Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow PROV to be carried forward with update in certain scenarios #131

Open
jeanetteclark opened this issue Jun 19, 2019 · 8 comments
Open

Comments

@jeanetteclark
Copy link
Collaborator

jeanetteclark commented Jun 19, 2019

(carryover from this issue: NCEAS/metacatui#310)

Scenario A: update package with new object not part of existing PROV trace
O1 <---derivedFrom--- O3 (using script S1 during execution E2)
O4 added as new object
Metadata updated

Scenario B: update metadata for a package without changing objects
O1 <---derivedFrom--- O3 (using script S1 during execution E2)
Metadata updated

In both situations, all prov relationships between O1 and O3 should be included in the new version of the package.

There is some code in arcticdatautils already that will carry forward PROV - see this commit but it needs to be expanded a bit to allow for PROV to be carried over in the scenarios above, but not in other scenarios where pids involved in the PROV trace are updated with new versions.

So, before adding the carried over PROV statements from the old resource map to the new resource map, I think we need to check that the pids contained within those statements all exist in the data_pids argument for update_resource_map

If not all of the pids involved in the PROV trace exist in the data_pids arguement, the function will drop all of the PROV statements in the updated version of the resource map. In this case should the function:

  • print a warning
  • error
  • stop and ask user if they really want to get rid of PROV
jeanetteclark added a commit to jeanetteclark/arcticdatautils that referenced this issue Jun 19, 2019
@amoeba
Copy link
Contributor

amoeba commented Jun 20, 2019

Thanks for chatting about this earlier today and writing this up.

Stepping back, I see two general patterns: (1) Keeping PROV by default and (2) dropping PROV by default unless certain scenarios are met. You've outlined option (2) which I think is a reasonable approach. If we go that route, I think we want two things:

  1. A way to force arcticdatautils to not drop the PROV when it otherwise would. Something like a dont_drop_prov_plz = TRUE argument to publish_update().
  2. A bit of hand-holding along the lines of your above warning()/stop()/etc options.

I think, for (2), we could go with a warning() by default and print out a super useful message telling the user exactly how to resurrect any PROV such as:

Hey, you lost all of your PROV statements because the PROV that existed 
in the previous version of this Data Package referenced Objects that 
aren't present in the updated version. You can override this in the 
future by adding `dont_drop_prov_plz = TRUE` and you can also recover 
the lost PROV by running:

prov_statements <- get_prov({paste in the old resource map PID here for the user})
update_package(mn, {paste in new rm pid}, other_statements = prov_statements)

Does something like that seem like a good solution?

Oh, and as a general note, the way I've been thinking about this isn't retaining PROV statements but actually being able to split the statements in the ORE into two groups: (1) Data Packaging statements and (2) all other statements. So in this ticket, any time I/we say PROV statements we'd mean "other statements". I think this is a better approach for a few reasons:

  1. Statements related to Data Packaging are pretty easy to find because we have a formal spec for this
  2. Determining whether a statement is a PROV statement is actually kinda hard because we're working with RDF and would need to bring an RDF reasoner in to do this work
  3. Data Packages may contain statements that are not related to Data Packaging or PROV and we might like to forward migrate those statements too

@mbjones
Copy link
Member

mbjones commented Jun 20, 2019

When updating a document like an RDF resource map, I think the default should be to preserve its contents and not be lossy unless there is a specific reason to drop something. Losing RDF triples should be considered a bug, as we will likely add other triples to our ORE docs over time. We shouldn't lose metadata every time it travels through the utils package, as that introduces the need for lots of manual fixes and people might forget to add it all back in. I think the default behavior for our R (and other) packages should be:

  1. Start from the existing resource map with all triples
  2. add aggregation related triples for new or updated content, or remove them for deleted content
  3. add or remove PROV triples as needed for updated or deleted objects (this should preserve all triples for objects that are unchanged)

This does require the software to have a model of a package that understands its components. The datapack::DataPackage class I think has a lot of this built in.

@jeanetteclark
Copy link
Collaborator Author

jeanetteclark commented Jun 20, 2019

@mbjones, I'm a little confused about your 3rd point since it seems in conflict with your comments here, but maybe I am misunderstanding it.

Let's say we have a package with the following PROV trace:

OBJECT_2 <---derivedFrom--- OBJECT_1 (using a SCRIPT during an EXECUTION)

If OBJECT_2 is updated with a new version of the file from a different execution of the same script, I think it is clear that you cannot blindly add PROV triples associated with the new object (this was the conclusion we came to over a year ago).

If we instead only remove the PROV triples associated with OBJECT_2 (because OBJECT_2 is not included in the data package anymore) and don't make any assumptions about how the new version of the object fits in, the triples would look like this, with the last three rows dropped.

subject predicate object drop
EXECUTION http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://purl.dataone.org/provone/2015/01/15/ontology#Execution FALSE
EXECUTION http://www.w3.org/ns/prov#qualifiedAssociation _r1561048296r6751r1 FALSE
SCRIPT http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://purl.dataone.org/provone/2015/01/15/ontology#Program FALSE
_r1561048296r6751r1 http://www.w3.org/ns/prov#hadPlan SCRIPT FALSE
EXECUTION http://www.w3.org/ns/prov#used OBJECT_1 FALSE
OBJECT_1 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://purl.dataone.org/provone/2015/01/15/ontology#Data FALSE
OBJECT_2 http://www.w3.org/ns/prov#wasGeneratedBy EXECUTION TRUE
OBJECT_2 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://purl.dataone.org/provone/2015/01/15/ontology#Data TRUE
OBJECT_2 http://www.w3.org/ns/prov#wasDerivedFrom OBJECT_1 TRUE

I'm not sure I know enough about the PROV model to say whether the result of dropping these triples leaves us with a valid trace or not. We should also consider whether this approach would work if we dropped all of the triples associated with a script that got updated. From a user perspective it would make it much easier to update their provenance because they would only have to add it back in for files that they updated, as opposed to all of the files.

@amoeba
Copy link
Contributor

amoeba commented Jun 20, 2019

We shouldn't lose metadata every time it travels through the utils package, as that introduces the need for lots of manual fixes and people might forget to add it all back in.

I agree, and that's the way we had it when we started inserting PROV into DataONE packages, but there's plenty of cases where migrating PROV forward doesn't make sense (which we've outlined here and elsewhere). @jeanetteclark 's example above is a good example of this.

Re:

  1. add aggregation related triples for new or updated content, or remove them for deleted content
  2. add or remove PROV triples as needed for updated or deleted objects (this should preserve all triples for objects that are unchanged)

This sound good but it doesn't sound like a good fit for the arcticdatautils API which is currently totally ignorant of stuff like this. I feel like, if we really want to have smarter resource map processing, we might wanna lean on dataone and datapack which implements a read-modify-write pattern whereas arcticdatautils implements a read-modify pattern and isn't aware of your intended changes.

@jeanetteclark
Copy link
Collaborator Author

Yeah it is not set up very well to do this kind of thing at all at the moment. This is definitely something we should consider when we start work on refactoring the R packages.

In the meantime...we should find some kind of solution for arcticdatautils. @amoeba I liked your solution above, it should be fairly easy to implement in the package

@jeanetteclark
Copy link
Collaborator Author

jeanetteclark commented Nov 14, 2019

@amoeba I finally got around to this and have a tested solution. Do you mind checking it out here:

https://github.com/jeanetteclark/arcticdatautils/tree/carry_prov

@dmullen17 it would be good if you had a look too. If we like this, I can create a pull request for it for more formal review

I have tests written up but to play with the functionality install the package from that branch and then play around with:

mn <- getMNode(CNode("STAGING"), "urn:node:mnTestARCTIC")
package <- create_dummy_package(mn, size = 3)
# add dummy prov to package
package_prov <- suppressMessages(add_dummy_prov(mn, package$resource_map))
# publish a new data object
data_new <- create_dummy_object(mn)

# Publish an update on it and observe output based on what data objects you include, whether to keep the prov, etc
update <- publish_update(mn,
                         package$metadata,
                         package_prov,
                         data_new,
                         keep_prov = FALSE,
                         check_first = FALSE)

@amoeba
Copy link
Contributor

amoeba commented Nov 26, 2019

Hey @jeanetteclark, thanks for putting this together. The warning with example code is 💯 btw.

Is defaulting to removing provenance (keep_prov = FALSE) what we want here? For the common cases, using publish_update to give a package a DOI or otherwise updating the metadata, provenance should probably get forwarded to the new package's resource map since it's still accurate. I'd have to scratch my head a bit more to think out how to integrate this in your flow here but it seems doable.

PS: I had a few other comments that'd be suitable during code review I could make when you file a PR.

@jeanetteclark
Copy link
Collaborator Author

Certainly something that could be up for debate! I think what you describe is already integrated into my workflow, just need to change the default arg. I'll create a PR and we can see where we get from there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants