allow PROV to be carried forward with update in certain scenarios #131

jeanetteclark · 2019-06-19T22:21:54Z

(carryover from this issue: NCEAS/metacatui#310)

Scenario A: update package with new object not part of existing PROV trace
O1 <---derivedFrom--- O3 (using script S1 during execution E2)
O4 added as new object
Metadata updated

Scenario B: update metadata for a package without changing objects
O1 <---derivedFrom--- O3 (using script S1 during execution E2)
Metadata updated

In both situations, all prov relationships between O1 and O3 should be included in the new version of the package.

There is some code in arcticdatautils already that will carry forward PROV - see this commit but it needs to be expanded a bit to allow for PROV to be carried over in the scenarios above, but not in other scenarios where pids involved in the PROV trace are updated with new versions.

So, before adding the carried over PROV statements from the old resource map to the new resource map, I think we need to check that the pids contained within those statements all exist in the data_pids argument for update_resource_map

If not all of the pids involved in the PROV trace exist in the data_pids arguement, the function will drop all of the PROV statements in the updated version of the resource map. In this case should the function:

print a warning
error
stop and ask user if they really want to get rid of PROV

The text was updated successfully, but these errors were encountered:

this needs much more testing

amoeba · 2019-06-20T00:28:11Z

Thanks for chatting about this earlier today and writing this up.

Stepping back, I see two general patterns: (1) Keeping PROV by default and (2) dropping PROV by default unless certain scenarios are met. You've outlined option (2) which I think is a reasonable approach. If we go that route, I think we want two things:

A way to force arcticdatautils to not drop the PROV when it otherwise would. Something like a dont_drop_prov_plz = TRUE argument to publish_update().
A bit of hand-holding along the lines of your above warning()/stop()/etc options.

I think, for (2), we could go with a warning() by default and print out a super useful message telling the user exactly how to resurrect any PROV such as:

Hey, you lost all of your PROV statements because the PROV that existed 
in the previous version of this Data Package referenced Objects that 
aren't present in the updated version. You can override this in the 
future by adding `dont_drop_prov_plz = TRUE` and you can also recover 
the lost PROV by running:

prov_statements <- get_prov({paste in the old resource map PID here for the user})
update_package(mn, {paste in new rm pid}, other_statements = prov_statements)

Does something like that seem like a good solution?

Oh, and as a general note, the way I've been thinking about this isn't retaining PROV statements but actually being able to split the statements in the ORE into two groups: (1) Data Packaging statements and (2) all other statements. So in this ticket, any time I/we say PROV statements we'd mean "other statements". I think this is a better approach for a few reasons:

Statements related to Data Packaging are pretty easy to find because we have a formal spec for this
Determining whether a statement is a PROV statement is actually kinda hard because we're working with RDF and would need to bring an RDF reasoner in to do this work
Data Packages may contain statements that are not related to Data Packaging or PROV and we might like to forward migrate those statements too

mbjones · 2019-06-20T01:09:09Z

When updating a document like an RDF resource map, I think the default should be to preserve its contents and not be lossy unless there is a specific reason to drop something. Losing RDF triples should be considered a bug, as we will likely add other triples to our ORE docs over time. We shouldn't lose metadata every time it travels through the utils package, as that introduces the need for lots of manual fixes and people might forget to add it all back in. I think the default behavior for our R (and other) packages should be:

Start from the existing resource map with all triples
add aggregation related triples for new or updated content, or remove them for deleted content
add or remove PROV triples as needed for updated or deleted objects (this should preserve all triples for objects that are unchanged)

This does require the software to have a model of a package that understands its components. The datapack::DataPackage class I think has a lot of this built in.

jeanetteclark · 2019-06-20T17:36:54Z

@mbjones, I'm a little confused about your 3rd point since it seems in conflict with your comments here, but maybe I am misunderstanding it.

Let's say we have a package with the following PROV trace:

OBJECT_2 <---derivedFrom--- OBJECT_1 (using a SCRIPT during an EXECUTION)

If OBJECT_2 is updated with a new version of the file from a different execution of the same script, I think it is clear that you cannot blindly add PROV triples associated with the new object (this was the conclusion we came to over a year ago).

If we instead only remove the PROV triples associated with OBJECT_2 (because OBJECT_2 is not included in the data package anymore) and don't make any assumptions about how the new version of the object fits in, the triples would look like this, with the last three rows dropped.

subject	predicate	object	drop
EXECUTION	http://www.w3.org/1999/02/22-rdf-syntax-ns#type	http://purl.dataone.org/provone/2015/01/15/ontology#Execution	FALSE
EXECUTION	http://www.w3.org/ns/prov#qualifiedAssociation	_r1561048296r6751r1	FALSE
SCRIPT	http://www.w3.org/1999/02/22-rdf-syntax-ns#type	http://purl.dataone.org/provone/2015/01/15/ontology#Program	FALSE
_r1561048296r6751r1	http://www.w3.org/ns/prov#hadPlan	SCRIPT	FALSE
EXECUTION	http://www.w3.org/ns/prov#used	OBJECT_1	FALSE
OBJECT_1	http://www.w3.org/1999/02/22-rdf-syntax-ns#type	http://purl.dataone.org/provone/2015/01/15/ontology#Data	FALSE
OBJECT_2	http://www.w3.org/ns/prov#wasGeneratedBy	EXECUTION	TRUE
OBJECT_2	http://www.w3.org/1999/02/22-rdf-syntax-ns#type	http://purl.dataone.org/provone/2015/01/15/ontology#Data	TRUE
OBJECT_2	http://www.w3.org/ns/prov#wasDerivedFrom	OBJECT_1	TRUE

I'm not sure I know enough about the PROV model to say whether the result of dropping these triples leaves us with a valid trace or not. We should also consider whether this approach would work if we dropped all of the triples associated with a script that got updated. From a user perspective it would make it much easier to update their provenance because they would only have to add it back in for files that they updated, as opposed to all of the files.

amoeba · 2019-06-20T21:00:03Z

We shouldn't lose metadata every time it travels through the utils package, as that introduces the need for lots of manual fixes and people might forget to add it all back in.

I agree, and that's the way we had it when we started inserting PROV into DataONE packages, but there's plenty of cases where migrating PROV forward doesn't make sense (which we've outlined here and elsewhere). @jeanetteclark 's example above is a good example of this.

Re:

add aggregation related triples for new or updated content, or remove them for deleted content

add or remove PROV triples as needed for updated or deleted objects (this should preserve all triples for objects that are unchanged)

This sound good but it doesn't sound like a good fit for the arcticdatautils API which is currently totally ignorant of stuff like this. I feel like, if we really want to have smarter resource map processing, we might wanna lean on dataone and datapack which implements a read-modify-write pattern whereas arcticdatautils implements a read-modify pattern and isn't aware of your intended changes.

jeanetteclark · 2019-06-21T16:48:54Z

Yeah it is not set up very well to do this kind of thing at all at the moment. This is definitely something we should consider when we start work on refactoring the R packages.

In the meantime...we should find some kind of solution for arcticdatautils. @amoeba I liked your solution above, it should be fairly easy to implement in the package

jeanetteclark · 2019-11-14T22:51:53Z

@amoeba I finally got around to this and have a tested solution. Do you mind checking it out here:

https://github.com/jeanetteclark/arcticdatautils/tree/carry_prov

@dmullen17 it would be good if you had a look too. If we like this, I can create a pull request for it for more formal review

I have tests written up but to play with the functionality install the package from that branch and then play around with:

mn <- getMNode(CNode("STAGING"), "urn:node:mnTestARCTIC")
package <- create_dummy_package(mn, size = 3)
# add dummy prov to package
package_prov <- suppressMessages(add_dummy_prov(mn, package$resource_map))
# publish a new data object
data_new <- create_dummy_object(mn)

# Publish an update on it and observe output based on what data objects you include, whether to keep the prov, etc
update <- publish_update(mn,
                         package$metadata,
                         package_prov,
                         data_new,
                         keep_prov = FALSE,
                         check_first = FALSE)

amoeba · 2019-11-26T04:26:10Z

Hey @jeanetteclark, thanks for putting this together. The warning with example code is 💯 btw.

Is defaulting to removing provenance (keep_prov = FALSE) what we want here? For the common cases, using publish_update to give a package a DOI or otherwise updating the metadata, provenance should probably get forwarded to the new package's resource map since it's still accurate. I'd have to scratch my head a bit more to think out how to integrate this in your flow here but it seems doable.

PS: I had a few other comments that'd be suitable during code review I could make when you file a PR.

jeanetteclark · 2019-11-26T17:38:34Z

Certainly something that could be up for debate! I think what you describe is already integrated into my workflow, just need to change the default arg. I'll create a PR and we can see where we get from there

jeanetteclark added a commit to jeanetteclark/arcticdatautils that referenced this issue Jun 19, 2019

issue NCEAS#131: carry forward prov in certain scenarios

912e7ca

this needs much more testing

amoeba added the enhancement label Jun 20, 2019

jeanetteclark mentioned this issue Nov 26, 2019

Carry prov forward in certain scenarios #151

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow PROV to be carried forward with update in certain scenarios #131

allow PROV to be carried forward with update in certain scenarios #131

jeanetteclark commented Jun 19, 2019 •

edited

Loading

amoeba commented Jun 20, 2019

mbjones commented Jun 20, 2019

jeanetteclark commented Jun 20, 2019 •

edited

Loading

amoeba commented Jun 20, 2019

jeanetteclark commented Jun 21, 2019

jeanetteclark commented Nov 14, 2019 •

edited

Loading

amoeba commented Nov 26, 2019

jeanetteclark commented Nov 26, 2019

allow PROV to be carried forward with update in certain scenarios #131

allow PROV to be carried forward with update in certain scenarios #131

Comments

jeanetteclark commented Jun 19, 2019 • edited Loading

amoeba commented Jun 20, 2019

mbjones commented Jun 20, 2019

jeanetteclark commented Jun 20, 2019 • edited Loading

amoeba commented Jun 20, 2019

jeanetteclark commented Jun 21, 2019

jeanetteclark commented Nov 14, 2019 • edited Loading

amoeba commented Nov 26, 2019

jeanetteclark commented Nov 26, 2019

jeanetteclark commented Jun 19, 2019 •

edited

Loading

jeanetteclark commented Jun 20, 2019 •

edited

Loading

jeanetteclark commented Nov 14, 2019 •

edited

Loading