Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include source hash in job search #19112

Draft
wants to merge 6 commits into
base: dev
Choose a base branch
from

Conversation

mvdbeek
Copy link
Member

@mvdbeek mvdbeek commented Nov 6, 2024

This means we find equivalent jobs if an input hda either points at the same dataset id (existing behavior), or if the dataset source_uri, transform and hashes match. All further restrictions still apply (same metadata etc).

Builds on #19108, and #19110 is probably also a good idea.

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@nsoranzo

This comment was marked as resolved.

@bgruening
Copy link
Member

Running a Galaxy Training Academy without large computational resources - yeah, yeah!

That way the equivalence search will also work for non-URI uploads.
@mvdbeek mvdbeek changed the title Include source_uri, transform and hash in job search Include source hash in job search Nov 7, 2024
@jmchilton
Copy link
Member

We have no clue about extra files in this context right - how can we make this conclusion without that? It isn't a -1 but I am uncomfortable with the direction this is all heading. I feel like we should be working on tightening up job search and not loosening it more and more before we've installed the guard rails it needs IMO.

@jmchilton
Copy link
Member

Maybe to clarify this - I absolutely think we should be able to hash the input to be used with job search. But I don't think this hash is what I would use - I would implement a hash of the dataset and not of the primary file of the dataset. 95% of the time they could be the same but we should verify that before using it in this context.

@mvdbeek
Copy link
Member Author

mvdbeek commented Nov 8, 2024

I would implement a hash of the dataset and not of the primary file of the dataset. 95% of the time they could be the same but we should verify that before using it in this context.

If we match the transform would it still be possible to get different extra files for uploaded content (considering we already match on the datatype) ? I added 4abc6e6 because we don't record the transform for local files, but we could of course do that.

There's of course also value in calculating the hash for datasets as they are written to the object store, but if we can get reliable equivalence using the upload hash + transform we could make nice progress in figuring out other edge cases for IWC workflows ?

Co-authored-by: Nicola Soranzo <[email protected]>
@jmchilton
Copy link
Member

I had the realization reading your response that you only care about upload or think these fields can all only be set during uploads. I believe any tool can produce hashes and source hashes and I believe we have an API for hashing files after the fact anyway. Could we restrict all this logic to just fetch and upload1 - then it feels pretty close to being correct? This also probably explains my unease with #19110 - I think we don't validate the hashes outside the fetch tool but we can create the hashes in other tools I think - it feels like what we need is a hash validated field if we want to act on that data in this fashion but maybe it would be sufficient to be more proactive in validating the hash field whenever it is set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants