Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manifest Reproduce true fix #475

Merged
merged 10 commits into from
Aug 22, 2024

Conversation

jo-basevi
Copy link
Collaborator

@jo-basevi jo-basevi commented Aug 2, 2024

Ok so I've had a go changing the manifests reproduce True logic. Currently (on the main branch), it create symlinks to the work directory from the manifests, and only raise errors if the files that existing on the manifest are modified.

In the first commit, I changed logic in model setup to always add model exe path and prior restart path to the manifest. This picks up if exe path is changed in config.yaml. If there's new restart files, in the prior restart path, it will also pick it up that there's been a change.

It still had the logic that symlinks from files to the work directory are generated from the manifest initially when reproduce is True. This could run into weird issues when there's a file in the manifest, thats existing unchanged on the filesystem, but it's no longer added to manifest in Model.setup. It'll pass the manifest checks because the hashes are unchanged and this file will still be linked into the work directory.

In the second commit, I have changed it to when reproduce is True, all the symlinks are created from filepaths added in the model.setup() when it calls manifest.add_filepath(). I have used the pre-existing existing_filepaths set on the PayuManifest class to keep track of what files haven't been added to the work directory but are in the manifest. There is now a check at the end of PayuManifest.fastcheck checking if that set is empty when reproduce is True.

The second commit, also has changes to the input/scanInputs logic. I've changed it to have scaninputs True when reproduce is True. This is so changes to inputs in config.yaml are picked up, or if new files are added to input directories.

scaninputs can be set to False, when reproduce is False. This is will still not check for changes in config.yaml as it'll create symlinks only from the input manifest. If a file in the manifest no longer exists, it will still be removed from the manifests before manifests are checked (only when reproduce is False).

I'll appreciate any feedback on these changes as it is changing the behaviour of reproduce:True quite a bit, and I'm unsure of my assumptions..

Closes #456

Previous behaviour would only use the exe path and restarts already set in the Manifest. So they would pick up when the exe/restart that were currently pointing toin the manifest would change, but not the configured one.
… when files are added to manifest

- When reproduce is True, set scanInputs to True so new/changed inputs are discovered
- When reproduce is True, only create work symlinks when filepaths are added in manifest during model.setup
- Check for files that exist in the manifest, but haven't been added to work directory, or no longer exist
@pep8speaks
Copy link

pep8speaks commented Aug 2, 2024

Hello @jo-basevi! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 147:19: E222 multiple spaces after operator

Comment last updated at 2024-08-22 01:38:38 UTC

@coveralls
Copy link

coveralls commented Aug 2, 2024

Coverage Status

coverage: 55.962% (+3.7%) from 52.281%
when pulling 8fae6f9 on ACCESS-NRI:456-manifest-exe-true-bug
into c83489e on payu-org:master.

@aidanheerdegen aidanheerdegen self-requested a review August 6, 2024 05:44
@aidanheerdegen
Copy link
Collaborator

One goal of the original behaviour was to be able to re-run an experiment from any commit, by using the restart manifest to populate the restart file symlinks as they were at that commit. This would bypass the payu machinery for determining the restarts to use. The same goal is achieved by setting the restart path in config.yaml, but requires manual intervention of the user.

It is an open question as to the utility of this functionality. As far as I know no-one has used it, and with the typical workflow of using ephemeral /scratch storage it is likely the original paths to restarts will no longer exist.

One use-case is perturbation experiments branching from the same point in a control experiment. Each perturbation experiment could be a payu branch and then run with restart reproduce true for a single run to pick up restarts from the control experiment. To be honest I'm not sure if it is worth trying to keep this functionality for the reasons stated above. It is fragile to files being deleted/moved, and the same effective functionality is available setting the restart config option.

aidanheerdegen
aidanheerdegen previously approved these changes Aug 8, 2024
Copy link
Collaborator

@aidanheerdegen aidanheerdegen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving but probably worth deciding if we do want to support scaninputs at all and simplifying the logic as much as possible.

docs/source/config.rst Outdated Show resolved Hide resolved
payu/models/model.py Outdated Show resolved Hide resolved
payu/manifest.py Outdated Show resolved Hide resolved
payu/manifest.py Outdated Show resolved Hide resolved
payu/manifest.py Show resolved Hide resolved
payu/manifest.py Outdated Show resolved Hide resolved
Fix some typos

Co-authored-by: Aidan Heerdegen <[email protected]>
- Rename existing_filepaths to previous_filepaths
- Add description that previous_filepaths keep track of filepaths from previous experiment run
- Move reproduce manifest exists check to load()
@jo-basevi jo-basevi marked this pull request as draft August 14, 2024 04:10
@jo-basevi
Copy link
Collaborator Author

jo-basevi commented Aug 14, 2024

I've updated the manifests to keep track of the pre-existing manifest and to compare directly with the created manifests to try simplify some of the logic. A key difference is that previous manifests are used as source of truth for md5 hashes (this was previously the case for just input manifests). So if a calculated fast hash matches the fast hash in the previous manifest, then the full hash from the previous manifest is used. This is to avoid re-calculating any md5 hashes.

@jo-basevi jo-basevi marked this pull request as ready for review August 15, 2024 04:34
aidanheerdegen
aidanheerdegen previously approved these changes Aug 20, 2024
Copy link
Collaborator

@aidanheerdegen aidanheerdegen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I only have one question: is it true that when reproduce = True the respective manifests are only written if there are no reproduce errors?

payu/manifest.py Show resolved Hide resolved
@aidanheerdegen
Copy link
Collaborator

I'm doing some tests on performance, which is one of the key criteria for the design and re-use of manifest hashes etc.

The release-01deg_jra55_iaf config is definitely the worst in this regard, as it has all the JRA55-do forcing data in the manifest.

I'll report back what I find.

@aidanheerdegen
Copy link
Collaborator

Timing looks similar after the first run of payu setup. I guess the files made it into some sort of lustre cache.

Typical timings for setup for both are

110.06user 145.29system 0:28.29elapsed 902%CPU (0avgtext+0avgdata 68016maxresident)k
133489472inputs+8202176outputs (4584major+362431minor)pagefaults 0swaps

for 708 files.

If the input manifest is deleted, forcing recalculation, the timing for this PR is

1484.73user 1696.42system 2:29.14elapsed 2132%CPU (0avgtext+0avgdata 65180maxresident)k
1664125624inputs+8202728outputs (12861major+1030049minor)pagefaults 0swaps

Current payu timing is

1650.39user 1556.04system 2:34.74elapsed 2072%CPU (0avgtext+0avgdata 50616maxresident)k
1662310320inputs+8202728outputs (2231major+1110215minor)pagefaults 0swaps

The times vary (I tried a couple of times) and likely it is very sensitive to disk usage and cache status.

Conclusion: this change does not degrade performance.

@jo-basevi
Copy link
Collaborator Author

I only have one question: is it true that when reproduce = True the respective manifests are only written if there are no reproduce errors?

The binhash values may be updated, and maybe the fullpaths if they have changed? But that should only happen if all MD5 hashes match.

@jo-basevi
Copy link
Collaborator Author

Conclusion: this change does not degrade performance.

That's great! Thanks for running those tests :) How did you run those timing tests?

@aidanheerdegen
Copy link
Collaborator

Conclusion: this change does not degrade performance.

That's great! Thanks for running those tests :) How did you run those timing tests?

/usr/bin/time payu setup -f

You can pass command line arguments to that command to output all sorts of useful info, above is just the default. There is also a bash builtin time but it isn't as configurable and doesn't report memory usage.

Copy link
Collaborator

@aidanheerdegen aidanheerdegen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let `er rip!

@jo-basevi jo-basevi merged commit 793447d into payu-org:master Aug 22, 2024
9 checks passed
@jo-basevi jo-basevi deleted the 456-manifest-exe-true-bug branch August 22, 2024 04:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Mainfest reproduce options don't seem to do what they say on the box
4 participants