Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add save() and load() method to solution. #52

Open
orionarcher opened this issue Dec 6, 2021 · 6 comments
Open

Add save() and load() method to solution. #52

orionarcher opened this issue Dec 6, 2021 · 6 comments
Labels
core enhancement New feature or request

Comments

@orionarcher
Copy link
Collaborator

orionarcher commented Dec 6, 2021

This important functionality is currently missing. As is, users would need to rerun their analysis.

There should be methods to serialize and load a solution object in a single file.

I believe we would need to save:

  • solute indexes
  • solvent names and indexes
  • rdf data
  • cutoff radii
  • solvation_data dataframe

That should be sufficient to reconstruct the Solution.

@orionarcher orionarcher added enhancement New feature or request core labels Dec 6, 2021
@hmacdope
Copy link
Member

hmacdope commented Dec 7, 2021

AFAIK there are 2 main options. The first one (and in my opinion best) is to pickle the Solution class in its entirety, allowing the whole class state to be reconstructed from disk.

I am no expert at this, perhaps @MDAnalysis/coredevs will know more but see the following link: https://docs.python.org/3/library/pickle.html.

The other option is to parse out the data to and from some kind of set of JSON files or such like. I am less in favour of this as it is fiddly and will require some introspection into class state etc which is a bit complicated. It may also seperate stuff into multiple files which is a lot less clean. On the plus side, these can then be human readable, but I think the downsides outweigh the positives.

@orbeckst
Copy link
Member

orbeckst commented Dec 7, 2021

Pickle is quick but not a good format for data. It can happen that you can’t process a pickle file with a different version of Python IIRC.

Results such as RDFs should be in a good data format anyway. CSV (compressed) is the lowest common denominator.

HDF5 is quite flexible and widely used but it is a heavy dependency.

Overall, I would spend some time figuring out how your workflow should work out. It’s often cleaner to have data producers and data analyzers and reduce coupling between the two.

@orionarcher
Copy link
Collaborator Author

If we save the output of expensive operations like solvation_data and rdf_data it would be cheap to recalculate everything else with the load() function. However, Solution also saves a copy of the Universe, is there an established way to serialize a Universe? I could save links to the files that created the universe, but that adds unnecessary complexity.

Maybe as @orbeckst suggests it's best to decouple the data production and analysis and not bother with implementing load. Instead, I could write a save_data function that saves a JSON of important statistics but does not save object state.

I favor JSON because it fits with the other infrastructure I use, but that's my personal bias. CSV would likely be more space efficient for the DataFrames.

@hmacdope
Copy link
Member

As you suggest, perhaps the best initial target is to implement saving functions for analyses in simple easy to use formats.

JSON or CSV is fine, but as @orbeckst says perhaps CSV is the lowest common denominator.

If we were to be dumping state and making it loadable I would favour PyHDF5 as everything can be contained in a single space efficient file. However writing load() is not an easy job as it needs to mirror the behaviour of __init__() very very closely, otherwise you will end up with unexpectedly None data all over the shop.

The coupling with Universe is also quite complicated, you would likely require the Universe ingredients in separates file AFAIK, resulting in 3 requirements: "topology.tpr", "trajectory.h5md", solution_state.hdf5. I think we can move this kind of complexity down the track possibly?

@orionarcher
Copy link
Collaborator Author

Agreed @hmacdope. I just moved this off the v0.2 roadmap. I'll make a new issue that specifically identifies creating a save_data method.

When we return to this later, I think your points are spot on.

@hmacdope
Copy link
Member

Sounds good. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants