-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request]: Larger than memory datasets. #210
Comments
A small update to this request. I also asked the matsciml team about this issue as they have an interface to matgl and other models included in their package and they were kind enough to prepare a guide on how to prepare a suitable dataset IntelLabs/matsciml#85 . I will follow that guide for our data and maybe it could be used to extend the training capabilities of matgl as well. |
@JonathanSchmidt1 The dataloaders in matgl already allows you to do a one-time processing of the structures into a graph dataset. Once that graph dataset is done, it is much smaller in memory than the structures. In fact, that is the way we have been training extremely large datasets. |
Thank you for the reply. The preprocessing is definitely useful. elem_list = get_element_list(structures)
# setup a graph converter
converter = Structure2Graph(element_types=elem_list, cutoff=6.0)
# convert the raw dataset into MEGNetDataset
dataset = M3GNetDataset(
threebody_cutoff=4.0, structures=structures, converter=converter, labels={"energies":energies}
)
dataset.process()
dataset.save() For me, the issue is also the rather old architecture of the gpu partition of the supercomputer I have to use (64Gb Ram per node, 1 Gpu per node). So in-memory datasets are just not an option there. |
Are there any updates on this? Even though we have decent nodes now with 200gb ram per gpu the datasets also take up more than 600gb now. |
Problem
Now that multi-gpu training is working we are very interested in training on some larger crystal structure datasets. However, the datasets do not fit into the RAM. It would be great if it would be possible to either have to only load a partial dataset on each ddp node or be able to load the features on the fly to make large scale training possible. I assume the LMDB datasets that the OCP and MATSCIML are using should work for that. Ps: thank you for the pytorch lightning implementation
Proposed Solution
Add an option to save and load data to/from an LMDB database.
Alternatives
Examples would be here
https://github.com/Open-Catalyst-Project/ocp/blob/main/tutorials/lmdb_dataset_creation.ipynb
or here https://github.com/IntelLabs/matsciml/tree/main/matsciml/datasets
Code of Conduct
The text was updated successfully, but these errors were encountered: