Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Larger than memory datasets. #210

Open
1 task done
JonathanSchmidt1 opened this issue Jan 8, 2024 · 4 comments
Open
1 task done

[Feature Request]: Larger than memory datasets. #210

JonathanSchmidt1 opened this issue Jan 8, 2024 · 4 comments
Labels
data Data loading and processing

Comments

@JonathanSchmidt1
Copy link

JonathanSchmidt1 commented Jan 8, 2024

Problem

Now that multi-gpu training is working we are very interested in training on some larger crystal structure datasets. However, the datasets do not fit into the RAM. It would be great if it would be possible to either have to only load a partial dataset on each ddp node or be able to load the features on the fly to make large scale training possible. I assume the LMDB datasets that the OCP and MATSCIML are using should work for that. Ps: thank you for the pytorch lightning implementation

Proposed Solution

Add an option to save and load data to/from an LMDB database.

Alternatives

Examples would be here
https://github.com/Open-Catalyst-Project/ocp/blob/main/tutorials/lmdb_dataset_creation.ipynb
or here https://github.com/IntelLabs/matsciml/tree/main/matsciml/datasets

Code of Conduct

  • I agree to follow this project's Code of Conduct
@JonathanSchmidt1
Copy link
Author

A small update to this request. I also asked the matsciml team about this issue as they have an interface to matgl and other models included in their package and they were kind enough to prepare a guide on how to prepare a suitable dataset IntelLabs/matsciml#85 . I will follow that guide for our data and maybe it could be used to extend the training capabilities of matgl as well.

@shyuep
Copy link
Contributor

shyuep commented Jan 31, 2024

@JonathanSchmidt1 The dataloaders in matgl already allows you to do a one-time processing of the structures into a graph dataset. Once that graph dataset is done, it is much smaller in memory than the structures. In fact, that is the way we have been training extremely large datasets.

@JonathanSchmidt1
Copy link
Author

JonathanSchmidt1 commented Feb 13, 2024

Thank you for the reply. The preprocessing is definitely useful.
But after preprocessing a decently sized dataset (4.5M structures) takes up 132Gb on disk and 128Gb loaded into RAM and we would like to train on larger datasets in the future.
Maybe I am also doing sth wrong. Right now I am just doing the following to preprocess data:

elem_list = get_element_list(structures)
# setup a graph converter
converter = Structure2Graph(element_types=elem_list, cutoff=6.0)
# convert the raw dataset into MEGNetDataset
dataset = M3GNetDataset(
    threebody_cutoff=4.0, structures=structures, converter=converter, labels={"energies":energies}
)
dataset.process()
dataset.save()

For me, the issue is also the rather old architecture of the gpu partition of the supercomputer I have to use (64Gb Ram per node, 1 Gpu per node). So in-memory datasets are just not an option there.
However even on a modern architecture with e.g. 512 Gb per node and 8 gpus per node the in-memory dataset becomes a problem as I think with ddp each process will load its own copy of the dataset resulting in ~960 Gb if I would use all gpus?

@janosh janosh added the data Data loading and processing label Mar 21, 2024
@JonathanSchmidt1
Copy link
Author

Are there any updates on this? Even though we have decent nodes now with 200gb ram per gpu the datasets also take up more than 600gb now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Data loading and processing
Projects
None yet
Development

No branches or pull requests

3 participants