[Feature Request]: Larger than memory datasets. #210

JonathanSchmidt1 · 2024-01-08T15:27:35Z

Problem

Now that multi-gpu training is working we are very interested in training on some larger crystal structure datasets. However, the datasets do not fit into the RAM. It would be great if it would be possible to either have to only load a partial dataset on each ddp node or be able to load the features on the fly to make large scale training possible. I assume the LMDB datasets that the OCP and MATSCIML are using should work for that. Ps: thank you for the pytorch lightning implementation

Proposed Solution

Add an option to save and load data to/from an LMDB database.

Alternatives

Examples would be here
https://github.com/Open-Catalyst-Project/ocp/blob/main/tutorials/lmdb_dataset_creation.ipynb
or here https://github.com/IntelLabs/matsciml/tree/main/matsciml/datasets

Code of Conduct

I agree to follow this project's Code of Conduct

JonathanSchmidt1 · 2024-01-10T23:22:20Z

A small update to this request. I also asked the matsciml team about this issue as they have an interface to matgl and other models included in their package and they were kind enough to prepare a guide on how to prepare a suitable dataset IntelLabs/matsciml#85 . I will follow that guide for our data and maybe it could be used to extend the training capabilities of matgl as well.

shyuep · 2024-01-31T16:03:38Z

@JonathanSchmidt1 The dataloaders in matgl already allows you to do a one-time processing of the structures into a graph dataset. Once that graph dataset is done, it is much smaller in memory than the structures. In fact, that is the way we have been training extremely large datasets.

JonathanSchmidt1 · 2024-02-13T22:57:46Z

Thank you for the reply. The preprocessing is definitely useful.
But after preprocessing a decently sized dataset (4.5M structures) takes up 132Gb on disk and 128Gb loaded into RAM and we would like to train on larger datasets in the future.
Maybe I am also doing sth wrong. Right now I am just doing the following to preprocess data:

elem_list = get_element_list(structures)
# setup a graph converter
converter = Structure2Graph(element_types=elem_list, cutoff=6.0)
# convert the raw dataset into MEGNetDataset
dataset = M3GNetDataset(
    threebody_cutoff=4.0, structures=structures, converter=converter, labels={"energies":energies}
)
dataset.process()
dataset.save()

For me, the issue is also the rather old architecture of the gpu partition of the supercomputer I have to use (64Gb Ram per node, 1 Gpu per node). So in-memory datasets are just not an option there.
However even on a modern architecture with e.g. 512 Gb per node and 8 gpus per node the in-memory dataset becomes a problem as I think with ddp each process will load its own copy of the dataset resulting in ~960 Gb if I would use all gpus?

JonathanSchmidt1 · 2024-09-11T15:18:18Z

Are there any updates on this? Even though we have decent nodes now with 200gb ram per gpu the datasets also take up more than 600gb now.

janosh added the data Data loading and processing label Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Larger than memory datasets. #210

[Feature Request]: Larger than memory datasets. #210

JonathanSchmidt1 commented Jan 8, 2024 •

edited

Loading

JonathanSchmidt1 commented Jan 10, 2024

shyuep commented Jan 31, 2024

JonathanSchmidt1 commented Feb 13, 2024 •

edited by janosh

Loading

JonathanSchmidt1 commented Sep 11, 2024

[Feature Request]: Larger than memory datasets. #210

[Feature Request]: Larger than memory datasets. #210

Comments

JonathanSchmidt1 commented Jan 8, 2024 • edited Loading

Problem

Proposed Solution

Alternatives

Code of Conduct

JonathanSchmidt1 commented Jan 10, 2024

shyuep commented Jan 31, 2024

JonathanSchmidt1 commented Feb 13, 2024 • edited by janosh Loading

JonathanSchmidt1 commented Sep 11, 2024

JonathanSchmidt1 commented Jan 8, 2024 •

edited

Loading

JonathanSchmidt1 commented Feb 13, 2024 •

edited by janosh

Loading