Skip to content

Commit

Permalink
Add ASE Datasets (#492)
Browse files Browse the repository at this point in the history
* Add AseReadDataset

* Add ASE Reader dataset to __init__

* Fix typo

* Adjust docstring

* Fix typo

* Add ASE DB Dataset

* Lint

* Adjust A2G settings

* Fix typo

* Bug fixes and consistent naming

* Bug fixes and adjustments

* Add handling for structures with no tags

* Add tests of ASE datasets

* Lint

* Test __getitem__ methods

* Default to not pre-compute graph on cpu

* Adjust pattern-matching syntax

* Update test

* Allow user to specify arguments for ase.io.read

* Bug fix

* first commit

* lint tests

* hook lmdb into ase dataset

* fix lmdb ase hook into dataset

* Bug fix

* Update env.common.yml

Add orjson to conda dependencies

* lint

* Improve speed of ASE DB Datasets that use LMDB backend (#499)

* Include structure with fixed atoms in test

* fix constraints

* Test if conda installs break with new cache

* Update config.yml

* Update config.yml

* add metadata and tests

* add metadata for lmdb datasets

* add metadata for lmdb datasets

* change metadata getter

* Don't cast jagged list to np.array

* Force rebuild circleci conda env

* Remove comment

* Suppress another warning on torch.tensorizing a list of np.arrays

* Make untagged atoms check more explicit

* Refactor and add AseMultiReadDataset

* Lint

* Refactor apply_tags to atoms_transform, add test of AseReadMultiStructureDataset

* Remove lmdb lock file

* Add test for ASE DB with deleted row

* Lint

* Fix broken filepath in test

* Include sid values

* Revert "Merge branch 'ase_lmdb' into ase_read_dataset"

This reverts commit 1b1d666, reversing
changes made to 8843f6a.

* Address review comments

* Document ASE datasets

---------

Co-authored-by: zulissimeta <[email protected]>
Co-authored-by: Zack Ulissi <[email protected]>
Co-authored-by: Abhishek Das <[email protected]>
  • Loading branch information
4 people committed Jun 6, 2023
1 parent 819e11d commit d2105bc
Show file tree
Hide file tree
Showing 4 changed files with 669 additions and 0 deletions.
106 changes: 106 additions & 0 deletions TRAIN.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,12 @@
- [Joint Training](#joint-training)
- [Create EvalAI submission files](#create-evalai-oc22-submission-files)
- [S2EF-Total/IS2RE-Total](#s2ef-totalis2re-total)
- [Using Your Own Data](#using-your-own-data)
- [Writing an LMDB](#writing-an-lmdb)
- [Using an ASE Database](#using-an-ase-database)
- [Using ASE-Readable Files](#using-ase-readable-files)
- [Single-Structure Files](#single-structure-files)
- [Multi-Structure Files](#multi-structure-files)

## Getting Started

Expand Down Expand Up @@ -323,3 +329,103 @@ EvalAI expects results to be structured in a specific format for a submission to
```
Where `file.npz` corresponds to the respective `[s2ef/is2re]_predictions.npz` files generated for the corresponding task. The final submission file will be written to `submission_file.npz` (rename accordingly). The `dataset` argument specifies which dataset is being considered — this only needs to be set for OC22 predictions because OC20 is the default.
3. Upload `submission_file.npz` to EvalAI.


# Using Your Own Data

There are multiple ways to train and evaluate OCP models on data other than OC20 and OC22. Writing an LMDB is the most performant option. However, ASE-based dataset formats are also included as a convenience for people with existing data who simply want to try OCP tools without needing to learn about LMDBs.

This tutorial will briefly discuss the basic use of these dataset formats. For more detailed information about the ASE datasets, see the [source code and docstrings](ocpmodels/datasets/ase_datasets.py).

## Writing an LMDB

Storing your data in an LMDB ensures very fast random read speeds for the fastest supported throughput. This is the recommended option for the majority of OCP use cases. For more information about writing your data to an LMDB, please see the [LMDB Dataset Tutorial](https://github.com/Open-Catalyst-Project/ocp/blob/main/tutorials/lmdb_dataset_creation.ipynb).

## Using an ASE Database

If your data is already in an [ASE Database](https://databases.fysik.dtu.dk/ase/ase/db/db.html), no additional preprocessing is necessary before running training/prediction! Although the ASE DB backends may not be sufficiently high throughput for all use cases, they are generally considered "fast enough" to train on a reasonably-sized dataset with 1-2 GPUs or predict with a single GPU. If you want to effictively utilize more resources than this, please be aware of the potential for this bottleneck and consider writing your data to an LMDB. If your dataset is small enough to fit in CPU memory, use the `keep_in_memory: True` option to avoid this bottleneck.

To use this dataset, we will just have to change our config files to use the ASE DB Dataset rather than the LMDB Dataset:

```
task:
dataset: ase_db
dataset:
train:
src: # The path/address to your ASE DB
connect_args:
# Keyword arguments for ase.db.connect()
select_args:
# Keyword arguments for ase.db.select()
# These can be used to query/filter the ASE DB
a2g_args:
r_energy: True
r_forces: True
# Set these if you want to train on energy/forces
# Energy/force information must be in the ASE DB!
keep_in_memory: False # Keeping the dataset in memory reduces random reads and is extremely fast, but this is only feasible for relatively small datasets!
val:
src:
a2g_args:
r_energy: True
r_forces: True
test:
src:
a2g_args:
r_energy: False
r_forces: False
# It is not necessary to have energy or forces if you are just making predictions.
```
## Using ASE-Readable Files

It is possible to train/predict directly on ASE-readable files. This is only recommended for smaller datasets, as directories of many small files do not scale efficiently on all computing infrastructures. There are two options for loading data with the ASE reader:

### Single-Structure Files
This dataset assumes a single structure will be obtained from each file:

```
task:
dataset: ase_read
dataset:
train:
src: # The folder that contains ASE-readable files
pattern: # Pattern matching each file you want to read (e.g. "*/POSCAR"). Search recursively with two wildcards: "**/*.cif".
ase_read_args:
# Keyword arguments for ase.io.read()
a2g_args:
# Include energy and forces for training purposes
# If True, the energy/forces must be readable from the file (ex. OUTCAR)
r_energy: True
r_forces: True
keep_in_memory: False
```

### Multi-structure Files
This dataset supports reading files that each contain multiple structure (for example, an ASE .traj file). Using an index file, which tells the dataset how many structures each file contains, is recommended. Otherwise, the dataset is forced to load every file at startup and count the number of structures!

```
task:
dataset: ase_read_multi
dataset:
train:
index_file: Filepath to an index file which contains each filename and the number of structures in each file. e.g.:
/path/to/relaxation1.traj 200
/path/to/relaxation2.traj 150
...
# If using an index file, the src and pattern are not necessary
src: # The folder that contains ASE-readable files
pattern: # Pattern matching each file you want to read (e.g. "*.traj"). Search recursively with two wildcards: "**/*.xyz".
ase_read_args:
# Keyword arguments for ase.io.read()
a2g_args:
# Include energy and forces for training purposes
r_energy: True
r_forces: True
keep_in_memory: False
```
6 changes: 6 additions & 0 deletions ocpmodels/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,9 @@
data_list_collater,
)
from .oc22_lmdb_dataset import OC22LmdbDataset

from .ase_datasets import (
AseReadDataset,
AseReadMultiStructureDataset,
AseDBDataset,
)
Loading

0 comments on commit d2105bc

Please sign in to comment.