diff --git a/TRAIN.md b/TRAIN.md index da078a924..d16d08dfb 100644 --- a/TRAIN.md +++ b/TRAIN.md @@ -253,7 +253,7 @@ final, relaxed state. This can be done by training a model to predict per-atom f task and then running an iterative relaxation. Although we present an iterative approach, models that directly predict relaxed states are also possible. The iterative approach IS2RS task uses the same configuration files as the S2EF task `configs/s2ef` and follows the same training scheme above. To perform an iterative relaxation, ensure the following is added to the configuration files of the models you wish to run relaxations on: -``` +```yaml # Relaxation options relax_dataset: src: data/is2re/all/val_id/data.lmdb # path to lmdb of systems to be relaxed (uses same lmdbs as is2re) @@ -268,7 +268,7 @@ relax_opt: ``` After training, relaxations can be run by: -``` +```bash python main.py --mode run-relaxations --config-yml configs/s2ef/2M/schnet/schnet.yml \ --checkpoint checkpoints/[TIMESTAMP]/checkpoint.pt ``` @@ -281,7 +281,7 @@ EvalAI expects results to be structured in a specific format for a submission to ### S2EF/IS2RE: 1. Run predictions `--mode predict` on all 4 splits, generating `[s2ef/is2re]_predictions.npz` files for each split. 2. Run the following command: - ``` + ```bash python make_submission_file.py --id path/to/id/file.npz --ood-ads path/to/ood_ads/file.npz \ --ood-cat path/to/ood_cat/file.npz --ood-both path/to/ood_both/file.npz --out-path submission_file.npz ``` @@ -292,7 +292,7 @@ EvalAI expects results to be structured in a specific format for a submission to ### IS2RS: 1. Ensure `write_pos: True` is included in your configuration file. Run relaxations `--mode run-relaxations` on all 4 splits, generating `relaxed_positions.npz` files for each split. 2. Run the following command: - ``` + ```bash python make_submission_file.py --id path/to/id/relaxed_positions.npz --ood-ads path/to/ood_ads/relaxed_positions.npz \ --ood-cat path/to/ood_cat/relaxed_positions.npz --ood-both path/to/ood_both/relaxed_positions.npz --out-path is2rs_submission.npz ``` @@ -305,7 +305,7 @@ EvalAI expects results to be structured in a specific format for a submission to For the IS2RE-Total task, the model takes the initial structure as input and predicts the total DFT energy of the relaxed structure. This task is more general and more challenging than the original OC20 IS2RE task that predicts adsorption energy. To train an OC22 IS2RE-Total model use the `EnergyTrainer` with the `OC22LmdbDataset` by including these lines in your configuration file: -``` +```yaml trainer: energy # Use the EnergyTrainer task: @@ -318,7 +318,7 @@ You can find examples configuration files in [`configs/oc22/is2re`](https://gith The S2EF-Total task takes a structure and predicts the total DFT energy and per-atom forces. This differs from the original OC20 S2EF task because it predicts total energy instead of adsorption energy. To train an OC22 S2EF-Total model use the ForcesTrainer with the OC22LmdbDataset by including these lines in your configuration file: -``` +```yaml trainer: forces # Use the ForcesTrainer task: @@ -331,7 +331,7 @@ You can find examples configuration files in [`configs/oc22/s2ef`](https://githu Training on OC20 total energies whether independently or jointly with OC22 requires a path to the `oc20_ref` (download link provided below) to be specified in the configuration file. These are necessary to convert OC20 adsorption energies into their corresponding total energies. The following changes in the configuration file capture these changes: -``` +```yaml task: dataset: oc22_lmdb ... @@ -358,7 +358,7 @@ EvalAI expects results to be structured in a specific format for a submission to ### S2EF-Total/IS2RE-Total: 1. Run predictions `--mode predict` on both the id and ood splits, generating `[s2ef/is2re]_predictions.npz` files for each split. 2. Run the following command: - ``` + ```bash python make_submission_file.py --dataset OC22 --id path/to/id/file.npz --ood path/to/ood_ads/file.npz --out-path submission_file.npz ``` Where `file.npz` corresponds to the respective `[s2ef/is2re]_predictions.npz` files generated for the corresponding task. The final submission file will be written to `submission_file.npz` (rename accordingly). The `dataset` argument specifies which dataset is being considered — this only needs to be set for OC22 predictions because OC20 is the default. @@ -381,7 +381,7 @@ If your data is already in an [ASE Database](https://databases.fysik.dtu.dk/ase/ To use this dataset, we will just have to change our config files to use the ASE DB Dataset rather than the LMDB Dataset: -``` +```yaml task: dataset: ase_db @@ -399,6 +399,7 @@ dataset: # Set these if you want to train on energy/forces # Energy/force information must be in the ASE DB! keep_in_memory: False # Keeping the dataset in memory reduces random reads and is extremely fast, but this is only feasible for relatively small datasets! + include_relaxed_energy: False # Read the last structure's energy and save as "y_relaxed" for IS2RE-Direct training val: src: a2g_args: @@ -418,29 +419,30 @@ It is possible to train/predict directly on ASE-readable files. This is only rec ### Single-Structure Files This dataset assumes a single structure will be obtained from each file: -``` +```yaml task: dataset: ase_read dataset: train: - src: # The folder that contains ASE-readable files - pattern: # Pattern matching each file you want to read (e.g. "*/POSCAR"). Search recursively with two wildcards: "**/*.cif". - - ase_read_args: - # Keyword arguments for ase.io.read() - a2g_args: - # Include energy and forces for training purposes - # If True, the energy/forces must be readable from the file (ex. OUTCAR) - r_energy: True - r_forces: True - keep_in_memory: False + src: # The folder that contains ASE-readable files + pattern: # Pattern matching each file you want to read (e.g. "*/POSCAR"). Search recursively with two wildcards: "**/*.cif". + include_relaxed_energy: False # Read the last structure's energy and save as "y_relaxed" for IS2RE-Direct training + + ase_read_args: + # Keyword arguments for ase.io.read() + a2g_args: + # Include energy and forces for training purposes + # If True, the energy/forces must be readable from the file (ex. OUTCAR) + r_energy: True + r_forces: True + keep_in_memory: False ``` ### Multi-structure Files This dataset supports reading files that each contain multiple structure (for example, an ASE .traj file). Using an index file, which tells the dataset how many structures each file contains, is recommended. Otherwise, the dataset is forced to load every file at startup and count the number of structures! -``` +```yaml task: dataset: ase_read_multi @@ -451,15 +453,15 @@ dataset: /path/to/relaxation2.traj 150 ... - # If using an index file, the src and pattern are not necessary - src: # The folder that contains ASE-readable files - pattern: # Pattern matching each file you want to read (e.g. "*.traj"). Search recursively with two wildcards: "**/*.xyz". - - ase_read_args: - # Keyword arguments for ase.io.read() - a2g_args: - # Include energy and forces for training purposes - r_energy: True - r_forces: True - keep_in_memory: False + # If using an index file, the src and pattern are not necessary + src: # The folder that contains ASE-readable files + pattern: # Pattern matching each file you want to read (e.g. "*.traj"). Search recursively with two wildcards: "**/*.xyz". + + ase_read_args: + # Keyword arguments for ase.io.read() + a2g_args: + # Include energy and forces for training purposes + r_energy: True + r_forces: True + keep_in_memory: False ``` diff --git a/ocpmodels/datasets/ase_datasets.py b/ocpmodels/datasets/ase_datasets.py index dcfd7f4ab..670d0b34a 100644 --- a/ocpmodels/datasets/ase_datasets.py +++ b/ocpmodels/datasets/ase_datasets.py @@ -123,6 +123,9 @@ def __getitem__(self, idx): data_object, **self.config.get("transform_args", {}) ) + if self.config.get("include_relaxed_energy", False): + data_object.y_relaxed = self.get_relaxed_energy(self.ids[idx]) + return data_object @abstractmethod @@ -201,6 +204,10 @@ class AseReadDataset(AseAtomsDataset): to iterate over a dataset many times (e.g. training for many epochs). Not recommended for large datasets. + include_relaxed_energy (bool): Include the relaxed energy in the resulting data object. + The relaxed structure is assumed to be the final structure in the file + (e.g. the last frame of a .traj). + atoms_transform_args (dict): Additional keyword arguments for the atoms_transform callable transform_args (dict): Additional keyword arguments for the transform callable @@ -224,6 +231,10 @@ def load_dataset_get_ids(self, config) -> List[Path]: if self.path.is_file(): raise Exception("The specified src is not a directory") + if self.config.get("include_relaxed_energy", False): + self.relaxed_ase_read_args = copy.deepcopy(self.ase_read_args) + self.relaxed_ase_read_args["index"] = "-1" + return list(self.path.glob(f'{config["pattern"]}')) def get_atoms_object(self, identifier): @@ -235,6 +246,10 @@ def get_atoms_object(self, identifier): return atoms + def get_relaxed_energy(self, identifier): + relaxed_atoms = ase.io.read(identifier, **self.relaxed_ase_read_args) + return relaxed_atoms.get_potential_energy(apply_constraint=False) + @registry.register_dataset("ase_read_multi") class AseReadMultiStructureDataset(AseAtomsDataset): @@ -279,6 +294,10 @@ class AseReadMultiStructureDataset(AseAtomsDataset): to iterate over a dataset many times (e.g. training for many epochs). Not recommended for large datasets. + include_relaxed_energy (bool): Include the relaxed energy in the resulting data object. + The relaxed structure is assumed to be the final structure in the file + (e.g. the last frame of a .traj). + use_tqdm (bool): Use TQDM progress bar when initializing dataset atoms_transform_args (dict): Additional keyword arguments for the atoms_transform callable @@ -347,6 +366,12 @@ def get_atoms_object(self, identifier): def get_metadata(self): return {} + def get_relaxed_energy(self, identifier): + relaxed_atoms = ase.io.read( + "".join(identifier.split(" ")[:-1]), **self.ase_read_args + )[-1] + return relaxed_atoms.get_potential_energy(apply_constraint=False) + class dummy_list(list): def __init__(self, max) -> None: @@ -513,3 +538,8 @@ def get_metadata(self): return self.guess_target_metadata() else: return copy.deepcopy(self.dbs[0].metadata) + + def get_relaxed_energy(self, identifier): + raise NotImplementedError( + "IS2RE-Direct training with an ASE DB is not currently supported." + ) diff --git a/tests/datasets/test_ase_datasets.py b/tests/datasets/test_ase_datasets.py index 57ce15e4d..d1767c978 100644 --- a/tests/datasets/test_ase_datasets.py +++ b/tests/datasets/test_ase_datasets.py @@ -45,6 +45,8 @@ def test_ase_read_dataset() -> None: data = dataset[0] del data + dataset.close_db() + for i in range(len(structures)): os.remove( os.path.join( @@ -52,8 +54,6 @@ def test_ase_read_dataset() -> None: ) ) - dataset.close_db() - def test_ase_db_dataset() -> None: try: @@ -380,11 +380,18 @@ def test_ase_multiread_dataset() -> None: atoms_objects = [build.bulk("Cu", a=a) for a in np.linspace(3.5, 3.7, 10)] + energies = np.linspace(1, 0, len(atoms_objects)) + traj = Trajectory( os.path.join(os.path.dirname(os.path.abspath(__file__)), "test.traj"), mode="w", ) - for atoms in atoms_objects: + + for atoms, energy in zip(atoms_objects, energies): + calc = SinglePointCalculator( + atoms, energy=energy, forces=atoms.positions + ) + atoms.calc = calc traj.write(atoms) dataset = AseReadMultiStructureDataset( @@ -423,6 +430,46 @@ def test_ase_multiread_dataset() -> None: assert len(dataset) == len(atoms_objects) [dataset[:]] + dataset = AseReadMultiStructureDataset( + config={ + "index_file": os.path.join( + os.path.dirname(os.path.abspath(__file__)), "test_index_file" + ), + "a2g_args": { + "r_energy": True, + "r_forces": True, + }, + "include_relaxed_energy": True, + } + ) + + assert len(dataset) == len(atoms_objects) + [dataset[:]] + + assert hasattr(dataset[0], "y_relaxed") + assert dataset[0].y_relaxed != dataset[0].y + assert dataset[-1].y_relaxed == dataset[-1].y + + dataset = AseReadDataset( + config={ + "src": os.path.join(os.path.dirname(os.path.abspath(__file__))), + "pattern": "*.traj", + "ase_read_args": { + "index": "0", + }, + "a2g_args": { + "r_energy": True, + "r_forces": True, + }, + "include_relaxed_energy": True, + } + ) + + [dataset[:]] + + assert hasattr(dataset[0], "y_relaxed") + assert dataset[0].y_relaxed != dataset[0].y + os.remove( os.path.join(os.path.dirname(os.path.abspath(__file__)), "test.traj") )