Skip to content

Commit

Permalink
[BE] Remove large files from fairchem and add references to new locat…
Browse files Browse the repository at this point in the history
…ion as needed (#761)

* Remove large files from fairchem and add references to new location as needed

* ruff differs from isort specification...

* add fine-tuning supporting-info since it is over 2MB

* add unittest

* linting

* typo

* import

* Use better function name and re-use fairchem_root function

---------

Co-authored-by: Muhammed Shuaibi <[email protected]>
  • Loading branch information
levineds and mshuaibii committed Jul 30, 2024
1 parent 3c0fade commit 434b956
Show file tree
Hide file tree
Showing 14 changed files with 319 additions and 193 deletions.
2 changes: 1 addition & 1 deletion docs/core/datasets/oc20dense.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ The OC20Dense dataset is a validation dataset which was used to assess model per
|ASE Trajectories |29G |112G | [ee937e5290f8f720c914dc9a56e0281f](https://dl.fbaipublicfiles.com/opencatalystproject/data/adsorbml/oc20_dense_trajectories.tar.gz) |

The following files are also provided to be used for evaluation and general information:
* `oc20dense_mapping.pkl` : Mapping of the LMDB `sid` to general metadata information -
* `oc20dense_mapping.pkl` : Mapping of the LMDB `sid` to general metadata information. If this file is not present, run the command `python src/fairchem/core/scripts/download_large_files.py adsorbml` from the root of the fairchem repo to download it. -
* `system_id`: Unique system identifier for an adsorbate, bulk, surface combination.
* `config_id`: Unique configuration identifier, where `rand` and `heur` correspond to random and heuristic initial configurations, respectively.
* `mpid`: Materials Project bulk identifier.
Expand Down
2 changes: 1 addition & 1 deletion docs/tutorials/NRR/NRR_example.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ To do this, we will enumerate adsorbate-slab configurations and run ML relaxatio

+++

Be sure to set the path in `fairchem/data/oc/configs/paths.py` to point to the correct place or pass the paths as an argument. The database pickles can be found in `fairchem/data/oc/databases/pkls`. We will show one explicitly here as an example and then run all of them in an automated fashion for brevity.
Be sure to set the path in `fairchem/data/oc/configs/paths.py` to point to the correct place or pass the paths as an argument. The database pickles can be found in `fairchem/data/oc/databases/pkls` (some pkl files are only downloaded by running the command `python src/fairchem/core/scripts/download_large_files.py oc` from the root of the fairchem repo). We will show one explicitly here as an example and then run all of them in an automated fashion for brevity.

```{code-cell} ipython3
import fairchem.data.oc
Expand Down
2 changes: 1 addition & 1 deletion src/fairchem/applications/AdsorbML/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ NOTE - ASE trajectories exclude systems that were not converged or had invalid c
|ASE Trajectories |29G |112G | [ee937e5290f8f720c914dc9a56e0281f](https://dl.fbaipublicfiles.com/opencatalystproject/data/adsorbml/oc20_dense_trajectories.tar.gz) |

The following files are also provided to be used for evaluation and general information:
* `oc20dense_mapping.pkl` : Mapping of the LMDB `sid` to general metadata information -
* `oc20dense_mapping.pkl` : Mapping of the LMDB `sid` to general metadata information. If this file is not present, run the command `python src/fairchem/core/scripts/download_large_files.py adsorbml` from the root of the fairchem repo to download it. -
* `system_id`: Unique system identifier for an adsorbate, bulk, surface combination.
* `config_id`: Unique configuration identifier, where `rand` and `heur` correspond to random and heuristic initial configurations, respectively.
* `mpid`: Materials Project bulk identifier.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@

import numpy as np

from fairchem.core.scripts import download_large_files


def is_successful(best_pred_energy, best_dft_energy, SUCCESS_THRESHOLD=0.1):
"""
Expand Down Expand Up @@ -161,6 +163,11 @@ def main():

# targets and metadata are expected to be in
# the same directory as this script
if (
not Path(__file__).with_name("oc20dense_val_targets.pkl").exists()
or not Path(__file__).with_name("ml_relaxed_dft_targets.pkl").exists()
):
download_large_files.download_file_group("adsorbml")
targets = pickle.load(
open(Path(__file__).with_name("oc20dense_val_targets.pkl"), "rb")
)
Expand Down
76 changes: 76 additions & 0 deletions src/fairchem/core/scripts/download_large_files.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
from __future__ import annotations

import argparse
from pathlib import Path
from urllib.request import urlretrieve

from fairchem.core.common.tutorial_utils import fairchem_root

S3_ROOT = "https://dl.fbaipublicfiles.com/opencatalystproject/data/large_files/"

FILE_GROUPS = {
"odac": [
Path("configs/odac/s2ef/scaling_factors/painn.pt"),
Path("src/fairchem/data/odac/force_field/data_w_oms.json"),
Path(
"src/fairchem/data/odac/promising_mof/promising_mof_features/JmolData.jar"
),
Path(
"src/fairchem/data/odac/promising_mof/promising_mof_energies/adsorption_energy.txt"
),
Path("src/fairchem/data/odac/supercell_info.csv"),
],
"oc": [Path("src/fairchem/data/oc/databases/pkls/bulks.pkl")],
"adsorbml": [
Path(
"src/fairchem/applications/AdsorbML/adsorbml/2023_neurips_challenge/oc20dense_mapping.pkl"
),
Path(
"src/fairchem/applications/AdsorbML/adsorbml/2023_neurips_challenge/ml_relaxed_dft_targets.pkl"
),
],
"cattsunami": [
Path("tests/applications/cattsunami/tests/autoframe_inputs_dissociation.pkl"),
Path("tests/applications/cattsunami/tests/autoframe_inputs_transfer.pkl"),
],
"docs": [
Path("docs/tutorials/NRR/NRR_example_bulks.pkl"),
Path("docs/core/fine-tuning/supporting-information.json"),
],
}


def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"file_group",
type=str,
help="Group of files to download",
default="ALL",
choices=["ALL", *list(FILE_GROUPS)],
)
return parser.parse_args()


def download_file_group(file_group):
if file_group in FILE_GROUPS:
files_to_download = FILE_GROUPS[file_group]
elif file_group == "ALL":
files_to_download = [item for group in FILE_GROUPS.values() for item in group]
else:
raise ValueError(
f'Requested file group {file_group} not recognized. Please select one of {["ALL", *list(FILE_GROUPS)]}'
)

fc_root = fairchem_root().parents[1]
for file in files_to_download:
if not (fc_root / file).exists():
print(f"Downloading {file}...")
urlretrieve(S3_ROOT + file.name, fc_root / file)
else:
print(f"{file} already exists")


if __name__ == "__main__":
args = parse_args()
download_file_group(args.file_group)
3 changes: 2 additions & 1 deletion src/fairchem/data/oc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ This repository hosts the adsorbate-catalyst input generation workflow used in t

To install just run in your favorite environment with python >= 3.9
* `pip install fairchem-data-oc`
* `python src/fairchem/core/scripts/download_large_files.py oc`

## Workflow

Expand Down Expand Up @@ -155,7 +156,7 @@ python structure_generator.py \

### Bulks

A database of bulk materials taken from existing databases (i.e. Materials Project) and relaxed with consistent RPBE settings may be found in `ocdata/databases/pkls/bulks.pkl`. To preview what bulks are available, view the corresponding mapping between indices and bulks (bulk id and composition): https://dl.fbaipublicfiles.com/opencatalystproject/data/input_generation/mapping_bulks_2021sep20.txt
A database of bulk materials taken from existing databases (i.e. Materials Project) and relaxed with consistent RPBE settings may be found in `databases/pkls/bulks.pkl` (if not, run the command `python src/fairchem/core/scripts/download_large_files.py oc` from the root of the fairchem repo). To preview what bulks are available, view the corresponding mapping between indices and bulks (bulk id and composition): https://dl.fbaipublicfiles.com/opencatalystproject/data/input_generation/mapping_bulks_2021sep20.txt

### Adsorbates

Expand Down
4 changes: 4 additions & 0 deletions src/fairchem/data/oc/core/bulk.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
from fairchem.data.oc.core.slab import Slab
from fairchem.data.oc.databases.pkls import BULK_PKL_PATH

from fairchem.core.scripts import download_large_files

if TYPE_CHECKING:
import ase

Expand Down Expand Up @@ -51,6 +53,8 @@ def __init__(
self.src_id = None
else:
if bulk_db is None:
if bulk_db_path == BULK_PKL_PATH and not os.path.exists(BULK_PKL_PATH):
download_large_files.download_file_group("oc")
with open(bulk_db_path, "rb") as fp:
bulk_db = pickle.load(fp)

Expand Down
17 changes: 11 additions & 6 deletions src/fairchem/data/oc/databases/update.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,15 @@
from __future__ import annotations

import pickle
from pathlib import Path

import ase.io
from ase.atoms import Atoms
from ase.calculators.singlepoint import SinglePointCalculator as SPC
from tqdm import tqdm

from fairchem.core.scripts import download_large_files


# Monkey patch fix
def pbc_patch(self):
Expand All @@ -29,7 +32,7 @@ def set_pbc_patch(self, pbc):

def update_pkls():
with open(
"ocdata/databases/pkls/adsorbates.pkl",
"oc/databases/pkls/adsorbates.pkl",
"rb",
) as fp:
data = pickle.load(fp)
Expand All @@ -38,13 +41,15 @@ def update_pkls():
pbc = data[idx][0].cell._pbc
data[idx][0]._pbc = pbc
with open(
"ocdata/databases/pkls/adsorbates_new.pkl",
"oc/databases/pkls/adsorbates_new.pkl",
"wb",
) as fp:
pickle.dump(data, fp)

if not Path("oc/databases/pkls/bulks.pkl").exists():
download_large_files.download_file_group("oc")
with open(
"ocdata/databases/pkls/bulks.pkl",
"oc/databases/pkls/bulks.pkl",
"rb",
) as fp:
data = pickle.load(fp)
Expand All @@ -64,7 +69,7 @@ def update_pkls():

bulks.append((atoms, bulk_id))
with open(
"ocdata/databases/pkls/bulks_new.pkl",
"oc/databases/pkls/bulks_new.pkl",
"wb",
) as f:
pickle.dump(bulks, f)
Expand All @@ -73,7 +78,7 @@ def update_pkls():
def update_dbs():
for db_name in ["adsorbates", "bulks"]:
db = ase.io.read(
f"ocdata/databases/ase/{db_name}.db",
f"oc/databases/ase/{db_name}.db",
":",
)
new_data = []
Expand All @@ -90,7 +95,7 @@ def update_dbs():
new_data.append(atoms)

ase.io.write(
f"ocdata/databases/ase/{db_name}_new.db",
f"oc/databases/ase/{db_name}_new.db",
new_data,
)

Expand Down
4 changes: 3 additions & 1 deletion src/fairchem/data/odac/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,11 @@ To download the ODAC23 dataset, please see the links [here](https://fair-chem.gi

Pre-trained ML models and configs are available [here](https://fair-chem.github.io/core/model_checkpoints.html#open-direct-air-capture-2023-odac23).

Large ODAC files can be downloaded by running the command `python src/fairchem/core/scripts/download_large_files.py odac` from the root of the fairchem repo.

This repository contains the list of [promising MOFs](https://github.com/FAIR-Chem/fairchem/tree/main/src/fairchem/data/odac/promising_mof) discovered in the ODAC23 paper, as well as details of the [classifical force field calculations](https://github.com/FAIR-Chem/fairchem/tree/main/src/fairchem/data/odac/force_field).

Information about supercells can be found in [supercell_info.csv](https://github.com/FAIR-Chem/fairchem/blob/main/src/fairchem/data/odac/supercell_info.csv) for each example.
Information about supercells can be found in [supercell_info.csv](https://dl.fbaipublicfiles.com/opencatalystproject/data/large_files/supercell_info.csv) for each example (this file is downloaded to the local repo only when the above script is run).

## Citing

Expand Down
2 changes: 1 addition & 1 deletion src/fairchem/data/odac/force_field/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This folder contains data and scripts related to the classical FF analysis performed in this work.

- The `data_w_oms.json` file contains all successful FF interaction energy calculations with both system information and DFT-computed interaction energies. Calculations were performed across the in-domain training, validation, and test sets.
- The `data_w_oms.json` file contains all successful FF interaction energy calculations with both system information and DFT-computed interaction energies. Calculations were performed across the in-domain training, validation, and test sets. If this file is not present, run the command `python src/fairchem/core/scripts/download_large_files.py odac` from the root of the fairchem repo to download it.
- The `data_w_ml.json` file contains the same information for systems with successful ML interaction energy predictions. Only systems in the in-domain test set are included here.
- The `FF_analysis.py` script performs the error calculations discussed in the paper and generates the four panels of Figure 5. All of the data used in this analysis is contained in 'data_w_oms.json" for reproducibility.
- The `FF_calcs` folder contains example calculations for classical FF interaction energy predictions.
Expand Down
Loading

0 comments on commit 434b956

Please sign in to comment.