mofdiff
is a diffusion model for generating coarse-grained MOF structures. This codebase also contains the code for deconstructing/reconstructing the all-atom MOF structures to train MOFDiff and assemble CG structures generated by MOFDiff.
paper | data and pretained models
If you find this code useful, please consider referencing our paper:
@inproceedings{
fu2024mofdiff,
title={{MOFD}iff: Coarse-grained Diffusion for Metal-Organic Framework Design},
author={Xiang Fu and Tian Xie and Andrew Scott Rosen and Tommi S. Jaakkola and Jake Allen Smith},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=0VBsoluxR2}
}
- Installation
- Process data
- Training
- Generating MOF structures
- Assemble all-atom MOFs
- Relax MOFs
- GCMC simulations
- Responsible AI FAQ
- Contributing
- Acknowledgement
We recommend using mamba rather than conda to install the dependencies to increase installation speed. First install mamba
following the intructions in the mamba repository. (Note: a reqirements.txt
mirror of env.yml
is provided for compatibility with CI/CD; however, we do not recommend building the environment with pip
.)
Install dependencies via mamba
:
mamba env create -f env.yml
Then install mofdiff
as a package:
pip install -e .
We use MOFid for preprocessing and analysis. To perform these steps, install MOFid following the instruction in the MOFid repository. The generative modeling and MOF simulation portions of this codebase do not depend on MOFid.
Configure the .env
file to set correct paths to various directories, dependent on the desired functionality. An example .env
file is provided in the repository.
For model training, please set the learning-related paths.
- PROJECT_ROOT: the parent MOFDiff directory
- DATASET_DIR: the directory containing the .lmdb file produced by processing the data
- LOG_DIR: the directory to which logs will by written
- HYDRA_JOBS: the directory to which Hydra output will be written
- WANDB_DIR: the directory to which WandB output will be written
For MOF relaxation and structureal property calculations, please additionally set the Zeo++ path.
- ZEO_PATH: path to the Zeo++ "network" binary
For GCMC simulations, please additionally set the GCMC-related paths.
- RASPA_PATH: the RASPA2 parent directory
- RASPA_SIM_PATH: path to the RASPA2 "simulate" binary
- EGULP_PATH: path to the eGULP "egulp" binary
- EGULP_PARAMETER_PATH: the directory containing the eGULP "MEPO.param" file
You can download the preprocessed BW-DB
data from Zenodo (recommended). To use the preprocessed data, please extract bw_db.tar.gz
into ${oc.env:DATASET_DIR}
.
Alternatively, you can download the BW-DB
raw data from Materials Cloud to ${raw_path}
and preprocess with the following command. This step requires MOFid.
python mofdiff/preprocessing/extract_mofid.py --df_path ${raw_path}/all_MOFs_screening_data.csv --cif_path ${raw_path}/cifs --save_path ${raw_path}/mofid
python mofdiff/preprocessing/preprocess.py --df_path ${raw_path}/all_MOFs_screening_data.csv --mofid_path ${raw_path}/mofid --save_path ${raw_path}/graphs
python mofdiff/preprocessing/save_to_lmdb.py --graph_path ${raw_path}/graphs --save_path ${raw_path}/lmdbs
The preprocessing inovlves 3 steps:
- Extract the MOFid for all structures (CPU).
- Construct CG MOF data objects from MOFid deconstruction results (CPU or GPU).
- Save the CG MOF objects to an LMDB database (relatively fast).
The entire preprocessing process for BW-DB
may take several days (depending on the CPU/GPU resources).
Before training the diffusion model, we need to train the building block encoder. The building block encoder is a graph neural network that encodes the building blocks of MOFs. The building block encoder is trained with the following command:
python mofdiff/scripts/train.py --config-name=bb
The default output directory is ${oc.env:HYDRA_JOBS}/bb/${expname}/
. oc.env:HYDRA_JOBS
is configured in .env
. expname
is configured in configs/bb.yaml
. We use hydra for config management. All configs are stored in configs/
You can override the default output directory with command line arguments. For example:
python mofdiff/scripts/train.py --config-name=bb expname=bwdb_bb_dim_64 model.latent_dim=64
Logging is done with wandb by default. You need to login to wandb with wandb login
before training. The training logs will be saved to the wandb project mofdiff
. You can also override the wandb project with command line arguments or disable wandb logging by removing the wandb
entry in the config as demonstrated here.
The output directory where the building block encoder is saved: bb_encoder_path
is needed for training the diffusion model. By default, this path is ${oc.env:HYDRA_JOBS}/bb/${expname}/
, as defined above. Train/validation splits are defined in splits, with examples provided for BW-DB. With the building block encoder trained to convergence, train the CG diffusion model with the following command:
python mofdiff/scripts/train.py data.bb_encoder_path=${bb_encoder_path}
For BW-DB, training the building block encoder takes roughly 3 days and training the diffusion model takes roughly 5 days on a single NVIDIA V100 GPU.
Pretrained models can be found here. To use the pretrained models, please extract pretrained.tar.gz
and bb_emb_space.tar.gz
into ${oc.env:PROJECT_ROOT}/pretrained
.
With a trained CG diffusion model ${diffusion_model_path}
, generate random CG MOF structures with the following command, where ${bb_cache_path}
is the path to the trained building encoder bb_emb_space.pt
, either sourced from the pretrained models or generated as described above.
python mofdiff/scripts/sample.py --model_path ${diffusion_model_path} --bb_cache_path ${bb_cache_path}
To optimize MOF structures for a property defined in BW-DB (e.g., CO2 adsorption working capacity) use the following command, where ${data_path}
is the path to the processed data data.lmdb
, either sourced from the pretrained models or generated as described above.
python mofdiff/scripts/optimize.py --model_path ${diffusion_model_path} --bb_cache_path ${bb_cache_path} --data_path ${data_path} --property "working_capacity_vacuum_swing [mmol/g]" --target_v 15.0
Available arguments for sample.py
and optimize.py
can be found in the respective files. The generated CG MOF structures will be saved in ${sample_path}=${diffusion_model_path}/${sample_tag}
as samples.pt
.
The CG structures generated with the diffusion model are not guaranteed to be realizable. We need to assemble the CG structures to recover the all-atom MOF structures. The following sections describe how to assemble the CG MOF structures, and all steps further do not require a GPU.
Assemble all-atom MOF structures from the CG MOF structures with the following command:
python mofdiff/scripts/assemble.py --input ${sample_path}/samples.pt
This command will assemble the CG MOF structures in ${sample_path}
and save the assembled MOFs in ${sample_path}/assembled.pt
. The cif files of the assembled MOFs will be saved in ${sample_path}/cif
. If the assembled MOFs came from property-driven optimization, the optimization arguments are saved to ${sample_path}/opt_args.json
.
The assembled structures may not be physically plausible. These MOF structures are relaxed using the UFF force field with LAMMPS. LAMMPS has already been installed as part of the environment if you have followed the installation instructions in this README. The script for relaxing the MOF structures also compute structural properties (e.g., pore volume, surface area, etc.) with Zeo++ and the mofids of the generated MOFs with MOFid. The respective packages should be installed following the instructions in the respective repositories, and the corresponding paths should be added to .env
as outlined above. Each step should take no more than a few minutes to complete on a single CPU. We use multiprocessing to parallelize the computation.
Relax MOFs and compute structural properties with the following command:
python mofdiff/scripts/uff_relax.py --input ${sample_path}
This command will relax the assembled MOFs in ${sample_path}/cif
and save the relaxed MOFs in ${sample_path}/relaxed
. The structural properties of the relaxed MOFs will be saved in ${sample_path}/relaxed/zeo_props_relax.json
. The mofids of the relaxed MOFs will be saved in ${sample_path}/mofid
.
To run GCMC simulations, first install RASPA2 (simulation software) and eGULP (charge calculation software). The paths to both should additionally be added to .env
as outlined above.
RASPA2 can be installed with pip
:
pip install "RASPA2==2.0.4"
You may need to install the following Linux dependencies first:
apt-get update
apt-get install -yq libgsl0-dev pkg-config libxrender-dev
Install eGULP following the instruction in the repository. The following commands install eGULP in /usr/local/bin/egulp-master
:
unzip egulp-master.zip -d /usr/local/bin
cd /usr/local/bin/egulp-master/src && make
Finally, RASPA2 requires a set of forcefield parameters with which to run the simulations. To use our default simulation settings, copy the UFF parameter set from ForceFields into the RASPA2 forcefield definition directory, typically located at ${oc.env:RASPA_PATH}/share/raspa/forcefield
.
Calculate charges for relaxed samples in ${sample_path}
with the following command:
python mofdiff/scripts/calculate_charges.py --input ${sample_path}
This command will output cif files with charge information under ${sample_path}/mepo_qeq_charges
.
Run GCMC simulations with the following command:
python mofdiff/scripts/gcmc_screen.py --input ${sample_path}/mepo_qeq_charges
The GCMC simulation results will be saved in ${sample_path}/gcmc/screening_results.json
.
We have found that RASPA2 may occasionally have trouble reading input files as generated by python. If you encounter errors of the general form Creating molecules for more systems than the maximum allowed
then please set the rewrite_raspa_input
flag.
python mofdiff/scripts/gcmc_screen.py --input ${sample_path}/mepo_qeq_charges --rewrite_raspa_input
- What is MOFDiff?
- MOFDiff is a deep neural network that models metal organic framework (MOF) 3D structures.
- What can MOFDiff do?
- MOFDiff allows you to train and sample from models that yield a coarse-grained representation of a MOF. It also includes functions for reassembly of an atomistic MOF structure from the coarse-grained representation and interaces to other molecular simulation software for evaluation of structural and gas separation properties.
- What is/are MOFDiff’s intended use(s)?
- MOFDiff is intended for research purposes only, for the machine learning for porous materials community.
- How was MOFDiff evaluated? What metrics are used to measure performance?
- MOFDiff was evaluated on the validity and novelty of the MOF structures sampled from MOFDiff. Additionally, structures optimized for CO2 adsorption were evaluated based on their simulated CO2 adsorption performance.
- What are the limitations of MOFDiff? How can users minimize the impact of MOFDiff’s limitations when using the system?
- The provided pretrained models are specific to the BW-DB dataset.
- While MOFDiff may in principle be trained on arbitrary datasets of MOF structures, it has been minimally tested in this capacity. We enable users to train additional models for research purposes. Please see the training instructions and associated publication above.
- MOFDiff has not been tested by real-world experiments to see if the MOF structures it samples are achievable.
- MOFDiff should be used for research purposes only.
- What operational factors and settings allow for effective and responsible use of MOFDiff?
- MOFDiff should be used for research purposes only.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
This codebase is based on several existing repositories: