Learning Molecular Representation in a Cell

InfoAlign learns molecular representations from bottleneck information derived from molecular structures, cell morphology, and gene expressions. For more details, please refer to our paper.

Update on Oct 14, 2024:

All packages can now be installed with pip install -r requirements.txt!
We have automated the model and data download process for ML developers. The InfoAlign model can now be trained with a single command!
We have created the infoalign package, which can be installed via pip install infoalign. For more details, refer to: https://github.com/liugangcode/infoalign-package.

Requirements

This project was developed and tested with the following versions:

Python: 3.11.7
PyTorch: 2.2.0+cu118
Torch Geometric: 2.6.1

All dependencies are listed in the requirements.txt file.

Setup Instructions

Create a Conda Environment:

conda create --name infoalign python=3.11.7

Activate the Environment:
```
conda activate infoalign
```
Install Dependencies:
```
pip install -r requirements.txt
```

Usage

Fine-tuning

We provide a pretrained checkpoint available for download from Hugging Face. For fine-tuning and inference, use the following commands. The pretrained model will be automatically downloaded to the ckpt/pretrain.pt file by default.

python main.py --model-path ckpt/pretrain.pt --dataset finetune-chembl2k
python main.py --model-path ckpt/pretrain.pt --dataset finetune-broad6k
python main.py --model-path ckpt/pretrain.pt --dataset finetune-biogenadme
python main.py --model-path ckpt/pretrain.pt --dataset finetune-moltoxcast

Alternatively, you can manually download the model weights and place the pretrain.pt file under the ckpt folder along with its corresponding YAML configuration file.

Note: If you wish to access the cell morphology and gene expression features in the ChEMBL2k and Broad6K datasets for baseline evaluation, visit our Hugging Face repository to download these features.

Pretraining

To pretrain the model from scratch, execute the following command:

python main.py --model-path "ckpt/pretrain.pt" --lr 1e-4 --wdecay 1e-8 --batch-size 3072

This will automatically download the pretraining dataset from Hugging Face. If you prefer to download the dataset manually, place all pretraining data files in the raw_data/pretrain/raw folder.

The pretrained model will be saved in the ckpt folder as pretrain.pt.

Data source

For readers interested in data collection, here are the sources:

Cell Morphology Data
- JUMP dataset: The data are from "JUMP Cell Painting dataset: morphological impact of 136,000 chemical and genetic perturbations" and can be downloaded here. The dataset includes chemical and genetic perturbations for cell morphology features.
- Bray's dataset: "A dataset of images and morphological profiles of 30,000 small-molecule treatments using the Cell Painting assay". Download from GigaDB. Processed version available on Zenodo.
Gene Expression Data
- LINCS L1000 gene expression data from the paper "Drug-induced adverse events prediction with the LINCS L1000 data": Data.
Relationships
- Gene-gene, gene-compound relationships from Hetionet: Data.

Citation

If you find this repository useful, please cite our paper:

@article{liu2024learning,
  title={Learning Molecular Representation in a Cell},
  author={Liu, Gang and Seal, Srijit and Arevalo, John and Liang, Zhenwen and Carpenter, Anne E and Jiang, Meng and Singh, Shantanu},
  journal={arXiv preprint arXiv:2406.12056},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
ckpt		ckpt
configures		configures
dataset		dataset
models		models
output/finetune-chembl2k		output/finetune-chembl2k
raw_data		raw_data
utils		utils
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learning Molecular Representation in a Cell

Update on Oct 14, 2024:

Requirements

Setup Instructions

Usage

Fine-tuning

Pretraining

Data source

Citation

About

Releases

Packages

Languages

liugangcode/InfoAlign

Folders and files

Latest commit

History

Repository files navigation

Learning Molecular Representation in a Cell

Update on Oct 14, 2024:

Requirements

Setup Instructions

Usage

Fine-tuning

Pretraining

Data source

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages