A multi-target model for de novo molecule generation. By using the internal protein representations of the AlphaFold[1] model, a single SMILES-based transformer can generate relevant molecules for thousands of protein targets (embeddings are available for 4,331 proteins).
The model was trained using bioactivity data from the Papyrus[2] dataset (661,613 unique protein-ligand pairs in total, 6,249,253 after augmentation).
The preprint is available on ChemRxiv:
https://chemrxiv.org/engage/chemrxiv/article-details/65d47632e9ebbb4db9c63988
The setup script will install the required dependencies and download the pretrained model.
git clone https://github.com/CDDLeiden/pcmol.git && cd pcmol
chmod +x setup.sh
bash setup.sh
The conda route requires the user to download the pretrained model manually (link below).
# Setting up a fresh conda environment
git clone https://github.com/CDDLeiden/pcmol.git && cd pcmol
conda env create -f environment.yml && conda activate pcmol
python -m pip install -e .
*When not using the setup script, the pretrained model can be downloaded from here (mirror). It should then be placed in the .../pcmol/data/models
folder.
# Run the model on a single target using Accession ID (generates 10 SMILES strings)
conda activate pcmol
python pcmol/generate.py --target P29275
# If GPU is not available
python pcmol/generate.py --target P29275 --device cpu
If available, the appropriate AlphaFold2 embeddings to be used as input to the model will be downloaded automatically. The generated molecules are saved in the data/results
folder.
To generate molecules for a particular target, the Runner
class can be used directly. The generate_smiles
method returns a list of SMILES strings for a target protein specified by its Accession ID.
from pcmol import Runner
model = Runner(model="XL")
SMILES = model.targetted_generation(target="P29275", num_mols=100)
The model currently depends on the availability of AlphaFold2 embeddings for the target protein. The list of supported targets can be found in the data/targets.txt file.
[1]: Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., ... & Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583-589.
[2]: Béquignon, O. J., Bongers, B. J., Jespers, W., IJzerman, A. P., van der Water, B., & van Westen, G. J. (2023). Papyrus: a large-scale curated dataset aimed at bioactivity predictions. Journal of cheminformatics, 15(1), 3.