Skip to content

ProteinGCN: Protein model quality assessment using Graph Convolutional Networks

License

Notifications You must be signed in to change notification settings

malllabiisc/ProteinGCN

Repository files navigation

ProteinGCN: Protein model quality assessment using Graph Convolutional Networks

Source code for the paper: ProteinGCN: Protein model quality assessment using Graph Convolutional Networks

Overview of ProteinGCN: Given a protein structure, it first generates a protein graph and uses GCN to learn the atom embeddings. Then, it pools the atom embeddings to generate residue-level embeddings. The residue embeddings are passed through a non-linear fully connected layer to predict the local scores. Further, the residue embeddings are pooled to generate a global protein embedding. Similar to residue embeddings, this is used to predict the global score.

Dependencies

  • Compatible with PyTorch 1.0 and Python 3.x.
  • Dependencies can be installed using the requirements.txt file.

Dataset:

  • We use Rosetta-300k to train the ProteinGCN model and test it on both Rosetta-300k and CASP13 dataset for local(residue) and global Quality Assessment predictions.

Training model:

  1. Install all the requirements by executing pip install -r requirements.txt.

  2. Install required protein .pdb processing library by executing sh preprocess.sh which clones and installs this github repository.

  3. Next execute python preprocess_pdb_to_pkl.py script which creates the required .pkl files from the dataset to be used for model training. It defaults to a sample dataset provided with the code at ./data/. To use the original datasets, please change the paths accordingly.

  4. To start a training run:

python train.py trial_run --epochs 10

Once successfully run, this creates a folder by the name trial_run under the path ./data/pkl/results/ which contains the test results test_results.csv (where each row has the protein model name, target global score, predicted global score, target local scores, and predicted local scores) and best model checkpoint model_best.pth.tar. Rest of the training arguments and the defaults can be found in arguments.py. We support multi-gpu training using PyTorch DataParallel on a single server by default. To enable multi-gpu training, just set the required number of gpus in CUDA_VISIBLE_DEVICES environment.

  1. To get the final pearson correlation scores, run:
python correlation.py -file ./data/pkl/results/trial_run/test_results.csv

Running inference using pretrained ProteinGCN on new models:

  1. For running inference on new models, the preprocessing steps mentioned in step 1-3 above need to be followed for the new data. This will convert the pdb files to pickle files required by the model. Please note that based on the specific use-cases, some changes might be required in the preprocess_pdb_to_pkl.py file:

    1. Evaluating the performance of ProteinGCN for new models: To evaluate model performance, ground truth global and local scores should be available for the new models. The function get_targets should be changed accordingly to extract these targets from a given protein pdb filename.
    2. Using ProteinGCN to predict scores for new models: In this use-case there might not be ground truth global and local scores, hence the get_targets function should be modified to just return a fixed value (say 1) for global and local scores. Also, calculating the correlations is not possible here.
  2. We have published our best ProteinGCN model that was trained on Rosetta-300k dataset. To run this pretrained model on the preprocessed data, execute:

python train.py trial_testrun --pretrained ./pretrained/pretrained.pth.tar --epochs 0 --train 0 --val 0 --test 1

The data directory currently defaults to the sample data provided with the repository. To change the directories to the new data, please check the arguments.py file and change accordingly.

Please cite the following paper if you use this code in your work.

@article {Sanyal2020.04.06.028266,
	author = {Sanyal, Soumya and Anishchenko, Ivan and Dagar, Anirudh and Baker, David and Talukdar, Partha},
	title = {ProteinGCN: Protein model quality assessment using Graph Convolutional Networks},
	year = {2020},
	doi = {10.1101/2020.04.06.028266},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2020/04/07/2020.04.06.028266},
	journal = {bioRxiv}
}

For any clarification, comments, or suggestions please create an issue or contact Soumya.