Benchmarking de novo peptide sequencing algorithms

Adding a new algorithm

Make a pull request to add your algorithm to the benchmarking system.

Add your algorithm in the denovo_benchmarks/algorithms/algorithm_name folder by providing
container.def, make_predictions.sh, input_mapper.py, output_mapper.py files.
Detailed files descriptions are given below.

Templates for each file implementation can be found in the algorithms/base/ folder.
It also includes the InputMapperBase and OutputMapperBase base classes for implementing input and output mappers.
For examples, you can check Casanovo and DeepNovo implementations.

container.def — definition file of the Apptainer container image that creates environment and installs dependencies required for running the algorithm.
make_predictions.sh — bash script to run the de novo algorithm on the input dataset (folder with MS spectra in .mgf files) and generate an output file with per-spectrum peptide predictions.
Input: path to a dataset folder containing .mgf files with spectra data
Output: output file (in a common output format) containing predictions for all spectra in the dataset

To configure the model for specific data properties (e.g. non-tryptic data, data from a particular instrument, etc.), please use dataset tags. Current set of tags can be found in the DatasetTag in dataset_config.py and includes nontryptic, timstof, waters, sciex. Example usage can be found in algorithms/base/make_predictions_template.sh.
input_mapper.py — python script to convert input data from its original representation (input format) to the format expected by the algorithm.

Input format
- Input: a dataset folder with separate .mgf files containing MS spectra.
- Keys order for a spectrum in .mgf file:
  [TITLE, RTINSECONDS, PEPMASS, CHARGE]
output_mapper.py — python script to convert the algorithm output to the common output format.

Output format
- .csv file (with sep=",")
- must contain columns:
  - "sequence" — predicted peptide sequence, written in the predefined output sequence format
  - "score" — de novo algorithm "confidence" score for a predicted sequence
  - "aa_scores" — per-amino acid scores, if available. If not available, the whole peptide score will be used as a score for each amino acid.
  - "spectrum_id" — information to match each prediction with its ground truth sequence.
    {filename}:{index} string, where
    filename — name of the .mgf file in a dataset,
    index — index (0-based) of each spectrum in an .mgf file.
- Output sequence format
  - 20 amino acid tokens:
    G, A, S, P, V, T, C, L, I, N, D, Q, K, E, M, H, F, R, Y, W
  - Amino acids with post-translational modifications (PTMs) are written in ProForma format with Unimod accession codes for PTMs:
    C[UNIMOD:4] for Cysteine Carbamidomethylation, M[UNIMOD:35] for Methionine Oxidation, etc.
  - N-terminus and C-terminus modifications, if supported by the algorithm, are also written in ProForma notation with Unimod accession codes:
    [UNIMOD:xx]-PEPTIDE-[UNIMOD:yy]

Running the benchmark

To run the benchmark locally:

Clone the repository:

git clone https://github.com/PominovaMS/denovo_benchmarks.git
cd denovo_benchmarks

Build containers for algorithms and evaluation: To build all apptainer images, make sure you have apptainer installed. Then run:
```
chmod +x build_apptainer_images.sh
./build_apptainer_images.sh
```
This will build the apptainer images for all algorithms and the evaluation apptainer image.

If an apptainer image already exists, the script will ask if you want to rebuild it.
```
A .sif image for casanovo already exists. Force rebuild? (y/N) 
```
If a container is missing, that algorithm will be skipped during benchmarking. We don't share or store containers publicly yet due to ongoing development and their large size.
Configure paths: Configure the path to dataset_tags.tsv.

Open denovo_benchmarks/algorithms/base/constants.py and set the DATASET_TAGS_PATH variable to the absolute path of dataset_tags.tsv on your machine.

Run benchmark on a dataset: Make sure the required packages are installed:

sudo apt install squashfuse gocryptfs fuse-overlayfs

Run the benchmark:

./run.sh /path/to/dataset/dir

Example:

./run.sh sample_data/9_species_human

Input data structure

The benchmark expects input data to follow a specific folder structure.

Each dataset is stored in a separate folder with unique name.
Spectra are stored as .mgf files inside the mgf/ subfolder.
Ground truth labels (PSMs found via database search) are contained in labels.csv file within each dataset folder.

Below is an example layout for our evaluation datasets stored on the HPC:

datasets/
    9_species_human/
        labels.csv
        mgf/
            151009_exo3_1.mgf
            151009_exo3_2.mgf
            151009_exo3_3.mgf
            ...
    9_species_solanum_lycopersicum/
        labels.csv
        mgf/...
    9_species_mus_musculus/
        labels.csv
        mgf/...
    9_species_methanosarcina_mazei/
        labels.csv
        mgf/...
    ...

Note that algorithm containers only get as input the /mgf subfolder with spectra files and do not have access to the labels.csv file. Only the evaluation container accesses the labels.csv file to evaluate algorithm predictions.

Running Streamlit dashboard locally:

To view the Streamlit dashboard for the benchmark locally, run:

# If Streamlit is not installed
pip install streamlit

streamlit run dashboard.py

The dashboard reads the benchmark results stored in the results/ folder.

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
.devcontainer		.devcontainer
algorithms		algorithms
results		results
sample_data		sample_data
.gitignore		.gitignore
README.md		README.md
build_apptainer_images.sh		build_apptainer_images.sh
create_dataset.py		create_dataset.py
dashboard.py		dashboard.py
dataset_config.py		dataset_config.py
dataset_tags.tsv		dataset_tags.tsv
dataset_utils.py		dataset_utils.py
datasets_info.py		datasets_info.py
evaluate.py		evaluate.py
evaluation.def		evaluation.def
ground_truth_mapper.py		ground_truth_mapper.py
metrics.py		metrics.py
run.sh		run.sh
run_test.sh		run_test.sh
test_output_format.py		test_output_format.py
token_masses.py		token_masses.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarking de novo peptide sequencing algorithms

Adding a new algorithm

Running the benchmark

Input data structure

Running Streamlit dashboard locally:

About

Releases

Packages

Contributors 3

Languages

PominovaMS/denovo_benchmarks

Folders and files

Latest commit

History

Repository files navigation

Benchmarking de novo peptide sequencing algorithms

Adding a new algorithm

Running the benchmark

Input data structure

Running Streamlit dashboard locally:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages