Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metabolomics hackathon: MS2 spectra matching for metabolite identification #13

Open
mmattano opened this issue Oct 11, 2022 · 6 comments
Labels

Comments

@mmattano
Copy link

MS2 spectra matching for metabolite identification

Abstract

One major open topic in untargeted metabolomics is identifying unknown compounds from mass spectra. As MS1 comparisons can be ambiguous (especially for small molecules), we need to look at MS2 spectra, and compare them to public MS2 databases, to differentiate compounds in the same mass range.
Currently, the best performing methods for compound identification are GNPS and Sirius. They provide a user with a list of potential compounds, but in some cases the uncertainty is very high or multiple candidates are suggested, making the downstream analysis labor intensive. GNPS improves their predictions by using molecular networks and taking biological information into account. Sirius improves their predictions by comparing structural similarity of the compounds.
We would like to set up a novel system, with modular parts that can be tested separately. Each aspect of the pipeline can be improved/modified individually, and multiple methods can be combined as an ensemble. In doing so, this can also serve as a benchmark of existing scoring and matching functions and a testing playground for novel ideas.

Project Plan

The general purpose is to have a(n automated) workflow for MS2 spectra matching that does not just rely on cosine similarity scoring. Subsequently, we would like to

  • Automatically clean MS2 spectra
    • Pick the MS2-spectrum with the highest precursor ion intensity or with the highest total ion current for each LC-MS feature;
    • remove peaks with relative intensities below 0.01 compared to the highest intensity peak
    • remove the peaks outside set m/z window
    • remove the peaks outside set intensity window
    • (optional) calculate neutral losses within each MS2 spectrum and compare across multiple spectra to identify similar functional groups (that might have been lost)
  • Optimize pipeline. Additional steps (e.g., QC-points, interactive visualization)
  • Optimize scoring. As a starting point, commonly used scoring functions can be compared, e.g., a normalized dot product (cosine score), spec2Vec (inspired by Word2Vec) – structure similarity, …
  • Ranking of results of identifications. We would like to not just rely on the score as this can be misleading. One way would be to include information that helps us exclude/lower the weight on unlikely candidates.

Technical Details

Main language: Python

  • Packages:
    • matchms (cosine score, modified cosine score)
    • spec2vec
      Workflow includes: GNPS, SIRIUS
      GitHub

Contact Information

Members of the metabolomics research group lead by Thomas Moritz at the NNF Center for Basic Metabolic Research, Faculty of Health Research, University of Copenhagen

Matthias Mattanovich ([email protected])
Muyao Xi ([email protected])
Lawrence Egyir ([email protected])

@tobiasko
Copy link
Contributor

tobiasko commented Nov 8, 2022

Dear @mmattano,

I am happy to inform you that your proposal has been selected for the DevMeeting2023! Participants will decide which hackathon to join after the pitch on Monday.

Best,
Tobi

@tobiasko
Copy link
Contributor

tobiasko commented Nov 8, 2022

Maybe Muyao and Lawrence could leave a short comment here so they also become participant of this issue! THX!

@lawtrea
Copy link

lawtrea commented Nov 8, 2022

Thank you @tobiasko

@MuyaoXi9271
Copy link

Thanks @tobiasko
Please add me in👍

@tobiasko
Copy link
Contributor

Hello everyone,

I just created a slack workspace for the DevMeeting and a channel named metabolomics for this hack. You should receive an invite to join by email.

Best,
Tobi

@mmattano
Copy link
Author

Summary paragraph

During the metabolomics related hackathon, spectral similarity scoring was explored. In order to identify a metabolite from an MS1 or MS2 spectrum, different scores are applied to match the spectrum in question to a database entry or, more commonly, an in-house library. Currently in the field, the cosine similarity score is most frequently used. Here, we set up a pipeline to compare multiple different ways to score spectral similarity and an array of variations or their respective input parameters. The data that was specifically prepared for the hackathon also allowed for statistics on false positives, false negatives, etc. Furthermore, we set up systems to test the robustness of these scores to intensity perturbations, which is very common when dealing with biological samples, and tested a possible correlation between structural- and spectral similarity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

4 participants