Skip to content
This repository has been archived by the owner on Aug 30, 2023. It is now read-only.

Unsupported amino acids should ideally yield warnings, not crash the program #41

Open
fburic opened this issue Aug 11, 2020 · 3 comments

Comments

@fburic
Copy link

fburic commented Aug 11, 2020

From a usability perspective, I would suggest that sequences with unsupported amino acids be skipped and warnings be issued. The current behavior simply has the program crash with a KeyError. The note on the main page about these amino acids is clear but having the user sanitize their input seems like much more effort than handling this through Prosit.

Here is my quick and dirty patch (just to get the tool running for me). Probably not the best way to handle this but it does rely on utils.peptide_parser() to not duplicate logic.

diff --git a/prosit/tensorize.py b/prosit/tensorize.py
index d4eefdb..d94127f 100644
--- a/prosit/tensorize.py
+++ b/prosit/tensorize.py
@@ -1,4 +1,5 @@
 import collections
+import logging
 import numpy as np

 from . import constants
@@ -45,7 +46,7 @@ def get_precursor_charge_onehot(charges):
     return array


-def get_sequence_integer(sequences):
+def get_sequence_integer(sequences):
     array = np.zeros([len(sequences), MAX_SEQUENCE], dtype=int)
     for i, sequence in enumerate(sequences):
         for j, s in enumerate(utils.peptide_parser(sequence)):
@@ -89,6 +90,14 @@ def csv(df):
     assert "modified_sequence" in df.columns
     assert "collision_energy" in df.columns
     assert "precursor_charge" in df.columns
+
+    n_all = df.shape[0]
+    df = df[df.modified_sequence.apply(is_supported_sequence)]
+    n_supported = df.shape[0]
+    if n_supported < n_all:
+        logging.warning(' {0} / {1} sequences were skipped as they '
+                        'contained unsupported amino acids'.format(n_all - n_supported, n_all))
+
     data = {
         "collision_energy_aligned_normed": get_numbers(df.collision_energy) / 100.0,
         "sequence_integer": get_sequence_integer(df.modified_sequence),
@@ -107,3 +116,12 @@ def csv(df):
     data["masses_pred"] = masses_pred

     return data
+
+
+def is_supported_sequence(seq: str) -> bool:
+    try:
+        _ = [ALPHABET[s] for s in utils.peptide_parser(seq)]
+        return True
+    except KeyError as e:
+        return False
+

PS Good work on developing Prosit! 😄

@fburic fburic mentioned this issue Aug 11, 2020
@tkschmidt
Copy link
Contributor

If you provide a PR, we are happy to merge it. Bonuspoints if you have a working test case :)

@gessulat
Copy link
Contributor

gessulat commented Aug 26, 2020

Sorry to chip in. Originally, we designed the Prosit pipeline, so that it never changes any user input. We on purpose didn't skip sequences, because input and ouput files would have a different length. I believe that providing malformed input data should throw an error and not a warning. But maybe this premise changed in the Prosit pipeline.

@tkschmidt @LLautenbacher @fburic

@fburic
Copy link
Author

fburic commented Aug 26, 2020

I see, that makes sense. If the premise hasn't changed, then I'd only suggest making the error more explicit. In my case, it took me quite some time to figure out what was wrong (and I'm familiar with Python). Then I just decided it was more efficient (and acceptable) for me to fix it like above :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants