Unsupported amino acids should ideally yield warnings, not crash the program #41

fburic · 2020-08-11T11:39:50Z

From a usability perspective, I would suggest that sequences with unsupported amino acids be skipped and warnings be issued. The current behavior simply has the program crash with a KeyError. The note on the main page about these amino acids is clear but having the user sanitize their input seems like much more effort than handling this through Prosit.

Here is my quick and dirty patch (just to get the tool running for me). Probably not the best way to handle this but it does rely on utils.peptide_parser() to not duplicate logic.

diff --git a/prosit/tensorize.py b/prosit/tensorize.py
index d4eefdb..d94127f 100644
--- a/prosit/tensorize.py
+++ b/prosit/tensorize.py
@@ -1,4 +1,5 @@
 import collections
+import logging
 import numpy as np

 from . import constants
@@ -45,7 +46,7 @@ def get_precursor_charge_onehot(charges):
     return array


-def get_sequence_integer(sequences):
+def get_sequence_integer(sequences):
     array = np.zeros([len(sequences), MAX_SEQUENCE], dtype=int)
     for i, sequence in enumerate(sequences):
         for j, s in enumerate(utils.peptide_parser(sequence)):
@@ -89,6 +90,14 @@ def csv(df):
     assert "modified_sequence" in df.columns
     assert "collision_energy" in df.columns
     assert "precursor_charge" in df.columns
+
+    n_all = df.shape[0]
+    df = df[df.modified_sequence.apply(is_supported_sequence)]
+    n_supported = df.shape[0]
+    if n_supported < n_all:
+        logging.warning(' {0} / {1} sequences were skipped as they '
+                        'contained unsupported amino acids'.format(n_all - n_supported, n_all))
+
     data = {
         "collision_energy_aligned_normed": get_numbers(df.collision_energy) / 100.0,
         "sequence_integer": get_sequence_integer(df.modified_sequence),
@@ -107,3 +116,12 @@ def csv(df):
     data["masses_pred"] = masses_pred

     return data
+
+
+def is_supported_sequence(seq: str) -> bool:
+    try:
+        _ = [ALPHABET[s] for s in utils.peptide_parser(seq)]
+        return True
+    except KeyError as e:
+        return False
+

PS Good work on developing Prosit! 😄

The text was updated successfully, but these errors were encountered:

tkschmidt · 2020-08-26T13:15:52Z

If you provide a PR, we are happy to merge it. Bonuspoints if you have a working test case :)

gessulat · 2020-08-26T13:35:46Z

Sorry to chip in. Originally, we designed the Prosit pipeline, so that it never changes any user input. We on purpose didn't skip sequences, because input and ouput files would have a different length. I believe that providing malformed input data should throw an error and not a warning. But maybe this premise changed in the Prosit pipeline.

@tkschmidt @LLautenbacher @fburic

fburic · 2020-08-26T14:03:07Z

I see, that makes sense. If the premise hasn't changed, then I'd only suggest making the error more explicit. In my case, it took me quite some time to figure out what was wrong (and I'm familiar with Python). Then I just decided it was more efficient (and acceptable) for me to fix it like above :)

fburic mentioned this issue Aug 11, 2020

Error log 2 #38

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unsupported amino acids should ideally yield warnings, not crash the program #41

Unsupported amino acids should ideally yield warnings, not crash the program #41

fburic commented Aug 11, 2020

tkschmidt commented Aug 26, 2020

gessulat commented Aug 26, 2020 •

edited

Loading

fburic commented Aug 26, 2020

Unsupported amino acids should ideally yield warnings, not crash the program #41

Unsupported amino acids should ideally yield warnings, not crash the program #41

Comments

fburic commented Aug 11, 2020

tkschmidt commented Aug 26, 2020

gessulat commented Aug 26, 2020 • edited Loading

fburic commented Aug 26, 2020

gessulat commented Aug 26, 2020 •

edited

Loading