How to define CE and charge state for spectral library generation workflow #126

tobiasko · 2023-09-21T07:40:49Z

Question

Dear oktoberfest maintainers,

I tried the SpectralLibraryGeneration workflow starting from a complete reference proteome ("library_input_type": "fasta") and the job finished successfully. Nice! But it seems like the default settings for the workflow are predicting charges 2, 3, 4 at NCE 30:

head prosit_input.csv
modified_sequence,collision_energy,precursor_charge,fragmentation
LTCTLSSGHSSYAIAWHQQQPEK,30,2,hcd
LTCTLSSGHSSYAIAWHQQQPEK,30,3,hcd
LTCTLSSGHSSYAIAWHQQQPEK,30,4,hcd
LTCTLSSGHSSYAIAWHQQQPEKGPR,30,2,hcd
LTCTLSSGHSSYAIAWHQQQPEKGPR,30,3,hcd
LTCTLSSGHSSYAIAWHQQQPEKGPR,30,4,hcd
LTCTLSSGHSSYAIAWHQQQPEKGPRYLMK,30,2,hcd
LTCTLSSGHSSYAIAWHQQQPEKGPRYLMK,30,3,hcd
LTCTLSSGHSSYAIAWHQQQPEKGPRYLMK,30,4,hcd

How can I reduce the charge states (2 and 3 or only 2) or select a specific NCE? My settings were:

{
    "type": "SpectralLibraryGeneration",
    "tag": "",
    "allFeatures": false,
    "inputs": {
        "library_input": "/Users/tobiasko/Documents/UP000005640_9606.fasta",
        "library_input_type": "fasta",
        "search_results": "./msms.txt"
    },
    "fastaDigestOptions": {
        "fragmentation": "HCD",
        "digestion": "full",
        "cleavages": 0,
        "minLength": 7,
        "maxLength": 30,
        "enzyme": "trypsin",
        "specialAas": "KR",
        "db": "target"
    },
    "models": {
        "intensity": "Prosit_2020_intensity_HCD",
        "irt": "Prosit_2019_irt"
    },
    "output": "/Users/tobiasko/tmp/20230920/",
    "outputFormat": "msp",
    "prediction_server": "koina.proteomicsdb.org:443",
    "ssl": true,
    "numThreads": 3,
    "fdr_estimation_method": "mokapot",
    "regressionMethod": "spline",
    "thermoExe": "ThermoRawFileParser.exe",
    "massTolerance": 20,
    "unitMassTolerance": "ppm"
}

Best,
Tobi

The text was updated successfully, but these errors were encountered:

picciama · 2023-10-05T13:21:19Z

This is currently hardcoded in the digest function in spectrum-io. I will separate the digestion from writing the output file and add a configuration file section for library generation options containing:

fragmentation method (moved from digestion options)
charge states
collision energy

Requires:

config change
documentation change
IO change of automatically generated peptide input file

This is a breaking change in spectrum-io.

tobiasko · 2023-12-21T13:41:47Z

Hi @picciama,

what is the status quo here? Is it possible to def. a CE paramter in specLib generation workflow? I can't see any corresponding change in the online docu.

Best,
Tobi

picciama · 2023-12-21T20:24:48Z

How urgent is it? I was pretty busy with some higher prio stuff, see release history. But given that we also have another issue in that regard (#157) I can do this next.

tobiasko · 2023-12-22T08:12:30Z

Hmmm...I would say depends on Ludwig. I personally think we need it for the Koina manuscript.

picciama · 2024-01-12T23:49:44Z

Hey Tobi, the current development branch of oktoberfest now has the new spectral library generation workflow. Please check if this is working for you. Also please look at the newest documentation because the config has changed: https://oktoberfest.readthedocs.io/en/latest/jobs.html#b-spectral-library-generation

I will also merge the generic model support branch soonish and release this together to have a working version for the koina manuscript. I also optimized the workflow so you should see extreme runtime improvements. I tested this already on ~1.7 Mio peptides. This works within less than 10 minutes for both spectronaut and msp compared to more than an hour and the library size went from 16 GB to 4.5 GB and 4.6 GB to 1.5 GB for spectronaut and msp, respectively, due to the following changes:

minIntensity is set to 5e-4 to exclude all low intensity peaks, change this in the config to your liking.
batchsize is set to 10000 peptides, if you predict with let's say 3 processes, the max queue size for the writer is also set to 3 but the processes will already continue to prefetch the next batch, i.e. you have to calculate your memory with batchsize * n_processes * 2. If you increase the batchsize, memory fluctuations will increase, but it may be faster to write, although it shouldn't play a big role any longer. batchsize 10000 with 2-3 procedition processes seems reasonable according to my tests but that depends on the IO and CPU power the writer process depends on.
You can see if you need more prediction processes by checking the progress bars, if the prediction bar is not ahead by a few batches, your writer consumes faster than the predictions are produced. The queue has a max size equal to the number of prediction processes, so you won't ever run into memory issues as long as a full queue fits into your memory. The prediction processes go to sleep until the writer is done with the next batch so it isn't too bad if you have too many prediction processes (i.e. rather have one more than one too few).
n_threads = 3 means three prediction processes + 1 writer process, i.e. a total of 4 processes.
rounding: we now round fragment / precursor mzs to 1e-8 and fragment intensities / irt to 1e-4. Together with the min intensity filter, this saves a lot of storage while reducing runtime for writing
if during the workflow, an error is encountered during prediction of a batch, you will see an error message that instructs you to leave the workflow running. The failed batches are stored without corrupting the library file, i.e. you can simply restart the workflow without changing your config and it will append to your existing library file without starting all over again. This should reduce the load on the koina server hopefully and make your live a bit easier if we get timeouts due to heavy load again

Please try it out and tell me if this makes sense in terms of default values and if you have additional suggestions that might be helpful.

tobiasko · 2024-01-15T09:01:50Z

Hi @picciama,

NICE! 🥳 will test asap. A question regarding the minIntensity parameter. So peaks with a predicted intensity below that value will not be written to the spectral library file?

For the example config file:

{
    "type": "SpectralLibraryGeneration",
    "tag": "",
    "output": "./out",
    "inputs": {
        "search_results": "./msms.txt",
        "search_results_type": "Maxquant",
        "library_input": "uniprot.fasta",
        "library_input_type": "fasta"
    },
    "models": {
        "intensity": "Prosit_2020_intensity_HCD",
        "irt": "Prosit_2019_irt"
    },
    "spectralLibraryOptions": {
        "fragmentation": "HCD",
        "collisionEnergy": 30,
        "precursorCharge": [2,3],
        "minIntensity": 5e-4,
        "batchsize": 10000,
        "format": "msp"
    },
    "fastaDigestOptions": {
        "digestion": "full",
        "missedCleavages": 2,
        "minLength": 7,
        "maxLength": 60,
        "enzyme": "trypsin",
        "specialAas": "KR",
        "db": "concat"
    },
    "prediction_server": "koina.proteomicsdb.org:443",
    "numThreads": 1,
    "ssl": true
}

These following lines can be deleted when starting from a FASTA DB?

"search_results": "./msms.txt",
"search_results_type": "Maxquant",

picciama · 2024-01-15T11:57:46Z

Yes, the intensity < minIntensity will be filtered out, intensity >= minIntensity will be in the file. If you really want all, you need to set minIntensity to 0. Koina adds a very small epsilon at around 1e-8 to "negative" or zero peaks according to Ludwig, but since it is rounded to 4 digits, you will see 0.0000 in these cases.

The search results are indeed not necessary for spectral library generation. I will remove it from the config examples in the documentation.

tobiasko added the question Further information is requested label Sep 21, 2023

tobiasko changed the title ~~Question~~ How to define CE and charge state for spectral library generation workflow Sep 22, 2023

picciama self-assigned this Oct 5, 2023

picciama added breaking Breaking Changes enhancement New feature or request labels Oct 5, 2023

picciama mentioned this issue Dec 16, 2023

Prosit Website Typo #157

Open

picciama linked a pull request Jan 12, 2024 that will close this issue

Feature/speclib multiprocessing #176

Merged

4 tasks

picciama mentioned this issue Jan 13, 2024

InferenceServerException: [StatusCode.UNKNOWN] Stream removed #174

Closed

picciama added this to the Release version 0.6.0 milestone Jan 29, 2024

picciama linked a pull request Jan 31, 2024 that will close this issue

Release/0.6.0 #197

Merged

4 tasks

picciama closed this as completed Feb 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to define CE and charge state for spectral library generation workflow #126

How to define CE and charge state for spectral library generation workflow #126

tobiasko commented Sep 21, 2023

picciama commented Oct 5, 2023 •

edited

Loading

tobiasko commented Dec 21, 2023

picciama commented Dec 21, 2023

tobiasko commented Dec 22, 2023

picciama commented Jan 12, 2024 •

edited

Loading

tobiasko commented Jan 15, 2024

picciama commented Jan 15, 2024 •

edited

Loading

How to define CE and charge state for spectral library generation workflow #126

How to define CE and charge state for spectral library generation workflow #126

Comments

tobiasko commented Sep 21, 2023

picciama commented Oct 5, 2023 • edited Loading

tobiasko commented Dec 21, 2023

picciama commented Dec 21, 2023

tobiasko commented Dec 22, 2023

picciama commented Jan 12, 2024 • edited Loading

tobiasko commented Jan 15, 2024

picciama commented Jan 15, 2024 • edited Loading

picciama commented Oct 5, 2023 •

edited

Loading

picciama commented Jan 12, 2024 •

edited

Loading

picciama commented Jan 15, 2024 •

edited

Loading