Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to define CE and charge state for spectral library generation workflow #126

Closed
tobiasko opened this issue Sep 21, 2023 · 7 comments · Fixed by #176 or #197
Closed

How to define CE and charge state for spectral library generation workflow #126

tobiasko opened this issue Sep 21, 2023 · 7 comments · Fixed by #176 or #197
Assignees
Labels
breaking Breaking Changes enhancement New feature or request question Further information is requested

Comments

@tobiasko
Copy link

Question

Dear oktoberfest maintainers,

I tried the SpectralLibraryGeneration workflow starting from a complete reference proteome ("library_input_type": "fasta") and the job finished successfully. Nice! But it seems like the default settings for the workflow are predicting charges 2, 3, 4 at NCE 30:

head prosit_input.csv
modified_sequence,collision_energy,precursor_charge,fragmentation
LTCTLSSGHSSYAIAWHQQQPEK,30,2,hcd
LTCTLSSGHSSYAIAWHQQQPEK,30,3,hcd
LTCTLSSGHSSYAIAWHQQQPEK,30,4,hcd
LTCTLSSGHSSYAIAWHQQQPEKGPR,30,2,hcd
LTCTLSSGHSSYAIAWHQQQPEKGPR,30,3,hcd
LTCTLSSGHSSYAIAWHQQQPEKGPR,30,4,hcd
LTCTLSSGHSSYAIAWHQQQPEKGPRYLMK,30,2,hcd
LTCTLSSGHSSYAIAWHQQQPEKGPRYLMK,30,3,hcd
LTCTLSSGHSSYAIAWHQQQPEKGPRYLMK,30,4,hcd

How can I reduce the charge states (2 and 3 or only 2) or select a specific NCE? My settings were:

{
    "type": "SpectralLibraryGeneration",
    "tag": "",
    "allFeatures": false,
    "inputs": {
        "library_input": "/Users/tobiasko/Documents/UP000005640_9606.fasta",
        "library_input_type": "fasta",
        "search_results": "./msms.txt"
    },
    "fastaDigestOptions": {
        "fragmentation": "HCD",
        "digestion": "full",
        "cleavages": 0,
        "minLength": 7,
        "maxLength": 30,
        "enzyme": "trypsin",
        "specialAas": "KR",
        "db": "target"
    },
    "models": {
        "intensity": "Prosit_2020_intensity_HCD",
        "irt": "Prosit_2019_irt"
    },
    "output": "/Users/tobiasko/tmp/20230920/",
    "outputFormat": "msp",
    "prediction_server": "koina.proteomicsdb.org:443",
    "ssl": true,
    "numThreads": 3,
    "fdr_estimation_method": "mokapot",
    "regressionMethod": "spline",
    "thermoExe": "ThermoRawFileParser.exe",
    "massTolerance": 20,
    "unitMassTolerance": "ppm"
}

Best,
Tobi

@tobiasko tobiasko added the question Further information is requested label Sep 21, 2023
@tobiasko tobiasko changed the title Question How to define CE and charge state for spectral library generation workflow Sep 22, 2023
@picciama picciama self-assigned this Oct 5, 2023
@picciama
Copy link
Contributor

picciama commented Oct 5, 2023

This is currently hardcoded in the digest function in spectrum-io. I will separate the digestion from writing the output file and add a configuration file section for library generation options containing:

  • fragmentation method (moved from digestion options)
  • charge states
  • collision energy

Requires:

  • config change
  • documentation change
  • IO change of automatically generated peptide input file

This is a breaking change in spectrum-io.

@picciama picciama added breaking Breaking Changes enhancement New feature or request labels Oct 5, 2023
@tobiasko
Copy link
Author

Hi @picciama,

what is the status quo here? Is it possible to def. a CE paramter in specLib generation workflow? I can't see any corresponding change in the online docu.

Best,
Tobi

@picciama
Copy link
Contributor

How urgent is it? I was pretty busy with some higher prio stuff, see release history. But given that we also have another issue in that regard (#157) I can do this next.

@tobiasko
Copy link
Author

Hmmm...I would say depends on Ludwig. I personally think we need it for the Koina manuscript.

@picciama picciama linked a pull request Jan 12, 2024 that will close this issue
4 tasks
@picciama
Copy link
Contributor

picciama commented Jan 12, 2024

Hey Tobi, the current development branch of oktoberfest now has the new spectral library generation workflow. Please check if this is working for you. Also please look at the newest documentation because the config has changed: https://oktoberfest.readthedocs.io/en/latest/jobs.html#b-spectral-library-generation

I will also merge the generic model support branch soonish and release this together to have a working version for the koina manuscript. I also optimized the workflow so you should see extreme runtime improvements. I tested this already on ~1.7 Mio peptides. This works within less than 10 minutes for both spectronaut and msp compared to more than an hour and the library size went from 16 GB to 4.5 GB and 4.6 GB to 1.5 GB for spectronaut and msp, respectively, due to the following changes:

  • minIntensity is set to 5e-4 to exclude all low intensity peaks, change this in the config to your liking.

  • batchsize is set to 10000 peptides, if you predict with let's say 3 processes, the max queue size for the writer is also set to 3 but the processes will already continue to prefetch the next batch, i.e. you have to calculate your memory with batchsize * n_processes * 2. If you increase the batchsize, memory fluctuations will increase, but it may be faster to write, although it shouldn't play a big role any longer. batchsize 10000 with 2-3 procedition processes seems reasonable according to my tests but that depends on the IO and CPU power the writer process depends on.

  • You can see if you need more prediction processes by checking the progress bars, if the prediction bar is not ahead by a few batches, your writer consumes faster than the predictions are produced. The queue has a max size equal to the number of prediction processes, so you won't ever run into memory issues as long as a full queue fits into your memory. The prediction processes go to sleep until the writer is done with the next batch so it isn't too bad if you have too many prediction processes (i.e. rather have one more than one too few).

  • n_threads = 3 means three prediction processes + 1 writer process, i.e. a total of 4 processes.

  • rounding: we now round fragment / precursor mzs to 1e-8 and fragment intensities / irt to 1e-4. Together with the min intensity filter, this saves a lot of storage while reducing runtime for writing

  • if during the workflow, an error is encountered during prediction of a batch, you will see an error message that instructs you to leave the workflow running. The failed batches are stored without corrupting the library file, i.e. you can simply restart the workflow without changing your config and it will append to your existing library file without starting all over again. This should reduce the load on the koina server hopefully and make your live a bit easier if we get timeouts due to heavy load again

Please try it out and tell me if this makes sense in terms of default values and if you have additional suggestions that might be helpful.

@tobiasko
Copy link
Author

Hi @picciama,

NICE! 🥳 will test asap. A question regarding the minIntensity parameter. So peaks with a predicted intensity below that value will not be written to the spectral library file?

For the example config file:

{
    "type": "SpectralLibraryGeneration",
    "tag": "",
    "output": "./out",
    "inputs": {
        "search_results": "./msms.txt",
        "search_results_type": "Maxquant",
        "library_input": "uniprot.fasta",
        "library_input_type": "fasta"
    },
    "models": {
        "intensity": "Prosit_2020_intensity_HCD",
        "irt": "Prosit_2019_irt"
    },
    "spectralLibraryOptions": {
        "fragmentation": "HCD",
        "collisionEnergy": 30,
        "precursorCharge": [2,3],
        "minIntensity": 5e-4,
        "batchsize": 10000,
        "format": "msp"
    },
    "fastaDigestOptions": {
        "digestion": "full",
        "missedCleavages": 2,
        "minLength": 7,
        "maxLength": 60,
        "enzyme": "trypsin",
        "specialAas": "KR",
        "db": "concat"
    },
    "prediction_server": "koina.proteomicsdb.org:443",
    "numThreads": 1,
    "ssl": true
}

These following lines can be deleted when starting from a FASTA DB?

"search_results": "./msms.txt",
"search_results_type": "Maxquant",

@picciama
Copy link
Contributor

picciama commented Jan 15, 2024

Yes, the intensity < minIntensity will be filtered out, intensity >= minIntensity will be in the file. If you really want all, you need to set minIntensity to 0. Koina adds a very small epsilon at around 1e-8 to "negative" or zero peaks according to Ludwig, but since it is rounded to 4 digits, you will see 0.0000 in these cases.

The search results are indeed not necessary for spectral library generation. I will remove it from the config examples in the documentation.

@picciama picciama added this to the Release version 0.6.0 milestone Jan 29, 2024
@picciama picciama linked a pull request Jan 31, 2024 that will close this issue
4 tasks
@picciama picciama closed this as completed Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking Changes enhancement New feature or request question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants