Heuristics for filtering and trimming parameters #8

wbazant · 2020-02-18T16:37:24Z

I work on MicrobiomeDB which puts me in the similar business to yours of wrapping dada2 except our code is less nice (there are some tools in https://github.com/VEuPathDB/DJob/tree/microbiomedb17/DistribJobTasks/bin/dada2, and a wrapper that decides when to run what).

dada2 requires a configuration of many parameters for the study to run, which are different depending on read parameters. I saw you ask the user to provide them in the JSON config. Do you provide heuristics that try to guess them, or support in determining what might be the right value?

We don't do this very well right now and it's an issue that I would benefit from knowing how to address. We have a method for guessing truncLen based on read error profiles, but for the other ones, the pipeline operator has to guess the values, try to run the analysis, and see whether everything got filtered out, or not.

The text was updated successfully, but these errors were encountered:

nhoffman · 2020-02-18T18:45:03Z

Hi @wbazant - thanks for leaving a comment. The short answer is no, thus far I just have predefined parameters without any heuristics for adapting them based on the input data for a specific run. The job of defining the parameters is made much easier by the fact that I'm processing data from a small number of labs with highly standardized processes, so in general this approach has worked pretty well. This pipeline does generate a file counts.csv that tracks the yield of reads throughout the pipeline, and this is a pretty good functional indicator of sequencing quality issues, but it requires manual adjustment of params.

The challenge that I see with developing heuristics for dynamically adjusting parameters is that you would almost have to run the pipeline (or at least parts of it) twice to gather the necessary information. Maybe an intermediate solution would be to provide a report with suggestions for changes to parameters based on the pipeline outcomes (sort of a mechanical Ben Callahan). If you are able to come up with such a thing, please let me know! Might be worth asking the question over in the dada2 repo.

Thanks for noticing the solution for providing parameters - at some point I realized that it was taking a long time to implement each parameter as a command line argument or item in a config file, and even then most of the options were left out. The solution is inspired by the python-style convention of passing of params as a dictionary (ie, my_function(**kwargs)).

wbazant · 2020-02-18T23:00:21Z

Thank you for the answer! That makes sense, I imagine there's one good set of parameters per experimental setup and sequencing machine.

Does operator of the pipeline examine counts.csv together with the results of a full run - or is there something that demands e.g. at least 79% of the files have at least 92% of the reads preserved in them?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heuristics for filtering and trimming parameters #8

Heuristics for filtering and trimming parameters #8

wbazant commented Feb 18, 2020

nhoffman commented Feb 18, 2020

wbazant commented Feb 18, 2020

Heuristics for filtering and trimming parameters #8

Heuristics for filtering and trimming parameters #8

Comments

wbazant commented Feb 18, 2020

nhoffman commented Feb 18, 2020

wbazant commented Feb 18, 2020