Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heuristics for filtering and trimming parameters #8

Open
wbazant opened this issue Feb 18, 2020 · 2 comments
Open

Heuristics for filtering and trimming parameters #8

wbazant opened this issue Feb 18, 2020 · 2 comments

Comments

@wbazant
Copy link

wbazant commented Feb 18, 2020

I work on MicrobiomeDB which puts me in the similar business to yours of wrapping dada2 except our code is less nice (there are some tools in https://github.com/VEuPathDB/DJob/tree/microbiomedb17/DistribJobTasks/bin/dada2, and a wrapper that decides when to run what).

dada2 requires a configuration of many parameters for the study to run, which are different depending on read parameters. I saw you ask the user to provide them in the JSON config. Do you provide heuristics that try to guess them, or support in determining what might be the right value?

We don't do this very well right now and it's an issue that I would benefit from knowing how to address. We have a method for guessing truncLen based on read error profiles, but for the other ones, the pipeline operator has to guess the values, try to run the analysis, and see whether everything got filtered out, or not.

@nhoffman
Copy link
Owner

Hi @wbazant - thanks for leaving a comment. The short answer is no, thus far I just have predefined parameters without any heuristics for adapting them based on the input data for a specific run. The job of defining the parameters is made much easier by the fact that I'm processing data from a small number of labs with highly standardized processes, so in general this approach has worked pretty well. This pipeline does generate a file counts.csv that tracks the yield of reads throughout the pipeline, and this is a pretty good functional indicator of sequencing quality issues, but it requires manual adjustment of params.

The challenge that I see with developing heuristics for dynamically adjusting parameters is that you would almost have to run the pipeline (or at least parts of it) twice to gather the necessary information. Maybe an intermediate solution would be to provide a report with suggestions for changes to parameters based on the pipeline outcomes (sort of a mechanical Ben Callahan). If you are able to come up with such a thing, please let me know! Might be worth asking the question over in the dada2 repo.

Thanks for noticing the solution for providing parameters - at some point I realized that it was taking a long time to implement each parameter as a command line argument or item in a config file, and even then most of the options were left out. The solution is inspired by the python-style convention of passing of params as a dictionary (ie, my_function(**kwargs)).

@wbazant
Copy link
Author

wbazant commented Feb 18, 2020

Thank you for the answer! That makes sense, I imagine there's one good set of parameters per experimental setup and sequencing machine.

Does operator of the pipeline examine counts.csv together with the results of a full run - or is there something that demands e.g. at least 79% of the files have at least 92% of the reads preserved in them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants