-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heuristics for filtering and trimming parameters #8
Comments
Hi @wbazant - thanks for leaving a comment. The short answer is no, thus far I just have predefined parameters without any heuristics for adapting them based on the input data for a specific run. The job of defining the parameters is made much easier by the fact that I'm processing data from a small number of labs with highly standardized processes, so in general this approach has worked pretty well. This pipeline does generate a file The challenge that I see with developing heuristics for dynamically adjusting parameters is that you would almost have to run the pipeline (or at least parts of it) twice to gather the necessary information. Maybe an intermediate solution would be to provide a report with suggestions for changes to parameters based on the pipeline outcomes (sort of a mechanical Ben Callahan). If you are able to come up with such a thing, please let me know! Might be worth asking the question over in the dada2 repo. Thanks for noticing the solution for providing parameters - at some point I realized that it was taking a long time to implement each parameter as a command line argument or item in a config file, and even then most of the options were left out. The solution is inspired by the python-style convention of passing of params as a dictionary (ie, |
Thank you for the answer! That makes sense, I imagine there's one good set of parameters per experimental setup and sequencing machine. Does operator of the pipeline examine |
I work on MicrobiomeDB which puts me in the similar business to yours of wrapping
dada2
except our code is less nice (there are some tools in https://github.com/VEuPathDB/DJob/tree/microbiomedb17/DistribJobTasks/bin/dada2, and a wrapper that decides when to run what).dada2
requires a configuration of many parameters for the study to run, which are different depending on read parameters. I saw you ask the user to provide them in the JSON config. Do you provide heuristics that try to guess them, or support in determining what might be the right value?We don't do this very well right now and it's an issue that I would benefit from knowing how to address. We have a method for guessing
truncLen
based on read error profiles, but for the other ones, the pipeline operator has to guess the values, try to run the analysis, and see whether everything got filtered out, or not.The text was updated successfully, but these errors were encountered: