-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving parallelism for medaka_consensus #350
Comments
We do not have generic automated scripting that implements what is noted in the README. It wouldn't be too difficult to add to the existing The required additions would be to: This could be more refined (there's lots of questions around optimal batching), but should be sufficient for a lot of cases. By the way, there is an outstanding issue that artifically makes medaka slow at processing lots of short contigs: the program reopens the input BAM file (and index) for every region of data it tries to process. This is a fixed overhead and can be slow for large BAMs. I mention this because it might me you observed less than linear time scaling with the number of compute processes. Its on my TODO list to fix this issue |
Thanks for the update! It would be great when there is a solution for 80% of the cases and will look forward to the update on the region issue. I will make a script for now then but would be great if in the future your solution would be available. |
I have a first draft working but I was curious if there is a trade-off between the number of contigs or the total size (besides the terminal command length limit). I was thinking of making chunks of 100 contigs each or up to a maximum size. Not sure yet about the numbers. All will be command line arguments. |
The core algorithm of medaka is applied to overlapping chunks of base counts matrices derived from read pileups (much like samtools tview would show). These chunks are stacked to form N x M x D matrices; N number of chunks ( However, and this is where my earlier point about short contigs comes in. If an individual counts matrix is narrower than the fixed M used by the program; it is put to one side for independent processing. We don't try to pad and stack these small matrices for parallel processing as we found doing so affected accuracy. So these things together, if your contigs are long (10s of times longer than M) the number of contigs given to each parallel incarnation of I'm going to look at making optimisations around the handling of short contig pieces (through reusing BAM file handles), to help mitigate the above. I'm intrigued, what assemblies do you have with 100s of contigs; good human assemblies do not typically surpass 1000 contigs, many of those reflecting homologous repeat elements which are tricky to obtain consensus for. |
Due to the command length limitations I can do a limit by contig size (default 10M?) or when it reaches a certain number of contigs (100?, then you reach around 2.000 chars) have to check what the terminal limits are in general. I had an assembly with ±120.000 contigs as this is a combination of host (non-human) and microbiome data and we will do further deep sequencing in the future of mixed environments where we also expect many contigs due to the vast number of species present. |
First draft is available at https://git.wur.nl/unlock/scripts/-/blob/master/medaka/medaka.py not many options from the individual tools are available but the key ones are. |
Is your feature request related to a problem? Please describe.
We are working on some large datasets 20.000 > 100.000 + contigs and realised that one of the bottlenecks in our workflow is related to medaka as it is running mostly on a single thread.
Describe the solution you'd like
A simple approach using the medaka_consensus method in which the files are splitted as mentioned in https://github.com/nanoporetech/medaka#improving-parallelism
Describe alternatives you've considered
I am currently not aware of an already existing solution. As my bash skills are not that great I am planning to create a medaka_consensus.py script that will perform the steps as mentioned in https://github.com/nanoporetech/medaka#improving-parallelism in a more default logic parallel approach. Is there an interest from your end to eventually pick this up and incorporate this script into medaka? or does something like that already exists?
The text was updated successfully, but these errors were encountered: