optimize mark_duplicates_and_sort #51

chrisamiller · 2022-06-15T15:40:27Z

The tools/mark_duplicates_and_sort.wdl is a bottleneck, especially for WGS. It's expensive, and that's partially because it is long-running and gets preempted. Do some local testing on the cluster to explore options for optimizing it:

right now the sort and markdup get 8 cores each. Is that the optimal ratio? If one is faster than the other, there'll be wasted cycles.
If we increase the number of overall cores, how does that affect runtime? (do we saturate I/O? is that different between HDD/SSD?)
Can we prevent localization of the input files to save an hour or so?
would giving more ram to the sort part of that step allow it to do less slow writes of temp files to disk and speed things up?

The text was updated successfully, but these errors were encountered:

chrisamiller · 2022-06-15T16:05:25Z

Breakdown of costs/timing for WGS:

sample	seconds	cpuCost	memCost	diskCost	diskType	totalCost
HCC1395 normal	24870.788	0.771547112	0.42860658	0.221073671	local-disk 576 HDD	1.421227363
HCC1395 tumor	46089.762	1.429806839	1.143282152	0.750381156	local-disk 1055 HDD	3.323470147

Both ran on a custom-16-97280 instance

malachig added the performance optimizations of run time adn/or cost label Jul 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize mark_duplicates_and_sort #51

optimize mark_duplicates_and_sort #51

chrisamiller commented Jun 15, 2022

chrisamiller commented Jun 15, 2022

optimize mark_duplicates_and_sort #51

optimize mark_duplicates_and_sort #51

Comments

chrisamiller commented Jun 15, 2022

chrisamiller commented Jun 15, 2022