Skip to content
jts edited this page Feb 23, 2012 · 1 revision

de novo assembly is a very complex process and many factors (like the read length, read depth, genome complexity and size) influence the quality of the final assembly. This page describes some ways to improve the performance of SGA by changes its parameters.

The most important tuning parameters are:

  • the overlap size (-m to sga overlap/sga assemble)
  • the k-mer size used in error correction
  • the k-mer threshold in error correction (sga correct -x)

The last parameter, the k-mer threshold, is automatically inferred from the data if the --learn flag is given to sga correct. If your data has uneven coverage (like a transcriptome) or your coverage is very low (<20X) then the inference may lead to a poor parameter choice. In this case, you may wish to manually choose the parameter instead of using --learn.

Other parameters that are perhaps useful to tune, but less important, are:

  • the aggressiveness of the bubble popping algorithm (-d and -g to sga assemble). It may be important to increase these values when assembling very heterozygous genomes.
  • the length of "tip" branches to trim (-l to sga assemble). The default value (150bp) is optimized for 100bp reads. If you use reads that are significantly longer (like Miseq 150bp) then you should increase this value.