Skip to content

Scaffolding multiple libraries

jts edited this page May 15, 2013 · 8 revisions

The SGA scaffolder can build scaffolds from multiple libraries of different insert size. To start, you need to align the reads for each library to the input contigs/scaffolds. I recommend using BWA.

For example, lets say you have the following BAM files:

lib.fragment.400bp.bam
lib.matepair.3kb.bam
lib.matepair.8kb.bam

First, create a .de file for each library using sga-bam2de.pl (located in the bin/ directory):

sga-bam2de.pl --prefix lib.fragment.400bp -n 5 -m 200 lib.fragment.400bp.bam
sga-bam2de.pl --prefix lib.matepair.3kbp -n 5 -m 200 lib.matepair.3kbp.bam
sga-bam2de.pl --prefix lib.matepair.8kbp -n 5 -m 200 lib.matepair.8kbp.bam

The -n 5 parameter indicates 5 read pairs are required to create a contig-contig link. -m 200 indicates that only contigs at least 200bp in length will be scaffolded.

Next, make an astat file from the highest coverage library. Typically this will be the short-insert fragment library.

samtools sort lib.fragment.400bp.bam lib.fragment.400bp.refsort
sga-astat.py -m 200 lib.fragment.400bp.refsort.bam > contigs.astat

Finally, you can perform the scaffolding:

sga scaffold -m 200 -a contigs.astat --pe lib.fragment.400bp.de --mate lib.matepair.3kb.de --mate lib.matepair.8kb.de -o multiple.libs.scaf contigs.fa
sga scaffold2fasta --write-unplaced -m 200 -o scaffolds.fa --use-overlap -a final-graph.asqg.gz multiple.libs.scaf

If you are scaffolding contigs that were not produced by sga assemble, you can replace -a final-graph.asqg.gz with -f contigs.fa in the final sga scaffold2fasta step.

Warnings

Long insert mate pair libraries can often contain contamination by short-insert fragments. These contaminating reads often have the opposite paired end orientation (FR) as the expected mate reads (RF). If the proportion of FR reads is greater than that of the RF mate reads, the distance estimation step can fail. In this case, the scaffolds will probably not improve when given the mate reads. For each library a .hist file will be produced, with a simple histogram of the insert sizes found in the library. This can be used to QC the library during the scaffolding process.