Indexing large data sets

SGA version 0.9.31 and later contain a very efficient algorithm to index large amounts of short reads. The code was written by Heng Li (https://github.com/lh3/ropebwt). To use this algorithm, specify the -a ropebwt option to sga index. This algorithm can index 1.5 billion 100bp reads in under 64GB of memory. You should be able to use this to index all of your data in a single process:

sga preprocess *.fastq > all.fastq
sga index -a ropebwt [--no-reverse] -t 4 all.fastq

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing large data sets

Clone this wiki locally