Skip to content

Distributed construction of an FM index from multiple input files

jts edited this page May 26, 2011 · 1 revision

If your data sets consists of multiple files, you can construct the FM-index for each file separately then merge the indices together to obtain an index of the entire data. This requires much less memory than constructing an index from a single file containing the entire data set. For example, suppose your data consists of four files:

s_1_1.fastq 
s_1_2.fastq
s_2_1.fastq 
s_2_2.fastq

We begin by constructing an index of each file individually:

sga index s_1_1.fastq
sga index s_1_2.fastq
sga index s_2_1.fastq
sga index s_2_2.fastq

Then we want to merge the indices together in pairs until we obtain a single index:

sga merge -p merged1 s_1_1.fastq s_1_2.fastq
sga merge -p merged2 s_2_1.fastq s_2_2.fastq
sga merge -p final merged1.fa merged2.fa

The final index can then be used in other steps of the pipeline, for instance to error correct the original sequence files:

sga correct -p final s_1_1.fastq
sga correct -p final s_1_2.fastq
sga correct -p final s_2_1.fastq
sga correct -p final s_2_2.fastq