-
Notifications
You must be signed in to change notification settings - Fork 82
Distributed construction of an FM index from multiple input files
jts edited this page May 26, 2011
·
1 revision
If your data sets consists of multiple files, you can construct the FM-index for each file separately then merge the indices together to obtain an index of the entire data. This requires much less memory than constructing an index from a single file containing the entire data set. For example, suppose your data consists of four files:
s_1_1.fastq
s_1_2.fastq
s_2_1.fastq
s_2_2.fastq
We begin by constructing an index of each file individually:
sga index s_1_1.fastq
sga index s_1_2.fastq
sga index s_2_1.fastq
sga index s_2_2.fastq
Then we want to merge the indices together in pairs until we obtain a single index:
sga merge -p merged1 s_1_1.fastq s_1_2.fastq
sga merge -p merged2 s_2_1.fastq s_2_2.fastq
sga merge -p final merged1.fa merged2.fa
The final index can then be used in other steps of the pipeline, for instance to error correct the original sequence files:
sga correct -p final s_1_1.fastq
sga correct -p final s_1_2.fastq
sga correct -p final s_2_1.fastq
sga correct -p final s_2_2.fastq