SPLITTING_AND_MERGING

Splitting and Merging the Mapping
---------------------------------

Mapping a set of donor reads to a reference genome can be parallelized in two
ways: splitting the genome into several chunks and splitting the reads into
several chunks. The former of the two (splitting the genome) might be required
in order to fit the jobs into the available RAM. The latter (splitting the
reads) is optional, but it provides a way to fully utilize a computing
cluster. This how-to briefly describes how to split and merge the mapping.


Initial setup
-------------

db.fa - contains all contigs of the reference genome
qr.fa - the reads, say in color space

Suppose we have a cluster of 4 nodes of 16GB each.


Splitting the Genome
--------------------

On each 16GB node, we decide to give the gmapper process 15GB of RAM, keeping
the rest for the operating system. To split the genome into several pieces that
fit in 15GB of RAM, we run:

$ split-db db.fa --ram-size 15

This creates several files of the form db-15gb-12,12,12,12seeds-XofY.fa, where X
ranges from 1..Y. For simplicity, we assume Y=2, and we refer to the resulting
files as db-1of2.fa and db-2of2.fa.


Projecting the Genome
---------------------

This step is  optional but recommended.  This provides significant savings if we
ever run gmapper against the individual pieces db-[12]of2.fa more than once. For
instance, this will be the case if we decide to split the reads. If the reads we
have are in color space, we project the genome pieces with:

$ project-db db-*of2.fa --shrimp-mode cs

This creates files of the form db-[12]of2-cs.genome and
db-[12]of2-cs.seed.[0-3]. We refer to them by their prefix, as db-1of2-cs and
db-2of2-cs.

Note: If the reads are in letter space, use "ls".

Note: Splitting and projecting the genome can be accomplished at the same time
by running:

$ split-project-db db.fa --ram-size 15 --shrimp-mode cs


Splitting the Reads
-------------------

This is optional, but it provides for a way to fully utilize a computing
cluster. If we have a cluster of N nodes and the genome was split into Y pieces,
we would want to split the reads into N/Y pieces. In our example, say N=4 and
Y=2, so we split the reads into 2 pieces, qr-1of2.fa and qr-2of2.fa.

To split reads, use the splitreads.py script.


Run the Mapping Jobs
--------------------

We run the following 4 jobs on different nodes of the cluster:

$ gmapper-cs qr-${A}of2.fa db-${B}of2.fa [options...] >map-qr-${A}of2-db-${B}of2.sam

If we projected the genome, the command lines are:

$ gmapper-cs qr-${A}of2.fa -L db-${B}of2-cs.fa [options...] >map-qr-${A}of2-db-${B}of2.sam


Merging Hits
------------

At this point, we have 4 output files. We merge them in two steps.

First, we merge hits of the *same* reads across *different* pieces of the genome:

$ mergesam qr-1of2.fa map-qr1of2-db*of2.sam >map-qr1of2.sam
$ mergesam qr-2of2.fa map-qr2of2-db*of2.sam >map-qr2of2.sam

Finally, merge hits of *different* reads across the *same* genome (now, all of
it):

$ mergesam <(cat qr-1of2.fa qr-2of2.fa) map-qr*of2.sam >map.sam

Alternatively, we can combine all merging in a single step as follows:

$ mergesam <(cat qr-1of2.fa qr-2of2.fa) map-qr*of2-db*of2.sam >map.sam


Merging and Mapping Qualities
-----------------------------

As of v2.2.0, gmapper supports mapping qualities. These statistics need to be
updated in nontrivial ways during merging. Particularly, merging mappings of the
same read to different reference chunks must be accompanied by mapping quality
recalculation. For this reason, gmapper by default includes in the SAM output
several special fields which are used to perform MQ recalculation during
merging. To inhibit the inclusion of these fields in the output, the final
merging step for every read should include the option --all-contigs (to either
gmapper itself or mergesam). However, the only negative effect of those extra
fields is wasted space: their presence or absence does not affect the actual
mappings in any other way.

Assume we have 2 read chunks and 2 reference chunks as in the example before,
and we want the single best mapping for every read (pair). By default, gmapper
includes the extra SAM fields in the 4 original runs. Moreover, since we are
only interested in the single best mapping, we can additionally give
--single-best-mapping to each of the 4 gmapper runs. We can merge the results as
follows:

(i) All together in one step:

$ mergesam --single-best-mapping --all-contigs <(cat qr-1of2.fa qr-2of2.fa) map-qr*of2-db*of2.sam >map.sam

Depending on the size of the mapping and the number of files involved, one might
prefer to split this work into several parts.

(ii) Across reference chunks, then across reads:

$ mergesam --single-best-mapping --all-contigs qr-1of2.fa map-qr1of2-db*of2.sam >map-qr1of2.sam
$ mergesam --single-best-mapping --all-contigs qr-2of2.fa map-qr2of2-db*of2.sam >map-qr2of2.sam
$ mergesam --no-mapping-qualities --leave-mapq <(cat qr-1of2.fa qr-2of2.fa) map-qr*of2.sam >map.sam

Above, in each of the first two runs all mappings for the reads in those runs
are inspected, hence it is correct to include --all-contigs at that early
point. In the last merge step, the existing mapping qualities are left alone.

(iii) Across reads, then across reference chunks:

$ mergesam <(cat qr-1of2.fa qr-2of2.fa) map-qr*of2-db1of2.sam >map-db1of2.sam
$ mergesam <(cat qr-1of2.fa qr-2of2.fa) map-qr*of2-db2of2.sam >map-db2of2.sam
$ mergesam --single-best-mapping --all-contigs <(cat qr-1of2.fa qr-2of2.fa) map-db*of2.sam >map.sam

Above, one could include --single-best-mapping in the first 2 merge jobs, but
that would have no effect. However, it would be an error to give --all-contigs
to either of those: in that case, the third merge job would need to recompute
mapping qualities of a read across different reference chunks, but the extra SAM
fields that are needed would be missing.