indexing large genome bam file #334

aoladzad-soundag · 2021-10-15T20:16:32Z

I'm working on a very large genome (4.5 G), when we run medaka, we gets the following error which seems to be related to bam indexing process because of the large genome, I know it would work with samtools -c which generates bam index csi and tried to modified the Wrappers.py and add -c option to the samtools command, yet getting same error, so I think I need to know where should I make this modification?

[M::worker_pipeline::14674.2757.95] mapped 126569 sequences
[M::worker_pipeline::15235.0637.95] mapped 127219 sequences
[M::worker_pipeline::15788.7047.95] mapped 127124 sequences
[M::worker_pipeline::16348.9997.95] mapped 129707 sequences
[M::worker_pipeline::16904.9507.96] mapped 128537 sequences
[M::worker_pipeline::17465.0347.96] mapped 127442 sequences
[M::worker_pipeline::18035.1987.96] mapped 127657 sequences
[M::worker_pipeline::18569.9777.96] mapped 130177 sequences
[M::worker_pipeline::18903.816*7.96] mapped 77228 sequences
[M::main] Version: 2.22-r1101
[M::main] CMD: minimap2 -x map-ont --MD -t 8 -a -A 2 -B 4 -O 4,24 -E 2,1 /home/dnalinux/efs/g34xl/Pisum_sativum_v1a.fa.mmi /home/dnalinux/efs/medaka/fastq/allfastq.gz
[M::main] Real time: 18904.149 sec; CPU: 150472.292 sec; Peak RSS: 14.238 GB
[bam_sort_core] merging from 40 files and 8 in-memory blocks...
[E::hts_idx_check_range] Region 536869540..536872741 cannot be stored in a bai index. Try using a csi index
[E::sam_index] Read '81baeba4-d17c-466a-ae19-73eeef0a548a' with ref_name='chr5LG3', ref_length=579269071, flags=0, pos=536869541 cannot be indexed
samtools index: failed to create index for "calls_to_draft.bam": Numerical result out of range
Failed to index alignment file.
Failed to run alignment of reads to draft.

Thanks

colindaven · 2024-01-03T16:09:04Z

We have a similar problem when working with long chromosomes, for example with wheat

The offending line is

medaka/medaka/wrappers.py

Line 79 in 9dffdee

index_cmd = ['samtools', 'index', out_bam, '-@', threads]

As the previous poster says, this problem could be alleviated using a csi not the default bai index.

That is, instead of using

index_cmd = ['samtools', 'index', out_bam, '-@', threads]

this should be sufficient (added -c)

index_cmd = ['samtools', 'index', '-c', out_bam, '-@', threads]

The csi index should otherwise be completely compatible with the bai, except for it's name of course.

cjw85 · 2024-01-03T16:38:14Z

The wrappers.py is no longer used (in fact I don't know if it was ever used). In is the script mini_align where some change may be required in order to produce a .csi index. Even with that change there would need to be other changes within medaka to allow it to read from the resultant .csi as if I recall it assumes the existence of a .bai index file.

cjw85 · 2024-01-03T16:51:10Z

I quick look at the code suggests I might be wrong on that account and the core code qill happily accept a .csi.

colindaven · 2024-01-04T09:43:54Z

Ok, thanks. I submitted a PR.

For others, if ONT can't or won't accept the PR due to other breakages being caused, you can modify the mini_align script from the pomoxis repo yourself.

The following is for a Dockerfile. Strip out the RUN if you want to modify your conda/mamba installed files directly.
You'll need to find and replace the path to your mini_align file on your system (use which mini_align when your medaka conda env is activated, and modify the example below)

# Modify ONT pomoxis mini_align script to use a csi not bai index
#samtools index -c -@ "${THREADS}" "${PREFIX}.bam" "${PREFIX}.bam.csi" \
RUN sed -i 's/samtools index/samtools index -c /g' /opt/micromamba/bin/mini_align
RUN sed -i 's/bai/csi/g' /opt/micromamba/bin/mini_align
RUN chmod a+x /opt/micromamba/bin/mini_align

aoladzad-soundag added the enhancement label Oct 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

indexing large genome bam file #334

indexing large genome bam file #334

aoladzad-soundag commented Oct 15, 2021

colindaven commented Jan 3, 2024 •

edited

Loading

cjw85 commented Jan 3, 2024

cjw85 commented Jan 3, 2024

colindaven commented Jan 4, 2024

indexing large genome bam file #334

indexing large genome bam file #334

Comments

aoladzad-soundag commented Oct 15, 2021

colindaven commented Jan 3, 2024 • edited Loading

cjw85 commented Jan 3, 2024

cjw85 commented Jan 3, 2024

colindaven commented Jan 4, 2024

colindaven commented Jan 3, 2024 •

edited

Loading