Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

indexing large genome bam file #334

Open
aoladzad-soundag opened this issue Oct 15, 2021 · 4 comments
Open

indexing large genome bam file #334

aoladzad-soundag opened this issue Oct 15, 2021 · 4 comments

Comments

@aoladzad-soundag
Copy link

I'm working on a very large genome (4.5 G), when we run medaka, we gets the following error which seems to be related to bam indexing process because of the large genome, I know it would work with samtools -c which generates bam index csi and tried to modified the Wrappers.py and add -c option to the samtools command, yet getting same error, so I think I need to know where should I make this modification?

[M::worker_pipeline::14674.2757.95] mapped 126569 sequences
[M::worker_pipeline::15235.063
7.95] mapped 127219 sequences
[M::worker_pipeline::15788.7047.95] mapped 127124 sequences
[M::worker_pipeline::16348.999
7.95] mapped 129707 sequences
[M::worker_pipeline::16904.9507.96] mapped 128537 sequences
[M::worker_pipeline::17465.034
7.96] mapped 127442 sequences
[M::worker_pipeline::18035.1987.96] mapped 127657 sequences
[M::worker_pipeline::18569.977
7.96] mapped 130177 sequences
[M::worker_pipeline::18903.816*7.96] mapped 77228 sequences
[M::main] Version: 2.22-r1101
[M::main] CMD: minimap2 -x map-ont --MD -t 8 -a -A 2 -B 4 -O 4,24 -E 2,1 /home/dnalinux/efs/g34xl/Pisum_sativum_v1a.fa.mmi /home/dnalinux/efs/medaka/fastq/allfastq.gz
[M::main] Real time: 18904.149 sec; CPU: 150472.292 sec; Peak RSS: 14.238 GB
[bam_sort_core] merging from 40 files and 8 in-memory blocks...
[E::hts_idx_check_range] Region 536869540..536872741 cannot be stored in a bai index. Try using a csi index
[E::sam_index] Read '81baeba4-d17c-466a-ae19-73eeef0a548a' with ref_name='chr5LG3', ref_length=579269071, flags=0, pos=536869541 cannot be indexed
samtools index: failed to create index for "calls_to_draft.bam": Numerical result out of range
Failed to index alignment file.
Failed to run alignment of reads to draft.

Thanks

@colindaven
Copy link

colindaven commented Jan 3, 2024

We have a similar problem when working with long chromosomes, for example with wheat

The offending line is

index_cmd = ['samtools', 'index', out_bam, '-@', threads]

As the previous poster says, this problem could be alleviated using a csi not the default bai index.

That is, instead of using

index_cmd = ['samtools', 'index', out_bam, '-@', threads]

this should be sufficient (added -c)

index_cmd = ['samtools', 'index', '-c', out_bam, '-@', threads]

The csi index should otherwise be completely compatible with the bai, except for it's name of course.

@cjw85
Copy link
Member

cjw85 commented Jan 3, 2024

The wrappers.py is no longer used (in fact I don't know if it was ever used). In is the script mini_align where some change may be required in order to produce a .csi index. Even with that change there would need to be other changes within medaka to allow it to read from the resultant .csi as if I recall it assumes the existence of a .bai index file.

@cjw85
Copy link
Member

cjw85 commented Jan 3, 2024

I quick look at the code suggests I might be wrong on that account and the core code qill happily accept a .csi.

@colindaven
Copy link

Ok, thanks. I submitted a PR.

For others, if ONT can't or won't accept the PR due to other breakages being caused, you can modify the mini_align script from the pomoxis repo yourself.

  • The following is for a Dockerfile. Strip out the RUN if you want to modify your conda/mamba installed files directly.
  • You'll need to find and replace the path to your mini_align file on your system (use which mini_align when your medaka conda env is activated, and modify the example below)
# Modify ONT pomoxis mini_align script to use a csi not bai index
#samtools index -c -@ "${THREADS}" "${PREFIX}.bam" "${PREFIX}.bam.csi" \
RUN sed -i 's/samtools index/samtools index -c /g' /opt/micromamba/bin/mini_align
RUN sed -i 's/bai/csi/g' /opt/micromamba/bin/mini_align
RUN chmod a+x /opt/micromamba/bin/mini_align

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants