Alignment and Relative Abundance Question #125

kevinmyers · 2022-07-18T16:02:12Z

kevinmyers
Jul 18, 2022

I want to determine the relative DNA abundance for a set of 217 genomes across the 115 experiments (sequenced with Illumina or PacBio). I first ran MiniMap2 on each of sequencing files separately based on technology (Illumina or PacBio) to optimize the MiniMap2 settings. Example commands used were:

minimap2 -ax map-hifi ../fasta_files/217_genomes.fasta PacBio.fastq > PacBio.sam
minimap2 -ax sr ../fasta_files/217_genomes.fasta Illumina.fastq > Illumina.sam

After converting to BAM and sorting, I ran CoverM on each BAM file using all the 217 FASTA files in the genome list. I used the following command:

for i in *bam; do coverm genome --bam-files ./$i --genome-fasta-directory ../fasta_files/individual_fasta_files/ -x fasta > ${i/.bam/coverm.txt}; done

I compiled the data and wanted to make sure I was interpreting it correctly. I have a few questions I'm hoping you can help me with.

I compared the percentage of aligned reads from MiniMap2 to the unmapped reads reported by CoverM. They were close but did not add up to 100. Some were over just by a bit (100.25%) while others were over by a bit more (106.5%). I was curious why they did not add up to 100%? I assume it has something to do with the different reporting of each program, but wanted to ask if you had a specific answer?
It is my understanding that the results of CoverM as I ran it is that the results given are the relative DNA abundance for each MAG within each experiment. Is this correct? So a value of 0.02 would indicate that for that particular experiment, that MAG had 0.02% relative DNA abundance. And an unmapped value of 88.57 would mean that 88.57% of the relative DNA abundance could not be mapped to any of the MAGs used. This calculation would take into account the different numbers of reads (especially in Illumina vs PacBio) and everything? I wanted to make sure I was fully understanding the results and make sure I don't have to do any additional corrections for the number of reads or anything like that.

Thanks in advance for your help (and thanks for all your previous help)! I really appreciate it!

Answered by wwood

Jul 19, 2022

Hi,

I suspect this might have to do with the fact that reads mapped to genomes which ultimately have <10% covered_fraction are discarded. Does that make sense?
Yes that is right, no additional calcs required. However, CoverM makes the assumption though that the fraction of reads that are unmapped is the fraction of the community not in the reference set. If recovered genomes are smaller than the ones that are not recovered for instance (or if there is euk host contamination), then it will be off.

Thanks again for kind words.

View full answer

wwood · 2022-07-19T03:44:01Z

wwood
Jul 19, 2022
Maintainer

Hi,

I suspect this might have to do with the fact that reads mapped to genomes which ultimately have <10% covered_fraction are discarded. Does that make sense?
Yes that is right, no additional calcs required. However, CoverM makes the assumption though that the fraction of reads that are unmapped is the fraction of the community not in the reference set. If recovered genomes are smaller than the ones that are not recovered for instance (or if there is euk host contamination), then it will be off.

Thanks again for kind words.

1 reply

kevinmyers Jul 19, 2022
Author

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alignment and Relative Abundance Question #125

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Alignment and Relative Abundance Question #125

kevinmyers Jul 18, 2022

Replies: 1 comment · 1 reply

wwood Jul 19, 2022 Maintainer

kevinmyers Jul 19, 2022 Author

kevinmyers
Jul 18, 2022

Replies: 1 comment 1 reply

wwood
Jul 19, 2022
Maintainer

kevinmyers Jul 19, 2022
Author