- Introduction
- Installation
- Mathematical Framework
- Implementation Details
- Usage Guide
- Output Files
- Performance Considerations
- Support
- License
CovTrim is a specialized bioinformatics tool designed for precise coverage-based downsampling of FASTQ files from amplicon sequencing data. It uses pysam for efficient FASTQ processing and implements robust algorithms for coverage calculation and read selection.
# Using conda
conda install -c gosahan covtrim
# Dependencies
- Python ≥ 3.7
- pandas
- numpy
- pysam
The total sequencing space (TSS) is calculated based on the number of amplicons and their size:
Where:
-
$N_a$ = Number of amplicons -
$L_a$ = Amplicon length (bp)
Number of amplicons is calculated as:
Where:
-
$G$ = Genome size (bp)
Code implementation:
def calculate_amplicon_coverage(total_bases, num_amplicons, amplicon_size):
sequencing_space = num_amplicons * amplicon_size # TSS calculation
# ...
# In downsample_fastq:
num_amplicons = np.ceil(genome_size / amplicon_size) # Na calculation
The mean coverage depth (
Where:
-
$l_i$ = Length of read$i$ -
$n$ = Total number of reads -
$TSS$ = Theoretical sequencing space
Code implementation:
def calculate_amplicon_coverage(total_bases, num_amplicons, amplicon_size):
sequencing_space = num_amplicons * amplicon_size
mean_coverage = total_bases / sequencing_space # Mean coverage calculation
# ...
# In downsample_fastq:
total_bases = df['len'].sum() # Sum of all read lengths
The expected number of reads per amplicon assuming uniform distribution:
Code implementation:
def calculate_amplicon_coverage(total_bases, num_amplicons, amplicon_size):
theoretical_reads_per_amplicon = total_bases / (num_amplicons * amplicon_size)
For a target coverage
Where:
-
$C_t$ = Target coverage -
$\bar{C}$ = Current mean coverage
Code implementation:
# In downsample_fastq:
sampling_fraction = min(1.0, target_coverage / current_coverage)
The expected coverage after sampling (
Where:
-
$S$ = Set of sampled reads -
$l_i$ = Length of read$i$
Code implementation:
# In downsample_fastq:
sampled_bases = sampled_reads['len'].sum()
expected_coverage = sampled_bases / coverage_stats['sequencing_space']
Coverage uniformity is assessed using the coefficient of variation:
Where:
-
$\sigma_C$ = Standard deviation of coverage -
$\mu_C$ = Mean coverage
The representation of each amplicon is calculated as:
Where:
-
$N_{observed}$ = Observed number of reads per amplicon -
$N_{expected}$ = Expected number of reads based on uniform distribution
Code implementation:
# These metrics are calculated and included in the final report
def calculate_amplicon_coverage(total_bases, num_amplicons, amplicon_size):
# ...
return {
'mean_coverage': mean_coverage,
'sequencing_space': sequencing_space,
'theoretical_reads_per_amplicon': theoretical_reads_per_amplicon
}
The overall time complexity of the main operations:
Where
The space complexity for the core operations:
Where
def calculate_amplicon_coverage(total_bases, num_amplicons, amplicon_size):
"""
Calculate coverage metrics for amplicon sequencing
Returns:
- mean_coverage
- sequencing_space
- theoretical_reads_per_amplicon
"""
def downsample_fastq(input_fastq, target_coverage, output_dir,
genome_size=16000, amplicon_size=500, random_seed=42):
"""
Main function for FASTQ downsampling
"""
covtrim -i <input.fastq> -o <output_dir> -tc <target_coverage> [options]
Required Arguments:
-i, --input Input FASTQ file
-o, --output Output directory
-tc, --target-coverage Target coverage depth (X)
Optional Arguments:
-gs, --genome-size Size of target genome (default: 16000 bp)
-as, --amplicon-size Size of amplicons (default: 500 bp)
-s, --seed Random seed (default: 42)
# Basic usage
covtrim -i sample.fastq -o output_dir -tc 30
# Specify custom genome and amplicon size
covtrim -i sample.fastq -o output_dir -tc 30 -gs 29903 -as 400
coverage_<target>X/
├── <filename>_<target>X.fastq # Downsampled FASTQ file
└── <filename>_<target>X_report.md # Analysis report
- Analysis parameters
- Sequencing metrics
- Coverage calculations
- Sampling results
- Detailed methodology
- Streaming FASTQ processing using pysam
- Only read headers stored in memory
- Linear memory complexity: O(n) where n is number of reads
- Efficient indexing using pysam
- Fast random sampling with pandas
- Time complexity: O(n log n) for sorting operations
For questions or issues, contact Gurasis Osahan at the National Microbiology Laboratory.
Licensed under Apache License, Version 2.0. See LICENSE for details.
Copyright: Government of Canada
Written by: National Microbiology Laboratory, Public Health Agency of Canada
Ensuring public health through advanced genomics. Developed with unwavering commitment and expertise by National Microbiology Laboratory, Public Health Agency of Canada.