Detection of ultra-low fraction variants using unique molecular barcodes and ultra-deep sequencing.
PCR copies of the same original DNA molecule are recognized based on their identical barcode and
mapping position. umiVar first computes corrected consensus reads with adjusted quality scores.
Next, specific error models for the different nucleotide changes and for different levels of barcode
correction are computed. The 'barcode-correction-level' depends on the number of PCR copies (duplicate
reads with identical barcodes) that have been used for generating a consensus read, with quality levels
of 1 (no copy, no correction), 2 (2 copies form the consensus), 3 copies and finally 4 or more copies.
Variant calling is based on the beta-binomial error models in combination with other well-known quality
features (allele frequency, low complexity sequence, strand bias etc.). Variants can be called at fractions
as low as 1 in 10000 reads (0.01%). In addition, umiVar provides various plots for analysing the error
correction efficiency, and the PCR copy number distribution.
Simply clone the repository
Required python packages:
- samtools
- pysam
- numpy
- scipy
- networkx
Required R packages:
- VGAM
- argparse
- data.table
- ggplot2
- grid
- gridExtra
- scales
- seqinr
- survcomp
usage: homopolymer_finder.py [-h] -r REF [-l LENGTH] [-o OUTLIER]
Find all homopolymers of length n in reference sequence (fasta).
optional arguments:
-h, --help show this help message and exit
-r REF, --ref REF Reference fasta file.
-l LENGTH, --length LENGTH
Minimum length of homopolymers to report. Default = 6
-o OUTLIER, --outlier OUTLIER
Maximum number of outliers (other nucleotides) in homopolymers. Default = 0
Example: Detect all homopolymers of at least 6 consecutive equal nucleotides or 7mers with at least 6 equal nucleotides and a wildcard.
python homopolymer_finder.py -r GRCh37.fa -l 7 -o 1 > GRCh37.homopolymer.txt
usage: barcode_correction.py [-h] --infile INFILE --outfile OUTFILE [--barcodes {START,END,BOTH}] [--minBQ MINBQ] [--barcode_error BARCODE_ERROR] [--n]
Correcting BAM files using barcodes info
optional arguments:
-h, --help show this help message and exit
--infile INFILE Input BAM file.
--outfile OUTFILE Output BAM file.
--barcodes {START,END,BOTH}
Barcode position: START = 5' barcode; END = 3' barcode; BOTH = 5' and 3' barcodes. Default = BOTH
--minBQ MINBQ Minimum base quality to be considered. Default = 30
--barcode_error BARCODE_ERROR
Maximum number of sequencing errors allowed in barcode sequence. Default = 0
--n Use Ns instead of reducing base quality.
Example:
python barcode_correction.py --infile raw.bam --outfile barcode_corrected.bam --barcodes BOTH
usage: umiVar - variant calling with unique molecular barcodes [-h] -tbam TBAM [-nbam NBAM] [-b BED] -r REF -hom HOMOPOLYMER [-o OUT_FILE] [-p PARAM] [-mq MQ] [-bq BQ] [-d DIST]
[-ac AC] [-ns NUM_SITES] [-str {0,1}] [-t TEMP_DIR]
optional arguments:
-h, --help show this help message and exit
-tbam TBAM, --tbam TBAM
Tumor bam file
-nbam NBAM, --nbam NBAM
Normal bam file
-b BED, --bed BED Bed file of the targeted regions. O-based
-r REF, --ref REF Reference genome - fasta
-hom HOMOPOLYMER, --homopolymer HOMOPOLYMER
File with homopolymer positions in reference genome
-o OUT_FILE, --out_file OUT_FILE
Out vcf file
-p PARAM, --param PARAM
Beta-binomial parameters table
-mq MQ, --mq MQ Minimum mapping quality
-bq BQ, --bq BQ Minimum base quality
-d DIST, --dist DIST Minimum distance allowed between variants
-ac AC, --ac AC Minimum number of reads supporting a variant
-ns NUM_SITES, --num_sites NUM_SITES
Number of sites to be analysed
-str {0,1}, --strand {0,1}
Strand filter activation. 0 for deactivating the filter. Default [0]
-t TEMP_DIR, --temp_dir TEMP_DIR
Temporary directory
Example:
python umiVar.py -tbam cfDNA.bam -b region.bed -r GRCh37.fa -p beta_params.txt
You can also provide custom binaries for python
, R
and samtools
. For that a settings file called settings.ini
has to be created in the main directory of umiVar. If no settings.ini
is available, umiVar uses the system-wide installed binaries.
Example content (also shown in settings.default
):
python = [PATH_TO_PYTHON]/python3
R = [PATH_TO_R]/RScript
samtools = [PATH_TO_SAMTOOLS]/samtools
Additionally, these paths can also be provided by environment variables:
$umiVar_python_binary="[PATH_TO_PYTHON]/python3"
$umiVar_R_binary="[PATH_TO_R]/RScript"
$umiVar_samtools_binary="[PATH_TO_SAMTOOLS]/samtools"
The Script calculateMRD.py
provides an alternative approach to calculate the minimum residual disease based on the background noise. For that it uses pileup files created by umiVar
and calculates the MRD with a Fisher's Exact Test:
usage: calculateMRD.py [-h] [--max_af MAX_AF] [--keep_gonosomes] [--blacklist BLACKLIST]
[--remove_off_target] [--keep_indels] folder panel output variants
Script to calculate MRD based on background noise.
positional arguments:
folder umiVar folder containing dedup TSV files.
panel VCF containing the cfDNA panel.(commas seperated if multiple)
output Output TSV containing MRD value.
variants Output TSV containing read counts and p-values for each monitoring variants.
optional arguments:
-h, --help show this help message and exit
--max_af MAX_AF Maximum allele frequency for variants to include in the computation. (>1 to disable)
--keep_gonosomes Do not remove gonosomes
--blacklist BLACKLIST
VCF of variants which should be excluded for the background error rate
--remove_off_target Remove tumor off-target variants
--keep_indels Do not remove InDels.