VariantBam: Filtering and profiling of next-generational sequencing data using region-specific rules
License: Apache2
Bioinformatics Paper Wala, J., C. Zhang, M. Meyerson, R. Beroukhim. VariantBam: filtering and profiling of nextgenerational sequencing data using region-specific rules. 2016. Bioinformatics, doi: 10.1093/bioinformatics/btw111
NOTE: VariantBam recently was updated to use the more universal JSON syntax, and to remove all dependencies on Boost to make installation easier.
- Installation
- Description
- Examples
- Rules script syntax
- Command line usage
- Full list of available JSON rules
- Attributions
I have succesfully built on Unix with GCC-4.8, 4.9 and 5.1
git clone --recursive https://github.com/jwalabroad/VariantBam.git
cd VariantBam
./configure
make
To add support for reading BAMs, etc with HTTPS, FTP, S3, Google cloud, etc, you must compile and link with libcurl.
## set hts to build with libcurl links and hfile_libcurl.c
cd VariantBam/SeqLib/htslib
./configure --enable-libcurl
## compile seqlib with libcurl support
cd ../../ # back to VariantBam main directory
./configure LDFLAGS="-lcurl -lcrypto"
make
make install ## if no root or didn't configure with --prefix, binary is found at src/variant
## test
variant gs://isb-cgc-open/ccle/LUSC/DNA-Seq/C836.NCI-H1339.2.bam | head
## using the included test BAM (HCC1143)
BAMOUTPUT='-b'
VariantBam/src/variant test/small.bam -g 'X:1,000,000-1,100,000' --min-mapq 10 $BAMOUTPUT -o mini.bam -v
## can stream in and out (out is binary stream with -b flag)
samtools view -h test/small.bam | variant - --min-mapq 10 -b | bamToBed > out.bed
## get help
VariantBam/src/variant --help
## TL;DR examples
## extract all reads and their pair-mates that overlap SNP sites within 100 bp
rfile=<BED file, samtools-style string (e.g. "1:1,000,000-1,000,010"), or VCF>
variant <bam> -l $rfile -P 100 -b -o mini.bam -v
## mask regions (exclude reads and their pair-mates that overlap)
variant <bam> -L $rfile -b -o mini.bam -v
## extract high-quality clipped reads (where clip length account for low quality bases)
variant <bam> --min-phred 4 --min-clip 5 -o mini.sam -v
## extract reads with high mapq that also contain a large insertion or deletion
## pipe output
variant <bam> --min-mapq 20 --min-ins 10 --min-del 10 | wc -l
## subsample to max-coverage. BAM must be sorted
variant <bam> -m 100 -o mini.bam -v -b
VariantBam is a tool to extract/count specific sets of sequencing reads from next-generational sequencing files. To save money, disk space and I/O, one may not want to store an entire BAM on disk. In many cases, it would be more efficient to store only those read-pairs or reads who intersect some region around the variant locations. Alternatively, if your scientific question is focused on only one aspect of the data (e.g. breakpoints), many reads can be removed without losing the information relevant to the problem, and enriching for the signal you are interested in.
VariantBam packages into a single executable a number of filtering features not easily found using samtools
+ awk
:
- Filter specifically on read clipping, orientation and insert size (all important for structural variation)
- Support for considering only high-quality bases when determining read length or clip count
- Interval tree to efficiently determine if a read overlaps a region
- Ability to link reads to a genomic region if their mate intersects that region.
- Provide different rules for different arbitrarily-sized regions, and to provide these regions as common variant files (VCF, MAF, BED)
- Select reads by matching motifs against a large dictionary using Aho-Corasick implementation
- Count reads that satisfy any number of user-defined properties
- Selectively strip alignment tags
- Support for sub-sampling to obtain a BAM file with a coverage limit
VariantBam is implemented in C++ and uses HTSlib, a highly optimized C library used as the core of Samtools and BCFtools.
To get a full list of options, run variant --help
.
Whole-genome analysis has been conducted on a BAM, generating VCF and MAF files. Ideally, these regions could be manually inspected or reanalyzed without having to keep the entire BAM. Running VariantBam to extract only read-pairs that overlap these events will allow these regions to be rapidly queried, without having to keep the full BAM record.
### Extract all read PAIRS that interset with a variant from a VCF
variant $bam -l myvcf.vcf -o mini.bam -b
In situations where the sequencing or library preparation quality is low, it may be advantageous to remove poor quality reads before starting the analysis train. VariantBam handles this by optionally taking into account base-qualities when making a decision whether to keep a sequencing read. For instance, one might only be interested in high quality MAPQ 0 or clipped reads. VariantBam can be setup to apply unique base-quality filters to different regions or across the entire genome, all with one-pass.
### Extract only high quality reads with >= 50 bases and MAPQ >= 1 and not duplicated/hardclip/qcfail
### json
{
"region1" : {
"region" : "WG",
"rules" : [{
"length" : [50,1000],
"mapq" : [1, 60],
"duplicate" : false,
"hardclip" : false,
"qcfail" : false
}
]
}
}
###
variant $bam -r example2.json -o mini.bam -b
An NGS tool operates only on a subset of the reads (eg. structural variant caller using only clipped/discordant reads). Running VariantBam to keep only these reads allows the tool to run much faster. This is particurlaly useful for facilitating a more rapid "build/test" cycle.
### Extract clipped, discordant, unmapped and indel reads
### json
{
"global" : { "nbases" : [0,0], "hardclip" : false, "supplementary" : false, "qcfail" : false},
"region_wg" : {"region" : "WG", "rules" : [
{ "mapq" : [0, 1000], "clip" : [5,1000] }, {"ic" : true}, {"ff" : true}, {"rf" : true}, {"rr" : true}, { "ins" : [1,1000], "mapq" : [1,100] }, { "del" : [1,1000], "mapq" : [1,1000] }
]}
}
###
variant $bam -r example3.json
A team is only interested in variants in known cancer genes, and would like to analyze thousands of exomes and genomes. Running VariantBam to extract reads from only these genes, and sending the BAM files to compressed CRAM provides sufficient data reduction to allow all of the relevant data to be stored on disk.
### Grab only reads from predefined regions. Strip unneccessary tags and convert to CRAM for maximum compression
variant $bam -l mygenes.bed -C -o mini.cram -s BI,OQ
A research team would like to extract only reads matching a certain motifs, but only if they have high optical quality.
VariantBam with the motif
rule will accomplish this with rapid O(n) efficiency for an arbitrarily large motif dictionary (where n
is
the length of a read)
## example6.json
{
"example6": {
"rules": [{"motif": "mymotifs.txt",
"length": 20 }]
}
}
##
###
variant $bam -r example6.json ## input as a JSON
variant $bam --min-length 20 --motif mymotifs.txt ## using command line shortcuts
To reduce the size of the BAM, reads can be removed from centromeric and satellite repeat regions. These reads are rarely helpful for variant calling.
To remove reads that intersect a region, set the region as an inverse-region. In a VariantBam script, use "exclude" : true```. For quick use on the command line, use
-Lor
-G(opposites of
-land
-g``).
### json
{
"" : {
"region" : "bad.bed",
"exclude" : true,
"matelink" : true
}
}
###
variant $bam -L bad.bed -o mini.bam -b
Massive read-pileups can occur at repetitive regions. These can reduced with VariantBam by subsampling to a max-coverage.
### BAM must be sorted
variant $bam -m 100 -o mini.bam -b
Obtain basic QC stats from a BAM file, or profile how many reads were accepted by each rule
### get QC stats on whole bam AND find how many reads are clipped with high-quality clipped bases
### use the -x flag to produce no output (profiling only)
variant <bam> --min-clip 10 --min-phred 5 -q qcreport.txt -x
Rscript VariantBam/R/BamQCPlot.R -i qcreport.txt -o qcreport.pdf
A user would like to extract only those reads supporting a particular allele at a variant site. This can be done by combining a small point-region at the variant site with a motif dictionary. Consider two alleles G and A at a site (e.g. 1:143250877), along with their adjacent sequences: GCAGAAT and GCAAAAT. To extract variant reads supporting the A allele:
## make the motifs file (include reverse complements)
printf "GCAAAAT\nATTTTGC" > motifs.txt
## just look near the variant
k="1:143,250,677-143,251,077"
r='{"":{"rules":[{"motif":"motifs.txt"}]}}'
g=1:143250877
variant <bam> -k $k -g $g -r $r -b -o mini.bam ## with JSON script
variant <bam> -k $k -g $g --motif motifs.txt -b -o mini.bam ## using command line shortcut
Because sequence information is required to match a motif, and reads do not contain the sequence information of their pair-mates, extracting all read pairs supporting a particular allele requires a two-pass solution:
## two pass solution
variant <bam> -k $k -g $g -r $r | cut -f1 | uniq > q.txt
printf "^@\n" >> q.txt ## keep the sam header too
samtools view <bam> $k | grep -f q.txt | samtools view - -b > mini.bam
This can be expanded for an arbitrary number of heterozygous sites, for instance to capture reads from a single haplotype:
### het.json
{
"A" : {
"region" : "1:132,250,677"
"rules" : {[ "motif" : "motifsA.txt" ]
}
},
"B" : {
"region" : "1:182,250,325"
"rules" : {[ "motif" : "motifsB.txt" ]
}
},
}
###
variant <bam> -r het.json -o mini.bam ## using JSON
variant <bam> -g 1:132,250,677 --motif motifsA.txt -g 1:182,250,325 --motif motifsB.txt ##using command line shortcut
Note that for the allele-specific extraction, there could be false negatives (reads not extracted) if a read has a sequencing error within the motif.
This section will describe the syntax used by VariantBam to specify the cascades of rules and regions applied to a BAM. Below is an example of a valid VariantBam JSON script:
{
"reg1" : {
"region" : "WG",
"rules" : [{RULE_A, RULE_B}, {RULE_C, RULE_D, RULE_E}]
}
}
This can be read as "Accept a read that passes (RULE_A && RULE_B) OR (RULE_C && RULE_D && RULE_E)".
If no "rules" is supplied, it will default to "accept every".
The region
keyword marks that what follows is a genomic region,
which is either the keyword WG
for whole genome, or a VCF, MAF, Callstats, BED file, or samtools-style string. If not specified,
the default "WG" is applied. Regions are
treated such that they will include any read who overlaps it, even partially. Optionally,
you can specify that your region of interest is a bit bigger than is actually in the file. You can do this by "padding"
the regions around the sites. For example:
### json
{
"" : {
"region" : "myvcf.vcf",
"pad" : 1000,
}
}
### command line short-cut
variant $bam -g myvcf.vcf -P 1000
Alternatively, if supplying a region directly with the -l, -L, -g or -G flag, you can specify a padding with the -P flag. Note that this padding must be supplied
after a region flag is provided, and will be applied to the last supplied region flag (but with multiple regions, you can provide multiple -P
flags).
variant <in.bam> -g myvcf.vcf -P 100
You can also state that the region applies to reads who don't necessarily overlap the region, but their pair-mate does (called "mate-linking"). Note that rules that involve pair-mate information not located within the view read (e.g. mate mapq) are not considered. Mate-linking is particularly useful for extracting all read PAIRS that cover a variant site.
### json
{
"" : {
"region" : "myvcf.vcf",
"pad" : 1000,
"matelinked" : true,
}
}
### command line shortcut
variant $bam -l myvcf.vcf -P 1000
### json to remove low (<= 10) MAPQ reads in bad region
{
"" : {
"region" : "blacklist.bed",
"pad" : 1000,
"rules" : [{"mapq" : [0, 10]}]
"exclude" : true
}
}
### command line shortcut (for pure blacklisting)
variant $bam -G blacklist.bed -P 1000
### command line shortcut (for pure blacklisting, with mate linking)
variant $bam -L blacklist.bed -P 1000
To reduce redundancy, you can name a region-rule set "global" anywhere in the stack, and it will append that rule to everything below. For example, to exclude hardclipped, duplicate, qcfail and supplementary reads in every region, you would do:
{
"global" : {
"rules" : [{"hardclip" : false, "duplicate" : false, "qcfail" : false, "supplementary" : false}]
},
"A" : {
"rules" : [{"isize" : [600, 0]}, {"clip" : [10,101], "mapq" : [1,60]}]
},
"B" : {
"region" : myvcf.vcf
}
}
which keeps hardclipped reads, etc out from both of the subsequent regions (A and B)
Rules are supplied as a list of criteria that a read must satisfy. VariantBam handles multiple rules in the following way. For each read, VariantBam will cycle through the rules within a region until the read satisfies a rule. When it does, it includes the read in the output and stops checking. The logic for the entire collection of rules is then as follows:
-
On a given rule line, the read must satisfy ALL conditions (logical AND)
-
Across different rules, the read nead only satisfy ONE rule (logical OR)
Below is an example which uses the ability of VariantBam to interpret VCFs and BED files, and apply rules separately to them.
{
"reg1" : {
"region" : "myvcf.vcf",
"pad" : 1000,
"matelinked" : true
},
"reg2" : {
"region" : "myexonlist.bed",
"matelinked" : true,
"rules" : [{"isize" : [600,0], "mapped" : true, "mate_mapped" : false, "rg" : "H01PE.2"},
{"mapped" : false, "mate_mapped" : false},
{"hardclip" : true},
{"nm" : [1,101], "mapq" : [30, 100]}]
}
}
The above JSON can be interpreted as a rule-cascade that, in one-pass of the BAM:
--Near VCF sites (to within 1000 bases)
Keep reads interesecting region OR reads with mate-pairs that intersect region
--In exons:
keep reads with: (isize outside of 0, 600 && with readgroup H01PE.2) OR (mapepd and mate unmapped) OR hardclipped OR (NM >= 1 && MAPQ >= 30)
Range rules can be input as a two-element JSON array ("mapq" : [30, 100]
) or a single value specificying the
minimum accepted value ("mapq" : 30
). To instead reject reads in this boundary, switch the order. Thus,
"mapq" : [100, 30]
accepts only reads with MAPQ < 30 || MAPQ > 100.
Flag rules can be input using keywords like "mapped" : true
or more versatily using the raw alginmetn flag "!flag" : 4
. Use
flag
to set all of the flags that must be turned on, and !flag
for that must be turned off. Thus, "!flag" : 4
requires that the "unmapped" bit be turned off, and so accepts only mapped reads.
A set of motifs can be supplied, so that only reads with (or without) the motif are accepted. A motif file is just a list of sequences in upper case, separated by newlines.
Reverse complements are not automatically considered,
so these must be explicitly provided. To include reads with a motif, use the --motif
flag for simple rules, or in JSON specify "motif" : "motiffile.txt"
. To
exclude reads with a motif, use JSON key-value pair: "!motif" : "motiffile.txt"
.
Variant BAM can also filter based on the number of N
bases in a read, with the nbases
key, input as a range rule ("nbaes" : [0,3]
)
Filters can be made based on the value of an alignment tag. Supported tags include "rg" (read-group), "nm" (number mismatches), and "xp" (number of supplementary alignments).
Filters can be supplied to enforce a min (or max) number of insertions or deletions, or number of clipped reads. Note that this refers to the max element size. e.g. a CIGAR
of 10M4D20M2D20M
will pass the filter "del" : [0,4]
but fail the filter "del" : [5, 100]
. The number of clipped bases is consider after trimming low-quality bases (if --min-phred
is supplied).
Region-specific subsampling rates can be applied. For example, in region A you can set "subsample" : 0.5
, which will remove half of all reads that otherwise passed the other filters. If a
read falls into two regions, the region listed first in the JSON will apply.
""
You can specify simple scripts directly on the command line:
### keep only paired reads, but not duplicates, in mate-linked region around 100 bp window at VCF sites
variant $bam -l myvcf.vcf -f 1 -F 1024 -P 100 -o out.bam
JSON scripts can also be supplied directly, just make sure to encase in single quotes and remove spaces:
variant $bam -g WG -r '{"":{"rules":[{"motif":"mymotifs.txt"}]}}' -o out.bam
#RULE #EXAMPLE #DESCRIPTION OF EXAMPLE / FLAG
anyflag "anyflag" : 4 Set the flag bits that, if ON pass read
!anyflag "!anyflag" : 4 Set the flag bits that, if OFF pass read
allflag "allflag" : 4 Require that all these flag bits be ON
!allflag "!allflag" : 4 Require that all these flag bits be OFF
motif "motif" : seqs.txt File containing substrings that must be present in the sequence.
!motif "!motif" : seqs.txt File containing substrings that must NOT be present in the sequence.
rg "rg" : "H01PE.2" Limit to just a single read-group
duplicate "duplicate" : true Read must be marked as optical duplicate
supp "supp" : false Read must be primary alignment
qcfail "qcfail" : false Read must note be marked as QC Fail
fwd_strand "fwd_strand" : true Read must be mapped to forward strand
rev_strand "rev_strand" : true Read must be mapped to reverse strand
mate_fwd_strand "mate_fwd_strand" : true Mate of read must be mapped to forward strand
mate_rev_strand "mate_rev_strand" : true Mate of read must be mapped to reverse strand
mapped "mapped" : true Read must be unmapped
mate_mapped "mate_mapped" : true Mate must be mapped
subsample "subsample" : 0.4 Subsample this region to at a certain rate
ff "ff" true Read pair must have forward-forward orientation
rr "rr" : true Read pair must have reverse-reverse orientation
fr "fr" : true Read pair must have forward-reverse orientation (proper)
rf "rf" : true Read pair must have reverse-forward orientation
ic "ic" : true Read pair must have inter-chromosomal mapping
... ALL RANGE RULES FOLLOW THE 3 INPUT OPTIONS ILLUSTRATED BELOW ...
ins "ins" : [5,101] Number of inserted bases on the reads (from parsed CIGAR string)
"ins" : 5 ... Take only reads with max insertion size of >= 5
"ins" : [101,5] ... Take only reads with max insertion size NOT in [5,101] (e.g. 0-4)
del "del" : [5,101] Number of deleted bases relative to reference (from parsed CIGAR string).
nm "nm" : [0,4] NM tag from BAM (number of mismatches). e.g. must be 0-4 inclusive
xp "xp" : [0,4] Number of chimeric aligments, with XP or SA tag from BAM
isize "isize" : [100,500] Insert size, where all insert sizes are converted to positive.
len "len" : [80,101] Length of the read following phred trimming. If phred trimming, don't count hardclips. If not, then HC count to length
clip "clip" : [0,5] Number of clipped bases following phred trimming
nbases "nbases" : [0,5] Removed reads that have within this range of N bases.
VariantBam is developed and maintained by Jeremiah Wala ([email protected]) -- Rameen Berkoukhim's lab -- Dana Farber Cancer Institute, Boston, MA.
This project was developed in collaboration with the Cancer Genome Analysis team at the Broad Institute. Particular thanks to:
- Cheng-Zhong Zhang (Matthew Meyerson Lab)
- Marcin Imielinski
- Gad Getz
- Mara Rosenberg
- Esther Rheinbay
- Gordon Saksena