RohHunter is a tool for run of homozygosity (ROH) detection based on a variant list in VCF format.
RohHunter uses the allele frequency of variants to calculate the probability to see a ROH by chance.
Allele frequency information can be annotated to the variant list via Ensembl VEP.
These are steps the RohHunter algorithm performs:
- Filter variants (markers) by quality to remove false genotype calls:
- Depth (default: ≥20)
- Variant Q score (default: ≥30)
- Determine raw stretches of homozygous markers
- Assign probability to observe ROH by chance
- based on allele frequency, e.g. using 1000g and gnomAD
- Remove regions with low probability (default: <Q30)
- Merge adjacent ROHs based on
- distance in markers (default: ≤1 or ≤1% of ROH marker count)
- distance in bases (default: ≤50% of ROH base count)
- Filter based on
- Number of markers (default: ≥20)
- Size (default: ≥20Kb)
The following image visualizes the algorihtm and show how it copes with a genotyping error (at the start of exon 2):
Instead of using VEP annotations as source of allele frequency information, an external database of allele frequencies can be provided via the 'af_source' parameter.
We suggest to use genomAD in version 3.1 or higher as allele frequency database.
It is important to normalize the allele frequency database (and the variant list) so that most variants can be annotated with allele frequency:
- Split multi-allelic variants to several rows, e.g. with VcfBreakMulti.
- Left-align InDels e.g. with VcfLeftNormalize.
- Sort variants according to position, e.g. with VcfStreamSort.
Finally, the allele frequency database has to be compressed with bgzip and index with tabix.
Using an exteral allele frequency database increases the run-time of the tool, since all variants have to be looked up in the database.
Our benchmarks show the following runtime increase when using the genomAD genome database:
- Exome (60K variants) from 4.3s (annotated) to ~100s.
- Genome (4.8M variants) from 3.3m (annotated) to ~90m.
Thus, for genomes it is favorable to use annotated variants lists if available.
Many large ROHs in a child can be a indicator for consanguinity of the parents.
This plot shows the ROH size sum of ROHs larger than 500kb for WGS (Illumina TruSeq DNA PCR-Free):
This plot shows the ROH size sum of ROHs larger than 500kb for WES (Agilent SureSelect Human All Exon V7):
This plot shows the ROH size sum of ROHs larger than 500kb for patients with different degrees of consanguinity:
It is pretty clear from the plots that a ROH size sum larger than 75Mb is a pretty good indicator for consanguinity of the parents.
The RohHunter command-line help and changelog can be found here.