Skip to content

Commit

Permalink
Merge pull request #128 from tskir/eva-2085-clinvar-distrubutions
Browse files Browse the repository at this point in the history
EVA-2085 — Calculate various value distributions in ClinVar data
  • Loading branch information
tskir authored Jul 24, 2020
2 parents 255dbae + 63fa919 commit b527112
Show file tree
Hide file tree
Showing 5 changed files with 112 additions and 9 deletions.
31 changes: 29 additions & 2 deletions clinvar-variant-types/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# ClinVar data model and variant types
# ClinVar data model and attribute value distributions

The script in this directory parses ClinVar XML types and calculates statistics on all possible ways the variants can be represented. The results are described below.

Expand All @@ -13,7 +13,9 @@ python3 \

## Results

Generated from the file `ClinVarFullRelease_2020-0706.xml.gz` (click to enlarge):
All graphs in this section were generated from the file `ClinVarFullRelease_2020-0706.xml.gz`. Graphs can be enlarged by clicking on them.

### Data model and variant types

![](variant-types.png)

Expand Down Expand Up @@ -44,3 +46,28 @@ Generated from the file `ClinVarFullRelease_2020-0706.xml.gz` (click to enlarge)
- **Diplotype.** Similar, but at least one of the _trans_ phased alleles includes a haplotype. An example of this would be three variants located on one copy of the gene, and one variant in the second one, all interpreted together.

The most common case is the MeasureSet/Variant one, accounting for 1114689 out of 1115169 RCV records (as of the date when this report was compiled), or 99.96%.

### Clinical significance

![](clinical-significance.png)

Under the current criteria, 188,518 out of 1,114,689 (17%) records are being processed.

For the situations where multiple clinical significance levels were reported for a given association, they are converted into a single composite string, e.g. `Benign/Likely benign, other`. Before processing such records, we need to decide which activity codes should correspond to them.

### Star rating (review status)

![](star-rating.png)

The distribution of records by star rating is:
* ☆☆☆☆ 142,855 (13%)
* ★☆☆☆ 894,109 (80%)
* ★★☆☆ 66,107 (6%)
* ★★★☆ 11,583 (1%)
* ★★★★ 35 (< 0.01%)

### Mode of inheritance

![](mode-of-inheritance.png)

Only a small fraction of all records specify their mode of inheritance: 35,009 out of 1,114,689, or about 3%.
Binary file added clinvar-variant-types/clinical-significance.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
90 changes: 83 additions & 7 deletions clinvar-variant-types/clinvar-variant-types.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,21 @@
import argparse
from collections import Counter
import gzip
import re
import sys
import xml.etree.ElementTree as ElementTree

PROCESSED_CLIN_SIG = ['Pathogenic', 'Likely pathogenic', 'protective', 'association', 'risk_factor', 'affects',
'drug response']

SIG_STARS = {
'practice guideline': 4,
'reviewed by expert panel': 3,
'criteria provided, multiple submitters, no conflicts': 2,
'criteria provided, conflicting interpretations': 1,
'criteria provided, single submitter': 1,
}

parser = argparse.ArgumentParser()
parser.add_argument('--clinvar-xml', required=True)
args = parser.parse_args()
Expand All @@ -17,9 +29,30 @@ def add_transitions(transitions_counter, transition_chain):
transitions_counter[(transition_from, transition_to)] += 1


def find_attribute(rcv, xpath, attribute_name):
"""Find an attribute in the RCV record which can have either zero or one occurrence. Return a textual representation
of the attribute, including special representations for the case of zero or multiple, constructed using the
attribute_name parameter."""

attributes = rcv.findall(xpath)
if len(attributes) == 0:
return '{} missing'.format(attribute_name)
elif len(attributes) == 1:
return attributes[0].text
else:
return '{} multiple'.format(attribute_name)


def review_status_stars(review_status):
black_stars = SIG_STARS.get(review_status, 0)
white_stars = 4 - black_stars
return '★' * black_stars + '☆' * white_stars


# The dicts store transition counts for the Sankey diagrams. Keys are (from, to), values are transition counts.
# Sankey diagrams can be visualised with SankeyMatic (see http://www.sankeymatic.com/build/)
high_level_transitions, variant_transitions = Counter(), Counter()
# Sankey diagrams can be visualised with SankeyMatic (see http://www.sankeymatic.com/build/).
variant_type_transitions, clin_sig_transitions, review_status_transitions, inheritance_mode_transitions \
= Counter(), Counter(), Counter(), Counter()


# ClinVar XML have the following top-level structure:
Expand Down Expand Up @@ -52,18 +85,60 @@ def add_transitions(transitions_counter, transition_chain):
# Most common case. RCV directly contains one measure set.
measure_set = measure_sets[0]
measure_set_type = measure_set.attrib['Type']
add_transitions(high_level_transitions, ('RCV', 'MeasureSet', measure_set_type))
add_transitions(variant_type_transitions, ('RCV', 'MeasureSet', measure_set_type))

if measure_set_type == 'Variant':
# Most common case. Here, we go into details about its individual types
# Most common case, accounting for >99.95% of all ClinVar records.. Here, we go into details on various
# attribute distributions.

# Variant type
measures = measure_set.findall('Measure')
assert len(measures) == 1, 'MeasureSet of type Variant must contain exactly one Measure'
add_transitions(variant_transitions, (measure_set_type, measures[0].attrib['Type']))
add_transitions(variant_type_transitions, (measure_set_type, measures[0].attrib['Type']))

# Clinical significance
clinical_significance = find_attribute(
rcv, 'ClinicalSignificance/Description', 'ClinicalSignificance')
if clinical_significance in PROCESSED_CLIN_SIG:
add_transitions(clin_sig_transitions, (
'Variant',
'Processed',
clinical_significance,
))
else:
significance_type = 'Complex' if re.search('[,/]', clinical_significance) else 'Simple'
add_transitions(clin_sig_transitions, (
'Variant',
'Not processed',
significance_type,
clinical_significance,
))

# Review status
review_status = find_attribute(
rcv, 'ClinicalSignificance/ReviewStatus', 'ReviewStatus')
add_transitions(review_status_transitions, (
'Variant',
review_status_stars(review_status),
review_status,
))

# Mode of inheritance
mode_of_inheritance = find_attribute(
rcv, 'AttributeSet/Attribute[@Type="ModeOfInheritance"]', 'ModeOfInheritance')
add_transitions(inheritance_mode_transitions, (
'Variant',
mode_of_inheritance if mode_of_inheritance.endswith('missing') else 'ModeOfInheritance present',
))
if not mode_of_inheritance.endswith('missing'):
add_transitions(inheritance_mode_transitions, (
'ModeOfInheritance present', mode_of_inheritance
))

elif len(measure_sets) == 0 and len(genotype_sets) == 1:
# RCV directly contains one genotype set.
genotype_set = genotype_sets[0]
add_transitions(high_level_transitions, ('RCV', 'GenotypeSet', genotype_set.attrib['Type']))
add_transitions(variant_type_transitions, ('RCV', 'GenotypeSet', genotype_set.attrib['Type']))

else:
raise AssertionError('RCV must contain either exactly one measure set, or exactly one genotype set')
Expand All @@ -78,7 +153,8 @@ def add_transitions(transitions_counter, transition_chain):

# Output the code for Sankey diagram. Transitions are sorted in decreasing number of counts, so that the most frequent
# cases are on top.
for transitions_counter in high_level_transitions, variant_transitions:
for transitions_counter in (variant_type_transitions, clin_sig_transitions, review_status_transitions,
inheritance_mode_transitions):
print()
for (transition_from, transition_to), count in sorted(transitions_counter.items(), key=lambda x: -x[1]):
print('{transition_from} [{count}] {transition_to}'.format(**locals()))
Binary file added clinvar-variant-types/mode-of-inheritance.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added clinvar-variant-types/star-rating.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit b527112

Please sign in to comment.