Skip to content

Releases: EBIvariation/CMAT

v2.0.2: Corrections and updates for the Open Targets batch 2021.04

22 Mar 13:13
c8d8cb5
Compare
Choose a tag to compare

Resolves several issues with evidence strings loss in v2.0.0 and v2.0.1 compared to v1.3.2:

  • All ClinVar traits are now processed instead of only “Disease” type traits.
  • All names of a ClinVar trait are now used for looking up the corresponding ontology term. Previously, only the preferred name was used, which does not always correspond to the one in the string-to-ontology mapping database.
  • Reintroduced processing of mitochondrial variants and variants containing IUPAC ambiguity bases. They were previously skipped due to not being supported by the Open Targets schema.

Technical changes:

  • Updated the Open Targets schema version: 2.0.5 → 2.0.6.
  • Updated test data and assertions to fix some inconsistencies which weren't spotted in v2.0.0 and v2.0.1 releases.

v2.0.1: Evidence string duplication, literature references, and ontology mapping adjustments

15 Mar 13:30
8f7856c
Compare
Choose a tag to compare

Version 2.0.1 addresses three groups of issues.

  • Evidence string duplication
    • Verify that all known problems have been resolved (#185).
    • Introduce additional checks prior to submission (#178, #188).
  • Processing of PubMed references
    • Investigate the three types of ClinVar literature references and provide a report (#166, #192).
    • Clarify the different types of PubMed references in the ClinVar XML parser class (#182).
  • Handling string to ontology mappings
    • Verified that the preferred trait names are used consistently across the pipeline (#177).
    • Verified that multiple string-to-ontology mappings are consistently supported across the pipeline, fixed a minor bug and amended documentation (#115).
    • Prevented non-specific terms like “disease” from reappearing in the manual curation results (#179).
    • Fixed a bug in construction of MONDO IRIs from ClinVar data (#175).

See also PR #202.

v2.0.0: Major refactor of ClinVar input, repeat expansion pipeline, and JSON schema

06 Mar 05:43
5236518
Compare
Choose a tag to compare

ClinVar input rewrite

All components of the pipeline now use the comprehensive XML data dump from ClinVar as input. The use of VCF and TSV summary files has been discontinued. This should make the results more consistent and comprehensive.

This is made possible by the new clinvar_xml_utils module, which provides a Python interface to work with ClinVar data. External users with similar goals are welcome to also try it out.

Repeat expansion pipeline refactor

Under the new approach, the following Microsatellite records are considered repeat expansion events:

  1. Variants with explicit allele sequences which represent insertions of 12 bases or more;
  2. Variants without explicit allele sequences, the HGVS-like notation of which does not represent a deletion.

The old approach was essentially confined to category (2). As a result, the number of repeat expansion consequences processed is now larger by approximately a factor of 6.

JSON schema migration

The pipeline output was migrated to accommodate the new major version of the Open Targets JSON schema, 2.0.5 (up from 1.7.5), described and discussed in detail in #189.

Other changes

  • Substantial refactoring and documentation updates under the hood.
  • Copy of the JSON schema is no longer stored in the repository and fetched on the fly instead.
  • Manual curation protocol now includes a “Notes” column, which stores the “NT expansion” annotation without replacing the trait frequency.
  • Removed a number of unused modules, including the old ClinVar XML parser written in Java.

v1.3.2: Minor updates for the 2021.02 batch

22 Jan 07:08
Compare
Choose a tag to compare
  • Migrated to Open Targets schema version 1.7.5.

v1.3.1: Minor updates for the 2020.11 batch

21 Oct 07:41
Compare
Choose a tag to compare
  • Evidence string related changes
    • Migrated to Open Targets schema version 1.7.3.
    • Minor updates to the evidence string generation review checklist.
    • Evidence string name format changed from DD-MM-YYYY to YYYY-MM-DD.
  • Other changes
    • Minor fixes to the manual curation protocol to ensure stable sort order.
    • ClinVar data examination script now calculates distributions of allele origins as well.

The latest ClinVar version with which this pipeline will work is 2020/08. After that, the variant_summary.tsv format has changed so that it does not include a “NT expansion” category anymore.

v1.3: Process additional ClinVar attributes

16 Sep 14:42
Compare
Choose a tag to compare

These changes introduce additional ClinVar attributes into the evidence strings, in preparation for implementing a better and more comprehensive scoring mechanism. All changes affect both genetic_association and somatic_mutation evidence strings.

  • #146 Report records with all clinical significance levels
    • Removed filtering by clinical significance throughout the pipeline.
    • Format and process the clinical significance levels according to the new schema, allowing multiple values per record.
    • Removed the obsolete target.activity attribute.
    • Always set the evidence.gene2variant.is_associated and evidence.variant2disease.is_associated fields to True.
  • #148 Add ClinVar star rating and review status
    • Add star rating, which ranges from 0 to 4.
    • Add review status, e.g. criteria provided, conflicting interpretations.
  • #149 Add mode of inheritance
    • Reported as strings verbatim from ClinVar and not additionally processed.
    • This field will contain an array, even when there is only one mode of inheritance (which is true for the majority of all records), for consistency between all records.
  • #150 Add last evaluated date
    • This fields tracks the timestamp of the most recent clinically meaningful update of the record: essentially, the latest (re)evaluation of the clinical significance level.

v1.2: Technical improvements and bug fixes

11 Aug 14:54
681c804
Compare
Choose a tag to compare
  • #138 Refactor approach for submitting and reusing ZOOMA feedback
    • Now the trait-to-ontology mappings from previous iterations of manual curation are reused directly, rather than relying on files for ZOOMA feedback, and also the feedback files themselves are generated at more appropriate stages of the pipeline.
    • This solves a number of issues which occur where two iterations of manual curation happen back to back without evidence string generation in between.
  • #140 Use virtualenv, reorganise dependencies and pin their versions
    • For more consistent dependency management, the pipeline now uses virtualenv for all purposes.
    • The list of dependencies was reorganised and their versions were pinned.
    • Fixed problems caused by release of Pandas 1.1.0 with multiple regressions by downgrading to Pandas 1.0.5.
  • #141 Changes for batch 2020.09. Includes update from JSON schema 1.6.7 to 1.7.1 (only test files and version updates, no actual evidence string format changes necessary) and minor documentation fixes.

v1.1: Drop support for haplotypes and gene2variant→resource_score section

03 Aug 07:08
aae028d
Compare
Choose a tag to compare
  • #128 Expand distribution diagrams of various attributes in ClinVar data with clinical significance, star rating, and mode of inheritance.
  • #135 Drop existing support for haplotypes, since it handled all variants of a haplotype as though they were independent. As can be seen in the report, haplotypes only account for a very minor fraction of ClinVar variants.
  • #136 Remove the resource_score section from gene2variant. This contained a placeholder value, which started to cause problems after recent updates of OT JSON schema.

v1.0: Inaugural release for the mono repository

22 Jul 10:50
255dbae
Compare
Choose a tag to compare

After about 5 years of development of this pipeline, it's probably about time to start doing the formal versioning and releases. Recently I merged everything into one mono repository (this one) and renamed it to eva-opentargets. It contains:

  • The core pipeline used to process ClinVar data and generate Open Targets submission → eva_cttv_pipeline;
  • Batch submission and manual curation documentation → docs;
  • Helper pipelines for querying VEP and for processing repeat expansion variants → vep-mapping-pipeline;
  • Scripts for analysing the ClinVar data model → clinvar-variant-types;
  • Helper scripts for comparing sets of Open Targets evidence strings → compare-evidence-strings.

The versioning scheme used will more or less comply to the semantic versioning approach.