Skip to content

Wimmics/WheatGenomicsSLKG

Repository files navigation

WheatGenomicsSLKG

The Wheat Genomics Scientific Literature Knowledge Graph (WheatGenomicsSLKG) is a FAIR knowledge graph that exploits the Semantic Web technologies to integrate information about Named Entities (NE) extracted automatically from a corpus of PubMed scientific articles on wheat genetics and genomics.

In this corpus we extract Named Entities of types genes, phenotypes, taxon names and varieties mentioned in the title and the abstract of the publications, and the relationships between wheat mentions of varieties and phenotypes.

The figure below illustrates an example of PubMed publication where three types of NEs are recognised: genes (e.g., Sr2, Lr27, Lr34), phenotypes (e.g., leaf rust resistance, resistance to stem rust, powdery mildew resistance) and taxon name (e.g., wheat).

Gene, phenotype and taxon named entities recognized and in a PubMed publication

Semantic Data Model

Mentions of phenotype and taxon are linked to existing entities defined respectively in the Wheat Trait and Phenotype Ontology OWL/SKOS (WTO) and NCBI taxonomy.

The core part of the WheatGenomicsSLKG data model is based on the W3C Web Annotation Ontology (OA) which has been complemented with different vocabularies to describe documents metadata:

Vocabulary Prefix URI
Bibliographic Ontology (BIBO) bibo http://purl.org/ontology/bibo/
FaBiO (FRBR-aligned Bibliographic Ontology) fabio http://purl.org/spar/fabio/
Dublin Core Elements dce http://purl.org/dc/elements/1.1/
Dublin Core Terms dct http://purl.org/dc/terms/
Schema.org schema http://schema.org/
Web Annotation Vocabulary oa https://www.w3.org/TR/annotation-vocab/

Wheat Trait and Phenotype Ontology

The Wheat Trait and Phenotype Ontology OWL/SKOS is the RDF representation of the previously defined Wheat Trait and Phenotype Ontology using OWL and SKOS. It is an ontology of wheat traits and environmental factors that affect these traits. They include resistance to disease, development, nutrition, bread quality, etc. Environmental factors include biotic and abiotic factors.

The files of WTO are also provided in this repository, in folder dataset/WTO-v3.1.

Generation Pipeline

First, we use a SPARQL micro-service to query Pubmed's Web API and translate the articles' metadata (including the title and abstract) into RDF. In some cases, the abstract of a publication is split in sub-sections. The SPARQL micro-service is currently deployed publicly at this URL.

In parallel, the AlvisNLP tool is used to extract and link the named entities mentioned in the titles and abstracts. The output consists of CSV files that must be translated to RDF.

The translation in RDF of CSV files is carried out using Morph-xR2RML, an implementation of the xR2RML mapping language for MongoDB databases. The mapping files are provided in directory mapping-rules.

License

The code used to produce the knowledge graph is licensed under the Apache License 2.0.

The RDF data files produced by the code are made available under the terms of the Open Data Commons Attribution License v1.0 (ODC-By-1.0) license.