-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Implement an identifier mapping service (#41)
References biopragmatics/bioregistry#686. This pull request implements the identifier mapping service described in [SPARQL-enabled identifier conversion with Identifiers.org](https://pubmed.ncbi.nlm.nih.gov/25638809/). The goal of such a service is to act as an interoperability in SPARQL queries that federate data from multiple places and potentially use different IRIs for the same things.
- Loading branch information
Showing
6 changed files
with
488 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,215 @@ | ||
# -*- coding: utf-8 -*- | ||
|
||
"""Identifier mappings service. | ||
This contains an implementation of the service described in `SPARQL-enabled identifier | ||
conversion with Identifiers.org <https://pubmed.ncbi.nlm.nih.gov/25638809/>`_. | ||
The idea here is that you can write a SPARQL query like the following: | ||
.. code-block:: sparql | ||
PREFIX biomodel: <http://identifiers.org/biomodels.db/> | ||
PREFIX bqbio: <http://biomodels.net/biology-qualifiers#> | ||
PREFIX sbmlrdf: <http://identifiers.org/biomodels.vocabulary#> | ||
PREFIX up: <http://purl.uniprot.org/core/> | ||
SELECT DISTINCT ?protein ?protein_domain | ||
WHERE { | ||
# The first part of this query extracts the proteins appearing in an RDF serialization | ||
# of the BioModels database (see https://www.ebi.ac.uk/biomodels/BIOMD0000000372) on | ||
# insulin/glucose feedback. Note that modelers call entities appearing in compartmental | ||
# models "species", and this does not refer to taxa. | ||
biomodel:BIOMD0000000372 sbmlrdf:species/bqbio:isVersionOf ?biomodels_protein . | ||
# The second part of this query maps BioModels protein IRIs to UniProt protein IRIs | ||
# using service XXX - that's what we're implementing here. | ||
SERVICE <XXX> { | ||
?biomodels_protein owl:sameAs ?uniprot_protein. | ||
} | ||
# The third part of this query gets links between UniProt proteins and their | ||
# domains. Since the service maps between the BioModels query, this only gets | ||
# us relevant protein domains to the insulin/glucose model. | ||
SERVICE <http://beta.sparql.uniprot.org/sparql> { | ||
?uniprot_protein a up:Protein; | ||
up:organism taxon:9606; | ||
rdfs:seeAlso ?protein_domain. | ||
} | ||
} | ||
The SPARQL endpoint running at the web address XXX takes in the bound values for `?biomodels_protein` | ||
one at a time and dynamically generates triples with `owl:sameAs` as the predicate mapping and other | ||
equivalent IRIs (based on the definition of the converter) as the objects. This allows for gluing | ||
together multiple services that use different URIs for the same entities - in this example, there | ||
are two ways of referring to UniProt Proteins: | ||
1. The BioModels database example represents a SBML model on insulin-glucose feedback and uses legacy | ||
Identifiers.org URIs for proteins such as http://identifiers.org/uniprot/P01308. | ||
2. The first-part UniProt database uses its own PURLs such as https://purl.uniprot.org/uniprot/P01308. | ||
.. seealso:: | ||
- Jerven Bolleman's implementation of this service in Java: https://github.com/JervenBolleman/sparql-identifiers | ||
- Vincent Emonet's `SPARQL endpoint for RDFLib generator <https://github.com/vemonet/rdflib-endpoint>`_ | ||
""" | ||
|
||
import itertools as itt | ||
from typing import TYPE_CHECKING, Any, Collection, Iterable, List, Set, Tuple, Union, cast | ||
|
||
from rdflib import OWL, Graph, URIRef | ||
|
||
from .rdflib_custom import JervenSPARQLProcessor # type: ignore | ||
from ..api import Converter | ||
|
||
if TYPE_CHECKING: | ||
import flask | ||
|
||
__all__ = [ | ||
"CURIEServiceGraph", | ||
"get_flask_mapping_blueprint", | ||
"get_flask_mapping_app", | ||
] | ||
|
||
|
||
def _prepare_predicates(predicates: Union[None, str, Collection[str]] = None) -> Set[URIRef]: | ||
if predicates is None: | ||
return {OWL.sameAs} | ||
if isinstance(predicates, str): | ||
return {URIRef(predicates)} | ||
return {URIRef(predicate) for predicate in predicates} | ||
|
||
|
||
class CURIEServiceGraph(Graph): # type:ignore | ||
"""A service that implements identifier mapping based on a converter.""" | ||
|
||
converter: Converter | ||
predicates: Set[URIRef] | ||
|
||
def __init__( | ||
self, | ||
*args: Any, | ||
converter: Converter, | ||
predicates: Union[None, str, List[str]] = None, | ||
**kwargs: Any, | ||
) -> None: | ||
"""Instantiate the graph. | ||
:param args: Positional arguments to pass to :meth:`rdflib.Graph.__init__` | ||
:param converter: A converter object | ||
:param predicates: A predicate or set of predicates. If not given, this service | ||
will use `owl:sameAs` as a predicate for mapping IRIs. | ||
:param kwargs: Keyword arguments to pass to :meth:`rdflib.Graph.__init__` | ||
In the following example, a service graph is instantiated using a small example | ||
converter, then an example SPARQL query is made directly to show how it makes | ||
results: | ||
.. code-block:: python | ||
from curies import Converter | ||
from curies.mapping_service import CURIEServiceGraph | ||
converter = Converter.from_priority_prefix_map( | ||
{ | ||
"CHEBI": [ | ||
"https://www.ebi.ac.uk/chebi/searchId.do?chebiId=", | ||
"http://identifiers.org/chebi/", | ||
"http://purl.obolibrary.org/obo/CHEBI_", | ||
], | ||
"GO": ["http://purl.obolibrary.org/obo/GO_"], | ||
"OBO": ["http://purl.obolibrary.org/obo/"], | ||
..., | ||
} | ||
) | ||
graph = CURIEServiceGraph(converter=converter) | ||
res = graph.query(''' | ||
SELECT ?o WHERE { | ||
VALUES ?s { | ||
<http://purl.obolibrary.org/obo/CHEBI_1> | ||
} | ||
?s owl:sameAs ?o | ||
} | ||
''') | ||
The results of this are: | ||
====================================== ================================================= | ||
subject object | ||
-------------------------------------- ------------------------------------------------- | ||
http://purl.obolibrary.org/obo/CHEBI_1 http://purl.obolibrary.org/obo/CHEBI_1 | ||
http://purl.obolibrary.org/obo/CHEBI_1 http://identifiers.org/chebi/1 | ||
http://purl.obolibrary.org/obo/CHEBI_1 https://www.ebi.ac.uk/chebi/searchId.do?chebiId=1 | ||
====================================== ================================================= | ||
""" | ||
self.converter = converter | ||
self.predicates = _prepare_predicates(predicates) | ||
super().__init__(*args, **kwargs) | ||
|
||
def triples( | ||
self, triple: Tuple[URIRef, URIRef, URIRef] | ||
) -> Iterable[Tuple[URIRef, URIRef, URIRef]]: | ||
"""Generate triples, overriden to dynamically generate mappings based on this graph's converter.""" | ||
subj_query, pred_query, obj_query = triple | ||
if pred_query not in self.predicates: | ||
return | ||
if subj_query is None and obj_query is None: | ||
return # can't generate based on this | ||
if subj_query is None and obj_query is not None: | ||
prefix, identifier = self.converter.parse_uri(obj_query) | ||
if prefix is None or identifier is None: | ||
return | ||
subjects = [ | ||
URIRef(sub) | ||
for sub in cast(Collection[str], self.converter.expand_pair_all(prefix, identifier)) | ||
] | ||
for subj, pred in itt.product(subjects, self.predicates): | ||
yield subj, pred, obj_query | ||
elif subj_query is not None and obj_query is None: | ||
prefix, identifier = self.converter.parse_uri(subj_query) | ||
if prefix is None or identifier is None: | ||
return | ||
objects = [ | ||
URIRef(obj) | ||
for obj in cast(Collection[str], self.converter.expand_pair_all(prefix, identifier)) | ||
] | ||
for obj, pred in itt.product(objects, self.predicates): | ||
yield subj_query, pred, obj | ||
else: # subj_query is not None and obj_query is not None | ||
return # too much specification? maybe just return one triple then? | ||
|
||
|
||
def get_flask_mapping_blueprint(converter: Converter, **kwargs: Any) -> "flask.Blueprint": | ||
"""Get a blueprint for :class:`flask.Flask`. | ||
:param converter: A converter | ||
:param kwargs: Keyword arguments passed through to :class:`flask.Blueprint` | ||
:return: A blueprint | ||
""" | ||
from flask import Blueprint, Response, request | ||
|
||
blueprint = Blueprint("mapping", __name__, **kwargs) | ||
graph = CURIEServiceGraph(converter=converter) | ||
processor = JervenSPARQLProcessor(graph=graph) | ||
|
||
@blueprint.route("/sparql", methods=["GET", "POST"]) # type:ignore | ||
def serve_sparql() -> "Response": | ||
"""Run a SPARQL query and serve the results.""" | ||
sparql = (request.args if request.method == "GET" else request.json).get("query") | ||
if not sparql: | ||
return Response("Missing parameter query", 400) | ||
results = graph.query(sparql, processor=processor).serialize(format="json") | ||
return Response(results) | ||
|
||
return blueprint | ||
|
||
|
||
def get_flask_mapping_app(converter: Converter) -> "flask.Flask": | ||
"""Get a Flask app for the mapping service.""" | ||
from flask import Flask | ||
|
||
blueprint = get_flask_mapping_blueprint(converter) | ||
app = Flask(__name__) | ||
app.register_blueprint(blueprint) | ||
return app |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
# -*- coding: utf-8 -*- | ||
# type: ignore | ||
|
||
"""A custom SPARQL processor that optimizes the query based on https://github.com/RDFLib/rdflib/pull/2257.""" | ||
|
||
from typing import Union | ||
|
||
from rdflib.plugins.sparql.algebra import translateQuery | ||
from rdflib.plugins.sparql.evaluate import evalQuery | ||
from rdflib.plugins.sparql.parser import parseQuery | ||
from rdflib.plugins.sparql.parserutils import CompValue | ||
from rdflib.plugins.sparql.processor import SPARQLProcessor | ||
from rdflib.plugins.sparql.sparql import Query | ||
|
||
__all__ = ["JervenSPARQLProcessor"] | ||
|
||
|
||
class JervenSPARQLProcessor(SPARQLProcessor): | ||
"""A custom SPARQL processor that optimizes the query based on https://github.com/RDFLib/rdflib/pull/2257. | ||
Why is this necessary? Ideally, we get queries like | ||
.. code-block:: sparql | ||
SELECT * WHERE { | ||
VALUES ?s { :a :b ... } | ||
?s owl:sameAs ?o | ||
} | ||
This is fine, since the way that RDFLib parses and constructs an abstract syntax tree, the values | ||
for ``?s`` get bound properly when calling a custom :func:`rdflib.Graph.triples`. However, it's also | ||
valid SPARQL to have the ``VALUES`` clause outside of the ``WHERE`` clause like | ||
.. code-block:: sparql | ||
SELECT * WHERE { | ||
?s owl:sameAs ?o | ||
} | ||
VALUES ?s { :a :b ... } | ||
Unfortunately, this trips up RDFLib since it doesn't know to bind the values before calling ``triples()``, | ||
therefore thwarting our custom implementation that dynamically generates triples based on the bound values | ||
themselves. | ||
This processor, originally by Jerven Bolleman in https://github.com/RDFLib/rdflib/pull/2257, | ||
adds some additional logic between parsing + constructing the abstract syntax tree and evaluation | ||
of the syntax tree. Basically, the abstract syntax tree has nodes with two or more children. Jerven's | ||
clever code (see :func:`_optimize_node` below) finds *Join* nodes that have a ``VALUES`` clause in the | ||
second of its two arguments, then flips them around. It does this recursively for the whole tree. | ||
This gets us to the goal of having the ``VALUES`` clauses appear first, therefore making sure that their | ||
bound values are available to the ``triples`` function. | ||
""" | ||
|
||
def query( | ||
self, | ||
query: Union[str, Query], | ||
initBindings=None, # noqa:N803 | ||
initNs=None, # noqa:N803 | ||
base=None, | ||
DEBUG=False, | ||
): | ||
"""Evaluate a SPARQL query on this processor's graph.""" | ||
if isinstance(query, str): | ||
parse_tree = parseQuery(query) | ||
query = translateQuery(parse_tree, base, initNs) | ||
return self.query(query, initBindings=initBindings, base=base) | ||
|
||
query.algebra = _optimize_node(query.algebra) | ||
return evalQuery(self.graph, query, initBindings or {}, base) | ||
|
||
|
||
# From Jerven's PR to RDFLib (https://github.com/RDFLib/rdflib/pull/2257) | ||
def _optimize_node(comp_value: CompValue) -> CompValue: | ||
if ( | ||
comp_value.name == "Join" | ||
and comp_value.p1.name != "ToMultiSet" | ||
and comp_value.p2.name == "ToMultiSet" | ||
): | ||
comp_value.update(p1=comp_value.p2, p2=comp_value.p1) | ||
for inner_comp_value in comp_value.values(): | ||
if isinstance(inner_comp_value, CompValue): | ||
_optimize_node(inner_comp_value) | ||
return comp_value |
Oops, something went wrong.