-
Notifications
You must be signed in to change notification settings - Fork 58
Configuring SameAs retrieval
For matching URIs, GERBIL tries to make use of owl:sameAs
links. This is done based on different retrievers and a mapping of well-known domains to retriever implementations. Given a URI, the retrieval process will extract the domain and use the retriever for this domain to retrieve data about the entity. If the retrieved data contains owl:sameAs
links connecting the given URI to other URIs, these new URIs are added to the set of URI and are used for retrieval as well. This is repeated until no more new URIs are found.
Overall, there are three types of retrievers
The recommended variant is to use a prepared index. We offer an index for DBpedia URIs. When starting GERBIL using the start.sh
file, the user is asked whether the index should be downloaded. It is extracted to gerbil_data/indexes/dbpedia
which is the default path for this index.
The path of the index as well as the domains for which it should be used are defined in the gerbil.properties
file:
org.aksw.gerbil.semantic.sameas.impl.index.IndexBasedSameAsRetriever.domain=dbpedia.org
org.aksw.gerbil.semantic.sameas.impl.index.IndexBasedSameAsRetriever.folder=${org.aksw.gerbil.DataPath}/indexes/dbpedia
GERBIL can try to retrieve owl:sameAs
links from the web at runtime. These retrieval methods give the advantage that the retrieved data is up-to-date. However, it is costly to request the single URIs one by one. Hence, the runtime of the evaluation is increased significantly when using this retriever.
The dereferencing retriever uses Apache Jena to retrieve RDF data for the given URI. It can configured to be used for several domains in the gerbil.properties
file:
org.aksw.gerbil.semantic.sameas.impl.http.HTTPBasedSameAsRetriever.domain=de.dbpedia.org
org.aksw.gerbil.semantic.sameas.impl.http.HTTPBasedSameAsRetriever.domain=fr.dbpedia.org
In practice, redirects within the Wikipedia can be very helpful—especially when older datasets are used for the evaluation. Hence, GERBIL can make use of the Wikipedia API to retrieve additional URIs and use them similar to owl:sameAs
links. The usage of the Wikipedia API can be configured in the gerbil.properties
file by defining the Wikipedia domain for which it should be used:
org.aksw.gerbil.semantic.sameas.impl.wiki.WikipediaApiBasedSingleUriSameAsRetriever.domain=en.wikipedia.org
The costly retrievers should be used with caches to avoid at least some of the HTTP requests. To this end, there are two cache implementations available. A simple in-memory cache can be configured with a number of maximum URIs it should store:
org.aksw.gerbil.semantic.sameas.InMemoryCachingSameAsRetriever.cacheSize=5000
Another heavier caching method is offered as a file-based cache. This implementation persists the results in a file and can reuse over a longer time period. It can be used by setting the path to a caching file:
org.aksw.gerbil.semantic.sameas.CachingSameAsRetriever.cacheFile=${org.aksw.gerbil.CachePath}/sameAs.cache
In some cases, the usage of HTTP-based retrieval is inefficient because of it's high costs. In this case, it can be deactivated by removing all statements that define domains of the Dereferencing retriever or the Wikipedia API retriever.
The retriever implementation comes with some additional retrievers that do not need further configuration. We list them just for completeness.
- The DBpedia Wikipedia bridge transforms DBpedia URIs into Wikipedia URIs and vice versa.
- The URI encoding retriever handles the encoding of special characters and, hence, works like a bridge between URIs and IRIs.
- The error fixing retriever is used to implement fixes of common errors. At the moment, it simply transforms faulty
en.dbpedia.org
URIs intodbpedia.org
URIs.
Not all owl:sameAs
links are always helpful. Since a lot of links between datasets are generated automatically, it is known that some links may connect entities that should not be connected. To this end, GERBIL comes with an implementation of a filter which filters URIs of certain domains from the URI set. The filter can be configured in the gerbil.properties
file:
org.aksw.gerbil.semantic.sameas.impl.UriFilteringSameAsRetrieverDecorator.domainBlacklist=data.nytimes.com