-
Notifications
You must be signed in to change notification settings - Fork 58
How to add a new Dataset
At the moment, there are two possibilities to add a new dataset to GERBIL.
- transforming your dataset into a NIF file or
- implement an own
Dataset
class
Note that generating a NIF file is the recommended way. Otherwise you have to add you dataset permanently to GERBIL (step 3).
For both ways you might have to perform some of the following steps.
First, you need to find the correct Experiment Type for your dataset. The types are described here.
For example:
- if your dataset has named entities annotated (position and URI), your dataset can be used for A2KB.
- If your dataset has only tags added to the documents containing entities mentioned inside the text, your dataset can be used for C2KB
You can either generate a NIF file containing your dataset or implement an adapter for it.
You have to write an adapter implementing the Dataset
interface. You might want to examine already existing adapters like org.aksw.gerbil.dataset.impl.msnbc.MSNBCDataset
. Afterwards, you will have follow step 3 to register your dataset enabling the GERBIL system to find it.
For this solution, you have to transform your dataset into the NLP Interchanged Format (NIF). The resulting RDF could look like the following example.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> .
@prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix aksw: <http://aksw.org/> .
<http://aksw.org/N3/Reuters-128/81#char=0,589>
a nif:String , nif:Context , nif:RFC5147String ;
nif:beginIndex "0"^^xsd:nonNegativeInteger ;
nif:endIndex "589"^^xsd:nonNegativeInteger ;
nif:isString "General Motors Acceptance Corp, a unit of General Motors Corp..."@en ;
nif:sourceUrl <http://www.research.att.com/~lewis/Reuters-21578/15108> .
<http://aksw.org/N3/Reuters-128/81#char=0,30>
a nif:RFC5147String ;
nif:anchorOf "General Motors Acceptance Corp"^^xsd:string ;
nif:beginIndex "0"^^xsd:nonNegativeInteger ;
nif:endIndex "30"^^xsd:nonNegativeInteger ;
nif:referenceContext <http://aksw.org/N3/Reuters-128/81#char=0,589> ;
itsrdf:taIdentRef <http://dbpedia.org/resource/Ally_Financial> ;
itsrdf:taSource "DBpedia_en_3.9"^^xsd:string .
For this step the articles https://github.com/AKSW/gerbil/wiki/How-to-generate-a-NIF-dataset and https://github.com/AKSW/gerbil/wiki/Generating-a-NIF-dataset-using-Java could be useful.
After creating the NIF file, the dataset already can be used for experiments by uploading it through the user interface. However, if the dataset should be added permanently, step 3 should be performed.
Adding a dataset permanently means that it can be chosen from the list of known datasets in the GUI. Therefore, the configuration of the dataset has to be added to the datasets.properties
file. A configuration of an example NIF-based dataset could look like:
org.aksw.gerbil.datasets.definition.MyDataset.name=My first dataset
org.aksw.gerbil.datasets.definition.MyDataset.class=org.aksw.gerbil.dataset.impl.nif.FileBasedNIFDataset
org.aksw.gerbil.datasets.definition.MyDataset.constructorArgs=a/path/to/my/dataset.ttl
org.aksw.gerbil.datasets.definition.MyDataset.cacheable=true
org.aksw.gerbil.datasets.definition.MyDataset.experimentType=A2KB
It can be seen that all properties start with org.aksw.gerbil.datasets.definition
, followed by a key that identifies properties of this example dataset. The properties define the name of the dataset, the class that is used to load the dataset and the constructor argument with which the class instance is created. In this example, we use one of the constructors of the FileBasedNIFDataset
that needs the path to the NIF file. The last two properties define that results for this dataset can be cached and that it can be used for A2KB experiments.