ISSA RDF data modeling

A document analyzed by the ISSA pipeline can be described in three parts: general metadata (title, authors, publication date etc.), thematic descriptors characterizing a document as well as documents domains and authors' keywords, and named entities extracted from a document's parts (title, abstract, body_text).


Below we use the following namespaces:

@prefix rdfs:   <>.
@prefix owl:    <>.
@prefix xsd:    <> .

@prefix bibo:   <> .
@prefix dce:    <>.
@prefix dct:    <>.
@prefix fabio:  <> .
@prefix eprint: <> .
@prefix foaf:   <>.
@prefix frbr:   <>.
@prefix oa:     <>.
@prefix prov:   <>.
@prefix schema: <>.

@prefix issa:   <>.
@prefix issapr: <>.

👉 The namespace is used for a specific ISSA instance (e.g. Agritrop). It can be replaced by any other namespace.

Document metadata

Document URIs are formatted as where document_id is a unique document identifier.

RDF resources representing documents can be instances of various classes depending on their type:

  • article (fabio:ResearchPaper, schema:ScholarlyArticle, bibo:AcademicArticle, eprint:JournalArticle)
  • conference article (fabio:ConferencePaper, eprint:ConferencePaper)
  • book (fabio:Book, bibo:Book, eprint:Book)
  • book section (fabio:BookChapter, bibo:BookSection, eprint:BookItem)
  • thesis (fabio:Thesis, bibo:Thesis, eprint:Thesis)
  • application (fabio:ComputerApplication)
  • data management plan (fabio:DataManagementPlan)
  • film (fabio:Film , bibo:AudioVisualDocument)
  • map (fabio:StillImage, bibo:Map)
  • monograph (fabio:Expression, bibo:Document, eprint:Text)
  • patent (fabio:Patent, eprint:Patent)
  • report (fabio:Report, bibo:Report, eprint:Report)
  • review (fabio:Review)

For each document, the available metadata are mapped as much as possible as follows (not all metadata exist for all types of documents):

  • title (dct:title)
  • authors (dce:creator)
  • authors in ordered list (bibo:authorList)
  • publication date (dct:issued)
  • journal (schema:publication)
  • license (dct:license)
  • access rights (dct:accessRights)
  • terms and conditions (dct:rights)
  • identifiers
    • archive internal identifier (dct:identifier)
    • DOI (bibo:doi)
  • source (API) from which the metadata information was retrieved (dct:source)
  • document page URL (schema:url)
  • source PDF download URL (schema:downloadUrl)
  • alternate PDF download URLs (schema:sameAs)
  • language
    • language string (dce:language)
    • language URI (dct:language)
  • provenance
    • dataset name and version (rdfs:isDefinedBy)
    • source data URI (prov:wasDerivedFrom)
    • source data creation timestamp (prov:generatedAtTime), i.e. at which time the article was added to the source archive

Furthermore, documents are linked to their parts (title, abstract, body) as follows:

  • issapr:hasTitle <>
  • dct:abstract <>
  • issapr:hasBody <>.

👉 In the Agritrop use case only journal articles have associated body text

Here is an example of a journal article's metadata:

  a                      prov:Entity, fabio:ResearchPaper, bibo:AcademicArticle, eprint:JournalArticle, schema:ScholarlyArticle;
  dct:title              "Accounting for the ecological dimension in participatory research and development : lessons learned from Indonesia and Madagascar";
  dce:creator            "Pfund, Jean-Laurent", "Laumonier, Yves", "Bourgeois, Robin";
  bibo:authorList        [ a       rdf:List ;
                            rdf:first "Laumonier, Yves" ;
                            rdf:rest ("Bourgeois, Robin" "Pfund, Jean-Laurent")
                         ] ;
  schema:publication     "Ecology and Society";
  dct:issued             "2008.0"^^xsd:gYear;

  dct:accessRights       <info:eu-repo/semantics/openAccess> ;
  dct:rights             <>;

  dct:identifier         "543654";
  schema:url             <> ;
  schema:downloadUrl     <>;
  schema:sameAs          <>;

  dce:language           "eng";
  dct:language           <>;

  rdfs:isDefinedBy       issa:issa-agritrop;
  prov:generatedAtTime   "2020-11-21T13:17:03Z"^^xsd:dateTime;
  prov:wasDerivedFrom    <>.

  issapr:hasTitle        <> ;
  dct:abstract           <> ;
  issapr:hasBody         <> .

Thematic descriptors

The thematic descriptors are concepts characterizing a document as a whole. They are described as annotations using the Web Annotations Vocabulary.

Each annotation consists of the following information:

  • the annotation target (oa:hasTarget) is the document it is about (schema:about)
  • the annotation body (oa:hasBody) gives the URI of the resource identified as representing the thematic descriptor (e.g. an Agrovoc category URI ).
  • provenance
    • dataset name and version (rdfs:isDefinedBy)
    • the agent that assigned this descriptor to a document (prov:wasAttributedTo)
      • a human documentalist (issa:Documentalist)
      • an automated indexing system (e.g. Annif ) (issa:AnnifSubjectIndexer)
  • (optional) an automated indexer confidence score (issapr:confidence)
  • (optional) an automated indexer rank of the descriptor among all assigned (issapr:rank)


# sustainable development
  a                      prov:Entity , issa:ThematicDescriptorAnnotation;
  oa:hasBody             <>;
  oa:hasTarget           <>;
  prov:wasAttributedTo   issa:Documentalist.
  rdfs:isDefinedBy       issa:issa-agritrop.
# natural resource management  
  a                      prov:Entity , issa:ThematicDescriptorAnnotation;
  oa:hasBody             <>;
  oa:hasTarget           <>;
  prov:wasAttributedTo   issa:AnnifSubjectIndexer.
  rdfs:isDefinedBy       issa:issa-agritrop;

  issapr:confidence      0.82;
  issapr:rank            1.

👉 In the ISSA Agritrop instance some of the Agrovoc categories are geographical entities (e.g. countries, regions, cities) and can be categorized as Geographical (Geo) descriptors. To identify if a descriptor has a geographical meaning, the following SPARQL query can be used:

      ?descriptorUri <> ?subVocabulary .
      BIND ( REGEEX ?subVocabulary, "^Geographical", "i") as ?isGeographicalDescriptor )


Each source archive may associate a set of domains with each document. The domains are can be proprietary (e.g. AgrIST-thema in Agritrop) or controlled vocabularies (e.g. HAL subjects in HAL).

The domain annotation consists of the following information:

  • the annotation target (oa:hasTarget) is a document
  • the annotation body (oa:hasBody) is the URI of the resource representing the domain
  • provenance
    • dataset name and version (rdfs:isDefinedBy)
    • the agent that assigned this descriptor to a document (prov:wasAttributedTo) and typically is a human documentalist (issa:Documentalist)
  • (optional) an automated indexer rank of the descriptor among all assigned (issapr:rank)


  a                      prov:Entity , issa:DomainAnnotation ;
  oa:hasBody             <> ;
  oa:hasTarget           <> ;
  rdfs:isDefinedBy       issa:issa-agritrop ;
  prov:wasAttributedTo   issa:Documentalist ;

  issapr:rank            3.

  rdfs:label             "K01 - Foresterie - Considérations générales". 

Authors Keywords

Some of the document archives (e.g. HAL) may provide a list of keywords assigned by the authors of a document. These keywords are described as annotations as well.

  a                      prov:Entity , issa:AuthorKeywordAnnotation ;
  oa:hasBody             <> ;
  oa:hasTarget           <> ;
  rdfs:isDefinedBy       issa:issa-hal-euromov ;
  prov:wasAttributedTo   issa:Author;

  issapr:rank            3.

  a                      oa:TextualBody ;
  rdf:value              "Coronavirus" ;
  dct:format             "text" ;
  dct:language           "en".

Named entities

The named entities identified in a document are described as annotations using the Web Annotations Vocabulary.

Each annotation consists of the following information:

  • the document it is about (schema:about)
  • the annotation target (oa:hasTarget) describes the piece of the text identified as a named entity as follows:
    • the source (oa:hasSource) is a part of a document where the named entity was detected (title, abstract, or body)
    • the selecor (oa:hasSelector) gives the named entity raw text (oa:exact) and its location whithin the source (oa:start and oa:end)
  • the annotation body (oa:hasBody) gives the URI of the resource identified as representing the named entity (e.g. a Wikidata URI, DBPedia URI, or Geonames URI)
  • provenance
    • dataset name and version (rdfs:isDefinedBy)
    • the software that assigned this named entity to the document (prov:wasAttributedTo)
  • (optional) domains related to the named entity (dct:subject)
  • (optional) the annotating tool confidence (issapr:confidence)


  a                      prov:Entity , oa:Annotation ;
  rdfs:label             "named entity 'natural resource management";
  schema:about           <> ;
  dct:subject            "Gas" , "Environment" ;
  issapr:confidence      0.7669;

  oa:hasBody             <> ;
  oa:hasTarget [
      oa:hasSource       <> .
      oa:hasSelector [
          a              oa:TextPositionSelector, oa:TextQuoteSelector;
          oa:exact       "natural resource management";
          oa:end         1760;
          oa:start       1733.

  rdfs:isDefinedBy       issa:issa-agritrop;
  prov:wasAttributedTo   issa:EntityFishing .

Named Graphs

As a result of the ISSA pipeline, the following named graphs are created:

Data type Named Graph
Annotated text
Human-validated thematic descriptors
Annif-generated thematic descriptors
Documents' domains
Documents' keywords
DBpedia annotations
Wikidata annotations
GeoNames annotations
Instance-specific vocabulary annotations

👉 As a reminder, the namespace is used for a specific ISSA instance (e.g. Agritrop). It can be replaced by any other namespace (e.g. for the HAL Euromov instance).