Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sparql creator query is too narrowly defined #48

Open
iannesbitt opened this issue Oct 9, 2023 · 12 comments · May be fixed by #124
Open

sparql creator query is too narrowly defined #48

iannesbitt opened this issue Oct 9, 2023 · 12 comments · May be fixed by #124
Assignees
Labels
bug Something isn't working

Comments

@iannesbitt
Copy link

iannesbitt commented Oct 9, 2023

Edit 2023-10-11: updating in light of new info
Both OpenTopography and CanWIN have an issue where no creator is found in the SO doc, making the citations incorrect.

CanWIN example: DataONE representation, dataset landing, schema validation
Citation:

(2023). Dease Strait Mooring Chl-a and Light Data - 2017 and 2019. Canadian Watershed Information Network (CanWIN). [10.34992/jt0y-en67](https://doi.org/10.34992/jt0y-en67), version: sha256:bf5660b425f554166b3a97d1a827792ec62a526925e332a3f805018ce12d0843.

OpenTopography example: DataONE representation, dataset landing, schema validation
Citation:

(2022). Lidar Survey of Dangermond Preserve, CA. OpenTopography. OTSDEM.042022.32611.1, version: sha256:359b20070883f69c5cd5f5b77d7f91c95a6cf31cffaa12fcc05c515946d3c86e.

sonormal (and by extension mnlite) needs to be more adaptable when looking for dataset creators. I think this is set in sonormal.normalize._forceSODatasetLists.

@iannesbitt iannesbitt self-assigned this Oct 9, 2023
@iannesbitt iannesbitt added the bug Something isn't working label Oct 9, 2023
@iannesbitt
Copy link
Author

I'm having trouble figuring out where to find this problem. It might actually have something to do with how mnlite serves creator information to the requestor.

@mbjones
Copy link
Member

mbjones commented Oct 10, 2023

What do you mean, to the requestor? mnlite as a member node serves its SO metadata documents in json-ld to the DataONE syncronization system, which grabs an exact copy of the SO document. That is transferred to and stored on the CN in its Metacat object store, and then an event is triggered to queue indexing for that document. The dataone indexer then picks up that queue entry, and attempts to extract the information from the SO JSON-LD document using a JSON-LD subprocessor, which loads the SO document and a number of other vocabularies into an in-memory triple store and runs SPARQL queries on the content to extract values, which are then updated in SOLR.

For example, here's a link to the subprocessor configuration for extracting a single creator from SO to populate the SOLR author field, showing its SPARQL query. Similar queries are used to populate the other SOLR fields.

@iannesbitt
Copy link
Author

I didn't understand the simplicity of how mnlite serves records, so that makes sense.

In that case, maybe it makes sense to look at the SPARQL query. I'm not that familiar with SPARQL but I understand some SQL. Here is a CANWIN creator field:

    "creator": {
      "@type": "Role",
      "creator": {
        "@type": "Person",
        "Affiliation": {
          "@type": "Organization",
          "name": "Centre for Earth Observation Science - University of Manitoba"
        },
        "Email": "[email protected]",
        "Identifier": {
          "@type": "PropertyValue",
          "propertyID": "https://registry.identifiers.org/registry/orcid",
          "url": "https://orcid.org/0009-0001-2454-4614",
          "value": "0009-0001-2454-4614"
        },
        "Name": "Yendamuri, Kiran\t"
      }
    },

And here is the SPARQL query:

PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX list: <http://jena.hpl.hp.com/ARQ/list#>
PREFIX SO:   <http://schema.org/>

SELECT (?name as ?author)
WHERE {
    ?dsId rdf:type SO:Dataset .
    ?dsId SO:creator ?list .
    ?list list:index (?pos ?member) .
    ?member SO:name ?name .
}
order by (?pos)
limit 1

@mbjones
Copy link
Member

mbjones commented Oct 11, 2023

@iannesbitt the actual SPARQL query for the origin field, which is what is used for the creator list in citations, is here:

https://github.com/DataONEorg/dataone-indexer/blob/develop/src/main/resources/application-context-schema-org.xml#L304

PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX list: <http://jena.hpl.hp.com/ARQ/list#>
PREFIX SO:   <http://schema.org/>

SELECT (?name as ?origin)
WHERE {
    ?dsId rdf:type SO:Dataset .
    ?dsId SO:creator ?list .
    ?list list:index (?pos ?member) .
    ?member SO:name ?name .
}
order by (?pos)

Note that it assumes list structure. Here's your example doc (slightly enhanced) that does not use a list but rather a creator Role embedded in creator:

{    
"@context": "https://schema.org", 
"schema:Dataset": {
  "@type": "schema:Dataset",
  "name": "Test dataset",
  "creator": {
      "@type": "Role",
      "creator": {
        "@type": "Person",
        "affiliation": {
          "@type": "Organization",
          "name": "Centre for Earth Observation Science - University of Manitoba"
        },
        "email": "[email protected]",
        "identifier": {
          "@type": "PropertyValue",
          "propertyID": "https://registry.identifiers.org/registry/orcid",
          "url": "https://orcid.org/0009-0001-2454-4614",
          "value": "0009-0001-2454-4614"
        },
        "name": "Yendamuri, Kiran"
      }
    }
  }
}

Here's a SPARQL query to retrieve both the name and email from that. Somehow we need to support these multiple encoding approaches:

PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX list: <http://jena.hpl.hp.com/ARQ/list#>
PREFIX SO:   <http://schema.org/>

SELECT ?name ?email
WHERE {
    ?dsId rdf:type SO:Dataset .
    ?dsId SO:creator $creator .
    $creator SO:creator ?role .
    $role SO:name ?name .
    $role SO:email $email .
}

This produces the following results:

{
  "head": {
    "vars": [
      "name",
      "email"
    ]
  },
  "results": {
    "bindings": [
      {
        "name": {
          "value": "Yendamuri, Kiran",
          "type": "literal"
        },
        "email": {
          "value": "[email protected]",
          "type": "literal"
        }
      }
    ]
  },
  "metadata": {
    "httpRequests": 46
  }
}

In an ideal world we would also capture the ORCID and Affiliation too.

@iannesbitt
Copy link
Author

@mbjones should I open an issue or PR in the indexer to track this?

@mbjones
Copy link
Member

mbjones commented Oct 11, 2023

yeah, or we could transfer this issue over to the indexer repo if it has what we need...

@iannesbitt iannesbitt transferred this issue from DataONEorg/sonormal Oct 11, 2023
@iannesbitt iannesbitt changed the title creator is not populated correctly sparql creator query is too narrowly defined Oct 11, 2023
@iannesbitt
Copy link
Author

Ok, should be good now. How should we test a change like this?

@iannesbitt
Copy link
Author

iannesbitt commented Oct 11, 2023

@taojing2002, @mbjones, @artntek and I met for a while to discuss this problem today. We came up with a three-point proposal to try and broaden the configurations of creator that DataONE can parse correctly. @datadavev perhaps you can assess whether you think this is a worthwhile plan.

  1. Find more JSON-LD documents with different configurations of creator to test against (for example, the CanWIN "@type": "Role" setup). The folder for jsonld test documents in the indexer is here.
  2. Use sonormal to normalize multiple configurations of the creator field (i.e. allow for a broader range of SOSO creator definitions) Issue: Normalize broader configurations of creator sonormal#3
  3. Modify our sparql query in dataone-indexer and cn-index-processor to accept more iterations of the creator field (not just creator fields that have @list structure as exists now)

@iannesbitt
Copy link
Author

Note: the above comment was edited to include @taojing2002 who was mis-tagged in the original

@iannesbitt
Copy link
Author

I noticed that there was systemmetadata for each test document, but I didn't see any documentation on how to create it. Is there a method for creating it automatically?

@datadavev
Copy link
Member

the system metadata needs to accompany an indexer test document primarily to indicate the type of object being sent to the indexer. Other than that, I think the sys meta can contain any valid values

@datadavev
Copy link
Member

wrt the creator steps - the normalization of schema.org metadata on mnlite is used to compare against subsequent retrieval from the same URL to determine if there was a change to the content. The schema.org content forwarded to the CN is I believe (should be) the original content that was extracted from the landing page since we've generally been following a principle of not changing content from the sources.

I believe that means the indexer needs to do the ops described in step 2.

That said, it would certainly be much simpler to index if the json-ld content could be pre-processed to a common representation prior to passing on to the CNs. Perhaps such pre-processing should be part of the indexer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants