Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation notes #113

Closed
DylanVanAssche opened this issue Mar 5, 2024 · 18 comments
Closed

Implementation notes #113

DylanVanAssche opened this issue Mar 5, 2024 · 18 comments
Labels
help wanted Extra attention is needed question Further information is requested

Comments

@DylanVanAssche
Copy link
Collaborator

DylanVanAssche commented Mar 5, 2024

When fixing RML test-cases across modules, we noticed that most modules need some additional notes regarding implementation details. The modules describe properly what something is and how it looks like. However, implementations do not know how to use it. Examples of implementation details we can describe in a separate note:

  • Error codes to generate when a mapping is invalid, data has errors, etc.
  • How reference formulations can be processed
  • Natural Data Type mapping
  • ...

CC: @pmaria @chrdebru

Actions:

  1. Create a separate repository for this Note
  2. Add Respec for the Note
  3. Publish Note and add to Portal

Let us know what you think!

@DylanVanAssche DylanVanAssche added help wanted Extra attention is needed question Further information is requested labels Mar 5, 2024
@pmaria
Copy link
Collaborator

pmaria commented Mar 5, 2024

+1
IMO we need a note per reference formulation, or at least grouped per type like SQL or SPARQL.
This can than also include the specification of the natural (datatype) mapping and other details that are reference formulation specific.

@chrdebru
Copy link
Contributor

chrdebru commented Mar 7, 2024

I support this, yes.

@DylanVanAssche
Copy link
Collaborator Author

kg-construct/rml-io#41 is again such a specific thing, looks more to me like an implementation note than an actual test-case?

@chrdebru
Copy link
Contributor

chrdebru commented Mar 9, 2024

I agree. Or only rely on Postgres for that test case

@andimou
Copy link
Contributor

andimou commented Mar 17, 2024

Given that we are a CG, everythingn is a draft. That being said, I do not disagree with specifying in more details a few reference formulations but these can be just examples of how potential reference formulations may look like.

@pmaria
Copy link
Collaborator

pmaria commented Mar 19, 2024

Given that we are a CG, everythingn is a draft. That being said, I do not disagree with specifying in more details a few reference formulations but these can be just examples of how potential reference formulations may look like.

I don't agree that the description of reference formulations should be just examples. I think that clearly defining the reference formulations is essential. In r2rml the only reference formulation was SQL, and r2rml defined several aspects of it.

  • How to generate rows to be mapped
  • How to access column values
  • A natural mapping of values

These aspects should be clearly defined for any reference formulation that is introduced for RML. IMHO it would therefor be best to have a note per reference formulation where these can be described.

@andimou
Copy link
Contributor

andimou commented Mar 19, 2024

Well, we can decide with the entire CG if we agree on limiting the Reference Formulations that RML can accept.

In my opinion and how RML was designed so far: RML deliberately left the Reference Formulations unspecified so anyone can define its own Reference Formulation. In this sense, if we specify now some Reference Formulations, it should be examples of such Reference Formulations and RML should not be restricted to these Reference Formulations. One should be able to define any Reference Formulation desired.

@pmaria
Copy link
Collaborator

pmaria commented Mar 19, 2024

Well, we can decide with the entire CG if we agree on limiting the Reference Formulations that RML can accept.

I think there is a bit of a misunderstanding. There is non intention to limit the accepted reference formulations. Only to clearly define those that are already mentioned in the specs and test cases, and for which we have definitions in the ontology.

In my opinion and how RML was designed so far: RML deliberately left the Reference Formulations unspecified so anyone can define its own Reference Formulation. In this sense, if we specify now some Reference Formulations, it should be examples of such Reference Formulations and RML should not be restricted to these Reference Formulations. One should be able to define any Reference Formulation desired.

Agreed. This issue has no intention to change that. The intention of this issue is to define what needs to be defined and described for any reference formulation to be properly handled, and to have a place where we put those definitions. Otherwise we risk that every implementation handles the same reference formulation in a different way.

Since we already have several reference formulations that are broadly in use, the proposal is to define each of these as a note.
Any other new reference formulations that gain broad usage could later on also be added as a separate note at a location that we as a community deem fit.

@DylanVanAssche
Copy link
Collaborator Author

he intention of this issue is to define what needs to be defined and described for any reference formulation to be properly handled, and to have a place where we put those definitions.

My 2 cents here: we already 'enforce' specific behavior for reference formulations we use in the spec, ontology, and test cases. In the test cases we already have 'defined' what happens with a certain source + reference formulation implicitly. This note / notes is more to make this implicitly thing more explicit so developers do not have to read other implementations and interpret the output of each test case to know how a given reference formulation behaves.

@andimou
Copy link
Contributor

andimou commented Mar 19, 2024

Only to clearly define those that are already mentioned in the specs and test cases

These are indeed examples of reference formulations. We can include a few but we cannot produce Notes. From the W3C types of documents: "A W3C Draft Note is a document produced by a W3C Working Group, a W3C Interest Group, the Advisory Board (AB), or the W3C Technical Architecture Group (TAG)."

As a CG we publish a report. If specific Reference Formulations come in the report and we use this report for the WG, then the Reference Formulations will become part of the candidate recommendation. Even if we include them as notes, then these are notes for the RML-IO and not RML-core as RML-core is independent of reference formulations.

@DylanVanAssche how do we do that? The test cases in RML-core are independent of reference formulations. In RML-core we are in a situation where we have already retrieved the data and we have key-value pairs with which we deal. There is (or should not be) anything in RML-core that is reference-formulation-dependent.

@DylanVanAssche
Copy link
Collaborator Author

These are indeed examples of reference formulations. We can include a few but we cannot produce Notes. From the W3C types of documents: "A W3C Draft Note is a document produced by a W3C Working Group, a W3C Interest Group, the Advisory Board (AB), or the W3C Technical Architecture Group (TAG)."

Note is maybe not the right wording given the W3C definitions. With 'note' here is meant that it could be a document with examples and how a reference formulation is supposed to work. See it as a set of guidelines for implementations. Not a hard requirement they MUST follow, but more like a SHOULD as seen from good practice. The same assumptions about the reference formulations documented in there are also made for the output of the test cases.

The test cases in RML-core are independent of reference formulations.

That's definitely not the case currently, we depend on CSV (more abstract: tabular) there (if we move away all other data formats as proposed in an issue) but we still require implementations to interpret a CSV reference formulation as going over each row to correctly generate the triples/quads. If RML-Core was truly independent, no Logical Source may appear there in the test cases, but then the test cases are no integration tests as they are now. We cannot use an 'abstract' reference in RML-Core's test cases because at this point it is always tight to some data source defined by the Logical Source. At this point, this 'iterate over each row' is implicitly defined through test cases that assume such behavior. In R2RML this is hard defined in the spec:

Each logical table is mapped to RDF using a triples map. The triples map is a rule that maps each row in the logical table to a number of RDF triples

So this is actually mentioned for R2RML implementations, but not RML implementations. R2RML implementations now know they have to follow a row-based iteration model for RDBs as it is clearly mentioned in the spec. Where we put our 'guidelines' on this matter is of course a point of discussion.

In RML-core we are in a situation where we have already retrieved the data and we have key-value pairs with which we deal. There is (or should not be) anything in RML-core that is reference-formulation-dependent.

That's what RML-Core is supposed to be yes, but the test cases do not reflect this. How to improve this is a hard question as the references in rml:reference and rml:template always depend on the reference formulation. Regarding the key-value pairs, that's RML Field. The latter could be the abstracted reference formulation decoupling RML-Core completely. However, that requires Fields into Core.

@pmaria
Copy link
Collaborator

pmaria commented Mar 19, 2024

These are indeed examples of reference formulations. We can include a few but we cannot produce Notes. From the W3C types of documents: "A W3C Draft Note is a document produced by a W3C Working Group, a W3C Interest Group, the Advisory Board (AB), or the W3C Technical Architecture Group (TAG)."

Note is maybe not the right wording given the W3C definitions. With 'note' here is meant that it could be a document with examples and how a reference formulation is supposed to work. See it as a set of guidelines for implementations. Not a hard requirement they MUST follow, but more like a SHOULD as seen from good practice. The same assumptions about the reference formulations documented in there are also made for the output of the test cases.

IMO these descriptions will be more than a SHOULD, so then maybe it should these should also be reports and the W3C process can decide how to label it later on.

As for the test cases:

I see the test cases as functional tests, not as unit tests.
With functional tests it is quite normal to add dependencies that are not part of the module under test.

My proposal would be to:

  • use JSONPath as the reference formulation for all core tests (JSONPath in order to have examples of multiple values).
  • introduce tests for specific reference formulations together with the report on that reference formulation.
    These tests can focus on the specifics of the reference formulation, like correct natural mapping of datatypes and such.

@DylanVanAssche
Copy link
Collaborator Author

IMO these descriptions will be more than a SHOULD, so then maybe it should these should also be reports and the W3C process can decide how to label it later on.

Yeah you can also do more 'enforcement' here, I first want to have a proper agreement on the rest before deciding the level of enforcement.

I see the test cases as functional tests, not as unit tests.

That's possible as well, but that contradicts with 'key-value' pairs IMO.
If the reference formulations are properly defined in some kind of document, it fixes a lot for the test cases in Core because the document clearly says then how to iterate over data.
We just need to decide on this ;)

use JSONPath as the reference formulation for all core tests (JSONPath in order to have examples of multiple values).

I prefer a tabular data source here for Core because it aligns better with R2RML making the transition less difficult. If we want adoption from R2RML implementations, Core should be as easy as possible to implement.

introduce tests for specific reference formulations together with the report on that reference formulation.
These tests can focus on the specifics of the reference formulation, like correct natural mapping of datatypes and such.

+1

@SimonBin
Copy link

just a comment, we also noticed that TC0002a-JSON fails because the "expected" result is not respecting natural data type mapping

@dachafra
Copy link
Member

dachafra commented Jul 5, 2024

If the reference formulations are properly defined in some kind of document, it fixes a lot for the test cases in Core because the document clearly says then how to iterate over data.
We just need to decide on this ;)

It's done right? @DylanVanAssche, any action here?

@DylanVanAssche
Copy link
Collaborator Author

It's done right? @DylanVanAssche, any action here?

rml-io-registry repo is created but the actual task of creating these documents did not happen yet.

@dachafra
Copy link
Member

dachafra commented Jul 5, 2024

but then is this a rml-io-registry issue?

@DylanVanAssche
Copy link
Collaborator Author

Yes, but that didn't exist back then when the issue was created. This is one of the issues that triggered the creation of the registry.

@dachafra dachafra closed this as completed Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants