-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] #2186
Comments
Thanks @mmalmeida ! We will investigate and I will answer here. |
hi @dommitchell any news on this? |
We have identified a fix for this and we will get that into production as soon as we can. |
Hi @mmalmeida thanks for your patience on this issue, we have a fix passing through final testing at the moment and a release should follow in due course. In the meantime, I wanted to outline how the fix we are producing differs from what you have suggested, and why, and see if that's an issue. Below I've outlined what the OAI-PMH specification says (which ties up with what you have indicated in the original issue), and then after that how our incoming fix modifies the response OAI-PMH specificationFirst of all, these are examples from the OAI-PMH specification on how the record should look <?xml version="1.0" encoding="UTF-8" ?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2002-05-01T19:20:30Z</responseDate>
<request verb="GetRecord" identifier="oai:arXiv.org:hep-th/9901001"
metadataPrefix="oai_dc">http://an.oa.org/OAI-script</request>
<GetRecord>
<record>
...
</record>
</GetRecord>
</OAI-PMH> Note:
Then inside the <metadata>
<oai_dc:dc
xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/
http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Using Structural Metadata to Localize Experience of Digital
Content</dc:title>
</oai_dc:dc>
</metadata> Note:
These lines up with what we see on the scielo OAI endpoint https://scielo.isciii.es/oai/scielo-oai.php?verb=ListRecords&set=0213-1285&from=2022-11-30&metadataPrefix=oai_dc DOAJ ImplementationOur (updated, not yet released) implementation, meanwhile, looks like this: <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2023-03-03T15:22:12Z</responseDate>
<ListRecords>
<record> Note:
Then inside the <metadata>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Exploring the Virulent Jazz Counterculture in Mumbo Jumbo</dc:title> Note:
AnalysisThe code which produces this latter part is built using In producing the above snippet the code (paraphrased) is: NSMAP = {None: PMH_NAMESPACE, "xsi": XSI_NAMESPACE, "xmlns": XMLNS_NAMESPACE}
NSMAP.update({"oai_dc": OAIDC_NAMESPACE, "dc": DC_NAMESPACE})
oai_dc = etree.SubElement(metadata, self.OAIDC + "dc", nsmap=NSMAP) That is, the We believe this is syntactically correct XML, and should work with any formal XML parser. Exactly why the OAI-PMH specification expects the The only way that I can see that we could become formally specification compliant is to render each We'd be interested to know whether you would still find the above correct (but not spec compliant) XML problematic to work with? |
Hello @richard-jones thanks for the followup. "We'd be interested to know whether you would still find the above correct (but not spec compliant) XML problematic to work with?" Yes it is still a problem, because we use the XOAI parser. That parser is a Java library for the harvesting of metadata from the oai records. This parser follows the spec given by the OAI-PMH. (Dspace xoai github page: https://github.com/DSpace/xoai) So when we try to harvest using that library it gives us an error stating that it needs to see that element (XMLSchema-instance in xmlns:xsi) and thus does not return the metadata. So this solution does not solve our problem, nor can we find a quick solution for this since the xoai parser is the tool that provides us the processed metadata. On another note, we have had the same issue with another specific repository that was using python for their project with the same library as you are (lxml) which presented the same behaviour that you are experiencing, if the element exists on the xml header it is removed from the record metadata headers. We had a meeting with them regarding this issue and the conclusion was that it was not fixable on their side due to that library. Their idea is to eventually stop using that library in order to allows the clients to harvest the metadata properly and as Maybe a simple version upgrade can solve it. Or using other library for that case, if that would be viable in your case? "The only way that I can see that we could become formally specification compliant is to render each oai_dc:dc element as a string with all the namespaces inserted, and then stitch together the final OAI-PMH response out of string representations. This seems like a Bad Idea, so we are not going to go that route." It does seem like a bad idea and I understand your worries on implementing something as such but this is probably the fastest solution that we could get (Even though far from optimal). |
Hi @eduardorep , thanks for the details that's useful. It's unfortunate that XOAI won't accept this input. Can you give me some details of the error you get from it when the import fails because of the missing We've discussed this internally, and we're not comfortable moving to a string-manipulation based approach, and we don't plan to introduce a second or alternative xml serialiser at this stage (though we will consider it for the future). Therefore the fastest approach would probably be to introduce some flexibility into the XOAI library to parse these records in the absence of a repeated xsi attribute. I have briefly looked at the XOAI code, and I wasn't able to quickly see how to do this, but if you can provide me with the error message it gives you and/or a reference code snippet that's using XOAI then I could take a look and see if there's a quick option there. |
I'd go so far as to say this is a bug in XOAI, it seems like it breaks the XML down into parts and parses them separately, which is where this error probably comes from: DSpace/xoai#67 |
Hello @richard-jones could you take a look at the bug in XOAI in order to make it work with what your project requires? Do you have experience in java to be able to fix it? Would you require any assistance to complete it? |
Here is a possible solution we are exploring gdcc/xoai#141 |
Hi there! Apologies for the small delay in replying. @richard-jones had to catch up with me first to discuss this. I am the Operations Manager for DOAJ and I work closely with Richard in prioritising our developments in a way that is the best use of the funding that we have. Richard explained the detail here and in order for us to implement a fix, we would need to do some investigations into how complex the fix is and whether or not it would be accepted by the XOAI maintainers. It's worth noting at this point that Richard contact them 3 weeks ago with this question and they still haven't responded. This does make us nervous as we could essentially do the work and have it rejected. The fix itself is not a small amount of work, as it will involve us understanding XOAI, setting up a testing environment, making the fix (which may be complex) and then piloting the fix through to acceptance. All in all, I don't think that we can risk that kind of resource and money on this, at this moment in time. We are severely underfinanced and have a long development list with several high-priority items in it. I am sorry that I don't have a better answer for you at this time. Thank you for taking the time to go through this with us. |
There is a discrepancy regarding your repository response and the expected response according to OAI Protocol.
According to guidelines in https://www.openarchives.org/OAI/openarchivesprotocol.html#OAIPMHschema, the response XML sent by the repository should include the following attribute in the metadata part of each record:
xmlns:xsi="[http://www.w3.org/2001/XMLSchema-instance"](http://www.w3.org/2001/XMLSchema-instance%22)
We've confirmed this occurs in most repositories' responses (we'll use Scielo Spain below as example).
For the request: https://scielo.isciii.es/oai/scielo-oai.php?verb=ListRecords&set=0213-1285&from=2022-11-30&metadataPrefix=oai_dc
The following response (XMLSchema-instance included in metadata element, in xmlns:xsi):
However for: https://doaj.org/oai.article?verb=ListRecords&set=TENDOkRlcm1hdG9sb2d5&from=2022-08-31T23%3A00%3A00Z&metadataPrefix=oai_dc
We get the following response (XMLSchema-instance not included in metadata element):
This makes it impossible for parsers that rely on a correct XML document to retrieve data from DOAJ.
Is it possible to update DOAJ to include the required
xmlns:xsi
element on each record's metadata element?The text was updated successfully, but these errors were encountered: