Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encodeValue and encodeContent do not escape some invalid XML characters #877

Open
karenhanson opened this issue Aug 16, 2023 · 0 comments
Open

Comments

@karenhanson
Copy link
Contributor

I came across an issue where a message contained invalid characters (specifically 0xfffe and 0xffff). This caused the XML output to be invalid when parsed... this is even though the messages are going through Utils.encodeContent(String content), which should clean up the string for XML. This documentation indicates that the two characters are forbidden.

I will do a PR for a proposed fix to address this problem, but wanted to log an issue to attach the fix to.

I also noted something else while troubleshooting:

The document linked above also lists "surrogates" as forbidden and says some characters are "discouraged though allowed." Of the discouraged characters, the JHOVE Utility only removes one... 0x7f. I thought there may be a standard function to clean XML that's usable and looked at org.apache.commons.lang3.StringEscapeUtils.escapeXml10. It handles surrogates, escapes the discouraged characters (which are XML version specific), but also encodes quote, greater than, and apostrophe in all cases... which may not be a good thing for readibility of messages. Anyway, I mention it because I think the code might be useful if it seems important to explore escaping further, or we need to handle the unicode surrogates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant