Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

W3CDom.fromJsoup(document) produces invalid XML DOM (multiple roots) - not usable for XPath evaluation #2213

Open
jelinj8 opened this issue Oct 14, 2024 · 0 comments

Comments

@jelinj8
Copy link

jelinj8 commented Oct 14, 2024

I'm trying to make a HTML->XHTML sanitizer (with absolutely no control over source HTML as it is supposed to be a web data scraper + later usage of FlyingSaucer library to convert to PDF).

Loading HTML from FileInputStream is OK:

org.jsoup.nodes.Document document = Jsoup.parse(fis, null, "./");
document.outputSettings().syntax(org.jsoup.nodes.Document.OutputSettings.Syntax.xml);
document.outputSettings().escapeMode(EscapeMode.xhtml);

Conversion to org.w3c.dom seems to be processed, but the resulting document has two root elements (and fails with any later XPath evaluation):

W3CDom w3cDom = new W3CDom();
Document doc = w3cDom.fromJsoup(document);

Resulting document has two childs: DocumentTypeImpl and ElementNSImpl (containing the real data).

I have to read the doc from document.toString() (parse again) to get a valid XML DOM. I think the problem comes from W3CDom.convert when a "document" node is passed and it skips to firstChild - first child is in my case a "DOCTYPE" declaration. The main document body is in next child. It might be that I'm doing something completely wrong, but in that case there is something very counter-intuitive.

I'm using JSoup 1.18.1 from Maven, OpenJDK 17.

(Optionally) auto-skipping non-xml declarations for XML output would be nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@jelinj8 and others