You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to make a HTML->XHTML sanitizer (with absolutely no control over source HTML as it is supposed to be a web data scraper + later usage of FlyingSaucer library to convert to PDF).
Conversion to org.w3c.dom seems to be processed, but the resulting document has two root elements (and fails with any later XPath evaluation):
W3CDom w3cDom = new W3CDom();
Document doc = w3cDom.fromJsoup(document);
Resulting document has two childs: DocumentTypeImpl and ElementNSImpl (containing the real data).
I have to read the doc from document.toString() (parse again) to get a valid XML DOM. I think the problem comes from W3CDom.convert when a "document" node is passed and it skips to firstChild - first child is in my case a "DOCTYPE" declaration. The main document body is in next child. It might be that I'm doing something completely wrong, but in that case there is something very counter-intuitive.
I'm using JSoup 1.18.1 from Maven, OpenJDK 17.
(Optionally) auto-skipping non-xml declarations for XML output would be nice.
The text was updated successfully, but these errors were encountered:
I'm trying to make a HTML->XHTML sanitizer (with absolutely no control over source HTML as it is supposed to be a web data scraper + later usage of FlyingSaucer library to convert to PDF).
Loading HTML from FileInputStream is OK:
Conversion to org.w3c.dom seems to be processed, but the resulting document has two root elements (and fails with any later XPath evaluation):
Resulting document has two childs: DocumentTypeImpl and ElementNSImpl (containing the real data).
I have to read the doc from document.toString() (parse again) to get a valid XML DOM. I think the problem comes from
W3CDom.convert
when a "document" node is passed and it skips to firstChild - first child is in my case a "DOCTYPE" declaration. The main document body is in next child. It might be that I'm doing something completely wrong, but in that case there is something very counter-intuitive.I'm using JSoup 1.18.1 from Maven, OpenJDK 17.
(Optionally) auto-skipping non-xml declarations for XML output would be nice.
The text was updated successfully, but these errors were encountered: