-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
<pb/> elements parsed incorrectly #228
Comments
After some analysis it seems that the page is correctly parsed until this line of the function parsePageContent of the structure-xml-parser.service.ts (line 102) which I debugged with a couple of crude console.log (sorry) Until that line the pb are correctly parsed but that function getEditionOrigNode is invoked and reverts the node at its original state and then before the removing of the exceeding content (= which is to say all the content after the following pb). Here an xml to easily test the issue: saba_pb_bug.zip Dump of the two console.log before and after line 102, with the pb n="6" marked in light blue. |
If that getEditionOriginNode isn't needed and adds again redundancy which only cause problems, wouldn't be easier to solve this issue by just deleting it? |
You got a point imho. Since we can't afford to wait longer to fix this pb issue, for now we are commenting that line and thus leaving the node as it was before the getEditionOrigNode invocation. |
I had to analyse the Basically, the problem arises when we call export function getElementsBetweenTreeNode(start: any, end: any): XMLElement[] {
const range = document.createRange();
range.setStart(start, 0);
range.setEnd(end, end.length || end.childNodes.length);
const commonAncestorChild = Array.from((range.commonAncestorContainer as XMLElement).children);
const startIdx = commonAncestorChild.indexOf(start);
const endIdx = commonAncestorChild.indexOf(end);
const rangeNodes = commonAncestorChild.slice(startIdx, endIdx).filter((c) => c !== start);
rangeNodes.forEach((c: XMLElement) => c.setAttribute('xpath', xpath(c).replace(/-/g, '/')));
const fragment = range.cloneContents();
const nodes = Array.from(fragment.childNodes);
return nodes as XMLElement[];
} As you can see, before returning it, the range's contents are cloned, which detach them from the document. The new parent is now a document fragment (as stated in the cloneContents() API). Hence, the returned nodes aren't part of the edition XML tree anymore, that's why the parsePageContent(doc: Document, pageContent: OriginalEncodingNodeType[]): Array<ParseResult<GenericElement>> {
return pageContent
.map((node) => {
const origEl = getEditionOrigNode(node, doc);
if (origEl.nodeName === this.frontTagName || isNestedInElem(origEl, this.frontTagName)) {
...
}
...
})
.reduce((x, y) => x.concat(y), []);
} The focus here is the |
I can't check myself the code now but there must be a better way to distinguish between front or body content, like for example two different invocations for the two cases in the parent(s) and/or a parameter in the invoker parent function. I remember there was already a distinction between front and body pages on a superior level of that service, maybe we can pass down that information? The elements end up being xml elements without node referrals anyway ( |
Yes, we could split the functions into two different one, we'd create a little bit of duplication but in my opinion that would be fine for the purpose of avoid doing that. However, I don't agree that we shouldn't care about node referrals only because in the result there won't be any. This is one of the main issues I'm having with DEPA. If it's critical to resolve this issue right now, we can try dampening it by commenting |
Apparently EVT 3 is currently parsing
<pb/>
elements not as simple milestones, gathering marked up content between one and the next as correctly done in EVT 2, but as part of the text structure, expecting them to be children of sibling containers:When this structure is not present in the TEI document, i.e.
<pb/>
s follow each other<p>
or<lg>
elementstext is parsed and visualized incorrectly, with repeated text blocks in the relevant frame, see screenshots for page 5 and 6 of Saba's manuscript text of the Canzoniere. This bug has been discovered while working on the authorial philology view for the Saba 2021 project, but as you can see from the screenshots it was already present in EVT 3 (screenshots of the alpha version with Saba's transcripted text).
The text was updated successfully, but these errors were encountered: