Replies: 1 comment 2 replies
-
Hi @Oren-H , This is a shortcoming of PyPDF2 at the moment. There is no support for this. Problem 1: Unclear expected resultsOrder of superscripts and subscriptsFor You might argue that you're fine with
as well as with
Writing both on the same level is not possible with plain text. Problem 2: Nested elementsI'm not sure how common this is in Chemistry, but in mathematics you can have nested elements. Most commonly are nested fractions: Now we would need to decide on a syntax. Most likely it would be something similar to the math syntax of TeX ... but that's really unclear. And it would be extremely hard to parse that. Problem 2: How should PyPDF2 recognize superscripts / subscripts?The PDF format does not necessarily represent this in a semantically meaningful way. Meaning we would need to be clever about how we extract it. How does PyPDF2 continue?For the moment, I would say the math-part is completely out of scope. For simple subscripts / superscripts I would hope that at some point one of our contributors tackles them. |
Beta Was this translation helpful? Give feedback.
-
I am trying to extract text from a pdf document that contains several chemical formulas. These formulas contain subscripts and superscripts that PYPDF2 is not recognizing when I use the extractText() function. Instead, the characters are either being read as normal numbers or are preceded with '\n'. Is there anyway I can fix this issue, or are their other tools I can use to accurately read subscripts and superscripts? Any help would be very appreciated.
Beta Was this translation helpful? Give feedback.
All reactions