Errors when reading subscripts and superscripts in a pdf? #1253

Oren-H · 2022-08-18T23:07:04Z

Oren-H
Aug 18, 2022

I am trying to extract text from a pdf document that contains several chemical formulas. These formulas contain subscripts and superscripts that PYPDF2 is not recognizing when I use the extractText() function. Instead, the characters are either being read as normal numbers or are preceded with '\n'. Is there anyway I can fix this issue, or are their other tools I can use to accurately read subscripts and superscripts? Any help would be very appreciated.

MartinThoma · 2022-08-19T05:16:10Z

MartinThoma
Aug 19, 2022
Maintainer

Hi @Oren-H ,

This is a shortcoming of PyPDF2 at the moment. There is no support for this.

Problem 1: Unclear expected results

Order of superscripts and subscripts

For H₂O you would likely expect exactly that, but what would you expect for

You might argue that you're fine with

²³⁵₉₂U

as well as with

₉₂²³⁵U

Writing both on the same level is not possible with plain text.

Problem 2: Nested elements

I'm not sure how common this is in Chemistry, but in mathematics you can have nested elements. Most commonly are nested fractions:

Now we would need to decide on a syntax. Most likely it would be something similar to the math syntax of TeX ... but that's really unclear. And it would be extremely hard to parse that.

Problem 2: How should PyPDF2 recognize superscripts / subscripts?

The PDF format does not necessarily represent this in a semantically meaningful way. Meaning we would need to be clever about how we extract it.

How does PyPDF2 continue?

For the moment, I would say the math-part is completely out of scope. For simple subscripts / superscripts I would hope that at some point one of our contributors tackles them.

2 replies

Oren-H Aug 19, 2022
Author

Thank you for the thorough response! Do you know of other python packages/tools that would allow me to solve this issue. It is integral to my project. Additionally, is anyone working on simply adding different parameters that can be added to the extractText() function that would provide some specificity on how to deal with these different scripts. For instance, I imagine that there could be a number of boolean parameters that would allow the coder to specify how they would like subscripts,superscripts, and other characters to be stored.

MartinThoma Aug 19, 2022
Maintainer

Do you know of other python packages/tools that would allow me to solve this issue.

No. I'm not a ware of any project (no matter of Python or not) that could recognize subscripts in PDF properly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors when reading subscripts and superscripts in a pdf? #1253

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Errors when reading subscripts and superscripts in a pdf? #1253

Oren-H Aug 18, 2022

Replies: 1 comment · 2 replies

MartinThoma Aug 19, 2022 Maintainer

Problem 1: Unclear expected results

Order of superscripts and subscripts

Problem 2: Nested elements

Problem 2: How should PyPDF2 recognize superscripts / subscripts?

How does PyPDF2 continue?

Oren-H Aug 19, 2022 Author

MartinThoma Aug 19, 2022 Maintainer

Oren-H
Aug 18, 2022

Replies: 1 comment 2 replies

MartinThoma
Aug 19, 2022
Maintainer

Oren-H Aug 19, 2022
Author

MartinThoma Aug 19, 2022
Maintainer