Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider supporting ActualText #41

Open
badicsalex opened this issue Sep 19, 2022 · 0 comments
Open

Consider supporting ActualText #41

badicsalex opened this issue Sep 19, 2022 · 0 comments

Comments

@badicsalex
Copy link

I have several PDFs with some very weird ToUnicode mappings. Some characters get extracted as lowercase instead of uppercase, even though the CID corresponds to the ASCII uppercase version. Unfortunately this breaks later processing steps for these documents.

For example I have the following: https://stickman.hu/junk/actualtext_example.pdf

Here, the line

o) 1. mellékletében foglalt táblázat VII. címében az „1034/2011 és 1035/2011 EU rendeletek” szövegrész helyébe

Extracts as

o) 1. mellékletében foglalt táblázat VII. címében az „1034/2011 és 1035/2011 eU rendeletek” szövegrész helyébe
                                                                             ^
                                                                             |
                                                                         Lowercase

Note that with several PDF viewers (e.g. the firefox built-in one) will also copy the wrong text. Chrome, Okular, and poppler in general will capitalize the E in EU. pdftotext from the poppler suite also works OK.

Now why is this? For some reason, the CID for both E and e are mapped to the ASCII code point 101 (lowercase e) in the font.

Why is it handled OK by some extractors? Because this is what the actual operations look like around that part:

op: Operation { operator: "BDC", operands: [/Span, <</ActualText (��^@E)>>] }
op: Operation { operator: "Td", operands: [30.888, 0] }
op: Operation { operator: "Tj", operands: [(E)] }
op: Operation { operator: "EMC", operands: [] }

The ActualText thing here is described in the PDF standard "14.9.4 Replacement Text", and has a special code path in poppler: https://github.com/freedesktop/poppler/blob/315ab3006fb24bf47b595343e6a3e90995f2a588/poppler/Gfx.cc#L5052-L5059

As far as I see, handling this case would need some refactoring around show_text, and I'm really not sure how to do it. Probably a fully separate code path for the "simple" and the replacement text use-cases, both of which would call output_character in the end.

P.S. 1: It seems like this guy had a related issue back in the day: https://stackoverflow.com/questions/17737776/pdf-text-extraction-issue-font-capitalization-inconsistencies

P.S. 2: In the end, I might just expose the CID on the output_character interface and do the same workaround I did in python: https://github.com/badicsalex/hun_law_py/blob/master/hun_law/extractors/pdf.py#L88-L93

P.S. 3: Thanks for taking the time to fix some of the bugs I reported, I really appreciate it.

badicsalex added a commit to badicsalex/hun_law_rs that referenced this issue Sep 19, 2022
badicsalex added a commit to badicsalex/hun_law_rs that referenced this issue Sep 19, 2022
badicsalex added a commit to badicsalex/pdf-extract-fhl that referenced this issue Sep 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant