Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fonts with custom encoding #85

Open
maxpowel opened this issue Apr 3, 2024 · 4 comments
Open

Fonts with custom encoding #85

maxpowel opened this issue Apr 3, 2024 · 4 comments

Comments

@maxpowel
Copy link
Contributor

maxpowel commented Apr 3, 2024

Hello, nice library. It is very useful and I had no issues until a find a weird PDF. Don't know if its an edge case or something common because I'm not a PDF expert.

Using pdffonts this info is shown:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
DRRVLN+AdvPSTim-I                    Type 1C           Custom           yes yes no      12  0
RHJJFZ+AdvSPSMI                      Type 1C           Custom           yes yes no      14  0
FFLANR+AdvPSTim-B                    Type 1C           Custom           yes yes no      16  0
QVMPSS+AdvP4C4E74                    Type 1C           Custom           yes yes no      18  0
JMUKNO+AdvP4C4E51                    Type 1C           Custom           yes yes no      20  0
PTWKHW+AdvPSSym                      Type 1C           Custom           yes yes no      22  0
APZBCX+AdvPSTim                      Type 1C           Custom           yes yes no       8  0
DYRDCR+AdvP4C4E59                    Type 1C           Custom           yes yes no      10  0

If you open the PDF with a reader, you can see the text properly rendered. With some readers even copy & paste works but other just copy strange characters.

As far as I know (by reading and searching out there), the custom encoding implies non standard glyphs and this is the reason why some reader just copy trash, they are indeed copying the bytes but nothing "readable" outside the pdf context. But this is just my guessing.

I think that this is the same case https://community.adobe.com/t5/acrobat-discussions/strange-font-encoding-in-pdf-files/td-p/12472215
Looks that it affects mainly old files (mine is like 20+ old)
Other people are getting the same issue kermitt2/grobid#518

When using pdf-extract this is the output I get:

unknown glyph name 'C68' for font APZBCX+AdvPSTim
unknown glyph name 'C101' for font APZBCX+AdvPSTim
unknown glyph name 'C116' for font APZBCX+AdvPSTim
unknown glyph name 'C114' for font APZBCX+AdvPSTim
unknown glyph name 'C109' for font APZBCX+AdvPSTim
unknown glyph name 'C105' for font APZBCX+AdvPSTim
unknown glyph name 'C110' for font APZBCX+AdvPSTim
unknown glyph name 'C97' for font APZBCX+AdvPSTim
unknown glyph name 'C111' for font APZBCX+AdvPSTim
unknown glyph name 'C102' for font APZBCX+AdvPSTim
unknown glyph name 'C100' for font APZBCX+AdvPSTim
unknown glyph name 'C115' for font APZBCX+AdvPSTim
unknown glyph name 'C108' for font APZBCX+AdvPSTim

And a bunch of lines like this. The text returned is just bytes in some encoding that are not readable.

This is a sample:

    -$  #  #     . 
   & .  '        /
 0120   3   4 5    &   '       
 (($1(0  ) 4     0  
) $  /  6 & /  6  
 '    / 7     8 9   : ;  
 4    /$<=    '       
  .      #'   '  : ; 9 4  5  
 &              #
  .   $(() !#   > ;  #

I cannot provide you the whole document but I'm attaching the first page so you can reproduce the error.
page.pdf

I will investigate more and if I find anything useful I will put it here.

Thank you

@jrmuizel
Copy link
Owner

jrmuizel commented May 4, 2024

Which readers does copy and paste work in?

@maxpowel
Copy link
Contributor Author

maxpowel commented May 5, 2024

Okular https://okular.kde.org/ https://github.com/KDE/okular

This is the default pdf reader in KDE. With the file I provided, google-chrome only copies trash but okular copies the actual content.

@jrmuizel
Copy link
Owner

jrmuizel commented May 6, 2024

I think to fix this we need to parse the CFF fonts.

@maxpowel
Copy link
Contributor Author

My knowledge about fonts is very limited but if I can help withy anything please tell me.
I found this library that does CFF stuff https://github.com/RazrFalcon/ttf-parser but I dont know if this is something that could be useful for this case.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants