Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF-hul: various issues with parsing PDFs #927

Open
petervwyatt opened this issue Jul 1, 2024 · 0 comments
Open

PDF-hul: various issues with parsing PDFs #927

petervwyatt opened this issue Jul 1, 2024 · 0 comments
Assignees
Labels
bug A product defect that needs fixing P2 Medium priority issues to be scheduled in a future release
Milestone

Comments

@petervwyatt
Copy link

Some issues noted about parsing PDFs:

  • { and } are not PDF delimiter tokens except within Type 4 PostScript functions (i.e. they are PS delimiters only) so using them elsewhere is incorrect. This was a long-standing error in PDF specifications.

  • PDF-hul header check is for %PDF-1 but spec says it is %PDF- followed by any digit (0-9), . and another `digit so PDF 2.0 files should report as a PDF file, but with an unsupported PDF version until such time as you support PDF 2.0. JHOVE currently reports PDF 2.0 files as a bytestream which is incorrect. See here

  • PDF-hul crashes if a PDF hex-string contains EOL characters - this is permitted by the PDF spec as whitespace can occur in hex-strings and the EOLs are considered whitespace. (For what it is worth, hex-strings and literal strings are the only 2 types of PDF tokens or keywords that can span multiple lines).

  • there seem to be assumptions with PDF-hul-xx error codes that a key with an explicit null value is invalid whereas the PDF spec states that such keys should be ignored (same as not present). An easy test is to set /Annots null on any page and compare behaviour to not having an /Annots entry present.

  • Java exception gets thrown if cross-reference sub-section marker lines (of 2 integers) start with a negative number (i.e. for the object number).

  • FileSpecification.java does not account for the UF entry added with PDF 1.7. This was noticed from a code review.

  • there is something strange going on when encountering empty names (i.e. just a '/' followed by nothing, which is a valid PDF name). PDump correctly lists as a Name object with empty string "", but if 2 empty names are appended to a trailer dictionary (i.e. a valid key/value dictionary entry) then JHOVE doesn't work properly...

  • please consider adding support for UTF-8 text strings introduced with PDF 2.0. This was noted from a code review. Also note that UTF-8 strings do occur in some pre-PDF 2.0 files...

@carlwilson carlwilson self-assigned this Aug 22, 2024
@carlwilson carlwilson added bug A product defect that needs fixing P2 Medium priority issues to be scheduled in a future release labels Aug 22, 2024
@carlwilson carlwilson added this to the JHOVE 1.34 milestone Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A product defect that needs fixing P2 Medium priority issues to be scheduled in a future release
Projects
None yet
Development

No branches or pull requests

2 participants