PDF-hul: various issues with parsing PDFs #927

petervwyatt · 2024-07-01T02:30:07Z

Some issues noted about parsing PDFs:

{ and } are not PDF delimiter tokens except within Type 4 PostScript functions (i.e. they are PS delimiters only) so using them elsewhere is incorrect. This was a long-standing error in PDF specifications.
PDF-hul header check is for %PDF-1 but spec says it is %PDF- followed by any digit (0-9), . and another `digit so PDF 2.0 files should report as a PDF file, but with an unsupported PDF version until such time as you support PDF 2.0. JHOVE currently reports PDF 2.0 files as a bytestream which is incorrect. See here
PDF-hul crashes if a PDF hex-string contains EOL characters - this is permitted by the PDF spec as whitespace can occur in hex-strings and the EOLs are considered whitespace. (For what it is worth, hex-strings and literal strings are the only 2 types of PDF tokens or keywords that can span multiple lines).
there seem to be assumptions with PDF-hul-xx error codes that a key with an explicit null value is invalid whereas the PDF spec states that such keys should be ignored (same as not present). An easy test is to set /Annots null on any page and compare behaviour to not having an /Annots entry present.
Java exception gets thrown if cross-reference sub-section marker lines (of 2 integers) start with a negative number (i.e. for the object number).
FileSpecification.java does not account for the UF entry added with PDF 1.7. This was noticed from a code review.
there is something strange going on when encountering empty names (i.e. just a '/' followed by nothing, which is a valid PDF name). PDump correctly lists as a Name object with empty string "", but if 2 empty names are appended to a trailer dictionary (i.e. a valid key/value dictionary entry) then JHOVE doesn't work properly...
please consider adding support for UTF-8 text strings introduced with PDF 2.0. This was noted from a code review. Also note that UTF-8 strings do occur in some pre-PDF 2.0 files...

The text was updated successfully, but these errors were encountered:

carlwilson self-assigned this Aug 22, 2024

carlwilson added bug A product defect that needs fixing P2 Medium priority issues to be scheduled in a future release labels Aug 22, 2024

carlwilson added this to the JHOVE 1.34 milestone Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF-hul: various issues with parsing PDFs #927

PDF-hul: various issues with parsing PDFs #927

petervwyatt commented Jul 1, 2024

PDF-hul: various issues with parsing PDFs #927

PDF-hul: various issues with parsing PDFs #927

Comments

petervwyatt commented Jul 1, 2024