Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF-hul: JHOVE hangs in infinite loop #920

Open
Bodensuri opened this issue Apr 17, 2024 · 3 comments
Open

PDF-hul: JHOVE hangs in infinite loop #920

Bodensuri opened this issue Apr 17, 2024 · 3 comments

Comments

@Bodensuri
Copy link

Validation process with the PDF of a dissertation ran for minutes before I aborted the process. I assume, there's something in the pdf structure which causes JHOVE to stuck in an infinite loop. This problem occurs for JHOVE GUI 1.28.0 (2023-05-18, Buffer Size -1, PDFhul selected) and with PDF-hul 1.12.4 (16.03.2023). The same problem with JHOVE 1.26.1, Plugin Version 1.2, plugin name PDF-hul-1.26.

The Dissertaion "Making CIAM..." is too large for the Upload. The PDF is a available from https://www.research-collection.ethz.ch/handle/20.500.11850/183653. The direct link is here:
https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/183653/KALPAKCI_MakingCIAM_Dissertation.pdf?sequence=3&isAllowed=y

@RvanVeenendaal
Copy link

RvanVeenendaal commented Apr 17, 2024

Please note that I tested the dissertation file with JHOVE 1.26.1, 2022-07-14 on Windows 10, and that it produces results.
The file seems to be a well-formed and valid PDF 1.4 with a PDF-HUL-136 infomessage.
It just takes a long time, about half an hour on my Intel Xeon 3.10 GHz laptop with 8 GB RAM, and produces 13 megabytes worth of output.
See the attached output of "jhove.bat -m pdf-hul KALPAKCI_MakingCIAM_Dissertation.pdf > dissertation.txt".
dissertation.zip

@fitnycdigitalinitiatives

I'm getting these same errors with Tiff files after upgrading to Archivematica 1.15.1 which uses JHOVE 1.26.1. Tested the same files on earlier version of JHOVE (1.20.0) without any errors.

@Bodensuri
Copy link
Author

Thank you for investigating the issue. Indeed, the file did not produce an infinite loop. Nevertheless, I feel that the issue should not be closed yet.
The result of the JHOVE validation (dissertation.txt, attached above by Rvan Veenendal) is a file with around 564,000 lines. Lines 47 to 555706 describe metadata for around 60'000 images. The metadata for each image requires 9 lines each and looks ok.
However, I can only find about 200 images in the PDF when I open it with Adobe. The validation with Adobe Preflight only takes a few seconds. Preflight only finds the 200 images, no attachments and nothing conspicuous. VeraPDF also only takes a few seconds, finds a small (but slightly higher) number of images, and states that the file is a valid PDF/A-1a file.
There are several method of embedding images in a PDF, but these 60'000 images found by JHOVE are puzzling me. Couldn't it be a bug in JHOVE?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants