1.0.21 causes previously-consumable PDFs to fail now with RangeError #191

rdunlop · 2021-03-29T20:56:20Z

I suspect that the input PDF that I'm dealing with is invalid...but I wanted to mention that it was working in 1.0.20, but no longer in 1.0.21.

The PDF appears to have an invalid stream defined near the end of my file (relevant part here::

8 0 obj\r<</Length 2200\r/Type\r/Metadata\r/Subtype \r/XML>>\rstream\rendstream\rendobj\r9 0 obj\r<< /Keywords()\r/Creator(HP Scan) \r/CreationDate(D:20210326163700-08'00')\r/ModDate(D:20210326163700-08'00')\r/Author ()\r/Producer (HP Scan Extended Application)\r/Title ()\r/Subject ()\r>>\rendobj\rxref\r0 10\r0000000000 65535 f \r0000000009 00000 n \r0000522282 00000 n \r0000522379 00000 n \r0000522588 00000 n \r0000522646 00000 n \r0000522697 00000 n \r0000522746 00000 n \r0000522892 00000 n \r0000522972 00000 n \rtrailer\r<<\r/Size 10\r/Root 5 0 R\r/Info 6 0 R\r/Info 7 0 R\r/Info 8 0 R\r/Info 9 0 R\r>>\rstartxref\r523171\r%%EOF\r

(pretty printed):

8 0 obj
<</Length 2200
/Type
/Metadata
/Subtype 
/XML>>
stream
endstream
endobj
9 0 obj
<< /Keywords()
/Creator(HP Scan) 
/CreationDate(D:20210326163700-08'00')
/ModDate(D:20210326163700-08'00')
/Author ()
/Producer (HP Scan Extended Application)
/Title ()
/Subject ()
>>
endobj
xref
0 10
0000000000 65535 f 
0000000009 00000 n 
0000522282 00000 n 
0000522379 00000 n 
0000522588 00000 n 
0000522646 00000 n 
0000522697 00000 n 
0000522746 00000 n 
0000522892 00000 n 
0000522972 00000 n 
trailer
<<
/Size 10
/Root 5 0 R
/Info 6 0 R
/Info 7 0 R
/Info 8 0 R
/Info 9 0 R
>>
startxref
523171
%%EOF

As you can see, the Length is 2200, but there are not 2200 bytes left in the file, and thus the @scanner.pos += out.last[:Length].to_i - 2
(here)[https://github.com/boazsegev/combine_pdf/blob/b966e703fd897ff50832d3823e74791099b82ca3/lib/combine_pdf/parser.rb#L364] causes a RangeError.

I am opening this ticket because I'm 90% sure that this is an invalid PDF, but I wanted to mention it out loud that the change introduced in 1.0.21 is (to me) a regression in capability. I recognize that #184 is a related issue.

For now, I've resolved my issue by reverting to 1.0.20. Not ideal, but sufficient for my purposes for now.

The text was updated successfully, but these errors were encountered:

boazsegev · 2021-04-07T23:30:15Z

Hi @rdunlop ,

Thank you for opening this issue. I totally understand your concern and I myself was debating this change for his very reason.

This isn't about a performance optimization. I would much rather be able to read malformed PDF files than run faster...

...however, as I explained in #185 , this is required to accommodate properly authored PDF files that are allowed to contain PDF-like markers in their stream data (i.e., a PDF explaining how PDF data looks might contain the PDF endstream keyword). Issue #184 was an issue that referenced such a valid PDF file as an example.

The choice was either to continue failing on valid PDF files or to patch in a way that limited support for malformed PDF files... I guess there's a way to support both variations, I just didn't see it at the time (though I see it now, it might have a performance penalty).

I'm not high on time, but if you want to submit a PR that prefers valid PDF files and supports some sort of handling for malformed PDF files, that would be great.

Cheers,
Boaz Segev.

RBIII · 2022-07-12T06:27:33Z

This issue happened for me as well. PR seems to fix @boazsegev.

stiaannel · 2023-03-20T15:30:17Z

Has there been any updates on this ticket or #205 as yet on whether it will be merged or not? @boazsegev

JrmKrb · 2023-06-14T14:06:09Z

Thanks for the PR, is there anyway to get this fix merged @boazsegev ?

AdrienQuilletKelio · 2024-01-30T17:11:14Z

Sorry to bimp that PR, but we experience the same bug in production !
Fix would be greatly appreciated :)

julitrows · 2024-05-14T10:15:04Z

Still alive in 1.0.26

boazsegev mentioned this issue Jan 21, 2022

Parsing specific PDF in 1.0.21 - RangeError: index out of range (works in 1.0.20) #205

Open

mfazekas linked a pull request May 3, 2022 that will close this issue

HP Scan invalid Length workaround #215

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.0.21 causes previously-consumable PDFs to fail now with RangeError #191

1.0.21 causes previously-consumable PDFs to fail now with RangeError #191

rdunlop commented Mar 29, 2021

boazsegev commented Apr 7, 2021

RBIII commented Jul 12, 2022 •

edited

Loading

stiaannel commented Mar 20, 2023

JrmKrb commented Jun 14, 2023

AdrienQuilletKelio commented Jan 30, 2024

julitrows commented May 14, 2024

1.0.21 causes previously-consumable PDFs to fail now with RangeError #191

1.0.21 causes previously-consumable PDFs to fail now with RangeError #191

Comments

rdunlop commented Mar 29, 2021

boazsegev commented Apr 7, 2021

RBIII commented Jul 12, 2022 • edited Loading

stiaannel commented Mar 20, 2023

JrmKrb commented Jun 14, 2023

AdrienQuilletKelio commented Jan 30, 2024

julitrows commented May 14, 2024

RBIII commented Jul 12, 2022 •

edited

Loading