-
-
Notifications
You must be signed in to change notification settings - Fork 665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Copying pages to a new PDF document brings over all images in the Resources/XObject section, even those not used on the page #1662
Comments
Here's an example of the results of splitting a PDF with several pages, this is the first page. If you look at the resources, you'll see it has references to many unused images. |
Compare that to this version of the same tune, but this was from a version of the same PDF tunebook that only had the one tune. No extra unused image resources are present. |
This was the original PDF fed to the splitter that had multiple pages of images. |
I was finally able to split the pages by range and delete unused images before exporting the split pages, this returns the pdf bytes for a range of pages in an original PDF with unused images stripped. Now, this works for me because I specifically know how I'm creating the original PDF using jsPDF and know their deterministic structure, it may not be a general solution. I figured that I needed to be able to get at the list of XObject images used in the document, a way to delete them, and a way to get at the raw command stream for the page, from which I could figure out which images are actually used in the split pages and delete the rest. splitPDF(originalPdf, range) takes in a PDFDocument and a {start:startpage, end:endpage} range and returns the bytes for the new split PDF document for saving or other processing. In my case, I just put them in a Blob and save the file (code not provided here).
|
What were you trying to do?
I am working on splitting a PDF document (PDF of music scores generated with a music transcription tool I've built) into individual page ranges, using a common pattern I've seen recommended for doing this sort of thing with pdf-lib.
How did you attempt to do it?
async function splitPDF(pdfBytes, ranges) {
}
What actually happened?
Unfortunately, what I find in the split files that get written out is that all of the images referenced in the original PDF are present in the split PDF files, and I see entries for them in the context indirectObjects. The split files are essentially all the same size as the original complete PDF.
What did you expect to happen?
It looks like copyPages() doesn't filter out the unused images, it just copies the entire set of images referenced in the original PDF document you're copying from.
If I look at the actual operators using a PDF parser, I can see they only reference the images being used for the page range, but the resulting PDF files are all essentially the size of the original PDF file before the split.
I've seen a few posts about issues with file size using copyPages() to split the files, and I'm guessing this is the root cause.
How can we reproduce the issue?
Take a existing PDF file that has many images and try to split it into individual files. I've attached a typical example of the sort of PDF generated by my tool that I'm trying to split into individual PDFs per page.
Retreat_Tunes_Played_Slowly_2024_Standard_Notation.pdf
Version
1.17.1
What environment are you running pdf-lib in?
Browser
Checklist
Additional Notes
No response
The text was updated successfully, but these errors were encountered: