Reduce Size of Child PDFs by 'Cleaning' Global Resources #2767
Replies: 4 comments 7 replies
-
To be honest, I have never used pypdf for this before, but mostly the (Py)MuPDF functionality, to redact private images from PDF samples. The usual way to do so there is to replace undesired images with a 1x1 pixel dummy image. If I remember correctly, doing a "hard clean" on the file would also remove images not referenced from the included pages. Without digging deeper, do we really need the regex for retrieving the images? Shouldn't |
Beta Was this translation helpful? Give feedback.
-
@pubpub-zz, I think your proposed solution was intended to address the inverse use case, namely blanking the images on the selected pages. That is not the goal. I'm attempting to get rid of 'hidden' image data that persists even though it is NOT displayed on any of the selected pages. To make matters worse, the The reason is that none of the images actually belong directly to any page, nor are they referenced directly by their object ID in the page on which they ARE displayed. Instead, a global '/Resources' object contains an XObject reference that 'names' all of the images displayed across all pages. Images are then referenced by name in the compressed byte stream content of each page. Long story short, all of the images for all pages are added to a global reference that behaves in a fashion much more typical of fonts. If you examine the contents of the 'UGLY WORKAROUND' PDF provided in the OP in a text editor, you'll notice the following:
This global '/Resources' object looks the same in the original, pg4_OOTB, and pg4_UGLY_WORKAROUND variants. The only difference is that the UGLY_WORKAROUND version has blanked out (i.e. replaced with That's what that nasty bit of hacky logic in the OP is doing: digging through the decoded byte stream of every page that IS in the output to find named image references a la At the end of the day, I have no idea how frequently this paradigm appears in the space of all PDF creation tools, so it may never be worth trying to deal with, but hopefully that clears up the problem?? |
Beta Was this translation helpful? Give feedback.
-
Here's a better toy PDF. In this example, I've retained 5 pages, two of which DO contain images. |
Beta Was this translation helpful? Give feedback.
-
I've refactored the original hacky and outdated logic to something approaching 'modern' that you should be able to run yourself. Here's a toy example that can be used with the 5 page sample PDF from the previous comment. The new logic doesn't capture all of the use cases covered by the original and has only been tested on the sample, but it should be more understandable while still illustrating the proposed improvement. It's also more efficient in terms of the reduction in file size interestingly enough lol. Notice that I've 'flipped' the regex lookup to eliminate the specificity concern noted in the OP. I.e. the new logic simply searches the content stream of each page for the keys from the global /Resources XObject reference. Doing it that way rather than searching for a generic 'instantiated XObject' pattern will make the logic universally applicable (theoretically ;)). I plan to continue testing with various input PDFs to get something like this implemented in my own repo. Once that's done, something along these lines might be worthy of a PR. Keep you posted. # %%
import re
from collections.abc import Sequence
from pypdf import PdfReader, PdfWriter
from pypdf.generic import DictionaryObject, ContentStream
def strip_unrendered_images(filename: str, pages: Sequence[int]):
"""
Generates an 'OOTB' and 'OPTIMIZED' version of a child PDF containing the indicated
page indices from the original.
Args:
filename (str): The path to the original source pdf.
pages (Sequence[int], optional): The page indices to extract from the original.
"""
fpfx, _, fsfx = filename.rpartition(".")
r = PdfReader(filename)
w = PdfWriter()
w.append(r, pages=pages)
w.write(f"{fpfx}_pg{','.join(str(i + 1) for i in pages)}_OOTB.{fsfx}")
xobjs: list[DictionaryObject] = []
xobj_pages: list[list[int]] = []
for i, out_page in enumerate(w.pages):
rsrcs = out_page.get_inherited("/Resources", {})
if "/XObject" in rsrcs:
xobj = rsrcs["/XObject"]
if xobj in xobjs:
xobj_idx = xobjs.index(xobj)
xobj_pages[xobj_idx].append(i)
else:
xobjs.append(xobj)
xobj_pages.append([i])
for page_group, xobj in zip(xobj_pages, xobjs):
name_regex = re.compile(r"(" + r"|".join(xobj.keys()) + r")\s", re.MULTILINE)
refd_names = set(
_m
for i in page_group
for _m in name_regex.findall(
str(w.pages[i].get_contents().get_data(), "utf-8", "ignore")
)
)
refd_names.update(
xobj[img]["/SMask"]["/Name"]
for img in refd_names.copy()
if img in xobj and "/SMask" in xobj[img] and "/Name" in xobj[img]["/SMask"]
)
print(f"{page_group=!r} references {refd_names=!r}")
for unrefd_img in xobj.keys() - refd_names:
print(f"Clearing {unrefd_img=!r}")
w._replace_object(xobj[unrefd_img].indirect_reference, ContentStream(None, w))
w.write(f"{fpfx}_pg{','.join(str(i + 1) for i in pages)}_OPTIMIZED.{fsfx}")
strip_unrendered_images("AB06236EEAC446128EDFBD5E1155C88D_pg4.12.13.41.42_UGLY.PDF", [1]) |
Beta Was this translation helpful? Give feedback.
-
I've been using a very hacky workaround to reduce the size of PDFs that result from splitting a large source PDF into 50 or 100 or more children for several years at this point. It's still working fine, so I'm just throwing this out to see if it anyone has an idea for addressing it 'properly' with an update to pypdf.
Problem
PDFs that use a global XObject resource entry for all rendered images carry all of the images forward into any child PDF created with pypdf.
Example
AB06236EEAC446128EDFBD5E1155C88D.PDF
is a 152 page report containing all of the patient records for patients seen on a single day in a hospital setting (Epic EMR).AB06236EEAC446128EDFBD5E1155C88D_pg4_OOTB.PDF
is the OOTB file size created by cloning a single page into a separate PDF using the standard pypdf paradigm.AB06236EEAC446128EDFBD5E1155C88D_pg4_UGLY_WORKAROUND.PDF
is the file size that results after applying my very hacky and totally not production ready workaround.The 'OOTB' and 'UGLY_WORKAROUND' pdfs are visually identical. I've included the workaround PDF below since it contains no protected information. I did NOT share the 'OOTB' split PDF b/c it still contains every image displayed on all 152 pages of the original source PDF (even though NONE of those images are actually displayed), and some of those images contain protected information.
AB06236EEAC446128EDFBD5E1155C88D_pg4_UGLY_WORKAROUND.PDF
Hacky Fix
Wait for it... It's ugly...
As shown above, the 'workaround' scans the decoded contents of every page in the child PDF to find all references to images stored in the global XObject and then 'blanks out' the bytes (i.e. replaces with
b""
) of all of the unrendered images that are still present in the global XObject. This is a bad practice for a number of reasons, and it's also very dependent on a particular 'regexable' reference format in the decoded bytes that is probably specific to the Epic EMR reporting engine (which generates these source PDFs).At the end of the day, the size is less of an issue than the potential 'data leaks' that the 'non-rendered' images from other pages represent. Some of these images contain protected information for other patients, so it's important ensure that only the images that belong to the pages being clipped are included in the output. Any thoughts or ideas on how this might be addressed in a 'non-hacky' way??
Beta Was this translation helpful? Give feedback.
All reactions