Reduce Size of Child PDFs by 'Cleaning' Global Resources #2767

shartzog · 2024-07-23T03:11:21Z

shartzog
Jul 23, 2024

I've been using a very hacky workaround to reduce the size of PDFs that result from splitting a large source PDF into 50 or 100 or more children for several years at this point. It's still working fine, so I'm just throwing this out to see if it anyone has an idea for addressing it 'properly' with an update to pypdf.

Problem

PDFs that use a global XObject resource entry for all rendered images carry all of the images forward into any child PDF created with pypdf.

Example

AB06236EEAC446128EDFBD5E1155C88D.PDF is a 152 page report containing all of the patient records for patients seen on a single day in a hospital setting (Epic EMR).
AB06236EEAC446128EDFBD5E1155C88D_pg4_OOTB.PDF is the OOTB file size created by cloning a single page into a separate PDF using the standard pypdf paradigm.
AB06236EEAC446128EDFBD5E1155C88D_pg4_UGLY_WORKAROUND.PDF is the file size that results after applying my very hacky and totally not production ready workaround.

The 'OOTB' and 'UGLY_WORKAROUND' pdfs are visually identical. I've included the workaround PDF below since it contains no protected information. I did NOT share the 'OOTB' split PDF b/c it still contains every image displayed on all 152 pages of the original source PDF (even though NONE of those images are actually displayed), and some of those images contain protected information.
AB06236EEAC446128EDFBD5E1155C88D_pg4_UGLY_WORKAROUND.PDF

Hacky Fix

Wait for it... It's ugly...

def strip_unrendered_images(
    pdf: PdfReader | io.BytesIO | bytes,
    pages: Sequence[int] = (),
    outline_items: Sequence[Any] = (),
    debug=False,
    filename: str | None = None,
) -> io.BytesIO:
    """
    Optimize a PDF to prevent storage bloat from indirect image references.

    Args:
        pdf (PdfReader | io.BytesIO | bytes): The original source pdf.
        pages (Sequence[int], optional): The pages to extract. Defaults to ()
            to return all pages in the original.
        outline_items (Sequence[Any], optional): The outline items from the
            original PDF. These items are added back to the output PDF at the
            end of the optimization process.
        debug (bool, optional): Enable debug logging. Defaults to False.
        filename (str | None, optional): An optional filename to write the
            optimized PDF to disk for inspection. Defaults to None.

    Returns:
        io.BytesIO: The optimized PDF as a byte stream.
    """
    dbgprint = partial(utils.logprint, level=log_levels.DEBUG)
    pdf_reader = as_pdf_reader(pdf)
    ref_regx = re.compile(
        r"q[\n ]+(?:-?[0-9.]{1,}[\n ]+){6}cm[\n ]+(/\w+)[\n ]+Do[\n ]+Q",
        re.MULTILINE,
    )
    mergr = PdfMerger()
    referenced_img_names: set[str] = set()
    if pages:
        for c_ints in utils.contiguous_ints(pages):
            mergr.append(pdf_reader, pages=(c_ints[0], c_ints[1] + 1))
    else:
        mergr.append(pdf_reader)
    readr: PdfReader = mergr.inputs[0][1]
    # fmt: off
    # pylint: disable=multiple-statements
    for i in sorted(pages) if pages else range(len(readr.pages)):
        pdf_pg: PageObject = readr.pages[i]
        if not pdf_pg.indirect_reference:
            if debug: dbgprint(f"Pg {i} indirect_ref not present.")
            continue
        indirect_pg = cast(DictionaryObject, readr.get_object(pdf_pg.indirect_reference))
        if indirect_pg is None:
            if debug: dbgprint(f"Pg {i} indirect_ref not resolved.")
            continue
        if "/Resources" not in indirect_pg or not indirect_pg["/Resources"]:
            if debug: dbgprint(f"Pg {i} Resources not found.")
            continue
        resrcs = indirect_pg["/Resources"].get_object()
        if not isinstance(resrcs, dict) or "/XObject" not in resrcs:
            if debug: dbgprint(f"No XObject in pg {i} Resources.")
            continue
        imgs = resrcs["/XObject"]
        contents = indirect_pg["/Contents"].get_object()
        if not isinstance(contents, EncodedStreamObject):
            if debug: dbgprint(f"Pg {i} {type(contents)=}!=EncodedStreamObject.")
            continue
        decoded_content = str(decode_stream_data(contents), "utf-8", "ignore")
        img_refs = ref_regx.findall(decoded_content)
        referenced_img_names.update(img_refs)
        masks_to_add = []
        for refd_img_name in referenced_img_names:
            if refd_img_name in imgs:
                if "/SMask" in (refd_img := imgs[refd_img_name].get_object()):
                    if "/Name" in (smask_obj := refd_img["/SMask"].get_object()):
                        masks_to_add.append(smask_obj["/Name"])
        referenced_img_names.update(masks_to_add)
    if debug:
        dbgprint(f"{referenced_img_names=}")
    merged = io.BytesIO()
    mergr.write(merged)
    if not referenced_img_names:
        merged.seek(0)
        return merged
    # pylint: disable=no-member
    _RETURN_BYTESIO_PARTIAL.keywords["pdf_bytesio"] = merged
    if filename is not None:
        Path(filename).write_bytes(merged.getvalue())
    all_refs: list[dict[str, Any]] = []
    # pylint: disable=protected-access
    for _obj in mergr.output._objects:  # type: ignore
        if not isinstance(_obj, dict):
            if debug: dbgprint(f"Skipping non-dict {_obj=}\n")
        # handle named images by clearing their XObject entries directly
        # XObject references are captured by elif below.
        elif _obj.get("/Subtype") == "/Image" and _obj.get("/Name"):
            if (
                _obj["/Name"] not in referenced_img_names
                and (_refs := [_ref for _ref in all_refs if _obj["/Name"] in _ref])
            ):
                if debug: dbgprint(f"clearing img: {_obj['/Name']}\n")
                for this_ref in _refs:
                    this_ref[_obj.get("/Name")]._data = b"" # type: ignore
        # capture XObjects with imgs embedded directly.
        # used to clear "named" imgs in the elif above.
        elif "/XObject" in _obj:
            _xobj = _obj["/XObject"]
            if debug: dbgprint(f"XObject found: {type(_obj)=}\n{list(_xobj)=}\n")
            all_refs.append(_xobj)
        # handle images embedded in form Resources
        elif (
            "/Resources" in _obj and (_rsrcs := _obj["/Resources"])
            and "/XObject" in _rsrcs and (_xobj := _rsrcs["/XObject"])
            and _xobj not in all_refs and isinstance(_obj, PageObject)
        ):
            if debug: dbgprint(f"Page Resources found: {list(_xobj)=}\n")
            all_refs.append(_xobj)
            for _nm, _form_ref in _xobj.items():
                _form_obj = _form_ref.get_object()
                if _nm not in referenced_img_names and _form_obj["/Subtype"] == "/Form":
                    if debug: dbgprint(f"Clearing {_form_obj=}\n")
                    _img = _form_obj["/Resources"]["/XObject"]
                    list(_img.values())[0].get_object()._data = b""
    # fmt: on
    # mergr.output = PdfWriter()
    for title, pagenum in outline_items:
        mergr.add_outline_item(title, pagenum)
    pdf_out = io.BytesIO()
    assert mergr.output is not None
    mergr.output.write(pdf_out)
    if filename is not None:
        Path(f'{filename.rpartition(".")[0]}_clean.pdf').write_bytes(pdf_out.getvalue())
    pdf_out.seek(0)
    return pdf_out

As shown above, the 'workaround' scans the decoded contents of every page in the child PDF to find all references to images stored in the global XObject and then 'blanks out' the bytes (i.e. replaces with b"") of all of the unrendered images that are still present in the global XObject. This is a bad practice for a number of reasons, and it's also very dependent on a particular 'regexable' reference format in the decoded bytes that is probably specific to the Epic EMR reporting engine (which generates these source PDFs).

At the end of the day, the size is less of an issue than the potential 'data leaks' that the 'non-rendered' images from other pages represent. Some of these images contain protected information for other patients, so it's important ensure that only the images that belong to the pages being clipped are included in the output. Any thoughts or ideas on how this might be addressed in a 'non-hacky' way??

stefan6419846 · 2024-07-23T03:41:07Z

stefan6419846
Jul 23, 2024
Maintainer

To be honest, I have never used pypdf for this before, but mostly the (Py)MuPDF functionality, to redact private images from PDF samples. The usual way to do so there is to replace undesired images with a 1x1 pixel dummy image. If I remember correctly, doing a "hard clean" on the file would also remove images not referenced from the included pages.

Without digging deeper, do we really need the regex for retrieving the images? Shouldn't page.images be sufficient in this case?

3 replies

shartzog Jul 23, 2024
Author

Yep. I think there are a number of improvements I could make with new features. This was originally developed back in the old PyPDF2 days and has only gotten a few minor face lifts (PdfReader vs PdfFileReader etc). The point is more about global XObject handling in the 'split' use case and the possibility of adding some sort of cross referencing to discard any global resources that aren't referenced in the collection of pages that actually belong to the new child PDF.

pubpub-zz Jul 24, 2024
Maintainer

@shartzog:
based on your comments below the code (I apologize I've lazily jumped over 😶‍🌫️)
my proposal would be
a) create a PdfWriter. Use .append() to get the page(s) you wish : outlines should be extracted
b) walkthrough images and replace them with 1x1 images:

for p in writer.pages:
    for i in p.images:
       i.replace(Image.new("G",(1,1)))

Would this works for you?

shartzog Jul 25, 2024
Author

Let me give it a shot. Hadn't tried the append() method.

shartzog · 2024-07-26T23:53:47Z

shartzog
Jul 26, 2024
Author

@pubpub-zz, I think your proposed solution was intended to address the inverse use case, namely blanking the images on the selected pages. That is not the goal. I'm attempting to get rid of 'hidden' image data that persists even though it is NOT displayed on any of the selected pages.

To make matters worse, the .images property for every page in the PDF currently returns the same list of 102 images regardless as to whether any images are actually displayed on the page, so running the proposed solution on any page actually blanks the images on all pages. (Side Note: Impressive that the images were found in the global resource to begin with, so kudos in that respect... 👍 👍 )

The reason is that none of the images actually belong directly to any page, nor are they referenced directly by their object ID in the page on which they ARE displayed. Instead, a global '/Resources' object contains an XObject reference that 'names' all of the images displayed across all pages. Images are then referenced by name in the compressed byte stream content of each page.

Long story short, all of the images for all pages are added to a global reference that behaves in a fashion much more typical of fonts. If you examine the contents of the 'UGLY WORKAROUND' PDF provided in the OP in a text editor, you'll notice the following:

5 0 obj
<<
/ProcSet [ /PDF /Text /ImageC /ImageB ]
/Font <<
/F0 6 0 R
/F1 9 0 R
/F2 12 0 R
/F3 17 0 R
/F4 20 0 R
/F5 23 0 R
>>
/XObject <<
/img0 26 0 R
/img1 27 0 R
/img2 28 0 R
/img3 29 0 R
/img4 30 0 R
/img5 31 0 R
/img6 32 0 R
/img7 33 0 R
/img8 34 0 R
/img9 35 0 R
/img10 36 0 R
/img11 37 0 R
/img12 38 0 R
/img13 39 0 R
/img14 40 0 R
/img15 41 0 R
/img16 42 0 R
/img17 43 0 R
/img18 44 0 R
/img19 45 0 R
/img20 46 0 R
/img21 47 0 R
/img22 48 0 R
/img23 49 0 R
/img24 50 0 R
/img25 51 0 R
/img26 52 0 R
/img27 53 0 R
/img28 54 0 R
/img29 55 0 R
/img30 56 0 R
/img31 57 0 R
/img32 58 0 R
/img33 59 0 R
/img34 60 0 R
/img35 61 0 R
/img36 62 0 R
/img37 63 0 R
/img38 64 0 R
/img39 65 0 R
/img40 66 0 R
/img41 67 0 R
/img42 68 0 R
/img43 69 0 R
/img44 70 0 R
/img45 71 0 R
/img46 72 0 R
/img47 73 0 R
/img48 74 0 R
/img49 75 0 R
/img50 76 0 R
/img51 77 0 R
/img52 78 0 R
/img53 79 0 R
/img54 80 0 R
/img55 81 0 R
/img56 82 0 R
/img57 83 0 R
/img58 84 0 R
/img59 85 0 R
/img60 86 0 R
/img61 87 0 R
/img62 88 0 R
/img63 89 0 R
/img64 90 0 R
/img65 91 0 R
/img66 92 0 R
/img67 93 0 R
/img68 94 0 R
/img69 95 0 R
/img70 96 0 R
/img71 97 0 R
/img72 98 0 R
/img73 99 0 R
/img74 100 0 R
/img75 101 0 R
/img76 102 0 R
/img77 103 0 R
/img78 104 0 R
/img79 105 0 R
/img80 106 0 R
/img81 107 0 R
/img82 108 0 R
/img83 109 0 R
/img84 110 0 R
/img85 111 0 R
/img86 112 0 R
/img87 113 0 R
/img88 114 0 R
/img89 115 0 R
/img90 116 0 R
/img91 117 0 R
/img92 118 0 R
/img93 119 0 R
/img94 120 0 R
/img95 121 0 R
/img96 122 0 R
/img97 123 0 R
/img98 124 0 R
/img99 125 0 R
/img100 126 0 R
/img101 127 0 R
>>
/ExtGState <<
/SRCAND <<
/Type /ExtGState
/BM /Darken
>>
/SSMULT <<
/Type /ExtGState
/BM /Multiply
>>
/SSDIFF <<
/Type /ExtGState
/BM /Difference
>>
/PATINVERT <<
/Type /ExtGState
/BM /Exclusion
>>
/SRCPAINT <<
/Type /ExtGState
/BM /Lighten
>>
>>
>>
endobj

This global '/Resources' object looks the same in the original, pg4_OOTB, and pg4_UGLY_WORKAROUND variants. The only difference is that the UGLY_WORKAROUND version has blanked out (i.e. replaced with b"") the byte streams for all of these images because none of them were actually displayed on page 4.

That's what that nasty bit of hacky logic in the OP is doing: digging through the decoded byte stream of every page that IS in the output to find named image references a la /img0, /img1, etc. The names found in those decoded byte streams are added to the set of 'referenced image names' displayed on the clipped pages. It then iterates across the full listing of named image objects from the /Resources XObject and clears the byte stream value for any named image object that does NOT appear in the set of 'referenced image names' (e.g. if /img0 does NOT appear in 'referenced image names', the byte stream of 26 0 obj is set to b"").

At the end of the day, I have no idea how frequently this paradigm appears in the space of all PDF creation tools, so it may never be worth trying to deal with, but hopefully that clears up the problem??

2 replies

shartzog Jul 26, 2024
Author

And here's a side by side of the 2.3MB OOTB PDF and the 57KB UGLY WORKAROUND PDF to prove I'm not crazy ;)

shartzog Jul 27, 2024
Author

Well... maybe I am crazy... but not about this lol

shartzog · 2024-07-27T14:11:28Z

shartzog
Jul 27, 2024
Author

Here's a better toy PDF. In this example, I've retained 5 pages, two of which DO contain images.

AB06236EEAC446128EDFBD5E1155C88D_pg4,12,13,41,42_UGLY.PDF

0 replies

shartzog · 2024-07-27T20:44:16Z

shartzog
Jul 27, 2024
Author

I've refactored the original hacky and outdated logic to something approaching 'modern' that you should be able to run yourself. Here's a toy example that can be used with the 5 page sample PDF from the previous comment. The new logic doesn't capture all of the use cases covered by the original and has only been tested on the sample, but it should be more understandable while still illustrating the proposed improvement. It's also more efficient in terms of the reduction in file size interestingly enough lol.

Notice that I've 'flipped' the regex lookup to eliminate the specificity concern noted in the OP. I.e. the new logic simply searches the content stream of each page for the keys from the global /Resources XObject reference. Doing it that way rather than searching for a generic 'instantiated XObject' pattern will make the logic universally applicable (theoretically ;)). I plan to continue testing with various input PDFs to get something like this implemented in my own repo. Once that's done, something along these lines might be worthy of a PR. Keep you posted.

# %%

import re
from collections.abc import Sequence

from pypdf import PdfReader, PdfWriter
from pypdf.generic import DictionaryObject, ContentStream


def strip_unrendered_images(filename: str, pages: Sequence[int]):
    """
    Generates an 'OOTB' and 'OPTIMIZED' version of a child PDF containing the indicated
    page indices from the original.

    Args:
        filename (str): The path to the original source pdf.
        pages (Sequence[int], optional): The page indices to extract from the original.
    """
    fpfx, _, fsfx = filename.rpartition(".")
    r = PdfReader(filename)
    w = PdfWriter()
    w.append(r, pages=pages)
    w.write(f"{fpfx}_pg{','.join(str(i + 1) for i in pages)}_OOTB.{fsfx}")
    xobjs: list[DictionaryObject] = []
    xobj_pages: list[list[int]] = []
    for i, out_page in enumerate(w.pages):
        rsrcs = out_page.get_inherited("/Resources", {})
        if "/XObject" in rsrcs:
            xobj = rsrcs["/XObject"]
            if xobj in xobjs:
                xobj_idx = xobjs.index(xobj)
                xobj_pages[xobj_idx].append(i)
            else:
                xobjs.append(xobj)
                xobj_pages.append([i])

    for page_group, xobj in zip(xobj_pages, xobjs):
        name_regex = re.compile(r"(" + r"|".join(xobj.keys()) + r")\s", re.MULTILINE)
        refd_names = set(
            _m
            for i in page_group
            for _m in name_regex.findall(
                str(w.pages[i].get_contents().get_data(), "utf-8", "ignore")
            )
        )
        refd_names.update(
            xobj[img]["/SMask"]["/Name"]
            for img in refd_names.copy()
            if img in xobj and "/SMask" in xobj[img] and "/Name" in xobj[img]["/SMask"]
        )
        print(f"{page_group=!r} references {refd_names=!r}")
        for unrefd_img in xobj.keys() - refd_names:
            print(f"Clearing {unrefd_img=!r}")
            w._replace_object(xobj[unrefd_img].indirect_reference, ContentStream(None, w))
    w.write(f"{fpfx}_pg{','.join(str(i + 1) for i in pages)}_OPTIMIZED.{fsfx}")


strip_unrendered_images("AB06236EEAC446128EDFBD5E1155C88D_pg4.12.13.41.42_UGLY.PDF", [1])

2 replies

shartzog Jul 27, 2024
Author

NOTE: I'm avoiding the ImageFile.replace() api because pillow is NOT currently a dependency of my project.

shartzog Jul 29, 2024
Author

@pubpub-zz Is there a better means of obtaining the list of named XObjects referenced in a given PageObject's content stream? Some of the form oriented logic seems to be close, but I don't see any way to get a list of all referenced XObjects regardless of type without resorting to the regex parsing implemented above...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce Size of Child PDFs by 'Cleaning' Global Resources #2767

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Reduce Size of Child PDFs by 'Cleaning' Global Resources #2767

shartzog Jul 23, 2024

Problem

Example

Hacky Fix

Replies: 4 comments · 7 replies

stefan6419846 Jul 23, 2024 Maintainer

shartzog Jul 23, 2024 Author

pubpub-zz Jul 24, 2024 Maintainer

shartzog Jul 25, 2024 Author

shartzog Jul 26, 2024 Author

shartzog Jul 26, 2024 Author

shartzog Jul 27, 2024 Author

shartzog Jul 27, 2024 Author

shartzog Jul 27, 2024 Author

shartzog Jul 27, 2024 Author

shartzog Jul 29, 2024 Author

shartzog
Jul 23, 2024

Replies: 4 comments 7 replies

stefan6419846
Jul 23, 2024
Maintainer

shartzog Jul 23, 2024
Author

pubpub-zz Jul 24, 2024
Maintainer

shartzog Jul 25, 2024
Author

shartzog
Jul 26, 2024
Author

shartzog Jul 26, 2024
Author

shartzog Jul 27, 2024
Author

shartzog
Jul 27, 2024
Author

shartzog
Jul 27, 2024
Author

shartzog Jul 27, 2024
Author

shartzog Jul 29, 2024
Author