How to know if PDF can have watermark (under text) or just a stamp (on top)? #1197

zyzlik · 2022-08-02T13:49:14Z

zyzlik
Aug 2, 2022

I have different PDF files: some are digitally generated and some are scans / images. I have to put watermark on them and preferable under the text. I used page.extract_text() to determine if PDF is digitally-generated and if so, I added watermark under the text, otherwise on top. But for some digitally generated PDFs (e.g. Amazon invoice) I am able to extract text, but watermark is not visible behind the text, while for some it works normally.

Is there a better way to determine if watermark can be behind text?

Answered by MartinThoma

Aug 2, 2022

The issue you encounter is that the watermark goes behind an image. You could check if the page contains any image.

Something like this:

def contains_image(page: PageObject) -> bool:
    page_resources = page["/Resources"]

    x_object = page_resources.get("/XObject", {})

    for obj in x_object:
        obj_ = x_object[obj]
        if obj_["/Subtype"] == "/Image":
            return True
    return False

You could enhance this with the image dimensions. Or you could look at metadata (the generator). If you see Epson / Canon or similar in there, there is a good chance that it's a scanner.

View full answer

MartinThoma · 2022-08-02T15:06:00Z

MartinThoma
Aug 2, 2022
Maintainer

The issue you encounter is that the watermark goes behind an image. You could check if the page contains any image.

Something like this:

def contains_image(page: PageObject) -> bool:
    page_resources = page["/Resources"]

    x_object = page_resources.get("/XObject", {})

    for obj in x_object:
        obj_ = x_object[obj]
        if obj_["/Subtype"] == "/Image":
            return True
    return False

You could enhance this with the image dimensions. Or you could look at metadata (the generator). If you see Epson / Canon or similar in there, there is a good chance that it's a scanner.

4 replies

zyzlik Aug 2, 2022
Author

Oh, that should work, thanks!

zyzlik Aug 2, 2022
Author

But! I tested this on the pdf with text, where underlying watermark works and it also contains images (logo), but watermark is displayed properly behind the text.
And for pdf where watermark doesn't work contains image 113x23 while page itself is 612x792 🤷‍♀️

MartinThoma Aug 2, 2022
Maintainer

Yes, it's not a bullet-proof method. You could use pymupdf (Fitz) to render the complete page before and after as an image. If the image changed, you know that at least a part of the watermark is visible

MartinThoma Aug 2, 2022
Maintainer

You could also look at the size / position of the image

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to know if PDF can have watermark (under text) or just a stamp (on top)? #1197

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to know if PDF can have watermark (under text) or just a stamp (on top)? #1197

zyzlik Aug 2, 2022

Replies: 1 comment · 4 replies

MartinThoma Aug 2, 2022 Maintainer

zyzlik Aug 2, 2022 Author

zyzlik Aug 2, 2022 Author

MartinThoma Aug 2, 2022 Maintainer

MartinThoma Aug 2, 2022 Maintainer

zyzlik
Aug 2, 2022

Replies: 1 comment 4 replies

MartinThoma
Aug 2, 2022
Maintainer

zyzlik Aug 2, 2022
Author

zyzlik Aug 2, 2022
Author

MartinThoma Aug 2, 2022
Maintainer

MartinThoma Aug 2, 2022
Maintainer