# Pass Image color channels information to Transformers #2846

davychxn · 2024-07-18T07:05:45Z

Background:
In Huggingface Transformers' image processor, e.g. CLIPImageProcessor, the constructor requires input of input_data_format, which gives the Image's color channels being in the first or the last position in its shape.

For example, if an image's shape is (512, 512, 3), it means its resolution is 512*512 pixels, and it has RBG, 3 color channels. In this case, input_data_format is ImageChannelDimension.LAST or ChannelDimension.LAST in Transformers.

Sometimes, people would use customized Image format in a shape of (3, 512, 512) for performance purpose. Transformers requires users to point it out, or it would infer to tell it from its shape.

Generally, an image would have 1 or 3 color channels representing Gray or RGB. So, the inferring algorithm in Transformers looks for 1 or 3 values in the image's shape.

If your input images are in the shape of (3, xxx, 1) or (1, xxx, 3), the inferring algorithm would get confused, and raise following exception: 'The channel dimension is ambiguous. Got image shape (1, xxx, 3). Assuming channels are the first dimension.' 'ValueError: mean must have 1 elements if it is an iterable, got 3'

Fix:

Add a class ImageChannelDimension to define 2 possible Image color channels position in an Image's shape
Input this information in model.encode method, and pass it to Tokenizer and image processor from Transformers.

Background: In Huggingface Transformers' image processor, e.g. CLIPImageProcessor, the constructor requires input of input_data_format, which gives the Image's color channels being in the first or the last position in its shape. For example, if an image's shape is (512, 512, 3), it means its resolution is 512*512 pixels, and it has RBG, 3 color channels. In this case, input_data_format is ImageChannelDimension.LAST or ChannelDimension.LAST in Transformers. Sometimes, people would use customized Image format in a shape of (3, 512, 512) for performance purpose. Transformers requires users to point it out, or it would infer to tell it from its shape. Generally, an image would have 1 or 3 color channels representing Gray or RGB. So, the inferring algorithm in Transformers looks for 1 or 3 values in the image's shape. If your input images are in the shape of (3, xxx, 1) or (1, xxx, 3), the inferring algorithm would get confused, and raise following exception: 'The channel dimension is ambiguous. Got image shape (1, xxx, 3). Assuming channels are the first dimension.' 'ValueError: mean must have 1 elements if it is an iterable, got 3' Fix: 1. Add a class ImageChannelDimension to define 2 possible Image color channels position in an Image's shape 2. Input this information in model.encode method, and pass it to Tokenizer and image processor from Transformers.

davychxn · 2024-07-18T07:08:48Z

How to reproduce the issue:
Use this image:

Save above image as "header-separator.png".

Run the official example:

from sentence_transformers import SentenceTransformer, util
from PIL import Image

#Load CLIP model
model = SentenceTransformer('clip-ViT-B-32')

#Encode an image:
img_emb = model.encode(Image.open('header-separator.png'))

davychxn · 2024-07-19T03:17:48Z

With the fix we can tokenize image in the shape of (xxx, xxx, 3) like:

from sentence_transformers import SentenceTransformer, util
...
// By default
model.encode(Image.open('image.png'))
// or
model.encode(Image.open('image.png'), input_data_format=util.ImageChannelDimension.LAST)

And tokenize image in the shape of (3, xxx, xxx) like:

model.encode(optimizeImage(Image.open('image.png')), input_data_format=util.ImageChannelDimension.FIRST)

davychxn · 2024-07-19T09:37:35Z

@tomaarsen Hi Tom, would you help to check my PR? Thank you.

davychxn · 2024-07-24T09:44:00Z

@fpgmaas Hi Florian, would you take some time to review my PR? Thank you.

sentence_transformers/SentenceTransformer.py

fpgmaas · 2024-07-24T16:55:01Z

sentence_transformers/util.py

@@ -28,6 +28,12 @@
    from sentence_transformers.cross_encoder.CrossEncoder import CrossEncoder
    from sentence_transformers.SentenceTransformer import SentenceTransformer

+class ImageChannelDimension():


This should probably be an Enum

This class is copied from Transformers' repo. It is defined like this there. Because the string defined in the class is needed by Transformers' image-processor. If we use Enum, I think, we'll get integer values? And we need to convert to string before passing to Transformers?

fpgmaas · 2024-07-24T16:56:39Z

Maybe input_data_format is not an appropriate name? In my opinion the name should be more specific. Maybe just image_channel_dimension?

fpgmaas · 2024-07-24T16:58:07Z

Hey @davychxn , I am not a maintainer of the project and I lack in-depth knowledge to judge your proposed changes, so I'm afraid I cannot approve the PR. However, I left a few review comments anyways to help the PR along.

1. Add doc-string for newly added 'image_channel_dimension' parameter of 'encode' function. 2. Changed the parameter's name from 'input_data_format' to 'image_channel_dimension'.

davychxn · 2024-07-25T12:18:35Z

Maybe input_data_format is not an appropriate name? In my opinion the name should be more specific. Maybe just image_channel_dimension?

@fpgmaas Fixed.

davychxn · 2024-07-25T12:20:25Z

Hey @davychxn , I am not a maintainer of the project and I lack in-depth knowledge to judge your proposed changes, so I'm afraid I cannot approve the PR. However, I left a few review comments anyways to help the PR along.

Thank you for your great help, Florian. I'll try to reach Tom separately.

1. To make the 'tokenize' interface compatible between Texts and Images.

And fixed Conflicts in: sentence_transformers/SentenceTransformer.py

fpgmaas reviewed Jul 24, 2024

View reviewed changes

sentence_transformers/SentenceTransformer.py Outdated Show resolved Hide resolved

fpgmaas reviewed Jul 24, 2024

View reviewed changes

# Made 2 modifications.

33a4ebc

1. Add doc-string for newly added 'image_channel_dimension' parameter of 'encode' function. 2. Changed the parameter's name from 'input_data_format' to 'image_channel_dimension'.

davychxn added 2 commits July 27, 2024 18:34

# Modified 2 files

e683051

1. To make the 'tokenize' interface compatible between Texts and Images.

Merge branch 'master' of https://github.com/UKPLab/sentence-transformers

531c59a

And fixed Conflicts in: sentence_transformers/SentenceTransformer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

# Pass Image color channels information to Transformers #2846

# Pass Image color channels information to Transformers #2846

davychxn commented Jul 18, 2024

davychxn commented Jul 18, 2024

davychxn commented Jul 19, 2024 •

edited

Loading

davychxn commented Jul 19, 2024

davychxn commented Jul 24, 2024

fpgmaas Jul 24, 2024

davychxn Jul 25, 2024

fpgmaas commented Jul 24, 2024

fpgmaas commented Jul 24, 2024

davychxn commented Jul 25, 2024

davychxn commented Jul 25, 2024

# Pass Image color channels information to Transformers #2846

Are you sure you want to change the base?

# Pass Image color channels information to Transformers #2846

Conversation

davychxn commented Jul 18, 2024

davychxn commented Jul 18, 2024

davychxn commented Jul 19, 2024 • edited Loading

davychxn commented Jul 19, 2024

davychxn commented Jul 24, 2024

fpgmaas Jul 24, 2024

Choose a reason for hiding this comment

davychxn Jul 25, 2024

Choose a reason for hiding this comment

fpgmaas commented Jul 24, 2024

fpgmaas commented Jul 24, 2024

davychxn commented Jul 25, 2024

davychxn commented Jul 25, 2024

davychxn commented Jul 19, 2024 •

edited

Loading