Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

# Pass Image color channels information to Transformers #2846

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

davychxn
Copy link

Background:
In Huggingface Transformers' image processor, e.g. CLIPImageProcessor, the constructor requires input of input_data_format, which gives the Image's color channels being in the first or the last position in its shape.

For example, if an image's shape is (512, 512, 3), it means its resolution is 512*512 pixels, and it has RBG, 3 color channels. In this case, input_data_format is ImageChannelDimension.LAST or ChannelDimension.LAST in Transformers.

Sometimes, people would use customized Image format in a shape of (3, 512, 512) for performance purpose. Transformers requires users to point it out, or it would infer to tell it from its shape.

Generally, an image would have 1 or 3 color channels representing Gray or RGB. So, the inferring algorithm in Transformers looks for 1 or 3 values in the image's shape.

If your input images are in the shape of (3, xxx, 1) or (1, xxx, 3), the inferring algorithm would get confused, and raise following exception: 'The channel dimension is ambiguous. Got image shape (1, xxx, 3). Assuming channels are the first dimension.' 'ValueError: mean must have 1 elements if it is an iterable, got 3'

Fix:

  1. Add a class ImageChannelDimension to define 2 possible Image color channels position in an Image's shape
  2. Input this information in model.encode method, and pass it to Tokenizer and image processor from Transformers.

Background:
In Huggingface Transformers' image processor, e.g. CLIPImageProcessor, the constructor requires input of input_data_format, which gives the Image's color channels being in the first or the last position in its shape.

For example, if an image's shape is (512, 512, 3), it means its resolution is 512*512 pixels, and it has RBG, 3 color channels. In this case, input_data_format is ImageChannelDimension.LAST or ChannelDimension.LAST in Transformers.

Sometimes, people would use customized Image format in a shape of (3, 512, 512) for performance purpose. Transformers requires users to point it out, or it would infer to tell it from its shape.

Generally, an image would have 1 or 3 color channels representing Gray or RGB. So, the inferring algorithm in Transformers looks for 1 or 3 values in the image's shape.

If your input images are in the shape of (3, xxx, 1) or (1, xxx, 3), the inferring algorithm would get confused, and raise following exception:
'The channel dimension is ambiguous. Got image shape (1, xxx, 3). Assuming channels are the first dimension.' 'ValueError: mean must have 1 elements if it is an iterable, got 3'

Fix:
1. Add a class ImageChannelDimension to define 2 possible Image color channels position in an Image's shape
2. Input this information in model.encode method, and pass it to Tokenizer and image processor from Transformers.
@davychxn
Copy link
Author

How to reproduce the issue:
Use this image:
header-separator

Save above image as "header-separator.png".

Run the official example:

from sentence_transformers import SentenceTransformer, util
from PIL import Image

#Load CLIP model
model = SentenceTransformer('clip-ViT-B-32')

#Encode an image:
img_emb = model.encode(Image.open('header-separator.png'))

@davychxn
Copy link
Author

davychxn commented Jul 19, 2024

With the fix we can tokenize image in the shape of (xxx, xxx, 3) like:

from sentence_transformers import SentenceTransformer, util
...
// By default
model.encode(Image.open('image.png'))
// or
model.encode(Image.open('image.png'), input_data_format=util.ImageChannelDimension.LAST)

And tokenize image in the shape of (3, xxx, xxx) like:

model.encode(optimizeImage(Image.open('image.png')), input_data_format=util.ImageChannelDimension.FIRST)

@davychxn
Copy link
Author

@tomaarsen Hi Tom, would you help to check my PR? Thank you.

@davychxn
Copy link
Author

@fpgmaas Hi Florian, would you take some time to review my PR? Thank you.

@@ -28,6 +28,12 @@
from sentence_transformers.cross_encoder.CrossEncoder import CrossEncoder
from sentence_transformers.SentenceTransformer import SentenceTransformer

class ImageChannelDimension():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be an Enum

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is copied from Transformers' repo. It is defined like this there. Because the string defined in the class is needed by Transformers' image-processor. If we use Enum, I think, we'll get integer values? And we need to convert to string before passing to Transformers?

@fpgmaas
Copy link
Contributor

fpgmaas commented Jul 24, 2024

Maybe input_data_format is not an appropriate name? In my opinion the name should be more specific. Maybe just image_channel_dimension?

@fpgmaas
Copy link
Contributor

fpgmaas commented Jul 24, 2024

Hey @davychxn , I am not a maintainer of the project and I lack in-depth knowledge to judge your proposed changes, so I'm afraid I cannot approve the PR. However, I left a few review comments anyways to help the PR along.

1. Add doc-string for newly added 'image_channel_dimension' parameter of 'encode' function.
2. Changed the parameter's name from 'input_data_format' to 'image_channel_dimension'.
@davychxn
Copy link
Author

Maybe input_data_format is not an appropriate name? In my opinion the name should be more specific. Maybe just image_channel_dimension?

@fpgmaas Fixed.

@davychxn
Copy link
Author

Hey @davychxn , I am not a maintainer of the project and I lack in-depth knowledge to judge your proposed changes, so I'm afraid I cannot approve the PR. However, I left a few review comments anyways to help the PR along.

Thank you for your great help, Florian. I'll try to reach Tom separately.

1. To make the 'tokenize' interface compatible between Texts and Images.


And fixed Conflicts in:
sentence_transformers/SentenceTransformer.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants