[multimodal] llava-1.5-7b-hf doesn't work on `mmmu_val` #2360

BabyChouSr · 2024-09-26T18:03:00Z

Reproduction:

lm_eval --model hf-multimodal \
    --model_args pretrained=llava-hf/llava-1.5-7b-hf,max_images=1 \
    --tasks mmmu_val \
    --device cuda:0 \
    --batch_size 8

Error:

File "/root/.venv/bin/lm_eval", line 8, in <module>
    sys.exit(cli_evaluate())
             ^^^^^^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/__main__.py", line 382, in cli_evaluate
    results = evaluator.simple_evaluate(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/evaluator.py", line 301, in simple_evaluate
    results = evaluate(
              ^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/evaluator.py", line 496, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/models/hf_vlms.py", line 674, in generate_until
    inputs = self.tok_batch_multimodal_encode(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/models/hf_vlms.py", line 296, in tok_batch_multimodal_encode
    encoding = self.processor(
               ^^^^^^^^^^^^^^^
  File "/workspace/transformers/src/transformers/models/llava/processing_llava.py", line 134, in __call__
    image_inputs = self.image_processor(images, **output_kwargs["images_kwargs"])
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/transformers/src/transformers/image_processing_utils.py", line 41, in __call__
    return self.preprocess(images, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/transformers/src/transformers/models/clip/image_processing_clip.py", line 286, in preprocess
    images = make_list_of_images(images)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/transformers/src/transformers/image_utils.py", line 205, in make_list_of_images
    raise ValueError(Invalid image type. Expected either PIL.Image.Image, numpy.ndarray, torch.Tensor, tf.Tensor or jax.ndarray, but got <class 'list'>.

I tried using vllm but they have also have an issue with the number of image tokens = 4 * 576 = 2304 != the number of image placeholders being 2305.

The text was updated successfully, but these errors were encountered:

haileyschoelkopf · 2024-09-26T19:01:47Z

Hi! We'll take a look at this. If I recall correctly this is due to an inconsistency in the input formats for this model as compared to other HF AutoModelForVision2Seq models and their corresponding processors.

BabyChouSr · 2024-09-26T21:30:45Z

thanks for the quick reply! it doesn't seem to be just llava-v1.5-7b however. I have some issues with Idefics2-8b as well.

Versions:

transformers==4.45.1

Command:

lm_eval --model hf-multimodal \
    --model_args pretrained=HuggingFaceM4/idefics2-8b,max_images=2,attn_implementation=flash_attention_2,dtype=bfloat16,convert_img_format=True \
    --tasks mmmu_val \
    --device cuda:0 \
    --batch_size 2

Traceback:

Traceback (most recent call last):
  File "/root/.venv/bin/lm_eval", line 8, in <module>
    sys.exit(cli_evaluate())
             ^^^^^^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/__main__.py", line 382, in cli_evaluate
    results = evaluator.simple_evaluate(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/evaluator.py", line 301, in simple_evaluate
    results = evaluate(
              ^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/evaluator.py", line 496, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/models/hf_vlms.py", line 686, in generate_until
    cont = self._model_multimodal_generate(inputs, stop=until, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/models/hf_vlms.py", line 342, in _model_multimodal_generate
    return self.model.generate(
           ^^^^^^^^^^^^^^^^^^^^
  File "/root/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/.venv/lib/python3.12/site-packages/transformers/generation/utils.py", line 2048, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/root/.venv/lib/python3.12/site-packages/transformers/generation/utils.py", line 3008, in _sample
    outputs = self(**model_inputs, return_dict=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.venv/lib/python3.12/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 1603, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/root/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.venv/lib/python3.12/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 1419, in forward
    inputs_embeds = self.inputs_merger(
                    ^^^^^^^^^^^^^^^^^^^
  File "/root/.venv/lib/python3.12/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 1296, in inputs_merger
    new_inputs_embeds[special_image_token_mask] = reshaped_image_hidden_states
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: shape mismatch: value tensor of shape [640, 4096] cannot be broadcast to indexing result of shape [0, 4096]

BabyChouSr · 2024-09-27T01:23:06Z

I tried vllm and somewhere I think there is an additional image token getting added in the context. When running

lm_eval --model vllm-vlm \
    --model_args pretrained=llava-hf/llava-1.5-7b-hf,max_images=1 \
    --tasks mmmu_val_architecture_and_engineering \
    --device cuda:0 \
    --batch_size 1

I noticed that inputs[7] has 2 tokens in them even though i set max image to 1. I'm not that familiar with the code base so I'm not sure where the image tokens are being set, but I hope this helps out.

haileyschoelkopf · 2024-09-27T18:29:52Z

Thanks @BabyChouSr , this is helpful-- in our testing we found idefics2 would run and avoid this error if setting max_images=2, so that error is surprising to me :( Haven't yet traced back the root cause.

( @baberabb also making you aware of this thread in case you hadn't seen it!)

haileyschoelkopf added the bug Something isn't working. label Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[multimodal] llava-1.5-7b-hf doesn't work on `mmmu_val` #2360

[multimodal] llava-1.5-7b-hf doesn't work on `mmmu_val` #2360

BabyChouSr commented Sep 26, 2024 •

edited

Loading

haileyschoelkopf commented Sep 26, 2024

BabyChouSr commented Sep 26, 2024

BabyChouSr commented Sep 27, 2024

haileyschoelkopf commented Sep 27, 2024 •

edited

Loading

[multimodal] llava-1.5-7b-hf doesn't work on mmmu_val #2360

[multimodal] llava-1.5-7b-hf doesn't work on mmmu_val #2360

Comments

BabyChouSr commented Sep 26, 2024 • edited Loading

haileyschoelkopf commented Sep 26, 2024

BabyChouSr commented Sep 26, 2024

BabyChouSr commented Sep 27, 2024

haileyschoelkopf commented Sep 27, 2024 • edited Loading

[multimodal] llava-1.5-7b-hf doesn't work on `mmmu_val` #2360

[multimodal] llava-1.5-7b-hf doesn't work on `mmmu_val` #2360

BabyChouSr commented Sep 26, 2024 •

edited

Loading

haileyschoelkopf commented Sep 27, 2024 •

edited

Loading