RuntimeError: Given groups=1, weight of size [1024, 3, 14, 14], expected input xxx to have 3 channels, but got xx channels instead #25

wnzhyee · 2024-05-15T06:44:19Z

Hi, I meet same problem when repetition pretraining step, it's happened when calculating patch_embedding in

LLaVA-UHD/llava_uhd/train/llava-uhd/adapt_clip.py

Line 79 in 302301b

self.patch_embedding = nn.Conv2d(

It seems the similar problem has been raised about 2 months ago, is there any specific timeline to solve this question?

wnzhyee · 2024-05-15T09:45:54Z

Here are my some attempts about this error:

LLaVA-UHD/llava_uhd/train/llava-uhd/adapt_clip.py

Lines 311 to 357 in 302301b

    
           def forward(self, images, origin_image_widths,origin_image_heights): 
        
               if images.shape[1] == 24: 
        
                   image_features = [] 
        
                   split_images = torch.chunk(images, chunks=8, dim=1) 
        
                   slice_w_nums=[] 
        
                   slice_h_nums=[] 
        
                   abstract_w_nums=[] 
        
                   abstract_h_nums=[] 
        
                   for i in range(len(origin_image_widths)): 
        
                       slice_w_num,slice_h_num,abstract_w_num,abstract_h_num = get_patch_nums(origin_image_widths[i],origin_image_heights[i]) 
        
                       slice_w_nums.append(slice_w_num) 
        
                       slice_h_nums.append(slice_h_num) 
        
                       abstract_w_nums.append(abstract_w_num) 
        
                       abstract_h_nums.append(abstract_h_num) 
        
                   for i, image in enumerate(split_images): 
        
                       if i == 7: 
        
                           image_forward_out = self.vision_tower(image.to(device=self.device, dtype=self.dtype).unsqueeze(0), 
        
                                                             output_hidden_states=True, 
        
                                                             w_patch_num = abstract_w_nums, 
        
                                                             h_patch_num = abstract_h_nums) 
        
                       else: 
        
                           image_forward_out = self.vision_tower(image.to(device=self.device, dtype=self.dtype).unsqueeze(0), 
        
                                                             output_hidden_states=True, 
        
                                                             w_patch_num = slice_w_nums, 
        
                                                             h_patch_num = slice_h_nums) 
        
                       image_feature = self.feature_select(image_forward_out).to(image.dtype) 
        
                       # print("image_feature.shape",image_feature.shape) 
        
                       # image_feature.shape torch.Size([4, 576, 1024]) 
        
                       # print("image_features.shape",image_features.shape) 
        
                       image_features.append(image_feature) 
        
               else: 
        
                   image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), 
        
                                                               output_hidden_states=True, 
        
                                                               w_patch_num = origin_image_widths, 
        
                                                               h_patch_num = origin_image_heights) 
        
                   image_features = self.feature_select(image_forward_outs).to(images.dtype) 
        
               return image_features

When adapt_CLIPVisionTower forward, the input images shape is always [B, 3*k, 336, 336], but here only considers when k==1 and k==8, if k is other numbers, here will report an error.

So I change this as:

if images.shape[1] % 3 == 0:
    chunk_size = images.shape[1] // 3

    image_features = []
    split_images = torch.chunk(images, chunks=chunk_size, dim=1)
    slice_w_nums=[]
    slice_h_nums=[]
    abstract_w_nums=[]
    abstract_h_nums=[]
    
    for i in range(len(origin_image_widths)):
        slice_w_num,slice_h_num,abstract_w_num,abstract_h_num = get_patch_nums(origin_image_widths[i],origin_image_heights[i])
        slice_w_nums.append(slice_w_num)
        slice_h_nums.append(slice_h_num)
        abstract_w_nums.append(abstract_w_num)
        abstract_h_nums.append(abstract_h_num)
        
    for i, image in enumerate(split_images):
        
        if i == chunk_size - 1:
            image_forward_out = self.vision_tower(image.to(device=self.device, dtype=self.dtype).unsqueeze(0),
                                                output_hidden_states=True,
                                                w_patch_num = abstract_w_nums,
                                                h_patch_num = abstract_h_nums)
        else:
            image_forward_out = self.vision_tower(image.to(device=self.device, dtype=self.dtype).unsqueeze(0),
                                                output_hidden_states=True,
                                                w_patch_num = slice_w_nums,
                                                h_patch_num = slice_h_nums)
        
        image_feature = self.feature_select(image_forward_out).to(image.dtype)
        # print("image_feature.shape",image_feature.shape)
        # image_feature.shape torch.Size([4, 576, 1024])
        # print("image_features.shape",image_features.shape)
        image_features.append(image_feature)

And here will not report errors.

But after this, in adapt_llava.py, there seems another magic number 8 and 4:

LLaVA-UHD/llava_uhd/train/llava-uhd/adapt_llava.py

Lines 201 to 204 in 302301b

    
           for j in range(8): 
        
               cur_image_features = image_features[cur_image_idx+j*4] 
        
               cur_new_input_embeds.append(cur_image_features) 
        
               cur_new_labels.append(torch.full((cur_image_features.shape[0],), IGNORE_INDEX, device=cur_labels.device, dtype=cur_labels.dtype))

I don't know what 8 and 4 means, but here will report another error:

IndexError: index 24 is out of bounds for dimension 0 with size 24

and in this function, the cur_image_idx should be auto increment, but i cant see any cur_image_idx +=1 in this loop. If is correct, cur_new_input_embeds and cur_new_labels seems always use a same subset of image_features.

Please check these code and reply my confusion, thanks!

ParadoxZW · 2024-06-13T07:57:09Z

Hi @wnzhyee !

I've released another implementation of LLaVA-UHD here, which I believe is more stable and elegant. The code of the new repo originates from this repo, but its overall quality is improved, and the training program is tested to be able to normally run without bugs.

When I reviewed this old repo and tried to fix this RuntimeError issue, I found it contains a lot of hidden bugs and calculations with wrong logic (violating the spirit of the original paper), and misses some necessary process (such as, image normalization). So I decided to rewrite the code and try my best to fix all these issues. Now I open-sourced my rewritten version.

You are very welcome to use it, and I look forward to your feedback.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Given groups=1, weight of size [1024, 3, 14, 14], expected input xxx to have 3 channels, but got xx channels instead #25

RuntimeError: Given groups=1, weight of size [1024, 3, 14, 14], expected input xxx to have 3 channels, but got xx channels instead #25

wnzhyee commented May 15, 2024

wnzhyee commented May 15, 2024

ParadoxZW commented Jun 13, 2024

RuntimeError: Given groups=1, weight of size [1024, 3, 14, 14], expected input xxx to have 3 channels, but got xx channels instead #25

RuntimeError: Given groups=1, weight of size [1024, 3, 14, 14], expected input xxx to have 3 channels, but got xx channels instead #25

Comments

wnzhyee commented May 15, 2024

wnzhyee commented May 15, 2024

ParadoxZW commented Jun 13, 2024