Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Given groups=1, weight of size [1024, 3, 14, 14], expected input xxx to have 3 channels, but got xx channels instead #25

Open
wnzhyee opened this issue May 15, 2024 · 2 comments

Comments

@wnzhyee
Copy link

wnzhyee commented May 15, 2024

Hi, I meet same problem when repetition pretraining step, it's happened when calculating patch_embedding in

self.patch_embedding = nn.Conv2d(

It seems the similar problem has been raised about 2 months ago, is there any specific timeline to solve this question?

@wnzhyee
Copy link
Author

wnzhyee commented May 15, 2024

Here are my some attempts about this error:

def forward(self, images, origin_image_widths,origin_image_heights):
if images.shape[1] == 24:
image_features = []
split_images = torch.chunk(images, chunks=8, dim=1)
slice_w_nums=[]
slice_h_nums=[]
abstract_w_nums=[]
abstract_h_nums=[]
for i in range(len(origin_image_widths)):
slice_w_num,slice_h_num,abstract_w_num,abstract_h_num = get_patch_nums(origin_image_widths[i],origin_image_heights[i])
slice_w_nums.append(slice_w_num)
slice_h_nums.append(slice_h_num)
abstract_w_nums.append(abstract_w_num)
abstract_h_nums.append(abstract_h_num)
for i, image in enumerate(split_images):
if i == 7:
image_forward_out = self.vision_tower(image.to(device=self.device, dtype=self.dtype).unsqueeze(0),
output_hidden_states=True,
w_patch_num = abstract_w_nums,
h_patch_num = abstract_h_nums)
else:
image_forward_out = self.vision_tower(image.to(device=self.device, dtype=self.dtype).unsqueeze(0),
output_hidden_states=True,
w_patch_num = slice_w_nums,
h_patch_num = slice_h_nums)
image_feature = self.feature_select(image_forward_out).to(image.dtype)
# print("image_feature.shape",image_feature.shape)
# image_feature.shape torch.Size([4, 576, 1024])
# print("image_features.shape",image_features.shape)
image_features.append(image_feature)
else:
image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype),
output_hidden_states=True,
w_patch_num = origin_image_widths,
h_patch_num = origin_image_heights)
image_features = self.feature_select(image_forward_outs).to(images.dtype)
return image_features

When adapt_CLIPVisionTower forward, the input images shape is always [B, 3*k, 336, 336], but here only considers when k==1 and k==8, if k is other numbers, here will report an error.

So I change this as:

if images.shape[1] % 3 == 0:
    chunk_size = images.shape[1] // 3

    image_features = []
    split_images = torch.chunk(images, chunks=chunk_size, dim=1)
    slice_w_nums=[]
    slice_h_nums=[]
    abstract_w_nums=[]
    abstract_h_nums=[]
    
    for i in range(len(origin_image_widths)):
        slice_w_num,slice_h_num,abstract_w_num,abstract_h_num = get_patch_nums(origin_image_widths[i],origin_image_heights[i])
        slice_w_nums.append(slice_w_num)
        slice_h_nums.append(slice_h_num)
        abstract_w_nums.append(abstract_w_num)
        abstract_h_nums.append(abstract_h_num)
        
    for i, image in enumerate(split_images):
        
        if i == chunk_size - 1:
            image_forward_out = self.vision_tower(image.to(device=self.device, dtype=self.dtype).unsqueeze(0),
                                                output_hidden_states=True,
                                                w_patch_num = abstract_w_nums,
                                                h_patch_num = abstract_h_nums)
        else:
            image_forward_out = self.vision_tower(image.to(device=self.device, dtype=self.dtype).unsqueeze(0),
                                                output_hidden_states=True,
                                                w_patch_num = slice_w_nums,
                                                h_patch_num = slice_h_nums)
        
        image_feature = self.feature_select(image_forward_out).to(image.dtype)
        # print("image_feature.shape",image_feature.shape)
        # image_feature.shape torch.Size([4, 576, 1024])
        # print("image_features.shape",image_features.shape)
        image_features.append(image_feature)

And here will not report errors.

But after this, in adapt_llava.py, there seems another magic number 8 and 4:

for j in range(8):
cur_image_features = image_features[cur_image_idx+j*4]
cur_new_input_embeds.append(cur_image_features)
cur_new_labels.append(torch.full((cur_image_features.shape[0],), IGNORE_INDEX, device=cur_labels.device, dtype=cur_labels.dtype))

I don't know what 8 and 4 means, but here will report another error:

IndexError: index 24 is out of bounds for dimension 0 with size 24

and in this function, the cur_image_idx should be auto increment, but i cant see any cur_image_idx +=1 in this loop. If is correct, cur_new_input_embeds and cur_new_labels seems always use a same subset of image_features.

Please check these code and reply my confusion, thanks!

@ParadoxZW
Copy link

Hi @wnzhyee !

I've released another implementation of LLaVA-UHD here, which I believe is more stable and elegant. The code of the new repo originates from this repo, but its overall quality is improved, and the training program is tested to be able to normally run without bugs.

When I reviewed this old repo and tried to fix this RuntimeError issue, I found it contains a lot of hidden bugs and calculations with wrong logic (violating the spirit of the original paper), and misses some necessary process (such as, image normalization). So I decided to rewrite the code and try my best to fix all these issues. Now I open-sourced my rewritten version.

You are very welcome to use it, and I look forward to your feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants