Timeout during finetuning #26

fangruizhu · 2024-10-08T21:22:09Z

Hi,

Thanks for sharing the code. I'm using it to fine-tune on videos by freezing the visual encoder and projector, and tuning the LLM. Initially, everything works well, but as training progresses, I notice that GPU memory usage keeps increasing. I'm using 8 H100s, but eventually, the process times out due to running out of memory. Have you encountered this issue before? Any insights you might have would be greatly appreciated. Thank you!

2U1 · 2024-10-09T01:10:12Z

I haven't tested the videos with a large dataset. So I hanven't encountered the problem you've said. When using large dataset with image dataset, it doesn't happen so it looks like some kind of video preprocessing problem. I'll look look into it and let you know when I get it.

Thanks for the issue.

Also does the memory run out when the training are in the middle of the process? Does it looks like a memory leak?

fangruizhu · 2024-10-09T02:17:15Z

Thank you for the reply! Yes, the memory only runs out in the middle of the training. At the beginning it was always fine. I set bs=8 per gpu, grad accum=1 or 2. I use Valley dataset, containing 702K video data. Training one epoch, it got time out around 50% -- 80% training iterations, with increasing memory usage on GPU. I use deepspeed zero3.

2U1 · 2024-10-09T03:00:27Z

Can You see if the resolution of the each video is different?
If it's the same, adding del vr right before the return state in encode_video in data.py might help. I'm not really sure what is the problem.

fangruizhu · 2024-10-09T06:10:26Z

Let me have a try! I will get back to you later, thanks!

fangruizhu · 2024-10-12T20:45:22Z

I tried del vr, and also I tried zero2.json and zero3.json. The training still hangs there. I am going to reinstall the env and try again.

2U1 · 2024-10-14T00:30:30Z

You can decrease the num_frames maybe. Also the 4 for the num_crops is the best hyperparameter in multi-image/video.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout during finetuning #26

Timeout during finetuning #26

fangruizhu commented Oct 8, 2024

2U1 commented Oct 9, 2024 •

edited

Loading

fangruizhu commented Oct 9, 2024

2U1 commented Oct 9, 2024

fangruizhu commented Oct 9, 2024

fangruizhu commented Oct 12, 2024

2U1 commented Oct 14, 2024

Timeout during finetuning #26

Timeout during finetuning #26

Comments

fangruizhu commented Oct 8, 2024

2U1 commented Oct 9, 2024 • edited Loading

fangruizhu commented Oct 9, 2024

2U1 commented Oct 9, 2024

fangruizhu commented Oct 9, 2024

fangruizhu commented Oct 12, 2024

2U1 commented Oct 14, 2024

2U1 commented Oct 9, 2024 •

edited

Loading