Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout during finetuning #26

Open
fangruizhu opened this issue Oct 8, 2024 · 6 comments
Open

Timeout during finetuning #26

fangruizhu opened this issue Oct 8, 2024 · 6 comments

Comments

@fangruizhu
Copy link

Hi,

Thanks for sharing the code. I'm using it to fine-tune on videos by freezing the visual encoder and projector, and tuning the LLM. Initially, everything works well, but as training progresses, I notice that GPU memory usage keeps increasing. I'm using 8 H100s, but eventually, the process times out due to running out of memory. Have you encountered this issue before? Any insights you might have would be greatly appreciated. Thank you!

@2U1
Copy link
Owner

2U1 commented Oct 9, 2024

I haven't tested the videos with a large dataset. So I hanven't encountered the problem you've said. When using large dataset with image dataset, it doesn't happen so it looks like some kind of video preprocessing problem. I'll look look into it and let you know when I get it.

Thanks for the issue.

Also does the memory run out when the training are in the middle of the process? Does it looks like a memory leak?

@fangruizhu
Copy link
Author

Thank you for the reply! Yes, the memory only runs out in the middle of the training. At the beginning it was always fine. I set bs=8 per gpu, grad accum=1 or 2. I use Valley dataset, containing 702K video data. Training one epoch, it got time out around 50% -- 80% training iterations, with increasing memory usage on GPU. I use deepspeed zero3.

@2U1
Copy link
Owner

2U1 commented Oct 9, 2024

Can You see if the resolution of the each video is different?
If it's the same, adding del vr right before the return state in encode_video in data.py might help. I'm not really sure what is the problem.

@fangruizhu
Copy link
Author

Let me have a try! I will get back to you later, thanks!

@fangruizhu
Copy link
Author

I tried del vr, and also I tried zero2.json and zero3.json. The training still hangs there. I am going to reinstall the env and try again.

@2U1
Copy link
Owner

2U1 commented Oct 14, 2024

You can decrease the num_frames maybe. Also the 4 for the num_crops is the best hyperparameter in multi-image/video.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants