Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA memory usage continuously increases #77

Open
vlfom opened this issue Jan 25, 2022 · 3 comments
Open

CUDA memory usage continuously increases #77

vlfom opened this issue Jan 25, 2022 · 3 comments

Comments

@vlfom
Copy link

vlfom commented Jan 25, 2022

Dear authors,

Thank you for the great work and clean code.

I am using the CenterNet2 default configuration (from Base-CenterNet2.yaml), however, when training, I observe that the memory reserved by CUDA keeps increasing until the training fails due to CUDA OOM error. When I replace the CenterNet2 with the default RPN, the issue disappears.

I tried adding gc.collect() and torch.cuda.empty_cache() to the training loop with no success.

Have you noticed such behavior in the past, or could you please provide some hints on what could be the issue? Below I also provide some reference screenshots.

Note: in my project, there are several things that differ from the abovementioned configuration: I train on 50% of COCO dataset and I use LazyConfig to initialize the model. However, I reimplemented the configuration twice and both face the same issue, so it is unlikely there is a bug in my code.

image
image

(observe that memory allocation keeps increasing on both images)

@vlfom vlfom changed the title CUDA memory continuously increasing CUDA memory usage continuously increases Jan 25, 2022
@costapt
Copy link

costapt commented Jan 28, 2022

Hi!

I am facing the same issue. I tried replacing the CustomCascadeROIHeads with the StandardROIHeads, trying to confirm if the problem, but the same problem persists. I have the feeling that the problem is in CenterNet, but I still was not not able to pinpoint where.

@kachiO
Copy link

kachiO commented Jan 29, 2022

I've encountered this issue as well. It seems to happen with the two-stage CenterNet2 models. The workaround that I've found is running the model with the following versions: detectron2=v0.6, pytorch=1.8.1, python=3.6, and cuda=11.1

@costapt
Copy link

costapt commented Jan 29, 2022

Thank you! 👍 It seemed to have solved the problem here as well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants