[BUG?] Higher "gradient_accumulation_steps" still increases memory usage a lot #1123

exnx · 2024-01-15T08:39:39Z

Hello, I am seeing a high increase in GPU memory usage the larger I make gradient_accumulation_steps. For example, I can fit a a desired sequence length with gradient_accumulation_steps at 1 and 4, but at 8, I get an out of memory error.

I am using 64 gpus (16 nodes). Here's the memory usage as I increase the gradient_accumulation_steps:

grad accum -> memory
1 -> 54 GB
2 -> 60.6 GB
4 -> 70.4 GB
8 -> OOM

My understanding was that gradient_accumulation_steps is generally decoupled from memory increase, but that's not what I'm seeing. ie, I can use a high gradient_accumulation_steps and it just takes longer, but it shouldn't use that much more memory.

I am wondering if this phenomenon depends on the model and pipeline parallel values too? I am generally using 8 and 8, but I've tried other settings too. My best guess is that with parallelization, there needs to be more communication to send these accumulated gradients around? I tested on a single node, and the memory generally stays flat as you increase the gradient_accumulation_steps.

Is anyone else experiencing this, or know if this is accurate?

Thanks!

The text was updated successfully, but these errors were encountered:

StellaAthena · 2024-01-15T13:35:29Z

Are you holding the microbatch size fixed? Or are you decreasing it as you increase gradient accumulation?

exnx · 2024-01-15T16:57:53Z

Hi @StellaAthena!

I'm increasing the total batch size by the gradient accumuation factor only. The micro batch size is just 1 actually, always, in my case.

exnx · 2024-01-18T02:05:40Z

Would be great to hear if anyone else experienced this in general too, or if I'm a crazy person. Thanks!

exnx added the bug Something isn't working label Jan 15, 2024

Quentin-Anthony self-assigned this Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG?] Higher "gradient_accumulation_steps" still increases memory usage a lot #1123

[BUG?] Higher "gradient_accumulation_steps" still increases memory usage a lot #1123

exnx commented Jan 15, 2024 •

edited

Loading

StellaAthena commented Jan 15, 2024 •

edited

Loading

exnx commented Jan 15, 2024 •

edited

Loading

exnx commented Jan 18, 2024 •

edited

Loading

[BUG?] Higher "gradient_accumulation_steps" still increases memory usage a lot #1123

[BUG?] Higher "gradient_accumulation_steps" still increases memory usage a lot #1123

Comments

exnx commented Jan 15, 2024 • edited Loading

StellaAthena commented Jan 15, 2024 • edited Loading

exnx commented Jan 15, 2024 • edited Loading

exnx commented Jan 18, 2024 • edited Loading

exnx commented Jan 15, 2024 •

edited

Loading

StellaAthena commented Jan 15, 2024 •

edited

Loading

exnx commented Jan 15, 2024 •

edited

Loading

exnx commented Jan 18, 2024 •

edited

Loading