You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I am seeing a high increase in GPU memory usage the larger I make gradient_accumulation_steps. For example, I can fit a a desired sequence length with gradient_accumulation_steps at 1 and 4, but at 8, I get an out of memory error.
I am using 64 gpus (16 nodes). Here's the memory usage as I increase the gradient_accumulation_steps:
My understanding was that gradient_accumulation_steps is generally decoupled from memory increase, but that's not what I'm seeing. ie, I can use a high gradient_accumulation_steps and it just takes longer, but it shouldn't use that much more memory.
I am wondering if this phenomenon depends on the model and pipeline parallel values too? I am generally using 8 and 8, but I've tried other settings too. My best guess is that with parallelization, there needs to be more communication to send these accumulated gradients around? I tested on a single node, and the memory generally stays flat as you increase the gradient_accumulation_steps.
Is anyone else experiencing this, or know if this is accurate?
Thanks!
The text was updated successfully, but these errors were encountered:
Hello, I am seeing a high increase in GPU memory usage the larger I make
gradient_accumulation_steps
. For example, I can fit a a desired sequence length withgradient_accumulation_steps
at 1 and 4, but at 8, I get an out of memory error.I am using 64 gpus (16 nodes). Here's the memory usage as I increase the
gradient_accumulation_steps
:grad accum -> memory
1 -> 54 GB
2 -> 60.6 GB
4 -> 70.4 GB
8 -> OOM
My understanding was that
gradient_accumulation_steps
is generally decoupled from memory increase, but that's not what I'm seeing. ie, I can use a highgradient_accumulation_steps
and it just takes longer, but it shouldn't use that much more memory.I am wondering if this phenomenon depends on the model and pipeline parallel values too? I am generally using 8 and 8, but I've tried other settings too. My best guess is that with parallelization, there needs to be more communication to send these accumulated gradients around? I tested on a single node, and the memory generally stays flat as you increase the
gradient_accumulation_steps
.Is anyone else experiencing this, or know if this is accurate?
Thanks!
The text was updated successfully, but these errors were encountered: