Fix the FSDP.optim_state_dict_to_load OOM #3184

bigning · 2024-04-10T06:21:47Z

What does this PR do?

Fix the FSDP.optim_state_dict_to_load OOM, it's already in pytorch>=2.3.0 pytorch/pytorch#117261

test

before the change, it oom in the first forward dbrx-dense-20b-debug-autoresume-AdiVZi

here is the memory before the forward:

after the change, it can train dbrx-dense-20b-debug-autoresume-3QVblX

composer/trainer/mosaic_fsdp_utils.py

snarayan21

That's a crazy amount of extra memory usage, damn.

Can you make the PR title more descriptive instead of "up"? other than that, one minor comment, otherwise LGTM! thanks for finding and fixing this so quick.

composer/trainer/mosaic_fsdp_utils.py

mvpatel2000

Discussed offline, only for 2.2.2 please!

* up * up * up * a * a * up * up * comments * up * lint * line

bigning added 7 commits April 9, 2024 23:19

up

0908130

up

de6db9c

up

164469a

a

98b54fa

a

702ca0a

up

d5ddcca

up

9f620c6

bigning requested review from mvpatel2000 and snarayan21 April 10, 2024 17:09

bigning marked this pull request as ready for review April 10, 2024 17:10

bigning changed the title up Fix the FSDP.optim_state_dict_to_load OOM Apr 10, 2024

karan6181 reviewed Apr 10, 2024

View reviewed changes

composer/trainer/mosaic_fsdp_utils.py Outdated Show resolved Hide resolved

snarayan21 reviewed Apr 10, 2024

View reviewed changes

composer/trainer/mosaic_fsdp_utils.py Show resolved Hide resolved

comments

48aa164

mvpatel2000 reviewed Apr 10, 2024

View reviewed changes

up

468e069

mvpatel2000 approved these changes Apr 10, 2024

View reviewed changes

bigning added 2 commits April 10, 2024 11:10

lint

0ee1fad

line

379969b

bigning enabled auto-merge (squash) April 10, 2024 19:03

Merge branch 'dev' into fix-autoresume-oom

2b0dfc3

bigning merged commit 52776a7 into dev Apr 10, 2024
14 checks passed

bigning deleted the fix-autoresume-oom branch April 10, 2024 20:17

staghado pushed a commit to lightonai/composer that referenced this pull request Apr 13, 2024

Fix the FSDP.optim_state_dict_to_load OOM (mosaicml#3184)

dd32754

* up * up * up * a * a * up * up * comments * up * lint * line

staghado pushed a commit to lightonai/composer that referenced this pull request Apr 13, 2024

Fix the FSDP.optim_state_dict_to_load OOM (mosaicml#3184)

d39767d

* up * up * up * a * a * up * up * comments * up * lint * line

j316chuck pushed a commit that referenced this pull request May 16, 2024

Fix the FSDP.optim_state_dict_to_load OOM (#3184)

54aed52

* up * up * up * a * a * up * up * comments * up * lint * line

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the FSDP.optim_state_dict_to_load OOM #3184

Fix the FSDP.optim_state_dict_to_load OOM #3184

bigning commented Apr 10, 2024 •

edited

Loading

snarayan21 left a comment

mvpatel2000 left a comment

Fix the FSDP.optim_state_dict_to_load OOM #3184

Fix the FSDP.optim_state_dict_to_load OOM #3184

Conversation

bigning commented Apr 10, 2024 • edited Loading

What does this PR do?

test

snarayan21 left a comment

Choose a reason for hiding this comment

mvpatel2000 left a comment

Choose a reason for hiding this comment

bigning commented Apr 10, 2024 •

edited

Loading