[WIP] Auto-microbatch fix #3016

bigning · 2024-02-15T18:16:22Z

What does this PR do?

Fix the auto-microbatch. Before this change, composer added sync_hook to module.register_forward_hook and module.register_full_backward_hook. those hooks are triggered AFTER forward and backward of the original module (not the fsdp wrapper)

issue with previous solution:

let's say the model forward like this:

fsdp_module_0 -> fsdp_module_1 -> fsdp_module_2

if the oom happens on rank 0, right in the middle of fsdp_module_0 and fsdp_module_1. Rank 0 starts this allReduce. Rank 1 will continue run fdsp_module_1, which starts the all_gather. This caused mismatch (rank 0 allReduce vs rank 1 allGather)

fix

So the fix is easy, we just add the hook to pre-foward and pre-backward. So it will do the oom detection before any fsdp allGather, instead of after fsdp allGather.

test

unit test

python -m composer.cli.launcher -n 2 -m pytest -m gpu tests/trainer/test_fsdp.py -k test_fsdp_auto_microbatch

[todo] e2e test

bigning added 3 commits February 15, 2024 18:00

v0

98f12fb

fix

48f12b2

add gc before empty cache

a3d3636

bigning mentioned this pull request Feb 18, 2024

[fix auto-microbatch] FSDP reshard and cleanup after OOM to fix the cuda memory leak #3030

Merged

bigning added 2 commits March 7, 2024 20:27

Merge branch 'dev' into auto-microbatch-clean-fix

dc054e4

Merge branch 'dev' into auto-microbatch-clean-fix

efd7bf0

mvpatel2000 force-pushed the dev branch from 8a09a3b to 6f8831d Compare July 22, 2024 21:04

bigning closed this Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Auto-microbatch fix #3016

[WIP] Auto-microbatch fix #3016

bigning commented Feb 15, 2024 •

edited

Loading

[WIP] Auto-microbatch fix #3016

[WIP] Auto-microbatch fix #3016

Conversation

bigning commented Feb 15, 2024 • edited Loading

What does this PR do?

issue with previous solution:

fix

test

unit test

[todo] e2e test

bigning commented Feb 15, 2024 •

edited

Loading