Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Auto-microbatch fix #3016

Closed
wants to merge 5 commits into from
Closed

[WIP] Auto-microbatch fix #3016

wants to merge 5 commits into from

Conversation

bigning
Copy link
Contributor

@bigning bigning commented Feb 15, 2024

What does this PR do?

Fix the auto-microbatch. Before this change, composer added sync_hook to module.register_forward_hook and module.register_full_backward_hook. those hooks are triggered AFTER forward and backward of the original module (not the fsdp wrapper)

issue with previous solution:

let's say the model forward like this:

fsdp_module_0 -> fsdp_module_1 -> fsdp_module_2

if the oom happens on rank 0, right in the middle of fsdp_module_0 and fsdp_module_1. Rank 0 starts this allReduce. Rank 1 will continue run fdsp_module_1, which starts the all_gather. This caused mismatch (rank 0 allReduce vs rank 1 allGather)

fix

So the fix is easy, we just add the hook to pre-foward and pre-backward. So it will do the oom detection before any fsdp allGather, instead of after fsdp allGather.

test

unit test

python -m composer.cli.launcher -n 2 -m pytest -m gpu tests/trainer/test_fsdp.py -k test_fsdp_auto_microbatch

[todo] e2e test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant