Skip to content

Navigation Menu

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

facebookresearch / fairscale Public

Notifications You must be signed in to change notification settings
Fork 280
Star 3.2k

Code
Issues 74
Pull requests 29
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

FP8 AllGather Support in Fairscale #1185

Open

levendlee wants to merge 21 commits into ngoyal_changes_for_pp_fp8_jiecaoyu_debug

base: ngoyal_changes_for_pp_fp8_jiecaoyu_debug

Choose a base branch

Loading

Loading

from shikaili_fp8_allgather_no_pp_fix

Open

FP8 AllGather Support in Fairscale #1185

levendlee wants to merge 21 commits into ngoyal_changes_for_pp_fp8_jiecaoyu_debug from shikaili_fp8_allgather_no_pp_fix

Conversation 6 Commits 21 Checks 4 Files changed

Conversation

Copy link

Member

levendlee commented May 20, 2024

What does this PR do?

Fixes # (issue).

Before submitting

Did you have fun?
- Make sure you had fun coding 🙃
Did you read the contributor guideline?
Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- N/A
Did you make sure to update the docs?
- N/A
Did you write any new necessary tests?
- N/A
Did you update the changelog? (if needed)
- N/A

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Sorry, something went wrong.

All reactions

ngoyal2707 and others added 21 commits

March 29, 2024 15:12


          added option for no PG validation for faster init (#1161)

73ce4b4

Co-authored-by: Naman Goyal <[email protected]>


          Mirros Jiecao's change.

33457b3


          Debug non-determinism issues.

70f5ff5

This commit works with a 4 GPU run on SMALL model with FSDP and PP
enabled.


          Moves amax update logic into params downcasting function.

16c682d


          Cleans up code.

24a769f


          Fix main_grad attribute checking.

3e2e77f

- Clean up flatten and non_flatten parameter generation logic.
- Avoid checking `main_grad` attribute all equal to zeros.


          Fix no pp hanging error.

1be7aa0

- Cleans up amax and scale update logic. Amax and scale should be
  done for both weights and parameters. So it should be done at
  forward of each microbatch.

- Consolidate `cast_params` and `all_gather` stream.


          Clean up shard offset calculation logic.

57eb557


          Unify compute dtype setting.

e9e8f8e


          Have FP16 and FP8 sharded in the same way.

21f8e05


          added option for no PG validation for faster init (#1161)

8ec7c1d

Co-authored-by: Naman Goyal <[email protected]>


          Mirros Jiecao's change.

f27ab17


          Debug non-determinism issues.

fa9cf77

This commit works with a 4 GPU run on SMALL model with FSDP and PP
enabled.


          Moves amax update logic into params downcasting function.

6fa19e0


          Cleans up code.

80ffd54


          Fix main_grad attribute checking.

afb2ca1

- Clean up flatten and non_flatten parameter generation logic.
- Avoid checking `main_grad` attribute all equal to zeros.


          Fix no pp hanging error.

25b2322

- Cleans up amax and scale update logic. Amax and scale should be
  done for both weights and parameters. So it should be done at
  forward of each microbatch.

- Consolidate `cast_params` and `all_gather` stream.


          Clean up shard offset calculation logic.

0d1502b


          Unify compute dtype setting.

5edb109


          Have FP16 and FP8 sharded in the same way.

2df199f


          Merge branch 'shikaili_fp8_allgather_no_pp_fix' of github.com:faceboo…

da36e31

…kresearch/fairscale into shikaili_fp8_allgather_no_pp_fix

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label

awgu reviewed

View reviewed changes

Copy link

awgu left a comment

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @levendlee for the great work! I left some comments for my own learning.

Sorry, something went wrong.

All reactions

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

+                              and all(_is_te_module_with_weights(info[1]) for info in p._param_infos))
+                          if fused_wgard_accumulation:
+                              if getattr(p, "main_grad", None) is None:
+                                  p.main_grad = torch.empty_like(p, dtype=torch.float32)

Copy link

awgu Jun 24, 2024

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, why empty_like instead of zeros_like?

Sorry, something went wrong.

All reactions

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

                       if params is None:
                           params = self.params
-                      with torch.cuda.stream(self._streams["fp32_to_fp16"]):

Copy link

awgu Jun 24, 2024

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why did you use the "all_gather" stream instead of the "fp32_to_fp16" stream?

Sorry, something went wrong.

All reactions

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

                           self.has_full_params = False
+                      if self.fp8_all_gather:
+                          self._update_amax_and_scale_fwd(is_first_microbatch_fwd=is_first_microbatch_fwd)

Copy link

awgu Jun 24, 2024

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, is there a reason that this is not done together with _cast_params_for_all_gather? (For example, could this call be delayed a few lines to below where _cast_params_for_all_gather is called?)

Sorry, something went wrong.

All reactions

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

                   @torch.no_grad()
-                  def _rebuild_full_params(self, force_full_precision: bool = False, wait_for_all_gather = True) -> Optional[List[Tuple[torch.Tensor, bool]]]:
+                  def _rebuild_full_params(

Copy link

awgu Jun 24, 2024

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For fp8_all_gather=True, what happens when this method is called without the TE autocast context?

Sorry, something went wrong.

All reactions

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

                       # All-gather full parameters. This will also transfer FP32 parameters to
                       # ``self.compute_dtype`` (e.g., FP16 if *mixed_precision* is ``True``).
-                      self._rebuild_full_params()
+                      self.module.has_unflatten_views = getattr(self.module, "has_unflatten_views", False)

Copy link

awgu Jun 24, 2024

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Sorry, something went wrong.

All reactions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

awgu awgu left review comments

Assignees

No one assigned

Labels

This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

4 participants

Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.

Footer

© 2024 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.