Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save / load from checkpoint TP #269

Merged
merged 13 commits into from
Oct 27, 2023
Merged

Save / load from checkpoint TP #269

merged 13 commits into from
Oct 27, 2023

Conversation

michaelbenayoun
Copy link
Member

@michaelbenayoun michaelbenayoun commented Oct 24, 2023

As per title.

Fixes: #249

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Copy link
Collaborator

@dacorvo dacorvo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the specified issue. Wouldn't it be better to have a test verifying that the issue is solved ?

I have a few comments, although I am not able to grasp the rationale behind all the changes that come on top of code I am not familiar with.

optimum/neuron/distributed/base.py Outdated Show resolved Hide resolved
optimum/neuron/distributed/base.py Show resolved Hide resolved
Comment on lines 539 to 540
is_zero_1_optimizer = optimizer.__class__.__name__ == "NeuronAcceleratedOptimizer" and isinstance(optimizer.optimizer, NeuronZero1Optimizer)
is_zero_1_optimizer = is_zero_1_optimizer or isinstance(optimizer, NeuronZero1Optimizer)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the check in the first line equivalent to isinstance(optimizer, NeuronAcceleratedOptimizer) ?
Anyway, since in the second line you also accept just NeuronZero1Optimizer, maybe it could be simplified to:
is_zero_1_optimizer = isinstance(optimizer, NeuronZero1Optimizer)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No because you can have:

  1. A NeuronZero1Optimizer, which we do not support yet
  2. A NeuronAcceleratedOptimizer which is a wrapper around the optimizer that is always applied when using accelerate (I think we will always fall in that case but I wanted to cover all the cases here). This wrapper contains an optimizer attribute, and we check if it's a NeuronZero1Optimizer.

optimum/neuron/distributed/base.py Show resolved Hide resolved
optimum/neuron/trainers.py Show resolved Hide resolved
Copy link
Collaborator

@dacorvo dacorvo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update and providing a test. However, I think this test is not picked up by te CI: is this intended ?

tests/distributed/training.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@dacorvo dacorvo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is failing: FAILED tests/distributed/test_training.py::test_tp_save_and_resume_from_checkpoint - RuntimeError: You need to log in the Hugging Face Hub otherwise you will not be able to push anything.

@dacorvo dacorvo merged commit 2e2fe40 into main Oct 27, 2023
15 of 16 checks passed
@dacorvo dacorvo deleted the fix_model_saving_tp branch October 27, 2023 05:38
@dacorvo
Copy link
Collaborator

dacorvo commented Oct 27, 2023

Thank you for fixing this !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tensor parallelism: saved model can't be loaded
3 participants