Save / load from checkpoint TP #269

michaelbenayoun · 2023-10-24T12:32:03Z

As per title.

Fixes: #249

HuggingFaceDocBuilderDev · 2023-10-24T12:36:12Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

dacorvo

Thanks for addressing the specified issue. Wouldn't it be better to have a test verifying that the issue is solved ?

I have a few comments, although I am not able to grasp the rationale behind all the changes that come on top of code I am not familiar with.

optimum/neuron/distributed/base.py

dacorvo · 2023-10-25T14:17:05Z

optimum/neuron/distributed/base.py

+        is_zero_1_optimizer = optimizer.__class__.__name__ == "NeuronAcceleratedOptimizer" and isinstance(optimizer.optimizer, NeuronZero1Optimizer)
+        is_zero_1_optimizer = is_zero_1_optimizer or isinstance(optimizer, NeuronZero1Optimizer)


Isn't the check in the first line equivalent to isinstance(optimizer, NeuronAcceleratedOptimizer) ?
Anyway, since in the second line you also accept just NeuronZero1Optimizer, maybe it could be simplified to:
is_zero_1_optimizer = isinstance(optimizer, NeuronZero1Optimizer)

No because you can have:

A NeuronZero1Optimizer, which we do not support yet

A NeuronAcceleratedOptimizer which is a wrapper around the optimizer that is always applied when using accelerate (I think we will always fall in that case but I wanted to cover all the cases here). This wrapper contains an optimizer attribute, and we check if it's a NeuronZero1Optimizer.

optimum/neuron/distributed/base.py

optimum/neuron/trainers.py

dacorvo

Thanks for the update and providing a test. However, I think this test is not picked up by te CI: is this intended ?

tests/distributed/training.py

dacorvo

CI is failing: FAILED tests/distributed/test_training.py::test_tp_save_and_resume_from_checkpoint - RuntimeError: You need to log in the Hugging Face Hub otherwise you will not be able to push anything.

dacorvo · 2023-10-27T05:39:04Z

Thank you for fixing this !

michaelbenayoun added 3 commits October 24, 2023 11:39

[WIP] fix resume_from_checkpoint

f62d42f

Merge branch 'main' into fix_model_saving_tp

53f7b26

[WIP] fix resume_from_checkpoint

a735137

michaelbenayoun added 4 commits October 24, 2023 16:47

Fix resume from checkpoint

18a0669

Save config file

e64070a

Fail if using ZeRO-1

2fb0225

Add docstring

76a3075

michaelbenayoun requested review from dacorvo and JingyaHuang October 25, 2023 12:05

dacorvo reviewed Oct 25, 2023

View reviewed changes

michaelbenayoun added 4 commits October 25, 2023 17:49

Apply suggestions

0841606

Styling

f1affac

Add test

2a7a4d5

Fix

b720644

dacorvo reviewed Oct 26, 2023

View reviewed changes

tests/distributed/training.py Outdated Show resolved Hide resolved

Final fix

0516393

dacorvo approved these changes Oct 26, 2023

View reviewed changes

dacorvo requested changes Oct 26, 2023

View reviewed changes

Fix tests

0b26272

dacorvo approved these changes Oct 27, 2023

View reviewed changes

dacorvo merged commit 2e2fe40 into main Oct 27, 2023
15 of 16 checks passed

dacorvo deleted the fix_model_saving_tp branch October 27, 2023 05:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save / load from checkpoint TP #269

Save / load from checkpoint TP #269

michaelbenayoun commented Oct 24, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Oct 24, 2023

dacorvo left a comment

dacorvo Oct 25, 2023

michaelbenayoun Oct 25, 2023

dacorvo left a comment

dacorvo left a comment

dacorvo commented Oct 27, 2023

		is_zero_1_optimizer = optimizer.__class__.__name__ == "NeuronAcceleratedOptimizer" and isinstance(optimizer.optimizer, NeuronZero1Optimizer)
		is_zero_1_optimizer = is_zero_1_optimizer or isinstance(optimizer, NeuronZero1Optimizer)

Save / load from checkpoint TP #269

Save / load from checkpoint TP #269

Conversation

michaelbenayoun commented Oct 24, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Oct 24, 2023

dacorvo left a comment

Choose a reason for hiding this comment

dacorvo Oct 25, 2023

Choose a reason for hiding this comment

michaelbenayoun Oct 25, 2023

Choose a reason for hiding this comment

dacorvo left a comment

Choose a reason for hiding this comment

dacorvo left a comment

Choose a reason for hiding this comment

dacorvo commented Oct 27, 2023

michaelbenayoun commented Oct 24, 2023 •

edited

Loading