Skip to content

Commit

Permalink
KEP-2170: Kubeflow Training V2 API (#2171)
Browse files Browse the repository at this point in the history
* KEP-2170: Kubeflow Training V2 API

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix some comments

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add user roles diagram

Signed-off-by: Andrey Velichkevich <[email protected]>

* Move diagrams after design

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update diagram

Signed-off-by: Andrey Velichkevich <[email protected]>

* Refactor Model and Dataset configs

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update runtime timelines

Signed-off-by: Andrey Velichkevich <[email protected]>

* Address readability comments

Signed-off-by: Andrey Velichkevich <[email protected]>

* Explaination for Trainer

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update LLM Fine-Tuning Diagram

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix Llama model name

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add goal for integration with Kueue

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add links for Job run policies

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add some alternatives

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix more API types

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix empty number of nodes

Signed-off-by: Andrey Velichkevich <[email protected]>

* Rename to Coscheduling

Signed-off-by: Andrey Velichkevich <[email protected]>

* Change parameters to env

Add runLauncherAsNode parameter

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update PodSpecOverride with scheduling directives

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix TrainingRuntime field

Signed-off-by: Andrey Velichkevich <[email protected]>

* Refactor PodGroupSpec APIs

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add note about scheduler name

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add initial TrainJob status field

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix pre-commit

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
  • Loading branch information
andreyvelich authored Aug 6, 2024
1 parent faec4e8 commit 53341c9
Show file tree
Hide file tree
Showing 3 changed files with 1,600 additions and 0 deletions.
Loading

0 comments on commit 53341c9

Please sign in to comment.