This is the Training Operator v1.8.0 release.

This release introduces a new Python API for LLMs Fine-Tuning that simplifies the ability to fine-tune foundational models using distributed PyTorch nodes.

Install the Kubeflow Training SDK as follows to try it:

pip install -U "kubeflow-training[huggingface]"

LLMs Fine-Tuning API

Train/Fine-tune API Proposal for LLMs (#1945 by @deepanker13)
[SDK] Train API for LLM Fine-Tuning (#1962 by @deepanker13)
Modify LLM Trainer to support BERT and Tiny LLaMA (#2031 by @andreyvelich)
Support arm64 for Hugging Face trainer (#2028 by @tariq-hasan)
Add Fine-Tune BERT LLM Example (#2021 by @andreyvelich)
Train api dataset download changes (#1959 by @deepanker13)
Train api init container creation (#1958 by @deepanker13)
[SDK] Add docstring for Train API (#2075 by @andreyvelich)

Breaking Changes

[SDK] Support Python 3.11 and Drop Python 3.7 (#2105 by @tenzen-y)
Support K8s v1.29 and Drop K8s v1.26 (#2039 by @tenzen-y)
Support K8s v1.28 and Drop K8s v1.25 (#2038 by @tenzen-y)
Deprecation Notice for MXJob (#2058 by @tenzen-y)
⚠️ Breaking Changes: Rename monitoring-port flag to webook-server-port (#1925 by @afritzler)

New Features

Control Plane Updates

Upgrade scheduler-plugins to v0.28.9 (#2065 by @tenzen-y)
Implement webhook validations for the PaddleJob (#2057 by @tenzen-y)
Implement webhook validations for the XGBoostJob (#2052 by @tenzen-y)
Implement webhook validation for the TFJob (#2051 by @tenzen-y)
Implement webhook validations for the PyTorchJob (#2035 by @tenzen-y)
Upgrade PyTorchJob examples to PyTorch v2 (#2024 by @champon1020)
Upgrade Go version to v1.22 (#2046 by @tenzen-y)

SDK Improvements

[SDK] Add resources per worker for Create Job API (#1990 by @andreyvelich)
[SDK] Fix Worker and Master templates for PyTorchJob (#1988 by @andreyvelich)
[SDK] Get Kubernetes Events for Job (#1975 by @andreyvelich)
SDK: Upgrade the minimum required Kubernetes version to v1.27.2 (#2066 by @tenzen-y)
[SDK] Add information about TrainingClient logging (#1973 by @andreyvelich)
Training operator SDK unit test (#1938 by @deepanker13)
[SDK] Consolidate Naming for CRUD APIs (#1907 by @andreyvelich)

Bug Fixes

[SDK] Fix Failed condition in wait Job API (#2160 by @andreyvelich)
[SDK] Sync Transformers version for train API (#2147 by @andreyvelich)
[SDK] Changed package name to flake8 to fix pip install (#2140 by @tenzen-y)
[SDK] Fix Incorrect Events in get_job_logs API (#2138 by @tenzen-y)
Fix volcano podgroup update issue (#2079 by @ckyuto)
Fix import for HuggingFace Dataset Provider (#2085 by @andreyvelich)
Updated examples for train API (#2077 by @shruti2522)
Fail job for non-retryable exit codes (#2071 by @kellyaa)
E2E: Replace outdated images with latest ones (#2083 by @tenzen-y)
fix wrong filepath in the simple example command (#2062 by @qzoscar)
fix(example): add installation of python-etcd in Pytorch example (#2064 by @champon1020)
fix: Upgrade controller-gen to v0.14.0 (#2026 by @champon1020)
Fix build workflow config for pytorch-torchrun-example (#2020 by @PeterWrighten)
Fix Distributed Data Samplers in PyTorch Examples (#2012 by @andreyvelich)
Fix URL in python SDK setup.py (#2011 by @garymm)
Fix for Github CI to publish HF trainer image (#1987 by @johnugeorge)
train api jupyternotebook fix (#1984 by @deepanker13)
fix: volcano podgroup should has a non-empty queue name (#1977 by @lowang-bh)
Fix Master Label for PyTorchJob (#1974 by @andreyvelich)
IsMasterRole fix in pytorchjob controller (#1969 by @deepanker13)
[fix] replace ${go env GOPATH} with $(go env GOPATH) (#1952 by @double12gzh)
Fixing issues with providing existing service account (#1918 by @rpemsel)

Misc

Refine the integration tests for the immutable PyTorchJob (#2130 by @tenzen-y)
Update training operator image to latest (#2089 by @johnugeorge)
Update sdk to v1.8.0rc0 (#2087 by @johnugeorge)
Test: Simplify and Identify pod-controller envtest (#2084 by @tenzen-y)
Remove deadcode related to PodDisruptionBudget (#2073 by @tenzen-y)
docs: updating docs for local development (#2074 by @franciscojavierarceo)
PyTorchJob: Always show warnings when using elasticPolicy.nProcPerNode (#2067 by @tenzen-y)
Updated developer docs to include Kind (#2061 by @franciscojavierarceo)
adding fine tune example with s3 as the dataset store (#2006 by @deepanker13)
CI: Use a mode=min in the builder cache (#2053 by @tenzen-y)
Fix: upgrade version of crd-ref-docs, which caused panic with go v1.22 (#2043 by @jdcfd)
Remove Dockerfile.ppc64le of pytorch example (#2042 by @champon1020)
publish torchrun example via Dockerfile (#2018 by @PeterWrighten)
Updated examples/pytorch to disable istio sidecar injection (#2004 by @jdcfd)
[docs] development guide update (#1995 by @shashank-iitbhu)
Add Kubeflow Website links to README (#1983 by @andreyvelich)
publish trainer hugging face image (#1985 by @deepanker13)
Adding Training image needed for train api (#1963 by @deepanker13)
Add test to create PyTorchJob from func (#1979 by @andreyvelich)
Corrected Some Spelling And Grammatical Errors (#1980 by @daniel-hutao)
torchrun example with cpu version pytorch (#1965 by @kuizhiqing)
utils changes needed to add train api (#1954 by @deepanker13)
Adding parallel support for coveralls (#1956 by @johnugeorge)
chore: pkg import only once (#1950 by @testwill)
fix nproc env in elastic mode for pytorchjob (#1948 by @kuizhiqing)
Avoid modifying log level globally (#1944 by @droctothorpe)
Add @andreyvelich to Approvers (#1941 by @andreyvelich)
Merge v1.7 branch changes to Main (#1940 by @johnugeorge)
Increase the root volume size on the github runner when building container images (#1931 by @tenzen-y)
Check podGroup CRD for the volcano and the scheudler-plugins as default. (#1929 by @Syulin7)
Use a community hosted image in MXJob E2E (#1928 by @tenzen-y)
Build MXJob examples in CI (#1927 by @tenzen-y)
Bump k8s.io/* deps to 1.28 (#1920 by @afritzler)
Replace XGBoost image for E2E with community hosted (#1922 by @tenzen-y)
Creating service account where approriate for MPI Job (#1917 by @rpemsel)
Build XGBoostJob example images in CI (#1913 by @tenzen-y)
Manage kube-delivery image from training-operator and update it (#1909 by @rpemsel)
Adding Yuki to Approvers (#1901 by @johnugeorge)
docs: Remove reference to tf-operator specific design doc (#1903 by @terrytangyuan)
Add Training WG Community Call (#1900 by @andreyvelich)
update full change list in changelog (#1895 by @lowang-bh)
update volcano scheduler to 1.8.0 (#1894 by @lowang-bh)
Changelog updated for 1.7.0 rc0 release (#1892 by @johnugeorge)
Add Stale GitHub Action (#1893 by @andreyvelich)
Refactor core/pod tests (#1890 by @tenzen-y)
Remove klog v1 (#1886 by @tenzen-y)

New Contributors

@ckyuto made their first contribution in #2079
@shruti2522 made their first contribution in #2077
@kellyaa made their first contribution in #2071
@qzoscar made their first contribution in #2062
@franciscojavierarceo made their first contribution in #2061
@tariq-hasan made their first contribution in #2028
@champon1020 made their first contribution in #2024
@garymm made their first contribution in #2011
@PeterWrighten made their first contribution in #2018
@jdcfd made their first contribution in #2004
@daniel-hutao made their first contribution in #1980
@shashank-iitbhu made their first contribution in #1995
@double12gzh made their first contribution in #1952
@testwill made their first contribution in #1950
@deepanker13 made their first contribution in #1938
@droctothorpe made their first contribution in #1944
@afritzler made their first contribution in #1920
@rpemsel made their first contribution in #1909

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.8.0 release