v1.8.0 release
This is the Training Operator v1.8.0 release.
This release introduces a new Python API for LLMs Fine-Tuning that simplifies the ability to fine-tune foundational models using distributed PyTorch nodes.
Install the Kubeflow Training SDK as follows to try it:
pip install -U "kubeflow-training[huggingface]"
LLMs Fine-Tuning API
- Train/Fine-tune API Proposal for LLMs (#1945 by @deepanker13)
- [SDK] Train API for LLM Fine-Tuning (#1962 by @deepanker13)
- Modify LLM Trainer to support BERT and Tiny LLaMA (#2031 by @andreyvelich)
- Support arm64 for Hugging Face trainer (#2028 by @tariq-hasan)
- Add Fine-Tune BERT LLM Example (#2021 by @andreyvelich)
- Train api dataset download changes (#1959 by @deepanker13)
- Train api init container creation (#1958 by @deepanker13)
- [SDK] Add docstring for Train API (#2075 by @andreyvelich)
Breaking Changes
- [SDK] Support Python 3.11 and Drop Python 3.7 (#2105 by @tenzen-y)
- Support K8s v1.29 and Drop K8s v1.26 (#2039 by @tenzen-y)
- Support K8s v1.28 and Drop K8s v1.25 (#2038 by @tenzen-y)
- Deprecation Notice for MXJob (#2058 by @tenzen-y)
⚠️ Breaking Changes: Renamemonitoring-port
flag towebook-server-port
(#1925 by @afritzler)
New Features
Control Plane Updates
- Upgrade scheduler-plugins to v0.28.9 (#2065 by @tenzen-y)
- Implement webhook validations for the PaddleJob (#2057 by @tenzen-y)
- Implement webhook validations for the XGBoostJob (#2052 by @tenzen-y)
- Implement webhook validation for the TFJob (#2051 by @tenzen-y)
- Implement webhook validations for the PyTorchJob (#2035 by @tenzen-y)
- Upgrade PyTorchJob examples to PyTorch v2 (#2024 by @champon1020)
- Upgrade Go version to v1.22 (#2046 by @tenzen-y)
SDK Improvements
- [SDK] Add resources per worker for Create Job API (#1990 by @andreyvelich)
- [SDK] Fix Worker and Master templates for PyTorchJob (#1988 by @andreyvelich)
- [SDK] Get Kubernetes Events for Job (#1975 by @andreyvelich)
- SDK: Upgrade the minimum required Kubernetes version to v1.27.2 (#2066 by @tenzen-y)
- [SDK] Add information about TrainingClient logging (#1973 by @andreyvelich)
- Training operator SDK unit test (#1938 by @deepanker13)
- [SDK] Consolidate Naming for CRUD APIs (#1907 by @andreyvelich)
Bug Fixes
- [SDK] Fix Failed condition in wait Job API (#2160 by @andreyvelich)
- [SDK] Sync Transformers version for train API (#2147 by @andreyvelich)
- [SDK] Changed package name to flake8 to fix pip install (#2140 by @tenzen-y)
- [SDK] Fix Incorrect Events in get_job_logs API (#2138 by @tenzen-y)
- Fix volcano podgroup update issue (#2079 by @ckyuto)
- Fix import for HuggingFace Dataset Provider (#2085 by @andreyvelich)
- Updated examples for train API (#2077 by @shruti2522)
- Fail job for non-retryable exit codes (#2071 by @kellyaa)
- E2E: Replace outdated images with latest ones (#2083 by @tenzen-y)
- fix wrong filepath in the simple example command (#2062 by @qzoscar)
- fix(example): add installation of python-etcd in Pytorch example (#2064 by @champon1020)
- fix: Upgrade controller-gen to v0.14.0 (#2026 by @champon1020)
- Fix build workflow config for pytorch-torchrun-example (#2020 by @PeterWrighten)
- Fix Distributed Data Samplers in PyTorch Examples (#2012 by @andreyvelich)
- Fix URL in python SDK setup.py (#2011 by @garymm)
- Fix for Github CI to publish HF trainer image (#1987 by @johnugeorge)
- train api jupyternotebook fix (#1984 by @deepanker13)
- fix: volcano podgroup should has a non-empty queue name (#1977 by @lowang-bh)
- Fix Master Label for PyTorchJob (#1974 by @andreyvelich)
- IsMasterRole fix in pytorchjob controller (#1969 by @deepanker13)
- [fix] replace
${go env GOPATH}
with$(go env GOPATH)
(#1952 by @double12gzh) - Fixing issues with providing existing service account (#1918 by @rpemsel)
Misc
- Refine the integration tests for the immutable PyTorchJob (#2130 by @tenzen-y)
- Update training operator image to latest (#2089 by @johnugeorge)
- Update sdk to v1.8.0rc0 (#2087 by @johnugeorge)
- Test: Simplify and Identify pod-controller envtest (#2084 by @tenzen-y)
- Remove deadcode related to PodDisruptionBudget (#2073 by @tenzen-y)
- docs: updating docs for local development (#2074 by @franciscojavierarceo)
- PyTorchJob: Always show warnings when using elasticPolicy.nProcPerNode (#2067 by @tenzen-y)
- Updated developer docs to include Kind (#2061 by @franciscojavierarceo)
- adding fine tune example with s3 as the dataset store (#2006 by @deepanker13)
- CI: Use a mode=min in the builder cache (#2053 by @tenzen-y)
- Fix: upgrade version of crd-ref-docs, which caused panic with go v1.22 (#2043 by @jdcfd)
- Remove Dockerfile.ppc64le of pytorch example (#2042 by @champon1020)
- publish torchrun example via Dockerfile (#2018 by @PeterWrighten)
- Updated examples/pytorch to disable istio sidecar injection (#2004 by @jdcfd)
- [docs] development guide update (#1995 by @shashank-iitbhu)
- Add Kubeflow Website links to README (#1983 by @andreyvelich)
- publish trainer hugging face image (#1985 by @deepanker13)
- Adding Training image needed for train api (#1963 by @deepanker13)
- Add test to create PyTorchJob from func (#1979 by @andreyvelich)
- Corrected Some Spelling And Grammatical Errors (#1980 by @daniel-hutao)
- torchrun example with cpu version pytorch (#1965 by @kuizhiqing)
- utils changes needed to add train api (#1954 by @deepanker13)
- Adding parallel support for coveralls (#1956 by @johnugeorge)
- chore: pkg import only once (#1950 by @testwill)
- fix nproc env in elastic mode for pytorchjob (#1948 by @kuizhiqing)
- Avoid modifying log level globally (#1944 by @droctothorpe)
- Add @andreyvelich to Approvers (#1941 by @andreyvelich)
- Merge v1.7 branch changes to Main (#1940 by @johnugeorge)
- Increase the root volume size on the github runner when building container images (#1931 by @tenzen-y)
- Check podGroup CRD for the volcano and the scheudler-plugins as default. (#1929 by @Syulin7)
- Use a community hosted image in MXJob E2E (#1928 by @tenzen-y)
- Build MXJob examples in CI (#1927 by @tenzen-y)
- Bump
k8s.io/*
deps to 1.28 (#1920 by @afritzler) - Replace XGBoost image for E2E with community hosted (#1922 by @tenzen-y)
- Creating service account where approriate for MPI Job (#1917 by @rpemsel)
- Build XGBoostJob example images in CI (#1913 by @tenzen-y)
- Manage kube-delivery image from training-operator and update it (#1909 by @rpemsel)
- Adding Yuki to Approvers (#1901 by @johnugeorge)
- docs: Remove reference to tf-operator specific design doc (#1903 by @terrytangyuan)
- Add Training WG Community Call (#1900 by @andreyvelich)
- update full change list in changelog (#1895 by @lowang-bh)
- update volcano scheduler to 1.8.0 (#1894 by @lowang-bh)
- Changelog updated for 1.7.0 rc0 release (#1892 by @johnugeorge)
- Add Stale GitHub Action (#1893 by @andreyvelich)
- Refactor core/pod tests (#1890 by @tenzen-y)
- Remove klog v1 (#1886 by @tenzen-y)
New Contributors
- @ckyuto made their first contribution in #2079
- @shruti2522 made their first contribution in #2077
- @kellyaa made their first contribution in #2071
- @qzoscar made their first contribution in #2062
- @franciscojavierarceo made their first contribution in #2061
- @tariq-hasan made their first contribution in #2028
- @champon1020 made their first contribution in #2024
- @garymm made their first contribution in #2011
- @PeterWrighten made their first contribution in #2018
- @jdcfd made their first contribution in #2004
- @daniel-hutao made their first contribution in #1980
- @shashank-iitbhu made their first contribution in #1995
- @double12gzh made their first contribution in #1952
- @testwill made their first contribution in #1950
- @deepanker13 made their first contribution in #1938
- @droctothorpe made their first contribution in #1944
- @afritzler made their first contribution in #1920
- @rpemsel made their first contribution in #1909