Releases · kubeflow/training-operator

03 Aug 19:22

Jeffwan

v1.2.0

6f1e96c

v1.2.0 release

v1.2.0 (2021-08-03)

Full Changelog

Features

Add job namespace to tf_operator_jobs_* counters (#1283, @alembiewski)
feat: upgrade kubeflow common and volcano version (#1276, @shinytang6)
Add task type annotation for pods when EnableGangScheduling is true. (#1268, @jiangkaihua)

Bug fixes

Fix invalid pointer when tfjob is deleted (#1285, @johnugeorge)
fix get_logs pod_names type and iteration blocking (#1280, @Windfarer)
fix calling custom_api.delete_namespaced_custom_object args error (#1281, @Windfarer)
fix: Remove the dup comment tag (#1274, @gaocegege)
Fix: Remove Github CD workflow (#1263, @PatrickXYS)
Fix: the "follow" of TFJobClient.get_logs (#1254, @Windfarer)

Misc

Update container image for v1.1.1 (#1328, @Jeffwan)
add a specific version of tensorflow_datasets (#1305, @jazzsir)
Remove vendor folder (#1288, @Jeffwan)
add podgroups rule in cluster-role.yaml (#1272, @huone1)
Use remote Kustomize build option in standalone installation instructions (#1266, @verult)

Contributors

verult, johnugeorge, and 9 other contributors

Assets 2

25 Mar 04:35

Jeffwan

v1.1.0

f564bce

v1.1.0 release

This is a large official release since v0.5.3. Please give more feedbacks. Thanks for all contributors.

Features

feat: Remove k8s.io/kubernetes (#1235, @gaocegege)
Migrate to public ECR (#1256, @PatrickXYS)
feat: Add API Documentation WIP (#1249, @gaocegege)
feat: Update developers guide and readme (#1244, @gaocegege)
Move TF Operator e2e tests to AWS Prow (#1204, @ChanYiLin)
crd definition support multiple evaluator (#1240, @oikomi)
support multiple evaluators (#1239, @oikomi)
feat: Change the message for running condition (#1230, @gaocegege)
feat(server): Use apiextension client to check if crd exists (#1228, @gaocegege)
checkCRDExists func return true when k8s cluster is not connected (#1207, @oikomi)
feat: Add CD using GitHub Actions (#1196, @gaocegege)
Migrate controller implementation to kubeflow/common fashion (#1171, @ChanYiLin)
Support success policy for TFJob (#1165, @terrytangyuan)
add distributed training example of using TF 2.1 Strategy API (#1164, @jazzsir)
Set completion time when job exceed specified deadline. (#1150, @SimonCqk)
Support ClusterSpec Propagation Feature in TF 1.14 (#1149, @zhujl1991)
Add watch function for TFJob python Client API (#1122, @jinchihe)
Enhance tfjobs sdk docs (#1114, @jinchihe)
Generate TFJob Python SDK (#1103, @jinchihe)
feat: Support pprof when monitoring is specified (#1102, @gaocegege)
feat: Use kubeflow/common (#1088, @gaocegege)
Add support for aarch64 (#1098, @MrXinWang)
feat: Do not set TF_CONFIG for local training (#1080, @gaocegege)
feat: Replace gometalinter with golangci-lint (#1081, @gaocegege)
Add controller-name label for Pod and service (#1067, @hougangliu)
Add qps and burst options (#1063, @ScorpioCPH)
Avoid unnecessary update when tfjob is complete (#1051, @cheyang)
set annotation automatically when EnableGangScheduling is set to true (#1032, @ChanYiLin)
feat(pod): Support custom gang scheduler via CLI argument (#1050, @gaocegege)

Bug fixes

Fix kubeflow overlay (#1260, @PatrickXYS)
fix: Do not validate evaluator (#1238, @gaocegege)
fix: Remove default resync period (#1237, @gaocegege)
fix: Observe the creation when failed to create the pod (#1236, @gaocegege)
fix: Remove vendor cp command (#1232, @gaocegege)
Fix completion time setting bug (#1226, @shaowei-su)
feat(deploy): Add standalone deployment yaml (#1218, @gaocegege)
Fix updateStatus no worker Crashoff (#1215, @kuikuikuizzZ)
fix: Fix the log message (#1203, @gaocegege)
Fix the typo (#1178, @pingsutw)
Fix setup cluster issue and Pylint issue in CI tests (#1179, @jinchihe)
Fix the link to run_e2e_workflow.py script (#1154, @terrytangyuan)
Fix evaluator runconfig (#1146, @richardsliu)
Fix sdk test issue that's caused by kubenertes Client bug. (#1143, @jinchihe)
fix(controller): calculate satisfied with && instead of || (#1120, @GuoHaiqing)
fix comment, add +optional flag to comment. (#1137, @EDGsheryl)
fix(ConvertTFJobToUnstructured): ConvertTFJobToUnstructured uses function ToUnstructured to convert TFJob to Unstructured (#1118, @leileiwan)
fix the reconcile flow (#1111, @ChanYiLin)
Fix example Mnist With Summaries (#1073, @andreyvelich)
fix bug: When executing tf-operator.v1 -version, GitSHA is always 'not provided' (#1046, @asdfsx)
fix(UI): show correct namespace and name when deleting job through dashboard (#1044, @gbin10533)
Minor fix to add CoreV1 to scheme (#1037, @johnugeorge)
fix(docs): Fix link for simple_TFJob_test (#1038, @gaocegege)
fix: Remove dup code (#1022, @gaocegege)

Chores

tf-operator: Consolidate manifests (#1255, @yanniszark)
TFJob Operator: Move manifests development upstream (#1247, @yanniszark)
Update vendor as kubeflow/common is updated. (#1252, @jiangkaihua)
docs: Add Ant Group to ADOPTERS.md (#1243, @terrytangyuan)
chore: Add tencent cloud (#1234, @gaocegege)
add vip (#1233, @oikomi)
chore: Update changelog (#1227, @gaocegege)
Update kubeflow common to 0.3.2 (#1225, @shaowei-su)
chore: Remove useless expectation (#1217, @gaocegege)
chore: Update codegen (#1211, @gaocegege)
add Evaluator type for CRD example (#1209, @oikomi)
add err log for create client set failed and code minor optimization (#1210, @oikomi)
chore: Remove the kanban update workflow (#1201, @gaocegege)
chore: Refactor cmd (#1199, @gaocegege)
bugfix for multi_worker_strategy-with-keras.py (#1198, @jiaqianjing)
Fix error when conditions is empty. (#1185, @Corea)
b/168938304 - Inclusive Language Fix-It, repo has non-inclusive language (#1190, @sculd)
chore: Update OWNERS (#1177, @gaocegege)
Update developer_guide.md (#1176, @pingsutw)
Update swagger-codegen-cli URL (#1172, @jinchihe)
Use go mod (#1144, @xychu)
Make tf_operator use static compilation in container (#1160, @MrXinWang)
Update tf_job_client.py remove unused variable. (#1157, @NikeNano)
Update e2e_testing.md (#1155, @NikeNano)
Disable istio sidecar injection in simple tfjob test (#1148, @Bobgy)
OWNERS: Add ChanYiLin as approver (#1147, @ChanYiLin)
Remove unused function arg (#1145, @zhujl1991)
docs: Add roadmap (#1140, @gaocegege)
simple_tfjob_tests py3 version (#1134, @gabrielwen)
add tf-operator test in py3 (#1133, @gabrielwen)
Distroless image for TF operator (#1124, @krishnadurai)
SDK support getting the TFJob training logs (#1130, @jinchihe)
...

Assets 2

09 Feb 06:14

gaocegege

v1.0.1-rc.5

fc46a92

v1.0.1-rc.5

feat: Update readme (#1244)

Signed-off-by: cegao <[email protected]>

Assets 2

04 Feb 02:14

gaocegege

v1.0.1-rc.4

6a608a7

v1.0.1-rc.4

v1.0.1-rc.4 (2021-02-04)

Full Changelog

Closed issues:

I have some questions about the function createNewPod in pkg/controller.v1/tensorflow/pod.go #1221

Merged pull requests:

fix: Remove default resync period #1237 (gaocegege)
fix: Observe the creation when failed to create the pod #1236 (gaocegege)
feat: Remove k8s.io/kubernetes #1235 (gaocegege)
chore: Add tencent cloud #1234 (gaocegege)
add vip #1233 (oikomi)
fix: Remove vendor cp command #1232 (gaocegege)
feat: Change the message for running condition #1230 (gaocegege)
chore: Update changelog #1227 (gaocegege)

Assets 2

27 Jan 11:11

gaocegege

v1.0.1-rc.3

8fd8229

v1.0.1-rc.3

v1.0.1-rc.3 (2021-01-27)

Full Changelog

Closed issues:

Error with release tag v1.0.1 "invalid memory address or nil pointer dereference" #1223

Merged pull requests:

feat(server): Use apiextension client to check if crd exists #1228 (gaocegege)

Assets 2

27 Jan 01:53

gaocegege

v1.0.1-rc.2

5e69262

v1.0.1-rc.2

v1.0.1-rc.2 (2021-01-27)

Full Changelog

Merged pull requests:

Fix completion time setting bug #1226 (shaowei-su)
Update kubeflow common to 0.3.2 #1225 (shaowei-su)

Assets 2

19 Jan 11:42

gaocegege

v1.0.1-rc.1

5f002e4

v1.0.1-rc.1

v1.0.1-rc.1 (2021-01-18)

Full Changelog

Closed issues:

checkCRDExists func return true when k8s cluster is not connected #1206
How to install it without kubeflow #1195
Pod get re-created after it exited and get garbage collected #1186
Surface Pod and other Errors that Prevent TFJob from starting #1131
Jobs failing when a node is preempted #999

Merged pull requests:

feat(deploy): Add standalone deployment yaml #1218 (gaocegege)
chore: Remove useless expectation #1217 (gaocegege)
Fix updateStatus no worker Crashoff #1215 (kuikuikuizzZ)
chore: Update codegen #1211 (gaocegege)
add err log for create client set failed and code minor optimization #1210 (oikomi)
add Evaluator type for CRD example #1209 (oikomi)
checkCRDExists func return true when k8s cluster is not connected #1207 (oikomi)
fix: Fix the log message #1203 (gaocegege)

Assets 2

22 Dec 07:15

gaocegege

v1.0.1-rc.0

6df2d50

v1.0.1-rc.0 Pre-release

Pre-release

v1.0.1-rc.0 (2020-12-22)

Full Changelog

Closed issues:

tf-operator panic without worker role #1192
TFJob completion with active services/endpoints resources #1191
Having trouble viewing logs using Kubernetes dashboard #1189
[feature] Support SuccessPolicy/FailurePolicy Based on % of Succeeded/Failed Workers #1188
TFJob cannot utilize GPUs in the node. #1184
[bug] With Python SDK, TFJob won't stop running #1183
[bug] [Python SDK] tfjob_client.get_logs broken #1182
How to create a python sdk for mxnet-operator #1181
[feature] python sdk should report errors in created TFJobs #1180
Could not introduce k8s.io/kube-openapi@master #1174
can tf-operator used in distribute scene, such as Multi-node #1173
Multi-worker training with Keras only use one GPU #1169
NCCL WARN Failed to open libibverbs.so[.1] #1168
tf-job-operator pod restarts #1167
swagger-codegen-cli-2.4.6.jar not found #1166
Cut release for tf-operator project #1163
Replace reconciler implementation with kubeflow/common JobController #1161
Error while replicating mnist_with_summaries #1159
Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch<Distribution>, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: out of memory #1158
TFjob pods hang without explanation #1156
[Proposal] Support ClusterSpec Propagation Feature in TF 1.14 #1141
evaluator� should be set in TF_CONFIG when using Estimator distribute strategy #1139
Is there any case to run the different command in tfReplicaSpecs? #1138
should gpu resource be released when tfjob failed because of image pull problem? #1136
tf-job-operator CrashLoopBackOff #1135
How to change the log level of tf-job-operator #1132
Support getting the training process via Python SDK #1129
Popgroup is not created automatically. #1121
TFConfig should be demonstrated more specifically. #1115
[chore] Remove tfjob dashboard #1113
read TF_CONFIG env from configMap #1112
Long job names result in jobs stuck forever #1101
[Question] can't the base image "registry.access.redhat.com/ubi8/ubi:latest" in Dockerfile be replaced with "debian:buster" ? #1099
can i install tf-operator alone without kubeflow? #1096
c #1095
TFJob test is failing on master and v0.7 branch for kubeflow/kubeflow #1094
TFJob tests should use pytest #1093
Multiple Evaluator replicas gives InvalidTFJobSpec #1091
Java client for current version of TFjob #1090
[enhancement] Replace common with kubeflow/common #1087
Lack of documents for deployment #1086
Performance problem about pod informer #1079
[bug] Cannot initialize the training job with TF Estimator when the user uses 1 worker and 0 PS #1078
Separate cluster scoped and namespace scoped resources #1077
TFJob 1.0 #1076
[bug] Keep tf-job-role as deprecated label in this version #1068
GenLabels may select wrong Pods #1066
Can I create a tf-operator pod without using GO? #1065
tf-job-dashboard cannot work #1060
[discussion] Should We Add CleanPodPolicy PS? #1059
Refactor dockerfile #1058
remove v1beta1 in v0.5.3 cause incompatible issue when using go mod #1057
Invalid value: "v1beta1": must appear in spec.versions #1056
Example on EKS: Device or resource busy #1053
can we add PriorityClassName when we create TF-job Podgroup? #1048
TFjob still running while chief pod is completed #1045
Is there any document for how to run TFJob in AllReduce Strategy #1039
tf-operator version conficts #1035
Add E2E test for gang-scheduling #1033
gang schedule annotation #1031
[feature] Can we use one headless service for one job? #1030
Will tf-operator upgrading k8s to 1.13? #1029
no error log for create tfjob fail #1026
Creating tfjob in dashboard usability issues #1024
Deleting tf-job through the dashboard is not working #1019
Create common CRD validate and mutating webhook for all operator #1016
error with kubeflow instalation #996
Shall we consider upgrading k8s to 1.11.3 #985
TFJob Dashboard is not support pvc #980
ERROR handle object: patching object from cluster: merging object with existing state: unable to recognize "/var/folders/tl/zzfcr4zs53vgnpqqjq4n08sh0000gn/T/ksonnet-mergepatch020443124": no matches for kind "TFJob" in version "kubeflow.org/v1beta1" #976
Create CRD conversion webhook #967
Performance issue when there is a lot of completed jobs #965
Failed to marshal the object to TFJob; the spec is invalid: Failed to marshal the object to TFJob #964
Proposal for a Common Operator #960
Delete pod with unknown status in reconcilePods #956
Create distributed training example for TF 2.0 #953
Consider using KubeBuilder to reduce boilerplate code #925
e2e test for dashboard/backend/handler/api_handler.go #921
Use pod group instead of PDB for gang scheduling #916
shareProcessNamespace not working with TFJob #902
[feasibility-research] Handle machine failure #900
Should limit the size of logs of tf_operator container #888
Log message severity isn't properly reported in stackdriver #864
E2E ...

Assets 2

28 Jun 19:09

kunmingg

v1.0.0-rc.0

d746bde

v1.0.0-rc.0 Pre-release

Pre-release

tf-operator pre-graduation

Assets 2

03 Jun 17:40

richardsliu

v0.5.3

d0b973b

v0.5.3

fix bug for check PodPending (#1021)

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.2.0 (2021-08-03)

Features

Bug fixes

Misc

Contributors

Features

Bug fixes

Chores

v1.0.1-rc.4 (2021-02-04)

v1.0.1-rc.3 (2021-01-27)

v1.0.1-rc.2 (2021-01-27)

v1.0.1-rc.1 (2021-01-18)

v1.0.1-rc.0 (2020-12-22)

Releases: kubeflow/training-operator

v1.2.0 release

v1.2.0 (2021-08-03)

Features

Bug fixes

Misc

Contributors

v1.1.0 release

Features

Bug fixes

Chores

v1.0.1-rc.5

v1.0.1-rc.4

v1.0.1-rc.4 (2021-02-04)

v1.0.1-rc.3

v1.0.1-rc.3 (2021-01-27)

v1.0.1-rc.2

v1.0.1-rc.2 (2021-01-27)

v1.0.1-rc.1

v1.0.1-rc.1 (2021-01-18)

v1.0.1-rc.0

v1.0.1-rc.0 (2020-12-22)

v1.0.0-rc.0

v0.5.3