Skip to content

Releases: kubeflow/training-operator

v1.2.0 release

03 Aug 19:22
6f1e96c
Compare
Choose a tag to compare

v1.2.0 (2021-08-03)

Full Changelog

Features

Bug fixes

Misc

v1.1.0 release

25 Mar 04:35
f564bce
Compare
Choose a tag to compare

This is a large official release since v0.5.3. Please give more feedbacks. Thanks for all contributors.

Features

Bug fixes

Chores

Read more

v1.0.1-rc.5

09 Feb 06:14
fc46a92
Compare
Choose a tag to compare
feat: Update readme (#1244)

Signed-off-by: cegao <[email protected]>

v1.0.1-rc.4

04 Feb 02:14
6a608a7
Compare
Choose a tag to compare

v1.0.1-rc.4 (2021-02-04)

Full Changelog

Closed issues:

  • I have some questions about the function createNewPod in pkg/controller.v1/tensorflow/pod.go #1221

Merged pull requests:

v1.0.1-rc.3

27 Jan 11:11
8fd8229
Compare
Choose a tag to compare

v1.0.1-rc.3 (2021-01-27)

Full Changelog

Closed issues:

  • Error with release tag v1.0.1 "invalid memory address or nil pointer dereference" #1223

Merged pull requests:

  • feat(server): Use apiextension client to check if crd exists #1228 (gaocegege)

v1.0.1-rc.2

27 Jan 01:53
5e69262
Compare
Choose a tag to compare

v1.0.1-rc.2 (2021-01-27)

Full Changelog

Merged pull requests:

v1.0.1-rc.1

19 Jan 11:42
5f002e4
Compare
Choose a tag to compare

v1.0.1-rc.1 (2021-01-18)

Full Changelog

Closed issues:

  • checkCRDExists func return true when k8s cluster is not connected #1206
  • How to install it without kubeflow #1195
  • Pod get re-created after it exited and get garbage collected #1186
  • Surface Pod and other Errors that Prevent TFJob from starting #1131
  • Jobs failing when a node is preempted #999

Merged pull requests:

v1.0.1-rc.0

22 Dec 07:15
6df2d50
Compare
Choose a tag to compare
v1.0.1-rc.0 Pre-release
Pre-release

v1.0.1-rc.0 (2020-12-22)

Full Changelog

Closed issues:

  • tf-operator panic without worker role #1192
  • TFJob completion with active services/endpoints resources #1191
  • Having trouble viewing logs using Kubernetes dashboard #1189
  • [feature] Support SuccessPolicy/FailurePolicy Based on % of Succeeded/Failed Workers #1188
  • TFJob cannot utilize GPUs in the node. #1184
  • [bug] With Python SDK, TFJob won't stop running #1183
  • [bug] [Python SDK] tfjob_client.get_logs broken #1182
  • How to create a python sdk for mxnet-operator #1181
  • [feature] python sdk should report errors in created TFJobs #1180
  • Could not introduce k8s.io/kube-openapi@master #1174
  • can tf-operator used in distribute scene, such as Multi-node #1173
  • Multi-worker training with Keras only use one GPU #1169
  • NCCL WARN Failed to open libibverbs.so[.1] #1168
  • tf-job-operator pod restarts #1167
  • swagger-codegen-cli-2.4.6.jar not found #1166
  • Cut release for tf-operator project #1163
  • Replace reconciler implementation with kubeflow/common JobController #1161
  • Error while replicating mnist_with_summaries #1159
  • Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch<Distribution>, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: out of memory #1158
  • TFjob pods hang without explanation #1156
  • [Proposal] Support ClusterSpec Propagation Feature in TF 1.14 #1141
  • evaluator� should be set in TF_CONFIG when using Estimator distribute strategy #1139
  • Is there any case to run the different command in tfReplicaSpecs? #1138
  • should gpu resource be released when tfjob failed because of image pull problem? #1136
  • tf-job-operator CrashLoopBackOff #1135
  • How to change the log level of tf-job-operator #1132
  • Support getting the training process via Python SDK #1129
  • Popgroup is not created automatically. #1121
  • TFConfig should be demonstrated more specifically. #1115
  • [chore] Remove tfjob dashboard #1113
  • read TF_CONFIG env from configMap #1112
  • Long job names result in jobs stuck forever #1101
  • [Question] can't the base image "registry.access.redhat.com/ubi8/ubi:latest" in Dockerfile be replaced with "debian:buster" ? #1099
  • can i install tf-operator alone without kubeflow? #1096
  • c #1095
  • TFJob test is failing on master and v0.7 branch for kubeflow/kubeflow #1094
  • TFJob tests should use pytest #1093
  • Multiple Evaluator replicas gives InvalidTFJobSpec #1091
  • Java client for current version of TFjob #1090
  • [enhancement] Replace common with kubeflow/common #1087
  • Lack of documents for deployment #1086
  • Performance problem about pod informer #1079
  • [bug] Cannot initialize the training job with TF Estimator when the user uses 1 worker and 0 PS #1078
  • Separate cluster scoped and namespace scoped resources #1077
  • TFJob 1.0 #1076
  • [bug] Keep tf-job-role as deprecated label in this version #1068
  • GenLabels may select wrong Pods #1066
  • Can I create a tf-operator pod without using GO? #1065
  • tf-job-dashboard cannot work #1060
  • [discussion] Should We Add CleanPodPolicy PS? #1059
  • Refactor dockerfile #1058
  • remove v1beta1 in v0.5.3 cause incompatible issue when using go mod #1057
  • Invalid value: "v1beta1": must appear in spec.versions #1056
  • Example on EKS: Device or resource busy #1053
  • can we add PriorityClassName when we create TF-job Podgroup? #1048
  • TFjob still running while chief pod is completed #1045
  • Is there any document for how to run TFJob in AllReduce Strategy #1039
  • tf-operator version conficts #1035
  • Add E2E test for gang-scheduling #1033
  • gang schedule annotation #1031
  • [feature] Can we use one headless service for one job? #1030
  • Will tf-operator upgrading k8s to 1.13? #1029
  • no error log for create tfjob fail #1026
  • Creating tfjob in dashboard usability issues #1024
  • Deleting tf-job through the dashboard is not working #1019
  • Create common CRD validate and mutating webhook for all operator #1016
  • error with kubeflow instalation #996
  • Shall we consider upgrading k8s to 1.11.3 #985
  • TFJob Dashboard is not support pvc #980
  • ERROR handle object: patching object from cluster: merging object with existing state: unable to recognize "/var/folders/tl/zzfcr4zs53vgnpqqjq4n08sh0000gn/T/ksonnet-mergepatch020443124": no matches for kind "TFJob" in version "kubeflow.org/v1beta1" #976
  • Create CRD conversion webhook #967
  • Performance issue when there is a lot of completed jobs #965
  • Failed to marshal the object to TFJob; the spec is invalid: Failed to marshal the object to TFJob #964
  • Proposal for a Common Operator #960
  • Delete pod with unknown status in reconcilePods #956
  • Create distributed training example for TF 2.0 #953
  • Consider using KubeBuilder to reduce boilerplate code #925
  • e2e test for dashboard/backend/handler/api_handler.go #921
  • Use pod group instead of PDB for gang scheduling #916
  • shareProcessNamespace not working with TFJob #902
  • [feasibility-research] Handle machine failure #900
  • Should limit the size of logs of tf_operator container #888
  • Log message severity isn't properly reported in stackdriver #864
  • E2E ...
Read more

v1.0.0-rc.0

28 Jun 19:09
Compare
Choose a tag to compare
v1.0.0-rc.0 Pre-release
Pre-release

tf-operator pre-graduation

v0.5.3

03 Jun 17:40
Compare
Choose a tag to compare
fix bug for check PodPending (#1021)