Releases: kubeflow/training-operator
Releases · kubeflow/training-operator
v1.2.0 release
v1.2.0 (2021-08-03)
Features
- Add job namespace to
tf_operator_jobs_*
counters (#1283, @alembiewski) - feat: upgrade kubeflow common and volcano version (#1276, @shinytang6)
- Add task type annotation for pods when EnableGangScheduling is true. (#1268, @jiangkaihua)
Bug fixes
- Fix invalid pointer when tfjob is deleted (#1285, @johnugeorge)
- fix get_logs pod_names type and iteration blocking (#1280, @Windfarer)
- fix calling custom_api.delete_namespaced_custom_object args error (#1281, @Windfarer)
- fix: Remove the dup comment tag (#1274, @gaocegege)
- Fix: Remove Github CD workflow (#1263, @PatrickXYS)
- Fix: the "follow" of TFJobClient.get_logs (#1254, @Windfarer)
Misc
- Update container image for v1.1.1 (#1328, @Jeffwan)
- add a specific version of tensorflow_datasets (#1305, @jazzsir)
- Remove vendor folder (#1288, @Jeffwan)
- add podgroups rule in cluster-role.yaml (#1272, @huone1)
- Use remote Kustomize build option in standalone installation instructions (#1266, @verult)
v1.1.0 release
This is a large official release since v0.5.3. Please give more feedbacks. Thanks for all contributors.
Features
- feat: Remove k8s.io/kubernetes (#1235, @gaocegege)
- Migrate to public ECR (#1256, @PatrickXYS)
- feat: Add API Documentation WIP (#1249, @gaocegege)
- feat: Update developers guide and readme (#1244, @gaocegege)
- Move TF Operator e2e tests to AWS Prow (#1204, @ChanYiLin)
- crd definition support multiple evaluator (#1240, @oikomi)
- support multiple evaluators (#1239, @oikomi)
- feat: Change the message for running condition (#1230, @gaocegege)
- feat(server): Use apiextension client to check if crd exists (#1228, @gaocegege)
- checkCRDExists func return true when k8s cluster is not connected (#1207, @oikomi)
- feat: Add CD using GitHub Actions (#1196, @gaocegege)
- Migrate controller implementation to kubeflow/common fashion (#1171, @ChanYiLin)
- Support success policy for TFJob (#1165, @terrytangyuan)
- add distributed training example of using TF 2.1 Strategy API (#1164, @jazzsir)
- Set completion time when job exceed specified deadline. (#1150, @SimonCqk)
- Support ClusterSpec Propagation Feature in TF 1.14 (#1149, @zhujl1991)
- Add watch function for TFJob python Client API (#1122, @jinchihe)
- Enhance tfjobs sdk docs (#1114, @jinchihe)
- Generate TFJob Python SDK (#1103, @jinchihe)
- feat: Support pprof when monitoring is specified (#1102, @gaocegege)
- feat: Use kubeflow/common (#1088, @gaocegege)
- Add support for aarch64 (#1098, @MrXinWang)
- feat: Do not set TF_CONFIG for local training (#1080, @gaocegege)
- feat: Replace gometalinter with golangci-lint (#1081, @gaocegege)
- Add controller-name label for Pod and service (#1067, @hougangliu)
- Add qps and burst options (#1063, @ScorpioCPH)
- Avoid unnecessary update when tfjob is complete (#1051, @cheyang)
- set annotation automatically when EnableGangScheduling is set to true (#1032, @ChanYiLin)
- feat(pod): Support custom gang scheduler via CLI argument (#1050, @gaocegege)
Bug fixes
- Fix kubeflow overlay (#1260, @PatrickXYS)
- fix: Do not validate evaluator (#1238, @gaocegege)
- fix: Remove default resync period (#1237, @gaocegege)
- fix: Observe the creation when failed to create the pod (#1236, @gaocegege)
- fix: Remove vendor cp command (#1232, @gaocegege)
- Fix completion time setting bug (#1226, @shaowei-su)
- feat(deploy): Add standalone deployment yaml (#1218, @gaocegege)
- Fix updateStatus no worker Crashoff (#1215, @kuikuikuizzZ)
- fix: Fix the log message (#1203, @gaocegege)
- Fix the typo (#1178, @pingsutw)
- Fix setup cluster issue and Pylint issue in CI tests (#1179, @jinchihe)
- Fix the link to run_e2e_workflow.py script (#1154, @terrytangyuan)
- Fix evaluator runconfig (#1146, @richardsliu)
- Fix sdk test issue that's caused by kubenertes Client bug. (#1143, @jinchihe)
- fix(controller): calculate satisfied with && instead of || (#1120, @GuoHaiqing)
- fix comment, add +optional flag to comment. (#1137, @EDGsheryl)
- fix(ConvertTFJobToUnstructured): ConvertTFJobToUnstructured uses function ToUnstructured to convert TFJob to Unstructured (#1118, @leileiwan)
- fix the reconcile flow (#1111, @ChanYiLin)
- Fix example Mnist With Summaries (#1073, @andreyvelich)
- fix bug: When executing
tf-operator.v1 -version
, GitSHA is always 'not provided' (#1046, @asdfsx) - fix(UI): show correct namespace and name when deleting job through dashboard (#1044, @gbin10533)
- Minor fix to add CoreV1 to scheme (#1037, @johnugeorge)
- fix(docs): Fix link for simple_TFJob_test (#1038, @gaocegege)
- fix: Remove dup code (#1022, @gaocegege)
Chores
- tf-operator: Consolidate manifests (#1255, @yanniszark)
- TFJob Operator: Move manifests development upstream (#1247, @yanniszark)
- Update vendor as kubeflow/common is updated. (#1252, @jiangkaihua)
- docs: Add Ant Group to ADOPTERS.md (#1243, @terrytangyuan)
- chore: Add tencent cloud (#1234, @gaocegege)
- add vip (#1233, @oikomi)
- chore: Update changelog (#1227, @gaocegege)
- Update kubeflow common to 0.3.2 (#1225, @shaowei-su)
- chore: Remove useless expectation (#1217, @gaocegege)
- chore: Update codegen (#1211, @gaocegege)
- add Evaluator type for CRD example (#1209, @oikomi)
- add err log for create client set failed and code minor optimization (#1210, @oikomi)
- chore: Remove the kanban update workflow (#1201, @gaocegege)
- chore: Refactor cmd (#1199, @gaocegege)
- bugfix for multi_worker_strategy-with-keras.py (#1198, @jiaqianjing)
- Fix error when
conditions
is empty. (#1185, @Corea) - b/168938304 - Inclusive Language Fix-It, repo has non-inclusive language (#1190, @sculd)
- chore: Update OWNERS (#1177, @gaocegege)
- Update developer_guide.md (#1176, @pingsutw)
- Update swagger-codegen-cli URL (#1172, @jinchihe)
- Use go mod (#1144, @xychu)
- Make tf_operator use static compilation in container (#1160, @MrXinWang)
- Update tf_job_client.py remove unused variable. (#1157, @NikeNano)
- Update e2e_testing.md (#1155, @NikeNano)
- Disable istio sidecar injection in simple tfjob test (#1148, @Bobgy)
- OWNERS: Add ChanYiLin as approver (#1147, @ChanYiLin)
- Remove unused function arg (#1145, @zhujl1991)
- docs: Add roadmap (#1140, @gaocegege)
- simple_tfjob_tests py3 version (#1134, @gabrielwen)
- add tf-operator test in py3 (#1133, @gabrielwen)
- Distroless image for TF operator (#1124, @krishnadurai)
- SDK support getting the TFJob training logs (#1130, @jinchihe)
...
v1.0.1-rc.5
feat: Update readme (#1244) Signed-off-by: cegao <[email protected]>
v1.0.1-rc.4
v1.0.1-rc.4 (2021-02-04)
Closed issues:
- I have some questions about the function
createNewPod
inpkg/controller.v1/tensorflow/pod.go
#1221
Merged pull requests:
- fix: Remove default resync period #1237 (gaocegege)
- fix: Observe the creation when failed to create the pod #1236 (gaocegege)
- feat: Remove k8s.io/kubernetes #1235 (gaocegege)
- chore: Add tencent cloud #1234 (gaocegege)
- add vip #1233 (oikomi)
- fix: Remove vendor cp command #1232 (gaocegege)
- feat: Change the message for running condition #1230 (gaocegege)
- chore: Update changelog #1227 (gaocegege)
v1.0.1-rc.3
v1.0.1-rc.3 (2021-01-27)
Closed issues:
- Error with release tag
v1.0.1
"invalid memory address or nil pointer dereference" #1223
Merged pull requests:
v1.0.1-rc.2
v1.0.1-rc.2 (2021-01-27)
Merged pull requests:
- Fix completion time setting bug #1226 (shaowei-su)
- Update kubeflow common to 0.3.2 #1225 (shaowei-su)
v1.0.1-rc.1
v1.0.1-rc.1 (2021-01-18)
Closed issues:
- checkCRDExists func return true when k8s cluster is not connected #1206
- How to install it without kubeflow #1195
- Pod get re-created after it exited and get garbage collected #1186
- Surface Pod and other Errors that Prevent TFJob from starting #1131
- Jobs failing when a node is preempted #999
Merged pull requests:
- feat(deploy): Add standalone deployment yaml #1218 (gaocegege)
- chore: Remove useless expectation #1217 (gaocegege)
- Fix updateStatus no worker Crashoff #1215 (kuikuikuizzZ)
- chore: Update codegen #1211 (gaocegege)
- add err log for create client set failed and code minor optimization #1210 (oikomi)
- add Evaluator type for CRD example #1209 (oikomi)
- checkCRDExists func return true when k8s cluster is not connected #1207 (oikomi)
- fix: Fix the log message #1203 (gaocegege)
v1.0.1-rc.0
v1.0.1-rc.0 (2020-12-22)
Closed issues:
- tf-operator panic without worker role #1192
- TFJob completion with active services/endpoints resources #1191
- Having trouble viewing logs using Kubernetes dashboard #1189
- [feature] Support SuccessPolicy/FailurePolicy Based on % of Succeeded/Failed Workers #1188
- TFJob cannot utilize GPUs in the node. #1184
- [bug] With Python SDK, TFJob won't stop running #1183
- [bug] [Python SDK] tfjob_client.get_logs broken #1182
- How to create a python sdk for mxnet-operator #1181
- [feature] python sdk should report errors in created TFJobs #1180
- Could not introduce k8s.io/kube-openapi@master #1174
- can tf-operator used in distribute scene, such as Multi-node #1173
- Multi-worker training with Keras only use one GPU #1169
- NCCL WARN Failed to open libibverbs.so[.1] #1168
- tf-job-operator pod restarts #1167
- swagger-codegen-cli-2.4.6.jar not found #1166
- Cut release for tf-operator project #1163
- Replace reconciler implementation with kubeflow/common JobController #1161
- Error while replicating mnist_with_summaries #1159
- Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch<Distribution>, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: out of memory #1158
- TFjob pods hang without explanation #1156
- [Proposal] Support ClusterSpec Propagation Feature in TF 1.14 #1141
- evaluator� should be set in TF_CONFIG when using Estimator distribute strategy #1139
- Is there any case to run the different command in tfReplicaSpecs? #1138
- should gpu resource be released when tfjob failed because of image pull problem? #1136
- tf-job-operator CrashLoopBackOff #1135
- How to change the log level of tf-job-operator #1132
- Support getting the training process via Python SDK #1129
- Popgroup is not created automatically. #1121
- TFConfig should be demonstrated more specifically. #1115
- [chore] Remove tfjob dashboard #1113
- read TF_CONFIG env from configMap #1112
- Long job names result in jobs stuck forever #1101
- [Question] can't the base image "registry.access.redhat.com/ubi8/ubi:latest" in Dockerfile be replaced with "debian:buster" ? #1099
- can i install tf-operator alone without kubeflow? #1096
- c #1095
- TFJob test is failing on master and v0.7 branch for kubeflow/kubeflow #1094
- TFJob tests should use pytest #1093
- Multiple Evaluator replicas gives InvalidTFJobSpec #1091
- Java client for current version of TFjob #1090
- [enhancement] Replace common with kubeflow/common #1087
- Lack of documents for deployment #1086
- Performance problem about pod informer #1079
- [bug] Cannot initialize the training job with TF Estimator when the user uses 1 worker and 0 PS #1078
- Separate cluster scoped and namespace scoped resources #1077
- TFJob 1.0 #1076
- [bug] Keep tf-job-role as deprecated label in this version #1068
- GenLabels may select wrong Pods #1066
- Can I create a tf-operator pod without using GO? #1065
- tf-job-dashboard cannot work #1060
- [discussion] Should We Add CleanPodPolicy PS? #1059
- Refactor dockerfile #1058
- remove v1beta1 in v0.5.3 cause incompatible issue when using go mod #1057
- Invalid value: "v1beta1": must appear in spec.versions #1056
- Example on EKS: Device or resource busy #1053
- can we add PriorityClassName when we create TF-job Podgroup? #1048
- TFjob still running while chief pod is completed #1045
- Is there any document for how to run TFJob in AllReduce Strategy #1039
- tf-operator version conficts #1035
- Add E2E test for gang-scheduling #1033
- gang schedule annotation #1031
- [feature] Can we use one headless service for one job? #1030
- Will tf-operator upgrading k8s to 1.13? #1029
- no error log for create tfjob fail #1026
- Creating tfjob in dashboard usability issues #1024
- Deleting tf-job through the dashboard is not working #1019
- Create common CRD validate and mutating webhook for all operator #1016
- error with kubeflow instalation #996
- Shall we consider upgrading k8s to 1.11.3 #985
- TFJob Dashboard is not support pvc #980
- ERROR handle object: patching object from cluster: merging object with existing state: unable to recognize "/var/folders/tl/zzfcr4zs53vgnpqqjq4n08sh0000gn/T/ksonnet-mergepatch020443124": no matches for kind "TFJob" in version "kubeflow.org/v1beta1" #976
- Create CRD conversion webhook #967
- Performance issue when there is a lot of completed jobs #965
- Failed to marshal the object to TFJob; the spec is invalid: Failed to marshal the object to TFJob #964
- Proposal for a Common Operator #960
- Delete pod with unknown status in reconcilePods #956
- Create distributed training example for TF 2.0 #953
- Consider using KubeBuilder to reduce boilerplate code #925
- e2e test for dashboard/backend/handler/api_handler.go #921
- Use pod group instead of PDB for gang scheduling #916
- shareProcessNamespace not working with TFJob #902
- [feasibility-research] Handle machine failure #900
- Should limit the size of logs of tf_operator container #888
- Log message severity isn't properly reported in stackdriver #864
- E2E ...
v1.0.0-rc.0
tf-operator pre-graduation
v0.5.3
fix bug for check PodPending (#1021)