Initial release of the TFJob operator
v0.1.0 (2018-03-29)
Closed issues:
- [v1alpha2] Implement condition update #502
- E2E tests timing out; job appears to remain in running state even though job is done. #500
- [v1alpha2] TF_CONFIG should be configurable by user #499
- [test] All log is 404 in argo #496
- Presubmit shows succeeded, but some test actually failed. #479
- Waiting pods start too long #461
- [test] Add unit test for pkg/controller #455
- Create a suitable OWNERS file in /dashboard #443
- Tide is misconfigured for this repository. #433
- CI failed to setup the cluster #420
- [docs] Add dashboard readme #411
- Make coverall results advisory and not report as failure #406
- Presubmits failing due to lint #404
- [enhancement] Fix go vet errors which not caught by the compilers #395
- User facing website for Kubeflow that details how to choose a stack #371
- [discussion] How to set clusterspec #369
- [enhancement] Rename the cmd/tf_operator to cmd/tf-operator #363
- Local releaser fails due to version_tag #360
- Helm test failure not reported to gubernator #355
- [discussion] Whether to create CRD in helm charts #353
- Should resourcelock be in the same namespace as controller? #352
- Helm test tf-job does not pass validation #351
- Move tensorflow/k8s to kubeflow/tf-operator #350
- Get rid of TensorBoard replica #347
- Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs #346
- Deprecate the ENV MY_POD_NAMESPACE and MY_POD_NAME #341
- [feature] Does tfJob support setting different label/envVar for each worker(replicas >1)? #340
- [Discussion] Time to start tagging releases for the TF operator? #339
- [discussion] Should group name be tensorflow.org or kubeflow.io or kubeflow.org? #337
- dashboard silient error during calling non-existent tfjob #335
- in dashboard, silent error when nonexistent namespace is specified #334
- Deprecate the IsDefaultPS field #329
- [Convention] Replace Tf with TF in CRD #328
- Standardise labels for issues and PRs #326
- Manage Pods directly instead of using Job controllers #325
- TfJobs dashboard not showing jobs #324
- TfJobs dashboard doesn't work with K8s API server proxy or envoy proxy #323
- Recreating a failed/successful job with same name doesn't work #322
- Releaser incorrectly tags images as "dirty" #321
- Reenable the releaser #320
- E2E tests are not isolated #318
- Need to mark prow job as failed if any tests fail #315
- Remove outdated branch wbuchwalter-patch-1 #311
- E2E test delete and recreate job with same name #310
- TrainingJob.reconcile not called periodically #309
- rename master to chief #306
- Assign resource quota for TensorBoard #304
- Jobs evicted for lack of memory, potentially add resource field to tf-job prototype #301
- [Discussion] Operators vs. controller pattern #300
- [bug] Add a default pod template for PS #297
- Bunch of pylint error messages #294
- Fix Head #293
- Operator deployment fails post-v20180108-190394d #292
- Promote last known good release #290
- [bug] metadata.ownerReferences.apiVersion is not set #288
- fail to run example job. invalid job spec: tfReplicaSpec.TfPort can''t be nil #284
- [bug] Build log 404 in https://prow.k8s.io/?repo=tensorflow%2Fk8s #282
- [feature] Seperate the CRD and controller #281
- Gaps in test coverage #280
- Regression in flag name: controller-config-file #279
- [bug] glog before flag.Parse() #275
- build new code to new image and find some problem #274
- Fix the releaser so we can build new images #270
- deploy.py gives gcloud api error '... Version "1.8.1-gke.1" is invalid.' #268
- Pods terminated without waiting #267
- Attach appropriate header (copyright) to go files #266
- suppose i've install the tfjob in my k8s cluster #265
- what's the folder pkg for? #264
- Build failing because of lint issues #256
- what's the main change between version 0.2 and version 0.3? #247
- SetupCluster failures unexpected keyword argument 'client_configuration' #242
- GPU test marked as succeeded but airflow step is failing #240
- Use Kubeflow & ksonnet to install TfJob #239
- tf_smoke.py distributed computing doesn't work on minikube #238
- example-job can not work in private k8s cluster #233
- Test failures aren't properly reported in Gubernator #229
- [CRD] Request for input and output dirs in TFJobSpec #224
- TfJob should be marked as failed if setup fails #218
- panic: runtime error: invalid memory address or nil pointer dereference can not run in k8s 1.8.5 #212
- Rethink the TFJob CRD #209
- ksonnet configs for deploying the TfJob CRD & Controller #208
- Make default TfImage configurable by users #207
- refactor the TfJob to use Informer and Controller #206
- Use Argo workflow engine for CI/CD or releases #205
- Potential issue with Tensorboard / value of simple best-practices example with tboard #202
- Investigate using buildah to build our images #201
- E2E tests pre & postsubmits are failing #196
- Publishing a client to pypi #193
- Don't require a master or chief #192
- Make cloning the repo and building the artifacts separate commands in py/release.py #189
- Handle the case where grpcServerFilePath is the empty string #188
- Make Airflow logs accessible #185
- Complement docs for Python 3rd party dependencies #181
- Helm Test fails because grpcServerFilePath is the empty string #179
- Helm should only set --controller_config_file conditionally #175
- Troubleshooting Guide: no matches for tensorflow.org/, Kind=TfJob #174
- no matches for tensorflow.org/, Kind=TfJob #173
- Failed to build TFOperator #171
- E2E test for GPUs #164
- TfJob doesn't work on minikube #160
- Deleted jobs re-starting #156
- Use coveralls.io to report and check code coverage #155
- Clarify scope of tensorflow/k8s #150
- After init helm, install chart failed #149
- Helm test; insufficient permissions on RBAC clusters #135
- Need to trim trailing slash of host string in TfJobRestClient.Watch() #130
- results of lint test aren't reported in junit file used by gubernator #126
- Collaborators need to be K8s members to trigger tests #122
- Extend Test Infrastructure to run multiple E2E tests in parallel #120
- initResource() failed; findAllTfJobs returned error: #118
- Latest tag on gcr.io is not up to date #116
- duplicate #115
- postsubmit results aren't showing up in testrgrid #113
- TensorBoard replica set not deleted when job deleted. #107
- helm permission issue on 1.8.1 #106
- Run python unittests as part of pre/post/periodic tests #101
- E2E tests are failing #96
- E2E Test log should capture output from helm-test #95
- Rename TfJob kind to remove mlkube.io #89
- Setup travis for tensorflow/k8s #88
- Update repo to use its new location tensorflow/k8s #86
- mlkube.io -> tensorflow/k8s #85
- Update prow to use repo tensorflow/k8s #84
- periodic test is failing #83
- runner.py needs to create build-log.txt with stdout/stderr of test #82
- E2E tests leaking GKE clusters #80
- No results show up if you click on mlkube-build-periodic #76
- No results show up in prow test grid for presubmit jobs #75
- Include TfJob name in labels #72
- Simplify/Clarify Accelerators config #71
- Clean up examples; don't require cloning the repo #68
- How to create TF Jobs from the user side? #67
- Change version from beta -> alpha #65
- API Review #64
- Setup release process for CRD #63
- Post submit jobs don't correctly upload artifacts to GCS #62
- presubmit test(bootstrap.py) doesn't properly check out PRs #59
- E2E Test for default PS server #58
- UI / Kubernetes Dashboard Integration #57
- E2E test for GPUs #54
- Integrate with Prow for Continuous Testing #46
- Consider how we manage replicas (stateful sets, managing pods directly) #45
- Use K8s Garbage Collection #42
- func c.findAllTfJobs() in controller.go will never reach #41
- Rename project #34
- Structured (Json) logging for Tf Processes #32
- Permanent errors don't cause job failure #28
- If handling Add event fails, TfJob should be marked as failed with appropriate error #26
- Structured Logging For the operator #24
- Operator Log Spam; replicas.go:287] No container named: tensorflow found for pod; assuming POD is running #23
- Provide a default value for TfPort, replicas, and tfReplicaType #22
- Setup continuous build of containers #19
- Should this be converted to a Custom Resource Definition (CRD) in anticipation of 1.7 #17
- Run TensorFlow server for parameter servers by default #16
- TensorBoard Integration #13
- Dependency management #7
- Better GPU support #6
- TfJobRestClient.Create doesn't set kind appropriately #5
- Add a creationTimestamp #4
Merged pull requests:
- Fix outdated information about GPUs in README #513 (mindprince)
- Don't leave pods running when a job completes. #512 (jlewi)
- Check running status more gracefully #507 (ScorpioCPH)
- test: Add test cases for condition #506 (gaocegege)
- test: Fix failed case because of update status #505 (gaocegege)
- Add condition logic code #504 (ScorpioCPH)
- Fix bug with jobs not being marked as completed. #501 (jlewi)
- release: Fix style #498 (gaocegege)
- pkg: Fix the code changed in #486 #497 (gaocegege)
- Set JSONLogFormat to false by default #495 (ScorpioCPH)
- Fix env append issue #494 (ScorpioCPH)
- Add dist-mnist for e2e test #493 (ScorpioCPH)
- Set restart policy #491 (ScorpioCPH)
- test: Add test cases #488 (gaocegege)
- Add sleep and random exit image for e2e test #487 (ScorpioCPH)
- fixed some golint warning #486 (AK-ayush)
- Support testing on minikube. #485 (jlewi)
- controller: Add defaulter #483 (gaocegege)
- controller: Add check for service and fix service #482 (gaocegege)
- controller: Separate ps and worker pods #481 (gaocegege)
- controller: Add internal state test #480 (gaocegege)
- *: Fix some errors in Travis CI #477 (gaocegege)
- controller: Update status in time #476 (gaocegege)
- add LabelsByIndex method to eliminate code duplication #474 (rc-zhang)
- Make RestartPolicy a property of the ReplicaSpec #473 (ScorpioCPH)
- Update tfjob status #472 (ScorpioCPH)
- Use headless services for Training jobs #471 (rc-zhang)
- Append labels instead of rewriting #468 (ScorpioCPH)
- test: Add unit test for controller #467 (gaocegege)
- linter: Fix linter ignore file #466 (gaocegege)
- Fix field selectors in controller #465 (wbuchwalter)
- Run ks upgrade #464 (lluunn)
- Import v1alpha2 logic code #463 (ScorpioCPH)
- Fix owners file id #462 (lluunn)
- Remove deprecated package retryutil #460 (ScorpioCPH)
- Change test cluster to kubeflow-ci #459 (lluunn)
- Update API to v1alpha2 #457 (ScorpioCPH)
- *: Remove APIExtension clientset #454 (gaocegege)
- travis: Ignore generated code #453 (gaocegege)
- Create PDB of TFReplicaSet for gang scheduling by kube-arbitrator #452 (mitake)
- Add OWNERS file for dashboard #446 (wbuchwalter)
- Make local release cross-platform + fix #445 (wbuchwalter)
- Add proxying to front-end development server. #442 (wbuchwalter)
- Fix dashboard + proxy incompatibility #441 (wbuchwalter)
- change kubeflow.io to kubeflow.org #440 (Jimexist)
- Remove unreachable code #434 (ScorpioCPH)
- *: Remove type ContainerName #432 (gaocegege)
- add boilerplate header for go file #431 (wackxu)
- format the python files with yapf #429 (mitake)
- clientset: Fix code which is changed manually #428 (gaocegege)
- Delete Dockerfile to build a docker image to use for prow. #425 (jlewi)
- Fix setup_cluster. #421 (jlewi)
- Add ScorpioCPH as approver/reviewer #419 (ScorpioCPH)
- Create resources (Services/Jobs) only once #418 (ScorpioCPH)
- Dashboard: Dev Guide #417 (wbuchwalter)
- Use logrus for structured logging #416 (ankushagarwal)
- Create an initial OWNERS file. #414 (jlewi)
- Docs should refer to Kubeflow user guide for deploying the TFJob operrator #412 (jlewi)
- Run glide update to update glide.lock #410 (ankushagarwal)
- Fix typo in Makefile #409 (ankushagarwal)
- Add a field SchedulerName to TFJob for specifying a scheduler #408 (mitake)
- Fix lint issues with python3 and a bug in lint script #405 (jlewi)
- Support using our E2E workflow to build a Docker image for releases. #403 (jlewi)
- add go 1.10 support in travis #402 (Jimexist)
- use yapf to format python code #401 (Jimexist)
- Fix bug with jobs not working if you recreate a job with same name as previous job #399 (jlewi)
- Fixes go vet errors #397 (swiftdiaries)
- Fixed-363: Rename cmd/tf_operator -> cmd/tf-operator #393 (AK-ayush)
- README: Add community section and quick links #392 (gaocegege)
- Remove TensorBoard related code in operator #391 (gaocegege)
- Fix something after move to kubeflow/tf-operator #390 (sdf611097)
- Add a prow_config.yaml file to configure our prow jobs. #388 (jlewi)
- fix a typo in the README file. #387 (ChanYiLin)
- *: Replace the repo name #386 (gaocegege)
- travis: Add go build command #383 (gaocegege)
- config.sh: Remove #381 (gaocegege)
- Use ksonnet to easily define TFJobs to be run as tests #374 (jlewi)
- Fix repo name env #372 (jose5918)
- controller.go: Fix a glog typo #368 (gaocegege)
- fix -version option: print version #367 (caogj)
- *: Add copyright owner in go files #364 (gaocegege)
- Fix local releaser #361 (jose5918)
- nit: try to simplify e2e main.go #359 (Jimexist)
- Use Argo rather than Airflow to run our E2E tests #358 (jlewi)
- Add an option to release.py to specify the tag for the image to use. #357 (jlewi)
- Fix helm test #356 (jose5918)
- feat(group): Update CRD group to kubeflow.org #354 (gaocegege)
- Deprecate the ENV MY_POD_NAME and use default namespace #348 (ScorpioCPH)
- feat(crd): Separate CRD and controller #345 (gaocegege)
- Create Pod instead of Job #344 (ScorpioCPH)
- Deprecate IsDefaultPS in TFJob CRD API #343 (ScorpioCPH)
- Update documentation #342 (jose5918)
- feat(dashboard): Namespace handling #338 (wbuchwalter)
- feat(dashboard): better error handling in dashboard code #336 (Jimexist)
- Rename Tf to TF #332 (ScorpioCPH)
- Delete binary file #331 (ScorpioCPH)
- Take test failures into account when setting prow job status #319 (jlewi)
- remove unused file rename.sh #316 (caogj)
- add UpdateFunc to handle update events #313 (mqliang)
- pkg: Add recorder support #312 (gaocegege)
- Fix a bunch of problems in TfJob CRD that crept in while tests were broken #308 (jlewi)
- replace TPR with CRD #307 (mqliang)
- fix broken link #305 (caogj)
- Fix python lint checks #303 (jlewi)
- Fix setting defaults. #299 (jlewi)
- Add service account name to dashboard if RBAC. #298 (ConnorDoyle)
- The flag should be --controller-config-file. #295 (jlewi)
- Fix the junit XML file format. #291 (jlewi)
- *: Fix API Version #289 (gaocegege)
- *: Implement the List interface for TfJobList #278 (gaocegege)
- cmd: Fix the flag error caused by pflag #277 (gaocegege)
- types.go: Fix CRDKind #276 (gaocegege)
- Move around due to new directories layout #273 (ScorpioCPH)
- bugfix: set faliures=true if failed deleting configmap #272 (mqliang)
- Fix our continuous release process #271 (jlewi)
- update initialClusterVersion to 1.7.11-gke.1 #269 (cwbeitel)
- Misc Cleanup. #262 (jlewi)
- Add proposed directories layout #261 (ScorpioCPH)
- record event when tf_operator failover #260 (zjj2wry)
- follow kubernetes flag convension #259 (zjj2wry)
- refactor dashboard backend, use versioned tfjob clientset #258 (zjj2wry)
- apply goimports -w to generated files #257 (Jimexist)
- add gometaliner into travis build #254 (Jimexist)
- fix(no-dup): reduce dup code in printVersion #253 (Jimexist)
- Improve utilities for E2E tests. #251 (jlewi)
- Fix leaking of clusters in E2E tests #80 #250 (jlewi)
- feat(pipenv): Use pipenv to lock down python dependencies #248 (Jimexist)
- fix(lint): add prop types and fix all eslint errors #246 (Jimexist)
- refactor code and format imported package #245 (zjj2wry)
- feat(lint): apply prettier to format frontend src/ code #244 (Jimexist)
- feature(lint): use prettier and lint-staged for frontend javascript code #243 (Jimexist)
- Fix issues with tf_job_gpu test #241 (jlewi)
- Use the release/test python scripts pulled from the repo. #237 (jlewi)
- Don't run glide install in travis builds. #236 (jlewi)
- refactor the controller logic #234 (wackxu)
- feat(coverage): add covealls support #232 (Jimexist)
- use glide install --strip-vendor remove subpackage vendor #231 (zjj2wry)
- update k8s dependency to stable version #230 (wackxu)
- let tfJob image configurable #228 (zjj2wry)
- remove todo, add gitSHA into version information #227 (zjj2wry)
- controller.go: Fix a print error #226 (gaocegege)
- replace tf-job-operator-config configmap when it already exist #225 (zjj2wry)
- Add the vendor directory to the repository. #222 (zjj2wry)
- allow using WORKER:0 as chief #221 (lluunn)
- Fix issue with handling of json errors. #220 (jlewi)
- Set state to failed if there is a problem initializing job #219 (jlewi)
- On GKE mounting volumes should no longer be required for GPUs. #217 (jlewi)
- update developer guide #216 (ddysher)
- Refactor the TfJob to use K8s libraries #215 (wackxu)
- Add a basic GPU job test as part of our E2E tests. #213 (jlewi)
- minor spelling porxy => proxy #211 (cbockman)
- Add terminationPolicy to TfJobSpec #204 (lluunn)
- Split cloning the repo and building the images into two steps in our airflow pipeline #200 (jlewi)
- Create separate commands to clone and build the repo #199 (jlewi)
- Install yarn and nodejs inside the Airflow container. #198 (jlewi)
- Update the Airflow deployment to use Docker images built from a clean tree #197 (jlewi)
- Fix some cuda issues on Azure #194 (wbuchwalter)
- Fixing front page documentation to have grpcServerFilePath #190 (hyperbolic2346)
- Add an option to build Docker images with GCB. #187 (jlewi)
- replace deprecated tf.initialize_all_variables #184 (DjangoPeng)
- build_and_push.py: Support python3 #183 (gaocegege)
- tf_job_design_doc: Fix the apiVersion #182 (gaocegege)
- py: Add requirements.txt #180 (gaocegege)
- resolve a merge conflict imported by commit ae8c31 #178 (DjangoPeng)
- tf_job_design_doc.md: Fix a typo #177 (gaocegege)
- Fix helm templates so that we don't require a configmap. #176 (jlewi)
- replace Google and Golang repos with corresponding github repos #172 (DjangoPeng)
- Stop hardcoding namespace for TfJob config map #169 (haitch)
- Tooling to make it easier to run a bunch of TfJob tests. #168 (jlewi)
- Run python lint and unittests as part of our E2E test pipeline #166 (jlewi)
- A binary to run pylint and python unittests #163 (jlewi)
- fix dev guide #162 (lluunn)
- Integrate Airflow with Prow #158 (jlewi)
- rename jlewi/mlkube.io in glide.yaml #153 (moon03432)
- add Create(), Delete() in TfJobClient interface #152 (moon03432)
- change jobname from task-runtimeid-index to jobname-task-runtimeid-index #151 (moon03432)
- Create binaries to run steps in an E2E test pipeline. #148 (jlewi)
- Fix a typo in the command line help. #147 (jlewi)
- ignore too-many-locals. #146 (jlewi)
- On RBAC clusters, test needs a service account with appropriate permissions #145 (jlewi)
- Airflow pipeline to run our tests #144 (jlewi)
- fix(*): amend the number of worker and ps in example yaml spec for a distributed job #142 (lienhua34)
- fix a log issue #141 (moon03432)
- rename clus to tfjob in controller.go #138 (moon03432)
- rename InClusterConfig() to GetClusterConfig() #137 (moon03432)
- Remove trailing slash of host #134 (ScorpioCPH)
- Turn release.py into a binary to build the artifacts for all the different contexts #133 (jlewi)
- Minor fix typo and redundancy #131 (ScorpioCPH)
- Update developer_guide.md #129 (Jimexist)
- Use K8s Garbage Collection #127 (jlewi)
- Dashboard V1 #125 (wbuchwalter)
- More verbose logging of resource deletion #124 (jlewi)
- Fix rbac settings in chart. #123 (jlewi)
- Fix issue in tpr_util.Delete() #121 (wbuchwalter)
- Tag docker images with "latest". #119 (jlewi)
- Update API group in the chart #117 (sozercan)
- Helm instructions #111 (jlewi)
- Name label #105 (jlewi)
- Update helm install syntax in readme #104 (sozercan)
- Change group to tensorflow.org and version to v1alpha1. #103 (jlewi)
- [WIP] Notebook demonstrating use of TfJob on GKE #102 (jlewi)
- Fix bugs in the release script. #100 (jlewi)
- Fix bugs in the release script. #99 (jlewi)
- Update release.py so we can run it continuously. #98 (jlewi)
- Fix the E2E test by specifying cloud when deploying the helm package. #97 (jlewi)
- Need to set environment to enable Estimators with TF <=1.3 #94 (jlewi)
- Update README.md #92 (Jimexist)
- Add python lint check to travis and fix python lint issues #91 (jlewi)
- #71 Simplify accelerators config #90 (wbuchwalter)
- Update test infrastructure to use repo tensorflow/k8s #87 (jlewi)
- Create symbolic links in GCS to output of presubmit results. #79 (jlewi)
- Fix periodic results (#76) #78 (jlewi)
- Another attempt to fix periodic jobs. #77 (jlewi)
- Fix location of the post submit results. #74 (jlewi)
- Overhaul the documentation #73 (jlewi)
- Release scripts #69 (jlewi)
- Record latest green from postsubmit #66 (jlewi)
- Fix presubmit jobs and periodic jobs #60 (jlewi)
- Fix periodic test #56 (jlewi)
- Updated chart with batch.jobs and extensions.deployments cluster roles #52 (sozercan)
- Added RBAC support for tf-operator chart #51 (sozercan)
- PR to test Prow presubmit integration. #50 (jlewi)
- E2E test for the CRD #49 (jlewi)
- Create configs for setting up Prow for continuous testing. #47 (jlewi)
- Fix bug that prevents permanent errors from causing job failure. #44 (jlewi)
- Always check for existing TfJobs and instantiate controllers for them. #43 (jlewi)
- support multi namespaces #39 (loadwiki)
-
Use Jinja templates and a Python script to build example Docker images for examples [\#37](https://github.com/kubeflow/tf-operator/pull/37) ([jlewi](https://github.com/jlewi))
- Parameter Server: Run TF server by default #36 (wbuchwalter)
- Set default values for Replicas, TfPort, TfReplicaType. #31 (jlewi)
- Fix a couple bugs. #27 (jlewi)
- [WIP] Update to CustomResourceDefinition instead of ThirdPartyResource. #20 (jlewi)
- Update glide config. #18 (jlewi)
- Add TensorBoard Integration #15 (wbuchwalter)
- Changes to support CI using Travis. #14 (jlewi)
- Add Environment Variables in Controller Config #12 (wbuchwalter)
- Fix tests #11 (wbuchwalter)
- Helm charts renaming #10 (wbuchwalter)
- Simplify GPU configuration process. #9 (jlewi)
- Fix build, add Glide for dependency management. #8 (wbuchwalter)
- Update links in README.md #3 (wbuchwalter)
- A more thorough E2E test. #2 (jlewi)
- Create a helm chart for deploying the TfJob operator #1 (jlewi)