This developer guide is for people who want to contribute to the Katib project. If you're interesting in using Katib in your machine learning project, see the following user guides:
- Concepts in Katib, hyperparameter tuning, and neural architecture search.
- Getting started with Katib.
- Detailed guide to configuring and running a Katib experiment.
- Go (1.22 or later)
- Docker (24.0 or later)
- Docker Buildx (0.8.0 or later)
- Java (8 or later)
- Python (3.11 or later)
- kustomize (4.0.5 or later)
Note that your Docker Desktop should enable containerd image store to build multi-arch images. Check source code as follows:
make build REGISTRY=<image-registry> TAG=<image-tag>
If you are using an Apple Silicon machine and encounter the "rosetta error: bss_size overflow," go to Docker Desktop -> General and uncheck "Use Rosetta for x86_64/amd64 emulation on Apple Silicon."
To use your custom images for the Katib components, modify Kustomization file and Katib Config
You can deploy Katib v1beta1 manifests into a Kubernetes cluster as follows:
make deploy
You can undeploy Katib v1beta1 manifests from a Kubernetes cluster as follows:
make undeploy
The following guidelines apply primarily to Katib, but other projects like Training Operator might also adhere to them.
When coding:
- Follow effective go guidelines.
- Run locally
make check
to verify if changes follow best practices before submitting PRs.
Testing:
- Use
cmp.Diff
instead ofreflect.Equal
, to provide useful comparisons. - Define test cases as maps instead of slices to avoid dependencies on the running order. Map key should be equal to the test case name.
If you want to modify Katib controller APIs, you have to generate deepcopy, clientset, listers, informers, open-api and Python SDK with the changed APIs. You can update the necessary files as follows:
make generate
Below is a list of command-line flags accepted by Katib controller:
Name | Type | Default | Description |
---|---|---|---|
katib-config | string | "" | The katib-controller will load its initial configuration from this file. Omit this flag to use the default configuration values. |
Below is a list of command-line flags accepted by Katib DB Manager:
Name | Type | Default | Description |
---|---|---|---|
connect-timeout | time.Duration | 60s | Timeout before calling error during database connection |
Please see workflow-design.md.
Katib uses three Kubernetes admission webhooks.
-
validator.experiment.katib.kubeflow.org
- Validating admission webhook to validate the Katib Experiment before the creation. -
defaulter.experiment.katib.kubeflow.org
- Mutating admission webhook to set the default values in the Katib Experiment before the creation. -
mutator.pod.katib.kubeflow.org
- Mutating admission webhook to inject the metrics collector sidecar container to the training pod. Learn more about the Katib's metrics collector in the Kubeflow documentation.
You can find the YAMLs for the Katib webhooks here.
Note: If you are using a private Kubernetes cluster, you have to allow traffic
via TCP:8443
by specifying the firewall rule and you have to update the master
plane CIDR source range to use the Katib webhooks
Katib Controller has the internal cert-generator
to generate certificates for the webhooks.
Once Katib is deployed in the Kubernetes cluster, the cert-generator
follows these steps:
-
Generate the self-signed certificate and private key.
-
Update a Kubernetes Secret with the self-signed TLS certificate and private key.
-
Patch the webhooks with the
CABundle
.
Once the cert-generator
finished, the Katib controller starts to register controllers such as experiment-controller
to the manager.
You can find the cert-generator
source code here.
NOTE: the Katib also supports the cert-manager to generate certs for the admission webhooks instead of using cert-generator. You can find the installation with the cert-manager here.
Please see new-algorithm-service.md.
Please see Katib UI README.
Please see proposals.