Skip to content

Commit

Permalink
Implement pre-commit hooks
Browse files Browse the repository at this point in the history
Signed-off-by: droctothorpe <[email protected]>
  • Loading branch information
droctothorpe committed Jul 25, 2024
1 parent f55a91d commit 37391ed
Show file tree
Hide file tree
Showing 64 changed files with 233 additions and 174 deletions.
14 changes: 14 additions & 0 deletions .github/workflows/pre-commit.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
name: pre-commit

on:
pull_request:
push:
branches: [main]

jobs:
pre-commit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v3
- uses: pre-commit/[email protected]
28 changes: 28 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.3.0
hooks:
- id: check-yaml
args: [--allow-multiple-documents]
- id: check-json
- id: end-of-file-fixer
- id: trailing-whitespace
- repo: https://github.com/pycqa/isort
rev: 5.11.5
hooks:
- id: isort
name: isort
entry: isort --profile google
exclude: |
(?x)^(
pkg/apis/kubeflow.org/v1/openapi_generated.go|
pkg/apis/kubeflow.org/v1/zz_.*|
pkg/client/.*|
test_job/apis/test_job/v1/.*generated.*.go|
test_job/client/.*|
sdk/python/kubeflow/training/[^/]*.py|
sdk/python/kubeflow/training/models/.*|
sdk/python/test/.*|
docs/api/.*|
sdk/python/docs/.*
)$
40 changes: 28 additions & 12 deletions docs/development/developer_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Kubeflow Training Operator is currently at v1.
## Requirements

- [Go](https://golang.org/) (1.22 or later)
- [Docker](https://docs.docker.com/)
- [Docker](https://docs.docker.com/)
- [Docker](https://docs.docker.com/) (20.10 or later)
- [Docker Buildx](https://docs.docker.com/build/buildx/) (0.8.0 or later)
- [Python](https://www.python.org/) (3.11 or later)
Expand Down Expand Up @@ -49,12 +49,12 @@ First, you need to run a Kubernetes cluster locally. We recommend [Kind](https:/

You can create a `kind` cluster by running
```sh
kind create cluster
kind create cluster
```
This will load your kubernetes config file with the new cluster.
This will load your kubernetes config file with the new cluster.

After creating the cluster, you can check the nodes with the code below which should show you the kind-control-plane.
```sh
After creating the cluster, you can check the nodes with the code below which should show you the kind-control-plane.
```sh
kubectl get nodes
```
The output should look something like below:
Expand All @@ -74,9 +74,9 @@ Then we can patch it with the latest operator image.
```sh
kubectl patch -n kubeflow deployments training-operator --type json -p '[{"op": "replace", "path": "/spec/template/spec/containers/0/image", "value": "kubeflow/training-operator:latest"}]'
```
Then we can run the job with the following command.
Then we can run the job with the following command.

```sh
```sh
kubectl apply -f https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/pytorch/simple.yaml
```
And we can see the output of the job from the logs, which may take some time to produce but should look something like below.
Expand Down Expand Up @@ -116,10 +116,10 @@ make docker-build IMG=my-username/training-operator:my-pr-01
```
You can swap `my-username/training-operator:my-pr-01` with whatever you would like.

## Load docker image
## Load docker image
```sh
kind load docker-image my-username/training-operator:my-pr-01
```
```

## Modify operator image with new one

Expand All @@ -129,8 +129,8 @@ kustomize edit set image my-username/training-operator=my-username/training-oper
```
Update the `newTag` key in `./manifests/overlayes/standalone/kustimization.yaml` with the new image.

Deploy the operator with:
```sh
Deploy the operator with:
```sh
kubectl apply -k ./manifests/overlays/standalone
```
And now we can submit jobs to the operator.
Expand All @@ -140,7 +140,7 @@ kubectl apply -f https://raw.githubusercontent.com/kubeflow/training-operator/ma
```
You should be able to see a pod for your training operator running in your namespace using
```
kubectl logs -n kubeflow -l training.kubeflow.org/job-name=pytorch-simple
kubectl logs -n kubeflow -l training.kubeflow.org/job-name=pytorch-simple
```
## Go version

Expand Down Expand Up @@ -187,3 +187,19 @@ sdk/python/kubeflow/training/api
```sh
black --check --exclude '/*kubeflow_org_v1*|__init__.py|api_client.py|configuration.py|exceptions.py|rest.py' sdk/
```

### pre-commit

Make sure to install [pre-commit](https://pre-commit.com/) (`pip install
pre-commit`) and run `pre-commit install` from the root of the repository at
least once before creating git commits.

The pre-commit [hooks](../.pre-commit-config.yaml) ensure code quality and
consistency. They are executed in CI. PRs that fail to comply with the hooks
will not be able to pass the corresponding CI gate. The hooks are only executed
against staged files unless you run `pre-commit run --all`, in which case,
they'll be executed against every file in the repository.
Specific programmatically generated files listed in the `exclude` field in
[.pre-commit-config.yaml](../.pre-commit-config.yaml) are deliberately excluded
from the hooks.
1 change: 0 additions & 1 deletion docs/diagrams/tfjob_k8s_resources.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 2 additions & 1 deletion docs/release/changelog.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from github import Github
import argparse

from github import Github

REPO_NAME = "kubeflow/training-operator"
CHANGELOG_FILE = "CHANGELOG.md"

Expand Down
2 changes: 1 addition & 1 deletion examples/mxnet/byteps/dist_gpu_v1.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -43,5 +43,5 @@ spec:
nvidia.com/gpu: 8
volumes:
- name: dshm
emptyDir:
emptyDir:
medium: Memory
15 changes: 9 additions & 6 deletions examples/mxnet/tune/auto-tuning.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,19 @@
import argparse
import os
import sys

import numpy as np
import argparse

import nnvm.testing
import nnvm.compiler
import nnvm.testing
import numpy as np
import tvm
from tvm import autotvm
from tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner
from tvm.contrib.util import tempdir
from tvm.autotvm.tuner import GATuner
from tvm.autotvm.tuner import GridSearchTuner
from tvm.autotvm.tuner import RandomTuner
from tvm.autotvm.tuner import XGBTuner
import tvm.contrib.graph_runtime as runtime
from tvm.contrib.util import tempdir

#from mxnet.gluon.model_zoo.vision import get_model

def get_network(name, batch_size):
Expand Down
2 changes: 1 addition & 1 deletion examples/mxnet/tune/start-job.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Befor running this script, make sure tvm is install in your cluster

import json
import os
import time
import json

if __name__ == '__main__':
mx_config = json.loads(os.environ.get('MX_CONFIG') or '{}')
Expand Down
1 change: 0 additions & 1 deletion examples/paddlepaddle/simple-gpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,4 +33,3 @@ spec:
- name: dshm
emptyDir:
medium: Memory

8 changes: 4 additions & 4 deletions examples/pytorch/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
## Installation & deployment tips
1. You need to configure your node to utilize GPU. This can be done the following way:
## Installation & deployment tips
1. You need to configure your node to utilize GPU. This can be done the following way:
* Install [nvidia-docker2](https://github.com/NVIDIA/nvidia-docker)
* Connect to your MasterNode and set nvidia as the default run in `/etc/docker/daemon.json`:
```
Expand All @@ -13,11 +13,11 @@
}
}
```
* After that deploy nvidia-daemon to kubernetes:
* After that deploy nvidia-daemon to kubernetes:
```bash
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.11/nvidia-device-plugin.yml
```

2. NVIDIA GPUs can now be consumed via container level resource requirements using the resource name nvidia.com/gpu:
```
resources:
Expand Down
1 change: 0 additions & 1 deletion examples/pytorch/elastic/echo/echo.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,4 +26,3 @@ spec:
- torch.distributed.run
- --rdzv_backend=c10d
- ./echo.py

2 changes: 1 addition & 1 deletion examples/pytorch/elastic/etcd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -71,4 +71,4 @@ spec:
protocol: TCP
targetPort: 2380
selector:
etcd_node: etcd-server
etcd_node: etcd-server
2 changes: 1 addition & 1 deletion examples/pytorch/elastic/imagenet/.dockerignore
Original file line number Diff line number Diff line change
@@ -1 +1 @@
data
data
14 changes: 7 additions & 7 deletions examples/pytorch/elastic/imagenet/imagenet.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,30 +44,30 @@
"""

import argparse
from contextlib import contextmanager
from datetime import timedelta
import io
import os
import shutil
import time
from contextlib import contextmanager
from datetime import timedelta
from typing import List, Tuple

import numpy
import torch
import torch.distributed as dist
from torch.distributed.elastic.multiprocessing.errors import record
from torch.distributed.elastic.utils.data import ElasticDistributedSampler
import torch.nn as nn
import torch.nn.parallel
from torch.nn.parallel import DistributedDataParallel
import torch.optim
from torch.optim import SGD
import torch.utils.data
from torch.utils.data import DataLoader
import torch.utils.data.distributed
import torchvision.datasets as datasets
import torchvision.models as models
import torchvision.transforms as transforms
from torch.distributed.elastic.multiprocessing.errors import record
from torch.distributed.elastic.utils.data import ElasticDistributedSampler
from torch.nn.parallel import DistributedDataParallel
from torch.optim import SGD
from torch.utils.data import DataLoader

model_names = sorted(
name
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -696,4 +696,4 @@
},
"nbformat": 4,
"nbformat_minor": 5
}
}
6 changes: 3 additions & 3 deletions examples/pytorch/mnist/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ IMG = gcr.io/kubeflow-examples/pytorch-dist-mnist
PUBLIC = gcr.io/kubeflow-examples/pytorch-dist-mnist
DIR := ${CURDIR}

# List any changed files.
# List any changed files.
CHANGED_FILES := $(shell git diff-files --relative=examples/dist-mnist)

ifeq ($(strip $(CHANGED_FILES)),)
Expand All @@ -43,9 +43,9 @@ build:
# Build but don't attach the latest tag. This allows manual testing/inspection of the image
# first.
push: build
gcloud docker -- push $(IMG):$(TAG)
gcloud docker -- push $(IMG):$(TAG)
@echo Pushed $(IMG) with :$(TAG) tags

push-latest: push
gcloud container images add-tag --quiet $(IMG):$(TAG) $(IMG):latest --verbosity=info
echo created $(IMG):latest
Expand Down
5 changes: 3 additions & 2 deletions examples/pytorch/mnist/mnist.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,15 @@
import argparse
import os

from tensorboardX import SummaryWriter
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from tensorboardX import SummaryWriter
from torch.utils.data import DistributedSampler
from torchvision import datasets, transforms
from torchvision import datasets
from torchvision import transforms


class Net(nn.Module):
Expand Down
2 changes: 1 addition & 1 deletion examples/pytorch/mnist/v1/pytorch_job_mnist_gloo.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ spec:
restartPolicy: OnFailure
template:
spec:
containers:
containers:
- name: pytorch
image: kubeflow/pytorch-dist-mnist:latest
args: ["--backend", "gloo"]
Expand Down
8 changes: 4 additions & 4 deletions examples/pytorch/mnist/v1/pytorch_job_mnist_mpi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,19 +14,19 @@ spec:
image: kubeflow/pytorch-dist-mnist:latest
args: ["--backend", "mpi"]
# Comment out the below resources to use the CPU.
resources:
resources:
limits:
nvidia.com/gpu: 1
Worker:
replicas: 1
restartPolicy: OnFailure
restartPolicy: OnFailure
template:
spec:
containers:
containers:
- name: pytorch
image: kubeflow/pytorch-dist-mnist:latest
args: ["--backend", "mpi"]
# Comment out the below resources to use the CPU.
resources:
resources:
limits:
nvidia.com/gpu: 1
6 changes: 3 additions & 3 deletions examples/pytorch/mnist/v1/pytorch_job_mnist_nccl.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,18 +13,18 @@ spec:
- name: pytorch
image: kubeflow/pytorch-dist-mnist:latest
args: ["--backend", "nccl"]
resources:
resources:
limits:
nvidia.com/gpu: 1
Worker:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
containers:
- name: pytorch
image: kubeflow/pytorch-dist-mnist:latest
args: ["--backend", "nccl"]
resources:
resources:
limits:
nvidia.com/gpu: 1
2 changes: 1 addition & 1 deletion examples/pytorch/smoke-dist/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
### Distributed send/recv e2e test
### Distributed send/recv e2e test

This folder containers Dockerfile and distributed send/recv test.

Expand Down
8 changes: 4 additions & 4 deletions examples/pytorch/smoke-dist/dist_sendrecv.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,19 +33,19 @@ def init_processes(fn, backend='gloo'):

def main():
logging.info("Torch version: %s", torch.__version__)

port = os.environ.get("MASTER_PORT", "{}")
logging.info("MASTER_PORT: %s", port)

addr = os.environ.get("MASTER_ADDR", "{}")
logging.info("MASTER_ADDR: %s", addr)

world_size = os.environ.get("WORLD_SIZE", "{}")
logging.info("WORLD_SIZE: %s", world_size)

rank = os.environ.get("RANK", "{}")
logging.info("RANK: %s", rank)

init_processes(run)


Expand Down
Loading

0 comments on commit 37391ed

Please sign in to comment.