Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multinode serving #574

Draft
wants to merge 91 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
c189fbf
mark a bunch of todos, maybe not complete though
seanshi-scale Jul 23, 2024
0dac28d
more todos in k8s_endpoint_resource_delegate
seanshi-scale Jul 23, 2024
b00dc8b
mark more todos
seanshi-scale Jul 23, 2024
b7275f3
partially string nodes_per_worker through
seanshi-scale Jul 23, 2024
69e767f
wip k8s_endpoint_resource_delegate
seanshi-scale Jul 24, 2024
b755e56
lws service template config map yaml
seanshi-scale Jul 24, 2024
832b29b
delete lws
seanshi-scale Jul 24, 2024
41cffd2
small things for delete
seanshi-scale Jul 24, 2024
c1f78a0
create_lws
seanshi-scale Jul 24, 2024
bcc4fb2
mark more places
seanshi-scale Jul 24, 2024
f3974cd
think this is sufficient for LWS args?
seanshi-scale Jul 25, 2024
a9f43fb
fix most of the domain test cases
seanshi-scale Jul 25, 2024
9cf87c6
fix rest of the domain test cases
seanshi-scale Jul 25, 2024
f168fca
partial llm stuff
seanshi-scale Jul 25, 2024
73cf9d8
validation of bundle extra params in endpoint use case layer
seanshi-scale Jul 25, 2024
b3da423
fix domain tests
seanshi-scale Jul 25, 2024
a1fa3a6
try fixing more tests
seanshi-scale Jul 25, 2024
7b22801
fix more tests
seanshi-scale Jul 25, 2024
0ce420d
don't set nodes per worker in update pls
seanshi-scale Jul 25, 2024
2e481e8
validate endpoint + multinode compat, more lws loading stuff
seanshi-scale Jul 26, 2024
03ef9fc
k8s resource delegate stuff
seanshi-scale Jul 26, 2024
ef110fa
.
seanshi-scale Jul 26, 2024
56be80a
add code to allow vllm to do multinode
seanshi-scale Jul 26, 2024
4b3591d
start on creating the multinode bundle
seanshi-scale Jul 26, 2024
009e707
create the bundle
seanshi-scale Jul 27, 2024
c865661
try fixing tests?
seanshi-scale Jul 27, 2024
7c81254
fix remaining unit tests
seanshi-scale Jul 27, 2024
c7b2b2e
:wMerge branch 'main' into seanshi/20240722-multinode-serving
seanshi-scale Jul 27, 2024
210da1a
screwed up the merge oops
seanshi-scale Jul 27, 2024
b390d25
temp turn off cache for testing, also try fixing the service template…
seanshi-scale Jul 27, 2024
258052b
hopefully the autogen templates was correct lol
seanshi-scale Jul 27, 2024
1cdd5ef
oops need to await
seanshi-scale Jul 27, 2024
da07257
fix a few typos
seanshi-scale Jul 27, 2024
31097f4
one more typo
seanshi-scale Jul 27, 2024
828fd2c
refactor out _get_deployment
seanshi-scale Jul 29, 2024
3321e2d
wip refactored out get_resources into deployment/lws types, todo the …
seanshi-scale Jul 29, 2024
2662cc7
get_resources for lws, almost done
seanshi-scale Jul 30, 2024
f4b227c
priority class
seanshi-scale Jul 30, 2024
64a773c
get_all_deployments
seanshi-scale Jul 30, 2024
9167347
black
seanshi-scale Jul 30, 2024
914b07a
comments
seanshi-scale Jul 30, 2024
d458529
try making the custom_obj_client calls return ApiException
seanshi-scale Jul 31, 2024
ceb63ce
a bunch of test stubs
seanshi-scale Jul 31, 2024
f05b722
one more test
seanshi-scale Aug 1, 2024
6c6edb4
client
seanshi-scale Aug 1, 2024
56d02ea
comment out model endpoint infra gateway update_model_endpoint_infra …
seanshi-scale Aug 1, 2024
79cfd13
fill in some tests
seanshi-scale Aug 1, 2024
f6c1f5b
more domain tests
seanshi-scale Aug 1, 2024
f71d69b
get multinode test
seanshi-scale Aug 1, 2024
1632719
fix more tests
seanshi-scale Aug 2, 2024
59c55f5
stub test
seanshi-scale Aug 8, 2024
cc6d430
Merge branch 'main' into seanshi/20240722-multinode-serving
seanshi-scale Sep 5, 2024
e2092eb
delete some things that were added back in the merge conflict
seanshi-scale Sep 5, 2024
0d8af5e
fix some semantic merge broken things
seanshi-scale Sep 5, 2024
a92d78d
delete test update multinode since that's not allowed in the api
seanshi-scale Sep 6, 2024
fa3de5e
update client
seanshi-scale Sep 6, 2024
c23e595
unmark some todos
seanshi-scale Sep 6, 2024
a6400ca
mark more todos
seanshi-scale Sep 6, 2024
a96d4eb
unmark more todos
seanshi-scale Sep 6, 2024
651a2bb
Merge branch 'main' into seanshi/20240722-multinode-serving
seanshi-scale Sep 12, 2024
0857c5b
fix test
seanshi-scale Sep 13, 2024
9863906
remove a test that isn't allowed in the api
seanshi-scale Sep 13, 2024
e87a33b
more test
seanshi-scale Sep 13, 2024
3286f1a
get test mostly working
seanshi-scale Sep 13, 2024
b45f19e
autogen template
seanshi-scale Sep 13, 2024
36d58f5
get test to pass
seanshi-scale Sep 13, 2024
ac2b92a
format
seanshi-scale Sep 13, 2024
4c5d2d5
add explicit resource
seanshi-scale Sep 13, 2024
7aca689
use the example config
seanshi-scale Sep 13, 2024
c8f5aa2
silly bug
seanshi-scale Sep 13, 2024
3ebc12b
fix test
seanshi-scale Sep 13, 2024
f25be41
.
seanshi-scale Sep 13, 2024
6ff255c
update multinode in gateway test case
seanshi-scale Sep 14, 2024
5c993e9
some cleanup
seanshi-scale Sep 14, 2024
6b22fd8
Merge branch 'main' into seanshi/20240722-multinode-serving
seanshi-scale Sep 25, 2024
65f10a3
strip out that worker env/command metadata hack
seanshi-scale Sep 25, 2024
a83d611
black
seanshi-scale Sep 25, 2024
3660008
Merge branch 'main' into seanshi/20240722-multinode-serving
seanshi-scale Sep 26, 2024
b00df92
turn cache back on
seanshi-scale Sep 26, 2024
b9845b5
unmark todos that I've done
seanshi-scale Sep 26, 2024
b3bd945
clean up more todos, add multinode deployment validation to live endp…
seanshi-scale Sep 27, 2024
cff475f
cleanup
seanshi-scale Sep 27, 2024
84d2288
try commenting out some mocks since we might not need to mock them
seanshi-scale Sep 27, 2024
af10a07
fix test, blackwell isn't out yet
seanshi-scale Sep 27, 2024
1ca4aa8
.
seanshi-scale Sep 27, 2024
6fd28e2
uncomment the config map stuff
seanshi-scale Sep 27, 2024
5076c7e
.
seanshi-scale Sep 27, 2024
bd5d657
more cleanup
seanshi-scale Sep 27, 2024
f740a2e
oops
seanshi-scale Sep 27, 2024
20c0b13
k8s doesn't allow underscores dang it
seanshi-scale Sep 28, 2024
02b0d89
vllm worker doesn't get ray cluster size
seanshi-scale Sep 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
379 changes: 379 additions & 0 deletions charts/model-engine/templates/service_template_config_map.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -472,6 +472,385 @@ data:
unsafeSsl: "false"
databaseIndex: "${REDIS_DB_INDEX}"
{{- end }}
{{- range $device := tuple "gpu" }}
{{- range $mode := tuple "streaming"}}
leader-worker-set-{{ $mode }}-{{ $device }}.yaml: |-
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
name: ${RESOURCE_NAME}
namespace: ${NAMESPACE}
labels:
{{- $service_template_labels | nindent 8 }}
spec:
replicas: ${MIN_WORKERS}
leaderWorkerTemplate:
size: ${LWS_SIZE}
restartPolicy: RecreateGroupOnPodRestart # TODO un-hardcode? if necessary
leaderTemplate:
metadata:
labels:
app: ${RESOURCE_NAME}
{{- $service_template_labels | nindent 14 }}
sidecar.istio.io/inject: "false" # Never inject istio, it screws up networking
version: v1
annotations:
ad.datadoghq.com/main.logs: '[{"service": "${ENDPOINT_NAME}", "source": "python"}]'
kubernetes.io/change-cause: "${CHANGE_CAUSE_MESSAGE}"
spec:
affinity:
{{- include "modelEngine.serviceTemplateAffinity" . | nindent 14 }}
{{- if eq $mode "async" }} # TODO
terminationGracePeriodSeconds: 1800
{{- else }}
terminationGracePeriodSeconds: 600
{{- end }}
{{- if $service_template_service_account_name }}
serviceAccount: {{ $service_template_service_account_name }}
{{- else }}
serviceAccount: {{ $launch_name }}
{{- end }}
{{- with $node_selector }}
nodeSelector:
{{- toYaml . | nindent 14 }}
{{- end }}
{{- if eq $device "gpu" }}
{{- if empty $node_selector }}
nodeSelector:
{{- end }}
k8s.amazonaws.com/accelerator: ${GPU_TYPE}
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
{{- end }}
priorityClassName: ${PRIORITY}
containers:
{{- if eq $mode "sync" }}
- name: http-forwarder
image: {{ $forwarder_repository }}:${GIT_TAG}
imagePullPolicy: IfNotPresent
command:
- /usr/bin/dumb-init
- --
{{- if $enable_datadog }}
- ddtrace-run
{{- end }}
- python
- -m
- model_engine_server.inference.forwarding.http_forwarder
- --config
- /workspace/model-engine/model_engine_server/inference/configs/${FORWARDER_CONFIG_FILE_NAME}
- --port
- "${FORWARDER_PORT}"
- --num-workers
- "${FORWARDER_WORKER_COUNT}"
- --set
- "forwarder.sync.predict_route=${PREDICT_ROUTE}"
- --set
- "forwarder.sync.healthcheck_route=${HEALTHCHECK_ROUTE}"
- --set
- "forwarder.stream.healthcheck_route=${HEALTHCHECK_ROUTE}"
{{- $sync_forwarder_template_env | nindent 16 }}
readinessProbe:
httpGet:
path: /readyz
port: ${FORWARDER_PORT}
initialDelaySeconds: ${READINESS_INITIAL_DELAY}
periodSeconds: 5
timeoutSeconds: 5
resources:
requests:
cpu: ${FORWARDER_CPUS_LIMIT}
memory: "100M"
ephemeral-storage: "100M"
limits:
cpu: ${FORWARDER_CPUS_LIMIT}
memory: ${FORWARDER_MEMORY_LIMIT}
ephemeral-storage: ${FORWARDER_STORAGE_LIMIT}
{{ $forwarder_volume_mounts | nindent 16 }}
ports:
- containerPort: ${FORWARDER_PORT}
name: http
{{- else if eq $mode "streaming" }}
- name: http-forwarder
image: {{ $forwarder_repository }}:${GIT_TAG}
imagePullPolicy: IfNotPresent
command:
- /usr/bin/dumb-init
- --
{{- if $enable_datadog }}
- ddtrace-run
{{- end }}
- python
- -m
- model_engine_server.inference.forwarding.http_forwarder
- --config
- /workspace/model-engine/model_engine_server/inference/configs/service--http_forwarder.yaml
- --port
- "${FORWARDER_PORT}"
- --num-workers
- "${FORWARDER_WORKER_COUNT}"
- --set
- "forwarder.sync.predict_route=${PREDICT_ROUTE}"
- --set
- "forwarder.stream.predict_route=${STREAMING_PREDICT_ROUTE}"
- --set
- "forwarder.sync.healthcheck_route=${HEALTHCHECK_ROUTE}"
- --set
- "forwarder.stream.healthcheck_route=${HEALTHCHECK_ROUTE}"
{{- $sync_forwarder_template_env | nindent 16 }}
readinessProbe:
httpGet:
path: /readyz
port: ${FORWARDER_PORT}
initialDelaySeconds: ${READINESS_INITIAL_DELAY}
periodSeconds: 5
timeoutSeconds: 5
resources:
requests:
cpu: ${FORWARDER_CPUS_LIMIT}
memory: "100M"
ephemeral-storage: "100M"
limits:
cpu: ${FORWARDER_CPUS_LIMIT}
memory: ${FORWARDER_MEMORY_LIMIT}
ephemeral-storage: ${FORWARDER_STORAGE_LIMIT}
{{ $forwarder_volume_mounts | nindent 16 }}
ports:
- containerPort: ${FORWARDER_PORT}
name: http
{{- else if eq $mode "async" }}
- name: celery-forwarder
image: {{ $forwarder_repository }}:${GIT_TAG}
imagePullPolicy: IfNotPresent
command:
- /usr/bin/dumb-init
- --
{{- if $enable_datadog }}
- ddtrace-run
{{- end }}
- python
- -m
- model_engine_server.inference.forwarding.celery_forwarder
- --config
- /workspace/model-engine/model_engine_server/inference/configs/${FORWARDER_CONFIG_FILE_NAME}
- --queue
- "${QUEUE}"
- --task-visibility
- "VISIBILITY_24H"
- --set
- "forwarder.async.predict_route=${PREDICT_ROUTE}"
- --set
- "forwarder.async.healthcheck_route=${HEALTHCHECK_ROUTE}"
{{- if eq $celery_broker_type "sqs" }}
- --sqs-url
- "${SQS_QUEUE_URL}"
{{- end }}
- --num-workers
- "${PER_WORKER}"
- --broker-type
- {{ $celery_broker_type }}
{{- if eq $celery_broker_type "servicebus" }}
- --backend-protocol
- abs
{{- end }}
{{- $async_forwarder_template_env | nindent 16 }}
resources:
requests:
cpu: 0.1
memory: "100M"
ephemeral-storage: "100M"
limits:
cpu: ${FORWARDER_CPUS_LIMIT}
memory: ${FORWARDER_MEMORY_LIMIT}
ephemeral-storage: ${FORWARDER_STORAGE_LIMIT}
{{ $forwarder_volume_mounts | nindent 16 }}
{{- end }}
- name: lws-leader
image: ${IMAGE}
imagePullPolicy: IfNotPresent
command: ${COMMAND}
env: ${MAIN_ENV}
readinessProbe:
httpGet:
path: ${HEALTHCHECK_ROUTE}
port: ${USER_CONTAINER_PORT}
initialDelaySeconds: ${READINESS_INITIAL_DELAY}
periodSeconds: 5
timeoutSeconds: 5
resources:
requests:
{{- if eq $device "gpu" }}
nvidia.com/gpu: ${GPUS}
{{- end }}
cpu: ${CPUS}
memory: ${MEMORY}
${STORAGE_DICT}
limits:
{{- if eq $device "gpu" }}
nvidia.com/gpu: ${GPUS}
{{- end }}
cpu: ${CPUS}
memory: ${MEMORY}
${STORAGE_DICT}
volumeMounts:
{{- if $require_aws_config }}
- name: config-volume
mountPath: /opt/.aws/config
subPath: config
{{- end }}
- mountPath: /dev/shm
name: dshm
{{- if $mount_infra_config }}
- name: infra-service-config-volume
mountPath: ${INFRA_SERVICE_CONFIG_VOLUME_MOUNT_PATH}
{{- end }}
- name: user-config
mountPath: /app/user_config
subPath: raw_data
- name: endpoint-config
mountPath: /app/endpoint_config
subPath: raw_data
ports:
- containerPort: ${USER_CONTAINER_PORT}
name: http
volumes:
{{- if $require_aws_config }}
- name: config-volume
configMap:
{{- if $service_template_aws_config_map_name }}
name: {{ $service_template_aws_config_map_name }}
{{- else }}
name: {{ $aws_config_map_name }}
{{- end }}
{{- end }}
- name: user-config
configMap:
name: ${RESOURCE_NAME}
- name: endpoint-config
configMap:
name: ${RESOURCE_NAME}-endpoint-config
- name: dshm
emptyDir:
medium: Memory
{{- if $config_values }}
- name: infra-service-config-volume
configMap:
name: {{ $launch_name }}-service-config
items:
- key: infra_service_config
path: config.yaml
{{- end }}
workerTemplate:
metadata:
labels:
app: ${RESOURCE_NAME}
{{- $service_template_labels | nindent 14 }}
sidecar.istio.io/inject: "false" # Never inject istio for LWS, it screws up networking
version: v1
annotations:
ad.datadoghq.com/main.logs: '[{"service": "${ENDPOINT_NAME}", "source": "python"}]'
kubernetes.io/change-cause: "${CHANGE_CAUSE_MESSAGE}"
spec:
affinity:
{{- include "modelEngine.serviceTemplateAffinity" . | nindent 14 }}
{{- if eq $mode "async" }} # TODO
terminationGracePeriodSeconds: 1800
{{- else }}
terminationGracePeriodSeconds: 600
{{- end }}
{{- if $service_template_service_account_name }}
serviceAccount: {{ $service_template_service_account_name }}
{{- else }}
serviceAccount: {{ $launch_name }}
{{- end }}
{{- with $node_selector }}
nodeSelector:
{{- toYaml . | nindent 14 }}
{{- end }}
{{- if eq $device "gpu" }}
{{- if empty $node_selector }}
nodeSelector:
{{- end }}
k8s.amazonaws.com/accelerator: ${GPU_TYPE}
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
{{- end }}
priorityClassName: ${PRIORITY}
containers:
- name: lws-worker
image: ${IMAGE}
imagePullPolicy: IfNotPresent
command: ${WORKER_COMMAND}
env: ${WORKER_ENV}
resources:
requests:
{{- if eq $device "gpu" }}
nvidia.com/gpu: ${GPUS}
{{- end }}
cpu: ${CPUS}
memory: ${MEMORY}
${STORAGE_DICT}
limits:
{{- if eq $device "gpu" }}
nvidia.com/gpu: ${GPUS}
{{- end }}
cpu: ${CPUS}
memory: ${MEMORY}
${STORAGE_DICT}
volumeMounts:
{{- if $require_aws_config }}
- name: config-volume
mountPath: /opt/.aws/config
subPath: config
{{- end }}
- mountPath: /dev/shm
name: dshm
{{- if $mount_infra_config }}
- name: infra-service-config-volume
mountPath: ${INFRA_SERVICE_CONFIG_VOLUME_MOUNT_PATH}
{{- end }}
- name: user-config
mountPath: /app/user_config
subPath: raw_data
- name: endpoint-config
mountPath: /app/endpoint_config
subPath: raw_data
ports:
- containerPort: ${USER_CONTAINER_PORT}
name: http
volumes:
{{- if $require_aws_config }}
- name: config-volume
configMap:
{{- if $service_template_aws_config_map_name }}
name: {{ $service_template_aws_config_map_name }}
{{- else }}
name: {{ $aws_config_map_name }}
{{- end }}
{{- end }}
- name: user-config
configMap:
name: ${RESOURCE_NAME}
- name: endpoint-config
configMap:
name: ${RESOURCE_NAME}-endpoint-config
- name: dshm
emptyDir:
medium: Memory
{{- if $config_values }}
- name: infra-service-config-volume
configMap:
name: {{ $launch_name }}-service-config
items:
- key: infra_service_config
path: config.yaml
{{- end }}
{{- end }} # mode
{{- end }} # device
service.yaml: |-
apiVersion: v1
kind: Service
Expand Down
Loading