Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Control the Number of Model Replicas in ModelMesh Serving #500

Open
michael-nammi opened this issue Apr 16, 2024 · 7 comments
Open

Comments

@michael-nammi
Copy link

michael-nammi commented Apr 16, 2024

Description

I am working with ModelMesh Serving deployed on a Kubernetes cluster and I am looking for a way to control the number of replicas for a specific model. My setup includes a Triton runtime with two pods, and I'm serving a model mobilenet. I aim to ensure that the model replicas can be configured to a specific number.

Cluster State:

The state of pods in my cluster is as follows:

NAME                                           READY   STATUS    RESTARTS   AGE
etcd-bcc445f46-gnmw6                           1/1     Running   0          2d21h
minio-67577699d-frm4s                          1/1     Running   0          2d21h
modelmesh-controller-5fd6b98c4f-h4njm          1/1     Running   0          65s
modelmesh-serving-triton-2.x-9849f97c6-54gh7   4/4     Running   0          18s
modelmesh-serving-triton-2.x-9849f97c6-qndvd   4/4     Running   0          18s
traefik-78db748568-cmn4x                       1/1     Running   0          2d21h

Inference service status

NAME                     URL                                               READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION   AGE
example-mobilenet-isvc   grpc://modelmesh-serving.modelmesh-serving:8033   True
                             40s

The InferenceService for mobilenet (example-mobilenet-isvc) has minReplicas set to 2, as shown in the description below:

Name:         example-mobilenet-isvc
Namespace:    modelmesh-serving
Labels:       <none>
Annotations:  serving.kserve.io/deploymentMode: ModelMesh
API Version:  serving.kserve.io/v1beta1
Kind:         InferenceService
Metadata:
  Creation Timestamp:  2024-04-18T03:01:25Z
  Generation:          1
  Resource Version:    454691
  UID:                 f34abe33-606f-4fbd-95e4-a67829f7dac0
Spec:
  Predictor:
    Min Replicas:  2
    Model:
      Model Format:
        Name:   onnx
      Runtime:  triton-2.x
      Storage:
        Key:  minio
        Parameters:
          Bucket:  modelmesh-serving
        Path:      mobilenetv2-7.onnx
Status:
  Components:
    Predictor:
      Grpc URL:  grpc://modelmesh-serving.modelmesh-serving:8033
      Rest URL:  http://modelmesh-serving.modelmesh-serving:8008
      URL:       grpc://modelmesh-serving.modelmesh-serving:8033
  Conditions:
    Last Transition Time:  2024-04-18T03:01:40Z
    Status:                True
    Type:                  PredictorReady
    Last Transition Time:  2024-04-18T03:01:40Z
    Status:                True
    Type:                  Ready
  Model Status:
    Copies:
      Failed Copies:  0
      Total Copies:   1
    States:
      Active Model State:  Loaded
      Target Model State:
    Transition Status:     UpToDate
  URL:                     grpc://modelmesh-serving.modelmesh-serving:8033
Events:                    <none>

ETCD Keys and Values:

Relevant data from ETCD suggests only one replica is active for the model as per the instanceIds and count:

{"hostname":"10.244.0.174","instanceId":"9f97c6-qndvd","port":8080,"version":"20230801-7b484","registrationTime":1713409288745,"connConfig":{"transport.tprotocol.factory":"org.apache.thrift.protocol.TCompactProtocol$Factory","transport.framed":"false","transport.ssl.enabled":"false","transport.extrainfo_supported":"true","service.class":"com.ibm.watson.modelmesh.thrift.ModelMeshService","methodinfo.applyModelMulti":"idp=t","methodinfo.applyModel":"idp=t","app.kv_store_type":"etcd"}}
/litelinks/modelmesh-serving/10.244.0.175_8080_18eef26ea30
{"hostname":"10.244.0.175","instanceId":"9f97c6-54gh7","port":8080,"version":"20230801-7b484","registrationTime":1713409288755,"connConfig":{"transport.tprotocol.factory":"org.apache.thrift.protocol.TCompactProtocol$Factory","transport.framed":"false","transport.ssl.enabled":"false","transport.extrainfo_supported":"true","service.class":"com.ibm.watson.modelmesh.thrift.ModelMeshService","methodinfo.applyModel":"idp=t","methodinfo.applyModelMulti":"idp=t","app.kv_store_type":"etcd"}}
/mm/modelmesh-serving/instances/9f97c6-54gh7
{"startTime":1713409287610,"loc":"172.18.0.2","labels":["mt:keras","mt:keras:2","mt:onnx","mt:onnx:1","mt:pytorch","mt:pytorch:1","mt:tensorflow","mt:tensorflow:1","mt:tensorflow:2","mt:tensorrt","mt:tensorrt:7","pv:grpc-v2","pv:v2","rt:triton-2.x"],"actionable":true,"lruTime":1713407522245,"count":1,"cap":48661,"used":123,"lThreads":2,"lInProg":1}
/mm/modelmesh-serving/instances/9f97c6-qndvd
{"startTime":1713409287621,"loc":"172.18.0.2","labels":["mt:keras","mt:keras:2","mt:onnx","mt:onnx:1","mt:pytorch","mt:pytorch:1","mt:tensorflow","mt:tensorflow:1","mt:tensorflow:2","mt:tensorrt","mt:tensorrt:7","pv:grpc-v2","pv:v2","rt:triton-2.x"],"actionable":true,"lruTime":1713407522368,"count":1,"cap":48661,"used":2174,"lThreads":2}
/mm/modelmesh-serving/leaderLatch/_9f97c6-54gh7
_9f97c6-54gh7
/mm/modelmesh-serving/leaderLatch/_9f97c6-qndvd
_9f97c6-qndvd
/mm/modelmesh-serving/registry/example-mobilenet-isvc__isvc-0b5941bbd0
{"type":"rt:triton-2.x","encKey":"{\"storage_key\":\"minio\",\"storage_params\":{\"bucket\":\"modelmesh-serving\"},\"model_type\":{\"name\":\"onnx\"}}","mPath":"mobilenetv2-7.onnx","autoDel":true,"instanceIds":{"9f97c6-qndvd":1713409297527},"refs":1,"lu":1713407522368}
/mm/modelmesh-serving/vmodels/example-mobilenet-isvc
{"o":"isvc","amid":"example-mobilenet-isvc__isvc-0b5941bbd0","tmid":"example-mobilenet-isvc__isvc-0b5941bbd0"}

Question:

How can one ensure that ModelMesh Serving adheres to the minReplicas configuration for a specific model? The documentation does not seem to discuss in depth about scaling individual model replicas across the serving pods. Is there a way to control the model replicas in modelmesh serving?

@haiminh2001
Copy link

Hi, @michael-nammi, have you found the solution ?

@haiminh2001
Copy link

I found this doc from the model-mesh repository. Hope this will help.

@mafs12
Copy link

mafs12 commented Oct 16, 2024

I'm also trying to understand how to set replicas to a specific, fixed number. It seems model autoscaling is on by default, so I'm not sure if that is possible. Maybe it's only possible by setting minReplicas=maxReplicas?

@haiminh2001
Copy link

I'm also trying to understand how to set replicas to a specific, fixed number. It seems model autoscaling is on by default, so I'm not sure if that is possible. Maybe it's only possible by setting minReplicas=maxReplicas?

Unfortunately, there is no way to set a fixed number of replicas of a certain model, you may only control it indirectly via concurrency settings of the serving runtime servers. As far as I know, there 2 logics of model scaling:

@mafs12
Copy link

mafs12 commented Oct 17, 2024

So, there's no way to stick with one replica!? I'm observing that most of the time I have 2 replicas of the model (MM v0.12.0). That's not great when deploying LLMs. Setting maxReplicas=1 has no effect, right?

@mafs12
Copy link

mafs12 commented Oct 18, 2024

So, there's no way to stick with one replica!? I'm observing that most of the time I have 2 replicas of the model (MM v0.12.0). That's not great when deploying LLMs. Setting maxReplicas=1 has no effect, right?

It seems I had 2 Prometheus jobs collecting the same metrics and after aggregating got duplicated results. I have 7 models and 7 copies, actually.

@haiminh2001
Copy link

The idea here is that Model Mesh control the number of replicas of the serving runtimes, not the models. You can definitely set the maxReplicas of the serving runtimes.

The fact that your model scaled up, that means there was available capacity in your serving runtimes. Scaling up in that case makes sense to me. If you want to prevent that scaling up, I think that you can set the maxReplicas or increase the scaling up threshold of the model.

By the way, I have just deployed modelmesh on my staging environment in my company, although we are gonna deploy it on production environment soon, perhaps there is something I miss about how modelmesh control the number of replicas (that thing baffled me a lot).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants