New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

document for huggingface(vllm) servingruntime for multi-node #402

Open

Jooho wants to merge 4 commits into kserve:main from Jooho:multi-node

Contributor

Jooho commented Oct 8, 2024

"Fixes #issue-number" or "Add description of the problem this PR solves"

Proposed Changes

This PR add a new documentation for setting up multi-node/multi-GPU inference using the Hugging Face LLM Serving Runtime. It includes detailed instructions on prerequisites, key configurations, model inference, and sample requests for OpenAI completions and chat endpoints. This documentation aims to enhance user understanding and streamline the deployment process, ensuring a smooth experience for developers looking to leverage Hugging Face's capabilities in a Kubernetes environment

This documentation is valid only after kserve/kserve#3972 is merged.

netlify bot commented Oct 8, 2024 •

edited

Loading

✅ Deploy Preview for elastic-nobel-0aef7a ready!

Name	Link
🔨 Latest commit	`a5a8661`
🔍 Latest deploy log	https://app.netlify.com/sites/elastic-nobel-0aef7a/deploys/6722fd3b6beb3c000810e851
😎 Deploy Preview	https://deploy-preview-402--elastic-nobel-0aef7a.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Jooho mentioned this pull request

Support Multi-Node Inference and Serving. kserve/kserve#3870

Open

4 tasks

Jooho marked this pull request as draft

October 8, 2024 15:13


          document for huggingface(vllm) servingruntime for multi-node

4e7cca4

Signed-off-by: jooho lee <[email protected]>

Jooho force-pushed the multi-node branch from dda0af9 to 4e7cca4 Compare

October 10, 2024 15:52


          update doc based on the change of logic

1860f33

Signed-off-by: jooho lee <[email protected]>

Jooho marked this pull request as ready for review

October 18, 2024 16:05


          add consideration

9b4d994

Signed-off-by: jooho lee <[email protected]>

spolti reviewed

View reviewed changes

docs/modelserving/v1beta1/llm/huggingface/multi-node/README.md

+                      timeoutSeconds: 5
+                      initialDelaySeconds: 10
+              ..
+              ~~~

spolti Oct 25, 2024

maybe add a note on how and where to do it?
If using custom runtime or isvc.

Contributor Author

Jooho Oct 31, 2024

this needs to be set in servingruntime. like huggingfaceserver-multinode
I add this simply to the doc.

spolti reviewed

View reviewed changes

docs/modelserving/v1beta1/llm/huggingface/multi-node/README.md Outdated


		### Key Validations

		- `TENSOR_PARALLEL_SIZE` and `PIPELINE_PARALLEL_SIZE` cannot be set via environment variables. They must be configured through `workerSpec.tensorParallelSize` and `workerSpec.pipelineParallelSize`.

spolti Oct 25, 2024

Suggested change

      
            - `TENSOR_PARALLEL_SIZE` and `PIPELINE_PARALLEL_SIZE` cannot be set via environment variables. They must be configured through `workerSpec.tensorParallelSize` and `workerSpec.pipelineParallelSize`.
          
            - `TENSOR_PARALLEL_SIZE` and `PIPELINE_PARALLEL_SIZE` cannot be set via environment variables. They must be configured through `workerSpec.tensorParallelSize` and `workerSpec.pipelineParallelSize` respectively.

Contributor Author

Jooho Oct 31, 2024

updated

docs/modelserving/v1beta1/llm/huggingface/multi-node/README.md

+              ### Key Validations
+              - `TENSOR_PARALLEL_SIZE` and `PIPELINE_PARALLEL_SIZE` cannot be set via environment variables. They must be configured through `workerSpec.tensorParallelSize` and `workerSpec.pipelineParallelSize`.
+              - In a ServingRuntime designed for multi-node, both `workerSpec.tensorParallelSize` and `workerSpec.pipelineParallelSize` must be set.

spolti Oct 25, 2024

there is no default expected default values for pipelineParallelSize (as it should be > 2)

Contributor Author

Jooho Oct 31, 2024

line number 16 explains the minimum value per field

docs/modelserving/v1beta1/llm/huggingface/multi-node/README.md

+              - In a ServingRuntime designed for multi-node, both `workerSpec.tensorParallelSize` and `workerSpec.pipelineParallelSize` must be set.
+              - The minimum value for `workerSpec.tensorParallelSize` is **1**, and the minimum value for `workerSpec.pipelineParallelSize` is **2**.
+              - Currently, four GPU types are allowed: `nvidia.com/gpu` (*default*), `intel.com/gpu`, `amd.com/gpu`, and `habana.ai/gaudi`.
+              - You can specify the GPU type via InferenceService, but if it differs from what is set in the ServingRuntime, both GPU types will be assigned to the resource. Then it can cause issues.

spolti Oct 25, 2024

Suggested change

      
            - You can specify the GPU type via InferenceService, but if it differs from what is set in the ServingRuntime, both GPU types will be assigned to the resource. Then it can cause issues.
          
              - You can specify the GPU type via InferenceService, but if it differs from what is set in the ServingRuntime, both GPU types will be assigned to the resource. Then it can cause issues.

spolti Oct 25, 2024

make sure it is nested from the previous bullet regarding GPUs.

Contributor Author

Jooho Oct 31, 2024

it was implemented but after discussion we decided to add a validation check for it. User have to set this.

docs/modelserving/v1beta1/llm/huggingface/multi-node/README.md

+              - The minimum value for `workerSpec.tensorParallelSize` is **1**, and the minimum value for `workerSpec.pipelineParallelSize` is **2**.
+              - Currently, four GPU types are allowed: `nvidia.com/gpu` (*default*), `intel.com/gpu`, `amd.com/gpu`, and `habana.ai/gaudi`.
+              - You can specify the GPU type via InferenceService, but if it differs from what is set in the ServingRuntime, both GPU types will be assigned to the resource. Then it can cause issues.
+              - The Autoscaler must be configured as `external`.

spolti Oct 25, 2024

if no other autoscaler is supported, why not default to it independently of what the user defines?

Contributor Author

Jooho Oct 31, 2024

I don't think there is a way to set default autoscaler in ServingRuntime.

docs/modelserving/v1beta1/llm/huggingface/multi-node/README.md

+              - Currently, four GPU types are allowed: `nvidia.com/gpu` (*default*), `intel.com/gpu`, `amd.com/gpu`, and `habana.ai/gaudi`.
+              - You can specify the GPU type via InferenceService, but if it differs from what is set in the ServingRuntime, both GPU types will be assigned to the resource. Then it can cause issues.
+              - The Autoscaler must be configured as `external`.
+              - The only supported storage protocol for StorageURI is `PVC`.

spolti Oct 25, 2024

isn't modelCar already supported by KServe?

Contributor Author

Jooho Oct 31, 2024

The first phase only supports PVC. modelcar might be included in the next phase.

docs/modelserving/v1beta1/llm/huggingface/multi-node/README.md



		2. `workerSpec.pipelineParallelSize`
		This setting determines how many nodes are involved in the deployment. This variable represents the total number of nodes, including both the head and worker nodes.

spolti Oct 25, 2024

I think it's worth mentioning that all nodes must have GPU available.

Contributor Author

Jooho Oct 31, 2024

Not all nodes need to have GPUs, as affinity can be used to select GPU-enabled nodes. Additionally, NFD (Node Feature Discovery) can add GPU labels to nodes, allowing the openshift to choose nodes with GPUs automatically.

docs/modelserving/v1beta1/llm/huggingface/multi-node/README.md


		### 2. Download the Model to the PVC

		To download the model, export your Hugging Face token (`HF_TEST_TOKEN`) as an environment variable. Replace `%token%` with your actual token.

spolti Oct 25, 2024

why the parenthesis?

Contributor Author

Jooho Oct 31, 2024

It has no special meaning.

docs/modelserving/v1beta1/llm/huggingface/multi-node/README.md


		### 2. Download the Model to the PVC

		To download the model, export your Hugging Face token (`HF_TEST_TOKEN`) as an environment variable. Replace `%token%` with your actual token.

spolti Oct 25, 2024

is the %string% the pattern for place holders?

I've seen on other places {{string}}

Contributor Author

Jooho Oct 31, 2024

correct, it is for placeholders.
The pattern is what I usually use but not something special. The most important thing is that the user understands what the string means. I think it is understandable for users

docs/modelserving/v1beta1/llm/huggingface/multi-node/README.md Outdated

+              !!! success "Expected Output"
+                  ```{ .bash .no-copy }
+                  NAME                 URL                                                   READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                          AGE
+                  huggingface-llama3   http://huggingface-llama3.default.example.com                                          12m

spolti Oct 25, 2024

shouldn't 12m be under AGE?

Contributor Author

Jooho Oct 31, 2024

It does not matter to me. And also, actually 405b model take 12 mins only for loading the model.

docs/modelserving/v1beta1/llm/huggingface/multi-node/README.md Outdated

		huggingface-llama3 http://huggingface-llama3.default.example.com 12m
		```

spolti Oct 25, 2024

Suggested change

Contributor Author

Jooho Oct 31, 2024

updated.


          follow up for comments

a5a8661

Signed-off-by: jooho lee <[email protected]>

terrytangyuan reviewed

View reviewed changes

docs/modelserving/v1beta1/llm/huggingface/multi-node/README.md

		@@ -0,0 +1,330 @@
		# Multi-node/Multi-GPU Inference with Hugging Face LLM Serving Runtime

Member

terrytangyuan Nov 8, 2024

Suggested change

      
            # Multi-node/Multi-GPU Inference with Hugging Face LLM Serving Runtime
          
            # Multi-node/Multi-GPU Inference with Hugging Face vLLM Serving Runtime

docs/modelserving/v1beta1/llm/huggingface/multi-node/README.md

		@@ -0,0 +1,330 @@
		# Multi-node/Multi-GPU Inference with Hugging Face LLM Serving Runtime

		This guide provides step-by-step instructions on setting up multi-node and multi-GPU inference using Hugging Face's LLM Serving Runtime. Before proceeding, please ensure you meet the following prerequisites and understand the limitations of this setup.

Member

terrytangyuan Nov 8, 2024

Suggested change

      
            This guide provides step-by-step instructions on setting up multi-node and multi-GPU inference using Hugging Face's LLM Serving Runtime. Before proceeding, please ensure you meet the following prerequisites and understand the limitations of this setup.
          
            This guide provides step-by-step instructions on setting up multi-node and multi-GPU inference using Hugging Face's vLLM Serving Runtime. Before proceeding, please ensure you meet the following prerequisites and understand the limitations of this setup.

docs/modelserving/v1beta1/llm/huggingface/multi-node/README.md

+              - `TENSOR_PARALLEL_SIZE` and `PIPELINE_PARALLEL_SIZE` cannot be set via environment variables. They must be configured through `workerSpec.tensorParallelSize` and `workerSpec.pipelineParallelSize` respectively.
+              - In a ServingRuntime designed for multi-node, both `workerSpec.tensorParallelSize` and `workerSpec.pipelineParallelSize` must be set.
+              - The minimum value for `workerSpec.tensorParallelSize` is **1**, and the minimum value for `workerSpec.pipelineParallelSize` is **2**.
+              - Currently, four GPU types are allowed: `nvidia.com/gpu` (*default*), `intel.com/gpu`, `amd.com/gpu`, and `habana.ai/gaudi`.

Member

terrytangyuan Nov 8, 2024

Should we add reference on how to add additional types? e.g. adding them here https://github.com/kserve/kserve/blob/21b103e1ce7166a444d0145412cee19bd6574309/pkg/constants/constants.go#L218-L223

Member

terrytangyuan Nov 8, 2024

Oh I see. There are instructions later

docs/modelserving/v1beta1/llm/huggingface/multi-node/README.md

+                "nvidia.com/gpu"
+                "amd.com/gpu"
+                "intel.com/gpu"
+                "habana.ai/gaudi"

Member

terrytangyuan Nov 8, 2024

Already mentioned these earlier

docs/modelserving/v1beta1/llm/huggingface/multi-node/README.md

+                  | NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
+                  |-----------------------------------------+------------------------+----------------------+
+                  | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
+                  | Fan  Temp   Perf          Pwr:Usage/Cap |            # Specifying workerSpec indicates that multi-node functionality will be used     Memory-Usage | GPU-Util  Compute M. |

Member

terrytangyuan Nov 8, 2024

There's some unexpected copy paste here

rnetser reviewed

View reviewed changes

docs/modelserving/v1beta1/llm/huggingface/multi-node/README.md

+              - Multi-node functionality is only supported in **RawDeployment** mode.
+              - **Auto-scaling is not available** for multi-node setups.
+              - A **Persistent Volume Claim (PVC)** is required for multi-node configurations, and it must support the **ReadWriteMany (RWM)** access mode.

rnetser Nov 10, 2024

Suggested change

      
            - A **Persistent Volume Claim (PVC)** is required for multi-node configurations, and it must support the **ReadWriteMany (RWM)** access mode.
          
            - A **Persistent Volume Claim (PVC)** is required for multi-node configurations, and it must support the **ReadWriteMany (RWX)** access mode.

?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet