Replies: 2 comments
-
Whilst LLMs are very large, they typically aren't so large that multiple machines are required to run a given model. onnxruntime is used in Azure to run LLMs on single machines without a problem. Not clear what you mean by 'CPU Offload' or 'Tensor/Pipeline Parallelism' or how that would help if the model was too large to run on one machine. |
Beta Was this translation helpful? Give feedback.
-
We don't have plans to support CPU offload. You may try quantizing the model if it doesn't fit on a single GPU. We support 4-bit quantization (similar to ggml). Tensor parallelism is in the works but not ready for general purpose consumption. Pipeline parallelism requires no extra support from ORT as you can simply partition the model, create a session for each partition and pipeline the execution of the model serially between sessions. |
Beta Was this translation helpful? Give feedback.
-
LLM are very large, so its parameters cannot sit into a single accelerator.
In this case, do onnxruntime provide CPU Offload, or Tensor/Pipeline Parallelism?
Beta Was this translation helpful? Give feedback.
All reactions