Can we use onnxruntime to deploy Large Langauge Models (llm)? #17385

2catycm · 2023-09-01T04:14:40Z

2catycm
Sep 1, 2023

LLM are very large, so its parameters cannot sit into a single accelerator.
In this case, do onnxruntime provide CPU Offload, or Tensor/Pipeline Parallelism?

skottmckay · 2023-09-08T06:31:53Z

skottmckay
Sep 8, 2023
Collaborator

Whilst LLMs are very large, they typically aren't so large that multiple machines are required to run a given model. onnxruntime is used in Azure to run LLMs on single machines without a problem.

Not clear what you mean by 'CPU Offload' or 'Tensor/Pipeline Parallelism' or how that would help if the model was too large to run on one machine.

0 replies

pranavsharma · 2023-09-08T18:18:21Z

pranavsharma
Sep 8, 2023
Maintainer

We don't have plans to support CPU offload. You may try quantizing the model if it doesn't fit on a single GPU. We support 4-bit quantization (similar to ggml). Tensor parallelism is in the works but not ready for general purpose consumption. Pipeline parallelism requires no extra support from ORT as you can simply partition the model, create a session for each partition and pipeline the execution of the model serially between sessions.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we use onnxruntime to deploy Large Langauge Models (llm)? #17385

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Can we use onnxruntime to deploy Large Langauge Models (llm)? #17385

2catycm Sep 1, 2023

Replies: 2 comments

skottmckay Sep 8, 2023 Collaborator

pranavsharma Sep 8, 2023 Maintainer

2catycm
Sep 1, 2023

skottmckay
Sep 8, 2023
Collaborator

pranavsharma
Sep 8, 2023
Maintainer