Skip to content

Commit

Permalink
Add Doc for accelerator
Browse files Browse the repository at this point in the history
  • Loading branch information
liuhongwei committed Jun 18, 2024
1 parent 0400ef1 commit bd96e43
Show file tree
Hide file tree
Showing 2 changed files with 127 additions and 31 deletions.
97 changes: 72 additions & 25 deletions docs/en/advanced_guides/accelerator_intro.md
Original file line number Diff line number Diff line change
@@ -1,50 +1,50 @@
# Accelerating Inference Evaluation with VLLM or LMDeploy
# Accelerate Evaluation Inference with vLLM or LMDeploy

## Background

In the evaluation process of OpenCompass, the default method is to use Huggingface's transformers library for inference, which is a very versatile solution. However, in some cases, we may require more efficient inference methods to speed up this process, such as leveraging VLLM or LMDeploy.
During the OpenCompass evaluation process, the Huggingface transformers library is used for inference by default. While this is a very general solution, there are scenarios where more efficient inference methods are needed to speed up the process, such as leveraging VLLM or LMDeploy.

- [LMDeploy](https://github.com/InternLM/lmdeploy) is a toolkit for compressing, deploying, and serving large language models (LLMs), developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams.
- [vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use LLM inference and serving library, featuring advanced serving throughput, efficient memory management with PagedAttention, continuous batching of requests, fast model execution with CUDA/HIP graphs, quantization techniques (such as GPTQ, AWQ, SqueezeLLM, FP8 KV Cache), and optimized CUDA kernels.
- [LMDeploy](https://github.com/InternLM/lmdeploy) is a toolkit designed for compressing, deploying, and serving large language models (LLMs), developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams.
- [vLLM](https://github.com/vllm-project/vllm) is a fast and user-friendly library for LLM inference and serving, featuring advanced serving throughput, efficient PagedAttention memory management, continuous batching of requests, fast model execution via CUDA/HIP graphs, quantization techniques (e.g., GPTQ, AWQ, SqueezeLLM, FP8 KV Cache), and optimized CUDA kernels.

## Preparation for Acceleration

First, check if the model you want to evaluate supports inference acceleration using VLLM or LMDeploy. Then, ensure you have installed VLLM or LMDeploy. Here are the reference installation methods based on their official documentation:
First, check whether the model you want to evaluate supports inference acceleration using vLLM or LMDeploy. Additionally, ensure you have installed vLLM or LMDeploy as per their official documentation. Below are the installation methods for reference:

### LMDeploy Installation
### LMDeploy Installation Method

Install using pip (Python 3.8+) or from [source](https://github.com/InternLM/lmdeploy/blob/main/docs/en/build.md):
Install LMDeploy using pip (Python 3.8+) or from [source](https://github.com/InternLM/lmdeploy/blob/main/docs/en/build.md):

```bash
pip install lmdeploy
```

### VLLM Installation
### VLLM Installation Method

Install using pip or from [source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source):
Install vLLM using pip or from [source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source):

```bash
pip install vllm
```

## Using VLLM or LMDeploy for Evaluation
## Accelerated Evaluation Using VLLM or LMDeploy

OpenCompass provides a one-click evaluation acceleration feature, which automatically converts Huggingface transformer models to VLLM or LMDeploy models during the evaluation process. Below is a script that evaluates the GSM8k dataset using the default Huggingface version of the Internlm2-chat-7b model:
### Method 1: Using Command Line Parameters to Change the Inference Backend

### OpenCompass Main Repository
OpenCompass offers one-click evaluation acceleration. During evaluation, it can automatically convert Huggingface transformer models to VLLM or LMDeploy models for use. Below is an example code for evaluating the GSM8k dataset using the default Huggingface version of the llama3-8b-instruct model:

```python
# eval_gsm8k.py
from mmengine.config import read_base

with read_base():
# choose a list of datasets
# Select a dataset list
from .datasets.gsm8k.gsm8k_0shot_gen_a58960 import gsm8k_datasets as datasets
# choose a model of interest
# Select an interested model
from ..models.hf_llama.hf_llama3_8b_instruct import models
```

Here, `hf_llama3_8b_instruct` is the original Huggingface model config, as follows:
Here, `hf_llama3_8b_instruct` specifies the original Huggingface model configuration, as shown below:

```python
from opencompass.models import HuggingFacewithChatTemplate
Expand All @@ -62,13 +62,13 @@ models = [
]
```

To evaluate the Llama3-8b-instruct model on the GSM8k dataset using the default Huggingface version, run:
To evaluate the GSM8k dataset using the default Huggingface version of the llama3-8b-instruct model, use:

```bash
python run.py config/eval_gsm8k.py
```

To use VLLM or LMDeploy for accelerated evaluation, use the following scripts:
To accelerate the evaluation using vLLM or LMDeploy, you can use the following script:

```bash
python run.py config/eval_gsm8k.py -a vllm
Expand All @@ -80,14 +80,61 @@ or
python run.py config/eval_gsm8k.py -a lmdeploy
```

## Performance Comparison
### Method 2: Accelerating Evaluation via Deployed Inference Acceleration Service API

Below is a performance comparison table for evaluating the GSM8k dataset using VLLM or LMDeploy, with single A800 GPU:
OpenCompass also supports accelerating evaluation by deploying vLLM or LMDeploy inference acceleration service APIs. Follow these steps:

| Inference Backend | Accuracy | Inference Time (min:sec) | Speedup (relative to Huggingface) |
| ----------------- | -------- | ------------------------ | --------------------------------- |
| Huggingface | 74.22 | 24:26 | 1.0 |
| LMDeploy | 73.69 | 11:15 | 2.2 |
| VLLM | 72.63 | 07:52 | 3.1 |
1. Install the openai package:

As shown in the table, using VLLM or LMDeploy for inference acceleration can significantly reduce inference time while maintaining high accuracy.
```bash
pip install openai
```

2. Deploy the inference acceleration service API for vLLM or LMDeploy. Below is an example for LMDeploy:

```bash
lmdeploy serve api_server meta-llama/Meta-Llama-3-8B-Instruct --model-name Meta-Llama-3-8B-Instruct --server-port 23333
```

Parameters for starting the api_server can be checked using `lmdeploy serve api_server -h`, such as --tp for tensor parallelism, --session-len for the maximum context window length, --cache-max-entry-count for adjusting the k/v cache memory usage ratio, etc.

3. Once the service is successfully deployed, modify the evaluation script by changing the model configuration path to the service address, as shown below:

```python
from opencompass.models import OpenAI

api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
],
reserved_roles=[dict(role='SYSTEM', api_role='SYSTEM')],
)

models = [
dict(
abbr='Meta-Llama-3-8B-Instruct-LMDeploy-API',
type=OpenAI,
openai_api_base='http://0.0.0.0:23333/v1', # Service address
path='Meta-Llama-3-8B-Instruct', # Model name for service request
rpm_verbose=True, # Whether to print request rate
meta_template=api_meta_template, # Service request template
query_per_second=1, # Service request rate
max_out_len=1024, # Maximum output length
max_seq_len=4096, # Maximum input length
temperature=0.01, # Generation temperature
batch_size=8, # Batch size
retry=3, # Number of retries
)
]
```

## Acceleration Effect and Performance Comparison

Below is a comparison table of the acceleration effect and performance when using VLLM or LMDeploy on a single A800 GPU for evaluating the Llama-3-8B-Instruct model on the GSM8k dataset:

| Inference Backend | Accuracy | Inference Time (minutes:seconds) | Speedup (relative to Huggingface) |
| ----------------- | -------- | -------------------------------- | --------------------------------- |
| Huggingface | 74.22 | 24:26 | 1.0 |
| LMDeploy | 73.69 | 11:15 | 2.2 |
| VLLM | 72.63 | 07:52 | 3.1 |
61 changes: 55 additions & 6 deletions docs/zh_cn/advanced_guides/accelerator_intro.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# 使用 VLLM 或 LMDeploy 来一键式加速评测推理
# 使用 vLLM 或 LMDeploy 来一键式加速评测推理

## 背景

Expand All @@ -9,7 +9,7 @@

## 加速前准备

首先,请检查您要评测的模型是否支持使用 VLLM 或 LMDeploy 进行推理加速。其次,请确保您已经安装了 VLLM 或 LMDeploy,具体安装方法请参考它们的官方文档,下面是参考的安装方法:
首先,请检查您要评测的模型是否支持使用 vLLM 或 LMDeploy 进行推理加速。其次,请确保您已经安装了 vLLM 或 LMDeploy,具体安装方法请参考它们的官方文档,下面是参考的安装方法:

### LMDeploy 安装方法

Expand All @@ -29,9 +29,9 @@ pip install vllm

## 评测时使用 VLLM 或 LMDeploy

OpenCompass 提供了一键式的评测加速,可以在评测过程中自动将 Huggingface 的 transformers 模型转化为 VLLM 或 LMDeploy 的模型,以便在评测过程中使用。以下是使用默认 Huggingface 版本的 Internlm2-chat-7b 模型评测 GSM8k 数据集的脚本:
### 方法1:使用命令行参数来变更推理后端

### OpenCompass 主库
OpenCompass 提供了一键式的评测加速,可以在评测过程中自动将 Huggingface 的 transformers 模型转化为 VLLM 或 LMDeploy 的模型,以便在评测过程中使用。以下是使用默认 Huggingface 版本的 llama3-8b-instruct 模型评测 GSM8k 数据集的样例代码:

```python
# eval_gsm8k.py
Expand Down Expand Up @@ -68,7 +68,7 @@ models = [
python run.py config/eval_gsm8k.py
```

如果需要使用 VLLM 或 LMDeploy 进行加速评测,可以使用下面的脚本:
如果需要使用 vLLM 或 LMDeploy 进行加速评测,可以使用下面的脚本:

```bash
python run.py config/eval_gsm8k.py -a vllm
Expand All @@ -80,9 +80,58 @@ python run.py config/eval_gsm8k.py -a vllm
python run.py config/eval_gsm8k.py -a lmdeploy
```

### 方法2:通过部署推理加速服务API来加速评测

OpenCompass 还支持通过部署vLLM或LMDeploy的推理加速服务 API 来加速评测,参考步骤如下:

1. 安装openai包:

```bash
pip install openai
```

2. 部署 vLLM 或 LMDeploy 的推理加速服务 API,具体部署方法请参考它们的官方文档,下面以LMDeploy为例:

```bash
lmdeploy serve api_server meta-llama/Meta-Llama-3-8B-Instruct --model-name Meta-Llama-3-8B-Instruct --server-port 23333
```

api_server 启动时的参数可以通过命令行`lmdeploy serve api_server -h`查看。 比如,--tp 设置张量并行,--session-len 设置推理的最大上下文窗口长度,--cache-max-entry-count 调整 k/v cache 的内存使用比例等等。

3. 服务部署成功后,修改评测脚本,将模型配置中的路径改为部署的服务地址,如下:

```python
from opencompass.models import OpenAI

api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
],
reserved_roles=[dict(role='SYSTEM', api_role='SYSTEM')],
)

models = [
dict(
abbr='Meta-Llama-3-8B-Instruct-LMDeploy-API',
type=OpenAI,
openai_api_base='http://0.0.0.0:23333/v1', # 服务地址
path='Meta-Llama-3-8B-Instruct ', # 请求服务时的 model name
rpm_verbose=True, # 是否打印请求速率
meta_template=api_meta_template, # 服务请求模板
query_per_second=1, # 服务请求速率
max_out_len=1024, # 最大输出长度
max_seq_len=4096, # 最大输入长度
temperature=0.01, # 生成温度
batch_size=8, # 批处理大小
retry=3, # 重试次数
)
]
```

## 加速效果及性能对比

下面是使用 VLLM 或 LMDeploy 在单卡 A800 上对 GSM8k 数据集进行加速评测的效果及性能对比表:
下面是使用 VLLM 或 LMDeploy 在单卡 A800 上 Llama-3-8B-Instruct 模型对 GSM8k 数据集进行加速评测的效果及性能对比表:

| 推理后端 | 精度(Accuracy) | 推理时间(分钟:秒) | 加速比(相对于 Huggingface) |
| ----------- | ---------------- | -------------------- | ---------------------------- |
Expand Down

0 comments on commit bd96e43

Please sign in to comment.