Add Doc for accelerator

liushz · Jun 18, 2024 · bd96e43 · bd96e43
1 parent 0400ef1
commit bd96e43
Show file tree

Hide file tree

Showing 2 changed files with 127 additions and 31 deletions.
diff --git a/docs/en/advanced_guides/accelerator_intro.md b/docs/en/advanced_guides/accelerator_intro.md
@@ -1,50 +1,50 @@
-# Accelerating Inference Evaluation with VLLM or LMDeploy
+# Accelerate Evaluation Inference with vLLM or LMDeploy
 
 ## Background
 
-In the evaluation process of OpenCompass, the default method is to use Huggingface's transformers library for inference, which is a very versatile solution. However, in some cases, we may require more efficient inference methods to speed up this process, such as leveraging VLLM or LMDeploy.
+During the OpenCompass evaluation process, the Huggingface transformers library is used for inference by default. While this is a very general solution, there are scenarios where more efficient inference methods are needed to speed up the process, such as leveraging VLLM or LMDeploy.
 
-- [LMDeploy](https://github.com/InternLM/lmdeploy) is a toolkit for compressing, deploying, and serving large language models (LLMs), developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams.
-- [vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use LLM inference and serving library, featuring advanced serving throughput, efficient memory management with PagedAttention, continuous batching of requests, fast model execution with CUDA/HIP graphs, quantization techniques (such as GPTQ, AWQ, SqueezeLLM, FP8 KV Cache), and optimized CUDA kernels.
+- [LMDeploy](https://github.com/InternLM/lmdeploy) is a toolkit designed for compressing, deploying, and serving large language models (LLMs), developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams.
+- [vLLM](https://github.com/vllm-project/vllm) is a fast and user-friendly library for LLM inference and serving, featuring advanced serving throughput, efficient PagedAttention memory management, continuous batching of requests, fast model execution via CUDA/HIP graphs, quantization techniques (e.g., GPTQ, AWQ, SqueezeLLM, FP8 KV Cache), and optimized CUDA kernels.
 
 ## Preparation for Acceleration
 
-First, check if the model you want to evaluate supports inference acceleration using VLLM or LMDeploy. Then, ensure you have installed VLLM or LMDeploy. Here are the reference installation methods based on their official documentation:
+First, check whether the model you want to evaluate supports inference acceleration using vLLM or LMDeploy. Additionally, ensure you have installed vLLM or LMDeploy as per their official documentation. Below are the installation methods for reference:
 
-### LMDeploy Installation
+### LMDeploy Installation Method
 
-Install using pip (Python 3.8+) or from [source](https://github.com/InternLM/lmdeploy/blob/main/docs/en/build.md):
+Install LMDeploy using pip (Python 3.8+) or from [source](https://github.com/InternLM/lmdeploy/blob/main/docs/en/build.md):
 
 ```bash
 pip install lmdeploy
 ```
 
-### VLLM Installation
+### VLLM Installation Method
 
-Install using pip or from [source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source):
+Install vLLM using pip or from [source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source):
 
 ```bash
 pip install vllm
 ```
 
-## Using VLLM or LMDeploy for Evaluation
+## Accelerated Evaluation Using VLLM or LMDeploy
 
-OpenCompass provides a one-click evaluation acceleration feature, which automatically converts Huggingface transformer models to VLLM or LMDeploy models during the evaluation process. Below is a script that evaluates the GSM8k dataset using the default Huggingface version of the Internlm2-chat-7b model:
+### Method 1: Using Command Line Parameters to Change the Inference Backend
 
-### OpenCompass Main Repository
+OpenCompass offers one-click evaluation acceleration. During evaluation, it can automatically convert Huggingface transformer models to VLLM or LMDeploy models for use. Below is an example code for evaluating the GSM8k dataset using the default Huggingface version of the llama3-8b-instruct model:
 
 ```python
 # eval_gsm8k.py
 from mmengine.config import read_base
 
 with read_base():
-    # choose a list of datasets
+    # Select a dataset list
     from .datasets.gsm8k.gsm8k_0shot_gen_a58960 import gsm8k_datasets as datasets
-    # choose a model of interest
+    # Select an interested model
     from ..models.hf_llama.hf_llama3_8b_instruct import models
 ```
 
-Here, `hf_llama3_8b_instruct` is the original Huggingface model config, as follows:
+Here, `hf_llama3_8b_instruct` specifies the original Huggingface model configuration, as shown below:
 
 ```python
 from opencompass.models import HuggingFacewithChatTemplate
@@ -62,13 +62,13 @@ models = [
 ]
 ```
 
-To evaluate the Llama3-8b-instruct model on the GSM8k dataset using the default Huggingface version, run:
+To evaluate the GSM8k dataset using the default Huggingface version of the llama3-8b-instruct model, use:
 
 ```bash
 python run.py config/eval_gsm8k.py
 ```
 
-To use VLLM or LMDeploy for accelerated evaluation, use the following scripts:
+To accelerate the evaluation using vLLM or LMDeploy, you can use the following script:
 
 ```bash
 python run.py config/eval_gsm8k.py -a vllm
@@ -80,14 +80,61 @@ or
 python run.py config/eval_gsm8k.py -a lmdeploy
 ```
 
-## Performance Comparison
+### Method 2: Accelerating Evaluation via Deployed Inference Acceleration Service API
 
-Below is a performance comparison table for evaluating the GSM8k dataset using VLLM or LMDeploy, with single A800 GPU:
+OpenCompass also supports accelerating evaluation by deploying vLLM or LMDeploy inference acceleration service APIs. Follow these steps:
 
-| Inference Backend | Accuracy | Inference Time (min:sec) | Speedup (relative to Huggingface) |
-| ----------------- | -------- | ------------------------ | --------------------------------- |
-| Huggingface       | 74.22    | 24:26                    | 1.0                               |
-| LMDeploy          | 73.69    | 11:15                    | 2.2                               |
-| VLLM              | 72.63    | 07:52                    | 3.1                               |
+1. Install the openai package:
 
-As shown in the table, using VLLM or LMDeploy for inference acceleration can significantly reduce inference time while maintaining high accuracy.
+```bash
+pip install openai
+```
+
+2. Deploy the inference acceleration service API for vLLM or LMDeploy. Below is an example for LMDeploy:
+
+```bash
+lmdeploy serve api_server meta-llama/Meta-Llama-3-8B-Instruct --model-name Meta-Llama-3-8B-Instruct --server-port 23333
+```
+
+Parameters for starting the api_server can be checked using `lmdeploy serve api_server -h`, such as --tp for tensor parallelism, --session-len for the maximum context window length, --cache-max-entry-count for adjusting the k/v cache memory usage ratio, etc.
+
+3. Once the service is successfully deployed, modify the evaluation script by changing the model configuration path to the service address, as shown below:
+
+```python
+from opencompass.models import OpenAI
+
+api_meta_template = dict(
+    round=[
+        dict(role='HUMAN', api_role='HUMAN'),
+        dict(role='BOT', api_role='BOT', generate=True),
+    ],
+    reserved_roles=[dict(role='SYSTEM', api_role='SYSTEM')],
+)
+
+models = [
+    dict(
+        abbr='Meta-Llama-3-8B-Instruct-LMDeploy-API',
+        type=OpenAI,
+        openai_api_base='http://0.0.0.0:23333/v1',  # Service address
+        path='Meta-Llama-3-8B-Instruct',  # Model name for service request
+        rpm_verbose=True,  # Whether to print request rate
+        meta_template=api_meta_template,  # Service request template
+        query_per_second=1,  # Service request rate
+        max_out_len=1024,  # Maximum output length
+        max_seq_len=4096,  # Maximum input length
+        temperature=0.01,  # Generation temperature
+        batch_size=8,  # Batch size
+        retry=3,  # Number of retries
+    )
+]
+```
+
+## Acceleration Effect and Performance Comparison
+
+Below is a comparison table of the acceleration effect and performance when using VLLM or LMDeploy on a single A800 GPU for evaluating the Llama-3-8B-Instruct model on the GSM8k dataset:
+
+| Inference Backend | Accuracy | Inference Time (minutes:seconds) | Speedup (relative to Huggingface) |
+| ----------------- | -------- | -------------------------------- | --------------------------------- |
+| Huggingface       | 74.22    | 24:26                            | 1.0                               |
+| LMDeploy          | 73.69    | 11:15                            | 2.2                               |
+| VLLM              | 72.63    | 07:52                            | 3.1                               |
diff --git a/docs/zh_cn/advanced_guides/accelerator_intro.md b/docs/zh_cn/advanced_guides/accelerator_intro.md
@@ -1,4 +1,4 @@
-# 使用 VLLM 或 LMDeploy 来一键式加速评测推理
+# 使用 vLLM 或 LMDeploy 来一键式加速评测推理
 
 ## 背景
 
@@ -9,7 +9,7 @@
 
 ## 加速前准备
 
-首先，请检查您要评测的模型是否支持使用 VLLM 或 LMDeploy 进行推理加速。其次，请确保您已经安装了 VLLM 或 LMDeploy，具体安装方法请参考它们的官方文档，下面是参考的安装方法：
+首先，请检查您要评测的模型是否支持使用 vLLM 或 LMDeploy 进行推理加速。其次，请确保您已经安装了 vLLM 或 LMDeploy，具体安装方法请参考它们的官方文档，下面是参考的安装方法：
 
 ### LMDeploy 安装方法
 
@@ -29,9 +29,9 @@ pip install vllm
 
 ## 评测时使用 VLLM 或 LMDeploy
 
-OpenCompass 提供了一键式的评测加速，可以在评测过程中自动将 Huggingface 的 transformers 模型转化为 VLLM 或 LMDeploy 的模型，以便在评测过程中使用。以下是使用默认 Huggingface 版本的 Internlm2-chat-7b 模型评测 GSM8k 数据集的脚本：
+### 方法1：使用命令行参数来变更推理后端
 
-### OpenCompass 主库
+OpenCompass 提供了一键式的评测加速，可以在评测过程中自动将 Huggingface 的 transformers 模型转化为 VLLM 或 LMDeploy 的模型，以便在评测过程中使用。以下是使用默认 Huggingface 版本的 llama3-8b-instruct 模型评测 GSM8k 数据集的样例代码：
 
 ```python
 # eval_gsm8k.py
@@ -68,7 +68,7 @@ models = [
 python run.py config/eval_gsm8k.py
 ```
 
-如果需要使用 VLLM 或 LMDeploy 进行加速评测，可以使用下面的脚本：
+如果需要使用 vLLM 或 LMDeploy 进行加速评测，可以使用下面的脚本：
 
 ```bash
 python run.py config/eval_gsm8k.py -a vllm
@@ -80,9 +80,58 @@ python run.py config/eval_gsm8k.py -a vllm
 python run.py config/eval_gsm8k.py -a lmdeploy
 ```
 
+### 方法2：通过部署推理加速服务API来加速评测
+
+OpenCompass 还支持通过部署vLLM或LMDeploy的推理加速服务 API 来加速评测，参考步骤如下：
+
+1. 安装openai包：
+
+```bash
+pip install openai
+```
+
+2. 部署 vLLM 或 LMDeploy 的推理加速服务 API，具体部署方法请参考它们的官方文档，下面以LMDeploy为例：
+
+```bash
+lmdeploy serve api_server meta-llama/Meta-Llama-3-8B-Instruct --model-name Meta-Llama-3-8B-Instruct --server-port 23333
+```
+
+api_server 启动时的参数可以通过命令行`lmdeploy serve api_server -h`查看。 比如，--tp 设置张量并行，--session-len 设置推理的最大上下文窗口长度，--cache-max-entry-count 调整 k/v cache 的内存使用比例等等。
+
+3. 服务部署成功后，修改评测脚本，将模型配置中的路径改为部署的服务地址，如下：
+
+```python
+from opencompass.models import OpenAI
+
+api_meta_template = dict(
+    round=[
+        dict(role='HUMAN', api_role='HUMAN'),
+        dict(role='BOT', api_role='BOT', generate=True),
+    ],
+    reserved_roles=[dict(role='SYSTEM', api_role='SYSTEM')],
+)
+
+models = [
+    dict(
+        abbr='Meta-Llama-3-8B-Instruct-LMDeploy-API',
+        type=OpenAI,
+        openai_api_base='http://0.0.0.0:23333/v1', # 服务地址
+        path='Meta-Llama-3-8B-Instruct ', # 请求服务时的 model name
+        rpm_verbose=True, # 是否打印请求速率
+        meta_template=api_meta_template, # 服务请求模板
+        query_per_second=1, # 服务请求速率
+        max_out_len=1024, # 最大输出长度
+        max_seq_len=4096, # 最大输入长度
+        temperature=0.01, # 生成温度
+        batch_size=8, # 批处理大小
+        retry=3, # 重试次数
+    )
+]
+```
+
 ## 加速效果及性能对比
 
-下面是使用 VLLM 或 LMDeploy 在单卡 A800 上对 GSM8k 数据集进行加速评测的效果及性能对比表：
+下面是使用 VLLM 或 LMDeploy 在单卡 A800 上 Llama-3-8B-Instruct 模型对 GSM8k 数据集进行加速评测的效果及性能对比表：
 
 | 推理后端    | 精度（Accuracy） | 推理时间（分钟：秒） | 加速比（相对于 Huggingface） |
 | ----------- | ---------------- | -------------------- | ---------------------------- |