Skip to content

Commit

Permalink
[Docs] update ds1000 code eval docs (open-compass#618)
Browse files Browse the repository at this point in the history
  • Loading branch information
yingfhu authored Nov 22, 2023
1 parent d6b4d9b commit f819ac9
Show file tree
Hide file tree
Showing 4 changed files with 92 additions and 20 deletions.
57 changes: 46 additions & 11 deletions docs/en/advanced_guides/code_eval_service.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,44 @@
# Multilingual Code Evaluation Tutorial
# Code Evaluation Docker Tutorial

To complete LLM code capability evaluation, we need to set up an independent evaluation environment to avoid executing erroneous codes on development environments which would cause unavoidable losses. The current Code Evaluation Service used in OpenCompass refers to the project [code-evaluator](https://github.com/open-compass/code-evaluator.git), which has already supported evaluating datasets for multiple programming languages [humaneval-x](https://huggingface.co/datasets/THUDM/humaneval-x). The following tutorials will introduce how to conduct code review services under different requirements.
To complete the LLM code capability evaluation, we need to build a separate evaluation environment to avoid executing erroneous code in the development environment, which would inevitably cause losses. The code evaluation service currently used by OpenCompass can refer to the [code-evaluator](https://github.com/open-compass/code-evaluator) project. The following will introduce evaluation tutorials around the code evaluation service.

Dataset [download address](https://github.com/THUDM/CodeGeeX2/tree/main/benchmark/humanevalx). Please download the needed files (xx.jsonl.gz) into `./data/humanevalx` folder.
1. humaneval-x

Supported languages are `python`, `cpp`, `go`, `java`, `js`.
This is a multi-programming language dataset [humaneval-x](https://huggingface.co/datasets/THUDM/humaneval-x).
You can download the dataset from this [download link](https://github.com/THUDM/CodeGeeX2/tree/main/benchmark/humanevalx). Please download the language file (××.jsonl.gz) that needs to be evaluated and place it in the `./data/humanevalx` folder.

The currently supported languages are `python`, `cpp`, `go`, `java`, `js`.

2. DS1000

This is a Python multi-algorithm library dataset [ds1000](https://github.com/xlang-ai/DS-1000).
You can download the dataset from this [download link](https://github.com/xlang-ai/DS-1000/blob/main/ds1000_data.zip).

The currently supported algorithm libraries are `Pandas`, `Numpy`, `Tensorflow`, `Scipy`, `Sklearn`, `Pytorch`, `Matplotlib`.

## Launching the Code Evaluation Service

1. Ensure you have installed Docker, please refer to [Docker installation document](https://docs.docker.com/engine/install/).
2. Pull the source code of the code evaluation service project and build the Docker image.

Choose the dockerfile corresponding to the dataset you need, and replace `humanevalx` or `ds1000` in the command below.

```shell
git clone https://github.com/open-compass/code-evaluator.git
cd code-evaluator/docker
sudo docker build -t code-eval:latest .
sudo docker build -t code-eval-{your-dataset}:latest -f docker/{your-dataset}/Dockerfile .
```

3. Create a container with the following commands:

```shell
# Log output format
sudo docker run -it -p 5000:5000 code-eval:latest python server.py
sudo docker run -it -p 5000:5000 code-eval-{your-dataset}:latest python server.py

# Run the program in the background
# sudo docker run -itd -p 5000:5000 code-eval:latest python server.py
# sudo docker run -itd -p 5000:5000 code-eval-{your-dataset}:latest python server.py

# Using different ports
# sudo docker run -itd -p 5001:5001 code-eval:latest python server.py --port 5001
# sudo docker run -itd -p 5001:5001 code-eval-{your-dataset}:latest python server.py --port 5001
```

4. To ensure you have access to the service, use the following command to check the inference environment and evaluation service connection status. (If both inferences and code evaluations run on the same host, skip this step.)
Expand All @@ -39,7 +50,7 @@ telnet your_service_ip_address your_service_port

## Local Code Evaluation

When the model inference and code evaluation services are running on the same host or within the same local area network, direct code reasoning and evaluation can be performed.
When the model inference and code evaluation services are running on the same host or within the same local area network, direct code reasoning and evaluation can be performed. **Note: DS1000 is currently not supported, please proceed with remote evaluation.**

### Configuration File

Expand Down Expand Up @@ -95,7 +106,7 @@ Refer to the [Quick Start](../get_started.html)

Model inference and code evaluation services located in different machines which cannot be accessed directly require prior model inference before collecting the code evaluation results. The configuration file and inference process can be reused from the previous tutorial.

### Collect Inference Results
### Collect Inference Results(Only for Humanevalx)

In OpenCompass's tools folder, there is a script called `collect_code_preds.py` provided to process and collect the inference results after providing the task launch configuration file during startup along with specifying the working directory used corresponding to the task.
It is the same with `-r` option in `run.py`. More details can be referred through the [documentation](https://opencompass.readthedocs.io/en/latest/get_started.html#launch-evaluation).
Expand Down Expand Up @@ -123,10 +134,14 @@ workdir/humanevalx
├── ...
```

For DS1000, you just need to obtain the corresponding prediction file generated by `opencompass`.

### Code Evaluation

Make sure your code evaluation service is started, and use `curl` to request:

#### The following only supports Humanevalx

```shell
curl -X POST -F 'file=@{result_absolute_path}' -F 'dataset={dataset/language}' {your_service_ip_address}:{your_service_port}/evaluate
```
Expand All @@ -149,6 +164,26 @@ Additionally, we offer an extra option named `with_prompt`(Defaults to `True`),
curl -X POST -F 'file=@./examples/humanevalx/python.json' -F 'dataset=humanevalx/python' -H 'with-prompt: False' localhost:5000/evaluate
```

#### The following only supports DS1000

Make sure the code evaluation service is started, then use `curl` to submit a request:

```shell
curl -X POST -F 'file=@./internlm-chat-7b-hf-v11/ds1000_Numpy.json' localhost:5000/evaluate
```

DS1000 supports additional debug parameters. Be aware that a large amount of log will be generated when it is turned on:

- `full`: Additional print out of the original prediction for each error sample, post-processing prediction, running program, and final error.
- `half`: Additional print out of the running program and final error for each error sample.
- `error`: Additional print out of the final error for each error sample.

```shell
curl -X POST -F 'file=@./internlm-chat-7b-hf-v11/ds1000_Numpy.json' -F 'debug=error' localhost:5000/evaluate
```

You can also modify the `num_workers` in the same way to control the degree of parallelism.

## Advanced Tutorial

Besides evaluating the supported HUMANEVAList data set, users might also need:
Expand Down
1 change: 1 addition & 0 deletions docs/en/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ We always welcome *PRs* and *Issues* for the betterment of OpenCompass.
advanced_guides/new_model.md
advanced_guides/evaluation_turbomind.md
advanced_guides/evaluation_lightllm.md
advanced_guides/code_eval.md
advanced_guides/code_eval_service.md
advanced_guides/multimodal_eval.md
advanced_guides/prompt_attack.md
Expand Down
53 changes: 44 additions & 9 deletions docs/zh_cn/advanced_guides/code_eval_service.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,44 @@
# 多语言代码评测教程
# 代码评测Docker教程

为了完成LLM代码能力评测,我们需要搭建一套独立的评测环境,避免在开发环境执行错误代码从而造成不可避免的损失。目前 OpenCompass 使用的代码评测服务可参考[code-evaluator](https://github.com/open-compass/code-evaluator)项目,并已经支持评测多编程语言的数据集 [humaneval-x](https://huggingface.co/datasets/THUDM/humaneval-x)。接下来将围绕代码评测服务介绍不同需要下的评测教程。
为了完成LLM代码能力评测,我们需要搭建一套独立的评测环境,避免在开发环境执行错误代码从而造成不可避免的损失。目前 OpenCompass 使用的代码评测服务可参考[code-evaluator](https://github.com/open-compass/code-evaluator)项目。接下来将围绕代码评测服务介绍不同需要下的评测教程。

1. humaneval-x

多编程语言的数据集 [humaneval-x](https://huggingface.co/datasets/THUDM/humaneval-x)
数据集[下载地址](https://github.com/THUDM/CodeGeeX2/tree/main/benchmark/humanevalx),请下载需要评测的语言(××.jsonl.gz)文件,并放入`./data/humanevalx`文件夹。

目前支持的语言有`python`, `cpp`, `go`, `java`, `js`

2. DS1000

Python 多算法库数据集 [ds1000](https://github.com/xlang-ai/DS-1000)
数据集[下载地址](https://github.com/xlang-ai/DS-1000/blob/main/ds1000_data.zip)

目前支持的算法库有`Pandas`, `Numpy`, `Tensorflow`, `Scipy`, `Sklearn`, `Pytorch`, `Matplotlib`

## 启动代码评测服务

1. 确保您已经安装了 docker,可参考[安装docker文档](https://docs.docker.com/engine/install/)
2. 拉取代码评测服务项目,并构建 docker 镜像

选择你需要的数据集对应的dockerfile,在下面命令中做替换 `humanevalx` 或者 `ds1000`

```shell
git clone https://github.com/open-compass/code-evaluator.git
cd code-evaluator/docker
sudo docker build -t code-eval:latest .
sudo docker build -t code-eval-{your-dataset}:latest -f docker/{your-dataset}/Dockerfile .
```

3. 使用以下命令创建容器

```shell
# 输出日志格式
sudo docker run -it -p 5000:5000 code-eval:latest python server.py
sudo docker run -it -p 5000:5000 code-eval-{your-dataset}:latest python server.py

# 在后台运行程序
# sudo docker run -itd -p 5000:5000 code-eval:latest python server.py
# sudo docker run -itd -p 5000:5000 code-eval-{your-dataset}:latest python server.py

# 使用不同的端口
# sudo docker run -itd -p 5001:5001 code-eval:latest python server.py --port 5001
# sudo docker run -itd -p 5001:5001 code-eval-{your-dataset}:latest python server.py --port 5001
```

4. 为了确保您能够访问服务,通过以下命令检测推理环境和评测服务访问情况。 (如果推理和代码评测在同一主机中运行服务,就跳过这个操作)
Expand All @@ -39,7 +50,7 @@ telnet your_service_ip_address your_service_port

## 本地代码评测

模型推理和代码评测服务在同一主机,或者同一局域网中,可以直接进行代码推理及评测。
模型推理和代码评测服务在同一主机,或者同一局域网中,可以直接进行代码推理及评测。**注意:DS1000暂不支持,请走异地评测**

### 配置文件

Expand Down Expand Up @@ -94,7 +105,7 @@ humanevalx_datasets = [

模型推理和代码评测服务分别在不可访问的不同机器中,需要先进行模型推理,收集代码推理结果。配置文件和推理流程都可以复用上面的教程。

### 收集推理结果
### 收集推理结果(仅针对Humanevalx)

OpenCompass 在 `tools` 中提供了 `collect_code_preds.py` 脚本对推理结果进行后处理并收集,我们只需要提供启动任务时的配置文件,以及指定复用对应任务的工作目录,其配置与 `run.py` 中的 `-r` 一致,细节可参考[文档](https://opencompass.readthedocs.io/zh_CN/latest/get_started.html#id7)

Expand All @@ -121,8 +132,12 @@ workdir/humanevalx
├── ...
```

对于 DS1000 只需要拿到 `opencompasss` 对应生成的 prediction文件即可。

### 代码评测

#### 以下仅支持Humanevalx

确保代码评测服务启动的情况下,使用 `curl` 提交请求:

```shell
Expand All @@ -147,6 +162,26 @@ curl -X POST -F 'file=@./examples/humanevalx/python.json' -F 'dataset=humanevalx
curl -X POST -F 'file=@./examples/humanevalx/python.json' -F 'dataset=humanevalx/python' -H 'with-prompt: False' localhost:5000/evaluate
```

#### 以下仅支持DS1000

确保代码评测服务启动的情况下,使用 `curl` 提交请求:

```shell
curl -X POST -F 'file=@./internlm-chat-7b-hf-v11/ds1000_Numpy.json' localhost:5000/evaluate
```

DS1000支持额外 debug 参数,注意开启之后会有大量log

- `full`: 额外打印每个错误样本的原始prediction,后处理后的predcition,运行程序以及最终报错。
- `half`: 额外打印每个错误样本的运行程序以及最终报错。
- `error`: 额外打印每个错误样本的最终报错。

```shell
curl -X POST -F 'file=@./internlm-chat-7b-hf-v11/ds1000_Numpy.json' -F 'debug=error' localhost:5000/evaluate
```

另外还可以通过同样的方式修改`num_workers`来控制并行数。

## 进阶教程

除了评测已支持的 `humanevalx` 数据集以外,用户还可能有以下需求:
Expand Down
1 change: 1 addition & 0 deletions docs/zh_cn/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ OpenCompass 上手路线
advanced_guides/new_model.md
advanced_guides/evaluation_turbomind.md
advanced_guides/evaluation_lightllm.md
advanced_guides/code_eval.md
advanced_guides/code_eval_service.md
advanced_guides/multimodal_eval.md
advanced_guides/prompt_attack.md
Expand Down

0 comments on commit f819ac9

Please sign in to comment.