Skip to content

Commit

Permalink
Merge branch 'main' of github.com:open-compass/opencompass into github
Browse files Browse the repository at this point in the history
  • Loading branch information
liuhongwei committed May 27, 2024
2 parents 64c5ba2 + 9fa80b0 commit 8fe4ceb
Show file tree
Hide file tree
Showing 84 changed files with 2,964 additions and 19 deletions.
110 changes: 110 additions & 0 deletions configs/datasets/CHARM/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# CHARM✨ Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations [ACL2024]
[![arXiv](https://img.shields.io/badge/arXiv-2403.14112-b31b1b.svg)](https://arxiv.org/abs/2403.14112)
[![license](https://img.shields.io/github/license/InternLM/opencompass.svg)](./LICENSE)
<div align="center">

📃[Paper](https://arxiv.org/abs/2403.14112)
🏰[Project Page](https://opendatalab.github.io/CHARM/)
🏆[Leaderboard](https://opendatalab.github.io/CHARM/leaderboard.html)
[Findings](https://opendatalab.github.io/CHARM/findings.html)

</div>

<div align="center">
📖 <a href="./README_ZH.md"> 中文</a> | <a href="./README.md">English</a>
</div>

## Dataset Description

**CHARM** is the first benchmark for comprehensively and in-depth evaluating the commonsense reasoning ability of large language models (LLMs) in Chinese, which covers both globally known and Chinese-specific commonsense. In addition, the CHARM can evaluate the LLMs' memorization-independent reasoning abilities and analyze the typical errors.

## Comparison of commonsense reasoning benchmarks
<html lang="en">
<table align="center">
<thead class="fixed-header">
<tr>
<th>Benchmarks</th>
<th>CN-Lang</th>
<th>CSR</th>
<th>CN-specifics</th>
<th>Dual-Domain</th>
<th>Rea-Mem</th>
</tr>
</thead>
<tr>
<td>Most benchmarks in <a href="https://arxiv.org/abs/2302.04752"> davis2023benchmarks</a></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
</tr>
<tr>
<td><a href="https://arxiv.org/abs/1809.05053"> XNLI</a>, <a
href="https://arxiv.org/abs/2005.00333">XCOPA</a>,<a
href="https://arxiv.org/abs/2112.10668">XStoryCloze</a></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
</tr>
<tr>
<td><a href="https://arxiv.org/abs/2007.08124">LogiQA</a>, <a
href="https://arxiv.org/abs/2004.05986">CLUE</a>, <a
href="https://arxiv.org/abs/2306.09212">CMMLU</a></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
</tr>
<tr>
<td><a href="https://arxiv.org/abs/2312.12853">CORECODE</a> </td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
</tr>
<tr>
<td><strong><a href="https://arxiv.org/abs/2403.14112">CHARM (ours)</a> </strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
</tr>
</table>

"CN-Lang" indicates the benchmark is presented in Chinese language. "CSR" means the benchmark is designed to focus on <strong>C</strong>ommon<strong>S</strong>ense <strong>R</strong>easoning. "CN-specific" indicates the benchmark includes elements that are unique to Chinese culture, language, regional characteristics, history, etc. "Dual-Domain" indicates the benchmark encompasses both Chinese-specific and global domain tasks, with questions presented in the similar style and format. "Rea-Mem" indicates the benchmark includes closely-interconnected <strong>rea</strong>soning and <strong>mem</strong>orization tasks.


## 🛠️ How to Use
Below are the steps for quickly downloading CHARM and using OpenCompass for evaluation.

### 1. Download CHARM
```bash
git clone https://github.com/opendatalab/CHARM ${path_to_CHARM_repo}
```
### 2. Run Inference and Evaluation
```bash
cd ${path_to_opencompass}
mkdir -p data
ln -snf ${path_to_CHARM_repo}/data/CHARM ./data/CHARM

# Infering and evaluating CHARM with hf_llama3_8b_instruct model
python run.py --models hf_llama3_8b_instruct --datasets charm_gen
```

## 🖊️ Citation
```bibtex
@misc{sun2024benchmarking,
title={Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations},
author={Jiaxing Sun and Weiquan Huang and Jiang Wu and Chenya Gu and Wei Li and Songyang Zhang and Hang Yan and Conghui He},
year={2024},
eprint={2403.14112},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
108 changes: 108 additions & 0 deletions configs/datasets/CHARM/README_ZH.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# CHARM✨ Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations [ACL2024]
[![arXiv](https://img.shields.io/badge/arXiv-2403.14112-b31b1b.svg)](https://arxiv.org/abs/2403.14112)
[![license](https://img.shields.io/github/license/InternLM/opencompass.svg)](./LICENSE)
<div align="center">

📃[Paper](https://arxiv.org/abs/2403.14112)
🏰[Project Page](https://opendatalab.github.io/CHARM/)
🏆[Leaderboard](https://opendatalab.github.io/CHARM/leaderboard.html)
[Findings](https://opendatalab.github.io/CHARM/findings.html)
</div>

<div align="center">
📖 <a href="./README_ZH.md"> 中文</a> | <a href="./README.md">English</a>
</div>

## 数据集介绍

**CHARM** 是首个全面深入评估大型语言模型(LLMs)在中文常识推理能力的基准测试,它覆盖了国际普遍认知的常识以及独特的中国文化常识。此外,CHARM 还可以评估 LLMs 独立于记忆的推理能力,并分析其典型错误。


## 与其他常识推理评测基准的比较
<html lang="en">
<table align="center">
<thead class="fixed-header">
<tr>
<th>基准</th>
<th>汉语</th>
<th>常识推理</th>
<th>中国特有知识</th>
<th>中国和世界知识域</th>
<th>推理和记忆的关系</th>
</tr>
</thead>
<tr>
<td><a href="https://arxiv.org/abs/2302.04752"> davis2023benchmarks</a> 中提到的基准</td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
</tr>
<tr>
<td><a href="https://arxiv.org/abs/1809.05053"> XNLI</a>, <a
href="https://arxiv.org/abs/2005.00333">XCOPA</a>,<a
href="https://arxiv.org/abs/2112.10668">XStoryCloze</a></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
</tr>
<tr>
<td><a href="https://arxiv.org/abs/2007.08124">LogiQA</a>,<a
href="https://arxiv.org/abs/2004.05986">CLUE</a>, <a
href="https://arxiv.org/abs/2306.09212">CMMLU</a></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
</tr>
<tr>
<td><a href="https://arxiv.org/abs/2312.12853">CORECODE</a> </td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
</tr>
<tr>
<td><strong><a href="https://arxiv.org/abs/2403.14112">CHARM (ours)</a> </strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
</tr>
</table>


## 🛠️ 如何使用
以下是快速下载 CHARM 并在 OpenCompass 上进行评估的步骤。

### 1. 下载 CHARM
```bash
git clone https://github.com/opendatalab/CHARM ${path_to_CHARM_repo}
```
### 2. 推理和评测
```bash
cd ${path_to_opencompass}
mkdir -p data
ln -snf ${path_to_CHARM_repo}/data/CHARM ./data/CHARM

# 在CHARM上对模型hf_llama3_8b_instruct做推理和评测
python run.py --models hf_llama3_8b_instruct --datasets charm_gen
```

## 🖊️ 引用
```bibtex
@misc{sun2024benchmarking,
title={Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations},
author={Jiaxing Sun and Weiquan Huang and Jiang Wu and Chenya Gu and Wei Li and Songyang Zhang and Hang Yan and Conghui He},
year={2024},
eprint={2403.14112},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
4 changes: 4 additions & 0 deletions configs/datasets/CHARM/charm_rea_gen.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from mmengine.config import read_base

with read_base():
from .charm_rea_gen_f8fca2 import charm_rea_datasets # noqa: F401, F403
78 changes: 78 additions & 0 deletions configs/datasets/CHARM/charm_rea_gen_f8fca2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
import os
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import CharmDataset, charm_rea_postprocess, CharmReaEvaluator

charm_tasks = [
'Chinese_Anachronisms_Judgment',
'Chinese_Movie_and_Music_Recommendation',
'Chinese_Natural_Language_Inference',
'Chinese_Reading_Comprehension',
'Chinese_Sequence_Understanding',
'Chinese_Sport_Understanding',
'Chinese_Time_Understanding',
'Global_Anachronisms_Judgment',
'Global_Movie_and_Music_Recommendation',
'Global_Natural_Language_Inference',
'Global_Reading_Comprehension',
'Global_Sequence_Understanding',
'Global_Sport_Understanding',
'Global_Time_Understanding',
]

data_dir = 'data/CHARM'
dataset_path_ZH = f'{data_dir}/reasoning'
dataset_path_TransEn = f'{data_dir}/reasoning_Translate-EN'
fewshot_example_path_ZH = os.path.join(os.path.dirname(__file__), 'few-shot-examples')
fewshot_example_path_TransEn = os.path.join(os.path.dirname(__file__), 'few-shot-examples_Translate-EN')

XLT_template = 'Follow the given examples and answer the question.\n{_hint}\n\n I want you to act as an commonsense reasoning expert for Chinese. \n Request: {{input}}\n'
Translate_EN_template = 'Follow the given examples and answer the question.\n{_hint}\n\nQ: {{input}}\nA: '
Other_template = '请按照给定的例子回答问题。\n{_hint}\n\nQ:{{input}}\nA:'

settings = [
('Direct', '', dataset_path_ZH, fewshot_example_path_ZH, Other_template),
('ZH-CoT', '让我们一步一步来思考。', dataset_path_ZH, fewshot_example_path_ZH, Other_template),
('EN-CoT', "Let's think step by step.", dataset_path_ZH, fewshot_example_path_ZH, Other_template),
('XLT', """You should retell the request in English.\nYou should do the answer step by step to choose the right answer.\nYou should step-by-step answer the request.\nYou should tell me the answer in this format 'So the answer is'.""", dataset_path_ZH, fewshot_example_path_ZH, XLT_template),
('Translate-EN', "Let's think step by step.", dataset_path_TransEn, fewshot_example_path_TransEn, Translate_EN_template),
]

charm_rea_datasets = []

for _cot, _cot_prefix, dataset_path, fewshot_example_path, prompt_template in settings:
for _task in charm_tasks:
_fewshot_example_file = os.path.join(fewshot_example_path, f'{_task}_{_cot}.txt')
with open(_fewshot_example_file, 'r') as f:
_hint = f.read()

charm_rea_reader_cfg = dict(input_columns=['input'], output_column='target')

charm_rea_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[dict(role='HUMAN', prompt=prompt_template.format(_hint=_hint) + _cot_prefix)]),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=512),
)

charm_rea_eval_cfg = dict(
evaluator=dict(type=CharmReaEvaluator),
pred_role='BOT',
pred_postprocessor=dict(type=charm_rea_postprocess),
dataset_postprocessor=dict(type=charm_rea_postprocess),
)

charm_rea_datasets.append(
dict(
type=CharmDataset,
path=dataset_path,
name=_task,
abbr='charm-rea-' + _task + '_' + _cot,
reader_cfg=charm_rea_reader_cfg,
infer_cfg=charm_rea_infer_cfg.copy(),
eval_cfg=charm_rea_eval_cfg.copy(),
)
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
识别给定陈述是否包含时代错误。

Q:以下陈述是否包含时代错误,请选择正确选项。一个接受了义务教育、具备基本常识的人会如何选择?
李白用钢笔写诗。
选项:
(A) 是
(B) 否
A:(A)

Q:以下陈述是否包含时代错误,请选择正确选项。一个接受了义务教育、具备基本常识的人会如何选择?
唐朝时期,人们可以在街头买到《红楼梦》。
选项:
(A) 是
(B) 否
A:(A)

Q:以下陈述是否包含时代错误,请选择正确选项。一个接受了义务教育、具备基本常识的人会如何选择?
1998年10月1日,人们手举五星红旗,在天安门广场上庆祝国庆
选项:
(A) 是
(B) 否
A:(B)
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
识别给定陈述是否包含时代错误。

Q:以下陈述是否包含时代错误,请选择正确选项。一个接受了义务教育、具备基本常识的人会如何选择?
李白用钢笔写诗。
选项:
(A) 是
(B) 否
A:Let's think step by step.
This statement mentions "Li Bai", a poet from the Tang Dynasty in China. The "pen" mentioned in the statement is a modern device, so it is impossible for Li Bai to write poetry with a pen. This statement contains errors from the times. So the answer is (A).

Q:以下陈述是否包含时代错误,请选择正确选项。一个接受了义务教育、具备基本常识的人会如何选择?
唐朝时期,人们可以在街头买到《红楼梦》。
选项:
(A) 是
(B) 否
A:Let's think step by step.
This statement mentions "Dream of the Red Chamber", which was written by Qing Dynasty writer Cao Xueqin. There was no "Dream of the Red Chamber" during the Tang Dynasty, so this statement contains historical errors. So the answer is (A).

Q:以下陈述是否包含时代错误,请选择正确选项。一个接受了义务教育、具备基本常识的人会如何选择?
1998年10月1日,人们手举五星红旗,在天安门广场上庆祝国庆
选项:
(A) 是
(B) 否
A:Let's think step by step.
This statement mentions that in 1998, New China was established in 1949, and the five-star red flag was designated as the national flag of China. Therefore, October 1, 1998 is National Day, and it is reasonable for people to celebrate National Day at Tiananmen Square, excluding historical errors. So the answer is (B).
Loading

0 comments on commit 8fe4ceb

Please sign in to comment.