-
Notifications
You must be signed in to change notification settings - Fork 436
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* update * update * update metric docs * update index.rst * update metrics
- Loading branch information
Showing
4 changed files
with
123 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,62 @@ | ||
# Metric Calculation | ||
|
||
In the evaluation phase, we typically select the corresponding evaluation metric strategy based on the characteristics of the dataset itself. The main criterion is the **type of standard answer**, generally including the following types: | ||
|
||
- **Choice**: Common in classification tasks, judgment questions, and multiple-choice questions. Currently, this type of question dataset occupies the largest proportion, with datasets such as MMLU, CEval, etc. Accuracy is usually used as the evaluation standard-- `ACCEvaluator`. | ||
- **Phrase**: Common in Q&A and reading comprehension tasks. This type of dataset mainly includes CLUE_CMRC, CLUE_DRCD, DROP datasets, etc. Matching rate is usually used as the evaluation standard--`EMEvaluator`. | ||
- **Sentence**: Common in translation and generating pseudocode/command-line tasks, mainly including Flores, Summscreen, Govrepcrs, Iwdlt2017 datasets, etc. BLEU (Bilingual Evaluation Understudy) is usually used as the evaluation standard--`BleuEvaluator`. | ||
- **Paragraph**: Common in text summary generation tasks, commonly used datasets mainly include Lcsts, TruthfulQA, Xsum datasets, etc. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is usually used as the evaluation standard--`RougeEvaluator`. | ||
- **Code**: Common in code generation tasks, commonly used datasets mainly include Humaneval, MBPP datasets, etc. Execution pass rate and `pass@k` are usually used as the evaluation standard. At present, Opencompass supports `MBPPEvaluator` and `HumanEvaluator`. | ||
|
||
There is also a type of **scoring-type** evaluation task without standard answers, such as judging whether the output of a model is toxic, which can directly use the related API service for scoring. At present, it supports `ToxicEvaluator`, and currently, the realtoxicityprompts dataset uses this evaluation method. | ||
|
||
## Supported Evaluation Metrics | ||
|
||
Currently, in OpenCompass, commonly used Evaluators are mainly located in the [`opencompass/openicl/icl_evaluator`](https://github.com/InternLM/opencompass/tree/main/opencompass/openicl/icl_evaluator) folder. There are also some dataset-specific indicators that are placed in parts of [`opencompass/datasets`](https://github.com/InternLM/opencompass/tree/main/opencompass/datasets). Below is a summary: | ||
|
||
| Evaluation Strategy | Evaluation Metrics | Common Postprocessing Method | Datasets | | ||
| ------------------- | -------------------- | ---------------------------- | -------------------------------------------------------------------- | | ||
| `ACCEvaluator` | Accuracy | `first_capital_postprocess` | agieval, ARC, bbh, mmlu, ceval, commonsenseqa, crowspairs, hellaswag | | ||
| `EMEvaluator` | Match Rate | None, dataset-specific | drop, CLUE_CMRC, CLUE_DRCD | | ||
| `BleuEvaluator` | BLEU | None, `flores` | flores, iwslt2017, summscreen, govrepcrs | | ||
| `RougeEvaluator` | ROUGE | None, dataset-specific | lcsts, truthfulqa, Xsum, XLSum | | ||
| `HumanEvaluator` | pass@k | `humaneval_postprocess` | humaneval_postprocess | | ||
| `MBPPEvaluator` | Execution Pass Rate | None | mbpp | | ||
| `ToxicEvaluator` | PerspectiveAPI | None | realtoxicityprompts | | ||
| `AGIEvalEvaluator` | Accuracy | None | agieval | | ||
| `AUCROCEvaluator` | AUC-ROC | None | jigsawmultilingual, civilcomments | | ||
| `MATHEvaluator` | Accuracy | `math_postprocess` | math | | ||
| `MccEvaluator` | Matthews Correlation | None | -- | | ||
| `SquadEvaluator` | F1-scores | None | -- | | ||
|
||
## How to Configure | ||
|
||
The evaluation standard configuration is generally placed in the dataset configuration file, and the final xxdataset_eval_cfg will be passed to `dataset.infer_cfg` as an instantiation parameter. | ||
|
||
Below is the definition of `govrepcrs_eval_cfg`, and you can refer to [configs/datasets/govrepcrs](https://github.com/InternLM/opencompass/tree/main/configs/datasets/govrepcrs). | ||
|
||
```python | ||
from opencompass.openicl.icl_evaluator import BleuEvaluator | ||
from opencompass.datasets import GovRepcrsDataset | ||
from opencompass.utils.text_postprocessors import general_cn_postprocess | ||
|
||
govrepcrs_reader_cfg = dict(.......) | ||
govrepcrs_infer_cfg = dict(.......) | ||
|
||
# Configuration of evaluation metrics | ||
govrepcrs_eval_cfg = dict( | ||
evaluator=dict(type=BleuEvaluator), # Use the common translator evaluator BleuEvaluator | ||
pred_role='BOT', # Accept 'BOT' role output | ||
pred_postprocessor=dict(type=general_cn_postprocess), # Postprocessing of prediction results | ||
dataset_postprocessor=dict(type=general_cn_postprocess)) # Postprocessing of dataset standard answers | ||
|
||
govrepcrs_datasets = [ | ||
dict( | ||
type=GovRepcrsDataset, # Dataset class name | ||
path='./data/govrep/', # Dataset path | ||
abbr='GovRepcrs', # Dataset alias | ||
reader_cfg=govrepcrs_reader_cfg, # Dataset reading configuration file, configure its reading split, column, etc. | ||
infer_cfg=govrepcrs_infer_cfg, # Dataset inference configuration file, mainly related to prompt | ||
eval_cfg=govrepcrs_eval_cfg) # Dataset result evaluation configuration file, evaluation standard, and preprocessing and postprocessing. | ||
] | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,62 @@ | ||
# 评估指标 | ||
|
||
Coming soon. | ||
在评测阶段,我们一般以数据集本身的特性来选取对应的评估策略,最主要的依据为**标准答案的类型**,一般以下几种类型: | ||
|
||
- **选项**:常见于分类任务,判断题以及选择题,目前这类问题的数据集占比最大,有 MMLU, CEval数据集等等,评估标准一般使用准确率--`ACCEvaluator`。 | ||
- **短语**:常见于问答以及阅读理解任务,这类数据集主要包括CLUE_CMRC, CLUE_DRCD, DROP数据集等等,评估标准一般使用匹配率--`EMEvaluator`。 | ||
- **句子**:常见于翻译以及生成伪代码、命令行任务中,主要包括Flores, Summscreen, Govrepcrs, Iwdlt2017数据集等等,评估标准一般使用BLEU(Bilingual Evaluation Understudy)--`BleuEvaluator`。 | ||
- **段落**:常见于文本摘要生成的任务,常用的数据集主要包括Lcsts, TruthfulQA, Xsum数据集等等,评估标准一般使用ROUGE(Recall-Oriented Understudy for Gisting Evaluation)--`RougeEvaluator`。 | ||
- **代码**:常见于代码生成的任务,常用的数据集主要包括Humaneval,MBPP数据集等等,评估标准一般使用执行通过率以及 `pass@k`,目前 Opencompass 支持的有`MBPPEvaluator`、`HumanEvaluator`。 | ||
|
||
还有一类**打分类型**评测任务没有标准答案,比如评判一个模型的输出是否存在有毒,可以直接使用相关 API 服务进行打分,目前支持的有 `ToxicEvaluator`,目前有 realtoxicityprompts 数据集使用此评测方式。 | ||
|
||
## 已支持评估指标 | ||
|
||
目前 OpenCompass 中,常用的 Evaluator 主要放在 [`opencompass/openicl/icl_evaluator`](https://github.com/InternLM/opencompass/tree/main/opencompass/openicl/icl_evaluator)文件夹下, 还有部分数据集特有指标的放在 [`opencompass/datasets`](https://github.com/InternLM/opencompass/tree/main/opencompass/datasets) 的部分文件中。以下是汇总: | ||
|
||
| 评估指标 | 评估策略 | 常用后处理方式 | 数据集 | | ||
| ------------------ | -------------------- | --------------------------- | -------------------------------------------------------------------- | | ||
| `ACCEvaluator` | 正确率 | `first_capital_postprocess` | agieval, ARC, bbh, mmlu, ceval, commonsenseqa, crowspairs, hellaswag | | ||
| `EMEvaluator` | 匹配率 | None, dataset_specification | drop, CLUE_CMRC, CLUE_DRCD | | ||
| `BleuEvaluator` | BLEU | None, `flores` | flores, iwslt2017, summscreen, govrepcrs | | ||
| `RougeEvaluator` | ROUGE | None, dataset_specification | lcsts, truthfulqa, Xsum, XLSum | | ||
| `HumanEvaluator` | pass@k | `humaneval_postprocess` | humaneval_postprocess | | ||
| `MBPPEvaluator` | 执行通过率 | None | mbpp | | ||
| `ToxicEvaluator` | PerspectiveAPI | None | realtoxicityprompts | | ||
| `AGIEvalEvaluator` | 正确率 | None | agieval | | ||
| `AUCROCEvaluator` | AUC-ROC | None | jigsawmultilingual, civilcomments | | ||
| `MATHEvaluator` | 正确率 | `math_postprocess` | math | | ||
| `MccEvaluator` | Matthews Correlation | None | -- | | ||
| `SquadEvaluator` | F1-scores | None | -- | | ||
|
||
## 如何配置 | ||
|
||
评估标准配置一般放在数据集配置文件中,最终的 xxdataset_eval_cfg 会传给 `dataset.infer_cfg` 作为实例化的一个参数。 | ||
|
||
下面是 `govrepcrs_eval_cfg` 的定义, 具体可查看 [configs/datasets/govrepcrs](https://github.com/InternLM/opencompass/tree/main/configs/datasets/govrepcrs)。 | ||
|
||
```python | ||
from opencompass.openicl.icl_evaluator import BleuEvaluator | ||
from opencompass.datasets import GovRepcrsDataset | ||
from opencompass.utils.text_postprocessors import general_cn_postprocess | ||
|
||
govrepcrs_reader_cfg = dict(.......) | ||
govrepcrs_infer_cfg = dict(.......) | ||
|
||
# 评估指标的配置 | ||
govrepcrs_eval_cfg = dict( | ||
evaluator=dict(type=BleuEvaluator), # 使用常用翻译的评估器BleuEvaluator | ||
pred_role='BOT', # 接受'BOT' 角色的输出 | ||
pred_postprocessor=dict(type=general_cn_postprocess), # 预测结果的后处理 | ||
dataset_postprocessor=dict(type=general_cn_postprocess)) # 数据集标准答案的后处理 | ||
|
||
govrepcrs_datasets = [ | ||
dict( | ||
type=GovRepcrsDataset, # 数据集类名 | ||
path='./data/govrep/', # 数据集路径 | ||
abbr='GovRepcrs', # 数据集别名 | ||
reader_cfg=govrepcrs_reader_cfg, # 数据集读取配置文件,配置其读取的split,列等 | ||
infer_cfg=govrepcrs_infer_cfg, # 数据集推理的配置文件,主要 prompt 相关 | ||
eval_cfg=govrepcrs_eval_cfg) # 数据集结果的评估配置文件,评估标准以及前后处理。 | ||
] | ||
``` |