[DOC] Add metric doc (#118)

* update * update * update metric docs * update index.rst * update metrics
open-compass · Aug 1, 2023 · e9b7b8a · e9b7b8a
1 parent d860b61
commit e9b7b8a
Show file tree

Hide file tree

Showing 4 changed files with 123 additions and 1 deletion.
diff --git a/docs/en/index.rst b/docs/en/index.rst
@@ -35,6 +35,7 @@ We always welcome *PRs* and *Issues* for the betterment of OpenCompass.
    user_guides/models.md
    user_guides/evaluation.md
    user_guides/experimentation.md
+   user_guides/metrics.md
 
 .. _AdvancedGuides:
 .. toctree::

diff --git a/docs/en/user_guides/metrics.md b/docs/en/user_guides/metrics.md
@@ -1 +1,62 @@
 # Metric Calculation
+
+In the evaluation phase, we typically select the corresponding evaluation metric strategy based on the characteristics of the dataset itself. The main criterion is the **type of standard answer**, generally including the following types:
+
+- **Choice**: Common in classification tasks, judgment questions, and multiple-choice questions. Currently, this type of question dataset occupies the largest proportion, with datasets such as MMLU, CEval, etc. Accuracy is usually used as the evaluation standard-- `ACCEvaluator`.
+- **Phrase**: Common in Q&A and reading comprehension tasks. This type of dataset mainly includes CLUE_CMRC, CLUE_DRCD, DROP datasets, etc. Matching rate is usually used as the evaluation standard--`EMEvaluator`.
+- **Sentence**: Common in translation and generating pseudocode/command-line tasks, mainly including Flores, Summscreen, Govrepcrs, Iwdlt2017 datasets, etc. BLEU (Bilingual Evaluation Understudy) is usually used as the evaluation standard--`BleuEvaluator`.
+- **Paragraph**: Common in text summary generation tasks, commonly used datasets mainly include Lcsts, TruthfulQA, Xsum datasets, etc. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is usually used as the evaluation standard--`RougeEvaluator`.
+- **Code**: Common in code generation tasks, commonly used datasets mainly include Humaneval, MBPP datasets, etc. Execution pass rate and `pass@k` are usually used as the evaluation standard. At present, Opencompass supports `MBPPEvaluator` and `HumanEvaluator`.
+
+There is also a type of **scoring-type** evaluation task without standard answers, such as judging whether the output of a model is toxic, which can directly use the related API service for scoring. At present, it supports `ToxicEvaluator`, and currently, the realtoxicityprompts dataset uses this evaluation method.
+
+## Supported Evaluation Metrics
+
+Currently, in OpenCompass, commonly used Evaluators are mainly located in the [`opencompass/openicl/icl_evaluator`](https://github.com/InternLM/opencompass/tree/main/opencompass/openicl/icl_evaluator) folder. There are also some dataset-specific indicators that are placed in parts of [`opencompass/datasets`](https://github.com/InternLM/opencompass/tree/main/opencompass/datasets). Below is a summary:
+
+| Evaluation Strategy | Evaluation Metrics   | Common Postprocessing Method | Datasets                                                             |
+| ------------------- | -------------------- | ---------------------------- | -------------------------------------------------------------------- |
+| `ACCEvaluator`      | Accuracy             | `first_capital_postprocess`  | agieval, ARC, bbh, mmlu, ceval, commonsenseqa, crowspairs, hellaswag |
+| `EMEvaluator`       | Match Rate           | None, dataset-specific       | drop, CLUE_CMRC, CLUE_DRCD                                           |
+| `BleuEvaluator`     | BLEU                 | None, `flores`               | flores, iwslt2017, summscreen, govrepcrs                             |
+| `RougeEvaluator`    | ROUGE                | None, dataset-specific       | lcsts, truthfulqa, Xsum, XLSum                                       |
+| `HumanEvaluator`    | pass@k               | `humaneval_postprocess`      | humaneval_postprocess                                                |
+| `MBPPEvaluator`     | Execution Pass Rate  | None                         | mbpp                                                                 |
+| `ToxicEvaluator`    | PerspectiveAPI       | None                         | realtoxicityprompts                                                  |
+| `AGIEvalEvaluator`  | Accuracy             | None                         | agieval                                                              |
+| `AUCROCEvaluator`   | AUC-ROC              | None                         | jigsawmultilingual, civilcomments                                    |
+| `MATHEvaluator`     | Accuracy             | `math_postprocess`           | math                                                                 |
+| `MccEvaluator`      | Matthews Correlation | None                         | --                                                                   |
+| `SquadEvaluator`    | F1-scores            | None                         | --                                                                   |
+
+## How to Configure
+
+The evaluation standard configuration is generally placed in the dataset configuration file, and the final xxdataset_eval_cfg will be passed to `dataset.infer_cfg` as an instantiation parameter.
+
+Below is the definition of `govrepcrs_eval_cfg`, and you can refer to [configs/datasets/govrepcrs](https://github.com/InternLM/opencompass/tree/main/configs/datasets/govrepcrs).
+
+```python
+from opencompass.openicl.icl_evaluator import BleuEvaluator
+from opencompass.datasets import GovRepcrsDataset
+from opencompass.utils.text_postprocessors import general_cn_postprocess
+
+govrepcrs_reader_cfg = dict(.......)
+govrepcrs_infer_cfg = dict(.......)
+
+# Configuration of evaluation metrics
+govrepcrs_eval_cfg = dict(
+    evaluator=dict(type=BleuEvaluator),            # Use the common translator evaluator BleuEvaluator
+    pred_role='BOT',                               # Accept 'BOT' role output
+    pred_postprocessor=dict(type=general_cn_postprocess),      # Postprocessing of prediction results
+    dataset_postprocessor=dict(type=general_cn_postprocess))   # Postprocessing of dataset standard answers
+
+govrepcrs_datasets = [
+    dict(
+        type=GovRepcrsDataset,                 # Dataset class name
+        path='./data/govrep/',                 # Dataset path
+        abbr='GovRepcrs',                      # Dataset alias
+        reader_cfg=govrepcrs_reader_cfg,       # Dataset reading configuration file, configure its reading split, column, etc.
+        infer_cfg=govrepcrs_infer_cfg,         # Dataset inference configuration file, mainly related to prompt
+        eval_cfg=govrepcrs_eval_cfg)           # Dataset result evaluation configuration file, evaluation standard, and preprocessing and postprocessing.
+]
+```
diff --git a/docs/zh_cn/index.rst b/docs/zh_cn/index.rst
@@ -36,6 +36,7 @@ OpenCompass 上手路线
    user_guides/models.md
    user_guides/evaluation.md
    user_guides/experimentation.md
+   user_guides/metrics.md
 
 .. _提示词:
 .. toctree::

diff --git a/docs/zh_cn/user_guides/metrics.md b/docs/zh_cn/user_guides/metrics.md
@@ -1,3 +1,62 @@
 # 评估指标
 
-Coming soon.
+在评测阶段，我们一般以数据集本身的特性来选取对应的评估策略，最主要的依据为**标准答案的类型**，一般以下几种类型：
+
+- **选项**：常见于分类任务，判断题以及选择题，目前这类问题的数据集占比最大，有 MMLU, CEval数据集等等，评估标准一般使用准确率--`ACCEvaluator`。
+- **短语**：常见于问答以及阅读理解任务，这类数据集主要包括CLUE_CMRC, CLUE_DRCD, DROP数据集等等，评估标准一般使用匹配率--`EMEvaluator`。
+- **句子**：常见于翻译以及生成伪代码、命令行任务中，主要包括Flores, Summscreen, Govrepcrs, Iwdlt2017数据集等等，评估标准一般使用BLEU(Bilingual Evaluation Understudy)--`BleuEvaluator`。
+- **段落**：常见于文本摘要生成的任务，常用的数据集主要包括Lcsts, TruthfulQA, Xsum数据集等等，评估标准一般使用ROUGE（Recall-Oriented Understudy for Gisting Evaluation）--`RougeEvaluator`。
+- **代码**：常见于代码生成的任务，常用的数据集主要包括Humaneval，MBPP数据集等等，评估标准一般使用执行通过率以及 `pass@k`，目前 Opencompass 支持的有`MBPPEvaluator`、`HumanEvaluator`。
+
+还有一类**打分类型**评测任务没有标准答案，比如评判一个模型的输出是否存在有毒，可以直接使用相关 API 服务进行打分，目前支持的有 `ToxicEvaluator`，目前有 realtoxicityprompts 数据集使用此评测方式。
+
+## 已支持评估指标
+
+目前 OpenCompass 中，常用的 Evaluator 主要放在 [`opencompass/openicl/icl_evaluator`](https://github.com/InternLM/opencompass/tree/main/opencompass/openicl/icl_evaluator)文件夹下， 还有部分数据集特有指标的放在 [`opencompass/datasets`](https://github.com/InternLM/opencompass/tree/main/opencompass/datasets) 的部分文件中。以下是汇总：
+
+| 评估指标           | 评估策略             | 常用后处理方式              | 数据集                                                               |
+| ------------------ | -------------------- | --------------------------- | -------------------------------------------------------------------- |
+| `ACCEvaluator`     | 正确率               | `first_capital_postprocess` | agieval, ARC, bbh, mmlu, ceval, commonsenseqa, crowspairs, hellaswag |
+| `EMEvaluator`      | 匹配率               | None, dataset_specification | drop, CLUE_CMRC, CLUE_DRCD                                           |
+| `BleuEvaluator`    | BLEU                 | None, `flores`              | flores, iwslt2017, summscreen, govrepcrs                             |
+| `RougeEvaluator`   | ROUGE                | None, dataset_specification | lcsts, truthfulqa, Xsum, XLSum                                       |
+| `HumanEvaluator`   | pass@k               | `humaneval_postprocess`     | humaneval_postprocess                                                |
+| `MBPPEvaluator`    | 执行通过率           | None                        | mbpp                                                                 |
+| `ToxicEvaluator`   | PerspectiveAPI       | None                        | realtoxicityprompts                                                  |
+| `AGIEvalEvaluator` | 正确率               | None                        | agieval                                                              |
+| `AUCROCEvaluator`  | AUC-ROC              | None                        | jigsawmultilingual, civilcomments                                    |
+| `MATHEvaluator`    | 正确率               | `math_postprocess`          | math                                                                 |
+| `MccEvaluator`     | Matthews Correlation | None                        | --                                                                   |
+| `SquadEvaluator`   | F1-scores            | None                        | --                                                                   |
+
+## 如何配置
+
+评估标准配置一般放在数据集配置文件中，最终的 xxdataset_eval_cfg 会传给 `dataset.infer_cfg` 作为实例化的一个参数。
+
+下面是 `govrepcrs_eval_cfg` 的定义， 具体可查看 [configs/datasets/govrepcrs](https://github.com/InternLM/opencompass/tree/main/configs/datasets/govrepcrs)。
+
+```python
+from opencompass.openicl.icl_evaluator import BleuEvaluator
+from opencompass.datasets import GovRepcrsDataset
+from opencompass.utils.text_postprocessors import general_cn_postprocess
+
+govrepcrs_reader_cfg = dict(.......)
+govrepcrs_infer_cfg = dict(.......)
+
+# 评估指标的配置
+govrepcrs_eval_cfg = dict(
+    evaluator=dict(type=BleuEvaluator),            # 使用常用翻译的评估器BleuEvaluator
+    pred_role='BOT',                               # 接受'BOT' 角色的输出
+    pred_postprocessor=dict(type=general_cn_postprocess),      # 预测结果的后处理
+    dataset_postprocessor=dict(type=general_cn_postprocess))   # 数据集标准答案的后处理
+
+govrepcrs_datasets = [
+    dict(
+        type=GovRepcrsDataset,                 # 数据集类名
+        path='./data/govrep/',                 # 数据集路径
+        abbr='GovRepcrs',                      # 数据集别名
+        reader_cfg=govrepcrs_reader_cfg,       # 数据集读取配置文件，配置其读取的split，列等
+        infer_cfg=govrepcrs_infer_cfg,         # 数据集推理的配置文件，主要 prompt 相关
+        eval_cfg=govrepcrs_eval_cfg)           # 数据集结果的评估配置文件，评估标准以及前后处理。
+]
+```