Skip to content

Commit

Permalink
[Feature] Add support for SciCode (#1417)
Browse files Browse the repository at this point in the history
* add SciCode

* add SciCode

* add SciCode

* add SciCode

* add SciCode

* add SciCode

* add SciCode

* add SciCode w/ bg

* add scicode

* Update README.md

* Update README.md

* Delete configs/eval_SciCode.py

* rename

* 1

* rename

* Update README.md

* Update scicode.py

* Update scicode.py

* fix some bugs

* Update

* Update

---------

Co-authored-by: root <HariSeldon0>
Co-authored-by: tonysy <[email protected]>
  • Loading branch information
HariSeldon0 and tonysy authored Aug 22, 2024
1 parent d3963bc commit 14b4b73
Show file tree
Hide file tree
Showing 13 changed files with 486 additions and 1 deletion.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ Just like a compass guides us on our journey, OpenCompass will guide you through
## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>

- **\[2024.08.20\]** OpenCompass now supports the [SciCode](https://github.com/scicode-bench/SciCode): A Research Coding Benchmark Curated by Scientists. 🔥🔥🔥
- **\[2024.08.16\]** OpenCompass now supports the brand new long-context language model evaluation benchmark — [RULER](https://arxiv.org/pdf/2404.06654). RULER provides an evaluation of long-context including retrieval, multi-hop tracing, aggregation, and question answering through flexible configurations. Check out the [RULER](configs/datasets/ruler/README.md) evaluation config now! 🔥🔥🔥
- **\[2024.08.09\]** We have released the example data and configuration for the CompassBench-202408, welcome to [CompassBench](https://opencompass.readthedocs.io/zh-cn/latest/advanced_guides/compassbench_intro.html) for more details. 🔥🔥🔥
- **\[2024.08.01\]** We supported the [Gemma2](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315) models. Welcome to try! 🔥🔥🔥
Expand Down
1 change: 1 addition & 0 deletions README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@
## 🚀 最新进展 <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>

- **\[2024.08.20\]** OpenCompass 现已支持 [SciCode](https://github.com/scicode-bench/SciCode): A Research Coding Benchmark Curated by Scientists。 🔥🔥🔥
- **\[2024.08.16\]** OpenCompass 现已支持全新的长上下文语言模型评估基准——[RULER](https://arxiv.org/pdf/2404.06654)。RULER 通过灵活的配置,提供了对长上下文包括检索、多跳追踪、聚合和问答等多种任务类型的评测,欢迎访问[RULER](configs/datasets/ruler/README.md)。🔥🔥🔥
- **\[2024.07.23\]** 我们支持了[Gemma2](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315)模型,欢迎试用!🔥🔥🔥
- **\[2024.07.23\]** 我们支持了[ModelScope](www.modelscope.cn)数据集,您可以按需加载,无需事先下载全部数据到本地,欢迎试用!🔥🔥🔥
Expand Down
31 changes: 31 additions & 0 deletions configs/datasets/scicode/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# SciCode: A Research Coding Benchmark Curated by Scientists

## Introduction
SciCode is a challenging benchmark designed to evaluate the capabilities of language models (LMs) in generating code for solving realistic scientific research problems. It has a diverse coverage of 16 subdomains from 6 domains: Physics, Math, Material Science, Biology, and Chemistry. Unlike previous benchmarks that consist of exam-like question-answer pairs, SciCode is converted from real research problems. SciCode problems naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems, and it offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only 4.6% of the problems in the most realistic setting. Broadly, SciCode demonstrates a realistic and scientists' everyday workflow of identifying critical science concepts and facts and then transforming them into computation and simulation code. We believe SciCode not only helps demonstrate contemporary LLMs' progress towards helpful assistant for scientists but also helps shed light on future building and evaluation of scientific AI. For more detailed information, please refer to https://scicode-bench.github.io/.

## How to Use
By modifying the with_bg parameter in the configuration file, you can support setup for w/ background evaluation.

```bash
python run.py --datasets scicode_gen --hf-num-gpus 1 --hf-type chat --hf-path meta-llama/Meta-Llama-3-8B-Instruct --debug --model-kwargs device_map='auto' trust_remote_code=True --batch-size 1
```

## Reference Performance
| Model | Condition | Subproblem Accuracy | Main Problem Accuracy |
|---------------------------|--------------|---------------------|-----------------------|
| Llama-3-70B-Instruct | w/o Background | 21.53% | 4.62% |
| Llama-3-70B-Instruct | w/ Background | 24.31% | 7.69% |
| Qwen2-72B-Instruct | w/o Background | 16.67% | 1.54% |
| Qwen2-72B-Instruct | w/ Background | 19.79% | 1.54% |

## Citation
```
@misc{tian2024scicode,
title={SciCode: A Research Coding Benchmark Curated by Scientists},
author={Minyang Tian and Luyu Gao and Shizhuo Dylan Zhang and Xinan Chen and Cunwei Fan and Xuefei Guo and Roland Haas and Pan Ji and Kittithat Krongchon and Yao Li and Shengyan Liu and Di Luo and Yutao Ma and Hao Tong and Kha Trinh and Chenyu Tian and Zihan Wang and Bohao Wu and Yanyu Xiong and Shengzhu Yin and Minhui Zhu and Kilian Lieret and Yanxin Lu and Genglin Liu and Yufeng Du and Tianhua Tao and Ofir Press and Jamie Callan and Eliu Huerta and Hao Peng},
year={2024},
eprint={2407.13168},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
```
4 changes: 4 additions & 0 deletions configs/datasets/scicode/scicode_gen.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from mmengine.config import read_base

with read_base():
from .scicode_gen_085b98 import SciCode_datasets # noqa: F401, F403
29 changes: 29 additions & 0 deletions configs/datasets/scicode/scicode_gen_085b98.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import ChatInferencer
from opencompass.datasets import SciCodeDataset, SciCodeEvaluator


SciCode_reader_cfg = dict(input_columns=['prompt'], output_column=None)

SciCode_infer_cfg = dict(
ice_template=dict(
type=PromptTemplate,
template='',
),

retriever=dict(type=ZeroRetriever),
inferencer=dict(type=ChatInferencer, infer_mode='every', max_out_len=4096))

SciCode_eval_cfg = dict(evaluator=dict(type=SciCodeEvaluator, dataset_path='./data/scicode', with_bg=False))

SciCode_datasets = [
dict(
abbr='SciCode',
type=SciCodeDataset,
path='./data/scicode',
with_bg=False,
reader_cfg=SciCode_reader_cfg,
infer_cfg=SciCode_infer_cfg,
eval_cfg=SciCode_eval_cfg)
]
31 changes: 31 additions & 0 deletions opencompass/configs/datasets/scicode/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# SciCode: A Research Coding Benchmark Curated by Scientists

## Introduction
SciCode is a challenging benchmark designed to evaluate the capabilities of language models (LMs) in generating code for solving realistic scientific research problems. It has a diverse coverage of 16 subdomains from 6 domains: Physics, Math, Material Science, Biology, and Chemistry. Unlike previous benchmarks that consist of exam-like question-answer pairs, SciCode is converted from real research problems. SciCode problems naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems, and it offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only 4.6% of the problems in the most realistic setting. Broadly, SciCode demonstrates a realistic and scientists' everyday workflow of identifying critical science concepts and facts and then transforming them into computation and simulation code. We believe SciCode not only helps demonstrate contemporary LLMs' progress towards helpful assistant for scientists but also helps shed light on future building and evaluation of scientific AI. For more detailed information, please refer to https://scicode-bench.github.io/.

## How to Use
By modifying the with_bg parameter in the configuration file, you can support setup for w/ background evaluation.

```bash
python run.py --datasets scicode_gen --hf-num-gpus 1 --hf-type chat --hf-path meta-llama/Meta-Llama-3-8B-Instruct --debug --model-kwargs device_map='auto' trust_remote_code=True --batch-size 1
```

## Reference Performance
| Model | Condition | Subproblem Accuracy | Main Problem Accuracy |
|---------------------------|--------------|---------------------|-----------------------|
| Llama-3-70B-Instruct | w/o Background | 21.53% | 4.62% |
| Llama-3-70B-Instruct | w/ Background | 24.31% | 7.69% |
| Qwen2-72B-Instruct | w/o Background | 16.67% | 1.54% |
| Qwen2-72B-Instruct | w/ Background | 19.79% | 1.54% |

## Citation
```
@misc{tian2024scicode,
title={SciCode: A Research Coding Benchmark Curated by Scientists},
author={Minyang Tian and Luyu Gao and Shizhuo Dylan Zhang and Xinan Chen and Cunwei Fan and Xuefei Guo and Roland Haas and Pan Ji and Kittithat Krongchon and Yao Li and Shengyan Liu and Di Luo and Yutao Ma and Hao Tong and Kha Trinh and Chenyu Tian and Zihan Wang and Bohao Wu and Yanyu Xiong and Shengzhu Yin and Minhui Zhu and Kilian Lieret and Yanxin Lu and Genglin Liu and Yufeng Du and Tianhua Tao and Ofir Press and Jamie Callan and Eliu Huerta and Hao Peng},
year={2024},
eprint={2407.13168},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
```
4 changes: 4 additions & 0 deletions opencompass/configs/datasets/scicode/scicode_gen.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from mmengine.config import read_base

with read_base():
from .scicode_gen_085b98 import SciCode_datasets # noqa: F401, F403
29 changes: 29 additions & 0 deletions opencompass/configs/datasets/scicode/scicode_gen_085b98.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import ChatInferencer
from opencompass.datasets import SciCodeDataset, SciCodeEvaluator


SciCode_reader_cfg = dict(input_columns=['prompt'], output_column=None)

SciCode_infer_cfg = dict(
ice_template=dict(
type=PromptTemplate,
template='',
),

retriever=dict(type=ZeroRetriever),
inferencer=dict(type=ChatInferencer, infer_mode='every', max_out_len=4096))

SciCode_eval_cfg = dict(evaluator=dict(type=SciCodeEvaluator, dataset_path='./data/scicode', with_bg=False))

SciCode_datasets = [
dict(
abbr='SciCode',
type=SciCodeDataset,
path='./data/scicode',
with_bg=False,
reader_cfg=SciCode_reader_cfg,
infer_cfg=SciCode_infer_cfg,
eval_cfg=SciCode_eval_cfg)
]
1 change: 1 addition & 0 deletions opencompass/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,7 @@
from .ruler import * # noqa: F401, F403
from .safety import * # noqa: F401, F403
from .scibench import ScibenchDataset, scibench_postprocess # noqa: F401, F403
from .scicode import * # noqa: F401, F403
from .siqa import * # noqa: F401, F403
from .squad20 import SQuAD20Dataset, SQuAD20Evaluator # noqa: F401, F403
from .storycloze import * # noqa: F401, F403
Expand Down
Loading

0 comments on commit 14b4b73

Please sign in to comment.