-
Notifications
You must be signed in to change notification settings - Fork 436
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Feature] Add support for SciCode (#1417)
* add SciCode * add SciCode * add SciCode * add SciCode * add SciCode * add SciCode * add SciCode * add SciCode w/ bg * add scicode * Update README.md * Update README.md * Delete configs/eval_SciCode.py * rename * 1 * rename * Update README.md * Update scicode.py * Update scicode.py * fix some bugs * Update * Update --------- Co-authored-by: root <HariSeldon0> Co-authored-by: tonysy <[email protected]>
- Loading branch information
1 parent
d3963bc
commit 14b4b73
Showing
13 changed files
with
486 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# SciCode: A Research Coding Benchmark Curated by Scientists | ||
|
||
## Introduction | ||
SciCode is a challenging benchmark designed to evaluate the capabilities of language models (LMs) in generating code for solving realistic scientific research problems. It has a diverse coverage of 16 subdomains from 6 domains: Physics, Math, Material Science, Biology, and Chemistry. Unlike previous benchmarks that consist of exam-like question-answer pairs, SciCode is converted from real research problems. SciCode problems naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems, and it offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only 4.6% of the problems in the most realistic setting. Broadly, SciCode demonstrates a realistic and scientists' everyday workflow of identifying critical science concepts and facts and then transforming them into computation and simulation code. We believe SciCode not only helps demonstrate contemporary LLMs' progress towards helpful assistant for scientists but also helps shed light on future building and evaluation of scientific AI. For more detailed information, please refer to https://scicode-bench.github.io/. | ||
|
||
## How to Use | ||
By modifying the with_bg parameter in the configuration file, you can support setup for w/ background evaluation. | ||
|
||
```bash | ||
python run.py --datasets scicode_gen --hf-num-gpus 1 --hf-type chat --hf-path meta-llama/Meta-Llama-3-8B-Instruct --debug --model-kwargs device_map='auto' trust_remote_code=True --batch-size 1 | ||
``` | ||
|
||
## Reference Performance | ||
| Model | Condition | Subproblem Accuracy | Main Problem Accuracy | | ||
|---------------------------|--------------|---------------------|-----------------------| | ||
| Llama-3-70B-Instruct | w/o Background | 21.53% | 4.62% | | ||
| Llama-3-70B-Instruct | w/ Background | 24.31% | 7.69% | | ||
| Qwen2-72B-Instruct | w/o Background | 16.67% | 1.54% | | ||
| Qwen2-72B-Instruct | w/ Background | 19.79% | 1.54% | | ||
|
||
## Citation | ||
``` | ||
@misc{tian2024scicode, | ||
title={SciCode: A Research Coding Benchmark Curated by Scientists}, | ||
author={Minyang Tian and Luyu Gao and Shizhuo Dylan Zhang and Xinan Chen and Cunwei Fan and Xuefei Guo and Roland Haas and Pan Ji and Kittithat Krongchon and Yao Li and Shengyan Liu and Di Luo and Yutao Ma and Hao Tong and Kha Trinh and Chenyu Tian and Zihan Wang and Bohao Wu and Yanyu Xiong and Shengzhu Yin and Minhui Zhu and Kilian Lieret and Yanxin Lu and Genglin Liu and Yufeng Du and Tianhua Tao and Ofir Press and Jamie Callan and Eliu Huerta and Hao Peng}, | ||
year={2024}, | ||
eprint={2407.13168}, | ||
archivePrefix={arXiv}, | ||
primaryClass={cs.AI} | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
from mmengine.config import read_base | ||
|
||
with read_base(): | ||
from .scicode_gen_085b98 import SciCode_datasets # noqa: F401, F403 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
from opencompass.openicl.icl_prompt_template import PromptTemplate | ||
from opencompass.openicl.icl_retriever import ZeroRetriever | ||
from opencompass.openicl.icl_inferencer import ChatInferencer | ||
from opencompass.datasets import SciCodeDataset, SciCodeEvaluator | ||
|
||
|
||
SciCode_reader_cfg = dict(input_columns=['prompt'], output_column=None) | ||
|
||
SciCode_infer_cfg = dict( | ||
ice_template=dict( | ||
type=PromptTemplate, | ||
template='', | ||
), | ||
|
||
retriever=dict(type=ZeroRetriever), | ||
inferencer=dict(type=ChatInferencer, infer_mode='every', max_out_len=4096)) | ||
|
||
SciCode_eval_cfg = dict(evaluator=dict(type=SciCodeEvaluator, dataset_path='./data/scicode', with_bg=False)) | ||
|
||
SciCode_datasets = [ | ||
dict( | ||
abbr='SciCode', | ||
type=SciCodeDataset, | ||
path='./data/scicode', | ||
with_bg=False, | ||
reader_cfg=SciCode_reader_cfg, | ||
infer_cfg=SciCode_infer_cfg, | ||
eval_cfg=SciCode_eval_cfg) | ||
] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# SciCode: A Research Coding Benchmark Curated by Scientists | ||
|
||
## Introduction | ||
SciCode is a challenging benchmark designed to evaluate the capabilities of language models (LMs) in generating code for solving realistic scientific research problems. It has a diverse coverage of 16 subdomains from 6 domains: Physics, Math, Material Science, Biology, and Chemistry. Unlike previous benchmarks that consist of exam-like question-answer pairs, SciCode is converted from real research problems. SciCode problems naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems, and it offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only 4.6% of the problems in the most realistic setting. Broadly, SciCode demonstrates a realistic and scientists' everyday workflow of identifying critical science concepts and facts and then transforming them into computation and simulation code. We believe SciCode not only helps demonstrate contemporary LLMs' progress towards helpful assistant for scientists but also helps shed light on future building and evaluation of scientific AI. For more detailed information, please refer to https://scicode-bench.github.io/. | ||
|
||
## How to Use | ||
By modifying the with_bg parameter in the configuration file, you can support setup for w/ background evaluation. | ||
|
||
```bash | ||
python run.py --datasets scicode_gen --hf-num-gpus 1 --hf-type chat --hf-path meta-llama/Meta-Llama-3-8B-Instruct --debug --model-kwargs device_map='auto' trust_remote_code=True --batch-size 1 | ||
``` | ||
|
||
## Reference Performance | ||
| Model | Condition | Subproblem Accuracy | Main Problem Accuracy | | ||
|---------------------------|--------------|---------------------|-----------------------| | ||
| Llama-3-70B-Instruct | w/o Background | 21.53% | 4.62% | | ||
| Llama-3-70B-Instruct | w/ Background | 24.31% | 7.69% | | ||
| Qwen2-72B-Instruct | w/o Background | 16.67% | 1.54% | | ||
| Qwen2-72B-Instruct | w/ Background | 19.79% | 1.54% | | ||
|
||
## Citation | ||
``` | ||
@misc{tian2024scicode, | ||
title={SciCode: A Research Coding Benchmark Curated by Scientists}, | ||
author={Minyang Tian and Luyu Gao and Shizhuo Dylan Zhang and Xinan Chen and Cunwei Fan and Xuefei Guo and Roland Haas and Pan Ji and Kittithat Krongchon and Yao Li and Shengyan Liu and Di Luo and Yutao Ma and Hao Tong and Kha Trinh and Chenyu Tian and Zihan Wang and Bohao Wu and Yanyu Xiong and Shengzhu Yin and Minhui Zhu and Kilian Lieret and Yanxin Lu and Genglin Liu and Yufeng Du and Tianhua Tao and Ofir Press and Jamie Callan and Eliu Huerta and Hao Peng}, | ||
year={2024}, | ||
eprint={2407.13168}, | ||
archivePrefix={arXiv}, | ||
primaryClass={cs.AI} | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
from mmengine.config import read_base | ||
|
||
with read_base(): | ||
from .scicode_gen_085b98 import SciCode_datasets # noqa: F401, F403 |
29 changes: 29 additions & 0 deletions
29
opencompass/configs/datasets/scicode/scicode_gen_085b98.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
from opencompass.openicl.icl_prompt_template import PromptTemplate | ||
from opencompass.openicl.icl_retriever import ZeroRetriever | ||
from opencompass.openicl.icl_inferencer import ChatInferencer | ||
from opencompass.datasets import SciCodeDataset, SciCodeEvaluator | ||
|
||
|
||
SciCode_reader_cfg = dict(input_columns=['prompt'], output_column=None) | ||
|
||
SciCode_infer_cfg = dict( | ||
ice_template=dict( | ||
type=PromptTemplate, | ||
template='', | ||
), | ||
|
||
retriever=dict(type=ZeroRetriever), | ||
inferencer=dict(type=ChatInferencer, infer_mode='every', max_out_len=4096)) | ||
|
||
SciCode_eval_cfg = dict(evaluator=dict(type=SciCodeEvaluator, dataset_path='./data/scicode', with_bg=False)) | ||
|
||
SciCode_datasets = [ | ||
dict( | ||
abbr='SciCode', | ||
type=SciCodeDataset, | ||
path='./data/scicode', | ||
with_bg=False, | ||
reader_cfg=SciCode_reader_cfg, | ||
infer_cfg=SciCode_infer_cfg, | ||
eval_cfg=SciCode_eval_cfg) | ||
] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.