[Task Submission] OperationsResearchQA (`operationsresearchqa`) #22

OpResearchQA · 2023-08-02T07:52:30Z

Operations Research QA (ORQA)

ORQA: Can Pretrained Language Models reason about Operations Research?

We propose Operations Research QA (ORQA) as a new benchmark to evaluate the ability of pretrained Large Language Models (LLMs) to generalize to new technical domains. Our benchmark considers the cross-domain shift issue of LLMs and focuses on the multi-choice question answering task. In our new dataset, the target domain is Operations Research (OR), and the task tests both the language models' domain-specific knowledge and reasoning skills on optimization modeling. Our dataset is handcrafted by domain experts, i.e. OR experts, and is representative of different types of optimization and various application domains, such as vehicle routing, production planning, or investment portfolio allocation.

Authors

Rindra Ramamonjison [email protected]
Mahdi Mostajabdaveh
Timothy Yu
Giuseppe Carenini
Samarendra Dash
Serge Jabo Byusa
Zirui Zhou
Yong Zhang

Implementation

The custom function in task.py that was changed is format_example. Our dataset consists of context, question, target, and target_options. This function simply formats it such that the context, question, and target_options are combined into the input field.

Usage

The evaluation function can be run the default method.

Checklist:

I and my co-authors agree that, if this PR is merged, the code will be available under the same license as the genbench_cbt repository.
Prior to submitting, I have ran the GenBench CBT test suite using the genbench-cli test-task tool.
I have read the description of what should be in the doc.md of my task, and have added the required arguments.
I have submitted or will submit an accompanying paper to the GenBench workshop.

OpResearchQA · 2023-08-29T19:22:35Z

Dear GenBench Organizers,

ORQA is a multi-choice QA (MCQA) task.

What is the standard method for evaluation that you would prefer for MCQA? There is a lot of variance across different runs and https://arxiv.org/pdf/2210.12353.pdf reports PPA to quantify this issue. Or perhaps running three inferences with three pre-set seeds and reporting the average and variance of the runs? Or are we over-thinking this and should expect the users of the benchmark to simply provide the output file with their predictions without considering how they obtained that?
There were some functions within /src/genbench/task.py that we had to change to make the evaluation script work. Specifically, "_format_example_for_in_context_learning" and "evaluate_predictions". Is this acceptable?
We are not sure why in "make_nshot_dataset", DatasetSplit.VALIDATION had to be replaced explicitly with "validation" for it to detect the key in the formatted_dataset. We will look into that, but were wondering if you or any other participants reported this issue.
Finally, we are going to be separating this into three tasks: (1) zero-shot, (2) 3-shot, (3) chain-of-thought. How would you recommend we submit this in the PR? We read in the comments from other PR that each new task has to be a new PR. However, we also recall you mentioning that the number of shots are recommended to not be a new task. We would appreciate any help/advice.

Thank you,
ORQA Team

vernadankers · 2023-09-01T10:29:50Z

Dear ORQA team,

That is up to you, and please explain your rationale in the paper that you submit along with the dataset.
Yes, that is perfectly acceptable! We gave you the opportunity to customise the task.py for this very reason, since not all submissions fit the same technical setup.
@kazemnejad Would you be able to look into that?
We created the option to have a task with subtasks, see https://github.com/GenBench/genbench_cbt#task-with-subtasks in the README, perhaps you can create three different subtasks for zero-shot, 3-shot and chain-of-thought?

We are getting quite close to the deadline (September 1, 11:59PM anywhere on earth), so if your PR needs any final changes, please make them now,
and don't forget to submit your accompanying paper to Openreview via https://openreview.net/group?id=GenBench.org/2023/Workshop by September 1.

Good luck finalising your PR and paper, feel free to tag us if you have questions.
Cheers, Verna
On behalf of the GenBench team

…d_instances test because of how the get_prepared_datasets function was rewritten to be a two-stage prompting approach

OpResearchQA · 2023-09-01T23:46:31Z

Thank you for your help, Verna!

We validated the test task using

$ genbench-cli test-task --id my_awesome_task

It skips 3 tests, passes 2, but fails one. The one it fails is "test_perfect_score_for_gold_instances". This is because we have changed the "get_prepared_datasets" function in the task.py file. We are currently doing a two-stage prompting method and the way this test validates that the gold instances would result in perfect performance does not retrieve the labels from the dataset properly.

Should we re-write the task.py file to return this? We could also write our own custom test script to verify that perfect predictions would result in perfect scores.

Please let us know how to proceed.

Best,
ORQA Team

kazemnejad · 2023-09-04T15:42:09Z

We are not sure why in "make_nshot_dataset", DatasetSplit.VALIDATION had to be replaced explicitly with "validation" for it to detect the key in the formatted_dataset. We will look into that, but were wondering if you or any other participants reported this issue.

Thanks for bringing up the issue. We'll fix this.

For editing the _format_examples.. and evaluate_prediction, please override these method in your own task implementation. The framework files should remain untouched.

We are currently doing a two-stage prompting

Could you please elaborate on the two-stage prompting you have in mind?

OpResearchQA · 2023-09-08T20:48:24Z

cc @kazemnejad
Dear GenBench organizers,

Our ORQA benchmark focuses on a multiple choice question answering (MCQA) task, for which we used two different prompting approaches as baselines:

Standard prompting has only one prompt, (both 0-shot and few-shot variants) since the model is directly prompted to output the correct answer in a single letter among the choices (A, B, C or D).
Chain of Thought follows a two-stage prompting approach, (both 0-shot and few-shot variants) where in the first stage, the model is prompted to generate the reasoning steps. Then, in the second stage, the reasoning steps is added to a second prompt for the model to produce the MCQA answer. We follow the following prior works that suggested this two-stage prompting approach for MCQA :
https://arxiv.org/pdf/2205.11916.pdf
https://arxiv.org/pdf/2304.13007.pdf

In summary, the two-stage prompting protocol is only the Chain-of-thoughts subtask.

Please feel free to refer to this comment for more context on the task https://github.com/GenBench/genbench_cbt/pull/22#issuecomment-1698004423

For editing the _format_examples.. and evaluate_prediction, please override these method in your own task implementation. The framework files should remain untouched.

Understood, we will not change the framework files and will instead override the methods in our own task implementation. One related question to this. The changes we are doing for these two methods are common to both the subtasks. Is it ok if we create a task.py file in the parent task folder and add these methods there, and then make the corresponding task class in the subtask folders inherit from the task class in the parent folder? We want to do this to avoid code repitition.

There is an existing issue in the genbench/task.py code. The issue is described in detail in the following issue,
https://github.com/GenBench/genbench_cbt/issues/35

Due to this bug, setting n_shots=n gives us n+1 examples in the few-shot prompt. I want to bring this to your attention and request you to fix it. And if you can advice us on whether for now we should ignore the bug and raise a pull request or write additional lines to compensate for this bug in our code.

OpResearchQA added 2 commits August 2, 2023 00:46

ORQA submission for GenBench 2023 - passed all checks

9e4ec4b

ORQA submission for GenBench 2023 - passed all checks

c467cf8

vernadankers added task-submission and removed task-submission labels Aug 2, 2023

OpResearchQA and others added 4 commits September 1, 2023 14:33

Merge branch 'GenBench:main' into OperationsResearchQA

28415c8

Updating PR with revised ORQA task. Failing the perfect_score_for_gol…

3aa2b25

…d_instances test because of how the get_prepared_datasets function was rewritten to be a two-stage prompting approach

Merge https://github.com/GenBench/genbench_cbt into OperationsResearchQA

fef8c42

fixed

db1ba0a

s50032299 added 5 commits September 13, 2023 13:17

updated standard inference

b17a4e1

updated standard inference

99f7262

updated inference and the code

f9c7d64

removed unnessary imports

3833c31

removed unnessary imports

0e744b6

OpResearchQA closed this Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task Submission] OperationsResearchQA (`operationsresearchqa`) #22

[Task Submission] OperationsResearchQA (`operationsresearchqa`) #22

OpResearchQA commented Aug 2, 2023

OpResearchQA commented Aug 29, 2023

vernadankers commented Sep 1, 2023

OpResearchQA commented Sep 1, 2023

kazemnejad commented Sep 4, 2023

OpResearchQA commented Sep 8, 2023

[Task Submission] OperationsResearchQA (operationsresearchqa) #22

[Task Submission] OperationsResearchQA (operationsresearchqa) #22

Conversation

OpResearchQA commented Aug 2, 2023

Operations Research QA (ORQA)

Authors

Implementation

Usage

Checklist:

OpResearchQA commented Aug 29, 2023

vernadankers commented Sep 1, 2023

OpResearchQA commented Sep 1, 2023

kazemnejad commented Sep 4, 2023

OpResearchQA commented Sep 8, 2023

[Task Submission] OperationsResearchQA (`operationsresearchqa`) #22

[Task Submission] OperationsResearchQA (`operationsresearchqa`) #22