Skip to content
This repository has been archived by the owner on Jul 23, 2024. It is now read-only.

[Task Submission] OperationsResearchQA (operationsresearchqa) #22

Closed
wants to merge 11 commits into from

Conversation

OpResearchQA
Copy link

Operations Research QA (ORQA)

ORQA: Can Pretrained Language Models reason about Operations Research?

We propose Operations Research QA (ORQA) as a new benchmark to evaluate the ability of pretrained Large Language Models (LLMs) to generalize to new technical domains. Our benchmark considers the cross-domain shift issue of LLMs and focuses on the multi-choice question answering task. In our new dataset, the target domain is Operations Research (OR), and the task tests both the language models' domain-specific knowledge and reasoning skills on optimization modeling. Our dataset is handcrafted by domain experts, i.e. OR experts, and is representative of different types of optimization and various application domains, such as vehicle routing, production planning, or investment portfolio allocation.

Authors

  • Rindra Ramamonjison [email protected]
  • Mahdi Mostajabdaveh
  • Timothy Yu
  • Giuseppe Carenini
  • Samarendra Dash
  • Serge Jabo Byusa
  • Zirui Zhou
  • Yong Zhang

Implementation

The custom function in task.py that was changed is format_example. Our dataset consists of context, question, target, and target_options. This function simply formats it such that the context, question, and target_options are combined into the input field.

Usage

The evaluation function can be run the default method.

Checklist:

  • I and my co-authors agree that, if this PR is merged, the code will be available under the same license as the genbench_cbt repository.
  • Prior to submitting, I have ran the GenBench CBT test suite using the genbench-cli test-task tool.
  • I have read the description of what should be in the doc.md of my task, and have added the required arguments.
  • I have submitted or will submit an accompanying paper to the GenBench workshop.

@OpResearchQA
Copy link
Author

Dear GenBench Organizers,

ORQA is a multi-choice QA (MCQA) task.

  1. What is the standard method for evaluation that you would prefer for MCQA? There is a lot of variance across different runs and https://arxiv.org/pdf/2210.12353.pdf reports PPA to quantify this issue. Or perhaps running three inferences with three pre-set seeds and reporting the average and variance of the runs? Or are we over-thinking this and should expect the users of the benchmark to simply provide the output file with their predictions without considering how they obtained that?

  2. There were some functions within /src/genbench/task.py that we had to change to make the evaluation script work. Specifically, "_format_example_for_in_context_learning" and "evaluate_predictions". Is this acceptable?

  3. We are not sure why in "make_nshot_dataset", DatasetSplit.VALIDATION had to be replaced explicitly with "validation" for it to detect the key in the formatted_dataset. We will look into that, but were wondering if you or any other participants reported this issue.

  4. Finally, we are going to be separating this into three tasks: (1) zero-shot, (2) 3-shot, (3) chain-of-thought. How would you recommend we submit this in the PR? We read in the comments from other PR that each new task has to be a new PR. However, we also recall you mentioning that the number of shots are recommended to not be a new task. We would appreciate any help/advice.

Thank you,
ORQA Team

@vernadankers
Copy link
Contributor

Dear ORQA team,

  1. That is up to you, and please explain your rationale in the paper that you submit along with the dataset.
  2. Yes, that is perfectly acceptable! We gave you the opportunity to customise the task.py for this very reason, since not all submissions fit the same technical setup.
  3. @kazemnejad Would you be able to look into that?
  4. We created the option to have a task with subtasks, see https://github.com/GenBench/genbench_cbt#task-with-subtasks in the README, perhaps you can create three different subtasks for zero-shot, 3-shot and chain-of-thought?

We are getting quite close to the deadline (September 1, 11:59PM anywhere on earth), so if your PR needs any final changes, please make them now,
and don't forget to submit your accompanying paper to Openreview via https://openreview.net/group?id=GenBench.org/2023/Workshop by September 1.

Good luck finalising your PR and paper, feel free to tag us if you have questions.
Cheers, Verna
On behalf of the GenBench team

@OpResearchQA
Copy link
Author

Thank you for your help, Verna!

We validated the test task using

$ genbench-cli test-task --id my_awesome_task

It skips 3 tests, passes 2, but fails one. The one it fails is "test_perfect_score_for_gold_instances". This is because we have changed the "get_prepared_datasets" function in the task.py file. We are currently doing a two-stage prompting method and the way this test validates that the gold instances would result in perfect performance does not retrieve the labels from the dataset properly.

Should we re-write the task.py file to return this? We could also write our own custom test script to verify that perfect predictions would result in perfect scores.

Please let us know how to proceed.

Best,
ORQA Team

@kazemnejad
Copy link
Contributor

  1. We are not sure why in "make_nshot_dataset", DatasetSplit.VALIDATION had to be replaced explicitly with "validation" for it to detect the key in the formatted_dataset. We will look into that, but were wondering if you or any other participants reported this issue.

Thanks for bringing up the issue. We'll fix this.

For editing the _format_examples.. and evaluate_prediction, please override these method in your own task implementation. The framework files should remain untouched.

We are currently doing a two-stage prompting

Could you please elaborate on the two-stage prompting you have in mind?

@OpResearchQA
Copy link
Author

cc @kazemnejad
Dear GenBench organizers,

  1. Our ORQA benchmark focuses on a multiple choice question answering (MCQA) task, for which we used two different prompting approaches as baselines:
  • Standard prompting has only one prompt, (both 0-shot and few-shot variants) since the model is directly prompted to output the correct answer in a single letter among the choices (A, B, C or D).
  • Chain of Thought follows a two-stage prompting approach, (both 0-shot and few-shot variants) where in the first stage, the model is prompted to generate the reasoning steps. Then, in the second stage, the reasoning steps is added to a second prompt for the model to produce the MCQA answer. We follow the following prior works that suggested this two-stage prompting approach for MCQA :
    https://arxiv.org/pdf/2205.11916.pdf
    https://arxiv.org/pdf/2304.13007.pdf

In summary, the two-stage prompting protocol is only the Chain-of-thoughts subtask.

Please feel free to refer to this comment for more context on the task https://github.com/GenBench/genbench_cbt/pull/22#issuecomment-1698004423

For editing the _format_examples.. and evaluate_prediction, please override these method in your own task implementation. The framework files should remain untouched.

Understood, we will not change the framework files and will instead override the methods in our own task implementation. One related question to this. The changes we are doing for these two methods are common to both the subtasks. Is it ok if we create a task.py file in the parent task folder and add these methods there, and then make the corresponding task class in the subtask folders inherit from the task class in the parent folder? We want to do this to avoid code repitition.

  1. There is an existing issue in the genbench/task.py code. The issue is described in detail in the following issue,
    https://github.com/GenBench/genbench_cbt/issues/35

Due to this bug, setting n_shots=n gives us n+1 examples in the few-shot prompt. I want to bring this to your attention and request you to fix it. And if you can advice us on whether for now we should ignore the bug and raise a pull request or write additional lines to compensate for this bug in our code.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants