-
Notifications
You must be signed in to change notification settings - Fork 18
[Task Submission] OperationsResearchQA (operationsresearchqa
)
#22
Conversation
Dear GenBench Organizers, ORQA is a multi-choice QA (MCQA) task.
Thank you, |
Dear ORQA team,
We are getting quite close to the deadline (September 1, 11:59PM anywhere on earth), so if your PR needs any final changes, please make them now, Good luck finalising your PR and paper, feel free to tag us if you have questions. |
…d_instances test because of how the get_prepared_datasets function was rewritten to be a two-stage prompting approach
Thank you for your help, Verna! We validated the test task using $ genbench-cli test-task --id my_awesome_task It skips 3 tests, passes 2, but fails one. The one it fails is "test_perfect_score_for_gold_instances". This is because we have changed the "get_prepared_datasets" function in the task.py file. We are currently doing a two-stage prompting method and the way this test validates that the gold instances would result in perfect performance does not retrieve the labels from the dataset properly. Should we re-write the task.py file to return this? We could also write our own custom test script to verify that perfect predictions would result in perfect scores. Please let us know how to proceed. Best, |
Thanks for bringing up the issue. We'll fix this. For editing the _format_examples.. and evaluate_prediction, please override these method in your own task implementation. The framework files should remain untouched.
Could you please elaborate on the two-stage prompting you have in mind? |
cc @kazemnejad
In summary, the two-stage prompting protocol is only the Chain-of-thoughts subtask. Please feel free to refer to this comment for more context on the task https://github.com/GenBench/genbench_cbt/pull/22#issuecomment-1698004423
Understood, we will not change the framework files and will instead override the methods in our own task implementation. One related question to this. The changes we are doing for these two methods are common to both the subtasks. Is it ok if we create a task.py file in the parent task folder and add these methods there, and then make the corresponding task class in the subtask folders inherit from the task class in the parent folder? We want to do this to avoid code repitition.
Due to this bug, setting n_shots=n gives us n+1 examples in the few-shot prompt. I want to bring this to your attention and request you to fix it. And if you can advice us on whether for now we should ignore the bug and raise a pull request or write additional lines to compensate for this bug in our code. |
Operations Research QA (ORQA)
ORQA: Can Pretrained Language Models reason about Operations Research?
We propose Operations Research QA (ORQA) as a new benchmark to evaluate the ability of pretrained Large Language Models (LLMs) to generalize to new technical domains. Our benchmark considers the cross-domain shift issue of LLMs and focuses on the multi-choice question answering task. In our new dataset, the target domain is Operations Research (OR), and the task tests both the language models' domain-specific knowledge and reasoning skills on optimization modeling. Our dataset is handcrafted by domain experts, i.e. OR experts, and is representative of different types of optimization and various application domains, such as vehicle routing, production planning, or investment portfolio allocation.
Authors
[email protected]
Implementation
The custom function in
task.py
that was changed isformat_example
. Our dataset consists ofcontext
,question
,target
, andtarget_options
. This function simply formats it such that the context, question, and target_options are combined into theinput
field.Usage
The evaluation function can be run the default method.
Checklist:
genbench-cli test-task
tool.