Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task Submission] mmlusr (mmlusr) #3

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file.
54 changes: 54 additions & 0 deletions src/genbench/tasks/mmlusr/config.jsonnet
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
{
name: 'mmlusr',

description: 'mmlusr aims to measure the true comprehension abilities of Large Language Models (LLMs) by challenging their performance in question-answering tasks with modified terms.',

keywords: [
'LLMs',
'Benchmarks',
'Dataset',
'Reasoning'
],

authors: [
'Wentian Wang',
'Sarthak Jain',
'Paul Kantor',
'Jacob Feldman',
'Lazaros Gallos',
'Hao Wang'
],

data_source: {
type: 'hf',
hf_id: [
'NiniCat/MMLU-SR',
'answer_only_abstract_algebra' //switch to other tasks by altering to some task like 'question_and_answer_abstract_algebra'
],
git_commit_sha: '505322b292ac81cc83c0942c2d2930af5ba31068'
},

has_validation_set: false,
has_train_set: true,

task_type: 'multiple_choice',

evaluation_metrics: [
{
hf_id: 'accuracy',
best_score: 1.0,
git_commit_sha: '330abb383de68be32352dd876716f644bc71c1e5',
}
],

preparation_strategies: {
prompt_based_testing: {
prompt_builder: {
instruction_zero_shot: 'Please respond to each question with \'Answer: <letter>\' where <letter> is the correct choice. Avoid additional explanations.\n\n',
instruction_few_shot: 'Follow the given examples and answer the question. Please respond to each question with \'Answer: <letter>\' where <letter> is the correct choice. Avoid additional explanations. \n\n',
input_prefix: 'Q: ',
output_prefix: '\nA: '
}
}
}
}
19 changes: 19 additions & 0 deletions src/genbench/tasks/mmlusr/doc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# mmlusr

## Abstract
*We propose MMLU-SR, a novel dataset designed to measure the true comprehension abilities of Large Language Models (LLMs) by challenging their performance in question-answering tasks with modified terms. We reasoned that an agent that "truly" understands a concept can still evaluate it when key terms are replaced by suitably defined alternate terms, and sought to differentiate such comprehension from mere text replacement. In our study, we modified standardized test questions by replacing a key term with a dummy word along with its definition. The key term could be in the context of questions, answers, or both questions and answers. Notwithstanding the high scores achieved by recent popular LLMs on the MMLU leaderboard, we found a substantial reduction in model performance after such replacement, suggesting poor comprehension. This new benchmark provides a rigorous benchmark for testing true model comprehension, and poses a challenge to the broader scientific community.*

## Examples
*"Suppose 'Dummy' means 'ubiquitous, mostly free-living organisms often consisting of one biological cell.' Which of the following best describes the human body's defense mechanism against environmental Dummy?", Hair in the nose,"Suppose 'Queen' means 'The moist, inner lining of some organs and body cavities' Queen",Suppose 'Noise' means 'cells that form new bones and grow and heal existing bones.' Noise,Suppose 'Bard' means 'an extracellular fluid produced and secreted by salivary glands in the mouth.' Bard,B*

## Usage
*Please check our git repo: https://github.com/Wang-ML-Lab/MMLU-SR*

## Data Source
*Dataset can be retrieved from either HuggingFace or Github. HF:NiniCat/MMLU-SR. Git: https://github.com/Wang-ML-Lab/MMLU-SR*

## Limitations and Bias
*NA*

## GenBench Eval card
*There are dev and test datasets, which dev set is for few-shot prompting and test set is the actual evaluation set.*.
18 changes: 18 additions & 0 deletions src/genbench/tasks/mmlusr/task.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
from typing import Any, Dict

from genbench import Task


class MMLUSRTask(Task):
def format_example(self, example: Dict[str, Any]) -> Dict[str, Any]:
options = [example["choice1"], example["choice2"], example["choice3"], example["choice4"]]

# If the answer is already a number (0-3)
if isinstance(example["answer"], (int, float)):
target = int(example["answer"])
else:
# If the answer is a letter (A, B, C, D)
answer_map = {"A": 0, "B": 1, "C": 2, "D": 3}
target = answer_map[example["answer"]]

return {"input": example["question"], "target": target, "target_options": options}