Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paper claims there are 10-choices but the test split has varying number of choices (anywhere from 3 to 10) #24

Open
eldarkurtic opened this issue Sep 25, 2024 · 5 comments

Comments

@eldarkurtic
Copy link

Hi folks, thanks for creating the dataset.
In your paper and the dataset card, you claim that MMLU-PRO has 10 choices for each question which seems to be false.
By opening the Viewer tab, and selecting test split one can see that only 83% of questions have 10 choices, and the remaining ones have anywhere from 3 to 10.
What is happening here?

@eldarkurtic eldarkurtic changed the title Paper claims there are 10-choices but the test split has varying number from choices (anywhere from 3 to 10) Paper claims there are 10-choices but the test split has varying number of choices (anywhere from 3 to 10) Sep 25, 2024
@wenhuchen
Copy link
Contributor

Hi there, we think our paper said that it's augmented to 10 options. Then our strict human and machine quality checker will remove the low-quality options. So 17% of them for impacted.

@eldarkurtic
Copy link
Author

Do you plan to remove those 17% from HF-hub or you plan to augment them somehow to get 10 choices?
I am asking this because the OpenLLM Leaderboard v2 is using MMLU-Pro as one of the tasks, and normalization of scores is impacted by this as it was implemented by following the claim that each question has 10 choices (so scores were normalized by 1/10).

We have an ongoing discussion about it here https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard/discussions/947#66f3e0bae2e1cb781da1c769

@wenhuchen
Copy link
Contributor

I see. A simple solution would be just padding "N/A" options to the 17% questions so that each question physically reaches 10 options. Would that approach work for the normalization strategy?

@eldarkurtic
Copy link
Author

Unfortunately no, because the idea of normalization is to subtract the random baseline accuracy first, and then to rescale it back to 0-100 (more details here: https://huggingface.co/spaces/open-llm-leaderboard/blog).

@wenhuchen
Copy link
Contributor

I have read the blog. I still think that padding the 17% questions with "N/A" options to 10 options should work. Would you mind pointing out what's the issue here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants