-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Refactor] Mmlu subgroups and weight avg #922
Conversation
lintangsutawika
commented
Oct 16, 2023
- Further splits MMLU into 4 subcategories instead of just aggregating over all subtasks.
- Add weighted averaging and stderr.
Output
|
…ation-harness into mmlu_subgroups
…ation-harness into mmlu_subgroups
The print-out is very ugly. I get ASCII-art tables aren't the most aesthetic thing in the world, but I suspect we can do better than this. I think it would look a lot better if we replaced the I also think that the "main results table" should show the grouped scores not the score breakdown, as the grouped scores is typically the thing we actually want. |
The dash is suppose to denote the hierarchy of the task (if it's part of a group). The group table at the bottom was so for easier to inspect since it was printed last it would be easier than having to scroll up. But it's also not an issue if you think we should put it above, my preference is that having printed after the full task table would make it easier to inspect. As for the naming, it's understandable that the 'mmlu' prefix might be too crowded, but it's a by-product of how each task has to be uniquely identified. For the default mmlu tasks, i suppose we can remove the prefix. |
…ation-harness into mmlu_subgroups
…ation-harness into mmlu_subgroups
@StellaAthena, simpler table presentation through the addition of
|
Looks good to me! Has this been added to the documentation as well? |
how to use this pr to evaluate all mmlu tasks? any one provide an example command line for me? I have updated to latest version. |
You should be able to run via simply |
What are the weights on each subject for the weighted average? |
They're weighted based on number of samples. |
@lintangsutawika in the latest code, this avg score seems gone. Any guidance on the latest way to get the mmlu avg score? |
Could tell me more what the issue might be? Also maybe in a new issue would be better. |
@lintangsutawika basically I would like to have the mmlu weight avg in the report (exactly what this PR does), but latest main seems drop this info (potentially reverted or so?). here is what I have now (notice no more weight avg as what you showed above):
so wondering what's the current way of getting what you have above, and happy to create a new issue. |
opened an issue: #1415 |
Just fixed in #1414 ! |