Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Refactor] Mmlu subgroups and weight avg #922

Merged
merged 23 commits into from
Nov 6, 2023

Conversation

lintangsutawika
Copy link
Contributor

  1. Further splits MMLU into 4 subcategories instead of just aggregating over all subtasks.
  2. Add weighted averaging and stderr.

lm_eval/evaluator.py Outdated Show resolved Hide resolved
@lintangsutawika lintangsutawika linked an issue Oct 16, 2023 that may be closed by this pull request
@lintangsutawika lintangsutawika marked this pull request as ready for review October 17, 2023 14:18
@lintangsutawika
Copy link
Contributor Author

Output

hf (pretrained=EleutherAI/pythia-2.8b), limit: None, num_fewshot: None, batch_size: 1
|                  Tasks                   |Version|Filter|Metric |  Value   |   |Stderr|
|------------------------------------------|-------|------|-------|---------:|---|-----:|
|boolq                                     |Yaml   |none  |acc    |    0.6465|±  |0.0084|
|                                          |       |      |samples| 3270.0000|   |      |
|mmlu                                      |N/A    |none  |acc    |    0.2522|±  |0.4029|
|                                          |       |      |samples|14042.0000|   |      |
|-mmlu_humanities                          |N/A    |none  |acc    |    0.2349|±  |0.1431|
|--mmlu_formal_logic                       |Yaml   |none  |acc    |    0.2857|±  |0.0404|
|--mmlu_high_school_european_history       |Yaml   |none  |acc    |    0.2485|±  |0.0337|
|--mmlu_high_school_us_history             |Yaml   |none  |acc    |    0.2353|±  |0.0298|
|--mmlu_high_school_world_history          |Yaml   |none  |acc    |    0.2616|±  |0.0286|
|--mmlu_international_law                  |Yaml   |none  |acc    |    0.1983|±  |0.0364|
|--mmlu_jurisprudence                      |Yaml   |none  |acc    |    0.2407|±  |0.0413|
|--mmlu_logical_fallacies                  |Yaml   |none  |acc    |    0.1963|±  |0.0312|
|--mmlu_moral_disputes                     |Yaml   |none  |acc    |    0.2254|±  |0.0225|
|--mmlu_moral_scenarios                    |Yaml   |none  |acc    |    0.2425|±  |0.0143|
|--mmlu_philosophy                         |Yaml   |none  |acc    |    0.1961|±  |0.0226|
|--mmlu_prehistory                         |Yaml   |none  |acc    |    0.2716|±  |0.0247|
|--mmlu_professional_law                   |Yaml   |none  |acc    |    0.2288|±  |0.0107|
|--mmlu_world_religions                    |Yaml   |none  |acc    |    0.2398|±  |0.0327|
|-mmlu_other                               |N/A    |none  |acc    |    0.2803|±  |0.1703|
|--mmlu_business_ethics                    |Yaml   |none  |acc    |    0.2200|±  |0.0416|
|--mmlu_clinical_knowledge                 |Yaml   |none  |acc    |    0.2566|±  |0.0269|
|--mmlu_college_medicine                   |Yaml   |none  |acc    |    0.2717|±  |0.0339|
|--mmlu_global_facts                       |Yaml   |none  |acc    |    0.3000|±  |0.0461|
|--mmlu_human_aging                        |Yaml   |none  |acc    |    0.2960|±  |0.0306|
|--mmlu_management                         |Yaml   |none  |acc    |    0.2524|±  |0.0430|
|--mmlu_marketing                          |Yaml   |none  |acc    |    0.2735|±  |0.0292|
|--mmlu_medical_genetics                   |Yaml   |none  |acc    |    0.2700|±  |0.0446|
|--mmlu_miscellaneous                      |Yaml   |none  |acc    |    0.2771|±  |0.0160|
|--mmlu_nutrition                          |Yaml   |none  |acc    |    0.2288|±  |0.0241|
|--mmlu_professional_accounting            |Yaml   |none  |acc    |    0.2695|±  |0.0265|
|--mmlu_professional_medicine              |Yaml   |none  |acc    |    0.4118|±  |0.0299|
|--mmlu_virology                           |Yaml   |none  |acc    |    0.2771|±  |0.0348|
|-mmlu_social_sciences                     |N/A    |none  |acc    |    0.2398|±  |0.1615|
|--mmlu_econometrics                       |Yaml   |none  |acc    |    0.2193|±  |0.0389|
|--mmlu_high_school_geography              |Yaml   |none  |acc    |    0.1970|±  |0.0283|
|--mmlu_high_school_government_and_politics|Yaml   |none  |acc    |    0.2487|±  |0.0312|
|--mmlu_high_school_macroeconomics         |Yaml   |none  |acc    |    0.2128|±  |0.0208|
|--mmlu_high_school_microeconomics         |Yaml   |none  |acc    |    0.2353|±  |0.0276|
|--mmlu_high_school_psychology             |Yaml   |none  |acc    |    0.2734|±  |0.0191|
|--mmlu_human_sexuality                    |Yaml   |none  |acc    |    0.2137|±  |0.0360|
|--mmlu_professional_psychology            |Yaml   |none  |acc    |    0.2500|±  |0.0175|
|--mmlu_public_relations                   |Yaml   |none  |acc    |    0.3636|±  |0.0461|
|--mmlu_security_studies                   |Yaml   |none  |acc    |    0.1918|±  |0.0252|
|--mmlu_sociology                          |Yaml   |none  |acc    |    0.2289|±  |0.0297|
|--mmlu_us_foreign_policy                  |Yaml   |none  |acc    |    0.2400|±  |0.0429|
|-mmlu_stem                                |N/A    |none  |acc    |    0.2623|±  |0.1835|
|--mmlu_abstract_algebra                   |Yaml   |none  |acc    |    0.3000|±  |0.0461|
|--mmlu_anatomy                            |Yaml   |none  |acc    |    0.2667|±  |0.0382|
|--mmlu_astronomy                          |Yaml   |none  |acc    |    0.2434|±  |0.0349|
|--mmlu_college_biology                    |Yaml   |none  |acc    |    0.2847|±  |0.0377|
|--mmlu_college_chemistry                  |Yaml   |none  |acc    |    0.2300|±  |0.0423|
|--mmlu_college_computer_science           |Yaml   |none  |acc    |    0.2800|±  |0.0451|
|--mmlu_college_mathematics                |Yaml   |none  |acc    |    0.2700|±  |0.0446|
|--mmlu_college_physics                    |Yaml   |none  |acc    |    0.3431|±  |0.0472|
|--mmlu_computer_security                  |Yaml   |none  |acc    |    0.2900|±  |0.0456|
|--mmlu_conceptual_physics                 |Yaml   |none  |acc    |    0.2979|±  |0.0299|
|--mmlu_electrical_engineering             |Yaml   |none  |acc    |    0.2069|±  |0.0338|
|--mmlu_elementary_mathematics             |Yaml   |none  |acc    |    0.2619|±  |0.0226|
|--mmlu_high_school_biology                |Yaml   |none  |acc    |    0.2581|±  |0.0249|
|--mmlu_high_school_chemistry              |Yaml   |none  |acc    |    0.2512|±  |0.0305|
|--mmlu_high_school_computer_science       |Yaml   |none  |acc    |    0.2400|±  |0.0429|
|--mmlu_high_school_mathematics            |Yaml   |none  |acc    |    0.2630|±  |0.0268|
|--mmlu_high_school_physics                |Yaml   |none  |acc    |    0.2517|±  |0.0354|
|--mmlu_high_school_statistics             |Yaml   |none  |acc    |    0.2269|±  |0.0286|
|--mmlu_machine_learning                   |Yaml   |none  |acc    |    0.2589|±  |0.0416|

|       Groups        |Version|Filter|Metric |  Value   |   |Stderr|
|---------------------|-------|------|-------|---------:|---|-----:|
|mmlu                 |N/A    |none  |acc    |    0.2522|±  |0.4029|
|                     |       |      |samples|14042.0000|   |      |
|-mmlu_humanities     |N/A    |none  |acc    |    0.2349|±  |0.1431|
|-mmlu_other          |N/A    |none  |acc    |    0.2803|±  |0.1703|
|-mmlu_social_sciences|N/A    |none  |acc    |    0.2398|±  |0.1615|
|-mmlu_stem           |N/A    |none  |acc    |    0.2623|±  |0.1835|

@StellaAthena
Copy link
Member

The print-out is very ugly. I get ASCII-art tables aren't the most aesthetic thing in the world, but I suspect we can do better than this. I think it would look a lot better if we replaced the - with and dropped the leading mmlu_ from each component task, for example.

I also think that the "main results table" should show the grouped scores not the score breakdown, as the grouped scores is typically the thing we actually want.

@lintangsutawika
Copy link
Contributor Author

The dash is suppose to denote the hierarchy of the task (if it's part of a group).

The group table at the bottom was so for easier to inspect since it was printed last it would be easier than having to scroll up. But it's also not an issue if you think we should put it above, my preference is that having printed after the full task table would make it easier to inspect.

As for the naming, it's understandable that the 'mmlu' prefix might be too crowded, but it's a by-product of how each task has to be uniquely identified. For the default mmlu tasks, i suppose we can remove the prefix.

@StellaAthena StellaAthena added this to the v0.3.0 milestone Nov 1, 2023
@lintangsutawika
Copy link
Contributor Author

@StellaAthena, simpler table presentation through the addition of task_alias and group_alias that a user can enter to have that presented instead of the task_name that has to be unique across all tasks in lm-eval.

hf (pretrained=EleutherAI/pythia-2.8b), limit: None, num_fewshot: None, batch_size: 1
|                Tasks                 |Version|Filter|Metric |  Value   |   |Stderr|
|--------------------------------------|-------|------|-------|---------:|---|-----:|
|mmlu                                  |N/A    |none  |acc    |    0.2522|±  |0.0407|
|                                      |       |      |samples|14042.0000|   |      |
| -humanities                          |N/A    |none  |acc    |    0.2349|±  |0.0289|
|  -formal_logic                       |Yaml   |none  |acc    |    0.2857|±  |0.0404|
|  -high_school_european_history       |Yaml   |none  |acc    |    0.2485|±  |0.0337|
|  -high_school_us_history             |Yaml   |none  |acc    |    0.2353|±  |0.0298|
|  -high_school_world_history          |Yaml   |none  |acc    |    0.2616|±  |0.0286|
|  -international_law                  |Yaml   |none  |acc    |    0.1983|±  |0.0364|
|  -jurisprudence                      |Yaml   |none  |acc    |    0.2407|±  |0.0413|
|  -logical_fallacies                  |Yaml   |none  |acc    |    0.1963|±  |0.0312|
|  -moral_disputes                     |Yaml   |none  |acc    |    0.2254|±  |0.0225|
|  -moral_scenarios                    |Yaml   |none  |acc    |    0.2425|±  |0.0143|
|  -philosophy                         |Yaml   |none  |acc    |    0.1961|±  |0.0226|
|  -prehistory                         |Yaml   |none  |acc    |    0.2716|±  |0.0247|
|  -professional_law                   |Yaml   |none  |acc    |    0.2288|±  |0.0107|
|  -world_religions                    |Yaml   |none  |acc    |    0.2398|±  |0.0327|
| -other                               |N/A    |none  |acc    |    0.2803|±  |0.0490|
|  -business_ethics                    |Yaml   |none  |acc    |    0.2200|±  |0.0416|
|  -clinical_knowledge                 |Yaml   |none  |acc    |    0.2566|±  |0.0269|
|  -college_medicine                   |Yaml   |none  |acc    |    0.2717|±  |0.0339|
|  -global_facts                       |Yaml   |none  |acc    |    0.3000|±  |0.0461|
|  -human_aging                        |Yaml   |none  |acc    |    0.2960|±  |0.0306|
|  -management                         |Yaml   |none  |acc    |    0.2524|±  |0.0430|
|  -marketing                          |Yaml   |none  |acc    |    0.2735|±  |0.0292|
|  -medical_genetics                   |Yaml   |none  |acc    |    0.2700|±  |0.0446|
|  -miscellaneous                      |Yaml   |none  |acc    |    0.2771|±  |0.0160|
|  -nutrition                          |Yaml   |none  |acc    |    0.2288|±  |0.0241|
|  -professional_accounting            |Yaml   |none  |acc    |    0.2695|±  |0.0265|
|  -professional_medicine              |Yaml   |none  |acc    |    0.4118|±  |0.0299|
|  -virology                           |Yaml   |none  |acc    |    0.2771|±  |0.0348|
| -social_sciences                     |N/A    |none  |acc    |    0.2398|±  |0.0390|
|  -econometrics                       |Yaml   |none  |acc    |    0.2193|±  |0.0389|
|  -high_school_geography              |Yaml   |none  |acc    |    0.1970|±  |0.0283|
|  -high_school_government_and_politics|Yaml   |none  |acc    |    0.2487|±  |0.0312|
|  -high_school_macroeconomics         |Yaml   |none  |acc    |    0.2128|±  |0.0208|
|  -high_school_microeconomics         |Yaml   |none  |acc    |    0.2353|±  |0.0276|
|  -high_school_psychology             |Yaml   |none  |acc    |    0.2734|±  |0.0191|
|  -human_sexuality                    |Yaml   |none  |acc    |    0.2137|±  |0.0360|
|  -professional_psychology            |Yaml   |none  |acc    |    0.2500|±  |0.0175|
|  -public_relations                   |Yaml   |none  |acc    |    0.3636|±  |0.0461|
|  -security_studies                   |Yaml   |none  |acc    |    0.1918|±  |0.0252|
|  -sociology                          |Yaml   |none  |acc    |    0.2289|±  |0.0297|
|  -us_foreign_policy                  |Yaml   |none  |acc    |    0.2400|±  |0.0429|
| -stem                                |N/A    |none  |acc    |    0.2623|±  |0.0412|
|  -abstract_algebra                   |Yaml   |none  |acc    |    0.3000|±  |0.0461|
|  -anatomy                            |Yaml   |none  |acc    |    0.2667|±  |0.0382|
|  -astronomy                          |Yaml   |none  |acc    |    0.2434|±  |0.0349|
|  -college_biology                    |Yaml   |none  |acc    |    0.2847|±  |0.0377|
|  -college_chemistry                  |Yaml   |none  |acc    |    0.2300|±  |0.0423|
|  -college_computer_science           |Yaml   |none  |acc    |    0.2800|±  |0.0451|
|  -college_mathematics                |Yaml   |none  |acc    |    0.2700|±  |0.0446|
|  -college_physics                    |Yaml   |none  |acc    |    0.3431|±  |0.0472|
|  -computer_security                  |Yaml   |none  |acc    |    0.2900|±  |0.0456|
|  -conceptual_physics                 |Yaml   |none  |acc    |    0.2979|±  |0.0299|
|  -electrical_engineering             |Yaml   |none  |acc    |    0.2069|±  |0.0338|
|  -elementary_mathematics             |Yaml   |none  |acc    |    0.2619|±  |0.0226|
|  -high_school_biology                |Yaml   |none  |acc    |    0.2581|±  |0.0249|
|  -high_school_chemistry              |Yaml   |none  |acc    |    0.2512|±  |0.0305|
|  -high_school_computer_science       |Yaml   |none  |acc    |    0.2400|±  |0.0429|
|  -high_school_mathematics            |Yaml   |none  |acc    |    0.2630|±  |0.0268|
|  -high_school_physics                |Yaml   |none  |acc    |    0.2517|±  |0.0354|
|  -high_school_statistics             |Yaml   |none  |acc    |    0.2269|±  |0.0286|
|  -machine_learning                   |Yaml   |none  |acc    |    0.2589|±  |0.0416|

|     Groups      |Version|Filter|Metric |  Value   |   |Stderr|
|-----------------|-------|------|-------|---------:|---|-----:|
|mmlu             |N/A    |none  |acc    |    0.2522|±  |0.0407|
|                 |       |      |samples|14042.0000|   |      |
| -humanities     |N/A    |none  |acc    |    0.2349|±  |0.0289|
| -other          |N/A    |none  |acc    |    0.2803|±  |0.0490|
| -social_sciences|N/A    |none  |acc    |    0.2398|±  |0.0390|
| -stem           |N/A    |none  |acc    |    0.2623|±  |0.0412|

@StellaAthena
Copy link
Member

StellaAthena commented Nov 2, 2023

Looks good to me! Has this been added to the documentation as well?

@lintangsutawika lintangsutawika merged commit 815f59e into big-refactor Nov 6, 2023
3 of 8 checks passed
@lintangsutawika lintangsutawika deleted the mmlu_subgroups branch November 6, 2023 13:47
@fancyerii
Copy link

how to use this pr to evaluate all mmlu tasks? any one provide an example command line for me? I have updated to latest version.

@haileyschoelkopf
Copy link
Collaborator

You should be able to run via simply python -m lm_eval --model hf --model_args ... --tasks mmlu!

@JeevanBhoot
Copy link
Contributor

What are the weights on each subject for the weighted average?

@lintangsutawika
Copy link
Contributor Author

They're weighted based on number of samples.

@lchu-ibm
Copy link
Contributor

lchu-ibm commented Feb 9, 2024

@lintangsutawika in the latest code, this avg score seems gone. Any guidance on the latest way to get the mmlu avg score?

@lintangsutawika
Copy link
Contributor Author

Could tell me more what the issue might be? Also maybe in a new issue would be better.

@lchu-ibm
Copy link
Contributor

lchu-ibm commented Feb 9, 2024

@lintangsutawika basically I would like to have the mmlu weight avg in the report (exactly what this PR does), but latest main seems drop this info (potentially reverted or so?). here is what I have now (notice no more weight avg as what you showed above):


|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
| - humanities     |N/A    |none  |     0|acc   |0.4740|±  |0.0256|
| - other          |N/A    |none  |     0|acc   |0.4465|±  |0.0318|
| - social_sciences|N/A    |none  |     0|acc   |0.4798|±  |0.0304|
| - stem           |N/A    |none  |     0|acc   |0.3477|±  |0.0366|
|                 Tasks                 |Version|Filter|n-shot|Metric|Value |   |Stderr|  
|---------------------------------------|-------|------|-----:|------|-----:|---|-----:|  
| - humanities                          |N/A    |none  |     0|acc   |0.4740|±  |0.0256|  
|  - formal_logic                       |      0|none  |     0|acc   |0.3095|±  |0.0413|  
|  - high_school_european_history       |      0|none  |     0|acc   |0.6000|±  |0.0383|
|  - high_school_us_history             |      0|none  |     0|acc   |0.5490|±  |0.0349|
|  - high_school_world_history          |      0|none  |     0|acc   |0.5865|±  |0.0321|  
|  - international_law                  |      0|none  |     0|acc   |0.5868|±  |0.0450|  
|  - jurisprudence                      |      0|none  |     0|acc   |0.5185|±  |0.0483|  

so wondering what's the current way of getting what you have above, and happy to create a new issue.

@lchu-ibm
Copy link
Contributor

lchu-ibm commented Feb 9, 2024

opened an issue: #1415

@haileyschoelkopf
Copy link
Collaborator

Just fixed in #1414 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add mmlu average score in report
6 participants