[Refactor] Mmlu subgroups and weight avg #922

lintangsutawika · 2023-10-16T14:28:35Z

Further splits MMLU into 4 subcategories instead of just aggregating over all subtasks.
Add weighted averaging and stderr.

lm_eval/evaluator.py

lintangsutawika · 2023-10-17T14:19:27Z

Output

hf (pretrained=EleutherAI/pythia-2.8b), limit: None, num_fewshot: None, batch_size: 1
|                  Tasks                   |Version|Filter|Metric |  Value   |   |Stderr|
|------------------------------------------|-------|------|-------|---------:|---|-----:|
|boolq                                     |Yaml   |none  |acc    |    0.6465|±  |0.0084|
|                                          |       |      |samples| 3270.0000|   |      |
|mmlu                                      |N/A    |none  |acc    |    0.2522|±  |0.4029|
|                                          |       |      |samples|14042.0000|   |      |
|-mmlu_humanities                          |N/A    |none  |acc    |    0.2349|±  |0.1431|
|--mmlu_formal_logic                       |Yaml   |none  |acc    |    0.2857|±  |0.0404|
|--mmlu_high_school_european_history       |Yaml   |none  |acc    |    0.2485|±  |0.0337|
|--mmlu_high_school_us_history             |Yaml   |none  |acc    |    0.2353|±  |0.0298|
|--mmlu_high_school_world_history          |Yaml   |none  |acc    |    0.2616|±  |0.0286|
|--mmlu_international_law                  |Yaml   |none  |acc    |    0.1983|±  |0.0364|
|--mmlu_jurisprudence                      |Yaml   |none  |acc    |    0.2407|±  |0.0413|
|--mmlu_logical_fallacies                  |Yaml   |none  |acc    |    0.1963|±  |0.0312|
|--mmlu_moral_disputes                     |Yaml   |none  |acc    |    0.2254|±  |0.0225|
|--mmlu_moral_scenarios                    |Yaml   |none  |acc    |    0.2425|±  |0.0143|
|--mmlu_philosophy                         |Yaml   |none  |acc    |    0.1961|±  |0.0226|
|--mmlu_prehistory                         |Yaml   |none  |acc    |    0.2716|±  |0.0247|
|--mmlu_professional_law                   |Yaml   |none  |acc    |    0.2288|±  |0.0107|
|--mmlu_world_religions                    |Yaml   |none  |acc    |    0.2398|±  |0.0327|
|-mmlu_other                               |N/A    |none  |acc    |    0.2803|±  |0.1703|
|--mmlu_business_ethics                    |Yaml   |none  |acc    |    0.2200|±  |0.0416|
|--mmlu_clinical_knowledge                 |Yaml   |none  |acc    |    0.2566|±  |0.0269|
|--mmlu_college_medicine                   |Yaml   |none  |acc    |    0.2717|±  |0.0339|
|--mmlu_global_facts                       |Yaml   |none  |acc    |    0.3000|±  |0.0461|
|--mmlu_human_aging                        |Yaml   |none  |acc    |    0.2960|±  |0.0306|
|--mmlu_management                         |Yaml   |none  |acc    |    0.2524|±  |0.0430|
|--mmlu_marketing                          |Yaml   |none  |acc    |    0.2735|±  |0.0292|
|--mmlu_medical_genetics                   |Yaml   |none  |acc    |    0.2700|±  |0.0446|
|--mmlu_miscellaneous                      |Yaml   |none  |acc    |    0.2771|±  |0.0160|
|--mmlu_nutrition                          |Yaml   |none  |acc    |    0.2288|±  |0.0241|
|--mmlu_professional_accounting            |Yaml   |none  |acc    |    0.2695|±  |0.0265|
|--mmlu_professional_medicine              |Yaml   |none  |acc    |    0.4118|±  |0.0299|
|--mmlu_virology                           |Yaml   |none  |acc    |    0.2771|±  |0.0348|
|-mmlu_social_sciences                     |N/A    |none  |acc    |    0.2398|±  |0.1615|
|--mmlu_econometrics                       |Yaml   |none  |acc    |    0.2193|±  |0.0389|
|--mmlu_high_school_geography              |Yaml   |none  |acc    |    0.1970|±  |0.0283|
|--mmlu_high_school_government_and_politics|Yaml   |none  |acc    |    0.2487|±  |0.0312|
|--mmlu_high_school_macroeconomics         |Yaml   |none  |acc    |    0.2128|±  |0.0208|
|--mmlu_high_school_microeconomics         |Yaml   |none  |acc    |    0.2353|±  |0.0276|
|--mmlu_high_school_psychology             |Yaml   |none  |acc    |    0.2734|±  |0.0191|
|--mmlu_human_sexuality                    |Yaml   |none  |acc    |    0.2137|±  |0.0360|
|--mmlu_professional_psychology            |Yaml   |none  |acc    |    0.2500|±  |0.0175|
|--mmlu_public_relations                   |Yaml   |none  |acc    |    0.3636|±  |0.0461|
|--mmlu_security_studies                   |Yaml   |none  |acc    |    0.1918|±  |0.0252|
|--mmlu_sociology                          |Yaml   |none  |acc    |    0.2289|±  |0.0297|
|--mmlu_us_foreign_policy                  |Yaml   |none  |acc    |    0.2400|±  |0.0429|
|-mmlu_stem                                |N/A    |none  |acc    |    0.2623|±  |0.1835|
|--mmlu_abstract_algebra                   |Yaml   |none  |acc    |    0.3000|±  |0.0461|
|--mmlu_anatomy                            |Yaml   |none  |acc    |    0.2667|±  |0.0382|
|--mmlu_astronomy                          |Yaml   |none  |acc    |    0.2434|±  |0.0349|
|--mmlu_college_biology                    |Yaml   |none  |acc    |    0.2847|±  |0.0377|
|--mmlu_college_chemistry                  |Yaml   |none  |acc    |    0.2300|±  |0.0423|
|--mmlu_college_computer_science           |Yaml   |none  |acc    |    0.2800|±  |0.0451|
|--mmlu_college_mathematics                |Yaml   |none  |acc    |    0.2700|±  |0.0446|
|--mmlu_college_physics                    |Yaml   |none  |acc    |    0.3431|±  |0.0472|
|--mmlu_computer_security                  |Yaml   |none  |acc    |    0.2900|±  |0.0456|
|--mmlu_conceptual_physics                 |Yaml   |none  |acc    |    0.2979|±  |0.0299|
|--mmlu_electrical_engineering             |Yaml   |none  |acc    |    0.2069|±  |0.0338|
|--mmlu_elementary_mathematics             |Yaml   |none  |acc    |    0.2619|±  |0.0226|
|--mmlu_high_school_biology                |Yaml   |none  |acc    |    0.2581|±  |0.0249|
|--mmlu_high_school_chemistry              |Yaml   |none  |acc    |    0.2512|±  |0.0305|
|--mmlu_high_school_computer_science       |Yaml   |none  |acc    |    0.2400|±  |0.0429|
|--mmlu_high_school_mathematics            |Yaml   |none  |acc    |    0.2630|±  |0.0268|
|--mmlu_high_school_physics                |Yaml   |none  |acc    |    0.2517|±  |0.0354|
|--mmlu_high_school_statistics             |Yaml   |none  |acc    |    0.2269|±  |0.0286|
|--mmlu_machine_learning                   |Yaml   |none  |acc    |    0.2589|±  |0.0416|

|       Groups        |Version|Filter|Metric |  Value   |   |Stderr|
|---------------------|-------|------|-------|---------:|---|-----:|
|mmlu                 |N/A    |none  |acc    |    0.2522|±  |0.4029|
|                     |       |      |samples|14042.0000|   |      |
|-mmlu_humanities     |N/A    |none  |acc    |    0.2349|±  |0.1431|
|-mmlu_other          |N/A    |none  |acc    |    0.2803|±  |0.1703|
|-mmlu_social_sciences|N/A    |none  |acc    |    0.2398|±  |0.1615|
|-mmlu_stem           |N/A    |none  |acc    |    0.2623|±  |0.1835|

…ation-harness into mmlu_subgroups

lm_eval/evaluator.py

StellaAthena · 2023-10-20T05:01:34Z

The print-out is very ugly. I get ASCII-art tables aren't the most aesthetic thing in the world, but I suspect we can do better than this. I think it would look a lot better if we replaced the - with and dropped the leading mmlu_ from each component task, for example.

I also think that the "main results table" should show the grouped scores not the score breakdown, as the grouped scores is typically the thing we actually want.

lintangsutawika · 2023-10-20T07:58:15Z

The dash is suppose to denote the hierarchy of the task (if it's part of a group).

The group table at the bottom was so for easier to inspect since it was printed last it would be easier than having to scroll up. But it's also not an issue if you think we should put it above, my preference is that having printed after the full task table would make it easier to inspect.

As for the naming, it's understandable that the 'mmlu' prefix might be too crowded, but it's a by-product of how each task has to be uniquely identified. For the default mmlu tasks, i suppose we can remove the prefix.

…ation-harness into mmlu_subgroups

lintangsutawika · 2023-11-02T06:57:43Z

@StellaAthena, simpler table presentation through the addition of task_alias and group_alias that a user can enter to have that presented instead of the task_name that has to be unique across all tasks in lm-eval.

hf (pretrained=EleutherAI/pythia-2.8b), limit: None, num_fewshot: None, batch_size: 1
|                Tasks                 |Version|Filter|Metric |  Value   |   |Stderr|
|--------------------------------------|-------|------|-------|---------:|---|-----:|
|mmlu                                  |N/A    |none  |acc    |    0.2522|±  |0.0407|
|                                      |       |      |samples|14042.0000|   |      |
| -humanities                          |N/A    |none  |acc    |    0.2349|±  |0.0289|
|  -formal_logic                       |Yaml   |none  |acc    |    0.2857|±  |0.0404|
|  -high_school_european_history       |Yaml   |none  |acc    |    0.2485|±  |0.0337|
|  -high_school_us_history             |Yaml   |none  |acc    |    0.2353|±  |0.0298|
|  -high_school_world_history          |Yaml   |none  |acc    |    0.2616|±  |0.0286|
|  -international_law                  |Yaml   |none  |acc    |    0.1983|±  |0.0364|
|  -jurisprudence                      |Yaml   |none  |acc    |    0.2407|±  |0.0413|
|  -logical_fallacies                  |Yaml   |none  |acc    |    0.1963|±  |0.0312|
|  -moral_disputes                     |Yaml   |none  |acc    |    0.2254|±  |0.0225|
|  -moral_scenarios                    |Yaml   |none  |acc    |    0.2425|±  |0.0143|
|  -philosophy                         |Yaml   |none  |acc    |    0.1961|±  |0.0226|
|  -prehistory                         |Yaml   |none  |acc    |    0.2716|±  |0.0247|
|  -professional_law                   |Yaml   |none  |acc    |    0.2288|±  |0.0107|
|  -world_religions                    |Yaml   |none  |acc    |    0.2398|±  |0.0327|
| -other                               |N/A    |none  |acc    |    0.2803|±  |0.0490|
|  -business_ethics                    |Yaml   |none  |acc    |    0.2200|±  |0.0416|
|  -clinical_knowledge                 |Yaml   |none  |acc    |    0.2566|±  |0.0269|
|  -college_medicine                   |Yaml   |none  |acc    |    0.2717|±  |0.0339|
|  -global_facts                       |Yaml   |none  |acc    |    0.3000|±  |0.0461|
|  -human_aging                        |Yaml   |none  |acc    |    0.2960|±  |0.0306|
|  -management                         |Yaml   |none  |acc    |    0.2524|±  |0.0430|
|  -marketing                          |Yaml   |none  |acc    |    0.2735|±  |0.0292|
|  -medical_genetics                   |Yaml   |none  |acc    |    0.2700|±  |0.0446|
|  -miscellaneous                      |Yaml   |none  |acc    |    0.2771|±  |0.0160|
|  -nutrition                          |Yaml   |none  |acc    |    0.2288|±  |0.0241|
|  -professional_accounting            |Yaml   |none  |acc    |    0.2695|±  |0.0265|
|  -professional_medicine              |Yaml   |none  |acc    |    0.4118|±  |0.0299|
|  -virology                           |Yaml   |none  |acc    |    0.2771|±  |0.0348|
| -social_sciences                     |N/A    |none  |acc    |    0.2398|±  |0.0390|
|  -econometrics                       |Yaml   |none  |acc    |    0.2193|±  |0.0389|
|  -high_school_geography              |Yaml   |none  |acc    |    0.1970|±  |0.0283|
|  -high_school_government_and_politics|Yaml   |none  |acc    |    0.2487|±  |0.0312|
|  -high_school_macroeconomics         |Yaml   |none  |acc    |    0.2128|±  |0.0208|
|  -high_school_microeconomics         |Yaml   |none  |acc    |    0.2353|±  |0.0276|
|  -high_school_psychology             |Yaml   |none  |acc    |    0.2734|±  |0.0191|
|  -human_sexuality                    |Yaml   |none  |acc    |    0.2137|±  |0.0360|
|  -professional_psychology            |Yaml   |none  |acc    |    0.2500|±  |0.0175|
|  -public_relations                   |Yaml   |none  |acc    |    0.3636|±  |0.0461|
|  -security_studies                   |Yaml   |none  |acc    |    0.1918|±  |0.0252|
|  -sociology                          |Yaml   |none  |acc    |    0.2289|±  |0.0297|
|  -us_foreign_policy                  |Yaml   |none  |acc    |    0.2400|±  |0.0429|
| -stem                                |N/A    |none  |acc    |    0.2623|±  |0.0412|
|  -abstract_algebra                   |Yaml   |none  |acc    |    0.3000|±  |0.0461|
|  -anatomy                            |Yaml   |none  |acc    |    0.2667|±  |0.0382|
|  -astronomy                          |Yaml   |none  |acc    |    0.2434|±  |0.0349|
|  -college_biology                    |Yaml   |none  |acc    |    0.2847|±  |0.0377|
|  -college_chemistry                  |Yaml   |none  |acc    |    0.2300|±  |0.0423|
|  -college_computer_science           |Yaml   |none  |acc    |    0.2800|±  |0.0451|
|  -college_mathematics                |Yaml   |none  |acc    |    0.2700|±  |0.0446|
|  -college_physics                    |Yaml   |none  |acc    |    0.3431|±  |0.0472|
|  -computer_security                  |Yaml   |none  |acc    |    0.2900|±  |0.0456|
|  -conceptual_physics                 |Yaml   |none  |acc    |    0.2979|±  |0.0299|
|  -electrical_engineering             |Yaml   |none  |acc    |    0.2069|±  |0.0338|
|  -elementary_mathematics             |Yaml   |none  |acc    |    0.2619|±  |0.0226|
|  -high_school_biology                |Yaml   |none  |acc    |    0.2581|±  |0.0249|
|  -high_school_chemistry              |Yaml   |none  |acc    |    0.2512|±  |0.0305|
|  -high_school_computer_science       |Yaml   |none  |acc    |    0.2400|±  |0.0429|
|  -high_school_mathematics            |Yaml   |none  |acc    |    0.2630|±  |0.0268|
|  -high_school_physics                |Yaml   |none  |acc    |    0.2517|±  |0.0354|
|  -high_school_statistics             |Yaml   |none  |acc    |    0.2269|±  |0.0286|
|  -machine_learning                   |Yaml   |none  |acc    |    0.2589|±  |0.0416|

|     Groups      |Version|Filter|Metric |  Value   |   |Stderr|
|-----------------|-------|------|-------|---------:|---|-----:|
|mmlu             |N/A    |none  |acc    |    0.2522|±  |0.0407|
|                 |       |      |samples|14042.0000|   |      |
| -humanities     |N/A    |none  |acc    |    0.2349|±  |0.0289|
| -other          |N/A    |none  |acc    |    0.2803|±  |0.0490|
| -social_sciences|N/A    |none  |acc    |    0.2398|±  |0.0390|
| -stem           |N/A    |none  |acc    |    0.2623|±  |0.0412|

StellaAthena · 2023-11-02T18:20:53Z

Looks good to me! Has this been added to the documentation as well?

fancyerii · 2023-11-10T08:55:41Z

how to use this pr to evaluate all mmlu tasks? any one provide an example command line for me? I have updated to latest version.

haileyschoelkopf · 2023-11-10T18:43:46Z

You should be able to run via simply python -m lm_eval --model hf --model_args ... --tasks mmlu!

JeevanBhoot · 2023-12-27T11:45:52Z

What are the weights on each subject for the weighted average?

lintangsutawika · 2023-12-27T11:52:02Z

They're weighted based on number of samples.

lchu-ibm · 2024-02-09T03:30:55Z

@lintangsutawika in the latest code, this avg score seems gone. Any guidance on the latest way to get the mmlu avg score?

lintangsutawika · 2024-02-09T04:03:49Z

Could tell me more what the issue might be? Also maybe in a new issue would be better.

lchu-ibm · 2024-02-09T13:06:07Z

@lintangsutawika basically I would like to have the mmlu weight avg in the report (exactly what this PR does), but latest main seems drop this info (potentially reverted or so?). here is what I have now (notice no more weight avg as what you showed above):


|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
| - humanities     |N/A    |none  |     0|acc   |0.4740|±  |0.0256|
| - other          |N/A    |none  |     0|acc   |0.4465|±  |0.0318|
| - social_sciences|N/A    |none  |     0|acc   |0.4798|±  |0.0304|
| - stem           |N/A    |none  |     0|acc   |0.3477|±  |0.0366|

|                 Tasks                 |Version|Filter|n-shot|Metric|Value |   |Stderr|  
|---------------------------------------|-------|------|-----:|------|-----:|---|-----:|  
| - humanities                          |N/A    |none  |     0|acc   |0.4740|±  |0.0256|  
|  - formal_logic                       |      0|none  |     0|acc   |0.3095|±  |0.0413|  
|  - high_school_european_history       |      0|none  |     0|acc   |0.6000|±  |0.0383|
|  - high_school_us_history             |      0|none  |     0|acc   |0.5490|±  |0.0349|
|  - high_school_world_history          |      0|none  |     0|acc   |0.5865|±  |0.0321|  
|  - international_law                  |      0|none  |     0|acc   |0.5868|±  |0.0450|  
|  - jurisprudence                      |      0|none  |     0|acc   |0.5185|±  |0.0483|

so wondering what's the current way of getting what you have above, and happy to create a new issue.

lchu-ibm · 2024-02-09T14:23:48Z

opened an issue: #1415

haileyschoelkopf · 2024-02-09T16:30:25Z

Just fixed in #1414 !

lintangsutawika added 3 commits October 16, 2023 04:42

modfied to add subcategory

92f2546

default to weighted averaging

1dc8f96

added stderr reprocessing for groups

0085982

lintangsutawika commented Oct 16, 2023

View reviewed changes

lm_eval/evaluator.py Outdated Show resolved Hide resolved

lintangsutawika mentioned this pull request Oct 16, 2023

Add mmlu average score in report #881

Closed

lintangsutawika linked an issue Oct 16, 2023 that may be closed by this pull request

Add mmlu average score in report #881

Closed

lintangsutawika added 5 commits October 16, 2023 15:24

removed comments

e97019c

sqrt final variance calculation to get stderr

4ccd2ec

pre-commit reformat

c4f0bf7

print tasks in alphabetically

93a4596

added subgroups for other mmlu variants

109ed1c

lintangsutawika marked this pull request as ready for review October 17, 2023 14:18

lintangsutawika requested a review from haileyschoelkopf as a code owner October 17, 2023 14:18

lintangsutawika added 2 commits October 17, 2023 16:47

Merge branch 'big-refactor' of https://github.com/EleutherAI/lm-evalu…

f77a3a2

…ation-harness into mmlu_subgroups

Merge branch 'big-refactor' of https://github.com/EleutherAI/lm-evalu…

7d49661

…ation-harness into mmlu_subgroups

StellaAthena mentioned this pull request Oct 19, 2023

How to compute metrics per subcategory within one task ? #823

Closed

lintangsutawika enabled auto-merge October 19, 2023 13:49

haileyschoelkopf requested changes Oct 19, 2023

View reviewed changes

lm_eval/evaluator.py Show resolved Hide resolved

lintangsutawika added 5 commits October 30, 2023 13:12

Merge branch 'big-refactor' of https://github.com/EleutherAI/lm-evalu…

21cf650

…ation-harness into mmlu_subgroups

add task and group alias

dd7002d

Merge branch 'big-refactor' of https://github.com/EleutherAI/lm-evalu…

a5d33eb

…ation-harness into mmlu_subgroups

new way to display tasks

f5bdefe

new way to display tasks

60dd33c

StellaAthena added this to the v0.3.0 milestone Nov 1, 2023

lintangsutawika added 3 commits November 2, 2023 06:03

fixing display name

29ba8cb

fixed stderr calculation

d1e7a30

added group and task alias

6dfe848

lintangsutawika added 2 commits November 2, 2023 06:43

merged conflicts

e49f270

removed print

6eac005

lintangsutawika added 3 commits November 6, 2023 09:50

remove samples in the table

1e1dbaf

reformat

491d479

add space after -

44124d9

haileyschoelkopf approved these changes Nov 6, 2023

View reviewed changes

lintangsutawika merged commit 815f59e into big-refactor Nov 6, 2023
3 of 8 checks passed

lintangsutawika deleted the mmlu_subgroups branch November 6, 2023 13:47

lchu-ibm mentioned this pull request Feb 9, 2024

Current recommended way to get weight avg MMLU score #1415

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Refactor] Mmlu subgroups and weight avg #922

[Refactor] Mmlu subgroups and weight avg #922

lintangsutawika commented Oct 16, 2023

lintangsutawika commented Oct 17, 2023

StellaAthena commented Oct 20, 2023

lintangsutawika commented Oct 20, 2023

lintangsutawika commented Nov 2, 2023

StellaAthena commented Nov 2, 2023 •

edited

Loading

fancyerii commented Nov 10, 2023

haileyschoelkopf commented Nov 10, 2023

JeevanBhoot commented Dec 27, 2023

lintangsutawika commented Dec 27, 2023

lchu-ibm commented Feb 9, 2024

lintangsutawika commented Feb 9, 2024

lchu-ibm commented Feb 9, 2024

lchu-ibm commented Feb 9, 2024

haileyschoelkopf commented Feb 9, 2024

[Refactor] Mmlu subgroups and weight avg #922

[Refactor] Mmlu subgroups and weight avg #922

Conversation

lintangsutawika commented Oct 16, 2023

lintangsutawika commented Oct 17, 2023

StellaAthena commented Oct 20, 2023

lintangsutawika commented Oct 20, 2023

lintangsutawika commented Nov 2, 2023

StellaAthena commented Nov 2, 2023 • edited Loading

fancyerii commented Nov 10, 2023

haileyschoelkopf commented Nov 10, 2023

JeevanBhoot commented Dec 27, 2023

lintangsutawika commented Dec 27, 2023

lchu-ibm commented Feb 9, 2024

lintangsutawika commented Feb 9, 2024

lchu-ibm commented Feb 9, 2024

lchu-ibm commented Feb 9, 2024

haileyschoelkopf commented Feb 9, 2024

StellaAthena commented Nov 2, 2023 •

edited

Loading