Skip to content

Commit

Permalink
Add Open Arabic LLM Leaderboard Benchmarks (Full and Light Version) (#…
Browse files Browse the repository at this point in the history
…2232)

* arabic leaferboard yaml file is added

* arabic toxigen is implemented

* Dataset library is imported

* arabic sciq is added

* util file of arabic toxigen is updated

* arabic race is added

* arabic piqa is implemented

* arabic open qa is added

* arabic copa is implemented

* arabic boolq ia added

* arabic arc easy is added

* arabic arc challenge is added

* arabic exams benchmark is implemented

* arabic hellaswag is added

* arabic leaderboard yaml file metrics are updated

* arabic mmlu benchmarks are added

* arabic mmlu group yaml file is updated

* alghafa benchmarks are added

* acva benchmarks are added

* acva utils.py is updated

* light version of arabic leaderboard benchmarks are added

* bugs fixed

* bug fixed

* bug fixed

* bug fixed

* bug fixed

* bug fixed

* library import bug is fixed

* doc to target updated

* bash file is deleted

* results folder is deleted

* leaderboard groups are added

* full arabic leaderboard groups are added, plus some bug fixes to the light version

* Create README.md

README.md for arabic_leaderboard_complete

* Create README.md

README.md for arabic_leaderboard_light

* Delete lm_eval/tasks/arabic_leaderboard directory

* Update README.md

* Update README.md

adding the Arabic leaderboards to the library

* Update README.md

10% of the training set

* Update README.md

10% of the training set

* revert .gitignore to prev version

* Update lm_eval/tasks/README.md

Co-authored-by: Hailey Schoelkopf <[email protected]>

* updated main README.md

* Update lm_eval/tasks/README.md

* specify machine translated benchmarks (complete)

* specify machine translated benchmarks (light version)

* add alghafa to the related task names (complete and light)

* add 'acva' to the related task names (complete and light)

* add 'arabic_leaderboard' to all the groups (complete and light)

* all dataset - not a random sample

* added more accurate details to the readme file

* added mt_mmlu from okapi

* Update lm_eval/tasks/README.md

Co-authored-by: Hailey Schoelkopf <[email protected]>

* Update lm_eval/tasks/README.md

Co-authored-by: Hailey Schoelkopf <[email protected]>

* updated mt_mmlu readme

* renaming 'alghafa' full and light

* renaming 'arabic_mmlu' light and full

* renaming 'acva' full and light

* update readme and standardize dir/file names

* running pre-commit

---------

Co-authored-by: shahrzads <[email protected]>
Co-authored-by: shahrzads <[email protected]>
Co-authored-by: Hailey Schoelkopf <[email protected]>
  • Loading branch information
4 people committed Sep 10, 2024
1 parent 543617f commit decc533
Show file tree
Hide file tree
Showing 337 changed files with 7,980 additions and 0 deletions.
2 changes: 2 additions & 0 deletions lm_eval/tasks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@
| [aexams](aexams/README.md) | Tasks in Arabic related to various academic exams covering a range of subjects. | Arabic |
| [agieval](agieval/README.md) | Tasks involving historical data or questions related to history and historical texts. | English, Chinese |
| [anli](anli/README.md) | Adversarial natural language inference tasks designed to test model robustness. | English |
| [arabic_leaderboard_complete](arabic_leaderboard_complete/README.md) | A full version of the tasks in the Open Arabic LLM Leaderboard, focusing on the evaluation of models that reflect the characteristics of Arabic language understanding and comprehension, culture, and heritage. Note that some of these tasks are machine-translated. | Arabic (Some MT) |
| [arabic_leaderboard_light](arabic_leaderboard_light/README.md) | A light version of the tasks in the Open Arabic LLM Leaderboard (i.e., 10% samples of the test set in the original benchmarks), focusing on the evaluation of models that reflect the characteristics of Arabic language understanding and comprehension, culture, and heritage. Note that some of these tasks are machine-translated. | Arabic (Some MT) |
| [arabicmmlu](arabicmmlu/README.md) | Localized Arabic version of MMLU with multiple-choice questions from 40 subjects. | Arabic |
| [arc](arc/README.md) | Tasks involving complex reasoning over a diverse set of questions. | English |
| [arithmetic](arithmetic/README.md) | Tasks involving numerical computations and arithmetic reasoning. | English |
Expand Down
254 changes: 254 additions & 0 deletions lm_eval/tasks/arabic_leaderboard_complete/README.md

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
group: arabic_leaderboard_alghafa
task:
- arabic_leaderboard_alghafa_mcq_exams_test_ar
- arabic_leaderboard_alghafa_meta_ar_dialects
- arabic_leaderboard_alghafa_meta_ar_msa
- arabic_leaderboard_alghafa_multiple_choice_facts_truefalse_balanced_task
- arabic_leaderboard_alghafa_multiple_choice_grounded_statement_soqal_task
- arabic_leaderboard_alghafa_multiple_choice_grounded_statement_xglue_mlqa_task
- arabic_leaderboard_alghafa_multiple_choice_rating_sentiment_no_neutral_task
- arabic_leaderboard_alghafa_multiple_choice_rating_sentiment_task
- arabic_leaderboard_alghafa_multiple_choice_sentiment_task



aggregate_metric_list:
- metric: acc
aggregation: mean
weight_by_size: true
- metric: acc_norm
aggregation: mean
weight_by_size: true
metadata:
version: 1.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
task: arabic_leaderboard_alghafa_mcq_exams_test_ar
dataset_path: OALL/AlGhafa-Arabic-LLM-Benchmark-Native
dataset_name: mcq_exams_test_ar
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
task: arabic_leaderboard_alghafa_meta_ar_dialects
dataset_path: OALL/AlGhafa-Arabic-LLM-Benchmark-Native
dataset_name: meta_ar_dialects
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
task: arabic_leaderboard_alghafa_meta_ar_msa
dataset_path: OALL/AlGhafa-Arabic-LLM-Benchmark-Native
dataset_name: meta_ar_msa
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
task: arabic_leaderboard_alghafa_multiple_choice_facts_truefalse_balanced_task
dataset_path: OALL/AlGhafa-Arabic-LLM-Benchmark-Native
dataset_name: multiple_choice_facts_truefalse_balanced_task
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
task: arabic_leaderboard_alghafa_multiple_choice_grounded_statement_soqal_task
dataset_path: OALL/AlGhafa-Arabic-LLM-Benchmark-Native
dataset_name: multiple_choice_grounded_statement_soqal_task
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
task: arabic_leaderboard_alghafa_multiple_choice_grounded_statement_xglue_mlqa_task
dataset_path: OALL/AlGhafa-Arabic-LLM-Benchmark-Native
dataset_name: multiple_choice_grounded_statement_xglue_mlqa_task
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
task: arabic_leaderboard_alghafa_multiple_choice_rating_sentiment_no_neutral_task
dataset_path: OALL/AlGhafa-Arabic-LLM-Benchmark-Native
dataset_name: multiple_choice_rating_sentiment_no_neutral_task
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
task: arabic_leaderboard_alghafa_multiple_choice_rating_sentiment_task
dataset_path: OALL/AlGhafa-Arabic-LLM-Benchmark-Native
dataset_name: multiple_choice_rating_sentiment_task
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
task: arabic_leaderboard_alghafa_multiple_choice_sentiment_task
dataset_path: OALL/AlGhafa-Arabic-LLM-Benchmark-Native
dataset_name: multiple_choice_sentiment_task
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
import datasets
import numpy as np


def process_docs(dataset: datasets.Dataset):
def _process_doc(doc):
question = doc["query"]
answer_index = int(doc["label"])
# Dynamically determining the choices by excluding '__few_shots', 'query' and 'label'
choices_keys = [
key for key in doc.keys() if key not in ["query", "label", "__few_shots"]
]
choices = [doc[key] for key in choices_keys]

instruction = "الأسئلة التالية هي أسئلة متعددة الإختيارات مع الجواب الصحيح\n\n"
query = f"{instruction}السؤال: {question}\n"
for index, choice in enumerate(choices):
query += f"{index}) {choice}\n"
query += "الإجابة:"

return {"query": query, "choices": choices, "gold": answer_index}

return dataset.map(_process_doc)
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
task: arabic_exams
dataset_path: OALL/Arabic_EXAMS
dataset_name: default
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
group: arabic_leaderboard_arabic_exams
task:
- arabic_exams

aggregate_metric_list:
- metric: acc
aggregation: mean
weight_by_size: true
- metric: acc_norm
aggregation: mean
weight_by_size: true
metadata:
version: 1.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
import datasets
import numpy as np


# fmt: off
LETTER_INDICES_AR = ["أ", "ب", "ج", "د", "هـ", "و", "ز", "ح", "ط", "ي", "ك", "ل", "م", "ن", "س", "ع", "ف", "ص", "ق", "ر", "ش", "ت", "ث", "خ", "ذ", "ض", "ظ", "غ"]
# fmt: on


# fmt: off
LETTER_INDICES = ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z"]
# fmt: on


def process_docs(dataset: datasets.Dataset):
def _process_doc(doc):
topic = doc["subject"]
question = doc["question"]
choices = [doc["A"], doc["B"], doc["C"], doc["D"]]
choices_formatted = [
f" {LETTER_INDICES_AR[i]}) {choice}\n" for i, choice in enumerate(choices)
]
answer = doc["answer"]
answer_index = LETTER_INDICES.index(answer)

instruction = f"الأسئلة التالية هي أسئلة متعددة الإختيارات مع الجواب الصحيح حول {topic.replace('_', ' ')}. \n\n"
query = f"{instruction}السؤال: {question}\n"
query += "\n".join(choices_formatted)
query += "\nالإجابة:"

return {"query": query, "choices": LETTER_INDICES_AR[:4], "gold": answer_index}

return dataset.map(_process_doc)
Loading

0 comments on commit decc533

Please sign in to comment.