Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new benchmark: Portuguese bench #2156

Merged
merged 7 commits into from
Sep 30, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 64 additions & 0 deletions lm_eval/tasks/portuguese_bench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# PortugueseBench

### Paper

PortugueseBench is a benchmark for evaluating language models in Portuguese tasks. This is, it evaluates the ability of a language model to understand and generate Portuguese text. PortugueseBench offers a combination of pre-existing, open datasets. All the details of PortugueseBench will be published in a paper soon.

The datasets included in PortugueseBench are:

| Task | Category | Paper title | Homepage |
|:-------------:|:-----:|:-------------:|:-----:|
| Belebele_es | Reading Comprehension | [The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants](https://arxiv.org/abs/2308.16884) | https://huggingface.co/datasets/facebook/belebele |
| FLORES_es | Translation | [The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation](https://arxiv.org/abs/2106.03193) | https://huggingface.co/datasets/facebook/flores |
| ASSIN | Natural Language Inference + Paraphrasing | [Avaliando a similaridade semântica entre frases curtas através de uma abordagem híbrida](https://aclanthology.org/W17-6612/) | https://huggingface.co/datasets/nilc-nlp/assin |


### Citation
Paper for PortugueseBench coming soon.

### Groups and Tasks

#### Groups

- `portuguese_bench`: All tasks included in PortugueseBench.
- `flores_pt`: All FLORES translation tasks from or to Portuguese.

#### Tasks

The following tasks evaluate tasks on PortugueseBench dataset using various scoring methods.
- `assin_paraphrase`
- `assin_entailment`
- `belebele_por_Latn`
- `flores_pt`
- `flores_pt-ca`
- `flores_pt-de`
- `flores_pt-en`
- `flores_pt-es`
- `flores_pt-eu`
- `flores_pt-fr`
- `flores_pt-gl`
- `flores_pt-it`
- `flores_ca-pt`
- `flores_de-pt`
- `flores_en-pt`
- `flores_es-pt`
- `flores_eu-pt`
- `flores_fr-pt`
- `flores_gl-pt`
- `flores_it-pt`

Some of these tasks are taken from benchmarks already available in LM Evaluation Harness. These are:
- `belebele_por_Latn`: Belebele Portuguese


### Checklist

* [x] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation?
* [ ] Yes, original implementation contributed by author of the benchmark

If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
17 changes: 17 additions & 0 deletions lm_eval/tasks/portuguese_bench/assin_entailment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
task: assin_entailment
dataset_path: nilc-nlp/assin
dataset_name: null
training_split: train
validation_split: validation
test_split: test
output_type: multiple_choice
doc_to_text: ""
doc_to_target: '{{0 if entailment_judgment == 0 else 1}}'
target_delimiter: ""
doc_to_choice: '{{[premise + ", certo? Também, " + hypothesis, premise + ", certo? Sim, " + hypothesis]}}'
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
17 changes: 17 additions & 0 deletions lm_eval/tasks/portuguese_bench/assin_paraphrase.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
task: assin_paraphrase
dataset_path: nilc-nlp/assin
dataset_name: null
training_split: train
validation_split: validation
test_split: test
output_type: multiple_choice
doc_to_text: ""
doc_to_target: '{{0 if entailment_judgment == 0 else 1}}'
target_delimiter: ""
doc_to_choice: '{{[premise + ", certo? Não, " + hypothesis, premise + ", certo? Sim, " + hypothesis]}}'
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
28 changes: 28 additions & 0 deletions lm_eval/tasks/portuguese_bench/flores_pt/_flores_common_yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
group: flores
haileyschoelkopf marked this conversation as resolved.
Show resolved Hide resolved
dataset_path: facebook/flores
dataset_name: all
output_type: generate_until
#! The test split of flores is not publicly available! (See paper section 6.1)
#! We are using `dev` and `devtest` splits, but they're mapped to train/validation/test in `data/flores/flores.py`.
training_split: dev
validation_split: dev
test_split: devtest
fewshot_split: dev
target_delimiter: ''
generation_kwargs:
until:
- "\n"
metric_list:
- metric: bleu
aggregation: bleu
higher_is_better: true
- metric: ter
aggregation: ter
higher_is_better: false
- metric: chrf
aggregation: chrf
higher_is_better: true
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
114 changes: 114 additions & 0 deletions lm_eval/tasks/portuguese_bench/flores_pt/create-yamls_flores_pt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
"""
Script to generate task YAMLs for the FLORES-200 dataset.
Based on `tasks/translation/utils.py`.
"""

import argparse
import yaml
from langcodes import *
from itertools import *

# utils
flatten = lambda l: list(itertools.chain(*l))

# constants
_LANGUAGES = [
"ace_Arab", "bam_Latn", "dzo_Tibt", "hin_Deva", "khm_Khmr", "mag_Deva", "pap_Latn", "sot_Latn", "tur_Latn",
"ace_Latn", "ban_Latn", "ell_Grek", "hne_Deva", "kik_Latn", "mai_Deva", "pbt_Arab", "spa_Latn", "twi_Latn",
"acm_Arab", "bel_Cyrl", "eng_Latn", "hrv_Latn", "kin_Latn", "mal_Mlym", "pes_Arab", "srd_Latn", "tzm_Tfng",
"acq_Arab", "bem_Latn", "epo_Latn", "hun_Latn", "kir_Cyrl", "mar_Deva", "plt_Latn", "srp_Cyrl", "uig_Arab",
"aeb_Arab", "ben_Beng", "est_Latn", "hye_Armn", "kmb_Latn", "min_Arab", "pol_Latn", "ssw_Latn", "ukr_Cyrl",
"afr_Latn", "bho_Deva", "eus_Latn", "ibo_Latn", "kmr_Latn", "min_Latn", "por_Latn", "sun_Latn", "umb_Latn",
"ajp_Arab", "bjn_Arab", "ewe_Latn", "ilo_Latn", "knc_Arab", "mkd_Cyrl", "prs_Arab", "swe_Latn", "urd_Arab",
"aka_Latn", "bjn_Latn", "fao_Latn", "ind_Latn", "knc_Latn", "mlt_Latn", "quy_Latn", "swh_Latn", "uzn_Latn",
"als_Latn", "bod_Tibt", "fij_Latn", "isl_Latn", "kon_Latn", "mni_Beng", "ron_Latn", "szl_Latn", "vec_Latn",
"amh_Ethi", "bos_Latn", "fin_Latn", "ita_Latn", "kor_Hang", "mos_Latn", "run_Latn", "tam_Taml", "vie_Latn",
"apc_Arab", "bug_Latn", "fon_Latn", "jav_Latn", "lao_Laoo", "mri_Latn", "rus_Cyrl", "taq_Latn", "war_Latn",
"arb_Arab", "bul_Cyrl", "fra_Latn", "jpn_Jpan", "lij_Latn", "mya_Mymr", "sag_Latn", "taq_Tfng", "wol_Latn",
"arb_Latn", "cat_Latn", "fur_Latn", "kab_Latn", "lim_Latn", "nld_Latn", "san_Deva", "tat_Cyrl", "xho_Latn",
"ars_Arab", "ceb_Latn", "fuv_Latn", "kac_Latn", "lin_Latn", "nno_Latn", "sat_Olck", "tel_Telu", "ydd_Hebr",
"ary_Arab", "ces_Latn", "gaz_Latn", "kam_Latn", "lit_Latn", "nob_Latn", "scn_Latn", "tgk_Cyrl", "yor_Latn",
"arz_Arab", "cjk_Latn", "gla_Latn", "kan_Knda", "lmo_Latn", "npi_Deva", "shn_Mymr", "tgl_Latn", "yue_Hant",
"asm_Beng", "ckb_Arab", "gle_Latn", "kas_Arab", "ltg_Latn", "nso_Latn", "sin_Sinh", "tha_Thai", "zho_Hans",
"ast_Latn", "crh_Latn", "glg_Latn", "kas_Deva", "ltz_Latn", "nus_Latn", "slk_Latn", "tir_Ethi", "zho_Hant",
"awa_Deva", "cym_Latn", "grn_Latn", "kat_Geor", "lua_Latn", "nya_Latn", "slv_Latn", "tpi_Latn", "zsm_Latn",
"ayr_Latn", "dan_Latn", "guj_Gujr", "kaz_Cyrl", "lug_Latn", "oci_Latn", "smo_Latn", "tsn_Latn", "zul_Latn",
"azb_Arab", "deu_Latn", "hat_Latn", "kbp_Latn", "luo_Latn", "ory_Orya", "sna_Latn", "tso_Latn",
"azj_Latn", "dik_Latn", "hau_Latn", "kea_Latn", "lus_Latn", "pag_Latn", "snd_Arab", "tuk_Latn",
"bak_Cyrl", "dyu_Latn", "heb_Hebr", "khk_Cyrl", "lvs_Latn", "pan_Guru", "som_Latn", "tum_Latn"
]
LANGUAGE_PAIRS = [(a, b) for idx, a in enumerate(_LANGUAGES) for b in _LANGUAGES[idx + 1:]]

LANGUAGES_OF_INTEREST = ["cat_Latn", "spa_Latn", "eng_Latn", "glg_Latn", "eus_Latn", "ita_Latn", "deu_Latn", "por_Latn", "fra_Latn"]
MAIN_LANG = "por_Latn"
LANGUAGE_PAIRS = [(a, b) for (a, b) in LANGUAGE_PAIRS if a in LANGUAGES_OF_INTEREST and b in LANGUAGES_OF_INTEREST and MAIN_LANG in (a, b)]

# auxiliary functions

code_to_language_name = lambda code: Language.make(language=Language.get(code)["language"]).display_name()
code_to_short_name = lambda code: Language.get(code)["language"]
jinja_var = lambda s: "{{" + s + "}}" # wrapper to avoid having to escape { } in format strings

def doc_to_text(src: str, tgt: str) -> str:
src_name, tgt_name = map(code_to_language_name, [src, tgt])

return f"""\
{src_name} sentence: {jinja_var('sentence_' + src)}
{tgt_name} sentence:"""

def doc_to_target(tgt: str) -> str:

return f"{jinja_var('sentence_' + tgt)}"

# main function

def gen_lang_yamls(output_dir: str, overwrite: bool) -> None:
"""
Generate a YAML file for each translation direction.
"""

err = []
for src, tgt in LANGUAGE_PAIRS:

# do both translation directions for each lang pair
for src, tgt in [(src, tgt), (tgt, src)]:
lang_pair_name = f"{code_to_short_name(src)}-{code_to_short_name(tgt)}"
yaml_file_name = f"flores_{lang_pair_name}.yaml"

try:
with open( f"{output_dir}/{yaml_file_name}", "w" if overwrite else "x", encoding="utf-8") as outfile:
print(f"Creating {yaml_file_name}...")
outfile.write("# File generated by `create-yamls.py`\n")
yaml.dump(
{
# "group": "flores_pt",
"include": "_flores_common_yaml",
"task": f"flores_{lang_pair_name}",
"doc_to_text": doc_to_text(src, tgt),
"doc_to_target": doc_to_target(tgt),
},
outfile,
sort_keys=False,
)

except FileExistsError:
err.append(yaml_file_name)

if len(err) > 0:
raise FileExistsError(
"Files were not created because they already exist:"
f" {', '.join(err)}"
"\nUse flag --overwrite to overwrite them."
)


def main() -> None:
parser = argparse.ArgumentParser()
parser.add_argument("--overwrite", default=False, action="store_true", help="Overwrite files if they already exist")
parser.add_argument( "--output-dir", default=".", help="Directory to write yaml files to" )
args = parser.parse_args()

gen_lang_yamls(output_dir=args.output_dir, overwrite=args.overwrite)

if __name__ == "__main__":
main()
7 changes: 7 additions & 0 deletions lm_eval/tasks/portuguese_bench/flores_pt/flores_ca-pt.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_ca-pt
doc_to_text: 'Catalan sentence: {{sentence_cat_Latn}}

Portuguese sentence:'
doc_to_target: '{{sentence_por_Latn}}'
7 changes: 7 additions & 0 deletions lm_eval/tasks/portuguese_bench/flores_pt/flores_de-pt.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_de-pt
doc_to_text: 'German sentence: {{sentence_deu_Latn}}

Portuguese sentence:'
doc_to_target: '{{sentence_por_Latn}}'
7 changes: 7 additions & 0 deletions lm_eval/tasks/portuguese_bench/flores_pt/flores_en-pt.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_en-pt
doc_to_text: 'English sentence: {{sentence_eng_Latn}}

Portuguese sentence:'
doc_to_target: '{{sentence_por_Latn}}'
7 changes: 7 additions & 0 deletions lm_eval/tasks/portuguese_bench/flores_pt/flores_es-pt.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_es-pt
doc_to_text: 'Spanish sentence: {{sentence_spa_Latn}}

Portuguese sentence:'
doc_to_target: '{{sentence_por_Latn}}'
7 changes: 7 additions & 0 deletions lm_eval/tasks/portuguese_bench/flores_pt/flores_eu-pt.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_eu-pt
doc_to_text: 'Basque sentence: {{sentence_eus_Latn}}

Portuguese sentence:'
doc_to_target: '{{sentence_por_Latn}}'
7 changes: 7 additions & 0 deletions lm_eval/tasks/portuguese_bench/flores_pt/flores_fr-pt.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_fr-pt
doc_to_text: 'French sentence: {{sentence_fra_Latn}}

Portuguese sentence:'
doc_to_target: '{{sentence_por_Latn}}'
7 changes: 7 additions & 0 deletions lm_eval/tasks/portuguese_bench/flores_pt/flores_gl-pt.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_gl-pt
doc_to_text: 'Galician sentence: {{sentence_glg_Latn}}

Portuguese sentence:'
doc_to_target: '{{sentence_por_Latn}}'
7 changes: 7 additions & 0 deletions lm_eval/tasks/portuguese_bench/flores_pt/flores_it-pt.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_it-pt
doc_to_text: 'Italian sentence: {{sentence_ita_Latn}}

Portuguese sentence:'
doc_to_target: '{{sentence_por_Latn}}'
7 changes: 7 additions & 0 deletions lm_eval/tasks/portuguese_bench/flores_pt/flores_pt-ca.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_pt-ca
doc_to_text: 'Portuguese sentence: {{sentence_por_Latn}}

Catalan sentence:'
doc_to_target: '{{sentence_cat_Latn}}'
7 changes: 7 additions & 0 deletions lm_eval/tasks/portuguese_bench/flores_pt/flores_pt-de.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_pt-de
doc_to_text: 'Portuguese sentence: {{sentence_por_Latn}}

German sentence:'
doc_to_target: '{{sentence_deu_Latn}}'
7 changes: 7 additions & 0 deletions lm_eval/tasks/portuguese_bench/flores_pt/flores_pt-en.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_pt-en
doc_to_text: 'Portuguese sentence: {{sentence_por_Latn}}

English sentence:'
doc_to_target: '{{sentence_eng_Latn}}'
7 changes: 7 additions & 0 deletions lm_eval/tasks/portuguese_bench/flores_pt/flores_pt-es.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_pt-es
doc_to_text: 'Portuguese sentence: {{sentence_por_Latn}}

Spanish sentence:'
doc_to_target: '{{sentence_spa_Latn}}'
7 changes: 7 additions & 0 deletions lm_eval/tasks/portuguese_bench/flores_pt/flores_pt-eu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_pt-eu
doc_to_text: 'Portuguese sentence: {{sentence_por_Latn}}

Basque sentence:'
doc_to_target: '{{sentence_eus_Latn}}'
7 changes: 7 additions & 0 deletions lm_eval/tasks/portuguese_bench/flores_pt/flores_pt-fr.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_pt-fr
doc_to_text: 'Portuguese sentence: {{sentence_por_Latn}}

French sentence:'
doc_to_target: '{{sentence_fra_Latn}}'
7 changes: 7 additions & 0 deletions lm_eval/tasks/portuguese_bench/flores_pt/flores_pt-gl.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_pt-gl
doc_to_text: 'Portuguese sentence: {{sentence_por_Latn}}

Galician sentence:'
doc_to_target: '{{sentence_glg_Latn}}'
7 changes: 7 additions & 0 deletions lm_eval/tasks/portuguese_bench/flores_pt/flores_pt-it.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_pt-it
doc_to_text: 'Portuguese sentence: {{sentence_por_Latn}}

Italian sentence:'
doc_to_target: '{{sentence_ita_Latn}}'
23 changes: 23 additions & 0 deletions lm_eval/tasks/portuguese_bench/flores_pt/flores_pt.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
group: flores_pt
task:
- flores_es-pt
- flores_pt-es
- flores_en-pt
- flores_pt-en
- flores_eu-pt
- flores_pt-eu
- flores_pt-it
- flores_it-pt
- flores_pt-fr
- flores_fr-pt
- flores_pt-ca
- flores_ca-pt
- flores_pt-gl
- flores_gl-pt
- flores_pt-de
- flores_de-pt
aggregate_metric_list:
- metric: bleu
aggregation: mean
metadata:
version: 1.0
Loading