-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new benchmark: Spanish bench #2157
Open
zxcvuser
wants to merge
12
commits into
EleutherAI:main
Choose a base branch
from
zxcvuser:spanish_bench
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 10 commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
6141bb3
Add spanish_bench
zxcvuser d59bec4
Merge branch 'EleutherAI:main' into spanish_bench
zxcvuser b7f165a
Add flores_es group
zxcvuser 27fde35
Update _flores_common_yaml
zxcvuser b38282d
Merge branch 'EleutherAI:main' into spanish_bench
zxcvuser 2b606b0
Merge branch 'EleutherAI:main' into spanish_bench
zxcvuser 12fb2f1
Delete lm_eval/tasks/spanish_bench/escola.yaml
zxcvuser 28eef2d
Delete escola from spanish_bench.yaml
zxcvuser 9445975
Delete escola from README.md
zxcvuser 6b4b9af
pre-commit run --all-files
zxcvuser 735fb47
Merge branch 'EleutherAI:main' into spanish_bench
zxcvuser 95f5071
Updated some task groupings and readme
zxcvuser File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
# SpanishBench | ||
|
||
### Paper | ||
|
||
SpanishBench is a benchmark for evaluating language models in Spanish tasks. This is, it evaluates the ability of a language model to understand and generate Spanish text. SpanishBench offers a combination of pre-existing, open datasets. All the details of SpanishBench will be published in a paper soon. | ||
|
||
The datasets included in SpanishBench are: | ||
|
||
| Task | Category | Paper title | Homepage | | ||
|:-------------:|:-----:|:-------------:|:-----:| | ||
| Belebele_es | Reading Comprehension | [The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants](https://arxiv.org/abs/2308.16884) | https://huggingface.co/datasets/facebook/belebele | | ||
| FLORES_es | Translation | [The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation](https://arxiv.org/abs/2106.03193) | https://huggingface.co/datasets/facebook/flores | | ||
| MGSM_es | Math | [Language Models are Multilingual Chain-of-Thought Reasoners](https://arxiv.org/abs/2210.03057) | https://huggingface.co/datasets/juletxara/mgsm | | ||
| PAWS-X_es | Paraphrasing | [PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification](https://aclanthology.org/D19-1382/) | https://huggingface.co/datasets/google-research-datasets/paws-x | | ||
| WNLI-es | Natural Language Inference | No paper. | https://huggingface.co/datasets/PlanTL-GOB-ES/wnli-es | | ||
| XL-Sum_es | Summarization | [XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages](https://aclanthology.org/2021.findings-acl.413/) | https://huggingface.co/datasets/csebuetnlp/xlsum | | ||
| XNLI_es | Natural Language Inference | [XNLI: Evaluating Cross-lingual Sentence Representations](https://aclanthology.org/D18-1269/) | https://huggingface.co/datasets/facebook/xnli | | ||
| XQuAD_es | Question Answering | [On the Cross-lingual Transferability of Monolingual Representations](https://aclanthology.org/2020.acl-main.421/) | https://huggingface.co/datasets/google/xquad | | ||
| XStoryCloze_es | Commonsense Reasoning | [Few-shot Learning with Multilingual Generative Language Models](https://aclanthology.org/2022.emnlp-main.616/) | https://huggingface.co/datasets/juletxara/xstory_cloze | | ||
|
||
|
||
### Citation | ||
Paper for SpanishBench coming soon. | ||
|
||
### Groups and Tasks | ||
|
||
#### Groups | ||
|
||
- `spanish_bench`: All tasks included in SpanishBench. | ||
- `flores_es`: All FLORES translation tasks from or to Spanish. | ||
- `phrases_es`: Two Phrases_va tasks for language adaptation between Spanish and Valencian. | ||
|
||
#### Tasks | ||
|
||
The following tasks evaluate tasks on SpanishBench dataset using various scoring methods. | ||
- `belebele_spa_Latn` | ||
- `flores_es` | ||
- `flores_es-ca` | ||
- `flores_es-de` | ||
- `flores_es-en` | ||
- `flores_es-eu` | ||
- `flores_es-fr` | ||
- `flores_es-gl` | ||
- `flores_es-it` | ||
- `flores_es-pt` | ||
- `flores_ca-es` | ||
- `flores_de-es` | ||
- `flores_en-es` | ||
- `flores_eu-es` | ||
- `flores_fr-es` | ||
- `flores_gl-es` | ||
- `flores_it-es` | ||
- `flores_pt-es` | ||
- `mgsm_direct_es_v2` (`v2` is due to an existing open issue in the original task) | ||
- `paws_es` | ||
- `phrases_es` | ||
- `veritasqa_gen_es` | ||
- `veritasqa_mc1_es` | ||
- `veritasqa_mc2_es` | ||
- `wnli_es` | ||
- `xlsum_es` | ||
- `xnli_es` | ||
- `xquad_es` | ||
- `xstorycloze_es` | ||
|
||
Some of these tasks are taken from benchmarks already available in LM Evaluation Harness. These are: | ||
- `belebele_spa_Latn`: Belebele Spanish | ||
- `mgsm_direct_es`: MGSM Spanish (We fix an existing open issue in the original task) | ||
- `paws_es`: PAWS-X Spanish | ||
- `veritasqa_gen_es`: VeritasQA Spanish | ||
- `veritasqa_mc1_es`: VeritasQA Spanish | ||
- `veritasqa_mc2_es`: VeritasQA Spanish | ||
- `xnli_es`: XNLI Spanish | ||
- `xstorycloze_es`: XStoryCloze Spanish | ||
|
||
### Checklist | ||
|
||
* [x] Is the task an existing benchmark in the literature? | ||
* [ ] Have you referenced the original paper that introduced the task? | ||
* [ ] If yes, does the original paper provide a reference implementation? | ||
* [ ] Yes, original implementation contributed by author of the benchmark | ||
|
||
If other tasks on this dataset are already supported: | ||
* [ ] Is the "Main" variant of this task clearly denoted? | ||
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates? | ||
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant? |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
group: flores | ||
dataset_path: facebook/flores | ||
dataset_name: all | ||
output_type: generate_until | ||
#! The test split of flores is not publicly available! (See paper section 6.1) | ||
#! We are using `dev` and `devtest` splits, but they're mapped to train/validation/test in `data/flores/flores.py`. | ||
training_split: dev | ||
validation_split: dev | ||
test_split: devtest | ||
fewshot_split: dev | ||
target_delimiter: '' | ||
generation_kwargs: | ||
until: | ||
- "\n" | ||
metric_list: | ||
- metric: bleu | ||
aggregation: bleu | ||
higher_is_better: true | ||
- metric: ter | ||
aggregation: ter | ||
higher_is_better: false | ||
- metric: chrf | ||
aggregation: chrf | ||
higher_is_better: true | ||
metadata: | ||
version: 1.0 | ||
dataset_kwargs: | ||
trust_remote_code: true |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.