[Task Submission] Divergent DepRel Distributions (`europarl_dbca_splits`) #33

anmoisio · 2023-09-02T07:59:30Z

Divergent DepRel Distributions

Note: this PR replaces https://github.com/GenBench/genbench_cbt/pull/15

To assess NMT models' capacity to translate novel syntactical structures, we split the Europarl parallel corpus into training and testing sets with divergent distributions of the syntactical structures. We derive the data splitting method from the distribution-based compositionality assessment (DBCA) method introduced by Keysers et al. (2020). We define the atoms as the lemmas and dependency relations, and the compounds as the three-element tuples of two lemmas (the head and the dependant), and their relation, for instance (appreciate, dobj, vigilance).

Authors

Anssi Moisio [email protected]
Mathias Creutz [email protected]
Mikko Kurimo [email protected]

Implementation

This submission modifies the task.py module: evaluate_predictions() is modified to get the chrF2 score from the hf evaluate library, and auxiliary methods are added to calculate divergences between train-test compound and atom distributions.

Usage

To evaluate generalisation, both low- and high-compound-divergence data splits should be evaluated. Therefore, you should run both of the subtasks for the selected language, e.g. subtasks "comdiv0_de" and "comdiv1_de", and take the ratio of the chrF2++ scores: task.comdiv1_de.evaluate_predictions(predictions, gold) / task.comdiv0_de.evaluate_predictions(predictions, gold)

Checklist:

I and my co-authors agree that, if this PR is merged, the code will be available under the same license as the genbench_cbt repository.
Prior to submitting, I have ran the GenBench CBT test suite using the genbench-cli test-task tool.
I have read the description of what should be in the doc.md of my task, and have added the required arguments.
I have submitted or will submit an accompanying paper to the GenBench workshop.

anmoisio · 2023-09-02T08:02:20Z

Hi GenBench team! To accommodate multiple datasets, I created this new Task Submission with subtasks, which replaces the old submission https://github.com/GenBench/genbench_cbt/pull/15. I hope this is ok!

vernadankers · 2023-09-02T08:23:29Z

Yes, that's alright; make sure your paper submission contains the right PR URL, though!

kazemnejad · 2023-11-01T16:32:03Z

@anmoisio We're in the process of merging the tasks into the repo. In order to merge your task, we need the following changes:

Could you please include a single file usage_example.py of each task where you showcase the full pipeline of using each task for finetuning and evaluation of the way you intent your tasks must be used. Preferably, it should be done on a pretrained huggingface model. Please also include requirements-usage-example.txt for the python dependencies needed to be installed for running the example.

kazemnejad · 2023-11-16T15:31:48Z

Hey @anmoisio! Is there any update on the usage_example?
Thanks.

…dbca_splits task

anmoisio · 2023-11-20T06:06:15Z

Hi, @kazemnejad, sorry for the delay. See last commits for the example.

One question about subtasks: I have used the subtask feature in this task, although it doesn't really have subtasks, it has subdatasets in the sense that the abstract task does not change for each subtask, only the dataset is different. There is now a lot of repetition, because I copied, unchanged, the task.py etc for each subtask. So my question is, is there a better way to include sub-datasets for one task?

kazemnejad · 2023-11-29T19:18:38Z

@kazemnejad I'd recommend creating an abstract Task (e.g. BaseDbcaTask) with the common functionalities in a file named _base_task.py under the root directory of your tasks (e.g. europarl_dbca_splits) and in the subtasks you just create a minimal task class that inherits from BaseDbcaTask).

….py instead.

anmoisio · 2023-12-15T13:08:05Z

Hi, @kazemnejad, sorry to commit after you added the ready-to-be-merged tag already, but I removed the repetitive code now as you instructed. Thanks for the help again!

Add europarl_dbca_splits task with subtasks

1c2f04a

anmoisio mentioned this pull request Sep 2, 2023

[Task Submission] Divergent DepRel Distributions (dbca_deprel) #15

Closed

4 tasks

vernadankers added the task-submission label Sep 2, 2023

anmoisio added 2 commits September 2, 2023 11:40

Update doc.md

a16dd1e

updated europarl_dbca_splits/comdiv1_fr/doc.md

42e2ca3

kazemnejad added task-submission and removed task-submission labels Sep 4, 2023

anmoisio added 2 commits November 20, 2023 07:53

Add usage_example.py and requirements-usage-example.txt for europarl_…

a6b6019

…dbca_splits task

Fix style

94f8119

kazemnejad added task-submission and removed task-submission labels Nov 29, 2023

kazemnejad added the ready-to-be-merged label Dec 14, 2023

Remove repetitive Task classes from subtasks, inherit from _base_task…

ee86089

….py instead.

kazemnejad added task-submission and removed task-submission labels Dec 31, 2023

kazemnejad merged commit 474d98f into GenBench:main Dec 31, 2023
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task Submission] Divergent DepRel Distributions (`europarl_dbca_splits`) #33

[Task Submission] Divergent DepRel Distributions (`europarl_dbca_splits`) #33

anmoisio commented Sep 2, 2023

anmoisio commented Sep 2, 2023

vernadankers commented Sep 2, 2023

kazemnejad commented Nov 1, 2023

kazemnejad commented Nov 16, 2023

anmoisio commented Nov 20, 2023

kazemnejad commented Nov 29, 2023 •

edited

Loading

anmoisio commented Dec 15, 2023

[Task Submission] Divergent DepRel Distributions (europarl_dbca_splits) #33

[Task Submission] Divergent DepRel Distributions (europarl_dbca_splits) #33

Conversation

anmoisio commented Sep 2, 2023

Divergent DepRel Distributions

Note: this PR replaces https://github.com/GenBench/genbench_cbt/pull/15

Authors

Implementation

Usage

Checklist:

anmoisio commented Sep 2, 2023

vernadankers commented Sep 2, 2023

kazemnejad commented Nov 1, 2023

kazemnejad commented Nov 16, 2023

anmoisio commented Nov 20, 2023

kazemnejad commented Nov 29, 2023 • edited Loading

anmoisio commented Dec 15, 2023

[Task Submission] Divergent DepRel Distributions (`europarl_dbca_splits`) #33

[Task Submission] Divergent DepRel Distributions (`europarl_dbca_splits`) #33

kazemnejad commented Nov 29, 2023 •

edited

Loading