This repository has been archived by the owner on Jul 23, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 18
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
7 changed files
with
337 additions
and
0 deletions.
There are no files selected for viewing
Empty file.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
50 changes: 50 additions & 0 deletions
50
src/genbench/tasks/cross_lingual_consistency/config.jsonnet
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
{ | ||
name: 'Cross Lingual Consistency', | ||
|
||
// @TODO: Add a description of the task | ||
description: 'Cross Lingual Consistency aims to measure ...', | ||
|
||
// @TODO: Add a list of keywords that describe the task | ||
keywords: [ | ||
'Model Consistency', | ||
'Multilinguality', | ||
'Knowledge Incorporation', | ||
'Large-scale Pre-trained Language Model', | ||
'Model Evaluation', | ||
'Knowledge Probing', | ||
], | ||
|
||
authors: [ | ||
'Jirui Qi', | ||
'Raquel Fernández', | ||
'Arianna Bisazza', | ||
|
||
], | ||
|
||
data_source: { | ||
type: 'manual', | ||
test: 'https://raw.githubusercontent.com/Betswish/Cross-Lingual-Consistency/main/1_easyrun/', | ||
}, | ||
|
||
|
||
has_validation_set: false, | ||
has_train_set: false, | ||
|
||
task_type: 'multiple_choice', | ||
|
||
field_mapping: { | ||
input: 'Prompt', | ||
target: 'Ans', | ||
target_options: 'Candidate Ans', | ||
}, | ||
|
||
evaluation_metrics: [ | ||
], | ||
|
||
preparation_strategies: { | ||
// A recipe for preparing the model to perform the task by configuring its prompt. | ||
// This recipe is suitable for generative LMs such as GPT-3, OPT, T5, etc. | ||
// We provide a few options for configuring the prompt. But, the task creator can | ||
// also provide a custom prompt preparation in the task's Python class. | ||
}, | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
# Cross Lingual Consistency | ||
|
||
## Abstract | ||
Multilingual large-scale Pretrained Language Models (PLMs) have been shown to learn considerable amounts of factual knowledge from the training corpora. However, large variations are observed in the extent to which this knowledge generalizes across different languages. With the ultimate goal of ensuring that users with different language backgrounds obtain consistent feedback from the same model, we study the cross-lingual consistency (CLC) of factual knowledge in various multilingual PLMs. To this end, we resplit the existing mLAMA (Kassner et al., 2021) to construct a new benchmark BMLAMA where instances for each language are balanced. Additionally, we propose a new Ranking-based Consistency (RankC) metric to evaluate knowledge consistency across languages independently from accuracy. We conduct an in-depth analysis of the determining factors for CLC, both at model level and at language-pair level. Among other results, we find that average CLC is low across different PLMs. Moreover, increasing model size leads to higher factual probing accuracy in most languages, but does not improve cross-lingual consistency. All code and data is also released at https://github.com/Betswish/Cross-Lingual-Consistency. | ||
|
||
## Statistics of the Balanced Multilingual LAMA (BMLAMA) | ||
- Property: BMLAMA-17 | BMLAMA-53 | ||
- \# Language: 17 | 53 | ||
- \# Relations: 41 | 30 | ||
- \# Queries: 6792\*17 | 3070\*53 | ||
- \# Candidates (Avg): 9.71 | 9.56 | ||
|
||
## Examples | ||
Instead of measuring the amount of factual knowledge encoded in each language, we focus on its consistency across languages, which shows how knowledge is generalized among different languages in the PLM. | ||
As illustrated in Figure Motivation (see motivation.png), the multilingual Bloom-3b model outputs consistently correct completions of the first prompt when queried in English, Spanish, and Vietnamese, but not in Hungarian and Greek. | ||
The model also outputs consistent, though wrong, answers to the second query in English, Spanish, and Vietnamese, but not in Hungarian and Greek, suggesting the first three languages share relevant knowledge representations within the model. | ||
|
||
## Data Source | ||
BMLAMA is filtered from X-FACTR and MLAMA, where the number of queries for different languages are balanced. | ||
We first probing the PLMs with same questions in all supported languages and record the rankings of all candidates. | ||
Based on the ranking lists of two languages (e.g. en and es), we propose RankC metric to compute the cross-lingual consistency of factual knowledge between different language pairs. | ||
|
||
## Usage | ||
You need to claim the language pair you want to use when applying for the dataset. | ||
Also, you need to assign a value for the variable 'mini' to indicate if you want to use BMLAMA17('True') or BMLAMA53('False'). | ||
|
||
Then you can to submit the dataset into the model you want to evaluated, record each candidate and its corresponding probabilities, | ||
and finally feed these results into the evaluation function to get the cross-lingual consistency score between the two languages. | ||
|
||
## Limitations and Bias | ||
*Note any known limitations or biases that the Cross Lingual Consistency has, with links and references if possible.* | ||
Limitation: | ||
Due to restriction of our GPU resources, we could not test models larger than Bloom-3b. Extending our analysis to larger-scale models in future work is encouraged to see if the same conclusions reached. Nevertheless, results in the paper indicate that the average CLC grows extremely slowly with the increment of model scale. | ||
|
||
The facts included in BMLAMA, while supposed to be universals, are likely to be more relevant to the Western world, which can introduce a bias in the evaluation. We inherit this problem from the benchmarks BMLAMA is built upon. Fixing this issue is not trivial, especially in comparative work where probing the exact set of facts across languages is a requirement, and should be given attention in future work. | ||
|
||
Bias: | ||
Since BMLAMA data is derived from previous works X-FACTR and MLAMA, queries in BMLAMA are also likely to encounter gender and racial bias issues, which are inevitable in the source Wikidata. However, this paper mostly focuses on the consistency between knowledge in different languages rather than the specific knowledge in a single language. | ||
|
||
## GenBench Eval card | ||
See eval\_card.png. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,246 @@ | ||
from collections import defaultdict | ||
from typing import Any, Dict, List, Mapping, Optional | ||
|
||
import datasets | ||
import numpy as np | ||
from datasets import load_dataset | ||
|
||
from genbench import Task | ||
from genbench.api import PreparationStrategy | ||
|
||
class CrossLingualConsistencyTask(Task): | ||
def _load_data_source( | ||
self, | ||
mini, | ||
lang1, | ||
lang2, | ||
): | ||
""" | ||
Private method to load the data source based on the type specified in the configuration. | ||
The data source can be of two types: 'manual' or 'hf'. | ||
For 'manual' type, it loads JSON datasets from the specified test, validation, and train files. | ||
For 'hf' type, it loads datasets from the HuggingFace datasets hub using the given HuggingFace dataset ID(s) | ||
and the git commit SHA for the specified version of the dataset. | ||
Returns: | ||
Loaded dataset which can be any of the following types: | ||
DatasetDict, Dataset, IterableDatasetDict, IterableDataset. | ||
Raises: | ||
ValueError: If the specified data source type is not supported. | ||
""" | ||
if self.config.data_source.type == "manual": | ||
if mini: | ||
file_path = self.config.data_source.test + "BMLAMA17/" | ||
else: | ||
file_path = self.config.data_source.test + "BMLAMA53/" | ||
|
||
data_files = dict() | ||
for lang in [lang1, lang2]: | ||
data_files[lang] = file_path + lang + ".tsv" | ||
""" | ||
data_files = { | ||
"test": self.config.data_source.test, | ||
} | ||
""" | ||
if self.config.data_source.validation is not None: | ||
data_files["validation"] = self.config.data_source.validation | ||
if self.config.data_source.train is not None: | ||
data_files["train"] = self.config.data_source.train | ||
|
||
return load_dataset("csv", data_files=data_files, delimiter="\t") | ||
# return load_dataset("json", data_files=data_files, field=None) | ||
elif self.config.data_source.type == "hf": | ||
hf_id = self.config.data_source.hf_id | ||
if isinstance(hf_id, str): | ||
hf_id = [hf_id] | ||
|
||
return load_dataset(*hf_id, revision=self.config.data_source.git_commit_sha) | ||
else: | ||
raise ValueError(f"Unsupported data source type: {self.config.data_source.type}") | ||
|
||
def get_datasets_raw(self, mini=True, lang1="en", lang2="es"): | ||
data_source = self._load_data_source(mini=mini, lang1=lang1, lang2=lang2) | ||
|
||
""" | ||
if self.config.split_file is not None: | ||
split_file_path = get_task_dir(self.root_task_id, self.subtask_id) / self.config.split_file | ||
splitting_info = load_jsonnet(split_file_path) | ||
data_source = resplit_data_source(data_source, splitting_info) | ||
""" | ||
|
||
output = {} | ||
for split in sorted(data_source.keys()): | ||
dataset = data_source[split] | ||
output[split] = dataset.map( | ||
self.format_example, | ||
num_proc=self.dataset_format_num_proc, | ||
batched=self.dataset_format_batched, | ||
desc=f"Formatting `{split}` examples", | ||
) | ||
assert all([f in output[split].column_names for f in ["input", "target"]]) | ||
|
||
# Assign id to each example | ||
for split in sorted(output.keys()): | ||
output[split] = output[split].map( | ||
lambda example, idx: {"_genbench_idx": idx}, | ||
with_indices=True, | ||
num_proc=self.dataset_format_num_proc, | ||
batched=False, | ||
desc=f"Assigning id to `{split}` examples", | ||
) | ||
|
||
return output | ||
|
||
def get_prepared_datasets( | ||
self, | ||
preparation_strategy: PreparationStrategy, | ||
shot_list: Optional[List[int]] = None, | ||
random_seed: int = 42, | ||
mini=True, | ||
lang1=None, | ||
lang2=None, | ||
): | ||
if not mini: | ||
raise ValueError("Value for 'mini=True/False' is required for this task") | ||
if not lang1 or not lang2: | ||
raise ValueError("Values for 'lang1=' and 'lang2=' are required for this task") | ||
|
||
if preparation_strategy == PreparationStrategy.FINETUNING: | ||
if self.config.preparation_strategies.finetuning is None: | ||
raise ValueError("Finetuning preparation strategy is not supported for this task") | ||
|
||
datasets = self.get_datasets_raw(mini=mini, lang1=lang1, lang2=lang2) | ||
|
||
# datasets is a dict of language_id -> Dataset | ||
lang1_ds = datasets[lang1] | ||
lang2_ds = datasets[lang2] | ||
|
||
# They all have the same length (i.e. they are translation of each other) | ||
assert len(lang1_ds) == len(lang2_ds) | ||
|
||
# Each of them contains instances of the form: | ||
# { | ||
# "input": "The capital of Canada ", | ||
# "target": "Ottawa", | ||
# "target_options": [ | ||
# "Beijing", | ||
# "Tokyo", | ||
# "Ottawa", | ||
# ], | ||
# "_genbnech_idx": <some index> | ||
# } | ||
|
||
# Add language identifier to each instance | ||
lang1_ds = lang1_ds.map(lambda x: {"lang": lang1}) | ||
lang2_ds = lang2_ds.map(lambda x: {"lang": lang2}) | ||
|
||
# Concatenate the datasets | ||
from datasets import concatenate_datasets | ||
|
||
final_dataset = concatenate_datasets([lang1_ds, lang2_ds]) | ||
|
||
return final_dataset | ||
|
||
def evaluate_predictions( | ||
self, | ||
*, | ||
predictions: List[Mapping[str, Any]] = None, | ||
gold: datasets.Dataset = None, | ||
) -> Dict[str, float]: | ||
def softmax(x): | ||
"""Compute softmax values for each sets of scores in x.""" | ||
return np.exp(x) / np.sum(np.exp(x), axis=0) | ||
|
||
# Make sure that the predictions are in the same order as the gold dataset | ||
assert len(predictions) == len(gold) | ||
|
||
# Just to make sure the gold dataset is the same as the one we generated in `get_prepared_datasets` | ||
assert "lang" in gold.features | ||
assert "_genbnech_idx" in gold.features | ||
|
||
# Also, make sure that predictions contain logprobs for each option | ||
assert all( | ||
"target_option_logprobs" in pred and len(pred["target_option_logprobs"]) == len(pred["target_options"]) | ||
for pred in predictions | ||
) | ||
|
||
# Group the prediction and instances such that we have: | ||
# _genbnech_idx -> { | ||
# "lang_id_1": { ...data_instance..., target_option_logprobs: ... } | ||
# "lang_id_2": { ...data_instance..., target_option_logprobs: ... } | ||
# }, | ||
|
||
grouped_examples = defaultdict(dict) | ||
for pred, gold in zip(predictions, gold): | ||
original_idx = gold["_genbnech_idx"] | ||
lang = gold["lang"] | ||
grouped_examples[original_idx][lang] = { | ||
**gold, | ||
**pred, | ||
} | ||
|
||
CLC_score = 0 | ||
count = 0 | ||
langs = [] | ||
# Now, we compute the cross lingual consistency score | ||
for idx, example in grouped_examples.items(): | ||
# Rerank the options based on the logprobs | ||
for lang, data in example.items(): | ||
if len(langs) < 2: | ||
langs.append(lang) | ||
|
||
logprobs = data["target_option_logprobs"] | ||
sorted_options = sorted( | ||
zip(data["target_options"], logprobs), | ||
key=lambda x: x[1], | ||
reverse=False, | ||
) | ||
sorted_options, logprobs = zip(*sorted_options) | ||
grouped_examples[idx][lang]["target_options"] = list(sorted_options) | ||
grouped_examples[idx][lang]["target_option_logprobs"] = list(logprobs) | ||
|
||
# Compute the cross lingual consistency score | ||
ranking1 = grouped_examples[idx][langs[0]]["target_options"] | ||
ranking2 = grouped_examples[idx][langs[1]]["target_options"] | ||
|
||
order = [len(ranking1) - i for i in range(len(ranking1))] | ||
order = np.array(order) | ||
weight = softmax(order) | ||
|
||
for j in range(len(ranking1)): | ||
set1 = {ranking1.index(i) for i in ranking1[: j + 1]} | ||
set2 = {ranking2.index(i) for i in ranking2[: j + 1]} | ||
|
||
cover = set1.intersection(set2) | ||
CLC_score += weight[j] * (len(cover) / len(set1)) | ||
|
||
count += 1 | ||
CLC_score /= count | ||
|
||
# Compute the final score | ||
result = { | ||
"cross_lingual_consistency": CLC_score, | ||
} | ||
|
||
return result | ||
|
||
def format_example(self, example: Mapping[str, Any]) -> Mapping[str, Any]: | ||
if self.config.field_mapping is None: | ||
assert "input" in example | ||
assert "target" in example | ||
output = {} | ||
else: | ||
assert "input" in self.config.field_mapping | ||
assert "target" in self.config.field_mapping | ||
|
||
output = { | ||
"input": example[self.config.field_mapping["input"]], | ||
"target": example[self.config.field_mapping["target"]], | ||
} | ||
|
||
if "target_options" in self.config.field_mapping: | ||
output["target_options"] = example[self.config.field_mapping["target_options"]] | ||
|
||
return output |