Skip to content
This repository has been archived by the owner on Jul 23, 2024. It is now read-only.

Commit

Permalink
Add My Awesome Task
Browse files Browse the repository at this point in the history
  • Loading branch information
Betswish committed Nov 30, 2023
1 parent decee6a commit 7a6181e
Show file tree
Hide file tree
Showing 7 changed files with 337 additions and 0 deletions.
Empty file.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
50 changes: 50 additions & 0 deletions src/genbench/tasks/cross_lingual_consistency/config.jsonnet
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
{
name: 'Cross Lingual Consistency',

// @TODO: Add a description of the task
description: 'Cross Lingual Consistency aims to measure ...',

// @TODO: Add a list of keywords that describe the task
keywords: [
'Model Consistency',
'Multilinguality',
'Knowledge Incorporation',
'Large-scale Pre-trained Language Model',
'Model Evaluation',
'Knowledge Probing',
],

authors: [
'Jirui Qi',
'Raquel Fernández',
'Arianna Bisazza',

],

data_source: {
type: 'manual',
test: 'https://raw.githubusercontent.com/Betswish/Cross-Lingual-Consistency/main/1_easyrun/',
},


has_validation_set: false,
has_train_set: false,

task_type: 'multiple_choice',

field_mapping: {
input: 'Prompt',
target: 'Ans',
target_options: 'Candidate Ans',
},

evaluation_metrics: [
],

preparation_strategies: {
// A recipe for preparing the model to perform the task by configuring its prompt.
// This recipe is suitable for generative LMs such as GPT-3, OPT, T5, etc.
// We provide a few options for configuring the prompt. But, the task creator can
// also provide a custom prompt preparation in the task's Python class.
},
}
41 changes: 41 additions & 0 deletions src/genbench/tasks/cross_lingual_consistency/doc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Cross Lingual Consistency

## Abstract
Multilingual large-scale Pretrained Language Models (PLMs) have been shown to learn considerable amounts of factual knowledge from the training corpora. However, large variations are observed in the extent to which this knowledge generalizes across different languages. With the ultimate goal of ensuring that users with different language backgrounds obtain consistent feedback from the same model, we study the cross-lingual consistency (CLC) of factual knowledge in various multilingual PLMs. To this end, we resplit the existing mLAMA (Kassner et al., 2021) to construct a new benchmark BMLAMA where instances for each language are balanced. Additionally, we propose a new Ranking-based Consistency (RankC) metric to evaluate knowledge consistency across languages independently from accuracy. We conduct an in-depth analysis of the determining factors for CLC, both at model level and at language-pair level. Among other results, we find that average CLC is low across different PLMs. Moreover, increasing model size leads to higher factual probing accuracy in most languages, but does not improve cross-lingual consistency. All code and data is also released at https://github.com/Betswish/Cross-Lingual-Consistency.

## Statistics of the Balanced Multilingual LAMA (BMLAMA)
- Property: BMLAMA-17 | BMLAMA-53
- \# Language: 17 | 53
- \# Relations: 41 | 30
- \# Queries: 6792\*17 | 3070\*53
- \# Candidates (Avg): 9.71 | 9.56

## Examples
Instead of measuring the amount of factual knowledge encoded in each language, we focus on its consistency across languages, which shows how knowledge is generalized among different languages in the PLM.
As illustrated in Figure Motivation (see motivation.png), the multilingual Bloom-3b model outputs consistently correct completions of the first prompt when queried in English, Spanish, and Vietnamese, but not in Hungarian and Greek.
The model also outputs consistent, though wrong, answers to the second query in English, Spanish, and Vietnamese, but not in Hungarian and Greek, suggesting the first three languages share relevant knowledge representations within the model.

## Data Source
BMLAMA is filtered from X-FACTR and MLAMA, where the number of queries for different languages are balanced.
We first probing the PLMs with same questions in all supported languages and record the rankings of all candidates.
Based on the ranking lists of two languages (e.g. en and es), we propose RankC metric to compute the cross-lingual consistency of factual knowledge between different language pairs.

## Usage
You need to claim the language pair you want to use when applying for the dataset.
Also, you need to assign a value for the variable 'mini' to indicate if you want to use BMLAMA17('True') or BMLAMA53('False').

Then you can to submit the dataset into the model you want to evaluated, record each candidate and its corresponding probabilities,
and finally feed these results into the evaluation function to get the cross-lingual consistency score between the two languages.

## Limitations and Bias
*Note any known limitations or biases that the Cross Lingual Consistency has, with links and references if possible.*
Limitation:
Due to restriction of our GPU resources, we could not test models larger than Bloom-3b. Extending our analysis to larger-scale models in future work is encouraged to see if the same conclusions reached. Nevertheless, results in the paper indicate that the average CLC grows extremely slowly with the increment of model scale.

The facts included in BMLAMA, while supposed to be universals, are likely to be more relevant to the Western world, which can introduce a bias in the evaluation. We inherit this problem from the benchmarks BMLAMA is built upon. Fixing this issue is not trivial, especially in comparative work where probing the exact set of facts across languages is a requirement, and should be given attention in future work.

Bias:
Since BMLAMA data is derived from previous works X-FACTR and MLAMA, queries in BMLAMA are also likely to encounter gender and racial bias issues, which are inevitable in the source Wikidata. However, this paper mostly focuses on the consistency between knowledge in different languages rather than the specific knowledge in a single language.

## GenBench Eval card
See eval\_card.png.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
246 changes: 246 additions & 0 deletions src/genbench/tasks/cross_lingual_consistency/task.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,246 @@
from collections import defaultdict
from typing import Any, Dict, List, Mapping, Optional

import datasets
import numpy as np
from datasets import load_dataset

from genbench import Task
from genbench.api import PreparationStrategy

class CrossLingualConsistencyTask(Task):
def _load_data_source(
self,
mini,
lang1,
lang2,
):
"""
Private method to load the data source based on the type specified in the configuration.
The data source can be of two types: 'manual' or 'hf'.
For 'manual' type, it loads JSON datasets from the specified test, validation, and train files.
For 'hf' type, it loads datasets from the HuggingFace datasets hub using the given HuggingFace dataset ID(s)
and the git commit SHA for the specified version of the dataset.
Returns:
Loaded dataset which can be any of the following types:
DatasetDict, Dataset, IterableDatasetDict, IterableDataset.
Raises:
ValueError: If the specified data source type is not supported.
"""
if self.config.data_source.type == "manual":
if mini:
file_path = self.config.data_source.test + "BMLAMA17/"
else:
file_path = self.config.data_source.test + "BMLAMA53/"

data_files = dict()
for lang in [lang1, lang2]:
data_files[lang] = file_path + lang + ".tsv"
"""
data_files = {
"test": self.config.data_source.test,
}
"""
if self.config.data_source.validation is not None:
data_files["validation"] = self.config.data_source.validation
if self.config.data_source.train is not None:
data_files["train"] = self.config.data_source.train

return load_dataset("csv", data_files=data_files, delimiter="\t")
# return load_dataset("json", data_files=data_files, field=None)
elif self.config.data_source.type == "hf":
hf_id = self.config.data_source.hf_id
if isinstance(hf_id, str):
hf_id = [hf_id]

return load_dataset(*hf_id, revision=self.config.data_source.git_commit_sha)
else:
raise ValueError(f"Unsupported data source type: {self.config.data_source.type}")

def get_datasets_raw(self, mini=True, lang1="en", lang2="es"):
data_source = self._load_data_source(mini=mini, lang1=lang1, lang2=lang2)

"""
if self.config.split_file is not None:
split_file_path = get_task_dir(self.root_task_id, self.subtask_id) / self.config.split_file
splitting_info = load_jsonnet(split_file_path)
data_source = resplit_data_source(data_source, splitting_info)
"""

output = {}
for split in sorted(data_source.keys()):
dataset = data_source[split]
output[split] = dataset.map(
self.format_example,
num_proc=self.dataset_format_num_proc,
batched=self.dataset_format_batched,
desc=f"Formatting `{split}` examples",
)
assert all([f in output[split].column_names for f in ["input", "target"]])

# Assign id to each example
for split in sorted(output.keys()):
output[split] = output[split].map(
lambda example, idx: {"_genbench_idx": idx},
with_indices=True,
num_proc=self.dataset_format_num_proc,
batched=False,
desc=f"Assigning id to `{split}` examples",
)

return output

def get_prepared_datasets(
self,
preparation_strategy: PreparationStrategy,
shot_list: Optional[List[int]] = None,
random_seed: int = 42,
mini=True,
lang1=None,
lang2=None,
):
if not mini:
raise ValueError("Value for 'mini=True/False' is required for this task")
if not lang1 or not lang2:
raise ValueError("Values for 'lang1=' and 'lang2=' are required for this task")

if preparation_strategy == PreparationStrategy.FINETUNING:
if self.config.preparation_strategies.finetuning is None:
raise ValueError("Finetuning preparation strategy is not supported for this task")

datasets = self.get_datasets_raw(mini=mini, lang1=lang1, lang2=lang2)

# datasets is a dict of language_id -> Dataset
lang1_ds = datasets[lang1]
lang2_ds = datasets[lang2]

# They all have the same length (i.e. they are translation of each other)
assert len(lang1_ds) == len(lang2_ds)

# Each of them contains instances of the form:
# {
# "input": "The capital of Canada ",
# "target": "Ottawa",
# "target_options": [
# "Beijing",
# "Tokyo",
# "Ottawa",
# ],
# "_genbnech_idx": <some index>
# }

# Add language identifier to each instance
lang1_ds = lang1_ds.map(lambda x: {"lang": lang1})
lang2_ds = lang2_ds.map(lambda x: {"lang": lang2})

# Concatenate the datasets
from datasets import concatenate_datasets

final_dataset = concatenate_datasets([lang1_ds, lang2_ds])

return final_dataset

def evaluate_predictions(
self,
*,
predictions: List[Mapping[str, Any]] = None,
gold: datasets.Dataset = None,
) -> Dict[str, float]:
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
return np.exp(x) / np.sum(np.exp(x), axis=0)

# Make sure that the predictions are in the same order as the gold dataset
assert len(predictions) == len(gold)

# Just to make sure the gold dataset is the same as the one we generated in `get_prepared_datasets`
assert "lang" in gold.features
assert "_genbnech_idx" in gold.features

# Also, make sure that predictions contain logprobs for each option
assert all(
"target_option_logprobs" in pred and len(pred["target_option_logprobs"]) == len(pred["target_options"])
for pred in predictions
)

# Group the prediction and instances such that we have:
# _genbnech_idx -> {
# "lang_id_1": { ...data_instance..., target_option_logprobs: ... }
# "lang_id_2": { ...data_instance..., target_option_logprobs: ... }
# },

grouped_examples = defaultdict(dict)
for pred, gold in zip(predictions, gold):
original_idx = gold["_genbnech_idx"]
lang = gold["lang"]
grouped_examples[original_idx][lang] = {
**gold,
**pred,
}

CLC_score = 0
count = 0
langs = []
# Now, we compute the cross lingual consistency score
for idx, example in grouped_examples.items():
# Rerank the options based on the logprobs
for lang, data in example.items():
if len(langs) < 2:
langs.append(lang)

logprobs = data["target_option_logprobs"]
sorted_options = sorted(
zip(data["target_options"], logprobs),
key=lambda x: x[1],
reverse=False,
)
sorted_options, logprobs = zip(*sorted_options)
grouped_examples[idx][lang]["target_options"] = list(sorted_options)
grouped_examples[idx][lang]["target_option_logprobs"] = list(logprobs)

# Compute the cross lingual consistency score
ranking1 = grouped_examples[idx][langs[0]]["target_options"]
ranking2 = grouped_examples[idx][langs[1]]["target_options"]

order = [len(ranking1) - i for i in range(len(ranking1))]
order = np.array(order)
weight = softmax(order)

for j in range(len(ranking1)):
set1 = {ranking1.index(i) for i in ranking1[: j + 1]}
set2 = {ranking2.index(i) for i in ranking2[: j + 1]}

cover = set1.intersection(set2)
CLC_score += weight[j] * (len(cover) / len(set1))

count += 1
CLC_score /= count

# Compute the final score
result = {
"cross_lingual_consistency": CLC_score,
}

return result

def format_example(self, example: Mapping[str, Any]) -> Mapping[str, Any]:
if self.config.field_mapping is None:
assert "input" in example
assert "target" in example
output = {}
else:
assert "input" in self.config.field_mapping
assert "target" in self.config.field_mapping

output = {
"input": example[self.config.field_mapping["input"]],
"target": example[self.config.field_mapping["target"]],
}

if "target_options" in self.config.field_mapping:
output["target_options"] = example[self.config.field_mapping["target_options"]]

return output

0 comments on commit 7a6181e

Please sign in to comment.