-
Notifications
You must be signed in to change notification settings - Fork 18
[Task Submission] Cross Lingual Consistency (cross_lingual_consistency
)
#9
Conversation
Adding additional comments
Thanks for submitting a task to GenBench. If I understand your submission correctly, in this PR, you're reporting the result of evaluating BLOOM on your task in In the current form, this PR doesn't seems to be mergable into the framework. CC. @dieuwkehupkes |
Hello kazemnejad, Thanks for your reply! However, our whole benchmark can't fit the current GenBench framework well, since one of our focuses is to separate the measurement of generalization of factual knowledge from the correctness of predictions. So here we only provided a sample in The code for our benchmark can be divided into two parts:
Do you have any suggestions about how to revise the implementation? |
Exactly, I also thought your task should be model agnostic. I'm not sure which knowledge probing you plan to use, but If you're using normal prompting, I'd suggest you modify the So, for example, if you have an initial data instance {
"input": "The capital of France is _",
"target": "Paris",
"idx": 874,
} You can create three separate queries for different languages: {
"input": "translation of (The capital of France is _) in Lang X",
"target": "...",
"original_sample_idx": 874,
"idx": 1,
} {
"input": "translation of (The capital of France is _) in Lang Y",
"target": "...",
"original_sample_idx": 874,
"idx": 2,
} {
"input": "translation of (The capital of France is _) in Lang Z",
"target": "...",
"original_sample_idx": 874,
"idx": 3,
} These queries will be sent to an LLM, and you'll receive their response in Let me know if this is the intention of your task. If I understand correctly, your task will fit into our framework, just the implementation needs rework. |
Hello! We are getting quite close to the deadline (September 1, 11:59PM anywhere on earth), which is why I wanted to remind you of the fact that your PR still needs some attention: see Amir's message above. Please don't forget to submit your accompanying paper to Openreview via https://openreview.net/group?id=GenBench.org/2023/Workshop by September 1. Good luck finalising your PR and paper, feel free to tag us if you have questions. |
Actually, the input of our benchmark should be two languages (X, Y) and one PLM (M), and the output is a score (0%-100%) representing the generalization of knowledge between X and Y in this PLM. Therefore, this benchmark has no target (i.e. golden answer) for the output scores. It only measures to what extent the factual knowledge in one language is generalized to another language. So I'm still unsure how to fit our implementation into the current framework. btw, thanks for the reminder! We have already submitted our paper on Openreview. |
Your data instances don't have a gold answer. You can just put some placeholder in I think my suggestion in my previous comment should be able to handle your case. |
@Betswish We're in the process of merging the tasks into the repo. Is there any new update regarding modifying your implementation to fit the GenBench framework? |
Hi @kazemnejad , I kept thinking about how to fit our evaluation metric into your framework but still couldn't figure it out. Because our benchmark is different from the traditional evaluation tasks -
So I drew a workflow to show our benchmark. I hope it can give you a clear view of our benchmark. |
Hi @kazemnejad , I kept thinking about how to fit our evaluation metric into your framework but still couldn't figure it out. Because our benchmark is different from the traditional evaluation tasks -
So I drew a workflow to show our benchmark. I hope it can give you a clear view of our benchmark. |
The figures makes everything very clear. Here is my suggestion on how to implement this task as a Genbench task. Since per each data instances we have a set of candidates, I'd suggest to use Here is an example based on the flowchart: {
"input": "The capital of Canada ",
"target": 0,
"target_options": [
"Beijing",
"Tokyo",
"Ottawa",
],
} or {
"input": "The capital of Netherland ",
"target": 0,
"target_options": [
"Miami",
"Dublin",
"Amsterdam",
],
} Note that in these examples, the ground truth (represented in the How to have data instances from multiple languages?Basically, the output of def get_prepared_datasets(
self,
preparation_strategy: PreparationStrategy,
shot_list: Optional[List[int]] = None,
random_seed: int = 42,
) -> Union[Mapping[DatasetSplit, Dataset], Mapping[int, Dataset]]:
if preparation_strategy == PreparationStrategy.FINETUNING:
raise ValueError("This task does not support finetuning")
datasets = self.get_datasets_raw()
# datasets is a dict of language_id -> Dataset
english_ds = datasets["en"]
french_ds = datasets["fr"]
...
# They all have the same length (i.e. they are translation of each other)
assert len(english_ds) == len(french_ds)
# Each of them contains instances of the form:
# {
# "input": "The capital of Canada ",
# "target": 0,
# "target_options": [
# "Beijing",
# "Tokyo",
# "Ottawa",
# ],
# "_genbnech_idx": <some index>
# }
# The order of instances is the same across languages. That is if the 10th instance in english_ds is:
# {
# "input": "The capital of Canada ",
# "target": 0,
# "target_options": [
# "Beijing",
# "Tokyo",
# "Ottawa",
# ],
# "_genbnech_idx": 9
# }
# Then the 10th instance in french_ds is:
# {
# "input": "La capitale del Canada e ",
# "target": 0,
# "target_options": [
# "Beijing",
# "Tokyo",
# "Ottawa",
# ],
# "_genbnech_idx": 9
# }
# Add language identifier to each instance
english_ds = english_ds.map(lambda x: {"lang": "en"})
french_ds = french_ds.map(lambda x: {"lang": "fr"})
...
# Concatenate the datasets
from datasets import concatenate_datasets
final_dataset = concatenate_datasets([english_ds, french_ds, ...])
return final_dataset This should resolve your first concern as we're evaluating ALL examples. How to rerank and evaluateSo, the user of the our task has to send the final_dataset to LLM and compute the score (logprobs) of each candidate. Here is an example: from genbench import load_task
cross_ling_const_task = load_task(...)
ds = cross_ling_const_task.get_prepared_datasets(
preparation_strategy=PreparationStrategy.PROMPT_BASED_TESTING
)
# the dataset length is num_languages * num_instances_per_language
predictions = []
for data in ds:
# Create the query to be sent to the LLM
logprobs = []
for option in data["target_options"]:
query = f"{data['input']}{option}"
logprob = llm(query)
logprobs.append(logprob)
# Get the index of the highest logprob.
# It's not needed for this task, but I added it for completeness
pred_idx = np.argmax(logprobs)
pred_option = data["target_options"][pred_idx]
assert len(logprobs) == len(data["target_options"])
pred = {
"target_option_logprobs": logprobs,
"target": pred_option,
}
predictions.append(pred) Note that this code snippet is not part of your task implementation. This is just an example how an external user might use your task. Now assuming that each prediction contains the logprob of all candidates we can group them and compute the evaluation metric. Here is an example: class Task:
def evaluate_predictions(
self,
*,
predictions: List[Mapping[str, Any]] = None,
gold: Dataset = None,
) -> EvaluationResult:
# Make sure that the predictions are in the same order as the gold dataset
assert len(predictions) == len(gold)
# Just to make sure the gold dataset is the same as the one we generated in `get_prepared_datasets`
assert "lang" in gold.features
assert "_genbnech_idx" in gold.features
# Also, make sure that predictions contain logprobs for each option
assert all(
"target_option_logprobs" in pred and len(pred["target_option_logprobs"]) == len(pred["target_options"])
for pred in predictions
)
# Group the prediction and instances such that we have:
# _genbnech_idx -> {
# "lang_id_1": { ...data_instance..., target_option_logprobs: ... }
# "lang_id_2": { ...data_instance..., target_option_logprobs: ... }
# },
grouped_examples = defaultdict(dict)
for pred, gold in zip(predictions, gold):
original_idx = gold["_genbnech_idx"]
lang = gold["lang"]
grouped_examples[original_idx][lang] = {
**gold,
**pred,
}
# Now, we compute the cross lingual consistency score
for idx, example in grouped_examples.items():
# Rerank the options based on the logprobs
for lang, data in example.items():
logprobs = data["target_option_logprobs"]
sorted_options = sorted(
zip(data["target_options"], logprobs),
key=lambda x: x[1],
reverse=True,
)
sorted_options, logprobs = zip(*sorted_options)
grouped_examples[idx][lang]["target_options"] = list(sorted_options)
grouped_examples[idx][lang]["target_option_logprobs"] = list(logprobs)
# Compute the cross lingual consistency score
...
# Compute the final score
result = {
"cross_lingual_consistency": ...,
}
return result This should resolve your second concern. |
Great thanks! I'm revising it and try to complete it ASAP! |
Hey @Betswish ! Are there any updates regarding the new implementation? |
Hi @kazemnejad, the new implementation is completed! |
Hi @kazemnejad , I did some minor updates! |
@Betswish Thanks for your efforts. It's definitely getting ready to be merges. There are still some fixes needed to be implemented.
Thanks a lot. |
[Task Name] Cross-Lingual Consistency of Factual Knowledge in Multilingual Language Models
Multilingual large-scale Pretrained Language Models (PLMs) have been shown to learn considerable amounts of factual knowledge from the training corpora. However, large variations are observed in the extent to which this knowledge generalizes across different languages. With the ultimate goal of ensuring that users with different language backgrounds obtain consistent feedback from the same model, we study the cross-lingual consistency (CLC) of factual knowledge in various multilingual PLMs. To this end, we resplit the existing mLAMA (Kassner et al., 2021) to construct a new benchmark BMLAMA where instances for each language are balanced. Additionally, we propose a new Ranking-based Consistency (RankC) metric to evaluate knowledge consistency across languages independently from accuracy. We conduct an in-depth analysis of the determining factors for CLC, both at model level and at language-pair level. Among other results, we find that average CLC is low across different PLMs. Moreover, increasing model size leads to higher factual probing accuracy in most languages, but does not improve cross-lingual consistency. All code and data will be released at Github.
Authors
[email protected]
[email protected]
[email protected]
Usage
Our evaluation function should be run in any other way than the default way since we focus on assessing the generation of factual knowledge in the multilingual PLMs with our proposed RankC metric. Our task can be described in three steps:
Implementation
Checklist:
genbench-cli test-task
tool.