[Task Submission] Cross Lingual Consistency (`cross_lingual_consistency`) #9

Betswish · 2023-07-26T16:08:04Z

[Task Name] Cross-Lingual Consistency of Factual Knowledge in Multilingual Language Models

Multilingual large-scale Pretrained Language Models (PLMs) have been shown to learn considerable amounts of factual knowledge from the training corpora. However, large variations are observed in the extent to which this knowledge generalizes across different languages. With the ultimate goal of ensuring that users with different language backgrounds obtain consistent feedback from the same model, we study the cross-lingual consistency (CLC) of factual knowledge in various multilingual PLMs. To this end, we resplit the existing mLAMA (Kassner et al., 2021) to construct a new benchmark BMLAMA where instances for each language are balanced. Additionally, we propose a new Ranking-based Consistency (RankC) metric to evaluate knowledge consistency across languages independently from accuracy. We conduct an in-depth analysis of the determining factors for CLC, both at model level and at language-pair level. Among other results, we find that average CLC is low across different PLMs. Moreover, increasing model size leads to higher factual probing accuracy in most languages, but does not improve cross-lingual consistency. All code and data will be released at Github.

Authors

Jirui Qi [email protected]
Raquel Fernández [email protected]
Arianna Bisazza [email protected]

Usage

Our evaluation function should be run in any other way than the default way since we focus on assessing the generation of factual knowledge in the multilingual PLMs with our proposed RankC metric. Our task can be described in three steps:

We first probe the multilingual PLMs with queries of factual knowledge in different languages (BMLAMA-17 and BMLAMA-53)
Based on the probabilities of candidates for each query, we sorted the candidate set from the highest to the lowest. (Provided in test_sample.jsonl)
We calculate the CLC between two languages with our proposed RankC metric and re-implement it in the format_example() function. The RankC score for the given sample is stored in the 'target' of the return terms. For more results, see 'bloom_3b_CLC.png'.

Implementation

evaluate_predictions(): It's not applicable to our task, since we focus on assessing the generation of factual knowledge in the multilingual PLMs with our proposed RankC metric.
format_example(): We re-implement the function to show how factual knowledge is generalized between a language pair in a PLM.

Checklist:

I and my co-authors agree that, if this PR is merged, the code will be available under the same license as the genbench_cbt repository.
Prior to submitting, I have ran the GenBench CBT test suite using the genbench-cli test-task tool.
I have read the description of what should be in the doc.md of my task, and have added the required arguments.
I have submitted or will submit an accompanying paper to the GenBench workshop.

Adding additional comments

kazemnejad · 2023-08-23T19:01:25Z

Thanks for submitting a task to GenBench.

If I understand your submission correctly, in this PR, you're reporting the result of evaluating BLOOM on your task in format_example. This is not an intended use of GenBench CBT framework. Ideally, a task should prepare the inputs (that are fed to the model), and then, it should evaluate the model's predictions against the correct target. The implementation itself should remain model-agnostic. Thus, I guess the correct implementation of your proposed task should provide the multi-lingual queries that needs to be sent to the model.

In the current form, this PR doesn't seems to be mergable into the framework.

CC. @dieuwkehupkes

Betswish · 2023-08-23T21:22:23Z

Hello kazemnejad,

Thanks for your reply!
Actually, our benchmark is model-agnostic. bloom_3b_CLC.png is only an example for CLC between all language pairs in Bloom-3b. We also measure the generalization of factual knowledge in other PLMs.

However, our whole benchmark can't fit the current GenBench framework well, since one of our focuses is to separate the measurement of generalization of factual knowledge from the correctness of predictions. So here we only provided a sample in format_example to show the computation process of our evaluation metric. This metric is used to measure the generalization (CLC) of factual knowledge in PLMs between one language pair.

The code for our benchmark can be divided into two parts:

Doing a knowledge probing task in each language on model X with our re-split dataset BMLAMA, and ranking the candidates by their probabilities;
Calculating the CLC score between all language pairs, the result shows the generalization of factual knowledge in model X. (like bloom_3b_CLC.png)

Do you have any suggestions about how to revise the implementation?

kazemnejad · 2023-08-25T15:07:54Z

Exactly, I also thought your task should be model agnostic.

I'm not sure which knowledge probing you plan to use, but If you're using normal prompting, I'd suggest you modify the get_prepared_datasets method to create all the queries that is needed to be sent to an LLM.

So, for example, if you have an initial data instance

{
 "input":  "The capital of France is _",
 "target": "Paris",
 "idx": 874,
 }

You can create three separate queries for different languages:

{
 "input":  "translation of (The capital of France is _) in Lang X",
 "target": "...",
 "original_sample_idx": 874,
 "idx": 1,
 }

{
"input":  "translation of (The capital of France is _) in Lang Y",
"target": "...",
"original_sample_idx": 874,
 "idx": 2,
}

{
"input":  "translation of (The capital of France is _) in Lang Z",
"target": "...",
"original_sample_idx": 874,
"idx": 3,
}

These queries will be sent to an LLM, and you'll receive their response in evaluate_predictions. You can use original_sample_idx in that function to collect the results for an original data instance and compute consistency. The only remaining part is the logprobs of the model's predictions, which we are planning to add to the framework.

Let me know if this is the intention of your task. If I understand correctly, your task will fit into our framework, just the implementation needs rework.

vernadankers · 2023-09-01T10:36:01Z

Hello!

We are getting quite close to the deadline (September 1, 11:59PM anywhere on earth), which is why I wanted to remind you of the fact that your PR still needs some attention: see Amir's message above.

Please don't forget to submit your accompanying paper to Openreview via https://openreview.net/group?id=GenBench.org/2023/Workshop by September 1.

Good luck finalising your PR and paper, feel free to tag us if you have questions.
Cheers, Verna
On behalf of the GenBench team

Betswish · 2023-09-01T12:17:23Z

Actually, the input of our benchmark should be two languages (X, Y) and one PLM (M), and the output is a score (0%-100%) representing the generalization of knowledge between X and Y in this PLM.

Therefore, this benchmark has no target (i.e. golden answer) for the output scores. It only measures to what extent the factual knowledge in one language is generalized to another language.

So I'm still unsure how to fit our implementation into the current framework.

btw, thanks for the reminder! We have already submitted our paper on Openreview.

kazemnejad · 2023-09-04T15:46:18Z

Your data instances don't have a gold answer. You can just put some placeholder in target field.

I think my suggestion in my previous comment should be able to handle your case.

kazemnejad · 2023-11-01T16:18:28Z

@Betswish We're in the process of merging the tasks into the repo. Is there any new update regarding modifying your implementation to fit the GenBench framework?

Qiorios · 2023-11-01T19:19:53Z

Hi @kazemnejad , I kept thinking about how to fit our evaluation metric into your framework but still couldn't figure it out.

Because our benchmark is different from the traditional evaluation tasks -

not randomly picking samples and doing evaluation based on the results. Instead, our benchmark needs to run on all provided samples (all Query-CandidateSet pairs) one by one in both lang X and lang Y.
it also needs to calculate the probabilities of all candidate tokens given each query.

So I drew a workflow to show our benchmark. I hope it can give you a clear view of our benchmark.
I'd like to hear your suggestions about how to fit it into your framework.
Sorry for the trouble! And thanks in advance.

Qiorios · 2023-11-01T19:19:58Z

Hi @kazemnejad , I kept thinking about how to fit our evaluation metric into your framework but still couldn't figure it out.

Because our benchmark is different from the traditional evaluation tasks -

not randomly picking samples and doing evaluation based on the results. Instead, our benchmark needs to run on all provided samples (all Query-CandidateSet pairs) one by one in both lang X and lang Y.
it also needs to calculate the probabilities of all candidate tokens given each query.

So I drew a workflow to show our benchmark. I hope it can give you a clear view of our benchmark.
I'd like to hear your suggestions about how to fit it into your framework.
Sorry for the trouble! And thanks in advance.

kazemnejad · 2023-11-03T16:12:28Z

The figures makes everything very clear. Here is my suggestion on how to implement this task as a Genbench task.
You probably need to reimplement _load_data_source, get_prepared_datasets, and evaluate_predictions yourself.

Since per each data instances we have a set of candidates, I'd suggest to use multiple_choice as the task type. Each data instance has target options (which acts as your candidates here).

Here is an example based on the flowchart:

{
   "input": "The capital of Canada ",
   "target": 0,
   "target_options": [
        "Beijing",
        "Tokyo",
        "Ottawa",
   ],
}

or

{
    "input": "The capital of Netherland ",
   "target": 0,
   "target_options": [
        "Miami",
        "Dublin",
        "Amsterdam",
   ],
}

Note that in these examples, the ground truth (represented in the target field) is alway equal to zero. This is intended as your task doesn't really have a target so we just have some default value there. This is not problematic as you're modifying the evaluate_predictions method and you don't rely on the gold target.

How to have data instances from multiple languages?

Basically, the output of get_prepared_datasets is a single Dataset object, which contains the instances (the queries that is going to be sent to an LLM) from all languages. However, since these examples are not independent (i.e. they are translation of each other) you should assign some sort of a unique id and a language identifier to them so we can group them later. Here is some pseudocode for this method:

    def get_prepared_datasets(
        self,
        preparation_strategy: PreparationStrategy,
        shot_list: Optional[List[int]] = None,
        random_seed: int = 42,
    ) -> Union[Mapping[DatasetSplit, Dataset], Mapping[int, Dataset]]:

        if preparation_strategy == PreparationStrategy.FINETUNING:
            raise ValueError("This task does not support finetuning")

        datasets = self.get_datasets_raw()

        # datasets is a dict of language_id -> Dataset
        english_ds = datasets["en"]
        french_ds = datasets["fr"]
        ...

        # They all have the same length (i.e. they are translation of each other)
        assert len(english_ds) == len(french_ds)

        # Each of them contains instances of the form:
        # {
        #     "input": "The capital of Canada ",
        #     "target": 0,
        #     "target_options": [
        #             "Beijing",
        #             "Tokyo",
        #             "Ottawa",
        #     ],
        #     "_genbnech_idx": <some index>
        # }

        # The order of instances is the same across languages. That is if the 10th instance in english_ds is:
        # {
        #     "input": "The capital of Canada ",
        #     "target": 0,
        #     "target_options": [
        #             "Beijing",
        #             "Tokyo",
        #             "Ottawa",
        #     ],
        #     "_genbnech_idx": 9
        # }

        # Then the 10th instance in french_ds is:
        # {
        #     "input": "La capitale del Canada e ",
        #     "target": 0,
        #     "target_options": [
        #             "Beijing",
        #             "Tokyo",
        #             "Ottawa",
        #     ],
        #     "_genbnech_idx": 9
        # }

        # Add language identifier to each instance
        english_ds = english_ds.map(lambda x: {"lang": "en"})
        french_ds = french_ds.map(lambda x: {"lang": "fr"})
        ...

        # Concatenate the datasets
        from datasets import concatenate_datasets
        final_dataset = concatenate_datasets([english_ds, french_ds, ...])

        return final_dataset

This should resolve your first concern as we're evaluating ALL examples.

How to rerank and evaluate

So, the user of the our task has to send the final_dataset to LLM and compute the score (logprobs) of each candidate. Here is an example:

from genbench import load_task

cross_ling_const_task = load_task(...)


ds = cross_ling_const_task.get_prepared_datasets(
    preparation_strategy=PreparationStrategy.PROMPT_BASED_TESTING
)

# the dataset length is num_languages * num_instances_per_language

predictions = []
for data in ds:
    # Create the query to be sent to the LLM
    logprobs = []
    for option in data["target_options"]:
        query = f"{data['input']}{option}"
        logprob = llm(query)
        logprobs.append(logprob)
    
    # Get the index of the highest logprob. 
    # It's not needed for this task, but I added it for completeness
    pred_idx = np.argmax(logprobs)
    pred_option = data["target_options"][pred_idx]

    assert len(logprobs) == len(data["target_options"])

    pred = {
        "target_option_logprobs": logprobs,
        "target": pred_option,
    }
    predictions.append(pred)

Note that this code snippet is not part of your task implementation. This is just an example how an external user might use your task.

Now assuming that each prediction contains the logprob of all candidates we can group them and compute the evaluation metric. Here is an example:

class Task:
    def evaluate_predictions(
        self,
        *,
        predictions: List[Mapping[str, Any]] = None,
        gold: Dataset = None,
    ) -> EvaluationResult:
        # Make sure that the predictions are in the same order as the gold dataset
        assert len(predictions) == len(gold)

        # Just to make sure the gold dataset is the same as the one we generated in `get_prepared_datasets`
        assert "lang" in gold.features
        assert "_genbnech_idx" in gold.features

        # Also, make sure that predictions contain logprobs for each option
        assert all(
            "target_option_logprobs" in pred and len(pred["target_option_logprobs"]) == len(pred["target_options"]) 
            for pred in predictions
        )

        # Group the prediction and instances such that we have:
        # _genbnech_idx -> {
        #    "lang_id_1": { ...data_instance..., target_option_logprobs: ... } 
        #    "lang_id_2": { ...data_instance..., target_option_logprobs: ... }
        # },
        
        grouped_examples = defaultdict(dict)
        for pred, gold in zip(predictions, gold):
            original_idx = gold["_genbnech_idx"]
            lang = gold["lang"]
            grouped_examples[original_idx][lang] = {
                **gold,
                **pred,
            }

        # Now, we compute the cross lingual consistency score
        for idx, example in grouped_examples.items():
            # Rerank the options based on the logprobs
            for lang, data in example.items():
                logprobs = data["target_option_logprobs"]
                sorted_options = sorted(
                    zip(data["target_options"], logprobs),
                    key=lambda x: x[1],
                    reverse=True,
                )
                sorted_options, logprobs = zip(*sorted_options)
                grouped_examples[idx][lang]["target_options"] = list(sorted_options)
                grouped_examples[idx][lang]["target_option_logprobs"] = list(logprobs)
            
            # Compute the cross lingual consistency score
            ...

        # Compute the final score
        result = {
            "cross_lingual_consistency": ...,
        }

        return result

This should resolve your second concern.

Betswish · 2023-11-06T17:19:12Z

Great thanks! I'm revising it and try to complete it ASAP!

kazemnejad · 2023-11-16T15:51:52Z

Hey @Betswish ! Are there any updates regarding the new implementation?
Thanks :)

Betswish · 2023-11-17T16:06:52Z

Hi @kazemnejad, the new implementation is completed!

Betswish · 2023-11-22T14:45:48Z

Hi @kazemnejad , I did some minor updates!
I have run the test_task and fixed most issues.
But there is a remaining one I'm not sure how to fix: it says the keys of the data source in config.jsonet ('test') don't match the variable 'dataset' keys in task.py ('en', 'es').
You may check it by re-running the checks.

kazemnejad · 2023-11-29T19:36:13Z

@Betswish Thanks for your efforts. It's definitely getting ready to be merges. There are still some fixes needed to be implemented.

The task dataset files should be served somewhere else as we cannot host them in the repo. You can take a look at other Pull Request to see how they're handling it.
Could you please include a single file usage_example.py of your task where you showcase the full pipeline of using the task, making inferences and performing evaluation of the way you intent. Preferably, it should be done on a pretrained huggingface model. Please also include requirements-usage-example.txt for the python dependencies needed to be installed for running the example.
Since your current PR already has the dataset files and even if you remove it, it remains in the git history, could you open a new PR with the above changes.

Thanks a lot.

Betswish and others added 18 commits July 26, 2023 15:00

Add My Awesome Task

3a47b18

Fix

d60f966

Fix

632bdc0

Fix

7c93042

Fix

16b1cf7

Fix

21560f7

Fix

d38da90

Delete test.jsonl

82a4d78

Fix

e4b60f0

Delete test_sample.jsonl

a3ef8e5

Fix

fae5fa4

Fix

488230f

Fix

acefa8e

Add My Awesome Task

8f9922c

Add My Awesome Task

8d36b9f

Add My Awesome Task

56c28bc

Add My Awesome Task

a09aaac

Delete test.jsonl

36ef999

kazemnejad added the task-submission label Jul 27, 2023

Update task.py

00f9952

Adding additional comments

vernadankers added task-submission and removed task-submission labels Jul 31, 2023

Betswish and others added 2 commits July 31, 2023 19:02

Update task.py

9ce633b

Fix task.py

feb562d

vernadankers added task-submission and removed task-submission labels Jul 31, 2023

vernadankers mentioned this pull request Sep 1, 2023

[Task Submission] Prompt ComVE #31

Open

Betswish added 2 commits September 6, 2023 07:27

update

8f30dde

update

dd684e2

vernadankers added task-submission and removed task-submission labels Sep 7, 2023

Betswish added 4 commits November 17, 2023 16:46

update

72390ae

update

801ce26

update

c349e88

update

40b8d1d

update

68ab640

kazemnejad added task-submission and removed task-submission labels Nov 17, 2023

update

159151b

kazemnejad added task-submission and removed task-submission labels Nov 29, 2023

Betswish closed this Nov 30, 2023

Betswish deleted the BMLAMA branch November 30, 2023 13:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task Submission] Cross Lingual Consistency (`cross_lingual_consistency`) #9

[Task Submission] Cross Lingual Consistency (`cross_lingual_consistency`) #9

Betswish commented Jul 26, 2023 •

edited

Loading

kazemnejad commented Aug 23, 2023

Betswish commented Aug 23, 2023

kazemnejad commented Aug 25, 2023

vernadankers commented Sep 1, 2023 •

edited

Loading

Betswish commented Sep 1, 2023 •

edited

Loading

kazemnejad commented Sep 4, 2023

kazemnejad commented Nov 1, 2023

Qiorios commented Nov 1, 2023

Qiorios commented Nov 1, 2023 •

edited

Loading

kazemnejad commented Nov 3, 2023

Betswish commented Nov 6, 2023

kazemnejad commented Nov 16, 2023 •

edited

Loading

Betswish commented Nov 17, 2023

Betswish commented Nov 22, 2023 •

edited

Loading

kazemnejad commented Nov 29, 2023 •

edited

Loading

[Task Submission] Cross Lingual Consistency (cross_lingual_consistency) #9

[Task Submission] Cross Lingual Consistency (cross_lingual_consistency) #9

Conversation

Betswish commented Jul 26, 2023 • edited Loading

[Task Name] Cross-Lingual Consistency of Factual Knowledge in Multilingual Language Models

Authors

Usage

Implementation

Checklist:

kazemnejad commented Aug 23, 2023

Betswish commented Aug 23, 2023

kazemnejad commented Aug 25, 2023

vernadankers commented Sep 1, 2023 • edited Loading

Betswish commented Sep 1, 2023 • edited Loading

kazemnejad commented Sep 4, 2023

kazemnejad commented Nov 1, 2023

Qiorios commented Nov 1, 2023

Qiorios commented Nov 1, 2023 • edited Loading

kazemnejad commented Nov 3, 2023

How to have data instances from multiple languages?

How to rerank and evaluate

Betswish commented Nov 6, 2023

kazemnejad commented Nov 16, 2023 • edited Loading

Betswish commented Nov 17, 2023

Betswish commented Nov 22, 2023 • edited Loading

kazemnejad commented Nov 29, 2023 • edited Loading

[Task Submission] Cross Lingual Consistency (`cross_lingual_consistency`) #9

[Task Submission] Cross Lingual Consistency (`cross_lingual_consistency`) #9

Betswish commented Jul 26, 2023 •

edited

Loading

vernadankers commented Sep 1, 2023 •

edited

Loading

Betswish commented Sep 1, 2023 •

edited

Loading

Qiorios commented Nov 1, 2023 •

edited

Loading

kazemnejad commented Nov 16, 2023 •

edited

Loading

Betswish commented Nov 22, 2023 •

edited

Loading

kazemnejad commented Nov 29, 2023 •

edited

Loading