Skip to content
This repository has been archived by the owner on Jul 23, 2024. It is now read-only.

[Task Submission] Cross Lingual Consistency (cross_lingual_consistency) #9

Closed
wants to merge 29 commits into from

Conversation

Betswish
Copy link
Contributor

@Betswish Betswish commented Jul 26, 2023

[Task Name] Cross-Lingual Consistency of Factual Knowledge in Multilingual Language Models

Multilingual large-scale Pretrained Language Models (PLMs) have been shown to learn considerable amounts of factual knowledge from the training corpora. However, large variations are observed in the extent to which this knowledge generalizes across different languages. With the ultimate goal of ensuring that users with different language backgrounds obtain consistent feedback from the same model, we study the cross-lingual consistency (CLC) of factual knowledge in various multilingual PLMs. To this end, we resplit the existing mLAMA (Kassner et al., 2021) to construct a new benchmark BMLAMA where instances for each language are balanced. Additionally, we propose a new Ranking-based Consistency (RankC) metric to evaluate knowledge consistency across languages independently from accuracy. We conduct an in-depth analysis of the determining factors for CLC, both at model level and at language-pair level. Among other results, we find that average CLC is low across different PLMs. Moreover, increasing model size leads to higher factual probing accuracy in most languages, but does not improve cross-lingual consistency. All code and data will be released at Github.

Authors

Usage

Our evaluation function should be run in any other way than the default way since we focus on assessing the generation of factual knowledge in the multilingual PLMs with our proposed RankC metric. Our task can be described in three steps:

  • We first probe the multilingual PLMs with queries of factual knowledge in different languages (BMLAMA-17 and BMLAMA-53)
  • Based on the probabilities of candidates for each query, we sorted the candidate set from the highest to the lowest. (Provided in test_sample.jsonl)
  • We calculate the CLC between two languages with our proposed RankC metric and re-implement it in the format_example() function. The RankC score for the given sample is stored in the 'target' of the return terms. For more results, see 'bloom_3b_CLC.png'.

Implementation

  • evaluate_predictions(): It's not applicable to our task, since we focus on assessing the generation of factual knowledge in the multilingual PLMs with our proposed RankC metric.
  • format_example(): We re-implement the function to show how factual knowledge is generalized between a language pair in a PLM.

Checklist:

  • I and my co-authors agree that, if this PR is merged, the code will be available under the same license as the genbench_cbt repository.
  • Prior to submitting, I have ran the GenBench CBT test suite using the genbench-cli test-task tool.
  • I have read the description of what should be in the doc.md of my task, and have added the required arguments.
  • I have submitted or will submit an accompanying paper to the GenBench workshop.

Adding additional comments
@kazemnejad
Copy link
Contributor

Thanks for submitting a task to GenBench.

If I understand your submission correctly, in this PR, you're reporting the result of evaluating BLOOM on your task in format_example. This is not an intended use of GenBench CBT framework. Ideally, a task should prepare the inputs (that are fed to the model), and then, it should evaluate the model's predictions against the correct target. The implementation itself should remain model-agnostic. Thus, I guess the correct implementation of your proposed task should provide the multi-lingual queries that needs to be sent to the model.

In the current form, this PR doesn't seems to be mergable into the framework.

CC. @dieuwkehupkes

@Betswish
Copy link
Contributor Author

Hello kazemnejad,

Thanks for your reply!
Actually, our benchmark is model-agnostic. bloom_3b_CLC.png is only an example for CLC between all language pairs in Bloom-3b. We also measure the generalization of factual knowledge in other PLMs.

However, our whole benchmark can't fit the current GenBench framework well, since one of our focuses is to separate the measurement of generalization of factual knowledge from the correctness of predictions. So here we only provided a sample in format_example to show the computation process of our evaluation metric. This metric is used to measure the generalization (CLC) of factual knowledge in PLMs between one language pair.

The code for our benchmark can be divided into two parts:

  1. Doing a knowledge probing task in each language on model X with our re-split dataset BMLAMA, and ranking the candidates by their probabilities;
  2. Calculating the CLC score between all language pairs, the result shows the generalization of factual knowledge in model X. (like bloom_3b_CLC.png)

Do you have any suggestions about how to revise the implementation?

@kazemnejad
Copy link
Contributor

Exactly, I also thought your task should be model agnostic.

I'm not sure which knowledge probing you plan to use, but If you're using normal prompting, I'd suggest you modify the get_prepared_datasets method to create all the queries that is needed to be sent to an LLM.

So, for example, if you have an initial data instance

{
 "input":  "The capital of France is _",
 "target": "Paris",
 "idx": 874,
 }

You can create three separate queries for different languages:

{
 "input":  "translation of (The capital of France is _) in Lang X",
 "target": "...",
 "original_sample_idx": 874,
 "idx": 1,
 }
{
"input":  "translation of (The capital of France is _) in Lang Y",
"target": "...",
"original_sample_idx": 874,
 "idx": 2,
}
{
"input":  "translation of (The capital of France is _) in Lang Z",
"target": "...",
"original_sample_idx": 874,
"idx": 3,
}

These queries will be sent to an LLM, and you'll receive their response in evaluate_predictions. You can use original_sample_idx in that function to collect the results for an original data instance and compute consistency. The only remaining part is the logprobs of the model's predictions, which we are planning to add to the framework.

Let me know if this is the intention of your task. If I understand correctly, your task will fit into our framework, just the implementation needs rework.

@vernadankers
Copy link
Contributor

vernadankers commented Sep 1, 2023

Hello!

We are getting quite close to the deadline (September 1, 11:59PM anywhere on earth), which is why I wanted to remind you of the fact that your PR still needs some attention: see Amir's message above.

Please don't forget to submit your accompanying paper to Openreview via https://openreview.net/group?id=GenBench.org/2023/Workshop by September 1.

Good luck finalising your PR and paper, feel free to tag us if you have questions.
Cheers, Verna
On behalf of the GenBench team

@Betswish
Copy link
Contributor Author

Betswish commented Sep 1, 2023

Actually, the input of our benchmark should be two languages (X, Y) and one PLM (M), and the output is a score (0%-100%) representing the generalization of knowledge between X and Y in this PLM.

Therefore, this benchmark has no target (i.e. golden answer) for the output scores. It only measures to what extent the factual knowledge in one language is generalized to another language.

So I'm still unsure how to fit our implementation into the current framework.

btw, thanks for the reminder! We have already submitted our paper on Openreview.

@kazemnejad
Copy link
Contributor

Your data instances don't have a gold answer. You can just put some placeholder in target field.

I think my suggestion in my previous comment should be able to handle your case.

@kazemnejad
Copy link
Contributor

@Betswish We're in the process of merging the tasks into the repo. Is there any new update regarding modifying your implementation to fit the GenBench framework?

@Qiorios
Copy link

Qiorios commented Nov 1, 2023

Hi @kazemnejad , I kept thinking about how to fit our evaluation metric into your framework but still couldn't figure it out.

Because our benchmark is different from the traditional evaluation tasks -

  1. not randomly picking samples and doing evaluation based on the results. Instead, our benchmark needs to run on all provided samples (all Query-CandidateSet pairs) one by one in both lang X and lang Y.
  2. it also needs to calculate the probabilities of all candidate tokens given each query.

So I drew a workflow to show our benchmark. I hope it can give you a clear view of our benchmark.
I'd like to hear your suggestions about how to fit it into your framework.
Sorry for the trouble! And thanks in advance.

未命名绘图 drawio

@Qiorios
Copy link

Qiorios commented Nov 1, 2023

Hi @kazemnejad , I kept thinking about how to fit our evaluation metric into your framework but still couldn't figure it out.

Because our benchmark is different from the traditional evaluation tasks -

  1. not randomly picking samples and doing evaluation based on the results. Instead, our benchmark needs to run on all provided samples (all Query-CandidateSet pairs) one by one in both lang X and lang Y.
  2. it also needs to calculate the probabilities of all candidate tokens given each query.

So I drew a workflow to show our benchmark. I hope it can give you a clear view of our benchmark.
I'd like to hear your suggestions about how to fit it into your framework.
Sorry for the trouble! And thanks in advance.

未命名绘图 drawio (3)

@kazemnejad
Copy link
Contributor

The figures makes everything very clear. Here is my suggestion on how to implement this task as a Genbench task.
You probably need to reimplement _load_data_source, get_prepared_datasets, and evaluate_predictions yourself.

Since per each data instances we have a set of candidates, I'd suggest to use multiple_choice as the task type. Each data instance has target options (which acts as your candidates here).

Here is an example based on the flowchart:

{
   "input": "The capital of Canada ",
   "target": 0,
   "target_options": [
        "Beijing",
        "Tokyo",
        "Ottawa",
   ],
}

or

{
    "input": "The capital of Netherland ",
   "target": 0,
   "target_options": [
        "Miami",
        "Dublin",
        "Amsterdam",
   ],
}

Note that in these examples, the ground truth (represented in the target field) is alway equal to zero. This is intended as your task doesn't really have a target so we just have some default value there. This is not problematic as you're modifying the evaluate_predictions method and you don't rely on the gold target.

How to have data instances from multiple languages?

Basically, the output of get_prepared_datasets is a single Dataset object, which contains the instances (the queries that is going to be sent to an LLM) from all languages. However, since these examples are not independent (i.e. they are translation of each other) you should assign some sort of a unique id and a language identifier to them so we can group them later. Here is some pseudocode for this method:

    def get_prepared_datasets(
        self,
        preparation_strategy: PreparationStrategy,
        shot_list: Optional[List[int]] = None,
        random_seed: int = 42,
    ) -> Union[Mapping[DatasetSplit, Dataset], Mapping[int, Dataset]]:

        if preparation_strategy == PreparationStrategy.FINETUNING:
            raise ValueError("This task does not support finetuning")

        datasets = self.get_datasets_raw()

        # datasets is a dict of language_id -> Dataset
        english_ds = datasets["en"]
        french_ds = datasets["fr"]
        ...

        # They all have the same length (i.e. they are translation of each other)
        assert len(english_ds) == len(french_ds)

        # Each of them contains instances of the form:
        # {
        #     "input": "The capital of Canada ",
        #     "target": 0,
        #     "target_options": [
        #             "Beijing",
        #             "Tokyo",
        #             "Ottawa",
        #     ],
        #     "_genbnech_idx": <some index>
        # }

        # The order of instances is the same across languages. That is if the 10th instance in english_ds is:
        # {
        #     "input": "The capital of Canada ",
        #     "target": 0,
        #     "target_options": [
        #             "Beijing",
        #             "Tokyo",
        #             "Ottawa",
        #     ],
        #     "_genbnech_idx": 9
        # }

        # Then the 10th instance in french_ds is:
        # {
        #     "input": "La capitale del Canada e ",
        #     "target": 0,
        #     "target_options": [
        #             "Beijing",
        #             "Tokyo",
        #             "Ottawa",
        #     ],
        #     "_genbnech_idx": 9
        # }

        # Add language identifier to each instance
        english_ds = english_ds.map(lambda x: {"lang": "en"})
        french_ds = french_ds.map(lambda x: {"lang": "fr"})
        ...

        # Concatenate the datasets
        from datasets import concatenate_datasets
        final_dataset = concatenate_datasets([english_ds, french_ds, ...])

        return final_dataset

This should resolve your first concern as we're evaluating ALL examples.

How to rerank and evaluate

So, the user of the our task has to send the final_dataset to LLM and compute the score (logprobs) of each candidate. Here is an example:

from genbench import load_task

cross_ling_const_task = load_task(...)


ds = cross_ling_const_task.get_prepared_datasets(
    preparation_strategy=PreparationStrategy.PROMPT_BASED_TESTING
)

# the dataset length is num_languages * num_instances_per_language

predictions = []
for data in ds:
    # Create the query to be sent to the LLM
    logprobs = []
    for option in data["target_options"]:
        query = f"{data['input']}{option}"
        logprob = llm(query)
        logprobs.append(logprob)
    
    # Get the index of the highest logprob. 
    # It's not needed for this task, but I added it for completeness
    pred_idx = np.argmax(logprobs)
    pred_option = data["target_options"][pred_idx]

    assert len(logprobs) == len(data["target_options"])

    pred = {
        "target_option_logprobs": logprobs,
        "target": pred_option,
    }
    predictions.append(pred)

Note that this code snippet is not part of your task implementation. This is just an example how an external user might use your task.

Now assuming that each prediction contains the logprob of all candidates we can group them and compute the evaluation metric. Here is an example:

class Task:
    def evaluate_predictions(
        self,
        *,
        predictions: List[Mapping[str, Any]] = None,
        gold: Dataset = None,
    ) -> EvaluationResult:
        # Make sure that the predictions are in the same order as the gold dataset
        assert len(predictions) == len(gold)

        # Just to make sure the gold dataset is the same as the one we generated in `get_prepared_datasets`
        assert "lang" in gold.features
        assert "_genbnech_idx" in gold.features

        # Also, make sure that predictions contain logprobs for each option
        assert all(
            "target_option_logprobs" in pred and len(pred["target_option_logprobs"]) == len(pred["target_options"]) 
            for pred in predictions
        )

        # Group the prediction and instances such that we have:
        # _genbnech_idx -> {
        #    "lang_id_1": { ...data_instance..., target_option_logprobs: ... } 
        #    "lang_id_2": { ...data_instance..., target_option_logprobs: ... }
        # },
        
        grouped_examples = defaultdict(dict)
        for pred, gold in zip(predictions, gold):
            original_idx = gold["_genbnech_idx"]
            lang = gold["lang"]
            grouped_examples[original_idx][lang] = {
                **gold,
                **pred,
            }

        # Now, we compute the cross lingual consistency score
        for idx, example in grouped_examples.items():
            # Rerank the options based on the logprobs
            for lang, data in example.items():
                logprobs = data["target_option_logprobs"]
                sorted_options = sorted(
                    zip(data["target_options"], logprobs),
                    key=lambda x: x[1],
                    reverse=True,
                )
                sorted_options, logprobs = zip(*sorted_options)
                grouped_examples[idx][lang]["target_options"] = list(sorted_options)
                grouped_examples[idx][lang]["target_option_logprobs"] = list(logprobs)
            
            # Compute the cross lingual consistency score
            ...

        # Compute the final score
        result = {
            "cross_lingual_consistency": ...,
        }

        return result

This should resolve your second concern.

@Betswish
Copy link
Contributor Author

Betswish commented Nov 6, 2023

Great thanks! I'm revising it and try to complete it ASAP!

@kazemnejad
Copy link
Contributor

kazemnejad commented Nov 16, 2023

Hey @Betswish ! Are there any updates regarding the new implementation?
Thanks :)

@Betswish
Copy link
Contributor Author

Hi @kazemnejad, the new implementation is completed!

@Betswish
Copy link
Contributor Author

Betswish commented Nov 22, 2023

Hi @kazemnejad , I did some minor updates!
I have run the test_task and fixed most issues.
But there is a remaining one I'm not sure how to fix: it says the keys of the data source in config.jsonet ('test') don't match the variable 'dataset' keys in task.py ('en', 'es').
You may check it by re-running the checks.

@kazemnejad
Copy link
Contributor

kazemnejad commented Nov 29, 2023

@Betswish Thanks for your efforts. It's definitely getting ready to be merges. There are still some fixes needed to be implemented.

  • The task dataset files should be served somewhere else as we cannot host them in the repo. You can take a look at other Pull Request to see how they're handling it.
  • Could you please include a single file usage_example.py of your task where you showcase the full pipeline of using the task, making inferences and performing evaluation of the way you intent. Preferably, it should be done on a pretrained huggingface model. Please also include requirements-usage-example.txt for the python dependencies needed to be installed for running the example.
  • Since your current PR already has the dataset files and even if you remove it, it remains in the git history, could you open a new PR with the above changes.

Thanks a lot.

@Betswish Betswish closed this Nov 30, 2023
@Betswish Betswish deleted the BMLAMA branch November 30, 2023 13:30
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants