This is the Repo for the paper: T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics.
You can run T5Score on a corpus to get automatic evaluation score. An example to evaluate on WMT20 is provided. We also provide an example result of language pair cs-en for corpus WMT20.
To compare the automatic evaluation method with human judgements, you can run segment level analysis and system level analysis as follows:
$ python evaluation/calculate_corr.py \
--dir "./evaluation/results/" \
--language_pair "cs-en" \
--filename "scores.csv" \
--metrics "t5_score_ref_F"
You can use parallel data to train your custom unsupervised T5Score. An example trained on a multilingual paraphrase dataset Paracotta is provided.
You can use paired data with human judgements to train your custom supervised T5Score. An example trained on dataset from WMT17 to WMT19 is provided.