Datasets and Evaluation Metrics

The provided fine tuning script allows you to select between three datasets by passing the dataset arg to the llama_finetuning.py script. The current options are grammar_dataset, alpaca_datasetand samsum_dataset. Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses)

grammar_dataset contains 150K pairs of english sentences and possible corrections.
alpaca_dataset provides 52K instruction-response pairs as generated by text-davinci-003.
samsum_dataset contains about 16k messenger-like conversations with summaries.

Adding custom datasets

The list of available datasets can easily be extended with custom datasets by following these instructions.

Each dataset has a corresponding configuration (dataclass) in configs/dataset.py which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.

Additionally, there is a preprocessing function for each dataset in the ft_datasets folder. The returned data of the dataset needs to be consumable by the forward method of the fine-tuned model by calling model(**data). For CausalLM models this usually means that the data needs to be in the form of a dictionary with "input_ids", "attention_mask" and "labels" fields.

To add a custom dataset the following steps need to be performed.

Create a dataset configuration after the schema described above. Examples can be found in configs/dataset.py.
Create a preprocessing routine which loads the data and returns a PyTorch style dataset. The signature for the preprocessing function needs to be (dataset_config, tokenizer, split_name) where split_name will be the string for train/validation split as defined in the dataclass.
Register the dataset name and preprocessing function by inserting it as key and value into the DATASET_PREPROC dictionary in utils/dataset_utils.py
Set dataset field in training config to dataset name or use --dataset option of the llama_finetuning.py training script.

Application

Below we list other datasets and their main use cases that can be used for fine tuning.

Q&A these can be used for evaluation as well

MMLU
BoolQ
NarrativeQA
NaturalQuestions (closed-book)
NaturalQuestions (open-book)
QuAC
HellaSwag
OpenbookQA
TruthfulQA ( can be helpful for fact checking/ misinformation of the model)

instruction finetuning

Alpaca 52k instruction tuning
Dolly 15k 15k instruction tuning

simple text generation for quick tests

English quotes 2508 Multi-label text classification, text generation

Reasoning used mostly for evaluation of LLMs

bAbI
Dyck
GSM8K
MATH
APPS
HumanEval
LSAT
Entity matching

Toxicity evaluation

Real_toxic_prompts

Bias evaluation

Crows_pair gender bias
WinoGender gender bias

Useful Links

More information on evaluation dataset can be found in HELM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset.md

Dataset.md

Datasets and Evaluation Metrics

Adding custom datasets

Application

Q&A these can be used for evaluation as well

instruction finetuning

simple text generation for quick tests

Reasoning used mostly for evaluation of LLMs

Toxicity evaluation

Bias evaluation

Useful Links

Files

Dataset.md

Latest commit

History

Dataset.md

File metadata and controls

Datasets and Evaluation Metrics

Adding custom datasets

Application

Q&A these can be used for evaluation as well

instruction finetuning

simple text generation for quick tests

Reasoning used mostly for evaluation of LLMs

Toxicity evaluation

Bias evaluation

Useful Links