Skip to content

dataset format

Ming Xu (徐明) edited this page Aug 1, 2023 · 1 revision

Dataset format

Dataset formats used --train_file_dir and --validation_file_dir

The format of the PT (pre-training) data set is as follows:

text file, one sample per line

txt file
  • The format of the SFT (supervised fine-tuning) dataset is as follows

alpaca dataset format, one sample per line, each sample contains the following fields:

json file, one sample per line, each sample contains the following fields:

{"instruction": "text1", "input": "text2", "output": "text3"}
  • The format of the Reward (reward model) data set is as follows: json file, one sample per line, each sample contains the following fields:
{"question": "text1", "response_chosen": "text2", "response_rejected": "text3"}
  • The RL (Reinforcement Learning) dataset format is as follows: json file, one sample per line, each sample contains the following fields:
{"instruction": "text1", "input": "text2", "output": "text3"}

SFT datasets can be reused.

Use --dataset_name to load HF datasets, format refer to shibing624/medical

Clone this wiki locally