Skip to content

Training Details

Ming Xu (徐明) edited this page Jun 5, 2023 · 3 revisions

The training script is in scripts directory:

  • First stage: PT (Continue PreTraining) run_pt.sh
  • Second stage: SFT (Supervised Fine-tuning) run_sft.sh
  • Third stage: RM (Reward Model) reward model run_rm.sh
  • Fourth stage: RL (Reinforcement Learning) reinforcement learning based on human feedback run_rl.sh

Description of training parameters

  1. If you want to train on a single card, you only need to set nproc_per_node to 1, or remove the torchrun command and run the python script directly, such as python scripts/run_supervised_finetuning.py
  2. The default pre-training model is LLaMA, and the training code is also compatible with GPT models such as ChatGLM-6B/BLOOM, model_name_or_path just adjust
  3. Specify the training set, --train_file_dir specify the training data directory, and --validation_file_dir specify the verification data directory. If not specified, the --dataset_name specified HF datasets dataset will be used by default. See the dataset format for the field format of the training set. It is recommended to add some general dialogue data to the domain training set. For the link of the dataset, see📚 Dataset
  4. If the operating environment supports deepspeed, add --deepspeed deepspeed_config.json
  5. If the gpu supports int8, plus --load_in_8bit Truethe representative adopts 8bit quantization training, it can significantly reduce the memory usage
  6. Debug the model, --max_train_samples and --max_eval_samples specify the maximum number of samples for the training and validation datasets to quickly verify whether the code is available. Please delete these two parameters or set them to -1 during training

About LoRA Training

By default, LoRA training is used. The LoRA model weights of each stage need to be merged into the base model. Use the following command to merge, and the next stage is model_name_or_path designated as the merged model folder.

LoRA layers were using at all stages to reduce memory requirements. At each stage the peft adapter layers were merged with the base model, using:

python scripts/merge_peft_adapter.py \
  --base_model_name_or_path base_model_dir \
  --peft_model_path lora_model_dir \
  --output_dir outputs-merged
  • this script requires peft>=0.3.0
  • The merged weights are saved in the output_dir directory, and can be directly loaded later by from_pretrained

About Model Results

The training logs and models are saved in the output_dir directory, and the file structure in the directory is as follows:

output_dir/
|-- adapter_config.json
|-- adapter_model.bin
|-- checkpoint-24000
|   |-- adapter_config.json
|   |-- adapter_model.bin
|   |-- trainer_state.json
|   `-- training_args.bin
|-- train_results.txt
|-- eval_results.txt
|-- special_tokens_map.json
|-- tokenizer_config.json
|-- training_args.bin
|-- logs
|   |-- 1685436851.18595
|   |   `-- events.out.tfevents.1685436851.ts-89f5028ad154472e99e7bcf2c9bf2343-launcher.82684.1
└── config.json
  • trainer_state.json Changes in loss and learning_rate are recorded
  • The files in the logs directory can be used for tensorboard visualization. The command to start tensorboard is as follows: tensorboard --logdir output_dir/logs --host 0.0.0.0 --port 8008

About deepspeed

The parameter configuration of deepspeed deepspeed_config.jsoncan refer to:

  1. https://www.deepspeed.ai/docs/config-json/
  2. https://huggingface.co/docs/accelerate/usage_guides/deepspeed
  3. https://github.com/huggingface/transformers/blob/main/tests/deepspeed

If the video memory is sufficient, stage 2 can be given priority, and the corresponding configuration file is deepspeed_config.json. If the video memory is insufficient, you can use stage 3, which uses model parameters in parallel, which can significantly reduce the video memory usage, but the training speed will be much slower.

About multi-machine multi-card training

Take two machines as an example, each machine has 8 cards

node_rank=$1
echo ${node_rank}
master_addr="10.111.112.223"

torchrun --nproc_per_node 8 --nnodes 2 --master_addr ${master_addr} --master_port 14545 --node_rank ${node_rank} srcipts/run_supervised_finetuning.py ... 
  • node_rank represents the rank of the node, the node_rank of the first machine (main machine) is set to 0, and the node_rank of the second - machine is set to 1
  • nnodes represents the number of node machines
  • master_addr represents the ip address of the master machine
  • master_port represents the port number for communicating with the master machine

Dataset format

Dataset formats used --train_file_dirand loaded --validation_file_dir

The format of the PT (pre-training) data set is as follows:

text file, one sample per line

txt file
  • The format of the SFT (supervised fine-tuning) dataset is as follows

alpaca dataset format, one sample per line, each sample contains the following fields:

json file, one sample per line, each sample contains the following fields:

{"instruction": "text1", "input": "text2", "output": "text3"}
  • The format of the Reward (reward model) data set is as follows: json file, one sample per line, each sample contains the following fields:
{"question": "text1", "response_chosen": "text2", "response_rejected": "text3"}
  • The RL (Reinforcement Learning) dataset format is as follows: json file, one sample per line, each sample contains the following fields:
{"instruction": "text1", "input": "text2", "output": "text3"}

SFT datasets can be reused.

Use --dataset_name to load HF datasets, format refer to shibing624/medical

Clone this wiki locally