diff --git a/README.md b/README.md index 4e729226c..2a8eb2879 100644 --- a/README.md +++ b/README.md @@ -39,6 +39,7 @@ English | [简体中文](README_zh-CN.md) ## 🎉 News +- **\[2024/07\]** Support [DPO](https://github.com/InternLM/xtuner/tree/main/xtuner/configs/dpo), [ORPO](https://github.com/InternLM/xtuner/tree/main/xtuner/configs/orpo) and [Reward Model](https://github.com/InternLM/xtuner/tree/main/xtuner/configs/reward_model) training with packed data and sequence parallel! See [documents](https://xtuner.readthedocs.io/en/latest/dpo/overview.html) for more details. - **\[2024/07\]** Support [InternLM 2.5](xtuner/configs/internlm/internlm2_5_chat_7b/) models! - **\[2024/06\]** Support [DeepSeek V2](xtuner/configs/deepseek/deepseek_v2_chat/) models! **2x faster!** - **\[2024/04\]** [LLaVA-Phi-3-mini](https://huggingface.co/xtuner/llava-phi-3-mini-hf) is released! Click [here](xtuner/configs/llava/phi3_mini_4k_instruct_clip_vit_large_p14_336) for details! @@ -144,6 +145,9 @@ XTuner is an efficient, flexible and full-featured toolkit for fine-tuning large
  • QLoRA
  • LoRA
  • Full parameter fine-tune
  • +
  • DPO
  • +
  • ORPO
  • +
  • Reward Model
  • diff --git a/README_zh-CN.md b/README_zh-CN.md index 16c1a2af2..58076210f 100644 --- a/README_zh-CN.md +++ b/README_zh-CN.md @@ -39,6 +39,7 @@ ## 🎉 更新 +- **\[2024/07\]** 支持训练 [DPO](https://github.com/InternLM/xtuner/tree/main/xtuner/configs/dpo), [ORPO](https://github.com/InternLM/xtuner/tree/main/xtuner/configs/orpo) 还有 [Reward Model](https://github.com/InternLM/xtuner/tree/main/xtuner/configs/reward_model) ! 并且能够支持打包数据以及序列并行功能! 请参考 [文档](https://xtuner.readthedocs.io/zh-cn/latest/dpo/overview.html) 了解更多信息。 - **\[2024/07\]** 支持 [InternLM 2.5](xtuner/configs/internlm/internlm2_5_chat_7b/) 模型! - **\[2024/06\]** 支持 [DeepSeek V2](xtuner/configs/deepseek/deepseek_v2_chat/) models! **训练速度提升一倍!** - **\[2024/04\]** 多模态大模型 [LLaVA-Phi-3-mini](https://huggingface.co/xtuner/llava-phi-3-mini-hf) 发布!快速开始请查阅此[文档](xtuner/configs/llava/phi3_mini_4k_instruct_clip_vit_large_p14_336)! @@ -144,6 +145,9 @@ XTuner 是一个高效、灵活、全能的轻量化大模型微调工具库。
  • QLoRA
  • LoRA
  • 全量参数微调
  • +
  • DPO
  • +
  • ORPO
  • +
  • Reward Model
  • diff --git a/docs/en/dpo/modify_settings.md b/docs/en/dpo/modify_settings.md new file mode 100644 index 000000000..d78cc40e6 --- /dev/null +++ b/docs/en/dpo/modify_settings.md @@ -0,0 +1,83 @@ +## Modify DPO Training Configuration + +This section introduces config parameters related to DPO (Direct Preference Optimization) training. For more details on XTuner config files, please refer to [Modifying Training Configuration](https://xtuner.readthedocs.io/zh-cn/latest/training/modify_settings.html). + +### Loss Function + +In DPO training, you can choose different types of loss functions according to your needs. XTuner provides various loss function options, such as `sigmoid`, `hinge`, `ipo`, etc. You can select the desired loss function type by setting the `dpo_loss_type` parameter. + +Additionally, you can control the temperature coefficient in the loss function by adjusting the `loss_beta` parameter. The `label_smoothing` parameter can be used for smoothing labels. + +```python +####################################################################### +# PART 1 Settings # +####################################################################### +# Model +dpo_loss_type = 'sigmoid' # One of ['sigmoid', 'hinge', 'ipo', 'kto_pair', 'sppo_hard', 'nca_pair', 'robust'] +loss_beta = 0.1 +label_smoothing = 0.0 +``` + +### Modifying the Model + +Users can modify `pretrained_model_name_or_path` to change the pretrained model. + +```python +####################################################################### +# PART 1 Settings # +####################################################################### +# Model +pretrained_model_name_or_path = 'internlm/internlm2-chat-1_8b-sft' +``` + +### Training Data + +In DPO training, you can specify the maximum number of tokens for a single sample sequence using the `max_length` parameter. XTuner will automatically truncate or pad the data. + +```python +# Data +max_length = 2048 +``` + +In the configuration file, we use the `train_dataset` field to specify the training dataset. You can specify the dataset loading method using the `dataset` field and the dataset mapping function using the `dataset_map_fn` field. + +```python +####################################################################### +# PART 3 Dataset & Dataloader # +####################################################################### +sampler = SequenceParallelSampler \ + if sequence_parallel_size > 1 else DefaultSampler + +train_dataset = dict( + type=build_preference_dataset, + dataset=dict(type=load_dataset, path='mlabonne/orpo-dpo-mix-40k'), + tokenizer=tokenizer, + max_length=max_length, + dataset_map_fn=orpo_dpo_mix_40k_map_fn, + is_dpo=True, + is_reward=False, + reward_token_id=-1, + num_proc=32, + use_varlen_attn=use_varlen_attn, + max_packed_length=max_packed_length, + shuffle_before_pack=True, +) + +train_dataloader = dict( + batch_size=batch_size, + num_workers=dataloader_num_workers, + dataset=train_dataset, + sampler=dict(type=sampler, shuffle=True), + collate_fn=dict( + type=preference_collate_fn, use_varlen_attn=use_varlen_attn)) +``` + +In the above configuration, we use `load_dataset` to load the `mlabonne/orpo-dpo-mix-40k` dataset from Hugging Face and use `orpo_dpo_mix_40k_map_fn` as the dataset mapping function. + +For more information on handling datasets and writing dataset mapping functions, please refer to the [Preference Dataset Section](../reward_model/preference_data.md). + +### Accelerating Training + +When training with preference data, we recommend enabling the [Variable-Length Attention Mechanism](https://xtuner.readthedocs.io/zh-cn/latest/acceleration/varlen_flash_attn.html) to avoid memory waste caused by length differences between chosen and rejected samples within a single preference. You can enable the variable-length attention mechanism by setting `use_varlen_attn=True`. + +XTuner also supports many training acceleration methods. For details on how to use them, please refer to the [Acceleration Strategies Section](https://xtuner.readthedocs.io/zh-cn/latest/acceleration/hyper_parameters.html). diff --git a/docs/en/dpo/overview.md b/docs/en/dpo/overview.md new file mode 100644 index 000000000..0c20946e3 --- /dev/null +++ b/docs/en/dpo/overview.md @@ -0,0 +1,27 @@ +## Introduction to DPO + +### Overview + +DPO (Direct Preference Optimization) is a method used in large language model training for directly optimizing human preferences. Unlike traditional reinforcement learning methods, DPO directly uses human preference data to optimize the model, thereby improving the quality of generated content to better align with human preferences. DPO also eliminates the need to train a Reward Model and a Critic Model, avoiding the complexity of reinforcement learning algorithms, reducing training overhead, and enhancing training efficiency. + +Many algorithms have made certain improvements to DPO's loss function. In XTuner, besides DPO, we have also implemented loss functions from papers such as [Identity Preference Optimization (IPO)](https://huggingface.co/papers/2310.12036). To use these algorithms, please refer to the [Modify DPO Settings](./modify_settings.md) section. We also provide some [example configurations](https://github.com/InternLM/xtuner/tree/main/xtuner/configs/dpo) for reference. + +In addition to DPO, there are alignment algorithms like [ORPO](https://arxiv.org/abs/2403.07691) that do not require a reference model. ORPO uses the concept of odds ratio to optimize the model by penalizing rejected samples during the training process, thereby adapting more effectively to the chosen samples. ORPO eliminates the dependence on a reference model, making the training process more simplified and efficient. The training method for ORPO in XTuner is very similar to DPO, and we provide some [example configurations](https://github.com/InternLM/xtuner/tree/main/xtuner/configs/orpo). Users can refer to the DPO tutorial to modify the configuration. + +### Features of DPO Training in XTuner + +DPO training in XTuner offers the following significant advantages: + +1. **Latest Algorithms**: In addition to supporting standard DPO, XTuner also supports improved DPO algorithms or memory efficient algorithms like ORPO that do not rely on reference models. + +2. **Reducing Memory Waste**: Due to the length differences in chosen and rejected data in preference datasets, padding tokens during data concatenation can cause memory waste. In XTuner, by utilizing the variable-length attention feature from Flash Attention2, preference pairs are packed into the same sequence during training, significantly reducing memory waste caused by padding tokens. This not only improves memory efficiency but also allows for training larger models or handling more data under the same hardware conditions. + + ![img](../../zh_cn/reward_model/images/var_len_atten.png) + +3. **Efficient Training**: Leveraging XTuner's QLoRA training capabilities, the reference model can be converted into a policy model with the LoRA adapter removed, eliminating the memory overhead of the reference model weights and significantly reducing DPO training costs. + +4. **Long Text Training**: With XTuner's sequence parallel functionality, long text data can be trained efficiently. + +### Getting Started + +Refer to the [Quick Start Guide](./quick_start.md) to understand the basic concepts. For more information on configuring training parameters, please see the [Modify DPO Settings](./modify_settings.md) section. diff --git a/docs/en/dpo/quick_start.md b/docs/en/dpo/quick_start.md new file mode 100644 index 000000000..19fffbf8b --- /dev/null +++ b/docs/en/dpo/quick_start.md @@ -0,0 +1,71 @@ +## Quick Start with DPO + +In this section, we will introduce how to use XTuner to train a 1.8B DPO (Direct Preference Optimization) model to help you get started quickly. + +### Preparing Pretrained Model Weights + +We use the model [InternLM2-chat-1.8b-sft](https://huggingface.co/internlm/internlm2-chat-1_8b-sft), as the initial model for DPO training to align human preferences. + +Set `pretrained_model_name_or_path = 'internlm/internlm2-chat-1_8b-sft'` in the training configuration file, and the model files will be automatically downloaded when training starts. If you need to download the model weights manually, please refer to the section [Preparing Pretrained Model Weights](https://xtuner.readthedocs.io/zh-cn/latest/preparation/pretrained_model.html), which provides detailed instructions on how to download model weights from Huggingface or Modelscope. Here are the links to the models on HuggingFace and ModelScope: + +- HuggingFace link: https://huggingface.co/internlm/internlm2-chat-1_8b-sft +- ModelScope link: https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-chat-1_8b-sft/summary + +### Preparing Training Data + +In this tutorial, we use the [mlabonne/orpo-dpo-mix-40k](https://huggingface.co/datasets/mlabonne/orpo-dpo-mix-40k) dataset from Huggingface as an example. + +```python +train_dataset = dict( + type=build_preference_dataset, + dataset=dict( + type=load_dataset, + path='mlabonne/orpo-dpo-mix-40k'), + dataset_map_fn=orpo_dpo_mix_40k_map_fn, + is_dpo=True, + is_reward=False, +) +``` + +Using the above configuration in the configuration file will automatically download and process this dataset. If you want to use other open-source datasets from Huggingface or custom datasets, please refer to the [Preference Dataset](../reward_model/preference_data.md) section. + +### Preparing Configuration File + +XTuner provides several ready-to-use configuration files, which can be viewed using `xtuner list-cfg`. Execute the following command to copy a configuration file to the current directory. + +```bash +xtuner copy-cfg internlm2_chat_1_8b_dpo_full . +``` + +Open the copied configuration file. If you choose to download the model and dataset automatically, no modifications are needed. If you want to specify paths to your pre-downloaded model and dataset, modify the `pretrained_model_name_or_path` and the `path` parameter in `dataset` under `train_dataset`. + +For more training parameter configurations, please refer to the section [Modifying DPO Training Configuration](./modify_settings.md) section. + +### Starting the Training + +After completing the above steps, you can start the training task using the following commands. + +```bash +# Single machine, single GPU +xtuner train ./internlm2_chat_1_8b_dpo_full_copy.py +# Single machine, multiple GPUs +NPROC_PER_NODE=${GPU_NUM} xtuner train ./internlm2_chat_1_8b_dpo_full_copy.py +# Slurm cluster +srun ${SRUN_ARGS} xtuner train ./internlm2_chat_1_8b_dpo_full_copy.py --launcher slurm +``` + +### Model Conversion + +XTuner provides integrated tools to convert models to HuggingFace format. Simply execute the following commands: + +```bash +# Create a directory for HuggingFace format parameters +mkdir work_dirs/internlm2_chat_1_8b_dpo_full_copy/iter_15230_hf + +# Convert format +xtuner convert pth_to_hf internlm2_chat_1_8b_dpo_full_copy.py \ + work_dirs/internlm2_chat_1_8b_dpo_full_copy/iter_15230.pth \ + work_dirs/internlm2_chat_1_8b_dpo_full_copy/iter_15230_hf +``` + +This will convert the XTuner's ckpt to the HuggingFace format. diff --git a/docs/en/index.rst b/docs/en/index.rst index c702e0a04..c4c18d31a 100644 --- a/docs/en/index.rst +++ b/docs/en/index.rst @@ -56,6 +56,23 @@ Documentation training/open_source_dataset.rst training/visualization.rst +.. toctree:: + :maxdepth: 2 + :caption: DPO + + dpo/overview.md + dpo/quick_start.md + dpo/modify_settings.md + +.. toctree:: + :maxdepth: 2 + :caption: Reward Model + + reward_model/overview.md + reward_model/quick_start.md + reward_model/modify_settings.md + reward_model/preference_data.md + .. toctree:: :maxdepth: 2 :caption: Acceleration diff --git a/docs/en/reward_model/modify_settings.md b/docs/en/reward_model/modify_settings.md new file mode 100644 index 000000000..4f41ca300 --- /dev/null +++ b/docs/en/reward_model/modify_settings.md @@ -0,0 +1,100 @@ +## Modify Reward Model Training Configuration + +This section introduces the config related to Reward Model training. For more details on XTuner config files, please refer to [Modify Settings](https://xtuner.readthedocs.io/zh-cn/latest/training/modify_settings.html). + +### Loss Function + +XTuner uses the [Bradley–Terry Model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) for preference modeling in the Reward Model. You can specify `loss_type="ranking"` to use ranking loss. XTuner also implements the focal loss function proposed in InternLM2, which adjusts the weights of difficult and easy samples to avoid overfitting. You can set `loss_type="focal"` to use this loss function. For a detailed explanation of this loss function, please refer to the [InternLM2 Technical Report](https://arxiv.org/abs/2403.17297). + +Additionally, to maintain stable reward model output scores, we have added a constraint term in the loss. You can specify `penalty_type='log_barrier'` or `penalty_type='L2'` to enable log barrier or L2 constraints, respectively. + +```python +####################################################################### +# PART 1 Settings # +####################################################################### +# Model +loss_type = 'focal' # 'ranking' or 'focal' +penalty_type = 'log_barrier' # 'log_barrier' or 'L2' +``` + +### Modifying the Model + +Users can modify `pretrained_model_name_or_path` to change the pretrained model. + +Note that XTuner calculates reward scores by appending a special token at the end of the data. Therefore, when switching models with different vocabularies, the ID of this special token also needs to be modified accordingly. We usually use an unused token at the end of the vocabulary as the reward token. + +For example, in InternLM2, we use `[UNUSED_TOKEN_130]` as the reward token: + +```python +####################################################################### +# PART 1 Settings # +####################################################################### +# Model +pretrained_model_name_or_path = 'internlm/internlm2-chat-1_8b-sft' +reward_token_id = 92527 # use [UNUSED_TOKEN_130] as reward token +``` + +If the user switches to the llama3 model, we can use `<|reserved_special_token_0|>` as the reward token: + +```python +####################################################################### +# PART 1 Settings # +####################################################################### +# Model +pretrained_model_name_or_path = 'meta-llama/Meta-Llama-3-8B-Instruct' +reward_token_id = 128002 # use <|reserved_special_token_0|> as reward token +``` + +### Training Data + +In Reward Model training, you can specify the maximum number of tokens for a single sample sequence using `max_length`. XTuner will automatically truncate or pad the data. + +```python +# Data +max_length = 2048 +``` + +In the configuration file, we use the `train_dataset` field to specify the training dataset. You can specify the dataset loading method using the `dataset` field and the dataset mapping function using the `dataset_map_fn` field. + +```python +####################################################################### +# PART 3 Dataset & Dataloader # +####################################################################### +sampler = SequenceParallelSampler \ + if sequence_parallel_size > 1 else DefaultSampler + +train_dataset = dict( + type=build_preference_dataset, + dataset=dict( + type=load_dataset, + path='argilla/ultrafeedback-binarized-preferences-cleaned'), + tokenizer=tokenizer, + max_length=max_length, + dataset_map_fn=orpo_dpo_mix_40k_map_fn, + is_dpo=False, + is_reward=True, + reward_token_id=reward_token_id, + num_proc=32, + use_varlen_attn=use_varlen_attn, + max_packed_length=max_packed_length, + shuffle_before_pack=True, +) + +train_dataloader = dict( + batch_size=batch_size, + num_workers=dataloader_num_workers, + dataset=train_dataset, + sampler=dict(type=sampler, shuffle=True), + collate_fn=dict( + type=preference_collate_fn, use_varlen_attn=use_varlen_attn)) +``` + +In the above configuration, we use `load_dataset` to load the `argilla/ultrafeedback-binarized-preferences-cleaned` dataset from Hugging Face, using `orpo_dpo_mix_40k_map_fn` as the dataset mapping function (this is because `orpo_dpo_mix_40k` and `ultrafeedback-binarized-preferences-cleaned` have the same format, so the same mapping function is used). + +For more information on handling datasets and writing dataset mapping functions, please refer to the [Preference Data Section](./preference_data.md). + +### Accelerating Training + +When training with preference data, we recommend enabling the [Variable-Length Attention Mechanism](https://xtuner.readthedocs.io/zh-cn/latest/acceleration/varlen_flash_attn.html) to avoid memory waste caused by length differences between chosen and rejected samples within a single preference. You can enable the variable-length attention mechanism by setting `use_varlen_attn=True`. + +XTuner also supports many training acceleration methods. For details on how to use them, please refer to the [Acceleration Strategies Section](https://xtuner.readthedocs.io/zh-cn/latest/acceleration/hyper_parameters.html). diff --git a/docs/en/reward_model/overview.md b/docs/en/reward_model/overview.md new file mode 100644 index 000000000..eb210140c --- /dev/null +++ b/docs/en/reward_model/overview.md @@ -0,0 +1,43 @@ +## Introduction to Reward Model + +### Overview + +The Reward Model is a crucial component in the reinforcement learning process. Its primary task is to predict reward values based on given inputs, guiding the direction of the learning algorithm. In RLHF (Reinforcement Learning from Human Feedback), the Reward Model acts as a proxy for human preferences, helping the reinforcement learning algorithm optimize strategies more effectively. + +In large language model training, the Reward Model typically refers to the Preference Model. By providing good and bad (chosen & rejected) responses to the same prompts during training, it fits human preferences and predicts a reward value during inference to guide the optimization of the Actor model in the RLHF process. + +Applications of the Reward Model include but are not limited to: + +- **RLHF Training**: During RLHF training such as the Proximal Policy Optimization (PPO) algorithm, the Reward Model provides reward signals, improve the quality of generated content, and align it more closely with human preferences. +- **BoN Sampling**: In the Best-of-N (BoN) sampling process, users can use the Reward Model to score multiple responses to the same prompt and select the highest-scoring generated result, thereby enhancing the model's output. +- **Data Construction**: The Reward Model can be used to evaluate and filter training data or replace manual annotation to construct DPO training data. + +### Features of Reward Model Training in XTuner + +The Reward Model training in XTuner offers the following significant advantages: + +1. **Latest Training Techniques**: XTuner integrates the Reward Model training loss function from InternLM2, which stabilizes the numerical range of reward scores and reduces overfitting on simple samples (see [InternLM2 Technical Report](https://arxiv.org/abs/2403.17297) for details). + +2. **Reducing Memory Waste**: Due to the length differences in chosen and rejected data in preference datasets, padding tokens during data concatenation can cause memory waste. In XTuner, by utilizing the variable-length attention feature from Flash Attention2, preference pairs are packed into the same sequence during training, significantly reducing memory waste caused by padding tokens. This not only improves memory efficiency but also allows for training larger models or handling more data under the same hardware conditions. + +![img](../../zh_cn/reward_model/images/var_len_atten.png) + +3. **Efficient Training**: Leveraging XTuner's QLoRA training capabilities, we can perform full parameter training only on the Reward Model's Value Head, while using QLoRA fine-tuning on the language model itself, substantially reducing the memory overhead of model training. + +4. **Long Text Training**: With XTuner's sequence parallel functionality, long text data can be trained efficiently. + +![img](../../zh_cn/reward_model/images/sequence_parallel.png) + +### Getting Started + +Refer to the [Quick Start Guide](./quick_start.md) to understand the basic concepts. For more information on configuring training parameters, please see the [Modifying Reward Model Settings](./modify_settings.md) section. + +### Open-source Models + +We use XTuner to train the InternLM2 Reward Models from the InternLM2 Technical Report, welcome to download and use: + +| Model | Transformers(HF) | ModelScope(HF) | OpenXLab(HF) | RewardBench Score | +| ------------------------- | -------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------- | +| **InternLM2-1.8B-Reward** | [🤗internlm2-1_8b-reward](https://huggingface.co/internlm/internlm2-1_8b-reward) | [internlm2-1_8b-reward](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-1_8b-reward/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-1_8b-reward) | 80.6 | +| **InternLM2-7B-Reward** | [🤗internlm2-7b-reward](https://huggingface.co/internlm/internlm2-7b-reward) | [internlm2-7b-reward](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-7b-reward/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-7b-reward) | 86.6 | +| **InternLM2-20B-Reward** | [🤗internlm2-20b-reward](https://huggingface.co/internlm/internlm2-20b-reward) | [internlm2-20b-reward](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-20b-reward/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-20b-reward) | 89.5 | diff --git a/docs/en/reward_model/preference_data.md b/docs/en/reward_model/preference_data.md new file mode 100644 index 000000000..2f304e627 --- /dev/null +++ b/docs/en/reward_model/preference_data.md @@ -0,0 +1,110 @@ +## Preference Dataset + +### Overview + +XTuner's Reward Model, along with DPO, ORPO, and other algorithms that training on preference data, adopts the same data format. Each training sample in the preference dataset needs to contain the following three fields: `prompt`, `chosen`, and `rejected`. The values for each field follow the [OpenAI chat message](https://platform.openai.com/docs/api-reference/chat/create) format. A specific example is as follows: + +```json +{ + "prompt": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "Who won the world series in 2020?" + }, + { + "role": "assistant", + "content": "The Los Angeles Dodgers won the World Series in 2020." + }, + { + "role": "user", + "content": "Where was it played?" + } + ], + "chosen": [ + { + "role": "assistant", + "content": "The 2020 World Series was played at Globe Life Field in Arlington, Texas." + } + ], + "rejected": [ + { + "role": "assistant", + "content": "I don't know." + } + ] +} +``` + +When conducting Reward Model training or DPO training, XTuner processes the preference dataset into different training labels based on the type of training task. + +![img](../../zh_cn/reward_model/images/preference_data.png) + +As shown in the above image, for Reward Model training, we follow the ChatGPT training method by adding a special `<|reward|>` token at the end of the conversation data and calculating the loss only on the logits output by this token. For DPO series algorithm training, we mask the tokens in the prompt part and calculate the loss only on the chosen and rejected responses. In the configuration file, we control the dataset type through the `is_reward` and `is_dpo` fields in the dataset. + +### Training with Custom Data + +After understanding the data format of a single sample, you only need to save the converted data in [JSON Lines](https://jsonlines.org/) format. Each line in the file is a sample. + +```json +{"prompt": [{"content": "How can I learn to better regulate my emotions as a grown adult?", "role": "user"}], "chosen": [{"content": "You could do emotional regulation exercises with me, and try to notice and practice ways that you could better control your emotions in real life.", "role": "assistant"}], "rejected": [{"content": "Do you mean how can you learn to be more emotionally mature?", "role": "assistant"}]} +{"prompt": [{"content": "What are some interesting things I can add to a grilled cheese sandwich?", "role": "user"}, {"content": "A tomato, a slice of pickle, some chopped pickles, some chopped green onions. Also, some mayonnaise if you like.", "role": "assistant"}, {"content": "Sounds great. Anything else?", "role": "user"}], "chosen": [{"content": "If you’d like something a little different, I recommend grating some horseradish on top of the sandwich.", "role": "assistant"}], "rejected": [{"content": "Maybe some nice fresh basil?", "role": "assistant"}]} +{"prompt": [{"content": "How do I bob for apples?", "role": "user"}], "chosen": [{"content": "Ah! Do you mean the game of “applesauce,” or do you mean the idea of “to bob for apples”?", "role": "assistant"}], "rejected": [{"content": "Sorry, I don’t know that term.", "role": "assistant"}]} +...... +``` + +After preparing the custom dataset, you need to fill in the path to your saved data in the `data_files` field in the configuration file. You can load multiple JSONL files simultaneously for training. + +```python +####################################################################### +# PART 3 Dataset & Dataloader # +####################################################################### +train_dataset = dict( + type=build_preference_dataset, + dataset=dict( + type=load_jsonl_dataset, + data_files=[ + '/your/jsonl/path/here.jsonl', + '/your/another/jsonl/path/here.jsonl' + ]), +) +``` + +### Training with Open Source Datasets + +Similar to configuring SFT data in XTuner, when using open-source datasets from Hugging Face, you only need to define a mapping function `map_fn` to process the dataset format into XTuner's data format. + +Taking `Intel/orca_dpo_pairs` as an example, this dataset has `system`, `question`, `chosen`, and `rejected` fields, with each field's value in text format instead of the [OpenAI chat message](https://platform.openai.com/docs/api-reference/chat/create) format. Therefore, we need to define a mapping function for this dataset: + +```python +def intel_orca_dpo_map_fn(example): + prompt = [{ + 'role': 'system', + 'content': example['system'] + }, { + 'role': 'user', + 'content': example['question'] + }] + chosen = [{'role': 'assistant', 'content': example['chosen']}] + rejected = [{'role': 'assistant', 'content': example['rejected']}] + return {'prompt': prompt, 'chosen': chosen, 'rejected': rejected} +``` + +As shown in the code, `intel_orca_dpo_map_fn` processes the four fields in the original data, converting them into `prompt`, `chosen`, and `rejected` fields, and ensures each field follows the [OpenAI chat message](https://platform.openai.com/docs/api-reference/chat/create) format, maintaining uniformity in subsequent data processing flows. + +After defining the mapping function, you need to import it in the configuration file and configure it in the `dataset_map_fn` field. + +```python +train_dataset = dict( + type=build_preference_dataset, + dataset=dict( + type=load_dataset, + path='Intel/orca_dpo_pairs'), + tokenizer=tokenizer, + max_length=max_length, + dataset_map_fn=intel_orca_dpo_map_fn, +) +``` diff --git a/docs/en/reward_model/quick_start.md b/docs/en/reward_model/quick_start.md new file mode 100644 index 000000000..5c802be2f --- /dev/null +++ b/docs/en/reward_model/quick_start.md @@ -0,0 +1,85 @@ +## Quick Start Guide for Reward Model + +In this section, we will introduce how to use XTuner to train a 1.8B Reward Model, helping you get started quickly. + +### Preparing Pretrained Model Weights + +According to the paper [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155), we use a language model fine-tuned with SFT as the initialization model for the Reward Model. Here, we use [InternLM2-chat-1.8b-sft](https://huggingface.co/internlm/internlm2-chat-1_8b-sft) as the initialization model. + +Set `pretrained_model_name_or_path = 'internlm/internlm2-chat-1_8b-sft'` in the training configuration file, and the model files will be automatically downloaded when training starts. If you need to download the model weights manually, please refer to the section [Preparing Pretrained Model Weights](https://xtuner.readthedocs.io/zh-cn/latest/preparation/pretrained_model.html), which provides detailed instructions on how to download model weights from Huggingface or Modelscope. Here are the links to the models on HuggingFace and ModelScope: + +- HuggingFace link: https://huggingface.co/internlm/internlm2-chat-1_8b-sft +- ModelScope link: https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-chat-1_8b-sft/summary + +### Preparing Training Data + +In this tutorial, we use the [UltraFeedback](https://arxiv.org/abs/2310.01377) dataset as an example. For convenience, we use the preprocessed [argilla/ultrafeedback-binarized-preferences-cleaned](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned) dataset from Huggingface. + +```python +train_dataset = dict( + type=build_preference_dataset, + dataset=dict( + type=load_dataset, + path='argilla/ultrafeedback-binarized-preferences-cleaned'), + dataset_map_fn=orpo_dpo_mix_40k_map_fn, + is_dpo=False, + is_reward=True, +) +``` + +Using the above configuration in the configuration file will automatically download and process this dataset. If you want to use other open-source datasets from Huggingface or custom datasets, please refer to the [Preference Dataset](./preference_data.md) section. + +### Preparing Configuration Files + +XTuner provides several ready-to-use configuration files, which can be viewed using `xtuner list-cfg`. Execute the following command to copy a configuration file to the current directory. + +```bash +xtuner copy-cfg internlm2_chat_1_8b_reward_full_ultrafeedback . +``` + +Open the copied configuration file. If you choose to download the model and dataset automatically, no modifications are needed. If you want to specify paths to your pre-downloaded model and dataset, modify the `pretrained_model_name_or_path` and the `path` parameter in `dataset` under `train_dataset`. + +For more training parameter configurations, please refer to the section [Modifying Reward Training Configuration](./modify_settings.md). + +### Starting the Training + +After completing the above steps, you can start the training task using the following commands. + +```bash +# Single node single GPU +xtuner train ./internlm2_chat_1_8b_reward_full_ultrafeedback_copy.py +# Single node multiple GPUs +NPROC_PER_NODE=${GPU_NUM} xtuner train ./internlm2_chat_1_8b_reward_full_ultrafeedback_copy.py +# Slurm cluster +srun ${SRUN_ARGS} xtuner train ./internlm2_chat_1_8b_reward_full_ultrafeedback_copy.py --launcher slurm +``` + +The correct training log should look like the following (running on a single A800 GPU): + +``` +06/06 16:12:11 - mmengine - INFO - Iter(train) [ 10/15230] lr: 3.9580e-07 eta: 2:59:41 time: 0.7084 data_time: 0.0044 memory: 18021 loss: 0.6270 acc: 0.0000 chosen_score_mean: 0.0000 rejected_score_mean: 0.0000 num_samples: 4.0000 num_tokens: 969.0000 +06/06 16:12:17 - mmengine - INFO - Iter(train) [ 20/15230] lr: 8.3536e-07 eta: 2:45:25 time: 0.5968 data_time: 0.0034 memory: 42180 loss: 0.6270 acc: 0.5000 chosen_score_mean: 0.0013 rejected_score_mean: 0.0010 num_samples: 4.0000 num_tokens: 1405.0000 +06/06 16:12:22 - mmengine - INFO - Iter(train) [ 30/15230] lr: 1.2749e-06 eta: 2:37:18 time: 0.5578 data_time: 0.0024 memory: 32121 loss: 0.6270 acc: 0.7500 chosen_score_mean: 0.0016 rejected_score_mean: 0.0011 num_samples: 4.0000 num_tokens: 932.0000 +06/06 16:12:28 - mmengine - INFO - Iter(train) [ 40/15230] lr: 1.7145e-06 eta: 2:36:05 time: 0.6033 data_time: 0.0025 memory: 42186 loss: 0.6270 acc: 0.7500 chosen_score_mean: 0.0027 rejected_score_mean: 0.0016 num_samples: 4.0000 num_tokens: 994.0000 +06/06 16:12:35 - mmengine - INFO - Iter(train) [ 50/15230] lr: 2.1540e-06 eta: 2:41:03 time: 0.7166 data_time: 0.0027 memory: 42186 loss: 0.6278 acc: 0.5000 chosen_score_mean: 0.0031 rejected_score_mean: 0.0032 num_samples: 4.0000 num_tokens: 2049.0000 +06/06 16:12:40 - mmengine - INFO - Iter(train) [ 60/15230] lr: 2.5936e-06 eta: 2:33:37 time: 0.4627 data_time: 0.0023 memory: 30238 loss: 0.6262 acc: 1.0000 chosen_score_mean: 0.0057 rejected_score_mean: 0.0030 num_samples: 4.0000 num_tokens: 992.0000 +06/06 16:12:46 - mmengine - INFO - Iter(train) [ 70/15230] lr: 3.0331e-06 eta: 2:33:18 time: 0.6018 data_time: 0.0025 memory: 42186 loss: 0.6247 acc: 0.7500 chosen_score_mean: 0.0117 rejected_score_mean: 0.0055 num_samples: 4.0000 num_tokens: 815.0000 +``` + +### Model Conversion + +XTuner provides integrated tools to convert models to HuggingFace format. Simply execute the following commands: + +```bash +# Create a directory to store HF format parameters +mkdir work_dirs/internlm2_chat_1_8b_reward_full_ultrafeedback_copy/iter_15230_hf + +# Convert the format +xtuner convert pth_to_hf internlm2_chat_1_8b_reward_full_ultrafeedback_copy.py \ + work_dirs/internlm2_chat_1_8b_reward_full_ultrafeedback_copy.py/iter_15230.pth \ + work_dirs/internlm2_chat_1_8b_reward_full_ultrafeedback_copy.py/iter_15230_hf +``` + +This will convert the XTuner's ckpt to the HuggingFace format. + +Note: Since the Reward Model type is not integrated into the official transformers library, only the Reward Models trained with InternLM2 will be converted to the `InternLM2ForRewardModel` type. Other models will default to the `SequenceClassification` type (for example, LLaMa3 will be converted to the `LlamaForSequenceClassification` type). diff --git a/docs/zh_cn/dpo/modify_settings.md b/docs/zh_cn/dpo/modify_settings.md index 7b4672792..2365be25c 100644 --- a/docs/zh_cn/dpo/modify_settings.md +++ b/docs/zh_cn/dpo/modify_settings.md @@ -32,7 +32,7 @@ pretrained_model_name_or_path = 'internlm/internlm2-chat-1_8b-sft' ### 训练数据 -在 Reward Model 训练中,你可以通过 `max_length` 来指定单个样本序列的最大 token 数,XTuner 会自动对数据进行截断或是填充。 +在 DPO 训练中,你可以通过 `max_length` 来指定单个样本序列的最大 token 数,XTuner 会自动对数据进行截断或是填充。 ```python # Data diff --git a/docs/zh_cn/dpo/overview.md b/docs/zh_cn/dpo/overview.md index d1bfc4379..d3c3a7aad 100644 --- a/docs/zh_cn/dpo/overview.md +++ b/docs/zh_cn/dpo/overview.md @@ -20,6 +20,8 @@ XTuner 中的 DPO 训练具备以下显著优势: 3. **高效训练**:借助 XTuner 的 QLoRA 训练功能,参考模型能够被转化为移除LoRA适配器的语言模型,从而省去了参考模型权重的显存占用,大幅降低了 DPO 的训练开销。 +4. **长文本训练**: 借助 XTuner 的序列并行功能,能够对长文本数据进行训练。 + ### 开始训练 请参阅[快速上手](./quick_start.md)来了解最基本的概念,若希望了解更多训练参数配置相关的内容,请参考[修改DPO配置](./modify_settings.md)章节。 diff --git a/docs/zh_cn/reward_model/images/sequence_parallel.png b/docs/zh_cn/reward_model/images/sequence_parallel.png new file mode 100644 index 000000000..53f86c81a Binary files /dev/null and b/docs/zh_cn/reward_model/images/sequence_parallel.png differ diff --git a/docs/zh_cn/reward_model/overview.md b/docs/zh_cn/reward_model/overview.md index 84b5ab14b..6c7c976ac 100644 --- a/docs/zh_cn/reward_model/overview.md +++ b/docs/zh_cn/reward_model/overview.md @@ -24,6 +24,20 @@ XTuner 中的 Reward Model 训练具备以下显著优势: 3. **高效训练**:借助 XTuner 的 QLoRA 训练功能,我们能够仅对 Reward Model 的 Value Head 进行全参数训练,而对语言模型本身使用 QLoRA 微调,大幅降低了模型训练的显存开销。 +4. **长文本训练**: 借助 XTuner 的序列并行功能,能够对长文本数据进行训练。 + +![img](./images/sequence_parallel.png) + ### 开始训练 请参[阅快速上手](./quick_start.md)来了解最基本的概念,若希望了解更多训练参数配置相关的内容,请参考[修改Reward Model配置](./modify_settings.md)章节。 + +### 开源模型 + +我们使用 XTuner 训练了 InternLM2 技术报告中的 Reward Model,欢迎下载使用: + +| Model | Transformers(HF) | ModelScope(HF) | OpenXLab(HF) | RewardBench Score | +| ------------------------- | -------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------- | +| **InternLM2-1.8B-Reward** | [🤗internlm2-1_8b-reward](https://huggingface.co/internlm/internlm2-1_8b-reward) | [internlm2-1_8b-reward](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-1_8b-reward/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-1_8b-reward) | 80.6 | +| **InternLM2-7B-Reward** | [🤗internlm2-7b-reward](https://huggingface.co/internlm/internlm2-7b-reward) | [internlm2-7b-reward](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-7b-reward/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-7b-reward) | 86.6 | +| **InternLM2-20B-Reward** | [🤗internlm2-20b-reward](https://huggingface.co/internlm/internlm2-20b-reward) | [internlm2-20b-reward](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-20b-reward/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-20b-reward) | 89.5 |