Skip to content

Commit

Permalink
[Docs]: update readme and DPO en docs (#853)
Browse files Browse the repository at this point in the history
* [Docs]: update readme and DPO en docs

* update link
  • Loading branch information
RangiLyu committed Jul 19, 2024
1 parent 16e2f8f commit 3617c98
Show file tree
Hide file tree
Showing 14 changed files with 561 additions and 1 deletion.
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ English | [简体中文](README_zh-CN.md)

## 🎉 News

- **\[2024/07\]** Support [DPO](https://github.com/InternLM/xtuner/tree/main/xtuner/configs/dpo), [ORPO](https://github.com/InternLM/xtuner/tree/main/xtuner/configs/orpo) and [Reward Model](https://github.com/InternLM/xtuner/tree/main/xtuner/configs/reward_model) training with packed data and sequence parallel! See [documents](https://xtuner.readthedocs.io/en/latest/dpo/overview.html) for more details.
- **\[2024/07\]** Support [InternLM 2.5](xtuner/configs/internlm/internlm2_5_chat_7b/) models!
- **\[2024/06\]** Support [DeepSeek V2](xtuner/configs/deepseek/deepseek_v2_chat/) models! **2x faster!**
- **\[2024/04\]** [LLaVA-Phi-3-mini](https://huggingface.co/xtuner/llava-phi-3-mini-hf) is released! Click [here](xtuner/configs/llava/phi3_mini_4k_instruct_clip_vit_large_p14_336) for details!
Expand Down Expand Up @@ -144,6 +145,9 @@ XTuner is an efficient, flexible and full-featured toolkit for fine-tuning large
<li><a href="http://arxiv.org/abs/2305.14314">QLoRA</a></li>
<li><a href="http://arxiv.org/abs/2106.09685">LoRA</a></li>
<li>Full parameter fine-tune</li>
<li><a href="https://arxiv.org/abs/2305.18290">DPO</a></li>
<li><a href="https://arxiv.org/abs/2403.07691">ORPO</a></li>
<li>Reward Model</a></li>
</ul>
</td>
</tr>
Expand Down
4 changes: 4 additions & 0 deletions README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@

## 🎉 更新

- **\[2024/07\]** 支持训练 [DPO](https://github.com/InternLM/xtuner/tree/main/xtuner/configs/dpo)[ORPO](https://github.com/InternLM/xtuner/tree/main/xtuner/configs/orpo) 还有 [Reward Model](https://github.com/InternLM/xtuner/tree/main/xtuner/configs/reward_model) ! 并且能够支持打包数据以及序列并行功能! 请参考 [文档](https://xtuner.readthedocs.io/zh-cn/latest/dpo/overview.html) 了解更多信息。
- **\[2024/07\]** 支持 [InternLM 2.5](xtuner/configs/internlm/internlm2_5_chat_7b/) 模型!
- **\[2024/06\]** 支持 [DeepSeek V2](xtuner/configs/deepseek/deepseek_v2_chat/) models! **训练速度提升一倍!**
- **\[2024/04\]** 多模态大模型 [LLaVA-Phi-3-mini](https://huggingface.co/xtuner/llava-phi-3-mini-hf) 发布!快速开始请查阅此[文档](xtuner/configs/llava/phi3_mini_4k_instruct_clip_vit_large_p14_336)
Expand Down Expand Up @@ -144,6 +145,9 @@ XTuner 是一个高效、灵活、全能的轻量化大模型微调工具库。
<li><a href="http://arxiv.org/abs/2305.14314">QLoRA</a></li>
<li><a href="http://arxiv.org/abs/2106.09685">LoRA</a></li>
<li>全量参数微调</li>
<li><a href="https://arxiv.org/abs/2305.18290">DPO</a></li>
<li><a href="https://arxiv.org/abs/2403.07691">ORPO</a></li>
<li>Reward Model</a></li>
</ul>
</td>
</tr>
Expand Down
83 changes: 83 additions & 0 deletions docs/en/dpo/modify_settings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
## Modify DPO Training Configuration

This section introduces config parameters related to DPO (Direct Preference Optimization) training. For more details on XTuner config files, please refer to [Modifying Training Configuration](https://xtuner.readthedocs.io/zh-cn/latest/training/modify_settings.html).

### Loss Function

In DPO training, you can choose different types of loss functions according to your needs. XTuner provides various loss function options, such as `sigmoid`, `hinge`, `ipo`, etc. You can select the desired loss function type by setting the `dpo_loss_type` parameter.

Additionally, you can control the temperature coefficient in the loss function by adjusting the `loss_beta` parameter. The `label_smoothing` parameter can be used for smoothing labels.

```python
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
dpo_loss_type = 'sigmoid' # One of ['sigmoid', 'hinge', 'ipo', 'kto_pair', 'sppo_hard', 'nca_pair', 'robust']
loss_beta = 0.1
label_smoothing = 0.0
```

### Modifying the Model

Users can modify `pretrained_model_name_or_path` to change the pretrained model.

```python
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-1_8b-sft'
```

### Training Data

In DPO training, you can specify the maximum number of tokens for a single sample sequence using the `max_length` parameter. XTuner will automatically truncate or pad the data.

```python
# Data
max_length = 2048
```

In the configuration file, we use the `train_dataset` field to specify the training dataset. You can specify the dataset loading method using the `dataset` field and the dataset mapping function using the `dataset_map_fn` field.

```python
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler

train_dataset = dict(
type=build_preference_dataset,
dataset=dict(type=load_dataset, path='mlabonne/orpo-dpo-mix-40k'),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=orpo_dpo_mix_40k_map_fn,
is_dpo=True,
is_reward=False,
reward_token_id=-1,
num_proc=32,
use_varlen_attn=use_varlen_attn,
max_packed_length=max_packed_length,
shuffle_before_pack=True,
)

train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(
type=preference_collate_fn, use_varlen_attn=use_varlen_attn))
```

In the above configuration, we use `load_dataset` to load the `mlabonne/orpo-dpo-mix-40k` dataset from Hugging Face and use `orpo_dpo_mix_40k_map_fn` as the dataset mapping function.

For more information on handling datasets and writing dataset mapping functions, please refer to the [Preference Dataset Section](../reward_model/preference_data.md).

### Accelerating Training

When training with preference data, we recommend enabling the [Variable-Length Attention Mechanism](https://xtuner.readthedocs.io/zh-cn/latest/acceleration/varlen_flash_attn.html) to avoid memory waste caused by length differences between chosen and rejected samples within a single preference. You can enable the variable-length attention mechanism by setting `use_varlen_attn=True`.

XTuner also supports many training acceleration methods. For details on how to use them, please refer to the [Acceleration Strategies Section](https://xtuner.readthedocs.io/zh-cn/latest/acceleration/hyper_parameters.html).
27 changes: 27 additions & 0 deletions docs/en/dpo/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
## Introduction to DPO

### Overview

DPO (Direct Preference Optimization) is a method used in large language model training for directly optimizing human preferences. Unlike traditional reinforcement learning methods, DPO directly uses human preference data to optimize the model, thereby improving the quality of generated content to better align with human preferences. DPO also eliminates the need to train a Reward Model and a Critic Model, avoiding the complexity of reinforcement learning algorithms, reducing training overhead, and enhancing training efficiency.

Many algorithms have made certain improvements to DPO's loss function. In XTuner, besides DPO, we have also implemented loss functions from papers such as [Identity Preference Optimization (IPO)](https://huggingface.co/papers/2310.12036). To use these algorithms, please refer to the [Modify DPO Settings](./modify_settings.md) section. We also provide some [example configurations](https://github.com/InternLM/xtuner/tree/main/xtuner/configs/dpo) for reference.

In addition to DPO, there are alignment algorithms like [ORPO](https://arxiv.org/abs/2403.07691) that do not require a reference model. ORPO uses the concept of odds ratio to optimize the model by penalizing rejected samples during the training process, thereby adapting more effectively to the chosen samples. ORPO eliminates the dependence on a reference model, making the training process more simplified and efficient. The training method for ORPO in XTuner is very similar to DPO, and we provide some [example configurations](https://github.com/InternLM/xtuner/tree/main/xtuner/configs/orpo). Users can refer to the DPO tutorial to modify the configuration.

### Features of DPO Training in XTuner

DPO training in XTuner offers the following significant advantages:

1. **Latest Algorithms**: In addition to supporting standard DPO, XTuner also supports improved DPO algorithms or memory efficient algorithms like ORPO that do not rely on reference models.

2. **Reducing Memory Waste**: Due to the length differences in chosen and rejected data in preference datasets, padding tokens during data concatenation can cause memory waste. In XTuner, by utilizing the variable-length attention feature from Flash Attention2, preference pairs are packed into the same sequence during training, significantly reducing memory waste caused by padding tokens. This not only improves memory efficiency but also allows for training larger models or handling more data under the same hardware conditions.

![img](../../zh_cn/reward_model/images/var_len_atten.png)

3. **Efficient Training**: Leveraging XTuner's QLoRA training capabilities, the reference model can be converted into a policy model with the LoRA adapter removed, eliminating the memory overhead of the reference model weights and significantly reducing DPO training costs.

4. **Long Text Training**: With XTuner's sequence parallel functionality, long text data can be trained efficiently.

### Getting Started

Refer to the [Quick Start Guide](./quick_start.md) to understand the basic concepts. For more information on configuring training parameters, please see the [Modify DPO Settings](./modify_settings.md) section.
71 changes: 71 additions & 0 deletions docs/en/dpo/quick_start.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
## Quick Start with DPO

In this section, we will introduce how to use XTuner to train a 1.8B DPO (Direct Preference Optimization) model to help you get started quickly.

### Preparing Pretrained Model Weights

We use the model [InternLM2-chat-1.8b-sft](https://huggingface.co/internlm/internlm2-chat-1_8b-sft), as the initial model for DPO training to align human preferences.

Set `pretrained_model_name_or_path = 'internlm/internlm2-chat-1_8b-sft'` in the training configuration file, and the model files will be automatically downloaded when training starts. If you need to download the model weights manually, please refer to the section [Preparing Pretrained Model Weights](https://xtuner.readthedocs.io/zh-cn/latest/preparation/pretrained_model.html), which provides detailed instructions on how to download model weights from Huggingface or Modelscope. Here are the links to the models on HuggingFace and ModelScope:

- HuggingFace link: https://huggingface.co/internlm/internlm2-chat-1_8b-sft
- ModelScope link: https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-chat-1_8b-sft/summary

### Preparing Training Data

In this tutorial, we use the [mlabonne/orpo-dpo-mix-40k](https://huggingface.co/datasets/mlabonne/orpo-dpo-mix-40k) dataset from Huggingface as an example.

```python
train_dataset = dict(
type=build_preference_dataset,
dataset=dict(
type=load_dataset,
path='mlabonne/orpo-dpo-mix-40k'),
dataset_map_fn=orpo_dpo_mix_40k_map_fn,
is_dpo=True,
is_reward=False,
)
```

Using the above configuration in the configuration file will automatically download and process this dataset. If you want to use other open-source datasets from Huggingface or custom datasets, please refer to the [Preference Dataset](../reward_model/preference_data.md) section.

### Preparing Configuration File

XTuner provides several ready-to-use configuration files, which can be viewed using `xtuner list-cfg`. Execute the following command to copy a configuration file to the current directory.

```bash
xtuner copy-cfg internlm2_chat_1_8b_dpo_full .
```

Open the copied configuration file. If you choose to download the model and dataset automatically, no modifications are needed. If you want to specify paths to your pre-downloaded model and dataset, modify the `pretrained_model_name_or_path` and the `path` parameter in `dataset` under `train_dataset`.

For more training parameter configurations, please refer to the section [Modifying DPO Training Configuration](./modify_settings.md) section.

### Starting the Training

After completing the above steps, you can start the training task using the following commands.

```bash
# Single machine, single GPU
xtuner train ./internlm2_chat_1_8b_dpo_full_copy.py
# Single machine, multiple GPUs
NPROC_PER_NODE=${GPU_NUM} xtuner train ./internlm2_chat_1_8b_dpo_full_copy.py
# Slurm cluster
srun ${SRUN_ARGS} xtuner train ./internlm2_chat_1_8b_dpo_full_copy.py --launcher slurm
```

### Model Conversion

XTuner provides integrated tools to convert models to HuggingFace format. Simply execute the following commands:

```bash
# Create a directory for HuggingFace format parameters
mkdir work_dirs/internlm2_chat_1_8b_dpo_full_copy/iter_15230_hf

# Convert format
xtuner convert pth_to_hf internlm2_chat_1_8b_dpo_full_copy.py \
work_dirs/internlm2_chat_1_8b_dpo_full_copy/iter_15230.pth \
work_dirs/internlm2_chat_1_8b_dpo_full_copy/iter_15230_hf
```

This will convert the XTuner's ckpt to the HuggingFace format.
17 changes: 17 additions & 0 deletions docs/en/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,23 @@ Documentation
training/open_source_dataset.rst
training/visualization.rst

.. toctree::
:maxdepth: 2
:caption: DPO

dpo/overview.md
dpo/quick_start.md
dpo/modify_settings.md

.. toctree::
:maxdepth: 2
:caption: Reward Model

reward_model/overview.md
reward_model/quick_start.md
reward_model/modify_settings.md
reward_model/preference_data.md

.. toctree::
:maxdepth: 2
:caption: Acceleration
Expand Down
100 changes: 100 additions & 0 deletions docs/en/reward_model/modify_settings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
## Modify Reward Model Training Configuration

This section introduces the config related to Reward Model training. For more details on XTuner config files, please refer to [Modify Settings](https://xtuner.readthedocs.io/zh-cn/latest/training/modify_settings.html).

### Loss Function

XTuner uses the [Bradley–Terry Model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) for preference modeling in the Reward Model. You can specify `loss_type="ranking"` to use ranking loss. XTuner also implements the focal loss function proposed in InternLM2, which adjusts the weights of difficult and easy samples to avoid overfitting. You can set `loss_type="focal"` to use this loss function. For a detailed explanation of this loss function, please refer to the [InternLM2 Technical Report](https://arxiv.org/abs/2403.17297).

Additionally, to maintain stable reward model output scores, we have added a constraint term in the loss. You can specify `penalty_type='log_barrier'` or `penalty_type='L2'` to enable log barrier or L2 constraints, respectively.

```python
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
loss_type = 'focal' # 'ranking' or 'focal'
penalty_type = 'log_barrier' # 'log_barrier' or 'L2'
```

### Modifying the Model

Users can modify `pretrained_model_name_or_path` to change the pretrained model.

Note that XTuner calculates reward scores by appending a special token at the end of the data. Therefore, when switching models with different vocabularies, the ID of this special token also needs to be modified accordingly. We usually use an unused token at the end of the vocabulary as the reward token.

For example, in InternLM2, we use `[UNUSED_TOKEN_130]` as the reward token:

```python
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-1_8b-sft'
reward_token_id = 92527 # use [UNUSED_TOKEN_130] as reward token
```

If the user switches to the llama3 model, we can use `<|reserved_special_token_0|>` as the reward token:

```python
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Meta-Llama-3-8B-Instruct'
reward_token_id = 128002 # use <|reserved_special_token_0|> as reward token
```

### Training Data

In Reward Model training, you can specify the maximum number of tokens for a single sample sequence using `max_length`. XTuner will automatically truncate or pad the data.

```python
# Data
max_length = 2048
```

In the configuration file, we use the `train_dataset` field to specify the training dataset. You can specify the dataset loading method using the `dataset` field and the dataset mapping function using the `dataset_map_fn` field.

```python
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler

train_dataset = dict(
type=build_preference_dataset,
dataset=dict(
type=load_dataset,
path='argilla/ultrafeedback-binarized-preferences-cleaned'),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=orpo_dpo_mix_40k_map_fn,
is_dpo=False,
is_reward=True,
reward_token_id=reward_token_id,
num_proc=32,
use_varlen_attn=use_varlen_attn,
max_packed_length=max_packed_length,
shuffle_before_pack=True,
)

train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(
type=preference_collate_fn, use_varlen_attn=use_varlen_attn))
```

In the above configuration, we use `load_dataset` to load the `argilla/ultrafeedback-binarized-preferences-cleaned` dataset from Hugging Face, using `orpo_dpo_mix_40k_map_fn` as the dataset mapping function (this is because `orpo_dpo_mix_40k` and `ultrafeedback-binarized-preferences-cleaned` have the same format, so the same mapping function is used).

For more information on handling datasets and writing dataset mapping functions, please refer to the [Preference Data Section](./preference_data.md).

### Accelerating Training

When training with preference data, we recommend enabling the [Variable-Length Attention Mechanism](https://xtuner.readthedocs.io/zh-cn/latest/acceleration/varlen_flash_attn.html) to avoid memory waste caused by length differences between chosen and rejected samples within a single preference. You can enable the variable-length attention mechanism by setting `use_varlen_attn=True`.

XTuner also supports many training acceleration methods. For details on how to use them, please refer to the [Acceleration Strategies Section](https://xtuner.readthedocs.io/zh-cn/latest/acceleration/hyper_parameters.html).
Loading

0 comments on commit 3617c98

Please sign in to comment.