v1.5.0
v1.5版本
新增DPO(直接偏好优化)方法,DPO通过直接优化语言模型来实现对其行为的精确控制,而无需使用复杂的强化学习,也可以有效学习到人类偏好,DPO相较于RLHF更容易实现且易于训练,效果更好。
提供完整PT+SFT+DPO全阶段串起来训练的pipeline:run_training_dpo_pipeline.ipynb ,其对应的colab: ,运行完大概需要15分钟,我运行成功后的副本colab:
What's Changed
- Update rl_training.py by @dividez in #159
- Update pretraining.py by @anwuzhiab in #167
- Dpo by @shibing624 in #180
- update dpo pynb by @shibing624 in #181
New Contributors
- @dividez made their first contribution in #159
- @anwuzhiab made their first contribution in #167
Full Changelog: 1.4.0...1.5.0