[Examples] boiler plate code for multi-turn reward for RLHF #2467
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR addresses: [Feature Request] multi-turn reward for RLHF #2271
This PR implements the reward system for multi-turn reinforcement learning from human feedback (RLHF), following the guidelines outlined in the paper Multi-turn Reinforcement Learning from Preference Human Feedback. The key changes involve creating a simulated multi-turn dialogue environment where human feedback (rewards) is used to guide policy learning. The implemented policy is trained using policy gradient methods, updating based on human feedback provided at each turn.
Changes include:
Motivation and Context
This change is necessary to replicate the reward structure proposed in the referenced paper, implementing multi-turn RLHF in a way that closely follows the described methodology. It introduces the simulation of human preferences, which plays a key role in the learning process. This change also resolves issue #2271, which proposed adding this reward mechanism to the project.
Closes [Feature Request] multi-turn reward for RLHF #2271
I have raised an issue to propose this change (required for new features and bug fixes)
Types of changes
What types of changes does your code introduce? Remove all that do not apply:
Checklist
Go over all the following points, and put an
x
in all the boxes that apply.