Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Baseline (Value Function) should account for KL penalty but currently does not #9

Open
Silent-Zebra opened this issue Aug 5, 2023 · 0 comments

Comments

@Silent-Zebra
Copy link
Owner

Issue:
Because the objective is being modified by the KL penalty, the baseline (value function) should take this into account. The current training scheme has the baseline just learn the values as if there were no KL penalty.

Possible Effects:
If the baseline is suboptimal, then the variance would be higher than with an optimal baseline. This increases the amount of noise, which could hamper learning. Fixing this issue should lead to better variance reduction, which would hopefully translate into better results. Since the baseline only affects variance, even an incorrect baseline can still lead to sensible learning outcomes (just with more noise), which is possibly what's happening with the current code.

Fix:
Derive a formulation of the reward (including with DiCE) such that the KL divergence fits directly into the reward at each time step, instead of being calculated separately. This might be trivial (just add the term to the reward), but it would be good to take care and be sure of the right formulation. Implement this, merging the KL divergence into the reward. Replace the reward used for learning the value function with this new reward as well.

Timeline:
Uncertain. Currently I am quite busy so will not work on this. Additionally, I am unsure how much effort this will take to fix, and unlike some other previously discovered bugs, I believe the effect of this is relatively clear, and fixing should only lead to better results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant