You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Issue:
Because the objective is being modified by the KL penalty, the baseline (value function) should take this into account. The current training scheme has the baseline just learn the values as if there were no KL penalty.
Possible Effects:
If the baseline is suboptimal, then the variance would be higher than with an optimal baseline. This increases the amount of noise, which could hamper learning. Fixing this issue should lead to better variance reduction, which would hopefully translate into better results. Since the baseline only affects variance, even an incorrect baseline can still lead to sensible learning outcomes (just with more noise), which is possibly what's happening with the current code.
Fix:
Derive a formulation of the reward (including with DiCE) such that the KL divergence fits directly into the reward at each time step, instead of being calculated separately. This might be trivial (just add the term to the reward), but it would be good to take care and be sure of the right formulation. Implement this, merging the KL divergence into the reward. Replace the reward used for learning the value function with this new reward as well.
Timeline:
Uncertain. Currently I am quite busy so will not work on this. Additionally, I am unsure how much effort this will take to fix, and unlike some other previously discovered bugs, I believe the effect of this is relatively clear, and fixing should only lead to better results.
The text was updated successfully, but these errors were encountered:
Issue:
Because the objective is being modified by the KL penalty, the baseline (value function) should take this into account. The current training scheme has the baseline just learn the values as if there were no KL penalty.
Possible Effects:
If the baseline is suboptimal, then the variance would be higher than with an optimal baseline. This increases the amount of noise, which could hamper learning. Fixing this issue should lead to better variance reduction, which would hopefully translate into better results. Since the baseline only affects variance, even an incorrect baseline can still lead to sensible learning outcomes (just with more noise), which is possibly what's happening with the current code.
Fix:
Derive a formulation of the reward (including with DiCE) such that the KL divergence fits directly into the reward at each time step, instead of being calculated separately. This might be trivial (just add the term to the reward), but it would be good to take care and be sure of the right formulation. Implement this, merging the KL divergence into the reward. Replace the reward used for learning the value function with this new reward as well.
Timeline:
Uncertain. Currently I am quite busy so will not work on this. Additionally, I am unsure how much effort this will take to fix, and unlike some other previously discovered bugs, I believe the effect of this is relatively clear, and fixing should only lead to better results.
The text was updated successfully, but these errors were encountered: