Hey there! This repository holds some reinforcement learning algorithms that I will be implementing in Pytorch. Some examples of things I will implement:
Some of the algorithms (A2C, PPO, SAC) use mujoco-py for the Inverted Pendulum environment. You can check out MuJoCo's website in order to get a license. If you're a student you should a yearly free license. Pybullet is a good alternative if you can't get the license.
- Basic Policy Gradient
- Advantage Actor Critic (A2C)
- Proximal Policy Optimization (PPO)
- Deep Deterministic Policy Gradient (DDPG)
- Twin Delayed DDPG (TD3)
- Soft Actor Critic (SAC)
- Batch sizes (right now it's just single episode online learning which has high variance)
- Normalize rewards
Combined with GAE (Generalized Advantage Estimation), I tried to keep this policy gradient implementation simple.
The secret to getting this one working? Realizing that storing things in numpy arrays gets rid of gradient history.
TODO: Add entropy term in the loss (subtract to encourage exploration)! (You can just use .entropy() on the distribution - multiply this by a constant)
While trying to rewrite the actor critic part of this, I realized the importance of discount factors in infinite MDPs. They help the actor critic converge.
Additional notes: There might be a balance issue between policy loss compared to value fn loss, especially in low dimensional action spaces such as inverted pendulum. Would this be an issue?
More notes: Because of this issue, the value function often times never really converges - even when maxxing out the episode limit, the value loss is still ridiculously high. Why is this?
The problem with the value function for an infinite horizon is that it's heavily dependent on the future of the episode because of the structure of rewards-to-go, and also because of the constant value of the inverted pendulum environment.
TLDR: The value fn essentially just becomes a predictor of the episode length... But this conflicts with the fact that you want the episode length to go farther
The main way the value fn can get to a low value - and typically the easiest: Just have short episodes!
New stuff: Replay buffer, deterministic policy, very DQN esque
Note: WIP
Just DDPG with bells and whistles
Note: WIP
Emphasis on entropy term inside loss (scaled by temperature term alpha - set to constant in this implementation)
Note: WIP