Reinforcement Learning Algorithms, Implemented in Pytorch

Hey there! This repository holds some reinforcement learning algorithms that I will be implementing in Pytorch. Some examples of things I will implement:

Notes

Some of the algorithms (A2C, PPO, SAC) use mujoco-py for the Inverted Pendulum environment. You can check out MuJoCo's website in order to get a license. If you're a student you should a yearly free license. Pybullet is a good alternative if you can't get the license.

List of things to implement

Basic Policy Gradient
Advantage Actor Critic (A2C)
Proximal Policy Optimization (PPO)
Deep Deterministic Policy Gradient (DDPG)
Twin Delayed DDPG (TD3)
Soft Actor Critic (SAC)

Additional things to add to help performance:

Batch sizes (right now it's just single episode online learning which has high variance)
Normalize rewards

Basic Policy Gradient

Combined with GAE (Generalized Advantage Estimation), I tried to keep this policy gradient implementation simple.

A2C

The secret to getting this one working? Realizing that storing things in numpy arrays gets rid of gradient history.

TODO: Add entropy term in the loss (subtract to encourage exploration)! (You can just use .entropy() on the distribution - multiply this by a constant)

PPO

While trying to rewrite the actor critic part of this, I realized the importance of discount factors in infinite MDPs. They help the actor critic converge.

Additional notes: There might be a balance issue between policy loss compared to value fn loss, especially in low dimensional action spaces such as inverted pendulum. Would this be an issue?

More notes: Because of this issue, the value function often times never really converges - even when maxxing out the episode limit, the value loss is still ridiculously high. Why is this?

The problem with the value function for an infinite horizon is that it's heavily dependent on the future of the episode because of the structure of rewards-to-go, and also because of the constant value of the inverted pendulum environment.

TLDR: The value fn essentially just becomes a predictor of the episode length... But this conflicts with the fact that you want the episode length to go farther

The main way the value fn can get to a low value - and typically the easiest: Just have short episodes!

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reinforcement Learning Algorithms, Implemented in Pytorch

Notes

List of things to implement

Additional things to add to help performance:

Basic Policy Gradient

A2C

PPO

DDPG

TD3

SAC

About

Releases

Packages

Languages

jimzers/rl-algos-pytorch

Folders and files

Latest commit

History

Repository files navigation

Reinforcement Learning Algorithms, Implemented in Pytorch

Notes

List of things to implement

Additional things to add to help performance:

Basic Policy Gradient

A2C

PPO

DDPG

TD3

SAC

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages