Skip to content

Commit

Permalink
ACKTR Continuous (#41)
Browse files Browse the repository at this point in the history
* Start ACKTR continuous version

* [ci skip] Update hyperparams

* [ci skip] Update hyperparams (add bipedal)

* [ci skip] Update hyperparams

* Add support for hyperparam optimization for ACKTR

* Fix for zip format

* Update hyperparams

* Hyperparam optimization for TD3

* Update benchmark

* Update benchmark

* Update benchmark

* Update travis

* Upgrade docker images

* [ci skip] Update Readme

* Split travis tests

* Fix permission script

* Fix pytest

* Fix test for TD3
  • Loading branch information
araffin authored Sep 29, 2019
1 parent ccf95e3 commit a41e611
Show file tree
Hide file tree
Showing 40 changed files with 345 additions and 36 deletions.
21 changes: 19 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,28 @@ python:
notifications:
email: false

env:
global:
- DOCKER_IMAGE=araffin/rl-baselines-zoo-cpu:v2.8.0

services:
- docker

install:
- docker pull araffin/rl-baselines-zoo-cpu
- docker pull ${DOCKER_IMAGE}

script:
- docker run -it --rm --network host --ipc=host --mount src=$(pwd),target=/root/code/stable-baselines,type=bind araffin/rl-baselines-zoo-cpu bash -c "cd /root/code/stable-baselines/ && pip install --upgrade git+https://github.com/pfnet/optuna.git && python -m pytest --cov-config .coveragerc --cov-report term --cov=. -v tests/"
- ./scripts/run_tests_travis.sh "${TEST_GLOB}"

jobs:
include:
# Split test suite to avoid exceeding travis limit
- stage: Test
name: "Unit Tests Train"
env: TEST_GLOB="train.py"

- name: "Unit Tests Enjoy"
env: TEST_GLOB="enjoy.py"

- name: "Unit Tests Hyperparams opt"
env: TEST_GLOB="hyperparams_opt.py"
24 changes: 13 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,14 +62,14 @@ mpirun -n 16 python train.py --algo trpo --env BreakoutNoFrameskip-v4

We use [Optuna](https://optuna.org/) for optimizing the hyperparameters.

Note: hyperparameters search is only implemented for PPO2/A2C/SAC/TRPO/DDPG for now.
Note: hyperparameters search is not implemented for ACER and DQN for now.
when using SuccessiveHalvingPruner ("halving"), you must specify `--n-jobs > 1`

Budget of 1000 trials with a maximum of 50000 steps:

```
python train.py --algo ppo2 --env MountainCar-v0 -n 50000 -optimize --n-trials 1000 --n-jobs 2 \
--sampler random --pruner median
--sampler tpe --pruner median
```


Expand Down Expand Up @@ -116,7 +116,7 @@ Additional Atari Games (to be completed):
|----------|--------------|----------------|------------|--------------|--------------------------|
| A2C | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| ACER | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | N/A | N/A |
| ACKTR | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | N/A | N/A |
| ACKTR | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| PPO2 | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| DQN | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | N/A | N/A |
| DDPG | N/A | N/A | N/A | :heavy_check_mark: | :heavy_check_mark: |
Expand All @@ -129,15 +129,15 @@ Additional Atari Games (to be completed):

| RL Algo | BipedalWalker-v2 | LunarLander-v2 | LunarLanderContinuous-v2 | BipedalWalkerHardcore-v2 | CarRacing-v0 |
|----------|--------------|----------------|------------|--------------|--------------------------|
| A2C | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | |
| A2C | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | |
| ACER | N/A | :heavy_check_mark: | N/A | N/A | N/A |
| ACKTR | N/A | :heavy_check_mark: | N/A | N/A | N/A |
| PPO2 | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | |
| ACKTR | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | |
| PPO2 | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | |
| DQN | N/A | :heavy_check_mark: | N/A | N/A | N/A |
| DDPG | :heavy_check_mark: | N/A | :heavy_check_mark: | | |
| DDPG | :heavy_check_mark: | N/A | :heavy_check_mark: | | |
| SAC | :heavy_check_mark: | N/A | :heavy_check_mark: | :heavy_check_mark: | |
| TD3 | | N/A | :heavy_check_mark: | | |
| TRPO | | :heavy_check_mark: | :heavy_check_mark: | | |
| TD3 | :heavy_check_mark: | N/A | :heavy_check_mark: | | |
| TRPO | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | | |

### PyBullet Environments

Expand All @@ -149,6 +149,7 @@ Note: those environments are derived from [Roboschool](https://github.com/openai
| RL Algo | Walker2D | HalfCheetah | Ant | Reacher | Hopper | Humanoid |
|----------|-----------|-------------|-----|---------|---------|----------|
| A2C | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | | :heavy_check_mark: | |
| ACKTR | | :heavy_check_mark: | | | | |
| PPO2 | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| DDPG | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | | | |
| SAC | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
Expand All @@ -160,6 +161,7 @@ PyBullet Envs (Continued)
| RL Algo | Minitaur | MinitaurDuck | InvertedDoublePendulum | InvertedPendulumSwingup |
|----------|-----------|-------------|-----|---------|
| A2C | | | | |
| ACKTR | | | | |
| PPO2 | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| DDPG | | | | |
| SAC | | | :heavy_check_mark: | :heavy_check_mark: |
Expand Down Expand Up @@ -209,11 +211,11 @@ You can train agents online using [colab notebook](https://colab.research.google

### Stable-Baselines PyPi Package

Min version: stable-baselines >= 2.7.0
Min version: stable-baselines[mpi] >= 2.8.0

```
apt-get install swig cmake libopenmpi-dev zlib1g-dev ffmpeg
pip install stable-baselines box2d box2d-kengz pyyaml pybullet optuna pytablewriter scikit-optimize
pip install stable-baselines[mpi] box2d box2d-kengz pyyaml pybullet optuna pytablewriter scikit-optimize
```

Please see [Stable Baselines README](https://github.com/hill-a/stable-baselines) for alternatives.
Expand Down
7 changes: 7 additions & 0 deletions benchmark.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,18 @@
|acer |SpaceInvadersNoFrameskip-v4 | 542.556| 172.332| 150374| 133|
|acktr|Acrobot-v1 | -91.284| 32.515| 149959| 1625|
|acktr|BeamRiderNoFrameskip-v4 | 3760.976| 1826.059| 147414| 41|
|acktr|BipedalWalker-v2 | 292.419| 54.373| 149881| 216|
|acktr|BipedalWalkerHardcore-v2 | 44.796| 113.898| 149216| 129|
|acktr|BreakoutNoFrameskip-v4 | 448.514| 88.882| 143118| 37|
|acktr|CartPole-v1 | 487.573| 63.866| 149685| 307|
|acktr|EnduroNoFrameskip-v4 | 0.000| 0.000| 149574| 45|
|acktr|HalfCheetahBulletEnv-v0 | 2535.255| 110.368| 150000| 150|
|acktr|LunarLander-v2 | 96.822| 64.020| 149905| 176|
|acktr|LunarLanderContinuous-v2 | 239.953| 58.406| 149825| 480|
|acktr|MountainCar-v0 | -111.917| 21.422| 149969| 1340|
|acktr|MountainCarContinuous-v0 | 93.779| 0.115| 149993| 2265|
|acktr|MsPacmanNoFrameskip-v4 | 1598.776| 264.338| 149588| 147|
|acktr|Pendulum-v0 | -213.831| 137.857| 150000| 750|
|acktr|PongNoFrameskip-v4 | 19.224| 3.697| 147753| 67|
|acktr|QbertNoFrameskip-v4 | 9569.575| 3980.468| 150896| 106|
|acktr|SeaquestNoFrameskip-v4 | 1672.239| 105.092| 149148| 67|
Expand Down Expand Up @@ -104,6 +110,7 @@
|sac |ReacherBulletEnv-v0 | 17.529| 9.860| 150000| 1000|
|sac |Walker2DBulletEnv-v0 | 2052.646| 13.631| 150000| 150|
|td3 |AntBulletEnv-v0 | 3269.021| 60.697| 150000| 150|
|td3 |BipedalWalker-v2 | 308.793| 23.750| 149713| 228|
|td3 |HalfCheetahBulletEnv-v0 | 3160.318| 15.284| 150000| 150|
|td3 |HopperBulletEnv-v0 | 2743.910| 20.159| 150000| 150|
|td3 |HumanoidBulletEnv-v0 | 1638.081| 801.594| 149453| 182|
Expand Down
2 changes: 1 addition & 1 deletion docker/Dockerfile.cpu
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ RUN \
pip install pytest-cov && \
pip install pyyaml && \
pip install box2d-py==2.3.5 && \
pip install stable-baselines && \
pip install stable-baselines[mpi]==2.8.0 && \
pip install pybullet && \
pip install gym-minigrid && \
pip install scikit-optimize && \
Expand Down
2 changes: 1 addition & 1 deletion docker/Dockerfile.gpu
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ RUN \
pip install pyyaml && \
pip install box2d-py==2.3.5 && \
pip install tensorflow-gpu==1.8.0 && \
pip install stable-baselines && \
pip install stable-baselines[mpi]==2.8.0 && \
pip install pybullet && \
pip install gym-minigrid && \
pip install scikit-optimize && \
Expand Down
12 changes: 10 additions & 2 deletions enjoy.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,10 +73,18 @@ def main():
else:
log_path = os.path.join(folder, algo)

model_path = "{}/{}.pkl".format(log_path, env_id)

assert os.path.isdir(log_path), "The {} folder was not found".format(log_path)
assert os.path.isfile(model_path), "No model found for {} on {}, path: {}".format(algo, env_id, model_path)

found = False
for ext in ['pkl', 'zip']:
model_path = "{}/{}.{}".format(log_path, env_id, ext)
found = os.path.isfile(model_path)
if found:
break

if not found:
raise ValueError("No model found for {} on {}, path: {}".format(algo, env_id, model_path))

if algo in ['dqn', 'ddpg', 'sac', 'td3']:
args.n_envs = 1
Expand Down
102 changes: 102 additions & 0 deletions hyperparams/acktr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,105 @@ Acrobot-v1:
n_timesteps: !!float 5e5
policy: 'MlpPolicy'
ent_coef: 0.0

Pendulum-v0:
n_envs: 4
n_timesteps: !!float 2e6
policy: 'MlpPolicy'
ent_coef: 0.0
gamma: 0.99
n_steps: 16
learning_rate: 0.06
lr_schedule: 'constant'

LunarLanderContinuous-v2:
normalize: true
n_envs: 8
n_timesteps: !!float 5e6
policy: 'MlpPolicy'
gamma: 0.99
n_steps: 16
ent_coef: 0.0
learning_rate: 0.06
lr_schedule: 'constant'

MountainCarContinuous-v0:
normalize: true
n_envs: 16
n_timesteps: !!float 3e5
policy: 'MlpPolicy'
ent_coef: 0.0

# Tuned
HalfCheetahBulletEnv-v0:
env_wrapper: utils.wrappers.TimeFeatureWrapper
normalize: True
n_envs: 1
n_timesteps: !!float 2e6
policy: 'MlpPolicy'
ent_coef: 0.0
lr_schedule: 'constant'
learning_rate: 0.0217
n_steps: 128
nprocs: 4
max_grad_norm: 0.5
gamma: 0.98
vf_coef: 0.946

# TO BE tuned
Walker2DBulletEnv-v0:
env_wrapper: utils.wrappers.TimeFeatureWrapper
normalize: True
n_envs: 1
n_timesteps: !!float 2e6
policy: 'MlpPolicy'
ent_coef: 0.0
# lr_schedule: 'constant'
# learning_rate: 0.0217
n_steps: 128
nprocs: 4
gamma: 0.99
vf_coef: 0.946


HalfCheetah-v2:
env_wrapper: utils.wrappers.TimeFeatureWrapper
normalize: True
n_envs: 1
n_timesteps: !!float 1e6
policy: 'MlpPolicy'
ent_coef: 0.0
lr_schedule: 'constant'
learning_rate: 0.2
n_steps: 2048
nprocs: 4
max_grad_norm: 10
gamma: 0.99
vf_coef: 0.5
policy_kwargs: "dict(net_arch=[256, 256])"

# Tuned
BipedalWalkerHardcore-v2:
normalize: true
n_envs: 8
n_timesteps: !!float 10e7
policy: 'MlpPolicy'
ent_coef: 0.000125
lr_schedule: 'constant'
learning_rate: 0.0675
n_steps: 16
gamma: 0.9999
vf_coef: 0.51

# Tuned
BipedalWalker-v2:
normalize: true
n_envs: 8
n_timesteps: !!float 5e6
policy: 'MlpPolicy'
ent_coef: 0.0
lr_schedule: 'constant'
learning_rate: 0.298
n_steps: 32
gamma: 0.98
vf_coef: 0.38
16 changes: 15 additions & 1 deletion hyperparams/td3.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,20 @@ HalfCheetahBulletEnv-v0:
gradient_steps: 1000
policy_kwargs: "dict(layers=[400, 300])"

BipedalWalker-v2:
n_timesteps: !!float 2e6
policy: 'MlpPolicy'
gamma: 0.99
buffer_size: 1000000
noise_type: 'normal'
noise_std: 0.1
learning_starts: 10000
batch_size: 100
learning_rate: !!float 1e-3
train_freq: 1000
gradient_steps: 1000
policy_kwargs: "dict(layers=[400, 300])"

# To be tuned
BipedalWalkerHardcore-v2:
n_timesteps: !!float 5e7
Expand All @@ -59,7 +73,7 @@ BipedalWalkerHardcore-v2:
noise_std: 0.2
learning_starts: 10000
batch_size: 100
learning_rate: 1e-3
learning_rate: !!float 1e-3
train_freq: 1000
gradient_steps: 1000
policy_kwargs: "dict(layers=[400, 300])"
Expand Down
2 changes: 1 addition & 1 deletion run_docker_cpu.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,5 @@ echo $cmd_line


docker run -it --rm --network host --ipc=host \
--mount src=$(pwd),target=/root/code/stable-baselines,type=bind araffin/rl-baselines-zoo-cpu\
--mount src=$(pwd),target=/root/code/stable-baselines,type=bind araffin/rl-baselines-zoo-cpu:v2.8.0\
bash -c "cd /root/code/stable-baselines/ && $cmd_line"
2 changes: 1 addition & 1 deletion run_docker_gpu.sh
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,5 @@ echo $cmd_line


docker run -it --runtime=nvidia --rm --network host --ipc=host \
--mount src=$(pwd),target=/root/code/stable-baselines,type=bind araffin/rl-baselines-zoo\
--mount src=$(pwd),target=/root/code/stable-baselines,type=bind araffin/rl-baselines-zoo:v2.8.0\
bash -c "cd /root/code/stable-baselines/ && $cmd_line"
23 changes: 23 additions & 0 deletions scripts/run_tests_travis.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/usr/bin/env bash

DOCKER_CMD="docker run -it --rm --network host --ipc=host --mount src=$(pwd),target=/root/code/stable-baselines,type=bind"
BASH_CMD="cd /root/code/stable-baselines/"

if [[ $# -ne 1 ]]; then
echo "usage: $0 <test glob>"
exit 1
fi

if [[ ${DOCKER_IMAGE} = "" ]]; then
echo "Need DOCKER_IMAGE environment variable to be set."
exit 1
fi

TEST_GLOB=$1

set -e # exit immediately on any error


${DOCKER_CMD} ${DOCKER_IMAGE} \
bash -c "${BASH_CMD} && \
python -m pytest --cov-config .coveragerc --cov-report term --cov=. -v tests/test_${TEST_GLOB}"
6 changes: 4 additions & 2 deletions tests/test_hyperparams_opt.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,9 @@ def _assert_eq(left, right):
N_TRIALS = 2
N_JOBS = 1

ALGOS = ('ppo2', 'a2c', 'trpo')
ALGOS = ('ppo2', 'a2c', 'trpo', 'acktr')
# Not yet supported:
# ALGOS = ('acer', 'acktr', 'dqn')
# ALGOS = ('acer', 'dqn')
ENV_IDS = ('CartPole-v1',)
LOG_FOLDER = 'logs/tests_optimize/'

Expand All @@ -29,6 +29,8 @@ def _assert_eq(left, right):
experiments['ddpg-MountainCarContinuous-v0'] = ('ddpg', 'MountainCarContinuous-v0')
# Test for SAC
experiments['sac-Pendulum-v0'] = ('sac', 'Pendulum-v0')
# Test for TD3
experiments['td3-Pendulum-v0'] = ('td3', 'Pendulum-v0')

# Clean up
if os.path.isdir(LOG_FOLDER):
Expand Down
4 changes: 2 additions & 2 deletions train.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,9 +51,9 @@
help='Run hyperparameters search')
parser.add_argument('--n-jobs', help='Number of parallel jobs when optimizing hyperparameters', type=int, default=1)
parser.add_argument('--sampler', help='Sampler to use when optimizing hyperparameters', type=str,
default='skopt', choices=['random', 'tpe', 'skopt'])
default='tpe', choices=['random', 'tpe', 'skopt'])
parser.add_argument('--pruner', help='Pruner to use when optimizing hyperparameters', type=str,
default='none', choices=['halving', 'median', 'none'])
default='median', choices=['halving', 'median', 'none'])
parser.add_argument('--verbose', help='Verbose mode (0: no output, 1: INFO)', default=1,
type=int)
parser.add_argument('--gym-packages', type=str, nargs='+', default=[], help='Additional external Gym environemnt package modules to import (e.g. gym_minigrid)')
Expand Down
Binary file added trained_agents/acktr/BipedalWalker-v2.zip
Binary file not shown.
11 changes: 11 additions & 0 deletions trained_agents/acktr/BipedalWalker-v2/config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
!!python/object/apply:collections.OrderedDict
- - [ent_coef, 0.0]
- [gamma, 0.98]
- [learning_rate, 0.298]
- [lr_schedule, constant]
- [n_envs, 8]
- [n_steps, 32]
- [n_timesteps, 5000000.0]
- [normalize, true]
- [policy, MlpPolicy]
- [vf_coef, 0.38]
Binary file added trained_agents/acktr/BipedalWalker-v2/obs_rms.pkl
Binary file not shown.
Binary file added trained_agents/acktr/BipedalWalker-v2/ret_rms.pkl
Binary file not shown.
Binary file added trained_agents/acktr/BipedalWalkerHardcore-v2.zip
Binary file not shown.
Loading

0 comments on commit a41e611

Please sign in to comment.