ACKTR Continuous (#41)

* Start ACKTR continuous version * [ci skip] Update hyperparams * [ci skip] Update hyperparams (add bipedal) * [ci skip] Update hyperparams * Add support for hyperparam optimization for ACKTR * Fix for zip format * Update hyperparams * Hyperparam optimization for TD3 * Update benchmark * Update benchmark * Update benchmark * Update travis * Upgrade docker images * [ci skip] Update Readme * Split travis tests * Fix permission script * Fix pytest * Fix test for TD3
araffin · Sep 29, 2019 · a41e611 · a41e611
1 parent ccf95e3
commit a41e611
Show file tree

Hide file tree

Showing 40 changed files with 345 additions and 36 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -5,11 +5,28 @@ python:
 notifications:
   email: false
 
+env:
+  global:
+    - DOCKER_IMAGE=araffin/rl-baselines-zoo-cpu:v2.8.0
+
 services:
   - docker
 
 install:
-  - docker pull araffin/rl-baselines-zoo-cpu
+  - docker pull ${DOCKER_IMAGE}
 
 script:
-  - docker run -it --rm --network host --ipc=host --mount src=$(pwd),target=/root/code/stable-baselines,type=bind araffin/rl-baselines-zoo-cpu bash -c "cd /root/code/stable-baselines/ && pip install --upgrade git+https://github.com/pfnet/optuna.git && python -m pytest --cov-config .coveragerc --cov-report term --cov=. -v tests/"
+  - ./scripts/run_tests_travis.sh "${TEST_GLOB}"
+
+jobs:
+  include:
+    # Split test suite to avoid exceeding travis limit
+    - stage: Test
+      name: "Unit Tests Train"
+      env: TEST_GLOB="train.py"
+
+    - name: "Unit Tests Enjoy"
+      env: TEST_GLOB="enjoy.py"
+
+    - name: "Unit Tests Hyperparams opt"
+      env: TEST_GLOB="hyperparams_opt.py"
diff --git a/README.md b/README.md
@@ -62,14 +62,14 @@ mpirun -n 16 python train.py --algo trpo --env BreakoutNoFrameskip-v4
 
 We use [Optuna](https://optuna.org/) for optimizing the hyperparameters.
 
-Note: hyperparameters search is only implemented for PPO2/A2C/SAC/TRPO/DDPG for now.
+Note: hyperparameters search is not implemented for ACER and DQN for now.
 when using SuccessiveHalvingPruner ("halving"), you must specify `--n-jobs > 1`
 
 Budget of 1000 trials with a maximum of 50000 steps:
 
 ```
 python train.py --algo ppo2 --env MountainCar-v0 -n 50000 -optimize --n-trials 1000 --n-jobs 2 \
-  --sampler random --pruner median
+  --sampler tpe --pruner median
 ```
 
 
@@ -116,7 +116,7 @@ Additional Atari Games (to be completed):
 |----------|--------------|----------------|------------|--------------|--------------------------|
 | A2C      | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
 | ACER     | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | N/A | N/A |
-| ACKTR    | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | N/A | N/A |
+| ACKTR    | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
 | PPO2     | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
 | DQN      | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | N/A | N/A |
 | DDPG     |  N/A |  N/A  | N/A | :heavy_check_mark: | :heavy_check_mark: |
@@ -129,15 +129,15 @@ Additional Atari Games (to be completed):
 
 |  RL Algo |  BipedalWalker-v2 | LunarLander-v2 | LunarLanderContinuous-v2 |  BipedalWalkerHardcore-v2 | CarRacing-v0 |
 |----------|--------------|----------------|------------|--------------|--------------------------|
-| A2C      | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |                   |
+| A2C      | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | |
 | ACER     | N/A | :heavy_check_mark: | N/A | N/A | N/A |
-| ACKTR    | N/A | :heavy_check_mark: | N/A | N/A | N/A |
-| PPO2     | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |   |
+| ACKTR    | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | |
+| PPO2     | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | |
 | DQN      | N/A | :heavy_check_mark: | N/A | N/A | N/A |
-| DDPG     | :heavy_check_mark: | N/A | :heavy_check_mark: | |  |
+| DDPG     | :heavy_check_mark: | N/A | :heavy_check_mark: | | |
 | SAC      | :heavy_check_mark: | N/A | :heavy_check_mark: | :heavy_check_mark: | |
-| TD3      | | N/A | :heavy_check_mark: | | |
-| TRPO     | | :heavy_check_mark: | :heavy_check_mark: | | |
+| TD3      | :heavy_check_mark: | N/A | :heavy_check_mark: | | |
+| TRPO     | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | | |
 
 ### PyBullet Environments
 
@@ -149,6 +149,7 @@ Note: those environments are derived from [Roboschool](https://github.com/openai
 |  RL Algo |  Walker2D | HalfCheetah | Ant | Reacher |  Hopper | Humanoid |
 |----------|-----------|-------------|-----|---------|---------|----------|
 | A2C      | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | | :heavy_check_mark: | |
+| ACKTR    | | :heavy_check_mark: | | | | |
 | PPO2     | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
 | DDPG     | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | | | |
 | SAC      | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
@@ -160,6 +161,7 @@ PyBullet Envs (Continued)
 |  RL Algo |  Minitaur | MinitaurDuck | InvertedDoublePendulum | InvertedPendulumSwingup |
 |----------|-----------|-------------|-----|---------|
 | A2C      | | | | |
+| ACKTR    | | | | |
 | PPO2     | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
 | DDPG     | | | | |
 | SAC      | | | :heavy_check_mark: | :heavy_check_mark: |
@@ -209,11 +211,11 @@ You can train agents online using [colab notebook](https://colab.research.google
 
 ### Stable-Baselines PyPi Package
 
-Min version: stable-baselines >= 2.7.0
+Min version: stable-baselines[mpi] >= 2.8.0
 
 ```
 apt-get install swig cmake libopenmpi-dev zlib1g-dev ffmpeg
-pip install stable-baselines box2d box2d-kengz pyyaml pybullet optuna pytablewriter scikit-optimize
+pip install stable-baselines[mpi] box2d box2d-kengz pyyaml pybullet optuna pytablewriter scikit-optimize
 ```
 
 Please see [Stable Baselines README](https://github.com/hill-a/stable-baselines) for alternatives.

diff --git a/benchmark.md b/benchmark.md
@@ -34,12 +34,18 @@
 |acer |SpaceInvadersNoFrameskip-v4        |    542.556|   172.332|     150374|       133|
 |acktr|Acrobot-v1                         |    -91.284|    32.515|     149959|      1625|
 |acktr|BeamRiderNoFrameskip-v4            |   3760.976|  1826.059|     147414|        41|
+|acktr|BipedalWalker-v2                   |    292.419|    54.373|     149881|       216|
+|acktr|BipedalWalkerHardcore-v2           |     44.796|   113.898|     149216|       129|
 |acktr|BreakoutNoFrameskip-v4             |    448.514|    88.882|     143118|        37|
 |acktr|CartPole-v1                        |    487.573|    63.866|     149685|       307|
 |acktr|EnduroNoFrameskip-v4               |      0.000|     0.000|     149574|        45|
+|acktr|HalfCheetahBulletEnv-v0            |   2535.255|   110.368|     150000|       150|
 |acktr|LunarLander-v2                     |     96.822|    64.020|     149905|       176|
+|acktr|LunarLanderContinuous-v2           |    239.953|    58.406|     149825|       480|
 |acktr|MountainCar-v0                     |   -111.917|    21.422|     149969|      1340|
+|acktr|MountainCarContinuous-v0           |     93.779|     0.115|     149993|      2265|
 |acktr|MsPacmanNoFrameskip-v4             |   1598.776|   264.338|     149588|       147|
+|acktr|Pendulum-v0                        |   -213.831|   137.857|     150000|       750|
 |acktr|PongNoFrameskip-v4                 |     19.224|     3.697|     147753|        67|
 |acktr|QbertNoFrameskip-v4                |   9569.575|  3980.468|     150896|       106|
 |acktr|SeaquestNoFrameskip-v4             |   1672.239|   105.092|     149148|        67|
@@ -104,6 +110,7 @@
 |sac  |ReacherBulletEnv-v0                |     17.529|     9.860|     150000|      1000|
 |sac  |Walker2DBulletEnv-v0               |   2052.646|    13.631|     150000|       150|
 |td3  |AntBulletEnv-v0                    |   3269.021|    60.697|     150000|       150|
+|td3  |BipedalWalker-v2                   |    308.793|    23.750|     149713|       228|
 |td3  |HalfCheetahBulletEnv-v0            |   3160.318|    15.284|     150000|       150|
 |td3  |HopperBulletEnv-v0                 |   2743.910|    20.159|     150000|       150|
 |td3  |HumanoidBulletEnv-v0               |   1638.081|   801.594|     149453|       182|

diff --git a/docker/Dockerfile.cpu b/docker/Dockerfile.cpu
@@ -22,7 +22,7 @@ RUN \
     pip install pytest-cov && \
     pip install pyyaml && \
     pip install box2d-py==2.3.5 && \
-    pip install stable-baselines && \
+    pip install stable-baselines[mpi]==2.8.0 && \
     pip install pybullet && \
     pip install gym-minigrid && \
     pip install scikit-optimize && \

diff --git a/docker/Dockerfile.gpu b/docker/Dockerfile.gpu
@@ -22,7 +22,7 @@ RUN \
     pip install pyyaml && \
     pip install box2d-py==2.3.5 && \
     pip install tensorflow-gpu==1.8.0 && \
-    pip install stable-baselines && \
+    pip install stable-baselines[mpi]==2.8.0 && \
     pip install pybullet && \
     pip install gym-minigrid && \
     pip install scikit-optimize && \

diff --git a/enjoy.py b/enjoy.py
@@ -73,10 +73,18 @@ def main():
     else:
         log_path = os.path.join(folder, algo)
 
-    model_path = "{}/{}.pkl".format(log_path, env_id)
 
     assert os.path.isdir(log_path), "The {} folder was not found".format(log_path)
-    assert os.path.isfile(model_path), "No model found for {} on {}, path: {}".format(algo, env_id, model_path)
+
+    found = False
+    for ext in ['pkl', 'zip']:
+        model_path = "{}/{}.{}".format(log_path, env_id, ext)
+        found = os.path.isfile(model_path)
+        if found:
+            break
+
+    if not found:
+        raise ValueError("No model found for {} on {}, path: {}".format(algo, env_id, model_path))
 
     if algo in ['dqn', 'ddpg', 'sac', 'td3']:
         args.n_envs = 1

diff --git a/hyperparams/acktr.yml b/hyperparams/acktr.yml
@@ -32,3 +32,105 @@ Acrobot-v1:
   n_timesteps: !!float 5e5
   policy: 'MlpPolicy'
   ent_coef: 0.0
+
+Pendulum-v0:
+  n_envs: 4
+  n_timesteps: !!float 2e6
+  policy: 'MlpPolicy'
+  ent_coef: 0.0
+  gamma: 0.99
+  n_steps: 16
+  learning_rate: 0.06
+  lr_schedule: 'constant'
+
+LunarLanderContinuous-v2:
+  normalize: true
+  n_envs: 8
+  n_timesteps: !!float 5e6
+  policy: 'MlpPolicy'
+  gamma: 0.99
+  n_steps: 16
+  ent_coef: 0.0
+  learning_rate: 0.06
+  lr_schedule: 'constant'
+
+MountainCarContinuous-v0:
+  normalize: true
+  n_envs: 16
+  n_timesteps: !!float 3e5
+  policy: 'MlpPolicy'
+  ent_coef: 0.0
+
+# Tuned
+HalfCheetahBulletEnv-v0:
+  env_wrapper: utils.wrappers.TimeFeatureWrapper
+  normalize: True
+  n_envs: 1
+  n_timesteps: !!float 2e6
+  policy: 'MlpPolicy'
+  ent_coef: 0.0
+  lr_schedule: 'constant'
+  learning_rate: 0.0217
+  n_steps: 128
+  nprocs: 4
+  max_grad_norm: 0.5
+  gamma: 0.98
+  vf_coef: 0.946
+
+# TO BE tuned
+Walker2DBulletEnv-v0:
+  env_wrapper: utils.wrappers.TimeFeatureWrapper
+  normalize: True
+  n_envs: 1
+  n_timesteps: !!float 2e6
+  policy: 'MlpPolicy'
+  ent_coef: 0.0
+  # lr_schedule: 'constant'
+  # learning_rate: 0.0217
+  n_steps: 128
+  nprocs: 4
+  gamma: 0.99
+  vf_coef: 0.946
+
+
+HalfCheetah-v2:
+  env_wrapper: utils.wrappers.TimeFeatureWrapper
+  normalize: True
+  n_envs: 1
+  n_timesteps: !!float 1e6
+  policy: 'MlpPolicy'
+  ent_coef: 0.0
+  lr_schedule: 'constant'
+  learning_rate: 0.2
+  n_steps: 2048
+  nprocs: 4
+  max_grad_norm: 10
+  gamma: 0.99
+  vf_coef: 0.5
+  policy_kwargs: "dict(net_arch=[256, 256])"
+
+# Tuned
+BipedalWalkerHardcore-v2:
+  normalize: true
+  n_envs: 8
+  n_timesteps: !!float 10e7
+  policy: 'MlpPolicy'
+  ent_coef: 0.000125
+  lr_schedule: 'constant'
+  learning_rate: 0.0675
+  n_steps: 16
+  gamma: 0.9999
+  vf_coef: 0.51
+
+# Tuned
+BipedalWalker-v2:
+  normalize: true
+  n_envs: 8
+  n_timesteps: !!float 5e6
+  policy: 'MlpPolicy'
+  ent_coef: 0.0
+  lr_schedule: 'constant'
+  learning_rate: 0.298
+  n_steps: 32
+  gamma: 0.98
+  vf_coef: 0.38
diff --git a/hyperparams/td3.yml b/hyperparams/td3.yml
@@ -49,6 +49,20 @@ HalfCheetahBulletEnv-v0:
   gradient_steps: 1000
   policy_kwargs: "dict(layers=[400, 300])"
 
+BipedalWalker-v2:
+  n_timesteps: !!float 2e6
+  policy: 'MlpPolicy'
+  gamma: 0.99
+  buffer_size: 1000000
+  noise_type: 'normal'
+  noise_std: 0.1
+  learning_starts: 10000
+  batch_size: 100
+  learning_rate: !!float 1e-3
+  train_freq: 1000
+  gradient_steps: 1000
+  policy_kwargs: "dict(layers=[400, 300])"
+
 # To be tuned
 BipedalWalkerHardcore-v2:
   n_timesteps: !!float 5e7
@@ -59,7 +73,7 @@ BipedalWalkerHardcore-v2:
   noise_std: 0.2
   learning_starts: 10000
   batch_size: 100
-  learning_rate: 1e-3
+  learning_rate: !!float 1e-3
   train_freq: 1000
   gradient_steps: 1000
   policy_kwargs: "dict(layers=[400, 300])"

diff --git a/run_docker_cpu.sh b/run_docker_cpu.sh
@@ -8,5 +8,5 @@ echo $cmd_line
 
 
 docker run -it --rm --network host --ipc=host \
- --mount src=$(pwd),target=/root/code/stable-baselines,type=bind araffin/rl-baselines-zoo-cpu\
+ --mount src=$(pwd),target=/root/code/stable-baselines,type=bind araffin/rl-baselines-zoo-cpu:v2.8.0\
   bash -c "cd /root/code/stable-baselines/ && $cmd_line"
diff --git a/run_docker_gpu.sh b/run_docker_gpu.sh
@@ -8,5 +8,5 @@ echo $cmd_line
 
 
 docker run -it --runtime=nvidia --rm --network host --ipc=host \
-  --mount src=$(pwd),target=/root/code/stable-baselines,type=bind araffin/rl-baselines-zoo\
+  --mount src=$(pwd),target=/root/code/stable-baselines,type=bind araffin/rl-baselines-zoo:v2.8.0\
   bash -c "cd /root/code/stable-baselines/ && $cmd_line"
diff --git a/scripts/run_tests_travis.sh b/scripts/run_tests_travis.sh
@@ -0,0 +1,23 @@
+#!/usr/bin/env bash
+
+DOCKER_CMD="docker run -it --rm --network host --ipc=host --mount src=$(pwd),target=/root/code/stable-baselines,type=bind"
+BASH_CMD="cd /root/code/stable-baselines/"
+
+if [[ $# -ne 1 ]]; then
+  echo "usage: $0 <test glob>"
+  exit 1
+fi
+
+if [[ ${DOCKER_IMAGE} = "" ]]; then
+  echo "Need DOCKER_IMAGE environment variable to be set."
+  exit 1
+fi
+
+TEST_GLOB=$1
+
+set -e  # exit immediately on any error
+
+
+${DOCKER_CMD} ${DOCKER_IMAGE} \
+    bash -c "${BASH_CMD} && \
+             python -m pytest --cov-config .coveragerc --cov-report term --cov=. -v tests/test_${TEST_GLOB}"
diff --git a/tests/test_hyperparams_opt.py b/tests/test_hyperparams_opt.py
@@ -13,9 +13,9 @@ def _assert_eq(left, right):
 N_TRIALS = 2
 N_JOBS = 1
 
-ALGOS = ('ppo2', 'a2c', 'trpo')
+ALGOS = ('ppo2', 'a2c', 'trpo', 'acktr')
 # Not yet supported:
-# ALGOS = ('acer', 'acktr', 'dqn')
+# ALGOS = ('acer', 'dqn')
 ENV_IDS = ('CartPole-v1',)
 LOG_FOLDER = 'logs/tests_optimize/'
 
@@ -29,6 +29,8 @@ def _assert_eq(left, right):
 experiments['ddpg-MountainCarContinuous-v0'] = ('ddpg', 'MountainCarContinuous-v0')
 # Test for SAC
 experiments['sac-Pendulum-v0'] = ('sac', 'Pendulum-v0')
+# Test for TD3
+experiments['td3-Pendulum-v0'] = ('td3', 'Pendulum-v0')
 
 # Clean up
 if os.path.isdir(LOG_FOLDER):

diff --git a/train.py b/train.py
@@ -51,9 +51,9 @@
                         help='Run hyperparameters search')
     parser.add_argument('--n-jobs', help='Number of parallel jobs when optimizing hyperparameters', type=int, default=1)
     parser.add_argument('--sampler', help='Sampler to use when optimizing hyperparameters', type=str,
-                        default='skopt', choices=['random', 'tpe', 'skopt'])
+                        default='tpe', choices=['random', 'tpe', 'skopt'])
     parser.add_argument('--pruner', help='Pruner to use when optimizing hyperparameters', type=str,
-                        default='none', choices=['halving', 'median', 'none'])
+                        default='median', choices=['halving', 'median', 'none'])
     parser.add_argument('--verbose', help='Verbose mode (0: no output, 1: INFO)', default=1,
                         type=int)
     parser.add_argument('--gym-packages', type=str, nargs='+', default=[], help='Additional external Gym environemnt package modules to import (e.g. gym_minigrid)')

diff --git a/trained_agents/acktr/BipedalWalker-v2.zip b/trained_agents/acktr/BipedalWalker-v2.zip
diff --git a/trained_agents/acktr/BipedalWalker-v2/config.yml b/trained_agents/acktr/BipedalWalker-v2/config.yml
@@ -0,0 +1,11 @@
+!!python/object/apply:collections.OrderedDict
+- - [ent_coef, 0.0]
+  - [gamma, 0.98]
+  - [learning_rate, 0.298]
+  - [lr_schedule, constant]
+  - [n_envs, 8]
+  - [n_steps, 32]
+  - [n_timesteps, 5000000.0]
+  - [normalize, true]
+  - [policy, MlpPolicy]
+  - [vf_coef, 0.38]
diff --git a/trained_agents/acktr/BipedalWalker-v2/obs_rms.pkl b/trained_agents/acktr/BipedalWalker-v2/obs_rms.pkl
diff --git a/trained_agents/acktr/BipedalWalker-v2/ret_rms.pkl b/trained_agents/acktr/BipedalWalker-v2/ret_rms.pkl
diff --git a/trained_agents/acktr/BipedalWalkerHardcore-v2.zip b/trained_agents/acktr/BipedalWalkerHardcore-v2.zip