Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TF v2 #469

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from
Draft

TF v2 #469

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 69 additions & 5 deletions week06_policy_based/a2c-optional.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -144,10 +144,10 @@
"To train the part of the model that predicts state values you will need to compute the value targets. \n",
"Any callable could be passed to `EnvRunner` to be applied to each partial trajectory after it is collected. \n",
"Thus, we can implement and use `ComputeValueTargets` callable. \n",
"The formula for the value targets is simple:\n",
"The formula for the value targets is simple, it's the right side of the following equation:\n",
"\n",
"$$\n",
"\\hat v(s_t) = \\left( \\sum_{t'=0}^{T - 1 - t} \\gamma^{t'}r_{t+t'} \\right) + \\gamma^T \\hat{v}(s_{t+T}),\n",
"V(s_t) = \\left( \\sum_{t'=0}^{T - 1 - t} \\gamma^{t'} \\cdot r (s_{t+t'}, a_{t + t'}) \\right) + \\gamma^T \\cdot V(s_{t+T}),\n",
"$$\n",
"\n",
"In implementation, however, do not forget to use \n",
Expand All @@ -165,7 +165,7 @@
"class ComputeValueTargets:\n",
" def __init__(self, policy, gamma=0.99):\n",
" self.policy = policy\n",
" \n",
"\n",
" def __call__(self, trajectory):\n",
" # This method should modify trajectory inplace by adding\n",
" # an item with key 'value_targets' to it.\n",
Expand Down Expand Up @@ -214,7 +214,58 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now is the time to implement the advantage actor critic algorithm itself. You can look into your lecture,\n",
"# Actor-critic objective\n",
"\n",
"Here we define a loss function that uses rollout above to train advantage actor-critic agent.\n",
"\n",
"\n",
"Our loss consists of three components:\n",
"\n",
"* __The policy \"loss\"__\n",
" $$ \\hat J = {1 \\over T} \\cdot \\sum_t { \\log \\pi(a_t | s_t) } \\cdot A_{const}(s,a) $$\n",
" * This function has no meaning in and of itself, but it was built such that\n",
" * $ \\nabla \\hat J = {1 \\over N} \\cdot \\sum_t { \\nabla \\log \\pi(a_t | s_t) } \\cdot A(s,a) \\approx \\nabla E_{s, a \\sim \\pi} R(s,a) $\n",
" * Therefore if we __maximize__ J_hat with gradient descent we will maximize expected reward\n",
" \n",
" \n",
"* __The value \"loss\"__\n",
" $$ L_{td} = {1 \\over T} \\cdot \\sum_t { [r + \\gamma \\cdot V_{const}(s_{t+1}) - V(s_t)] ^ 2 }$$\n",
" * Ye Olde TD_loss from q-learning and alike\n",
" * If we minimize this loss, V(s) will converge to $V_\\pi(s) = E_{a \\sim \\pi(a | s)} R(s,a) $\n",
"\n",
"\n",
"* __Entropy Regularizer__\n",
" $$ H = - {1 \\over T} \\sum_t \\sum_a {\\pi(a|s_t) \\cdot \\log \\pi (a|s_t)}$$\n",
" * If we __maximize__ entropy we discourage agent from predicting zero probability to actions\n",
" prematurely (a.k.a. exploration)\n",
" \n",
" \n",
"So we optimize a linear combination of $L_{td}$ $- \\hat J$, $-H$\n",
" \n",
"```\n",
"\n",
"```\n",
"\n",
"```\n",
"\n",
"```\n",
"\n",
"```\n",
"\n",
"```\n",
"\n",
"\n",
"__One more thing:__ since we train on T-step rollouts, we can use N-step formula for advantage for free:\n",
" * At the last step, $A(s_t,a_t) = r(s_t, a_t) + \\gamma \\cdot V(s_{t+1}) - V(s) $\n",
" * One step earlier, $A(s_t,a_t) = r(s_t, a_t) + \\gamma \\cdot r(s_{t+1}, a_{t+1}) + \\gamma ^ 2 \\cdot V(s_{t+2}) - V(s) $\n",
" * Et cetera, et cetera. This way agent starts training much faster since it's estimate of A(s,a) depends less on his (imperfect) value function and more on actual rewards. There's also a [nice generalization](https://arxiv.org/abs/1506.02438) of this."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also look into your lecture,\n",
"[Mnih et al. 2016](https://arxiv.org/abs/1602.01783) paper, and [lecture](https://www.youtube.com/watch?v=Tol_jw5hWnI&list=PLkFD6_40KJIxJMR-j5A1mkxK26gh_qg37&index=20) by Sergey Levine."
]
},
Expand Down Expand Up @@ -288,9 +339,22 @@
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"pygments_lexer": "ipython3"
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
Expand Down
56 changes: 38 additions & 18 deletions week06_policy_based/atari_wrappers.py
Original file line number Diff line number Diff line change
Expand Up @@ -213,12 +213,16 @@ def __init__(self, env, prefix=None, running_mean_size=100):
self.episode_counter = 0
self.prefix = prefix or self.env.spec.id

nenvs = getattr(self.env.unwrapped, "nenvs", 1)
self.rewards = np.zeros(nenvs)
self.had_ended_episodes = np.zeros(nenvs, dtype=np.bool)
self.episode_lengths = np.zeros(nenvs)
self.nenvs = getattr(self.env.unwrapped, "nenvs", 1)
self.rewards = np.zeros(self.nenvs)
self.had_ended_episodes = np.zeros(self.nenvs, dtype=np.bool)
self.episode_lengths = np.zeros(self.nenvs)
self.reward_queues = [deque([], maxlen=running_mean_size)
for _ in range(nenvs)]
for _ in range(self.nenvs)]
self.global_step = 0

def add_summary_scalar(self, name, value):
raise NotImplementedError

def should_write_summaries(self):
""" Returns true if it's time to write summaries. """
Expand Down Expand Up @@ -260,6 +264,8 @@ def step(self, action):
self.reward_queues[i].append(self.rewards[i])
self.rewards[i] = 0

self.global_step += self.nenvs

if self.should_write_summaries():
self.add_summaries()
return obs, rew, done, info
Expand All @@ -272,19 +278,22 @@ def reset(self, **kwargs):


class TFSummaries(SummariesBase):
""" Writes env summaries using TensorFlow."""
""" Writes env summaries using TensorFlow.
In order to write summaries in a specific directory,
you may define a writer and set it as default just before
training loop as in an example here
https://www.tensorflow.org/api_docs/python/tf/summary
Other summaries could be added in A2C class or elsewhere
"""

def __init__(self, env, prefix=None, running_mean_size=100, step_var=None):
def __init__(self, env, prefix=None,
running_mean_size=100, step_var=None):

super().__init__(env, prefix, running_mean_size)

import tensorflow as tf
self.step_var = (step_var if step_var is not None
else tf.train.get_global_step())

def add_summary_scalar(self, name, value):
import tensorflow as tf
tf.contrib.summary.scalar(name, value, step = self.step_var)
tf.summary.scalar(name, value, self.global_step)


class NumpySummaries(SummariesBase):
Expand All @@ -304,7 +313,7 @@ def get_values(cls, name):
def clear(cls):
cls._summaries = defaultdict(list)

def __init__(self, env, prefix = None, running_mean_size = 100):
def __init__(self, env, prefix=None, running_mean_size=100):
super().__init__(env, prefix, running_mean_size)

def add_summary_scalar(self, name, value):
Expand All @@ -316,6 +325,7 @@ def nature_dqn_env(env_id, nenvs=None, seed=None,
""" Wraps env as in Nature DQN paper. """
if "NoFrameskip" not in env_id:
raise ValueError(f"env_id must have 'NoFrameskip' but is {env_id}")

if nenvs is not None:
if seed is None:
seed = list(range(nenvs))
Expand All @@ -327,20 +337,30 @@ def nature_dqn_env(env_id, nenvs=None, seed=None,

env = ParallelEnvBatch([
lambda i=i, env_seed=env_seed: nature_dqn_env(
env_id, seed=env_seed, summaries=False, clip_reward=False)
env_id, seed=env_seed, summaries=None, clip_reward=False)
for i, env_seed in enumerate(seed)
])
if summaries:
summaries_class = NumpySummaries if summaries == 'Numpy' else TFSummaries
env = summaries_class(env, prefix=env_id)
if summaries is not None:
if summaries == 'Numpy':
env = NumpySummaries(env, prefix=env_id)
elif summaries == 'TensorFlow':
env = TFSummaries(env, prefix=env_id)
else:
raise ValueError(
f"Unknown `summaries` value: expected either 'Numpy' or 'TensorFlow', got {summaries}")
if clip_reward:
env = ClipReward(env)
return env

env = gym.make(env_id)
env.seed(seed)
if summaries:
if summaries == 'Numpy':
env = NumpySummaries(env)
elif summaries == 'TensorFlow':
env = TFSummaries(env)
elif summaries:
raise ValueError(f"summaries must be either Numpy, "
f"or TensorFlow, or a falsy value, but is {summaries}")
env = EpisodicLife(env)
if "FIRE" in env.unwrapped.get_action_meanings():
env = FireReset(env)
Expand Down
9 changes: 9 additions & 0 deletions week06_policy_based/local_setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#!/usr/bin/env bash

apt-get install -yqq ffmpeg
apt-get install -yqq python-opengl

python3 -m pip install --user gym==0.14.0
python3 -m pip install --user pygame
python3 -m pip install --user pyglet==1.3.2
python3 -m pip install --user tensorflow>=2.0.0
Loading