Skip to content

Commit

Permalink
Proofread weeks from 2 to 6
Browse files Browse the repository at this point in the history
  • Loading branch information
mmamedli authored and dniku committed Feb 13, 2021
1 parent 1e936b0 commit cd9f5e9
Show file tree
Hide file tree
Showing 11 changed files with 292 additions and 286 deletions.
42 changes: 22 additions & 20 deletions week2_model_based/practice_vi.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@
"source": [
"### Markov decision process\n",
"\n",
"This week's methods are all built to solve __M__arkov __D__ecision __P__rocesses. In the broadest sense, an MDP is defined by how it changes states and how rewards are computed.\n",
"This week methods are all built to solve __M__arkov __D__ecision __P__rocesses. In the broadest sense, the MDP is defined by how it changes the states and how rewards are computed.\n",
"\n",
"State transition is defined by $P(s' |s,a)$ - how likely are you to end at state $s'$ if you take action $a$ from state $s$. Now there's more than one way to define rewards, but we'll use $r(s,a,s')$ function for convenience.\n",
"State transition is defined by $P(s' |s,a)$ - how likely you are to end at the state $s'$ if you take an action $a$ from the state $s$. Now there's more than one way to define rewards, but for convenience we'll use $r(s,a,s')$ function.\n",
"\n",
"_This notebook is inspired by the awesome_ [CS294](https://github.com/berkeleydeeprlcourse/homework/blob/36a0b58261acde756abd55306fbe63df226bf62b/hw2/HW2.ipynb) _by Berkeley_"
]
Expand Down Expand Up @@ -39,7 +39,7 @@
" !touch .setup_complete\n",
"\n",
"# This code creates a virtual display to draw game images on.\n",
"# It will have no effect if your machine has a monitor.\n",
"# It won't have any effect if your machine has a monitor.\n",
"if type(os.environ.get(\"DISPLAY\")) is not str or len(os.environ.get(\"DISPLAY\")) == 0:\n",
" !bash ../xvfb start\n",
" os.environ['DISPLAY'] = ':1'"
Expand Down Expand Up @@ -78,7 +78,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now use MDP just as any other gym environment:"
"We can now use the MDP just as any other gym environment:"
]
},
{
Expand All @@ -105,7 +105,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"but it also has other methods that you'll need for Value Iteration"
"but it also has other methods that you'll need for Value Iteration:"
]
},
{
Expand Down Expand Up @@ -352,7 +352,9 @@
"<graphviz.dot.Digraph at 0x7f729b9db7b8>"
]
},
"metadata": {},
"metadata": {
"tags": []
},
"output_type": "display_data"
}
],
Expand Down Expand Up @@ -387,9 +389,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"First, let's write a function to compute the state-action value function $Q^{\\pi}$, defined as follows\n",
"First, let's write a function to compute the state-action value function $Q^{\\pi}$, defined as follows:\n",
"\n",
"$$Q_i(s, a) = \\sum_{s'} P(s' | s,a) \\cdot [ r(s,a,s') + \\gamma V_{i}(s')]$$\n"
"$$Q_i(s, a) = \\sum_{s'} P(s' | s,a) \\cdot [ r(s,a,s') + \\gamma V_{i}(s')].$$\n"
]
},
{
Expand All @@ -399,7 +401,7 @@
"outputs": [],
"source": [
"def get_action_value(mdp, state_values, state, action, gamma):\n",
" \"\"\" Computes Q(s,a) as in formula above \"\"\"\n",
" \"\"\" Computes Q(s,a) according to the formula above \"\"\"\n",
"\n",
" <YOUR CODE>\n",
"\n",
Expand All @@ -422,7 +424,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Using $Q(s,a)$ we can now define the \"next\" V(s) for value iteration.\n",
"Using $Q(s,a)$ we now can define the \"next\" V(s) for value iteration.\n",
" $$V_{(i+1)}(s) = \\max_a \\sum_{s'} P(s' | s,a) \\cdot [ r(s,a,s') + \\gamma V_{i}(s')] = \\max_a Q_i(s,a)$$"
]
},
Expand All @@ -433,7 +435,7 @@
"outputs": [],
"source": [
"def get_new_state_value(mdp, state_values, state, gamma):\n",
" \"\"\" Computes next V(s) as in formula above. Please do not change state_values in process. \"\"\"\n",
" \"\"\" Computes the next V(s) according to the formula above. Please do not change state_values in process. \"\"\"\n",
" if mdp.is_terminal(state):\n",
" return 0\n",
"\n",
Expand Down Expand Up @@ -470,9 +472,9 @@
"outputs": [],
"source": [
"# parameters\n",
"gamma = 0.9 # discount for MDP\n",
"gamma = 0.9 # discount for the MDP\n",
"num_iter = 100 # maximum iterations, excluding initialization\n",
"# stop VI if new values are this close to old values (or closer)\n",
"# stop VI if new values are as close to old values (or closer)\n",
"min_difference = 0.001\n",
"\n",
"# initialize V(s)\n",
Expand Down Expand Up @@ -528,11 +530,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's use those $V^{*}(s)$ to find optimal actions in each state\n",
"Now let's use those $V^{*}(s)$ to find optimal actions in each state:\n",
"\n",
" $$\\pi^*(s) = argmax_a \\sum_{s'} P(s' | s,a) \\cdot [ r(s,a,s') + \\gamma V_{i}(s')] = argmax_a Q_i(s,a)$$\n",
" $$\\pi^*(s) = argmax_a \\sum_{s'} P(s' | s,a) \\cdot [ r(s,a,s') + \\gamma V_{i}(s')] = argmax_a Q_i(s,a).$$\n",
" \n",
"The only difference vs V(s) is that here we take not max but argmax: find action such with maximum Q(s,a)."
"The only difference vs V(s) is that here instead of max we take argmax: find the action that leads to the maximum of Q(s,a)."
]
},
{
Expand Down Expand Up @@ -622,7 +624,7 @@
"outputs": [],
"source": [
"def value_iteration(mdp, state_values=None, gamma=0.9, num_iter=1000, min_difference=1e-5):\n",
" \"\"\" performs num_iter value iteration steps starting from state_values. Same as before but in a function \"\"\"\n",
" \"\"\" performs num_iter value iteration steps starting from state_values. The same as before but in a function \"\"\"\n",
" state_values = state_values or {s: 0 for s in mdp.get_all_states()}\n",
" for i in range(num_iter):\n",
"\n",
Expand All @@ -631,7 +633,7 @@
"\n",
" assert isinstance(new_state_values, dict)\n",
"\n",
" # Compute difference\n",
" # Compute the difference\n",
" diff = max(abs(new_state_values[s] - state_values[s])\n",
" for s in mdp.get_all_states())\n",
"\n",
Expand Down Expand Up @@ -677,7 +679,7 @@
"source": [
"### Let's visualize!\n",
"\n",
"It's usually interesting to see what your algorithm actually learned under the hood. To do so, we'll plot state value functions and optimal actions at each VI step."
"It's usually interesting to see, what your algorithm actually learned under the hood. To do so, we'll plot the state value functions and optimal actions at each VI step."
]
},
{
Expand Down Expand Up @@ -903,5 +905,5 @@
}
},
"nbformat": 4,
"nbformat_minor": 2
"nbformat_minor": 0
}
40 changes: 20 additions & 20 deletions week3_model_free/experience_replay.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,21 +6,21 @@
"source": [
"### Honor Track: experience replay\n",
"\n",
"There's a powerful technique that you can use to improve sample efficiency for off-policy algorithms: [spoiler] Experience replay :)\n",
"There's a powerful technique that you can use to improve the sample efficiency for off-policy algorithms: [spoiler] Experience replay :)\n",
"\n",
"The catch is that you can train Q-learning and EV-SARSA on `<s,a,r,s'>` tuples even if they aren't sampled under current agent's policy. So here's what we're gonna do:\n",
"The catch is that you can train Q-learning and EV-SARSA on `<s,a,r,s'>` tuples even if they aren't sampled under the current agent's policy. So here's what we're gonna do:\n",
"\n",
"<img src=https://github.com/yandexdataschool/Practical_RL/raw/master/yet_another_week/_resource/exp_replay.png width=480>\n",
"\n",
"#### Training with experience replay\n",
"1. Play game, sample `<s,a,r,s'>`.\n",
"2. Update q-values based on `<s,a,r,s'>`.\n",
"3. Store `<s,a,r,s'>` transition in a buffer. \n",
" 3. If buffer is full, delete earliest data.\n",
"4. Sample K such transitions from that buffer and update q-values based on them.\n",
" 3. If buffer is full, delete the earliest data.\n",
"4. Sample K such transitions from that buffer and update the q-values based on them.\n",
"\n",
"\n",
"To enable such training, first we must implement a memory structure that would act like such a buffer."
"To enable such training, first, we must implement a memory structure, that would act as this buffer."
]
},
{
Expand All @@ -39,7 +39,7 @@
" !touch .setup_complete\n",
"\n",
"# This code creates a virtual display to draw game images on.\n",
"# It will have no effect if your machine has a monitor.\n",
"# It won't have any effect if your machine has a monitor.\n",
"if type(os.environ.get(\"DISPLAY\")) is not str or len(os.environ.get(\"DISPLAY\")) == 0:\n",
" !bash ../xvfb start\n",
" os.environ['DISPLAY'] = ':1'"
Expand All @@ -64,7 +64,7 @@
"metadata": {},
"outputs": [],
"source": [
"<YOUR CODE: copy your implementation of QLearningAgent from previous notebooks here>"
"<YOUR CODE: copy your implementation of QLearningAgent from the previous notebooks here>"
]
},
{
Expand All @@ -83,12 +83,12 @@
" Parameters\n",
" ----------\n",
" size: int\n",
" Max number of transitions to store in the buffer. When the buffer\n",
" overflows the old memories are dropped.\n",
" Max number of transitions to store in the buffer. When the buffer is\n",
" overflowed, the old memories are dropped.\n",
"\n",
" Note: for this assignment you can pick any data structure you want.\n",
" If you want to keep it simple, you can store a list of tuples of (s, a, r, s') in self._storage\n",
" However you may find out there are faster and/or more memory-efficient ways to do so.\n",
" However you may find, that there are faster and/or more memory-efficient ways to do so.\n",
" \"\"\"\n",
" self._storage = []\n",
" self._maxsize = size\n",
Expand All @@ -101,7 +101,7 @@
" def add(self, obs_t, action, reward, obs_tp1, done):\n",
" '''\n",
" Make sure, _storage will not exceed _maxsize. \n",
" Make sure, FIFO rule is being followed: the oldest examples has to be removed earlier\n",
" Make sure, FIFO rule is being followed: the oldest examples have to be removed earlier\n",
" '''\n",
" data = (obs_t, action, reward, obs_tp1, done)\n",
"\n",
Expand All @@ -121,9 +121,9 @@
" act_batch: np.array\n",
" batch of actions executed given obs_batch\n",
" rew_batch: np.array\n",
" rewards received as results of executing act_batch\n",
" rewards received as the results of executing act_batch\n",
" next_obs_batch: np.array\n",
" next set of observations seen after executing act_batch\n",
" next set of observations, seen after executing act_batch\n",
" done_mask: np.array\n",
" done_mask[i] = 1 if executing act_batch[i] resulted in\n",
" the end of an episode and 0 otherwise.\n",
Expand Down Expand Up @@ -184,7 +184,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's use this buffer to improve training:"
"Now let's use this buffer to improve the training:"
]
},
{
Expand Down Expand Up @@ -212,7 +212,7 @@
" - train agent using agent.update(...) whenever possible\n",
" - return total reward\n",
" :param replay: ReplayBuffer where agent can store and sample (s,a,r,s',done) tuples.\n",
" If None, do not use experience replay\n",
" If None, do not use an experience replay\n",
" \"\"\"\n",
" total_reward = 0.0\n",
" s = env.reset()\n",
Expand All @@ -231,7 +231,7 @@
" <YOUR CODE>\n",
"\n",
" # sample replay_batch_size random transitions from replay,\n",
" # then update agent on each of them in a loop\n",
" # then update the agent on each of them in a loop\n",
" s_, a_, r_, next_s_, done_ = replay.sample(replay_batch_size)\n",
" for i in range(replay_batch_size):\n",
" <YOUR CODE>\n",
Expand All @@ -250,7 +250,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Create two agents: first will use experience replay, second will not.\n",
"# Create two agents: first will use the experience replay, second will not.\n",
"\n",
"agent_baseline = QLearningAgent(\n",
" alpha=0.5, epsilon=0.25, discount=0.99,\n",
Expand Down Expand Up @@ -326,11 +326,11 @@
"\n",
"### Outro\n",
"\n",
"We will use the code you just wrote extensively in the next week of our course. If you're feeling that you need more examples to understand how experience replay works, try using it for binarized state spaces (CartPole or other __[classic control envs](https://gym.openai.com/envs/#classic_control)__).\n",
"We will use the code you just wrote extensively in the next week of our course. If you're feeling, that you need more examples to understand how the experience replay works, try using it for binarized state spaces (CartPole or other __[classic control envs](https://gym.openai.com/envs/#classic_control)__).\n",
"\n",
"__Next week__ we're gonna explore how q-learning and similar algorithms can be applied for large state spaces, with deep learning models to approximate the Q function.\n",
"\n",
"However, __the code you've written__ for this week is already capable of solving many RL problems, and as an added benifit - it is very easy to detach. You can use Q-learning, SARSA and Experience Replay for any RL problems you want to solve - just thow 'em into a file and import the stuff you need."
"However, __the code you've written__ this week is already capable to solve many RL problems, and as an added benifit - it is very easy to detach. You can use Q-learning, SARSA and Experience Replay for any RL problems you want to solve - just throw them into a file and import the stuff you need."
]
}
],
Expand All @@ -341,5 +341,5 @@
}
},
"nbformat": 4,
"nbformat_minor": 1
"nbformat_minor": 0
}
Loading

0 comments on commit cd9f5e9

Please sign in to comment.