diff --git a/docs/ch12/main.md b/docs/ch12/main.md
index 2bfd917..c7a41e1 100644
--- a/docs/ch12/main.md
+++ b/docs/ch12/main.md
@@ -1,12 +1,14 @@
-# PPO 算法
+# 第 12 章 PPO 算法
-$\qquad$ 本章我们开始讲解强化学习中最最最泛用的 $\text{PPO}$ 算法($\text{proximal policy optimization}$),这个算法在强化学习领域的研究和应用中有着非常重要的地位,可以说是强化学习领域的一个里程碑式的算法。$\text{PPO}$ 算法是一种基于策略梯度的强化学习算法,由 $\text{OpenAI}$ 的研究人员 $\text{Schulman}$ 等人在 $\text{2017}$ 年提出。$\text{PPO}$ 算法的主要思想是通过在策略梯度的优化过程中引入一个重要性比率来限制策略更新的幅度,从而提高算法的稳定性和收敛性。$\text{PPO}$ 算法的优点在于简单、易于实现、易于调参,而且在实际应用中的效果也非常好,因此在强化学习领域得到了广泛的应用。
+$\qquad$ 本章我们开始讲解强化学习中比较重要的 $\text{PPO}$ 算法,它在相关应用中有着非常重要的地位,是一个里程碑式的算法。不同于 $\text{DDPG}$ 算法,$\text{PPO}$ 算法是一类典型的 $\text{Actor-Critic}$ 算法,既适用于连续动作空间,也适用于离散动作空间。
+
+$\qquad$ $\text{PPO}$ 算法是一种基于策略梯度的强化学习算法,由 $\text{OpenAI}$ 的研究人员 $\text{Schulman}$ 等人在 $\text{2017}$ 年提出。$\text{PPO}$ 算法的主要思想是通过在策略梯度的优化过程中引入一个重要性权重来限制策略更新的幅度,从而提高算法的稳定性和收敛性。$\text{PPO}$ 算法的优点在于简单、易于实现、易于调参,应用十分广泛,正可谓 “遇事不决 $\text{PPO}$ ”。
$\qquad$ $\text{PPO}$ 的前身是 $\text{TRPO}$ 算法,旨在克服 $\text{TRPO}$ 算法中的一些计算上的困难和训练上的不稳定性。$\text{TRPO}$ 是一种基于策略梯度的算法,它通过定义策略更新的信赖域来保证每次更新的策略不会太远离当前的策略,以避免过大的更新引起性能下降。然而,$\text{TRPO}$ 算法需要解决一个复杂的约束优化问题,计算上较为繁琐。本书主要出于实践考虑,这种太复杂且几乎已经被淘汰的 $\text{TRPO}$ 算法就不再赘述了,需要深入研究或者工作面试的读者可以自行查阅相关资料。 接下来将详细讲解 $\text{PPO}$ 算法的原理和实现,希望能够帮助读者更好地理解和掌握这个算法。
-## 重要性采样
+## 12.1 重要性采样
-$\qquad$ 在将 $\text{PPO}$ 算法之前,我们需要铺垫一个概念,那就是重要性采样( $\text{importance sampling}$ )。重要性采样是一种估计随机变量的期望或者概率分布的统计方法。它的原理也很简单,假设有一个函数 $f(x)$ ,需要从分布 $p(x)$ 中采样来计算其期望值,但是在某些情况下我们可能很难从 $p(x)$ 中采样,这个时候我们可以从另一个比较容易采样的分布 $q(x)$ 中采样,来间接地达到从 $p(x)$ 中采样的效果。这个过程的数学表达式如式 $\text{(12.1)}$ 所示。
+$\qquad$ 在展开 $\text{PPO}$ 算法之前,我们先铺垫一个概念,即重要性采样( $\text{importance sampling}$ )。重要性采样是一种估计随机变量的期望或者概率分布的统计方法。它的原理也很简单,假设有一个函数 $f(x)$ ,需要从分布 $p(x)$ 中采样来计算其期望值,但是在某些情况下我们可能很难从 $p(x)$ 中采样,这个时候我们可以从另一个比较容易采样的分布 $q(x)$ 中采样,来间接地达到从 $p(x)$ 中采样的效果。这个过程的数学表达式如式 $\text{(12.1)}$ 所示。
$$
\tag{12.1}
@@ -17,7 +19,9 @@ $\qquad$ 对于离散分布的情况,可以表达为式 $\text{(12.2)}$ 。
$$
\tag{12.2}
+\begin{aligned}
E_{p(x)}[f(x)]=\frac{1}{N} \sum f\left(x_{i}\right) \frac{p\left(x_{i}\right)}{q\left(x_{i}\right)}
+\end{aligned}
$$
$\qquad$ 这样一来原问题就变成了只需要从 $q(x)$ 中采样,然后计算两个分布之间的比例 $\frac{p(x)}{q(x)}$ 即可,这个比例称之为**重要性权重**。换句话说,每次从 $q(x)$ 中采样的时候,都需要乘上对应的重要性权重来修正采样的偏差,即两个分布之间的差异。当然这里可能会有一个问题,就是当 $p(x)$ 不为 $\text{0}$ 的时候,$q(x)$ 也不能为 $\text{0}$,但是他们可以同时为 $\text{0}$ ,这样 $\frac{p(x)}{q(x)}$ 依然有定义,具体的原理由于并不是很重要,因此就不展开讲解了。
@@ -33,15 +37,17 @@ $\qquad$ 结合重要性采样公式,我们可以得到式 $\text{(12.4)}$ 。
$$
\tag{12.4}
+\begin{aligned}
Var_{x \sim q}\left[f(x) \frac{p(x)}{q(x)}\right]=E_{x \sim q}\left[\left(f(x) \frac{p(x)}{q(x)}\right)^{2}\right]-\left(E_{x \sim q}\left[f(x) \frac{p(x)}{q(x)}\right]\right)^{2} \\
-=E_{x \sim p}\left[f(x)^{2} \frac{p(x)}{q(x)}\right]-\left(E_{x \sim p}[f(x)]\right)^{2}
+= E_{x \sim p}\left[f(x)^{2} \frac{p(x)}{q(x)}\right]-\left(E_{x \sim p}[f(x)]\right)^{2}
+\end{aligned}
$$
$\qquad$ 不难看出,当 $q(x)$ 越接近 $p(x)$ 的时候,方差就越小,也就是说重要性权重越接近于 $1$ 的时候,反之越大。
$\qquad$ 其实重要性采样也是蒙特卡洛估计的一部分,只不过它是一种比较特殊的蒙特卡洛估计,允许我们在复杂问题中利用已知的简单分布进行采样,从而避免了直接采样困难分布的问题,同时通过适当的权重调整,可以使得蒙特卡洛估计更接近真实结果。
-## PPO 算法
+## 12.2 PPO 算法
$\qquad$ 既然重要性采样本质上是一种在某些情况下更优的蒙特卡洛估计,再结合前面 $\text{Actor-Critic}$ 章节中我们讲到策略梯度算法的高方差主要来源于 $\text{Actor}$ 的策略梯度采样估计,读者应该不难猜出 $\text{PPO}$ 算法具体是优化在什么地方了。没错,$\text{PPO}$ 算法的核心思想就是通过重要性采样来优化原来的策略梯度估计,其目标函数表示如式 $\text{(12.5)}$ 所示。
@@ -80,10 +86,79 @@ $$
J^{KL}(\theta)=\hat{\mathbb{E}}_t\left[\frac{\pi_\theta\left(a_t \mid s_t\right)}{\pi_{\theta_{\text {old }}}\left(a_t \mid s_t\right)} \hat{A}_t-\beta \mathrm{KL}\left[\pi_{\theta_{\text {old }}}\left(\cdot \mid s_t\right), \pi_\theta\left(\cdot \mid s_t\right)\right]\right]
$$
-$\text{KL}$ 约束一般也叫 $\text{KL-penalty}$,它的意思是在 $\text{TRPO}$ 损失的基础上,加上一个 $\text{KL}$ 散度的惩罚项,这个惩罚项的系数 $\beta$ 一般取 $0.01$ 左右。这个惩罚项的作用也是保证每次更新的策略分布都不会偏离上一次的策略分布太远,从而保证重要性权重不会偏离 $1$ 太远。在实践中,我们一般用 $\text{clip}$ 约束,因为它更简单,计算成本较低,而且效果也更好。
-
-## 一个常见的误区
-
-在很早的章节之前,我们讲过 `on-policy` 和
-
-## 实战:PPO 算法
\ No newline at end of file
+$\qquad$ $\text{KL}$ 约束一般也叫 $\text{KL-penalty}$,它的意思是在 $\text{TRPO}$ 损失的基础上,加上一个 $\text{KL}$ 散度的惩罚项,这个惩罚项的系数 $\beta$ 一般取 $0.01$ 左右。这个惩罚项的作用也是保证每次更新的策略分布都不会偏离上一次的策略分布太远,从而保证重要性权重不会偏离 $1$ 太远。在实践中,我们一般用 $\text{clip}$ 约束,因为它更简单,计算成本较低,而且效果也更好。
+
+## 12.3 一个常见的误区
+
+$\qquad$ 在之前的章节中,我们讲过 $\text{on-policy}$ 和 $\text{off-policy}$ 算法,前者使用当前策略生成样本,并基于这些样本来更新该策略,后者则可以使用过去的策略采集样本来更新当前的策略。$\text{on-policy}$ 算法的数据利用效率较低,因为每次策略更新后,旧的样本或经验可能就不再适用,通常需要重新采样。而 $\text{off-policy}$ 算法由于可以利用历史经验,一般使用经验回放来存储和重复利用之前的经验,数据利用效率则较高,因为同一批数据可以被用于多次更新。但由于经验的再利用,可能会引入一定的偏见,但这也有助于稳定学习。但在需要即时学习和适应的环境中,$\text{on-policy}$ 算法可能更为适合,因为它们直接在当前策略下操作。
+
+$\qquad$ 那么 $\text{PPO}$ 算法究竟是 $\text{on-policy}$ 还是 $\text{off-policy}$ 的呢?有读者可能会因为 $\text{PPO}$ 算法在更新时重要性采样的部分中利用了旧的 $\text{Actor}$ 采样的样本,就觉得 $\text{PPO}$ 算法会是 $\text{off-policy}$ 的。实际上虽然这批样本是从旧的策略中采样得到的,但我们并没有直接使用这些样本去更新我们的策略,而是使用重要性采样先将数据分布不同导致的误差进行了修正,即是两者样本分布之间的差异尽可能地缩小。换句话说,就可以理解为重要性采样之后的样本虽然是由旧策略采样得到的,但可以近似为从更新后的策略中得到的,即我们要优化的 $\text{Actor}$ 和采样的 $\text{Actor}$ 是同一个,因此 **$\text{PPO}$ 算法是 $\text{on-policy}$ 的**。
+
+## 12.4 实战:PPO 算法
+### 12.4.1 PPO 伪代码
+
+$\qquad$ 如图 $\text{12-1}$ 所示,与 $\text{off-policy}$ 算法不同,$\text{PPO}$ 算法每次会采样若干个时步的样本,然后利用这些样本更新策略,而不是存入经验回放中进行采样更新。
+
+
+
+
+图 $\text{12-1}$ $\text{PPO}$ 算法伪代码
+
+### 12.4.2 PPO 算法更新
+
+$\qquad$ 无论是连续动作空间还是离散动作空间,$\text{PPO}$ 算法的动作采样方式跟前面章节讲的 $\text{Actor-Critic}$ 算法是一样的,在本次实战中就不做展开,读者可在 $\text{JoyRL}$ 代码仓库上查看完整代码。我们主要看看更新策略的方式,如代码清单 $\text{12-1}$ 所示。
+
+
+ 代码清单 $\text{12-1}$ $\text{PPO}$ 算法更新
+
+
+```Python
+def update(self):
+ # 采样样本
+ old_states, old_actions, old_log_probs, old_rewards, old_dones = self.memory.sample()
+ # 转换成tensor
+ old_states = torch.tensor(np.array(old_states), device=self.device, dtype=torch.float32)
+ old_actions = torch.tensor(np.array(old_actions), device=self.device, dtype=torch.float32)
+ old_log_probs = torch.tensor(old_log_probs, device=self.device, dtype=torch.float32)
+ # 计算回报
+ returns = []
+ discounted_sum = 0
+ for reward, done in zip(reversed(old_rewards), reversed(old_dones)):
+ if done:
+ discounted_sum = 0
+ discounted_sum = reward + (self.gamma * discounted_sum)
+ returns.insert(0, discounted_sum)
+ # 归一化
+ returns = torch.tensor(returns, device=self.device, dtype=torch.float32)
+ returns = (returns - returns.mean()) / (returns.std() + 1e-5) # 1e-5 to avoid division by zero
+ for _ in range(self.k_epochs): # 小批量随机下降
+ # 计算优势
+ values = self.critic(old_states)
+ advantage = returns - values.detach()
+ probs = self.actor(old_states)
+ dist = Categorical(probs)
+ new_probs = dist.log_prob(old_actions)
+ # 计算重要性权重
+ ratio = torch.exp(new_probs - old_log_probs) #
+ surr1 = ratio * advantage
+ surr2 = torch.clamp(ratio, 1 - self.eps_clip, 1 + self.eps_clip) * advantage
+ # 注意dist.entropy().mean()的目的是最大化策略熵
+ actor_loss = -torch.min(surr1, surr2).mean() + self.entropy_coef * dist.entropy().mean()
+ critic_loss = (returns - values).pow(2).mean()
+ # 反向传播
+ self.actor_optimizer.zero_grad()
+ self.critic_optimizer.zero_grad()
+ actor_loss.backward()
+ critic_loss.backward()
+ self.actor_optimizer.step()
+ self.critic_optimizer.step()
+```
+
+$\qquad$ 注意在更新时由于每次采样的轨迹往往包含的样本数较多,我们通过利用小批量随机下降将样本随机切分成若干个部分,然后一个批量一个批量地更新网络参数。最后我们展示算法在 $\text{CartPole}$ 上的训练效果,如图 $\text{12-2}$ 所示。此外,在更新 $\text{Actor}$ 参数时,我们增加了一个最大化策略熵的正则项,这部分原理我们会在接下来的一章讲到。
+
+
+
+
+图 $\text{12-2}$ $\text{CartPole}$ 环境 $\text{PPO}$ 算法训练曲线
+
+$\qquad$ 可以看到,与 $\text{A2C}$ 算法相比,$\text{PPO}$ 算法的收敛是要更加快速且稳定的。
\ No newline at end of file
diff --git a/docs/ch13/main.md b/docs/ch13/main.md
index 2b64b97..02345e6 100644
--- a/docs/ch13/main.md
+++ b/docs/ch13/main.md
@@ -1,8 +1,8 @@
-# SAC 算法
+# 第 13 章 SAC 算法
-$\qquad$ 本章开始介绍最后一种经典的策略梯度算法,即 $\text{Soft Actor-Critic}$ 算法,简写为 $\text{SAC}$ 。$\text{SAC}$ 算法是一种基于最大熵强化学习的策略梯度算法,它的目标是最大化策略的熵,从而使得策略更加鲁棒。$\text{SAC}$ 算法的核心思想是,通过最大化策略的熵,使得策略更加鲁棒,经过超参改良后的 $\text{SAC}$ 算法在稳定性方面是可以与 $\text{PPO}$ 算法华山论剑的。注意,由于 $\text{SAC}$ 算法理论相对之前的算法要复杂一些,因此推导过程要多很多,但是最后的结果还是相对简洁的,因此读者可以根据自己的需求选择性阅读,只需要关注伪代码中变量的涵义以及结果公式即可。
+$\qquad$ 本章开始介绍最后一种经典的策略梯度算法,即 $\text{Soft Actor-Critic}$ 算法,简写为 $\text{SAC}$ 。相比于前两个算法,$\text{SAC}$ 算法要更为复杂,因此本章涉及的公式推导也要多很多,但是最后的结果还是相对简洁的。因此读者可以根据自己的需求选择性阅读,只需要关注伪代码中变量的涵义以及结果公式即可。$\text{SAC}$ 算法是一种基于最大熵强化学习的策略梯度算法,它的目标是最大化策略的熵,从而使得策略更加鲁棒。$\text{SAC}$ 算法的核心思想是,通过最大化策略的熵,使得策略更加鲁棒,经过超参改良后的 $\text{SAC}$ 算法在稳定性方面是可以与 $\text{PPO}$ 算法华山论剑的。注意,由于 $\text{SAC}$ 算法理论相对之前的算法要复杂一些,因此推导过程
-## 最大熵强化学习
+## 13.1 最大熵强化学习
$\qquad$ 由于 $\text{SAC}$ 算法相比于之前的策略梯度算法独具一路,它走的是最大熵强化学习的路子。为了让读者更好地搞懂什么是 $\text{SAC}$ ,我们先介绍一下最大熵强化学习,然后从基于价值的 $\text{Soft Q-Learning}$ 算法开始讲起。我们先回忆一下确定性策略和随机性策略,确定性策略是指在给定相同状态下,总是选择相同的动作,随机性策略则是在给定状态下可以选择多种可能的动作,不知道读者们有没有想过这两种策略在实践中有什么优劣呢?或者说哪种更好呢?这里我们先架空实际的应用场景,只总结这两种策略本身的优劣,首先看确定性策略:
@@ -43,7 +43,7 @@ $$
$\qquad$ 它表示了随机策略 $\pi\left(\cdot \mid \mathbf{s}_t\right)$ 对应概率分布的随机程度,策略越随机,熵越大。后面我们可以发现,虽然理论推导起来比较复杂,但实际实践起来是比较简单的。
-## Soft Q-Learning
+## 13.2 Soft Q-Learning
$\qquad$ 前面小节中我们引入了带有熵的累积奖励期望,接下来我们需要基于这个重新定义的奖励来重新推导一下相关的量。后面我们会发现虽然推导起来比较复杂,但用代码实现起来是比较简单的,因为几乎跟传统的 $\text{Q-Learning}$ 算法没有多大区别。因此着重于实际应用的同学可以直接跳过本小节的推导部分,直接看后面的算法实战部分。
@@ -152,7 +152,7 @@ $$
\end{aligned}
$$
-## SAC
+## 13.3 SAC
$\qquad$ 实际上 $\text{SAC}$ 算法有两个版本,第一个版本是由 $\text{Tuomas Haarnoja}$ 于 $\text{2018}$ 年提出来的①,,第二个版本也是由 $\text{Tuomas Haarnoja}$ 于 $\text{2019}$ 年提出来的②,一般称作 $\text{SAC v2}$。第二个版本主要在前一版本的基础上做了简化,并且实现了温度因子的自动调节,从而使得算法更加简单稳定。
@@ -205,7 +205,7 @@ $$
\end{aligned}
$$
-## 自动调节温度因子
+## 13.4 自动调节温度因子
$\qquad$ 本小节主要讲解如何推导出自动调节因子的版本,整体推导的思路其实很简单,就是转换成规划问题,然后用动态规划、拉格朗日乘子法等方法简化求解,只关注结果的读者可以直接跳到本小节最后一个关于温度调节因子 $\alpha$ 的梯度下降公式即可。
@@ -361,4 +361,122 @@ $$
$\qquad$ 这样一来就能实现温度因子的自动调节了。这一版本由于引入了温度因子的自动调节,因此不需要额外的 $V$ 值网络,直接使用两个 $Q$ 网络(包含目标网络和当前网络)来作为 $\text{Critic}$ 估计价值即可。
-## 实战:SAC 算法
\ No newline at end of file
+## 13.5 实战:SAC 算法
+
+$\qquad$ 在实战中,我们主要讲解 $SAC$ 算法的第二个版本,即自动调节温度因子的版本。该版本的如图 $\text{13-1}$ 所示,整个训练过程相对来说还是比较简洁的,只是需要额外定义一些网络,比如用来调节温度因子等。
+
+
+
+
+图 $\text{13-1}$ $\text{SAC}$ 算法伪代码
+
+### 15.5.1 定义模型
+
+$\qquad$ 首先我们定义 $\text{Actor}$ 和 $\text{Critic}$,即值网络和策略网络,跟 $\text{A2C}$ 算法其实是一样的,如代码清单 $\text{13-1}$ 所示。
+
+
+ 代码清单 $\text{13-1}$ $\text{Actor}$ 和 $\text{Critic}$ 网络
+
+
+```Python
+class ValueNet(nn.Module):
+ def __init__(self, state_dim, hidden_dim, init_w=3e-3):
+ super(ValueNet, self).__init__()
+ '''定义值网络
+ '''
+ self.linear1 = nn.Linear(state_dim, hidden_dim) # 输入层
+ self.linear2 = nn.Linear(hidden_dim, hidden_dim) # 隐藏层
+ self.linear3 = nn.Linear(hidden_dim, 1)
+
+ self.linear3.weight.data.uniform_(-init_w, init_w) # 初始化权重
+ self.linear3.bias.data.uniform_(-init_w, init_w)
+
+ def forward(self, state):
+ x = F.relu(self.linear1(state))
+ x = F.relu(self.linear2(x))
+ x = self.linear3(x)
+ return x
+class PolicyNet(nn.Module):
+ def __init__(self, state_dim, action_dim, hidden_dim, init_w=3e-3, log_std_min=-20, log_std_max=2):
+ super(PolicyNet, self).__init__()
+ self.log_std_min = log_std_min
+ self.log_std_max = log_std_max
+
+ self.linear1 = nn.Linear(state_dim, hidden_dim)
+ self.linear2 = nn.Linear(hidden_dim, hidden_dim)
+
+ # 初始化权重
+ self.mean_linear = nn.Linear(hidden_dim, action_dim)
+ self.mean_linear.weight.data.uniform_(-init_w, init_w)
+ self.mean_linear.bias.data.uniform_(-init_w, init_w)
+
+ self.log_std_linear = nn.Linear(hidden_dim, action_dim)
+ self.log_std_linear.weight.data.uniform_(-init_w, init_w)
+ self.log_std_linear.bias.data.uniform_(-init_w, init_w)
+
+ def forward(self, state):
+ x = F.relu(self.linear1(state))
+ x = F.relu(self.linear2(x))
+
+ mean = self.mean_linear(x)
+ log_std = self.log_std_linear(x)
+ log_std = torch.clamp(log_std, self.log_std_min, self.log_std_max)
+
+ return mean, log_std
+
+ def evaluate(self, state, epsilon=1e-6):
+ mean, log_std = self.forward(state)
+ std = log_std.exp()
+ # 计算动作
+ normal = Normal(mean, std)
+ z = normal.sample()
+ action = torch.tanh(z)
+ # 计算动作概率
+ log_prob = normal.log_prob(z) - torch.log(1 - action.pow(2) + epsilon)
+ log_prob = log_prob.sum(-1, keepdim=True)
+
+ return action, log_prob, z, mean, log_std
+
+ def get_action(self, state):
+ state = torch.FloatTensor(state).unsqueeze(0)
+ mean, log_std = self.forward(state)
+ std = log_std.exp()
+
+ normal = Normal(mean, std)
+ z = normal.sample()
+ action = torch.tanh(z)
+
+ action = action.detach().cpu().numpy()
+ return action[0]
+```
+
+$\qquad$ 然后再额外定义一个 $\text{Soft Q}$ 网络,如代码清单 $\text{13-2}$ 所示。
+
+
+ 代码清单 $\text{13-2}$ $\text{Soft Q}$ 网络
+
+
+```Python
+class SoftQNet(nn.Module):
+ def __init__(self, state_dim, action_dim, hidden_dim, init_w=3e-3):
+ super(SoftQNet, self).__init__()
+ '''定义Q网络,state_dim, action_dim, hidden_dim, init_w分别为状态维度、动作维度隐藏层维度和初始化权重
+ '''
+ self.linear1 = nn.Linear(state_dim + action_dim, hidden_dim)
+ self.linear2 = nn.Linear(hidden_dim, hidden_dim)
+ self.linear3 = nn.Linear(hidden_dim, 1)
+
+ self.linear3.weight.data.uniform_(-init_w, init_w)
+ self.linear3.bias.data.uniform_(-init_w, init_w)
+
+ def forward(self, state, action):
+ x = torch.cat([state, action], 1)
+ x = F.relu(self.linear1(x))
+ x = F.relu(self.linear2(x))
+ x = self.linear3(x)
+ return x
+```
+
+### 15.5.2 算法更新
+
+$\qquad$ 我们再看看
\ No newline at end of file
diff --git a/docs/figs/ch12/PPO_Cartpole_training_curve.png b/docs/figs/ch12/PPO_Cartpole_training_curve.png
new file mode 100644
index 0000000..d4b4bf8
Binary files /dev/null and b/docs/figs/ch12/PPO_Cartpole_training_curve.png differ
diff --git a/docs/figs/ch12/ppo_pseu.png b/docs/figs/ch12/ppo_pseu.png
new file mode 100644
index 0000000..c1bba91
Binary files /dev/null and b/docs/figs/ch12/ppo_pseu.png differ
diff --git a/docs/figs/ch13/sac_pseu.png b/docs/figs/ch13/sac_pseu.png
new file mode 100644
index 0000000..67f2f91
Binary files /dev/null and b/docs/figs/ch13/sac_pseu.png differ
diff --git a/notebooks/PPO.ipynb b/notebooks/PPO.ipynb
new file mode 100644
index 0000000..94cadae
--- /dev/null
+++ b/notebooks/PPO.ipynb
@@ -0,0 +1,522 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# PPO实现CarPole-v1(离散动作空间)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 1. 定义算法"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 1.1 定义模型"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 57,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import torch.nn as nn\n",
+ "import torch.nn.functional as F\n",
+ "class ActorSoftmax(nn.Module):\n",
+ " def __init__(self, input_dim, output_dim, hidden_dim=256):\n",
+ " super(ActorSoftmax, self).__init__()\n",
+ " self.fc1 = nn.Linear(input_dim, hidden_dim)\n",
+ " self.fc2 = nn.Linear(hidden_dim, hidden_dim)\n",
+ " self.fc3 = nn.Linear(hidden_dim, output_dim)\n",
+ " def forward(self,x):\n",
+ " x = F.relu(self.fc1(x))\n",
+ " x = F.relu(self.fc2(x))\n",
+ " probs = F.softmax(self.fc3(x),dim=1)\n",
+ " return probs\n",
+ "class Critic(nn.Module):\n",
+ " def __init__(self,input_dim,output_dim,hidden_dim=256):\n",
+ " super(Critic,self).__init__()\n",
+ " assert output_dim == 1 # critic must output a single value\n",
+ " self.fc1 = nn.Linear(input_dim, hidden_dim)\n",
+ " self.fc2 = nn.Linear(hidden_dim, hidden_dim)\n",
+ " self.fc3 = nn.Linear(hidden_dim, output_dim)\n",
+ " def forward(self,x):\n",
+ " x = F.relu(self.fc1(x))\n",
+ " x = F.relu(self.fc2(x))\n",
+ " value = self.fc3(x)\n",
+ " return value"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 1.2 定义经验回放"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 58,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import random\n",
+ "from collections import deque\n",
+ "class ReplayBufferQue:\n",
+ " '''DQN的经验回放池,每次采样batch_size个样本'''\n",
+ " def __init__(self, capacity: int) -> None:\n",
+ " self.capacity = capacity\n",
+ " self.buffer = deque(maxlen=self.capacity)\n",
+ " def push(self,transitions):\n",
+ " '''_summary_\n",
+ " Args:\n",
+ " trainsitions (tuple): _description_\n",
+ " '''\n",
+ " self.buffer.append(transitions)\n",
+ " def sample(self, batch_size: int, sequential: bool = False):\n",
+ " if batch_size > len(self.buffer):\n",
+ " batch_size = len(self.buffer)\n",
+ " if sequential: # sequential sampling\n",
+ " rand = random.randint(0, len(self.buffer) - batch_size)\n",
+ " batch = [self.buffer[i] for i in range(rand, rand + batch_size)]\n",
+ " return zip(*batch)\n",
+ " else:\n",
+ " batch = random.sample(self.buffer, batch_size)\n",
+ " return zip(*batch)\n",
+ " def clear(self):\n",
+ " self.buffer.clear()\n",
+ " def __len__(self):\n",
+ " return len(self.buffer)\n",
+ "\n",
+ "class PGReplay(ReplayBufferQue):\n",
+ " '''PG的经验回放池,每次采样所有样本,因此只需要继承ReplayBufferQue,重写sample方法即可\n",
+ " '''\n",
+ " def __init__(self):\n",
+ " self.buffer = deque()\n",
+ " def sample(self):\n",
+ " ''' sample all the transitions\n",
+ " '''\n",
+ " batch = list(self.buffer)\n",
+ " return zip(*batch)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 1.3 定义智能体"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 59,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import torch\n",
+ "from torch.distributions import Categorical\n",
+ "class Agent:\n",
+ " def __init__(self,cfg) -> None:\n",
+ " \n",
+ " self.gamma = cfg.gamma\n",
+ " self.device = torch.device(cfg.device) \n",
+ " self.actor = ActorSoftmax(cfg.n_states,cfg.n_actions, hidden_dim = cfg.actor_hidden_dim).to(self.device)\n",
+ " self.critic = Critic(cfg.n_states,1,hidden_dim=cfg.critic_hidden_dim).to(self.device)\n",
+ " self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), lr=cfg.actor_lr)\n",
+ " self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), lr=cfg.critic_lr)\n",
+ " self.memory = PGReplay()\n",
+ " self.k_epochs = cfg.k_epochs # update policy for K epochs\n",
+ " self.eps_clip = cfg.eps_clip # clip parameter for PPO\n",
+ " self.entropy_coef = cfg.entropy_coef # entropy coefficient\n",
+ " self.sample_count = 0\n",
+ " self.update_freq = cfg.update_freq\n",
+ "\n",
+ " def sample_action(self,state):\n",
+ " self.sample_count += 1\n",
+ " state = torch.tensor(state, device=self.device, dtype=torch.float32).unsqueeze(dim=0)\n",
+ " probs = self.actor(state)\n",
+ " dist = Categorical(probs)\n",
+ " action = dist.sample()\n",
+ " self.log_probs = dist.log_prob(action).detach()\n",
+ " return action.detach().cpu().numpy().item()\n",
+ " @torch.no_grad()\n",
+ " def predict_action(self,state):\n",
+ " state = torch.tensor(state, device=self.device, dtype=torch.float32).unsqueeze(dim=0)\n",
+ " probs = self.actor(state)\n",
+ " dist = Categorical(probs)\n",
+ " action = dist.sample()\n",
+ " return action.detach().cpu().numpy().item()\n",
+ " def update(self):\n",
+ " # update policy every n steps\n",
+ " if self.sample_count % self.update_freq != 0:\n",
+ " return\n",
+ " # print(\"update policy\")\n",
+ " old_states, old_actions, old_log_probs, old_rewards, old_dones = self.memory.sample()\n",
+ " # convert to tensor\n",
+ " old_states = torch.tensor(np.array(old_states), device=self.device, dtype=torch.float32)\n",
+ " old_actions = torch.tensor(np.array(old_actions), device=self.device, dtype=torch.float32)\n",
+ " old_log_probs = torch.tensor(old_log_probs, device=self.device, dtype=torch.float32)\n",
+ " # monte carlo estimate of state rewards\n",
+ " returns = []\n",
+ " discounted_sum = 0\n",
+ " for reward, done in zip(reversed(old_rewards), reversed(old_dones)):\n",
+ " if done:\n",
+ " discounted_sum = 0\n",
+ " discounted_sum = reward + (self.gamma * discounted_sum)\n",
+ " returns.insert(0, discounted_sum)\n",
+ " # Normalizing the rewards:\n",
+ " returns = torch.tensor(returns, device=self.device, dtype=torch.float32)\n",
+ " returns = (returns - returns.mean()) / (returns.std() + 1e-5) # 1e-5 to avoid division by zero\n",
+ " for _ in range(self.k_epochs):\n",
+ " # compute advantage\n",
+ " values = self.critic(old_states) # detach to avoid backprop through the critic\n",
+ " advantage = returns - values.detach()\n",
+ " # get action probabilities\n",
+ " probs = self.actor(old_states)\n",
+ " dist = Categorical(probs)\n",
+ " # get new action probabilities\n",
+ " new_probs = dist.log_prob(old_actions)\n",
+ " # compute ratio (pi_theta / pi_theta__old):\n",
+ " ratio = torch.exp(new_probs - old_log_probs) # old_log_probs must be detached\n",
+ " # compute surrogate loss\n",
+ " surr1 = ratio * advantage\n",
+ " surr2 = torch.clamp(ratio, 1 - self.eps_clip, 1 + self.eps_clip) * advantage\n",
+ " # compute actor loss\n",
+ " actor_loss = -torch.min(surr1, surr2).mean() + self.entropy_coef * dist.entropy().mean()\n",
+ " # compute critic loss\n",
+ " critic_loss = (returns - values).pow(2).mean()\n",
+ " # take gradient step\n",
+ " self.actor_optimizer.zero_grad()\n",
+ " self.critic_optimizer.zero_grad()\n",
+ " actor_loss.backward()\n",
+ " critic_loss.backward()\n",
+ " self.actor_optimizer.step()\n",
+ " self.critic_optimizer.step()\n",
+ " self.memory.clear()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 2. 定义训练"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 60,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import copy\n",
+ "def train(cfg, env, agent):\n",
+ " ''' 训练\n",
+ " '''\n",
+ " print(\"开始训练!\")\n",
+ " rewards = [] # 记录所有回合的奖励\n",
+ " steps = []\n",
+ " best_ep_reward = 0 # 记录最大回合奖励\n",
+ " output_agent = None\n",
+ " for i_ep in range(cfg.train_eps):\n",
+ " ep_reward = 0 # 记录一回合内的奖励\n",
+ " ep_step = 0\n",
+ " state = env.reset() # 重置环境,返回初始状态\n",
+ " for _ in range(cfg.max_steps):\n",
+ " ep_step += 1\n",
+ " action = agent.sample_action(state) # 选择动作\n",
+ " next_state, reward, done, _ = env.step(action) # 更新环境,返回transition\n",
+ " agent.memory.push((state, action,agent.log_probs,reward,done)) # 保存transition\n",
+ " state = next_state # 更新下一个状态\n",
+ " agent.update() # 更新智能体\n",
+ " ep_reward += reward # 累加奖励\n",
+ " if done:\n",
+ " break\n",
+ " if (i_ep+1)%cfg.eval_per_episode == 0:\n",
+ " sum_eval_reward = 0\n",
+ " for _ in range(cfg.eval_eps):\n",
+ " eval_ep_reward = 0\n",
+ " state = env.reset()\n",
+ " for _ in range(cfg.max_steps):\n",
+ " action = agent.predict_action(state) # 选择动作\n",
+ " next_state, reward, done, _ = env.step(action) # 更新环境,返回transition\n",
+ " state = next_state # 更新下一个状态\n",
+ " eval_ep_reward += reward # 累加奖励\n",
+ " if done:\n",
+ " break\n",
+ " sum_eval_reward += eval_ep_reward\n",
+ " mean_eval_reward = sum_eval_reward/cfg.eval_eps\n",
+ " if mean_eval_reward >= best_ep_reward:\n",
+ " best_ep_reward = mean_eval_reward\n",
+ " output_agent = copy.deepcopy(agent)\n",
+ " print(f\"回合:{i_ep+1}/{cfg.train_eps},奖励:{ep_reward:.2f},评估奖励:{mean_eval_reward:.2f},最佳评估奖励:{best_ep_reward:.2f},更新模型!\")\n",
+ " else:\n",
+ " print(f\"回合:{i_ep+1}/{cfg.train_eps},奖励:{ep_reward:.2f},评估奖励:{mean_eval_reward:.2f},最佳评估奖励:{best_ep_reward:.2f}\")\n",
+ " steps.append(ep_step)\n",
+ " rewards.append(ep_reward)\n",
+ " print(\"完成训练!\")\n",
+ " env.close()\n",
+ " return output_agent,{'rewards':rewards}\n",
+ "\n",
+ "def test(cfg, env, agent):\n",
+ " print(\"开始测试!\")\n",
+ " rewards = [] # 记录所有回合的奖励\n",
+ " steps = []\n",
+ " for i_ep in range(cfg.test_eps):\n",
+ " ep_reward = 0 # 记录一回合内的奖励\n",
+ " ep_step = 0\n",
+ " state = env.reset() # 重置环境,返回初始状态\n",
+ " for _ in range(cfg.max_steps):\n",
+ " ep_step+=1\n",
+ " action = agent.predict_action(state) # 选择动作\n",
+ " next_state, reward, done, _ = env.step(action) # 更新环境,返回transition\n",
+ " state = next_state # 更新下一个状态\n",
+ " ep_reward += reward # 累加奖励\n",
+ " if done:\n",
+ " break\n",
+ " steps.append(ep_step)\n",
+ " rewards.append(ep_reward)\n",
+ " print(f\"回合:{i_ep+1}/{cfg.test_eps},奖励:{ep_reward:.2f}\")\n",
+ " print(\"完成测试\")\n",
+ " env.close()\n",
+ " return {'rewards':rewards}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 3. 定义环境"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 61,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import gym\n",
+ "import os\n",
+ "import numpy as np\n",
+ "def all_seed(env,seed = 1):\n",
+ " ''' 万能的seed函数\n",
+ " '''\n",
+ " if seed == 0:\n",
+ " return\n",
+ " env.seed(seed) # env config\n",
+ " np.random.seed(seed)\n",
+ " random.seed(seed)\n",
+ " torch.manual_seed(seed) # config for CPU\n",
+ " torch.cuda.manual_seed(seed) # config for GPU\n",
+ " os.environ['PYTHONHASHSEED'] = str(seed) # config for python scripts\n",
+ " # config for cudnn\n",
+ " torch.backends.cudnn.deterministic = True\n",
+ " torch.backends.cudnn.benchmark = False\n",
+ " torch.backends.cudnn.enabled = False\n",
+ "def env_agent_config(cfg):\n",
+ " env = gym.make(cfg.env_name) # 创建环境\n",
+ " all_seed(env,seed=cfg.seed)\n",
+ " n_states = env.observation_space.shape[0]\n",
+ " n_actions = env.action_space.n\n",
+ " print(f\"状态空间维度:{n_states},动作空间维度:{n_actions}\")\n",
+ " # 更新n_states和n_actions到cfg参数中\n",
+ " setattr(cfg, 'n_states', n_states)\n",
+ " setattr(cfg, 'n_actions', n_actions) \n",
+ " agent = Agent(cfg)\n",
+ " return env,agent"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 4. 设置参数"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 62,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import matplotlib.pyplot as plt\n",
+ "import seaborn as sns\n",
+ "class Config:\n",
+ " def __init__(self) -> None:\n",
+ " self.env_name = \"CartPole-v1\" # 环境名字\n",
+ " self.new_step_api = False # 是否用gym的新api\n",
+ " self.algo_name = \"PPO\" # 算法名字\n",
+ " self.mode = \"train\" # train or test\n",
+ " self.seed = 1 # 随机种子\n",
+ " self.device = \"cuda\" # device to use\n",
+ " self.train_eps = 200 # 训练的回合数\n",
+ " self.test_eps = 20 # 测试的回合数\n",
+ " self.max_steps = 200 # 每个回合的最大步数\n",
+ " self.eval_eps = 5 # 评估的回合数\n",
+ " self.eval_per_episode = 10 # 评估的频率\n",
+ "\n",
+ " self.gamma = 0.99 # 折扣因子\n",
+ " self.k_epochs = 4 # 更新策略网络的次数\n",
+ " self.actor_lr = 0.0003 # actor网络的学习率\n",
+ " self.critic_lr = 0.0003 # critic网络的学习率\n",
+ " self.eps_clip = 0.2 # epsilon-clip\n",
+ " self.entropy_coef = 0.01 # entropy的系数\n",
+ " self.update_freq = 100 # 更新频率\n",
+ " self.actor_hidden_dim = 256 # actor网络的隐藏层维度\n",
+ " self.critic_hidden_dim = 256 # critic网络的隐藏层维度\n",
+ "\n",
+ "def smooth(data, weight=0.9): \n",
+ " '''用于平滑曲线,类似于Tensorboard中的smooth曲线\n",
+ " '''\n",
+ " last = data[0] \n",
+ " smoothed = []\n",
+ " for point in data:\n",
+ " smoothed_val = last * weight + (1 - weight) * point # 计算平滑值\n",
+ " smoothed.append(smoothed_val) \n",
+ " last = smoothed_val \n",
+ " return smoothed\n",
+ "\n",
+ "def plot_rewards(rewards,cfg, tag='train'):\n",
+ " ''' 画图\n",
+ " '''\n",
+ " sns.set()\n",
+ " plt.figure() # 创建一个图形实例,方便同时多画几个图\n",
+ " plt.title(f\"{tag}ing curve on {cfg.device} of {cfg.algo_name} for {cfg.env_name}\")\n",
+ " plt.xlabel('epsiodes')\n",
+ " plt.plot(rewards, label='rewards')\n",
+ " plt.plot(smooth(rewards), label='smoothed')\n",
+ " plt.legend()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 5. 开始训练"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 63,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "状态空间维度:4,动作空间维度:2\n",
+ "开始训练!\n",
+ "回合:10/200,奖励:11.00,评估奖励:29.20,最佳评估奖励:29.20,更新模型!\n",
+ "回合:20/200,奖励:68.00,评估奖励:25.00,最佳评估奖励:29.20\n",
+ "回合:30/200,奖励:60.00,评估奖励:26.20,最佳评估奖励:29.20\n",
+ "回合:40/200,奖励:105.00,评估奖励:27.60,最佳评估奖励:29.20\n",
+ "回合:50/200,奖励:26.00,评估奖励:60.60,最佳评估奖励:60.60,更新模型!\n",
+ "回合:60/200,奖励:122.00,评估奖励:113.40,最佳评估奖励:113.40,更新模型!\n",
+ "回合:70/200,奖励:65.00,评估奖励:38.00,最佳评估奖励:113.40\n",
+ "回合:80/200,奖励:175.00,评估奖励:135.40,最佳评估奖励:135.40,更新模型!\n",
+ "回合:90/200,奖励:200.00,评估奖励:177.20,最佳评估奖励:177.20,更新模型!\n",
+ "回合:100/200,奖励:115.00,评估奖励:173.60,最佳评估奖励:177.20\n",
+ "回合:110/200,奖励:200.00,评估奖励:183.20,最佳评估奖励:183.20,更新模型!\n",
+ "回合:120/200,奖励:196.00,评估奖励:173.60,最佳评估奖励:183.20\n",
+ "回合:130/200,奖励:46.00,评估奖励:61.40,最佳评估奖励:183.20\n",
+ "回合:140/200,奖励:200.00,评估奖励:166.40,最佳评估奖励:183.20\n",
+ "回合:150/200,奖励:172.00,评估奖励:154.40,最佳评估奖励:183.20\n",
+ "回合:160/200,奖励:61.00,评估奖励:84.80,最佳评估奖励:183.20\n",
+ "回合:170/200,奖励:127.00,评估奖励:181.60,最佳评估奖励:183.20\n",
+ "回合:180/200,奖励:152.00,评估奖励:173.20,最佳评估奖励:183.20\n",
+ "回合:190/200,奖励:200.00,评估奖励:200.00,最佳评估奖励:200.00,更新模型!\n",
+ "回合:200/200,奖励:176.00,评估奖励:190.20,最佳评估奖励:200.00\n",
+ "完成训练!\n",
+ "开始测试!\n",
+ "回合:1/20,奖励:200.00\n",
+ "回合:2/20,奖励:200.00\n",
+ "回合:3/20,奖励:200.00\n",
+ "回合:4/20,奖励:200.00\n",
+ "回合:5/20,奖励:200.00\n",
+ "回合:6/20,奖励:200.00\n",
+ "回合:7/20,奖励:200.00\n",
+ "回合:8/20,奖励:200.00\n",
+ "回合:9/20,奖励:200.00\n",
+ "回合:10/20,奖励:200.00\n",
+ "回合:11/20,奖励:200.00\n",
+ "回合:12/20,奖励:200.00\n",
+ "回合:13/20,奖励:200.00\n",
+ "回合:14/20,奖励:200.00\n",
+ "回合:15/20,奖励:200.00\n",
+ "回合:16/20,奖励:200.00\n",
+ "回合:17/20,奖励:200.00\n",
+ "回合:18/20,奖励:200.00\n",
+ "回合:19/20,奖励:200.00\n",
+ "回合:20/20,奖励:200.00\n",
+ "完成测试\n"
+ ]
+ },
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ "