Skip to content

Commit

Permalink
更新DDPG算法
Browse files Browse the repository at this point in the history
  • Loading branch information
johnjim0816 committed Sep 9, 2023
1 parent 41f2b3f commit 51be956
Show file tree
Hide file tree
Showing 13 changed files with 1,322 additions and 39 deletions.
2 changes: 1 addition & 1 deletion docs/_sidebar.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,6 @@
- [第八章 DQN算法进阶](/ch8/main.md)
- [第九章 策略梯度](/ch9/main.md)
- [第十章 Actor-Critic 算法](/ch10/main.md)
- [第十一章 DDPG 与 TD3 算法](/ch11/main.md)
- [第十一章 DDPG 算法](/ch11/main.md)
- [第十二章 PPO 算法](/ch12/main.md)
- [第十三章 SAC 算法](/ch13/main.md)
264 changes: 234 additions & 30 deletions docs/ch11/main.md

Large diffs are not rendered by default.

Binary file added docs/ch11/main.pptx
Binary file not shown.
Binary file added docs/figs/ch11/DDPG_Pendulum_training_curve.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/figs/ch11/Pendulum.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/figs/ch11/TD3_Pendulum_training_curve.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/figs/ch11/ddpg.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/figs/ch11/ddpg_actor.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/figs/ch11/ddpg_pseu.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
559 changes: 559 additions & 0 deletions notebooks/DDPG.ipynb

Large diffs are not rendered by default.

520 changes: 520 additions & 0 deletions notebooks/TD3.ipynb

Large diffs are not rendered by default.

Binary file modified pseudocodes/pseudo_without_notes.pdf
Binary file not shown.
16 changes: 8 additions & 8 deletions pseudocodes/pseudo_without_notes.tex
Original file line number Diff line number Diff line change
Expand Up @@ -387,29 +387,29 @@ \section{PPO-KL散度算法}
\clearpage
\section{DDPG算法}
\begin{algorithm}[H] % [H]固定位置
\floatname{algorithm}{{DDPG算法}\footnotemark[1]}
\floatname{algorithm}{{DDPG算法}}
\renewcommand{\thealgorithm}{} % 去掉算法标号
\caption{}
\begin{algorithmic}[1] % [1]显示步数
\STATE 初始化critic网络$Q\left(s, a \mid \theta^Q\right)$和actor网络$\mu(s|\theta^{\mu})$的参数$\theta^Q$$\theta^{\mu}$
\STATE 初始化对应的目标网络参数,即$\theta^{Q^{\prime}} \leftarrow \theta^Q, \theta^{\mu^{\prime}} \leftarrow \theta^\mu$
\STATE 初始化经验回放$R$
\STATE 初始化经验回放 $D$
\FOR {回合数 = $1,M$}
\STATE {\bfseries 交互采样:}
\STATE 选择动作$a_t=\mu\left(s_t \mid \theta^\mu\right)+\mathcal{N}_t$$\mathcal{N}_t$为探索噪声
\STATE 环境根据$a_t$反馈奖励$s_t$和下一个状态$s_{t+1}$
\STATE 存储transition$(s_t,a_t,r_t,s_{t+1})$到经验回放$R$
\STATE 存储样本$(s_t,a_t,r_t,s_{t+1})$到经验回放 $D$
\STATE 更新环境状态$s_{t+1} \leftarrow s_t$
\STATE {\bfseries 更新策略:}
\STATE$R$中取出一个随机批量的$(s_i,a_i,r_i,s_{i+1})$
\STATE {\bfseries 策略更新:}
\STATE $D$ 中取出一个随机批量的$(s_i,a_i,r_i,s_{i+1})$
\STATE 求得$y_i=r_i+\gamma Q^{\prime}\left(s_{i+1}, \mu^{\prime}\left(s_{i+1} \mid \theta^{\mu^{\prime}}\right) \mid \theta^{Q^{\prime}}\right)$
\STATE 更新critic参数,其损失为:$L=\frac{1}{N} \sum_i\left(y_i-Q\left(s_i, a_i \mid \theta^Q\right)\right)^2$
\STATE 更新actor参数$\left.\left.\nabla_{\theta^\mu} J \approx \frac{1}{N} \sum_i \nabla_a Q\left(s, a \mid \theta^Q\right)\right|_{s=s_i, a=\mu\left(s_i\right)} \nabla_{\theta^\mu} \mu\left(s \mid \theta^\mu\right)\right|_{s_i}$
\STATE 更新 $\text{critic}$ 参数,其损失为:$L=\frac{1}{N} \sum_i\left(y_i-Q\left(s_i, a_i \mid \theta^Q\right)\right)^2$
\STATE 更新 $\text{actor}$ 参数$\left.\left.\nabla_{\theta^\mu} J \approx \frac{1}{N} \sum_i \nabla_a Q\left(s, a \mid \theta^Q\right)\right|_{s=s_i, a=\mu\left(s_i\right)} \nabla_{\theta^\mu} \mu\left(s \mid \theta^\mu\right)\right|_{s_i}$
\STATE 软更新目标网络:$\theta^{Q^{\prime}} \leftarrow \tau \theta^Q+(1-\tau) \theta^{Q^{\prime}}$
$\theta^{\mu^{\prime}} \leftarrow \tau \theta^\mu+(1-\tau) \theta^{\mu^{\prime}}$
\ENDFOR
\end{algorithmic}
\end{algorithm}
\footnotetext[1]{Continuous control with deep reinforcement learning}
\clearpage
\section{SoftQ算法}
\begin{algorithm}[H]
Expand Down

0 comments on commit 51be956

Please sign in to comment.