class: middle, center, title-slide
Lecture: Generative adversarial networks (optional)
Prof. Gilles Louppe
[email protected]
class: middle
.center.italic["Generative adversarial networks is the coolest idea
in deep learning in the last 20 years."]
.pull-right[Yann LeCun, 2018.]
Learn a model of the data.
- Generative adversarial networks
- Numerics of GANs
- State of the art
- Applications
class: middle
class: middle
.center[.width-30[] .width-30[]]
class: middle
In generative adversarial networks (GANs), the task of learning a generative model is expressed as a two-player zero-sum game between two networks.
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
The first network is a generator
The second network
class: middle
For a fixed generator
However, the situation is slightly more complicated since we also want to train
class: middle
Let us consider the value function
-
For a fixed
$g$ ,$V(\phi, \theta)$ is high if$d$ is good at recognizing true from generated samples. -
If
$d$ is the best classifier given$g$ , and if$V$ is high, then this implies that the generator is bad at reproducing the data distribution. -
Conversely,
$g$ will be a good generative model if$V$ is low when$d$ is a perfect opponent.
Therefore, the ultimate goal is
class: middle
In practice, the minimax solution is approximated using alternating stochastic gradient descent: $$ \begin{aligned} \theta &\leftarrow \theta - \gamma \nabla_\theta V(\phi, \theta) \\ \phi &\leftarrow \phi + \gamma \nabla_\phi V(\phi, \theta), \end{aligned} $$ where gradients are estimated with Monte Carlo integration.
???
- For one step on
$\theta$ , we can optionally take$k$ steps on$\phi$ , since we need the classifier to remain near optimal. - Note that to compute
$\nabla_\theta V(\phi, \theta)$ , it is necessary to backprop all the way through$d$ before computing the partial derivatives with respect to$g$ 's internals.
class: middle
.footnote[Credits: Goodfellow et al, Generative Adversarial Networks, 2014.]
class: middle
For a generator
class: middle
Therefore,
$$\begin{aligned}
&\min_\theta \max_\phi V(\phi, \theta) = \min_\theta V(\phi^*_\theta, \theta) \\
&= \min_\theta \mathbb{E}_{\mathbf{x} \sim p(\mathbf{x})}\left[ \log \frac{p(\mathbf{x})}{q(\mathbf{x};\theta) + p(\mathbf{x})} \right] + \mathbb{E}_{\mathbf{x} \sim q(\mathbf{x};\theta)}\left[ \log \frac{q(\mathbf{x};\theta)}{q(\mathbf{x};\theta) + p(\mathbf{x})} \right] \\
&= \min_\theta \text{KL}\left(p(\mathbf{x}) || \frac{p(\mathbf{x}) + q(\mathbf{x};\theta)}{2}\right) \\
&\quad\quad\quad+ \text{KL}\left(q(\mathbf{x};\theta) || \frac{p(\mathbf{x}) + q(\mathbf{x};\theta)}{2}\right) -\log 4\\
&= \min_\theta 2, \text{JSD}(p(\mathbf{x}) || q(\mathbf{x};\theta)) - \log 4
\end{aligned}$$
where
class: middle
In summary, $$ \begin{aligned} \theta^* &= \arg \min_\theta \max_\phi V(\phi, \theta) \\ &= \arg \min_\theta \text{JSD}(p(\mathbf{x}) || q(\mathbf{x};\theta)). \end{aligned}$$
Since
class: middle, center
(demo)
class: middle
.footnote[Credits: Goodfellow et al, Generative Adversarial Networks, 2014.]
class: middle
.footnote[Credits: Radford et al, Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks, 2015.]
class: middle
.footnote[Credits: Radford et al, Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks, 2015.]
class: middle
Training a standard GAN often results in pathological behaviors:
- Oscillations without convergence: contrary to standard loss minimization, alternating stochastic gradient descent has no guarantee of convergence.
-
Vanishing gradients: when the classifier
$d$ is too good, the value function saturates and we end up with no gradient to update the generator. -
Mode collapse: the generator
$g$ models very well a small sub-population, concentrating on a few modes of the data distribution. - Performance is also difficult to assess in practice.
.center.width-100[![](figures/archives-lec-gan/mode-collapse.png)]
.center[Mode collapse (Metz et al, 2016)]
class: middle
While early results (2014-2016) were already impressive, a close inspection of the fake samples distribution
class: middle
.center[Cherry-picks]
.footnote[Credits: Ian Goodfellow, 2016.]
class: middle
.center[Problems with counting]
.footnote[Credits: Ian Goodfellow, 2016.]
class: middle
.center[Problems with perspective]
.footnote[Credits: Ian Goodfellow, 2016.]
class: middle
.center[Problems with global structures]
.footnote[Credits: Ian Goodfellow, 2016.]
class: middle, inactive count: false exclude: true
(optional)
class: middle count: false exclude: true
For most non-toy data distributions, the fake samples
At the limit, when
class: middle count: false exclude: true
Dilemma :
- If
$d$ is bad, then$g$ does not have accurate feedback and the loss function cannot represent the reality. - If
$d$ is too good, the gradients drop to 0, thereby slowing down or even halting the optimization.
class: middle count: false exclude: true
For any two distributions
-
$JSD(p||q)=0$ if and only if$p=q$ , -
$JSD(p||q)=\log 2$ if and only if$p$ and$q$ have disjoint supports.
class: middle count: false exclude: true
Notice how the Jensen-Shannon divergence poorly accounts for the metric structure of the space.
Intuitively, instead of comparing distributions "vertically", we would like to compare them "horizontally".
class: middle count: false exclude: true
An alternative choice is the Earth mover's distance, which intuitively corresponds to the minimum mass displacement to transform one distribution into the other.
$p = \frac{1}{4}\mathbf{1}_{[1,2]} + \frac{1}{4}\mathbf{1}_{[3,4]} + \frac{1}{2}\mathbf{1}_{[9,10]}$ $q = \mathbf{1}_{[5,7]}$
Then,
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle count: false exclude: true
The Earth mover's distance is also known as the Wasserstein-1 distance and is defined as:
-
$\Pi(p,q)$ denotes the set of all joint distributions$\gamma(x,y)$ whose marginals are respectively$p$ and$q$ ; -
$\gamma(x,y)$ indicates how much mass must be transported from$x$ to$y$ in order to transform the distribution$p$ into$q$ . -
$||\cdot||$ is the L1 norm and$||x-y||$ represents the cost of moving a unit of mass from$x$ to$y$ .
class: middle count: false exclude: true
class: middle count: false exclude: true
Notice how the
For any two distributions
-
$W_1(p,q) \in \mathbb{R}^+$ , -
$W_1(p,q)=0$ if and only if$p=q$ .
class: middle count: false exclude: true
Given the attractive properties of the Wasserstein-1 distance, Arjovsky et al (2017) propose
to learn a generative model by solving instead:
On the other hand, the Kantorovich-Rubinstein duality tells us that
class: middle count: false exclude: true
For
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle count: false exclude: true
Using this result, the Wasserstein GAN algorithm consists in solving the minimax problem:
- The classifier
$d:\mathcal{X} \to [0,1]$ is replaced by a critic function$d:\mathcal{X}\to \mathbb{R}$ and its output is not interpreted through the cross-entropy loss; - There is a strong regularization on the form of
$d$ . In practice, to ensure 1-Lipschitzness,- Arjovsky et al (2017) propose to clip the weights of the critic at each iteration;
- Gulrajani et al (2017) add a regularization term to the loss.
- As a result, Wasserstein GANs benefit from:
- a meaningful loss metric,
- improved stability (no mode collapse is observed).
class: middle count: false exclude: true
.footnote[Credits: Arjovsky et al, Wasserstein GAN, 2017.]
class: middle count: false exclude: true
.footnote[Credits: Arjovsky et al, Wasserstein GAN, 2017.]
class: middle
???
Check https://mitliagkas.github.io/ift6085-2019/ift-6085-lecture-14-notes.pdf
class: middle
.center[ .width-45[] .width-45[] ]
Solving for saddle points is different from gradient descent.
- Minimization of scalar functions yields conservative vector fields.
- Min-max saddle point problems may yield non-conservative vector fields.
.footnote[Credits: Ferenc Huszár, GANs are Broken in More than One Way, 2017.]
???
A vector field is conservative when it can be expressed as the gradient of a scalar function.
class: middle
Following the notations of Mescheder et al (2018), the training objective for the two players can be described by an objective function of the form
If
???
If
class: middle
Training algorithms can be described as fixed points algorithms that apply some operator
- For simultaneous gradient descent,
$$F_h(\theta,\phi) = (\theta,\phi) + h v(\theta,\phi)$$ where$v(\theta,\phi)$ denotes the gradient vector field $$v(\theta,\phi):= \begin{pmatrix} -\frac{\partial L}{\partial \theta}(\theta,\phi) \\ \frac{\partial L}{\partial \phi}(\theta,\phi) \end{pmatrix}$$ and$h$ is a scalar stepsize. - Similarly, alternating gradient descent can be described by an operator
$F_h = F_{2,h} \circ F_{1,h}$ , where$F_{1,h}$ and$F_{2,h}$ perform an update for the generator and discriminator, respectively.
class: middle
Let us consider the Jacobian
- if
$J_{F_h}(\theta^*,\phi^*)$ has eigenvalues with absolute value bigger than 1, the training will generally not converge to$(\theta^*,\phi^*)$ . - if all eigenvalues have absolute value smaller than 1, the training will converge to
$(\theta^*,\phi^*)$ . - if all eigenvalues values are on the unit circle, training can be convergent, divergent or neither.
Mescheder et al (2017) show that all eigenvalues can be forced to remain within the unit ball if and only if the stepsize
class: middle
.center[Discrete system: divergence (
.footnote[Credits: Mescheder et al, Which Training Methods for GANs do actually Converge?, 2018.]
class: middle
.center[Discrete system: convergence (
.footnote[Credits: Mescheder et al, Which Training Methods for GANs do actually Converge?, 2018.]
class: middle
For the (idealized) continuous system
$$
\begin{pmatrix}
\dot{\theta}(t) \\
\dot{\phi}(t)
\end{pmatrix} =
\begin{pmatrix}
-\frac{\partial L}{\partial \theta}(\theta,\phi) \\
\frac{\partial L}{\partial \phi}(\theta,\phi)
\end{pmatrix},$$
which corresponds to training GANs with infinitely small learning rate
- if all eigenvalues of the Jacobian
$v'(\theta^*,\phi^*)$ at a stationary point$(\theta^*,\phi^*)$ have negative real-part, the continuous system converges locally to$(\theta^*,\phi^*)$ ; - if
$v'(\theta^*,\phi^*)$ has eigenvalues with positive real-part, the continuous system is not locally convergent. - if all eigenvalues have zero real-part, it can be convergent, divergent or neither.
class: middle
.center[Continuous system: divergence.]
.footnote[Credits: Mescheder et al, Which Training Methods for GANs do actually Converge?, 2018.]
class: middle
.center[Continuous system: convergence.]
.footnote[Credits: Mescheder et al, Which Training Methods for GANs do actually Converge?, 2018.]
class: middle
On the Dirac-GAN toy problem, eigenvalues are
.footnote[Credits: Mescheder et al, Which Training Methods for GANs do actually Converge?, 2018.]
class: middle exclude: true
Eigenvalues are
.footnote[Credits: Mescheder et al, Which Training Methods for GANs do actually Converge?, 2018.]
class: middle
A penalty on the squared norm of the gradients of the discriminator results in the regularization
.footnote[Credits: Mescheder et al, Which Training Methods for GANs do actually Converge?, 2018.]
class: middle
.footnote[Credits: Mescheder et al, Which Training Methods for GANs do actually Converge?, 2018.]
class: middle
.footnote[Credits: Mescheder et al, Which Training Methods for GANs do actually Converge?, 2018.]
class: middle
.footnote[Credits: Mescheder et al, Which Training Methods for GANs do actually Converge?, 2018.]
class: middle
class: middle
class: middle
.center[
Wasserstein GANs as baseline (Arjovsky et al, 2017) +
Gradient Penalty (Gulrajani, 2017) + (quite a few other tricks)
]
.center[(Karras et al, 2017)]
class: middle
.center[(Karras et al, 2017)]
class: middle, center, black-slide
<iframe width="600" height="450" src="https://www.youtube.com/embed/XOxxPcy5Gr4" frameborder="0" volume="0" allowfullscreen></iframe>(Karras et al, 2017)
class: middle
.center[
Self-attention GANs as baseline (Zhang et al, 2018) + Hinge loss objective (Lim and Ye, 2017; Tran et al, 2017) + Class information to
]
.center.width-100[![](figures/archives-lec-gan/biggan.png)] .center[(Brock et al, 2018)]
class: middle, center, black-slide
<iframe width="600" height="450" src="https://www.youtube.com/embed/YY6LrQSxIbc" frameborder="0" allowfullscreen></iframe>(Brock et al, 2018)
class: middle
.center[
Progressive GANs as baseline (Karras et al, 2017) + Non-saturating loss instead of WGAN-GP +
]
class: middle, center, black-slide
<iframe width="600" height="450" src="https://www.youtube.com/embed/kSLJriaOumA" frameborder="0" allowfullscreen></iframe>(Karras et al, 2018)
class: middle
The StyleGAN generator
.center[ .width-30[] .width-30[] ]
class: middle
class: middle
.center[ .width-30[] .width-30[]
class: middle
.center[
]
.center[(Karras et al, 2019; Karras et al, 2021)]
class: middle
.center[(Esser et al, 2021)]
???
Check https://ljvmiranda921.github.io/notebook/2021/08/08/clip-vqgan/
class: middle, center .width-45[] .width-45[]
(Esser et al, 2021)
class: middle
class: middle
.center[
.center[CycleGANs (Zhu et al, 2017)]
]
class: middle, center, black-slide
<iframe width="600" height="450" src="https://www.youtube.com/embed/3AIpPlzM_qs" frameborder="0" volume="0" allowfullscreen></iframe>High-resolution image synthesis (Wang et al, 2017)
class: middle, center, black-slide
<iframe width="600" height="450" src="https://www.youtube.com/embed/p5U4NgVGAwg" frameborder="0" allowfullscreen></iframe>GauGAN: Changing sketches into photorealistic masterpieces (NVIDIA, 2019)
class: middle, center, black-slide
<iframe width="600" height="450" src="https://www.youtube.com/embed/p9MAvRpT6Cg" frameborder="0" allowfullscreen></iframe>GauGAN2 (NVIDIA, 2021)
class: middle
.center[
Few-Shot Adversarial Learning of Realistic Neural Talking Head Models
(Zakharov et al, 2019)
]
class: middle, center, black-slide
<iframe width="600" height="450" src="https://www.youtube.com/embed/rJb0MDrT3SE" frameborder="0" allowfullscreen></iframe>Few-Shot Adversarial Learning of Realistic Neural Talking Head Models
(Zakharov et al, 2019)
class: middle
.center[(Shetty et al, 2017)]
class: middle
.center[
.center[(Zhang et al, 2017)]
]
class: middle
.center[
.center[(Zhang et al, 2017)]
]
class: middle
.center[
.center[StyleCLIP (Patashnik et al, 2021)]
]
???
See also https://stylegan-nada.github.io/ or VQGAN+CLIP.
class: middle
.center[
Your browser does not support the audio element.
]
.center[MuseGAN (Dong et al, 2018)]
class: middle
.grid[
.kol-2-3[.width-100[]]
.kol-1-3[
.width-100[]]
]
.center[Learning particle physics (Paganini et al, 2017)]
???
https://arxiv.org/pdf/1712.10321.pdf
class: middle
.center[Learning cosmological models (Rodriguez et al, 2018)]
???
https://arxiv.org/pdf/1801.09070.pdf
class: end-slide, center count: false
The end.