- Writer: Jaeun Jeong (VAE), Haneol Lee(Pixel CNN, Pixel RNN, GANs)
-
Title: (cs231n) Lecture 13 : Visualizing and Understanding
-
Link: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture13.pdf
-
Keywords: VAE, Explicit density model, PixelRNN, PixelCNN, Generative adversarial networks, KL-divergence, GANs problems, mode collapse
-
Supervised Learning
- Data: (x, y), x is data, y is label
- Goal: Learn a function to map x -> y
- Examples: Classification, regression, object detection, semantic segmentation, image captioning, etc.
-
Unsupervised Learning
- Data: x, x is data
- Goal: Learn some underlying hidden structure of the data
- Examples: Clustering, dimensionality reduction, feature learning, density estimation, etc
-
Generative Models
-
Given training data, generate new samples from same distribution
-
Want to learn
$p_{model}(x)$ similar to$p_{data}(x)$ -
Taxonomy of Generative models:image reference: [1]
-
Explicit density model, optimizes exact likelihood, good samples. But inefficient sequential generation.
-
Use chain rule to decompose likelihood of an image x into product of 1-d distributions:
$p_{\theta}(x)=\prod_{i=1}^{n} p_{\theta}(x_i|x_1, ..., x_{i-1})$ $p_{\theta}(x)$ : Likelihood of image X$\prod_{i=1}^{n} p_{\theta}(x_i|x_1, ..., x_{i-1})$ : Probability of i'th pixel value given all previous pixels -
Then maximize likelihood of training data.
-
Complex distribution over pixel values (So express using a neural network)
-
Need to define ordering of previous pixels
Figure 1: Visualization example of previous pixels: [1]
- Generate image pixels starting from corner.
- Dependency on previous pixels now modeled using an RNN (LSTM).
- In this example, the pixels to the left and top of the current pixel are defined as the previous pixels.
- If no previous pixel, use padding.
- Drawback:
- Sequential generation is slow.
Figure 2: A visualization of the PixelCNN that maps a neighborhood of pixels to prediction for
the next pixel. To generate pixel
-
Generate image pixels starting from corner
-
Dependency on previous pixels now modeled using a CNN over context region
-
Training: maximize likelihood of training images
$p_{\theta}(x)=\prod_{i=1}^{n} p_{\theta}(x_i|x_1, ..., x_{i-1})$ -
Drawback:
- The generation process is still slow. (Because generation must still proceed sequentially)
- The major drawback of Pixel CNN is that it’s performance is worse than Pixel RNN.
- Another drawback is the presence of a Blind Spot in the receptive field. (Masking
$x_i$ and all pixels next to$x_i$ pixel (e.g.$x_{i+1}$ ,$x_{i+2}$ , ...) in the receptive field at training step.
-
Pros:
- Can explicitly compute likelihood
$p(x)$ - Explicit likelihood of training data gives good evaluation metric
- Good samples
- Can explicitly compute likelihood
-
Con:
- Slow because of sequential generation
-
Improving PixelCNN performance
- Gated convolutional layers
- Short_cut connections
- Discretized logistic loss
- Multi-scale
- Training tricks
The main difference between bayesians and frequentists is that the first considers the parameter we'd like to estimate as a random variable, whereas the second thinks it as a fixed constant.
For example, let's say we have data of heights. Probably we'd like to know the mean of the data which can be written as
The key idea of bayesians: combine data with prior belief or derive posterior distribution from prior distribution and data
We'd like to know the joint distribution of
$p(y, \theta \vert x) = p(y \vert \theta, x)p(\theta) \text{ } \because x \perp\theta$ $p(\theta \vert X, Y) = \frac{p(Y \vert X, \theta)p(\theta)}{\int p(Y \vert X, \theta)p(\theta)}\text{ where X, Y denote whole training set}$ - test:
$p(y \vert x, X, Y) = \int{ p(y \vert x, \theta)p(\theta \vert X, Y)d\theta}$
Problem: Unless
$\theta_{MP} = argmax_{\theta}p(\theta \vert X, Y) = argmax_{\theta}P(Y \vert X, \theta)p(\theta)$ $p(y \vert x, X, Y) \approx p(y \vert x, \theta_{MP})$
No conjugacy => impossible to solve analytically.
-
variational inference:
$q(\theta) \approx p(\theta \vert x)$ -
sampling based method: sample from
$p(x \vert \theta)p(\theta)$ , time consuming.
We want to know the first method(variational inference) - assume approximate posterior and estimate it to be close to true posterior. To estimate the distance btw distributions, we use KL-divergence.
- Prob1: We don't know
$p(\theta \vert x)$ - Prob2: How can we optimize with respect to probability distribution?
Sol)
Hence, it is equal to minimize
$= E_{q(\theta)}[logp(x \vert \theta)] - D_{KL}(q(\theta) \vert\vert p(\theta)) = \text{data likelihood + KL-regularizer term}$
Then how can we optimize
- mean field approximation: can be used when $\theta$s are ind.
-
parametric approximation: the most popular method in deep learning. Define
$q(\theta)=q(\theta \vert \lambda)$ and optimize with respect to$\lambda$ .
- Quality generation => maximize
$logP(X)$ - Learn distribution of latent variable Z =>
$q(Z \vert X) \approx p(Z \vert X)$
To achieve the first goal, (img source)
The rest except the final KL-term is now lower bound. Therefore, maximizing
The formal one is KL term btw prior and approximate posterior and the latter equals to decoder probability. In original VAE paper,
If we look decoder probability, we have to get the mean of
To handle these problems, we use reparameterization trick!
This is all about linear regression!
Linear regression assumes that
However, VAE suffers from blurry generation. Since approximate posterior acts as a regularizer and reconstrucion loss is the actual loss, VAE is learned to maximize
-
The ultimate goal of GANs is generating data approximating real data distribution.
-
Take game-theoretic approach, learn to generate from training distribution through 2-player game. But can be tricky and unstable to train, no inference queries such as
$p(x)$ ,$p(z|x)$ .
Fake and real images [1]
- Problem: Want to sample from complex, high-dimensional training distribution. No way to do this.
- Solution: Sample from a simple distribution, e.g. random noise. Learn transformation to training distribution.
- Minmax objective function:
Minmax objective loss function [1]
-
Generator(
$\theta_g$ ) network: try to fool the discriminator by generating real-looking images- Generator(
$\theta_g$ ) wants to minimize objective such that D(G(z)) is close to 1 (discriminator is fooled into thinking generated G(z) is real).
- Generator(
-
Discriminator(
$\theta_d$ ) network: try to distinguish between real and fake images- Discriminator(
$\theta_d$ ) wants to maximize objective such that D(x) is close to 1 (real) and D(G(Z)) is close to 0 (fake). - Discriminator outputs likelihood in (0,1) of real image
- Discriminator(
-
Gradient ascent and descent of GANs in practice
-
Gradient ascent on discriminator:
-
Gradient descent on generator in origin:
- In practice, optimizing the generator objective function does not work well.
Figure 3: [1]
-
When sample is likely fake, want to learn from it to improve generator. But gradient in this region is relatively flat.
-
Gradient signal dominated by region where sample is already good.
-
Gradient ascent on generator in standard practice (Instead of the "2. Gradient descent on generator in origin"):
- Instead of minimizing likelihood of discriminator being correct, now maximize likelihood of discriminator being wrong.
Figure 4: [1]
-
Same objective of fooling discriminator, but now higher gradient signal for bad samples. So it works better.
-
Jointly training two networks is challenging, can be unstable.
- Choosing objectives with better loss landscapes helps training.
-
Generative models create a model
- This is the same as minimizing the KL-divergence
$KL(p,q)$ which measures how the estimated probability distribution$q$ diverges from the real world expected distribution p. (proof in detail)
- KL-divergence is not symmetrical.
-
As you see in figure 5, the KL-divergence
$D_{KL}(p||q)$ penalizes the generator if it misses some modes of images: the penalty is high where$p(x) > 0$ but$q(x) → 0$ . Nevertheless, it is acceptable that some images do not look real. The penalty is low when$p(x) → 0$ but$q(x) > 0$ . (Poorer quality but more diverse samples) -
On the other hand, the reverse KL-divergence
$D_{KL}(q||p)$ penalizes the generator if the images do not look real: high penalty if p(x) → 0 but q(x) > 0. But it explores less variety: low penalty if q(x) → 0 but p(x) > 0. (Better quality but less diverse samples)Figure 5: probability density function of p and q (left), KL-divergence of
$p$ and$q$ (right) [9]
- Non-convergence: the model parameters oscillate, destabilize and never converge.
- Mode collapse: the generator collapses which produces limited varieties of samples.
- Diminished gradient: the discriminator gets too successful that the generator gradient vanishes and learns nothing.
- Unbalance between the generator and discriminator causing overfitting.
- Highly sensitive to the hyper parameter selections.
Mode collapse refers to the phenomenon that the model we are trying to train does not cover all the distribution of the actual data and loses diversity. This is a case where G cannot find the entire data distribution because it is only learning to reduce the loss, and it is strongly concentrated in only one mode at a time as shown in the figure below. For example, this is the case where G trained on MNIST generates only certain numbers. [7]
The problem that the probability density functions of generator and discriminator are alternatively vibrating without converging is related to mode collapse [7]
mode collapse example [7], [9]
The key to solve the model collapse is to train the model to learn the boundaries of the entire data distribution evenly and keep it remembered.
-
feature matching : Add least square error between fake data and real data to the objective function
-
mini-batch discrimination : Add the sum of distance difference between fake data and real data for each mini-batch to the objective function.
-
historical averaging : Update the loss function to incorporate history.
- [1] cs231n 2017 lecture13
- [2] Generative Adversarial Networks
- [3] Pixel RNN
- [4] Pixel CNN, Pixel CNN v2
- [5] https://towardsdatascience.com/auto-regressive-generative-models-pixelrnn-pixelcnn-32d192911173
- [6] cs231n 2020 lecture11
- [7] mode collapse in GANs
- [8] developers.google.com: mode collapse
- [9] solutions of mode collapse