1. 如何让难以求解的目标函数$P(X)$变得可以求解
2. 为什么可以强制规定隐变量服从标准正太分布$\mathcal{N}(0, I)$
3. 如何解决采样过程不可导导致难以用梯度下降进行训练的问题(reparameterization trick)

## 目标函数的转换

VAE的目标是最大化训练集中每个样本$X$通过生成函数生成$X$的概率$P(X)$, 即:

$$P(X) = \int{P(X|z;\theta)P(z)\mathrm{d}z}$$

$$\mathcal{D}[Q(z)||P(z|X)] = E_{z \sim Q}[\mathrm{log}Q(z) - \mathrm{log}P(z|X)]$$

$$\mathcal{D}[Q(z)||P(z|X)] = E_{z \sim Q}[\mathrm{log}Q(z) - \mathrm{log}P(X|z) - \mathrm{log}P(z)] + \mathrm{log}P(X)$$

$$\mathrm{log}P(X) - \mathcal{D}[Q(z)||P(z|X)] = E_{z \sim Q}[\mathrm{log}P(X|z) - D[Q(z)||P(z)]$$

$$\mathrm{log}P(X) - \mathcal{D}[Q(z|X)||P(z|X)] = E_{z \sim Q}[\mathrm{log}P(X|z) - D[Q(z|X)||P(z)]$$

## 目标函数的优化和reparameterization trick

$$\mathcal{D}[\mathcal{N}(\mu_0, \Sigma_0) || \mathcal{N}[\mathcal{N}(\mu_1, \Sigma_1)] = 1/2 (tr(\Sigma_{1}^{-1}\Sigma_0) + (\mu_1 - \mu_0)^{T}\Sigma_{1}^{-1}(\mu_1 - \mu_0) - k + \mathrm{log}(\frac{det\Sigma_1}{det\Sigma_0}))$$

$$\mathcal{D}[\mathcal{N}(\mu(X)), \Sigma(X))) || \mathcal{N}[\mathcal{N}(0, I] = 1/2 (tr(\Sigma(X)) + (\mu(X))^{T}(\mu(X)) - k + \mathrm{log}(det\Sigma(X)))$$

## 为什么可以让$z$服从正太分布?

下面说说为什么可以强制规定$z$服从正太分布呢? 这主要是因为, 即使$z$服从正太分布, 我们还是能够通过一个足够复杂的函数, 将$z$变换成任意的分布. 例如下图左边的随机分布$z$, 通过一个足够复杂的函数$g(z) = z / 10 + z/||z||$可以变换成下图右边的分布.

## 散度和熵

### 信息熵的定义

$$h(x) = -\mathrm{log}_2p(x) \tag{1}$$

$$H[x] = - \sum_x{p(x)\mathrm{log}_2p(x)} \tag{2}$$

$$H[x] = - \sum_x{p(x)\mathrm{ln}p(x)} \label{entropy}\tag{3}$$

### KL散度的定义

KL散度(Kullback-Leibler Divergence, KLD), 又称相对熵(relative entropy), 信息散度(information divergence), 信息增益(information gain)

KL散度是两个概率分布P和Q差别的非对称度量. 对于两个概率分布函数$p(x)$, $q(x)$的KL散度为:

\begin{align} KL(p||q) & = - \int{p(x)\mathrm{ln}q(x)\mathrm{d}x} - (-\int{p(x)\mathrm{ln}p(x)\mathrm{d}x}) \\ & = - \int{p(x)\mathrm{ln}{\frac{q(x)}{p(x)}}\mathrm{d}x} \label{kl-entropy} \tag{4} \end{align}