1970年代反向传播算法被发现, 那时候并没有引起人们的重视, 直到1986年 David Rumelhart, Geoffrey HintonRonald Williams 发表了一篇非常著名的论文. 论文阐述反向传播能够比其它方法更有有效地训练神经网络, 使得先前用神经网络不可解的问题变得可解. 如今, 反向传播算法已然是神经网络训练的根本驱动动力了.

## 神经网络向量化表示

$$\begin{eqnarray} a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right), \tag{23}\label{23}\end{eqnarray}$$

$$\begin{eqnarray} f\left(\left[ \begin{array}{c} 2 \\ 3 \end{array} \right] \right) = \left[ \begin{array}{c} f(2) \\ f(3) \end{array} \right] = \left[ \begin{array}{c} 4 \\ 9 \end{array} \right], \tag{24}\label{24}\end{eqnarray}$$

$$\begin{eqnarray} a^{l} = \sigma(w^l a^{l-1}+b^l). \tag{25}\label{25}\end{eqnarray}$$

## 代价函数的两个假设

$$\begin{eqnarray} C = \frac{1}{2n} \sum_x \|y(x)-a^L(x)\|^2, \tag{26}\end{eqnarray}$$

$$\begin{eqnarray} C = \frac{1}{2} \|y-a^L\|^2 = \frac{1}{2} \sum_j (y_j-a^L_j)^2, \tag{27}\end{eqnarray}$$

## 哈达玛积, $s \odot t$

$$\begin{eqnarray} \left[ \begin{array}{c} 1 \\ 2 \end{array} \right] \odot \left[\begin{array}{c} 3 \\ 4\end{array} \right] = \left[ \begin{array}{c} 1 * 3 \\ 2 * 4 \end{array} \right] = \left[ \begin{array}{c} 3 \\ 8 \end{array} \right]. \tag{28} \end{eqnarray}$$

## 反向传播的四个核心公式

$$\begin{eqnarray} \delta^l_j \equiv \frac{\partial C}{\partial z^l_j}. \tag{29}\end{eqnarray}$$

$$\sigma_j^L = \frac{\partial{C}}{\partial{a_j^L}}\sigma’(z_j^L) \label{BP1} \tag{BP1}$$

$$\sigma^L = \Delta_aC \odot \sigma’(z^L). \label{BP1a} \tag{BP1a}$$

$$\sigma^L = (a^L - y) \odot \sigma’(z^L)$$

$$\begin{eqnarray} \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma’(z^l), \tag{BP2} \label{BP2}\end{eqnarray}$$

$$\begin{eqnarray} \frac{\partial C}{\partial b^l_j} = \delta^l_j. \tag{BP3}\end{eqnarray}$$

\begin{eqnarray}
\frac{\partial C}{\partial b} = \delta,
\tag{31}\end{eqnarray}

$$\begin{eqnarray} \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j. \tag{BP4} \label{BP4}\end{eqnarray}$$

$$\begin{eqnarray} \frac{\partial C}{\partial w} = a_{\rm in} \delta_{\rm out}, \tag{32} \label{32}\end{eqnarray}$$

$$\begin{eqnarray} \delta^L_j &= \frac{\partial C}{\partial a^L_j} \sigma’(z^L_j). \tag{BP1}\end{eqnarray}$$

$$\begin{eqnarray} \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma’(z^l), \tag{BP2}\end{eqnarray}$$

$$\begin{eqnarray} \frac{\partial C}{\partial b^l_j} = \delta^l_j. \tag{BP3}\end{eqnarray}$$

$$\begin{eqnarray} \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j. \tag{BP4}\end{eqnarray}$$