A multilayer perceptron (MLP) is a neural network that consists of a sequence of matrix multiplications and nonlinearities interleaved. An $L$-layer MLP $f(\textbf{X})$ takes the following form: $f^{[L]}(f^{[L-1]}(\dots f^{[2]}(f^{[1]}(\textbf{X}))))$, where $f^{[l]}(\textbf{X}) = g^{[l]}(\textbf{X} \textbf{W}^{[l]})$.1 For a binary classification task, the last weight matrix $\textbf{W}^{[L]}$ has dimensions $p^{[L-1]} \times 1$ and the last nonlinearity $g^{[L]}(z)$ is the sigmoid function $\sigma(z) = \frac{1}{1 + e^{-z}}$ in order to map the final output to a scalar between 0 and 1. With a training dataset consisting of $n$ examples and $p^{[1]}$ features organized into a design matrix $\textbf{X} \in \mathbb{R}^{n \times p^{[1]}}$ and a binary label vector $\textbf{y} \in \{0, 1\}^n$, we calculate the log loss over the training dataset as $J(\textbf{W}) = -\textbf{y}^T \log[f(\textbf{X})] - (1 - \textbf{y})^T \log[1 - f(\textbf{X})]$. Gradient descent can be used to find weights that minimize this loss. In practice, automatic differentiation libraries makes it easy to compute $\nabla J(\textbf{W})$, but in this post we work through it by hand.

To help with the bookkeeping, let $\textbf{A}^{[0]} = \textbf{X}, \textbf{A}^{[1]} = g^{[1]}(\textbf{A}^{[0]} \textbf{W}^{[1]}), \dots, \textbf{A}^{[L]} = g^{[L]}(\textbf{A}^{[L]} \textbf{W}^{[L]})$ and $\textbf{Z}^{[1]} = \textbf{A}^{[0]} \textbf{W}^{[1]}, \dots, \textbf{Z}^{[L]} = \textbf{A}^{[L-1]} \textbf{W}^{[L]}$.

Last layer

We start with the last layer. By the chain rule, we have $\frac{\partial J}{\partial \textbf{W}^{[L]}} = \frac{\partial J}{\partial \textbf{A}^{[L]}} \frac{\partial \textbf{A}^{[L]}}{\partial \textbf{Z}^{[L]}} \frac{\partial \textbf{Z}^{[L]}}{\partial \textbf{W}^{[L]}}$.

The matrix-by-matrix derivatives get unwieldy, so we compute scalar-by-scalar derivatives and then convert back to matrix form.

$\frac{\partial J}{\partial a_{ij}^{[L]}} = -\frac{y_i}{a_{ij}^{[L]}} + \frac{1 - y_i}{1 - a_{ij}^{[L]}}$

$\frac{\partial a_{ij}^{[L]}}{\partial z_{kl}^{[L]}} = \frac{\partial \sigma(z_{ij})}{\partial z_{kl}} = a_{ij}^{[L]} (1 - a_{kl}^{[L]})$ when $i = k$ and $j = l$ and 0 otherwise.

Combining these 2 terms and simplifying, we have $\frac{\partial J}{\partial \textbf{Z}^{[L]}} = \textbf{A}^{[L]} - \textbf{y}$. This looks like the $\textbf{y}$ vector is subtracted from a matrix but recall that $\textbf{A}^{[L]}$ in the last layer has dimensions $n \times 1$.

$\frac{\partial z_{ij}^{[L]}}{\partial w_{kl}^{[L]}} = \sum_{m=1}^{p^{[L]}} a_{im}^{[L]} \frac{\partial}{\partial w_{kl}^{[L]}} w_{mj}^{[L]} = a_{ik}^{[L-1]}$ if $m = k$ and $j = l$ and 0 otherwise.

Putting all this together, we get $\frac{\partial J}{\partial \textbf{W}^{[L]}} = (\textbf{A}^{[L]} - \textbf{y}) (\textbf{A}^{[L-1]})^T$.

Penultimate layer

We work through one other layer before generalizing: $\frac{\partial J}{\partial \textbf{W}^{[L-1]}} = \frac{\partial J}{\partial \textbf{Z}^{[L]}} \frac{\partial \textbf{Z}^{[L]}}{\partial \textbf{A}^{[L-1]}} \frac{\partial \textbf{A}^{[L-1]}}{\partial \textbf{Z}^{[L-1]}} \frac{\partial \textbf{Z}^{[L-1]}}{\partial \textbf{W}^{[L-1]}}$.

$\frac{\partial J}{\partial \textbf{Z}^{[L]}} = \textbf{A}^{[L]} - \textbf{y}$ (see section above)

$\frac{\partial \textbf{Z}^{[L]}}{\partial \textbf{A}^{[L-1]}} = \textbf{W}^{[L]}$

$\frac{\partial \textbf{A}^{[L-1]}}{\partial \textbf{Z}^{[L-1]}} = g’^{[L-1]}(\textbf{Z}^{[L-1]})$ (the derivative of whatever nonlinearity we use for this layer)

$\frac{\partial \textbf{Z}^{[L-1]}}{\partial \textbf{W}^{[L-1]}} = (\textbf{A}^{[L-2]})^T$ (see section above)

Putting all this together, we get $\frac{\partial J}{\partial \textbf{W}^{[L-1]}} = (\textbf{W}^{[L]})^T (\textbf{A}^{[L]} - \textbf{y}) \odot g’^{[L-1]}(\textbf{Z}^{[L-1]}) (\textbf{A}^{[L-2]})^T$.

All layers

We start with $\frac{\partial J}{\partial \textbf{Z}^{[L]}} = \textbf{A}^{[L]} - \textbf{y}$. Then proceeding backwards through the layers, we compute $\frac{\partial J}{\partial \textbf{W}^{[l]}} = \frac{\partial J}{\partial \textbf{Z}^{[l]}} (\textbf{A}^{[l-1]})^T$ and $\frac{\partial J}{\partial \textbf{Z}^{[l]}} = (\textbf{W}^{[l+1]})^T (\frac{\partial J}{\partial \textbf{Z}^{[l+1]}}) \odot g’^{[l]}(\textbf{Z}^{[l]})$.

Sources

  • Week 4, Neural Networks and Deep Learning, Coursera, Reading
  • https://web.stanford.edu/class/cs224n/readings/gradient-notes.pdf
  • http://www.gatsby.ucl.ac.uk/teaching/courses/sntn/sntn-2017/resources/Matrix_derivatives_cribsheet.pdf
  • https://en.wikipedia.org/wiki/Matrix_calculus

Footnotes

  1. Usually, the MLP also has a bias term in each layer, i.e., $f^{[l]}(\textbf{X}) = g^{[l]}(\textbf{X} \textbf{W}^{[l]} + \textbf{b}^{[l]})$, but I set $\textbf{b}^{[l]} = 0$ here to reduce the plethora of notation already in this post.