Jekyll20210117T16:07:2708:00http://localhost:4000/feed.xmlTechnical notesKernel trick20210117T00:00:0008:0020210117T00:00:0008:00http://localhost:4000/2021/01/17/kerneltrick<p>Suppose we observe a training dataset $\{(\textbf{x}^{(i)}, y^{(i)})\}_{i=1}^n$ where each feature vector $\textbf{x}^{(i)} \in \mathbb{R}^p$ and each target $y^{(i)} \in \mathbb{R}$. We want to estimate a function $h(\textbf{x})$ that enables us to predict $y$ from $\textbf{x}$.</p>
<p>Organize the training dataset into the design matrix $\textbf{X} \in \mathbb{R}^{n \times p}$ and the target vector $\textbf{y} \in \mathbb{R}^n$. Ridge regression finds the coefficients $\textbf{w}$ that minimize $\lVert \textbf{y}  \textbf{X}w \rVert_2^2 + \lambda \lVert \textbf{w} \rVert_2^2$ where $\lambda$ controls the tradeoff between the fit to the training dataset and the complexity of the model (as measured by the magnitude of its coefficients).</p>
<p>We solve for $\textbf{w}$ using the regularized normal equation: $\textbf{w} = (\textbf{X}^T \textbf{X} + \lambda I_p)^{1} \textbf{X}^T y$ (supposing we fit a model without an intercept, which we can always do by mean centering each $y^{(i)}$ and $\textbf{x}^{(i)}$). Solving this $p \times p$ system of linear equations takes $O(p^3)$ time.</p>
<p>Suppose we want to increase the capacity of the model by adding interactions between features, e.g., we could define a new feature vector $\phi(x) = (x, x \cdot x_1, x \cdot x_2, \cdots, x \cdot x_p)^T$. The dimensionality of $\phi(\textbf{x})$ is $O(p^2)$ making solving for $\textbf{w}_{\phi}$ take at least $O(p^6)$ time.</p>
<p>That’s slow, but let’s make it even worse. Redefine $\phi(\textbf{x}) = \exp \left( \frac{x^2}{\sigma^2}\right) \left(1, \frac{x}{\sigma \sqrt{1!}}, \frac{x^2}{\sigma^2 \sqrt{2!}}, \frac{x^3}{\sigma^3 \sqrt{3!}}, \cdots \right)^T$. The virtue of this new representation of features is that the dot product of (think similarity between) $\textbf{x}^{(i)}$ and $\textbf{x}^{(j)}$ is given by the Gaussian function: $k(\textbf{x}^{(i)}, \textbf{x}^{(j)}) = \exp\left(\frac{\lVert \textbf{x}^{(i)}  \textbf{x}^{(j)} \rVert_2^2}{2 \sigma^2}\right)$. It’s a nice way to encode smoothness in the feature space (and we can control the amount of smoothness by varying $\sigma$). The downside is that this is an infinite dimensional feature space, which seems intractable.</p>
<p>Go back to the original feature representation for a moment. Define the “kernel” matrix $\textbf{K}$ to contain the dot product between each pair of training examples, i.e., $K_{i,j} = (\textbf{x}^{(i)})^T \textbf{x}^{(j)}$ or in matrix form: $\textbf{K} = \textbf{X} \textbf{X}^T \in \mathbb{R}^{n \times n}$ (notice that this is different from the covariance matrix $\textbf{X}^T \textbf{X} \in \mathbb{R}^{p \times p}$). Let $\alpha = (\textbf{K} + \lambda I_n) \textbf{y}$. With some algebra, we can show that $\textbf{w}^T \textbf{x} = \sum_{i=1}^n \alpha_i (\textbf{x}^T \textbf{x}^{(i)})$, so $\alpha$ gives the solution to the ridge regression problem, but where the features only show up in the equations for training and prediction as dot products.</p>
<p>We can run through the same steps for $\phi(\textbf{x})$. Even though $\phi(\textbf{x})$ is infinite dimensional, we only ever have to compute dot products and the dot product $\phi(\textbf{x}^{(i)})$ and $\phi(\textbf{x}^{(j)})$ is given by the Gaussian function. It takes $O(p)$ time to evaluate the Gaussian function. We have to compute it for every pair of points giving $O(n^2 p)$ time to construct $\textbf{K}_{\phi}$. Then it takes $O(n^3)$ time to solve the $n \times n$ system of linear equations. So we can regress $y$ on an infinite number of features.</p>
<p>In general, avoiding the explicit computation of $\phi(\textbf{x})$ for some feature map $\phi$ by computing dot products in that feature space is called the “kernel trick”.</p>
<p><strong>Sources</strong></p>
<ul>
<li>https://people.eecs.berkeley.edu/~jrs/189s19/lec/14.pdf</li>
</ul>Suppose we observe a training dataset $\{(\textbf{x}^{(i)}, y^{(i)})\}_{i=1}^n$ where each feature vector $\textbf{x}^{(i)} \in \mathbb{R}^p$ and each target $y^{(i)} \in \mathbb{R}$. We want to estimate a function $h(\textbf{x})$ that enables us to predict $y$ from $\textbf{x}$.Gradient descent for multilayer perceptrons20210109T00:00:0008:0020210109T00:00:0008:00http://localhost:4000/2021/01/09/gradientdescentformlps<p>A multilayer perceptron (MLP) is a neural network that consists of a sequence of matrix multiplications and nonlinearities interleaved. An $L$layer MLP $f(\textbf{X})$ takes the following form: $f^{[L]}(f^{[L1]}(\dots f^{[2]}(f^{[1]}(\textbf{X}))))$, where $f^{[l]}(\textbf{X}) = g^{[l]}(\textbf{X} \textbf{W}^{[l]})$.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> For a binary classification task, the last weight matrix $\textbf{W}^{[L]}$ has dimensions $p^{[L1]} \times 1$ and the last nonlinearity $g^{[L]}(z)$ is the sigmoid function $\sigma(z) = \frac{1}{1 + e^{z}}$ in order to map the final output to a scalar between 0 and 1. With a training dataset consisting of $n$ examples and $p^{[1]}$ features organized into a design matrix $\textbf{X} \in \mathbb{R}^{n \times p^{[1]}}$ and a binary label vector $\textbf{y} \in \{0, 1\}^n$, we calculate the log loss over the training dataset as $J(\textbf{W}) = \textbf{y}^T \log[f(\textbf{X})]  (1  \textbf{y})^T \log[1  f(\textbf{X})]$. Gradient descent can be used to find weights that minimize this loss. In practice, automatic differentiation libraries makes it easy to compute $\nabla J(\textbf{W})$, but in this post we work through it by hand.</p>
<p>To help with the bookkeeping, let $\textbf{A}^{[0]} = \textbf{X}, \textbf{A}^{[1]} = g^{[1]}(\textbf{A}^{[0]} \textbf{W}^{[1]}), \dots, \textbf{A}^{[L]} = g^{[L]}(\textbf{A}^{[L]} \textbf{W}^{[L]})$ and $\textbf{Z}^{[1]} = \textbf{A}^{[0]} \textbf{W}^{[1]}, \dots, \textbf{Z}^{[L]} = \textbf{A}^{[L1]} \textbf{W}^{[L]}$.</p>
<p><strong>Last layer</strong></p>
<p>We start with the last layer. By the chain rule, we have $\frac{\partial J}{\partial \textbf{W}^{[L]}} = \frac{\partial J}{\partial \textbf{A}^{[L]}} \frac{\partial \textbf{A}^{[L]}}{\partial \textbf{Z}^{[L]}} \frac{\partial \textbf{Z}^{[L]}}{\partial \textbf{W}^{[L]}}$.</p>
<p>The matrixbymatrix derivatives get unwieldy, so we compute scalarbyscalar derivatives and then convert back to matrix form.</p>
<p>$\frac{\partial J}{\partial a_{ij}^{[L]}} = \frac{y_i}{a_{ij}^{[L]}} + \frac{1  y_i}{1  a_{ij}^{[L]}}$</p>
<p>$\frac{\partial a_{ij}^{[L]}}{\partial z_{kl}^{[L]}} = \frac{\partial \sigma(z_{ij})}{\partial z_{kl}} = a_{ij}^{[L]} (1  a_{kl}^{[L]})$ when $i = k$ and $j = l$ and 0 otherwise.</p>
<p>Combining these 2 terms and simplifying, we have $\frac{\partial J}{\partial \textbf{Z}^{[L]}} = \textbf{A}^{[L]}  \textbf{y}$. This looks like the $\textbf{y}$ vector is subtracted from a matrix but recall that $\textbf{A}^{[L]}$ in the last layer has dimensions $n \times 1$.</p>
<p>$\frac{\partial z_{ij}^{[L]}}{\partial w_{kl}^{[L]}} = \sum_{m=1}^{p^{[L]}} a_{im}^{[L]} \frac{\partial}{\partial w_{kl}^{[L]}} w_{mj}^{[L]} = a_{ik}^{[L1]}$ if $m = k$ and $j = l$ and 0 otherwise.</p>
<p>Putting all this together, we get $\frac{\partial J}{\partial \textbf{W}^{[L]}} = (\textbf{A}^{[L]}  \textbf{y}) (\textbf{A}^{[L1]})^T$.</p>
<p><strong>Penultimate layer</strong></p>
<p>We work through one other layer before generalizing: $\frac{\partial J}{\partial \textbf{W}^{[L1]}} = \frac{\partial J}{\partial \textbf{Z}^{[L]}} \frac{\partial \textbf{Z}^{[L]}}{\partial \textbf{A}^{[L1]}} \frac{\partial \textbf{A}^{[L1]}}{\partial \textbf{Z}^{[L1]}} \frac{\partial \textbf{Z}^{[L1]}}{\partial \textbf{W}^{[L1]}}$.</p>
<p>$\frac{\partial J}{\partial \textbf{Z}^{[L]}} = \textbf{A}^{[L]}  \textbf{y}$ (see section above)</p>
<p>$\frac{\partial \textbf{Z}^{[L]}}{\partial \textbf{A}^{[L1]}} = \textbf{W}^{[L]}$</p>
<p>$\frac{\partial \textbf{A}^{[L1]}}{\partial \textbf{Z}^{[L1]}} = g’^{[L1]}(\textbf{Z}^{[L1]})$ (the derivative of whatever nonlinearity we use for this layer)</p>
<p>$\frac{\partial \textbf{Z}^{[L1]}}{\partial \textbf{W}^{[L1]}} = (\textbf{A}^{[L2]})^T$ (see section above)</p>
<p>Putting all this together, we get $\frac{\partial J}{\partial \textbf{W}^{[L1]}} = (\textbf{W}^{[L]})^T (\textbf{A}^{[L]}  \textbf{y}) \odot g’^{[L1]}(\textbf{Z}^{[L1]}) (\textbf{A}^{[L2]})^T$.</p>
<p><strong>All layers</strong></p>
<p>We start with $\frac{\partial J}{\partial \textbf{Z}^{[L]}} = \textbf{A}^{[L]}  \textbf{y}$. Then proceeding backwards through the layers, we compute $\frac{\partial J}{\partial \textbf{W}^{[l]}} = \frac{\partial J}{\partial \textbf{Z}^{[l]}} (\textbf{A}^{[l1]})^T$ and $\frac{\partial J}{\partial \textbf{Z}^{[l]}} = (\textbf{W}^{[l+1]})^T (\frac{\partial J}{\partial \textbf{Z}^{[l+1]}}) \odot g’^{[l]}(\textbf{Z}^{[l]})$.</p>
<p><strong>Sources</strong></p>
<ul>
<li>Week 4, Neural Networks and Deep Learning, Coursera, Reading</li>
<li>https://web.stanford.edu/class/cs224n/readings/gradientnotes.pdf</li>
<li>http://www.gatsby.ucl.ac.uk/teaching/courses/sntn/sntn2017/resources/Matrix_derivatives_cribsheet.pdf</li>
<li>https://en.wikipedia.org/wiki/Matrix_calculus</li>
</ul>
<p><strong>Footnotes</strong></p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>Usually, the MLP also has a bias term in each layer, i.e., $f^{[l]}(\textbf{X}) = g^{[l]}(\textbf{X} \textbf{W}^{[l]} + \textbf{b}^{[l]})$, but I set $\textbf{b}^{[l]} = 0$ here to reduce the plethora of notation already in this post. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>A multilayer perceptron (MLP) is a neural network that consists of a sequence of matrix multiplications and nonlinearities interleaved. An $L$layer MLP $f(\textbf{X})$ takes the following form: $f^{[L]}(f^{[L1]}(\dots f^{[2]}(f^{[1]}(\textbf{X}))))$, where $f^{[l]}(\textbf{X}) = g^{[l]}(\textbf{X} \textbf{W}^{[l]})$.1 For a binary classification task, the last weight matrix $\textbf{W}^{[L]}$ has dimensions $p^{[L1]} \times 1$ and the last nonlinearity $g^{[L]}(z)$ is the sigmoid function $\sigma(z) = \frac{1}{1 + e^{z}}$ in order to map the final output to a scalar between 0 and 1. With a training dataset consisting of $n$ examples and $p^{[1]}$ features organized into a design matrix $\textbf{X} \in \mathbb{R}^{n \times p^{[1]}}$ and a binary label vector $\textbf{y} \in \{0, 1\}^n$, we calculate the log loss over the training dataset as $J(\textbf{W}) = \textbf{y}^T \log[f(\textbf{X})]  (1  \textbf{y})^T \log[1  f(\textbf{X})]$. Gradient descent can be used to find weights that minimize this loss. In practice, automatic differentiation libraries makes it easy to compute $\nabla J(\textbf{W})$, but in this post we work through it by hand. Usually, the MLP also has a bias term in each layer, i.e., $f^{[l]}(\textbf{X}) = g^{[l]}(\textbf{X} \textbf{W}^{[l]} + \textbf{b}^{[l]})$, but I set $\textbf{b}^{[l]} = 0$ here to reduce the plethora of notation already in this post. ↩Automatic differentiation20210108T00:00:0008:0020210108T00:00:0008:00http://localhost:4000/2021/01/08/automaticdifferentiation<p><strong>Automatic differentiation</strong> is a technique for computing the gradient of a function specified by a computer program.</p>
<p>It takes advantage of the fact that any function implemented in a computer program can be decomposed into primitive operations (or else how would the function be implemented in the first place?), which are themselves easy to differentiate and whose derivatives can then be combined to get the derivative of the original function.</p>
<p>For example, suppose our primitive operations are multiplying by a constant, adding a constant and exponentiating by a constant. And the function we want to differentiate is $f(x) = 2 x^3 + 7$. We can write $f(x)$ as a composition of primitive operations. Define $f_1(x) = x + 7$, $f_2(x) = 2x$, $f_3(x) = x^3$. Then $f(x) = f_1(f_2(f_3(x)))$.</p>
<p>We can apply the chain rule to get the derivative. Define $g(x) = f_2(f_3(x))$. The chain rule states that $\frac{\partial} {\partial x} [ f_1(g(x)) ] = f’(g(x)) g’(x)$. By the same token, $g’(x) = f_2’(f_3(x)) f_3’(x)$. Putting that together, we have $f’(x) = f_1’(f_2(f_3(x))) f_2’(f_3(x)) f_3’(x)$, which is the derivate of the original function written in terms of the derivative of primitive operations.</p>
<p>How do we do this more generally? First, we need a way to represent a function. Supposing that we can decompose a function into primitive operations, then we can represent a function as a <strong>computational graph</strong> where each node in the graph (a directed acyclic graph) is either a primitive operation or a variable. For example, the computational graph for $f(x, y, z) = (x + y) \cdot z$ looks like:</p>
<div class="highlighterrouge"><div class="highlight"><pre class="highlight"><code> *
/ \
+ z
/ \
x y
</code></pre></div></div>
<p>An automatic differentiator takes the root of a computational graph as input and values for the variable nodes and returns the gradient of the function evaluated at those input values.</p>
<p><strong>How do we compute the gradient using the graph?</strong> Even though we don’t yet know how to calculate the partial derivative of the root node with respect to one of its grandchildren ($x$ or $y$), we first notice that it’s easy to calculate the partial derivative of a node with respect to one of its children, because each parent node is a primitive operation and we know how to calculate derivatives for primitive operations by applying one of a few formulas. For example, relabel the computational graph above with $a = x + y$ and $b = az$:</p>
<div class="highlighterrouge"><div class="highlight"><pre class="highlight"><code> b
*
/ \
a z
+
/ \
x y
</code></pre></div></div>
<p>The partial derviative of the $b$ node with respect to its child node $a$ is just $\frac{\partial}{\partial a} b = \frac{\partial}{\partial a} a \cdot z = a \frac{\partial{z}}{\partial{a}} + z \frac{\partial{a}}{\partial{a}} = z$ according to the product rule of calculus.</p>
<p>We can label each edge with the partial derivative of the parent/destination node with respect to its child/source node. Now, how do we use that information to get the partial derivatives of the root with respect to each of the leaf/variable nodes in the graph? For a given leaf node, it turns out that the sum over all the paths from that leaf to the root of the product of the edges for each path gives you the partial derivative of the root node with respect to the leaf node.</p>
<p>That procedure is just a visual way of describing the multivariate chain rule. Suppose we have a function $f(u_1, u_2, \cdots, u_n)$ where the input variables depend on some other variable $x$ ($f$ should be thought of as the root node in the graph, $u_i$ as intermediate nodes and $x$ as a leaf node), then $\frac{\partial f}{\partial x} = \sum_{i=1}^n \frac{\partial f}{\partial u_i} \frac{\partial u_i}{\partial x}$.</p>
<p><strong>Forward and reverse mode accumulation</strong>: The way that we traverse the graph to compute these partial derivatives can dramatically alter the efficiency of the computation. Consider the following graph:</p>
<div class="highlighterrouge"><div class="highlight"><pre class="highlight"><code> y

uk

.
.
.

u1
+
/   \
x1 ... xp
</code></pre></div></div>
<p>If we want to calculate the gradient of $y$ and we start from the leaf nodes and move up, then we first calculate $\frac{\partial y}{\partial x_1} = \frac{\partial u_1}{\partial x_1} \frac{\partial u_2}{\partial u_1} \dots \frac{\partial u_k}{\partial u_{k1}} \frac{\partial y}{\partial u_k}$. We then sweep through the graph again to calculate $\frac{\partial y}{\partial x_2} = \frac{\partial u_1}{\partial x_2} \frac{\partial u_2}{\partial u_1} \dots \frac{\partial u_k}{\partial u_{k1}} \frac{\partial y}{\partial u_k}$. And again for $\frac{\partial y}{\partial x_3}$ and so on. Each time repeating the computation $\frac{\partial u_2}{\partial u_1} \dots \frac{\partial u_k}{\partial u_{k1}} \frac{\partial y}{\partial u_k}$. This is forward mode accumulation and it requires $k \cdot p$ multiplies to get the gradient.</p>
<p>If we instead start from the top of the graph and move downwards caching the results as we go, then we first calculate $\frac{\partial y}{\partial u_k}$, then $\frac{\partial y}{\partial u_{k1}}$ as $\frac{\partial y}{\partial u_k} \cdot \frac{\partial u_k}{\partial u_{k1}}$ using the cached value for the first term and so on down the graph. This is reverse mode accumulation and it only requires $k + p$ multiplies to get the gradient. In the case of gradient descent on the cost function of a neural network with millions of parameters, automatic differentiation with reverse mode accumulation (also called <strong>backpropagation</strong>) makes optimization feasible.</p>
<p><strong>Sources</strong></p>
<ul>
<li>https://colah.github.io/posts/201508Backprop/</li>
<li>https://www.offconvex.org/2016/12/20/backprop/</li>
<li>https://en.wikipedia.org/wiki/Automatic_differentiation</li>
<li>https://arxiv.org/pdf/1811.05031.pdf</li>
</ul>Automatic differentiation is a technique for computing the gradient of a function specified by a computer program.Gradient descent20210106T00:00:0008:0020210106T00:00:0008:00http://localhost:4000/2021/01/06/gradientdescent<p><strong>Gradient descent</strong> is an algorithm for finding a minimum of a function.</p>
<p>It works as follows. Given a function $f: \mathbb{R}^p \mapsto \mathbb{R}$ and a <strong>learning rate</strong> $\lambda \in \mathbb{R}^{+}$, choose a random starting point $\textbf{w}^{(0)} \in \mathbb{R}^p$ and apply the following update iteratively: $\textbf{w}^{(t)} = \textbf{w}^{(t1)}  \lambda \nabla f(\textbf{w}^{(t1)})$ for $t = 1, 2, 3, \dots$ Stop when $\textbf{w}^{(t)} \approx \textbf{w}^{(t1)}$ or when $f(\textbf{w}^{(t)})$ is less than some threshold or using some other heuristic. Take the last $\textbf{w}^{(t)}$ as a minimizer of the function.</p>
<p>$\nabla f(\textbf{w}^{(t1)})$ is the <strong>gradient</strong> of $f$ evaluated at $\textbf{w}^{(t1)}$. In general, $\nabla f(\textbf{w}) = \begin{bmatrix} \frac{\partial f}{w_1} (\textbf{w}) \ \cdots \ \frac{\partial f}{\partial w_p}(\textbf{w}) \end{bmatrix}^T$. The $i$th component tells us how much the function changes when we move $w_i$ a little bit while keeping the other components fixed. Computing $\nabla f(\textbf{w}) \cdot \textbf{u}$ tells us how much the function changes when we move $\textbf{w}$ a little bit in the direction of some vector $\textbf{u}$. Which direction will most decrease the function? The dot product is minimized when the two vectors point in the opposite directions, i.e., when $\textbf{u} =  \nabla f(\textbf{w})$. Therefore, gradient descent is a greedy algorithm that at each point takes a step in the direction (the negative of the gradient) that most decreases the function near the point.</p>
<p><strong>Example:</strong> Let $f: \mathbb{R} \mapsto \mathbb{R}$. In particular, let $f(w) = w^2$, then $\nabla f(w) = 2w$. The update rule is then $w^{(t)} = w^{(t1)}  2 \lambda w^{(t1)} = (1  2 \lambda) w^{(t1)}$. In this unusual case, we can write down a closed form solution: $w^{(t)} = (1  2 \lambda)^{t} w^{(0)}$. If $\lambda \in (0, \frac{1}{2})$, then as $t$ gets larger, $w^{(t)}$ approaches 0, which is the minimum of the function. What happens if we choose $\lambda > \frac{1}{2}$? Then $w^{(t)}$ will be positive for even $t$ and negative for odd $t$ never approaching the minimum. If we make $\lambda$ very small, then it will get closer and closer to 0, but it might take a while before our stopping criterion kicks in. In practice, we try several different values of $\lambda$ and pick one based on plots of $f(w^{(t)})$ against $t$.</p>
<p><strong>Sources</strong></p>
<ul>
<li>http://web.archive.org/save/https://www.khanacademy.org/math/multivariablecalculus/multivariablederivatives/partialderivativeandgradientarticles/a/directionalderivativesgoingdeeper</li>
<li>https://math.stackexchange.com/questions/223252/whyisgradientthedirectionofsteepestascent</li>
<li>Gradient descent, how neural networks learn, 3Blue1Brown (https://www.youtube.com/watch?v=IHZwWFHWaw)</li>
<li>https://www.stat.cmu.edu/~ryantibs/convexoptF15/lectures/05graddescent.pdf</li>
</ul>Gradient descent is an algorithm for finding a minimum of a function.