Jekyll20210131T16:44:2708:00http://localhost:4000/feed.xmlTechnical notesExact Deterministic Dynamic Programming20210131T00:00:0008:0020210131T00:00:0008:00http://localhost:4000/2021/01/31/exactdeterministicdynamicprogramming<p><strong>Setup</strong></p>
<p>$x_{k+1} = f(x_k, u_k)$ for $k = 0, 1, …, N  1$</p>
<p>where</p>
<ul>
<li>$k$ is the time index</li>
<li>$x_k$ is the state of the system (what separates/makes independent the past from the future)</li>
<li>$u_k$ is the action or decision or control selected at time $k$ from a set $U_k(x_k)$</li>
<li>$f_k$ describes the dynamics of the system</li>
<li>N is the finite time horizon</li>
</ul>
<p>The problem also has a cost function $g_k(x_k, u_k)$ associated with it, which is the cost of going from $x_k$ to $x_{k+1}$, and the terminal cost $g_N(x_N)$ of the last state.</p>
<p>The total cost $J(x_0)$, which we sometimes write as $J_{u_0, \dots, u_{N_1}}(x_0)$ or $J(x_0; u_0, \dots, u_{N_1})$ or $J(x_0, u_0, \dots, u_{N_1})$, is:</p>
<p>$J(x_0) = g_N(x_N) + \sum_{k=0}^{N1} g_k(x_k, u_k)$</p>
<p>We want to choose $u_k$ so as to achieve the minimum cost $J^{*} (x_0)$:</p>
<p>$J^{*} (x_0) = \min_{u_k \in U_k(x_k)} J(x_0; u_0, \dots, u_{N_1})$</p>
<p><strong>Shortest path view</strong></p>
<p>The setup above can be completely described as a shortest weighed path problem on a graph. We set up the graph with an artificial terminal node:</p>
<p><img src="/img/bertsekas_finite_state_problems_shortest_path_view.png" /></p>
<p><strong>Principle of optimality</strong></p>
<p>The tail of an optimal sequence is optimal for the tail subproblem.</p>
<p>“For an auto travel analogy, suppose that the fastest route from Los Angeles to Boston passes through Chicago. The principle of optimality translates to the obvious fact that the Chicago to Boston portion of the route is also the fastest route for a trip that starts from Chicago and ends in Boston”</p>
<p>The proof is an example of the Extremal Principle (Engels, Problem Solving Strategies).</p>
<p><strong>Example: Deterministic Scheduling Problem</strong></p>
<p>Figure 1.1.6 in Chapter 1 of Bertsekas.</p>
<p><img src="/img/bertsekas_figure116.png" /></p>
<p>N = 3 (initial state starts at $k = 0$)</p>
<p>$x_N = x_3 \in $ {ABC, ACB, ACD, CAB, CAD, CDA}</p>
<p>[DP algorithm: Start with: $J_N^{*} (x_N) = g_N (x_N)$ for all $x_N$]</p>
<p>$J_3^{*} (ABC) = 6, J_3^{*} (ACB) = 1, J_3^{*} (ACD) = 3, J_3^{*} (CAB) = 1, J_3^{*} (CAD) = 3, J_3^{*} (CDA) = 2$</p>
<p>[DP algorithm: And for k = N1, … 0, let $J_k^{*} (x_k) = \min_{u_k \in U_k(x_k)} \left[ g_k(x_k, u_k) + J^{*}_{k+1} (f_k(x_k, u_k)) \right]$ for all $x_k$]</p>
<p>[Note that $J^{*}_{k+1} (f_k(x_k, u_k))$ is not a constant. And this calculation is not just for one path along the graph. It’s a calculation for every state/node in the graph.]</p>
<p>$k = N  1$</p>
<p>$x_{N1} = x_2 \in $ {AB, AC, CA, CD}</p>
<p>$J_2^{*} (AB) = \min_{[ABC]} \left[ g_2(x_2, u_2) + J^{*}_{3} (f_2(x_2, u_2)) \right] = \left[ g_2(AB, ABC) + J^{*}_{3} (ABC) \right] = 3 + 6 = 9$</p>
<p>$J_2^{*} (AC) = \min_{[ACB, ACD]} \left[ g_2(x_2, u_2) + J^{*}_{3} (f_2(x_2, u_2)) \right] = \min \left[ g_2(AC, ACB) + J^{*}_{3} (ACB), g_2(AC, ACD) + J^{*}_{3} (ACD)\right] = 5$</p>
<p>$\cdots$</p>
<p>Now we have $J_k^{*}(x_k)$ for all $x_k$, i.e., $J_0^{*} (x_0), J_1^{*} (A), J_1^{*} (C), J_2^{*} (AB), J_2^{*}(AC), J_2^{*}(CA), J_2^{*}(CD), J_3^{*}(ABC), \dots, J_3^{*}(CDA)$</p>
<p><strong>DP algorithm</strong></p>
<p>Start at $k = N$. Label all the terminal nodes (the set of all possible $x_N$) with the terminal cost at that node.</p>
<p>Proceed backwards.</p>
<p>$k = N  1$. Label all the nodes (the set of all possible $x_{N1}$) with the optimal cost if you started at that node, i.e., the mimimum over all edges of the sum of the cost of the edge to the terminal node and the label of the terminal node it connects to. And then do the same for $k = N  2$ and so on all the way back to $k = 0$.</p>
<p><strong>Construction of optimal control sequence</strong></p>
<p>After the DP algorithm / labeling of the graph, we have $J_k^{*}(x_k)$ for all $x_k$. Reverse the process. Start from the initial state and choose the action with the minimum sum of the edge weight and label of the node it travels to.</p>
<p><strong>Sources</strong></p>
<ul>
<li><a href="https://web.mit.edu/dimitrib/www/RL_1SHORTINTERNETPOSTED.pdf">Bertsekas, Chapter 1, Reinforcement Learning and Optimal Control</a></li>
<li><a href="https://web.mit.edu/dimitrib/www/Slides_Lecture1_RLOC.pdf">Bertsekas, Lecture 1, CSE 691</a></li>
<li><a href="https://web.mit.edu/dimitrib/www/Slides_Lecture2_RLOC.pdf">Bertsekas, Lecture 2, CSE 691</a></li>
</ul>SetupJamesStein Estimation20210130T00:00:0008:0020210130T00:00:0008:00http://localhost:4000/2021/01/30/jamessteinestimation<p><strong>What is JamesStein Estimation?</strong></p>
<p>Suppose we observe $\textbf{x} \in \mathbb{R}^p$, where $x_i = \theta_i + \epsilon_i$ for some unobserved $\theta \in \mathbb{R}^p$ and $\epsilon_i \sim N(0, 1)$, and our goal is to find a function $f(\textbf{x}) = \hat \theta$ that minimizes $E\left[\lVert \theta  \hat \theta \rVert^2\right]$.</p>
<p>Because the distributions that generate each $x_i$ are independent, it seems like all we can do is use each $x_i$ as our estimate of $\theta_i$ by choosing $f(\textbf{x}) = \textbf{x}$. Also, $\textbf{x}$ is the Maximum Likelihood estimate of $\theta$.</p>
<p>The JamesStein estimate instead chooses $f(\textbf{x}) = (1  \frac{p  2}{\rVert \textbf{x} \lVert^2}) \textbf{x}$.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> The surprising result is that this estimate does better (according to our evaluation criteria) than the Maximum Likelihood estimate for any $\theta$ when $p \ge 2$.</p>
<p><strong>Why is it so surprising?</strong></p>
<p>The JamesStein estimate uses all the samples to make a guess for each mean even though the distributions the samples come from are independent. It seems to suggest that we are improving our guess for a particular mean by using information from completely unrelated distributions. As described on the Wikipedia page for <a href="https://web.archive.org/web/20200601224707/https://en.wikipedia.org/wiki/Stein%27s_example">Stein’s example</a>: “To demonstrate the unintuitive nature of Stein’s example, consider the following realworld example. Suppose we are to estimate three unrelated parameters, such as the US wheat yield for 1993, the number of spectators at the Wimbledon tennis tournament in 2001, and the weight of a randomly chosen candy bar from the supermarket…At first sight it appears that somehow we get a better estimator for US wheat yield by measuring some other unrelated statistics such as the number of spectators at Wimbledon and the weight of a candy bar.”</p>
<p><strong>What’s the trick?</strong></p>
<p>At a high level, the Wikipedia page explains that “This is of course absurd; we have not obtained a better estimator for US wheat yield by itself, but we have produced an estimator for the vector of the means of all three random variables, which has a reduced total risk. This occurs because the cost of a bad estimate in one component of the vector is compensated by a better estimate in another component. Also, a specific set of the three estimated mean values obtained with the new estimator will not necessarily be better than the ordinary set (the measured values). It is only on average that the new estimator is better.”</p>
<p><strong>How does it work?</strong></p>
<p>Each $x_i$ has some probability of being an outlier and misleading us about the value of $\theta_i$. But it’s much less likely for all the $x_i$ to be outliers, because the distributions are independent. Similarly, a fair coin has a 50% chance of landing on heads, but the chance of 10 coins all landing on heads is 1 in 1024. Relatedly, estimating a metric that summarizes $\theta$ (like its mean $\bar{\theta}$) is an easier task than estimating each component of $\theta$. If $\theta_1, \dots, \theta_p$ are similar enough, then $\bar{\textbf{x}}$ will give a better estimate of $\theta$ than $\textbf{x}$ (we see this clearly in the extreme case of $\theta_1 = \dots = \theta_p$, where the mean just uses more data to estimate the same quantity).</p>
<p>On the other hand, if the variance of $\theta$ is large, i.e., $\theta_1, \dots, \theta_p$ are very different from each other, then $\bar{\textbf{x}}$ will be a very poor estimate of $\theta$. In this case, getting a good estimate of $\bar{\theta}$ doesn’t help us.</p>
<p>There is a tradeoff between estimating $\theta$ based on a more reliable, but less relevant estimate of the global behavior and estimating it based on a less reliable, but more relevant estimate of its local behavior. This tradeoff depends on the variance of $\theta$, i.e., the consistency of its local behavior. We don’t know the variance of $\theta$, but we can estimate it based on the variance of $\textbf{x}$. We can then construct an estimator for $\theta_i$ that is an average between $x_i$ and $\bar{\textbf{x}}$ weighted by the inverse variance of $\textbf{x}$. The greater the variance, the more weight we put on $x_i$ for our estimate of $\theta_i$. The smaller the variance, the more weight we put on $\bar{\textbf{x}}$.</p>
<p>This is an instance of a BiasVariance tradeoff. A special feature of the mean squared error is that we can think of it as the sum of the squared bias of the estimator (how close is $\theta$ to $E[\hat \theta]$?) and the variance of the estimator (how close is $E[\hat \theta]$ to $\hat \theta$ on average?). In this context, we’re weighing the increased bias of using information from other distributions with the decreased variance from using more data.</p>
<p>To connect this back to the JamesStein estimate, consider $\lVert \theta \rVert$, i.e. the distance of $\theta$ from the origin, as our summary metric of $\theta$ instead of $\bar{\theta}$. We can get a reliable estimate of $\lVert \theta \rVert$ based on all of the data we observe. If $\lVert \theta \rVert$ is small, then a reasonable guess of $\theta$ is 0 and the increase in bias is worth the decrease in variance from this “smoothed” guess. Otherwise, $\textbf{x}$ is better: the increase in variance is offset by the decrease in bias from not guessing 0 for a vector with large magnitude.</p>
<p>We can construct an estimator for $\theta_i$ that is an average between $x_i$ and 0 weighted by the inverse of our estimate of $\lVert \theta \rVert$. The greater the magnitude, the more weight we put on $x_i$. The smaller the magnitude, the more weight we put on 0. The JamesStein estimate is this weighted average when the weight we put on $x_i$ is $1  \frac{p  2}{\lVert x \rVert^2}$.</p>
<p>Why this particular function of the inverse magnitude for the weight? For one, $E[\lVert (1  \frac{p  2}{\lVert x \rVert^2}) \textbf{x} \rVert^2] \approx E[\lVert \theta \rVert^2]$ when $p$ is large.<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>. In another surprising twist, it’s also an estimate of the slope of the line through the origin that we get from regressing $\theta$ on $\textbf{x}$.<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup> We can’t run this regression, because we don’t observe $\theta$, but we can still construct an estimate of the slope using $\textbf{x}$ alone. Under this interpretation, we use all the data to characterize a global relationship between $\theta$ and $\textbf{x}$ and then exploit that to make a local prediction for each $\theta_i$. If we don’t require the line to go through the origin, then our estimate for the equation of the line becomes: $\frac{1}{\hat{\mathrm{Var}[\theta]}} \bar{\textbf{x}} + (1  \frac{1}{\hat{\mathrm{Var}[\theta]}}) x_i$ or an average of $\bar{\textbf{x}}$ and $x_i$ weighted by our estimate of the inverse variance of $\theta$.<sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></p>
<p><strong>Footnotes</strong></p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>We can also replace $p  2$ with any $c \in (0, 2p  4)$ and the result still holds (<a href="https://projecteuclid.org/download/pdf_1/euclid.ss/1177012274">Stigler 1990</a>, Pg. 1) <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p><a href="https://www.stat.washington.edu/~pdhoff/courses/581/LectureNotes/Static/shrinkage.pdf">Hoff 2013</a>, Pg. 12 <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p><a href="https://projecteuclid.org/download/pdf_1/euclid.ss/1177012274">Stigler 1990</a> <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p><a href="https://piazza.com/class_profile/get_resource/hzdbtb6jdr56q1/i2kz4qj4x102b1">Jordan 2014</a>, Pg. 7 <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>What is JamesStein Estimation?Kernel trick20210117T00:00:0008:0020210117T00:00:0008:00http://localhost:4000/2021/01/17/kerneltrick<p>Suppose we observe a training dataset $\{(\textbf{x}^{(i)}, y^{(i)})\}_{i=1}^n$ where each feature vector $\textbf{x}^{(i)} \in \mathbb{R}^p$ and each target $y^{(i)} \in \mathbb{R}$. We want to estimate a function $h(\textbf{x})$ that enables us to predict $y$ from $\textbf{x}$.</p>
<p>Organize the training dataset into the design matrix $\textbf{X} \in \mathbb{R}^{n \times p}$ and the target vector $\textbf{y} \in \mathbb{R}^n$. Ridge regression finds the coefficients $\textbf{w}$ that minimize $\lVert \textbf{y}  \textbf{X}w \rVert_2^2 + \lambda \lVert \textbf{w} \rVert_2^2$ where $\lambda$ controls the tradeoff between the fit to the training dataset and the complexity of the model (as measured by the magnitude of its coefficients).</p>
<p>We solve for $\textbf{w}$ using the regularized normal equation: $\textbf{w} = (\textbf{X}^T \textbf{X} + \lambda I_p)^{1} \textbf{X}^T y$ (supposing we fit a model without an intercept, which we can always do by mean centering each $y^{(i)}$ and $\textbf{x}^{(i)}$). Solving this $p \times p$ system of linear equations takes $O(p^3)$ time.</p>
<p>Suppose we want to increase the capacity of the model by adding interactions between features, e.g., we could define a new feature vector $\phi(x) = (x, x \cdot x_1, x \cdot x_2, \cdots, x \cdot x_p)^T$. The dimensionality of $\phi(\textbf{x})$ is $O(p^2)$ making solving for $\textbf{w}_{\phi}$ take at least $O(p^6)$ time.</p>
<p>That’s slow, but let’s make it even worse. Redefine $\phi(\textbf{x}) = \exp \left( \frac{x^2}{\sigma^2}\right) \left(1, \frac{x}{\sigma \sqrt{1!}}, \frac{x^2}{\sigma^2 \sqrt{2!}}, \frac{x^3}{\sigma^3 \sqrt{3!}}, \cdots \right)^T$. The virtue of this new representation of features is that the dot product of (think similarity between) $\textbf{x}^{(i)}$ and $\textbf{x}^{(j)}$ is given by the Gaussian function: $k(\textbf{x}^{(i)}, \textbf{x}^{(j)}) = \exp\left(\frac{\lVert \textbf{x}^{(i)}  \textbf{x}^{(j)} \rVert_2^2}{2 \sigma^2}\right)$. It’s a nice way to encode smoothness in the feature space (and we can control the amount of smoothness by varying $\sigma$). The downside is that this is an infinite dimensional feature space, which seems intractable.</p>
<p>Go back to the original feature representation for a moment. Define the “kernel” matrix $\textbf{K}$ to contain the dot product between each pair of training examples, i.e., $K_{i,j} = (\textbf{x}^{(i)})^T \textbf{x}^{(j)}$ or in matrix form: $\textbf{K} = \textbf{X} \textbf{X}^T \in \mathbb{R}^{n \times n}$ (notice that this is different from the covariance matrix $\textbf{X}^T \textbf{X} \in \mathbb{R}^{p \times p}$). Let $\alpha = (\textbf{K} + \lambda I_n) \textbf{y}$. With some algebra, we can show that $\textbf{w}^T \textbf{x} = \sum_{i=1}^n \alpha_i (\textbf{x}^T \textbf{x}^{(i)})$, so $\alpha$ gives the solution to the ridge regression problem, but where the features only show up in the equations for training and prediction as dot products.</p>
<p>We can run through the same steps for $\phi(\textbf{x})$. Even though $\phi(\textbf{x})$ is infinite dimensional, we only ever have to compute dot products and the dot product $\phi(\textbf{x}^{(i)})$ and $\phi(\textbf{x}^{(j)})$ is given by the Gaussian function. It takes $O(p)$ time to evaluate the Gaussian function. We have to compute it for every pair of points giving $O(n^2 p)$ time to construct $\textbf{K}_{\phi}$. Then it takes $O(n^3)$ time to solve the $n \times n$ system of linear equations. So we can regress $y$ on an infinite number of features.</p>
<p>In general, avoiding the explicit computation of $\phi(\textbf{x})$ for some feature map $\phi$ by computing dot products in that feature space is called the “kernel trick”.</p>
<p><strong>Sources</strong></p>
<ul>
<li>https://people.eecs.berkeley.edu/~jrs/189s19/lec/14.pdf</li>
</ul>Suppose we observe a training dataset $\{(\textbf{x}^{(i)}, y^{(i)})\}_{i=1}^n$ where each feature vector $\textbf{x}^{(i)} \in \mathbb{R}^p$ and each target $y^{(i)} \in \mathbb{R}$. We want to estimate a function $h(\textbf{x})$ that enables us to predict $y$ from $\textbf{x}$.Gradient descent for multilayer perceptrons20210109T00:00:0008:0020210109T00:00:0008:00http://localhost:4000/2021/01/09/gradientdescentformlps<p>A multilayer perceptron (MLP) is a neural network that consists of a sequence of matrix multiplications and nonlinearities interleaved. An $L$layer MLP $f(\textbf{X})$ takes the following form: $f^{[L]}(f^{[L1]}(\dots f^{[2]}(f^{[1]}(\textbf{X}))))$, where $f^{[l]}(\textbf{X}) = g^{[l]}(\textbf{X} \textbf{W}^{[l]})$.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> For a binary classification task, the last weight matrix $\textbf{W}^{[L]}$ has dimensions $p^{[L1]} \times 1$ and the last nonlinearity $g^{[L]}(z)$ is the sigmoid function $\sigma(z) = \frac{1}{1 + e^{z}}$ in order to map the final output to a scalar between 0 and 1. With a training dataset consisting of $n$ examples and $p^{[1]}$ features organized into a design matrix $\textbf{X} \in \mathbb{R}^{n \times p^{[1]}}$ and a binary label vector $\textbf{y} \in \{0, 1\}^n$, we calculate the log loss over the training dataset as $J(\textbf{W}) = \textbf{y}^T \log[f(\textbf{X})]  (1  \textbf{y})^T \log[1  f(\textbf{X})]$. Gradient descent can be used to find weights that minimize this loss. In practice, automatic differentiation libraries makes it easy to compute $\nabla J(\textbf{W})$, but in this post we work through it by hand.</p>
<p>To help with the bookkeeping, let $\textbf{A}^{[0]} = \textbf{X}, \textbf{A}^{[1]} = g^{[1]}(\textbf{A}^{[0]} \textbf{W}^{[1]}), \dots, \textbf{A}^{[L]} = g^{[L]}(\textbf{A}^{[L]} \textbf{W}^{[L]})$ and $\textbf{Z}^{[1]} = \textbf{A}^{[0]} \textbf{W}^{[1]}, \dots, \textbf{Z}^{[L]} = \textbf{A}^{[L1]} \textbf{W}^{[L]}$.</p>
<p><strong>Last layer</strong></p>
<p>We start with the last layer. By the chain rule, we have $\frac{\partial J}{\partial \textbf{W}^{[L]}} = \frac{\partial J}{\partial \textbf{A}^{[L]}} \frac{\partial \textbf{A}^{[L]}}{\partial \textbf{Z}^{[L]}} \frac{\partial \textbf{Z}^{[L]}}{\partial \textbf{W}^{[L]}}$.</p>
<p>The matrixbymatrix derivatives get unwieldy, so we compute scalarbyscalar derivatives and then convert back to matrix form.</p>
<p>$\frac{\partial J}{\partial a_{ij}^{[L]}} = \frac{y_i}{a_{ij}^{[L]}} + \frac{1  y_i}{1  a_{ij}^{[L]}}$</p>
<p>$\frac{\partial a_{ij}^{[L]}}{\partial z_{kl}^{[L]}} = \frac{\partial \sigma(z_{ij})}{\partial z_{kl}} = a_{ij}^{[L]} (1  a_{kl}^{[L]})$ when $i = k$ and $j = l$ and 0 otherwise.</p>
<p>Combining these 2 terms and simplifying, we have $\frac{\partial J}{\partial \textbf{Z}^{[L]}} = \textbf{A}^{[L]}  \textbf{y}$. This looks like the $\textbf{y}$ vector is subtracted from a matrix but recall that $\textbf{A}^{[L]}$ in the last layer has dimensions $n \times 1$.</p>
<p>$\frac{\partial z_{ij}^{[L]}}{\partial w_{kl}^{[L]}} = \sum_{m=1}^{p^{[L]}} a_{im}^{[L]} \frac{\partial}{\partial w_{kl}^{[L]}} w_{mj}^{[L]} = a_{ik}^{[L1]}$ if $m = k$ and $j = l$ and 0 otherwise.</p>
<p>Putting all this together, we get $\frac{\partial J}{\partial \textbf{W}^{[L]}} = (\textbf{A}^{[L]}  \textbf{y}) (\textbf{A}^{[L1]})^T$.</p>
<p><strong>Penultimate layer</strong></p>
<p>We work through one other layer before generalizing: $\frac{\partial J}{\partial \textbf{W}^{[L1]}} = \frac{\partial J}{\partial \textbf{Z}^{[L]}} \frac{\partial \textbf{Z}^{[L]}}{\partial \textbf{A}^{[L1]}} \frac{\partial \textbf{A}^{[L1]}}{\partial \textbf{Z}^{[L1]}} \frac{\partial \textbf{Z}^{[L1]}}{\partial \textbf{W}^{[L1]}}$.</p>
<p>$\frac{\partial J}{\partial \textbf{Z}^{[L]}} = \textbf{A}^{[L]}  \textbf{y}$ (see section above)</p>
<p>$\frac{\partial \textbf{Z}^{[L]}}{\partial \textbf{A}^{[L1]}} = \textbf{W}^{[L]}$</p>
<p>$\frac{\partial \textbf{A}^{[L1]}}{\partial \textbf{Z}^{[L1]}} = g’^{[L1]}(\textbf{Z}^{[L1]})$ (the derivative of whatever nonlinearity we use for this layer)</p>
<p>$\frac{\partial \textbf{Z}^{[L1]}}{\partial \textbf{W}^{[L1]}} = (\textbf{A}^{[L2]})^T$ (see section above)</p>
<p>Putting all this together, we get $\frac{\partial J}{\partial \textbf{W}^{[L1]}} = (\textbf{W}^{[L]})^T (\textbf{A}^{[L]}  \textbf{y}) \odot g’^{[L1]}(\textbf{Z}^{[L1]}) (\textbf{A}^{[L2]})^T$.</p>
<p><strong>All layers</strong></p>
<p>We start with $\frac{\partial J}{\partial \textbf{Z}^{[L]}} = \textbf{A}^{[L]}  \textbf{y}$. Then proceeding backwards through the layers, we compute $\frac{\partial J}{\partial \textbf{W}^{[l]}} = \frac{\partial J}{\partial \textbf{Z}^{[l]}} (\textbf{A}^{[l1]})^T$ and $\frac{\partial J}{\partial \textbf{Z}^{[l]}} = (\textbf{W}^{[l+1]})^T (\frac{\partial J}{\partial \textbf{Z}^{[l+1]}}) \odot g’^{[l]}(\textbf{Z}^{[l]})$.</p>
<p><strong>Sources</strong></p>
<ul>
<li>Week 4, Neural Networks and Deep Learning, Coursera, Reading</li>
<li>https://web.stanford.edu/class/cs224n/readings/gradientnotes.pdf</li>
<li>http://www.gatsby.ucl.ac.uk/teaching/courses/sntn/sntn2017/resources/Matrix_derivatives_cribsheet.pdf</li>
<li>https://en.wikipedia.org/wiki/Matrix_calculus</li>
</ul>
<p><strong>Footnotes</strong></p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>Usually, the MLP also has a bias term in each layer, i.e., $f^{[l]}(\textbf{X}) = g^{[l]}(\textbf{X} \textbf{W}^{[l]} + \textbf{b}^{[l]})$, but I set $\textbf{b}^{[l]} = 0$ here to reduce the plethora of notation already in this post. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>A multilayer perceptron (MLP) is a neural network that consists of a sequence of matrix multiplications and nonlinearities interleaved. An $L$layer MLP $f(\textbf{X})$ takes the following form: $f^{[L]}(f^{[L1]}(\dots f^{[2]}(f^{[1]}(\textbf{X}))))$, where $f^{[l]}(\textbf{X}) = g^{[l]}(\textbf{X} \textbf{W}^{[l]})$.1 For a binary classification task, the last weight matrix $\textbf{W}^{[L]}$ has dimensions $p^{[L1]} \times 1$ and the last nonlinearity $g^{[L]}(z)$ is the sigmoid function $\sigma(z) = \frac{1}{1 + e^{z}}$ in order to map the final output to a scalar between 0 and 1. With a training dataset consisting of $n$ examples and $p^{[1]}$ features organized into a design matrix $\textbf{X} \in \mathbb{R}^{n \times p^{[1]}}$ and a binary label vector $\textbf{y} \in \{0, 1\}^n$, we calculate the log loss over the training dataset as $J(\textbf{W}) = \textbf{y}^T \log[f(\textbf{X})]  (1  \textbf{y})^T \log[1  f(\textbf{X})]$. Gradient descent can be used to find weights that minimize this loss. In practice, automatic differentiation libraries makes it easy to compute $\nabla J(\textbf{W})$, but in this post we work through it by hand. Usually, the MLP also has a bias term in each layer, i.e., $f^{[l]}(\textbf{X}) = g^{[l]}(\textbf{X} \textbf{W}^{[l]} + \textbf{b}^{[l]})$, but I set $\textbf{b}^{[l]} = 0$ here to reduce the plethora of notation already in this post. ↩Automatic differentiation20210108T00:00:0008:0020210108T00:00:0008:00http://localhost:4000/2021/01/08/automaticdifferentiation<p><strong>Automatic differentiation</strong> is a technique for computing the gradient of a function specified by a computer program.</p>
<p>It takes advantage of the fact that any function implemented in a computer program can be decomposed into primitive operations (or else how would the function be implemented in the first place?), which are themselves easy to differentiate and whose derivatives can then be combined to get the derivative of the original function.</p>
<p>For example, suppose our primitive operations are multiplying by a constant, adding a constant and exponentiating by a constant. And the function we want to differentiate is $f(x) = 2 x^3 + 7$. We can write $f(x)$ as a composition of primitive operations. Define $f_1(x) = x + 7$, $f_2(x) = 2x$, $f_3(x) = x^3$. Then $f(x) = f_1(f_2(f_3(x)))$.</p>
<p>We can apply the chain rule to get the derivative. Define $g(x) = f_2(f_3(x))$. The chain rule states that $\frac{\partial} {\partial x} [ f_1(g(x)) ] = f’(g(x)) g’(x)$. By the same token, $g’(x) = f_2’(f_3(x)) f_3’(x)$. Putting that together, we have $f’(x) = f_1’(f_2(f_3(x))) f_2’(f_3(x)) f_3’(x)$, which is the derivate of the original function written in terms of the derivative of primitive operations.</p>
<p>How do we do this more generally? First, we need a way to represent a function. Supposing that we can decompose a function into primitive operations, then we can represent a function as a <strong>computational graph</strong> where each node in the graph (a directed acyclic graph) is either a primitive operation or a variable. For example, the computational graph for $f(x, y, z) = (x + y) \cdot z$ looks like:</p>
<div class="highlighterrouge"><div class="highlight"><pre class="highlight"><code> *
/ \
+ z
/ \
x y
</code></pre></div></div>
<p>An automatic differentiator takes the root of a computational graph as input and values for the variable nodes and returns the gradient of the function evaluated at those input values.</p>
<p><strong>How do we compute the gradient using the graph?</strong> Even though we don’t yet know how to calculate the partial derivative of the root node with respect to one of its grandchildren ($x$ or $y$), we first notice that it’s easy to calculate the partial derivative of a node with respect to one of its children, because each parent node is a primitive operation and we know how to calculate derivatives for primitive operations by applying one of a few formulas. For example, relabel the computational graph above with $a = x + y$ and $b = az$:</p>
<div class="highlighterrouge"><div class="highlight"><pre class="highlight"><code> b
*
/ \
a z
+
/ \
x y
</code></pre></div></div>
<p>The partial derviative of the $b$ node with respect to its child node $a$ is just $\frac{\partial}{\partial a} b = \frac{\partial}{\partial a} a \cdot z = a \frac{\partial{z}}{\partial{a}} + z \frac{\partial{a}}{\partial{a}} = z$ according to the product rule of calculus.</p>
<p>We can label each edge with the partial derivative of the parent/destination node with respect to its child/source node. Now, how do we use that information to get the partial derivatives of the root with respect to each of the leaf/variable nodes in the graph? For a given leaf node, it turns out that the sum over all the paths from that leaf to the root of the product of the edges for each path gives you the partial derivative of the root node with respect to the leaf node.</p>
<p>That procedure is just a visual way of describing the multivariate chain rule. Suppose we have a function $f(u_1, u_2, \cdots, u_n)$ where the input variables depend on some other variable $x$ ($f$ should be thought of as the root node in the graph, $u_i$ as intermediate nodes and $x$ as a leaf node), then $\frac{\partial f}{\partial x} = \sum_{i=1}^n \frac{\partial f}{\partial u_i} \frac{\partial u_i}{\partial x}$.</p>
<p><strong>Forward and reverse mode accumulation</strong>: The way that we traverse the graph to compute these partial derivatives can dramatically alter the efficiency of the computation. Consider the following graph:</p>
<div class="highlighterrouge"><div class="highlight"><pre class="highlight"><code> y

uk

.
.
.

u1
+
/   \
x1 ... xp
</code></pre></div></div>
<p>If we want to calculate the gradient of $y$ and we start from the leaf nodes and move up, then we first calculate $\frac{\partial y}{\partial x_1} = \frac{\partial u_1}{\partial x_1} \frac{\partial u_2}{\partial u_1} \dots \frac{\partial u_k}{\partial u_{k1}} \frac{\partial y}{\partial u_k}$. We then sweep through the graph again to calculate $\frac{\partial y}{\partial x_2} = \frac{\partial u_1}{\partial x_2} \frac{\partial u_2}{\partial u_1} \dots \frac{\partial u_k}{\partial u_{k1}} \frac{\partial y}{\partial u_k}$. And again for $\frac{\partial y}{\partial x_3}$ and so on. Each time repeating the computation $\frac{\partial u_2}{\partial u_1} \dots \frac{\partial u_k}{\partial u_{k1}} \frac{\partial y}{\partial u_k}$. This is forward mode accumulation and it requires $k \cdot p$ multiplies to get the gradient.</p>
<p>If we instead start from the top of the graph and move downwards caching the results as we go, then we first calculate $\frac{\partial y}{\partial u_k}$, then $\frac{\partial y}{\partial u_{k1}}$ as $\frac{\partial y}{\partial u_k} \cdot \frac{\partial u_k}{\partial u_{k1}}$ using the cached value for the first term and so on down the graph. This is reverse mode accumulation and it only requires $k + p$ multiplies to get the gradient. In the case of gradient descent on the cost function of a neural network with millions of parameters, automatic differentiation with reverse mode accumulation (also called <strong>backpropagation</strong>) makes optimization feasible.</p>
<p><strong>Sources</strong></p>
<ul>
<li>https://colah.github.io/posts/201508Backprop/</li>
<li>https://www.offconvex.org/2016/12/20/backprop/</li>
<li>https://en.wikipedia.org/wiki/Automatic_differentiation</li>
<li>https://arxiv.org/pdf/1811.05031.pdf</li>
</ul>Automatic differentiation is a technique for computing the gradient of a function specified by a computer program.Gradient descent20210106T00:00:0008:0020210106T00:00:0008:00http://localhost:4000/2021/01/06/gradientdescent<p><strong>Gradient descent</strong> is an algorithm for finding a minimum of a function.</p>
<p>It works as follows. Given a function $f: \mathbb{R}^p \mapsto \mathbb{R}$ and a <strong>learning rate</strong> $\lambda \in \mathbb{R}^{+}$, choose a random starting point $\textbf{w}^{(0)} \in \mathbb{R}^p$ and apply the following update iteratively: $\textbf{w}^{(t)} = \textbf{w}^{(t1)}  \lambda \nabla f(\textbf{w}^{(t1)})$ for $t = 1, 2, 3, \dots$ Stop when $\textbf{w}^{(t)} \approx \textbf{w}^{(t1)}$ or when $f(\textbf{w}^{(t)})$ is less than some threshold or using some other heuristic. Take the last $\textbf{w}^{(t)}$ as a minimizer of the function.</p>
<p>$\nabla f(\textbf{w}^{(t1)})$ is the <strong>gradient</strong> of $f$ evaluated at $\textbf{w}^{(t1)}$. In general, $\nabla f(\textbf{w}) = \begin{bmatrix} \frac{\partial f}{w_1} (\textbf{w}) \ \cdots \ \frac{\partial f}{\partial w_p}(\textbf{w}) \end{bmatrix}^T$. The $i$th component tells us how much the function changes when we move $w_i$ a little bit while keeping the other components fixed. Computing $\nabla f(\textbf{w}) \cdot \textbf{u}$ tells us how much the function changes when we move $\textbf{w}$ a little bit in the direction of some vector $\textbf{u}$. Which direction will most decrease the function? The dot product is minimized when the two vectors point in the opposite directions, i.e., when $\textbf{u} =  \nabla f(\textbf{w})$. Therefore, gradient descent is a greedy algorithm that at each point takes a step in the direction (the negative of the gradient) that most decreases the function near the point.</p>
<p><strong>Example:</strong> Let $f: \mathbb{R} \mapsto \mathbb{R}$. In particular, let $f(w) = w^2$, then $\nabla f(w) = 2w$. The update rule is then $w^{(t)} = w^{(t1)}  2 \lambda w^{(t1)} = (1  2 \lambda) w^{(t1)}$. In this unusual case, we can write down a closed form solution: $w^{(t)} = (1  2 \lambda)^{t} w^{(0)}$. If $\lambda \in (0, \frac{1}{2})$, then as $t$ gets larger, $w^{(t)}$ approaches 0, which is the minimum of the function. What happens if we choose $\lambda > \frac{1}{2}$? Then $w^{(t)}$ will be positive for even $t$ and negative for odd $t$ never approaching the minimum. If we make $\lambda$ very small, then it will get closer and closer to 0, but it might take a while before our stopping criterion kicks in. In practice, we try several different values of $\lambda$ and pick one based on plots of $f(w^{(t)})$ against $t$.</p>
<p><strong>Sources</strong></p>
<ul>
<li>http://web.archive.org/save/https://www.khanacademy.org/math/multivariablecalculus/multivariablederivatives/partialderivativeandgradientarticles/a/directionalderivativesgoingdeeper</li>
<li>https://math.stackexchange.com/questions/223252/whyisgradientthedirectionofsteepestascent</li>
<li>Gradient descent, how neural networks learn, 3Blue1Brown (https://www.youtube.com/watch?v=IHZwWFHWaw)</li>
<li>https://www.stat.cmu.edu/~ryantibs/convexoptF15/lectures/05graddescent.pdf</li>
</ul>Gradient descent is an algorithm for finding a minimum of a function.