Mathematics for Intelligent Systems Lecture 5 Homework Solutions

Mathematics for Intelligent Systems Lecture 5 Homework Solutions Advanced Calculus I: Derivatives and local geometry) Nathan Ratliff Nov 25, 204 Problem : Gradient and Hessian Calculations We ve seen that the gradient and Hessian of hx) x are hx) 2 hx) x x x ) x We can plug these in when they become necessary below.. Calculating the derivatives For generic differentiable h, we have log e hx) + e hx)) x k e h + e h I x x T ). 2) e h h + e h h )) x k x k eh e h h e h + e }{{ h. 3) x } k αx) This expression shows that the kth partial of ψx) is just the kth partial of h weighted by a weight αx). This weighting function has the nice property that as hx) gets large, it approaches the value, and as hx) approaches zero, it vanishes. We ll use those properties later on when analyzing limits. Stacking these partial derivatives, we get the following expression for the full gradient: ψx) αx) hx). 4)

Now for the Hessian. We simply take the second partial of the expression in Equation 3: 2 ψx) αx) h ) x k x k 2 h αx) + α h. 5) x k x k Already we know that the first of those terms is going to be the hessian of h weighted by αx). To calculate the second term, we need to compute the first partial of αx) α e h e h ) e h + e h e h + e h) ) e h h h h + e e h e h) ) e h h h h e e h + e h ) 2 7) αx) 2 ) h. 8) Plugging this expression back into Equation 5, we get 2 2 h ψx) αx) + x k x k αx) 2 ) h 6) h x k. 9) That first term is just the scaled Hessian of h as we mentioned above, and the second term is a scaled version of the outer product h s gradient vector. So, the full Hessian matrix is ) 2 ψx) αx) 2 h + αx) 2 h h T. 0) Note that correctly tracking the indices k and l throughout the computation is just about as easy as using a primed notation since these partial derivatives, themselves, represent just one-dimensional derivatives of the restricted function created by fixing the values of all variables accept the one in question. Often people make the notation even simpler by using xk x k. Importantly, correctly tracking these indices gives us confidence that the final solution is properly arranged in matrix form. In this case, the Hessian is always symmetric, so dimensionality analysis can sometimes help us guess the right answer, but when we make things a little more difficult by finding the Jacobian of mappings of the form φ : R n R n, for instance, or move even beyond that to tensors of higher order derivatives, things get more hairy. It s very very important, as the problems become harder that you meticulously track those indices. If you look up the Einstein summation convention for higher order tensors, you ll notice that for hard problems this technique of manipulating the partials is really all you can do.) So I encourage you to be careful with your indices. It ll save you a lot of anguish as derivatives become more complex. 2

.2 Taking the limits Now for the limits. From the intuition we gained from graphing the function, we might expect the gradient and Hessian of ψ away from zero to increasingly just look like the gradient and Hessian of hx) x. And then as we approach zero, the point of the cone formed by the graph of hx) becomes increasingly rounded; we d expect the gradient at least to be 0 at that point, and we might guess, if all goes right, that the Hessian is just the identity I since it s uniform in all directions. We can even look at the second derivative in one-dimension to better convince ourselves that the identity makes sense, but to be sure about all of these, beyond a modicum of doubt, we should perform the actual calculations. First we note that αx) approaches as x, and α0) 0. The gradient of ψ is the same as the gradient of h, just scaled by α, so as x gets large, since α approaches, the gradient simply approaches the gradient of h as we thought. We can make a similar argument for the first term of the Hessian in Equation 0. α approaches as x gets large, so that term approaches the Hessian of h. For the second term, we note that the gradients h x all have bounded norm specifically, the norm is always ). That means each element of the vector is bounded in absolute value, too. Since α as x gets large, the weight on this term α 2 approaches 0 as x gets large, so we can bound each element of the outer product h h T above and below by a value that approaches 0 as x gets large. That term, then, must go to zero in this limit. In all, that tells us that 2 ψ looks more and more like 2 h as x gets bigger and bigger in norm, matching our intuition. In the other direction, as x 0, the weight αx) approaches zero. So for the gradient ψ α h, since h always has norm, we can make the same bounding argument for its elements as we made for the second term of the Hessian to say that all components of this gradient must approach zero. The gradient, therefore, along along all paths converging to x 0, limits to 0. Note that from the viewpoint of the constituent functions that went into creating ψ, since h is undefined at x 0, it seems like the gradient of ψ is also undefined there as well. However, this function isn t really something constructed from its constituent components it s a mapping from all points x to the corresponding values ψx) R. And from that perspective, we know nothing about what went into its construction. If we return back to the definition of partial derivatives, we see that the derivative is actually the limit of the difference ratios as the perturbation of a variable x k gets smaller and smaller. That limit is perfectly well defined for this function at x 0, so we can conclude that the function is differentiable there. And the full gradient there, as we found in the above analysis is 0. Thus, it s absolutely correct to return the value 0 in an implementation of the gradient evaluation at x 0. This isn t just a trick, it s the right calculation for the function ψ. The limit of the Hessian is a bit trickier. Naïvely, it s tempting to say that, looking back at Equation 0, since α 0 as x 0, the first term vanishes and the weight on the second term approaches, leaving us with just the unweighted 3

second term h h T. In our case, that would suggest that the Hessian limits to x x T, meaning that it entirely depends on the path we take while approaching x 0. And we d conclude that there is no definitive limit, and this function behaves poorly around the origin. But we would be wrong! The reason is that while α 0, we need to remember that 2 h x I x x T ) explodes to infinity. Thus, we have a situation where two separate factors are approaching zero and infinity simultaneously, so this is an indeterminate form. We need to use L Hôspital! Denoting h x, we can collect the Hessian terms as follows 2 ψ α ) I x x T + α 2) x x T h α h I + α 2 α h ) x x T. ) It s clear what all elements of this expression that pertain to alpha do, except α h. That term, as x 0 approaches 0 0, so we need to use L Hôspital to understand its limit. Expanding that term, we get α h eh e h h e h + e h ). 2) Again this fraction is an indeterminate form, so we can take the derivative of the numerator and denominator separately and examine those ratios: e h + e h e h e h + e h ) + h e h e h ) + e h ) e h + e h + hα ). 3) Thus, this weird term limits to. Note that we could also apply L Hôspital directly to the fraction α h. The calculation is slightly more complicated, but the new ratio becomes α2, which also limits to. Returning back to the Hessian in Equation, we see now that the Hessian limits to α 2 ψ0) lim h 0 h I + α 2 α ) ) x x T 4) h I + 0 2 ) x x T I. So we conclude that the Hessian as we intuitively guessed above), is actually I at x 0..3 A note on checking your work with a computer. There are a number of tools these days for checking that you ve calculated and implemented these formulas right. I ll mention a few of them here since they came up during our discussion. By far, the easiest and most reliable is numeric differentiation i.e. finite-differencing). Section 5.3 of this week s lecture 4

notes Multivariate Calculus I: Derivatives and local geometry) talk about that method in length. The great thing about numeric differentiation is that it s largely agnostic to simplification of the expression. Independent of how convoluted you make your formula, as long as it s correct in the implementation, it ll spit out the right numbers, and numeric differentiation will report the same approximate) values. You don t have to stare out a different expression and try to verify that it s the same. On the other hand Mathematica and by extension, Matlab) has very powerful symbolic differentiation tools that we can use. Unfortunately, those tools may do a poor job of simplification which is a hard and largely ambiguous problem). They take symbolic representations of the function and report back their first and second derivatives, but in the end you may get grotesquely complicated strings of terms that you ll have to somehow relate back to your own expression. You can always evaluate those expressions at various points to gain confidence, but it s often easier to implement a numeric differentiation module and use it everywhere. One place where you really want to use it is in unit tests. You should unit test all of your derivative implementations using numeric differentiation. Unit tests will verify their correctness, and ensure that changes to the code never break anything. Once your function value implementation diverges from the gradient and Hessian computations, all bets are off. Those are hard bugs to reason through. The last tool we ll discuss is automatic differentiation. Automatic differentiation is different from both symbolic and numeric differentiation; you can think of it as lying somewhat between the two. The idea s been around for a very long time at least dating back to the 80 s), but, outside the backpropagation algorithm for Neural Networks in machine learning which is an independently developed instance of automatic differentiation used to calculate the gradient of the Neural Network), we re only now seeing a strong recurrence of its use within the machine learning and robotics communities. The basic idea is that the chain rule, product rule, and all other rules for differentiation are very regular and algorithmic. If you implement a function evaluation using a collection of well-known primitives such as exp), sin), cos), log)), while we re traversing the execution tree to evaluate the function, we can keep track of just a little bit more information along the way to additionally simultaneously compute the gradient, and even the Hessian. These tools are very powerful, and in some cases, you can use them to replace derivative calculations. Implement a function from primitives that support automatic differentiation, and you can call a pre-built method to calculate the value and gradient automatically. It ll calculate both the function value and the corresponding gradient at a given point with no effort to you at all. Unfortunately, the problem of finding the optimal computation tree i.e. the most efficient) for derivative calculations is intractable, and you can usually find a computationally more efficient way to directly implement the gradient and especially) the Hessian that automatic differentiators probably wouldn t find. But automatic differentiation is significantly more efficient than symbolic differentiation since it s just computing the value of the derivatives at a particular point, 5

and not the entire symbolic expression giving the gradient everywhere. And often in modern applications the networks of functions are complicated enough that it may make sense to trade off computational efficiency for speed of implementation and correctness. So use it if it makes sense you can really only get a feel for the trade-offs by playing around with it. Otherwise, I d recommend numeric differentiation over symbolic differentiation since it s a more general and reliable tool for unit testing your code. 2 Problem 2: Gradient and Hessian Calculations for Machine Learning 2. Sigmoid derivatives We present the derivatives as a lemma, and include the derivation below. Lemma. Sigmoid analysis. Let σx) +e x. Its first derivative is and its second derivative is d σ σ σ), 5) dx d 2 σ 2σ)σ σ). 6) dx2 Proof. We derive the result by straightforward calculation. d dx σ + e x ) 2 e x e x ) + e x + e x σ σ). For the second derivative we have d 2 dx 2 σ d e x ) dx + e x ) 2 + e x ) 2 e x + e x 2 + e x )e x + e x ) 4 e x + e x ) 2 + 2 e x e x + e x + e x + e x σ σ) + 2σ σ) 2 [ ] σ) σ 2σ 2 2σ) σ σ). 6

2.2 Logistic regression Generally, for any y {, }, we can say ) + e ywt x py x, w) + e ywt x + e ywt x + x e ywt + e ywt x ) x e ywt + e ywt x e ywt x e ywt x + e ỹwt x p y x, w). where ỹ y So, as the problem explains, we can conclude that replacing y with y is equivalent to finding the probability of not y. As for the derivatives, these are actually slightly easier than the raw sigmoids above, because of the log. Indeed, this also shows that perhaps there s an easier way to find the above derivatives since d σ dx log σ σ, which implies σ σ d dx log σ. But we ll leave that for you guys to play with. The first partial of the logistic regression terms is log + e yiw T x i ) ) e yiw T x i y w k + e yiwt x i x k) i i And the second partials are 2 log + e yiw T x i ) w l w k w l py i x i, w))y i x k) i. + e yiwt x i e yiwt x i y i x l) i ) + e y iw T x 2 y i x k) i i p i p i ) x l) i x k) i, ) y i x k) i where p i py i x i, w). Here we ve also used the property that y i ) 2 independent of the value of y i since y i {, }. In vector and matrix form, the full gradient and Hessian become and N N log + e yiwt x i ) p i )y i x i. 7) 2 i N i log + e yiwt x i ) i N p i p i ) x i x T i. 8) i 7

3 Problem 3A): The geometry of revolute manipulators Taking the time derivative of the differentiable map φ : R n R n gives ẋ J φ q. Thus, just from calculus, we see that this Jacobian matrix J φ is a linear map that transforms velocities in the space of the joints to velocities at the end-effector. Given a velocity q, we simply multiply by J φ and we get back a velocity vector ẋ sitting at the end-effector that tells us the corresponding direction of motion of that end-effector point. In the other direction, now assume J φ is full rank and we re given a desired velocity ẋ d. What types of movements of the joints make the end-effector move in that direction? Since J φ is full rank, we can write its SVD as [ V T J φ U[S 0] // V T ], 9) as we ve seen before. In this case, though, U R 3 3 is a full square matrix since the matrix is full rank. The general solution to this linear system ẋ d J φ q is then q V // S U T ẋ d + V β, 20) for any vector of coefficients β. Any of these motions will do: a small motion in the direction of q will move the end-effector in the desired direction, simply because we re exploiting the linear relationship between velocities in the joint space and velocities at the end-effector. Note that this solution matrix V // S U T is often called the Moore-Penrose pseudoinverse, and is written J φ J T φ Jφ Jφ T ). 2) One may multiply the expression out in terms of the SVD above in Equation 9 to get insight into why it works. It calculates the matrix V // S U T without actually having to calculate the SVD. Additionally, the matrix I J φ J φ projects any arbitrary vector v onto the null space of J φ, so that I J φ J φ)v V β for some β. It s often useful in practice to choose some direction in joint space q d that we d like to move in for instance, it might be a direction pointing back toward some default configuration), and calculate q J φẋd + I J φ J φ) q d 22) to resolve the null space redundancy of the problem. These linear algebra results can easily be turned into an algorithm for controlling the robot s end-effector to a point by simply by choosing a desired velocity for the end-effector at each iteration that takes the robot in the right direction and moving the joints small step in the resulting joint space direction that implements that desired end-effector movement. How you specifically implement that procedure is flexible. But note 8

that a potential function of the form ψx d x), where ψ is the softmax function as defined in problem, creates a field of vectors negative gradient vectors) that nicely converge on x d. If you re familiar with differential equations, you ll recognize this field of negative potential gradient vectors as a differential equation whose integral curves i.e. the curves you get from always instantaneously following the directions specified by the vector field) converge on x d. So the problem of controlling the robot to x d simply becomes a problem of trying to numerically follow one of these integral curves by choosing appropriate joint velocity vectors q. There are a number of numeric differentiation tools readily available to choose from, such as Euler integration a particularly simple choice) or Runge Kutta a more numerically robust algorithm). And finally, when the Jacobian is reduced rank, we can t actually fully solve the problem. The best we can do is solve the least squares problem to find the best fit motion within the achievable set of end-effector velocities. 4 Problem 4: Inverse function theorem Suppose φ : R n R n. At the domain point x 0 we have a corresponding point in the co-domain y 0 φx 0 ). The first-order Taylor approximation of this nonlinear map around the point x 0 gives φx) y y 0 + J φ x x 0 ), 23) where J φ is the Jacobian at x 0. If J φ is invertible, we can think of this linear map as a bijective connection between the two spaces. Any notion of directionality mapping from x to y vs the reverse mapping from y to x) is simply an artifact of notation. This bijective linear map creates an association between points in the domain and points in the co-domain, and we can easily calculate an expression representing how points move from y to x by simply solving this linear system for x in terms of y. Doing so gives x x 0 + J φ y y 0). 24) This new expression looks like a Taylor expansion of a mapping from y to x, so it suggests that perhaps it is the first order Taylor expansion of the inverse map. If so, then it must be that J φ is the Jacobian of that inverse map. These are just heuristic arguments, but they demonstrate that the first order Taylor approximation can demonstrate structure of the nonlinear map that isn t evident from the original expression. That first order Taylor approximation connects the nonlinear function back to linear algebra and allows us to analyze the map locally using our linear algebra tools. In the above case, we see that it tells us something about the inverse of the map, but more generally, even when we re moving between R n and a space of different dimensionality R m, we can still use the first order Taylor approximation to understand the local structure of the map. When m n the best we can do is use pseudoinverses, but that s still quite handy when we have computational tools readily available while implementing complex systems. 9