Gradients and Directional Derivatives Brett Bernstein CDS at NYU January 21, 2018 Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 1 / 23
Initial Question Intro Question Question We are given the data set (x 1, y 1 ),..., (x n, y n ) where x i R d and y i R. We want to fit a linear function to this data by performing empirical risk minimization. More precisely, we are using the hypothesis space F = {f (x) = w T x w R d } and the loss function l(a, y) = (a y) 2. Given an initial guess w for the empirical risk minimizing parameter vector, how could we improve our guess? y x Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 2 / 23
Initial Question Intro Solution Solution The empirical risk is given by ˆR n (f ) = 1 n n l(f (x i ), y i ) = 1 n i=1 n (w T x i y i ) 2 = 1 n Xw y 2 2, i=1 where X R n d is the matrix whose ith row is given by x i. Can improve a non-optimal guess w by taking a small step in the direction of the negative gradient. Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 3 / 23
Single Variable Calculus Single Variable Differentiation Calculus lets us turn non-linear problems into linear algebra. For f : R R differentiable, the derivative is given by f f (x + h) f (x) (x) = lim. h 0 h Can also be written as f (x + h) = f (x) + hf (x) + o(h) as h 0, where o(h) denotes a function g(h) with g(h)/h 0 as h 0. Points with f (x) = 0 are called critical points. Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 4 / 23
Single Variable Calculus 1D Linear Approximation By Derivative f(t) f(x 0 ) + (t x 0 )f (x 0 ) f(t) (f(x 0 ) + (t x 0 )f (x 0 )) (x 0, f(x 0 )) Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 5 / 23
Multivariable Differentiation Consider now a function f : R n R with inputs of the form x = (x 1,..., x n ) R n. Unlike the 1-dimensional case, we cannot assign a single number to the slope at a point since there are many directions we can move in. Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 6 / 23
Multiple Possible Directions for f : R 2 R Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 7 / 23
Directional Derivative Definition Let f : R n R. The directional derivative f (x; u) of f at x R n in the direction u R n is given by f f (x + hu) f (x) (x; u) = lim. h 0 h By fixing a direction u we turned our multidimensional problem into a 1-dimensional problem. Similar to 1-d we have f (x + hu) = f (x) + hf (x; u) + o(h). We say that u is a descent direction of f at x if f (x; u) < 0. Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 8 / 23
Directional Derivative as a Slope of a Slice Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 9 / 23
Partial Derivative i 1 {}}{ Let e i = ( 0, 0,..., 0, 1, 0,..., 0) denote the ith standard basis vector. The ith partial derivative is defined to be the directional derivative along e i. It can be written many ways: f (x; e i ) = x i f (x) = xi f (x) = i f (x). What is the intuitive meaning of xi f (x)? For example, what does a large value of x3 f (x) imply? Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 10 / 23
Differentiability We say a function f : R n R is differentiable at x R n if for some g R n. f (x + v) f (x) g T v lim = 0, v 0 v 2 If it exists, this g is unique and is called the gradient of f at x with notation g = f (x). It can be shown that f (x) = x1 f (x). xn f (x). Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 11 / 23
Useful Convention Consider f : R p+q R. Split the input x R p+q into parts w R p and z R q so that x = (w, z). Define the partial gradients w1 f (w, z) w f (w, z) :=. wp f (w, z) and z f (w, z) := z1 f (w, z). zq f (w, z). Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 12 / 23
Tangent Plane Analogous to the 1-d case we can express differentiability as f (x + v) = f (x) + f (x) T v + o( v 2 ). The approximation f (x + v) f (x) + f (x) T v gives a tangent plane at the point x. The tangent plane of f at x is given by P = {(x + v, f (x) + f (x) T v) v R n } R n+1. Methods like gradient descent approximate a function locally by its tangent plane, and then take a step accordingly. Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 13 / 23
Tangent Plane for f : R 2 R Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 14 / 23
Directional Derivatives from Gradients If f is differentiable we have If f (x) 0 this implies that arg max f (x; u) = u 2 =1 f (x) f (x; u) = f (x) T u. f (x) 2 and arg min u 2 =1 The gradient points in the direction of steepest ascent. f f (x) (x; u) =. f (x) 2 The negative gradient points in the direction of steepest descent. Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 15 / 23
Critical Points Analogous to 1-d, if f : R n R is differentiable and x is a local extremum then we must have f (x) = 0. Points with f (x) = 0 are called critical points. Later we will see that for a convex differentiable function, x is a critical point if and only if it is a global minimizer. Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 16 / 23
Critical Points of f : R 2 R Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 17 / 23
Computing Gradients Computing Gradients Question For questions 1 and 2, compute the gradient of the given function. 1 f : R 3 R is given by 2 f : R n R is given by f (x 1, x 2, x 3 ) = log(1 + e x 1+2x 2 +3x 3 ). f (x) = Ax y 2 2 = (Ax y) T (Ax y) = x T A T Ax 2y T Ax + y T y, for some A R m n and y R m. 3 Assume A in the previous question has full column rank. What is the critical point of f? Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 18 / 23
Computing Gradients f (x 1, x 2, x 3 ) = log(1 + e x 1+2x 2 +3x 3 ) Solution 1 We can compute the partial derivatives directly: and obtain x1 f (x 1, x 2, x 3 ) = x2 f (x 1, x 2, x 3 ) = x3 f (x 1, x 2, x 3 ) = f (x 1, x 2, x 3 ) = e x 1+2x 2 +3x 3 1 + e x 1+2x 2 +3x 3 2e x 1+2x 2 +3x 3 1 + e x 1+2x 2 +3x 3 3e x 1+2x 2 +3x 3 1 + e x 1+2x 2 +3x 3 e x 1+2x 2 +3x 3 1 + e x 1+2x 2 +3x 3 2e x 1+2x 2 +3x 3 1 + e x 1+2x 2 +3x 3 3e x 1+2x 2 +3x 3 1 + e x 1+2x 2 +3x 3. Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 19 / 23
Computing Gradients f (x 1, x 2, x 3 ) = log(1 + e x 1+2x 2 +3x 3 ) Solution 2 Let w = (1, 2, 3) T. Write f (x) = log(1 + e w T x ). Apply a version of the chain rule: f (x) = ew T x 1 + e w T x w. Theorem (Chain Rule) If g : R R and h : R n R are differentiable then (g h)(x) = g (h(x)) h(x). Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 20 / 23
Computing Gradients f (x) = Ax y 2 2 Solution We could use techniques similar to the previous problem, but instead we show a different method using directional derivatives. For arbitrary t R and x, v R n we have f (x + tv) = (x + tv) T A T A(x + tv) 2y T A(x + tv) + y T y = x T A T Ax + t 2 v T A T Av + 2tx T A T Av 2y T Ax 2ty T Av + y T y = f (x) + t(2x T A T A 2y T A)v + t 2 v T A T Av. This gives f f (x + tv) f (x) (x; v) = lim = (2x T A T A 2y T A)v = f (x) T v t 0 t Thus f (x) = 2(A T Ax A T y) = 2A T (Ax y). Data science interpretation of f (x)? Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 21 / 23
Computing Gradients Critical Points of f (x) = Ax y 2 2 Need f (x) = 2A T Ax 2A T y = 0. Since A is assumed to have full column rank, we see that A T A is invertible. Thus we have x = (A T A) 1 A T y. As we will see later, this function is strictly convex (Hessian 2 f (x) = 2A T A is positive definite). Thus we have found the unique minimizer (least squares solution). Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 22 / 23
Computing Gradients Technical Aside: Differentiability When computing the gradients above we assumed the functions were differentiable. Can use the following theorem to be completely rigorous. Theorem Let f : R n R and suppose xi f : R n R is continuous for i = 1,..., n. Then f is differentiable. Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 23 / 23