Recitation 1. Gradients and Directional Derivatives. Brett Bernstein. CDS at NYU. January 21, PDF Free Download

Gradients and Directional Derivatives Brett Bernstein CDS at NYU January 21, 2018 Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 1 / 23

Initial Question Intro Question Question We are given the data set (x 1, y 1 ),..., (x n, y n ) where x i R d and y i R. We want to fit a linear function to this data by performing empirical risk minimization. More precisely, we are using the hypothesis space F = {f (x) = w T x w R d } and the loss function l(a, y) = (a y) 2. Given an initial guess w for the empirical risk minimizing parameter vector, how could we improve our guess? y x Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 2 / 23

Initial Question Intro Solution Solution The empirical risk is given by ˆR n (f ) = 1 n n l(f (x i ), y i ) = 1 n i=1 n (w T x i y i ) 2 = 1 n Xw y 2 2, i=1 where X R n d is the matrix whose ith row is given by x i. Can improve a non-optimal guess w by taking a small step in the direction of the negative gradient. Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 3 / 23

Single Variable Calculus Single Variable Differentiation Calculus lets us turn non-linear problems into linear algebra. For f : R R differentiable, the derivative is given by f f (x + h) f (x) (x) = lim. h 0 h Can also be written as f (x + h) = f (x) + hf (x) + o(h) as h 0, where o(h) denotes a function g(h) with g(h)/h 0 as h 0. Points with f (x) = 0 are called critical points. Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 4 / 23

Single Variable Calculus 1D Linear Approximation By Derivative f(t) f(x 0 ) + (t x 0 )f (x 0 ) f(t) (f(x 0 ) + (t x 0 )f (x 0 )) (x 0, f(x 0 )) Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 5 / 23

Multivariable Differentiation Consider now a function f : R n R with inputs of the form x = (x 1,..., x n ) R n. Unlike the 1-dimensional case, we cannot assign a single number to the slope at a point since there are many directions we can move in. Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 6 / 23

Multiple Possible Directions for f : R 2 R Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 7 / 23

Directional Derivative Definition Let f : R n R. The directional derivative f (x; u) of f at x R n in the direction u R n is given by f f (x + hu) f (x) (x; u) = lim. h 0 h By fixing a direction u we turned our multidimensional problem into a 1-dimensional problem. Similar to 1-d we have f (x + hu) = f (x) + hf (x; u) + o(h). We say that u is a descent direction of f at x if f (x; u) < 0. Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 8 / 23

Directional Derivative as a Slope of a Slice Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 9 / 23

Partial Derivative i 1 {}}{ Let e i = ( 0, 0,..., 0, 1, 0,..., 0) denote the ith standard basis vector. The ith partial derivative is defined to be the directional derivative along e i. It can be written many ways: f (x; e i ) = x i f (x) = xi f (x) = i f (x). What is the intuitive meaning of xi f (x)? For example, what does a large value of x3 f (x) imply? Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 10 / 23

Differentiability We say a function f : R n R is differentiable at x R n if for some g R n. f (x + v) f (x) g T v lim = 0, v 0 v 2 If it exists, this g is unique and is called the gradient of f at x with notation g = f (x). It can be shown that f (x) = x1 f (x). xn f (x). Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 11 / 23

Useful Convention Consider f : R p+q R. Split the input x R p+q into parts w R p and z R q so that x = (w, z). Define the partial gradients w1 f (w, z) w f (w, z) :=. wp f (w, z) and z f (w, z) := z1 f (w, z). zq f (w, z). Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 12 / 23

Tangent Plane Analogous to the 1-d case we can express differentiability as f (x + v) = f (x) + f (x) T v + o( v 2 ). The approximation f (x + v) f (x) + f (x) T v gives a tangent plane at the point x. The tangent plane of f at x is given by P = {(x + v, f (x) + f (x) T v) v R n } R n+1. Methods like gradient descent approximate a function locally by its tangent plane, and then take a step accordingly. Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 13 / 23

Tangent Plane for f : R 2 R Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 14 / 23

Directional Derivatives from Gradients If f is differentiable we have If f (x) 0 this implies that arg max f (x; u) = u 2 =1 f (x) f (x; u) = f (x) T u. f (x) 2 and arg min u 2 =1 The gradient points in the direction of steepest ascent. f f (x) (x; u) =. f (x) 2 The negative gradient points in the direction of steepest descent. Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 15 / 23

Critical Points Analogous to 1-d, if f : R n R is differentiable and x is a local extremum then we must have f (x) = 0. Points with f (x) = 0 are called critical points. Later we will see that for a convex differentiable function, x is a critical point if and only if it is a global minimizer. Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 16 / 23

Critical Points of f : R 2 R Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 17 / 23

Computing Gradients Computing Gradients Question For questions 1 and 2, compute the gradient of the given function. 1 f : R 3 R is given by 2 f : R n R is given by f (x 1, x 2, x 3 ) = log(1 + e x 1+2x 2 +3x 3 ). f (x) = Ax y 2 2 = (Ax y) T (Ax y) = x T A T Ax 2y T Ax + y T y, for some A R m n and y R m. 3 Assume A in the previous question has full column rank. What is the critical point of f? Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 18 / 23

Computing Gradients f (x 1, x 2, x 3 ) = log(1 + e x 1+2x 2 +3x 3 ) Solution 1 We can compute the partial derivatives directly: and obtain x1 f (x 1, x 2, x 3 ) = x2 f (x 1, x 2, x 3 ) = x3 f (x 1, x 2, x 3 ) = f (x 1, x 2, x 3 ) = e x 1+2x 2 +3x 3 1 + e x 1+2x 2 +3x 3 2e x 1+2x 2 +3x 3 1 + e x 1+2x 2 +3x 3 3e x 1+2x 2 +3x 3 1 + e x 1+2x 2 +3x 3 e x 1+2x 2 +3x 3 1 + e x 1+2x 2 +3x 3 2e x 1+2x 2 +3x 3 1 + e x 1+2x 2 +3x 3 3e x 1+2x 2 +3x 3 1 + e x 1+2x 2 +3x 3. Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 19 / 23

Computing Gradients f (x 1, x 2, x 3 ) = log(1 + e x 1+2x 2 +3x 3 ) Solution 2 Let w = (1, 2, 3) T. Write f (x) = log(1 + e w T x ). Apply a version of the chain rule: f (x) = ew T x 1 + e w T x w. Theorem (Chain Rule) If g : R R and h : R n R are differentiable then (g h)(x) = g (h(x)) h(x). Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 20 / 23

Computing Gradients f (x) = Ax y 2 2 Solution We could use techniques similar to the previous problem, but instead we show a different method using directional derivatives. For arbitrary t R and x, v R n we have f (x + tv) = (x + tv) T A T A(x + tv) 2y T A(x + tv) + y T y = x T A T Ax + t 2 v T A T Av + 2tx T A T Av 2y T Ax 2ty T Av + y T y = f (x) + t(2x T A T A 2y T A)v + t 2 v T A T Av. This gives f f (x + tv) f (x) (x; v) = lim = (2x T A T A 2y T A)v = f (x) T v t 0 t Thus f (x) = 2(A T Ax A T y) = 2A T (Ax y). Data science interpretation of f (x)? Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 21 / 23

Computing Gradients Critical Points of f (x) = Ax y 2 2 Need f (x) = 2A T Ax 2A T y = 0. Since A is assumed to have full column rank, we see that A T A is invertible. Thus we have x = (A T A) 1 A T y. As we will see later, this function is strictly convex (Hessian 2 f (x) = 2A T A is positive definite). Thus we have found the unique minimizer (least squares solution). Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 22 / 23

Computing Gradients Technical Aside: Differentiability When computing the gradients above we assumed the functions were differentiable. Can use the following theorem to be completely rigorous. Theorem Let f : R n R and suppose xi f : R n R is continuous for i = 1,..., n. Then f is differentiable. Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 23 / 23

Recitation 1. Gradients and Directional Derivatives. Brett Bernstein. CDS at NYU. January 21, 2018