Recitation 1. Gradients and Directional Derivatives. Brett Bernstein. CDS at NYU. January 21, 2018

Similar documents
Machine Learning Brett Bernstein. Recitation 1: Gradients and Directional Derivatives

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Lecture Notes: Geometric Considerations in Unconstrained Optimization

Convex Optimization. Problem set 2. Due Monday April 26th

Nonlinear equations and optimization

Differentiable Functions

Optimization and Calculus

Unconstrained minimization of smooth functions

ECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor.

Note: Every graph is a level set (why?). But not every level set is a graph. Graphs must pass the vertical line test. (Level sets may or may not.

Chapter 8: Taylor s theorem and L Hospital s rule

ECE 680 Modern Automatic Control. Gradient and Newton s Methods A Review

1 Convexity, concavity and quasi-concavity. (SB )

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

Properties of the Gradient

Gradient and Directional Derivatives October 2013

Computational Optimization. Mathematical Programming Fundamentals 1/25 (revised)

MA102: Multivariable Calculus

E 600 Chapter 3: Multivariate Calculus

ECE580 Partial Solution to Problem Set 3

Limits, Continuity, and the Derivative

Week 3: Sets and Mappings/Calculus and Optimization (Jehle September and Reny, 20, 2015 Chapter A1/A2)

IFT Lecture 2 Basics of convex analysis and gradient descent

SOLUTIONS to Exercises from Optimization

B553 Lecture 1: Calculus Review

26. Directional Derivatives & The Gradient

APPLICATIONS OF INTEGRATION

Optimization methods

Mathematical Economics (ECON 471) Lecture 3 Calculus of Several Variables & Implicit Functions

Iterative Reweighted Least Squares

Slopes and Rates of Change

Matrix Derivatives and Descent Optimization Methods

Week 4: Calculus and Optimization (Jehle and Reny, Chapter A2)

LOCAL SEARCH. Today. Reading AIMA Chapter , Goals Local search algorithms. Introduce adversarial search 1/31/14

Calculus Overview. f(x) f (x) is slope. I. Single Variable. A. First Order Derivative : Concept : measures slope of curve at a point.

Regression with Numerical Optimization. Logistic

Lecture 10. Econ August 21

Convex Optimization. Chapter Introduction Mathematical Optimization

C&O367: Nonlinear Optimization (Winter 2013) Assignment 4 H. Wolkowicz

The Derivative. Appendix B. B.1 The Derivative of f. Mappings from IR to IR

Chapter 13. Convex and Concave. Josef Leydold Mathematical Methods WS 2018/19 13 Convex and Concave 1 / 44

Vector Derivatives and the Gradient

February 4, 2019 LECTURE 4: THE DERIVATIVE.

Families of Functions, Taylor Polynomials, l Hopital s

A function(al) f is convex if dom f is a convex set, and. f(θx + (1 θ)y) < θf(x) + (1 θ)f(y) f(x) = x 3

1 Overview. 2 A Characterization of Convex Functions. 2.1 First-order Taylor approximation. AM 221: Advanced Optimization Spring 2016

MATH529 Fundamentals of Optimization Unconstrained Optimization II

Fitting Linear Statistical Models to Data by Least Squares I: Introduction

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Vector Multiplication. Directional Derivatives and the Gradient

Parameter estimation Conditional risk

5.) For each of the given sets of vectors, determine whether or not the set spans R 3. Give reasons for your answers.

SIMPLE MULTIVARIATE OPTIMIZATION

2.098/6.255/ Optimization Methods Practice True/False Questions

Integration - Past Edexcel Exam Questions

Chapter 5 Elements of Calculus

Fitting Linear Statistical Models to Data by Least Squares II: Weighted

Math 5630: Conjugate Gradient Method Hung M. Phan, UMass Lowell March 29, 2019

Nonlinear equations. Norms for R n. Convergence orders for iterative methods

Sections 2.1, 2.2 and 2.4: Limit of a function Motivation:

More First-Order Optimization Algorithms

Directional Derivatives and the Gradient

Chapter 2. Optimization. Gradients, convexity, and ALS

CS 453X: Class 20. Jacob Whitehill

Directional Derivatives in the Plane

Optimization methods

APPLICATIONS OF DIFFERENTIABILITY IN R n.

, b = 0. (2) 1 2 The eigenvectors of A corresponding to the eigenvalues λ 1 = 1, λ 2 = 3 are

Scientific Computing: Optimization

Math 117: Honours Calculus I Fall, 2002 List of Theorems. a n k b k. k. Theorem 2.1 (Convergent Bounded) A convergent sequence is bounded.

Multivariable Calculus

September Math Course: First Order Derivative

l 1 and l 2 Regularization

Motivation: We have already seen an example of a system of nonlinear equations when we studied Gaussian integration (p.8 of integration notes)

Optimization Tutorial 1. Basic Gradient Descent

8.7 Taylor s Inequality Math 2300 Section 005 Calculus II. f(x) = ln(1 + x) f(0) = 0

Introduction to gradient descent

MAT 473 Intermediate Real Analysis II

Linear Models in Machine Learning

Linear Regression and Its Applications

10-701/ Recitation : Linear Algebra Review (based on notes written by Jing Xiang)

d(x n, x) d(x n, x nk ) + d(x nk, x) where we chose any fixed k > N

Comparative study of Optimization methods for Unconstrained Multivariable Nonlinear Programming Problems

Solutions to Test #1 MATH 2421

n f(k) k=1 means to evaluate the function f(k) at k = 1, 2,..., n and add up the results. In other words: n f(k) = f(1) + f(2) f(n). 1 = 2n 2.

Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method.

Gradient Descent. Dr. Xiaowei Huang

MATH 23b, SPRING 2005 THEORETICAL LINEAR ALGEBRA AND MULTIVARIABLE CALCULUS Midterm (part 1) Solutions March 21, 2005

ECS171: Machine Learning

Vector Functions & Space Curves MATH 2110Q

(Extended) Kalman Filter

Existence of minimizers

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

Lecture 3 - Linear and Logistic Regression

The Inverse Function Theorem 1

UNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems

1 Functions of Several Variables Some Examples Level Curves / Contours Functions of More Variables... 6

10. Unconstrained minimization

a b c d e GOOD LUCK! 3. a b c d e 12. a b c d e 4. a b c d e 13. a b c d e 5. a b c d e 14. a b c d e 6. a b c d e 15. a b c d e

Transcription:

Gradients and Directional Derivatives Brett Bernstein CDS at NYU January 21, 2018 Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 1 / 23

Initial Question Intro Question Question We are given the data set (x 1, y 1 ),..., (x n, y n ) where x i R d and y i R. We want to fit a linear function to this data by performing empirical risk minimization. More precisely, we are using the hypothesis space F = {f (x) = w T x w R d } and the loss function l(a, y) = (a y) 2. Given an initial guess w for the empirical risk minimizing parameter vector, how could we improve our guess? y x Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 2 / 23

Initial Question Intro Solution Solution The empirical risk is given by ˆR n (f ) = 1 n n l(f (x i ), y i ) = 1 n i=1 n (w T x i y i ) 2 = 1 n Xw y 2 2, i=1 where X R n d is the matrix whose ith row is given by x i. Can improve a non-optimal guess w by taking a small step in the direction of the negative gradient. Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 3 / 23

Single Variable Calculus Single Variable Differentiation Calculus lets us turn non-linear problems into linear algebra. For f : R R differentiable, the derivative is given by f f (x + h) f (x) (x) = lim. h 0 h Can also be written as f (x + h) = f (x) + hf (x) + o(h) as h 0, where o(h) denotes a function g(h) with g(h)/h 0 as h 0. Points with f (x) = 0 are called critical points. Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 4 / 23

Single Variable Calculus 1D Linear Approximation By Derivative f(t) f(x 0 ) + (t x 0 )f (x 0 ) f(t) (f(x 0 ) + (t x 0 )f (x 0 )) (x 0, f(x 0 )) Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 5 / 23

Multivariable Differentiation Consider now a function f : R n R with inputs of the form x = (x 1,..., x n ) R n. Unlike the 1-dimensional case, we cannot assign a single number to the slope at a point since there are many directions we can move in. Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 6 / 23

Multiple Possible Directions for f : R 2 R Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 7 / 23

Directional Derivative Definition Let f : R n R. The directional derivative f (x; u) of f at x R n in the direction u R n is given by f f (x + hu) f (x) (x; u) = lim. h 0 h By fixing a direction u we turned our multidimensional problem into a 1-dimensional problem. Similar to 1-d we have f (x + hu) = f (x) + hf (x; u) + o(h). We say that u is a descent direction of f at x if f (x; u) < 0. Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 8 / 23

Directional Derivative as a Slope of a Slice Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 9 / 23

Partial Derivative i 1 {}}{ Let e i = ( 0, 0,..., 0, 1, 0,..., 0) denote the ith standard basis vector. The ith partial derivative is defined to be the directional derivative along e i. It can be written many ways: f (x; e i ) = x i f (x) = xi f (x) = i f (x). What is the intuitive meaning of xi f (x)? For example, what does a large value of x3 f (x) imply? Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 10 / 23

Differentiability We say a function f : R n R is differentiable at x R n if for some g R n. f (x + v) f (x) g T v lim = 0, v 0 v 2 If it exists, this g is unique and is called the gradient of f at x with notation g = f (x). It can be shown that f (x) = x1 f (x). xn f (x). Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 11 / 23

Useful Convention Consider f : R p+q R. Split the input x R p+q into parts w R p and z R q so that x = (w, z). Define the partial gradients w1 f (w, z) w f (w, z) :=. wp f (w, z) and z f (w, z) := z1 f (w, z). zq f (w, z). Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 12 / 23

Tangent Plane Analogous to the 1-d case we can express differentiability as f (x + v) = f (x) + f (x) T v + o( v 2 ). The approximation f (x + v) f (x) + f (x) T v gives a tangent plane at the point x. The tangent plane of f at x is given by P = {(x + v, f (x) + f (x) T v) v R n } R n+1. Methods like gradient descent approximate a function locally by its tangent plane, and then take a step accordingly. Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 13 / 23

Tangent Plane for f : R 2 R Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 14 / 23

Directional Derivatives from Gradients If f is differentiable we have If f (x) 0 this implies that arg max f (x; u) = u 2 =1 f (x) f (x; u) = f (x) T u. f (x) 2 and arg min u 2 =1 The gradient points in the direction of steepest ascent. f f (x) (x; u) =. f (x) 2 The negative gradient points in the direction of steepest descent. Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 15 / 23

Critical Points Analogous to 1-d, if f : R n R is differentiable and x is a local extremum then we must have f (x) = 0. Points with f (x) = 0 are called critical points. Later we will see that for a convex differentiable function, x is a critical point if and only if it is a global minimizer. Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 16 / 23

Critical Points of f : R 2 R Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 17 / 23

Computing Gradients Computing Gradients Question For questions 1 and 2, compute the gradient of the given function. 1 f : R 3 R is given by 2 f : R n R is given by f (x 1, x 2, x 3 ) = log(1 + e x 1+2x 2 +3x 3 ). f (x) = Ax y 2 2 = (Ax y) T (Ax y) = x T A T Ax 2y T Ax + y T y, for some A R m n and y R m. 3 Assume A in the previous question has full column rank. What is the critical point of f? Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 18 / 23

Computing Gradients f (x 1, x 2, x 3 ) = log(1 + e x 1+2x 2 +3x 3 ) Solution 1 We can compute the partial derivatives directly: and obtain x1 f (x 1, x 2, x 3 ) = x2 f (x 1, x 2, x 3 ) = x3 f (x 1, x 2, x 3 ) = f (x 1, x 2, x 3 ) = e x 1+2x 2 +3x 3 1 + e x 1+2x 2 +3x 3 2e x 1+2x 2 +3x 3 1 + e x 1+2x 2 +3x 3 3e x 1+2x 2 +3x 3 1 + e x 1+2x 2 +3x 3 e x 1+2x 2 +3x 3 1 + e x 1+2x 2 +3x 3 2e x 1+2x 2 +3x 3 1 + e x 1+2x 2 +3x 3 3e x 1+2x 2 +3x 3 1 + e x 1+2x 2 +3x 3. Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 19 / 23

Computing Gradients f (x 1, x 2, x 3 ) = log(1 + e x 1+2x 2 +3x 3 ) Solution 2 Let w = (1, 2, 3) T. Write f (x) = log(1 + e w T x ). Apply a version of the chain rule: f (x) = ew T x 1 + e w T x w. Theorem (Chain Rule) If g : R R and h : R n R are differentiable then (g h)(x) = g (h(x)) h(x). Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 20 / 23

Computing Gradients f (x) = Ax y 2 2 Solution We could use techniques similar to the previous problem, but instead we show a different method using directional derivatives. For arbitrary t R and x, v R n we have f (x + tv) = (x + tv) T A T A(x + tv) 2y T A(x + tv) + y T y = x T A T Ax + t 2 v T A T Av + 2tx T A T Av 2y T Ax 2ty T Av + y T y = f (x) + t(2x T A T A 2y T A)v + t 2 v T A T Av. This gives f f (x + tv) f (x) (x; v) = lim = (2x T A T A 2y T A)v = f (x) T v t 0 t Thus f (x) = 2(A T Ax A T y) = 2A T (Ax y). Data science interpretation of f (x)? Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 21 / 23

Computing Gradients Critical Points of f (x) = Ax y 2 2 Need f (x) = 2A T Ax 2A T y = 0. Since A is assumed to have full column rank, we see that A T A is invertible. Thus we have x = (A T A) 1 A T y. As we will see later, this function is strictly convex (Hessian 2 f (x) = 2A T A is positive definite). Thus we have found the unique minimizer (least squares solution). Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 22 / 23

Computing Gradients Technical Aside: Differentiability When computing the gradients above we assumed the functions were differentiable. Can use the following theorem to be completely rigorous. Theorem Let f : R n R and suppose xi f : R n R is continuous for i = 1,..., n. Then f is differentiable. Brett Bernstein (CDS at NYU) Recitation 1 January 21, 2018 23 / 23