Machine Learning Brett Bernstein. Recitation 1: Gradients and Directional Derivatives

Similar documents
Recitation 1. Gradients and Directional Derivatives. Brett Bernstein. CDS at NYU. January 21, 2018

Unconstrained minimization of smooth functions

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Convex Optimization. Problem set 2. Due Monday April 26th

LOCAL SEARCH. Today. Reading AIMA Chapter , Goals Local search algorithms. Introduce adversarial search 1/31/14

Optimization methods

Lecture Notes: Geometric Considerations in Unconstrained Optimization

IFT Lecture 2 Basics of convex analysis and gradient descent

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Optimization methods

Gradient Descent. Dr. Xiaowei Huang

e x = 1 + x + x2 2! + x3 If the function f(x) can be written as a power series on an interval I, then the power series is of the form

ECS171: Machine Learning

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Fitting Linear Statistical Models to Data by Least Squares I: Introduction

Error Functions & Linear Regression (1)

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

Introduction to gradient descent

Optimization Tutorial 1. Basic Gradient Descent

Vector Multiplication. Directional Derivatives and the Gradient

Vector Derivatives and the Gradient

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Mathematical Economics (ECON 471) Lecture 3 Calculus of Several Variables & Implicit Functions

SOLUTIONS to Exercises from Optimization

26. Directional Derivatives & The Gradient

Directional Derivatives and the Gradient

Motivation: We have already seen an example of a system of nonlinear equations when we studied Gaussian integration (p.8 of integration notes)

Note: Every graph is a level set (why?). But not every level set is a graph. Graphs must pass the vertical line test. (Level sets may or may not.

Gaussian Models (9/9/13)

Fitting Linear Statistical Models to Data by Least Squares II: Weighted

February 4, 2019 LECTURE 4: THE DERIVATIVE.

MA102: Multivariable Calculus

The Multivariate Gaussian Distribution

Appendix 3. Derivatives of Vectors and Vector-Valued Functions DERIVATIVES OF VECTORS AND VECTOR-VALUED FUNCTIONS

Linear Models in Machine Learning

Chapter 5 Elements of Calculus

2 Two-Point Boundary Value Problems

Calculus Overview. f(x) f (x) is slope. I. Single Variable. A. First Order Derivative : Concept : measures slope of curve at a point.

MATH 4211/6211 Optimization Basics of Optimization Problems

Association studies and regression

The Derivative. Appendix B. B.1 The Derivative of f. Mappings from IR to IR

Fitting Linear Statistical Models to Data by Least Squares I: Introduction

SIMPLE MULTIVARIATE OPTIMIZATION

3.5 Quadratic Approximation and Convexity/Concavity

1 Overview. 2 A Characterization of Convex Functions. 2.1 First-order Taylor approximation. AM 221: Advanced Optimization Spring 2016

The Multivariate Gaussian Distribution [DRAFT]

Topic 2: Logistic Regression

Tangent Planes, Linear Approximations and Differentiability

APPLICATIONS OF INTEGRATION

Lecture 3 - Linear and Logistic Regression

C&O367: Nonlinear Optimization (Winter 2013) Assignment 4 H. Wolkowicz

10-701/15-781, Machine Learning: Homework 4

Properties of the Gradient

Appendix 5. Derivatives of Vectors and Vector-Valued Functions. Draft Version 24 November 2008

Parameter estimation Conditional risk

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Nonlinear equations and optimization

Exercise List: Proving convergence of the Gradient Descent Method on the Ridge Regression Problem.

Differentiable Functions

Computational Optimization. Mathematical Programming Fundamentals 1/25 (revised)

Optimization and Calculus

ECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor.

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Hands-On Learning Theory Fall 2016, Lecture 3

Section 4.2. Types of Differentiation

10-701/ Recitation : Linear Algebra Review (based on notes written by Jing Xiang)

Nonlinear equations. Norms for R n. Convergence orders for iterative methods

Chapter 3 Numerical Methods

Multivariate calculus

Calculus 2502A - Advanced Calculus I Fall : Local minima and maxima

Lecture Notes Part 2: Matrix Algebra

3.9 Derivatives of Exponential and Logarithmic Functions

Convex Optimization. Chapter Introduction Mathematical Optimization

3 Applications of partial differentiation

APPLICATIONS OF DIFFERENTIATION

Chapter 13. Convex and Concave. Josef Leydold Mathematical Methods WS 2018/19 13 Convex and Concave 1 / 44

September Math Course: First Order Derivative

Overfitting, Bias / Variance Analysis

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

ECE 680 Modern Automatic Control. Gradient and Newton s Methods A Review

Chapter 2. Optimization. Gradients, convexity, and ALS

Iterative Reweighted Least Squares

Chapter 8: Taylor s theorem and L Hospital s rule

Tail and Concentration Inequalities

A function(al) f is convex if dom f is a convex set, and. f(θx + (1 θ)y) < θf(x) + (1 θ)f(y) f(x) = x 3

To take the derivative of x raised to a power, you multiply in front by the exponent and subtract 1 from the exponent.

Math Refresher - Calculus

Unconstrained optimization

Math 409/509 (Spring 2011)

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Lecture 2: Linear regression

Machine Learning and Data Mining. Linear classification. Kalev Kask

Matrix Derivatives and Descent Optimization Methods

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

PRACTICE PROBLEMS FOR MIDTERM I

THE INVERSE FUNCTION THEOREM FOR LIPSCHITZ MAPS

Ordinary Least Squares Linear Regression

Transcription:

Machine Learning Brett Bernstein Recitation 1: Gradients and Directional Derivatives Intro Question 1 We are given the data set (x 1, y 1 ),, (x n, y n ) where x i R d and y i R We want to fit a linear function to this data by performing empirical risk minimization More precisely, we are using the hypothesis space F = {f(x) = w T x w R d } and the loss function l(a, y) = (a y) 2 Given an initial guess w for the empirical risk minimizing parameter vector, how could we improve our guess? y x Figure 1: Data Set With d = 1 Multivariable Differentiation Differential calculus allows us to convert non-linear problems into local linear problems, to which we can apply the well-developed techniques of linear algebra Here we will review some of the important concepts needed in the rest of the course 1

Single Variable Differentiation To gain intuition, we first recall the single variable case Let f : R R be differentiable The derivative f f(x + h) f(x) (x) = lim h 0 h gives us a local linear approximation of f near x This is seen more clearly in the following form: f(x + h) = f(x) + hf (x) + o(h) as h 0, where o(h) represents a function g(h) with g(h)/h 0 as h 0 This can be used to show that if x is a local extremum of f then f (x) = 0 Points with f (x) = 0 are called critical points f(t) f(x 0 ) + (t x 0 )f (x 0 ) f(t) (f(x 0 ) + (t x 0 )f (x 0 )) (x 0, f(x 0 )) Figure 2: 1D Linear Approximation By Derivative Multivariate Differentiation More generally, we will look at functions f : R n R In the single-variable case, the derivative was just a number that signified how much the function increased when we moved in the positive x-direction In the multivariable case, we have many possible directions we can move along from a given point x = (x 1,, x n ) R n 2

Figure 3: Multiple Possible Directions for f : R 2 R If we fix a direction u we can compute the directional derivative f (x; u) as f f(x + hu) f(x) (x; u) = lim h 0 h This allows us to turn our multidimensional problem into a 1-dimensional computation For instance, f(x + hu) = f(x) + hf (x; u) + o(h), mimicking our earlier 1-d formula This says that nearby x we can get a good approximation of f(x+hu) using the linear approximation f(x)+hf (x; u) In particular, if f (x; u) < 0 (such a u is called a descent direction) then for sufficiently small h > 0 we have f(x + hu) < f(x) 3

i 1 Figure 4: Directional Derivative as a Slope of a Slice {}}{ Let e i = ( 0, 0,, 0, 1, 0,, 0) be the ith standard basis vector The directional derivative in the direction e i is called the ith partial derivative and can be written in several ways: f(x) = xi f(x) = i f(x) = f (x; e i ) x i We say that f is differentiable at x if f(x + v) f(x) g T v lim = 0, v 0 v 2 for some g R n (note that the limit for v is taken in R n ) This g is uniquely determined, and is called the gradient of f at x denoted by f(x) It is easy to show that the gradient is the vector of partial derivatives: f(x) = x1 f(x) xn f(x) 4

The kth entry of the gradient (ie, the kth partial derivative) is the approximate change in f due to a small positive change in x k Sometimes we will split the variables of f into two parts For instance, we could write f(x, w) with x R p and w R q It is often useful to take the gradient with respect to some of the variables Here we would write x or w to specify which part: x f(x, w) := x1 f(x, w) xp f(x, w) and w f(x, w) := w1 f(x, w) wq f(x, w) Analogous to the univariate case, can express the condition for differentiability in terms of a gradient approximation: f(x + v) = f(x) + f(x) T v + o( v 2 ) The approximation f(x + v) f(x) + f(x) T v gives a tangent plane at the point x as we let v vary Figure 5: Tangent Plane for f : R 2 R 5

If f is differentiable, we can use the gradient to compute an arbitrary directional derivative: f (x; u) = f(x) T u From this expression we can quickly see that (assuming f(x) 0) arg max f (x; u) = u 2 =1 f(x) and arg min f (x; u) = f(x) f(x) 2 u 2 =1 f(x) 2 In words, we say that the gradient points in the direction of steepest ascent, and the negative gradient points in the direction of steepest descent As in the 1-dimensional case, if f : R n R is differentiable and x is a local extremum of f then we must have f(x) = 0 Points x with f(x) = 0 are called critical points As we will see later in the course, if a function is differentiable and convex, then a point is critical if and only if it is a global minimum Figure 6: Critical Points of f : R 2 R Computing Gradients A simple method to compute the gradient of a function is to compute each partial derivative separately For example, if f : R 3 R is given by f(x 1, x 2, x 3 ) = log( ) 6

then we can directly compute x1 f(x 1, x 2, x 3 ) = ex 1+2x 2 +3x 3, x2 f(x 1, x 2, x 3 ) = 2ex1+2x2+3x3, x3 f(x 1, x 2, x 3 ) = 3ex1+2x2+3x3 and obtain f(x 1, x 2, x 3 ) = Alternatively, we could let w = (1, 2, 3) T and write e x 1+2x 2 +3x 3 2e x 1+2x 2 +3x 3 3e x 1+2x 2 +3x 3 f(x) = log(1 + e wt x ) Then we can apply a version of the chain rule which says that if g : R R and h : R n R are differentiable then g(h(x)) = g (h(x)) h(x) Applying the chain rule twice (for log and exp) we obtain f(x) = 1 1 + e wt x ewt x w, where we use the fact that x (w T x) = w This last expression is more concise, and is more amenable to vectorized computation in many languages Another useful technique is to compute a general directional derivative and then infer the gradient from the computation For example, let f : R n R be given by f(x) = Ax y 2 2 = (Ax y) T (Ax y) = x T A T Ax 2y T Ax + y T y, for some A R m n and y R m Assuming f is differentiable (so that f (x; v) = f(x) T v) we have f(x + tv) = (x + tv) T A T A(x + tv) 2y T A(x + tv) + y T y = x T A T Ax + t 2 v T A T Av + 2tx T A T Av 2y T Ax 2ty T Av + y T y = f(x) + t(2x T A T A 2y T A)v + t 2 v T A T Av Thus we have f(x + tv) f(x) t Taking the limit as t 0 shows = (2x T A T A 2y T A)v + tv T A T Av f (x; v) = (2x T A T A 2y T A)v = f(x) = (2x T A T A 2y T A) T = 2A T Ax 2A T y 7

Assume the columns of the data matrix A have been centered (by subtracting their respective means) We can interpret f(x) = 2A T (Ax y) as (up to scaling) the covariance between the features and the residual Using the above calculation we can determine the critical points of f Let s assume here that A has full column rank Then A T A is invertible, and the unique critical point is x = (A T A) 1 A T y As we will see later in the course, this is a global minimum since f is convex (the Hessian of f satisfies 2 f(x) = 2A T A 0) ( ) Proving Differentiability With a little extra work we can make the previous technique give a proof of differentiability Using the computation above, we can rewrite f(x + v) as f(x) plus terms depending on v: Note that f(x + v) = f(x) + (2x T A T A 2y T A)v + v T A T Av v T A T Av v 2 = Av 2 2 v 2 A 2 2 v 2 2 v 2 = A 2 2 v 2 0, as v 2 0 (This section is starred since we used the matrix norm A 2 here) This shows f(x + v) above has the form This proves that f is differentiable and that f(x + v) = f(x) + f(x) T v + o( v 2 ) f(x) = 2A T Ax 2A T y Another method we could have used to establish differentiability is to observe that the partial derivatives are all continuous This relies on the following theorem Theorem 1 Let f : R n R and suppose xi f(x) : R n R is continuous for all x R n and all i = 1,, n Then f is differentiable 8