Gradient Descent. Stephan Robert

Gradient Descent Stephan Robert

Model Other inputs # of bathrooms Lot size Year built Location Lake view Price (CHF) 6 5 4 3 2 1 2 4 6 8 1 Size (m 2 )

Model How it works Linear regression with one variable y 6 5 Price (CHF) 4 3 2 1 2 4 6 8 1 Size (m 2 ) x

Model How it works Linear regression with one variable y y i = + 1 x i + i f (x) =f(x) = + 1 x 6 5 Price (CHF) 4 3 2 1 2 4 6 8 1 Size (m 2 ) x

Model Idea: Choose, 1 so that f(x) is close to y for our training example y f(x) = + 1 x 6 5 Price (CHF) 4 3 2 1 2 4 6 8 1 Size (m 2 ) x

Cost function: J(, 1 )= 1 2m Aim: Model mx (f(x i ) y i ) 2 i=1 1 min J(, 1 )=min, 1, 1 2m mx (f(x i ) y i ) 2 i=1

Other functions Quadratic f(x) = + 1 x + 2 x 2 Higher order polynomial f(x) = + 1 x + 2 x 2 + 3 x 3 +...

A very simple example J(, 1 )= 1 mx (f(x i ) y i ) 2 = 2m i=1 1 2m ((1 2)2 +(2 4) 2 +(3 6) 2 = 1 6 (1 + 4 + 9) = 14 6 =2.33 7 6 5 4 3 2 1 y 1 =, 1 =1 f(x 3 ) 1 2 3 4 y 2 y 3

A very simple example J(, 1 )= 1 2m mx (f(x i ) y i ) 2 i=1 1 9 8 7 6 5 4 3 2 1.5 1 1.5 2 2.5 3 3.5 4 1

A very simple example J(, 1 )= 1 2m mx (f(x i ) y i ) 2 i=1 1 9 8 7 Minimum 6 5 4 3 2 1.5 1 1.5 2 2.5 3 3.5 4 1

A very simple example J(, 1 )= 1 2m mx (f(x i ) y i ) 2 i=1 From Andrew Ng, Stanford

3D-plots

Cost function 1 J(, 1 ) 7 5 3 1-1 1 2 3 4

Gradient descent (intuition) Have some function J(, 1 ) Goal: min J(, 1 ), 1 Solution: we start with some keep changing these parameters to reduce minimum J(, 1 ) (, 1 ) and we until we hopefully end up a a

Gradient descent (intuition) Have some function J(, 1 ) Goal: min J(, 1 ), 1 Proposition: j = j j J(, 1 ) 1 9 8 7 6 5 4 3 2 1 Negative slope.5 1 1.5 2 2.5 3 3.5 4

Gradient descent Formally J( ) :R n! R is convex and differentiable = B 1. n 1 C A 2 Rn+1 We want to solve i.e. find such that min J( ) 2R n J( )= min 2R n J( )

Gradient descent Algorithm Choose an initial [] 2 R n+1 Repeat [k] = [k 1] [k 1]rJ( [k 1]) Stop at some point k =1, 2, 3,...

Example with 2 parameters Choose an initial [] = ( [], 1 []) Repeat j [k] = j [k 1] [k 1] j [k 1] J( [k 1], 1 [k 1]) Stop at some point j =, 1

Example with 2 parameters Implementation detail Repeat (simultaneous update) temp = [k 1] [k 1] temp 1 = 1 [k 1] [k 1] [k 1] J( [k 1], 1 [k 1]) 1 [k 1] J( [k 1], 1 [k 1]) [k] =temp 1 [k] =temp 1 k =1, 2, 3,...

Learning rate Search of one parameter only (one is supposed to be set already) 1 [k] = 1 [k 1] [k 1] k 1 = k = 1 [k] = 1 [k 1] 1 [k] = 1 [k 1] 1 [k 1] J( [k 1], 1 [k 1]) 1 [k 1] J( [k 1], 1 [k 1]) 1 [k 1] J(, 1 [k 1])

1 [k] = 1 [k 1] Learning rate 1 [k 1] J(, 1 [k 1]) 7 6 5 4 3 2 1 7 J( 1 ) J( 6 1 ) 1-6 -4-2 -1 2 4 6-6 -4-2 -1 2 4 6 5 4 3 2 1 too small too large 1

Gradient Descent Learning rate (intuitive) Practical trick: Debug: J should decrease after each iteration Fix a stop-threshold (.1?) Observe the cost function J( )= min 2R n J( ) J( ) Number of iterations

Gradient descent Important: We assume there are no local minimums for now! Wikimedia

The porridge is Gradient Descent Backtracking line search (Boyd) Convergence analysis might help Take an arbitrary parameter and fix it:.1 apple apple.8 At each iteration start with =1and while J( rj( )) >J( ) 2 rj( ) 2 update = Simple but works quite well in practice (check!)

Interpretation Gradient Descent Backtracking line search (Boyd) J( rj( )) J( ) rj( ) 2 J( ) 2 rj( ) 2 =

Gradient Descent Backtracking line search (Boyd) Example (13 steps) (Geoff Gordon & Ryan Tibshirani) Convergence analysis by GG&RT J = 1( 2 )+ 2 1 2 =.8

Gradient Descent Exact line search (Boyd) At each iteration do the best along the gradient direction = arg max s J( srj( )) Often not much more efficient than backtracking, not really worth it

Gradient descent linear regression Repeat j [k] = j [k 1] [k 1] Stop at some point f(x) = + 1 x J(, 1 )= 1 2m = 1 2m mx (f(x i ) y i ) 2 i=1 mx ( + 1 x i y i ) 2 i=1 j [k 1] J( [k 1], 1 [k 1]) j =, 1

J(, 1 )= 1 2m J(, 1 )= Gradient descent linear regression mx ( + 1 x i y i ) 2 i=1 1 2m! mx ( + 1 x i y i ) 2 i=1 = 1 m mx ( + 1 x i y i ) i=1

J(, 1 )= 1 2m 1 J(, 1 )= Gradient descent linear regression mx ( + 1 x i y i ) 2 i=1 1 1 2m! mx ( + 1 x i y i ) 2 i=1 = 1 m mx ( + 1 x i y i )x i i=1

Choose an initial Repeat Stop at some point Gradient descent linear regression k [k] = [k 1] 1 [k] = 1 [k 1] [] = ( [], 1 []) 1 = k = [k 1] J( [k 1], 1 [k 1]) 1 [k 1] J( [k 1], 1 [k 1])

Gradient descent linear regression Repeat [k] = [k 1] 1 [k] = 1 [k 1] m m mx ( [k 1] + 1 [k 1]x i y i ) i=1 mx ( [k 1] + 1 [k 1]x i y i )x i i=1 Stop at some point WARNING: Update and 1 simultaneously

Generalization Higher order polynomial f(x) = + 1 x + 2 x 2 + 3 x 3 +... Multi-features f(x 1,x 2,...)=f(x) = + 1 x 1 + 2 x 2 + 3 x 3 +... WARNING: Notation f(x) :(x) 1 = x 1 f(x 1,x 2,...):x 1 (x 1 ) i is a scalar is a variable is a scalar

Choose an initial Repeat Gradient Descent, n=1 Linear Regression k [k] = [k 1] 1 [k] = 1 [k 1] 1 = k = m m Stop at some point [] = ( [], 1 []) mx ( [k 1] + 1 [k 1]x i y i ) i=1 mx ( [k 1] + 1 [k 1]x i y i )x i i=1

Choose an initial Repeat Gradient Descent, n>1 Linear Regression [k] = [k 1] 1 [k] = 1 [k 1] 2 [k] = 2 [k 1] Stop at some point. [] = ( [], 1 [], 2 [],..., n []) m m m mx (f(x i ) y i ) i=1 mx (f(x i ) i=1 mx (f(x i ) i=1 y i )(x 1 ) i y i )(x 2 ) i

Practical trick: Gradient Descent Scaling all features on the similar scale e.g. size of the house = 3 m 2, number of floors: 3 Mean normalization x i = Size i MeanSize MaxSize-MinSize

Vector Notation x = B x x 1. x n 1 C A 2 Rn+1 = B 1. n 1 C A 2 Rn+1 f(x,x 1,x 2,...,x n )=f(x) = t x Often: (x ) i =1,i2 {, 1,...,n} = + 1 x 1 + 2 x 2 +...+ n x n

Vector Notation (x) i = B (x ) i (x 1 ) i. (x n ) i 1 C A = B 1. n 1 C A 2 Rn+1 f((x ) i, (x 1 ) i, (x 2 ) i,...,(x n ) i )=f((x) i )= t x i = + 1 (x 1 ) i + 2 (x 2 ) i +...+ n (x n ) i i 2 R m+1

Vector Notation X = B (x) t 1 (x) t 2. (x) t m 1 C A = B (x ) 1 (x 1 ) 1... (x n ) 1 (x ) 2 (x 1 ) 2... (x n ) 2. (x ) m (x 1 ) m... (x n ) m 1 C A 2 Rm (n+1) m (n+1)

Vector Notation y = B (y) 1 (y) 2. (y) m 1 C A = B y 1 y 2. y m 1 C A 2 Rm = B 1. n 1 C A 2 Rn+1 Cost function J( ) = 1 2m (X y)t (X y)

Example m=4=#nb of samples x x 1 x 2 x 3 x 4 y Size #bedrooms #floors Age Price (mios) 1 125 3 2 2.8 1 22 5 2 15 1.2 1 4 7 3 5 4 1 25 4 2 1 2 X = B 1 125 3 2 2 1 22 5 2 15 1 4 7 3 5 1 25 4 2 1 1 C A y = B.8 1.2 4 2 1 C A 2 Rm

Example X = B 1 125 3 2 2 1 22 5 2 15 1 4 7 3 5 1 25 4 2 1 1 C A 2 Rm (n+1) X t = B 1 1 1 1 125 22 4 25 3 5 7 4 2 2 3 2 2 15 5 1 1 C A 2 R(n+1) m

Normal equation Method to resolve squares) Derivates = analytically (Least =(X t X) 1 X t y

Comparison Gradient Descent Choose a learning parameter Many iterations Works even for large n (in the order of 1 6 ) Normal Equation No need to choose a learning parameter No iteration Need to compute (X t X) 1 Slow when n is large (-> 1 )