Gradient Descent. Stephan Robert

Size: px

Start display at page:

Download "Gradient Descent. Stephan Robert"

Dominick Cannon
5 years ago
Views:

1 Gradient Descent Stephan Robert

2 Model Other inputs # of bathrooms Lot size Year built Location Lake view Price (CHF) Size (m 2 )

3 Model How it works Linear regression with one variable y 6 5 Price (CHF) Size (m 2 ) x

4 Model How it works Linear regression with one variable y y i = + 1 x i + i f (x) =f(x) = + 1 x 6 5 Price (CHF) Size (m 2 ) x

5 Model Idea: Choose, 1 so that f(x) is close to y for our training example y f(x) = + 1 x 6 5 Price (CHF) Size (m 2 ) x

6 Cost function: J(, 1 )= 1 2m Aim: Model mx (f(x i ) y i ) 2 i=1 1 min J(, 1 )=min, 1, 1 2m mx (f(x i ) y i ) 2 i=1

7 Other functions Quadratic f(x) = + 1 x + 2 x 2 Higher order polynomial f(x) = + 1 x + 2 x x

8 A very simple example J(, 1 )= 1 mx (f(x i ) y i ) 2 = 2m i=1 1 2m ((1 2)2 +(2 4) 2 +(3 6) 2 = 1 6 ( ) = 14 6 = y 1 =, 1 =1 f(x 3 ) y 2 y 3

9 A very simple example J(, 1 )= 1 2m mx (f(x i ) y i ) 2 i=

10 A very simple example J(, 1 )= 1 2m mx (f(x i ) y i ) 2 i= Minimum

11 A very simple example J(, 1 )= 1 2m mx (f(x i ) y i ) 2 i=1 From Andrew Ng, Stanford

12 3D-plots

13 Cost function 1 J(, 1 )

14 Gradient descent (intuition) Have some function J(, 1 ) Goal: min J(, 1 ), 1 Solution: we start with some keep changing these parameters to reduce minimum J(, 1 ) (, 1 ) and we until we hopefully end up a a

15 Gradient descent (intuition) Have some function J(, 1 ) Goal: min J(, 1 ), 1 Proposition: j = j j J(, 1 ) Negative slope

16 Gradient descent Formally J( ) :R n! R is convex and differentiable = B 1. n 1 C A 2 Rn+1 We want to solve i.e. find such that min J( ) 2R n J( )= min 2R n J( )

17 Gradient descent Algorithm Choose an initial [] 2 R n+1 Repeat [k] = [k 1] [k 1]rJ( [k 1]) Stop at some point k =1, 2, 3,...

18 Example with 2 parameters Choose an initial [] = ( [], 1 []) Repeat j [k] = j [k 1] [k 1] j [k 1] J( [k 1], 1 [k 1]) Stop at some point j =, 1

19 Example with 2 parameters Implementation detail Repeat (simultaneous update) temp = [k 1] [k 1] temp 1 = 1 [k 1] [k 1] [k 1] J( [k 1], 1 [k 1]) 1 [k 1] J( [k 1], 1 [k 1]) [k] =temp 1 [k] =temp 1 k =1, 2, 3,...

20 Learning rate Search of one parameter only (one is supposed to be set already) 1 [k] = 1 [k 1] [k 1] k 1 = k = 1 [k] = 1 [k 1] 1 [k] = 1 [k 1] 1 [k 1] J( [k 1], 1 [k 1]) 1 [k 1] J( [k 1], 1 [k 1]) 1 [k 1] J(, 1 [k 1])

1 [k] = 1 [k 1] Learning rate 1 [k 1] J(, 1 [k 1]) 7 6 5 4 3 2 1 7 J( 1 )

21 1 [k] = 1 [k 1] Learning rate 1 [k 1] J(, 1 [k 1]) J( 1 ) J( 6 1 ) too small too large 1

22 Gradient Descent Learning rate (intuitive) Practical trick: Debug: J should decrease after each iteration Fix a stop-threshold (.1?) Observe the cost function J( )= min 2R n J( ) J( ) Number of iterations

23 Gradient descent Important: We assume there are no local minimums for now! Wikimedia

24 Gradient Descent Learning rate (intuitive) Practical trick: Debug: J should decrease after each iteration Fix a stop-threshold (.1?) Observe the cost function J( )= min 2R n J( ) J( ) Number of iterations

25 The porridge is Gradient Descent Backtracking line search (Boyd) Convergence analysis might help Take an arbitrary parameter and fix it:.1 apple apple.8 At each iteration start with =1and while J( rj( )) >J( ) 2 rj( ) 2 update = Simple but works quite well in practice (check!)

26 Interpretation Gradient Descent Backtracking line search (Boyd) J( rj( )) J( ) rj( ) 2 J( ) 2 rj( ) 2 =

27 Gradient Descent Backtracking line search (Boyd) Example (13 steps) (Geoff Gordon & Ryan Tibshirani) Convergence analysis by GG&RT J = 1( 2 ) =.8

28 Gradient Descent Exact line search (Boyd) At each iteration do the best along the gradient direction = arg max s J( srj( )) Often not much more efficient than backtracking, not really worth it

29 Gradient descent linear regression Repeat j [k] = j [k 1] [k 1] Stop at some point f(x) = + 1 x J(, 1 )= 1 2m = 1 2m mx (f(x i ) y i ) 2 i=1 mx ( + 1 x i y i ) 2 i=1 j [k 1] J( [k 1], 1 [k 1]) j =, 1

30 J(, 1 )= 1 2m J(, 1 )= Gradient descent linear regression mx ( + 1 x i y i ) 2 i=1 1 2m! mx ( + 1 x i y i ) 2 i=1 = 1 m mx ( + 1 x i y i ) i=1

31 J(, 1 )= 1 2m 1 J(, 1 )= Gradient descent linear regression mx ( + 1 x i y i ) 2 i= m! mx ( + 1 x i y i ) 2 i=1 = 1 m mx ( + 1 x i y i )x i i=1

32 Choose an initial Repeat Stop at some point Gradient descent linear regression k [k] = [k 1] 1 [k] = 1 [k 1] [] = ( [], 1 []) 1 = k = [k 1] J( [k 1], 1 [k 1]) 1 [k 1] J( [k 1], 1 [k 1])

33 Gradient descent linear regression Repeat [k] = [k 1] 1 [k] = 1 [k 1] m m mx ( [k 1] + 1 [k 1]x i y i ) i=1 mx ( [k 1] + 1 [k 1]x i y i )x i i=1 Stop at some point WARNING: Update and 1 simultaneously

Generalization Higher order polynomial f(x) = + 1 x + 2 x 2 + 3 x 3 +... Multi-features f(x 1,x 2,.

34 Generalization Higher order polynomial f(x) = + 1 x + 2 x x Multi-features f(x 1,x 2,...)=f(x) = + 1 x x x WARNING: Notation f(x) :(x) 1 = x 1 f(x 1,x 2,...):x 1 (x 1 ) i is a scalar is a variable is a scalar

35 Choose an initial Repeat Gradient Descent, n=1 Linear Regression k [k] = [k 1] 1 [k] = 1 [k 1] 1 = k = m m Stop at some point [] = ( [], 1 []) mx ( [k 1] + 1 [k 1]x i y i ) i=1 mx ( [k 1] + 1 [k 1]x i y i )x i i=1

36 Choose an initial Repeat Gradient Descent, n>1 Linear Regression [k] = [k 1] 1 [k] = 1 [k 1] 2 [k] = 2 [k 1] Stop at some point. [] = ( [], 1 [], 2 [],..., n []) m m m mx (f(x i ) y i ) i=1 mx (f(x i ) i=1 mx (f(x i ) i=1 y i )(x 1 ) i y i )(x 2 ) i

37 Practical trick: Gradient Descent Scaling all features on the similar scale e.g. size of the house = 3 m 2, number of floors: 3 Mean normalization x i = Size i MeanSize MaxSize-MinSize

38 Vector Notation x = B x x 1. x n 1 C A 2 Rn+1 = B 1. n 1 C A 2 Rn+1 f(x,x 1,x 2,...,x n )=f(x) = t x Often: (x ) i =1,i2 {, 1,...,n} = + 1 x x n x n

39 Vector Notation (x) i = B (x ) i (x 1 ) i. (x n ) i 1 C A = B 1. n 1 C A 2 Rn+1 f((x ) i, (x 1 ) i, (x 2 ) i,...,(x n ) i )=f((x) i )= t x i = + 1 (x 1 ) i + 2 (x 2 ) i n (x n ) i i 2 R m+1

40 Vector Notation X = B (x) t 1 (x) t 2. (x) t m 1 C A = B (x ) 1 (x 1 ) 1... (x n ) 1 (x ) 2 (x 1 ) 2... (x n ) 2. (x ) m (x 1 ) m... (x n ) m 1 C A 2 Rm (n+1) m (n+1)

41 Vector Notation y = B (y) 1 (y) 2. (y) m 1 C A = B y 1 y 2. y m 1 C A 2 Rm = B 1. n 1 C A 2 Rn+1 Cost function J( ) = 1 2m (X y)t (X y)

42 Example m=4=#nb of samples x x 1 x 2 x 3 x 4 y Size #bedrooms #floors Age Price (mios) X = B C A y = B C A 2 Rm

43 Example X = B C A 2 Rm (n+1) X t = B C A 2 R(n+1) m

44 Normal equation Method to resolve squares) Derivates = analytically (Least =(X t X) 1 X t y

45 Comparison Gradient Descent Choose a learning parameter Many iterations Works even for large n (in the order of 1 6 ) Normal Equation No need to choose a learning parameter No iteration Need to compute (X t X) 1 Slow when n is large (-> 1 )

Gradient Descent. Model

Gradient Descent. Model Gradient Descent Stephan Robert Model Other inputs # of bathrooms Lot size Year built Location Lake view Price (HF) 6 5 4 3 2 2 4 6 8 Size (m 2 ) 2 Model How it works Linear regression with one variable