Linear Regression Introduction to Machine Learning. Matt Gormley Lecture 5 September 14, Readings: Bishop, 3.1

Size: px

Start display at page:

Download "Linear Regression Introduction to Machine Learning. Matt Gormley Lecture 5 September 14, Readings: Bishop, 3.1"

Walter Sharp
6 years ago
Views:

1 School of Computer Scence Introducton to Machne Learnng Lnear Regresson Readngs: Bshop, 3.1 Matt Gormle Lecture 5 September 14, 016 1

2 Homework : Remnders Extenson: due Frda (9/16) at 5:30pm Rectaton schedule posted on course webste

3 Outlne Lnear Regresson Smple example Model Learnng Gradent Descent SGD Closed Form Advanced opcs Geometrc and Probablstc Interpretaton of LMS L Regularzaton L1 Regularzaton Features 3

4 Outlne Lnear Regresson Smple example Model Learnng (aka. Least Squares) Gradent Descent SGD (aka. Least Mean Squares (LMS)) Closed Form (aka. Normal Equatons) Advanced opcs Geometrc and Probablstc Interpretaton of LMS L Regularzaton (aka. Rdge Regresson) L1 Regularzaton (aka. LASSO) Features (aka. non- lnear bass functons) 4

5 Lnear regresson Our goal s to estmate w from a tranng data of <x, > pars Y = wx + ε Optmzaton goal: mnmze squared error (least squares): arg mn w ( wx ) Wh least squares? - mnmzes squared dstance between measurements and predcted lne see HW - has a nce probablstc nterpretaton - the math s prett

6 Solvng lnear regresson o optmze closed form: We just take the dervatve w.r.t. to w and set to 0: w ( wx ) = x ( wx ) x ( wx ) = 0 w = x = wx x x x wx x = 0

7 Lnear regresson Gven an nput x we would lke to compute an output In lnear regresson we assume that and x are related wth the followng equaton: What we are trng to predct = wx+ε Observed values Y where w s a parameter and ε represents measurement or other nose

8 Regresson example Generated: w= Recovered: w=.03 Nose: std=1

9 Regresson example Generated: w= Recovered: w=.05 Nose: std=

10 Regresson example Generated: w= Recovered: w=.08 Nose: std=4

11 Bas term So far we assumed that the lne passes through the orgn What f the lne does not? No problem, smpl change the model to = w 0 + w 1 x+ε Can use least squares to determne w 0, w 1 w 0 Y w 0 = n w x 1 x ( w1 = x w 0 )

12 Lnear Regresson Data: Inputs are contnuous vectors of length K. Outputs are contnuous scalars. D = {x (), () } N =1 where x R K and R What are some example problems of ths form? 1

13 Lnear Regresson Data: Inputs are contnuous vectors of length K. Outputs are contnuous scalars. D = {x (), () } N =1 where x R K and R Predcton: Output s a lnear functon of the nputs. ŷ = h (x) = 1 x 1 + x K x K ŷ = h (x) = x (We assume x 1 s 1) Learnng: fnds the parameters that mnmze some objectve functon. = argmn J( ) 13

14 Least Squares Learnng: fnds the parameters that mnmze some objectve functon. = argmn J( ) We mnmze the sum of the squares: J( )= 1 N =1 ( x () () ) Wh? 1. Reduces dstance between true measurements and predcted hperplane (lne n 1D). Has a nce probablstc nterpretaton 14

1 We could solve t n N =1 ( x () () ) Wh? lots of was.

15 Least Squares Learnng: fnds the parameters that mnmze some objectve functon. = argmn J( ) We mnmze the sum of the squares: hs s a ver general optmzaton J( )= setup. 1 We could solve t n N =1 ( x () () ) Wh? lots of was. oda, 1. Reduces dstance between true measurements and we ll consder three predcted hperplane (lne n 1D). Has a was. nce probablstc nterpretaton 15

16 Least Squares Learnng: hree approaches to solvng = argmn J( ) Approach 1: Gradent Descent (take larger more certan steps opposte the gradent) Approach : Stochastc Gradent Descent (SGD) (take man small steps opposte the gradent) Approach 3: Closed Form (set dervatves equal to zero and solve for parameters) 16

17 Gradent Descent Algorthm 1 Gradent Descent 1: procedure GD(D, : (0) (0) ) 3: whle not converged do 4: + J( ) 5: return In order to appl GD to Lnear Regresson all we need s the gradent of the objectve functon (.e. vector of partal dervatves). J( )= d d 1 J( ) d d J( ). d d N J( ) 17

Gradent Descent Algorthm 1 Gradent Descent 1: procedure GD(D, : (0) (0) ) 3: whle not converged do 4: + J( ) 5: return here are man possble was to detect convergence.

18 Gradent Descent Algorthm 1 Gradent Descent 1: procedure GD(D, : (0) (0) ) 3: whle not converged do 4: + J( ) 5: return here are man possble was to detect convergence. For example, we could check whether the L norm of the gradent s below some small tolerance. J( ) Alternatvel we could check that the reducton n the objectve functon from one teraton to the next s small. 18

19 Stochastc Gradent Descent (SGD) Algorthm Stochastc Gradent Descent (SGD) 1: procedure SGD(D, : (0) (0) ) 3: whle not converged do 4: for shu e({1,,...,n}) do 5: + J () ( ) 6: return Appled to Lnear Regresson, SGD s called the Least Mean Squares (LMS) algorthm We need a per- example objectve: Let J( )= N =1 J () ( ) where J () ( )= 1 ( x () () ). 19

20 Stochastc Gradent Descent (SGD) Algorthm Stochastc Gradent Descent (SGD) 1: procedure SGD(D, : (0) (0) ) 3: whle not converged do 4: for shu e({1,,...,n}) do 5: for k {1,,...,K} do 6: k k + d d k J () ( ) 7: return Appled to Lnear Regresson, SGD s called the Least Mean Squares (LMS) algorthm We need a per- example objectve: Let J( )= N =1 J () ( ) where J () ( )= 1 ( x () () ). 0

called the Least Mean Squares (LMS) algorthm Let s start b calculatng We need

21 Stochastc Gradent Descent (SGD) Algorthm Stochastc Gradent Descent (SGD) 1: procedure SGD(D, : (0) (0) ) 3: whle not converged do 4: for shu e({1,,...,n}) do 5: for k {1,,...,K} do 6: k k + d d k J () ( ) 7: return Appled to Lnear Regresson, SGD s called the Least Mean Squares (LMS) algorthm Let s start b calculatng We need a per- example objectve: ths partal dervatve for Let J( )= N =1 J () the Lnear Regresson ( ) objectve functon. where J () ( )= 1 ( x () () ). 1

22 Partal Dervatves for Lnear Reg. Let J( )= N =1 J () ( ) where J () ( )= 1 ( x () () ). d d k J () ( )= = 1 d 1 d k ( x () () ) d ( x () d k () ) =( x () () ) d d k ( x () () ) =( x () () ) d d k ( =( x () () )x () k K k=1 kx () k () )

23 Partal Dervatves for Lnear Reg. Let J( )= N =1 J () ( ) where J () ( )= 1 ( x () () ). d d k J () ( )=( x () d d k J( )= = d d k N =1 N =1 J () ( ) () )x () k ( x () () )x () k Used b SGD (aka. LMS) Used b Gradent Descent 3

24 Least Mean Squares (LMS) Algorthm 3 Least Mean Squares (LMS) 1: procedure LMS(D, : (0) (0) ) 3: whle not converged do 4: for shu e({1,,...,n}) do 5: for k {1,,...,K} do 6: k k + ( x () () )x () k 7: return Appled to Lnear Regresson, SGD s called the Least Mean Squares (LMS) algorthm 4

25 Optmzaton for Lnear Reg. vs. Logstc Reg. Can use the same trcks for both: regularzaton tunng learnng rate on development data shuffle examples out- of- core (f can t ft n memor) and stream over them local hll clmbng elds global optmum (both problems are convex) etc. But Logstc Regresson does not have a closed form soluton for MLE parameters what about Lnear Regresson? 5

26 Least Squares Learnng: hree approaches to solvng = argmn J( ) Approach 1: Gradent Descent (take larger more certan steps opposte the gradent) Approach : Stochastc Gradent Descent (SGD) (take man small steps opposte the gradent) Approach 3: Closed Form (set dervatves equal to zero and solve for parameters) 6

27 he normal equatons l Wrte the cost functon n matrx form: l o mnmze J(), take dervatve and set to zero: Erc CMU, ( ) ( ) ( ) J n!!!!!! + = = = = ) ( ) ( x = x n x x!!! 1 = n! " 1 ( ) ( ) ( ) = = + = + = + = J!!!!!!!!! tr tr tr tr! = he normal equatons ( )! 1 = *

28 Some matrx dervatves l For R m f : n! R, defne: l race: A tra = A11 f ( A) = " A1 n = 1 A, m f f! #! A 1n " A tra = a, mn f f tr ABC = trcab = trbca l Some fact of matrx dervatves (wthout proof) A trab = B, 1 traba C = CAB + C AB, A = A( A ) A A Erc CMU,

29 Comments on the normal equaton l In most stuatons of practcal nterest, the number of data ponts N s larger than the dmensonalt k of the nput space and the matrx s of full column rank. If ths condton holds, then t s eas to verf that s necessarl nvertble. l he assumpton that s nvertble mples that t s postve defnte, thus the crtcal pont we have found s a mnmum. l What f has less than full column rank? à regularzaton (later). Erc CMU,

30 Drect and Iteratve methods l Drect methods: we can acheve the soluton n a sngle step b solvng the normal equaton l l Usng Gaussan elmnaton or QR decomposton, we converge n a fnte number of steps It can be nfeasble when data are streamng n n real tme, or of ver large amount l Iteratve methods: stochastc or steepest gradent l l l Convergng n a lmtng sense But more attractve n large practcal problems Cauton s needed for decdng the learnng rate α Erc CMU,

31 Convergence rate l heorem: the steepest descent equaton algorthm converge to the mnmum of the cost characterzed b normal equaton: If l A formal analss of LMS need more math-mussels; n practce, one can use a small α, or graduall decrease α. Erc CMU,

32 Convergence Curves l For the batch method, the tranng MSE s ntall large due to unnformed ntalzaton l In the onlne update, N updates for ever epoch reduces MSE to a much smaller value. Erc CMU,

33 Least Squares Learnng: hree approaches to solvng = argmn J( ) Approach 1: Gradent Descent (take larger more certan steps opposte the gradent) pros: conceptuall smple, guaranteed convergence cons: batch, often slow to converge Approach : Stochastc Gradent Descent (SGD) (take man small steps opposte the gradent) pros: memor effcent, fast convergence, less prone to local optma cons: convergence n practce requres tunng and fancer varants Approach 3: Closed Form (set dervatves equal to zero and solve for parameters) pros: one shot algorthm! cons: does not scale to large datasets (matrx nverse s bottleneck) 33

34 Matchng Game Goal: Match the Algorthm to ts Update Rule 1. SGD for Logstc Regresson 4. h (x) = p( x). Least Mean Squares h (x) = 5. x 3. Perceptron (next lecture) h (x) = sgn( x) + (h (x() ) k k k k + k k 6. A. 1=5, =4, 3=6 B. 1=5, =6, 3=4 C. 1=6, =4, 3=4 D. 1=5, =6, 3=6 E. 1=6, =6, 3=6 () ) exp (h (x() ) + (h (x() ) () ) () () )xk 34

35 Geometrc Interpretaton of LMS he predctons on the tranng data are:! ( ) 1 ˆ! = * = Note that and! ˆ! ( 1! = I ) ( ) (!ˆ! ) = ( ) = ( ) 1! I ( ) 1 = 0!! ( ) ŷ!! s the orthogonal projecton of nto the space spanned b the columns of! =! 1 " n =! x x! 1 x n! Erc CMU,

36 Probablstc Interpretaton of LMS Let us assume that the target varable and the nputs are related b the equaton: where ε s an error term of unmodeled effects or random nose Now assume that ε follows a Gaussan N(0,σ), then we have: B ndependence assumpton: ε + = x = 1 σ πσ ) ( exp ) ; ( x p x = = = = σ πσ n n n x p L ) ( exp ) ; ( ) ( x 36 Erc CMU,

37 Probablstc Interpretaton of LMS, cont. Hence the log- lkelhood s: n l( ) = nlog = ( 1 x ) πσ σ Do ou recognze the last term? Yes t s: n 1 J ( ) = ( x ) = 1 hus under ndependence assumpton, LMS s equvalent to MLE of! Erc CMU,

38 Rdge Regresson Adds an L regularzer to Lnear Regresson J RR ( )=J( )+ = 1 N =1 ( x () () ) + K k=1 k Baesan nterpretaton: MAP estmaton wth a Gaussan pror on the parameters MAP = argmax N =1 = argmax J RR ( ) log p ( () x () ) + log p( ) where p( ) N (0, 1 ) 38

39 LASSO Adds an L1 regularzer to Lnear Regresson J LASSO ( )=J( )+ 1 = 1 N =1 ( x () () ) + K k=1 k Baesan nterpretaton: MAP estmaton wth a Laplace pror on the parameters MAP = argmax N =1 = argmax J LASSO ( ) log p ( () x () ) + log p( ) where p( ) Laplace(0,f( )) 39

40 Rdge Regresson vs Lasso Rdge Regresson: Lasso: βs wth constant J(β) (level sets of J(β)) βs wth constant l norm β βs wth constant l1 norm β1 Lasso (l1 penalt) results n sparse soluons vector wth more zero coordnates Good for hgh- dmensonal problems don t have to store all coordnates! Erc CMU,

41 Non-Lnear bass functon So far we onl used the observed values x 1,x, However, lnear regresson can be appled n the same wa to functons of these values Eg: to add a term w x 1 x add a new varable z=x 1 x so each example becomes: x 1, x,. z As long as these functons can be drectl computed from the observed values the parameters are stll lnear n the data and the problem remans a mult-varate lnear regresson problem = w + w x + + w x k k + ε

42 Non-lnear bass functons What tpe of functons can we use? A few common examples: - Polnomal: φ j (x) = x j for j=0 n - Gaussan: - Sgmod: φ j (x) = (x µ j ) σ j φ j (x) = 1 1+ exp( s j x) An functon of the nput values can be used. he soluton for the parameters of the regresson remans the same. - Logs: φ j (x) = log(x +1)

43 General lnear regresson problem Usng our new notatons for the bass functon lnear regresson can be wrtten as = n j= 0 w j φ j (x) Where φ j (x) can be ether x j for multvarate regresson or one of the non-lnear bass functons we defned and φ 0 (x)=1 for the ntercept term

44 An example: polnomal bass vectors on a small dataset From Bshop Ch 1

45 0 th Order Polnomal n=10

46 1 st Order Polnomal

47 3 rd Order Polnomal

48 9 th Order Polnomal

49 Over-fttng Root-Mean-Square (RMS) Error:

50 Polnomal Coeffcents

51 9 th Order Polnomal Data Set Sze:

52 Regularzaton Penalze large coeffcent values J, (w) = 1 # & % w j φ j (x ) ( $ j ' λ w

53 Regularzaton: +

54 Polnomal Coeffcents none exp(18) huge

55 Over Regularzaton:

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume