School of Computer Scence 10-601 Introducton to Machne Learnng Lnear Regresson Readngs: Bshop, 3.1 Matt Gormle Lecture 5 September 14, 016 1
Homework : Remnders Extenson: due Frda (9/16) at 5:30pm Rectaton schedule posted on course webste
Outlne Lnear Regresson Smple example Model Learnng Gradent Descent SGD Closed Form Advanced opcs Geometrc and Probablstc Interpretaton of LMS L Regularzaton L1 Regularzaton Features 3
Outlne Lnear Regresson Smple example Model Learnng (aka. Least Squares) Gradent Descent SGD (aka. Least Mean Squares (LMS)) Closed Form (aka. Normal Equatons) Advanced opcs Geometrc and Probablstc Interpretaton of LMS L Regularzaton (aka. Rdge Regresson) L1 Regularzaton (aka. LASSO) Features (aka. non- lnear bass functons) 4
Lnear regresson Our goal s to estmate w from a tranng data of <x, > pars Y = wx + ε Optmzaton goal: mnmze squared error (least squares): arg mn w ( wx ) Wh least squares? - mnmzes squared dstance between measurements and predcted lne see HW - has a nce probablstc nterpretaton - the math s prett
Solvng lnear regresson o optmze closed form: We just take the dervatve w.r.t. to w and set to 0: w ( wx ) = x ( wx ) x ( wx ) = 0 w = x = wx x x x wx x = 0
Lnear regresson Gven an nput x we would lke to compute an output In lnear regresson we assume that and x are related wth the followng equaton: What we are trng to predct = wx+ε Observed values Y where w s a parameter and ε represents measurement or other nose
Regresson example Generated: w= Recovered: w=.03 Nose: std=1
Regresson example Generated: w= Recovered: w=.05 Nose: std=
Regresson example Generated: w= Recovered: w=.08 Nose: std=4
Bas term So far we assumed that the lne passes through the orgn What f the lne does not? No problem, smpl change the model to = w 0 + w 1 x+ε Can use least squares to determne w 0, w 1 w 0 Y w 0 = n w x 1 x ( w1 = x w 0 )
Lnear Regresson Data: Inputs are contnuous vectors of length K. Outputs are contnuous scalars. D = {x (), () } N =1 where x R K and R What are some example problems of ths form? 1
Lnear Regresson Data: Inputs are contnuous vectors of length K. Outputs are contnuous scalars. D = {x (), () } N =1 where x R K and R Predcton: Output s a lnear functon of the nputs. ŷ = h (x) = 1 x 1 + x +...+ K x K ŷ = h (x) = x (We assume x 1 s 1) Learnng: fnds the parameters that mnmze some objectve functon. = argmn J( ) 13
Least Squares Learnng: fnds the parameters that mnmze some objectve functon. = argmn J( ) We mnmze the sum of the squares: J( )= 1 N =1 ( x () () ) Wh? 1. Reduces dstance between true measurements and predcted hperplane (lne n 1D). Has a nce probablstc nterpretaton 14
Least Squares Learnng: fnds the parameters that mnmze some objectve functon. = argmn J( ) We mnmze the sum of the squares: hs s a ver general optmzaton J( )= setup. 1 We could solve t n N =1 ( x () () ) Wh? lots of was. oda, 1. Reduces dstance between true measurements and we ll consder three predcted hperplane (lne n 1D). Has a was. nce probablstc nterpretaton 15
Least Squares Learnng: hree approaches to solvng = argmn J( ) Approach 1: Gradent Descent (take larger more certan steps opposte the gradent) Approach : Stochastc Gradent Descent (SGD) (take man small steps opposte the gradent) Approach 3: Closed Form (set dervatves equal to zero and solve for parameters) 16
Gradent Descent Algorthm 1 Gradent Descent 1: procedure GD(D, : (0) (0) ) 3: whle not converged do 4: + J( ) 5: return In order to appl GD to Lnear Regresson all we need s the gradent of the objectve functon (.e. vector of partal dervatves). J( )= d d 1 J( ) d d J( ). d d N J( ) 17
Gradent Descent Algorthm 1 Gradent Descent 1: procedure GD(D, : (0) (0) ) 3: whle not converged do 4: + J( ) 5: return here are man possble was to detect convergence. For example, we could check whether the L norm of the gradent s below some small tolerance. J( ) Alternatvel we could check that the reducton n the objectve functon from one teraton to the next s small. 18
Stochastc Gradent Descent (SGD) Algorthm Stochastc Gradent Descent (SGD) 1: procedure SGD(D, : (0) (0) ) 3: whle not converged do 4: for shu e({1,,...,n}) do 5: + J () ( ) 6: return Appled to Lnear Regresson, SGD s called the Least Mean Squares (LMS) algorthm We need a per- example objectve: Let J( )= N =1 J () ( ) where J () ( )= 1 ( x () () ). 19
Stochastc Gradent Descent (SGD) Algorthm Stochastc Gradent Descent (SGD) 1: procedure SGD(D, : (0) (0) ) 3: whle not converged do 4: for shu e({1,,...,n}) do 5: for k {1,,...,K} do 6: k k + d d k J () ( ) 7: return Appled to Lnear Regresson, SGD s called the Least Mean Squares (LMS) algorthm We need a per- example objectve: Let J( )= N =1 J () ( ) where J () ( )= 1 ( x () () ). 0
Stochastc Gradent Descent (SGD) Algorthm Stochastc Gradent Descent (SGD) 1: procedure SGD(D, : (0) (0) ) 3: whle not converged do 4: for shu e({1,,...,n}) do 5: for k {1,,...,K} do 6: k k + d d k J () ( ) 7: return Appled to Lnear Regresson, SGD s called the Least Mean Squares (LMS) algorthm Let s start b calculatng We need a per- example objectve: ths partal dervatve for Let J( )= N =1 J () the Lnear Regresson ( ) objectve functon. where J () ( )= 1 ( x () () ). 1
Partal Dervatves for Lnear Reg. Let J( )= N =1 J () ( ) where J () ( )= 1 ( x () () ). d d k J () ( )= = 1 d 1 d k ( x () () ) d ( x () d k () ) =( x () () ) d d k ( x () () ) =( x () () ) d d k ( =( x () () )x () k K k=1 kx () k () )
Partal Dervatves for Lnear Reg. Let J( )= N =1 J () ( ) where J () ( )= 1 ( x () () ). d d k J () ( )=( x () d d k J( )= = d d k N =1 N =1 J () ( ) () )x () k ( x () () )x () k Used b SGD (aka. LMS) Used b Gradent Descent 3
Least Mean Squares (LMS) Algorthm 3 Least Mean Squares (LMS) 1: procedure LMS(D, : (0) (0) ) 3: whle not converged do 4: for shu e({1,,...,n}) do 5: for k {1,,...,K} do 6: k k + ( x () () )x () k 7: return Appled to Lnear Regresson, SGD s called the Least Mean Squares (LMS) algorthm 4
Optmzaton for Lnear Reg. vs. Logstc Reg. Can use the same trcks for both: regularzaton tunng learnng rate on development data shuffle examples out- of- core (f can t ft n memor) and stream over them local hll clmbng elds global optmum (both problems are convex) etc. But Logstc Regresson does not have a closed form soluton for MLE parameters what about Lnear Regresson? 5
Least Squares Learnng: hree approaches to solvng = argmn J( ) Approach 1: Gradent Descent (take larger more certan steps opposte the gradent) Approach : Stochastc Gradent Descent (SGD) (take man small steps opposte the gradent) Approach 3: Closed Form (set dervatves equal to zero and solve for parameters) 6
he normal equatons l Wrte the cost functon n matrx form: l o mnmze J(), take dervatve and set to zero: Erc ng @ CMU, 006-011 7 ( ) ( ) ( ) J n!!!!!! + = = = = 1 1 1 1 ) ( ) ( x = x n x x!!! 1 = n! " 1 ( ) ( ) ( ) 0 1 1 1 = = + = + = + = J!!!!!!!!! tr tr tr tr! = he normal equatons ( )! 1 = *
Some matrx dervatves l For R m f : n! R, defne: l race: A tra = A11 f ( A) = " A1 n = 1 A, m f f! #! A 1n " A tra = a, mn f f tr ABC = trcab = trbca l Some fact of matrx dervatves (wthout proof) A trab = B, 1 traba C = CAB + C AB, A = A( A ) A A Erc ng @ CMU, 006-011 8
Comments on the normal equaton l In most stuatons of practcal nterest, the number of data ponts N s larger than the dmensonalt k of the nput space and the matrx s of full column rank. If ths condton holds, then t s eas to verf that s necessarl nvertble. l he assumpton that s nvertble mples that t s postve defnte, thus the crtcal pont we have found s a mnmum. l What f has less than full column rank? à regularzaton (later). Erc ng @ CMU, 006-011 9
Drect and Iteratve methods l Drect methods: we can acheve the soluton n a sngle step b solvng the normal equaton l l Usng Gaussan elmnaton or QR decomposton, we converge n a fnte number of steps It can be nfeasble when data are streamng n n real tme, or of ver large amount l Iteratve methods: stochastc or steepest gradent l l l Convergng n a lmtng sense But more attractve n large practcal problems Cauton s needed for decdng the learnng rate α Erc ng @ CMU, 006-011 30
Convergence rate l heorem: the steepest descent equaton algorthm converge to the mnmum of the cost characterzed b normal equaton: If l A formal analss of LMS need more math-mussels; n practce, one can use a small α, or graduall decrease α. Erc ng @ CMU, 006-011 31
Convergence Curves l For the batch method, the tranng MSE s ntall large due to unnformed ntalzaton l In the onlne update, N updates for ever epoch reduces MSE to a much smaller value. Erc ng @ CMU, 006-011 3
Least Squares Learnng: hree approaches to solvng = argmn J( ) Approach 1: Gradent Descent (take larger more certan steps opposte the gradent) pros: conceptuall smple, guaranteed convergence cons: batch, often slow to converge Approach : Stochastc Gradent Descent (SGD) (take man small steps opposte the gradent) pros: memor effcent, fast convergence, less prone to local optma cons: convergence n practce requres tunng and fancer varants Approach 3: Closed Form (set dervatves equal to zero and solve for parameters) pros: one shot algorthm! cons: does not scale to large datasets (matrx nverse s bottleneck) 33
Matchng Game Goal: Match the Algorthm to ts Update Rule 1. SGD for Logstc Regresson 4. h (x) = p( x). Least Mean Squares h (x) = 5. x 3. Perceptron (next lecture) h (x) = sgn( x) + (h (x() ) k k k k + k k 6. A. 1=5, =4, 3=6 B. 1=5, =6, 3=4 C. 1=6, =4, 3=4 D. 1=5, =6, 3=6 E. 1=6, =6, 3=6 () ) 1 1 + exp (h (x() ) + (h (x() ) () ) () () )xk 34
Geometrc Interpretaton of LMS he predctons on the tranng data are:! ( ) 1 ˆ! = * = Note that and! ˆ! ( 1! = I ) ( ) (!ˆ! ) = ( ) = ( ) 1! I ( ) 1 = 0!! ( ) ŷ!! s the orthogonal projecton of nto the space spanned b the columns of! =! 1 " n =! x x! 1 x n! Erc ng @ CMU, 006-011 35
Probablstc Interpretaton of LMS Let us assume that the target varable and the nputs are related b the equaton: where ε s an error term of unmodeled effects or random nose Now assume that ε follows a Gaussan N(0,σ), then we have: B ndependence assumpton: ε + = x = 1 σ πσ ) ( exp ) ; ( x p x = = = = 1 1 1 σ πσ n n n x p L ) ( exp ) ; ( ) ( x 36 Erc ng @ CMU, 006-011
Probablstc Interpretaton of LMS, cont. Hence the log- lkelhood s: 1 1 1 n l( ) = nlog = ( 1 x ) πσ σ Do ou recognze the last term? Yes t s: n 1 J ( ) = ( x ) = 1 hus under ndependence assumpton, LMS s equvalent to MLE of! Erc ng @ CMU, 006-011 37
Rdge Regresson Adds an L regularzer to Lnear Regresson J RR ( )=J( )+ = 1 N =1 ( x () () ) + K k=1 k Baesan nterpretaton: MAP estmaton wth a Gaussan pror on the parameters MAP = argmax N =1 = argmax J RR ( ) log p ( () x () ) + log p( ) where p( ) N (0, 1 ) 38
LASSO Adds an L1 regularzer to Lnear Regresson J LASSO ( )=J( )+ 1 = 1 N =1 ( x () () ) + K k=1 k Baesan nterpretaton: MAP estmaton wth a Laplace pror on the parameters MAP = argmax N =1 = argmax J LASSO ( ) log p ( () x () ) + log p( ) where p( ) Laplace(0,f( )) 39
Rdge Regresson vs Lasso Rdge Regresson: Lasso: βs wth constant J(β) (level sets of J(β)) βs wth constant l norm β βs wth constant l1 norm β1 Lasso (l1 penalt) results n sparse soluons vector wth more zero coordnates Good for hgh- dmensonal problems don t have to store all coordnates! Erc ng @ CMU, 006-011 40
Non-Lnear bass functon So far we onl used the observed values x 1,x, However, lnear regresson can be appled n the same wa to functons of these values Eg: to add a term w x 1 x add a new varable z=x 1 x so each example becomes: x 1, x,. z As long as these functons can be drectl computed from the observed values the parameters are stll lnear n the data and the problem remans a mult-varate lnear regresson problem = w + w x + + w x k 0 1 1 k + ε
Non-lnear bass functons What tpe of functons can we use? A few common examples: - Polnomal: φ j (x) = x j for j=0 n - Gaussan: - Sgmod: φ j (x) = (x µ j ) σ j φ j (x) = 1 1+ exp( s j x) An functon of the nput values can be used. he soluton for the parameters of the regresson remans the same. - Logs: φ j (x) = log(x +1)
General lnear regresson problem Usng our new notatons for the bass functon lnear regresson can be wrtten as = n j= 0 w j φ j (x) Where φ j (x) can be ether x j for multvarate regresson or one of the non-lnear bass functons we defned and φ 0 (x)=1 for the ntercept term
An example: polnomal bass vectors on a small dataset From Bshop Ch 1
0 th Order Polnomal n=10
1 st Order Polnomal
3 rd Order Polnomal
9 th Order Polnomal
Over-fttng Root-Mean-Square (RMS) Error:
Polnomal Coeffcents
9 th Order Polnomal Data Set Sze:
Regularzaton Penalze large coeffcent values J, (w) = 1 # & % w j φ j (x ) ( $ j ' λ w
Regularzaton: +
Polnomal Coeffcents none exp(18) huge
Over Regularzaton: