Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regulariza5on, Sparsity & Lasso

Machne Learnng Data Mnng CS/CS/EE 155 Lecture 4: Regularza5on, Sparsty Lasso 1

Recap: Complete Ppelne S = {(x, y )} Tranng Data f (x, b) = T x b Model Class(es) L(a, b) = (a b) 2 Loss Func5on,b L( y, f (x, b) ) SGD! Cross Valda5on Model Selec5on Proft! 2

Dfferent Model Classes? Op5on 1: SVMs vs As vs LR vs LS Op5on 2: Regularza5on,b L( y, f (x, b) ) SGD! Cross Valda5on Model Selec5on 3

ota5on Part 1 L0 orm (not actually a norm) of non-zero entres L1 orm Sum of absolute values L2 orm Squared L2 orm Sum of squares Sqrt(sum of squares) L-nfnty orm Max absolute value 0 = = lm p d = 1 = 1 d 0 [ ] d d 2 = d T d 2 = 2 d T p d d d p = max d d 4

ota5on Part 2 Mnmzng Squared Loss Regresson Least-Squares ( y T x + b) 2 (Unless Otherse Stated) E.g., Logs5c Regresson = Log Loss 5

Rdge Regresson,b λ T + ( y T x + b) 2 Regularza5on Tranng Loss aka L2-Regularzed Regresson Trades off model complexty vs tranng loss Each choce of λ a model class Wll dscuss the further later 6

0.22 0.2 0.18 0.16 0.14 0.12 0.1,b Larger Lambda! x = " " $ y = %$ 1 [age>10] 1 [gender=male] Test Loss Tranng Loss $ % 1 heght > 55" 0 heght 55" 10-2 10-1 10 0 10 1 lambda λ T + ( y T x + b) b 0.7401 0.2441-0.1745 0.7122 0.2277-0.1967 0.6197 0.1765-0.2686 0.4124 0.0817-0.4196 0.1801 0.0161-0.5686 0.0001 0.0000-0.6666 2 Tran Test Person Age>10 Male? Heght > 55 Alce 1 0 1 Bob 0 1 0 Carol 0 0 0 Dave 1 1 1 Ern 1 0 1 Frank 0 1 1 Gena 0 0 0 Harold 1 1 1 Irene 1 0 0 John 0 1 1 Kelly 1 0 1 Larry 1 1 1 7

Updated Ppelne S = {(x, y )} Tranng Data f (x, b) = T x b Model Class L(a, b) = (a b) 2 Loss Func5on,b ( ) λ T + L y, f (x, b) Choosng λ! Cross Valda5on Model Selec5on Proft! 8

Test Tran Person Age>10 Male? Heght > 55 Model Score / Increasng Lambda Alce 1 0 1 0.91 0.89 0.83 0.75 0.67 Bob 0 1 0 0.42 0.45 0.50 0.58 0.67 Carol 0 0 0 0.17 0.26 0.42 0.50 0.67 Dave 1 1 1 1.16 1.06 0.91 0.83 0.67 Ern 1 0 1 0.91 0.89 0.83 0.79 0.67 Frank 0 1 1 0.42 0.45 0.50 0.54 0.67 Gena 0 0 0 0.17 0.27 0.42 0.50 0.67 Harold 1 1 1 1.16 1.06 0.91 0.83 0.67 Irene 1 0 0 0.91 0.89 0.83 0.79 0.67 John 0 1 1 0.42 0.45 0.50 0.54 0.67 Kelly 1 0 1 0.91 0.89 0.83 0.79 0.67 Larry 1 1 1 1.16 1.06 0.91 0.83 0.67 Best test error 9

Choce of Lambda Depends on Tranng Sze 40 25 Tranng Ponts 28 50 Tranng Ponts 35 26 24 test loss 30 test loss 22 20 25 18 16 20 10-2 10 0 10 2 lambda 14 10-2 10 0 10 2 lambda test loss 25 75 Tranng Ponts 20 15 10 10-2 10 0 10 2 lambda test loss 24 100 Tranng Ponts 22 20 18 16 14 12 10 10-2 10 0 10 2 lambda 25 dmensonal space Randomly generated lnear response func5on + nose 10

Recap: Rdge Regularza5on Rdge Regresson: L2 Regularzed Least-Squares,b λ T + Large λ è more stable predc5ons Less lkely to overft to tranng data Too large λ è underft Works th other loss Hnge Loss, Log Loss, etc. ( y T x + b) 2 11

Asde: Stochas5c Gradent Descent,b!L(, b) = ( ) λ 2 + L y, f (x, b) 1 λ 2 + L (, b) 1!L(, b) = E! L (, b) Do SGD on ths 12

Model Class Interpreta5on,b Ths s not a model class! At least not hat e ve dscussed... An op5mza5on procedure Is there a connec5on? ( ) λ T + L y, f (x, b) 13

orm Constraned Model Class f (x, b) = T x b s.t. T c 2 c 3 2 1 0-1 Vsualza5on c=1 c=2 c=3 L2 orm Seems to correspond to lambda,b 0.7 0.6 0.5 0.4 0.3 λ T + L y, f (x, b) ( ) -2 0.2-3 -3-2 -1 0 1 2 3 0.1 0 10-2 10-1 10 0 10 1 lambda 14

Lagrange Mul5plers s.t. T c L(y, ) ( y T x) 2 Op5malty Cond5on: Gradents algned! Constrant Boundary λ 0 :( L(y, ) = λ T ) ( T c) Omrng b 1 tranng data for smplcty hpp://en.kpeda.org/k/lagrange_mul5pler 15

orm Constraned Model Class Tranng: L(y, ) ( y T x) 2 s.t. T c Omrng b 1 tranng data for smplcty To Cond5ons Must Be Sa5sfed At Op5malty ó. ObservaPon about OpPmalty: λ 0 :( L(y, ) = λ T ) ( T c) Lagrangan:,λ Λ(, λ) = ( y T x) 2 + λ ( T c) Clam: Solvng Lagrangan Solves orm-constraned Tranng Problem OpPmalty ImplcaPon of Lagrangan: SaPsfes Frst CondPon! Λ(, λ) = 2x( y T x) T + 2λ 0 è 2x( y T x) T = 2λ hpp://en.kpeda.org/k/lagrange_mul5pler 16

orm Constraned Model Class Tranng: L(y, ) ( y T x) 2 s.t. T c Omrng b 1 tranng data for smplcty To Cond5ons Must Be Sa5sfed At Op5malty ó. ObservaPon about OpPmalty: λ 0 :( L(y, ) = λ T ) ( T c) Lagrangan:,λ Λ(, λ) = ( y T x) 2 + λ ( T c) Clam: Solvng Lagrangan Solves orm-constraned Tranng Problem SaPsfes 2 nd CondPon! OpPmalty ImplcaPon of Lagrangan: %' 0 f λ Λ(, λ) = T < c (' T c f T c 0 è T c hpp://en.kpeda.org/k/lagrange_mul5pler 18

orm Constraned Model Class Tranng: L(y, ) ( y T x) 2 s.t. T c L2 Regularzed Tranng: λ T + ( y T x) 2 Lagrangan: Λ(, λ) = ( y T x) 2 + λ ( T c),λ Lagrangan = orm Constraned Tranng: λ 0 :( L(y, ) = λ T ) T c Lagrangan = L2 Regularzed Tranng: Hold λ fxed Equvalent to solvng orm Constraned! For some c ( ) Omrng b 1 tranng data for smplcty hpp://en.kpeda.org/k/lagrange_mul5pler 19

Recap 2: Rdge Regularza5on Rdge Regresson: L2 Regularzed Least-Squares = orm Constraned Model,b λ T + L(),b Large λ è more stable predc5ons Less lkely to overft to tranng data Too large λ è underft Works th other loss Hnge Loss, Log Loss, etc. L() s.t. T c 20

Hallucna5ng Data Ponts λ T + ( y T x ) 2 = 2λ 2 x( y T x ) T Instead hallucnate D data ponts? D = 2 λe d T λe d 2x y T x d=1 D D d=1 d=1 ( 0 T λe ) 2 d + ( ) T D d=1 ( y T x ) 2 = 2 λe T d = 2 λ d = 2λ ( ) T IdenPcal to RegularzaPon! {( λe d, 0) } D d=1 Unt vector along d-th Dmenson! e d = " 0! 0 1 0! 0 $ % Omrng b for smplcty 21

Extenson: Mul5-task Learnng 2 predc5on tasks: Spam flter for Alce Spam flter for Bob Lmted tranng data for both but Alce s smlar to Bob 22

Extenson: Mul5-task Learnng To Tranng Sets rela5vely small S (1) = (x (1), y (1) { )} S (2) = (x (2), y (2) { )} OpPon 1: Tran Separately v ( ) 2 λ T + y (1) T x (1) ( ) 2 λv T v + y (2) v T x (2) Both models have hgh error. Omrng b for smplcty 23

Extenson: Mul5-task Learnng To Tranng Sets rela5vely small S (1) = (x (1), y (1) { )} S (2) = (x (2), y (2) { )} OpPon 2: Tran Jontly,v ( ) 2 λ T + y (1) T x (1) ( ) 2 + λv T v + y (2) v T x (2) Doesn t accomplsh anythng! ( v don t depend on each other) Omrng b for smplcty 24

Mul5-task Regularza5on,v ( ) 2 λ T + λv T v +γ ( v) T ( v) + y (1) T (1) x + y (2) v T (2) x ( ) 2 Standard Regularza5on Mul5-task Regularza5on Tranng Loss Prefer v to be close Controlled by γ Tasks smlar Larger γ helps! Tasks not den5cal γ not too large test loss Test Loss (Task 2) 14.5 14 13.5 13 12.5 12 10-4 10-2 10 0 10 2 gamma 25

Lasso L1-Regularzed Least-Squares 26

L1 Regularzed Least Squares L2: λ + = 2 vs ( y T x ) 2 =1 4 3.5 3 λ 2 + ( y T x ) 2 L1: =1 = 2 =1 = = vs vs vs = 0 =1 = 0 2.5 2 1.5 1 0.5 0-2 -1.5-1 -0.5 0 0.5 1 1.5 2 Omrng b for smplcty 27

Asde: Subgradent (sub-dfferen5al) a R(a) = c a' : R(a') R(a) c(a' a) { } Dfferen5able: a R(a) = a R(a) L1: d % $ % % 1 f d < 0 +1 f d > 0 [ 1,+1 ] f d = 0 ConPnuous range for =0! 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0-2 -1 0 1 2 Omrng b for smplcty 28

L1 Regularzed Least Squares L2: λ + ( y T x ) 2 λ 2 + 4 3.5 ( y T x ) 2 d 2 = 2 d 3 2.5 L1: d % $ % % 1 f d < 0 +1 f d > 0 [ 1,+1 ] f d = 0 2 1.5 1 0.5 0-2 -1.5-1 -0.5 0 0.5 1 1.5 2 Omrng b for smplcty 29

Lagrange Mul5plers s.t. c L(y, ) ( y T x) 2 d % $ % % 1 f d < 0 +1 f d > 0 [ 1,+1 ] f d = 0 SoluPons tend to be at corners! λ 0 :( L(y, ) λ ) ( c) Omrng b 1 tranng data for smplcty hpp://en.kpeda.org/k/lagrange_mul5pler 30

Sparsty s sparse f mostly 0 s: Small L0 orm 0 = d 1 d 0 [ ] Why not L0 Regularza5on? ot conpnuous! λ 0 + ( y T x ) 2 L1 nduces sparsty And s con5nuous! λ + ( y T x ) 2 Omrng b for smplcty 31

Why s Sparsty Important? Computa5onal / Memory Effcency Store 1M numbers n array Store 2 numbers per non-zero (Index, Value) pars E.g., [ (50,1), (51,1) ] Dot product more effcent: T x! " 0 0! 0 1 1 0! 0 $ % Some5mes true s sparse Want to recover non-zero dmensons 32

Lasso Guarantee λ + ( y T x + b) 2 Suppose data generated as: y ~ ormal( T * x,σ 2 ) Then f: λ > 2 κ 2σ 2 log D Wth hgh probablty (ncreasng th ): Supp( ) Supp( * ) d : d λc Supp è ( ) = Supp( * ) Hgh Precson Parameter Recovery SomePmes Hgh Recall Supp( * ) = { d *,d 0} See also: hpps://.cs.utexas.edu/~pradeepr/courses/395t-lt/flez/hghdmii.pdf hpp://.eecs.berkeley.edu/~anrg/papers/wa_sparseinfo09.pdf 33

male? 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 L1 L2 0 0 0.1 0.2 0.3 0.4 0.5 age >10? Magntude of the to eghts. (As regularza5on shrnks) Person Age>10 Male? Heght > 55 Alce 1 0 1 Bob 0 1 0 Carol 0 0 0 Dave 1 1 1 Ern 1 0 1 Frank 0 1 1 Gena 0 0 0 Harold 1 1 1 Irene 1 0 0 John 0 1 1 Kelly 1 0 1 Larry 1 1 1 34

Recap: Lasso vs Rdge Model Assump5ons Lasso learns sparse eght vector Predc5ve Accuracy Lasso oxen not as accurate Re-run Least Squares on dmensons selected by Lasso Ease of Inspec5on Sparse s easer to nspect Ease of Op5mza5on Lasso somehat trcker to op5mze 35

Recap: Regularza5on L2 L1 (Lasso) λ 2 + λ + ( y T x ) 2 ( y T x ) 2 Mul5-task [Insert Yours Here!],v λ T + λv T v +γ ( v) T ( v) ( ) 2 + y (1) T (1) x + y (2) v T (2) x ( ) 2 Omrng b for smplcty 36

ext Lectures Decson Trees Baggng Random Forests Boos5ng Ensemble Selec5on o Recta5on ths Week