Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Machne Learnng Data Mnng CS/CS/EE 155 Lecture 4: Regularzaton, Sparsty Lasso 1

Recap: Complete Ppelne S = {(x, y )} Tranng Data f (x, b) = T x b Model Class(es) L(a, b) = (a b) 2 Loss Functon,b L( y, f (x, b) ) SGD! Cross Valdaton Model Selecton Proft! 2

Dfferent Model Classes? Opton 1: SVMs vs As vs LR vs LS Opton 2: Regularzaton,b L( y, f (x, b) ) SGD! Cross Valdaton Model Selecton 3

otaton Part 1 L0 orm (not actually a norm) of non-zero entres L1 orm Sum of absolute values L2 orm Squared L2 orm Sum of squares Sqrt(sum of squares) L-nfnty orm Max absolute value 0 = = lm p d = 1 = 1 d 0 [ ] d d 2 = d T d 2 = 2 d T p d d d p = max d d 4

otaton Part 2 Mnmzng Squared Loss Regresson Least-Squares ( y T x + b) 2 (Unless Otherse Stated) E.g., Logstc Regresson = Log Loss 5

Rdge Regresson,b λ T + ( y T x + b) 2 Regularzaton Tranng Loss aka L2-Regularzed Regresson Trades off model complexty vs tranng loss Each choce of λ a model class Wll dscuss the further later 6

0.22 0.2 0.18 0.16 0.14 0.12 0.1,b Larger Lambda! x = " " $ y = %$ 1 [age>10] 1 [gender=male] Test Loss Tranng Loss $ % 1 heght > 55" 0 heght 55" 10-2 10-1 10 0 10 1 lambda λ T + ( y T x + b) b 0.7401 0.2441-0.1745 0.7122 0.2277-0.1967 0.6197 0.1765-0.2686 0.4124 0.0817-0.4196 0.1801 0.0161-0.5686 0.0001 0.0000-0.6666 2 Tran Test Person Age>10 Male? Heght > 55 Alce 1 0 1 Bob 0 1 0 Carol 0 0 0 Dave 1 1 1 Ern 1 0 1 Frank 0 1 1 Gena 0 0 0 Harold 1 1 1 Irene 1 0 0 John 0 1 1 Kelly 1 0 1 Larry 1 1 1 7

Updated Ppelne S = {(x, y )} Tranng Data f (x, b) = T x b Model Class L(a, b) = (a b) 2 Loss Functon,b ( ) λ T + L y, f (x, b) Cross Valdaton Model Selecton Proft! 8

Test Tran Person Age>10 Male? Heght > 55 Model Score / Increasng Lambda Alce 1 0 1 0.91 0.89 0.83 0.75 0.67 Bob 0 1 0 0.42 0.45 0.50 0.58 0.67 Carol 0 0 0 0.17 0.26 0.42 0.50 0.67 Dave 1 1 1 1.16 1.06 0.91 0.83 0.67 Ern 1 0 1 0.91 0.89 0.83 0.79 0.67 Frank 0 1 1 0.42 0.45 0.50 0.54 0.67 Gena 0 0 0 0.17 0.27 0.42 0.50 0.67 Harold 1 1 1 1.16 1.06 0.91 0.83 0.67 Irene 1 0 0 0.91 0.89 0.83 0.79 0.67 John 0 1 1 0.42 0.45 0.50 0.54 0.67 Kelly 1 0 1 0.91 0.89 0.83 0.79 0.67 Larry 1 1 1 1.16 1.06 0.91 0.83 0.67 Best test error 9

Choce of Lambda Depends on Tranng Sze 40 25 Tranng Ponts 28 50 Tranng Ponts 35 26 24 test loss 30 test loss 22 20 25 18 16 20 10-2 10 0 10 2 lambda 14 10-2 10 0 10 2 lambda test loss 25 75 Tranng Ponts 20 15 10 10-2 10 0 10 2 lambda test loss 24 100 Tranng Ponts 22 20 18 16 14 12 10 10-2 10 0 10 2 lambda 25 dmensonal space Randomly generated lnear response functon + nose 10

Recap: Rdge Regularzaton Rdge Regresson: L2 Regularzed Least-Squares,b λ T + Large λ è more stable predctons Less lkely to overft to tranng data Too large λ è underft Works th other loss Hnge Loss, Log Loss, etc. ( y T x + b) 2 11

Asde: Stochastc Gradent Descent,b!L(, b) = ( ) λ 2 + L y, f (x, b) 1 λ 2 + L (, b) 1!L(, b) = E!L (, b) Do SGD on ths 12

Model Class Interpretaton,b Ths s not a model class! At least not hat e ve dscussed... An optmzaton procedure Is there a connecton? ( ) λ T + L y, f (x, b) 13

orm Constraned Model Class f (x, b) = T x b s.t. T c 2 c 3 2 1 0-1 Vsualzaton c=1 c=2 c=3 L2 orm Seems to correspond to lambda,b 0.7 0.6 0.5 0.4 0.3 λ T + L y, f (x, b) ( ) -2 0.2-3 -3-2 -1 0 1 2 3 0.1 0 10-2 10-1 10 0 10 1 lambda 14

Lagrange Multplers s.t. T c L(y, ) ( y T x) 2 Optmalty Condton: Gradents algned! Constrant Boundary λ 0 :( L(y, ) = λ T ) ( T c) Omttng b 1 tranng data for smplcty http://en.kpeda.org/k/lagrange_multpler 15

orm Constraned Model Class Tranng: L(y, ) ( y T x) 2 s.t. T c Omttng b 1 tranng data for smplcty To Condtons Must Be Satsfed At Optmalty ó. Observaton about Optmalty: λ 0 :( L(y, ) = λ T ) ( T c) Lagrangan:,λ Λ(, λ) = ( y T x) 2 + λ ( T c) Clam: Solvng Lagrangan Solves orm-constraned Tranng Problem Satsfes Frst Condton! Optmalty Implcaton of Lagrangan: Λ(, λ) = 2x( y T x) T + 2λ 0 è 2x( y T x) T = 2λ http://en.kpeda.org/k/lagrange_multpler 16

orm Constraned Model Class Tranng: L(y, ) ( y T x) 2 s.t. T c Omttng b 1 tranng data for smplcty To Condtons Must Be Satsfed At Optmalty ó. Observaton about Optmalty: λ 0 :( L(y, ) = λ T ) ( T c) Lagrangan:,λ Λ(, λ) = ( y T x) 2 + λ ( T c) Clam: Solvng Lagrangan Solves orm-constraned Tranng Problem Satsfes Frst Condton! Optmalty Implcaton of Lagrangan: Λ(, λ) = 2x( y T x) T + 2λ 0 è2x y x = 2λ http://en.kpeda.org/k/lagrange_multpler 17

orm Constraned Model Class Tranng: L(y, ) ( y T x) 2 s.t. T c Omttng b 1 tranng data for smplcty To Condtons Must Be Satsfed At Optmalty ó. Observaton about Optmalty: λ 0 :( L(y, ) = λ T ) ( T c) Lagrangan:,λ Λ(, λ) = ( y T x) 2 + λ ( T c) Clam: Solvng Lagrangan Solves orm-constraned Tranng Problem Optmalty Implcaton of Lagrangan: Satsfes 2 nd Condton! λ Λ(, λ) = %' (' 0 f T < c T c f T c 0 è T c http://en.kpeda.org/k/lagrange_multpler 18

orm Constraned Model Class Tranng: L(y, ) ( y T x) 2 s.t. T c L2 Regularzed Tranng: λ T + ( y T x) 2 Lagrangan: Λ(, λ) = ( y T x) 2 + λ ( T c),λ Lagrangan = orm Constraned Tranng: λ 0 :( L(y, ) = λ T ) ( T c) Lagrangan = L2 Regularzed Tranng: Hold λ fxed Equvalent to solvng orm Constraned! For some c Omttng b 1 tranng data for smplcty http://en.kpeda.org/k/lagrange_multpler 19

Recap 2: Rdge Regularzaton Rdge Regresson: L2 Regularzed Least-Squares = orm Constraned Model,b λ T + L(),b Large λ è more stable predctons Less lkely to overft to tranng data Too large λ è underft Works th other loss Hnge Loss, Log Loss, etc. L() s.t. T c 20

Hallucnatng Data Ponts λ T + ( y T x ) 2 = 2λ 2 x( y T x ) T Instead hallucnate D data ponts? D = 2 λe d T λe d 2x y T x d=1 D D d=1 d=1 ( 0 T λe ) 2 d + ( ) T D d=1 ( y T x ) 2 = 2 λe T d = 2 λ d = 2λ ( ) T Identcal to Regularzaton! {( λe d, 0) } D d=1 Unt vector along d-th Dmenson! e d = " 0! 0 1 0! 0 $ % Omttng b for smplcty 21

Extenson: Mult-task Learnng 2 predcton tasks: Spam flter for Alce Spam flter for Bob Lmted tranng data for both but Alce s smlar to Bob 22

Extenson: Mult-task Learnng To Tranng Sets relatvely small Opton 1: Tran Separately S (1) = (x (1), y (1) { )} S (2) = (x (2), y (2) { )} v ( ) 2 λ T + y (1) T x (1) ( ) 2 λv T v + y (2) v T x (2) Both models have hgh error. Omttng b for smplcty 23

Extenson: Mult-task Learnng To Tranng Sets relatvely small Opton 2: Tran Jontly,v ( ) 2 λ T + y (1) T x (1) ( ) 2 + λv T v + y (2) v T x (2) S (1) = (x (1), y (1) { )} S (2) = (x (2), y (2) { )} Doesn t accomplsh anythng! ( v don t depend on each other) Omttng b for smplcty 24

Mult-task Regularzaton,v ( ) 2 λ T + λv T v +γ ( v) T ( v) + y (1) T (1) x + y (2) v T (2) x ( ) 2 Standard Regularzaton Mult-task Regularzaton Tranng Loss Prefer v to be close Controlled by γ Tasks smlar Larger γ helps! Tasks not dentcal γ not too large test loss Test Loss (Task 2) 14.5 14 13.5 13 12.5 12 10-4 10-2 10 0 10 2 gamma 25

Lasso L1-Regularzed Least-Squares 26

L1 Regularzed Least Squares L2: λ + = 2 vs ( y T x ) 2 =1 4 3.5 3 λ 2 + ( y T x ) 2 L1: =1 = 2 =1 = = vs vs vs = 0 =1 = 0 2.5 2 1.5 1 0.5 0-2 -1.5-1 -0.5 0 0.5 1 1.5 2 Omttng b for smplcty 27

Asde: Subgradent (sub-dfferental) a R(a) = c a' : R(a') R(a) c(a' a) { } Dfferentable: a R(a) = a R(a) L1: d % $ % % 1 f d < 0 +1 f d > 0 [ 1,+1 ] f d = 0 Contnuous range for =0! 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0-2 -1 0 1 2 Omttng b for smplcty 28

L1 Regularzed Least Squares L2: λ + ( y T x ) 2 λ 2 + 4 3.5 ( y T x ) 2 d 2 = 2 d 3 2.5 L1: d % $ % % 1 f d < 0 +1 f d > 0 [ 1,+1 ] f d = 0 2 1.5 1 0.5 0-2 -1.5-1 -0.5 0 0.5 1 1.5 2 Omttng b for smplcty 29

Lagrange Multplers s.t. c L(y, ) ( y T x) 2 d % $ % % 1 f d < 0 +1 f d > 0 [ 1,+1 ] f d = 0 Solutons tend to be at corners! λ 0 :( L(y, ) λ ) ( c) Omttng b 1 tranng data for smplcty http://en.kpeda.org/k/lagrange_multpler 30

Sparsty s sparse f mostly 0 s: Small L0 orm 0 = d 1 d 0 [ ] Why not L0 Regularzaton? ot contnuous! λ 0 + ( y T x ) 2 L1 nduces sparsty And s contnuous! λ + ( y T x ) 2 Omttng b for smplcty 31

Why s Sparsty Important? Computatonal / Memory Effcency Store 1M numbers n array Store 2 numbers per non-zero (Index, Value) pars E.g., [ (50,1), (51,1) ] Dot product more effcent: T x! " 0 0! 0 1 1 0! 0 $ % Sometmes true s sparse Want to recover non-zero dmensons 32

Lasso Guarantee λ + ( y T x + b) 2 Suppose data generated as: y ~ ormal( T * x,σ 2 ) Then f: λ > 2 κ 2σ 2 log D Wth hgh probablty (ncreasng th ): Supp( ) Supp( * ) d : d λc è Supp ( ) = Supp( * ) Hgh Precson Parameter Recovery Sometmes Hgh Recall Supp( * ) = { d *,d 0} See also: https://.cs.utexas.edu/~pradeepr/courses/395t-lt/flez/hghdmii.pdf http://.eecs.berkeley.edu/~anrg/papers/wa_sparseinfo09.pdf 33

male? 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 L1 L2 0 0 0.1 0.2 0.3 0.4 0.5 age >10? Magntude of the to eghts. (As regularzaton shrnks) Person Age>10 Male? Heght > 55 Alce 1 0 1 Bob 0 1 0 Carol 0 0 0 Dave 1 1 1 Ern 1 0 1 Frank 0 1 1 Gena 0 0 0 Harold 1 1 1 Irene 1 0 0 John 0 1 1 Kelly 1 0 1 Larry 1 1 1 34

Asde: Optmzng Lasso Solvng Lasso gves sparse model Wll stochastc gradent descent fnd t? o! Hard to ht exactly 0 th gradent descent Soluton: Iteratve Soft Thresholdng Intuton: f gradent update passes 0, clamp at 0 https://blogs.prnceton.edu/mabandt/2013/04/11/orf523-sta-and-fsta/ 35

Recap: Lasso vs Rdge Model Assumptons Lasso learns sparse eght vector Predctve Accuracy Lasso often not as accurate Re-run Least Squares on dmensons selected by Lasso Ease of Inspecton Sparse s easer to nspect Ease of Optmzaton Lasso somehat trcker to optmze 36

Recap: Regularzaton L2 L1 (Lasso) λ 2 + λ + ( y T x ) 2 ( y T x ) 2 Mult-task [Insert Yours Here!],v λ T + λv T v +γ ( v) T ( v) ( ) 2 + y (1) T (1) x + y (2) v T (2) x ( ) 2 Omttng b for smplcty 37

Recap: Updated Ppelne S = {(x, y )} Tranng Data f (x, b) = T x b Model Class L(a, b) = (a b) 2 Loss Functon,b ( ) λ T + L y, f (x, b) Cross Valdaton Model Selecton Proft! 38

ext Lectures Decson Trees Baggng Random Forests Boostng Ensemble Selecton o Rectaton ths Week