Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Size: px

Start display at page:

Download "Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso"

Cora Sparks
6 years ago
Views:

1 Machne Learnng Data Mnng CS/CS/EE 155 Lecture 4: Regularzaton, Sparsty Lasso 1

2 Recap: Complete Ppelne S = {(x, y )} Tranng Data f (x, b) = T x b Model Class(es) L(a, b) = (a b) 2 Loss Functon,b L( y, f (x, b) ) SGD! Cross Valdaton Model Selecton Proft! 2

3 Dfferent Model Classes? Opton 1: SVMs vs As vs LR vs LS Opton 2: Regularzaton,b L( y, f (x, b) ) SGD! Cross Valdaton Model Selecton 3

4 otaton Part 1 L0 orm (not actually a norm) of non-zero entres L1 orm Sum of absolute values L2 orm Squared L2 orm Sum of squares Sqrt(sum of squares) L-nfnty orm Max absolute value 0 = = lm p d = 1 = 1 d 0 [ ] d d 2 = d T d 2 = 2 d T p d d d p = max d d 4

5 otaton Part 2 Mnmzng Squared Loss Regresson Least-Squares ( y T x + b) 2 (Unless Otherse Stated) E.g., Logstc Regresson = Log Loss 5

6 Rdge Regresson,b λ T + ( y T x + b) 2 Regularzaton Tranng Loss aka L2-Regularzed Regresson Trades off model complexty vs tranng loss Each choce of λ a model class Wll dscuss the further later 6

7 ,b Larger Lambda! x = " " $ y = %$ 1 [age>10] 1 [gender=male] Test Loss Tranng Loss $ % 1 heght > 55" 0 heght 55" lambda λ T + ( y T x + b) b Tran Test Person Age>10 Male? Heght > 55 Alce Bob Carol Dave Ern Frank Gena Harold Irene John Kelly Larry

8 Updated Ppelne S = {(x, y )} Tranng Data f (x, b) = T x b Model Class L(a, b) = (a b) 2 Loss Functon,b ( ) λ T + L y, f (x, b) Cross Valdaton Model Selecton Proft! 8

9 Test Tran Person Age>10 Male? Heght > 55 Model Score / Increasng Lambda Alce Bob Carol Dave Ern Frank Gena Harold Irene John Kelly Larry Best test error 9

10 Choce of Lambda Depends on Tranng Sze Tranng Ponts Tranng Ponts test loss 30 test loss lambda lambda test loss Tranng Ponts lambda test loss Tranng Ponts lambda 25 dmensonal space Randomly generated lnear response functon + nose 10

11 Recap: Rdge Regularzaton Rdge Regresson: L2 Regularzed Least-Squares,b λ T + Large λ è more stable predctons Less lkely to overft to tranng data Too large λ è underft Works th other loss Hnge Loss, Log Loss, etc. ( y T x + b) 2 11

12 Asde: Stochastc Gradent Descent,b!L(, b) = ( ) λ 2 + L y, f (x, b) 1 λ 2 + L (, b) 1!L(, b) = E!L (, b) Do SGD on ths 12

13 Model Class Interpretaton,b Ths s not a model class! At least not hat e ve dscussed... An optmzaton procedure Is there a connecton? ( ) λ T + L y, f (x, b) 13

14 orm Constraned Model Class f (x, b) = T x b s.t. T c 2 c Vsualzaton c=1 c=2 c=3 L2 orm Seems to correspond to lambda,b λ T + L y, f (x, b) ( ) lambda 14

15 Lagrange Multplers s.t. T c L(y, ) ( y T x) 2 Optmalty Condton: Gradents algned! Constrant Boundary λ 0 :( L(y, ) = λ T ) ( T c) Omttng b 1 tranng data for smplcty 15

16 orm Constraned Model Class Tranng: L(y, ) ( y T x) 2 s.t. T c Omttng b 1 tranng data for smplcty To Condtons Must Be Satsfed At Optmalty ó. Observaton about Optmalty: λ 0 :( L(y, ) = λ T ) ( T c) Lagrangan:,λ Λ(, λ) = ( y T x) 2 + λ ( T c) Clam: Solvng Lagrangan Solves orm-constraned Tranng Problem Satsfes Frst Condton! Optmalty Implcaton of Lagrangan: Λ(, λ) = 2x( y T x) T + 2λ 0 è 2x( y T x) T = 2λ 16

17 orm Constraned Model Class Tranng: L(y, ) ( y T x) 2 s.t. T c Omttng b 1 tranng data for smplcty To Condtons Must Be Satsfed At Optmalty ó. Observaton about Optmalty: λ 0 :( L(y, ) = λ T ) ( T c) Lagrangan:,λ Λ(, λ) = ( y T x) 2 + λ ( T c) Clam: Solvng Lagrangan Solves orm-constraned Tranng Problem Satsfes Frst Condton! Optmalty Implcaton of Lagrangan: Λ(, λ) = 2x( y T x) T + 2λ 0 è2x y x = 2λ 17

18 orm Constraned Model Class Tranng: L(y, ) ( y T x) 2 s.t. T c Omttng b 1 tranng data for smplcty To Condtons Must Be Satsfed At Optmalty ó. Observaton about Optmalty: λ 0 :( L(y, ) = λ T ) ( T c) Lagrangan:,λ Λ(, λ) = ( y T x) 2 + λ ( T c) Clam: Solvng Lagrangan Solves orm-constraned Tranng Problem Optmalty Implcaton of Lagrangan: Satsfes 2 nd Condton! λ Λ(, λ) = %' (' 0 f T < c T c f T c 0 è T c 18

19 orm Constraned Model Class Tranng: L(y, ) ( y T x) 2 s.t. T c L2 Regularzed Tranng: λ T + ( y T x) 2 Lagrangan: Λ(, λ) = ( y T x) 2 + λ ( T c),λ Lagrangan = orm Constraned Tranng: λ 0 :( L(y, ) = λ T ) ( T c) Lagrangan = L2 Regularzed Tranng: Hold λ fxed Equvalent to solvng orm Constraned! For some c Omttng b 1 tranng data for smplcty 19

20 Recap 2: Rdge Regularzaton Rdge Regresson: L2 Regularzed Least-Squares = orm Constraned Model,b λ T + L(),b Large λ è more stable predctons Less lkely to overft to tranng data Too large λ è underft Works th other loss Hnge Loss, Log Loss, etc. L() s.t. T c 20

21 Hallucnatng Data Ponts λ T + ( y T x ) 2 = 2λ 2 x( y T x ) T Instead hallucnate D data ponts? D = 2 λe d T λe d 2x y T x d=1 D D d=1 d=1 ( 0 T λe ) 2 d + ( ) T D d=1 ( y T x ) 2 = 2 λe T d = 2 λ d = 2λ ( ) T Identcal to Regularzaton! {( λe d, 0) } D d=1 Unt vector along d-th Dmenson! e d = " 0! 0 1 0! 0 $ % Omttng b for smplcty 21

22 Extenson: Mult-task Learnng 2 predcton tasks: Spam flter for Alce Spam flter for Bob Lmted tranng data for both but Alce s smlar to Bob 22

23 Extenson: Mult-task Learnng To Tranng Sets relatvely small Opton 1: Tran Separately S (1) = (x (1), y (1) { )} S (2) = (x (2), y (2) { )} v ( ) 2 λ T + y (1) T x (1) ( ) 2 λv T v + y (2) v T x (2) Both models have hgh error. Omttng b for smplcty 23

24 Extenson: Mult-task Learnng To Tranng Sets relatvely small Opton 2: Tran Jontly,v ( ) 2 λ T + y (1) T x (1) ( ) 2 + λv T v + y (2) v T x (2) S (1) = (x (1), y (1) { )} S (2) = (x (2), y (2) { )} Doesn t accomplsh anythng! ( v don t depend on each other) Omttng b for smplcty 24

25 Mult-task Regularzaton,v ( ) 2 λ T + λv T v +γ ( v) T ( v) + y (1) T (1) x + y (2) v T (2) x ( ) 2 Standard Regularzaton Mult-task Regularzaton Tranng Loss Prefer v to be close Controlled by γ Tasks smlar Larger γ helps! Tasks not dentcal γ not too large test loss Test Loss (Task 2) gamma 25

26 Lasso L1-Regularzed Least-Squares 26

27 L1 Regularzed Least Squares L2: λ + = 2 vs ( y T x ) 2 = λ 2 + ( y T x ) 2 L1: =1 = 2 =1 = = vs vs vs = 0 =1 = Omttng b for smplcty 27

28 Asde: Subgradent (sub-dfferental) a R(a) = c a' : R(a') R(a) c(a' a) { } Dfferentable: a R(a) = a R(a) L1: d % $ % % 1 f d < 0 +1 f d > 0 [ 1,+1 ] f d = 0 Contnuous range for =0! Omttng b for smplcty 28

29 L1 Regularzed Least Squares L2: λ + ( y T x ) 2 λ ( y T x ) 2 d 2 = 2 d L1: d % $ % % 1 f d < 0 +1 f d > 0 [ 1,+1 ] f d = Omttng b for smplcty 29

30 Lagrange Multplers s.t. c L(y, ) ( y T x) 2 d % $ % % 1 f d < 0 +1 f d > 0 [ 1,+1 ] f d = 0 Solutons tend to be at corners! λ 0 :( L(y, ) λ ) ( c) Omttng b 1 tranng data for smplcty 30

31 Sparsty s sparse f mostly 0 s: Small L0 orm 0 = d 1 d 0 [ ] Why not L0 Regularzaton? ot contnuous! λ 0 + ( y T x ) 2 L1 nduces sparsty And s contnuous! λ + ( y T x ) 2 Omttng b for smplcty 31

32 Why s Sparsty Important? Computatonal / Memory Effcency Store 1M numbers n array Store 2 numbers per non-zero (Index, Value) pars E.g., [ (50,1), (51,1) ] Dot product more effcent: T x! " 0 0! ! 0 $ % Sometmes true s sparse Want to recover non-zero dmensons 32

33 Lasso Guarantee λ + ( y T x + b) 2 Suppose data generated as: y ~ ormal( T * x,σ 2 ) Then f: λ > 2 κ 2σ 2 log D Wth hgh probablty (ncreasng th ): Supp( ) Supp( * ) d : d λc è Supp ( ) = Supp( * ) Hgh Precson Parameter Recovery Sometmes Hgh Recall Supp( * ) = { d *,d 0} See also:

34 male? L1 L age >10? Magntude of the to eghts. (As regularzaton shrnks) Person Age>10 Male? Heght > 55 Alce Bob Carol Dave Ern Frank Gena Harold Irene John Kelly Larry

35 Asde: Optmzng Lasso Solvng Lasso gves sparse model Wll stochastc gradent descent fnd t? o! Hard to ht exactly 0 th gradent descent Soluton: Iteratve Soft Thresholdng Intuton: f gradent update passes 0, clamp at

36 Recap: Lasso vs Rdge Model Assumptons Lasso learns sparse eght vector Predctve Accuracy Lasso often not as accurate Re-run Least Squares on dmensons selected by Lasso Ease of Inspecton Sparse s easer to nspect Ease of Optmzaton Lasso somehat trcker to optmze 36

37 Recap: Regularzaton L2 L1 (Lasso) λ 2 + λ + ( y T x ) 2 ( y T x ) 2 Mult-task [Insert Yours Here!],v λ T + λv T v +γ ( v) T ( v) ( ) 2 + y (1) T (1) x + y (2) v T (2) x ( ) 2 Omttng b for smplcty 37

38 Recap: Updated Ppelne S = {(x, y )} Tranng Data f (x, b) = T x b Model Class L(a, b) = (a b) 2 Loss Functon,b ( ) λ T + L y, f (x, b) Cross Valdaton Model Selecton Proft! 38

39 ext Lectures Decson Trees Baggng Random Forests Boostng Ensemble Selecton o Rectaton ths Week

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regulariza5on, Sparsity & Lasso

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regulariza5on, Sparsity & Lasso Machne Learnng Data Mnng CS/CS/EE 155 Lecture 4: Regularza5on, Sparsty Lasso 1 Recap: Complete Ppelne S = {(x, y )} Tranng Data f (x, b) = T x b Model Class(es) L(a, b) = (a b) 2 Loss Func5on,b L( y, f