Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Size: px

Start display at page:

Download "Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso"

Corey Hortense Cobb
5 years ago
Views:

1 Machine Learning Data Mining CS/CS/EE 155 Lecture 4: Regularization, Sparsity Lasso 1

2 Recap: Complete Pipeline S = {(x i, y i )} Training Data f (x, b) = T x b Model Class(es) L(a, b) = (a b) 2 Loss Function,b L( y i, f (x i, b) ) SGD! Cross Validation Model Selection Profit! 2

3 Different Model Classes? Option 1: SVMs vs As vs LR vs LS Option 2: Regularization,b L( y i, f (x i, b) ) SGD! Cross Validation Model Selection 3

4 otation Part 1 L0 orm (not actually a norm) of non-zero entries L1 orm Sum of absolute values L2 orm Squared L2 orm Sum of squares Sqrt(sum of squares) L-infinity orm Max absolute value 0 = = lim p d = 1 = 1 d 0 [ ] d d 2 = d T d 2 = 2 d T p d d d p = max d d 4

5 otation Part 2 Minimizing Squared Loss Regression Least-Squares i ( y i T x i + b) 2 (Unless Otherise Stated) E.g., Logistic Regression = Log Loss 5

6 Ridge Regression,b λ T + i ( y i T x i + b) 2 Regularization Training Loss aka L2-Regularized Regression Trades off model complexity vs training loss Each choice of λ a model class Will discuss the further later 6

7 ,b Larger Lambda! x = " " $ y = %$ 1 [age>10] 1 [gender=male] Test Loss Training Loss $ % 1 height > 55" 0 height 55" lambda λ T + i ( y i T x i + b) b Train Test Person Age>10 Male? Height > 55 Alice Bob Carol Dave Erin Frank Gena Harold Irene John Kelly Larry

8 Updated Pipeline S = {(x i, y i )} Training Data f (x, b) = T x b Model Class L(a, b) = (a b) 2 Loss Function,b ( ) λ T + L y i, f (x i, b) Cross Validation Model Selection Profit! 8

Test Train Person Age>10 Male? Height > 55 Model Score / Increasing Lambda Alice 1 0 1 0.91 0.89 0.83 0.75 0.67 Bob 0 1 0 0.42 0.45 0.50 0.58 0.67 Carol 0 0 0 0.17 0.26 0.42 0.50 0.67 Dave 1 1 1 1.

9 Test Train Person Age>10 Male? Height > 55 Model Score / Increasing Lambda Alice Bob Carol Dave Erin Frank Gena Harold Irene John Kelly Larry Best test error 9

10 Choice of Lambda Depends on Training Size Training Points Training Points test loss 30 test loss lambda lambda test loss Training Points lambda test loss Training Points lambda 25 dimensional space Randomly generated linear response function + noise 10

11 Recap: Ridge Regularization Ridge Regression: L2 Regularized Least-Squares,b λ T + Large λ è more stable predictions Less likely to overfit to training data Too large λ è underfit Works ith other loss Hinge Loss, Log Loss, etc. i ( y i T x i + b) 2 11

12 Aside: Stochastic Gradient Descent,b!L(, b) = ( ) λ 2 + L y i, f (x i, b) 1 λ 2 + L i (, b) 1!L(, b) = E i!l i (, b) Do SGD on this 12

13 Model Class Interpretation,b This is not a model class! At least not hat e ve discussed... An optimization procedure Is there a connection? ( ) λ T + L y i, f (x i, b) 13

14 orm Constrained Model Class f (x, b) = T x b s.t. T c 2 c Visualization c=1 c=2 c=3 L2 orm Seems to correspond to lambda,b λ T + L y i, f (x i, b) ( ) lambda 14

15 Lagrange Multipliers s.t. T c L(y, ) ( y T x) 2 Optimality Condition: Gradients aligned! Constraint Boundary λ 0 :( L(y, ) = λ T ) ( T c) Omitting b 1 training data for simplicity 15

16 orm Constrained Model Class Training: L(y, ) ( y T x) 2 s.t. T c Omitting b 1 training data for simplicity To Conditions Must Be Satisfied At Optimality ó. Observation about Optimality: λ 0 :( L(y, ) = λ T ) ( T c) Lagrangian:,λ Λ(, λ) = ( y T x) 2 + λ ( T c) Claim: Solving Lagrangian Solves orm-constrained Training Problem Optimality Implication of Lagrangian: Satisfies First Condition! Λ(, λ) = 2x( y T x) T + 2λ 0 è2x y x = 2λ 16

17 orm Constrained Model Class Training: L(y, ) ( y T x) 2 s.t. T c Omitting b 1 training data for simplicity To Conditions Must Be Satisfied At Optimality ó. Observation about Optimality: λ 0 :( L(y, ) = λ T ) ( T c) Lagrangian:,λ Λ(, λ) = ( y T x) 2 + λ ( T c) Claim: Solving Lagrangian Solves orm-constrained Training Problem Optimality Implication of Lagrangian: Satisfies 2 nd Condition! λ Λ(, λ) = %' (' 0 if T < c T c if T c 0 è T c 17

18 orm Constrained Model Class Training: L(y, ) ( y T x) 2 s.t. T c L2 Regularized Training: λ T + ( y T x) 2 Lagrangian: Λ(, λ) = ( y T x) 2 + λ ( T c),λ Lagrangian = orm Constrained Training: λ 0 :( L(y, ) = λ T ) ( T c) Lagrangian = L2 Regularized Training: Hold λ fixed Equivalent to solving orm Constrained! For some c Omitting b 1 training data for simplicity 18

19 Recap 2: Ridge Regularization Ridge Regression: L2 Regularized Least-Squares = orm Constrained Model,b λ T + L(),b Large λ è more stable predictions Less likely to overfit to training data Too large λ è underfit Works ith other loss Hinge Loss, Log Loss, etc. L() s.t. T c 19

20 Hallucinating Data Points λ T + ( y i T x i ) 2 = 2λ 2 x( y i T x i ) T Instead hallucinate D data points? D = 2 λe d T λe d 2x y i T x i d=1 D d=1 D d=1 ( 0 T λe ) 2 d + ( ) T D d=1 ( y i T x i ) 2 = 2 λe T d = 2 λ d = 2λ ( ) T Identical to Regularization! {( λe d, 0) } D d=1 Unit vector along d-th Dimension! e d = " 0! 0 1 0! 0 $ % Omitting b for simplicity 20

21 Extension: Multi-task Learning 2 prediction tasks: Spam filter for Alice Spam filter for Bob Limited training data for both but Alice is similar to Bob 21

22 Extension: Multi-task Learning To Training Sets relatively small Option 1: Train Separately S (1) = (x (1) i, y (1) { i )} S (2) = (x (2) i, y (2) { i )} v ( ) 2 λ T + y i (1) T x i (1) ( ) 2 λv T v + y i (2) v T x i (2) Both models have high error. Omitting b for simplicity 22

23 Extension: Multi-task Learning To Training Sets relatively small Option 2: Train Jointly,v ( ) 2 λ T + y i (1) T x i (1) ( ) 2 + λv T v + y i (2) v T x i (2) S (1) = (x (1) i, y (1) { i )} S (2) = (x (2) i, y (2) { i )} Doesn t accomplish anything! ( v don t depend on each other) Omitting b for simplicity 23

24 Multi-task Regularization,v ( ) 2 λ T + λv T v +γ ( v) T ( v) + y (1) i T (1) x i + y (2) i v T (2) x i ( ) 2 Standard Regularization Multi-task Regularization Training Loss Prefer v to be close Controlled by γ Tasks similar Larger γ helps! Tasks not identical γ not too large test loss Test Loss (Task 2) gamma 24

25 Lasso L1-Regularized Least-Squares 25

26 L1 Regularized Least Squares L2: λ + = 2 vs ( y i T x i ) 2 = λ 2 + ( y i T x i ) 2 L1: =1 = 2 =1 = = vs vs vs = 0 =1 = Omitting b for simplicity 26

27 Aside: Subgradient (sub-differential) a R(a) = c a' : R(a') R(a) c(a' a) { } Differentiable: a R(a) = a R(a) L1: d % $ % % 1 if d < 0 +1 if d > 0 [ 1,+1 ] if d = 0 Continuous range for =0! Omitting b for simplicity 27

28 L1 Regularized Least Squares L2: λ + ( y i T x i ) 2 λ ( y i T x i ) 2 d 2 = 2 d L1: d % $ % % 1 if d < 0 +1 if d > 0 [ 1,+1 ] if d = Omitting b for simplicity 28

29 Lagrange Multipliers s.t. c L(y, ) ( y T x) 2 d % $ % % 1 if d < 0 +1 if d > 0 [ 1,+1 ] if d = 0 Solutions tend to be at corners! λ 0 :( L(y, ) λ ) ( c) Omitting b 1 training data for simplicity 29

30 Sparsity is sparse if mostly 0 s: Small L0 orm 0 = d 1 d 0 [ ] Why not L0 Regularization? ot continuous! λ 0 + ( y i T x i ) 2 L1 induces sparsity And is continuous! λ + ( y i T x ) 2 i Omitting b for simplicity 30

31 Why is Sparsity Important? Computational / Memory Efficiency Store 1M numbers in array Store 2 numbers per non-zero (Index, Value) pairs E.g., [ (50,1), (51,1) ] Dot product more efficient: T x! " 0 0! ! 0 $ % Sometimes true is sparse Want to recover non-zero dimensions 31

32 Lasso Guarantee λ + ( y i T x i + b) 2 Suppose data generated as: y i ~ ormal( T * x i,σ 2 ) Then if: λ > 2 κ 2σ 2 log D With high probability (increasing ith ): Supp( ) Supp( * ) d : d λc Supp è ( ) = Supp( * ) High Precision Parameter Recovery Sometimes High Recall Supp( * ) = { d *,d 0} See also:

33 male? L1 L age >10? Magnitude of the to eights. (As regularization shrinks) Person Age>10 Male? Height > 55 Alice Bob Carol Dave Erin Frank Gena Harold Irene John Kelly Larry

34 Aside: Optimizing Lasso Solving Lasso gives sparse model Will stochastic gradient descent find it? o! Hard to hit exactly 0 ith gradient descent Solution: Iterative Soft Thresholding Intuition: if gradient update passes 0, clamp at

35 Recap: Lasso vs Ridge Model Assumptions Lasso learns sparse eight vector Predictive Accuracy Lasso often not as accurate Re-run Least Squares on dimensions selected by Lasso Ease of Inspection Sparse s easier to inspect Ease of Optimization Lasso somehat trickier to optimize 35

36 Recap: Regularization L2 L1 (Lasso) λ 2 + λ + ( y i T x ) 2 i ( y i T x i ) 2 Multi-task [Insert Yours Here!],v λ T + λv T v +γ ( v) T ( v) ( ) 2 + y (1) i T (1) x i + y (2) i v T (2) x i ( ) 2 Omitting b for simplicity 36

37 Recap: Updated Pipeline S = {(x i, y i )} Training Data f (x, b) = T x b Model Class L(a, b) = (a b) 2 Loss Function,b ( ) λ T + L y i, f (x i, b) Cross Validation Model Selection Profit! 37

38 ext Lectures Decision Trees Bagging Random Forests Boosting Ensemble Selection Recitation tonight: Linear Algebra

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso Machne Learnng Data Mnng CS/CS/EE 155 Lecture 4: Regularzaton, Sparsty Lasso 1 Recap: Complete Ppelne S = {(x, y )} Tranng Data f (x, b) = T x b Model Class(es) L(a, b) = (a b) 2 Loss Functon,b L( y, f