Machine Learning Data Mining CS/CS/EE 155 Lecture 4: Regularization, Sparsity Lasso 1
Recap: Complete Pipeline S = {(x i, y i )} Training Data f (x, b) = T x b Model Class(es) L(a, b) = (a b) 2 Loss Function,b L( y i, f (x i, b) ) SGD! Cross Validation Model Selection Profit! 2
Different Model Classes? Option 1: SVMs vs As vs LR vs LS Option 2: Regularization,b L( y i, f (x i, b) ) SGD! Cross Validation Model Selection 3
otation Part 1 L0 orm (not actually a norm) of non-zero entries L1 orm Sum of absolute values L2 orm Squared L2 orm Sum of squares Sqrt(sum of squares) L-infinity orm Max absolute value 0 = = lim p d = 1 = 1 d 0 [ ] d d 2 = d T d 2 = 2 d T p d d d p = max d d 4
otation Part 2 Minimizing Squared Loss Regression Least-Squares i ( y i T x i + b) 2 (Unless Otherise Stated) E.g., Logistic Regression = Log Loss 5
Ridge Regression,b λ T + i ( y i T x i + b) 2 Regularization Training Loss aka L2-Regularized Regression Trades off model complexity vs training loss Each choice of λ a model class Will discuss the further later 6
0.22 0.2 0.18 0.16 0.14 0.12 0.1,b Larger Lambda! x = " " $ y = %$ 1 [age>10] 1 [gender=male] Test Loss Training Loss $ % 1 height > 55" 0 height 55" 10-2 10-1 10 0 10 1 lambda λ T + i ( y i T x i + b) b 0.7401 0.2441-0.1745 0.7122 0.2277-0.1967 0.6197 0.1765-0.2686 0.4124 0.0817-0.4196 0.1801 0.0161-0.5686 0.0001 0.0000-0.6666 2 Train Test Person Age>10 Male? Height > 55 Alice 1 0 1 Bob 0 1 0 Carol 0 0 0 Dave 1 1 1 Erin 1 0 1 Frank 0 1 1 Gena 0 0 0 Harold 1 1 1 Irene 1 0 0 John 0 1 1 Kelly 1 0 1 Larry 1 1 1 7
Updated Pipeline S = {(x i, y i )} Training Data f (x, b) = T x b Model Class L(a, b) = (a b) 2 Loss Function,b ( ) λ T + L y i, f (x i, b) Cross Validation Model Selection Profit! 8
Test Train Person Age>10 Male? Height > 55 Model Score / Increasing Lambda Alice 1 0 1 0.91 0.89 0.83 0.75 0.67 Bob 0 1 0 0.42 0.45 0.50 0.58 0.67 Carol 0 0 0 0.17 0.26 0.42 0.50 0.67 Dave 1 1 1 1.16 1.06 0.91 0.83 0.67 Erin 1 0 1 0.91 0.89 0.83 0.79 0.67 Frank 0 1 1 0.42 0.45 0.50 0.54 0.67 Gena 0 0 0 0.17 0.27 0.42 0.50 0.67 Harold 1 1 1 1.16 1.06 0.91 0.83 0.67 Irene 1 0 0 0.91 0.89 0.83 0.79 0.67 John 0 1 1 0.42 0.45 0.50 0.54 0.67 Kelly 1 0 1 0.91 0.89 0.83 0.79 0.67 Larry 1 1 1 1.16 1.06 0.91 0.83 0.67 Best test error 9
Choice of Lambda Depends on Training Size 40 25 Training Points 28 50 Training Points 35 26 24 test loss 30 test loss 22 20 25 18 16 20 10-2 10 0 10 2 lambda 14 10-2 10 0 10 2 lambda test loss 25 75 Training Points 20 15 10 10-2 10 0 10 2 lambda test loss 24 100 Training Points 22 20 18 16 14 12 10 10-2 10 0 10 2 lambda 25 dimensional space Randomly generated linear response function + noise 10
Recap: Ridge Regularization Ridge Regression: L2 Regularized Least-Squares,b λ T + Large λ è more stable predictions Less likely to overfit to training data Too large λ è underfit Works ith other loss Hinge Loss, Log Loss, etc. i ( y i T x i + b) 2 11
Aside: Stochastic Gradient Descent,b!L(, b) = ( ) λ 2 + L y i, f (x i, b) 1 λ 2 + L i (, b) 1!L(, b) = E i!l i (, b) Do SGD on this 12
Model Class Interpretation,b This is not a model class! At least not hat e ve discussed... An optimization procedure Is there a connection? ( ) λ T + L y i, f (x i, b) 13
orm Constrained Model Class f (x, b) = T x b s.t. T c 2 c 3 2 1 0-1 Visualization c=1 c=2 c=3 L2 orm Seems to correspond to lambda,b 0.7 0.6 0.5 0.4 0.3 λ T + L y i, f (x i, b) ( ) -2 0.2-3 -3-2 -1 0 1 2 3 0.1 0 10-2 10-1 10 0 10 1 lambda 14
Lagrange Multipliers s.t. T c L(y, ) ( y T x) 2 Optimality Condition: Gradients aligned! Constraint Boundary λ 0 :( L(y, ) = λ T ) ( T c) Omitting b 1 training data for simplicity http://en.ikipedia.org/iki/lagrange_multiplier 15
orm Constrained Model Class Training: L(y, ) ( y T x) 2 s.t. T c Omitting b 1 training data for simplicity To Conditions Must Be Satisfied At Optimality ó. Observation about Optimality: λ 0 :( L(y, ) = λ T ) ( T c) Lagrangian:,λ Λ(, λ) = ( y T x) 2 + λ ( T c) Claim: Solving Lagrangian Solves orm-constrained Training Problem Optimality Implication of Lagrangian: Satisfies First Condition! Λ(, λ) = 2x( y T x) T + 2λ 0 è2x y x = 2λ http://en.ikipedia.org/iki/lagrange_multiplier 16
orm Constrained Model Class Training: L(y, ) ( y T x) 2 s.t. T c Omitting b 1 training data for simplicity To Conditions Must Be Satisfied At Optimality ó. Observation about Optimality: λ 0 :( L(y, ) = λ T ) ( T c) Lagrangian:,λ Λ(, λ) = ( y T x) 2 + λ ( T c) Claim: Solving Lagrangian Solves orm-constrained Training Problem Optimality Implication of Lagrangian: Satisfies 2 nd Condition! λ Λ(, λ) = %' (' 0 if T < c T c if T c 0 è T c http://en.ikipedia.org/iki/lagrange_multiplier 17
orm Constrained Model Class Training: L(y, ) ( y T x) 2 s.t. T c L2 Regularized Training: λ T + ( y T x) 2 Lagrangian: Λ(, λ) = ( y T x) 2 + λ ( T c),λ Lagrangian = orm Constrained Training: λ 0 :( L(y, ) = λ T ) ( T c) Lagrangian = L2 Regularized Training: Hold λ fixed Equivalent to solving orm Constrained! For some c Omitting b 1 training data for simplicity http://en.ikipedia.org/iki/lagrange_multiplier 18
Recap 2: Ridge Regularization Ridge Regression: L2 Regularized Least-Squares = orm Constrained Model,b λ T + L(),b Large λ è more stable predictions Less likely to overfit to training data Too large λ è underfit Works ith other loss Hinge Loss, Log Loss, etc. L() s.t. T c 19
Hallucinating Data Points λ T + ( y i T x i ) 2 = 2λ 2 x( y i T x i ) T Instead hallucinate D data points? D = 2 λe d T λe d 2x y i T x i d=1 D d=1 D d=1 ( 0 T λe ) 2 d + ( ) T D d=1 ( y i T x i ) 2 = 2 λe T d = 2 λ d = 2λ ( ) T Identical to Regularization! {( λe d, 0) } D d=1 Unit vector along d-th Dimension! e d = " 0! 0 1 0! 0 $ % Omitting b for simplicity 20
Extension: Multi-task Learning 2 prediction tasks: Spam filter for Alice Spam filter for Bob Limited training data for both but Alice is similar to Bob 21
Extension: Multi-task Learning To Training Sets relatively small Option 1: Train Separately S (1) = (x (1) i, y (1) { i )} S (2) = (x (2) i, y (2) { i )} v ( ) 2 λ T + y i (1) T x i (1) ( ) 2 λv T v + y i (2) v T x i (2) Both models have high error. Omitting b for simplicity 22
Extension: Multi-task Learning To Training Sets relatively small Option 2: Train Jointly,v ( ) 2 λ T + y i (1) T x i (1) ( ) 2 + λv T v + y i (2) v T x i (2) S (1) = (x (1) i, y (1) { i )} S (2) = (x (2) i, y (2) { i )} Doesn t accomplish anything! ( v don t depend on each other) Omitting b for simplicity 23
Multi-task Regularization,v ( ) 2 λ T + λv T v +γ ( v) T ( v) + y (1) i T (1) x i + y (2) i v T (2) x i ( ) 2 Standard Regularization Multi-task Regularization Training Loss Prefer v to be close Controlled by γ Tasks similar Larger γ helps! Tasks not identical γ not too large test loss Test Loss (Task 2) 14.5 14 13.5 13 12.5 12 10-4 10-2 10 0 10 2 gamma 24
Lasso L1-Regularized Least-Squares 25
L1 Regularized Least Squares L2: λ + = 2 vs ( y i T x i ) 2 =1 4 3.5 3 λ 2 + ( y i T x i ) 2 L1: =1 = 2 =1 = = vs vs vs = 0 =1 = 0 2.5 2 1.5 1 0.5 0-2 -1.5-1 -0.5 0 0.5 1 1.5 2 Omitting b for simplicity 26
Aside: Subgradient (sub-differential) a R(a) = c a' : R(a') R(a) c(a' a) { } Differentiable: a R(a) = a R(a) L1: d % $ % % 1 if d < 0 +1 if d > 0 [ 1,+1 ] if d = 0 Continuous range for =0! 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0-2 -1 0 1 2 Omitting b for simplicity 27
L1 Regularized Least Squares L2: λ + ( y i T x i ) 2 λ 2 + 4 3.5 ( y i T x i ) 2 d 2 = 2 d 3 2.5 L1: d % $ % % 1 if d < 0 +1 if d > 0 [ 1,+1 ] if d = 0 2 1.5 1 0.5 0-2 -1.5-1 -0.5 0 0.5 1 1.5 2 Omitting b for simplicity 28
Lagrange Multipliers s.t. c L(y, ) ( y T x) 2 d % $ % % 1 if d < 0 +1 if d > 0 [ 1,+1 ] if d = 0 Solutions tend to be at corners! λ 0 :( L(y, ) λ ) ( c) Omitting b 1 training data for simplicity http://en.ikipedia.org/iki/lagrange_multiplier 29
Sparsity is sparse if mostly 0 s: Small L0 orm 0 = d 1 d 0 [ ] Why not L0 Regularization? ot continuous! λ 0 + ( y i T x i ) 2 L1 induces sparsity And is continuous! λ + ( y i T x ) 2 i Omitting b for simplicity 30
Why is Sparsity Important? Computational / Memory Efficiency Store 1M numbers in array Store 2 numbers per non-zero (Index, Value) pairs E.g., [ (50,1), (51,1) ] Dot product more efficient: T x! " 0 0! 0 1 1 0! 0 $ % Sometimes true is sparse Want to recover non-zero dimensions 31
Lasso Guarantee λ + ( y i T x i + b) 2 Suppose data generated as: y i ~ ormal( T * x i,σ 2 ) Then if: λ > 2 κ 2σ 2 log D With high probability (increasing ith ): Supp( ) Supp( * ) d : d λc Supp è ( ) = Supp( * ) High Precision Parameter Recovery Sometimes High Recall Supp( * ) = { d *,d 0} See also: https://.cs.utexas.edu/~pradeepr/courses/395t-lt/filez/highdimii.pdf http://.eecs.berkeley.edu/~ainrig/papers/wai_sparseinfo09.pdf 32
male? 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 L1 L2 0 0 0.1 0.2 0.3 0.4 0.5 age >10? Magnitude of the to eights. (As regularization shrinks) Person Age>10 Male? Height > 55 Alice 1 0 1 Bob 0 1 0 Carol 0 0 0 Dave 1 1 1 Erin 1 0 1 Frank 0 1 1 Gena 0 0 0 Harold 1 1 1 Irene 1 0 0 John 0 1 1 Kelly 1 0 1 Larry 1 1 1 33
Aside: Optimizing Lasso Solving Lasso gives sparse model Will stochastic gradient descent find it? o! Hard to hit exactly 0 ith gradient descent Solution: Iterative Soft Thresholding Intuition: if gradient update passes 0, clamp at 0 https://blogs.princeton.edu/imabandit/2013/04/11/orf523-ista-and-fista/ 34
Recap: Lasso vs Ridge Model Assumptions Lasso learns sparse eight vector Predictive Accuracy Lasso often not as accurate Re-run Least Squares on dimensions selected by Lasso Ease of Inspection Sparse s easier to inspect Ease of Optimization Lasso somehat trickier to optimize 35
Recap: Regularization L2 L1 (Lasso) λ 2 + λ + ( y i T x ) 2 i ( y i T x i ) 2 Multi-task [Insert Yours Here!],v λ T + λv T v +γ ( v) T ( v) ( ) 2 + y (1) i T (1) x i + y (2) i v T (2) x i ( ) 2 Omitting b for simplicity 36
Recap: Updated Pipeline S = {(x i, y i )} Training Data f (x, b) = T x b Model Class L(a, b) = (a b) 2 Loss Function,b ( ) λ T + L y i, f (x i, b) Cross Validation Model Selection Profit! 37
ext Lectures Decision Trees Bagging Random Forests Boosting Ensemble Selection Recitation tonight: Linear Algebra