Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Similar documents
Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regulariza5on, Sparsity & Lasso

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Optimization methods

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Classification Logistic Regression

Ad Placement Strategies

Lasso Regression: Regularization for feature selection

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

Support Vector Machine

Introduction to Logistic Regression

Machine Learning for NLP

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CSC 411 Lecture 17: Support Vector Machine

CSC321 Lecture 4: Learning a Classifier

CSC321 Lecture 4: Learning a Classifier

Discriminative Models

Lasso Regression: Regularization for feature selection

DATA MINING AND MACHINE LEARNING

SVMs, Duality and the Kernel Trick

Linear Models for Regression

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

l 1 and l 2 Regularization

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Announcements - Homework

Stochastic Gradient Descent

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning & Data Mining CMS/CS/CNS/EE 155. Lecture 2: Perceptron & Gradient Descent

Machine Learning (CSE 446): Neural Networks

Optimization methods

Least Mean Squares Regression

CSC 411 Lecture 6: Linear Regression

Discriminative Models

Coordinate Descent and Ascent Methods

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

CSE 546 Midterm Exam, Fall 2014

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Ad Placement Strategies

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Machine Learning and Data Mining. Linear classification. Kalev Kask

Day 3 Lecture 3. Optimizing deep networks

Linear and Logistic Regression. Dr. Xiaowei Huang

Regression.

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

CS60021: Scalable Data Mining. Large Scale Machine Learning

Learning theory. Ensemble methods. Boosting. Boosting: history

CSC321 Lecture 9: Generalization

ECE 5424: Introduction to Machine Learning

Discriminative Learning and Big Data

Optimization and Gradient Descent

Convergence rate of SGD

Machine Learning for NLP

Solving Regression. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 12. Slides adapted from Matt Nedrich and Trevor Hastie

II. Linear Models (pp.47-70)

Machine Learning Basics

VBM683 Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2017

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

ORIE 4741: Learning with Big Messy Data. Regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Logistic Regression. COMP 527 Danushka Bollegala

Least Mean Squares Regression. Machine Learning Fall 2018

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018

Convex Optimization Lecture 16

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Day 4: Classification, support vector machines

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

CPSC 340: Machine Learning and Data Mining. Gradient Descent Fall 2016

CPSC 340: Machine Learning and Data Mining

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

10-701/ Machine Learning - Midterm Exam, Fall 2010

Lecture 9: Generalization

CSC321 Lecture 2: Linear Regression

FA Homework 2 Recitation 2

Gradient Boosting (Continued)

CSC321 Lecture 5: Multilayer Perceptrons

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Is the test error unbiased for these programs?

Online Learning With Kernel

Statistical Computing (36-350)

Logistic Regression. Stochastic Gradient Descent

Transcription:

Machine Learning Data Mining CS/CS/EE 155 Lecture 4: Regularization, Sparsity Lasso 1

Recap: Complete Pipeline S = {(x i, y i )} Training Data f (x, b) = T x b Model Class(es) L(a, b) = (a b) 2 Loss Function,b L( y i, f (x i, b) ) SGD! Cross Validation Model Selection Profit! 2

Different Model Classes? Option 1: SVMs vs As vs LR vs LS Option 2: Regularization,b L( y i, f (x i, b) ) SGD! Cross Validation Model Selection 3

otation Part 1 L0 orm (not actually a norm) of non-zero entries L1 orm Sum of absolute values L2 orm Squared L2 orm Sum of squares Sqrt(sum of squares) L-infinity orm Max absolute value 0 = = lim p d = 1 = 1 d 0 [ ] d d 2 = d T d 2 = 2 d T p d d d p = max d d 4

otation Part 2 Minimizing Squared Loss Regression Least-Squares i ( y i T x i + b) 2 (Unless Otherise Stated) E.g., Logistic Regression = Log Loss 5

Ridge Regression,b λ T + i ( y i T x i + b) 2 Regularization Training Loss aka L2-Regularized Regression Trades off model complexity vs training loss Each choice of λ a model class Will discuss the further later 6

0.22 0.2 0.18 0.16 0.14 0.12 0.1,b Larger Lambda! x = " " $ y = %$ 1 [age>10] 1 [gender=male] Test Loss Training Loss $ % 1 height > 55" 0 height 55" 10-2 10-1 10 0 10 1 lambda λ T + i ( y i T x i + b) b 0.7401 0.2441-0.1745 0.7122 0.2277-0.1967 0.6197 0.1765-0.2686 0.4124 0.0817-0.4196 0.1801 0.0161-0.5686 0.0001 0.0000-0.6666 2 Train Test Person Age>10 Male? Height > 55 Alice 1 0 1 Bob 0 1 0 Carol 0 0 0 Dave 1 1 1 Erin 1 0 1 Frank 0 1 1 Gena 0 0 0 Harold 1 1 1 Irene 1 0 0 John 0 1 1 Kelly 1 0 1 Larry 1 1 1 7

Updated Pipeline S = {(x i, y i )} Training Data f (x, b) = T x b Model Class L(a, b) = (a b) 2 Loss Function,b ( ) λ T + L y i, f (x i, b) Cross Validation Model Selection Profit! 8

Test Train Person Age>10 Male? Height > 55 Model Score / Increasing Lambda Alice 1 0 1 0.91 0.89 0.83 0.75 0.67 Bob 0 1 0 0.42 0.45 0.50 0.58 0.67 Carol 0 0 0 0.17 0.26 0.42 0.50 0.67 Dave 1 1 1 1.16 1.06 0.91 0.83 0.67 Erin 1 0 1 0.91 0.89 0.83 0.79 0.67 Frank 0 1 1 0.42 0.45 0.50 0.54 0.67 Gena 0 0 0 0.17 0.27 0.42 0.50 0.67 Harold 1 1 1 1.16 1.06 0.91 0.83 0.67 Irene 1 0 0 0.91 0.89 0.83 0.79 0.67 John 0 1 1 0.42 0.45 0.50 0.54 0.67 Kelly 1 0 1 0.91 0.89 0.83 0.79 0.67 Larry 1 1 1 1.16 1.06 0.91 0.83 0.67 Best test error 9

Choice of Lambda Depends on Training Size 40 25 Training Points 28 50 Training Points 35 26 24 test loss 30 test loss 22 20 25 18 16 20 10-2 10 0 10 2 lambda 14 10-2 10 0 10 2 lambda test loss 25 75 Training Points 20 15 10 10-2 10 0 10 2 lambda test loss 24 100 Training Points 22 20 18 16 14 12 10 10-2 10 0 10 2 lambda 25 dimensional space Randomly generated linear response function + noise 10

Recap: Ridge Regularization Ridge Regression: L2 Regularized Least-Squares,b λ T + Large λ è more stable predictions Less likely to overfit to training data Too large λ è underfit Works ith other loss Hinge Loss, Log Loss, etc. i ( y i T x i + b) 2 11

Aside: Stochastic Gradient Descent,b!L(, b) = ( ) λ 2 + L y i, f (x i, b) 1 λ 2 + L i (, b) 1!L(, b) = E i!l i (, b) Do SGD on this 12

Model Class Interpretation,b This is not a model class! At least not hat e ve discussed... An optimization procedure Is there a connection? ( ) λ T + L y i, f (x i, b) 13

orm Constrained Model Class f (x, b) = T x b s.t. T c 2 c 3 2 1 0-1 Visualization c=1 c=2 c=3 L2 orm Seems to correspond to lambda,b 0.7 0.6 0.5 0.4 0.3 λ T + L y i, f (x i, b) ( ) -2 0.2-3 -3-2 -1 0 1 2 3 0.1 0 10-2 10-1 10 0 10 1 lambda 14

Lagrange Multipliers s.t. T c L(y, ) ( y T x) 2 Optimality Condition: Gradients aligned! Constraint Boundary λ 0 :( L(y, ) = λ T ) ( T c) Omitting b 1 training data for simplicity http://en.ikipedia.org/iki/lagrange_multiplier 15

orm Constrained Model Class Training: L(y, ) ( y T x) 2 s.t. T c Omitting b 1 training data for simplicity To Conditions Must Be Satisfied At Optimality ó. Observation about Optimality: λ 0 :( L(y, ) = λ T ) ( T c) Lagrangian:,λ Λ(, λ) = ( y T x) 2 + λ ( T c) Claim: Solving Lagrangian Solves orm-constrained Training Problem Optimality Implication of Lagrangian: Satisfies First Condition! Λ(, λ) = 2x( y T x) T + 2λ 0 è2x y x = 2λ http://en.ikipedia.org/iki/lagrange_multiplier 16

orm Constrained Model Class Training: L(y, ) ( y T x) 2 s.t. T c Omitting b 1 training data for simplicity To Conditions Must Be Satisfied At Optimality ó. Observation about Optimality: λ 0 :( L(y, ) = λ T ) ( T c) Lagrangian:,λ Λ(, λ) = ( y T x) 2 + λ ( T c) Claim: Solving Lagrangian Solves orm-constrained Training Problem Optimality Implication of Lagrangian: Satisfies 2 nd Condition! λ Λ(, λ) = %' (' 0 if T < c T c if T c 0 è T c http://en.ikipedia.org/iki/lagrange_multiplier 17

orm Constrained Model Class Training: L(y, ) ( y T x) 2 s.t. T c L2 Regularized Training: λ T + ( y T x) 2 Lagrangian: Λ(, λ) = ( y T x) 2 + λ ( T c),λ Lagrangian = orm Constrained Training: λ 0 :( L(y, ) = λ T ) ( T c) Lagrangian = L2 Regularized Training: Hold λ fixed Equivalent to solving orm Constrained! For some c Omitting b 1 training data for simplicity http://en.ikipedia.org/iki/lagrange_multiplier 18

Recap 2: Ridge Regularization Ridge Regression: L2 Regularized Least-Squares = orm Constrained Model,b λ T + L(),b Large λ è more stable predictions Less likely to overfit to training data Too large λ è underfit Works ith other loss Hinge Loss, Log Loss, etc. L() s.t. T c 19

Hallucinating Data Points λ T + ( y i T x i ) 2 = 2λ 2 x( y i T x i ) T Instead hallucinate D data points? D = 2 λe d T λe d 2x y i T x i d=1 D d=1 D d=1 ( 0 T λe ) 2 d + ( ) T D d=1 ( y i T x i ) 2 = 2 λe T d = 2 λ d = 2λ ( ) T Identical to Regularization! {( λe d, 0) } D d=1 Unit vector along d-th Dimension! e d = " 0! 0 1 0! 0 $ % Omitting b for simplicity 20

Extension: Multi-task Learning 2 prediction tasks: Spam filter for Alice Spam filter for Bob Limited training data for both but Alice is similar to Bob 21

Extension: Multi-task Learning To Training Sets relatively small Option 1: Train Separately S (1) = (x (1) i, y (1) { i )} S (2) = (x (2) i, y (2) { i )} v ( ) 2 λ T + y i (1) T x i (1) ( ) 2 λv T v + y i (2) v T x i (2) Both models have high error. Omitting b for simplicity 22

Extension: Multi-task Learning To Training Sets relatively small Option 2: Train Jointly,v ( ) 2 λ T + y i (1) T x i (1) ( ) 2 + λv T v + y i (2) v T x i (2) S (1) = (x (1) i, y (1) { i )} S (2) = (x (2) i, y (2) { i )} Doesn t accomplish anything! ( v don t depend on each other) Omitting b for simplicity 23

Multi-task Regularization,v ( ) 2 λ T + λv T v +γ ( v) T ( v) + y (1) i T (1) x i + y (2) i v T (2) x i ( ) 2 Standard Regularization Multi-task Regularization Training Loss Prefer v to be close Controlled by γ Tasks similar Larger γ helps! Tasks not identical γ not too large test loss Test Loss (Task 2) 14.5 14 13.5 13 12.5 12 10-4 10-2 10 0 10 2 gamma 24

Lasso L1-Regularized Least-Squares 25

L1 Regularized Least Squares L2: λ + = 2 vs ( y i T x i ) 2 =1 4 3.5 3 λ 2 + ( y i T x i ) 2 L1: =1 = 2 =1 = = vs vs vs = 0 =1 = 0 2.5 2 1.5 1 0.5 0-2 -1.5-1 -0.5 0 0.5 1 1.5 2 Omitting b for simplicity 26

Aside: Subgradient (sub-differential) a R(a) = c a' : R(a') R(a) c(a' a) { } Differentiable: a R(a) = a R(a) L1: d % $ % % 1 if d < 0 +1 if d > 0 [ 1,+1 ] if d = 0 Continuous range for =0! 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0-2 -1 0 1 2 Omitting b for simplicity 27

L1 Regularized Least Squares L2: λ + ( y i T x i ) 2 λ 2 + 4 3.5 ( y i T x i ) 2 d 2 = 2 d 3 2.5 L1: d % $ % % 1 if d < 0 +1 if d > 0 [ 1,+1 ] if d = 0 2 1.5 1 0.5 0-2 -1.5-1 -0.5 0 0.5 1 1.5 2 Omitting b for simplicity 28

Lagrange Multipliers s.t. c L(y, ) ( y T x) 2 d % $ % % 1 if d < 0 +1 if d > 0 [ 1,+1 ] if d = 0 Solutions tend to be at corners! λ 0 :( L(y, ) λ ) ( c) Omitting b 1 training data for simplicity http://en.ikipedia.org/iki/lagrange_multiplier 29

Sparsity is sparse if mostly 0 s: Small L0 orm 0 = d 1 d 0 [ ] Why not L0 Regularization? ot continuous! λ 0 + ( y i T x i ) 2 L1 induces sparsity And is continuous! λ + ( y i T x ) 2 i Omitting b for simplicity 30

Why is Sparsity Important? Computational / Memory Efficiency Store 1M numbers in array Store 2 numbers per non-zero (Index, Value) pairs E.g., [ (50,1), (51,1) ] Dot product more efficient: T x! " 0 0! 0 1 1 0! 0 $ % Sometimes true is sparse Want to recover non-zero dimensions 31

Lasso Guarantee λ + ( y i T x i + b) 2 Suppose data generated as: y i ~ ormal( T * x i,σ 2 ) Then if: λ > 2 κ 2σ 2 log D With high probability (increasing ith ): Supp( ) Supp( * ) d : d λc Supp è ( ) = Supp( * ) High Precision Parameter Recovery Sometimes High Recall Supp( * ) = { d *,d 0} See also: https://.cs.utexas.edu/~pradeepr/courses/395t-lt/filez/highdimii.pdf http://.eecs.berkeley.edu/~ainrig/papers/wai_sparseinfo09.pdf 32

male? 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 L1 L2 0 0 0.1 0.2 0.3 0.4 0.5 age >10? Magnitude of the to eights. (As regularization shrinks) Person Age>10 Male? Height > 55 Alice 1 0 1 Bob 0 1 0 Carol 0 0 0 Dave 1 1 1 Erin 1 0 1 Frank 0 1 1 Gena 0 0 0 Harold 1 1 1 Irene 1 0 0 John 0 1 1 Kelly 1 0 1 Larry 1 1 1 33

Aside: Optimizing Lasso Solving Lasso gives sparse model Will stochastic gradient descent find it? o! Hard to hit exactly 0 ith gradient descent Solution: Iterative Soft Thresholding Intuition: if gradient update passes 0, clamp at 0 https://blogs.princeton.edu/imabandit/2013/04/11/orf523-ista-and-fista/ 34

Recap: Lasso vs Ridge Model Assumptions Lasso learns sparse eight vector Predictive Accuracy Lasso often not as accurate Re-run Least Squares on dimensions selected by Lasso Ease of Inspection Sparse s easier to inspect Ease of Optimization Lasso somehat trickier to optimize 35

Recap: Regularization L2 L1 (Lasso) λ 2 + λ + ( y i T x ) 2 i ( y i T x i ) 2 Multi-task [Insert Yours Here!],v λ T + λv T v +γ ( v) T ( v) ( ) 2 + y (1) i T (1) x i + y (2) i v T (2) x i ( ) 2 Omitting b for simplicity 36

Recap: Updated Pipeline S = {(x i, y i )} Training Data f (x, b) = T x b Model Class L(a, b) = (a b) 2 Loss Function,b ( ) λ T + L y i, f (x i, b) Cross Validation Model Selection Profit! 37

ext Lectures Decision Trees Bagging Random Forests Boosting Ensemble Selection Recitation tonight: Linear Algebra