Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regulariza5on, Sparsity & Lasso

Similar documents
Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Feature Selection: Part 1

Which Separator? Spring 1

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

CSE 546 Midterm Exam, Fall 2014(with Solution)

Lecture 3: Dual problems and Kernels

Support Vector Machines

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Lecture Notes on Linear Regression

Support Vector Machines

Kernel Methods and SVMs Extension

Ensemble Methods: Boosting

CSC 411 / CSC D11 / CSC C11

10-701/ Machine Learning, Fall 2005 Homework 3

Lecture 10 Support Vector Machines II

Boostrapaggregating (Bagging)

1 Convex Optimization

Lecture 10 Support Vector Machines. Oct

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

UVA CS / Introduc8on to Machine Learning and Data Mining

Support Vector Machines

Support Vector Machines CS434

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Logistic Classifier CISC 5800 Professor Daniel Leeds

Support Vector Machines CS434

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

Machine Learning Linear Regression

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Introduction to the Introduction to Artificial Neural Network

Lecture 20: November 7

Linear Classification, SVMs and Nearest Neighbors

Advanced Machine Learning & Perception

Machine Learning. What is a good Decision Boundary? Support Vector Machines

The Geometry of Logit and Probit

Classification learning II

Machine Learning for Language Technology Lecture 8: Decision Trees and k- Nearest Neighbors

Support Vector Machines

Machine Learning. Support Vector Machines. Eric Xing , Fall Lecture 9, October 6, 2015

MATH 567: Mathematical Techniques in Data Science Lab 8

Recap: the SVM problem

CSCI B609: Foundations of Data Science

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

IV. Performance Optimization

Nonlinear Classifiers II

UVA CS / Introduc8on to Machine Learning and Data Mining

15-381: Artificial Intelligence. Regression and cross validation

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Distributed and Stochastic Machine Learning on Big Data

Mechanics Physics 151

Kernel Methods and SVMs

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CMAC: RECONSIDERING AN OLD NEURAL NETWORK. Gábor Horváth

Economics 101. Lecture 4 - Equilibrium and Efficiency

PHYS 705: Classical Mechanics. Calculus of Variations II

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 10: Classifica8on with Support Vector Machine (cont.

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Bit Juggling. Representing Information. representations. - Some other bits. - Representing information using bits - Number. Chapter

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Limited Dependent Variables

Online Classification: Perceptron and Winnow

Machine learning: Density estimation

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Generalized Linear Methods

Admin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester

Logistic Regression Maximum Likelihood Estimation

Vapnik-Chervonenkis theory

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Linear Regression Introduction to Machine Learning. Matt Gormley Lecture 5 September 14, Readings: Bishop, 3.1

18-660: Numerical Methods for Engineering Design and Optimization

Homework Assignment 3 Due in class, Thursday October 15

Natural Language Processing and Information Retrieval

COS 521: Advanced Algorithms Game Theory and Linear Programming

Maximal Margin Classifier

Pattern Classification

Generative classification models

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Advanced Introduction to Machine Learning

VQ widely used in coding speech, image, and video

Evaluation of classifiers MLPs

Economics 8105 Macroeconomic Theory Recitation 1

Chapter 6 Support vector machine. Séparateurs à vaste marge

Support Vector Machines

Maximum Likelihood Estimation

CHAPTER 2: Supervised Learning

Lagrange Multipliers Kernel Trick

Multigradient for Neural Networks for Equalizers 1

Machine Learning: and 15781, 2003 Assignment 4

Classification as a Regression Problem

Learning Theory: Lecture Notes

Natural Images, Gaussian Mixtures and Dead Leaves Supplementary Material

Week 5: Neural Networks

Economics 130. Lecture 4 Simple Linear Regression Continued

Intro to Visual Recognition

Support Vector Machines. Jie Tang Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University 2012

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Transcription:

Machne Learnng Data Mnng CS/CS/EE 155 Lecture 4: Regularza5on, Sparsty Lasso 1

Recap: Complete Ppelne S = {(x, y )} Tranng Data f (x, b) = T x b Model Class(es) L(a, b) = (a b) 2 Loss Func5on,b L( y, f (x, b) ) SGD! Cross Valda5on Model Selec5on Proft! 2

Dfferent Model Classes? Op5on 1: SVMs vs As vs LR vs LS Op5on 2: Regularza5on,b L( y, f (x, b) ) SGD! Cross Valda5on Model Selec5on 3

ota5on Part 1 L0 orm (not actually a norm) of non-zero entres L1 orm Sum of absolute values L2 orm Squared L2 orm Sum of squares Sqrt(sum of squares) L-nfnty orm Max absolute value 0 = = lm p d = 1 = 1 d 0 [ ] d d 2 = d T d 2 = 2 d T p d d d p = max d d 4

ota5on Part 2 Mnmzng Squared Loss Regresson Least-Squares ( y T x + b) 2 (Unless Otherse Stated) E.g., Logs5c Regresson = Log Loss 5

Rdge Regresson,b λ T + ( y T x + b) 2 Regularza5on Tranng Loss aka L2-Regularzed Regresson Trades off model complexty vs tranng loss Each choce of λ a model class Wll dscuss the further later 6

0.22 0.2 0.18 0.16 0.14 0.12 0.1,b Larger Lambda! x = " " $ y = %$ 1 [age>10] 1 [gender=male] Test Loss Tranng Loss $ % 1 heght > 55" 0 heght 55" 10-2 10-1 10 0 10 1 lambda λ T + ( y T x + b) b 0.7401 0.2441-0.1745 0.7122 0.2277-0.1967 0.6197 0.1765-0.2686 0.4124 0.0817-0.4196 0.1801 0.0161-0.5686 0.0001 0.0000-0.6666 2 Tran Test Person Age>10 Male? Heght > 55 Alce 1 0 1 Bob 0 1 0 Carol 0 0 0 Dave 1 1 1 Ern 1 0 1 Frank 0 1 1 Gena 0 0 0 Harold 1 1 1 Irene 1 0 0 John 0 1 1 Kelly 1 0 1 Larry 1 1 1 7

Updated Ppelne S = {(x, y )} Tranng Data f (x, b) = T x b Model Class L(a, b) = (a b) 2 Loss Func5on,b ( ) λ T + L y, f (x, b) Choosng λ! Cross Valda5on Model Selec5on Proft! 8

Test Tran Person Age>10 Male? Heght > 55 Model Score / Increasng Lambda Alce 1 0 1 0.91 0.89 0.83 0.75 0.67 Bob 0 1 0 0.42 0.45 0.50 0.58 0.67 Carol 0 0 0 0.17 0.26 0.42 0.50 0.67 Dave 1 1 1 1.16 1.06 0.91 0.83 0.67 Ern 1 0 1 0.91 0.89 0.83 0.79 0.67 Frank 0 1 1 0.42 0.45 0.50 0.54 0.67 Gena 0 0 0 0.17 0.27 0.42 0.50 0.67 Harold 1 1 1 1.16 1.06 0.91 0.83 0.67 Irene 1 0 0 0.91 0.89 0.83 0.79 0.67 John 0 1 1 0.42 0.45 0.50 0.54 0.67 Kelly 1 0 1 0.91 0.89 0.83 0.79 0.67 Larry 1 1 1 1.16 1.06 0.91 0.83 0.67 Best test error 9

Choce of Lambda Depends on Tranng Sze 40 25 Tranng Ponts 28 50 Tranng Ponts 35 26 24 test loss 30 test loss 22 20 25 18 16 20 10-2 10 0 10 2 lambda 14 10-2 10 0 10 2 lambda test loss 25 75 Tranng Ponts 20 15 10 10-2 10 0 10 2 lambda test loss 24 100 Tranng Ponts 22 20 18 16 14 12 10 10-2 10 0 10 2 lambda 25 dmensonal space Randomly generated lnear response func5on + nose 10

Recap: Rdge Regularza5on Rdge Regresson: L2 Regularzed Least-Squares,b λ T + Large λ è more stable predc5ons Less lkely to overft to tranng data Too large λ è underft Works th other loss Hnge Loss, Log Loss, etc. ( y T x + b) 2 11

Asde: Stochas5c Gradent Descent,b!L(, b) = ( ) λ 2 + L y, f (x, b) 1 λ 2 + L (, b) 1!L(, b) = E! L (, b) Do SGD on ths 12

Model Class Interpreta5on,b Ths s not a model class! At least not hat e ve dscussed... An op5mza5on procedure Is there a connec5on? ( ) λ T + L y, f (x, b) 13

orm Constraned Model Class f (x, b) = T x b s.t. T c 2 c 3 2 1 0-1 Vsualza5on c=1 c=2 c=3 L2 orm Seems to correspond to lambda,b 0.7 0.6 0.5 0.4 0.3 λ T + L y, f (x, b) ( ) -2 0.2-3 -3-2 -1 0 1 2 3 0.1 0 10-2 10-1 10 0 10 1 lambda 14

Lagrange Mul5plers s.t. T c L(y, ) ( y T x) 2 Op5malty Cond5on: Gradents algned! Constrant Boundary λ 0 :( L(y, ) = λ T ) ( T c) Omrng b 1 tranng data for smplcty hpp://en.kpeda.org/k/lagrange_mul5pler 15

orm Constraned Model Class Tranng: L(y, ) ( y T x) 2 s.t. T c Omrng b 1 tranng data for smplcty To Cond5ons Must Be Sa5sfed At Op5malty ó. ObservaPon about OpPmalty: λ 0 :( L(y, ) = λ T ) ( T c) Lagrangan:,λ Λ(, λ) = ( y T x) 2 + λ ( T c) Clam: Solvng Lagrangan Solves orm-constraned Tranng Problem OpPmalty ImplcaPon of Lagrangan: SaPsfes Frst CondPon! Λ(, λ) = 2x( y T x) T + 2λ 0 è 2x( y T x) T = 2λ hpp://en.kpeda.org/k/lagrange_mul5pler 16

orm Constraned Model Class Tranng: L(y, ) ( y T x) 2 s.t. T c Omrng b 1 tranng data for smplcty To Cond5ons Must Be Sa5sfed At Op5malty ó. ObservaPon about OpPmalty: λ 0 :( L(y, ) = λ T ) ( T c) Lagrangan:,λ Λ(, λ) = ( y T x) 2 + λ ( T c) Clam: Solvng Lagrangan Solves orm-constraned Tranng Problem OpPmalty ImplcaPon of Lagrangan: SaPsfes Frst CondPon! Λ(, λ) = 2x( y T x) T + 2λ 0 è 2x( y T x) T = 2λ hpp://en.kpeda.org/k/lagrange_mul5pler 17

orm Constraned Model Class Tranng: L(y, ) ( y T x) 2 s.t. T c Omrng b 1 tranng data for smplcty To Cond5ons Must Be Sa5sfed At Op5malty ó. ObservaPon about OpPmalty: λ 0 :( L(y, ) = λ T ) ( T c) Lagrangan:,λ Λ(, λ) = ( y T x) 2 + λ ( T c) Clam: Solvng Lagrangan Solves orm-constraned Tranng Problem SaPsfes 2 nd CondPon! OpPmalty ImplcaPon of Lagrangan: %' 0 f λ Λ(, λ) = T < c (' T c f T c 0 è T c hpp://en.kpeda.org/k/lagrange_mul5pler 18

orm Constraned Model Class Tranng: L(y, ) ( y T x) 2 s.t. T c L2 Regularzed Tranng: λ T + ( y T x) 2 Lagrangan: Λ(, λ) = ( y T x) 2 + λ ( T c),λ Lagrangan = orm Constraned Tranng: λ 0 :( L(y, ) = λ T ) T c Lagrangan = L2 Regularzed Tranng: Hold λ fxed Equvalent to solvng orm Constraned! For some c ( ) Omrng b 1 tranng data for smplcty hpp://en.kpeda.org/k/lagrange_mul5pler 19

Recap 2: Rdge Regularza5on Rdge Regresson: L2 Regularzed Least-Squares = orm Constraned Model,b λ T + L(),b Large λ è more stable predc5ons Less lkely to overft to tranng data Too large λ è underft Works th other loss Hnge Loss, Log Loss, etc. L() s.t. T c 20

Hallucna5ng Data Ponts λ T + ( y T x ) 2 = 2λ 2 x( y T x ) T Instead hallucnate D data ponts? D = 2 λe d T λe d 2x y T x d=1 D D d=1 d=1 ( 0 T λe ) 2 d + ( ) T D d=1 ( y T x ) 2 = 2 λe T d = 2 λ d = 2λ ( ) T IdenPcal to RegularzaPon! {( λe d, 0) } D d=1 Unt vector along d-th Dmenson! e d = " 0! 0 1 0! 0 $ % Omrng b for smplcty 21

Extenson: Mul5-task Learnng 2 predc5on tasks: Spam flter for Alce Spam flter for Bob Lmted tranng data for both but Alce s smlar to Bob 22

Extenson: Mul5-task Learnng To Tranng Sets rela5vely small S (1) = (x (1), y (1) { )} S (2) = (x (2), y (2) { )} OpPon 1: Tran Separately v ( ) 2 λ T + y (1) T x (1) ( ) 2 λv T v + y (2) v T x (2) Both models have hgh error. Omrng b for smplcty 23

Extenson: Mul5-task Learnng To Tranng Sets rela5vely small S (1) = (x (1), y (1) { )} S (2) = (x (2), y (2) { )} OpPon 2: Tran Jontly,v ( ) 2 λ T + y (1) T x (1) ( ) 2 + λv T v + y (2) v T x (2) Doesn t accomplsh anythng! ( v don t depend on each other) Omrng b for smplcty 24

Mul5-task Regularza5on,v ( ) 2 λ T + λv T v +γ ( v) T ( v) + y (1) T (1) x + y (2) v T (2) x ( ) 2 Standard Regularza5on Mul5-task Regularza5on Tranng Loss Prefer v to be close Controlled by γ Tasks smlar Larger γ helps! Tasks not den5cal γ not too large test loss Test Loss (Task 2) 14.5 14 13.5 13 12.5 12 10-4 10-2 10 0 10 2 gamma 25

Lasso L1-Regularzed Least-Squares 26

L1 Regularzed Least Squares L2: λ + = 2 vs ( y T x ) 2 =1 4 3.5 3 λ 2 + ( y T x ) 2 L1: =1 = 2 =1 = = vs vs vs = 0 =1 = 0 2.5 2 1.5 1 0.5 0-2 -1.5-1 -0.5 0 0.5 1 1.5 2 Omrng b for smplcty 27

Asde: Subgradent (sub-dfferen5al) a R(a) = c a' : R(a') R(a) c(a' a) { } Dfferen5able: a R(a) = a R(a) L1: d % $ % % 1 f d < 0 +1 f d > 0 [ 1,+1 ] f d = 0 ConPnuous range for =0! 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0-2 -1 0 1 2 Omrng b for smplcty 28

L1 Regularzed Least Squares L2: λ + ( y T x ) 2 λ 2 + 4 3.5 ( y T x ) 2 d 2 = 2 d 3 2.5 L1: d % $ % % 1 f d < 0 +1 f d > 0 [ 1,+1 ] f d = 0 2 1.5 1 0.5 0-2 -1.5-1 -0.5 0 0.5 1 1.5 2 Omrng b for smplcty 29

Lagrange Mul5plers s.t. c L(y, ) ( y T x) 2 d % $ % % 1 f d < 0 +1 f d > 0 [ 1,+1 ] f d = 0 SoluPons tend to be at corners! λ 0 :( L(y, ) λ ) ( c) Omrng b 1 tranng data for smplcty hpp://en.kpeda.org/k/lagrange_mul5pler 30

Sparsty s sparse f mostly 0 s: Small L0 orm 0 = d 1 d 0 [ ] Why not L0 Regularza5on? ot conpnuous! λ 0 + ( y T x ) 2 L1 nduces sparsty And s con5nuous! λ + ( y T x ) 2 Omrng b for smplcty 31

Why s Sparsty Important? Computa5onal / Memory Effcency Store 1M numbers n array Store 2 numbers per non-zero (Index, Value) pars E.g., [ (50,1), (51,1) ] Dot product more effcent: T x! " 0 0! 0 1 1 0! 0 $ % Some5mes true s sparse Want to recover non-zero dmensons 32

Lasso Guarantee λ + ( y T x + b) 2 Suppose data generated as: y ~ ormal( T * x,σ 2 ) Then f: λ > 2 κ 2σ 2 log D Wth hgh probablty (ncreasng th ): Supp( ) Supp( * ) d : d λc Supp è ( ) = Supp( * ) Hgh Precson Parameter Recovery SomePmes Hgh Recall Supp( * ) = { d *,d 0} See also: hpps://.cs.utexas.edu/~pradeepr/courses/395t-lt/flez/hghdmii.pdf hpp://.eecs.berkeley.edu/~anrg/papers/wa_sparseinfo09.pdf 33

male? 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 L1 L2 0 0 0.1 0.2 0.3 0.4 0.5 age >10? Magntude of the to eghts. (As regularza5on shrnks) Person Age>10 Male? Heght > 55 Alce 1 0 1 Bob 0 1 0 Carol 0 0 0 Dave 1 1 1 Ern 1 0 1 Frank 0 1 1 Gena 0 0 0 Harold 1 1 1 Irene 1 0 0 John 0 1 1 Kelly 1 0 1 Larry 1 1 1 34

Recap: Lasso vs Rdge Model Assump5ons Lasso learns sparse eght vector Predc5ve Accuracy Lasso oxen not as accurate Re-run Least Squares on dmensons selected by Lasso Ease of Inspec5on Sparse s easer to nspect Ease of Op5mza5on Lasso somehat trcker to op5mze 35

Recap: Regularza5on L2 L1 (Lasso) λ 2 + λ + ( y T x ) 2 ( y T x ) 2 Mul5-task [Insert Yours Here!],v λ T + λv T v +γ ( v) T ( v) ( ) 2 + y (1) T (1) x + y (2) v T (2) x ( ) 2 Omrng b for smplcty 36

ext Lectures Decson Trees Baggng Random Forests Boos5ng Ensemble Selec5on o Recta5on ths Week