Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Similar documents
Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regulariza5on, Sparsity & Lasso

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Feature Selection: Part 1

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Lecture Notes on Linear Regression

Which Separator? Spring 1

Support Vector Machines

Support Vector Machines

CSE 546 Midterm Exam, Fall 2014(with Solution)

Discriminative classifier: Logistic Regression. CS534-Machine Learning

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Ensemble Methods: Boosting

Discriminative classifier: Logistic Regression. CS534-Machine Learning

10-701/ Machine Learning, Fall 2005 Homework 3

Kernel Methods and SVMs Extension

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Lecture 3: Dual problems and Kernels

Linear Regression Introduction to Machine Learning. Matt Gormley Lecture 5 September 14, Readings: Bishop, 3.1

Lecture 10 Support Vector Machines II

CSC 411 / CSC D11 / CSC C11

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Support Vector Machines

1 Convex Optimization

Lecture 10 Support Vector Machines. Oct

Generalized Linear Methods

Lecture 20: November 7

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Support Vector Machines CS434

Week 5: Neural Networks

18-660: Numerical Methods for Engineering Design and Optimization

Machine learning: Density estimation

Maximum Likelihood Estimation

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

CSCI B609: Foundations of Data Science

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Support Vector Machines CS434

Introduction to the Introduction to Artificial Neural Network

Logistic Classifier CISC 5800 Professor Daniel Leeds

Classification as a Regression Problem

Classification learning II

Evaluation of classifiers MLPs

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

Boostrapaggregating (Bagging)

PHYS 705: Classical Mechanics. Calculus of Variations II

Generative classification models

Support Vector Machines

Linear Classification, SVMs and Nearest Neighbors

Lecture 14: Bandits with Budget Constraints

Homework Assignment 3 Due in class, Thursday October 15

15-381: Artificial Intelligence. Regression and cross validation

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

SDMML HT MSc Problem Sheet 4

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Limited Dependent Variables

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

Admin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester

Kernel Methods and SVMs

The Geometry of Logit and Probit

Lagrange Multipliers Kernel Trick

Machine Learning. What is a good Decision Boundary? Support Vector Machines

Distributed and Stochastic Machine Learning on Big Data

VQ widely used in coding speech, image, and video

COMP th April, 2007 Clement Pang

Multilayer neural networks

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Maximal Margin Classifier

MATH 567: Mathematical Techniques in Data Science Lab 8

Chapter 2 - The Simple Linear Regression Model S =0. e i is a random error. S β2 β. This is a minimization problem. Solution is a calculus exercise.

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

e i is a random error

Machine Learning. Support Vector Machines. Eric Xing , Fall Lecture 9, October 6, 2015

Lecture 6: Support Vector Machines

11 Bayesian Methods. p(w D) = p(d w)p(w) (1)

Advanced Introduction to Machine Learning

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14

COS 521: Advanced Algorithms Game Theory and Linear Programming

Economics 101. Lecture 4 - Equilibrium and Efficiency

CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning

Natural Language Processing and Information Retrieval

17 Support Vector Machines

Recap: the SVM problem

IV. Performance Optimization

A Study on L2-Loss (Squared Hinge-Loss) Multi-Class SVM

Econ Statistical Properties of the OLS estimator. Sanjaya DeSilva

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Maxent Models & Deep Learning

The exam is closed book, closed notes except your one-page cheat sheet.

Logistic Regression Maximum Likelihood Estimation

Singular Value Decomposition: Theory and Applications

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

MULTICLASS LEAST SQUARES AUTO-CORRELATION WAVELET SUPPORT VECTOR MACHINES. Yongzhong Xing, Xiaobei Wu and Zhiliang Xu

Intro to Visual Recognition

Machine Learning for Signal Processing Linear Gaussian Models

Support Vector Machines

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

Introduction to Algorithms

Lossy Compression. Compromise accuracy of reconstruction for increased compression.

The General Nonlinear Constrained Optimization Problem

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

Transcription:

Machne Learnng Data Mnng CS/CS/EE 155 Lecture 4: Regularzaton, Sparsty Lasso 1

Recap: Complete Ppelne S = {(x, y )} Tranng Data f (x, b) = T x b Model Class(es) L(a, b) = (a b) 2 Loss Functon,b L( y, f (x, b) ) SGD! Cross Valdaton Model Selecton Proft! 2

Dfferent Model Classes? Opton 1: SVMs vs As vs LR vs LS Opton 2: Regularzaton,b L( y, f (x, b) ) SGD! Cross Valdaton Model Selecton 3

otaton Part 1 L0 orm (not actually a norm) of non-zero entres L1 orm Sum of absolute values L2 orm Squared L2 orm Sum of squares Sqrt(sum of squares) L-nfnty orm Max absolute value 0 = = lm p d = 1 = 1 d 0 [ ] d d 2 = d T d 2 = 2 d T p d d d p = max d d 4

otaton Part 2 Mnmzng Squared Loss Regresson Least-Squares ( y T x + b) 2 (Unless Otherse Stated) E.g., Logstc Regresson = Log Loss 5

Rdge Regresson,b λ T + ( y T x + b) 2 Regularzaton Tranng Loss aka L2-Regularzed Regresson Trades off model complexty vs tranng loss Each choce of λ a model class Wll dscuss the further later 6

0.22 0.2 0.18 0.16 0.14 0.12 0.1,b Larger Lambda! x = " " $ y = %$ 1 [age>10] 1 [gender=male] Test Loss Tranng Loss $ % 1 heght > 55" 0 heght 55" 10-2 10-1 10 0 10 1 lambda λ T + ( y T x + b) b 0.7401 0.2441-0.1745 0.7122 0.2277-0.1967 0.6197 0.1765-0.2686 0.4124 0.0817-0.4196 0.1801 0.0161-0.5686 0.0001 0.0000-0.6666 2 Tran Test Person Age>10 Male? Heght > 55 Alce 1 0 1 Bob 0 1 0 Carol 0 0 0 Dave 1 1 1 Ern 1 0 1 Frank 0 1 1 Gena 0 0 0 Harold 1 1 1 Irene 1 0 0 John 0 1 1 Kelly 1 0 1 Larry 1 1 1 7

Updated Ppelne S = {(x, y )} Tranng Data f (x, b) = T x b Model Class L(a, b) = (a b) 2 Loss Functon,b ( ) λ T + L y, f (x, b) Cross Valdaton Model Selecton Proft! 8

Test Tran Person Age>10 Male? Heght > 55 Model Score / Increasng Lambda Alce 1 0 1 0.91 0.89 0.83 0.75 0.67 Bob 0 1 0 0.42 0.45 0.50 0.58 0.67 Carol 0 0 0 0.17 0.26 0.42 0.50 0.67 Dave 1 1 1 1.16 1.06 0.91 0.83 0.67 Ern 1 0 1 0.91 0.89 0.83 0.79 0.67 Frank 0 1 1 0.42 0.45 0.50 0.54 0.67 Gena 0 0 0 0.17 0.27 0.42 0.50 0.67 Harold 1 1 1 1.16 1.06 0.91 0.83 0.67 Irene 1 0 0 0.91 0.89 0.83 0.79 0.67 John 0 1 1 0.42 0.45 0.50 0.54 0.67 Kelly 1 0 1 0.91 0.89 0.83 0.79 0.67 Larry 1 1 1 1.16 1.06 0.91 0.83 0.67 Best test error 9

Choce of Lambda Depends on Tranng Sze 40 25 Tranng Ponts 28 50 Tranng Ponts 35 26 24 test loss 30 test loss 22 20 25 18 16 20 10-2 10 0 10 2 lambda 14 10-2 10 0 10 2 lambda test loss 25 75 Tranng Ponts 20 15 10 10-2 10 0 10 2 lambda test loss 24 100 Tranng Ponts 22 20 18 16 14 12 10 10-2 10 0 10 2 lambda 25 dmensonal space Randomly generated lnear response functon + nose 10

Recap: Rdge Regularzaton Rdge Regresson: L2 Regularzed Least-Squares,b λ T + Large λ è more stable predctons Less lkely to overft to tranng data Too large λ è underft Works th other loss Hnge Loss, Log Loss, etc. ( y T x + b) 2 11

Asde: Stochastc Gradent Descent,b!L(, b) = ( ) λ 2 + L y, f (x, b) 1 λ 2 + L (, b) 1!L(, b) = E!L (, b) Do SGD on ths 12

Model Class Interpretaton,b Ths s not a model class! At least not hat e ve dscussed... An optmzaton procedure Is there a connecton? ( ) λ T + L y, f (x, b) 13

orm Constraned Model Class f (x, b) = T x b s.t. T c 2 c 3 2 1 0-1 Vsualzaton c=1 c=2 c=3 L2 orm Seems to correspond to lambda,b 0.7 0.6 0.5 0.4 0.3 λ T + L y, f (x, b) ( ) -2 0.2-3 -3-2 -1 0 1 2 3 0.1 0 10-2 10-1 10 0 10 1 lambda 14

Lagrange Multplers s.t. T c L(y, ) ( y T x) 2 Optmalty Condton: Gradents algned! Constrant Boundary λ 0 :( L(y, ) = λ T ) ( T c) Omttng b 1 tranng data for smplcty http://en.kpeda.org/k/lagrange_multpler 15

orm Constraned Model Class Tranng: L(y, ) ( y T x) 2 s.t. T c Omttng b 1 tranng data for smplcty To Condtons Must Be Satsfed At Optmalty ó. Observaton about Optmalty: λ 0 :( L(y, ) = λ T ) ( T c) Lagrangan:,λ Λ(, λ) = ( y T x) 2 + λ ( T c) Clam: Solvng Lagrangan Solves orm-constraned Tranng Problem Satsfes Frst Condton! Optmalty Implcaton of Lagrangan: Λ(, λ) = 2x( y T x) T + 2λ 0 è 2x( y T x) T = 2λ http://en.kpeda.org/k/lagrange_multpler 16

orm Constraned Model Class Tranng: L(y, ) ( y T x) 2 s.t. T c Omttng b 1 tranng data for smplcty To Condtons Must Be Satsfed At Optmalty ó. Observaton about Optmalty: λ 0 :( L(y, ) = λ T ) ( T c) Lagrangan:,λ Λ(, λ) = ( y T x) 2 + λ ( T c) Clam: Solvng Lagrangan Solves orm-constraned Tranng Problem Satsfes Frst Condton! Optmalty Implcaton of Lagrangan: Λ(, λ) = 2x( y T x) T + 2λ 0 è2x y x = 2λ http://en.kpeda.org/k/lagrange_multpler 17

orm Constraned Model Class Tranng: L(y, ) ( y T x) 2 s.t. T c Omttng b 1 tranng data for smplcty To Condtons Must Be Satsfed At Optmalty ó. Observaton about Optmalty: λ 0 :( L(y, ) = λ T ) ( T c) Lagrangan:,λ Λ(, λ) = ( y T x) 2 + λ ( T c) Clam: Solvng Lagrangan Solves orm-constraned Tranng Problem Optmalty Implcaton of Lagrangan: Satsfes 2 nd Condton! λ Λ(, λ) = %' (' 0 f T < c T c f T c 0 è T c http://en.kpeda.org/k/lagrange_multpler 18

orm Constraned Model Class Tranng: L(y, ) ( y T x) 2 s.t. T c L2 Regularzed Tranng: λ T + ( y T x) 2 Lagrangan: Λ(, λ) = ( y T x) 2 + λ ( T c),λ Lagrangan = orm Constraned Tranng: λ 0 :( L(y, ) = λ T ) ( T c) Lagrangan = L2 Regularzed Tranng: Hold λ fxed Equvalent to solvng orm Constraned! For some c Omttng b 1 tranng data for smplcty http://en.kpeda.org/k/lagrange_multpler 19

Recap 2: Rdge Regularzaton Rdge Regresson: L2 Regularzed Least-Squares = orm Constraned Model,b λ T + L(),b Large λ è more stable predctons Less lkely to overft to tranng data Too large λ è underft Works th other loss Hnge Loss, Log Loss, etc. L() s.t. T c 20

Hallucnatng Data Ponts λ T + ( y T x ) 2 = 2λ 2 x( y T x ) T Instead hallucnate D data ponts? D = 2 λe d T λe d 2x y T x d=1 D D d=1 d=1 ( 0 T λe ) 2 d + ( ) T D d=1 ( y T x ) 2 = 2 λe T d = 2 λ d = 2λ ( ) T Identcal to Regularzaton! {( λe d, 0) } D d=1 Unt vector along d-th Dmenson! e d = " 0! 0 1 0! 0 $ % Omttng b for smplcty 21

Extenson: Mult-task Learnng 2 predcton tasks: Spam flter for Alce Spam flter for Bob Lmted tranng data for both but Alce s smlar to Bob 22

Extenson: Mult-task Learnng To Tranng Sets relatvely small Opton 1: Tran Separately S (1) = (x (1), y (1) { )} S (2) = (x (2), y (2) { )} v ( ) 2 λ T + y (1) T x (1) ( ) 2 λv T v + y (2) v T x (2) Both models have hgh error. Omttng b for smplcty 23

Extenson: Mult-task Learnng To Tranng Sets relatvely small Opton 2: Tran Jontly,v ( ) 2 λ T + y (1) T x (1) ( ) 2 + λv T v + y (2) v T x (2) S (1) = (x (1), y (1) { )} S (2) = (x (2), y (2) { )} Doesn t accomplsh anythng! ( v don t depend on each other) Omttng b for smplcty 24

Mult-task Regularzaton,v ( ) 2 λ T + λv T v +γ ( v) T ( v) + y (1) T (1) x + y (2) v T (2) x ( ) 2 Standard Regularzaton Mult-task Regularzaton Tranng Loss Prefer v to be close Controlled by γ Tasks smlar Larger γ helps! Tasks not dentcal γ not too large test loss Test Loss (Task 2) 14.5 14 13.5 13 12.5 12 10-4 10-2 10 0 10 2 gamma 25

Lasso L1-Regularzed Least-Squares 26

L1 Regularzed Least Squares L2: λ + = 2 vs ( y T x ) 2 =1 4 3.5 3 λ 2 + ( y T x ) 2 L1: =1 = 2 =1 = = vs vs vs = 0 =1 = 0 2.5 2 1.5 1 0.5 0-2 -1.5-1 -0.5 0 0.5 1 1.5 2 Omttng b for smplcty 27

Asde: Subgradent (sub-dfferental) a R(a) = c a' : R(a') R(a) c(a' a) { } Dfferentable: a R(a) = a R(a) L1: d % $ % % 1 f d < 0 +1 f d > 0 [ 1,+1 ] f d = 0 Contnuous range for =0! 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0-2 -1 0 1 2 Omttng b for smplcty 28

L1 Regularzed Least Squares L2: λ + ( y T x ) 2 λ 2 + 4 3.5 ( y T x ) 2 d 2 = 2 d 3 2.5 L1: d % $ % % 1 f d < 0 +1 f d > 0 [ 1,+1 ] f d = 0 2 1.5 1 0.5 0-2 -1.5-1 -0.5 0 0.5 1 1.5 2 Omttng b for smplcty 29

Lagrange Multplers s.t. c L(y, ) ( y T x) 2 d % $ % % 1 f d < 0 +1 f d > 0 [ 1,+1 ] f d = 0 Solutons tend to be at corners! λ 0 :( L(y, ) λ ) ( c) Omttng b 1 tranng data for smplcty http://en.kpeda.org/k/lagrange_multpler 30

Sparsty s sparse f mostly 0 s: Small L0 orm 0 = d 1 d 0 [ ] Why not L0 Regularzaton? ot contnuous! λ 0 + ( y T x ) 2 L1 nduces sparsty And s contnuous! λ + ( y T x ) 2 Omttng b for smplcty 31

Why s Sparsty Important? Computatonal / Memory Effcency Store 1M numbers n array Store 2 numbers per non-zero (Index, Value) pars E.g., [ (50,1), (51,1) ] Dot product more effcent: T x! " 0 0! 0 1 1 0! 0 $ % Sometmes true s sparse Want to recover non-zero dmensons 32

Lasso Guarantee λ + ( y T x + b) 2 Suppose data generated as: y ~ ormal( T * x,σ 2 ) Then f: λ > 2 κ 2σ 2 log D Wth hgh probablty (ncreasng th ): Supp( ) Supp( * ) d : d λc è Supp ( ) = Supp( * ) Hgh Precson Parameter Recovery Sometmes Hgh Recall Supp( * ) = { d *,d 0} See also: https://.cs.utexas.edu/~pradeepr/courses/395t-lt/flez/hghdmii.pdf http://.eecs.berkeley.edu/~anrg/papers/wa_sparseinfo09.pdf 33

male? 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 L1 L2 0 0 0.1 0.2 0.3 0.4 0.5 age >10? Magntude of the to eghts. (As regularzaton shrnks) Person Age>10 Male? Heght > 55 Alce 1 0 1 Bob 0 1 0 Carol 0 0 0 Dave 1 1 1 Ern 1 0 1 Frank 0 1 1 Gena 0 0 0 Harold 1 1 1 Irene 1 0 0 John 0 1 1 Kelly 1 0 1 Larry 1 1 1 34

Asde: Optmzng Lasso Solvng Lasso gves sparse model Wll stochastc gradent descent fnd t? o! Hard to ht exactly 0 th gradent descent Soluton: Iteratve Soft Thresholdng Intuton: f gradent update passes 0, clamp at 0 https://blogs.prnceton.edu/mabandt/2013/04/11/orf523-sta-and-fsta/ 35

Recap: Lasso vs Rdge Model Assumptons Lasso learns sparse eght vector Predctve Accuracy Lasso often not as accurate Re-run Least Squares on dmensons selected by Lasso Ease of Inspecton Sparse s easer to nspect Ease of Optmzaton Lasso somehat trcker to optmze 36

Recap: Regularzaton L2 L1 (Lasso) λ 2 + λ + ( y T x ) 2 ( y T x ) 2 Mult-task [Insert Yours Here!],v λ T + λv T v +γ ( v) T ( v) ( ) 2 + y (1) T (1) x + y (2) v T (2) x ( ) 2 Omttng b for smplcty 37

Recap: Updated Ppelne S = {(x, y )} Tranng Data f (x, b) = T x b Model Class L(a, b) = (a b) 2 Loss Functon,b ( ) λ T + L y, f (x, b) Cross Valdaton Model Selecton Proft! 38

ext Lectures Decson Trees Baggng Random Forests Boostng Ensemble Selecton o Rectaton ths Week