UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 6: Linear Regression Model with RegularizaEons. Dr. Yanjun Qi. University of Virginia

Similar documents
UVA CS 4501: Machine Learning. Lecture 6: Linear Regression Model with Dr. Yanjun Qi. University of Virginia

Last Lecture Recap. UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 6: Regression Models with Regulariza8on

UVA CS 4501: Machine Learning. Lecture 6 Extra: Op=miza=on for Linear Regression Model with Regulariza=ons. Dr. Yanjun Qi. University of Virginia

Last Lecture Recap UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 3: Linear Regression

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 7: Regression Models - Review HW1 DUE NOW / HW2 OUT TODAY 9/18/14

UVA CS 4501: Machine Learning

Lecture 14: Shrinkage

Regularization and Variable Selection via the Elastic Net

A Short Introduction to the Lasso Methodology

LINEAR REGRESSION, RIDGE, LASSO, SVR

Introduction to Machine Learning

STAT 462-Computational Data Analysis

UVA CS 4501: Machine Learning. Lecture 11b: Support Vector Machine (nonlinear) Kernel Trick and in PracCce. Dr. Yanjun Qi. University of Virginia

Linear Regression Introduction to Machine Learning. Matt Gormley Lecture 4 September 19, Readings: Bishop, 3.

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Machine Learning for OR & FE

Lecture 5 Multivariate Linear Regression

Linear Methods for Regression. Lijun Zhang

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

A Quick Tour of Linear Algebra and Optimization for Machine Learning

Data Mining Stat 588

STA414/2104 Statistical Methods for Machine Learning II

Linear Models in Machine Learning

CPSC 340: Machine Learning and Data Mining. Gradient Descent Fall 2016

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 9: Classifica8on with Support Vector Machine (cont.

MS-C1620 Statistical inference

UVA CS 4501: Machine Learning

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Linear Models for Regression CS534

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Statistical Methods for Data Mining

Linear Model Selection and Regularization

High-dimensional regression

25 : Graphical induced structured input/output models

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Dimension Reduction Methods

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Lecture 2 Machine Learning Review

Linear Models for Regression CS534

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

The prediction of house price

Big Data Analytics. Lucas Rego Drumond

ESL Chap3. Some extensions of lasso

Parameter Norm Penalties. Sargur N. Srihari

Machine Learning And Applications: Supervised Learning-SVM

arxiv: v3 [stat.ml] 14 Apr 2016

Linear Models for Regression CS534

UVA$CS$6316$$ $Fall$2015$Graduate:$$ Machine$Learning$$ $ $Lecture$5:$NonALinear$Regression$ Models$

Classification Logistic Regression


UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 11: Probability Review. Dr. Yanjun Qi. University of Virginia. Department of Computer Science

CSC321 Lecture 2: Linear Regression

Where are we? è Five major sec3ons of this course

Tutorial on Linear Regression

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

Overfitting, Bias / Variance Analysis

Regression Shrinkage and Selection via the Lasso

l 1 and l 2 Regularization

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

Generalized Elastic Net Regression

Machine Learning Basics

MSA220/MVE440 Statistical Learning for Big Data

INTRODUCTION TO DATA SCIENCE

Institute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR

LASSO Review, Fused LASSO, Parallel LASSO Solvers

CS540 Machine learning Lecture 5

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 18: Principal Component Analysis (PCA) Dr. Yanjun Qi. University of Virginia

Stat 5100 Handout #26: Variations on OLS Linear Regression (Ch. 11, 13)

Convex Optimization / Homework 1, due September 19

Lecture 17: Neural Networks and Deep Learning

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Linear Regression. S. Sumitra

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

DATA MINING AND MACHINE LEARNING

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Homework 5. Convex Optimization /36-725

Lasso: Algorithms and Extensions

ISyE 691 Data mining and analytics

Support Vector Machine I

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

Regularization Paths

Linear Regression (9/11/13)

COMP 551 Applied Machine Learning Lecture 2: Linear regression

6. Regularized linear regression

MSA220/MVE440 Statistical Learning for Big Data

Lecture 2: Linear Algebra Review

UVA CS / Introduc8on to Machine Learning and Data Mining

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Linear regression methods

UNIVERSITETET I OSLO

Lecture 5: A step back

Statistical Data Mining and Machine Learning Hilary Term 2016

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Transcription:

UVA CS 6316/4501 Fall 2016 Machine Learning Lecture 6: Linear Regression Model with RegularizaEons Dr. Yanjun Qi University of Virginia Department of Computer Science 1

Where are we? è Five major secgons of this course q Regression (supervised) q ClassificaGon (supervised) q Unsupervised models q Learning theory q Graphical models 2

Today è Regression (supervised) q Four ways to train / perform opgmizagon for linear regression models q Normal EquaGon q Gradient Descent (GD) q StochasGc GD q Newton s method q Supervised regression models q Linear regression (LR) q LR with non-linear basis funcgons q Locally weighted LR q LR with RegularizaGons 3

Today q Linear Regression Model with RegularizaGons q Ridge Regression q Lasso Regression q ElasGc net 4

Review: Vector norms Dr. Yanjun Qi / UVA CS 6316 / f16 A norm of a vector x is informally a measure of the length of the vector. Common norms: L 1, L 2 (Euclidean) L infinity 5

eview: Vector Norm (L2, when p=2) Dr. Yanjun Qi / UVA CS 6316 / f16 6

Review: Normal equagon for LR Write the cost funcgon in matrix form: n 1 J ( θ ) = ( x 2 = = 1 2 1 2 i= 1 T i θ y )! T ( Xθ y) ( Xθ y) i 2! T T T T!! T! T! ( θ X Xθ θ X y y Xθ + y y) To minimize J(θ), take derivagve and set to zero: X Xθ = X y T T! The normal equaeons θ * = ( X T X) 1 X T! y " $ $ X = $ $ $ # x 1 T x 2 T!!! x n T % ' ' ' ' ' &! # # Y = # # "# Assume that X T X is invergble y 1 y 2! y n $ & & & & %&

Comments on the normal equagon In most situagons of pracgcal interest, the number of data points N is larger than the dimensionality p of the input space and the matrix X is of full column rank. If this condigon holds, then it is easy to verify that X T X is necessarily invergble. that X T X is invergble implies that it is posigve definite (è SSE strong convex), thus the crigcal point we have found is a minimum. What if X has less than full column rank? à regularizagon (later).

Review: Page17 of Linear-Algebra Handout è One important property of posigve definite and negagve definite matrices is that they are always full rank, and hence, invergble. 9

Page11 0f Handout 10

Ridge Regression / L2 If not invergble, a solugon is to add a small element ^ to diagonal = β +!+ β^! Y^ 1 x 1 ( ) 1 X T! y β * = X T X + λi p x p The ridge estimator is solution from ˆβ ridge = argmin(y Xβ) T (y Xβ)+ λβ T β to minimize, take derivagve and set to zero Equivalently ˆ ridge T β = arg min( y Xβ ) ( y Xβ ) 2 subject to β s By convengon, the bias/intercept term is typically not regularized. Here we assume data has been centered therefore no bias term j Basic Model, HW2 11

ObjecGve FuncGon s Contour lines from Ridge Regression Dr. Yanjun Qi / UVA CS 6316 / f16 Ridge Regression solugon p s s Least Square solugon 12

Review: from L3 Dr. Yanjun Qi / UVA CS 6316 / f16 13

2 s 1 14

(1) Ridge Regression / L2 The parameter > 0 penalizes λ β j Solution is ˆ β λ = ( X X + λi) T 1 X T y where I is the identity matrix. Note λ = 0 gives the least squares estimator; if λ, then βˆ 0 15

Shrinkage? β OLS = ( X T X ) 1 X T! y β Rg = ( X T X + λi) 1 X T! y When When Page65 of ESL book @ hjp://statweb.stanford.edu/~gbs/ ElemStatLearn/prinGngs/ESLII_print10.pdf 16

Two forms of Ridge Regression Totally equivalent hjp://stats.stackexchange.com/quesgons/ 190993/how-to-find-regression-coefficientsbeta-in-ridge-regression 17

PosiGve Definiteness One important property of posigve definite and negagve definite matrices is that they are always full rank, and hence, invergble. 18

ˆ β λ = ( X X + λi) T 1 X T y Dr. Yanjun Qi / UVA CS 6316 / f16 19

Today q Linear Regression Model with RegularizaGons q Ridge Regression q Lasso Regression q ElasGc net 20

(2) Lasso (least absolute shrinkage and selection operator) / L1 The lasso is a shrinkage method like ridge, but acts in a nonlinear manner on the outcome y. The lasso is defined by ˆβ lasso = argmin(y Xβ) T (y Xβ) subject to β s j By convengon, the bias/intercept term is typically not regularized. Here we assume data has been centered therefore no bias term 21

Lasso (least absolute shrinkage and selection operator) Notice that ridge penalty by β j β 2 j is replaced Due to the nature of the constraint, if tuning parameter is chosen small enough, then the lasso will set some coefficients exactly to zero. 22

Lasso (least absolute Dr. Yanjun Qi / UVA CS 6316 / f16 shrinkage and selection operator) Suppose in 2 dimension β= (β 1, β 2 ) β 1 + β 2 =const β 1 + - β 2 =const -β 1 + β 2 =const -β 1 + -β 2 =const Lasso SoluGon s Least Square solugon If data not centered, not-shrinking-the-bias-intercept-term-in-regression 23

Lasso EsGmator Dr. Yanjun Qi / UVA CS 6316 / f16 Ridge Regression s s 24

Today q Linear Regression Model with RegularizaGons q Ridge Regression q Lasso Regression q ElasGc net 25

(3) Hybrid of Ridge and Lasso Normally x and y have been centered, therefore no bias term needed in above! 26

Geometry of elasgc net 27

Movie Reviews and Revenues: An Experiment in Text Regression, Proceedings of HLT '10 Human Language Technologies: Dr. Yanjun Qi / UVA CS 6316 / f16 Use linear regression to directly predict the opening weekend gross earnings, denoted y, based on features x extracted from the movie metadata and/or the text of the reviews. 28

More: A family of shrinkage estimators Dr. Yanjun Qi / UVA CS 6316 / f16 β = arg min subject N β i = 1 to ( y i β j q x T i s β ) 2 for q >=0, contours of constant value of are shown for the case of two inputs. j β j q Here assume x and y have been centered (normally), therefore no bias term needed in above! 29

Summary: Regularized multivariate linear regression Model: LR esgmagon: LASSO esgmagon: ^ Y Ridge regression esgmagon: ^ ^ = β + β x +! + β 0 1 1 ^ p xp ^ argmin Y Y 2 n ^ p argmin Y Y + λ β j i=1 2 j=1 n ^ p argmin Y Y + λ 2 β j i=1 2 j=1 30/54

Regularized multivariate linear regression Dr. Yanjun Qi / UVA CS 6316 / f16 Task Regression Representation Score Function Y = Weighted linear sum of X s Least-squares + Regularization Search/Optimization Linear algebra / GD Models, Parameters Regression coefficients (constrained) 31

RegularizaGon path of a Ridge Regression Dr. Yanjun Qi / UVA CS 6316 / f16 λ λ = 0 32

Dr. Yanjun Qi / UVA CS 6316 / f15 an example (ESL Fig3.8), Choose λ that generalizes well! Ridge Regression when varying λ, how β j varies. λ λ = 0 λ increases 33

2 s 1 34

λ Dr. Yanjun Qi / UVA CS 6316 / f16 2 2 s 1 p s 1 35

2 Dr. Yanjun Qi / UVA CS 6316 / f16 λ 0 s 1 36

RegularizaGon path of a Lasso Regression Dr. Yanjun Qi / UVA CS 6316 / f16 when varying λ, how β j varies. λ λ = 0 37

Dr. Yanjun Qi / UVA CS 6316 / f15 Choose λ that generalizes well! an example (ESL Fig3.10), λ λ = 0 38

Lasso when p>n PredicGon accuracy and model interpretagon are two important aspects of regression models. LASSO does shrinkage and variable selecgon simultaneously for bejer predicgon and model interpretagon. Disadvantage: -In p>n case, lasso selects at most n variable before it saturates -If there is a group of variables among which the pairwise correlagons are very high, then lasso select one from the group 39

Bias/Intercept Term is not Shrinked If the data is not centered, there exists bias term hjp://stats.stackexchange.com/quesgons/86991/ reason-for-not-shrinking-the-bias-intercept-term-inregression We normally assume we centered x and y. If this is true, no need to have bias term, e.g., for lasso, 40

Today Recap q Linear Regression Model with RegularizaGons q Ridge Regression q why invergble (next class) q Lasso Regression q Extra: how to perform training (next class) q ElasGc net q Extra: how to perform training (next class) 41

References q Big thanks to Prof. Eric Xing @ CMU for allowing me to reuse some of his slides q Prof. Nando de Freitas s tutorial slide q RegularizaEon and variable seleceon via the elasec net, Hui Zou and Trevor HasGe, Stanford University, USA q ESL book: Elements of Sta>s>cal Learning 42