UVA CS 4501: Machine Learning. Lecture 6: Linear Regression Model with Dr. Yanjun Qi. University of Virginia

Similar documents
UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 6: Linear Regression Model with RegularizaEons. Dr. Yanjun Qi. University of Virginia

Last Lecture Recap. UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 6: Regression Models with Regulariza8on

Last Lecture Recap UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 3: Linear Regression

UVA CS 4501: Machine Learning. Lecture 6 Extra: Op=miza=on for Linear Regression Model with Regulariza=ons. Dr. Yanjun Qi. University of Virginia

UVA CS 4501: Machine Learning. Lecture 11b: Support Vector Machine (nonlinear) Kernel Trick and in PracCce. Dr. Yanjun Qi. University of Virginia

Machine Learning and Data Mining. Linear regression. Prof. Alexander Ihler

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 7: Regression Models - Review HW1 DUE NOW / HW2 OUT TODAY 9/18/14

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Regression.

Priors in Dependency network learning

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 9: Classifica8on with Support Vector Machine (cont.

UVA CS / Introduc8on to Machine Learning and Data Mining

Regularization and Variable Selection via the Elastic Net

Bias/variance tradeoff, Model assessment and selec+on

Logis&c Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

MS-C1620 Statistical inference

UVA CS 4501: Machine Learning

Lecture 14: Shrinkage

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

CS 6140: Machine Learning Spring 2016

Linear Model Selection and Regularization

LINEAR REGRESSION, RIDGE, LASSO, SVR

COMP 562: Introduction to Machine Learning

STAT 462-Computational Data Analysis

10-701/ Machine Learning - Midterm Exam, Fall 2010

ESL Chap3. Some extensions of lasso

Linear Models for Regression CS534

Lecture 2 Machine Learning Review

CSC321 Lecture 2: Linear Regression

MSA220/MVE440 Statistical Learning for Big Data

Linear Models for Regression CS534

Machine Learning for OR & FE

Linear Regression with mul2ple variables. Mul2ple features. Machine Learning

Homework 5. Convex Optimization /36-725

High-dimensional regression

UVA CS 4501: Machine Learning

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Introduction to Machine Learning

Linear Models for Regression CS534

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

ISyE 691 Data mining and analytics

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

The prediction of house price

Overfitting, Bias / Variance Analysis

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Dimension Reduction Methods

Least Mean Squares Regression. Machine Learning Fall 2017

Classifica(on and predic(on omics style. Dr Nicola Armstrong Mathema(cs and Sta(s(cs Murdoch University

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

CSC 411 Lecture 6: Linear Regression

Week 3: Linear Regression

Natural Language Processing with Deep Learning. CS224N/Ling284

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Biostatistics Advanced Methods in Biostatistics IV

Linear Regression Introduction to Machine Learning. Matt Gormley Lecture 4 September 19, Readings: Bishop, 3.

A Short Introduction to the Lasso Methodology

DATA MINING AND MACHINE LEARNING

ECE521 lecture 4: 19 January Optimization, MLE, regularization

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

Where are we? è Five major sec3ons of this course

Support Vector Machine I

Regulatory Inferece from Gene Expression. CMSC858P Spring 2012 Hector Corrada Bravo

Regression. Mark Craven and David Page Computer Sciences 760 Spring Goals for the lecture

25 : Graphical induced structured input/output models

Founda'ons of Large- Scale Mul'media Informa'on Management and Retrieval. Lecture #3 Machine Learning. Edward Chang

STA 4273H: Sta-s-cal Machine Learning

Lecture 2: Linear regression

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Linear Regression (continued)

Linear Methods for Regression. Lijun Zhang

Compressed Sensing in Cancer Biology? (A Work in Progress)

Linear Models in Machine Learning

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

High-Dimensional Statistical Learning: Introduction

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Introduction to Statistical modeling: handout for Math 489/583

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

CS6220: DATA MINING TECHNIQUES

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

STA414/2104 Statistical Methods for Machine Learning II

Machine Learning Linear Regression. Prof. Matteo Matteucci

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Linear regression methods

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Least Square Es?ma?on, Filtering, and Predic?on: ECE 5/639 Sta?s?cal Signal Processing II: Linear Es?ma?on

STAD68: Machine Learning

Parameter Norm Penalties. Sargur N. Srihari

Machine Learning and Data Mining. Linear regression. Kalev Kask

Statistical Data Mining and Machine Learning Hilary Term 2016

Linear and Logistic Regression. Dr. Xiaowei Huang

Machine Learning for Biomedical Engineering. Enrico Grisan

Transcription:

UVA CS 4501: Machine Learning Lecture 6: Linear Regression Model with Regulariza@ons Dr. Yanjun Qi University of Virginia Department of Computer Science

Where are we? è Five major sec@ons of this course q Regression (supervised) q Classifica@on (supervised) q Unsupervised models q Learning theory q Graphical models 2

Today è Regression (supervised) q Four ways to train / perform op@miza@on for linear regression models q Normal Equa@on q Gradient Descent (GD) q Stochas@c GD q Newton s method q Supervised regression models q Linear regression (LR) q LR with non-linear basis func@ons q Locally weighted LR q LR with Regulariza@ons 3

Today q Linear Regression Model with Regulariza@ons ü Review: (Ordinary) Least squares: squared loss (Normal Equa@on) ü Ridge regression: squared loss with L2 regulariza@on ü Lasso regression: squared loss with L1 regulariza@on ü Elas@c regression: squared loss with L1 AND L2 regulariza@on ü WHY and Influence of Regulariza@on Parameter 4

A Dataset for regression con@nuous valued variable Data/points/instances/examples/samples/records: [ rows ] Features/a0ributes/dimensions/independent variables/covariates/ predictors/regressors: [ columns, except the last] Target/outcome/response/label/dependent variable: special column to be predicted [ last column ] 5

Dr. Yanjun Qi / SUPERVISED Regression Training dataset consists of inputoutput pairs When, target Y is a con@nuous target variable f(x? ) 6

Review: Normal equa@on for LR Write the cost func@on in matrix form: J(β)= 1 2 n i=1 = 1 2 Xβ! y (x i T β y i ) 2 ( ) T ( Xβ y! ) = 1 2 β T X T Xβ β T X T!! y y T Xβ + y! T! y $ T ( ) x n ' # & To minimize J(θ), take deriva@ve and set to zero: X T Xβ = X T! y The normal equa@ons β * = ( X T X ) 1 X T! y X = " $ $ $ $ x 1 T x 2 T!!! % ' ' ' '! # # Y = # # "# Assume that X T X is inver@ble y 1 y 2! y n $ & & & & %&

Comments on the normal equa@on What if X has less than full column rank? à Not Inver@ble à Add Regulariza@on

Page11 0f Handout 9

e.g. A Prac@cal Applica@on of Regression Model Proceedings of HLT 2010 Human Language Technologies: 10

e.g., Movie Reviews and Revenues: An Experiment in Text Regression, Proceedings of HLT '10 (1.7k n / >3k features) e.g. counts of a ngram in the text 11

e.g., QSAR: Drug Screening p p Binding to Thrombin (DuPont Pharmaceuticals) - n: 2543 compounds tested for their ability to bind to a target site on thrombin (a human enzyme) - p: 139,351 binary features, which describe three-dimensional properties of molecules. n X Weston et al, Bioinformatics, 2002 12

Today q Linear Regression Model with Regulariza@ons ü Review: (Ordinary) Least squares: squared loss (Normal Equa@on) ü Ridge regression: squared loss with L2 regulariza@on ü Lasso regression: squared loss with L1 regulariza@on ü Elas@c regression: squared loss with L1 AND L2 regulariza@on ü Influence of Regulariza@on Parameter 13

Review: Vector norms A norm of a vector x is informally a measure of the length of the vector. Common norms: L 1, L 2 (Euclidean) L infinity 14

eview: Vector Norm (L2, when p=2) 15

Norms Lasso Quadra@c

Ridge Regression / L2 Regularization β * = ( X T X ) 1! X T y If not inver@ble, a classical solu@on is to add a small posi@ve element to diagonal β * = ( X T X + λi) 1! X T y By conven@on, the bias/intercept term is typically not regularized. Here we assume data has been centered therefore no bias term 17

Posi@ve Definite Matrix One important property of posi@ve definite matrices is that è They are always full rank, and hence, inver@ble. Extra: See Proof at Page 17-18 of Linear-Algebra Handout è 18

β * = ( X T X + λi) 1! X T y 19

Ridge Regression / Squared Loss+L2 β * = ( X T X + λi) 1! X T y As the solution from HW2 β! ridge = argmin( y Xβ) T ( y Xβ)+ λβ T β to minimize, take deriva@ve and set to zero By conven@on, the bias/intercept term is typically not regularized. Here we assume data has been centered therefore no bias term 20

Ridge Regression / Squared Loss+L2 β * = ( X T X + λi) 1! X T y As the solution from HW2 β! ridge = argmin( y Xβ) T ( y Xβ)+ λβ T β to minimize, take deriva@ve and set to zero Equivalently β! ridge = argmin( y Xβ) T ( y Xβ) subject to β 2 j s 2 j={1..p} By conven@on, the bias/intercept term is typically not regularized. Here we assume data has been centered therefore no bias term 21

Objec@ve Func@on s Contour lines from Ridge Regression β! ridge = argmin( y Xβ) T ( y Xβ) subject to β 2 j s 2 j={1..p} s Least Square solu@on 22

Objec@ve Func@on s Contour lines from Ridge Regression Ridge Regression solu@on Least Square solu@on s 23

Least Square+L2: Ridge solu@on 2 s 1 24

Ridge Regression / Squared Loss+L2 > 0 penalizes each λ θ j See whiteboard if = 0 we get the least squares estimator; λ if λ, then θ j 0 25

Parameter Shrinkage β OLS = ( X T X ) 1 X T! y β Rg = ( X T X + λi) 1 X T! y When When Page65 of ESL book @ hqp://statweb.stanford.edu/~@bs/ ElemStatLearn/prin@ngs/ESLII_print10.pdf 26

ü Influence of Regulariza@on Parameter 2 1 27

ü Influence of Regulariza@on Parameter 2 2 1 λ 1 28

ü Influence of Regulariza@on Parameter 2 λ 0 1 29

Today q Linear Regression Model with Regulariza@ons ü Review: (Ordinary) Least squares: squared loss (Normal Equa@on) ü Ridge regression: squared loss with L2 regulariza@on ü Lasso regression: squared loss with L1 regulariza@on ü Elas@c regression: squared loss with L1 AND L2 regulariza@on ü WHY and Influence of Regulariza@on Parameter 30

(2) Lasso (least absolute shrinkage and selection operator) / Squared Loss+L1 The lasso is a shrinkage method like ridge, but acts in a nonlinear manner on the outcome y. The lasso is defined by ˆβ lasso = argmin( y X β) T ( y X β) subject to β s j n ( y i x it β) 2 i=1 By conven@on, the bias/intercept term is typically not regularized. Here we assume data has been centered therefore no bias term 31

Lasso (least absolute shrinkage and selection operator) Suppose in 2 dimension β= (β 1, β 2 ) β 1 + β 2 =const β 1 + - β 2 =const -β 1 + β 2 =const -β 1 + -β 2 =const Lasso Solu@on s Least Square solu@on 32

In the Figure, the solu@on has eliminated the role of x2, leading to sparsity Lasso Solu@on Least Square solu@on s 33

Lasso Es@mator Ridge Regression s s 34

Lasso (least absolute shrinkage and selection operator) Notice that ridge penalty by β j β 2 j is replaced Due to the nature of the constraint, if tuning parameter is chosen small enough, then the lasso will set some coefficients exactly to zero. 35

Lasso for when p>n Predic@on accuracy and model interpreta@on are two important aspects of regression models. LASSO does shrinkage and variable selec@on simultaneously for beqer predic@on and model interpreta@on. Disadvantage: -In p>n case, lasso selects at most n variable before it saturates -If there is a group of variables among which the pairwise correla@ons are very high, then lasso select one from the group 36

Lasso: Implicit Feature Selec@on p p n X 37

38

Today q Linear Regression Model with Regulariza@ons ü Review: (Ordinary) Least squares: squared loss (Normal Equa@on) ü Ridge regression: squared loss with L2 regulariza@on ü Lasso regression: squared loss with L1 regulariza@on ü Elas@c regression: squared loss with L1 AND L2 regulariza@on ü WHY and Influence of Regulariza@on Parameter 39

(3) Hybrid of Ridge and Lasso : Elas@c Net regulariza@on L1 part of the penalty generates a sparse model L2 part of the penalty: Remove the limita@on of the number of selected variables Encouraging group effect Stabilize the L1 regulariza@on path. 40

Naïve elas@c net For any non nega@ve fixed λ 1 and λ 2, naive elas@c net criterion: The naive elas@c net es@mator is the minimizer of equa@on Let 41

Geometry of elas@c net 42

Movie Reviews and Revenues: An Experiment in Text Regression, Proceedings of HLT '10 Human Language Technologies: Use linear regression to directly predict the opening weekend gross earnings, denoted y, based on features x extracted from the movie metadata and/or the text of the reviews. 43

Advantage of Elas@c net (Extra) Na@ve Elas@c set can be converted to lasso with augmented data form In the augmented formula@on, sample size n+p and X * has rank p è can poten@ally select all the predictors Naïve elas@c net can perform automa@c variable selec@on like lasso 44

Regularized multivariate linear regression Task Regression Representation Score Function Y = Weighted linear sum of X s Least-squares + Regularization Search/Optimization Models, Parameters Linear algebra for Ridge / sub-gd for Lasso & Elastic Regression coefficients (regularized weights) min J(β) = Y Y^ + λ( β q j ) 1/q 45 n i=1 2 p j=1

Today q Linear Regression Model with Regulariza@ons ü Review: (Ordinary) Least squares: squared loss (Normal Equa@on) ü Ridge regression: squared loss with L2 regulariza@on ü Lasso regression: squared loss with L1 regulariza@on ü Elas@c regression: squared loss with L1 AND L2 regulariza@on ü WHY and Influence of Regulariza@on Parameter 46

Why to Use simpler models? Because: Simpler to use (lower computa@onal complexity) Easier to train (needs less examples) Less sensi@ve to noise Easier to explain (more interpretable) Generalizes beqer (lower variance - Occam s razor) --- More in future lectures!!! Adapted from Dr. Christoph Eick Slides 47

Model Selec@on & Generaliza@on Generalisation: learn func@on / hypothesis from past data in order to explain, predict, model or control new data examples Underfi}ng: when model is too simple, both training and test errors are large Overfi}ng: when model is too complex and test errors are large although training errors are small. A~er learning knowledge, model tends to learn noise 48

Issue: Overfi}ng and underfi}ng Under fit Looks good Over fit y =θ 0 + θ1x y 2 5 = θ 0 + θ1x + θ2x = j = 0 y θ j x j Generalisation: learn func@on / hypothesis from past data in order to explain, predict, model or control new data examples K-fold Cross Valida@on!!!!

Overfi}ng and. Underfi}ng Overfit 50

Overfi}ng ----- regulariza@on A regularizer is an addi@onal criteria to the loss func@on to make sure that we don t overfit It s called a regularizer since it tries to keep the parameters more normal/regular It is a bias on the model forces the learning to prefer certain types of weights over others, e.g.,! β ridge = argmin n β ( y i x T i β) 2 i=1 + λβ T β

Common regularizers L2: Squared weights penalizes large values more L1: Sum of weights will penalize small values more j β j β 2 j j Generally, we don t want huge weights If weights are large, a small change in a feature can result in a large change in the predic@on Might also prefer weights of 0 for features that aren t useful

Summary: Regularized multivariate linear regression Model: LR es@ma@on: LASSO es@ma@on: ^ Y Ridge regression es@ma@on: ^ ^ = β + β x +! + β 0 1 1 ^ p xp ^ argmin Y Y 2 n ^ p argmin Y Y + λ β j i=1 j=1 n ^ p argmin Y Y + λ 2 β j i=1 2 2 j=1 53/54

More: A family of shrinkage estimators β = argmin N β ( y i x T i β) 2 i=1 subject to β j q s for q >=0, contours of constant value of are shown for the case of two inputs. j β j q 54

norms visualized all p-norms penalize larger weights q q < 2 tends to create sparse (i.e. lots of 0 weights) q > 2 tends to push for similar weights

Regulariza@on path of a Ridge Regression λ λ = 0 56

Regulariza@on path of a Ridge Regression Weight Decay λ λ = 0 An example with 8 features 57

Regulariza@on path of a Lasso Regression when varying λ, how β j varies. λ λ = 0 An example with 8 features 58

WHY and How to Select λ? 1. Generaliza@on ability è k-folds CV to decide 2. Control the bias and Variance of the model (details in future lectures) 59

Regression: Complexity versus Goodness of Fit y Training data y Too simple? x x Low Variance / High Bias y Too complex? About right? y Low Bias / High Variance x What ultimately matters: GENERALIZATION 60 x

An example Choose λ that generalizes well! of Ridge Regression when varying λ, how β j varies. λ λ = 0 λ increases An example with 8 features 61

Choose λ that generalizes well! when varying λ, how β j varies. λ λ = 0 An example with 8 features 62

Today Recap q Linear Regression Model with Regulariza@ons ü Review: (Ordinary) Least squares: squared loss (Normal Equa@on) ü Ridge regression: squared loss with L2 regulariza@on ü Lasso regression: squared loss with L1 regulariza@on ü Elas@c regression: squared loss with L1 AND L2 regulariza@on ü Influence of Regulariza@on Parameter 63

References q Big thanks to Prof. Eric Xing @ CMU for allowing me to reuse some of his slides q Prof. Nando de Freitas s tutorial slide q Regulariza@on and variable selec@on via the elas@c net, Hui Zou and Trevor Has@e, Stanford University, USA q ESL book: Elements of StaDsDcal Learning 64

More Op@miza@on of regularized regression: See L6-extra slide Rela@on between λ and s See L6-extra slide Why Elas@c Net has a few nice proper@es See L6-extra slide 65