Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Similar documents
Data Mining Stat 588

Linear regression methods

Linear Methods for Regression. Lijun Zhang

ISyE 691 Data mining and analytics

Linear Model Selection and Regularization

Machine Learning for OR & FE

Regularization: Ridge Regression and the LASSO

Lecture 14: Shrinkage

Linear model selection and regularization

Regression, Ridge Regression, Lasso

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Prediction & Feature Selection in GLM

Lecture 6: Methods for high-dimensional problems

MS-C1620 Statistical inference

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

Linear Regression Linear Regression with Shrinkage

High-dimensional regression modeling

Machine Learning for OR & FE

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Business Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata'

Introduction to Statistical modeling: handout for Math 489/583

Ridge and Lasso Regression

Chapter 3. Linear Models for Regression

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Sparse Linear Models (10/7/13)

The prediction of house price

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Math 423/533: The Main Theoretical Topics

Statistics 262: Intermediate Biostatistics Model selection

Dimension Reduction Methods

Linear Regression Linear Regression with Shrinkage

Linear Model Selection and Regularization

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Theorems. Least squares regression

MS&E 226: Small Data

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Lecture 7: Modeling Krak(en)

Simultaneous Coefficient Penalization and Model Selection in Geographically Weighted Regression: The Geographically Weighted Lasso

Day 4: Shrinkage Estimators

Shrinkage Methods: Ridge and Lasso

Stat 5101 Lecture Notes

Homework 1: Solutions

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

Applied Machine Learning Annalisa Marsico

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Variable Selection under Measurement Error: Comparing the Performance of Subset Selection and Shrinkage Methods

Penalized Regression

MSG500/MVE190 Linear Models - Lecture 15

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Model Selection. Frank Wood. December 10, 2009

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Lecture VIII Dim. Reduction (I)

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Regression Models - Introduction

Fast Regularization Paths via Coordinate Descent

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Regression Models - Introduction

High-dimensional regression

PCA and admixture models

Focused fine-tuning of ridge regression

Prediction of New Observations

Ridge Regression Revisited

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

STK-IN4300 Statistical Learning Methods in Data Science

IEOR165 Discussion Week 5

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

Linear Methods for Prediction

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Weighted Least Squares

CMSC858P Supervised Learning Methods

Machine Learning Linear Regression. Prof. Matteo Matteucci

Math 533 Extra Hour Material

Regression I: Mean Squared Error and Measuring Quality of Fit

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Nonlinear Regression. Chapter 2 of Bates and Watts. Dave Campbell 2009

Statistical Methods for Data Mining

Regularized Regression

MSA220/MVE440 Statistical Learning for Big Data

Regularization Path Algorithms for Detecting Gene Interactions

Introduction to Machine Learning and Cross-Validation

Comparison of Some Improved Estimators for Linear Regression Model under Different Conditions

The lasso: some novel algorithms and applications

A Modern Look at Classical Multivariate Techniques

STK-IN4300 Statistical Learning Methods in Data Science

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Ridge Regression and Ill-Conditioning

ECE521 week 3: 23/26 January 2017

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

STAT 462-Computational Data Analysis

Multiple (non) linear regression. Department of Computer Science, Czech Technical University in Prague

PDEEC Machine Learning 2016/17

Transcription:

Linear Regression Models Based on Chapter 3 of Hastie, ibshirani and Friedman

Linear Regression Models Here the X s might be: p f ( X = " + " 0 j= 1 X j Raw predictor variables (continuous or coded-categorical ransformed predictors (X 4 =log X 3 Basis expansions (X 4 =X 3, X 5 =X 33, etc. Interactions (X 4 =X X 3 Popular choice for estimation is least squares: N p ( y i " # 0 " X j j i= 1 j= 1 RSS (# = # j

Least Squares RSS( = ( y " X ( y " X ˆ 1 " # = ( X X X y " yˆ = X# ˆ = X ( X X 1 X y hat matrix Often assume that the Y s are independent and normally distributed, leading to various classical statistical tests and confidence intervals

Gauss-Markov heorem Consider any linear combination of the β s: he least squares estimate of θ is: # = a " ˆ = ˆ 1 a ( X X If the linear model is correct, this estimate is unbiased (X fixed: E( ˆ = E(a (X X "1 X y = a (X X "1 X X# = a # Gauss-Markov states that for any other linear unbiased estimator ~ = c y : i.e., E(c y = a, Var( a " ˆ Var( c y Of course, there might be a biased estimator with lower MSE X y " = a

bias-variance ~ ~ ( ( ~ ( ~ ( ( ~ ( ~ ( ~ ( ~ ( ~ ( " + = " + " = " + " = E Var E E E E E E E ~ ( ~ ( " = E MSE For any estimator : bias Note MSE closely related to prediction error: ~ ( ~ ( ( ~ ( 0 0 0 0 0 0 0 " x MSE x x E x Y E x Y E + = # + # = #

oo Many Predictors? When there are lots of X s, get models with high variance and prediction suffers. hree solutions: 1. Subset selection Score: AIC, BIC, etc. All-subsets + leaps-and-bounds, Stepwise methods,. Shrinkage/Ridge Regression 3. Derived Inputs

Subset Selection Standard all-subsets finds the subset of size k, k=1,,p, that minimizes RSS: Choice of subset size requires tradeoff AIC, BIC, marginal likelihood, cross-validation, etc. Leaps and bounds is an efficient algorithm to do all-subsets

e.g. 10-fold cross-validation: Cross-Validation Randomly divide the data into ten parts rain model using 9 tenths and compute prediction error on the remaining 1 tenth Do these for each 1 tenth of the data Average the 10 prediction error estimates One standard error rule pick the simplest model within one standard error of the minimum

Shrinkage Methods Subset selection is a discrete process individual variables are either in or out his method can have high variance a different dataset from the same source can result in a totally different model Shrinkage methods allow a variable to be partly included in the model. hat is, the variable is included but with a shrunken co-efficient.

subject to: Ridge Regression ˆ N p ridge = arg min ( i " # 0 " x ij # i= 1 j= 1 # p j= 1 # j y # " s j Equivalently: & N p p ridge = $ ( ' '( ˆ arg min ( y + ( i 0 x ij j * % i= 1 j= 1 j= 1 j # " his leads to: ˆ ridge 1 # = ( X X + " I Choose λ by cross-validation. Predictors should be centered. X y works even when X X is singular

effective number of X s

Ridge Regression = Bayesian Regression yi ~ N( $ 0 + x $ ~ N(0, j i $," same as ridge with # = "

he Lasso ˆ N p ridge = arg min ( i " # 0 " x ij # i= 1 j= 1 # subject to: p j= 1 # j y # Quadratic programming algorithm needed to solve for the parameter estimates. Choose s via cross-validation. " s j & N p p ~ = $ ( ' '( arg min ( y + ( j i 0 x ij * % i= 1 j= 1 j= 1 j q # " q=0: var. sel. q=1: lasso q=: ridge Learn q?

function of 1/lambda

Principal Component Regression Consider a an eigen-decomposition of X X (and hence the covariance matrix of X: X X = VD V he eigenvectors v j are called the principal components of X D is diagonal with entries d 1 d d p Xv 1 has largest sample variance amongst all normalized linear d1 combinations of the columns of X (var( Xv 1 = N Xv k has largest sample variance amongst all normalized linear combinations of the columns of X subject to being orthogonal to all the earlier ones (X is first centered (X is N x p

Principal Component Regression PC Regression regresses on the first M principal components where M<p Similar to ridge regression in some respects see HF, p.66

www.r-project.org/user-006/slides/hesterberg+fraley.pdf

x1<-rnorm(10 x<-rnorm(10 y<-(3*x1 + x + rnorm(10,0.1 par(mfrow=c(1, plot(x1,y,xlim=range(c(x1,x,ylim=range(y abline(lm(y~-1+x1 plot(x,y,xlim=range(c(x1,x,ylim=range(y abline(lm(y~-1+x epsilon <- 0.1 r <- y beta <- c(0,0 numiter <- 5 for (i in 1:numIter { } cat(cor(x1,r,"\t",cor(x,r,"\t",beta[1],"\t",beta[],"\n"; if (cor(x1,r > cor(x,r { delta <- epsilon * (( * ((r%*%x1 > 0-1 beta[1] <- beta[1] + delta r <- r - (delta * x1 par(mfg=c(1,1 abline(0,beta[1],col="red" } if (cor(x1,r <= cor(x,r { delta <- epsilon * (( * ((r%*%x > 0-1 beta[] <- beta[] + delta r <- r - (delta * x par(mfg=c(1, abline(0,beta[],col="green" }

LARS Start with all coefficients b j = 0 Find the predictor x j most correlated with y Increase b j in the direction of the sign of its correlation with y. ake residuals r=y-yhat along the way. Stop when some other predictor x k has as much correlation with r as x j has Increase (b j,b k in their joint least squares direction until some other predictor x m has as much correlation with the residual r. Continue until all predictors are in the model

Fused Lasso If there are many correlated features, lasso gives non-zero weight to only one of them Maybe correlated features (e.g. time-ordered should have similar coefficients? ibshirani et al. (005

Group Lasso Suppose you represent a categorical predictor with indicator variables Might want the set of indicators to be in or out regular lasso: group lasso: Yuan and Lin (006