ISyE 691 Data mining and analytics

Similar documents
Linear Methods for Regression. Lijun Zhang

Machine Learning for OR & FE

Linear regression methods

Regression, Ridge Regression, Lasso

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

A Modern Look at Classical Multivariate Techniques

Data Mining Stat 588

MS-C1620 Statistical inference

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Linear Model Selection and Regularization

Regularization: Ridge Regression and the LASSO

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Dimension Reduction Methods

Linear model selection and regularization

Chapter 3. Linear Models for Regression

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning for OR & FE

Sparse Linear Models (10/7/13)

Prediction & Feature Selection in GLM

Introduction to Statistical modeling: handout for Math 489/583

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Day 4: Shrinkage Estimators

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

Linear Regression Linear Regression with Shrinkage

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

High-dimensional regression modeling

Biostatistics Advanced Methods in Biostatistics IV

CS6220: DATA MINING TECHNIQUES

Linear Regression Linear Regression with Shrinkage

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Linear Regression In God we trust, all others bring data. William Edwards Deming

Analysis Methods for Supersaturated Design: Some Comparisons

Lecture 3: Introduction to Complexity Regularization

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Recap from previous lecture

Lecture 14: Shrinkage

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

MS&E 226: Small Data

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

Iterative Selection Using Orthogonal Regression Techniques

Regularization and Variable Selection via the Elastic Net

CS6220: DATA MINING TECHNIQUES

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Statistics 262: Intermediate Biostatistics Model selection

Theorems. Least squares regression

Variable Selection under Measurement Error: Comparing the Performance of Subset Selection and Shrinkage Methods

Model Selection. Frank Wood. December 10, 2009

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Homework 1: Solutions

Introduction to Machine Learning and Cross-Validation

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Comparisons of penalized least squares. methods by simulations

STAT 462-Computational Data Analysis

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

The prediction of house price

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Lecture Data Science

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Multiple Regression Analysis

Sparsity in Underdetermined Systems

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

Tuning Parameter Selection in L1 Regularized Logistic Regression

Lecture 7: Modeling Krak(en)

Variable Selection in Predictive Regressions

Regression Models - Introduction

Regression Shrinkage and Selection via the Lasso

Lecture 5: Clustering, Linear Regression

Classification Logistic Regression

ECE521 week 3: 23/26 January 2017

Business Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata'

6. Regularized linear regression

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

Matematické Metody v Ekonometrii 7.

How the mean changes depends on the other variable. Plots can show what s happening...

Consistent high-dimensional Bayesian variable selection via penalized credible regions

10. Alternative case influence statistics

ORIE 4741: Learning with Big Messy Data. Regularization

Machine Learning for Biomedical Engineering. Enrico Grisan

Cross-Validation with Confidence

High-dimensional regression

Ridge Regression and Ill-Conditioning

Lecture 5: Clustering, Linear Regression

Bayes Estimators & Ridge Regression

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

DATA MINING AND MACHINE LEARNING

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

Math 423/533: The Main Theoretical Topics

Lecture 5: Clustering, Linear Regression

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Large Sample Theory For OLS Variable Selection Estimators

2.2 Classical Regression in the Time Series Context

COMS 4771 Regression. Nakul Verma

Linear Regression 9/23/17. Simple linear regression. Advertising sales: Variance changes based on # of TVs. Advertising sales: Normal error?

Transcription:

ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

General terms Supervised Learning: Will have input X and the corresponding output Y The task of regression and classification is a typical supervised learning. Unsupervised Learning Only input X is given, and there is no notion of output, during learning; The task of clustering is a typical unsupervised learning. Linear Regression: Method of (ordinary) Least Squares Penalized Least Squares 2

Simple Linear Regression We observe the data (y i, x i ) for i = 1,, n, and model the linear relationship assuming ε i are independent and identically distributed (i.i.d.) normal with mean 0 and variance 2. How to estimate β 0, β 1 and 2? 3

Method of Least Squares The method of Least Squares: find b 0 and b 1 so as to minimize the residual sum of squares The solution is 4

Fitted Values & Residuals The regression line is and the values predicted values. are called fitted or The residuals are the deviations from the estimated line: for i=1,,n. 5

Properties of Regression Estimators Since we have Thus, Similarly, and thus 6

How to estimate 2? Note that ε i are iid with mean 0 and variance 2, thus it is reasonable to estimate 2 by Since β 0, β 1 are unknown, we can estimate 2 by 7

Summary Let If ε i are iid N 0, σ 2 then 8

100 1 α % C.I. for β 1 is: Statistical inference A 100 1 α % confidence interval for the estimated regression at a new point x = x is 9

Multiple Linear Regression Observe data (Y i, X i1,, X ip ) for i = 1,, n, and we assume the linear model Y i = f X i + ε i = β 0 + β 1 X i1 + + β p X ip + ε i with ε i ~N(0, σ 2 ) How to estimate β j? The (ordinary) least squares estimator β ols = (X X) 1 X Y which has good properties (Gauss-Markov, Maximum Likelihood). 10

Derivation: Multiple Linear Regression Unbiased: Variance: 11

Estimation of σ 2 : Statistical inference T-test of β j : F-test: 12

13

Insight of two models Consider and Y restricted = X 1 β 1,restricted Y full = X 1 β 1,full + X 2 β 2,full 14

What does better mean? Two questions to see whether an estimator β is good: Model Identification: is β close to the true β, e.g., is Mean Squared Error (MSE) small? MSE β = E[(β β) 2 ] Prediction: will the model predict future observations well? 15

Multiple Linear Regression Gauss-Markov Theorem: If E Y = Xβ and var Y = σ 2 I, then the Best Linear Unbiased Estimator (BLUE) of a β is a β ols. 16

BLUE 17

Prediction: Overfitting v.s. Overfitting: Underfitting When the statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. Underfitting Underfitting: when model is too simple, both training and test errors are large 18

Variable Selection Subset Selection: Suppose that we have p predictors with coefficients β 0, β 1,, β p 1. The subset selection problem is to find for each k {0,1, 2,..., p}, the subset of k out of p predictors that minimize the RSS. Can we search all 2 p candidate models? (e.g. p = 20, 2 p = 1048576) 19

Greedy procedure & sequential search methods The previous figure shows that using additional predictor variables more than two does not gain much. So why don t we select the smallest subset of variables such that RSS is smaller than a given threshold? Forward selection: starts with β 0, and subsequently add into the model the predictor that most improves the model Question: where does the F-test come from? 20

Greedy procedure & sequential search methods Backward selection: starts with the full model (with all p predictors), and sequentially drop a predictor from the model at a time so that its corresponding F-ratio is the smallest. Stepwise selection: a combination of the above, testing at each step for variables to be included or excluded 21

Classical Subset Selection Classical subset selection seeks a subset of variables that minimizes some criterion: Mallow s C p Cross-Validation (Ch 7.10), and GCV (eq 7.46 on p 239) Akaike s Information Criterion(AIC, Akaike 1973), Bayes information Criterion(BIC, Schwarz 1978) 22

Information Criterion Typically, an information criterion is of the form AIC: (better for prediction) BIC: (better for model identification) Others ( (n) = o(n)): Hurvich & Tsai, 1989: HQ (Hannan & Quinn, 1979): 23

Lasso regression Because the 0-norm problem is combinatorial and hard to solve, people tried to approximate the 0-norm problem with other convex objective functions that are easier to solve. The tactic is called "convex relaxation." For example, It is natural to convert the l 1 -Error formulation into: Lasso regression: is to use the l 1 -norm convex relaxation to approximate the original subset selection problem Usually center the data before we do a regression 24

How well do greedy algorithms or convex approximations work? For greedy algorithm: Tropp, 2004, "Greedy is good: algorithmic results for sparse approximation," IEEE Transactions on Information Theory, 50(10): 2231-2242. Basically, when the size of the optimal subset is small enough, then the greedy algorithm can lead to the true optimum. Unfortunately, the bound for smallness is typically too small. For convex approximation: Tropp, 2006, "Just relax: convex programming methods for subset selection and sparse approximation", IEEE Transactions on Information Theory, 52(3): 1030-1051. Conclusion similar to that in the greedy algorithms but the bound is little relaxed. 25

Lasso regression Define a standardized quantity: s = t t 0 s = 1, all the coefficients from Lasso are the same as those from LSE; s = 0, all the coefficients from Lasso are zero; shrinkage factor For s in between, it indicates how much the coefficients are shrunk on average. 26

Subset selection v.s. coefficient shrinkage As expected, in the Lasso-result figure, with a change in the value of s, the values of coefficients are shrunk if go from s = 1 to s = 0. For this reason, people also use coefficient shrinkage to refer to the methods for subset selection. Oftentimes, coefficient shrinkage is indeed used interchangeably with subset selection. However, fundamentally, subset selection is a combinatorial problem, while coefficient shrinkage does not have to be. Actually, coefficient shrinkage, as it is usually formulated, is a continuous optimization problem. So we can understand that coefficient shrinkage is the approximation to the subset selection problem. 27

Shrinkage Method Generalization Assume we observe (Yi, X i1,, X ip ), and all variables are standardized: A shrinkage method solves the optimization problem: min β Y Xβ Y Xβ + λ J( β j ) j=1 where J( β j ) is the penalty function and λ 0 is the decay or tuning parameter. p 28

Shrinkage Methods Shrinkage Methods (penalized, regularization) Are based on subtracting a penalty from the risk (rather than log-likelihood), and the penalty is a function of a decay parameter Can filter out unimportant variables from the candidate variables; and estimate the important ones consistently with high efficiency. Can make an ill-posed problem well-posed (e.g., the matrix X is not full rank) Once the decay parameter is estimated, the variable (or model) selection is done! 29

Shrinkage Method Generalization In general, a shrinkage method solves the optimization problem min β Y Xβ p Y Xβ + λ J( β j ) j=1 An alternative formulation is to solve min β Y Xβ Y Xβ, subject to J β j t p j=1 How to choose the penalty function? 30

Ridge regression The ridge regression estimator is l 2 -Penalty formulation under Langrangian relaxation: The explicit expression is It is most useful when X is nonsingular but has high collinearity (i.e., close to singular) 31

Remarks on Ridge regression β ridge is still a linear function of y, same as the LSE. β ridge is biased, with the bias decreasing to 0 as goes to 0. As increases, the β ridge get closer to 0 (though rarely=0). One may force shrinkage the coefficients to zero by choosing a cutoff below which coefficients are set to 0. This may help smooth out noisy data and produce robust solutions. Overall, despite the bias, the variance will usually be smaller than that of OLS, and thus smaller MSE and better prediction. (introduced in statistics by Hoerl & Kennard in 1970, who tried to address a singularity problem.) 32

Another way to see regularization If we use the singular value decomposition (SVD): where U and V are orthogonal matrix with the column of U spanning the column space of X, and the columns of V spanning the raw space. D= diag(d 1,., d p ) and d 1,, d p called the singular values of X. Using the singular values of X, people define the effective degree of freedom: 33

Ridge regression example Effective degree of freedom indicates how many effective coefficients are resulted from a ridge regression. 34

Comparisons Performance-wise, in terms of selecting the best linear predictive model, ridge and lasso perform quite similarly. Lasso outperforms ridge when there are a moderate number of sizable effects, rather than many small effects. It also produces more interpretable models. LASSO does poorly: (i) when the true mode is not spare; (ii) when some variables are highly correlated (LASSO picks one randomly and Ridge regression tends to perform better); (ii) LASSO may not be robust to outliers in the responses. Ridge regression used to be more popular because it has a closedform analytical solution. But with today s computer power, lasso is also easy to use For each fixed, the computation of LASSO solution can be formulated as a quadratic programming (Tibshirani, 1996). Least Angel Regression (LARS) algorithm by Efron et al. (2004): The number of linear pieces in LASSO path is approximately p, and the complexity of getting the whole LASSO path is O(np 2 ), the same as the cost of computing a single OLS fit. 35