Linear Methods for Regression. Lijun Zhang

Similar documents
Data Mining Stat 588

Linear regression methods

ISyE 691 Data mining and analytics

Prediction & Feature Selection in GLM

Machine Learning for OR & FE

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

ESL Chap3. Some extensions of lasso

Linear model selection and regularization

Linear Model Selection and Regularization

Linear Regression Linear Regression with Shrinkage

Lecture 14: Shrinkage

6. Regularized linear regression

Linear Regression Linear Regression with Shrinkage

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Chapter 3. Linear Models for Regression

Dimension Reduction Methods

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Regularization: Ridge Regression and the LASSO

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

MS-C1620 Statistical inference

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Regression, Ridge Regression, Lasso

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

ORIE 4741: Learning with Big Messy Data. Regularization

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

Machine Learning for Biomedical Engineering. Enrico Grisan

Theorems. Least squares regression

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

Business Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata'

The prediction of house price

A Modern Look at Classical Multivariate Techniques

Stat588 Homework 1 (Due in class on Oct 04) Fall 2011

STA414/2104 Statistical Methods for Machine Learning II

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

STAT 462-Computational Data Analysis

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Homework 1: Solutions

Optimization methods

LINEAR REGRESSION, RIDGE, LASSO, SVR

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

COMS 4771 Regression. Nakul Verma

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Linear Regression (9/11/13)

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Model Selection and Regularization

Lasso Regression: Regularization for feature selection

Machine Learning Linear Regression. Prof. Matteo Matteucci

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Day 4: Shrinkage Estimators

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Support Vector Machine I

Lecture VIII Dim. Reduction (I)

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

Regression Shrinkage and Selection via the Lasso

Stability and the elastic net

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Regularization and Variable Selection via the Elastic Net

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Is the test error unbiased for these programs? 2017 Kevin Jamieson

LECTURE 7. Least Squares and Variants. Optimization Models EE 127 / EE 227AT. Outline. Least Squares. Notes. Notes. Notes. Notes.

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Lasso Regression: Regularization for feature selection

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Lecture 7: Modeling Krak(en)

CS540 Machine learning Lecture 5

MSA220/MVE440 Statistical Learning for Big Data

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

ECE521 week 3: 23/26 January 2017

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Introduction to Machine Learning

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014

Linear Models for Regression

Linear Regression In God we trust, all others bring data. William Edwards Deming

Sparse Linear Models (10/7/13)

Least Squares Optimization

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Lecture 6: Methods for high-dimensional problems

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Lasso: Algorithms and Extensions

ECE521 lecture 4: 19 January Optimization, MLE, regularization


Statistics 262: Intermediate Biostatistics Model selection

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

Transcription:

Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj

Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived Input Directions Discussions Summary

Introduction Let be a data point, a linear regression model assumes E is a linear function of Advantages They are simple and often provide an adequate and interpretable description They can sometimes outperform nonlinear models Small numbers of training cases, low signal-tonoise ratio or sparse data Linear methods can be applied to transformations of the inputs

Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived Input Directions Discussions Summary

Linear Regression Models The Linear Regression Model s are unknown coefficients The variable could be Quantitative inputs Transformations of quantitative inputs Log, square-root or square Basis expansions ( ) Numeric coding of qualitative inputs

Least Squares Given a set of training data where Minimize the Residual Sum of Squares Valid if the s are conditionally independent given the inputs

A Geometric Interpretation

Optimization (1) Let be a matrix with each row an input vector 1 1 1 and Then, we have

Optimization (2) Differentiate with respect to Set the derivative to zero Assume is invertible

Predictions The Prediction of The Predictions of Training Data Let is the orthogonal projection of onto the subspace spanned by

Predictions The Prediction of The Predictions of Training Data

Understanding (1) Assume the linear model is right, but the observation contains noise Where Then,, + +

Understanding (2) Since is a Gaussian random vector, thus + is also a Gaussian random vector E E E Cov Cov Cov Thus

Expected Prediction Error (EPE) Given a test point, assume ~ 0, The EPE of is The Mean Squared Error (MSE) MSE E E E E Variance Bias

EPE of Least Squares Under the assumption that ~ 0, The EPE of is E E MSE The Mean Squared Error (MSE) MSE E E E Var

The Gauss Markov Theorem has the smallest variance among all linear unbiased estimates. We aim to estimate, the estimation of is From precious discussions, we have E E and for all such that Var Var

Multiple Outputs (1) Suppose we aim to predict, and assume outputs Given training data, we have Where is the response matrix is the data matrix is the matrix of parameters is the matrix of errors

Multiple Outputs (2) The Residual Sum of Squares The Solution It is equivalent to performing independent least squares

Large-scale Setting The Problem Sampling Faster least squares approximation

Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived Input Directions Discussions Summary

Subset Selection Limitations of Least Squares Prediction Accuracy: the least squares estimates often have low bias but large variance Interpretation: We often would like to determine a smaller subset that exhibit the strongest effects Shrink or Set Some Coefficients to Zero We sacrifice a little bit of bias to reduce the variance of the predicted values

Best-Subset Selection Select the subset of variables (features) such that the RSS is minimized 8

Forward- and Backward- Stepwise Selection Forward-stepwise Selection 1. Start with the intercept 2. Sequentially add into the model the predictor that most improves the fit Backward-stepwise Selection 1. Start with the full model 2. Sequentially delete the predictor that has the least impact on the fit Both are greedy algorithms Both can be solved quite efficiently

Forward-Stagewise Regression 1. Start with an intercept equal to and centered predictors with coefficients initially all 2. Identify the variable most correlated with the current residual 3. Compute the simple linear regression coefficient of the residual on this chosen variable None of the other variables are adjusted when a term is added to the model

Comparisons

Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived Input Directions Discussions Summary

Shrinkage Methods Limitation of Subset Selection A discrete process variables are either retained or discarded It often exhibits high variance, and so doesn t reduce the prediction error Shrinkage Methods More continuous, low variance Ridge Regression The Lasso Least Angle Regression

Ridge Regression Shrink the regression coefficients By imposing a penalty on their size The Objective is a complexity parameter An Equivalent Form Coefficients cannot be too large even when variables are correlated

Optimization (1) Let be a matrix with each row an input vector and The Objective Becomes min, Where

Optimization (2) Differentiate with respect to and set it to zero 2 0 1 Differentiate with respect to it to zero 2 2 0 and set 1 0 1 1

Optimization (3) The Final Solution Let Always invertible be the centering matrix 1

Understanding (1) Assume is centered, then Let the SVD of be,, contains the left singular vectors is a diagonal matrix with diagonal entries 0 Then, we examine the prediction of training data

Understanding (2) Least Squares Ridge Regression Shrink the coordinates by

Understanding (3) Connection with PCA

An Example

The Lasso The Objective It is equivalent to l -norm

Optimization The First Formulation Gradient descent followed by Projection [1] The Second Formulation Convex Composite Optimization [2]

An Example Hit 0 Piece-wise linear

Subset Selection, Ridge, Lasso Columns of are orthonormal Hard-thresholding Scaling Soft-thresholding

Ridge v.s. Lasso (1) Ridge Regression -norm appears in the constraint Lasso -norm appears in the constraint

Ridge v.s. Lasso (2)

Generalization (1) A General Formulation Contours of Constant Value of

Generalization (2) A Mixed Formulation The elastic-net penalty

Least Angle Regression (LAR) The Procedure 1. Identify the variable most correlated with the response 2. Move the coefficient of this variable continuously toward its least squares value 3. As soon as another variable catches up in terms of correlation with the residual, the process is paused 4. The second variable then joins the active set, and their coefficients are moved together in a way that keeps their correlations tied and decreasing

An Example

LAS v.s. Lasso

Comparisons

Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived Input Directions Discussions Summary

Methods Using Derived Input Directions We have a large number of inputs Often very correlated 1. Generate a small number of linear combinations, 1,, of the original inputs 2. Use in place of as inputs in the regression Linear Dimensionality Reduction + Regression

Principal Components Regression (PCR) The linear combinations are generated by PCA is centered, and is the -th right singular vector Since s are orthogonal where

PCR v.s. Ridge

Partial Least Squares (PLS) The Procedure 1. Compute for each feature 2. Construct the 1 st derived input is regressed on giving coefficient 4. Orthogonalize with respect to 5. Repeat the above process

Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived Input Directions Discussions Summary

Discussions (1) Model complexity increases as we move from left to right.

Discussions (2)

Discussions (3)

Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived Input Directions Discussions Summary

Summary Linear Regression Models Least Squares Shrinkage Methods Ridge Regression Lasso Least Angle Regression (LAR) Methods Using Derived Input Directions Principal Components Regression (PCR) Partial Least Squares (PLS)

Reference [1] Duchi et al. Efficient projections onto the -ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning, pp. 272-279, 2008. [2] Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1): 125-161, 2013.