A Modern Look at Classical Multivariate Techniques

Similar documents
ISyE 691 Data mining and analytics

Lecture 14: Shrinkage

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

Machine Learning for OR & FE

Regularization: Ridge Regression and the LASSO

A Short Introduction to the Lasso Methodology

High-dimensional regression modeling

Biostatistics Advanced Methods in Biostatistics IV

Data Mining Stat 588

Linear Regression Linear Regression with Shrinkage

Lecture 6: Methods for high-dimensional problems

Linear Methods for Regression. Lijun Zhang

Nonparametric Regression. Badr Missaoui

Linear Regression Linear Regression with Shrinkage

Chapter 3. Linear Models for Regression

Prediction & Feature Selection in GLM

Regression, Ridge Regression, Lasso

Linear regression methods

High-dimensional regression

STAT 462-Computational Data Analysis

Theorems. Least squares regression

Data Mining Stat 588

Regression Shrinkage and Selection via the Lasso

Linear model selection and regularization

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Dimension Reduction Methods

UNIVERSITETET I OSLO

Comparison of Some Improved Estimators for Linear Regression Model under Different Conditions

Lecture 14: Variable Selection - Beyond LASSO

The prediction of house price

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

PENALIZED PRINCIPAL COMPONENT REGRESSION. Ayanna Byrd. (Under the direction of Cheolwoo Park) Abstract

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Regularization and Variable Selection via the Elastic Net

MS-C1620 Statistical inference

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Multicollinearity and A Ridge Parameter Estimation Approach

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

Bayes Estimators & Ridge Regression

Theoretical Exercises Statistical Learning, 2009

Multiple Linear Regression

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Linear Model Selection and Regularization

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

CMSC858P Supervised Learning Methods

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Business Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata'

Iterative Selection Using Orthogonal Regression Techniques

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

ESL Chap3. Some extensions of lasso

Reduction of Model Complexity and the Treatment of Discrete Inputs in Computer Model Emulation

Effect of outliers on the variable selection by the regularized regression

MSA220/MVE440 Statistical Learning for Big Data

Machine Learning for OR & FE

Comparisons of penalized least squares. methods by simulations

Proteomics and Variable Selection

Generalized Elastic Net Regression

Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

Lecture 3: Statistical Decision Theory (Part II)

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Homework 1: Solutions

Ridge and Lasso Regression

Sparsity and the Lasso

PENALIZING YOUR MODELS

DIMENSION REDUCTION OF THE EXPLANATORY VARIABLES IN MULTIPLE LINEAR REGRESSION. P. Filzmoser and C. Croux

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

Lecture 6: Linear Regression (continued)

COS 424: Interacting with Data

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

Least Squares Estimation-Finite-Sample Properties

Linear Model Selection and Regularization

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

CS540 Machine learning Lecture 5

Math 423/533: The Main Theoretical Topics

Integrated Likelihood Estimation in Semiparametric Regression Models. Thomas A. Severini Department of Statistics Northwestern University

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Sparse Linear Models (10/7/13)

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

STAT5044: Regression and Anova. Inyoung Kim

Ridge Regression and Ill-Conditioning

Lecture 4 Multiple linear regression

STAT5044: Regression and Anova

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Linear Regression Model. Badr Missaoui

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Spatial Process Estimates as Smoothers: A Review

A simulation study of model fitting to high dimensional data using penalized logistic regression

Day 4: Shrinkage Estimators

STK-IN4300 Statistical Learning Methods in Data Science

Introduction to Regression

Transcription:

A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico

Guanajuato January 2007 August 2010

Overview Technological innovation and development in science, medicine, engineering, and industry have made high dimensional, complex data widely available. Statistical methods are inherently linked to our assumptions (either explicit or implicit) about the data. Classical statistical paradigm is for relatively small data sets for which simple analytic models may be sufficient. Classical multivariate techniques tend to be ill-posed or poorly-posed for high dimensional data. How to modify them for modern data? Less is more.

Outline Part I: Regression Galton s linear regression to modern penalized regression Part II: Classification Fisher s linear discriminant analysis to its regularized variants and pattern recognition algorithms Part III: Dimensionality reduction Hotelling s principal component analysis (PCA) to generalized PCA for non-gaussian data

Part I: Regression Galton s linear regression to modern penalized regression Galton, F. (1886) Regression towards mediocrity in hereditary stature Journal of the Anthropological Institute of Great Britain and Ireland 15, 246-263

Galton s Height Data Galton studied the degree to which human traits were passed from one generation to the next. In an 1885 study, he measured the heights (inches) of 933 adult children (480 males and 453 females) and their parents. > head(galton) # first several rows Gender Family Height Father Mother 1 male 1 73.2 78.5 67.0 2 female 1 69.2 78.5 67.0 3 female 1 69.0 78.5 67.0 4 female 1 69.0 78.5 67.0 5 male 2 73.5 75.5 66.5 6 male 2 72.5 75.5 66.5

A Matrix of Scatterplots for Females 58 60 62 64 66 68 70 Height 60 65 70 58 60 62 64 66 68 70 Mother Father 65 70 75 60 65 70 65 70 75

ˆµ{height mother, father} = 18.758+0.304(mother)+0.374(father) Regression plane Height 55 60 65 70 75 58 60 62 64 66 68 70 72 60 65 70 75 80 Father Mother

Method of Least Squares Linear regression model with p predictors: y i = β 0 + p β j x ij + i j=1 where x i =(x i1,...,x ip ) is the ith observation (i = 1,, n) and i are i.i.d. with N(0,σ 2 ). Find coefficients β =(β 0,...,β p ) that minimize the residual sum of squares RSS(β) = n p (y i β 0 β j x ij ) 2 =(y Xβ) (y Xβ), i=1 j=1 where X is n (p + 1) data matrix with the ith row [1 x i ] and y =(y 1,...,y n ).

Geometry of Least Squares Fitting X 1 X 2 Y courtesy of Hastie, Tibshirani & Friedman

Least Squares Estimates When X X is non-singular, the unique minimizer is given by ˆβ =(X X) 1 X y satisfying the normal equations RSS β = 0. What if the number of variables p is much larger than sample size n? What if some variables are highly correlated?

Singularity of X X If X is not of full column rank, X X is singular and ˆβ is not uniquely defined. Rank deficiencies happen if the number of variables p exceeds the sample size n (e.g. image analysis). When the variables are closely related to each other, the columns of X may be nearly linearly dependent. Then X X is nearly singular. Multicollinearity results in large variance of ˆβ: Var( ˆβ) =(X X) 1 σ 2

Remedies for the Ordinary Least Squares Method Reduce the features by filtering (e.g. subset selection). Derive a small number of linear combinations of the original inputs (e.g. principal component regression). Modify the fitting process through regularization or penalization to get an estimator in a reliable manner. (Bias and variance trade-off) To improve the overall accuracy, introduce a little bias in exchange for variance reduction. Examples of biased regression include ridge regression and LASSO.

Ridge Regression Hoerl, A. and Kennard, R. (1970), Ridge regression: Biased estimation for nonorthogonal problems, Technometrics Main motivation: Alleviate the effects of multicollinearity when X X is badly conditioned. ˆβ ridge is defined as the minimizer of the penalized residual sum of squares RSS λ (β) = n p p (y i β 0 β j x ij ) 2 + λ βj 2, i=1 j=1 j=1 where λ 0 is a shrinkage (or ridge) parameter. With β 2 2 = p j=1 β2 j, large coefficients are penalized.

Ridge Regression Standardize the inputs first as the ridge solution is not equivariant under scaling of the inputs. Can achieve smaller mean square error MSE( ˆβ) =E( ˆβ β 2 ) than ˆβ LS. Alternative form of the ridge problem: min β RSS(β) = n (y i β 0 i=1 p β j x ij ) 2 j=1 subject to β 2 2 s If s ˆβ LS 2, then ˆβ ridge = ˆβ LS. Otherwise, it is constrained by the size s.

Geometry of Ridge Regression For a model without the intercept using the centered y and x j, min β n (y i i=1 p β j x ij ) 2 subject to β 2 2 s j=1

How to Get the Ridge Estimator? With RSS λ (β) =(y Xβ) (y Xβ)+λβ β, RSS λ = 0 gives β ˆβ ridge =(X X + λi) 1 X y. As λ 0, ˆβ ridge ˆβ LS, and as λ, ˆβ ridge 0. The fitted values at the training inputs are given by ŷ = X ˆβ ridge = X(X X + λi) 1 X y. H(λ)

When X is Orthogonal If X X = I, then ˆβ LS = X y. From ˆβ ridge =(X X + λi) 1 X y, ˆβ ridge = ˆβ LS /(1 + λ). The ridge estimator is a scaled version of the LSE: ˆβ ridge j = ˆβ LS j /(1 + λ) Ridge regression shrinks coefficients toward zero by imposing a penalty on their size.

LASSO Tibshirani, R. (1996), Regression Shrinkage and Selection via the Lasso, JRSSb Least Absolute Shrinkage and Selection Operator (also known as basis pursuit in Chen et al. 1998) Shrinkage method for simultaneous model fitting and variable selection Combine interpretability of subset selection and stability of ridge regression 1 norm constraint on β =(β 1,...,β p ) can set some coefficients to zero exactly.

Definition of the LASSO Assuming that y is centered and x j s are standardized, find β minimizing RSS(β) = n p (y i β j x ij ) 2 i=1 j=1 subject to β 1 = p j=1 β j s. Equivalently, ˆβ lasso is defined as the minimizer of RSS λ (β) = 1 2 (y Xβ) (y Xβ)+λβ 1, where λ 0 is a shrinkage parameter.

Geometry of LASSO min β n (y i i=1 p β j x ij ) 2 subject to β 1 s j=1 ^

When X is Orthogonal When X X = I, ˆβ lasso j = sign( ˆβ j LS )( ˆβ j LS λ) + Soft thresholding in the context of signal or image recovery or wavelet-based smoothing Recall that the ridge regression scales the LSE: ˆβ ridge j = ˆβ j /(1 + λ)

LASSO vs Ridge Regression ^lasso ^ridge ^ ^

Model Complexity Control of model complexity or capacity is critical for a good fit to the data and proper generalization to new data. The complexity of ridge and LASSO solutions is indexed by tuning parameters λ or s. Regularization entails a model selection problem. Tuning parameters need to be chosen to optimize the bias-variance tradeoff. How to define model degrees of freedom for the penalized regression solutions?

Effective Model Degrees of Freedom The model degrees of freedom of a multiple linear regression model with p predictors are p = tr(h) where H = X(X X) 1 X is the projection matrix that maps y to ŷ = X ˆβ LS = Hy. From ŷ = X ˆβ ridge = X(X X + λi) 1 X y = H(λ)y, we define the effective degrees of freedom of the ridge regression fit as tr[h(λ)] analogously. Let ν 1 ν 2 ν p > 0 be the eigenvalues of X X. tr[h(λ)] = p j=1 ν j ν j + λ.

Illustration y i = f (x i )+ i for i = 1,...,n where i N(0,σ 2 ) 0.0 0.2 0.4 0.6 0.8 1.0 4 2 0 2 4 x y Estimate f with a large number of basis functions.

Basis Functions {1, x, x 2, x 3, (x x 1 ) 3 +,, (x x n ) 3 +} 0.000 0.006 0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.4 0.8 0.0 0.2 0.4 0.6 0.8 1.0 x

Smoothing Splines Wahba (1990), Spline Models for Observational Data Find f W 2 [a, b] = {f b a (f (x)) 2 dx < } (Sobolev space) minimizing n b (y i f (x i )) 2 + λ (f (x)) 2 dx, i=1 a J(f ) where J(f ) measures the curvature of f and λ>0 is a smoothing parameter. The solution is a natural cubic spline with knots at x i (a piecewise cubic polynomial with two continuous derivatives linear beyond the boundary knots): n ˆfλ (x) = β j N j (x) with a basis {N j (x)} n j=1 j=1

Smoothing Spline as Penalized LS Solution The curvature of ˆf λ is b a where Ω = (ˆf λ (x))2 dx = b a { n j=1 β j N j (x)} 2 dx = β Ωβ, b a N i (x)nj (x)dx : n n matrix. To obtain the solution ˆf λ, find β minimizing (y Nβ) (y Nβ)+λβ Ωβ, where N = N j (x i ) : basis matrix. Solve a generalized ridge regression problem: ˆβ λ =(N N + λω) 1 N y, where the coefficients are shrunk toward the linear fit.

Smoothing Spline Fits 0.0 0.2 0.4 0.6 0.8 1.0 4 2 0 2 4 x y 0.0 0.2 0.4 0.6 0.8 1.0 4 2 0 2 4 x y 0.0 0.2 0.4 0.6 0.8 1.0 4 2 0 2 4 x y 0.0 0.2 0.4 0.6 0.8 1.0 4 2 0 2 4 x y

How to Choose λ? Ideally we want to choose λ that minimizes the true risk: 1 n 2 E ˆfλ (x i ) f (x i ) n i=1 The Mallows-type criterion as an unbiased risk estimate: 1 n y ŷ2 + 2 σ2 tr[a(λ)], where ŷ = A(λ)y n 0.5 1.0 1.5 2.0 Average Prediction Error Unbiased Risk Estimate 5 10 15 20 25 30 df