A Modern Look at Classical Multivariate Techniques

A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico

Guanajuato January 2007 August 2010

Overview Technological innovation and development in science, medicine, engineering, and industry have made high dimensional, complex data widely available. Statistical methods are inherently linked to our assumptions (either explicit or implicit) about the data. Classical statistical paradigm is for relatively small data sets for which simple analytic models may be sufficient. Classical multivariate techniques tend to be ill-posed or poorly-posed for high dimensional data. How to modify them for modern data? Less is more.

Outline Part I: Regression Galton s linear regression to modern penalized regression Part II: Classification Fisher s linear discriminant analysis to its regularized variants and pattern recognition algorithms Part III: Dimensionality reduction Hotelling s principal component analysis (PCA) to generalized PCA for non-gaussian data

Part I: Regression Galton s linear regression to modern penalized regression Galton, F. (1886) Regression towards mediocrity in hereditary stature Journal of the Anthropological Institute of Great Britain and Ireland 15, 246-263

Galton s Height Data Galton studied the degree to which human traits were passed from one generation to the next. In an 1885 study, he measured the heights (inches) of 933 adult children (480 males and 453 females) and their parents. > head(galton) # first several rows Gender Family Height Father Mother 1 male 1 73.2 78.5 67.0 2 female 1 69.2 78.5 67.0 3 female 1 69.0 78.5 67.0 4 female 1 69.0 78.5 67.0 5 male 2 73.5 75.5 66.5 6 male 2 72.5 75.5 66.5

A Matrix of Scatterplots for Females 58 60 62 64 66 68 70 Height 60 65 70 58 60 62 64 66 68 70 Mother Father 65 70 75 60 65 70 65 70 75

ˆµ{height mother, father} = 18.758+0.304(mother)+0.374(father) Regression plane Height 55 60 65 70 75 58 60 62 64 66 68 70 72 60 65 70 75 80 Father Mother

Method of Least Squares Linear regression model with p predictors: y i = β 0 + p β j x ij + i j=1 where x i =(x i1,...,x ip ) is the ith observation (i = 1,, n) and i are i.i.d. with N(0,σ 2 ). Find coefficients β =(β 0,...,β p ) that minimize the residual sum of squares RSS(β) = n p (y i β 0 β j x ij ) 2 =(y Xβ) (y Xβ), i=1 j=1 where X is n (p + 1) data matrix with the ith row [1 x i ] and y =(y 1,...,y n ).

Geometry of Least Squares Fitting X 1 X 2 Y courtesy of Hastie, Tibshirani & Friedman

Least Squares Estimates When X X is non-singular, the unique minimizer is given by ˆβ =(X X) 1 X y satisfying the normal equations RSS β = 0. What if the number of variables p is much larger than sample size n? What if some variables are highly correlated?

Singularity of X X If X is not of full column rank, X X is singular and ˆβ is not uniquely defined. Rank deficiencies happen if the number of variables p exceeds the sample size n (e.g. image analysis). When the variables are closely related to each other, the columns of X may be nearly linearly dependent. Then X X is nearly singular. Multicollinearity results in large variance of ˆβ: Var( ˆβ) =(X X) 1 σ 2

Remedies for the Ordinary Least Squares Method Reduce the features by filtering (e.g. subset selection). Derive a small number of linear combinations of the original inputs (e.g. principal component regression). Modify the fitting process through regularization or penalization to get an estimator in a reliable manner. (Bias and variance trade-off) To improve the overall accuracy, introduce a little bias in exchange for variance reduction. Examples of biased regression include ridge regression and LASSO.

Ridge Regression Hoerl, A. and Kennard, R. (1970), Ridge regression: Biased estimation for nonorthogonal problems, Technometrics Main motivation: Alleviate the effects of multicollinearity when X X is badly conditioned. ˆβ ridge is defined as the minimizer of the penalized residual sum of squares RSS λ (β) = n p p (y i β 0 β j x ij ) 2 + λ βj 2, i=1 j=1 j=1 where λ 0 is a shrinkage (or ridge) parameter. With β 2 2 = p j=1 β2 j, large coefficients are penalized.

Ridge Regression Standardize the inputs first as the ridge solution is not equivariant under scaling of the inputs. Can achieve smaller mean square error MSE( ˆβ) =E( ˆβ β 2 ) than ˆβ LS. Alternative form of the ridge problem: min β RSS(β) = n (y i β 0 i=1 p β j x ij ) 2 j=1 subject to β 2 2 s If s ˆβ LS 2, then ˆβ ridge = ˆβ LS. Otherwise, it is constrained by the size s.

Geometry of Ridge Regression For a model without the intercept using the centered y and x j, min β n (y i i=1 p β j x ij ) 2 subject to β 2 2 s j=1

How to Get the Ridge Estimator? With RSS λ (β) =(y Xβ) (y Xβ)+λβ β, RSS λ = 0 gives β ˆβ ridge =(X X + λi) 1 X y. As λ 0, ˆβ ridge ˆβ LS, and as λ, ˆβ ridge 0. The fitted values at the training inputs are given by ŷ = X ˆβ ridge = X(X X + λi) 1 X y. H(λ)

When X is Orthogonal If X X = I, then ˆβ LS = X y. From ˆβ ridge =(X X + λi) 1 X y, ˆβ ridge = ˆβ LS /(1 + λ). The ridge estimator is a scaled version of the LSE: ˆβ ridge j = ˆβ LS j /(1 + λ) Ridge regression shrinks coefficients toward zero by imposing a penalty on their size.

LASSO Tibshirani, R. (1996), Regression Shrinkage and Selection via the Lasso, JRSSb Least Absolute Shrinkage and Selection Operator (also known as basis pursuit in Chen et al. 1998) Shrinkage method for simultaneous model fitting and variable selection Combine interpretability of subset selection and stability of ridge regression 1 norm constraint on β =(β 1,...,β p ) can set some coefficients to zero exactly.

Definition of the LASSO Assuming that y is centered and x j s are standardized, find β minimizing RSS(β) = n p (y i β j x ij ) 2 i=1 j=1 subject to β 1 = p j=1 β j s. Equivalently, ˆβ lasso is defined as the minimizer of RSS λ (β) = 1 2 (y Xβ) (y Xβ)+λβ 1, where λ 0 is a shrinkage parameter.

Geometry of LASSO min β n (y i i=1 p β j x ij ) 2 subject to β 1 s j=1 ^

When X is Orthogonal When X X = I, ˆβ lasso j = sign( ˆβ j LS )( ˆβ j LS λ) + Soft thresholding in the context of signal or image recovery or wavelet-based smoothing Recall that the ridge regression scales the LSE: ˆβ ridge j = ˆβ j /(1 + λ)

LASSO vs Ridge Regression ^lasso ^ridge ^ ^

Model Complexity Control of model complexity or capacity is critical for a good fit to the data and proper generalization to new data. The complexity of ridge and LASSO solutions is indexed by tuning parameters λ or s. Regularization entails a model selection problem. Tuning parameters need to be chosen to optimize the bias-variance tradeoff. How to define model degrees of freedom for the penalized regression solutions?

Effective Model Degrees of Freedom The model degrees of freedom of a multiple linear regression model with p predictors are p = tr(h) where H = X(X X) 1 X is the projection matrix that maps y to ŷ = X ˆβ LS = Hy. From ŷ = X ˆβ ridge = X(X X + λi) 1 X y = H(λ)y, we define the effective degrees of freedom of the ridge regression fit as tr[h(λ)] analogously. Let ν 1 ν 2 ν p > 0 be the eigenvalues of X X. tr[h(λ)] = p j=1 ν j ν j + λ.

Illustration y i = f (x i )+ i for i = 1,...,n where i N(0,σ 2 ) 0.0 0.2 0.4 0.6 0.8 1.0 4 2 0 2 4 x y Estimate f with a large number of basis functions.

Basis Functions {1, x, x 2, x 3, (x x 1 ) 3 +,, (x x n ) 3 +} 0.000 0.006 0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.4 0.8 0.0 0.2 0.4 0.6 0.8 1.0 x

Smoothing Splines Wahba (1990), Spline Models for Observational Data Find f W 2 [a, b] = {f b a (f (x)) 2 dx < } (Sobolev space) minimizing n b (y i f (x i )) 2 + λ (f (x)) 2 dx, i=1 a J(f ) where J(f ) measures the curvature of f and λ>0 is a smoothing parameter. The solution is a natural cubic spline with knots at x i (a piecewise cubic polynomial with two continuous derivatives linear beyond the boundary knots): n ˆfλ (x) = β j N j (x) with a basis {N j (x)} n j=1 j=1

Smoothing Spline as Penalized LS Solution The curvature of ˆf λ is b a where Ω = (ˆf λ (x))2 dx = b a { n j=1 β j N j (x)} 2 dx = β Ωβ, b a N i (x)nj (x)dx : n n matrix. To obtain the solution ˆf λ, find β minimizing (y Nβ) (y Nβ)+λβ Ωβ, where N = N j (x i ) : basis matrix. Solve a generalized ridge regression problem: ˆβ λ =(N N + λω) 1 N y, where the coefficients are shrunk toward the linear fit.

Smoothing Spline Fits 0.0 0.2 0.4 0.6 0.8 1.0 4 2 0 2 4 x y 0.0 0.2 0.4 0.6 0.8 1.0 4 2 0 2 4 x y 0.0 0.2 0.4 0.6 0.8 1.0 4 2 0 2 4 x y 0.0 0.2 0.4 0.6 0.8 1.0 4 2 0 2 4 x y

How to Choose λ? Ideally we want to choose λ that minimizes the true risk: 1 n 2 E ˆfλ (x i ) f (x i ) n i=1 The Mallows-type criterion as an unbiased risk estimate: 1 n y ŷ2 + 2 σ2 tr[a(λ)], where ŷ = A(λ)y n 0.5 1.0 1.5 2.0 Average Prediction Error Unbiased Risk Estimate 5 10 15 20 25 30 df