Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Similar documents
COS513 LECTURE 8 STATISTICAL CONCEPTS

Stability and the elastic net

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

y Xw 2 2 y Xw λ w 2 2

Nonconvex penalties: Signal-to-noise ratio and algorithms

Penalized Regression

Linear Regression (9/11/13)

Introduction into Bayesian statistics

Censoring mechanisms

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Sampling distribution of GLM regression coefficients

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

Semiparametric Regression

An Introduction to Bayesian Linear Regression

Lecture 14: Shrinkage

STA414/2104 Statistical Methods for Machine Learning II

High-dimensional regression

Lecture : Probabilistic Machine Learning

Lecture 5: A step back

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Covariance test Selective inference. Selective inference. Patrick Breheny. April 18. Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/20

Density Estimation. Seungjin Choi

Shrinkage Methods: Ridge and Lasso

Least Squares Regression

Eigenvalues and diagonalization

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Topic 12 Overview of Estimation

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Accelerated Failure Time Models

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Linear Methods for Prediction

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Least Squares Regression

Overfitting, Bias / Variance Analysis

Proportional hazards regression

Machine Learning for OR & FE

y(x) = x w + ε(x), (1)

Sparse Linear Models (10/7/13)

Lecture 2: Statistical Decision Theory (Part I)

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gibbs Sampling in Linear Models #2

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Linear Regression 9/23/17. Simple linear regression. Advertising sales: Variance changes based on # of TVs. Advertising sales: Normal error?

Bayesian Inference: Concept and Practice

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

ECE521 week 3: 23/26 January 2017

Post-Selection Inference

Gibbs Sampling in Endogenous Variables Models

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Introduction to Machine Learning

FREQUENTIST BEHAVIOR OF FORMAL BAYESIAN INFERENCE

Advanced Statistical Methods. Lecture 6

Linear Models in Machine Learning

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

Bayesian performance

Bayesian Regression (1/31/13)

MS&E 226: Small Data

One-parameter models

Theoretical results for lasso, MCP, and SCAD

Lecture Notes 1: Decisions and Data. In these notes, I describe some basic ideas in decision theory. theory is constructed from

Bayesian Linear Regression [DRAFT - In Progress]

Generalized Elastic Net Regression

Linear Models A linear model is defined by the expression

Basis Penalty Smoothers. Simon Wood Mathematical Sciences, University of Bath, U.K.

g-priors for Linear Regression

Bayesian Regression Linear and Logistic Regression

Part 6: Multivariate Normal and Linear Models

Week 3: Linear Regression

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Prelim Examination. Friday August 11, Time limit: 150 minutes

Multicollinearity and A Ridge Parameter Estimation Approach

More on nuisance parameters

Part 4: Multi-parameter and normal models

The Normal Linear Regression Model with Natural Conjugate Prior. March 7, 2016

Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Introduction to Machine Learning. Lecture 2

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

Bayesian Inference for Normal Mean

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Exam: high-dimensional data analysis January 20, 2014

Model comparison. Patrick Breheny. March 28. Introduction Measures of predictive power Model selection

Linear regression methods

Outline lecture 2 2(30)

Statistics: Learning models from data

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

A Short Introduction to the Lasso Methodology

Frequentist-Bayesian Model Comparisons: A Simple Example

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Bayesian Econometrics

Bayesian Linear Regression

A Very Brief Summary of Bayesian Inference, and Examples

The t-distribution. Patrick Breheny. October 13. z tests The χ 2 -distribution The t-distribution Summary

Bayesian methods in economics and finance

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Transcription:

Patrick Breheny February 8 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/27

Introduction Basic idea Standardization Large-scale testing is, of course, a big area and we could keep talking about it However, for the rest of the course we will take up the issue of high-dimensional regression: using the features to predict/explain the outcome As we saw in our first lecture, ordinary least squares is problematic in high dimensions Reducing the dimensionality through model selection allows for some progress, but has several shortcomings Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 2/27

Likelihood and loss Basic idea Standardization More broadly speaking, this can be seen as a failure of likelihood-based methods In this course, we will use the notation L to refer to the negative log-likelihood: L(θ Data) = log l(θ Data) = log p(data θ) Here, L is known as the loss function and we seek estimates with a low loss; this is equivalent to finding a value (or interval of values) with a high likelihood Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 3/27

Likelihood for linear regression Basic idea Standardization In the context of linear regression, the loss function is L(β X, y) = n 2 log(2πσ2 ) + 1 2σ 2 (y i x T i β) 2 It is only the difference in loss functions between two values, L(β 1 X, y) L(β 2 X, y), i.e., the likelihood ratio, that is relevant to likelihood-based inference; thus, the first term may be ignored For the purposes of finding the MLE, the (2σ 2 ) 1 factor may also be ignored, although we must account for it when constructing likelihood-based intervals i Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 4/27

Penalized likelihood Basic idea Standardization Given the aforementioned problems with likelihood methods, consider instead the following modification: where Q(β X, y) = L(β X, y) + P λ (β), P is a penalty function that penalizes what one would consider less realistic values of the unknown parameters λ is a regularization parameter that controls the tradeoff between the two components The combined function Q is known as the objective function Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 5/27

Meaning of the penalty Basic idea Standardization What exactly do we mean by less realistic values? The most common use of penalization is to impose the belief that small regression coefficients are more likely than large ones; i.e., that we would not be surprised if β j was 1.2 or 0.3 or 0, but would be very surprised if β j was 9.7 10 4 Later in the course, we consider other uses for penalization to reflect beliefs that the true coefficients may be grouped into hierarchies, or display a spatial pattern such that β j is likely to be close to β j+1 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 6/27

Remarks Basic idea Standardization Some care is needed in the application of the idea that small regression coefficients are more likely than large ones First of all, it typically does not make sense to apply this line of reasoning to intercept; hence β 0 is not included in the penalty Second, the size of the regression coefficient depends on the scale with which the associated feature is measured; depending on the units x j is measured in, β j = 9.7 10 4 might, in fact, be realistic Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 7/27

Standardization Basic idea Standardization This is a particular problem if different features are measured on different scales, as the penalty would not have an equal effect on all coefficient estimates To avoid this issue and ensure invariance to scale, features are usually standardized prior to model fitting to have mean zero and standard deviation 1: x j = 0 x T j x j = n j This can be accomplished without any loss of generality, as any location shifts for X are absorbed into the intercept and scale changes can be reversed after the model has been fit Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 8/27

Added benefits of standardization Basic idea Standardization Centering and scaling the explanatory variables has added benefits in terms of computational savings and conceptual simplicity The features are now orthogonal to the intercept term, meaning that in the standardized covariate space, β 0 = ȳ regardless of the rest of the model Also, standardization simplifies the solutions; to illustrate with simple linear regression, β 0 = ȳ β x (yi ȳ)(x i x) β 1 = (xi x) 2 However, if we center and scale x and center y, then we get the much simpler expression β 0 = 0, β 1 = x T y/n Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 9/27

: Penalty Objective and estimate Understanding the penalty s effect Properties If penalized regression is to impose the assumption that small regression coefficients are more likely than large ones, we should choose a penalty that discourages large regression coefficients A natural choice is to penalize the sum of squares of the regression coefficients: P λ (β) = 1 2τ 2 p j=1 Applying this penalty in the context of penalized regression is known as ridge regression, and has a long history in statistics, dating back to 1970 β 2 j Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 10/27

Objective function Objective and estimate Understanding the penalty s effect Properties The ridge regression objective function is Q(β X, y) = 1 2σ 2 (y i x T i β) 2 + 1 2τ 2 i p j=1 β 2 j It is often convenient to multiply the above objective function by σ 2 /n; as we will see, doing so tends to simplify the expressions involved in penalized regression: Q(β X, y) = 1 2n where λ = σ 2 /(nτ 2 ) (y i x T i β) 2 + λ 2 i p βj 2, j=1 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 11/27

Solution Objective and estimate Understanding the penalty s effect Properties For linear regression, the ridge penalty is particularly attractive to work with because the maximum penalized likelihood estimator has a simple closed form solution This objective function is differentiable, and it is straightforward to show that its minimum occurs at β = (n 1 X T X + λi) 1 n 1 X T y The solution is similar to the least squares solution, but with the addition of a ridge down the diagonal of the matrix to be inverted Note that the ridge solution is a simple function of the marginal OLS solutions n 1 X T y and the correlation matrix n 1 X T X Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 12/27

Orthonormal solutions Objective and estimate Understanding the penalty s effect Properties To understand the effect of the ridge penalty on the estimator β, it helps to consider the special case of an orthonormal design matrix (X T X/n = I) In this case, β J = β OLS J 1 + λ This illustrates the essential feature of ridge regression: shrinkage; i.e., the primary effect of applying ridge penalty is to shrink the estimates toward zero Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 13/27

Simple example Objective and estimate Understanding the penalty s effect Properties The benefits of ridge regression are most striking in the presence of multicollinearity Consider the following very simple simulated example: > x1 <- rnorm(20) > x2 <- rnorm(20, mean=x1, sd=.01) > y <- rnorm(20, mean=3+x1+x2) > lm(y~x1+x2)... (Intercept) x1 x2 2.582064 39.971344-38.040040 Although there are only two covariates, the strong correlation between X 1 and X 2 causes a great deal of trouble for maximum likelihood Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 14/27

Objective and estimate Understanding the penalty s effect Properties for the simple example The problem here is that the likelihood surface is very flat along β 1 + β 2 = 2, leading to tremendous uncertainty When we introduce the added assumption that small coefficients are more likely than large ones by using a ridge penalty, however, this uncertainty is resolved: > lm.ridge(y~x1+x2, lambda=1) x1 x2 2.6214998 0.9906773 0.8973912 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 15/27

Objective and estimate Understanding the penalty s effect Properties always has unique solutions The maximum likelihood estimator is not always unique: If X is not full rank, X T X is not invertible and an infinite number of β values maximize the likelihood This problem does not occur with ridge regression Theorem: For any design matrix X, the quantity n 1 X T X + λi is always invertible provided that λ > 0; thus, there is always a unique solution β. Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 16/27

Objective and estimate Understanding the penalty s effect Properties Is ridge better than maximum likelihood? In our simple example from earlier, the ridge regression estimate was much closer to the truth than the MLE An obvious question is whether ridge regression estimates are systematically closer to the truth than MLEs are, or whether that example was a fluke To address this question, let us first derive the bias and variance of ridge regression Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 17/27

Bias and variance Objective and estimate Understanding the penalty s effect Properties The variance of the ridge regression estimate is where W = (X T X + nλi) 1 Meanwhile, the bias is Var( β) = σ 2 WX T XW, Bias( β) = nλwβ Both bias and variance contribute to overall accuracy, as measured by mean squared error (MSE): MSE( β) = E β β 2 = j Var( β j ) + j Bias( β j ) 2 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 18/27

Existence theorem Objective and estimate Understanding the penalty s effect Properties So, is ridge regression better than maximum likelihood (OLS)? Theorem: There always exists a value λ such that ) MSE ( βλ ( βols ) < MSE This is a rather surprising result with somewhat radical implications: despite the typically impressive theoretical properties of maximum likelihood and linear regression, we can always obtain a better estimator by shrinking the MLE towards zero Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 19/27

Sketch of proof Objective and estimate Understanding the penalty s effect Properties 0.15 0.10 MSE Bias 2 0.05 Var 0.00 0 2 4 6 8 10 λ Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 20/27

Bayesian justification for the penalty Ridge estimate as posterior mode Similarities and differences From a Bayesian perspective, one can think of the penalty as arising from a formal prior distribution on the parameters Let p(y β) denote the distribution of y given β and p(β) the prior for β; then the posterior density is p(β y) = p(y β)p(β) p(y) p(y β)p(β), or log p(β y) = log p(y β) + log p(β) + constant on the log scale; this is exactly the generic form of a penalized likelihood Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 21/27

Ridge estimate as posterior mode Similarities and differences from a Bayesian perspective By optimizing the objective function, we are finding the mode of the posterior distribution of β; this is known as the maximum a posteriori, or MAP, estimate Specifically, suppose that we assume the prior β j iid N(0, τ 2 ); the resulting log-posterior is exactly the ridge regression objective function (up to a constant) Furthermore, The ridge regression estimator β is the posterior mean (in addition to being the posterior mode) The regularization parameter λ is the ratio of the prior precision (1/τ 2 ) to the information (n/σ 2 ) Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 22/27

Similarities and differences Ridge estimate as posterior mode Similarities and differences Thus, we arrive at the same estimator β whether we view it as a modified maximum likelihood estimator or a Bayes estimator In other inferential respects, however, the similarity between Bayesian and Frequentist breaks down Two aspects, in particular, are worthy of mention Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 23/27

Properties of intervals Ridge estimate as posterior mode Similarities and differences First is the inferential goal of constructing intervals for β and what properties such intervals should have Frequentist confidence intervals are required to maintain a certain level of coverage for any fixed value of β Bayesian posterior intervals, on the other hand, may have much higher coverage at some values of β than others Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 24/27

Properties of intervals (cont d) Ridge estimate as posterior mode Similarities and differences Coverage 1.0 0.8 0.6 0.4 0.2 0.0 4 2 0 2 4 β Bayes coverage for a 95% posterior interval at β j 0 is > 99%, but only 20% for β j 3.5 The interval nevertheless maintains 95% coverage across a collection of β j values, integrated with respect to the prior Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 25/27

Point properties at 0 Ridge estimate as posterior mode Similarities and differences The other aspect in which a clear divide emerges between Bayes and Frequentist perspectives is with regard to the specific value β = 0 From a Bayesian perspective, the posterior probability that β = 0 is 0 because its posterior distribution is continuous From a Frequentist perspective, however, the notion of testing whether β = 0 is still meaningful and indeed, often of interest in an analysis Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 26/27

Final remarks Ridge estimate as posterior mode Similarities and differences The penalized regression literature generally adopts the perspective of maximum likelihood theory, although the appearance of a penalty in the likelihood somewhat blurs the lines between Bayes and Frequentist ideas The vast majority of research into penalized regression methods has focused on point estimation and its properties, so these inferential differences between Bayesian and Frequentist perspectives are relatively unexplored Nevertheless, developing inferential methods for penalized regression is an active area of current research, and we will come back to some of these issues when we discuss inference for high-dimensional models Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 27/27