g-priors for Linear Regression

Similar documents
Stat260: Bayesian Modeling and Inference Lecture Date: March 10, 2010

COS513 LECTURE 8 STATISTICAL CONCEPTS

STA 414/2104, Spring 2014, Practice Problem Set #1

Lecture : Probabilistic Machine Learning

Sparse Linear Models (10/7/13)

Some Curiosities Arising in Objective Bayesian Analysis

Bayesian methods in economics and finance

Density Estimation. Seungjin Choi

CPSC 540: Machine Learning

Lecture 20 May 18, Empirical Bayes Interpretation [Efron & Morris 1973]

CSC321 Lecture 18: Learning Probabilistic Models

Linear Models A linear model is defined by the expression

Bayesian Machine Learning

Part III. A Decision-Theoretic Approach and Bayesian testing

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Small Area Modeling of County Estimates for Corn and Soybean Yields in the US

5.2 Expounding on the Admissibility of Shrinkage Estimators

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Graphical Models for Collaborative Filtering

Bayesian linear regression

Hierarchical Models & Bayesian Model Selection

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Introduction into Bayesian statistics

Lecture 2: Priors and Conjugacy

Bayesian Analysis for Natural Language Processing Lecture 2

Lecture 13 Fundamentals of Bayesian Inference

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

GWAS IV: Bayesian linear (variance component) models

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Gaussian Process Regression

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

STA 4273H: Sta-s-cal Machine Learning

Probabilistic Reasoning in Deep Learning

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

7. Estimation and hypothesis testing. Objective. Recommended reading

The Normal Linear Regression Model with Natural Conjugate Prior. March 7, 2016

Lecture 6: Model Checking and Selection

Introduction to Machine Learning

Mathematical statistics

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Carl N. Morris. University of Texas

Parametric Techniques Lecture 3

Logistic Regression. Machine Learning Fall 2018

Bayesian Regression Linear and Logistic Regression

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

PMR Learning as Inference

Generalized Elastic Net Regression

Lecture 1: Bayesian Framework Basics

STA414/2104 Statistical Methods for Machine Learning II

Overall Objective Priors

Parametric Techniques

Lecture 4: Probabilistic Learning

Empirical Bayes Quantile-Prediction aka E-B Prediction under Check-loss;

COMP90051 Statistical Machine Learning

Bayesian Methods for Machine Learning

Learning Bayesian network : Given structure and completely observed data

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

CMU-Q Lecture 24:

CPSC 540: Machine Learning

Linear Models in Machine Learning

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Bayesian model selection: methodology, computation and applications

Computational Cognitive Science

1 Hypothesis Testing and Model Selection

Time Series and Dynamic Models

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Bayesian Learning (II)

Statistical Data Mining and Machine Learning Hilary Term 2016

Comparing Non-informative Priors for Estimation and Prediction in Spatial Models

Bayesian Machine Learning

10-701/ Machine Learning - Midterm Exam, Fall 2010

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Bayesian Regression (1/31/13)

Probabilistic Graphical Models

Stat 451 Lecture Notes Numerical Integration

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks

Conjugate Analysis for the Linear Model

Bayesian Interpretations of Regularization

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Multinomial Data. f(y θ) θ y i. where θ i is the probability that a given trial results in category i, i = 1,..., k. The parameter space is

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Principles of Bayesian Inference

A Very Brief Summary of Bayesian Inference, and Examples

Lecture 5 September 19

Frequentist-Bayesian Model Comparisons: A Simple Example

Introduction to Probabilistic Machine Learning

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Mixtures of Gaussians. Sargur Srihari

Bayesian Variable Selection Under Collinearity

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Modeling Real Estate Data using Quantile Regression

Foundations of Statistical Inference

Transcription:

Stat60: Bayesian Modeling and Inference Lecture Date: March 15, 010 g-priors for Linear Regression Lecturer: Michael I. Jordan Scribe: Andrew H. Chan 1 Linear regression and g-priors In the last lecture, we mentioned the use of g-priors for linear regression in a Bayesian framework. In this lecture we continue to discuss several of the issues that arise when using g-priors. As a recapitulation of the problem setup, recall that there is a set of data y = Xβ + ɛ, where X is the design matrix, ɛ N (0, σ I), and β N (β 0, gσ(x T X) 1 ). The prior on σ is the Jeffreys prior, π(σ ) 1 σ, and usually, β 0 is taken to be 0 for simplification purposes. The appeal of the method is that there is only one free parameter g for all linear regression. Furthermore, the simplicity of the g-prior model generally leads to easily obtained analytical results. However, we still face the problem of selecting g or a prior for g, and this lecture provides an overview of the issues that come up. 1.1 Marginal likelihood and Bayes factors To compare different models to each other, we need to compute the marginal likelihoods of the given models, and then take their ratio to obtain the Bayes factor. Fortunately, for the g-prior, the marginal likelihood for a model M γ is easily computed as p(y M γ, g) = Γ ( n 1 π n 1 n 1 ) y ȳ (n 1) (1 + g) n 1 pγ (1 + g(1 Rγ)) n 1, (1) where γ is an indicator of which covariates are used in the model, ȳ is the grand mean of the data, R γ is the coefficient of determination, and p γ is the numbers of degrees of freedom for the given model. Recall from the last lecture that the coefficient of determination is where ˆβ = (X T X) 1 X T y is the MLE of β. R = 1 (y X ˆβ) T (y X ˆβ) (y ȳ) T, (y ȳ) To find the Bayes factor of a given model with respect to the null model M N, where in the null model, p γ = 0 and Rγ = 0, we take the ratio of their marginal likelihoods. The constants in (1) cancel, and we are left with (1 + g) n 1 pγ (1 + g(1 Rγ)) n 1 With this in hand, we can compare any model to the null model, and thus, we can compare any pair of models by using BF(M γ, M γ ) = BF(M γ, M N ) BF(M γ, M N ). 1.

g-priors for Linear Regression Before moving on to the question of how g should be selected in the model, we consider the reason why the marginal likelihood is used to compare models against each other. 1. Why the marginal likelihood? If one were to plot the classical likelihood with respect to the complexity of the model, one would find that as the complexity of the model increased, so would the likelihood, as illustrated in Figure 1. In this setting, the likelihood is obtained after estimating the parameters using some statistical method (e.g. maximum likelihood), and then plotting the value of p(x θ). However, the problem of using this method of comparison is that more complex models will always do better than simpler models. Because more complex models have more parameters, whatever can be done in a simple model can be done in a complex model, and thus, a more complex model will lead to a better fit. The downside is that this leads to overfitting, which is particularly damaging in the setting of prediction. In frequentist statistics, many methods have been developed to penalize the likelihood with respect to increasing complexity of the model. However, these methods are typically only justified after proposing the idea and then performing a series of analyses to show that it has desirable properties, rather than being well-motivated from the start. Figure 1: The dashed line represents the likelihood and the solid line is the marginal likelihood. With increasing model complexity, the likelihood continually increases, whereas the marginal likelihood has a maximum at some point, after which it begins to decrease. In the Bayesian framework, marginal likelihoods have a natural built-in penalty for more complex models. At a certain point, the marginal likelihood will begin to decrease with increasing complexity, and hence, does not directly suffer from the overfitting problem that occurs when considering only likelihoods. The intuition for why the marginal likelihood begins to decrease is that as the complexity of the model increases, the prior will be spread out more thinly across both the good models and the bad models. Because the marginal likelihood is the likelihood integrated with respect to the prior, spreading the prior across too many models will place too little prior mass on the good models, and as a result, cause the marginal likelihood to decrease. Another way to look at it is as follows. Suppose X is the data, θ are the parameters and M is the model. The posterior is then p(x θ, M)p(θ M) p(θ x, M) =. () p(x M) Solving for the marginal likelihood p(x M) in (), we obtain p(x M) = p(x θ, M)p(θ M). (3) p(θ x, M)

g-priors for Linear Regression 3 Taking the log of (3) results in log p(x M) = log p(x θ, M) + log p(θ M) log p(θ x, M). (4) }{{}}{{} log likelihood penalty This is true for any choice of θ, and in particular, the maximum likelihood estimate of θ. The log p(x θ, M) term in (4) is the log likelihood, and only increases with increasing model complexity. On the other hand, the log p(θ M) log p(θ x, M) term can be viewed as a penalty that penalizes against complex models. The overall sign of this penalty is negative because p(θ x, M), the posterior, is generally larger than p(θ M), the prior, assuming that given the data, the posterior sharpens up with respect to the prior. Hence, the penalty term balances out the increase in likelihood as the model complexity increases. Fixed value for g We still need to find a prior for g, and so we first consider using a fixed g. The first idea might be to let g go to infinity since this gives a posterior expectation equal to the maximum likelihood estimate. However, as g, the Bayes factor BF(M γ, M N ) 0, and thus, the null model would always be preferred to any other model. This is generally a bad idea, and is called the Lindley paradox. In another attempt we might consider letting R γ go to 1. R γ, the coefficient of determination, is a measure of how well the data fits the regression. Intuitively, letting R γ go to 1 considers what happens as we see data sets that are more and more appropriate for the model. However, in letting R γ go to 1, we find that BF(M γ, M N ) (1 + g) n 1 pγ, (5) which shows that the Bayes factor converges to a constant. One would expect the Bayes factor to go to infinity as we consider data sets that fit the model better and better, but here we find that it instead tops off at a constant, which is considered a problem and is essentially a symptom of a deeper issue. This problem is called the information paradox. In the end, we conclude that a fixed g is a bad idea, and g should, in fact, depend on the data. We will eventually construct a hierarchical model to handle the selection of g, but first we give a brief overview of hierarchical models. 3 Hierarchical models Suppose we have some data x, parameters θ, and a hyperparameter λ, such that the distribution of x given θ is p(x θ) and the distribution of θ given λ is p(θ λ). In the regression setting, y is the data, β is the parameter, and g is the hyperparameter. The first question we must answer is how to set λ, the hyperparameter. We will consider two ways to do so, the empirical Bayes method and the fully Bayesian approach. 3.1 Empirical Bayes The empirical Bayes method is to choose the value of λ that maximizes p(x λ); i.e. ˆλ EB = argmax p(x λ). λ This is essentially a maximization problem over λ of p(x λ) = p(x θ)p(θ λ)dθ.

4 g-priors for Linear Regression Having chosen ˆλ EB, we can use the model p(x θ) with the prior p(θ ˆλ EB ) (where the prior is now fixed using ˆλ EB ) to do Bayesian inference. However, a major concern is that the data is used twice: once for the hyperparameter and another time in the likelihood p(x θ). This typically leads to over-confidence in the posterior because the data is used multiple times. Indeed, empirical Bayes is not Bayesian. Rather, it s frequentist, i.e. it needs frequentist theory to justify it (and that theory is available). For example, the James-Stein estimator is an empirical Bayes procedure [EM73]. This estimator is used in the setting where there is one draw of data x from a normal distribution, x N (θ, σ I), where σ is known. With only one draw of data, the maximum likelihood estimate for the mean is simply ˆθ ML = x. However, James and Stein [JS61] showed that in three or more dimensions, the estimate ˆθ ML is inadmissible with respect to the least squares loss function, l = θ δ(x), where θ is the true mean and δ(x) is an estimate of the mean as a function of the data x. That is, there exists an estimator that is better than ˆθ ML for all θ. They showed this by devising their own estimator ˆθ JS, ˆθ JS = (1 ) (p )σ x x, where p is the number of dimensions. This estimator dominates ˆθ ML for all θ in settings with three or more dimensions. Efron and Morris showed that the James-Stein estimator can be interpreted as a two-level Gaussian model, where the hyperparameter is obtained by applying empirical Bayes. Empirical Bayesianism has some properties that make it especially useful in practice. For example, in large hierarchies, it is helpful to stop growing the hierarchy and plug in some numbers for the hyperparameters. These hyperparameters can be obtained by finding a maximum, which typically involves taking a derivative. This can often be easier than integrating throughout the entire hierarchy. 3. Fully Bayesian An alternative to the empirical Bayes method is the fully Bayesian approach. We still have p(x θ) and p(θ λ), but instead of fixing λ to some estimate, we put a prior p(λ) on λ. In this case, any inference depends on integrating over λ, generally using the posterior p(λ x). This approach is more conservative than the empirical Bayesian approach (i.e. it gives mores spread on the estimate of λ), and it also doesn t suffer from the issue of using the data twice. This approach can have good frequentist properties, but the choice of the prior p(λ) matters, and that choice is essential to any frequentist analysis. 3.3 Other approaches In addition to empirical Bayesianism and full Bayesianism, there are other approaches, such as crossvalidation. In cross-validation, a fraction of the data is left out as test data, and the model is trained on the remaining data. The trained model is then used on the test data, and the parameter that generates the best likelihoods is chosen as the inferred parameter. One problem with cross-validation is that part of the data must be left out of the training set, which might not be practical if there is already very little data available.

g-priors for Linear Regression 5 4 Prior for g Returning to the issue of selecting a prior for g in linear regression, we have a list of essential properties that the prior should have. The first is that it suffer from no paradoxes, such as the Lindley paradox and the information paradox described previously. The second is that it has model selection consistency, which is typically a frequentist property. However, applied here, this requirement will have both Bayesian and frequentist properties. For the model selection procedure to be consistent, the probability of the true model given the data must approach 1 as the number of data points n goes to infinity. That is, when M λ is the true model, p(m λ y, X) a.s. 1 as n, (6) where a.s. is for almost-sure convergence. The Bayesian aspect of the statement in (6) is the fact that the model is assigned a probability. The frequentist aspect is that the convergence considers what happens when y is generated over multiple draws of the data, which is a frequentist notion. If we use the empirical Bayes approach to select g, we do not encounter any paradoxes. However, this method does not have model selection consistency. Using the fully Bayesian approach, we do not encounter any paradoxes, and we also achieve model selection consistency. In the next lecture, we will cover two priors for g that have these appealing properties, the Zellner-Siow prior and the hyper-g prior. References [JS61] [EM73] W. James, C. M. Stein, Estimation with quadratic loss, Proc. of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, 1961. B. Efron, C. Morris, Stein s Estimation Rule and its Competitors an Empirical Bayes Approach, Journal of the American Statistical Association, Vol. 68, No. 341, 1973.