Stat260: Bayesian Modeling and Inference Lecture Date: March 10, 2010

Similar documents
g-priors for Linear Regression

Bayesian Linear Regression

Sparse Linear Models (10/7/13)

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

The linear model is the most fundamental of all serious statistical models encompassing:

Bayesian Inference: Concept and Practice

1 Hypothesis Testing and Model Selection

Bayesian Linear Models

Statistics Hotelling s T Gary W. Oehlert School of Statistics 313B Ford Hall

Stat 5101 Lecture Notes

Association studies and regression

Bayesian Linear Models

Bayesian Linear Models

An Introduction to Bayesian Linear Regression

Multiple Linear Regression

Overall Objective Priors

A Very Brief Summary of Bayesian Inference, and Examples

Caterpillar Regression Example: Conjugate Priors, Conditional & Marginal Posteriors, Predictive Distribution, Variable Selection

Linear Models A linear model is defined by the expression

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Measuring the fit of the model - SSR

Simple Linear Regression

7. Estimation and hypothesis testing. Objective. Recommended reading

Bayesian Regression (1/31/13)

Part 6: Multivariate Normal and Linear Models

Bayesian Model Averaging

Simple and Multiple Linear Regression

November 2002 STA Random Effects Selection in Linear Mixed Models

Foundations of Statistical Inference

Inferences for Regression

A Bayesian Treatment of Linear Gaussian Regression

Model comparison and selection

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

ST 740: Linear Models and Multivariate Normal Inference

STAT 425: Introduction to Bayesian Analysis

Subject CS1 Actuarial Statistics 1 Core Principles

STAT5044: Regression and Anova. Inyoung Kim

Supplementary materials for Scalable Bayesian model averaging through local information propagation

STAT763: Applied Regression Analysis. Multiple linear regression. 4.4 Hypothesis testing

Scatter plot of data from the study. Linear Regression

Robust Semiparametric Optimal Testing Procedure for Multiple Normal Means

Stat 5102 Final Exam May 14, 2015

Mixtures of g-priors for Bayesian Variable Selection

Power-Expected-Posterior Priors for Variable Selection in Gaussian Linear Models

Bayesian Classification Methods

A Very Brief Summary of Statistical Inference, and Examples

Introduction. Start with a probability distribution f(y θ) for the data. where η is a vector of hyperparameters

Generalized Linear Models

Statistical Methods in Particle Physics

Lecture 14 Simple Linear Regression

Inference for a Population Proportion

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks

Bayesian linear regression

Scatter plot of data from the study. Linear Regression

Eco517 Fall 2004 C. Sims MIDTERM EXAM

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Bayesian methods in economics and finance

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

THE SIGNAL ESTIMATOR LIMIT SETTING METHOD

Mixtures of g-priors for Bayesian Variable Selection

AMS-207: Bayesian Statistics

Chapter 14. Linear least squares

Module 4: Bayesian Methods Lecture 5: Linear regression

Lecture 6 Multiple Linear Regression, cont.

Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands

Simple Linear Regression

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

Minimum Message Length Analysis of the Behrens Fisher Problem

Machine Learning Linear Classification. Prof. Matteo Matteucci

Statistical Inference

Simple Linear Regression

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Problem 1 (20) Log-normal. f(x) Cauchy

Module 17: Bayesian Statistics for Genetics Lecture 4: Linear regression

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley

The Normal Linear Regression Model with Natural Conjugate Prior. March 7, 2016

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

First Year Examination Department of Statistics, University of Florida

STAT 3A03 Applied Regression With SAS Fall 2017

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Ch 2: Simple Linear Regression

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Seminar über Statistik FS2008: Model Selection

INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Part III. A Decision-Theoretic Approach and Bayesian testing

CPSC 540: Machine Learning

Bayesian Variable Selection Under Collinearity

5.2 Expounding on the Admissibility of Shrinkage Estimators

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Lecture 7 October 13

01 Probability Theory and Statistics Review

Non-Parametric Bayes

Lecture 2. The Simple Linear Regression Model: Matrix Approach

Bayesian Regression Linear and Logistic Regression

LECTURE 5 HYPOTHESIS TESTING

Transcription:

Stat60: Bayesian Modelin and Inference Lecture Date: March 10, 010 Bayes Factors, -priors, and Model Selection for Reression Lecturer: Michael I. Jordan Scribe: Tamara Broderick The readin for this lecture is Lian et al. (008). 1 Bayes Factors We bein by clarifyin a point about Bayes factors from the previous lecture. Consider data x X and parameter θ Θ. Take H 0 : θ Θ 0 and H 1 : θ Θ 1 for some partition {Θ 0, Θ 1 } of Θ. Θ 0 Θ 1 Fiure 1: Cartoon of the partition of Θ space into reions Θ 0 and Θ 1. Then the Bayes factor BF for comparin models H 0 and H 1 is BF = p(x H 0) p(x H 1 ) And we have the followin relation between the posterior odds and prior odds. p(h 0 x) p(h 1 x) = BF p(h 0) p(h 1 ) Θ Note that the probabilities p(h 0 ) and p(h 1 ) are hyperpriors on the models themselves and are not expressible in terms of π(θ H 0 ) or π(θ H 1 ). Indeed, the overall prior on θ is a mixture distribution where p(h 0 ) and p(h 1 ) are the mixture probabilities. π(θ) = p(h 0 )π(θ H 0 ) + p(h 1 )π(θ H 1 ) Once we know BF, we can solve for the posterior probability of H 0. p(h 0 x) 1 p(h 0 x) p(h 0 x) = = BF p(h 0) p(h 1 ) BF p(h0) p(h 1) 1 + BF p(h0) p(h 1) We can see from the last line that an arbitrary constant in the Bayes factor does not factor out when calculatin the posterior of H 0. Rather, an arbitrary constant can be used to obtain an arbitrary posterior. Lettin the Bayes factor constant decrease to zero yields p(h 0 x) = 0 in the limit, and lettin the Bayes factor increase to infinity yields p(h 0 x) = 1 in the limit. 1

Bayes Factors, -priors, and Model Selection for Reression y y x x Fiure : Left: In one-dimensional linear reression, sets of points with more scatter alon the onedimensional predictor x are more informative about the slope and intercept. It is easy to rouhly predict these parameters from the fiure. Riht: Data with less scatter in the predictor variable are less informative about the slope and intercept. -priors We consider a reression model y = Xβ + ǫ, ǫ N(0, σ I) (1) with a Jeffreys prior on σ π(σ ) 1 σ () and a prior called a -prior on the vector of reression coefficients β β N(β 0, σ (X T X) 1 ) (3) In Eq. (3), is a scalar that we will determine later, and the σ factor in the covariance is necessary for conjuacy. Generally in reression models, we make the assumption that the desin matrix X is known and fixed, so we are free to use it in our prior. The particular form in which X appears in Eq. (3) represents a kind of dispersion matrix. For instance, recall that for the usual maximum likelihood estimator ˆβ of β, we have Var(ˆβ) = (X T X) 1 {an estimate of σ } Alternatively, consider a principal component analysis on X (and inore the response variable y for the moment). The eienvalues of X T X ive the directions of the new coordinates. Althouh the -prior is not a reference prior, we can further motivate the use of X in Eq. (3) by the same intuitive considerations we used for reference priors, namely placin more prior mass in areas of the parameter space where we expect the data to be less informative about the parameters. In the one-dimensional classical reression case (Fiure ), hih scatter alon the predictor axis corresponds to hih leverae. The data in this case are more informative about the intercept and slope (β) than in the case of less scatter and low leverae.

Bayes Factors, -priors, and Model Selection for Reression 3 Posterior With this enerative model (Eqs. 1,, 3) in hand, we can calculate the full posterior. π(β, σ y, X) (σ ) ( n +1) exp{ 1 σ (y X ˆβ) T (y X ˆβ) Recall that the maximum likelihood estimator ˆβ satisfies ˆβ = (X T X) 1 X T y, 1 σ (β ˆβ) T X T X(β ˆβ) 1 σ (β β 0) T X T X(β β 0 )} (4) and note that the first term in the sum is proportional to the classical residual s. s = (y X ˆβ) T (y X ˆβ) From Eq. (4), we obtain a Gaussian posterior on β and an inverse amma posterior on σ. ( ( ) β σ, y, X = d β0 N + 1 + ˆβ σ ), + 1 (XT X) 1 ( ) σ y, X = d n IG, s + 1 ( + 1) (ˆβ β 0 )X T X(ˆβ β 0 ) (5) (6) Posterior means From Eqs. (5, 6), we can find each parameter s posterior mean. First, E(β y, X) = E ( E(β σ, y, X) y, X ) by the tower rule ( ( ) β0 = E + 1 + ˆβ y, X) ( ) β0 = + 1 + ˆβ Lettin increase to infinity, we recover the frequentist estimate of β: E(β y, X) = ˆβ. Moreover, as increases to infinity, we approach a flat prior on β. We have previously seen that such a prior can be useful for estimation but becomes problematic when performin hypothesis testin. Next, usin the fact that the mean of an IG(a, b)-distributed random variable is b/(a 1) for a > 1, we have E(σ y, X) = s + 1 +1 (ˆβ β 0 ) T X T X(ˆβ β 0 ) n Lettin yields E(σ y, X) = s /(n ). Compare this result to the classical frequentist unbiased estimator s /(n p) for σ. Here, p is the number of derees of freedom, or equivalently the number of components of the β vector. One miht worry that s /(n ) is overly conservative for p >, but note that the σ -estimate does not provide the error bars on β for, e.., confidence intervals. We will see how to calculate confidence intervals for β next.

4 Bayes Factors, -priors, and Model Selection for Reression (1 α/) α/ St 1 n (1 α/) Fiure 3: Illustration of the quantile function St 1 n. Mass α/ of the Student s t-density appears to the riht of St 1 n (1 α/) and is shown in ray. The remainin mass is on the left. Hihest posterior density reion for β i In order to find the HPD reion for β i, we first calculate the posterior of β i with σ interated out. β i y, X d = St(T i, K ii, n), where St represents Student s t-distribution, which has three parameters: a mean, a spread, and a number of derees of freedom. In our case, we further have ( ) β0i T i := + 1 + ˆβ i ( ) K ii := s + (ˆβ β 0 ) T X T X(ˆβ β 0 ) ω ii + 1 + 1 n ω ii := (X T X) 1 ii Here, β 0i and ˆβ i are the i th components of β 0 and ˆβ, respectively. It follows that β i T i Kii d = Stn, where St n is the standard Student s t-distribution (zero mean, unit variance) with n derees of freedom. It follows that, if we let St 1 be the quantile function for the standard Student s t-distribution with n derees of freedom (Fiure 3), the HPD interval at level α is T i ± ( K ii St 1 n 1 α ) How to set We ve already seen that lettin is not an option if we want to perform hypothesis testin. Moreover, lettin 0 takes the posterior to the prior distribution, also reducin our ability to perform inference. Some other options for choosin include usin BIC, empirical Bayes, and full Bayes. BIC The Bayesian Information Criterion (BIC) is an approximation to the marinal likelihood obtained by a Laplace expansion. We can use the BIC to determine a value for, but the resultin model is not necessarily consistent (e.. for the model selection in Section 3).

Bayes Factors, -priors, and Model Selection for Reression 5 Empirical Bayes Empirical Bayes works by drawin a line in the hierarchical description of our model. y β, σ β σ, σ Then we take a Bayesian approach to estimation above the line and a frequentist approach below the line. While a full frequentist would consider a profile likelihood for, empirical Bayes methodoloy prescribes interatin out other parameters and maximizin the marinal likelihood. Thus, the empirical Bayes estimate for is ĝ EB : ĝ EB = armaxp(y X, ) Inference under the prior with set to ĝ EB is, in eneral, consistent. We can calculate the marinal likelihood for our case as follows. Interatin out β and σ yields p(y X, ) = Γ ( ) n 1 π n+1 n 1 p n y (1 + ) 1/ ȳ (n 1) (1 + (1 R )) n 1 In the formula above, ȳ is the rand mean, is a vector norm, and R is the usual coefficient of determination. R = 1 (y X ˆβ) T (y X ˆβ) (y ȳ) T (y ȳ) Full Bayes A full Bayesian puts a prior distribution on. Not only is a hierarchy of this type consistent, but the converence will have the best rate (of the three methods considered here) in the frequentist sense. 3 Model Selection for Reression Let γ be an indicator vector encodin which components of β are used in a model. When p is the dimensionality of β, we miht have, for example, γ = p elements {}}{ (0, 0, 1, 1, 0, 1,..., 0) Further define X γ to be the matrix that has only those columns of X correspondin to ones in the γ vector, and let β γ be the vector that has only those components of β correspondin to ones in γ. Then there are p models M γ of the form y = X γ β γ + ǫ Model selection refers to the problem of pickin one of these models. That is, we wish to choose γ based on data (X, y). The problem is like hypothesis testin but with many simultaneous hypotheses. The issue we encounter is that p can be very lare. We need a way to select γ based on priors on the parameters in each model, but with potentially very many models, assinin separate priors for each model may be infeasible. We consider instead a Bayesian approach utilizin nested models. Recall that, when we wish to use the Bayes factor, we may use improper priors when the same parameter appears in both models.

6 Bayes Factors, -priors, and Model Selection for Reression Then we miht, for instance, use a -prior on the remainin parameters. This pairwise concept extends to model selection by comparin all models to the null model or the full model. That is, γ null = (0,..., 0) γ full = (1,..., 1) For some base model M b, we can calculate for each model M γ, Then for any two models M γ and M γ, we have BF(M γ, M b ) = p(y M γ) p(y M b ) BF(M γ, M γ ) = BF(M γ, M b ) BF(M γ, M b ) If the base model is null, then the only factor in common between any model and M null is σ (and often the intercept α). We can place a Jeffreys prior separately on each of these parameters. In particular, α is a scale parameter, so π(σ, α) 1 σ For the case where the base model is the full model, see Lian et al. (008) for more information. How to set, continued Choosin -priors for the remainin components of β yields a closed-form expression for BF(M γ, M γ) via BF(M γ, M null ) = (1 + )(n pγ 1)/ (1 + (1 Rγ ))(n 1)/ Here, R γ is the coefficient of determination for the model M γ, and p γ is the remainin number of parameters of β γ. We can see directly now that if we send, then BF(M γ, M null ) 0. That is, for any value of n, the null model is always preferred in the limit. This behavior is an instance of Lindley s paradox. On the other hand, consider choosin a fixed, e.. by usin BIC, and lettin n. Then BF(M γ, M). That is, no matter the data, the alternative to the null is always preferred in the limit n. We will see a resolution to this quandary in the next lecture. References Lian, F., Paulo, R., Molina, G., Clyde, M. A., and Berer, J. O. (008). Mixtures of -priors for Bayesian variable selection. Journal of the American Statistical Association, 103(481):410 43.