Stat60: Bayesian Modelin and Inference Lecture Date: March 10, 010 Bayes Factors, -priors, and Model Selection for Reression Lecturer: Michael I. Jordan Scribe: Tamara Broderick The readin for this lecture is Lian et al. (008). 1 Bayes Factors We bein by clarifyin a point about Bayes factors from the previous lecture. Consider data x X and parameter θ Θ. Take H 0 : θ Θ 0 and H 1 : θ Θ 1 for some partition {Θ 0, Θ 1 } of Θ. Θ 0 Θ 1 Fiure 1: Cartoon of the partition of Θ space into reions Θ 0 and Θ 1. Then the Bayes factor BF for comparin models H 0 and H 1 is BF = p(x H 0) p(x H 1 ) And we have the followin relation between the posterior odds and prior odds. p(h 0 x) p(h 1 x) = BF p(h 0) p(h 1 ) Θ Note that the probabilities p(h 0 ) and p(h 1 ) are hyperpriors on the models themselves and are not expressible in terms of π(θ H 0 ) or π(θ H 1 ). Indeed, the overall prior on θ is a mixture distribution where p(h 0 ) and p(h 1 ) are the mixture probabilities. π(θ) = p(h 0 )π(θ H 0 ) + p(h 1 )π(θ H 1 ) Once we know BF, we can solve for the posterior probability of H 0. p(h 0 x) 1 p(h 0 x) p(h 0 x) = = BF p(h 0) p(h 1 ) BF p(h0) p(h 1) 1 + BF p(h0) p(h 1) We can see from the last line that an arbitrary constant in the Bayes factor does not factor out when calculatin the posterior of H 0. Rather, an arbitrary constant can be used to obtain an arbitrary posterior. Lettin the Bayes factor constant decrease to zero yields p(h 0 x) = 0 in the limit, and lettin the Bayes factor increase to infinity yields p(h 0 x) = 1 in the limit. 1
Bayes Factors, -priors, and Model Selection for Reression y y x x Fiure : Left: In one-dimensional linear reression, sets of points with more scatter alon the onedimensional predictor x are more informative about the slope and intercept. It is easy to rouhly predict these parameters from the fiure. Riht: Data with less scatter in the predictor variable are less informative about the slope and intercept. -priors We consider a reression model y = Xβ + ǫ, ǫ N(0, σ I) (1) with a Jeffreys prior on σ π(σ ) 1 σ () and a prior called a -prior on the vector of reression coefficients β β N(β 0, σ (X T X) 1 ) (3) In Eq. (3), is a scalar that we will determine later, and the σ factor in the covariance is necessary for conjuacy. Generally in reression models, we make the assumption that the desin matrix X is known and fixed, so we are free to use it in our prior. The particular form in which X appears in Eq. (3) represents a kind of dispersion matrix. For instance, recall that for the usual maximum likelihood estimator ˆβ of β, we have Var(ˆβ) = (X T X) 1 {an estimate of σ } Alternatively, consider a principal component analysis on X (and inore the response variable y for the moment). The eienvalues of X T X ive the directions of the new coordinates. Althouh the -prior is not a reference prior, we can further motivate the use of X in Eq. (3) by the same intuitive considerations we used for reference priors, namely placin more prior mass in areas of the parameter space where we expect the data to be less informative about the parameters. In the one-dimensional classical reression case (Fiure ), hih scatter alon the predictor axis corresponds to hih leverae. The data in this case are more informative about the intercept and slope (β) than in the case of less scatter and low leverae.
Bayes Factors, -priors, and Model Selection for Reression 3 Posterior With this enerative model (Eqs. 1,, 3) in hand, we can calculate the full posterior. π(β, σ y, X) (σ ) ( n +1) exp{ 1 σ (y X ˆβ) T (y X ˆβ) Recall that the maximum likelihood estimator ˆβ satisfies ˆβ = (X T X) 1 X T y, 1 σ (β ˆβ) T X T X(β ˆβ) 1 σ (β β 0) T X T X(β β 0 )} (4) and note that the first term in the sum is proportional to the classical residual s. s = (y X ˆβ) T (y X ˆβ) From Eq. (4), we obtain a Gaussian posterior on β and an inverse amma posterior on σ. ( ( ) β σ, y, X = d β0 N + 1 + ˆβ σ ), + 1 (XT X) 1 ( ) σ y, X = d n IG, s + 1 ( + 1) (ˆβ β 0 )X T X(ˆβ β 0 ) (5) (6) Posterior means From Eqs. (5, 6), we can find each parameter s posterior mean. First, E(β y, X) = E ( E(β σ, y, X) y, X ) by the tower rule ( ( ) β0 = E + 1 + ˆβ y, X) ( ) β0 = + 1 + ˆβ Lettin increase to infinity, we recover the frequentist estimate of β: E(β y, X) = ˆβ. Moreover, as increases to infinity, we approach a flat prior on β. We have previously seen that such a prior can be useful for estimation but becomes problematic when performin hypothesis testin. Next, usin the fact that the mean of an IG(a, b)-distributed random variable is b/(a 1) for a > 1, we have E(σ y, X) = s + 1 +1 (ˆβ β 0 ) T X T X(ˆβ β 0 ) n Lettin yields E(σ y, X) = s /(n ). Compare this result to the classical frequentist unbiased estimator s /(n p) for σ. Here, p is the number of derees of freedom, or equivalently the number of components of the β vector. One miht worry that s /(n ) is overly conservative for p >, but note that the σ -estimate does not provide the error bars on β for, e.., confidence intervals. We will see how to calculate confidence intervals for β next.
4 Bayes Factors, -priors, and Model Selection for Reression (1 α/) α/ St 1 n (1 α/) Fiure 3: Illustration of the quantile function St 1 n. Mass α/ of the Student s t-density appears to the riht of St 1 n (1 α/) and is shown in ray. The remainin mass is on the left. Hihest posterior density reion for β i In order to find the HPD reion for β i, we first calculate the posterior of β i with σ interated out. β i y, X d = St(T i, K ii, n), where St represents Student s t-distribution, which has three parameters: a mean, a spread, and a number of derees of freedom. In our case, we further have ( ) β0i T i := + 1 + ˆβ i ( ) K ii := s + (ˆβ β 0 ) T X T X(ˆβ β 0 ) ω ii + 1 + 1 n ω ii := (X T X) 1 ii Here, β 0i and ˆβ i are the i th components of β 0 and ˆβ, respectively. It follows that β i T i Kii d = Stn, where St n is the standard Student s t-distribution (zero mean, unit variance) with n derees of freedom. It follows that, if we let St 1 be the quantile function for the standard Student s t-distribution with n derees of freedom (Fiure 3), the HPD interval at level α is T i ± ( K ii St 1 n 1 α ) How to set We ve already seen that lettin is not an option if we want to perform hypothesis testin. Moreover, lettin 0 takes the posterior to the prior distribution, also reducin our ability to perform inference. Some other options for choosin include usin BIC, empirical Bayes, and full Bayes. BIC The Bayesian Information Criterion (BIC) is an approximation to the marinal likelihood obtained by a Laplace expansion. We can use the BIC to determine a value for, but the resultin model is not necessarily consistent (e.. for the model selection in Section 3).
Bayes Factors, -priors, and Model Selection for Reression 5 Empirical Bayes Empirical Bayes works by drawin a line in the hierarchical description of our model. y β, σ β σ, σ Then we take a Bayesian approach to estimation above the line and a frequentist approach below the line. While a full frequentist would consider a profile likelihood for, empirical Bayes methodoloy prescribes interatin out other parameters and maximizin the marinal likelihood. Thus, the empirical Bayes estimate for is ĝ EB : ĝ EB = armaxp(y X, ) Inference under the prior with set to ĝ EB is, in eneral, consistent. We can calculate the marinal likelihood for our case as follows. Interatin out β and σ yields p(y X, ) = Γ ( ) n 1 π n+1 n 1 p n y (1 + ) 1/ ȳ (n 1) (1 + (1 R )) n 1 In the formula above, ȳ is the rand mean, is a vector norm, and R is the usual coefficient of determination. R = 1 (y X ˆβ) T (y X ˆβ) (y ȳ) T (y ȳ) Full Bayes A full Bayesian puts a prior distribution on. Not only is a hierarchy of this type consistent, but the converence will have the best rate (of the three methods considered here) in the frequentist sense. 3 Model Selection for Reression Let γ be an indicator vector encodin which components of β are used in a model. When p is the dimensionality of β, we miht have, for example, γ = p elements {}}{ (0, 0, 1, 1, 0, 1,..., 0) Further define X γ to be the matrix that has only those columns of X correspondin to ones in the γ vector, and let β γ be the vector that has only those components of β correspondin to ones in γ. Then there are p models M γ of the form y = X γ β γ + ǫ Model selection refers to the problem of pickin one of these models. That is, we wish to choose γ based on data (X, y). The problem is like hypothesis testin but with many simultaneous hypotheses. The issue we encounter is that p can be very lare. We need a way to select γ based on priors on the parameters in each model, but with potentially very many models, assinin separate priors for each model may be infeasible. We consider instead a Bayesian approach utilizin nested models. Recall that, when we wish to use the Bayes factor, we may use improper priors when the same parameter appears in both models.
6 Bayes Factors, -priors, and Model Selection for Reression Then we miht, for instance, use a -prior on the remainin parameters. This pairwise concept extends to model selection by comparin all models to the null model or the full model. That is, γ null = (0,..., 0) γ full = (1,..., 1) For some base model M b, we can calculate for each model M γ, Then for any two models M γ and M γ, we have BF(M γ, M b ) = p(y M γ) p(y M b ) BF(M γ, M γ ) = BF(M γ, M b ) BF(M γ, M b ) If the base model is null, then the only factor in common between any model and M null is σ (and often the intercept α). We can place a Jeffreys prior separately on each of these parameters. In particular, α is a scale parameter, so π(σ, α) 1 σ For the case where the base model is the full model, see Lian et al. (008) for more information. How to set, continued Choosin -priors for the remainin components of β yields a closed-form expression for BF(M γ, M γ) via BF(M γ, M null ) = (1 + )(n pγ 1)/ (1 + (1 Rγ ))(n 1)/ Here, R γ is the coefficient of determination for the model M γ, and p γ is the remainin number of parameters of β γ. We can see directly now that if we send, then BF(M γ, M null ) 0. That is, for any value of n, the null model is always preferred in the limit. This behavior is an instance of Lindley s paradox. On the other hand, consider choosin a fixed, e.. by usin BIC, and lettin n. Then BF(M γ, M). That is, no matter the data, the alternative to the null is always preferred in the limit n. We will see a resolution to this quandary in the next lecture. References Lian, F., Paulo, R., Molina, G., Clyde, M. A., and Berer, J. O. (008). Mixtures of -priors for Bayesian variable selection. Journal of the American Statistical Association, 103(481):410 43.