Bayesian model selection: methodology, computation and applications

Bayesian model selection: methodology, computation and applications David Nott Department of Statistics and Applied Probability National University of Singapore Statistical Genomics Summer School Program

Outline Examples Bayesian statistics Bayesian model selection Marginal likelihood computation Predictive model selection Conclusion

What is model selection? For us, a statistical model describes a process (random) by which data might have been generated. Often there will be several plausible models to choose from and hence uncertainty about the data generating process. Sometimes we won t believe that any of the models under consideration generated the data. The goal of model selection is to choose from a collection of models the best for a given purpose.

Example: gene expression arrays Gene expression arrays are able to give a measure, for different tissue samples, of the level of gene expression for thousands of genes simultaneously. Over the page I ve randomly selected 30 genes from a microarray experiment and plotted gene expression values for the 30 genes.

30 randomly chosen genes 1 3 5 7 9 12 15 18 21 24 27 30 1.0 0.5 0.0 0.5 1.0 1.5 Differential expression

Example: gene expression arrays A row corresponds to a gene. There are 10 dots in each row (10 microarrays doing 10 comparisons of tissue samples). In this experiment, we are comparing brain tissue in two strains of mice (Cotsapas et al., 2003). Data for gene g has mean µ g (assume normality say). Is µ g = 0 or µ g 0? A model selection problem.

Rainfall-runoff models Used by hydrologists for simulating processes such as streamflow in response to a rainfall event. The choice of model for a given application is a difficult problem. One is left with the view that the state of water resources modelling is like an economy subject to inflation that there are too many models chasing (as yet) too few applications; that there are too many modellers chasing too few ideas; and that the response is to print ever-increasing quantities of paper, thereby devaluing the currency..." (Robin Clarke, 1974).

S 1 S 2 P E Q s S 3 Q r A 1 A 2 A 3 BS Q b

Rainfall-runoff models The diagram shows a representation of a runoff model (the Australian Water Balance Model (AWBM), Boughton (2004)) that relates rainfall and other variables to discharge of a stream. The number of storages shown in the graph is a variable to be chosen by the modeller. How to choose the number of storages in the best way for prediction, interpretation, etc.?

Nonparametric regression using linear combinations of basis terms I simulated n = 600 response values from the following model: y i = f (z i ) + ɛ i i = 1,..., 600 where the errors ɛ i are independent N(0, 0.5 2 ), f (z i ) is the mean function and the predictors z i = (z i1, z i2 ) are generated uniformly on the unit square [0, 1] 2.

0 Mean function for simulated data set 4.5 4 3.5 3 f(z) 2.5 2 1.5 1 1 0.8 1 0.6 0.8 0.4 0.6 0.2 0.2 0.4 z2 0 z1

Nonparametric regression using linear combinations of basis terms The mean function f (z) = f (z 1, z 2 ) is f (z) = 1 + N(µ 1, Σ 1, z) + N(µ 2, Σ 2, z) where N(µ, Σ, z) denotes a bivariate normal density with mean µ and covariance matrix Σ. Here we choose µ 1 = (0.25, 0.75) T, µ 2 = (0.75, 0.25) T, Σ 1 = [ 0.05 0.01 0.01 0.05 ] [ 0.1 0.01, Σ 2 = 0.01 0.1 ]

Nonparametric regression using linear combinations of basis terms Suppose I didn t know beforehand the mean function f (z) and that I want to estimate it. One way to do this is to adopt a flexible representation for f (z) in terms of a linear combination of a large number of basis functions. K f (z) β j h j (z). j=1

Nonparametric regression using linear combinations of basis terms Here the h j (z) are the basis terms and the β j are unknown coefficients. To estimate the unknown coefficients we fit a linear model: y i = K β j h j (z i ) + ɛ i. j=1 Some kind of variable selection (estimating some of the β j as exactly zero) might be done to prevent overfitting since K is large.

Nonparametric regression using linear combinations of basis terms One choice of basis (there are many): {1, z 1, z 2, z ρ 1 2 log( z ρ 1 ),..., z ρ s 2 log( z ρ s )} where ρ i, i = 1,..., s are a collection of so-called knot points. The knots are points in the predictor space. For a fairly rich set of knots we can obtain good approximations to the mean function.

Purposes of model selection The three examples of model selection problems that I ve presented so far illustrate different purposes of model selection. In the gene expression example, interpretation is all important (which genes are different between the two tissue samples?) In the nonparametric regression example, our purpose is purely to predict well. There is no particular interpretation associated with the basis functions which are selected. In the hydrology example, there are elements of both prediction and interpretation to the problem.

Purposes of model selection A key idea in model selection of any kind is that one must consider what the model is to be used for. We won t treat this idea very formally in this talk but it is important. The idea of this talk is to review the Bayesian approach to model selection. I ll give a brief review of Bayesian statistics first.

Bayesian statistics Bayesian statistics is distinguished by the use of probability for quantifying all kinds of uncertainty. Set of unknowns θ to learn about, data y. Specify a full probability model for the data and unknowns p(y, θ) = p(θ)p(y θ) p(θ) is called the prior distribution and p(y θ) is the likelihood function. The prior codes in probabilistic form what we know about the unknowns before observing data, and gives the opportunity for use of prior knowledge.

Bayesian statistics Conditioning on the observed data in this model we get Bayes rule: p(θ y) p(θ)p(y θ) Here p(θ y) is the posterior distribution expressing what we know about θ given the data y. Inference is from the posterior distribution. In summarizing the posterior (by calculating probabilities or expectations) an integration over the parameter space is needed.

S 1 S 2 Example: AWBM P E Q s S 3 Q r A 1 A 2 A 3 BS Q b

550 MHz. The CPU time required for 100,000 iterations of each run was measured to compare the speed of the MCMC sampling schemes. These computation times are given in Example: Table 4. AWBM 5. Discussion [60] Table 1 and the corresponding results for each sampling algorithm indicates that the posterior summaries obtained for each proposed scheme were similar, with approximately equal mean results, and comparable posterior quartiles. proposed flow must be calculated three times to account for the combination of block and single site updating. The likelihood function must be calculated four times. The MHSS algorithm requires calculation of the proposed flow seven times. In terms of computation time and simplicity, the AM algorithm and MHBU algorithm are superior. The time taken for each simulation is much less. [63] Examining the estimated variance of the mean of the series compares the efficiency of each scheme. By examining the efficiency of the sampling algorithms statistically, it is evident that the AM algorithm provides a far more Figure 3. Posterior distribution and prior distribution for K parameter. 8of11

Predictive inference Suppose predictions of future data y are required. Predictive inference is based on p(y y) = p(y θ)p(θ y)dθ. In Bayesian inference predictive distributions are often the basis for informal methods of model criticism and even model selection. How we go about model criticism usually depends on what the model will be used for.

Bayesian model comparison Now let s consider model selection in the Bayesian framework. At the end of this material there will be a question for the audience... Consider a collection of models M = {M 1,..., M k }. Denoting the data by y, write p(y θ i, M i ) for the likelihood function for model M i, where we have written θ i for the set of unknown parameters in M i. Write p(θ i M i ) for the prior distribution on the parameters in model M i, i = 1,..., k.

Bayesian model comparison In Bayesian statistics uncertainty about unknowns is treated probabilistically. So we need a prior distribution on the unknown model, which will be updated to a posterior distribution based on the data. Write p(m i ) for the prior probability of model M i, i = 1,..., k. Now we apply Bayes rule to obtain

Bayesian model comparison p(m i y) p(m i )p(y M i ) (1) where the so-called marginal likelihood p(y M i ) for model M i is obtained as p(y M i ) = p(y θ i, M i )p(θ i M i )dθ i Normalizing (1) so that the distribution sums to one we obtain p(m i y) = p(m i )p(y M i ) j p(m j)p(y M j )

Bayesian model comparison Note that the posterior odds of model M i relative to model M j is p(m i y) p(m j y) = p(m i)p(y M i ) p(m j )p(y M j ) From this the ratio of the posterior to prior odds is p(m i y)/p(m j y) p(m i )/p(m j ) = p(y M i) p(y M j ) which is called the Bayes factor for model M i relative to model M j. If all models are assigned equal probability in the prior, then the Bayes factor is simply the ratio of posterior probabilities of the models compared.

Bayesian model averaging We discussed before predictive inference. How do we do this in the presence of model uncertainty? Let be a quantity to be predicted (a future response say). Let p( M i, y) be the predictive distribution under model M i. We talked about these kinds of predictive distributions before. Predictive distribution incorporating model uncertainty (Bayesian model averaging) p( y) = i p( M i, y)p(m i y) Model specific predictive distributions are weighted according to the posterior model probabilities.

A simple example Consider testing on a normal mean. Observed data y 1,..., y n, independent N(µ, 1). We want to compare the two models M 1 = {µ = 0} and M 2 = {µ 0}. This might be a reasonable way to formulate model selection for one gene in my gene expression example.

Example: testing on a normal mean In model M 1 there are no unknown parameters and the marginal likelihood is ( ) n p(y M 1 ) = (2π) n/2 i=1 exp y i 2 2 In model M 2, we need to specify a prior distribution on µ. We take µ M 2 N(0, σ 2 0 ). Question for the audience - what happens as σ 2 0 0 and σ 2 0?

Example: testing on a normal mean It is easily shown that p(y M 2 ) = (2π) n/2 (nσ 2 0 + 1) 1/2 exp ( ) n n exp )ȳ2 2 (n + 1/σ0 2 ( 1 2 n i=1 y 2 i )

Example: testing on a normal mean Comparing the expressions for p(y M 1 ) and p(y M 2 ) the factor ( ) (2π) n/2 exp 1 n yi 2 2 is common to both and hence these terms cancel when we compute the Bayes factor of model M 2 relative to model M 1 which is p(y M 2 ) p(y M 1 ) = (nσ 2 0 + 1) 1/2 exp i=1 ( n 2 ) n )ȳ2 (n + 1/σ0 2

Example: testing on a normal mean Note that as σ0 2 0 this Bayes factor 1. As σ0 2 the Bayes factor 0. In other words, if the models have equal prior probability p(m 2 y) 0.5 as σ0 2 0 and p(m 2 y) 0 as σ0 2 regardless of the data. This example illustrates that it is not acceptable to thoughtlessly use vague proper priors in Bayesian model selection as this will tend to favour the simplest model.

Bayesian model comparison Note that the marginal likelihood p(y M i ) = p(y θ i, M i )p(θ i M i )dθ i can be regarded as a predictive density for y before y is observed (if we haven t observed any data yet, the posterior on θ i is just the prior p(θ i M i ) so our definition of the predictive density of y just reduces to the marginal likelihood above).

Bayesian model comparison Looking at the marginal likelihood this way, you can see why the behaviour in our example happens. If we have a very tight prior around zero for µ in model M 2 the models are nearly the same (not much difference between setting µ = 0 in model M 1 and having a very tight prior around zero in model M 2 ) and so the Bayes factor is close to 1.

Bayesian model comparison With a diffuse prior on µ in model M 2 the prior predictive density is very spread out (since our prior allows the mean to be anywhere) so that the value of the prior predictive density is bound to be smaller than for model M 1. Finding good default choices for priors in model comparison is an active area of current research. Fortunately it is not hard to find a good default choice for many common model selection problems.

Methods for calculating marginal likelihoods Once priors are specified, everything is easy. Not quite. How does one compute for model M i the marginal likelihood p(y M i ) = p(y θ i, M i )p(θ i M i )dθ i?

Methods for calculating marginal likelihoods In my example I could do this analytically. For complex models θ i is high-dimensional. p(y M i ) is defined by an integral over the (possibly high-dimensional) parameter space. Hard.

Methods for calculating marginal likelihoods Obvious idea: recall that p(y M i ) = p(y θ i, M i )p(θ i M i )dθ i and observe that this is an expectation with respect to the prior. Simulate θ (1) i,..., θ (s) i from p(θ i M i ). Then use to estimate p(y M i ). 1 s s i=1 p(y θ (s) i, M i ) Bad idea. The variance is large as the prior is very spread out compared to the likelihood. More sophisticated methods are available.

Methods for calculating marginal likelihoods In what follows let s just consider a single model M. We won t show conditioning on M explicitly so that if M has parameter θ we write p(θ), p(y θ), p(y) for the prior, likelihood and marginal likelihood. Markov chain Monte Carlo methods which sample on the model and parameter space jointly are one common approach to calculating marginal likelihoods (Green, 1995, Biometrika, Carlin and Chib, 1995, JRSSB). These methods are not always easy to apply even for experts.

Methods for calculating marginal likelihoods An alternative uses methods based on the so-called candidate s formula. From rearranging Bayes rule, This holds for every θ. p(y) = p(θ)p(y θ). p(θ y) Suppose ˆθ is some estimate of the mode. If we can estimate p(ˆθ y) then substituting into the candidate s formula gives an estimate of p(y) (since calculating p(ˆθ) and p(y ˆθ) is usually easy). If we can just estimate the posterior density at a point we can estimate the marginal likelihood!

Methods for calculating marginal likelhioods Laplace approximation arises from the candidate s formula by choosing ˆθ the posterior mode, and using a normal approximation to p(θ y) with mean ˆθ and a covariance H 1 based on derivatives of the log posterior at the mode, giving p(y) (2π) p/2 H 1/2 p(ˆθ)p(y ˆθ) where p is the number of parameters. Further simplifying the Laplace approximation to log p(y) leads to the famous BIC criterion.

Methods for calculating marginal likelihoods There are more sophisticated ways in which one can use the candidate s formula approach (Chib, 1995, JASA and Chib and Jeliazkov, 2001, JASA), Bridge estimation (Meng and Wong, 1996, Statistica Sinica). For estimating p(y) (not the most general setting) suppose we have some density r(θ) with no unknown normalizing constant and let t(θ) be any function of θ such that 0 < t(θ)r(θ)p(θ)p(y θ)dθ <.

Methods for calculating marginal likelihoods Then p(y) = p(θ)p(y θ)t(θ)r(θ)dθ r(θ)t(θ)p(θ y)dθ. To see this, just write p(y)p(θ y) = p(θ)p(y θ) and multiply both sides by r(θ)t(θ) and then integrate. The denominator can be estimated from a Monte Carlo sample from p(θ y) (this is obtainable using standard Bayesian computational methods). The numerator can be estimated from a sample from r(θ).

Predictive model selection So far we ve used what is sometimes called the fully probabilistic" approach to model selection. Some of the challenges in implementing this approach have caused some Bayesians to look for alternatives. The difficulties with the conventional approach are One must assign a prior distribution on the model space, and this may be difficult to do when you don t believe in any of the models. Model comparison may be sensitive to priors on the model parameters. The marginal likelihood is hard to calculate. The goals of model improvement may be better served by examining less formal diagnostics that illuminate how the model doesn t fit. There are of course ways of responding to these criticisms by those who advocate the traditional Bayesian approach but I won t go into this debate here.

Predictive model selection There are numerous alternatives to the traditional Bayesian approach to model selection. Most are motivated from the point of view of wanting to predict well rather than choosing the true" model. Often view model selection as a two step decision problem. First a model is chosen, and then the chosen model is used to make a prediction. Optimal choice of model is required to minimize some specified measure of predictive loss.

Predictive model selection Popular predictive approaches Bayesian variants of cross-validation (Geisser and Eddy, 1979, JASA, Bernardo and Smith, 1994) DIC (Spiegelhalter et al., 2002, JRSSB). Posterior Bayes factors (Aitkin, 1991, JRSSB) Less formal approaches based on simulations of replicate data sets from posterior predictive distributions (posterior predictive checks, Gelman, Meng and Stern, 1996, Statistica Sinica). Many others.

Conclusion Even if you are not a Bayesian, thinking about the Bayesian way to approach a problem is often very illuminating for understanding all the sources of information available. One must also consider the purpose for which a model is constructed. Prediction is a very different goal to attempting to choose the true" model. Don t forget about background knowledge, sensitivity analysis and diagnostics when model building. References to my own research on my website: http://www.stat.nus.edu.sg/ standj