CHAPTER 10. Bayesian methods. Geir Storvik

Size: px

Start display at page:

Download "CHAPTER 10. Bayesian methods. Geir Storvik"

Deborah May
5 years ago
Views:

1 CHAPTER 10 Bayesian methods Geir Storvik 10.1 Introduction Statistical inference concerns about learning from data, either parameters (estimation) or some, typically future, variables (prediction). In most introductory courses, we learn that parameters to be estimated are fixed unknown quantities, while variables to be predicted are random variables. The consequence of this is a different treatment of parameters and variables. In particular, for prediction, parameter uncertainty is usually ignored, although more sophisticated methods, such as Bootstrapping (Chapter 18) can be used in order to take uncertainties in the point estimates into account. Probability statements of a hypothesis being true or probabilities for a parameter to belong to a given interval is therefore not possible to formulate (although confidence intervals, wrongly, often are interpreted like that). Bayesian methods differ from classical or frequentist statistics in that parameters are treated as random variables rather than fixed quantities. The machinery requires specification of a prior distribution describing our knowledge about the model, typically through model parameters, before data is taken into account. Such information can be obtained from communication with experts within the field of interest or from other datasets that contain information that is relevant for the problem at hand. The prior distribution is updated with respect to the given data giving the posterior distribution. The posterior distribution summarizes the combined information from our prior and data. The Bayesian paradigm states that all relevant information is contained in this posterior distribution, and information about any quantity of interest can be extracted from this. The Bayesian setting legitimate probability statements concerning parameters or hypotheses. Use of Bayesian methods can be motivated from many different angles. In some situations, incorporation of prior information is the essential aspect. This can be weak information such as similarity of prevalence of a disease at nearby sites or strong information about the parameters of interest obtained from previous studies of similar nature. Such information can be important to include in 1

2 2. BOOTSTRAPPING pre-clinical or pilot studies where the main aim is to obtain preliminary knowledge of a new treatment method or a new drug. In this case the Bayesian approach provides a coherent way of combining different information sources, performing the basis for decision on whether trials should be continued. In other situations, the use is of a more pragmatic nature in that complicated multilevel (or hierarchical) models (Mugglin et al., 2002) in many cases are easier to handle within a Bayesian framework compared to classical methods. Some researchers prefer Bayesian methods due to their capabilities to relate inference statements to the actual observations that are collected, which is in contrast to the classical methods where inference statements are related to what will happen if similar experiments are repeated many times. Yet another motivation is the coherent framework within the Bayesian approach for handling parameter uncertainty in prediction problems. Example 10.1 As an illustrative example, assume our interest is in the prevalence θ of an infectious disease in a small city. A simple random sample of n = 20 individuals are checked for infection. Defining Y to be the number of infected individuals in the sample, a reasonable model for Y the binomial(20, θ) distribution. Within classical statistics, a popular 100(1 α)% confidence interval for the population proportion θ is the Wald interval given by ȳ ± z α/2 ȳ(1 ȳ)/n. where ȳ is the fraction of infected people in the sample. This interval turns out to perform poorly, especially for small θ s. A better alternative is the Wilson score method given by the more complex formula ȳ + zα/2 2 /(2n) ± z α/2 ȳ(1 ȳ)/n + zα/2 2 /(2n)2 1 + z 2 α/2 /n but which is also easy to compute. Considering the special case with n = 20, Y = 0 and α = 0.05, the Wald estimate is (0, 0), clearly not giving a reasonable result. Wilson s method gives (0, 000, 0.161), better reflecting the uncertainty involved. Other alternatives have also been suggested in the literature. Consider now a Bayesian approach. The first step is to specify a prior for θ. For a situation with no prior information, a possible prior is the uniform distribution. Assuming again a random sample of n = 20 individuals are checked for infection and Y = 0 are infected. The prior distribution can now be updated to the posterior distribution based on our observations. The left panel in Figure 10.1 shows the prior and posterior densities for θ in this case. Compared to the prior, the posterior density is clearly shifted towards zero, reflecting that small prevalences are most

3 10.1. INTRODUCTION θ θ Figure 10.1: Prior (dashed line) and posterior (solid line) densities of θ in example Left panel is based on a uniform prior on θ. Right panel is based on a Beta( , ) prior distribution. likely when none infected are observed within the random sample. An interval containing θ with probability 0.95 can be constructed by specifying endpoints θ L, θ U such that Pr(θ θ L ) = Pr(θ θ U ) = Before data is collected, this gives an interval (0.025, 0.975) which is updated to (0.001, 0.161) when taking the data into account. The latter is almost identical to the Wilson interval. Note however the different interpretations of these intervals. The Bayesian interval is a claim about θ directly related to the given data. The classical confidence intervals aim at being precise in the sense that if repeating a collection of 20 individuals many times and calculating an interval for each sample, 100(1 α)% of the intervals will on average cover the true parameter value. No statement can be made on the interval obtained from the actual data! Assume now an alternative situation where previous studies indicate that the prevalence should to be around 0.1 and that we are 99% sure that it is less than 0.3. A possible way to describe this through a prior distribution is by using a beta distribution. The beta distributions is a two-parameter class of distributions limited to the interval (0, 1) with the uniform distribution as a special case. A beta distribution with parameters specified through the given prior knowledge is shown in Figure 10.1 (dashed line in right panel). Again we can update the prior based on our data giving the solid line in the right panel of Figure A Bayesian 95% interval is now given as (0.006, 0.135). The interval is in this case more narrow, reflecting that we include more information about θ through an informative prior

4 4. BOOTSTRAPPING distribution. For a long time, the use of Bayesian methods within medical applications was limited, very much due to the subjectivity involved in specification of priors, but partly also due to computational problems involved in routine use of such approaches. Today this has largely changed with Bayesian methods playing a important role within many areas of medical research (e.g. genetics) and with increasing use in medical research in general. Using prior information when present is more widely accepted and software for performing Bayesian inference is more readily available. Ashby (2006) gave a recent review of Bayesian statistics in medicine, with the following quote demonstrating the importance of Bayesian methods within medical research: It charts the growth of Bayesian statistics as it is applied to medicine and makes predictions for the future. From sparse beginnings, where Bayesian statistics was barely mentioned, Bayesian statistics has now permeated all the major areas of medical statistics, including clinical trials, epidemiology, meta-analyses and evidence synthesis, spatial modelling, longitudinal modelling, survival modelling, molecular genetics and decision-making in respect of new technologies. A formal presentation of Bayesian methods requires a mathematical background that is a bit higher than what will be expected here. Our presentation will be based on that the essential information about unknowns is summarised through probability distributions and that all quantities of interest from such distributions can be extracted through computer simulations The basics We will in this section describe the basic ideas behind Bayesian methods. Assume data y, possibly a vector of observations, is available. A standard statistical approach is to describe the stochastic behaviour of the data through a density or distribution specified by some parameters, i.e. p(y θ) (where the bar notation repeatedly will be used for conditional on). The task of the analysis is to make inference on θ based on y. For simplicity of notation we will in the following assume y to have a continuous sample space in which case we can concentrate on densities (although we will talk about densities and distributions interchangeable). We will use p() generically to denote densities with the quantities within the parentheses specifying the variables of interest. The likelihood principle states that all relevant information about θ in the observations is contained in the likelihood function L(θ) = p(y θ). Classical statistics

5 10.2. THE BASICS 5 lh Minutes Figure 10.2: Luteinizing hormone in blood samples at 10 minutes intervals from a human female (source Diggle, 1990). would go on and construct an estimate of θ through maximization of the likelihood function, giving maximum likelihood (ML) estimates. Example 10.2 Figure 10.2 shows n = 48 samples of the luteinizing hormone in blood samples at 10 minutes intervals from a human female. Because the luteinizing hormone in blood is always positive, we will make a log-transformation when analysing this data. A simple model on the log-scale is an autoregressive model of order one, AR(1): y i = µ + ρy i 1 + ε i, i = 1,..., n (10.1) where we for simplicity assume y 0, the first observation, has a distribution not depending on the unknown parameters. In this model ρ measures the dependence between two sequent variables while µ is related to the center value for the series. We assume ε 1,..., ε n are independent normal variables with expectations zero and variance equal to σ 2, i.e. N(0, σ 2 ). It is common to assume ρ < 1 which can be considered as a kind of stability condition. The parameter vector θ = (µ, ρ, σ) is unknown and needs to be estimated. By defining x i = y i 1, equation (10.1) is similar to an ordinary regression model and maximum likelihood estimates of ρ, µ and σ can be found by standard software. For the given data, we get ˆµ ML = 0.364, ˆρ ML = 0.575, ˆσ = We will use this example, and extensions of it, as a running example to explain many of the concepts within Bayesian analysis. We will refer to the data involved as

6 6. BOOTSTRAPPING the lh-data. Given the simplicity in obtaining ML estimates, one might question the necessity in going further to Bayesian methods. Considering more complex versions of this simple model and other problems than parameter estimation, we will see that Bayesian methods can be beneficial. In Bayesian analysis we introduce a distribution p(θ) describing our prior belief about θ. Bayes theorem, a basic mathematical result within probability theory, can be used to update our knowledge about θ when observations y arrive. Its mathematical expression is given by Bayes theorem p(θ y) = p(θ)p(y θ) p(y) (10.2) where p(y) is the so-called marginal density for the observations y. It describes the random behaviour of the observations when taking the uncertainty about the parameters θ into account. The conditional density p(y θ) is what we generally term the likelihood of the data, giving that posterior prior likelihood where we have used the proportionality sign because the term p(y) do not depend on θ. This term can be considered as a normalization term making the posterior a proper probability density. For a large part of Bayesian analysis, we do not need to consider the marginal density, but see Section The Bayesian paradigm states that all information about θ is contained in the posterior distribution p(θ y). For example, a point estimate of θ j, the jth component of θ is the posterior expectation E[θ j y] = θ j p(θ y)dθ. θ A difficulty, that for a long time prevented the use of Bayesian statistics, is that such posterior distributions can be difficult to handle numerically. This situation is largely changed today through the increased possibility of performing Monte Carlo simulations. Assume θ 1,..., θ S are computer generated samples from the posterior distribution p(θ y). All relevant information can be drawn from these samples. The posterior expectation above can be well approximated by a Monte Carlo estimate given by Ê[θ j y] = 1 S θj s S where θj s is the jth component of θ s. Uncertainty about θ j can be validated through the empirical variance of the samples θj 1,..., θs j. Uncertainty intervals can be constructed by finding numbers θj L, θu j such that 100(1 α)% of the samples s=1

7 10.2. THE BASICS 7 falls within the interval [θj L, θu j ]. Such intervals are called credibility intervals (CrI) within Bayesian statistics. If a function of θ, h(θ) say, is of interest, similar procedures can be applied, i.e. E[h(θ) y] = 1 S S h(θ s ). An important practical issue is how to perform simulation from the posterior. This issue will be discussed in Section Example 10.2 (cont.) In order to apply Bayesian methods to the lh-data, we need to specify a prior for the parameters involved. For mathematical convenience, consider the precision parameter τ = 1/σ 2 rather than σ itself (a common trick within Bayesian statistics). We will first of all assume τ, µ and ρ a priori are independent. A mathematically convenient distribution for precision parameters is the Gamma distribution, Gamma(a, b), where a and b are constants to be specified. Figure 10.3 (left panel) shows this distribution for different values of (a, b). In each case the mean in the prior distributions (equal to a/b) is 3 while the variances (equal to a/b 2 ) range from 1 to 100. In practice, the type of distribution is of less importance compared to the choice of parameters involved. For ρ, we will assume a uniform distribution within the interval ( 1, 1) reflecting that we have no prior information about ρ further than the stability condition. For µ we will assume a non-informative distribution mathematically defined as having a constant density value for all possible µ. Such non-informative priors are frequently used within Bayesian analysis to reflect situations where no prior information about a parameter is available. See the general discussion about this issue below. For this simple model, analytical expressions for the posterior distribution can be derived. We will however concentrate on the more general issue that computer simulation from the posterior distribution is possible. Figure 10.4 shows histograms of 100,000 samples from the posterior distribution for (τ, µ, ρ). Such histograms both show which values the posterior distributions are concentrated around and the uncertainty involved. For τ and ρ, the prior distributions are superimposed, and the sharper peaks in the posterior distributions show the learning from data about the parameters in question. Note that given samples τ 1,..., τ S of τ, we can easily obtain samples of σ by the transformation σ s = 1/ τ s for s = 1,..., S. Summary statistics such as mean, variance and credibility intervals can easily be extracted and are given in Table Comparing the point estimates to those s=1

8 8. BOOTSTRAPPING density density τ τ Figure 10.3: On the left, prior densities for τ for (a, b) equal to (9, 3) (solid line), (0.9, 0.3) (dashed line) and (0.09, 0.03) (dotted line). On the right, corresponding posterior distributions. obtained by maximum likelihood, we see that for µ and ρ they are essentially equal. This result is related to that we are using non-informative prior distributions. For τ, the point estimates are somewhat influenced by the prior giving a shift towards the prior distribution. Recalling that the prior mean of τ was 3, we see that our Bayesian estimate is moved towards the prior mean compared to the maximum likelihood estimate. On the right of Figure 10.3, posterior distributions for τ are given for the different prior distributions shown on the left. We see that for the most informative prior (solid line), the posterior distribution is very much influenced by the prior (the main mass is in middle of the prior mean and the maximum likelihood estimate). For the two other priors, being more vague, the main mass is around the maximum likelihood value, indicating that the posterior distribution is largely influenced by the data. The amount of learning from data depends on the information in data. Figure 10.5 shows posterior distributions for σ for varying numbers of observations. More data result in sharper peaks in the posterior distributions. Note that for small numbers of data, the posterior will be close to the prior while for increasing number of observations, the prior gets less influential. This example illustrates many aspects which can be transferred to more general cases. When using informative priors, posterior means will be smoothed versions of the prior means and the maximum likelihood estimates. When using non-

9 10.2. THE BASICS 9 Density Density Density τ µ ρ Figure 10.4: Histogram of 100,000 samples from the posterior distribution of τ, µ and ρ using prior parameters (a, b) = (0.9, 0.3). The prior distributions for τ and ρ are superimposed on the histograms (dashed line). Table 10.1: Posterior summary statistics of τ, σ 2, µ and ρ for the analysis of luteinizing hormone data using prior parameters (a, b) = (0.9, 0.3). SE is standard error (square root of variance) while CrI is credibility interval. The last column gives the corresponding ML estimates. Param. Mean SE 95% CrI ML τ (12.767,28.991) σ (0.186,0.280) µ (0.121,0.613) ρ (0.291,0.851) informative priors, Bayes estimates are typically similar to the ML estimates. Non-informative priors are in many cases not proper densities, in that they can have infinitive mass. Formally, an improper prior can be seen as limits of a proper prior making the variance increase to infinity. In practice such improper densities causes no problems as long as the posterior distributions become proper. See however Section 10.5 for further discussion regarding this issue. The influence of the prior depends both on the information in the prior and on the amount of data. When the number of observations increases, the influence of the prior will in most cases vanish and the Bayesian estimates will become similar to the maximum likelihood estimates.

10 10. BOOTSTRAPPING density n= 0 n= 12 n= 24 n= 36 n= σ Figure 10.5: Posterior distributions for σ based on the first n observations. n = 0 corresponds to the prior distribution. Results are based on a = 0.9, b = Prediction Prediction is needed in many applications. For illustration, we will again consider the time-series data in Figure Example 10.2 (cont.) Assume now we want to predict y n+k, the luteinizing hormone k times 10 minutes after the last measurement. From (10.1), y n+k = (1 + ρ + + ρ k 1 )µ + ρ k y n + ε n+k + ρε n+k ρ k 1 ε n+1 (10.3) which implies that E[y n+k y 1,.., y n, σ 2, ρ] =(1 + ρ + + ρ k 1 )µ + ρ k y n Var[y n+k y 1,.., y n, σ 2, ρ] =σ 2 [1 + ρ ρ 2(k 1) ]. If µ, ρ and σ were known, the obvious estimate would be to use the conditional expectation with the corresponding variance giving a measure of the uncertainty in the prediction. With the parameters unknown, a simple plug-in approach is to insert point estimates in the expressions above. A weakness with this approach is that uncertainty in the parameter estimates not are taken into account. The Bayesian approach is to include the uncertainty in (µ, ρ, σ) directly into the predictions. Formally, this means looking at the posterior distribution for

11 10.3. PREDICTION Minutes Figure 10.6:. Predictions of exp(y n+k ) for k = 1, 2, 3, 4. Dashed lines are predictions and 95% prediction intervals for the plug-in rule. Dotted lines are the same quantities for Bayesian prediction using a = 0.9, b = 0.3. observation y n+k given y = (y 1,..., y n ), which can be written as a mathematical integral: p(y n+k y) = µ,ρ,σ p(y n+k y; σ, µ, ρ)p(σ, ρ y)dµdρdσ. (10.4) Taking our Monte Carlo approach for performing inference, such integrals can be substituted by simulations. In this context it means simulating y n+k in addition to the previous samples of (µ, ρ, σ). Simulation of y n+k can easily be performed using (10.3). Note that several y n+1,..., y n+k can be simulated simultaneously. There is therefore no need in running the simulation scheme several times if prediction at many time-points are of interest. Further, transforming our prediction to the original scale (remembering that we performed our analysis on the log-scale) can be performed by looking at exp(y n+k ) instead. Figure 10.6 shows predictions and prediction intervals based on the plug-in rule and Bayesian prediction for k = 1, 2, 3, 4. We see that by taking the uncertainty of the parameters into account, wider, but more reliable, intervals result.

12 12. BOOTSTRAPPING 10.4 Hierarchical models Latent random variables can be included into the models in order to describe dependence or clustering structures in the data. Latent variables (as opposed to observable variables) are variables that are not directly observed but included as a modelling tool. Such approaches have been given different names within different areas of statistics. Within Bayesian statistics, this is usually called hierarchical modelling. Classical statistics use the terms random effects or latent variables while in survival analysis, the name frailty model is used. One of the real strengths of Bayesian analysis is the coherent way of handling such hierarchical structures. We will illustrate the main ideas through the time series data of luteinizing hormone. Example 10.2 (cont.) The luteinizing hormone data is given with a precision up to one decimal, indicating that there are some deviation between the actual lh level and the observations. Assume now ỹ i is the observed lh level at time i (now at the original scale) while x i is the actual level at time i (on log-scale). Because the observations always will be positive, we will assume a simple model where ỹ i is equal to exp(x i ) multiplied by some measurement noise and rounded to one decimal. This gives an extension of model (10.1): x i =µ + ρx i 1 + ε i, ε i N(0, σ 2 ), ỹ i =round 1 (exp(x i + η i )), η i N(0, ). (10.5) where we by round 1 ( ) mean rounded to one decimal. The data are still described by two parameters, but the variability involved is now divided into two parts, one reflecting the variability in the true lh level, the other related to measurement errors. Equations (10.5) describe an example of a hierarchical model with {x i } being the latent variables, not directly observed. The relation between x i and ỹ i could take many different forms, both deterministic or stochastic. The main idea is that the dependence structure involved usually is much easier to specify for the latent structure than directly on the observations. In the example above, the sample space for ỹ i is a discrete set of values for which it is far from obvious how to specify dependence. The latent process {x i } is however normal, where dependence structures are directly specified through correlations. In the example this correlation structure is determined through a dynamic equation, but other options can easily be considered in more complex settings. With latent variables present, focus then change to the simultaneous posterior distribution for the pair (θ, x) given the observations. Conceptually this is

13 10.5. MODEL SELECTION 13 not different from the situations considered earlier. Still taking the Monte Carlo approach, we now need to simulate the pair (θ,x) conditional on the observed data. Assume {(θ s,x s ), s = 1,..., S} are our Monte Carlo samples. If inference about a specific parameter θ j is the main interest, θ 1 j,..., θs j can be used to construct a histogram or extract summary statistics. Similarly, if x i is of interest, x 1 i,...x S i are the relevant quantities to explore. Example 10.2 (cont.) Consider now the extended model (10.5). Our interest will be in both estimation of µ, ρ, σ and in prediction of x i, i = 1,..., n. Figure 10.7 shows histograms of σ, µ and ρ based on 10,000 Monte Carlo samples from the posterior distribution. Superimposed on the histograms are posterior densities obtained from the original model (10.1). We see that σ has a mass somewhat lower, reflecting that the variability in the data now is divided into two parts, one relating to the latent dynamics process and one to the observations. For µ and ρ, there are only minor changes. Simultaneously, we get information about the latent variables x i (lower panel of Figure 10.7). We see in this case that predictions about the latent process are quite accurate, reflecting that ỹ i gives quite precise information about x i. The use of latent variables as a modelling tool has become very popular recently, partly because of the great flexibility in describing complicated dependence structures and partly because the computational obstacles now can be handled through Monte Carlo simulations. There is however a computational cost in including latent variables. Performing inference based on models like (10.5) is not straightforward. This issue will be discussed in Section Model selection In many real applications, uncertainty is also related to the model(s) assumed. For our running example, an alternative model to (10.1) is y i = µ + ρ 1 y i 1 + ρ 2 y i 2 + ε i, i = 1,..., n (10.6) i.e. an autoregressive model of order two. In classical statistics, selection of a model can be proceeded by performing a test on the hypothesis H 0 : ρ 2 = 0. The pure Bayesian approach to model selection is to include the model as a unknown random quantity and including a prior on a set of possible models as well.

14 14. BOOTSTRAPPING σ µ ρ Minutes Figure 10.7: Upper panel show histograms of σ (left), µ (middle) and ρ (right) based 10,000 samples from the posterior distribution using a = 0.9 and b = 0.3. Posterior densities for the original model (10.1) are shown as dashed lines. Lower panel shows posterior mean (solid line) and 95% credibility intervals (dashed lines) for x i, i = 1,, n. Observations (on log-scale) are shown as circles. Defining M 1,..., M m to be this, our interest now is in the posterior probabilities Pr(M j y) for M j being the true model. Using Bayes theorem, we obtain Pr(M j y) = C Pr(M j )p(y M j ), j = 1,..., m (10.7) where C = m j=1 Pr(M j)p(y M j ). Here Pr(M j ) is the prior belief in model M j. If there is no preference in one model to another, we can choose Pr(M j ) = 1 m. In addition to this prior, we also need to consider p(y M j ), which in Section 10.2 was referred to as the marginal density (in Section 10.2 conditioning on model was done implicitly in that only one model was considered). When comparing two competing models, M 1 and M 2, say, an alternative is to

15 10.5. MODEL SELECTION 15 Table 10.2: Verbal guideline for interpreting Bayes factors. Bayes factor Evidence against model M 2 1 to 3 Not worth more than a bare mention 3 to 20 Positive 20 to 150 Strong > 150 Very strong look at the ratio of their posterior probabilities. Using (10.7), we get Pr(M 1 y) Pr(M 2 y) = Pr(M 1) Pr(M 2 ) p(y M 1) p(y M 2 ). In this formulation we are able to divide the ratio of posteriors into one part related to the prior and another part related to the data. Because of its importance in model selection, the last term is given an own name, the Bayes factor. In many cases, only the Bayes factors are considered when performing model selection, which indirectly indicate the use of Pr(M j ) = 1. In Table 10.2 a verbal guideline m for interpreting Bayes factors suggested by Kass and Raftery (1995) (a modification of an original suggestion by Jeffreys (1961)) is given. Mathematically, the marginal density is given by p(y M j ) = p(θ j M j )p(y θ j, M j )dθ j, θ j where θ j are the parameters involved in model M j. If latent variables are present, the integral above would be extended to include these as well. In the above expression, p(θ j M j ) is the prior distribution for the parameters involved in model M j. The marginal densities quantify the fit between data and model and are the essential quantities for model selection. Note that uncertainties in the parameters and latent variables are taken into account. There are several problematic issues related to the marginal densities. In contrast to posterior distributions for parameters, they depend heavily on the prior distributions, even for a large number of observations. In particular, all prior distributions need to be proper. If no prior information is available about some parameters, a possible solution is to divide the data into two sub-sets, y 1 and y 2 say. The set y 2 is used to perform model selection. As prior for θ, one can use the posterior distribution for θ based on y 1 with prior now having the meaning prior to data y 2. Assuming y 1 contains a reasonable amount of data, we obtain a prior that is proper. For more sophisticated strategies, utilizing the full dataset better, see Berger and Pericchi (1996).

16 16. BOOTSTRAPPING Table 10.3: Marginal densities, Bayes factor for model M 1 compared to model M 2 and posterior model probabilities for the last 24 lh-observations. Priors are constructed using the first 24 observations. (a, b) (9, 3) (0.9, 0.3) (0.09, 0.03) p(y M 1 ) p(y M 2 ) Bayes factor Pr(M 2 y) Numerical calculation of marginal densities can be difficult. We will discuss this issue further in Section Example 10.2 (cont.) Assume two alternative models, the AR(1) model (10.1) and the AR(2) model (10.6). In our original formulation, we used an improper prior distribution for µ and a rather wide prior distribution for ρ. In order to obtain proper priors, divide the dataset into two parts, y 1 = (y 1,..., y 24 ) and y 2 = (y 25,..., y 48 ). Priors for θ for analysis of y 2 are then specified through the posteriors for θ based on y 1. Note that different priors are obtained by different choices of the parameter (a, b) in the Gamma prior for τ. Table 10.3 shows marginal density values, Bayes factors and posterior model probabilities for the three sets of values of (a, b) considered earlier. The marginal densities changes considerable for the different priors. For the most informative prior (a = 9, b = 3), the small marginal densities reflect that the prior do not fit very well with the data. The differences between the two more vague priors are much smaller. The Bayes factors and the posterior probabilities for M 2 are quite stable, though. The strong dependence of priors and the computational difficulties involved in calculation of marginal densities and Bayes factors have motivated the use of alternatives. The Bayesian information criterion (BIC) is based on an approximation of the Bayes factor p(y M 1 ) p(y M 2 ) p(y ˆθ 1, M 1 ) p(y ˆθ 2, M 2 ) n0.5(k 2 k 1 ) (10.8) where ˆθ j is the ML estimator for θ j under model j while k j is the number of parameters in model j. The approximation is based on an asymptotic argument indicating its dependence on a large number of observations.

17 10.6. COMPUTATIONAL ASPECTS 17 Defining now BIC for model M j to be BIC j = 2 log p(y ˆθ j, M j ) k j log(n) we see that 2 times the log of right hand side of (10.8) becomes equal to BIC 1 BIC 2. Model selection can be performed through selecting the model with the largest BIC value. The advantages of using BIC is its simplicity and an (apparent) independence of prior distributions. However, as stated by Kass and Wasserman (1995), the approximation is based on an implicit prior. This prior is quite sensible though and do not have to be expressed explicitly. Example 10.2 (cont.) We compare models M 1 and M 2 based on observations y 3,..., y 48 assuming y 1, y 2 are fixed. Then log p(y ˆθ 1, M 1 ) = while log p(y ˆθ 2, M 2 ) = Using that k 1 = 3 while k 2 = 4, we obtain BIC 1 = 8.91 while BIC 2 = This gives a preference to model 1. The BIC measure can be seen as a measure of fit with a penalty term included for model complexity. Many alternative such measures have been suggested in the literature, Akaike s information criterion (AIC, Akaike, 1973) the most wellknown. Another criterion popular within Bayesian statistics is DIC (Spiegelhalter et al., 2002) Computational aspects As discussed in Section 10.2, computer generated (Monte Carlo) samples can be used to explore the posterior distribution. By transforming the samples properly, any quantity of interest can be explored. Using Monte Carlo simulation, a numerical approximation is applied. The errors involved in such approximations will usually be small compared to the uncertainty due to basing our inference on a finite number of observations. Further, this approximation can be controlled through the number of Monte Carlo samples S that we generate. For some examples (e.g. the Bayesian analysis of the lh-data using model (10.1)), performing Monte Carlo simulations can be easy. In more complex situations, constructing efficient simulation routines are more difficult. Still, Monte Carlo approaches have been quite successfully applied in a wide range of applications. In cases with latent variables x, we might be interested in the simultaneous posterior p(x, θ y). For simplicity in notation, assume z contains all the unknowns, both latent variables and parameters (for the hierarchical models discussed in section 10.4, z = (θ,x)). Assume further that we are interested in some univariate

18 18. BOOTSTRAPPING quantity h(z) (which could be a parameter, a latent variable or some more complicated function). Our interest will then be in the posterior distribution of h(z) given our data y. Assuming z s, s = 1,..., S are samples from the posterior distribution p(z y), a Monte Carlo estimate of the posterior expectation for h(z), E[h(z) y], is h MC = 1 S The law of large numbers states that S h(z s ). s=1 h MC E[h(z) y] for large S. Further, if the samples are independent, we have Var[ h MC y] = 1 S CMC h (10.9) where Ch MC is a factor depending on the function h and the posterior distribution p(z y), but not on S. The factor Ch MC can be estimated by the empirical variance of h(z s ) and specification of S can then be made through quantifying the requirement in precision of our estimate. Note that the decrease in variance (the factor 1) S is independent of the dimension of the problem. This is highly in contrast to numerical integration techniques and is the main reason for the popularity of Monte Carlo methods within Bayesian analysis. The dimensional problems do however come into account when considering methods for simulation from the posterior distribution p(z y). For high-dimensional problems, direct sampling can be difficult. Markov chain Monte Carlo (MCMC) algorithms (Gilks et al., 1996) is a powerful class of methods for performing simulations in high-dimensional spaces. The main idea is to sample {z s } sequentially such that when s increases, the distribution of the samples will get closer to the target distribution p(z y). This sequential updating is performed such that the generation of z s only depends on the previous sample z s 1 and not on any earlier samples. This results in a Markov chain, giving the name of the class of algorithms. For MCMC samples, the Monte Carlo estimate is modified to h MCMC = 1 S S+B s=b h(z s ), i.e. we are running the Markov chain B iterations before starting to collect our samples. B is usually denoted the burnin period and reflects that the first samples can have distributions not close enough to the target distribution.

19 10.6. COMPUTATIONAL ASPECTS 19 Remarkably, under relatively weak conditions, also h MCMC E[h(z) y], a result based on general Markov chain theory. Further, the variance will be approximately Var[ h MCMC ] 1 S CMCMC h. The factor Ch MCMC will in most cases be larger (in some cases much larger) than Ch MC and will depend on the specific type of algorithm that is chosen. This increase in variance is due to that the Monte Carlo samples now are dependent resulting in an effective sample size smaller than S. Two subclasses of MCMC algorithms have mainly been applied. The Metropolis- Hastings algorithms are closely related to acceptance-rejection sampling, a sampling technique based on generating a sample from an (in principle) arbitrary proposal distribution followed by a acceptance step which works as a correction for sampling from the wrong distribution. Assuming a sample z s is given at iteration s, a proposal z is generated from a proposal distribution q(z z s ). Then { z s+1 z with probability r(z s,z ); = z s with probability 1 r(z s,z ), where { } r(z s,z ) = min 1, p(z y)q(z s z ). p(z s y)q(z z s ) The key factor in obtaining an effective algorithm is the choice of proposal distribution q. Typically choices are small changes in one or a few components of z at a time. Metropolis-Hastings can be applied in quite general settings and are usually easy to implement. The main requirements are the possibility for evaluating the posterior p(z y) and that the proposal distribution allow the Markov chain {z s } to explore all possible values of z. In many Bayesian settings, the posterior distribution can only be evaluated up to a proportionality constant. This can be handled easily in the Metropolis- Hastings algorithm due to the occurrence of the posterior both in the nominator and in the denominator of the acceptance probability. Practical implementation do contain many tuning parameters that need to be specified carefully. The Gibbs sampler is another popular class of algorithms, its most simple version being specified by two steps at each iteration: 1. Choose a component j at random.

20 20. BOOTSTRAPPING 2. Simulate z s+1 j from the conditional distribution of z j given z k = zk s, k j and put z s+1 k = s s k for k j. A component in this setting could be a single variable or a set of several components. The Gibbs sampler can formally be seen as a special case of a Metropolis- Hastings algorithm (by using the conditional distribution as a proposal distribution). In practice, applications of Metropolis-Hastings algorithms use quite different proposals and are usually considered as alternatives to the Gibbs sampler. Compared to the Metropolis-Hastings algorithm, the Gibbs sampler requires more effort in the sense that conditional distributions need to be worked out. On the other hand, no tuning parameters need to be specified making it more directly applicable when the relevant conditional distributions are available. Implementation of Markov chain Monte Carlo algorithms can be time-consuming. Both general and application specific software are however becoming increasingly available making Bayesian analysis accessible. Winbugs is the most general Bayesian software package available at date and is freely available from the website The first version of the program called BUGS was an abbreviation for Bayesian inference Using Gibbs Sampler with Winbugs being the Windows successor to BUGS (the only version maintained, but there is an Openbugs version that still is under development). The original version mainly performed Gibbs sampling, while later versions also allow Metropolis-Hastings steps, extending the application area extensively. The software comes with a user manual, in addition to a wide range of examples. The nice part in using Winbugs/Openbugs is that only the model needs to specified, the construction of the MCMC algorithms is done automatically. Model (10.5) is a bit problematic to implement directly due to the rounding of the observations. The model can however be implemented by first reformulating it to x i =µ + ρ(x i 1 µ) + ε i, ε i N(0, σ 2 ), v i =x i + η i, η i N(0, ) (10.10) ỹ i =Uniform[v i 0.5, v i + 0.5]. The reformulation consist of two steps. The first is to include an extra variable v i being the observation before rounding. The second is to reformulate the rounding to a uniform distribution. This actually gives another model, but it can be shown that the posterior distribution remains the same. The model can then be specified as shown in Figure Here logical (or deterministic) links are denoted by <- while stochastic links are denoted by ~. In addition, data and initial values need to be loaded into the system before the actual simulations can be performed. Figure 10.9 shows a trace plot of the simulated values of ρ. This plot shows a typical behaviour in that some initial iterations are needed before the values

21 10.6. COMPUTATIONAL ASPECTS 21 model { x[1] ~ dnorm(mu,tau) for(i in 2:n){ mean[i] <- mu + rho*x[i-1] x[i] ~ dnorm(mean[i],tau) } for(i in 1:n){ y.tilde[i] ~ dnorm(x[i],100) L[i] <- exp(x[i])-0.05 U[i] <- exp(x[i])+0.05 y[i] ~ dunif(l[i],u[i]) } mu ~ dflat() rho ~ dunif(-1,1) tau ~ dgamma(a,b) } Figure 10.8: Winbugs code for specifying model (10.10). stabilise. Such trace plots can be used for visually specifying the burnin value B, but more formal methods are available, many of which are implemented into Winbugs or related software. For this example, convergence is fast, but it can be substantially slower in more complex situations. Several other packages for MCMC simulations in Bayesian settings are available, some of which are listed at the website More application specific programs can also be found. Within the R package (R Development Core Team, 2009), a freeware statistical package, many libraries tuned towards specific applications are available. The above discussion is primarily related to inference within a model. When comparing models (Section 10.5), marginal densities need to be evaluated as well. A marginal density (dropping the conditioning on model) is given by p(y) = z p(y z)p(z)dz. This density can be seen as the expectation of the conditional density p(z y) with respect to the prior p(z). Since simulation from the prior p(z) nearly always is

22 22. BOOTSTRAPPING ρ Iterations Figure 10.9: Trace plot of simulated values of ρ based on Markov chain Monte Carlo simulations. easy, in principle, this expectation can be approximated by ˆp(y) = S p(y z s ) s=1 where z s,...,z s are samples from p(z). In practice, very few of the samples from p(z) will cover the main support of p(y z) making the effective sample size very small. Alternative methods are available, but are rather complicated. We refer to Han and Carlin (2001) for a review of some of these methods Summary/discussion Given the similarities with maximum likelihood estimates under non-informative priors, one might question the need for Bayesian analysis. There are several arguments that can be used to answer this.

23 10.8. FURTHER READING 23 When prior information is available, it will influence the results. For the running example one might have information from other patients, or data collected earlier, that can give information about both ρ and τ. Classical statistical approaches differ from Bayesian ones when taking parameter uncertainty into account, for example in prediction. For the Bayesian approach, there is a coherent way of doing this. For classical statistics, various methods are in use. The simplest, and perhaps most widely used, approach would be a plug-in method. Such a method is neglecting the uncertainty in the parameter estimates and thereby underestimating the uncertainty in the predictions. More advanced methods such as Bootstrapping will be more accurate but are not always that easy to apply. Apart from the simplest models, inference within classical statistics is based on large sample approximations while Bayesian methods are exact in the sense that assuming the model assumptions are valid, the posterior distribution do give the right answer. The need for numerical approximations violates this exactness, but such errors are usually of a smaller scale than variability due to data. On the other hand, large sample approximations are usually quite robust to model assumptions. This is also true for Bayesian settings, but with small samples the results will typically rely heavily on the model assumptions. Maximum likelihood methods involve optimization, Bayesian approaches involve integration. Advances in statistical computing, and in particular Monte Carlo methods, have for many problems made the computational challenges easier to handle within the Bayesian framework than in classical settings. Pragmatic considerations have in such cases largely given preferences to the use of Bayesian methods Further reading For general introductions to Bayesian methods, we refer to Gelman et al. (2004) or Carlin and Louis (2008). Many applications of Bayesian methods within health research is given in Berry and Stangl (1996). Ashby (2006) contain a huge list of references to Bayesian statistics within medicine. Breslow (1990) discuss Bayesian methods in contrast to frequentist (classical) statistical approaches. Bibliography Akaike, H. (1973). Information theory and an extension of the likelihood ratio principle. In Second international symposium of information theory, pp

24 24. BOOTSTRAPPING Ashby, D. (2006). Bayesian statistics in medicine: a 25 year review. Statistics in medicine 25(21), Berger, J. and L. Pericchi (1996). The intrinsic Bayes factor for model selection and prediction. Journal of the American Statistical Association 91(433), Berry, D. and D. Stangl (1996). Bayesian biostatistics. CRC. Breslow, N. (1990). Biostatistics and Bayes. Statistical Science 5(3), Carlin, B. and T. Louis (2008). Bayesian methods for data analysis (Third ed.). CRC Press. Diggle, P. (1990). Time series: a biostatistical introduction. Oxford University Press. Gelman, A., J. Carlin, H. Stern, and D. Rubin (2004). Bayesian data analysis. Champan and Hall/CRC. Gilks, W., S. Richardson, and D. J. Spiegelhalter (1996). Markov chain Monte Carlo in practice. London: Chapman & Hall. Han, C. and B. Carlin (2001). Markov Chain Monte Carlo Methods for Computing Bayes Factors: A Comparative Review. Journal of the American Statistical Association 96(455), Jeffreys, H. (1961). Theory of probability. Oxford: Clarendon Press. Kass, R. and A. Raftery (1995). Bayes factors. Journal of the American Statistical Association 90(430), Kass, R. and L. Wasserman (1995). A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association 90(431), Mugglin, A., N. Cressie, and I. Gemmell (2002). Hierarchical statistical modelling of influenza epidemic dynamics in space and time. Statistics in medicine 21(18), R Development Core Team (2009). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. ISBN Spiegelhalter, D., N. Best, B. Carlin, and A. van der Linde (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 64(4),

7. Estimation and hypothesis testing. Objective. Recommended reading

7. Estimation and hypothesis testing Objective In this chapter, we show how the election of estimators can be represented as a decision problem. Secondly, we consider the problem of hypothesis testing