CHAPTER 10. Bayesian methods. Geir Storvik

Size: px
Start display at page:

Download "CHAPTER 10. Bayesian methods. Geir Storvik"

Transcription

1 CHAPTER 10 Bayesian methods Geir Storvik 10.1 Introduction Statistical inference concerns about learning from data, either parameters (estimation) or some, typically future, variables (prediction). In most introductory courses, we learn that parameters to be estimated are fixed unknown quantities, while variables to be predicted are random variables. The consequence of this is a different treatment of parameters and variables. In particular, for prediction, parameter uncertainty is usually ignored, although more sophisticated methods, such as Bootstrapping (Chapter 18) can be used in order to take uncertainties in the point estimates into account. Probability statements of a hypothesis being true or probabilities for a parameter to belong to a given interval is therefore not possible to formulate (although confidence intervals, wrongly, often are interpreted like that). Bayesian methods differ from classical or frequentist statistics in that parameters are treated as random variables rather than fixed quantities. The machinery requires specification of a prior distribution describing our knowledge about the model, typically through model parameters, before data is taken into account. Such information can be obtained from communication with experts within the field of interest or from other datasets that contain information that is relevant for the problem at hand. The prior distribution is updated with respect to the given data giving the posterior distribution. The posterior distribution summarizes the combined information from our prior and data. The Bayesian paradigm states that all relevant information is contained in this posterior distribution, and information about any quantity of interest can be extracted from this. The Bayesian setting legitimate probability statements concerning parameters or hypotheses. Use of Bayesian methods can be motivated from many different angles. In some situations, incorporation of prior information is the essential aspect. This can be weak information such as similarity of prevalence of a disease at nearby sites or strong information about the parameters of interest obtained from previous studies of similar nature. Such information can be important to include in 1

2 2. BOOTSTRAPPING pre-clinical or pilot studies where the main aim is to obtain preliminary knowledge of a new treatment method or a new drug. In this case the Bayesian approach provides a coherent way of combining different information sources, performing the basis for decision on whether trials should be continued. In other situations, the use is of a more pragmatic nature in that complicated multilevel (or hierarchical) models (Mugglin et al., 2002) in many cases are easier to handle within a Bayesian framework compared to classical methods. Some researchers prefer Bayesian methods due to their capabilities to relate inference statements to the actual observations that are collected, which is in contrast to the classical methods where inference statements are related to what will happen if similar experiments are repeated many times. Yet another motivation is the coherent framework within the Bayesian approach for handling parameter uncertainty in prediction problems. Example 10.1 As an illustrative example, assume our interest is in the prevalence θ of an infectious disease in a small city. A simple random sample of n = 20 individuals are checked for infection. Defining Y to be the number of infected individuals in the sample, a reasonable model for Y the binomial(20, θ) distribution. Within classical statistics, a popular 100(1 α)% confidence interval for the population proportion θ is the Wald interval given by ȳ ± z α/2 ȳ(1 ȳ)/n. where ȳ is the fraction of infected people in the sample. This interval turns out to perform poorly, especially for small θ s. A better alternative is the Wilson score method given by the more complex formula ȳ + zα/2 2 /(2n) ± z α/2 ȳ(1 ȳ)/n + zα/2 2 /(2n)2 1 + z 2 α/2 /n but which is also easy to compute. Considering the special case with n = 20, Y = 0 and α = 0.05, the Wald estimate is (0, 0), clearly not giving a reasonable result. Wilson s method gives (0, 000, 0.161), better reflecting the uncertainty involved. Other alternatives have also been suggested in the literature. Consider now a Bayesian approach. The first step is to specify a prior for θ. For a situation with no prior information, a possible prior is the uniform distribution. Assuming again a random sample of n = 20 individuals are checked for infection and Y = 0 are infected. The prior distribution can now be updated to the posterior distribution based on our observations. The left panel in Figure 10.1 shows the prior and posterior densities for θ in this case. Compared to the prior, the posterior density is clearly shifted towards zero, reflecting that small prevalences are most

3 10.1. INTRODUCTION θ θ Figure 10.1: Prior (dashed line) and posterior (solid line) densities of θ in example Left panel is based on a uniform prior on θ. Right panel is based on a Beta( , ) prior distribution. likely when none infected are observed within the random sample. An interval containing θ with probability 0.95 can be constructed by specifying endpoints θ L, θ U such that Pr(θ θ L ) = Pr(θ θ U ) = Before data is collected, this gives an interval (0.025, 0.975) which is updated to (0.001, 0.161) when taking the data into account. The latter is almost identical to the Wilson interval. Note however the different interpretations of these intervals. The Bayesian interval is a claim about θ directly related to the given data. The classical confidence intervals aim at being precise in the sense that if repeating a collection of 20 individuals many times and calculating an interval for each sample, 100(1 α)% of the intervals will on average cover the true parameter value. No statement can be made on the interval obtained from the actual data! Assume now an alternative situation where previous studies indicate that the prevalence should to be around 0.1 and that we are 99% sure that it is less than 0.3. A possible way to describe this through a prior distribution is by using a beta distribution. The beta distributions is a two-parameter class of distributions limited to the interval (0, 1) with the uniform distribution as a special case. A beta distribution with parameters specified through the given prior knowledge is shown in Figure 10.1 (dashed line in right panel). Again we can update the prior based on our data giving the solid line in the right panel of Figure A Bayesian 95% interval is now given as (0.006, 0.135). The interval is in this case more narrow, reflecting that we include more information about θ through an informative prior

4 4. BOOTSTRAPPING distribution. For a long time, the use of Bayesian methods within medical applications was limited, very much due to the subjectivity involved in specification of priors, but partly also due to computational problems involved in routine use of such approaches. Today this has largely changed with Bayesian methods playing a important role within many areas of medical research (e.g. genetics) and with increasing use in medical research in general. Using prior information when present is more widely accepted and software for performing Bayesian inference is more readily available. Ashby (2006) gave a recent review of Bayesian statistics in medicine, with the following quote demonstrating the importance of Bayesian methods within medical research: It charts the growth of Bayesian statistics as it is applied to medicine and makes predictions for the future. From sparse beginnings, where Bayesian statistics was barely mentioned, Bayesian statistics has now permeated all the major areas of medical statistics, including clinical trials, epidemiology, meta-analyses and evidence synthesis, spatial modelling, longitudinal modelling, survival modelling, molecular genetics and decision-making in respect of new technologies. A formal presentation of Bayesian methods requires a mathematical background that is a bit higher than what will be expected here. Our presentation will be based on that the essential information about unknowns is summarised through probability distributions and that all quantities of interest from such distributions can be extracted through computer simulations The basics We will in this section describe the basic ideas behind Bayesian methods. Assume data y, possibly a vector of observations, is available. A standard statistical approach is to describe the stochastic behaviour of the data through a density or distribution specified by some parameters, i.e. p(y θ) (where the bar notation repeatedly will be used for conditional on). The task of the analysis is to make inference on θ based on y. For simplicity of notation we will in the following assume y to have a continuous sample space in which case we can concentrate on densities (although we will talk about densities and distributions interchangeable). We will use p() generically to denote densities with the quantities within the parentheses specifying the variables of interest. The likelihood principle states that all relevant information about θ in the observations is contained in the likelihood function L(θ) = p(y θ). Classical statistics

5 10.2. THE BASICS 5 lh Minutes Figure 10.2: Luteinizing hormone in blood samples at 10 minutes intervals from a human female (source Diggle, 1990). would go on and construct an estimate of θ through maximization of the likelihood function, giving maximum likelihood (ML) estimates. Example 10.2 Figure 10.2 shows n = 48 samples of the luteinizing hormone in blood samples at 10 minutes intervals from a human female. Because the luteinizing hormone in blood is always positive, we will make a log-transformation when analysing this data. A simple model on the log-scale is an autoregressive model of order one, AR(1): y i = µ + ρy i 1 + ε i, i = 1,..., n (10.1) where we for simplicity assume y 0, the first observation, has a distribution not depending on the unknown parameters. In this model ρ measures the dependence between two sequent variables while µ is related to the center value for the series. We assume ε 1,..., ε n are independent normal variables with expectations zero and variance equal to σ 2, i.e. N(0, σ 2 ). It is common to assume ρ < 1 which can be considered as a kind of stability condition. The parameter vector θ = (µ, ρ, σ) is unknown and needs to be estimated. By defining x i = y i 1, equation (10.1) is similar to an ordinary regression model and maximum likelihood estimates of ρ, µ and σ can be found by standard software. For the given data, we get ˆµ ML = 0.364, ˆρ ML = 0.575, ˆσ = We will use this example, and extensions of it, as a running example to explain many of the concepts within Bayesian analysis. We will refer to the data involved as

6 6. BOOTSTRAPPING the lh-data. Given the simplicity in obtaining ML estimates, one might question the necessity in going further to Bayesian methods. Considering more complex versions of this simple model and other problems than parameter estimation, we will see that Bayesian methods can be beneficial. In Bayesian analysis we introduce a distribution p(θ) describing our prior belief about θ. Bayes theorem, a basic mathematical result within probability theory, can be used to update our knowledge about θ when observations y arrive. Its mathematical expression is given by Bayes theorem p(θ y) = p(θ)p(y θ) p(y) (10.2) where p(y) is the so-called marginal density for the observations y. It describes the random behaviour of the observations when taking the uncertainty about the parameters θ into account. The conditional density p(y θ) is what we generally term the likelihood of the data, giving that posterior prior likelihood where we have used the proportionality sign because the term p(y) do not depend on θ. This term can be considered as a normalization term making the posterior a proper probability density. For a large part of Bayesian analysis, we do not need to consider the marginal density, but see Section The Bayesian paradigm states that all information about θ is contained in the posterior distribution p(θ y). For example, a point estimate of θ j, the jth component of θ is the posterior expectation E[θ j y] = θ j p(θ y)dθ. θ A difficulty, that for a long time prevented the use of Bayesian statistics, is that such posterior distributions can be difficult to handle numerically. This situation is largely changed today through the increased possibility of performing Monte Carlo simulations. Assume θ 1,..., θ S are computer generated samples from the posterior distribution p(θ y). All relevant information can be drawn from these samples. The posterior expectation above can be well approximated by a Monte Carlo estimate given by Ê[θ j y] = 1 S θj s S where θj s is the jth component of θ s. Uncertainty about θ j can be validated through the empirical variance of the samples θj 1,..., θs j. Uncertainty intervals can be constructed by finding numbers θj L, θu j such that 100(1 α)% of the samples s=1

7 10.2. THE BASICS 7 falls within the interval [θj L, θu j ]. Such intervals are called credibility intervals (CrI) within Bayesian statistics. If a function of θ, h(θ) say, is of interest, similar procedures can be applied, i.e. E[h(θ) y] = 1 S S h(θ s ). An important practical issue is how to perform simulation from the posterior. This issue will be discussed in Section Example 10.2 (cont.) In order to apply Bayesian methods to the lh-data, we need to specify a prior for the parameters involved. For mathematical convenience, consider the precision parameter τ = 1/σ 2 rather than σ itself (a common trick within Bayesian statistics). We will first of all assume τ, µ and ρ a priori are independent. A mathematically convenient distribution for precision parameters is the Gamma distribution, Gamma(a, b), where a and b are constants to be specified. Figure 10.3 (left panel) shows this distribution for different values of (a, b). In each case the mean in the prior distributions (equal to a/b) is 3 while the variances (equal to a/b 2 ) range from 1 to 100. In practice, the type of distribution is of less importance compared to the choice of parameters involved. For ρ, we will assume a uniform distribution within the interval ( 1, 1) reflecting that we have no prior information about ρ further than the stability condition. For µ we will assume a non-informative distribution mathematically defined as having a constant density value for all possible µ. Such non-informative priors are frequently used within Bayesian analysis to reflect situations where no prior information about a parameter is available. See the general discussion about this issue below. For this simple model, analytical expressions for the posterior distribution can be derived. We will however concentrate on the more general issue that computer simulation from the posterior distribution is possible. Figure 10.4 shows histograms of 100,000 samples from the posterior distribution for (τ, µ, ρ). Such histograms both show which values the posterior distributions are concentrated around and the uncertainty involved. For τ and ρ, the prior distributions are superimposed, and the sharper peaks in the posterior distributions show the learning from data about the parameters in question. Note that given samples τ 1,..., τ S of τ, we can easily obtain samples of σ by the transformation σ s = 1/ τ s for s = 1,..., S. Summary statistics such as mean, variance and credibility intervals can easily be extracted and are given in Table Comparing the point estimates to those s=1

8 8. BOOTSTRAPPING density density τ τ Figure 10.3: On the left, prior densities for τ for (a, b) equal to (9, 3) (solid line), (0.9, 0.3) (dashed line) and (0.09, 0.03) (dotted line). On the right, corresponding posterior distributions. obtained by maximum likelihood, we see that for µ and ρ they are essentially equal. This result is related to that we are using non-informative prior distributions. For τ, the point estimates are somewhat influenced by the prior giving a shift towards the prior distribution. Recalling that the prior mean of τ was 3, we see that our Bayesian estimate is moved towards the prior mean compared to the maximum likelihood estimate. On the right of Figure 10.3, posterior distributions for τ are given for the different prior distributions shown on the left. We see that for the most informative prior (solid line), the posterior distribution is very much influenced by the prior (the main mass is in middle of the prior mean and the maximum likelihood estimate). For the two other priors, being more vague, the main mass is around the maximum likelihood value, indicating that the posterior distribution is largely influenced by the data. The amount of learning from data depends on the information in data. Figure 10.5 shows posterior distributions for σ for varying numbers of observations. More data result in sharper peaks in the posterior distributions. Note that for small numbers of data, the posterior will be close to the prior while for increasing number of observations, the prior gets less influential. This example illustrates many aspects which can be transferred to more general cases. When using informative priors, posterior means will be smoothed versions of the prior means and the maximum likelihood estimates. When using non-

9 10.2. THE BASICS 9 Density Density Density τ µ ρ Figure 10.4: Histogram of 100,000 samples from the posterior distribution of τ, µ and ρ using prior parameters (a, b) = (0.9, 0.3). The prior distributions for τ and ρ are superimposed on the histograms (dashed line). Table 10.1: Posterior summary statistics of τ, σ 2, µ and ρ for the analysis of luteinizing hormone data using prior parameters (a, b) = (0.9, 0.3). SE is standard error (square root of variance) while CrI is credibility interval. The last column gives the corresponding ML estimates. Param. Mean SE 95% CrI ML τ (12.767,28.991) σ (0.186,0.280) µ (0.121,0.613) ρ (0.291,0.851) informative priors, Bayes estimates are typically similar to the ML estimates. Non-informative priors are in many cases not proper densities, in that they can have infinitive mass. Formally, an improper prior can be seen as limits of a proper prior making the variance increase to infinity. In practice such improper densities causes no problems as long as the posterior distributions become proper. See however Section 10.5 for further discussion regarding this issue. The influence of the prior depends both on the information in the prior and on the amount of data. When the number of observations increases, the influence of the prior will in most cases vanish and the Bayesian estimates will become similar to the maximum likelihood estimates.

10 10. BOOTSTRAPPING density n= 0 n= 12 n= 24 n= 36 n= σ Figure 10.5: Posterior distributions for σ based on the first n observations. n = 0 corresponds to the prior distribution. Results are based on a = 0.9, b = Prediction Prediction is needed in many applications. For illustration, we will again consider the time-series data in Figure Example 10.2 (cont.) Assume now we want to predict y n+k, the luteinizing hormone k times 10 minutes after the last measurement. From (10.1), y n+k = (1 + ρ + + ρ k 1 )µ + ρ k y n + ε n+k + ρε n+k ρ k 1 ε n+1 (10.3) which implies that E[y n+k y 1,.., y n, σ 2, ρ] =(1 + ρ + + ρ k 1 )µ + ρ k y n Var[y n+k y 1,.., y n, σ 2, ρ] =σ 2 [1 + ρ ρ 2(k 1) ]. If µ, ρ and σ were known, the obvious estimate would be to use the conditional expectation with the corresponding variance giving a measure of the uncertainty in the prediction. With the parameters unknown, a simple plug-in approach is to insert point estimates in the expressions above. A weakness with this approach is that uncertainty in the parameter estimates not are taken into account. The Bayesian approach is to include the uncertainty in (µ, ρ, σ) directly into the predictions. Formally, this means looking at the posterior distribution for

11 10.3. PREDICTION Minutes Figure 10.6:. Predictions of exp(y n+k ) for k = 1, 2, 3, 4. Dashed lines are predictions and 95% prediction intervals for the plug-in rule. Dotted lines are the same quantities for Bayesian prediction using a = 0.9, b = 0.3. observation y n+k given y = (y 1,..., y n ), which can be written as a mathematical integral: p(y n+k y) = µ,ρ,σ p(y n+k y; σ, µ, ρ)p(σ, ρ y)dµdρdσ. (10.4) Taking our Monte Carlo approach for performing inference, such integrals can be substituted by simulations. In this context it means simulating y n+k in addition to the previous samples of (µ, ρ, σ). Simulation of y n+k can easily be performed using (10.3). Note that several y n+1,..., y n+k can be simulated simultaneously. There is therefore no need in running the simulation scheme several times if prediction at many time-points are of interest. Further, transforming our prediction to the original scale (remembering that we performed our analysis on the log-scale) can be performed by looking at exp(y n+k ) instead. Figure 10.6 shows predictions and prediction intervals based on the plug-in rule and Bayesian prediction for k = 1, 2, 3, 4. We see that by taking the uncertainty of the parameters into account, wider, but more reliable, intervals result.

12 12. BOOTSTRAPPING 10.4 Hierarchical models Latent random variables can be included into the models in order to describe dependence or clustering structures in the data. Latent variables (as opposed to observable variables) are variables that are not directly observed but included as a modelling tool. Such approaches have been given different names within different areas of statistics. Within Bayesian statistics, this is usually called hierarchical modelling. Classical statistics use the terms random effects or latent variables while in survival analysis, the name frailty model is used. One of the real strengths of Bayesian analysis is the coherent way of handling such hierarchical structures. We will illustrate the main ideas through the time series data of luteinizing hormone. Example 10.2 (cont.) The luteinizing hormone data is given with a precision up to one decimal, indicating that there are some deviation between the actual lh level and the observations. Assume now ỹ i is the observed lh level at time i (now at the original scale) while x i is the actual level at time i (on log-scale). Because the observations always will be positive, we will assume a simple model where ỹ i is equal to exp(x i ) multiplied by some measurement noise and rounded to one decimal. This gives an extension of model (10.1): x i =µ + ρx i 1 + ε i, ε i N(0, σ 2 ), ỹ i =round 1 (exp(x i + η i )), η i N(0, ). (10.5) where we by round 1 ( ) mean rounded to one decimal. The data are still described by two parameters, but the variability involved is now divided into two parts, one reflecting the variability in the true lh level, the other related to measurement errors. Equations (10.5) describe an example of a hierarchical model with {x i } being the latent variables, not directly observed. The relation between x i and ỹ i could take many different forms, both deterministic or stochastic. The main idea is that the dependence structure involved usually is much easier to specify for the latent structure than directly on the observations. In the example above, the sample space for ỹ i is a discrete set of values for which it is far from obvious how to specify dependence. The latent process {x i } is however normal, where dependence structures are directly specified through correlations. In the example this correlation structure is determined through a dynamic equation, but other options can easily be considered in more complex settings. With latent variables present, focus then change to the simultaneous posterior distribution for the pair (θ, x) given the observations. Conceptually this is

13 10.5. MODEL SELECTION 13 not different from the situations considered earlier. Still taking the Monte Carlo approach, we now need to simulate the pair (θ,x) conditional on the observed data. Assume {(θ s,x s ), s = 1,..., S} are our Monte Carlo samples. If inference about a specific parameter θ j is the main interest, θ 1 j,..., θs j can be used to construct a histogram or extract summary statistics. Similarly, if x i is of interest, x 1 i,...x S i are the relevant quantities to explore. Example 10.2 (cont.) Consider now the extended model (10.5). Our interest will be in both estimation of µ, ρ, σ and in prediction of x i, i = 1,..., n. Figure 10.7 shows histograms of σ, µ and ρ based on 10,000 Monte Carlo samples from the posterior distribution. Superimposed on the histograms are posterior densities obtained from the original model (10.1). We see that σ has a mass somewhat lower, reflecting that the variability in the data now is divided into two parts, one relating to the latent dynamics process and one to the observations. For µ and ρ, there are only minor changes. Simultaneously, we get information about the latent variables x i (lower panel of Figure 10.7). We see in this case that predictions about the latent process are quite accurate, reflecting that ỹ i gives quite precise information about x i. The use of latent variables as a modelling tool has become very popular recently, partly because of the great flexibility in describing complicated dependence structures and partly because the computational obstacles now can be handled through Monte Carlo simulations. There is however a computational cost in including latent variables. Performing inference based on models like (10.5) is not straightforward. This issue will be discussed in Section Model selection In many real applications, uncertainty is also related to the model(s) assumed. For our running example, an alternative model to (10.1) is y i = µ + ρ 1 y i 1 + ρ 2 y i 2 + ε i, i = 1,..., n (10.6) i.e. an autoregressive model of order two. In classical statistics, selection of a model can be proceeded by performing a test on the hypothesis H 0 : ρ 2 = 0. The pure Bayesian approach to model selection is to include the model as a unknown random quantity and including a prior on a set of possible models as well.

14 14. BOOTSTRAPPING σ µ ρ Minutes Figure 10.7: Upper panel show histograms of σ (left), µ (middle) and ρ (right) based 10,000 samples from the posterior distribution using a = 0.9 and b = 0.3. Posterior densities for the original model (10.1) are shown as dashed lines. Lower panel shows posterior mean (solid line) and 95% credibility intervals (dashed lines) for x i, i = 1,, n. Observations (on log-scale) are shown as circles. Defining M 1,..., M m to be this, our interest now is in the posterior probabilities Pr(M j y) for M j being the true model. Using Bayes theorem, we obtain Pr(M j y) = C Pr(M j )p(y M j ), j = 1,..., m (10.7) where C = m j=1 Pr(M j)p(y M j ). Here Pr(M j ) is the prior belief in model M j. If there is no preference in one model to another, we can choose Pr(M j ) = 1 m. In addition to this prior, we also need to consider p(y M j ), which in Section 10.2 was referred to as the marginal density (in Section 10.2 conditioning on model was done implicitly in that only one model was considered). When comparing two competing models, M 1 and M 2, say, an alternative is to

15 10.5. MODEL SELECTION 15 Table 10.2: Verbal guideline for interpreting Bayes factors. Bayes factor Evidence against model M 2 1 to 3 Not worth more than a bare mention 3 to 20 Positive 20 to 150 Strong > 150 Very strong look at the ratio of their posterior probabilities. Using (10.7), we get Pr(M 1 y) Pr(M 2 y) = Pr(M 1) Pr(M 2 ) p(y M 1) p(y M 2 ). In this formulation we are able to divide the ratio of posteriors into one part related to the prior and another part related to the data. Because of its importance in model selection, the last term is given an own name, the Bayes factor. In many cases, only the Bayes factors are considered when performing model selection, which indirectly indicate the use of Pr(M j ) = 1. In Table 10.2 a verbal guideline m for interpreting Bayes factors suggested by Kass and Raftery (1995) (a modification of an original suggestion by Jeffreys (1961)) is given. Mathematically, the marginal density is given by p(y M j ) = p(θ j M j )p(y θ j, M j )dθ j, θ j where θ j are the parameters involved in model M j. If latent variables are present, the integral above would be extended to include these as well. In the above expression, p(θ j M j ) is the prior distribution for the parameters involved in model M j. The marginal densities quantify the fit between data and model and are the essential quantities for model selection. Note that uncertainties in the parameters and latent variables are taken into account. There are several problematic issues related to the marginal densities. In contrast to posterior distributions for parameters, they depend heavily on the prior distributions, even for a large number of observations. In particular, all prior distributions need to be proper. If no prior information is available about some parameters, a possible solution is to divide the data into two sub-sets, y 1 and y 2 say. The set y 2 is used to perform model selection. As prior for θ, one can use the posterior distribution for θ based on y 1 with prior now having the meaning prior to data y 2. Assuming y 1 contains a reasonable amount of data, we obtain a prior that is proper. For more sophisticated strategies, utilizing the full dataset better, see Berger and Pericchi (1996).

16 16. BOOTSTRAPPING Table 10.3: Marginal densities, Bayes factor for model M 1 compared to model M 2 and posterior model probabilities for the last 24 lh-observations. Priors are constructed using the first 24 observations. (a, b) (9, 3) (0.9, 0.3) (0.09, 0.03) p(y M 1 ) p(y M 2 ) Bayes factor Pr(M 2 y) Numerical calculation of marginal densities can be difficult. We will discuss this issue further in Section Example 10.2 (cont.) Assume two alternative models, the AR(1) model (10.1) and the AR(2) model (10.6). In our original formulation, we used an improper prior distribution for µ and a rather wide prior distribution for ρ. In order to obtain proper priors, divide the dataset into two parts, y 1 = (y 1,..., y 24 ) and y 2 = (y 25,..., y 48 ). Priors for θ for analysis of y 2 are then specified through the posteriors for θ based on y 1. Note that different priors are obtained by different choices of the parameter (a, b) in the Gamma prior for τ. Table 10.3 shows marginal density values, Bayes factors and posterior model probabilities for the three sets of values of (a, b) considered earlier. The marginal densities changes considerable for the different priors. For the most informative prior (a = 9, b = 3), the small marginal densities reflect that the prior do not fit very well with the data. The differences between the two more vague priors are much smaller. The Bayes factors and the posterior probabilities for M 2 are quite stable, though. The strong dependence of priors and the computational difficulties involved in calculation of marginal densities and Bayes factors have motivated the use of alternatives. The Bayesian information criterion (BIC) is based on an approximation of the Bayes factor p(y M 1 ) p(y M 2 ) p(y ˆθ 1, M 1 ) p(y ˆθ 2, M 2 ) n0.5(k 2 k 1 ) (10.8) where ˆθ j is the ML estimator for θ j under model j while k j is the number of parameters in model j. The approximation is based on an asymptotic argument indicating its dependence on a large number of observations.

17 10.6. COMPUTATIONAL ASPECTS 17 Defining now BIC for model M j to be BIC j = 2 log p(y ˆθ j, M j ) k j log(n) we see that 2 times the log of right hand side of (10.8) becomes equal to BIC 1 BIC 2. Model selection can be performed through selecting the model with the largest BIC value. The advantages of using BIC is its simplicity and an (apparent) independence of prior distributions. However, as stated by Kass and Wasserman (1995), the approximation is based on an implicit prior. This prior is quite sensible though and do not have to be expressed explicitly. Example 10.2 (cont.) We compare models M 1 and M 2 based on observations y 3,..., y 48 assuming y 1, y 2 are fixed. Then log p(y ˆθ 1, M 1 ) = while log p(y ˆθ 2, M 2 ) = Using that k 1 = 3 while k 2 = 4, we obtain BIC 1 = 8.91 while BIC 2 = This gives a preference to model 1. The BIC measure can be seen as a measure of fit with a penalty term included for model complexity. Many alternative such measures have been suggested in the literature, Akaike s information criterion (AIC, Akaike, 1973) the most wellknown. Another criterion popular within Bayesian statistics is DIC (Spiegelhalter et al., 2002) Computational aspects As discussed in Section 10.2, computer generated (Monte Carlo) samples can be used to explore the posterior distribution. By transforming the samples properly, any quantity of interest can be explored. Using Monte Carlo simulation, a numerical approximation is applied. The errors involved in such approximations will usually be small compared to the uncertainty due to basing our inference on a finite number of observations. Further, this approximation can be controlled through the number of Monte Carlo samples S that we generate. For some examples (e.g. the Bayesian analysis of the lh-data using model (10.1)), performing Monte Carlo simulations can be easy. In more complex situations, constructing efficient simulation routines are more difficult. Still, Monte Carlo approaches have been quite successfully applied in a wide range of applications. In cases with latent variables x, we might be interested in the simultaneous posterior p(x, θ y). For simplicity in notation, assume z contains all the unknowns, both latent variables and parameters (for the hierarchical models discussed in section 10.4, z = (θ,x)). Assume further that we are interested in some univariate

18 18. BOOTSTRAPPING quantity h(z) (which could be a parameter, a latent variable or some more complicated function). Our interest will then be in the posterior distribution of h(z) given our data y. Assuming z s, s = 1,..., S are samples from the posterior distribution p(z y), a Monte Carlo estimate of the posterior expectation for h(z), E[h(z) y], is h MC = 1 S The law of large numbers states that S h(z s ). s=1 h MC E[h(z) y] for large S. Further, if the samples are independent, we have Var[ h MC y] = 1 S CMC h (10.9) where Ch MC is a factor depending on the function h and the posterior distribution p(z y), but not on S. The factor Ch MC can be estimated by the empirical variance of h(z s ) and specification of S can then be made through quantifying the requirement in precision of our estimate. Note that the decrease in variance (the factor 1) S is independent of the dimension of the problem. This is highly in contrast to numerical integration techniques and is the main reason for the popularity of Monte Carlo methods within Bayesian analysis. The dimensional problems do however come into account when considering methods for simulation from the posterior distribution p(z y). For high-dimensional problems, direct sampling can be difficult. Markov chain Monte Carlo (MCMC) algorithms (Gilks et al., 1996) is a powerful class of methods for performing simulations in high-dimensional spaces. The main idea is to sample {z s } sequentially such that when s increases, the distribution of the samples will get closer to the target distribution p(z y). This sequential updating is performed such that the generation of z s only depends on the previous sample z s 1 and not on any earlier samples. This results in a Markov chain, giving the name of the class of algorithms. For MCMC samples, the Monte Carlo estimate is modified to h MCMC = 1 S S+B s=b h(z s ), i.e. we are running the Markov chain B iterations before starting to collect our samples. B is usually denoted the burnin period and reflects that the first samples can have distributions not close enough to the target distribution.

19 10.6. COMPUTATIONAL ASPECTS 19 Remarkably, under relatively weak conditions, also h MCMC E[h(z) y], a result based on general Markov chain theory. Further, the variance will be approximately Var[ h MCMC ] 1 S CMCMC h. The factor Ch MCMC will in most cases be larger (in some cases much larger) than Ch MC and will depend on the specific type of algorithm that is chosen. This increase in variance is due to that the Monte Carlo samples now are dependent resulting in an effective sample size smaller than S. Two subclasses of MCMC algorithms have mainly been applied. The Metropolis- Hastings algorithms are closely related to acceptance-rejection sampling, a sampling technique based on generating a sample from an (in principle) arbitrary proposal distribution followed by a acceptance step which works as a correction for sampling from the wrong distribution. Assuming a sample z s is given at iteration s, a proposal z is generated from a proposal distribution q(z z s ). Then { z s+1 z with probability r(z s,z ); = z s with probability 1 r(z s,z ), where { } r(z s,z ) = min 1, p(z y)q(z s z ). p(z s y)q(z z s ) The key factor in obtaining an effective algorithm is the choice of proposal distribution q. Typically choices are small changes in one or a few components of z at a time. Metropolis-Hastings can be applied in quite general settings and are usually easy to implement. The main requirements are the possibility for evaluating the posterior p(z y) and that the proposal distribution allow the Markov chain {z s } to explore all possible values of z. In many Bayesian settings, the posterior distribution can only be evaluated up to a proportionality constant. This can be handled easily in the Metropolis- Hastings algorithm due to the occurrence of the posterior both in the nominator and in the denominator of the acceptance probability. Practical implementation do contain many tuning parameters that need to be specified carefully. The Gibbs sampler is another popular class of algorithms, its most simple version being specified by two steps at each iteration: 1. Choose a component j at random.

20 20. BOOTSTRAPPING 2. Simulate z s+1 j from the conditional distribution of z j given z k = zk s, k j and put z s+1 k = s s k for k j. A component in this setting could be a single variable or a set of several components. The Gibbs sampler can formally be seen as a special case of a Metropolis- Hastings algorithm (by using the conditional distribution as a proposal distribution). In practice, applications of Metropolis-Hastings algorithms use quite different proposals and are usually considered as alternatives to the Gibbs sampler. Compared to the Metropolis-Hastings algorithm, the Gibbs sampler requires more effort in the sense that conditional distributions need to be worked out. On the other hand, no tuning parameters need to be specified making it more directly applicable when the relevant conditional distributions are available. Implementation of Markov chain Monte Carlo algorithms can be time-consuming. Both general and application specific software are however becoming increasingly available making Bayesian analysis accessible. Winbugs is the most general Bayesian software package available at date and is freely available from the website The first version of the program called BUGS was an abbreviation for Bayesian inference Using Gibbs Sampler with Winbugs being the Windows successor to BUGS (the only version maintained, but there is an Openbugs version that still is under development). The original version mainly performed Gibbs sampling, while later versions also allow Metropolis-Hastings steps, extending the application area extensively. The software comes with a user manual, in addition to a wide range of examples. The nice part in using Winbugs/Openbugs is that only the model needs to specified, the construction of the MCMC algorithms is done automatically. Model (10.5) is a bit problematic to implement directly due to the rounding of the observations. The model can however be implemented by first reformulating it to x i =µ + ρ(x i 1 µ) + ε i, ε i N(0, σ 2 ), v i =x i + η i, η i N(0, ) (10.10) ỹ i =Uniform[v i 0.5, v i + 0.5]. The reformulation consist of two steps. The first is to include an extra variable v i being the observation before rounding. The second is to reformulate the rounding to a uniform distribution. This actually gives another model, but it can be shown that the posterior distribution remains the same. The model can then be specified as shown in Figure Here logical (or deterministic) links are denoted by <- while stochastic links are denoted by ~. In addition, data and initial values need to be loaded into the system before the actual simulations can be performed. Figure 10.9 shows a trace plot of the simulated values of ρ. This plot shows a typical behaviour in that some initial iterations are needed before the values

21 10.6. COMPUTATIONAL ASPECTS 21 model { x[1] ~ dnorm(mu,tau) for(i in 2:n){ mean[i] <- mu + rho*x[i-1] x[i] ~ dnorm(mean[i],tau) } for(i in 1:n){ y.tilde[i] ~ dnorm(x[i],100) L[i] <- exp(x[i])-0.05 U[i] <- exp(x[i])+0.05 y[i] ~ dunif(l[i],u[i]) } mu ~ dflat() rho ~ dunif(-1,1) tau ~ dgamma(a,b) } Figure 10.8: Winbugs code for specifying model (10.10). stabilise. Such trace plots can be used for visually specifying the burnin value B, but more formal methods are available, many of which are implemented into Winbugs or related software. For this example, convergence is fast, but it can be substantially slower in more complex situations. Several other packages for MCMC simulations in Bayesian settings are available, some of which are listed at the website More application specific programs can also be found. Within the R package (R Development Core Team, 2009), a freeware statistical package, many libraries tuned towards specific applications are available. The above discussion is primarily related to inference within a model. When comparing models (Section 10.5), marginal densities need to be evaluated as well. A marginal density (dropping the conditioning on model) is given by p(y) = z p(y z)p(z)dz. This density can be seen as the expectation of the conditional density p(z y) with respect to the prior p(z). Since simulation from the prior p(z) nearly always is

22 22. BOOTSTRAPPING ρ Iterations Figure 10.9: Trace plot of simulated values of ρ based on Markov chain Monte Carlo simulations. easy, in principle, this expectation can be approximated by ˆp(y) = S p(y z s ) s=1 where z s,...,z s are samples from p(z). In practice, very few of the samples from p(z) will cover the main support of p(y z) making the effective sample size very small. Alternative methods are available, but are rather complicated. We refer to Han and Carlin (2001) for a review of some of these methods Summary/discussion Given the similarities with maximum likelihood estimates under non-informative priors, one might question the need for Bayesian analysis. There are several arguments that can be used to answer this.

23 10.8. FURTHER READING 23 When prior information is available, it will influence the results. For the running example one might have information from other patients, or data collected earlier, that can give information about both ρ and τ. Classical statistical approaches differ from Bayesian ones when taking parameter uncertainty into account, for example in prediction. For the Bayesian approach, there is a coherent way of doing this. For classical statistics, various methods are in use. The simplest, and perhaps most widely used, approach would be a plug-in method. Such a method is neglecting the uncertainty in the parameter estimates and thereby underestimating the uncertainty in the predictions. More advanced methods such as Bootstrapping will be more accurate but are not always that easy to apply. Apart from the simplest models, inference within classical statistics is based on large sample approximations while Bayesian methods are exact in the sense that assuming the model assumptions are valid, the posterior distribution do give the right answer. The need for numerical approximations violates this exactness, but such errors are usually of a smaller scale than variability due to data. On the other hand, large sample approximations are usually quite robust to model assumptions. This is also true for Bayesian settings, but with small samples the results will typically rely heavily on the model assumptions. Maximum likelihood methods involve optimization, Bayesian approaches involve integration. Advances in statistical computing, and in particular Monte Carlo methods, have for many problems made the computational challenges easier to handle within the Bayesian framework than in classical settings. Pragmatic considerations have in such cases largely given preferences to the use of Bayesian methods Further reading For general introductions to Bayesian methods, we refer to Gelman et al. (2004) or Carlin and Louis (2008). Many applications of Bayesian methods within health research is given in Berry and Stangl (1996). Ashby (2006) contain a huge list of references to Bayesian statistics within medicine. Breslow (1990) discuss Bayesian methods in contrast to frequentist (classical) statistical approaches. Bibliography Akaike, H. (1973). Information theory and an extension of the likelihood ratio principle. In Second international symposium of information theory, pp

24 24. BOOTSTRAPPING Ashby, D. (2006). Bayesian statistics in medicine: a 25 year review. Statistics in medicine 25(21), Berger, J. and L. Pericchi (1996). The intrinsic Bayes factor for model selection and prediction. Journal of the American Statistical Association 91(433), Berry, D. and D. Stangl (1996). Bayesian biostatistics. CRC. Breslow, N. (1990). Biostatistics and Bayes. Statistical Science 5(3), Carlin, B. and T. Louis (2008). Bayesian methods for data analysis (Third ed.). CRC Press. Diggle, P. (1990). Time series: a biostatistical introduction. Oxford University Press. Gelman, A., J. Carlin, H. Stern, and D. Rubin (2004). Bayesian data analysis. Champan and Hall/CRC. Gilks, W., S. Richardson, and D. J. Spiegelhalter (1996). Markov chain Monte Carlo in practice. London: Chapman & Hall. Han, C. and B. Carlin (2001). Markov Chain Monte Carlo Methods for Computing Bayes Factors: A Comparative Review. Journal of the American Statistical Association 96(455), Jeffreys, H. (1961). Theory of probability. Oxford: Clarendon Press. Kass, R. and A. Raftery (1995). Bayes factors. Journal of the American Statistical Association 90(430), Kass, R. and L. Wasserman (1995). A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association 90(431), Mugglin, A., N. Cressie, and I. Gemmell (2002). Hierarchical statistical modelling of influenza epidemic dynamics in space and time. Statistics in medicine 21(18), R Development Core Team (2009). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. ISBN Spiegelhalter, D., N. Best, B. Carlin, and A. van der Linde (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 64(4),

7. Estimation and hypothesis testing. Objective. Recommended reading

7. Estimation and hypothesis testing. Objective. Recommended reading 7. Estimation and hypothesis testing Objective In this chapter, we show how the election of estimators can be represented as a decision problem. Secondly, we consider the problem of hypothesis testing

More information

Penalized Loss functions for Bayesian Model Choice

Penalized Loss functions for Bayesian Model Choice Penalized Loss functions for Bayesian Model Choice Martyn International Agency for Research on Cancer Lyon, France 13 November 2009 The pure approach For a Bayesian purist, all uncertainty is represented

More information

7. Estimation and hypothesis testing. Objective. Recommended reading

7. Estimation and hypothesis testing. Objective. Recommended reading 7. Estimation and hypothesis testing Objective In this chapter, we show how the election of estimators can be represented as a decision problem. Secondly, we consider the problem of hypothesis testing

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1 Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters

More information

Bayesian model selection: methodology, computation and applications

Bayesian model selection: methodology, computation and applications Bayesian model selection: methodology, computation and applications David Nott Department of Statistics and Applied Probability National University of Singapore Statistical Genomics Summer School Program

More information

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1 Parameter Estimation William H. Jefferys University of Texas at Austin bill@bayesrules.net Parameter Estimation 7/26/05 1 Elements of Inference Inference problems contain two indispensable elements: Data

More information

Advanced Statistical Modelling

Advanced Statistical Modelling Markov chain Monte Carlo (MCMC) Methods and Their Applications in Bayesian Statistics School of Technology and Business Studies/Statistics Dalarna University Borlänge, Sweden. Feb. 05, 2014. Outlines 1

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee and Andrew O. Finley 2 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. 2 Department of Forestry & Department

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee 1 and Andrew O. Finley 2 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. 2 Department of Forestry & Department

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods Tomas McKelvey and Lennart Svensson Signal Processing Group Department of Signals and Systems Chalmers University of Technology, Sweden November 26, 2012 Today s learning

More information

Bayesian Inference for Regression Parameters

Bayesian Inference for Regression Parameters Bayesian Inference for Regression Parameters 1 Bayesian inference for simple linear regression parameters follows the usual pattern for all Bayesian analyses: 1. Form a prior distribution over all unknown

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

Lecture 6: Model Checking and Selection

Lecture 6: Model Checking and Selection Lecture 6: Model Checking and Selection Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de May 27, 2014 Model selection We often have multiple modeling choices that are equally sensible: M 1,, M T. Which

More information

eqr094: Hierarchical MCMC for Bayesian System Reliability

eqr094: Hierarchical MCMC for Bayesian System Reliability eqr094: Hierarchical MCMC for Bayesian System Reliability Alyson G. Wilson Statistical Sciences Group, Los Alamos National Laboratory P.O. Box 1663, MS F600 Los Alamos, NM 87545 USA Phone: 505-667-9167

More information

Downloaded from:

Downloaded from: Camacho, A; Kucharski, AJ; Funk, S; Breman, J; Piot, P; Edmunds, WJ (2014) Potential for large outbreaks of Ebola virus disease. Epidemics, 9. pp. 70-8. ISSN 1755-4365 DOI: https://doi.org/10.1016/j.epidem.2014.09.003

More information

Bayesian Networks in Educational Assessment

Bayesian Networks in Educational Assessment Bayesian Networks in Educational Assessment Estimating Parameters with MCMC Bayesian Inference: Expanding Our Context Roy Levy Arizona State University Roy.Levy@asu.edu 2017 Roy Levy MCMC 1 MCMC 2 Posterior

More information

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn Parameter estimation and forecasting Cristiano Porciani AIfA, Uni-Bonn Questions? C. Porciani Estimation & forecasting 2 Temperature fluctuations Variance at multipole l (angle ~180o/l) C. Porciani Estimation

More information

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01 STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01 Nasser Sadeghkhani a.sadeghkhani@queensu.ca There are two main schools to statistical inference: 1-frequentist

More information

BUGS Bayesian inference Using Gibbs Sampling

BUGS Bayesian inference Using Gibbs Sampling BUGS Bayesian inference Using Gibbs Sampling Glen DePalma Department of Statistics May 30, 2013 www.stat.purdue.edu/~gdepalma 1 / 20 Bayesian Philosophy I [Pearl] turned Bayesian in 1971, as soon as I

More information

Bayesian Inference and MCMC

Bayesian Inference and MCMC Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the

More information

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence Bayesian Inference in GLMs Frequentists typically base inferences on MLEs, asymptotic confidence limits, and log-likelihood ratio tests Bayesians base inferences on the posterior distribution of the unknowns

More information

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017 Markov Chain Monte Carlo (MCMC) and Model Evaluation August 15, 2017 Frequentist Linking Frequentist and Bayesian Statistics How can we estimate model parameters and what does it imply? Want to find the

More information

Part 7: Hierarchical Modeling

Part 7: Hierarchical Modeling Part 7: Hierarchical Modeling!1 Nested data It is common for data to be nested: i.e., observations on subjects are organized by a hierarchy Such data are often called hierarchical or multilevel For example,

More information

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17 MCMC for big data Geir Storvik BigInsight lunch - May 2 2018 Geir Storvik MCMC for big data BigInsight lunch - May 2 2018 1 / 17 Outline Why ordinary MCMC is not scalable Different approaches for making

More information

DIC: Deviance Information Criterion

DIC: Deviance Information Criterion (((( Welcome Page Latest News DIC: Deviance Information Criterion Contact us/bugs list WinBUGS New WinBUGS examples FAQs DIC GeoBUGS DIC (Deviance Information Criterion) is a Bayesian method for model

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee 1 and Andrew O. Finley 2 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. 2 Department of Forestry & Department

More information

MCMC algorithms for fitting Bayesian models

MCMC algorithms for fitting Bayesian models MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models

More information

ST 740: Markov Chain Monte Carlo

ST 740: Markov Chain Monte Carlo ST 740: Markov Chain Monte Carlo Alyson Wilson Department of Statistics North Carolina State University October 14, 2012 A. Wilson (NCSU Stsatistics) MCMC October 14, 2012 1 / 20 Convergence Diagnostics:

More information

Lecture 13 Fundamentals of Bayesian Inference

Lecture 13 Fundamentals of Bayesian Inference Lecture 13 Fundamentals of Bayesian Inference Dennis Sun Stats 253 August 11, 2014 Outline of Lecture 1 Bayesian Models 2 Modeling Correlations Using Bayes 3 The Universal Algorithm 4 BUGS 5 Wrapping Up

More information

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida Bayesian Statistical Methods Jeff Gill Department of Political Science, University of Florida 234 Anderson Hall, PO Box 117325, Gainesville, FL 32611-7325 Voice: 352-392-0262x272, Fax: 352-392-8127, Email:

More information

FAV i R This paper is produced mechanically as part of FAViR. See for more information.

FAV i R This paper is produced mechanically as part of FAViR. See  for more information. Bayesian Claim Severity Part 2 Mixed Exponentials with Trend, Censoring, and Truncation By Benedict Escoto FAV i R This paper is produced mechanically as part of FAViR. See http://www.favir.net for more

More information

One-parameter models

One-parameter models One-parameter models Patrick Breheny January 22 Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 1/17 Introduction Binomial data is not the only example in which Bayesian solutions can be worked

More information

Markov Chain Monte Carlo in Practice

Markov Chain Monte Carlo in Practice Markov Chain Monte Carlo in Practice Edited by W.R. Gilks Medical Research Council Biostatistics Unit Cambridge UK S. Richardson French National Institute for Health and Medical Research Vilejuif France

More information

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA Intro: Course Outline and Brief Intro to Marina Vannucci Rice University, USA PASI-CIMAT 04/28-30/2010 Marina Vannucci

More information

Tools for Parameter Estimation and Propagation of Uncertainty

Tools for Parameter Estimation and Propagation of Uncertainty Tools for Parameter Estimation and Propagation of Uncertainty Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline Models, parameters, parameter estimation,

More information

Hierarchical Models & Bayesian Model Selection

Hierarchical Models & Bayesian Model Selection Hierarchical Models & Bayesian Model Selection Geoffrey Roeder Departments of Computer Science and Statistics University of British Columbia Jan. 20, 2016 Contact information Please report any typos or

More information

2016 SISG Module 17: Bayesian Statistics for Genetics Lecture 3: Binomial Sampling

2016 SISG Module 17: Bayesian Statistics for Genetics Lecture 3: Binomial Sampling 2016 SISG Module 17: Bayesian Statistics for Genetics Lecture 3: Binomial Sampling Jon Wakefield Departments of Statistics and Biostatistics University of Washington Outline Introduction and Motivating

More information

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation. PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.. Beta Distribution We ll start by learning about the Beta distribution, since we end up using

More information

ST 740: Model Selection

ST 740: Model Selection ST 740: Model Selection Alyson Wilson Department of Statistics North Carolina State University November 25, 2013 A. Wilson (NCSU Statistics) Model Selection November 25, 2013 1 / 29 Formal Bayesian Model

More information

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016 Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016 EPSY 905: Intro to Bayesian and MCMC Today s Class An

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Markov chain Monte Carlo

Markov chain Monte Carlo Markov chain Monte Carlo Karl Oskar Ekvall Galin L. Jones University of Minnesota March 12, 2019 Abstract Practically relevant statistical models often give rise to probability distributions that are analytically

More information

1 Hypothesis Testing and Model Selection

1 Hypothesis Testing and Model Selection A Short Course on Bayesian Inference (based on An Introduction to Bayesian Analysis: Theory and Methods by Ghosh, Delampady and Samanta) Module 6: From Chapter 6 of GDS 1 Hypothesis Testing and Model Selection

More information

Part III. A Decision-Theoretic Approach and Bayesian testing

Part III. A Decision-Theoretic Approach and Bayesian testing Part III A Decision-Theoretic Approach and Bayesian testing 1 Chapter 10 Bayesian Inference as a Decision Problem The decision-theoretic framework starts with the following situation. We would like to

More information

An Extended BIC for Model Selection

An Extended BIC for Model Selection An Extended BIC for Model Selection at the JSM meeting 2007 - Salt Lake City Surajit Ray Boston University (Dept of Mathematics and Statistics) Joint work with James Berger, Duke University; Susie Bayarri,

More information

Spatial Statistics Chapter 4 Basics of Bayesian Inference and Computation

Spatial Statistics Chapter 4 Basics of Bayesian Inference and Computation Spatial Statistics Chapter 4 Basics of Bayesian Inference and Computation So far we have discussed types of spatial data, some basic modeling frameworks and exploratory techniques. We have not discussed

More information

Bayesian Linear Regression

Bayesian Linear Regression Bayesian Linear Regression Sudipto Banerjee 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. September 15, 2010 1 Linear regression models: a Bayesian perspective

More information

The Metropolis-Hastings Algorithm. June 8, 2012

The Metropolis-Hastings Algorithm. June 8, 2012 The Metropolis-Hastings Algorithm June 8, 22 The Plan. Understand what a simulated distribution is 2. Understand why the Metropolis-Hastings algorithm works 3. Learn how to apply the Metropolis-Hastings

More information

Metropolis-Hastings Algorithm

Metropolis-Hastings Algorithm Strength of the Gibbs sampler Metropolis-Hastings Algorithm Easy algorithm to think about. Exploits the factorization properties of the joint probability distribution. No difficult choices to be made to

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Basic of Probability Theory for Ph.D. students in Education, Social Sciences and Business (Shing On LEUNG and Hui Ping WU) (May 2015)

Basic of Probability Theory for Ph.D. students in Education, Social Sciences and Business (Shing On LEUNG and Hui Ping WU) (May 2015) Basic of Probability Theory for Ph.D. students in Education, Social Sciences and Business (Shing On LEUNG and Hui Ping WU) (May 2015) This is a series of 3 talks respectively on: A. Probability Theory

More information

Bayesian Inference: Concept and Practice

Bayesian Inference: Concept and Practice Inference: Concept and Practice fundamentals Johan A. Elkink School of Politics & International Relations University College Dublin 5 June 2017 1 2 3 Bayes theorem In order to estimate the parameters of

More information

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Bagging During Markov Chain Monte Carlo for Smoother Predictions Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods

More information

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007 Bayesian inference Fredrik Ronquist and Peter Beerli October 3, 2007 1 Introduction The last few decades has seen a growing interest in Bayesian inference, an alternative approach to statistical inference.

More information

David Giles Bayesian Econometrics

David Giles Bayesian Econometrics David Giles Bayesian Econometrics 1. General Background 2. Constructing Prior Distributions 3. Properties of Bayes Estimators and Tests 4. Bayesian Analysis of the Multiple Regression Model 5. Bayesian

More information

Weakness of Beta priors (or conjugate priors in general) They can only represent a limited range of prior beliefs. For example... There are no bimodal beta distributions (except when the modes are at 0

More information

Bayesian philosophy Bayesian computation Bayesian software. Bayesian Statistics. Petter Mostad. Chalmers. April 6, 2017

Bayesian philosophy Bayesian computation Bayesian software. Bayesian Statistics. Petter Mostad. Chalmers. April 6, 2017 Chalmers April 6, 2017 Bayesian philosophy Bayesian philosophy Bayesian statistics versus classical statistics: War or co-existence? Classical statistics: Models have variables and parameters; these are

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

DAG models and Markov Chain Monte Carlo methods a short overview

DAG models and Markov Chain Monte Carlo methods a short overview DAG models and Markov Chain Monte Carlo methods a short overview Søren Højsgaard Institute of Genetics and Biotechnology University of Aarhus August 18, 2008 Printed: August 18, 2008 File: DAGMC-Lecture.tex

More information

Introduction to Bayesian Methods

Introduction to Bayesian Methods Introduction to Bayesian Methods Jessi Cisewski Department of Statistics Yale University Sagan Summer Workshop 2016 Our goal: introduction to Bayesian methods Likelihoods Priors: conjugate priors, non-informative

More information

Labor-Supply Shifts and Economic Fluctuations. Technical Appendix

Labor-Supply Shifts and Economic Fluctuations. Technical Appendix Labor-Supply Shifts and Economic Fluctuations Technical Appendix Yongsung Chang Department of Economics University of Pennsylvania Frank Schorfheide Department of Economics University of Pennsylvania January

More information

The STS Surgeon Composite Technical Appendix

The STS Surgeon Composite Technical Appendix The STS Surgeon Composite Technical Appendix Overview Surgeon-specific risk-adjusted operative operative mortality and major complication rates were estimated using a bivariate random-effects logistic

More information

Invariant HPD credible sets and MAP estimators

Invariant HPD credible sets and MAP estimators Bayesian Analysis (007), Number 4, pp. 681 69 Invariant HPD credible sets and MAP estimators Pierre Druilhet and Jean-Michel Marin Abstract. MAP estimators and HPD credible sets are often criticized in

More information

Sequential Monitoring of Clinical Trials Session 4 - Bayesian Evaluation of Group Sequential Designs

Sequential Monitoring of Clinical Trials Session 4 - Bayesian Evaluation of Group Sequential Designs Sequential Monitoring of Clinical Trials Session 4 - Bayesian Evaluation of Group Sequential Designs Presented August 8-10, 2012 Daniel L. Gillen Department of Statistics University of California, Irvine

More information

Heriot-Watt University

Heriot-Watt University Heriot-Watt University Heriot-Watt University Research Gateway Prediction of settlement delay in critical illness insurance claims by using the generalized beta of the second kind distribution Dodd, Erengul;

More information

Bayesian Inference. Introduction

Bayesian Inference. Introduction Bayesian Inference Introduction The frequentist approach to inference holds that probabilities are intrinsicially tied (unsurprisingly) to frequencies. This interpretation is actually quite natural. What,

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

MCMC for Cut Models or Chasing a Moving Target with MCMC

MCMC for Cut Models or Chasing a Moving Target with MCMC MCMC for Cut Models or Chasing a Moving Target with MCMC Martyn Plummer International Agency for Research on Cancer MCMSki Chamonix, 6 Jan 2014 Cut models What do we want to do? 1. Generate some random

More information

Inference for a Population Proportion

Inference for a Population Proportion Al Nosedal. University of Toronto. November 11, 2015 Statistical inference is drawing conclusions about an entire population based on data in a sample drawn from that population. From both frequentist

More information

Bayesian inference: what it means and why we care

Bayesian inference: what it means and why we care Bayesian inference: what it means and why we care Robin J. Ryder Centre de Recherche en Mathématiques de la Décision Université Paris-Dauphine 6 November 2017 Mathematical Coffees Robin Ryder (Dauphine)

More information

Model comparison: Deviance-based approaches

Model comparison: Deviance-based approaches Model comparison: Deviance-based approaches Patrick Breheny February 19 Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 1/23 Model comparison Thus far, we have looked at residuals in a fairly

More information

BAYESIAN MODEL CRITICISM

BAYESIAN MODEL CRITICISM Monte via Chib s BAYESIAN MODEL CRITICM Hedibert Freitas Lopes The University of Chicago Booth School of Business 5807 South Woodlawn Avenue, Chicago, IL 60637 http://faculty.chicagobooth.edu/hedibert.lopes

More information

Bayesian Phylogenetics:

Bayesian Phylogenetics: Bayesian Phylogenetics: an introduction Marc A. Suchard msuchard@ucla.edu UCLA Who is this man? How sure are you? The one true tree? Methods we ve learned so far try to find a single tree that best describes

More information

INTRODUCTION TO BAYESIAN ANALYSIS

INTRODUCTION TO BAYESIAN ANALYSIS INTRODUCTION TO BAYESIAN ANALYSIS Arto Luoma University of Tampere, Finland Autumn 2014 Introduction to Bayesian analysis, autumn 2013 University of Tampere 1 / 130 Who was Thomas Bayes? Thomas Bayes (1701-1761)

More information

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis Summarizing a posterior Given the data and prior the posterior is determined Summarizing the posterior gives parameter estimates, intervals, and hypothesis tests Most of these computations are integrals

More information

Bayesian Analysis of RR Lyrae Distances and Kinematics

Bayesian Analysis of RR Lyrae Distances and Kinematics Bayesian Analysis of RR Lyrae Distances and Kinematics William H. Jefferys, Thomas R. Jefferys and Thomas G. Barnes University of Texas at Austin, USA Thanks to: Jim Berger, Peter Müller, Charles Friedman

More information

Seminar über Statistik FS2008: Model Selection

Seminar über Statistik FS2008: Model Selection Seminar über Statistik FS2008: Model Selection Alessia Fenaroli, Ghazale Jazayeri Monday, April 2, 2008 Introduction Model Choice deals with the comparison of models and the selection of a model. It can

More information

Bayesian Meta-analysis with Hierarchical Modeling Brian P. Hobbs 1

Bayesian Meta-analysis with Hierarchical Modeling Brian P. Hobbs 1 Bayesian Meta-analysis with Hierarchical Modeling Brian P. Hobbs 1 Division of Biostatistics, School of Public Health, University of Minnesota, Mayo Mail Code 303, Minneapolis, Minnesota 55455 0392, U.S.A.

More information

Bayes: All uncertainty is described using probability.

Bayes: All uncertainty is described using probability. Bayes: All uncertainty is described using probability. Let w be the data and θ be any unknown quantities. Likelihood. The probability model π(w θ) has θ fixed and w varying. The likelihood L(θ; w) is π(w

More information

Introduction to Machine Learning. Lecture 2

Introduction to Machine Learning. Lecture 2 Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for

More information

an introduction to bayesian inference

an introduction to bayesian inference with an application to network analysis http://jakehofman.com january 13, 2010 motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena

More information

Bayesian data analysis in practice: Three simple examples

Bayesian data analysis in practice: Three simple examples Bayesian data analysis in practice: Three simple examples Martin P. Tingley Introduction These notes cover three examples I presented at Climatea on 5 October 0. Matlab code is available by request to

More information

Bayesian model selection for computer model validation via mixture model estimation

Bayesian model selection for computer model validation via mixture model estimation Bayesian model selection for computer model validation via mixture model estimation Kaniav Kamary ATER, CNAM Joint work with É. Parent, P. Barbillon, M. Keller and N. Bousquet Outline Computer model validation

More information

Default Priors and Effcient Posterior Computation in Bayesian

Default Priors and Effcient Posterior Computation in Bayesian Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature

More information

Chapter 5. Bayesian Statistics

Chapter 5. Bayesian Statistics Chapter 5. Bayesian Statistics Principles of Bayesian Statistics Anything unknown is given a probability distribution, representing degrees of belief [subjective probability]. Degrees of belief [subjective

More information

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling Due: Tuesday, May 10, 2016, at 6pm (Submit via NYU Classes) Instructions: Your answers to the questions below, including

More information

Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters

Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters Exercises Tutorial at ICASSP 216 Learning Nonlinear Dynamical Models Using Particle Filters Andreas Svensson, Johan Dahlin and Thomas B. Schön March 18, 216 Good luck! 1 [Bootstrap particle filter for

More information

Bayesian Estimation An Informal Introduction

Bayesian Estimation An Informal Introduction Mary Parker, Bayesian Estimation An Informal Introduction page 1 of 8 Bayesian Estimation An Informal Introduction Example: I take a coin out of my pocket and I want to estimate the probability of heads

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

Bayes Factors for Grouped Data

Bayes Factors for Grouped Data Bayes Factors for Grouped Data Lizanne Raubenheimer and Abrie J. van der Merwe 2 Department of Statistics, Rhodes University, Grahamstown, South Africa, L.Raubenheimer@ru.ac.za 2 Department of Mathematical

More information

MCMC 2: Lecture 2 Coding and output. Phil O Neill Theo Kypraios School of Mathematical Sciences University of Nottingham

MCMC 2: Lecture 2 Coding and output. Phil O Neill Theo Kypraios School of Mathematical Sciences University of Nottingham MCMC 2: Lecture 2 Coding and output Phil O Neill Theo Kypraios School of Mathematical Sciences University of Nottingham Contents 1. General (Markov) epidemic model 2. Non-Markov epidemic model 3. Debugging

More information

Bayesian Methods in Multilevel Regression

Bayesian Methods in Multilevel Regression Bayesian Methods in Multilevel Regression Joop Hox MuLOG, 15 september 2000 mcmc What is Statistics?! Statistics is about uncertainty To err is human, to forgive divine, but to include errors in your design

More information

A note on Reversible Jump Markov Chain Monte Carlo

A note on Reversible Jump Markov Chain Monte Carlo A note on Reversible Jump Markov Chain Monte Carlo Hedibert Freitas Lopes Graduate School of Business The University of Chicago 5807 South Woodlawn Avenue Chicago, Illinois 60637 February, 1st 2006 1 Introduction

More information

Part 8: GLMs and Hierarchical LMs and GLMs

Part 8: GLMs and Hierarchical LMs and GLMs Part 8: GLMs and Hierarchical LMs and GLMs 1 Example: Song sparrow reproductive success Arcese et al., (1992) provide data on a sample from a population of 52 female song sparrows studied over the course

More information