Spatial Statistics Chapter 4 Basics of Bayesian Inference and Computation

Size: px
Start display at page:

Download "Spatial Statistics Chapter 4 Basics of Bayesian Inference and Computation"

Transcription

1 Spatial Statistics Chapter 4 Basics of Bayesian Inference and Computation So far we have discussed types of spatial data, some basic modeling frameworks and exploratory techniques. We have not discussed inference from applying models to data in any great detail (though some elements of maximum likelihood inference for random effect models has been mentioned). In this chapter we will outline Bayesian methods for statistical inference as well as the computational algorithms that are required for applying the methods to complex problems. 1

2 Once we have done this, we will discuss hierarchical spatial models and develop inferential techniques for such models in the Bayesian framework (Chapter 5 - the core of the course). We will cover just the basic concepts of Bayesian inference and computation and get everyone up an running with Markov chain Monte Carlo using the WinBugs software. Good references on Bayesian statistics: 1. Applied perspectives: Gelman, Carlin, Stern and Rubin (2004), Carlin and Louis (2000). 2. More Theoretical: Robert (2001), Bernardo and Smith (1996).

3 Everyone is familiar with the maximum likelihood approach which is based on a model for data f(y θ), where θ is treated as a fixed unknown quantity of interest. Inference is then based on the notion of repeated sampling along with asymptotic arguments. In other words, we consider the distribution of the MLE ˆθ n induced by repeatedly sampling the data under identical conditions and in addition as n. The Bayesian approach to inference is based on adding to the data model f(y θ), a distribution for θ.

4 This distribution is meant to reflect external knowledge or prior opinion for example from previous related studies. An example would be a pilot study in a clinical trial. These are usually conducted before the formal trial to obtain initial estimates of variability which are then used in sample size calculations. This distribution for θ is known as a prior distribution and will have density π(θ λ) where λ is a vector of parameters, known as hyperparameters, indexing a family of prior distributions.

5 In many situations λ is known: 1. determined from prior information 2. deliberately set to a particular value that makes the prior vague For example in regression models one often sets the prior for a regression coefficient β λ N(0, λ) and then sets λ to a very large value say λ = 10 6 to provide an essentially non-informative prior.

6 In such cases when λ is known, inference concerning θ is based on the posterior distribution π(θ y, λ) which, from Bayes theorem takes the form π(θ y, λ) = p(y, θ λ) p(y, λ) = f(y θ)π(θ λ) f(y θ)π(θ λ)dθ A synthesis of the information contained in the likelihood and the prior. Note 1: There is no notion of repeated sampling of Y here. Inference is based strictly on the data y we have observed. Note 2: Asymptotics do not have a strong role to play. We consider the posterior distribution as determined by the sample size of the observed data. In this sense, inference is exact.

7 Note 3: Basing inference on a probability distribution for the unknowns π(θ y, λ) leads to straightforward inference on θ, where objects such as p-values and confidence intervals are replaced by quantities based on probability statements more intuitive. If the hyperparameter λ is not known a priori, we can assign it a distribution, a hyperprior π(λ), in which case the posterior π(θ y) is obtained as π(θ y) = f(y θ)π(θ λ)π(λ)dλ f(y θ)π(θ λ)π(λ)dθdλ where we simply integrate λ out of the posterior.

8 Note 1: It is typical to assume Y and λ are conditionally independent given θ f(y θ, λ) = f(y θ) Note 2: In general, Bayesian inference deals with nuisance parameters in a very straightforward way we simply integrate them out of the posterior. In most complicated models, some components of the hyperparameter λ will be of interest as well. For example when modeling random effects, the hyperparameter λ will be the associated variance component.

9 In this case we base inference for both θ and λ on their joint posterior π(θ, λ y) = f(y θ)π(θ λ)π(λ) f(y θ)π(θ λ)π(λ)dθdλ Another alternative to dealing with unknown hyperparameters is to estimate them from the data. That is, we use the data to obtain ˆλ, stuff this into the prior and base inference on π(θ y, ˆλ) empirical Bayes The estimate of λ is obtained as the maximizer of the likelihood function with θ integrated out ˆλ = arg min λ p(y λ) = the usual MLE of λ p(y θ)π(θ λ)dθ Any problems with this approach?

10 Replacing λ with ˆλ and proceeding as if it were known does not acknowledge the uncertainty associated with the estimation of ˆλ. One of the difficulties that arises in applying Bayesian methods is the evaluation of integrals in π(θ y, λ) = f(y θ)π(θ λ) f(y θ)π(θ λ)dθ In particular, the normalizing constant (also known as the marginal likelihood function) p(y λ) = f(y θ)π(θ λ)dθ is generally not available in closed form. This computational difficulty severely limited the application of Bayesian methods up until the early 1990 s.

11 Since then, an increase in computing power along with the development of Markov chain Monte Carlo algorithms has led to a rapid increase in the use of Bayesian methods especially in spatial statistics

12 Example Assume we observe a single data point Y with Y θ N(θ, σ 2 ) with σ 2 known. In this case the likelihood function is f(y θ) = 1 (y θ)2 exp( 2πσ 2σ 2 ) For the unknown θ, suppose we adopt a normal prior θ µ, τ 2 N(µ, τ 2 ) with µ, τ 2 known. case is The posterior in this π(θ y) = f(y θ)π(θ µ, τ 2 )/p(y µ, τ 2 ) f(y θ)π(θ µ, τ 2 ) exp( (y θ)2 2σ 2 ) exp( (θ µ)2 2τ 2 )

13 = exp( 1 θ)2 [(y 2 σ 2 (θ µ)2 τ 2 ]) = exp( 1 2 [τ 2 (y θ) 2 + σ 2 (θ µ) 2 τ 2 σ 2 ]) exp( 1 2 [(τ 2 + σ 2 )θ 2 2(yτ 2 + µσ 2 )θ τ 2 σ 2 ]) (τ 2 +σ 2 ) θ τ 2 σ 2 (τ 2 +σ 2 ) exp( 1 2 [ θ 2 2 (yτ 2 +µσ2 ) Complete the Square... exp( 1 2 (θ (yτ 2 +µσ 2 ) (τ 2 +σ 2 ) )2 τ 2 σ 2 (τ 2 +σ 2 ) This is the kernel of a normal distribution ]) θ y N( yτ 2 + µσ 2 τ 2 + σ 2, τ 2 σ 2 τ 2 + σ 2) )

14 Note that the posterior mean E[θ y] = τ 2 τ 2 + σ 2y + σ2 τ 2 + σ 2µ is a weighted average of the data y and the prior mean µ. The weights depend on the relative variability of the likelihood and the prior. Also note that the posterior precision (inverse variance) 1 V [θ y] = 1 σ τ 2 is a sum of the precision of the data and the precision in the prior.

15 If we observe a sample of y s y 1,..., y n then using the exact same argument and the fact that ȳ is sufficient for θ we can show that the posterior for θ is once again normal with and E[θ y] = nτ 2 nτ 2 + σ 2ȳ + V [θ y] = σ2 τ 2 σ 2 + nτ 2 what happens as n gets large? σ2 nτ 2 + σ 2µ In general for large datasets the likelihood will swamp the prior in this way inference will be driven primarily by the observed data

16 Note 1: In the above example a conjugate prior was used for θ. A conjugate prior is a prior distribution which, when combined with a given likelihood, leads to a posterior distribution in the same family as the prior. Note 2: τ 2 = in the above example yields a noninformative prior in which case π(θ y) N(ȳ, σ2 n ) prior has no effect on the posterior in this case. Note 3: τ 2 = corresponds to a flat prior π(θ) 1 which is improper.

17 Setting τ 2 to some very large value say 10 6 would not be strictly non-informative but would yield a vague but proper prior N(µ, 10 6 ) for θ. Such vague but proper priors are often employed to achieve both approximate noninformativeness and the computational convenience of a conjugate prior. These sorts of priors have been criticized for various reasons. For example, with sparse data such a prior may not be as non-informative as one hopes. In general it is good practice to conduct a sensitivity analysis and obtain results corresponding to various forms for the prior.

18 Example (linear model): Suppose we observe data Y i, x i, i = 1,..., n and we assume a linear regression model for Y of the form Y β MV N n (Xβ, Σ) with Σ assumed known and adopt the prior β MV N p (β, V) with Σ, β and V known. Here, interest lines in the regression coefficient β. Which under the specified (conjugate) prior has a posterior β y MV N p (Dd, D) where and D 1 = X ΣX + V 1 d = X Σy + V 1 β

19 With this posterior, one sensible estimate would be the posterior mean E[β y] = Dd with variation assessed using D. To obtain a non-informative prior for β we can set the prior precision to the zero matrix V 1 = 0 so that D 1 = X ΣX, d = X Σy If in addition we make the usual linear regression assumption Σ = σ 2 I then we obtain the posterior β y MV N p ((X X) 1 Xy, σ 2 (X X) 1 ) The posterior mean in this case is the usual least squares estimator! In addition the posterior covariance Cov[β y] which we use to asses variability is exactly the covariance of β in the frequentist setting β MV N p (β, σ 2 (X X) 1 )

20 We see that adopting a non-informative prior in the Bayesian setting, in the linear model, leads to results that are formally equivalent to the usual frequentist results. The frequentist approach, in this case, is just a special case of the Bayesian approach corresponding to a particular prior.

21 Bayesian Inference from the Posterior Distribution As mentioned previously, in most real problems the posterior distribution π(y θ) will not be available in closed form. If we do have the posterior distribution or some estimate of it, inference regarding unknown parameters as well as prediction become fairly straightforward. For now, we will assume we have access to the posterior and discuss how one conducts estimation, tests hypotheses and goes about prediction using it.

22 For ease of presentation I will also assume dim(θ) = 1; however, everything carries forward to the multidimensional setting in a straightforward manner. Once we have discussed the posterior quantities one uses for inference and prediction, we will discuss estimation of these quantities using Monte Carlo methods. Estimation: If we have π(θ y) we can estimate θ using some measure of centrality derived from this distribution. 1. Posterior mean: ˆθ = E[θ y] 2. Posterior median: ˆθ = ˆθ π(θ y)dθ = Posterior mode: π(ˆθ y) > π(θ y), θ Θ

23 Note that if π(θ y) is symmetric and unimodal all three are equivalent. When this is not the case we can choose one or report all three as summaries of the posterior. Note that the posterior mode can be particularly bad in some cases - for example if the posterior distribution looks like an exponential distribution. The posterior mean is often used; however, can be sensitive to heavy tails. The posterior median is the most robust of the three with respect to different distributional forms.

24 Note that in many cases, all three of these estimators will have good frequentist properties (small bias etc...) Formally, starting with a loss function L(θ, ˆθ) the Bayes estimator ˆθ is the estimator that minimizes the posterior expected value of the loss function. For squared error loss L(θ, ˆθ) = (θ ˆθ) 2 the Bayes estimator is the posterior mean. For absolute error loss L(θ, ˆθ) = θ ˆθ the Bayes estimator is the posterior median.

25 Note that if one adopts a flat improper prior π(θ) 1 and the posterior is proper then the posterior mode is exactly the maximum likelihood estimator. In the hierarchical spatial models we will consider in Chapter 5, the posterior median and mean will be used for estimation.

26 Interval estimation: We can use the posterior distribution to obtain interval estimates with a direct probability interpretation. Suppose q L and q U are two points in the parameter space such that ql π(θ y) = α/2 and then we have q U π(θ y) = α/2 P (q L < θ < q U y) = 1 α The interval (q L, q U ) is called a 100 (1 α) equal tail credible interval for θ.

27 More generally, any subset of the parameters space C that satisfies 1 α = P (C y) = p(θ y)dθ C is called a 100 (1 α) credible interval for θ. Note that for a given continuous posterior distribution there are an infinite number of such sets C. Typically we want the one that encompasses the required 1 α probability but also has the shortest length. Such a shortest length interval is achieved by restricting elements of the set C to have posterior density greater than some cutoff value k(α) - highest posterior density sets.

28 The cutoff is chosen to be as large as possible while still maintaining the required coverage probability. When the posterior is approximately symmetric and unimodal, the equal tail interval is typically used. Prediction: Prediction of a new observation Y 0 given data y is obtained via the posterior predictive distribution P (y 0 y) = f(y 0 y, θ)π(θ y)dθ We can predict Y 0 using the mean of this distribution and easily form prediction intervals in the same way we formed credible intervals above.

29 Notice that this is much nicer than the frequentist method of estimating θ and plugging the estimate into the conditional distribution. As with other posterior quantities, we will use Monte Carlo methods to form predictions and prediction intervals. Hypothesis Testing: If we restrict our attention to a particular family of models f(y θ) indexed by θ, a hypothesis is simply a statement about θ. For example we may have H 0 : β ( 1, 1) for a regression coefficient or H 0 : µ 1 < µ 2 for a pair of normal means (µ 1, µ 2 ).

30 Given a posterior distribution π(θ y) we can evaluate the hypothesis of interest by evaluating its posterior probability P (H 0 y) = π(θ y)dθ H 0 with small values giving evidence in favor of the alternative. This posterior probability is easily conveyed and interpreted by subject matter specialists and is far more meaningful than a p- value. There are issues; however, with the prior distribution. The posterior probability will necessarily be a function of the prior probability π(h 0 ). It may not me clear how to choose this; however, a default choice is the fair prior π(h 0 ) = 0.5.

31 In addition, testing point null hypotheses such as H 0 : β = 0 requires special care since such a hypothesis will have posterior probability zero under a continuous prior. An easier way of assessing such point-null hypotheses is to form the 100 (1 α) credible interval for β and to simply check whether the null value is contained in the interval - this is what is usually done. Model selection: Choosing between competing models for data is a very important statistical task. In the likelihood setting, one usually uses a likelihood ratio or a score test to compare models. What are the limitations of these?

32 Models must be nested. That is, one of the competing models must be a submodel of the other. For example, a submodel obtained by setting one of the parameters in the model to a particular value. In addition, the submodel can not correspond to a value on the boundary of the parameters space. If these conditions don t hold, the usual asymptotics for these test statistics breaks down.

33 In general, we would like to compare nonnested models 1. Compare Proportional hazards models to accelerated failure time models 2. Compare different link functions for generalized linear models 3. Compare different distributions for random effects in mixed models Also, values of the parameter that lie on the boundary of the parameter space can sometimes correspond to very interesting submodels that we would like to test for. Unfortunately, even in the Bayesian framework there is no universally accepted technique for doing model selection.

34 The traditional Bayesian method for comparing models is through the use of posterior models probabilities and Bayes factors. Suppose we have two competing (not necessarily nested) models M 1 and M 2 each with parameters θ 1 and θ 2 respectively with associated prior densities π i (θ i ), i = 1, 2. To complete the prior specification we assign to each model a prior probability π(m i ), i = 1, 2 Again, a default choice is usually π(m i ) = 0.5.

35 To choose between models we can calculate the posterior models probabilities obtained from Bayes theorem as P (M i y) = p(y M i )π(m i ) p(y M 1 )π(m 1 ) + p(y M 2 )π(m 2 ) In order to calculate these we need the marginal distributions for each model p(y M i ) which are obtained as p(y M i ) = f(y θ i, M i )π i (θ i )dθ i, i = 1, 2 where f(y θ i, M i ) is the likelihood function under model M i. The posterior probabilities are summarized using the Bayes factor.

36 The Bayes factor, BF, is an odds ratio: the ratio of the posterior odds of M 1 to the prior odds of M 1 BF = P (M 1 y)/p (M 2 y) P (M 1 )/P (M 2 ) which using Bayes theorem is just BF = p(y M 1) p(y M 2 ) a ratio of the marginal likelihoods under each model. If the BF > 1 then the posterior odds of M 1 is greater than the prior odds of M 1. That is, having observed the data, the odds of M 1 have increased the data favor M 1.

37 This is a very elegant way of comparing two models unfortunately there are some difficulties that hamper the use of the BF for model comparison. First, the calculation of the marginal likelihoods p(y M i ) is difficult. This is the normalizing constant in the posterior distribution of θ i in model M i. A shortcut method for calculating the Bayes factor is the BIC, which for a given model M i is defined as BIC = 2l Mi ( ˆθ i ) + log(n)p a fit plus penalty model selection tool where l Mi (θ i ) is the loglikelihood for model M i and ˆθ i is the MLE of θ.

38 Models with lower BIC are preferred. In addition, the difference in the BIC scores between two models is asymptotically equal to minus twice the log of the Bayes factor comparing the two models BIC = 2 log BF assuming both models are a priori equally likely. So the Bayes factor comparing two models can be approximated by calculating the corresponds BIC s. Unfortunately, this approximation is not valid with random effect models.

39 Another serious problem with Bayes factors, besides their computation, is that they are not well defined when non-informative priors are employed even if the corresponding posterior is proper. This is due to the fact that the marginal likelihood p(y M i ) = f(y θ i, M i )π i (θ i )dθ i is not well defined when π i (θ i ) is improper. Similar to the BIC, another fit plus penalty model selection tool that we have already seen is the AIC AIC = 2l Mi ( ˆθ i ) + 2p which replaces log(n) in the penalty by 2 and is therefore more liberal whenever n > e 2.

40 Unfortunately the AIC just like the BIC is not appropriate for comparing models with random effects. Why are these tools inappropriate for such models? The raw number of parameters, p, that appears in the penalty of both AIC and BIC is not an appropriate way to measure complexity in random effects models where the parameters are correlated and hence the effective number of parameters will generally be less than p. To address this concern, Spiegelhalter et al. (2002) proposed a generalization of the AIC known as the deviance information criterion (DIC) which is more appropriate for doing model selection with Bayesian hierarchical models.

41 This generalization is based on the deviance statistic, which for a given model takes the form D(θ) = 2 log f(y θ) + 2 log h(y) where f(y θ) is the likelihood function under the model and h(y) is some standardizing function of the data alone. Typically, we set h(y) = 1 for convenience. To asses the fit of the model we consider the posterior mean of the deviance D = E[D(θ) y] which is smaller for better fitting models. To penalize this measure of fit we penalize a given model by measuring its complexity by the effective number of parameters p D defined by p D = E[D(θ) y] D( θ)

42 Is p D always positive? Why or why not? This penalty counts the number of parameters while also accounting for correlation or shrinkage in parameters. For example, in a model with spatially correlated random effects, we may have n random effects contributing to the parameters count. Rather than simply adding n to the total parameter count, the p D penalty accounts for the fact that the random effects are spatially correlated, so the effective number added to the penalty due to the random effects will be less than n.

43 Note that unlike the penalty term used in the AIC and BIC, this penalty will also depend on the observed data. If the data encourages a great deal of shrinkage in the random effects, the p D value and hence the penalty will be lower. Adding p D to the posterior mean of the deviance yields the DIC DIC = D + p D = 2E[D(θ) y] D( θ) with smaller values being preferred. For non-hierarchical models (linear regression models, glms, parametric survival models) the value of p D will be very close to the actual parameter count and the DIC will essentially be equal to the AIC.

44 The DIC is a very popular model selection tool primarily since it is easy to calculate (using output from the Gibbs sampler) and is automatically computed by the WinBugs MCMC software. There is no theoretical justification for its use outside the exponential family class of models; nevertheless, it has been used for model selection in far more complicated situations and tends to work fairly well. The lack of a theoretical justification implies that the DIC, in many situations, can only be considered a quick and dirty model selection tool used to rank a set of candidate models. My feeling is that the DIC tends to be too liberal in model selection, generally choosing more complicated models than perhaps necessary. I use it a lot anyways...

45 Another problem with the DIC is that it is not invariant to re-parametrization. Change the parametrization of your model and the DIC for the model will change! Posterior predictive loss criteria: Another Bayesian model selection tool developed by Gelfand and Gosh (1998). The posterior predictive loss criteria focusses on model evaluation based on the notion of prediction. In particular, we consider a hypothetical replicate data set Y rep drawn under the same conditions (i.e. same positions, same covariate values etc..) as the observed data and we assume that Y rep is conditionally independent of Y given θ.

46 The criteria has a decision theoretic foundation where the action of interest is prediction of Y rep using the posterior predictive distribution of Y rep f(y rep y) = f(y rep θ)π(θ y)dθ One chooses a loss function that penalizes predictions in some way and the selected models are those that perform well under the given loss function. Under squared error loss, the criteria penalizes actions (prediction of Y rep ) when the expected value of E[Y repi y], under a given model, departs from the observed data y (predictive bias) and also penalizes actions when the predictive variability V [Y repi y i ] for a given observation is high.

47 In other words, under squared error loss, the criteria chooses models that have small mean squared error of prediction with respect to the replicate data Y rep. The criteria can be written as the sum of two terms D = G + P Where and G = n i=1 P = (E[Y repi y i ] y obs,i ) 2 n i=1 V [Y repi y i ]

48 Models with that have a small combination of predictive bias and variance (with respect to replicate data) will yield small D values - these are the preferred models under this criteria. For a given model, we can estimate E[Y repi y i ] and V [Y repi y i ] using the Monte Carlo methods we will discuss down the road. This criteria is not as popular as the DIC criteria. This is likely due to the fact that it is not automatically calculated in winbugs (but it is easily calculated!) and requires the choice of a loss function.

49 Since it is based on the posterior predictive distribution and the notion of prediction, using this criteria for model selection would be quite natural if prediction was an eventual goal. In general it seems to have a stronger theoretical foundation than DIC. The Basics of Bayesian Computation Although the posterior π(θ y) provides all of the information concerning θ, simply writing it down is not enough π(θ, λ y) = f(y θ)π(θ λ)π(λ) f(y θ)π(θ λ)π(λ)dθdλ

50 The normalizing constant in this distribution will typically be analytically intractable. In addition, we will usually be interested in the marginal distributions π(θ i y) of certain components of θ. π(θ i y) = f(y θ)π(θ λ)π(λ)dλdθ( i) f(y θ)π(θ λ)π(λ)dθdλ where θ ( i) denotes the parameter vector θ with the i th component θ i removed. For example, the marginal distribution π(θ i y) is required to construct a 100(1 α) credible interval for θ i. The problem we have with obtaining the posterior and any of its marginals is essentially one of integration.

51 To implement bayesian methods, we need ways of approximating high dimensional integrals. There are several ways to approach this problem: 1. Asymptotic approximations: Laplace approximation, Bayesian CLT 2. Quadrature: methods like Simpson s rule etc Importance sampling: a method for generating independent realizations from the posterior - hard to implement in high dimensional problems (many parameters) 4. Markov chain Monte Carlo (MCMC): method for generating dependent realizations from the posterior - easy to implement in many high dimensional problems

52 We will only discuss MCMC since it is essentially the only one of these methods applicable to spatial statistics Note that Laplace approximations have been used in implementing frequentist techniques for spatial statistics penalized quasi likelihood but we will not discuss these.

53 Posterior inference through simulation: Even though we can not access the posterior distribution in its analytic form, if it were possible to simulate independent realizations θ (1), θ (2),..., θ (J) from the posterior distribution θ (j) iid π(θ y), j = 1,..., J We could use these simulated realizations, called Monte Carlo samples, to summarize the posterior distribution and estimate posterior quantities of interest and hence conduct inference based on these simulations. To see this, it is helpful to consider the duality between a probability density function and a histogram of a set of random draws from this distribution.

54 Given a large enough sample θ (1),..., θ (J), the histogram based on θ (1),..., θ (J) can provide essentially complete information about the density. Consider approximating the density of a Weibull(ρ, λ) distribution with shape ρ = 2 and scale λ = 1. The density can be approximated by the histogram obtained from simulated draws obtained from this distribution.

55 10 Random Draws 100 Random Draws Density Density sim sim Random Draws Random Draws Density Density sim sim4

56 Using a sample simulated from the posterior distribution θ (1),..., θ (J), we can estimate the various moments of the posterior, percentiles, and obtain other summary statistics which can estimate any aspect of the posterior distribution, to a level of precision which can be estimated.

57 For example: 1. Ê[θ i y] = 1 J Jj=1 θ (j) i 2. ˆP (θ i D y) = 1 J Jj=1 I{θ (j) i D} 3. ˆP (θ i > θ k y) = 1 J Jj=1 I{θ (j) i > θ (j) k } th percentile of π(θ i y) use the 0.95Jth order statistic of θ (1) i,..., θ (J) i 5. The DIC for a model with deviance D(θ) = 2f(y θ) is estimated as ˆ DIC = 2 1 J J j=1 D(θ (j) ) D( 1 J j=1 θ (j) )

58 In addition, the marginal posterior density π(θ i y) can be estimated using the histogram of simulated values θ (1) i, θ (2) i,..., θ (J) i or a smooth estimate can be obtained using a kernel density estimate based on the simulated values. The level of precision depends only on the Monte Carlo sample size and can be increased to an arbitrarily high level simply by increasing the size of the Monte Carlo sample. We usually drop the hat notation for these simulation based estimators under the assumption that a very large Monte Carlo sample size has been used.

59 Note that provided the draws θ (1),..., θ (J) are iid from π(θ y) the weak law of large numbers guarantees that simulation based estimators such as and Ê[θ i y] = 1 J J j=1 θ (j) i ˆP (θ i D y) = 1 J J j=1 are all simulation consistent. I{θ (j) i D} In other words, they converge in probability to the actual posterior quantities as the Monte Carlo sample size J. Unfortunately, generating independent draws from a possibly high dimensional and nonstandard distribution π(θ y) can be very difficult.

60 This is where Markov chain Monte Carlo comes in... initially developed by Metropolis et al. (1953) for applications in statistical physics, extended by Hastings (1970) and applied to image restoration problems by Geman and Geman (1984) where the Gibbs sampler, a special case of Metropolis/Hastings, was developed. These algorithms remained virtually unnoticed by statisticians until Gelfand and Smith (1990), in a landmark paper, observed that these algorithms could be exploited in Bayesian statistics.

61 Idea behind MCMC: Simulate realizations from a Markov chain θ (1) θ (2) θ (3) θ (4) θ (k) that converges to a unique stationary or limiting distribution that is the posterior distribution of interest. Once we are able to simulate realizations from a Markov process whose limiting distribution is π(θ y) we run the simulation long enough, for t 0 iterations say, θ (1) θ (2) θ (3) θ (4) θ (t 0) so that future draws θ (t 0+1), θ (t 0+2),..., θ (t 0+J) constitute a sample of size J from the true posterior distribution.

62 For a given posterior π(θ y), there are a variety of ways to construct such a chain; moreover, to do so, we only need to know the posterior distribution up to a normalizing constant. That is, the simulation of such a Markov chain requires only the ability to evaluate f(y θ)π(θ) without knowledge of the normalizing constant f(y θ)π(θ)dθ. Given the samples from the Markov chain (after it has reached its stationary distribution) θ (t 0+1), θ (t 0+2),..., θ (t 0+J) we can construct simulation consistent estimates of posterior quantities as before.

63 Note 1: The sampled values {θ (t 0+1), θ (t 0+2),..., θ (t 0+J) } are not independent - they are draws from a Markov chain. Note that a dependent sample contains less information about the posterior than a sample of independent values. The precision of our Monte Carlo estimates will therefore depend on (I) the Monte Carlo sample size J as in the case of independent samples and (II) the level of dependence or autocorrelation in the sampled values. If the level of autocorrelation in the sampled values is high, then we will need a larger value of J to produce a reliable posterior summary.

64 Note 2: It is critical for us to be able to decide when the Markov chain has converged to its stationary distribution - π(θ y). That is, we need some way of determining the value of the integer t 0 such that for every k > t 0 the sampled values θ (k) are draws from the stationary distribution of the Markov chain. The initial set of sampled values {θ (0), θ (1),..., θ (t 0) } is a transient phase of the Markov chain simulation and is known as the burn-in period.

65 We do not use these initial sampled values for posterior inference as we can not be sure that these values come from the posterior distribution - they are discarded. Examining the sequence of simulated values θ (0), θ (1),... and determining when the chain converged its stationary distribution (in other words determining the number of initial values to discard as burn-in) is known as convergence assessment.

66 Given the above notes, we will answer the following questions in order: 1. How exactly do I construct and simulate from the Markov chain that has π(θ y) as its stationary distribution? Gibbs sampling, Metropolis and Metropolis-Hastings algorithms. 2. How do I asses convergence of the chain to the posterior distribution? 3. Since the posterior samples are correlated, how do I asses the quality of my MCMC-based posterior estimates? In other words, how can I asses the Monte Carlo variance of the posterior summary I have produced?

67 Gibbs Sampling Simplest and most widely used of all MCMC algorithms. Suppose our posterior distribution [θ y] is k dimensional where k could be (and usually is in spatial models) large θ = (θ 1,..., θ k ) For any component, θ i, we define the full conditional distribution as [θ i θ 1,..., θ i 1, θ i+1,..., θ k, y] = [θ i θ i, y] The distribution of θ i conditional on y and all other components of θ.

68 The Markov chain constructed through Gibbs sampling generates realizations from the posterior [θ y] by iteratively generating realizations from the full conditional distributions. Initialization: We begin the procedure by choosing an initial or starting value for θ: θ (0) = (θ (0) 1, θ(0) 2,..., θ(0) k ) The starting value θ (0) can be any point in the parameter space over which the posterior distribution is defined.

69 Simulation: T iterations of the Gibbs sampler produces realizations θ (0) θ (1) θ (T ) by repeating the following process for i = 1,..., T 1. generate θ (i) 1 [θ 1 θ (i 1) 2, θ (i 1) 3,, θ (i 1) k, y] 2. generate θ (i) 2 [θ 2 θ (i) 1, θ(i 1) 3,, θ (i 1) k, y] 3. generate θ (i) 3 [θ 3 θ (i) 1, θ(i) 2, θ(i 1) 4,, θ (i 1) k, y].. k. generate θ (i) k [θ k θ (i) 1, θ(i) 2,, θ(i) k 1, y]

70 At iteration i of the simulation, the vector θ (i) is drawn in an iterative fashion by simulating each of its components θ (i) j, j = 1,..., k from the corresponding full conditional distribution: [θ (i) j θ (i) 1,, θ(i) j 1, θ(i 1) j+1,, θ(i 1) k, y] This procedure will produce a Markovian sequence with [θ y] as its stationary distribution. Convergence to the stationary distribution is guaranteed for a very wide class of models. Note 1: Not obvious why it works.

71 Note 2: The Gibbs sampler reduces the problem of simulating from a high dimensional distribution [θ y] to that of iteratively simulating from a sequence of lower (typically 1-dimensional) distributions. Each component θ j of θ need not be 1- dimensional; however, this is often the case. If θ j is not 1-dimensional, it is typically of low dimension. Simulating from low-dimensional full conditional distributions is typically easy. Since they are of low dimension, we can easily find the associated normalizing constant of each; however, knowledge of this normalizing constant is often not required.

72 Techniques used to simulate from full conditional distributions: 1. If the full conditional is a standard distribution (Gamma, normal etc...) we can simulate from it directly. 2. Rejection sampling - requires finding an envelope distribution 3. Adaptive rejection sampling (ARS) - requires log-concavity of the density associated with the full conditional distribution 4. Slice sampling, Metropolis-Hastings which we will discuss later.

73 Note 3: If X, Y, Z are random vectors then in order to simulate realizations from [Z y] Z (1),..., Z (J) it is sufficient to simulate realizations from [Z, X y] (Z (1), X (1) ),..., (Z (J), X (J) ) and then simply consider the Z components of this realization. Once we have a sample θ (t0+1),..., θ (t 0+J) from the posterior [θ y], a sample from the marginal posterior of any component of θ say [θ j y] is obtained simply by considering the corresponding components of the simulated values θ (t 0+1) j,..., θ (t 0+J) j.

74 Note 4: The Gibbs sampler is guaranteed to converge for any starting value θ (0). Regardless of the initial value, the chain will eventually forget its initial state and converge to the stationary distribution. Choice of initial value is important... if the initial value is located in a region of high posterior mass, the chain will converge quickly; however, an initial value with low posterior mass may require a longer burn-in period. In practice one usually runs several chains initialized from different starting values and compares the resulting sequences to asses convergence.

75 Upon convergence, regardless of their starting values, all chains will all be sampling from the same distribution. Note 5: Deriving full conditional distributions is straightforward. Suppose we want [θ i θ i, y] which is, of course, characterized by the corresponding density (or pmf) f(θ i θ i, y). f(θ i θ i, y) = f(θ i, θ i, y) f(θ i, y) = f(θ, y) f(θ i, y) In the context of the full conditional distribution both θ i and y are known so that f(θ i θ i, y) f(θ, y) = f(y θ)π(θ)

76 Full conditional density (pmf) is just proportional to likelihood prior. So we write down f(y θ)π(θ) and view it as a function of θ i to obtain the full conditional density (pmf) up to a normalizing constant. If we recognize this as the kernel of some standard distribution, we can simulate from it easily. Otherwise we can use one of the aforementioned techniques (rejection sampling, ARS, slice sampling or MH) which only require knowledge of the full conditional density up to a normalizing constant.

77 Note 6: Situations where Gibbs sampling does not work well: 1. Multimodal Posterior distributions: the Markov chain can spend to much time in one of the modes before tunneling into the other modes. After an initial burn-in sample has been discarded, Gibbs sampling in this setting will need very long simulation runs in order for the posterior space to be adequately explored. 2. Weakly identified models in conjunction with vague prior distributions and sparse data will usually produce weird results.

78 Note 7: Missing data is handled very nicely within a Gibbs sampling framework. Suppose our data y can be partitioned into two components y = (y obs, y mis ), one we have actually observed and the other we have not.

79 In this case y is not the observed data; rather, it is a hypothetical set of complete data and there are two reasons for considering this: 1. The likelihood f(y θ) corresponding to y = (y obs, y mis ) is much simpler in form than the likelihood corresponding to the data we have observed f(y obs θ) = y mis f(y θ)dy mis and so we consider the complete data in an attempt to work with the simpler likelihood function - data augmentation 2. We may be interested in conducting inference on the missing data itself.

80 In either case we use the Gibbs sampler to simulate from the joint posterior [θ, y mis y obs ] which then yields samples from the marginal posterior [θ y obs ] which we use for conducting inference on θ and also yields samples from [y mis y obs ] which we can use for conducting inference on the missing data (if this is of interest). Why is this easy to do within a Gibbs sampler? Consider the density of the joint posterior π(θ, y mis y obs ) π(θ, y mis, y obs ) = π(y mis, y obs θ)π(θ)

81 To simulate from this distribution using the Gibbs sampler we work iteratively with the conditional distributions and θ [θ y mis, y obs ] [θ y] y mis [y mis y obs, θ] This allows us to work with the complete data likelihood when drawing from [θ y mis, y obs ] [θ y] and y mis is essentially treated as an additional parameter being drawn from its full conditional distribution at each iteration.

82 Note 8: Prediction within a Gibbs sampling framework is straightforward. Recall we base prediction of Y new on the posterior predictive distribution whose density (pmf) can be written as f(y new y) = f(y new θ, y)π(θ y)dθ which, under the assumption that Y new is conditionally independent of Y given θ is given by f(y new y) = f(y new θ)π(θ y)dθ

83 If we have samples θ (1), θ (2),, θ (J) from the posterior distribution (obtained using the Gibbs sampler!) we can generate onefor-one samples 1. generate y (1) new f(y new θ (1) ) 2. generate y (2) new f(y new θ (2) ) 3. generate y (3) new f(y new θ (3) ).. J. generate y (J) new f(y new θ (J) ) In which case y (1) new,, y new (J) are a sample from the predictive density f(y new y). We can use these predictive samples to form predictions and prediction intervals etc...

84 The Metropolis-Hastings Algorithm Implementing the Gibbs sampler is, in principle, fairly straightforward we only need to sample from full conditional distributions [θ i θ i, y], i = 1,..., k. When for at least one i, the full conditional [θ i θ i, y] is not of standard form, we need some way to sample from this distribution in order to implement the Gibbs sampler. Note that we will always have the density of the full conditional distribution up to a normalizing constant as this is simply proportional to the portion of f(y θ) P (θ) that involves θ i.

85 The Metropolis-Hastings algorithm will be useful in these situations. This is an MCMC algorithm that allows one to generate (dependent) samples from an arbitrary distribution and requires only knowledge of the corresponding density up to a normalizing constant. First I will describe the algorithm in its most general form for simulating realizations of some random vector X from some multivariate distribution with density f(x) h(x) - we call this the target distribution. Then describe how it is applied very usefully for dealing with non-standard full conditional distributions within a Gibbs sampler.

86 Note that the target distribution f(x) h(x) can represent virtually any distribution one is interested in: a posterior distribution, a full conditional distribution or some other distribution that may be of interest in non-bayesian applications. As before the algorithm will generate a Markov chain X (1) X 2) X (t) having f(x) h(x) as its stationary or limiting distribution. The algorithm is a rejection algorithm: where we generate a proposed value X from some candidate distribution and either accept or reject this proposed value with a certain probability.

87 To implement this algorithm we need to specify a candidate distribution from which to draw proposals. At iteration t, the density of the candidate distribution, can depend on the previous value of the chain X (t 1) in some way and we let q(x x (t 1) ) denote this density. Given a candidate (or proposal) distribution q(x x (t 1) ) and an initial value X (0) the MH algorithm proceeds for t = 1, 2, 3,... as follows

88 1. Draw X from q( x (t 1) ) 2. Compute the acceptance ratio r as: r = f(x )q(x (t 1) x ) f(x t 1 )q(x x (t 1) ) 3. Accept the candidate X as the new state of the chain X (t) with probability p = min{1, r} That is, if the acceptance ratio r 1 we set X (t) = X Otherwise we set X (t) = X X (t 1) with probability r with probability 1 r

89 Note that the normalizing constant associated with the density f( ) cancels in the acceptance ratio so that we have r = h(x )q(x (t 1) x ) h(x t 1 )q(x x (t 1) ) and we only require h( ) to implement the algorithm. Note that unlike the Gibbs sampler, the Metropolis-Hastings algorithm does not necessarily change its state in every iteration. If the proposal at iteration t is rejected, then X (t 1) and X (t) are identical. Implementation of the MH algorithm requires the choice of a candidate distribution q(x x (t 1) ) from which to draw proposals.

90 Amazingly, the algorithm will converge to the stationary distribution for virtually any choice of proposal distribution - in theory. In practice the choice of proposal distribution has an important effect on the performance of the algorithm. Types of proposals useful in practice 1. Symmetric: q(x x (t 1) ) = q(x (t 1) x ) - Metropolis algorithm 2. Proposal independent of current state q(x x (t 1) ) = q(x ) Hastings independence chain

91 Of these, the most common option is to use what is known as a random walk Metropolis proposal. Assuming the sample space of the target distribution [X] is R k, the proposal at iteration t is formed as where X = X (t 1) + e e MV N(0, Σ) where Σ is pre-specified so that the sampler performs well in terms of its overall acceptance rate.

92 Note that this leads to a Metropolis algorithm as the associated proposal distribution is X MV N(X (t 1), Σ) which has a symmetric density. If the sample space of the target [X] is not R k, the above random walk proposal can be employed after applying a suitable transformation (homework). The performance of the MH algorithm is often measured by its empirical acceptance rate. That is, we run the algorithm for some fixed number of iterations and determine the proportion of proposed moves that were accepted.

93 What acceptance rate would we like the algorithm to have? An acceptance rate of 1 is definitely not optimal. A high acceptance rate, say above 0.9, may be indicative of an overly narrow candidate distribution. In this case the algorithm will only propose minor departures from the current state. These minor departures typically have high acceptance probability; however, the chain can take a very long time to explore the entire posterior as it will move around the parameter space at a snail s pace (a diagram would be helpful here) high autocorrelation in the chain.

94 On the other hand, if our candidate distribution proposes very bold moves, the proposed values may lie in the tails of the target distribution and will almost certainly be rejected. In such cases, the chain can get stuck in the same state for a very long time X (j) = X (j+1) = X (j+100) = which is obviously undesirable. The ideal candidate distribution will yield an empirical acceptance rate that lies somewhere in the middle of these two extremes - between 20 and 50 percent seems to work well in practice (rough guidelines).

95 Such an ideal candidate distribution is often chosen adaptively - we tune the algorithm. For example, let s assume we want to sample from an arbitrary 1-dimensional distribution f(x) h(x) that has support on the entire real line. In this case, a random walk Metropolis algorithm will propose, at the t th iteration X = X (t 1) + e with e N(0, σ 2 ) and the corresponding acceptance ratio will be r = f(x ) f(x (t 1) )

96 In this case, choosing an appropriate candidate distribution boils down to choosing an appropriate value for σ 2 - we tune the algorithm to determine this value. Start by picking some initial value of σ 2, and then keep track of the empirical proportion of candidates that are accepted in some small say 100 number of iterations. If this fraction is too high (say above 75 percent) we increase σ 2 (why?) and if it is too low (say below 20 percent) we decrease σ 2 and continue until a reasonable acceptance rate is obtained.

97 The MH algorithm is typically run in two phases: 1. A first phase where the algorithm is tuned in order to select a candidate distribution that leads to reasonable acceptance rates. 2. A second phase where the candidate distribution is held fixed and the required samples are drawn.

98 Back to the Gibbs Sampler Now that we know about the Metropolis- Hastings algorithm we can discuss its utilization within a Gibbs sampler. If we encounter an awkward, non-standard full conditional [θ i θ i, y] in implementing the Gibbs sampler, we could in principle sample from this full conditional distribution using the Metropolis-Hastings algorithm. That is, at the i th iteration of the Gibbs sampler we simulate from [θ i θ i, y] by running another Markov chain (via the MH algorithm) that has the required full conditional [θ i θ i, y] as its stationary distribution. Such an approach is known as Metropolis within Gibbs. Problems with this?

99 Fortunately we do not have to consider such overly bulky MCMC within MCMC type algorithms! It turns out that a single MH substep is sufficient to ensure convergence of the overall chain to its stationary distribution (the posterior). That is, when we encounter an awkward full conditional distribution [θ i θ i, y], we simply draw one Metropolis-Hastings candidate, calculate the acceptance ratio, either accept or reject it, and move on to updating the next component θ i+1.

100 Given this, an overall strategy for drawing samples from a posterior distribution π(θ y) is as follows: 1. Divide the parameter vector θ into k components θ = (θ 1, θ 2,..., θ k ) Each θ i need not be a scalar; however, this is often the case. 2. Starting with an initial value θ (0) we generate each subsequent state of the chain θ (t) by visiting and updating the corresponding components using the full conditional distributions [θ i θ i, y] 3. If the full conditional is some standard distribution we draw θ i from this distribution - a Gibbs step.

101 4. If the full conditional is non-standard but has support on R (or R dim(θ i) ) we update θ i by taking a random walk Metropolis step. 5. If the full conditional is non-standard and has support on some D R (or D R dim(θ i) ) we take a random walk Metropolis step after transforming to R (or R dim(θ i) ) - Metropolis-Hastings step (homework)

102 6. We first run this algorithm in an adaptive phase, adjusting the tuning parameters associated with each MH step so that the corresponding acceptance rate is not too high and not too low. 7. We run a second phase with these values held fixed, asses convergence of the sampler to its limiting distribution (burnin), and draw posterior samples from the rest. This simple MCMC algorithm, a hybrid of Gibbs and MH steps will work very well for a broad class of models.

103 Assessing Convergence of an MCMC Sampler As mentioned before, upon running an MCMC algorithm we need to decide when the chain has reached its stationary distribution. That is, we need to determine the value of some t 0 such that θ (t 0+1) θ (t 0+2) θ (t 0+J) constitute a dependent sample of size J from the posterior. The idea is to run the chain for some initial number of iterations and examine the output in some way to determine if the initial state has been forgotten.

104 The simplest way to examine the output is through a trace plot. That is, for each θ i we plot θ (t) i versus t and examine the evolution of the chain. MCMC Trace Plot for Theta theta t

105 Examining a single trace plot can be dangerous. Consider the following trace plot: MCMC Trace Plot for Theta theta t Chain looks fairly stable, as if it has reached its stationary distribution fairly quickly.

106 If we run another chain with a different starting value and compare the output to the original chain we see: Trace Plot of Theta from Two Chains theta t

107 We see that the chain had not converged, it was simply moving very slowly across the parameter space. Running two chains with each initialized at different points allowed us to detect this. In general, it is always best to run multiple chains in parallel and compare the output of all chains. The chains should be initialized at points that are overdispersed with respect to the posterior. The trace plots from all chains are then examined to determine if there is an identifiable point t 0 after which all chains seems to be overlapping.

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1 Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

MCMC algorithms for fitting Bayesian models

MCMC algorithms for fitting Bayesian models MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee and Andrew O. Finley 2 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. 2 Department of Forestry & Department

More information

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence Bayesian Inference in GLMs Frequentists typically base inferences on MLEs, asymptotic confidence limits, and log-likelihood ratio tests Bayesians base inferences on the posterior distribution of the unknowns

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee 1 and Andrew O. Finley 2 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. 2 Department of Forestry & Department

More information

7. Estimation and hypothesis testing. Objective. Recommended reading

7. Estimation and hypothesis testing. Objective. Recommended reading 7. Estimation and hypothesis testing Objective In this chapter, we show how the election of estimators can be represented as a decision problem. Secondly, we consider the problem of hypothesis testing

More information

7. Estimation and hypothesis testing. Objective. Recommended reading

7. Estimation and hypothesis testing. Objective. Recommended reading 7. Estimation and hypothesis testing Objective In this chapter, we show how the election of estimators can be represented as a decision problem. Secondly, we consider the problem of hypothesis testing

More information

Part III. A Decision-Theoretic Approach and Bayesian testing

Part III. A Decision-Theoretic Approach and Bayesian testing Part III A Decision-Theoretic Approach and Bayesian testing 1 Chapter 10 Bayesian Inference as a Decision Problem The decision-theoretic framework starts with the following situation. We would like to

More information

A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Statistical Inference, and Examples A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2008 Prof. Gesine Reinert 1 Data x = x 1, x 2,..., x n, realisations of random variables X 1, X 2,..., X n with distribution (model)

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA Intro: Course Outline and Brief Intro to Marina Vannucci Rice University, USA PASI-CIMAT 04/28-30/2010 Marina Vannucci

More information

A Very Brief Summary of Bayesian Inference, and Examples

A Very Brief Summary of Bayesian Inference, and Examples A Very Brief Summary of Bayesian Inference, and Examples Trinity Term 009 Prof Gesine Reinert Our starting point are data x = x 1, x,, x n, which we view as realisations of random variables X 1, X,, X

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods Tomas McKelvey and Lennart Svensson Signal Processing Group Department of Signals and Systems Chalmers University of Technology, Sweden November 26, 2012 Today s learning

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Bayesian Inference in Astronomy & Astrophysics A Short Course

Bayesian Inference in Astronomy & Astrophysics A Short Course Bayesian Inference in Astronomy & Astrophysics A Short Course Tom Loredo Dept. of Astronomy, Cornell University p.1/37 Five Lectures Overview of Bayesian Inference From Gaussians to Periodograms Learning

More information

17 : Markov Chain Monte Carlo

17 : Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo

More information

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis Summarizing a posterior Given the data and prior the posterior is determined Summarizing the posterior gives parameter estimates, intervals, and hypothesis tests Most of these computations are integrals

More information

CSC 2541: Bayesian Methods for Machine Learning

CSC 2541: Bayesian Methods for Machine Learning CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 3 More Markov Chain Monte Carlo Methods The Metropolis algorithm isn t the only way to do MCMC. We ll

More information

eqr094: Hierarchical MCMC for Bayesian System Reliability

eqr094: Hierarchical MCMC for Bayesian System Reliability eqr094: Hierarchical MCMC for Bayesian System Reliability Alyson G. Wilson Statistical Sciences Group, Los Alamos National Laboratory P.O. Box 1663, MS F600 Los Alamos, NM 87545 USA Phone: 505-667-9167

More information

Seminar über Statistik FS2008: Model Selection

Seminar über Statistik FS2008: Model Selection Seminar über Statistik FS2008: Model Selection Alessia Fenaroli, Ghazale Jazayeri Monday, April 2, 2008 Introduction Model Choice deals with the comparison of models and the selection of a model. It can

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Introduction. Start with a probability distribution f(y θ) for the data. where η is a vector of hyperparameters

Introduction. Start with a probability distribution f(y θ) for the data. where η is a vector of hyperparameters Introduction Start with a probability distribution f(y θ) for the data y = (y 1,...,y n ) given a vector of unknown parameters θ = (θ 1,...,θ K ), and add a prior distribution p(θ η), where η is a vector

More information

1 Hypothesis Testing and Model Selection

1 Hypothesis Testing and Model Selection A Short Course on Bayesian Inference (based on An Introduction to Bayesian Analysis: Theory and Methods by Ghosh, Delampady and Samanta) Module 6: From Chapter 6 of GDS 1 Hypothesis Testing and Model Selection

More information

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model UNIVERSITY OF TEXAS AT SAN ANTONIO Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model Liang Jing April 2010 1 1 ABSTRACT In this paper, common MCMC algorithms are introduced

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

INTRODUCTION TO BAYESIAN STATISTICS

INTRODUCTION TO BAYESIAN STATISTICS INTRODUCTION TO BAYESIAN STATISTICS Sarat C. Dass Department of Statistics & Probability Department of Computer Science & Engineering Michigan State University TOPICS The Bayesian Framework Different Types

More information

MCMC Methods: Gibbs and Metropolis

MCMC Methods: Gibbs and Metropolis MCMC Methods: Gibbs and Metropolis Patrick Breheny February 28 Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 1/30 Introduction As we have seen, the ability to sample from the posterior distribution

More information

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over Point estimation Suppose we are interested in the value of a parameter θ, for example the unknown bias of a coin. We have already seen how one may use the Bayesian method to reason about θ; namely, we

More information

Bayesian model selection: methodology, computation and applications

Bayesian model selection: methodology, computation and applications Bayesian model selection: methodology, computation and applications David Nott Department of Statistics and Applied Probability National University of Singapore Statistical Genomics Summer School Program

More information

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007 Bayesian inference Fredrik Ronquist and Peter Beerli October 3, 2007 1 Introduction The last few decades has seen a growing interest in Bayesian inference, an alternative approach to statistical inference.

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

One-parameter models

One-parameter models One-parameter models Patrick Breheny January 22 Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 1/17 Introduction Binomial data is not the only example in which Bayesian solutions can be worked

More information

Adaptive Monte Carlo methods

Adaptive Monte Carlo methods Adaptive Monte Carlo methods Jean-Michel Marin Projet Select, INRIA Futurs, Université Paris-Sud joint with Randal Douc (École Polytechnique), Arnaud Guillin (Université de Marseille) and Christian Robert

More information

Introduction to Bayesian Methods

Introduction to Bayesian Methods Introduction to Bayesian Methods Jessi Cisewski Department of Statistics Yale University Sagan Summer Workshop 2016 Our goal: introduction to Bayesian methods Likelihoods Priors: conjugate priors, non-informative

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As

More information

MODEL COMPARISON CHRISTOPHER A. SIMS PRINCETON UNIVERSITY

MODEL COMPARISON CHRISTOPHER A. SIMS PRINCETON UNIVERSITY ECO 513 Fall 2008 MODEL COMPARISON CHRISTOPHER A. SIMS PRINCETON UNIVERSITY SIMS@PRINCETON.EDU 1. MODEL COMPARISON AS ESTIMATING A DISCRETE PARAMETER Data Y, models 1 and 2, parameter vectors θ 1, θ 2.

More information

Model comparison. Christopher A. Sims Princeton University October 18, 2016

Model comparison. Christopher A. Sims Princeton University October 18, 2016 ECO 513 Fall 2008 Model comparison Christopher A. Sims Princeton University sims@princeton.edu October 18, 2016 c 2016 by Christopher A. Sims. This document may be reproduced for educational and research

More information

ST 740: Markov Chain Monte Carlo

ST 740: Markov Chain Monte Carlo ST 740: Markov Chain Monte Carlo Alyson Wilson Department of Statistics North Carolina State University October 14, 2012 A. Wilson (NCSU Stsatistics) MCMC October 14, 2012 1 / 20 Convergence Diagnostics:

More information

Eco517 Fall 2014 C. Sims MIDTERM EXAM

Eco517 Fall 2014 C. Sims MIDTERM EXAM Eco57 Fall 204 C. Sims MIDTERM EXAM You have 90 minutes for this exam and there are a total of 90 points. The points for each question are listed at the beginning of the question. Answer all questions.

More information

STAT 425: Introduction to Bayesian Analysis

STAT 425: Introduction to Bayesian Analysis STAT 425: Introduction to Bayesian Analysis Marina Vannucci Rice University, USA Fall 2017 Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 2) Fall 2017 1 / 19 Part 2: Markov chain Monte

More information

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida Bayesian Statistical Methods Jeff Gill Department of Political Science, University of Florida 234 Anderson Hall, PO Box 117325, Gainesville, FL 32611-7325 Voice: 352-392-0262x272, Fax: 352-392-8127, Email:

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee 1 and Andrew O. Finley 2 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. 2 Department of Forestry & Department

More information

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

SAMPLING ALGORITHMS. In general. Inference in Bayesian models SAMPLING ALGORITHMS SAMPLING ALGORITHMS In general A sampling algorithm is an algorithm that outputs samples x 1, x 2,... from a given distribution P or density p. Sampling algorithms can for example be

More information

Bayes: All uncertainty is described using probability.

Bayes: All uncertainty is described using probability. Bayes: All uncertainty is described using probability. Let w be the data and θ be any unknown quantities. Likelihood. The probability model π(w θ) has θ fixed and w varying. The likelihood L(θ; w) is π(w

More information

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01 STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01 Nasser Sadeghkhani a.sadeghkhani@queensu.ca There are two main schools to statistical inference: 1-frequentist

More information

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33 Hypothesis Testing Econ 690 Purdue University Justin L. Tobias (Purdue) Testing 1 / 33 Outline 1 Basic Testing Framework 2 Testing with HPD intervals 3 Example 4 Savage Dickey Density Ratio 5 Bartlett

More information

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference 1 The views expressed in this paper are those of the authors and do not necessarily reflect the views of the Federal Reserve Board of Governors or the Federal Reserve System. Bayesian Estimation of DSGE

More information

Metropolis-Hastings Algorithm

Metropolis-Hastings Algorithm Strength of the Gibbs sampler Metropolis-Hastings Algorithm Easy algorithm to think about. Exploits the factorization properties of the joint probability distribution. No difficult choices to be made to

More information

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation. PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.. Beta Distribution We ll start by learning about the Beta distribution, since we end up using

More information

Bayesian Linear Models

Bayesian Linear Models Bayesian Linear Models Sudipto Banerjee 1 and Andrew O. Finley 2 1 Department of Forestry & Department of Geography, Michigan State University, Lansing Michigan, U.S.A. 2 Biostatistics, School of Public

More information

Bayesian Linear Regression

Bayesian Linear Regression Bayesian Linear Regression Sudipto Banerjee 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. September 15, 2010 1 Linear regression models: a Bayesian perspective

More information

Default Priors and Effcient Posterior Computation in Bayesian

Default Priors and Effcient Posterior Computation in Bayesian Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature

More information

Bayesian Networks in Educational Assessment

Bayesian Networks in Educational Assessment Bayesian Networks in Educational Assessment Estimating Parameters with MCMC Bayesian Inference: Expanding Our Context Roy Levy Arizona State University Roy.Levy@asu.edu 2017 Roy Levy MCMC 1 MCMC 2 Posterior

More information

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) Markov Chain Monte Carlo (MCMC Dependent Sampling Suppose we wish to sample from a density π, and we can evaluate π as a function but have no means to directly generate a sample. Rejection sampling can

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Bayesian Inference for DSGE Models. Lawrence J. Christiano Bayesian Inference for DSGE Models Lawrence J. Christiano Outline State space-observer form. convenient for model estimation and many other things. Preliminaries. Probabilities. Maximum Likelihood. Bayesian

More information

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Statistics - Lecture One. Outline. Charlotte Wickham  1. Basic ideas about estimation Statistics - Lecture One Charlotte Wickham wickham@stat.berkeley.edu http://www.stat.berkeley.edu/~wickham/ Outline 1. Basic ideas about estimation 2. Method of Moments 3. Maximum Likelihood 4. Confidence

More information

Penalized Loss functions for Bayesian Model Choice

Penalized Loss functions for Bayesian Model Choice Penalized Loss functions for Bayesian Model Choice Martyn International Agency for Research on Cancer Lyon, France 13 November 2009 The pure approach For a Bayesian purist, all uncertainty is represented

More information

Bayesian Linear Models

Bayesian Linear Models Bayesian Linear Models Sudipto Banerjee 1 and Andrew O. Finley 2 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. 2 Department of Forestry & Department

More information

Bayesian inference: what it means and why we care

Bayesian inference: what it means and why we care Bayesian inference: what it means and why we care Robin J. Ryder Centre de Recherche en Mathématiques de la Décision Université Paris-Dauphine 6 November 2017 Mathematical Coffees Robin Ryder (Dauphine)

More information

Bayesian Linear Models

Bayesian Linear Models Bayesian Linear Models Sudipto Banerjee September 03 05, 2017 Department of Biostatistics, Fielding School of Public Health, University of California, Los Angeles Linear Regression Linear regression is,

More information

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC Stat 451 Lecture Notes 07 12 Markov Chain Monte Carlo Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Chapters 8 9 in Givens & Hoeting, Chapters 25 27 in Lange 2 Updated: April 4, 2016 1 / 42 Outline

More information

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Bayesian Inference for DSGE Models. Lawrence J. Christiano Bayesian Inference for DSGE Models Lawrence J. Christiano Outline State space-observer form. convenient for model estimation and many other things. Bayesian inference Bayes rule. Monte Carlo integation.

More information

Bayesian linear regression

Bayesian linear regression Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding

More information

BAYESIAN MODEL CRITICISM

BAYESIAN MODEL CRITICISM Monte via Chib s BAYESIAN MODEL CRITICM Hedibert Freitas Lopes The University of Chicago Booth School of Business 5807 South Woodlawn Avenue, Chicago, IL 60637 http://faculty.chicagobooth.edu/hedibert.lopes

More information

STA 294: Stochastic Processes & Bayesian Nonparametrics

STA 294: Stochastic Processes & Bayesian Nonparametrics MARKOV CHAINS AND CONVERGENCE CONCEPTS Markov chains are among the simplest stochastic processes, just one step beyond iid sequences of random variables. Traditionally they ve been used in modelling a

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Markov chain Monte Carlo

Markov chain Monte Carlo Markov chain Monte Carlo Karl Oskar Ekvall Galin L. Jones University of Minnesota March 12, 2019 Abstract Practically relevant statistical models often give rise to probability distributions that are analytically

More information

Bayesian search for other Earths

Bayesian search for other Earths Bayesian search for other Earths Low-mass planets orbiting nearby M dwarfs Mikko Tuomi University of Hertfordshire, Centre for Astrophysics Research Email: mikko.tuomi@utu.fi Presentation, 19.4.2013 1

More information

g-priors for Linear Regression

g-priors for Linear Regression Stat60: Bayesian Modeling and Inference Lecture Date: March 15, 010 g-priors for Linear Regression Lecturer: Michael I. Jordan Scribe: Andrew H. Chan 1 Linear regression and g-priors In the last lecture,

More information

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Module 22: Bayesian Methods Lecture 9 A: Default prior selection Module 22: Bayesian Methods Lecture 9 A: Default prior selection Peter Hoff Departments of Statistics and Biostatistics University of Washington Outline Jeffreys prior Unit information priors Empirical

More information

Inference for a Population Proportion

Inference for a Population Proportion Al Nosedal. University of Toronto. November 11, 2015 Statistical inference is drawing conclusions about an entire population based on data in a sample drawn from that population. From both frequentist

More information

ST 740: Model Selection

ST 740: Model Selection ST 740: Model Selection Alyson Wilson Department of Statistics North Carolina State University November 25, 2013 A. Wilson (NCSU Statistics) Model Selection November 25, 2013 1 / 29 Formal Bayesian Model

More information

Markov chain Monte Carlo

Markov chain Monte Carlo 1 / 26 Markov chain Monte Carlo Timothy Hanson 1 and Alejandro Jara 2 1 Division of Biostatistics, University of Minnesota, USA 2 Department of Statistics, Universidad de Concepción, Chile IAP-Workshop

More information

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks (9) Model selection and goodness-of-fit checks Objectives In this module we will study methods for model comparisons and checking for model adequacy For model comparisons there are a finite number of candidate

More information

Bayesian philosophy Bayesian computation Bayesian software. Bayesian Statistics. Petter Mostad. Chalmers. April 6, 2017

Bayesian philosophy Bayesian computation Bayesian software. Bayesian Statistics. Petter Mostad. Chalmers. April 6, 2017 Chalmers April 6, 2017 Bayesian philosophy Bayesian philosophy Bayesian statistics versus classical statistics: War or co-existence? Classical statistics: Models have variables and parameters; these are

More information

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1 Parameter Estimation William H. Jefferys University of Texas at Austin bill@bayesrules.net Parameter Estimation 7/26/05 1 Elements of Inference Inference problems contain two indispensable elements: Data

More information

Bayesian Assessment of Hypotheses and Models

Bayesian Assessment of Hypotheses and Models 8 Bayesian Assessment of Hypotheses and Models This is page 399 Printer: Opaque this 8. Introduction The three preceding chapters gave an overview of how Bayesian probability models are constructed. Once

More information

Testing Restrictions and Comparing Models

Testing Restrictions and Comparing Models Econ. 513, Time Series Econometrics Fall 00 Chris Sims Testing Restrictions and Comparing Models 1. THE PROBLEM We consider here the problem of comparing two parametric models for the data X, defined by

More information

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016 Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016 EPSY 905: Intro to Bayesian and MCMC Today s Class An

More information

Approximate Bayesian computation for spatial extremes via open-faced sandwich adjustment

Approximate Bayesian computation for spatial extremes via open-faced sandwich adjustment Approximate Bayesian computation for spatial extremes via open-faced sandwich adjustment Ben Shaby SAMSI August 3, 2010 Ben Shaby (SAMSI) OFS adjustment August 3, 2010 1 / 29 Outline 1 Introduction 2 Spatial

More information

Overall Objective Priors

Overall Objective Priors Overall Objective Priors Jim Berger, Jose Bernardo and Dongchu Sun Duke University, University of Valencia and University of Missouri Recent advances in statistical inference: theory and case studies University

More information

David Giles Bayesian Econometrics

David Giles Bayesian Econometrics David Giles Bayesian Econometrics 1. General Background 2. Constructing Prior Distributions 3. Properties of Bayes Estimators and Tests 4. Bayesian Analysis of the Multiple Regression Model 5. Bayesian

More information

EM Algorithm II. September 11, 2018

EM Algorithm II. September 11, 2018 EM Algorithm II September 11, 2018 Review EM 1/27 (Y obs, Y mis ) f (y obs, y mis θ), we observe Y obs but not Y mis Complete-data log likelihood: l C (θ Y obs, Y mis ) = log { f (Y obs, Y mis θ) Observed-data

More information

Eco517 Fall 2004 C. Sims MIDTERM EXAM

Eco517 Fall 2004 C. Sims MIDTERM EXAM Eco517 Fall 2004 C. Sims MIDTERM EXAM Answer all four questions. Each is worth 23 points. Do not devote disproportionate time to any one question unless you have answered all the others. (1) We are considering

More information

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31 Hierarchical models Dr. Jarad Niemi Iowa State University August 31, 2017 Jarad Niemi (Iowa State) Hierarchical models August 31, 2017 1 / 31 Normal hierarchical model Let Y ig N(θ g, σ 2 ) for i = 1,...,

More information

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units Sahar Z Zangeneh Robert W. Keener Roderick J.A. Little Abstract In Probability proportional

More information

COS513 LECTURE 8 STATISTICAL CONCEPTS

COS513 LECTURE 8 STATISTICAL CONCEPTS COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions

More information

Bayesian data analysis in practice: Three simple examples

Bayesian data analysis in practice: Three simple examples Bayesian data analysis in practice: Three simple examples Martin P. Tingley Introduction These notes cover three examples I presented at Climatea on 5 October 0. Matlab code is available by request to

More information

Hierarchical Models & Bayesian Model Selection

Hierarchical Models & Bayesian Model Selection Hierarchical Models & Bayesian Model Selection Geoffrey Roeder Departments of Computer Science and Statistics University of British Columbia Jan. 20, 2016 Contact information Please report any typos or

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017 Markov Chain Monte Carlo (MCMC) and Model Evaluation August 15, 2017 Frequentist Linking Frequentist and Bayesian Statistics How can we estimate model parameters and what does it imply? Want to find the

More information

Markov Chain Monte Carlo and Applied Bayesian Statistics

Markov Chain Monte Carlo and Applied Bayesian Statistics Markov Chain Monte Carlo and Applied Bayesian Statistics Trinity Term 2005 Prof. Gesine Reinert Markov chain Monte Carlo is a stochastic simulation technique that is very useful for computing inferential

More information

Bayesian Inference: Concept and Practice

Bayesian Inference: Concept and Practice Inference: Concept and Practice fundamentals Johan A. Elkink School of Politics & International Relations University College Dublin 5 June 2017 1 2 3 Bayes theorem In order to estimate the parameters of

More information

General Bayesian Inference I

General Bayesian Inference I General Bayesian Inference I Outline: Basic concepts, One-parameter models, Noninformative priors. Reading: Chapters 10 and 11 in Kay-I. (Occasional) Simplified Notation. When there is no potential for

More information

Markov Chain Monte Carlo

Markov Chain Monte Carlo Markov Chain Monte Carlo Recall: To compute the expectation E ( h(y ) ) we use the approximation E(h(Y )) 1 n n h(y ) t=1 with Y (1),..., Y (n) h(y). Thus our aim is to sample Y (1),..., Y (n) from f(y).

More information

David Giles Bayesian Econometrics

David Giles Bayesian Econometrics David Giles Bayesian Econometrics 5. Bayesian Computation Historically, the computational "cost" of Bayesian methods greatly limited their application. For instance, by Bayes' Theorem: p(θ y) = p(θ)p(y

More information

Bayesian model selection for computer model validation via mixture model estimation

Bayesian model selection for computer model validation via mixture model estimation Bayesian model selection for computer model validation via mixture model estimation Kaniav Kamary ATER, CNAM Joint work with É. Parent, P. Barbillon, M. Keller and N. Bousquet Outline Computer model validation

More information