Checking for Prior-Data Conßict. by Michael Evans Department of Statistics University of Toronto. and

Size: px
Start display at page:

Download "Checking for Prior-Data Conßict. by Michael Evans Department of Statistics University of Toronto. and"

Transcription

1 Checking for Prior-Data Conßict by Michael Evans Department of Statistics University of Toronto and Hadas Moshonov Department of Statistics University of Toronto Technical Report No. 413, February 24, 25 TECHNICAL REPORT SERIES UNIVERSITY OF TORONTO DEPARTMENT OF STATISTICS

2 Checking for Prior-Data Conßict M. Evans and H. Moshonov U. of Toronto Abstract Inference proceeds from ingredients chosen by the analyst and data. To validate any inferences drawn it is essential that the inputs chosen be deemed appropriate for the data. In the Bayesian context these inputs consist of both the sampling model and the prior. There are thus two possibilities for failure: the data may not have arisen from the sampling model, or the prior may place most of its mass on parameter values that are not feasible in light of the data, referred to here as prior-data conßict. Failure of the sampling model can only be Þxed by modifying the model while prior-data conßict can be overcome if sufficient data is available. We examine how to assess whether or not a prior-data conßict exists and how to assess when its effects can be ignored for inferences. The concept of prior-data conßict is seen to lead to a partial characterization of what is meant by a noninformative prior or a noninformative sequence of priors. 1 Introduction Model checking is a necessary component of statistical analyses. This is because statistical analyses are based on assumptions. These assumptions typically take the form of possibilities for how the observed data was generated the sampling model and, in Bayesian contexts, a quantiþcation of the investigator s beliefs the prior about the actual data generation mechanism, among the possibilities included in the sampling model. In certain contexts it might be argued that we have speciþc information that leads to a sampling model or that the investigator has a very clear idea of what an appropriate prior is but, in general, this does not seem to be the case. In fact the sampling model and prior often seem, if not chosen arbitrarily, then at least selected for convenience. Accordingly it seems quite important that, before carrying out an inferential analysis, we Þrst check to make sure that the choices made, make sense in light of the data collected. For, if we can show that the observed data is surprising in light of the sampling model and the prior then we must at least be suspicious about the validity of the inferences drawn (ignoring any ambiguities about the correct inference process itself). Our concern here is with the Bayesian context where we have two ingredients, namely, the sampling model and the prior. Several authors have considered model checking in this context such as Guttman (1967), Box (198), Rubin 1

3 (1984), and Gelman, Meng and Stern (1996). We note that all of these authors considered the effect of both the sampling model and the prior simultaneously. There are, however, two possible ways in which the Bayesian model can fail. First we may have that the sampling model fails. By this we mean that the observed data is surprising for each of the assumed distributions in the model. If this is the case, then either we had the misfortune to observe a rare data set or the model is in fact not appropriate. In the former case collecting more data mayresolvetheissuebutinthelattercasethiswillnotbethecase. Second, assuming the sampling model is appropriate, the Bayesian model mayfailbythepriorplacingitsmassprimarily on distributions in the sampling model for which the observed data is surprising. So in other words there are distributions in the model for which the data is not surprising but the prior places little or no mass there. We refer to this as prior-data conßict. This conßict again leads to doubts about the validity of inferences drawn from the Bayesian model. We note that in the second context increasing the amount of data can lead to a resolution of the problem. For as the amount of data increases the effect of the prior decreases, vanishing in the limit. What this implies is that even if there is prior-data conßict, we may have sufficient data so that its effect can be ignored. The preceding discussion leads us to the conclusion that in Bayesian contexts we must check individually for the separate sources of failure. First we must check for the failure of the sampling model. For example, there are many frequentist methods for this and also the Bayesian methods as discussed in Bayarri and Berger (2). If we have no evidence that the sampling model is in error, then we must check to see whether or not there is any prior-data conßict and, if there is, whether or not this conßict leads to erroneous inferences. Of course, if we Þnd evidence that the sampling model is wrong then it doesn t make sense to check for prior-data conßict. See also Section 6 for further justiþcation for separately checking the sampling model and for prior-data conßict. We note also that if we do obtain evidence of the sampling model failing then this failure may not lead to serious problems for our inferences in certain circumstances. This is because the model may only be approximately correct andwehaveobservedasufficient amount of data to detect this deviation. If the approximate correctness of the model is appropriate for the application at hand, then we would not want to discard the model. This is, a separate issue, however, from what we will discuss in this paper and needs to be addressed by methods other than what we present here. In any case, throughout the remainder of this paper we assume that the sampling model is correct and focus on looking for prior-data conßict. The question now arises as how we should check for the existence of a priordata conßict. In Section 2 we argue that it is appropriate to use the prior predictive distribution of the minimal sufficient statistic to do this. In other words, if the observed value of the minimal sufficient statistic is a surprising value from its prior predictive distribution then we have evidence that a priordata conßict exists. In Section 3 we show how the concept of ancillarity becomes 2

4 relevant when looking for prior-data conßict. In Section 4 we discuss the implications of our deþnition of prior-data conßict for the characterization of a noninformative prior. In Section 5 we discuss how to assess whether or not an observed prior-data conßict can be ignored so that the Bayesian model can be used to derive inferences about the unknowns in the sampling model. In Section 6 we discuss a factorization of the joint distribution of the parameter and data that serves as further support for the approach taken here. In preparation for our discussion we specify some notation. We denote the samplespaceforthedatas by S, the parameter space by Ω, and the sampling model by {f θ : θ Ω}, where each f θ is a probability density on S with respect to some support measure µ. The prior probability measure on Ω is denoted by Π with density π with respect to a support measure υ on Ω. The joint distribution of (θ, s) leads to the prior-predictive measure for s as given by Z Z Z M(B) = f θ (s) π (θ) µ (ds) υ (dθ) = m (s) µ (ds) where Ω B Z m (s) = Ω f θ (s) π (θ) υ (dθ) is the density of M with respect to µ. We note that we are restricting our attention here to the case where Π is proper so that M is indeed a probability measure. We do, however, make some comments relevant to the improper case. When we observe the data s, then we have the posterior probability measure Π ( s ) with density, with respect to υ, givenby π (θ s )= f θ (s ) π (θ) m (s ) for inferences about θ. For a function T : S T we denote the marginal densities by f θt, with respect to some support measure λ on T, and this leads to the marginal prior predictive density for T given by Z m T (t) = f θt (t) π (θ) υ (dθ), Ω again with respect to λ. We will also refer to the posterior predictive distribution of T which has density Z m T (t s )= f θt (t) π (θ s ) υ (dθ) Ω with respect to λ. To assess whether or not a prior-data conßict exists we need to compare some observed value with a Þxed distribution to assess whether or not it is surprising. For convenience we will use the P-value for this purpose although we are not necessarily recommending this as "the" way to measure surprise. There are various concerns with P-values and in practical contexts we recommend at least plotting the distribution being used in the comparison in addition to computing the P-value. B 3

5 2 Checking for Prior-data Conßict: Sufficiency Intuitively, a prior-data conßict exists whenever the data provide little or no support to those values of θ where the prior places its support. So, if the region where the prior places most of its mass is somewhat separate from the region where the likelihood function is relatively high, then the measure we use to assess the conßict should indicate this. This might lead one to compare the MLE ˆθ (s ) (or some other consistent estimate of θ) to the prior distribution to assess the degree of conßict. If ˆθ (s ) lay in a region where the prior placed little support, e.g., out in the tails of π, then we might conclude that a conßict exists. We note, however, that this is not quite correct for, if the spread of the likelihood was sufficiently wide, then there could be a signiþcant overlap between the effective support of the prior and regions of support for the likelihood, even if ˆθ (s ) was "out in the tails" of π. Therefore, a more appropriate measure of prior-data conßict would seem to be one that compared the effective support of the prior (loosely speaking, a region that contained most of the mass for the prior) with the region where the likelihood was relatively high. If there is very little overlap between these regions then we have evidence that a prior-data conßict exists. It is not clear, however, how to appropriately measure this degree of overlap. Accordingly we take a somewhat different approach. We note that the prior induces a probability distribution on the set of possible likelihood functions via the prior predictive probability measure M. If the observed likelihood L ( s )= f (s ) is a surprising value from this distribution, then this would seem to indicate that a prior-data conßict exists. We can simplify this somewhat by noting that the likelihood is in effect equivalent to a minimal sufficient statistic T, namely, if we know the value T (s ) then we can obtain the map L ( s ), and conversely. So, instead of comparing the observed likelihood to its marginal model, we can compare T (s ) to its prior distribution M T. WecanthenchooseT so that this comparison can be made relatively simply. We might ask if anything in the observed data s, beyond the value of T (s ), has any relevance to checking for any prior-data conßict. The following result suggests that this is not the case. Theorem 1. Suppose T is a sufficient statistic for the model {f θ : θ Ω} for data s. Then the conditional prior predictive distribution of the data s given T is independent of the prior π. Proof: See the Appendix. So the conditional prior distribution of any aspect of the data, beyond the value of T (s ), doesn t depend on the prior. This suggests that such information can tell us nothing about whether or not a prior-data conßict exists and so can be ignored for this purpose. Note also the following result. Theorem 2. If L ( s ) is nonzero only on a θ-region where π places no mass 4

6 then T (s ) is an impossible outcome for M T. Proof: See the Appendix. So we see, at least in this extreme case of nonoverlapping support for the likelihood and prior, that comparing T (s ) to M T leads to deþnite evidence of a prior-data conßict. These results support our assertion that comparing the observed value T (s ), of a minimal sufficient statistic, to its prior distribution M T is an appropriate method for checking for prior-data conßict. We note that Box (198) suggested comparing the observed data s to M as an appropriate method for model checking. So the method we are suggesting here is really just a restriction of Box s approach to minimal sufficient statistics and we are further suggesting that this is appropriate only for checking for prior-data conßict and not model checking generally. This, together with the discussion in Section 6 would seem to lead to a resolution of some anomalous behavior of Box s approach in particular examples. We note also that the discussion in Section 3 leads to a further modiþcation of Box s approach. The following examples show that basing the check for prior-data conßict by comparing T (s ) to M T produces acceptable results. We also consider some alternatives in these examples (such as using the posterior predictive) and show that these have bad properties for the problem of checking for prior-data conßict. Perhaps the most basic model is the following. Example 1. Location normal model Suppose s =(s 1,..., s n ) is a sample from a N(θ, 1) distribution with θ R 1 and we place a N(θ,σ 2 ) prior on θ. A minimal sufficient statistic is T (s) = s N(θ, 1/n). For the prior predictive distribution of s we can write s = θ + Z where Z N (, 1/n) independent of θ. So the prior predictive distribution is N(θ,σ 2 + 1/n). Accordingly it makes sense to see if s lies in the tails of this distribution andweassessthisviathep-value M T (m T ( s) m T ( s )) = 2 1 Φ s θ q σ (1) n This shows that when s lies in the tails of its prior distribution, then we have evidence of a prior-data conßict existing. Note that using the prior-predictive distribution of s to assess this, instead of the prior for θ, results in the standardization by (σ 2 +1/n)1/2 rather than σ. So s has to be somewhat further out in the tail than if we simply compared the MLE to the prior of θ. Now suppose that the true value of θ is θ. Then as n we have that (1) converges almost surely to 2 µ 1 Φ µ θ θ σ. Therefore, if the true value of θ lies in the tails of the prior, then asymptotically (1) will detect this and lead to evidence that a prior-data conßict exists. 5

7 Also note that when σ then (1) converges to 1 and so no evidence will be found of a prior-data conßict. This is not surprising. For recall that we are proceeding as if the sampling model is correct and a diffuse prior simply indicates that all values are equally likely, so we should deþnitely not Þnd any conßict. Further as σ, so we have a very precise prior, then (1) will deþnitely Þnd evidence of a prior-data conßict for large n, unless θ is indeed the true value. One might wonder at the possibility of using the posterior predictive distribution of the minimal sufficient statistic instead. In fact a result analogous to Theorem 1 is true in the posterior context as well, so it makes sense to only look at functions of the minimal sufficient statistic again. The posterior predictive distribution for s is given by à µ 1 µ µ 1! 1 θ 1 s N σ 2 + n σ 2 + n s, σ 2 + n + 1. n Then comparing s to this distribution we would Þnd a prior-data conßict if the P-value ³ 1 ³ 2 1 Φ s 1 + n θ + n s σ 2 σ 2 r ³ 1 (2) 1 σ + n o n is small. As n we have that (2) converges almost surely to 1, irrespective of the true value of θ. So if we were to use the posterior predictive, we would never conclude that a prior-data conßict exists. This is clearly an unacceptable situation. This phenomenon occurs because the data inevitably overwhelms the effect of the prior in the posterior and we have an instance of the double use of the data as discussed in Bayarri and Berger (2). Similarly coin tossing represents a very simple context. Example 2. Bernoulli model Suppose s =(s 1,..., s n ) is a sample from a Bernoulli(θ) model where θ [, 1] is unknown, and θ is given a Beta(α, β) prior distribution. A minimal sufficient statistic is T (s) =n s = P s i Binomial(n, θ). The prior-predictive probability function of T (s) =n s is then given by m T (t) = µ n t Γ(α + β) Γ(α)Γ(β) Γ(t + α)γ(n t + β). Γ(n + α + β) To assess whether or not t = n s is surprising we must compare t to m T. This can be conveniently done by computing the tail probability M T (m T (t) m T (t )) (3) which, in this case, must be done numerically. In essence (3) is the prior probability of obtaining a value of the minimal sufficient statistic with probability of 6

8 occurrence no greater than that for the observed value. Because of cancellations we can write (3) as µ Γ(t + α)γ(n t + β) M T Γ(t +1)Γ(n t +1) Γ(t + α)γ(n t + β). (4) Γ(t +1)Γ(n t +1) Notice that when α = β =1then (4) equals 1 and we never have any evidence of prior-data conßict. This makes sense, however, as given that the sampling model is correct, there should be no conßict between the data and a noninformative prior. To illustrate the use of the P-value given by (4) we consider a numerical example. For this suppose that n =1,α=5and β =2. In this case the prior distributionputsmostofit smassonvaluesofθ in (, 1/2). Figure 1 is a plot of the prior predictive probability function of the minimal sufficient statistic density t Figure 1: Plot of the prior predictive probability function of the minimal sufficient statistic for a sample of n =1from the Bernoulli(θ) distribution when θ Beta(5, 2). Generating a sample of size 1 from the Bernoulli(.9) distribution we obtained n s =9. Note that θ =.9 is outside the effective support of the prior distribution, i.e., a prior-data conßict exists. The prior predictive P-value, is given by M T (m T (y) m T (9)) = m T (9) + m T (1) =.117. As expected, this is an indication of a prior-data conßict. Generating a sample of size 1 from the Bernoulli(.25) distribution we obtained n s =1. Note that this value of θ =.25 lies well in the support of the prior distribution i.e., a prior-data conßict does not exist. The P-value is given by M T (m T (y) m T (1)) = As expected, this does not indicate any prior-data conßict. The following example illustrates that the notion of a noninformative prior is somewhat subtle. We discuss this further in Section 4. 7

9 Example 3. Negative-binomial sampling Suppose that a coin is tossed in independent tosses until k heads are obtained and the number of tails observed until the k-thheadiss. Then s is distributed Negative-Binomial(k, θ), and is minimal sufficient. Note that the likelihood function is a positive multiple of the likelihood function from the Bernoulli model in Example 2 when we obtain k ones in a sample of size n = s + k. It is then clear that posterior inferences about θ are the same from the two models. We will see, however, that the checks for prior-data conßict are somewhat different. Suppose that we place a uniform prior on θ. Then, as noted in Example 2, the prior predictive for the minimal sufficient statistic is uniform and we never have evidence of a prior-data conßict existing. The prior predictive under negative binomial sampling, however, is given by m(s) = = µ Z s + k 1 1 θ k (1 θ) s dθ k 1 Γ (s + k) Γ (k +1)Γ (s +1) Γ (k) Γ (s +1) Γ (s + k +2) = k (s + k)(s + k +1) (5) and this is not uniform. Indeed, it cannot be uniform because the support for this distribution is the nonnegative integers. Notice that (5) is decreasing in s and so X k M (m(s) m(s )) = (s + k)(s + k +1) s=s X µ 1 = k s + k 1 s + k +1 s=s = k s + k and we see that this P-value is small only when s is large compared to k. While this P-value may seem unusual, when compared to the Binomial case, there is a subtle difference between the two situations. In particular, in the Binomial(n, θ) case it is permissible to have θ =and so, for example, observe n consecutive tails. In the Negative Binomial(k, θ) case, however, it is not possible to have θ =as there is no such thing as the Negative Binomial(k, ) distribution. The uniform prior on the unit interval places positive mass on every interval about and so is not really a sensible prior for this model, as the positive mass indicates a prior belief that θ =is a possible value. Recall our restriction that, when looking for prior-data conßict, we assume that the sampling model is correct. If we place a uniform prior on [<, 1] with <>, as this places no mass at, and denote the prior predictive by m *, then the dominated convergence theorem establishes that lim * M * (m * (s) m * (s )) is equal to (6). In fact when k =1 we have that M * (m * (s) m * (s )) = (1 <) s /(s +1). So we see that (6) is an approximation to the P-value we will obtain when < is small. Accordingly we can view (6) as an approximation to the P-value we would obtain for a prior that (6) 8

10 is in a small neighborhood of and rises smoothly to 1 outside this interval. If (6) is small, then we have evidence of prior-data conßict when using this kind of prior. The following example illustrates that looking at different functions of the minimal sufficient statistic may be appropriate when assessing prior-data conßict. Example 4. Location-scale normal model Suppose that x =(x 1,..., x n ) is a sample from a N(µ, σ 2 ) distribution where µ R 1 and σ> are unknown. With s 2 =(n 1) 1 P (x i x) 2, then ( x, s 2 ) is a minimal sufficient statistic for (µ, σ 2 ) with x N(µ, σ 2 /n) independent of s 2 (σ 2 / (n 1))χ 2 (n 1). Suppose we put the prior on (µ, σ 2 ) given by µ σ 2 N(µ,τ 2 σ 2 ) 1 σ 2 Gamma(α,β ). (7) The posterior distribution of (µ, σ 2 ) is then given by µ σ 2,x 1...x n à µ N µ x, n + 1! 1 τ 2 σ 2 1 σ 2 x 1...x n Gamma(α + n 2,β x) where µ µ x = n µ µ τ 2 τ 2 + n x and β x = β + µ2 2τ 2 + n 2 x2 + n 1 s 2 1 µ n µ 2 µ 2 2 τ 2 τ 2 + n x. The joint prior predictive distribution of ( x, s 2 ) is given by m( x, s 2 ) = Z Z = Γ n 2 + α Γ (α ) f( x, s 2 µ, σ 2 )π(µ σ 2 )π(σ 2 ) dσ 2 dµ µ n + 1 1/2 (2π) n/2 β α τ 2 (β τ x ) (n/2) α. (8) We then need to assess whether or not an observed value ( x,s 2 ) is a reasonable value from the distribution given by (8). While there might be a number of ways of assessing this, we compute the probability of observing a value of ( x, s 2 ) at least as extreme as ( x,s 2 ), namely, we compute the P-value M m( x, s 2 ) m( x,s 2 ). 9

11 To compute this, for speciþed values of the hyperparameters α,β,µ and τ 2, we generate (µ, σ 2 ) using (7), then generate ( x, s 2 ) from their joint distribution given (µ, σ 2 ) and evaluate m( x, s 2 ) using the formula in (8). Repeating this many times, and recording the proportion of values of m( x, s 2 ) that are less then or equal to m( x,s 2 ), gives us a Monte Carlo estimate of the P-value. For example, suppose that (µ, σ) =(, 1) and that for a sample of size n =2 from this distribution we obtained x = and s 2 = Then for the prior speciþed by τ 2 =1,µ =5,α =1and β =5, the above simulation process, based on a Monte Carlo sample of size N =1 3, gives the P-value as.19 and so clear evidence of a prior-data conßict is obtained. Rather than computing a P-value based on the full prior predictive distributionoftheminimalsufficient statistic, it is also possible to use the prior predictive distribution of some function of the minimal sufficient statistic. For example, we could instead use the marginal prior predictive distributions of x and s 2. We note that we can write x = µ +(σ 2 /n)z where z v N(, 1) and when µ σ 2 N(µ,τ 2 σ 2 ) then x σ 2 v N(µ,σ 2 (τ 2 +1/n)). Therefore, after multiplying the N(µ,σ 2 (τ 2 +1/n)) density by the marginal prior for 1/σ 2 and integrating out σ 2, we have that x t 2α (µ, (β τ n α 1 )1/2 ) (where t λ (, 1) denotes the t distribution with λ degrees of freedom and t λ (µ, σ) =µ+σt λ (, 1) is the marginal prior predictive distribution of x. Therefore the P-value based on x alone is 2 1 G 2α x µ q β (τ 2 +1/n) α 1 where G 2α is the cdf of the t distribution with 2α degrees of freedom. For the above generated data this gives the P-value equal to.21, which also indicates the existence of a prior-data conßict. Similarly, we Þnd that s 2 (β /α )F (n 1,2α) is the marginal prior predictive of s 2. For the above example we compare s 2 /5= with the F (19, 2) distribution. Computing the P-value by computing the probability of obtaining a value from the F (19, 2) distribution with density smaller than that obtained at the observed value, leads to the P-value Pr (F (19, 2) ) + Pr (F (19, 2) > ) = which does not indicate any prior-data conßict. We see from these two tests that the conßict arises from the location of that data and not its spread. Similar results are available for the normal regression model when using a conjugate prior. In general it seems difficult to determine the large sample behavior of M T. Example 5. Exponential models Suppose we have a sample s =(s 1,...,s n ) from a density of the form f θ (s i )=c (θ) h (s i )exp θ t u(s i ) ª. 1

12 Then the minimal sufficient statistic T (s) =n 1 P n i=1 u(s i) converges almost surely P θ to E θ (u) and so we expect that, for large n, the prior predictive distribution of T will be closely approximated by the prior distribution of E θ (u). So, for large n, if T (s ) is extreme with respect to the prior distribution of E θ (u), we will have evidence of prior-data conßict. The following example has posed a variety of problems for inference but seems to lead to sensible answers when we apply our criterion for assessing whether or not any prior-data conßict exists. Example 6. Suppose we observe s =(s 1,...,s n ) where s i N(θ i, 1) with s 1,...,s n independent and θ 1,...,θ n independent N(θ,σ 2 ).Inthiscasesis a minimal suf- Þcient statistic with prior predictive distribution N n (θ (1,...,1) t, σ 2 +1 I). So we see that we have evidence of prior-data conßict whenever 1 X n σ 2 +1 (s i θ ) 2 i=1 is extreme with respect to the Chi-squared(n) distribution, i.e., whenever the P-value Ã! P χ 2 n > 1 X n σ 2 +1 (s i θ ) 2 is small. Notice that as σ 2 this P-value goes to 1 and so we have no evidence of a conßict existing when we have a diffuse prior. When σ 2 we have evidence of a prior-data conßict whenever s 1,...,s n doesn t look like a sample from the N(θ, 1) distribution. 3 Checking for Prior-data Conßict: Ancillarity In principle we could look for prior-data conßict by comparing the observed value of U (T (s)) with its marginal prior-predictive distribution for any function U, e.g., consider Example 4. We note, however, that for certain choices of U this will clearly not be appropriate. For example, if U is ancillary then the marginal prior-predictive distribution of U is the same as its marginal distribution from the sampling model, i.e., the marginal prior-predictive distribution does not depend on the prior. So if U (T (s )) is surprising this cannot be evidence of a prior-data conßict but rather indicates a problem with the sampling model. Since we are assuming here that the sampling model has passed its checks, we want to avoid this possibility. A simple answer would be to say that we simply don t choose U to be ancillary. But we note that the prior-predictive distribution of T will also be affected by aspects that are ancillary. In other words, T (s ) may be a surprising value simply because U (T (s )) is surprising for some ancillary U and we want to avoid this. Accordingly, in such a context, it makes sense to compare i=1 11

13 the observed value T (s ) with its prior-predictive distribution given the value U (T (s )), i.e., we remove the variation in the prior-predictive distribution of T, that is associated with U, by conditioning on U. Note that we avoid the necessity of conditioning on an ancillary whenever we have a complete minimal sufficient statistic T, as then Basu s theorem implies that every ancillary U is independent of T. So, for the examples discussed in Section 2 we do not need to condition on ancillaries, i.e., the prior predictive distribution of T does not exhibit variation that can be ascribed to any ancillary U. If U 1 T and U 2 T are ancillary and U 1 T = h U 2 T for some h then we need only do the check based on the prior-predictive distribution for T given U 2 (T (s )) since this is conditioning on more information and so removing more variation. When an ancillary U 1 is such that there is no function h and ancillary U 2 (other than a choice where h is one-to-one) such that U 1 T = h U 2 T then U 1 is a maximal ancillary. So in general we would like to condition on maximal ancillaries whenever possible. We note, however, as discussed in Lehmann and Scholz (1992), the lack of a general method for obtaining maximal ancillaries. Still if we have an ancillary U available it seems better to condition on U, and so remove the variation due to U, even if we cannot determine whether or not U is maximal ancillary or determine a maximal ancillary based on U. Furthermore, it can happen that there is more than one maximal ancillary in a problem, e.g., see Lehmann and Scholz (1992). We note that the lack of unique maximal ancillary can cause problems in frequentist approaches to inference because inference about θ can be different, depending on which ancillary we choose to condition on. This does not cause a problem here, however. For if the check based on conditioning on one maximal ancillary indicates a prior-data conßict then a conßict exists irrespective of what happens when we condition on other maximal ancillaries. In essence we are considering different aspects of the prior-predictive distribution when we condition on different maximal ancillaries. The following example is presented in Cox and Hinkley (1974) as a particularly compelling argument for the use of ancillary statistics in inference. Example 7. Mixtures Suppose we have that a response x is either from a N(θ, σ 2 ) or N(θ, σ 2 1) distribution where θ R 1 is unknown and σ 2,σ 2 1 are both known and unequal. Suppose these two distributions correspond to the variation exhibited by two measuring instruments and the particular instrument used is chosen according to c Bernoulli(p ) where p is known. Then we see that (c, x) is minimal sufficient and c is ancillary. We place a N(θ, 1) prior on θ. Therefore, when c =we would use the prior predictive based on the prior and the N(θ, σ 2 ) family, namely x N(θ,σ 2 +1), and when c =1we would use the prior predictive based on the prior and the N(θ, σ 2 1) family, namely x N(θ,σ ), to check for prior-data conßict. This example nicely illustrates the use of ancillary statistics. For there are two "components" to the prior predictive distribution of the minimal sufficient 12

14 statistic and the ancillary statistic selects one. We see that x may lead to a prior-data conßict for both components, one component or neither component. This will depend on the prior π and the values of σ 2 and σ 2 1. Note also that if p is very small and we observed c =1, then we might consider this as evidence of a problem with the model, not as indicating that a prior-data conßict exists (see Section 6). More generally, if we have that (x, u) is minimal sufficient for a model with x u f θ ( u) and u h a maximal ancillary, then Z m (x u) = f θ (x u) π (θ) υ (dθ). Ω We compare the observed value of x with this distribution to see if there is any evidence of prior-data conßict. Clearly the variation in the value of the minimal sufficient statistic due to changes in u, can tell us nothing about whether or not there is any prior-data conßict. Conditioning on u removes this variation. The following example is presented in Cox and Hinkley (1974) as a case where the use of ancillary statistics leads to ambiguity for inferences about the model parameter. Example 8. Special Multinomial Distribution Suppose that we have a sample of n from the Multinomial (1, (1 θ) /6, (1 + θ) /6, (2 θ) /6, (2 + θ) /6) distribution where θ [ 1, 1] is unknown. Then the cell counts (f 1,f 2,f 3,f 4 ) constitute a minimal sufficient statistic and U 1 =(f 1 + f 2,f 3 + f 4 ) is ancillary as is U 2 =(f 1 + f 4,f 2 + f 3 ). Then (f 1,f 2,f 3,f 4 ) U 1 is given by f 1 U 1 Binomial(f 1 + f 2, (1 θ) /2) independent of f 3 U 1 Binomial(f 3 + f 4, (2 θ) /4) giving µ µ f1 + f 2 f3 + f 4 m (f 1,f 2,f 3,f 4 U 1 )= Z 1 1 µ 1 θ 2 f1 µ 1+θ 2 f 1 f2 µ 2 θ 4 f 3 f3 µ 2+θ 4 f4 π (θ) dθ. This distribution is 2-dimensional and can be used as described in Example 4 for prior-data conßict checking. Alternatively, we could use the 1-dimensional distributions f 1 U 1 and f 3 U 1. In Example 4 we saw that separate 1-dimensional assessments can lead to more insight concerning any prior-data conßict that may exist. For example, suppose π is a Beta(α, β) distribution on [ 1, 1]. When α = β =2, so the prior concentrates about, andf 1 + f 2 =1, Figure 2 gives the conditional prior predictive probability function for f 1 U 1. We see that we will Þnd evidence of a prior-data conßict whenever f 1 {, 1, 9, 1}. On the other hand when α =1,β =2, so the prior concentrates about 1, andf 1 + f 2 =1, Figure 3 gives the conditional prior predictive probability function for f 1 U 1. In 13

15 this case we will Þnd evidence of a prior-data conßict whenever f 1 {3,...,1} probability f_ Figure 2: Conditional prior predictive of f 1 given U 1 when α = β = probability f_ Figure 3: Conditional prior predictive of f 1 given U 1 when α =1,β =2. Alternatively, we could use f 1 U 2 Binomial(f 1 + f 4, (1 θ) /3). In this case, when α = β =2and f 1 + f 4 =1, Figure 4 gives the conditional prior predictive probability function for f 1 U 2. WeseethatwewillÞnd evidence of aprior-dataconßict whenever f 1 {, 8, 9, 1}. This differs from the previous check involving f 1 but there is no conßict between them. If either check indicates that f 1 is a surprising value then we have evidence of a prior-data conßict existing. We also have available the checks using f 3 U 1 and f 2 U probability f_ Figure 4: Conditional prior predictive of f 1 given U 2 when α = β =2. 14

16 The following example can be considered as an archetype for the common situation where an ancillary is determined as the maximal invariant under a transformation group acting on a sample space. Example 9. Location Cauchy Suppose we have a sample s =(s 1,...,s n ) from a distribution with density proportional to 1 ³1+(x 2 θ) where θ R 1. Then it is known that T (s) = s (1),...,s (n), the order statistic, is a minimal sufficient statistic. Further we have that U(T (s)) = s (2) s (1),...,s (n) s (1) =(u1,...,u n 1 ) is ancillary. Clearly the conditional distribution of T given U can be expressed as the conditional distribution of s (1) given U and this has conditional density proportional to n 1 1 Y ³1+ s (1) θ 1 2 ³1+ s (1) + u i θ 2. i=1 Note that the normalizing constant will not involve θ and so the conditional prior predictive density of s (1) is = m(s (1) u 1,...,u n 1 ) Z n 1 1 Y ³1+ s (1) θ 1 2 ³1+ s (1) + u i θ 2 π (θ) dθ Z i=1 n 1 1 Y 1 (1 + v 2 ) i=1 ³1+(v 2 π s (1) v dv (9) + u i ) for prior π. Integrating s (1) out of the above expression shows that the normalizing constant is given by Z n 1 1 Y 1 (1 + v 2 dv. (1) ) i=1 ³1+(v + u i ) 2 We then assess whether or not there is any prior-data conßict by comparing the observed value of s (1) with the distribution given by m( u 1,...,u n 1 ). To evaluate a P-value associated with this conditional prior predictive we must integrate numerically and this can be a bit tricky. For example, consider the following ordered sample of n =1from the Cauchy distribution with θ =

17 In Figure 5 we have plotted the normalized integrand in (1) in the interval ( u n 1, u 1 ). Here the normalizing constant is equal to This is in in fact a plot of the conditional density of v = s (1) θ given θ and (u 1,...,u n 1 ) density v -1-5 Figure 5: The conditional density of v = s (1) θ given θ and (u 1,...,u n 1 ). We can combine the information in this plot, with our knowledge about where the prior places its mass, to determine where we should place the integration points to evaluate the integral in (9). For example, when the prior is a N(, 1) distribution we see that, for this data set, we need to tabulate m(s (1) u 1,...,u n 1 ) for s (1) in the interval ( 8, 1). Note that we need only tabulate the normalized integrand in (1) once to tabulate m(s (1) u 1,...,u n 1 ) and this saves considerable computation. Doing this we obtain the conditional density m( u 1,...,u n 1 ) as shown in Figure 6. From this we can see that the observed value s (1) = provides no evidence of prior-data conßict and this is what we would expect..4.3 density s_(1) -2 Figure 6: The density function m( u 1,...,u n 1 ) with a N(, 1) prior. Suppose, however, the prior distribution is N(θ, 1). In this case the density m( u 1,...,u n 1 ) looks just like that plotted in Figure 6 but is translated by 16

18 θ units. So we see that if θ is more than 3 then the observed value s (1) = will provide evidence of a prior-data conßict. When the prior is N(θ,σ 2 ) and σ 2, thenm( u 1,...,u n 1 ) converges to a density that is the same as the density depicted in Figure 5, but translated by θ. So when θ > 1 or θ < 2 we see clearly that there will be evidence of a prior-data conßict. Note that the left-tail probability is given by P s (1) u 1,...,u n 1 = R n 1 Y 1 (1+v 2 ) i=1 R ³ 1 (1+(v+u i) 2 ) Φ θ v σ dv 1 (1+v 2 ) n 1 Y i=1 1 (1+(v+u i) 2 ) dv and this converges to 1/2 as σ 2. So for a very diffuse prior we will not Þnd any evidence of a prior-data conßict. We note that the method of calculation that we have outlined here will work with any location family, when the minimal sufficient statistic is given by the order statistic, as then (u 1,...,u n 1 ) will always be ancillary. For example, when we are sampling from any t distribution, then this will be the case. Further this method can be generalized to location-scale models. As noted in Lehmann and Scholz (1992), there can be a conßict between ancillarity and sufficiency in frequentist inference in the sense that it is not entirely clear which should be imposed Þrst. In other words there are situations where ancillary statistics exist that are not functions of the minimal sufficient statistic. In the context of checking for a prior-data conßict, however, this does not pose a difficulty as it is clear from Theorem 1 that we only consider functions of the minimal sufficient statistic for this. Of course, for Bayesian inferences about θ the concept of ancillarity has no relevance. 4 Noninformative Priors Various deþnitions are available for expressing what it means for a prior to be noninformative. For example, see Kass and Wasserman (1996) for a discussion of the various possibilities and of the difficulties inherent in this, and also Bernardo (1979), Berger and Bernardo (1992). A somewhat different requirement for noninformativity arises from considerations about the existence of a prior-data conßict. For if a prior is such that we would never conclude that a prior-data conßict exists, no matter what data is obtained, then it seems reasonable to say that such a prior is at least a candidate for being called noninformative. Example 2 gives an example where the uniform prior on a compact parameter space satisþes this requirement. We saw in Example 3 that, while the parameter space is contained in a compact set, it is not the case that the uniform distribution is noninformative. 17

19 Strictly interpreted, our requirement only applies to proper priors because M T is not a probability measure unless Π is proper. We can, however, extend this to sequences of priors that converge in some sense to an improper prior. As in Example 1, we can say that a sequence of priors satisþes this requirement for noninformativity if the P-value we choose to compute converges to 1 for every possible set of data. In general it is not clear whether a proper prior, or a sequence of proper priors, will exist for a given problem, that satisþes this requirement. Further, even if we can show that such objects do exist it is likely that further requirements must be satisþed for a prior to be called noninformative. If it is possible, however, for a so-called noninformative prior to conßict with the data in some sense, then it does seem that it is putting some information into the analysis. So we consider the absence of the possibility of any prior-data conßict as a necessary characteristic of noninformativity rather than as a characterization of this concept. The following example demonstrates that more than one sequence can satisfy the requirement. Example 1. Location normal model Consider the context of Example 1 but now suppose that we take the prior Π w for θ to be the uniform distribution on ( w, w) for some w>. Then the prior predictive density of T (s) = s is given by m T,w ( s) = 1 Z w 2w w n n exp n 2π 2 ( s θ)2o dθ Φ n (w s) Φ n ( w s). = 1 2w It is clear that m T,w is unimodal, with mode at s =, and is symmetric about. Then the relevant P-value is given by M T,w (m T,w ( s) <m T,w ( s )) = 1 M T,w (( s, s )) Z s 1 = 1 Φ n (w s) Φ n ( w s) d s w which converges to 1, as w, by the dominated convergence theorem. So in this case we see that a natural sequence of priors converging to the uniform prior satisþes our requirement, as in the limit there is no prior-data conßict. This is also true, as discussed in Example 1, for a sequence of N(θ,σ 2 ) priors with σ 2. Both of these sequences give the same limiting posterior inferences. The discussion in Section 3 indicates that the deþnition of a noninformative prior must also involve ancillary statistics. Example 11. Mixtures Consider again the context of Example 7 and suppose that we take the prior on θ to be the uniform prior on ( w, w). Then, after conditioning on 18

20 the ancillary c, the analysis of Example 1 shows that the improper uniform prior satisþes our requirement for noninformativity as the sequence of P-values converges to 1. Notice, however, that if p was small and we didn t condition on c, then if we observed c =1the P-value would not converge to 1 as w.this reinforces the necessity of conditioning on ancillaries when considering whether or not a prior-data conßict exists. In Examples 8 and 9 it is not entirely clear what priors will satisfy our requirement for noninformativity. Partly this is a technical problem that might yield results with more work but it is conceivable that in certain problems no such priors or sequences of priors will exist. Further, there is no reason to suppose that, even if a such a prior exists, that it will be necessarily unique. In Example 13 we discuss what it means for a sequence of priors to be noninformative for the location-scale normal problem as discussed in Example 4 in light of our requirement. This presents a more challenging problem than those considered in this section. 5 Diagnostics for Ignoring Prior-data Conßict Suppose we have concluded that there is evidence of a prior-data conßict. The questionthenremainsastowhattodonext. Ingeneralthisisasomewhat difficult question to answer as it may well entail using a different prior to avoid the conßict and thus we would need to prescribe how to go about choosing such apriorinascientiþcally defendable way. There are circumstances, however, where the answer can be as simple as collecting more data. For we know that, with sufficient data, the effect of the prior on our inferences is immaterial. Of course, circumstances can arise where collecting more data is not possible, and so we may be left with the more difficult problem of selecting a different prior, but we would still like to know if, in a given context, we have enough data to ignore the prior-data conßict and, if not, how much more data do we need. Intuitively, if the inferences about θ that we are interested in making are not strongly dependent on the prior, then we might feel that we can ignore any prior-data conßict. We need, however, some quantitative assessment of this and there are several possibilities. When there is a prior that is noninformative, we can simply compare the posterior inferences obtained via the prior with those obtained using the informative prior. If these inferences do not differ by an amount that is considered of practical importance then it seems reasonable to ignore the prior-data conßict. The amount of difference will depend on the particular application, as a small difference in one context may be considered quite large in another. Consider the following examples. Example 12. Bernoulli model Suppose the situation is as described in Example 2 and we specify a Beta(α, β) prior distribution for θ. Suppose further we are interested in estimating θ and 19

21 for this we use the posterior mode. Given that the posterior distribution is Beta(α + n s, β + n(1 s)) themodeis(α + n s 1)/(α + β + n 2). A noninformative prior is, as described in Example 2, given by α = β =1. Acomparison between the posterior modes leads to the difference α + n s 1 α + β + n 2 s = (α 1)(1 s) (β 1) s α + β + n 2. (11) If (11) is smaller than some prescribed value δ, then we can say that any priordata conßict can be ignored. Note that (11) is bounded above by (α+β 2)/(α+ β + n 2) and so we can determine n so that this difference is always smaller than δ for a given choice of prior. The value of δ is a value such that a difference greater than this is of practical signiþcance and not otherwise. Necessarily, this value depends upon the application and must be speciþed by the analyst. Example 13. Location-scale normal model Suppose the situation is as described in Example 4 and we are interested in.95 hpd intervals for µ. With the prior speciþed as in (7), the posterior distribution of µ is µ x + n +1/τ 2 1/2 (α + n/2) 1/2 βx 1/2 t 2α+n, where µ x,β x are as speciþed in Example 4. Therefore, a γ-hpd interval for µ is given by µ x ± µ n +1/τ 2 1/2 (α + n/2) 1/2 1+γ βx 1/2 G 1 2α +n (12) 2 where G 1 2α +n is the inverse cdf for the t distribution on 2α + n degrees of freedom. For a noninformative sequence of priors we consider what happens to M m( x, s 2 ) m( x,s 2 ) as as a function of the hyperparameters µ,τ,α, and β. From (8), and some easy algebraic manipulations, we see that M m( x, s 2 ) m( x,s 2 ) Ã n 1 α n 1 nτ = M 2 +1 β ( x µ ) 2! + α β s 2 n 1 n 1 β ( x µ ) 2 + α β s 2. (13) nτ 2 +1 α Now recall that s 2 (β /α )F (n 1,2α ) and so the left side of the inequality in (13) converges almost surely (and so in distribution also) to a random variable with the F (n 1,2α) distribution, as τ 2,β with the added restriction that β τ 2. Then with the additional restriction that α /β, we see that (13) converges to 1. Note that since β this also implies that α. Therefore a sequence of priors, as speciþed in Example 4, via the hyperparameters (µ,τ,α,β ), will be noninformative for this problem whenever 2

22 τ,β τ 2 and α /β. Actually, it is apparent that we need only require that α /β to have a noninformative sequence of priors, so there are many such sequences. The condition τ is saying that the variance of the prior on µ is becoming large. Since the mean of a Gamma(α,β ) distribution is α /β, the condition α /β says that the mass of the prior distribution for 1/σ 2 is concentrating at. But since the variance a Gamma(α,β ) distribution is α /β 2, the condition β τ 2 implies that the concentration must occur at a suitably slow rate. For example, if we take β =1/τ and α =1/τ p with p>1, these conditions will be satisþed as τ. Note that with these choices the prior variance of 1/σ 2 is α /β 2 = τ 2 p and so this will become inþnite as τ whenever 1 <p<2, will remain constant at 1 when p =2and goes towhenp>2. Under a sequence of priors, as speciþed above, we have that the γ-hpd interval for µ given by (12) converges to x ± s r n 1 n n G 1 n µ 1+γ 2. (14) This interval differs from the classical frequentist interval by using the factor (1 1/n) 1/2 G 1 n ((1 + γ)/2) instead of G 1 n 1 ((1 + γ)/2). Note, however, that the (1 1/n) 1/2 t n distribution and the t n 1 distribution are very similar. Now we want to compare the two hpd intervals to obtain our diagnostic. Perhaps the most natural comparison is to compute the length measure of the symmetric difference of the two intervals. Certainly something like this would be necessary for general regions. For intervals, however, if one can show that the differences between the corresponding endpoints are less than δ then the length measure of the symmetric difference is also less than δ. So our diagnostic is the maximum difference between the corresponding endpoints of the intervals given by (13) and by (14). If this maximum is less than δ, where δ is a value such that a difference greater than this is of practical signiþcance and not otherwise, then we can ignore any prior-data conßict. Other diagnostics suggest themselves. For example, in general we could compute the divergence between the posterior under the informative and noninformative prior. For this, however, we would need to Þrst state a cut-off value for the divergence, below which we would not view the difference between the distributions as being material. As indicated in Section 4 it does not always seem obvious how to obtain a noninformative prior or sequence of priors. One possibility in these circumstances is to simply select another prior that seems reasonable for the problem and compare inferences as we have done in Examples 12 and 13. For example, we may use some convenient reference prior even if we have not been able to prove that this prior is noninformative or arises from a noninformative sequence of priors. 21

23 6 Factoring the Joint Distribution As further support for the approach taken here consider the following factorization of the joint distribution of the data and parameter. This seems to point to a logical separation between the activities of checking the sampling model, checking for prior-data conßict and inferences about the model parameter. In a Bayesian analysis with a proper prior the full information available to the statistician for subsequent analysis is effectively the observed data s and the joint distribution of (θ, s) as given by the measure P θ Π. Denoting a minimal sufficient statistic by T we see that this factors as P ( T) P Tθ Π where P Tθ is the marginal probability measure of T and P ( T) is the conditional distribution of the data s given T. The measure P ( T) is independent of θ and the prior and so only depends on the choice of the sampling model {P θ : θ Ω}. Therefore, we can compare the observed data s against this distribution to assess whether or not the sampling model is reasonable. Also we can write P Tθ Π as M T Π ( T) where M T is the prior predictive distribution of T and Π ( T) is the posterior distribution of θ. As we have discussed in this paper, comparing the observed value T (s ) with the distribution M T is an appropriate method for assessing whether or not there is a conßict between the prior and the data. For a maximal ancillary U, that is a function of the minimal sufficient statistic T, we can write M T Π ( T) as P U M T ( U) Π ( T) where P U is the marginal distribution of U and M T ( U) is the conditional prior predictive distribution of T given U. WehavethatP U depends only on the choice of the sampling model {P θ : θ Ω} and so we can also compare the observed value U (s ) with P U to assess whether or not the sampling model is reasonable. In a case where U is not independent of T we have argued that it is more appropriate to compare T (s ) with M T ( U), than against M T, to assess whether or not a prior-data conßict exists as conditioning on U removes inappropriate variation (variation that does not depend on the prior) from the comparison. Finally, when we feel comfortable with both the sampling model and the prior, we can proceed to inferences about θ and for this we use the posterior distribution Π ( T). Note that checking for prior-data conßict is only appropriate when we know the sampling model is reasonable and making inferences about θ is only appropriate when we know both the sampling model is appropriate and that there is no prior-data conßict. In Box (198) a method was suggested for model checking based upon the full marginal prior predictive M for the data. In several examples this can be seen to give anomalous results. The approach taken in this paper can be seen as reþnement of Box s proposal. For we can factor M as P ( T) M T and we see that the Þrst component is available for checking the sampling model and the second component is available for checking for prior-data conßict. 22

Checking for Prior-Data Conflict

Checking for Prior-Data Conflict Bayesian Analysis (2006) 1, Number 4, pp. 893 914 Checking for Prior-Data Conflict Michael Evans and Hadas Moshonov Abstract. Inference proceeds from ingredients chosen by the analyst and data. To validate

More information

Introduction to Bayesian Methods

Introduction to Bayesian Methods Introduction to Bayesian Methods Jessi Cisewski Department of Statistics Yale University Sagan Summer Workshop 2016 Our goal: introduction to Bayesian methods Likelihoods Priors: conjugate priors, non-informative

More information

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation. PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.. Beta Distribution We ll start by learning about the Beta distribution, since we end up using

More information

1 Hypothesis Testing and Model Selection

1 Hypothesis Testing and Model Selection A Short Course on Bayesian Inference (based on An Introduction to Bayesian Analysis: Theory and Methods by Ghosh, Delampady and Samanta) Module 6: From Chapter 6 of GDS 1 Hypothesis Testing and Model Selection

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Review: Statistical Model

Review: Statistical Model Review: Statistical Model { f θ :θ Ω} A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced the data. The statistical model

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

CS 361: Probability & Statistics

CS 361: Probability & Statistics October 17, 2017 CS 361: Probability & Statistics Inference Maximum likelihood: drawbacks A couple of things might trip up max likelihood estimation: 1) Finding the maximum of some functions can be quite

More information

Eco517 Fall 2004 C. Sims MIDTERM EXAM

Eco517 Fall 2004 C. Sims MIDTERM EXAM Eco517 Fall 2004 C. Sims MIDTERM EXAM Answer all four questions. Each is worth 23 points. Do not devote disproportionate time to any one question unless you have answered all the others. (1) We are considering

More information

Overall Objective Priors

Overall Objective Priors Overall Objective Priors Jim Berger, Jose Bernardo and Dongchu Sun Duke University, University of Valencia and University of Missouri Recent advances in statistical inference: theory and case studies University

More information

Invariant HPD credible sets and MAP estimators

Invariant HPD credible sets and MAP estimators Bayesian Analysis (007), Number 4, pp. 681 69 Invariant HPD credible sets and MAP estimators Pierre Druilhet and Jean-Michel Marin Abstract. MAP estimators and HPD credible sets are often criticized in

More information

Bayesian Estimation An Informal Introduction

Bayesian Estimation An Informal Introduction Mary Parker, Bayesian Estimation An Informal Introduction page 1 of 8 Bayesian Estimation An Informal Introduction Example: I take a coin out of my pocket and I want to estimate the probability of heads

More information

A Very Brief Summary of Bayesian Inference, and Examples

A Very Brief Summary of Bayesian Inference, and Examples A Very Brief Summary of Bayesian Inference, and Examples Trinity Term 009 Prof Gesine Reinert Our starting point are data x = x 1, x,, x n, which we view as realisations of random variables X 1, X,, X

More information

CS 361: Probability & Statistics

CS 361: Probability & Statistics March 14, 2018 CS 361: Probability & Statistics Inference The prior From Bayes rule, we know that we can express our function of interest as Likelihood Prior Posterior The right hand side contains the

More information

The subjective worlds of Frequentist and Bayesian statistics

The subjective worlds of Frequentist and Bayesian statistics Chapter 2 The subjective worlds of Frequentist and Bayesian statistics 2.1 The deterministic nature of random coin throwing Suppose that, in an idealised world, the ultimate fate of a thrown coin heads

More information

Chapter 5. Bayesian Statistics

Chapter 5. Bayesian Statistics Chapter 5. Bayesian Statistics Principles of Bayesian Statistics Anything unknown is given a probability distribution, representing degrees of belief [subjective probability]. Degrees of belief [subjective

More information

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33 Hypothesis Testing Econ 690 Purdue University Justin L. Tobias (Purdue) Testing 1 / 33 Outline 1 Basic Testing Framework 2 Testing with HPD intervals 3 Example 4 Savage Dickey Density Ratio 5 Bartlett

More information

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling Due: Tuesday, May 10, 2016, at 6pm (Submit via NYU Classes) Instructions: Your answers to the questions below, including

More information

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Bayesian Inference. STA 121: Regression Analysis Artin Armagan Bayesian Inference STA 121: Regression Analysis Artin Armagan Bayes Rule...s! Reverend Thomas Bayes Posterior Prior p(θ y) = p(y θ)p(θ)/p(y) Likelihood - Sampling Distribution Normalizing Constant: p(y

More information

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007 Bayesian inference Fredrik Ronquist and Peter Beerli October 3, 2007 1 Introduction The last few decades has seen a growing interest in Bayesian inference, an alternative approach to statistical inference.

More information

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2 Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, 2010 Jeffreys priors Lecturer: Michael I. Jordan Scribe: Timothy Hunter 1 Priors for the multivariate Gaussian Consider a multivariate

More information

1 Introduction. P (n = 1 red ball drawn) =

1 Introduction. P (n = 1 red ball drawn) = Introduction Exercises and outline solutions. Y has a pack of 4 cards (Ace and Queen of clubs, Ace and Queen of Hearts) from which he deals a random of selection 2 to player X. What is the probability

More information

Introduction to Applied Bayesian Modeling. ICPSR Day 4

Introduction to Applied Bayesian Modeling. ICPSR Day 4 Introduction to Applied Bayesian Modeling ICPSR Day 4 Simple Priors Remember Bayes Law: Where P(A) is the prior probability of A Simple prior Recall the test for disease example where we specified the

More information

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3 Hypothesis Testing CB: chapter 8; section 0.3 Hypothesis: statement about an unknown population parameter Examples: The average age of males in Sweden is 7. (statement about population mean) The lowest

More information

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/?? to Bayesian Methods Introduction to Bayesian Methods p.1/?? We develop the Bayesian paradigm for parametric inference. To this end, suppose we conduct (or wish to design) a study, in which the parameter

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

Part III. A Decision-Theoretic Approach and Bayesian testing

Part III. A Decision-Theoretic Approach and Bayesian testing Part III A Decision-Theoretic Approach and Bayesian testing 1 Chapter 10 Bayesian Inference as a Decision Problem The decision-theoretic framework starts with the following situation. We would like to

More information

Divergence Based priors for the problem of hypothesis testing

Divergence Based priors for the problem of hypothesis testing Divergence Based priors for the problem of hypothesis testing gonzalo garcía-donato and susie Bayarri May 22, 2009 gonzalo garcía-donato and susie Bayarri () DB priors May 22, 2009 1 / 46 Jeffreys and

More information

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION THOMAS MAILUND Machine learning means different things to different people, and there is no general agreed upon core set of algorithms that must be

More information

Bayesian Inference. Chapter 2: Conjugate models

Bayesian Inference. Chapter 2: Conjugate models Bayesian Inference Chapter 2: Conjugate models Conchi Ausín and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Master in Business Administration and Quantitative Methods Master in

More information

Bayesian Econometrics

Bayesian Econometrics Bayesian Econometrics Christopher A. Sims Princeton University sims@princeton.edu September 20, 2016 Outline I. The difference between Bayesian and non-bayesian inference. II. Confidence sets and confidence

More information

A Discussion of the Bayesian Approach

A Discussion of the Bayesian Approach A Discussion of the Bayesian Approach Reference: Chapter 10 of Theoretical Statistics, Cox and Hinkley, 1974 and Sujit Ghosh s lecture notes David Madigan Statistics The subject of statistics concerns

More information

Inference for a Population Proportion

Inference for a Population Proportion Al Nosedal. University of Toronto. November 11, 2015 Statistical inference is drawing conclusions about an entire population based on data in a sample drawn from that population. From both frequentist

More information

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio Estimation of reliability parameters from Experimental data (Parte 2) This lecture Life test (t 1,t 2,...,t n ) Estimate θ of f T t θ For example: λ of f T (t)= λe - λt Classical approach (frequentist

More information

STAT 830 Bayesian Estimation

STAT 830 Bayesian Estimation STAT 830 Bayesian Estimation Richard Lockhart Simon Fraser University STAT 830 Fall 2011 Richard Lockhart (Simon Fraser University) STAT 830 Bayesian Estimation STAT 830 Fall 2011 1 / 23 Purposes of These

More information

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. Unbiased Estimation Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. To compare ˆθ and θ, two estimators of θ: Say ˆθ is better than θ if it

More information

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models Matthew S. Johnson New York ASA Chapter Workshop CUNY Graduate Center New York, NY hspace1in December 17, 2009 December

More information

Stat 5101 Notes: Brand Name Distributions

Stat 5101 Notes: Brand Name Distributions Stat 5101 Notes: Brand Name Distributions Charles J. Geyer February 14, 2003 1 Discrete Uniform Distribution DiscreteUniform(n). Discrete. Rationale Equally likely outcomes. The interval 1, 2,..., n of

More information

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b) LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered

More information

Bayesian Inference for Normal Mean

Bayesian Inference for Normal Mean Al Nosedal. University of Toronto. November 18, 2015 Likelihood of Single Observation The conditional observation distribution of y µ is Normal with mean µ and variance σ 2, which is known. Its density

More information

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing 1 In most statistics problems, we assume that the data have been generated from some unknown probability distribution. We desire

More information

Hierarchical Models & Bayesian Model Selection

Hierarchical Models & Bayesian Model Selection Hierarchical Models & Bayesian Model Selection Geoffrey Roeder Departments of Computer Science and Statistics University of British Columbia Jan. 20, 2016 Contact information Please report any typos or

More information

Bayesian Inference. p(y)

Bayesian Inference. p(y) Bayesian Inference There are different ways to interpret a probability statement in a real world setting. Frequentist interpretations of probability apply to situations that can be repeated many times,

More information

STAT 425: Introduction to Bayesian Analysis

STAT 425: Introduction to Bayesian Analysis STAT 425: Introduction to Bayesian Analysis Marina Vannucci Rice University, USA Fall 2017 Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 2017 1 / 10 Lecture 7: Prior Types Subjective

More information

P (E) = P (A 1 )P (A 2 )... P (A n ).

P (E) = P (A 1 )P (A 2 )... P (A n ). Lecture 9: Conditional probability II: breaking complex events into smaller events, methods to solve probability problems, Bayes rule, law of total probability, Bayes theorem Discrete Structures II (Summer

More information

A General Overview of Parametric Estimation and Inference Techniques.

A General Overview of Parametric Estimation and Inference Techniques. A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying

More information

Lecture 2: Conjugate priors

Lecture 2: Conjugate priors (Spring ʼ) Lecture : Conjugate priors Julia Hockenmaier juliahmr@illinois.edu Siebel Center http://www.cs.uiuc.edu/class/sp/cs98jhm The binomial distribution If p is the probability of heads, the probability

More information

The Exciting Guide To Probability Distributions Part 2. Jamie Frost v1.1

The Exciting Guide To Probability Distributions Part 2. Jamie Frost v1.1 The Exciting Guide To Probability Distributions Part 2 Jamie Frost v. Contents Part 2 A revisit of the multinomial distribution The Dirichlet Distribution The Beta Distribution Conjugate Priors The Gamma

More information

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems Principles of Statistical Inference Recap of statistical models Statistical inference (frequentist) Parametric vs. semiparametric

More information

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. Unbiased Estimation Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. To compare ˆθ and θ, two estimators of θ: Say ˆθ is better than θ if it

More information

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006 Hypothesis Testing Part I James J. Heckman University of Chicago Econ 312 This draft, April 20, 2006 1 1 A Brief Review of Hypothesis Testing and Its Uses values and pure significance tests (R.A. Fisher)

More information

Introduction to Machine Learning. Lecture 2

Introduction to Machine Learning. Lecture 2 Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for

More information

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University Econ 690 Purdue University In virtually all of the previous lectures, our models have made use of normality assumptions. From a computational point of view, the reason for this assumption is clear: combined

More information

Beta statistics. Keywords. Bayes theorem. Bayes rule

Beta statistics. Keywords. Bayes theorem. Bayes rule Keywords Beta statistics Tommy Norberg tommy@chalmers.se Mathematical Sciences Chalmers University of Technology Gothenburg, SWEDEN Bayes s formula Prior density Likelihood Posterior density Conjugate

More information

arxiv: v1 [math.st] 22 Feb 2013

arxiv: v1 [math.st] 22 Feb 2013 arxiv:1302.5468v1 [math.st] 22 Feb 2013 What does the proof of Birnbaum s theorem prove? Michael Evans Department of Statistics University of Toronto Abstract: Birnbaum s theorem, that the sufficiency

More information

Foundations of Statistical Inference

Foundations of Statistical Inference Foundations of Statistical Inference Jonathan Marchini Department of Statistics University of Oxford MT 2013 Jonathan Marchini (University of Oxford) BS2a MT 2013 1 / 27 Course arrangements Lectures M.2

More information

7. Estimation and hypothesis testing. Objective. Recommended reading

7. Estimation and hypothesis testing. Objective. Recommended reading 7. Estimation and hypothesis testing Objective In this chapter, we show how the election of estimators can be represented as a decision problem. Secondly, we consider the problem of hypothesis testing

More information

A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Statistical Inference, and Examples A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2008 Prof. Gesine Reinert 1 Data x = x 1, x 2,..., x n, realisations of random variables X 1, X 2,..., X n with distribution (model)

More information

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization

More information

One-parameter models

One-parameter models One-parameter models Patrick Breheny January 22 Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 1/17 Introduction Binomial data is not the only example in which Bayesian solutions can be worked

More information

Central Limit Theorem and the Law of Large Numbers Class 6, Jeremy Orloff and Jonathan Bloom

Central Limit Theorem and the Law of Large Numbers Class 6, Jeremy Orloff and Jonathan Bloom Central Limit Theorem and the Law of Large Numbers Class 6, 8.5 Jeremy Orloff and Jonathan Bloom Learning Goals. Understand the statement of the law of large numbers. 2. Understand the statement of the

More information

Introduction to Probability

Introduction to Probability LECTURE NOTES Course 6.041-6.431 M.I.T. FALL 2000 Introduction to Probability Dimitri P. Bertsekas and John N. Tsitsiklis Professors of Electrical Engineering and Computer Science Massachusetts Institute

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Chapter 8 Maximum Likelihood Estimation 8. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

University of Regina. Lecture Notes. Michael Kozdron

University of Regina. Lecture Notes. Michael Kozdron University of Regina Statistics 252 Mathematical Statistics Lecture Notes Winter 2005 Michael Kozdron kozdron@math.uregina.ca www.math.uregina.ca/ kozdron Contents 1 The Basic Idea of Statistics: Estimating

More information

Bayesian Analysis for Natural Language Processing Lecture 2

Bayesian Analysis for Natural Language Processing Lecture 2 Bayesian Analysis for Natural Language Processing Lecture 2 Shay Cohen February 4, 2013 Administrativia The class has a mailing list: coms-e6998-11@cs.columbia.edu Need two volunteers for leading a discussion

More information

Lecture 2. (See Exercise 7.22, 7.23, 7.24 in Casella & Berger)

Lecture 2. (See Exercise 7.22, 7.23, 7.24 in Casella & Berger) 8 HENRIK HULT Lecture 2 3. Some common distributions in classical and Bayesian statistics 3.1. Conjugate prior distributions. In the Bayesian setting it is important to compute posterior distributions.

More information

Statistics 3858 : Maximum Likelihood Estimators

Statistics 3858 : Maximum Likelihood Estimators Statistics 3858 : Maximum Likelihood Estimators 1 Method of Maximum Likelihood In this method we construct the so called likelihood function, that is L(θ) = L(θ; X 1, X 2,..., X n ) = f n (X 1, X 2,...,

More information

Conditional probabilities and graphical models

Conditional probabilities and graphical models Conditional probabilities and graphical models Thomas Mailund Bioinformatics Research Centre (BiRC), Aarhus University Probability theory allows us to describe uncertainty in the processes we model within

More information

Topic 12 Overview of Estimation

Topic 12 Overview of Estimation Topic 12 Overview of Estimation Classical Statistics 1 / 9 Outline Introduction Parameter Estimation Classical Statistics Densities and Likelihoods 2 / 9 Introduction In the simplest possible terms, the

More information

University of Toronto

University of Toronto A Limit Result for the Prior Predictive by Michael Evans Department of Statistics University of Toronto and Gun Ho Jang Department of Statistics University of Toronto Technical Report No. 1004 April 15,

More information

New Bayesian methods for model comparison

New Bayesian methods for model comparison Back to the future New Bayesian methods for model comparison Murray Aitkin murray.aitkin@unimelb.edu.au Department of Mathematics and Statistics The University of Melbourne Australia Bayesian Model Comparison

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

Mathematical Statistics

Mathematical Statistics Mathematical Statistics MAS 713 Chapter 8 Previous lecture: 1 Bayesian Inference 2 Decision theory 3 Bayesian Vs. Frequentist 4 Loss functions 5 Conjugate priors Any questions? Mathematical Statistics

More information

STA 294: Stochastic Processes & Bayesian Nonparametrics

STA 294: Stochastic Processes & Bayesian Nonparametrics MARKOV CHAINS AND CONVERGENCE CONCEPTS Markov chains are among the simplest stochastic processes, just one step beyond iid sequences of random variables. Traditionally they ve been used in modelling a

More information

Chapter 4: An Introduction to Probability and Statistics

Chapter 4: An Introduction to Probability and Statistics Chapter 4: An Introduction to Probability and Statistics 4. Probability The simplest kinds of probabilities to understand are reflected in everyday ideas like these: (i) if you toss a coin, the probability

More information

Lecture 6: Model Checking and Selection

Lecture 6: Model Checking and Selection Lecture 6: Model Checking and Selection Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de May 27, 2014 Model selection We often have multiple modeling choices that are equally sensible: M 1,, M T. Which

More information

Lecture 1: August 28

Lecture 1: August 28 36-705: Intermediate Statistics Fall 2017 Lecturer: Siva Balakrishnan Lecture 1: August 28 Our broad goal for the first few lectures is to try to understand the behaviour of sums of independent random

More information

Probability and Estimation. Alan Moses

Probability and Estimation. Alan Moses Probability and Estimation Alan Moses Random variables and probability A random variable is like a variable in algebra (e.g., y=e x ), but where at least part of the variability is taken to be stochastic.

More information

Compute f(x θ)f(θ) dθ

Compute f(x θ)f(θ) dθ Bayesian Updating: Continuous Priors 18.05 Spring 2014 b a Compute f(x θ)f(θ) dθ January 1, 2017 1 /26 Beta distribution Beta(a, b) has density (a + b 1)! f (θ) = θ a 1 (1 θ) b 1 (a 1)!(b 1)! http://mathlets.org/mathlets/beta-distribution/

More information

An Extended BIC for Model Selection

An Extended BIC for Model Selection An Extended BIC for Model Selection at the JSM meeting 2007 - Salt Lake City Surajit Ray Boston University (Dept of Mathematics and Statistics) Joint work with James Berger, Duke University; Susie Bayarri,

More information

2 Inference for Multinomial Distribution

2 Inference for Multinomial Distribution Markov Chain Monte Carlo Methods Part III: Statistical Concepts By K.B.Athreya, Mohan Delampady and T.Krishnan 1 Introduction In parts I and II of this series it was shown how Markov chain Monte Carlo

More information

10. Exchangeability and hierarchical models Objective. Recommended reading

10. Exchangeability and hierarchical models Objective. Recommended reading 10. Exchangeability and hierarchical models Objective Introduce exchangeability and its relation to Bayesian hierarchical models. Show how to fit such models using fully and empirical Bayesian methods.

More information

Computational Perception. Bayesian Inference

Computational Perception. Bayesian Inference Computational Perception 15-485/785 January 24, 2008 Bayesian Inference The process of probabilistic inference 1. define model of problem 2. derive posterior distributions and estimators 3. estimate parameters

More information

Classical and Bayesian inference

Classical and Bayesian inference Classical and Bayesian inference AMS 132 January 18, 2018 Claudia Wehrhahn (UCSC) Classical and Bayesian inference January 18, 2018 1 / 9 Sampling from a Bernoulli Distribution Theorem (Beta-Bernoulli

More information

Testing Restrictions and Comparing Models

Testing Restrictions and Comparing Models Econ. 513, Time Series Econometrics Fall 00 Chris Sims Testing Restrictions and Comparing Models 1. THE PROBLEM We consider here the problem of comparing two parametric models for the data X, defined by

More information

Chapter 6 Likelihood Inference

Chapter 6 Likelihood Inference Chapter 6 Likelihood Inference CHAPTER OUTLINE Section 1 The Likelihood Function Section 2 Maximum Likelihood Estimation Section 3 Inferences Based on the MLE Section 4 Distribution-Free Methods Section

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Why Bayesian? Rigorous approach to address statistical estimation problems. The Bayesian philosophy is mature and powerful.

Why Bayesian? Rigorous approach to address statistical estimation problems. The Bayesian philosophy is mature and powerful. Why Bayesian? Rigorous approach to address statistical estimation problems. The Bayesian philosophy is mature and powerful. Even if you aren t Bayesian, you can define an uninformative prior and everything

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models Advanced Methods for Data Analysis (36-402/36-608 Spring 2014 1 Generalized linear models 1.1 Introduction: two regressions So far we ve seen two canonical settings for regression.

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

1 Probability Model. 1.1 Types of models to be discussed in the course

1 Probability Model. 1.1 Types of models to be discussed in the course Sufficiency January 18, 016 Debdeep Pati 1 Probability Model Model: A family of distributions P θ : θ Θ}. P θ (B) is the probability of the event B when the parameter takes the value θ. P θ is described

More information

CSC 2541: Bayesian Methods for Machine Learning

CSC 2541: Bayesian Methods for Machine Learning CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 3 More Markov Chain Monte Carlo Methods The Metropolis algorithm isn t the only way to do MCMC. We ll

More information

Stat 5421 Lecture Notes Proper Conjugate Priors for Exponential Families Charles J. Geyer March 28, 2016

Stat 5421 Lecture Notes Proper Conjugate Priors for Exponential Families Charles J. Geyer March 28, 2016 Stat 5421 Lecture Notes Proper Conjugate Priors for Exponential Families Charles J. Geyer March 28, 2016 1 Theory This section explains the theory of conjugate priors for exponential families of distributions,

More information

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet. Stat 535 C - Statistical Computing & Monte Carlo Methods Arnaud Doucet Email: arnaud@cs.ubc.ca 1 Suggested Projects: www.cs.ubc.ca/~arnaud/projects.html First assignement on the web: capture/recapture.

More information

Remarks on Improper Ignorance Priors

Remarks on Improper Ignorance Priors As a limit of proper priors Remarks on Improper Ignorance Priors Two caveats relating to computations with improper priors, based on their relationship with finitely-additive, but not countably-additive

More information

Discrete Binary Distributions

Discrete Binary Distributions Discrete Binary Distributions Carl Edward Rasmussen November th, 26 Carl Edward Rasmussen Discrete Binary Distributions November th, 26 / 5 Key concepts Bernoulli: probabilities over binary variables Binomial:

More information

Probability. Lecture Notes. Adolfo J. Rumbos

Probability. Lecture Notes. Adolfo J. Rumbos Probability Lecture Notes Adolfo J. Rumbos October 20, 204 2 Contents Introduction 5. An example from statistical inference................ 5 2 Probability Spaces 9 2. Sample Spaces and σ fields.....................

More information

Uncertain Inference and Artificial Intelligence

Uncertain Inference and Artificial Intelligence March 3, 2011 1 Prepared for a Purdue Machine Learning Seminar Acknowledgement Prof. A. P. Dempster for intensive collaborations on the Dempster-Shafer theory. Jianchun Zhang, Ryan Martin, Duncan Ermini

More information