Bayesian Inference There are different ways to interpret a probability statement in a real world setting. Frequentist interpretations of probability apply to situations that can be repeated many times, e.g., tossing a coin, or administering a treatment to a large cohort of patients. In this setting, a probability statement relates to the frequency with which an event occurs in the limit, e.g., how often the coin will land heads up, or what proportion of patients taking the treatment will respond favourably. An alternative approach is to view probability as a rational numerical expression of belief and uncertainty. In this setting, an events that we have a high degree of belief to be true (false) will have a probability close to (0). If we have a higher degree of belief in one event than another, then its corresponding probability will be higher. In this context we can assign probabilities to events that do not occur repeatedly, or that we are unable to observe, e.g., the probability that Shakespeare was the author of the plays attributed to him, or that an important part of a plane s engine will fail while it is being used for a commercial flight. This latter interpretation of probability is associated with Bayesian inference. Bayes s rule Consider a statistical model for data y with parameters θ. When we perform inference in a Bayesian context, we consider both y and θ to be random variables. We represent our beliefs about the possible values that these variables can take using the joint distribution p(θ, y) = p(y θ)p(θ). Here p(y θ) refers to the likelihood function, while p(θ) is the prior distribution of the parameters. We can interpret p(θ) as our beliefs about the model parameters before any data is observed. If we observe some data y, then we can then condition our beliefs about θ by using Bayes s rule to obtain the posterior distribution: p(θ y) = p(y θ)p(θ). p(y) In words, we update our beliefs about θ, conditional on the value of y, with these beliefs expressed in the form of a probability distribution. In practice, the marginal distribution of the data, p(y) = p(y, θ)dθ is often Θ difficult to compute exactly. Instead we will usually focus on the unnormalised posterior, p(θ y) p(y θ)p(θ). Either the unnormalised posterior will have a recognisable shape, or we will resort to computational methods to determine the normalised distribution of the parameter, as we will see later.
For many years, the inclusion of the prior distribution p(θ) in the inference process was a source of controversy and considerable debate within the statistics community, the argument being that it made inference subjective and unreliable. From a practical perspective, and broadly speaking, when there is a large sample of data, the amount of information in the likelihood overwhelms that in the prior, so that very extreme prior beliefs would be needed to meaningfully influence the inference procedure. On the other hand, when data is scarce, such as in some reliability applications, often expert opinion is required before meaningful probability statements can be made. In general, the utility of Bayesian methods has now been widely demonstrated, and the use of such methods is accepted by the wider scientific community. Beta-binomial model Suppose that we observe n trials of a binary outcome y = y,..., y n, where each y i B(, θ). Then the likelihood for the data takes the form p(y θ) = n i= ( θ) θyi yi = θ n i= yi ( θ) n n i= yi. We can specify a Beta(a, b) distribution as a prior: p(θ a, b) = c(a, b)θ a ( θ) b. Here c(a, b) = Γ(a + b) = 0 θa ( θ) b dθ Γ(a)Γ(b), which we will take as a known result from calculus, and where Γ(x) denotes the gamma function, which is defined for any number x > 0. We call a and b the hyperparameters for the prior distribution. Typically, these values must be chosen. A simple argument (in fact, something of a cop out) for choosing a and b is to note that setting a = b = means that p(θ a, b) = for any value of θ. Hence this is often interpreted as a non-informative prior for θ. Regardless of the choice of hyperparameters, we can combine the prior and likelihood together so that p(θ y, a, b) p(y θ)p(θ a, b) = θ n i= yi ( θ) n n i= yi c(a, b)θ a ( θ) b θ n i= yi+a ( θ) n n i= yi+b = θ a ( θ) b. Here we have ignored any terms that do not involve θ, and then used some algebra to tidy up the expression on the right hand side of the equation. It remains to identify a normalising constant for the posterior p(θ y, a, b). You should be able to recognise that the shape of p(θ y, a, b) is the same as the prior p(θ a, b), but with parameters a and b. This means that p(θ y, a, b ) follows a beta distribution, with parameters a = n i= y i+a and b = n n i= y i + b. 2
4 n=0 n=20 3 distribution 2 0 4 3 2 Be(,) prior Be(3,2) prior type posterior prior 0 0.00 0.25 0.50 0.75.00 0.00 0.25 0.50 0.75.00 p Figure : Examples of the posterior distribution of a beta distribution with different sample sizes and hyperparameters. Inspecting the posterior parameters terms of the distribution, we can interpret the updated terms a and b as a combination of our prior knowledge (in the form of the hyperparameters a and b) and summary statistics of the observed data ( y i and n n i= y i). We can interpret the value of a relative to b as a reflection of our prior belief in the number of successes relative to failures we would expect to observe. The value a + b is a reflection of our certainty of these beliefs; larger values indicate higher certainty. If the sample size n is much larger than a + b, then the hyperparameters will have relatively little effect on the posterior distribution. Some examples of fthe posterior distribution of a beta distribution with different sample sizes (n = 0, 20)and hyperparameters ((a =, b = ), (a = 3, b = 2)) are shown in Figure. 3
982 y = 438 i= 25 20 distribution 5 0 type posterior prior 5 0 0.00 0.25 0.50 0.75.00 p Figure 2: Example: posterior distribution of θ, the probability of a female birth in the case of placenta previa, with a Beta(,) prior. Example: estimating the probability of female birth given placenta previa The following example is taken from [2]. We consider the sex ratio of births for which the maternal condition placenta previa occurred. This is an unusual condition of pregnancy that prevents a normal delivery from occuring. A study concerning the sex of placenta previa births in Germany recorded that 437 of a total of 980 births were female. How much evidence does this provide for the claim that the proportion of female births in the population of placenta previa births is less than 0.5? If we adopt a uniform Beta(,) prior, then the posterior distribution for θ, the probability of a female birth in the case of placenta previa, is Beta(438,544). This distribution is visualised in Figure 2. The red line indicates the posterior distribution, and the green line the prior. In this case we have a large number of observations, so the curves are very distinct. A dashed line indicates the value θ = 0.5. Clearly, in this case the majority of the mass of the distribution is to the right of 0.5. Using the code in R, we can compute the probability P(θ < 0.5) = 0.99965. Conjugacy We have shown that a beta prior and a binomial likelihood leads to a beta posterior, which makes inference easier (we didn t need to do any integration ourselves, and instead could use a known result). We say that the beta prior is conjugate for the binomial distribution. More generally, we say that a class P of prior 4
(a) (b) Figure 3: Graphical diagrams of a beta distribution. The second diagram uses plate notation to represent the data. distributions for θ is conjugate for a likelihood function p(y θ) if We will exploit this convenient property many times. p(θ) P p(θ y) P. Graphical diagram We can represent the data generation for a beta-binomial model as follows: θ Beta(a, b); Y i θ Binomial(, θ), for i =,..., n. This is a reflection of the conditional distribution of the posterior, i.e., p(θ y, a, b) p(y θ)p(θ a, b). We can also represent this model using a graphical diagram. See Figure 3. This a graph, consisting of nodes and edges. The nodes represent the parameters and data of the model. The edges connect the nodes, and represent the dependence between parameters. The direction of the edges indicate the nature of the depence between the parameters. Figure 3a shows each datapoint y,..., y n separately. Figure 3b shows the same model but more concisely; the data y are collectively represented using a plate diagram. In both figures, the node representing y is shaded, denoting that this quantity is observed. The nodes representing a and b are boxes, denoting that they are hyperparameters and are specified by the analyst. The node representing θ is transparent and circular, which means that it is a quantity to be inferred. 5
References [] P.D. Hoff, A first course in Bayesian statistical methods, Chapter 3. Springer, 2009. [2] A. Gelman, J. B. Carlin, H.S. Stern, D.B. Rubin, Bayesian data analysis, 2nd edition. Chapter 2. Chapman & Hall/CRC, 2004. 6