Chalmers April 6, 2017
Bayesian philosophy
Bayesian philosophy Bayesian statistics versus classical statistics: War or co-existence? Classical statistics: Models have variables and parameters; these are conceptually different. Variables represent (potential) data. Parameters are assumed to have a FIXED but UNKNOWN value. Thus, models for unrepeatable events are meaningless. Procedures to compute the parameters from data are judged by their properties when applied to potential new data. Models with estimated parameters inserted yield predictions. Bayesian statistics: Models have only variables; their distribution represent (some person s) KNOWLEDGE about the some part of the world. Models for unrepeatable events are meaningful. Models give predictions of (relative) probabilities for data, even before data is observed. Predictions for new observations are made from models conditional on old data.
Example A sequence of independent and equivalent trials is performed, each resulting in success (1) or failure (0). The following data is observed: 0, 1, 0, 0, 1, 0, 0, 1. Classical analysis: A possible model is a Binomial distribution, with probability of success p and x out of 8 trials observed as successes. A possible estimator for p is ˆp = x/8. One can show this estimator is unbiased, i.e., E [ˆp] = p. With our data, we get ˆp = 3/8. Plugging into model, we compute the probability 0.062 that 4 of the next 5 trials will be successes. Another possible model is a negative Binomial distribution, where y is the number of of trials needed to observe 3 successes. A possible estimator for p is ˆp = 3/y. This estimator for p has a different distribution. For example, it is biased, i.e., E [ˆp] p. One might instead use the minimum variance unbiased estimator for p, and get ˆp = 3 1 8 1 = 2/7. But this would yield the probability 0.024 that 4 of the next 5 trials will be successes.
Example, continued Assume we want to do a hypothesis test where H 0 : p 0.6, while H 1 : p < 0.6. What will the p-value be? The answer depends on which test statistic we use. Recall that the p-value is the probability, assuming H 0 and generating new data, of observing something equally or more extreme than the given test statistic, in terms of rejecting H 0. One possibility is the test statistic x, the number of successes in 8 trials. The probability of observing 0, 1, 2, or 3 successes when p = 0.6 is 0.174, so the p-value is 0.174. Another possibility is the test statistic y, the number of trials needed to observe 3 successes. The probability of needing 8 or more trials when p = 0.6 is 0.095.
Example, continued In the classical analysis, answers depend on choice of estimator or test statistic. However, they do not depend on the context. Consider the following contexts: 8 tosses of a coin gives 3 heads. 8 tests of a new medical procedure leads to 3 fatalities. Controlling 8 items produced in a factory uncovers 3 faulty items. In real life, the predicted probability of 4 successes in the next 5 trials would be different in these three contexts. In a Bayesian analysis, the different contexts would be taken into account by formulating a prior probability distribution for p, indicating the prior knowledge: For the coin example, we might use p Beta(20, 20). For the medical example, studies of similar medical procedures might yield p Beta(2, 6). For the factory example, knowledge gained in similar testing might be formulated with p Beta(1, 10).
Digression: The Beta distribution θ has a Beta distribution on [0, 1], with parameters α and β, if its density has the form 1 π(θ α, β) = B(α, β) θα 1 (1 θ) β 1 where B(α, β) is the Beta function defined by B(α, β) = Γ(α)Γ(β) Γ(α + β) where Γ(t) is the Gamma function defined by Γ(t) = 0 x t 1 e x dx Recall that for positive integers, Γ(n) = (n 1)! = 0 1 (n 1). See for example Wikipedia for more properties of the Beta distribution, and the Beta and Gamma functions. We write π(θ α, β) = Beta(θ; α, β) for the Beta density.
Example, continued The Bayesian model consists of the appropriate prior for p, and conditionally on each such p, a model for the data: It could be either a Binomial model for x or a negative Binomial model for y. Thus, the Bayesian model is a bivariate probability distribution; there is no conceptual difference between the variable for observed data (be it x or y) and p. Before the data is observed, the probability of observing just this data can be computed using the marginal distribution of x (or y) computed from the bivariate distribution representing the model.
Example, continued The knowledge about p after considering the data can be computed as the conditional distribution when we fix the data. This is called the posterior distribution for p. Note that there is no need to make a subjective choice of an estimator for p. Crucially, the posterior will be the same whether we use a Binomial model or a negative Binomial model for the data. Let z be the number of successes in 5 new trials. Given the posterior distribution for p, we get a bivariate model for p and z by multiplying with a Binomial distribution for z, with 5 trials and probability of success p. The distribution of z can be computed as the marginal distribution over this model. Note that predictions about z will not depend on whether we used a Binomial or negative Binomial distribution for the data.
, simplest example s can in fact always be performed by multiplying probability densities or functions, and taking conditional or marginal distributions. Example: An archeological item could be from either of three areas, A, B, or C. Based on visual inspection, it is judged to be from A, B, or C with probabilities 0.2, 0.5, and 0.3, respectively. Now a chemical analysis is done to detect two trace elements, X and Y. We know that the probabilities of detecting combinations of these trace elements, given the item s origin, is given in the table below: Both X and Y X only Y only None A 0.1 0.7 0.1 0.1 B 0.6 0.1 0.2 0.1 C 0.1 0.1 0.1 0.7
, simplest example, continued How can we answer for example questions like What is the probability that the item is from A, given that X only is detected? The table below represents the joint distribution: Both X and Y X only Y only None A 0.02 0.14 0.02 0.02 0.2 B 0.30 0.05 0.10 0.05 0.5 C 0.03 0.03 0.03 0.21 0.3 0.35 0.22 0.15 0.28 All questions can be answered by computing coditional or marginal distributions from the table above. For example, Pr(A X ) = 0.14/0.22 = 0.636
Computations in Beta-Binomial example Let s choose one of the priors: Assume p Beta(2, 6). The probability density becomes π(p) = 1 B(2,6) p2 1 (1 p) 6 1. If we use that x has a Binomial distribution with 8 trials and parameter p, we get the probability function π(x p) = ( 8 x) p x (1 p) 8 x. The joint model becomes π(x, p) = π(x p)π(p) = ( ) 8 x p x (1 p) 8 x 1 B(2,6) p1 (1 p) 5. We would like to compute π(p x) = π(x,p) π(x) with x fixed to the value 3. Note that, as a function of p, this must be proportional to p 4 (1 p) 10. Note also that the Beta distribution with parameters 5 and 11 has a density proportional to p 4 (1 p) 10. Thus these two densities must be identical! We get. π(p x = 3) = Beta(p; 5, 11)
Computations in previous example, continued More generally, if we had used the prior Beta(p, α, β) for p, we would get the posterior Beta(p; α + 3, β + 5). Note that, if we had chosen to use data y with a negative Binomial distribution, we would have π(y p) = ( ) y 1 3 (1 p) y 3 p 3, and one can check that the posterior for p would become the same. The possible new data z has a Binomial distribution with 5 trials and parameter p. Multiplying this probability function with the posterior density found above, we get the joint distribution for z and p given the data. We can now compute π(z) = π(z p)π(p) π(p z) = ( 5 z) 1 Beta(5,11) 1 Beta(5+z,16 z) = ( ) 5 Beta(5 + z, 16 z). z Beta(5, 11) Thus we get that the probability of 4 successes in 5 new trials is π(z = 4) = 0.04966.
More advanced More generally, let x be a vector representing the data, and let θ be a vector representing the variables of interest. Assume we can write down the probability (density) function π(x θ), and the prior π(θ). Then the posterior for the parameter θ is then given by Bayes formula π(θ x) = π(x θ)π(θ) π(x) = π(x θ)π(θ) π(x θ)π(θ) dθ θ π(x θ)π(θ) where π(x) is the marginal probability (density) for x. Note the notation using θ : If we only know the posterior π(θ x) up to a factor not depending on θ, it can be reconstructed by requiring the sum (or integral) to be 1. Thus, in order to do inference, i.e., compute the posterior distribution of θ, we only need to compute the distribution for θ whose density is proportional to π(x θ)π(θ).
Computational methods for the posterior When all variables are finite-valued, there are algorithms for exact efficient computations, even when the distributions π(θ) and π(x θ) are expressed in terms of a network of dependent variables. The computations in the second example above work out (fairly) easily because we chose as the prior for p a distribution that is conjugate to the Binomial distribution (or negative Binomial) for the data. With enough conjucacies, one can also obtain exact posteriors. In all other cases, one can only compute approximations of the posterior. The group of methods called Markov chain Monte Carlo (McMC) are by far the most general and popular approximation methods. There are some other approximative algorithms, for example INLA (Integrated Nested Laplace Approximation), but they can be applied to more limited sets of models.
Markov chain Monte Carlo The idea is to generate an (approximative) sample from the posterior. Then, inference can be done based on this sample. The sample is produced using a Markov chain. The chain is produced by * starting at some fairly random value θ 0, * for each step, generating a new proposed value from the old, using some algorithm, and * accepting or rejecting the proposed value based on an acceptance criterium. The acceptance criterium depends on the posterior distribution π(θ x), but it needs to be known only up to a constant. This fits our situation perfectly. The distribution of the chain converges to the correct distribution, but the convergence may be slow. The chain may also have autocorrelation.
Checking convergence The simplest is to monitor the series of values of a variable. Does the pattern seem to stabilize? A slightly more advanced method is to use several parallell Markov chains with independent starting points. If convergence is reached, the range of values spanned by all chains should be the same as the range of values spanned by each chain; otherwise it is larger. This is measured by a quantity called R, and estimated by ˆR. If ˆR goes down towards 1, this indicates convergence. High autocorrelation means that the chain moves very slowly. This will also indicate slow convergence.
Improving convergence A popular type of McMC is Gibbs sampling. Each proposal changes only one of the variables in the variable vector, and the proposal is based on the conditional distribution of this variable given all the others. Gibbs sampling often works great, and is easy to implement. However, for highly correlated variables, convergence can be too slow. General methods to improve convergence speed exist. But often, the most efficient is to look carefully at the shape of your distribution, and choose a proposal function adapted to it.
Using the sample for inference Given a sample from a distribution, all properties of the distribution can in fact be estimated from this sample. For example given a sample of size 10.000 of a variable, you can estimate a 95% credibility interval (i.e., an interval that covers 95% of the probability density) by finding the 250 th and the 9750 th values in the ordered set. In R, use quantile. To estimate the expectation of any function f of the variable θ, simply compute f (θ 1 ), f (θ 2 ),..., f (θ 10000 ) and take their average.
Most statisticians use both frequentist and Bayesian methods, so a large proportion of software available, also R packages, use some Bayesian ideas. When models are Bayesian Networks with finite-values variables (or only normally distributed variables), algorithms for exact inference are available in programs like Hugin (commercial) or GeNIe (free). There are a few general-purpose programs for models formulated as a Bayesian Network. The most famous and oldest is BUGS, which exists in a number of incarnations (WinBUGS, OpenBUGS). It implements Gibbs sampling, basically. It can be accesseed from R via a number of different R packages, e.g., R2OpenBUGS, brugs, etc. etc. Some more modern general-purpose programs exist, most notably JAGS and STAN. They implement improvements to the algoritmns of BUGS that in general increase convergence speed.