Bayesian Inference: Concept and Practice

Size: px

Start display at page:

Download "Bayesian Inference: Concept and Practice"

Suzan Singleton
5 years ago
Views:

1 Inference: Concept and Practice fundamentals Johan A. Elkink School of Politics & International Relations University College Dublin 5 June 2017

2 1 2 3

3 Bayes theorem In order to estimate the parameters of a model, it is natural to ask, given the data: what probability distribution can we assign to the parameters? p(θ X) =? From Bayes theorem we have: p(θ X)p(X) = p(x θ)p(θ) p(θ X) = p(x θ)p(θ) p(x) = p(x θ)p(θ) p(x θ)p(θ)dθ We refer to p(θ X) as the posterior and p(θ) as the prior distribution of θ. is therefore updating our prior believes given some new data.

4 Prior believes Where do these prior believes come from? Purely subjective starting point. Earlier scientific findings. Earlier data in the same endeavour. Based on skeptical or optimist views about what should be found. But also... Priors can be vague, low on information, so as not to drive results. Priors can be improper, so that they are not real probability densities, but really low on information. Priors can be data-driven.

5 Likelihood function p(θ X) = p(x θ)p(θ) p(x) While p(x θ) is a probability distribution of X over possible values of θ, it can also be seen as a function of probabilities of θ over possible values of X. In this case it is not a proper probability density function, as it does not necessarily add up to 1, and it is called a likelihood function. L(θ X) = p(x θ). It is often easier to work with the log-likelihood l(θ X) = log L(θ X). (dice example) (Lee, 2012, 37)

6 Frequentist vs. Bayes Note that in, we take θ to be a random variable, with an associated probability distribution, and we try to describe this distribution. In frequentist, we take instead X as a random variable, with an associated probability distribution parameterised by θ, and find the θ that maximises this distribution. Since p X (X θ) p θ (X θ) = L(θ X), this amounts to finding the θ that maximises the (log-)likelihood function. We return to the comparison with frequentist statistics when we turn to.

7 Predictive distribution p(θ X) = p(x θ)p(θ) p(x) The normalising constant p(x) can be taken as the marginal distribution p(x) = p(x θ)p(θ)dθ, which is also called the predictive distribution, as it is our prediction of X taking into account our uncertainty around θ as well as around X given θ. In practice, there are many circumstances where we do not need to concern ourselves with this constant, focusing instead on p(θ X) p(x θ)p(θ). (Lee, 2012, 39)

8 posterior likelihood prior

9 Parametric Note that is parametric all on parameters is given a particular type of probability distribution. If we have no idea about the type of distribution that describes p(x θ), can be problematic. There is such a thing as semiparametric or non-parametric, which is beyond the scope of today.

10 Normal distribution (1 observation, unknown µ) In some circumstances, has an analytic solution. For example, assume we have one observation (x) from a normally distributed random variable with mean µ and variance σ 2, X N(µ, σ 2 ), and σ 2 is known. p(x) = ( 2πσ 2) ( ) 1 2 (x µ) 2 exp 2σ 2 Suppose we have a prior belief regarding the mean that follows a normal distribution: µ N(µ 0, σ 2 µ 0 ). p(µ) = ( 2πσ 2 µ 0 ) 1 2 exp ( (µ µ0 ) 2 ) 2σ 2 µ 0 (Lee, 2012, 40)

11 Normal distribution (1 observation, unknown µ) We can now update our prior believe as follows: p(µ x) p(x µ)p(µ) = ( 2πσ 2) ( 1 2 (x µ) 2 exp 2σ 2 ( ( ) 2πσ (µ µ0 ) 2 µ0 exp 2σ 2 µ 0 ) ), which can be worked out as: ( p(µ x) = (2πσµ 2 1 ) 1 2 exp (µ µ 1) 2 ) 2σµ 2, 1 which implies µ x N(µ 1, σµ 2 1 ), ( ) where µ 1 = σµ 2 µ0 1 + x and σ 2 σµ 2 0 σ 2 µ 1 = ( σµ σµ 2 ) 1. (Lee, 2012, 40 41)

12 Normal distribution (1 observation, unknown µ) Observation x = 2, while σ 2 = 1 is known. Prior distribution µ x N(0, 2) p(µ) prior posterior µ

13 Normal distribution (1 observation, unknown µ) Observation x = 2, while σ 2 = 1 is known. Prior distribution µ x N(0, 5) p(µ) prior posterior µ

14 Normal distribution (1 observation, unknown µ) statisticians typically work with the precision instead of the variance: τ = ( σ 2) 1, so that σ 2 µ1 = ( σµ σµ 2 ) 1 simplifies to τ 1 = τ 0 + τ, or posterior precision equals prior precision plus likelihood precision (in the case of a normal prior and normal likelihood). ( µ 1 = σµ 2 µ0 1 σµ 2 + x ) τ 0 0 σ 2 = µ 0 τ 0 + τ + x τ τ 0 + τ, which is therefore a weighted mean between prior mean and the data, weighted by their precision. Note that this still assumes σ 2, or τ, is known. (Lee, 2012, 41)

15 Normal distribution (n observations, unknown µ) When we do not have only one, but n observations from the same random variable, we obtain: τ 1 = τ 0 + nτ ( µ 1 = τ1 1 µ 0 τ 0 + x i τ i = τ1 1 (µ 0 τ 0 + n xτ) Note that we would obtain the same result if we updated the prior using the first observation, then take the posterior as the prior using the second observation, and so forth, which is called online updating. ) (Lee, 2012, 45)

16 Normal distribution (n observations, unknown µ) 5 observations with x = 2, while σ 2 = 1 is known. Prior distribution µ x N(0, 10) p(µ) prior posterior µ

17 Normal distribution (n observations, unknown µ) 100 observations with x = 2, while σ 2 = 1 is known. Prior distribution µ x N(0, 10) p(µ) prior posterior µ

18 Normal distribution (n observations, unknown µ and τ) Similar analytical solutions exist when both µ and τ are unknown, and given both normal priors, or other (improper) priors, such as uniform ones. The basic principles are more important: The posterior is a combination of prior information and information from the data. The more data, and the less variable this data, the more the posterior is driven by the data instead of the prior.

19 1 2 3

20 The linear model is familiar from ordinary least squares (OLS) estimation: y i = β 0 + β 1 x i1 + β 2 x i β k x ik + ε i or y = Xβ + ε, whereby ε N(0, σε). 2 In, the quantity of interest is then: p(β y, X) p(y, X β)p(β), whereby we assumed that the errors and X are independent, p(ε, X) = p(ε)p(x). (Lancaster, 2004, )

21 We assume a (hierarchical) prior for the errors, assuming that these are normally and independently distributed with mean zero and precision τ (τ = 1/σε): 2 p(ε τ) τ n 2 exp ( τ ) 2 ε ε, from which follows: p(y, X β, τ) τ n 2 exp ( τ ) 2 (y Xβ) (y Xβ) p(x β). If we assume X is strictly exogenous and does not depend on β or τ, we obtain: p(y, X β, τ) = p(y X, β) τ n 2 exp ( τ ) 2 (y Xβ) (y Xβ). (Lancaster, 2004, 117)

22 : improper prior There is a conventional, vague, improper uniform distribution for β, and τ, namely whereby < β <, τ > 0. p(β, τ) 1 τ, Note that this prior is improper as it is not an actual probability distribution it does not add to one. To obtain the posterior, we multiply the likelihood by the prior: ( p(β, τ y, X) τ n 2 1 exp τ ) 2 (y Xβ) (y Xβ). (Lancaster, 2004, 120)

23 : improper prior The marginal posterior distribution of β in this case is: β t(b, s 2 (X X) 1, ν), a t-distribution with ν = n k degrees of freedom, where b = (X X) 1 X y is the OLS estimate of β and s 2 (X X) 1 the OLS estimate of the variance-covariance matrix. s 2 = e e ν is the residual variance. The use of the improper prior made this tractable analytically and shows the importance of least squares also in. (Lancaster, 2004, 124, 133)

24 : improper prior The marginal posterior distribution of τ in this case is ) p(τ yx) τ ν 2 1 exp ( τνs2, 2 a gamma distribution. The expected value then works out to be E(τ) = 1/s 2, which is not surprising given that the precision is the reciprocal of the variance. (Lancaster, 2004, )

25 : conjugate prior A conjugate prior is a prior that is of the same family of probability distributions as the posterior, such that a posterior based on one set of observations can be used immediately as prior for a subsequent set of observations. The natural conjugate priors in this case are β τ N(β 0, τ 1 A 1 ) and τ G(α, c) with density ( p(β, τ) τ α 2 1 exp τ ) ( 2 (β β 0) A(β β 0 ) exp Compare this to the likelihood: p(y, X β, τ) τ n 2 exp ( τ ) 2 (β b) X X(β b) exp τc 2 ). ( ) τe e, 2 where b is the least squares estimate and e the associated residuals. (Lancaster 2004, 133; Gamerman and Lopes 2006, 55 56)

26 : conjugate prior Using this prior, we obtain an expected value of the slope parameters of E(β y, X) = (X X + A) 1 (X Xb + Aβ 0 ), which can be seen as a weighted average between the prior parameter β 0 and the least squares estimate b, weighted by their respective precisions. This prior is therefore much more informative, it affects the results more strongly than the flat improper prior. (Lancaster, 2004, )

27 Jeffreys priors One well-known proposal to obtain non-informative, improper priors is to use a prior where I (θ X) = E p(θ) I (θ X) 1 2, [ ] 2 log f (X θ) θ θ θ is the expectation of the Fisher information matrix of θ. In particular, this leads to improper priors that remain improper under reparameterisation of the model, which is not necessarily the case otherwise. Typically, this leads to priors of p(θ) k for location parameters and p(τ) τ 1 for scale parameters. (Gamerman and Lopes, 2006, 45 46)

28 Power priors Another special type of prior is the power prior, where the idea is to put a relative weight on prior data and newly obtained data, which itself can be configured. Say, we collect data X, but an earlier study had already collected data X 0 and provided posterior estimates. We can then use the prior p(θ X 0, a 0 ) p(θ) [L(θ X 0 )] a 0, where 0 a 0 1. So our posterior becomes p(θ X, X 0, a 0 ) p(x θ)p(θ X 0, a 0 ). In typical style, we can also put a prior on this parameter: p(θ X 0, a 0 ) = 1 0 p(θ) [L(θ X 0 )] a 0 p(a 0 ψ)da 0, with ψ a hyperparameter of some distribution of a 0, e.g. gamma. (Gill, 2015, )

29 1 2 3

30 Hypothesis Assume we have an unknown parameters θ, which we know to be from a set Θ. In frequentist statistics we usually have a null H 0 : θ Θ 0 and an alternative H 1 : θ Θ 1, where Θ 0 Θ 1 = Θ and Θ 0 Θ 1 =. (Lee, 2012, 138) For example, we might have a parameter β, with H 0 : β = 0 and H 1 : β 0.

31 Frequentist The probability of observing any given value of β is infinitely small, so in frequentist statistics we reject H 0 if, assuming the sampling distribution given H 0, the probability of observing x or greater is less than threshold value α. Note that the p-value is not the probability that the null or the alternative is true. As n increases, p decreases. The α-value of 0.05 is arbitrary. The null and alternative hypotheses are often unrealistic, and typically strongly biased in favour of one over the other. The p-value is based on the probability of observations we have not made. (Friel 2015, lecture 8; Lee 2012, 139)

32 Frequentist Harold Jeffreys remarks: What the use of p implies, therefore, is that a that may be true may be rejected because it has not predicted observable results that have not occurred. (Jeffreys 1967, as cited in Lee 2012, 139) Andrew Gelman remarks: The relevant goal is not to ask the question Do the data come from the assumed model? (to which the answer is almost always no), but to quantify the discrepancies between data and model, and assess whether they could have arisen by chance, under the model s own assumptions. (Friel, 2015, lecture 8)

33 : Bayes factor As usual, in we attempt to calculate full probability distributions more directly. So we are interested in p 0 = p(θ Θ 0 X) and p 1 = p(θ Θ 1 X), the posterior probabilities of the two hypotheses. And of course, we need priors. π 0 = p(θ Θ 0 ) and π 1 = p(θ Θ 1 ) We then focus on the odds of H 0 against H 1, the ratio p 0 p 1 consider the Bayes factor and B = p 0/p 1 π 0 /π 1 = p 0π 1 p 1 π 0, i.e. the odds in favour of H 0 against H 1 given the data. (Lee, 2012, )

34 : likelihood ratio If θ could only have two possible values, θ 0 and θ 1, then p 0 π 0 p(x θ 0 ) and p 1 π 1 p(x θ 1 ), and therefore The Bayes factor is then p 0 p 1 = π 0 π 1 p(x θ 0 ) p(x θ 1 ). B = p(x θ 0) p(x θ 1 ), which is the likelihood ratio of H 0 against H 1. When θ can have more values, we need to integrate over all possible values: 1 θ Θ B = 0 π 0 p(x θ)p(θ)dθ 1 θ Θ 1 π 1 p(x θ)p(θ)dθ. (Lee, 2012, 141)

35 : critical values Much like the α-value for judging p-values, using thresholds is generally inappropriate. Nevertheless, Jeffreys suggests the following: B > 1 Support for H 0 1 > B minimal evidence against H > B 10 1 substantial evidence against H > B 10 2 strong evidence against H 0 B < 10 2 decisive evidence against H 0 And one can translate against H 0 into in favour of H 1 in a way that you cannot in frequentist conclusions. Note that calculating B requires the normalising constant p(x), which is often difficult to obtain. B is also more sensitive to the priors than most other s. (Gill, 2015, 217)

36 p-values Assume a one-sided test, then in frequentist statistics, the p-value, or exact significance level, is the probability of an observation of random values X to be at least as high as the observed value x, given the null: p(x x θ = θ 0 ). In we would instead focus on posterior probability p 0 = p(θ θ 0 X = x) = Φ((µ 0 x)τ). As it turns out, p(x x θ = θ 0 ) = 1 Φ((x θ 0 )τ) = Φ((θ 0 x)τ) = p 0. Note that this implies that the p-value is (1 + B 1 ) 1. (Lee, 2012, 144)

37 Likelihood principle The likelihood principle states that all the information about the parameters from the data are stored in the likelihood function. This principle generally holds in, but is violated in, among other situations: frequentist statistical significance tests or confidence intervals; the use of Jeffreys priors.

38 Friel, Nial Analysis. lecture slides. Gamerman, Dani and Hedibert F. Lopes Markov Chain Monte Carlo. Stochastic simulation for. 2nd ed. Boca Raton, FL: Chapman & Hall. Gill, Jeff Methods. A social and behavioral sciences approach. 3rd ed. Boca Raton: CRC Press. Jeffreys, Sir Harold Theory of Probability. 3rd ed. Clarendon Press. Lancaster, Tony An Introduction to Modern Econometrics. Malden, MA: Blackwell. Lee, Peter M Statistics: An introduction. 4th ed. Chichester: Wiley.

Bayesian Inference. Chapter 1. Introduction and basic concepts

Bayesian Inference. Chapter 1. Introduction and basic concepts Bayesian Inference Chapter 1. Introduction and basic concepts M. Concepción Ausín Department of Statistics Universidad Carlos III de Madrid Master in Business Administration and Quantitative Methods Master