9 Bayesian inference. 9.1 Subjective probability

Size: px

Start display at page:

Download "9 Bayesian inference. 9.1 Subjective probability"

Archibald Jennings
6 years ago
Views:

9 Bayesian inference 1702-1761 9.1 Subjective probability This is probability regarded as degree of belief.

1 9 Bayesian inference Subjective probability This is probability regarded as degree of belief. A subjective probability of an event A is assessed as p if you are prepared to stake pm to win M and equally prepared to accept a stake of pm to win M. In other words the bet is fair and you are assumed to behave rationally Kolmogorov s axioms How does subjective probability fit in with the fundamental axioms? Let A be the set of all subsets of a countable sample space Ω. Then (i) P (A) 0 for every A A; (ii) P (Ω)=1; 83

2 (iii) If {A λ : λ Λ} is a countable set of mutually exclusive events belonging to A, then P ( λ Λ A λ ) = λ Λ P (A λ ). Obviously the subjective interpretation has no difficulty in conforming with (i) and (ii). (iii) is slightly less obvious. Suppose we have 2 events A and B such that A B =. Consider a stake of p A M to win M if A occurs and a stake p B M to win M if B occurs. The total stake for bets on A or B occurring is p A M+ p B M to win M if A or B occurs. Thus we have (p A + p B )M to win M and so P (A B) =P (A) +P (B) Conditional probability Define p B, p AB, p A B such that p B M is the fair stake for M if B occurs; p AB M is the fair stake for M if A and B occur; p A B M is the fair stake for M if A occurs given B has occurred otherwise the bet is off. Consider the 3 alternative outcomes A only, B only, A and B. IfG 1, G 2, G 3 are the gains, then A : p B M 2 p AB M 3 = G 1 B : p A B M 1 +(1 p B )M 2 p AB )M 3 = G 2 A B : (1 p A B )M 1 +(1 p B )M 2 +(1 p AB )M 3 = G 3 The principle of rationality does not allow the possibility of setting the stake to obtain sure profit. Thus or 0 p B p AB p A B 1 p B p AB 1 p A B 1 p B 1 p AB p AB p B p A B =0 P (A B) = P (A B). P (B) =0 84

3 9.2 Estimation formulated as a decision problem A The action space, which is the set of all possible actions a A available. Θ The parameter space, consisting of all possible states of nature, only one of which occurs or will occur ( this true state being unknown). L The loss function, having domain Θ A(the set of all possible consequences (θ, a)) and codomain R. R X Containing all possible realisations x R X of a random variable X belonging to the family {f(x; θ) :θ Θ}. D The decision space, consisting of all possible decisions δ D, each decision function having domain R X and codomain δ The basic idea The true state of the world is unknown when the action is chosen. The loss is known which results from each of the possible consequences (θ, a). Data are collected to provide information about the unknown θ. A choice of action is taken for each possible observation of X. Decision theory is concerned with the criteria for making a choice. The form of the loss function is crucial it depends upon the nature of the problem. Example You are a fashion buyer and must stock up with trendy dresses. The true number you will sell is θ (unknown). 85

4 You stock up with a dresses. If you overstock you must sell off at your endof-season sale and lose A per dress. If you understock you lose the profit of B per dress. Thus { A(a θ), a θ, L(θ, a) = B(θ a), a θ. Special forms of loss function If A = B, you have modular loss, L(θ, a) =A θ a To suit comparison with classical inference, quadratic loss is often used. L(θ, a) =C(θ a) The risk function This is a measure of the quality of a decision function. Suppose we choose a decision δ(x) (i.e. choose a = δ(x)). The function R with domain Θ Dand codomain R defined by is called the risk function. We may write this as R(θ, δ) = L (θ, δ(x)) f(x; θ)dx RX or R(θ, δ) = x RX L (θ, δ(x)) p(x; θ) R(θ, δ) =E [L (θ,δ(x))]. Example Suppose X is a random sample from N(θ,θ) and suppose we assume quadratic loss. 86

5 Consider We know that E(X) =θ, so But E [δ 2 (X)] = θ, sothat δ 1 (X) =X, δ 2 (X) = 1 n 1 n i=1 R(θ, δ 1 )=E [ (θ X) 2]. R(θ, δ 1 )=V (X) = θ n. R(θ, δ 2 )=E [ (θ δ 2 (X)) 2 ]. R(θ, δ 2 ) = V [δ 2 (X)] = [ θ 2 n (n V 1) 2 i=1 (X i X) 2. (X i X) 2 θ ] = R(θ,d) θ 2 2θ2 2(n 1) = 2 (n 1) n. 1 d 2 (n 1)/2n 2 d 1 (n 1)/2n Without knowing θ, how do we choose between δ 1 and δ 2? Two sensible criteria θ 1. Take precautions against the worst. 2. Take into account any further information you may have. 87

6 Criterion 1 leads to minimax. Choose the decision leading to the smallest maximum risk, i.e. choose δ Dsuch that sup {R(θ, δ )} = inf θ Θ sup δ D θ Θ {R(θ, δ)}. Minimax rules have disadvantages, the main one being that they treat Nature as an adversary. Minimax guards against the worst choice. The value of θ should not be regarded as a value deliberately chosen by Nature to make the risk large. Fortunately in many situations they turn out to be reasonable in practice, if not in concept. Criterion 2 involves introducing beliefs about θ into the analysis. If π(θ) represents our beliefs about θ, then define Bayes risk r(π, δ) as r(π, δ) = Now choose δ such that { Θ R(θ, δ)π(θ)dθ θ Θ R(θ, δ)π(θ) if θ is continuous, if θ is discrete. r(π, δ )=inf{r(π, δ)}. δ D Bayes Rule The Bayes rule with respect to a prior π is the decision rule δ that minimises the Bayes risk among all possible decision rules. Bayes tomb 88

7 9.2.3 Decision terminology δ is called a Bayes decision function. δ(x) is a Bayes estimator. δ(x) is a Bayes estimate. If a decision function δ is such that δ satisfying R(θ, δ ) R(θ, δ ) θ Θ and R(θ, δ ) <R(θ,δ ) for some θ Θ, then δ is dominated by δ and δ is said to be inadmissible. All other decisions are admissible. 89

8 9.3 Prior and posterior distributions The prior distribution represents our initial belief in the parameter θ. This means that θ is regarded as a random variable with p.d.f. π(θ). From Bayes Theorem, f(x θ)π(θ) =π(θ x)f(x) or f(x θ)π(θ) π(θ x) = f(x θ)π(θ)dθ. Θ π(θ x) is called the posterior p.d.f. It represents our updated belief in θ, having taken account of the data. We write the expression in the form π(θ x) f(x θ)π(θ) or Posterior Likelihood Prior. Example Suppose X 1,X 2,...,X n is a random sample from a Poisson distribution, Poi(θ), where θ has a prior distribution π(θ) = θα 1 e θ Γ(α) The posterior distribution is given by which simplifies to π(θ x) θ x i e nθ xi!, θ 0,α> 0. θα 1 e θ Γ(α) π(θ x) θ x i +α 1 e (n+1)θ, θ 0. This is recognisable as a Γ-distribution and, by comparison with π(θ), we can write π(θ (n x) = +1) xi +α x θ i+α 1 e (n+1)θ, θ 0, Γ( x i + α) which represents our posterior belief in θ. 90

9 9.4 Decisions We have already defined the Bayes risk to be where so that r(π,δ) = Θ R(θ,δ)π(θ)dθ R(θ, δ) = L (θ,δ(x)) f(x θ)dx R X r(π, δ) = Θ R X L (θ, δ(x)) f (x θ)dxπ(θ)dθ = R X Θ L (θ, δ(x)) π(θ x)dθf (x)dx For any given x, minimising the Bayes risk is equivalent to minimising Θ L (θ,δ(x)) π(θ x)dθ. i.e. we minimise the posterior expected loss. Example Take the posterior distribution in the previous example and, for no special reason, assume quadratic loss Then we minimise L (θ,δ(x))=[θ δ(x)] 2. Θ [θ δ(x)] 2 π(θ x)dθ by differentiating with respect to δ to obtain 2 Θ [θ δ(x)] π(θ x)dθ =0 δ(x) = Θ θπ(θ x)dθ. 91

10 This is the mean of the posterior distribution. δ(x) = 0 (n +1) xi +α x θ i +α e (n+1)θ dθ Γ( x i + α) = (n +1) xi +α Γ( x i + α +1) Γ( x i + α)(n +1) x i +α+1 = xi + α n +1 Theorem: Suppose that δ π is a decision rule that is Bayes with respect to some prior distribution π. If the risk function of δ π satisfies then δ π is a minimax decision rule. R(θ, δ π ) r(π, δ π ) for all θ Θ, Proof : Suppose δ π is not minimax. Then there is a decision rule δ such that sup θ Θ R(θ, δ ) < sup R(θ, δ π ). θ Θ For this decision rule we have, since, if f(x) k, then E (f (X)) k, r(π, δ ) sup R(θ, δ ) θ Θ R(θ, δ π ) < sup θ Θ r(π,δ π ), contradicting the statement that δ π is Bayes with respect to π. Hence δ π is minimax. Example Let X 1,...,X n be a random sample from a Bernoulli distribution B(1,p). Let Y = X i and consider estimating p using the loss function L(p, a) = 92 (p a)2 p(1 p),

11 a loss which penalises mistakes more if p is near 0 or 1. The m.l.e. is p = Y/n and ( ) ( p p) 2 R(p, p) = E p(1 p) 1 p(1 p) = = 1 p(1 p) n n The risk is constant and, therefore, minimax. [NB A decision rule with constant risk is called an equalizer rule] We now show that p is the Bayes estimator with respect to the U (0, 1) prior which will imply that p is minimax. π(θ y) p y (1 p) n y so we want the value of a which minimises 1 0 (p a) 2 Differentiation with respect to a yields p(1 p) py (1 p) n y dp. 0 a = py (1 p) n y 1 dp 1 0 py 1 (1 p) n y 1 dp. Now a β-distribution with parameters α, β has pdf so 1 f Γ(α (x) = + β)xα 1 (1 x) β 1, x [0, 1], Γ(α)Γ(β) a = Γ(y + 1)Γ(n y) Γ(n +1) and, using the identity Γ(α +1)=αΓ(α), Γ(n) Γ(y)Γ(n y) a = y n. i.e. p = Y/n. 93

12 9.5 Bayes estimates (i) The mean of π(θ x) is the Bayes estimate with respect to quadratic loss. Proof Choose δ(x) to minimise Θ [θ δ(x)] 2 π(θ x)dθ by differentiating with respect to δ and equating to zero. Then since Θ π(θ x)dθ =1. δ(x) = Θ θπ(θ x)dθ (ii) The median of π(θ x) is the Bayes estimate with respect to modular loss (i.e. absolute value loss). Proof Choose δ(x) to minimise = δ(x) Θ θ δ(x) π(θ x)dθ (δ(x) θ) π(θ x)dθ + δ(x) Differentiate with respect to δ and equate to zero. so that (θ δ(x)) π(θ x)dθ δ(x) π(θ x)dθ = π(θ x)dθ δ(x) δ(x) 2 π(θ x)dθ = π(θ x)dθ =1 δ(x) and δ(x) is the median of π(θ x). π(θ x)dθ = 1 2, 94

13 (iii) The mode of π(θ x) is the Bayes estimate with respect to zero-one loss (i.e. loss of zero for a correct estimate and loss of one for any incorrect eastimate). Proof The loss function has the form L (θ, δ(x)) = { 1, if δ(x) θ, 0, if δ(x) =θ. This is only a useful idea for a discrete distribution. θ Θ L (θ, δ(x)) π(θ x) = π(θ x) θ Θ\{θ } = 1 π(θ x) where δ(x) =θ. Minimisation means maximising π(θ x), soδ(x) is taken to be the most likely value the mode. Remember that π(θ x) represents our most up-to-date belief in θ and is all that is needed for inference. 95

14 9.6 Credible intervals Definition The interval (θ 1,θ 2 ) is a 100(1 α)% credible interval for θ if θ2 π(θ x)dθ =1 α, 0 α 1. θ 1 π(θ x) θ θ 1 θ 1 θ 2 θ 2 Both (θ 1,θ 2 ) and (θ 1,θ 2 ) are 100(1 α)% credible intervals. Definition The interval (θ 1,θ 2 ) is a 100(1 α)% highest posterior density interval if (i) (θ 1,θ 2 ) is a 100(1 α)% credible interval; (ii) θ (θ 1,θ 2 ) and θ / (θ 1,θ 2 ), π(θ x) π(θ x) Obviously this defines the shortest possible 100(1 α)% credible interval. 96

15 π(θ x) 1 α θ 1 θ 2 θ 97

16 9.7 Features of Bayesian models (i) If there is a sufficient statistic, say T, the posterior distribution will depend upon the data only through T. (ii) There is a natural parametric family of priors such that the posterior distributions also belong to this family. Such priors are called conjugate priors. Example: Binomial likelihood, beta prior A binomial likelihood has the form ( n ) p(x θ) = θ x (1 θ) n x, x x =0, 1,...,n, and a beta prior is where so that π(θ) = θa 1 (1 θ) b 1, θ [0, 1], B(a, b) B(a, b) = Γ(a)Γ(b) Γ(a. + b) Posterior Likelihood Prior, π(θ x) θ x+a 1 (1 θ) n x+b 1 and comparison with the prior leads to π(θ x) = θx+a 1 (1 θ) n x+b 1, θ [0, 1], B(x + a, n x + b) which is also a beta distribution. 98

17 9.8 Hypothesis testing The Bayesian only has a concept of a hypothesis test in certain very specific situations. This is because of the alien idea of having to regard a parameter as a fixed point. However the concept has some validity in an example such as the following. Example: Water samples are taken from a reservoir and checked for bacterial content. These provide information about θ, the proportion of infected samples. If it is believed that bacterial density is such that more than 1% of samples are infected, there is a legal obligation to change the chlorination procedure. Bayesian procedure is (i) use the data to obtain the posterior distribution of θ; (ii) calculate P (θ > 0.01). EC rules are formulated in statistical terms and, in such cases, allow for a 5% significance level. This could be interpreted in a Bayesian sense as allowing for a wrong decision with probability no more than 0.05, so order increased chlorination if P (θ > 0.01) > Where for some reason (e.g. legal reason) a fixed parameter value is of interest because it defines a threshold, then a Bayesian concept of a test is possible. It is usually performed either by calculating a credible bound or a credible interval and looking at it in relation to that threshold value or values, or by calculating posterior probabilities based on the threshold or thresholds. 99

18 9.9 Inferences for normal distributions Unknown mean, known variance Consider data from N(θ, σ 2 ) and a prior N(τ,κ 2 ). so, since and But Posterior Likelihood Prior. f(x θ,σ 2 )=(2πσ 2 ) n/2 exp π(θ) = π(θ x,σ 2,τ,κ 2 ) exp ( 1 2πκ exp 1 2 ( 1 2σ 2 (x i θ) 2 ) ) 2κ (θ τ 2 )2, θ R, ( n 2σ 2 (θ2 2θx) 1 2κ 2 (θ2 2θτ) n 2σ 2 (θ2 2θx) + 1 2κ 2 (θ2 2θτ) ( ) = 1 n 2 σ + 1 θ 2 2 κ 2 = 1 2 ( nx σ 2 + τ κ 2 ) θ + garbage ( n )( ) σ + 1 θ nx /σ2 + τ /κ κ 2 n /σ 2 +1/κ 2 so the posterior p.d.f. is proportional to [ ( 1 exp θ nx /σ2 + τ /κ 2 2(n /σ 2 +1/κ 2 ) 1 This means that θ x,σ 2,τ,κ 2 N The form is a weighted average. The prior mean τ has weight 1/κ 2 : this is 1/(prior variance),... n /σ 2 +1/κ 2 ) 2 ] ( nx /σ 2 + τ /κ 2 n /σ 2 +1/κ 2, (n / σ 2 +1 / κ 2 ) 1 ). 100 ).

19 and the observed sample mean has weight n /σ 2 : this is 1/(variance of sample mean). This is reasonable because ( nx /σ 2 + τ /κ 2 and lim n lim κ 2 n /σ 2 +1/κ 2 ) ( nx /σ 2 + τ /κ 2 n /σ 2 +1/κ 2 ) = x = x. The posterior mean tends to x as the amount of data swamps the prior or as prior beliefs become very vague Unknown variance, known mean We need to assign a prior p.d.f. π(σ 2 ). The Γ family of p.d.f. s provides a flexible set of shapes over [0, ). We take τ = 1 σ 2 Γ ( λ 2, νλ 2 where λ and ν are chosen to provide suitable location and shape. In other words, the prior p.d.f. of τ =1/σ 2 is chosen to be of the form ( ) π(τ) = (νλ)ν/2 ν/2 1 τ 2 ν/2 Γ(ν /2) exp νλτ, τ 0. 2 or π ( σ 2) = (νλ)ν/2 (σ 2 ) ν/2 1 τ =1/σ 2 is called the precision. The posterior p.d.f. is given by 2 ν/2 Γ(ν /2) ) exp ( νλ 2σ 2 ). π ( σ 2 x,θ ) f ( x θ, σ 2) π ( σ 2), and the right-hand side as a function of σ 2, is proportional to ( σ 2 ) n/2 exp ( 1 2σ 2 (x i θ) 2 ) ( σ 2) ν/2 1 exp ( νλ 2σ 2 ). 101

20 Thus π (σ 2 x,θ) is proportional to ( ) [ ]) σ 2 (ν+n)/2 1 exp ( 1 νλ + (x 2σ 2 i θ) 2 This is the same as the prior except that ν is replaced by ν + n and λ is replaced by νλ + (x i θ) 2. ν + n Consider the random variable W = νλ + (x i θ) 2 σ 2. Clearly f W (w) =Cw (ν+n)/2 1 e w/2, w 0. Thisisaχ 2 -distribution with ν + n degrees of freedom. In other words, νλ + (x i θ) 2 σ 2 χ 2 (ν + n). Again this agrees with intuition. As n we approach the classical estimate, and also as ν 0 which corresponds to vague prior knowledge Mean and variance unknown We now have to assign a joint prior p.d.f. π (θ, σ 2 ) to obtain a joint posterior p.d.f. π ( θ, σ 2 x ) f ( x θ, σ 2) π ( θ, σ 2). We shall look at the special case of a non-informative prior. From theoretical considerations beyond the scope of this course, a non-informative prior may be approximated by π ( θ, σ 2) 1 σ

21 Note that this is not to be thought of as an expression of prior beliefs but rather as an approximate prior which allows the posterior to be dominated by the data. π ( θ, σ 2 x ) ( σ 2) n/2 1 exp ( 1 2σ 2 (x i θ) 2 ). The right-hand side may be written in the form ( σ 2 ) n/2 1 exp ( 1 2σ 2 [ (x i x) 2 + n(x θ) 2 ]). We can use this form to integrate out, in turn, θ and σ 2. Using we find exp ( n ) 2σ (θ 2πσ 2 x)2 dθ 2 = n π ( σ 2 x ) ( σ 2) (n 1)/2 1 exp ( 1 2σ 2 (x i x) 2 ) and, putting W = (x i x) 2 /σ 2, f W (w) =Cw (n 1)/2 1 e w/2, w 0, so we see that (xi x) 2 σ 2 χ 2 (n 1). To integrate out σ 2, make the substitution σ 2 = τ 1. Then π (θ x) is proportional to 0 ( τ n/2 1 exp τ ]) [ (x i x) 2 + n(x θ) 2 dτ. 2 From the definition of the Γ-function, 0 y α 1 e βy dy = Γ(α) β α, and we see, since only the terms in θ are relevant, π (θ x) ( (x i x) 2 + n(x θ) 2 ) n/2. 103

22 or where π (θ x) t = (1 + t2 n 1 ) n/2 n(θ x). (xi x) 2 /(n 1) 104

A Very Brief Summary of Bayesian Inference, and Examples

A Very Brief Summary of Bayesian Inference, and Examples Trinity Term 009 Prof Gesine Reinert Our starting point are data x = x 1, x,, x n, which we view as realisations of random variables X 1, X,, X