Course arrangements. Foundations of Statistical Inference

Size: px

Start display at page:

Download "Course arrangements. Foundations of Statistical Inference"

Horace Gregory
5 years ago
Views:

1 Course arrangements Foundations of Statistical Inference Jonathan Marchini Department of Statistics University of Oxford MT 013 Lectures M. and Th.10 SPR1 weeks 1-8 Classes Weeks 3-8 Friday 1-1, 4-5, 5-6 SPR1/ Hand in solutions by 1noon on Wednesday SPR1. Class Tutors : Jonathan Marchini, Charlotte Greenan Notes and Problem sheets will be available at Books Garthwaite, P. H., Jolliffe, I. T. and Jones, B. (00 Statistical Inference, Oxford Science Publications Leonard, T., Hsu, J. S. (005 Bayesian Methods, Cambridge University Press. D. R. Cox (006 Principals of Statistical Inference This course builds on notes from Bob Griffiths and Geoff Nicholls Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT 013 / 8 Part A Statistics The majority of the statistics that you have learned up to now falls under the philosophy of classical (or Frequentist statistics. This theory makes the assumption that we can randomly take repeated samples of data from the same population. You learned about three types of statistical inference Point estimation (Maximum likelihood, bias, consistency, efficiency, information Interval estimation(exact and approximate intervals using CLT Hypothesis testing (Neyman-Pearson lemma, uniformly most powerful tests, generalised likelihood ratio tests You also had an introduction to Bayesian Statistics Posterior inference (Posterior Likelihood Prior Interval estimation(credible intervals, HPD intervals Priors(conjugate priors, improper priors, Jeffreys prior Hypothesis testing (marginal likelihoods, Bayes Factors Jonathan Marchini (University of Oxford BSa MT / 8 Frequentist inference In BSa we develop the theory of point estimation further. Are there families of distributions about which we can make general statements? Exponential families. How can we summarise all the information in a dataset about a parameter θ? Sufficiency and the Factorization Theorem. What are limits of how well we can estimate a parameter θ? Cramer-Rao inequality (and bound. How can we find good estimators of a parameter θ? Rao-Blackwell Theorem and Lehmann-Scheffé Theorem. Jonathan Marchini (University of Oxford BSa MT / 8

2 Bayesian inference Parameters are treated as random variables. Inference starts by specifying a prior distribution on θ based on prior beliefs. Having collected some data we use Bayes Theorem to update our beliefs to obtain a posterior distribution. Quick Example Suppose I give a coin and tell you that it is bit biased. We might use a Beta(4,4 distribution to represent our beliefs about the θ. If we observe 30 heads and 10 tails we can use probability theory to infer a posterior distribution for θ of Beta(34, Prior distribution Posterior distribution Computational techniques for Bayesian inference It is not always possible to obtain an analytic solution when doing Bayesian Inference, so we study approximate computational techniques in this course. These include Markov chain Monte Carlo (MCMC methods Approximations to marginal likelihoods NEW Variational Approximations Laplace approximations Bayesian Information Criterion (BIC The EM algorithm NEW useful in Frequentist and Bayesian inference of missing data problems Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Decision theory Quick Example You have been exposed to a deadly virus. About 1/3 of people who are exposed to the virus are infected by it, and all those infected by it die unless they receive a vaccine. By the time any symptoms of the virus show up, it is too late for the vaccine to work. You are offered a vaccine for 500. Do you take it or not? The most likely scenario is that you don t have the virus but basing a decision on this ignores the costs (or loss associated with the decisions. We would put a very high loss on dying! Decision theory allows us to formalise this problem. Closely connected to Bayesian ideas. Applicable to point estimation and hypothesis testing. Inference is based on specifying a number of possible actions we might take and a loss associated with each action. The theory involves determining a rule to make decisions using an overall measure of loss. Notation X, Y, Z Capital letters for random variables. x, y, z Lower case letters for realisations of random variables. E X ( Expectation with respect to the random variable X. φ = {φ 1,..., φ k } Sometimes we will use bold symbols to denote a vector of parameters. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

3 Parametric families Lecture 1 - Exponential families f (x; θ, θ Θ, probability density of a random variable (rv which could be discrete or continuous. θ can be 1-dimensional or of higher dimension. Equivalent notation:f θ (x,f (x θ,f (x, θ. Likelihood L(θ; x = f (x; θ and log-likelihood l(θ; x = log(l. Examples 1. Normal N(θ, 1 : f (x; θ = 1 e 1 (x θ x R, θ R. π. Poisson: f (x; θ = θx x! e θ, x = 0, 1,,..., θ > Regression: f (y; θ = n 1 e 1 (y i p πσ j=1 x ij β j, y R n, σ > 0, β R p. θ = {β, σ}. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Exponential families of distributions Example 1 : Poisson Definition 1 GJJ.6, DRC.3 A rv X belongs to a k-parameter exponential family if its probability density function (pdf can be written as k f (x; θ = exp A j (θb j (x + C(x + D(θ, j=1 where x χ, θ Θ, A 1 (θ,..., A k (θ, D(θ are functions of θ alone and B 1 (x, B (x,..., B k (x, C(x are well behaved functions of x alone. Exponential families are widely used in practice - for example in generalised linear models (see BS1a. We want to put the Poisson distribution in the form (with k = 1 f (x; θ = exp {A(θB(x + C(x + D(θ}, e θ θ x θ+x log θ log x! /x! = e = exp {(log θx log x! θ} So A(θ = log θ, B(x = x, C(x = log x!, D(θ = θ. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

4 Examples of 1-parameter Exponential families Binomial, Poisson, Normal, Exponential. Distn f (x; θ A(θ B(x C(x D(θ Bin(n, p ( n x p x (1 p n x log p (1 p x log ( n x Pois(θ e θ θ x /x! log θ x log(x! θ n log(1 p N(µ, 1 1 π exp{ (x µ } µ x x / 1 (µ log(π Exp(θ θe θx θ x 0 log θ Others : negative binomial, Pareto (with known minimum, Weibull (with known shape, Laplace (with known mean, Log-normal, inverse Gaussian, beta, Dirichlet, Wishart. Exercise: check these distributions Example : a -parameter family (Gamma If X Gamma(α, β then let θ = (α, β so f (x; θ = βα x α 1 e βx Γ(α = exp {α log β + (α 1 log x βx log Γ(α} = exp { (α 1 log x βx log [ Γ(αβ α]} And we have A 1 (θ = α 1, B 1 (x = log x, A (θ = β, B (x = x. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Some other -parameter Exponential families Exponential family canonical form Distribution f (x; θ A(θ B(x C(x D(θ N(µ, σ Gamma e (x µ σ A πσ 1 (θ = 1/σ B 1 (x = x 0 1 log(πσ β α x α 1 e βx A (θ = µ/σ B (x = x 0 1 µ /σ Γ(α A 1 (θ = α 1 B 1 (x = log x 0 log [ Γ(αβ α] A (θ = β B (x = x Let φ j = A j (θ, j = 1,..., k k f (x; φ = exp φ j B j (x + C(x + D(φ. j=1 φ j, j = 1,..., k are the canonical parameters, B j, j = 1,..., k are the canonical observations. These are sometimes called the natural parameters and observations. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

5 Since we have R f (x; φdx = 1 k exp φ R j B j (x + C(x + D(φ dx = 1 j=1 k exp{d(φ} exp φ R j B j (x + C(x dx = 1 j=1 k exp φ j B j (x + C(x dx = exp{ D(φ} R j=1 Jonathan Marchini (University of Oxford BSa MT / 8 R k exp φ j B j (x + C(x dx = exp{ D(φ} j=1 Differentiate with respect to φ i. k B i (x exp φ R j B j (x + C(x dx = D(φ exp{ D(φ} φ i j=1 k B i (x exp φ j B j (x + C(x + D(φ dx = D(φ φ i R j=1 E[B i (X] = φ i D(φ Exercise Cov[B i (X, B j (X] = φ i φ j D(φ Exercise Var[B i (X] = φ D(φ i Jonathan Marchini (University of Oxford BSa MT / 8 Example 3 : Gamma We already know that if X Gamma(α, β the E(X = α β and Var(X = α β. β α x α 1 e βx Γ(α = exp { βx + (α 1 log x + α log β log Γ(α} φ 1 = β, φ = α 1, B 1 (x = x, B (x = log x D(φ = α log β log Γ(α = (φ + 1 log( φ 1 log Γ(φ + 1 E[X] = D(φ = (φ + 1 = α φ 1 φ 1 β Var[X] = φ D(φ = (φ φ = α β 1 Exercise: show E[log X] = ψ 0 (α log(β where ψ 0 is the digamma function, and Γ (α = Γ(αψ 0 (α. Jonathan Marchini (University of Oxford BSa MT / 8 Cumulant Generating Function In a scalar canonical exponential family (k = 1 E X [e sb(x ] = exp {(φ + sb(x + C(x + D(φ} dx R = exp{d(φ D(φ + s} If M B(X (s is the Moment Generating Function (mgf for B(X, then log(m B(X (s = D(φ D(φ + s This is the cumulant generating function (defined as the log of the mgf for the cumulants of B(X i.e. log(m B(X (s = κ r s r /r! where κ 1 = E(B(X and κ = V (B(X Exercise : prove this Jonathan Marchini (University of Oxford BSa MT / 8 r=1

6 Example 4 : Binomial We already know that if X Binom(n, p then E(X = np. and Therefore φ = log p (1 p p = eφ 1 + e φ B 1 (x = x, D(φ = n log(1 p = n log(1 + e φ log(m X (s = D(φ D(φ + s = n log(1 + e φ + n log(1 + e φ+s κ 1 = s log(m X (s = n eφ s=0 1 + e φ = np Jonathan Marchini (University of Oxford BSa MT / 8 Example 5 : Skew-logistic distribution Consider the real valued random variable X with pdf θe x f (x; θ = (1 + e x θ+1 { ( e = exp θ log(1 + e x x } + log 1 + e x + log θ and φ = θ, B 1 (x = log(1 + e x and D(φ = log θ = log( φ log(m X (s = D(φ D(φ + s = log( φ log( (φ + s E(log(1 + e x = 1 = 1 φ + s s=0 θ Var(log(1 + e x 1 s=0 = (φ + s = 1 θ These results are harder to derive directly. Jonathan Marchini (University of Oxford BSa MT 013 / 8 Family preserved under transformations Exponential family conditions A smooth invertible transformation of a rv from the Exponential family is also within the Exponential family. If X Y, Y = Y (X then f Y (y; θ = f X (x(y; θ X/ Y k = exp A j (θb j (x(y + C(x(y + D(θ X/ Y, j=1 The Jacobian depends only on y and so the natural observation B(x(y, the natural parameter A(θ, and D(θ do not change, while C(X C(X(Y + log X/ Y. 1. The range of X, χ does not depend on θ.. ( A 1 (θ, A (θ,..., A k (θ have continuous second derivatives for θ Θ. 3. If dim Θ = d, then the Jacobian [ ] Ai (θ J(θ = θ j has full rank d for θ Θ. [Part of the regularity conditions which we cite later.] Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

7 The family is minimal if no linear combination exists such that Linear relations among A s do not generate curvature λ 0 + λ 1 B 1 (x + + λ k B k (x = 0 for all x χ. The dimension of the family is defined to be d, the rank of J(θ. A minimal exponential family is said to be curved when d < k and linear when d = k. We refer to a (k, d curved exponential family. Example 6 (X 1, X independent, normal, unit variance, means (θ, c/θ, c known. Take a k-parameter exponential family and impose A 1 = aa + b for constants a and b. k A i B i + C + D = k A i B i + A (ab 1 + B + (C + bb 1 + D i=3 This gives a k 1 parameter EF, not a (k, k 1-CEF. log f (x; θ = x 1 θ + cx /θ θ / c θ / +... is a (, 1 curved exponential family. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Examples not in an exponential family 1 Uniform on [0, θ], θ > 0. f (x; θ = 1, x [0, θ] θ > 0 θ Lecture - Sufficiency, Factorization Theorem, Minimal sufficiency The Cauchy distribution f (x; θ = [π(1 + (x θ ] 1, x R Other examples include the F-distribution, hypergeometric distribution and logistic distribution. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

8 Sufficient statistics Let X 1,..., X n be a random sample from f (x; θ. Definition A statistic T (X 1,..., X n is a function of the data that does not depend on unknown parameters. Definition 3 A statistic T (X 1,..., X n is said to be sufficient for θ if the conditional distribution of X 1,..., X n, given T, does not depend on θ. That is, g(x t, θ = g(x t Comment The definition says that a sufficient statistic T contains all the information there is in the sample about θ. Example 7 n independent trials where the probability of success is p. Let X 1,..., X n be indicator variables which are 1 or 0 depending if the trial is a success or failure. Let T = n X i. The conditional distribution of X 1,..., X n given T = t is g(x 1,..., x n t, p = f (x 1,..., x n, t p h(t p = = = n px i (1 p 1 x i ( n t p t (1 p n t ( n t p t (1 p n t p t (1 p n t ( n 1, t not depending on p, so T is sufficient for p. Comment Makes sense, since no information in the order. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Theorem 1 : Factorization Criterion Proof - for discrete random variables 1. Assume that T is sufficient, then the distribution of the sample is T (X 1,..., X n is a sufficient statistic for θ if and only if there exist two non-negative functions K 1, K such that the likelihood function L(θ; x can be written L(θ; x = K 1 [t(x 1,..., x n ; θ]k [x 1,..., x n ] = K 1 [t; θ]k [x], where K 1 depends only on the sample through T, and K does not depend on θ. L(θ; x = f (x θ = f (x, t θ = g(x t, θh(t θ T is sufficient which implies g(x t, θ = g(x t h(t θ depends on x through t(x only so L(θ; x = g(x th(t θ We set L(θ; x = K 1 K, where K 1 h, K g. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

9 . Suppose L(θ; x = f (x θ = K 1 [t; θ]k [x].then h(t θ = f (x, t θ = L(θ; x Thus g(x t, θ = {x:t (x=t} f (x, t θ h(t θ = {x:t (x=t} = K 1 [t; θ] L(θ; x h(t θ = {x:t (x=t} K (x. K [x] {x:t (x=t} K (x, not depending on θ. (K 1 cancels out in numerator and denominator. Minimal sufficiency In general, we can envisage a sequence of functions T 1 (X, T (X, T 3 (X,... such that T j (X is a function of T j 1 (X (j where moving along the sequence gains progressive reductions of data, and if T j (X is sufficient for θ, then so is T i (X i < j. Typically there will be a last member T k (X that is sufficient, with T i (X not sufficient for i > k. Then T k (X is minimal sufficient Example 7 (cont. Consider n = 3 Bernoulli trials 1 T 1 (X = (X 1, X, X 3 (the individual sequences of trials T (X = (X 1, 3 X i (the 1st random variable and the total sum. 3 T 3 (X = 3 X i (the total sum 4 T 4 (X = I(T 3 (X = 0 (I is indicator function; Exercise Prove T 4 not sufficient Definition 4 A statistic is minimal sufficient if it can be expressed as a function of every other sufficient statistic. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Example 7 (cont. : Minimal sufficiency n Bernoulli trials with T = n X i. Suppose T above is not minimal sufficient but another statistic U is MS.Then U can be given as a function of T (and not vis versa or T is MS and there exist t 1 t values of T so that U(t 1 = U(t (ie T U is many to one so U T is not a function, and we assume for the moment no other t make U(t = U(t 1.The event U = u is the event T {t 1, t }. Let x 1,...x n contain t 1 successes. Then g(x 1,..., x n u, p = g(x 1,..., x n t 1, pp(t 1 u, p = g(x 1,..., x n t 1 P(T = t 1 T {t 1, t } = p t 1(1 p n t 1 ( n t 1 p t 1(1 p n t 1 + ( n t p t (1 p n t Minimal sufficiency and partitions of the sample space Intuitively, a minimal sufficient statistic most efficiently captures all possible information about the parameter θ. Any statistic T (X partitions the sample space into subsets and in each subset T (X has constant value. Minimal sufficient statistics correspond to the coarsest possible partition of the sample space. In the example of n = 3 Bernoulli trials consider the following 4 statistics and the partitions they induce. which depends on p, so U is not sufficient, a contradiction, and hence T must be MS (similar reasoning for multiple t i. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

Lemma 1 : Lehmann-Scheffé partitions TTT TTH THT THH HTT HHT HTH HHH TTT TTH THT THH HTT HHT HTH HHH Consider the partition of the sample space defined by putting x and y into the same class of the

10 Lemma 1 : Lehmann-Scheffé partitions TTT TTH THT THH HTT HHT HTH HHH TTT TTH THT THH HTT HHT HTH HHH Consider the partition of the sample space defined by putting x and y into the same class of the partition if and only if T 1 (X = ( X 1, X, X 3 3 T (X = X 1, X i L(θ; y/l(θ; x = f (y θ/f (x θ = m(x, y. Then any statistic corresponding to this partition is minimal sufficient. TTT TTH THT THH HTT HHT HTH HHH TTT TTH THT THH HTT HHT HTH HHH Comment This Lemma tells us how to define partitions that correspond to minimal sufficient statistics. It says that ratios of likelihoods of two values x and y in the same partition (and hence same statistic value should not depend on θ. 3 T 3 (X = X i T 4 (X = I T 3 (X = 0 ( Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Proof (for discrete RVs 1. Sufficiency. Suppose T is such a statistic g(x t, θ = f (x θ f (t θ = = = f (x θ τ = {y : T (y = t} f (y θ, y τ f (x θ y τ f (x θm(x, y [ ] 1 m(x, y y τ which does not depend on θ. Hence the partition D is sufficient.. Minimal sufficiency. Now suppose U is any other sufficient statistic and that U(x = U(y for some pair of values (x, y. If we can show that U(x = U(y implies T (x = T (y, then the Lehmann-Scheffé partition induced by T includes the partition based on any other sufficient statistic.in other words, T is a function of every other sufficient statistic, and so must be minimal sufficient. Since U is sufficient we have L(θ; y L(θ; x = K 1[u(y; θ]k [y] K 1 [u(x; θ]k [x] = K [y] K [x] which does not depend on θ. So the statistic U produces a partition at least as fine as that induced by T, and the result is proved. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

11 Sufficiency in an exponential family Random Sample X 1,..., X n Likelihood L(θ; x = n f (x i ; θ n k = exp A j (θb j (x i + C(x i + D(θ j=1 ( k n n = exp A j (θ B j (x i + nd(θ + C(x i. j=1 Exponential family form again. Jonathan Marchini (University of Oxford BSa MT / 8 Sufficiency in an exponential family Suppose the family is in canonical form so φ j = A j (θ, and let t j = n B j(x i, C(x = n C(x i. k L(θ; x = exp φ j t j + nd(θ + C(x. j=1 By the factorization criterion t 1,..., t k are sufficient statistics for φ 1,..., φ k. In fact, we do not need canonical form. If k L(θ; x = exp A j (θt j + nd(θ + C(x j=1 is a minimal k-dimensional linear exponential family then (by the regularity conditions above t 1,..., t k are minimal sufficient for θ 1,..., θ k. Minimal sufficiency is verified using Lemma 1. Jonathan Marchini (University of Oxford BSa MT / 8 Estimators Definition 5 A point estimate for θ is a statistic of the data. θ = θ(x = t(x 1,..., x n. Lecture 3 - Estimators, Minimum Variance Unbiased Estimators and the Cramér-Rao Lower Bound. Definition 6 An interval estimate is a set valued function C(X Θ such that θ C(X with a specified probability. Definition 7 : Maximum likelihood estimation If L(θ is differentiable and there is a unique maximum in the interior of θ Θ, then the MLE θ is the solution of where l(θ = log L(θ; x. L(θ; x = 0 or θ l(θ = 0, θ Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

12 Lemma : MLEs and exponential families Consider a k-dimensional exponential family in canonical form ( k n n L(θ; x = exp φ j B j (x i + nd(φ + C(x i. j=1 Let T j (X = n B j(x i, j = 1,..., k. If the realized data are X = x, then the statistics evaluated on the data are T j (x = t j. The MLEs of φ 1,..., φ k are the solution of t j = E X (T j, j = 1,..., k. i.e. set the expected values of the sufficient statistics equal to their realised values and solve for φ j. [If the family is not in canonical form, there is a similar slightly more complicated matrix equation] Proof l = log L = const + k φ j t j + nd(φ j=1 φ j l = t j + n φ j D(φ However, since E X [B i (X] = φ i D(φ and T j (X = n B j(x i we know that E X [T j ] = n D(φ, so φ j is equivalent to t j = E X (T j. φ j l = t j E X (T j = 0 Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Bias, Variance, Mean Squared Error T n = T (X 1,..., X n is a statistic. Definition 8 T n is unbiased for a function g(θ if E X (T n = t n (xf (x; θdx = g(θ, for all θ Θ. χ Definition 9 The bias of an estimator T n is bias(t n = E X [T n g(θ] Definition 10 T n is a consistent estimator if ɛ > 0, P( T n θ > ɛ 0 as n. Definition 11 The Mean Squared Error (MSE of T n is MSE(T n = E X [T n g(θ] = V X (T n + [bias(t n ] Example 10 N(µ, σ. µ = X and S = (n 1 1 n ( Xi X are unbiased estimates of µ and σ. Minimum Variance Unbiased Estimators (MVUE If we want to find a good estimator then one obvious strategy is to try to find estimators that minimise MSE. This is often difficult. For example, if we choose the estimator θ = θ 0 then this has MSE=0 when θ = θ 0, so no other estimator can be uniformly best unless it has zero MSE everywhere. If we restrict attention to unbiased estimators then the situation becomes more tractable. In this case, MSE reduces to the variance of the estimator and we can focus on minimising the variance of estimators. That is, we search for minimum variance unbiased estimators (MVUE. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

13 Theorem : Cramér-Rao inequality (and bound. If θ is an unbiased estimator of θ, then subject to certain regularity conditions on f (x; θ, we have Var( θ I 1 θ. where I θ, the Fisher information, is given by [ ] I θ = E θ l(θ Comment This bound tells us the minimum possible variance. If an estimator achieves the bound then it is MVUE. There is no guarantee that the bound will be attainable. In many cases it is attainable asymptotically. Intuitively, the more information we have about θ, the larger I θ will be and lowest possible variance of the estimator will be smaller. Regularity conditions for CRLB We will not be concerned with the details of the required regularity conditions. The main reason they are needed is to ensure that it is ok to interchange integration and differentiation during parts of the proof. One condition that is often easy to check is that the range of the rv X must not depend on θ. So for example, the result can not be applied when working with the uniform distribution U[0, θ] and we wish to estimate θ. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 In order to prove the CRLB we will need to use a few results. Proposition 1 : Variance-Covariance inequality Let U and V be scalar rv. Then cov(u, V var(uvar(v with equality if and only if U = av + b for constants and a 0. The Fisher Information I θ, which is used in the Cramér-Rao lower bound, can be expressed in two different forms. Lemma 3 Under regularity conditions [ ] [ ( ] l I θ = E θ l(θ = E = Var[S(X; θ], θ where the score function s(x; θ is defined as s(x; θ = θ l(θ = f (x; θ f (x; θ Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

14 Lemma 3 - Proof [ ] [ ( ] l We need to prove E θ l(θ = E. θ l θ = θ { 1 L } [ since l L θ = 1 ( L L + 1 L θ L θ ( l = + 1 ( L θ L θ The second term has expectation zero because [ 1 E L ( ] L θ = θ = 1 L ] L θ 1 L L θ Ldx = L dx = θ θ Ldx = 0 The alternative form I θ = Var[S(X; θ] follows from E [ l θ ] = 0. Jonathan Marchini (University of Oxford BSa MT / 8 Proof of the CRLB We consider only unbiased estimators, so we have E( θ = θ(xl(θ; xdx = θ Differentiate both sides w.r.t. θ Now so 1 = χ χ χ θ L θ dx = 1 L θ = L l θ θ l θ Ldx = E [ θ l ] θ Jonathan Marchini (University of Oxford BSa MT / 8 Proof of the CRLB Now we use the inequality that for two random variables U, V Cov[U, V ] Var[U]Var[V ] with U = θ, V = l l θ.we know Var[ θ ] = I θ. Must show Cov[U, V ] = 1. [ Cov[U, V ] = E[UV ] E[U]E[V ], E[U] = θ, E θ l ] = 1 θ E[V ] = χ l θ Ldx = L χ θ dx = [ ] Ldx θ χ Var[ θ] = Var[U] Cov[U, V ] Var[V ] = 1 I θ = θ [1] = 0 = I 1 θ Jonathan Marchini (University of Oxford BSa MT / 8 Information in a sample of size n. If we have n iid observations then f (x; θ = and the Fisher information is [ ] i n (θ = E θ l(θ = n n f (x i ; θ That is, i 1 (θ is calculated from the density as θ log f (x i; θf (x; θdx = ni 1 (θ. i 1 (θ = log f (x; θf (x; θdx θ Jonathan Marchini (University of Oxford BSa MT / 8

15 Question Under what conditions will we be able to attain the Cramér-Rao bound and find a MVUE? Lecture 4 - Consequences of the Cramér-Rao Lower Bound. Searching for a MVUE. Rao-Blackwell Theorem, Lehmann-Scheffé Theorem. Corollary 1 There exists an unbiased estimator θ which attains the CR lower bound (under regularity conditions if and only if Proof In the CR proof l θ = I θ( θ θ Cov[U, V ] Var[U]Var[V ] and the lower bound is attained if and only equality is achieved. If U = θ, V = l l θ, the equality occurs when θ = c + d θ, where c, d are constants. E[V ] = 0 so c = dθ and l θ = d( θ θ. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Multiply by l/ θ and take expectations. E [ ( ] l θ The LHS is I θ so we have d = I θ and [ ] [ ] l = de θ θ l dθe = d 1 0 θ l θ = I θ( θ θ Question What is the relationship between the CRLB and exponential families? Corollary If there exists an unbiased estimator θ(x which attains the CR lower bound (under regularity conditions it follows that X must be in an exponential family Proof Taking n = 1 and log f (x; θ θ = l θ = I θ( θ θ log f (x; θ = θa(θ + D(θ + C(x which is an exponential family form. The constant of integration C(x is a function of x. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

16 Question What is the relationship between the CRLB and MLEs? Corollary 3 Suppose θ(x is an unbiased estimator that attains the CRLB, and so is a MVUE. Suppose that the MLE θ is a solution to l/ θ = 0 (so, not on boundary. Then θ = θ. i.e. if the CRLB is attained then it is generally the MLE that attains it. Proof θ must satisfy l θ = I θ( θ θ. Setting l θ = 0 and solving will give the MLE θ. Since I θ > 0 (in all but exceptional circumstances, this gives θ = θ. Question Do all MLEs attain the CRLB? No, because not all MLEs are unbiased. Example 11 Let X 1,..., X n be a random sample from N(µ, σ. Then we know the MLEs are µ = X, σ = 1 n n X i X. Exercise µ is unbiased, but σ is biased. CRLBs are 1/I µ = σ and 1/I σ = σ 4 /n. Var( µ = σ /n which equals the CRLB so is MVUE. Var( σ = (n 1σ 4 /n is less than the CRLB. But σ is biased. The sample variance S = 1 n 1 n (X i X is unbiased and has variance σ 4 /(n 1 which is larger than the CRLB. Question Is S a MVUE? Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Efficiency Asymptotic normality of MLE Definition 13 The (Bahadur efficiency of an estimator θ is defined as a comparison of the variance of θ with the CR bound I 1 θ. That is Efficiency of θ = I 1 θ Var[ θ] = 1 I θ Var[ θ] The asymptotic efficiency is the limit as n. There are similar definitions for the relative efficiency of two estimators. Revision from Part A Statistics As the sample size n, the MLE θ N(θ, I 1 θ. This is a powerful and general result. Assuming the usual regularity conditions hold then it tells us that the MLE has the following properties 1 it is asymptotically unbiased it is asymptotically efficient i.e. it attains the CRLB asymptotically. 3 it has a normal distribution asymptotically. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

17 Extensions to the Cramér-Rao inequality 1. If θ is an estimator with bias b(θ = bias( θ, then Var[ θ] ( 1 + b I 1 θ θ. If ĝ(x is an unbiased estimator for g(θ, then Var[ĝ(X] ( g I 1 θ θ. Proof Begin with E θ ( θ(x = θ + b(θ (in 1. and E θ (ĝ(x = g(θ (in.. Differentiate both sides and proceed as above to find Cov[U, V ] = (1 + b/ θ (in 1. and Cov[U, V ] = g/ θ (in., with U = ĝ. The bound is against Cov[U, V ] which leads to the results above. Fisher Information for a d-dimensional parameter Information matrix: [ ] [ l l ] l I ij = E = E θ i θ j θ i θ j under regularity conditions. The CR inequality is Var( θ i [I 1 ] ii, i = 1,..., d. Exercise: verify that we have already proved Var( θ i [I ii ] 1. Note that [I 1 ] ii [I ii ] 1 (GJJ so bound above is stronger. Exercise For an Exponential family in canonical form, I ij = nd(φ. φ i φ j Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 The CRLB may not be achievable but will still wish to search for an MVUE. Sufficiency plays an important role in the search for a MVUE. Theorem 3 : Rao-Blackwell Theorem (RBT (GJJ.5. Let X 1,..., X n be a random sample of observations from f (x; θ. Suppose that T is a sufficient statistic for θ and that θ is any unbiased estimator for θ. Define a new estimator θ T = E[ˆθ T ]. Then 1. θ T is a function of T alone;. E[ θ T ] = θ; 3. Var( θ T Var( θ. Comment This says that estimators maybe be improved if we take advantage of sufficient statistics. Proof 1. θ T = E X [ˆθ T = t] = = ˆθ(xf (x t, θdx χ χ ˆθ(xf (x tdx. E[ θ T ] = E T [E[ˆθ T ]] = E[ˆθ] = θ (by law of total expectation 3. Using the law of total variance Var( θ = Var(E[ˆθ T ] + E T [Var(ˆθ T ] = Var( θ T + E T [Var(ˆθ T ] Var( θ Var( θ T Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

18 Example 1 Suppose X 1,..., X n be a random sample from Bernoulli(θ. It is easy to see that θ = X 1 is unbiased for θ. Also, we have seen before that T = N X i is sufficient for θ. We can use RBT to construct an estimator with smaller variance E[X 1 T = t] = P(X 1 = 1 T = t = P(X 1 = 1, N X i = t P( N X i = t = P(X 1 = 1, N i= X i = t 1 N C t θ t (1 θ N t = θ.n 1 C t 1 θ t 1 (1 θ N t N C t θ t (1 θ n t Corollary 4 If an MVUE θ for θ exists, then there is a function θ T of the minimal sufficient statistic T for θ which is an MVUE. Proof If θ is a MVUE and T is minimal sufficient then by RBT we can construct θ T. Which implies θ T is a function of T alone, is unbiased and variance no larger than θ. Hence is also a MVUE. Comment This says that we can restrict our search for a MVUE to those based on minimal sufficient statistics. = t N Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Completeness Definition 14 : Complete Sufficient Statistics Let T (X 1,..., X n be a sufficient statistic for θ. The statistic T is complete if, whenever h(t is a function of T for which E[h(T ] = 0 for all θ, then h(t 0 almost everywhere. Lemma 4 Suppose T is a complete sufficient statistic for θ, and g(t unbiased for θ, so E[g(T ] = θ. Then g(t is the unique function of T which is an unbiased estimator of θ. Proof If there were two such unbiased estimators g 1 (T, g (T, then E[g 1 (T g (T ] = θ θ = 0 for all θ, so g 1 (T = g (T almost everywhere. Question If we have an unbiased estimator what are the sufficient conditions for it to be MVUE? Lemma 5 If an MVUE for θ exists and T is a complete and minimal sufficient statistic for θ, and suppose h = h(t is unbiased for θ, then h(t is a MVUE. This Lemma combines the results of Corollary 4 and Lemma 4. Proof If an MVUE exists then there is a function of T which is an MVUE, by the RB Corollary 4. But h(t is the only function of T which is unbiased for θ (from Lemma 4. So h must be the function of T which an MVUE. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

19 Question Finally, how can we construct a MVUE? Theorem 4 : Lehmann-Scheffé theorem Let T be a complete sufficient statistic for θ, and let θ be an unbiased estimator for θ, then the unbiased estimator θ T = E[ˆθ T ] has the smallest variance among all unbiased estimators of θ. That is, for all unbiased estimators θ. Var( θ T Var( θ Comment This theorem says that if we can find any unbiased estimator and a complete sufficient statistic T then we can construct a MVUE. Proof Suppose θ exists with Var( θ < Var( θ T. Then by RBT we can construct θ T = E[ θ T ] such that Var( θ T Var( θ < Var( θ T But θ T and θ T are both unbiased and T is complete, so by Lemma 4 we have θ T = θ T and Var( θ T = Var( θ T which is a contradiction. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Lemma 6 : Complete Sufficiency in EFs If the rv X has a distribution belonging to a k-parameter exponential family, then under the usual regularity conditions, the statistic ( n B 1 (X i, n B (X i,..., n B k (X i which we already know is minimal sufficient, is complete. Comment We showed before that MLEs for exponential families were functions of this same statistic. Therefore, for a member of an exponential family if the MLE is unbiased (note : not all MLEs are unbiased, then by Lemma 5, the MLE will be MVUE. If there is an unbiased estimator that attains the CRLB then, by Corollary 3, the MLE will attain the CRLB. Example 13 Let X 1,..., X n be a random sample from N(µ, σ. Then we know the MLEs are µ = X, σ = 1 n n X i X. µ is unbiased, but σ is biased. Exercise ( The minimal sufficient complete statistic is n X i, n X i. So µ is MVUE and attains the CRLB with variance σ /n. The sample variance S = 1 n 1 n (X i X is unbiased and is a function of the minimal sufficient complete statistic so is MVUE with variance σ 4 /(n 1 which is larger than the CRLB of σ 4 /n. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

20 Method of moments Generate estimators by equating observed statistics with their expected values E{t(X 1,..., X n } = t(x 1,..., x n. Lecture 5 - Method of Moments Comment Simple, easy to use. Often have good properties, but not guaranteed. Can provide good starting point for an iterative method for finding MLEs. Example 14 Uniform iid sample X = (X 1, X,..., X n on (0, θ. L(θ; x = f (x; θ = θ n, 0 < x 1,..., x n < θ. Let X (n = max i X i. In this case the moment relation E [ ] X (n = x(n Jonathan Marchini (University of Oxford BSa MT / 8 leads to an unbiased sufficient statistic, θ = n+1 n X (n. Jonathan Marchini (University of Oxford BSa MT / 8 Method of moments The distribution of X (n = max i X i is obtained from the CDF ( y n P(X (n y = θ (the probability all iid X i fall in (0, y so f X(n (y; θ = ny n 1 θ n, 0 < y < θ and E [ ] n X (n = n + 1 θ, so θ = n + 1 n is unbiased. X (n Now check θ is sufficient. The distribution of X X (n = y is f (x X (n = y; θ = which does not depend on θ. θ is minimal sufficient since f (x; θ f X(n (y; θ = 1 ny n 1 L(θ; x L(θ; y = θ n I[x (n < θ] θ n I[y (n < θ] does not depend on θ if x (n = y (n i.e. θ forms a Lehmann-Scheffé partition. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

21 Finally, we can show that θ = θ(x (n is complete. If E[ θ] = 0 for all θ > 0 then θ 0 ny n 1 h( θ(y θ n dy = 0 for all θ > 0 and hence θ h( θ(yy n 1 dy = 0. 0 Differentiate wrt θ (using the Leibniz integral rule, and conclude h( θ = 0 identically for all X. It follows that θ is the unique unbiased estimator based on X (n. Since θ is minimal sufficient, if a MVUE exists, it equals θ (Corollary 4. We can t use the CRB to show a MVUE exists, as the problem is non-regular. Jonathan Marchini (University of Oxford BSa MT / 8 Exercise l(θ; x = n log(θ so [ ( ] l E = n /θ θ and the CRB would be Var( θ θ /n. However, you can check (using f X(n (y; θ above that Var( θ = θ n(n + 1 which is smaller than θ /n for any n 1. This is not a contradiction, as f (x; θ doesn t satisfy the regularity conditions (limits of x depend on θ. Jonathan Marchini (University of Oxford BSa MT / 8 We can find the MLE in this example but this estimator is θ = X (n 1 biased and the Uniform density does not satisfy the regularity condition needed for CRLB so it does not apply. does not satisfy l/ θ = 0, so even if the CRLB did apply we cannot make the link between the MLE and the lower bound. Example 15 A sample from N(µ, σ. The moment relations [ n ] [ n ] E X i = nµ, E = n(µ + σ lead to the estimating equations X i µ = X, µ + σ = n 1 n X i. We have seen that these are just the equations for the MLE s in this -dimensional exponential linear family, and solve to give the MLE n σ = n 1 (X i X. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

22 Example 16 A sample from Pois(λ. We have two possible moment relations [ ] 1 n E X i = λ, n λ 1 = 1 n X i n E [ 1 n ] n (X i X = λ, λ = 1 n n (X i X For a sample of size n we have A(λ = log λ, B(x = n x i, C(x = log( n x i!, D(λ = nλ By Lemma 6 we have that B(x = n x i is a sufficient complete statistic. Thus by the Lehmann-Scheffé Theorem we have that arithmetic mean λ 1 is MVUE. λ is not a function of B(x. Exercise Show that λ 1 attains the CRLB. Jonathan Marchini (University of Oxford BSa MT / 8 Method of moments asymptotics Distribution of θ as n.suppose the dimension of θ is d = 1 and H = n 1 n h(x i where E( H = k(θ and θ is the solution to H = k(θ. Theorem 5 As n, θ is asymptotically normally distributed with mean θ and variance n 1 σ h /(k (θ, where σ h = h(x f (x; θdx k(θ Proof E( H = k(θ, so by the Central Limit Theorem, Approximately ( H k(θ n σ h D N(0, 1 H = k( θ = k(θ + ( θ θk (θ Jonathan Marchini (University of Oxford BSa MT / 8 Method of moments asymptotics Example 17 - Exponential distribution Rearrange to get so as n. n( θ θk (θ σ h θ N = ( H k(θ n D N(0, 1 σ h ( θ, σ h n(k (θ That is Now so f (x; θ = θ exp( θx, x > 0 E[ X] = θ 1, θ = X 1 k(θ = θ 1, H = X, h(x = X, E [ h(x ] = θ 1 Var [ h(x ] = σ h = θ, k (θ = θ θ N ( θ, σ h n(k (θ θ N(θ, n 1 θ Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

23 Exponential families - Method of moments and MLEs MLE are moment estimates because we are solving [ n ] n E t(x i = t(x i for θ. We showed earlier that [ ] E t(x = D (θ so Now [ n ] E t(x i = nd (θ k(θ = D (θ, k (θ = D (θ. Lecture 6 : Bayesian Inference Var( θ Var(t(X nd (θ = D (θ nd (θ = I 1 θ Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Ideas of probability The majority of statistics you have learned so are are called classical or Frequentist. The probability for an event is defined as the proportion of successes in an infinite number of repeatable trials. By contrast, in Subjective Bayesian inference, probability is a measure of the strength of belief. We treat parameters as random variables. Before collecting any data we there is uncertainty about the value of a parameter. This uncertainty can be formalised by specifying a pdf (or pmf for the parameter. We then conduct an experiment to collect some data that will give us information about the parameter. We then use Bayes Theorem to combine our prior beliefs with the data to derive an updated estimate of our uncertainty about the parameter. The history of Bayesian Statistics Bayesian methods originated with Bayes and Laplace (late 1700s to mid 1800s. In the early 190 s, Fisher put forward an opposing viewpoint, that statistical inference must be based entirely on probabilities with direct experimental interpretation i.e. the repeated sampling principle. In 1939 Jeffrey s book The theory of probability started a resurgence of interest in Bayesian inference. This continued throughout the s, especially as problems with the Frequentist approach started to emerge. The development of simulation based inference has transformed Bayesian statistics in the last 0-30 years and it now plays a prominent part in modern statistics. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

24 Problems with Frequentist Inference - I Frequentist Inference generally does not condition on the observed data A confidence interval is a set-valued function C(X Θ of the data X which covers the parameter θ C(X a fraction 1 α of repeated draws of X taken under the null H 0. This is not the same as the statement that, given data X = x, the interval C(x covers θ with probability 1 α. But this is the type of statement we might wish to make. Example 1 Suppose X 1, X U(θ 1/, θ + 1/ so that X (1 and X ( are the order statistics. Then C(X = [X (1, X ( ] is a α = 1/ level CI for θ. Suppose in your data X = x, x ( x (1 > 1/ (this happens in an eighth of data sets. Then θ [x (1, x ( ] with probability one. Problems with Frequentist Inference - II Frequentist Inference depends on data that were never observed The likelihood principle Suppose that two experiments relating to θ, E 1, E, give rise to data y 1, y, such that the corresponding likelihoods are proportional, that is, for all θ L(θ; y 1, E 1 = cl(θ; y, E. then the two experiments lead to identical conclusions about θ. Key point MLE s respect the likelihood principle i.e. the MLEs for θ are identical in both experiments. But significance tests do not respect the likelihood principle. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Consider a pure test for significance where we specify just H 0. We must choose a test statistic T (x, and define the p-value for data T (x = t as p-value = P(T (X at least as extreme as t H 0. The choice of T (X amounts to a statement about the direction of likely departures from the null, which requires some consideration of alternative models. Note 1 The calculation of the p-value involves a sum (or integral over data that was not observed, and this can depend upon the form of the experiment. Note A p-value is not P(H 0 T (X = t. Example A Bernoulli trial succeeds with probability p. E 1 E fix n 1 Bernoulli trials, count number y 1 of successes count number n Bernoulli trials to get fixed number y successes L(p; y 1, E 1 = L(p; y, E = ( n1 p y 1 (1 p n 1 y 1 binomial y ( 1 n 1 p y (1 p n y negative binomial y 1 If n 1 = n = n, y 1 = y = y then L(p; y 1, E 1 L(p; y, E. So MLEs for p will be the same under E 1 and E. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

25 Example But significance tests contradict : eg, H 0 : p = 1/ against H 1 : p < 1/ and suppose n = 1 and y = 3. The p-value based on E 1 is P ( Y y θ = 1 = y k=0 ( n (1/ k (1 1/ n k (= k while the p-value based on E is ( P N n θ = 1 ( k 1 = (1/ k (1 1/ n k (= y 1 k=n so different conclusions at significance level Note The p-values disagree because they sum over portions of two different sample spaces. Jonathan Marchini (University of Oxford BSa MT / 8 Bayesian Inference - Revison Likelihood f (x θ and prior distribution π(θ for ϑ. The posterior distribution of ϑ at ϑ = θ, given x, is π(θ x = f (x θπ(θ π(θ x f (x θπ(θ f (x θπ(θdθ posterior likelihood prior The same form for θ continuous (π(θ x a pdf or discrete (π(θ x a pmf. We call f (x θπ(θdθ the marginal likelihood. Likelihood principle Notice that, if we base all inference on the posterior distribution, then we respect the likelihood principle. If two likelihood functions are proportional, then any constant cancels top and bottom in Bayes rule, and the two posterior distributions are the same. Jonathan Marchini (University of Oxford BSa MT / 8 Example 1 X Bin(n, ϑ for known n and unknown ϑ. Suppose our prior knowledge about ϑ is represented by a Beta distribution on (0, 1, and θ is a trial value for ϑ. π(θ = θa 1 (1 θ b 1, 0 < θ < 1. B(a, b density a=1, b=1 a=0, b=0 a=1, b=3 a=8, b= Jonathan Marchini (University of Oxford BSa MT / 8 x Example 1 Prior probability density Likelihood π(θ = θa 1 (1 θ b 1, 0 < θ < 1. B(a, b f (x θ = Posterior probability density ( n θ x (1 θ n x, x = 0,..., n x π(θ x likelihood prior θ a 1 (1 θ b 1 θ x (1 θ n x = θ a+x 1 (1 θ n x+b 1 Here posterior has the same form as the prior (conjugacy with updated parameters a, b replaced by a + x, b + n x, so π(θ x = θa+x 1 (1 θ n x+b 1 B(a + x, b + n x Jonathan Marchini (University of Oxford BSa MT / 8

26 Example 1 Example 1 For a Beta distribution with parameters a, b µ = a a + b, σ = The posterior mean and variance are ab (a + b (a + b + 1 a + X a + b + n, (a + X(b + n X (a + b + n (a + b + n + 1 Suppose X = n and we set a = b = 1 for our prior. Then posterior mean is n + 1 n + i.e. when we observe events of just one type then our point estimate is not 0 or 1 (which is sensible especially in small sample sizes. For large n, the posterior mean and variance are approximately In classical statistics X X(n X, n n 3 θ = X n, θ(1 θ n = X(n X n 3 Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Example 1 Suppose ϑ = 0.3 but prior mean is 0.7 with std 0.1. Suppose data X Bin(n, ϑ with n = 10 (yielding X = 3 and then n = 100 (yielding X = 30, say. probability density post n= theta post n=10 As n increases, the likelihood overwhelms information in prior. Jonathan Marchini (University of Oxford BSa MT / 8 prior Example - Conjugate priors Normal distribution when the mean and variance are unknown.let τ = 1/σ, θ = (τ, µ. τ is called the precision. The prior τ has a Gamma distribution with parameters α, β > 0, and conditional on τ, µ has a distribution N(ν, 1 kτ for some k > 0, ν R. The prior is { π(τ, µ = βα Γ(α τ α 1 e βτ (π 1/ (kτ 1/ exp kτ } (µ ν or The likelihood is [ π(τ, µ τ α 1/ exp τ {β + k }] (µ ν f (x µ, τ = (π n/ τ n/ exp { τ } n (x i µ Jonathan Marchini (University of Oxford BSa MT / 8

27 Example - Conjugate priors Thus [ { π(τ, µ x τ α+(n/ 1/ exp τ β + k (µ ν + 1 }] n (x i µ Example - Conjugate priors Thus the posterior is where π(τ, µ x τ α 1/ exp [ τ {β + k }] (ν µ Complete the square to see that k(µ ν + (x i µ ( kν + n x = (k + n µ + nk k + n n + k ( x ν + (x i x α = α + n β = β + 1 nk n + k ( x ν + 1 (xi x k = k + n ν = kν + n x k + n This is the same form as the prior, so the class is conjugate prior. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Constructing priors Lecture 7 : Prior distributions. Hierarchical models. Predictive Distributions. Subjective Priors: Write down a distribution representing prior knowledge about the parameter before the data is available. If possible, build a model for the parameter. If different scientists have different priors or it is unclear how to represent prior knowledge as a distribution, then consider several different priors. Repeat the analysis and check that conclusions are insensitive to priors representing different points of view. Non-Subjective Priors: Several approaches offer the promise of an automatic and even objective prior. We list some suggestions below (Uniform, Jeffreys, MaxEnt. In practice, if one of these priors conflicted prior knowledge, we wouldnt use it. These approaches can be useful to complete the specification of a prior distribution, once subjective considerations have been taken into account. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

28 Uniform priors If we use a uniform prior (π(θ constant then π(θ x = L(θ; x/ L(θ; xdθ and Θ L(θ; xdθ may not be finite. Such distributions are called improper and all inference is meaningless. Example X Exp(1/µ and N = I(X < 1 yielding N = n with n {0, 1}. Suppose we observe n = 0. Now Θ L(µ; n = exp( 1/µ so if we take π(µ 1 for µ > 0 we have π(µ n exp( 1/µ which is improper, as π(µ n 1 as µ so 0 exp( 1/µdµ cannot exist. Jonathan Marchini (University of Oxford BSa MT / 8 Jeffreys Priors Jeffreys reasoned as follows. If we have a rule for constructing priors it should lead to the same distribution if we apply it to θ or some other parameterization ψ with g(ψ = θ.jeffreys took π(θ I θ where I θ = E is the Fisher information. Now if g(ψ = θ then π Ψ (ψ π(g(ψ g (ψ, [ ( ] l θ so Jeffreys rule should yield π Ψ (ψ I g(ψ g (ψ. The rule gives π Ψ (ψ I ψ. But I ψ = g (ψ I g (ψ, so I ψ = I g(ψ g (ψ and the rule is consistent in this respect. It is sometimes desirable (on subjective grounds to have a prior which is invariant under reparameterization. Jonathan Marchini (University of Oxford BSa MT / 8 Higher dimensions If Θ R k, and l(θ; X = log(f (X; θ, the Fisher information satisfies ( l(θ; X [I θ ] i,j = E θ θ i θ j ( ( l(θ; X l(θ; X l(θ; X E θ = E θ θ i θ j θ i θ j subject to regularity conditions. A k-dimensional Jeffreys prior π(θ I θ 1/ ( A det(a is invariant under 1-1 reparameterization. Exercise Verify 1 to 1 g(ψ = θ in R k gives π Ψ (ψ = θ I T g(ψ ψ. Maximum Entropy Priors Choose a density π(θ which maximizes the entropy φ[π] = π(θ log π(θdθ Θ over functions π(θ subject to constraints on π. This is a Calculus of Variations problem. Example The distribution π maximizing φ[π] over all densities π on Θ = R, subject to 0 π(θdθ = 1, 0 θπ(θdθ = µ, and 0 (θ µ π(θdθ = σ, (normalized with Eϑ = µ and Var(ϑ = σ is the normal density π(θ = 1 πσ e (θ µ /σ. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

29 This is a special case of the following Theorem. Theorem The density π(θ that maximizes φ(π, subject to E[t j (θ] = t j, j = 1,..., p takes the p-parameter exponential family form { p } π(θ exp λ i t i (θ for all θ Θ, where λ 1,..., λ p are determined by the constraints. (For the proof see Leonard and Hsu. Example In the normal case t 1 (θ = θ, t 1 = µ, t (θ = (θ µ, t = σ gives π(θ exp(λ 1 θ + λ (θ µ. Impose the constraints to get λ 1 = 0 and λ = 1/σ. Jonathan Marchini (University of Oxford BSa MT / 8 Example Suppose prior probabilities are specified so that with j φ j = 1 and P(a j 1 < ϑ a j = φ j, j = 1,..., p ϑ (a 0, a p, a 0 a 1 a p a p. We find the maximum entropy distribution subject to these conditions. The conditions are equivalent to E[t j (ϑ] = φ j, j = 1,..., p where t j (ϑ = I[a j 1 < ϑ a j ]. The posterior density of ϑ is p π(θ exp λ j I[a j 1 < θ a j ], a 0 θ a p j=1 where λ 1,..., λ p are determined by the conditions. π(θ is a histogram, with intervals (a 0, a 1 ], (a 1, a ],..., (a p 1, a p ]. Jonathan Marchini (University of Oxford BSa MT / 8 Hierarchical models These are models where the prior has parameters which again have a probability distribution. Data x have a density f (x; θ. The prior distribution of θ is π(θ; ψ. ψ has a prior distribution g(ψ, for ψ Ψ. We can work with posterior π(θ, ψ x f (x; θπ(θ; ψg(ψ, or π(θ x = π(θ, ψ xdψ f (x; θ π(θ; ψg(ψdψ Ψ Ψ f (x; θπ(θ The hierarchical model simply specifies the prior for θ indirectly π(θ = π(θ; ψg(ψdψ. Example For i = 1,,..., k we make n i observations X i,1, X i,,..., X i,ni on population i, with X ij N(θ i, σ. The θ i are the unknown means for observations on the i th population but σ is known. Suppose the prior model for the θ i is iid normal, θ i N(φ, τ. X 1,1,, X 1,n1 ~ N θ 1,σ X,1,, X,n ~ N θ,σ ( θ 1 ( θ.. N φ,τ.. X k,1,, X k,nk ~ N θ k,σ ( θ k (! Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

30 Example If ψ = (φ, τ π(θ 1,..., θ k ; ψ = k { (πτ 1/ exp 1 } τ (θ i φ, Now we need a prior for φ and τ. Suppose we take g(φ, τ = constant, so all possible φ, τ equally likely a priori. careful!. The joint posterior of the parameters is π(θ, ψ x f (x; θπ(θ ψg(ψ Example Integration with respect to ψ gives the posterior density of θ. k π(θ, φ, τ x exp 1 n i σ (x ij θ i j=1 [ k { τ 1 exp 1 } ] τ (θ i φ Integrate out wrt φ and τ to obtain the marginal posterior distribution of θ. Exercise Integrating the last factor wrt φ gives a term proportional to { τ 1 k exp 1 } (θi τ θ Exercise Then the integral wrt τ gives a term proportional to [ (θi θ ] 1 k/ Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Example Thus the posterior distribution of θ is k π(θ x exp 1 n i [ σ (x ij θ i (θi θ ] 1 k/ j=1 [ k { exp 1 } ] [ σ n i(θ i x i (θi θ ] 1 k/ where x i = j x ij/n i. Let θ j be the MAP estimate for θ j (posterior mode and put θ = θj /k. Exercise Differentiate π(θ x wrt θ j and set to equal zero to get θ j = (ω j x j + ν θ /(ω j + ν, [ where ω j = (σ /n j 1 and ν = ( θi θ ] 1 /(k. If the θ j were unrelated then ˆθ j = x j. The model modifies the estimate by pulling it towards the mean of the estimated θ i s. Jonathan Marchini (University of Oxford BSa MT / 8 Uniform priors If the prior is uniform (and possibly improper itself but the posterior is proper, and it can be shown that all parameter values being equally probable a priori is a fair representation of prior knowledge, then the uniform prior may be useful. In the hierarchical model in the previous example the prior on ϕ, τ was improper, but the posterior is proper. Jonathan Marchini (University of Oxford BSa MT / 8

31 Priors for Exponential Families Conjugate prior for an exponential family k n f (x θ = exp A j (θ B j (x i + j=1 n C(x i + nd(θ The prior distribution based on sufficient statistics k π(θ exp{τ 0 D(θ + A j (θτ j } (where (τ 0,..., τ k are constant prior parameters is conjugate. j=1 Priors for Exponential Families The posterior density is proportional to f (x θπ(θ τ 0,..., τ k [ k n ] exp A j (θ B j (x i + τ j + (n + τ 0 D(θ j=1 This is an updated form of the prior with n B j (x = B j (x i + τ j n = n + τ 0 Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 f Predictive distributions X 1,..., X n are observations from f (x; θ and the predictive distribution of a further observation X n+1 is required. If x = (x 1,..., x n then the posterior predictive distribution is g(x n+1 x = f (x n+1 ; θπ(θ xdθ Predictive distributions are useful for...prediction. They are used also for model checking. Divide the data in two groups, Y = (X 1,..., X a and Z = (X a+1,..., X n. If we fit using Y and check that the reserved data Z overlap g(x n+1 x in distribution. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

32 Example Data X 1,..., X n are iid N(θ, σ with σ known and prior θ N(µ 0, σ 0. Predict X n+1. We saw that π(θ x = N(θ; µ 1, σ 1 with µ 1 and σ 1 calculated in example 3 above. In order to calculate the posterior predictive density for X n+1 we need to evaluate g(x n+1 x = 1 (x θ e πσ πσ1 σ 1 We could complete the square to solve this. e (θ µ 1 σ 1 dθ Example Alternatively, think how X n+1 is built up. We have θ X N(µ 1, σ 1 and X n+1 θ N(θ, σ. If Y, Z N(0, 1 then θ = µ 1 + σ 1 Z (posterior X n+1 = θ + σy (observation model = µ 1 + σ 1 Z + σy. It follows that X n+1 N(µ 1, σ + σ1 is the posterior predictive density for X n+1 X 1,..., X n. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Hypothesis Testing I Lecture 8 : Bayesian Hypothesis Testing In Bayesian inference, hypotheses are represented by prior distributions. There is nothing special about H 0, and H 0 and H 1 need not be nested. Let π(θ H 0, θ Θ 0 be the prior distribution of θ under hypothesis H 0. Here π(θ H 0 is a pmf/pfd as θ H 0 is continuous/discrete. Composite H 0 : θ Θ 0, with Θ 0 Θ, π(θ H 0 = π(θ π(θ Θ 0 I(θ Θ 0 If Θ 0 has more than one element then H 0 is a composite hypothesis. Simple H 0 : θ = θ 0, so that π(θ 0 H 0 = 1. This is a simple hypothesis. However, since any statement about the form of the prior amounts to a hypothesis about θ, we are not restricted to statements about set membership (not just simple and composite. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

33 Example 1 In a quality inspection program components are selected at random from a batch and tested. Let θ denote the failure probability. Suppose that we want to test for H 0 : θ 0. against H 1 : θ > 0. and that the prior is θ Beta(, 5 so that π(θ = 30θ(1 θ 4, 0 < θ < 1. Now if π(h 0 = π(θ Θ 0 then π(h 0 = θ(1 θ 4 dθ so that π(h and π(h so and π(θ H 0 = π(θ H 1 = 30θ(1 θ4, 0 < θ 0. π(h 0 30θ(1 θ4, 0. < θ < 1 π(h 1 Marginal Likelihood P(x H 0 is called the marginal likelihood (for hypothesis H 0. We can think of a parameter H {H 0, H 1 }, with likelihood P(x H, prior P(H and posterior P(H x. By the partition theorem for probability (π(θ H 0 a pdf, say P(x H 0 = P(x θ, H 0 P(θ H 0 dθ Θ 0 = L(θ; xπ(θ H 0 dθ, Θ 0 since, given θ, x is determined by the observation model, and independent of the process (H 0 that generated θ. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 In the discrete case P(x H 0 = θ Θ 0 L(θ; xπ(θ H 0. In the special case that H 0 is a simple hypothesis Θ 0 = {θ 0 }, π(θ 0 H 0 = 1, and P(x H 0 = L(θ 0 ; x. Jonathan Marchini (University of Oxford BSa MT / 8 Example 1 (cont In the quality inspection program suppose n components are selected for independent testing. The number X that fail is X Binomial(n, θ. Recall H 0 : θ 0. with θ Beta(, 5 in the prior. The marginal likelihood for H 0 is P(x H 0 = L(θ; xπ(θ H 0 dθ Θ 0 ( 5 0. = θ x n x 30θ(1 θ4 (1 θ dθ x π(h 0 For one batch of size n = 5, X = 0 is observed. Recall that π(h Then P(x H 0 = ( θ(1 θ 9 dθ 0 0 π(h /0.345 = Jonathan Marchini (University of Oxford BSa MT / 8

34 Notice that Similarly, for H 1 : θ > 0. P(x H 1 = ( θ(1 θ 9 dθ 0 0. π(h P(x H 0 = E(L(ϑ; x H 0, that is, the marginal likelihood is the average likelihood given the prior π(θ H 0 the marginal likelihood is the normalizing constant we often leave off when we write the posterior π(θ x, H 0 = L(θ; xπ(x H 0, P(x H 0 posterior = likelihood prior marginal likelihood, Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Prior and Posterior Probabilities for Hypotheses We have a posterior probability for H 0 itself. This is actually where we started with Bayesian inference. In the simple case where we have two hypotheses H 0, H 1, exactly one of which is true, where P(H 0 x = P(H 0P(x H 0, P(x P(x = P(H 0 P(x H 0 + P(H 1 P(x H 1 so that P(H 0 x + P(H 1 x = 1. When we estimate the value of a discrete parameter H {H 0, H 1 }, we are making a Bayesian hypothesis test. Example 1 (cont X Binomial(5, θ with θ Beta(, 5 in the prior and H 0 : θ 0. and H 1 : θ > 0.. The posterior probability for H 0 given we observe X = 0 is P(H 0 x = P(x H 0P(H 0 P(x P(H 0 = P(θ Θ 0 (P(θ Θ 0 + P(θ Θ 1 = π(h 0 P(x H 0 π(h P(x H 1 π(h P(x P(x H 0 P(H 0 + P(x H 1 P(H P(H 0 x 0.185/0.73 = P(H 1 x 0.3 Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

35 Hypothesis Testing II, Bayes factors Suppose we have two hypotheses H 0, H 1, exactly one of which is true. Data x. The Prior Odds Q = P[H 0] P[H 1 ] These are prior odds since P[H 1 ] = 1 P[H 0 ]. Here H 0 is Q times more probable than H 1, given the prior model. The Posterior Odds Q = P[H 0 x] P[H 1 x] are the posterior odds, so that H 0 is Q times more probable than H 1, given the data and prior model. Jonathan Marchini (University of Oxford BSa MT / 8 Hypothesis Testing II, Bayes factors The posterior odds for H 0 against H 1 can be written where Q is the prior odds and is the Bayes Factor. Q = P[H 0] P[H 1 ] P(x H 0 P(x H 1 = Q B B = P(x H 0 P(x H 1 The Bayes Factor is a criterion for model comparison since H 0 is B times more probable than H 1, given the data and a prior model which puts equal probability on H 0 and H 1. The Bayes factor tells us how the data shifts the strength of belief (measured as a probability in H 0 relative to H 1. Jonathan Marchini (University of Oxford BSa MT / 8 Example 1 (cont X Binomial(5, θ with θ Beta(, 5 in the prior and H 0 : θ 0. and H 1 : θ > 0.. The prior odds are The posterior odds are Q = P(H 0 /P(H /( Q = P(H 0 x/p(h 1 x 0.678/( The Bayes factor comparing H 0 and H 1 is B = P(x H 0 P(x H /0.134 = 4 Jonathan Marchini (University of Oxford BSa MT / 8 Explicitly, from the beginning, Θ B = 0 L(x; θπ(θ H 0 dθ Θ 1 L(x; θπ(θ H 1 dθ = = Θ 0 L(x; θπ(θdθ Θ 1 L(x; θπ(θdθ π(h 1 π(h 0 ( θ(1 θ 9 1 dθ 0. ( 5 30θ(1 θ4 dθ θ(1 0. θ9 dθ 0 30θ(1 θ 4 dθ = / (Maple. This is positive evidence for θ 0.. Notice that the Bayes factor is more positive than the posterior odds, as the prior odds were weighted against H 0. Jonathan Marchini (University of Oxford BSa MT / 8

36 Interpreting Bayes Factors Adrian Raftery gives this table (values are approximate, and adapted from a table due to Jeffreys interpreting B. P(H 0 x B log(b evidence for H 0 < 0.5 < 1 < 0 negative (supports H to to 3 0 to barely worth mentioning 0.75 to to 1 to 5 positive 0.9 to to to 10 strong > 0.99 > 150 > 10 very strong I added the leftmost column (posterior for prior odds equal one. We sometimes report log(b because it is on the same scale as the familiar deviance and likelihood ratio test statistic. Simple-Simple hypothesis If both hypotheses are simple H 0 : θ = θ 0 ; H 1 : θ = θ 1, with priors P(H 0 and P(H 1 for the two hypotheses, the posterior probability for H 0 is P(H 0 x = P(x H 0P(H 0 P(x L(θ 0 ; xp(h 0 = L(θ 0 ; xp(h 0 + L(θ 1 ; xp(h 1, since P(x H 0 is just L(θ 0 ; x. The Bayes factor is then just likelihood ratio B = L(θ 0; x L(θ 1 ; x. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Simple-Composite hypothesis If one hypothesis is simple and the other composite, for example, H 0 : θ = θ 0 ; H 1 : π(θ H 1, θ Θ, with priors P(H 0 and P(H 1 for the two hypotheses, the Bayes factor is L(x; θ 0 B = Θ L(x; θπ(θ H 1dθ The denominator is just Θ L(x; θπ(θdθ when π(θ H 1 is a pdf. Exercise Show that the posterior probability for H 0 is L(θ 0 ; xp(h 0 P(H 0 x = P(H 0 L(θ 0 ; x + P(H 1 Θ L(x; θπ(θdθ when π(θ H 1 is a pdf, in this simple-composite comparison. Example X 1,..., X n are iid N(θ, σ, with σ known. H 0 : θ = 0, H 1 : θ H 1 N(µ, τ. Bayes factor is P 0 /P 1, where ( P 0 = (πσ n/ exp 1 i x σ ( P 1 = (πσ n/ exp 1 (xi σ θ (πτ 1/ (θ µ exp ( τ dθ. Completing the square in P 1 and integrating dθ, 1/ ( P 1 = (πσ n/ σ nτ + σ [ exp 1 { n nτ + σ ( x µ + 1 }] (xi σ x Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

37 So B = (1 + nτ 1/ [ σ exp 1 { }] n x σ n nτ ( x µ + σ Defining t = n x/σ, η = µ/τ, ρ = σ/(τ n, this can be written as B = ( / [ ρ exp 1 { }] (t ρη 1 + ρ η Lecture 9 : Decision theory This example illustrates a problem choosing the prior. If we take a diffuse prior, for ρ so that ρ 0, then B, giving overwelming support for H 0. This is an instance of Lindley s paradox. The point here is that B compares the models θ = θ 0 and θ π( H 1, not the sets θ 0 against θ \ {θ 0 }. If the π(θ H 1 -prior becomes very diffuse then the average likelihood (ie P(H 1 x, the marginal likelihood, which is the denominator of B goes to zero, while P(H 0 x = L(θ 0 ; x is fixed. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Decision theory Decision Theory Example You have been exposed to a deadly virus. About 1/3 of people who are exposed to the virus are infected by it, and all those infected by it die unless they receive a vaccine. By the time any symptoms of the virus show up, it is too late for the vaccine to work. You are offered a vaccine for 500. Do you take it or not? The most likely scenario is that you don t have the virus but basing a decision on this ignores the costs (or loss associated with the decisions. We would put a very high loss on dying! Decision Theory sits above Bayesian and classical statistical inference and gives us a basis to compare different approaches to statistical inference. We make decisions by applying rules to data. Decisions are subject to risk. A risk function specifies the expected loss which follows from the application of a given rule, and this is a basis for comparing rules. We may choose a rule to minimize the maximum risk, or we may choose a rule to minimize the average risk. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

38 Terminology θ is the true state of nature, X f (x; θ is the data. The Decision rule is δ. If X = x, adopt the action δ(x given by the rule. Example A single parameter θ is estimated from X = x by δ(x = θ(x. Example L S (θ, θ(x is the loss function which increases for θ(x being away from θ. Here are three common loss functions. Zero-One loss { L S (θ, θ(x 0 θ(x θ < b = a θ(x θ b where a, b are constants. The rule θ is the functional form of the estimator. The action is the value of the estimator. The Loss function L S (θ, δ(x measures the loss from action δ(x when θ holds. Absolute error loss where a > 0. Quadratic loss L S (θ, θ(x = a θ(x θ L S (θ, θ(x = ( θ(x θ. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Example Risk function - b a θ b Zero- One Loss Definition The risk function R(θ, δ is defined as R(θ, δ = L S (θ, δ(xf (x; θdx, Absolute error loss This is the expected value of the loss (aka expected loss. θ Example In the context of point estimation, with Quadratic Loss, the risk function is the mean square error, R(θ, θ = E[( θ(x θ ]. Quadra3c loss θ Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

39 Admissible and inadmissible rules. Definition A procedure δ 1 is inadmissible if there exists another procedure δ such that R(θ, δ 1 R(θ, δ, for all θ Θ with R(θ, δ 1 > R(θ, δ for at least some θ. A procedure which is not inadmissible is admissible. Example Suppose X U(0, θ. Consider estimators of the form θ(x = ax (this is a family of decisions rules indexed by a. Show that a = 3/ is a necessary condition for the rule θ to be admissible for quadratic loss. R(θ, θ = and R is minimized at a = 3/. θ 0 (ax θ 1 θ dx = (a /3 a + 1θ This does not show θ(x = 3x/ is admissible here. Jonathan Marchini (University of Oxford BSa MT / 8 It only shows that all estimators with a 3/ are inadmissible. The estimator θ(x = 3x/ maybe inadmissible relative to other estimators not in this class. Jonathan Marchini (University of Oxford BSa MT / 8 Minimax rules Definition A rule δ is a minimax rule if max θ R(θ, δ max θ R(θ, δ for any other rule δ. It minimizes the maximum risk. Since minimax minimizes the maximum risk (ie, the loss averaged over all possible data X f the choice of rule is not influenced by the actual data X = x (though given the rule δ, the action δ(x is data-dependent. It makes sense when the maximum loss scenario must be avoided, but can can lead to poor performance on average. Bayes risk, Bayes rule, expected posterior loss Definition Suppose we have a prior probability π = π(θ for θ. Denote by r(π, δ = R(θ, δπ(θdθ the Bayes risk of rule δ. A Bayes rule is a rule that minimizes the Bayes risk. A Bayes rule is sometimes called a Bayes procedure. L(x; θπ(θ Let π(θ x = denote the posterior following from likelihood h(x L and prior π. Definition The expected posterior loss is defined as L S (θ, δ(xπ(θ xdθ Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

40 Lemma A Bayes rule minimizes the expected posterior loss. Proof R(θ, δπ(θdθ = = = L S (θ, δ(xl(θ; xπ(θdx dθ L S (θ, δ(xπ(θ xh(xdx dθ ( h(x L S (θ, δ(xπ(θ xdθ dx That is for each x we choose δ(x to minimize the integral L S (θ, δ(xπ(θ xdθ Bayes rules for Point estimation Zero-One loss where a, b are constants. We need to minimize L S (θ, θ(x = π(θ xl S (θ, θdθ = a { 0 θ(x θ < b a 1 That is we want to maximize θ+b θ b θ+b θ+b θ(x θ b θ b π(θ xdθ + a θ b π(θ xdθ π(θ xdθ π(θ xdθ Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Zero-one loss If π(θ x is unimodal the maximum is attained by choosing θ to be the mid-point of the interval of length b for which π(θ b has the same value at both ends. Absolute error loss Absolute error loss L S (θ, θ(x = a θ(x θ We need to minimise θ θ π(θ xdθ = θ ( θ θπ(θ xdθ + θ (θ θπ(θ xdθ. θˆ As b 0, θ the global mode of the posterior distribution. If π(θ x is unimodal and symmetric, the optimal θ is the median (equal to the mean and mode of the posterior distribution. Differentiate wrt θ and equate to zero. θ π(θ xdθ θ π(θ xdθ = 0 That is, the optimal θ is the median of the posterior distribution. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

41 Quadratic loss Summary Minimize E θ x [( θ θ ] = E θ x [( θ θ + θ θ ] where θ is the posterior mean of θ. = ( θ θ + (ˆθ θe θ x (θ θ + E θ x [(θ θ ] Note ˆθ and θ are constants in the posterior distribution of θ so that (ˆθ θe(θ θ = 0. So E θ x [( θ θ ] = ( θ θ + V θ x (θ The form of the Bayes rule depends upon the loss function in the following way Zero-one loss (as b 0 leads to the posterior mode. Absolute error loss leads to the posterior median. Quadratic loss leads to the posterior mean. Note These are not the only loss functions one could use in a given situation, and other loss functions will lead to different Bayes rules. The Quadratic loss function is minimized when θ = θ, the posterior mean. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Example X Binomial (n, θ, and the prior π(θ is a Beta (α, β distribution. E(θ = α α + β The distribution is unimodal if α, β > 1 with mode (α 1/(α + β. The posterior distribution of θ x is Beta (α + x, β + n x. Lecture 10 : Finding minimax rules. Hypothesis testing with loss functions. With zero-one loss and b 0 the Bayes estimator is (α + x 1/(α + β + n. For a quadratic loss function, the Bayes estimator is (α + x/(α + β + n. For an absolute error loss function is the median of the posterior. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

42 Minimax rules Definition A rule δ is a minimax rule if max θ R(θ, δ max θ R(θ, δ for any other rule δ. It minimizes the maximum risk. Sometimes this doesn t produce a sensible choice of decision rule. Risk function d d 1 Risk function Jonathan Marchini (University of Oxford BSa MT / 8 d d 1 Finding minimax rules Theorem 1 If δ is a Bayes rule for prior π, with r(π, δ = C, and δ 0 is a rule for which max θ R(θ, δ 0 = C, then δ 0 is minimax. Proof If for some other rule δ, max θ R(θ, δ = C ɛ for some ɛ > 0 (so δ 0 is not minimax, then r(π, δ = R(θ, δ π(θdθ (C ɛπ(θdθ = (C ɛ < r(π, δ so δ is not the Bayes rule for π, a contradiction. [This is an informal treatment which assumes the min and max exist - see Y&S Ch Sec.6] Jonathan Marchini (University of Oxford BSa MT / 8 Finding minimax rules Finding minimax rules Theorem If δ is a Bayes rule for prior π with the property that R(θ, δ does not depend on θ, then δ is minimax. Proof (Y&S Ch Let R(θ, δ = C (no θ dependence. This implies r(π, δ = R(θ, δπ(θdθ = C. If for some other rule δ, max θ R(θ, δ = C ɛ for some ɛ > 0 (so δ 0 is not minimax, then we have r(π, δ C ɛ. But r(π, δ = C so δ is not the Bayes rule for π, a contradiction. Theorem tells us that The Bayes estimator with constant risk is minimax. This result is useful, as it gives an approach to finding minimax rules. Bayes rules are sometimes easy to compute, so if we find a prior that yields a Bayes rule with constant risk for all θ we have the minimax rule. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

43 Example : minimax estimator for quadratic loss X is Binomial (n, θ, and the prior π(θ is a Beta (α, β distribution. For a quadratic loss function, the Bayes estimator is (α + X/(α + β + n The risk function is E X θ [( θ θ ] = MSE( θ = [Bias( θ] + Var[ θ] [ ( ] α + X = θ E + Var α + β + n [ ( ] α + nθ = θ + α + β + n = [θ(α + β α] + nθ(1 θ [α + β + n] [ α + X ] α + β + n nθ(1 θ (α + β + n The Bayes estimator with constant risk is minimax. This occurs when α = β = n/, so the minimax estimator using quadratic loss is (α + x/(α + β + n = (x + n//(n + n. Jonathan Marchini (University of Oxford BSa MT / 8 Hypothesis testing with loss functions Suppose X i f (x; θ iid for i = 1,,..., n and we wish to test H 0 : θ = θ 0 against H 1 : θ = θ 1. The decision rule is δ C when we use a critical region C so δ C (x = H 1 if x C and otherwise δ C (x = H 0. Loss function: a θ = θ 0, x C 0 θ = θ L S (θ, δ C (x = 0, x C b θ = θ 1, x C 0 θ = θ 1, x C so the loss for Type I error is a (H 0 holds and we accept H 1 and the loss for Type II error is b (H 1 holds and we accept H 0. Jonathan Marchini (University of Oxford BSa MT / 8 Hypothesis testing with loss functions The risk function for the rule δ C is R(θ 0 ; δ C = L S (θ 0, δ C (xf (x; θ 0 dx = ai(x Cf (x; θ 0 dx = aα as α = P(X C H 0 is the probability for Type I error. Similarly R(θ 1 ; δ C = bβ, as β = P(X C H 1 is the probability for a Type II error. Note α and β are functions of the critical region C. Hypothesis testing with loss functions To calculate the Bayes risk we need a prior. Let π(θ 0 = p 0 and π(θ 1 = p 1 be the prior probabilities that H 0 and H 1 hold. The Bayes risk is r(π, δ C = θ {θ 0,θ 1 } R(θ; δ C π(θ = p 0 aα(c + p 1 bβ(c. The Bayes test chooses the critical region C to minimize the Bayes risk. Question How do we do this? Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

44 Hypothesis testing with loss functions To answer this question, let s revise what we know about Frequentist hypothesis testing. Neyman-Pearson lemma The best test of size α of H 0 vs H 1 is a likelihood ratio test with critical region { C = x; L(θ } 1; x L(θ 0 ; x A for some constant A > 0 chosen so that P(X C H 0 = α. By best test we mean the test with the highest power, where Power = 1 - Type II error. Bayes tests The following theorem tells us how to choose the critical region to minimize the Bayes risk, in the same way that the Neyman-Pearson lemma tells us how to maximize the power at fixed size. Theorem The critical region for the Bayes test is the critical region for a LR test with A = p 0a p 1 b Every LR test is a Bayes test for some p 0, p 1. In frequentist statistics we fix the Type I error (α and this determines the value of A, thus defining the critical region for the test. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Proof The Bayes test minimizes p 0 aα + p 1 bβ = p 0 ap(x C H 0 + p 1 bp(x C H 1 = p 0 a L(θ 0 ; xdx + p 1 b L(θ 1 ; xdx C C [ ] = p 0 a L(θ 0 ; xdx + p 1 b 1 L(θ 1 ; xdx C C [ ] = p 1 b + p 0 al(θ 0 ; x p 1 bl(θ 1 ; x dx C So choose C such that x C iff p 0 al(θ 0 ; x p 1 bl(θ 1 ; x 0, i.e. when L(θ 1 ; x L(θ 0 ; x p 0a p 1 b Example Let X 1,..., X n be N(µ, σ with σ known, and we want to test H 0 : µ = µ 0 vs H 1 : µ = µ 1, with µ 1 > µ 0. The critical region for the LR test L(θ 1 ; x L(θ 0 ; x A becomes x σ log(a n(µ 1 µ (µ 0 + µ 1 = B( say In the classical case we ignore the exact value of B, but in a Bayes test A = p 0 a/(p 1 b and we substitute into B. As an example take µ 0 = 0, µ 1 = 1, σ = 1, n = 4, a =, b = 1, p 0 = 1/4, p 1 = 3/4 Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

45 Then the Bayes test has critical region x 1 4 log( = 1 4 log( = For this test (using the fact that X is N(µ, 1/4 α = P( X µ = 0, σ /n = 1/4 = 0.1 Lecture 11 : Stein s paradox and the James-Stein estimator. Empirical Bayes and β = P( X < µ = 1, σ /n = 1/4 = In a classical approach fixing α = 0.05, B = /4 = 0.8, so β = P( X < 0.8 µ = 1, σ = 1/4 = In the Bayes test α has been increased and β decreased. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Stein s paradox and the James-Stein Estimator Stein s paradox has been described (Efron, 199 as the most striking theorem of post-war mathematical statistics! Theorem Let X i N(µ i, 1, i = 1,,..., p be jointly independent so we have one data point for each of the p µ i -parameters. Let X = (X 1,..., X p and µ = (µ 1,..., µ p. The MLE ˆµ for µ is ˆµ MLE = X. This estimator is inadmissable for quadratic loss i.e. Risk = MSE. The result is surprising since the estimator X is the MLE and also MVUE. An estimator with lower risk is given by the James-Stein estimator ( ˆµ JSE = 1 a i X i X Jonathan Marchini (University of Oxford BSa MT / 8 Implications of Stein s Paradox Suppose we are interested in estimating 1 the weight of a randomly chosen loaf of bread from a supermarket. the height of a random chosen blade of grass from a garden. 3 the speed of a randomly chosen car as it passes a speed camera. These are totally unrelated quantities. It seem implausible that by combining information across the data points that we might end up with a better way of estimating the vector of three parameters. The James-Stein estimator tells us that we can get a better estimate (on average for the vector of three parameters by simultaneously using the three unrelated measurements. Jonathan Marchini (University of Oxford BSa MT / 8

46 Proof Consider the alternative estimator ( ˆµ JSE = 1 a i X i X (the James-Stein estimator Note This estimator shrinks X towards 0 (when i X i > a. We will show that if a = p then R(µ, ˆµ JSE < R(µ, ˆµ MLE for every µ R n, so that the MLE is inadmissable in this case. First, the risk for ˆµ MLE is R(µ, ˆµ MLE = p E( µ i ˆµ MLE,i = recognizing Var(X i = 1. p E( µ i X i = p Jonathan Marchini (University of Oxford BSa MT / 8 Stein s Lemma In order to calculate R(µ, ˆµ JSE, it is convenient to use Stein s Lemma, for Normal RV, ( h(x E((X i µh(x = E. X i This can be shown by integrating by parts. Noting (x i µe (x i µ i / dx = e (x i µ i / we have (x i µ i h(xe (x i µ i / dx i = h(xe (x i µ i / + x i = x i = h(x e (x i µ i / dx i x i The first term is zero if h(x (for eg is bounded, giving the lemma. Jonathan Marchini (University of Oxford BSa MT / 8 Proof (continued R(µ, ˆµ JSE = ( p E( µ i ˆµ i with ˆµ i = ( E( µ i ˆµ i = E( µ i X i (X i µ i X i ae j X j ( ( (X i µ i X i E j X j = E = E ( +a E X i ( j X X i X i j X j ( j X j Xi ( j X j j Stein s lemma 1 a ( 1 = E j X j i X i X i X i ( j X j Jonathan Marchini (University of Oxford BSa MT / 8 Putting the pieces together, ( p E( µ i ˆµ i 1 = R(µ, ˆµ MLE (ap 4aE = p (a(p a E ( 1 j X j j X j and this is less than p if ap 4a a > 0 and in particular at a = p, which minimizes the risk over a R. ( + a 1 E j X j Note We have not shown that the James Stein estimator is admissible. Further reading Young and Smith sec 3.4 which covers the James-Stein estimator is worth reading. It includes a nice example (sec on the application of the estimator to estimation of baseball home run rates. Jonathan Marchini (University of Oxford BSa MT / 8

47 The risk of the James-Stein estimator Remember R(µ, ˆµ MLE = p. When a = p If µ i = 0 X i N(0, 1 ( j X j χ p E R(µ, ˆµ JSE =. 1 j X j = 1/(p If µ i = λ X i = λ + Z i where Z i N(0, 1 and j X j λ + χ ( p E 0 as λ so R(µ, ˆµ JSE p. 1 j X j So( we get a smaller difference between R(µ, ˆµ MLE and R(µ, ˆµ JSE as E gets smaller. 1 j X j Exercise A better estimator is X1 ( p + 1 a V (X X1 p where V = p j 1 (X j X and 1 p is p-dimensional vector of 1 s. Empirical Bayes Bayes estimators have good risk properties (for example, the posterior mean is usually admissible for quadratic loss. However, Bayes estimators may be hard to compute (for example, the posterior mean is an integral, or sum, particularly for hierarchical models. In Empirical Bayes, we use Bayesian reasoning to find estimators which can then be used in classical frequentist. EB uses a particular strategy to simplify hierarchical models. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Empirical Bayes Recall the setup for Bayesian inference for hierarchical models. X f (x; θ θ π(θ; ψ ψ g(ψ Empirical Bayes The EB trick is to avoid doing ψ-integrals by replacing ψ with an estimate ˆψ, derived from the data, and consider the model X f (x; θ θ π(θ; ˆψ Our prior for θ has a parameter ψ which also has a prior. The posterior is π(θ, ψ x L(θ; xπ(θ; ψg(ψ If we want minimim risk for quadratic loss (for eg we should use ˆθ = θπ(θ, ψ xdθdψ This EB approximation to the full posterior chops off a layer of the hierarchy. The reduced model has posterior ˆπ(θ x L(θ; xπ(θ; ˆψ, and a Bayes estimator ˆθ EB is calculated using ˆπ(θ x. For example, for quadratic loss, ˆθ = θˆπ(θ xdθ. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

48 Empirical Bayes Example The James-Stein estimator is an EB-estimator We still need an estimator for ψ. There are several choices. We can use the MLE ˆψ = arg max ψ p(x ψ for ψ in the marginal likelihood p(x ψ = L(θ; xπ(θ; ψdθ. Method of moments estimators are used also. Data x i N(θ i, 1, i = 1,..., p (so one observation x i for each parameter θ i. The MLE for θ i is simply ˆθ MLE,i = x i. Construct an EB estimator for quadratic loss. Suppose the prior is θ i N(0, τ (we have some freedom here, as we are mainly interested in the risk-related properties of the final estimator. If we knew τ we have (completing the square - see Lecture 7 p13 θ i (x i, τ N ( xi τ 1 + τ, τ 1 + τ. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Example To get an estimate for τ we compute the marginal distribution for X i given τ, which is X i N(0, τ + 1. The MLE for τ is then ˆτ = 1 p i X i 1, and this gives ˆθ EB,i = X i ˆτ which is the James-Stein estimator ( = ˆθ JS,i = ( 1 + ˆτ 1 p i X i 1 a i X i with a = p. This isn t the minimum risk JS estimator for quadratic loss, (that is a = p but it already beats the MLE for all θ. (to get the best JS estimator (with a = p use a method of moments estimator for τ. See Young and Smith Section 3.5 Jonathan Marchini (University of Oxford BSa MT / 8 X i X i Example Data x i Poisson(θ i, i = 1,..., n (so one observation x i for each parameter θ i. The MLE for θ i is simply ˆθ MLE,i = x i. Construct an EB estimator for quadratic loss. Suppose the prior for θ i s is iid Exponential(λ i.e. π(θ i λ = λe λθ i. p(x i λ = = 0 ( λ e θ i θ x i i x i! λe λθ i dθ i xi λ 1 + λ given λ the x i s are iid geometric (λ/(1 + λ. The MLE of λ based on x 1,..., x n is λ = 1/ x, where x = 1 n n 1 x i. Jonathan Marchini (University of Oxford BSa MT / 8

49 Now, under the EB simplification, set λ = ˆλ, so that ˆπ(θ x L(θ; xπ(θ ˆλ = i e θ i θ x i i ˆλe ˆλθ i and we recognise θ i x Γ(x i + 1, ˆλ + 1 in this EB approximation. This leads to an estimator ˆθ EB,i = θ i ˆπ(θ i xdθ i Lecture 1 : Markov Chain Monte Carlo. The Metropolis algorithm. = x i + 1 ˆλ + 1 = x x i + 1 x + 1 We can rewrite this ˆθ EB,i = x i + x x + 1 ( x x i showing that this EB estimator shrinks the estimates towards the common mean. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Computationally intensive methods In Bayesian inference, the things we need to know are often given in terms of intractable integrals or sums. If x f (x; θ and we have a prior π(θ on θ Θ and we want to measure the evidence for θ S, S Θ then we might calculate P(θ S x = π(θ xdθ where π(θ x f (x; θπ(θ is the posterior density. The posterior mean is commonly the most useful point estimate. There we have to compute E θ x θ = θπ(θ xdθ. Θ In previous examples our choice of prior has kept calculations doable. Jonathan Marchini (University of Oxford BSa MT / 8 S Monte Carlo estimation P(θ S x = E θ x I(θ S and E θ x θ are both averages. If we have N iid samples θ (i π(θ x, i = 1,,..., N distributed according to the posterior density and we want to estimate a finite mean µ f = E θ x f (θ for some given function f, and we take then fn = 1 f (θ (i N fn P µf, that is, f N is consistent for estimation of µ f (in fact there is usually a CLT. But how do we generate the samples θ (i π(θ x, i = 1,,..., N? Jonathan Marchini (University of Oxford BSa MT / 8 i

50 Markov Chain Monte Carlo estimation The idea in MCMC is to simulate an ergodic Markov Chain θ (i i = 1,,..., N that has π(θ x as its stationary distribution. Remember a discrete time Markov chain is specified by it s transition function P(θ, θ = P(θ (t+1 = θ θ (t = θ Given θ (0 we use the transition function to draw θ (1, then θ ( given θ (1, then θ (3 given θ ( etc. So we obtain a sequence θ (0, θ (1, θ (, θ (3, θ (4, θ (5, θ (6, θ (7,... The sequence of simulated θ-values θ (i i = 1,,..., N is correlated. Note In what follows we will refer to π(θ x as the target, and we will drop the x while we are setting up the theory. Jonathan Marchini (University of Oxford BSa MT / 8 Markov Chain Monte Carlo estimation However, the ergodic theorem for Markov chains from Part A tells us that 1 N f (θ (i P µ f i so we can use the Markov Chain sequence to form estimates. The key here is that we can often write down an ergodic Markov chain that is easy to simulate, and has stationary distribution π(θ x. Question If we start from π(θ x how do we choose the transition function P(θ, θ? We will learn about two methods to do this 1 The Metropolis algorithm Gibbs sampling Jonathan Marchini (University of Oxford BSa MT / 8 Metropolis algorithm For target π(θ x. Suppose θ (t = θ. Simulate a value for θ (t+1 as follows. 1 Simulate a candidate state θ q(θ θ i.e. a simple random change to the state satisfying q(θ θ = q(θ θ With probability α(θ θ, set θ (t+1 = θ (accept and otherwise set θ (t+1 = θ (reject. Here { } α(θ θ = min 1, π(θ x π(θ x The overall transition probability from θ to θ is P(θ, θ = q(θ θα(θ θ Example In a population of m individuals N are red and m N are blue, with unknown true value N unknown. A sample of k individuals is taken, with replacement, and X are red. The prior on N is uniform 0 to m. The posterior for N = n is π(n x n x (m n k x, n {0, 1,,..., m} In this example we can work out posterior probabilities exactly. Let us use MCMC to estimate posterior probabilities, and check it matches the exact result (within Monte Carlo error. We need to write an MCMC algorithm simulating a Markov chain N (t, t = 0, 1,,... with the property that N (t π(n x. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

51 Example Metropolis algorithm targeting π(n x. See R code. Suppose N (t = n. Simulate a value for N (t+1 as follows. 1 Toss a coin and set n = n + 1 on a Head and otherwise n = n 1. So q(n + 1 n = q(n n + 1 = 1/. With probability { } α(n n = min 1, π(n x π(n x set N (t+1 = n and otherwise set N (t+1 = n (reject. Note that α = 0 if n = 1, m + 1 (since π(n x = 0 in those cases and otherwise α(n n = min {1, (n x (m n k x } n x (m n k x. Example Here is a plot of showing the results of running the Markov chains for 10,000 iterations. (m = 0, x = 3, k = 10 N t Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Example Here is a plot showing the empirical density of the 10,000 samples, together with the true density in red. You can see that approximation is good. Density Histogram of N N Why does it work? Theorem If θ (t, t = 0, 1,,... is an irreducible, aperiodic Markov chain on a finite space Θ, satisfying detailed balance with respect to a target distribution π(θ on Θ, and f : Θ R is a function with finite mean and variance in π, then 1 f (θ (i P E π f (θ N i [This is essentially a watered down version of the Ergodic Theorem of Part A, with positive recurrence replaced by detailed balance] Suppose Θ is finite. Let P (n (θ, θ = P(θ (t+n = θ θ (t = θ give the n-step transition probability in a homogeneous Markov chain with transition probability P(θ, θ. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

52 Markov Chains Irreducible θ, θ Θ, n such that P (n (θ, θ > 0, Aperiodic Detailed balance θ Θ, gcd{n : P (n (θ, θ > 0} = 1 π(θp(θ, θ = π(θ P(θ, θ If detailed balance is satisfied the it is easy to show that the chain is a stationary distribution π(θp(θ, θ = π(θ. θ Θ Jonathan Marchini (University of Oxford BSa MT / 8 Metropolis algorithm Let us check DB. Assume WLOG π(θ π(θ. P(θ, θ π(θ = q(θ θα(θ θπ(θ = q(θ θ min { } 1, π(θ π(θ π(θ = q(θ θπ(θ { = q(θ θ min 1, π(θ } π(θ π(θ = P(θ, θπ(θ, { } { since min 1, π(θ = π(θ and min 1, π(θ } π(θ π(θ = 1. We have shown that the transition probability for the Metropolis update has π(θ as a stationary distribution. If the chain is also irreducible and aperiodic, it will generate the samples we need. These last two conditions have to be checked separately for each application (but are often obvious. Jonathan Marchini (University of Oxford BSa MT / 8 Example In Q4 of PS3, we wrote down the posterior distribution for a hierarchical model, but were unable to do anything with it, as the integrals were intractable. The observation model is X i B(n i, p i, i = 1,,..., k, the prior π(p i a, b is Beta(a, b and the prior on a, b is the improper prior π(a, b 1/ab on a, b > 0. The posterior density is π(p, a, b x 1 ab k i (1 p i n i x i p a 1 i (1 p i b 1, B(a, b p x i with B(a, b = Γ(aΓ(b Γ(a + b. Note we used an improper prior - we need to check this is a proper posterior distribution. Example Use Monte Carlo to sample (p (t, a (t, b (t π(p, a, b x, t = 1,,..., T and use the samples to estimate any posterior expectations of interest. For example, if we want to estimate a using the posterior mean â = 1 a (t. T or the posterior probability that p 1 > c, π(p 1 > c x = 1 T t t I(p (t 1 > c. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

53 Example Metropolis algorithm targeting π(p, a, b x. See R code. Suppose θ (t = θ = (p, a, b. Simulate θ (t+1 as follows. 1 Pick a component of θ, one of (p 1,,..., p k, a or b and propose an updated value. i If we choose p i, propose p i U(0, 1. ii If we choose a propose a U(a w, a + w iii If we choose b propose b U(b w, b + w with w > 0 a fixed number. This generates a new state θ with one component changed. Clearly, q(θ θ = q(θ θ. Adjustment to avoid negative a and b. With probability α(θ θ set θ (t+1 = θ and otherwise set θ (t+1 = θ (reject. Example Exercise Verify the following formulae. If we update a p i (so θ = ((p 1,..., p i 1, p i, p i+1,..., p k, a, b { } α(θ θ = min 1, (p i a+x i 1 (1 p i b+n i x i 1 (p i a+x i 1 (1 p i b+n i x i 1 whilst if we update a (so that θ = (p, a, b { } α(θ θ = min 1, (p 1p...p k a B(a, b k a (p 1 p...p k a B(a, b k a and finally b (where θ = (p, a, b { α(θ θ = min 1, [(1 p 1(1 p...(1 p k ] b B(a, b k [(1 p 1 (1 p...(1 p k ] b B(a, b k b b } Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 How long to run the MCMC? Here are 10,000 samples from Example 1 with m = 00, x =, k = 30. Lecture 13 : MCMC convergence. The Metropolis-Hastings algorithm. The Gibbs sampler. N e+00 e+04 4e+04 6e+04 8e+04 1e+05 t Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

54 Convergence and burn-in iterations The Markov Chain theory that underlies MCMC tells us about the long run properties of the Markov Chain we have constructed i.e. that the chain will converge to it s stationary distribution. Often we will throw away some of the samples at the beginning of the chain. These samples are known as the burn-in iterations. The number of burn-in iterations has to be decided upon by the user, but plots of the Markov chain can often help. Plotting the Markov chain allows us to visually observe when the chain looks to have converged. A better alternative is to run multiple chains from a diverse set of starting points. If all the chains produce the same answer then this can be good evidence that the chains ave converged. Jonathan Marchini (University of Oxford BSa MT / 8 Example Consider a Metropolis algorithm for simulating samples from a bivariate Normal distribution with mean µ = (0, 0 and covariance ( Σ = using a proposal distribution for each component x i x i U(x w, x + w i {1, } The performance of the algorithm can be controlled by setting w. If w is small then we propose only small moves and the chain will move slowly around the parameter space. If w is large then it is large then we may only accept a few moves. There is an art to implementing MCMC that involves choice of good proposal distributions. Jonathan Marchini (University of Oxford BSa MT / 8 Example N = 4 w = 0.1 T = 100 Example N = 4 w = 0.1 T = Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

55 Example Example N = 4 w = 0.01 T = N = 4 w = 0.1 T = Jonathan Marchini (University of Oxford 0 BSa 4 MT / 8 Example Jonathan Marchini (University of Oxford 0 BSa 4 MT / 8 Example N = 4 w = 10 T = N = 4 w = 0.01 T = 1e+05 4 Jonathan Marchini (University of Oxford 0 BSa 4 MT / 8 Jonathan Marchini (University of Oxford 0 BSa 4 MT / 8

56 The Metropolis-Hastings algorithm The Metropolis-Hastings (MH algorithm generalises the Metropolis algorithm in two ways 1 The proposal distribution does not need to be symmetric i.e. we can have q(θ θ q(θ θ To correct for any asymmetry in the jumping rule the acceptance ratio becomes α(θ θ = min { 1, π(θ x/q(θ } θ π(θ x/q(θ θ { = min 1, π(θ x π(θ x q(θ θ } q(θ θ Choosing a proposal distribution A good proposal distribution has the following properties 1 For any θ, it is easy to sample from q(θ θ. It is easy to compute the acceptance ratio. 3 Each proposal goes a reasonable distance in the parameter space (otherwise the chain moves too slowly. 4 The proposals are not rejected too often (otherwise the chain spends too much time at the same positions. Allowing asymmetric proposals can help the chain to converge faster. Exercise Prove detailed balance for the MH algorithm. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 The Gibbs sampler A Gibbs sampler is a particular type of MCMC algorithm that has been found to be very useful in high dimensional problems. Suppose we wish to sample from π(θ where θ = (θ 1,..., θ d. Each iteration of the Gibbs sampler occurs as follows 1 An ordering of the d components of θ is chosen. For each component in this ordering, θ j say, we draw a new value is sampled from the full conditional distribution given all the other components of θ. θ t j π(θ j θ t 1 j where θ t 1 j represents all the components of θ, except the for θ j, at their current values Example Consider a single observation y = (y 1, y from a bivariate Normal distribution ( with unknown mean θ = (θ 1, θ and known covariance 1 ρ Σ =. With a uniform prior on θ, the posterior distribution is ρ 1 θ y N(y, Σ In this case the posterior is tractable but we will consider how to construct a Gibbs sampler. We need the two conditional distributions π(θ 1 θ, y and π(θ θ 1, y. θ t 1 j = (θ t 1,..., θt j 1, θt 1 j+1,..., θt 1 d. Jonathan Marchini (University of Oxford BSa MT 013 / 8 Jonathan Marchini (University of Oxford BSa MT / 8

57 Example Example π(θ 1 θ, y π(θ 1, θ y { Similarly, exp 1 (θ yt Σ 1 (θ y } { ( T ( ( } 1 θ1 y exp 1 1 ρ θ1 y 1 (1 ρ θ y ρ 1 θ y } {(θ 1 y 1 + ρ(θ 1 y 1 (θ y + (θ y = exp exp { } 1 (1 ρ (θ 1 (y 1 + ρ(θ y θ 1 θ, y N(y 1 + ρ(θ y, 1 ρ θ θ 1, y N(y + ρ(θ 1 y 1, 1 ρ N = 1 T = Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Deriving full conditional distributions The Gibbs sampler requires us to be able to obtain the full conditional distributions of all components of the parameter vector θ. To determine π(θ j θ t 1 j we write down the posterior distribution and consider it as a function of θ j. If we are lucky then we will be able to spot the distribution for θ j θ t 1 j (using conjugate priors often helps. If the full conditional is not of a nice form then this component could be sampled using rejection sampling or a MH update could be used instead of a Gibbs update. Jonathan Marchini (University of Oxford BSa MT / 8 Gibbs sampling and Metropolis-Hastings Gibbs sampling can be viewed as a special case of the MH algorithm with proposal distributions equal to the full conditionals. The MH acceptance ratio is min(1, r where r = π(θ /q(θ θ t 1 π(θ t 1 /q(θ t 1 θ We have θ t 1 = (θ 1,..., θ j,..., θ d = (θ j, θ t 1 j and we use proposal q(θ θ t 1 = π(θ j θt 1 j so that θ = (θ 1,..., θ j,..., θ d = (θ j, θt 1 j and. q(θ t 1 θ = π(θ j θ t 1 j Jonathan Marchini (University of Oxford BSa MT / 8

58 Gibbs sampling and Metropolis-Hastings Then the MH acceptance ratio is min(1, r where r = π(θ /q(θ j θt 1 π(θ t 1 /q(θ t 1 j θ = π(θ /π(θ j θt 1 j π(θ t 1 /π(θ j θ t 1 j But the numerator can be simplified π(θ π(θ j θt 1 j = π(θ j, θt 1 j j π(θ j θt 1 j = π(θt 1 Similarly the denominator can be simplified π(θ t 1 π(θ j θ t 1 j = π(θ j, θ t 1 j π(θ /π(θ j θt 1 j = π(θ t 1 j π(θ t 1 /π(θ j θ t 1 j π(θ t 1 j π(θ j θ t 1 j = π(θt 1 So we have r = = 1 j Thus, every proposed move is accepted. Jonathan Marchini (University of Oxford BSa MT / 8 Lecture 14 : Variational Bayes Jonathan Marchini (University of Oxford BSa MT / 8 A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M P(D M = P(D θ, MP(θ Mdθ where θ are the parameters of the model and P(θ M is the prior on the parameters of the model. The integral required to calculate the marginal likelihood of a given model is not always tractable for some models, so an approximation can be desirable. Variational Bayes (VB is one possibility, that has become reasonably popular recently. Variational Bayes The idea of VB is to find an approximation Q(θ to a given posterior distribution P(θ D, M. That is Q(θ P(θ D, M where θ is the vector of parameters of the model M. We then use Q(θ to approximate the marginal likelihood. In fact, what we do is find a lower bound for the marginal likelihood. Question How to find a good approximate posterior Q(θ? Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

59 Kullback-Liebler (KL divergence The strategy we take is to find a distribution Q(θ that minimizes a measure of distance between Q(θ and the posterior P(θ D, M. Definition The Kullback-Leibler divergence KL(q p between two distributions q(x and p(x is [ q(x ] KL(q p = log q(xdx p(x N(µ, σ approximations to a Gamma(10,1 density µ = 10,σ = 4,KL = x density µ = 9.11,σ = 3.03,KL = x density x Exercise KL(q p 0 and KL(q p = 0 iff q = p. q(x p(x q(x * log(q(x/p(x Jonathan Marchini (University of Oxford BSa MT / 8 x density µ = 13,σ =.3,KL = x density µ = 9,σ =,KL = x Jonathan Marchini (University of Oxford BSa MT / 8 We consider the KL divergence betweem Q(θ and P(θ D, M [ Q(θ ] KL(Q(θ P(θ D, M = log Q(θdθ P(θ D, M [ Q(θP(D M ] = log Q(θdθ P(θ, D M [ p(θ, D M ] = log P(D M log Q(θdθ Q(θ The log marginal likelihood can then be written as log P(D M = F(Q(θ + KL(Q(θ P(θ D, M (1 ] Q(θdθ. where F(Q(θ = log [ P(θ,D M Q(θ Note Since KL(q p 0 we have that log P(D M F(Q(θ so that F(Q(θ is a lower bound on the log-marginal likelihood. Jonathan Marchini (University of Oxford BSa MT / 8 The mean field approximation We now need to ask what form that Q(θ should take? The most widely used approximation is known as the mean field approximation and assumes only that the approximate posterior has a factorized form Q(θ = Q(θ i i The VB algorithm iteratively maximises F(Q(θ with respect to the free distributions, Q(θ i, which is coordinate ascent in the function space of variational distributions. We refer to each Q(θ i as a VB component. We update each component Q(θ i in turn keeping Q(θ j j i fixed. Jonathan Marchini (University of Oxford BSa MT / 8

60 VB components Lemma The VB components take the form ( log Q(θ i = E Q(θ i log P(D, θ M + const Proof Writing Q(θ = Q(θ i Q(θ i where θ i = θ\θ i, the lower-bound can be re-written as [ P(θ, D M ] F(Q(θ = log Q(θdθ Q(θ [ P(θ, D M ] = log Q(θ i Q(θ i dθ i dθ i Q(θ i Q(θ i [ ] = Q(θ i log P(D, θ MQ(θ i dθ i dθ i Q(θ i Q(θ i log Q(θ i dθ i dθ i Q(θ i Q(θ i log Q(θ i dθ i dθ i F (Q(θ = = [ ] Q(θ i log P(D, θ MQ(θ i dθ i dθ i Q(θ i Q(θ i log Q(θ i dθ i dθ i Q(θ i Q(θ i log Q(θ i dθ i dθ i [ ] Q(θ i log P(D, θ MQ(θ i dθ i dθ i Q(θ i log Q(θ i dθ i Q(θ j log Q(θ j dθ j j i If we let Q (θ i = 1 Z exp [ log P(D, θ MQ(θ i dθ i ] where Z is a normalising constant and write H(Q(θ j = Q(θ j log Q(θ j dθ j as the entropy of Q(θ j then F (Q(θ = Q(θ i log Q (θ i Q(θ i dθ i + log Z + H(Q(θ j j i = KL(Q(θ i Q (θ i + log Z + j i H(Q(θ j Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 VB algorithm F(Q(θ = KL(Q(θ i Q (θ i + log Z + j i H(Q(θ j We then see that F(Q(θ is maximised when Q(θ i = Q (θ i as this choice minimises the Kullback-Liebler divergence term. Thus the update for Q(θ i is given by or [ ] Q(θ i exp log P(D, θ MQ(θ i dθ i log Q(θ i = E Q(θ i ( log P(D, θ M + const This implies a straightforward algorithm for variational inference: 1 Initialize all approximate posteriors Q(θ = Q(µQ(τ, e.g., by setting them to their priors. Cycle over the parameters, revising each given the current estimates of the others. 3 Loop until convergence. Convergence is checked by calculating the VB lower bound at each step i.e. [ P(θ, D M ] F(Q(θ = log Q(θdθ Q(θ The precise form of this term needs to be derived, and can be quite tricky. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

61 Example 1 Consider applying VB to the hierarchical model D i N(µ, τ 1 i = 1,..., P, µ N(m, (τβ 1 τ Γ(a, b Note We are using a prior of the form π(τ, µ = π(µ τπ(τ. Let θ = (µ, τ and assume Q(θ = Q(µQ(τ. We will use notation θ i = E Q(θ i θ i. log P(D, θ M = P log τ τ P (D i µ + 1 log τ τβ (µ m +(a 1 log τ bτ + K We can derive the VB updates one at a time. We start with Q(µ. Note We just need to focus on terms involving µ. ( log Q(µ = E Q(τ log P(D, θ M + C The log joint density is log P(D, θ M = P log τ τ P (D i µ + 1 log τ τβ +(a 1 log τ bτ + K (µ m = τ ( P (D i µ β(µ m + C where τ = E Q(τ (τ. We will be able to determine τ when we derive the other component of the approximate density, Q(τ. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 We can see this log density has the form of a normal distribution where log Q(µ = τ Thus Q(µ = N(µ m, β 1. ( P (D i µ β(µ m + C = β (µ m β = (β + P τ m = β 1( βm + P D i log P(D, θ M = P log τ τ P (D i µ + 1 log τ τβ (µ m +(a 1 log τ bτ + K The second component of the VB approximation is derived as ( log Q(τ = E Q(µ log P(D, θ M + C = ((P + 1/ + a 1 log τ τ P (D i µ β(µ m τ + C Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

62 We can see this log density has the form of a gamma distribution log Q(τ = ((P + 1/ + a 1 log τ τ P (D i µ β(µ m τ which is Γ(a, b where + C a = a + (P + 1/ b = b + 1 ( P Di µ + P µ + β ( m µ + µ Jonathan Marchini (University of Oxford BSa MT / 8 So overall we have 1 Q(µ = N(µ m, β 1 where Q(τ = Γ(τ a, b where β = (β + P τ ( m = β 1( P βm + D i (3 a = a + (P + 1/ b = b + 1 ( P Di µ + P µ + β ( m µ + µ To calculate these we need τ = a b µ = m µ = β 1 + m Jonathan Marchini (University of Oxford BSa MT / 8 Example 1 For this model the exact posterior was calculate in Lecture 6. [ }] π(τ, µ D τ α 1 exp τ {b + β (m µ where a = a + P + 1 b = b + 1 (Di D + 1 β = β + P m = β 1 (βm + P D i Pβ P + β ( D m We can compare the true and VB posterior when applied to a real dataset. We see that VB approximations underestimate posterior variances. Density µ True posterior VB posterior Density True posterior VB posterior We note some similarity between the VB updates and the true posterior parameters. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

63 General comments The property of VB underestimating the variance in the posterior is a general feature of the method, when there exists correlation between the θ i s in the posterior, which is usually the case. This may not be important if the purpose of inference is model comparison i.e. comparing the approximate marginal likelihoods between models. VB is often much, much faster to implement than MCMC or other sampling based methods. The VB updates and lower bound can be tricky to derive, and sometime further approximation is needed. The VB algorithm will find a local mode of the posterior, so care should be taken when the posterior is thought/known to be multi-modal. Laplace s method Consider the integral e Nf (x dx Expanding f (x around a maximum point x 0 we have f (x f (x 0 + f (x 0 (x x f (x 0 (x x 0 = f (x 0 1 f (x 0 (x x 0 and we can approximate the integral as { e Nf (x dx e Nf (x 0 exp N f (x 0 (x x 0 } dx Comparing to a normal pdf we have e Nf (x dx e Nf (x 0 (π 1 N 1 f (x 0 1 Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Laplace s method Laplace approximation For a d-dimensional function the analogue of this result is e Nf (x dx e Nf (x 0 (π d N d f (x 0 1 where f (x 0 is the determinant of the Hessian of the function evaluated at x 0. We apply Laplace s method to the integral that is the marginal likelihood in Bayesian inference, as follows P(θ D, M P(D θ, MP(θ Mdθ = { 1 exp N( N log P(D θ, M + 1 } N log P(θ M dθ here we have f (x = 1 N log P(D θ, M + 1 log P(θ M N Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

64 Laplace approximation We let θ be the (vector value of θ that maximizes the posterior θ = arg max P(θ D, M θ This is known as the maximum a posteriori (MAP estimate of θ. This leads to log P(θ D, M log P(D θ, M+log P( θ M+ d log π+1 log f ( θ 1 d log N Bayesian Information Criterion (BIC The Bayesian Information Criterion (BIC takes the approximation one step further, essentially by minimizing the impact of the prior. Firstly, the MAP estimate θ is replaced by the MLE ˆθ, which is reasonable if the prior has a small effect. Secondly, BIC only retains the terms that vary in N, since asymptotically the terms that are constant in N do not matter. Dropping the constant terms we get, log P(θ D, M log P(D ˆθ, M d log N Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Lecture 15 : The EM algorithm An expectation-maximization (EM algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP estimates of parameters in statistical models, where the model depends on unobserved latent variables. So called missing data problems are widespread in many application areas of statistics. We will consider the general set up where X = Observed data Z = Missing or unobserved data θ = Model parameters l(θ = log L(θ; X Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

65 Example Suppose we observed data X 1,..., X M from the following model p(z i = 1 = p(z i = = 1, X i Z i N(θZ i, 1 So we observe the X i s but not the Z i s or θ and want to estimate θ. Frequency Note If we knew the Z i s this would be an easy problem, but we don t...hence the need for the EM algorithm. x Overview of EM algorithm The EM algorithm is an iterative algorithm for maximising the log-likelihood l(θ that proceeds as follows 1 Set t = 0 and pick a initial value θ t. E-Step Calculate the expected log-likelihood of the full data likelihood [ ] log P(X, Z θ E p(z X,θt where the expectation is wrt p(z X, θ t i.e. the conditional distribution of the missing data Z given and observed data X and θ t. [ ] 3 M-Step Set θ t+1 = arg max θ E p(z X,θt log P(X, Z θ 4 Iterate steps and 3 until convergence. Question Why does this work? Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 l(θ = log P(X θ = log P(X, Z θ = log P(X, Z θ log P(Z X, θ P(Z X, θ Now take expectations wrt to a distribution q(z l(θ = log P(X, Z θq(z dz log P(Z X, θq(z dz = log P(X, Z θq(z dz log q(z q(z dz + log q(z q(z dz log P(Z X, θq(z dz [ ] = E q(z log P(X, Z θ + H(q(Z + KL(q(Z P(Z X, θ [ ] l(θ E q(z log P(X, Z θ + H(q(Z where H(q(Z = log q(z q(z dz is known as the Entropy of q(z. Jonathan Marchini (University of Oxford BSa MT / 8 E-Step Now set q(z = P(Z X, θ t. Then we have [ ] l(θ E P(Z X,θt log P(X, Z θ + H(q(Z with equality at θ = θ t i.e. we have a lower bound for l(θ. θ t Jonathan Marchini (University of Oxford BSa MT / 8 l( θ

66 M-Step We then obtain an updated estimate θ t+1 by maximizing this lower bound [ ] θ t+1 = arg max E P(Z X,θt log P(X, Z θ θ since H(q(Z is not a function of θ. Example Suppose we observed data X 1,..., X M from the following model p(z i = 1 = p(z i = = 1, X i Z i N(θZ i, 1 So we observe the X i s but not the Z i s or θ and want to estimate θ. l( θ Frequency θ t θ t x Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 The log-likelihood is l(θ = log P(X θ = log M k=1 P(x i z i = k, θp(z i = k The full data likelihood and it s logarithm are given by P(X, Z θ = M M P(x i z i, θp(z i = N(x i : θz i, 1 1 log P(X, Z θ = 1 M (x i θz i + K E-Step For θ t we then calculate the conditional distribution of the missing data P(Z X, θ t. This can be done one Z i at a time. ( p ikt = P(Z i = k X i, θ t exp 1 (x i kθ t Then calculate the expectation of the log of the full data likelihood [ ] E log P(X, Z θ = 1 = 1 M k=1 (x i θk p ikt M (x i θ p i1t + (x i θ p it = 1 θ i (p i1t + 4p it + θ i x i (p i1t + p it + C Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

67 Summary EM algorithm M-Step We can then update the estimate of θ by maximizing the expected log-likelihood, [ ] E log P(X, Z θ which leads to = 1 θ i θ t+1 = (p i1t + 4p it + θ i i x i(p i1t + p it i (p i1t + 4p it x i (p i1t + p it + C 1 Set t = 0 and pick a initial value θ t. E-Step To calculate the expected log-likelihood of the full data likelihood we calculate the conditional distribution of the missing data Z given and observed data X and θ t. ( p ikt = P(Z i = k X i, θ t exp 1 (x i kθ t 3 M-Step We then maximise the expected log-likelihood using θ t+1 = i x i(p i1t + p it i (p i1t + 4p it 4 Iterate steps and 3 until convergence. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 General comments Frequency On this real example the EM algorithm leads to the following sequence of estimates i θ i l(θ Jonathan Marchini (University of Oxford BSa MT / 8 x 1 The EM algorithm is only guaranteed to converge to a local maxima of the likelihood. In some situations there will be multiple-maxima. A computational strategy that is often employed to overcome this involves running the EM algorithm from many different starting points and choosing the estimate that produces the largest likelihood. The convergence of the EM algorithm can sometimes be slow, and this depends upon the form of the likelihood function. Jonathan Marchini (University of Oxford BSa MT / 8

68 Contingency table analysis North Carolina State University data. Lecture 16 : Bayesian analysis of contingency tables. Bayesian linear regression. EC : Extra Curricular activities in hours per week. EC < to 1 > 1 C or better D or F Let y = (y ij be the matrix of counts. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Frequentist analysis Usual χ test from R. Pearson s Chi-squared test X-squared = 6.964, df = (3-1(-1 =, p-value = The p-value is , evidence that grades are related to time spent on extra curricular activities. Bayesian analysis Let p = {p 11,..., p 3 }. Consider two models M I M D EC < to 1 > 1 C or better p 11 p 1 p 13 D or F p 1 p p 3 the two categorical variables are independent the two categorical variables are dependent. The Bayes factor is BF = P(y M D P(y M I. So we need to calculate the marginal likelihoods. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

The Dirichlet distribution Dirichlet integral z 1 + +z k =1 z ν 1 1 1 z ν k 1 k dz 1 dz k = Γ(ν 1 Γ(ν k Γ( ν i Examples of 3D Dirichlet distributions Dirichlet distribution Γ( ν i Γ(ν 1 Γ(ν k zν 1 1

69 The Dirichlet distribution Dirichlet integral z 1 + +z k =1 z ν z ν k 1 k dz 1 dz k = Γ(ν 1 Γ(ν k Γ( ν i Examples of 3D Dirichlet distributions Dirichlet distribution Γ( ν i Γ(ν 1 Γ(ν k zν z ν k 1 k, z z k = 1 The means are E[Z i ] = ν i / ν i, i = 1,..., k. A representation that makes the Dirichlet easy to simulate from is the following. Let W 1,..., W k be independent Gamma (ν 1,... Gamma (ν k random variables, W = W i and set Z i = W i /W, i = 1,..., k. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Calculating marginal likelihoods Calculating marginal likelihoods where P(y M D = = p P(y pπ(pdp p p 3 =1 ( y ( yij pij π(pdp11 dp 3 y ( y = ( y ij!/ y ij! y Under M D choose a uniform distribution for p i.e. Dirichlet(1,..., 1 ij P(y M D = = = where ( y Γ(RC y ( y Γ(yij + 1 Γ(RC y Γ( y + RC ( y D(y + 1 y D(1 RC p p 3 =1 D(ν = Γ(ν i /Γ( ν i ( yij pij dp11 dp 3 ij π(p = Γ(RC, p p 3 = 1 and y + 1 denotes the matrix of counts with 1 added to all entries and 1 RC denotes a vector of length RC with all entries equal to 1. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

Foundations of Statistical Inference

Foundations of Statistical Inference Jonathan Marchini Department of Statistics University of Oxford MT 2013 Jonathan Marchini (University of Oxford) BS2a MT 2013 1 / 27 Course arrangements Lectures M.2