Course arrangements. Foundations of Statistical Inference

Size: px
Start display at page:

Download "Course arrangements. Foundations of Statistical Inference"

Transcription

1 Course arrangements Foundations of Statistical Inference Jonathan Marchini Department of Statistics University of Oxford MT 013 Lectures M. and Th.10 SPR1 weeks 1-8 Classes Weeks 3-8 Friday 1-1, 4-5, 5-6 SPR1/ Hand in solutions by 1noon on Wednesday SPR1. Class Tutors : Jonathan Marchini, Charlotte Greenan Notes and Problem sheets will be available at Books Garthwaite, P. H., Jolliffe, I. T. and Jones, B. (00 Statistical Inference, Oxford Science Publications Leonard, T., Hsu, J. S. (005 Bayesian Methods, Cambridge University Press. D. R. Cox (006 Principals of Statistical Inference This course builds on notes from Bob Griffiths and Geoff Nicholls Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT 013 / 8 Part A Statistics The majority of the statistics that you have learned up to now falls under the philosophy of classical (or Frequentist statistics. This theory makes the assumption that we can randomly take repeated samples of data from the same population. You learned about three types of statistical inference Point estimation (Maximum likelihood, bias, consistency, efficiency, information Interval estimation(exact and approximate intervals using CLT Hypothesis testing (Neyman-Pearson lemma, uniformly most powerful tests, generalised likelihood ratio tests You also had an introduction to Bayesian Statistics Posterior inference (Posterior Likelihood Prior Interval estimation(credible intervals, HPD intervals Priors(conjugate priors, improper priors, Jeffreys prior Hypothesis testing (marginal likelihoods, Bayes Factors Jonathan Marchini (University of Oxford BSa MT / 8 Frequentist inference In BSa we develop the theory of point estimation further. Are there families of distributions about which we can make general statements? Exponential families. How can we summarise all the information in a dataset about a parameter θ? Sufficiency and the Factorization Theorem. What are limits of how well we can estimate a parameter θ? Cramer-Rao inequality (and bound. How can we find good estimators of a parameter θ? Rao-Blackwell Theorem and Lehmann-Scheffé Theorem. Jonathan Marchini (University of Oxford BSa MT / 8

2 Bayesian inference Parameters are treated as random variables. Inference starts by specifying a prior distribution on θ based on prior beliefs. Having collected some data we use Bayes Theorem to update our beliefs to obtain a posterior distribution. Quick Example Suppose I give a coin and tell you that it is bit biased. We might use a Beta(4,4 distribution to represent our beliefs about the θ. If we observe 30 heads and 10 tails we can use probability theory to infer a posterior distribution for θ of Beta(34, Prior distribution Posterior distribution Computational techniques for Bayesian inference It is not always possible to obtain an analytic solution when doing Bayesian Inference, so we study approximate computational techniques in this course. These include Markov chain Monte Carlo (MCMC methods Approximations to marginal likelihoods NEW Variational Approximations Laplace approximations Bayesian Information Criterion (BIC The EM algorithm NEW useful in Frequentist and Bayesian inference of missing data problems Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Decision theory Quick Example You have been exposed to a deadly virus. About 1/3 of people who are exposed to the virus are infected by it, and all those infected by it die unless they receive a vaccine. By the time any symptoms of the virus show up, it is too late for the vaccine to work. You are offered a vaccine for 500. Do you take it or not? The most likely scenario is that you don t have the virus but basing a decision on this ignores the costs (or loss associated with the decisions. We would put a very high loss on dying! Decision theory allows us to formalise this problem. Closely connected to Bayesian ideas. Applicable to point estimation and hypothesis testing. Inference is based on specifying a number of possible actions we might take and a loss associated with each action. The theory involves determining a rule to make decisions using an overall measure of loss. Notation X, Y, Z Capital letters for random variables. x, y, z Lower case letters for realisations of random variables. E X ( Expectation with respect to the random variable X. φ = {φ 1,..., φ k } Sometimes we will use bold symbols to denote a vector of parameters. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

3 Parametric families Lecture 1 - Exponential families f (x; θ, θ Θ, probability density of a random variable (rv which could be discrete or continuous. θ can be 1-dimensional or of higher dimension. Equivalent notation:f θ (x,f (x θ,f (x, θ. Likelihood L(θ; x = f (x; θ and log-likelihood l(θ; x = log(l. Examples 1. Normal N(θ, 1 : f (x; θ = 1 e 1 (x θ x R, θ R. π. Poisson: f (x; θ = θx x! e θ, x = 0, 1,,..., θ > Regression: f (y; θ = n 1 e 1 (y i p πσ j=1 x ij β j, y R n, σ > 0, β R p. θ = {β, σ}. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Exponential families of distributions Example 1 : Poisson Definition 1 GJJ.6, DRC.3 A rv X belongs to a k-parameter exponential family if its probability density function (pdf can be written as k f (x; θ = exp A j (θb j (x + C(x + D(θ, j=1 where x χ, θ Θ, A 1 (θ,..., A k (θ, D(θ are functions of θ alone and B 1 (x, B (x,..., B k (x, C(x are well behaved functions of x alone. Exponential families are widely used in practice - for example in generalised linear models (see BS1a. We want to put the Poisson distribution in the form (with k = 1 f (x; θ = exp {A(θB(x + C(x + D(θ}, e θ θ x θ+x log θ log x! /x! = e = exp {(log θx log x! θ} So A(θ = log θ, B(x = x, C(x = log x!, D(θ = θ. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

4 Examples of 1-parameter Exponential families Binomial, Poisson, Normal, Exponential. Distn f (x; θ A(θ B(x C(x D(θ Bin(n, p ( n x p x (1 p n x log p (1 p x log ( n x Pois(θ e θ θ x /x! log θ x log(x! θ n log(1 p N(µ, 1 1 π exp{ (x µ } µ x x / 1 (µ log(π Exp(θ θe θx θ x 0 log θ Others : negative binomial, Pareto (with known minimum, Weibull (with known shape, Laplace (with known mean, Log-normal, inverse Gaussian, beta, Dirichlet, Wishart. Exercise: check these distributions Example : a -parameter family (Gamma If X Gamma(α, β then let θ = (α, β so f (x; θ = βα x α 1 e βx Γ(α = exp {α log β + (α 1 log x βx log Γ(α} = exp { (α 1 log x βx log [ Γ(αβ α]} And we have A 1 (θ = α 1, B 1 (x = log x, A (θ = β, B (x = x. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Some other -parameter Exponential families Exponential family canonical form Distribution f (x; θ A(θ B(x C(x D(θ N(µ, σ Gamma e (x µ σ A πσ 1 (θ = 1/σ B 1 (x = x 0 1 log(πσ β α x α 1 e βx A (θ = µ/σ B (x = x 0 1 µ /σ Γ(α A 1 (θ = α 1 B 1 (x = log x 0 log [ Γ(αβ α] A (θ = β B (x = x Let φ j = A j (θ, j = 1,..., k k f (x; φ = exp φ j B j (x + C(x + D(φ. j=1 φ j, j = 1,..., k are the canonical parameters, B j, j = 1,..., k are the canonical observations. These are sometimes called the natural parameters and observations. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

5 Since we have R f (x; φdx = 1 k exp φ R j B j (x + C(x + D(φ dx = 1 j=1 k exp{d(φ} exp φ R j B j (x + C(x dx = 1 j=1 k exp φ j B j (x + C(x dx = exp{ D(φ} R j=1 Jonathan Marchini (University of Oxford BSa MT / 8 R k exp φ j B j (x + C(x dx = exp{ D(φ} j=1 Differentiate with respect to φ i. k B i (x exp φ R j B j (x + C(x dx = D(φ exp{ D(φ} φ i j=1 k B i (x exp φ j B j (x + C(x + D(φ dx = D(φ φ i R j=1 E[B i (X] = φ i D(φ Exercise Cov[B i (X, B j (X] = φ i φ j D(φ Exercise Var[B i (X] = φ D(φ i Jonathan Marchini (University of Oxford BSa MT / 8 Example 3 : Gamma We already know that if X Gamma(α, β the E(X = α β and Var(X = α β. β α x α 1 e βx Γ(α = exp { βx + (α 1 log x + α log β log Γ(α} φ 1 = β, φ = α 1, B 1 (x = x, B (x = log x D(φ = α log β log Γ(α = (φ + 1 log( φ 1 log Γ(φ + 1 E[X] = D(φ = (φ + 1 = α φ 1 φ 1 β Var[X] = φ D(φ = (φ φ = α β 1 Exercise: show E[log X] = ψ 0 (α log(β where ψ 0 is the digamma function, and Γ (α = Γ(αψ 0 (α. Jonathan Marchini (University of Oxford BSa MT / 8 Cumulant Generating Function In a scalar canonical exponential family (k = 1 E X [e sb(x ] = exp {(φ + sb(x + C(x + D(φ} dx R = exp{d(φ D(φ + s} If M B(X (s is the Moment Generating Function (mgf for B(X, then log(m B(X (s = D(φ D(φ + s This is the cumulant generating function (defined as the log of the mgf for the cumulants of B(X i.e. log(m B(X (s = κ r s r /r! where κ 1 = E(B(X and κ = V (B(X Exercise : prove this Jonathan Marchini (University of Oxford BSa MT / 8 r=1

6 Example 4 : Binomial We already know that if X Binom(n, p then E(X = np. and Therefore φ = log p (1 p p = eφ 1 + e φ B 1 (x = x, D(φ = n log(1 p = n log(1 + e φ log(m X (s = D(φ D(φ + s = n log(1 + e φ + n log(1 + e φ+s κ 1 = s log(m X (s = n eφ s=0 1 + e φ = np Jonathan Marchini (University of Oxford BSa MT / 8 Example 5 : Skew-logistic distribution Consider the real valued random variable X with pdf θe x f (x; θ = (1 + e x θ+1 { ( e = exp θ log(1 + e x x } + log 1 + e x + log θ and φ = θ, B 1 (x = log(1 + e x and D(φ = log θ = log( φ log(m X (s = D(φ D(φ + s = log( φ log( (φ + s E(log(1 + e x = 1 = 1 φ + s s=0 θ Var(log(1 + e x 1 s=0 = (φ + s = 1 θ These results are harder to derive directly. Jonathan Marchini (University of Oxford BSa MT 013 / 8 Family preserved under transformations Exponential family conditions A smooth invertible transformation of a rv from the Exponential family is also within the Exponential family. If X Y, Y = Y (X then f Y (y; θ = f X (x(y; θ X/ Y k = exp A j (θb j (x(y + C(x(y + D(θ X/ Y, j=1 The Jacobian depends only on y and so the natural observation B(x(y, the natural parameter A(θ, and D(θ do not change, while C(X C(X(Y + log X/ Y. 1. The range of X, χ does not depend on θ.. ( A 1 (θ, A (θ,..., A k (θ have continuous second derivatives for θ Θ. 3. If dim Θ = d, then the Jacobian [ ] Ai (θ J(θ = θ j has full rank d for θ Θ. [Part of the regularity conditions which we cite later.] Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

7 The family is minimal if no linear combination exists such that Linear relations among A s do not generate curvature λ 0 + λ 1 B 1 (x + + λ k B k (x = 0 for all x χ. The dimension of the family is defined to be d, the rank of J(θ. A minimal exponential family is said to be curved when d < k and linear when d = k. We refer to a (k, d curved exponential family. Example 6 (X 1, X independent, normal, unit variance, means (θ, c/θ, c known. Take a k-parameter exponential family and impose A 1 = aa + b for constants a and b. k A i B i + C + D = k A i B i + A (ab 1 + B + (C + bb 1 + D i=3 This gives a k 1 parameter EF, not a (k, k 1-CEF. log f (x; θ = x 1 θ + cx /θ θ / c θ / +... is a (, 1 curved exponential family. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Examples not in an exponential family 1 Uniform on [0, θ], θ > 0. f (x; θ = 1, x [0, θ] θ > 0 θ Lecture - Sufficiency, Factorization Theorem, Minimal sufficiency The Cauchy distribution f (x; θ = [π(1 + (x θ ] 1, x R Other examples include the F-distribution, hypergeometric distribution and logistic distribution. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

8 Sufficient statistics Let X 1,..., X n be a random sample from f (x; θ. Definition A statistic T (X 1,..., X n is a function of the data that does not depend on unknown parameters. Definition 3 A statistic T (X 1,..., X n is said to be sufficient for θ if the conditional distribution of X 1,..., X n, given T, does not depend on θ. That is, g(x t, θ = g(x t Comment The definition says that a sufficient statistic T contains all the information there is in the sample about θ. Example 7 n independent trials where the probability of success is p. Let X 1,..., X n be indicator variables which are 1 or 0 depending if the trial is a success or failure. Let T = n X i. The conditional distribution of X 1,..., X n given T = t is g(x 1,..., x n t, p = f (x 1,..., x n, t p h(t p = = = n px i (1 p 1 x i ( n t p t (1 p n t ( n t p t (1 p n t p t (1 p n t ( n 1, t not depending on p, so T is sufficient for p. Comment Makes sense, since no information in the order. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Theorem 1 : Factorization Criterion Proof - for discrete random variables 1. Assume that T is sufficient, then the distribution of the sample is T (X 1,..., X n is a sufficient statistic for θ if and only if there exist two non-negative functions K 1, K such that the likelihood function L(θ; x can be written L(θ; x = K 1 [t(x 1,..., x n ; θ]k [x 1,..., x n ] = K 1 [t; θ]k [x], where K 1 depends only on the sample through T, and K does not depend on θ. L(θ; x = f (x θ = f (x, t θ = g(x t, θh(t θ T is sufficient which implies g(x t, θ = g(x t h(t θ depends on x through t(x only so L(θ; x = g(x th(t θ We set L(θ; x = K 1 K, where K 1 h, K g. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

9 . Suppose L(θ; x = f (x θ = K 1 [t; θ]k [x].then h(t θ = f (x, t θ = L(θ; x Thus g(x t, θ = {x:t (x=t} f (x, t θ h(t θ = {x:t (x=t} = K 1 [t; θ] L(θ; x h(t θ = {x:t (x=t} K (x. K [x] {x:t (x=t} K (x, not depending on θ. (K 1 cancels out in numerator and denominator. Minimal sufficiency In general, we can envisage a sequence of functions T 1 (X, T (X, T 3 (X,... such that T j (X is a function of T j 1 (X (j where moving along the sequence gains progressive reductions of data, and if T j (X is sufficient for θ, then so is T i (X i < j. Typically there will be a last member T k (X that is sufficient, with T i (X not sufficient for i > k. Then T k (X is minimal sufficient Example 7 (cont. Consider n = 3 Bernoulli trials 1 T 1 (X = (X 1, X, X 3 (the individual sequences of trials T (X = (X 1, 3 X i (the 1st random variable and the total sum. 3 T 3 (X = 3 X i (the total sum 4 T 4 (X = I(T 3 (X = 0 (I is indicator function; Exercise Prove T 4 not sufficient Definition 4 A statistic is minimal sufficient if it can be expressed as a function of every other sufficient statistic. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Example 7 (cont. : Minimal sufficiency n Bernoulli trials with T = n X i. Suppose T above is not minimal sufficient but another statistic U is MS.Then U can be given as a function of T (and not vis versa or T is MS and there exist t 1 t values of T so that U(t 1 = U(t (ie T U is many to one so U T is not a function, and we assume for the moment no other t make U(t = U(t 1.The event U = u is the event T {t 1, t }. Let x 1,...x n contain t 1 successes. Then g(x 1,..., x n u, p = g(x 1,..., x n t 1, pp(t 1 u, p = g(x 1,..., x n t 1 P(T = t 1 T {t 1, t } = p t 1(1 p n t 1 ( n t 1 p t 1(1 p n t 1 + ( n t p t (1 p n t Minimal sufficiency and partitions of the sample space Intuitively, a minimal sufficient statistic most efficiently captures all possible information about the parameter θ. Any statistic T (X partitions the sample space into subsets and in each subset T (X has constant value. Minimal sufficient statistics correspond to the coarsest possible partition of the sample space. In the example of n = 3 Bernoulli trials consider the following 4 statistics and the partitions they induce. which depends on p, so U is not sufficient, a contradiction, and hence T must be MS (similar reasoning for multiple t i. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

10 Lemma 1 : Lehmann-Scheffé partitions TTT TTH THT THH HTT HHT HTH HHH TTT TTH THT THH HTT HHT HTH HHH Consider the partition of the sample space defined by putting x and y into the same class of the partition if and only if T 1 (X = ( X 1, X, X 3 3 T (X = X 1, X i L(θ; y/l(θ; x = f (y θ/f (x θ = m(x, y. Then any statistic corresponding to this partition is minimal sufficient. TTT TTH THT THH HTT HHT HTH HHH TTT TTH THT THH HTT HHT HTH HHH Comment This Lemma tells us how to define partitions that correspond to minimal sufficient statistics. It says that ratios of likelihoods of two values x and y in the same partition (and hence same statistic value should not depend on θ. 3 T 3 (X = X i T 4 (X = I T 3 (X = 0 ( Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Proof (for discrete RVs 1. Sufficiency. Suppose T is such a statistic g(x t, θ = f (x θ f (t θ = = = f (x θ τ = {y : T (y = t} f (y θ, y τ f (x θ y τ f (x θm(x, y [ ] 1 m(x, y y τ which does not depend on θ. Hence the partition D is sufficient.. Minimal sufficiency. Now suppose U is any other sufficient statistic and that U(x = U(y for some pair of values (x, y. If we can show that U(x = U(y implies T (x = T (y, then the Lehmann-Scheffé partition induced by T includes the partition based on any other sufficient statistic.in other words, T is a function of every other sufficient statistic, and so must be minimal sufficient. Since U is sufficient we have L(θ; y L(θ; x = K 1[u(y; θ]k [y] K 1 [u(x; θ]k [x] = K [y] K [x] which does not depend on θ. So the statistic U produces a partition at least as fine as that induced by T, and the result is proved. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

11 Sufficiency in an exponential family Random Sample X 1,..., X n Likelihood L(θ; x = n f (x i ; θ n k = exp A j (θb j (x i + C(x i + D(θ j=1 ( k n n = exp A j (θ B j (x i + nd(θ + C(x i. j=1 Exponential family form again. Jonathan Marchini (University of Oxford BSa MT / 8 Sufficiency in an exponential family Suppose the family is in canonical form so φ j = A j (θ, and let t j = n B j(x i, C(x = n C(x i. k L(θ; x = exp φ j t j + nd(θ + C(x. j=1 By the factorization criterion t 1,..., t k are sufficient statistics for φ 1,..., φ k. In fact, we do not need canonical form. If k L(θ; x = exp A j (θt j + nd(θ + C(x j=1 is a minimal k-dimensional linear exponential family then (by the regularity conditions above t 1,..., t k are minimal sufficient for θ 1,..., θ k. Minimal sufficiency is verified using Lemma 1. Jonathan Marchini (University of Oxford BSa MT / 8 Estimators Definition 5 A point estimate for θ is a statistic of the data. θ = θ(x = t(x 1,..., x n. Lecture 3 - Estimators, Minimum Variance Unbiased Estimators and the Cramér-Rao Lower Bound. Definition 6 An interval estimate is a set valued function C(X Θ such that θ C(X with a specified probability. Definition 7 : Maximum likelihood estimation If L(θ is differentiable and there is a unique maximum in the interior of θ Θ, then the MLE θ is the solution of where l(θ = log L(θ; x. L(θ; x = 0 or θ l(θ = 0, θ Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

12 Lemma : MLEs and exponential families Consider a k-dimensional exponential family in canonical form ( k n n L(θ; x = exp φ j B j (x i + nd(φ + C(x i. j=1 Let T j (X = n B j(x i, j = 1,..., k. If the realized data are X = x, then the statistics evaluated on the data are T j (x = t j. The MLEs of φ 1,..., φ k are the solution of t j = E X (T j, j = 1,..., k. i.e. set the expected values of the sufficient statistics equal to their realised values and solve for φ j. [If the family is not in canonical form, there is a similar slightly more complicated matrix equation] Proof l = log L = const + k φ j t j + nd(φ j=1 φ j l = t j + n φ j D(φ However, since E X [B i (X] = φ i D(φ and T j (X = n B j(x i we know that E X [T j ] = n D(φ, so φ j is equivalent to t j = E X (T j. φ j l = t j E X (T j = 0 Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Bias, Variance, Mean Squared Error T n = T (X 1,..., X n is a statistic. Definition 8 T n is unbiased for a function g(θ if E X (T n = t n (xf (x; θdx = g(θ, for all θ Θ. χ Definition 9 The bias of an estimator T n is bias(t n = E X [T n g(θ] Definition 10 T n is a consistent estimator if ɛ > 0, P( T n θ > ɛ 0 as n. Definition 11 The Mean Squared Error (MSE of T n is MSE(T n = E X [T n g(θ] = V X (T n + [bias(t n ] Example 10 N(µ, σ. µ = X and S = (n 1 1 n ( Xi X are unbiased estimates of µ and σ. Minimum Variance Unbiased Estimators (MVUE If we want to find a good estimator then one obvious strategy is to try to find estimators that minimise MSE. This is often difficult. For example, if we choose the estimator θ = θ 0 then this has MSE=0 when θ = θ 0, so no other estimator can be uniformly best unless it has zero MSE everywhere. If we restrict attention to unbiased estimators then the situation becomes more tractable. In this case, MSE reduces to the variance of the estimator and we can focus on minimising the variance of estimators. That is, we search for minimum variance unbiased estimators (MVUE. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

13 Theorem : Cramér-Rao inequality (and bound. If θ is an unbiased estimator of θ, then subject to certain regularity conditions on f (x; θ, we have Var( θ I 1 θ. where I θ, the Fisher information, is given by [ ] I θ = E θ l(θ Comment This bound tells us the minimum possible variance. If an estimator achieves the bound then it is MVUE. There is no guarantee that the bound will be attainable. In many cases it is attainable asymptotically. Intuitively, the more information we have about θ, the larger I θ will be and lowest possible variance of the estimator will be smaller. Regularity conditions for CRLB We will not be concerned with the details of the required regularity conditions. The main reason they are needed is to ensure that it is ok to interchange integration and differentiation during parts of the proof. One condition that is often easy to check is that the range of the rv X must not depend on θ. So for example, the result can not be applied when working with the uniform distribution U[0, θ] and we wish to estimate θ. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 In order to prove the CRLB we will need to use a few results. Proposition 1 : Variance-Covariance inequality Let U and V be scalar rv. Then cov(u, V var(uvar(v with equality if and only if U = av + b for constants and a 0. The Fisher Information I θ, which is used in the Cramér-Rao lower bound, can be expressed in two different forms. Lemma 3 Under regularity conditions [ ] [ ( ] l I θ = E θ l(θ = E = Var[S(X; θ], θ where the score function s(x; θ is defined as s(x; θ = θ l(θ = f (x; θ f (x; θ Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

14 Lemma 3 - Proof [ ] [ ( ] l We need to prove E θ l(θ = E. θ l θ = θ { 1 L } [ since l L θ = 1 ( L L + 1 L θ L θ ( l = + 1 ( L θ L θ The second term has expectation zero because [ 1 E L ( ] L θ = θ = 1 L ] L θ 1 L L θ Ldx = L dx = θ θ Ldx = 0 The alternative form I θ = Var[S(X; θ] follows from E [ l θ ] = 0. Jonathan Marchini (University of Oxford BSa MT / 8 Proof of the CRLB We consider only unbiased estimators, so we have E( θ = θ(xl(θ; xdx = θ Differentiate both sides w.r.t. θ Now so 1 = χ χ χ θ L θ dx = 1 L θ = L l θ θ l θ Ldx = E [ θ l ] θ Jonathan Marchini (University of Oxford BSa MT / 8 Proof of the CRLB Now we use the inequality that for two random variables U, V Cov[U, V ] Var[U]Var[V ] with U = θ, V = l l θ.we know Var[ θ ] = I θ. Must show Cov[U, V ] = 1. [ Cov[U, V ] = E[UV ] E[U]E[V ], E[U] = θ, E θ l ] = 1 θ E[V ] = χ l θ Ldx = L χ θ dx = [ ] Ldx θ χ Var[ θ] = Var[U] Cov[U, V ] Var[V ] = 1 I θ = θ [1] = 0 = I 1 θ Jonathan Marchini (University of Oxford BSa MT / 8 Information in a sample of size n. If we have n iid observations then f (x; θ = and the Fisher information is [ ] i n (θ = E θ l(θ = n n f (x i ; θ That is, i 1 (θ is calculated from the density as θ log f (x i; θf (x; θdx = ni 1 (θ. i 1 (θ = log f (x; θf (x; θdx θ Jonathan Marchini (University of Oxford BSa MT / 8

15 Question Under what conditions will we be able to attain the Cramér-Rao bound and find a MVUE? Lecture 4 - Consequences of the Cramér-Rao Lower Bound. Searching for a MVUE. Rao-Blackwell Theorem, Lehmann-Scheffé Theorem. Corollary 1 There exists an unbiased estimator θ which attains the CR lower bound (under regularity conditions if and only if Proof In the CR proof l θ = I θ( θ θ Cov[U, V ] Var[U]Var[V ] and the lower bound is attained if and only equality is achieved. If U = θ, V = l l θ, the equality occurs when θ = c + d θ, where c, d are constants. E[V ] = 0 so c = dθ and l θ = d( θ θ. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Multiply by l/ θ and take expectations. E [ ( ] l θ The LHS is I θ so we have d = I θ and [ ] [ ] l = de θ θ l dθe = d 1 0 θ l θ = I θ( θ θ Question What is the relationship between the CRLB and exponential families? Corollary If there exists an unbiased estimator θ(x which attains the CR lower bound (under regularity conditions it follows that X must be in an exponential family Proof Taking n = 1 and log f (x; θ θ = l θ = I θ( θ θ log f (x; θ = θa(θ + D(θ + C(x which is an exponential family form. The constant of integration C(x is a function of x. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

16 Question What is the relationship between the CRLB and MLEs? Corollary 3 Suppose θ(x is an unbiased estimator that attains the CRLB, and so is a MVUE. Suppose that the MLE θ is a solution to l/ θ = 0 (so, not on boundary. Then θ = θ. i.e. if the CRLB is attained then it is generally the MLE that attains it. Proof θ must satisfy l θ = I θ( θ θ. Setting l θ = 0 and solving will give the MLE θ. Since I θ > 0 (in all but exceptional circumstances, this gives θ = θ. Question Do all MLEs attain the CRLB? No, because not all MLEs are unbiased. Example 11 Let X 1,..., X n be a random sample from N(µ, σ. Then we know the MLEs are µ = X, σ = 1 n n X i X. Exercise µ is unbiased, but σ is biased. CRLBs are 1/I µ = σ and 1/I σ = σ 4 /n. Var( µ = σ /n which equals the CRLB so is MVUE. Var( σ = (n 1σ 4 /n is less than the CRLB. But σ is biased. The sample variance S = 1 n 1 n (X i X is unbiased and has variance σ 4 /(n 1 which is larger than the CRLB. Question Is S a MVUE? Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Efficiency Asymptotic normality of MLE Definition 13 The (Bahadur efficiency of an estimator θ is defined as a comparison of the variance of θ with the CR bound I 1 θ. That is Efficiency of θ = I 1 θ Var[ θ] = 1 I θ Var[ θ] The asymptotic efficiency is the limit as n. There are similar definitions for the relative efficiency of two estimators. Revision from Part A Statistics As the sample size n, the MLE θ N(θ, I 1 θ. This is a powerful and general result. Assuming the usual regularity conditions hold then it tells us that the MLE has the following properties 1 it is asymptotically unbiased it is asymptotically efficient i.e. it attains the CRLB asymptotically. 3 it has a normal distribution asymptotically. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

17 Extensions to the Cramér-Rao inequality 1. If θ is an estimator with bias b(θ = bias( θ, then Var[ θ] ( 1 + b I 1 θ θ. If ĝ(x is an unbiased estimator for g(θ, then Var[ĝ(X] ( g I 1 θ θ. Proof Begin with E θ ( θ(x = θ + b(θ (in 1. and E θ (ĝ(x = g(θ (in.. Differentiate both sides and proceed as above to find Cov[U, V ] = (1 + b/ θ (in 1. and Cov[U, V ] = g/ θ (in., with U = ĝ. The bound is against Cov[U, V ] which leads to the results above. Fisher Information for a d-dimensional parameter Information matrix: [ ] [ l l ] l I ij = E = E θ i θ j θ i θ j under regularity conditions. The CR inequality is Var( θ i [I 1 ] ii, i = 1,..., d. Exercise: verify that we have already proved Var( θ i [I ii ] 1. Note that [I 1 ] ii [I ii ] 1 (GJJ so bound above is stronger. Exercise For an Exponential family in canonical form, I ij = nd(φ. φ i φ j Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 The CRLB may not be achievable but will still wish to search for an MVUE. Sufficiency plays an important role in the search for a MVUE. Theorem 3 : Rao-Blackwell Theorem (RBT (GJJ.5. Let X 1,..., X n be a random sample of observations from f (x; θ. Suppose that T is a sufficient statistic for θ and that θ is any unbiased estimator for θ. Define a new estimator θ T = E[ˆθ T ]. Then 1. θ T is a function of T alone;. E[ θ T ] = θ; 3. Var( θ T Var( θ. Comment This says that estimators maybe be improved if we take advantage of sufficient statistics. Proof 1. θ T = E X [ˆθ T = t] = = ˆθ(xf (x t, θdx χ χ ˆθ(xf (x tdx. E[ θ T ] = E T [E[ˆθ T ]] = E[ˆθ] = θ (by law of total expectation 3. Using the law of total variance Var( θ = Var(E[ˆθ T ] + E T [Var(ˆθ T ] = Var( θ T + E T [Var(ˆθ T ] Var( θ Var( θ T Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

18 Example 1 Suppose X 1,..., X n be a random sample from Bernoulli(θ. It is easy to see that θ = X 1 is unbiased for θ. Also, we have seen before that T = N X i is sufficient for θ. We can use RBT to construct an estimator with smaller variance E[X 1 T = t] = P(X 1 = 1 T = t = P(X 1 = 1, N X i = t P( N X i = t = P(X 1 = 1, N i= X i = t 1 N C t θ t (1 θ N t = θ.n 1 C t 1 θ t 1 (1 θ N t N C t θ t (1 θ n t Corollary 4 If an MVUE θ for θ exists, then there is a function θ T of the minimal sufficient statistic T for θ which is an MVUE. Proof If θ is a MVUE and T is minimal sufficient then by RBT we can construct θ T. Which implies θ T is a function of T alone, is unbiased and variance no larger than θ. Hence is also a MVUE. Comment This says that we can restrict our search for a MVUE to those based on minimal sufficient statistics. = t N Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Completeness Definition 14 : Complete Sufficient Statistics Let T (X 1,..., X n be a sufficient statistic for θ. The statistic T is complete if, whenever h(t is a function of T for which E[h(T ] = 0 for all θ, then h(t 0 almost everywhere. Lemma 4 Suppose T is a complete sufficient statistic for θ, and g(t unbiased for θ, so E[g(T ] = θ. Then g(t is the unique function of T which is an unbiased estimator of θ. Proof If there were two such unbiased estimators g 1 (T, g (T, then E[g 1 (T g (T ] = θ θ = 0 for all θ, so g 1 (T = g (T almost everywhere. Question If we have an unbiased estimator what are the sufficient conditions for it to be MVUE? Lemma 5 If an MVUE for θ exists and T is a complete and minimal sufficient statistic for θ, and suppose h = h(t is unbiased for θ, then h(t is a MVUE. This Lemma combines the results of Corollary 4 and Lemma 4. Proof If an MVUE exists then there is a function of T which is an MVUE, by the RB Corollary 4. But h(t is the only function of T which is unbiased for θ (from Lemma 4. So h must be the function of T which an MVUE. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

19 Question Finally, how can we construct a MVUE? Theorem 4 : Lehmann-Scheffé theorem Let T be a complete sufficient statistic for θ, and let θ be an unbiased estimator for θ, then the unbiased estimator θ T = E[ˆθ T ] has the smallest variance among all unbiased estimators of θ. That is, for all unbiased estimators θ. Var( θ T Var( θ Comment This theorem says that if we can find any unbiased estimator and a complete sufficient statistic T then we can construct a MVUE. Proof Suppose θ exists with Var( θ < Var( θ T. Then by RBT we can construct θ T = E[ θ T ] such that Var( θ T Var( θ < Var( θ T But θ T and θ T are both unbiased and T is complete, so by Lemma 4 we have θ T = θ T and Var( θ T = Var( θ T which is a contradiction. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Lemma 6 : Complete Sufficiency in EFs If the rv X has a distribution belonging to a k-parameter exponential family, then under the usual regularity conditions, the statistic ( n B 1 (X i, n B (X i,..., n B k (X i which we already know is minimal sufficient, is complete. Comment We showed before that MLEs for exponential families were functions of this same statistic. Therefore, for a member of an exponential family if the MLE is unbiased (note : not all MLEs are unbiased, then by Lemma 5, the MLE will be MVUE. If there is an unbiased estimator that attains the CRLB then, by Corollary 3, the MLE will attain the CRLB. Example 13 Let X 1,..., X n be a random sample from N(µ, σ. Then we know the MLEs are µ = X, σ = 1 n n X i X. µ is unbiased, but σ is biased. Exercise ( The minimal sufficient complete statistic is n X i, n X i. So µ is MVUE and attains the CRLB with variance σ /n. The sample variance S = 1 n 1 n (X i X is unbiased and is a function of the minimal sufficient complete statistic so is MVUE with variance σ 4 /(n 1 which is larger than the CRLB of σ 4 /n. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

20 Method of moments Generate estimators by equating observed statistics with their expected values E{t(X 1,..., X n } = t(x 1,..., x n. Lecture 5 - Method of Moments Comment Simple, easy to use. Often have good properties, but not guaranteed. Can provide good starting point for an iterative method for finding MLEs. Example 14 Uniform iid sample X = (X 1, X,..., X n on (0, θ. L(θ; x = f (x; θ = θ n, 0 < x 1,..., x n < θ. Let X (n = max i X i. In this case the moment relation E [ ] X (n = x(n Jonathan Marchini (University of Oxford BSa MT / 8 leads to an unbiased sufficient statistic, θ = n+1 n X (n. Jonathan Marchini (University of Oxford BSa MT / 8 Method of moments The distribution of X (n = max i X i is obtained from the CDF ( y n P(X (n y = θ (the probability all iid X i fall in (0, y so f X(n (y; θ = ny n 1 θ n, 0 < y < θ and E [ ] n X (n = n + 1 θ, so θ = n + 1 n is unbiased. X (n Now check θ is sufficient. The distribution of X X (n = y is f (x X (n = y; θ = which does not depend on θ. θ is minimal sufficient since f (x; θ f X(n (y; θ = 1 ny n 1 L(θ; x L(θ; y = θ n I[x (n < θ] θ n I[y (n < θ] does not depend on θ if x (n = y (n i.e. θ forms a Lehmann-Scheffé partition. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

21 Finally, we can show that θ = θ(x (n is complete. If E[ θ] = 0 for all θ > 0 then θ 0 ny n 1 h( θ(y θ n dy = 0 for all θ > 0 and hence θ h( θ(yy n 1 dy = 0. 0 Differentiate wrt θ (using the Leibniz integral rule, and conclude h( θ = 0 identically for all X. It follows that θ is the unique unbiased estimator based on X (n. Since θ is minimal sufficient, if a MVUE exists, it equals θ (Corollary 4. We can t use the CRB to show a MVUE exists, as the problem is non-regular. Jonathan Marchini (University of Oxford BSa MT / 8 Exercise l(θ; x = n log(θ so [ ( ] l E = n /θ θ and the CRB would be Var( θ θ /n. However, you can check (using f X(n (y; θ above that Var( θ = θ n(n + 1 which is smaller than θ /n for any n 1. This is not a contradiction, as f (x; θ doesn t satisfy the regularity conditions (limits of x depend on θ. Jonathan Marchini (University of Oxford BSa MT / 8 We can find the MLE in this example but this estimator is θ = X (n 1 biased and the Uniform density does not satisfy the regularity condition needed for CRLB so it does not apply. does not satisfy l/ θ = 0, so even if the CRLB did apply we cannot make the link between the MLE and the lower bound. Example 15 A sample from N(µ, σ. The moment relations [ n ] [ n ] E X i = nµ, E = n(µ + σ lead to the estimating equations X i µ = X, µ + σ = n 1 n X i. We have seen that these are just the equations for the MLE s in this -dimensional exponential linear family, and solve to give the MLE n σ = n 1 (X i X. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

22 Example 16 A sample from Pois(λ. We have two possible moment relations [ ] 1 n E X i = λ, n λ 1 = 1 n X i n E [ 1 n ] n (X i X = λ, λ = 1 n n (X i X For a sample of size n we have A(λ = log λ, B(x = n x i, C(x = log( n x i!, D(λ = nλ By Lemma 6 we have that B(x = n x i is a sufficient complete statistic. Thus by the Lehmann-Scheffé Theorem we have that arithmetic mean λ 1 is MVUE. λ is not a function of B(x. Exercise Show that λ 1 attains the CRLB. Jonathan Marchini (University of Oxford BSa MT / 8 Method of moments asymptotics Distribution of θ as n.suppose the dimension of θ is d = 1 and H = n 1 n h(x i where E( H = k(θ and θ is the solution to H = k(θ. Theorem 5 As n, θ is asymptotically normally distributed with mean θ and variance n 1 σ h /(k (θ, where σ h = h(x f (x; θdx k(θ Proof E( H = k(θ, so by the Central Limit Theorem, Approximately ( H k(θ n σ h D N(0, 1 H = k( θ = k(θ + ( θ θk (θ Jonathan Marchini (University of Oxford BSa MT / 8 Method of moments asymptotics Example 17 - Exponential distribution Rearrange to get so as n. n( θ θk (θ σ h θ N = ( H k(θ n D N(0, 1 σ h ( θ, σ h n(k (θ That is Now so f (x; θ = θ exp( θx, x > 0 E[ X] = θ 1, θ = X 1 k(θ = θ 1, H = X, h(x = X, E [ h(x ] = θ 1 Var [ h(x ] = σ h = θ, k (θ = θ θ N ( θ, σ h n(k (θ θ N(θ, n 1 θ Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

23 Exponential families - Method of moments and MLEs MLE are moment estimates because we are solving [ n ] n E t(x i = t(x i for θ. We showed earlier that [ ] E t(x = D (θ so Now [ n ] E t(x i = nd (θ k(θ = D (θ, k (θ = D (θ. Lecture 6 : Bayesian Inference Var( θ Var(t(X nd (θ = D (θ nd (θ = I 1 θ Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Ideas of probability The majority of statistics you have learned so are are called classical or Frequentist. The probability for an event is defined as the proportion of successes in an infinite number of repeatable trials. By contrast, in Subjective Bayesian inference, probability is a measure of the strength of belief. We treat parameters as random variables. Before collecting any data we there is uncertainty about the value of a parameter. This uncertainty can be formalised by specifying a pdf (or pmf for the parameter. We then conduct an experiment to collect some data that will give us information about the parameter. We then use Bayes Theorem to combine our prior beliefs with the data to derive an updated estimate of our uncertainty about the parameter. The history of Bayesian Statistics Bayesian methods originated with Bayes and Laplace (late 1700s to mid 1800s. In the early 190 s, Fisher put forward an opposing viewpoint, that statistical inference must be based entirely on probabilities with direct experimental interpretation i.e. the repeated sampling principle. In 1939 Jeffrey s book The theory of probability started a resurgence of interest in Bayesian inference. This continued throughout the s, especially as problems with the Frequentist approach started to emerge. The development of simulation based inference has transformed Bayesian statistics in the last 0-30 years and it now plays a prominent part in modern statistics. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

24 Problems with Frequentist Inference - I Frequentist Inference generally does not condition on the observed data A confidence interval is a set-valued function C(X Θ of the data X which covers the parameter θ C(X a fraction 1 α of repeated draws of X taken under the null H 0. This is not the same as the statement that, given data X = x, the interval C(x covers θ with probability 1 α. But this is the type of statement we might wish to make. Example 1 Suppose X 1, X U(θ 1/, θ + 1/ so that X (1 and X ( are the order statistics. Then C(X = [X (1, X ( ] is a α = 1/ level CI for θ. Suppose in your data X = x, x ( x (1 > 1/ (this happens in an eighth of data sets. Then θ [x (1, x ( ] with probability one. Problems with Frequentist Inference - II Frequentist Inference depends on data that were never observed The likelihood principle Suppose that two experiments relating to θ, E 1, E, give rise to data y 1, y, such that the corresponding likelihoods are proportional, that is, for all θ L(θ; y 1, E 1 = cl(θ; y, E. then the two experiments lead to identical conclusions about θ. Key point MLE s respect the likelihood principle i.e. the MLEs for θ are identical in both experiments. But significance tests do not respect the likelihood principle. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Consider a pure test for significance where we specify just H 0. We must choose a test statistic T (x, and define the p-value for data T (x = t as p-value = P(T (X at least as extreme as t H 0. The choice of T (X amounts to a statement about the direction of likely departures from the null, which requires some consideration of alternative models. Note 1 The calculation of the p-value involves a sum (or integral over data that was not observed, and this can depend upon the form of the experiment. Note A p-value is not P(H 0 T (X = t. Example A Bernoulli trial succeeds with probability p. E 1 E fix n 1 Bernoulli trials, count number y 1 of successes count number n Bernoulli trials to get fixed number y successes L(p; y 1, E 1 = L(p; y, E = ( n1 p y 1 (1 p n 1 y 1 binomial y ( 1 n 1 p y (1 p n y negative binomial y 1 If n 1 = n = n, y 1 = y = y then L(p; y 1, E 1 L(p; y, E. So MLEs for p will be the same under E 1 and E. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

25 Example But significance tests contradict : eg, H 0 : p = 1/ against H 1 : p < 1/ and suppose n = 1 and y = 3. The p-value based on E 1 is P ( Y y θ = 1 = y k=0 ( n (1/ k (1 1/ n k (= k while the p-value based on E is ( P N n θ = 1 ( k 1 = (1/ k (1 1/ n k (= y 1 k=n so different conclusions at significance level Note The p-values disagree because they sum over portions of two different sample spaces. Jonathan Marchini (University of Oxford BSa MT / 8 Bayesian Inference - Revison Likelihood f (x θ and prior distribution π(θ for ϑ. The posterior distribution of ϑ at ϑ = θ, given x, is π(θ x = f (x θπ(θ π(θ x f (x θπ(θ f (x θπ(θdθ posterior likelihood prior The same form for θ continuous (π(θ x a pdf or discrete (π(θ x a pmf. We call f (x θπ(θdθ the marginal likelihood. Likelihood principle Notice that, if we base all inference on the posterior distribution, then we respect the likelihood principle. If two likelihood functions are proportional, then any constant cancels top and bottom in Bayes rule, and the two posterior distributions are the same. Jonathan Marchini (University of Oxford BSa MT / 8 Example 1 X Bin(n, ϑ for known n and unknown ϑ. Suppose our prior knowledge about ϑ is represented by a Beta distribution on (0, 1, and θ is a trial value for ϑ. π(θ = θa 1 (1 θ b 1, 0 < θ < 1. B(a, b density a=1, b=1 a=0, b=0 a=1, b=3 a=8, b= Jonathan Marchini (University of Oxford BSa MT / 8 x Example 1 Prior probability density Likelihood π(θ = θa 1 (1 θ b 1, 0 < θ < 1. B(a, b f (x θ = Posterior probability density ( n θ x (1 θ n x, x = 0,..., n x π(θ x likelihood prior θ a 1 (1 θ b 1 θ x (1 θ n x = θ a+x 1 (1 θ n x+b 1 Here posterior has the same form as the prior (conjugacy with updated parameters a, b replaced by a + x, b + n x, so π(θ x = θa+x 1 (1 θ n x+b 1 B(a + x, b + n x Jonathan Marchini (University of Oxford BSa MT / 8

26 Example 1 Example 1 For a Beta distribution with parameters a, b µ = a a + b, σ = The posterior mean and variance are ab (a + b (a + b + 1 a + X a + b + n, (a + X(b + n X (a + b + n (a + b + n + 1 Suppose X = n and we set a = b = 1 for our prior. Then posterior mean is n + 1 n + i.e. when we observe events of just one type then our point estimate is not 0 or 1 (which is sensible especially in small sample sizes. For large n, the posterior mean and variance are approximately In classical statistics X X(n X, n n 3 θ = X n, θ(1 θ n = X(n X n 3 Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Example 1 Suppose ϑ = 0.3 but prior mean is 0.7 with std 0.1. Suppose data X Bin(n, ϑ with n = 10 (yielding X = 3 and then n = 100 (yielding X = 30, say. probability density post n= theta post n=10 As n increases, the likelihood overwhelms information in prior. Jonathan Marchini (University of Oxford BSa MT / 8 prior Example - Conjugate priors Normal distribution when the mean and variance are unknown.let τ = 1/σ, θ = (τ, µ. τ is called the precision. The prior τ has a Gamma distribution with parameters α, β > 0, and conditional on τ, µ has a distribution N(ν, 1 kτ for some k > 0, ν R. The prior is { π(τ, µ = βα Γ(α τ α 1 e βτ (π 1/ (kτ 1/ exp kτ } (µ ν or The likelihood is [ π(τ, µ τ α 1/ exp τ {β + k }] (µ ν f (x µ, τ = (π n/ τ n/ exp { τ } n (x i µ Jonathan Marchini (University of Oxford BSa MT / 8

27 Example - Conjugate priors Thus [ { π(τ, µ x τ α+(n/ 1/ exp τ β + k (µ ν + 1 }] n (x i µ Example - Conjugate priors Thus the posterior is where π(τ, µ x τ α 1/ exp [ τ {β + k }] (ν µ Complete the square to see that k(µ ν + (x i µ ( kν + n x = (k + n µ + nk k + n n + k ( x ν + (x i x α = α + n β = β + 1 nk n + k ( x ν + 1 (xi x k = k + n ν = kν + n x k + n This is the same form as the prior, so the class is conjugate prior. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Constructing priors Lecture 7 : Prior distributions. Hierarchical models. Predictive Distributions. Subjective Priors: Write down a distribution representing prior knowledge about the parameter before the data is available. If possible, build a model for the parameter. If different scientists have different priors or it is unclear how to represent prior knowledge as a distribution, then consider several different priors. Repeat the analysis and check that conclusions are insensitive to priors representing different points of view. Non-Subjective Priors: Several approaches offer the promise of an automatic and even objective prior. We list some suggestions below (Uniform, Jeffreys, MaxEnt. In practice, if one of these priors conflicted prior knowledge, we wouldnt use it. These approaches can be useful to complete the specification of a prior distribution, once subjective considerations have been taken into account. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

28 Uniform priors If we use a uniform prior (π(θ constant then π(θ x = L(θ; x/ L(θ; xdθ and Θ L(θ; xdθ may not be finite. Such distributions are called improper and all inference is meaningless. Example X Exp(1/µ and N = I(X < 1 yielding N = n with n {0, 1}. Suppose we observe n = 0. Now Θ L(µ; n = exp( 1/µ so if we take π(µ 1 for µ > 0 we have π(µ n exp( 1/µ which is improper, as π(µ n 1 as µ so 0 exp( 1/µdµ cannot exist. Jonathan Marchini (University of Oxford BSa MT / 8 Jeffreys Priors Jeffreys reasoned as follows. If we have a rule for constructing priors it should lead to the same distribution if we apply it to θ or some other parameterization ψ with g(ψ = θ.jeffreys took π(θ I θ where I θ = E is the Fisher information. Now if g(ψ = θ then π Ψ (ψ π(g(ψ g (ψ, [ ( ] l θ so Jeffreys rule should yield π Ψ (ψ I g(ψ g (ψ. The rule gives π Ψ (ψ I ψ. But I ψ = g (ψ I g (ψ, so I ψ = I g(ψ g (ψ and the rule is consistent in this respect. It is sometimes desirable (on subjective grounds to have a prior which is invariant under reparameterization. Jonathan Marchini (University of Oxford BSa MT / 8 Higher dimensions If Θ R k, and l(θ; X = log(f (X; θ, the Fisher information satisfies ( l(θ; X [I θ ] i,j = E θ θ i θ j ( ( l(θ; X l(θ; X l(θ; X E θ = E θ θ i θ j θ i θ j subject to regularity conditions. A k-dimensional Jeffreys prior π(θ I θ 1/ ( A det(a is invariant under 1-1 reparameterization. Exercise Verify 1 to 1 g(ψ = θ in R k gives π Ψ (ψ = θ I T g(ψ ψ. Maximum Entropy Priors Choose a density π(θ which maximizes the entropy φ[π] = π(θ log π(θdθ Θ over functions π(θ subject to constraints on π. This is a Calculus of Variations problem. Example The distribution π maximizing φ[π] over all densities π on Θ = R, subject to 0 π(θdθ = 1, 0 θπ(θdθ = µ, and 0 (θ µ π(θdθ = σ, (normalized with Eϑ = µ and Var(ϑ = σ is the normal density π(θ = 1 πσ e (θ µ /σ. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

29 This is a special case of the following Theorem. Theorem The density π(θ that maximizes φ(π, subject to E[t j (θ] = t j, j = 1,..., p takes the p-parameter exponential family form { p } π(θ exp λ i t i (θ for all θ Θ, where λ 1,..., λ p are determined by the constraints. (For the proof see Leonard and Hsu. Example In the normal case t 1 (θ = θ, t 1 = µ, t (θ = (θ µ, t = σ gives π(θ exp(λ 1 θ + λ (θ µ. Impose the constraints to get λ 1 = 0 and λ = 1/σ. Jonathan Marchini (University of Oxford BSa MT / 8 Example Suppose prior probabilities are specified so that with j φ j = 1 and P(a j 1 < ϑ a j = φ j, j = 1,..., p ϑ (a 0, a p, a 0 a 1 a p a p. We find the maximum entropy distribution subject to these conditions. The conditions are equivalent to E[t j (ϑ] = φ j, j = 1,..., p where t j (ϑ = I[a j 1 < ϑ a j ]. The posterior density of ϑ is p π(θ exp λ j I[a j 1 < θ a j ], a 0 θ a p j=1 where λ 1,..., λ p are determined by the conditions. π(θ is a histogram, with intervals (a 0, a 1 ], (a 1, a ],..., (a p 1, a p ]. Jonathan Marchini (University of Oxford BSa MT / 8 Hierarchical models These are models where the prior has parameters which again have a probability distribution. Data x have a density f (x; θ. The prior distribution of θ is π(θ; ψ. ψ has a prior distribution g(ψ, for ψ Ψ. We can work with posterior π(θ, ψ x f (x; θπ(θ; ψg(ψ, or π(θ x = π(θ, ψ xdψ f (x; θ π(θ; ψg(ψdψ Ψ Ψ f (x; θπ(θ The hierarchical model simply specifies the prior for θ indirectly π(θ = π(θ; ψg(ψdψ. Example For i = 1,,..., k we make n i observations X i,1, X i,,..., X i,ni on population i, with X ij N(θ i, σ. The θ i are the unknown means for observations on the i th population but σ is known. Suppose the prior model for the θ i is iid normal, θ i N(φ, τ. X 1,1,, X 1,n1 ~ N θ 1,σ X,1,, X,n ~ N θ,σ ( θ 1 ( θ.. N φ,τ.. X k,1,, X k,nk ~ N θ k,σ ( θ k (! Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

30 Example If ψ = (φ, τ π(θ 1,..., θ k ; ψ = k { (πτ 1/ exp 1 } τ (θ i φ, Now we need a prior for φ and τ. Suppose we take g(φ, τ = constant, so all possible φ, τ equally likely a priori. careful!. The joint posterior of the parameters is π(θ, ψ x f (x; θπ(θ ψg(ψ Example Integration with respect to ψ gives the posterior density of θ. k π(θ, φ, τ x exp 1 n i σ (x ij θ i j=1 [ k { τ 1 exp 1 } ] τ (θ i φ Integrate out wrt φ and τ to obtain the marginal posterior distribution of θ. Exercise Integrating the last factor wrt φ gives a term proportional to { τ 1 k exp 1 } (θi τ θ Exercise Then the integral wrt τ gives a term proportional to [ (θi θ ] 1 k/ Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Example Thus the posterior distribution of θ is k π(θ x exp 1 n i [ σ (x ij θ i (θi θ ] 1 k/ j=1 [ k { exp 1 } ] [ σ n i(θ i x i (θi θ ] 1 k/ where x i = j x ij/n i. Let θ j be the MAP estimate for θ j (posterior mode and put θ = θj /k. Exercise Differentiate π(θ x wrt θ j and set to equal zero to get θ j = (ω j x j + ν θ /(ω j + ν, [ where ω j = (σ /n j 1 and ν = ( θi θ ] 1 /(k. If the θ j were unrelated then ˆθ j = x j. The model modifies the estimate by pulling it towards the mean of the estimated θ i s. Jonathan Marchini (University of Oxford BSa MT / 8 Uniform priors If the prior is uniform (and possibly improper itself but the posterior is proper, and it can be shown that all parameter values being equally probable a priori is a fair representation of prior knowledge, then the uniform prior may be useful. In the hierarchical model in the previous example the prior on ϕ, τ was improper, but the posterior is proper. Jonathan Marchini (University of Oxford BSa MT / 8

31 Priors for Exponential Families Conjugate prior for an exponential family k n f (x θ = exp A j (θ B j (x i + j=1 n C(x i + nd(θ The prior distribution based on sufficient statistics k π(θ exp{τ 0 D(θ + A j (θτ j } (where (τ 0,..., τ k are constant prior parameters is conjugate. j=1 Priors for Exponential Families The posterior density is proportional to f (x θπ(θ τ 0,..., τ k [ k n ] exp A j (θ B j (x i + τ j + (n + τ 0 D(θ j=1 This is an updated form of the prior with n B j (x = B j (x i + τ j n = n + τ 0 Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 f Predictive distributions X 1,..., X n are observations from f (x; θ and the predictive distribution of a further observation X n+1 is required. If x = (x 1,..., x n then the posterior predictive distribution is g(x n+1 x = f (x n+1 ; θπ(θ xdθ Predictive distributions are useful for...prediction. They are used also for model checking. Divide the data in two groups, Y = (X 1,..., X a and Z = (X a+1,..., X n. If we fit using Y and check that the reserved data Z overlap g(x n+1 x in distribution. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

32 Example Data X 1,..., X n are iid N(θ, σ with σ known and prior θ N(µ 0, σ 0. Predict X n+1. We saw that π(θ x = N(θ; µ 1, σ 1 with µ 1 and σ 1 calculated in example 3 above. In order to calculate the posterior predictive density for X n+1 we need to evaluate g(x n+1 x = 1 (x θ e πσ πσ1 σ 1 We could complete the square to solve this. e (θ µ 1 σ 1 dθ Example Alternatively, think how X n+1 is built up. We have θ X N(µ 1, σ 1 and X n+1 θ N(θ, σ. If Y, Z N(0, 1 then θ = µ 1 + σ 1 Z (posterior X n+1 = θ + σy (observation model = µ 1 + σ 1 Z + σy. It follows that X n+1 N(µ 1, σ + σ1 is the posterior predictive density for X n+1 X 1,..., X n. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Hypothesis Testing I Lecture 8 : Bayesian Hypothesis Testing In Bayesian inference, hypotheses are represented by prior distributions. There is nothing special about H 0, and H 0 and H 1 need not be nested. Let π(θ H 0, θ Θ 0 be the prior distribution of θ under hypothesis H 0. Here π(θ H 0 is a pmf/pfd as θ H 0 is continuous/discrete. Composite H 0 : θ Θ 0, with Θ 0 Θ, π(θ H 0 = π(θ π(θ Θ 0 I(θ Θ 0 If Θ 0 has more than one element then H 0 is a composite hypothesis. Simple H 0 : θ = θ 0, so that π(θ 0 H 0 = 1. This is a simple hypothesis. However, since any statement about the form of the prior amounts to a hypothesis about θ, we are not restricted to statements about set membership (not just simple and composite. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

33 Example 1 In a quality inspection program components are selected at random from a batch and tested. Let θ denote the failure probability. Suppose that we want to test for H 0 : θ 0. against H 1 : θ > 0. and that the prior is θ Beta(, 5 so that π(θ = 30θ(1 θ 4, 0 < θ < 1. Now if π(h 0 = π(θ Θ 0 then π(h 0 = θ(1 θ 4 dθ so that π(h and π(h so and π(θ H 0 = π(θ H 1 = 30θ(1 θ4, 0 < θ 0. π(h 0 30θ(1 θ4, 0. < θ < 1 π(h 1 Marginal Likelihood P(x H 0 is called the marginal likelihood (for hypothesis H 0. We can think of a parameter H {H 0, H 1 }, with likelihood P(x H, prior P(H and posterior P(H x. By the partition theorem for probability (π(θ H 0 a pdf, say P(x H 0 = P(x θ, H 0 P(θ H 0 dθ Θ 0 = L(θ; xπ(θ H 0 dθ, Θ 0 since, given θ, x is determined by the observation model, and independent of the process (H 0 that generated θ. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 In the discrete case P(x H 0 = θ Θ 0 L(θ; xπ(θ H 0. In the special case that H 0 is a simple hypothesis Θ 0 = {θ 0 }, π(θ 0 H 0 = 1, and P(x H 0 = L(θ 0 ; x. Jonathan Marchini (University of Oxford BSa MT / 8 Example 1 (cont In the quality inspection program suppose n components are selected for independent testing. The number X that fail is X Binomial(n, θ. Recall H 0 : θ 0. with θ Beta(, 5 in the prior. The marginal likelihood for H 0 is P(x H 0 = L(θ; xπ(θ H 0 dθ Θ 0 ( 5 0. = θ x n x 30θ(1 θ4 (1 θ dθ x π(h 0 For one batch of size n = 5, X = 0 is observed. Recall that π(h Then P(x H 0 = ( θ(1 θ 9 dθ 0 0 π(h /0.345 = Jonathan Marchini (University of Oxford BSa MT / 8

34 Notice that Similarly, for H 1 : θ > 0. P(x H 1 = ( θ(1 θ 9 dθ 0 0. π(h P(x H 0 = E(L(ϑ; x H 0, that is, the marginal likelihood is the average likelihood given the prior π(θ H 0 the marginal likelihood is the normalizing constant we often leave off when we write the posterior π(θ x, H 0 = L(θ; xπ(x H 0, P(x H 0 posterior = likelihood prior marginal likelihood, Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Prior and Posterior Probabilities for Hypotheses We have a posterior probability for H 0 itself. This is actually where we started with Bayesian inference. In the simple case where we have two hypotheses H 0, H 1, exactly one of which is true, where P(H 0 x = P(H 0P(x H 0, P(x P(x = P(H 0 P(x H 0 + P(H 1 P(x H 1 so that P(H 0 x + P(H 1 x = 1. When we estimate the value of a discrete parameter H {H 0, H 1 }, we are making a Bayesian hypothesis test. Example 1 (cont X Binomial(5, θ with θ Beta(, 5 in the prior and H 0 : θ 0. and H 1 : θ > 0.. The posterior probability for H 0 given we observe X = 0 is P(H 0 x = P(x H 0P(H 0 P(x P(H 0 = P(θ Θ 0 (P(θ Θ 0 + P(θ Θ 1 = π(h 0 P(x H 0 π(h P(x H 1 π(h P(x P(x H 0 P(H 0 + P(x H 1 P(H P(H 0 x 0.185/0.73 = P(H 1 x 0.3 Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

35 Hypothesis Testing II, Bayes factors Suppose we have two hypotheses H 0, H 1, exactly one of which is true. Data x. The Prior Odds Q = P[H 0] P[H 1 ] These are prior odds since P[H 1 ] = 1 P[H 0 ]. Here H 0 is Q times more probable than H 1, given the prior model. The Posterior Odds Q = P[H 0 x] P[H 1 x] are the posterior odds, so that H 0 is Q times more probable than H 1, given the data and prior model. Jonathan Marchini (University of Oxford BSa MT / 8 Hypothesis Testing II, Bayes factors The posterior odds for H 0 against H 1 can be written where Q is the prior odds and is the Bayes Factor. Q = P[H 0] P[H 1 ] P(x H 0 P(x H 1 = Q B B = P(x H 0 P(x H 1 The Bayes Factor is a criterion for model comparison since H 0 is B times more probable than H 1, given the data and a prior model which puts equal probability on H 0 and H 1. The Bayes factor tells us how the data shifts the strength of belief (measured as a probability in H 0 relative to H 1. Jonathan Marchini (University of Oxford BSa MT / 8 Example 1 (cont X Binomial(5, θ with θ Beta(, 5 in the prior and H 0 : θ 0. and H 1 : θ > 0.. The prior odds are The posterior odds are Q = P(H 0 /P(H /( Q = P(H 0 x/p(h 1 x 0.678/( The Bayes factor comparing H 0 and H 1 is B = P(x H 0 P(x H /0.134 = 4 Jonathan Marchini (University of Oxford BSa MT / 8 Explicitly, from the beginning, Θ B = 0 L(x; θπ(θ H 0 dθ Θ 1 L(x; θπ(θ H 1 dθ = = Θ 0 L(x; θπ(θdθ Θ 1 L(x; θπ(θdθ π(h 1 π(h 0 ( θ(1 θ 9 1 dθ 0. ( 5 30θ(1 θ4 dθ θ(1 0. θ9 dθ 0 30θ(1 θ 4 dθ = / (Maple. This is positive evidence for θ 0.. Notice that the Bayes factor is more positive than the posterior odds, as the prior odds were weighted against H 0. Jonathan Marchini (University of Oxford BSa MT / 8

36 Interpreting Bayes Factors Adrian Raftery gives this table (values are approximate, and adapted from a table due to Jeffreys interpreting B. P(H 0 x B log(b evidence for H 0 < 0.5 < 1 < 0 negative (supports H to to 3 0 to barely worth mentioning 0.75 to to 1 to 5 positive 0.9 to to to 10 strong > 0.99 > 150 > 10 very strong I added the leftmost column (posterior for prior odds equal one. We sometimes report log(b because it is on the same scale as the familiar deviance and likelihood ratio test statistic. Simple-Simple hypothesis If both hypotheses are simple H 0 : θ = θ 0 ; H 1 : θ = θ 1, with priors P(H 0 and P(H 1 for the two hypotheses, the posterior probability for H 0 is P(H 0 x = P(x H 0P(H 0 P(x L(θ 0 ; xp(h 0 = L(θ 0 ; xp(h 0 + L(θ 1 ; xp(h 1, since P(x H 0 is just L(θ 0 ; x. The Bayes factor is then just likelihood ratio B = L(θ 0; x L(θ 1 ; x. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Simple-Composite hypothesis If one hypothesis is simple and the other composite, for example, H 0 : θ = θ 0 ; H 1 : π(θ H 1, θ Θ, with priors P(H 0 and P(H 1 for the two hypotheses, the Bayes factor is L(x; θ 0 B = Θ L(x; θπ(θ H 1dθ The denominator is just Θ L(x; θπ(θdθ when π(θ H 1 is a pdf. Exercise Show that the posterior probability for H 0 is L(θ 0 ; xp(h 0 P(H 0 x = P(H 0 L(θ 0 ; x + P(H 1 Θ L(x; θπ(θdθ when π(θ H 1 is a pdf, in this simple-composite comparison. Example X 1,..., X n are iid N(θ, σ, with σ known. H 0 : θ = 0, H 1 : θ H 1 N(µ, τ. Bayes factor is P 0 /P 1, where ( P 0 = (πσ n/ exp 1 i x σ ( P 1 = (πσ n/ exp 1 (xi σ θ (πτ 1/ (θ µ exp ( τ dθ. Completing the square in P 1 and integrating dθ, 1/ ( P 1 = (πσ n/ σ nτ + σ [ exp 1 { n nτ + σ ( x µ + 1 }] (xi σ x Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

37 So B = (1 + nτ 1/ [ σ exp 1 { }] n x σ n nτ ( x µ + σ Defining t = n x/σ, η = µ/τ, ρ = σ/(τ n, this can be written as B = ( / [ ρ exp 1 { }] (t ρη 1 + ρ η Lecture 9 : Decision theory This example illustrates a problem choosing the prior. If we take a diffuse prior, for ρ so that ρ 0, then B, giving overwelming support for H 0. This is an instance of Lindley s paradox. The point here is that B compares the models θ = θ 0 and θ π( H 1, not the sets θ 0 against θ \ {θ 0 }. If the π(θ H 1 -prior becomes very diffuse then the average likelihood (ie P(H 1 x, the marginal likelihood, which is the denominator of B goes to zero, while P(H 0 x = L(θ 0 ; x is fixed. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Decision theory Decision Theory Example You have been exposed to a deadly virus. About 1/3 of people who are exposed to the virus are infected by it, and all those infected by it die unless they receive a vaccine. By the time any symptoms of the virus show up, it is too late for the vaccine to work. You are offered a vaccine for 500. Do you take it or not? The most likely scenario is that you don t have the virus but basing a decision on this ignores the costs (or loss associated with the decisions. We would put a very high loss on dying! Decision Theory sits above Bayesian and classical statistical inference and gives us a basis to compare different approaches to statistical inference. We make decisions by applying rules to data. Decisions are subject to risk. A risk function specifies the expected loss which follows from the application of a given rule, and this is a basis for comparing rules. We may choose a rule to minimize the maximum risk, or we may choose a rule to minimize the average risk. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

38 Terminology θ is the true state of nature, X f (x; θ is the data. The Decision rule is δ. If X = x, adopt the action δ(x given by the rule. Example A single parameter θ is estimated from X = x by δ(x = θ(x. Example L S (θ, θ(x is the loss function which increases for θ(x being away from θ. Here are three common loss functions. Zero-One loss { L S (θ, θ(x 0 θ(x θ < b = a θ(x θ b where a, b are constants. The rule θ is the functional form of the estimator. The action is the value of the estimator. The Loss function L S (θ, δ(x measures the loss from action δ(x when θ holds. Absolute error loss where a > 0. Quadratic loss L S (θ, θ(x = a θ(x θ L S (θ, θ(x = ( θ(x θ. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Example Risk function - b a θ b Zero- One Loss Definition The risk function R(θ, δ is defined as R(θ, δ = L S (θ, δ(xf (x; θdx, Absolute error loss This is the expected value of the loss (aka expected loss. θ Example In the context of point estimation, with Quadratic Loss, the risk function is the mean square error, R(θ, θ = E[( θ(x θ ]. Quadra3c loss θ Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

39 Admissible and inadmissible rules. Definition A procedure δ 1 is inadmissible if there exists another procedure δ such that R(θ, δ 1 R(θ, δ, for all θ Θ with R(θ, δ 1 > R(θ, δ for at least some θ. A procedure which is not inadmissible is admissible. Example Suppose X U(0, θ. Consider estimators of the form θ(x = ax (this is a family of decisions rules indexed by a. Show that a = 3/ is a necessary condition for the rule θ to be admissible for quadratic loss. R(θ, θ = and R is minimized at a = 3/. θ 0 (ax θ 1 θ dx = (a /3 a + 1θ This does not show θ(x = 3x/ is admissible here. Jonathan Marchini (University of Oxford BSa MT / 8 It only shows that all estimators with a 3/ are inadmissible. The estimator θ(x = 3x/ maybe inadmissible relative to other estimators not in this class. Jonathan Marchini (University of Oxford BSa MT / 8 Minimax rules Definition A rule δ is a minimax rule if max θ R(θ, δ max θ R(θ, δ for any other rule δ. It minimizes the maximum risk. Since minimax minimizes the maximum risk (ie, the loss averaged over all possible data X f the choice of rule is not influenced by the actual data X = x (though given the rule δ, the action δ(x is data-dependent. It makes sense when the maximum loss scenario must be avoided, but can can lead to poor performance on average. Bayes risk, Bayes rule, expected posterior loss Definition Suppose we have a prior probability π = π(θ for θ. Denote by r(π, δ = R(θ, δπ(θdθ the Bayes risk of rule δ. A Bayes rule is a rule that minimizes the Bayes risk. A Bayes rule is sometimes called a Bayes procedure. L(x; θπ(θ Let π(θ x = denote the posterior following from likelihood h(x L and prior π. Definition The expected posterior loss is defined as L S (θ, δ(xπ(θ xdθ Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

40 Lemma A Bayes rule minimizes the expected posterior loss. Proof R(θ, δπ(θdθ = = = L S (θ, δ(xl(θ; xπ(θdx dθ L S (θ, δ(xπ(θ xh(xdx dθ ( h(x L S (θ, δ(xπ(θ xdθ dx That is for each x we choose δ(x to minimize the integral L S (θ, δ(xπ(θ xdθ Bayes rules for Point estimation Zero-One loss where a, b are constants. We need to minimize L S (θ, θ(x = π(θ xl S (θ, θdθ = a { 0 θ(x θ < b a 1 That is we want to maximize θ+b θ b θ+b θ+b θ(x θ b θ b π(θ xdθ + a θ b π(θ xdθ π(θ xdθ π(θ xdθ Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Zero-one loss If π(θ x is unimodal the maximum is attained by choosing θ to be the mid-point of the interval of length b for which π(θ b has the same value at both ends. Absolute error loss Absolute error loss L S (θ, θ(x = a θ(x θ We need to minimise θ θ π(θ xdθ = θ ( θ θπ(θ xdθ + θ (θ θπ(θ xdθ. θˆ As b 0, θ the global mode of the posterior distribution. If π(θ x is unimodal and symmetric, the optimal θ is the median (equal to the mean and mode of the posterior distribution. Differentiate wrt θ and equate to zero. θ π(θ xdθ θ π(θ xdθ = 0 That is, the optimal θ is the median of the posterior distribution. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

41 Quadratic loss Summary Minimize E θ x [( θ θ ] = E θ x [( θ θ + θ θ ] where θ is the posterior mean of θ. = ( θ θ + (ˆθ θe θ x (θ θ + E θ x [(θ θ ] Note ˆθ and θ are constants in the posterior distribution of θ so that (ˆθ θe(θ θ = 0. So E θ x [( θ θ ] = ( θ θ + V θ x (θ The form of the Bayes rule depends upon the loss function in the following way Zero-one loss (as b 0 leads to the posterior mode. Absolute error loss leads to the posterior median. Quadratic loss leads to the posterior mean. Note These are not the only loss functions one could use in a given situation, and other loss functions will lead to different Bayes rules. The Quadratic loss function is minimized when θ = θ, the posterior mean. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Example X Binomial (n, θ, and the prior π(θ is a Beta (α, β distribution. E(θ = α α + β The distribution is unimodal if α, β > 1 with mode (α 1/(α + β. The posterior distribution of θ x is Beta (α + x, β + n x. Lecture 10 : Finding minimax rules. Hypothesis testing with loss functions. With zero-one loss and b 0 the Bayes estimator is (α + x 1/(α + β + n. For a quadratic loss function, the Bayes estimator is (α + x/(α + β + n. For an absolute error loss function is the median of the posterior. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

42 Minimax rules Definition A rule δ is a minimax rule if max θ R(θ, δ max θ R(θ, δ for any other rule δ. It minimizes the maximum risk. Sometimes this doesn t produce a sensible choice of decision rule. Risk function d d 1 Risk function Jonathan Marchini (University of Oxford BSa MT / 8 d d 1 Finding minimax rules Theorem 1 If δ is a Bayes rule for prior π, with r(π, δ = C, and δ 0 is a rule for which max θ R(θ, δ 0 = C, then δ 0 is minimax. Proof If for some other rule δ, max θ R(θ, δ = C ɛ for some ɛ > 0 (so δ 0 is not minimax, then r(π, δ = R(θ, δ π(θdθ (C ɛπ(θdθ = (C ɛ < r(π, δ so δ is not the Bayes rule for π, a contradiction. [This is an informal treatment which assumes the min and max exist - see Y&S Ch Sec.6] Jonathan Marchini (University of Oxford BSa MT / 8 Finding minimax rules Finding minimax rules Theorem If δ is a Bayes rule for prior π with the property that R(θ, δ does not depend on θ, then δ is minimax. Proof (Y&S Ch Let R(θ, δ = C (no θ dependence. This implies r(π, δ = R(θ, δπ(θdθ = C. If for some other rule δ, max θ R(θ, δ = C ɛ for some ɛ > 0 (so δ 0 is not minimax, then we have r(π, δ C ɛ. But r(π, δ = C so δ is not the Bayes rule for π, a contradiction. Theorem tells us that The Bayes estimator with constant risk is minimax. This result is useful, as it gives an approach to finding minimax rules. Bayes rules are sometimes easy to compute, so if we find a prior that yields a Bayes rule with constant risk for all θ we have the minimax rule. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

43 Example : minimax estimator for quadratic loss X is Binomial (n, θ, and the prior π(θ is a Beta (α, β distribution. For a quadratic loss function, the Bayes estimator is (α + X/(α + β + n The risk function is E X θ [( θ θ ] = MSE( θ = [Bias( θ] + Var[ θ] [ ( ] α + X = θ E + Var α + β + n [ ( ] α + nθ = θ + α + β + n = [θ(α + β α] + nθ(1 θ [α + β + n] [ α + X ] α + β + n nθ(1 θ (α + β + n The Bayes estimator with constant risk is minimax. This occurs when α = β = n/, so the minimax estimator using quadratic loss is (α + x/(α + β + n = (x + n//(n + n. Jonathan Marchini (University of Oxford BSa MT / 8 Hypothesis testing with loss functions Suppose X i f (x; θ iid for i = 1,,..., n and we wish to test H 0 : θ = θ 0 against H 1 : θ = θ 1. The decision rule is δ C when we use a critical region C so δ C (x = H 1 if x C and otherwise δ C (x = H 0. Loss function: a θ = θ 0, x C 0 θ = θ L S (θ, δ C (x = 0, x C b θ = θ 1, x C 0 θ = θ 1, x C so the loss for Type I error is a (H 0 holds and we accept H 1 and the loss for Type II error is b (H 1 holds and we accept H 0. Jonathan Marchini (University of Oxford BSa MT / 8 Hypothesis testing with loss functions The risk function for the rule δ C is R(θ 0 ; δ C = L S (θ 0, δ C (xf (x; θ 0 dx = ai(x Cf (x; θ 0 dx = aα as α = P(X C H 0 is the probability for Type I error. Similarly R(θ 1 ; δ C = bβ, as β = P(X C H 1 is the probability for a Type II error. Note α and β are functions of the critical region C. Hypothesis testing with loss functions To calculate the Bayes risk we need a prior. Let π(θ 0 = p 0 and π(θ 1 = p 1 be the prior probabilities that H 0 and H 1 hold. The Bayes risk is r(π, δ C = θ {θ 0,θ 1 } R(θ; δ C π(θ = p 0 aα(c + p 1 bβ(c. The Bayes test chooses the critical region C to minimize the Bayes risk. Question How do we do this? Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

44 Hypothesis testing with loss functions To answer this question, let s revise what we know about Frequentist hypothesis testing. Neyman-Pearson lemma The best test of size α of H 0 vs H 1 is a likelihood ratio test with critical region { C = x; L(θ } 1; x L(θ 0 ; x A for some constant A > 0 chosen so that P(X C H 0 = α. By best test we mean the test with the highest power, where Power = 1 - Type II error. Bayes tests The following theorem tells us how to choose the critical region to minimize the Bayes risk, in the same way that the Neyman-Pearson lemma tells us how to maximize the power at fixed size. Theorem The critical region for the Bayes test is the critical region for a LR test with A = p 0a p 1 b Every LR test is a Bayes test for some p 0, p 1. In frequentist statistics we fix the Type I error (α and this determines the value of A, thus defining the critical region for the test. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Proof The Bayes test minimizes p 0 aα + p 1 bβ = p 0 ap(x C H 0 + p 1 bp(x C H 1 = p 0 a L(θ 0 ; xdx + p 1 b L(θ 1 ; xdx C C [ ] = p 0 a L(θ 0 ; xdx + p 1 b 1 L(θ 1 ; xdx C C [ ] = p 1 b + p 0 al(θ 0 ; x p 1 bl(θ 1 ; x dx C So choose C such that x C iff p 0 al(θ 0 ; x p 1 bl(θ 1 ; x 0, i.e. when L(θ 1 ; x L(θ 0 ; x p 0a p 1 b Example Let X 1,..., X n be N(µ, σ with σ known, and we want to test H 0 : µ = µ 0 vs H 1 : µ = µ 1, with µ 1 > µ 0. The critical region for the LR test L(θ 1 ; x L(θ 0 ; x A becomes x σ log(a n(µ 1 µ (µ 0 + µ 1 = B( say In the classical case we ignore the exact value of B, but in a Bayes test A = p 0 a/(p 1 b and we substitute into B. As an example take µ 0 = 0, µ 1 = 1, σ = 1, n = 4, a =, b = 1, p 0 = 1/4, p 1 = 3/4 Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

45 Then the Bayes test has critical region x 1 4 log( = 1 4 log( = For this test (using the fact that X is N(µ, 1/4 α = P( X µ = 0, σ /n = 1/4 = 0.1 Lecture 11 : Stein s paradox and the James-Stein estimator. Empirical Bayes and β = P( X < µ = 1, σ /n = 1/4 = In a classical approach fixing α = 0.05, B = /4 = 0.8, so β = P( X < 0.8 µ = 1, σ = 1/4 = In the Bayes test α has been increased and β decreased. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Stein s paradox and the James-Stein Estimator Stein s paradox has been described (Efron, 199 as the most striking theorem of post-war mathematical statistics! Theorem Let X i N(µ i, 1, i = 1,,..., p be jointly independent so we have one data point for each of the p µ i -parameters. Let X = (X 1,..., X p and µ = (µ 1,..., µ p. The MLE ˆµ for µ is ˆµ MLE = X. This estimator is inadmissable for quadratic loss i.e. Risk = MSE. The result is surprising since the estimator X is the MLE and also MVUE. An estimator with lower risk is given by the James-Stein estimator ( ˆµ JSE = 1 a i X i X Jonathan Marchini (University of Oxford BSa MT / 8 Implications of Stein s Paradox Suppose we are interested in estimating 1 the weight of a randomly chosen loaf of bread from a supermarket. the height of a random chosen blade of grass from a garden. 3 the speed of a randomly chosen car as it passes a speed camera. These are totally unrelated quantities. It seem implausible that by combining information across the data points that we might end up with a better way of estimating the vector of three parameters. The James-Stein estimator tells us that we can get a better estimate (on average for the vector of three parameters by simultaneously using the three unrelated measurements. Jonathan Marchini (University of Oxford BSa MT / 8

46 Proof Consider the alternative estimator ( ˆµ JSE = 1 a i X i X (the James-Stein estimator Note This estimator shrinks X towards 0 (when i X i > a. We will show that if a = p then R(µ, ˆµ JSE < R(µ, ˆµ MLE for every µ R n, so that the MLE is inadmissable in this case. First, the risk for ˆµ MLE is R(µ, ˆµ MLE = p E( µ i ˆµ MLE,i = recognizing Var(X i = 1. p E( µ i X i = p Jonathan Marchini (University of Oxford BSa MT / 8 Stein s Lemma In order to calculate R(µ, ˆµ JSE, it is convenient to use Stein s Lemma, for Normal RV, ( h(x E((X i µh(x = E. X i This can be shown by integrating by parts. Noting (x i µe (x i µ i / dx = e (x i µ i / we have (x i µ i h(xe (x i µ i / dx i = h(xe (x i µ i / + x i = x i = h(x e (x i µ i / dx i x i The first term is zero if h(x (for eg is bounded, giving the lemma. Jonathan Marchini (University of Oxford BSa MT / 8 Proof (continued R(µ, ˆµ JSE = ( p E( µ i ˆµ i with ˆµ i = ( E( µ i ˆµ i = E( µ i X i (X i µ i X i ae j X j ( ( (X i µ i X i E j X j = E = E ( +a E X i ( j X X i X i j X j ( j X j Xi ( j X j j Stein s lemma 1 a ( 1 = E j X j i X i X i X i ( j X j Jonathan Marchini (University of Oxford BSa MT / 8 Putting the pieces together, ( p E( µ i ˆµ i 1 = R(µ, ˆµ MLE (ap 4aE = p (a(p a E ( 1 j X j j X j and this is less than p if ap 4a a > 0 and in particular at a = p, which minimizes the risk over a R. ( + a 1 E j X j Note We have not shown that the James Stein estimator is admissible. Further reading Young and Smith sec 3.4 which covers the James-Stein estimator is worth reading. It includes a nice example (sec on the application of the estimator to estimation of baseball home run rates. Jonathan Marchini (University of Oxford BSa MT / 8

47 The risk of the James-Stein estimator Remember R(µ, ˆµ MLE = p. When a = p If µ i = 0 X i N(0, 1 ( j X j χ p E R(µ, ˆµ JSE =. 1 j X j = 1/(p If µ i = λ X i = λ + Z i where Z i N(0, 1 and j X j λ + χ ( p E 0 as λ so R(µ, ˆµ JSE p. 1 j X j So( we get a smaller difference between R(µ, ˆµ MLE and R(µ, ˆµ JSE as E gets smaller. 1 j X j Exercise A better estimator is X1 ( p + 1 a V (X X1 p where V = p j 1 (X j X and 1 p is p-dimensional vector of 1 s. Empirical Bayes Bayes estimators have good risk properties (for example, the posterior mean is usually admissible for quadratic loss. However, Bayes estimators may be hard to compute (for example, the posterior mean is an integral, or sum, particularly for hierarchical models. In Empirical Bayes, we use Bayesian reasoning to find estimators which can then be used in classical frequentist. EB uses a particular strategy to simplify hierarchical models. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Empirical Bayes Recall the setup for Bayesian inference for hierarchical models. X f (x; θ θ π(θ; ψ ψ g(ψ Empirical Bayes The EB trick is to avoid doing ψ-integrals by replacing ψ with an estimate ˆψ, derived from the data, and consider the model X f (x; θ θ π(θ; ˆψ Our prior for θ has a parameter ψ which also has a prior. The posterior is π(θ, ψ x L(θ; xπ(θ; ψg(ψ If we want minimim risk for quadratic loss (for eg we should use ˆθ = θπ(θ, ψ xdθdψ This EB approximation to the full posterior chops off a layer of the hierarchy. The reduced model has posterior ˆπ(θ x L(θ; xπ(θ; ˆψ, and a Bayes estimator ˆθ EB is calculated using ˆπ(θ x. For example, for quadratic loss, ˆθ = θˆπ(θ xdθ. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

48 Empirical Bayes Example The James-Stein estimator is an EB-estimator We still need an estimator for ψ. There are several choices. We can use the MLE ˆψ = arg max ψ p(x ψ for ψ in the marginal likelihood p(x ψ = L(θ; xπ(θ; ψdθ. Method of moments estimators are used also. Data x i N(θ i, 1, i = 1,..., p (so one observation x i for each parameter θ i. The MLE for θ i is simply ˆθ MLE,i = x i. Construct an EB estimator for quadratic loss. Suppose the prior is θ i N(0, τ (we have some freedom here, as we are mainly interested in the risk-related properties of the final estimator. If we knew τ we have (completing the square - see Lecture 7 p13 θ i (x i, τ N ( xi τ 1 + τ, τ 1 + τ. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Example To get an estimate for τ we compute the marginal distribution for X i given τ, which is X i N(0, τ + 1. The MLE for τ is then ˆτ = 1 p i X i 1, and this gives ˆθ EB,i = X i ˆτ which is the James-Stein estimator ( = ˆθ JS,i = ( 1 + ˆτ 1 p i X i 1 a i X i with a = p. This isn t the minimum risk JS estimator for quadratic loss, (that is a = p but it already beats the MLE for all θ. (to get the best JS estimator (with a = p use a method of moments estimator for τ. See Young and Smith Section 3.5 Jonathan Marchini (University of Oxford BSa MT / 8 X i X i Example Data x i Poisson(θ i, i = 1,..., n (so one observation x i for each parameter θ i. The MLE for θ i is simply ˆθ MLE,i = x i. Construct an EB estimator for quadratic loss. Suppose the prior for θ i s is iid Exponential(λ i.e. π(θ i λ = λe λθ i. p(x i λ = = 0 ( λ e θ i θ x i i x i! λe λθ i dθ i xi λ 1 + λ given λ the x i s are iid geometric (λ/(1 + λ. The MLE of λ based on x 1,..., x n is λ = 1/ x, where x = 1 n n 1 x i. Jonathan Marchini (University of Oxford BSa MT / 8

49 Now, under the EB simplification, set λ = ˆλ, so that ˆπ(θ x L(θ; xπ(θ ˆλ = i e θ i θ x i i ˆλe ˆλθ i and we recognise θ i x Γ(x i + 1, ˆλ + 1 in this EB approximation. This leads to an estimator ˆθ EB,i = θ i ˆπ(θ i xdθ i Lecture 1 : Markov Chain Monte Carlo. The Metropolis algorithm. = x i + 1 ˆλ + 1 = x x i + 1 x + 1 We can rewrite this ˆθ EB,i = x i + x x + 1 ( x x i showing that this EB estimator shrinks the estimates towards the common mean. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Computationally intensive methods In Bayesian inference, the things we need to know are often given in terms of intractable integrals or sums. If x f (x; θ and we have a prior π(θ on θ Θ and we want to measure the evidence for θ S, S Θ then we might calculate P(θ S x = π(θ xdθ where π(θ x f (x; θπ(θ is the posterior density. The posterior mean is commonly the most useful point estimate. There we have to compute E θ x θ = θπ(θ xdθ. Θ In previous examples our choice of prior has kept calculations doable. Jonathan Marchini (University of Oxford BSa MT / 8 S Monte Carlo estimation P(θ S x = E θ x I(θ S and E θ x θ are both averages. If we have N iid samples θ (i π(θ x, i = 1,,..., N distributed according to the posterior density and we want to estimate a finite mean µ f = E θ x f (θ for some given function f, and we take then fn = 1 f (θ (i N fn P µf, that is, f N is consistent for estimation of µ f (in fact there is usually a CLT. But how do we generate the samples θ (i π(θ x, i = 1,,..., N? Jonathan Marchini (University of Oxford BSa MT / 8 i

50 Markov Chain Monte Carlo estimation The idea in MCMC is to simulate an ergodic Markov Chain θ (i i = 1,,..., N that has π(θ x as its stationary distribution. Remember a discrete time Markov chain is specified by it s transition function P(θ, θ = P(θ (t+1 = θ θ (t = θ Given θ (0 we use the transition function to draw θ (1, then θ ( given θ (1, then θ (3 given θ ( etc. So we obtain a sequence θ (0, θ (1, θ (, θ (3, θ (4, θ (5, θ (6, θ (7,... The sequence of simulated θ-values θ (i i = 1,,..., N is correlated. Note In what follows we will refer to π(θ x as the target, and we will drop the x while we are setting up the theory. Jonathan Marchini (University of Oxford BSa MT / 8 Markov Chain Monte Carlo estimation However, the ergodic theorem for Markov chains from Part A tells us that 1 N f (θ (i P µ f i so we can use the Markov Chain sequence to form estimates. The key here is that we can often write down an ergodic Markov chain that is easy to simulate, and has stationary distribution π(θ x. Question If we start from π(θ x how do we choose the transition function P(θ, θ? We will learn about two methods to do this 1 The Metropolis algorithm Gibbs sampling Jonathan Marchini (University of Oxford BSa MT / 8 Metropolis algorithm For target π(θ x. Suppose θ (t = θ. Simulate a value for θ (t+1 as follows. 1 Simulate a candidate state θ q(θ θ i.e. a simple random change to the state satisfying q(θ θ = q(θ θ With probability α(θ θ, set θ (t+1 = θ (accept and otherwise set θ (t+1 = θ (reject. Here { } α(θ θ = min 1, π(θ x π(θ x The overall transition probability from θ to θ is P(θ, θ = q(θ θα(θ θ Example In a population of m individuals N are red and m N are blue, with unknown true value N unknown. A sample of k individuals is taken, with replacement, and X are red. The prior on N is uniform 0 to m. The posterior for N = n is π(n x n x (m n k x, n {0, 1,,..., m} In this example we can work out posterior probabilities exactly. Let us use MCMC to estimate posterior probabilities, and check it matches the exact result (within Monte Carlo error. We need to write an MCMC algorithm simulating a Markov chain N (t, t = 0, 1,,... with the property that N (t π(n x. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

51 Example Metropolis algorithm targeting π(n x. See R code. Suppose N (t = n. Simulate a value for N (t+1 as follows. 1 Toss a coin and set n = n + 1 on a Head and otherwise n = n 1. So q(n + 1 n = q(n n + 1 = 1/. With probability { } α(n n = min 1, π(n x π(n x set N (t+1 = n and otherwise set N (t+1 = n (reject. Note that α = 0 if n = 1, m + 1 (since π(n x = 0 in those cases and otherwise α(n n = min {1, (n x (m n k x } n x (m n k x. Example Here is a plot of showing the results of running the Markov chains for 10,000 iterations. (m = 0, x = 3, k = 10 N t Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Example Here is a plot showing the empirical density of the 10,000 samples, together with the true density in red. You can see that approximation is good. Density Histogram of N N Why does it work? Theorem If θ (t, t = 0, 1,,... is an irreducible, aperiodic Markov chain on a finite space Θ, satisfying detailed balance with respect to a target distribution π(θ on Θ, and f : Θ R is a function with finite mean and variance in π, then 1 f (θ (i P E π f (θ N i [This is essentially a watered down version of the Ergodic Theorem of Part A, with positive recurrence replaced by detailed balance] Suppose Θ is finite. Let P (n (θ, θ = P(θ (t+n = θ θ (t = θ give the n-step transition probability in a homogeneous Markov chain with transition probability P(θ, θ. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

52 Markov Chains Irreducible θ, θ Θ, n such that P (n (θ, θ > 0, Aperiodic Detailed balance θ Θ, gcd{n : P (n (θ, θ > 0} = 1 π(θp(θ, θ = π(θ P(θ, θ If detailed balance is satisfied the it is easy to show that the chain is a stationary distribution π(θp(θ, θ = π(θ. θ Θ Jonathan Marchini (University of Oxford BSa MT / 8 Metropolis algorithm Let us check DB. Assume WLOG π(θ π(θ. P(θ, θ π(θ = q(θ θα(θ θπ(θ = q(θ θ min { } 1, π(θ π(θ π(θ = q(θ θπ(θ { = q(θ θ min 1, π(θ } π(θ π(θ = P(θ, θπ(θ, { } { since min 1, π(θ = π(θ and min 1, π(θ } π(θ π(θ = 1. We have shown that the transition probability for the Metropolis update has π(θ as a stationary distribution. If the chain is also irreducible and aperiodic, it will generate the samples we need. These last two conditions have to be checked separately for each application (but are often obvious. Jonathan Marchini (University of Oxford BSa MT / 8 Example In Q4 of PS3, we wrote down the posterior distribution for a hierarchical model, but were unable to do anything with it, as the integrals were intractable. The observation model is X i B(n i, p i, i = 1,,..., k, the prior π(p i a, b is Beta(a, b and the prior on a, b is the improper prior π(a, b 1/ab on a, b > 0. The posterior density is π(p, a, b x 1 ab k i (1 p i n i x i p a 1 i (1 p i b 1, B(a, b p x i with B(a, b = Γ(aΓ(b Γ(a + b. Note we used an improper prior - we need to check this is a proper posterior distribution. Example Use Monte Carlo to sample (p (t, a (t, b (t π(p, a, b x, t = 1,,..., T and use the samples to estimate any posterior expectations of interest. For example, if we want to estimate a using the posterior mean â = 1 a (t. T or the posterior probability that p 1 > c, π(p 1 > c x = 1 T t t I(p (t 1 > c. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

53 Example Metropolis algorithm targeting π(p, a, b x. See R code. Suppose θ (t = θ = (p, a, b. Simulate θ (t+1 as follows. 1 Pick a component of θ, one of (p 1,,..., p k, a or b and propose an updated value. i If we choose p i, propose p i U(0, 1. ii If we choose a propose a U(a w, a + w iii If we choose b propose b U(b w, b + w with w > 0 a fixed number. This generates a new state θ with one component changed. Clearly, q(θ θ = q(θ θ. Adjustment to avoid negative a and b. With probability α(θ θ set θ (t+1 = θ and otherwise set θ (t+1 = θ (reject. Example Exercise Verify the following formulae. If we update a p i (so θ = ((p 1,..., p i 1, p i, p i+1,..., p k, a, b { } α(θ θ = min 1, (p i a+x i 1 (1 p i b+n i x i 1 (p i a+x i 1 (1 p i b+n i x i 1 whilst if we update a (so that θ = (p, a, b { } α(θ θ = min 1, (p 1p...p k a B(a, b k a (p 1 p...p k a B(a, b k a and finally b (where θ = (p, a, b { α(θ θ = min 1, [(1 p 1(1 p...(1 p k ] b B(a, b k [(1 p 1 (1 p...(1 p k ] b B(a, b k b b } Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 How long to run the MCMC? Here are 10,000 samples from Example 1 with m = 00, x =, k = 30. Lecture 13 : MCMC convergence. The Metropolis-Hastings algorithm. The Gibbs sampler. N e+00 e+04 4e+04 6e+04 8e+04 1e+05 t Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

54 Convergence and burn-in iterations The Markov Chain theory that underlies MCMC tells us about the long run properties of the Markov Chain we have constructed i.e. that the chain will converge to it s stationary distribution. Often we will throw away some of the samples at the beginning of the chain. These samples are known as the burn-in iterations. The number of burn-in iterations has to be decided upon by the user, but plots of the Markov chain can often help. Plotting the Markov chain allows us to visually observe when the chain looks to have converged. A better alternative is to run multiple chains from a diverse set of starting points. If all the chains produce the same answer then this can be good evidence that the chains ave converged. Jonathan Marchini (University of Oxford BSa MT / 8 Example Consider a Metropolis algorithm for simulating samples from a bivariate Normal distribution with mean µ = (0, 0 and covariance ( Σ = using a proposal distribution for each component x i x i U(x w, x + w i {1, } The performance of the algorithm can be controlled by setting w. If w is small then we propose only small moves and the chain will move slowly around the parameter space. If w is large then it is large then we may only accept a few moves. There is an art to implementing MCMC that involves choice of good proposal distributions. Jonathan Marchini (University of Oxford BSa MT / 8 Example N = 4 w = 0.1 T = 100 Example N = 4 w = 0.1 T = Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

55 Example Example N = 4 w = 0.01 T = N = 4 w = 0.1 T = Jonathan Marchini (University of Oxford 0 BSa 4 MT / 8 Example Jonathan Marchini (University of Oxford 0 BSa 4 MT / 8 Example N = 4 w = 10 T = N = 4 w = 0.01 T = 1e+05 4 Jonathan Marchini (University of Oxford 0 BSa 4 MT / 8 Jonathan Marchini (University of Oxford 0 BSa 4 MT / 8

56 The Metropolis-Hastings algorithm The Metropolis-Hastings (MH algorithm generalises the Metropolis algorithm in two ways 1 The proposal distribution does not need to be symmetric i.e. we can have q(θ θ q(θ θ To correct for any asymmetry in the jumping rule the acceptance ratio becomes α(θ θ = min { 1, π(θ x/q(θ } θ π(θ x/q(θ θ { = min 1, π(θ x π(θ x q(θ θ } q(θ θ Choosing a proposal distribution A good proposal distribution has the following properties 1 For any θ, it is easy to sample from q(θ θ. It is easy to compute the acceptance ratio. 3 Each proposal goes a reasonable distance in the parameter space (otherwise the chain moves too slowly. 4 The proposals are not rejected too often (otherwise the chain spends too much time at the same positions. Allowing asymmetric proposals can help the chain to converge faster. Exercise Prove detailed balance for the MH algorithm. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 The Gibbs sampler A Gibbs sampler is a particular type of MCMC algorithm that has been found to be very useful in high dimensional problems. Suppose we wish to sample from π(θ where θ = (θ 1,..., θ d. Each iteration of the Gibbs sampler occurs as follows 1 An ordering of the d components of θ is chosen. For each component in this ordering, θ j say, we draw a new value is sampled from the full conditional distribution given all the other components of θ. θ t j π(θ j θ t 1 j where θ t 1 j represents all the components of θ, except the for θ j, at their current values Example Consider a single observation y = (y 1, y from a bivariate Normal distribution ( with unknown mean θ = (θ 1, θ and known covariance 1 ρ Σ =. With a uniform prior on θ, the posterior distribution is ρ 1 θ y N(y, Σ In this case the posterior is tractable but we will consider how to construct a Gibbs sampler. We need the two conditional distributions π(θ 1 θ, y and π(θ θ 1, y. θ t 1 j = (θ t 1,..., θt j 1, θt 1 j+1,..., θt 1 d. Jonathan Marchini (University of Oxford BSa MT 013 / 8 Jonathan Marchini (University of Oxford BSa MT / 8

57 Example Example π(θ 1 θ, y π(θ 1, θ y { Similarly, exp 1 (θ yt Σ 1 (θ y } { ( T ( ( } 1 θ1 y exp 1 1 ρ θ1 y 1 (1 ρ θ y ρ 1 θ y } {(θ 1 y 1 + ρ(θ 1 y 1 (θ y + (θ y = exp exp { } 1 (1 ρ (θ 1 (y 1 + ρ(θ y θ 1 θ, y N(y 1 + ρ(θ y, 1 ρ θ θ 1, y N(y + ρ(θ 1 y 1, 1 ρ N = 1 T = Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Deriving full conditional distributions The Gibbs sampler requires us to be able to obtain the full conditional distributions of all components of the parameter vector θ. To determine π(θ j θ t 1 j we write down the posterior distribution and consider it as a function of θ j. If we are lucky then we will be able to spot the distribution for θ j θ t 1 j (using conjugate priors often helps. If the full conditional is not of a nice form then this component could be sampled using rejection sampling or a MH update could be used instead of a Gibbs update. Jonathan Marchini (University of Oxford BSa MT / 8 Gibbs sampling and Metropolis-Hastings Gibbs sampling can be viewed as a special case of the MH algorithm with proposal distributions equal to the full conditionals. The MH acceptance ratio is min(1, r where r = π(θ /q(θ θ t 1 π(θ t 1 /q(θ t 1 θ We have θ t 1 = (θ 1,..., θ j,..., θ d = (θ j, θ t 1 j and we use proposal q(θ θ t 1 = π(θ j θt 1 j so that θ = (θ 1,..., θ j,..., θ d = (θ j, θt 1 j and. q(θ t 1 θ = π(θ j θ t 1 j Jonathan Marchini (University of Oxford BSa MT / 8

58 Gibbs sampling and Metropolis-Hastings Then the MH acceptance ratio is min(1, r where r = π(θ /q(θ j θt 1 π(θ t 1 /q(θ t 1 j θ = π(θ /π(θ j θt 1 j π(θ t 1 /π(θ j θ t 1 j But the numerator can be simplified π(θ π(θ j θt 1 j = π(θ j, θt 1 j j π(θ j θt 1 j = π(θt 1 Similarly the denominator can be simplified π(θ t 1 π(θ j θ t 1 j = π(θ j, θ t 1 j π(θ /π(θ j θt 1 j = π(θ t 1 j π(θ t 1 /π(θ j θ t 1 j π(θ t 1 j π(θ j θ t 1 j = π(θt 1 So we have r = = 1 j Thus, every proposed move is accepted. Jonathan Marchini (University of Oxford BSa MT / 8 Lecture 14 : Variational Bayes Jonathan Marchini (University of Oxford BSa MT / 8 A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M P(D M = P(D θ, MP(θ Mdθ where θ are the parameters of the model and P(θ M is the prior on the parameters of the model. The integral required to calculate the marginal likelihood of a given model is not always tractable for some models, so an approximation can be desirable. Variational Bayes (VB is one possibility, that has become reasonably popular recently. Variational Bayes The idea of VB is to find an approximation Q(θ to a given posterior distribution P(θ D, M. That is Q(θ P(θ D, M where θ is the vector of parameters of the model M. We then use Q(θ to approximate the marginal likelihood. In fact, what we do is find a lower bound for the marginal likelihood. Question How to find a good approximate posterior Q(θ? Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

59 Kullback-Liebler (KL divergence The strategy we take is to find a distribution Q(θ that minimizes a measure of distance between Q(θ and the posterior P(θ D, M. Definition The Kullback-Leibler divergence KL(q p between two distributions q(x and p(x is [ q(x ] KL(q p = log q(xdx p(x N(µ, σ approximations to a Gamma(10,1 density µ = 10,σ = 4,KL = x density µ = 9.11,σ = 3.03,KL = x density x Exercise KL(q p 0 and KL(q p = 0 iff q = p. q(x p(x q(x * log(q(x/p(x Jonathan Marchini (University of Oxford BSa MT / 8 x density µ = 13,σ =.3,KL = x density µ = 9,σ =,KL = x Jonathan Marchini (University of Oxford BSa MT / 8 We consider the KL divergence betweem Q(θ and P(θ D, M [ Q(θ ] KL(Q(θ P(θ D, M = log Q(θdθ P(θ D, M [ Q(θP(D M ] = log Q(θdθ P(θ, D M [ p(θ, D M ] = log P(D M log Q(θdθ Q(θ The log marginal likelihood can then be written as log P(D M = F(Q(θ + KL(Q(θ P(θ D, M (1 ] Q(θdθ. where F(Q(θ = log [ P(θ,D M Q(θ Note Since KL(q p 0 we have that log P(D M F(Q(θ so that F(Q(θ is a lower bound on the log-marginal likelihood. Jonathan Marchini (University of Oxford BSa MT / 8 The mean field approximation We now need to ask what form that Q(θ should take? The most widely used approximation is known as the mean field approximation and assumes only that the approximate posterior has a factorized form Q(θ = Q(θ i i The VB algorithm iteratively maximises F(Q(θ with respect to the free distributions, Q(θ i, which is coordinate ascent in the function space of variational distributions. We refer to each Q(θ i as a VB component. We update each component Q(θ i in turn keeping Q(θ j j i fixed. Jonathan Marchini (University of Oxford BSa MT / 8

60 VB components Lemma The VB components take the form ( log Q(θ i = E Q(θ i log P(D, θ M + const Proof Writing Q(θ = Q(θ i Q(θ i where θ i = θ\θ i, the lower-bound can be re-written as [ P(θ, D M ] F(Q(θ = log Q(θdθ Q(θ [ P(θ, D M ] = log Q(θ i Q(θ i dθ i dθ i Q(θ i Q(θ i [ ] = Q(θ i log P(D, θ MQ(θ i dθ i dθ i Q(θ i Q(θ i log Q(θ i dθ i dθ i Q(θ i Q(θ i log Q(θ i dθ i dθ i F (Q(θ = = [ ] Q(θ i log P(D, θ MQ(θ i dθ i dθ i Q(θ i Q(θ i log Q(θ i dθ i dθ i Q(θ i Q(θ i log Q(θ i dθ i dθ i [ ] Q(θ i log P(D, θ MQ(θ i dθ i dθ i Q(θ i log Q(θ i dθ i Q(θ j log Q(θ j dθ j j i If we let Q (θ i = 1 Z exp [ log P(D, θ MQ(θ i dθ i ] where Z is a normalising constant and write H(Q(θ j = Q(θ j log Q(θ j dθ j as the entropy of Q(θ j then F (Q(θ = Q(θ i log Q (θ i Q(θ i dθ i + log Z + H(Q(θ j j i = KL(Q(θ i Q (θ i + log Z + j i H(Q(θ j Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 VB algorithm F(Q(θ = KL(Q(θ i Q (θ i + log Z + j i H(Q(θ j We then see that F(Q(θ is maximised when Q(θ i = Q (θ i as this choice minimises the Kullback-Liebler divergence term. Thus the update for Q(θ i is given by or [ ] Q(θ i exp log P(D, θ MQ(θ i dθ i log Q(θ i = E Q(θ i ( log P(D, θ M + const This implies a straightforward algorithm for variational inference: 1 Initialize all approximate posteriors Q(θ = Q(µQ(τ, e.g., by setting them to their priors. Cycle over the parameters, revising each given the current estimates of the others. 3 Loop until convergence. Convergence is checked by calculating the VB lower bound at each step i.e. [ P(θ, D M ] F(Q(θ = log Q(θdθ Q(θ The precise form of this term needs to be derived, and can be quite tricky. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

61 Example 1 Consider applying VB to the hierarchical model D i N(µ, τ 1 i = 1,..., P, µ N(m, (τβ 1 τ Γ(a, b Note We are using a prior of the form π(τ, µ = π(µ τπ(τ. Let θ = (µ, τ and assume Q(θ = Q(µQ(τ. We will use notation θ i = E Q(θ i θ i. log P(D, θ M = P log τ τ P (D i µ + 1 log τ τβ (µ m +(a 1 log τ bτ + K We can derive the VB updates one at a time. We start with Q(µ. Note We just need to focus on terms involving µ. ( log Q(µ = E Q(τ log P(D, θ M + C The log joint density is log P(D, θ M = P log τ τ P (D i µ + 1 log τ τβ +(a 1 log τ bτ + K (µ m = τ ( P (D i µ β(µ m + C where τ = E Q(τ (τ. We will be able to determine τ when we derive the other component of the approximate density, Q(τ. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 We can see this log density has the form of a normal distribution where log Q(µ = τ Thus Q(µ = N(µ m, β 1. ( P (D i µ β(µ m + C = β (µ m β = (β + P τ m = β 1( βm + P D i log P(D, θ M = P log τ τ P (D i µ + 1 log τ τβ (µ m +(a 1 log τ bτ + K The second component of the VB approximation is derived as ( log Q(τ = E Q(µ log P(D, θ M + C = ((P + 1/ + a 1 log τ τ P (D i µ β(µ m τ + C Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

62 We can see this log density has the form of a gamma distribution log Q(τ = ((P + 1/ + a 1 log τ τ P (D i µ β(µ m τ which is Γ(a, b where + C a = a + (P + 1/ b = b + 1 ( P Di µ + P µ + β ( m µ + µ Jonathan Marchini (University of Oxford BSa MT / 8 So overall we have 1 Q(µ = N(µ m, β 1 where Q(τ = Γ(τ a, b where β = (β + P τ ( m = β 1( P βm + D i (3 a = a + (P + 1/ b = b + 1 ( P Di µ + P µ + β ( m µ + µ To calculate these we need τ = a b µ = m µ = β 1 + m Jonathan Marchini (University of Oxford BSa MT / 8 Example 1 For this model the exact posterior was calculate in Lecture 6. [ }] π(τ, µ D τ α 1 exp τ {b + β (m µ where a = a + P + 1 b = b + 1 (Di D + 1 β = β + P m = β 1 (βm + P D i Pβ P + β ( D m We can compare the true and VB posterior when applied to a real dataset. We see that VB approximations underestimate posterior variances. Density µ True posterior VB posterior Density True posterior VB posterior We note some similarity between the VB updates and the true posterior parameters. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

63 General comments The property of VB underestimating the variance in the posterior is a general feature of the method, when there exists correlation between the θ i s in the posterior, which is usually the case. This may not be important if the purpose of inference is model comparison i.e. comparing the approximate marginal likelihoods between models. VB is often much, much faster to implement than MCMC or other sampling based methods. The VB updates and lower bound can be tricky to derive, and sometime further approximation is needed. The VB algorithm will find a local mode of the posterior, so care should be taken when the posterior is thought/known to be multi-modal. Laplace s method Consider the integral e Nf (x dx Expanding f (x around a maximum point x 0 we have f (x f (x 0 + f (x 0 (x x f (x 0 (x x 0 = f (x 0 1 f (x 0 (x x 0 and we can approximate the integral as { e Nf (x dx e Nf (x 0 exp N f (x 0 (x x 0 } dx Comparing to a normal pdf we have e Nf (x dx e Nf (x 0 (π 1 N 1 f (x 0 1 Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Laplace s method Laplace approximation For a d-dimensional function the analogue of this result is e Nf (x dx e Nf (x 0 (π d N d f (x 0 1 where f (x 0 is the determinant of the Hessian of the function evaluated at x 0. We apply Laplace s method to the integral that is the marginal likelihood in Bayesian inference, as follows P(θ D, M P(D θ, MP(θ Mdθ = { 1 exp N( N log P(D θ, M + 1 } N log P(θ M dθ here we have f (x = 1 N log P(D θ, M + 1 log P(θ M N Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

64 Laplace approximation We let θ be the (vector value of θ that maximizes the posterior θ = arg max P(θ D, M θ This is known as the maximum a posteriori (MAP estimate of θ. This leads to log P(θ D, M log P(D θ, M+log P( θ M+ d log π+1 log f ( θ 1 d log N Bayesian Information Criterion (BIC The Bayesian Information Criterion (BIC takes the approximation one step further, essentially by minimizing the impact of the prior. Firstly, the MAP estimate θ is replaced by the MLE ˆθ, which is reasonable if the prior has a small effect. Secondly, BIC only retains the terms that vary in N, since asymptotically the terms that are constant in N do not matter. Dropping the constant terms we get, log P(θ D, M log P(D ˆθ, M d log N Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Lecture 15 : The EM algorithm An expectation-maximization (EM algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP estimates of parameters in statistical models, where the model depends on unobserved latent variables. So called missing data problems are widespread in many application areas of statistics. We will consider the general set up where X = Observed data Z = Missing or unobserved data θ = Model parameters l(θ = log L(θ; X Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

65 Example Suppose we observed data X 1,..., X M from the following model p(z i = 1 = p(z i = = 1, X i Z i N(θZ i, 1 So we observe the X i s but not the Z i s or θ and want to estimate θ. Frequency Note If we knew the Z i s this would be an easy problem, but we don t...hence the need for the EM algorithm. x Overview of EM algorithm The EM algorithm is an iterative algorithm for maximising the log-likelihood l(θ that proceeds as follows 1 Set t = 0 and pick a initial value θ t. E-Step Calculate the expected log-likelihood of the full data likelihood [ ] log P(X, Z θ E p(z X,θt where the expectation is wrt p(z X, θ t i.e. the conditional distribution of the missing data Z given and observed data X and θ t. [ ] 3 M-Step Set θ t+1 = arg max θ E p(z X,θt log P(X, Z θ 4 Iterate steps and 3 until convergence. Question Why does this work? Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 l(θ = log P(X θ = log P(X, Z θ = log P(X, Z θ log P(Z X, θ P(Z X, θ Now take expectations wrt to a distribution q(z l(θ = log P(X, Z θq(z dz log P(Z X, θq(z dz = log P(X, Z θq(z dz log q(z q(z dz + log q(z q(z dz log P(Z X, θq(z dz [ ] = E q(z log P(X, Z θ + H(q(Z + KL(q(Z P(Z X, θ [ ] l(θ E q(z log P(X, Z θ + H(q(Z where H(q(Z = log q(z q(z dz is known as the Entropy of q(z. Jonathan Marchini (University of Oxford BSa MT / 8 E-Step Now set q(z = P(Z X, θ t. Then we have [ ] l(θ E P(Z X,θt log P(X, Z θ + H(q(Z with equality at θ = θ t i.e. we have a lower bound for l(θ. θ t Jonathan Marchini (University of Oxford BSa MT / 8 l( θ

66 M-Step We then obtain an updated estimate θ t+1 by maximizing this lower bound [ ] θ t+1 = arg max E P(Z X,θt log P(X, Z θ θ since H(q(Z is not a function of θ. Example Suppose we observed data X 1,..., X M from the following model p(z i = 1 = p(z i = = 1, X i Z i N(θZ i, 1 So we observe the X i s but not the Z i s or θ and want to estimate θ. l( θ Frequency θ t θ t x Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 The log-likelihood is l(θ = log P(X θ = log M k=1 P(x i z i = k, θp(z i = k The full data likelihood and it s logarithm are given by P(X, Z θ = M M P(x i z i, θp(z i = N(x i : θz i, 1 1 log P(X, Z θ = 1 M (x i θz i + K E-Step For θ t we then calculate the conditional distribution of the missing data P(Z X, θ t. This can be done one Z i at a time. ( p ikt = P(Z i = k X i, θ t exp 1 (x i kθ t Then calculate the expectation of the log of the full data likelihood [ ] E log P(X, Z θ = 1 = 1 M k=1 (x i θk p ikt M (x i θ p i1t + (x i θ p it = 1 θ i (p i1t + 4p it + θ i x i (p i1t + p it + C Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

67 Summary EM algorithm M-Step We can then update the estimate of θ by maximizing the expected log-likelihood, [ ] E log P(X, Z θ which leads to = 1 θ i θ t+1 = (p i1t + 4p it + θ i i x i(p i1t + p it i (p i1t + 4p it x i (p i1t + p it + C 1 Set t = 0 and pick a initial value θ t. E-Step To calculate the expected log-likelihood of the full data likelihood we calculate the conditional distribution of the missing data Z given and observed data X and θ t. ( p ikt = P(Z i = k X i, θ t exp 1 (x i kθ t 3 M-Step We then maximise the expected log-likelihood using θ t+1 = i x i(p i1t + p it i (p i1t + 4p it 4 Iterate steps and 3 until convergence. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 General comments Frequency On this real example the EM algorithm leads to the following sequence of estimates i θ i l(θ Jonathan Marchini (University of Oxford BSa MT / 8 x 1 The EM algorithm is only guaranteed to converge to a local maxima of the likelihood. In some situations there will be multiple-maxima. A computational strategy that is often employed to overcome this involves running the EM algorithm from many different starting points and choosing the estimate that produces the largest likelihood. The convergence of the EM algorithm can sometimes be slow, and this depends upon the form of the likelihood function. Jonathan Marchini (University of Oxford BSa MT / 8

68 Contingency table analysis North Carolina State University data. Lecture 16 : Bayesian analysis of contingency tables. Bayesian linear regression. EC : Extra Curricular activities in hours per week. EC < to 1 > 1 C or better D or F Let y = (y ij be the matrix of counts. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Frequentist analysis Usual χ test from R. Pearson s Chi-squared test X-squared = 6.964, df = (3-1(-1 =, p-value = The p-value is , evidence that grades are related to time spent on extra curricular activities. Bayesian analysis Let p = {p 11,..., p 3 }. Consider two models M I M D EC < to 1 > 1 C or better p 11 p 1 p 13 D or F p 1 p p 3 the two categorical variables are independent the two categorical variables are dependent. The Bayes factor is BF = P(y M D P(y M I. So we need to calculate the marginal likelihoods. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

69 The Dirichlet distribution Dirichlet integral z 1 + +z k =1 z ν z ν k 1 k dz 1 dz k = Γ(ν 1 Γ(ν k Γ( ν i Examples of 3D Dirichlet distributions Dirichlet distribution Γ( ν i Γ(ν 1 Γ(ν k zν z ν k 1 k, z z k = 1 The means are E[Z i ] = ν i / ν i, i = 1,..., k. A representation that makes the Dirichlet easy to simulate from is the following. Let W 1,..., W k be independent Gamma (ν 1,... Gamma (ν k random variables, W = W i and set Z i = W i /W, i = 1,..., k. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8 Calculating marginal likelihoods Calculating marginal likelihoods where P(y M D = = p P(y pπ(pdp p p 3 =1 ( y ( yij pij π(pdp11 dp 3 y ( y = ( y ij!/ y ij! y Under M D choose a uniform distribution for p i.e. Dirichlet(1,..., 1 ij P(y M D = = = where ( y Γ(RC y ( y Γ(yij + 1 Γ(RC y Γ( y + RC ( y D(y + 1 y D(1 RC p p 3 =1 D(ν = Γ(ν i /Γ( ν i ( yij pij dp11 dp 3 ij π(p = Γ(RC, p p 3 = 1 and y + 1 denotes the matrix of counts with 1 added to all entries and 1 RC denotes a vector of length RC with all entries equal to 1. Jonathan Marchini (University of Oxford BSa MT / 8 Jonathan Marchini (University of Oxford BSa MT / 8

Foundations of Statistical Inference

Foundations of Statistical Inference Foundations of Statistical Inference Jonathan Marchini Department of Statistics University of Oxford MT 2013 Jonathan Marchini (University of Oxford) BS2a MT 2013 1 / 27 Course arrangements Lectures M.2

More information

Foundations of Statistical Inference

Foundations of Statistical Inference Foundations of Statistical Inference Julien Berestycki Department of Statistics University of Oxford MT 2016 Julien Berestycki (University of Oxford) SB2a MT 2016 1 / 20 Lecture 6 : Bayesian Inference

More information

Proof In the CR proof. and

Proof In the CR proof. and Question Under what conditions will we be able to attain the Cramér-Rao bound and find a MVUE? Lecture 4 - Consequences of the Cramér-Rao Lower Bound. Searching for a MVUE. Rao-Blackwell Theorem, Lehmann-Scheffé

More information

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. Unbiased Estimation Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. To compare ˆθ and θ, two estimators of θ: Say ˆθ is better than θ if it

More information

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. Unbiased Estimation Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. To compare ˆθ and θ, two estimators of θ: Say ˆθ is better than θ if it

More information

Mathematical statistics

Mathematical statistics October 4 th, 2018 Lecture 12: Information Where are we? Week 1 Week 2 Week 4 Week 7 Week 10 Week 14 Probability reviews Chapter 6: Statistics and Sampling Distributions Chapter 7: Point Estimation Chapter

More information

A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Statistical Inference, and Examples A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2008 Prof. Gesine Reinert 1 Data x = x 1, x 2,..., x n, realisations of random variables X 1, X 2,..., X n with distribution (model)

More information

STAT 730 Chapter 4: Estimation

STAT 730 Chapter 4: Estimation STAT 730 Chapter 4: Estimation Timothy Hanson Department of Statistics, University of South Carolina Stat 730: Multivariate Analysis 1 / 23 The likelihood We have iid data, at least initially. Each datum

More information

Chapter 8.8.1: A factorization theorem

Chapter 8.8.1: A factorization theorem LECTURE 14 Chapter 8.8.1: A factorization theorem The characterization of a sufficient statistic in terms of the conditional distribution of the data given the statistic can be difficult to work with.

More information

Mathematical Statistics

Mathematical Statistics Mathematical Statistics Chapter Three. Point Estimation 3.4 Uniformly Minimum Variance Unbiased Estimator(UMVUE) Criteria for Best Estimators MSE Criterion Let F = {p(x; θ) : θ Θ} be a parametric distribution

More information

Introduction to Bayesian Methods

Introduction to Bayesian Methods Introduction to Bayesian Methods Jessi Cisewski Department of Statistics Yale University Sagan Summer Workshop 2016 Our goal: introduction to Bayesian methods Likelihoods Priors: conjugate priors, non-informative

More information

A Very Brief Summary of Bayesian Inference, and Examples

A Very Brief Summary of Bayesian Inference, and Examples A Very Brief Summary of Bayesian Inference, and Examples Trinity Term 009 Prof Gesine Reinert Our starting point are data x = x 1, x,, x n, which we view as realisations of random variables X 1, X,, X

More information

ECE 275B Homework # 1 Solutions Version Winter 2015

ECE 275B Homework # 1 Solutions Version Winter 2015 ECE 275B Homework # 1 Solutions Version Winter 2015 1. (a) Because x i are assumed to be independent realizations of a continuous random variable, it is almost surely (a.s.) 1 the case that x 1 < x 2

More information

ECE 275B Homework # 1 Solutions Winter 2018

ECE 275B Homework # 1 Solutions Winter 2018 ECE 275B Homework # 1 Solutions Winter 2018 1. (a) Because x i are assumed to be independent realizations of a continuous random variable, it is almost surely (a.s.) 1 the case that x 1 < x 2 < < x n Thus,

More information

March 10, 2017 THE EXPONENTIAL CLASS OF DISTRIBUTIONS

March 10, 2017 THE EXPONENTIAL CLASS OF DISTRIBUTIONS March 10, 2017 THE EXPONENTIAL CLASS OF DISTRIBUTIONS Abstract. We will introduce a class of distributions that will contain many of the discrete and continuous we are familiar with. This class will help

More information

Brief Review on Estimation Theory

Brief Review on Estimation Theory Brief Review on Estimation Theory K. Abed-Meraim ENST PARIS, Signal and Image Processing Dept. abed@tsi.enst.fr This presentation is essentially based on the course BASTA by E. Moulines Brief review on

More information

Mathematical statistics

Mathematical statistics October 18 th, 2018 Lecture 16: Midterm review Countdown to mid-term exam: 7 days Week 1 Chapter 1: Probability review Week 2 Week 4 Week 7 Chapter 6: Statistics Chapter 7: Point Estimation Chapter 8:

More information

1. Fisher Information

1. Fisher Information 1. Fisher Information Let f(x θ) be a density function with the property that log f(x θ) is differentiable in θ throughout the open p-dimensional parameter set Θ R p ; then the score statistic (or score

More information

Statistics 3858 : Maximum Likelihood Estimators

Statistics 3858 : Maximum Likelihood Estimators Statistics 3858 : Maximum Likelihood Estimators 1 Method of Maximum Likelihood In this method we construct the so called likelihood function, that is L(θ) = L(θ; X 1, X 2,..., X n ) = f n (X 1, X 2,...,

More information

Statistical Theory MT 2007 Problems 4: Solution sketches

Statistical Theory MT 2007 Problems 4: Solution sketches Statistical Theory MT 007 Problems 4: Solution sketches 1. Consider a 1-parameter exponential family model with density f(x θ) = f(x)g(θ)exp{cφ(θ)h(x)}, x X. Suppose that the prior distribution has the

More information

STAT 512 sp 2018 Summary Sheet

STAT 512 sp 2018 Summary Sheet STAT 5 sp 08 Summary Sheet Karl B. Gregory Spring 08. Transformations of a random variable Let X be a rv with support X and let g be a function mapping X to Y with inverse mapping g (A = {x X : g(x A}

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

A Few Notes on Fisher Information (WIP)

A Few Notes on Fisher Information (WIP) A Few Notes on Fisher Information (WIP) David Meyer dmm@{-4-5.net,uoregon.edu} Last update: April 30, 208 Definitions There are so many interesting things about Fisher Information and its theoretical properties

More information

Part III. A Decision-Theoretic Approach and Bayesian testing

Part III. A Decision-Theoretic Approach and Bayesian testing Part III A Decision-Theoretic Approach and Bayesian testing 1 Chapter 10 Bayesian Inference as a Decision Problem The decision-theoretic framework starts with the following situation. We would like to

More information

9 Bayesian inference. 9.1 Subjective probability

9 Bayesian inference. 9.1 Subjective probability 9 Bayesian inference 1702-1761 9.1 Subjective probability This is probability regarded as degree of belief. A subjective probability of an event A is assessed as p if you are prepared to stake pm to win

More information

DA Freedman Notes on the MLE Fall 2003

DA Freedman Notes on the MLE Fall 2003 DA Freedman Notes on the MLE Fall 2003 The object here is to provide a sketch of the theory of the MLE. Rigorous presentations can be found in the references cited below. Calculus. Let f be a smooth, scalar

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

HT Introduction. P(X i = x i ) = e λ λ x i

HT Introduction. P(X i = x i ) = e λ λ x i MODS STATISTICS Introduction. HT 2012 Simon Myers, Department of Statistics (and The Wellcome Trust Centre for Human Genetics) myers@stats.ox.ac.uk We will be concerned with the mathematical framework

More information

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata Maura Department of Economics and Finance Università Tor Vergata Hypothesis Testing Outline It is a mistake to confound strangeness with mystery Sherlock Holmes A Study in Scarlet Outline 1 The Power Function

More information

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources STA 732: Inference Notes 10. Parameter Estimation from a Decision Theoretic Angle Other resources 1 Statistical rules, loss and risk We saw that a major focus of classical statistics is comparing various

More information

MAS3301 Bayesian Statistics

MAS3301 Bayesian Statistics MAS3301 Bayesian Statistics M. Farrow School of Mathematics and Statistics Newcastle University Semester 2, 2008-9 1 11 Conjugate Priors IV: The Dirichlet distribution and multinomial observations 11.1

More information

Math 494: Mathematical Statistics

Math 494: Mathematical Statistics Math 494: Mathematical Statistics Instructor: Jimin Ding jmding@wustl.edu Department of Mathematics Washington University in St. Louis Class materials are available on course website (www.math.wustl.edu/

More information

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions. Large Sample Theory Study approximate behaviour of ˆθ by studying the function U. Notice U is sum of independent random variables. Theorem: If Y 1, Y 2,... are iid with mean µ then Yi n µ Called law of

More information

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1 Chapter 4 HOMEWORK ASSIGNMENTS These homeworks may be modified as the semester progresses. It is your responsibility to keep up to date with the correctly assigned homeworks. There may be some errors in

More information

Classical Estimation Topics

Classical Estimation Topics Classical Estimation Topics Namrata Vaswani, Iowa State University February 25, 2014 This note fills in the gaps in the notes already provided (l0.pdf, l1.pdf, l2.pdf, l3.pdf, LeastSquares.pdf). 1 Min

More information

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3 Hypothesis Testing CB: chapter 8; section 0.3 Hypothesis: statement about an unknown population parameter Examples: The average age of males in Sweden is 7. (statement about population mean) The lowest

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

10-704: Information Processing and Learning Fall Lecture 24: Dec 7 0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 24: Dec 7 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy of

More information

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed 18.466 Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed 1. MLEs in exponential families Let f(x,θ) for x X and θ Θ be a likelihood function, that is, for present purposes,

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators Estimation theory Parametric estimation Properties of estimators Minimum variance estimator Cramer-Rao bound Maximum likelihood estimators Confidence intervals Bayesian estimation 1 Random Variables Let

More information

ST5215: Advanced Statistical Theory

ST5215: Advanced Statistical Theory Department of Statistics & Applied Probability Wednesday, October 19, 2011 Lecture 17: UMVUE and the first method of derivation Estimable parameters Let ϑ be a parameter in the family P. If there exists

More information

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics DS-GA 100 Lecture notes 11 Fall 016 Bayesian statistics In the frequentist paradigm we model the data as realizations from a distribution that depends on deterministic parameters. In contrast, in Bayesian

More information

Review. December 4 th, Review

Review. December 4 th, Review December 4 th, 2017 Att. Final exam: Course evaluation Friday, 12/14/2018, 10:30am 12:30pm Gore Hall 115 Overview Week 2 Week 4 Week 7 Week 10 Week 12 Chapter 6: Statistics and Sampling Distributions Chapter

More information

STATISTICAL METHODS FOR SIGNAL PROCESSING c Alfred Hero

STATISTICAL METHODS FOR SIGNAL PROCESSING c Alfred Hero STATISTICAL METHODS FOR SIGNAL PROCESSING c Alfred Hero 1999 32 Statistic used Meaning in plain english Reduction ratio T (X) [X 1,..., X n ] T, entire data sample RR 1 T (X) [X (1),..., X (n) ] T, rank

More information

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b) LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered

More information

Conjugate Priors, Uninformative Priors

Conjugate Priors, Uninformative Priors Conjugate Priors, Uninformative Priors Nasim Zolaktaf UBC Machine Learning Reading Group January 2016 Outline Exponential Families Conjugacy Conjugate priors Mixture of conjugate prior Uninformative priors

More information

Statistical Theory MT 2006 Problems 4: Solution sketches

Statistical Theory MT 2006 Problems 4: Solution sketches Statistical Theory MT 006 Problems 4: Solution sketches 1. Suppose that X has a Poisson distribution with unknown mean θ. Determine the conjugate prior, and associate posterior distribution, for θ. Determine

More information

Part IB Statistics. Theorems with proof. Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua. Lent 2015

Part IB Statistics. Theorems with proof. Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua. Lent 2015 Part IB Statistics Theorems with proof Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua Lent 2015 These notes are not endorsed by the lecturers, and I have modified them (often significantly)

More information

Lecture 2: Repetition of probability theory and statistics

Lecture 2: Repetition of probability theory and statistics Algorithms for Uncertainty Quantification SS8, IN2345 Tobias Neckel Scientific Computing in Computer Science TUM Lecture 2: Repetition of probability theory and statistics Concept of Building Block: Prerequisites:

More information

40.530: Statistics. Professor Chen Zehua. Singapore University of Design and Technology

40.530: Statistics. Professor Chen Zehua. Singapore University of Design and Technology Singapore University of Design and Technology Lecture 9: Hypothesis testing, uniformly most powerful tests. The Neyman-Pearson framework Let P be the family of distributions of concern. The Neyman-Pearson

More information

Lecture 1: Introduction

Lecture 1: Introduction Principles of Statistics Part II - Michaelmas 208 Lecturer: Quentin Berthet Lecture : Introduction This course is concerned with presenting some of the mathematical principles of statistical theory. One

More information

ECE 275A Homework 6 Solutions

ECE 275A Homework 6 Solutions ECE 275A Homework 6 Solutions. The notation used in the solutions for the concentration (hyper) ellipsoid problems is defined in the lecture supplement on concentration ellipsoids. Note that θ T Σ θ =

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

INTRODUCTION TO BAYESIAN METHODS II

INTRODUCTION TO BAYESIAN METHODS II INTRODUCTION TO BAYESIAN METHODS II Abstract. We will revisit point estimation and hypothesis testing from the Bayesian perspective.. Bayes estimators Let X = (X,..., X n ) be a random sample from the

More information

General Bayesian Inference I

General Bayesian Inference I General Bayesian Inference I Outline: Basic concepts, One-parameter models, Noninformative priors. Reading: Chapters 10 and 11 in Kay-I. (Occasional) Simplified Notation. When there is no potential for

More information

Foundations of Statistical Inference

Foundations of Statistical Inference Foundations of Statistical Inference Julien Berestycki Department of Statistics University of Oxford MT 2016 Julien Berestycki (University of Oxford) SB2a MT 2016 1 / 32 Lecture 14 : Variational Bayes

More information

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That Statistics Lecture 2 August 7, 2000 Frank Porter Caltech The plan for these lectures: The Fundamentals; Point Estimation Maximum Likelihood, Least Squares and All That What is a Confidence Interval? Interval

More information

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics Data from one or a series of random experiments are collected. Planning experiments and collecting data (not discussed here). Analysis:

More information

Chapters 9. Properties of Point Estimators

Chapters 9. Properties of Point Estimators Chapters 9. Properties of Point Estimators Recap Target parameter, or population parameter θ. Population distribution f(x; θ). { probability function, discrete case f(x; θ) = density, continuous case The

More information

Lecture 1: August 28

Lecture 1: August 28 36-705: Intermediate Statistics Fall 2017 Lecturer: Siva Balakrishnan Lecture 1: August 28 Our broad goal for the first few lectures is to try to understand the behaviour of sums of independent random

More information

Exercises and Answers to Chapter 1

Exercises and Answers to Chapter 1 Exercises and Answers to Chapter The continuous type of random variable X has the following density function: a x, if < x < a, f (x), otherwise. Answer the following questions. () Find a. () Obtain mean

More information

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Statistics - Lecture One. Outline. Charlotte Wickham  1. Basic ideas about estimation Statistics - Lecture One Charlotte Wickham wickham@stat.berkeley.edu http://www.stat.berkeley.edu/~wickham/ Outline 1. Basic ideas about estimation 2. Method of Moments 3. Maximum Likelihood 4. Confidence

More information

1 Introduction. P (n = 1 red ball drawn) =

1 Introduction. P (n = 1 red ball drawn) = Introduction Exercises and outline solutions. Y has a pack of 4 cards (Ace and Queen of clubs, Ace and Queen of Hearts) from which he deals a random of selection 2 to player X. What is the probability

More information

10. Linear Models and Maximum Likelihood Estimation

10. Linear Models and Maximum Likelihood Estimation 10. Linear Models and Maximum Likelihood Estimation ECE 830, Spring 2017 Rebecca Willett 1 / 34 Primary Goal General problem statement: We observe y i iid pθ, θ Θ and the goal is to determine the θ that

More information

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix) 1 EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix) Taisuke Otsu London School of Economics Summer 2018 A.1. Summation operator (Wooldridge, App. A.1) 2 3 Summation operator For

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Introduction to Probabilistic Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB

More information

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory Statistical Inference Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory IP, José Bioucas Dias, IST, 2007

More information

Spring 2012 Math 541B Exam 1

Spring 2012 Math 541B Exam 1 Spring 2012 Math 541B Exam 1 1. A sample of size n is drawn without replacement from an urn containing N balls, m of which are red and N m are black; the balls are otherwise indistinguishable. Let X denote

More information

Statistics & Data Sciences: First Year Prelim Exam May 2018

Statistics & Data Sciences: First Year Prelim Exam May 2018 Statistics & Data Sciences: First Year Prelim Exam May 2018 Instructions: 1. Do not turn this page until instructed to do so. 2. Start each new question on a new sheet of paper. 3. This is a closed book

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Principles of Statistics

Principles of Statistics Part II Year 2018 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2018 81 Paper 4, Section II 28K Let g : R R be an unknown function, twice continuously differentiable with g (x) M for

More information

STA 260: Statistics and Probability II

STA 260: Statistics and Probability II Al Nosedal. University of Toronto. Winter 2017 1 Properties of Point Estimators and Methods of Estimation 2 3 If you can t explain it simply, you don t understand it well enough Albert Einstein. Definition

More information

Statistics for scientists and engineers

Statistics for scientists and engineers Statistics for scientists and engineers February 0, 006 Contents Introduction. Motivation - why study statistics?................................... Examples..................................................3

More information

Chapter 3 : Likelihood function and inference

Chapter 3 : Likelihood function and inference Chapter 3 : Likelihood function and inference 4 Likelihood function and inference The likelihood Information and curvature Sufficiency and ancilarity Maximum likelihood estimation Non-regular models EM

More information

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio Estimation of reliability parameters from Experimental data (Parte 2) This lecture Life test (t 1,t 2,...,t n ) Estimate θ of f T t θ For example: λ of f T (t)= λe - λt Classical approach (frequentist

More information

Central Limit Theorem ( 5.3)

Central Limit Theorem ( 5.3) Central Limit Theorem ( 5.3) Let X 1, X 2,... be a sequence of independent random variables, each having n mean µ and variance σ 2. Then the distribution of the partial sum S n = X i i=1 becomes approximately

More information

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Theory of Maximum Likelihood Estimation. Konstantin Kashin Gov 2001 Section 5: Theory of Maximum Likelihood Estimation Konstantin Kashin February 28, 2013 Outline Introduction Likelihood Examples of MLE Variance of MLE Asymptotic Properties What is Statistical

More information

CSC 2541: Bayesian Methods for Machine Learning

CSC 2541: Bayesian Methods for Machine Learning CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 10 Alternatives to Monte Carlo Computation Since about 1990, Markov chain Monte Carlo has been the dominant

More information

Interval Estimation. Chapter 9

Interval Estimation. Chapter 9 Chapter 9 Interval Estimation 9.1 Introduction Definition 9.1.1 An interval estimate of a real-values parameter θ is any pair of functions, L(x 1,..., x n ) and U(x 1,..., x n ), of a sample that satisfy

More information

Final Examination a. STA 532: Statistical Inference. Wednesday, 2015 Apr 29, 7:00 10:00pm. Thisisaclosed bookexam books&phonesonthefloor.

Final Examination a. STA 532: Statistical Inference. Wednesday, 2015 Apr 29, 7:00 10:00pm. Thisisaclosed bookexam books&phonesonthefloor. Final Examination a STA 532: Statistical Inference Wednesday, 2015 Apr 29, 7:00 10:00pm Thisisaclosed bookexam books&phonesonthefloor Youmayuseacalculatorandtwo pagesofyourownnotes Do not share calculators

More information

Lecture 2: Statistical Decision Theory (Part I)

Lecture 2: Statistical Decision Theory (Part I) Lecture 2: Statistical Decision Theory (Part I) Hao Helen Zhang Hao Helen Zhang Lecture 2: Statistical Decision Theory (Part I) 1 / 35 Outline of This Note Part I: Statistics Decision Theory (from Statistical

More information

Mathematical statistics

Mathematical statistics October 1 st, 2018 Lecture 11: Sufficient statistic Where are we? Week 1 Week 2 Week 4 Week 7 Week 10 Week 14 Probability reviews Chapter 6: Statistics and Sampling Distributions Chapter 7: Point Estimation

More information

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization

More information

ECE531 Lecture 10b: Maximum Likelihood Estimation

ECE531 Lecture 10b: Maximum Likelihood Estimation ECE531 Lecture 10b: Maximum Likelihood Estimation D. Richard Brown III Worcester Polytechnic Institute 05-Apr-2011 Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 1 / 23 Introduction So

More information

Bayesian Inference: Posterior Intervals

Bayesian Inference: Posterior Intervals Bayesian Inference: Posterior Intervals Simple values like the posterior mean E[θ X] and posterior variance var[θ X] can be useful in learning about θ. Quantiles of π(θ X) (especially the posterior median)

More information

Chapter 3: Unbiased Estimation Lecture 22: UMVUE and the method of using a sufficient and complete statistic

Chapter 3: Unbiased Estimation Lecture 22: UMVUE and the method of using a sufficient and complete statistic Chapter 3: Unbiased Estimation Lecture 22: UMVUE and the method of using a sufficient and complete statistic Unbiased estimation Unbiased or asymptotically unbiased estimation plays an important role in

More information

Probability and Distributions

Probability and Distributions Probability and Distributions What is a statistical model? A statistical model is a set of assumptions by which the hypothetical population distribution of data is inferred. It is typically postulated

More information

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2 Logistics CSE 446: Point Estimation Winter 2012 PS2 out shortly Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2 Last Time Random variables, distributions Marginal, joint & conditional

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

Elements of statistics (MATH0487-1)

Elements of statistics (MATH0487-1) Elements of statistics (MATH0487-1) Prof. Dr. Dr. K. Van Steen University of Liège, Belgium November 12, 2012 Introduction to Statistics Basic Probability Revisited Sampling Exploratory Data Analysis -

More information

Algorithms for Uncertainty Quantification

Algorithms for Uncertainty Quantification Algorithms for Uncertainty Quantification Tobias Neckel, Ionuț-Gabriel Farcaș Lehrstuhl Informatik V Summer Semester 2017 Lecture 2: Repetition of probability theory and statistics Example: coin flip Example

More information

Modern Methods of Statistical Learning sf2935 Auxiliary material: Exponential Family of Distributions Timo Koski. Second Quarter 2016

Modern Methods of Statistical Learning sf2935 Auxiliary material: Exponential Family of Distributions Timo Koski. Second Quarter 2016 Auxiliary material: Exponential Family of Distributions Timo Koski Second Quarter 2016 Exponential Families The family of distributions with densities (w.r.t. to a σ-finite measure µ) on X defined by R(θ)

More information

MATH4427 Notebook 2 Fall Semester 2017/2018

MATH4427 Notebook 2 Fall Semester 2017/2018 MATH4427 Notebook 2 Fall Semester 2017/2018 prepared by Professor Jenny Baglivo c Copyright 2009-2018 by Jenny A. Baglivo. All Rights Reserved. 2 MATH4427 Notebook 2 3 2.1 Definitions and Examples...................................

More information

Chapter 3. Point Estimation. 3.1 Introduction

Chapter 3. Point Estimation. 3.1 Introduction Chapter 3 Point Estimation Let (Ω, A, P θ ), P θ P = {P θ θ Θ}be probability space, X 1, X 2,..., X n : (Ω, A) (IR k, B k ) random variables (X, B X ) sample space γ : Θ IR k measurable function, i.e.

More information

1 Probability Model. 1.1 Types of models to be discussed in the course

1 Probability Model. 1.1 Types of models to be discussed in the course Sufficiency January 18, 016 Debdeep Pati 1 Probability Model Model: A family of distributions P θ : θ Θ}. P θ (B) is the probability of the event B when the parameter takes the value θ. P θ is described

More information

MAS223 Statistical Inference and Modelling Exercises

MAS223 Statistical Inference and Modelling Exercises MAS223 Statistical Inference and Modelling Exercises The exercises are grouped into sections, corresponding to chapters of the lecture notes Within each section exercises are divided into warm-up questions,

More information

Outline. 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks

Outline. 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks Outline 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks Likelihood A common and fruitful approach to statistics is to assume

More information