A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2008 Prof. Gesine Reinert 1

Data x = x 1, x 2,..., x n, realisations of random variables X 1, X 2,..., X n with distribution (model) f(x 1, x 2,..., x n ; θ) Frequentist: θ Θ unknown constant 2

1. Likelihood and Sufficiency (Fisherian) Likelihood approach: L(θ) = L(θ, x) = f(x 1, x 2,..., x n ; θ) Often: X 1, X 2,..., X n independent, identically distributed (i.i.d.); then L(θ, x) = n i=1 f(x i, θ) Summarize information about θ: find a minimal sufficient statistic t(x); from the Factorization Theorem: T = t(x) is sufficient for θ if and only if there exists functions g(t, θ) and h(x) such that for all x and θ f(x, θ) = g(t(x), θ)h(x). Moveover T = t(x) is minimal sufficient when it holds that f(x, θ) f(y, θ) is constant in θ t(x) = t(y) 3

Example Let X 1,..., X n be a random sample from the truncated exponential distribution, where f Xi (x i ) = e θ x i, x i > θ or, using the indicator function notation, f Xi (x i ) = e θ x i I (θ, ) (x i ). Show that Y 1 = min(x i ) is sufficient for θ. Let T = T (X 1,..., X n ) = Y 1. We need the pdf f T (t) of the smallest order statistic. Calculate the cdf for X i, F (x) = x θ e θ z dz = e θ [e θ e x ] = 1 e θ x. 4

Now P (T > t) = n (1 F (t)) = (1 F (t)) n. i=1 Differentiating gives that f T (t) equals n[1 F (t)] n 1 f(t) = ne (θ t)(n 1) e θ t = ne n(θ t), t > θ. So the conditional density of X 1,... X n given T = t is e θ x 1e θ x 2... e θ x n ne n(θ t) = e xi ne, x nt i t, i = 1,... n, which does not depend on θ for each fixed t = min(x i ). Note that since x i t, i = 1,..., n, neither the expression nor the range space depends on θ, so the first order statistic, X (1), is a sufficient statistic for θ. 5

Alternatively, use Factorization Theorem: n f X (x) = e θ x i I (θ, ) (x i ) i=1 = I (θ, ) (x (1) ) n i=1 e θ x i = e nθ I (θ, ) (x (1) )e n i=1 x i with g(t, θ) = e nθ I (θ, ) (t) and h(x) = e n i=1 x i. 6

2. Point Estimation Estimate θ by a function t(x 1,..., x n ) of the data; often by maximumlikelihood: ˆθ = arg max θ L(θ), or by method of moments Neither are unbiased in general, but the m.l.e. is asymptotically unbiased and asymptotically efficient; under some regularity assumptions, ˆθ N (θ, In 1 (θ)), where [ ( l(θ, ) ] 2 X) I n (θ) = E θ is the Fisher information (matrix) 7

Under more regularity, ( 2 ) l(θ, X) I n (θ) = E θ 2 If x is a random sample, then I n (θ) = ni 1 (θ) ; often abbreviate I 1 (θ) = I(θ) The m.l.e. is a function of a sufficient statistic; recall that, for scalar θ, there is a nice theory using the Cramer-Rao lower bound and the Rao-Blackwell theorem on how to obtain minimum variance unbiased estimators based on a sufficient statistic and an unbiased estimator Nice invariance property: The m.l.e. of a function φ(θ) is φ(ˆθ) 8

Example Suppose X 1, X 2,..., X n random sample from Gamma distribution with density f(x; c, β) = xc 1 Γ(c)β ce x β, x > 0, where c > 0, β > 0; θ = (c, β); l(θ) = n log Γ(c) nc log β +(c 1) log x i 1 β xi Put D 1 (c) = c log Γ(c), then l β = nc β + 1 β 2 xi l c = nd 1(c) n log β + log( x i ) 9

Setting equal to zero yields ˆβ = x ĉ, where ĉ solves D 1 (ĉ) log(ĉ) = log([ x i ] 1/n /x) (Could calculate that sufficient statistics is indeed ( x i, [ x i ), and is minimal sufficient) Check the second derivatives for maximum Calculate Fisher information: Put D 2 (c) = 2 c 2 log Γ(c), 2 l β 2 = nc β 2 2 β 3 xi 2 l β c = n β 2 l c 2 = nd 2(c) 10

Use that EX i = cβ to obtain I n (θ) = n c 1 β 2 β 1 β D 2 (c) Recall also that there is an iterative method to compute m.l.e.s, related to the Newton-Raphson method. 11

3. Hypothesis Testing For simple null hypothesis and simple alternative, the Neyman- Pearson Lemma says that the most powerful tests are likelihood-ratio tests. These can sometimes be generalized to one-sided alternatives in such a way as to yield uniformly most powerful tests. 12

Example Suppose as above that X 1,..., X n is a random sample from the truncated exponential distribution, where f Xi (x i ; θ) = e θ x i, x i > θ. Find a UMP test of size α for testing H 0 : θ = θ 0 against H 1 : θ > θ 0 : Let θ 1 > θ 0, then the LR is f(x, θ 1 ) f(x, θ 0 ) = en(θ 1 θ 0 ) I (θ1, )(x (1) ), which increases with x (1), so we reject H 0 if the smallest observation, T = X (1), is large. Under H 0, we have calculated that the pdf of f T is ne n(θ 0 y 1 ), y 1 > θ 0. 13

So for t > θ, under H 0, P (T > t) = t ne n(θ 0 s) ds = e n(θ 0 t). For a test at level α, choose t such that t = θ 0 ln α n. The LR test rejects H 0 if X (1) > θ 0 ln α n. The test is the same for all θ > θ 0, so is UMP. 14

For general null hypothesis and general alternative: H 0 : θ Θ 0 H 1 : θ Θ 1 = Θ \ Θ 0 The (generalized) LR test uses the likelihood ratio statistic T = max θ Θ L(θ; X) max L(θ; X) θ Θ 0 rejects H 0 for large values of T ; use chisquare asymptotics 2 log T χ 2 p, where p = dimθ dimθ 0 (nested models only); or use score tests, which are based on the score function l/ θ, with asymptotics in terms of normal distribution N (0, I n (θ)) An important example is Pearson s Chisquare test, which we derived as a score test, and we saw that it is asymptotically equivalent to the generalized likelihood ratio test 15

Example Let X 1,..., X n be a random sample from a geometric distribution with parameter p; P (X i = k) = p(1 p) k, k = 0, 1, 2,.... Then L(p) = p n (1 p) (x i ) and l(p) = n ln p + nx ln(1 p), so that and l/ p(p) = n 2 l/ p 2 (p) = n ( 1 p ( 1 p 2 x ) 1 p ) x. (1 p) 2 16

We know that EX 1 = (1 p)/p. Calculate the information I(p) = n p 2 (1 p). Suppose n = 20, x = 3, H 0 : p 0 = 0.15 and H 1 : p 0 0.15. Then l/ p(0.15) = 62.7 and I(0.15) = 1045.8. Test statistic Z = 62.7/ 1045.8 = 1.9388. Compare to 1.96 for a test at level α = 0.05: do not reject H 0. 17

Example continued. For a generalized LRT the test statistic is based on 2 log[l(ˆθ)/l(θ 0 )] χ 2 1, and the test statistic is thus 2(l(ˆθ) l(θ 0 ) = 2n(ln(ˆθ) ln(θ 0 ) + x(ln(1 ˆθ) ln(1 θ 0 )). We calculate that the m.l.e. is ˆθ = 1 1 + x. Suppose again n = 20, x = 3, H 0 : θ 0 = 0.15 and H 1 : θ 0 0.15. The m.l.e. is ˆθ = 0.25, and l(0.25) = 44.98; l(0.15) = 47.69. Calculate χ 2 = 2(47.69 44.98) = 5.4 and compare to chisquare distribution with 1 degree of freedom: 3.84 at 5 percent level, so reject H 0. 18

4. Confidence Regions If we can find a pivot, a function t(x, θ) of a sufficient statistics whose distribution does not depend on θ, then we can find confidence regions in a straightforward manner. Otherwise we may have to resort to approximate confidence regions, for example using the approximate normality of the m.l.e. Recall that confidence intervals are equivalent to hypothesis tests with simple null hypothesis and one- or two-sided alternatives 19

Not to forget about: Profile likelihood Often θ = (ψ, λ) where ψ contains the parameters of interest; then base inference on profile likelihood for ψ, L P (ψ) = L(ψ, ˆλ ψ ). Again we can use (generalized) LRT or score test; if ψ scalar, H 0 : ψ = ψ 0, H + 1 : ψ > ψ 0, H 1 : ψ < ψ 0, use test statistic T = l(ψ 0, ˆλ 0 ; X), ψ where ˆλ 0 is the MLE for λ when H 0 true 20

Large positive values of T indicate H + 1 Large negative values indicate H 1 T l ψ I 1 λ,λ I ψ,λl λ N(0, 1/I ψ,ψ ), where I ψ,ψ = (I ψ,ψ I 2 ψ,λ I 1 λ,λ ) 1 is the top left element of I 1 Estimate by substituting the null hypothesis values; calculate the practical standardized form of T as Z = T Var(T ) l ψ(ψ, ˆλ ψ )[I ψ,ψ (ψ, ˆλ ψ )] 1/2 is approximately standard normal 21

Bias and variance approximations: the delta method If we cannot calculate mean and variance directly: Suppose T = g(s) where ES = β and Var S = V. Taylor T = g(s) g(β) + (S β)g (β). Taking the mean and variance of the r.h.s.: ET g(β), Var T [g (β)] 2 V. Also works for vectors S, β, with T still a scalar ( g (β) ) i = g/ β i and g (β) matrix of second derivatives, Var T [g (β)] T V g (β) and ET g(β) + 1 2 trace[g (β)v ]. 22

Exponential family For distributions in the exponential family, many calculations have been standardized, see Notes. 23

A Very Brief Summary of Bayesian Inference, and Examples Data x = x 1, x 2,..., x n, realisations of random variables X 1, X 2,..., X n with distribution (model) f(x 1, x 2,..., x n θ) Bayesian: θ Θ random; prior π(θ) 24

1. Priors, Posteriors, Likelihood, and Sufficiency Posterior distribution of θ given x is π(θ x) = f(x θ)π(θ) f(x θ)π(θ)dθ Shorter: posterior prior likelihood The (prior) predictive distribution of the data x on the basis π is p(x) = f(x θ)π(θ)dθ Suppose data x 1 is available, want to predict additional data: p(x 2 x 1 ) = f(x 2 θ)π(θ x 1 )dθ Note that x 2 and x 1 are assumed conditionally independent given θ They are not, in general, unconditionally independent. 25

Example Suppose y 1, y 2,..., y n are independent normally distributed random variables, each with variance 1 and with means βx 1,..., βx n, where β is an unknown real-valued parameter and x 1, x 2,..., x n are known constants. Suppose prior π(β) = N (µ, α 2 ). Then the likelihood is L(β) = (2π) n 2 n i=1 e (y i βx i )2 2 and π(β) = 1 2πα 2 e (β µ)2 2α 2 exp { β2 2α 2 + β µ α 2 } 26

and the posterior of β given y 1, y 2,..., y n can be calculated as π(β y) exp exp { 1 (yi βx i ) 2 1 } 2 2α2(β µ)2 {β x i y i 12 β2 x 2i β2 2α + βµ } 2 α 2. Abbreviate s xx = x 2 i, s xy = x i y i, then π(β y) exp {βs xy 12 β2 s xx β2 2α + βµ } 2 α 2 = exp { (s β2 xx + 1α ) + β (s 2 2 xy + µ ) } α 2 which we recognize as N ( sxy + µ α 2 s xx + 1, (s xx + 1α ) ) 1. 2 α 2 27

Summarize information about θ: find a minimal sufficient statistic t(x); T = t(x) is sufficient for θ if and only if for all x and θ π(θ x) = π(θ t(x)) As in frequentist approach; posterior (inference) based on sufficient statistic 28

Choice of prior There are many ways of choosing a prior distribution; using a coherent belief system, e.g.. Often it is convenient to restrict the class of priors to a particular family of distributions. When the posterior is in the same family of models as the prior, i.e. when one has a conjugate prior, then updating the distribution under new data is particularly convenient. For the regular k-parameter exponential family, k f(x θ) = f(x)g(θ)exp{ c i φ i (θ)h i (x)}, x X, i=1 conjugate priors can be derived in a straightforward manner, using the sufficient statistics. 29

Non-informative priors favour no particular values of the parameter over others. If Θ is finite, choose uniform prior; if Θ is infinite, there are several ways (some may be improper) For a location density f(x θ) = f(x θ), then the non-informative location-invariant prior is π(θ) = π(0) constant; usually choose π(θ) = 1 for all θ (improper prior) For a scale density f(x σ) = 1 σ f ( x σ) for σ > 0, the scale-invariant non-informative prior is π(σ) 1 σ ; usually choose π(σ) = 1 σ (improper) Jeffreys prior π(θ) I(θ) 1 2 if the information I(θ) exists; they is invariant under reparametrization; may or may not be improper 30

Example continued. Suppose y 1, y 2,..., y n are independent, y i N (βx i, 1), where β is an unknown parameter and x 1, x 2,..., x n are known. Then ( ) 2 I(β) = E log L(β, y) β2 ( ) = E (yi βx i )x i β = s xx and so the Jeffreys prior is s xx ; constant and improper. With this Jeffreys prior, the calculation of the posterior distribution is equivalent to putting α 2 = in the previous calculation, yielding π(β y) is N ( ) sxy, (s xx ) 1. s xx 31

If Θ is discrete, constraints E π g k (θ) = µ k, k = 1,..., m choose the distribution with the maximum entropy under these constraints: π(θ i ) = exp ( m k=1 λ kg k (θ i )) i exp ( m k=1 λ kg k (θ i )) where the λ i are determined by the constraints. There is a similar formula for the continuous case, maximizing the entropy relative to a particular reference distribution π 0 under constraints. For π 0 one would choose the natural invariant noninformative prior. For inference, check influence of the choice of prior, for example by ɛ contamination class of priors. 32

2. Point Estimation Under suitable regularity conditions, random sampling, when n is large, then the posterior is approximately N (ˆθ, (ni 1 (ˆθ)) 1 ), where ˆθ is the m.l.e.; provided that the prior is non-zero in a region surrounding ˆθ. In particular, if θ 0 is the true parameter, then the posterior will become more and more concentrated around θ 0. Often reporting the posterior distribution is preferred to point estimation. 33

Bayes estimators are constructed to minimize the posterior expected loss. Recall from Decision Theory: Θ is the set of all possible states of nature (values of parameter), D is the set of all possible decisions (actions); a loss function is any function L : Θ D [0, ) For point estimation: D = Θ, L(θ, d) loss in reporting d when θ is true. 34

Bayesian: For a prior π and data x X, the posterior expected loss of a decision is ρ(π, d x) = Θ L(θ, d)π(θ x)dθ For a prior π the integrated risk of a decision rule δ is r(π, δ) = Θ X L(θ, δ(x))f(x θ)dxπ(θ)dθ An estimator minimizing r(π, δ) can be obtained by selecting, for every x X, the value δ(x) that minimizes ρ(π, δ x). A Bayes estimator associated with prior π, loss L, is any estimator δ π which minimizes r(π, δ). Then r(π) = r(π, δ π ) is the Bayes risk. 35

Valid for proper priors, and for improper priors if r(π) <. If r(π) =, define a generalized Bayes estimator as the minimizer, for every x, of ρ(π, d x) For strictly convex loss functions, Bayes estimators are unique. For squared error loss L(θ, d) = (θ d) 2, the Bayes estimator δ π associated with prior π is the posterior mean. For absolute error loss L(θ, d) = θ d, the posterior median is a Bayes estimator. 36

Example. Let x be a single observation from N (µ, 1), and let µ have prior distribution π(µ) e µ, on the whole real line. What is the generalized Bayes estimator for µ under square error loss? First we find the posterior distribution of µ given x. π(µ x) exp (µ 12 ) (µ x))2 exp ( 12 ) (µ (x + 1))2 and so π(µ x) = N (x + 1, 1). The generalized Bayes estimator under square error loss is the posterior mean, and so we have µ B = 1 + x. 37

Recall that we have seen more in Decision Theory: least favourable distributions. 38

3. Hypothesis Testing H 0 : θ Θ 0, D = {accept H 0, reject H 0 } = {1, 0}, where 1 stands for acceptance; loss function 0 if θ Θ 0, φ = 1 L(θ, φ) = a 0 if θ Θ 0, φ = 0 0 if θ Θ 0, φ = 0 a 1 if θ Θ 0, φ = 1 Under this loss function, the Bayes decision rule associated with a prior distribution π is φ π (x) = 1 if P π (θ Θ 0 x) > a 1 a 0 +a 1 0 otherwise 39

The Bayes factor for testing H 0 : θ Θ 0 against H 1 : θ Θ 1 is B π (x) = P π (θ Θ 0 x)/p π (θ Θ 1 x) P π (θ Θ 0 )/P π (θ Θ 1 ) = p(x θ Θ 0) p(x θ Θ 1 ) is the ratio of how likely the data is under H 0 and how likely the data is under H 1 ; measures the extent to which the data x will change the odds of Θ 0 relative to Θ 1 Note that we have to make sure that our prior distribution puts mass on H 0 (and on H 1 ). If H 0 is simple, this is usually achieved by choosing a prior that has some point mass on H 0 and otherwise lives on H 1. 40

Example. Suppose that x 1,..., x n is a random sample from a Poisson distribution with unknown mean θ; possible models for the prior distribution are π 1 (θ) = e θ, θ > 0, and π 2 (θ) = e θ θ, θ > 0. The prior probabilities of model 1 and model 2 are assessed at probability 1/2 each. Then the Bayes factor is B(x) = 0 e nθ θ x i e θ dθ 0 e nθ θ x ie θ θdθ = Γ( x i + 1)/((n + 1) x i +1 ) Γ( x i + 2)/((n + 1) x i +2 ) = n + 1 xi + 1. 41

Note that in this setting B(x) = = P (model 1 x) P (model 2 x) P (model 1 x) 1 P (model 1 x), so that P (model 1 x) = (1 + B(x)) 1. Hence P (model 1 x) = = ( 1 + xi + 1 n + 1 ) 1, ( 1 + x + 1 n 1 + 1 n ) 1 which is decreasing for x increasing. 42

For robustness: Least favourable Bayesian answers; suppose H 0 : θ = θ 0, H 1 : θ θ 0, prior probability on H 0 is ρ 0 = 1/2. For G family of priors on H 1 ; put B(x, G) = inf g G f(x θ 0 ) Θ f(x θ)g(θ)dθ. A Bayesian prior g G on H 0 will then have posterior probability at least P (x, G) = ( 1 + 1 B(x,G)) 1 on H0 (for ρ 0 = 1/2). If ˆθ is the m.l.e. of θ, and G A the set of all prior distributions, then B(x, G A ) = f(x θ 0) f(x ˆθ(x)) and P (x, G A ) = ( 1 + f(x ˆθ(x)) f(x θ 0 ) ) 1. 43

4. Credible intervals A (1 α) (posterior) credible interval (region) is an interval (region) of θ values within which 1 α of the posterior probability lies Often would like to find HPD (highest posterior density) region: a (1 α) credible region that has minimal volume when the posterior density is unimodal, this is often straightforward 44

Example continued. Suppose y 1, y 2,..., y n are independent normally distributed random variables, each with variance 1 and with means βx 1,..., βx n, where β is an unknown real-valued parameter and x 1, x 2,..., x n are known constants; Jeffreys prior, posterior N ( sxy s xx, (s xx ) 1), then a 95%-credible interval for β is s xy 1 ± 1.96. s xx s xx Interpretation in Bayesian: conditional on the observed x; randomness relates to the distribution of θ In contrast, frequentist confidence interval applies before x is observed; randomness relates to the distribution of x 45

5. Not to forget about: Nuisance parameters θ = (ψ, λ), where λ nuisance parameter π(θ x) = π((ψ, λ) x) marginal posterior of ψ: π(ψ x) = π(ψ, λ x)dλ Just integrate out the nuisance parameter 46