A Very Brief Summary of Statistical Inference, and Examples

Similar documents
A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Bayesian Inference, and Examples

Part III. A Decision-Theoretic Approach and Bayesian testing

Statistical Theory MT 2007 Problems 4: Solution sketches

Statistical Theory MT 2006 Problems 4: Solution sketches

simple if it completely specifies the density of x

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

STAT215: Solutions for Homework 2

Direction: This test is worth 250 points and each problem worth points. DO ANY SIX

Mathematical statistics

Interval Estimation. Chapter 9

Parameter Estimation

1. Fisher Information

DA Freedman Notes on the MLE Fall 2003

Suggested solutions to written exam Jan 17, 2012

Lecture 1: Introduction

Introduction to Bayesian Methods

General Bayesian Inference I

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

Central Limit Theorem ( 5.3)

Ch. 5 Hypothesis Testing

Bayesian and frequentist inference

Problem Selected Scores

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Statistics GIDP Ph.D. Qualifying Exam Theory Jan 11, 2016, 9:00am-1:00pm

Master s Written Examination

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given.

STAT 730 Chapter 4: Estimation

9 Bayesian inference. 9.1 Subjective probability

Definition 3.1 A statistical hypothesis is a statement about the unknown values of the parameters of the population distribution.

Statistics Ph.D. Qualifying Exam: Part I October 18, 2003

Math 494: Mathematical Statistics

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

Chapter 7. Hypothesis Testing

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Default priors and model parametrization

7. Estimation and hypothesis testing. Objective. Recommended reading

INTRODUCTION TO BAYESIAN METHODS II

Mathematical statistics

Likelihood inference in the presence of nuisance parameters

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Final Examination a. STA 532: Statistical Inference. Wednesday, 2015 Apr 29, 7:00 10:00pm. Thisisaclosed bookexam books&phonesonthefloor.

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Lecture 7 Introduction to Statistical Decision Theory

Principles of Statistics

Approximating models. Nancy Reid, University of Toronto. Oxford, February 6.

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Nancy Reid SS 6002A Office Hours by appointment

Spring 2012 Math 541B Exam 1

Introduction to Estimation Methods for Time Series models Lecture 2

Brief Review on Estimation Theory

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

Mathematics Ph.D. Qualifying Examination Stat Probability, January 2018

Foundations of Statistical Inference

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n

Spring 2012 Math 541A Exam 1. X i, S 2 = 1 n. n 1. X i I(X i < c), T n =

Hypothesis Test. The opposite of the null hypothesis, called an alternative hypothesis, becomes

Exercises Chapter 4 Statistical Hypothesis Testing

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables THE UNIVERSITY OF MANCHESTER. 21 June :45 11:45

STAT 512 sp 2018 Summary Sheet

Hypothesis Testing - Frequentist

1 Hypothesis Testing and Model Selection

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Final Examination. STA 215: Statistical Inference. Saturday, 2001 May 5, 9:00am 12:00 noon

LECTURE NOTES 57. Lecture 9

Exercises and Answers to Chapter 1

Statistics Ph.D. Qualifying Exam: Part II November 3, 2001

Chapter 9: Hypothesis Testing Sections

Foundations of Statistical Inference

Linear Models A linear model is defined by the expression

New Bayesian methods for model comparison

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Chapter 3 : Likelihood function and inference

Statistical Inference

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Likelihood Inference in the Presence of Nuisance Parameters

Stat 5101 Lecture Notes

STATISTICS SYLLABUS UNIT I

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Integrated Objective Bayesian Estimation and Hypothesis Testing

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Bayesian Statistics from Subjective Quantum Probabilities to Objective Data Analysis

Lecture 8: Information Theory and Statistics

Loglikelihood and Confidence Intervals

(θ θ ), θ θ = 2 L(θ ) θ θ θ θ θ (θ )= H θθ (θ ) 1 d θ (θ )

Part IB Statistics. Theorems with proof. Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua. Lent 2015

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

Testing Statistical Hypotheses

Statistics: Learning models from data

LECTURE NOTES FYS 4550/FYS EXPERIMENTAL HIGH ENERGY PHYSICS AUTUMN 2013 PART I A. STRANDLIE GJØVIK UNIVERSITY COLLEGE AND UNIVERSITY OF OSLO

Likelihood Inference in the Presence of Nuisance Parameters

First Year Examination Department of Statistics, University of Florida

Stat 5102 Lecture Slides Deck 3. Charles J. Geyer School of Statistics University of Minnesota

Maximum Likelihood Estimation

ST 740: Model Selection

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Transcription:

A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2008 Prof. Gesine Reinert 1

Data x = x 1, x 2,..., x n, realisations of random variables X 1, X 2,..., X n with distribution (model) f(x 1, x 2,..., x n ; θ) Frequentist: θ Θ unknown constant 2

1. Likelihood and Sufficiency (Fisherian) Likelihood approach: L(θ) = L(θ, x) = f(x 1, x 2,..., x n ; θ) Often: X 1, X 2,..., X n independent, identically distributed (i.i.d.); then L(θ, x) = n i=1 f(x i, θ) Summarize information about θ: find a minimal sufficient statistic t(x); from the Factorization Theorem: T = t(x) is sufficient for θ if and only if there exists functions g(t, θ) and h(x) such that for all x and θ f(x, θ) = g(t(x), θ)h(x). Moveover T = t(x) is minimal sufficient when it holds that f(x, θ) f(y, θ) is constant in θ t(x) = t(y) 3

Example Let X 1,..., X n be a random sample from the truncated exponential distribution, where f Xi (x i ) = e θ x i, x i > θ or, using the indicator function notation, f Xi (x i ) = e θ x i I (θ, ) (x i ). Show that Y 1 = min(x i ) is sufficient for θ. Let T = T (X 1,..., X n ) = Y 1. We need the pdf f T (t) of the smallest order statistic. Calculate the cdf for X i, F (x) = x θ e θ z dz = e θ [e θ e x ] = 1 e θ x. 4

Now P (T > t) = n (1 F (t)) = (1 F (t)) n. i=1 Differentiating gives that f T (t) equals n[1 F (t)] n 1 f(t) = ne (θ t)(n 1) e θ t = ne n(θ t), t > θ. So the conditional density of X 1,... X n given T = t is e θ x 1e θ x 2... e θ x n ne n(θ t) = e xi ne, x nt i t, i = 1,... n, which does not depend on θ for each fixed t = min(x i ). Note that since x i t, i = 1,..., n, neither the expression nor the range space depends on θ, so the first order statistic, X (1), is a sufficient statistic for θ. 5

Alternatively, use Factorization Theorem: n f X (x) = e θ x i I (θ, ) (x i ) i=1 = I (θ, ) (x (1) ) n i=1 e θ x i = e nθ I (θ, ) (x (1) )e n i=1 x i with g(t, θ) = e nθ I (θ, ) (t) and h(x) = e n i=1 x i. 6

2. Point Estimation Estimate θ by a function t(x 1,..., x n ) of the data; often by maximumlikelihood: ˆθ = arg max θ L(θ), or by method of moments Neither are unbiased in general, but the m.l.e. is asymptotically unbiased and asymptotically efficient; under some regularity assumptions, ˆθ N (θ, In 1 (θ)), where [ ( l(θ, ) ] 2 X) I n (θ) = E θ is the Fisher information (matrix) 7

Under more regularity, ( 2 ) l(θ, X) I n (θ) = E θ 2 If x is a random sample, then I n (θ) = ni 1 (θ) ; often abbreviate I 1 (θ) = I(θ) The m.l.e. is a function of a sufficient statistic; recall that, for scalar θ, there is a nice theory using the Cramer-Rao lower bound and the Rao-Blackwell theorem on how to obtain minimum variance unbiased estimators based on a sufficient statistic and an unbiased estimator Nice invariance property: The m.l.e. of a function φ(θ) is φ(ˆθ) 8

Example Suppose X 1, X 2,..., X n random sample from Gamma distribution with density f(x; c, β) = xc 1 Γ(c)β ce x β, x > 0, where c > 0, β > 0; θ = (c, β); l(θ) = n log Γ(c) nc log β +(c 1) log x i 1 β xi Put D 1 (c) = c log Γ(c), then l β = nc β + 1 β 2 xi l c = nd 1(c) n log β + log( x i ) 9

Setting equal to zero yields ˆβ = x ĉ, where ĉ solves D 1 (ĉ) log(ĉ) = log([ x i ] 1/n /x) (Could calculate that sufficient statistics is indeed ( x i, [ x i ), and is minimal sufficient) Check the second derivatives for maximum Calculate Fisher information: Put D 2 (c) = 2 c 2 log Γ(c), 2 l β 2 = nc β 2 2 β 3 xi 2 l β c = n β 2 l c 2 = nd 2(c) 10

Use that EX i = cβ to obtain I n (θ) = n c 1 β 2 β 1 β D 2 (c) Recall also that there is an iterative method to compute m.l.e.s, related to the Newton-Raphson method. 11

3. Hypothesis Testing For simple null hypothesis and simple alternative, the Neyman- Pearson Lemma says that the most powerful tests are likelihood-ratio tests. These can sometimes be generalized to one-sided alternatives in such a way as to yield uniformly most powerful tests. 12

Example Suppose as above that X 1,..., X n is a random sample from the truncated exponential distribution, where f Xi (x i ; θ) = e θ x i, x i > θ. Find a UMP test of size α for testing H 0 : θ = θ 0 against H 1 : θ > θ 0 : Let θ 1 > θ 0, then the LR is f(x, θ 1 ) f(x, θ 0 ) = en(θ 1 θ 0 ) I (θ1, )(x (1) ), which increases with x (1), so we reject H 0 if the smallest observation, T = X (1), is large. Under H 0, we have calculated that the pdf of f T is ne n(θ 0 y 1 ), y 1 > θ 0. 13

So for t > θ, under H 0, P (T > t) = t ne n(θ 0 s) ds = e n(θ 0 t). For a test at level α, choose t such that t = θ 0 ln α n. The LR test rejects H 0 if X (1) > θ 0 ln α n. The test is the same for all θ > θ 0, so is UMP. 14

For general null hypothesis and general alternative: H 0 : θ Θ 0 H 1 : θ Θ 1 = Θ \ Θ 0 The (generalized) LR test uses the likelihood ratio statistic T = max θ Θ L(θ; X) max L(θ; X) θ Θ 0 rejects H 0 for large values of T ; use chisquare asymptotics 2 log T χ 2 p, where p = dimθ dimθ 0 (nested models only); or use score tests, which are based on the score function l/ θ, with asymptotics in terms of normal distribution N (0, I n (θ)) An important example is Pearson s Chisquare test, which we derived as a score test, and we saw that it is asymptotically equivalent to the generalized likelihood ratio test 15

Example Let X 1,..., X n be a random sample from a geometric distribution with parameter p; P (X i = k) = p(1 p) k, k = 0, 1, 2,.... Then L(p) = p n (1 p) (x i ) and l(p) = n ln p + nx ln(1 p), so that and l/ p(p) = n 2 l/ p 2 (p) = n ( 1 p ( 1 p 2 x ) 1 p ) x. (1 p) 2 16

We know that EX 1 = (1 p)/p. Calculate the information I(p) = n p 2 (1 p). Suppose n = 20, x = 3, H 0 : p 0 = 0.15 and H 1 : p 0 0.15. Then l/ p(0.15) = 62.7 and I(0.15) = 1045.8. Test statistic Z = 62.7/ 1045.8 = 1.9388. Compare to 1.96 for a test at level α = 0.05: do not reject H 0. 17

Example continued. For a generalized LRT the test statistic is based on 2 log[l(ˆθ)/l(θ 0 )] χ 2 1, and the test statistic is thus 2(l(ˆθ) l(θ 0 ) = 2n(ln(ˆθ) ln(θ 0 ) + x(ln(1 ˆθ) ln(1 θ 0 )). We calculate that the m.l.e. is ˆθ = 1 1 + x. Suppose again n = 20, x = 3, H 0 : θ 0 = 0.15 and H 1 : θ 0 0.15. The m.l.e. is ˆθ = 0.25, and l(0.25) = 44.98; l(0.15) = 47.69. Calculate χ 2 = 2(47.69 44.98) = 5.4 and compare to chisquare distribution with 1 degree of freedom: 3.84 at 5 percent level, so reject H 0. 18

4. Confidence Regions If we can find a pivot, a function t(x, θ) of a sufficient statistics whose distribution does not depend on θ, then we can find confidence regions in a straightforward manner. Otherwise we may have to resort to approximate confidence regions, for example using the approximate normality of the m.l.e. Recall that confidence intervals are equivalent to hypothesis tests with simple null hypothesis and one- or two-sided alternatives 19

Not to forget about: Profile likelihood Often θ = (ψ, λ) where ψ contains the parameters of interest; then base inference on profile likelihood for ψ, L P (ψ) = L(ψ, ˆλ ψ ). Again we can use (generalized) LRT or score test; if ψ scalar, H 0 : ψ = ψ 0, H + 1 : ψ > ψ 0, H 1 : ψ < ψ 0, use test statistic T = l(ψ 0, ˆλ 0 ; X), ψ where ˆλ 0 is the MLE for λ when H 0 true 20

Large positive values of T indicate H + 1 Large negative values indicate H 1 T l ψ I 1 λ,λ I ψ,λl λ N(0, 1/I ψ,ψ ), where I ψ,ψ = (I ψ,ψ I 2 ψ,λ I 1 λ,λ ) 1 is the top left element of I 1 Estimate by substituting the null hypothesis values; calculate the practical standardized form of T as Z = T Var(T ) l ψ(ψ, ˆλ ψ )[I ψ,ψ (ψ, ˆλ ψ )] 1/2 is approximately standard normal 21

Bias and variance approximations: the delta method If we cannot calculate mean and variance directly: Suppose T = g(s) where ES = β and Var S = V. Taylor T = g(s) g(β) + (S β)g (β). Taking the mean and variance of the r.h.s.: ET g(β), Var T [g (β)] 2 V. Also works for vectors S, β, with T still a scalar ( g (β) ) i = g/ β i and g (β) matrix of second derivatives, Var T [g (β)] T V g (β) and ET g(β) + 1 2 trace[g (β)v ]. 22

Exponential family For distributions in the exponential family, many calculations have been standardized, see Notes. 23

A Very Brief Summary of Bayesian Inference, and Examples Data x = x 1, x 2,..., x n, realisations of random variables X 1, X 2,..., X n with distribution (model) f(x 1, x 2,..., x n θ) Bayesian: θ Θ random; prior π(θ) 24

1. Priors, Posteriors, Likelihood, and Sufficiency Posterior distribution of θ given x is π(θ x) = f(x θ)π(θ) f(x θ)π(θ)dθ Shorter: posterior prior likelihood The (prior) predictive distribution of the data x on the basis π is p(x) = f(x θ)π(θ)dθ Suppose data x 1 is available, want to predict additional data: p(x 2 x 1 ) = f(x 2 θ)π(θ x 1 )dθ Note that x 2 and x 1 are assumed conditionally independent given θ They are not, in general, unconditionally independent. 25

Example Suppose y 1, y 2,..., y n are independent normally distributed random variables, each with variance 1 and with means βx 1,..., βx n, where β is an unknown real-valued parameter and x 1, x 2,..., x n are known constants. Suppose prior π(β) = N (µ, α 2 ). Then the likelihood is L(β) = (2π) n 2 n i=1 e (y i βx i )2 2 and π(β) = 1 2πα 2 e (β µ)2 2α 2 exp { β2 2α 2 + β µ α 2 } 26

and the posterior of β given y 1, y 2,..., y n can be calculated as π(β y) exp exp { 1 (yi βx i ) 2 1 } 2 2α2(β µ)2 {β x i y i 12 β2 x 2i β2 2α + βµ } 2 α 2. Abbreviate s xx = x 2 i, s xy = x i y i, then π(β y) exp {βs xy 12 β2 s xx β2 2α + βµ } 2 α 2 = exp { (s β2 xx + 1α ) + β (s 2 2 xy + µ ) } α 2 which we recognize as N ( sxy + µ α 2 s xx + 1, (s xx + 1α ) ) 1. 2 α 2 27

Summarize information about θ: find a minimal sufficient statistic t(x); T = t(x) is sufficient for θ if and only if for all x and θ π(θ x) = π(θ t(x)) As in frequentist approach; posterior (inference) based on sufficient statistic 28

Choice of prior There are many ways of choosing a prior distribution; using a coherent belief system, e.g.. Often it is convenient to restrict the class of priors to a particular family of distributions. When the posterior is in the same family of models as the prior, i.e. when one has a conjugate prior, then updating the distribution under new data is particularly convenient. For the regular k-parameter exponential family, k f(x θ) = f(x)g(θ)exp{ c i φ i (θ)h i (x)}, x X, i=1 conjugate priors can be derived in a straightforward manner, using the sufficient statistics. 29

Non-informative priors favour no particular values of the parameter over others. If Θ is finite, choose uniform prior; if Θ is infinite, there are several ways (some may be improper) For a location density f(x θ) = f(x θ), then the non-informative location-invariant prior is π(θ) = π(0) constant; usually choose π(θ) = 1 for all θ (improper prior) For a scale density f(x σ) = 1 σ f ( x σ) for σ > 0, the scale-invariant non-informative prior is π(σ) 1 σ ; usually choose π(σ) = 1 σ (improper) Jeffreys prior π(θ) I(θ) 1 2 if the information I(θ) exists; they is invariant under reparametrization; may or may not be improper 30

Example continued. Suppose y 1, y 2,..., y n are independent, y i N (βx i, 1), where β is an unknown parameter and x 1, x 2,..., x n are known. Then ( ) 2 I(β) = E log L(β, y) β2 ( ) = E (yi βx i )x i β = s xx and so the Jeffreys prior is s xx ; constant and improper. With this Jeffreys prior, the calculation of the posterior distribution is equivalent to putting α 2 = in the previous calculation, yielding π(β y) is N ( ) sxy, (s xx ) 1. s xx 31

If Θ is discrete, constraints E π g k (θ) = µ k, k = 1,..., m choose the distribution with the maximum entropy under these constraints: π(θ i ) = exp ( m k=1 λ kg k (θ i )) i exp ( m k=1 λ kg k (θ i )) where the λ i are determined by the constraints. There is a similar formula for the continuous case, maximizing the entropy relative to a particular reference distribution π 0 under constraints. For π 0 one would choose the natural invariant noninformative prior. For inference, check influence of the choice of prior, for example by ɛ contamination class of priors. 32

2. Point Estimation Under suitable regularity conditions, random sampling, when n is large, then the posterior is approximately N (ˆθ, (ni 1 (ˆθ)) 1 ), where ˆθ is the m.l.e.; provided that the prior is non-zero in a region surrounding ˆθ. In particular, if θ 0 is the true parameter, then the posterior will become more and more concentrated around θ 0. Often reporting the posterior distribution is preferred to point estimation. 33

Bayes estimators are constructed to minimize the posterior expected loss. Recall from Decision Theory: Θ is the set of all possible states of nature (values of parameter), D is the set of all possible decisions (actions); a loss function is any function L : Θ D [0, ) For point estimation: D = Θ, L(θ, d) loss in reporting d when θ is true. 34

Bayesian: For a prior π and data x X, the posterior expected loss of a decision is ρ(π, d x) = Θ L(θ, d)π(θ x)dθ For a prior π the integrated risk of a decision rule δ is r(π, δ) = Θ X L(θ, δ(x))f(x θ)dxπ(θ)dθ An estimator minimizing r(π, δ) can be obtained by selecting, for every x X, the value δ(x) that minimizes ρ(π, δ x). A Bayes estimator associated with prior π, loss L, is any estimator δ π which minimizes r(π, δ). Then r(π) = r(π, δ π ) is the Bayes risk. 35

Valid for proper priors, and for improper priors if r(π) <. If r(π) =, define a generalized Bayes estimator as the minimizer, for every x, of ρ(π, d x) For strictly convex loss functions, Bayes estimators are unique. For squared error loss L(θ, d) = (θ d) 2, the Bayes estimator δ π associated with prior π is the posterior mean. For absolute error loss L(θ, d) = θ d, the posterior median is a Bayes estimator. 36

Example. Let x be a single observation from N (µ, 1), and let µ have prior distribution π(µ) e µ, on the whole real line. What is the generalized Bayes estimator for µ under square error loss? First we find the posterior distribution of µ given x. π(µ x) exp (µ 12 ) (µ x))2 exp ( 12 ) (µ (x + 1))2 and so π(µ x) = N (x + 1, 1). The generalized Bayes estimator under square error loss is the posterior mean, and so we have µ B = 1 + x. 37

Recall that we have seen more in Decision Theory: least favourable distributions. 38

3. Hypothesis Testing H 0 : θ Θ 0, D = {accept H 0, reject H 0 } = {1, 0}, where 1 stands for acceptance; loss function 0 if θ Θ 0, φ = 1 L(θ, φ) = a 0 if θ Θ 0, φ = 0 0 if θ Θ 0, φ = 0 a 1 if θ Θ 0, φ = 1 Under this loss function, the Bayes decision rule associated with a prior distribution π is φ π (x) = 1 if P π (θ Θ 0 x) > a 1 a 0 +a 1 0 otherwise 39

The Bayes factor for testing H 0 : θ Θ 0 against H 1 : θ Θ 1 is B π (x) = P π (θ Θ 0 x)/p π (θ Θ 1 x) P π (θ Θ 0 )/P π (θ Θ 1 ) = p(x θ Θ 0) p(x θ Θ 1 ) is the ratio of how likely the data is under H 0 and how likely the data is under H 1 ; measures the extent to which the data x will change the odds of Θ 0 relative to Θ 1 Note that we have to make sure that our prior distribution puts mass on H 0 (and on H 1 ). If H 0 is simple, this is usually achieved by choosing a prior that has some point mass on H 0 and otherwise lives on H 1. 40

Example. Suppose that x 1,..., x n is a random sample from a Poisson distribution with unknown mean θ; possible models for the prior distribution are π 1 (θ) = e θ, θ > 0, and π 2 (θ) = e θ θ, θ > 0. The prior probabilities of model 1 and model 2 are assessed at probability 1/2 each. Then the Bayes factor is B(x) = 0 e nθ θ x i e θ dθ 0 e nθ θ x ie θ θdθ = Γ( x i + 1)/((n + 1) x i +1 ) Γ( x i + 2)/((n + 1) x i +2 ) = n + 1 xi + 1. 41

Note that in this setting B(x) = = P (model 1 x) P (model 2 x) P (model 1 x) 1 P (model 1 x), so that P (model 1 x) = (1 + B(x)) 1. Hence P (model 1 x) = = ( 1 + xi + 1 n + 1 ) 1, ( 1 + x + 1 n 1 + 1 n ) 1 which is decreasing for x increasing. 42

For robustness: Least favourable Bayesian answers; suppose H 0 : θ = θ 0, H 1 : θ θ 0, prior probability on H 0 is ρ 0 = 1/2. For G family of priors on H 1 ; put B(x, G) = inf g G f(x θ 0 ) Θ f(x θ)g(θ)dθ. A Bayesian prior g G on H 0 will then have posterior probability at least P (x, G) = ( 1 + 1 B(x,G)) 1 on H0 (for ρ 0 = 1/2). If ˆθ is the m.l.e. of θ, and G A the set of all prior distributions, then B(x, G A ) = f(x θ 0) f(x ˆθ(x)) and P (x, G A ) = ( 1 + f(x ˆθ(x)) f(x θ 0 ) ) 1. 43

4. Credible intervals A (1 α) (posterior) credible interval (region) is an interval (region) of θ values within which 1 α of the posterior probability lies Often would like to find HPD (highest posterior density) region: a (1 α) credible region that has minimal volume when the posterior density is unimodal, this is often straightforward 44

Example continued. Suppose y 1, y 2,..., y n are independent normally distributed random variables, each with variance 1 and with means βx 1,..., βx n, where β is an unknown real-valued parameter and x 1, x 2,..., x n are known constants; Jeffreys prior, posterior N ( sxy s xx, (s xx ) 1), then a 95%-credible interval for β is s xy 1 ± 1.96. s xx s xx Interpretation in Bayesian: conditional on the observed x; randomness relates to the distribution of θ In contrast, frequentist confidence interval applies before x is observed; randomness relates to the distribution of x 45

5. Not to forget about: Nuisance parameters θ = (ψ, λ), where λ nuisance parameter π(θ x) = π((ψ, λ) x) marginal posterior of ψ: π(ψ x) = π(ψ, λ x)dλ Just integrate out the nuisance parameter 46