Statistical Inference

Similar documents
1. Fisher Information

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

Theory of Maximum Likelihood Estimation. Konstantin Kashin

A Very Brief Summary of Statistical Inference, and Examples

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Review and continuation from last week Properties of MLEs

Fisher Information & Efficiency

Maximum Likelihood Estimation

Patterns of Scalable Bayesian Inference Background (Session 1)

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

simple if it completely specifies the density of x

DA Freedman Notes on the MLE Fall 2003

A Very Brief Summary of Statistical Inference, and Examples

Submitted to the Brazilian Journal of Probability and Statistics

Maximum Likelihood Estimation

A Very Brief Summary of Bayesian Inference, and Examples

Hypothesis Testing. Robert L. Wolpert Department of Statistical Science Duke University, Durham, NC, USA

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Poisson CI s. Robert L. Wolpert Department of Statistical Science Duke University, Durham, NC, USA

Classical Estimation Topics

Loglikelihood and Confidence Intervals

Final Examination. STA 215: Statistical Inference. Saturday, 2001 May 5, 9:00am 12:00 noon

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

Information in a Two-Stage Adaptive Optimal Design

Section 8: Asymptotic Properties of the MLE

Chapter 3. Point Estimation. 3.1 Introduction

Lecture 1: Introduction

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n

Bootstrap and Parametric Inference: Successes and Challenges

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STAT 512 sp 2018 Summary Sheet

Chapter 7. Hypothesis Testing

Mathematics Ph.D. Qualifying Examination Stat Probability, January 2018

Some Curiosities Arising in Objective Bayesian Analysis

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That

Exponential Families

STAT 730 Chapter 4: Estimation

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

ECE531 Lecture 10b: Maximum Likelihood Estimation

Stat 5102 Lecture Slides Deck 3. Charles J. Geyer School of Statistics University of Minnesota

Statistical Inference

Stat 5101 Lecture Notes

1 Likelihood. 1.1 Likelihood function. Likelihood & Maximum Likelihood Estimators

Introduction to Machine Learning

Time Series and Dynamic Models

5601 Notes: The Sandwich Estimator

Introduction to Estimation Methods for Time Series models Lecture 2

INTRODUCTION TO BAYESIAN METHODS II

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Mathematical statistics

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Information in Data. Sufficiency, Ancillarity, Minimality, and Completeness

New Bayesian methods for model comparison

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Master s Written Examination

Statistical Inference

ECE 275A Homework 7 Solutions

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

An Extended BIC for Model Selection

9 Asymptotic Approximations and Practical Asymptotic Tools

Maximum Likelihood Estimation

Confidence Distribution

Applied Asymptotics Case studies in higher order inference

Step-Stress Models and Associated Inference

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Likelihood inference in the presence of nuisance parameters

STAT 135 Lab 2 Confidence Intervals, MLE and the Delta Method

Semiparametric posterior limits

Statistics & Data Sciences: First Year Prelim Exam May 2018

Lecture 17: Likelihood ratio and asymptotic tests

Bayesian inference: what it means and why we care

Statistical Theory MT 2007 Problems 4: Solution sketches

Logistic Regression. Seungjin Choi

Econometrics I, Estimation

Invariant HPD credible sets and MAP estimators

Topic 12 Overview of Estimation

Outline of GLMs. Definitions

Parameter Estimation

Introduction to Machine Learning

STA 260: Statistics and Probability II

Sampling distribution of GLM regression coefficients

Mean and variance. Compute the mean and variance of the distribution with density

12 - Nonparametric Density Estimation

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Lecture 26: Likelihood ratio tests

5.2 Fisher information and the Cramer-Rao bound

Brief Review on Estimation Theory

Principles of Statistics

Mathematics Qualifying Examination January 2015 STAT Mathematical Statistics

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

MIT Spring 2016

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY

Lecture 7 Introduction to Statistical Decision Theory

Multivariate Analysis and Likelihood Inference

1 Hypothesis Testing and Model Selection

Asymptotic Analysis of Objectives Based on Fisher Information in Active Learning

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Transcription:

Statistical Inference Robert L. Wolpert Institute of Statistics and Decision Sciences Duke University, Durham, NC, USA. Asymptotic Inference in Exponential Families Let X j be a sequence of independent, identically distributed random variables from a natural exponential family η T (x) A(η) f(x η) = h(x) e and let q denote the dimension of T and η. We have seen that the log moment generating function for T (X) is A(ω + η) A(η) and hence that the mean and covariance for T are given respectively by ET η] = A(η) VT η] = 2 A(η) = I(η), () where we recognize the Hessian of A(η) as the Information matrix, both expected (Fisher) and observed. The likelihood function upon observing a sample of size n is L n (η) = h(x j ) e η P T (x j ) n A(η), so the Maximum Likelihood Estimator (MLE) ˆη n of η satisfies the equation A(ˆη n ) = T n Under suitable regularity conditions the Central Limit Theorem will ensure that T n will have an asymptotically normal distribution with (by () mean A(η) and covariance I(η)/n, so n A(ˆηn ) A(η) ]

will have an asymptotical No(0, I(η)) distribution. By Taylor s Theorem we can write A(ˆηn ) A(η) ] = 2 A(η)(ˆη n η) + O( ˆη n η 2 ) = I(η)(ˆη n η) + O(/n), from which we conclude n(ˆηn η) No ( 0, I(η) ) or, more casually, that ˆη n has approximately a No ( η, n I(η)] ) distribution so the Maximum Likelihood Estimator is consistent and efficient and asymptotically normal... Unnatural Families If we parametrize by some θ Θ other than the natural parameter η, but still have a smooth mapping θ η(θ), we can note that ˆη n = η(ˆθ n ), and (again, by Taylor) η(ˆθ n ) = η(θ) + J (ˆθ n θ) + O( ˆθ n θ 2 ), where the Jacobian matrix J is given by J ij = η j / θ i, so n(ˆθn θ) No ( 0, J I(η)J] ) = No ( 0, I(θ) ) and again we have consistency, efficiency, and asymptotic normality. In one-dimensional families this leads to Frequentist confidence intervals of the form α Pr θ (ˆθn Z α/2 n I(ˆθ n ), ˆθn + Z α/2 ) ] θ n I(ˆθ n ) where the is required both because the distribution of ˆθ n is only approximately normal, and because I(ˆθ n ) is only approximately I(θ). 2. Bayesian Asymptotic Inference The observed information for a single observation X = x from the model X f(x θ) is i(θ, x) = 2 θ log f(x θ); 2

evidently the Fisher (expected) information is related to this by I(θ) = Ei(θ, X) θ]. The likelihood for a sample of size n is just the product of the individual likelihoods, leading to a sum for the log likelihoods, and observed information n i(θ, x) = i(θ, x j ). If the log likelihood log f n (x θ) is differentiable throughout Θ and attains a unique maximum at an interior point ˆθ n (x) Θ, then we can expand log f n (x θ) in a second-order Taylor series for θ = ˆθ n (x) + ɛ/ n close to ˆθ n (x) to find (for q = dimensional θ) j= log f n (x θ) = log f n (x ˆθ n ) + (ɛ/ n) θ log f n (x ˆθ n ) + (ɛ/ n) 2 2 θ! 2! log f n(x ˆθ n ) +o ( (ɛ/ n) 2 2 θ log f n(x ˆθ n ) ) = log f n (x ˆθ n ) + 0 ɛ2 2 n n i(ˆθ n, x j ) + o() j= log f n (x ˆθ n ) ɛ2 2 Ei(ˆθ n, X)] = log f n (x ˆθ n ) ɛ2 2 I(ˆθ n ) = log f n (x ˆθ n ) 2 n I(ˆθ n )(θ ˆθ n ) 2, where we have used the consistency of ˆθ n and have applied the strong law of large numbers for i(θ, X). Thus we have the likelihood approximation f n (x θ) No(ˆθ n (x), ni(ˆθ n )), normal with mean the MLE ˆθ n (x) and precision ni(ˆθ n ) (or covariance n I(ˆθ n ) ). Note that the prior distribution is irrelevent, asymptotically, so long as it is smooth and doesn t vanish in a neighborhood of ˆθ n ; thus we have the Bayesian Central Limit Theorem, π n (θ x) No(ˆθ n, n I(ˆθ n )] ), leading in the q = -dimensional case to Bayesian credible intervals of the form α Pr θ (ˆθn Z α/2, ˆθn + Z α/2 ) ] x, n I(ˆθ n ) n I(ˆθ n ) 3

just as before, but now with a completely different interpretation. 3. An Example Let X i be Bernoulli random variables with PX i = ] = θ for some θ Θ = (0, ); this is an exponential family with natural parameter η = η(θ) = log θ θ and natural sufficient statistic (for a sample of size n) T n = ΣX j, hence with log normalizing factor A(η) = log( + e η ) = B(θ) = log( θ), i.e. with likelihood L(θ) = p Tn ( p) n Tn = e ηtn n log(+eη ) and hence with MLE s and ( ) information matrices ˆθ n = T n /n = X n I θ = ˆη n = log Tn n T n = log θ( θ) X n X n I η = e η (+e η ) 2, so I η (ˆη n ) = T n (n T n )/n 2, I θ (ˆθ n ) = n 2 /(T n (n T n )), and 95% confidence intervals would be 0.95 Prˆη n.96/ ni(ˆη n ) < η < ˆη n +.96/ ni(ˆη n )] = T n.96 Prlog n T n Tn (n T n )/n < η < log T n n T n +.96 Tn (n T n )/n ] This can be written as an interval 0.95 = PrL n < θ < R n ] for θ = e η /( + e η ), with left and right endpoints L n (T n ) = T n / T n + (n T n ) exp ( +.96/ T n (n T n )/n )] R n (T n ) = T n / T n + (n T n ) exp (.96/ T n (n T n )/n )] ; for examle, with T 00 = 0 successes in n = 00 tries, the endpoints are L 00 (0) = 0/0 + 90 exp(.96/ 9)] = / + 9 exp(0.6533)] = 0.05465 and R 00 (0) = / + 9 exp( 0.6533)] = 0.7597, while with T 00 = 50 the interval endpoints would be are L 00 (50) = 50/50 + 50 exp(.96/ 25)] = /+exp(0.392)] = 0.40324 and R 00 (50) = /+exp( 0.392)] = 0.59676. Intervals for θ can be made directly using 0.95 Prˆθ n.96/ ni(ˆθ n ) < θ < ˆθ n +.96/ ni(ˆθ n )] = Prˆθ n.96 ˆθ n ( ˆθ n )/n < θ < ˆθ n +.96 ˆθ n ( ˆθ n )/n], 4

an interval with endpoints L 00 (0) = 0.0.96 0.09/00 = 0.0.96 0.30 = 0.042, R 00 (0) = 0.0 +.96 0.03 = 0.588 for T 00 = 0, and L 00 (50) = 0.50.96 0.25/00 = 0.0.96 0.05 = 0.402, R 00 (50) = 0.50 +.96 0.05 = 0.598 for T 00 = 50. Recall that the exact 95% confidence intervals for θ are qbeta(0.025, T n,n-t n +), qbeta(0.975, T n +,n-t n ) in general, or qbeta(0.025, 0,9), qbeta(0.975,,90)] =0.0490, 0.762] for T 00 = 0 and qbeta(0.025, 50,5), qbeta(0.975, 5, 50)] = 0.3983, 0.607] for T 00 = 50. In summary, L 00 (0) R 00 (0) L 00 (50) R 00 (50) Normal, based on η: 0.05465, 0.7597] 0.40324, 0.59676] Normal, based on θ: 0.0420, 0.5880] 0.40200, 0.59800] Exact Frequentist: 0.04900, 0.7622] 0.39832, 0.6068] Exact Bayes (.0,.0): 0.05564, 0.7456] 0.40364, 0.59636] Exact Bayes (0.5,0.5): 0.05258, 0.702] 0.4037, 0.59683] Exact Bayes (0.0,0.0): 0.0495, 0.6557] 0.40270, 0.59730] Evidently the normal approximations to the Frequentist intervals are both rather close, but a bit too narrow (hence cover θ with a bit less than the promissed 95% probability). Of course the approximations improve with increasing n and become worse for smaller n. The intervals based on the natural parameter are very close to the Bayesian credible intervals. All the approximations are better near the middle of the range (θ /2) than near the endpoints. 5