For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

Similar documents
P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014

Statistics 3858 : Maximum Likelihood Estimators

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Introduction to Estimation Methods for Time Series models Lecture 2

simple if it completely specifies the density of x

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

Chapter 3. Point Estimation. 3.1 Introduction

Lecture 1: Introduction

Maximum Likelihood Estimation

Maximum Likelihood Estimation

5.2 Fisher information and the Cramer-Rao bound

Chapter 4: Asymptotic Properties of the MLE

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Statistical Inference, and Examples

1 Likelihood. 1.1 Likelihood function. Likelihood & Maximum Likelihood Estimators

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That

DA Freedman Notes on the MLE Fall 2003

Chapter 4: Asymptotic Properties of the MLE (Part 2)

Lecture 3 September 1

Stat 5102 Lecture Slides Deck 3. Charles J. Geyer School of Statistics University of Minnesota

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Maximum Likelihood Estimation

Lecture 17: Likelihood ratio and asymptotic tests

EM Algorithm II. September 11, 2018

STAT 135 Lab 3 Asymptotic MLE and the Method of Moments

Fisher Information & Efficiency

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Multivariate Analysis and Likelihood Inference

Review and continuation from last week Properties of MLEs

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Statistics Ph.D. Qualifying Exam

BTRY 4090: Spring 2009 Theory of Statistics

Hypothesis Testing. A rule for making the required choice can be described in two ways: called the rejection or critical region of the test.

1 One-way analysis of variance

Classical Estimation Topics

Chapter 1: A Brief Review of Maximum Likelihood, GMM, and Numerical Tools. Joan Llull. Microeconometrics IDEA PhD Program

Normalising constants and maximum likelihood inference

Central Limit Theorem ( 5.3)

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Spring 2012 Math 541B Exam 1

parameter space Θ, depending only on X, such that Note: it is not θ that is random, but the set C(X).

Qualifying Exam in Probability and Statistics.

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Linear Methods for Prediction

Generalized Linear Models. Kurt Hornik

Section 8: Asymptotic Properties of the MLE

Likelihood-Based Methods

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

1. Fisher Information

Master s Written Examination

Some explanations about the IWLS algorithm to fit generalized linear models

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n

Regression Estimation Least Squares and Maximum Likelihood

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

MATH4427 Notebook 2 Fall Semester 2017/2018

Statistics & Data Sciences: First Year Prelim Exam May 2018

Mathematical statistics

Generalized Linear Models I

Information in Data. Sufficiency, Ancillarity, Minimality, and Completeness

Problem Selected Scores

Generalized Linear Models

POLI 8501 Introduction to Maximum Likelihood Estimation

Parametric Inference

Optimization. Charles J. Geyer School of Statistics University of Minnesota. Stat 8054 Lecture Notes

Statistics Ph.D. Qualifying Exam: Part I October 18, 2003

Solution-Midterm Examination - I STAT 421, Spring 2012

Ch. 5 Hypothesis Testing

Brief Review on Estimation Theory

Mathematics Ph.D. Qualifying Examination Stat Probability, January 2018

Some Assorted Formulae. Some confidence intervals: σ n. x ± z α/2. x ± t n 1;α/2 n. ˆp(1 ˆp) ˆp ± z α/2 n. χ 2 n 1;1 α/2. n 1;α/2

Greene, Econometric Analysis (6th ed, 2008)

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given.

1 General problem. 2 Terminalogy. Estimation. Estimate θ. (Pick a plausible distribution from family. ) Or estimate τ = τ(θ).

If g is also continuous and strictly increasing on J, we may apply the strictly increasing inverse function g 1 to this inequality to get

Outline of GLMs. Definitions

Delta Method. Example : Method of Moments for Exponential Distribution. f(x; λ) = λe λx I(x > 0)

Lecture 6: Discrete Choice: Qualitative Response

Spring 2012 Math 541A Exam 1. X i, S 2 = 1 n. n 1. X i I(X i < c), T n =

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Maximum Likelihood Large Sample Theory

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Econometrics I, Estimation

Answers to the 8th problem set. f(x θ = θ 0 ) L(θ 0 )

[y i α βx i ] 2 (2) Q = i=1

Submitted to the Brazilian Journal of Probability and Statistics

Statistical Inference

f (1 0.5)/n Z =

Maximum Likelihood. F θ, θ Θ. X 1,..., X n. L(θ) = f(x i ; θ) l(θ) = ln L(θ) = i.i.d. i=1. n ln f(x i ; θ) Sometimes

6.1 Variational representation of f-divergences

Generalized Linear Models Introduction


STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Estimation and Model Selection in Mixed Effects Models Part I. Adeline Samson 1

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

Transcription:

Large Sample Theory Study approximate behaviour of ˆθ by studying the function U. Notice U is sum of independent random variables. Theorem: If Y 1, Y 2,... are iid with mean µ then Yi n µ Called law of large numbers. Strong law Yi P (lim n = µ) = 1 and the weak law that Yi lim P ( µ > ɛ) = 0 n For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions. 159

Now suppose θ 0 is true value of θ. Then U(θ)/n µ(θ) where [ ] log f µ(θ) =E θ0 (X i, θ) θ log f = (x, θ)f(x, θ 0 )dx θ 160

Example: N(µ, 1) data: U(µ)/n = (X i µ)/n = X µ If the true mean is µ 0 then X µ 0 and U(µ)/n µ 0 µ Consider µ < µ 0 : derivative of l(µ) is likely to be positive so that l increases as µ increases. For µ > µ 0 : derivative is probably negative and so l tends to be decreasing for µ > 0. Hence: l is likely to be maximized close to µ 0. Repeat ideas for more general case. Study rv log[f(x i, θ)/f(x i, θ 0 )]. You know the inequality E(X) 2 E(X 2 ) (difference is Var(X) 0.) Generalization: Jensen s inequality: for g a convex function (g 0 roughly) then g(e(x)) E(g(X)) 161

Inequality above has g(x) = x 2. Use g(x) = log(x): convex because g (x) = x 2 > 0. We get log(e θ0 [f(x i, θ)/f(x i, θ 0 )] E θ0 [ log{f(x i, θ)/f(x i, θ 0 )}] But [ ] f(xi, θ) E θ0 f(x i, θ 0 ) = = f(x, θ) f(x, θ 0 ) f(x, θ 0)dx = 1 f(x, θ)dx We can reassemble the inequality and this calculation to get E θ0 [log{f(x i, θ)/f(x i, θ 0 )}] 0 162

Fact: inequality is strict unless the θ and θ 0 densities are actually the same. Let µ(θ) < 0 be this expected value. Then for each θ we find l(θ) l(θ 0 ) n = log[f(xi, θ)/f(x i, θ 0 )] n µ(θ) This proves likelihood probably higher at θ 0 than at any other single θ. Idea can often be stretched to prove that the mle is consistent; need uniform convergence in θ. 163

Definition A sequence ˆθ n of estimators of θ is consistent if ˆθ n converges weakly (or strongly) to θ. Proto theorem: In regular problems the mle ˆθ is consistent. More precise statements of possible conclusions. Use notation N(ɛ) = {θ : θ θ 0 ɛ}. Suppose: ˆθ n is global maximizer of l. ˆθ n,δ maximizes l over N(δ) = { θ θ 0 δ}. A ɛ = { ˆθ n θ 0 ɛ} B δ,ɛ = { ˆθ n,δ θ 0 ɛ} C L = {!θ N(L/n 1/2 ) : U(θ) = 0, U (θ) < 0} 164

Theorem: 1. Under conditions I P (A ɛ ) 1 for each ɛ > 0. 2. Under conditions II there is a δ > 0 such that for all ɛ > 0 we have P (B δ,ɛ ) 1. 3. Under conditions III for all δ > 0 there is an L so large and an n 0 so large that for all n n 0, P (C L ) > 1 δ. 4. Under conditions III there is a sequence L n tending to so slowly that P (C Ln ) 1. Point: conditions get weaker as conclusions get weaker. Many possible conditions in literature. See book by Zacks for some precise conditions. 165

Asymptotic Normality Study shape of log likelihood near the true value of θ. Assume ˆθ is a root of the likelihood equations close to θ 0. Taylor expansion (1 dimensional parameter θ): U(ˆθ) =0 =U(θ 0 ) + U (θ 0 )(ˆθ θ 0 ) + U ( θ)(ˆθ θ 0 ) 2 /2 for some θ between θ 0 and ˆθ. WARNING: This form of the remainder in Taylor s theorem is not valid for multivariate θ. 166

Derivatives of U are sums of n terms. So each derivative should be proportional to n in size. Second derivative is multiplied by the square of the small number ˆθ θ 0 so should be negligible compared to the first derivative term. Ignoring second derivative term get U (θ 0 )(ˆθ θ 0 ) U(θ 0 ) Now look at terms U and U. 167

Normal case: U(θ 0 ) = (X i µ 0 ) has a normal distribution with mean 0 and variance n (SD n). Derivative is U (µ) = n. Next derivative U is 0. Notice: both U and U are sums of iid random variables. Let and U i = log f θ (X i, θ 0 ) V i = 2 log f θ 2 (X i, θ) 168

In general, U(θ 0 ) = U i has mean 0 and approximately a normal distribution. Here is how we check that: E θ0 (U(θ 0 )) = ne θ0 (U 1 ) log(f(x, θ0 )) = n f(x, θ 0 )dx θ f(x, θ0 )/ θ = n f(x, θ 0 )dx f(x, θ 0 ) f = n θ (x, θ 0)dx = n θ = n θ 1 = 0 f(x, θ)dx θ=θ0 169

Notice: interchanged order of differentiation and integration at one point. This step is usually justified by applying the dominated convergence theorem to the definition of the derivative. Differentiate identity just proved: log f (x, θ)f(x, θ)dx = 0 θ Take derivative of both sides wrt θ; pull derivative under integral sign: [ ] log f (x, θ)f(x, θ) dx = 0 θ θ Do the derivative and get 2 log(f) θ 2 f(x, θ)dx log f = (x, θ) f (x, θ)dx θ θ [ log f 2 = (x, θ)] f(x, θ)dx θ 170

Definition: The Fisher Information is I(θ) = E θ (U (θ)) = ne θ0 (V 1 ) We refer to I(θ 0 ) = E θ0 (V 1 ) as the information in 1 observation. The idea is that I is a measure of how curved the log likelihood tends to be at the true value of θ. Big curvature means precise estimates. Our identity above is I(θ) = V ar θ (U(θ)) = ni(θ) Now we return to our Taylor expansion approximation U (θ 0 )(ˆθ θ 0 ) U(θ 0 ) and study the two appearances of U. We have shown that U = U i is a sum of iid mean 0 random variables. The central limit theorem thus proves that n 1/2 U(θ 0 ) N(0, σ 2 ) where σ 2 = Var(U i ) = E(V i ) = I(θ). 171

Next observe that where again U (θ) = V i V i = U i θ The law of large numbers can be applied to show U (θ 0 )/n E θ0 [V 1 ] = I(θ 0 ) Now manipulate our Taylor expansion as follows n 1/2 (ˆθ θ 0 ) [ Vi n ] 1 Ui n Apply Slutsky s Theorem to conclude that the right hand side of this converges in distribution to N(0, σ 2 /I(θ) 2 ) which simplifies, because of the identities, to N{0, 1/I(θ)}. 172

Summary In regular families: assuming ˆθ = ˆθ n is a consistent root of U(θ) = 0. n 1/2 U(θ 0 ) MV N(0, I) where I ij = E θ0 { V1,ij (θ 0 ) } and V k,ij (θ) = 2 log f(x k, θ) θ i θ j If V k (θ) is the matrix [V k,ij ] then nk=1 V k (θ 0 ) n I If V(θ) = k V k (θ) then {V(θ 0 )/n}n 1/2 (ˆθ θ 0 ) n 1/2 U(θ 0 ) 0 in probability as n. 173

Also {V(ˆθ)/n}n 1/2 (ˆθ θ 0 ) n 1/2 U(θ 0 ) 0 in probability as n. n 1/2 (ˆθ θ 0 ) {I(θ 0 )} 1 U(θ 0 ) 0 in probability as n. n 1/2 (ˆθ θ 0 ) MV N(0, I 1 ). In general (not just iid cases) I(θ 0 )(ˆθ θ 0 ) N(0, 1) I(ˆθ)(ˆθ θ 0 ) N(0, 1) V (θ 0 )(ˆθ θ 0 ) N(0, 1) V (ˆθ)(ˆθ θ 0 ) N(0, 1) where V = l is the so-called observed information, the negative second derivative of the log-likelihood. Note: If the square roots are replaced by matrix square roots we can let θ be vector valued and get MV N(0, I) as the limit law. 174

Why all these different forms? Use limit laws to test hypotheses and compute confidence intervals. Test H o : θ = θ 0 using one of the 4 quantities as test statistic. Find confidence intervals using quantities as pivots. E.g.: second and fourth limits lead to confidence intervals and ˆθ ± z α/2 / I(ˆθ) ˆθ ± z α/2 / V (ˆθ) respectively. The other two are more complicated. For iid N(0, σ 2 ) data we have and V (σ) = 3 X 2 i σ 4 n σ 2 I(σ) = 2n σ 2 The first line above then justifies confidence intervals for σ computed by finding all those σ for which 2n(ˆσ σ) σ z α/2 Similar interval can be derived from 3rd expression, though this is much more complicated. 175

Usual summary: mle is consistent and asymptotically normal with an asymptotic variance which is the inverse of the Fisher information. Problems with maximum likelihood 1. Many parameters lead to poor approximations. MLEs can be far from right answer. See homework for Neyman Scott example where MLE is not consistent. 2. Multiple roots of the likelihood equations: you must choose the right root. Start with different, consistent, estimator; apply iterative scheme like Newton Raphson to likelihood equations to find MLE. Not many steps of NR generally required if starting point is a reasonable estimate. 176

Finding (good) preliminary Point Estimates Method of Moments Basic strategy: set sample moments equal to population moments and solve for the parameters. Definition: The r th sample moment (about the origin) is 1 n n Xi r i=1 The r th population moment is E(X r ) (Central moments are and 1 n n i=1 (X i X) r E [(X µ) r ]. 177

If we have p parameters we can estimate the parameters θ 1,..., θ p by solving the system of p equations: and so on to µ 1 = X µ 2 = X2 µ p = Xp You need to remember that the population moments µ k will be formulas involving the parameters. 178

Gamma Example The Gamma(α, β) density is f(x; α, β) = 1 βγ(α) and has and ( ) α 1 [ x exp β µ 1 = αβ µ 2 = α(α + 1)β2. This gives the equations or αβ = X α(α + 1)β 2 = X 2 αβ = X αβ 2 = X 2 X 2. x β ] 1(x > 0) 179

Divide the second equation by the first to find the method of moments estimate of β is β = (X 2 X 2 )/X. Then from the first equation get α = X/ β = (X) 2 /(X 2 X 2 ). The method of moments equations are much easier to solve than the likelihood equations which involve the function ψ(α) = d dα log(γ(α)) called the digamma function. 180

Score function has components and U β = Xi β 2 nα/β U α = nψ(α) + log(x i ) n log(β). You can solve for β in terms of α to leave you trying to find a root of the equation nψ(α) + log(x i ) n log( X i /(nα)) = 0 To use Newton Raphson on this you begin with the preliminary estimate ˆα 1 = α and then compute iteratively ˆα k+1 = log(x) ψ(ˆα k) log(x)/ˆα k 1/α ψ (ˆα k ) until the sequence converges. Computation of ψ, the trigamma function, requires special software. Web sites like netlib and statlib are good sources for this sort of thing. 181

Estimating Equations Same large sample ideas arise whenever estimates derived by solving some equation. Example: large sample theory for Generalized Linear Models. Suppose Y i is number of cancer cases in some group of people characterized by values x i of some covariates. Think of x i as containing variables like age, or a dummy for sex or average income or.... Possible parametric regression model: Y i has a Poisson distribution with mean µ i where the mean µ i depends somehow on x i. 182

Typically assume g(µ i ) = β 0 + x i β; g is link function. Often g(µ) = log(µ) and x i β is a matrix product: x i row vector, β column vector. Linear regression model with Poisson errors. Special case log(µ i ) = βx i where x i is a scalar. The log likelihood is simply l(β) = (Y i log(µ i ) µ i ) ignoring irrelevant factorials. The score function is, since log(µ i ) = βx i, U(β) = (Y i x i x i µ i ) = x i (Y i µ i ) (Notice again that the score has mean 0 when you plug in the true parameter value.) 183

The key observation, however, is that it is not necessary to believe that Y i has a Poisson distribution to make solving the equation U = 0 sensible. Suppose only that log(e(y i )) = x i β. Then we have assumed that E β (U(β)) = 0 This was the key condition in proving that there was a root of the likelihood equations which was consistent and here it is what is needed, roughly, to prove that the equation U(β) = 0 has a consistent root ˆβ. 184

Ignoring higher order terms in a Taylor expansion will give V (β)(ˆβ β) U(β) where V = U. In the mle case we had identities relating the expectation of V to the variance of U. In general here we have Var(U) = x 2 i Var(Y i). If Y i is Poisson with mean µ i (and so Var(Y i ) = µ i ) this is Moreover we have and so Var(U) = x 2 i µ i. V i = x 2 i µ i V (β) = x 2 i µ i. 185

The central limit theorem (the Lyapunov kind) will show that U(β) has an approximate normal distribution with variance σ 2 U = x 2 i Var(Y i) and so ˆβ β N(0, σ 2 U /( x 2 i µ i) 2 ) If Var(Y i ) = µ i, as it is for the Poisson case, the asymptotic variance simplifies to 1/ x 2 i µ i. 186

Other estimating equations are possible, popular. If w i is any set of deterministic weights (possibly depending on µ i ) then could define U(β) = w i (Y i µ i ) and still conclude that U = 0 probably has a consistent root which has an asymptotic normal distribution. Idea widely used: Example: Generalized Estimating Equations, Zeger and Liang. Abbreviation: GEE. Called by econometricians Generalized Method of Moments. 187

An estimating equation is unbiased if E θ (U(θ)) = 0 Theorem: Suppose ˆθ is a consistent root of the unbiased estimating equation U(θ) = 0. Let V = U. Suppose there is a sequence of constants B(θ) such that and let and Then V (θ)/b(θ) 1 A(θ) = V ar θ (U(θ)) C(θ) = B(θ)A 1 (θ)b(θ). C(θ 0 )(ˆθ θ 0 ) N(0, 1) C(ˆθ)(ˆθ θ 0 ) N(0, 1) Other ways to estimate A, B and C lead to same conclusions. There are multivariate extensions using matrix square roots. 188