Lecture Stat Maximum Likelihood Estimation

Similar documents
EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Exponential Families and Bayesian Inference

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

SOME THEORY AND PRACTICE OF STATISTICS by Howard G. Tucker

Lecture 7: Properties of Random Samples

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

7.1 Convergence of sequences of random variables

Unbiased Estimation. February 7-12, 2008

Solutions: Homework 3

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

Maximum Likelihood Estimation

7.1 Convergence of sequences of random variables

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Statistical Inference Based on Extremum Estimators

Lecture 19: Convergence

Stat410 Probability and Statistics II (F16)

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

An Introduction to Asymptotic Theory

ECE 901 Lecture 13: Maximum Likelihood Estimation

Distribution of Random Samples & Limit theorems

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

A Note on Box-Cox Quantile Regression Estimation of the Parameters of the Generalized Pareto Distribution

1 Introduction to reducing variance in Monte Carlo simulations

First Year Quantitative Comp Exam Spring, Part I - 203A. f X (x) = 0 otherwise

Estimation for Complete Data

Topic 9: Sampling Distributions of Estimators

Lecture 33: Bootstrap

Approximations and more PMFs and PDFs

Topic 9: Sampling Distributions of Estimators

Introductory statistics

Notes 19 : Martingale CLT

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Lecture Note 8 Point Estimators and Point Estimation Methods. MIT Spring 2006 Herman Bennett

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

n n i=1 Often we also need to estimate the variance. Below are three estimators each of which is optimal in some sense: n 1 i=1 k=1 i=1 k=1 i=1 k=1

Lecture 12: September 27

Lecture 3: MLE and Regression

Estimation of the Mean and the ACVF

IIT JAM Mathematical Statistics (MS) 2006 SECTION A

1 Covariance Estimation

4. Partial Sums and the Central Limit Theorem

Notes 5 : More on the a.s. convergence of sums

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Lecture 3. Properties of Summary Statistics: Sampling Distribution

Lecture 8: Convergence of transformations and law of large numbers

Asymptotic Results for the Linear Regression Model

17. Joint distributions of extreme order statistics Lehmann 5.1; Ferguson 15

Lecture 11 and 12: Basic estimation theory

Lecture 2: Poisson Sta*s*cs Probability Density Func*ons Expecta*on and Variance Es*mators

Since X n /n P p, we know that X n (n. Xn (n X n ) Using the asymptotic result above to obtain an approximation for fixed n, we obtain

Chapter 6 Principles of Data Reduction

Topic 9: Sampling Distributions of Estimators

Summary. Recap ... Last Lecture. Summary. Theorem

Lecture 2: Monte Carlo Simulation

SDS 321: Introduction to Probability and Statistics

Statistical Theory MT 2008 Problems 1: Solution sketches

Properties and Hypothesis Testing

32 estimating the cumulative distribution function

Random Variables, Sampling and Estimation

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

Lecture 6 Ecient estimators. Rao-Cramer bound.

Lecture 01: the Central Limit Theorem. 1 Central Limit Theorem for i.i.d. random variables

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

Review Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn

Statistical Theory MT 2009 Problems 1: Solution sketches

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

Lecture 9: September 19

A statistical method to determine sample size to estimate characteristic value of soil parameters

Introduction to Extreme Value Theory Laurens de Haan, ISM Japan, Erasmus University Rotterdam, NL University of Lisbon, PT

Stochastic Simulation

Direction: This test is worth 150 points. You are required to complete this test within 55 minutes.

This section is optional.

6.3 Testing Series With Positive Terms

( θ. sup θ Θ f X (x θ) = L. sup Pr (Λ (X) < c) = α. x : Λ (x) = sup θ H 0. sup θ Θ f X (x θ) = ) < c. NH : θ 1 = θ 2 against AH : θ 1 θ 2

Probability and Statistics

MATH 472 / SPRING 2013 ASSIGNMENT 2: DUE FEBRUARY 4 FINALIZED

Lecture 10 October Minimaxity and least favorable prior sequences

Lecture 2: Concentration Bounds

Maximum Likelihood Estimation and Complexity Regularization

Algorithms for Clustering

Kernel density estimator

LECTURE 8: ASYMPTOTICS I

1 Convergence in Probability and the Weak Law of Large Numbers

1 General linear Model Continued..

Law of the sum of Bernoulli random variables


Advanced Stochastic Processes.

Lecture 3: August 31

Department of Mathematics

Lecture 23: Minimal sufficiency

In this section we derive some finite-sample properties of the OLS estimator. b is an estimator of β. It is a function of the random sample data.

TAMS24: Notations and Formulas

Lecture 13: Maximum Likelihood Estimation

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

Transcription:

Lecture Stat 461-561 Maximum Likelihood Estimatio A.D. Jauary 2008 A.D. () Jauary 2008 1 / 63

Maximum Likelihood Estimatio Ivariace Cosistecy E ciecy Nuisace Parameters A.D. () Jauary 2008 2 / 63

Parametric Iferece Let f (xj θ) deote the joit pdf or pmf of the sample X = (X 1,..., X ) parametrized by θ 2 Θ. The give that X = x is observed, the fuctio L ( θj x) = f (xj θ) is the likelihood fuctio. The most commo estimate is the Maximum Likelihood Estimate (MLE) give by bθ = arg max L ( θj x). θ2θ A.D. () Jauary 2008 3 / 63

Example: Gaussia distributio f (x i j θ) = The we have with θ = log L ( θj x) = = 1 p exp (x i µ) 2 2πσ 2 2σ 2 µ, σ 2 log f (x i j θ) 2 log 2πσ2 1 2σ 2 (x i µ) 2. By takig the derivatives ad settig them to zero log L ( θj x) µ log L ( θj x) σ 2 = = 1 σ 2 (x i µ) = 0,! 2σ 2 + 1 2 (σ 2 ) 2 (x i µ) 2 = 0.. A.D. () Jauary 2008 4 / 63

By solvig these equatios, we obtai cσ 2 = 1 bµ = 1 x i, (x i bµ) 2. Note that bµ is a ubiased estimate but c σ 2 is biased. A.D. () Jauary 2008 5 / 63

Example: Laplace Distributio (Double Expoetial) The we have f (x i j θ) = 1 2 exp ( jx i θj). log L ( θj x) = log 2 jx i θj. By takig the derivative, we obtai d log L ( θj x) dθ = sg (x i θ) hece bθ = med fx 1,..., x g for = 2p + 1. A.D. () Jauary 2008 6 / 63

Example (Uiform Distributio): Cosider X i U (0, θ), i.e. We have L ( θj x) = f (x i j θ) = 1/θ if 0 x < θ, 0 otherwise. (1/θ) f (x i j θ) = if θ x () 0 if θ < x (). It follows that bθ = x () where x (1) < x (2) < < x (). A.D. () Jauary 2008 7 / 63

Example (Liear Regressio): Let fx i, y i g be a set of data where x i = x1 i, x 2 i,..., x p i T is a set of explaatory variables ad yi 2 R is the respose. We assume y i = x T i β + ɛ i, ɛ i N 0, σ 2 thus We have for θ = log L (θ) = f (y i j x i, β) = = = β,σ 2 1 p exp y i x T i β! 2 2πσ 2 2σ 2 log f (y i j x i, β) 2 log 2πσ2 1 2σ 2 y i 2 x T i β 2 log 2πσ2 1 2σ 2 (y Xβ)T (y Xβ) where y = (y 1,..., y ) T ad X = (x 1,..., x ) T. A.D. () Jauary 2008 8 / 63

By takig the derivatives ad settig them to zero log L ( θj x) β = log L ( θj x) σ 2 = 1 2σ 2 2X T β + 2X T Xβ = 0, 2σ 2 + 1 2 (σ 2 ) 2 (y Xβ)T (y Xβ) = 0. Thus we obtai 1 bβ = X X T X T y, cσ 2 = 1 y Xβ b T y Xβ b. A.D. () Jauary 2008 9 / 63

Example (Time Series): Cosider the followig autoregressio X 0 = x 0, X = αx 1 + σv where V i.i.d. N (0, 1) where θ = α, σ 2. We have L ( θj x) = f (xj θ) = Thus we have! 1 p exp (x i αx i 1 ) 2 2πσ 2 2σ 2 log L ( θj x) = cst log σ2 2 (x i αx i 1 ) 2 2σ 2 A.D. () Jauary 2008 10 / 63

It follows that Thus we have 2 log L ( θj x) σ 2 = 2 log L ( θj x) α bα = x i x 2 1 x i i 1 σ 2 + (x i αx i 1 ) 2 σ 4, = 2 σ 2 x i 1 (x i αx i 1 ) 2., σ c2 = (x i bαx i 1 ) 2. A.D. () Jauary 2008 11 / 63

Ivariace Cosider η = g (θ). We itroduce the iduced likelihood fuctio L L ( ηj x) = sup L ( θj x). fθ:g (θ)=ηg Ivariace property: If bθ is the MLE of θ the for ay fuctio η = g (θ) the g b θ is the MLE of η. Proof : The MLE of η is de ed by De e bη = arg sup η sup fθ:g (θ)=ηg L ( θj x) g 1 (η) = fθ : g (θ) = ηg. The clearly bθ 2 g 1 (bη) ad caot be i ay other preimage so bη = g b θ. A.D. () Jauary 2008 12 / 63

Cosistecy De itio. A sequece of estimators bθ = bθ (X 1,..., X ) is cosistet for the parameter θ if, for every ε > 0 ad every θ 2 Θ lim P θ bθ θ < ε = 1 (equivaletly lim P θ bθ θ ε = 0).!! i.i.d. Example: Cosider X i N (θ, 1) ad bθ = 1 X i the bθ N (θ, 1/) ad P θ bθ θ < ε = Z ε p ε p 1 p 2π exp u 2 2 du! 1. It is possible to avoid this calculatios ad use istead Chebychev s iequality b 2 E P θ bθ θ ε = P θ bθ θ 2 θ θ θ ε 2 where E θ b θ 2 2 θ = var θ b θ + E θ b θ θ. A.D. () Jauary 2008 13 / 63 ε 2

Example of icosistet MLE (Fisher) (X i, Y i ) N µi µ i The likelihood fuctio is give by L (θ) = We obtai 1 (2πσ 2 ) exp l (θ) = cste log σ 2 " 1 xi + y i 2σ 2 2 2 We have bµ i = x i + y i 2 σ 2 0, 0 σ 2. 1 h 2σ 2 (x i µ i ) 2 + (y i µ i ) 2i! µ i 2 + 1 2, c σ 2 = (x i y i ) 2 4 # (x i y i ) 2.! σ2 2. A.D. () Jauary 2008 14 / 63

Cosistecy of the MLE Kullback-Leibler Distace: For ay desity f, g Z f (x) D (f, g) = f (x) log dx g (x) We have Ideed Z D (f, g) = D (f, g) 0 ad D (f, f ) = 0. f (x) log Z g (x) dx f (x) g (x) f (x) f (x) 1 dx = 0 D (f, g) is a very useful distace ad appears i may di eret cotexts. A.D. () Jauary 2008 15 / 63

Alterative measures of similarity Helliger distace D (f, g) = Z q f (x) q g (x) 2 dx Geeralized iformatio D (f, g) = 1 λ Z f (x) λ 1! f (x) dx g (x) L1-orm / Total variatio Z D (f, g) = jf (x) g (x)j dx L2-orm / Total variatio Z D (f, g) = (f (x) g (x)) 2 dx A.D. () Jauary 2008 16 / 63

Example Suppose we have f (x) = N x; ξ, τ 2 ad g (x) = N x; µ, σ 2. We have h E f (X µ) 2i h = E f (X ξ) 2 + 2 (X ξ) (ξ µ) + (ξ µ) 2i So it follows that ad = τ 2 + (ξ µ) 2 E f [log g (X )] = E f " = # 1 2 log 2πσ2 (X µ) 2 2σ 2 1 2 log 2πσ2 τ 2 + (ξ µ) 2 2σ 2 E f [log f (X )] = 1 2 log 2πτ2 1 2. A.D. () Jauary 2008 17 / 63

It follows that D (f, g) = Z f (x) f (x) log dx g (x) = E f [log f (X )] E f [log g (X )] ( = 1 ) σ 2 log 2 τ 2 + τ2 + (ξ µ) 2 σ 2 1 It ca be easily checked that D (f, f ) = 0 (less easy to show D (f, g) 0). A.D. () Jauary 2008 18 / 63

Example Assume we have f (x) = 1 2 exp ( jxj) ad g (x) = N x; µ, σ2. We obtai E f [log f (X )] = log 2 = log 2 1 2 = log 2 1, Z 0 Z jxj exp ( x exp ( E f [log g (X )] = 1 2 log 2πσ2 1 4σ 2 It follows that jxj) dx jxj) dx 4 + 2µ 2 D (f, g) = 1 2 log 2πσ2 + 1 2σ 2 2 + µ 2 log 2 1. A.D. () Jauary 2008 19 / 63

Assume the pdfs f (xj θ) have commo support for all θ ad f (xj θ) 6= f xj θ 0 for θ 6= θ 0 ; i.e. S θ = fx : f (xj θ) > 0g is idepedet of θ. Deote M (θ) = 1 log f (X i j θ) f (X i j θ ) As the MLE bθ maximises L ( θj x), it also maximizes M (θ). Assume X i i.i.d. f (xj θ ). Note that by the law of large umbers M (θ) coverges to f (X j θ) E θ log f (X j θ ) = Z f (xj θ) f (xj θ ) log f (xj θ ) dx = D (f ( j θ ), f ( j θ)) := M (θ). Hece, M (θ) D (f ( j θ ), f ( j θ)) which is maximized for θ so we expect that its maximizer will coverge towards θ. A.D. () Jauary 2008 20 / 63

Example Assume f (x) = g (x) = N (x; 0, 1). We approximate E f [log g (X )] = 1 2 log (2π) 1 2 = 1.4189 through E b f [log g (X )] = 1 2 log (2π) 1 2 Xi 2 Numerical examples 10 100 1, 000 10, 000 E f [log g (X )] Mea -1.4188-1.4185-1.4191-1.4189-1.4189 Variace 0.05079 0.00497 0.00050 0.00005 - Stadard deviatio 0.22537 0.07056 0.02232 0.00696 - Mea, variace ad stadard deviatio by ruig 1,000 Mote Carlo trials A.D. () Jauary 2008 21 / 63

Theorem. Suppose ad that, for every ε > 0, sup jm (θ) M (θ)j! P 0 θ2θ sup M (θ) < M (θ ) θ:jθ θ jε the bθ P! θ A.D. () Jauary 2008 22 / 63

Proof. Sice bθ maximizes M (θ), we have M b θ M (θ ). Thus, M (θ ) M b θ = M (θ ) M b θ + M (θ ) M (θ ) M b θ M b θ + M (θ ) M (θ ) 2sup jm (θ) θ P! 0. M (θ)j Thus it implies that for ay δ > 0, we have Pr M b θ < M (θ ) δ! 0. Now for ay ε > 0, there exists δ > 0 such that jθ θ j ε implies M (θ) < M (θ ) δ. Hece, Pr bθ θ > ε Pr M b θ < M (θ ) δ! 0. A.D. () Jauary 2008 23 / 63

Asymptotic Normality Assumig we have bθ P! θ, what ca we say about p b θ log f ( x jθ) θ Lemma. Let s (xj θ) := be the score fuctio, the we have for ay θ E θ [s (X j θ)] = 0. Proof. We have Z log f (xj θ) f (xj θ) dx θ Z f ( x jθ) Z θ f (xj θ) = f (xj θ) f (xj θ) dx = dx θ = Z f (xj θ) dx = 0. θ {z } =1 θ? A.D. () Jauary 2008 24 / 63

Lemma. We also have h var θ [s (X j θ)] = E θ s (X j θ) 2i = 2 log f (X j θ) E θ θ 2 := I (θ) Proof. This follows from Z log f (xj θ) f (xj θ) dx = 0 θ thus by takig the derivative oce more with respect to θ 0 = Z log f (xj θ) f (xj θ) dx θ θ Z = 2 Z log f (xj θ) log f (xj θ) θ 2 f (xj θ) + θ f (xj θ) dx θ A.D. () Jauary 2008 25 / 63

Heuristic Derivatio. We have for l (θ) := log L ( θj x) 0 = l 0 bθ l 0 (θ ) + b θ θ l 00 (θ ) ) b θ θ = l 0 (θ ) l 00 (θ ) That is p b θ θ = p1 l 0 (θ ) 1 l 00 (θ ). Now remember that l 0 (θ ) = s (X i j θ ) where E θ [s (X i j θ )] = 0 ad var θ [s (X i j θ )] = I (θ ) so the CLT tells us that 1 p l 0 (θ ) D! N (0, I (θ )) A.D. () Jauary 2008 26 / 63

Now the law of large umber yields 1 l 00 (θ ) P! I (θ ) so by Slutsky s theorem p b D! 1 θ θ N 0,, p q I (θ ) b D! θ θ N (0, 1) I (θ ) Note that you have already see this expressio whe establishig the Cramer-Rao boud. It is importat to remember that depedig o θ the parameter ca be more or less easy to estimate. A.D. () Jauary 2008 27 / 63

Similarly, we ca prove that r p I b θ b D! θ θ N (0, 1). We ca also prove that p g 0 bθ r I b θ g b θ g (θ ) D! N (0, 1). This allows us to derive some co dece itervals. A.D. () Jauary 2008 28 / 63

Makig the proof more rigourous We have l 0 bθ = l 0 (θ ) + b θ θ l 00 (θ ) + 1 2 b θ θ 2 l 000 (θ, ) where θ, lies betwee bθ ad θ so that p b θ θ = 1 p l 0 (θ ) 2 1 l 00 1 (θ ) b 2 θ θ l 000 (θ, ). To proof the result, we eed to check that 2 1 b θ θ l 000 (θ, )! P P 0. As bθ! θ, we just eed to prove that 2 1 l 000 (θ, ) is bouded (i probability). So we eed a additioal coditio of the form say for ay θ with E θ [C (X )] <. 3 log f (xj θ) θ 3 C (x) A.D. () Jauary 2008 29 / 63

Multiparameter Case The extesio to the multiparameter case θ = (θ 1,..., θ d ) is straightforward p b θ θ D! N (0, J (θ )) where J (θ ) = I (θ ) 1 where We de e rg := p g b θ [I (θ )] k,l = 2 log f (xj θ) E θ. θ k θ l g θ 1,..., g θ d T the if rg (θ ) 6= 0 D! g (θ ) N 0, rg (θ ) J (θ ) rg (θ ) T A.D. () Jauary 2008 30 / 63

Example: If X i i.i.d. N log f (xj θ) = cst s (xj θ) = I (θ) = 0 @ 1 σ log σ (x µ) σ 2 µ, σ 2 with θ = (µ, σ) the + (x µ)2 σ 3 The MLE of µ is give by (x µ) 2 2σ 2! h i 1 2(x µ) E σ 2 θ σ h i 3 2(x µ) E θ E 1 σ 3 θ + σ 2 bµ = 1, 3(x µ)2 σ 4 X i ) var [bµ] = σ2 1 A = 1 σ 2 0 0 2 σ 2 ad the MLE is ideed e ciet (it reaches Cramer-Rao lower boud). A.D. () Jauary 2008 31 / 63

Assume we observe a vector X = (X 1,..., X k ) where X j 2 f0, 1g, k j=1 X j = 1 with f (xj p 1,..., p k 1 ) =! k 1 p x j j j=1 where p j > 0 ad p k := 1 k 1 j=1 p j < 1. We have θ = (p 1,..., p k 1 ). We have log f (xj θ) = x j p j p j 2 log f (xj θ) pj 2 = 2 log f (xj θ) p j p l = x j p 2 j x k pk 2 1 x k p k, x k pk 2, j 6= l < k.! xk k 1 p j j=1 A.D. () Jauary 2008 32 / 63

Recall that X j has a Beroulli distributio with mea p j so 2 p1 1 + pk 1 pk 1 pk 1 pk 1 p2 1 + pk 1. I (θ) = 6 4. pk 1... pk 1 1 + p k 1 3 7 5 Oe ca check that 2 p 1 (1 p 1 ) p 1 p 2 p 1 p k 1 I (θ) 1 p 1 p 2 p 2 (1 p 2 ) p 2 p k 1 = 6 4... p 1 p k 1 p 2 p k 1 p k 1 (1 p k 1 ) 3 7 5 A.D. () Jauary 2008 33 / 63

Now assume we observe X 1, X 2,..., X the log L ( θj x) = k t j log p j = j=1 k 1 t j log p j + t k log 1 j=1! k 1 p j j=1 where t j = x i j. So we have log L ( θj x) p j = t j p j t k p k for j = 1,..., k 1 ) bp j = t j. Clearly t j is Biomial(, p j ) with variace p j (1 e ciet. p j ) so bp j is A.D. () Jauary 2008 34 / 63

Nuisace Parameters Assume θ = (θ 1,..., θ d ) is the parameter vector but oly the scalar θ 1 is of iterest whereas (θ 2,..., θ d ) are uisace parameters. We wat to assess how the asymptotic precisio with which we estimate θ 1 is i ueced by the presece of uisace parameters; i.e. if bθ is a e ciet estimate for θ, the how does bθ 1 as a estimator of θ 1 compare to a e ciet estimatio of θ 1, say eθ 1, which would assume that all the uisace parameters are kow. h i h i Ituitively, we should have var e θ 1 var b θ 1 ; i.e. igorace caot brig you ay advatage. A.D. () Jauary 2008 35 / 63

Asymptotic variace of p b θ parameter is deoted α i,j. Asymptotic variace of p e θ 1 θ is I 1 (θ ) whose (i, j) θ,1 is I 1 (θ,1 ) = 1/γ 1,1. Theorem. We have α 1,1 1/γ 1,1, with equality if ad oly if α 1,2 = = α 1,d = 0. A.D. () Jauary 2008 36 / 63

Partitio I (θ) as follows I (θ) = γ1,1 ρ ρ T Σ. Now we use the fact that I 1 (θ) = 1 1 ρ T Σ 1 τ Σ 1 ρ Σ 1 ρρ T Σ 1 + τσ 1 where τ = γ 1,1 ρ T Σ 1 ρ. As I (θ) is de ite positive the Σ 1 is de ite positive ad α 1,1 = 1 τ 1/γ 1,1 with equality i ρ = 0. To show that τ > 0 we use the fact that I (θ) is p.d. ad that τ = v T 1 I (θ) v where v = ρ T Σ 1. A.D. () Jauary 2008 37 / 63

Beyod Maximum Likelihood: Method of Momets MLE estimates ca be di cult to compute, the method of momets is a simple alterative. The obtaied estimators are typically ot optimal but ca be used as startig values for more sophisticated methods. For 1 j d, de e the j th momet of f (xj θ) where θ = (θ 1,..., θ d ) α j (θ) = E θ X j Z = x j f (xj θ) dx ad, give X = (X 1,..., X ), the j th sample as bα j = 1 X j i. The idea of the method of momets method is to match the theoretical momets to the sample momets; that is we de ed bθ as the value of θ such that α j b θ = bα j for j = 1,..., d. A.D. () Jauary 2008 38 / 63

Example: Let X i i.i.d. N µ, σ 2 with θ = µ, σ 2 the Thus we obtai Note that c σ 2 is ot ubiased. α 1 (θ) = µ, α 2 (θ) = σ 2 + µ 2, bα 1 = 1 Xi 1, bα 2 = 1 Xi 2 bµ = bα 1 ad c σ 2 = bα 2 (bα 1 ) 2. A.D. () Jauary 2008 39 / 63

Assume X i i.i.d. U (θ 1, θ 2 ) where < θ 1 < θ 2 < + the α 1 (θ) = θ 1 + θ 2 2 Now we solve ad obtai θ 1 = 2bα 1 θ 2,, α 2 (θ) = θ2 1 + θ 2 2 + θ 1 θ 2. 3 3bα 2 = (2bα 1 θ 2 ) 2 + θ 2 2 + (2bα 1 θ 2 ) θ 2, (θ 2 bα 1 ) 2 = 3 Sice θ 2 > E (X ) the r r bθ 2 = bα 1 + 3 bα 2 bα 2 1, bθ 1 = bα 1 3 bα 2 bα 2 1. Note that b θ 1, bθ 2 is NOT a fuctio of the su ciet statistics X (1), X (). bα 2 bα 2 1 A.D. () Jauary 2008 40 / 63

Assume X i i.i.d. Bi (p, k) with parameters k 2 N ad p 2 (0, 1). We have Thus we obtai α 1 (θ) = kp, α 2 (θ) = kp (1 p) + k 2 p 2. bp = bα 1 + bα 2 1 bα 2 /bα 1, bk = bα 2 1/ bα 1 + bα 2 1 bα 2. The estimator bp 2 (0, 1) but bk is geerally ot a iteger. A.D. () Jauary 2008 41 / 63

Statistical Properties of the Estimate Let bα = (bα 1,..., bα d ), we have bα = h b θ ad if the iverse fuctio g = h 1 exists, the bθ = g (bα). If g is cotiuous at α = (α 1,..., α d ) the bθ is a cosistet estimate of θ as bα j! α j. Moreover if g is di eretiable at α ad E θ X 2d < the p b D! θ θ N 0, rg (α) T V α rg (α) where V α [i, j] = α i+j α i α j. A.D. () Jauary 2008 42 / 63

The result follows from p (bα α) D! N (0, V α ) as We have E θ [bα j, ] = α j, cov [bα i, bα j, ] = α i+j bθ θ = g (bα ) g (α ) α i α j so usig the delta method p b D! θ θ N 0, rg (α) T V α rg (α) A.D. () Jauary 2008 43 / 63

We ca also establish the 1 order asymptotic bias of the estimate as g (bα ) = g (α) + rg (α) T (bα α) + 1 2 (bα α) T r 2 g (α) (bα α) + o 1 where p (bα α) D! Z Σ with Z Σ N (0, Σ) so (bα α) T r 2 g (α) (bα α) D! Z T Σ r 2 g (α) Z Σ as recall that X D! X implies ϕ (X ) D! ϕ (X ). Thus we have E [g (bα )] = g (α) + 1 h i 1 2 E ZΣ T r 2 g (α) Z Σ + o = g (α) + tr r2 g (α) Σ 1 + o. 2 A.D. () Jauary 2008 44 / 63

Beyod Maximum Likelihood: Pseudo-Likelihood Assume X = (X 1,..., X q ) f (xj θ). Give observatios X i f (xj θ), the MLE requires maximizig L ( θj x). However i some problems, it might be di cult to specify f (xj θ) ad we may be oly able to specify say f (x k, x l j θ) for 1 k < l q Based o this iformatio ad observatios, we could de e the pseudo-log-likelihood fuctios l 1 ( θj x) = l 2 ( θj x) = l 1 θj x i = l 2 θj x i = q s=1 q s=1 log f (x s j θ), q t=s+1 log f (x s, x t j θ) + αl 1 ( θj x). These pseudo-likelihood fuctios are simpler that the full likelihood. A.D. () Jauary 2008 45 / 63

Uder regularity coditios very similar to the oes for the MLE, solvig l 0 k ( θj x) = 0 for k = 1, 2 will provide ubiased estimates. To derive the asymptotic variace, we use 0 = lk 0 b θ lk 0 (θ ) + b θ ) p b θ θ = 1 p l 0 k (θ ) 1 l 00 k (θ ) θ l 00 k (θ ) where 1 l k 00 (θ )! P E θ [lk 00] ad p 1 l 0 (θ )! D N 0, E θ l 02 k, thus p b D! θ θ N 0, E θ lk 00 2 Eθ l 02 k. We have the estimates E θ l 00 1 k l 00 k ( θj x i ), E θ l 02 1 k lk 02 ( θj x i ). A.D. () Jauary 2008 46 / 63

Example. Assume that X = (X 1,..., X q ) N (0, Σ) where [Σ] (i, j) = 1 if i = j ad ρ otherwise. We are iterested i estimatig θ = ρ. There is o iformatio about ρ i l 1 ( θj x) so we use l 2 ( θj x) for α = 0. For observatios X 1,..., X, we have l 2 ( θj x) = q (q 1) 4 (q 1) (1 ϱ) 2 (1 ρ 2 ) log 1 ρ 2 q 1 + ρ 2 (1 ρ 2 ) SS W SS B q where SS W = q s=1 X i s!! q 2 Xt i, SS B = X i 2 t. t=1 A.D. () Jauary 2008 47 / 63

After simple but tiedous calculatios, we obtai for the asymptotic variace 2 1 ρ 2 c (q, ρ) q (q 1) (1 + ρ 2 ) 2 where c (q, ρ) = 1 ρ 2 1 + 3ρ 2 + qρ 3ρ 3 + 8ρ 2 3ρ + 2 +q 2 ρ 2 1 ρ 2 whereas for MLE we have 2 (1 + (q 1) ρ) 2 (1 ρ) 2 q (q 1) 1 + (q 1) ρ 2. The ratio is 1 for q = 2 as expected ad also 1 if ρ = 0 or 1. For ay other values, there is a loss of e ciecy for l 2 ( θj x) which icreases as q!. A.D. () Jauary 2008 48 / 63

Cosider the followig time series X 0 N 0, σ 2 1 α 2, X = αx 1 + σv where V i.i.d. N (0, 1) where θ = σ 2. We ca show that we have for ay i = 0, 1,..., 1 f (x i j θ) = p exp 1 α 2! xi 2 2πσ 2 2σ 2 ad we cosider 2l 1 ( θj x) = 2 log f (x i j θ) = cste log σ 2 1 α 2 This pseudo-likelihood ca easily be maximized 2σ 2 xi 2 cσ 2 = 1 α2 xi 2. If oe is iterested i estimatig α, it would be ecessary to itroduce f (x i, x i+1 j θ). A.D. () Jauary 2008 49 / 63

Pseudo-likelihood is widely used for Markov radom elds sice its itroductio by Besag (1975). I the Gaussia cotext, we have X = (X 1,..., X ) where d is extremely large ad Gaussia ad the model is speci ed by E θ [X i j x i ] = λ H ij x j, var θ [X i j x i ] = κ. Computig the likelihood for θ = (λ, κ) ca be too computatoally itesive so the pseudo-likelihood is de ed through thus el ( θj x) = bλ = xt Hx x T H 2 x, κ = d 1 log f (x i j θ, x i ) x T x x T Hx! 2 x T H 2 x I this cotext, it ca be show that the estimate is cosistet ad has a reasoable asymptotic variace. A.D. () Jauary 2008 50 / 63

Summary of Pseudo Likelihood I may applicatios, the log-likelihood l (θ; y 1: ) is very complex to compute. Istead we maximize a surrogate fuctio l S (θ; y 1: ). If possible, we pick this fuctio such that if θ is the true parameter the E θ (l S (θ; Y 1: )) is maximized for θ = θ ad solvig is easy. rl S b θ ; Y 1: = 0 A.D. () Jauary 2008 51 / 63

Uder regularity assumptios, we have p b θ θ ) N 0, G 1 (θ ) where with G 1 (θ) = H 1 (θ) J (θ) H T (θ) J (θ) = V frl S (θ; Y 1: )g, H (θ) = E r 2 l S (θ; Y 1: ). Whe l S (θ; Y 1: ) = l (θ; Y 1: ) ad the model is correctly speci ed the G (θ) is the Fisher iformatio matrix. Whe l S (θ; Y 1: ) 6= l (θ; Y 1: ), we typically lose i terms of e ciecy. A.D. () Jauary 2008 52 / 63

Applicatio to Geeral State-Space Models Cosider the followig geeral state-space model. Let fx k g k1 be a Markov process de ed by X 1 µ θ ad X k j (X k 1 = x k 1 ) f θ ( j x k 1 ). The we have that for ay > 0 p θ (x 1: ) = p θ (x 1 ) = µ θ (x 1 ) k=2 k=2 p θ (x k j x 1:k 1 ) f θ (x k j x k 1 ). A.D. () Jauary 2008 53 / 63

We are iterested i estimatig θ from the data but we do ot have access to fx k g k1. We oly have access to a process fy k g k1 such that, coditioal upo fx k g k1, the observatios are statistically idepedet ad That is we have for ay > 0 p θ (y 1: j x 1: ) = Y k j (X k = x k ) g θ ( j x k ). p θ (y k j x k ) = g θ (y k j x k ). k=1 k=1 A.D. () Jauary 2008 54 / 63

Examples Liear Gaussia model. Cosider say for jαj < 1 X 1 N σ 2 i.i.d. 0, 1 α 2, X k = αx k 1 + σv k where V k N (0, 1), Y k = β + X k + τw k where W k i.i.d. N (0, 1). I this case we have say θ = β, σ 2, α, τ 2 ad f θ (x k j x k 1 ) = N x k ; αx k 1, σ 2, g θ (y k j x k ) = N y k ; β + x k, τ 2. A.D. () Jauary 2008 55 / 63

Stochastic Volatility model. Cosider say for jαj < 1 X 1 N σ 2 i.i.d. 0, 1 α 2, X k = αx k 1 + σv k where V k N (0, 1), Y k = β exp (X k /2) W k where W k i.i.d. N (0, 1). I this case we have say θ = β, σ 2, α ad f θ (x k j x k 1 ) = N x k ; αx k 1, σ 2, g θ (y k j x k ) = N y k ; 0, β 2 exp (x k ). A.D. () Jauary 2008 56 / 63

I this case, the likelihood of the observatios y 1: is give by Z p θ (y 1: ) = p θ (x 1:, y 1: ) dx 1: Z = p θ (y 1: j x 1: ) p θ (x 1: ) dx 1:!! Z = g θ (y k j x k ) µ θ (x 1 ) f θ (x k j x k 1 ) dx 1:. k=1 k=2 If the model is liear Gaussia or ite state-space, we ca compute the likelihood i closed-form but the maximizatio is ot trivial. Otherwise, we caot. A.D. () Jauary 2008 57 / 63

Pairwise likelihood for state-space models We cosider the followig pseudo-likelihood for m 1 L S (θ; y 1: ) = 1 mifi +m,g p θ (y i, y j ) j=i+1 where Z p θ (y i, y j ) = g θ (y i j x i ) g θ (y j j x j ) p θ (x i, x j ) dx i dx j. As a alterative, if = pm, we could maximize L S (θ; y 1: ) = p p θ y (i 1)m+1:im. A.D. () Jauary 2008 58 / 63

For the two models discussed earlier, it is possible to compute exactly p θ (x i, x j ) as p θ (x i, x j ) = p θ (x i ) p θ (x j j x i ) where σ p θ (x i ) = N 2 x i ; 0, 1 α 2! p θ (x j j x i ) = N x j ; α j i x i, σ 2 j i 1 α 2k. k=0 I a geeral case, we could approximate p θ (y i, y j ) through Mote Carlo bp θ (y i, y j ) = 1 N N g θ y i j Xi l g θ y j j Xj l l=1 where Xi l, X j l p θ (x i, x j ). A.D. () Jauary 2008 59 / 63

Uder regularity assumptios, we have Z l S (θ; y 1: ) p θ (y 1: ) dy 1: which is maximum i θ so maximizig this pseudo-likelihood method makes sese. To prove it, ote that l S (θ; y 1: ) = 1 mifi +m,g log p θ (y i, y j ) j=i+1 ad = Z Z log p θ (y i, y j ).p θ (y 1: ) dy 1: log p θ (y i, y j ).p θ (y i, y j ) dy i dy j which is maximum i θ = θ. A.D. () Jauary 2008 60 / 63

Applicatio Cosider X 1 N σ 2 i.i.d. 0, 1 α 2, X k = αx k 1 + σv k where V k N (0, 1), Y k = β + X k + τw k where W k i.i.d. N (0, 1). where θ = β, σ 2, α, τ 2. I this case, we ca directly establish ot oly p θ (x i, x j ) but p θ (y i, y j )!! Yi β τ 2 + σ2 α j i σ2 N, 1 α 2 1 α 2 Y j β α j i σ2 τ 2 + σ2 1 α 2 1 α 2 For m = 2,..., 20 we compare the performace of bθ MLE ad bθ MPL where the likelihood ad pseudo-likelihood are maximized usig a simple gradiet algorithm (EM could be used). 1,000 time series of legth = 500 with β = 0.1, τ = 1.0, α = 0.95 ad σ = 0.55 are simulated. A.D. () Jauary 2008 61 / 63

true bθ (2) MPL bθ (6) MPL bθ (12) MPL bθ (20) MPL bθ ML β 0.1 0.108 0.108 0.109 0.109 0.102 (0.488) (0.489) (0.4908) (0.492) (0.481) τ 1.0 0.994 0.997 0.990 0.981 0.995 (0.066) (0.048) (0.054) (0.068) (0.046) α 0.95 0.941 0.941 0.939 0.937 0.941 (0.033) (0.020) (0.022) (0.024) (0.020) σ 0.55 0.535 0.551 0.560 0.571 0.554 (0.160) (0.064) (0.072) (0.087) (0.061) A.D. () Jauary 2008 62 / 63

Now you should... be able to compute MLE estimate for rather complex models, be able to compute the asymptotic variace of the MLE estimate, be able to derive the expressio of the asymptotic variace of simple estimates di eret from MLE. A.D. () Jauary 2008 63 / 63