Lecture Stat Maximum Likelihood Estimation

Lecture Stat 461-561 Maximum Likelihood Estimatio A.D. Jauary 2008 A.D. () Jauary 2008 1 / 63

Maximum Likelihood Estimatio Ivariace Cosistecy E ciecy Nuisace Parameters A.D. () Jauary 2008 2 / 63

Parametric Iferece Let f (xj θ) deote the joit pdf or pmf of the sample X = (X 1,..., X ) parametrized by θ 2 Θ. The give that X = x is observed, the fuctio L ( θj x) = f (xj θ) is the likelihood fuctio. The most commo estimate is the Maximum Likelihood Estimate (MLE) give by bθ = arg max L ( θj x). θ2θ A.D. () Jauary 2008 3 / 63

Example: Gaussia distributio f (x i j θ) = The we have with θ = log L ( θj x) = = 1 p exp (x i µ) 2 2πσ 2 2σ 2 µ, σ 2 log f (x i j θ) 2 log 2πσ2 1 2σ 2 (x i µ) 2. By takig the derivatives ad settig them to zero log L ( θj x) µ log L ( θj x) σ 2 = = 1 σ 2 (x i µ) = 0,! 2σ 2 + 1 2 (σ 2 ) 2 (x i µ) 2 = 0.. A.D. () Jauary 2008 4 / 63

By solvig these equatios, we obtai cσ 2 = 1 bµ = 1 x i, (x i bµ) 2. Note that bµ is a ubiased estimate but c σ 2 is biased. A.D. () Jauary 2008 5 / 63

Example: Laplace Distributio (Double Expoetial) The we have f (x i j θ) = 1 2 exp ( jx i θj). log L ( θj x) = log 2 jx i θj. By takig the derivative, we obtai d log L ( θj x) dθ = sg (x i θ) hece bθ = med fx 1,..., x g for = 2p + 1. A.D. () Jauary 2008 6 / 63

Example (Uiform Distributio): Cosider X i U (0, θ), i.e. We have L ( θj x) = f (x i j θ) = 1/θ if 0 x < θ, 0 otherwise. (1/θ) f (x i j θ) = if θ x () 0 if θ < x (). It follows that bθ = x () where x (1) < x (2) < < x (). A.D. () Jauary 2008 7 / 63

Example (Liear Regressio): Let fx i, y i g be a set of data where x i = x1 i, x 2 i,..., x p i T is a set of explaatory variables ad yi 2 R is the respose. We assume y i = x T i β + ɛ i, ɛ i N 0, σ 2 thus We have for θ = log L (θ) = f (y i j x i, β) = = = β,σ 2 1 p exp y i x T i β! 2 2πσ 2 2σ 2 log f (y i j x i, β) 2 log 2πσ2 1 2σ 2 y i 2 x T i β 2 log 2πσ2 1 2σ 2 (y Xβ)T (y Xβ) where y = (y 1,..., y ) T ad X = (x 1,..., x ) T. A.D. () Jauary 2008 8 / 63

By takig the derivatives ad settig them to zero log L ( θj x) β = log L ( θj x) σ 2 = 1 2σ 2 2X T β + 2X T Xβ = 0, 2σ 2 + 1 2 (σ 2 ) 2 (y Xβ)T (y Xβ) = 0. Thus we obtai 1 bβ = X X T X T y, cσ 2 = 1 y Xβ b T y Xβ b. A.D. () Jauary 2008 9 / 63

Example (Time Series): Cosider the followig autoregressio X 0 = x 0, X = αx 1 + σv where V i.i.d. N (0, 1) where θ = α, σ 2. We have L ( θj x) = f (xj θ) = Thus we have! 1 p exp (x i αx i 1 ) 2 2πσ 2 2σ 2 log L ( θj x) = cst log σ2 2 (x i αx i 1 ) 2 2σ 2 A.D. () Jauary 2008 10 / 63

It follows that Thus we have 2 log L ( θj x) σ 2 = 2 log L ( θj x) α bα = x i x 2 1 x i i 1 σ 2 + (x i αx i 1 ) 2 σ 4, = 2 σ 2 x i 1 (x i αx i 1 ) 2., σ c2 = (x i bαx i 1 ) 2. A.D. () Jauary 2008 11 / 63

Ivariace Cosider η = g (θ). We itroduce the iduced likelihood fuctio L L ( ηj x) = sup L ( θj x). fθ:g (θ)=ηg Ivariace property: If bθ is the MLE of θ the for ay fuctio η = g (θ) the g b θ is the MLE of η. Proof : The MLE of η is de ed by De e bη = arg sup η sup fθ:g (θ)=ηg L ( θj x) g 1 (η) = fθ : g (θ) = ηg. The clearly bθ 2 g 1 (bη) ad caot be i ay other preimage so bη = g b θ. A.D. () Jauary 2008 12 / 63

Cosistecy De itio. A sequece of estimators bθ = bθ (X 1,..., X ) is cosistet for the parameter θ if, for every ε > 0 ad every θ 2 Θ lim P θ bθ θ < ε = 1 (equivaletly lim P θ bθ θ ε = 0).!! i.i.d. Example: Cosider X i N (θ, 1) ad bθ = 1 X i the bθ N (θ, 1/) ad P θ bθ θ < ε = Z ε p ε p 1 p 2π exp u 2 2 du! 1. It is possible to avoid this calculatios ad use istead Chebychev s iequality b 2 E P θ bθ θ ε = P θ bθ θ 2 θ θ θ ε 2 where E θ b θ 2 2 θ = var θ b θ + E θ b θ θ. A.D. () Jauary 2008 13 / 63 ε 2

Example of icosistet MLE (Fisher) (X i, Y i ) N µi µ i The likelihood fuctio is give by L (θ) = We obtai 1 (2πσ 2 ) exp l (θ) = cste log σ 2 " 1 xi + y i 2σ 2 2 2 We have bµ i = x i + y i 2 σ 2 0, 0 σ 2. 1 h 2σ 2 (x i µ i ) 2 + (y i µ i ) 2i! µ i 2 + 1 2, c σ 2 = (x i y i ) 2 4 # (x i y i ) 2.! σ2 2. A.D. () Jauary 2008 14 / 63

Cosistecy of the MLE Kullback-Leibler Distace: For ay desity f, g Z f (x) D (f, g) = f (x) log dx g (x) We have Ideed Z D (f, g) = D (f, g) 0 ad D (f, f ) = 0. f (x) log Z g (x) dx f (x) g (x) f (x) f (x) 1 dx = 0 D (f, g) is a very useful distace ad appears i may di eret cotexts. A.D. () Jauary 2008 15 / 63

Alterative measures of similarity Helliger distace D (f, g) = Z q f (x) q g (x) 2 dx Geeralized iformatio D (f, g) = 1 λ Z f (x) λ 1! f (x) dx g (x) L1-orm / Total variatio Z D (f, g) = jf (x) g (x)j dx L2-orm / Total variatio Z D (f, g) = (f (x) g (x)) 2 dx A.D. () Jauary 2008 16 / 63

Example Suppose we have f (x) = N x; ξ, τ 2 ad g (x) = N x; µ, σ 2. We have h E f (X µ) 2i h = E f (X ξ) 2 + 2 (X ξ) (ξ µ) + (ξ µ) 2i So it follows that ad = τ 2 + (ξ µ) 2 E f [log g (X )] = E f " = # 1 2 log 2πσ2 (X µ) 2 2σ 2 1 2 log 2πσ2 τ 2 + (ξ µ) 2 2σ 2 E f [log f (X )] = 1 2 log 2πτ2 1 2. A.D. () Jauary 2008 17 / 63

It follows that D (f, g) = Z f (x) f (x) log dx g (x) = E f [log f (X )] E f [log g (X )] ( = 1 ) σ 2 log 2 τ 2 + τ2 + (ξ µ) 2 σ 2 1 It ca be easily checked that D (f, f ) = 0 (less easy to show D (f, g) 0). A.D. () Jauary 2008 18 / 63

Example Assume we have f (x) = 1 2 exp ( jxj) ad g (x) = N x; µ, σ2. We obtai E f [log f (X )] = log 2 = log 2 1 2 = log 2 1, Z 0 Z jxj exp ( x exp ( E f [log g (X )] = 1 2 log 2πσ2 1 4σ 2 It follows that jxj) dx jxj) dx 4 + 2µ 2 D (f, g) = 1 2 log 2πσ2 + 1 2σ 2 2 + µ 2 log 2 1. A.D. () Jauary 2008 19 / 63

Assume the pdfs f (xj θ) have commo support for all θ ad f (xj θ) 6= f xj θ 0 for θ 6= θ 0 ; i.e. S θ = fx : f (xj θ) > 0g is idepedet of θ. Deote M (θ) = 1 log f (X i j θ) f (X i j θ ) As the MLE bθ maximises L ( θj x), it also maximizes M (θ). Assume X i i.i.d. f (xj θ ). Note that by the law of large umbers M (θ) coverges to f (X j θ) E θ log f (X j θ ) = Z f (xj θ) f (xj θ ) log f (xj θ ) dx = D (f ( j θ ), f ( j θ)) := M (θ). Hece, M (θ) D (f ( j θ ), f ( j θ)) which is maximized for θ so we expect that its maximizer will coverge towards θ. A.D. () Jauary 2008 20 / 63

Example Assume f (x) = g (x) = N (x; 0, 1). We approximate E f [log g (X )] = 1 2 log (2π) 1 2 = 1.4189 through E b f [log g (X )] = 1 2 log (2π) 1 2 Xi 2 Numerical examples 10 100 1, 000 10, 000 E f [log g (X )] Mea -1.4188-1.4185-1.4191-1.4189-1.4189 Variace 0.05079 0.00497 0.00050 0.00005 - Stadard deviatio 0.22537 0.07056 0.02232 0.00696 - Mea, variace ad stadard deviatio by ruig 1,000 Mote Carlo trials A.D. () Jauary 2008 21 / 63

Theorem. Suppose ad that, for every ε > 0, sup jm (θ) M (θ)j! P 0 θ2θ sup M (θ) < M (θ ) θ:jθ θ jε the bθ P! θ A.D. () Jauary 2008 22 / 63

Proof. Sice bθ maximizes M (θ), we have M b θ M (θ ). Thus, M (θ ) M b θ = M (θ ) M b θ + M (θ ) M (θ ) M b θ M b θ + M (θ ) M (θ ) 2sup jm (θ) θ P! 0. M (θ)j Thus it implies that for ay δ > 0, we have Pr M b θ < M (θ ) δ! 0. Now for ay ε > 0, there exists δ > 0 such that jθ θ j ε implies M (θ) < M (θ ) δ. Hece, Pr bθ θ > ε Pr M b θ < M (θ ) δ! 0. A.D. () Jauary 2008 23 / 63

Asymptotic Normality Assumig we have bθ P! θ, what ca we say about p b θ log f ( x jθ) θ Lemma. Let s (xj θ) := be the score fuctio, the we have for ay θ E θ [s (X j θ)] = 0. Proof. We have Z log f (xj θ) f (xj θ) dx θ Z f ( x jθ) Z θ f (xj θ) = f (xj θ) f (xj θ) dx = dx θ = Z f (xj θ) dx = 0. θ {z } =1 θ? A.D. () Jauary 2008 24 / 63

Lemma. We also have h var θ [s (X j θ)] = E θ s (X j θ) 2i = 2 log f (X j θ) E θ θ 2 := I (θ) Proof. This follows from Z log f (xj θ) f (xj θ) dx = 0 θ thus by takig the derivative oce more with respect to θ 0 = Z log f (xj θ) f (xj θ) dx θ θ Z = 2 Z log f (xj θ) log f (xj θ) θ 2 f (xj θ) + θ f (xj θ) dx θ A.D. () Jauary 2008 25 / 63

Heuristic Derivatio. We have for l (θ) := log L ( θj x) 0 = l 0 bθ l 0 (θ ) + b θ θ l 00 (θ ) ) b θ θ = l 0 (θ ) l 00 (θ ) That is p b θ θ = p1 l 0 (θ ) 1 l 00 (θ ). Now remember that l 0 (θ ) = s (X i j θ ) where E θ [s (X i j θ )] = 0 ad var θ [s (X i j θ )] = I (θ ) so the CLT tells us that 1 p l 0 (θ ) D! N (0, I (θ )) A.D. () Jauary 2008 26 / 63

Now the law of large umber yields 1 l 00 (θ ) P! I (θ ) so by Slutsky s theorem p b D! 1 θ θ N 0,, p q I (θ ) b D! θ θ N (0, 1) I (θ ) Note that you have already see this expressio whe establishig the Cramer-Rao boud. It is importat to remember that depedig o θ the parameter ca be more or less easy to estimate. A.D. () Jauary 2008 27 / 63

Similarly, we ca prove that r p I b θ b D! θ θ N (0, 1). We ca also prove that p g 0 bθ r I b θ g b θ g (θ ) D! N (0, 1). This allows us to derive some co dece itervals. A.D. () Jauary 2008 28 / 63

Makig the proof more rigourous We have l 0 bθ = l 0 (θ ) + b θ θ l 00 (θ ) + 1 2 b θ θ 2 l 000 (θ, ) where θ, lies betwee bθ ad θ so that p b θ θ = 1 p l 0 (θ ) 2 1 l 00 1 (θ ) b 2 θ θ l 000 (θ, ). To proof the result, we eed to check that 2 1 b θ θ l 000 (θ, )! P P 0. As bθ! θ, we just eed to prove that 2 1 l 000 (θ, ) is bouded (i probability). So we eed a additioal coditio of the form say for ay θ with E θ [C (X )] <. 3 log f (xj θ) θ 3 C (x) A.D. () Jauary 2008 29 / 63

Multiparameter Case The extesio to the multiparameter case θ = (θ 1,..., θ d ) is straightforward p b θ θ D! N (0, J (θ )) where J (θ ) = I (θ ) 1 where We de e rg := p g b θ [I (θ )] k,l = 2 log f (xj θ) E θ. θ k θ l g θ 1,..., g θ d T the if rg (θ ) 6= 0 D! g (θ ) N 0, rg (θ ) J (θ ) rg (θ ) T A.D. () Jauary 2008 30 / 63

Example: If X i i.i.d. N log f (xj θ) = cst s (xj θ) = I (θ) = 0 @ 1 σ log σ (x µ) σ 2 µ, σ 2 with θ = (µ, σ) the + (x µ)2 σ 3 The MLE of µ is give by (x µ) 2 2σ 2! h i 1 2(x µ) E σ 2 θ σ h i 3 2(x µ) E θ E 1 σ 3 θ + σ 2 bµ = 1, 3(x µ)2 σ 4 X i ) var [bµ] = σ2 1 A = 1 σ 2 0 0 2 σ 2 ad the MLE is ideed e ciet (it reaches Cramer-Rao lower boud). A.D. () Jauary 2008 31 / 63

Assume we observe a vector X = (X 1,..., X k ) where X j 2 f0, 1g, k j=1 X j = 1 with f (xj p 1,..., p k 1 ) =! k 1 p x j j j=1 where p j > 0 ad p k := 1 k 1 j=1 p j < 1. We have θ = (p 1,..., p k 1 ). We have log f (xj θ) = x j p j p j 2 log f (xj θ) pj 2 = 2 log f (xj θ) p j p l = x j p 2 j x k pk 2 1 x k p k, x k pk 2, j 6= l < k.! xk k 1 p j j=1 A.D. () Jauary 2008 32 / 63

Recall that X j has a Beroulli distributio with mea p j so 2 p1 1 + pk 1 pk 1 pk 1 pk 1 p2 1 + pk 1. I (θ) = 6 4. pk 1... pk 1 1 + p k 1 3 7 5 Oe ca check that 2 p 1 (1 p 1 ) p 1 p 2 p 1 p k 1 I (θ) 1 p 1 p 2 p 2 (1 p 2 ) p 2 p k 1 = 6 4... p 1 p k 1 p 2 p k 1 p k 1 (1 p k 1 ) 3 7 5 A.D. () Jauary 2008 33 / 63

Now assume we observe X 1, X 2,..., X the log L ( θj x) = k t j log p j = j=1 k 1 t j log p j + t k log 1 j=1! k 1 p j j=1 where t j = x i j. So we have log L ( θj x) p j = t j p j t k p k for j = 1,..., k 1 ) bp j = t j. Clearly t j is Biomial(, p j ) with variace p j (1 e ciet. p j ) so bp j is A.D. () Jauary 2008 34 / 63

Nuisace Parameters Assume θ = (θ 1,..., θ d ) is the parameter vector but oly the scalar θ 1 is of iterest whereas (θ 2,..., θ d ) are uisace parameters. We wat to assess how the asymptotic precisio with which we estimate θ 1 is i ueced by the presece of uisace parameters; i.e. if bθ is a e ciet estimate for θ, the how does bθ 1 as a estimator of θ 1 compare to a e ciet estimatio of θ 1, say eθ 1, which would assume that all the uisace parameters are kow. h i h i Ituitively, we should have var e θ 1 var b θ 1 ; i.e. igorace caot brig you ay advatage. A.D. () Jauary 2008 35 / 63

Asymptotic variace of p b θ parameter is deoted α i,j. Asymptotic variace of p e θ 1 θ is I 1 (θ ) whose (i, j) θ,1 is I 1 (θ,1 ) = 1/γ 1,1. Theorem. We have α 1,1 1/γ 1,1, with equality if ad oly if α 1,2 = = α 1,d = 0. A.D. () Jauary 2008 36 / 63

Partitio I (θ) as follows I (θ) = γ1,1 ρ ρ T Σ. Now we use the fact that I 1 (θ) = 1 1 ρ T Σ 1 τ Σ 1 ρ Σ 1 ρρ T Σ 1 + τσ 1 where τ = γ 1,1 ρ T Σ 1 ρ. As I (θ) is de ite positive the Σ 1 is de ite positive ad α 1,1 = 1 τ 1/γ 1,1 with equality i ρ = 0. To show that τ > 0 we use the fact that I (θ) is p.d. ad that τ = v T 1 I (θ) v where v = ρ T Σ 1. A.D. () Jauary 2008 37 / 63

Beyod Maximum Likelihood: Method of Momets MLE estimates ca be di cult to compute, the method of momets is a simple alterative. The obtaied estimators are typically ot optimal but ca be used as startig values for more sophisticated methods. For 1 j d, de e the j th momet of f (xj θ) where θ = (θ 1,..., θ d ) α j (θ) = E θ X j Z = x j f (xj θ) dx ad, give X = (X 1,..., X ), the j th sample as bα j = 1 X j i. The idea of the method of momets method is to match the theoretical momets to the sample momets; that is we de ed bθ as the value of θ such that α j b θ = bα j for j = 1,..., d. A.D. () Jauary 2008 38 / 63

Example: Let X i i.i.d. N µ, σ 2 with θ = µ, σ 2 the Thus we obtai Note that c σ 2 is ot ubiased. α 1 (θ) = µ, α 2 (θ) = σ 2 + µ 2, bα 1 = 1 Xi 1, bα 2 = 1 Xi 2 bµ = bα 1 ad c σ 2 = bα 2 (bα 1 ) 2. A.D. () Jauary 2008 39 / 63

Assume X i i.i.d. U (θ 1, θ 2 ) where < θ 1 < θ 2 < + the α 1 (θ) = θ 1 + θ 2 2 Now we solve ad obtai θ 1 = 2bα 1 θ 2,, α 2 (θ) = θ2 1 + θ 2 2 + θ 1 θ 2. 3 3bα 2 = (2bα 1 θ 2 ) 2 + θ 2 2 + (2bα 1 θ 2 ) θ 2, (θ 2 bα 1 ) 2 = 3 Sice θ 2 > E (X ) the r r bθ 2 = bα 1 + 3 bα 2 bα 2 1, bθ 1 = bα 1 3 bα 2 bα 2 1. Note that b θ 1, bθ 2 is NOT a fuctio of the su ciet statistics X (1), X (). bα 2 bα 2 1 A.D. () Jauary 2008 40 / 63

Assume X i i.i.d. Bi (p, k) with parameters k 2 N ad p 2 (0, 1). We have Thus we obtai α 1 (θ) = kp, α 2 (θ) = kp (1 p) + k 2 p 2. bp = bα 1 + bα 2 1 bα 2 /bα 1, bk = bα 2 1/ bα 1 + bα 2 1 bα 2. The estimator bp 2 (0, 1) but bk is geerally ot a iteger. A.D. () Jauary 2008 41 / 63

Statistical Properties of the Estimate Let bα = (bα 1,..., bα d ), we have bα = h b θ ad if the iverse fuctio g = h 1 exists, the bθ = g (bα). If g is cotiuous at α = (α 1,..., α d ) the bθ is a cosistet estimate of θ as bα j! α j. Moreover if g is di eretiable at α ad E θ X 2d < the p b D! θ θ N 0, rg (α) T V α rg (α) where V α [i, j] = α i+j α i α j. A.D. () Jauary 2008 42 / 63

The result follows from p (bα α) D! N (0, V α ) as We have E θ [bα j, ] = α j, cov [bα i, bα j, ] = α i+j bθ θ = g (bα ) g (α ) α i α j so usig the delta method p b D! θ θ N 0, rg (α) T V α rg (α) A.D. () Jauary 2008 43 / 63

We ca also establish the 1 order asymptotic bias of the estimate as g (bα ) = g (α) + rg (α) T (bα α) + 1 2 (bα α) T r 2 g (α) (bα α) + o 1 where p (bα α) D! Z Σ with Z Σ N (0, Σ) so (bα α) T r 2 g (α) (bα α) D! Z T Σ r 2 g (α) Z Σ as recall that X D! X implies ϕ (X ) D! ϕ (X ). Thus we have E [g (bα )] = g (α) + 1 h i 1 2 E ZΣ T r 2 g (α) Z Σ + o = g (α) + tr r2 g (α) Σ 1 + o. 2 A.D. () Jauary 2008 44 / 63

Beyod Maximum Likelihood: Pseudo-Likelihood Assume X = (X 1,..., X q ) f (xj θ). Give observatios X i f (xj θ), the MLE requires maximizig L ( θj x). However i some problems, it might be di cult to specify f (xj θ) ad we may be oly able to specify say f (x k, x l j θ) for 1 k < l q Based o this iformatio ad observatios, we could de e the pseudo-log-likelihood fuctios l 1 ( θj x) = l 2 ( θj x) = l 1 θj x i = l 2 θj x i = q s=1 q s=1 log f (x s j θ), q t=s+1 log f (x s, x t j θ) + αl 1 ( θj x). These pseudo-likelihood fuctios are simpler that the full likelihood. A.D. () Jauary 2008 45 / 63

Uder regularity coditios very similar to the oes for the MLE, solvig l 0 k ( θj x) = 0 for k = 1, 2 will provide ubiased estimates. To derive the asymptotic variace, we use 0 = lk 0 b θ lk 0 (θ ) + b θ ) p b θ θ = 1 p l 0 k (θ ) 1 l 00 k (θ ) θ l 00 k (θ ) where 1 l k 00 (θ )! P E θ [lk 00] ad p 1 l 0 (θ )! D N 0, E θ l 02 k, thus p b D! θ θ N 0, E θ lk 00 2 Eθ l 02 k. We have the estimates E θ l 00 1 k l 00 k ( θj x i ), E θ l 02 1 k lk 02 ( θj x i ). A.D. () Jauary 2008 46 / 63

Example. Assume that X = (X 1,..., X q ) N (0, Σ) where [Σ] (i, j) = 1 if i = j ad ρ otherwise. We are iterested i estimatig θ = ρ. There is o iformatio about ρ i l 1 ( θj x) so we use l 2 ( θj x) for α = 0. For observatios X 1,..., X, we have l 2 ( θj x) = q (q 1) 4 (q 1) (1 ϱ) 2 (1 ρ 2 ) log 1 ρ 2 q 1 + ρ 2 (1 ρ 2 ) SS W SS B q where SS W = q s=1 X i s!! q 2 Xt i, SS B = X i 2 t. t=1 A.D. () Jauary 2008 47 / 63

After simple but tiedous calculatios, we obtai for the asymptotic variace 2 1 ρ 2 c (q, ρ) q (q 1) (1 + ρ 2 ) 2 where c (q, ρ) = 1 ρ 2 1 + 3ρ 2 + qρ 3ρ 3 + 8ρ 2 3ρ + 2 +q 2 ρ 2 1 ρ 2 whereas for MLE we have 2 (1 + (q 1) ρ) 2 (1 ρ) 2 q (q 1) 1 + (q 1) ρ 2. The ratio is 1 for q = 2 as expected ad also 1 if ρ = 0 or 1. For ay other values, there is a loss of e ciecy for l 2 ( θj x) which icreases as q!. A.D. () Jauary 2008 48 / 63

Cosider the followig time series X 0 N 0, σ 2 1 α 2, X = αx 1 + σv where V i.i.d. N (0, 1) where θ = σ 2. We ca show that we have for ay i = 0, 1,..., 1 f (x i j θ) = p exp 1 α 2! xi 2 2πσ 2 2σ 2 ad we cosider 2l 1 ( θj x) = 2 log f (x i j θ) = cste log σ 2 1 α 2 This pseudo-likelihood ca easily be maximized 2σ 2 xi 2 cσ 2 = 1 α2 xi 2. If oe is iterested i estimatig α, it would be ecessary to itroduce f (x i, x i+1 j θ). A.D. () Jauary 2008 49 / 63

Pseudo-likelihood is widely used for Markov radom elds sice its itroductio by Besag (1975). I the Gaussia cotext, we have X = (X 1,..., X ) where d is extremely large ad Gaussia ad the model is speci ed by E θ [X i j x i ] = λ H ij x j, var θ [X i j x i ] = κ. Computig the likelihood for θ = (λ, κ) ca be too computatoally itesive so the pseudo-likelihood is de ed through thus el ( θj x) = bλ = xt Hx x T H 2 x, κ = d 1 log f (x i j θ, x i ) x T x x T Hx! 2 x T H 2 x I this cotext, it ca be show that the estimate is cosistet ad has a reasoable asymptotic variace. A.D. () Jauary 2008 50 / 63

Summary of Pseudo Likelihood I may applicatios, the log-likelihood l (θ; y 1: ) is very complex to compute. Istead we maximize a surrogate fuctio l S (θ; y 1: ). If possible, we pick this fuctio such that if θ is the true parameter the E θ (l S (θ; Y 1: )) is maximized for θ = θ ad solvig is easy. rl S b θ ; Y 1: = 0 A.D. () Jauary 2008 51 / 63

Uder regularity assumptios, we have p b θ θ ) N 0, G 1 (θ ) where with G 1 (θ) = H 1 (θ) J (θ) H T (θ) J (θ) = V frl S (θ; Y 1: )g, H (θ) = E r 2 l S (θ; Y 1: ). Whe l S (θ; Y 1: ) = l (θ; Y 1: ) ad the model is correctly speci ed the G (θ) is the Fisher iformatio matrix. Whe l S (θ; Y 1: ) 6= l (θ; Y 1: ), we typically lose i terms of e ciecy. A.D. () Jauary 2008 52 / 63

Applicatio to Geeral State-Space Models Cosider the followig geeral state-space model. Let fx k g k1 be a Markov process de ed by X 1 µ θ ad X k j (X k 1 = x k 1 ) f θ ( j x k 1 ). The we have that for ay > 0 p θ (x 1: ) = p θ (x 1 ) = µ θ (x 1 ) k=2 k=2 p θ (x k j x 1:k 1 ) f θ (x k j x k 1 ). A.D. () Jauary 2008 53 / 63

We are iterested i estimatig θ from the data but we do ot have access to fx k g k1. We oly have access to a process fy k g k1 such that, coditioal upo fx k g k1, the observatios are statistically idepedet ad That is we have for ay > 0 p θ (y 1: j x 1: ) = Y k j (X k = x k ) g θ ( j x k ). p θ (y k j x k ) = g θ (y k j x k ). k=1 k=1 A.D. () Jauary 2008 54 / 63

Examples Liear Gaussia model. Cosider say for jαj < 1 X 1 N σ 2 i.i.d. 0, 1 α 2, X k = αx k 1 + σv k where V k N (0, 1), Y k = β + X k + τw k where W k i.i.d. N (0, 1). I this case we have say θ = β, σ 2, α, τ 2 ad f θ (x k j x k 1 ) = N x k ; αx k 1, σ 2, g θ (y k j x k ) = N y k ; β + x k, τ 2. A.D. () Jauary 2008 55 / 63

Stochastic Volatility model. Cosider say for jαj < 1 X 1 N σ 2 i.i.d. 0, 1 α 2, X k = αx k 1 + σv k where V k N (0, 1), Y k = β exp (X k /2) W k where W k i.i.d. N (0, 1). I this case we have say θ = β, σ 2, α ad f θ (x k j x k 1 ) = N x k ; αx k 1, σ 2, g θ (y k j x k ) = N y k ; 0, β 2 exp (x k ). A.D. () Jauary 2008 56 / 63

I this case, the likelihood of the observatios y 1: is give by Z p θ (y 1: ) = p θ (x 1:, y 1: ) dx 1: Z = p θ (y 1: j x 1: ) p θ (x 1: ) dx 1:!! Z = g θ (y k j x k ) µ θ (x 1 ) f θ (x k j x k 1 ) dx 1:. k=1 k=2 If the model is liear Gaussia or ite state-space, we ca compute the likelihood i closed-form but the maximizatio is ot trivial. Otherwise, we caot. A.D. () Jauary 2008 57 / 63

Pairwise likelihood for state-space models We cosider the followig pseudo-likelihood for m 1 L S (θ; y 1: ) = 1 mifi +m,g p θ (y i, y j ) j=i+1 where Z p θ (y i, y j ) = g θ (y i j x i ) g θ (y j j x j ) p θ (x i, x j ) dx i dx j. As a alterative, if = pm, we could maximize L S (θ; y 1: ) = p p θ y (i 1)m+1:im. A.D. () Jauary 2008 58 / 63

For the two models discussed earlier, it is possible to compute exactly p θ (x i, x j ) as p θ (x i, x j ) = p θ (x i ) p θ (x j j x i ) where σ p θ (x i ) = N 2 x i ; 0, 1 α 2! p θ (x j j x i ) = N x j ; α j i x i, σ 2 j i 1 α 2k. k=0 I a geeral case, we could approximate p θ (y i, y j ) through Mote Carlo bp θ (y i, y j ) = 1 N N g θ y i j Xi l g θ y j j Xj l l=1 where Xi l, X j l p θ (x i, x j ). A.D. () Jauary 2008 59 / 63

Uder regularity assumptios, we have Z l S (θ; y 1: ) p θ (y 1: ) dy 1: which is maximum i θ so maximizig this pseudo-likelihood method makes sese. To prove it, ote that l S (θ; y 1: ) = 1 mifi +m,g log p θ (y i, y j ) j=i+1 ad = Z Z log p θ (y i, y j ).p θ (y 1: ) dy 1: log p θ (y i, y j ).p θ (y i, y j ) dy i dy j which is maximum i θ = θ. A.D. () Jauary 2008 60 / 63

Applicatio Cosider X 1 N σ 2 i.i.d. 0, 1 α 2, X k = αx k 1 + σv k where V k N (0, 1), Y k = β + X k + τw k where W k i.i.d. N (0, 1). where θ = β, σ 2, α, τ 2. I this case, we ca directly establish ot oly p θ (x i, x j ) but p θ (y i, y j )!! Yi β τ 2 + σ2 α j i σ2 N, 1 α 2 1 α 2 Y j β α j i σ2 τ 2 + σ2 1 α 2 1 α 2 For m = 2,..., 20 we compare the performace of bθ MLE ad bθ MPL where the likelihood ad pseudo-likelihood are maximized usig a simple gradiet algorithm (EM could be used). 1,000 time series of legth = 500 with β = 0.1, τ = 1.0, α = 0.95 ad σ = 0.55 are simulated. A.D. () Jauary 2008 61 / 63

true bθ (2) MPL bθ (6) MPL bθ (12) MPL bθ (20) MPL bθ ML β 0.1 0.108 0.108 0.109 0.109 0.102 (0.488) (0.489) (0.4908) (0.492) (0.481) τ 1.0 0.994 0.997 0.990 0.981 0.995 (0.066) (0.048) (0.054) (0.068) (0.046) α 0.95 0.941 0.941 0.939 0.937 0.941 (0.033) (0.020) (0.022) (0.024) (0.020) σ 0.55 0.535 0.551 0.560 0.571 0.554 (0.160) (0.064) (0.072) (0.087) (0.061) A.D. () Jauary 2008 62 / 63

Now you should... be able to compute MLE estimate for rather complex models, be able to compute the asymptotic variace of the MLE estimate, be able to derive the expressio of the asymptotic variace of simple estimates di eret from MLE. A.D. () Jauary 2008 63 / 63