ECE 645: Estimatio Theory Sprig 2015 Istructor: Prof. Staley H. Cha Maximum Likelihood Estimatio (LaTeX prepared by Shaobo Fag) April 14, 2015 This lecture ote is based o ECE 645(Sprig 2015) by Prof. Staley H. Cha i the School of Electrical ad Computer Egieerig at Purdue Uiversity. 1 Itroductio For may families besides expoetial family, Miimum Variace Ubiased Estimator (MVUE) could be very difficult to fid, or it may ot eve exist. For such models, we eed a alterative method to obtai good estimators. With the absece of the prior iformatio, the maximum likelihood estimatio might be a viable alterative. (Poor IV.D) Defiitio 1. Maximum likelihood estimate (MLE) The maximum likelihood estimator is defied as: θ ML (y) def = argmax θ f θ (y) (1) where f θ (y) = f Y (y;θ). Here, the fuctio f θ (y) is called the likelihood fuctio. We ca also take log o f θ (y) ad yield the same maximizer: θ ML (y) def =argmax logf θ (y). (2) θ The fuctio logf θ (y) is called the log-likelihood fuctio. Example 1. Let Y = [Y 1,...,Y ] be a sequece of iid radom variables such that Assume that σ 2 is kow, fid θ ML for µ. Solutio: First of all, the likelihood fuctio is f θ (y) = Y k N(µ,σ 2 ). ( 1 exp 1 (2πσ 2 ) /2 2σ 2 ) (y k µ) 2. Takig the log o both sides we have the log-likelihood fuctio logf θ (y) = 1 2σ 2 (y k µ) 2 2 log(2πσ2 ). I order to fid the maximizer of the log-likelihood fuctio, we take the first order derivative ad set it to zero. This yields µ ML (y) = 1 y k.
We ca also show that E[ µ ML (Y )] = 1 which says that the estimator is ubiased. E[Y k ] = µ, Example 2. Nowwecosiderthepreviousexamplewithbothµadσ ukow. Ourgoalistodetermie θ def ML = [ θ 1, θ 2 ] T for θ 1 = µ ad θ 2 = σ 2. Solutio: Same as the previous problem, the log-likelihood fuctio is logf θ (y) = 1 2θ 2 Takig the partial derivative wrt to θ 1 yields which gives (y k θ 1 ) 2 2 log(2πθ 2). θ 1 logf θ (y) = 1 2θ 2 θ 1 (y) = 1 Similarly, takig the partial derivative wrt to θ 2 yields which gives θ 2 logf θ (y) = 1 θ 2 2 θ 2 (y) = 1 Note that E[ θ 2 (Y )] = 1 σ2 σ 2. So θ 2 is biased. 2(y k θ 1 ) = 0, y k. 2(y k θ 1 ) 2 2θ 2 = 0, (y k θ 1 ) 2. Remark: I order to obtai a ubiased estimator for the populatio variace, it is preferred to use the sample variace defie as (Zwilliger 1995, p. 603): S 1 = 1 1 (Y i Y) 2, where Y = 1 Y i is the sample mea. I fact, the fuctio var i MATLAB is the sample variace. Example 3. Beroulli (Statistical Iferece: Example 7.2.7, Casella ad Berger) Let Y = [Y 1,...,Y ] be a sequece of i.i.d. Beroulli radom variables of parameter θ. We would like to fid the MLE θ ML for θ. Solutio: 2
First of all, we defie the likelihood fuctio: f θ (y) = θ y k (1 θ) 1 y k. Lettig y = y k, we ca rewrite the likelihood fuctio as Hece, the log-likelihood fuctio is Takig the derivative ad settig it to zero yields f θ (y) = θ y (1 θ) 1 y. logf θ (y) = ylogθ +( y)log(1 θ). θ logf θ(y) = y θ y 1 θ = 0. Therefore, θ ML (y) is θ ML (y) = y k. Example 4. Biomial Let Y = [Y 1,...,Y ] be a sequece of iid radom variables of a Biomial distributio of biomial(k,θ). We would like to fid θ ML for θ. Solutio: The likelihood fuctio is f θ (y) = ( ) k θ yi (1 θ) k yi. By lettig y = y i, we ca rewrite the likelihood fuctio as: The log-likelihood fuctio is y i ( ) k f θ (y) = θ y (1 θ) 1 y. logf θ (y) = ylogθ +(k y)log(1 θ)+ Takig the first order derivative ad settig to zero yields θ ML (y) = y k = 1 k y i ( ) k log y i }{{} This term does ot cotai θ y i. Example 5. Poisso Let Y = [Y 1,...,Y ] be a sequece of i.i.d. Poisso radom variables of parameter λ. Recall that Poisso distributio is: P(Y i = y i ) = e λ λ y i y i!. We would like to fid λ ML for parameter λ. 3
Solutio: Similarly as i previous examples, first we fid the likelihood fuctio: Thus the log-likelihood fuctio is logf λ (y) = λ+ Settig the first-order derivative to 0 yields f λ (y) = e λ λ yi y i! = e λ λ yi y. i! y i logθ λ ML (y) = 1 log y i! }{{} This term does ot cotai λ y i. 2 Bias v.s. Variace I geeral, MLE could be both biased or ubiased. To take a closer look at this property, we write the MLE as a sum of bias ad variace terms as below: MSE θ = E Y [( θ ML (Y ) θ) 2 ] = E[( θ ML E[ θ ML ]+E[ θ ML ] θ) 2 ] = E[( θ ML E[ θ ML ]) 2 ]+E[(E[ θ ML ] θ) 2 ]+2E[( θ ML E[ θ ML ])(E[ θ ML ] θ)] = E Y [( θ ML (Y ) E[ θ ML (Y )]) 2 ] +(E[ θ ML (Y )] θ) 2. }{{}}{{} variace bias (3) Example 6. Image Deoisig Let z be a clea sigal ad let be a oise vector such that N(0,σ 2 I). Suppose that we are give the oisy observatio y = z +, our goal is to estimate z from y. I this example, let us cosidera liear deoisigmethod. We wouldliketo fid aw suchthat the estimator ẑ = Wy would be optimal i some sese. We shall call W as a smoothig filter. To determie what W would be good, we first cosider the MSE: MSE = E[ ẑ z 2 ] = E[ Wy z 2 ] = E[ W(z +) z 2 ] = E[ (W I)z +W 2 ] = (W I)z 2 +E[ W 2 ] }{{}}{{} bias variace 4
Now, by usig eige-decompostio we ca write W as W = UΛU T. The, the bias ca be computed as bias = (W I)z 2 = E[ ẑ z 2 ] = (UΛU T I)z 2 = U(Λ I)U T z 2 = z T U(Λ I) 2 U T z = (λ i 1) 2 vi 2, where v = U T z. Similarly, the variace ca be computed as Therefore, the MSE ca be writte as: variace = E[ W 2 ] MSE = = E[ T W T W] { } = σ 2 Tr W T W = σ 2 λ 2 i (λ i 1) 2 vi 2 +σ2 To miimize MSE, λ i should be chose such that which is λ 2 i λ i MSE = 2v 2 i (λ i 1)+2σ 2 λ i = 0, λ i = v2 i v 2 i +σ2. Thus far we have come across may examples where the estimators are ubiased. So are biased estimators bad? The aswer is o. Here is a example. Let us cosider a radom variable Y N(0,σ 2 ). Now, cosider the followig two estimators: Estimator 1: θ 1 (Y) = Y 2. The E[ θ 1 (Y)] = E[Y 2 ] = σ 2, thus it is ubiased. Estimator 2: θ 2 (Y) = ay 2,a 1, the E[ θ 2 (Y)] = aσ 2. Thus it is biased. Let us ow cosider the MSE of θ 2. (Note that the MSE of θ 1 ca be foud by lettig a = 1.) Therefore, the MSE attais its miimum at: which is MSE = E[( θ 2 (Y) σ 2 ) 2 ] = E[(aY 2 σ 2 ) 2 ] = E[a 2 Y 4 ] 2σ 2 E[aY 2 ]+σ 4 = 3a 2 σ 4 2aσ 4 +σ 4 = σ 4 (3a 2 2a+1) a MSE = σ4 (6a 2) = 0, a = 1 3. This result says: although θ 2 is biased, it actually attais a lower MSE! 5
3 Fisher Iformatio 3.1 Variace ad Curvature of log-likelihood For ubiased estimators, the variace ca provide extremely importat iformatio about the performace of the estimators. I order to study the variace more carefully, we first study its relatioship with regard to the log-likelihood as demostrated i the example below. Example 7. Let Y N(θ,σ 2 ), where σ is kow. Accordigly, f θ (y) = 1 θ)2 exp( (y 2πσ 2 2σ 2 ) logf θ (y) = log 2πσ 2 1 θ)2 2σ2(y logf θ(y) θ = 1 σ2(y θ) 2 logf θ (y) } θ {{ 2 = 1 } σ 2 curvature of log-likelihood Therefore, as σ 2 icreases, we ca easily coclude that 2 θ 2 logf θ (y) will decrease. Thus, we coclude that with the variace icreasig, the curvature will be decreasig. 3.2 Fisher-Iformatio Defiitio 2. Fisher Iformatio The Fisher-iformatio is defied as: [ I(θ) def 2 ] logf θ (Y ) = E Y θ 2, (4) where [ 2 ] logf θ (Y ) 2 logf θ (y) E Y θ 2 = θ 2 f θ (y)dy (5) We will try to estimate the fisher iformatio i the examples below. Example 8. Let Y = [Y 1,...,Y ] be a sequece of iid radom variables such that Y i N(θ,σ 2 ). We would like to determie I(θ). First, we kow that the log-likelihood is The first order derivative is logf θ (y) = 2 log(2πσ2 ) logf θ (y) θ = σ2(y θ), (y i θ) 2σ 2. 6
where Cosequetly, the secod order derivative is Fially, the fisher iformatio is y = 1 y i. 2 logf θ (y) θ 2 I(θ) = E Y [ 2 logf θ (Y ) θ 2 = σ 2. ] = E Y [ σ ] 2 = σ 2. Example 9. Let Y = [Y 1,...,Y ] be a sequece of iid radom variables such that where N k N(0,σ 2 ). Fid I(θ). The likelihood fuctio is f θ (y) = Y k = Acos(w 0 k +θ)+n k, 1 exp( (y i Acos(w 0 k +θ)) 2 2πσ 2 2σ 2 ) 1 1 = exp( /2 2πσ 2 2σ 2 (y i Acos(w 0 k +θ)) 2 ) The, the first order derivative of the log-likelihood is [ ] θ logf θ(y) = 1 θ σ 2 (y i Acos(w 0 k +θ)) 2 = 1 σ 2 (y i Acos(w 0 k +θ))(asi(w 0 k +θ)) The secod order derivative is = A σ 2 2 θ 2 logf θ(y) = A σ 2 (y i si(w 0 k +θ) A 2 si(2w 0k +2θ)) [y i cos(w 0 k +θ) Acos(2w 0 k +2θ)] Accordigly, the E Y [ 2 θ logf 2 θ (Y )] ca be estimated as below: [ ] 2 E Y θ 2 logf θ(y ) = A σ 2 (E[Y i ]cos(w 0 k +θ) Acos(2w 0 k +2θ)) = A 2σ 2 = A2 σ 2 [Acos 2 (w 0 k +θ) Acos(2w 0 k +2θ)] ( 1 2 + 1 2 cos(2w 0k +2θ) cos(2w 0 k +2θ)) = A2 2σ 2 + A2 2σ 2 1 cos(2w 0 k +2θ) 7
By usig the fact that 1 cos(2w 0k +2θ) 0, we have: I(θ) = A2 2σ 2. 3.3 Fisher-Iformatio ad KL Divergece There is a iterestig relatioship betwee the Fisher-Iformatio ad the KL divergece, which we shall ow discuss. To begi with, let us first list out two assumptios. Assumptio: 1. 2. θ θ f θ (y)dy = θ(y)f θ (y)dy = θ f θ(y)dy θ f θ(y) θ(y)dy Basically, the two assumptios say that we ca iterchage the order of itegratio ad the differetiatio. If the assumptio holds, we ca show the followig result: [ (logfθ ) ] 2 (Y) I(θ) = E Y. (6) θ Proof. By the assumptios ad itegratio by part, we have [ ] ( I(θ) = E Y 2 logf θ (Y) f θ 2 = = f θ (y)dy 1 + f }{{} θ (y) (f θ (y))2 dy = 0 by (1) θ (y)f θ(y) (f θ (y))2 f 2 θ (y) ( ) [ 2 (logfθ ) ] 2 logfθ (y) (y) = f θ (y) dy = E Y. θ θ ) f θ (y)dy The followig propositio liks KL divergece ad I(θ). Propositio 1. Let θ = θ 0 +δ for some small deviatio δ, the D(f θ0 f θ ) I(θ 0) 2 (θ θ 0) 2 +O(θ θ 0 ) 3. (7) Iterpretatio: If I(θ) is large, the D(f θ0 f θ ) is large. Accordigly, it would be easier to differetiate θ 0 ad θ. Proof. First, recall that the KL divergece is defied as D(f θ0 f θ ) = f θ0 (y)log f θ 0 (y) f θ (y) dy. 8
Cosider Taylor expasio o θ 0, we estimate the first two terms as below. First-order derivative: θ D(f θ 0 f θ ) = f θ0 (y) θ=θ0 θ [logf θ 0 (y) logf θ (y)] dy θ=θ0 [ ] 1 = f θ0 (y) f θ (y) θ f dy θ(y) θ=θ 0 = θ f θ(y)dy = f θ (y)dy = 0. θ (8) Secod-order derivative: 2 θ 2D(f θ 0 f θ ) θ=θ0 = f θ0 (y) 2 θ 2 [ logf θ(y)]dy ] = E [ 2 θ 2 logf θ(y) = I(θ 0 ). θ=θ 0 (9) Substitute the above terms ito Taylor expasio ad igore the higher order terms: D(f θ0 f θ ) = D(f θ0 f θ0 )+(θ θ 0 ) θ D(f θ 0 f θ )+ (θ θ 0) 2 = (θ θ 0) 2 I(θ 0 )+O(θ θ 0 ) 3. 2 2 2 θ 2D(f θ 0 f θ )+O(θ θ) 3 (10) 4 Cramer-Rao Lower Boud (CRLB) Theorem The CRLB is a fudametal result that characterizes the performace of a estimator. Theorem 1. Uder the assumptios (1) ad (2): Var( θ(y)) ( θ E[ θ(y)]) 2 I(θ) (11) for ay estimator θ(y). Proof. To prove the iequality, we first ote that Lettig Var( θ(y))i(θ) = ) 2fθ ( ) 2 ( θ(y) E[ θ(y)] (y)dy θ logf θ(y) f θ (y)dy. A = θ(y) E[ θ(y)], B = θ logf θ(y), 9
the above equatio ca be simplified as Var( θ(y))i(θ) = E[A 2 ]E[B 2 ] E[AB] 2, where the iequality is due to Cauchy. We ca also show that [ E[AB] 2 = [ = [ = = ( θ(y) E[ θ(y)])( ] 2 θ logf θ(y))f θ (y)dy ( θ(y) E[ θ(y)]) ] 2 θ f θ(y)dy θ(y) θ f θ(y)dy E[ θ(y)] θ f θ(y)dy [ ] 2 ( ) 2 θ E[ θ(y)] 0 = θ E[ θ(y)]. ] 2 Propositio 2. A estimator θ(y) achieves CRLB equality if ad oly if θ(y) is a sufficiet statistic of a oe-parameter expoetial family. Proof. Suppose that CRLB equality holds, the we must have for some fuctio k(θ). This implies that Thus, logf θ (y) = θ logf θ(y) = k(θ)( θ(y) E[ θ(y)]), θ = a θ which is a oe-parameter expoetial family. k(θ )( θ(y) E[ θ(y)])dθ +H(y) k(θ )E[ θ(y)]dθ } a {{ } log C(θ) + H(y) }{{} log h(y) f θ (y) = C(θ)exp(Q(θ) θ(y)) h(y), θ + θ(y) k(θ )dθ. } a {{ } Q(θ) Coversely, suppose that θ(y) is a sufficiet statistic of a oe-parameter expoetial family, the, f θ (y) = C(θ)exp(Q(θ)T(y)) h(y), where T(y) = θ(y), ad ( C(θ) = exp(q(θ)t(y)) h(y)dy) 1. I order to show that Var{T(Y)} attais the CRLB, we eed to obtai the Fisher Iformatio: { ( ) } 2 I(θ) = E θ logf θ(y). 10
Note that sice logf θ (y) = Q(θ)T(y)+logh(y) log exp(q(θ)t(y))h(y)dy, we must have ( θ logf θ(y) = Q (θ)t(y) = Q (θ){t(y) E{T(Y)}}. ) exp(q(θ)t(y))h(y) T(y) dy Q (θ) exp(q(θ)t(y))h(y)dy Therefore, The Cramer Rao Lower boud is ( ) 2 I(θ) = E θ logf θ(y) = (Q (θ)) 2 Var{T(Y)}. Var( θ(y)) ( θ E[ θ(y)]) 2. I(θ) Thus we eed to determie ( θ E[ θ(y)]) 2. Suppose that θ(y) = T(Y), Therefore, θ {E{ θ(y)}} = T(y)exp(Q(θ)T(y))h(y)dy θ exp(q(θ)t(y))h(y)dy T(y) 2 exp(q(θ)t(y))h(y)dy exp(q(θ)t(y))h(y)dy ( T(y)exp(Q(θ)T(y))h(y)dy) 2 = Q (θ) = Q (θ)var{t(y)}. ( exp(q(θ)t(y))h(y)dy) 2 ( θ E[ θ(y)]) 2 I(θ) = Q (θ) 2 Var{T(Y)} 2 (Q (θ)) 2 Var{T(Y)} = Var(T(Y)) = Var( θ(y)), which shows that CRLB equality is attaied. Example 10. Let Y = [Y 1,...,Y ] be a sequece of iid radom variables such that Y i N(θ,σ 2 ). Cosider the estimator θ(y ) = 1 Y i. Is θ(y ) a MVUE? Solutio: The CRLB is Var( θ) ( θ E[ θ(y )]) 2, I(θ) where it is ot difficult to show that I(θ) = σ 2 ad E[Y i ] = θ. Therefore, CRLB becomes Var( θ) 1 I(θ) = σ2. 11
O the other had, we ca show that Var( θ) = Var ( 1 ) Y i = 1 2 Var(Y i ) = σ2 = 1 I(θ), which meas that CRLB equality is achieved. Therefore, the estimator is a MVUE. θ(y ) = 1 Y i Example 11. Let Y = [Y 1,...,Y ] be a sequece of iid radom variables such that Y k s k (θ) + N k where s k (θ) is a fuctio of k, ad N k N(0,σ 2 ). Fid CRLB for ay ubiased estimator θ. Solutio: The log-likelihood is Cosequetly, we ca show that 2 θ 2 logf θ(y) = 1 σ 2 logf θ (y) = 2 log(2πσ2 ) 1 2σ 2 (y i s k (θ)) 2 [ ] (y k s k (θ)) 2 s k (θ) θ 2 1 ( ) 2 sk (θ) σ 2. θ Accordigly, Therefore, ] E [ 2 θ 2 θ(y ) = 1 ( ) 2 sk (θ) σ 2. θ Var( θ) 1 I(θ) = σ 2 ( θ s k(θ)) 2. For example, if s k = θ, the Var( θ) σ2. If s k(θ) = Acos(w 0 k +θ), the Var( θ) 2σ2 A 2. 12