Statistics for Applications. Chapter 3: Maximum Likelihood Estimation 1/23

18.650 Statistics for Applicatios Chapter 3: Maximum Likelihood Estimatio 1/23

Total variatio distace (1) ( ) Let E,(IPθ ) θ Θ be a statistical model associated with a sample of i.i.d. r.v. X 1,...,X. Assume that there exists θ Θ such that X 1 IP θ : θ is the true parameter. Statisticia s goal: give X 1,...,X, fid a estimator θ ˆ= θ(x ˆ 1,...,X ) such that IPˆ is close to IP θ θ for the true parameter θ. This meas: IPˆ(A) IP θ θ (A) is small for all A E. Defiitio The total variatio distace betwee two probability measures IP θ ad IP θ is defied by TV(IP θ,ip θ ) = max IP θ (A) IP θ (A). A E 2/23

Total variatio distace (2) Assume that E is discrete (i.e., fiite or coutable). This icludes Beroulli, Biomial, Poisso,... Therefore X has a PMF (probability mass fuctio): IP θ (X = x) = p θ (x) for all x E, L p θ (x) 0, p θ (x) = 1. x E The total variatio distace betwee IP θ ad IP θ is a simple fuctio of the PMF s p θ ad p θ : 1 L TV(IP θ,ip θ ) = p θ (x) p θ (x). 2 x E 3/23

Total variatio distace (3) Assume that E is cotiuous. This icludes Gaussia, Expoetial,... J Assume that X has a desity IP θ (X A) = A fθ (x)dx for all A E. l f θ (x) 0, f θ (x)dx = 1. E The total variatio distace betwee IP θ ad IP θ is a simple fuctio of the desities f θ ad f θ : l 1 TV(IP θ,ip θ ) = f θ (x) f θ (x) dx. 2 E 4/23

Total variatio distace (4) Properties of Total variatio: TV(IP θ,ip θ ) = TV(IP θ,ip θ ) (symmetric) TV(IP θ,ip θ ) 0 If TV(IP θ,ip θ ) = 0 the IP θ = IP θ (defiite) TV(IP θ,ip θ ) TV(IP θ,ip θ )+TV(IP θ,ip θ ) (triagle iequality) These imply that the total variatio is a distace betwee probability distributios. 5/23

Total variatio distace (5) A estimatio strategy: Build a estimator TV(IP T θ,ip θ ) for all θ Θ. The fid θ ˆthat miimizes the fuctio θ TV(IP T θ,ip θ ). 6/23

Total variatio distace (5) A estimatio strategy: Build a estimator TV(IP T θ,ip θ ) for all θ Θ. The fid θ ˆthat miimizes the fuctio θ TV(IP T θ,ip θ ). problem: Uclear how to build T TV(IPθ,IP θ )! 6/23

Kullback-Leibler (KL) divergece (1) There are may distaces betwee probability measures to replace total variatio. Let us choose oe that is more coveiet. Defiitio The Kullback-Leibler (KL) divergece betwee two probability measures IP θ ad IP θ is defied by L p θ (x)log KL(IP θ,ip θ ) = x E l E ( pθ (x) ) p θ (x) if E is discrete ( fθ (x) ) f θ (x)log dx if E is cotiuous f θ (x) 7/23

Kullback-Leibler (KL) divergece (2) Properties of KL-divergece: KL(IP θ,ip θ ) = KL(IP θ,ip θ ) i geeral KL(IP θ,ip θ ) 0 If KL(IP θ,ip θ ) = 0 the IP θ = IP θ (defiite) KL(IP θ,ip θ ) i KL(IP θ,ip θ )+KL(IP θ,ip θ ) i geeral Not a distace. This is is called a divergece. Asymmetry is the key to our ability to estimate it! 8/23

Kullback-Leibler (KL) divergece (3) KL(IP θ,ip θ ) = IE θ [ ( pθ (X) )] log p θ (X) [ = IE θ logpθ (X) ] [ IE θ logpθ (X) ] So the fuctio θ KL(IP θ [,IP θ ) is of the form: costat IE θ logpθ (X) ] 1 Ca be estimated: IE θ [h(x)] - L h(x i ) (by LLN) 1 KL(IP T θ,ip θ ) = costat L logpθ (X i ) 9/23

Kullback-Leibler (KL) divergece (4) KL(IP T 1 L θ,ip θ ) = costat logp θ (X i ) KL(IP T 1 L mi θ,ip θ ) mi logp θ (X i ) θ Θ θ Θ 1 L max logp θ (X i ) max θ Θ L θ Θ logp θ (X i ) max p θ (X i ) θ Θ This is the maximum likelihood priciple. 10/23

Iterlude: maximizig/miimizig fuctios (1) Note that mi h(θ) max h(θ) θ Θ θ Θ I this class, we focus o maximizatio. Maximizatio of arbitrary fuctios ca be difficult: Example: θ (θ X i ) 11/23

Iterlude: maximizig/miimizig fuctios (2) Defiitio A fuctio twice differetiable fuctio h : Θ IR IR is said to be cocave if its secod derivative satisfies h (θ) 0, θ Θ It is said to be strictly cocave if the iequality is strict: h (θ) < 0 Moreover, h is said to be (strictly) covex if h is (strictly) cocave, i.e. h (θ) 0 (h (θ) > 0). Examples: Θ = IR, h(θ) = θ 2, Θ = (0, ), h(θ) = θ, Θ = (0, ), h(θ) = logθ, Θ = [0,π], h(θ) = si(θ) Θ = IR, h(θ) = 2θ 3 12/23

Iterlude: maximizig/miimizig fuctios (3) More geerally for a multivariate fuctio: h : Θ IR d IR, d 2, defie the gradiet vector: h(θ) = Hessia matrix: 2 h(θ) = h θ 1 (θ).. h θ d (θ) IR d 2 h 2 h θ 1 θ 1 (θ) θ 1 θ d (θ)... IR d d 2 h θ d θ d (θ) 2 h θ d θ d (θ) h is cocave x 2 h(θ)x 0 x IR d, θ Θ. h is strictly cocave x 2 h(θ)x < 0 x IR d, θ Θ. Examples: Θ = IR 2, h(θ) = θ2 1 2θ2 2 or h(θ) = (θ 1 θ 2 ) 2 Θ = (0, ), h(θ) = log(θ 1 +θ 2 ), 13/23

Iterlude: maximizig/miimizig fuctios (4) Strictly cocave fuctios are easy to maximize: if they have a maximum, the it is uique. It is the uique solutio to or, i the multivariate case h (θ) = 0, h(θ) = 0 IR d. There are may algorithms to fid it umerically: this is the theory of covex optimizatio. I this class, ofte a closed form formula for the maximum. 14/23

Likelihood, Discrete case (1) ( ) Let E,(IPθ ) θ Θ be a statistical model associated with a sample of i.i.d. r.v. X 1,...,X. Assume that E is discrete (i.e., fiite or coutable). Defiitio The likelihood of the model is the map L (or just L) defied as: L : E Θ IR (x 1,...,x,θ) IP θ [X 1 = x 1,...,X = x ]. 15/23

Likelihood, Discrete case (2) iid Example 1 (Beroulli trials): If X 1,...,X Ber(p) for some p (0,1): E = {0,1}; Θ = (0,1); (x 1,...,x ) {0,1}, p (0,1), L(x 1,...,x,p) = IPp [X i = x i ] = p x i (1 p) 1 x i = p x i (1 p) x i. 16/23

Likelihood, Discrete case (3) Example 2 (Poisso model): iid If X 1,...,X Poiss(λ) for some λ > 0: E = IN; Θ = (0, ); (x 1,...,x ) IN, λ > 0, L(x 1,...,x,p) = IPλ [X i = x i ] λ λ x i = e x i! λ x i λ = e. x 1!...x! 17/23

Likelihood, Cotiuous case (1) ( ) Let E,(IPθ ) θ Θ be a statistical model associated with a sample of i.i.d. r.v. X 1,...,X. Assume that all the IP θ have desity f θ. Defiitio The likelihood of the model is the map L defied as: L : E Θ IR (x 1,...,x,θ) f θ (x i ). 18/23

Likelihood, Cotiuous case (2) iid Example 1 (Gaussia model): If X 1,...,X N(µ,σ 2 ), for some µ IR,σ 2 > 0: E = IR; Θ = IR (0, ) (x 1,...,x ) IR, (µ,σ 2 ) IR (0, ), 1 1 L(x 1,...,x,µ,σ 2 ) = exp (σ 2π) 2σ 2 L (xi µ) 2. 19/23

Maximum likelihood estimator (1) Let X 1 (,...,X be ) a i.i.d. sample associated with a statistical model E,(IPθ ) θ Θ ad let L be the correspodig likelihood. Defiitio The likelihood estimator of θ is defied as: provided it exists. θˆmle = argmax L(X 1,...,X,θ), θ Θ Remark (log-likelihood estimator): I practice, we use the fact that θˆmle = argmax logl(x 1,...,X,θ). θ Θ 20/23

Maximum likelihood estimator (2) Examples Beroulli trials: pˆmle = X. Poisso model: ˆ = X. Gaussia model: λ MLE ( µˆ,σˆ2) = (X,Ŝ ). 21/23

Maximum likelihood estimator (3) Defiitio: Fisher iformatio Defie the log-likelihood for oe observatio as: l(θ) = logl 1 (X,θ), θ Θ IR d Assume that l is a.s. twice differetiable. Uder some regularity coditios, the Fisher iformatio of the statistical model is defied as: I(θ) = IE [ l(θ) l(θ) ] IE [ l(θ) ] IE [ l(θ) ] = IE [ 2 l(θ) ]. If Θ IR, we get: I(θ) = var [ l (θ) ] = IE [ l (θ) ] 22/23

Maximum likelihood estimator (4) Theorem Let θ Θ (the true parameter). Assume the followig: 1. The model is idetified. 2. For all θ Θ, the support of IP θ does ot deped o θ; 3. θ is ot o the boudary of Θ; 4. I(θ) is ivertible i a eighborhood of θ ; 5. A few more techical coditios. The, ˆ θ MLE satisfies: θ MLE IP ˆ θ w.r.t. IP θ ; ( ) (d) θˆmle θ N ( 0,I(θ ) 1) w.r.t. IP θ. 23/23

MIT OpeCourseWare https://ocw.mit.edu 18.650 / 18.6501 Statistics for Applicatios Fall 2016 For iformatio about citig these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.