ECE 90 Lecture 3: Maximum Likelihood Estimatio R. Nowak 5/7/009 The focus of this lecture is to cosider aother approach to learig based o maximum likelihood estimatio. Ulike earlier approaches cosidered here we are willig to make somewhat stroger assumptios about the relatio betwee features ad labels. These are quite reasoable i may settigs, i particular i may imagig applicatios. Cosider the classical sigal plus oise model: ( ) i Y i f + W i, i,..., where W i are iid zero-mea oises. Furthermore, assume that W i have a distributio characterized by a probability desity fuctio (p.d.f.) p(w) for some kow desity p(w). The ( ( )) i Y i p y f p fi (y) sice Y i f ( i ) Wi. I a settig like this it is quite commo to cosider the maximum likelihood approach - seek a mostprobable explaatio for the observatios. Defie the likelihood of the data to be the the p.d.f. of the observatios (Y,..., Y ) p fi (Y i ). i The maximum likelihood estimator seeks the model that maximizes the likelihood, or equivaletly miimizes the egative log-likelihood i log p fi (Y i). () We immediately otice the similarity betwee the empirical risk we had see before ad the egative loglikelihood. We will see that we ca regard maximum likelihood estimatio as our familiar miimal empirical risk whe the loss fuctio is chose appropriately. I the meatime ote that miimizig () yields our familiar square-error loss if W i s are Gaussia. If the W i s are Laplacia (p W (w) e c w ) we get the sum of absolute errors. We ca also cosider o-additive models like the Poisso model (used ofte i medical imagig applicatios, like PET imagig) Y i p(y f(i/)) e f(i/) f y (i/) y, that gives rise to the followig egative log-likelihood log P (Y i f (i/)) f (i/) Y i log (f (i/)) + costat, which is a very differet loss fuctio, but quite appropriate for may imagig problems.
Maximum Likelihood Estimatio Before we ivestigate maximum likelihood estimatio for model selectio, let s review some of the basis cocepts. Let Θ deote a parameter space (e.g., Θ R, or Θ {smooth fuctios}). Assume we have observatios iid Y i p θ (y), i,..., where θ Θ is a parameter determiig the desity of the {Y i }. The ML estimator of θ is ˆθ arg max θ Θ Note that by the strog law of large umbers i arg max θ Θ p θ (Y i ) i i arg mi θ Θ. i a.s. E log p θ (Y ). So we ca use the egative log-likelihood as a proxy for E log p θ (Y ). Let s see why this is the thik to do. Elog p θ (Y ) log p θ (Y ) E log p θ (Y ) p θ (Y ) log p θ (y) p θ (y) p θ (y)dy K(p θ, p θ ) the KL divergece 0 with equality iff p θ p θ. Where K is the Kullback-Leibler divergece betwee to desities. This is a measure of the distiguishability betwee two differet radom variables. It is ot a symmetric fuctio so the order of the argumets is importat. Furthermore it is always positive, ad zero oly if the two desities are idetical E log p θ (y) p θ (y) E log p θ(y) p θ (y) pθ (y) log E p θ (y) log p θ (y)dy 0 K(p θ, p θ ) 0 By showig that Elog p θ (Y ) log p θ (Y ) 0 we immediately see that miimizig E log p θ (Y ) with respect to θ gets us close to the true model θ, exactly what we wat to do.. Likelihood as a Loss Fuctio We ca restate the maximum likelihood estimator i the geeral terms we are usig i this course. We have i.i.d observatios draw from a ukow distributio Y i i.i.d. p θ, i {,..., }
where θ Θ. We ca view p θ as a member of a parametric class of distributios, P {p θ } θ Θ. Our goal is to use the observatios {Y i } to select a appropriate distributio (e.g., model) from P. We would like the selected distributio to be close to p θ i some sese. We use the egative log-likelihood loss fuctio, defied as l(θ, Y i ). The empirical risk is ˆR. We select the distributio that miimizes the empirical risk i mi log p(y i ) mi p P θ Θ I other words, the distributio we select is ˆp : pˆθ, where The risk is defied as i Ad, the excess risk of θ is defied as R(θ) R(θ ) ˆθ arg mi θ Θ i i R(θ) El(θ, Y ) Elog p θ (Y ). log p θ (y) p θ (y) p θ (y) dy K(p θ, p θ ). We recogized that the excess risk correspodig to this loss fuctio is simply the Kullback-Leibler (KL) Divergece or Relative Etropy, deoted by K(p θ, p θ ). It is easy to see that K(p θ, p θ ) is always oegative ad is zero if ad oly if p θ p θ. This shows that θ miimizes the risk. The KL divergece measures how differet two probability distributios are ad therefore is atural to measure covergece of the maximum likelihood procedures.. Covergece of Log-Likelihood to KL Divergece Sice ˆθ maximizes the likelihood over θ Θ, we have Therefore, or re-arragig Notice that the quatity i pˆθ (Y i ) i i log p θ (Y i ) log pˆθ (Y i ) 0 pˆθ (Y i ) K(pˆθ, p θ ) + K(pˆθ, p θ ) 0, p K(pˆθ θ ) (Y pˆθ i ), p K(pˆθ θ ) i i p θ (Y i ) is a empirical average whose mea is K(p θ, p θ ). By the law of large umbers, for each θ Θ, p θ (Y i ) K(p a.s. θ, p θ ) 0 i 3
If this also holds for the sequece {ˆθ }, the we have p θ (Y i ), p K(pˆθ θ ) log (Y pˆθ i ), p K(pˆθ θ ) 0 as which implies that which ofte implies that pˆθ p θ ˆθ θ i some appropriate sese (e.g., poit-wise or i orm). Example. Gaussia Distributios p θ (y) (y θ ) πσ e σ K(p θ, p θ ) θ Θ R, {Y i } i iid p θ (y) log p θ (y) p θ (y) p θ (y)dy σ (y θ) (y θ ) p θ (y)dy σ E θ (Y θ) σ E θ (Y θ ) σ E θ Y (θ + θ ) + θ + θ σ (θ + θ )E θ Y + θ + θ (θ θ) σ. maximizes Elog p θ (Y ) wrt θ Θ.3 Helliger Distace The KL divergece is ot a distace fuctio. ˆθ arg max θ { (Y i θ) } arg mi{ (Y i θ) } θ Y i i K(p θ, p θ ) K(p θ, p θ ) Therefore, it is ofte more coveiet to work with the Helliger metric, H(p θ, p θ ) ( ( ) p θ p ) θ dy 4
The Helliger metric is symmetric, o-egative ad H(p θ, p θ ) H(p θ, p θ ) ad therefore it is a distace measure. Furthermore, the squared Helliger distace lower bouds the KL divergece, so covergece i KL divergece implies covergece of the Helliger distace. Propositio. H (p θ, p θ ) K(p θ, p θ ) Proof: ( H (p θ, p θ ) (y) ) pθ p (y) θ dy pθ p θ (y)dy + p θ (y)dy (y) p θ (y)dy pθ (y) p (y)dy, sice θ p θ (y)dy θ ( ) E θ pθ (Y )/p θ (Y ) ( ) log E θ pθ (Y )/p θ (Y ), sice x log x E θ log p θ (Y )/p θ (Y ), by Jese s iequality E θ log(p θ (Y )/p θ (Y )) K(p θ, p θ ) Note that i the proof we also showed that H(p θ, p θ ) ( pθ (y) ) p θ (y)dy ad usig the fact log x x agai, we have ( ) pθ H(p θ, p θ ) log (y) p (y)dy θ The quatity iside the log is called the affiity betwee p θ ad p θ : A(p θ, p θ ) pθ (y) p θ (y)dy This is aother measure of closeess betwee p θ ad p θ. Example. Gaussia Distributios p θ (y) (y θ) e σ πσ 5
log pθ (y) p θ (y)dy ( ) ( log (y θ πσ e ) σ (y θ πσ e ) ( ) (y θ ) log πσ e 4σ + (y θ ) 4σ dy log πσ e ( y ( θ +θ )) σ + (θ θ ) 8σ ) σ dy dy log e (θ θ ) 8σ (θ θ ) 4σ Summary iid Y i p θ log A(p θ, p θ ) (θ θ ) 4σ for Gaussia distributios H (p θ, p θ ) (θ θ ) 4σ for Gaussia.. Maximum likelihood estimator maximizes the empirical average (our empirical risk is egative log-likelihood). θ maximizes the expectatio E i (the risk is the expected egative log-likelihood) i 3. i a.s. E so we expect some sort of cocetratio of measure. i 4. I particular, sice i p θ (Y i ) a.s. K(p θ, p θ ) we might expect that K(pˆθ, p θ ) 0 for the sequece of estimates {pˆθ }. So, the poit is that maximum likelihood estimator is just a special case of a loss fuctio i learig. Due to its special structure, we are aturally led to cosider KL divergeces, Helliger distaces, ad Affiities. 6