ECE90 Sprig 007 Statistical Learig Theory Istructor: R. Nowak Lecture 3: Maximum Likelihood Estimatio Summary of Lecture I the last lecture we derived a risk (MSE) boud for regressio problems; i.e., select a f F so that E(f(X) Y ) ] E(f (X) Y ) ] is small, where f (x) = EY X = x]. The result is summarized below. Theorem (Complexity Regularizatio with Squared Error Loss) Let X = R d, Y = b/, b/], {X i, Y i } iid, P XY ukow, F = {collectio of cadidate fuctios}, f : R d Y, R(f) = E(f(X) Y ) ]. Let c(f), f F, be positive umbers satisfyig f F c(f), ad select a fuctio from F accordig to { ˆf = arg mi ˆR (f) + ɛ } c(f) log, with ɛ 3 5b ad ˆR (f) = (f(x i) Y i ). The, ( ) { ER( ˆf + α )] R(f ) mi R(f) R(f ) + α f F ɛ where α = ɛb b ɛ/3 Maximum Likelihood Estimatio } c(f) log + O( ) The focus of this lecture is to cosider aother approach to learig based o maximum likelihood estimatio. Cosider the classical sigal plus oise model: ( ) i Y i = f + W i, i =,, where W i are iid zero-mea oises. Furthermore, assume that W i p(w) for some kow desity p(w). The ( ( )) i Y i p y f p fi (y) sice Y i f ( i ) = Wi. A very commo ad useful loss fuctio to cosider is ˆR (f) = ( log p fi (Y i )). Miimizig ˆR with respect to f is equivalet to maximizig log p fi (Y i )
Lecture 3: Maximum Likelihood Estimatio or p fi (Y i ). Thus, usig the egative log-likelihood as a loss fuctio leads to maximum likelihood estimatio. If the W i are iid zero-mea Gaussia r.v.s the this is just the squared error loss we cosidered last time. If the W i are Laplacia distributed e.g. p(w) e w, the we obtai the absolute error, or L, loss fuctio. We ca also hadle o-additive models such as the Poisso model I this case f(i/) f(i/)]y Y i P (y f (i/)) = e y! log P (Y i f (i/)) = f (i/) Y i log (f (i/)) + costat which is a very differet loss fuctio, but quite appropriate for may imagig problems. Before we ivestigate maximum likelihood estimatio for model selectio, let s review some of the basis cocepts. Let Θ deote a parameter space (e.g., Θ = R), ad assume we have observatios Y i iid p θ (y), i =,..., where θ Θ is a parameter determiig the desity of the {Y i }. The ML estimator of θ is ˆθ = arg max = arg max = arg mi p θ (Y i ). ˆθ maximizes the expected log-likelihood. To see this, let s compare the expected log-likelihood of θ with ay other θ Θ. Elog p θ (Y ) log p θ (Y )] = E log p θ (Y ) ] p θ (Y ) = log p θ (y) p θ (y) p θ (y)dy = K(p θ, p θ ) the KL divergece 0 with equality iff p θ = p θ. Why? E log p θ (y) ] p θ (y) = E log p ] θ(y) p θ (y) ] pθ (y) log E p θ (y) = log p θ (y)dy = 0 K(p θ, p θ ) 0
Lecture 3: Maximum Likelihood Estimatio 3. Likelihood as a Loss Fuctio We ca restate the maximum likelihood estimator i the geeral terms we are usig i this course. We have i.i.d observatios draw from a ukow distributio Y i i.i.d. p θ, i = {,..., } where θ Θ. We ca view p θ as a member of a parametric class of distributios, P = {p θ }. Our goal is to use the observatios {Y i } to select a appropriate distributio (e.g., model) from P. We would like the selected distributio to be close to p θ i some sese. We use the egative log-likelihood loss fuctio, defied as l(θ, Y i ) =. The empirical risk is ˆR =. We select the distributio that miimizes the empirical risk mi log p(y i ) = mi p P I other words, the distributio we select is ˆp := pˆθ, where ˆθ = arg mi The risk is defied as Ad, the excess risk of θ is defied as R(θ) R(θ ) = R(θ) = El(θ, Y )] = Elog p θ (Y )]. log p θ (y) p θ (y) p θ (y) dy K(p θ, p θ ). We recogized that the excess risk correspodig to this loss fuctio is simply the Kullback-Leibler (KL) Divergece or Relative Etropy, deoted by K(p θ, p θ ). It is easy to see that K(p θ, p θ ) is always oegative ad is zero if ad oly if p θ = p θ. This shows that θ miimizes the risk. The KL divergece measures how differet two probability distributios are ad therefore is atural to measure covergece of the maximum likelihood procedures.. Covergece of Log-Likelihood to KL Divergece Sice ˆθ maximizes the likelihood over θ Θ, we have pˆθ (Y i ) = log p θ (Y i ) log pˆθ (Y i ) 0 Therefore, or re-arragig pˆθ (Y i ) K(pˆθ, p θ ) + K(pˆθ, p θ ) 0, p K(pˆθ θ ) (Y pˆθ i ), p K(pˆθ θ )
Lecture 3: Maximum Likelihood Estimatio 4 Notice that the quatity p θ (Y i ) is a empirical average whose mea is K(p θ, p θ ). By the law of large umbers, for each θ Θ, p θ (Y i ) K(p a.s. θ, p θ ) 0 If this also holds for the sequece {ˆθ }, the we have p θ (Y i ), p K(pˆθ θ ) log (Y pˆθ i ), p K(pˆθ θ ) 0 as which implies that which ofte implies that pˆθ p θ ˆθ θ i some appropriate sese (e.g., poit-wise or i orm). Example Gaussia Distributios p θ (y) = π e (y θ ) Θ = R, {Y i } iid p θ (y) K(p θ, p θ ) = = log p θ (y) p θ (y) p θ (y)dy (y θ) (y θ ) ]p θ (y)dy = E θ (y θ) ] E θ (y θ ) ] = E θ Y Y θ + θ ] / = (θ ) + / θ θ + θ / = (θ θ) θ maximizes Elog p θ (Y )] wrt θ Θ ˆθ = arg max θ { (Y i θ) } = arg mi{ (Y i θ) } θ = Y i
Lecture 3: Maximum Likelihood Estimatio 5.3 Helliger Distace The KL divergece is ot a distace fuctio. K(p θ, p θ ) K(p θ, p θ ) Therefore, it is ofte more coveiet to work with the Helliger metric, H(p θ, p θ ) = The Helliger metric is symmetric, o-egative ad ( ( ) p θ p ) θ dy H(p θ, p θ ) = H(p θ, p θ ) ad therefore it is a distace measure. Furthermore, the squared Helliger distace lower bouds the KL divergece, so covergece i KL divergece implies covergece of the Helliger distace. Propositio H (p θ, p θ ) K(p θ, p θ ) Proof: ( H (p θ, p θ ) = (y) ) pθ p (y) θ dy pθ = p θ (y)dy + p θ (y)dy (y) p (y)dy θ pθ = (y) p (y)dy, sice θ p θ (y)dy = θ ( ]) = E θ pθ (Y )/p θ (Y ) ( ]) log E θ pθ (Y )/p θ (Y ), sice x log x E θ log ] p θ (Y )/p θ (Y ), by Jese s iequality = E θ log(p θ (Y )/p θ (Y ))] K(p θ, p θ ) Note that i the proof we also showed that H(p θ, p θ ) = ( pθ (y) ) p θ (y)dy ad usig the fact log x x agai, we have ( ) pθ H(p θ, p θ ) log (y) p (y)dy θ The quatity iside the log is called the affiity betwee p θ ad p θ : A(p θ, p θ ) = pθ (y) p θ (y)dy This is aother measure of closeess betwee p θ ad p θ.
Lecture 3: Maximum Likelihood Estimatio 6 Example Gaussia Distributios p θ (y) = π e (y θ) log pθ (y) p θ (y)dy ( ) ( ) = log π e (y θ) π e (y θ) dy ( = log π e (y θ ) ) + (y θ ) dy ( = log = log e ( θ θ ) = (θ θ ) h e θ (y ( +θ )) +( θ θ ) i ) dy π log A(p θ, p θ ) = (θ θ ) for Gaussia distributios H (p θ, p θ ) (θ θ ) for Gaussia. Summary iid Y i p θ. Maximum likelihood estimator maximizes the empirical average (our empirical risk is egative log-likelihood). θ maximizes the expectatio E (the risk is the expected egative log-likelihood) ] 3. a.s. E so we expect some sort of cocetratio of measure. ] 4. I particular, sice p θ (Y i ) a.s. K(p θ, p θ ) we might expect that K(pˆθ, p θ ) 0 for the sequece of estimates {pˆθ } =. So, the poit is that maximum likelihood estimator is just a special case of a loss fuctio i learig. Due to its special structure, we are aturally led to cosider KL divergeces, Helliger distaces, ad Affiities.