ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where θ Θ We ca view p θ as a member of a parametric class of distributios, P p θ θ Θ Our goal is to use the observatios Y i to select a appropriate distributio (eg, model from P We would like the selected distributio to be close to p θ i some sese We use the egative log-likelihood loss fuctio, defied as l(θ, Y i log p θ (Y i The empirical risk is ˆR (θ log p θ (Y i We select the distributio that miimizes the empirical risk mi log p(y i mi log p θ (Y i p P θ Θ I other words, the distributio we select is ˆp : pˆθ, where ˆθ arg mi θ Θ log p θ (Y i The risk is defied as As show before, θ miimizes R(θ over Θ R(θ E[l(θ, Y ] E[log p θ (Y ] θ arg mi E[log p θ(y ] θ Θ arg mi log p θ (y p θ (y dy θ Θ Fially, the excess risk of θ is defied as R(θ R(θ log p θ (y p θ (y p θ (y dy K(p θ, p θ
We see that the excess risk correspodig to this loss fuctio is simply the Kullback-Leibler (KL Divergece or Relative Etropy, deoted by K(p θ, p θ It is easy to see that K(p θ, p θ is always o-egative ad is zero if ad oly if p θ p θ KL divergece measures how differet two probability distributios are ad therefore is atural to measure covergece of the maximum likelihood procedures However, K(p θ, p θ is ot a distace metric because it is ot symmetric ad does ot satisfy the triagle iequality For this reaso, two other quatities play a key role i the aalysis of maximum likelihood estimatio, amely Helliger Distace ad Affiity The Helliger distace is defied as H(p θ, p θ ( ( pθ (y p θ (y dy We proved that the squared Helliger distace lower bouds the KL divergece: H (p θ, p θ K(p θ, p θ H (p θ, p θ K(p θ, p θ The affiity is defied as we also proved that pθ A(p θ, p θ (yp θ (y dy H (p θ, p θ log (A(p θ, p θ Example (Gaussia Distributio Y is Gaussia with mea θ ad variace σ The, First, look at p θ (y [ K(p θ, p θ E θ log p ] θ p θ (y θ e σ πσ log p θ p θ σ [(θ θ (θ θ y] θ θ σ (θ θ σ y p θ (y dy E[Y ]θ σ (θ + θ θ θ (θ θ σ ( ( log A(p θ, p θ log (y θ πσ e / ( σ (y θ πσ e / σ dy ( log ( log (θ θ log e σ (y θ πσ e 4σ (y θ 4σ dy πσ e σ [ (y θ +θ +( θ θ ] dy (θ θ 4σ K(p θ, p θ H (p θ, p θ
Maximum likelihood estimatio ad Complexity regularizatio Suppose that we have iid traiig samples, X i, Y i Usig coditioal probability, p XY ca be writte as iid p XY p XY (x, y p X (x p Y Xx (y Let s assume for the momet that p X is completely ukow, but p Y Xx (y has a special form: p Y Xx (y p f (x(y where p Y Xx (y is a kow parametric desity fuctio with parameter f (x Example (Sigal-plus-oise observatio model Y i f (X i + W i, i,, where W i iid N (0, σ ad X i iid p X Y X x Poisso(f (x p f (x(y (y f (x πσ e σ The likelihood loss fuctio is p f (x(y e f (x [f (x] y y! l(f(x, y log p XY (X, Y log p X (X log p Y X (Y X log p X (X log p f(x (Y The expected loss is E[l(f(X, Y ] E [E[l(f(X, Y X]] E[ E[ log p X (X log p f(x (Y X] ] E[ log p X (X ] E[ E[ log p f(x (Y X ] ] E[ log p X (X ] E[ log p f(x (Y ] Notice that the first term is a costat with respect to f With that i mid we defie our risk to be R(f E[ log p f(x (Y ] E[ E[log p f(x (Y X] ] ( log p f(x (y p f (x(y dy p X (x dx The fuctio f miimizes this risk sice f(x f (x miimizes the itegrad Our empirical risk is the egative log-likelihood of the traiig samples: ˆ R (f log p f(xi(y i We ca regard the value as the empirical probability of observig X X i (sice we are makig o assumptios o P X 3
3 Determiistic Desigs Ofte i fuctio estimatio, we have cotrol over where we sample For illustratio let s assume that X [0, ] ad Y R Suppose we have the samples determiistically distributed somewhat uiformly over the domai X Let x i, i,, deote these sample poits, ad assume that Y i p f (x i(y The, our empirical risk is ˆR (f l(f(x i, Y i log p f(xi(y i Note that x i is ow a determiistic quatity (hece the ame determiistic desig Our risk is R(f E [ log p f(xi(y i ] [ ] log p f(xi(y i p f (x i(y i dy i The risk is miimized by f However, f is ot a uique miimizer Ay f that agrees with f at the poit x i also miimizes this risk Now, we will make use of the followig vector ad shorthad otatio The uppercase Y deotes a radom variable, while the lowercase y ad x deote determiistic quatities Y y x Y y x Y Y The, p f (Y p (Y i f(x i (radom p f (y p (y i f(x i (determiistic y y x With this otatio, the empirical risk ad the true risk ca be writte as ˆ R (f log p f (Y x R(f E[log p f (Y ] log p f (y p f (y dy 4 Costructig Error Bouds Suppose that we have a pool of cadidate fuctios F, ad we wat to select a fuctio f from F usig the traiig data Our usual approach is to show that the distributio of R ˆ (f cocetrates about its mea as grows First, we assig a complexity c(f > 0 to each f F so that c(f The, apply the uio boud to get a uiform cocetratio iequality holdig for all models i F Fially, we use this cocetratio iequality to boud the expected risk of our selected model We will essetially accomplish the same result here, but avoid the eed for explicit cocetratio iequalities ad istead make use of the iformatio-theoretic bouds 4
where We would like to select a f F so that the excess risk is small is agai the KL divergece 0 R(f R(f E[log p f (Y log p f (Y ] [ E log p f (Y ] p f (Y K(p f ( K(p f log p f (x i(y i p f(xi(y i p f (x i(y i dy i K(p f(xi,p f (xi Ufortuately, as metioed before, K(p f is ot a true distace So istead we will focus o the expected squared Helliger distace as our measure of performace: H (p f ( p f(xi(y i p f (x i(y i dyi 5 Maximum Complexity-Regularized Likelihood Estimatio Theorem (Li-Barro 000, Kolaczyk-Nowak 00 Let x i, Y i be a radom sample of traiig data, where x i are determiistic ad Y i are idepedet radom variables, distributed as Y i p f (x i(y i, i,, for some ukow fuctio f Suppose we have a collectio of cadidate fuctios F measurable fuctios f : X Y, ad complexities c(f > 0, f F, satisfyig c(f Defie the complexity-regularized estimator ˆf arg mi log p f (Y i + c(f log The, [ ] E H (p ˆf ( [log E mi ] A(p ˆf K(p f + c(f log Before provig the theorem, let s look at a very special ad importat case We will use this results quite a lot i the followig lectures 5
Example 3 (Gaussia oise Suppose Y i f(x i + W i Usig results from example, we have The, We also have, ad log p f (Y ( log A p ˆf (Y, p f (Y, W i iid N (0, σ p f(xi(y i (y πσ e i f(x i σ ( log A p ˆf(x (Y i i, p f (x i(y i log p ˆf (y (x i i p f (x i(y i dy i 4σ ] [log E A(p ˆf 4σ (Y i f(x i σ K(p f ˆf arg mi The theorem tells us that ( ˆf (x i f (x i E 4 σ ( ˆf (x i f (x i [ ( ] E ˆf (x i f (x i (f(x i f (x i σ, Combie everythig together to get (Y i f(x i + 8σ c(f log mi (f(x i f (x i σ + c(f log or We will ow prove Theorem Proof: [ ( ] E ˆf (x i f (x i mi H ( p ˆf, p f (f(x i f (x i + 8σ c(f log ( p (y ˆf p f (y dy ( log p (y p ˆf f (y dy affiitya(p ˆf,p f Notice that A(p ˆf is a radom quatity (it depedets of the traiig set through ˆf Keepig that i mid the above result tells us that [ ( [ ( ] ] E H p ˆf, p f E log A(p ˆf 6
Now, defie the theoretical aalog of ˆf : f arg mi Now rewrite the defiitio of ˆf ˆf arg mi arg max arg max arg max arg max K (p f + c(f log log p c(f log f (Y + (log p f (Y c(f log (log p f (Y c(f log log (p f (Y c(f p f (Y c(f Sice ˆf is the fuctio f F that maximizes p f (Y c(f we coclude that p ˆf (Y c( ˆf pf (Y c(f The ca write [ ( ] E H p ˆf, p f [ ( ] E log A(p ˆf p ˆf (Y c( ˆf E log pf (Y c(f A(p ˆf Now, simply multiply the argumet iside the log by [ ( ] E H p ˆf, p f [ E log p f (Y pf (Y p f (Y to get pf (Y p ˆf (Y E log c( ˆf pf (Y pf (Y c(f A(p ˆf ( ] pf (Y + c(f log p ˆf (Y +E log pf (Y c( ˆf A(p ˆf K (p f + c(f log p ˆf (Y +E log pf (Y c( ˆf A(p ˆf The terms K (p f + c(f log are precisely what we wated for the upper boud of the theorem So, to fiish the proof we oly eed to show that the last term is ot-positive 7
Applyig Jese s iequality, we get p ˆf (Y p f (Y p ˆf (Y E log pf (Y c( ˆf log log E A(p ˆf c( ˆf A(p ˆf ( Note that i the above expressio the radom quatities are Y ad ˆf Because ˆf F ad we kow somethig about this space we ca use a uio boud to get the radomess of ˆf out of the way I this case the uio boud is simply sayig that ay idividual term i a summatio of positive terms is smaller tha the summatio E log p ˆf (Y pf (Y e c( ˆf log A(p ˆf, p f pf (Y p f (Y log E e c(f log A(p f, p f [ ] E pf (Y log c(f p f (Y A(p f, p f log c(f 0, where the last steps of the proof follow from the fact that [ ] p f (Y p f (y E p f (Y p f (y p f (y dy p f (y p f (y dy A(p f, ad c(f Let z, z, be o-egative The for all i N x i j x j 8