Maximum Likelihood Estimation and Complexity Regularization

Size: px

Start display at page:

Download "Maximum Likelihood Estimation and Complexity Regularization"

Darcy Sanders
5 years ago
Views:

1 ECE90 Sprig 004 Statistical Regularizatio ad Learig Theory Lecture: 4 Maximum Likelihood Estimatio ad Complexity Regularizatio Lecturer: Rob Nowak Scribe: Pam Limpiti Review : Maximum Likelihood Estimatio I the last lecture, we have iid observatios draw from a ukow distributio Y i p θ, i,, where θ Θ With loss fuctio defied as l(θ, Y i log p θ (Y i, the empirical risk is ˆ R log p θ (Y i Essetially, we wat to choose a distributio from the collectio of distributios withi the parameter space that miimizes the empirical risk,ie, we would like to select where The risk is defied as Note that θ miimizes R(θ over Θ p θ ˆ P p θ θ Θ ˆ θ arg mi θ Θ log p θ (Y i R(θ E[l(θ, Y ] E[log p θ (Y ] θ arg mi E[log p θ(y ] θ Θ arg mi log p θ (y p θ (y dy θ Θ Fially, the excess risk of θ is defied as R(θ R(θ log p θ (y p θ (y p θ (y dy K(p θ, p θ We recogized that the excess risk correspodig to this loss fuctio is simply the Kullback-Leibler (KL Divergece or Relative Etropy, deoted by K(p θ, p θ It is easy to see that K(p θ, p θ is always o-egative ad is zero if ad oly if p θ p θ KL divergece measures how differet two probability distributios are ad therefore is atural to measure covergece of the maximum likelihood procedures However, K(p θ, p θ is ot a distace metric because it is ot symmetric ad does ot satisfy the triagle iequality For this reaso, two other quatities play a key role i maximum likelihood estimatio, amely Helliger Distace ad Affiity

2 Maximum Likelihood Estimatio ad Complexity Regularizatio The Helliger distace is defied as ( ( H(p θ, p θ pθ (y p θ (y dy We proved that the squared Helliger distace lower bouds the KL divergece: H (p θ, p θ K(p θ, p θ H (p θ, p θ K(p θ, p θ The affiity is defied as we also proved that pθ A(p θ, p θ p θ (y dy H (p θ, p θ log (A(p θ, p θ Example (Gaussia Distributio Y is Gaussia with mea θ ad variace σ First, look at The, p θ (y [ K(p θ, p θ E θ log p ] θ p θ (y θ e σ πσ log p θ p θ σ [(θ θ (θ θ y] θ θ σ (θ θ σ y p θ (y dy E[Y ]θ σ (θ + θ θ θ (θ θ σ ( ( log A(p θ, p θ log (y θ πσ e / ( σ (y θ πσ e / σ dy ( log ( log (θ θ log e σ (y θ πσ e 4σ (y θ 4σ dy πσ e σ [ (y θ +θ +( θ θ ] dy (θ θ 4σ K(p θ, p θ H (p θ, p θ

3 Maximum Likelihood Estimatio ad Complexity Regularizatio 3 Maximum likelihood estimatio ad Complexity regularizatio Suppose that we have iid traiig samples, X i, Y i Usig coditioal probability, p XY ca be writte as p XY p XY (x, y p X (x p Y Xx (y Let s assume for the momet that p X is completely ukow, but p Y Xx (y has a special form: p Y Xx (y p f (x(y where p Y Xx (y is a kow parametric desity fuctio with parameter f (x Example (Sigal-plus-oise observatio model Y i f (X i + W i, i,, where W i N (0, σ ad X i p X Y X x Poisso(f (x p f (x(y (y f (x πσ e σ The likelihood loss fuctio is p f (x(y e f (x [f (x] y y! The expected loss is l(f(x, y log p XY (X, Y log p X (X log p Y X (Y X log p X (X log p f(x (Y E[l(f(X, Y ] E X [ EY X [l(f(x, Y X x] ] E X [ E Y X [ log p X (x log p f(x (Y X x] ] E X [ log p X (X ] E X [ E Y X [ log p f(x (Y X x ] ] E X [ log p X (X ] E[ log p f(x (Y ] Notice that the first term is a costat with respect to f Hece, we defie our risk to be R(f E[ log p f(x (Y ] E X [ E Y X [log p f(x (Y X x] ] ( log p f(x (y p f (x(y dy p X (x dx The fuctio f miimizes this risk sice f(x f (x miimizes the itegrad Our empirical risk is the egative log-likelihood of the traiig samples: ˆ R (f log p f(xi(y i

4 Maximum Likelihood Estimatio ad Complexity Regularizatio 4 The value is the empirical probability of observig X X i Ofte i fuctio estimatio, we have cotrol over where we sample X Let s assume that X [0, ] d ad Y R Suppose we sample X uiformly with m d samples for some positive iteger m (ie,,take m evely spaced samples i each coordiate Let x i,i,, deote these sample poits, ad assume that Y i p f (x i(y The, our empirical risk is Rˆ (f l(f(x i, Y i log p f(xi(y i Note that x i is ow a determiistic quatity Our risk is R(f E [ log p f(xi(y i ] [ ] log p f(xi(y i p f (x i(y i dy i The risk is miimized by f However, f is ot a uique miimizer Ay f that agrees with f at the poit x i, Y i also miimizes this risk Now, we will make use of the followig vector ad shorthad otatio radom variable, while the lowercase y ad x deote determiistic quatities Y y x Y y x Y Y y y x x The uppercase Y deotes a The, p f (Y p (Y i f(x i (radom p f (y p (y i f(x i (determiistic With this otatio, the empirical risk ad the true risk ca be writte as ˆ R (f log p f (Y R(f E[log p f (Y ] log p f (y p f (y dy 3 Error Boud Suppose that we have a pool of cadidate fuctios F, ad we wat to select a fuctio f from F usig the traiig data Our usual approach is to show that the distributio of R ˆ (f cocetrates about its mea as grows First, we assig a complexity c(f > 0 to each f F so that c(f The, apply the uio boud to get a uiform cocetratio iequality holdig for all models i F Fially, we use this cocetratio iequality to boud the expected risk of our selected model

5 Maximum Likelihood Estimatio ad Complexity Regularizatio 5 We will essetially accomplish the same result here, but avoid the eed for explicit cocetratio iequalities ad istead make use of the iformatio-theoretic bouds where We would like to select a f F so that the excess risk is small is agai the KL divergece 0 R(f R(f E[log p f (Y log p f (Y ] [ E log p f (Y ] p f (Y K(p f, p f ( K(p f, p f log p f (x i(y i p f(xi(y i p f (x i(y i dy i K(p f(xi,p f (xi Ufortuately, as metioed before, K(p f, p f is ot a true distace So istead we will focus o the expected squared Helliger distace as our measure of performace We will get a boud o E [ H (p f (Y, p f (Y ] ( ( p f(xi(y i p f (x i(y i dyi 4 Maximum Complexity-Regularized Likelihood Estimatio Theorem (Li-Barro 000, Kolaczyk-Nowak 00 Let x i, Y i be a radom sample of traiig data with Y i idepedet, Y i p f (x i(y i, i,, for some ukow fuctio f Suppose we have a collectio of cadidate fuctios F, ad complexities c(f > 0, f F, satisfyig c(f Defie the complexity-regularized estimator fˆ arg mi The, log p f (Y i + c(f log E [ H (p f (Y, p f (Y ] E [log (A(p f (Y, p f (Y ] mi K(p c(f log f, p f + Before provig the theorem, let s look at a special case

6 Maximum Likelihood Estimatio ad Complexity Regularizatio 6 Example 3 (Gaussia oise Suppose Y i f(x i + W i, W i N (0, σ Usig results from example, we have p f(xi(y i (y πσ e i f(x i σ ( log A p f ˆ (Y, p f (Y 4σ ( log A p f(x ˆ (Y i i, p f (x i(y i log p ˆ f (x i (y i p f (x i(y i dy i ( ˆ f (x i f (x i The, We also have, ] [log E A(p f ˆ, p f 4σ [ ( ] E f ˆ (x i f (x i Combie everythig together to get fˆ arg mi The theorem tells us that ( f ˆ (x i f (x i E 4 σ or Now let s come back to the proof K(p f, p f (f(x i f (x i σ (Y i f(x i log p f (Y σ (Y i f(x i σ + mi [ ( ] E f ˆ (x i f (x i mi c(f log (f(x i f (x i σ + c(f log (f(x i f (x i + 8σ c(f log Proof: ( H p f ˆ, p f ( pf ˆ (y p f (y dy ( log pf ˆ (y p f (y dy affiity

7 Maximum Likelihood Estimatio ad Complexity Regularizatio 7 [ ( ] E H p f ˆ, p f Now, defie the theoretical aalog of f ˆ : f arg mi E log K (p f, p f + p f ˆ (y p f (y dy c(f log Sice we ca see that fˆ arg mi log p f (Y + arg max arg max arg max arg max c(f log (log p f (Y c(f log (log p f (Y c(f log log (p f (Y e c(f log p f (Y e c(f log p f ˆ (Y e c( f ˆ log pf (Y e c(f log The ca write [ ( ] E H p f ˆ, p f E log p f ˆ (y p f (y dy p f ˆ (Y e c( ˆ f log E log pf (Y e c(f log p f ˆ p f dy Now, simply multiply the argumet iside the log by [ ( ] E H p f ˆ, p f pf (Y p f ˆ (Y E log pf (Y pf (Y e ( ] pf (Y + c(f log [ E log p f (Y p f ˆ (Y +E log pf (Y K (p f, p f + c(f log p f ˆ (Y +E log pf (Y pf (Y p f (Y to get e c( f ˆ log c(f log e c( ˆ f log p ˆ f (y p f (y dy e c( ˆ f log p ˆ f (y p f (y dy p f ˆ (y p f (y dy

8 Maximum Likelihood Estimatio ad Complexity Regularizatio 8 The terms K (p f, p f + c(f log are precisely what we wated for the upper boud of the theorem So, to fiish the proof we oly eed to show that the last term is o-positive Applyig Jese s iequality, we get p f ˆ (Y E log pf (Y e c( ˆ f log p ˆ f (y p f (y dy log E e c( f ˆ log p ˆ p ˆ f (Y p f (Y f (y p f (y dy Both Y ad f ˆ are radom, which makes the expectatio difficult to compute However, we ca simplify the problem usig the uio boud, which elimiates the depedece o f ˆ : p f ˆ (Y E log pf (Y e c( f ˆ log log E pf (Y e c(f log p f (Y p f ˆ (y p f (y dy pf (y p f (y dy [ ] pf (Y E log c(f p f (Y pf (y p f (y dy log c(f 0 where the last two lies come from [ ] p f (Y p f (y E p f (Y p f (y p f (y dy p f (y p f (y dy ad c(f

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where