Lectures 12 and 13 - Complexity Penalized Maximum Likelihood Estimation

Size: px

Start display at page:

Download "Lectures 12 and 13 - Complexity Penalized Maximum Likelihood Estimation"

Cathleen Stewart
5 years ago
Views:

1 Lectures ad 3 - Complexity Pealized Maximum Likelihood Estimatio Rui Castro May 5, 03 Itroductio As you leared i previous courses, if we have a statistical model we ca ofte estimate ukow parameters by the maximum likelihood priciple Suppose we have idepedet, but ot ecessarily idetically distributed, data Namely, we model the data Y i i= as idepedet radom variables with desities (with respect to a commo domiatig measure) give by p i ( ; θ), where θ is a ukow parameter The Maximum Likelihood Estimator (MLE) of θ is simply give by ˆθ = arg max p i (Y i θ), where Θ is the set of possible parameters If the support of p i ( ; θ) is ot a fuctio of θ the we re-write the above estimator i terms of the log-likelihood ˆθ = arg max i= log p i (Y i θ) () i= We will cosider this settig i what follows Why is maximum likelihood a good idea? Kullback-Leibler divergece To better uderstad this let s itroduce the Defiitio (Kullback-Leibler Divergece) Covetio 0 log 0 = lim x 0 + x log x = 0 Let p ad q be two desities (for simplicity say these are with respect to the Lebesgue measure) The the Kullback-Leibler divergece is defied as KL(p q) = p(x) p(x) log q(x) dx if p q otherwise, where p q meas q domiates p, that is, for almost all x if q(x) = 0 the p(x) = 0 Here are some importat facts about the KL divergece The word parameter is i betwee quotes as θ ca be ifiite dimesioal, i what is geerally referred to as o-parametric estimatio For istace, this is the case i o-parametric regressio

2 KL(p q) 0, with equality if ad oly if p(x) = q(x) almost everywhere, which meas p ad q represet the same probability measure The proof of this follows easily from Jese s iequality KL(p q) = p(x) log p(x) q(x) dx = E p log p(x), where X is a rv with desity p q(x) = E p log q(x) p(x) q(x) log E p p(x) = log p(x) q(x) p(x) dx = log q(x)dx = log = 0 where the iequality follows from Jese s iequality ad it is strict uless p(x) = q(x) almost everywhere (ote: there are miute techicalities omitted i the above proof) KL(p q) KL(q p) I other words, the KL divergece is ot symmetric It is therefore ot a distace (i additio, it does t satisfy the triagle iequality) Let s see a example: let p(x) = x 0, ad q(x) = x 0, / The KL(p q) = log ad KL(q p) = The KL divergece is related to testig betwee two simple hypothesisif the ull hypothesis correspods to desity p ad the alterative correspods to desity q the probabilities of type I ad II error are itimately related to KL(p q) ad KL(q p) respectively I particular larger KL divergeces correspod to smaller the error probability The product-desity property: Let p i ad q i, i,, be uivariate desities ad defie the followig multivariate desities The p(x,, x ) = p (x ) p (x ) p (x ), q(x,, x ) = q (x ) q (x ) q (x ) KL(p q) = KL(p i q i ) i= The proof is quite simple ad left as a exercise Now let s look agai at the MLE formulatio i () Begi by assumig Y i are idepedet radom variables with desity q i We ca re-write the MLE as ˆθ = arg mi q i log p i (Y i ; θ) i=

3 The expected value of the rhs of the above is simply KL(q p(θ)), where q q ( ) q ( ) ad p(θ) p ( ; θ) p ( ; θ) are the product desities uder the true data distributio ad a model parameterized by θ, respectively Therefore we see that the MLE is attemptig to fid the parameter θ Θ that is closer to the true data distributio, where distace is measured as the KL divergece betwee the distributios of the data Actually, with some small additioal assumptios oe ca argue, by the strog law of large umbers, that q i log p i (Y i ; θ) as KL(q p(θ)) 0, i= as, ad that KL(q p(θ)) coverges to some poit c > 0 So, i a sese, the MLE is asymptotically idetifyig the model i Θ that is closer to the true distributio of the data Example The regressio settig with determiistic desig: Let Y i = r(x i ) + ɛ i, i =,, where ɛ i are iid ormal radom variables with variace σ The each Y i has desity ( exp (Y i r(x i ) ) πσ σ The joit likelihood of Y,, Y is ( exp (Y i r(x i ) ) πσ σ ad the MLE is simple give by i= ˆr = arg mi r R (Y i r(x i )) So we see that we recover our familiar sum of squared residuals i this case i= I what follows we ll make use of the KL divergece to characterize the behavior of the MLE for a variety of settigs For this we ll eed two other measures betwee desities The Helliger Distace ad the Affiity As we argued above, the KL divergece is ot a proper distace (i particular it is ot symmetric ad does ot satisfy the triagle iequality) However, it is ot hard to defie other distaces betwee desities Notably useful is the Helliger distace Defiitio (Helliger Distace) Let p ad q be two desities with respect to the Lebesgue measure The Helliger distace is defied as, H(p, q) = ( ( p(x) q(x) ) dx ) / 3

4 This is a proper distace, as it is symmetric, sice H(p, q) = H(q, p), o-egative, ad it satisfies the triagle iequality Remarkably the squared Helliger distace provides a lower boud for the KL divergece, so covergece i KL divergece implies covergece of the Helliger distace Propositio H (p, q) KL(p q) Proof ( ) H (p, q) = p(x) q(x) dx p(x)q(x)dx = p(x)dx + q(x)dx ( ) p(x)q(x)dx = log p(x)q(x)dx, sice x log x q(x) = log p(x) p(x)dx q(x) = log E, where X has desity p p(x) E log q(x)/p(x), by Jese s iequality = E log(p(x)/q(x)) KL(p q) Note that we have also showed that ( ) p(x)q(x)dx p(x)q(x)dx H (p, q) = log The quatity iside the log is called the Affiity, ad is a measure of the similarity betwee two desities (valued is these are idetical, ad zero if they have disjoit support) Defiitio 3 (Affiity) Let p ad q be two desities with respect to the Lebesgue measure The Affiity is defied as I summary, we have show that A(p, q) = p(x)q(x)dx H (p, q) log A(p, q) KL(p q) 4

5 The importat case of Gaussia distributios A importat case to cosider is whe p ad q are Gaussia distributios with the same variace but differet meas Let (x θ) p θ (x) = e σ πσ The KL(p θ p θ ) = log p θ (x) p θ (x) p θ (x)dx (x θ ) (x θ ) = σ p θ (x)dx (X θ ) (X θ ) = E σ where X has desity p θ = σ E(X θ ) σ E(X θ ) = σ E(X θ + θ θ ) σ E(X θ ) = σ (σ + (θ θ ) ) σ σ = (θ θ ) σ So, i this case, the KL divergece is symmetric Now, for the affiity ad Helliger distace we proceed i a similar fashio A(p θ, p θ ) = p θ (x)p θ (x)dx = (x θ πσ e ) 4σ e (x θ ) 4σ dx = = = x θ x+θ πσ e +x θ x+θ 4σ dx x θ x+θ πσ e +x θ x+θ 4σ dx πσ e x θ +θ σ x+ θ +θ dx = ( x θ +θ ) πσ e ( θ +θ σ ) + θ +θ dx = e ( ) θ +θ θ +θ σ ( πσ e x θ +θ σ ) dx = e ( ) θ +θ θ +θ σ 8σ = e (θ θ ) 5

6 Therefore ad ( H (p θ, p θ ) = e (θ θ ) ) 8σ log A(p θ, p θ ) = (θ θ ) 4σ, Maximum Pealized Likelihood Estimatio (MPLE) Ofte time, i o-parametric settigs, it is ecessary to either restrict the choice of possible models uder cosideratio, or pealize the MLE criterio The secod approach is more geeral ad we will cosider it here Namely, we are goig to study estimators of the form ˆθ = arg mi log p i (Y i ; θ) + pe(θ) i= where pe(θ) is iteded to pealize complex models Next we prove the followig importat result Theorem (Li-Barro 000, Kolaczyk-Nowak 00) Let Y be a radom vector (eg Y = (Y,, Y )) Assume the distributio of Y has ukow desity p Now suppose we have a class of probability desity fuctios p θ parameterized by θ Θ Assume Θ is a coutable set, ad that for each θ we have a associated umber c(θ) so that the followig iequality holds c(θ) (Kraft Iequality) () Defie the Maximum Pealized Likelihood Estimator (MPLE) as The ˆθ arg mi log p θ(y) + c(θ) log E H (p, ) pˆθ E log A(p, ) pˆθ mi KL(p p θ ) + c(θ) log This type of result, kow as a oracle boud, tells us that the performace of the proposed estimator is essetially as good as the performace of a clairvoyat estimator where we replace the likelihood fuctio by the theoretical couterpart (the KL divergece) Note also the result is extremely geeral - there are o assumptios made o the distributio of Y other tha havig a desity The major weakess of this result is that it oly applies to coutable (or fiite) classes of cadidate models As we will see this creates some icoveiece, but we ca still aalyze may iterestig settigs by discretizig/quatizig the classes of models uder cosideratio There are possible extesio of results of this type to ucoutable classes of models, but these require several techical assumptios ad heavy empirical processes machiery to prove Nevertheless the mai ideas, ituitio ad results are essetially preserved Fially, ote that this result gives you performace bouds for maximum likelihood estimatio for fiite classes of models, as i that case oe ca take c(f) = log Θ ad the estimator reduces to the usual MLE (as the pealty term does ot deped o θ) 6,

7 The Gaussia Regressio Case Before provig the theorem, let s look at a very special ad importat case We will use these results quite a lot i the what follows Recall the Gaussia regressio model, used i may of the previous lectures Suppose Y i = r (x i ) + ɛ i iid, ɛ i N (0, σ ), where x i are determiistic covariates (eg, x i = i/), ad r is the true, ukow, regressio fuctio The desity of Y i is parameterized by r ad give by p i (y i ; r) = with r = r We have see before for this case we have πσ e (y i r(x i )) σ, (3) KL(p i (r ) p i (r)) = σ (r (x i ) r(x i )), ad log A(p i (r ), p i (r)) = 4σ (r (x i ) r(x i )) This meas that, for the joit desity of the data, we have ad KL(p(r ) p(r)) = σ log A(p(r ), p(r)) = 4σ (r (x i ) r(x i )) i= (r (x i ) r(x i )) Fially, as we have see log p(y; r) = i= + cost, where the costat term does σ ot deped o r So, usig all this together with the above theorem we have the followig, importat corollary Corollary Let Y i = r (x i ) + ɛ i i= (Y i r(x i )) iid, ɛ i N (0, σ ), where x i are determiistic Let R be a class of models such that there is a map c : R 0, satisfyig c(r) Defie The i= ˆr = arg mi r R r R i= E (ˆr (x i ) r (x i )) mi r R (Y i r(x i )) + 4σ c(r) log i= (r(x i ) r (x i )) + 4σ c(r) log 7

8 This corollary shows that the maximum pealized likelihood estimator ca perform almost as well (we have a factor i the boud) has if we had observed r (x i ) istead of Y i (ie, if we had removed the oise from the observatios ad used the same estimator) Proof of Theorem We have already show that H (p, p θ ) log A(p, p θ ), So, clearly E H (p, ) E log A(p, ) pˆθ pˆθ We are goig to boud the right-had-side of the above expressio theorectical aalog of ˆθ Begi by defiig the Let s re-write the defiitio of the MPLE θ = arg mi KL(p p θ ) + c(f) log ˆθ = arg mi log p θ(y) + c(θ) log = arg max log p θ(y) c(θ) log = arg max log p θ(y) c(θ) log = arg max = arg max = arg max log p θ (Y) c(θ) log pθ (Y) exp ( c(θ) log ) pθ (Y) c(θ) This meas that, for ANY θ Θ pˆθ (Y) c(ˆθ ) p θ (Y) c(θ) I particular we have pˆθ (Y) c(ˆθ ) (Y) p θ c( θ ) 8

9 With this fact i had let s proceed with the boud E log A(p, ) = E log pˆθ A(p, ) pˆθ ( E log pˆθ (Y) = E log = E log p (Y) (Y) p θ c(ˆθ ) (Y) p θ c( θ ) p (Y) p θ (Y) = KL(p ) + c( θ p θ ) log +E log ) A(p, ) pˆθ (Y) pˆθ c(ˆθ ) p (Y) c( θ ) + c( θ ) log + E pˆθ (Y) p (Y) c(ˆθ ) A(p, ) pˆθ A(p, ) pˆθ pˆθ (Y) p (Y) c(ˆθ ) A(p, ) pˆθ So, the first two terms of the last lie are exactly what we have i the statemet of the theorem To coclude the proof we just eed to show the last term is always smaller tha zero Note that there are two radom quatities i that term, amely Y ad ˆθ (which is a fuctio of Y) Next, we use first Jese s iequality ad the a very crude boud to essetially get rid of ˆθ, takig advatage of the fact that ˆθ Θ First ote that, by Jese s iequality E log pˆθ (Y) p (Y) c(ˆθ ) A(p log E, ) pˆθ pˆθ (Y) p (Y) c(ˆθ ) A(p, ) pˆθ Now we ca make use of the Kraft iequality through a uio boud I this case the uio boud is simply sayig that ay idividual term i a summatio of positive terms is smaller tha the summatio So, we boud the iside of the expectatio by a sum over Θ (Y) log E pˆθ c(ˆθ ) p θ (Y) c(θ) p (Y) A(p log E, ) p pˆθ (Y) A(p, p θ ) E pθ (Y) = log c(θ) p (Y) A(p, p θ ) ( ) = log c(θ) log = 0 Let z, z, be o-egative The for all i N z i j= zj 9

10 where the last step follows from Kraft s iequality, ad the previous step follows simply by otig that p θ (Y) p θ (y) pθ E p = (Y) p (y) p (y)dy = (y)p (y)dy = A(p, p θ ) Choosig the values c(θ) I order to apply the theorem or the corollary we eed to costruct a map c( ) from Θ to 0, ) These values ca be though of as a measure of the complexity of each model θ If oe has a proper probability mass distributio over Θ, say P prior : Θ 0,, the we ca just use c(θ) = log P prior (θ) Clearly this satisfies the Kraft iequality Moreover this estimator as a strog Bayesia iterpretatio - it correspods to the Maximum a Posteriori estimator for the prior P prior Aother way to satisfy the Kraft iequality is to use codig argumets This has the advatage that we ca very easily devise maps that automatically satisfy the Kraft iequality Assume we have assig a biary codeword (sequece of 0s ad s) to each elemet of Θ Let c(θ) deote the size of the codeword (size of the sequece) The set Θ is called the alphabet, ad the idea is to ecode sequeces of symbols from the alphabet by cocateatig the correspodig codewords A very useful class of codes are called prefix codes Defiitio 4 A code is called a prefix or istataeous code if o codeword is a prefix of ay other codeword The reaso for such ame ad defiitio is clarified i the followig example Example (From Cover & Thomas 9) Cosider a alphabet of symbols, say A, B, C, ad D ad the codebooks i Figure Suppose Symbol Sigular Nosigular But Not Uiquely Decodable But Prefix Code Codebook Uiquely Decodable Not a Prefix Code A B C D Figure : Four possible codes for a alphabet of four symbols we wat to covey the setece ADBBAC This will be ecoded i each system respectively as , , , ad It is clear that with the sigular codebook we assig the same codeword to each symbol - a system that is obviously flawed! I the secod case the codes are ot sigular but the codeword 00 could represet B or CA or AD, so the above biary sequece is ambiguous, as it ca be decoded both as ADBBAC or BBBAC for istace Hece it is ot a uiquely decodable codebook The third ad fourth cases are both examples of uiquely decodable codebooks, but the fourth has the added feature that o codeword is a prefix of aother Prefix codes ca be decoded from left to right istataeously sice each codeword is self-puctuatig - that is, you kow immediately whe a codeword eds Note that, util the teth bit i the third case, you do t kow if the sequece is ADBB or ACBBB ad oly the ca you decide I the fourth case you kow immediately whe a codeword eds ad aother starts 0

11 The good thig about prefix codes is that these are easy to costruct ad automatically satisfy the Kraft iequality We will use this approach ofte The Kraft Iequality Theorem For ay biary prefix code, the codeword legths c, c, satisfy c k k= Coversely, give ay c, c, satisfyig the iequality above we ca costruct a prefix code with these codeword legths Proof We will prove that for ay biary prefix code, the codeword legths c, c,, satisfy k c k The coverse is easy to prove also, but it ot cetral to our purposes here (for a proof, see Cover & Thomas 9) Cosider a biary tree like the oe show i Figure Root Figure : A biary tree The sequece of bit values leadig from the root to a leaf of the tree represets a codeword The prefix coditio implies that o codeword is a descedat of ay other codeword i the tree Therefore we ca each leaf of the tree represets a codeword i our code Let c max be the legth of the logest codeword (also the umber of braches to the deepest leaf) i the tree Cosider a ode i the tree at level c i This ode i the tree ca have at most cmax c i descedats at level c max Furthermore, for each leaf of the tree the set of possible descedats at level c max is disjoit (sice o codeword ca be a prefix of aother) Therefore, sice the total umber of possible leafs at level c max is cmax, we have cmax c i cmax c i, i leafs i leafs which proves the case whe the umber of codewords is fiite

12 Suppose ow that we have a coutably ifiite umber of codewords Let b, b,, b ci be the i th codeword ad let c i r i = b j j j=i be the real umber correspodig to the biary expasio of the codeword (i biary r i = 0b b b 3 ) We ca associate the iterval r i, r i + c i ) with the i th codeword This is the set of all real umbers whose biary expasio begis with b, b,, b ci Sice this is a subiterval of 0,, ad all such subitervals correspodig to prefix codewords are disjoit, the sum of their legths must be less tha or equal to This proves the case where the umber of codewords is ifiite 3 Adaptig to Ukow Parameters Earlier i the course we saw that oe ca estimate Lipschitz smooth fuctios well I this sectio we geeralize those results to iclude smoother fuctios Furthermore, we will be able to automatically adjust to the ukow level of smoothess of fuctios 3 Lipschitz Fuctios Suppose we have a fuctio r : 0, R, R satisfyig the Lipschitz smoothess assumptio s, t 0, r (s) r (t) L s t We have see these fuctios ca be well approximated by piecewise costat fuctios of the form m j g(x) = c j x I j, where I j = m, j m j= Let s use maximum-likelihood to pick the best such model Suppose we have followig regressio model Y i = r (x i ) + ɛ i, i =,,, iid where x i = i/ ad ɛ i N (0, σ ) To be able to use the corollary we derived we eed a coutable or fiite class of models The easiest way to do so is to discretize/quatize the possible values of each costat piece i the cadidate models Defie m R m = c j x I j : c j Q, where j= Q = R, R,, R = R k, k = 0,, Therefore R m has exactly ( + ) m elemets i total This meas that, by takig c(r) = log (( + ) m ) = m log ( + ) for all r R m we satisfy the Kraft iequality r R m c(r) = r R m R m =

13 So we are ready to apply our oracle boud Sice c(r) is just a costat (ot really a fuctio of r) the estimator is simply the MLE ˆr = arg mi (Y i r(x i )) + 4σ c(r) log r R m i= = arg mi (Y i r(x i )) r R m i= The corollary the says that E (ˆr (x i ) r (x i )) i= mi r R m i= (r (x i ) r(x i )) + 4σ c(r) log So far the result is extremely geeral, as we have ot made use of the Lipschitz assumptio We have see earlier that there is a piecewise costat fuctio r m (x) = m j= c jx I j such that for all x 0, we have r (x) r m (x) L/m The problem is that, geerally r m / R m sice c j / Q Take istead the elemet of R m that is closest to r m, amely r m = arg mi sup f R m x 0, r(x) r m (x) It is clear that f(x) f m (x) R/ for all x 0, therefore, by the triagle iequality we have f(x) f m (x) f(x) f m (x) + f m (x) f m (x) L m + R Now, we ca just use this i our boud E (ˆr (x i ) r (x i )) mi (r (x i ) r(x i )) + 4σ c(r) log r R m i= i= ( L m + R ) + 8σ m log ( + ) log i= ( L = m + R ) + 8σ m log( + ) So, to esure the best boud possible we should choose m miimizig the right-had-side This yields m (/ log ) /3 ad E ( ˆf (x i ) f (x i )) = O ((/ log ) /3) i= which, apart from the logarithmic factor is the best we ca ever hope for (this logarithmic factor is due to the discretizatio of the model classes, ad is a artifact of this approach) If we wat the truly best possible boud we eed to miimize the above expressio with respect to m, ad for that we eed to kow L Ca we do better? Ca we automagically choose m usig the data? The aswer is yes, ad for this we will start takig full advatage of our oracle boud 3

14 Sice we wat to choose the best possible m we must cosider the followig class of models R = m= R m This is clearly a coutable class of models (but ot fiite) So we eed to be a bit more careful i costructig the map c( ) Let s use a codig argumet: begi by defiig m(r) = mi m N m : r R m Ecode r R usig first the bits 00 0 (total m(r) bits) to ecode m(r) ad them log R m bits to ecode which model iside R m is r This is clearly a prefix code ad therefore satisfies the Kraft iequality More formally c(r) = m(r) + log R m(r) = m(r) + log (( + ) m(r)) = m(r)( + log (( + )) Although we kow, from the codig argumet, that the map c( ) satisfies the Kraft iequality for sure, we ca do a little saity check, ad esure this is ideed true: c(r) r R = = = c(r) m= r R m m(r) log Rm(r) m= r R m m log Rm m= r R m m R m r R m m = m= m= Now, similarly to what we had before ˆr = arg mi r R = arg mi r R i= i= (Y i r(x i )) + 4σ c(r) log (Y i r(x i )) + 4σ m(r)( + log ( + )) log, 4

15 which is o loger the MLE, but rather a maximum pealized likelihood estimator The E (ˆr (x i ) r (x i )) i= mi (r (x i ) r(x i )) + 4σ m(r)( + log ( + )) log r R i= mi mi (r (x i ) r(x i )) + 4σ m( + log ( + )) log m N r R m i= ( L mi m N m + R ) + 8σ m(log + log( + )) Therefore this estimator automatically chooses the best possible umber of parameters m Note that the price we pay is very very modest - the oly chage from what we had before was that the term log( + ) is replaced by log + log( + ), which is a very miute chage as log(+) log for ay iterestig sample size Although this is remarkable, we ca probably do much better, ad adjust to ukow smoothess We ll see how to do this ext 4 Regressio of Hölder smooth fuctios Let s begi by defiig a class of Hölder smooth fuctios Let α > 0 ad defie let H α (C, R) be the set of all fuctios r : 0, R, R which have α derivatives ad f(x) T α y (x) C x y α x, y 0,, where α is the largest iteger such that α < α, ad Ty α is the Taylor polyomial of degree α aroud the poit y I words, a Hölder-α smooth fuctio is locally well approximated by a polyomial of degree α Note that the above defiitio coicides with our Lipschitz fuctio class whe α = Hölder smoothess essetially measures how differetiable fuctios are, ad therefor Taylor polyomials are the atural way to approximate Hölder smooth fuctios We will focus o Hölder smooth fuctio classes with 0 < α but the results preseted ca be easily geeralized for larger values of α Sice α we will work with piecewise liear approximatios, the Taylor polyomial of degree If we were to cosider smoother fuctios, α > we would eed cosider higher degree Taylor polyomial approximatio fuctios, ie quadratic, cubic, etc 4 Regressio of Hölder smooth fuctios Cosider the usual regressio model Y i = r (x i ) + ɛ i, i =,,, iid where x i = i/ ad ɛ i N (0, σ ) Let s assume r H α (C, R), with ukow smoothess 0 < α Ituitively, the smoother r is the better we should be able to estimate it, as we ca average more observatios locally I other words for smoother r we should average over 5

16 larger bis Also, we will eed to exploit the extra smoothess i our models for r To that ed, we will cosider cadidate fuctios that are piecewise liear, ie, fuctios of the form m j (a j + b j x)x I j, where I j = m, j m j= As before, we wat to cosider coutable/fiite classes of models to be able to apply our corollary, so we will cosider a slight modificatio of the above Each liear piece ca be described by their begiig ad ed poits respectively So we are goig to restrict those to lie o a grid Namely refer to Figure 3 C levels 0 C (i )/k Figure 3: Example o the discretizatio of f o iterval i/k ) j m, j m Defie the class m R m = r(x) = l j (x)x I j, j= where x (j )/m l j (x) = b j + j/m x /m /m a j = (mx j + )b j + (j mx)a j, ad a j, b j k R : k 0,, Clearly R m = ( + ) m Sice we do t kow the smoothess a priori, we must choose m usig the data Therefore, i the same fashio as before we take the class with m(r) = mi m N m : r R m, ad R = m= R m, c(r) = m(r) + log R m(r) = m(r)( + log ( ( + ) ) 6

17 Exactly as before, defie the estimator ˆr = arg mi (Y i r(x i )) + 4σ m(r)( + log ( + )) log r R i=, The E (ˆr (x i ) r (x i )) i= mi r R mi m N (r (x i ) r(x i )) + 4σ m(r)( + log ( + )) log (r (x i ) r(x i )) i= mi r R m i= + 4σ m( + log ( + )) log I the above, the first term is essetially our familiar approximatio error, ad the secod term is i a sese boudig the estimatio error Therefore this estimator automatically seeks the best balace betwee the two To say somethig more cocrete about the performace of the estimator we eed to brig i the assumptios we have o r (so far we have t used ay) First, suppose r H α (C, R) for < α We eed to fid a good model i the class R m that makes the approximatio error small Take x I j where j is arbitrary From the defiitio of H α (C, R) we have r (x) r (j/m) x r (j/m)(x j/m) C x j/m α C /m α = Cm α Therefore we ca approximate r (x) with x I j by a liear fuctio, such that the error is bouded by Cm α I particular, the best piecewise liear approximatio, defied as r m = arg mi sup piecewise liear fuctios x 0, r(x) r (x), satisfies r (x) r m(x) Cm α for all x 0, Clearly, this fuctio i ot i R m, so we eed to do some discretizatio Take the fuctio i R m that is closest to that fuctio r m r m = arg mi sup r R m x 0, r(x) r m(x), because of the way we discretized we kow that sup x 0, r m(x) r m(x) R/ Usig this, together with the triagle iequality yields x 0, r m(x) r (x) Cm α + R (4) If r H α (C, R) for 0 < α we ca proceed i a similar fashio, but simply have to ote that such fuctios are well approximated by piecewise costat fuctios Furthermore these are a subset of R m So, the reasoig we used for Lipschitz fuctios applies directly here, ad we obtai expressio (4) for that case as well 7

18 Fially, we ca just plug-i these results ito the boud of the corollary E (ˆr (x i ) r (x i )) i= mi mi (r (x i ) r(x i )) + 4σ m( + log ( + )) log m N r R m i= ( mi Cm α + R ) + 4σ m( + log ( + )) log m N ( mi O max m α, m α m N,, m log ) It is ot hard to see that the first ad last terms domiate the boud, ad so we attai the miimum by takig (i the boud) which yields E m ( ) /(α+), log (ˆr (x i ) r (x i )) = O i= ( ( ) α ) α+ log Note that the estimator does ot kow α!!! So we are ideed adaptig to ukow ( smoothess If the regressio fuctio r is Lipschitz this estimator has error rate O (/ log ) /3) ( However, if the fuctio is smoother (say α = ) the estimator has error rate O (/ log ) 4/5), which decays much quicker to zero More remarkably, apart from the logarithmic factor it ca be show this is the best oe ca hope for! So the logarithmic factor is the very small price we eed to pay for adaptivity 8

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where