Lectures 12 and 13 - Complexity Penalized Maximum Likelihood Estimation

Size: px
Start display at page:

Download "Lectures 12 and 13 - Complexity Penalized Maximum Likelihood Estimation"

Transcription

1 Lectures ad 3 - Complexity Pealized Maximum Likelihood Estimatio Rui Castro May 5, 03 Itroductio As you leared i previous courses, if we have a statistical model we ca ofte estimate ukow parameters by the maximum likelihood priciple Suppose we have idepedet, but ot ecessarily idetically distributed, data Namely, we model the data Y i i= as idepedet radom variables with desities (with respect to a commo domiatig measure) give by p i ( ; θ), where θ is a ukow parameter The Maximum Likelihood Estimator (MLE) of θ is simply give by ˆθ = arg max p i (Y i θ), where Θ is the set of possible parameters If the support of p i ( ; θ) is ot a fuctio of θ the we re-write the above estimator i terms of the log-likelihood ˆθ = arg max i= log p i (Y i θ) () i= We will cosider this settig i what follows Why is maximum likelihood a good idea? Kullback-Leibler divergece To better uderstad this let s itroduce the Defiitio (Kullback-Leibler Divergece) Covetio 0 log 0 = lim x 0 + x log x = 0 Let p ad q be two desities (for simplicity say these are with respect to the Lebesgue measure) The the Kullback-Leibler divergece is defied as KL(p q) = p(x) p(x) log q(x) dx if p q otherwise, where p q meas q domiates p, that is, for almost all x if q(x) = 0 the p(x) = 0 Here are some importat facts about the KL divergece The word parameter is i betwee quotes as θ ca be ifiite dimesioal, i what is geerally referred to as o-parametric estimatio For istace, this is the case i o-parametric regressio

2 KL(p q) 0, with equality if ad oly if p(x) = q(x) almost everywhere, which meas p ad q represet the same probability measure The proof of this follows easily from Jese s iequality KL(p q) = p(x) log p(x) q(x) dx = E p log p(x), where X is a rv with desity p q(x) = E p log q(x) p(x) q(x) log E p p(x) = log p(x) q(x) p(x) dx = log q(x)dx = log = 0 where the iequality follows from Jese s iequality ad it is strict uless p(x) = q(x) almost everywhere (ote: there are miute techicalities omitted i the above proof) KL(p q) KL(q p) I other words, the KL divergece is ot symmetric It is therefore ot a distace (i additio, it does t satisfy the triagle iequality) Let s see a example: let p(x) = x 0, ad q(x) = x 0, / The KL(p q) = log ad KL(q p) = The KL divergece is related to testig betwee two simple hypothesisif the ull hypothesis correspods to desity p ad the alterative correspods to desity q the probabilities of type I ad II error are itimately related to KL(p q) ad KL(q p) respectively I particular larger KL divergeces correspod to smaller the error probability The product-desity property: Let p i ad q i, i,, be uivariate desities ad defie the followig multivariate desities The p(x,, x ) = p (x ) p (x ) p (x ), q(x,, x ) = q (x ) q (x ) q (x ) KL(p q) = KL(p i q i ) i= The proof is quite simple ad left as a exercise Now let s look agai at the MLE formulatio i () Begi by assumig Y i are idepedet radom variables with desity q i We ca re-write the MLE as ˆθ = arg mi q i log p i (Y i ; θ) i=

3 The expected value of the rhs of the above is simply KL(q p(θ)), where q q ( ) q ( ) ad p(θ) p ( ; θ) p ( ; θ) are the product desities uder the true data distributio ad a model parameterized by θ, respectively Therefore we see that the MLE is attemptig to fid the parameter θ Θ that is closer to the true data distributio, where distace is measured as the KL divergece betwee the distributios of the data Actually, with some small additioal assumptios oe ca argue, by the strog law of large umbers, that q i log p i (Y i ; θ) as KL(q p(θ)) 0, i= as, ad that KL(q p(θ)) coverges to some poit c > 0 So, i a sese, the MLE is asymptotically idetifyig the model i Θ that is closer to the true distributio of the data Example The regressio settig with determiistic desig: Let Y i = r(x i ) + ɛ i, i =,, where ɛ i are iid ormal radom variables with variace σ The each Y i has desity ( exp (Y i r(x i ) ) πσ σ The joit likelihood of Y,, Y is ( exp (Y i r(x i ) ) πσ σ ad the MLE is simple give by i= ˆr = arg mi r R (Y i r(x i )) So we see that we recover our familiar sum of squared residuals i this case i= I what follows we ll make use of the KL divergece to characterize the behavior of the MLE for a variety of settigs For this we ll eed two other measures betwee desities The Helliger Distace ad the Affiity As we argued above, the KL divergece is ot a proper distace (i particular it is ot symmetric ad does ot satisfy the triagle iequality) However, it is ot hard to defie other distaces betwee desities Notably useful is the Helliger distace Defiitio (Helliger Distace) Let p ad q be two desities with respect to the Lebesgue measure The Helliger distace is defied as, H(p, q) = ( ( p(x) q(x) ) dx ) / 3

4 This is a proper distace, as it is symmetric, sice H(p, q) = H(q, p), o-egative, ad it satisfies the triagle iequality Remarkably the squared Helliger distace provides a lower boud for the KL divergece, so covergece i KL divergece implies covergece of the Helliger distace Propositio H (p, q) KL(p q) Proof ( ) H (p, q) = p(x) q(x) dx p(x)q(x)dx = p(x)dx + q(x)dx ( ) p(x)q(x)dx = log p(x)q(x)dx, sice x log x q(x) = log p(x) p(x)dx q(x) = log E, where X has desity p p(x) E log q(x)/p(x), by Jese s iequality = E log(p(x)/q(x)) KL(p q) Note that we have also showed that ( ) p(x)q(x)dx p(x)q(x)dx H (p, q) = log The quatity iside the log is called the Affiity, ad is a measure of the similarity betwee two desities (valued is these are idetical, ad zero if they have disjoit support) Defiitio 3 (Affiity) Let p ad q be two desities with respect to the Lebesgue measure The Affiity is defied as I summary, we have show that A(p, q) = p(x)q(x)dx H (p, q) log A(p, q) KL(p q) 4

5 The importat case of Gaussia distributios A importat case to cosider is whe p ad q are Gaussia distributios with the same variace but differet meas Let (x θ) p θ (x) = e σ πσ The KL(p θ p θ ) = log p θ (x) p θ (x) p θ (x)dx (x θ ) (x θ ) = σ p θ (x)dx (X θ ) (X θ ) = E σ where X has desity p θ = σ E(X θ ) σ E(X θ ) = σ E(X θ + θ θ ) σ E(X θ ) = σ (σ + (θ θ ) ) σ σ = (θ θ ) σ So, i this case, the KL divergece is symmetric Now, for the affiity ad Helliger distace we proceed i a similar fashio A(p θ, p θ ) = p θ (x)p θ (x)dx = (x θ πσ e ) 4σ e (x θ ) 4σ dx = = = x θ x+θ πσ e +x θ x+θ 4σ dx x θ x+θ πσ e +x θ x+θ 4σ dx πσ e x θ +θ σ x+ θ +θ dx = ( x θ +θ ) πσ e ( θ +θ σ ) + θ +θ dx = e ( ) θ +θ θ +θ σ ( πσ e x θ +θ σ ) dx = e ( ) θ +θ θ +θ σ 8σ = e (θ θ ) 5

6 Therefore ad ( H (p θ, p θ ) = e (θ θ ) ) 8σ log A(p θ, p θ ) = (θ θ ) 4σ, Maximum Pealized Likelihood Estimatio (MPLE) Ofte time, i o-parametric settigs, it is ecessary to either restrict the choice of possible models uder cosideratio, or pealize the MLE criterio The secod approach is more geeral ad we will cosider it here Namely, we are goig to study estimators of the form ˆθ = arg mi log p i (Y i ; θ) + pe(θ) i= where pe(θ) is iteded to pealize complex models Next we prove the followig importat result Theorem (Li-Barro 000, Kolaczyk-Nowak 00) Let Y be a radom vector (eg Y = (Y,, Y )) Assume the distributio of Y has ukow desity p Now suppose we have a class of probability desity fuctios p θ parameterized by θ Θ Assume Θ is a coutable set, ad that for each θ we have a associated umber c(θ) so that the followig iequality holds c(θ) (Kraft Iequality) () Defie the Maximum Pealized Likelihood Estimator (MPLE) as The ˆθ arg mi log p θ(y) + c(θ) log E H (p, ) pˆθ E log A(p, ) pˆθ mi KL(p p θ ) + c(θ) log This type of result, kow as a oracle boud, tells us that the performace of the proposed estimator is essetially as good as the performace of a clairvoyat estimator where we replace the likelihood fuctio by the theoretical couterpart (the KL divergece) Note also the result is extremely geeral - there are o assumptios made o the distributio of Y other tha havig a desity The major weakess of this result is that it oly applies to coutable (or fiite) classes of cadidate models As we will see this creates some icoveiece, but we ca still aalyze may iterestig settigs by discretizig/quatizig the classes of models uder cosideratio There are possible extesio of results of this type to ucoutable classes of models, but these require several techical assumptios ad heavy empirical processes machiery to prove Nevertheless the mai ideas, ituitio ad results are essetially preserved Fially, ote that this result gives you performace bouds for maximum likelihood estimatio for fiite classes of models, as i that case oe ca take c(f) = log Θ ad the estimator reduces to the usual MLE (as the pealty term does ot deped o θ) 6,

7 The Gaussia Regressio Case Before provig the theorem, let s look at a very special ad importat case We will use these results quite a lot i the what follows Recall the Gaussia regressio model, used i may of the previous lectures Suppose Y i = r (x i ) + ɛ i iid, ɛ i N (0, σ ), where x i are determiistic covariates (eg, x i = i/), ad r is the true, ukow, regressio fuctio The desity of Y i is parameterized by r ad give by p i (y i ; r) = with r = r We have see before for this case we have πσ e (y i r(x i )) σ, (3) KL(p i (r ) p i (r)) = σ (r (x i ) r(x i )), ad log A(p i (r ), p i (r)) = 4σ (r (x i ) r(x i )) This meas that, for the joit desity of the data, we have ad KL(p(r ) p(r)) = σ log A(p(r ), p(r)) = 4σ (r (x i ) r(x i )) i= (r (x i ) r(x i )) Fially, as we have see log p(y; r) = i= + cost, where the costat term does σ ot deped o r So, usig all this together with the above theorem we have the followig, importat corollary Corollary Let Y i = r (x i ) + ɛ i i= (Y i r(x i )) iid, ɛ i N (0, σ ), where x i are determiistic Let R be a class of models such that there is a map c : R 0, satisfyig c(r) Defie The i= ˆr = arg mi r R r R i= E (ˆr (x i ) r (x i )) mi r R (Y i r(x i )) + 4σ c(r) log i= (r(x i ) r (x i )) + 4σ c(r) log 7

8 This corollary shows that the maximum pealized likelihood estimator ca perform almost as well (we have a factor i the boud) has if we had observed r (x i ) istead of Y i (ie, if we had removed the oise from the observatios ad used the same estimator) Proof of Theorem We have already show that H (p, p θ ) log A(p, p θ ), So, clearly E H (p, ) E log A(p, ) pˆθ pˆθ We are goig to boud the right-had-side of the above expressio theorectical aalog of ˆθ Begi by defiig the Let s re-write the defiitio of the MPLE θ = arg mi KL(p p θ ) + c(f) log ˆθ = arg mi log p θ(y) + c(θ) log = arg max log p θ(y) c(θ) log = arg max log p θ(y) c(θ) log = arg max = arg max = arg max log p θ (Y) c(θ) log pθ (Y) exp ( c(θ) log ) pθ (Y) c(θ) This meas that, for ANY θ Θ pˆθ (Y) c(ˆθ ) p θ (Y) c(θ) I particular we have pˆθ (Y) c(ˆθ ) (Y) p θ c( θ ) 8

9 With this fact i had let s proceed with the boud E log A(p, ) = E log pˆθ A(p, ) pˆθ ( E log pˆθ (Y) = E log = E log p (Y) (Y) p θ c(ˆθ ) (Y) p θ c( θ ) p (Y) p θ (Y) = KL(p ) + c( θ p θ ) log +E log ) A(p, ) pˆθ (Y) pˆθ c(ˆθ ) p (Y) c( θ ) + c( θ ) log + E pˆθ (Y) p (Y) c(ˆθ ) A(p, ) pˆθ A(p, ) pˆθ pˆθ (Y) p (Y) c(ˆθ ) A(p, ) pˆθ So, the first two terms of the last lie are exactly what we have i the statemet of the theorem To coclude the proof we just eed to show the last term is always smaller tha zero Note that there are two radom quatities i that term, amely Y ad ˆθ (which is a fuctio of Y) Next, we use first Jese s iequality ad the a very crude boud to essetially get rid of ˆθ, takig advatage of the fact that ˆθ Θ First ote that, by Jese s iequality E log pˆθ (Y) p (Y) c(ˆθ ) A(p log E, ) pˆθ pˆθ (Y) p (Y) c(ˆθ ) A(p, ) pˆθ Now we ca make use of the Kraft iequality through a uio boud I this case the uio boud is simply sayig that ay idividual term i a summatio of positive terms is smaller tha the summatio So, we boud the iside of the expectatio by a sum over Θ (Y) log E pˆθ c(ˆθ ) p θ (Y) c(θ) p (Y) A(p log E, ) p pˆθ (Y) A(p, p θ ) E pθ (Y) = log c(θ) p (Y) A(p, p θ ) ( ) = log c(θ) log = 0 Let z, z, be o-egative The for all i N z i j= zj 9

10 where the last step follows from Kraft s iequality, ad the previous step follows simply by otig that p θ (Y) p θ (y) pθ E p = (Y) p (y) p (y)dy = (y)p (y)dy = A(p, p θ ) Choosig the values c(θ) I order to apply the theorem or the corollary we eed to costruct a map c( ) from Θ to 0, ) These values ca be though of as a measure of the complexity of each model θ If oe has a proper probability mass distributio over Θ, say P prior : Θ 0,, the we ca just use c(θ) = log P prior (θ) Clearly this satisfies the Kraft iequality Moreover this estimator as a strog Bayesia iterpretatio - it correspods to the Maximum a Posteriori estimator for the prior P prior Aother way to satisfy the Kraft iequality is to use codig argumets This has the advatage that we ca very easily devise maps that automatically satisfy the Kraft iequality Assume we have assig a biary codeword (sequece of 0s ad s) to each elemet of Θ Let c(θ) deote the size of the codeword (size of the sequece) The set Θ is called the alphabet, ad the idea is to ecode sequeces of symbols from the alphabet by cocateatig the correspodig codewords A very useful class of codes are called prefix codes Defiitio 4 A code is called a prefix or istataeous code if o codeword is a prefix of ay other codeword The reaso for such ame ad defiitio is clarified i the followig example Example (From Cover & Thomas 9) Cosider a alphabet of symbols, say A, B, C, ad D ad the codebooks i Figure Suppose Symbol Sigular Nosigular But Not Uiquely Decodable But Prefix Code Codebook Uiquely Decodable Not a Prefix Code A B C D Figure : Four possible codes for a alphabet of four symbols we wat to covey the setece ADBBAC This will be ecoded i each system respectively as , , , ad It is clear that with the sigular codebook we assig the same codeword to each symbol - a system that is obviously flawed! I the secod case the codes are ot sigular but the codeword 00 could represet B or CA or AD, so the above biary sequece is ambiguous, as it ca be decoded both as ADBBAC or BBBAC for istace Hece it is ot a uiquely decodable codebook The third ad fourth cases are both examples of uiquely decodable codebooks, but the fourth has the added feature that o codeword is a prefix of aother Prefix codes ca be decoded from left to right istataeously sice each codeword is self-puctuatig - that is, you kow immediately whe a codeword eds Note that, util the teth bit i the third case, you do t kow if the sequece is ADBB or ACBBB ad oly the ca you decide I the fourth case you kow immediately whe a codeword eds ad aother starts 0

11 The good thig about prefix codes is that these are easy to costruct ad automatically satisfy the Kraft iequality We will use this approach ofte The Kraft Iequality Theorem For ay biary prefix code, the codeword legths c, c, satisfy c k k= Coversely, give ay c, c, satisfyig the iequality above we ca costruct a prefix code with these codeword legths Proof We will prove that for ay biary prefix code, the codeword legths c, c,, satisfy k c k The coverse is easy to prove also, but it ot cetral to our purposes here (for a proof, see Cover & Thomas 9) Cosider a biary tree like the oe show i Figure Root Figure : A biary tree The sequece of bit values leadig from the root to a leaf of the tree represets a codeword The prefix coditio implies that o codeword is a descedat of ay other codeword i the tree Therefore we ca each leaf of the tree represets a codeword i our code Let c max be the legth of the logest codeword (also the umber of braches to the deepest leaf) i the tree Cosider a ode i the tree at level c i This ode i the tree ca have at most cmax c i descedats at level c max Furthermore, for each leaf of the tree the set of possible descedats at level c max is disjoit (sice o codeword ca be a prefix of aother) Therefore, sice the total umber of possible leafs at level c max is cmax, we have cmax c i cmax c i, i leafs i leafs which proves the case whe the umber of codewords is fiite

12 Suppose ow that we have a coutably ifiite umber of codewords Let b, b,, b ci be the i th codeword ad let c i r i = b j j j=i be the real umber correspodig to the biary expasio of the codeword (i biary r i = 0b b b 3 ) We ca associate the iterval r i, r i + c i ) with the i th codeword This is the set of all real umbers whose biary expasio begis with b, b,, b ci Sice this is a subiterval of 0,, ad all such subitervals correspodig to prefix codewords are disjoit, the sum of their legths must be less tha or equal to This proves the case where the umber of codewords is ifiite 3 Adaptig to Ukow Parameters Earlier i the course we saw that oe ca estimate Lipschitz smooth fuctios well I this sectio we geeralize those results to iclude smoother fuctios Furthermore, we will be able to automatically adjust to the ukow level of smoothess of fuctios 3 Lipschitz Fuctios Suppose we have a fuctio r : 0, R, R satisfyig the Lipschitz smoothess assumptio s, t 0, r (s) r (t) L s t We have see these fuctios ca be well approximated by piecewise costat fuctios of the form m j g(x) = c j x I j, where I j = m, j m j= Let s use maximum-likelihood to pick the best such model Suppose we have followig regressio model Y i = r (x i ) + ɛ i, i =,,, iid where x i = i/ ad ɛ i N (0, σ ) To be able to use the corollary we derived we eed a coutable or fiite class of models The easiest way to do so is to discretize/quatize the possible values of each costat piece i the cadidate models Defie m R m = c j x I j : c j Q, where j= Q = R, R,, R = R k, k = 0,, Therefore R m has exactly ( + ) m elemets i total This meas that, by takig c(r) = log (( + ) m ) = m log ( + ) for all r R m we satisfy the Kraft iequality r R m c(r) = r R m R m =

13 So we are ready to apply our oracle boud Sice c(r) is just a costat (ot really a fuctio of r) the estimator is simply the MLE ˆr = arg mi (Y i r(x i )) + 4σ c(r) log r R m i= = arg mi (Y i r(x i )) r R m i= The corollary the says that E (ˆr (x i ) r (x i )) i= mi r R m i= (r (x i ) r(x i )) + 4σ c(r) log So far the result is extremely geeral, as we have ot made use of the Lipschitz assumptio We have see earlier that there is a piecewise costat fuctio r m (x) = m j= c jx I j such that for all x 0, we have r (x) r m (x) L/m The problem is that, geerally r m / R m sice c j / Q Take istead the elemet of R m that is closest to r m, amely r m = arg mi sup f R m x 0, r(x) r m (x) It is clear that f(x) f m (x) R/ for all x 0, therefore, by the triagle iequality we have f(x) f m (x) f(x) f m (x) + f m (x) f m (x) L m + R Now, we ca just use this i our boud E (ˆr (x i ) r (x i )) mi (r (x i ) r(x i )) + 4σ c(r) log r R m i= i= ( L m + R ) + 8σ m log ( + ) log i= ( L = m + R ) + 8σ m log( + ) So, to esure the best boud possible we should choose m miimizig the right-had-side This yields m (/ log ) /3 ad E ( ˆf (x i ) f (x i )) = O ((/ log ) /3) i= which, apart from the logarithmic factor is the best we ca ever hope for (this logarithmic factor is due to the discretizatio of the model classes, ad is a artifact of this approach) If we wat the truly best possible boud we eed to miimize the above expressio with respect to m, ad for that we eed to kow L Ca we do better? Ca we automagically choose m usig the data? The aswer is yes, ad for this we will start takig full advatage of our oracle boud 3

14 Sice we wat to choose the best possible m we must cosider the followig class of models R = m= R m This is clearly a coutable class of models (but ot fiite) So we eed to be a bit more careful i costructig the map c( ) Let s use a codig argumet: begi by defiig m(r) = mi m N m : r R m Ecode r R usig first the bits 00 0 (total m(r) bits) to ecode m(r) ad them log R m bits to ecode which model iside R m is r This is clearly a prefix code ad therefore satisfies the Kraft iequality More formally c(r) = m(r) + log R m(r) = m(r) + log (( + ) m(r)) = m(r)( + log (( + )) Although we kow, from the codig argumet, that the map c( ) satisfies the Kraft iequality for sure, we ca do a little saity check, ad esure this is ideed true: c(r) r R = = = c(r) m= r R m m(r) log Rm(r) m= r R m m log Rm m= r R m m R m r R m m = m= m= Now, similarly to what we had before ˆr = arg mi r R = arg mi r R i= i= (Y i r(x i )) + 4σ c(r) log (Y i r(x i )) + 4σ m(r)( + log ( + )) log, 4

15 which is o loger the MLE, but rather a maximum pealized likelihood estimator The E (ˆr (x i ) r (x i )) i= mi (r (x i ) r(x i )) + 4σ m(r)( + log ( + )) log r R i= mi mi (r (x i ) r(x i )) + 4σ m( + log ( + )) log m N r R m i= ( L mi m N m + R ) + 8σ m(log + log( + )) Therefore this estimator automatically chooses the best possible umber of parameters m Note that the price we pay is very very modest - the oly chage from what we had before was that the term log( + ) is replaced by log + log( + ), which is a very miute chage as log(+) log for ay iterestig sample size Although this is remarkable, we ca probably do much better, ad adjust to ukow smoothess We ll see how to do this ext 4 Regressio of Hölder smooth fuctios Let s begi by defiig a class of Hölder smooth fuctios Let α > 0 ad defie let H α (C, R) be the set of all fuctios r : 0, R, R which have α derivatives ad f(x) T α y (x) C x y α x, y 0,, where α is the largest iteger such that α < α, ad Ty α is the Taylor polyomial of degree α aroud the poit y I words, a Hölder-α smooth fuctio is locally well approximated by a polyomial of degree α Note that the above defiitio coicides with our Lipschitz fuctio class whe α = Hölder smoothess essetially measures how differetiable fuctios are, ad therefor Taylor polyomials are the atural way to approximate Hölder smooth fuctios We will focus o Hölder smooth fuctio classes with 0 < α but the results preseted ca be easily geeralized for larger values of α Sice α we will work with piecewise liear approximatios, the Taylor polyomial of degree If we were to cosider smoother fuctios, α > we would eed cosider higher degree Taylor polyomial approximatio fuctios, ie quadratic, cubic, etc 4 Regressio of Hölder smooth fuctios Cosider the usual regressio model Y i = r (x i ) + ɛ i, i =,,, iid where x i = i/ ad ɛ i N (0, σ ) Let s assume r H α (C, R), with ukow smoothess 0 < α Ituitively, the smoother r is the better we should be able to estimate it, as we ca average more observatios locally I other words for smoother r we should average over 5

16 larger bis Also, we will eed to exploit the extra smoothess i our models for r To that ed, we will cosider cadidate fuctios that are piecewise liear, ie, fuctios of the form m j (a j + b j x)x I j, where I j = m, j m j= As before, we wat to cosider coutable/fiite classes of models to be able to apply our corollary, so we will cosider a slight modificatio of the above Each liear piece ca be described by their begiig ad ed poits respectively So we are goig to restrict those to lie o a grid Namely refer to Figure 3 C levels 0 C (i )/k Figure 3: Example o the discretizatio of f o iterval i/k ) j m, j m Defie the class m R m = r(x) = l j (x)x I j, j= where x (j )/m l j (x) = b j + j/m x /m /m a j = (mx j + )b j + (j mx)a j, ad a j, b j k R : k 0,, Clearly R m = ( + ) m Sice we do t kow the smoothess a priori, we must choose m usig the data Therefore, i the same fashio as before we take the class with m(r) = mi m N m : r R m, ad R = m= R m, c(r) = m(r) + log R m(r) = m(r)( + log ( ( + ) ) 6

17 Exactly as before, defie the estimator ˆr = arg mi (Y i r(x i )) + 4σ m(r)( + log ( + )) log r R i=, The E (ˆr (x i ) r (x i )) i= mi r R mi m N (r (x i ) r(x i )) + 4σ m(r)( + log ( + )) log (r (x i ) r(x i )) i= mi r R m i= + 4σ m( + log ( + )) log I the above, the first term is essetially our familiar approximatio error, ad the secod term is i a sese boudig the estimatio error Therefore this estimator automatically seeks the best balace betwee the two To say somethig more cocrete about the performace of the estimator we eed to brig i the assumptios we have o r (so far we have t used ay) First, suppose r H α (C, R) for < α We eed to fid a good model i the class R m that makes the approximatio error small Take x I j where j is arbitrary From the defiitio of H α (C, R) we have r (x) r (j/m) x r (j/m)(x j/m) C x j/m α C /m α = Cm α Therefore we ca approximate r (x) with x I j by a liear fuctio, such that the error is bouded by Cm α I particular, the best piecewise liear approximatio, defied as r m = arg mi sup piecewise liear fuctios x 0, r(x) r (x), satisfies r (x) r m(x) Cm α for all x 0, Clearly, this fuctio i ot i R m, so we eed to do some discretizatio Take the fuctio i R m that is closest to that fuctio r m r m = arg mi sup r R m x 0, r(x) r m(x), because of the way we discretized we kow that sup x 0, r m(x) r m(x) R/ Usig this, together with the triagle iequality yields x 0, r m(x) r (x) Cm α + R (4) If r H α (C, R) for 0 < α we ca proceed i a similar fashio, but simply have to ote that such fuctios are well approximated by piecewise costat fuctios Furthermore these are a subset of R m So, the reasoig we used for Lipschitz fuctios applies directly here, ad we obtai expressio (4) for that case as well 7

18 Fially, we ca just plug-i these results ito the boud of the corollary E (ˆr (x i ) r (x i )) i= mi mi (r (x i ) r(x i )) + 4σ m( + log ( + )) log m N r R m i= ( mi Cm α + R ) + 4σ m( + log ( + )) log m N ( mi O max m α, m α m N,, m log ) It is ot hard to see that the first ad last terms domiate the boud, ad so we attai the miimum by takig (i the boud) which yields E m ( ) /(α+), log (ˆr (x i ) r (x i )) = O i= ( ( ) α ) α+ log Note that the estimator does ot kow α!!! So we are ideed adaptig to ukow ( smoothess If the regressio fuctio r is Lipschitz this estimator has error rate O (/ log ) /3) ( However, if the fuctio is smoother (say α = ) the estimator has error rate O (/ log ) 4/5), which decays much quicker to zero More remarkably, apart from the logarithmic factor it ca be show this is the best oe ca hope for! So the logarithmic factor is the very small price we eed to pay for adaptivity 8

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where

More information

ECE 901 Lecture 13: Maximum Likelihood Estimation

ECE 901 Lecture 13: Maximum Likelihood Estimation ECE 90 Lecture 3: Maximum Likelihood Estimatio R. Nowak 5/7/009 The focus of this lecture is to cosider aother approach to learig based o maximum likelihood estimatio. Ulike earlier approaches cosidered

More information

Maximum Likelihood Estimation and Complexity Regularization

Maximum Likelihood Estimation and Complexity Regularization ECE90 Sprig 004 Statistical Regularizatio ad Learig Theory Lecture: 4 Maximum Likelihood Estimatio ad Complexity Regularizatio Lecturer: Rob Nowak Scribe: Pam Limpiti Review : Maximum Likelihood Estimatio

More information

Lecture 13: Maximum Likelihood Estimation

Lecture 13: Maximum Likelihood Estimation ECE90 Sprig 007 Statistical Learig Theory Istructor: R. Nowak Lecture 3: Maximum Likelihood Estimatio Summary of Lecture I the last lecture we derived a risk (MSE) boud for regressio problems; i.e., select

More information

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak

More information

Advanced Stochastic Processes.

Advanced Stochastic Processes. Advaced Stochastic Processes. David Gamarik LECTURE 2 Radom variables ad measurable fuctios. Strog Law of Large Numbers (SLLN). Scary stuff cotiued... Outlie of Lecture Radom variables ad measurable fuctios.

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam. Probability ad Statistics FS 07 Secod Sessio Exam 09.0.08 Time Limit: 80 Miutes Name: Studet ID: This exam cotais 9 pages (icludig this cover page) ad 0 questios. A Formulae sheet is provided with the

More information

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit Theorems Throughout this sectio we will assume a probability space (, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

Sequences and Series of Functions

Sequences and Series of Functions Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges

More information

Lecture 19: Convergence

Lecture 19: Convergence Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may

More information

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,

More information

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3 MATH 337 Sequeces Dr. Neal, WKU Let X be a metric space with distace fuctio d. We shall defie the geeral cocept of sequece ad limit i a metric space, the apply the results i particular to some special

More information

Lecture 12: September 27

Lecture 12: September 27 36-705: Itermediate Statistics Fall 207 Lecturer: Siva Balakrisha Lecture 2: September 27 Today we will discuss sufficiecy i more detail ad the begi to discuss some geeral strategies for costructig estimators.

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

Chapter 6 Infinite Series

Chapter 6 Infinite Series Chapter 6 Ifiite Series I the previous chapter we cosidered itegrals which were improper i the sese that the iterval of itegratio was ubouded. I this chapter we are goig to discuss a topic which is somewhat

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit theorems Throughout this sectio we will assume a probability space (Ω, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

1 Approximating Integrals using Taylor Polynomials

1 Approximating Integrals using Taylor Polynomials Seughee Ye Ma 8: Week 7 Nov Week 7 Summary This week, we will lear how we ca approximate itegrals usig Taylor series ad umerical methods. Topics Page Approximatig Itegrals usig Taylor Polyomials. Defiitios................................................

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y

More information

Distribution of Random Samples & Limit theorems

Distribution of Random Samples & Limit theorems STAT/MATH 395 A - PROBABILITY II UW Witer Quarter 2017 Néhémy Lim Distributio of Radom Samples & Limit theorems 1 Distributio of i.i.d. Samples Motivatig example. Assume that the goal of a study is to

More information

Entropy Rates and Asymptotic Equipartition

Entropy Rates and Asymptotic Equipartition Chapter 29 Etropy Rates ad Asymptotic Equipartitio Sectio 29. itroduces the etropy rate the asymptotic etropy per time-step of a stochastic process ad shows that it is well-defied; ad similarly for iformatio,

More information

Summary. Recap ... Last Lecture. Summary. Theorem

Summary. Recap ... Last Lecture. Summary. Theorem Last Lecture Biostatistics 602 - Statistical Iferece Lecture 23 Hyu Mi Kag April 11th, 2013 What is p-value? What is the advatage of p-value compared to hypothesis testig procedure with size α? How ca

More information

The Maximum-Likelihood Decoding Performance of Error-Correcting Codes

The Maximum-Likelihood Decoding Performance of Error-Correcting Codes The Maximum-Lielihood Decodig Performace of Error-Correctig Codes Hery D. Pfister ECE Departmet Texas A&M Uiversity August 27th, 2007 (rev. 0) November 2st, 203 (rev. ) Performace of Codes. Notatio X,

More information

Lecture 10: Universal coding and prediction

Lecture 10: Universal coding and prediction 0-704: Iformatio Processig ad Learig Sprig 0 Lecture 0: Uiversal codig ad predictio Lecturer: Aarti Sigh Scribes: Georg M. Goerg Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved

More information

Lecture 2: Monte Carlo Simulation

Lecture 2: Monte Carlo Simulation STAT/Q SCI 43: Itroductio to Resamplig ethods Sprig 27 Istructor: Ye-Chi Che Lecture 2: ote Carlo Simulatio 2 ote Carlo Itegratio Assume we wat to evaluate the followig itegratio: e x3 dx What ca we do?

More information

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach STAT 425: Itroductio to Noparametric Statistics Witer 28 Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach Istructor: Ye-Chi Che Referece: Sectio 8.4 of All of Noparametric Statistics.

More information

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014. Product measures, Toelli s ad Fubii s theorems For use i MAT3400/4400, autum 2014 Nadia S. Larse Versio of 13 October 2014. 1. Costructio of the product measure The purpose of these otes is to preset the

More information

Lecture 3: August 31

Lecture 3: August 31 36-705: Itermediate Statistics Fall 018 Lecturer: Siva Balakrisha Lecture 3: August 31 This lecture will be mostly a summary of other useful expoetial tail bouds We will ot prove ay of these i lecture,

More information

Study the bias (due to the nite dimensional approximation) and variance of the estimators

Study the bias (due to the nite dimensional approximation) and variance of the estimators 2 Series Methods 2. Geeral Approach A model has parameters (; ) where is ite-dimesioal ad is oparametric. (Sometimes, there is o :) We will focus o regressio. The fuctio is approximated by a series a ite

More information

32 estimating the cumulative distribution function

32 estimating the cumulative distribution function 32 estimatig the cumulative distributio fuctio 4.6 types of cofidece itervals/bads Let F be a class of distributio fuctios F ad let θ be some quatity of iterest, such as the mea of F or the whole fuctio

More information

On Random Line Segments in the Unit Square

On Random Line Segments in the Unit Square O Radom Lie Segmets i the Uit Square Thomas A. Courtade Departmet of Electrical Egieerig Uiversity of Califoria Los Ageles, Califoria 90095 Email: tacourta@ee.ucla.edu I. INTRODUCTION Let Q = [0, 1] [0,

More information

Lecture 3 The Lebesgue Integral

Lecture 3 The Lebesgue Integral Lecture 3: The Lebesgue Itegral 1 of 14 Course: Theory of Probability I Term: Fall 2013 Istructor: Gorda Zitkovic Lecture 3 The Lebesgue Itegral The costructio of the itegral Uless expressly specified

More information

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi

More information

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5 CS434a/54a: Patter Recogitio Prof. Olga Veksler Lecture 5 Today Itroductio to parameter estimatio Two methods for parameter estimatio Maimum Likelihood Estimatio Bayesia Estimatio Itroducto Bayesia Decisio

More information

Problem Set 4 Due Oct, 12

Problem Set 4 Due Oct, 12 EE226: Radom Processes i Systems Lecturer: Jea C. Walrad Problem Set 4 Due Oct, 12 Fall 06 GSI: Assae Gueye This problem set essetially reviews detectio theory ad hypothesis testig ad some basic otios

More information

Fall 2013 MTH431/531 Real analysis Section Notes

Fall 2013 MTH431/531 Real analysis Section Notes Fall 013 MTH431/531 Real aalysis Sectio 8.1-8. Notes Yi Su 013.11.1 1. Defiitio of uiform covergece. We look at a sequece of fuctios f (x) ad study the coverget property. Notice we have two parameters

More information

Integrable Functions. { f n } is called a determining sequence for f. If f is integrable with respect to, then f d does exist as a finite real number

Integrable Functions. { f n } is called a determining sequence for f. If f is integrable with respect to, then f d does exist as a finite real number MATH 532 Itegrable Fuctios Dr. Neal, WKU We ow shall defie what it meas for a measurable fuctio to be itegrable, show that all itegral properties of simple fuctios still hold, ad the give some coditios

More information

Math 155 (Lecture 3)

Math 155 (Lecture 3) Math 55 (Lecture 3) September 8, I this lecture, we ll cosider the aswer to oe of the most basic coutig problems i combiatorics Questio How may ways are there to choose a -elemet subset of the set {,,,

More information

Lecture 10 October Minimaxity and least favorable prior sequences

Lecture 10 October Minimaxity and least favorable prior sequences STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least

More information

Lecture 20: Multivariate convergence and the Central Limit Theorem

Lecture 20: Multivariate convergence and the Central Limit Theorem Lecture 20: Multivariate covergece ad the Cetral Limit Theorem Covergece i distributio for radom vectors Let Z,Z 1,Z 2,... be radom vectors o R k. If the cdf of Z is cotiuous, the we ca defie covergece

More information

An Introduction to Randomized Algorithms

An Introduction to Randomized Algorithms A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis

More information

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n. Jauary 1, 2019 Resamplig Methods Motivatio We have so may estimators with the property θ θ d N 0, σ 2 We ca also write θ a N θ, σ 2 /, where a meas approximately distributed as Oce we have a cosistet estimator

More information

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1 EECS564 Estimatio, Filterig, ad Detectio Hwk 2 Sols. Witer 25 4. Let Z be a sigle observatio havig desity fuctio where. p (z) = (2z + ), z (a) Assumig that is a oradom parameter, fid ad plot the maximum

More information

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters

More information

5.1 A mutual information bound based on metric entropy

5.1 A mutual information bound based on metric entropy Chapter 5 Global Fao Method I this chapter, we exted the techiques of Chapter 2.4 o Fao s method the local Fao method) to a more global costructio. I particular, we show that, rather tha costructig a local

More information

Sieve Estimators: Consistency and Rates of Convergence

Sieve Estimators: Consistency and Rates of Convergence EECS 598: Statistical Learig Theory, Witer 2014 Topic 6 Sieve Estimators: Cosistecy ad Rates of Covergece Lecturer: Clayto Scott Scribe: Julia Katz-Samuels, Brado Oselio, Pi-Yu Che Disclaimer: These otes

More information

Lecture 11 and 12: Basic estimation theory

Lecture 11 and 12: Basic estimation theory Lecture ad 2: Basic estimatio theory Sprig 202 - EE 94 Networked estimatio ad cotrol Prof. Kha March 2 202 I. MAXIMUM-LIKELIHOOD ESTIMATORS The maximum likelihood priciple is deceptively simple. Louis

More information

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f. Lecture 5 Let us give oe more example of MLE. Example 3. The uiform distributio U[0, ] o the iterval [0, ] has p.d.f. { 1 f(x =, 0 x, 0, otherwise The likelihood fuctio ϕ( = f(x i = 1 I(X 1,..., X [0,

More information

Lecture 7: October 18, 2017

Lecture 7: October 18, 2017 Iformatio ad Codig Theory Autum 207 Lecturer: Madhur Tulsiai Lecture 7: October 8, 207 Biary hypothesis testig I this lecture, we apply the tools developed i the past few lectures to uderstad the problem

More information

Lecture 14: Graph Entropy

Lecture 14: Graph Entropy 15-859: Iformatio Theory ad Applicatios i TCS Sprig 2013 Lecture 14: Graph Etropy March 19, 2013 Lecturer: Mahdi Cheraghchi Scribe: Euiwoog Lee 1 Recap Bergma s boud o the permaet Shearer s Lemma Number

More information

Estimation for Complete Data

Estimation for Complete Data Estimatio for Complete Data complete data: there is o loss of iformatio durig study. complete idividual complete data= grouped data A complete idividual data is the oe i which the complete iformatio of

More information

If a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero?

If a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero? 2 Lebesgue Measure I Chapter 1 we defied the cocept of a set of measure zero, ad we have observed that every coutable set is of measure zero. Here are some atural questios: If a subset E of R cotais a

More information

4.1 Data processing inequality

4.1 Data processing inequality ECE598: Iformatio-theoretic methods i high-dimesioal statistics Sprig 206 Lecture 4: Total variatio/iequalities betwee f-divergeces Lecturer: Yihog Wu Scribe: Matthew Tsao, Feb 8, 206 [Ed. Mar 22] Recall

More information

Information Theory and Statistics Lecture 4: Lempel-Ziv code

Information Theory and Statistics Lecture 4: Lempel-Ziv code Iformatio Theory ad Statistics Lecture 4: Lempel-Ziv code Łukasz Dębowski ldebowsk@ipipa.waw.pl Ph. D. Programme 203/204 Etropy rate is the limitig compressio rate Theorem For a statioary process (X i)

More information

The Growth of Functions. Theoretical Supplement

The Growth of Functions. Theoretical Supplement The Growth of Fuctios Theoretical Supplemet The Triagle Iequality The triagle iequality is a algebraic tool that is ofte useful i maipulatig absolute values of fuctios. The triagle iequality says that

More information

Statistics 511 Additional Materials

Statistics 511 Additional Materials Cofidece Itervals o mu Statistics 511 Additioal Materials This topic officially moves us from probability to statistics. We begi to discuss makig ifereces about the populatio. Oe way to differetiate probability

More information

2 Banach spaces and Hilbert spaces

2 Banach spaces and Hilbert spaces 2 Baach spaces ad Hilbert spaces Tryig to do aalysis i the ratioal umbers is difficult for example cosider the set {x Q : x 2 2}. This set is o-empty ad bouded above but does ot have a least upper boud

More information

Kernel density estimator

Kernel density estimator Jauary, 07 NONPARAMETRIC ERNEL DENSITY ESTIMATION I this lecture, we discuss kerel estimatio of probability desity fuctios PDF Noparametric desity estimatio is oe of the cetral problems i statistics I

More information

EE 4TM4: Digital Communications II Probability Theory

EE 4TM4: Digital Communications II Probability Theory 1 EE 4TM4: Digital Commuicatios II Probability Theory I. RANDOM VARIABLES A radom variable is a real-valued fuctio defied o the sample space. Example: Suppose that our experimet cosists of tossig two fair

More information

1 Convergence in Probability and the Weak Law of Large Numbers

1 Convergence in Probability and the Weak Law of Large Numbers 36-752 Advaced Probability Overview Sprig 2018 8. Covergece Cocepts: i Probability, i L p ad Almost Surely Istructor: Alessadro Rialdo Associated readig: Sec 2.4, 2.5, ad 4.11 of Ash ad Doléas-Dade; Sec

More information

The Boolean Ring of Intervals

The Boolean Ring of Intervals MATH 532 Lebesgue Measure Dr. Neal, WKU We ow shall apply the results obtaied about outer measure to the legth measure o the real lie. Throughout, our space X will be the set of real umbers R. Whe ecessary,

More information

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound Lecture 7 Ageda for the lecture Gaussia chael with average power costraits Capacity of additive Gaussia oise chael ad the sphere packig boud 7. Additive Gaussia oise chael Up to this poit, we have bee

More information

Law of the sum of Bernoulli random variables

Law of the sum of Bernoulli random variables Law of the sum of Beroulli radom variables Nicolas Chevallier Uiversité de Haute Alsace, 4, rue des frères Lumière 68093 Mulhouse icolas.chevallier@uha.fr December 006 Abstract Let be the set of all possible

More information

Shannon s noiseless coding theorem

Shannon s noiseless coding theorem 18.310 lecture otes May 4, 2015 Shao s oiseless codig theorem Lecturer: Michel Goemas I these otes we discuss Shao s oiseless codig theorem, which is oe of the foudig results of the field of iformatio

More information

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled 1 Lecture : Area Area ad distace traveled Approximatig area by rectagles Summatio The area uder a parabola 1.1 Area ad distace Suppose we have the followig iformatio about the velocity of a particle, how

More information

Math 2784 (or 2794W) University of Connecticut

Math 2784 (or 2794W) University of Connecticut ORDERS OF GROWTH PAT SMITH Math 2784 (or 2794W) Uiversity of Coecticut Date: Mar. 2, 22. ORDERS OF GROWTH. Itroductio Gaiig a ituitive feel for the relative growth of fuctios is importat if you really

More information

Stat410 Probability and Statistics II (F16)

Stat410 Probability and Statistics II (F16) Some Basic Cocepts of Statistical Iferece (Sec 5.) Suppose we have a rv X that has a pdf/pmf deoted by f(x; θ) or p(x; θ), where θ is called the parameter. I previous lectures, we focus o probability problems

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture 3 Tolstikhi Ilya Abstract I this lecture we will prove the VC-boud, which provides a high-probability excess risk boud for the ERM algorithm whe

More information

Output Analysis and Run-Length Control

Output Analysis and Run-Length Control IEOR E4703: Mote Carlo Simulatio Columbia Uiversity c 2017 by Marti Haugh Output Aalysis ad Ru-Legth Cotrol I these otes we describe how the Cetral Limit Theorem ca be used to costruct approximate (1 α%

More information

A Proof of Birkhoff s Ergodic Theorem

A Proof of Birkhoff s Ergodic Theorem A Proof of Birkhoff s Ergodic Theorem Joseph Hora September 2, 205 Itroductio I Fall 203, I was learig the basics of ergodic theory, ad I came across this theorem. Oe of my supervisors, Athoy Quas, showed

More information

INFINITE SEQUENCES AND SERIES

INFINITE SEQUENCES AND SERIES INFINITE SEQUENCES AND SERIES INFINITE SEQUENCES AND SERIES I geeral, it is difficult to fid the exact sum of a series. We were able to accomplish this for geometric series ad the series /[(+)]. This is

More information

Measure and Measurable Functions

Measure and Measurable Functions 3 Measure ad Measurable Fuctios 3.1 Measure o a Arbitrary σ-algebra Recall from Chapter 2 that the set M of all Lebesgue measurable sets has the followig properties: R M, E M implies E c M, E M for N implies

More information

Chapter 6 Principles of Data Reduction

Chapter 6 Principles of Data Reduction Chapter 6 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 0 Chapter 6 Priciples of Data Reductio Sectio 6. Itroductio Goal: To summarize or reduce the data X, X,, X to get iformatio about a

More information

Math 61CM - Solutions to homework 3

Math 61CM - Solutions to homework 3 Math 6CM - Solutios to homework 3 Cédric De Groote October 2 th, 208 Problem : Let F be a field, m 0 a fixed oegative iteger ad let V = {a 0 + a x + + a m x m a 0,, a m F} be the vector space cosistig

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013 Large Deviatios for i.i.d. Radom Variables Cotet. Cheroff boud usig expoetial momet geeratig fuctios. Properties of a momet

More information

Rates of Convergence by Moduli of Continuity

Rates of Convergence by Moduli of Continuity Rates of Covergece by Moduli of Cotiuity Joh Duchi: Notes for Statistics 300b March, 017 1 Itroductio I this ote, we give a presetatio showig the importace, ad relatioship betwee, the modulis of cotiuity

More information

Information Theory Tutorial Communication over Channels with memory. Chi Zhang Department of Electrical Engineering University of Notre Dame

Information Theory Tutorial Communication over Channels with memory. Chi Zhang Department of Electrical Engineering University of Notre Dame Iformatio Theory Tutorial Commuicatio over Chaels with memory Chi Zhag Departmet of Electrical Egieerig Uiversity of Notre Dame Abstract A geeral capacity formula C = sup I(; Y ), which is correct for

More information

Lecture 01: the Central Limit Theorem. 1 Central Limit Theorem for i.i.d. random variables

Lecture 01: the Central Limit Theorem. 1 Central Limit Theorem for i.i.d. random variables CSCI-B609: A Theorist s Toolkit, Fall 06 Aug 3 Lecture 0: the Cetral Limit Theorem Lecturer: Yua Zhou Scribe: Yua Xie & Yua Zhou Cetral Limit Theorem for iid radom variables Let us say that we wat to aalyze

More information

Sequences. Notation. Convergence of a Sequence

Sequences. Notation. Convergence of a Sequence Sequeces A sequece is essetially just a list. Defiitio (Sequece of Real Numbers). A sequece of real umbers is a fuctio Z (, ) R for some real umber. Do t let the descriptio of the domai cofuse you; it

More information

MAT1026 Calculus II Basic Convergence Tests for Series

MAT1026 Calculus II Basic Convergence Tests for Series MAT026 Calculus II Basic Covergece Tests for Series Egi MERMUT 202.03.08 Dokuz Eylül Uiversity Faculty of Sciece Departmet of Mathematics İzmir/TURKEY Cotets Mootoe Covergece Theorem 2 2 Series of Real

More information

Math 104: Homework 2 solutions

Math 104: Homework 2 solutions Math 04: Homework solutios. A (0, ): Sice this is a ope iterval, the miimum is udefied, ad sice the set is ot bouded above, the maximum is also udefied. if A 0 ad sup A. B { m + : m, N}: This set does

More information

7 Sequences of real numbers

7 Sequences of real numbers 40 7 Sequeces of real umbers 7. Defiitios ad examples Defiitio 7... A sequece of real umbers is a real fuctio whose domai is the set N of atural umbers. Let s : N R be a sequece. The the values of s are

More information

Basics of Probability Theory (for Theory of Computation courses)

Basics of Probability Theory (for Theory of Computation courses) Basics of Probability Theory (for Theory of Computatio courses) Oded Goldreich Departmet of Computer Sciece Weizma Istitute of Sciece Rehovot, Israel. oded.goldreich@weizma.ac.il November 24, 2008 Preface.

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

Agnostic Learning and Concentration Inequalities

Agnostic Learning and Concentration Inequalities ECE901 Sprig 2004 Statistical Regularizatio ad Learig Theory Lecture: 7 Agostic Learig ad Cocetratio Iequalities Lecturer: Rob Nowak Scribe: Aravid Kailas 1 Itroductio 1.1 Motivatio I the last lecture

More information

Random Variables, Sampling and Estimation

Random Variables, Sampling and Estimation Chapter 1 Radom Variables, Samplig ad Estimatio 1.1 Itroductio This chapter will cover the most importat basic statistical theory you eed i order to uderstad the ecoometric material that will be comig

More information

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors ECONOMETRIC THEORY MODULE XIII Lecture - 34 Asymptotic Theory ad Stochastic Regressors Dr. Shalabh Departmet of Mathematics ad Statistics Idia Istitute of Techology Kapur Asymptotic theory The asymptotic

More information

Exponential Families and Bayesian Inference

Exponential Families and Bayesian Inference Computer Visio Expoetial Families ad Bayesia Iferece Lecture Expoetial Families A expoetial family of distributios is a d-parameter family f(x; havig the followig form: f(x; = h(xe g(t T (x B(, (. where

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/5.070J Fall 203 Lecture 3 9//203 Large deviatios Theory. Cramér s Theorem Cotet.. Cramér s Theorem. 2. Rate fuctio ad properties. 3. Chage of measure techique.

More information

Department of Mathematics

Department of Mathematics Departmet of Mathematics Ma 3/103 KC Border Itroductio to Probability ad Statistics Witer 2017 Lecture 19: Estimatio II Relevat textbook passages: Larse Marx [1]: Sectios 5.2 5.7 19.1 The method of momets

More information

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara Poit Estimator Eco 325 Notes o Poit Estimator ad Cofidece Iterval 1 By Hiro Kasahara Parameter, Estimator, ad Estimate The ormal probability desity fuctio is fully characterized by two costats: populatio

More information

s = and t = with C ij = A i B j F. (i) Note that cs = M and so ca i µ(a i ) I E (cs) = = c a i µ(a i ) = ci E (s). (ii) Note that s + t = M and so

s = and t = with C ij = A i B j F. (i) Note that cs = M and so ca i µ(a i ) I E (cs) = = c a i µ(a i ) = ci E (s). (ii) Note that s + t = M and so 3 From the otes we see that the parts of Theorem 4. that cocer us are: Let s ad t be two simple o-egative F-measurable fuctios o X, F, µ ad E, F F. The i I E cs ci E s for all c R, ii I E s + t I E s +

More information

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT

More information