Lectures 12 and 13 - Complexity Penalized Maximum Likelihood Estimation
|
|
- Cathleen Stewart
- 5 years ago
- Views:
Transcription
1 Lectures ad 3 - Complexity Pealized Maximum Likelihood Estimatio Rui Castro May 5, 03 Itroductio As you leared i previous courses, if we have a statistical model we ca ofte estimate ukow parameters by the maximum likelihood priciple Suppose we have idepedet, but ot ecessarily idetically distributed, data Namely, we model the data Y i i= as idepedet radom variables with desities (with respect to a commo domiatig measure) give by p i ( ; θ), where θ is a ukow parameter The Maximum Likelihood Estimator (MLE) of θ is simply give by ˆθ = arg max p i (Y i θ), where Θ is the set of possible parameters If the support of p i ( ; θ) is ot a fuctio of θ the we re-write the above estimator i terms of the log-likelihood ˆθ = arg max i= log p i (Y i θ) () i= We will cosider this settig i what follows Why is maximum likelihood a good idea? Kullback-Leibler divergece To better uderstad this let s itroduce the Defiitio (Kullback-Leibler Divergece) Covetio 0 log 0 = lim x 0 + x log x = 0 Let p ad q be two desities (for simplicity say these are with respect to the Lebesgue measure) The the Kullback-Leibler divergece is defied as KL(p q) = p(x) p(x) log q(x) dx if p q otherwise, where p q meas q domiates p, that is, for almost all x if q(x) = 0 the p(x) = 0 Here are some importat facts about the KL divergece The word parameter is i betwee quotes as θ ca be ifiite dimesioal, i what is geerally referred to as o-parametric estimatio For istace, this is the case i o-parametric regressio
2 KL(p q) 0, with equality if ad oly if p(x) = q(x) almost everywhere, which meas p ad q represet the same probability measure The proof of this follows easily from Jese s iequality KL(p q) = p(x) log p(x) q(x) dx = E p log p(x), where X is a rv with desity p q(x) = E p log q(x) p(x) q(x) log E p p(x) = log p(x) q(x) p(x) dx = log q(x)dx = log = 0 where the iequality follows from Jese s iequality ad it is strict uless p(x) = q(x) almost everywhere (ote: there are miute techicalities omitted i the above proof) KL(p q) KL(q p) I other words, the KL divergece is ot symmetric It is therefore ot a distace (i additio, it does t satisfy the triagle iequality) Let s see a example: let p(x) = x 0, ad q(x) = x 0, / The KL(p q) = log ad KL(q p) = The KL divergece is related to testig betwee two simple hypothesisif the ull hypothesis correspods to desity p ad the alterative correspods to desity q the probabilities of type I ad II error are itimately related to KL(p q) ad KL(q p) respectively I particular larger KL divergeces correspod to smaller the error probability The product-desity property: Let p i ad q i, i,, be uivariate desities ad defie the followig multivariate desities The p(x,, x ) = p (x ) p (x ) p (x ), q(x,, x ) = q (x ) q (x ) q (x ) KL(p q) = KL(p i q i ) i= The proof is quite simple ad left as a exercise Now let s look agai at the MLE formulatio i () Begi by assumig Y i are idepedet radom variables with desity q i We ca re-write the MLE as ˆθ = arg mi q i log p i (Y i ; θ) i=
3 The expected value of the rhs of the above is simply KL(q p(θ)), where q q ( ) q ( ) ad p(θ) p ( ; θ) p ( ; θ) are the product desities uder the true data distributio ad a model parameterized by θ, respectively Therefore we see that the MLE is attemptig to fid the parameter θ Θ that is closer to the true data distributio, where distace is measured as the KL divergece betwee the distributios of the data Actually, with some small additioal assumptios oe ca argue, by the strog law of large umbers, that q i log p i (Y i ; θ) as KL(q p(θ)) 0, i= as, ad that KL(q p(θ)) coverges to some poit c > 0 So, i a sese, the MLE is asymptotically idetifyig the model i Θ that is closer to the true distributio of the data Example The regressio settig with determiistic desig: Let Y i = r(x i ) + ɛ i, i =,, where ɛ i are iid ormal radom variables with variace σ The each Y i has desity ( exp (Y i r(x i ) ) πσ σ The joit likelihood of Y,, Y is ( exp (Y i r(x i ) ) πσ σ ad the MLE is simple give by i= ˆr = arg mi r R (Y i r(x i )) So we see that we recover our familiar sum of squared residuals i this case i= I what follows we ll make use of the KL divergece to characterize the behavior of the MLE for a variety of settigs For this we ll eed two other measures betwee desities The Helliger Distace ad the Affiity As we argued above, the KL divergece is ot a proper distace (i particular it is ot symmetric ad does ot satisfy the triagle iequality) However, it is ot hard to defie other distaces betwee desities Notably useful is the Helliger distace Defiitio (Helliger Distace) Let p ad q be two desities with respect to the Lebesgue measure The Helliger distace is defied as, H(p, q) = ( ( p(x) q(x) ) dx ) / 3
4 This is a proper distace, as it is symmetric, sice H(p, q) = H(q, p), o-egative, ad it satisfies the triagle iequality Remarkably the squared Helliger distace provides a lower boud for the KL divergece, so covergece i KL divergece implies covergece of the Helliger distace Propositio H (p, q) KL(p q) Proof ( ) H (p, q) = p(x) q(x) dx p(x)q(x)dx = p(x)dx + q(x)dx ( ) p(x)q(x)dx = log p(x)q(x)dx, sice x log x q(x) = log p(x) p(x)dx q(x) = log E, where X has desity p p(x) E log q(x)/p(x), by Jese s iequality = E log(p(x)/q(x)) KL(p q) Note that we have also showed that ( ) p(x)q(x)dx p(x)q(x)dx H (p, q) = log The quatity iside the log is called the Affiity, ad is a measure of the similarity betwee two desities (valued is these are idetical, ad zero if they have disjoit support) Defiitio 3 (Affiity) Let p ad q be two desities with respect to the Lebesgue measure The Affiity is defied as I summary, we have show that A(p, q) = p(x)q(x)dx H (p, q) log A(p, q) KL(p q) 4
5 The importat case of Gaussia distributios A importat case to cosider is whe p ad q are Gaussia distributios with the same variace but differet meas Let (x θ) p θ (x) = e σ πσ The KL(p θ p θ ) = log p θ (x) p θ (x) p θ (x)dx (x θ ) (x θ ) = σ p θ (x)dx (X θ ) (X θ ) = E σ where X has desity p θ = σ E(X θ ) σ E(X θ ) = σ E(X θ + θ θ ) σ E(X θ ) = σ (σ + (θ θ ) ) σ σ = (θ θ ) σ So, i this case, the KL divergece is symmetric Now, for the affiity ad Helliger distace we proceed i a similar fashio A(p θ, p θ ) = p θ (x)p θ (x)dx = (x θ πσ e ) 4σ e (x θ ) 4σ dx = = = x θ x+θ πσ e +x θ x+θ 4σ dx x θ x+θ πσ e +x θ x+θ 4σ dx πσ e x θ +θ σ x+ θ +θ dx = ( x θ +θ ) πσ e ( θ +θ σ ) + θ +θ dx = e ( ) θ +θ θ +θ σ ( πσ e x θ +θ σ ) dx = e ( ) θ +θ θ +θ σ 8σ = e (θ θ ) 5
6 Therefore ad ( H (p θ, p θ ) = e (θ θ ) ) 8σ log A(p θ, p θ ) = (θ θ ) 4σ, Maximum Pealized Likelihood Estimatio (MPLE) Ofte time, i o-parametric settigs, it is ecessary to either restrict the choice of possible models uder cosideratio, or pealize the MLE criterio The secod approach is more geeral ad we will cosider it here Namely, we are goig to study estimators of the form ˆθ = arg mi log p i (Y i ; θ) + pe(θ) i= where pe(θ) is iteded to pealize complex models Next we prove the followig importat result Theorem (Li-Barro 000, Kolaczyk-Nowak 00) Let Y be a radom vector (eg Y = (Y,, Y )) Assume the distributio of Y has ukow desity p Now suppose we have a class of probability desity fuctios p θ parameterized by θ Θ Assume Θ is a coutable set, ad that for each θ we have a associated umber c(θ) so that the followig iequality holds c(θ) (Kraft Iequality) () Defie the Maximum Pealized Likelihood Estimator (MPLE) as The ˆθ arg mi log p θ(y) + c(θ) log E H (p, ) pˆθ E log A(p, ) pˆθ mi KL(p p θ ) + c(θ) log This type of result, kow as a oracle boud, tells us that the performace of the proposed estimator is essetially as good as the performace of a clairvoyat estimator where we replace the likelihood fuctio by the theoretical couterpart (the KL divergece) Note also the result is extremely geeral - there are o assumptios made o the distributio of Y other tha havig a desity The major weakess of this result is that it oly applies to coutable (or fiite) classes of cadidate models As we will see this creates some icoveiece, but we ca still aalyze may iterestig settigs by discretizig/quatizig the classes of models uder cosideratio There are possible extesio of results of this type to ucoutable classes of models, but these require several techical assumptios ad heavy empirical processes machiery to prove Nevertheless the mai ideas, ituitio ad results are essetially preserved Fially, ote that this result gives you performace bouds for maximum likelihood estimatio for fiite classes of models, as i that case oe ca take c(f) = log Θ ad the estimator reduces to the usual MLE (as the pealty term does ot deped o θ) 6,
7 The Gaussia Regressio Case Before provig the theorem, let s look at a very special ad importat case We will use these results quite a lot i the what follows Recall the Gaussia regressio model, used i may of the previous lectures Suppose Y i = r (x i ) + ɛ i iid, ɛ i N (0, σ ), where x i are determiistic covariates (eg, x i = i/), ad r is the true, ukow, regressio fuctio The desity of Y i is parameterized by r ad give by p i (y i ; r) = with r = r We have see before for this case we have πσ e (y i r(x i )) σ, (3) KL(p i (r ) p i (r)) = σ (r (x i ) r(x i )), ad log A(p i (r ), p i (r)) = 4σ (r (x i ) r(x i )) This meas that, for the joit desity of the data, we have ad KL(p(r ) p(r)) = σ log A(p(r ), p(r)) = 4σ (r (x i ) r(x i )) i= (r (x i ) r(x i )) Fially, as we have see log p(y; r) = i= + cost, where the costat term does σ ot deped o r So, usig all this together with the above theorem we have the followig, importat corollary Corollary Let Y i = r (x i ) + ɛ i i= (Y i r(x i )) iid, ɛ i N (0, σ ), where x i are determiistic Let R be a class of models such that there is a map c : R 0, satisfyig c(r) Defie The i= ˆr = arg mi r R r R i= E (ˆr (x i ) r (x i )) mi r R (Y i r(x i )) + 4σ c(r) log i= (r(x i ) r (x i )) + 4σ c(r) log 7
8 This corollary shows that the maximum pealized likelihood estimator ca perform almost as well (we have a factor i the boud) has if we had observed r (x i ) istead of Y i (ie, if we had removed the oise from the observatios ad used the same estimator) Proof of Theorem We have already show that H (p, p θ ) log A(p, p θ ), So, clearly E H (p, ) E log A(p, ) pˆθ pˆθ We are goig to boud the right-had-side of the above expressio theorectical aalog of ˆθ Begi by defiig the Let s re-write the defiitio of the MPLE θ = arg mi KL(p p θ ) + c(f) log ˆθ = arg mi log p θ(y) + c(θ) log = arg max log p θ(y) c(θ) log = arg max log p θ(y) c(θ) log = arg max = arg max = arg max log p θ (Y) c(θ) log pθ (Y) exp ( c(θ) log ) pθ (Y) c(θ) This meas that, for ANY θ Θ pˆθ (Y) c(ˆθ ) p θ (Y) c(θ) I particular we have pˆθ (Y) c(ˆθ ) (Y) p θ c( θ ) 8
9 With this fact i had let s proceed with the boud E log A(p, ) = E log pˆθ A(p, ) pˆθ ( E log pˆθ (Y) = E log = E log p (Y) (Y) p θ c(ˆθ ) (Y) p θ c( θ ) p (Y) p θ (Y) = KL(p ) + c( θ p θ ) log +E log ) A(p, ) pˆθ (Y) pˆθ c(ˆθ ) p (Y) c( θ ) + c( θ ) log + E pˆθ (Y) p (Y) c(ˆθ ) A(p, ) pˆθ A(p, ) pˆθ pˆθ (Y) p (Y) c(ˆθ ) A(p, ) pˆθ So, the first two terms of the last lie are exactly what we have i the statemet of the theorem To coclude the proof we just eed to show the last term is always smaller tha zero Note that there are two radom quatities i that term, amely Y ad ˆθ (which is a fuctio of Y) Next, we use first Jese s iequality ad the a very crude boud to essetially get rid of ˆθ, takig advatage of the fact that ˆθ Θ First ote that, by Jese s iequality E log pˆθ (Y) p (Y) c(ˆθ ) A(p log E, ) pˆθ pˆθ (Y) p (Y) c(ˆθ ) A(p, ) pˆθ Now we ca make use of the Kraft iequality through a uio boud I this case the uio boud is simply sayig that ay idividual term i a summatio of positive terms is smaller tha the summatio So, we boud the iside of the expectatio by a sum over Θ (Y) log E pˆθ c(ˆθ ) p θ (Y) c(θ) p (Y) A(p log E, ) p pˆθ (Y) A(p, p θ ) E pθ (Y) = log c(θ) p (Y) A(p, p θ ) ( ) = log c(θ) log = 0 Let z, z, be o-egative The for all i N z i j= zj 9
10 where the last step follows from Kraft s iequality, ad the previous step follows simply by otig that p θ (Y) p θ (y) pθ E p = (Y) p (y) p (y)dy = (y)p (y)dy = A(p, p θ ) Choosig the values c(θ) I order to apply the theorem or the corollary we eed to costruct a map c( ) from Θ to 0, ) These values ca be though of as a measure of the complexity of each model θ If oe has a proper probability mass distributio over Θ, say P prior : Θ 0,, the we ca just use c(θ) = log P prior (θ) Clearly this satisfies the Kraft iequality Moreover this estimator as a strog Bayesia iterpretatio - it correspods to the Maximum a Posteriori estimator for the prior P prior Aother way to satisfy the Kraft iequality is to use codig argumets This has the advatage that we ca very easily devise maps that automatically satisfy the Kraft iequality Assume we have assig a biary codeword (sequece of 0s ad s) to each elemet of Θ Let c(θ) deote the size of the codeword (size of the sequece) The set Θ is called the alphabet, ad the idea is to ecode sequeces of symbols from the alphabet by cocateatig the correspodig codewords A very useful class of codes are called prefix codes Defiitio 4 A code is called a prefix or istataeous code if o codeword is a prefix of ay other codeword The reaso for such ame ad defiitio is clarified i the followig example Example (From Cover & Thomas 9) Cosider a alphabet of symbols, say A, B, C, ad D ad the codebooks i Figure Suppose Symbol Sigular Nosigular But Not Uiquely Decodable But Prefix Code Codebook Uiquely Decodable Not a Prefix Code A B C D Figure : Four possible codes for a alphabet of four symbols we wat to covey the setece ADBBAC This will be ecoded i each system respectively as , , , ad It is clear that with the sigular codebook we assig the same codeword to each symbol - a system that is obviously flawed! I the secod case the codes are ot sigular but the codeword 00 could represet B or CA or AD, so the above biary sequece is ambiguous, as it ca be decoded both as ADBBAC or BBBAC for istace Hece it is ot a uiquely decodable codebook The third ad fourth cases are both examples of uiquely decodable codebooks, but the fourth has the added feature that o codeword is a prefix of aother Prefix codes ca be decoded from left to right istataeously sice each codeword is self-puctuatig - that is, you kow immediately whe a codeword eds Note that, util the teth bit i the third case, you do t kow if the sequece is ADBB or ACBBB ad oly the ca you decide I the fourth case you kow immediately whe a codeword eds ad aother starts 0
11 The good thig about prefix codes is that these are easy to costruct ad automatically satisfy the Kraft iequality We will use this approach ofte The Kraft Iequality Theorem For ay biary prefix code, the codeword legths c, c, satisfy c k k= Coversely, give ay c, c, satisfyig the iequality above we ca costruct a prefix code with these codeword legths Proof We will prove that for ay biary prefix code, the codeword legths c, c,, satisfy k c k The coverse is easy to prove also, but it ot cetral to our purposes here (for a proof, see Cover & Thomas 9) Cosider a biary tree like the oe show i Figure Root Figure : A biary tree The sequece of bit values leadig from the root to a leaf of the tree represets a codeword The prefix coditio implies that o codeword is a descedat of ay other codeword i the tree Therefore we ca each leaf of the tree represets a codeword i our code Let c max be the legth of the logest codeword (also the umber of braches to the deepest leaf) i the tree Cosider a ode i the tree at level c i This ode i the tree ca have at most cmax c i descedats at level c max Furthermore, for each leaf of the tree the set of possible descedats at level c max is disjoit (sice o codeword ca be a prefix of aother) Therefore, sice the total umber of possible leafs at level c max is cmax, we have cmax c i cmax c i, i leafs i leafs which proves the case whe the umber of codewords is fiite
12 Suppose ow that we have a coutably ifiite umber of codewords Let b, b,, b ci be the i th codeword ad let c i r i = b j j j=i be the real umber correspodig to the biary expasio of the codeword (i biary r i = 0b b b 3 ) We ca associate the iterval r i, r i + c i ) with the i th codeword This is the set of all real umbers whose biary expasio begis with b, b,, b ci Sice this is a subiterval of 0,, ad all such subitervals correspodig to prefix codewords are disjoit, the sum of their legths must be less tha or equal to This proves the case where the umber of codewords is ifiite 3 Adaptig to Ukow Parameters Earlier i the course we saw that oe ca estimate Lipschitz smooth fuctios well I this sectio we geeralize those results to iclude smoother fuctios Furthermore, we will be able to automatically adjust to the ukow level of smoothess of fuctios 3 Lipschitz Fuctios Suppose we have a fuctio r : 0, R, R satisfyig the Lipschitz smoothess assumptio s, t 0, r (s) r (t) L s t We have see these fuctios ca be well approximated by piecewise costat fuctios of the form m j g(x) = c j x I j, where I j = m, j m j= Let s use maximum-likelihood to pick the best such model Suppose we have followig regressio model Y i = r (x i ) + ɛ i, i =,,, iid where x i = i/ ad ɛ i N (0, σ ) To be able to use the corollary we derived we eed a coutable or fiite class of models The easiest way to do so is to discretize/quatize the possible values of each costat piece i the cadidate models Defie m R m = c j x I j : c j Q, where j= Q = R, R,, R = R k, k = 0,, Therefore R m has exactly ( + ) m elemets i total This meas that, by takig c(r) = log (( + ) m ) = m log ( + ) for all r R m we satisfy the Kraft iequality r R m c(r) = r R m R m =
13 So we are ready to apply our oracle boud Sice c(r) is just a costat (ot really a fuctio of r) the estimator is simply the MLE ˆr = arg mi (Y i r(x i )) + 4σ c(r) log r R m i= = arg mi (Y i r(x i )) r R m i= The corollary the says that E (ˆr (x i ) r (x i )) i= mi r R m i= (r (x i ) r(x i )) + 4σ c(r) log So far the result is extremely geeral, as we have ot made use of the Lipschitz assumptio We have see earlier that there is a piecewise costat fuctio r m (x) = m j= c jx I j such that for all x 0, we have r (x) r m (x) L/m The problem is that, geerally r m / R m sice c j / Q Take istead the elemet of R m that is closest to r m, amely r m = arg mi sup f R m x 0, r(x) r m (x) It is clear that f(x) f m (x) R/ for all x 0, therefore, by the triagle iequality we have f(x) f m (x) f(x) f m (x) + f m (x) f m (x) L m + R Now, we ca just use this i our boud E (ˆr (x i ) r (x i )) mi (r (x i ) r(x i )) + 4σ c(r) log r R m i= i= ( L m + R ) + 8σ m log ( + ) log i= ( L = m + R ) + 8σ m log( + ) So, to esure the best boud possible we should choose m miimizig the right-had-side This yields m (/ log ) /3 ad E ( ˆf (x i ) f (x i )) = O ((/ log ) /3) i= which, apart from the logarithmic factor is the best we ca ever hope for (this logarithmic factor is due to the discretizatio of the model classes, ad is a artifact of this approach) If we wat the truly best possible boud we eed to miimize the above expressio with respect to m, ad for that we eed to kow L Ca we do better? Ca we automagically choose m usig the data? The aswer is yes, ad for this we will start takig full advatage of our oracle boud 3
14 Sice we wat to choose the best possible m we must cosider the followig class of models R = m= R m This is clearly a coutable class of models (but ot fiite) So we eed to be a bit more careful i costructig the map c( ) Let s use a codig argumet: begi by defiig m(r) = mi m N m : r R m Ecode r R usig first the bits 00 0 (total m(r) bits) to ecode m(r) ad them log R m bits to ecode which model iside R m is r This is clearly a prefix code ad therefore satisfies the Kraft iequality More formally c(r) = m(r) + log R m(r) = m(r) + log (( + ) m(r)) = m(r)( + log (( + )) Although we kow, from the codig argumet, that the map c( ) satisfies the Kraft iequality for sure, we ca do a little saity check, ad esure this is ideed true: c(r) r R = = = c(r) m= r R m m(r) log Rm(r) m= r R m m log Rm m= r R m m R m r R m m = m= m= Now, similarly to what we had before ˆr = arg mi r R = arg mi r R i= i= (Y i r(x i )) + 4σ c(r) log (Y i r(x i )) + 4σ m(r)( + log ( + )) log, 4
15 which is o loger the MLE, but rather a maximum pealized likelihood estimator The E (ˆr (x i ) r (x i )) i= mi (r (x i ) r(x i )) + 4σ m(r)( + log ( + )) log r R i= mi mi (r (x i ) r(x i )) + 4σ m( + log ( + )) log m N r R m i= ( L mi m N m + R ) + 8σ m(log + log( + )) Therefore this estimator automatically chooses the best possible umber of parameters m Note that the price we pay is very very modest - the oly chage from what we had before was that the term log( + ) is replaced by log + log( + ), which is a very miute chage as log(+) log for ay iterestig sample size Although this is remarkable, we ca probably do much better, ad adjust to ukow smoothess We ll see how to do this ext 4 Regressio of Hölder smooth fuctios Let s begi by defiig a class of Hölder smooth fuctios Let α > 0 ad defie let H α (C, R) be the set of all fuctios r : 0, R, R which have α derivatives ad f(x) T α y (x) C x y α x, y 0,, where α is the largest iteger such that α < α, ad Ty α is the Taylor polyomial of degree α aroud the poit y I words, a Hölder-α smooth fuctio is locally well approximated by a polyomial of degree α Note that the above defiitio coicides with our Lipschitz fuctio class whe α = Hölder smoothess essetially measures how differetiable fuctios are, ad therefor Taylor polyomials are the atural way to approximate Hölder smooth fuctios We will focus o Hölder smooth fuctio classes with 0 < α but the results preseted ca be easily geeralized for larger values of α Sice α we will work with piecewise liear approximatios, the Taylor polyomial of degree If we were to cosider smoother fuctios, α > we would eed cosider higher degree Taylor polyomial approximatio fuctios, ie quadratic, cubic, etc 4 Regressio of Hölder smooth fuctios Cosider the usual regressio model Y i = r (x i ) + ɛ i, i =,,, iid where x i = i/ ad ɛ i N (0, σ ) Let s assume r H α (C, R), with ukow smoothess 0 < α Ituitively, the smoother r is the better we should be able to estimate it, as we ca average more observatios locally I other words for smoother r we should average over 5
16 larger bis Also, we will eed to exploit the extra smoothess i our models for r To that ed, we will cosider cadidate fuctios that are piecewise liear, ie, fuctios of the form m j (a j + b j x)x I j, where I j = m, j m j= As before, we wat to cosider coutable/fiite classes of models to be able to apply our corollary, so we will cosider a slight modificatio of the above Each liear piece ca be described by their begiig ad ed poits respectively So we are goig to restrict those to lie o a grid Namely refer to Figure 3 C levels 0 C (i )/k Figure 3: Example o the discretizatio of f o iterval i/k ) j m, j m Defie the class m R m = r(x) = l j (x)x I j, j= where x (j )/m l j (x) = b j + j/m x /m /m a j = (mx j + )b j + (j mx)a j, ad a j, b j k R : k 0,, Clearly R m = ( + ) m Sice we do t kow the smoothess a priori, we must choose m usig the data Therefore, i the same fashio as before we take the class with m(r) = mi m N m : r R m, ad R = m= R m, c(r) = m(r) + log R m(r) = m(r)( + log ( ( + ) ) 6
17 Exactly as before, defie the estimator ˆr = arg mi (Y i r(x i )) + 4σ m(r)( + log ( + )) log r R i=, The E (ˆr (x i ) r (x i )) i= mi r R mi m N (r (x i ) r(x i )) + 4σ m(r)( + log ( + )) log (r (x i ) r(x i )) i= mi r R m i= + 4σ m( + log ( + )) log I the above, the first term is essetially our familiar approximatio error, ad the secod term is i a sese boudig the estimatio error Therefore this estimator automatically seeks the best balace betwee the two To say somethig more cocrete about the performace of the estimator we eed to brig i the assumptios we have o r (so far we have t used ay) First, suppose r H α (C, R) for < α We eed to fid a good model i the class R m that makes the approximatio error small Take x I j where j is arbitrary From the defiitio of H α (C, R) we have r (x) r (j/m) x r (j/m)(x j/m) C x j/m α C /m α = Cm α Therefore we ca approximate r (x) with x I j by a liear fuctio, such that the error is bouded by Cm α I particular, the best piecewise liear approximatio, defied as r m = arg mi sup piecewise liear fuctios x 0, r(x) r (x), satisfies r (x) r m(x) Cm α for all x 0, Clearly, this fuctio i ot i R m, so we eed to do some discretizatio Take the fuctio i R m that is closest to that fuctio r m r m = arg mi sup r R m x 0, r(x) r m(x), because of the way we discretized we kow that sup x 0, r m(x) r m(x) R/ Usig this, together with the triagle iequality yields x 0, r m(x) r (x) Cm α + R (4) If r H α (C, R) for 0 < α we ca proceed i a similar fashio, but simply have to ote that such fuctios are well approximated by piecewise costat fuctios Furthermore these are a subset of R m So, the reasoig we used for Lipschitz fuctios applies directly here, ad we obtai expressio (4) for that case as well 7
18 Fially, we ca just plug-i these results ito the boud of the corollary E (ˆr (x i ) r (x i )) i= mi mi (r (x i ) r(x i )) + 4σ m( + log ( + )) log m N r R m i= ( mi Cm α + R ) + 4σ m( + log ( + )) log m N ( mi O max m α, m α m N,, m log ) It is ot hard to see that the first ad last terms domiate the boud, ad so we attai the miimum by takig (i the boud) which yields E m ( ) /(α+), log (ˆr (x i ) r (x i )) = O i= ( ( ) α ) α+ log Note that the estimator does ot kow α!!! So we are ideed adaptig to ukow ( smoothess If the regressio fuctio r is Lipschitz this estimator has error rate O (/ log ) /3) ( However, if the fuctio is smoother (say α = ) the estimator has error rate O (/ log ) 4/5), which decays much quicker to zero More remarkably, apart from the logarithmic factor it ca be show this is the best oe ca hope for! So the logarithmic factor is the very small price we eed to pay for adaptivity 8
ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization
ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where
More informationECE 901 Lecture 13: Maximum Likelihood Estimation
ECE 90 Lecture 3: Maximum Likelihood Estimatio R. Nowak 5/7/009 The focus of this lecture is to cosider aother approach to learig based o maximum likelihood estimatio. Ulike earlier approaches cosidered
More informationMaximum Likelihood Estimation and Complexity Regularization
ECE90 Sprig 004 Statistical Regularizatio ad Learig Theory Lecture: 4 Maximum Likelihood Estimatio ad Complexity Regularizatio Lecturer: Rob Nowak Scribe: Pam Limpiti Review : Maximum Likelihood Estimatio
More informationLecture 13: Maximum Likelihood Estimation
ECE90 Sprig 007 Statistical Learig Theory Istructor: R. Nowak Lecture 3: Maximum Likelihood Estimatio Summary of Lecture I the last lecture we derived a risk (MSE) boud for regressio problems; i.e., select
More informationConvergence of random variables. (telegram style notes) P.J.C. Spreij
Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space
More informationECE 901 Lecture 12: Complexity Regularization and the Squared Loss
ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality
More informationChapter 3. Strong convergence. 3.1 Definition of almost sure convergence
Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i
More informationMASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS
MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak
More informationAdvanced Stochastic Processes.
Advaced Stochastic Processes. David Gamarik LECTURE 2 Radom variables ad measurable fuctios. Strog Law of Large Numbers (SLLN). Scary stuff cotiued... Outlie of Lecture Radom variables ad measurable fuctios.
More informationMachine Learning Theory Tübingen University, WS 2016/2017 Lecture 12
Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig
More informationThis exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.
Probability ad Statistics FS 07 Secod Sessio Exam 09.0.08 Time Limit: 80 Miutes Name: Studet ID: This exam cotais 9 pages (icludig this cover page) ad 0 questios. A Formulae sheet is provided with the
More informationREGRESSION WITH QUADRATIC LOSS
REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d
More information7.1 Convergence of sequences of random variables
Chapter 7 Limit Theorems Throughout this sectio we will assume a probability space (, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite
More informationSequences and Series of Functions
Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges
More informationLecture 19: Convergence
Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may
More informationRegression with quadratic loss
Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,
More information(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3
MATH 337 Sequeces Dr. Neal, WKU Let X be a metric space with distace fuctio d. We shall defie the geeral cocept of sequece ad limit i a metric space, the apply the results i particular to some special
More informationLecture 12: September 27
36-705: Itermediate Statistics Fall 207 Lecturer: Siva Balakrisha Lecture 2: September 27 Today we will discuss sufficiecy i more detail ad the begi to discuss some geeral strategies for costructig estimators.
More information6.3 Testing Series With Positive Terms
6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial
More informationChapter 6 Infinite Series
Chapter 6 Ifiite Series I the previous chapter we cosidered itegrals which were improper i the sese that the iterval of itegratio was ubouded. I this chapter we are goig to discuss a topic which is somewhat
More information7.1 Convergence of sequences of random variables
Chapter 7 Limit theorems Throughout this sectio we will assume a probability space (Ω, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite
More information1 Approximating Integrals using Taylor Polynomials
Seughee Ye Ma 8: Week 7 Nov Week 7 Summary This week, we will lear how we ca approximate itegrals usig Taylor series ad umerical methods. Topics Page Approximatig Itegrals usig Taylor Polyomials. Defiitios................................................
More informationInfinite Sequences and Series
Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet
More informationLinear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d
Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y
More informationDistribution of Random Samples & Limit theorems
STAT/MATH 395 A - PROBABILITY II UW Witer Quarter 2017 Néhémy Lim Distributio of Radom Samples & Limit theorems 1 Distributio of i.i.d. Samples Motivatig example. Assume that the goal of a study is to
More informationEntropy Rates and Asymptotic Equipartition
Chapter 29 Etropy Rates ad Asymptotic Equipartitio Sectio 29. itroduces the etropy rate the asymptotic etropy per time-step of a stochastic process ad shows that it is well-defied; ad similarly for iformatio,
More informationSummary. Recap ... Last Lecture. Summary. Theorem
Last Lecture Biostatistics 602 - Statistical Iferece Lecture 23 Hyu Mi Kag April 11th, 2013 What is p-value? What is the advatage of p-value compared to hypothesis testig procedure with size α? How ca
More informationThe Maximum-Likelihood Decoding Performance of Error-Correcting Codes
The Maximum-Lielihood Decodig Performace of Error-Correctig Codes Hery D. Pfister ECE Departmet Texas A&M Uiversity August 27th, 2007 (rev. 0) November 2st, 203 (rev. ) Performace of Codes. Notatio X,
More informationLecture 10: Universal coding and prediction
0-704: Iformatio Processig ad Learig Sprig 0 Lecture 0: Uiversal codig ad predictio Lecturer: Aarti Sigh Scribes: Georg M. Goerg Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved
More informationLecture 2: Monte Carlo Simulation
STAT/Q SCI 43: Itroductio to Resamplig ethods Sprig 27 Istructor: Ye-Chi Che Lecture 2: ote Carlo Simulatio 2 ote Carlo Itegratio Assume we wat to evaluate the followig itegratio: e x3 dx What ca we do?
More informationLecture 7: Density Estimation: k-nearest Neighbor and Basis Approach
STAT 425: Itroductio to Noparametric Statistics Witer 28 Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach Istructor: Ye-Chi Che Referece: Sectio 8.4 of All of Noparametric Statistics.
More informationProduct measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.
Product measures, Toelli s ad Fubii s theorems For use i MAT3400/4400, autum 2014 Nadia S. Larse Versio of 13 October 2014. 1. Costructio of the product measure The purpose of these otes is to preset the
More informationLecture 3: August 31
36-705: Itermediate Statistics Fall 018 Lecturer: Siva Balakrisha Lecture 3: August 31 This lecture will be mostly a summary of other useful expoetial tail bouds We will ot prove ay of these i lecture,
More informationStudy the bias (due to the nite dimensional approximation) and variance of the estimators
2 Series Methods 2. Geeral Approach A model has parameters (; ) where is ite-dimesioal ad is oparametric. (Sometimes, there is o :) We will focus o regressio. The fuctio is approximated by a series a ite
More information32 estimating the cumulative distribution function
32 estimatig the cumulative distributio fuctio 4.6 types of cofidece itervals/bads Let F be a class of distributio fuctios F ad let θ be some quatity of iterest, such as the mea of F or the whole fuctio
More informationOn Random Line Segments in the Unit Square
O Radom Lie Segmets i the Uit Square Thomas A. Courtade Departmet of Electrical Egieerig Uiversity of Califoria Los Ageles, Califoria 90095 Email: tacourta@ee.ucla.edu I. INTRODUCTION Let Q = [0, 1] [0,
More informationLecture 3 The Lebesgue Integral
Lecture 3: The Lebesgue Itegral 1 of 14 Course: Theory of Probability I Term: Fall 2013 Istructor: Gorda Zitkovic Lecture 3 The Lebesgue Itegral The costructio of the itegral Uless expressly specified
More informationEmpirical Process Theory and Oracle Inequalities
Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi
More informationCS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5
CS434a/54a: Patter Recogitio Prof. Olga Veksler Lecture 5 Today Itroductio to parameter estimatio Two methods for parameter estimatio Maimum Likelihood Estimatio Bayesia Estimatio Itroducto Bayesia Decisio
More informationProblem Set 4 Due Oct, 12
EE226: Radom Processes i Systems Lecturer: Jea C. Walrad Problem Set 4 Due Oct, 12 Fall 06 GSI: Assae Gueye This problem set essetially reviews detectio theory ad hypothesis testig ad some basic otios
More informationFall 2013 MTH431/531 Real analysis Section Notes
Fall 013 MTH431/531 Real aalysis Sectio 8.1-8. Notes Yi Su 013.11.1 1. Defiitio of uiform covergece. We look at a sequece of fuctios f (x) ad study the coverget property. Notice we have two parameters
More informationIntegrable Functions. { f n } is called a determining sequence for f. If f is integrable with respect to, then f d does exist as a finite real number
MATH 532 Itegrable Fuctios Dr. Neal, WKU We ow shall defie what it meas for a measurable fuctio to be itegrable, show that all itegral properties of simple fuctios still hold, ad the give some coditios
More informationMath 155 (Lecture 3)
Math 55 (Lecture 3) September 8, I this lecture, we ll cosider the aswer to oe of the most basic coutig problems i combiatorics Questio How may ways are there to choose a -elemet subset of the set {,,,
More informationLecture 10 October Minimaxity and least favorable prior sequences
STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least
More informationLecture 20: Multivariate convergence and the Central Limit Theorem
Lecture 20: Multivariate covergece ad the Cetral Limit Theorem Covergece i distributio for radom vectors Let Z,Z 1,Z 2,... be radom vectors o R k. If the cdf of Z is cotiuous, the we ca defie covergece
More informationAn Introduction to Randomized Algorithms
A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis
More informationResampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.
Jauary 1, 2019 Resamplig Methods Motivatio We have so may estimators with the property θ θ d N 0, σ 2 We ca also write θ a N θ, σ 2 /, where a meas approximately distributed as Oce we have a cosistet estimator
More informationEECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1
EECS564 Estimatio, Filterig, ad Detectio Hwk 2 Sols. Witer 25 4. Let Z be a sigle observatio havig desity fuctio where. p (z) = (2z + ), z (a) Assumig that is a oradom parameter, fid ad plot the maximum
More informationEconomics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator
Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters
More information5.1 A mutual information bound based on metric entropy
Chapter 5 Global Fao Method I this chapter, we exted the techiques of Chapter 2.4 o Fao s method the local Fao method) to a more global costructio. I particular, we show that, rather tha costructig a local
More informationSieve Estimators: Consistency and Rates of Convergence
EECS 598: Statistical Learig Theory, Witer 2014 Topic 6 Sieve Estimators: Cosistecy ad Rates of Covergece Lecturer: Clayto Scott Scribe: Julia Katz-Samuels, Brado Oselio, Pi-Yu Che Disclaimer: These otes
More informationLecture 11 and 12: Basic estimation theory
Lecture ad 2: Basic estimatio theory Sprig 202 - EE 94 Networked estimatio ad cotrol Prof. Kha March 2 202 I. MAXIMUM-LIKELIHOOD ESTIMATORS The maximum likelihood priciple is deceptively simple. Louis
More informationLet us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.
Lecture 5 Let us give oe more example of MLE. Example 3. The uiform distributio U[0, ] o the iterval [0, ] has p.d.f. { 1 f(x =, 0 x, 0, otherwise The likelihood fuctio ϕ( = f(x i = 1 I(X 1,..., X [0,
More informationLecture 7: October 18, 2017
Iformatio ad Codig Theory Autum 207 Lecturer: Madhur Tulsiai Lecture 7: October 8, 207 Biary hypothesis testig I this lecture, we apply the tools developed i the past few lectures to uderstad the problem
More informationLecture 14: Graph Entropy
15-859: Iformatio Theory ad Applicatios i TCS Sprig 2013 Lecture 14: Graph Etropy March 19, 2013 Lecturer: Mahdi Cheraghchi Scribe: Euiwoog Lee 1 Recap Bergma s boud o the permaet Shearer s Lemma Number
More informationEstimation for Complete Data
Estimatio for Complete Data complete data: there is o loss of iformatio durig study. complete idividual complete data= grouped data A complete idividual data is the oe i which the complete iformatio of
More informationIf a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero?
2 Lebesgue Measure I Chapter 1 we defied the cocept of a set of measure zero, ad we have observed that every coutable set is of measure zero. Here are some atural questios: If a subset E of R cotais a
More information4.1 Data processing inequality
ECE598: Iformatio-theoretic methods i high-dimesioal statistics Sprig 206 Lecture 4: Total variatio/iequalities betwee f-divergeces Lecturer: Yihog Wu Scribe: Matthew Tsao, Feb 8, 206 [Ed. Mar 22] Recall
More informationInformation Theory and Statistics Lecture 4: Lempel-Ziv code
Iformatio Theory ad Statistics Lecture 4: Lempel-Ziv code Łukasz Dębowski ldebowsk@ipipa.waw.pl Ph. D. Programme 203/204 Etropy rate is the limitig compressio rate Theorem For a statioary process (X i)
More informationThe Growth of Functions. Theoretical Supplement
The Growth of Fuctios Theoretical Supplemet The Triagle Iequality The triagle iequality is a algebraic tool that is ofte useful i maipulatig absolute values of fuctios. The triagle iequality says that
More informationStatistics 511 Additional Materials
Cofidece Itervals o mu Statistics 511 Additioal Materials This topic officially moves us from probability to statistics. We begi to discuss makig ifereces about the populatio. Oe way to differetiate probability
More information2 Banach spaces and Hilbert spaces
2 Baach spaces ad Hilbert spaces Tryig to do aalysis i the ratioal umbers is difficult for example cosider the set {x Q : x 2 2}. This set is o-empty ad bouded above but does ot have a least upper boud
More informationKernel density estimator
Jauary, 07 NONPARAMETRIC ERNEL DENSITY ESTIMATION I this lecture, we discuss kerel estimatio of probability desity fuctios PDF Noparametric desity estimatio is oe of the cetral problems i statistics I
More informationEE 4TM4: Digital Communications II Probability Theory
1 EE 4TM4: Digital Commuicatios II Probability Theory I. RANDOM VARIABLES A radom variable is a real-valued fuctio defied o the sample space. Example: Suppose that our experimet cosists of tossig two fair
More information1 Convergence in Probability and the Weak Law of Large Numbers
36-752 Advaced Probability Overview Sprig 2018 8. Covergece Cocepts: i Probability, i L p ad Almost Surely Istructor: Alessadro Rialdo Associated readig: Sec 2.4, 2.5, ad 4.11 of Ash ad Doléas-Dade; Sec
More informationThe Boolean Ring of Intervals
MATH 532 Lebesgue Measure Dr. Neal, WKU We ow shall apply the results obtaied about outer measure to the legth measure o the real lie. Throughout, our space X will be the set of real umbers R. Whe ecessary,
More informationLecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound
Lecture 7 Ageda for the lecture Gaussia chael with average power costraits Capacity of additive Gaussia oise chael ad the sphere packig boud 7. Additive Gaussia oise chael Up to this poit, we have bee
More informationLaw of the sum of Bernoulli random variables
Law of the sum of Beroulli radom variables Nicolas Chevallier Uiversité de Haute Alsace, 4, rue des frères Lumière 68093 Mulhouse icolas.chevallier@uha.fr December 006 Abstract Let be the set of all possible
More informationShannon s noiseless coding theorem
18.310 lecture otes May 4, 2015 Shao s oiseless codig theorem Lecturer: Michel Goemas I these otes we discuss Shao s oiseless codig theorem, which is oe of the foudig results of the field of iformatio
More informationThe picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled
1 Lecture : Area Area ad distace traveled Approximatig area by rectagles Summatio The area uder a parabola 1.1 Area ad distace Suppose we have the followig iformatio about the velocity of a particle, how
More informationMath 2784 (or 2794W) University of Connecticut
ORDERS OF GROWTH PAT SMITH Math 2784 (or 2794W) Uiversity of Coecticut Date: Mar. 2, 22. ORDERS OF GROWTH. Itroductio Gaiig a ituitive feel for the relative growth of fuctios is importat if you really
More informationStat410 Probability and Statistics II (F16)
Some Basic Cocepts of Statistical Iferece (Sec 5.) Suppose we have a rv X that has a pdf/pmf deoted by f(x; θ) or p(x; θ), where θ is called the parameter. I previous lectures, we focus o probability problems
More informationMachine Learning Theory Tübingen University, WS 2016/2017 Lecture 3
Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture 3 Tolstikhi Ilya Abstract I this lecture we will prove the VC-boud, which provides a high-probability excess risk boud for the ERM algorithm whe
More informationOutput Analysis and Run-Length Control
IEOR E4703: Mote Carlo Simulatio Columbia Uiversity c 2017 by Marti Haugh Output Aalysis ad Ru-Legth Cotrol I these otes we describe how the Cetral Limit Theorem ca be used to costruct approximate (1 α%
More informationA Proof of Birkhoff s Ergodic Theorem
A Proof of Birkhoff s Ergodic Theorem Joseph Hora September 2, 205 Itroductio I Fall 203, I was learig the basics of ergodic theory, ad I came across this theorem. Oe of my supervisors, Athoy Quas, showed
More informationINFINITE SEQUENCES AND SERIES
INFINITE SEQUENCES AND SERIES INFINITE SEQUENCES AND SERIES I geeral, it is difficult to fid the exact sum of a series. We were able to accomplish this for geometric series ad the series /[(+)]. This is
More informationMeasure and Measurable Functions
3 Measure ad Measurable Fuctios 3.1 Measure o a Arbitrary σ-algebra Recall from Chapter 2 that the set M of all Lebesgue measurable sets has the followig properties: R M, E M implies E c M, E M for N implies
More informationChapter 6 Principles of Data Reduction
Chapter 6 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 0 Chapter 6 Priciples of Data Reductio Sectio 6. Itroductio Goal: To summarize or reduce the data X, X,, X to get iformatio about a
More informationMath 61CM - Solutions to homework 3
Math 6CM - Solutios to homework 3 Cédric De Groote October 2 th, 208 Problem : Let F be a field, m 0 a fixed oegative iteger ad let V = {a 0 + a x + + a m x m a 0,, a m F} be the vector space cosistig
More informationMASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013 Large Deviatios for i.i.d. Radom Variables Cotet. Cheroff boud usig expoetial momet geeratig fuctios. Properties of a momet
More informationRates of Convergence by Moduli of Continuity
Rates of Covergece by Moduli of Cotiuity Joh Duchi: Notes for Statistics 300b March, 017 1 Itroductio I this ote, we give a presetatio showig the importace, ad relatioship betwee, the modulis of cotiuity
More informationInformation Theory Tutorial Communication over Channels with memory. Chi Zhang Department of Electrical Engineering University of Notre Dame
Iformatio Theory Tutorial Commuicatio over Chaels with memory Chi Zhag Departmet of Electrical Egieerig Uiversity of Notre Dame Abstract A geeral capacity formula C = sup I(; Y ), which is correct for
More informationLecture 01: the Central Limit Theorem. 1 Central Limit Theorem for i.i.d. random variables
CSCI-B609: A Theorist s Toolkit, Fall 06 Aug 3 Lecture 0: the Cetral Limit Theorem Lecturer: Yua Zhou Scribe: Yua Xie & Yua Zhou Cetral Limit Theorem for iid radom variables Let us say that we wat to aalyze
More informationSequences. Notation. Convergence of a Sequence
Sequeces A sequece is essetially just a list. Defiitio (Sequece of Real Numbers). A sequece of real umbers is a fuctio Z (, ) R for some real umber. Do t let the descriptio of the domai cofuse you; it
More informationMAT1026 Calculus II Basic Convergence Tests for Series
MAT026 Calculus II Basic Covergece Tests for Series Egi MERMUT 202.03.08 Dokuz Eylül Uiversity Faculty of Sciece Departmet of Mathematics İzmir/TURKEY Cotets Mootoe Covergece Theorem 2 2 Series of Real
More informationMath 104: Homework 2 solutions
Math 04: Homework solutios. A (0, ): Sice this is a ope iterval, the miimum is udefied, ad sice the set is ot bouded above, the maximum is also udefied. if A 0 ad sup A. B { m + : m, N}: This set does
More information7 Sequences of real numbers
40 7 Sequeces of real umbers 7. Defiitios ad examples Defiitio 7... A sequece of real umbers is a real fuctio whose domai is the set N of atural umbers. Let s : N R be a sequece. The the values of s are
More informationBasics of Probability Theory (for Theory of Computation courses)
Basics of Probability Theory (for Theory of Computatio courses) Oded Goldreich Departmet of Computer Sciece Weizma Istitute of Sciece Rehovot, Israel. oded.goldreich@weizma.ac.il November 24, 2008 Preface.
More informationOptimally Sparse SVMs
A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but
More informationAgnostic Learning and Concentration Inequalities
ECE901 Sprig 2004 Statistical Regularizatio ad Learig Theory Lecture: 7 Agostic Learig ad Cocetratio Iequalities Lecturer: Rob Nowak Scribe: Aravid Kailas 1 Itroductio 1.1 Motivatio I the last lecture
More informationRandom Variables, Sampling and Estimation
Chapter 1 Radom Variables, Samplig ad Estimatio 1.1 Itroductio This chapter will cover the most importat basic statistical theory you eed i order to uderstad the ecoometric material that will be comig
More informationECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors
ECONOMETRIC THEORY MODULE XIII Lecture - 34 Asymptotic Theory ad Stochastic Regressors Dr. Shalabh Departmet of Mathematics ad Statistics Idia Istitute of Techology Kapur Asymptotic theory The asymptotic
More informationExponential Families and Bayesian Inference
Computer Visio Expoetial Families ad Bayesia Iferece Lecture Expoetial Families A expoetial family of distributios is a d-parameter family f(x; havig the followig form: f(x; = h(xe g(t T (x B(, (. where
More informationMASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/5.070J Fall 203 Lecture 3 9//203 Large deviatios Theory. Cramér s Theorem Cotet.. Cramér s Theorem. 2. Rate fuctio ad properties. 3. Chage of measure techique.
More informationDepartment of Mathematics
Departmet of Mathematics Ma 3/103 KC Border Itroductio to Probability ad Statistics Witer 2017 Lecture 19: Estimatio II Relevat textbook passages: Larse Marx [1]: Sectios 5.2 5.7 19.1 The method of momets
More informationEcon 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara
Poit Estimator Eco 325 Notes o Poit Estimator ad Cofidece Iterval 1 By Hiro Kasahara Parameter, Estimator, ad Estimate The ormal probability desity fuctio is fully characterized by two costats: populatio
More informations = and t = with C ij = A i B j F. (i) Note that cs = M and so ca i µ(a i ) I E (cs) = = c a i µ(a i ) = ci E (s). (ii) Note that s + t = M and so
3 From the otes we see that the parts of Theorem 4. that cocer us are: Let s ad t be two simple o-egative F-measurable fuctios o X, F, µ ad E, F F. The i I E cs ci E s for all c R, ii I E s + t I E s +
More informationA survey on penalized empirical risk minimization Sara A. van de Geer
A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the
More informationMachine Learning Theory (CS 6783)
Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT
More information