Empirical risk minimization for heavy-tailed losses

Size: px
Start display at page:

Download "Empirical risk minimization for heavy-tailed losses"

Transcription

1 Empirical risk miimizatio for heavy-tailed losses Christia Browlees Emilie Joly Gábor Lugosi Jue 8, 2014 Abstract The purpose of this paper is to discuss empirical risk miimizatio whe the losses are ot ecessarily bouded ad may have a distributio with heavy tails. I such situatios usual empirical averages may fail to provide reliable estimates ad empirical risk miimizatio may provide large excess risk. However, some robust mea estimators proposed i the literature may be used to replace empirical meas. I this paper we ivestigate empirical risk miimizatio based o a robust estimate proposed by Catoi. We develop performace bouds based o chaiig argumets tailored to Catoi s mea estimator. 1 Itroductio Oe of the basic priciples of statistical learig is empirical risk miimizatio that has bee routiely used i a great variety of problems such as regressio fuctio estimatio, classificatio, ad clusterig. The geeral model may be described as follows. Let X be a radom variable takig values i some measurable space X ad let F be a set of oegative fuctios defied o X. For each f F, defie the risk m f = EfX) ad let m = if f F m f deote the optimal risk. I statistical learig idepedet radom variables X 1,..., X are available, all distributed as X, ad oe aims at fidig a fuctio with small risk. To this ed, oe may defie the empirical risk miimizer f ERM = argmi f F 1 fx i ) where, for the simplicity of the discussio ad essetially without loss of geerality, we implicitly assume that the miimizer exists. If the miimum is achieved by more tha oe fuctio, oe may pick oe of them arbitrarily. Remark. loss fuctios ad risks.) The mai motivatio ad termiology may be explaied by the followig geeral predictio problem i statistical learig. Let Z 1, Y 1 ),..., Z, Y ) be idepedet idetically distributed pairs of radom variables represetig traiig data where the Z i take their values i, say, R d ad the Y i are realvalued. I classificatio problems the Y i take discrete values. Give a ew observatio Z, 1

2 oe is iterested i predictig the value of the correspodig respose variable Y where the pair Z, Y ) has the same distributio as that of the Z i, Y i ). A predictor is a fuctio g : R d R whose quality is measured with the help of a loss fuctio l : R R R +. The risk of g is the ElgZ), Y ). Give a class G of fuctios g : R d R, empirical risk miimizatio chooses oe that miimizes the empirical risk 1/) lgz i), Y i ) over all g G. I the simplified otatio followed i this paper, X i correspods to the pair Z i, Y i ), the fuctio f represets lg ), ), ad m f substitutes ElgZ), Y ). The performace of empirical risk miimizatio is measured by the risk of the selected fuctio, m ERM = E [f ERM X) X 1,..., X ]. I particular, the mai object of iterest for this paper is the excess risk m ERM m. The performace of empirical risk miimizatio has bee thoroughly studied ad well uderstood usig tools of empirical process theory. I particular, the simple observatio that m ERM m 2 sup f F 1 fx i ) m f allows oe to apply the rich theory o the suprema of empirical processes to obtai upper performace bouds. The iterested reader is referred to Bartlett ad Medelso [6], Bouchero, Bousquet, ad Lugosi [8], Koltchiskii [14], Massart [18], Medelso [21], va de Geer [29] for refereces ad recet results i this area. Essetially all of the theory of empirical miimizatio assumes either that the fuctios f are uiformly bouded or that the radom variables fx) have sub-gaussia tails for all f F. For example, whe all f F take their values i the iterval [0, 1], Dudley s [12] classical metric-etropy boud, together with stadard symmetrizatio argumets, imply that there exists a uiversal costat c such that Em ERM m c E 1 0 log NX F, ɛ)dɛ, 1) where for ay ɛ > 0, N X F, ɛ) is the ɛ-coverig umber of the class F uder the empirical quadratic distace d X f, g) = 1 fx i) gx i )) 2) 1/2, defied as the miimal cardiality N of ay set {f 1,..., f N } F such that for all f F there exists a f j {f 1,..., f N } with d X f, f j ) ɛ. Of course, this is oe of the most basic bouds ad may importat refiemets have bee established. A tighter boud may be established by the so-called geeric chaiig method, see Talagrad [27]. Recall the followig defiitio see, e.g., [27, Defiitio 1.2.3]). Let T be a pseudo) metric space. A icreasig sequece A ) of partitios of T is called admissible if for all = 0, 1, 2,..., #A 2 2. For ay t T, deote by A t) the uique elemet of A that cotais t. Let A) deote the diameter of the set A T. Defie, for α = 1, 2, γ α T, d) = if sup 2 /α A t)), A t T 0 2

3 where the ifimum is take over all admissible sequeces. The oe has Em ERM m c Eγ 2 F, d X ) 2) for some uiversal costat c. This boud implies 1) as γ 2 F, d X ) is bouded by a costat multiple of the etropy itegral 1 0 log NX F, ɛ)dɛ. However, whe the fuctios f are o loger uiformly bouded ad the radom variables fx) may have a heavy tail, empirical risk miimizatio may have a much poorer performace. This is simply due to the fact that empirical averages become poor estimates of expected values. Ideed, for heavy-tailed distributios, several estimators of the mea are kow to outperform simple empirical averages. It is a atural idea to defie a robust versio of empirical risk miimizatio based o miimizig such robust estimators. I this paper we focus o a elegat ad powerful estimator proposed ad aalyzed by Catoi [11]. A versio of) Catoi s estimator may be defied as follows. Itroduce the o-decreasig differetiable trucatio fuctio ) ) φx) = 1 {x<0} log 1 x + x {x 0} log 1 + x + x2. 3) 2 To estimate m f = EfX) for some f F, defie, for all µ R, r f µ) = 1 α φαfx i ) µ)) where α > 0 is a parameter of the estimator to be specified below. Catoi s estimator of m f is defied as the uique value µ f for which r f µ f ) = 0. Uiqueess is esured by the mootoicity of µ r f µ)). Catoi proves that for ay fixed f F ad δ [0, 1] such that > 2 log1/δ), uder the oly assumptio that Var fx)) v, the estimator above with α = v + 2 log1/δ) 2v log1/δ) 1 2/) log1/δ)) satisfies that, with probability at least 1 2δ, 2v log1/δ) m f µ f 1 2/) log1/δ)). 4) I other words, the deviatios of the estimate exhibit a sub-gaussia behavior. The price to pay is that the estimator depeds both o the upper boud v for the variace ad o the prescribed cofidece δ via the parameter α. Catoi also shows that for ay > 41 + log1/δ)), if Var fx)) v, the choice 2 α = v 3 )

4 guaratees that, with probability at least 1 2δ, m f µ f 1 + log1/δ)) v. 5) Eve though we lose the sub-gaussia tail behavior, the estimator is idepedet of the required cofidece level. Give such a powerful mea estimator, it is atural to propose a empirical risk miimizer that selects a fuctio from the class F that miimizes Catoi s mea estimator. Formally, defie f = argmi µ f f F where agai, for the sake of simplicity we assume that the miimizer exists. Otherwise oe may select a appropriate approximate miimizer ad all argumets go through i a trivial way.) Oce agai, as a first step of uderstadig the excess risk m bf m, we may use the simple boud m bf m = ) m bf µ bf + µ bf m ) 2 sup m f µ f. f F Whe F is a fiite class of cardiality, say F = N, Catoi s boud may be combied, i a straightforward way, with the uio-of-evets boud. Ideed, if the estimators µ f are defied with parameter α = the, with probability at least 1 2δ, v + sup m f µ f f F 2 logn/δ) 2v logn/δ) 1 2/) logn/δ)) ), 2v logn/δ) 1 2/) logn/δ)). Note that this boud requires that sup f F Var fx)) v, that is, the variaces are uiformly bouded by a kow value v. Throughout the paper we work with this assumptio. However, this boud does ot take ito accout the structure of the class F ad it is useless whe F is a ifiite class. Our strategy to obtai meaigful bouds is to use chaiig argumets. However, the extesio is otrivial ad the argumet becomes more ivolved. The mai results of the paper preset performace bouds for empirical miimizatio of Catoi s estimator based o geeric chaiig. Remark. media-of-meas estimator.) Catoi s estimator is ot the oly oe with sub-gaussia deviatios for heavy-tailed distributios. Ideed, the media-of-meas estimator, proposed by Nemirovsky ad Yudi [23] ad also idepedetly by Alo, Matias, 4

5 ad Szegedy [2]) has similar performace guaratees as 4). This estimate is obtaied by dividig the data i several small blocks, calculatig the sample mea withi each block, ad the takig the media of these meas. Hsu ad Sabato [13] ad Misker [22] itroduce multivariate geeralizatios of the media-of-meas estimator ad use it to defie a aalyze certai statistical learig procedures i the presece of heavy-tailed data. The sub-gaussia behavior is achieved uder various assumptios o the loss fuctio. Such coditios ca be avoided here. As a example, we detail applicatios of our results theorems i Sectio 4 for three differet classes of loss fuctios. A importat advatage of the media-of-meas estimate over Catoi s estimate is that the parameter of the estimate i.e., the umber of blocks) oly depeds o the cofidece level δ but ot o v ad therefore o prior upper boud of the variace v is required to compute this estimate. Also, the media-of-meas estimate is useful eve whe the variace is ifiite ad oly a momet of order 1 + ɛ exists for some ɛ > 0 see Bubeck, Cesa-Biachi, ad Lugosi [10]). Lerasle ad Oliveira [15] cosider empirical miimizatio of the media-of-meas estimator ad obtai iterestig results i various statistical learig problems. However, to establish metric-etropy bouds for miimizatio of this mea estimate remais to be a challege. The rest of the paper is orgaized as follows. I Sectio 2 we state ad discuss the mai results of the paper. Sectio 3 is dedicated to the proofs. I Sectio 4 we describe some applicatios to regressio uder the absolute ad squared losses ad k-meas clusterig. Fially, i Sectio 5 we preset some simulatio results both for regressio ad k-meas clusterig. Some of the more techical argumets are relegated to the Appedix. 2 Mai results The bouds we establish for the excess risk deped o the geometric structure of the class F uder differet distaces. The L 2 P ) distace is defied, for f, f F, by ad the L distace is df, f ) = [ fx) E f X) ) ]) 2 1/2 Df, f ) = sup fx) f x). x X We also work with the radom) empirical quadratic distace d X f, f ) = 1 ) 1/2 fx i ) f X i )) 2. Deote by f a fuctio with miimal expectatio f = argmi f F 5 m f.

6 Next we preset two results that boud the excess risk m bf m f of the miimizer f of Catoi s risk estimate i terms of metric properties of the class F. The first result ivolves a combiatio of terms ivolvig the γ 2 ad γ 1 fuctioals uder the metrics d ad D while the secod is i terms of quatiles of γ 2 uder the empirical metric d X. Theorem 1. Let F be a class of o-egative fuctios defied o a set X ad let X, X 1,..., X be i.i.d. radom variables takig values i X. Assume that there exists v > 0 such that sup f F Var fx)) v. Let δ 0, 1/3). Suppose that f is selected from F by miimizig Catoi s mea estimator with parameter α < 1. The there exist a uiversal costat L 384 log2) such that, with probability at least 1 3δ ad for large eough, the risk of f satisfies m bf m f 6 αv + 2 ) logδ 1 ) 8L + logδ 1 ) γ 2 F, d) + 4L ) α 3 γ 1F, D). Theorem 2. Assume the hypotheses of Theorem 1. Set Γ δ such that P {γ 2 T, d X ) > Γ δ } δ 2. The there exist a uiversal costat K such that, with probability at least 1 3δ ad for large eough, the risk of f satisfies m bf m f 6 αv + 2 ) logδ 1 ) log 2 δ + KΓ ) δ. α I both theorems above, the choice of α oly iflueces the term αv + 2 logδ 1 )/α). By takig α = 2 logδ 1 )/v), this term equals 2v logδ 2 1 ). This choice has the disadvatage that the estimator depeds o the cofidece level. By takig α = 2/v), oe obtais the term 2v 1 + logδ 1 )). Observe that the mai term i the secod part of the boud of Theorem 1 is log 1 δ ) L γ 2F, d) which is comparable to the boud 2) obtaied uder the strog coditio of fx) beig uiformly bouded. All other terms are of smaller order. Note that this part of the boud depeds o the weak distributio-depedet L 2 P ) metric d. The quatity γ 1 F, D) γ 2 F, d) also eters the boud of Theorem 1 though oly multiplied by 1/. The presece of this term requires that F is bouded i the L distace D which limits the usefuless 6

7 of the boud. I Sectio 4 we illustrate the bouds o two applicatios to regressio ad k-meas clusterig. I these applicatios, i spite of the presece of heavy tails, the coverig umbers uder the distace D may be bouded i a meaigful way. Note that o such boud ca hold for ordiary empirical risk miimizatio that miimizes the usual empirical meas 1/) fx i) because of the poor performace of empirical averages i the presece of heavy tails. The mai merit of the boud of Theorem 2 is that it does ot require that the class F has a fiite diameter uder the supremum orm. Istead, the quatiles of γ 2 F, d X ) eter the picture. I Sectio 4 we show it through the example of L 2 regressio how these quatiles may be estimated. 3 Proofs The proofs of Theorems 1 ad 2 are based o showig that the excess risk ca be bouded as soo as the supremum of the empirical process {X f µ) : f F} is bouded for ay fixed µ R, where for ay f F ad µ R, we defie X f µ) = r f µ) r f µ) with ad r f µ) = 1 E [φαfx) µ))] α r f µ) = 1 α φαfx i ) µ)). The two theorems differ i the way the supremum of this empirical process is bouded. Note first that, by the defiitio of Catoi s estimator, µ f if i fx i ). I particular, µ f 0 for all f F. Let A α δ) = αv + 2 logδ 1 )/α). Oce agai, we may assume, essetially without loss of geerality, that the miimum exists. I case of multiple miimizers we may choose oe arbitrarily. The mai result i [11] states that for ay δ > 0 such that α 2 v + 2 logδ 1 )/ 1, with probability at least 1 2δ, µ f m f A α δ). 6) 3.1 A determiistic versio of µ f We begi with a variat of the argumet of Catoi [11]. It ivolves a determiistic versio µ f of the estimator defied, for each f F, as the uique solutio of the equatio r f µ) = 0. I Lemma 4 below we show that µ f is i a small determiistic) iterval cetered at m f. First we recall a fact from [11] i the ext propositio. For ay f F, µ R, ad 7

8 ε > 0, defie ad let B + f µ, ε) = m f µ) + α 2 m f µ) 2 + α 2 v + ε, B f µ, ε) = m f µ) α 2 m f µ) 2 α 2 v ε µ + f ε) = m f + αv + 2ε, µ f ε) = m f αv 2ε. As a fuctio of µ, B + f µ, ε) is a quadratic polyomial such that µ+ f ε) is a upper boud of the smallest root of B + f µ, ε). Similarly, µ f ε) is a lower boud of the largest root of B f µ, ε). Implicitly we assumed that these roots always exist. This is ot always the case but a simple coditio o α guaratees that these roots exists. I particular, 1 α 2 v 2αε 0 guaratees that B + f µ, ε) = 0 ad B f µ, ε) = 0 have at least oe solutio. This coditio will always be satisfied by our choice of ɛ ad α. I our otatio, Propositio 2.2 i [11] is equivalet to the followig. Propositio 3. Let δ > 0 ad µ R. For ay f F, the evets ) } Ω f {B µ, δ) = log δ 1 f µ, r f µ) α )} Ω + f { r µ, δ) = f µ) B + log δ 1 f µ, α both hold with probability at least 1 δ. Let ε = log δ 1 α ad defie Ω f δ) def = Ω f µ f ε), δ) Ω + f µ+ f ε), δ). 2 log δ 1 log δ 1 If α 2 v + 1, 6) holds o the evet Ω f δ). Just replace ε by α i the expressio of µ + f ε) ad µ f ε).) Sice µ f is the uique zero of r f µ), it is squeezed ito the iterval [µ f ε), µ + f ε)] cetered at m f ad of size 2A α δ). Note that P {Ω f δ)} 1 2δ. Still followig ideas of [11], the ext lemma bouds r f µ) by the quadratic polyomials B + ad B. The lemma will help us compare the zero of r f µ) to the zeros of these quadratic fuctios. Lemma 4. For ay fixed f F ad µ R, B f µ, 0) r f µ) B + f µ, 0), 7) ad therefore m f αv µ f m f + αv. I particular, B f b µ, 0) r bf µ) B + f b µ, 0). 8

9 For ay µ such that r bf µ) ε, if 1 α 2 v 2αε 0, the m bf µ + αv + 2ε. 8) Proof. Writig Y for αfx) µ) ad usig the fact that φx) log1 + x + x 2 /2) for all x R, exp αr f µ)) exp E [log1 + Y + Y 2 ]) 2 ) E [1 + Y + Y 2 ] αm f µ) + α2 2 [v + m f µ) 2 ] expαb + f µ, 0)). Thus, we have r f µ) B + f µ) 0. Sice this last iequality is true for ay f, sup f r f µ) B + f µ)) 0 ad the secod iequality of 7) is proved. The other part ca be treated with the same argumet. If r bf µ) ε the B f b µ, 0) ε which is equivalet to B f b µ, ε) 0. If 1 α 2 v 2αε 0 the a solutio of B b f µ, ε) = 0 exists ad sice r bf µ) is a o-icreasig fuctio, µ is above the largest of these two solutios. This implies µ b f ε) µ which gives iequality 8). The last iequality 8) is the key tool to esure that the risk m bf of the miimizer f ca be upper bouded as soo as r bf is. It remais to fid the smallest µ ad ε such that r f µ) is bouded uiformly o F. 3.2 Boudig the excess risk i terms of the supremum of a empirical process The key to all proofs is that we lik the excess risk to the supremum of the empirical process X f µ) = r f µ) r f µ) as f rages through F for a suitably chose value of µ. For fixed µ R ad δ 0, 1), defie the 1 δ quatile of sup f F X f µ) X f µ) by Qµ, δ), that is, the ifimum of all positive umbers such that { } P sup X f µ) X f µ) Qµ, δ) f F 1 δ. First we eed a few simple facts summarized i the ext lemma. Lemma 5. Let µ 0 = m f + A α δ). The o the evet Ω f δ), the followig iequalities hold: 9

10 1. r bf µ 0 ) 0 2. r f µ 0 ) 0 3. r f µ 0 ) 2A f δ) Proof. We prove each iequality separately. 1. First ote that o Ω f δ) equatio 6) holds ad we have ˆµ bf ˆµ f µ 0 ad µ f µ 0. By defiitio µ f µ bf. Sice r bf is a o-icreasig fuctio of µ, r bf µ 0 ) r bf µ bf ) = By 7), µ f m f + αv m f + αv + 2 logδ 1 ) α = µ 0. Sice r f is a o-icreasig fuctio, r f µ 0 ) r f µ f ) = r f is a 1-Lipschitz fuctio ad therefore r f µ 0 ) = r f µ f ) r f µ 0 ) µ f µ 0 which gives r f µ 0 ) 2A f δ). µ f m f + m f µ 0 2A f δ) We will use Lemma 4 with µ 0 itroduced i Lemma 5. With the otatio itroduced above, we see that with probability at least 1 δ, r bf µ 0 ) = r bf µ 0 ) + r f µ 0 ) r f µ 0 ) + r bf µ 0 ) r bf µ 0 ) r f µ 0 ) + r f µ 0 ) r bf µ 0 ) + r f µ 0 ) r f µ 0 ) + sup r f µ 0 ) r f µ 0 ) r f µ 0 ) + r f µ 0 ) f F r bf µ 0 ) + r f µ 0 ) r f µ 0 ) + Qµ, δ). This iequality, together with Lemma 5, implies that with probability at least 1 3δ, r bf µ 0 ) 2A f δ) + Qµ, δ). Now usig Lemma 4 uder the coditio 1 α 2 v 4αA f δ) 2αQµ, δ) 0 we have m bf m f αv + 5A f δ) + 2Qµ, δ) 6 αv + 2 ) logδ 1 ) + 2Qµ, δ), 9) α with probability at least 1 3δ. The coditio ) 1 α 2 v 4αA f δ) 2αQµ, δ) 0 is implied sice α 1) by 6 αv + 2 logδ 1 ) α + 2Qµ, δ) 1 which will be see to hold for sufficietly large. 10

11 3.3 Boudig the supremum of the empirical process Theorems 1 ad 2 both follow from 9) by two differet ways of boudig the quatile Qµ, δ) of sup f F X f µ) X f µ). He we preset these two iequalities. Both of them use basic results of geeric chaiig, see Talagrad [27]. Theorem 1 follows from 9) ad the ext iequality: Propositio 6. Let µ R ad α > 0. There exist a uiversal costat L < 384 log 2 such that for ay δ 0, 1), 4L Qµ, δ) logδ 1 ) γ 2 F, d) + 2L ) 3 γ 1F, D). The proof is a immediate cosequece of Theorem 13 ad 14) i the Appedix ad the followig lemma. Lemma 7. For ay µ R, α > 0, f, f F, ad t > 0, P { X f µ) X f µ) > t } exp t 2 24df, f ) 2 + 2Df,f )t 3 ) where the distaces d, D are defied at the begiig of Sectio 2. Proof. Observe that X f X f ) is the sum of the idepedet zero-mea radom variables C i f, f ) = 1 α φαfx i) µ)) 1 [ 1 α φαf X i ) µ)) α E [φαfx) µ))] 1 α E [ φαf X) µ)) ] ]. Note that sice the trucatio fuctio φ is 1-Lipschitz, we have C i f, f ) 2Df, f ). Also, E[C i f, f ) 2 ] 4 [ fxi E ) µ) f X i ) µ) ) ] 2 = 4df, f ) 2 The lemma follows from Berstei s iequality see, e.g., [9, Theorem 2.10]). ) 2. Similarly, Theorem 2 is implied by 9) ad the followig. Recall the otatio of Theorem Theorem 8. Let µ R, α > 0, ad δ 0, 1). There exists a uiversal costat K log2) such that log 2 δ Qµ, δ) KΓ ) δ. 11

12 Proof. The proof is based o a stadard symmetrizatio argumet. Let X 1,..., X ) be idepedet copies of X 1,..., X ) ad defie Z i f) = 1 α φαfx i) µ)) 1 α φαfx i) µ)). Itroduce also idepedet Rademacher radom variables ε 1,..., ε ). For ay f F, deote by Zf) = 1 ε iz i f). The by Hoeffdig s iequality, for all f, g F ad for every t > 0, t 2 ) P ε1,...,ε ) { Zf) Zg) > t} 2 exp 2d X,X f, g) 2 10) where P ε1,...,ε ) deotes probability with respect to the Rademacher variables oly i.e., coditioal o the X i ad X i ) ad d X,X f, g) = 1 Z if) Z i g)) 2 is a radom distace. Deote by r f µ) the idepedet copy of r f µ) that depeds oly o the radom vector X 1,..., X ). Let λ > 0 be a parameter that we optimize later. { } P sup X f X f t f F E [ e λ sup f F br f µ) E[br f µ)]) br f µ) E[br f µ)]) ] e λt E X [ e λe X h E X,X = E X,X Usig 15) i the Appedix with distace d X,X { } P [ sup X f X f t f F sup f F brf µ) br f µ)) br f µ) br f µ)) e λ sup f F br f µ) br f µ)) br f µ) br f µ)) ] i] e λt e λt [ [E ε e λ sup f F 1 P ε i[z i f) Z i f )] ]] e λt E X,X E X,X ad 10), we get [e λ2 2L 2 γ 2 T, d X,X ) 2 ] e λt [ e λ2 2L 2 γ 2 T,d X,X ) 2 ] e λt. A few more calculatios are able to reduce the radom etropy o the couple X, X ) to 12

13 the radom etropy oly o X. Sice x 1 αφαx) is Lipschitz with costat 1, d X,X f, g) 1 = φαfxi ) µ)) φαfx α i) µ)) φαgx i ) µ)) + φαgx i) µ)) ) 2 ) 1/2 1 2 fx i ) gx i )) 2 + ) 1/2 1 2 fx i) gx i)) 2 ) 1/2 This implies ad therefore Hece, P { E X,X γ 2 T, d X,X ) 2γ 2 T, d X ) + γ 2 T, d X )) [ e λ2 2L 2 γ 2 T,d X,X ) 2 ] E X [ e λ2 16L 2 γ 2 T,d X ) 2 ] sup X f X f t f F } E X,X. [ e λ2 16L 2 γ 2 T,d X ) 2 ] e λt 11) Recall that, by defiitio, Γ δ is such that P {γ 2 T, d X ) > Γ δ } δ 2. Thus, P { sup X f X f t f F Optimizatio i λ with t = 4 log 2LΓ 2 δ ) δ { as desired. 4 Applicatios P } gives sup X f X f t f F δ 2 + e λ2 16L 2 Γ 2 δ e λt I this sectio we describe two applicatios of Theorems 1 ad 2 to simple statistical learig problems. The first is a regressio estimatio problem i which we distiguish betwee L 1 ad L 2 risks ad the secod is k-meas clusterig. } δ 13

14 4.1 Empirical risk miimizatio for regressio L 1 regressio Let Z 1, Y 1 ),..., Z, Y ) be idepedet idetically takig values i Z R where Z a bouded subset of say) R d. Suppose G is a class of fuctios Z R bouded i the L orm, that is, def = sup g,g G sup z Z gz) g z) <. First we cosider the setup whe the risk of each g G is defied by the L 1 loss Rg) = E gz) Y where the pair Z, Y ) has the same distributio of the Z i, Y i ) ad is idepedet of them. Let g = argmi g G Rg) be a miimizer of the risk which, without loss of geerality, is assumed to exist). The statistical learig problem we cosider here cosists of choosig a fuctio ĝ from the class G that has a risk Rĝ) ot much larger tha Rg ). The stadard procedure is to pick ĝ by miimizig the empirical risk 1/) gz i) Y i over g G. However, if the respose variable Y is ubouded ad may have a heavy tail, ordiary empirical risk miimizatio is may fail to provide a good predictor of Y as the empirical risk is a ureliable estimate of the true risk. Here we propose choosig ĝ by miimizig Catoi s estimate. To this ed, we oly eed to assume that the secod momet of Y is bouded by a kow costat. More precisely, assume that EY 2 σ 2 for some σ > 0. The sup g G Var gz) Y ) σ 2 + sup g G sup z Z gz) 2 def = v is a kow ad fiite costat. Now for all g G ad µ R, defie r g µ) = 1 α φα gx i ) Y i µ)) where φ is the trucatio fuctio defied i 3). Defie Rg) as the uique value for which r f Rg)) = 0. The empirical risk miimizer based o Catoi s risk estimate is the sup z Z,y R ĝ = argmi g G Rg). By Theorem 1, the performace of ĝ may be bouded i terms of coverig umbers of the class of fuctios F = {fz, y) = gz) y : g G} based o the distace Df, f ) = gz) y g z) y sup gz) g z). z Z Thus, the coverig umbers of F uder the distace D may be bouded i terms of the coverig umbers of G uder the L distace. We obtai the followig. 14

15 Corollary 9. Cosider the setup described above. Let α > 0, δ 0, 1) ad A f δ) = αv + 2 log δ 1 α. There exists a uiversal costat C such that, with probability at least 1 3δ, Rĝ) Rg ) 6A f δ) + C log 1 ) 1 )) 1 log N G, ɛ)dɛ + O. δ Note that the boud essetially has the same form as 1) but to apply 1) it is crucial that the respose variable Y is bouded or at least has sub-gaussia tails. We get this uder the weak assumptio that Y has a bouded secod momet with a kow upper boud). The price we pay is that coverig umbers uder the distace d X are ow replaced by coverig umbers uder the supremum orm L 2 regressio Here we cosider the same setup as i Sectio but ow the risk is measured by the L 2 loss. The risk of each g G is defied by the L 2 loss Rg) = EgZ) Y ) 2. Note that Theorem 1 is useless here as the differece Rg) Rg ) is ot bouded by the L distace of g ad g aymore ad the coverig umbers of F uder the metric D are ifiite. However, Theorem 2 gives meaigful bouds. Let g = argmi g G Rg) ad agai we choose ĝ by miimizig Catoi s estimate. Here we eed to assume that EY 4 σ 2 for some σ > 0. The sup g G Var gz) Y ) 2) σ 2 + sup g G sup z Z gz) 4 def = v is a kow ad fiite costat. By Theorem 2, the performace of ĝ may be bouded i terms of coverig umbers of the class of fuctios F = {fz, y) = gz) y) 2 : g G} based o the distace ) 1/2 d X f, f 1 ) = gzi ) Y i ) 2 g Z i ) Y i ) 2) 2 Note that gz i ) Y i ) 2 g Z i ) Y i ) 2 = gz i ) g Z i ) 2Y i gz i ) g Z i ) 0 ad therefore 2 gz i ) g Z i ) Y i + ) 2d g, g ) Y i + ), d X f, f ) 2d g, g ) 1 Y i + ) 2 2 2d g, g ) Yi 2. 15

16 By Chebyshev s iequality, { 1 P thus 1 Y 2 i > E [ Y 2] + Y 2 i 2σ 2 δ E [ Y 2] > t d X f, f ) > 2 2d g, g ) } Var Y 2) t 2 σ2 t 2 with probability at most δ 2 ad 2 + E [Y 2 ] + 2σ 2 occurs with a probability bouded by δ 2. The Theorem 2 applies with Γ δ = 2 2σ E [Y 2 2 ] + δ γ 2G, d ). Corollary 10. Cosider the setup described above. Let α > 0, δ 0, 1) ad A f δ) = 2 log δ 1 αv + α. There exists a uiversal costat C such that, with probability at least 1 3δ, ) 2 Rĝ) Rg ) 6A f δ) + C log 2 + E [Y 2 ] + 2σ 2 /δ) log N G, ɛ)dɛ. δ 4.2 k-meas clusterig uder heavy tailed distributio I k-meas clusterig or vector quatizatio oe wishes to represet a distributio by a fiite umber of poits. Formally, let X be a radom vector takig values i R d ad let P deote the distributio of X. Let k 2 be a positive iteger that we fix for the rest of the sectio. A clusterig scheme is give by a set of k cluster ceters C = {y 1,..., y k } R d ad a quatizer q : R d C. Give a distortio measure l : R d R d [0, ), oe wishes to fid C ad q such that the expected distortio D k P, q) = ElX, qx)) is as small as possible. The miimizatio problem is meaigful wheever ElX, 0) < which we assume throughout. Typical distortio measures are of the form lx, y) = x y α where is a orm o R d ad α > 0 typically α equals 1 or 2). Here, for cocreteess ad simplicity, we assume that l is the Euclidea distace lx, y) = x y though the results may be geeralized i a straightforward maer to other orms. I a way equivalet to the argumets of Sectio 4.1.2, the results may be geeralized to the case of the quadratic distortio lx, y) = x y 2. I order to avoid repetitio of argumets, the details are omitted. δ 0 16

17 It is ot difficult to see that if E X <, the there exists a ot ecessarily uique) quatizer q that is optimal, that is, q is such that for all clusterig schemes q, D k P, q) D k P, q ) def = D k P ). It is also clear that q is a earest eighbor quatizer, that is, x q x) = mi y i C x y i. Thus, earest eighbor quatizers are determied by their cluster ceters C = {y 1,..., y k }. I fact, for all quatizers with a particular set C of cluster ceters, the correspodig earest eighbor quatizer has miimal distortio ad therefore it suffices to restrict our attetio to earest eighbor quatizers. I the problem of empirical quatizer desig, oe is give a i.i.d. sample X 1,..., X draw from the distributio P ad oe s aim is to fid a quatizer q whose distortio D k P, q ) = E [ X q X) X 1,..., X ] is as close to Dk P ) as possible. A atural strategy is to choose a quatizer or equivaletly, a set C of cluster ceters by miimizig the empirical distortio D k P, q) = 1 X i qx i ) = 1 mi X i y j, j=1,...,k where P deotes the stadard empirical distributio based o X 1,..., X. If E X <, the the empirically optimal quatizer asymptotically miimizes the distortio. More precisely, if q deotes the empirically optimal quatizer i.e., q = argmi q D k P, q)), the lim D kp, q ) = D k P ) with probability 1, see Pollard [24, 26] ad Abaya ad Wise [1] see also Lider [17]). The rate of covergece of D k P, q ) to Dk P ) has draw cosiderable attetio, see, e.g., Pollard [25], Bartlett, Lider, ad Lugosi [5], Atos [3], Atos, Györfi, ad György [4], Biau, Devroye, ad Lugosi [7], Maurer ad Potil [20], ad Levrard [16]. Such rates are typically studied uder the assumptio that X is almost surely bouded. Uder such assumptios oe ca show that ED k P, q ) Dk P ) CP, k, d) 1/2 where the costat CP, k, d) depeds o esssup X, k, ad the dimesio d. The value of the costat has mostly be ivestigated i the case of quadratic loss lx, y) = x y 2 but most proofs may be modified for the case studied here. However, little is kow about the fiite-sample performace of empirically desiged quatizers uder possibly heavy-tailed distributios. I fact, there is o hope to exted the 17

18 results cited above for distributios with fiite secod momet simply because empirical averages are poor estimators of meas uder such geeral coditios. I the recet paper of Telgarsky ad Dasgupta [28], bouds o the excess risk uder coditios o higher momets have bee developed. They prove a boud of O 1/2+2/p ) for the excess distortio where p is the umber of momets of X that are assumed to be fiite. Here we show that there exists a empirical quatizer ˆq whose excess distortio D k P, q [ ) Dk P ) is of the order of 1/2 with high probability) uder the oly assumptio that E X 2] is fiite. This may be achieved by choosig a quatizer that miimizes Catoi s estimate of the distortio. The proposed empirical quatizer uses two parameters that deped o the ukow) distributio of X. For simplicity, we assume that upper bouds for these two parameters are available. Otherwise either oe may try to estimate them or, as the sample size grows, use icreasig values for these parameters. The details go beyod the scope of this paper.) Oe of these parameters is the secod momet Var X ) ad let V be a upper boud. The other parameter ρ > 0 is a upper boud for the orm of the possible cluster ceters. The ext lemma offers a estimate. Lemma 11. Lider [17].) Let 2 m k be the uique iteger such that D k = = D m < D m 1 ad defie ε = D k 1 D k )/2. Let y 1,..., y m ) be a set of cluster ceters such that the distortio of the correspodig quatizer is less tha D m + ε. Let B r = {x : x r} deote the closed ball of radius r > 0 cetered at the origi. If ρ > 0 is such that ρ 10 P B ρ ) > 2E X 10 P B 2ρ/5 ) > 1 ε 2 4E[ X 2 ] the for all 1 j k, y j ρ. Now we are prepared to describe the proposed empirical quatizer. Let C ρ be the set of all collectios C = {y 1,..., y k } R d ) k of cluster ceters with y j ρ for all j = 1,..., k. For each C C ρ, deote by q C the correspodig quatizer. Now for all C C ρ, we may calculate Catoi s mea estimator of the distortio DP, q C ) = E X q C X) = E mi j=1,...,k X i y j defied as the uique value µ R for which 1 α )) φ α mi X i y j µ j=1,...,k where we use the parameter value α = 2/kV. Deote this estimator by DP, q C ) ad let q be ay quatizer miimizig the estimated distortio. A easy compactess argumet shows that such a miimizer exists. The mai result of this sectio is the followig boud for the distortio of the chose quatizer. = 0 18

19 Theorem 12. Assume that Var X ) V < ad d. The, with probability at least 1 δ, DP, q ) DP, q ) C log 1 ) ) ) V k dk 1 δ + + O, where the costat C oly depeds o ρ. Proof. The result follows from Theorem 1. All we eed to check is that Var mi j=1,...,k X y j ) is bouded by 2kV ad estimate the coverig umbers of the class of fuctios { } F ρ = f C x) = mi x y : C C ρ. y C The variace boud follows simply by the fact that for all C C, ) Var mi X y j j=1,...,k k Var X y i ) k 2Var X ) 2kV. I order to use the boud of Theorem 1, we eed to boud the coverig umbers of the class F ρ uder both metrics d ad D. We begi with the metric Df C, f C ) = sup x R d f C x) f C x). B z ɛ, d) refers to the ball uder the metric d of radius ɛ cetered at z. Let Z be a subset of B ρ such that B Bρ := {B z ɛ, d 2 ) : z Z} is a coverig of the set B ρ by balls of radius ɛ uder the Euclidea orm. Let C C ρ ad associate to ay y i C oe of the ceters i Z such that y i z i ɛ. If there is more tha oe possible choice for z i, we pick oe of them arbitrarily. We deote by q C the earest eighbor quatizer with codebook C = z i ) i. Fially, let S i = q 1 C z i ). Now clearly, i, x S i f C x) f C x) = mi 1 j k x y j mi 1 j k x z j = mi x y j x z i 1 j k x y i x z i ɛ ad symmetrically for f C x) f C x). The f C B fc ɛ, D) ad B Fρ := {B fc ɛ, D) : C Z k } 19

20 is a coverig of F ρ. Sice Z ca be take such that Z = N d2 B ρ, ɛ) we ed with N d F ρ, ɛ) N D F ρ, ɛ) N d2 B ρ, ɛ) k. By stadard estimates o the coverig umbers of the ball B ρ by balls of size ɛ uder the Euclidea metric, ) 4ρ d N d2 B ρ, ɛ) ɛ see, e.g., Matousek [19]). I other words, there exists a costat C ρ that depeds oly o ρ such that ad γ 2 F ρ, d) γ 1 F ρ, D) 2ρ 0 2ρ Theorem 1 may ow be applied to the class F ρ. 5 Simulatio Study 0 log N d F ρ, ɛ)dɛ C ρ kd log N D F ρ, ɛ)dɛ C ρkd I this closig sectio we preset the results of two simulatio exercises to assess the performace of the estimators developed i this work. 5.1 L 2 Regressio The first applicatio is a L 2 regressio exercise. Data are simulated from a liear model with heavy-tailed errors ad the L 2 regressio procedure based o Catoi s risk miimizer itroduced i Sectio is used for estimatio. The procedure is bechmarked agaist regular vailla ) L 2 regressio based o the miimizatio of the empirical L 2 loss. The simulatio exercise is desiged as follows. We simulate Z 1, Y 1 ), Z 2, Y 2 ),..., Z, Y ) i.i.d. pairs of radom variables i R 4 R. Each compoet of the Z i vector is draw from a uiform distributio with support [ 1, 1] while Y i is geerated as Y i = Z T i θ + ɛ i, where the parameter vector θ is 0.25, 0.25, 0.50, 0.70) ad ɛ i is draw from a Studet s t distributio with d degrees of freedom. As it is well kow, the degrees of freedom parameter determies the highest fiite momet of the Studet s t distributio. Momets of order k d do ot exist. We are iterested i fidig the value of θ which miimizes the L 2 loss E Y Zi T θ 2. 20

21 The parameter θ is estimated usig the Catoi ad the vailla L 2 regressios. Let R C θ) deote the solutio of the equatio ˆr θ µ) = 1 ψ α Y i Zi T θ )) 2 µ = 0, α the the Catoi L 2 regressio estimator is defied as ˆθ C = arg mi θ RC θ). The vailla L 2 regressio estimator is defied as the miimizer of the empirical L 2 loss, 1 ˆθ V = arg mi RV θ) = arg mi Y i Zi T θ 2, θ θ which is the classic least squares estimator. The estimated risk of the Catoi ad vailla estimators are deoted as R C ˆθ C ) ad R V ˆθ V ) respectively. Expected risk is the atural idex to assess the precisio of the estimators R C = E Y Z T ˆθC 2 R V = E Y Z T ˆθV 2. We estimate the expected risk by simulatio. For each replicatio of the simulatio exercise, we estimate the empirical risk of the estimators usig a i.i.d. sample Z 1, Y 1 ),..., Z m, Y m ) that is idepedet of the oe used for estimatio, R C = 1 m m Y i Z T i ˆθC 2 R V = 1 m m Y i Z T i ˆθV 2. 12) The simulatio experimet is replicated for differet values of the tail parameter d ragig from 2 to 4 ad differet values of the sample size ragig from 25 to 200. For each combiatio of the degrees of freedom parameter d ad sample size the experimet is replicated times. Figure 1 displays the Mote Carlo estimate of R C ad R V as fuctios of the tail parameter d whe the sample size is equal to 50. The left pael reports the level of the idices while the right pael reports the percetage improvemet of the Catoi procedure over the bechmark. Whe the tails are ot excessively heavy high values of d) the differece betwee the procedures is small. As the tails become heavier small values of d) the risk of both procedures icreases. Importatly, the Catoi estimator becomes progressively more efficiet as the tails become heavier. The improvemet is roughly 10% of the bechmark whe the the tail parameter is close to 2. Detailed results for differet values of are reported i Table 1. The patter documeted i the pictures holds for differet values of but the advatages of the Catoi approach are stroger whe the sample size is smaller. Overall the Catoi L 2 regressio estimator ever performs sigificatly worse tha the bechmark ad it is substatially better whe the tails of the data become heavier ad data are scarce. 21

22 Figure 1: L 2 Regressio Parameter Estimatio. a) b) The figure plots the risk of the Catoi ad vailla L 2 regressio parameter estimators a) ad the percetage improvemet of the Catoi procedure relative to the vailla b) as a fuctio of the tail parameter d for a sample size equal to 50. Table 1: Relative Performace of the Catoi L 2 Parameter Estimator. d =25 =50 =75 =100 =150 = The table reports the improvemet of the Catoi L 2 parameter estimator relative to the vailla procedure as a fuctio of the tail parameter d ad sample size. 22

23 5.2 k-meas I the secod experimet we carry out a k meas clusterig exercise. Data are simulated from a heavy-tailed mixture distributio ad the cluster ceters are chose by miimizig Catoi s estimate of the L 2 distortio. The performace of the algorithm is bechmarked agaist the vailla k meas algorithm procedure where the distortio is estimated by simple empirical averages. The simulatio exercise is desiged as follows. A i.i.d. sample of radom vectors X 1,..., X i R 2 is draw from a four-compoet mixture distributio with equal weights. Each mixture compoet is a bivariate Studet s t distributio with d degrees of freedom ad idepedet coordiates. The k meas algorithm based o Catoi as well as the stadard vailla ) k meas algorithm are used to estimate the cluster ceters, which are deoted respectively as ˆq C ad ˆq V. Aalogously to the previous exercise, we summarize the performace of the clusterig procedures usig their expected distortio of the algorithms, that is R C = D k P, ˆq V ) R V = D k P, ˆq V ). We estimate the expected distortio by simulatio. We compute the empirical distortio of the quatizers usig a i.i.d. sample X 1,..., X m of vectors that is idepedet of the oes used for estimatio, that is, R C = D k P m, ˆq V ) = 1 m m mi j=1,...,k X i ˆqV X i ) 2 R V = D k P m, ˆq C ) = 1 m m mi j=1,...,k X i ˆqC X i ) 2. 13) The experimet is replicated for differet values of the tail parameter d ragig from 2 to 4 ad differet values of the sample size ragig from 25 to 200. For each combiatio of tail parameter d ad sample size the experimet is replicated times. Figures 2 displays the Mote Carlo estimate of R C ad R V as a fuctio of the degree of freedom d for = 50. The left pael reports the absolute estimated risk while the right pael reports the percetage improvemet of the Catoi procedure over the bechmark. The overall results are aalogous to the oes of the L 2 regressio applicatio. Whe the tails of the mixture are ot excessively heavy high values of d) the differece i the procedures is small. As the tails become heavier small values of d) the risk of both procedure icreases, but the Catoi algorithm becomes progressively more efficiet. The percetage gais for the Catoi procedure are above 15% of the bechmark whe the tail parameter is close to 2. Tables 2 report detailed results for differet values of. Overall, the Catoi k meas algorithm ever performs worse tha the bechmark ad it is substatially better whe the tails of the mixture become heavier ad the sample size is small. 23

24 Figure 2: k meas Quatizer Estimatio. a) b) The figure plots the risk of the Catoi ad vailla k meas quatizer estimator a) ad the percetage improvemet of the Catoi procedure relative to the vailla b) as a fuctio of the tail parameter d for a sample size equal to 100. Table 2: Relative Performace of the Catoi k meas Quatizer Estimator. d =25 =50 =75 =100 =150 = The table reports the improvemet of the Catoi k meas quatizer estimator relative to the vailla procedure as a fuctio of the tail parameter d ad sample size. 24

25 6 Appedix 6.1 A chaiig theorem The followig result is a versio of stadard bouds based o geeric chaiig, see Talagrad [27]. We iclude the proof for completeess. Recall that if ψ is a o-egative icreasig covex fuctio defied o R + with ψ0) = 0, the the Orlicz orm of a radom variable X is defied by X ψ = if We cosider Orlicz orms defied by { c > 0 : E [ ψ )] } X 1. c ψ 1 x) = expx) 1 ad ψ 2 x) = expx 2 ) 1. It is easy to see that X ψ1 X ψ2 always holds. Also ote that, by Markov s iequality, X ψ1 c implies that P{ X > t} e t/c ad similarly, if X ψ2 c, the P{ X > t} e t2 /c 2. The X X ψ1 logδ 1 ) with probability at least 1 δ, 14) X X ψ2 logδ 1 ) with probability at least 1 δ. Recall the followig defiitio see, e.g., [27, Defiitio 1.2.3]). Let T be a pseudo) metric space. A icreasig sequece A ) of partitios of T is called admissible if for all = 0, 1, 2,..., #A 2 2. For ay t T, deote by A t) the uique elemet of A that cotais t. Let A) deote the diameter of the set A T. Defie, for α = 1, 2, γ α T, d) = if A sup 2 /α A t)), t T 0 where the ifimum is take over all admissible sequeces. Theorem 13. Let X t ) t T be a stochastic process idexed by a set T o which two pseudo) metrics, d 1 ad d 2, are defied such that T is bouded with respect to both metrics. Assume that for ay s, t T ad for all x > 0, P{ X s X t > x} 2 exp 1 x 2 ) 2 d d. 1x The for all t T, sup X s X t L γ 1 T, d 1 ) + γ 2 T, d 2 )) s T ψ1 with L 384 log2). 25

26 Corollary 14. Assume that for ay s, t T ad for all x > 0, ) P{ X s X t > x} 2 exp x2. 2d 2 2 The for all t T, with L 384 log2). sup X s X t Lγ 2 T, d 2 ) s T ψ2 I particular, we obtai The proof of Theorem 13 uses the followig lemma: E [ e λ sup s T Xs Xt ] e λ2 2L 2 γ 2 T,d 2 ) 2. 15) Lemma 15. [30, lemma ].) Let a, b > 0 ad assume that the radom variables X 1,..., X m satisfy, for all x > 0, P{ X i > x} 2 exp 1 x 2 ). 2 b + ax The max 1 i m X i 48 a log1 + m) + b ) log1 + m) ψ1. Proof of Theorem 13: Cosider a admissible sequece B ) 0 such that for all t T, 2 1 B t)) 2γ 1 T, d 1 ) 0 ad a admissible sequece C ) 0 such that for all t T, 2 /2 1 C t)) 2γ 2 T, d 2 ) 0 Now we may defie a admissible sequece by itersectio of the elemets of B 1 ) 1 ad C 1 ) 1 : set A 0 = {T } ad let A = {B C : B B 1 & C C 1 } A ) 0 is a admissible sequece because each A is icreasig ad cotais at most ) 2 = 2 2 sets. Defie a sequece of fiite sets T 0 = {t} T 1 T such that 26

27 T cotais a sigle poit i each set of A. For ay s T, deote by π s) the uique elemet of T i A s). Now for ay s T k+1, we write X s X t = k=0 ) Xπk+1s) X πk s). The, usig the fact that ψ1 is a orm ad Lemma 15, sup X s X t s T ψ1 max X πk+1 s) X πk s) s T k+1 k=0 48 k=0 ψ1 ) d 1 π k+1 s), π k s)) log k+1 ) + d 2 π k+1 s), π k s)) log k+1 ). Sice A ) 0 is a icreasig sequece, π k+1 s) ad π k s) are both i A k s). By costructio, A k s) B k s), ad therefore d 1 π k+1 s), π k s)) 1 B k s)). Similarly, d 2 π k+1 s), π k s)) 2 C k s)). Usig log k+1 ) 4 log2)2 k, we get [ max X ] s X t s T 192 log2) 2 k 1 B k s)) + 2 k/2 1 C k s)) ψ1 Refereces k=0 k=0 384 log2) [γ 1 T, d 1 ) + γ 2 T, d 2 )]. [1] E. A. Abaya ad G. L. Wise. Covergece of vector quatizers with applicatios to optimal quatizatio. SIAM Joural o Applied Mathematics, 44: , [2] N. Alo, Y. Matias, ad M. Szegedy. The space complexity of approximatig the frequecy momets. Joural of Computer ad System Scieces, 58: , [3] A. Atos. Improved miimax bouds o the test ad traiig distortio of empirically desiged vector quatizers. IEEE Trasactios o Iformatio Theory, 51: , [4] A. Atos, L. Györfi, ad A. György. Improved covergece rates i empirical vector quatizer desig. IEEE Trasactios o Iformatio Theory, 51: , [5] P. Bartlett, T. Lider, ad G. Lugosi. The miimax distortio redudacy i empirical quatizer desig. IEEE Tras. Iform. Theory, 44: , Sep

28 [6] P.L. Bartlett ad S. Medelso. Empirical miimizatio. Probability Theory Related Fields, 135: , [7] G. Biau, L. Devroye, ad G. Lugosi. O the performace of clusterig i hilbert spaces. IEEE Trasactios o Iformatio Theory, 54: , 200. [8] S. Bouchero, O. Bousquet, ad G. Lugosi. Theory of classificatio: a survey of some recet advaces. ESAIM. Probability ad Statistics, 9: , [9] S. Bouchero, G. Lugosi, ad P. Massart. Cocetratio Iequalities: A Noasymptotic Theory of Idepedece. Oxford Uiversity Press, [10] S. Bubeck, N. Cesa-Biachi, ad G. Lugosi. Badits with heavy tail. mauscript, [11] O. Catoi. Challegig the empirical mea ad empirical variace: a deviatio study. Arxiv preprit arxiv: , [12] R.M. Dudley. Cetral limit theorems for empirical measures. Aals of Probability, 6: , [13] D. Hsu ad S. Sabato. Approximate loss miimizatio with heavy tails. Computig Research Repository, abs/ , [14] V. Koltchiskii. Local Rademacher complexities ad oracle iequalities i risk miimizatio. Aals of Statistics, 36:00 00, [15] M. Lerasle ad R.I. Oliveira. Robust empirical mea estimators. mauscript, [16] C. Levrard. Fast rates for empirical vector quatizatio. Electroic Joural of Statistics, pages , [17] T. Lider. Learig-theoretic methods i vector quatizatio. I L. Györfi, editor, Priciples of oparametric learig, umber 434 i CISM Courses ad Lecture Notes. Spriger-Verlag, New York, [18] P. Massart. Cocetratio iequalities ad model selectio. Ecole d été de Probabilités de Sait-Flour Lecture Notes i Mathematics. Spriger, [19] J. Matoušek. Lectures o discrete geometry. Spriger, [20] A. Maurer ad M. Potil. k-dimesioal codig schemes i hilbert spaces. IEEE Trasactios o Iformatio Theory, 56: , [21] S. Medelso. Learig without cocetratio. arxiv preprit arxiv: , [22] S. Misker. Geometric media ad robust estimatio i baach spaces. arxiv preprit,

29 [23] A.S. Nemirovsky ad D.B. Yudi. Problem complexity ad method efficiecy i optimizatio [24] D. Pollard. Strog cosistecy of k-meas clusterig. Aals of Statistics, 9, o. 1: , [25] D. Pollard. A cetral limit theorem for k-meas clusterig. Aals of Probability, 104): , [26] D. Pollard. Quatizatio ad the method of k-meas. IEEE Tras. Iform. Theory, IT-28: , [27] M. Talagrad. The geeric chaiig. Spriger, [28] M. Telgarsky ad S. Dasgupta. Momet-based uiform deviatio bouds for k-meas ad frieds. arxiv preprit arxiv: , [29] S. va de Geer. Empirical Processes i M-Estimatio. Cambridge Uiversity Press, Cambridge, UK, [30] A.W. va der Vaart ad J.A. Weller. Weak covergece ad empirical processes. Spriger-Verlag, New York,

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the

More information

arxiv: v3 [stat.me] 17 Nov 2015

arxiv: v3 [stat.me] 17 Nov 2015 The Aals of Statistics 2015, Vol. 43, No. 6, 2507 2536 DOI: 10.1214/15-AOS1350 c Istitute of Mathematical Statistics, 2015 EMPIRICAL RISK MINIMIZATION FOR HEAVY-TAILED LOSSES arxiv:1406.2462v3 [stat.me]

More information

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d

More information

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space

More information

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector Dimesio-free PAC-Bayesia bouds for the estimatio of the mea of a radom vector Olivier Catoi CREST CNRS UMR 9194 Uiversité Paris Saclay olivier.catoi@esae.fr Ilaria Giulii Laboratoire de Probabilités et

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi

More information

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,

More information

Empirical Processes: Glivenko Cantelli Theorems

Empirical Processes: Glivenko Cantelli Theorems Empirical Processes: Gliveko Catelli Theorems Mouliath Baerjee Jue 6, 200 Gliveko Catelli classes of fuctios The reader is referred to Chapter.6 of Weller s Torgo otes, Chapter??? of VDVW ad Chapter 8.3

More information

Rates of Convergence by Moduli of Continuity

Rates of Convergence by Moduli of Continuity Rates of Covergece by Moduli of Cotiuity Joh Duchi: Notes for Statistics 300b March, 017 1 Itroductio I this ote, we give a presetatio showig the importace, ad relatioship betwee, the modulis of cotiuity

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak

More information

Lecture 3: August 31

Lecture 3: August 31 36-705: Itermediate Statistics Fall 018 Lecturer: Siva Balakrisha Lecture 3: August 31 This lecture will be mostly a summary of other useful expoetial tail bouds We will ot prove ay of these i lecture,

More information

Binary classification, Part 1

Binary classification, Part 1 Biary classificatio, Part 1 Maxim Ragisky September 25, 2014 The problem of biary classificatio ca be stated as follows. We have a radom couple Z = (X,Y ), where X R d is called the feature vector ad Y

More information

5.1 A mutual information bound based on metric entropy

5.1 A mutual information bound based on metric entropy Chapter 5 Global Fao Method I this chapter, we exted the techiques of Chapter 2.4 o Fao s method the local Fao method) to a more global costructio. I particular, we show that, rather tha costructig a local

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

Advanced Stochastic Processes.

Advanced Stochastic Processes. Advaced Stochastic Processes. David Gamarik LECTURE 2 Radom variables ad measurable fuctios. Strog Law of Large Numbers (SLLN). Scary stuff cotiued... Outlie of Lecture Radom variables ad measurable fuctios.

More information

1 Review and Overview

1 Review and Overview CS9T/STATS3: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #6 Scribe: Jay Whag ad Patrick Cho October 0, 08 Review ad Overview Recall i the last lecture that for ay family of scalar fuctios F, we

More information

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i

More information

arxiv: v1 [math.pr] 4 Dec 2013

arxiv: v1 [math.pr] 4 Dec 2013 Squared-Norm Empirical Process i Baach Space arxiv:32005v [mathpr] 4 Dec 203 Vicet Q Vu Departmet of Statistics The Ohio State Uiversity Columbus, OH vqv@statosuedu Abstract Jig Lei Departmet of Statistics

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit Theorems Throughout this sectio we will assume a probability space (, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

arxiv: v1 [math.pr] 13 Oct 2011

arxiv: v1 [math.pr] 13 Oct 2011 A tail iequality for quadratic forms of subgaussia radom vectors Daiel Hsu, Sham M. Kakade,, ad Tog Zhag 3 arxiv:0.84v math.pr] 3 Oct 0 Microsoft Research New Eglad Departmet of Statistics, Wharto School,

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 4 Scribe: Cheg Mao Sep., 05 I this lecture, we cotiue to discuss the effect of oise o the rate of the excess risk E(h) = R(h) R(h

More information

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities CS8B/Stat4B Sprig 008) Statistical Learig Theory Lecture: Ada Boost, Risk Bouds, Cocetratio Iequalities Lecturer: Peter Bartlett Scribe: Subhrasu Maji AdaBoost ad Estimates of Coditioal Probabilities We

More information

Rademacher Complexity

Rademacher Complexity EECS 598: Statistical Learig Theory, Witer 204 Topic 0 Rademacher Complexity Lecturer: Clayto Scott Scribe: Ya Deg, Kevi Moo Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for

More information

Sieve Estimators: Consistency and Rates of Convergence

Sieve Estimators: Consistency and Rates of Convergence EECS 598: Statistical Learig Theory, Witer 2014 Topic 6 Sieve Estimators: Cosistecy ad Rates of Covergece Lecturer: Clayto Scott Scribe: Julia Katz-Samuels, Brado Oselio, Pi-Yu Che Disclaimer: These otes

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture 3 Tolstikhi Ilya Abstract I this lecture we will prove the VC-boud, which provides a high-probability excess risk boud for the ERM algorithm whe

More information

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n. Jauary 1, 2019 Resamplig Methods Motivatio We have so may estimators with the property θ θ d N 0, σ 2 We ca also write θ a N θ, σ 2 /, where a meas approximately distributed as Oce we have a cosistet estimator

More information

1 Review and Overview

1 Review and Overview DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, 2013 1 Review ad Overview I the first half of this course,

More information

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3 MATH 337 Sequeces Dr. Neal, WKU Let X be a metric space with distace fuctio d. We shall defie the geeral cocept of sequece ad limit i a metric space, the apply the results i particular to some special

More information

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014. Product measures, Toelli s ad Fubii s theorems For use i MAT3400/4400, autum 2014 Nadia S. Larse Versio of 13 October 2014. 1. Costructio of the product measure The purpose of these otes is to preset the

More information

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound Lecture 7 Ageda for the lecture Gaussia chael with average power costraits Capacity of additive Gaussia oise chael ad the sphere packig boud 7. Additive Gaussia oise chael Up to this poit, we have bee

More information

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y

More information

Notes 27 : Brownian motion: path properties

Notes 27 : Brownian motion: path properties Notes 27 : Browia motio: path properties Math 733-734: Theory of Probability Lecturer: Sebastie Roch Refereces:[Dur10, Sectio 8.1], [MP10, Sectio 1.1, 1.2, 1.3]. Recall: DEF 27.1 (Covariace) Let X = (X

More information

Distribution of Random Samples & Limit theorems

Distribution of Random Samples & Limit theorems STAT/MATH 395 A - PROBABILITY II UW Witer Quarter 2017 Néhémy Lim Distributio of Radom Samples & Limit theorems 1 Distributio of i.i.d. Samples Motivatig example. Assume that the goal of a study is to

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit theorems Throughout this sectio we will assume a probability space (Ω, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

On Random Line Segments in the Unit Square

On Random Line Segments in the Unit Square O Radom Lie Segmets i the Uit Square Thomas A. Courtade Departmet of Electrical Egieerig Uiversity of Califoria Los Ageles, Califoria 90095 Email: tacourta@ee.ucla.edu I. INTRODUCTION Let Q = [0, 1] [0,

More information

Lecture 19: Convergence

Lecture 19: Convergence Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may

More information

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function. MATH 532 Measurable Fuctios Dr. Neal, WKU Throughout, let ( X, F, µ) be a measure space ad let (!, F, P ) deote the special case of a probability space. We shall ow begi to study real-valued fuctios defied

More information

Lecture 15: Learning Theory: Concentration Inequalities

Lecture 15: Learning Theory: Concentration Inequalities STAT 425: Itroductio to Noparametric Statistics Witer 208 Lecture 5: Learig Theory: Cocetratio Iequalities Istructor: Ye-Chi Che 5. Itroductio Recall that i the lecture o classificatio, we have see that

More information

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + 62. Power series Defiitio 16. (Power series) Give a sequece {c }, the series c x = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + is called a power series i the variable x. The umbers c are called the coefficiets of

More information

Intro to Learning Theory

Intro to Learning Theory Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified

More information

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where

More information

An Introduction to Randomized Algorithms

An Introduction to Randomized Algorithms A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis

More information

Maximum Likelihood Estimation and Complexity Regularization

Maximum Likelihood Estimation and Complexity Regularization ECE90 Sprig 004 Statistical Regularizatio ad Learig Theory Lecture: 4 Maximum Likelihood Estimatio ad Complexity Regularizatio Lecturer: Rob Nowak Scribe: Pam Limpiti Review : Maximum Likelihood Estimatio

More information

Chapter 6 Infinite Series

Chapter 6 Infinite Series Chapter 6 Ifiite Series I the previous chapter we cosidered itegrals which were improper i the sese that the iterval of itegratio was ubouded. I this chapter we are goig to discuss a topic which is somewhat

More information

Concentration inequalities

Concentration inequalities Cocetratio iequalities Jea-Yves Audibert 1,2 1. Imagie - ENPC/CSTB - uiversité Paris Est 2. Willow (INRIA/ENS/CNRS) ThRaSH 2010 with Problem Tight upper ad lower bouds o f(x 1,..., X ) X 1,..., X i.i.d.

More information

Self-normalized deviation inequalities with application to t-statistic

Self-normalized deviation inequalities with application to t-statistic Self-ormalized deviatio iequalities with applicatio to t-statistic Xiequa Fa Ceter for Applied Mathematics, Tiaji Uiversity, 30007 Tiaji, Chia Abstract Let ξ i i 1 be a sequece of idepedet ad symmetric

More information

1 Introduction to reducing variance in Monte Carlo simulations

1 Introduction to reducing variance in Monte Carlo simulations Copyright c 010 by Karl Sigma 1 Itroductio to reducig variace i Mote Carlo simulatios 11 Review of cofidece itervals for estimatig a mea I statistics, we estimate a ukow mea µ = E(X) of a distributio by

More information

Lecture 10 October Minimaxity and least favorable prior sequences

Lecture 10 October Minimaxity and least favorable prior sequences STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least

More information

Agnostic Learning and Concentration Inequalities

Agnostic Learning and Concentration Inequalities ECE901 Sprig 2004 Statistical Regularizatio ad Learig Theory Lecture: 7 Agostic Learig ad Cocetratio Iequalities Lecturer: Rob Nowak Scribe: Aravid Kailas 1 Itroductio 1.1 Motivatio I the last lecture

More information

lim za n n = z lim a n n.

lim za n n = z lim a n n. Lecture 6 Sequeces ad Series Defiitio 1 By a sequece i a set A, we mea a mappig f : N A. It is customary to deote a sequece f by {s } where, s := f(). A sequece {z } of (complex) umbers is said to be coverget

More information

The random version of Dvoretzky s theorem in l n

The random version of Dvoretzky s theorem in l n The radom versio of Dvoretzky s theorem i l Gideo Schechtma Abstract We show that with high probability a sectio of the l ball of dimesio k cε log c > 0 a uiversal costat) is ε close to a multiple of the

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information

Lecture 3 : Random variables and their distributions

Lecture 3 : Random variables and their distributions Lecture 3 : Radom variables ad their distributios 3.1 Radom variables Let (Ω, F) ad (S, S) be two measurable spaces. A map X : Ω S is measurable or a radom variable (deoted r.v.) if X 1 (A) {ω : X(ω) A}

More information

Glivenko-Cantelli Classes

Glivenko-Cantelli Classes CS28B/Stat24B (Sprig 2008 Statistical Learig Theory Lecture: 4 Gliveko-Catelli Classes Lecturer: Peter Bartlett Scribe: Michelle Besi Itroductio This lecture will cover Gliveko-Catelli (GC classes ad itroduce

More information

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector Summary ad Discussio o Simultaeous Aalysis of Lasso ad Datzig Selector STAT732, Sprig 28 Duzhe Wag May 4, 28 Abstract This is a discussio o the work i Bickel, Ritov ad Tsybakov (29). We begi with a short

More information

Axioms of Measure Theory

Axioms of Measure Theory MATH 532 Axioms of Measure Theory Dr. Neal, WKU I. The Space Throughout the course, we shall let X deote a geeric o-empty set. I geeral, we shall ot assume that ay algebraic structure exists o X so that

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT

More information

Sequences and Series of Functions

Sequences and Series of Functions Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013 Large Deviatios for i.i.d. Radom Variables Cotet. Cheroff boud usig expoetial momet geeratig fuctios. Properties of a momet

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013 MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013 Fuctioal Law of Large Numbers. Costructio of the Wieer Measure Cotet. 1. Additioal techical results o weak covergece

More information

Integrable Functions. { f n } is called a determining sequence for f. If f is integrable with respect to, then f d does exist as a finite real number

Integrable Functions. { f n } is called a determining sequence for f. If f is integrable with respect to, then f d does exist as a finite real number MATH 532 Itegrable Fuctios Dr. Neal, WKU We ow shall defie what it meas for a measurable fuctio to be itegrable, show that all itegral properties of simple fuctios still hold, ad the give some coditios

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

ECE 901 Lecture 13: Maximum Likelihood Estimation

ECE 901 Lecture 13: Maximum Likelihood Estimation ECE 90 Lecture 3: Maximum Likelihood Estimatio R. Nowak 5/7/009 The focus of this lecture is to cosider aother approach to learig based o maximum likelihood estimatio. Ulike earlier approaches cosidered

More information

1 Convergence in Probability and the Weak Law of Large Numbers

1 Convergence in Probability and the Weak Law of Large Numbers 36-752 Advaced Probability Overview Sprig 2018 8. Covergece Cocepts: i Probability, i L p ad Almost Surely Istructor: Alessadro Rialdo Associated readig: Sec 2.4, 2.5, ad 4.11 of Ash ad Doléas-Dade; Sec

More information

Vector Quantization: a Limiting Case of EM

Vector Quantization: a Limiting Case of EM . Itroductio & defiitios Assume that you are give a data set X = { x j }, j { 2,,, }, of d -dimesioal vectors. The vector quatizatio (VQ) problem requires that we fid a set of prototype vectors Z = { z

More information

Basics of Probability Theory (for Theory of Computation courses)

Basics of Probability Theory (for Theory of Computation courses) Basics of Probability Theory (for Theory of Computatio courses) Oded Goldreich Departmet of Computer Sciece Weizma Istitute of Sciece Rehovot, Israel. oded.goldreich@weizma.ac.il November 24, 2008 Preface.

More information

Topic 9: Sampling Distributions of Estimators

Topic 9: Sampling Distributions of Estimators Topic 9: Samplig Distributios of Estimators Course 003, 2016 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be

More information

Information-based Feature Selection

Information-based Feature Selection Iformatio-based Feature Selectio Farza Faria, Abbas Kazeroui, Afshi Babveyh Email: {faria,abbask,afshib}@staford.edu 1 Itroductio Feature selectio is a topic of great iterest i applicatios dealig with

More information

4.1 Sigma Notation and Riemann Sums

4.1 Sigma Notation and Riemann Sums 0 the itegral. Sigma Notatio ad Riema Sums Oe strategy for calculatig the area of a regio is to cut the regio ito simple shapes, calculate the area of each simple shape, ad the add these smaller areas

More information

Chapter 7 Isoperimetric problem

Chapter 7 Isoperimetric problem Chapter 7 Isoperimetric problem Recall that the isoperimetric problem (see the itroductio its coectio with ido s proble) is oe of the most classical problem of a shape optimizatio. It ca be formulated

More information

Algorithms for Clustering

Algorithms for Clustering CR2: Statistical Learig & Applicatios Algorithms for Clusterig Lecturer: J. Salmo Scribe: A. Alcolei Settig: give a data set X R p where is the umber of observatio ad p is the umber of features, we wat

More information

MAS111 Convergence and Continuity

MAS111 Convergence and Continuity MAS Covergece ad Cotiuity Key Objectives At the ed of the course, studets should kow the followig topics ad be able to apply the basic priciples ad theorems therei to solvig various problems cocerig covergece

More information

1 of 7 7/16/2009 6:06 AM Virtual Laboratories > 6. Radom Samples > 1 2 3 4 5 6 7 6. Order Statistics Defiitios Suppose agai that we have a basic radom experimet, ad that X is a real-valued radom variable

More information

Properties and Hypothesis Testing

Properties and Hypothesis Testing Chapter 3 Properties ad Hypothesis Testig 3.1 Types of data The regressio techiques developed i previous chapters ca be applied to three differet kids of data. 1. Cross-sectioal data. 2. Time series data.

More information

On forward improvement iteration for stopping problems

On forward improvement iteration for stopping problems O forward improvemet iteratio for stoppig problems Mathematical Istitute, Uiversity of Kiel, Ludewig-Mey-Str. 4, D-24098 Kiel, Germay irle@math.ui-iel.de Albrecht Irle Abstract. We cosider the optimal

More information

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters

More information

Notes 19 : Martingale CLT

Notes 19 : Martingale CLT Notes 9 : Martigale CLT Math 733-734: Theory of Probability Lecturer: Sebastie Roch Refereces: [Bil95, Chapter 35], [Roc, Chapter 3]. Sice we have ot ecoutered weak covergece i some time, we first recall

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

Estimation of the essential supremum of a regression function

Estimation of the essential supremum of a regression function Estimatio of the essetial supremum of a regressio fuctio Michael ohler, Adam rzyżak 2, ad Harro Walk 3 Fachbereich Mathematik, Techische Uiversität Darmstadt, Schlossgartestr. 7, 64289 Darmstadt, Germay,

More information

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam. Probability ad Statistics FS 07 Secod Sessio Exam 09.0.08 Time Limit: 80 Miutes Name: Studet ID: This exam cotais 9 pages (icludig this cover page) ad 0 questios. A Formulae sheet is provided with the

More information

Lecture 7: October 18, 2017

Lecture 7: October 18, 2017 Iformatio ad Codig Theory Autum 207 Lecturer: Madhur Tulsiai Lecture 7: October 8, 207 Biary hypothesis testig I this lecture, we apply the tools developed i the past few lectures to uderstad the problem

More information

Topic 9: Sampling Distributions of Estimators

Topic 9: Sampling Distributions of Estimators Topic 9: Samplig Distributios of Estimators Course 003, 2018 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be

More information

A statistical method to determine sample size to estimate characteristic value of soil parameters

A statistical method to determine sample size to estimate characteristic value of soil parameters A statistical method to determie sample size to estimate characteristic value of soil parameters Y. Hojo, B. Setiawa 2 ad M. Suzuki 3 Abstract Sample size is a importat factor to be cosidered i determiig

More information

Solution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1

Solution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1 Solutio Sagchul Lee October 7, 017 1 Solutios of Homework 1 Problem 1.1 Let Ω,F,P) be a probability space. Show that if {A : N} F such that A := lim A exists, the PA) = lim PA ). Proof. Usig the cotiuity

More information

Lecture 2: Concentration Bounds

Lecture 2: Concentration Bounds CSE 52: Desig ad Aalysis of Algorithms I Sprig 206 Lecture 2: Cocetratio Bouds Lecturer: Shaya Oveis Ghara March 30th Scribe: Syuzaa Sargsya Disclaimer: These otes have ot bee subjected to the usual scrutiy

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple

More information

Math 525: Lecture 5. January 18, 2018

Math 525: Lecture 5. January 18, 2018 Math 525: Lecture 5 Jauary 18, 2018 1 Series (review) Defiitio 1.1. A sequece (a ) R coverges to a poit L R (writte a L or lim a = L) if for each ǫ > 0, we ca fid N such that a L < ǫ for all N. If the

More information

Application to Random Graphs

Application to Random Graphs A Applicatio to Radom Graphs Brachig processes have a umber of iterestig ad importat applicatios. We shall cosider oe of the most famous of them, the Erdős-Réyi radom graph theory. 1 Defiitio A.1. Let

More information

arxiv: v1 [math.st] 17 Apr 2015

arxiv: v1 [math.st] 17 Apr 2015 Robust estimatio of U-statistics arxiv:1504.04580v1 [math.st] 17 Apr 2015 Emilie Joly Gábor Lugosi April 20, 2015 This paper is dedicated to the memory of Evarist Gié. Abstract A importat part of the legacy

More information

On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities

On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities O Equivalece of Martigale Tail Bouds ad Determiistic Regret Iequalities Sasha Rakhli Departmet of Statistics, The Wharto School Uiversity of Pesylvaia Dec 16, 2015 Joit work with K. Sridhara arxiv:1510.03925

More information

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate Supplemetary Material for Fast Stochastic AUC Maximizatio with O/-Covergece Rate Migrui Liu Xiaoxua Zhag Zaiyi Che Xiaoyu Wag 3 iabao Yag echical Lemmas ized versio of Hoeffdig s iequality, ote that We

More information

Lecture 3 The Lebesgue Integral

Lecture 3 The Lebesgue Integral Lecture 3: The Lebesgue Itegral 1 of 14 Course: Theory of Probability I Term: Fall 2013 Istructor: Gorda Zitkovic Lecture 3 The Lebesgue Itegral The costructio of the itegral Uless expressly specified

More information

Dimensionality reduction in Hilbert spaces

Dimensionality reduction in Hilbert spaces Dimesioality reductio i Hilbert spaces Maxim Ragisky October 3, 014 Dimesioality reductio is a geeric ame for ay procedure that takes a complicated object livig i a high-dimesioal (or possibly eve ifiite-dimesioal)

More information

MAT1026 Calculus II Basic Convergence Tests for Series

MAT1026 Calculus II Basic Convergence Tests for Series MAT026 Calculus II Basic Covergece Tests for Series Egi MERMUT 202.03.08 Dokuz Eylül Uiversity Faculty of Sciece Departmet of Mathematics İzmir/TURKEY Cotets Mootoe Covergece Theorem 2 2 Series of Real

More information

Monte Carlo Integration

Monte Carlo Integration Mote Carlo Itegratio I these otes we first review basic umerical itegratio methods (usig Riema approximatio ad the trapezoidal rule) ad their limitatios for evaluatig multidimesioal itegrals. Next we itroduce

More information

Lecture 33: Bootstrap

Lecture 33: Bootstrap Lecture 33: ootstrap Motivatio To evaluate ad compare differet estimators, we eed cosistet estimators of variaces or asymptotic variaces of estimators. This is also importat for hypothesis testig ad cofidece

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week Lecture: Cocept Check Exercises Starred problems are optioal. Statistical Learig Theory. Suppose A = Y = R ad X is some other set. Furthermore, assume P X Y is a discrete

More information