On the Theory of Learning with Privileged Information

Size: px
Start display at page:

Download "On the Theory of Learning with Privileged Information"

Transcription

1 O the Theory of Learig with Privileged Iformatio Dmitry Pechyoy NEC Laboratories Priceto, NJ 08540, USA Vladimir Vapik NEC Laboratories Priceto, NJ 08540, USA Abstract I Learig Usig Privileged Iformatio (LUPI) paradigm, alog with the stadard traiig data i the decisio space, a teacher supplies a learer with the privileged iformatio i the correctig space. The goal of the learer is to fid a classifier with a low geeralizatio error i the decisio space. We cosider a empirical risk miimizatio algorithm, called Privileged ERM, that takes ito accout the privileged iformatio i order to fid a good fuctio i the decisio space. We outlie the coditios o the correctig space that, if satisfied, allow Privileged ERM to have much faster learig rate i the decisio space tha the oe of the regular empirical risk miimizatio. 1 Itroductio I the classical supervised machie learig paradigm the learer is give a labeled traiig set of examples ad her goal is to fid a decisio fuctio with the small geeralizatio error o the ukow test examples. If the learig problem is easy (e.g. if learer s space of decisio fuctios cotais a oe with zero geeralizatio error) the, whe the traiig size icreases, the decisio fuctio foud by the learer coverges quickly to the optimal oe. However if the learig problem is hard ad the learer s space of decisio fuctios is large the the covergece (or learig) rate is slow. The example of such hard learig problem is XOR whe the space of decisio fuctios is 2-dimesioal hyperplaes. The obvious questio is Ca we accelerate the learig rate if the learer is give a additioal iformatio about the learig problem?. Durig the last years several ew paradigms of learig with additioal iformatio were proposed that, uder some coditios, provably accelerate the learig rate. For example, i semi-supervised learig such additioal iformatio is ulabeled traiig examples. I this paper we cosider a recetly proposed Learig Usig Privileged Iformatio (LUPI) paradigm [8, 9, 10], that uses additioal iformatio of differet kid. Let X be a decisio space. I LUPI paradigm, i additio to the stadard traiig data, (x, y) X Y, a teacher supplies the learer with a privileged iformatio x i the correctig space X. The privileged iformatio is oly available for the traiig examples ad is ever available for the test examples. The LUPI paradigm requires, give a traiig set {(x i, x i, y i)} i=1, to fid a decisio fuctio h : X Y with the small geeralizatio error for the ukow test examples x X. The above questio about acceleratig the learig rate, reformulated i terms of the LUPI paradigm, is What kid of additioal iformatio should the teacher provide to the learer i order to accelerate her learig rate?. Paraphrased, this questio is essetially Who is a good teacher?. I this paper we outlie the coditios for the additioal iformatio provided by the teacher that allow for fast learig rate eve i the hard problems. 1

2 LUPI paradigm emerges i a umber of applicatios, for example time series predictio, protei classificatio ad huma computatio. The experimets [9] i these domais demostrated a clear advatage of LUPI paradigm over the supervised learig. LUPI paradigm ca be implemeted by SVM+ algorithm [8], which i tur is based o the wellkow SVM algorithm [2]. We ow preset the versio of SVM+ for classificatio, the versio for regressio ca be foud i [9]. Let h(x) = sig(w x + b) be a decisio fuctio ad φ(x i ) = w x i + d be a correctig fuctio. The optimizatio problem of SVM+ is mi w,b,w,d 1 2 w γ 2 w C i=1 s.t. 1 i, y i (w x i + b) 1 (w x i + d) 1 i, w x i + d 0. (w x i + d) (1) The objective fuctio of SVM+ cotais two hyperparameters, C > 0 ad γ > 0. The term γ w 2 2/2 i (1) is iteded to restrict the capacity (or VC-dimesio) of the fuctio space cotaiig φ. Let l X (h(x), y) = 1 y(w x + b) be a hige loss of the decisio fuctio h = (w, b) o the example (x, y) ad l X (φ(x )) = [w x + d] + be a loss of the correctig fuctio φ = (w, d) o the example x. The optimizatio problem (1) ca be rewritte as mi h=(w,b),φ=(w,d) 1 2 w γ 2 w C l X (φ(x i )) (2) i=1 s.t. 1 i, l X (h(x i ), y) l X (φ(x i )). The followig optimizatio problem is a simplified ad a geeralized versio of (2): l X (φ(x i ), y i ) (3) mi h H,φ Φ i=1 s.t. 1 i, l X (h(x i ), y i ) l X (φ(x i ), y i ), (4) where l X ad l X are arbitrary bouded loss fuctios, H is a space of decisio fuctios ad Φ is a space of correctig fuctios. Let C > 0 be a costat (that is defied later), [t] + = max(t, 0) ad l ((h, φ), (x, x, y)) = 1 C l X (φ(x ), y) + [l X (h(x), y) l X (φ(x ), y)] + (5) be the loss of the composite hypothesis (h, φ) o the example (x, x, y). I this paper we study the relaxatio of (3): mi l ((h, φ), (x i, x i, y i )), (6) h H,φ Φ i=1 We refer to the learig algorithm defied by the optimizatio problem (6) as empirical risk miimizatio with privileged iformatio, or abbreviated Privileged ERM. The basic assumptio of Privileged ERM is that if we ca achieve a small loss l X (φ(x ), y) i the correctig space the we should also achieve a small loss l X (h(x), y) i the decisio space. This assumptio reflects the huma learig process, where the teacher tells the learer what are the most importat examples (the oes with the small loss i the correctig space) that the learer should take ito accout i order to fid a good decisio rule. The regular empirical risk miimizatio (ERM) fids a hypothesis ĥ H that miimizes the traiig error i=1 l X(h(x i ), y i ). While the regular ERM directly miimizes the traiig error of h, the privileged ERM miimizes the traiig error of h idirectly, via the miimizatio of the traiig error of the correctig fuctio φ ad the relaxatio of the costrait (4). Let h be the best possible decisio fuctio (i terms of geeralizatio error) i the hypothesis space H. Suppose that for each traiig example x i a oracle gives us the value of the loss l X (h (x i ), y i ). We use these fixed losses istead of l X (φ(x i ), y i) ad fid h that satisfies the followig system of iequalities: 1 i, l X (h(x i ), y i ) l X (h (x i ), y i ). (7) 2

3 We deote the learig algorithm defied by (7) as OracleERM. A straightforward geeralizatio of the proof of Propositio 1 of [9] shows that the geeralizatio error of the hypothesis ĥ foud by OracleERM coverges to the oe of h with the rate of 1/. This rate is much faster tha the worst-case covergece rate 1/ of the regular ERM [3]. I this paper we cosider more realistic settig, whe the above oracle is ot available. Our subsequet derivatios rely heavily o the followig defiitio: Defiitio 1.1 A decisio fuctio h is uiformly better tha the correctig fuctio φ if for ay example (x, x, y) that has o-zero probability, l X (φ(x i ), y i) l X (h(x i ), y i ). Give a space H of decisio fuctios ad a space Φ of correctig fuctios we defie Φ = {φ Φ h H that is uiformly better tha φ}. Note that Φ Φ ad Φ does ot cotai correctig fuctios that are too good for H. Our results are based o the followig two assumptios: Assumptio 1.2 Φ. This assumptio is ot restrictive, sice it oly meas that the optimizatio problem (3) of Privileged ERM has a feasible solutio whe the traiig size goes to ifiity. Assumptio 1.3 There exists a correctig fuctio φ Φ, such that for ay (x, x, y) that has o-zero probability, l X (h (x i ), y i ) = l X (φ(x i ), y i). Put it aother way, we assume the existece of correctig fuctio i Φ that mimics the losses of h. Let r be a learig rate of the Privileged ERM whe it is ra over the joit X X space with the space of decisio ad correctig fuctios H Φ. We develop a upper boud for the risk of the decisio fuctio foud by Privileged ERM. Uder the above assumptios this boud coverges to h with the same rate r. This implies that if the correctig space is good, so that the Privileged ERM i the joit X X space has a fast learig rate (e.g 1/), the the Privileged ERM will have the same fast learig rate (e.g. the same 1/) i the decisio space. That is true eve if the decisio space is hard ad the regular ERM i the decisio space has a slow learig rate (e.g. 1/ ). We illustrate this result with the artificial learig problem, where the regular ERM i the decisio space ca ot lear with the rate faster tha 1/, but the correctig space is good ad Privileged ERM lears i the decisio space with the rate of 1/. The paper has the followig structure. I Sectio 2 we give additioal defiitios. I Sectio 3 we review the existig risk bouds that are used to derive our results. Sectio 4 cotais the proof of the risk boud for Privileged ERM. I Sectio 5 we show a example whe Privileged ERM is provably better tha the regular ERM. We coclude ad give the directios for future research i Sectio 6. Due to the space costraits, most of the proofs appear i the supplemetary material. Previous work The first attempt of theoretical aalysis of LUPI was doe by Vapik ad Vashist [9]. I additio to the aalysis of learig with oracle (metioed above), they cosidered the algorithm, which is close, but differet from Privileged ERM. They developed a risk boud (Propositio 2 i [9]) for the decisio fuctio foud by their algorithm. This boud also applies to Privileged ERM. The boud of [9] is tailored to the classificatio settig, with 0/1-loss fuctios i the decisio ad the correctig space. By cotrast, our boud holds for ay bouded loss fuctios ad allows the loss fuctios l X ad l X to be differet. The boud of [9] depeds o geeralizatio error of the correctig fuctio φ foud by Privileged ERM. Vapik ad Vashist [9] cocluded that if we could boud the covergece rate of φ the this boud will imply the boud o the covergece rate of the decisio fuctio foud by their algorithm. 2 Defiitios The triple (x, x, y) is sampled from the distributio D, which is ukow to the learer. We deote by D X the margial distributio over (x, y) ad by D X the margial distributio over (x, y). The distributio D X is give by the ature ad the distributio D X is costructed by the teacher. The spaces H ad Φ of decisio ad correctig fuctios are chose by learer. 3

4 Let R(h) = E (x,y) DX {l X (h(x), y)} ad R(φ) = E (x,y) D X {l X (φ(x ), y)} be the geeralizatio errors of the decisio fuctio h ad the correctig fuctio φ respectively. We assume that the loss fuctios l X ad l X have rage [0, 1]. This assumptio ca be satisfied by ay bouded loss fuctio by simply dividig it by its maximal value. We deote by h = arg mi h H R(h) ad φ = arg mi φ Φ R(φ) the decisio ad the correctio fuctio with the miimal geeralizatio error w.r.t. the loss fuctios l X ad l X. Also, we deote by l 01 the 0/1 loss, by R 01 (h) = E (x,y) DX {l 01 (h(x), y)} the geeralizatio error of h w.r.t. the 0/1 loss ad by h 01 = arg mi h H R 01 (h) the decisio fuctio i H with the miimal geeralizatio 0/1 error. Let R (h, φ) = 1 i=1 l ((h, φ), (x i, x i, y i)) ad R (h, φ) = E (x,x,y) D{l ((h, φ), (x, x, y))} (8) be respectively empirical ad geeralizatio errors of the hypothesis (h, φ) w.r.t. the loss fuctio l. We deote by (ĥ, φ) = arg mi (h,φ) H Φ R (h, φ) the empirical risk miimizer ad by (h, φ ) = arg mi (h,φ) H Φ R (h, φ) the miimizer of the geeralizatio error w.r.t. the loss fuctio l. Note that i geeral h ca be differet from h, ad also φ ca be differet from φ. Let (H, Φ) = {(h, φ) H Φ h is uiformly better tha φ}. By Assumptio 1.2, (H, Φ). We will use additioal techical assumptio: Assumptio 2.1 There exists a costat A > 0 such that if { E (x,x,y) D {[l X (h(x), y) l X (φ(x ), y)] + } (h, φ) / (H, Φ), R(φ) < R(φ) } A. This assumptio is satisfied, for example, i the classificatio settig whe l X ad l X are 0/1 loss fuctios ad the probability desity fuctio p(x, x, y) of the uderlyig distributio D is bouded away from zero for all poits with ozero probability. I this case A if{p(x, x, y) (x, x, y) such that p(x, x, y) 0}. The followig lemma (proved i Appedix A i the full versio of the paper) shows that for sufficietly large C the optimizatio problems (3) ad (6) are asymptotically (whe ) equivalet: Lemma 2.2 Suppose that Assumptios 1.2, 1.3 ad 2.1 hold true. The there exists a fiite C 1 R such that for ay C C 1, (h, φ ) (H, Φ). Moreover, h = h ad φ = φ. I all our subsequet derivatios we assume that C has a fiite value for which (3) ad (6) are equivalet. Later o we will show how we choose the value of C that optimizes the forthcomig risk boud. The risk bouds preseted i this paper are based o VC-dimesio of various fuctio classes. While the defiitio of VC-dimesio for biary fuctios is well-kow i the learig commuity, the oe for the real-valued fuctios is less kow ad we review it here. Let F be a set of realvalued fuctios f : S R ad T (F) = {(x, t) S R f F s.t. 0 f(x) t}. We say that the set T = {(x i, t i )} T i=1 T (F) is shattered by F if for ay T T there exists a fuctio f F such that for ay (x i, t i ) T, f(x i ) t i ad for ay (x i, t i ) T \ T, f(x i ) > t i. The VC-dimesio of F is defied as a VC-dimesio of the set T (F), amely the maximal size of the set T T (F) that is shattered by F. 3 Review of existig excess risk bouds with fast covergece rates We derive our risk bouds from geeric excess risk bouds developed by Massart ad Nedelec [6] ad geeralized by Gie ad Koltchiskii [4] ad Koltchikii [5]. I this paper we use the versio of the bouds give i [4] ad [5]. Let F be a space of hypotheses f : S S, l : S { 1, +1} R be a real-valued loss fuctio such that 0 l(f(x), y) 1 for ay f F ad ay (x, y). Let f = 4

5 (a) Hypothesis space with small D (b) Hypothesis space with large D Figure 1: Visualizatio of the hypothesis spaces. The horisotal axis measures the distace (i terms of the variace) betwee hypothesis f ad the best hypothesis f i F. The vertical axis is the miimal error of hypotheses i F with the fixed distace from f. Note that the error fuctio displayed i graphs ca be o-cotiuous. The large value of D i the hypothesis space i graph (b) is caused by hypothesis A, which is sigificatly differet from f but has early-optimal error. arg mi f F E (x,y) {l(f(x), y)}, f = arg mi f F i=1 l(f(x i), y i ) ad D > 0 be a costat such that for ay f F, Var (x,y) {l(f(x), y) l(f (x), y)} D E (x,y) {l(f(x), y) l(f (x), y)}. (9) This coditio is a geeralizatio of Tsybakov s low-oise coditio [7] to arbitrary loss fuctios ad arbitrary hypothesis spaces. The costat D i (9) characterizes the error surface of the hypothesis space F. Suppose that E (x,y) {l(f(x), y) l(f (x), y)} is very small, amely f is early optimal. If f is almost the same as f the the variace i the left had side of (9), as well as the value of D, will be small. But if f differs sigificatly from f the the variace i the left had side of (9), as well as the value of D, will be large. Thus, if we take the variace i the left had side of (9) as a measure of distace betwee f ad f the the hypothesis spaces with large ad small D ca be visualized as show i Figure 1. Let V be a VC-dimesio of F. The followig theorem is a straightforward geeralizatio of Theorem 5.8 i [5]. Theorem 3.1 ([5]) There exists a costat K > 0 such that if > V D 2 the for ay δ > 0, with probability of at least 1 δ E (x,y) {l( f(x), y)} E (x,y) {l(f (x), y)} + KD ( V log V D 2 + l 1 ). (10) δ Let B = (V log + log(1/δ))/. If the coditio of Theorem 3.1 does ot hold, amely if V D 2 the we ca use the followig fallback risk boud: Theorem 3.2 ([1, 8]) There exists a costat K such that for ay δ > 0, with probability of at least 1 δ, E (x,y) {l( f(x), ( ) y)} E (x,y) {l(f (x), y)} + K E (x,y) {l(f (x), y)}b + B. (11) Defiitio 3.3 Let T = T (E (x,y) {l(f (x), y)}, V, δ) be a costat such that for all < T it holds that E (x,y) {l(f (x), y)} < B. For T the boud (11) has a covergece rate of 1/, ad for > T the boud (11) has a covergece rate of 1/. The mai differece betwee (10) ad (11) is the fast covergece rate of 1/ vs. the slow oe of 1/ i the regime of > max(t, V D 2 ). By Theorem 3.1, startig from > (D) = V D 2 we always have the covergece rate of 1/. Thus, the smaller value of D, the smaller will be the threshold (D) for obtaiig the fast covergece rate of 1/. 4 Upper Risk Boud For ay C 1, ay (x, x, y), ay h H ad φ Φ, ad ay loss fuctios l X ad l X, l X (h(x), y) l X (φ(x ), y) + C [l X (h(x), y) l X (φ(x ), y)] +. 5

6 Hece, usig (5) we obtai that R(ĥ) = E (x,y){l X (ĥ(x), y)} C E (x,y) { l ((ĥ, φ), } (x, x, y)) = C R (ĥ, φ). (12) Let l 1 (h, h, x, y) = l X (h(x), y) l X (h (x), y) ad D H 0 be a costat such that for ay h H D H E (x,y) {l 1 (h, h, x, y)} Var (x,y) {l 1 (h, h, x, y)}. (13) Similarly, let l 2 (h, h, φ, φ, x, x, y) = l ((h, φ), (x, x, y)) l ((h, φ ), (x, x, y)) ad D H,Φ 0 be a costat such that for all (h, φ) H Φ, D H,Φ E (x,x,y) {l 2 (h, h, φ, φ, x, x, y)} Var (x,x,y) {l 2 (h, h, φ, φ, x, x, y)}. (14) Let L(H, Φ) = {l ((h, φ), (,, )) h H, φ Φ} be a set of the loss fuctios l correspodig to hypotheses from H Φ ad V L(H,Φ) be a VC-dimesio of L(H, Φ). Similarly, let L(H) = {l X (h( ), ) h H} ad L(Φ) = {l X (φ( ), ) φ Φ} be the sets of loss fuctios that correspod to the hypotheses i H ad Φ, ad V L(H) ad V L(Φ) be VC dimesios of L(H) ad L(Φ) respectively. Note that if l X = l 01 the V L(H) is also a VC-dimesio of H (the same holds also for V L(Φ) ). Lemma 4.1 V L(H,Φ) = V L(H) + V L(Φ). Proof See Appedix C i the full versio of the paper. We apply Theorem 3.1 to the hypothesis space H Φ ad the loss fuctio l ((h, φ), (x, x, y)) ad obtai that there exists a costat K > 0 such that if > V L(H,Φ) DH,Φ 2 the for ay δ > 0, with probability at least 1 δ ( ) R (ĥ, φ) R (h, φ ) + KD H,Φ V L(H,Φ) l V L(H,Φ) DH,Φ 2 + l 1. δ Usig (12) we obtai that R(ĥ) C R (h, φ ) + CKD H,Φ It follows from Assumptio 1.3 ad Lemma 2.2 that ( ) V L(H,Φ) l V L(H,Φ) DH,Φ 2 + l 1. (15) δ R (h, φ ) = 1 C R(φ ) = 1 C R(φ) = 1 C R(h ). (16) We substitute (16) ito (15) ad obtai that there exists a costat K > 0 such that if > V L(H,Φ) DH,Φ 2 the for ay δ > 0, with probability at least 1 δ, ( ) R(ĥ) R(h ) + CKD H,Φ V L(H,Φ) l V L(H,Φ) DH,Φ 2 + l 1. δ We boud V H,Φ by Lemma 4.1 ad obtai our fial risk boud, that is summarized i the followig theorem: Theorem 4.2 Suppose that Assumptios 1.2, 1.3 ad 2.1 hold. Let D H,Φ be as defied i (14), C 1 be as defied i Lemma 2.2, ad V L(H,Φ) = V L(H) + V L(Φ). Suppose that C > C 1 ad > V L(H,Φ) DH,Φ 2. The for ay δ > 0 with probability of at least 1 δ, ( ) R(ĥ) R(h ) + CKD H,Φ V L(H,Φ) l + l 1, (17) V L(H,Φ) DH,Φ 2 δ where K > 0 is a costat. 6

7 Accordig to this boud, R(ĥ) coverges to R(h ) with the rate of 1/. If Assumptio 1.3 does ot hold the it is easy to see that we obtai the same boud as (17), but with R(h ) replaced by R(φ ). I this case the upper boud o R(ĥ) coverges to R(φ ) with the rate of 1/. We ow provide further aalysis of the risk boud (17). Let l 3 (φ, φ, x, y) = l X (φ(x ), y) l X (φ (x ), y) ad D Φ 0 be a costat such that for ay φ Φ, D Φ E (x,y) {l 3 (φ, φ, x, y)} Var (x,y) {l 3 (φ, φ, x, y)}. (18) Similarly, let D H,Φ 0 be a costat such that for all (h, φ) (H Φ) \ (H, Φ), D H,ΦE (x,x,y) {l 2 (h, h, φ, φ, x, x, y)} Var (x,x,y) {l 2 (h, h, φ, φ, x, x, y)}. Lemma 4.3 D H,Φ max ( D Φ /C, D H,Φ). Proof See Appedix B i the full versio of the paper. By Lemma 4.3, C D H,Φ max(d Φ, C D H,Φ ). Sice the loss fuctio l 2 depeds o C, the costat D H,Φ depeds o C too. Thus, igorig the left-had logarithmic term i (17), the optimal value of C is the oe that is larger that C 1 ad miimizes C D H,Φ. We ow show that such miimum ideed exists. By the defiitio of the loss fuctio l 2, { Var(x,x 0 < lim sup,y) {l 2 (h, h, φ, φ, x, x }, y)} C E (x,x,y) {l 2 (h, h, φ, φ, x, x 1. (19), y)} (h,φ) (H Φ)\(H,Φ) Therefore for very large C it holds that 0 < s D H,Φ 1, where s is the value of the above limit. Cosequetly lim C C D H,Φ =. Sice the fuctio g(c) = C D H,Φ is cotiuous ad fiite i C = C 1, there exists a poit C = C [C 1, ) that miimizes it. 5 Whe Privileged ERM is provably better tha the regular ERM We show a example that demostrates the differece betwee the emprical risk miimizatio i X space ad empirical risk miimizatio with privileged iformatio i the joit X X space. I particular, we show i this example that for ot too small traiig sizes (as specified by the coditios of Theorems 11 ad 4.2) the learig rate of the regular ERM i X space is 1/ while the learig rate of the privileged ERM i the joit X X space is 1/. We cosider the classificatio settig ad all loss fuctios i our example are 0/1 loss. Let D X = {D X (ɛ) 0 < ɛ < 0.1} be a ifiite family of distributios of examples i X space. All distributios i D X have o-zero support i four poits, deoted by X 1, X 2, X 3 ad X 4. We assume that these poits lie o a 1-dimesioal lie, as show i Figure 2(a). Figure 2(a) also shows the probability mass of each poit i the distributio D X (ɛ). The hypothesis space H cosists of hypotheses h t (x) = sig(x t) ad h t = sig(x t). The best hypothesis i H is h 1 ad its geeralizatio error is 1/4 2ɛ. The hypothesis space H cotais also a hypothesis h 3, which is slightly worse tha h 1 ad has geeralizatio error of 1/4 + ɛ. It ca be verified that for a fixed D X (ɛ) ad H the costat D H (defied i equatio (13)) is D H = 1/(6ɛ) (1/3) ɛ 1/(6ɛ). (20) Note that the iequality i (20) is very tight sice ɛ ca be arbitrary small. The VC-dimesio V H of H is 2. Suppose that ɛ is sufficietly small such that V H D 2 H > T (1/4 2ɛ, V H, δ), where the fuctio T (,, ) is defied i Defiitio 3.3. I order to use the risk boud (10) with our D X ad H, the coditio > V H D 2 H = 1/(18ɛ 2 ) (21) should be satisfied. But sice ɛ ca be very small, the coditio (21) is ot satisfied for a large rage of s. Hece, accordig to (11), for distributios D X (ɛ) that satisfy T (1/4 2ɛ, 2, δ) 1 18ɛ we 2 obtai that R 01 (ĥ) coverges to R 01(h ) with the rate of at least 1/. The followig lower boud shows that R 01 (ĥ) coverges to R 01(h ) with the rate of at most 1/. 7

8 (a) X space (b) X space Figure 2: X ad X spaces. Lemma 5.1 Suppose that ɛ < 1/16. Let δ = exp( 20ɛ 2 ). The for ay > 256, with probability at least δ, R 01 (ĥ) R 01(h ) l(1/δ )/(20). By combiig upper ad lower bouds we obtai that the covergece rate of R 01 (ĥ) to R 01(h ) is exactly 1/. The proof of the lower boud appears i Appedix D i the full versio of the paper. Suppose that the teacher costructed the distributio D X (ɛ) of examples i X space i the followig way. D X (ɛ) has o-zero support i four poits, deoted by X 1, X 2, X 3 ad X 4, that lie o a 1-dimesioal lie, as show i Figure 2(b). Figure 2(b) shows the probability mass of each poit i X space. We assume that the joit distributio (X, X ) has o-zero support oly o poits (X 1, X 1 ), (X 2, X 2 ), (X 3, X 3 ) ad (X 4, X 4 ). The hypothesis space Φ cosists of hypotheses φ t (x) = sig(x t) ad φ t = sig(x t). The best hypothesis i Φ is φ 2 ad its geeralizatio error is 0. However there is o h H that is uiformly better tha φ 2. The best hypothesis i Φ, amog those that have uiformly better hypothesis i H, is φ 1 ad its geeralizatio error is 1/4 2ɛ. h 1 is uiformly better tha φ 1. It ca be verified that for such D X (ɛ) ad Φ the costat D Φ (defied i equatio (18)) is D Φ = (11/16 3ɛ 4ɛ 2 )/(1/4 + 2ɛ) (22) Note that the iequality i (22) is very tight sice ɛ ca be arbitrary small. Moreover, it ca be verified that C that miimizes C D H,Φ is C = 2.6. For C = C it holds that D H,Φ = 1.71 ad D Φ /C = It is easy to see that our example satisfies Assumptios 1.2 ad 1.3 (the last assumptio is satisfied with φ = φ 1). Also, it ca be verified that Assumptio 2.1 is satisfied with A = 1/4 2ɛ ad C 1 = 1.1 < C satisfies Lemma 2.2. The VC-dimesio of Φ is 2. Hece by Theorem 4.2 ad Lemma 4.3, if > (2 + 2) = 11.7 the R 01 (ĥ) coverges to R 01(h ) with the rate of at least 1/. Sice our bouds o D Φ ad D H,Φ are idepedet of ɛ, the covergece rate of 1/ holds for ay distributio i D X. We obtaied that for 11.7 < 1 18ɛ the upper boud (17) coverges to R 2 01 (h ) with the rate of 1/, while the upper boud (11) coverges to R 01 (h ) with the rate of 1/. This improvemet was possible due to teacher s costructio of D X (ɛ) ad learer s choice of Φ. The hypothesis h 3 caused the value of D H to be large ad thus preveted us from 1/ covergece rate for a large rage of s. We costructed D X (ɛ) ad Φ i such a way that Φ does ot have a hypothesis φ that has exactly the same dichotomy as the bad hypothesis h 3. With such costructio ay φ Φ, such that h 3 is uiformly better tha φ, has geeralizatio error sigificatly larger tha the oe of h 3. For example, the best hypothesis i Φ for which h 3 is uiformly better, is φ 0 ad its geeralizatio error is 1/2. 6 Coclusios We formulated the algorithm of empirical risk miimizatio with privileged iformatio ad derived the risk boud for it. Our risk boud outlies the coditios for the correctig space that, if satisfied, will allow fast learig i the decisio space, eve if the origial learig problem i the decisio space is very hard. We showed a example where the privileged iformatio provably sigificatly improves the learig rate. I this paper we showed that the good correctig space ca improve the learig rate from 1/ to 1/. But, havig the good correctig space, ca we achieve a learig rate faster tha 1/? Aother iterstig problem is to aalyze Privileged ERM whe the learer does ot completely trust the teacher. This coditio traslates to the costrait l X (h(x), y) l X (φ(x ), y) + ɛ i (3) ad the term [l X (h(x), y) l X (φ(x ), y)] + i (6), where ɛ 0 is a hyperparameter. Fially, the importat directio is to develop risk bouds for SVM+ (which is a regularized versio of Privileged ERM) ad show whe it is provably better tha SVM. 8

9 Refereces [1] S. Bouchero, O. Bousquet, ad G. Lugosi. Theory of classificatio: a survey of some recet advaces. ESAIM: Probability ad Statistics, 9: , [2] C. Cortes ad V. Vapik. Support-vector etworks. Machie Learig, 20(3): , [3] L. Devroye ad G. Lugosi. Lower bouds i patter recogitio ad learig. Patter Recogitio, 28(7): , [4] E. Gie ad V. Koltchiskii. Cocetratio iequalities ad asymptotic resutls for ratio type empirical processes. Aals of Probability, 34(3): , [5] V. Koltchiskii Sait Flour lectures: Oracle iequalities i empirical risk miimizatio ad sparse recovery problems, Available at fodava.gatech.edu/files/reports/fodava pdf. [6] P. Massart ad E. Nedelec. Risk bouds for statistical learig. Aals of Statistics, 34(5): , [7] A. Tsybakov. Optimal aggregatio of classifiers i statistical learig. Aals of Statistics, 32(1): , [8] V. Vapik. Estimatio of depedecies based o empirical data. Spriger Verlag, 2d editio, [9] V. Vapik ad A. Vashist. A ew learig paradigm: Learig usig privileged iformatio. Neural Networks, 22(5-6): , [10] V. Vapik, A. Vashist, ad N. Pavlovich. Learig usig hidde iformatio: Master class learig. I Proceedigs of NATO workshop o Miig Massive Data Sets for Security, pages

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the

More information

Intro to Learning Theory

Intro to Learning Theory Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT

More information

Sieve Estimators: Consistency and Rates of Convergence

Sieve Estimators: Consistency and Rates of Convergence EECS 598: Statistical Learig Theory, Witer 2014 Topic 6 Sieve Estimators: Cosistecy ad Rates of Covergece Lecturer: Clayto Scott Scribe: Julia Katz-Samuels, Brado Oselio, Pi-Yu Che Disclaimer: These otes

More information

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture 3 Tolstikhi Ilya Abstract I this lecture we will prove the VC-boud, which provides a high-probability excess risk boud for the ERM algorithm whe

More information

1 Review and Overview

1 Review and Overview CS9T/STATS3: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #6 Scribe: Jay Whag ad Patrick Cho October 0, 08 Review ad Overview Recall i the last lecture that for ay family of scalar fuctios F, we

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak

More information

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space

More information

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015 ECE 8527: Itroductio to Machie Learig ad Patter Recogitio Midterm # 1 Vaishali Ami Fall, 2015 tue39624@temple.edu Problem No. 1: Cosider a two-class discrete distributio problem: ω 1 :{[0,0], [2,0], [2,2],

More information

Information-based Feature Selection

Information-based Feature Selection Iformatio-based Feature Selectio Farza Faria, Abbas Kazeroui, Afshi Babveyh Email: {faria,abbask,afshib}@staford.edu 1 Itroductio Feature selectio is a topic of great iterest i applicatios dealig with

More information

1 Review and Overview

1 Review and Overview DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, 2013 1 Review ad Overview I the first half of this course,

More information

Lecture 3: August 31

Lecture 3: August 31 36-705: Itermediate Statistics Fall 018 Lecturer: Siva Balakrisha Lecture 3: August 31 This lecture will be mostly a summary of other useful expoetial tail bouds We will ot prove ay of these i lecture,

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 4 Scribe: Cheg Mao Sep., 05 I this lecture, we cotiue to discuss the effect of oise o the rate of the excess risk E(h) = R(h) R(h

More information

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,

More information

Binary classification, Part 1

Binary classification, Part 1 Biary classificatio, Part 1 Maxim Ragisky September 25, 2014 The problem of biary classificatio ca be stated as follows. We have a radom couple Z = (X,Y ), where X R d is called the feature vector ad Y

More information

Selective Prediction

Selective Prediction COMS 6998-4 Fall 2017 November 8, 2017 Selective Predictio Preseter: Rog Zhou Scribe: Wexi Che 1 Itroductio I our previous discussio o a variatio o the Valiat Model [3], the described learer has the ability

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

Distribution of Random Samples & Limit theorems

Distribution of Random Samples & Limit theorems STAT/MATH 395 A - PROBABILITY II UW Witer Quarter 2017 Néhémy Lim Distributio of Radom Samples & Limit theorems 1 Distributio of i.i.d. Samples Motivatig example. Assume that the goal of a study is to

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week Lecture: Cocept Check Exercises Starred problems are optioal. Statistical Learig Theory. Suppose A = Y = R ad X is some other set. Furthermore, assume P X Y is a discrete

More information

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5 CS434a/54a: Patter Recogitio Prof. Olga Veksler Lecture 5 Today Itroductio to parameter estimatio Two methods for parameter estimatio Maimum Likelihood Estimatio Bayesia Estimatio Itroducto Bayesia Decisio

More information

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1 EECS564 Estimatio, Filterig, ad Detectio Hwk 2 Sols. Witer 25 4. Let Z be a sigle observatio havig desity fuctio where. p (z) = (2z + ), z (a) Assumig that is a oradom parameter, fid ad plot the maximum

More information

Rademacher Complexity

Rademacher Complexity EECS 598: Statistical Learig Theory, Witer 204 Topic 0 Rademacher Complexity Lecturer: Clayto Scott Scribe: Ya Deg, Kevi Moo Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for

More information

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities CS8B/Stat4B Sprig 008) Statistical Learig Theory Lecture: Ada Boost, Risk Bouds, Cocetratio Iequalities Lecturer: Peter Bartlett Scribe: Subhrasu Maji AdaBoost ad Estimates of Coditioal Probabilities We

More information

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Patter Recogitio Classificatio: No-Parametric Modelig Hamid R. Rabiee Jafar Muhammadi Sprig 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Ageda Parametric Modelig No-Parametric Modelig

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

Problem Set 4 Due Oct, 12

Problem Set 4 Due Oct, 12 EE226: Radom Processes i Systems Lecturer: Jea C. Walrad Problem Set 4 Due Oct, 12 Fall 06 GSI: Assae Gueye This problem set essetially reviews detectio theory ad hypothesis testig ad some basic otios

More information

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f. Lecture 5 Let us give oe more example of MLE. Example 3. The uiform distributio U[0, ] o the iterval [0, ] has p.d.f. { 1 f(x =, 0 x, 0, otherwise The likelihood fuctio ϕ( = f(x i = 1 I(X 1,..., X [0,

More information

Introduction to Machine Learning DIS10

Introduction to Machine Learning DIS10 CS 189 Fall 017 Itroductio to Machie Learig DIS10 1 Fu with Lagrage Multipliers (a) Miimize the fuctio such that f (x,y) = x + y x + y = 3. Solutio: The Lagragia is: L(x,y,λ) = x + y + λ(x + y 3) Takig

More information

Chapter 6 Principles of Data Reduction

Chapter 6 Principles of Data Reduction Chapter 6 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 0 Chapter 6 Priciples of Data Reductio Sectio 6. Itroductio Goal: To summarize or reduce the data X, X,, X to get iformatio about a

More information

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018) NYU Ceter for Data Sciece: DS-GA 003 Machie Learig ad Computatioal Statistics (Sprig 208) Brett Berstei, David Roseberg, Be Jakubowski Jauary 20, 208 Istructios: Followig most lab ad lecture sectios, we

More information

Chapter 6 Infinite Series

Chapter 6 Infinite Series Chapter 6 Ifiite Series I the previous chapter we cosidered itegrals which were improper i the sese that the iterval of itegratio was ubouded. I this chapter we are goig to discuss a topic which is somewhat

More information

Fall 2013 MTH431/531 Real analysis Section Notes

Fall 2013 MTH431/531 Real analysis Section Notes Fall 013 MTH431/531 Real aalysis Sectio 8.1-8. Notes Yi Su 013.11.1 1. Defiitio of uiform covergece. We look at a sequece of fuctios f (x) ad study the coverget property. Notice we have two parameters

More information

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ. 2 5. Weighted umber of late jobs 5.1. Release dates ad due dates: maximimizig the weight of o-time jobs Oce we add release dates, miimizig the umber of late jobs becomes a sigificatly harder problem. For

More information

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1. Eco 325/327 Notes o Sample Mea, Sample Proportio, Cetral Limit Theorem, Chi-square Distributio, Studet s t distributio 1 Sample Mea By Hiro Kasahara We cosider a radom sample from a populatio. Defiitio

More information

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector Dimesio-free PAC-Bayesia bouds for the estimatio of the mea of a radom vector Olivier Catoi CREST CNRS UMR 9194 Uiversité Paris Saclay olivier.catoi@esae.fr Ilaria Giulii Laboratoire de Probabilités et

More information

10/2/ , 5.9, Jacob Hays Amit Pillay James DeFelice

10/2/ , 5.9, Jacob Hays Amit Pillay James DeFelice 0//008 Liear Discrimiat Fuctios Jacob Hays Amit Pillay James DeFelice 5.8, 5.9, 5. Miimum Squared Error Previous methods oly worked o liear separable cases, by lookig at misclassified samples to correct

More information

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 7

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 7 Statistical Machie Learig II Sprig 2017, Learig Theory, Lecture 7 1 Itroductio Jea Hoorio jhoorio@purdue.edu So far we have see some techiques for provig geeralizatio for coutably fiite hypothesis classes

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit Theorems Throughout this sectio we will assume a probability space (, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 0 Scribe: Ade Forrow Oct. 3, 05 Recall the followig defiitios from last time: Defiitio: A fuctio K : X X R is called a positive symmetric

More information

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where

More information

Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3

Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3 No-Parametric Techiques Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3 Parametric vs. No-Parametric Parametric Based o Fuctios (e.g Normal Distributio) Uimodal Oly oe peak Ulikely real data cofies

More information

Estimation of the essential supremum of a regression function

Estimation of the essential supremum of a regression function Estimatio of the essetial supremum of a regressio fuctio Michael ohler, Adam rzyżak 2, ad Harro Walk 3 Fachbereich Mathematik, Techische Uiversität Darmstadt, Schlossgartestr. 7, 64289 Darmstadt, Germay,

More information

Lecture 15: Learning Theory: Concentration Inequalities

Lecture 15: Learning Theory: Concentration Inequalities STAT 425: Itroductio to Noparametric Statistics Witer 208 Lecture 5: Learig Theory: Cocetratio Iequalities Istructor: Ye-Chi Che 5. Itroductio Recall that i the lecture o classificatio, we have see that

More information

Expectation and Variance of a random variable

Expectation and Variance of a random variable Chapter 11 Expectatio ad Variace of a radom variable The aim of this lecture is to defie ad itroduce mathematical Expectatio ad variace of a fuctio of discrete & cotiuous radom variables ad the distributio

More information

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017 Lecture 9: Boostig Akshay Krishamurthy akshay@csumassedu October 3, 07 Recap Last week we discussed some algorithmic aspects of machie learig We saw oe very powerful family of learig algorithms, amely

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

1 Duality revisited. AM 221: Advanced Optimization Spring 2016 AM 22: Advaced Optimizatio Sprig 206 Prof. Yaro Siger Sectio 7 Wedesday, Mar. 9th Duality revisited I this sectio, we will give a slightly differet perspective o duality. optimizatio program: f(x) x R

More information

1 Review of Probability & Statistics

1 Review of Probability & Statistics 1 Review of Probability & Statistics a. I a group of 000 people, it has bee reported that there are: 61 smokers 670 over 5 960 people who imbibe (drik alcohol) 86 smokers who imbibe 90 imbibers over 5

More information

Linear Regression Demystified

Linear Regression Demystified Liear Regressio Demystified Liear regressio is a importat subject i statistics. I elemetary statistics courses, formulae related to liear regressio are ofte stated without derivatio. This ote iteds to

More information

APPENDIX A SMO ALGORITHM

APPENDIX A SMO ALGORITHM AENDIX A SMO ALGORITHM Sequetial Miimal Optimizatio SMO) is a simple algorithm that ca quickly solve the SVM Q problem without ay extra matrix storage ad without usig time-cosumig umerical Q optimizatio

More information

Pattern Classification, Ch4 (Part 1)

Pattern Classification, Ch4 (Part 1) Patter Classificatio All materials i these slides were take from Patter Classificatio (2d ed) by R O Duda, P E Hart ad D G Stork, Joh Wiley & Sos, 2000 with the permissio of the authors ad the publisher

More information

An Introduction to Randomized Algorithms

An Introduction to Randomized Algorithms A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis

More information

Seunghee Ye Ma 8: Week 5 Oct 28

Seunghee Ye Ma 8: Week 5 Oct 28 Week 5 Summary I Sectio, we go over the Mea Value Theorem ad its applicatios. I Sectio 2, we will recap what we have covered so far this term. Topics Page Mea Value Theorem. Applicatios of the Mea Value

More information

Agnostic Learning and Concentration Inequalities

Agnostic Learning and Concentration Inequalities ECE901 Sprig 2004 Statistical Regularizatio ad Learig Theory Lecture: 7 Agostic Learig ad Cocetratio Iequalities Lecturer: Rob Nowak Scribe: Aravid Kailas 1 Itroductio 1.1 Motivatio I the last lecture

More information

Chapter 7 Isoperimetric problem

Chapter 7 Isoperimetric problem Chapter 7 Isoperimetric problem Recall that the isoperimetric problem (see the itroductio its coectio with ido s proble) is oe of the most classical problem of a shape optimizatio. It ca be formulated

More information

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i

More information

Vector Quantization: a Limiting Case of EM

Vector Quantization: a Limiting Case of EM . Itroductio & defiitios Assume that you are give a data set X = { x j }, j { 2,,, }, of d -dimesioal vectors. The vector quatizatio (VQ) problem requires that we fid a set of prototype vectors Z = { z

More information

Lecture 7: October 18, 2017

Lecture 7: October 18, 2017 Iformatio ad Codig Theory Autum 207 Lecturer: Madhur Tulsiai Lecture 7: October 8, 207 Biary hypothesis testig I this lecture, we apply the tools developed i the past few lectures to uderstad the problem

More information

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam. Probability ad Statistics FS 07 Secod Sessio Exam 09.0.08 Time Limit: 80 Miutes Name: Studet ID: This exam cotais 9 pages (icludig this cover page) ad 0 questios. A Formulae sheet is provided with the

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple

More information

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence Sequeces A sequece of umbers is a fuctio whose domai is the positive itegers. We ca see that the sequece 1, 1, 2, 2, 3, 3,... is a fuctio from the positive itegers whe we write the first sequece elemet

More information

Sequences and Series of Functions

Sequences and Series of Functions Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges

More information

Support vector machine revisited

Support vector machine revisited 6.867 Machie learig, lecture 8 (Jaakkola) 1 Lecture topics: Support vector machie ad kerels Kerel optimizatio, selectio Support vector machie revisited Our task here is to first tur the support vector

More information

Math Solutions to homework 6

Math Solutions to homework 6 Math 175 - Solutios to homework 6 Cédric De Groote November 16, 2017 Problem 1 (8.11 i the book): Let K be a compact Hermitia operator o a Hilbert space H ad let the kerel of K be {0}. Show that there

More information

The standard deviation of the mean

The standard deviation of the mean Physics 6C Fall 20 The stadard deviatio of the mea These otes provide some clarificatio o the distictio betwee the stadard deviatio ad the stadard deviatio of the mea.. The sample mea ad variace Cosider

More information

Analytic Continuation

Analytic Continuation Aalytic Cotiuatio The stadard example of this is give by Example Let h (z) = 1 + z + z 2 + z 3 +... kow to coverge oly for z < 1. I fact h (z) = 1/ (1 z) for such z. Yet H (z) = 1/ (1 z) is defied for

More information

6.867 Machine learning, lecture 7 (Jaakkola) 1

6.867 Machine learning, lecture 7 (Jaakkola) 1 6.867 Machie learig, lecture 7 (Jaakkola) 1 Lecture topics: Kerel form of liear regressio Kerels, examples, costructio, properties Liear regressio ad kerels Cosider a slightly simpler model where we omit

More information

Random Variables, Sampling and Estimation

Random Variables, Sampling and Estimation Chapter 1 Radom Variables, Samplig ad Estimatio 1.1 Itroductio This chapter will cover the most importat basic statistical theory you eed i order to uderstad the ecoometric material that will be comig

More information

Problem Set 2 Solutions

Problem Set 2 Solutions CS271 Radomess & Computatio, Sprig 2018 Problem Set 2 Solutios Poit totals are i the margi; the maximum total umber of poits was 52. 1. Probabilistic method for domiatig sets 6pts Pick a radom subset S

More information

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4. 4. BASES I BAACH SPACES 39 4. BASES I BAACH SPACES Sice a Baach space X is a vector space, it must possess a Hamel, or vector space, basis, i.e., a subset {x γ } γ Γ whose fiite liear spa is all of X ad

More information

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence Sequeces A sequece of umbers is a fuctio whose domai is the positive itegers. We ca see that the sequece,, 2, 2, 3, 3,... is a fuctio from the positive itegers whe we write the first sequece elemet as

More information

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + 62. Power series Defiitio 16. (Power series) Give a sequece {c }, the series c x = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + is called a power series i the variable x. The umbers c are called the coefficiets of

More information

A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS

A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS J. Japa Statist. Soc. Vol. 41 No. 1 2011 67 73 A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS Yoichi Nishiyama* We cosider k-sample ad chage poit problems for idepedet data i a

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013 MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013 Fuctioal Law of Large Numbers. Costructio of the Wieer Measure Cotet. 1. Additioal techical results o weak covergece

More information

Lecture 2. The Lovász Local Lemma

Lecture 2. The Lovász Local Lemma Staford Uiversity Sprig 208 Math 233A: No-costructive methods i combiatorics Istructor: Ja Vodrák Lecture date: Jauary 0, 208 Origial scribe: Apoorva Khare Lecture 2. The Lovász Local Lemma 2. Itroductio

More information

Unbiased Estimation. February 7-12, 2008

Unbiased Estimation. February 7-12, 2008 Ubiased Estimatio February 7-2, 2008 We begi with a sample X = (X,..., X ) of radom variables chose accordig to oe of a family of probabilities P θ where θ is elemet from the parameter space Θ. For radom

More information

Ma 530 Introduction to Power Series

Ma 530 Introduction to Power Series Ma 530 Itroductio to Power Series Please ote that there is material o power series at Visual Calculus. Some of this material was used as part of the presetatio of the topics that follow. What is a Power

More information

Lecture 2: Monte Carlo Simulation

Lecture 2: Monte Carlo Simulation STAT/Q SCI 43: Itroductio to Resamplig ethods Sprig 27 Istructor: Ye-Chi Che Lecture 2: ote Carlo Simulatio 2 ote Carlo Itegratio Assume we wat to evaluate the followig itegratio: e x3 dx What ca we do?

More information

Topic 9: Sampling Distributions of Estimators

Topic 9: Sampling Distributions of Estimators Topic 9: Samplig Distributios of Estimators Course 003, 2018 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be

More information

Efficient GMM LECTURE 12 GMM II

Efficient GMM LECTURE 12 GMM II DECEMBER 1 010 LECTURE 1 II Efficiet The estimator depeds o the choice of the weight matrix A. The efficiet estimator is the oe that has the smallest asymptotic variace amog all estimators defied by differet

More information

Lecture 11: Decision Trees

Lecture 11: Decision Trees ECE9 Sprig 7 Statistical Learig Theory Istructor: R. Nowak Lecture : Decisio Trees Miimum Complexity Pealized Fuctio Recall the basic results of the last lectures: let X ad Y deote the iput ad output spaces

More information

STAT Homework 1 - Solutions

STAT Homework 1 - Solutions STAT-36700 Homework 1 - Solutios Fall 018 September 11, 018 This cotais solutios for Homework 1. Please ote that we have icluded several additioal commets ad approaches to the problems to give you better

More information

Empirical Processes: Glivenko Cantelli Theorems

Empirical Processes: Glivenko Cantelli Theorems Empirical Processes: Gliveko Catelli Theorems Mouliath Baerjee Jue 6, 200 Gliveko Catelli classes of fuctios The reader is referred to Chapter.6 of Weller s Torgo otes, Chapter??? of VDVW ad Chapter 8.3

More information

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014. Product measures, Toelli s ad Fubii s theorems For use i MAT3400/4400, autum 2014 Nadia S. Larse Versio of 13 October 2014. 1. Costructio of the product measure The purpose of these otes is to preset the

More information

A Risk Comparison of Ordinary Least Squares vs Ridge Regression

A Risk Comparison of Ordinary Least Squares vs Ridge Regression Joural of Machie Learig Research 14 (2013) 1505-1511 Submitted 5/12; Revised 3/13; Published 6/13 A Risk Compariso of Ordiary Least Squares vs Ridge Regressio Paramveer S. Dhillo Departmet of Computer

More information

Linear Classifiers III

Linear Classifiers III Uiversität Potsdam Istitut für Iformatik Lehrstuhl Maschielles Lere Liear Classifiers III Blaie Nelso, Tobias Scheffer Cotets Classificatio Problem Bayesia Classifier Decisio Liear Classifiers, MAP Models

More information

Maximum Likelihood Estimation and Complexity Regularization

Maximum Likelihood Estimation and Complexity Regularization ECE90 Sprig 004 Statistical Regularizatio ad Learig Theory Lecture: 4 Maximum Likelihood Estimatio ad Complexity Regularizatio Lecturer: Rob Nowak Scribe: Pam Limpiti Review : Maximum Likelihood Estimatio

More information

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion .87 Machie learig: lecture Tommi S. Jaakkola MIT CSAIL tommi@csail.mit.edu Topics The learig problem hypothesis class, estimatio algorithm loss ad estimatio criterio samplig, empirical ad epected losses

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit theorems Throughout this sectio we will assume a probability space (Ω, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

Math 61CM - Solutions to homework 3

Math 61CM - Solutions to homework 3 Math 6CM - Solutios to homework 3 Cédric De Groote October 2 th, 208 Problem : Let F be a field, m 0 a fixed oegative iteger ad let V = {a 0 + a x + + a m x m a 0,, a m F} be the vector space cosistig

More information

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach STAT 425: Itroductio to Noparametric Statistics Witer 28 Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach Istructor: Ye-Chi Che Referece: Sectio 8.4 of All of Noparametric Statistics.

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 6 9/23/2013. Brownian motion. Introduction

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 6 9/23/2013. Brownian motion. Introduction MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/5.070J Fall 203 Lecture 6 9/23/203 Browia motio. Itroductio Cotet.. A heuristic costructio of a Browia motio from a radom walk. 2. Defiitio ad basic properties

More information