On the Theory of Learning with Privileged Information

Size: px

Start display at page:

Download "On the Theory of Learning with Privileged Information"

Doris Hopkins
6 years ago
Views:

1 O the Theory of Learig with Privileged Iformatio Dmitry Pechyoy NEC Laboratories Priceto, NJ 08540, USA Vladimir Vapik NEC Laboratories Priceto, NJ 08540, USA Abstract I Learig Usig Privileged Iformatio (LUPI) paradigm, alog with the stadard traiig data i the decisio space, a teacher supplies a learer with the privileged iformatio i the correctig space. The goal of the learer is to fid a classifier with a low geeralizatio error i the decisio space. We cosider a empirical risk miimizatio algorithm, called Privileged ERM, that takes ito accout the privileged iformatio i order to fid a good fuctio i the decisio space. We outlie the coditios o the correctig space that, if satisfied, allow Privileged ERM to have much faster learig rate i the decisio space tha the oe of the regular empirical risk miimizatio. 1 Itroductio I the classical supervised machie learig paradigm the learer is give a labeled traiig set of examples ad her goal is to fid a decisio fuctio with the small geeralizatio error o the ukow test examples. If the learig problem is easy (e.g. if learer s space of decisio fuctios cotais a oe with zero geeralizatio error) the, whe the traiig size icreases, the decisio fuctio foud by the learer coverges quickly to the optimal oe. However if the learig problem is hard ad the learer s space of decisio fuctios is large the the covergece (or learig) rate is slow. The example of such hard learig problem is XOR whe the space of decisio fuctios is 2-dimesioal hyperplaes. The obvious questio is Ca we accelerate the learig rate if the learer is give a additioal iformatio about the learig problem?. Durig the last years several ew paradigms of learig with additioal iformatio were proposed that, uder some coditios, provably accelerate the learig rate. For example, i semi-supervised learig such additioal iformatio is ulabeled traiig examples. I this paper we cosider a recetly proposed Learig Usig Privileged Iformatio (LUPI) paradigm [8, 9, 10], that uses additioal iformatio of differet kid. Let X be a decisio space. I LUPI paradigm, i additio to the stadard traiig data, (x, y) X Y, a teacher supplies the learer with a privileged iformatio x i the correctig space X. The privileged iformatio is oly available for the traiig examples ad is ever available for the test examples. The LUPI paradigm requires, give a traiig set {(x i, x i, y i)} i=1, to fid a decisio fuctio h : X Y with the small geeralizatio error for the ukow test examples x X. The above questio about acceleratig the learig rate, reformulated i terms of the LUPI paradigm, is What kid of additioal iformatio should the teacher provide to the learer i order to accelerate her learig rate?. Paraphrased, this questio is essetially Who is a good teacher?. I this paper we outlie the coditios for the additioal iformatio provided by the teacher that allow for fast learig rate eve i the hard problems. 1

2 LUPI paradigm emerges i a umber of applicatios, for example time series predictio, protei classificatio ad huma computatio. The experimets [9] i these domais demostrated a clear advatage of LUPI paradigm over the supervised learig. LUPI paradigm ca be implemeted by SVM+ algorithm [8], which i tur is based o the wellkow SVM algorithm [2]. We ow preset the versio of SVM+ for classificatio, the versio for regressio ca be foud i [9]. Let h(x) = sig(w x + b) be a decisio fuctio ad φ(x i ) = w x i + d be a correctig fuctio. The optimizatio problem of SVM+ is mi w,b,w,d 1 2 w γ 2 w C i=1 s.t. 1 i, y i (w x i + b) 1 (w x i + d) 1 i, w x i + d 0. (w x i + d) (1) The objective fuctio of SVM+ cotais two hyperparameters, C > 0 ad γ > 0. The term γ w 2 2/2 i (1) is iteded to restrict the capacity (or VC-dimesio) of the fuctio space cotaiig φ. Let l X (h(x), y) = 1 y(w x + b) be a hige loss of the decisio fuctio h = (w, b) o the example (x, y) ad l X (φ(x )) = [w x + d] + be a loss of the correctig fuctio φ = (w, d) o the example x. The optimizatio problem (1) ca be rewritte as mi h=(w,b),φ=(w,d) 1 2 w γ 2 w C l X (φ(x i )) (2) i=1 s.t. 1 i, l X (h(x i ), y) l X (φ(x i )). The followig optimizatio problem is a simplified ad a geeralized versio of (2): l X (φ(x i ), y i ) (3) mi h H,φ Φ i=1 s.t. 1 i, l X (h(x i ), y i ) l X (φ(x i ), y i ), (4) where l X ad l X are arbitrary bouded loss fuctios, H is a space of decisio fuctios ad Φ is a space of correctig fuctios. Let C > 0 be a costat (that is defied later), [t] + = max(t, 0) ad l ((h, φ), (x, x, y)) = 1 C l X (φ(x ), y) + [l X (h(x), y) l X (φ(x ), y)] + (5) be the loss of the composite hypothesis (h, φ) o the example (x, x, y). I this paper we study the relaxatio of (3): mi l ((h, φ), (x i, x i, y i )), (6) h H,φ Φ i=1 We refer to the learig algorithm defied by the optimizatio problem (6) as empirical risk miimizatio with privileged iformatio, or abbreviated Privileged ERM. The basic assumptio of Privileged ERM is that if we ca achieve a small loss l X (φ(x ), y) i the correctig space the we should also achieve a small loss l X (h(x), y) i the decisio space. This assumptio reflects the huma learig process, where the teacher tells the learer what are the most importat examples (the oes with the small loss i the correctig space) that the learer should take ito accout i order to fid a good decisio rule. The regular empirical risk miimizatio (ERM) fids a hypothesis ĥ H that miimizes the traiig error i=1 l X(h(x i ), y i ). While the regular ERM directly miimizes the traiig error of h, the privileged ERM miimizes the traiig error of h idirectly, via the miimizatio of the traiig error of the correctig fuctio φ ad the relaxatio of the costrait (4). Let h be the best possible decisio fuctio (i terms of geeralizatio error) i the hypothesis space H. Suppose that for each traiig example x i a oracle gives us the value of the loss l X (h (x i ), y i ). We use these fixed losses istead of l X (φ(x i ), y i) ad fid h that satisfies the followig system of iequalities: 1 i, l X (h(x i ), y i ) l X (h (x i ), y i ). (7) 2

3 We deote the learig algorithm defied by (7) as OracleERM. A straightforward geeralizatio of the proof of Propositio 1 of [9] shows that the geeralizatio error of the hypothesis ĥ foud by OracleERM coverges to the oe of h with the rate of 1/. This rate is much faster tha the worst-case covergece rate 1/ of the regular ERM [3]. I this paper we cosider more realistic settig, whe the above oracle is ot available. Our subsequet derivatios rely heavily o the followig defiitio: Defiitio 1.1 A decisio fuctio h is uiformly better tha the correctig fuctio φ if for ay example (x, x, y) that has o-zero probability, l X (φ(x i ), y i) l X (h(x i ), y i ). Give a space H of decisio fuctios ad a space Φ of correctig fuctios we defie Φ = {φ Φ h H that is uiformly better tha φ}. Note that Φ Φ ad Φ does ot cotai correctig fuctios that are too good for H. Our results are based o the followig two assumptios: Assumptio 1.2 Φ. This assumptio is ot restrictive, sice it oly meas that the optimizatio problem (3) of Privileged ERM has a feasible solutio whe the traiig size goes to ifiity. Assumptio 1.3 There exists a correctig fuctio φ Φ, such that for ay (x, x, y) that has o-zero probability, l X (h (x i ), y i ) = l X (φ(x i ), y i). Put it aother way, we assume the existece of correctig fuctio i Φ that mimics the losses of h. Let r be a learig rate of the Privileged ERM whe it is ra over the joit X X space with the space of decisio ad correctig fuctios H Φ. We develop a upper boud for the risk of the decisio fuctio foud by Privileged ERM. Uder the above assumptios this boud coverges to h with the same rate r. This implies that if the correctig space is good, so that the Privileged ERM i the joit X X space has a fast learig rate (e.g 1/), the the Privileged ERM will have the same fast learig rate (e.g. the same 1/) i the decisio space. That is true eve if the decisio space is hard ad the regular ERM i the decisio space has a slow learig rate (e.g. 1/ ). We illustrate this result with the artificial learig problem, where the regular ERM i the decisio space ca ot lear with the rate faster tha 1/, but the correctig space is good ad Privileged ERM lears i the decisio space with the rate of 1/. The paper has the followig structure. I Sectio 2 we give additioal defiitios. I Sectio 3 we review the existig risk bouds that are used to derive our results. Sectio 4 cotais the proof of the risk boud for Privileged ERM. I Sectio 5 we show a example whe Privileged ERM is provably better tha the regular ERM. We coclude ad give the directios for future research i Sectio 6. Due to the space costraits, most of the proofs appear i the supplemetary material. Previous work The first attempt of theoretical aalysis of LUPI was doe by Vapik ad Vashist [9]. I additio to the aalysis of learig with oracle (metioed above), they cosidered the algorithm, which is close, but differet from Privileged ERM. They developed a risk boud (Propositio 2 i [9]) for the decisio fuctio foud by their algorithm. This boud also applies to Privileged ERM. The boud of [9] is tailored to the classificatio settig, with 0/1-loss fuctios i the decisio ad the correctig space. By cotrast, our boud holds for ay bouded loss fuctios ad allows the loss fuctios l X ad l X to be differet. The boud of [9] depeds o geeralizatio error of the correctig fuctio φ foud by Privileged ERM. Vapik ad Vashist [9] cocluded that if we could boud the covergece rate of φ the this boud will imply the boud o the covergece rate of the decisio fuctio foud by their algorithm. 2 Defiitios The triple (x, x, y) is sampled from the distributio D, which is ukow to the learer. We deote by D X the margial distributio over (x, y) ad by D X the margial distributio over (x, y). The distributio D X is give by the ature ad the distributio D X is costructed by the teacher. The spaces H ad Φ of decisio ad correctig fuctios are chose by learer. 3

4 Let R(h) = E (x,y) DX {l X (h(x), y)} ad R(φ) = E (x,y) D X {l X (φ(x ), y)} be the geeralizatio errors of the decisio fuctio h ad the correctig fuctio φ respectively. We assume that the loss fuctios l X ad l X have rage [0, 1]. This assumptio ca be satisfied by ay bouded loss fuctio by simply dividig it by its maximal value. We deote by h = arg mi h H R(h) ad φ = arg mi φ Φ R(φ) the decisio ad the correctio fuctio with the miimal geeralizatio error w.r.t. the loss fuctios l X ad l X. Also, we deote by l 01 the 0/1 loss, by R 01 (h) = E (x,y) DX {l 01 (h(x), y)} the geeralizatio error of h w.r.t. the 0/1 loss ad by h 01 = arg mi h H R 01 (h) the decisio fuctio i H with the miimal geeralizatio 0/1 error. Let R (h, φ) = 1 i=1 l ((h, φ), (x i, x i, y i)) ad R (h, φ) = E (x,x,y) D{l ((h, φ), (x, x, y))} (8) be respectively empirical ad geeralizatio errors of the hypothesis (h, φ) w.r.t. the loss fuctio l. We deote by (ĥ, φ) = arg mi (h,φ) H Φ R (h, φ) the empirical risk miimizer ad by (h, φ ) = arg mi (h,φ) H Φ R (h, φ) the miimizer of the geeralizatio error w.r.t. the loss fuctio l. Note that i geeral h ca be differet from h, ad also φ ca be differet from φ. Let (H, Φ) = {(h, φ) H Φ h is uiformly better tha φ}. By Assumptio 1.2, (H, Φ). We will use additioal techical assumptio: Assumptio 2.1 There exists a costat A > 0 such that if { E (x,x,y) D {[l X (h(x), y) l X (φ(x ), y)] + } (h, φ) / (H, Φ), R(φ) < R(φ) } A. This assumptio is satisfied, for example, i the classificatio settig whe l X ad l X are 0/1 loss fuctios ad the probability desity fuctio p(x, x, y) of the uderlyig distributio D is bouded away from zero for all poits with ozero probability. I this case A if{p(x, x, y) (x, x, y) such that p(x, x, y) 0}. The followig lemma (proved i Appedix A i the full versio of the paper) shows that for sufficietly large C the optimizatio problems (3) ad (6) are asymptotically (whe ) equivalet: Lemma 2.2 Suppose that Assumptios 1.2, 1.3 ad 2.1 hold true. The there exists a fiite C 1 R such that for ay C C 1, (h, φ ) (H, Φ). Moreover, h = h ad φ = φ. I all our subsequet derivatios we assume that C has a fiite value for which (3) ad (6) are equivalet. Later o we will show how we choose the value of C that optimizes the forthcomig risk boud. The risk bouds preseted i this paper are based o VC-dimesio of various fuctio classes. While the defiitio of VC-dimesio for biary fuctios is well-kow i the learig commuity, the oe for the real-valued fuctios is less kow ad we review it here. Let F be a set of realvalued fuctios f : S R ad T (F) = {(x, t) S R f F s.t. 0 f(x) t}. We say that the set T = {(x i, t i )} T i=1 T (F) is shattered by F if for ay T T there exists a fuctio f F such that for ay (x i, t i ) T, f(x i ) t i ad for ay (x i, t i ) T \ T, f(x i ) > t i. The VC-dimesio of F is defied as a VC-dimesio of the set T (F), amely the maximal size of the set T T (F) that is shattered by F. 3 Review of existig excess risk bouds with fast covergece rates We derive our risk bouds from geeric excess risk bouds developed by Massart ad Nedelec [6] ad geeralized by Gie ad Koltchiskii [4] ad Koltchikii [5]. I this paper we use the versio of the bouds give i [4] ad [5]. Let F be a space of hypotheses f : S S, l : S { 1, +1} R be a real-valued loss fuctio such that 0 l(f(x), y) 1 for ay f F ad ay (x, y). Let f = 4

5 (a) Hypothesis space with small D (b) Hypothesis space with large D Figure 1: Visualizatio of the hypothesis spaces. The horisotal axis measures the distace (i terms of the variace) betwee hypothesis f ad the best hypothesis f i F. The vertical axis is the miimal error of hypotheses i F with the fixed distace from f. Note that the error fuctio displayed i graphs ca be o-cotiuous. The large value of D i the hypothesis space i graph (b) is caused by hypothesis A, which is sigificatly differet from f but has early-optimal error. arg mi f F E (x,y) {l(f(x), y)}, f = arg mi f F i=1 l(f(x i), y i ) ad D > 0 be a costat such that for ay f F, Var (x,y) {l(f(x), y) l(f (x), y)} D E (x,y) {l(f(x), y) l(f (x), y)}. (9) This coditio is a geeralizatio of Tsybakov s low-oise coditio [7] to arbitrary loss fuctios ad arbitrary hypothesis spaces. The costat D i (9) characterizes the error surface of the hypothesis space F. Suppose that E (x,y) {l(f(x), y) l(f (x), y)} is very small, amely f is early optimal. If f is almost the same as f the the variace i the left had side of (9), as well as the value of D, will be small. But if f differs sigificatly from f the the variace i the left had side of (9), as well as the value of D, will be large. Thus, if we take the variace i the left had side of (9) as a measure of distace betwee f ad f the the hypothesis spaces with large ad small D ca be visualized as show i Figure 1. Let V be a VC-dimesio of F. The followig theorem is a straightforward geeralizatio of Theorem 5.8 i [5]. Theorem 3.1 ([5]) There exists a costat K > 0 such that if > V D 2 the for ay δ > 0, with probability of at least 1 δ E (x,y) {l( f(x), y)} E (x,y) {l(f (x), y)} + KD ( V log V D 2 + l 1 ). (10) δ Let B = (V log + log(1/δ))/. If the coditio of Theorem 3.1 does ot hold, amely if V D 2 the we ca use the followig fallback risk boud: Theorem 3.2 ([1, 8]) There exists a costat K such that for ay δ > 0, with probability of at least 1 δ, E (x,y) {l( f(x), ( ) y)} E (x,y) {l(f (x), y)} + K E (x,y) {l(f (x), y)}b + B. (11) Defiitio 3.3 Let T = T (E (x,y) {l(f (x), y)}, V, δ) be a costat such that for all < T it holds that E (x,y) {l(f (x), y)} < B. For T the boud (11) has a covergece rate of 1/, ad for > T the boud (11) has a covergece rate of 1/. The mai differece betwee (10) ad (11) is the fast covergece rate of 1/ vs. the slow oe of 1/ i the regime of > max(t, V D 2 ). By Theorem 3.1, startig from > (D) = V D 2 we always have the covergece rate of 1/. Thus, the smaller value of D, the smaller will be the threshold (D) for obtaiig the fast covergece rate of 1/. 4 Upper Risk Boud For ay C 1, ay (x, x, y), ay h H ad φ Φ, ad ay loss fuctios l X ad l X, l X (h(x), y) l X (φ(x ), y) + C [l X (h(x), y) l X (φ(x ), y)] +. 5

6 Hece, usig (5) we obtai that R(ĥ) = E (x,y){l X (ĥ(x), y)} C E (x,y) { l ((ĥ, φ), } (x, x, y)) = C R (ĥ, φ). (12) Let l 1 (h, h, x, y) = l X (h(x), y) l X (h (x), y) ad D H 0 be a costat such that for ay h H D H E (x,y) {l 1 (h, h, x, y)} Var (x,y) {l 1 (h, h, x, y)}. (13) Similarly, let l 2 (h, h, φ, φ, x, x, y) = l ((h, φ), (x, x, y)) l ((h, φ ), (x, x, y)) ad D H,Φ 0 be a costat such that for all (h, φ) H Φ, D H,Φ E (x,x,y) {l 2 (h, h, φ, φ, x, x, y)} Var (x,x,y) {l 2 (h, h, φ, φ, x, x, y)}. (14) Let L(H, Φ) = {l ((h, φ), (,, )) h H, φ Φ} be a set of the loss fuctios l correspodig to hypotheses from H Φ ad V L(H,Φ) be a VC-dimesio of L(H, Φ). Similarly, let L(H) = {l X (h( ), ) h H} ad L(Φ) = {l X (φ( ), ) φ Φ} be the sets of loss fuctios that correspod to the hypotheses i H ad Φ, ad V L(H) ad V L(Φ) be VC dimesios of L(H) ad L(Φ) respectively. Note that if l X = l 01 the V L(H) is also a VC-dimesio of H (the same holds also for V L(Φ) ). Lemma 4.1 V L(H,Φ) = V L(H) + V L(Φ). Proof See Appedix C i the full versio of the paper. We apply Theorem 3.1 to the hypothesis space H Φ ad the loss fuctio l ((h, φ), (x, x, y)) ad obtai that there exists a costat K > 0 such that if > V L(H,Φ) DH,Φ 2 the for ay δ > 0, with probability at least 1 δ ( ) R (ĥ, φ) R (h, φ ) + KD H,Φ V L(H,Φ) l V L(H,Φ) DH,Φ 2 + l 1. δ Usig (12) we obtai that R(ĥ) C R (h, φ ) + CKD H,Φ It follows from Assumptio 1.3 ad Lemma 2.2 that ( ) V L(H,Φ) l V L(H,Φ) DH,Φ 2 + l 1. (15) δ R (h, φ ) = 1 C R(φ ) = 1 C R(φ) = 1 C R(h ). (16) We substitute (16) ito (15) ad obtai that there exists a costat K > 0 such that if > V L(H,Φ) DH,Φ 2 the for ay δ > 0, with probability at least 1 δ, ( ) R(ĥ) R(h ) + CKD H,Φ V L(H,Φ) l V L(H,Φ) DH,Φ 2 + l 1. δ We boud V H,Φ by Lemma 4.1 ad obtai our fial risk boud, that is summarized i the followig theorem: Theorem 4.2 Suppose that Assumptios 1.2, 1.3 ad 2.1 hold. Let D H,Φ be as defied i (14), C 1 be as defied i Lemma 2.2, ad V L(H,Φ) = V L(H) + V L(Φ). Suppose that C > C 1 ad > V L(H,Φ) DH,Φ 2. The for ay δ > 0 with probability of at least 1 δ, ( ) R(ĥ) R(h ) + CKD H,Φ V L(H,Φ) l + l 1, (17) V L(H,Φ) DH,Φ 2 δ where K > 0 is a costat. 6

7 Accordig to this boud, R(ĥ) coverges to R(h ) with the rate of 1/. If Assumptio 1.3 does ot hold the it is easy to see that we obtai the same boud as (17), but with R(h ) replaced by R(φ ). I this case the upper boud o R(ĥ) coverges to R(φ ) with the rate of 1/. We ow provide further aalysis of the risk boud (17). Let l 3 (φ, φ, x, y) = l X (φ(x ), y) l X (φ (x ), y) ad D Φ 0 be a costat such that for ay φ Φ, D Φ E (x,y) {l 3 (φ, φ, x, y)} Var (x,y) {l 3 (φ, φ, x, y)}. (18) Similarly, let D H,Φ 0 be a costat such that for all (h, φ) (H Φ) \ (H, Φ), D H,ΦE (x,x,y) {l 2 (h, h, φ, φ, x, x, y)} Var (x,x,y) {l 2 (h, h, φ, φ, x, x, y)}. Lemma 4.3 D H,Φ max ( D Φ /C, D H,Φ). Proof See Appedix B i the full versio of the paper. By Lemma 4.3, C D H,Φ max(d Φ, C D H,Φ ). Sice the loss fuctio l 2 depeds o C, the costat D H,Φ depeds o C too. Thus, igorig the left-had logarithmic term i (17), the optimal value of C is the oe that is larger that C 1 ad miimizes C D H,Φ. We ow show that such miimum ideed exists. By the defiitio of the loss fuctio l 2, { Var(x,x 0 < lim sup,y) {l 2 (h, h, φ, φ, x, x }, y)} C E (x,x,y) {l 2 (h, h, φ, φ, x, x 1. (19), y)} (h,φ) (H Φ)\(H,Φ) Therefore for very large C it holds that 0 < s D H,Φ 1, where s is the value of the above limit. Cosequetly lim C C D H,Φ =. Sice the fuctio g(c) = C D H,Φ is cotiuous ad fiite i C = C 1, there exists a poit C = C [C 1, ) that miimizes it. 5 Whe Privileged ERM is provably better tha the regular ERM We show a example that demostrates the differece betwee the emprical risk miimizatio i X space ad empirical risk miimizatio with privileged iformatio i the joit X X space. I particular, we show i this example that for ot too small traiig sizes (as specified by the coditios of Theorems 11 ad 4.2) the learig rate of the regular ERM i X space is 1/ while the learig rate of the privileged ERM i the joit X X space is 1/. We cosider the classificatio settig ad all loss fuctios i our example are 0/1 loss. Let D X = {D X (ɛ) 0 < ɛ < 0.1} be a ifiite family of distributios of examples i X space. All distributios i D X have o-zero support i four poits, deoted by X 1, X 2, X 3 ad X 4. We assume that these poits lie o a 1-dimesioal lie, as show i Figure 2(a). Figure 2(a) also shows the probability mass of each poit i the distributio D X (ɛ). The hypothesis space H cosists of hypotheses h t (x) = sig(x t) ad h t = sig(x t). The best hypothesis i H is h 1 ad its geeralizatio error is 1/4 2ɛ. The hypothesis space H cotais also a hypothesis h 3, which is slightly worse tha h 1 ad has geeralizatio error of 1/4 + ɛ. It ca be verified that for a fixed D X (ɛ) ad H the costat D H (defied i equatio (13)) is D H = 1/(6ɛ) (1/3) ɛ 1/(6ɛ). (20) Note that the iequality i (20) is very tight sice ɛ ca be arbitrary small. The VC-dimesio V H of H is 2. Suppose that ɛ is sufficietly small such that V H D 2 H > T (1/4 2ɛ, V H, δ), where the fuctio T (,, ) is defied i Defiitio 3.3. I order to use the risk boud (10) with our D X ad H, the coditio > V H D 2 H = 1/(18ɛ 2 ) (21) should be satisfied. But sice ɛ ca be very small, the coditio (21) is ot satisfied for a large rage of s. Hece, accordig to (11), for distributios D X (ɛ) that satisfy T (1/4 2ɛ, 2, δ) 1 18ɛ we 2 obtai that R 01 (ĥ) coverges to R 01(h ) with the rate of at least 1/. The followig lower boud shows that R 01 (ĥ) coverges to R 01(h ) with the rate of at most 1/. 7

8 (a) X space (b) X space Figure 2: X ad X spaces. Lemma 5.1 Suppose that ɛ < 1/16. Let δ = exp( 20ɛ 2 ). The for ay > 256, with probability at least δ, R 01 (ĥ) R 01(h ) l(1/δ )/(20). By combiig upper ad lower bouds we obtai that the covergece rate of R 01 (ĥ) to R 01(h ) is exactly 1/. The proof of the lower boud appears i Appedix D i the full versio of the paper. Suppose that the teacher costructed the distributio D X (ɛ) of examples i X space i the followig way. D X (ɛ) has o-zero support i four poits, deoted by X 1, X 2, X 3 ad X 4, that lie o a 1-dimesioal lie, as show i Figure 2(b). Figure 2(b) shows the probability mass of each poit i X space. We assume that the joit distributio (X, X ) has o-zero support oly o poits (X 1, X 1 ), (X 2, X 2 ), (X 3, X 3 ) ad (X 4, X 4 ). The hypothesis space Φ cosists of hypotheses φ t (x) = sig(x t) ad φ t = sig(x t). The best hypothesis i Φ is φ 2 ad its geeralizatio error is 0. However there is o h H that is uiformly better tha φ 2. The best hypothesis i Φ, amog those that have uiformly better hypothesis i H, is φ 1 ad its geeralizatio error is 1/4 2ɛ. h 1 is uiformly better tha φ 1. It ca be verified that for such D X (ɛ) ad Φ the costat D Φ (defied i equatio (18)) is D Φ = (11/16 3ɛ 4ɛ 2 )/(1/4 + 2ɛ) (22) Note that the iequality i (22) is very tight sice ɛ ca be arbitrary small. Moreover, it ca be verified that C that miimizes C D H,Φ is C = 2.6. For C = C it holds that D H,Φ = 1.71 ad D Φ /C = It is easy to see that our example satisfies Assumptios 1.2 ad 1.3 (the last assumptio is satisfied with φ = φ 1). Also, it ca be verified that Assumptio 2.1 is satisfied with A = 1/4 2ɛ ad C 1 = 1.1 < C satisfies Lemma 2.2. The VC-dimesio of Φ is 2. Hece by Theorem 4.2 ad Lemma 4.3, if > (2 + 2) = 11.7 the R 01 (ĥ) coverges to R 01(h ) with the rate of at least 1/. Sice our bouds o D Φ ad D H,Φ are idepedet of ɛ, the covergece rate of 1/ holds for ay distributio i D X. We obtaied that for 11.7 < 1 18ɛ the upper boud (17) coverges to R 2 01 (h ) with the rate of 1/, while the upper boud (11) coverges to R 01 (h ) with the rate of 1/. This improvemet was possible due to teacher s costructio of D X (ɛ) ad learer s choice of Φ. The hypothesis h 3 caused the value of D H to be large ad thus preveted us from 1/ covergece rate for a large rage of s. We costructed D X (ɛ) ad Φ i such a way that Φ does ot have a hypothesis φ that has exactly the same dichotomy as the bad hypothesis h 3. With such costructio ay φ Φ, such that h 3 is uiformly better tha φ, has geeralizatio error sigificatly larger tha the oe of h 3. For example, the best hypothesis i Φ for which h 3 is uiformly better, is φ 0 ad its geeralizatio error is 1/2. 6 Coclusios We formulated the algorithm of empirical risk miimizatio with privileged iformatio ad derived the risk boud for it. Our risk boud outlies the coditios for the correctig space that, if satisfied, will allow fast learig i the decisio space, eve if the origial learig problem i the decisio space is very hard. We showed a example where the privileged iformatio provably sigificatly improves the learig rate. I this paper we showed that the good correctig space ca improve the learig rate from 1/ to 1/. But, havig the good correctig space, ca we achieve a learig rate faster tha 1/? Aother iterstig problem is to aalyze Privileged ERM whe the learer does ot completely trust the teacher. This coditio traslates to the costrait l X (h(x), y) l X (φ(x ), y) + ɛ i (3) ad the term [l X (h(x), y) l X (φ(x ), y)] + i (6), where ɛ 0 is a hyperparameter. Fially, the importat directio is to develop risk bouds for SVM+ (which is a regularized versio of Privileged ERM) ad show whe it is provably better tha SVM. 8

9 Refereces [1] S. Bouchero, O. Bousquet, ad G. Lugosi. Theory of classificatio: a survey of some recet advaces. ESAIM: Probability ad Statistics, 9: , [2] C. Cortes ad V. Vapik. Support-vector etworks. Machie Learig, 20(3): , [3] L. Devroye ad G. Lugosi. Lower bouds i patter recogitio ad learig. Patter Recogitio, 28(7): , [4] E. Gie ad V. Koltchiskii. Cocetratio iequalities ad asymptotic resutls for ratio type empirical processes. Aals of Probability, 34(3): , [5] V. Koltchiskii Sait Flour lectures: Oracle iequalities i empirical risk miimizatio ad sparse recovery problems, Available at fodava.gatech.edu/files/reports/fodava pdf. [6] P. Massart ad E. Nedelec. Risk bouds for statistical learig. Aals of Statistics, 34(5): , [7] A. Tsybakov. Optimal aggregatio of classifiers i statistical learig. Aals of Statistics, 32(1): , [8] V. Vapik. Estimatio of depedecies based o empirical data. Spriger Verlag, 2d editio, [9] V. Vapik ad A. Vashist. A ew learig paradigm: Learig usig privileged iformatio. Neural Networks, 22(5-6): , [10] V. Vapik, A. Vashist, ad N. Pavlovich. Learig usig hidde iformatio: Master class learig. I Proceedigs of NATO workshop o Miig Massive Data Sets for Security, pages

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi