Exponential Convergence Rates in Classification

Size: px

Start display at page:

Download "Exponential Convergence Rates in Classification"

Heather Wilkins
6 years ago
Views:

1 Expoetial Covergece Rates i Classificatio Vladimir Koltchiskii ad Olexadra Bezosova Departmet of Mathematics ad Statistics The Uiversity of New Mexico Albuquerque, NM 873-4, U.S.A. vlad@math.um.edu,bezosik@math.um.edu Abstract. Let (X, Y ) be a radom couple, X beig a observable istace ad Y {, } beig a biary label to be predicted based o a observatio of the istace. Let (X i, Y i), i =,..., be traiig data cosistig of idepedet copies of (X, Y ). Cosider a real valued classifier ˆf that miimizes the followig pealized empirical risk l(y if(x i)) + λ f 2 mi, f H i= over a Hilbert space H of fuctios with orm, l beig a covex loss fuctio ad λ > 0 beig a regularizatio parameter. I particular, H might be a Sobolev space or a reproducig kerel Hilbert space. We provide some coditios uder which the geeralizatio error of the correspodig biary classifier sig( ˆf ) coverges to the Bayes risk expoetially fast. Itroductio Let (S, d) be a metric space ad (X, Y ) be a radom couple takig values i S {, } with joit distributio P. The distributio of X (which is a measure o the Borel σ-algebra i S) will be deoted by Π. Let (X i, Y i ), i be a sequece of idepedet copies of (X, Y ). Here ad i what follows all radom variables are defied o some probability space (Ω, Σ, P ). Let H be a Hilbert space of fuctios o S such that H is dese i the space C(S) of all cotiuous fuctios o S ad, i additio, x, y S f(x) f ad f(x) f(y) f d(x, y). () Here = H is the orm of H ad, =, H is its ier product. We have i mid two mai examples. I the first oe, S is a compact domai i R d with smooth boudary. For ay s, oe ca defie the followig ier product i the space C (S) of all ifiitely differetiable fuctios i S : f, g s := D α fd α gdx. α s Partially supported by NSF grat DMS S

2 Here α = (α,..., α d ), α j = 0,,..., α := d i= α i ad D α α f f = x α.... xα d d ) The Sobolev space H s (S) is the completio of (C (S),, s. There is also a versio of the defiitio for ay real s > 0 that utilizes Fourier trasforms. If s > d/2 +, the it follows from Sobolev s embeddig theorems that coditios () hold with metric d beig the Euclidea distace (possibly, after a proper rescalig of the ier product or of the metric d to make costats equal to ). I the secod example, S is a metric compact ad H = H K is the reproducig kerel Hilbert space (RKHS) geerated by a Mercer kerel K. This meas that K is a cotiuous symmetric oegatively defiite kerel ad H K is defied as the completio of the liear spa of fuctios {K x : x S}, K x (y) := K(x, y), with respect to the followig ier product: α i K xi, β j K yj := α i β j K(x i, y j ). i j K i,j It is well kow that H K ca be idetified with a subset of C(S) ad implyig that f H K f(x) = f, K x K, f(x) f K sup K x K ad f(x) f(y) f K K x K y K, x S so agai coditios () hold with d(x, y) := K x K y K (as before, a simple rescalig is eeded to esure that the costats are equal to ). I biary classificatio problems, it is commo to look for a real valued classifier ˆf that solves the followig pealized empirical risk miimizatio problem l(y i f(x i )) + λ f 2 mi, f H, (2) i= where l is a oegative decreasig covex loss fuctio such that l I (,0] ad λ > 0 is a regularizatio parameter. For istace, if l is a hige loss, i.e. l(u) = ( u) 0, ad is a RKHS-orm, this is a stadard approach i kerel machies classificatio. Give a real valued classifier f : S R, the correspodig biary classifier is typically defied as x sig(f(x)), where sig(u) = + for u 0 ad otherwise. The geeralizatio error or risk of f is the R P (f) := P {(x, y) : y sig(f(x))}.

3 It is well kow that the miimium of R P (f) over all measurable fuctios f is attaied at the regressio fuctio η defied as η(x) := E(Y X = x). The correspodig biary classifier sig(η(x)) is called the Bayes classifier, the quatity R := R P (η) is called the Bayes risk ad, fially, the quatity R P (f) R is ofte referred to as the excess risk of a classifier f. Our goal i this ote is to show that uder some (aturally restrictive) assumptios the expectatio of the excess risk of ˆf coverges to 0 expoetially fast as. Recetly, Audibert ad Tsybakov [] observed a similar pheomeo i the case of plug-i classifiers ad our aalysis here cotiues this lie of work. Deote δ(p ) := sup{δ > 0 : Π{x : η(x) δ} = 0}. We will assume that (a) η is a Lipschitz fuctio with costat L > 0 (which, for the sake of simplicity of otatios, will be assumed to be i what follows): η(x) η(y) Ld(x, y). (b) δ(p ) > 0. These will be two mai coditios that guaratee the possibilty of expoetially fast covergece rates of the geeralizatio error to the Bayes risk. Note that coditio (b), which is a extreme case of Tsybakov s low oise assumptio, meas that there exists δ > 0 such that Π-a.e. either η(x) δ, or η(x) δ. The fuctio η (as a coditioal expectatio) is defied up to Π-a.e. Coditio (a) meas that there exists a smooth (Lipschitz) versio of this coditioal expectatio. Sice smooth fuctios ca ot jump immediately from the value δ to value δ, the combiatio of coditios (a) ad (b) essetially meas that there should be a wide eough corridor betwee the regios {η δ} ad {η δ}, but the probability of gettig ito this corridor is zero. The fact that i such situatios it is possible to costruct classifiers that coverge to Bayes expoetially fast is essetially rather simple, it reduces to a large deviatio type pheomeo, ad it is eve surprisig that, up to our best kowledge, the possibility of such superfast covergece rates i classificatio has ot bee observed before Audibert ad Tsybakov [] (we apologize if someoe, i fact, did it earlier). Subtle results o covergece rates of the geeralizatio error of large margi classifiers to the Bayes risk have bee obtaied relatively recetly, see papers by Bartlett, Jorda ad McAuliffe [3] ad by Blachard, Lugosi ad Vayatis [5] o boostig, ad papers by Blachard, Bousquet ad Massart [4] ad by Scovel ad Steiwart [7] o SVM. These papers rely heavily o geeral expoetial iequalities i abstract empirical risk miimizatio i spirit of papers by Bartlett, Bousquet ad Medelso [2] or Koltchiskii [6] (or eve earlier work by Birgé ad Massart i the 90s). The rates of covergece i classificatio based o this

4 geeral approach are at best of the order O( ). I classificatio problems, there are may relevat probabilistic, aalytic ad geometric parameters to play with whe oe studies the covergece rates. For istace, both papers [4] ad [7] deal with SVM classifiers (so, essetially, with problem (2) i the case whe H is RKHS). I [4], the covergece rates are studied uder the assumptio (b) above ad uder some coditios o the eigevalues of the kerel. I [7], the authors determie the covergece rates uder the assumptio o the etropy of the uit ball i RKHS of the same type as our assumptio (3) below, uder Tsybakov s low oise assumptio ad some additioal coditios of geomeric ature. The fact that uder somewhat more restrictive assumptios imposed i this paper eve expoetial covergece rates are possible idicates that, probably, we have ot uderstood to the ed rather subtle iterplay betwee various parameters that ifluece the behaviour of this type of classifiers. 2 Mai Result We ow tur to precise formulatio of the results. Our goal will be to explai the mai ideas rather tha to give the results i the full geerality, so, we will make below several simplifyig assumptios. First, we eed some coditios o the loss fuctio l ad to get this out of the way, we will just assume that l is the so called logit loss, l(u) = log 2 ( + e u ), u R (other loss fuctios of the same type that are decreasig, strictly covex, satisfy the assumptio l I (,0] ad grow slower tha u 2 as u will also do). We deote (l f)(x, y) := l(yf(x)). For a fuctio g o S {, }, we write P g = gdp = E g(x, Y ). S {,} Let P be the empirical measure based o the traiig data (X i, Y i ), i =,...,. We will write P g = gdp = g(x i, Y i ). S {,} We use similar otatios for fuctios defied o S. A simple ad well kow computatio shows that the fuctio f P (l f) attais its miimum at f defied by f (x) = log + η(x) η(x). We will assume i what follows that f H. This assumptio is rather restrictive. Sice fuctios i H are uiformly bouded (see ()) it meas, i particular, that i=

5 η is bouded away from both + ad. Although, there is a versio of the mai result below without this assumptio, we are ot discussig it i this ote. Next we eed a assumptio o so called uiform L 2 -etropy of the uit ball i H, B H := {f H : f }. ( ) Give a probability measure Q o S, let N B H ; L 2 (Q); ε deote the miimal umber of L 2 (Q)-balls eeded to cover B H. Suppose that for some ρ (0, 2) ad for some costat A > 0 ( ) ( ) ρ A Q ε > 0 : log N B H ; L 2 (Q); ε. (3) ε Deote B(x, δ) the ope ball i (S, d) with ceter x ad radius δ. Also, let H(x, δ) be the set of all fuctios h H satisfyig the followig coditios: It follows from (i) (iii) that (i) y S 0 h(y) 2δ (ii) h δ o B(x; δ/2) (iii) hdπ δ B(x;δ) c δπ(b(x; δ/2)) E h(x) S hdπ 2δ Π(B(x; δ)). δ Sice there exists a cotiuous fuctio h such that 0 h 3 2 δ, h 4 3 δ o B(x, δ/2) ad h = 0 o B(x, δ) c, ad, o the other had, H is dese i C(S), it is easy to see that H(x, δ). Deote q(x, δ) := if h. h H(x,δ) The quatity q(x, δ) is, ofte, bouded from above uiformly i x S by a decreasig fuctio of δ, say by q(δ), ad this will be assumed i what follows. Ofte, q(δ) grows as δ γ, δ 0 for some γ > 0. Example. For istace, if H = H s (S) is a Sobolev space of fuctios i a compact domai S R d, s > d/2 +, defie ( ) x y h(y) := δϕ, δ where ϕ C (R d ), 0 ϕ 2, ϕ(x) if x /2 ad ϕ(x) = 0 if x. The h satisfies coditios (i) (iii) (moreover, h = 0 o B(x, δ) c ). A straightforward computatio of Sobolev s orm of h shows that h Hs (S) Cδ +d/2 s,

6 implyig that q(x, δ) is uiformly bouded from above by q(δ) = Cδ γ with γ = s d 2. Similar results are also true i the case of RKHS for some kerels. Let p(x, δ) := δ 2 Π(B(x, δ/2)). I what follows, K, C > 0 will deote sufficietly large umerical costats (whose precise values might chage from place to place). Recall our assumptio that δ(p ) > 0. I this case it is also atural to assume that for all δ δ(p ) K ad for all x such that η(x) δ(p ) p(x, δ) p(δ) > 0 for some fixed fuctio p. This would be true, for istace, if S is a domai i R d ad Π has desity uiformly bouded away from 0 o the set {x : η(x) δ(p )}. I this case we have for all x from this set Defie ow The o the set {x : η(x) δ(p )} p(x, δ) cδ d+2 =: p(δ). r(x, δ) := p(x, δ) q(x, δ). r(x, δ) p(δ) q(δ). ( ) We set U := K f L (here ad i what follows stads for the maximum ad for the miimum) ad defie λ + = λ + (P ) := { ( 4U if r x; δ(p ) ) U log log ad, for a fixed ε > K, Clearly, λ := A2ρ/(2+ρ) 2/(2+ρ) ε. λ + p(δ(p )/U) 4U q(δ(p )/U) > 0, } : η(x) δ(p ) so, λ + is a positive costat. The if is large eough ad ε is ot too large, we have λ λ +. Now, we are ready to formulate the mai result. Theorem. Let λ [λ, λ + ]. The there exists β = β(h, P ) > 0 such that E (R P ( ˆf ) R ) exp{ β}. ( ) ) I fact, with sufficietly large K, C > 0, β is equal to C ( p δ(p ) U ε, which is positive ad does ot deped o, establishig the expoetial covergece rate.

7 3 Proof We use a well kow represetatio of the excess risk R P (f) R = η dπ to get the followig boud: { ˆf (x)η(x) 0} {sig(f) sig(η)} E (R P ( ˆf ) R ) E η(x) Π(dx) = E η(x) I { ˆf(x)η(x) 0} Π(dx) = η(x) E I { ˆf(x)η(x) 0} Π(dx) = η(x) P { ˆf (x)η(x) 0}Π(dx) (4) Our goal ow is to boud, for a give x, P { ˆf (x)η(x) 0}. Let us assume that η(x) = δ > 0 (the other case, whe η(x) < 0, is similar). We have P { ˆf (x)η(x) 0} = P { ˆf (x) 0} P { ˆf (x) 0, ˆf U} + P { ˆf > U}. (5) We start with boudig the first term. For δ 0 > 0 (to be chose later), let h H(x, δ 0 ). Defie Sice ˆf miimizes the fuctioal L (α) := P (l ( ˆf + αh)) + λ ˆf + αh 2. H f P (l f) + λ f 2, the fuctio α L (α) attais its miimum at α = 0. This fuctio is differetiable, implyig that dl ( ) 0 = l (Y j ˆf (X j ))Y j h(x j ) + 2λ dα ˆf, h = 0. Assumig that η(x) = δ > 0, ˆf U ad ˆf (x) 0, we eed to boud from above l (Y j ˆf (X j ))Y j h(x j ) + 2λ ˆf, h, tryig to show that everywhere except the evet of small probability the last expressio is strictly egative. This would cotradict the fact that it is equal to 0, implyig a boud o the probability of the evet { ˆf (x) 0, ˆf U}. First ote that l (Y j ˆf (X j ))Y j h(x j ) = j:y j=+ l ( ˆf (X j ))h(x j ) j:y j= l ( ˆf (X j ))h(x j ).

8 Note also that fuctio l is egative ad icreasig, h is oegative ad ˆf is a Lipschitz fuctio with Lipschitz orm bouded by ˆf. The last observatio ad the assumptio that ˆf (x) 0 imply that, for all y B(x, δ 0 ), ad, as a result, ˆf (y) ˆf δ 0 Uδ 0 l ( ˆf (y)) l (Uδ 0 ), l ( ˆf (y)) l ( Uδ 0 ). Also, for all y S, ˆf (y) ˆf U, implyig that l ( ˆf (y)) l ( U), l ( ˆf (y)) l ( U). This leads to the followig upper boud: l (Y j ˆf (X j ))Y j h(x j ) l (Uδ 0 ) l ( U) l (Uδ 0 ) l ( U) j:x j B(x,δ 0),Y j=+ h(x j ) l ( Uδ 0 ) h(x j ) = j:x j B(x,δ 0) c + Y j h(x j ) l ( Uδ 0 ) 2 j:x j B(x,δ 0) j:x j B(x,δ 0) c h(x j ) = l (Uδ 0 ) l ( Uδ 0 ) 2 l (Uδ 0 ) + l ( Uδ 0 ) 2 l ( U) h(x j )I B(x,δ0)(X j ) + Y j h(x j )I B(x,δ0)(X j ) + h(x j )I B(x,δ0) c(x j). j:x j B(x,δ 0),Y j= j:x j B(x,δ 0) Usig the fact that for logit loss l has its maximum at 0, we get l (Uδ 0 ) + l ( Uδ 0 ) l (0) 2 l (Uδ 0 ) l (0) 2 + l ( Uδ 0 ) l (0) 2 ad l (Uδ 0 ) l ( Uδ 0 ) 2 l (0)Uδ 0. l (0)Uδ 0 h(x j ) + Y j h(x j ) + 2

9 Therefore, l (Y j ˆf (X j ))Y j h(x j ) l (0) l ( U) Y j h(x j )I B(x;δ0)(X j ) + 2l (0)Uδ 0 h(x j )I c(x B(x;δ0) j) = h(x j )I B(x;δ0)(X j ) + ξ j, (6) where ξ, ξ j, j are i.i.d. ξ := l (0)Y h(x)i B(x,δ0)(X) + 2l (0)Uδ 0 h(x)i B(x,δ0)(X) + l ( U) h(x)i B(x;δ0) c(x). To boud the sum of ξ j s, we will use Berstei iequality. To this ed, we first boud the expectatio ad the variace of ξ. We have E ξ = l (0) E Y h(x)i B(x;δ0)(X) + 2l (0)Uδ 0 E h(x)i B(x;δ0)(X) + l ( U) E h(x)i B(x;δ0) c(x). Sice η is Lipschitz with the Lipschitz costat L ad η(x) = δ, η(y) δ Lδ 0 for all y B(x; δ 0 ). Sice also h H(x, δ 0 ), we have: ad E Y h(x)i B(x;δ0)(X) = E η(x)h(x)i B(x;δ0)(X) (δ Lδ 0 ) E h(x)i B(x;δ0)(X) (δ Lδ 0 )( δ 0 ) E h(x), E h(x)i B(x;δ0)(X) E h(x), E h(x)i B(x;δ0) c(x) δ 0 E h(x) Recall that l (0) < 0 ad l (0) 0. So, the followig boud for the expectatio of ξ is immediate: ] E ξ [l (0)(δ Lδ 0 )( δ 0 ) + 2l (0)Uδ 0 + l ( U) δ 0 E h(x). We will choose δ 0 small eough to make [l (0)(δ Lδ 0 )( δ 0 ) + 2l (Uδ 0 )Uδ 0 + l ( U) δ 0 ] δ 0.

10 A simple computatio shows that it is eough to take δ 0 = C δ U L δ L + 4U + 2, which ca be always achieved by makig the umerical costat C large eough. The the expectatio satisfies the boud E ξ δ 0 E h(x). As far as the variace of ξ is cocered, usig a elemetary boud (a+b+c) 2 3a 2 + 3b 2 + 3c 2, it is easy to check that Var(ξ) Cδ 0 E h(x) with a sufficietly large umerical costat C. Fially, it is also straightforward that with some C > 0 ξ Cδ 0. Now Berstei iequality easily yields with a sufficietly large umerical costat C > 0 P { ξj 2 δ 0 E h(x) } { 2 exp δ } 0 E h(x). C The, sice δ 0 E h(x) δ0π(b(x; 2 δ 0 /2)) = p(x, δ 0 ), { } we have with probability at least 2 exp : So, if the p(x,δ0) C l (Y j ˆf (X j ))Y j h(x j ) + 2λ ˆf, h 2 δ 0 E h(x) + 2λ ˆf, h 2 δ 0 E h(x) + 2λU h 2 p(x, δ 0) + 2λUq(x, δ 0 ) (7) λ < p(x, δ 0) 4Uq(x, δ 0 ) = r(x, δ 0) 4U, l (Y j ˆf (X j ))Y j h(x j ) + 2λ ˆf, h < 0 { with probability at least 2 exp ad λ < r(x,δ0) 4U, the p(x,δ0) C P { ˆf (x) 0, ˆf U} 2 exp }. The coclusio is that if η(x) = δ { p(x, δ } 0). C

11 Thus, for λ λ +, we have { P { ˆf (x) 0, ˆf U} 2 exp p(δ } 0). (8) C We ow tur to boudig the probability P { ˆf U} for a properly chose U. This is the oly part of the proof where the coditio (3) o the uiform etropy of the uit ball B H is eeded. It relies heavily o recet excess risk bouds i Koltchiskii [6] as well as o some of the results i spirit of Blachard, Lugosi ad Vayatis [5] (see their Lemma 4). We formulate the boud we eed i the followig lemma. Lemma. Suppose that coditio (3) holds ad (for simplicity) that l is the logit loss. Let R. The, there exists a costat K > 0 such that for ay t > 0, the followig evet f H with f R (9) P (l f) if P (l g) g R ( ) ( RA 2ρ/(2+ρ) 2 P (l f) if P (l g) + K + tr ), (0) g R 2/(2+ρ) has probability at least e t. The argumet that follows will provide a boud that is somewhat aki to some of the bouds i [7] ad i [4]. Deote E(R) the evet of the lemma. Let R f. O the evet E(R), the coditio R/2 < ˆf R implies λ ˆf 2 P (l ˆf ) if P (l g) + λ ˆf 2 = g R [ ] if P (l f) if P (l g) + λ f 2 f R g R [ ( RA 2ρ/(2+ρ) 2 if P (l f) if P (l g) + f R g R λ f 2 + K + tr )] 2/(2+ρ) [ ] ( RA 2 P (l f ) if P (l g) + λ f 2 2ρ/(2+ρ) + 2K + tr ) g R 2/(2+ρ) ( RA 2λ f 2 2ρ/(2+ρ) + 2K + tr ), which implies that 2/(2+ρ) R 2 ( 4 ˆf RA 2 2 f 2 2ρ/(2+ρ) + 2K + tr ). λ 2/(2+ρ) λ

12 Solvig this iequality with respect to R shows that o E(R) the coditio R/2 ˆf R implies R K ( f A 2ρ/(2+ρ) ) t. λ 2/(2+ρ) λ If ow t = ε ad λ λ, the it yields Note that R K( f ). P (l ˆf ) + λ ˆf 2 l(0) (just plug i f = 0 i the target fuctioal). Therefore, we have λ ˆf 2 l(0), or ˆf l(0) =: R. λ Defie R k = 2 k, k = 0,, 2,..., N := log 2 R +. Note that, for our choice of λ, we have N C log with some umerical costat C > 0. Let E k := E(R k ). Clearly, P (E k ) e t ad, o the eve E k, the coditio R k ˆf R k implies ˆf R k K( f ). Thus, ˆf ca be larger tha the right had side of the last boud oly o the evet N k= Ec k, whose probabilty is smaller tha Ne ε. This establishes the followig iequality: { } P ˆf K( f ) Ne ε e ε/2, () log log provided that ε K, as it was assumed. Combiig bouds (8) ad () ad pluggig the resultig boud i (5) ad the i (4) easily completes the proof (subject to a mior adjustmet of the costats). Ackowledgemet. The first author is very thakful to Alexadre Tsybakov for several useful ad iterestig coversatios o the subject of the paper. Refereces. Audibert, J. Y. ad Tsybakov, A. Fast covergece rates for plug-i estimators uder margi coditios. Upublished mauscript, Bartlett, P., Bousquet, O. ad Medelso, S. Local Rademacher Complexities. Aals of Statistics, 2005, to appear. 3. Bartlett, P., Jorda, M. ad McAuliffe, J. Covexity, Classificatio ad Risk Bouds. J. America Statistical Soc., 2004, to appear.

13 4. Blachard, G., Bousquet, O. ad Massart, P. Statistical Performace of Support Vector Machies. Preprit, 2003, 4, Blachard, G., Lugosi, G. ad Vayatis, N. O the rates of covergece of regularized boostig classifiers. Joural of Machie Learig Research, 2003, 4, Koltchiskii, V. Local Rademacher Complexities ad Oracle Iequalities i Risk Miimizatio. Preprit. Preprit, Scovel, C. ad Steiwart, I. Fast Rates for Support Vector Machies. Preprit, to

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the