Biary classificatio, Part 1 Maxim Ragisky September 25, 2014 The problem of biary classificatio ca be stated as follows. We have a radom couple Z = (X,Y ), where X R d is called the feature vector ad Y 1,1} is called the label 1. I the spirit of the modelfree framework, we assume that the relatioship betwee the features ad the labels is stochastic ad described by a ukow probability distributio P P (Z), where Z = R d 1,1}. I these lectures o biary classificatio, I will be followig maily two excellet sources: the book by Devroye, Györfi, ad Lugosi [DGL96] ad the comprehesive survey article by Bousquet, Bouchero, ad Lugosi [BBL05]. As usual, we cosider the case whe we are give a i.i.d. sample of legth from P. The goal is to lear a classifier, i.e., a mappig g : R d 1,1}, such that the probability of classificatio error, P(g (X ) Y ), is small. As we have see before, the optimal choice is the Bayes classifier g (x) 1, if η(x) > 1/2 1, otherwise (1) where η(x) P[Y = 1 X = x] is the regressio fuctio. However, sice we make o assumptios o P, i geeral we caot hope to lear the Bayes classifier g. Istead, we focus o a more realistic goal: We fix a collectio G of classifiers ad the use the traiig data to come up with a hypothesis ĝ G, such that P(ĝ (X ) Y ) if P(g (X ) Y ) g G with high probability. By way of otatio, let us write L(g ) for the classificatio error of g, i.e., L(g ) P(g (X ) Y ), ad let L (G ) deote the smallest classificatio error attaiable over G : L (G ) if L(g ). g G We will assume that a miimizig g G exists. For future referece, we ote that L(g ) = P(g (X ) Y ) = P(Y g (X ) < 0). (2) Warig: I what follows, we will use C or c to deote various absolute costats; their values may chage from lie to lie. 1 The reaso why we chose 1,1}, rather tha 0,1}, for the label space is merely coveiece. 1
1 Learig liear discrimiat rules Oe of the simplest classificatio rules (ad oe of the first to be studied) is a liear discrimiat rule: give a ozero vector w = (w (1),..., w (d) ) R d ad a scalar b R, let g (x) g w,b (x) 1, if w, x + b > 0 1, otherwise (3) Let G be the class of all such liear discrimiat rules as w rages over all ozero vectors i R d ad b rages over all reals: G = g w,b : w R d \0},b R}. Give the traiig sample Z, let ĝ G be the output of the ERM algorithm, i.e., ĝ argmi g G 1 1 g (Xi ) Y i }. I other words, ĝ is ay classifier of the form (3) that miimizes the umber of misclassificatios o the traiig sample. The we have the followig: Theorem 1. There exists a absolute costat C > 0, such that for ay N ad ay δ (0,1), the boud L(ĝ ) L d + 1 (G ) +C + 2log(1/δ) (4) holds with probability at least 1 δ. Proof. A stadard argumet leads to the boud where L(ĝ ) L (G ) + 2 (Z ), (5) (Z ) sup L(g ) L (g ) g G is the uiform deviatio ad L (g ) deotes the empirical classificatio error of g o Z : L (g ) = 1 1 g (Xi ) Y i }, which is the fractio of icorrectly labeled poits i the traiig sample Z. Cosider a classifier g G ad defie the set } C g (x, y) R d 1,1} : y ( w, x + b) 0. The it is easy to see that where, as before, L(g ) = P(C g ) ad L (g ) = P (C g ), P = 1 Zi = δ 1 δ (Xi,Y i ) is the empirical distributio of the sample Z. Let C deote the collectio of all sets of the form C = C g for some g G. The (Z ) = sup P (C ) P(C ). C C 2
Let F = F C deote the class of idicator fuctios of the sets i C : F C = 1 C } : C C }. The we kow that, with probability at least 1 δ, (Z ) 2ER (F (Z log(1/δ) )) +, (6) 2 where R (F (Z )) is the Rademacher average of the projectio of F oto the sample Z. Now, F (Z ) = (f (Z 1 ),..., f (Z )) : f F } Therefore, if we prove that C is a VC class, the = (1 Z1 C },...,1 Z C }) : C C }. R (F (Z )) C V (C ). But this follows from the fact that ay C C has the form } d C = (x, y) R d 1,1} : w (j ) y x (j ) + by 0 for some w R d \0} ad some b R, ad the fuctios (x, y) y ad (x, y) y x (j ), 1 j d, spa a liear space of dimesio o greater tha d + 1. Hece, V (C ) d + 1, so that R (F (Z V (C ) d + 1 )) C C. Combiig this with (5) ad (6), we see that (4) holds with probability at least 1 δ. 1.1 Geeralized liear discrimiat rules I the same vei, we may cosider classificatio rules of the form g (x) = 1, if k w (j ) ψ j (x) + b > 0 1, otherwise (7) where k is some positive iteger (ot ecessarily equal to d), w = (w (1),..., w (k) ) R k is a ozero vector, b R is a arbitrary scalar, ad Ψ = ψ j : R d R} k is some fixed dictioary of real-valued fuctios o R d. For a fixed Ψ, let G deote the collectio of all classifiers of the form (7) as w rages over all ozero vectors i R k ad b rages over all reals. The the ERM rule is, agai, give by ĝ if L 1 (g ) if g G g G 1 g (Xi ) Y i }. The followig result ca be proved pretty much alog the same lies as Theorem 1: Theorem 2. There exists a absolute costat C > 0, such that for ay N ad ay δ (0,1), the boud L(ĝ ) L k + 1 (G ) +C + 2log(1/δ) (8) holds with probability at least 1 δ. 3
1.2 Two fudametal issues As Theorems 1 ad 2 show, the ERM algorithm applied to the collectio of all (geeralized) liear discrimiat rules is guarateed to work well i the sese that the classificatio error of the output hypothesis will, with high probability, be close to the optimum achievable by ay discrimiat rule with the give structure. The same argumet exteds to ay collectio of classifiers G, for which the error sets (x, y) : y g (x) 0}, g G, form a VC class of dimesio much smaller tha the sample size. I other words, with high probability the differece L(ĝ ) L (G ) = L(ĝ ) if g G L(g ) will be small. However, precisely because the VC dimesio of G caot be too large, the approximatio properties of G will be limited. Aother problem is computatioal. For istace, the problem of fidig a empirically optimal liear discrimiat rule is NP-hard. I other words, uless P is equal to NP, there is o hope of comig up with a efficiet ERM algorithm for liear discrimiat rules that would work for all feature space dimesios d. If d is fixed, the it is possible to eumerate all projectios of a give sample Z oto the class of idicators of all halfspaces i O( d 1 log) time, which allows for a exhaustive search for a ERM solutio, but the usefuless of this aive approach is limited to d < 5. 2 Risk bouds for combied classifiers via surrogate loss fuctios Oe way to sidestep the above approximatio-theoretic ad computatioal issues is to replace the 0 1 Hammig loss fuctio that gives rise to the probability of error criterio with some other loss fuctio. What we gai is the ability to boud the performace of various complicated classifiers built up by combiig simpler base classifiers i terms of the complexity (e.g, the VC dimesio) of the collectio of the base classifiers, as well as cosiderable computatioal advatages, especially if the problem of miimizig the empirical surrogate loss turs out to be a covex programmig problem. What we lose, though, is that, i geeral, we will ot be able to compare the geeralizatio error of the leared classifier to the miimum classificatio risk. Istead, we will have to be cotet with the fact that the geeralizatio error will be close to the smallest surrogate loss. We will cosider classifiers of the form g f (x) = sg f (x) 1, if f (x) 0 1, otherwise (9) where f : R d R is some fuctio. From (2) we have L(g f ) = P(g f (X ) Y ) P(Y g f (X ) < 0) = P(Y f (X ) < 0). From ow o, whe dealig with classifiers of the form (9), we will write L(f ) istead of L(g f ) to keep the otatio simple. Now we itroduce the otio of a surrogate loss fuctio. Defiitio 1. A surrogate loss fuctio is ay fuctio ϕ : R R +, such that Some examples of commoly used surrogate losses: ϕ(x) 1 x>0}. (10) 4
1. Expoetial, ϕ(x) = e x 2. Logit, ϕ(x) = log 2 (1 + e x ) 3. Hige loss, ϕ(x) = (x + 1) + maxx + 1,0} Let ϕ be a surrogate loss. The for ay (x, y) R d 1,1} ad ay f : R d R we have y f (x) < 0 ϕ( y f (x)) 1 y f (x)>0} = 1 y f (x)<0}. (11) Therefore, defiig the ϕ-risk of f by A ϕ (f ) E[ϕ( Y f (X ))] ad its empirical versio we see from (11) that A ϕ, (f ) 1 ϕ( Y i f (X i )), L(f ) A ϕ (f ) ad L (f ) A ϕ, (f ). (12) Now that these prelimiaries are out of the way, we ca state ad prove the basic surrogate loss boud: Theorem 3. Cosider ay learig algorithm A = A } =1, where, for each, the mappig A receives the traiig sample Z = (Z 1,..., Z ) as iput ad produces a fuctio f : R d R from some class F. Suppose that F ad the surrogate loss ϕ are chose so that the followig coditios are satisfied: 1. There exists some costat B > 0 such that sup (x,y) R d 1,1} sup ϕ( y f (x)) B 2. There exists some costat M ϕ > 0 such that ϕ is M ϕ -Lipschitz, i.e., ϕ(u) ϕ(v) M ϕ u v, u, v R. The for ay ad ay δ (0,1) the followig boud holds with probability at least 1 δ: Proof. Usig (12), we ca write L( f ) A ϕ, ( f ) + 4M ϕ ER (F (X log(1/δ) )) + B. (13) 2 L( f ) A ϕ ( f ) = A ϕ, ( f ) + A ϕ ( f ) A ϕ, ( f ) A ϕ, ( f ) + sup Aϕ (f ) A ϕ, (f ). 5
Now let H deote the class of fuctios h : R d 1,1} R of the form h(x, y) = y f (x), f F. The sup Aϕ (f ) A ϕ, (f ) = sup E[ϕ( Y f (X ))] 1 ϕ( Y i f (X i )) = sup P(ϕ h) P (ϕ h), h H where ϕ h(z) ϕ(h(z)) for every z = (x, y) R d 1,1}. Let (Z ) sup P(ϕ h) P (ϕ h) h H = sup h H P(ϕ h ϕ(0)) P (ϕ h ϕ(0)), where the secod lie follows from the fact that addig the same costat to each ϕ h does ot chage the value of P (ϕ h) P(ϕ h). Usig the familiar symmetrizatio argumet, we ca write E (Z ) 2ER ( Hϕ (Z ) ), (14) where H ϕ deotes the class of all fuctios of the form (x, y) ϕ(h(x, y)) ϕ(0), h H. We ow use a very powerful result about the Rademacher averages called the cotractio priciple, which states the followig [LT91]: If A R is a bouded set ad F : R R is a M-Lipschitz fuctio satisfyig F (0) = 0, the R (F A ) 2MR (A ), (15) where F A (F (a 1 ),...,F (a )) : a = (a 1,..., a ) A }. (The proof of the cotractio priciple is somewhat ivolved, ad we do ot give it here.) Cosider the fuctio F (u) = ϕ(u) ϕ(0). This fuctio clearly satisfies F (0) = 0, ad it is M ϕ -Lipschitz, by our assumptios o ϕ. Moreover, from our defiitio of H ϕ, we immediately see that H ϕ (Z ) = ( ϕ(h(z 1 ) ϕ(0),...,ϕ(h(z ) ϕ(0) ) : h H } = (F (h(z 1 )),...,F (h(z ))) : h H } = F H (Z ). Therefore, applyig (15) to A = H (Z ) ad the usig the resultig boud i (14), we obtai E (Z ) 4M ϕ ER ( H (Z ) ). Furthermore, lettig σ be a i.i.d. Rademacher tuple idepedet of Z, we have [ ] R (H (Z )) = 1 E σ sup σ h H i h(z i ) ] = 1 E σ [sup σ i Y i f (X i ) ] = 1 E σ [sup σ i f (X i ) R (F (X )), 6
which leads to E (Z ) 4M ϕ ER ( F (X ) ). (16) Now, sice every fuctio ϕ h is bouded betwee 0 ad B, the fuctio (Z ) has bouded differeces with c 1 =... = c = B/. Therefore, from (16) ad from McDiarmid s iequality, we have for every t > 0 that ( ) ( ) P (Z ) 4M ϕ ER (F (X )) + t P (Z ) E (Z ) + t e 2t 2 /B 2. Choosig t = B (2) 1 log(1/δ), we see that with probability at least 1 δ. Therefore, sice (Z ) 4M ϕ ER (F (X )) + B we see that (13) holds with probability at least 1 δ. L( f ) A ϕ, ( f ) + (Z ), log(1/δ) What the above theorem tells us is that the performace of the leared classifier f is cotrolled by the Rademacher average of the class F, ad we ca always arrage it to be relatively small. We will ow look at several specific examples. 2 3 Weighted liear combiatio of classifiers Let G = g : R d 1,1}} be a class of base classifiers (ot to be cofused with Bayes classifiers!), ad cosider the class } F λ f = c j g j : N N, c j λ; g 1,..., g N G, where λ > 0 is a tuable parameter. The for each f = N c j g j F λ the correspodig classifier g f of the form (9) is give by ( ) g f (x) = sg c j g j (x). A useful way of thikig about g f is that, upo receivig a feature x R d, it computes the outputs g 1 (x),..., g N (x) of the N base classifiers from G ad the takes a weighted majority vote ideed, if we had c 1 =... = c N = λ/n, the sg(g f (x)) would precisely correspod to takig the majority vote amog the N base classifiers. Note, by the way, that the umber of base classifiers is ot fixed, ad ca be leared from the data. Now, Theorem 3 tells us that the performace of ay learig algorithm that accepts a traiig sample Z ad produces a fuctio f F λ is cotrolled by the Rademacher average R (F λ (X )). It turs out, moreover, that we ca relate it to the Rademacher average of the base class G. To start, ote that F λ = λ abscovg, 7
where } abscovg = c j g j : N N; c = c j 1; g 1,..., g N G is the absolute covex hull of G. Therefore R (F λ (X )) = λ R (G (X )). Now ote that the fuctios i G are biary-valued. Therefore, assumig that the base class G is a VC class, we will have R (G (X V (G ) )) C. Combiig these bouds with the boud of Theorem 3, we coclude that for ay f selected from F λ based o the traiig sample Z, the boud L( f ) A ϕ, ( f V (G ) ) +CλM ϕ + B log(1/δ) 2 will hold with probability at least 1 δ, where B is the uiform upper boud o ϕ( y f (x)), f F Λ, (x, y) R d 1,1} ad M ϕ is the Lipschitz costat of the surrogate loss ϕ. Note that the above boud ivolves oly the VC dimesio of the base class, which is typically small. O the other had, the class F λ obtaied by formig weighted combiatios of classifiers from G is extremely rich, ad will geerally have ifiite VC dimesio! But there is a price we pay: The first term is the empirical surrogate loss A ϕ, ( f ), rather tha the empirical classificatio error L ( f ). However, it is possible to choose the surrogate ϕ i such a way that A ϕ, ( ) ca be bouded i terms of a quatity related to the umber of misclassified traiig examples. Here is a example. Fix a positive parameter γ > 0 ad cosider 0, if x γ ϕ(x) = 1, if x 0 1 + x/γ, otherwise This is a valid surrogate loss with B = 1 ad M ϕ = 1/γ, but i additio we have ϕ(x) 1 x> γ}, which implies that ϕ( y f (x)) 1 y f (x)<γ}. Therefore, for ay f we have The quatity is called the margi error of f. Notice that: For ay γ > 0, L γ (f ) L (f ) The fuctio γ L γ (f ) is icreasig. A ϕ, (f ) = 1 i f (X i )) ϕ( Y 1 L γ (f ) 1 1 Yi f (X i )<γ}. (17) 1 Yi f (X i )<γ} (18) 8
Notice also that we ca write L γ (f ) = 1 1 Yi f (X i )<0} + 1 1 0 Yi f (X i )<γ}, where the first term is just L (f ), while the secod term is the umber of traiig examples that were classified correctly, but oly with small margi" (the quatity Y f (X ) is ofte called the margi of the classifier f ). Theorem 4 (Margi-based risk boud for weighted liear combiatios). For ay γ > 0, the boud L( f ) L γ ( f ) + Cλ V (G ) γ + log(1/δ) (19) holds with probability at least 1 δ. Remark 1. Note that the first term o the right-had side of (19) icreases with γ, while the secod term decreases with γ. Hece, if the leared classifier f has a small margi error for a large γ, i.e., it classifies the traiig samples well ad with high cofidece," the its geeralizatio error will be small. Refereces [BBL05] O. Bousquet, S. Bouchero, ad G. Lugosi. Theory of classificatio: a survey of recet advaces. ESAIM Probability ad Statistics, 9:323 375, 2005. [DGL96] L. Devroye, L. Györfi, ad G. Lugosi. A Probabilistic Theory of Patter Recogitio. Spriger, 1996. [LT91] M. Ledoux ad M. Talagrad. Probability i Baach Spaces: Isoperimetry ad Processes. Spriger, 1991. 9