BINARY CLASSIFICATION

Size: px

Start display at page:

Download "BINARY CLASSIFICATION"

Elizabeth Watson
6 years ago
Views:

1 BINARY CLASSIFICATION MAXIM RAGINSY The problem of binary classification can be stated as follows. We have a random couple Z = X, Y ), where X R d is called the feature vector and Y {, } is called the label. In the spirit of the model-free framework, we assume that the relationship between the features and the labels is stochastic and described by an unknown probability distribution P PZ), where Z = R d {, }. As usual, we consider the case when we are given an i.i.d. sample of length n from P. The goal is to learn a classifier, i.e., a mapping g : R d {, }, such that the probability of classification error, PgX) Y ), is small. As we have seen before, the optimal choice is the Bayes classifier { g, if ηx) > /2 ) x), otherwise where ηx) P[Y = X = x] is the regression function. However, since we make no assumptions on P, in general we cannot hope to learn the Bayes classifier g. Instead, we focus on a more realistic goal: We fix a collection G of classifiers and then use the training data to come up with a hypothesis ĝ n G, such that Pĝ n X) Y ) inf g G PgX) Y ) with high probability. By way of notation, let us write Lg) for the classification error of g, i.e., Lg) PgX) Y ), and let L G) denote the smallest classification error attainable over G: L G) inf g G Lg). We will assume that a minimizing g G exists. For future reference, we note that 2) Lg) = PgX) Y ) PY gx) < 0). Warning: In what follows, we will use C or c to denote various absolute constants; their values may change from line to line.. Learning linear discriminant rules One of the simplest classification rules and one of the first to be studied) is a linear discriminant rule: given a nonzero vector w = w ),..., w d) ) R d and a scalar b R, let {, if w, x + b > 0 3) gx) g w,b x), otherwise Let G be the class of all such linear discriminant rules as w ranges over all nonzero vectors in R d and b ranges over all reals: G = {g w,b : w R d \{0}, b R}. Date: March 4, 20. The reason why we chose {, }, rather than {0, }, for the label space is merely convenience.

2 Given the training sample Z n, let ĝ n G be the output of the ERM algorithm, i.e., ĝ n arg min g G n {gxi ) Y i }. In other words, ĝ n is any classifier of the form 3) that minimizes the number of misclassifications on the training sample. Then we have the following: Theorem. There exists an absolute constant C > 0, such that for any n N and any δ 0, ), the bound d + 2 log/δ) 4) Lĝ n ) L G) + C n + n holds with probability at least δ. Proof. A standard argument leads to the bound 5) Lĝ n ) L G) + 2 Z n ), where Z n ) sup Lg) L n g) g G is the uniform deviation and L n g) denotes the empirical classification error of g on Z n : L n g) = n {gxi ) Y i }, which is the fraction of incorrectly labeled points in the training sample Z n. Consider a classifier g G and define the set { } C g x, y) R d {, } : y w, x + b) 0. Then it is easy to see that where, as before, Lg) = P C g ) and L n g) = P n C g ), P n = n δ Zi = n δ Xi,Y i ) is the empirical distribution of the sample Z n. Let C denote the collection of all sets of the form C = C g for some g G. Then Z n ) = sup P n C) P C). C C Let F = F C denote the class of indicator functions of the sets in C: F C = { { C} : C C}. Then we know that, with probability at least δ, log/δ) 6) Z n ) 2ER n FZ n )) +, 2n where R n FZ n )) is the Rademacher average of the projection of F onto the sample Z n. Now, FZ n ) = {fz ),..., fz n )) : f F} Therefore, if we prove that C is a VC class, then = { {Z C},..., {Zn C}) : C C }. R n FZ n )) C 2 V C) n.

3 But this follows from the fact that any C C has the form d C = x, y) Rd {, } : w j) yx j) + by 0 for some w R d \{0} and some b R, and the functions x, y) y and x, y) yx j), j d, span a linear space of dimension no greater than d +. Hence, V C) d +, so that V C) d + RFZ n )) C n C n. Combining this with 5) and 6), we see that 4) holds with probability at least δ... Generalized linear discriminant rules. In the same vein, we may consider classification rules of the form {, if k j= gx) = wj) ψ j x) + b > 0 7), otherwise where k is some positive integer not necessarily equal to d), w = w ),..., w k) ) R k is a nonzero vector, b R is an arbitrary scalar, and Ψ = {ψ j : R d R} k j= is some fixed dictionary of real-valued functions on R d. For a fixed Ψ, let G denote the collection of all classifiers of the form 7) as w ranges over all nonzero vectors in R k and b ranges over all reals. Then the ERM rule is, again, given by ĝ n inf L ng) inf g G g G n {gxi ) Y i }. The following result can be proved pretty much along the same lines as Theorem : Theorem 2. There exists an absolute constant C > 0, such that for any n N and any δ 0, ), the bound k + 2 log/δ) 8) Lĝ n ) L G) + C n + n holds with probability at least δ..2. Two fundamental issues. As Theorems and 2 show, the ERM algorithm applied to the collection of all generalized) linear discriminant rules is guaranteed to work well in the sense that the classification error of the output hypothesis will, with high probability, be close to the optimum achievable by any discriminant rule with the given structure. The same argument extends to any collection of classifiers G, for which the error sets {x, y) : y gx) 0}, g G, form a VC class of dimension much smaller than the sample size n. In other words, with high probability the difference Lĝ n ) L G) = Lĝ n ) inf Lg) g G will be small. However, precisely because the VC dimension of G cannot be too large, the approximation properties of G will be limited. Another problem is computational. For instance, the problem of finding an empirically optimal linear discriminant rule is NP-hard. In other words, unless P is equal to NP, there is no hope of coming up with an efficient ERM algorithm for linear discriminant rules that would work for all feature space dimensions d. If d is fixed, then it is possible to enumerate all projections of a given sample Z n onto the class of indicators of all halfspaces in On d log n) time, which allows for an exhaustive search for an ERM solution, but the usefulness of this naive approach is limited to d < 5. 3 j=

4 2. Risk bounds for combined classifiers via surrogate loss functions One way to sidestep the above approximation-theoretic and computational issues is to replace the 0 Hamming loss function that gives rise to the probability of error criterion with some other loss function. What we gain is the ability to bound the performance of various complicated classifiers built up by combining simpler base classifiers in terms of the complexity e.g, the VC dimension) of the collection of the base classifiers, as well as considerable computational advantages, especially if the problem of minimizing the empirical surrogate loss turns out to be a convex programming problem. What we lose, though, is that, in general, we will not be able to compare the generalization error of the learned classifier to the minimum classification risk. Instead, we will have to be content with the fact that the generalization error will be close to the smallest surrogate loss. We will consider classifiers of the form {, if fx) 0 9) g f x) = sgn fx), otherwise where f : R d R is some function. From 2) we have Lg f ) = Pg f X) Y ) PY g f X) < 0) = PY fx) < 0). From now on, when dealing with classifiers of the form 9), we will write Lf) instead of Lg f ) to keep the notation simple. Now we introduce the notion of a surrogate loss function. Definition. A surrogate loss function is any function ϕ : R R +, such that 0) ϕx) {x>0}. Some examples of commonly used surrogate losses: ) Exponential, ϕx) = e x 2) Logit, ϕx) = log 2 + e x ) 3) Hinge loss, ϕx) = x + ) + max{x +, 0} Let ϕ be a surrogate loss. Then for any x, y) R d {, } and any f : R d R we have ) yfx) < 0 ϕ yfx)) { yfx)>0} = {yfx)<0}. Therefore, defining the ϕ-risk of f by and its empirical version we see from ) that 2) A ϕ f) E[ϕ Y fx))] A ϕ,n f) n ϕ Y i fx i )), Lf) A ϕ f) and L n f) A ϕ,n f). Now that these preliminaries are out of the way, we can state and prove the basic surrogate loss bound: Theorem 3. Consider any learning algorithm A = {A n } n=, where, for each n, the mapping A n receives the training sample Z n = Z,..., Z n ) as input and produces a function f n : R d R from some class F. Suppose that F and the surrogate loss ϕ are chosen so that the following conditions are satisfied: ) There exists some constant B > 0 such that sup x,y) R d {,} f F sup ϕ yfx)) B 4

5 2) There exists some constant M ϕ > 0 such that ϕ is M ϕ -Lipschitz, i.e., ϕu) ϕv) M ϕ u v, u, v R. Then for any n and any δ 0, ) the following bound holds with probability at least δ: L f n ) A ϕ,n f log/δ) 3) n ) + 2M ϕ ER n FX n )) + B. 2n Proof. Using 2), we can write L f n ) A ϕ f n ) = A ϕ,n f n ) + A ϕ f n ) Aϕ, n f n ) A ϕ,n f n ) + sup A ϕ f n ) A ϕ,n f n ). f F Now let H denote the class of functions h : R d {, } R of the form hx, y) = yfx), f F. Then the functions in H are bounded between 0 and B > 0, and A ϕ f n ) A ϕ,n f n ) = sup E[ϕ Y fx))] ϕ Y i fx i )) n sup f F f F = sup P ϕ h) P n ϕ h), h H where ϕ hz) ϕhz)) for every z = x, y) R d {, }. Let n Z n ) sup P ϕ h) P n ϕ h). h H We can now use the familiar symmetrization argument to bound 4) E n Z n ) 2ER n ϕ HZ n )), where ϕ H denotes the class of all functions of the form ϕ h, h H, and 5) ϕ HZ n ) {ϕ hz ),..., ϕ hz n )) : h H} = {ϕhz )),..., ϕhz n ))) : h H} is the projection of ϕ H onto the random sample Z n. We now use a very powerful result about the Rademacher averages called the contraction principle, which states the following: If A R n is a bounded set and ϕ : R R is an M ϕ -Lipschitz function, then 6) R n ϕ A) M ϕ R n A), where ϕ A {ϕa ),..., ϕa n )) : a = a,..., a n ) A}. The proof of the contraction principle is somewhat involved, and we do not give it here.) From 5) we immediately see that ϕ HZ n ) = ϕ [HZ n )]. Therefore, applying 6) to A = HZ n ) and then using the resulting bound in 4), we obtain E n Z n ) 2M ϕ ER n HZ n )). 5

6 Furthermore, letting σ n be an i.i.d. Rademacher tuple independent of Z n, we have ] R n HZ n )) = n E σ [sup n σ i hz i ) h H ] = n E σ [sup n σ i Y i fx i ) f F ] = n E σ [sup n σ i fx i ) f F which leads to 7) R n FX n )), E n Z n ) 2M ϕ ER n FX n )). Now, since every function ϕ h is bounded between 0 and B, the function n Z n ) has bounded differences with c =... = c n = B/n. Therefore, from 7) and from McDiarmid s inequality, we have for every t > 0 that ) ) P n Z n ) 2M ϕ ER n FX n )) + t P n Z n ) E n Z n ) + t e 2nt2 /B 2. Choosing t = B 2n) log/δ), we see that n Z n ) 2M ϕ ER n FX n )) + B with probability at least δ. Therefore, since L f n ) A ϕ,h f n ) + n Z n ), we see that 3) holds with probability at least δ. log/δ) What the above theorem tells us is that the performance of the learned classifier f n is controlled by the Rademacher average of the class F, and we can always arrange it to be relatively small. We will now look at several specific examples. 3. Weighted linear combination of classifiers Let G = {g : R d {, }} be a class of base classifiers not to be confused with Bayes classifiers!), and consider the class N F λ f = N c j g j : N N, c j λ; g,..., g N G, j= j= where λ > 0 is a tunable parameter. Then for each f = N j= c jg j F λ the corresponding classifier g f of the form 9) is given by N g f x) = sgn c j g j x). A useful way of thinking about g f is that, upon receiving a feature x R d, it computes the outputs g x),..., g N x) of the N base classifiers from G and then takes a weighted majority vote indeed, if we had c =... = c N = λ/n, then sgng f x)) would precisely correspond to taking the majority vote among the N base classifiers. Note, by the way, that the number of base classifiers is not fixed, and can be learned from the data. 6 j= 2n

7 Now, Theorem 3 tells us that the performance of any learning algorithm that accepts a training sample Z n and produces a function f n F λ is controlled by the Rademacher average R n F λ X n )). It turns out, moreover, that we can relate it to the Rademacher average of the base class G. To start, note that F λ = λ absconv G, where N N absconv G = c j g j : N N; c = c j ; g,..., g N G j= is the absolute convex hull of G. Therefore j= R n F λ X n )) = λ R n GX n )). Now note that the functions in G are binary-valued. Therefore, assuming that the base class G is a VC class, we will have V G) R n GX n )) C n. Combining these bounds with the bound of Theorem 3, we conclude that for any f n selected from F λ based on the training sample Z n, the bound L f n ) A ϕ,n f V G) log/δ) n ) + CM ϕ n + B 2n will hold with probability at least δ, where B is the uniform upper bound on ϕ yfx)), f F Λ, x, y) R d {, } and M ϕ is the Lipschitz constant of the surrogate loss ϕ. Note that the above bound involves only the VC dimension of the base class, which is typically small. On the other hand, the class F λ obtained by forming weighted combinations of classifiers from G is extremely rich, and will general have infinite VC dimension! But there is a price we pay: The first term is the empirical surrogate loss A ϕ,n f n ), rather than the empirical classification error L n f n ). However, it is possible to choose the surrogate ϕ in such a way that A ϕ,n ) can be bounded in terms of a quantity related to the number of misclassified training examples. Here is an example. Fix a positive parameter γ > 0 and consider 0, if x γ ϕx) =, if x 0 + x/γ, otherwise This is a valid surrogate loss with B = and M ϕ = /γ, but in addition we have ϕx) {x> γ}, which implies that ϕ yfx)) {yfx)<γ}. Therefore, for any f we have A ϕ,n f) = ϕ Y i fx i )) 8) n n {Yi fx i )<γ}. The quantity 9) L γ nf) n is called the margin error of f. Notice that: For any γ > 0, L γ nf) L n f) The function γ L γ nf) is increasing. 7 {Yi fx i )<γ}

8 Notice also that we can write L γ nf) = n {Yi fx i )<0} + n {0 Yi fx i )<γ}, where the first term is just L n f n ), while the second term is the number of training examples that were classified correctly, but only with small margin the quantity Y fx) is often called the margin of the classifier f). Theorem 4 Margin-based risk bound for weighted linear combinations). For any γ > 0, the bound L f n ) L γ n f n ) + Cλ V G) log/δ) 20) γ n + n holds with probability at least δ. Remark. Note that the first term on the right-hand side of 20) increases with γ, while the second term decreases with γ. Hence, if the learned classifier f n has a small margin error for a large γ, i.e., it classifies the training samples well and with high confidence, then its generalization error will be small. 4. ernel machines Another powerful way of building complicated classifiers from simple functions is by means of kernels. ernel methods are popular in machine learning for a variety of reasons, not the least of which is that any algorithm that operates in a Euclidean space and relies only on the computation of inner products between feature vectors can be modified to work with any suitably well-behaved kernel. To start with, let us define what we mean by a kernel. We will stick to Euclidean feature spaces, although everything works out for arbitrary separable metric spaces. Definition 2. Let X be a closed subset of R d. A real-valued function : X X R is called a Mercer kernel provided the following conditions are met: ) It is symmetric, i.e., x, x ) = x, x) for any x, x X. 2) It is continuous, i.e., if {x n } is a sequence of points in X converging to a point x, then lim x n, x ) = x, x ), x X. n 3) It is positive semidefinite, i.e., for all α,..., α n R and all x,..., x n X, 2) α i α j x i, x j ) 0. i,j= Remark 2. Another way to interpret the positive semidefiniteness condition is as follows. For any n-tuple x n = x,..., x n ) X n, define the n n kernel Gram matrix G x n ) [x i, x j )] n i,j=. Then 2) is equivalent to saying that G x n ) is positive semidefinite in the usual sense, i.e., for any vector v R n we have v, G x n )v 0. Remark 3. From now on, we will just say kernel, but always mean Mercer kernel. Here are some examples of kernels: ) With X = R d, x, x ) = x, x, the usual Euclidean inner product. 8

9 2) A more general class of kernels based on the Euclidean inner product can be constructed as follows. Let X = {x R d : x R}; choose any sequence {a j } j=0 of nonnegative reals such that a j R 2j <. Then j=0 x, x ) = a j x, x j j=0 is a kernel. 3) Let X = R d, and let k : R d R be a continuous function, which is reflection-symmetric, i.e., k x) = kx) for all x. Then x, x ) kx x ) is a kernel provided the Fourier transform of k, kξ) R d e i ξ,x kx)dx, is nonnegative. A prime example is the Gaussian kernel, induced by the function kx) = e γ x 2. In all of the above cases, the first two properties of a Mercer kernel are easy to check. The third, i.e., positive semidefiniteness, requires a bit more work. For details, consult Section 2.5 of the book by Cucker and Zhou [CZ07]. The importance of kernels in machine learning stems from the fact that we can use them to represent or approximate) arbitrarily complicated continuous functions on the feature space X. In order to take full advantage of this representational power, we must take a detour into the theory of Hilbert spaces. 4.. A crash course on Hilbert spaces. Hilbert spaces are a powerful generalization of the usual Euclidean space with an inner product; once we have an inner product, we can introduce the notion of an angle and, consequently, orthogonality. Moreover, a Hilbert space has certain favorable convergence properties, so we can speak about unique) linear projections of their elements onto closed linear subspaces. Let us make these ideas precise. Definition 3. A real vector space V is an inner product space if there exists a function, V : V V R, which is: ) Symmetric: v, v V = v, v V for all v, v V 2) Linear: αv + βv 2, v V = α v, v V + β v 2, v V for all α, β R and all v, v 2, v V 3) Positive definite: v, v V 0 for all v V, and v, v V = 0 if and only if v = 0 Let V,, V ) be an inner product space. Then we can define a norm on V via v V v, v V. It is easy to check that this is, indeed, a norm 22) ) It is homogeneous: for any v V and any α R, αv V = αv, αv V = α 2 v, v V = α v, v V = α v V 2) It satisfies the triangle inequality: for any v, v V, v + v V v V + v V. 9

10 23) To prove this, we first need to establish another key property of V : the Cauchy Schwarz inequality, which generalizes its classical Euclidean counterpart and says that v, v V v V v V. To prove 23), we start with the observation that v λv 2 V = v λv, v λv V 0 for any λ R. Expanding this, we get v λv, v λv V = λ 2 v 2 V 2λ v, v V + v 2 V 0. This is a quadratic function of λ, and from the above we see that its graph does not cross the horizontal axis. Therefore, we must have 4 v, v V 2 4 v 2 V v 2 V v, v V v V v V. Now we can write v V + v V ) 2 = v 2 V + 2 v V v V + v 2 V v 2 V + 2 v, v V + v 2 V = v, v V + v, v V + v, v V + v, v V = v + v, v + v V v + v 2 V, where the first step uses the Cauchy Schwarz inequality, the second step uses the definition of V and the symmetry of, V, the third step uses the linearity of, V, and the final step is, again, by definition. Since all norms are nonnegative, we can take square roots of both sides to get the triangle inequality. 3) Finally, v V 0, and v V = 0 if and only if v = 0 this is obvious from definitions. Thus, an inner product space can be equipped with a norm that has certain special properties mainly, the Cauchy Schwarz inequality, since a lot of useful things follow from it alone). Now that we have a norm, we can talk about convergence of sequences in V: Definition 4. Let {v n } n= be a sequence of elements of V. We say that it converges to v V if 24) lim v n v V = 0. n Remark 4. This definition is valid for any norm on V, not necessarily a norm induced by an inner product. Any norm-convergent sequence has the property that, as n gets larger, its elements get closer and closer to one another. Specifically, suppose that {v n } converges to v. Then 24) implies that for any ε > 0 we can choose n large enough, so that v n v V < ε/2 for all m n. But the triangle inequality gives v n v m V v n v V + v m v V < ε, m n. In other words, we have lim v n v m = 0. m Since this holds for every n, we can write 25) lim v n v m = 0. minm,n) Any sequence {v n } that has the property 25) is called a Cauchy sequence. We have just proved that any convergent sequence is Cauchy. However, the converse is not necessarily true: a Cauchy sequence does not have to be convergent. This motivates the following definition: 0

11 Definition 5. A normed space V, V ) is complete if any Cauchy sequence {v n } of its elements is convergent. If the norm V is induced by an inner product, then we say that V is a Hilbert space. There is a standard procedure of starting with an inner product and the corresponding normed space and then completing it by adding the limits of all Cauchy sequences. We will not worry too much about this procedure. Here are a few standard examples of Hilbert spaces: ) The Euclidean space V = R d with the usual inner product d v, v = v j v j. The corresponding norm is the familiar l 2 norm, v = v, v. 2) More generally, if A is a positive definite d d matrix, then the inner product j= v, v A v, Av induces the A-weighted norm v A v, v A = v, Av, which makes R d into a Hilbert space. The preceding example is a special case with A = I d, the d d identity matrix. 3) The space L 2 R d ) of all square-integrable functions f : R d R, i.e., R d f 2 x)dx <, is a Hilbert space with the inner product and the corresponding norm f, g L 2 R d ) R d fx)gx)dx f L 2 R d ) R d f 2 x)dx. 4) Let Ω, B, P ) be a probability space. Then the space L 2 P ) space of all real-valued random variables X : Ω R with finite second moment, i.e., EX 2 = X 2 ω)p dω), is a Hilbert space with the inner product X, X L 2 P ) E[XX ] = Ω Ω Xω)X ω)p dω) and the corresponding norm X L 2 P ) Xω) 2 P dω) EX 2. Ω From now on, we will denote a typical Hilbert space by H,, H ); the induced norm will be denoted by H. An enormous advantage of working with Hilbert spaces is the availability of the notion of orthogonality and orthogonal projection. Two elements h, g of a Hilbert space H are said to be orthogonal if h, g H = 0. Now consider a closed linear subspace H of H, where closed means that the limit of any convergent sequence {h n } of elements of H is also contained in H. Then we have the following basic facts:

12 Theorem 5. Let H be the set of all h H, such that g, h H = 0 for all g H. Then: ) H is also a closed linear subspace of H. 2) Any element g of H can be uniquely decomposed as g = h+h, where h H and h H. 3) Define the orthogonal projection Π : H H onto H through Πg h if g = h + h with h H, h H. Then Π has the following properties: a) It is a linear operator. b) Π 2 = Π, i.e., ΠΠg) = Πg for any g H. c) If g H, then Πg = g. d) For any g H and any h H, Πg, h H = g, h H. e) For any g H, h = Πg H is the unique solution of the optimization problem minimize h g subject to h H. Remark 5. It is important for H to be a closed linear subspace of H for the above results to hold Reproducing kernel Hilbert spaces. Now let us return to our original goal. Suppose we have a fixed kernel on our feature space X which we assume to be a closed subset of R d ). Let L X) be the linear span of the set {x, ) : x X}, i.e., the set of all functions f : X R of the form N 26) fx) = c j x j, x) j= for all possible choices of N N, c,..., c N R, and x,..., x n X. It is easy to see that L X) is a vector space: for any two functions f, f of the form 26), their sum is also of that form; if we multiply any f L X) by a scalar c R, we will get another element of L X); and the zero function is clearly in L X). It turns out that, for any Mercer) kernel, we can complete L X) into a Hilbert space of functions that can potentially represent any continuous function from X into R, provided is chosen appropriately. The following result is essential for the proof, see Section 2.4 of Cucker and Zhou [CZ07]): Theorem 6. Let X be a closed subset of R d, and let : X X R be a Mercer kernel. Then there exists a unique Hilbert space H,, ) of real-valued functions on X with the following properties: ) For all x X, the function x ) x, ) is an element of H, and x, x = x, x ) for all x, x X. 2) The linear space L X) is dense in H, i.e., for any f H and any ε > 0 there exist some N N, c,..., c N R, and x,..., x N X, such that N f c j xj < ε. 27) 3) For all f H and all x X, j= fx) = x, f. Moreover, the functions in H are continuous. The Hilbert space H is called the Reproducing ernel Hilbert Space RHS) associated with ; the property 27) is referred to as the reproducing kernel property. 2

13 Remark 6. The reproducing kernel property essentially states that the value of any function f H at any point x X can be extracted by projecting f onto the function x ) = x, ), i.e., a copy of the kernel centered at the point x. It is easy to prove when f L X). Indeed, if f has the form 26), then N f, x = c j xj, x = = j= N c j xj, x j= N c j x j, x) j= = fx). Since any f H can be expressed as a limit of functions from L X), the proof of 27) for a general f follows by continuity. Now we pick a kernel on our feature space and consider classifiers of the form 9) with the underlying f taken from a suitable subset of the RHS H. One choice, which underlies such things as the Support Vector Machine, is to take a ball in H : given some λ > 0, let F λ {f H : f λ}. This set is the closure in the norm) of the convex set N N c j xj : N N; c,..., c N R; x,..., x N X; c i c j x i, x j ) λ 2 L X), j= and is itself convex. Now, as we already know, the performance of any learning algorithm that chooses an element f n F λ in a data-dependent way is controlled by the Rademacher average R n F λ X n )). It turns out that this Rademacher average is fairly easy to estimate. Indeed, using the reproducing kernel property 27) and then the linearity of the inner product,, we can write R n F λ X n )) = sup f: f λ n E σ n = sup f: f λ n E σ n = sup f: f λ n E σ n i,j= σ i fx i ) σ i f, Xi f, σ i Xi Now, using the Cauchy Schwarz inequality 23), it is not hard to show that for any g H. Therefore, sup f, g = g f: f λ R n F λ X n )) = λ n E σ n σ i Xi. 3

14 Now we exploit the following easily proved fact: for any n functions g,..., g n H, E σ n σ i g i n 28) g i 2. The proof of this is in two steps: First, we use the concavity of the square root to write E 2 2 σ n σ i g i E σ i g i. Then we expand the squared norm: 2 σ i g i = σ i g i, σ i g i = σ i σ j g i, g j. And finally we take the expectation over σ n and use the fact that E[σ i σ j ] = if i = j and 0 otherwise to get 2 E σ i g i = g i, g i = g i 2. Hence, we obtain i,j= R n FX n )) n Xi, Xi n = n X i, X i ). n Finally, taking the expectation w.r.t. X n and once more using concavity of the square root, we have ER n FX n )) λ EX, X) n Empirical risk minimization in an RHS. Another advantage of working with kernels is that, in many cases, a minimizer of empirical risk over a sufficiently regular subset of an RHS will have the form of a linear combination of kernels centered at the training feature points. The results ensuring this are often referred to in the literature as representer theorems. Here is one such result due, in a slightly different form, to Schölkopf, Herbrich, and Smola [SHS0]), sufficiently general for our purposes: Theorem 7 The generalized representer theorem). Let X be a closed subset of R d and let Y be a subset of the reals. Consider a nonnegative loss function l : Y Y R +. Let be a Mercer kernel on X, and let H be the corresponding RHS. Let X, Y ),..., X n, Y n ) be an i.i.d. sample from some distribution P = P XY on X Y, let H Xn be the closed linear subspace of H spanned by { Xi : i n}, and let Π Xn denote the orthogonal projection onto H Xn. Let F be a subset of H, such that Π Xn F) F. Then 29) inf f F n ly i, fx i )) = inf f Π Xn F) n ly i, fx i )), and if a minimizer of the left-hand side of 29) exists, then it can be taken to have the form 30) f n = c i Xi for some c,..., c n R. 4

15 Remark 7. Note that both the subspace H Xn and the corresponding orthogonal projection ΠXn are random objects, since they depend on the random features X n. Proof. Since Xi H Xn for every i, by Theorem 5 we have f, Xi = Π Xn f, Xi, f H. Moreover, from the reproducing kernel property 27) we deduce that fx i ) = f, Xi = Π Xn f, Xi = ΠXn fx i ). Therefore, for every f F we can write ly i, fx i )) = n n This implies that 3) inf ly i, fx i )) = inf f F n f F n l Y i, Π Xn fx i ) ). l Y i, Π Xn fx i ) ) = inf g Π Xn F) n ly i, gx i )). Now suppose that f n F achieves the infimum on the left-hand side of 3). Then its projection Π Xn f onto H Xn achieves the infimum on the right-hand side. Moreover, since ΠXn F) F by hypothesis, we may conclude that f = Π Xn f), i.e., f H Xn. Since every element of HXn has the form 30), the theorem is proved. In the classification setting, we may take Y = {, +} and consider the problem of minimizing the empirical surrogate loss A ϕ,n f) = ϕ Y i fx i )) n over the ball F λ in a suitable RHS H. By the above theorem, we may write this problem in the following form: 32a) min ϕ Y i c j X i, X j ) c,...,c n R n 32b) subject to j= c i c j X i, X j ) λ 2 i,j= Suppose the surrogate loss function ϕ is convex. Then the objective function in 32) is convex as well, and the decision variables c,..., c n R are subject to a quadratic constraint. Thus, 32) is an instance of a quadratically constrained convex program QCCP). Moreover, when ϕ is such that the objective is quadratic in c,..., c n, then we have a quadratically constrained quadratic problem QCQP), which can be solved very efficiently using interior point methods. For detailed background see the text of Boyd and Vandenberghe [BV04]. Many popular machine learning algorithms can be cast in the form 32). For instance, if we let ϕ be the hinge loss ϕu) = u + ) +, then 32) corresponds to the Support Vector Machine SVM) algorithm more precisely, the SVM is the scalarized version of 32), i.e., it has the form min Y i c j X i, X j ) + τ c i c j X i, X j ) c,...,c n R n for some regularization parameter τ > 0. j= 5 + i,j=

16 5. Convex risk minimization Choosing a convex surrogate loss function ϕ has many advantages in general. First of all, we may arrange things in such a way that minimizing the surrogate loss A ϕ f) over all measurable f : X R is equivalent to determining the Bayes classifier ): Theorem 8. Let P = P XY be the joint distribution of the feature X R d and the binary label Y {, +}, and let ηx) = P[Y = X = x] be the corresponding regression function. Consider a surrogate loss function ϕ, which is strictly convex and differentiable. Then the unique minimizer of the surrogate loss A ϕ f) = E[ϕ Y fx))] over all measurable) functions f : X R has the form f x) = arg min h ηx) u), u R where for each η [0, ] we have h η u) ηϕ u) + η)ϕu). Moreover, f x) is positive if and only if ηx) > /2, i.e, the induced sign classifier g f x) = sgnf x)) is the Bayes classifier ). Proof. By the law of iterated expectation, Hence, For every x X, we have A ϕ f) = E [ϕ Y fx))] = E[E[ϕ Y fx)) X]]. inf f A ϕf) = inf E [E [ϕ Y fx)) X]] f [ ] = E inf E[ϕ Y u) X = x]. u R E[ϕ Y u) X = x] = P[Y = X = x]ϕ u) + P[Y = X = x]ϕu) = ηx)ϕ u) + η)ϕu) h ηx) u). Since ϕ is strictly convex and differentiable, so is h η for every η [0, ]. Therefore, inf u R h η u) exists, and is achieved by a unique u ; in particular, f x) = arg min h ηx) u). u R To find the u minimizing h η, we differentiate h η w.r.t. u and set the derivative to zero. Since h ηu) = η)ϕ u) + ηϕ u), the point of minimum u is the solution to the equation Suppose η > /2; then ϕ u) ϕ u) = η η. ϕ u) ϕ u) >. Since ϕ is strictly convex, its derivative ϕ is strictly increasing. Hence, u > u which implies that u > 0. Conversely, if u 0, then u u, so ϕ u ) ϕ u ), which means that η/ η), i.e., η /2. Thus, we conclude that f x), which is the minimizer of h ηx), is positive if and only if ηx) > /2, i.e., sgnf x)) is the Bayes classifier. 6

17 Secondly, under some additional regularity conditions it is possible to relate the minimum surrogate loss to the Bayes rate A ϕ inf f L = inf f A ϕf) PY fx)) : Theorem 9. Assume that the surrogate loss function ϕ satisfies the conditions of Theorem 3, and that there exist positive constants s and c, such that the inequality 33) Lf) L c A ϕ f) A ϕ ) /s holds for any measurable function f : X R. empirical surrogate loss over some class F: 34) Then 35) L f n ) L 2 /s c f n = arg min A ϕ,n f) = arg min f F f F n with probability at least δ. 2M ϕ ER n FX n )) + B Consider the learning algorithm that minimizes ϕ Y i fx i )). log/δ) Proof. We have the following: L f n ) L c A ϕ f ) /s 36) n ) A ϕ ) /s 37) = c A ϕ f n ) inf A ϕf) + inf A ϕf) A ϕ f F f F ) /s 38) c A ϕ f n ) inf A ϕf) + c inf A ϕf) A ϕ f F f F 39) 40) where: 2 /s c 2 /s c 2n /s sup A ϕ,n f) A ϕ f) ) + c f F log/δ) ) /s ) /s + c inf A ϕf) A ϕ f F ) /s inf f F A ϕf) A ϕ ) /s 2M ϕ ER n FX n )) + B 2n ) /s + c inf A ϕf) A ϕ w.p. δ, f F 36) follows from 33); 38) follows from the inequality a + b) /s a /s + b /s that holds for all a, b 0 and all s 39) and 40) follow from the same argument as the one used in the proof of Theorem 3. This completes the proof. 7 ) /s

18 Remark 8. Condition 33) is often easy to check. For instance, Zhang [Zha04] proved that it is satisfied, provided the inequality 2 η s ) 4) 2c) s inf h ηu) u holds for all η [0, ]. For instance, 4) holds for the exponential loss ϕu) = e u and the logit loss ϕu) = log 2 + e u ) with s = 2 and c = 2 2; for the hinge loss ϕu) = u + ) +, 4) holds with s = and c = 4. What Theorem 9 says is that, assuming the expected Rademacher average ER n FX n )) = O/ n), the difference between the generalization error of the Convex Risk Minimization algorithm 34) and the Bayes rate L is, with high probability, bounded by the combination of two terms: the On /2s ) estimation error term and the inf f F A ϕ f) A ϕ) /s approximation error term. If the hypothesis space F is rich enough, so that inf f F A ϕ f) = A ϕ, then the difference between L f n ) and L is, with high probability, bounded as O/n 2s ), independently of the dimension d of the feature space. References [BV04] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, [CZ07] F. Cucker and D. X. Zhou. Learning Theory: An Approximation Theory Viewpoint. Cambridge University Press, [SHS0] B. Schölkopf, R. Herbrich, and A. Smola. A generalized representer theorem. In D. Helmbold and B. Williamson, editors, Computational Learning Theory, volume 2 of Lecture Notes in Computer Science, pages Springer, 200. [Zha04] T. Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 32):56 34,

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we