BINARY CLASSIFICATION

Size: px
Start display at page:

Download "BINARY CLASSIFICATION"

Transcription

1 BINARY CLASSIFICATION MAXIM RAGINSY The problem of binary classification can be stated as follows. We have a random couple Z = X, Y ), where X R d is called the feature vector and Y {, } is called the label. In the spirit of the model-free framework, we assume that the relationship between the features and the labels is stochastic and described by an unknown probability distribution P PZ), where Z = R d {, }. As usual, we consider the case when we are given an i.i.d. sample of length n from P. The goal is to learn a classifier, i.e., a mapping g : R d {, }, such that the probability of classification error, PgX) Y ), is small. As we have seen before, the optimal choice is the Bayes classifier { g, if ηx) > /2 ) x), otherwise where ηx) P[Y = X = x] is the regression function. However, since we make no assumptions on P, in general we cannot hope to learn the Bayes classifier g. Instead, we focus on a more realistic goal: We fix a collection G of classifiers and then use the training data to come up with a hypothesis ĝ n G, such that Pĝ n X) Y ) inf g G PgX) Y ) with high probability. By way of notation, let us write Lg) for the classification error of g, i.e., Lg) PgX) Y ), and let L G) denote the smallest classification error attainable over G: L G) inf g G Lg). We will assume that a minimizing g G exists. For future reference, we note that 2) Lg) = PgX) Y ) PY gx) < 0). Warning: In what follows, we will use C or c to denote various absolute constants; their values may change from line to line.. Learning linear discriminant rules One of the simplest classification rules and one of the first to be studied) is a linear discriminant rule: given a nonzero vector w = w ),..., w d) ) R d and a scalar b R, let {, if w, x + b > 0 3) gx) g w,b x), otherwise Let G be the class of all such linear discriminant rules as w ranges over all nonzero vectors in R d and b ranges over all reals: G = {g w,b : w R d \{0}, b R}. Date: March 4, 20. The reason why we chose {, }, rather than {0, }, for the label space is merely convenience.

2 Given the training sample Z n, let ĝ n G be the output of the ERM algorithm, i.e., ĝ n arg min g G n {gxi ) Y i }. In other words, ĝ n is any classifier of the form 3) that minimizes the number of misclassifications on the training sample. Then we have the following: Theorem. There exists an absolute constant C > 0, such that for any n N and any δ 0, ), the bound d + 2 log/δ) 4) Lĝ n ) L G) + C n + n holds with probability at least δ. Proof. A standard argument leads to the bound 5) Lĝ n ) L G) + 2 Z n ), where Z n ) sup Lg) L n g) g G is the uniform deviation and L n g) denotes the empirical classification error of g on Z n : L n g) = n {gxi ) Y i }, which is the fraction of incorrectly labeled points in the training sample Z n. Consider a classifier g G and define the set { } C g x, y) R d {, } : y w, x + b) 0. Then it is easy to see that where, as before, Lg) = P C g ) and L n g) = P n C g ), P n = n δ Zi = n δ Xi,Y i ) is the empirical distribution of the sample Z n. Let C denote the collection of all sets of the form C = C g for some g G. Then Z n ) = sup P n C) P C). C C Let F = F C denote the class of indicator functions of the sets in C: F C = { { C} : C C}. Then we know that, with probability at least δ, log/δ) 6) Z n ) 2ER n FZ n )) +, 2n where R n FZ n )) is the Rademacher average of the projection of F onto the sample Z n. Now, FZ n ) = {fz ),..., fz n )) : f F} Therefore, if we prove that C is a VC class, then = { {Z C},..., {Zn C}) : C C }. R n FZ n )) C 2 V C) n.

3 But this follows from the fact that any C C has the form d C = x, y) Rd {, } : w j) yx j) + by 0 for some w R d \{0} and some b R, and the functions x, y) y and x, y) yx j), j d, span a linear space of dimension no greater than d +. Hence, V C) d +, so that V C) d + RFZ n )) C n C n. Combining this with 5) and 6), we see that 4) holds with probability at least δ... Generalized linear discriminant rules. In the same vein, we may consider classification rules of the form {, if k j= gx) = wj) ψ j x) + b > 0 7), otherwise where k is some positive integer not necessarily equal to d), w = w ),..., w k) ) R k is a nonzero vector, b R is an arbitrary scalar, and Ψ = {ψ j : R d R} k j= is some fixed dictionary of real-valued functions on R d. For a fixed Ψ, let G denote the collection of all classifiers of the form 7) as w ranges over all nonzero vectors in R k and b ranges over all reals. Then the ERM rule is, again, given by ĝ n inf L ng) inf g G g G n {gxi ) Y i }. The following result can be proved pretty much along the same lines as Theorem : Theorem 2. There exists an absolute constant C > 0, such that for any n N and any δ 0, ), the bound k + 2 log/δ) 8) Lĝ n ) L G) + C n + n holds with probability at least δ..2. Two fundamental issues. As Theorems and 2 show, the ERM algorithm applied to the collection of all generalized) linear discriminant rules is guaranteed to work well in the sense that the classification error of the output hypothesis will, with high probability, be close to the optimum achievable by any discriminant rule with the given structure. The same argument extends to any collection of classifiers G, for which the error sets {x, y) : y gx) 0}, g G, form a VC class of dimension much smaller than the sample size n. In other words, with high probability the difference Lĝ n ) L G) = Lĝ n ) inf Lg) g G will be small. However, precisely because the VC dimension of G cannot be too large, the approximation properties of G will be limited. Another problem is computational. For instance, the problem of finding an empirically optimal linear discriminant rule is NP-hard. In other words, unless P is equal to NP, there is no hope of coming up with an efficient ERM algorithm for linear discriminant rules that would work for all feature space dimensions d. If d is fixed, then it is possible to enumerate all projections of a given sample Z n onto the class of indicators of all halfspaces in On d log n) time, which allows for an exhaustive search for an ERM solution, but the usefulness of this naive approach is limited to d < 5. 3 j=

4 2. Risk bounds for combined classifiers via surrogate loss functions One way to sidestep the above approximation-theoretic and computational issues is to replace the 0 Hamming loss function that gives rise to the probability of error criterion with some other loss function. What we gain is the ability to bound the performance of various complicated classifiers built up by combining simpler base classifiers in terms of the complexity e.g, the VC dimension) of the collection of the base classifiers, as well as considerable computational advantages, especially if the problem of minimizing the empirical surrogate loss turns out to be a convex programming problem. What we lose, though, is that, in general, we will not be able to compare the generalization error of the learned classifier to the minimum classification risk. Instead, we will have to be content with the fact that the generalization error will be close to the smallest surrogate loss. We will consider classifiers of the form {, if fx) 0 9) g f x) = sgn fx), otherwise where f : R d R is some function. From 2) we have Lg f ) = Pg f X) Y ) PY g f X) < 0) = PY fx) < 0). From now on, when dealing with classifiers of the form 9), we will write Lf) instead of Lg f ) to keep the notation simple. Now we introduce the notion of a surrogate loss function. Definition. A surrogate loss function is any function ϕ : R R +, such that 0) ϕx) {x>0}. Some examples of commonly used surrogate losses: ) Exponential, ϕx) = e x 2) Logit, ϕx) = log 2 + e x ) 3) Hinge loss, ϕx) = x + ) + max{x +, 0} Let ϕ be a surrogate loss. Then for any x, y) R d {, } and any f : R d R we have ) yfx) < 0 ϕ yfx)) { yfx)>0} = {yfx)<0}. Therefore, defining the ϕ-risk of f by and its empirical version we see from ) that 2) A ϕ f) E[ϕ Y fx))] A ϕ,n f) n ϕ Y i fx i )), Lf) A ϕ f) and L n f) A ϕ,n f). Now that these preliminaries are out of the way, we can state and prove the basic surrogate loss bound: Theorem 3. Consider any learning algorithm A = {A n } n=, where, for each n, the mapping A n receives the training sample Z n = Z,..., Z n ) as input and produces a function f n : R d R from some class F. Suppose that F and the surrogate loss ϕ are chosen so that the following conditions are satisfied: ) There exists some constant B > 0 such that sup x,y) R d {,} f F sup ϕ yfx)) B 4

5 2) There exists some constant M ϕ > 0 such that ϕ is M ϕ -Lipschitz, i.e., ϕu) ϕv) M ϕ u v, u, v R. Then for any n and any δ 0, ) the following bound holds with probability at least δ: L f n ) A ϕ,n f log/δ) 3) n ) + 2M ϕ ER n FX n )) + B. 2n Proof. Using 2), we can write L f n ) A ϕ f n ) = A ϕ,n f n ) + A ϕ f n ) Aϕ, n f n ) A ϕ,n f n ) + sup A ϕ f n ) A ϕ,n f n ). f F Now let H denote the class of functions h : R d {, } R of the form hx, y) = yfx), f F. Then the functions in H are bounded between 0 and B > 0, and A ϕ f n ) A ϕ,n f n ) = sup E[ϕ Y fx))] ϕ Y i fx i )) n sup f F f F = sup P ϕ h) P n ϕ h), h H where ϕ hz) ϕhz)) for every z = x, y) R d {, }. Let n Z n ) sup P ϕ h) P n ϕ h). h H We can now use the familiar symmetrization argument to bound 4) E n Z n ) 2ER n ϕ HZ n )), where ϕ H denotes the class of all functions of the form ϕ h, h H, and 5) ϕ HZ n ) {ϕ hz ),..., ϕ hz n )) : h H} = {ϕhz )),..., ϕhz n ))) : h H} is the projection of ϕ H onto the random sample Z n. We now use a very powerful result about the Rademacher averages called the contraction principle, which states the following: If A R n is a bounded set and ϕ : R R is an M ϕ -Lipschitz function, then 6) R n ϕ A) M ϕ R n A), where ϕ A {ϕa ),..., ϕa n )) : a = a,..., a n ) A}. The proof of the contraction principle is somewhat involved, and we do not give it here.) From 5) we immediately see that ϕ HZ n ) = ϕ [HZ n )]. Therefore, applying 6) to A = HZ n ) and then using the resulting bound in 4), we obtain E n Z n ) 2M ϕ ER n HZ n )). 5

6 Furthermore, letting σ n be an i.i.d. Rademacher tuple independent of Z n, we have ] R n HZ n )) = n E σ [sup n σ i hz i ) h H ] = n E σ [sup n σ i Y i fx i ) f F ] = n E σ [sup n σ i fx i ) f F which leads to 7) R n FX n )), E n Z n ) 2M ϕ ER n FX n )). Now, since every function ϕ h is bounded between 0 and B, the function n Z n ) has bounded differences with c =... = c n = B/n. Therefore, from 7) and from McDiarmid s inequality, we have for every t > 0 that ) ) P n Z n ) 2M ϕ ER n FX n )) + t P n Z n ) E n Z n ) + t e 2nt2 /B 2. Choosing t = B 2n) log/δ), we see that n Z n ) 2M ϕ ER n FX n )) + B with probability at least δ. Therefore, since L f n ) A ϕ,h f n ) + n Z n ), we see that 3) holds with probability at least δ. log/δ) What the above theorem tells us is that the performance of the learned classifier f n is controlled by the Rademacher average of the class F, and we can always arrange it to be relatively small. We will now look at several specific examples. 3. Weighted linear combination of classifiers Let G = {g : R d {, }} be a class of base classifiers not to be confused with Bayes classifiers!), and consider the class N F λ f = N c j g j : N N, c j λ; g,..., g N G, j= j= where λ > 0 is a tunable parameter. Then for each f = N j= c jg j F λ the corresponding classifier g f of the form 9) is given by N g f x) = sgn c j g j x). A useful way of thinking about g f is that, upon receiving a feature x R d, it computes the outputs g x),..., g N x) of the N base classifiers from G and then takes a weighted majority vote indeed, if we had c =... = c N = λ/n, then sgng f x)) would precisely correspond to taking the majority vote among the N base classifiers. Note, by the way, that the number of base classifiers is not fixed, and can be learned from the data. 6 j= 2n

7 Now, Theorem 3 tells us that the performance of any learning algorithm that accepts a training sample Z n and produces a function f n F λ is controlled by the Rademacher average R n F λ X n )). It turns out, moreover, that we can relate it to the Rademacher average of the base class G. To start, note that F λ = λ absconv G, where N N absconv G = c j g j : N N; c = c j ; g,..., g N G j= is the absolute convex hull of G. Therefore j= R n F λ X n )) = λ R n GX n )). Now note that the functions in G are binary-valued. Therefore, assuming that the base class G is a VC class, we will have V G) R n GX n )) C n. Combining these bounds with the bound of Theorem 3, we conclude that for any f n selected from F λ based on the training sample Z n, the bound L f n ) A ϕ,n f V G) log/δ) n ) + CM ϕ n + B 2n will hold with probability at least δ, where B is the uniform upper bound on ϕ yfx)), f F Λ, x, y) R d {, } and M ϕ is the Lipschitz constant of the surrogate loss ϕ. Note that the above bound involves only the VC dimension of the base class, which is typically small. On the other hand, the class F λ obtained by forming weighted combinations of classifiers from G is extremely rich, and will general have infinite VC dimension! But there is a price we pay: The first term is the empirical surrogate loss A ϕ,n f n ), rather than the empirical classification error L n f n ). However, it is possible to choose the surrogate ϕ in such a way that A ϕ,n ) can be bounded in terms of a quantity related to the number of misclassified training examples. Here is an example. Fix a positive parameter γ > 0 and consider 0, if x γ ϕx) =, if x 0 + x/γ, otherwise This is a valid surrogate loss with B = and M ϕ = /γ, but in addition we have ϕx) {x> γ}, which implies that ϕ yfx)) {yfx)<γ}. Therefore, for any f we have A ϕ,n f) = ϕ Y i fx i )) 8) n n {Yi fx i )<γ}. The quantity 9) L γ nf) n is called the margin error of f. Notice that: For any γ > 0, L γ nf) L n f) The function γ L γ nf) is increasing. 7 {Yi fx i )<γ}

8 Notice also that we can write L γ nf) = n {Yi fx i )<0} + n {0 Yi fx i )<γ}, where the first term is just L n f n ), while the second term is the number of training examples that were classified correctly, but only with small margin the quantity Y fx) is often called the margin of the classifier f). Theorem 4 Margin-based risk bound for weighted linear combinations). For any γ > 0, the bound L f n ) L γ n f n ) + Cλ V G) log/δ) 20) γ n + n holds with probability at least δ. Remark. Note that the first term on the right-hand side of 20) increases with γ, while the second term decreases with γ. Hence, if the learned classifier f n has a small margin error for a large γ, i.e., it classifies the training samples well and with high confidence, then its generalization error will be small. 4. ernel machines Another powerful way of building complicated classifiers from simple functions is by means of kernels. ernel methods are popular in machine learning for a variety of reasons, not the least of which is that any algorithm that operates in a Euclidean space and relies only on the computation of inner products between feature vectors can be modified to work with any suitably well-behaved kernel. To start with, let us define what we mean by a kernel. We will stick to Euclidean feature spaces, although everything works out for arbitrary separable metric spaces. Definition 2. Let X be a closed subset of R d. A real-valued function : X X R is called a Mercer kernel provided the following conditions are met: ) It is symmetric, i.e., x, x ) = x, x) for any x, x X. 2) It is continuous, i.e., if {x n } is a sequence of points in X converging to a point x, then lim x n, x ) = x, x ), x X. n 3) It is positive semidefinite, i.e., for all α,..., α n R and all x,..., x n X, 2) α i α j x i, x j ) 0. i,j= Remark 2. Another way to interpret the positive semidefiniteness condition is as follows. For any n-tuple x n = x,..., x n ) X n, define the n n kernel Gram matrix G x n ) [x i, x j )] n i,j=. Then 2) is equivalent to saying that G x n ) is positive semidefinite in the usual sense, i.e., for any vector v R n we have v, G x n )v 0. Remark 3. From now on, we will just say kernel, but always mean Mercer kernel. Here are some examples of kernels: ) With X = R d, x, x ) = x, x, the usual Euclidean inner product. 8

9 2) A more general class of kernels based on the Euclidean inner product can be constructed as follows. Let X = {x R d : x R}; choose any sequence {a j } j=0 of nonnegative reals such that a j R 2j <. Then j=0 x, x ) = a j x, x j j=0 is a kernel. 3) Let X = R d, and let k : R d R be a continuous function, which is reflection-symmetric, i.e., k x) = kx) for all x. Then x, x ) kx x ) is a kernel provided the Fourier transform of k, kξ) R d e i ξ,x kx)dx, is nonnegative. A prime example is the Gaussian kernel, induced by the function kx) = e γ x 2. In all of the above cases, the first two properties of a Mercer kernel are easy to check. The third, i.e., positive semidefiniteness, requires a bit more work. For details, consult Section 2.5 of the book by Cucker and Zhou [CZ07]. The importance of kernels in machine learning stems from the fact that we can use them to represent or approximate) arbitrarily complicated continuous functions on the feature space X. In order to take full advantage of this representational power, we must take a detour into the theory of Hilbert spaces. 4.. A crash course on Hilbert spaces. Hilbert spaces are a powerful generalization of the usual Euclidean space with an inner product; once we have an inner product, we can introduce the notion of an angle and, consequently, orthogonality. Moreover, a Hilbert space has certain favorable convergence properties, so we can speak about unique) linear projections of their elements onto closed linear subspaces. Let us make these ideas precise. Definition 3. A real vector space V is an inner product space if there exists a function, V : V V R, which is: ) Symmetric: v, v V = v, v V for all v, v V 2) Linear: αv + βv 2, v V = α v, v V + β v 2, v V for all α, β R and all v, v 2, v V 3) Positive definite: v, v V 0 for all v V, and v, v V = 0 if and only if v = 0 Let V,, V ) be an inner product space. Then we can define a norm on V via v V v, v V. It is easy to check that this is, indeed, a norm 22) ) It is homogeneous: for any v V and any α R, αv V = αv, αv V = α 2 v, v V = α v, v V = α v V 2) It satisfies the triangle inequality: for any v, v V, v + v V v V + v V. 9

10 23) To prove this, we first need to establish another key property of V : the Cauchy Schwarz inequality, which generalizes its classical Euclidean counterpart and says that v, v V v V v V. To prove 23), we start with the observation that v λv 2 V = v λv, v λv V 0 for any λ R. Expanding this, we get v λv, v λv V = λ 2 v 2 V 2λ v, v V + v 2 V 0. This is a quadratic function of λ, and from the above we see that its graph does not cross the horizontal axis. Therefore, we must have 4 v, v V 2 4 v 2 V v 2 V v, v V v V v V. Now we can write v V + v V ) 2 = v 2 V + 2 v V v V + v 2 V v 2 V + 2 v, v V + v 2 V = v, v V + v, v V + v, v V + v, v V = v + v, v + v V v + v 2 V, where the first step uses the Cauchy Schwarz inequality, the second step uses the definition of V and the symmetry of, V, the third step uses the linearity of, V, and the final step is, again, by definition. Since all norms are nonnegative, we can take square roots of both sides to get the triangle inequality. 3) Finally, v V 0, and v V = 0 if and only if v = 0 this is obvious from definitions. Thus, an inner product space can be equipped with a norm that has certain special properties mainly, the Cauchy Schwarz inequality, since a lot of useful things follow from it alone). Now that we have a norm, we can talk about convergence of sequences in V: Definition 4. Let {v n } n= be a sequence of elements of V. We say that it converges to v V if 24) lim v n v V = 0. n Remark 4. This definition is valid for any norm on V, not necessarily a norm induced by an inner product. Any norm-convergent sequence has the property that, as n gets larger, its elements get closer and closer to one another. Specifically, suppose that {v n } converges to v. Then 24) implies that for any ε > 0 we can choose n large enough, so that v n v V < ε/2 for all m n. But the triangle inequality gives v n v m V v n v V + v m v V < ε, m n. In other words, we have lim v n v m = 0. m Since this holds for every n, we can write 25) lim v n v m = 0. minm,n) Any sequence {v n } that has the property 25) is called a Cauchy sequence. We have just proved that any convergent sequence is Cauchy. However, the converse is not necessarily true: a Cauchy sequence does not have to be convergent. This motivates the following definition: 0

11 Definition 5. A normed space V, V ) is complete if any Cauchy sequence {v n } of its elements is convergent. If the norm V is induced by an inner product, then we say that V is a Hilbert space. There is a standard procedure of starting with an inner product and the corresponding normed space and then completing it by adding the limits of all Cauchy sequences. We will not worry too much about this procedure. Here are a few standard examples of Hilbert spaces: ) The Euclidean space V = R d with the usual inner product d v, v = v j v j. The corresponding norm is the familiar l 2 norm, v = v, v. 2) More generally, if A is a positive definite d d matrix, then the inner product j= v, v A v, Av induces the A-weighted norm v A v, v A = v, Av, which makes R d into a Hilbert space. The preceding example is a special case with A = I d, the d d identity matrix. 3) The space L 2 R d ) of all square-integrable functions f : R d R, i.e., R d f 2 x)dx <, is a Hilbert space with the inner product and the corresponding norm f, g L 2 R d ) R d fx)gx)dx f L 2 R d ) R d f 2 x)dx. 4) Let Ω, B, P ) be a probability space. Then the space L 2 P ) space of all real-valued random variables X : Ω R with finite second moment, i.e., EX 2 = X 2 ω)p dω), is a Hilbert space with the inner product X, X L 2 P ) E[XX ] = Ω Ω Xω)X ω)p dω) and the corresponding norm X L 2 P ) Xω) 2 P dω) EX 2. Ω From now on, we will denote a typical Hilbert space by H,, H ); the induced norm will be denoted by H. An enormous advantage of working with Hilbert spaces is the availability of the notion of orthogonality and orthogonal projection. Two elements h, g of a Hilbert space H are said to be orthogonal if h, g H = 0. Now consider a closed linear subspace H of H, where closed means that the limit of any convergent sequence {h n } of elements of H is also contained in H. Then we have the following basic facts:

12 Theorem 5. Let H be the set of all h H, such that g, h H = 0 for all g H. Then: ) H is also a closed linear subspace of H. 2) Any element g of H can be uniquely decomposed as g = h+h, where h H and h H. 3) Define the orthogonal projection Π : H H onto H through Πg h if g = h + h with h H, h H. Then Π has the following properties: a) It is a linear operator. b) Π 2 = Π, i.e., ΠΠg) = Πg for any g H. c) If g H, then Πg = g. d) For any g H and any h H, Πg, h H = g, h H. e) For any g H, h = Πg H is the unique solution of the optimization problem minimize h g subject to h H. Remark 5. It is important for H to be a closed linear subspace of H for the above results to hold Reproducing kernel Hilbert spaces. Now let us return to our original goal. Suppose we have a fixed kernel on our feature space X which we assume to be a closed subset of R d ). Let L X) be the linear span of the set {x, ) : x X}, i.e., the set of all functions f : X R of the form N 26) fx) = c j x j, x) j= for all possible choices of N N, c,..., c N R, and x,..., x n X. It is easy to see that L X) is a vector space: for any two functions f, f of the form 26), their sum is also of that form; if we multiply any f L X) by a scalar c R, we will get another element of L X); and the zero function is clearly in L X). It turns out that, for any Mercer) kernel, we can complete L X) into a Hilbert space of functions that can potentially represent any continuous function from X into R, provided is chosen appropriately. The following result is essential for the proof, see Section 2.4 of Cucker and Zhou [CZ07]): Theorem 6. Let X be a closed subset of R d, and let : X X R be a Mercer kernel. Then there exists a unique Hilbert space H,, ) of real-valued functions on X with the following properties: ) For all x X, the function x ) x, ) is an element of H, and x, x = x, x ) for all x, x X. 2) The linear space L X) is dense in H, i.e., for any f H and any ε > 0 there exist some N N, c,..., c N R, and x,..., x N X, such that N f c j xj < ε. 27) 3) For all f H and all x X, j= fx) = x, f. Moreover, the functions in H are continuous. The Hilbert space H is called the Reproducing ernel Hilbert Space RHS) associated with ; the property 27) is referred to as the reproducing kernel property. 2

13 Remark 6. The reproducing kernel property essentially states that the value of any function f H at any point x X can be extracted by projecting f onto the function x ) = x, ), i.e., a copy of the kernel centered at the point x. It is easy to prove when f L X). Indeed, if f has the form 26), then N f, x = c j xj, x = = j= N c j xj, x j= N c j x j, x) j= = fx). Since any f H can be expressed as a limit of functions from L X), the proof of 27) for a general f follows by continuity. Now we pick a kernel on our feature space and consider classifiers of the form 9) with the underlying f taken from a suitable subset of the RHS H. One choice, which underlies such things as the Support Vector Machine, is to take a ball in H : given some λ > 0, let F λ {f H : f λ}. This set is the closure in the norm) of the convex set N N c j xj : N N; c,..., c N R; x,..., x N X; c i c j x i, x j ) λ 2 L X), j= and is itself convex. Now, as we already know, the performance of any learning algorithm that chooses an element f n F λ in a data-dependent way is controlled by the Rademacher average R n F λ X n )). It turns out that this Rademacher average is fairly easy to estimate. Indeed, using the reproducing kernel property 27) and then the linearity of the inner product,, we can write R n F λ X n )) = sup f: f λ n E σ n = sup f: f λ n E σ n = sup f: f λ n E σ n i,j= σ i fx i ) σ i f, Xi f, σ i Xi Now, using the Cauchy Schwarz inequality 23), it is not hard to show that for any g H. Therefore, sup f, g = g f: f λ R n F λ X n )) = λ n E σ n σ i Xi. 3

14 Now we exploit the following easily proved fact: for any n functions g,..., g n H, E σ n σ i g i n 28) g i 2. The proof of this is in two steps: First, we use the concavity of the square root to write E 2 2 σ n σ i g i E σ i g i. Then we expand the squared norm: 2 σ i g i = σ i g i, σ i g i = σ i σ j g i, g j. And finally we take the expectation over σ n and use the fact that E[σ i σ j ] = if i = j and 0 otherwise to get 2 E σ i g i = g i, g i = g i 2. Hence, we obtain i,j= R n FX n )) n Xi, Xi n = n X i, X i ). n Finally, taking the expectation w.r.t. X n and once more using concavity of the square root, we have ER n FX n )) λ EX, X) n Empirical risk minimization in an RHS. Another advantage of working with kernels is that, in many cases, a minimizer of empirical risk over a sufficiently regular subset of an RHS will have the form of a linear combination of kernels centered at the training feature points. The results ensuring this are often referred to in the literature as representer theorems. Here is one such result due, in a slightly different form, to Schölkopf, Herbrich, and Smola [SHS0]), sufficiently general for our purposes: Theorem 7 The generalized representer theorem). Let X be a closed subset of R d and let Y be a subset of the reals. Consider a nonnegative loss function l : Y Y R +. Let be a Mercer kernel on X, and let H be the corresponding RHS. Let X, Y ),..., X n, Y n ) be an i.i.d. sample from some distribution P = P XY on X Y, let H Xn be the closed linear subspace of H spanned by { Xi : i n}, and let Π Xn denote the orthogonal projection onto H Xn. Let F be a subset of H, such that Π Xn F) F. Then 29) inf f F n ly i, fx i )) = inf f Π Xn F) n ly i, fx i )), and if a minimizer of the left-hand side of 29) exists, then it can be taken to have the form 30) f n = c i Xi for some c,..., c n R. 4

15 Remark 7. Note that both the subspace H Xn and the corresponding orthogonal projection ΠXn are random objects, since they depend on the random features X n. Proof. Since Xi H Xn for every i, by Theorem 5 we have f, Xi = Π Xn f, Xi, f H. Moreover, from the reproducing kernel property 27) we deduce that fx i ) = f, Xi = Π Xn f, Xi = ΠXn fx i ). Therefore, for every f F we can write ly i, fx i )) = n n This implies that 3) inf ly i, fx i )) = inf f F n f F n l Y i, Π Xn fx i ) ). l Y i, Π Xn fx i ) ) = inf g Π Xn F) n ly i, gx i )). Now suppose that f n F achieves the infimum on the left-hand side of 3). Then its projection Π Xn f onto H Xn achieves the infimum on the right-hand side. Moreover, since ΠXn F) F by hypothesis, we may conclude that f = Π Xn f), i.e., f H Xn. Since every element of HXn has the form 30), the theorem is proved. In the classification setting, we may take Y = {, +} and consider the problem of minimizing the empirical surrogate loss A ϕ,n f) = ϕ Y i fx i )) n over the ball F λ in a suitable RHS H. By the above theorem, we may write this problem in the following form: 32a) min ϕ Y i c j X i, X j ) c,...,c n R n 32b) subject to j= c i c j X i, X j ) λ 2 i,j= Suppose the surrogate loss function ϕ is convex. Then the objective function in 32) is convex as well, and the decision variables c,..., c n R are subject to a quadratic constraint. Thus, 32) is an instance of a quadratically constrained convex program QCCP). Moreover, when ϕ is such that the objective is quadratic in c,..., c n, then we have a quadratically constrained quadratic problem QCQP), which can be solved very efficiently using interior point methods. For detailed background see the text of Boyd and Vandenberghe [BV04]. Many popular machine learning algorithms can be cast in the form 32). For instance, if we let ϕ be the hinge loss ϕu) = u + ) +, then 32) corresponds to the Support Vector Machine SVM) algorithm more precisely, the SVM is the scalarized version of 32), i.e., it has the form min Y i c j X i, X j ) + τ c i c j X i, X j ) c,...,c n R n for some regularization parameter τ > 0. j= 5 + i,j=

16 5. Convex risk minimization Choosing a convex surrogate loss function ϕ has many advantages in general. First of all, we may arrange things in such a way that minimizing the surrogate loss A ϕ f) over all measurable f : X R is equivalent to determining the Bayes classifier ): Theorem 8. Let P = P XY be the joint distribution of the feature X R d and the binary label Y {, +}, and let ηx) = P[Y = X = x] be the corresponding regression function. Consider a surrogate loss function ϕ, which is strictly convex and differentiable. Then the unique minimizer of the surrogate loss A ϕ f) = E[ϕ Y fx))] over all measurable) functions f : X R has the form f x) = arg min h ηx) u), u R where for each η [0, ] we have h η u) ηϕ u) + η)ϕu). Moreover, f x) is positive if and only if ηx) > /2, i.e, the induced sign classifier g f x) = sgnf x)) is the Bayes classifier ). Proof. By the law of iterated expectation, Hence, For every x X, we have A ϕ f) = E [ϕ Y fx))] = E[E[ϕ Y fx)) X]]. inf f A ϕf) = inf E [E [ϕ Y fx)) X]] f [ ] = E inf E[ϕ Y u) X = x]. u R E[ϕ Y u) X = x] = P[Y = X = x]ϕ u) + P[Y = X = x]ϕu) = ηx)ϕ u) + η)ϕu) h ηx) u). Since ϕ is strictly convex and differentiable, so is h η for every η [0, ]. Therefore, inf u R h η u) exists, and is achieved by a unique u ; in particular, f x) = arg min h ηx) u). u R To find the u minimizing h η, we differentiate h η w.r.t. u and set the derivative to zero. Since h ηu) = η)ϕ u) + ηϕ u), the point of minimum u is the solution to the equation Suppose η > /2; then ϕ u) ϕ u) = η η. ϕ u) ϕ u) >. Since ϕ is strictly convex, its derivative ϕ is strictly increasing. Hence, u > u which implies that u > 0. Conversely, if u 0, then u u, so ϕ u ) ϕ u ), which means that η/ η), i.e., η /2. Thus, we conclude that f x), which is the minimizer of h ηx), is positive if and only if ηx) > /2, i.e., sgnf x)) is the Bayes classifier. 6

17 Secondly, under some additional regularity conditions it is possible to relate the minimum surrogate loss to the Bayes rate A ϕ inf f L = inf f A ϕf) PY fx)) : Theorem 9. Assume that the surrogate loss function ϕ satisfies the conditions of Theorem 3, and that there exist positive constants s and c, such that the inequality 33) Lf) L c A ϕ f) A ϕ ) /s holds for any measurable function f : X R. empirical surrogate loss over some class F: 34) Then 35) L f n ) L 2 /s c f n = arg min A ϕ,n f) = arg min f F f F n with probability at least δ. 2M ϕ ER n FX n )) + B Consider the learning algorithm that minimizes ϕ Y i fx i )). log/δ) Proof. We have the following: L f n ) L c A ϕ f ) /s 36) n ) A ϕ ) /s 37) = c A ϕ f n ) inf A ϕf) + inf A ϕf) A ϕ f F f F ) /s 38) c A ϕ f n ) inf A ϕf) + c inf A ϕf) A ϕ f F f F 39) 40) where: 2 /s c 2 /s c 2n /s sup A ϕ,n f) A ϕ f) ) + c f F log/δ) ) /s ) /s + c inf A ϕf) A ϕ f F ) /s inf f F A ϕf) A ϕ ) /s 2M ϕ ER n FX n )) + B 2n ) /s + c inf A ϕf) A ϕ w.p. δ, f F 36) follows from 33); 38) follows from the inequality a + b) /s a /s + b /s that holds for all a, b 0 and all s 39) and 40) follow from the same argument as the one used in the proof of Theorem 3. This completes the proof. 7 ) /s

18 Remark 8. Condition 33) is often easy to check. For instance, Zhang [Zha04] proved that it is satisfied, provided the inequality 2 η s ) 4) 2c) s inf h ηu) u holds for all η [0, ]. For instance, 4) holds for the exponential loss ϕu) = e u and the logit loss ϕu) = log 2 + e u ) with s = 2 and c = 2 2; for the hinge loss ϕu) = u + ) +, 4) holds with s = and c = 4. What Theorem 9 says is that, assuming the expected Rademacher average ER n FX n )) = O/ n), the difference between the generalization error of the Convex Risk Minimization algorithm 34) and the Bayes rate L is, with high probability, bounded by the combination of two terms: the On /2s ) estimation error term and the inf f F A ϕ f) A ϕ) /s approximation error term. If the hypothesis space F is rich enough, so that inf f F A ϕ f) = A ϕ, then the difference between L f n ) and L is, with high probability, bounded as O/n 2s ), independently of the dimension d of the feature space. References [BV04] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, [CZ07] F. Cucker and D. X. Zhou. Learning Theory: An Approximation Theory Viewpoint. Cambridge University Press, [SHS0] B. Schölkopf, R. Herbrich, and A. Smola. A generalized representer theorem. In D. Helmbold and B. Williamson, editors, Computational Learning Theory, volume 2 of Lecture Notes in Computer Science, pages Springer, 200. [Zha04] T. Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 32):56 34,

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization

More information

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Support Vector Machine

Support Vector Machine Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary

More information

An introduction to some aspects of functional analysis

An introduction to some aspects of functional analysis An introduction to some aspects of functional analysis Stephen Semmes Rice University Abstract These informal notes deal with some very basic objects in functional analysis, including norms and seminorms

More information

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee The Learning Problem and Regularization 9.520 Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

AdaBoost and other Large Margin Classifiers: Convexity in Classification

AdaBoost and other Large Margin Classifiers: Convexity in Classification AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at

More information

Some Background Material

Some Background Material Chapter 1 Some Background Material In the first chapter, we present a quick review of elementary - but important - material as a way of dipping our toes in the water. This chapter also introduces important

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal In this class we continue our journey in the world of RKHS. We discuss the Mercer theorem which gives

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall

More information

Statistical Properties of Large Margin Classifiers

Statistical Properties of Large Margin Classifiers Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

Support Vector Machines

Support Vector Machines Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

(Kernels +) Support Vector Machines

(Kernels +) Support Vector Machines (Kernels +) Support Vector Machines Machine Learning Torsten Möller Reading Chapter 5 of Machine Learning An Algorithmic Perspective by Marsland Chapter 6+7 of Pattern Recognition and Machine Learning

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444 Kernel Methods Jean-Philippe Vert Jean-Philippe.Vert@mines.org Last update: Jan 2015 Jean-Philippe Vert (Mines ParisTech) 1 / 444 What we know how to solve Jean-Philippe Vert (Mines ParisTech) 2 / 444

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016 12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 477 Instructor: Tony Jebara Topic 5 Generalization Guarantees VC-Dimension Nearest Neighbor Classification (infinite VC dimension) Structural Risk Minimization Support Vector Machines

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

08a. Operators on Hilbert spaces. 1. Boundedness, continuity, operator norms

08a. Operators on Hilbert spaces. 1. Boundedness, continuity, operator norms (February 24, 2017) 08a. Operators on Hilbert spaces Paul Garrett garrett@math.umn.edu http://www.math.umn.edu/ garrett/ [This document is http://www.math.umn.edu/ garrett/m/real/notes 2016-17/08a-ops

More information

Case study: stochastic simulation via Rademacher bootstrap

Case study: stochastic simulation via Rademacher bootstrap Case study: stochastic simulation via Rademacher bootstrap Maxim Raginsky December 4, 2013 In this lecture, we will look at an application of statistical learning theory to the problem of efficient stochastic

More information

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability... Functional Analysis Franck Sueur 2018-2019 Contents 1 Metric spaces 1 1.1 Definitions........................................ 1 1.2 Completeness...................................... 3 1.3 Compactness......................................

More information

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets 9.520 Class 22, 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce an alternate perspective of RKHS via integral operators

More information

Kernel Methods. Machine Learning A W VO

Kernel Methods. Machine Learning A W VO Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

More information

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lessons 6 10 Jan 2017 Outline Perceptrons and Support Vector machines Notation... 2 Perceptrons... 3 History...3

More information

Lecture Notes on Support Vector Machine

Lecture Notes on Support Vector Machine Lecture Notes on Support Vector Machine Feng Li fli@sdu.edu.cn Shandong University, China 1 Hyperplane and Margin In a n-dimensional space, a hyper plane is defined by ω T x + b = 0 (1) where ω R n is

More information

Kernels A Machine Learning Overview

Kernels A Machine Learning Overview Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Support Vector Machines, Kernel SVM

Support Vector Machines, Kernel SVM Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM

More information

Classification with Reject Option

Classification with Reject Option Classification with Reject Option Bartlett and Wegkamp (2008) Wegkamp and Yuan (2010) February 17, 2012 Outline. Introduction.. Classification with reject option. Spirit of the papers BW2008.. Infinite

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

Hilbert spaces. 1. Cauchy-Schwarz-Bunyakowsky inequality

Hilbert spaces. 1. Cauchy-Schwarz-Bunyakowsky inequality (October 29, 2016) Hilbert spaces Paul Garrett garrett@math.umn.edu http://www.math.umn.edu/ garrett/ [This document is http://www.math.umn.edu/ garrett/m/fun/notes 2016-17/03 hsp.pdf] Hilbert spaces are

More information

The Representor Theorem, Kernels, and Hilbert Spaces

The Representor Theorem, Kernels, and Hilbert Spaces The Representor Theorem, Kernels, and Hilbert Spaces We will now work with infinite dimensional feature vectors and parameter vectors. The space l is defined to be the set of sequences f 1, f, f 3,...

More information

Hilbert Space Methods in Learning

Hilbert Space Methods in Learning Hilbert Space Methods in Learning guest lecturer: Risi Kondor 6772 Advanced Machine Learning and Perception (Jebara), Columbia University, October 15, 2003. 1 1. A general formulation of the learning problem

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Introduction to Real Analysis Alternative Chapter 1

Introduction to Real Analysis Alternative Chapter 1 Christopher Heil Introduction to Real Analysis Alternative Chapter 1 A Primer on Norms and Banach Spaces Last Updated: March 10, 2018 c 2018 by Christopher Heil Chapter 1 A Primer on Norms and Banach Spaces

More information

Lecture 10 February 23

Lecture 10 February 23 EECS 281B / STAT 241B: Advanced Topics in Statistical LearningSpring 2009 Lecture 10 February 23 Lecturer: Martin Wainwright Scribe: Dave Golland Note: These lecture notes are still rough, and have only

More information

1. Subspaces A subset M of Hilbert space H is a subspace of it is closed under the operation of forming linear combinations;i.e.,

1. Subspaces A subset M of Hilbert space H is a subspace of it is closed under the operation of forming linear combinations;i.e., Abstract Hilbert Space Results We have learned a little about the Hilbert spaces L U and and we have at least defined H 1 U and the scale of Hilbert spaces H p U. Now we are going to develop additional

More information

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization Compiled by David Rosenberg Abstract Boyd and Vandenberghe s Convex Optimization book is very well-written and a pleasure to read. The

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Notions such as convergent sequence and Cauchy sequence make sense for any metric space. Convergent Sequences are Cauchy

Notions such as convergent sequence and Cauchy sequence make sense for any metric space. Convergent Sequences are Cauchy Banach Spaces These notes provide an introduction to Banach spaces, which are complete normed vector spaces. For the purposes of these notes, all vector spaces are assumed to be over the real numbers.

More information

Applied inductive learning - Lecture 7

Applied inductive learning - Lecture 7 Applied inductive learning - Lecture 7 Louis Wehenkel & Pierre Geurts Department of Electrical Engineering and Computer Science University of Liège Montefiore - Liège - November 5, 2012 Find slides: http://montefiore.ulg.ac.be/

More information

How Good is a Kernel When Used as a Similarity Measure?

How Good is a Kernel When Used as a Similarity Measure? How Good is a Kernel When Used as a Similarity Measure? Nathan Srebro Toyota Technological Institute-Chicago IL, USA IBM Haifa Research Lab, ISRAEL nati@uchicago.edu Abstract. Recently, Balcan and Blum

More information

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product Chapter 4 Hilbert Spaces 4.1 Inner Product Spaces Inner Product Space. A complex vector space E is called an inner product space (or a pre-hilbert space, or a unitary space) if there is a mapping (, )

More information

Lecture 3 January 28

Lecture 3 January 28 EECS 28B / STAT 24B: Advanced Topics in Statistical LearningSpring 2009 Lecture 3 January 28 Lecturer: Pradeep Ravikumar Scribe: Timothy J. Wheeler Note: These lecture notes are still rough, and have only

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Exercise Solutions to Functional Analysis

Exercise Solutions to Functional Analysis Exercise Solutions to Functional Analysis Note: References refer to M. Schechter, Principles of Functional Analysis Exersize that. Let φ,..., φ n be an orthonormal set in a Hilbert space H. Show n f n

More information

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers Computational Methods for Data Analysis Massimo Poesio SUPPORT VECTOR MACHINES Support Vector Machines Linear classifiers 1 Linear Classifiers denotes +1 denotes -1 w x + b>0 f(x,w,b) = sign(w x + b) How

More information

Lecture 7: Kernels for Classification and Regression

Lecture 7: Kernels for Classification and Regression Lecture 7: Kernels for Classification and Regression CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 15, 2011 Outline Outline A linear regression problem Linear auto-regressive

More information

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Support Vector Machines CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 A Linearly Separable Problem Consider the binary classification

More information

BRUTE FORCE AND INTELLIGENT METHODS OF LEARNING. Vladimir Vapnik. Columbia University, New York Facebook AI Research, New York

BRUTE FORCE AND INTELLIGENT METHODS OF LEARNING. Vladimir Vapnik. Columbia University, New York Facebook AI Research, New York 1 BRUTE FORCE AND INTELLIGENT METHODS OF LEARNING Vladimir Vapnik Columbia University, New York Facebook AI Research, New York 2 PART 1 BASIC LINE OF REASONING Problem of pattern recognition can be formulated

More information

The Learning Problem and Regularization

The Learning Problem and Regularization 9.520 Class 02 February 2011 Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference problem from usually small sets of high dimensional, noisy data. Learning

More information

Applied Analysis (APPM 5440): Final exam 1:30pm 4:00pm, Dec. 14, Closed books.

Applied Analysis (APPM 5440): Final exam 1:30pm 4:00pm, Dec. 14, Closed books. Applied Analysis APPM 44: Final exam 1:3pm 4:pm, Dec. 14, 29. Closed books. Problem 1: 2p Set I = [, 1]. Prove that there is a continuous function u on I such that 1 ux 1 x sin ut 2 dt = cosx, x I. Define

More information

1. Mathematical Foundations of Machine Learning

1. Mathematical Foundations of Machine Learning 1. Mathematical Foundations of Machine Learning 1.1 Basic Concepts Definition of Learning Definition [Mitchell (1997)] A computer program is said to learn from experience E with respect to some class of

More information

Near-Potential Games: Geometry and Dynamics

Near-Potential Games: Geometry and Dynamics Near-Potential Games: Geometry and Dynamics Ozan Candogan, Asuman Ozdaglar and Pablo A. Parrilo September 6, 2011 Abstract Potential games are a special class of games for which many adaptive user dynamics

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM

DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM Due: Monday, April 11, 2016, at 6pm (Submit via NYU Classes) Instructions: Your answers to

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical learning theory, Support vector machines, and Bioinformatics 1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.

More information

Statistical Learning Theory and the C-Loss cost function

Statistical Learning Theory and the C-Loss cost function Statistical Learning Theory and the C-Loss cost function Jose Principe, Ph.D. Distinguished Professor ECE, BME Computational NeuroEngineering Laboratory and principe@cnel.ufl.edu Statistical Learning Theory

More information

Learning Methods for Linear Detectors

Learning Methods for Linear Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2011/2012 Lesson 20 27 April 2012 Contents Learning Methods for Linear Detectors Learning Linear Detectors...2

More information

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary

More information

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction

More information

Metric Embedding for Kernel Classification Rules

Metric Embedding for Kernel Classification Rules Metric Embedding for Kernel Classification Rules Bharath K. Sriperumbudur University of California, San Diego (Joint work with Omer Lang & Gert Lanckriet) Bharath K. Sriperumbudur (UCSD) Metric Embedding

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

Topological properties of Z p and Q p and Euclidean models

Topological properties of Z p and Q p and Euclidean models Topological properties of Z p and Q p and Euclidean models Samuel Trautwein, Esther Röder, Giorgio Barozzi November 3, 20 Topology of Q p vs Topology of R Both R and Q p are normed fields and complete

More information

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9 MAT 570 REAL ANALYSIS LECTURE NOTES PROFESSOR: JOHN QUIGG SEMESTER: FALL 204 Contents. Sets 2 2. Functions 5 3. Countability 7 4. Axiom of choice 8 5. Equivalence relations 9 6. Real numbers 9 7. Extended

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Surrogate loss functions, divergences and decentralized detection

Surrogate loss functions, divergences and decentralized detection Surrogate loss functions, divergences and decentralized detection XuanLong Nguyen Department of Electrical Engineering and Computer Science U.C. Berkeley Advisors: Michael Jordan & Martin Wainwright 1

More information

Kernel Methods. Outline

Kernel Methods. Outline Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert

More information

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 A. Kernels 1. Let X be a finite set. Show that the kernel

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear

More information

The Hilbert Space of Random Variables

The Hilbert Space of Random Variables The Hilbert Space of Random Variables Electrical Engineering 126 (UC Berkeley) Spring 2018 1 Outline Fix a probability space and consider the set H := {X : X is a real-valued random variable with E[X 2

More information

Chapter 1. Preliminaries. The purpose of this chapter is to provide some basic background information. Linear Space. Hilbert Space.

Chapter 1. Preliminaries. The purpose of this chapter is to provide some basic background information. Linear Space. Hilbert Space. Chapter 1 Preliminaries The purpose of this chapter is to provide some basic background information. Linear Space Hilbert Space Basic Principles 1 2 Preliminaries Linear Space The notion of linear space

More information

Kernels MIT Course Notes

Kernels MIT Course Notes Kernels MIT 15.097 Course Notes Cynthia Rudin Credits: Bartlett, Schölkopf and Smola, Cristianini and Shawe-Taylor The kernel trick that I m going to show you applies much more broadly than SVM, but we

More information

Sample width for multi-category classifiers

Sample width for multi-category classifiers R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods 2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University

More information

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds

More information

Metric spaces and metrizability

Metric spaces and metrizability 1 Motivation Metric spaces and metrizability By this point in the course, this section should not need much in the way of motivation. From the very beginning, we have talked about R n usual and how relatively

More information

Optimization and Optimal Control in Banach Spaces

Optimization and Optimal Control in Banach Spaces Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,

More information

Pattern Recognition 2018 Support Vector Machines

Pattern Recognition 2018 Support Vector Machines Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 48 Support Vector Machines Ad Feelders ( Universiteit Utrecht

More information