Using boxes and proximity to classify data into several categories

Size: px
Start display at page:

Download "Using boxes and proximity to classify data into several categories"

Transcription

1 R u t c o r Research R e p o r t Using boxes and proximity to classify data into several categories Martin Anthony a Joel Ratsaby b RRR 7-01, February 01 RUTCOR Rutgers Center for Operations Research Rutgers University 640 Bartholomew Road Piscataway, New Jersey Telephone: Telefax: rrr@rutcor.rutgers.edu rrr a Department of Mathematics, The London School of Economics and Political Science, Houghton Street, London WCA AE, UK. m.anthony@lse.ac.uk b Electrical and Electronics Engineering Department, Ariel University Center of Samaria, Ariel 40700, Israel. ratsaby@ariel.ac.il

2 Rutcor Research Report RRR 7-01, February 01 Using boxes and proximity to classify data into several categories Martin Anthony Joel Ratsaby Abstract.The use of boxes for pattern classification has been widespread and is a fairly natural way in which to partition data into different classes or categories. In this paper we consider multi-category classifiers which are based on unions of boxes. The classification method studied may be described as follows: find boxes such that all points in the region enclosed by each box are assumed to belong to the same category, and then classify remaining points by considering their distances to these boxes, assigning to a point the category of the nearest box. This extends the simple method of classifying by unions of boxes by incorporating a natural way based on proximity of classifying points outside the boxes. We analyse the generalization accuracy of such classifiers and we obtain generalization error bounds that depend on a measure of how definitive is the classification of training points. Acknowledgements: This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST Part of the work was carried out while Martin Anthony was a visitor at Rutgers University Center for Operations Research.

3 Page RRR Box-based multi-category classifiers Classification in which each category or class is a union of boxes is a long-studied and natural method for pattern classification. It is central, for instance, to the methods used for logical analysis of data see, for example [9, 10, 15, 6] and has been more widely studied as a geometrical classifier see [11], for instance. More recently, unions of boxes have been used in combination with a nearest-neighbor or proximity paradigm for binary classification [5] and multi-category classification [13], enabling meaningful classification for points of the domain that lie outside any of the boxes. In this paper, we analyse multi-category classifiers of the type decribed by [13]. In that paper, they describe a set of classifiers based on boxes and nearest-neighbor, where the metric used for the nearest-neighbor measure is the Manhattan or taxicab metric. We give explicit details shortly. They use an agglomerative box-clustering method to produce a set of candidate classifiers of this type. They then select from these one that is, in a sense they define, optimal. First they focus on the classifiers which are, with respect to the two dimensions of error on the sample, E, and complexity number of boxes, B, Pareto-optimal. Among these they then select a classifier that minimizes an objective function of the form E E 0 + B B 0 effecting a tradeoff between error and complexity and, if there is more that one such classifier, they choose that which minimizes E. They provide some experimental evidence that this approach works. Here, we obtain generalization error bounds for the box-based classifiers of the type considered in [13], within the standard PAC model of probabilistic learning. The bounds we obtain depend on the error and complexity and they improve that is, they decrease the more definite is the classification of the sample points. Suppose points of [0, 1] n are to be classified into C classes, which we will assume are labeled 1,,..., C. We let [C] denote the set {1,,..., C}. A box or, more exactly, an axis-parallel box in R n is a set of the form Iu, v = {x R n : u i x i v i, 1 i n}, where u, v [0, 1] n and u v, meaning that u i v i for each i. We consider multi-category classifiers which are based on C unions of boxes, as we now describe. For k = 1,..., C, suppose that S k is a union of some number, B k, of boxes: S k = B k j=1 Iuk, j, vk, j. Here, the jth box is defined by uk, j, vk, j where uk, j, vk, j [0, 1] n and uk, j vk, j so, for each i between 1 and n, uk, j i vk, j i. We assume, further, that for

4 RRR 7-01 Page 3 k l, S k S l =. We think of S k as being a region of the domain all of whose points we assume to belong to class k. So, as in [13] and [15], for instance, the boxes in S k might be constructed by agglomerative box-clustering methods. To define our classifiers, we will make use of a metric on [0, 1] n. To be specific, as in [13], d will be the d 1 or Manhattan or taxicab metric: for x, y [0, 1] n, dx, y = n x i y i. i=1 We could equally well as in [5], where the two-class case is the focus use the supremum or d metric, defined by d x, y = max{ x i y i : 1 i n} and similar results would be obtained. For x [0, 1] n and S [0, 1] n, the distance from x to S is dx, S = inf dx, y. y S Let S = S 1, S,..., S C and denote by h S follows: for x [0, 1] n, the classifier from [0, 1] n into [C] defined as h S x = argmin 1 k C dx, S k, where if dx, S k is minimized for more than one value of k, one of these is chosen randomly as the value of h S. So, in other words, the class label assigned to x is k where S k is the closest to x of the regions S 1, S,..., S C. We refer to B = B B C as the number of boxes in S and in h S. We will denote by H B the set of all such classifiers where the number of boxes is B. The set of all possible classifiers we cosider is then H = B=1 H B. These classifiers, therefore, are based, as a starting point, on regions assumed to be of particular categories. These regions are each unions of boxes, and the regions do not overlap. In practice, these boxes and the corresponding regions will likely have been constructed directly from a training sample by finding boxes containing sample points of a particular class, and merging, or agglomerating these; see [13]. See, for example, Figure 1. The three types of boxes are indicated, and the pale gray region is the region not covered by any of the boxes.

5 Page 4 RRR 7-01 Figure 1: Boxes of three categories. Then, for all other points of the domain, the classification of a point is given by the class of the region to which it is closest in the d 1 metric. For the initial configuration of boxes indicated in Figure 1, the final classification of the whole domain is as indicated in Figure. Grid lines have been inserted in these figures to make it easier to see the correspondence between them. Figure : Classification of the whole domain resulting from the boxes of Figure 1.

6 RRR 7-01 Page 5 These classifiers seem quite natural, from a geometrical point of view; and unlike black-box classifiers such as neural networks, can be described and understood: there are box-shaped regions where we assert a known classification, and the classification elsewhere is determined by an arguably fairly sensible nearest-neighbor approach. It is also potentially useful that the classification is explicitly based on distance, and so there is a real-valued function f underlying the classifier: if the classification of x is k, then we can consider how much further x is from the next-nearest category of box. This real number, fx = min l k dx, S l dx, S k, quantifies, in a sense, how sure or definitive the classification is. In particular, if fx is relatively large, it means that x is quite far from boxes of the other categories. We could interpret the value of the function f as an indicator of how confident we might be about the classification of a point. A point in the domain with a large value of f will be classified more definitely than one with a small value of f and we might think that the classification of the first point is more reliable than that of the second, because the larger value of f indicates that the first point is further from boxes of other categories than is the second point. Furthermore, by considering the value of f on the sample points, we have some measure of how robust the classifier is on the training sample. This measure of robustness plays a role in our generalization error bounds. Generalization error bounds for definitive classification Guermeur[14] describes a fairly general framework for PAC-analysis of multi-category classifiers and we will show how we can use his results to bound the generalization error for our classifiers. This involves defining a certain function class and then bounding the covering numbers of that class. First, however, we obtain a tighter bound for the case in which the margin error is zero. What this means is that we bound the error of the classifier in terms of how definitive the classification of the training sample is..1 Classifiers with definitive classification on all sample points The following definition describes what we mean by definitive classification on the sample. Definition.1 Let Z = X Y where X = [0, 1] n and Y = [C] = {1,,..., C}. For 0, 1, and for z Z m, we say that h S H achieves margin on the sample point z = x, y in z if for all l y, dx, S l > dx, S y +. We say that h S achieves margin on the sample z if it achieves margin on each of the x i, y i in z.

7 Page 6 RRR 7-01 So, h S achieves margin on a sample point if S y is the closest of the S k to x so that the class label assigned to x will be y and, also, every other S l has distance from x that is at least greater than the distance from x to S y. To quantify the performance of a classifier after training, we use a form of the PAC model of computational learning theory. See, for instance [8, 4, 19].This assumes that we have training examples z i = x i, y i Z = [0, 1] n [C], each of which has been generated randomly according to some fixed probability measure P on Z. Then, we can regard a training sample of length m, which is an element of Z m, as being randomly generated according to the product probability measure P m. The error of a classifier h S is the probability that it does not definitely assign the correct class to a subsequent randomly-drawn instance and we include in this the cases in which there are more than one equally close S k, for the random choice then made might be incorrect. So, the error is the probability that it is not true that for x, y Z we have dx, S y < dx, S k for all k y; that is, er P h S = P {x, y : dx, S y min dx, S k}. k y What we would hope is that, if a classifier performs well on a large enough sample, then its error is likely to be low. The following result is of this type where good performance on the sample means correct, and definitively correct, classification on the sample. Theorem. Let δ 0, 1 and suppose P is a probability measure on Z = [0, 1] n [C]. With P m -probability at least 1 δ, z Z m will be such that we have the following: for all B and for all 0, 1, for all h S H B, if h S achieves margin on z, then er P h S < m 1n 4 nb log + B log C + log. δ In particular, Theorem. provides a high-probability guarantee on the real error of a classifier once training has taken place, based on the observed margin that has been obtained this being the largest value of such that a margin of has been achieved.. Proof of Theorem. We first derive a result in which B and are fixed in advance. We then remove the requirement that these parameters be fixed, to obtain Theorem..

8 RRR 7-01 Page 7 For 0, 1, let A [0, 1] be the set of all integer multiples of /4n belonging to [0, 1], together with 1. So, { A = 0, 4n, 4n, 3 } 4n 4n,..., 4n, 1. We have A 4n + 6n. Let ĤB H B be the set of all classifiers of the form h S where, for each k, S k is a union of boxes of the form Iu, v where u, v A n. Lemma.3 With the above notation, we have ĤB nb 6n C B. Proof: The number of possible boxes Iu, v with u, v A n is n 6n n A n 6n. A classifier in ĤB is obtained by choosing B such boxes and labeling each with a category between 1 and C. So, 6n/ n ĤB B C B nb 6n C B, as required. For 0, we define the -margin error of h S on a sample z. Denoted by Ez h S, this is simply the proportion of x, y in z in which h S does not achieve a margin of. In other words, Ez h S = 1 m I { l y i : dx i, S l dx, S yi + }. m i=1 Here, IA denotes the indicator function of a set or event A. Evidently, to say that h S achieves margin on z is to say that E z h S = 0.

9 Page 8 RRR 7-01 The next part of the proof uses a symmetrization technique similar to those first used in [0, 18, 1, 17] and in subsequent work extending those techniques to learning with realvalued functions, such as [16, 1,, 7]. Lemma.4 If Q = {s Z m : h S H B with Es h S = 0, er P h S ɛ} and T = {s, s Z m Z m : h S H B with Es h S = 0, Es 0 h S ɛ/}, then, for m 8/ɛ, P m Q P m T. Proof: We have P m T P { m h S H B : Es h S = 0, er P h ɛ and Es 0 h S ɛ/ } = P { m s : h S H B, Es h S = 0, er P h S ɛ and Es 0 h S ɛ/ } dp m s Q 1 P m Q, for m 8/ɛ. The final inequality follows from the fact that if er P h S ɛ, then for m 8/ɛ, P m E 0 s h S ɛ/ 1/, for any h S H, something that follows by a Chernoff bound. We next bound the probability of T and use this to obtain a generalization error bound for fixed and B. Proposition.5 Let B N, 0, 1 and δ 0, 1. Then, with probability at least 1 δ, if h S H B and Es h S = 0, then er P h S < 6n nb log m + B log C + log. δ Proof: Let G be the permutation group the swapping group on the set {1,,..., m} generated by the transpositions i, m + i for i = 1,,..., m. Then G acts on Z m by

10 RRR 7-01 Page 9 permuting the coordinates: for σ G, σz 1, z,..., z m = z σ1,..., z σm. By invariance of P m under the action of G, P m T max{pσz T : z Z m }, where P denotes the probability over uniform choice of σ from G. See, for instance, [17, 3]. Fix z Z m. Suppose τz = s, s T and that h S H B is such that E s h S = 0 and E 0 s h S ɛ/. Suppose that S = S 1, S,..., S C where, for each k, S k = B k j=1 Let ûk, j, ˆvj, k A n be such that, for all r, These exist by definition of A. Let Iuk, j, vk, j. ûk, j r uk, j r 4n, ˆvk, j r vk, j r 4n. Ŝ k = B k j=1 Iûk, j, ˆvk, j and Ŝ = Ŝ1,..., ŜC. Let ĥs be the corresponding classifier in ĤB; that is, ĥs = hŝ. The following geometrical fact easily seen will be useful: for any k, for any a S k, there exists â Ŝk such that da, â /4; and, conversely, for any â Ŝk, there exists a S k such that da, â /4. For any k, there is some a S k such that dx, S k = dx, a. If â Ŝk is such that a r â r /4n for all r, then it follows that da, â = n a r â r /4. r=1 So, dx, Ŝk dx, â dx, a + da, â dx, S k + /4. A similar argument shows that, for each k, dx, S k dx, Ŝk + /4. Suppose that, for all l y, dx, S l dx, S y +. Then, if l y, we have dx, Ŝl dx, S l 4 dx, S y + 4 dx, Ŝy = dx, Ŝy +.

11 Page 10 RRR 7-01 So, if h S achieves margin on x, y, then ĥs achieves margin /. It follows that if Es h S = 0 then Es / ĥs = 0. Now suppose that there is l y such that dx, S l < dx, S y. Then dx, Ŝl dx, S l + 4 < dx, S y + 4 dx, Ŝy = dx, Ŝy +. This argument shows that if Es 0 h S ɛ/, then E / s h S ɛ/. It now follows that if τz T, then, for some ĥs H B, τz RĥS, where RĥS = {s, s Z m Z m : Es / ĥs = 0, E / s ĥs ɛ/}. By symmetry, P σz RĥS = P στz RĥS. Suppose that E / s ĥs = r/m, where r ɛm/ is the number of x i, y i in s on which ĥs does not achieve margin /. Then those permutations σ such that στz RĥS are precisely those that do not transpose these r coordinates, and there are m r m ɛm/ such σ. It follows that, for each fixed ĥ S ĤB, P σz RĥS m1 ɛ/ = ɛm/. G We therefore have Pσz T P σz RĥS Pσz RĥS ĤB ɛm/. ĥ S ĤB ĥ S ĤB So, This is at most δ when P m Q P m T ĤB ɛm/ ɛ = m nb log 6n nb 6n C B ɛm/ + B log C + log δ, as stated. Next, we use this to obtain a result in which and B are not prescribed in advance. For α 1, α, δ 0, 1, let Eα 1, α, δ be the set of z Z m for which there exists some h S H B which achieves margin α on z and which has er P f ɛ 1 m, α 1, δ, B, where ɛ 1 m, α 1, δ, B = m nb log 6n α 1 + B log C + log δ.

12 RRR 7-01 Page 11 Proposition.5 tells us that P m Eα, α, δ δ. It is also clear that if α 1 α α and δ 1 δ, then Eα 1, α, δ 1 Eα, α, δ. It follows, from [7], that P m E/,, δ/ δ. 0,1] In other words, for fixed B, with probability at least 1 δ, for all 0, 1], we have that if h S H B achieves margin on the sample, then er P h S < ɛ m,, δ, B, where ɛ m,, δ, B = 1n 4 nb log m + B log C + log. δ Note that now need not be prescribed in advance. It now follows that with probability at least 1 δ/ B, for any 0, 1, if h S H B achieves margin, then er P h S ɛ m,, δ/ B, B. So, with probability at least 1 B=1 δ/b = 1 δ, we have: for all B, for all 0, 1, if h S H B achieves margin, then er P h S ɛ m,, δ/ B, B = 1n 4 nb log m + B log α C + log + B. 1 δ Theorem. now follows on noting that log C 1..3 Allowing less definitive classification on some sample points As mentioned, Guermeur [14] has developed a fairly general framework in which to analyse multi-category classification, and we can apply one of his results to obtain a generalization error bound applicable to the case in which a margin of is not obtained on all the training examples. We describe his result and explain how to formulate our problem in his framework. This requires us to define a certain real-valued class of functions, the values of which may themselves impart some useful information as to how confident one might be about classification. To use his result to obtain a generalization error bound, we then bound the covering numbers of this class of functions. What we obtain is a high probability bound that takes the following form: for all B, for all 0, 1, if h S H B, then er P h S E z h S + ɛm,, δ, B, where ɛ tends to 0 as m and ɛ decreases as increases. Recall that E z h S is the -margin error of h S on the sample. The rationale for seeking such a bound is that there is likely to be a trade-off between margin error on the sample and the value of ɛ: taking

13 Page 1 RRR 7-01 small so that the margin error term is zero might entail a large value of ɛ; and, conversely, choosing large will make ɛ relatively small, but lead to a large margin error term. So, in principle, since the value is free to be chosen, one could optimize the choice of on the right-hand side of the bound to minimize it. We now describe Guermeur s framework with some adjustments to the notation to make it consistent with the notation used here. There is a set G of functions from X = [0, 1] n into R C, and a typical g G is represented by its component functions g = g k C k=1. Each g G satisfies the constraint g k x = 0, x X. k=1 A function of this type acts as a classifier as follows: it assigns category l [C] to x X if and only if g l x > max k l g k x. If more than one value of k maximizes g k x, then the classification is left undefined, assigned some value not in [C]. The risk of g G, when the underlying probability measure on X Y is P, is defined to be Rg = P {x, y X [C] : g y x max g kx}. k y For v, k R C [C], let Mv, k = 1 v k max v l and, for g G, let g be the function l k X R C given by gx = g k x C k=1 = Mgx, k C k=1. Given a sample z X [C] m, let R,z g = 1 m m I { g yi x i < }. i=1 To describe Guermeur s result, we need covering numbers. Suppose H is a set of functions from X to R C. For x X m, define the metric d x on H by d x h, h = max 1 i m hx i h x i = max 1 i m max 1 k C h kx i h kx i. For α > 0, a finite subset Ĥ of H, is said to be a proper α-cover of H with respect to metric d x if for each h H there exists ĥ Ĥ such that d xh, ĥ α. The class H is totally bounded if for each α > 0, for each m, and for each x X m, there is a finite α-cover of H with respect to d x. The smallest cardinality of an α-cover of H with respect to d x is denoted N α, H, d x. The covering number N α, H, m is defined by N α, H, m = max x X m N α, H, d x. A result following from [14] is in the above notation as follows:

14 RRR 7-01 Page 13 Theorem.6 Let δ 0, 1 and suppose P is a probability measure on Z = [0, 1] n [C]. With P m -probability at least 1 δ, z Z m will be such that we have the following: for all 0, 1 and for all g G, Rg R,m g + m ln N /4, G, m + ln + 1 δ m. We now use Theorem.6, with an appropriate choice of function space G. Let us fix B N and let S = S 1,..., S C be, as before, C unions of boxes, with B boxes in total. Let g S be the function X = [0, 1] n R C defined by g S = g S k C k=1, where g S k x = 1 C dx, S i dx, S k. i=1 Let G B = {g S : h S H B }. Then these functions satisfy the constraint that their coordinate functions sum to the zero function, since g k x = k=1 For each k, k=1 1 C dx, S i i=1 dx, S k = k=1 dx, S k k=1 gk S x = Mg S x, k = 1 gk S x max l k gs l x = dx, S i dx, S k max C l k C i=1 = 1 dx, S k max dx, S l l k = 1 min dx, S l dx, S k. l k From the definition of g, g S y x max k y g kx 1 C i=1 dx, S i dx, S y max k y min dx, S k dx, S y. k y So it follows that Rg S = er P h S. Similarly, R,z g S = E z h S. dx, S k = 0. k=1 dx, S i dx, S l i=1 1 C dx, S i dx, S k By bounding the covering numbers of our class G B of functions, and then by removing the restriction that B be specified in advance, we obtain the following result. i=1

15 Page 14 RRR 7-01 Theorem.7 Let δ 0, 1 and suppose P is a probability measure on Z = [0, 1] n [C]. With P m -probability at least 1 δ, z Z m will be such that we have the following: for all B and for all 0, 1, for all h S H B, er P h S E z h S + m nb ln 6n + B ln C + ln + 1 δ m..4 Proof of Theorem.7 As the preceding discussion makes clear, the functions in G B are all of the form g1 S,..., gc S, where gk S = 1 min dx, S l dx, S k. l k Here, S, as before, involves B boxes. To simplify, we will bound the covering numbers of G B, noting that an α-covering of G B is an α/-covering of G B. So, consider the class F = F B = {f S : h S H B }, where f S = g S ; that is, f S : [0, 1] n R C is given by f S k x = min l k dx, S l dx, S k. We use some of the same notations and ideas as developed in the proof of Theorem.. We will define ˆF to be the subset of F = F B in which each S k is of the form S k = B k j=1 Iûk, j, ˆvk, j where ûk, j, ˆvk, j A n. We will show that ˆF is a /-cover for F B, and hence a /4 cover for G B. Suppose f S F B, where S = S 1, S,..., S C. Suppose ûk, j, ˆvj, k A n are chosen as in the proof of Theorem., and that Ŝk = B k j=1 Iûk, j, ˆvk, j and Ŝ = Ŝ1,..., ŜC. Let ˆf S be the corresponding function in ˆF B : ˆfS = fŝ. As argued in the proof of Theorem., for all k and all x, dx, S k dx, Ŝk /4. For any k, and for any x, f S x ˆf Ŝk S x = min dx, S l dx, S k min dx, Ŝl dx, l k l k min dx, S l min dx, Ŝl dx, l k l k + Ŝ k dx, S k. The second term in this last line is bounded by /4, by the observation just made. Consider the first term. Suppose min l k dx, S l = dx, S r. Then, since dx, S r dx, Ŝr /4, it

16 RRR 7-01 Page 15 follows that min l k dx, Ŝk dx, Ŝr dx, S r + /4 = min l k dx, S l + /4. Similarly, min l k dx, S k min l k dx, Ŝk + /4. So, min dx, S l min dx, Ŝl l k l k 4. Therefore, f S x ˆf S x /4 + /4 = /. This establishes that ˆF B is a /-cover of F B with respect to the supremum metric on functions X R C, meaning that for every f S F B, there is ˆf S ˆF B with sup f S x ˆf S x /. x X In particular, therefore, for any m and for any x X m, ˆFB is a / cover for F B with respect to the d x metric. We now see that the covering number N /4, G B, m is, for all m, bounded above by ˆF B. Bounding this cardinality as in the proof of Theorem. gives N /4, G B, m nb 6n C B. Taking δ/ B in place of δ, using the covering number bound now obtained, and noting that Rg S = er P h S and R,z g S = E z h S, Theorem.6 shows that, with probability at least 1 δ/ B, for all 0, 1, if h S H B, then er P h S E z h S + m nb ln 6n + B ln C + ln + 1 δ m. We have used B ln B ln C. So, with probabilty at least 1 δ, for all B, this bound holds, completing the proof. Acknowledgements This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST Part of the work was carried out while Martin Anthony was a visitor at Rutgers University Center for Operations Research.

17 Page 16 RRR 7-01 References [1] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sensitive dimensions, uniform convergence, and learnability. Journal of the ACM, 444: , [] M. Anthony and P. L. Bartlett. Function learning from interpolation. Combinatorics, Probability and Computing, 9:13 5, 000. [3] M. Anthony and P.L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, [4] M. Anthony and N. Biggs. Computational Learning Theory. Cambridge Tracts in Theoretical Computer Science 30. Cambridge University Press, 199. Reprinted [5] M. Anthony and J. Ratsaby. The performance of a new hybrid classifier based on boxes and nearest neighbors. In International Symposium on Artificial Intelligence and Mathematics January 01. Also RUTCOR Research Report RRR , Rutgers University, [6] M. Anthony and J. Ratsaby. Robust cutpoints in the logical analysis of numerical data. Discrete Applied Mathematics, 160: , 01. [7] P. L. Bartlett. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44:55 536, [8] Anselm Blumer, A. Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. J. ACM, 364:99 965, [9] Endre Boros, Peter L. Hammer, Toshihide Ibaraki, and Alexander Kogan. Logical analysis of numerical data. Mathematical Programming, 79: , [10] Endre Boros, Peter L. Hammer, Toshihide Ibaraki, Alexander Kogan, Eddy Mayoraz, and Ilya Muchnik. An implementation of logical analysis of data. IEEE Transactions on Knowledge and Data Engineering, 1:9 306, 000. [11] N. H. Bshouty, P.W. Goldberg, S.A. Goldman, and H.D. Mathias. Exact learning of discretized geometric concepts. SIAM journal on Computing, 8: , [1] R. M. Dudley. Central limit theorems for empirical measures. Annals of Probability, 66:899 99, [13] Giovanni Felici, Bruno Simeone, and Vincenzo Spinelli. Classification techniques and error control in logic mining. In Data Mining 010, volume 8 of Annals of Information Systems, pages , 010.

18 RRR 7-01 Page 17 [14] Yann Guermeur. VC theory of large margin multi-category classifiers. Journal of Machine Learning Research, 8: , 007. [15] P.L. Hammer, Y. Liu, S. Szedmák, and B. Simeone. Saturated systems of homogeneous boxes and the logical analysis of numerical data. Discrete Applied Mathematics, : , 004. [16] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Inform. Comput., 1001:78 150, September 199. [17] David Pollard. Convergence of Stochastic Processes. Springer-Verlag, [18] V. N. Vapnik. Statistical Learning Theory. Wiley, [19] Vladimir N. Vapnik. The nature of statistical learning theory. Springer-Verlag, New York, [0] Vladimir N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probab. and its Applications, 16:64 80, 1971.

Sample width for multi-category classifiers

Sample width for multi-category classifiers R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University

More information

The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors

The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors R u t c o r Research R e p o r t The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors Martin Anthony a Joel Ratsaby b RRR 17-2011, October 2011 RUTCOR Rutgers Center for Operations

More information

The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors

The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors Martin Anthony Department of Mathematics London School of Economics and Political Science Houghton Street, London WC2A2AE

More information

Maximal Width Learning of Binary Functions

Maximal Width Learning of Binary Functions Maximal Width Learning of Binary Functions Martin Anthony Department of Mathematics, London School of Economics, Houghton Street, London WC2A2AE, UK Joel Ratsaby Electrical and Electronics Engineering

More information

Maximal Width Learning of Binary Functions

Maximal Width Learning of Binary Functions Maximal Width Learning of Binary Functions Martin Anthony Department of Mathematics London School of Economics Houghton Street London WC2A 2AE United Kingdom m.anthony@lse.ac.uk Joel Ratsaby Ben-Gurion

More information

A Result of Vapnik with Applications

A Result of Vapnik with Applications A Result of Vapnik with Applications Martin Anthony Department of Statistical and Mathematical Sciences London School of Economics Houghton Street London WC2A 2AE, U.K. John Shawe-Taylor Department of

More information

R u t c o r Research R e p o r t. A New Imputation Method for Incomplete Binary Data. Mine Subasi a Martin Anthony c. Ersoy Subasi b P.L.

R u t c o r Research R e p o r t. A New Imputation Method for Incomplete Binary Data. Mine Subasi a Martin Anthony c. Ersoy Subasi b P.L. R u t c o r Research R e p o r t A New Imputation Method for Incomplete Binary Data Mine Subasi a Martin Anthony c Ersoy Subasi b P.L. Hammer d RRR 15-2009, August 2009 RUTCOR Rutgers Center for Operations

More information

R u t c o r Research R e p o r t. Large margin case-based reasoning. Martin Anthony a. Joel Ratsaby b. RRR , February 2013

R u t c o r Research R e p o r t. Large margin case-based reasoning. Martin Anthony a. Joel Ratsaby b. RRR , February 2013 R u t c o r Research R e p o r t Large margin case-based reasoning Martin Anthony a Joel Ratsaby b RRR 2-2013, February 2013 RUTCOR Rutgers Center for Operations Research Rutgers University 640 Bartholomew

More information

Yale University Department of Computer Science. The VC Dimension of k-fold Union

Yale University Department of Computer Science. The VC Dimension of k-fold Union Yale University Department of Computer Science The VC Dimension of k-fold Union David Eisenstat Dana Angluin YALEU/DCS/TR-1360 June 2006, revised October 2006 The VC Dimension of k-fold Union David Eisenstat

More information

Computational Learning Theory for Artificial Neural Networks

Computational Learning Theory for Artificial Neural Networks Computational Learning Theory for Artificial Neural Networks Martin Anthony and Norman Biggs Department of Statistical and Mathematical Sciences, London School of Economics and Political Science, Houghton

More information

On the Sample Complexity of Noise-Tolerant Learning

On the Sample Complexity of Noise-Tolerant Learning On the Sample Complexity of Noise-Tolerant Learning Javed A. Aslam Department of Computer Science Dartmouth College Hanover, NH 03755 Scott E. Decatur Laboratory for Computer Science Massachusetts Institute

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

Uniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning

Uniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning Uniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning Martin Anthony Department of Mathematics London School of Economics Houghton Street London WC2A

More information

Statistical and Computational Learning Theory

Statistical and Computational Learning Theory Statistical and Computational Learning Theory Fundamental Question: Predict Error Rates Given: Find: The space H of hypotheses The number and distribution of the training examples S The complexity of the

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

A Bound on the Label Complexity of Agnostic Active Learning

A Bound on the Label Complexity of Agnostic Active Learning A Bound on the Label Complexity of Agnostic Active Learning Steve Hanneke March 2007 CMU-ML-07-103 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Machine Learning Department,

More information

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY Dan A. Simovici UMB, Doctoral Summer School Iasi, Romania What is Machine Learning? The Vapnik-Chervonenkis Dimension Probabilistic Learning Potential

More information

The information-theoretic value of unlabeled data in semi-supervised learning

The information-theoretic value of unlabeled data in semi-supervised learning The information-theoretic value of unlabeled data in semi-supervised learning Alexander Golovnev Dávid Pál Balázs Szörényi January 5, 09 Abstract We quantify the separation between the numbers of labeled

More information

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization : Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

Computational and Statistical Learning theory

Computational and Statistical Learning theory Computational and Statistical Learning theory Problem set 2 Due: January 31st Email solutions to : karthik at ttic dot edu Notation : Input space : X Label space : Y = {±1} Sample : (x 1, y 1,..., (x n,

More information

Neural Network Learning: Testing Bounds on Sample Complexity

Neural Network Learning: Testing Bounds on Sample Complexity Neural Network Learning: Testing Bounds on Sample Complexity Joaquim Marques de Sá, Fernando Sereno 2, Luís Alexandre 3 INEB Instituto de Engenharia Biomédica Faculdade de Engenharia da Universidade do

More information

Vapnik-Chervonenkis Dimension of Neural Nets

Vapnik-Chervonenkis Dimension of Neural Nets P. L. Bartlett and W. Maass: Vapnik-Chervonenkis Dimension of Neural Nets 1 Vapnik-Chervonenkis Dimension of Neural Nets Peter L. Bartlett BIOwulf Technologies and University of California at Berkeley

More information

Large width nearest prototype classification on general distance spaces

Large width nearest prototype classification on general distance spaces Large width nearest prototype classification on general distance spaces Martin Anthony 1 and Joel Ratsaby 2 1 Department of Mathematics, London School of Economics and Political Science, Houghton Street,

More information

Relating Data Compression and Learnability

Relating Data Compression and Learnability Relating Data Compression and Learnability Nick Littlestone, Manfred K. Warmuth Department of Computer and Information Sciences University of California at Santa Cruz June 10, 1986 Abstract We explore

More information

COMPUTATIONAL LEARNING THEORY

COMPUTATIONAL LEARNING THEORY COMPUTATIONAL LEARNING THEORY XIAOXU LI Abstract. This paper starts with basic concepts of computational learning theory and moves on with examples in monomials and learning rays to the final discussion

More information

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory. CS534 - Machine Learning Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning PAC Learning and VC Dimension Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process

More information

EVALUATING MISCLASSIFICATION PROBABILITY USING EMPIRICAL RISK 1. Victor Nedel ko

EVALUATING MISCLASSIFICATION PROBABILITY USING EMPIRICAL RISK 1. Victor Nedel ko 94 International Journal "Information Theories & Applications" Vol13 [Raudys, 001] Raudys S, Statistical and neural classifiers, Springer, 001 [Mirenkova, 00] S V Mirenkova (edel ko) A method for prediction

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

A Large Deviation Bound for the Area Under the ROC Curve

A Large Deviation Bound for the Area Under the ROC Curve A Large Deviation Bound for the Area Under the ROC Curve Shivani Agarwal, Thore Graepel, Ralf Herbrich and Dan Roth Dept. of Computer Science University of Illinois Urbana, IL 680, USA {sagarwal,danr}@cs.uiuc.edu

More information

Discriminative Learning can Succeed where Generative Learning Fails

Discriminative Learning can Succeed where Generative Learning Fails Discriminative Learning can Succeed where Generative Learning Fails Philip M. Long, a Rocco A. Servedio, b,,1 Hans Ulrich Simon c a Google, Mountain View, CA, USA b Columbia University, New York, New York,

More information

The Decision List Machine

The Decision List Machine The Decision List Machine Marina Sokolova SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 sokolova@site.uottawa.ca Nathalie Japkowicz SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 nat@site.uottawa.ca

More information

MACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti

MACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti MACHINE LEARNING Vapnik-Chervonenkis (VC) Dimension Alessandro Moschitti Department of Information Engineering and Computer Science University of Trento Email: moschitti@disi.unitn.it Computational Learning

More information

10.1 The Formal Model

10.1 The Formal Model 67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 10: The Formal (PAC) Learning Model Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 We have see so far algorithms that explicitly estimate

More information

Solving Classification Problems By Knowledge Sets

Solving Classification Problems By Knowledge Sets Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

A Pseudo-Boolean Set Covering Machine

A Pseudo-Boolean Set Covering Machine A Pseudo-Boolean Set Covering Machine Pascal Germain, Sébastien Giguère, Jean-Francis Roy, Brice Zirakiza, François Laviolette, and Claude-Guy Quimper Département d informatique et de génie logiciel, Université

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

R u t c o r Research R e p o r t. Relations of Threshold and k-interval Boolean Functions. David Kronus a. RRR , April 2008

R u t c o r Research R e p o r t. Relations of Threshold and k-interval Boolean Functions. David Kronus a. RRR , April 2008 R u t c o r Research R e p o r t Relations of Threshold and k-interval Boolean Functions David Kronus a RRR 04-2008, April 2008 RUTCOR Rutgers Center for Operations Research Rutgers University 640 Bartholomew

More information

Vapnik-Chervonenkis Dimension of Neural Nets

Vapnik-Chervonenkis Dimension of Neural Nets Vapnik-Chervonenkis Dimension of Neural Nets Peter L. Bartlett BIOwulf Technologies and University of California at Berkeley Department of Statistics 367 Evans Hall, CA 94720-3860, USA bartlett@stat.berkeley.edu

More information

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015 10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015 1 Active Learning Most classic machine learning methods and the formal learning

More information

PATTERN-BASED CLUSTERING AND ATTRIBUTE ANALYSIS

PATTERN-BASED CLUSTERING AND ATTRIBUTE ANALYSIS R U T C O R R E S E A R C H R E P O R T PATTERN-BASED CLUSTERING AND ATTRIBUTE ANALYSIS Gabriela Alexe a Sorin Alexe b Peter L. Hammer c RRR-10-2003 MARCH 2003 RUTCOR Rutgers Center for Operations Research

More information

Small sample size generalization

Small sample size generalization 9th Scandinavian Conference on Image Analysis, June 6-9, 1995, Uppsala, Sweden, Preprint Small sample size generalization Robert P.W. Duin Pattern Recognition Group, Faculty of Applied Physics Delft University

More information

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016 Machine Learning 10-701, Fall 2016 Computational Learning Theory Eric Xing Lecture 9, October 5, 2016 Reading: Chap. 7 T.M book Eric Xing @ CMU, 2006-2016 1 Generalizability of Learning In machine learning

More information

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti MACHINE LEARNING Support Vector Machines Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@dit.unitn.it Summary Support Vector Machines

More information

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 68 November 999 C.B.C.L

More information

Rademacher Averages and Phase Transitions in Glivenko Cantelli Classes

Rademacher Averages and Phase Transitions in Glivenko Cantelli Classes IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 1, JANUARY 2002 251 Rademacher Averages Phase Transitions in Glivenko Cantelli Classes Shahar Mendelson Abstract We introduce a new parameter which

More information

The PAC Learning Framework -II

The PAC Learning Framework -II The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline

More information

Voronoi Networks and Their Probability of Misclassification

Voronoi Networks and Their Probability of Misclassification IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 6, NOVEMBER 2000 1361 Voronoi Networks and Their Probability of Misclassification K. Krishna, M. A. L. Thathachar, Fellow, IEEE, and K. R. Ramakrishnan

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

Introduction to Statistical Learning Theory

Introduction to Statistical Learning Theory Introduction to Statistical Learning Theory Definition Reminder: We are given m samples {(x i, y i )} m i=1 Dm and a hypothesis space H and we wish to return h H minimizing L D (h) = E[l(h(x), y)]. Problem

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

On Learnability, Complexity and Stability

On Learnability, Complexity and Stability On Learnability, Complexity and Stability Silvia Villa, Lorenzo Rosasco and Tomaso Poggio 1 Introduction A key question in statistical learning is which hypotheses (function) spaces are learnable. Roughly

More information

Tutorial on Venn-ABERS prediction

Tutorial on Venn-ABERS prediction Tutorial on Venn-ABERS prediction Paolo Toccaceli (joint work with Alex Gammerman and Ilia Nouretdinov) Computer Learning Research Centre Royal Holloway, University of London 6th Symposium on Conformal

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

A Necessary Condition for Learning from Positive Examples

A Necessary Condition for Learning from Positive Examples Machine Learning, 5, 101-113 (1990) 1990 Kluwer Academic Publishers. Manufactured in The Netherlands. A Necessary Condition for Learning from Positive Examples HAIM SHVAYTSER* (HAIM%SARNOFF@PRINCETON.EDU)

More information

Computational Learning Theory

Computational Learning Theory CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful

More information

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence Chao Zhang The Biodesign Institute Arizona State University Tempe, AZ 8587, USA Abstract In this paper, we present

More information

Hence C has VC-dimension d iff. π C (d) = 2 d. (4) C has VC-density l if there exists K R such that, for all

Hence C has VC-dimension d iff. π C (d) = 2 d. (4) C has VC-density l if there exists K R such that, for all 1. Computational Learning Abstract. I discuss the basics of computational learning theory, including concept classes, VC-dimension, and VC-density. Next, I touch on the VC-Theorem, ɛ-nets, and the (p,

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Martin Anthony. The London School of Economics and Political Science, Houghton Street, London WC2A 2AE, United Kingdom.

Martin Anthony. The London School of Economics and Political Science, Houghton Street, London WC2A 2AE, United Kingdom. Function Learning from Interpolation 1 Martin Anthony Department of Mathematics, The London School of Economics and Political Science, Houghton Street, London WC2A 2AE, United Kingdom. Email anthony@vax.lse.ac.uk

More information

On Learning µ-perceptron Networks with Binary Weights

On Learning µ-perceptron Networks with Binary Weights On Learning µ-perceptron Networks with Binary Weights Mostefa Golea Ottawa-Carleton Institute for Physics University of Ottawa Ottawa, Ont., Canada K1N 6N5 050287@acadvm1.uottawa.ca Mario Marchand Ottawa-Carleton

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016 12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv: v2 [math.st] 23 Jul 2012

Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv: v2 [math.st] 23 Jul 2012 Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv:203.093v2 [math.st] 23 Jul 202 Servane Gey July 24, 202 Abstract The Vapnik-Chervonenkis (VC) dimension of the set of half-spaces of R d with frontiers

More information

CONSIDER a measurable space and a probability

CONSIDER a measurable space and a probability 1682 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL 46, NO 11, NOVEMBER 2001 Learning With Prior Information M C Campi and M Vidyasagar, Fellow, IEEE Abstract In this paper, a new notion of learnability is

More information

CSE 648: Advanced algorithms

CSE 648: Advanced algorithms CSE 648: Advanced algorithms Topic: VC-Dimension & ɛ-nets April 03, 2003 Lecturer: Piyush Kumar Scribed by: Gen Ohkawa Please Help us improve this draft. If you read these notes, please send feedback.

More information

Quadratization of Symmetric Pseudo-Boolean Functions

Quadratization of Symmetric Pseudo-Boolean Functions R u t c o r Research R e p o r t Quadratization of Symmetric Pseudo-Boolean Functions Martin Anthony a Endre Boros b Yves Crama c Aritanan Gruber d RRR 12-2013, December 2013 RUTCOR Rutgers Center for

More information

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14 Learning Theory Piyush Rai CS5350/6350: Machine Learning September 27, 2011 (CS5350/6350) Learning Theory September 27, 2011 1 / 14 Why Learning Theory? We want to have theoretical guarantees about our

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

Probably Approximately Correct (PAC) Learning

Probably Approximately Correct (PAC) Learning ECE91 Spring 24 Statistical Regularization and Learning Theory Lecture: 6 Probably Approximately Correct (PAC) Learning Lecturer: Rob Nowak Scribe: Badri Narayan 1 Introduction 1.1 Overview of the Learning

More information

1 Differential Privacy and Statistical Query Learning

1 Differential Privacy and Statistical Query Learning 10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 5: December 07, 015 1 Differential Privacy and Statistical Query Learning 1.1 Differential Privacy Suppose

More information

3.1 Decision Trees Are Less Expressive Than DNFs

3.1 Decision Trees Are Less Expressive Than DNFs CS 395T Computational Complexity of Machine Learning Lecture 3: January 27, 2005 Scribe: Kevin Liu Lecturer: Adam Klivans 3.1 Decision Trees Are Less Expressive Than DNFs 3.1.1 Recap Recall the discussion

More information

Generalization theory

Generalization theory Generalization theory Chapter 4 T.P. Runarsson (tpr@hi.is) and S. Sigurdsson (sven@hi.is) Introduction Suppose you are given the empirical observations, (x 1, y 1 ),..., (x l, y l ) (X Y) l. Consider the

More information

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu

More information

PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers

PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers PAC-Bayes Ris Bounds for Sample-Compressed Gibbs Classifiers François Laviolette Francois.Laviolette@ift.ulaval.ca Mario Marchand Mario.Marchand@ift.ulaval.ca Département d informatique et de génie logiciel,

More information

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

12.1 A Polynomial Bound on the Sample Size m for PAC Learning 67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 12: PAC III Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 In this lecture will use the measure of VC dimension, which is a combinatorial

More information

Part of the slides are adapted from Ziko Kolter

Part of the slides are adapted from Ziko Kolter Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,

More information

Nearly-tight VC-dimension bounds for piecewise linear neural networks

Nearly-tight VC-dimension bounds for piecewise linear neural networks Proceedings of Machine Learning Research vol 65:1 5, 2017 Nearly-tight VC-dimension bounds for piecewise linear neural networks Nick Harvey Christopher Liaw Abbas Mehrabian Department of Computer Science,

More information

Set, functions and Euclidean space. Seungjin Han

Set, functions and Euclidean space. Seungjin Han Set, functions and Euclidean space Seungjin Han September, 2018 1 Some Basics LOGIC A is necessary for B : If B holds, then A holds. B A A B is the contraposition of B A. A is sufficient for B: If A holds,

More information

Rademacher Complexity Bounds for Non-I.I.D. Processes

Rademacher Complexity Bounds for Non-I.I.D. Processes Rademacher Complexity Bounds for Non-I.I.D. Processes Mehryar Mohri Courant Institute of Mathematical ciences and Google Research 5 Mercer treet New York, NY 00 mohri@cims.nyu.edu Afshin Rostamizadeh Department

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory Problem set 1 Due: Monday, October 10th Please send your solutions to learning-submissions@ttic.edu Notation: Input space: X Label space: Y = {±1} Sample:

More information

CS 6375: Machine Learning Computational Learning Theory

CS 6375: Machine Learning Computational Learning Theory CS 6375: Machine Learning Computational Learning Theory Vibhav Gogate The University of Texas at Dallas Many slides borrowed from Ray Mooney 1 Learning Theory Theoretical characterizations of Difficulty

More information

POL502: Foundations. Kosuke Imai Department of Politics, Princeton University. October 10, 2005

POL502: Foundations. Kosuke Imai Department of Politics, Princeton University. October 10, 2005 POL502: Foundations Kosuke Imai Department of Politics, Princeton University October 10, 2005 Our first task is to develop the foundations that are necessary for the materials covered in this course. 1

More information

International Journal "Information Theories & Applications" Vol.14 /

International Journal Information Theories & Applications Vol.14 / International Journal "Information Theories & Applications" Vol.4 / 2007 87 or 2) Nˆ t N. That criterion and parameters F, M, N assign method of constructing sample decision function. In order to estimate

More information

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Computational Learning Theory Le Song Lecture 11, September 20, 2012 Based on Slides from Eric Xing, CMU Reading: Chap. 7 T.M book 1 Complexity of Learning

More information

Narrowing confidence interval width of PAC learning risk function by algorithmic inference

Narrowing confidence interval width of PAC learning risk function by algorithmic inference Narrowing confidence interval width of PAC learning risk function by algorithmic inference Bruno Apolloni, Dario Malchiodi Dip. di Scienze dell Informazione, Università degli Studi di Milano Via Comelico

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Probably Approximately Correct Learning - III

Probably Approximately Correct Learning - III Probably Approximately Correct Learning - III Prof. Dan A. Simovici UMB Prof. Dan A. Simovici (UMB) Probably Approximately Correct Learning - III 1 / 18 A property of the hypothesis space Aim : a property

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

Connections between Neural Networks and Boolean Functions

Connections between Neural Networks and Boolean Functions Connections between Neural Networks and Boolean Functions Martin Anthony Department of Mathematics and Centre for Discrete and Applicable Mathematics The London School of Economics and Political Science

More information

Learnability of the Superset Label Learning Problem

Learnability of the Superset Label Learning Problem Learnability of the Superset Label Learning Problem Li-Ping Liu LIULI@EECSOREGONSTATEEDU Thomas G Dietterich TGD@EECSOREGONSTATEEDU School of Electrical Engineering and Computer Science, Oregon State University,

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

From Batch to Transductive Online Learning

From Batch to Transductive Online Learning From Batch to Transductive Online Learning Sham Kakade Toyota Technological Institute Chicago, IL 60637 sham@tti-c.org Adam Tauman Kalai Toyota Technological Institute Chicago, IL 60637 kalai@tti-c.org

More information