The information-theoretic value of unlabeled data in semi-supervised learning

Size: px
Start display at page:

Download "The information-theoretic value of unlabeled data in semi-supervised learning"

Transcription

1 The information-theoretic value of unlabeled data in semi-supervised learning Alexander Golovnev Dávid Pál Balázs Szörényi January 5, 09 Abstract We quantify the separation between the numbers of labeled examples required to learn in two settings: Settings with and without the knowledge of the distribution of the unlabeled data. More specifically, we prove a separation by Θlog n) multiplicative factor for the class of projections over the Boolean hypercube of dimension n. We prove that there is no separation for the class of all functions on domain of any size. Learning with the knowledge of the distribution a.k.a. fixed-distribution learning) can be viewed as an idealized scenario of semi-supervised learning where the number of unlabeled data points is so great that the unlabeled distribution is known exactly. For this reason, we call the separation the value of unlabeled data. Introduction Hanneke [06] showed that for any class C of Vapnik-Chervonenkis dimension d there exists an ) algorithm that -learns any target function from C under any distribution from O d+log/δ) labeled examples with probability at least δ. For this paper, it is important to stress that Hanneke s algorithm does not receive the distribution of unlabeled data as input. On the other hand, Benedek and Itai [99] showed that for any class C and any ) distribution there exists an log N/ +log/δ) algorithm that -learns any target from C from O labeled examples with probability at least δ where N / is the size of an -cover of C with respect to the disagreement metric df, g) Pr[fx) gx)]. Here, it is important to note that Benedek and Itai construct for each distribution a separate algorithm. In other words, they construct a family of algorithms indexed by the uncountably many) distributions over the domain. Alternatively, we can think of Benedek-Itai s family of algorithms as a single algorithm that receives the distribution as an input. It is known that N O/) Od) ; see Dudley [978]. Thus, ignoring log/) factor, Benedek-Itai bound is never worse than Hanneke s bound. As we already mentioned, Benedek-Itai s algorithm receives as input the distribution of unlabeled data. The algorithm uses it to construct an -cover. Unsurprisingly, there exist distributions which have a small -cover and thus sample complexity of Benedek-Itai s algorithm on such distributions is significantly lower then the Hanneke s bound. For instance, a distribution concentrated on a single point has an -cover of size for any positive. Harvard University, Cambridge, MA, USA Yahoo Research, New York, NY, USA

2 However, an algorithm does not need to receive the unlabeled distribution in order to enjoy low sample complexity. For example, empirical risk minimization ERM) algorithm needs significantly less labeled examples to learn any target under some unlabeled distributions. For instance, if the distribution is concentrated on a single point, ERM needs only one labeled example to learn any target. One could be lead to believe that there exists an algorithm that does not receive the unlabeled distribution as input and achieves Benedek-Itai bound or a slightly worse bound) for every distribution. In fact, one could think that ERM or Hanneke s algorithm could be such algorithms. If ERM, Hanneke s algorithm, or some other distribution-independent algorithm had sample complexity that matches or nearly matches) the optimal distribution-specific sample complexity for every distribution, we could conclude that the knowledge of unlabeled data distribution is completely useless. As Darnstädt et al. [03] showed this is not the case. They showed that any algorithm for learning projections over {0, } n that does not receive the unlabeled distribution as input, requires, for some data unlabeled distributions, more labeled examples than the Benedek-Itai bound. However, they did not quantify this gap beside stating that it grows without bound as n goes to infinity. In this paper, we quantify the gap by showing that any distribution-independent algorithm for learning the class of projections over {0, } n requires, for some unlabeled distributions, Ωlog n) times as many labeled examples as Benedek-Itai bound. Darnstädt et al. [03] showed the gap for any class with Vapnik-Chervonenkis dimension d is at most Od). It is well known that Vapnik- Chervonenkis dimensions of projections over {0, } n is Θlog n). Thus our lower bounds matches the upper bound Od). To better understand the relationship of the upper and lower bounds, we illustrate the situation for the class of projections over {0, } n in Figure. In contrast, we show that for the class of all functions on any domain) there is no gap between the two settings. In other words, for learning a target from the class of all functions, unlabeled data are in fact useless. This illustrates the point that the gap depends in a non-trivial way on the combinatorial structure of the function class rather than just on the Vapnik-Chervonenkis dimension. The paper is organized as follows. In Section we review prior work. Section 3 gives the necessary definitions and basic probabilistic tools. In Section 4 we give the proof of the separation result for projections. In Section 5 we prove that there is no gap for the class of all functions. For completeness, in Appendix A we give a proof of a simple upper bound e/) d on the size ) of the log N/ +log/δ) minimum -cover, and in Appendix B we give a proof of Benedek-Itai s O sample complexity upper bound. Related work The question of whether knowledge of unlabeled data distribution helps was proposed and initially studied by Ben-David et al. [008]; see also Lu [009]. However, they considered only classes with Vapnik-Chervonenkis dimension at most, or classes with Vapnik-Chervonenkis dimension d but only distributions for which the size of the -cover is Θ/) Θd), i.e. the -cover is as large as it can be. In these settings, for constant and δ, the separation of labeled sample complexities is at most a constant factor, which is exactly what Ben-David et al. [008] proved. In these settings, For any concept class with Vapnik-Chervonenkis dimension d and any distribution, the size of the smallest -cover is at most O/) Od). In fact, as we show in Appendix A, the size of the smallest -cover is at most e/) d.

3 labeled sample complexity distribution on a shattered set Θ logn)) an algorithm that does not know the distribution fixed-distribution learning distributions over {0, } n Figure : The graph shows sample complexity bounds of learning a class of projections over the domain {0, } n under various unlabeled distributions. We assume that and δ are constant, say, δ 00. The graph shows three lines. The red horizontal line is Hanneke s bound for the class of projections, which is ΘVCC n )) Θlog n). The green line is the Benedek-Itai bound. The green line touches the red line for certain distributions, but is lower for other distributions. In particular, for certain distributions the green line is O). The dashed line corresponds to a particular distribution on a shattered set. This is where the green line and red line touch. Furthermore, here the upper bound coincides with the lower bound for that particular distribution. The black line is the sample complexity of an arbitrary distribution-independent algorithm. For example, the reader can think of the ERM or Hanneke s algorithm. We prove that there exist a distribution where the black line is Ωlog n) times higher than the green line. This separation is indicated by the double arrow. unlabeled data are indeed useless. However, these results say nothing about distributions with -cover of small size and it ignores the dependency on the Vapnik-Chervonenkis dimension. The question was studied in earnest by Darnstädt et al. [03] who showed two major results. First, they show that for any non-trivial concept class C and for every distribution, the ratio of the labeled sample complexities between distribution-independent and distribution-dependent algorithms is bounded by the Vapnik-Chervonenkis dimension. Second, they show that for the class of projections over {0, } n, there are distributions where the ratio grows to infinity as a function of n. In learning theory, the disagreement metric and -cover were introduced by Benedek and Itai [99] but the ideas are much older; see e.g. Dudley [978, 984]. The O/) Od) upper bound on size of the smallest -cover is by Dudley [978, Lemma 7.3]; see also Devroye and Lugosi [000, Chapter 4] and Haussler [995]. For any distribution-independent algorithm and any class C of Vapnik-Chervonenkis dimension d and any 0, ) and δ 0,) ), there exists a distribution over the domain and a concept which requires at least Ω labeled examples to -learn with probability at least δ; d+log/δ) see Anthony and Bartlett [999, Theorem 5.3] and Blumer et al. [989], Ehrenfeucht et al. [989]. The proof of the lower bound constructs a distribution that does not depend on the algorithm. The distribution is a particular distribution over a fixed set shattered by C. So even an algorithm that knows the distribution requires Ω d+log/δ) ) labeled examples. 3

4 3 Preliminaries Let X be a non-empty set. We denote by {0, } X the class of all functions from X to {0, }. A concept class over a domain X is a subset C {0, } X. A labeled example is a pair x, y) X {0, }. A distribution-independent learning algorithm is a function A : m0 X {0, })m {0, } X. In other words, the algorithm gets as input a sequence of labeled examples x, y ), x, y ),..., x m, y m ) and outputs a function from X to {0, }. We allow the algorithm to output function that does not belong to C, i.e., the algorithm can be improper. A distribution-dependent algorithm is a function that maps any probability distribution over X to a distribution-independent algorithm. Let P be a probability distribution over a domain X. For any two functions f : X {0, }, g : X {0, } we define the disagreement pseudo-metric d P f, g) Pr [fx) gx)]. X P Let C be a concept class over X, let c C, let, δ 0, ). Let X, X,..., X m be an i.i.d. sample from P. We define the corresponding labeled sample T X, cx )), X, cx )),..., X m, cx m ))). We say that an algorithm A, -learns target c from m samples with probability at least δ if Pr [d P c, AT )) ] δ. The smallest non-negative integer m such that for any target c C, the algorithm A, -learns the target c from m samples with probability at least δ is denoted by ma, C, P,, δ). We recall the standard definitions from learning theory. For any concept c : X {0, } and any S X we define πc, S) {x S : cx) }. In other words, πc, S) is the set of examples in S which c labels. A set S X is shattered by a concept class C if for any subset S S there exists a classifier c C such that πc, S) S. Vapnik-Chervonenkis dimension of a concept class C is the size of the largest set S X shattered by C. A subset C of a concept class C is an -cover of C for a probability distribution P if for any c C there exists c C such that d P c, c ). To prove our lower bounds we need three general probabilistic results. The first one is the standard Hoeffding bound. The other two are simple and intuitive propositions. The first proposition says that if average error d P c, AT )) is high, the algorithm fails to -learn with high probability. The second proposition says that the best algorithm for predicting a bit based on some side information, is to compute conditional expectation of the bit and thresholds it at /. Theorem Hoeffding bound). Let X, X,..., X n be i.i.d. random variables that lie in interval [a, b] with probability one and let p n n i E[X i]. Then, for any t 0, [ ] n Pr X i p + t e nt /a b), n i [ ] n Pr X i p t e nt /a b). n i Proposition Error probability vs. Expected error). Let Z be a random variable such that Z with probability one. Then, Pr[Z > t] E[Z] t t 4 for any t [0, ).

5 Proof. We have E[Z] t Pr[Z t] + Pr[Z > t] t Pr[Z > t]) + Pr[Z > t]. Solving for Pr[Z > t] finishes the proof. Proposition 3 Predicting Single Bit). Let U be a finite non-empty set. Let U, V be random variables possibly correlated) such that U U and V {0, } with probability one. Let f : U {0, } be a predictor. Then, Pr [fu) V ] ) E [V U u] Pr[U u]. u U Proof. We have Pr [fu) V ] u U Pr [fu) V U u] Pr[U u]. It remains to show that Pr [fu) V U u] Since if U u, the value fu) fu) is fixed, and hence E [V U u]. Pr [fu) V U u] min {Pr [V U u], Pr [V 0 U u]} min {E [V U u], E [V U u]} E [V U u] We used the fact that min{x, x} x for all x R which can be easily verified by considering two cases: x and x <. 4 Projections In this section, we denote by C n the class of projections over the domain X {0, } n. The class C n consists of n functions c, c,..., c n from {0, } n to {0, }. For any i {,,..., n}, for any x {0, } n, the function c i is defined as c i x[], x[],..., x[n])) x[i]. For any 0, ) and n, we consider a family P n, consisting of n probability distributions P, P,..., P n over the Boolean hypercube {0, } n. In order to describe the distribution P i, for some i, consider a random vector X X[], X[],..., X[n]) drawn from P i. The distribution P i is a product distribution, i.e., Pr[X x] n j Pr[X[j] x[j]] for any x {0, }n. The marginal distributions of the coordinates are Pr[X[j] ] { if j i, if j i, for j,,..., n. The reader should think of as a constant that does not depend on n, say, 00. Proposition 4. Vapnik-Chervonenkis dimension of C n is log n. 5

6 Proof. Let us denote the Vapnik-Chervonenkis dimension by d. Recall that d is the size of the largest shattered set. Let S be any shattered set of size d. Then, there must be at least d distinct functions in C n. Hence, d log C n log n. Since d is an integer, we conclude that d log n. On the other hand, we construct a shattered set of size log n. The set will consists of points x, x,..., x log n {0, } n. For any i {,,..., log n } and any j {0,,,..., n }, we define x i [j] to be the i-th bit in the binary representation of the number j. The bit at position i is the least significant bit.) It is not hard to see that for any v {0, } log n, there exists c C n such that v cx ), cx ),..., cx log n )). Indeed, given v, let k {0,,..., log n } be the number with binary representation v, then we can take c c k+. Lemma 5 Small cover). Let n and 0, ). Any distribution in P n, has -cover of size. Proof. Consider a distribution P i P n, for some i {,,..., n}. Let j be an arbitrary index in {,,..., n} \ {i}. Consider the projections c i, c j C n. We claim that C {c i, c j } is a -cover of C n. To see that C is a -cover of C n, consider any c k C n. We need to show that d Pi c i, c k ) or d Pi c j, c k ). If k i or k j, the condition is trivially satisfied. Consider k {,,..., n} \ {i, j}. Let X P i. Then, d Pi c j, c k ) Pr[c j X) c k X)] Pr[c j X) c k X) 0] + Pr[c j X) 0 c k X) ] Pr[X[j] X[k] 0] + Pr[X[j] 0 X[k] ] Pr[X[j] ] Pr[X[k] 0] + Pr[X[j] 0] Pr[X[k] ] ) <. Using Bedek-Itai bound Theorem in Appendix B) we obtain the corollary below. The corollary states that the distribution-dependent sample complexity of learning target in C n under any distribution from P n, does not depend on n. Corollary 6 Learning with knowledge of the distribution). Let n and 0, ). There exists a distribution-dependent algorithm such that for any distribution from P n,, any δ 0, ), any target function c C n, if the algorithm gets ln/δ) m labeled examples, with probability at least δ, it 4-learns the target. The next theorem states that without knowing the distribution, learning a target under a distribution from P n, requires at least Ωlog n) labeled examples. Theorem 7 Learning without knowledge of the distribution). For any distribution-independent algorithm, any 0, 4 ) and any n 600/3 there exists a distribution P P n, and a target concept c C n such that if the algorithm gets m ln n 3 ln/) labeled examples, it fails to 6-learn the target concept with probability more than 6. 6

7 Proof. Let A be any learning algorithm. For ease of notation, we formalize it is a function A : m0 {0, } m n {0, } m) {0, } {0,}n. The algorithm receives an m n matrix and a binary vector of length m. The rows of the matrix corresponds to unlabeled examples and the vector encodes the labels. The output of A is any function from {0, } n {0, }. We demonstrate the existence of a pair P, c) P n, C n which cannot be learned with m samples by the probabilistic method. Let I be chosen uniformly at random from {,,..., n}. We consider the distribution P I P n, and target c I C n. Let X, X,..., X m be an i.i.d. sample from P I and let Y c I X ), Y c I X ),..., Y m c I X m ) be the target labels. Let X be the m n matrix with entries X i [j] and let Y Y, Y,..., Y m ) be the vector of labels. The output of the algorithm is AX, Y ). We will show that This means that there exists i {,,..., n} such that By Proposition, E [d PI c I, AX, Y ))] 8. ) E [d Pi c i, AX, Y )) I i] 8. [ Pr d Pi c i, AX, Y )) > 6 ] I i > 6. It remains to prove ). Let Z be a test sample drawn from P I. That is, conditioned on I, the sequence X, X,..., X m, Z is i.i.d. drawn from P I. Then, by Proposition 3, E [d PI c I, AX, Y ))] Pr [AX, Y )Z) c I Z)] ) E [c IZ) X x, Y y, Z z] Pr [X x, Y y, Z z]. ) x {0,} m n y {0,} m z {0,} n We need to compute E [c I Z) X x, Y y, Z z]. For that we need some additional notation. For any matrix x {0, } m n, let x[], x[],..., x[n] be its columns. For any matrix x {0, } m n and vector y {0, } m let kx, y) {i {,,..., n} : x[i] y} be the set of indices of columns of x equal to the vector y. Also, we define to be the sum of absolute values of entries of a vector or a matrix. Since we use only for binary matrices and binary vectors, it will be just the number of ones.) For any i {,,..., n}, { ) m Pr [I i, X x, Y y] n x y ) mn x + y if i kx, y), 0 if i kx, y). 7

8 Therefore, for any i {,,..., n}, Pr [I i, X x, Y y] Pr [I i X x, Y y] Pr [X x, Y y] Pr [I i, X x, Y y] Pr [I j, X x, Y y] j kx,y) { kx,y) if i kx, y), 0 if i kx, y). Conditioned on I, the variables Z and X, Y ) are independent. Thus, for any x {0, } n, and i,,..., n, Pr [Z z I i, X x, Y y] Pr [Z z I i] { z ) n z if z[i], z ) n z if z[i] 0. This allows us to compute the conditional probability For any z {0, } n, let Pr [I i, Z z X x, Y y] and note that sx, y, z) kx, y). Then, Pr [Z z X x, Y y] n Pr [Z z, I i X x, Y y] i i kx,y) i sx,y,x) Pr [Z z I i, X x, Y y] Pr [I i X x, Y y] kx,y) z ) n z if i kx, y) and z[i], kx,y) z ) n z if i kx, y) and z[i] 0, 0 if i kx, y). Pr [Z z, I i X x, Y y] Pr [Z z, I i X x, Y y] + sx, y, z) {i kx, y) : z[i] }, i kx,y)\sx,y,z) Pr [Z z, I i X x, Y y] kx, y) sx, y, z) z ) n z + kx, y) kx, y) sx, y, z) ) z ) n z z ) n z kx, y) sx, y, z) ) + kx, y) ). 8

9 Hence, E [c I Z) X x, Y y, Z z] Pr [Z[I] X x, Y y, Z z] Pr [Z[I], Z z X x, Y y] Pr [Z z X x, Y y] n Pr [I i, Z[i], Z z X x, Y y] i Pr [Z z X x, Y y] sx, y, z) kx, y) z ) n z z ) n z sx, y, z) ) + kx, y, z) ) kx, y) sx, y, z) ) sx, y, z) ) + kx, y) kx, y) + sx, y, z) We now show that the last expression is close to /. It is easy to check that [ ] [ kx, y) 5 sx, y, z) 6, kx, y) 4, 3 ]. 4 + sx, y, z) Indeed, since 0, 4 ), kx, y) + sx, y, z) + /4 + 4 and kx, y) + sx, y, z) + 5/6 / + 5/

10 We now substitute this into the ). We have ) E [c IZ) X x, Y y, Z z] Pr [X x, Y y, Z z] x {0,} m n y {0,} m z {0,} n x {0,} m n y {0,} m z {0,} n x {0,} m n y {0,} m z {0,} n kx,y,z) sx,y,z) [ 5 6,] x {0,} m n y {0,} m z {0,} n kx,y,z) sx,y,z) [ 5 6,] 4 Pr [ kx, Y ) sx, Y, Z) kx, y) Pr [X x, Y y, Z z] + sx, y, z) kx, y) Pr [X x, Y y, Z z] + sx, y, z) ) Pr [X x, Y y, Z z] 4 [ ]] 5 6,. kx,y ) In order to prove ), we need to show that sx,y,z) [ 5 6, ] with probability at least /. To that end, we define two additional random variables The condition K kx, Y ) and S sx, Y, Z). kx,y ) sx,y,z) [ 5 6, ] is equivalent to S K ) First, we lower bound K. For any y {0, } m and any i, j {,,..., n}, { if j i, Pr [j kx, Y ) Y y, I i] y ) m y if j j. Conditioned on Y y and I i, the random variable K kx, Y ) \ {I} is a sum of n Bernoulli variables with parameter y ) m y, one for each column except for column i. Hoeffding bound with t m / and the loose lower bound y ) m y m gives [ ] [ ] K Pr n > m K Y y, I i Pr n > m t Y y, I i [ ] K Pr n > y ) m y t Y y, I i e n )t. 0

11 Since m ln n 3 ln/), we lower bound t m as t m / > ln n 3 ln/) 3 n. Since the lower bound is uniform for all choices of y and i, we can remove the conditioning and conclude that [ ] ) n ) n ) Pr K > + 3 exp n n /3. For n 5, we can simplify it further to [ Pr K n/3 ] 3 4. Second, conditioned on K r, the random variable S is a sum of r Bernoulli random variables with parameter and one Bernoulli random variable with parameter /. Hoeffding bound for any t 0 gives that [ ] S K ) + / Pr K K < t K r e rt. Thus, [ ] S K ) + / n/3 Pr K K < t and K n [ ] S K ) + / Pr K K < t K r Pr[K r] r n /3 / n r n /3 / e rt) Pr[K r] e n/3 t / ) Pr [ K n/3 We choose t /4. Since n 600/ 3, we have e n/3 t / < 8 and thus [ ] S K ) + / n/3 Pr ) [ ] K K < t and K e n/3 t / Pr K n/3 3 ) e n/3 t / 4 > 3 ) >. We claim that t /4, S K K )+/ K < t and K n/3 imply 3). To see that, note that S K K )+/ < t is equivalent to K ]. K ) + / K t < S K < K ) + / K + t

12 which implies that p ) t < SK K < ) + K K + t. Since K n/3 and n 5 we have K > 4, which implies that Since K n/3 and n 3/ Since t /4, the condition 3) follows. 5 All functions 3 4 t < S K < K + t. we have K > 5, which implies that 3 4 t < S K < t. Let X be some finite domain. We say a sample T x, y ),..., x m, y m )) X {0, }) m of size m is self-consistent if for any i, j {,,..., m}, x i x j implies that y i y j. A distribution independent algorithm A is said to be consistent if for any self-consistent sample T x, y ),..., x m, y m )) X {0, }) m, AT )x i ) y i holds for any i,,..., m. In this section we show that for C all {0, } X, any consistent distribution independent learner is almost as powerful as any distribution independent learner. Note that, in particular, the ERM algorithm for C all is consistent. In other words, for the class C all unlabeled data do not have any information theoretic value. Theorem 8 No Gap). Let X be some finite domain, C all {0, } X and A be any consistent learning algorithm. Then, for any distribution P over X, any possibly distribution dependent) learning algorithm B and any, δ 0, ), ma, C all, P,, δ) mb, C all, P,, δ). Proof. Fix any integer m 0 and any distribution P over X. Let X, X, X,..., X m be an i.i.d. sample from P. Define the random variable Z Pr[X {X, X,..., X m } X, X,..., X m ]. In other words, Z is the probability mass not covered by X, X,..., X m. For any c C all, let T c X, cx )), X, cx ),..., X m, cx m )) be the sample labeled according to c. Since A is consistent, with probability one, for any c C all, d P AT c ), c) Z. 4) Let c be chosen uniformly at random from C all, independently of X, X, X,..., X m. Additionally, define ĉ C all as { cx) if x {X, X,..., X m }, ĉx) cx) otherwise.

13 and note that ĉ and c are distributed identically and T c Tĉ, and thus We have E [ [d P B T c ), c) ] T c ] E [ [d P B Tĉ), ĉ) ] T c ] 5) sup Pr[d P BT c ), c) ] sup E [ [d P B T c ), c) ]] c C all c C all E [ [d P B T c ), c) ]] E [E [ [d P B T c ), c) ] T c ]] [ [ E E [d P B T c ), c) ] + [d P B T c ), c) ] [ [ ]] E E [Z ] T c E [ [Z ]] Pr [Z ] sup Pr [Z ] c C T c ]] 6) 7) sup Pr [d P AT c ), c) ]. 8) c C Equation 6) follows from 5). To justify inequality 7), note that since the classifiers c and ĉ disagree on the missing mass, if Z then d P BT c ), c) or d P BTĉ), ĉ) or both. By symmetry between c and ĉ, if Z then with probability at least /, d P BT c ), c). Inequality 8) follows from 4). Since the inequality sup Pr[d P BT c ), c) ] c C all sup Pr [d P AT c ), c) ] holds for arbitrary m, it implies ma, C all, P,, δ) mb, C all, P,, δ) for any, δ 0, ). 6 Conclusion and open problems Darnstädt et al. [03] showed that the gap between the number of samples needed to learn a class of functions of Vapnik-Chervonenkis dimension d with and without knowledge of the distribution is upper-bounded by Od). We show that this bound is tight for the class of Boolean projections. On the other hand, for the class of all functions, this gap is only constant. These observations lead to the following research directions. First, it will be interesting to understand the value of the gap for larger classes of functions. For example, one might consider the classes of monotone) disjunctions over {0, } n, monotone) conjuctions over {0, } n, parities over {0, } n, and halfspaces over R n. The Vapnik-Chervonenkis dimension of these classes is Θn) thus the gap for these classes is at least Ω) and at most On). Other than these crude bounds, the question of what is the gap for these classes is wide open. c C 3

14 Second, as the example with class of all functions shows, the gap is not characterized by the Vapnik-Chervonenkis dimension. It will be interesting to study other parameters which determine this gap. In particular, it will be interesting to obtain upper bounds on the gap in terms of other quantities. Finally, we believe that studying this question in the agnostic extension of the PAC model [Anthony and Bartlett, 999, Chapter ] will be of great interest, too. References Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 999. Shai Ben-David, Tyler Lu, and Dávid Pál. Does unlabeled data provably help? Worst-case analysis of the sample complexity of semi-supervised learning. In Proceedings of the st Annual Conference on Learning Theory, Helsinki, Finland, 9, July 008, pages Omnipress, 008. Gyora M. Benedek and Alon Itai. Learnability with respect to fixed distributions. Theoretical Computer Science, 86): , 99. Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM JACM), 364):99 965, 989. Malte Darnstädt, Hans Ulrich Simon, and Balázs Szörényi. Unlabeled data does provably help. In 30th International Symposium on Theoretical Aspects of Computer Science, STACS 03, February 7 - March, 03, Kiel, Germany, pages 85 96, 03. Luc Devroye and Gábor Lugosi. Combinatorial Methods in Density Estimation. Springer, 000. Richard M. Dudley. Central limit theorems for empirical measures. The Annals of Probability, pages , 978. Richard M. Dudley. A course on empirical processes, pages 4. Springer, 984. Andrzej Ehrenfeucht, David Haussler, Michael Kearns, and Leslie Valiant. A general lower bound on the number of examples needed for learning. Information and Computation, 83):47 6, 989. Steve Hanneke. The optimal sample complexity of PAC learning. Journal of Machine Learning Research, 738): 5, 06. David Haussler. Sphere packing numbers for subsets of the Boolean n-cube with bounded Vapnik- Chervonenkis dimension. Journal of Combinatorial Theory, Series A, 69):7 3, 995. Tyler Lu. Fundamental limitations of semi-supervised learning. Master s thesis, David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada NL 3G, 009. Available at lumastersthesis_electronic.pdf. 4

15 A Size of -cover In this section, we prove an e/) d upper bound on the size of the -cover of any concept class of Vapnik-Chervonenkis dimension d. To prove our result, we need Sauer s lemma. Its proof can be found, for example, in Anthony and Bartlett [999, Chapter 3]. Lemma 9 Sauer s lemma). Let X be a non-empty domain and let C {0, } X be a concept class with Vapnik-Chervonenkis dimension d. Then, for any S X, {πc, S) : c C} d ) S. i i0 We remark that if n d then d i0 ) n i ne ) d 9) d where e is the base of the natural logarithm. This follows from the following calculation ) d d n d i0 ) n i ) i d ) n d i n i0 n ) ) n d i i n i0 + d ) n e d n where we used in the last step that + x e x for any x R. Theorem 0 Size of -cover). Let X be a non-empty domain and let C {0, } X be a concept class with Vapnik-Chervonenkis dimension d. Let P be any distribution over X. For any 0, ], there exists a set C C such that e ) d C 0) and for any c C there exists c C such that d P c, c ). Proof. We say that a set B C is an -packing if c, c B c c d P c, c ) > We claim that there exists a maximal -packing. In order to show that a maximal set exists we to appeal to Zorn s lemma. Consider the collection of all -packings. We impose partial order on them by set inclusion. Notice that any totally ordered collection {B i : i I} of -packings has an upper bound i I B i that is an -packing. Indeed, if c, c i I B i such that c c then there exists i I such that c, c B i since {B i : i I} is totally ordered. Since B i is an -packing, d P c, c ) >. We conclude that i I B i is an -packing. By Zorn s lemma, there exists a maximal -packing. 5

16 Let C be a maximal -packing. We claim that C is also an -cover of C. Indeed, for any c C there exists c C such that d P c, c ) since otherwise C {c} would an -packing, which would contradict maximality of C. It remains to upper bound C. Consider any finite subset C C. It suffices to show an upper bound on C and since C is arbitrary, the same upper bound holds for C. Let M C and let c, c,..., c M be concepts in C. For any i, j {,,..., M}, i < j, let A i,j {x X : c i x) c j x)}. Let X, X,..., X K be an i.i.d. sample from P. We will chose K later. Since d P c i, c j ) >, Since there are M ) subsets Ai,j, we have For K Pr[X k A i,j ] > for k,,..., K. Pr [ i, j, i < j, k, X k A i,j ] Pr [ i, j, i < j, k, X k A i,j ] Pr [ k, X k A i,j ] i<j M i<j M M ) K ) ) K. ln M ) +, the above probability is strictly positive. This means there exists a set S {x, x,..., x K } X such that A i,j S is non-empty for every i < j. This means that for every for every i j, c i S) c j S) and hence M C {πc, S) : c C}. Thus by Sauer s lemma d ) K M. i i0 We now show that this inequality implies that M e/) d. We consider five cases. Case : d. Then, M 0 and inequality trivially follows. Case : d 0. Then, M and the inequality trivially follows. Case 3: d and M. Clearly, M e e d e/) d. ln Case 4: d and M 3 and K < d/. Then M ) + < d/ which implies that ln ) M < d and hence M ) e d. Since M ) M for M 3, ) M e ) d M e d. Case 5: d and M 3 and K d/. Clearly, K d hence by 9), M d i0 ) K i ) Ke d e ) d. d 6

17 B Fixed distribution learning Theorem Chernoff bound). Let X, X,..., X n be i.i.d. Bernoulli random variables with E[X i ] p. Then, for any [0, min{p, p}), [ ] n Pr X i p + e ndp+ p), n i [ ] n Pr X i p e ndp p). n where i D x y) x ln ) ) x x + x) ln y y is the Kullback-Leibler divergence between Bernoulli distributions with parameters x, y [0, ]. We further use the following inequality D x y) x y) max{x, y} Theorem Benedek-Itai). Let C {0, } X be a concept class over a non-empty domain X. Let P be a distribution over X. Let 0, ] and assume that C has an -cover of size at most N. Then, there exists an algorithm, such that for any δ 0, ), any target c C, if it gets ) ln N + ln/δ) m 48 labeled samples then with probability at least δ, it -learns the target. Proof. Given a labeled sample T x, y ),..., x m, y m )), for any c C, we define err T c) m m [cx i ) y i ]. i Let C C be an /)-cover of size at most N. Consider the algorithm A that given a labeled sample T outputs ĉ argmin c C err T c ) breaking ties arbitrarily. We prove that A, with probability at least δ, -learns any target c C under the distribution P. Consider any target c C. Then there exists c C such that d P c, c) /. Let C {c : d P c, c ) > }. We claim that with probability at least δ, for all c C, err T c ) > 3 and err T c) < 3 and hence A outputs ĉ C \ C. 7

18 Consider any c C and note that err T c ) is an average of Bernoulli random variables with mean d P c, c ) >. Thus, by Chernoff bound, Pr [err T c ) > 3 ] )) > exp md 3 d P c, c ) > exp m 3 d ) P c, c )) d P c, c ) > exp m/8) where the last inequality follows from the inequality ) 3 x 9 x valid for any x > 0. Similarly, err T c) is an average of Bernoulli random variables with mean d P c, c) < /. Thus, by Chernoff bound, Pr [err T c) < 3 ] )) > exp md 3 d P c, c) > exp m 3 d ) P c, c)) 4 3 > exp m/48). Since C N, by union bound, with probability at least N ) exp m/48), for all c C, err T c ) > 3. Finally, with probability at least N exp m/48) δ, err T c) < 3 and for all c C, err T c ) > 3. 8

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

On the Sample Complexity of Noise-Tolerant Learning

On the Sample Complexity of Noise-Tolerant Learning On the Sample Complexity of Noise-Tolerant Learning Javed A. Aslam Department of Computer Science Dartmouth College Hanover, NH 03755 Scott E. Decatur Laboratory for Computer Science Massachusetts Institute

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

A Bound on the Label Complexity of Agnostic Active Learning

A Bound on the Label Complexity of Agnostic Active Learning A Bound on the Label Complexity of Agnostic Active Learning Steve Hanneke March 2007 CMU-ML-07-103 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Machine Learning Department,

More information

A Result of Vapnik with Applications

A Result of Vapnik with Applications A Result of Vapnik with Applications Martin Anthony Department of Statistical and Mathematical Sciences London School of Economics Houghton Street London WC2A 2AE, U.K. John Shawe-Taylor Department of

More information

k-fold unions of low-dimensional concept classes

k-fold unions of low-dimensional concept classes k-fold unions of low-dimensional concept classes David Eisenstat September 2009 Abstract We show that 2 is the minimum VC dimension of a concept class whose k-fold union has VC dimension Ω(k log k). Keywords:

More information

Sample width for multi-category classifiers

Sample width for multi-category classifiers R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University

More information

Yale University Department of Computer Science. The VC Dimension of k-fold Union

Yale University Department of Computer Science. The VC Dimension of k-fold Union Yale University Department of Computer Science The VC Dimension of k-fold Union David Eisenstat Dana Angluin YALEU/DCS/TR-1360 June 2006, revised October 2006 The VC Dimension of k-fold Union David Eisenstat

More information

Quadratic Upper Bound for Recursive Teaching Dimension of Finite VC Classes

Quadratic Upper Bound for Recursive Teaching Dimension of Finite VC Classes Proceedings of Machine Learning Research vol 65:1 10, 2017 Quadratic Upper Bound for Recursive Teaching Dimension of Finite VC Classes Lunjia Hu Ruihan Wu Tianhong Li Institute for Interdisciplinary Information

More information

MACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti

MACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti MACHINE LEARNING Vapnik-Chervonenkis (VC) Dimension Alessandro Moschitti Department of Information Engineering and Computer Science University of Trento Email: moschitti@disi.unitn.it Computational Learning

More information

Unlabeled Data Does Provably Help

Unlabeled Data Does Provably Help Unlabeled Data Does Provably Help Malte Darnstädt 1, Hans Ulrich Simon 1, and Balázs Szörényi 2,3 1 Department of Mathematics, Ruhr-University Bochum, Germany {malte.darnstaedt,hans.simon}@rub.de 2 INRIA

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms

More information

Discriminative Learning can Succeed where Generative Learning Fails

Discriminative Learning can Succeed where Generative Learning Fails Discriminative Learning can Succeed where Generative Learning Fails Philip M. Long, a Rocco A. Servedio, b,,1 Hans Ulrich Simon c a Google, Mountain View, CA, USA b Columbia University, New York, New York,

More information

Multiclass Learnability and the ERM principle

Multiclass Learnability and the ERM principle Multiclass Learnability and the ERM principle Amit Daniely Sivan Sabato Shai Ben-David Shai Shalev-Shwartz November 5, 04 arxiv:308.893v [cs.lg] 4 Nov 04 Abstract We study the sample complexity of multiclass

More information

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process

More information

On a learning problem that is independent of the set theory ZFC axioms

On a learning problem that is independent of the set theory ZFC axioms On a learning problem that is independent of the set theory ZFC axioms Shai Ben-David, Pavel Hrubeš, Shay Moran, Amir Shpilka, and Amir Yehudayoff Abstract. We consider the following statistical estimation

More information

Active Learning: Disagreement Coefficient

Active Learning: Disagreement Coefficient Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence Chao Zhang The Biodesign Institute Arizona State University Tempe, AZ 8587, USA Abstract In this paper, we present

More information

Multiclass Learnability and the ERM Principle

Multiclass Learnability and the ERM Principle Journal of Machine Learning Research 6 205 2377-2404 Submitted 8/3; Revised /5; Published 2/5 Multiclass Learnability and the ERM Principle Amit Daniely amitdaniely@mailhujiacil Dept of Mathematics, The

More information

A Necessary Condition for Learning from Positive Examples

A Necessary Condition for Learning from Positive Examples Machine Learning, 5, 101-113 (1990) 1990 Kluwer Academic Publishers. Manufactured in The Netherlands. A Necessary Condition for Learning from Positive Examples HAIM SHVAYTSER* (HAIM%SARNOFF@PRINCETON.EDU)

More information

From Batch to Transductive Online Learning

From Batch to Transductive Online Learning From Batch to Transductive Online Learning Sham Kakade Toyota Technological Institute Chicago, IL 60637 sham@tti-c.org Adam Tauman Kalai Toyota Technological Institute Chicago, IL 60637 kalai@tti-c.org

More information

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory. CS534 - Machine Learning Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows

More information

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY Dan A. Simovici UMB, Doctoral Summer School Iasi, Romania What is Machine Learning? The Vapnik-Chervonenkis Dimension Probabilistic Learning Potential

More information

Computational Learning Theory

Computational Learning Theory CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful

More information

Statistical and Computational Learning Theory

Statistical and Computational Learning Theory Statistical and Computational Learning Theory Fundamental Question: Predict Error Rates Given: Find: The space H of hypotheses The number and distribution of the training examples S The complexity of the

More information

Uniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning

Uniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning Uniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning Martin Anthony Department of Mathematics London School of Economics Houghton Street London WC2A

More information

The Power of Random Counterexamples

The Power of Random Counterexamples Proceedings of Machine Learning Research 76:1 14, 2017 Algorithmic Learning Theory 2017 The Power of Random Counterexamples Dana Angluin DANA.ANGLUIN@YALE.EDU and Tyler Dohrn TYLER.DOHRN@YALE.EDU Department

More information

On Learning µ-perceptron Networks with Binary Weights

On Learning µ-perceptron Networks with Binary Weights On Learning µ-perceptron Networks with Binary Weights Mostefa Golea Ottawa-Carleton Institute for Physics University of Ottawa Ottawa, Ont., Canada K1N 6N5 050287@acadvm1.uottawa.ca Mario Marchand Ottawa-Carleton

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical

More information

Supervised Learning and Co-training

Supervised Learning and Co-training Supervised Learning and Co-training Malte Darnstädt a, Hans Ulrich Simon a,, Balázs Szörényi b,c a Fakultät für Mathematik, Ruhr-Universität Bochum, D-44780 Bochum, Germany b Hungarian Academy of Sciences

More information

Agnostic Online learnability

Agnostic Online learnability Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes

More information

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015 10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015 1 Active Learning Most classic machine learning methods and the formal learning

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Generalization and Overfitting

Generalization and Overfitting Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle

More information

arxiv: v4 [cs.lg] 7 Feb 2016

arxiv: v4 [cs.lg] 7 Feb 2016 The Optimal Sample Complexity of PAC Learning Steve Hanneke steve.hanneke@gmail.com arxiv:1507.00473v4 [cs.lg] 7 Feb 2016 Abstract This work establishes a new upper bound on the number of samples sufficient

More information

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016 Machine Learning 10-701, Fall 2016 Computational Learning Theory Eric Xing Lecture 9, October 5, 2016 Reading: Chap. 7 T.M book Eric Xing @ CMU, 2006-2016 1 Generalizability of Learning In machine learning

More information

Lecture 2: Uniform Entropy

Lecture 2: Uniform Entropy STAT 583: Advanced Theory of Statistical Inference Spring 218 Lecture 2: Uniform Entropy Lecturer: Fang Han April 16 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal

More information

Maximal Width Learning of Binary Functions

Maximal Width Learning of Binary Functions Maximal Width Learning of Binary Functions Martin Anthony Department of Mathematics, London School of Economics, Houghton Street, London WC2A2AE, UK Joel Ratsaby Electrical and Electronics Engineering

More information

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht PAC Learning prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht Recall: PAC Learning (Version 1) A hypothesis class H is PAC learnable

More information

arxiv: v1 [cs.lg] 18 Feb 2017

arxiv: v1 [cs.lg] 18 Feb 2017 Quadratic Upper Bound for Recursive Teaching Dimension of Finite VC Classes Lunjia Hu, Ruihan Wu, Tianhong Li and Liwei Wang arxiv:1702.05677v1 [cs.lg] 18 Feb 2017 February 21, 2017 Abstract In this work

More information

Baum s Algorithm Learns Intersections of Halfspaces with respect to Log-Concave Distributions

Baum s Algorithm Learns Intersections of Halfspaces with respect to Log-Concave Distributions Baum s Algorithm Learns Intersections of Halfspaces with respect to Log-Concave Distributions Adam R Klivans UT-Austin klivans@csutexasedu Philip M Long Google plong@googlecom April 10, 2009 Alex K Tang

More information

Computational Learning Theory for Artificial Neural Networks

Computational Learning Theory for Artificial Neural Networks Computational Learning Theory for Artificial Neural Networks Martin Anthony and Norman Biggs Department of Statistical and Mathematical Sciences, London School of Economics and Political Science, Houghton

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

Being Taught can be Faster than Asking Questions

Being Taught can be Faster than Asking Questions Being Taught can be Faster than Asking Questions Ronald L. Rivest Yiqun Lisa Yin Abstract We explore the power of teaching by studying two on-line learning models: teacher-directed learning and self-directed

More information

Hierarchical Concept Learning

Hierarchical Concept Learning COMS 6998-4 Fall 2017 Octorber 30, 2017 Hierarchical Concept Learning Presenter: Xuefeng Hu Scribe: Qinyao He 1 Introduction It has been shown that learning arbitrary polynomial-size circuits is computationally

More information

10.1 The Formal Model

10.1 The Formal Model 67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 10: The Formal (PAC) Learning Model Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 We have see so far algorithms that explicitly estimate

More information

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity; CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

Rademacher Averages and Phase Transitions in Glivenko Cantelli Classes

Rademacher Averages and Phase Transitions in Glivenko Cantelli Classes IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 1, JANUARY 2002 251 Rademacher Averages Phase Transitions in Glivenko Cantelli Classes Shahar Mendelson Abstract We introduce a new parameter which

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Sinh Hoa Nguyen, Hung Son Nguyen Polish-Japanese Institute of Information Technology Institute of Mathematics, Warsaw University February 14, 2006 inh Hoa Nguyen, Hung Son

More information

Maximal Width Learning of Binary Functions

Maximal Width Learning of Binary Functions Maximal Width Learning of Binary Functions Martin Anthony Department of Mathematics London School of Economics Houghton Street London WC2A 2AE United Kingdom m.anthony@lse.ac.uk Joel Ratsaby Ben-Gurion

More information

On A-distance and Relative A-distance

On A-distance and Relative A-distance 1 ADAPTIVE COMMUNICATIONS AND SIGNAL PROCESSING LABORATORY CORNELL UNIVERSITY, ITHACA, NY 14853 On A-distance and Relative A-distance Ting He and Lang Tong Technical Report No. ACSP-TR-08-04-0 August 004

More information

Sample Complexity of Learning Independent of Set Theory

Sample Complexity of Learning Independent of Set Theory Sample Complexity of Learning Independent of Set Theory Shai Ben-David University of Waterloo, Canada Based on joint work with Pavel Hrubes, Shay Moran, Amir Shpilka and Amir Yehudayoff Simons workshop,

More information

The Vapnik-Chervonenkis Dimension

The Vapnik-Chervonenkis Dimension The Vapnik-Chervonenkis Dimension Prof. Dan A. Simovici UMB 1 / 91 Outline 1 Growth Functions 2 Basic Definitions for Vapnik-Chervonenkis Dimension 3 The Sauer-Shelah Theorem 4 The Link between VCD and

More information

MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension

MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension Prof. Dan A. Simovici UMB Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 1 / 30 The

More information

Computational Learning Theory

Computational Learning Theory 1 Computational Learning Theory 2 Computational learning theory Introduction Is it possible to identify classes of learning problems that are inherently easy or difficult? Can we characterize the number

More information

Bounds on the Sample Complexity for Private Learning and Private Data Release

Bounds on the Sample Complexity for Private Learning and Private Data Release Bounds on the Sample Complexity for Private Learning and Private Data Release Amos Beimel Hai Brenner Shiva Prasad Kasiviswanathan Kobbi Nissim June 28, 2013 Abstract Learning is a task that generalizes

More information

Relating Data Compression and Learnability

Relating Data Compression and Learnability Relating Data Compression and Learnability Nick Littlestone, Manfred K. Warmuth Department of Computer and Information Sciences University of California at Santa Cruz June 10, 1986 Abstract We explore

More information

1 Review of The Learning Setting

1 Review of The Learning Setting COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we

More information

Smart PAC-Learners. Malte Darnstädt and Hans Ulrich Simon. Fakultät für Mathematik, Ruhr-Universität Bochum, D Bochum, Germany

Smart PAC-Learners. Malte Darnstädt and Hans Ulrich Simon. Fakultät für Mathematik, Ruhr-Universität Bochum, D Bochum, Germany Smart PAC-Learners Malte Darnstädt and Hans Ulrich Simon Fakultät für Mathematik, Ruhr-Universität Bochum, D-44780 Bochum, Germany Abstract The PAC-learning model is distribution-independent in the sense

More information

Using boxes and proximity to classify data into several categories

Using boxes and proximity to classify data into several categories R u t c o r Research R e p o r t Using boxes and proximity to classify data into several categories Martin Anthony a Joel Ratsaby b RRR 7-01, February 01 RUTCOR Rutgers Center for Operations Research Rutgers

More information

Neural Network Learning: Testing Bounds on Sample Complexity

Neural Network Learning: Testing Bounds on Sample Complexity Neural Network Learning: Testing Bounds on Sample Complexity Joaquim Marques de Sá, Fernando Sereno 2, Luís Alexandre 3 INEB Instituto de Engenharia Biomédica Faculdade de Engenharia da Universidade do

More information

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 3 Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz

More information

Self bounding learning algorithms

Self bounding learning algorithms Self bounding learning algorithms Yoav Freund AT&T Labs 180 Park Avenue Florham Park, NJ 07932-0971 USA yoav@research.att.com January 17, 2000 Abstract Most of the work which attempts to give bounds on

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory Problem set 1 Due: Monday, October 10th Please send your solutions to learning-submissions@ttic.edu Notation: Input space: X Label space: Y = {±1} Sample:

More information

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization : Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage

More information

Statistical Learning Learning From Examples

Statistical Learning Learning From Examples Statistical Learning Learning From Examples We want to estimate the working temperature range of an iphone. We could study the physics and chemistry that affect the performance of the phone too hard We

More information

Computational Learning Theory

Computational Learning Theory 09s1: COMP9417 Machine Learning and Data Mining Computational Learning Theory May 20, 2009 Acknowledgement: Material derived from slides for the book Machine Learning, Tom M. Mitchell, McGraw-Hill, 1997

More information

Computational Learning Theory. Definitions

Computational Learning Theory. Definitions Computational Learning Theory Computational learning theory is interested in theoretical analyses of the following issues. What is needed to learn effectively? Sample complexity. How many examples? Computational

More information

Computational Learning Theory

Computational Learning Theory 0. Computational Learning Theory Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch. 7 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell 1. Main Questions

More information

On Learnability, Complexity and Stability

On Learnability, Complexity and Stability On Learnability, Complexity and Stability Silvia Villa, Lorenzo Rosasco and Tomaso Poggio 1 Introduction A key question in statistical learning is which hypotheses (function) spaces are learnable. Roughly

More information

Learning symmetric non-monotone submodular functions

Learning symmetric non-monotone submodular functions Learning symmetric non-monotone submodular functions Maria-Florina Balcan Georgia Institute of Technology ninamf@cc.gatech.edu Nicholas J. A. Harvey University of British Columbia nickhar@cs.ubc.ca Satoru

More information

Can PAC Learning Algorithms Tolerate. Random Attribute Noise? Sally A. Goldman. Department of Computer Science. Washington University

Can PAC Learning Algorithms Tolerate. Random Attribute Noise? Sally A. Goldman. Department of Computer Science. Washington University Can PAC Learning Algorithms Tolerate Random Attribute Noise? Sally A. Goldman Department of Computer Science Washington University St. Louis, Missouri 63130 Robert H. Sloan y Dept. of Electrical Engineering

More information

3.1 Decision Trees Are Less Expressive Than DNFs

3.1 Decision Trees Are Less Expressive Than DNFs CS 395T Computational Complexity of Machine Learning Lecture 3: January 27, 2005 Scribe: Kevin Liu Lecturer: Adam Klivans 3.1 Decision Trees Are Less Expressive Than DNFs 3.1.1 Recap Recall the discussion

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning PAC Learning and VC Dimension Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE

More information

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds Lecture 25 of 42 PAC Learning, VC Dimension, and Mistake Bounds Thursday, 15 March 2007 William H. Hsu, KSU http://www.kddresearch.org/courses/spring2007/cis732 Readings: Sections 7.4.17.4.3, 7.5.17.5.3,

More information

Learnability and the Doubling Dimension

Learnability and the Doubling Dimension Learnability and the Doubling Dimension Yi Li Genome Institute of Singapore liy3@gis.a-star.edu.sg Philip M. Long Google plong@google.com Abstract Given a set of classifiers and a probability distribution

More information

Vapnik-Chervonenkis Dimension of Neural Nets

Vapnik-Chervonenkis Dimension of Neural Nets P. L. Bartlett and W. Maass: Vapnik-Chervonenkis Dimension of Neural Nets 1 Vapnik-Chervonenkis Dimension of Neural Nets Peter L. Bartlett BIOwulf Technologies and University of California at Berkeley

More information

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Computational Learning Theory Le Song Lecture 11, September 20, 2012 Based on Slides from Eric Xing, CMU Reading: Chap. 7 T.M book 1 Complexity of Learning

More information

Lecture 5: January 30

Lecture 5: January 30 CS71 Randomness & Computation Spring 018 Instructor: Alistair Sinclair Lecture 5: January 30 Disclaimer: These notes have not been subjected to the usual scrutiny accorded to formal publications. They

More information

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds

More information

CS 6375: Machine Learning Computational Learning Theory

CS 6375: Machine Learning Computational Learning Theory CS 6375: Machine Learning Computational Learning Theory Vibhav Gogate The University of Texas at Dallas Many slides borrowed from Ray Mooney 1 Learning Theory Theoretical characterizations of Difficulty

More information

Lecture 29: Computational Learning Theory

Lecture 29: Computational Learning Theory CS 710: Complexity Theory 5/4/2010 Lecture 29: Computational Learning Theory Instructor: Dieter van Melkebeek Scribe: Dmitri Svetlov and Jake Rosin Today we will provide a brief introduction to computational

More information

Irredundant Families of Subcubes

Irredundant Families of Subcubes Irredundant Families of Subcubes David Ellis January 2010 Abstract We consider the problem of finding the maximum possible size of a family of -dimensional subcubes of the n-cube {0, 1} n, none of which

More information

Uniform-Distribution Attribute Noise Learnability

Uniform-Distribution Attribute Noise Learnability Uniform-Distribution Attribute Noise Learnability Nader H. Bshouty Technion Haifa 32000, Israel bshouty@cs.technion.ac.il Christino Tamon Clarkson University Potsdam, NY 13699-5815, U.S.A. tino@clarkson.edu

More information

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu

More information

Online Learning: Random Averages, Combinatorial Parameters, and Learnability

Online Learning: Random Averages, Combinatorial Parameters, and Learnability Online Learning: Random Averages, Combinatorial Parameters, and Learnability Alexander Rakhlin Department of Statistics University of Pennsylvania Karthik Sridharan Toyota Technological Institute at Chicago

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

Multiclass Learnability and the ERM principle

Multiclass Learnability and the ERM principle JMLR: Workshop and Conference Proceedings 19 2011 207 232 24th Annual Conference on Learning Theory Multiclass Learnability and the ERM principle Amit Daniely amit.daniely@mail.huji.ac.il Dept. of Mathematics,

More information

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar Computational Learning Theory: Probably Approximately Correct (PAC) Learning Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 This lecture: Computational Learning Theory The Theory

More information

The Decision List Machine

The Decision List Machine The Decision List Machine Marina Sokolova SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 sokolova@site.uottawa.ca Nathalie Japkowicz SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 nat@site.uottawa.ca

More information

Yale University Department of Computer Science

Yale University Department of Computer Science Yale University Department of Computer Science Lower Bounds on Learning Random Structures with Statistical Queries Dana Angluin David Eisenstat Leonid (Aryeh) Kontorovich Lev Reyzin YALEU/DCS/TR-42 December

More information

Generalization theory

Generalization theory Generalization theory Chapter 4 T.P. Runarsson (tpr@hi.is) and S. Sigurdsson (sven@hi.is) Introduction Suppose you are given the empirical observations, (x 1, y 1 ),..., (x l, y l ) (X Y) l. Consider the

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

Impossibility Theorems for Domain Adaptation

Impossibility Theorems for Domain Adaptation Shai Ben-David and Teresa Luu Tyler Lu Dávid Pál School of Computer Science University of Waterloo Waterloo, ON, CAN {shai,t2luu}@cs.uwaterloo.ca Dept. of Computer Science University of Toronto Toronto,

More information

Active Learning and Optimized Information Gathering

Active Learning and Optimized Information Gathering Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office

More information

21 An Augmented PAC Model for Semi- Supervised Learning

21 An Augmented PAC Model for Semi- Supervised Learning 21 An Augmented PAC Model for Semi- Supervised Learning Maria-Florina Balcan Avrim Blum A PAC Model for Semi-Supervised Learning The standard PAC-learning model has proven to be a useful theoretical framework

More information

7 Influence Functions

7 Influence Functions 7 Influence Functions The influence function is used to approximate the standard error of a plug-in estimator. The formal definition is as follows. 7.1 Definition. The Gâteaux derivative of T at F in the

More information