Sample Complexity of Learning Independent of Set Theory

Size: px

Start display at page:

Download "Sample Complexity of Learning Independent of Set Theory"

Ernest Martin
5 years ago
Views:

1 Sample Complexity of Learning Independent of Set Theory Shai Ben-David University of Waterloo, Canada Based on joint work with Pavel Hrubes, Shay Moran, Amir Shpilka and Amir Yehudayoff Simons workshop, June 2018

2 The 3 components of this talk Sta$s$cal learning theory. Combinatorics. Set Theory.

3 Outline of the talk 1. Introduce a general learning task Expecta)on Maximiza)on (EMX) and show that many common sta$s$cal learning problems are subcases. 2. Introduce a combinatorial no$on of sample compression schemes and show that it characterizes EMX learnability. 3. Show that for certain basic classes, the existence of such compressions is independent of the set theory axioms ZFC. 4. Conclude that EMX learnability is independent of set Theory. 5. Conclude that EMX learnability cannot be captured by any no$on of finite character dimension (VC dimension like). 6. Discuss implica$ons.

4 A General Learning Problem -- Expectation Maximization (EMX) Let H be a collec$on of real valued func$ons over some domain X. Input: a sample S, i.i.d. generated by some unknown probability distribu$on P over X. Output: Some h in H with as high as possible expecta$on w.r.t. P.

5 Vapnik s generalized loss A sta$s$cal learning framework is defined by triplet: a domain set X, a set of models M, and a loss func$on l: X x M è R. Ø For a probability distribu$on P over X and h in M, define L P (h) to be the expecta$on over x ~P of l(h, x). Ø The learner is given a sample S generated i.i.d. by P and aims to find h in M the minimizes L P (h). It is easy to see that the EMX problem encapsulates this problem.

6 Examples of EMX problems Binary classifica$on predic$on ( proper ). Mul$-class predic$on. K-center clustering problems. Linear regression. Learning when the labels are known. Sta$s$cal loss minimiza$on.

7 Common Tools of Trade PAC Learnability Learnability by Empirical Risk Minimiza$on Weak Learnability Uniform Convergence Existence of Sample Compression Schemes Finiteness of Combinatorial Dimensions

8 Binary classimication (-- the clean case) The Fundamental Theorem of StaPsPcal Learning : Given a class H, all the above are equivalent (the relevant dimension is the VC-dimension of the class). The equivalence holds quan$ta$vely.

9 The Theorem breaks down for multi-class classimication. Theorem [Daniely, B-D, Sabato, Shalev-Shwartz 2010] There are mul$-class H s for which not all ERM algorithms are equal -- Some ERM may learn a class that other ERM algorithms fail to learn. Corollary: PAC learnability may hold when uniform convergence fails.

10 Non-equivalence for EMX When H is the class of all subsets of any infinite domain Claim 1: Uniform convergence fails. Claim 2: EMX learnability holds. (Why?...)

11 Our goal Characterize/figure-out: Which classes are EMX learnable? Can one have a purely combinatorial characteriza$on?

12 Sample Compression Schemes [Littlestone and Warmuth 89] Let H be a class of binary valued func)ons over some domain set X, A k-size Sample Compression Scheme for H is a func$on G: (X \$mes {0,1}) k to {0,1} X so that For any h in H and every finite subset S=((x 1,h(x 1 )), (x m, h(x m ))), there is S h, a subset of S of size at most k, so that for all i m, G(S h )(x i )=h(x i ).

13 Recall For binary classifica$on learning problems, PAC learnability is equivalent to the existence of such compressions schemes.

14 The case of Subset Probability Maximization H is a set of (characteris$c func$ons of) subsets of some domain set X. EMX for H is the task of finding a set h in H whose probability weight is within epsilon to the maximum-weight h in H.

15 Monotone Compression Schemes Let H be a class of real valued func)ons over some domain set X, A k-size Monotone Compression Scheme for H is a func$on G: X k to H so that For any finite domain subset S and every h in H, there is a subset S h of size at most k so that for all x in S, G(S h )(x) h(x).

16 Example class Consider H X Fin the collecpon of all finite subsets of a set X. Monotone compression boils down to the following game: Alice: Gets a finite set S, picks a subset S and sends it to Bob. Bob: Outputs a finite subset of X, η(s ). Their goal: To find a strategy by which, for every S, η(s ) is a superset of S. Can Alice send Bob subsets of bounded size?

17 Example continued The answer depends on the cardinality of X. It is trivial is X is finite ( X is a bound on the size of all subsets ) Easy if X is countable (think of the natural numbers ) It gets more interes)ng when X is uncountable.

18 The sub-problem we focus on Let H be a family of subsets of some X. We say that H is union bounded if For every h 1, h 2 in H there is h 3 in H that contains both. Examples: Union closed classes, Axis aligned rectangles in any dimension, convex polygons.

19 The components of the main result For EMX over unions-bounded classes, Theorem: The following are equivalent 1. PAC learnability 2. Weak Learnability 3. Finite size Monotone Compression

20 Recall that for binary classimication all of the following are equivalent PAC Learnability Learnability by Empirical Risk Minimiza$on Weak Learnability Uniform Convergence Existence of Sample Compression Schemes Finiteness of the VC Dimensions

21 A Quantitative version Let d H =d H (1/3, 1/3) be the sample size needed for performing EMX over a class H with accuracy epsilon=1/3 and confidence delta=1/3. Learnability è Compression: Every class H has monotone compression to size k(m)= O(d H log(m)). Compression è Learnability: If a class has k-size monotone compression then d H =O(klog(k)).

22 Monotone compression of H X Fin Theorem: For every k, H X Fin has monotone compression of size at most k if and only if X < Aleph k Corollary: The class of finite subsets of the real unit interval is EMX learnable if and only if 2 N < Aleph ω

23 Proof: 1. If X < Aleph k then H X Fin has monotone compression of size at most k : By induc$on on k, using a well ordering of X of type ω k 2. If H X Fin has monotone compression of size at most k then for every subset Y of X of smaller cardinality, H Y Fin has monotone compression of size at most k-1.

24 Corollaries 1) If [0,1] = Aleph_k then H [0,1] Fin is EMX- learnable and Ck/ log(k) < d H (1/3, 1/3) < Ck log(k) (for some constant C). 2) If [0,1] > Aleph ω then H [0,1] Fin is not EMX- learnable.

25 A set theoretic aspect The cardinality of the reals can be changed (using forcing) without changing the set of real. Therefore the EMX learnability of the class H [0,1] Fin can be switched without changing either the domain set or the class.

26 A Dimension for EMX Is there a combinatorial dimension that characterizes EMX learnability the way VC dimension characterizes binary classifica$on PAC learning?

27 Formalizing the notion of dimension We say that a property of a domain set X and a class H is Finite Character if it can be expressed by a first order formula all of whose quan)fiers are over X and H (and may be real/natural numbers as well). Note that VCdim(H)>d is of Finite Character. So are the Fat Shatering dimension, the Graph dimension and the Natarajan dimension.

28 A model theoretic observation Claim: If M 0 is a submodel of M 1, both models of ZFC and they share the same sets X and H (and same set of real numbers), then The truth value of any finite character property of (X, H) is the same in both models.

29 No Dimension for EMX learnability Corollary: There is no finite character no$on of dimension that characterizes EMX learnability.

30 Discussion Where can this independence of set theory come from? The no$ons of sample complexity are defined in terms of learning func$ons. (rather than learning algorithms). Func$ons over infinite domains, in contrast with programs, are infinitary objects.

On a learning problem that is independent of the set theory ZFC axioms

On a learning problem that is independent of the set theory ZFC axioms Shai Ben-David, Pavel Hrubeš, Shay Moran, Amir Shpilka, and Amir Yehudayoff Abstract. We consider the following statistical estimation