ICML '97 and AAAI '97 Tutorials

Size: px

Start display at page:

Download "ICML '97 and AAAI '97 Tutorials"

Bruno Ellis
6 years ago
Views:

1 A Short Course in Computational Learning Theory: ICML '97 and AAAI '97 Tutorials Michael Kearns AT&T Laboratories

2 Outline Sample Complexity/Learning Curves: nite classes, Occam's VC dimension Razor, Best Experts/Multiplicative Update Algorithms Statistical Query Learning and Noisy Data Boosting Computational Intractability Results Fourier Methods and Membership Queries

3 The Basic Model Target function f : X! Y (Y = f0; 1g or f+1;,1g), may from class F come Input distribution/density P over X, may be known or arbitrary Class of hypothesis functions H from X to Y Random sample S of m pairs hx i ;f(x i )i Generalization Error (h) =Pr P [h(x) 6= f(x)]

4 Measures of Eciency Parameters ; 2 [0; 1]: ask for (h) with probability at 1, least Input dimension n Target function \complexity" s(f) Sample and computational complexity: scale nicely with 1=;n;s(f) 1=;

5 Variations Ad Innitum Target f from known class F, distribution P arbitrary: model PAC/Valiant Fixed, known P : Distribution-specic PAC model Target f arbitrary: Agnostic model Target f from known class F, add black-box access to f: model with membership queries PAC

6 Sample Complexity/Learning Curves Algorithm-specic vs. general Assume f 2 H How many examples does an arbitrary consistent algorithm to achieve (h)? require

7 The Case of Finite H Fix unknown f 2 H, distribution P Fix \bad" h 0 2 H ((h 0 ) ) Probability h 0 survives m examples (1, ) m (Independence) Probability some bad hypothesis survives jhj(1, ) m (Union Bound) Solve jhj(1, ) m, m = ((1=) log(jhj=)) suces Example: jhj =2 n, m = ((n =) log(1=)) suces Independent of distribution P

8 Occam's Razor Assume f 2 H 0, but given m examples h is chosen from Hm, H0 H 1 Hm Same argument establishes that failure probability is bounded jhmj(1, ) m by Example: jhmj =2 n m, m = ((n m =) log(1=)) suces; m = (((n =) log(1=)) 1=(1,) ) or Compression ( <1) implies Learning

9 An Example: Covering Methods Target f is a conjunction of boolean attributes chosen from x 1 ;:::;xn Eliminate any x i such that x i =0in a positive example Any surviving x i \covers" or explains all negative examples x i =0 with Greedy approach: always is an x i covering (1, 1=k opt ) of so cover all negatives in O(k opt log(m)) remainder,

10 Innite H and the VC Dimension Still true that probability that bad h 2 H survives is (1,) m, now jhj is innite but H shatters x 1 ;:::;x d if all 2 d labelings are realized by H VC dimension d H : size of the largest shattered set

11 Examples of the VC Dimension Finite class H: need 2 d functions to shatter d points, so H log(jhj) d Axis-parallel rectangles in the plane: d H =4 Convex d-gons in the plane: d H =2d +1 Hyperplanes in n dimensions: d H = n +1 d H usually \nicely" related to number of parameters, number operations of

12 The Dichotomy Counting Function For any set of points S, dene H (S) =fh T S : h 2 Hg and = max jsjm fj H (S)jg H(m) H (m) counts maximum number of labelings realized by H m points, so H (m) 2 m always on Important Lemma: for m d H, H (m) m d H Proof is by double induction on m and d H

13 Two Clever Tricks Idea: in expression jhj(1, ) m, try to replace jhj by H (2m) Two-Sample Trick: probability some bad h 2 H survives m probability some h 2 H makes NO mistakes on examples rst m examples and m=2 mistakes on second m examples Incremental Trick: draw 2m sample S = Randomization S S2 rst, randomly split into S 1 and S 2 later 1 S Fix one of the H (2m) labelings of S which makes at least mistakes; probability (wrt split) all mistakes end up in m=2 S 2 is exponentially small in m Failure probability H (2m)2,m=2, sucient sample size is = ((d H =) log(1=)+(1=) log(1=)) m

14 Extensions and Renements Not all h 2 H with (h) have (h) = ; compute error shells (Energy vs. Entropy, Sta- distribution-specic tistical Mechanics) Replace H (m) with distribution-specic expected number dichotomies of \Uniform Convergence Happens": unrealizable f, squared log-loss, ::: error,

15 Best Experts, Multiplicative Updates, Weighted Majority::: Assume nothing about the data Input sequence x 1 ;:::;xm arbitrary Label sequence y 1 ;:::;ym arbitrary Given x m+1, want to predict y m+1 What could we hope to say?

16 A Modest Proposal Only compare performance to a xed collection of \expert h 1 ;:::;h N advisors" Expert h i predicts y i j on x j Goal: for any data sequence, match the number of mistakes by the best advisor on that sequence made Idea: to punish us, force adversary to punish all the advisors As in sample complexity analysis for probabilistic assumptions, expect performance to degrade for large N

17 A Simple Algorithm Maintain weight w i for advisor h i, w i =1=N initially Predict using weighted majority at each trial If we err, and h i erred, w i w i =2

18 A Simple Analysis If W = P N i=1 w i, then when we err W 0 (W=2) + (1=2)(W=2)=(3=4)W After k errors, W 1 (3=4) k If expert i makes ` errors, then w i (1=N)(1=2)` Total weight w i gives (3=4) k (1=N)(1=2)` k (1= log(4=3))(` + log(n)) 2:41(` + log(n)) Sparse vs. distributed representations

19 Statistical Query Learning Model algorithms that use a random sample only to compute statistics Replace source of random examples of target f drawn from P by an oracle for estimating probabilities distribution Learning algorithm submits a query : X Y!f0; 1g Example: (x; y) =1if and only if x 5 = y Response is an estimate of P =Pr P [(x; f(x)) = 1] Demand that complexity of queries and accuracy of estimates permit ecient simulation from examples Captures almost all natural algorithms

20 Noise Tolerance of SQ Algorithms Let source of examples return hx; yi, where y = f(x) with 1, and y = :f(x) with probability probability Dene X 1 = fx : (x; 0) 6= (x; 1)g, X 2 = X, X 1, and 1 =Pr P [x 2 X 1 ] p P = (p 1 =(1, 2)) Pr P; [(x; y) = 1jx 2 X 1 ] + P; [((x; y) =1)^ (x 2 X 2 )] Pr Can eciently estimate P from source of noisy examples Only known method for noise tolerance; other P decompositions give tolerance to other forms of corruption

21 Limitations of SQ Algorithms How many queries are required to learn? SQ dimension d F;P : largest number of pairwise (almost) functions in F with respect to P uncorrelated 1=3 d F;P queries required for nontrivial generalization (Easy case: queries are functions in F ) Also lower bounded by VC dimension of F Application: no SQ algorithm for small decision trees, uniform distribution (including C4.5/CART)

22 Boosting Replace source of random examples with an oracle accepting distributions On input P i, oracle returns a function h i 2 H such that P [h i (x) 6= h(x)] 1=2,, 2 (0; 1=2] (weak learning) Pr Each successive P i is a ltering of true input distribution P, can simulate from random examples so Oracle models a mediocre but nontrivial heuristic Goal: combine the h i into h such that Pr P [h(x) 6= f(x)]

23 How is it Possible? Intuition: each successive P i should force oracle to learn \new" something Example: P 1 = P ; P 2 balances h 1 (x) =f(x) and h 1 6= f(x); 3 restricted to h 1 (x) 6= h 2 (x) P h(x) is majority vote of h 1 ;h 2 ;h 3 Claim: if each h i satises Pr Pi [h i (x) 6= f(x)], then P [h(x) 6= f(x)] 3 2, 2 3 Pr Represent error algebraically, solve constrained maximization problem Now recurse

24 Measuring Performance How many rounds (ltered distributions) required to go from, to? 1=2 Recursive scheme: polynomial in log(1=) and 1=, hypothesis is ternary majority tree of weak hypotheses Adaboost: (1= 2 ) log(1=), hypothesis is linear threshold of weak hypotheses function Top-down decision tree algorithms: (1=) 1=2, hypothesis is decision tree over weak hypotheses a Advantage is actually problem- and algorithm-dependent

25 Computational Intractability Results for Learning Representation-dependent: hard to learn the functions in using hypothesis representation H F Representation-independent: hard to learn the functions F, period in

26 A Standard Reduction Given examples of f 2 F according to any P, L outputs h 2 H (h) with probability 1, in time polynomial in with 1=; 1=;n;s(f) Given sample S of m pairs hx i ;f(x i )i, run algorithm L using = 1=(2m) and drawing randomly from S to provide exam- ples So: reduce hard combinatorial problem to consistency problem for (F; H), then learning is hard unless RP = NP Note: converse from Occam's Razor Then if there exists h 2 H consistent with S, L will nd it probability 1, with

27 Examples of Representation-Dependent Intractability Consistency for (k-term DNF,k-term DNF) as hard as graph k-coloring Consistency for (k-term DNF,2k-term DNF) as hard as approximate graph coloring Consistency for (k-term DNF,k-CNF) is easy Approximate consistency for (conjunctions with errors, conjunctions) as hard as set covering

28 Representation-Independent Intractability Complexity theory: set of examples problem instance, simple hypothesis solution for instance \Read o" a k-coloring of original graph G from the sytactic of a k-term DNF formula for the associated sample S G form No restrictions on form of h: everything is learnable At least ask that hypotheses be polynomially evaluatable: h(x) in time polynomial in jxj;s(h) compute Want learning to be hard for any ecient algorithm

29 Public-Key Cryptography and Learning A sends many encrypted messages F (x 1 );:::;F(xm) to B Ecient eavesdropper E may obtain (or generate) many pairs i ;F(x i )i hx Want it to be hard for E to decrypt a \new" F (x 0 ) (generalization) Thus, E wants to \learn" inverse of F (decryption) from examples! Inverse is \simple" given \trapdoor" (fairness of learning)

30 Applications Given by PKC: encryption schemes F with polynomial-time E's task as hard as factoring (F (x) =x e mod p q) inverses, Thus, learning (small) boolean circuits is intractable Also hard: boolean formulae, nite automata, small-depth networks neural DNF and decision trees?

31 The Fourier Representation Any function f : f0; 1g n!<is a 2 n -dimensional real vector Inner product: hf; hi =(1=2 n ) P x f(x)h(x) hf; hi = E U [f(x)h(x)] = Pr U [f(x) =h(x)], Pr U [f(x) 6= h(x)] Set of 2 n orthogonal basis functions: gz(x) = +1 if x has parity on bits i where z i =1,else gz(x) =,1 even Write f(x) = P z z(f) gz(x), coecients z(f) =hf; gzi

32 Fun Fourier Facts Parseval's Identity: hf; fi = P z(z(f)) 2 =1 Small DNF f: P z:w(z)c0 log(n) ( z(f)) 2 1 Small decision tree f: spectrum is sparse, only a small number of non-zero coecients Tree-based spectrum estimation using membership queries

Computational Learning Theory. Definitions

Computational Learning Theory. Definitions Computational Learning Theory Computational learning theory is interested in theoretical analyses of the following issues. What is needed to learn effectively? Sample complexity. How many examples? Computational