ICML '97 and AAAI '97 Tutorials

A Short Course in Computational Learning Theory: ICML '97 and AAAI '97 Tutorials Michael Kearns AT&T Laboratories

Outline Sample Complexity/Learning Curves: nite classes, Occam's VC dimension Razor, Best Experts/Multiplicative Update Algorithms Statistical Query Learning and Noisy Data Boosting Computational Intractability Results Fourier Methods and Membership Queries

The Basic Model Target function f : X! Y (Y = f0; 1g or f+1;,1g), may from class F come Input distribution/density P over X, may be known or arbitrary Class of hypothesis functions H from X to Y Random sample S of m pairs hx i ;f(x i )i Generalization Error (h) =Pr P [h(x) 6= f(x)]

Measures of Eciency Parameters ; 2 [0; 1]: ask for (h) with probability at 1, least Input dimension n Target function \complexity" s(f) Sample and computational complexity: scale nicely with 1=;n;s(f) 1=;

Variations Ad Innitum Target f from known class F, distribution P arbitrary: model PAC/Valiant Fixed, known P : Distribution-specic PAC model Target f arbitrary: Agnostic model Target f from known class F, add black-box access to f: model with membership queries PAC

Sample Complexity/Learning Curves Algorithm-specic vs. general Assume f 2 H How many examples does an arbitrary consistent algorithm to achieve (h)? require

The Case of Finite H Fix unknown f 2 H, distribution P Fix \bad" h 0 2 H ((h 0 ) ) Probability h 0 survives m examples (1, ) m (Independence) Probability some bad hypothesis survives jhj(1, ) m (Union Bound) Solve jhj(1, ) m, m = ((1=) log(jhj=)) suces Example: jhj =2 n, m = ((n =) log(1=)) suces Independent of distribution P

Occam's Razor Assume f 2 H 0, but given m examples h is chosen from Hm, H0 H 1 Hm Same argument establishes that failure probability is bounded jhmj(1, ) m by Example: jhmj =2 n m, m = ((n m =) log(1=)) suces; m = (((n =) log(1=)) 1=(1,) ) or Compression ( <1) implies Learning

An Example: Covering Methods Target f is a conjunction of boolean attributes chosen from x 1 ;:::;xn Eliminate any x i such that x i =0in a positive example Any surviving x i \covers" or explains all negative examples x i =0 with Greedy approach: always is an x i covering (1, 1=k opt ) of so cover all negatives in O(k opt log(m)) remainder,

Innite H and the VC Dimension Still true that probability that bad h 2 H survives is (1,) m, now jhj is innite but H shatters x 1 ;:::;x d if all 2 d labelings are realized by H VC dimension d H : size of the largest shattered set

Examples of the VC Dimension Finite class H: need 2 d functions to shatter d points, so H log(jhj) d Axis-parallel rectangles in the plane: d H =4 Convex d-gons in the plane: d H =2d +1 Hyperplanes in n dimensions: d H = n +1 d H usually \nicely" related to number of parameters, number operations of

The Dichotomy Counting Function For any set of points S, dene H (S) =fh T S : h 2 Hg and = max jsjm fj H (S)jg H(m) H (m) counts maximum number of labelings realized by H m points, so H (m) 2 m always on Important Lemma: for m d H, H (m) m d H Proof is by double induction on m and d H

Two Clever Tricks Idea: in expression jhj(1, ) m, try to replace jhj by H (2m) Two-Sample Trick: probability some bad h 2 H survives m probability some h 2 H makes NO mistakes on examples rst m examples and m=2 mistakes on second m examples Incremental Trick: draw 2m sample S = Randomization S S2 rst, randomly split into S 1 and S 2 later 1 S Fix one of the H (2m) labelings of S which makes at least mistakes; probability (wrt split) all mistakes end up in m=2 S 2 is exponentially small in m Failure probability H (2m)2,m=2, sucient sample size is = ((d H =) log(1=)+(1=) log(1=)) m

Extensions and Renements Not all h 2 H with (h) have (h) = ; compute error shells (Energy vs. Entropy, Sta- distribution-specic tistical Mechanics) Replace H (m) with distribution-specic expected number dichotomies of \Uniform Convergence Happens": unrealizable f, squared log-loss, ::: error,

Best Experts, Multiplicative Updates, Weighted Majority::: Assume nothing about the data Input sequence x 1 ;:::;xm arbitrary Label sequence y 1 ;:::;ym arbitrary Given x m+1, want to predict y m+1 What could we hope to say?

A Modest Proposal Only compare performance to a xed collection of \expert h 1 ;:::;h N advisors" Expert h i predicts y i j on x j Goal: for any data sequence, match the number of mistakes by the best advisor on that sequence made Idea: to punish us, force adversary to punish all the advisors As in sample complexity analysis for probabilistic assumptions, expect performance to degrade for large N

A Simple Algorithm Maintain weight w i for advisor h i, w i =1=N initially Predict using weighted majority at each trial If we err, and h i erred, w i w i =2

A Simple Analysis If W = P N i=1 w i, then when we err W 0 (W=2) + (1=2)(W=2)=(3=4)W After k errors, W 1 (3=4) k If expert i makes ` errors, then w i (1=N)(1=2)` Total weight w i gives (3=4) k (1=N)(1=2)` k (1= log(4=3))(` + log(n)) 2:41(` + log(n)) Sparse vs. distributed representations

Statistical Query Learning Model algorithms that use a random sample only to compute statistics Replace source of random examples of target f drawn from P by an oracle for estimating probabilities distribution Learning algorithm submits a query : X Y!f0; 1g Example: (x; y) =1if and only if x 5 = y Response is an estimate of P =Pr P [(x; f(x)) = 1] Demand that complexity of queries and accuracy of estimates permit ecient simulation from examples Captures almost all natural algorithms

Noise Tolerance of SQ Algorithms Let source of examples return hx; yi, where y = f(x) with 1, and y = :f(x) with probability probability Dene X 1 = fx : (x; 0) 6= (x; 1)g, X 2 = X, X 1, and 1 =Pr P [x 2 X 1 ] p P = (p 1 =(1, 2)) Pr P; [(x; y) = 1jx 2 X 1 ] + P; [((x; y) =1)^ (x 2 X 2 )] Pr Can eciently estimate P from source of noisy examples Only known method for noise tolerance; other P decompositions give tolerance to other forms of corruption

Limitations of SQ Algorithms How many queries are required to learn? SQ dimension d F;P : largest number of pairwise (almost) functions in F with respect to P uncorrelated 1=3 d F;P queries required for nontrivial generalization (Easy case: queries are functions in F ) Also lower bounded by VC dimension of F Application: no SQ algorithm for small decision trees, uniform distribution (including C4.5/CART)

Boosting Replace source of random examples with an oracle accepting distributions On input P i, oracle returns a function h i 2 H such that P [h i (x) 6= h(x)] 1=2,, 2 (0; 1=2] (weak learning) Pr Each successive P i is a ltering of true input distribution P, can simulate from random examples so Oracle models a mediocre but nontrivial heuristic Goal: combine the h i into h such that Pr P [h(x) 6= f(x)]

How is it Possible? Intuition: each successive P i should force oracle to learn \new" something Example: P 1 = P ; P 2 balances h 1 (x) =f(x) and h 1 6= f(x); 3 restricted to h 1 (x) 6= h 2 (x) P h(x) is majority vote of h 1 ;h 2 ;h 3 Claim: if each h i satises Pr Pi [h i (x) 6= f(x)], then P [h(x) 6= f(x)] 3 2, 2 3 Pr Represent error algebraically, solve constrained maximization problem Now recurse

Measuring Performance How many rounds (ltered distributions) required to go from, to? 1=2 Recursive scheme: polynomial in log(1=) and 1=, hypothesis is ternary majority tree of weak hypotheses Adaboost: (1= 2 ) log(1=), hypothesis is linear threshold of weak hypotheses function Top-down decision tree algorithms: (1=) 1=2, hypothesis is decision tree over weak hypotheses a Advantage is actually problem- and algorithm-dependent

Computational Intractability Results for Learning Representation-dependent: hard to learn the functions in using hypothesis representation H F Representation-independent: hard to learn the functions F, period in

A Standard Reduction Given examples of f 2 F according to any P, L outputs h 2 H (h) with probability 1, in time polynomial in with 1=; 1=;n;s(f) Given sample S of m pairs hx i ;f(x i )i, run algorithm L using = 1=(2m) and drawing randomly from S to provide exam- ples So: reduce hard combinatorial problem to consistency problem for (F; H), then learning is hard unless RP = NP Note: converse from Occam's Razor Then if there exists h 2 H consistent with S, L will nd it probability 1, with

Examples of Representation-Dependent Intractability Consistency for (k-term DNF,k-term DNF) as hard as graph k-coloring Consistency for (k-term DNF,2k-term DNF) as hard as approximate graph coloring Consistency for (k-term DNF,k-CNF) is easy Approximate consistency for (conjunctions with errors, conjunctions) as hard as set covering

Representation-Independent Intractability Complexity theory: set of examples problem instance, simple hypothesis solution for instance \Read o" a k-coloring of original graph G from the sytactic of a k-term DNF formula for the associated sample S G form No restrictions on form of h: everything is learnable At least ask that hypotheses be polynomially evaluatable: h(x) in time polynomial in jxj;s(h) compute Want learning to be hard for any ecient algorithm

Public-Key Cryptography and Learning A sends many encrypted messages F (x 1 );:::;F(xm) to B Ecient eavesdropper E may obtain (or generate) many pairs i ;F(x i )i hx Want it to be hard for E to decrypt a \new" F (x 0 ) (generalization) Thus, E wants to \learn" inverse of F (decryption) from examples! Inverse is \simple" given \trapdoor" (fairness of learning)

Applications Given by PKC: encryption schemes F with polynomial-time E's task as hard as factoring (F (x) =x e mod p q) inverses, Thus, learning (small) boolean circuits is intractable Also hard: boolean formulae, nite automata, small-depth networks neural DNF and decision trees?

The Fourier Representation Any function f : f0; 1g n!<is a 2 n -dimensional real vector Inner product: hf; hi =(1=2 n ) P x f(x)h(x) hf; hi = E U [f(x)h(x)] = Pr U [f(x) =h(x)], Pr U [f(x) 6= h(x)] Set of 2 n orthogonal basis functions: gz(x) = +1 if x has parity on bits i where z i =1,else gz(x) =,1 even Write f(x) = P z z(f) gz(x), coecients z(f) =hf; gzi

Fun Fourier Facts Parseval's Identity: hf; fi = P z(z(f)) 2 =1 Small DNF f: P z:w(z)c0 log(n) ( z(f)) 2 1 Small decision tree f: spectrum is sparse, only a small number of non-zero coecients Tree-based spectrum estimation using membership queries