ICML '97 and AAAI '97 Tutorials

Similar documents
Computational Learning Theory. Definitions

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Computational Learning Theory - Hilary Term : Introduction to the PAC Learning Framework

Computational Learning Theory

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

Introduction to Computational Learning Theory

Computational Learning Theory

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Dan Roth 461C, 3401 Walnut

Lecture 29: Computational Learning Theory

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

Computational and Statistical Learning Theory

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Active Learning and Optimized Information Gathering

Computational Learning Theory

Computational Learning Theory (COLT)

Simple Learning Algorithms for. Decision Trees and Multivariate Polynomials. xed constant bounded product distribution. is depth 3 formulas.

On the power and the limits of evolvability. Vitaly Feldman Almaden Research Center

Generalization, Overfitting, and Model Selection

Lecture 5: Efficient PAC Learning. 1 Consistent Learning: a Bound on Sample Complexity

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Web-Mining Agents Computational Learning Theory

On the Sample Complexity of Noise-Tolerant Learning

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Online Learning, Mistake Bounds, Perceptron Algorithm

Computational Learning Theory

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Testing by Implicit Learning

Can PAC Learning Algorithms Tolerate. Random Attribute Noise? Sally A. Goldman. Department of Computer Science. Washington University

Computational and Statistical Learning Theory

Computational Learning Theory

Lecture 8. Instructor: Haipeng Luo

TTIC An Introduction to the Theory of Machine Learning. Learning from noisy data, intro to SQ model

Introduction to Machine Learning

CS446: Machine Learning Spring Problem Set 4

Qualifying Exam in Machine Learning

Generalization and Overfitting

Polynomial time Prediction Strategy with almost Optimal Mistake Probability

Data Mining und Maschinelles Lernen

Statistical Learning Learning From Examples

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

From Batch to Transductive Online Learning

Name (NetID): (1 Point)

18.9 SUPPORT VECTOR MACHINES

CS 395T Computational Learning Theory. Scribe: Mike Halcrow. x 4. x 2. x 6

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning

Computational Learning Theory. CS534 - Machine Learning

CS 6375 Machine Learning

2 Upper-bound of Generalization Error of AdaBoost

Lecture Notes 20: Zero-Knowledge Proofs

New Results for Random Walk Learning

Distributed Machine Learning. Maria-Florina Balcan, Georgia Tech

Efficient Noise-Tolerant Learning from Statistical Queries

Testing Problems with Sub-Learning Sample Complexity

8.1 Polynomial Threshold Functions

Computational Lower Bounds for Statistical Estimation Problems

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Abstract. In this paper we study an extension of the distribution-free model of learning introduced

A Course in Machine Learning

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Computational learning theory. PAC learning. VC dimension.

Computational Learning Theory

COMS 4771 Introduction to Machine Learning. Nakul Verma

CS 6375: Machine Learning Computational Learning Theory

Computational and Statistical Learning Theory

Computational Learning Theory

1 Differential Privacy and Statistical Query Learning

COS597D: Information Theory in Computer Science October 5, Lecture 6

CS 498F: Machine Learning Fall 2010

Statistical and Computational Learning Theory

Lecture 10: Learning DNF, AC 0, Juntas. 1 Learning DNF in Almost Polynomial Time

An Algorithms-based Intro to Machine Learning

Name (NetID): (1 Point)

Foundations of Machine Learning and Data Science. Lecturer: Avrim Blum Lecture 9: October 7, 2015

Midterm, Fall 2003

3 Finish learning monotone Boolean functions

1 The Probably Approximately Correct (PAC) Model

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

Lecture 5: February 16, 2012

Computational Learning Theory (VC Dimension)

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181.

Agnostic Online learnability

Computational and Statistical Learning Theory

Majority is incompressible by AC 0 [p] circuits

Upper and Lower Bounds for Tree-like. Cutting Planes Proofs. Abstract. In this paper we study the complexity of Cutting Planes (CP) refutations, and

What Can We Learn Privately?

Littlestone s Dimension and Online Learnability

Theoretical Advances in Neural Computation and Learning, V. P. Roychowdhury, K. Y. Siu, A. Perspectives of Current Research about

Active Learning: Disagreement Coefficient

Lecture 4: LMN Learning (Part 2)

Lecture 14 - P v.s. NP 1

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Part of the slides are adapted from Ziko Kolter

1 Last time and today

Transcription:

A Short Course in Computational Learning Theory: ICML '97 and AAAI '97 Tutorials Michael Kearns AT&T Laboratories

Outline Sample Complexity/Learning Curves: nite classes, Occam's VC dimension Razor, Best Experts/Multiplicative Update Algorithms Statistical Query Learning and Noisy Data Boosting Computational Intractability Results Fourier Methods and Membership Queries

The Basic Model Target function f : X! Y (Y = f0; 1g or f+1;,1g), may from class F come Input distribution/density P over X, may be known or arbitrary Class of hypothesis functions H from X to Y Random sample S of m pairs hx i ;f(x i )i Generalization Error (h) =Pr P [h(x) 6= f(x)]

Measures of Eciency Parameters ; 2 [0; 1]: ask for (h) with probability at 1, least Input dimension n Target function \complexity" s(f) Sample and computational complexity: scale nicely with 1=;n;s(f) 1=;

Variations Ad Innitum Target f from known class F, distribution P arbitrary: model PAC/Valiant Fixed, known P : Distribution-specic PAC model Target f arbitrary: Agnostic model Target f from known class F, add black-box access to f: model with membership queries PAC

Sample Complexity/Learning Curves Algorithm-specic vs. general Assume f 2 H How many examples does an arbitrary consistent algorithm to achieve (h)? require

The Case of Finite H Fix unknown f 2 H, distribution P Fix \bad" h 0 2 H ((h 0 ) ) Probability h 0 survives m examples (1, ) m (Independence) Probability some bad hypothesis survives jhj(1, ) m (Union Bound) Solve jhj(1, ) m, m = ((1=) log(jhj=)) suces Example: jhj =2 n, m = ((n =) log(1=)) suces Independent of distribution P

Occam's Razor Assume f 2 H 0, but given m examples h is chosen from Hm, H0 H 1 Hm Same argument establishes that failure probability is bounded jhmj(1, ) m by Example: jhmj =2 n m, m = ((n m =) log(1=)) suces; m = (((n =) log(1=)) 1=(1,) ) or Compression ( <1) implies Learning

An Example: Covering Methods Target f is a conjunction of boolean attributes chosen from x 1 ;:::;xn Eliminate any x i such that x i =0in a positive example Any surviving x i \covers" or explains all negative examples x i =0 with Greedy approach: always is an x i covering (1, 1=k opt ) of so cover all negatives in O(k opt log(m)) remainder,

Innite H and the VC Dimension Still true that probability that bad h 2 H survives is (1,) m, now jhj is innite but H shatters x 1 ;:::;x d if all 2 d labelings are realized by H VC dimension d H : size of the largest shattered set

Examples of the VC Dimension Finite class H: need 2 d functions to shatter d points, so H log(jhj) d Axis-parallel rectangles in the plane: d H =4 Convex d-gons in the plane: d H =2d +1 Hyperplanes in n dimensions: d H = n +1 d H usually \nicely" related to number of parameters, number operations of

The Dichotomy Counting Function For any set of points S, dene H (S) =fh T S : h 2 Hg and = max jsjm fj H (S)jg H(m) H (m) counts maximum number of labelings realized by H m points, so H (m) 2 m always on Important Lemma: for m d H, H (m) m d H Proof is by double induction on m and d H

Two Clever Tricks Idea: in expression jhj(1, ) m, try to replace jhj by H (2m) Two-Sample Trick: probability some bad h 2 H survives m probability some h 2 H makes NO mistakes on examples rst m examples and m=2 mistakes on second m examples Incremental Trick: draw 2m sample S = Randomization S S2 rst, randomly split into S 1 and S 2 later 1 S Fix one of the H (2m) labelings of S which makes at least mistakes; probability (wrt split) all mistakes end up in m=2 S 2 is exponentially small in m Failure probability H (2m)2,m=2, sucient sample size is = ((d H =) log(1=)+(1=) log(1=)) m

Extensions and Renements Not all h 2 H with (h) have (h) = ; compute error shells (Energy vs. Entropy, Sta- distribution-specic tistical Mechanics) Replace H (m) with distribution-specic expected number dichotomies of \Uniform Convergence Happens": unrealizable f, squared log-loss, ::: error,

Best Experts, Multiplicative Updates, Weighted Majority::: Assume nothing about the data Input sequence x 1 ;:::;xm arbitrary Label sequence y 1 ;:::;ym arbitrary Given x m+1, want to predict y m+1 What could we hope to say?

A Modest Proposal Only compare performance to a xed collection of \expert h 1 ;:::;h N advisors" Expert h i predicts y i j on x j Goal: for any data sequence, match the number of mistakes by the best advisor on that sequence made Idea: to punish us, force adversary to punish all the advisors As in sample complexity analysis for probabilistic assumptions, expect performance to degrade for large N

A Simple Algorithm Maintain weight w i for advisor h i, w i =1=N initially Predict using weighted majority at each trial If we err, and h i erred, w i w i =2

A Simple Analysis If W = P N i=1 w i, then when we err W 0 (W=2) + (1=2)(W=2)=(3=4)W After k errors, W 1 (3=4) k If expert i makes ` errors, then w i (1=N)(1=2)` Total weight w i gives (3=4) k (1=N)(1=2)` k (1= log(4=3))(` + log(n)) 2:41(` + log(n)) Sparse vs. distributed representations

Statistical Query Learning Model algorithms that use a random sample only to compute statistics Replace source of random examples of target f drawn from P by an oracle for estimating probabilities distribution Learning algorithm submits a query : X Y!f0; 1g Example: (x; y) =1if and only if x 5 = y Response is an estimate of P =Pr P [(x; f(x)) = 1] Demand that complexity of queries and accuracy of estimates permit ecient simulation from examples Captures almost all natural algorithms

Noise Tolerance of SQ Algorithms Let source of examples return hx; yi, where y = f(x) with 1, and y = :f(x) with probability probability Dene X 1 = fx : (x; 0) 6= (x; 1)g, X 2 = X, X 1, and 1 =Pr P [x 2 X 1 ] p P = (p 1 =(1, 2)) Pr P; [(x; y) = 1jx 2 X 1 ] + P; [((x; y) =1)^ (x 2 X 2 )] Pr Can eciently estimate P from source of noisy examples Only known method for noise tolerance; other P decompositions give tolerance to other forms of corruption

Limitations of SQ Algorithms How many queries are required to learn? SQ dimension d F;P : largest number of pairwise (almost) functions in F with respect to P uncorrelated 1=3 d F;P queries required for nontrivial generalization (Easy case: queries are functions in F ) Also lower bounded by VC dimension of F Application: no SQ algorithm for small decision trees, uniform distribution (including C4.5/CART)

Boosting Replace source of random examples with an oracle accepting distributions On input P i, oracle returns a function h i 2 H such that P [h i (x) 6= h(x)] 1=2,, 2 (0; 1=2] (weak learning) Pr Each successive P i is a ltering of true input distribution P, can simulate from random examples so Oracle models a mediocre but nontrivial heuristic Goal: combine the h i into h such that Pr P [h(x) 6= f(x)]

How is it Possible? Intuition: each successive P i should force oracle to learn \new" something Example: P 1 = P ; P 2 balances h 1 (x) =f(x) and h 1 6= f(x); 3 restricted to h 1 (x) 6= h 2 (x) P h(x) is majority vote of h 1 ;h 2 ;h 3 Claim: if each h i satises Pr Pi [h i (x) 6= f(x)], then P [h(x) 6= f(x)] 3 2, 2 3 Pr Represent error algebraically, solve constrained maximization problem Now recurse

Measuring Performance How many rounds (ltered distributions) required to go from, to? 1=2 Recursive scheme: polynomial in log(1=) and 1=, hypothesis is ternary majority tree of weak hypotheses Adaboost: (1= 2 ) log(1=), hypothesis is linear threshold of weak hypotheses function Top-down decision tree algorithms: (1=) 1=2, hypothesis is decision tree over weak hypotheses a Advantage is actually problem- and algorithm-dependent

Computational Intractability Results for Learning Representation-dependent: hard to learn the functions in using hypothesis representation H F Representation-independent: hard to learn the functions F, period in

A Standard Reduction Given examples of f 2 F according to any P, L outputs h 2 H (h) with probability 1, in time polynomial in with 1=; 1=;n;s(f) Given sample S of m pairs hx i ;f(x i )i, run algorithm L using = 1=(2m) and drawing randomly from S to provide exam- ples So: reduce hard combinatorial problem to consistency problem for (F; H), then learning is hard unless RP = NP Note: converse from Occam's Razor Then if there exists h 2 H consistent with S, L will nd it probability 1, with

Examples of Representation-Dependent Intractability Consistency for (k-term DNF,k-term DNF) as hard as graph k-coloring Consistency for (k-term DNF,2k-term DNF) as hard as approximate graph coloring Consistency for (k-term DNF,k-CNF) is easy Approximate consistency for (conjunctions with errors, conjunctions) as hard as set covering

Representation-Independent Intractability Complexity theory: set of examples problem instance, simple hypothesis solution for instance \Read o" a k-coloring of original graph G from the sytactic of a k-term DNF formula for the associated sample S G form No restrictions on form of h: everything is learnable At least ask that hypotheses be polynomially evaluatable: h(x) in time polynomial in jxj;s(h) compute Want learning to be hard for any ecient algorithm

Public-Key Cryptography and Learning A sends many encrypted messages F (x 1 );:::;F(xm) to B Ecient eavesdropper E may obtain (or generate) many pairs i ;F(x i )i hx Want it to be hard for E to decrypt a \new" F (x 0 ) (generalization) Thus, E wants to \learn" inverse of F (decryption) from examples! Inverse is \simple" given \trapdoor" (fairness of learning)

Applications Given by PKC: encryption schemes F with polynomial-time E's task as hard as factoring (F (x) =x e mod p q) inverses, Thus, learning (small) boolean circuits is intractable Also hard: boolean formulae, nite automata, small-depth networks neural DNF and decision trees?

The Fourier Representation Any function f : f0; 1g n!<is a 2 n -dimensional real vector Inner product: hf; hi =(1=2 n ) P x f(x)h(x) hf; hi = E U [f(x)h(x)] = Pr U [f(x) =h(x)], Pr U [f(x) 6= h(x)] Set of 2 n orthogonal basis functions: gz(x) = +1 if x has parity on bits i where z i =1,else gz(x) =,1 even Write f(x) = P z z(f) gz(x), coecients z(f) =hf; gzi

Fun Fourier Facts Parseval's Identity: hf; fi = P z(z(f)) 2 =1 Small DNF f: P z:w(z)c0 log(n) ( z(f)) 2 1 Small decision tree f: spectrum is sparse, only a small number of non-zero coecients Tree-based spectrum estimation using membership queries