Introduction to Computational Learning Theory

Similar documents
Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Computational Learning Theory - Hilary Term : Introduction to the PAC Learning Framework

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

Computational Learning Theory

Introduction to Machine Learning

Computational Learning Theory. Definitions

CS 6375: Machine Learning Computational Learning Theory

1 The Probably Approximately Correct (PAC) Model

ICML '97 and AAAI '97 Tutorials

Statistical Learning Learning From Examples

Computational Learning Theory (COLT)

Computational Learning Theory: PAC Model

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Computational Learning Theory

Computational Learning Theory. CS534 - Machine Learning

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Lecture 5: Efficient PAC Learning. 1 Consistent Learning: a Bound on Sample Complexity

10.1 The Formal Model

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Computational Learning Theory

Online Learning, Mistake Bounds, Perceptron Algorithm

Computational Learning Theory

Web-Mining Agents Computational Learning Theory

Computational Learning Theory

Dan Roth 461C, 3401 Walnut

An Algorithms-based Intro to Machine Learning

Statistical and Computational Learning Theory

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Hierarchical Concept Learning

Lecture 29: Computational Learning Theory

Linear Classifiers: Expressiveness

Machine Learning

A Course in Machine Learning

Computational learning theory. PAC learning. VC dimension.

Testing by Implicit Learning

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory

Agnostic Learning of Disjunctions on Symmetric Distributions

Optimal Hardness Results for Maximizing Agreements with Monomials

Introduction to Machine Learning

Machine Learning

Comp487/587 - Boolean Formulas

CS446: Machine Learning Spring Problem Set 4

CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning

Computational Learning Theory

CS 395T Computational Learning Theory. Scribe: Mike Halcrow. x 4. x 2. x 6

Computational Learning Theory

TTIC An Introduction to the Theory of Machine Learning. Learning from noisy data, intro to SQ model

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Introduction to Machine Learning

Learning Theory: Basic Guarantees

[read Chapter 2] [suggested exercises 2.2, 2.3, 2.4, 2.6] General-to-specific ordering over hypotheses

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Name (NetID): (1 Point)

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

the tree till a class assignment is reached

CSCE 478/878 Lecture 6: Bayesian Learning

Can PAC Learning Algorithms Tolerate. Random Attribute Noise? Sally A. Goldman. Department of Computer Science. Washington University

COMS 4771 Introduction to Machine Learning. Nakul Verma

Notes for Lecture 11

Foundations of Machine Learning and Data Science. Lecturer: Avrim Blum Lecture 9: October 7, 2015

Yale University Department of Computer Science

Learning theory. Ensemble methods. Boosting. Boosting: history

What Can We Learn Privately?

Notes on Machine Learning for and

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

CMSC 858F: Algorithmic Lower Bounds Fall SAT and NP-Hardness

Polynomial time Prediction Strategy with almost Optimal Mistake Probability

Bounds on the Sample Complexity for Private Learning and Private Data Release

Stephen Scott.

3 Finish learning monotone Boolean functions

A brief introduction to Logic. (slides from

Active Learning and Optimized Information Gathering

PAC Model and Generalization Bounds

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Distribution Free Learning with Local Queries

Introduction to Machine Learning CMU-10701

On the Sample Complexity of Noise-Tolerant Learning

Introduction to Machine Learning

CMSC 858F: Algorithmic Lower Bounds Fall SAT and NP-Hardness

Learning and Neural Networks

Learning Theory. Machine Learning B Seyoung Kim. Many of these slides are derived from Tom Mitchell, Ziv- Bar Joseph. Thanks!

PAC-learning, VC Dimension and Margin-based Bounds

1. Courses are either tough or boring. 2. Not all courses are boring. 3. Therefore there are tough courses. (Cx, Tx, Bx, )

Learning symmetric non-monotone submodular functions

ECS171: Machine Learning

Computational and Statistical Learning theory

Machine Learning (CS 567) Lecture 2

1 AC 0 and Håstad Switching Lemma

1 Learning Linear Separators

Qualifying Exam in Machine Learning

Introduction to Statistical Learning Theory

Learning implicitly in reasoning in PAC-Semantics

Complete problems for classes in PH, The Polynomial-Time Hierarchy (PH) oracle is like a subroutine, or function in

Separating Models of Learning with Faulty Teachers

Transcription:

Introduction to Computational Learning Theory The classification problem Consistent Hypothesis Model Probably Approximately Correct (PAC) Learning c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 1 / 35

Outline 1 What is Machine Learning? 2 Learning Models and An Example 3 Probably Approximately Correct (PAC) Learning c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 2 / 35

Don t Have a Good Definition, Only Examples Optical character recognition Spam filtering Document classification (IP) Packet filtering/classification Face detection Medical diagnosis Insider threat detection Stock price prediction Game playing (chess, go, etc.) c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 3 / 35

Classification Problems Input: set of labeled examples (spam and legitimate emails) Output: prediction rule (is this newly received email a spam email?) Sample Space New Example Training Examples ML Algorithm Prediction Rule Label of the New Example Many examples on previous slide are classification problems. c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 4 / 35

Objectives Numerous, sometimes conflicting: Accuracy Little computational resources (time and space) Small training set General purpose Simple prediction rule (Occam s Razor) Prediction rule understandable by human experts (avoid black box behavior) c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 5 / 35

Objectives Numerous, sometimes conflicting: Accuracy Little computational resources (time and space) Small training set General purpose Simple prediction rule (Occam s Razor) Prediction rule understandable by human experts (avoid black box behavior) Perhaps ultimately leads to an understanding of human cognition and the induction problem! (So far the reverse is truer ) c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 5 / 35

Objectives Numerous, sometimes conflicting: Accuracy Little computational resources (time and space) Small training set General purpose Simple prediction rule (Occam s Razor) Prediction rule understandable by human experts (avoid black box behavior) Perhaps ultimately leads to an understanding of human cognition and the induction problem! (So far the reverse is truer ) Learning Model In order to characterize these objectives mathematically, we need a mathematical model for learning. c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 5 / 35

Outline 1 What is Machine Learning? 2 Learning Models and An Example 3 Probably Approximately Correct (PAC) Learning c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 6 / 35

What Do We Mean by a Learning Model? Definition (Learning Model) is a mathematical formulation of a learning problem (e.g. classification) What do we want the model to behave? Powerful (to capture REAL learning) and Simple (to be mathematically feasible). Oxymoron? Maybe not! By powerful we mean the model should capture, at the very least, 1 What is being learned? 2 Where/how do data come from? 3 How s the data given to the learner? (offline, online, etc.) 4 Which objective(s) to achieve/optimize? Under which constraints? c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 7 / 35

An Example: The Consistency Model 1 What is being learned? Ω: a domain or instance space consisting of all possible examples c : Ω {0, 1} is the target concept we want to learn 2 Where/how do data come from? Data: a subset of m examples from Ω, along with their labels, i.e. S = {(x 1, c(x 1 )),, (x m, c(x m ))} 3 How s the data given to the learner? (offline, online, etc.) S given offline C, a class of known concepts, containing the unknown concept c. 4 Which objective(s) to achieve/optimize? Under which constraints? Output a hypothesis h C consistent with data, or output no such concept Algorithm runs in polynomial time c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 8 / 35

Tricky Issues C is usually very large, could be exponential in m, or even infinite! How do we represent an element of C? h in particular? A truth table is out of the question, since Ω is huge For now, let s say We agree in advance a particular way to represent C The representation of c in C has size c (number of bits representing c Each example x Ω is of size x = O(n) ML algorithm required to run in time poly(m, n, c ). c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 9 / 35

Examples of CM-learnable and not CM-learnable concept clases CM-learnable concept classes Monotone conjunctions Monotone disjunctions Boolean conjunctions k-cnf DNF Axis-aligned rectangles Separation hyperplanes Concept classes which are NP-hard to learn k-term DNF Boolean threshold functions c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 10 / 35

Example 1: monotone conjunctions is Learnable C = set of formulae on n variables x 1,..., x n of the form: ϕ = x i1 x i2 x iq, 1 q n Data looks like this: x 1 x 2 x 3 x 4 x 5 c(x) 1 1 0 0 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 0 Output hypothesis h = x 1 x 5 x 1 = MS Word Running, x 5 = ActiveX Control On, c(x) = 1 means System Down c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 11 / 35

Example 2: monotone disjunctions is Learnable C = set of formulae on n variables x 1,..., x n of the form: ϕ = x i1 x i2 x iq, 1 q n Data looks like this: Output hypothesis h = x 1 x 2 x 1 x 2 x 3 x 4 x 5 c(x) 1 1 0 0 1 1 0 0 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 0 1 1 1 0 c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 12 / 35

Example 3: Boolean conjunctions is Learnable C = set of formulae on n variables x 1,..., x n of the form: ϕ = x i1 x i2 x i3 x iq, 1 q n Data looks like this: Output hypothesis h = x 2 x 3 x 1 x 2 x 3 x 4 x 5 c(x) 1 1 0 0 1 1 1 0 1 0 0 0 1 1 0 0 1 1 1 1 0 0 1 1 0 1 1 1 1 0 c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 13 / 35

Example 4: k-cnf is Learnable C = set of formulae on n variables x 1,..., x n of the form: Data looks like this: ϕ = ( ) ( ) ( ) }{{}}{{}}{{} k literals k literals k literals x 1 x 2 x 3 x 4 x 5 c(x) 1 0 0 0 1 1 1 0 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 1 0 1 1 1 1 0 Output hypothesis h = ( x 2 x 5 ) ( x 3 x 4 ) c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 14 / 35

Example 5: DNF is Learnable C = set of formulae on n variables x 1,..., x n of the form: Data looks like this: ϕ = ( ) ( ) ( ) Output hypothesis trivially is: x 1 x 2 x 3 x 4 x 5 c(x) 1 0 0 0 1 1 1 0 1 1 1 1 1 0 1 0 0 0 h = (x 1 x 2 x 3 x 4 x 5 ) (x 1 x 2 x 3 x 4 x 5 ) c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 15 / 35

Example 6: axis-aligned rectangles is Learnable C is the set of all axis-parallel rectangles target concept x x x x x hypothesis x x x x c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 16 / 35

Example 7: separation hyperplanes is Learnable C is the set of all hyperplanes on R n Solvable with an LP-solver (a kind of algorithmic Farkas lemma) c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 17 / 35

Example 8: k-term DNF is Not Learnable, k 2 C = set of formulae on n variables x 1,..., x n of the form: ϕ = ( ) ( ) ( ) }{{}}{{}}{{} term 1 term 2 term k Theorem The problem of finding a k-term DNF formula consistent with given data S is NP-hard, for any k 2. Proof. Reduce 3-coloring to this problem. c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 18 / 35

Example 9: threshold boolean functions is Not Learnable Each concept is represented by c {0, 1} n and b N An example x {0, 1} n is positive if c 1 x 1 + + c n x n b. c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 19 / 35

Problems with the Consistency Model Does not take into account generalization (prediction performance) No noise involved (examples are never perfect) DNF is learnable but k-dnf is not? Strict consistency often means over-fitting c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 20 / 35

Outline 1 What is Machine Learning? 2 Learning Models and An Example 3 Probably Approximately Correct (PAC) Learning c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 21 / 35

The PAC Model Informally 1 What to learn? Domain Ω, concept c : Ω {0, 1} 2 Where/how do data come from? Data: S = {(x 1, c(x 1 )),, (x m, c(x m )} Each x i drawn from Ω according to some fixed but unknown distribution D 3 How s the data given to the learner? (offline, online, etc.) S given offline Concept class C ( c) along with an implicit representation 4 Which objective(s) to achieve/optimize? Under which constraints? Efficiently output a hypothesis h C so that the generalization error is small with high probability. err D (h) := Prob[h(x) c(x)] x D c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 22 / 35

The PAC Model: Preliminary Definition Definition (PAC Learnability) A concept class C is PAC learnable if there s an algorithm A (could be randomized) satisfying the following: for any 0 < ɛ < 1/2, 0 < δ < 1/2 for any distribution D on Ω A draws m examples from D, along with their labels A outputs a hypothesis h C such that Prob [err D (h) ɛ] 1 δ Definition (Efficiently PAC Learnability) If A also runs in time poly(1/ɛ, 1/δ, n, c ), then C is efficiently PAC learnable. m must be poly(1/ɛ, 1/δ, n, c ) for C to be efficiently PAC learnable. c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 23 / 35

Some Initial Thoughts on the Model Still no explicit involvement of noise However, intuitively if (example,label) error is relatively small, then the learner can deal with noise by reducing ɛ, δ. The requirement that the learner works for any D seems quite strong. It s quite amazing that non-trivial concepts are learnable Can we do better for some problem if D is known in advance? Is there a theorem to this effect? The i.i.d. assumption (on the samples) is also somewhat too strong. This paper David Aldous, Umesh V. Vazirani: A Markovian Extension of Valiant s Learning Model, Inf. Comput. 117(2): 181-186 (1995) shows that the i.i.d. assumption can be relaxed a little. c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 24 / 35

Some examples Efficiently PAC-learnable classes Boolean conjunctions Axis-aligned rectangles k-cnf k-dl (decision list, homework!) Not PAC-learnable classes k-term DNF (that nasty guy again!) Boolean threshold functions Union of k half-spaces, k 3 DNF k-juntas c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 25 / 35

1) Boolean conjunctions is Efficiently PAC-Learnable Need to produce h = l 1 l 2 l k, (l i are literals) Start with h = x 1 x 1 x n x n For each example (a, c(a) = 1) taken from D, remove from h all literals contradicting the example E.g., if example is (x 1 = 0, x 2 = 1, x 3 = 0, x 4 = 0, x 5 = 1, c(x) = 1), then we remove literals x 1, x 2, x 3, x 4, x 5 from h (if they haven t been removed before) h always contain all literals of c, thus c(a) = 0 h(a) = 0, a Ω h(a) c(a) iff c(a) = 1 and a literal l h c s.t. a(l) = 0. err D (h) = Prob[h(a) c(a)] a D = Prob [c(a) = 1 a(l) = 0 for some l h c] a D Prob [c(a) = 1 a(l) = 0] }{{} p(l) a D l h c c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 26 / 35

1) Boolean conjunctions is Efficiently PAC-Learnable So, if p(l) ɛ/2n, l h c then we re OK! How many samples from D must we take to ensure all p(l) ɛ/2n, l h c with probability 1 δ? Consider an l h c for which p(l) > ɛ/2n, call it a bad literal l will be removed with probability p(l) l survives m samples with probability at most (1 p(l)) m < (1 ɛ/2n) m Some bad literal survives with probability at most 2n (1 ɛ/2n) m 2ne ɛm/2n δ if m 2n ɛ (ln(2n) + ln(1/δ)) c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 27 / 35

2) k-cnf is Efficiently PAC-Learnable Say k = 3 We can reduce learning 3-CNF to learning (monotone) conjunctions For every triple of literals u, v, w, create a new variable y u,v,w, for a total of O(n 3 ) variables Basic idea: (u v w) y u,v,w Each example from 3-CNF can be transformed into an example for the conjunctions problem under variables y u,v,w A hypothesis h for conjunctions can be transformed back easily. c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 28 / 35

3) axis parallel rectangles is Efficiently PAC-Learnable The algorithm is like in the consistency model Error is the area-difference between target rectangle c and hypothesis rectangle h Area is measured in density according to D Hence, even with area ɛ, the probability that all m samples misses the area is (1 ɛ) m Only need m (1/ɛ) ln(1/δ) c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 29 / 35

4) k-term DNF is Not Efficiently PAC-Learnable (k 2) Pitt and Valiant in Leonard Pitt and Leslie G. Valiant. Computational limitations on learning from examples. Journal of the ACM, 35(4):965-984, October 1988 showed that k-term DNF is not efficiently learnable unless RP = NP c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 30 / 35

The PAC Model: Informal Revision Troubling: k-term DNF k-cnf but the latter is learnable and the former is not. Representation matters a great deal! We should allow the algorithm to output a hypothesis represented differently from C Particular, let H be a hypothesis class which is more expressive than C ( more expressive = every c can be represented by some h) C is PAC-learnable using H if blah blah blah and allow output h H c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 31 / 35

The PAC Model: Final Revision Definition (PAC Learnability) A concept class C is PAC learnable using a hypothesis class H if there s an algorithm A (could be randomized) satisfying the following: for any 0 < ɛ < 1/2, 0 < δ < 1/2 for any distribution D on Ω A draws m examples from D, along with their labels A outputs a hypothesis h H such that Prob [err D (h) ɛ] 1 δ If A also runs in time poly(1/ɛ, 1/δ, n, size(c)), then C is efficiently PAC learnable. We also want each h H to be efficiently evaluatable. This is implicit! c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 32 / 35

Let s Summarize 1-term DNF (i.e. conjunctions) is efficiently PAC-learnable using 1-term DNF k-term DNF is not efficiently PAC-learnable using k-term DNF, for any k 2 k-term DNF is efficiently PAC-learnable using k-cnf, for any k 2 k-cnf is efficiently PAC-learnable using k-cnf, for any k 2 axis parallel rectangles (natural representation) is efficiently PAC-learnable c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 33 / 35

More Hardness Results Blum and Rivest (Neural Networks, 1989): 3-node neural networks is NP-hard to PAC-learn Alekhnovich et al. (FOCS 04): some classes of Boolean functions and decision trees are hard to PAC-learn Feldman (STOC 06): DNF is not learnable, even with membership querying Guruswami and Raghavendra (FOCS 06): learning half-spaces (perceptron) with noise is hard Main reason: we made no assumption about D, hence these are worst case results. c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 34 / 35

Contrast with the Baysian View It should not be surprising that some concept classes are not learnable, because computational learning theory, like other areas taking the computational viewpoint, are based on worst-case complexity. The Baysian viewpoint imposes a prior distribution over the concept class c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 35 / 35