Computational Learning Theory

Similar documents
Computational Learning Theory

Computational Learning Theory

Computational Learning Theory

Computational Learning Theory

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

CS 6375: Machine Learning Computational Learning Theory

Computational Learning Theory (COLT)

Machine Learning

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Computational Learning Theory. Definitions

Computational Learning Theory (VC Dimension)

Machine Learning

Computational Learning Theory

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Concept Learning through General-to-Specific Ordering

Introduction to Machine Learning

[read Chapter 2] [suggested exercises 2.2, 2.3, 2.4, 2.6] General-to-specific ordering over hypotheses

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Computational Learning Theory

Overview. Machine Learning, Chapter 2: Concept Learning

Web-Mining Agents Computational Learning Theory

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

Computational Learning Theory

Learning Theory. Machine Learning B Seyoung Kim. Many of these slides are derived from Tom Mitchell, Ziv- Bar Joseph. Thanks!

Concept Learning Mitchell, Chapter 2. CptS 570 Machine Learning School of EECS Washington State University

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Question of the Day? Machine Learning 2D1431. Training Examples for Concept Enjoy Sport. Outline. Lecture 3: Concept Learning

Lecture 2: Foundations of Concept Learning

Concept Learning. Space of Versions of Concepts Learned

Dan Roth 461C, 3401 Walnut

Concept Learning. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University.

Outline. [read Chapter 2] Learning from examples. General-to-specic ordering over hypotheses. Version spaces and candidate elimination.

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

Machine Learning 2D1431. Lecture 3: Concept Learning

Computational Learning Theory. CS534 - Machine Learning

Concept Learning. Berlin Chen References: 1. Tom M. Mitchell, Machine Learning, Chapter 2 2. Tom M. Mitchell s teaching materials.

Statistical Learning Learning From Examples

Concept Learning.

Introduction to machine learning. Concept learning. Design of a learning system. Designing a learning system

BITS F464: MACHINE LEARNING

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Version Spaces.

Statistical and Computational Learning Theory

Computational learning theory. PAC learning. VC dimension.

On the Sample Complexity of Noise-Tolerant Learning

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Online Learning, Mistake Bounds, Perceptron Algorithm

Active Learning and Optimized Information Gathering

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

Learning Theory, Overfi1ng, Bias Variance Decomposi9on

Fundamentals of Concept Learning

CSCE 478/878 Lecture 2: Concept Learning and the General-to-Specific Ordering

EECS 349: Machine Learning

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

The Power of Random Counterexamples

Generalization, Overfitting, and Model Selection

Computational Learning Theory for Artificial Neural Networks

Generalization and Overfitting

10.1 The Formal Model

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Lecture 29: Computational Learning Theory

PAC Model and Generalization Bounds

Midterm, Fall 2003

Introduction to Computational Learning Theory

COMPUTATIONAL LEARNING THEORY

Introduction to Machine Learning

ICML '97 and AAAI '97 Tutorials

Learning Theory Continued

Classification: The PAC Learning Framework

CS446: Machine Learning Spring Problem Set 4

The sample complexity of agnostic learning with deterministic labels

The Vapnik-Chervonenkis Dimension

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Introduction to Machine Learning

PAC-learning, VC Dimension and Margin-based Bounds

Learning Theory. Sridhar Mahadevan. University of Massachusetts. p. 1/38

Introduction to Machine Learning

Hierarchical Concept Learning

Generalization theory

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Lecture 5: Efficient PAC Learning. 1 Consistent Learning: a Bound on Sample Complexity

Introduction to Machine Learning

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

Part of the slides are adapted from Ziko Kolter

VC dimension and Model Selection

MACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti

Machine Learning. Lecture 9: Learning Theory. Feng Li.

FORMULATION OF THE LEARNING PROBLEM

Models of Language Acquisition: Part II

The PAC Learning Framework -II

Neural Network Learning: Testing Bounds on Sample Complexity

Discriminative Learning can Succeed where Generative Learning Fails

COMS 4771 Introduction to Machine Learning. Nakul Verma

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Learning theory Lecture 4

The Decision List Machine

Transcription:

Computational Learning Theory Sinh Hoa Nguyen, Hung Son Nguyen Polish-Japanese Institute of Information Technology Institute of Mathematics, Warsaw University February 14, 2006 inh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of Mathematics, February 14, Warsaw 2006 University) 1 / 35

Outline 1 Introduction 2 PAC Learning 3 VC Dimension Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of Mathematics, February 14, Warsaw 2006 University) 2 / 35

Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: Probability of successful learning Number of training examples Complexity of hypothesis space Accuracy to which target concept is approximated Manner in which training examples presented inh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of Mathematics, February 14, Warsaw 2006 University) 3 / 35

1 2 + 2 2 +... + n 2 n(n + 1)(2n + 1) = 6 Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of Mathematics, February 14, Warsaw 2006 University) 4 / 35

Prototypical Concept Learning Task Given: Instances X: Possible days, each described by the attributes Sky, AirTemp, Humidity, Wind, Water, Forecast Target function c: EnjoySport : X {0, 1} Hypotheses H: Conjunctions of literals. E.g.?, Cold, High,?,?,?. Training examples D: Positive and negative examples of the target function x 1, c(x 1 ),... x m, c(x m ) Determine: A hypothesis h in H such that h(x) = c(x) for all x in D? A hypothesis h in H such that h(x) = c(x) for all x in X? inh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of Mathematics, February 14, Warsaw 2006 University) 5 / 35

Sample Complexity How many training examples are sufficient to learn the target concept? 1 If learner proposes instances, as queries to teacher Learner proposes instance x, teacher provides c(x) 2 If teacher (who knows c) provides training examples teacher provides sequence of examples of form x, c(x) 3 If some random process (e.g., nature) proposes instances instance x generated randomly, teacher provides c(x) inh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of Mathematics, February 14, Warsaw 2006 University) 6 / 35

Sample Complexity: 1 Learner proposes instance x, teacher provides c(x) (assume c is in learner s hypothesis space H) Optimal query strategy: play 20 questions pick instance x such that half of hypotheses in V S classify x positive, half classify x negative When this is possible, need log 2 H queries to learn c when not possible, need even more Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of Mathematics, February 14, Warsaw 2006 University) 7 / 35

Sample Complexity: 2 Teacher (who knows c) provides training examples (assume c is in learner s hypothesis space H) Optimal teaching strategy: depends on H used by learner Consider the case H = conjunctions of up to n boolean literals and their negations e.g., (AirT emp = W arm) (W ind = Strong), where AirT emp, W ind,... each have 2 possible values. if n possible boolean attributes in H, n + 1 examples suffice why? inh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of Mathematics, February 14, Warsaw 2006 University) 8 / 35

Sample Complexity: 3 Given: set of instances X set of hypotheses H set of possible target concepts C training instances generated by a fixed, unknown probability distribution D over X Learner observes a sequence D of training examples of form x, c(x), for some target concept c C instances x are drawn from distribution D teacher provides target value c(x) for each Learner must output a hypothesis h estimating c h is evaluated by its performance on subsequent instances drawn according to D Note: probabilistic instances, noise-free classifications Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of Mathematics, February 14, Warsaw 2006 University) 9 / 35

True Error of a Hypothesis Definition The true error (denoted error D (h)) of hypothesis h with respect to target concept c and distribution D is the probability that h will misclassify an instance drawn at random according to D. error D (h) Pr [c(x) h(x)] x D With probability (1 ε) one can estimate erd c erd c erd c s (1 erc D ) ε 2 D Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 10 / 35

Two Notions of Error Training error of hypothesis h with respect to target concept c How often h(x) c(x) over training instances True error of hypothesis h with respect to c How often h(x) c(x) over future random instances Our concern: Can we bound the true error of h given the training error of h? First consider when training error of h is zero (i.e., h V S H,D ) Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 11 / 35

No Free Lunch Theorem No search or learning algorithm can be the best on all possible learning or optimization problems. In fact, every algorithm is the best algorithm for the same number of problems. But only some problems are of interest. For example: a random search algorithm is perfect for a completely random problem (the white noise problem), but for any search or optimization problem with structure, random search is not so good. inh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 12 / 35

Outline 1 Introduction 2 PAC Learning 3 VC Dimension Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 13 / 35

Exhausting the Version Space Definition The version space V S H,D is said to be ε-exhausted with respect to c and D, if every hypothesis h in V S H,D has error less than ε with respect to c and D. ( h V S H,D ) error D (h) < ε Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 14 / 35

How many examples will ε-exhaust the VS? Theorem (Haussler, 1988) If the hypothesis space H is finite, and D is a sequence of m 1 independent random examples of some target concept c, then for any 0 ε 1, the probability that the version space with respect to H and D is not ε-exhausted (with respect to c) is less than H e εm Interesting! This bounds the probability that any consistent learner will output a hypothesis h with error(h) ε If we want to this probability to be below δ then H e εm δ m 1 (ln H + ln(1/δ)) ε Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 15 / 35

Learning Conjunctions of Boolean Literals How many examples are sufficient to assure with probability at least (1 δ) that every h in V S H,D satisfies error D (h) ε Use our theorem: m 1 (ln H + ln(1/δ)) ε Suppose H contains conjunctions of constraints on up to n boolean attributes (i.e., n boolean literals). Then H = 3 n, and or m 1 ε (ln 3n + ln(1/δ)) m 1 (n ln 3 + ln(1/δ)) ε inh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 16 / 35

How About EnjoySport? m 1 (ln H + ln(1/δ)) ε If H is as given in EnjoySport then H = 973, and m 1 (ln 973 + ln(1/δ)) ε... if want to assure that with probability 95%, V S contains only hypotheses with error D (h).1, then it is sufficient to have m examples, where m 1 (ln 973 + ln(1/.05)).1 m 10(ln 973 + ln 20) m 10(6.88 + 3.00) m 98.8 Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 17 / 35

PAC Learning Consider a class C of possible target concepts defined over a set of instances X of length n, and a learner L using hypothesis space H. Definition C is PAC-learnable by L using H if for all c C, distributions D over X, ε such that 0 < ε < 1/2, and δ such that 0 < δ < 1/2, learner L will with probability at least (1 δ) output a hypothesis h H such that error D (h) ε, in time that is polynomial in 1/ε, 1/δ, n and size(c). Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 18 / 35

Example Unbiased learner: H = 2 2n m 1 (ln H + ln(1/δ)) ε 1 ε (2n ln 2 + ln(1/δ)) Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 19 / 35

Example Unbiased learner: H = 2 2n m 1 (ln H + ln(1/δ)) ε 1 ε (2n ln 2 + ln(1/δ)) k term DNF: We have H (3 n ) k, thus T 1 T 2... T k m 1 (ln H + ln(1/δ)) ε 1 (kn ln 3 + ln(1/δ)) ε So are k-dnfs PAC learnable? Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 19 / 35

Agnostic Learning So far, assumed c H Agnostic learning setting: don t assume c H What do we want then? The hypothesis h that makes fewest errors on training data What is sample complexity in this case? m 1 (ln H + ln(1/δ)) 2ε2 derived from Hoeffding bounds: P r[error D (h) > error D (h) + ε] e 2mε2 Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 20 / 35

Outline 1 Introduction 2 PAC Learning 3 VC Dimension Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 21 / 35

Discretization problem er c D = µ((λ, λ 0]) Let β 0 = sup{β µ((β, λ 0 ]) < ε}. then er c D (f λ ) ε λ β 0 there exists an instance x i which is belonging to [λ 0, β 0 ]; The probability that there is no instance that belongs to [β, λ 0 ] is equal to (1 ε) m. Hence µ m {D S(m, f λ0 ) er D (L(D)) ε} 1 (1 ε) m 1 This probability is > 1 δ if m m 0 = ε ln 1 δ Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 22 / 35

Shattering a Set of Instances Definition Definition: a dichotomy of a set S is a partition of S into two disjoint subsets. Definition A set of instances S is shattered by hypothesis space H if and only if for every dichotomy of S there exists some hypothesis in H consistent with this dichotomy. Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 23 / 35

Three Instances Shattered Let S = {x 1, x 2,...x m } X. Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 24 / 35

Three Instances Shattered Let S = {x 1, x 2,...x m } X. Let Π H (S) = {(h(x 1 ),..., h(x m )) {0, 1} m : h H} 2 m Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 24 / 35

Three Instances Shattered Let S = {x 1, x 2,...x m } X. Let Π H (S) = {(h(x 1 ),..., h(x m )) {0, 1} m : h H} 2 m If Π H (S) = 2 m then we say H shatters S. Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 24 / 35

Three Instances Shattered Let S = {x 1, x 2,...x m } X. Let Π H (S) = {(h(x 1 ),..., h(x m )) {0, 1} m : h H} 2 m If Π H (S) = 2 m then we say H shatters S. Let Π H (m) = max S X m Π H(S) Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 24 / 35

Three Instances Shattered Let S = {x 1, x 2,...x m } X. Let Π H (S) = {(h(x 1 ),..., h(x m )) {0, 1} m : h H} 2 m If Π H (S) = 2 m then we say H shatters S. Let Π H (m) = max S X m Π H(S) In previous example (space of radiuses) Π H (m) = m + 1. Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 24 / 35

Three Instances Shattered Let S = {x 1, x 2,...x m } X. Let Π H (S) = {(h(x 1 ),..., h(x m )) {0, 1} m : h H} 2 m If Π H (S) = 2 m then we say H shatters S. Let Π H (m) = max S X m Π H(S) In previous example (space of radiuses) Π H (m) = m + 1. In general it is hard to find a formula for Π H (m)!!! Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 24 / 35

The Vapnik-Chervonenkis Dimension Definition The Vapnik-Chervonenkis dimension, V C(H), of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H. If arbitrarily large finite sets of X can be shattered by H, then V C(H). Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 25 / 35

Examples of VC Dim H = {circles...} = V C(H) = 3 inh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 26 / 35

Examples of VC Dim H = {circles...} = V C(H) = 3 H = {rectangles...} = V C(H) = 4 inh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 26 / 35

Examples of VC Dim H = {circles...} = V C(H) = 3 H = {rectangles...} = V C(H) = 4 H = {threshold functions...} = V C(H) = 1 if + is always on the left; V C(H) = 2 if + can be on left or right inh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 26 / 35

Examples of VC Dim H = {circles...} = V C(H) = 3 H = {rectangles...} = V C(H) = 4 H = {threshold functions...} = V C(H) = 1 if + is always on the left; V C(H) = 2 if + can be on left or right H = {intervals...} = VC(H) = 2 if + is always in center VC(H) = 3 if center can be + or - inh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 26 / 35

Examples of VC Dim H = {circles...} = V C(H) = 3 H = {rectangles...} = V C(H) = 4 H = {threshold functions...} = V C(H) = 1 if + is always on the left; V C(H) = 2 if + can be on left or right H = {intervals...} = VC(H) = 2 if + is always in center VC(H) = 3 if center can be + or - H = { linear decision surface in 2D... } = V C(H) = 3 inh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 26 / 35

Examples of VC Dim H = {circles...} = V C(H) = 3 H = {rectangles...} = V C(H) = 4 H = {threshold functions...} = V C(H) = 1 if + is always on the left; V C(H) = 2 if + can be on left or right H = {intervals...} = VC(H) = 2 if + is always in center VC(H) = 3 if center can be + or - H = { linear decision surface in 2D... } = V C(H) = 3 Is there an H with V C(H) =? inh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 26 / 35

Examples of VC Dim H = {circles...} = V C(H) = 3 H = {rectangles...} = V C(H) = 4 H = {threshold functions...} = V C(H) = 1 if + is always on the left; V C(H) = 2 if + can be on left or right H = {intervals...} = VC(H) = 2 if + is always in center VC(H) = 3 if center can be + or - H = { linear decision surface in 2D... } = V C(H) = 3 Is there an H with V C(H) =? Theorem If H < then V Cdim(H) log H inh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 26 / 35

Examples of VC Dim H = {circles...} = V C(H) = 3 H = {rectangles...} = V C(H) = 4 H = {threshold functions...} = V C(H) = 1 if + is always on the left; V C(H) = 2 if + can be on left or right H = {intervals...} = VC(H) = 2 if + is always in center VC(H) = 3 if center can be + or - H = { linear decision surface in 2D... } = V C(H) = 3 Is there an H with V C(H) =? Theorem If H < then V Cdim(H) log H Let M n = the set of all Boolean monomials of n variables. Since, M n = 3 n we have V Cdim(M n ) n log 3 inh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 26 / 35

Sample Complexity from VC Dimension How many randomly drawn examples suffice to ε-exhaust V S H,D with probability at least (1 δ)? m 1 ε (4 log 2(2/δ) + 8V C(H) log 2 (13/ε)) inh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 27 / 35

Potential learnability Let D S(m, c) H c (D) = {h H h(x i ) = c(x i )(i = 1,..., m)} Algorithm L is consistent if and only if L(D) H c (D) for any training sample D B c ε = {h H er Ω (h) ε} We say that H is potentially learnable if, given real numbers 0 < ε, δ < 1 there is a positive integer m 0 = m 0 (ε, δ) such that, whenever m m 0, µ m {D S(m, c) H c (D) B c ε = } > 1 δ for any probability distribution µ on X and c H (Theorem:) If H is potentially learnable, and L is a consistent learning algorithm for H, then L is PAC. inh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 28 / 35

Theorem Haussler, 1988 Any finite hypothesis space is potentially learnable. Proof: Let h B ε then µ m {D S(m, c) er D (h) = 0} (1 ε) m µ m {D : H[D] B ε } B ε (1 ε) m H (1 ε) m It is enough to chose m m 0 = to obtain H (1 ε) m < δ 1 ε ln H δ inh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 29 / 35

Fundamental theorem Theorem If a hypothesis space has infinite VC dimension then it is not potentially learnable. Inversely, finite VC dimension is sufficient for potential learnability Let V Cdim(H) = d 1 Each consistent algorithm L is PAC with sample complexity ( 4 m L (H, δ, ε) d log 12 ε ε + log 2 ) δ Lower bounds: for any PAC learning algorithm L for finite VC dimension space H, m L (H, δ, ε) d(1 ε) If δ 1/100 and ε 1/8, then m L (H, δ, ε) > d 1 32ε m L (H, δ, ε) > 1 ε ε ln 1 δ Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 30 / 35

Combine theory with practice Theory is when we know everything and nothing works. Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 31 / 35

Combine theory with practice Theory is when we know everything and nothing works. Practice is when everything works and no one knows why. Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 31 / 35

Combine theory with practice Theory is when we know everything and nothing works. Practice is when everything works and no one knows why. We combine theory with practice Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 31 / 35

Combine theory with practice Theory is when we know everything and nothing works. Practice is when everything works and no one knows why. We combine theory with practice nothing works and no one knows why. Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 31 / 35

Mistake Bounds So far: how many examples needed to learn? What about: how many mistakes before convergence? Let s consider similar setting to PAC learning: Instances drawn at random from X according to distribution D Learner must classify each instance before receiving correct classification from teacher Can we bound the number of mistakes learner makes before converging? inh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 32 / 35

Mistake Bounds: Find-S Consider Find-S when H = conjunction of boolean literals Find-S: Initialize h to the most specific hypothesis l 1 l 1 l 2 l 2... l n l n For each positive training instance x Remove from h any literal that is not satisfied by x Output hypothesis h. How many mistakes before converging to correct h? Sinh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 33 / 35

Mistake Bounds: Halving Algorithm Consider the Halving Algorithm: Learn concept using version space Candidate-Elimination algorithm Classify new instances by majority vote of version space members How many mistakes before converging to correct h?... in worst case?... in best case? inh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 34 / 35

Optimal Mistake Bounds Let M A (C) be the max number of mistakes made by algorithm A to learn concepts in C. (maximum over all possible c C, and all possible training sequences) M A (C) max c C M A(c) Definition: Let C be an arbitrary non-empty concept class. The optimal mistake bound for C, denoted Opt(C), is the minimum over all possible learning algorithms A of M A (C). Opt(C) min A learning algorithms M A(C) V C(C) Opt(C) M Halving (C) log 2 ( C ). inh Hoa Nguyen, Hung Son Nguyen (Polish-Japanese Computational Institute of Information Learning Theory Technology Institute of February Mathematics, 14, 2006 Warsaw University) 35 / 35