Probably Approximately Correct Learning - III

Similar documents
THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

COMPUTATIONAL LEARNING THEORY

The PAC Learning Framework -II

Lecture 5: Efficient PAC Learning. 1 Consistent Learning: a Bound on Sample Complexity

The Vapnik-Chervonenkis Dimension

Computational Learning Theory. Definitions

1 The Probably Approximately Correct (PAC) Model

MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension

Introduction to Machine Learning

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Introduction to Machine Learning

Computational Learning Theory for Artificial Neural Networks

Classification: The PAC Learning Framework

MACHINE LEARNING. Probably Approximately Correct (PAC) Learning. Alessandro Moschitti

Introduction to Statistical Learning Theory

Computational Learning Theory

Final Exam. December 11 th, This exam booklet contains five problems, out of which you are expected to answer four problems of your choice.

Convex Sets and Functions (part III)

Computational Learning Theory - Hilary Term : Introduction to the PAC Learning Framework

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

10.1 The Formal Model

Computational Learning Theory (COLT)

Computational Learning Theory

1 Proof of learning bounds

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

Name (NetID): (1 Point)

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

ECS171: Machine Learning

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Introduction to Bayesian Learning. Machine Learning Fall 2018

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

Computational learning theory. PAC learning. VC dimension.

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Grammars (part II) Prof. Dan A. Simovici UMB

1 Review of The Learning Setting

Computational Learning Theory

Statistical Learning Learning From Examples

Overview. Machine Learning, Chapter 2: Concept Learning

CS 151 Complexity Theory Spring Solution Set 5

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Computational Learning Theory

Concept Learning Mitchell, Chapter 2. CptS 570 Machine Learning School of EECS Washington State University

Lecture 2: Foundations of Concept Learning

Computational Learning Theory

Computational and Statistical Learning Theory

Discriminative Learning can Succeed where Generative Learning Fails

CS 6375: Machine Learning Computational Learning Theory

A Result of Vapnik with Applications

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

CS 395T Computational Learning Theory. Scribe: Mike Halcrow. x 4. x 2. x 6

Computational and Statistical Learning theory

Computational Learning Theory. CS534 - Machine Learning

Sample width for multi-category classifiers

Computational Learning Theory (VC Dimension)

Introduction to machine learning. Concept learning. Design of a learning system. Designing a learning system

Learning Theory. Machine Learning B Seyoung Kim. Many of these slides are derived from Tom Mitchell, Ziv- Bar Joseph. Thanks!

CS446: Machine Learning Spring Problem Set 4

Stanford University CS254: Computational Complexity Handout 8 Luca Trevisan 4/21/2010

Stochastic Gradient Descent

Introduction to Machine Learning (67577) Lecture 3

Computational Learning Theory

The Decision List Machine

Does Unlabeled Data Help?

PAC Model and Generalization Bounds

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Notes for Lecture 3... x 4

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Introduction to Computational Learning Theory

Machine Learning

Learning Theory: Basic Guarantees

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Web-Mining Agents Computational Learning Theory

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Introduction to Machine Learning

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

1 A Lower Bound on Sample Complexity

BITS F464: MACHINE LEARNING

Computational and Statistical Learning Theory

Concept Learning through General-to-Specific Ordering

Notes on Machine Learning for and

Machine Learning

Introduction to Machine Learning CMU-10701

Computational and Statistical Learning Theory

Finite Automata and Regular Languages (part III)

Online Learning, Mistake Bounds, Perceptron Algorithm

Inductive Learning. Inductive hypothesis h Hypothesis space H size H. Example set X. h: hypothesis that. size m Training set D

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors

CS 498F: Machine Learning Fall 2010

Tight Bounds on Proper Equivalence Query Learning of DNF

Learning with multiple models. Boosting.

Question of the Day? Machine Learning 2D1431. Training Examples for Concept Enjoy Sport. Outline. Lecture 3: Concept Learning

ORIE 4741: Learning with Big Messy Data. Generalization

1 Learning Linear Separators

Computational Learning Theory

On Learning µ-perceptron Networks with Binary Weights

1 Randomized complexity

Transcription:

Probably Approximately Correct Learning - III Prof. Dan A. Simovici UMB Prof. Dan A. Simovici (UMB) Probably Approximately Correct Learning - III 1 / 18

A property of the hypothesis space Aim : a property of the hypothesis space H that ensures that every consistent algorithm L that learns a hypothesis H H is PAC. L is consistent if given any training sample s, L produces a hypothesis that is consistent with s. Let H[s] the set of hypothesis consistent with s. Prof. Dan A. Simovici (UMB) Probably Approximately Correct Learning - III 2 / 18

Let T be a target concept. The set of ɛ-bad hypotheses is B ɛ = {H H P(T H) ɛ}. A consistent L produces an output in H[s] starting from s and the PAC property requires that it is unlikely that H = L[s] is ɛ-bad. Prof. Dan A. Simovici (UMB) Probably Approximately Correct Learning - III 3 / 18

Potential Learnability Definition A hypothesis space H is potentially learnable if, given real numbers δ and ɛ, there is a positive integer m 0 (δ,ɛ) such that whenever m m 0 (δ,ɛ) P(s S(m,T) H[s] B ɛ = ) > 1 δ, for any probability distribution P. Prof. Dan A. Simovici (UMB) Probably Approximately Correct Learning - III 4 / 18

Theorem If H is potentially learnable and L is a consistent learning algorithm, then L is PAC. Proof: the proof is immediate because if L is consistent, L[s] H[s]. Thus, the condition H[s] B ɛ = implies that the error of L[s] is less than ɛ, as required for PAC learning. Prof. Dan A. Simovici (UMB) Probably Approximately Correct Learning - III 5 / 18

Theorem Every finite hypothesis space is potentially learnable. Proof: Suppose that H is a finite hypothesis space and let δ,ɛ,c and P are given. We prove that P(H[s] B ɛ ) can be made less than δ by choosing m sufficiently large. By the definition of B ɛ it follows that for every H B ɛ : Thus, P(x H(x) = C(x)) 1 ɛ. P(s H(x i ) = C(x i ) for 1 i m) (1 ɛ) m. This is the probability that one ɛ-bad hypothesis is in H[s]. Prof. Dan A. Simovici (UMB) Probably Approximately Correct Learning - III 6 / 18

Proof (cont d) There is some ɛ-bad hypothesis in H[s] iff there exists s such that H[s] B ɛ. Therefore, the probability of the existence of such a hypothesis is P({s H[s] B ɛ ) is less than H (1 ɛ) m. To have H (1 ɛ) m < δ we must have because δ = H e ln δ H. Thus, H (1 ɛ) m < H e ɛm < H e ln δ H ɛm < ln δ H, so m > 1 ɛ ln δ H, or m 1 ɛ ln δ. H Prof. Dan A. Simovici (UMB) Probably Approximately Correct Learning - III 7 / 18

Observations the algorithm for learning monomials is PAC (hypothesis space has 3 n elements); practical limitations exists even for finite spaces; for example, there are 2 2n Boolean functions, so the bound for the sample length is 2 n ɛ ln 2 δ even for applications of moderate size (say n = 50) this is enormous! ; Prof. Dan A. Simovici (UMB) Probably Approximately Correct Learning - III 8 / 18

A decision list is a sequence of pairs L = ((f 1,c 1 ),...,f r ),(f r,c r )) and a bit c, where f i : {0,1} n {0,1} for 1 i n and c i {0,1}. The Boolean function defined by L is evaluated as shown below: if f 1 (x) = 1 then set f (x) = c 1 else if f 2 (x) = 1 then set f (x) = c 2. else if f r (x) = 1 then set f (x) = c r else set f (x) = c. Given x {0,1} n we evaluate f 1 (x). If f 1 (x) = 1, f (x) has the value c 1. Otherwise, f 2 (x) is evaluated, etc. Prof. Dan A. Simovici (UMB) Probably Approximately Correct Learning - III 9 / 18

If K = (f 1,...,f r ) is a sequence of Boolean functions we denote by DL(K) the set of decision lists on K. The value of the function defined by a decision list ((f 1,c 1 ),...,(f r,c r )),c is { c j if j = min{i f i (x) = 1} exists f (x) = c otherwise. There is no loss of generality in assuming that all functions f i are distinct, so the length of a decision list is at most K. Prof. Dan A. Simovici (UMB) Probably Approximately Correct Learning - III 10 / 18

Example If K = MON 3,2, the set of monomials of length at most 2 in 3 variables, then the decision list ((u 2,1),(u 1 ū 3,0),(ū 1,1)),0 operates as follows: those examples for which u 2 is satisfied are assigned 1: 010, 011, 110, 111; the examples for which u 1 ū 3 is satisfied are assigned 0: the only remaining example is 100; the remaining examples for which ū 1 is satisfied are assigned 1: 000,011; the remaining example, 101 is assigned 0. Prof. Dan A. Simovici (UMB) Probably Approximately Correct Learning - III 11 / 18

Example Let K = (f 1,f 2 ). The decision list ((f 1,1),(f 2,1)),0 defines the function f 1 f 2. Prof. Dan A. Simovici (UMB) Probably Approximately Correct Learning - III 12 / 18

1 2 3 4 5 6 7 8 9 0 1 2 A Consistent Algorithm for Decision Lists Algorithm 2.1: A Consistent Algorithm for Decision Lists Data: A sample s = ((x 1,b 1 ),...,(x m,b m )), a sequence of Boolean functions K = (g 1,...,g r ) and a training sample Result: A decision list let I = {1,...,m}; let j = 1; repeat if for all i I, g j (x i ) = 1 implies b i = c for a fixed bit c then select (g j,c) to include in the decision list; delete from I all i for which g j (x i ) = 1; j = 1; else j = j+1; end until I = ; return decision list; Prof. Dan A. Simovici (UMB) Probably Approximately Correct Learning - III 13 / 18

Example K = M 5,2 is listed in lexicographic order based on the ordering The first few entries in the list are Note that (u 1 ū 1 ) is not included. Training sample s is: u 1,ū 1,u 2,ū 2,u 3,ū 3,u 4,ū 4,u 5,ū 5. (),(u 1 ),(u 1 u 2 ),(u 1 ū 2 ),(u 1 u 3 ),... (x 1 = 10000,b 1 = 0),(x 2 = 01110,b 2 = 0),(x 3 = 11000,b 3 = 0), (x 4 = 10101,b 4 = 1),(x 5 = 01100,b 5 = 1),(x 6 = 10111,b 6 = 1). Prof. Dan A. Simovici (UMB) Probably Approximately Correct Learning - III 14 / 18

(x 1 = 10000,b 1 = 0),(x 2 = 01110,b 2 = 0),(x 3 = 11000,b 3 = 0), (x 4 = 10101,b 4 = 1),(x 5 = 01100,b 5 = 1),(x 6 = 10111,b 6 = 1). I = {1, 2, 3, 4, 5, 6} () no: all examples satisfy but some have 0 and others 1 (u 1, 0) no: both x 1 and x 4 satisfy it but have distinct b i s (u 1 u 2, 0) yes: this is satisfied only by x 3, so add (u 1 u 2, 0) and delete 3 (u 1 ū 2 ) no (u 1 u 3, 1) yes: delete x 4 and x 6 and add (u 1 u 3, 1) (u 1 ū 3 ) no:.. (u 1, 0) yes: delete x 1 and add (u 1, 0).. (ū 1 u 4, 0) yes: delete x 3 and add (ū 1 u 4, 0) ((), 1) yes: delete x 5 and add 0 Prof. Dan A. Simovici (UMB) Probably Approximately Correct Learning - III 15 / 18

The resulting decision list: ((u 1 u 2 ),0),((u 1 u 3 ),1),((u 1 ),0),((ū 1 u 4 ),0),((),0),0 Claim: when we are given a sample s for a target concept in DL(K), then there is always a pair (g,c) which has the required properties. Prof. Dan A. Simovici (UMB) Probably Approximately Correct Learning - III 16 / 18

Theorem Let K be a sequence of Boolean function that contains the constant function of 1. If f DL(K) and S is a finite sample. There exists g K and c {0,1} such that the set S g = {x in S g(x) = 1} is not empty; for all x S g, f (x) = c. Prof. Dan A. Simovici (UMB) Probably Approximately Correct Learning - III 17 / 18

Proof: Since f DL(K) there is a representation of f as a decision list ((f 1,c 1 ),...,(f r,c r )),c. If f i (x) = 0 for all x in S and all i, 1 i r, then all examples of S are negative examples of f. In this case we take g to be the constant function 1 and c = 0. If there is i such that {x f i (x) = 1} let q = min{i f i (x) = 1}. Then, f (x) = c q for all x such that f q (x) = 1. Select g = f q and c = c q. Thus, given a training example for f DL(K), there is a suitable choice of a pair (g,c) for the first term in the decision list. We followed here the paper [2] and the monograph [1]. Prof. Dan A. Simovici (UMB) Probably Approximately Correct Learning - III 18 / 18

M. Anthony and N. Biggs. Computational Learning Theory. Cambridge University, Cambridge, 1997. R. L. Rivest. Learning decision lists. Machine Learning, 2:229 246, 1987. Prof. Dan A. Simovici (UMB) Probably Approximately Correct Learning - III 18 / 18