Lecture 5: Efficient PAC Learning. 1 Consistent Learning: a Bound on Sample Complexity

Similar documents
A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

Computational Learning Theory - Hilary Term : Introduction to the PAC Learning Framework

Computational Learning Theory

Computational Learning Theory. Definitions

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181.

Web-Mining Agents Computational Learning Theory

Computational Learning Theory

Probably Approximately Correct Learning - III

About the impossibility to prove P NP or P = NP and the pseudo-randomness in NP

Computational Learning Theory (COLT)

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Online Learning, Mistake Bounds, Perceptron Algorithm

2 P vs. NP and Diagonalization

Show that the following problems are NP-complete

Chapter 2. Reductions and NP. 2.1 Reductions Continued The Satisfiability Problem (SAT) SAT 3SAT. CS 573: Algorithms, Fall 2013 August 29, 2013

CS 151 Complexity Theory Spring Solution Set 5

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

1 More finite deterministic automata

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

Lecture Notes 4. Issued 8 March 2018

Computational Learning Theory

Design and Analysis of Algorithms

Theory of Computer Science. Theory of Computer Science. E4.1 Overview E4.2 3SAT. E4.3 Graph Problems. E4.4 Summary.

Hierarchical Concept Learning

Lecture 2. 1 More N P-Compete Languages. Notes on Complexity Theory: Fall 2005 Last updated: September, Jonathan Katz

Computational learning theory. PAC learning. VC dimension.

CS 5114: Theory of Algorithms. Tractable Problems. Tractable Problems (cont) Decision Problems. Clifford A. Shaffer. Spring 2014

1.1 P, NP, and NP-complete

Limits to Approximability: When Algorithms Won't Help You. Note: Contents of today s lecture won t be on the exam

CS 6375: Machine Learning Computational Learning Theory

Notes on Computer Theory Last updated: November, Circuits

Harvard CS 121 and CSCI E-121 Lecture 22: The P vs. NP Question and NP-completeness

Foundations of Machine Learning and Data Science. Lecturer: Avrim Blum Lecture 9: October 7, 2015

Handout 5. α a1 a n. }, where. xi if a i = 1 1 if a i = 0.

an efficient procedure for the decision problem. We illustrate this phenomenon for the Satisfiability problem.

Introduction to Computational Learning Theory

Essential facts about NP-completeness:

On the Sample Complexity of Noise-Tolerant Learning

Classes of Boolean Functions

Critical Reading of Optimization Methods for Logical Inference [1]

NP Complete Problems. COMP 215 Lecture 20

Introduction to Machine Learning

NP-Completeness. Until now we have been designing algorithms for specific problems

CS Lecture 29 P, NP, and NP-Completeness. k ) for all k. Fall The class P. The class NP

Learning Large-Alphabet and Analog Circuits with Value Injection Queries

NP-Completeness. Andreas Klappenecker. [based on slides by Prof. Welch]

Computational Learning Theory. CS534 - Machine Learning

CS 5114: Theory of Algorithms

Week 3: Reductions and Completeness

Computational Learning Theory: PAC Model

6.045: Automata, Computability, and Complexity (GITCS) Class 15 Nancy Lynch

Lecture 7: Passive Learning

A An Overview of Complexity Theory for the Algorithm Designer

Lecture 3. 1 Terminology. 2 Non-Deterministic Space Complexity. Notes on Complexity Theory: Fall 2005 Last updated: September, 2005.

UC Berkeley CS 170: Efficient Algorithms and Intractable Problems Handout 22 Lecturer: David Wagner April 24, Notes 22 for CS 170

Computational and Statistical Learning Theory

Discriminative Learning can Succeed where Generative Learning Fails

NP-Complete Problems. More reductions

Can PAC Learning Algorithms Tolerate. Random Attribute Noise? Sally A. Goldman. Department of Computer Science. Washington University

Lecture 1 : Probabilistic Method

P is the class of problems for which there are algorithms that solve the problem in time O(n k ) for some constant k.

CS 395T Computational Learning Theory. Scribe: Mike Halcrow. x 4. x 2. x 6

Computational Learning Theory

Polynomial time Prediction Strategy with almost Optimal Mistake Probability

Lecture 29: Computational Learning Theory

NP Completeness and Approximation Algorithms

Theory of Computation CS3102 Spring 2014 A tale of computers, math, problem solving, life, love and tragic death

Proving SAT does not have Small Circuits with an Application to the Two Queries Problem

10.1 The Formal Model

1 PSPACE-Completeness

Lecture Notes Each circuit agrees with M on inputs of length equal to its index, i.e. n, x {0, 1} n, C n (x) = M(x).

Summer School on Introduction to Algorithms and Optimization Techniques July 4-12, 2017 Organized by ACMU, ISI and IEEE CEDA.

COL 352 Introduction to Automata and Theory of Computation Major Exam, Sem II , Max 80, Time 2 hr. Name Entry No. Group

SAT, NP, NP-Completeness

Comp487/587 - Boolean Formulas

Limitations of Algorithm Power

CSCI3390-Second Test with Solutions

1 Distributional problems

Apropos of an errata in ÜB 10 exercise 3

NP-Completeness. f(n) \ n n sec sec sec. n sec 24.3 sec 5.2 mins. 2 n sec 17.9 mins 35.

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Lecture 4. 1 Circuit Complexity. Notes on Complexity Theory: Fall 2005 Last updated: September, Jonathan Katz

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

Computational Complexity and Intractability: An Introduction to the Theory of NP. Chapter 9

A Lower Bound of 2 n Conditional Jumps for Boolean Satisfiability on A Random Access Machine

Outline. Complexity Theory. Introduction. What is abduction? Motivation. Reference VU , SS Logic-Based Abduction

1 The Probably Approximately Correct (PAC) Model

Statistical and Computational Learning Theory

Computational Learning Theory for Artificial Neural Networks

1 Computational Problems

P P P NP-Hard: L is NP-hard if for all L NP, L L. Thus, if we could solve L in polynomial. Cook's Theorem and Reductions

Decentralized Control of Discrete Event Systems with Bounded or Unbounded Delay Communication

Computational Learning Theory

20.1 2SAT. CS125 Lecture 20 Fall 2016

Propositional Resolution

ICML '97 and AAAI '97 Tutorials

CS154, Lecture 15: Cook-Levin Theorem SAT, 3SAT

CMSC 858F: Algorithmic Lower Bounds Fall SAT and NP-Hardness

Transcription:

Universität zu Lübeck Institut für Theoretische Informatik Lecture notes on Knowledge-Based and Learning Systems by Maciej Liśkiewicz Lecture 5: Efficient PAC Learning 1 Consistent Learning: a Bound on Sample Complexity Let X be any finite learning domain, let D be any probability distribution over X, and let C (X) be a concept class. Furthermore, we use H to denote any hypothesis space for C. To simplify notation, we use M to denote the cardinality of any set M. Let m N, m 1 ; then we use X m to denote the m -fold Cartesian product of X. Let x X m, then we write x = (x 1,...,x m ). Now, let c C be any concept. The m -sample of c generated by x is denoted by S(c, x) = x 1, c(x 1 ),...,x m, c(x m ). A hypothesis h H is said to be consistent for an m -sample S(c, x) iff h(x i ) = c(x i ) for all 1 i m. A learner is said to be consistent iff for every target concept c and for every hypothesis output h on an m -sample S(c, x) h is consistent for S(c, x). The formal definition of PAC learning has already been presented above. Moreover, we showed the class of all monomials to be PAC learnable. The general idea behind the algorithm given there can be described as follows: (1) Draw a sufficiently large sample from the oracle EX(c, D). (2) Find some h H that is consistent with all the examples drawn. (3) Output h. Therefore, it is only natural to ask whether or not this strategy may be successful in the general finite case, too. Let us assume that we have a consistent learner. Let c C be any concept, and let h be any hypothesis output by the learner on any m -sample S(c, x), where x has been drawn with respect to the unknown probability distribution D. Assume h to be bad, i.e., d(c, h) > ε. Any such hypothesis will not be consistent with m randomly drawn examples unless all examples are drawn outside the symmetric difference of c and h. Hence, the probability that the particular bad hypothesis h survives m examples is at most (1 ε) m. Consequently, the probability that some bad hypothesis survives m examples is at most H (1 ε) m. Furthermore, we want Pr(d(c, h) > ε) < δ. Hence, we must require: H (1 ε) m δ. In this lecture, we consider finite learning domains only.

2 Wissensbasierte und lernende Systeme Now, the latter requirement directly allows to lower bound m. Taking the natural logarithm of both sides, we obtain: ln H + m ln(1 ε) lnδ. Therefore, we have: m > lnδ ln H ln(1 ε) Because of (1 1 z )z < e 1 for all z > 0, we additionally obtain: (1 ε) = ((1 ε) 1/ε ) ε < e ε, and thus Putting it all together, we see that m > 1 ε ln(1 ε) < ε. ( ln H + ln 1 ) = 1 H ln δ ε δ. We summarize the insight obtained by the following theorem. Theorem 1 Let X be any finite learning domain, let C (X) be any concept class, and let H be any hypothesis space for C. Then every consistent learner PAC identifies C with respect to H with sample complexity m = 1 H ln ε δ + 1. The latter theorem delivers a first upper bound on the sample complexity needed to achieve efficient PAC learning. However, it does not say anything concerning the problem to compute consistent hypotheses. Clearly, there is a trivial algorithm to achieve this goal. We may just enumerate all hypotheses. Then we may simply search for the first consistent one in the enumeration fixed. Nevertheless, taking into account that H might be huge, this method will usually take too much time. Hence, further effort is necessary to arrive at practical learning algorithms. 2 Efficient PAC Learnability - Definition The latter observation motivates us to strengthen our requirements concerning the efficiency of PAC learning. It might be not enough to bound the number of examples. Additionally, we shall demand the overall running time to be polynomial in the appropriate parameters. Definition 1. A concept class C is said to be efficiently PAC learnable with respect to the hypothesis space H if C is PAC learnable with respect to H, and there exists a PAC learning algorithm A for C that runs in time polynomial in 1/ε, 1/δ, n (the size of an instance in X ), and size (c) for all ε, δ (0, 1) and all c C. Using Theorem 1 we can establish efficient PAC learnability of a couple of important finite concept classes.

M. Liśkiewicz, November 2006 3 3 Example: Efficient Learnability of k-cnf Formulae By k -CNF we denote the class of all conjunctions such that each clause contains at most k literals. The overall number of clauses containing at most k literals is bounded by 2n + (2n) 2 +...(2n) k < O(n k ). Hence, ln( k-cnf ) = O(n k ). Therefore, we get the following general theorem. Theorem 2 Let k N + be arbitrarily fixed. The class of all concepts describable by a k -CNF formula is efficiently PAC learnable with respect to k -CNF. We leave the proof as an exercise. Next, by k -DNF we denote the class of all disjunctions such that each monomial contains contains at most k literals. Exercise. Prove the following: Let k N + be arbitrarily fixed. The class of all concepts describable by a k -DNF formula is efficiently PAC learnable with respect to k -DNF. 4 Intractability of Learning 3-Term DNF Formulae First of all we define 3-term DNF. Let X = {0, 1} n, n 1 be the Boolean learning domain. Then we use again L n = {x 1, x 1,..., x n, x n } to denote the set of all relevant literals over X. Now, a term is a conjunction of literals from L n. The set of all disjunctions of at most three terms is called 3-term DNF, e.g., x 1 x 3 x 5 x 2 x 4 x 5 x 2 x 3 is a member of 3-term DNF. We are going to study whether or not 3-term DNF is efficiently PAC learnable. What is an appropriate hypothesis space? Well, this is just a crucial problem as we shall see. However, it might be a good idea to try 3-term DNF itself as hypothesis space. Since the sample complexity depends on 3-term DNF we will first check this quantity. As we already know, there are 3 n + 1 monomials over L n. Hence, there are (3 n + 1) 3 many elements in 3-term DNF. Therefore, ln( 3-term DNF ) = O(n). This looks good. Hence, the only remaining problem we have to address is the complexity of finding consistent hypotheses. However, this is easier said than done. Many researches tried to find a polynomial time algorithm for this problem but nobody succeeded until now. Therefore, it seems desirable to provide at least a lower bound for this complexity. Unfortunately, there is no known method to prove nontrivial super-polynomial lower bounds for particular problems. Alternatively, we may try to relate the complexity of finding consistent hypotheses to the complexity of other problems. Good candidates are problems that have been proven to be complete for some complexity class. Whenever dealing with efficient PAC learning the appropriate complexity class is N P. Then, assuming that N P -complete problems cannot be solved efficiently by a randomized algorithm, we get some strong evidence for the non-efficient PAC learnability. That is, unless someone comes up with a proof of N P = RP, we have proved that 3-term DNF is not efficient PAC learnable with respect to 3-term DNF (here RP denotes the class of decision problems solved by polynomial time random Turing machine: for each possible input string, either there are no accepting computations or else at least half of all computations are accepting). Next, we formally define the consistency problem for 3-term DNF. In accordance with the notation introduced above, we use b = (b 1,...,b m ) to denote any m -tuple of Boolean vectors b 1,...,b m {0, 1} n. We start with the following decision problem. Consistency Problem for 3-term DNF

4 Wissensbasierte und lernende Systeme Input: m labeled Boolean vectors from {0, 1} n, i.e., an m -sample S(c, b). Output: yes, if there exists a consistent hypothesis h 3-term DNF for S(c, b). no, otherwise. What can be said concerning the complexity of the Consistency Problem for 3-term DNF is provided by our next theorem. Theorem 3 The Consistency Problem for 3-term DNF is N P -complete. Proof. We reduce graph 3-colorability to the Consistency Problem for 3-term DNF. This shows that the Consistency Problem for 3-term DNF is N P -hard. Since the set of all m -samples, m 1, for which there exists a consistent hypothesis is obviously acceptable by a non-deterministic Turing machine, we are done. Graph 3-colorability is a known N P -complete problem defined as follows (cf. Garey and Johnson [3]). Let G = (V, E) be an undirected graph. G is said to be 3-colorable iff there exists a function χ: V {1, 2, 3} such that (i, j) E implies χ(i) χ(j). Let G = (V, E) be any given graph, where without loss of generality V = {1,..., n}. We consider the following reduction. For each vertex i, a positive example b i is created, where b i = u 1 u 2... u n with u i = 0 and u j = 1 for all j i. For each edge (i, j) E, a negative example e ij is created, where e ij = u 1 u 2... u n with u i = u j = 0 and u k = 1 for all k i, j. The resulting sample is denoted by S(G). Since E n(n 1)/2, this reduction is clearly polynomial time computable. It remains to show that the reduction has been defined appropriately. This is done via the following claims. Claim 1. Let G = (V, E) be an undirected graph, and let S(G) be the sample constructed as above. If there exists a hypothesis h 3-term DNF that is consistent with S(G), then G is 3-colorable. Let h be any hypothesis consistent with S(G). Since h 3-term DNF, we may write h = T 1 T 2 T 3. Since h is consistent, we have h(b i ) = 1 for every vertex i V. Moreover, h is a disjunction. Thus, for every vertex i V there must be a term satisfying it. Therefore, we may define the desired mapping χ as follows: χ(i) = min{r T r (b i ) = 1, 1 r 3}. Now, let (i, j) E ; we have to show that χ(i) χ(j). Suppose the converse, i.e., χ(i) = χ(j). Then the examples b i and b j satisfy the same T r. Since (i, j) E we additionally have that e ij = b i b j (taken bitwise). However, b j and e ij differ just in the i th bit. Taking into account that T r (b i ) = T r (b j ) = 1, it is easy to see that neither the literals x i and x i nor the literals x j and x j can be present in T r. Thus, T r (e ij ) = 1, a contradiction to h(e ij ) = 0. This proves Claim 1. Claim 2. Let G = (V, E) be an undirected graph that is 3-colorable. Then there exist a hypothesis h 3-term DNF that is consistent with S(G). Let χ be the mapping assigning the 3 colors to the vertices of G. We define T 1 = and set h = T 1 T 2 T 3. T 2 = T 3 = i, χ(i) 1 i, χ(i) 2 i, χ(i) 3 x i x i x i

M. Liśkiewicz, November 2006 5 We have to show that h is consistent with S(G). First, consider any positive example b i. Let χ(i) = r. Then T r satisfies b i. Now, let e ij be any negative example. Since χ assigns to i and j different colors we have χ(i) χ(j). Let χ(i) = r, then T r cannot satisfy e ij since it contains x j. Analogously, if χ(j) = g, the T g cannot satisfy e ij. Hence, if ever, then T y with χ(i) y χ(j) might satisfy e ij. However, this is also impossible, since T y contains both x i and x j. Thus, h(e ij ) = 0, and hence h is consistent. Finally, Theorem 3 has a nice dual version which we include as exercise. Exercise 1. For all n N, n 1, let L n = {x 1, x 1,..., x n x n } be the set of all relevant literals. Furthermore, let k N, k 3, be arbitrarily fixed. By k -term CNF we denote the set of all conjunctions of at most k clauses, where a clause is again any disjunction of elements from L n. Finally, let C(k) = n 1 k-term CNF. Define the consistency problem for C(k) and prove its N P -completeness. As we have seen, the consistency problem for 3-term DNF is N P -complete. However, at first glance it might seem that we have dealt with the wrong question. Whenever studying the PAC learnability of 3-term DNF, all examples drawn are labeled with respect to some target concept. Hence, there is a consistent hypothesis, i.e., at least the target itself. The problem the learner has to solve is to construct a consistent hypothesis. Therefore, we have to investigate how the decision problem and the construction problem are related to one another. This is done by the following theorem. Theorem 4 If there is an algorithm that efficiently PAC learns 3-term DNF with respect to 3-term DNF then N P = RP. Proof. Let A be any algorithm that efficiently PAC learns 3-term DNF with respect to 3-term DNF. Let q be any fixed polynomial such that the running time of A is polynomially bounded in 1/ε, 1/δ, n, and size( c ). Obviously, size( c ) can be upper bounded by 3n, so we have essentially to deal with 1/ε, 1/δ, n. Next we show how to use this algorithm A to decide in random polynomial time the Consistency problem for 3-term DNF. For a formal definition of randomized algorithms and the complexity class RP we refer to Cormen, Leiserson and Rivest [2]. Let S(c, b) be any labeled m -sample, where c is any Boolean concept (not necessarily from 3-term DNF). Next, we choose ε = 1/(m + 1), and an arbitrary small δ > 0, say δ = 0.000001. The choice of ε guarantees that the hypothesis possibly output by A must be consistent with all examples contained in S(c, b). Conceptually, this refers to the probability distribution where each Boolean vector b i from S(c, b) is equally likely, and all other elements have probability zero. That is, D(b i ) = 1/m for all b i from S(c, b), and D(b j ) = 0 for all other Boolean vectors b j {0, 1} n \ {b 1,..., b m }. Hence, if at least one label is not correctly reflected by the hypothesis h possibly output by A, then d(c, h) 1/m > 1/(m + 1). Next, we run A on input ε, δ and the m -sample S(c, b) at most q(1/ε, 1/δ, S(c, b) ) steps. Since every polynomial is time constructible, it is in polynomial time (again in 1/ε, 1/δ, S(c, b) ) decidable whether or not A has already executed at most q(1/ε, 1/δ, S(c, b) ) many steps. Now, we distinguish the following cases. Case 1. A does not stop after having executed q(1/ε, 1/δ, S(c, b) ) many steps, or it stops but does not output any hypothesis. Then, we conclude that there is no consistent hypothesis for S(c, b) in 3-term DNF. If there is really no consistent hypothesis for S(c, b) in 3-term DNF then the output is for sure correct. Now, assume that there is a consistent hypothesis for S(c, b) in 3-term DNF. Hence, there is a concept ĉ 3-term DNF such that c(b i ) = ĉ(b i ) for all 1 i m. Since the Algorithm A is supposed to PAC learn ĉ with respect to every probability distribution, it must do so for D. Hence, with probability 1 δ it has to output a hypothesis h such that d(ĉ, h) = d(c, h) ε. As shown above, the latter inequality forces

6 Wissensbasierte und lernende Systeme A to produce a consistent guess. Since A did not produce any guess, it has failed to PAC learn ĉ. However, this failure probability is bounded by δ. Case 2. A stops after having executed at most q(1/ε, 1/δ, S(c, b) ) many steps and outputs a hypothesis h. Obviously, we can decide in time polynomial in the length of h and S(c, b) whether or not h is consistent with S(c, b). In case it is, we know for sure that there exists a consistent hypothesis in 3-term DNF for S(c, b). If h is not consistent, then we can argue as in Case 1. Thus, with probability 1 δ we may conclude that there is no consistent hypothesis for S(c, b) in 3-term DNF. Putting it all together, we have arrived at an algorithm that has the following properties. If there is no consistent hypothesis for S(c, b) in 3-term DNF, then its output is always correct. If there exists a consistent hypothesis for S(c, b) in 3-term DNF then the above algorithm produces with probability δ a wrong answer. Hence, there exists an ε 0 > 0, ε 0 constant such that with probability 1 δ > 1/2 + ε 0 every m sample for which there exists a consistent hypothesis in 3-term DNF is accepted. Thus, we have an RP algorithm for the Consistency problem for 3-term DNF. Finally, since RP N P, and because of the N P -completeness of the Consistency problem for 3-term DNF, we can conclude that RP = N P. The proof provided above is worth to be analyzed a bit further. We strongly recommend to resolve the following exercise. Exercise 2. Prove the following: If there exists a deterministic algorithm A that constructs for every input m -sample S drawn in accordance with some 3-term DNF formula a consistent hypothesis then P = N P provided A has a running time polynomially bounded in S and n. Now, it is only natural to ask whether or not Theorem 4 implies that 3-term DNF are not efficiently PAC learnable at all. We still have the freedom to choose another hypothesis space. As it turns out, a careful choice of the hypothesis space does really change everything. An alternative choice to 3-term DNF are 3-CNF. 3-CNF is the conjunction of clauses that contain at most 3 literals per clause, i.e., 3-CNF is the set of all formulas of the form (l i1 l i2 l i3 ) i where the l ij L n or empty. We leave it as an exercise to show that every 3-term DNF formula is equivalently representable by a 3-CNF formula. The easiest way to see this is to prove that every 3-term DNF formula f = T 1 T 2 T 3 can be rewritten as T 1 T 2 T 3 = x T 1,y T 2,z T 3 (x y z). The converse is not true, i.e., 3-CNF is more expressive than 3-term DNF. Now Theorem 2 allows the following corollary. Corollary 1 The class of all 3-term DNF is efficiently PAC learnable with respect to the hypothesis space space 3-CNF. Again, the latter corollary is easily generalized. Exercise 3. Prove the following: For every constant k 2, the class of all k -term DNF is efficiently PAC learnable with respect to the hypothesis space space k -CNF.

M. Liśkiewicz, November 2006 7 References [1] M. Anthony and N. Biggs (1992), Computational Learning Theory, Cambridge University Press, Cambridge. [2] Cormen, T., Leiserson, C. and Rivest, R (1990) Introduction to Algorithms, The MIT Press, Cambridge, MA. [3] Garey, M.R. and Johnson, D.S. (1979), Computers and Intractability: A Guide to the Theory of N P -completeness, Freeman, San Francisco. [4] M.J. Kearns and U.V. Vazirani (1994), An Introduction to Computational Learning Theory, MIT-Press. [5] B.K. Natarajan (1991), Machine Learning, Morgan Kaufmann Publishers Inc. [6] L.G. Valiant (1984), A theory of the learnable, Communications of the ACM 27, 1134 1142