Universität zu Lübeck Institut für Theoretische Informatik Lecture notes on Knowledge-Based and Learning Systems by Maciej Liśkiewicz Lecture 5: Efficient PAC Learning 1 Consistent Learning: a Bound on Sample Complexity Let X be any finite learning domain, let D be any probability distribution over X, and let C (X) be a concept class. Furthermore, we use H to denote any hypothesis space for C. To simplify notation, we use M to denote the cardinality of any set M. Let m N, m 1 ; then we use X m to denote the m -fold Cartesian product of X. Let x X m, then we write x = (x 1,...,x m ). Now, let c C be any concept. The m -sample of c generated by x is denoted by S(c, x) = x 1, c(x 1 ),...,x m, c(x m ). A hypothesis h H is said to be consistent for an m -sample S(c, x) iff h(x i ) = c(x i ) for all 1 i m. A learner is said to be consistent iff for every target concept c and for every hypothesis output h on an m -sample S(c, x) h is consistent for S(c, x). The formal definition of PAC learning has already been presented above. Moreover, we showed the class of all monomials to be PAC learnable. The general idea behind the algorithm given there can be described as follows: (1) Draw a sufficiently large sample from the oracle EX(c, D). (2) Find some h H that is consistent with all the examples drawn. (3) Output h. Therefore, it is only natural to ask whether or not this strategy may be successful in the general finite case, too. Let us assume that we have a consistent learner. Let c C be any concept, and let h be any hypothesis output by the learner on any m -sample S(c, x), where x has been drawn with respect to the unknown probability distribution D. Assume h to be bad, i.e., d(c, h) > ε. Any such hypothesis will not be consistent with m randomly drawn examples unless all examples are drawn outside the symmetric difference of c and h. Hence, the probability that the particular bad hypothesis h survives m examples is at most (1 ε) m. Consequently, the probability that some bad hypothesis survives m examples is at most H (1 ε) m. Furthermore, we want Pr(d(c, h) > ε) < δ. Hence, we must require: H (1 ε) m δ. In this lecture, we consider finite learning domains only.
2 Wissensbasierte und lernende Systeme Now, the latter requirement directly allows to lower bound m. Taking the natural logarithm of both sides, we obtain: ln H + m ln(1 ε) lnδ. Therefore, we have: m > lnδ ln H ln(1 ε) Because of (1 1 z )z < e 1 for all z > 0, we additionally obtain: (1 ε) = ((1 ε) 1/ε ) ε < e ε, and thus Putting it all together, we see that m > 1 ε ln(1 ε) < ε. ( ln H + ln 1 ) = 1 H ln δ ε δ. We summarize the insight obtained by the following theorem. Theorem 1 Let X be any finite learning domain, let C (X) be any concept class, and let H be any hypothesis space for C. Then every consistent learner PAC identifies C with respect to H with sample complexity m = 1 H ln ε δ + 1. The latter theorem delivers a first upper bound on the sample complexity needed to achieve efficient PAC learning. However, it does not say anything concerning the problem to compute consistent hypotheses. Clearly, there is a trivial algorithm to achieve this goal. We may just enumerate all hypotheses. Then we may simply search for the first consistent one in the enumeration fixed. Nevertheless, taking into account that H might be huge, this method will usually take too much time. Hence, further effort is necessary to arrive at practical learning algorithms. 2 Efficient PAC Learnability - Definition The latter observation motivates us to strengthen our requirements concerning the efficiency of PAC learning. It might be not enough to bound the number of examples. Additionally, we shall demand the overall running time to be polynomial in the appropriate parameters. Definition 1. A concept class C is said to be efficiently PAC learnable with respect to the hypothesis space H if C is PAC learnable with respect to H, and there exists a PAC learning algorithm A for C that runs in time polynomial in 1/ε, 1/δ, n (the size of an instance in X ), and size (c) for all ε, δ (0, 1) and all c C. Using Theorem 1 we can establish efficient PAC learnability of a couple of important finite concept classes.
M. Liśkiewicz, November 2006 3 3 Example: Efficient Learnability of k-cnf Formulae By k -CNF we denote the class of all conjunctions such that each clause contains at most k literals. The overall number of clauses containing at most k literals is bounded by 2n + (2n) 2 +...(2n) k < O(n k ). Hence, ln( k-cnf ) = O(n k ). Therefore, we get the following general theorem. Theorem 2 Let k N + be arbitrarily fixed. The class of all concepts describable by a k -CNF formula is efficiently PAC learnable with respect to k -CNF. We leave the proof as an exercise. Next, by k -DNF we denote the class of all disjunctions such that each monomial contains contains at most k literals. Exercise. Prove the following: Let k N + be arbitrarily fixed. The class of all concepts describable by a k -DNF formula is efficiently PAC learnable with respect to k -DNF. 4 Intractability of Learning 3-Term DNF Formulae First of all we define 3-term DNF. Let X = {0, 1} n, n 1 be the Boolean learning domain. Then we use again L n = {x 1, x 1,..., x n, x n } to denote the set of all relevant literals over X. Now, a term is a conjunction of literals from L n. The set of all disjunctions of at most three terms is called 3-term DNF, e.g., x 1 x 3 x 5 x 2 x 4 x 5 x 2 x 3 is a member of 3-term DNF. We are going to study whether or not 3-term DNF is efficiently PAC learnable. What is an appropriate hypothesis space? Well, this is just a crucial problem as we shall see. However, it might be a good idea to try 3-term DNF itself as hypothesis space. Since the sample complexity depends on 3-term DNF we will first check this quantity. As we already know, there are 3 n + 1 monomials over L n. Hence, there are (3 n + 1) 3 many elements in 3-term DNF. Therefore, ln( 3-term DNF ) = O(n). This looks good. Hence, the only remaining problem we have to address is the complexity of finding consistent hypotheses. However, this is easier said than done. Many researches tried to find a polynomial time algorithm for this problem but nobody succeeded until now. Therefore, it seems desirable to provide at least a lower bound for this complexity. Unfortunately, there is no known method to prove nontrivial super-polynomial lower bounds for particular problems. Alternatively, we may try to relate the complexity of finding consistent hypotheses to the complexity of other problems. Good candidates are problems that have been proven to be complete for some complexity class. Whenever dealing with efficient PAC learning the appropriate complexity class is N P. Then, assuming that N P -complete problems cannot be solved efficiently by a randomized algorithm, we get some strong evidence for the non-efficient PAC learnability. That is, unless someone comes up with a proof of N P = RP, we have proved that 3-term DNF is not efficient PAC learnable with respect to 3-term DNF (here RP denotes the class of decision problems solved by polynomial time random Turing machine: for each possible input string, either there are no accepting computations or else at least half of all computations are accepting). Next, we formally define the consistency problem for 3-term DNF. In accordance with the notation introduced above, we use b = (b 1,...,b m ) to denote any m -tuple of Boolean vectors b 1,...,b m {0, 1} n. We start with the following decision problem. Consistency Problem for 3-term DNF
4 Wissensbasierte und lernende Systeme Input: m labeled Boolean vectors from {0, 1} n, i.e., an m -sample S(c, b). Output: yes, if there exists a consistent hypothesis h 3-term DNF for S(c, b). no, otherwise. What can be said concerning the complexity of the Consistency Problem for 3-term DNF is provided by our next theorem. Theorem 3 The Consistency Problem for 3-term DNF is N P -complete. Proof. We reduce graph 3-colorability to the Consistency Problem for 3-term DNF. This shows that the Consistency Problem for 3-term DNF is N P -hard. Since the set of all m -samples, m 1, for which there exists a consistent hypothesis is obviously acceptable by a non-deterministic Turing machine, we are done. Graph 3-colorability is a known N P -complete problem defined as follows (cf. Garey and Johnson [3]). Let G = (V, E) be an undirected graph. G is said to be 3-colorable iff there exists a function χ: V {1, 2, 3} such that (i, j) E implies χ(i) χ(j). Let G = (V, E) be any given graph, where without loss of generality V = {1,..., n}. We consider the following reduction. For each vertex i, a positive example b i is created, where b i = u 1 u 2... u n with u i = 0 and u j = 1 for all j i. For each edge (i, j) E, a negative example e ij is created, where e ij = u 1 u 2... u n with u i = u j = 0 and u k = 1 for all k i, j. The resulting sample is denoted by S(G). Since E n(n 1)/2, this reduction is clearly polynomial time computable. It remains to show that the reduction has been defined appropriately. This is done via the following claims. Claim 1. Let G = (V, E) be an undirected graph, and let S(G) be the sample constructed as above. If there exists a hypothesis h 3-term DNF that is consistent with S(G), then G is 3-colorable. Let h be any hypothesis consistent with S(G). Since h 3-term DNF, we may write h = T 1 T 2 T 3. Since h is consistent, we have h(b i ) = 1 for every vertex i V. Moreover, h is a disjunction. Thus, for every vertex i V there must be a term satisfying it. Therefore, we may define the desired mapping χ as follows: χ(i) = min{r T r (b i ) = 1, 1 r 3}. Now, let (i, j) E ; we have to show that χ(i) χ(j). Suppose the converse, i.e., χ(i) = χ(j). Then the examples b i and b j satisfy the same T r. Since (i, j) E we additionally have that e ij = b i b j (taken bitwise). However, b j and e ij differ just in the i th bit. Taking into account that T r (b i ) = T r (b j ) = 1, it is easy to see that neither the literals x i and x i nor the literals x j and x j can be present in T r. Thus, T r (e ij ) = 1, a contradiction to h(e ij ) = 0. This proves Claim 1. Claim 2. Let G = (V, E) be an undirected graph that is 3-colorable. Then there exist a hypothesis h 3-term DNF that is consistent with S(G). Let χ be the mapping assigning the 3 colors to the vertices of G. We define T 1 = and set h = T 1 T 2 T 3. T 2 = T 3 = i, χ(i) 1 i, χ(i) 2 i, χ(i) 3 x i x i x i
M. Liśkiewicz, November 2006 5 We have to show that h is consistent with S(G). First, consider any positive example b i. Let χ(i) = r. Then T r satisfies b i. Now, let e ij be any negative example. Since χ assigns to i and j different colors we have χ(i) χ(j). Let χ(i) = r, then T r cannot satisfy e ij since it contains x j. Analogously, if χ(j) = g, the T g cannot satisfy e ij. Hence, if ever, then T y with χ(i) y χ(j) might satisfy e ij. However, this is also impossible, since T y contains both x i and x j. Thus, h(e ij ) = 0, and hence h is consistent. Finally, Theorem 3 has a nice dual version which we include as exercise. Exercise 1. For all n N, n 1, let L n = {x 1, x 1,..., x n x n } be the set of all relevant literals. Furthermore, let k N, k 3, be arbitrarily fixed. By k -term CNF we denote the set of all conjunctions of at most k clauses, where a clause is again any disjunction of elements from L n. Finally, let C(k) = n 1 k-term CNF. Define the consistency problem for C(k) and prove its N P -completeness. As we have seen, the consistency problem for 3-term DNF is N P -complete. However, at first glance it might seem that we have dealt with the wrong question. Whenever studying the PAC learnability of 3-term DNF, all examples drawn are labeled with respect to some target concept. Hence, there is a consistent hypothesis, i.e., at least the target itself. The problem the learner has to solve is to construct a consistent hypothesis. Therefore, we have to investigate how the decision problem and the construction problem are related to one another. This is done by the following theorem. Theorem 4 If there is an algorithm that efficiently PAC learns 3-term DNF with respect to 3-term DNF then N P = RP. Proof. Let A be any algorithm that efficiently PAC learns 3-term DNF with respect to 3-term DNF. Let q be any fixed polynomial such that the running time of A is polynomially bounded in 1/ε, 1/δ, n, and size( c ). Obviously, size( c ) can be upper bounded by 3n, so we have essentially to deal with 1/ε, 1/δ, n. Next we show how to use this algorithm A to decide in random polynomial time the Consistency problem for 3-term DNF. For a formal definition of randomized algorithms and the complexity class RP we refer to Cormen, Leiserson and Rivest [2]. Let S(c, b) be any labeled m -sample, where c is any Boolean concept (not necessarily from 3-term DNF). Next, we choose ε = 1/(m + 1), and an arbitrary small δ > 0, say δ = 0.000001. The choice of ε guarantees that the hypothesis possibly output by A must be consistent with all examples contained in S(c, b). Conceptually, this refers to the probability distribution where each Boolean vector b i from S(c, b) is equally likely, and all other elements have probability zero. That is, D(b i ) = 1/m for all b i from S(c, b), and D(b j ) = 0 for all other Boolean vectors b j {0, 1} n \ {b 1,..., b m }. Hence, if at least one label is not correctly reflected by the hypothesis h possibly output by A, then d(c, h) 1/m > 1/(m + 1). Next, we run A on input ε, δ and the m -sample S(c, b) at most q(1/ε, 1/δ, S(c, b) ) steps. Since every polynomial is time constructible, it is in polynomial time (again in 1/ε, 1/δ, S(c, b) ) decidable whether or not A has already executed at most q(1/ε, 1/δ, S(c, b) ) many steps. Now, we distinguish the following cases. Case 1. A does not stop after having executed q(1/ε, 1/δ, S(c, b) ) many steps, or it stops but does not output any hypothesis. Then, we conclude that there is no consistent hypothesis for S(c, b) in 3-term DNF. If there is really no consistent hypothesis for S(c, b) in 3-term DNF then the output is for sure correct. Now, assume that there is a consistent hypothesis for S(c, b) in 3-term DNF. Hence, there is a concept ĉ 3-term DNF such that c(b i ) = ĉ(b i ) for all 1 i m. Since the Algorithm A is supposed to PAC learn ĉ with respect to every probability distribution, it must do so for D. Hence, with probability 1 δ it has to output a hypothesis h such that d(ĉ, h) = d(c, h) ε. As shown above, the latter inequality forces
6 Wissensbasierte und lernende Systeme A to produce a consistent guess. Since A did not produce any guess, it has failed to PAC learn ĉ. However, this failure probability is bounded by δ. Case 2. A stops after having executed at most q(1/ε, 1/δ, S(c, b) ) many steps and outputs a hypothesis h. Obviously, we can decide in time polynomial in the length of h and S(c, b) whether or not h is consistent with S(c, b). In case it is, we know for sure that there exists a consistent hypothesis in 3-term DNF for S(c, b). If h is not consistent, then we can argue as in Case 1. Thus, with probability 1 δ we may conclude that there is no consistent hypothesis for S(c, b) in 3-term DNF. Putting it all together, we have arrived at an algorithm that has the following properties. If there is no consistent hypothesis for S(c, b) in 3-term DNF, then its output is always correct. If there exists a consistent hypothesis for S(c, b) in 3-term DNF then the above algorithm produces with probability δ a wrong answer. Hence, there exists an ε 0 > 0, ε 0 constant such that with probability 1 δ > 1/2 + ε 0 every m sample for which there exists a consistent hypothesis in 3-term DNF is accepted. Thus, we have an RP algorithm for the Consistency problem for 3-term DNF. Finally, since RP N P, and because of the N P -completeness of the Consistency problem for 3-term DNF, we can conclude that RP = N P. The proof provided above is worth to be analyzed a bit further. We strongly recommend to resolve the following exercise. Exercise 2. Prove the following: If there exists a deterministic algorithm A that constructs for every input m -sample S drawn in accordance with some 3-term DNF formula a consistent hypothesis then P = N P provided A has a running time polynomially bounded in S and n. Now, it is only natural to ask whether or not Theorem 4 implies that 3-term DNF are not efficiently PAC learnable at all. We still have the freedom to choose another hypothesis space. As it turns out, a careful choice of the hypothesis space does really change everything. An alternative choice to 3-term DNF are 3-CNF. 3-CNF is the conjunction of clauses that contain at most 3 literals per clause, i.e., 3-CNF is the set of all formulas of the form (l i1 l i2 l i3 ) i where the l ij L n or empty. We leave it as an exercise to show that every 3-term DNF formula is equivalently representable by a 3-CNF formula. The easiest way to see this is to prove that every 3-term DNF formula f = T 1 T 2 T 3 can be rewritten as T 1 T 2 T 3 = x T 1,y T 2,z T 3 (x y z). The converse is not true, i.e., 3-CNF is more expressive than 3-term DNF. Now Theorem 2 allows the following corollary. Corollary 1 The class of all 3-term DNF is efficiently PAC learnable with respect to the hypothesis space space 3-CNF. Again, the latter corollary is easily generalized. Exercise 3. Prove the following: For every constant k 2, the class of all k -term DNF is efficiently PAC learnable with respect to the hypothesis space space k -CNF.
M. Liśkiewicz, November 2006 7 References [1] M. Anthony and N. Biggs (1992), Computational Learning Theory, Cambridge University Press, Cambridge. [2] Cormen, T., Leiserson, C. and Rivest, R (1990) Introduction to Algorithms, The MIT Press, Cambridge, MA. [3] Garey, M.R. and Johnson, D.S. (1979), Computers and Intractability: A Guide to the Theory of N P -completeness, Freeman, San Francisco. [4] M.J. Kearns and U.V. Vazirani (1994), An Introduction to Computational Learning Theory, MIT-Press. [5] B.K. Natarajan (1991), Machine Learning, Morgan Kaufmann Publishers Inc. [6] L.G. Valiant (1984), A theory of the learnable, Communications of the ACM 27, 1134 1142