Lecture 7: Passive Learning - PDF Free Download

CS 880: Advanced Complexity Theory 2/8/2008 Lecture 7: Passive Learning Instructor: Dieter van Melkebeek Scribe: Tom Watson In the previous lectures, we studied harmonic analysis as a tool for analyzing structural properties of Boolean functions, and we developed some property testers as applications of this tool. In the next few lectures, we study applications in computational learning theory. Harmonic analysis turns out to be useful for designing and analyzing efficient learning algorithms under the uniform distribution. As a concrete example, we develop algorithms for learning decision trees. Later, we will see applications to other concept classes. 1 Computational Learning Theory Recall that in the setting of property testing, we have an underlying class of Boolean functions and oracle access to a Boolean function f, and we want a test that accepts if f is in the class and rejects with high probability if f is far from being in the class. In computational learning theory, we again have an underlying class and a function c, but now we are guaranteed that c is in the class, and we wish to know which function c is. More formally, we have the following ingredients. A concept class {C n,s } where C n,s { c : { 1,1} n { 1,1} }. The elements of C n,s are called concepts. The parameter n is the input length of the concepts, and the parameter s is a measure of the complexity of the concepts. For example, C n,s might be the class of n-variable functions computable by decision trees of size at most s. A hypothesis class {H n,s } where H n,s { h : { 1,1} n { 1,1} }. Given a concept c C n,s, a learning algorithm will output a hypothesis h H n,s, which is its guess for what c is. Ideally we would like H n,s = C n,s (this setting is called proper learning). However, in general we have H n,s C n,s ; i.e., a learning algorithm may be allowed to output a hypothesis that is not a concept. A learning process. There are several ways in which a learning algorithm may have access to the concept c. Labeled samples (x,c(x)) where x is drawn from some distribution D on { 1,1} n. Membership queries, i.e., oracle access to c. Note that these two options are incomparable; the former cannot simulate the latter because the algorithm has no control over the samples it receives, and the latter cannot simulate the former if the distribution D is not known. There is a third commonly studied type of access known as equivalence queries, but we will not study this type. Learning from labeled samples only is often referred to as passive learning, whereas learning using membership queries is referred to as active learning. 1

A measure of success. We measure success in terms of h being close to c. We would like Pr x D [h(x) c(x)] ǫ for some error parameter ǫ; note that the probability is with respect to the same distribution D that the labeled samples are drawn from. As learning algorithms are generally randomized, the hypothesis h may or may not satisfy the above inequality, but we would like it to hold with high probability, say at least 1 δ where δ is a confidence parameter. We use the running time of a learning algorithm as our measure of efficiency. We would like the running time to be polynomial in n, s, 1 ǫ, and 1 δ. Typically, however, the dependence on δ is polynomial in log 1 δ rather than in 1 δ. This is because if we can learn with some particular confidence parameter less than 1/3, say, then we can iterate the process to drive the confidence parameter down exponentially in the number of iterations. In each iteration we find a new hypothesis h and test whether it is sufficiently close to c using labeled samples (incurring only a slight loss in confidence); if so we output h and otherwise we continue. For this reason, we will usually ignore the dependence on the confidence parameter δ. These ingredients allow us to define the model of distribution-free learning or PAC (probably approximately correct) learning. Definition 1. A PAC learner for {C n,s } by {H n,s } is a randomized oracle Turing machine L such that for all n, s, ǫ > 0, δ > 0, distributions D on { 1,1} n, and concepts c C n,s, if L is given access to random labeled samples (x,c(x)) where x is drawn from D, then it produces a hypothesis h H n,s such that [ Pr ] Pr [h(x) c(x)] ǫ x D 1 δ. (1) We say that L is efficient if it runs in time poly(n,s, 1 ǫ, 1 δ ). A PAC learner with membership queries is defined in the same way, except that the learner L is allowed to make oracle queries to c in addition to receiving random labeled samples. Note that the outer probability in equation (1) is over the randomness used by L and the randomness used to generate labeled samples, and the inner probability is an independent experiment for measuring the quality of h. Also note that L does not know D, but is still required to succeed for any D. In this class, we will only study learning algorithms that work when the random samples are guaranteed to come from the uniform distribution (and may fail when D is not uniform). Definition 2. We say a machine L learns with respect to the uniform distribution if L satisfies the conditions of Definition 1 for D the uniform distribution. In this class, we will also assume that the learner L is given both n and s as input. In practical settings, the complexity s of the concept may not be known. In this case, the learner can just try s = 2 i, i = 0,1,2,... until it finds a hypothesis sufficiently close to the concept (which it can test using random samples); this incurs only a small penalty in running time. Thus, our assumption about knowing s is essentially without loss of generality. 2 Learning Using Harmonic Analysis It is intuitively clear that if C n,s has no structure, for example is the set of all n-variable functions, then efficient learning is impossible. This is because after seeing a small set of samples, the learning 2

algorithm has no way of guessing the value of the concept on other inputs. (This intuition can be turned into a formal argument.) The structure we impose on C n,s in this lecture is that all concepts must have their Fourier weight concentrated on only a few coefficients. That is, we only allow concepts c = S ĉ(s)χ S such that for some small set S 2 [n], (ĉ(s) ) 2 ǫ. (We are using ǫ for both this and the error parameter of a learning algorithm; however, these two quantities will be closely related so we do not introduce new notation.) How does this structural property help us learn c? We know that the (in general non-boolean) function f = S S ĉ(s)χ S is a good approximation to c in the sense that c f 2 ǫ. Thus, as a first step, we can focus on finding an approximation to f. Assuming we know S, we need to approximate ˆf(S) = ĉ(s) = E x [c(x)χ S (x)] for S S (where the expectation is over the uniform distribution). If we have access to uniform samples (x,c(x)), then we can also get uniform samples (x,c(x)χ S (x)) because χ S (x) is trivially efficiently computable. We can thus approximate E x [c(x)χ S (x)] by taking the average of many samples of c(x)χ S (x). More formally, since c(x)χ S (x) 1 for all x, a Chernoff bound shows that with probability at least 1 exp ( η 2 k) 2, the average of k samples is within an additive ±η of the desired expectation, where η is a parameter we will specify shortly. We want this probability to be at least 1 δ Picking k = Θ( 1 log S η 2 δ ) suffices for this. If we let ĝ(s) denote our approximation of ˆf(S) for S S, then the function g = S S ĝ(s)χ S satisfies ( ) 2 (ĉ(s) ) 2 ˆf(S) ĝ(s) + S η 2 + ǫ 2ǫ, S. c g 2 = S S where the last inequality holds provided we pick η ǫ S. Thus, assuming we know S and our labeled samples are drawn from the uniform distribution, we can find g in time poly(n, S, 1 ǫ,log 1 δ ): we need to process S coefficients, for each one we need O( 1 S S ) = O( ǫ log δ ) samples, and for each sample we need O(n) time. After a union bound log S η 2 δ over S S, with probability at least 1 δ, we have c g 2 2ǫ. The only problem with g is that it might not be Boolean. The most natural thing to do is to obtain hypothesis h by rounding g to the nearest Boolean function: h(x) = sign(g(x)) (the case g(x) = 0 can be decided arbitrarily). How well does h approximate c? Assuming we achieved c g 2 2ǫ, we have Pr[h(x) c(x)] Pr[ c(x) g(x) 1] E[(c g) 2 ] 2ǫ. We won t care exactly what hypothesis class {H n,s } we are using, although it does follow from the above that (ĥ(s) ) 2 (ĥ(s) ) 2 (ĥ(s) ) 2 2ǫ. This is because S ĝ(s) = E[(h g) 2 ] and over all Boolean functions c, h minimizes E[( c g) 2 ] but c achieves E[(c g) 2 ] 2ǫ. To summarize where we are, we have four functions involved: Given the concept c, f is obtained by annhilating the Fourier coefficients not in S, g is obtained by approximating the remaining Fourier coefficients, and then h is obtained by rounding to a Boolean function. We argued that by choosing the approximation parameter appropriately, we achieve at most 2ǫ error with probability at least 1 δ in time poly(n, S, 1 ǫ,log 1 δ ). This result requires the samples to come from the uniform distribution. 3

The above learning algorithm needs to somehow know a set S such that (ĉ(s) ) 2 ǫ. There are two cases: S is known; i.e., it can be computed efficiently given n and s. In this case, as we showed above, we can learn efficiently using only random samples, provided S is not too large. S is unknown. In this case, provided we have a good upper bound on S, we can compute a suitable alternative set S using membership queries and then run the above algorithm with S. We show how to do this in the next lecture. Note that a randomized learning algorithm that can make membership queries automatically has access to labeled samples from the uniform distribution. 3 Learning From Samples For the rest of this lecture, we focus on special concept classes where an appropriate set S is trivial to find. We begin by noting that the set of all small sets S forms an appropriate S when the average sensitivity I(c) is low, because if a large set has a large Fourier coefficient, then I(c) must also be large because of the formula I(c) = S S ( ĉ(s) ) 2, proven in the previous lecture. Theorem 1. If I(c) d for all c C n,s then there is an algorithm that learns C n,s with respect to the uniform distribution using random samples only and running in time poly(n d/ǫ,log 1 δ ). Proof. For all t we have S t (ĉ(s) ) 2 1 t S t S ( ĉ(s) ) 2 1 t I(c) d t where the last inequality holds provided t d ǫ. Thus we can apply the learning algorithm from the previous section with S = {S : S < d ǫ }. A trivial upper bound of S (n + 1)d/ǫ yields the running time. We can translate this result to decision trees using the fact that functions computable by small decision trees have low average sensitivity. If a decision tree has depth d then the average sensitivity of the function computed by it is at most d; in fact, the sensitivity at any input is at most the number of nodes on the corresponding path in the decision tree, which is at most d. Thus we have the following. Theorem 2. If C n,d is the set of functions computed by decision trees of depth at most d, then there is an algorithm that learns C n,d with respect to the uniform distribution using random samples only and running in time poly(n d/ǫ,log 1 δ ). What can we say if s refers to the size of a decision tree, not its depth? Suppose we truncate the original tree T at some level l to obtain tree T. (The behavior of T on severed paths is arbitrary.) For large l, T is a good approximation to T because each of the leaves at depth greater than l is reached with only small probability, and there are few such leaves when T is small. How large does l need to be? We have Pr x [T (x) T(x)] leaves v depth > l Pr x [v reached] < 4 leaves v depth > l ǫ, 2 l s 2 l ǫ, (2)

where the last inequality holds provided l log s ǫ. Now, suppose we apply our learning algorithm for decision trees of depth at most l to T. The approximation f in the first step satisfies T f = ˆT(S)χ S (3) ( ˆT(S) ˆT (S))χ S + ˆT (S)χ S (4) T T + ˆT (S)χ S (5) 2 ǫ + ǫ = 3 ǫ, (6) Step (3) follows because the Fourier coefficients of f agree with those of T on S and vanish elsewhere. Step (4) follows from the triangle inequality, and step (5) from the orthonormality of the characters. Step (6) follows from the fact that T T = 2 Pr[T T ] and (2), and from the defining property of S. Thus, T f 2 9ǫ. By approximating each of the Fourier coefficients of T in S to within η as above, we obtain an approximation g for which T g 2 10ǫ, and our resulting hypothesis h = sign(g) differs from T on no more than a fraction 10ǫ of the inputs. We have derived the following result about learning decision trees from random samples. Theorem 3. If C n,s is the set of functions computed by decision trees of size at most s, then there is an algorithm that learns C n,s with respect to the uniform distribution using random samples only and running in time poly(n (log s ǫ )/ǫ,log 1 δ ). In particular, if ǫ is a constant then we get a learning algorithm that runs in time quasipolynomial in n + s. As we begin to show next, we can bring the running time down to polynomial if membership queries are available. 4 Learning From Queries Let us return to our generic learning algorithm that uses random samples from the uniform distribution and achieves 2ǫ error and δ confidence in time poly(n, S, 1 ǫ,log 1 δ ) provided we know a set S such that (ĉ(s) ) 2 ǫ. If all we know is that such a set S exists and has size at most some value M, then we can attempt to find S or another suitable set. The idea is to define S = {S : ( ĉ(s) ) 2 τ} for some threshold τ and compute S using the list decoding algorithm we describe in the next lecture. Without loss of generality we can assume that S contains the M sets with the largest coefficients in absolute value. A first attempt at choosing τ would be to set it low enough to ensure that (ĉ(s) ) 2 ǫ. Setting τ = ǫ 2 would certainly n achieve this, since then (ĉ(s) ) 2 2n τ = ǫ. However, since the list decoding algorithm runs in time polynomial in τ 1, this would result in an exponential-time learning algorithm. By setting τ higher, we may not be able to guarantee our ǫ bound, but we won t be too far off, since of the sets not in S, those not in S only contribute a small amount (by hypothesis), and those in S also only contribute a small amount (since there are only a small number M of them and each has weight at most τ). 5

0 1 min τ needed to achieve ε min τ needed to achieve 2ε Formally, (ĉ(s) ) 2 (ĉ(s) ) 2 (ĉ(s) ) 2 + M τ + ǫ. S S\S Choosing τ = ǫ M ensures that this quantity is at most 2ǫ, and it ensures that S M ǫ, so S is a suitable alternative for S. That is, we can learn in roughly the same time as if we knew S, provided we can find S. But ( ĉ(s) ) 2 ǫ M is equivalent to saying that χ S or χ S correlates well with c, and since the characters are exactly the Hadamard codewords, we can use a list decoding algorithm for the Hadamard code to efficiently find the set S using membership queries. In the next lecture, we will spell out the list decoding algorithm in more detail, and we will apply this approach to learning decision trees. 6