Lecture 7: Passive Learning

Similar documents
Lecture 29: Computational Learning Theory

Lecture 5: February 16, 2012

Learning and Fourier Analysis

Lecture 23: Alternation vs. Counting

Lecture 4: LMN Learning (Part 2)

TTIC An Introduction to the Theory of Machine Learning. Learning from noisy data, intro to SQ model

Lecture 18: Inapproximability of MAX-3-SAT

Notes for Lecture 15

CS 6375: Machine Learning Computational Learning Theory

CS 151 Complexity Theory Spring Solution Set 5

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Testing Problems with Sub-Learning Sample Complexity

1 Last time and today

Lecture 8. Instructor: Haipeng Luo

Lecture 3: Error Correcting Codes

New Results for Random Walk Learning

CS Communication Complexity: Applications and New Directions

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Lecture 2: From Classical to Quantum Model of Computation

2 Upper-bound of Generalization Error of AdaBoost

Lecture 4: Codes based on Concatenation

Learning and Fourier Analysis

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Online Learning, Mistake Bounds, Perceptron Algorithm

Lecture 5: Efficient PAC Learning. 1 Consistent Learning: a Bound on Sample Complexity

Answering Many Queries with Differential Privacy

3 Finish learning monotone Boolean functions

Polynomial time Prediction Strategy with almost Optimal Mistake Probability

Lecture 24: Approximate Counting

Lecture 22: Counting

Computational Learning Theory - Hilary Term : Introduction to the PAC Learning Framework

Machine Learning. Lecture 9: Learning Theory. Feng Li.

1 From previous lectures

Lecture 26: Arthur-Merlin Games

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

NP, polynomial-time mapping reductions, and NP-completeness

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

1 The Probably Approximately Correct (PAC) Model

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Conditional Sparse Linear Regression

1 Randomized Computation

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

CSC 5170: Theory of Computational Complexity Lecture 9 The Chinese University of Hong Kong 15 March 2010

6.842 Randomness and Computation April 2, Lecture 14

Learning Combinatorial Functions from Pairwise Comparisons

1 PSPACE-Completeness

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

Goldreich-Levin Hardcore Predicate. Lecture 28: List Decoding Hadamard Code and Goldreich-L

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Lecture 1: 01/22/2014

Lecture 12: Randomness Continued

Lecture 24: Goldreich-Levin Hardcore Predicate. Goldreich-Levin Hardcore Predicate

CS 395T Computational Learning Theory. Scribe: Mike Halcrow. x 4. x 2. x 6

CS229 Supplemental Lecture notes

Lecture 23 : Nondeterministic Finite Automata DRAFT Connection between Regular Expressions and Finite Automata

: On the P vs. BPP problem. 30/12/2016 Lecture 12

Impagliazzo s Hardcore Lemma

CS446: Machine Learning Spring Problem Set 4

On the power and the limits of evolvability. Vitaly Feldman Almaden Research Center

1 Review of The Learning Setting

More Efficient PAC-learning of DNF with Membership Queries. Under the Uniform Distribution

Homework 4 Solutions

Notes on Computer Theory Last updated: November, Circuits

Proclaiming Dictators and Juntas or Testing Boolean Formulae

Introduction to Machine Learning

Compute the Fourier transform on the first register to get x {0,1} n x 0.

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Computational Learning Theory

Machine Learning

Active Learning and Optimized Information Gathering

Lecture 59 : Instance Compression and Succinct PCP s for NP

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

PAC-learning, VC Dimension and Margin-based Bounds

Generalization, Overfitting, and Model Selection

1 Circuit Complexity. CS 6743 Lecture 15 1 Fall Definitions

Lecture 3: Constructing a Quantum Model

Lecture 3: Reductions and Completeness

CS Foundations of Communication Complexity

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Machine Learning

Computational Learning Theory

Understanding Generalization Error: Bounds and Decompositions

PAC Model and Generalization Bounds

Error Correcting Codes Questions Pool

Learning Submodular Functions

2 P vs. NP and Diagonalization

Martingale Boosting. September 15, 2008

CS264: Beyond Worst-Case Analysis Lecture #11: LP Decoding

CSC 2429 Approaches to the P vs. NP Question and Related Complexity Questions Lecture 2: Switching Lemma, AC 0 Circuit Lower Bounds

Computational and Statistical Learning Theory

Lecture 10: Learning DNF, AC 0, Juntas. 1 Learning DNF in Almost Polynomial Time

21.1 Lower bounds on minimax risk for functional estimation

COMS 4771 Introduction to Machine Learning. Nakul Verma

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Computational Learning Theory

CSE250A Fall 12: Discussion Week 9

Learning symmetric non-monotone submodular functions

On the Efficiency of Noise-Tolerant PAC Algorithms Derived from Statistical Queries

Active Testing! Liu Yang! Joint work with Nina Balcan,! Eric Blais and Avrim Blum! Liu Yang 2011

Transcription:

CS 880: Advanced Complexity Theory 2/8/2008 Lecture 7: Passive Learning Instructor: Dieter van Melkebeek Scribe: Tom Watson In the previous lectures, we studied harmonic analysis as a tool for analyzing structural properties of Boolean functions, and we developed some property testers as applications of this tool. In the next few lectures, we study applications in computational learning theory. Harmonic analysis turns out to be useful for designing and analyzing efficient learning algorithms under the uniform distribution. As a concrete example, we develop algorithms for learning decision trees. Later, we will see applications to other concept classes. 1 Computational Learning Theory Recall that in the setting of property testing, we have an underlying class of Boolean functions and oracle access to a Boolean function f, and we want a test that accepts if f is in the class and rejects with high probability if f is far from being in the class. In computational learning theory, we again have an underlying class and a function c, but now we are guaranteed that c is in the class, and we wish to know which function c is. More formally, we have the following ingredients. A concept class {C n,s } where C n,s { c : { 1,1} n { 1,1} }. The elements of C n,s are called concepts. The parameter n is the input length of the concepts, and the parameter s is a measure of the complexity of the concepts. For example, C n,s might be the class of n-variable functions computable by decision trees of size at most s. A hypothesis class {H n,s } where H n,s { h : { 1,1} n { 1,1} }. Given a concept c C n,s, a learning algorithm will output a hypothesis h H n,s, which is its guess for what c is. Ideally we would like H n,s = C n,s (this setting is called proper learning). However, in general we have H n,s C n,s ; i.e., a learning algorithm may be allowed to output a hypothesis that is not a concept. A learning process. There are several ways in which a learning algorithm may have access to the concept c. Labeled samples (x,c(x)) where x is drawn from some distribution D on { 1,1} n. Membership queries, i.e., oracle access to c. Note that these two options are incomparable; the former cannot simulate the latter because the algorithm has no control over the samples it receives, and the latter cannot simulate the former if the distribution D is not known. There is a third commonly studied type of access known as equivalence queries, but we will not study this type. Learning from labeled samples only is often referred to as passive learning, whereas learning using membership queries is referred to as active learning. 1

A measure of success. We measure success in terms of h being close to c. We would like Pr x D [h(x) c(x)] ǫ for some error parameter ǫ; note that the probability is with respect to the same distribution D that the labeled samples are drawn from. As learning algorithms are generally randomized, the hypothesis h may or may not satisfy the above inequality, but we would like it to hold with high probability, say at least 1 δ where δ is a confidence parameter. We use the running time of a learning algorithm as our measure of efficiency. We would like the running time to be polynomial in n, s, 1 ǫ, and 1 δ. Typically, however, the dependence on δ is polynomial in log 1 δ rather than in 1 δ. This is because if we can learn with some particular confidence parameter less than 1/3, say, then we can iterate the process to drive the confidence parameter down exponentially in the number of iterations. In each iteration we find a new hypothesis h and test whether it is sufficiently close to c using labeled samples (incurring only a slight loss in confidence); if so we output h and otherwise we continue. For this reason, we will usually ignore the dependence on the confidence parameter δ. These ingredients allow us to define the model of distribution-free learning or PAC (probably approximately correct) learning. Definition 1. A PAC learner for {C n,s } by {H n,s } is a randomized oracle Turing machine L such that for all n, s, ǫ > 0, δ > 0, distributions D on { 1,1} n, and concepts c C n,s, if L is given access to random labeled samples (x,c(x)) where x is drawn from D, then it produces a hypothesis h H n,s such that [ Pr ] Pr [h(x) c(x)] ǫ x D 1 δ. (1) We say that L is efficient if it runs in time poly(n,s, 1 ǫ, 1 δ ). A PAC learner with membership queries is defined in the same way, except that the learner L is allowed to make oracle queries to c in addition to receiving random labeled samples. Note that the outer probability in equation (1) is over the randomness used by L and the randomness used to generate labeled samples, and the inner probability is an independent experiment for measuring the quality of h. Also note that L does not know D, but is still required to succeed for any D. In this class, we will only study learning algorithms that work when the random samples are guaranteed to come from the uniform distribution (and may fail when D is not uniform). Definition 2. We say a machine L learns with respect to the uniform distribution if L satisfies the conditions of Definition 1 for D the uniform distribution. In this class, we will also assume that the learner L is given both n and s as input. In practical settings, the complexity s of the concept may not be known. In this case, the learner can just try s = 2 i, i = 0,1,2,... until it finds a hypothesis sufficiently close to the concept (which it can test using random samples); this incurs only a small penalty in running time. Thus, our assumption about knowing s is essentially without loss of generality. 2 Learning Using Harmonic Analysis It is intuitively clear that if C n,s has no structure, for example is the set of all n-variable functions, then efficient learning is impossible. This is because after seeing a small set of samples, the learning 2

algorithm has no way of guessing the value of the concept on other inputs. (This intuition can be turned into a formal argument.) The structure we impose on C n,s in this lecture is that all concepts must have their Fourier weight concentrated on only a few coefficients. That is, we only allow concepts c = S ĉ(s)χ S such that for some small set S 2 [n], (ĉ(s) ) 2 ǫ. (We are using ǫ for both this and the error parameter of a learning algorithm; however, these two quantities will be closely related so we do not introduce new notation.) How does this structural property help us learn c? We know that the (in general non-boolean) function f = S S ĉ(s)χ S is a good approximation to c in the sense that c f 2 ǫ. Thus, as a first step, we can focus on finding an approximation to f. Assuming we know S, we need to approximate ˆf(S) = ĉ(s) = E x [c(x)χ S (x)] for S S (where the expectation is over the uniform distribution). If we have access to uniform samples (x,c(x)), then we can also get uniform samples (x,c(x)χ S (x)) because χ S (x) is trivially efficiently computable. We can thus approximate E x [c(x)χ S (x)] by taking the average of many samples of c(x)χ S (x). More formally, since c(x)χ S (x) 1 for all x, a Chernoff bound shows that with probability at least 1 exp ( η 2 k) 2, the average of k samples is within an additive ±η of the desired expectation, where η is a parameter we will specify shortly. We want this probability to be at least 1 δ Picking k = Θ( 1 log S η 2 δ ) suffices for this. If we let ĝ(s) denote our approximation of ˆf(S) for S S, then the function g = S S ĝ(s)χ S satisfies ( ) 2 (ĉ(s) ) 2 ˆf(S) ĝ(s) + S η 2 + ǫ 2ǫ, S. c g 2 = S S where the last inequality holds provided we pick η ǫ S. Thus, assuming we know S and our labeled samples are drawn from the uniform distribution, we can find g in time poly(n, S, 1 ǫ,log 1 δ ): we need to process S coefficients, for each one we need O( 1 S S ) = O( ǫ log δ ) samples, and for each sample we need O(n) time. After a union bound log S η 2 δ over S S, with probability at least 1 δ, we have c g 2 2ǫ. The only problem with g is that it might not be Boolean. The most natural thing to do is to obtain hypothesis h by rounding g to the nearest Boolean function: h(x) = sign(g(x)) (the case g(x) = 0 can be decided arbitrarily). How well does h approximate c? Assuming we achieved c g 2 2ǫ, we have Pr[h(x) c(x)] Pr[ c(x) g(x) 1] E[(c g) 2 ] 2ǫ. We won t care exactly what hypothesis class {H n,s } we are using, although it does follow from the above that (ĥ(s) ) 2 (ĥ(s) ) 2 (ĥ(s) ) 2 2ǫ. This is because S ĝ(s) = E[(h g) 2 ] and over all Boolean functions c, h minimizes E[( c g) 2 ] but c achieves E[(c g) 2 ] 2ǫ. To summarize where we are, we have four functions involved: Given the concept c, f is obtained by annhilating the Fourier coefficients not in S, g is obtained by approximating the remaining Fourier coefficients, and then h is obtained by rounding to a Boolean function. We argued that by choosing the approximation parameter appropriately, we achieve at most 2ǫ error with probability at least 1 δ in time poly(n, S, 1 ǫ,log 1 δ ). This result requires the samples to come from the uniform distribution. 3

The above learning algorithm needs to somehow know a set S such that (ĉ(s) ) 2 ǫ. There are two cases: S is known; i.e., it can be computed efficiently given n and s. In this case, as we showed above, we can learn efficiently using only random samples, provided S is not too large. S is unknown. In this case, provided we have a good upper bound on S, we can compute a suitable alternative set S using membership queries and then run the above algorithm with S. We show how to do this in the next lecture. Note that a randomized learning algorithm that can make membership queries automatically has access to labeled samples from the uniform distribution. 3 Learning From Samples For the rest of this lecture, we focus on special concept classes where an appropriate set S is trivial to find. We begin by noting that the set of all small sets S forms an appropriate S when the average sensitivity I(c) is low, because if a large set has a large Fourier coefficient, then I(c) must also be large because of the formula I(c) = S S ( ĉ(s) ) 2, proven in the previous lecture. Theorem 1. If I(c) d for all c C n,s then there is an algorithm that learns C n,s with respect to the uniform distribution using random samples only and running in time poly(n d/ǫ,log 1 δ ). Proof. For all t we have S t (ĉ(s) ) 2 1 t S t S ( ĉ(s) ) 2 1 t I(c) d t where the last inequality holds provided t d ǫ. Thus we can apply the learning algorithm from the previous section with S = {S : S < d ǫ }. A trivial upper bound of S (n + 1)d/ǫ yields the running time. We can translate this result to decision trees using the fact that functions computable by small decision trees have low average sensitivity. If a decision tree has depth d then the average sensitivity of the function computed by it is at most d; in fact, the sensitivity at any input is at most the number of nodes on the corresponding path in the decision tree, which is at most d. Thus we have the following. Theorem 2. If C n,d is the set of functions computed by decision trees of depth at most d, then there is an algorithm that learns C n,d with respect to the uniform distribution using random samples only and running in time poly(n d/ǫ,log 1 δ ). What can we say if s refers to the size of a decision tree, not its depth? Suppose we truncate the original tree T at some level l to obtain tree T. (The behavior of T on severed paths is arbitrary.) For large l, T is a good approximation to T because each of the leaves at depth greater than l is reached with only small probability, and there are few such leaves when T is small. How large does l need to be? We have Pr x [T (x) T(x)] leaves v depth > l Pr x [v reached] < 4 leaves v depth > l ǫ, 2 l s 2 l ǫ, (2)

where the last inequality holds provided l log s ǫ. Now, suppose we apply our learning algorithm for decision trees of depth at most l to T. The approximation f in the first step satisfies T f = ˆT(S)χ S (3) ( ˆT(S) ˆT (S))χ S + ˆT (S)χ S (4) T T + ˆT (S)χ S (5) 2 ǫ + ǫ = 3 ǫ, (6) Step (3) follows because the Fourier coefficients of f agree with those of T on S and vanish elsewhere. Step (4) follows from the triangle inequality, and step (5) from the orthonormality of the characters. Step (6) follows from the fact that T T = 2 Pr[T T ] and (2), and from the defining property of S. Thus, T f 2 9ǫ. By approximating each of the Fourier coefficients of T in S to within η as above, we obtain an approximation g for which T g 2 10ǫ, and our resulting hypothesis h = sign(g) differs from T on no more than a fraction 10ǫ of the inputs. We have derived the following result about learning decision trees from random samples. Theorem 3. If C n,s is the set of functions computed by decision trees of size at most s, then there is an algorithm that learns C n,s with respect to the uniform distribution using random samples only and running in time poly(n (log s ǫ )/ǫ,log 1 δ ). In particular, if ǫ is a constant then we get a learning algorithm that runs in time quasipolynomial in n + s. As we begin to show next, we can bring the running time down to polynomial if membership queries are available. 4 Learning From Queries Let us return to our generic learning algorithm that uses random samples from the uniform distribution and achieves 2ǫ error and δ confidence in time poly(n, S, 1 ǫ,log 1 δ ) provided we know a set S such that (ĉ(s) ) 2 ǫ. If all we know is that such a set S exists and has size at most some value M, then we can attempt to find S or another suitable set. The idea is to define S = {S : ( ĉ(s) ) 2 τ} for some threshold τ and compute S using the list decoding algorithm we describe in the next lecture. Without loss of generality we can assume that S contains the M sets with the largest coefficients in absolute value. A first attempt at choosing τ would be to set it low enough to ensure that (ĉ(s) ) 2 ǫ. Setting τ = ǫ 2 would certainly n achieve this, since then (ĉ(s) ) 2 2n τ = ǫ. However, since the list decoding algorithm runs in time polynomial in τ 1, this would result in an exponential-time learning algorithm. By setting τ higher, we may not be able to guarantee our ǫ bound, but we won t be too far off, since of the sets not in S, those not in S only contribute a small amount (by hypothesis), and those in S also only contribute a small amount (since there are only a small number M of them and each has weight at most τ). 5

0 1 min τ needed to achieve ε min τ needed to achieve 2ε Formally, (ĉ(s) ) 2 (ĉ(s) ) 2 (ĉ(s) ) 2 + M τ + ǫ. S S\S Choosing τ = ǫ M ensures that this quantity is at most 2ǫ, and it ensures that S M ǫ, so S is a suitable alternative for S. That is, we can learn in roughly the same time as if we knew S, provided we can find S. But ( ĉ(s) ) 2 ǫ M is equivalent to saying that χ S or χ S correlates well with c, and since the characters are exactly the Hadamard codewords, we can use a list decoding algorithm for the Hadamard code to efficiently find the set S using membership queries. In the next lecture, we will spell out the list decoding algorithm in more detail, and we will apply this approach to learning decision trees. 6