A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

Similar documents
Lecture 5: Efficient PAC Learning. 1 Consistent Learning: a Bound on Sample Complexity

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Computational Learning Theory

Computational Learning Theory. Definitions

CS 6375: Machine Learning Computational Learning Theory

Computational Learning Theory (COLT)

Computational Learning Theory

Computational Learning Theory (VC Dimension)

Computational Learning Theory

Computational Learning Theory

Online Learning, Mistake Bounds, Perceptron Algorithm

Computational Learning Theory

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Introduction to Machine Learning

Machine Learning

Computational Learning Theory

Computational Learning Theory

Machine Learning

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Web-Mining Agents Computational Learning Theory

Introduction to Computational Learning Theory

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory - Hilary Term : Introduction to the PAC Learning Framework

[read Chapter 2] [suggested exercises 2.2, 2.3, 2.4, 2.6] General-to-specific ordering over hypotheses

Computational learning theory. PAC learning. VC dimension.

A Necessary Condition for Learning from Positive Examples

Foundations of Machine Learning and Data Science. Lecturer: Avrim Blum Lecture 9: October 7, 2015

NP Completeness and Approximation Algorithms

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

Computational Learning Theory

Summer School on Introduction to Algorithms and Optimization Techniques July 4-12, 2017 Organized by ACMU, ISI and IEEE CEDA.

Can PAC Learning Algorithms Tolerate. Random Attribute Noise? Sally A. Goldman. Department of Computer Science. Washington University

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Limitations of Efficient Reducibility to the Kolmogorov Random Strings

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181.

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Generalization, Overfitting, and Model Selection

ICML '97 and AAAI '97 Tutorials

10.1 The Formal Model

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Polynomial time Prediction Strategy with almost Optimal Mistake Probability

Lecture 29: Computational Learning Theory

The Power of Random Counterexamples

Comp487/587 - Boolean Formulas

On the Sample Complexity of Noise-Tolerant Learning

Concept Learning Mitchell, Chapter 2. CptS 570 Machine Learning School of EECS Washington State University

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

A Result of Vapnik with Applications

Concept Learning through General-to-Specific Ordering

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Being Taught can be Faster than Asking Questions

Introduction to machine learning. Concept learning. Design of a learning system. Designing a learning system

Computational Learning Theory: PAC Model

Learning Theory. Machine Learning B Seyoung Kim. Many of these slides are derived from Tom Mitchell, Ziv- Bar Joseph. Thanks!

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Statistical and Computational Learning Theory

Relating Data Compression and Learnability

NP-Completeness. f(n) \ n n sec sec sec. n sec 24.3 sec 5.2 mins. 2 n sec 17.9 mins 35.

Statistical Learning Learning From Examples

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

Neural Network Learning: Testing Bounds on Sample Complexity

Algorithmic Probability

Computational Learning Theory for Artificial Neural Networks

CSCE 478/878 Lecture 6: Bayesian Learning

Dan Roth 461C, 3401 Walnut

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

A brief introduction to Logic. (slides from

Classes of Boolean Functions

Chapter 2. Reductions and NP. 2.1 Reductions Continued The Satisfiability Problem (SAT) SAT 3SAT. CS 573: Algorithms, Fall 2013 August 29, 2013

Propositional Resolution

Lecture 20: conp and Friends, Oracles in Complexity Theory

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

The Perceptron algorithm

Generalization and Overfitting

Midterm, Fall 2003

Lecture 2: Foundations of Concept Learning

1 More finite deterministic automata

Agnostic Online learnability

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

PAC-learning, VC Dimension and Margin-based Bounds

Washington University. St. Louis, Missouri 63130

Halting and Equivalence of Program Schemes in Models of Arbitrary Theories

Concept Learning. Space of Versions of Concepts Learned

Part 1: Propositional Logic

Polynomial time reduction and NP-completeness. Exploring some time complexity limits of polynomial time algorithmic solutions

Notes on Machine Learning for and

Computational Complexity and Intractability: An Introduction to the Theory of NP. Chapter 9

Name (NetID): (1 Point)

PAC Model and Generalization Bounds

Polynomial Certificates for Propositional Classes

A An Overview of Complexity Theory for the Algorithm Designer

Harvard University, like reasoning, language recognition, object identication. be studied separately from learning (See (Kirsh 1991)

MACHINE LEARNING. Probably Approximately Correct (PAC) Learning. Alessandro Moschitti

The Complexity of Learning Concept Classes with Polynomial General Dimension

A Lower Bound of 2 n Conditional Jumps for Boolean Satisfiability on A Random Access Machine

Transcription:

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997 Vasant Honavar Artificial Intelligence Research Laboratory Department of Computer Science honavar@cs.iastate.edu Iowa State University, Ames, Iowa 50011 http://www.cs.iastate.edu/~honavar/aigroup.html

What are learning systems? Systems that improve their performance one or more tasks with experience in their environment Examples: Pattern recognizers, adaptive control systems, adaptive intelligent agents, etc.

Computational Models of Learning Model of the Learner: Computational capabilities, sensors, effectors, knowledge representation, inference mechanisms, prior knowledge, etc. Model of the Environment: Tasks to be learned, information sources (teacher, queries, experiments), performance measures Key questions: Can a learner with a certain structure learn a specified task in a particular environment? Can the learner do so efficiently? If so, how? If not, why not?

Computational Models of Learning Theories of Learning: What is it good for? Mistake bound model Maximum Likelihood model PAC (Probably Approximately Correct) model Learning from simple examples Concluding remarks

Theories of Learning: What are they good for? To make explicit relevant aspects of the learner and the environment To identify easy and hard learning problems (and the precise conditions under which they are easy or hard) To guide the design of learning systems To shed light on natural learning systems To help analyze the performance of learning systems

Mistake bound Model Example: Given an arbitrary, noise-free sequence of labeled examples <X 1,C(X 1 )>,<X 2,C(X 2 )>...of an unknown binary conjunctive concept C over {0,1} N, the learner's task is to predict C(X) for a given X. Theorem: Exact online learning of conjunctive concepts can be accomplished with at most (N+1) prediction mistakes.

Mistake bound model Algorithm Initialize L={X 1, ~X 1,... ~X N } Predict according to match between an instance and the conjunction of literals in L Whenever a mistake is made on a positive example, drop the offending literals from L Eg: <0111, 1> will result in L = {~ X 1, X 2,X 3, X 4 } <1110, 1> will yield L = {X 2,X 3 }

Mistake bound model Proof of Theorem 1: No literal in C is ever eliminated from L Each mistake eliminates at least one literal from L The first mistake eliminates N of the 2N literals Conjunctive concepts can be learned with at most (N+1) mistakes Conclusion: Conjunctive concepts are easy to learn in the mistake bound model

Optimal Mistake Bound Learning Algorithms Definition: An optimal mistake bound mbound(c) for a concept classs C is the lowest possible mistake bound in the worst case (considering all concepts in C, and all possible sequences of examples). Definition: An optimal learning algorithm for a concept class C (in the mistake bound framework) is one that is guaranteed to exactly learn any concept in C, using any noise-free example sequence, with at most O(mbound(C)) mistakes. Theorem: mbound( C) lg C

Definition: The version space The Halving Algorithm { } V C C i = C is consistent with the first i examples Definition: The halving algorithm predicts according to the majority of concepts in the current version space and a mistake results in elimination of all the offending concepts from the version space Fine print: The halving algorithm may not be efficiently implementable.

The Halving Algorithm The halving algorithm can be practical if there is a way to compactly represent and efficiently manipulate the version space. Question: Are there any efficiently implementable optimal mistake bound learning algorithms? Answer: Littlestone's algorithm for learning monotone disjunctions of at most k of n literals using the hypothesis class of threshold functions with at most (k lg n) mistakes.

Bounding the prediction error Mistake bound model bounds the number of mistakes that the learner will ever make before exactly learning a concept, but not the prediction error after having seen a certain number of examples. Mistake bound model assumes that the examples are chosen arbitrarily - in the worst case, by a smart, adversarial teacher. It might often be satisfactory to assume randomly drawn examples

Probably Approximately Correct Learning Oracle Examples Learner Samples Concept Instance Distribution

Probably Approximately Correct Learning Consider: An instance space X A concept space C = { C: X { 01, }} A hypothesis space H = { h : X { 0, 1} } An unknown, arbitrary, not necessarily computable, stationary probability distribution D over the instance space X

PAC Learning The oracle samples the instance space according to D and provides labeled examples of an unknown concept C to the learner The learner is tested on samples drawn from the instance space according to the same probability distribution D The learner's task is to output a hypothesis h from H that closely approximates the unknown concept C based on the examples it has encountered

PAC Learning In the PAC setting, exact learning (zero error approximation) cannot be guaranteed In the PAC setting, even approximate learning (with bounded non-zero error) cannot be guaranteed 100% of the time Definition: The error of a hypothesis h with respect to a target concept C and an instance distribution D is given by Prob D [ C( X ) h( X ) ]

PAC Learning Definition: A concept class C is said to be PAClearnable using a hypothesis class H if there exists a learning algorithm L such that for all concepts in C, for all instance distributions D on an instance space X, εδ, ( 0< εδ, < 1), L, when given access to the Example oracle, produces, with probability at least ( 1 δ), a hypothesis h from H with error no more than ε (Valiant, 1984)

Efficient PAC Learning Definition: C is said to be efficiently PAC-learnable if L runs in time that is polynomial in N (size of the instance representation), 1 size(c) (size of the 1 concept representation), and δ ε Remark Note that lower error or increased confidence require more examples. Remark: In order for a concept class to be efficiently PAClearnable, it should be PAC-learnable using a random sample of size polynomial in the relevant parameters.

Sample complexity of PAC Learning Definition: A consistent learner is one that returns some hypothesis h from the hypothesis class H that is consistent with a random sequence of m examples. Remark: A consistent learner is a MAP learner (one that returns a hypothesis that is most likely given the training data) if all hypotheses are a-priori equally likely Theorem: A consistent learner is guranteed to be PAC if the number of samples 1 H m > ln ε δ

Sample Complexity of PAC Learning Proof: Consider a hypothesis h that is not a PAC approximation of an unknown concept C. Clearly, error of h, or the probability that h is wrong on a random instance is at least ( 1 ε). The probability of h being wrong on m independently drawn random examples is at least ( 1 ε) m. For PAC learning, we want to make sure that the probability of L returning such a bad hypothesis is small. H ( ) m 1 ε < δ

PAC- Easy and PAC-Hard Concept Classes Conjunctive concepts are easy to learn. Use the same algorithm as the one used in the mistake bound framework 1 Sample complexity O { N ln 3 ln δ } Time complexity is polynomial in the relevant parameters of interest. Remark: Polynomial sample complexity is necessary but not sufficient for efficient PAC learning. ε

PAC-Easy and PAC-Hard Concept Classes Theorem: 3-term DNF concept class (disjunctions of at most 3 conjunctions) are not efficiently PAClearnable using the same hypothesis class (although it has polynomial sample complexity) unless P=NP. Proof: By polynomial time reduction of graph 3-colorability (a well-known NP-complete problem) to the problem of deciding whether a given set of labeled examples is consistent with some 3-term DNF formula.

Transforming Hard Problems to Easy ones Theorem: 3-term DNF concepts are efficiently PAClearnable using 3-CNF (conjunctive normal form with at most 3 literals per clause) hypothesis class. Proof: 3- term DNF 3- CNF Transform each example over N boolean variables into a corresponding example over N 3 variables (one for each possible clause in a 3-CNF formula). The problem reduces to learning a conjunctive concept over the transformed instance space.

Transforming Hard Problems to Easy ones Theorem For any k 2 k-term DNF are efficiently PAC-learnable using the k-cnf hypothesis class. Remark: In this case, enlarging the search space by using a hypothesis class that is larger than strictly necessary, actually makes the problem easy! Remark: No, we have not proved that P=NP. Summary: Conjunctive k - term DNF k -CNF CNF Easy Hard Easy Hard

Inductive Bias: Occam's Razor Occam's razor: Keep it simple, stupid! An Occam learning algorithm returns a simple or succinct hypothesis that is consistent with the training data. Definition: Let α 0 & 0 β < 1 be constants. A learning algorithm L is said to be an α β Occam algorithm for a concept class C using a hypothesis class H if L, given m random examples of an unknown concept C C, outputs a hypothesis h H such that h is consistent with the examples and size h { Nsize c } α β ( ) ( ) m

Sample complexity of an Occam Algorithm Theorem: An Occam algorithm is guaranteed to be PAC if the number of samples m Proof: omitted. 1 1 = O lg + ε δ ε ( Nsize() c ) α 1 1 β

Occam algorithm is PAC for K-decision lists Theorem: For any fixed k, the concept class of k- decision lists (nested if-then-else statements where each if condition is a conjunction of at most k of N literals and their negations) is efficiently PAClearnable using the the same hypothesis class. Remark: K-decision lists constitute the most expressive boolean concept class over the boolean instance space {0,1} N that are known to be efficiently PAC learnable.

PAC Learning of Infinite Concept Classes Sample complexity results can be derived for concepts defined over R N. Remark: Note that the cardinality of concept and hypothesis classes can now be infinite (e.g., in the case of threshold functions over R N ). Solution: Instead of the cardinality of concept class, use the Vapnik-Chervonenkis dimension (VC dimension) of the concept class to compute sample complexity

VC Dimension and Sample Complexity Definition: A set S of instances is shattered by a hypothesis class H if and only if for every dichotomy of S, there exists a hypothesis in H that is consistent with the dichotomy. Definition: The VC-dimension V(H), of a hypothesis class H defined over an instance space X is the cardinality of the largest subset of X that is shattered by H. If arbitrarily large finite subsets of X can be shattered by H, V(H)=

VC Dimension and Sample Complexity Example: Let the instance space X be the 2- dimensional Euclidian space. Let the hypothesis space H be the set of linear 1-dimensional hyperplanes in the 2-dimensional Euclidian space. Then V(H)=3 (a set of 3 points can be shattered by a hyperplane as long as they are not colinear but a set of 4 points cannot be shattered).

VC Dimension and Sample Complexity Theorem: The number m of random examples needed for PAC learning of a concept class C of VC dimension V(C) = d is given by 1 1 1 m= O lg + dlg ε δ ε Corollary: Acyclic, layered multi-layer networks of s threshold logic units, each with r inputs, has VC dimension 2( r + 1) slg( es)

Using a Weak learner for PAC Learning PAC learning requires learning under all distributions, for all choices of error and confidence parameters. Suppose we are given a weak learning algorithm for concept class C that works for a fixed error and/or a fixed confidence. Can we use it for PAC learning of C? YES! (Kearns & Vazirani, 94; Natarajan, 92)

Learning from Simple Examples Question: Can we relax the requirement of learning under all probability distributions over the instance space (including extremely pathological distributions) by limiting the class of distributions to a useful subset of all possible distributions? What are the implications of doing so on the learnability of concept classes that are PAC-hard? What probability distributions are natural?

Learning from Simple Examples Intuition: Suppose mother nature is kind to us: Simple instances are more likely to be made available to the learner. Question: How can we formalize this intuitive notion? Answer: Kolmogorov complexity offers a natural measure of descriptional complexity of an instance

Kolmogorov Complexity Definition: Kolmogorov complexity of an object relative to a universal Turing machine M is the length (measured in number of bits) of the shortest program which when executed on M, prints out γ and halts. ( γ ) = min { ( π) ( π) = γ} K l M π Remark: Simple objects (e.g., a string of all zeros) have low Kolmogorov complexity. γ

Kolmogorov Complexity Definition: The conditional Kolmogorov complexity of γ given λ is the length of the shortest program π for a universal Turing machine M which, given λ, outputs γ. Remark: K( γ λ) K( γ ) Remark: Kolmogorov complexity is machineindependent (modulo an additive constant).

Universal Distribution Definition: The universal probability distribution over an instance space X is defined by: K( X) X X D ( X ) = where is a normalization U η2 η constant. Definition: A distribution D is simple if it is multiplicatively dominated by the universal distribution, that is, there exists a constant σ such that σ D ( X ) D( X ) U Remark: All computable distributions (including gaussian, poisson, etc. with finite precision parameters) are simple.

PAC Learning Under Simple Distributions Theorem: A concept class C defined over a discrete instance space is polynomially PAC-learnable under the universal distribution iff it is polynomially PAC-learnable under each simple distribution, provided, during the learning phase, the samples are drawn according to the universal distribution. (Li & Vitanyi, 91) Remarks: This raises the possibility of learning under all simple distributions by sampling examples according to the universal distribution. But universal distribution is not computable. Is nature characterized by universal distribution? Can we approximate universal distribution?

Learning from Simple Examples Suppose a knowledgeable teacher provides simple examples (i.e., examples with low Kolmogorov complexity conditioned on the teacher's knowledge of the concept to be learned). K X r More precisely, D r ( X ) = η r 2 ( ) where r is a suitable representation of the unknown concept and η r is a normalization constant. Definition: Let S S, a set of simple examples, that is, ( ) X SS K( X r) µ lg sizeof ( r)

Learning from Simple Examples Definition (informal): A representative sample S R is one that contains all the information necessary for identifying an unknown concept. Example: To learn a finite state machine, a representative examples provide information about all the state transitions. Theorem: If there exists a representative set of simple examples for each concept in a concept class C, then C is PAC learnable under distribution D r. (Denis et al., 96)

Learning from Simple Examples Theorem: The class of DFA whose canonical representations have at most Q states are polynomially exactly learnable when examples are provided from a sample drawn according to D r when Q is known. (Parekh & Honavar, 97) Theorem: The class of DFA are probably approximately learnable under D r (Parekh & Honavar, 97). Remark: These are encouraging results in light of the strong evidence against efficient PAC learnability of DFA (Kearns and Vazirani, 1994)

Concluding remarks PAC-Easy learning problems lend themselves to a variety of efficient algorithms. PAC-Hard learning problems can often be made PAC-easy through appropriate instance transformation and choice of hypothesis space Occam's razor often helps Weak learning algorithms can often be used for strong learning Learning under restricted classes of instance distributions (e.g., universal distribution) offers new possibilities

Bibliography 1 Honavar, V. http://www.cs.iastate.edu/~honavar/cs673.s96.html 2 Kearns, M.J. & Vazirani, U.V. An Introduction to Computational Learning Theory. Cambridge, MA: MIT Press. 1994. 3 Langley, P. Elements of Machine Learning. Palo Alto, CA: Morgan Kaufmann. 1995. 4 Li, M. & Vitanyi, P. Kolmogorov Complexity and its Applications. New York: Springer-Verlag. 1997. 5 Mitchell, T. Machine Learning. New York: McGraw Hill. 1997. 6 Natarajan, B.K. Machine Learning: A Theoretical Approach. Palo Alto, CA: Morgan Kaufmann, 1992.