Computational Learning Theory

Similar documents
Computational Learning Theory

Computational Learning Theory

Computational Learning Theory

Computational Learning Theory

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Machine Learning

Machine Learning

Computational Learning Theory. Definitions

Concept Learning through General-to-Specific Ordering

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Computational Learning Theory

Computational Learning Theory (COLT)

Computational Learning Theory

Introduction to Machine Learning

Computational Learning Theory (VC Dimension)

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

CS 6375: Machine Learning Computational Learning Theory

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

[read Chapter 2] [suggested exercises 2.2, 2.3, 2.4, 2.6] General-to-specific ordering over hypotheses

Outline. [read Chapter 2] Learning from examples. General-to-specic ordering over hypotheses. Version spaces and candidate elimination.

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

Concept Learning.

Overview. Machine Learning, Chapter 2: Concept Learning

Concept Learning. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University.

Concept Learning Mitchell, Chapter 2. CptS 570 Machine Learning School of EECS Washington State University

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Web-Mining Agents Computational Learning Theory

Online Learning, Mistake Bounds, Perceptron Algorithm

Version Spaces.

Fundamentals of Concept Learning

Computational Learning Theory

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Concept Learning. Berlin Chen References: 1. Tom M. Mitchell, Machine Learning, Chapter 2 2. Tom M. Mitchell s teaching materials.

Learning Theory. Machine Learning B Seyoung Kim. Many of these slides are derived from Tom Mitchell, Ziv- Bar Joseph. Thanks!

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Concept Learning. Space of Versions of Concepts Learned

Introduction to machine learning. Concept learning. Design of a learning system. Designing a learning system

Dan Roth 461C, 3401 Walnut

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

BITS F464: MACHINE LEARNING

Statistical Learning Learning From Examples

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Lecture 2: Foundations of Concept Learning

Question of the Day? Machine Learning 2D1431. Training Examples for Concept Enjoy Sport. Outline. Lecture 3: Concept Learning

Linear Discrimination Functions

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Computational Learning Theory. CS534 - Machine Learning

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

[Read Ch. 5] [Recommended exercises: 5.2, 5.3, 5.4]

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

PDEEC Machine Learning 2016/17

Machine Learning. Ensemble Methods. Manfred Huber

Computational learning theory. PAC learning. VC dimension.

Introduction to Machine Learning

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning 2D1431. Lecture 3: Concept Learning

CSCE 478/878 Lecture 2: Concept Learning and the General-to-Specific Ordering

Part of the slides are adapted from Ziko Kolter

Learning Theory, Overfi1ng, Bias Variance Decomposi9on

The Power of Random Counterexamples

Hierarchical Concept Learning

Data Mining und Maschinelles Lernen

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Active Learning and Optimized Information Gathering

Notes on Machine Learning for and

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

Neural Network Learning: Testing Bounds on Sample Complexity

Statistical and Computational Learning Theory

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

10.1 The Formal Model

Learning with multiple models. Boosting.

EECS 349: Machine Learning

Midterm, Fall 2003

Learning Theory Continued

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Introduction to Computational Learning Theory

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

Does Unlabeled Data Help?

Introduction to Machine Learning

Answers Machine Learning Exercises 2

The Vapnik-Chervonenkis Dimension

Discriminative Learning can Succeed where Generative Learning Fails

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

On the Sample Complexity of Noise-Tolerant Learning

CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

Generalization, Overfitting, and Model Selection

Lecture 29: Computational Learning Theory

MACHINE LEARNING. Probably Approximately Correct (PAC) Learning. Alessandro Moschitti

The Decision List Machine

CS446: Machine Learning Spring Problem Set 4

Understanding Generalization Error: Bounds and Decompositions

Generalization theory

PAC-learning, VC Dimension and Margin-based Bounds

Generalization and Overfitting

CSCE 478/878 Lecture 6: Bayesian Learning

Support Vector Machines. Machine Learning Fall 2017

Machine Learning Lecture 7

Transcription:

09s1: COMP9417 Machine Learning and Data Mining Computational Learning Theory May 20, 2009 Acknowledgement: Material derived from slides for the book Machine Learning, Tom M. Mitchell, McGraw-Hill, 1997 http://www-2.cs.cmu.edu/~tom/mlbook.html and slides by Andrew W. Moore available at http://www.cs.cmu.edu/~awm/tutorials and the book Data Mining, Ian H. Witten and Eibe Frank, Morgan Kauffman, 2000. http://www.cs.waikato.ac.nz/ml/weka and the book Pattern Classification, Richard O. Duda, Peter E. Hart, and David G. Stork. Copyright (c) 2001 by John Wiley & Sons, Inc. and the book Elements of Statistical Learning, Trevor Hastie, Robert Tibshirani and Jerome Friedman. (c) 2001, Springer. Aims This lecture will enable you to define and reproduce fundamental approaches and results from the theory of machine learning. Following it you should be able to: describe a basic theoretical framework for sample complexity of learning define training error and true error define Probably Approximately Correct (PAC) learning define the Vapnik-Chervonenkis (VC) dimension define optimal mistake bounds reproduce the basic results in each area [Recommended reading: Mitchell, Chapter 7] [Recommended exercises: 7.2, 7.4, 7.5, 7.6, (7.1, 7.3, 7.7)] Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: Probability of successful learning Number of training examples Complexity of hypothesis space Time complexity of learning algorithm Accuracy to which target concept is approximated Manner in which training examples presented COMP9417: May 20, 2009 Computational Learning Theory: Slide 1 COMP9417: May 20, 2009 Computational Learning Theory: Slide 2

Computational Learning Theory Computational Learning Theory Characterise classes of algorithms using questions such as: Sample complexity How many training examples required for learner to converge (with high probability) to a successful hypothesis? Computational complexity Successful hypothesis: identical to target concept? mostly agrees with target concept...?... does this most of the time? How much computational effort required for learner to converge (with high probability) to a successful hypothesis? Mistake bounds How many training examples will the learner misclassify before converging to a successful hypothesis? COMP9417: May 20, 2009 Computational Learning Theory: Slide 3 COMP9417: May 20, 2009 Computational Learning Theory: Slide 4 Given: Prototypical Concept Learning Task Instances X: Possible days, each described by the attributes Sky, AirTemp, Humidity, Wind, Water, Forecast Target function c: EnjoySport : X {0, 1} Hypotheses H: Conjunctions of literals. E.g.?, Cold, High,?,?,? Training examples D: Positive and negative examples of the target function x 1, c(x 1 ),... x m, c(x m ) Determine: A hypothesis h in H such that h(x) = c(x) for all x in D? A hypothesis h in H such that h(x) = c(x) for all x in X? Sample Complexity How many training examples are sufficient to learn the target concept? 1. If learner proposes instances, as queries to teacher learner proposes instance x, teacher provides c(x) 2. If teacher (who knows c) provides training examples teacher provides sequence of examples of form x, c(x) 3. If some random process (e.g., nature) proposes instances instance x generated randomly, teacher provides c(x) COMP9417: May 20, 2009 Computational Learning Theory: Slide 5 COMP9417: May 20, 2009 Computational Learning Theory: Slide 6

Sample Complexity: 1 Learner proposes instance x, teacher provides c(x) (assume c is in learner s hypothesis space H) Optimal query strategy: play 20 questions pick instance x such that half of the hypotheses in V S classify x positive, and half classify x negative when this is possible, need log 2 H queries to learn c when not possible, need even more learner conducts experiments (generates queries) see Version Space learning Sample Complexity: 2 Teacher (who knows c) provides training examples (assume c is in learner s hypothesis space H) Optimal teaching strategy: depends on H used by learner Consider the case H = conjunctions of up to n Boolean literals e.g., (AirT emp = W arm) (W ind = Strong), where AirT emp, W ind,... each have 2 possible values. if n possible Boolean attributes in H, n + 1 examples suffice why? COMP9417: May 20, 2009 Computational Learning Theory: Slide 7 COMP9417: May 20, 2009 Computational Learning Theory: Slide 8 Sample Complexity: 3 Given: set of instances X set of hypotheses H True Error of a Hypothesis Instance space X set of possible target concepts C training instances generated by a fixed, unknown probability distribution D over X Learner observes a sequence D of training examples of form x, c(x), for some target concept c C - c + h - instances x are drawn from distribution D teacher provides target value c(x) for each + Learner must output a hypothesis h estimating c h is evaluated by its performance on subsequent instances drawn according to D Note: randomly drawn instances, noise-free classifications Where c and h disagree - COMP9417: May 20, 2009 Computational Learning Theory: Slide 9 COMP9417: May 20, 2009 Computational Learning Theory: Slide 10

True Error of a Hypothesis Definition: The true error (denoted error D (h)) of hypothesis h with respect to target concept c and distribution D is the probability that h will misclassify an instance drawn at random according to D. error D (h) Pr [c(x) h(x)] x D Two Notions of Error Training error of hypothesis h with respect to target concept c How often h(x) c(x) over training instances True error of hypothesis h with respect to c How often h(x) c(x) over future random instances Our concern: Can we bound the true error of h given the training error of h? First consider when training error of h is zero (i.e., h V S H,D ) COMP9417: May 20, 2009 Computational Learning Theory: Slide 11 COMP9417: May 20, 2009 Computational Learning Theory: Slide 12 Exhausting the Version Space Hypothesis space H Exhausting the Version Space Note: in the diagram (r = training error, error = true error) error =.1 r =.2 error =.2 r =0 error =.3 r =.4 Definition: The version space V S H,D is said to be ɛ-exhausted with respect to c and D, if every hypothesis h in V S H,D has error less than ɛ with respect to c and D. ( h V S H,D ) error D (h) < ɛ error =.3 r =.1 VS H,D error =.1 r =0 error =.2 r =.3 COMP9417: May 20, 2009 Computational Learning Theory: Slide 13 COMP9417: May 20, 2009 Computational Learning Theory: Slide 14

How many examples will ɛ-exhaust the VS? Theorem: [Haussler, 1988]. How many examples will ɛ-exhaust the VS? If we want this probability to be below δ If the hypothesis space H is finite, and D is a sequence of m 1 independent random examples of some target concept c, then for any 0 ɛ 1, the probability that the version space with respect to H and D is not ɛ-exhausted (with respect to c) is less than H e ɛm then H e ɛm δ m 1 (ln H + ln(1/δ)) ɛ Interesting! This bounds the probability that any consistent learner will output a hypothesis h with error(h) ɛ COMP9417: May 20, 2009 Computational Learning Theory: Slide 15 COMP9417: May 20, 2009 Computational Learning Theory: Slide 16 Learning Conjunctions of Boolean Literals How many examples are sufficient to assure with probability at least (1 δ) that Learning Conjunctions of Boolean Literals Suppose H contains conjunctions of constraints on up to n Boolean attributes (i.e., n Boolean literals). Then H = 3 n, and every h in V S H,D satisfies error D (h) ɛ Use our theorem: m 1 (ln H + ln(1/δ)) ɛ or m 1 ɛ (ln 3n + ln(1/δ)) m 1 (n ln 3 + ln(1/δ)) ɛ COMP9417: May 20, 2009 Computational Learning Theory: Slide 17 COMP9417: May 20, 2009 Computational Learning Theory: Slide 18

How About EnjoySport? m 1 (ln H + ln(1/δ)) ɛ If H is as given in EnjoySport then H = 973, and m 1 (ln 973 + ln(1/δ)) ɛ How About EnjoySport?... if want to assure that with probability 95%, V S contains only hypotheses with error D (h).1, then it is sufficient to have m examples, where m 1 (ln 973 + ln(1/.05)).1 m 10(ln 973 + ln 20) m 10(6.88 + 3.00) m 98.8 COMP9417: May 20, 2009 Computational Learning Theory: Slide 19 COMP9417: May 20, 2009 Computational Learning Theory: Slide 20 PAC Learning Consider a class C of possible target concepts defined over a set of instances X of length n, and a learner L using hypothesis space H. Definition: C is PAC-learnable by L using H if for all c C, distributions D over X, ɛ such that 0 < ɛ < 1/2, and δ such that 0 < δ < 1/2, learner L will with probability at least (1 δ) output a hypothesis h H such that error D (h) ɛ, in time that is polynomial in 1/ɛ, 1/δ, n and size(c). So far, assumed c H Agnostic Learning Agnostic learning setting: don t assume c H What do we want then? The hypothesis h that makes fewest errors on training data What is sample complexity in this case? derived from Hoeffding bounds: m 1 2ɛ2(ln H + ln(1/δ)) P r[error D (h) > error D (h) + ɛ] e 2mɛ2 COMP9417: May 20, 2009 Computational Learning Theory: Slide 21 COMP9417: May 20, 2009 Computational Learning Theory: Slide 22

Unbiased Learners Bias, Variance and Model Complexity Unbiased concept class C contains all target concepts definable on instance space X. C = 2 X Say X is defined using n Boolean features, then X = 2 n. C = 2 2n An unbiased learner has a hypothesis space able to represent all possible target concepts, i.e., H = C. m 1 ɛ (2n ln 2 + ln(1/δ)) i.e., exponential (in the number of features) sample complexity COMP9417: May 20, 2009 Computational Learning Theory: Slide 23 COMP9417: May 20, 2009 Computational Learning Theory: Slide 24 Bias, Variance and Model Complexity How Complex is a Hypothesis? We can see the behaviour of different models predictive accuaracy on test sample and training sample as the model complexity is varied. How to measure model complexity? number of parameters? description length? COMP9417: May 20, 2009 Computational Learning Theory: Slide 25 COMP9417: May 20, 2009 Computational Learning Theory: Slide 26

How Complex is a Hypothesis? The solid curve is the function sin(50x) for x [0, 1]. The blue (solid) and green (hollow) points illustrate how the associated indicator function I(sin(αx) > 0) can shatter (separate) an arbitrarily large number of points by choosing an appropriately high frequency α. Classes separated based on sin(αx), for frequency α, a single parameter. Shattering a Set of Instances Definition: a dichotomy of a set S is a partition of S into two disjoint subsets. Definition: a set of instances S is shattered by hypothesis space H if and only if for every dichotomy of S there exists some hypothesis in H consistent with this dichotomy. COMP9417: May 20, 2009 Computational Learning Theory: Slide 27 COMP9417: May 20, 2009 Computational Learning Theory: Slide 28 Three Instances Shattered Instance space X The Vapnik-Chervonenkis Dimension Definition: The Vapnik-Chervonenkis dimension, V C(H), of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H. If arbitrarily large finite sets of X can be shattered by H, then V C(H). COMP9417: May 20, 2009 Computational Learning Theory: Slide 29 COMP9417: May 20, 2009 Computational Learning Theory: Slide 30

VC Dim. of Linear Decision Surfaces VC Dim. of Linear Decision Surfaces ( a) ( b) COMP9417: May 20, 2009 Computational Learning Theory: Slide 31 COMP9417: May 20, 2009 Computational Learning Theory: Slide 32 Sample Complexity from VC Dimension How many randomly drawn examples suffice to ɛ-exhaust V S H,D with probability at least (1 δ)? m 1 ɛ (4 log 2(2/δ) + 8V C(H) log 2 (13/ɛ)) Mistake Bounds So far: how many examples needed to learn? What about: how many mistakes before convergence? Let s consider similar setting to PAC learning: Instances drawn at random from X according to distribution D Learner must classify each instance before receiving correct classification from teacher Can we bound the number of mistakes learner makes before converging? COMP9417: May 20, 2009 Computational Learning Theory: Slide 33 COMP9417: May 20, 2009 Computational Learning Theory: Slide 34

Winnow Mistake bounds following Littlestone (1987) who developed Winnow, an online, mistake-driven algorithm similar to perceptron learns linear threshold function designed to learn in the presence of many irrelevant features number of mistakes grows only logarithmically with number of irrelevant features Winnow While some instances are misclassified For each instance a classify a using current weights If predicted class is incorrect If a has class 1 For each a i = 1, w i αw i (if a i = 0, leave w i unchanged) Otherwise For each a i = 1, w i w i α (if a i = 0, leave w i unchanged) COMP9417: May 20, 2009 Computational Learning Theory: Slide 35 COMP9417: May 20, 2009 Computational Learning Theory: Slide 36 Winnow Mistake Bounds: Find-S user-supplied threshold θ class is 1 if w i a i > θ note similarity to perceptron training rule Winnow uses multiplicative weight updates α > 1 will do better than perceptron with many irrelevant attributes Consider Find-S when H = conjunction of Boolean literals Find-S: Initialize h to the most specific hypothesis l 1 l 1 l 2 l 2... l n l n For each positive training instance x Remove from h any literal that is not satisfied by x Output hypothesis h. an attribute-efficient learner How many mistakes before converging to correct h? COMP9417: May 20, 2009 Computational Learning Theory: Slide 37 COMP9417: May 20, 2009 Computational Learning Theory: Slide 38

Mistake Bounds: Find-S Mistake Bounds: Find-S Find-S will converge to error-free hypothesis in the limit if C H data are noise-free How many mistakes before converging to correct h? Find-S initially classifies all instances as negative will generalize for positive examples by dropping unsatisfied literals since c H, will never classify a negative instance as positive (if it did, would exist a hypothesis h such that c g h would exist an instance x such that h (x) = 1 and c(x) 1 contradicts definition of generality ordering g ) mistake bound from number of positives classified as negatives How many mistakes before converging to correct h? COMP9417: May 20, 2009 Computational Learning Theory: Slide 39 COMP9417: May 20, 2009 Computational Learning Theory: Slide 40 2n terms in initial hypothesis Mistake Bounds: Find-S first mistake, remove half of these terms, leaving n each further mistake, remove at least 1 term in worst case, will have to remove all n remaining terms would be most general concept - everything is positive worst case number of mistakes would be n + 1 worst case sequence of learning steps, removing only one literal per step Mistake Bounds: Halving Algorithm Consider the Halving Algorithm: Learn concept using version space Candidate-Elimination algorithm Classify new instances by majority vote of version space members How many mistakes before converging to correct h?... in worst case?... in best case? COMP9417: May 20, 2009 Computational Learning Theory: Slide 41 COMP9417: May 20, 2009 Computational Learning Theory: Slide 42

Mistake Bounds: Halving Algorithm Mistake Bounds: Halving Algorithm Consider the Halving Algorithm: Learn concept using version space Candidate-Elimination algorithm Classify new instances by majority vote of version space members How many mistakes before converging to correct h? Halving Algorithm learns exactly when the version space contains only one hypothesis which corresponds to the target concept at each step, remove all hypotheses whose vote was incorrect mistake when majority vote classification is incorrect mistake bound in converging to correct h?... in worst case?... in best case? COMP9417: May 20, 2009 Computational Learning Theory: Slide 43 COMP9417: May 20, 2009 Computational Learning Theory: Slide 44 how many mistakes worst case? Mistake Bounds: Halving Algorithm on every step, mistake because majority vote is incorrect each mistake, number of hypotheses reduced by at least half hypothesis space size H, worst-case mistake bound log 2 H how many mistakes best case? on every step, no mistake because majority vote is correct still remove all incorrect hypotheses, up to half best case, no mistakes in converging to correct h Optimal Mistake Bounds Let M A (C) be the max number of mistakes made by algorithm A to learn concepts in C. (maximum over all possible c C, and all possible training sequences) M A (C) max c C M A(c) Definition: Let C be an arbitrary non-empty concept class. The optimal mistake bound for C, denoted Opt(C), is the minimum over all possible learning algorithms A of M A (C). Opt(C) min M A(C) A learning algorithms V C(C) Opt(C) M Halving (C) log 2 ( C ). COMP9417: May 20, 2009 Computational Learning Theory: Slide 45 COMP9417: May 20, 2009 Computational Learning Theory: Slide 46

Weighted Majority generalisation of Halving Algorithm predicts by weighted vote of set of prediction algorithms learns by altering weights for prediction algorithms a prediction algorithm simply predicts value of target concept given instance elements of hypothesis space H different learning algorithms inconsistency between prediction algorithm and training example reduce weight of prediction algorithm bound no. of mistakes of ensemble by no. of mistakes made by best prediction algorithm a i is the i-th prediction algorithm w i is the weight associated with a i Weighted Majority For all i, initialize w i 1 For each training example x, c(x) Initialize q 0 and q 1 to 0 For each prediction algorithm a i If a i (x) = 0 then q 0 q 0 + w i If a i (x) = 1 then q 1 q 1 + w i If q 1 > q 0 then predict c(x) = 1 If q 0 > q 1 then predict c(x) = 0 If q 0 = q 1 then predict 0 or 1 at random for c(x) For each prediction algorithm a i If a i (x) c(x) then w i βw i COMP9417: May 20, 2009 Computational Learning Theory: Slide 47 COMP9417: May 20, 2009 Computational Learning Theory: Slide 48 Weighted Majority Weighted Majority algorithm begins by assigning weight of 1 to each prediction algorithm. On misclassification by a prediction algorithm its weight is reduced by multiplication by constant β, 0 β 1. Equivalent to Halving Algorithm when β = 0. Otherwise, just down-weights contribution of algorithms which make errors. Key result: number of mistakes made by Weighted Majority algorithm will never be greater than a constant factor times the number of mistakes made by the best member of the pool, plus a term that grows only logarithmically in the number of prediction algorithms in the pool. Summary Algorithm independent learning analysed by computational complexity methods An alternative view point to statistics; complementary, different goals Many different frameworks, no clear winner worst- Many results over-conservative from practical viewpoint, e.g. case rather than average-case analyses Has led to advances in practically useful algorithms, e.g. PAC Boosting VC dimension SVM COMP9417: May 20, 2009 Computational Learning Theory: Slide 49 COMP9417: May 20, 2009 Computational Learning Theory: Slide 50