09s1: COMP9417 Machine Learning and Data Mining Computational Learning Theory May 20, 2009 Acknowledgement: Material derived from slides for the book Machine Learning, Tom M. Mitchell, McGraw-Hill, 1997 http://www-2.cs.cmu.edu/~tom/mlbook.html and slides by Andrew W. Moore available at http://www.cs.cmu.edu/~awm/tutorials and the book Data Mining, Ian H. Witten and Eibe Frank, Morgan Kauffman, 2000. http://www.cs.waikato.ac.nz/ml/weka and the book Pattern Classification, Richard O. Duda, Peter E. Hart, and David G. Stork. Copyright (c) 2001 by John Wiley & Sons, Inc. and the book Elements of Statistical Learning, Trevor Hastie, Robert Tibshirani and Jerome Friedman. (c) 2001, Springer. Aims This lecture will enable you to define and reproduce fundamental approaches and results from the theory of machine learning. Following it you should be able to: describe a basic theoretical framework for sample complexity of learning define training error and true error define Probably Approximately Correct (PAC) learning define the Vapnik-Chervonenkis (VC) dimension define optimal mistake bounds reproduce the basic results in each area [Recommended reading: Mitchell, Chapter 7] [Recommended exercises: 7.2, 7.4, 7.5, 7.6, (7.1, 7.3, 7.7)] Computational Learning Theory What general laws constrain inductive learning? We seek theory to relate: Probability of successful learning Number of training examples Complexity of hypothesis space Time complexity of learning algorithm Accuracy to which target concept is approximated Manner in which training examples presented COMP9417: May 20, 2009 Computational Learning Theory: Slide 1 COMP9417: May 20, 2009 Computational Learning Theory: Slide 2
Computational Learning Theory Computational Learning Theory Characterise classes of algorithms using questions such as: Sample complexity How many training examples required for learner to converge (with high probability) to a successful hypothesis? Computational complexity Successful hypothesis: identical to target concept? mostly agrees with target concept...?... does this most of the time? How much computational effort required for learner to converge (with high probability) to a successful hypothesis? Mistake bounds How many training examples will the learner misclassify before converging to a successful hypothesis? COMP9417: May 20, 2009 Computational Learning Theory: Slide 3 COMP9417: May 20, 2009 Computational Learning Theory: Slide 4 Given: Prototypical Concept Learning Task Instances X: Possible days, each described by the attributes Sky, AirTemp, Humidity, Wind, Water, Forecast Target function c: EnjoySport : X {0, 1} Hypotheses H: Conjunctions of literals. E.g.?, Cold, High,?,?,? Training examples D: Positive and negative examples of the target function x 1, c(x 1 ),... x m, c(x m ) Determine: A hypothesis h in H such that h(x) = c(x) for all x in D? A hypothesis h in H such that h(x) = c(x) for all x in X? Sample Complexity How many training examples are sufficient to learn the target concept? 1. If learner proposes instances, as queries to teacher learner proposes instance x, teacher provides c(x) 2. If teacher (who knows c) provides training examples teacher provides sequence of examples of form x, c(x) 3. If some random process (e.g., nature) proposes instances instance x generated randomly, teacher provides c(x) COMP9417: May 20, 2009 Computational Learning Theory: Slide 5 COMP9417: May 20, 2009 Computational Learning Theory: Slide 6
Sample Complexity: 1 Learner proposes instance x, teacher provides c(x) (assume c is in learner s hypothesis space H) Optimal query strategy: play 20 questions pick instance x such that half of the hypotheses in V S classify x positive, and half classify x negative when this is possible, need log 2 H queries to learn c when not possible, need even more learner conducts experiments (generates queries) see Version Space learning Sample Complexity: 2 Teacher (who knows c) provides training examples (assume c is in learner s hypothesis space H) Optimal teaching strategy: depends on H used by learner Consider the case H = conjunctions of up to n Boolean literals e.g., (AirT emp = W arm) (W ind = Strong), where AirT emp, W ind,... each have 2 possible values. if n possible Boolean attributes in H, n + 1 examples suffice why? COMP9417: May 20, 2009 Computational Learning Theory: Slide 7 COMP9417: May 20, 2009 Computational Learning Theory: Slide 8 Sample Complexity: 3 Given: set of instances X set of hypotheses H True Error of a Hypothesis Instance space X set of possible target concepts C training instances generated by a fixed, unknown probability distribution D over X Learner observes a sequence D of training examples of form x, c(x), for some target concept c C - c + h - instances x are drawn from distribution D teacher provides target value c(x) for each + Learner must output a hypothesis h estimating c h is evaluated by its performance on subsequent instances drawn according to D Note: randomly drawn instances, noise-free classifications Where c and h disagree - COMP9417: May 20, 2009 Computational Learning Theory: Slide 9 COMP9417: May 20, 2009 Computational Learning Theory: Slide 10
True Error of a Hypothesis Definition: The true error (denoted error D (h)) of hypothesis h with respect to target concept c and distribution D is the probability that h will misclassify an instance drawn at random according to D. error D (h) Pr [c(x) h(x)] x D Two Notions of Error Training error of hypothesis h with respect to target concept c How often h(x) c(x) over training instances True error of hypothesis h with respect to c How often h(x) c(x) over future random instances Our concern: Can we bound the true error of h given the training error of h? First consider when training error of h is zero (i.e., h V S H,D ) COMP9417: May 20, 2009 Computational Learning Theory: Slide 11 COMP9417: May 20, 2009 Computational Learning Theory: Slide 12 Exhausting the Version Space Hypothesis space H Exhausting the Version Space Note: in the diagram (r = training error, error = true error) error =.1 r =.2 error =.2 r =0 error =.3 r =.4 Definition: The version space V S H,D is said to be ɛ-exhausted with respect to c and D, if every hypothesis h in V S H,D has error less than ɛ with respect to c and D. ( h V S H,D ) error D (h) < ɛ error =.3 r =.1 VS H,D error =.1 r =0 error =.2 r =.3 COMP9417: May 20, 2009 Computational Learning Theory: Slide 13 COMP9417: May 20, 2009 Computational Learning Theory: Slide 14
How many examples will ɛ-exhaust the VS? Theorem: [Haussler, 1988]. How many examples will ɛ-exhaust the VS? If we want this probability to be below δ If the hypothesis space H is finite, and D is a sequence of m 1 independent random examples of some target concept c, then for any 0 ɛ 1, the probability that the version space with respect to H and D is not ɛ-exhausted (with respect to c) is less than H e ɛm then H e ɛm δ m 1 (ln H + ln(1/δ)) ɛ Interesting! This bounds the probability that any consistent learner will output a hypothesis h with error(h) ɛ COMP9417: May 20, 2009 Computational Learning Theory: Slide 15 COMP9417: May 20, 2009 Computational Learning Theory: Slide 16 Learning Conjunctions of Boolean Literals How many examples are sufficient to assure with probability at least (1 δ) that Learning Conjunctions of Boolean Literals Suppose H contains conjunctions of constraints on up to n Boolean attributes (i.e., n Boolean literals). Then H = 3 n, and every h in V S H,D satisfies error D (h) ɛ Use our theorem: m 1 (ln H + ln(1/δ)) ɛ or m 1 ɛ (ln 3n + ln(1/δ)) m 1 (n ln 3 + ln(1/δ)) ɛ COMP9417: May 20, 2009 Computational Learning Theory: Slide 17 COMP9417: May 20, 2009 Computational Learning Theory: Slide 18
How About EnjoySport? m 1 (ln H + ln(1/δ)) ɛ If H is as given in EnjoySport then H = 973, and m 1 (ln 973 + ln(1/δ)) ɛ How About EnjoySport?... if want to assure that with probability 95%, V S contains only hypotheses with error D (h).1, then it is sufficient to have m examples, where m 1 (ln 973 + ln(1/.05)).1 m 10(ln 973 + ln 20) m 10(6.88 + 3.00) m 98.8 COMP9417: May 20, 2009 Computational Learning Theory: Slide 19 COMP9417: May 20, 2009 Computational Learning Theory: Slide 20 PAC Learning Consider a class C of possible target concepts defined over a set of instances X of length n, and a learner L using hypothesis space H. Definition: C is PAC-learnable by L using H if for all c C, distributions D over X, ɛ such that 0 < ɛ < 1/2, and δ such that 0 < δ < 1/2, learner L will with probability at least (1 δ) output a hypothesis h H such that error D (h) ɛ, in time that is polynomial in 1/ɛ, 1/δ, n and size(c). So far, assumed c H Agnostic Learning Agnostic learning setting: don t assume c H What do we want then? The hypothesis h that makes fewest errors on training data What is sample complexity in this case? derived from Hoeffding bounds: m 1 2ɛ2(ln H + ln(1/δ)) P r[error D (h) > error D (h) + ɛ] e 2mɛ2 COMP9417: May 20, 2009 Computational Learning Theory: Slide 21 COMP9417: May 20, 2009 Computational Learning Theory: Slide 22
Unbiased Learners Bias, Variance and Model Complexity Unbiased concept class C contains all target concepts definable on instance space X. C = 2 X Say X is defined using n Boolean features, then X = 2 n. C = 2 2n An unbiased learner has a hypothesis space able to represent all possible target concepts, i.e., H = C. m 1 ɛ (2n ln 2 + ln(1/δ)) i.e., exponential (in the number of features) sample complexity COMP9417: May 20, 2009 Computational Learning Theory: Slide 23 COMP9417: May 20, 2009 Computational Learning Theory: Slide 24 Bias, Variance and Model Complexity How Complex is a Hypothesis? We can see the behaviour of different models predictive accuaracy on test sample and training sample as the model complexity is varied. How to measure model complexity? number of parameters? description length? COMP9417: May 20, 2009 Computational Learning Theory: Slide 25 COMP9417: May 20, 2009 Computational Learning Theory: Slide 26
How Complex is a Hypothesis? The solid curve is the function sin(50x) for x [0, 1]. The blue (solid) and green (hollow) points illustrate how the associated indicator function I(sin(αx) > 0) can shatter (separate) an arbitrarily large number of points by choosing an appropriately high frequency α. Classes separated based on sin(αx), for frequency α, a single parameter. Shattering a Set of Instances Definition: a dichotomy of a set S is a partition of S into two disjoint subsets. Definition: a set of instances S is shattered by hypothesis space H if and only if for every dichotomy of S there exists some hypothesis in H consistent with this dichotomy. COMP9417: May 20, 2009 Computational Learning Theory: Slide 27 COMP9417: May 20, 2009 Computational Learning Theory: Slide 28 Three Instances Shattered Instance space X The Vapnik-Chervonenkis Dimension Definition: The Vapnik-Chervonenkis dimension, V C(H), of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H. If arbitrarily large finite sets of X can be shattered by H, then V C(H). COMP9417: May 20, 2009 Computational Learning Theory: Slide 29 COMP9417: May 20, 2009 Computational Learning Theory: Slide 30
VC Dim. of Linear Decision Surfaces VC Dim. of Linear Decision Surfaces ( a) ( b) COMP9417: May 20, 2009 Computational Learning Theory: Slide 31 COMP9417: May 20, 2009 Computational Learning Theory: Slide 32 Sample Complexity from VC Dimension How many randomly drawn examples suffice to ɛ-exhaust V S H,D with probability at least (1 δ)? m 1 ɛ (4 log 2(2/δ) + 8V C(H) log 2 (13/ɛ)) Mistake Bounds So far: how many examples needed to learn? What about: how many mistakes before convergence? Let s consider similar setting to PAC learning: Instances drawn at random from X according to distribution D Learner must classify each instance before receiving correct classification from teacher Can we bound the number of mistakes learner makes before converging? COMP9417: May 20, 2009 Computational Learning Theory: Slide 33 COMP9417: May 20, 2009 Computational Learning Theory: Slide 34
Winnow Mistake bounds following Littlestone (1987) who developed Winnow, an online, mistake-driven algorithm similar to perceptron learns linear threshold function designed to learn in the presence of many irrelevant features number of mistakes grows only logarithmically with number of irrelevant features Winnow While some instances are misclassified For each instance a classify a using current weights If predicted class is incorrect If a has class 1 For each a i = 1, w i αw i (if a i = 0, leave w i unchanged) Otherwise For each a i = 1, w i w i α (if a i = 0, leave w i unchanged) COMP9417: May 20, 2009 Computational Learning Theory: Slide 35 COMP9417: May 20, 2009 Computational Learning Theory: Slide 36 Winnow Mistake Bounds: Find-S user-supplied threshold θ class is 1 if w i a i > θ note similarity to perceptron training rule Winnow uses multiplicative weight updates α > 1 will do better than perceptron with many irrelevant attributes Consider Find-S when H = conjunction of Boolean literals Find-S: Initialize h to the most specific hypothesis l 1 l 1 l 2 l 2... l n l n For each positive training instance x Remove from h any literal that is not satisfied by x Output hypothesis h. an attribute-efficient learner How many mistakes before converging to correct h? COMP9417: May 20, 2009 Computational Learning Theory: Slide 37 COMP9417: May 20, 2009 Computational Learning Theory: Slide 38
Mistake Bounds: Find-S Mistake Bounds: Find-S Find-S will converge to error-free hypothesis in the limit if C H data are noise-free How many mistakes before converging to correct h? Find-S initially classifies all instances as negative will generalize for positive examples by dropping unsatisfied literals since c H, will never classify a negative instance as positive (if it did, would exist a hypothesis h such that c g h would exist an instance x such that h (x) = 1 and c(x) 1 contradicts definition of generality ordering g ) mistake bound from number of positives classified as negatives How many mistakes before converging to correct h? COMP9417: May 20, 2009 Computational Learning Theory: Slide 39 COMP9417: May 20, 2009 Computational Learning Theory: Slide 40 2n terms in initial hypothesis Mistake Bounds: Find-S first mistake, remove half of these terms, leaving n each further mistake, remove at least 1 term in worst case, will have to remove all n remaining terms would be most general concept - everything is positive worst case number of mistakes would be n + 1 worst case sequence of learning steps, removing only one literal per step Mistake Bounds: Halving Algorithm Consider the Halving Algorithm: Learn concept using version space Candidate-Elimination algorithm Classify new instances by majority vote of version space members How many mistakes before converging to correct h?... in worst case?... in best case? COMP9417: May 20, 2009 Computational Learning Theory: Slide 41 COMP9417: May 20, 2009 Computational Learning Theory: Slide 42
Mistake Bounds: Halving Algorithm Mistake Bounds: Halving Algorithm Consider the Halving Algorithm: Learn concept using version space Candidate-Elimination algorithm Classify new instances by majority vote of version space members How many mistakes before converging to correct h? Halving Algorithm learns exactly when the version space contains only one hypothesis which corresponds to the target concept at each step, remove all hypotheses whose vote was incorrect mistake when majority vote classification is incorrect mistake bound in converging to correct h?... in worst case?... in best case? COMP9417: May 20, 2009 Computational Learning Theory: Slide 43 COMP9417: May 20, 2009 Computational Learning Theory: Slide 44 how many mistakes worst case? Mistake Bounds: Halving Algorithm on every step, mistake because majority vote is incorrect each mistake, number of hypotheses reduced by at least half hypothesis space size H, worst-case mistake bound log 2 H how many mistakes best case? on every step, no mistake because majority vote is correct still remove all incorrect hypotheses, up to half best case, no mistakes in converging to correct h Optimal Mistake Bounds Let M A (C) be the max number of mistakes made by algorithm A to learn concepts in C. (maximum over all possible c C, and all possible training sequences) M A (C) max c C M A(c) Definition: Let C be an arbitrary non-empty concept class. The optimal mistake bound for C, denoted Opt(C), is the minimum over all possible learning algorithms A of M A (C). Opt(C) min M A(C) A learning algorithms V C(C) Opt(C) M Halving (C) log 2 ( C ). COMP9417: May 20, 2009 Computational Learning Theory: Slide 45 COMP9417: May 20, 2009 Computational Learning Theory: Slide 46
Weighted Majority generalisation of Halving Algorithm predicts by weighted vote of set of prediction algorithms learns by altering weights for prediction algorithms a prediction algorithm simply predicts value of target concept given instance elements of hypothesis space H different learning algorithms inconsistency between prediction algorithm and training example reduce weight of prediction algorithm bound no. of mistakes of ensemble by no. of mistakes made by best prediction algorithm a i is the i-th prediction algorithm w i is the weight associated with a i Weighted Majority For all i, initialize w i 1 For each training example x, c(x) Initialize q 0 and q 1 to 0 For each prediction algorithm a i If a i (x) = 0 then q 0 q 0 + w i If a i (x) = 1 then q 1 q 1 + w i If q 1 > q 0 then predict c(x) = 1 If q 0 > q 1 then predict c(x) = 0 If q 0 = q 1 then predict 0 or 1 at random for c(x) For each prediction algorithm a i If a i (x) c(x) then w i βw i COMP9417: May 20, 2009 Computational Learning Theory: Slide 47 COMP9417: May 20, 2009 Computational Learning Theory: Slide 48 Weighted Majority Weighted Majority algorithm begins by assigning weight of 1 to each prediction algorithm. On misclassification by a prediction algorithm its weight is reduced by multiplication by constant β, 0 β 1. Equivalent to Halving Algorithm when β = 0. Otherwise, just down-weights contribution of algorithms which make errors. Key result: number of mistakes made by Weighted Majority algorithm will never be greater than a constant factor times the number of mistakes made by the best member of the pool, plus a term that grows only logarithmically in the number of prediction algorithms in the pool. Summary Algorithm independent learning analysed by computational complexity methods An alternative view point to statistics; complementary, different goals Many different frameworks, no clear winner worst- Many results over-conservative from practical viewpoint, e.g. case rather than average-case analyses Has led to advances in practically useful algorithms, e.g. PAC Boosting VC dimension SVM COMP9417: May 20, 2009 Computational Learning Theory: Slide 49 COMP9417: May 20, 2009 Computational Learning Theory: Slide 50