Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Similar documents
Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Computational Learning Theory

Computational Learning Theory

Computational Learning Theory

Computational Learning Theory

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning

Machine Learning

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory (COLT)

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Concept Learning through General-to-Specific Ordering

[read Chapter 2] [suggested exercises 2.2, 2.3, 2.4, 2.6] General-to-specific ordering over hypotheses

Generalization, Overfitting, and Model Selection

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Computational Learning Theory

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Computational Learning Theory

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Generalization and Overfitting

Computational Learning Theory

CS 6375: Machine Learning Computational Learning Theory

Concept Learning. Space of Versions of Concepts Learned

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Outline. [read Chapter 2] Learning from examples. General-to-specic ordering over hypotheses. Version spaces and candidate elimination.

The PAC Learning Framework -II

Concept Learning Mitchell, Chapter 2. CptS 570 Machine Learning School of EECS Washington State University

Learning Theory. Machine Learning B Seyoung Kim. Many of these slides are derived from Tom Mitchell, Ziv- Bar Joseph. Thanks!

Concept Learning. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University.

PAC Model and Generalization Bounds

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Concept Learning. Berlin Chen References: 1. Tom M. Mitchell, Machine Learning, Chapter 2 2. Tom M. Mitchell s teaching materials.

PAC-learning, VC Dimension and Margin-based Bounds

Question of the Day? Machine Learning 2D1431. Training Examples for Concept Enjoy Sport. Outline. Lecture 3: Concept Learning

Overview. Machine Learning, Chapter 2: Concept Learning

FORMULATION OF THE LEARNING PROBLEM

COMS 4771 Introduction to Machine Learning. Nakul Verma

Midterm, Fall 2003

Computational Learning Theory. Definitions

Dan Roth 461C, 3401 Walnut

Lecture 29: Computational Learning Theory

On the Sample Complexity of Noise-Tolerant Learning

Web-Mining Agents Computational Learning Theory

Computational Learning Theory

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

Introduction to machine learning. Concept learning. Design of a learning system. Designing a learning system

Classification: The PAC Learning Framework

1 A Lower Bound on Sample Complexity

Machine Learning 2D1431. Lecture 3: Concept Learning

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

1 The Probably Approximately Correct (PAC) Model

BITS F464: MACHINE LEARNING

Fundamentals of Concept Learning

Version Spaces.

Online Learning, Mistake Bounds, Perceptron Algorithm

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Concept Learning.

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

Answers Machine Learning Exercises 2

1 Review of The Learning Setting

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Introduction to Machine Learning

Introduction to Statistical Learning Theory

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions

Introduction to Machine Learning (67577) Lecture 3

Introduction to Bayesian Learning. Machine Learning Fall 2018

Data Mining und Maschinelles Lernen

PAC-learning, VC Dimension and Margin-based Bounds

Statistical Learning Learning From Examples

Introduction to Computational Learning Theory

Lecture 7: Passive Learning

Learning Theory, Overfi1ng, Bias Variance Decomposi9on

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

CSCE 478/878 Lecture 2: Concept Learning and the General-to-Specific Ordering

Learning Theory Continued

Statistical and Computational Learning Theory

Linear Classifiers: Expressiveness

Introduction to Machine Learning

Active Learning and Optimized Information Gathering

10.1 The Formal Model

Introduction to Machine Learning

Confusion matrix. a = true positives b = false negatives c = false positives d = true negatives 1. F-measure combines Recall and Precision:

Understanding Machine Learning A theory Perspective

Computational and Statistical Learning Theory

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Active Learning: Disagreement Coefficient


ECE 5424: Introduction to Machine Learning

Introduction to Machine Learning CMU-10701

Hierarchical Concept Learning

Computational Learning Theory (VC Dimension)

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017

Transcription:

Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Computational Learning Theory Le Song Lecture 11, September 20, 2012 Based on Slides from Eric Xing, CMU Reading: Chap. 7 T.M book 1

Complexity of Learning The complexity of leaning is measured mainly along two axis: Information and computation. The Information complexity is concerned with the generalization performance of learning; How many training examples are needed? How fast do learner s estimate converge to the true population parameters? Etc. The Computational complexity concerns the computation resources applied to the training data to extract from it learner s predictions. It seems that when an algorithm improves with respect to one of these measures it deteriorates with respect to the other. 2

What General Laws constrain Inductive Learning? Sample Complexity How many training examples are sufficient to learn target concept? Computational Complexity Resources required to learn target concept? Want theory to relate: Training examples Quantity Quality m How presented Complexity of hypothesis/concept space H Accuracy of approx to target concept Probability of successful learning 3

Prototypical concept learning task Binary classification Everything we'll say here generalizes to other, including regression and multiclass classification, problems. Given: Instances X: Possible days, each described by the attributes Sky, AirTemp, Humidity, Wind, Water, Forecast Target function c: EnjoySport : X {0, 1} Hypotheses space H: Conjunctions of literals. E.g. (?, Cold, High,?,?, EnjoySport). Training examples S: iid positive and negative examples of the target function (x 1, c(x 1 )),... (x m, c(x m )) Determine: A hypothesis h in H such that h(x) is "good" w.r.t c(x) for all x in S? A hypothesis h in H such that h(x) is "good" w.r.t c(x) for all x in the true dist D? 4

Two Basic Competing Models PAC framework Sample labels are consistent with some h in H Agnostic framework No prior restriction on the sample labels Learner s hypothesis required to meet absolute upper bound on its error The required upper bound on the hypothesis error is only relative (to the best hypothesis in the class) 5

Sample Complexity How many training examples are sufficient to learn the target concept? Training senarios: 1 If learner proposes instances, as queries to teacher Learner proposes instance x, teacher provides c(x) 2 If teacher (who knows c) provides training examples teacher provides sequence of examples of form (x, c(x)) 3 If some random process (e.g., nature) proposes instances instance x generated randomly, teacher provides c(x) 6

Te mp P r e s s. Sor e- Thr oat C ol o r Te mp. Pre ss. So Colo ur disea sex 35 95 Y Pale No 22 110 N Clear Yes : : : : 10 87 N Pale No Protocol Learner Given: set of examples X fixed (unknown) distribution D over X 3 2 9 0 N P a l e Classifier disease X No set of hypotheses H set of possible target concepts C Learner observes sample S = { x i, c(x i ) } instances x i drawn from distr. D labeled by target concept c C (Learner does NOT know c(.), D) Learner outputs h H estimating c h is evaluated by performance on subsequent instances drawn from D For now: C = H (so c H) Noise-free data 7

True error of a hypothesis Definition: The true error (denoted ε D (h)) of hypothesis h with respect to target concept c and distribution D is the probability that h will misclassify an instance drawn at random according to D. 8

Two notions of error Training error of hypothesis h with respect to target concept c How often h(x) c(x) over training instance from S True error of hypothesis h with respect to c How often h(x) c(x) over future random instances drew iid from D Can we bound ε D (h) in terms of ε S (h)?? 9

The Union Bound Lemma. (The union bound). Let A 1 ;A 2,, A k be k different events (that may not be independent). Then A 1 A 2 A 3 A 4 A k In probability theory, the union bound is usually stated as an axiom (and thus we won't try to prove it), but it also makes intuitive sense: The probability of any one of k events happening is at most the sums of the probabilities of the k different events. 10

Hoeffding inequality Lemma. (Hoeding inequality) Let Z 1,,Z m be m independent and identically distributed (iid) random variables drawn from a Bernoulli(f) distribution, i.e., P(Z i = 1) =f, and P(Z i = 0) = 1- f. Let be the mean of these random variables, and let any g> 0 be fixed. Then This lemma (which in learning theory is also called the Chernoff bound) says that if we take the average of m Bernoulli( ) random variables to be our estimate of, then the probability of our being far from the true value is small, so long as m is large. 11

Version Space A hypothesis h is consistent with a set of training examples S of target concept c if and only if h(x)=c(x) for each training example x i, c(x i ) in S The version space, VS H,S, with respect to hypothesis space H and training examples S is the subset of hypotheses from H consistent with all training examples in S. 12

Exhausting the version space Definition: The version space VS H,S is said to be -exhausted with respect to c and S, if every hypothesis h in VS H,S has true error less than with respect to c and D h VS H,S, ε D h < ε 13

Probably Approximately Correct Goal: PAC-Learner produces hypothesis ĥ that is approximately correct, err D (ĥ) 0 with high probability P( err D (ĥ) 0 ) 1 Double hedging" approximately probably Need both! 14

A Simple Setting Classification m data points in S Finite number of possible hypothesis (e.g. conjunction of literals, decision trees) A learner finds a hypothesis h that is consistent with training data Get zero error in training ε S h = 0 What is the probability that h has more than ε true error? ε D h ε 15

How many examples will - exhaust the VS Theorem: [Haussler, 1988]. If the hypothesis space H is finite, and S is a sequence of m 1 independent random examples of some target concept c, then for any 0 1/2, the probability that the version space has true error more than (with respect to c and D) is less than This bounds the probability that any consistent learner (for training data or zero training error) will output a hypothesis h with test error ε D (h) 16

Proof How likely is a bad hypothesis h to get m data points right? Hypothesis h is consistent with training data \rightarrow get m iid data points right h is bad : if it gets all training points right, but has high true error Prob. h with ε D h ε get 1 point right Pr h gets 1 point right 1 ε Prob. h with ε D h ε get m points right Pr h gets m point right 1 ε m Exponentially small as m increases 17

Proof (cont.) But there are many possible hypothesis that are consistent with training data H VS H,S, consistent with data h 1 h 2 h bad Learner picks the worst one ε D h highest 18

Proof (cont.) How likely is learner to pick a bad hypothesis? Prob. h with ε D h ε get m points right Pr h gets m point right 1 ε m Suppose k hypotheses are consistent with m training points Pr h bad and consistent with m data Pr h 1 bad, consistent h 2 bad, consistent h k bad, consistent 19

Proof (cont.) The union bound: Let A 1 ;A 2,, A k be k different events (that may not be independent). Then A 1 A 2 A 3 A 4 A k Suppose k hypotheses are consistent with m training points Pr h bad and consistent with m data Pr h 1 bad, consistent h 2 bad, consistent h k bad, consistent k 1 ε m H 1 ε m H e mε 20

What it means [Haussler, 1988]: probability that the version space is not ε- exhausted after m training examples is at most H e - m Suppose we want this probability to be at most δ 1. How many training examples suffice? 2. If error train (h) = 0 then with probability at least (1-δ): 21

Learning Conjunctions of Boolean Literals How many examples are sufficient to assure with probability at least (1 - ) that every h in VS H,S satisfies S (h) Use our theorem: m (ln H ln(1/ 1 )) Suppose H contains conjunctions of constraints on up to n boolean attributes (i.e., n boolean literals). Then H = 3 n, and or n m (ln3 ln(1/ 1 )) 1 m ( nln3 ln( / )) 1 22

How about EnjoySport? m (ln H 1 ln( 1 / )) If H is as given in EnjoySport then H = 973, and m (ln 973 ln(1/ 1 )) if want to assure that with probability 95%, VS contains only hypotheses with S (h).1, then it is sufficient to have m examples, where m (ln 973. 1 1 m 10(ln 973 m 10(6.88 m 98.8 ln(1/.05)) ln 20) 3.00) 23

PAC-Learning Learner L can draw labeled instance x, c(x) in unit time, x X of length n drawn from distribution D, labeled by target concept c C Def'n: Learner L PAC-learns class C using hypothesis space H if 1. for any target concept c C, any distribution D, any such that 0 < < 1/2, such that 0 < < 1/2, L returns h H s.t. w/ prob. 1, err D (h) < 2. L's run-time (and hence, sample complexity) is poly( x, size(c), 1/, 1/ ) Sufficient: 1. Only poly( ) training instances H = 2 poly() 2. Only poly time / instance Often C = H 24

What if classifier does not have zero error on training data? A learner with zero training errors may make mistakes in test set What about a learner with nonzero ε S (h) in training data? 25

Agnostic Learning So far, assumed c H Agnostic learning setting: don't assume c H What do we want then? The hypothesis h that makes fewest errors on training data What is sample complexity in this case? m (ln H ln(1/ 1 2 2 )) derived from Hoeffding bounds: Pr[ error D ( h) error ( h) ] S e 2m 2 26

Empirical Risk Minimization Paradigm Choose a Hypothesis Class H of subsets of X. For an input sample S, find some h in H that fits S "well". For a new point x, predict a label according to its membership in h. Example: Consider linear classification, and let Then We think of ERM as the most "basic" learning algorithm, and it will be this algorithm that we focus on in the remaining. In our study of learning theory, it will be useful to abstract away from the specic parameterization of hypotheses and from issues such as whether we're using a linear classier or an ANN 27

The Case of Finite H H = {h 1 ; : : : ; h k } consisting of k hypotheses. We would like to give guarantees on the generalization error of ĥ. First, we will show that ε S (h) is a reliable estimate of ε D (h) for all h. Second, we will show that this implies an upper-bound on the generalization error of ĥ. 28

Misclassification Probability The outcome of a binary classifier can be viewed as a Bernoulli random variable Z : For each sample: Hoeffding inequality This shows that, for our particular h i, training error will be close to generalization error with high probability, assuming m is large. 29

Uniform Convergence But we don't just want to guarantee that will be close (with high probability) for just only one particular h i. We want to prove that this will be true for simultaneously for all h i H For k hypothesis: This means: 30

Uniform Convergence (cont.) In the discussion above, what we did was, for particular values of m and γ, given a bound on the probability that: for some h i H There are three quantities of interest here: m and γ, and probability of error; we can bound either one in terms of the other two. 31

Sample Complexity How many training examples we need in order make a guarantee? We find that if then with probability at least 1-δ, we have that for all h i H The key property of the bound above is that the number of training examples needed to make this guarantee is only logarithmic in k, the number of hypotheses in H. This will be important later. 32

Generalization Error Bound Similarly, we can also hold m and δ fixed and solve for γ in the previous equation, and show [again, convince yourself that this is right!] that with probability 1- δ, we have that for all h i H 33

Summary Bounding the true (or generalization) error using training error Two cases: zero training error and nonzero training error Assume finite number of hypotheses How about the case of infinite number of hypotheses? 34