Learning Theory. Machine Learning B Seyoung Kim. Many of these slides are derived from Tom Mitchell, Ziv- Bar Joseph. Thanks!

Similar documents
Machine Learning

Machine Learning

Learning Theory, Overfi1ng, Bias Variance Decomposi9on

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Computational Learning Theory

Computational Learning Theory

Computational Learning Theory

Computational Learning Theory (VC Dimension)

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

Computational Learning Theory

Computational Learning Theory

Computational Learning Theory

PAC-learning, VC Dimension and Margin-based Bounds

Computational Learning Theory (COLT)

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

CS 6375: Machine Learning Computational Learning Theory

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Learning Theory Continued

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Computational Learning Theory

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Introduction to Machine Learning

Machine Learning. Lecture 9: Learning Theory. Feng Li.

MACHINE LEARNING. Probably Approximately Correct (PAC) Learning. Alessandro Moschitti

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Learning theory Lecture 4

Computational Learning Theory

Midterm Exam, Spring 2005

FINAL: CS 6375 (Machine Learning) Fall 2014

Web-Mining Agents Computational Learning Theory

PAC-learning, VC Dimension and Margin-based Bounds

On the Sample Complexity of Noise-Tolerant Learning

Statistical and Computational Learning Theory

Midterm, Fall 2003

Computational Learning Theory. CS534 - Machine Learning

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Classification: The PAC Learning Framework

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Overview. Machine Learning, Chapter 2: Concept Learning

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

18.9 SUPPORT VECTOR MACHINES

Notes on Machine Learning for and

Introduction to Machine Learning

Linear Classifiers: Expressiveness

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Computational learning theory. PAC learning. VC dimension.

Learning with multiple models. Boosting.

Computational Learning Theory. Definitions

Stephen Scott.

Machine Learning

Machine Learning Lecture 7

Bias-Variance Tradeoff

Generalization, Overfitting, and Model Selection

Introduction to Machine Learning

Support Vector Machines

Machine Learning 4771

Concept Learning through General-to-Specific Ordering

Name (NetID): (1 Point)

Understanding Generalization Error: Bounds and Decompositions

1 Differential Privacy and Statistical Query Learning

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Machine Learning

Introduction to Bayesian Learning. Machine Learning Fall 2018

ECE 5984: Introduction to Machine Learning

Dan Roth 461C, 3401 Walnut

BITS F464: MACHINE LEARNING

Variance Reduction and Ensemble Methods

Evaluating Hypotheses

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

[read Chapter 2] [suggested exercises 2.2, 2.3, 2.4, 2.6] General-to-specific ordering over hypotheses

Polynomial time Prediction Strategy with almost Optimal Mistake Probability


Holdout and Cross-Validation Methods Overfitting Avoidance

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Statistical Learning Theory: Generalization Error Bounds

Machine Learning Gaussian Naïve Bayes Big Picture

Introduction to Machine Learning Midterm Exam Solutions

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Introduction to Machine Learning Midterm Exam

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Online Learning, Mistake Bounds, Perceptron Algorithm

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Active Learning and Optimized Information Gathering

Decision Trees Lecture 12

2 Upper-bound of Generalization Error of AdaBoost

Stochastic Gradient Descent

Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning

CSE 546 Final Exam, Autumn 2013

Voting (Ensemble Methods)

CS7267 MACHINE LEARNING

Qualifier: CS 6375 Machine Learning Spring 2015

[Read Ch. 5] [Recommended exercises: 5.2, 5.3, 5.4]

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Transcription:

Learning Theory Machine Learning 10-601B Seyoung Kim Many of these slides are derived from Tom Mitchell, Ziv- Bar Joseph. Thanks!

Computa2onal Learning Theory What general laws constrain inducgve learning? Want theory to relate Number of training examples Complexity of hypothesis space Accuracy to which target funcgon is approximated Manner in which training examples are presented Probability of successful learning * See annual Conference on ComputaGonal Learning Theory

Sample Complexity How many training examples suffice to learn target concept 1. If some random process (e.g., nature) proposes instances, and teacher labels them? - instances drawn according to P(X) 2. If learner proposes instances as queries to teacher? - learner proposes x, teacher provides f(x) AcGve learning 3. If teacher (who knows f(x)) proposes training examples? - teacher proposes sequence {<x 1, f(x 1 )>, <x n, f(x n )>

Learning Theory In general, we are interested in Sample complexity: How many training examples are needed for a learner to converge to a successful hypothesis? ComputaGonal complexity: How much computagonal effort is needed for a learner to converge to a successful hypothesis? The two are related. Why?

Given: Problem Se8ng for Learning from Data Set of instances X = {x 1,, x n } for n input features Sequence of input instances drawn at random from Set of hypotheses Set of possible target funcgons teacher provides noise- free label Learner observes a sequence D of training examples of the form <x, c(x)> for some target concept Instances x are drawn from P(X) Teacher provides target value c(x) c C

Problem Se8ng for Learning from Data Goal: Then, learner must output a hypothesis such that esgmagng c h is evaluated on subsequent instances drawn from P(X) Randomly drawn instances, noise- free classificagon

Func2on Approxima2on with Training Data Given X = {x: x is boolean and x={x 1,, x n } } with n input features How many possible input values? X = 2 n How many possible label assignments? 2 2n The size of hypothesis space that can represent all possible label assignments? H = 2 2n In order to find h that is idengcal to c, we need observagons for all data X = 2 n In pracgce, we are limited by training data! Need to introduce inducgve bias

The true error of h is the probability that it will misclassify an example drawn at random from

D drawn at random from instances training examples D Probability distribution P(X)

Overfi8ng Consider a hypothesis h and its Error rate over training data: True error rate over all data: We say h overfits the training data if Amount of overfigng =

Overfi8ng Consider a hypothesis h and its Error rate over training data: True error rate over all data: We say h overfits the training data if Amount of overfigng = Can we bound in terms of??

training examples Probability distribution P(x) if D was a set of examples drawn from and independent of h, then we could use standard stagsgcal confidence intervals to determine that with 95% probability, lies in the interval: but D is the training data for h.

c: X! {0,1}

Any(!) learner that outputs a hypothesis consistent with all training examples (i.e., an h contained in VS H,D )

Proof: Given, hypothesis space H, input space X, m labeled examples, target funcgon, error tolerance ε Let h 1, h 2,, h K be hypotheses with true error > ε. Then, Probability that h 1 will be consistent with first training example (1- ε) Probability that h 1 will be consistent with m independently drawn training examples (1- ε) m Probability that at least one of h 1, h 2,, h K (K bad hypotheses) will be consistent with m examples K(1- ε) m H (1- ε) m H e - εm since K H since for 0 ε 1, (1- ε) e - εm

What it means [Haussler, 1988]: probability that the version space is not ε- exhausted aper m training examples is at most Suppose we want this probability to be at most δ 1. How many training examples suffice?

E.g., X=< X1, X2,... Xn > Each h H constrains each Xi to be 1, 0, or don t care In other words, each h is a rule such as: If X2=0 and X5=1 Then Y=1, else Y=0

Example: Simple decision trees Consider Boolean classificagon problem instances: X = <X 1 X N > where each X i is boolean Each hypothesis in H is a decision tree of depth 1 How many training examples m suffice to assure that with probability at least 0.99, any consistent learner using H will output a hypothesis with true error at most 0.05?

Example: Simple decision trees Consider Boolean classificagon problem instances: X = <X 1 X n > where each X i is boolean Each hypothesis in H is a decision tree of depth 1 How many training examples m suffice to assure that with probability at least 0.99, any consistent learner using H will output a hypothesis with true error at most 0.05? H = 4 n, epsilon = 0.05, delta = 0.01 m 1 1 (ln(4n) + ln 0.05 0.01 ) N=4 m 148 N=10 m 166 N=100 m 212

Example: H is Decision Tree with depth=2 Consider classificagon problem f:x!y: instances: X = <X 1 X N > where each X i is boolean learned hypotheses are decision trees of depth 2, using only two variables How many training examples m suffice to assure that with probability at least 0.99, any consistent learner will output a hypothesis with true error at most 0.05?

Example: H is Decision Tree with depth=2 Consider classificagon problem f:x!y: instances: X = <X 1 X N > where each X i is boolean learned hypotheses are decision trees of depth 2, using only two variables How many training examples m suffice to assure that with probability at least 0.99, any consistent learner will output a hypothesis with true error at most 0.05? H = N(N- 1)/2 x 16 m 1 0.05 (ln(8n 2 8N) + ln 1 0.01 ) N=4 m 184 N=10 m 224 N=100 m 318

Sufficient condition: Holds if learner L requires only a polynomial number of training examples, and processing per example is polynomial

Here ε is the difference between the training error and true error of the output hypothesis (the one with lowest training error)

Addi2ve Hoeffding Bounds Agnos2c Learning Given m independent flips of a coin with true Pr(heads) = θ we can bound the error in the maximum likelihood esgmate Relevance to agnosgc learning: for any single hypothesis h But we must consider all hypotheses in H Now we assume this probability is bounded by δ. Then, we have m > 1 2 (ln H +ln(1/δ)) ε

Ques2on: If H = {h h: X! Y} is infinite, what measure of complexity should we use in place of H?

Ques2on: If H = {h h: X! Y} is infinite, what measure of complexity should we use in place of H? Answer: The largest subset of X for which H can guarantee zero training error (regardless of the target funcgon c)

Ques2on: If H = {h h: X! Y} is infinite, what measure of complexity should we use in place of H? Answer: The largest subset of X for which H can guarantee zero training error (regardless of the target funcgon c) VC dimension of H is the size of this subset

a labeling of each member of S as positive or negativea Each ellipse corresponds to a possible dichotomy Positive: Inside the ellipse Negative: Outside the ellipse

VC(H)=3

Sample Complexity based on VC dimension How many randomly drawn examples suffice to ε- exhaust VS H,D with probability at least (1- δ)? ie., to guarantee that any hypothesis that perfectly fits the training data is probably (1- δ) approximately (ε) correct Compare to our earlier results based on H :

VC dimension: examples Consider 1- dim real valued input X, want to learn c:x!{0,1} What is VC dimension of Open intervals: x Closed intervals:

VC dimension: examples Consider 1- dim real valued input X, want to learn c:x!{0,1} What is VC dimension of Open intervals: VC(H1)=1 x VC(H2)=2 Closed intervals: VC(H3)=2 VC(H4)=3

VC dimension: examples What is VC dimension of lines in a plane? H 2 = { ((w 0 + w 1 x 1 + w 2 x 2 )>0! y=1) }

What is VC dimension of VC dimension: examples H 2 = { ((w 0 + w 1 x 1 + w 2 x 2 )>0! y=1) } VC(H 2 )=3 For H n = linear separating hyperplanes in n dimensions, VC(H n )=n+1

For any finite hypothesis space H, can you give an upper bound on VC(H) in terms of H? (hint: yes) Assume VC(H) = K, which means H can shayer K examples. For K examples, there are 2 K possible labelings. Thus, H 2 K Thus, K log 2 H

More VC Dimension Examples to Think About LogisGc regression over n congnuous features Over n boolean features? Linear SVM over n congnuous features Decision trees defined over n boolean features F: <X 1,... X n >! Y How about 1- nearest neighbor?

Tightness of Bounds on Sample Complexity How many examples m suffice to assure that any hypothesis that fits the training data perfectly is probably (1-δ) approximately (ε) correct? How tight is this bound?

Tightness of Bounds on Sample Complexity How many examples m suffice to assure that any hypothesis that fits the training data perfectly is probably (1-δ) approximately (ε) correct? How tight is this bound? Lower bound on sample complexity (Ehrenfeucht et al., 1989): Consider any class C of concepts such that VC(C) > 1, any learner L, any 0 < ε < 1/8, and any 0 < δ < 0.01. Then there exists a distribution and a target concept in C, such that if L observes fewer examples than Then with probability at least δ, L outputs a hypothesis with

Agnos2c Learning: VC Bounds for Decision Tree [Schölkopf and Smola, 2002] With probability at least (1- δ) every h H sagsfies

What You Should Know Sample complexity varies with the learning segng Learner acgvely queries trainer Examples arrive at random Within the PAC learning segng, we can bound the probability that learner will output hypothesis with given error For ANY consistent learner (case where c H) For ANY best fit hypothesis (agnosgc learning, where perhaps c not in H) VC dimension as a measure of complexity of H Conference on Learning Theory: hyp://www.learningtheory.org Avrim Blum s course on Machine Learning Theory: hyps://www.cs.cmu.edu/~avrim/ml14/