Machine Learning

Similar documents
Machine Learning

Learning Theory. Machine Learning B Seyoung Kim. Many of these slides are derived from Tom Mitchell, Ziv- Bar Joseph. Thanks!

Learning Theory, Overfi1ng, Bias Variance Decomposi9on

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Computational Learning Theory

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

PAC-learning, VC Dimension and Margin-based Bounds

Computational Learning Theory

Computational Learning Theory

Computational Learning Theory

Computational Learning Theory

Computational Learning Theory

Computational Learning Theory (VC Dimension)

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Computational Learning Theory

PAC-learning, VC Dimension and Margin-based Bounds

CS 6375: Machine Learning Computational Learning Theory

Computational Learning Theory (COLT)

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Statistical and Computational Learning Theory

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Computational Learning Theory

Machine Learning

Learning Theory Continued

Web-Mining Agents Computational Learning Theory

Introduction to Machine Learning

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Generalization, Overfitting, and Model Selection

Machine Learning Gaussian Naïve Bayes Big Picture

Understanding Generalization Error: Bounds and Decompositions

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Variance Reduction and Ensemble Methods

Midterm Exam, Spring 2005

Support Vector Machines. Machine Learning Fall 2017

Introduction to Machine Learning

Generalization and Overfitting

Computational Learning Theory. CS534 - Machine Learning

Linear Classifiers: Expressiveness

Machine Learning

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Bias-Variance Tradeoff

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Machine Learning 4771

ECE 5424: Introduction to Machine Learning

Qualifying Exam in Machine Learning

Computational Learning Theory. Definitions

Machine Learning

COMS 4771 Introduction to Machine Learning. Nakul Verma

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Machine Learning

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Machine Learning Lecture 7

ECE 5984: Introduction to Machine Learning

Generalization, Overfitting, and Model Selection

The Perceptron algorithm

18.9 SUPPORT VECTOR MACHINES

Computational learning theory. PAC learning. VC dimension.

[read Chapter 2] [suggested exercises 2.2, 2.3, 2.4, 2.6] General-to-specific ordering over hypotheses

[Read Ch. 5] [Recommended exercises: 5.2, 5.3, 5.4]

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Cognitive Cyber-Physical System

Learning theory Lecture 4

Midterm, Fall 2003

On the Sample Complexity of Noise-Tolerant Learning

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Notes on Machine Learning for and

FINAL: CS 6375 (Machine Learning) Fall 2014

Introduction to Bayesian Learning. Machine Learning Fall 2018

Support Vector Machines

Introduction to Machine Learning

Bayesian Learning (II)

A Bound on the Label Complexity of Agnostic Active Learning

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Stochastic Gradient Descent

Learning with multiple models. Boosting.

CS7267 MACHINE LEARNING

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Introduction to Machine Learning CMU-10701

Holdout and Cross-Validation Methods Overfitting Avoidance

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Machine Learning

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan


CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

Transcription:

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem Vapnik-Chervonenkis (VC) dimension Recommended reading: Mitchell: Ch. 7 suggested exercises: 7.1, 7.2, 7.7 Computational Learning Theory What general laws constrain inductive learning? Want theory to relate Number of training examples Complexity of hypothesis space Accuracy to which target function is approximated Manner in which training examples are presented Probability of successful learning * See annual Conference on Computational Learning Theory 1

Sample Complexity How many training examples suffice to learn target concept 1. If learner proposes instances as queries to teacher? - learner proposes x, teacher provides f(x) 2. If teacher (who knows f(x)) proposes training examples? - teacher proposes sequence {<x 1, f(x 1 )>, <x n, f(x n )> 3. If some random process (e.g., nature) proposes instances, and teacher labels them? - instances drawn according to P(X) Problem setting: Set of instances Sample Complexity 3 Set of hypotheses Set of possible target functions Sequence of training instances drawn at random from teacher provides noise-free label Learner outputs a hypothesis such that 2

Function Approximation: The Big Picture The true error of h is the probability that it will misclassify an example drawn at random from 3

D drawn at random from instances training examples D Probability distribution P(X) Overfitting Consider a hypothesis h and its Error rate over training data: True error rate over all data: We say h overfits the training data if Amount of overfitting = 4

Overfitting Consider a hypothesis h and its Error rate over training data: True error rate over all data: We say h overfits the training data if Amount of overfitting = Can we bound in terms of?? training examples Probability distribution P(x) if D was a set of examples drawn from and independent of h, then we could use standard statistical confidence intervals to determine that with 95% probability, lies in the interval: D D D but D is the training data for h. 5

c: X à {0,1} Function Approximation: The Big Picture 6

7

Any(!) learner that outputs a hypothesis consistent with all training examples (i.e., an h contained in VS H,D ) 8

What it means [Haussler, 1988]: probability that the version space is not ε-exhausted after m training examples is at most Suppose we want this probability to be at most δ 1. How many training examples suffice? 2. If then with probability at least (1-δ): Example: H is Conjunction of up to N Boolean Literals Consider classification problem f:xà Y: instances: X = <X 1 X 2 X 3 X 4 > where each X i is boolean Each hypothesis in H is a rule of the form: IF <X 1 X 2 X 3 X 4 > = <0,?,1,?>, THEN Y=1, ELSE Y=0 i.e., rules constrain any subset of the X i How many training examples m suffice to assure that with probability at least 0.99, any consistent learner using H will output a hypothesis with true error at most 0.05? 9

Example: H is Decision Tree with depth=2 Consider classification problem f:xà Y: instances: X = <X 1 X N > where each X i is boolean learned hypotheses are decision trees of depth 2, using only two variables How many training examples m suffice to assure that with probability at least 0.99, any consistent learner will output a hypothesis with true error at most 0.05? 10

Sufficient condition: Holds if learner L requires only a polynomial number of training examples, and processing per example is polynomial Here ε is the difference between the training error and true error of the output hypothesis (the one with lowest training error) 11

Additive Hoeffding Bounds Agnostic Learning Given m independent flips of a coin with true Pr(heads) = θ we can bound the error in the maximum likelihood estimate Relevance to agnostic learning: for any single hypothesis h But we must consider all hypotheses in H So, with probability at least (1-δ) every h satisfies General Hoeffding Bounds When estimating parameter θ inside [a,b] from m examples When estimating a probability θ is inside [0,1], so And if we re interested in only one-sided error, then 12

Question: If H = {h h: X à Y} is infinite, what measure of complexity should we use in place of H? Question: If H = {h h: X à Y} is infinite, what measure of complexity should we use in place of H? Answer: The largest subset of X for which H can guarantee zero training error (regardless of the target function c) 13

Question: If H = {h h: X à Y} is infinite, what measure of complexity should we use in place of H? Answer: The largest subset of X for which H can guarantee zero training error (regardless of the target function c) VC dimension of H is the size of this subset Question: If H = {h h: X à Y} is infinite, what measure of complexity should we use in place of H? Answer: The largest subset of X for which H can guarantee zero training error (regardless of the target function c) Informal intuition: 14

a labeling of each member of S as positive or negative 15

VC(H)=3 Sample Complexity based on VC dimension How many randomly drawn examples suffice to ε-exhaust VS H,D with probability at least (1-δ)? ie., to guarantee that any hypothesis that perfectly fits the training data is probably (1-δ) approximately (ε) correct Compare to our earlier results based on H : 16

VC dimension: examples Consider X = <, want to learn c:xà {0,1} What is VC dimension of Open intervals: x Closed intervals: VC dimension: examples Consider X = <, want to learn c:xà {0,1} What is VC dimension of Open intervals: VC(H1)=1 x VC(H2)=2 Closed intervals: VC(H3)=2 VC(H4)=3 17

VC dimension: examples What is VC dimension of lines in a plane? H 2 = { ((w 0 + w 1 x 1 + w 2 x 2 )>0 à y=1) } VC dimension: examples What is VC dimension of H 2 = { ((w 0 + w 1 x 1 + w 2 x 2 )>0 à y=1) } VC(H 2 )=3 For H n = linear separating hyperplanes in n dimensions, VC (H n )=n+1 18

For any finite hypothesis space H, can you give an upper bound on VC(H) in terms of H? (hint: yes) More VC Dimension Examples to Think About Logistic regression over n continuous features Over n boolean features? Linear SVM over n continuous features Decision trees defined over n boolean features F: <X 1,... X n > à Y Decision trees of depth 2 defined over n features How about 1-nearest neighbor? 19

Tightness of Bounds on Sample Complexity How many examples m suffice to assure that any hypothesis that fits the training data perfectly is probably (1-δ) approximately (ε) correct? How tight is this bound? Tightness of Bounds on Sample Complexity How many examples m suffice to assure that any hypothesis that fits the training data perfectly is probably (1-δ) approximately (ε) correct? How tight is this bound? Lower bound on sample complexity (Ehrenfeucht et al., 1989): Consider any class C of concepts such that VC(C) > 1, any learner L, any 0 < ε < 1/8, and any 0 < δ < 0.01. Then there exists a distribution and a target concept in C, such that if L observes fewer examples than Then with probability at least δ, L outputs a hypothesis with 20

Agnostic Learning: VC Bounds [Schölkopf and Smola, 2002] With probability at least (1-δ) every h H satisfies Structural Risk Minimization [Vapnik] Which hypothesis space should we choose? Bias / variance tradeoff H4 H3 H2 H1 SRM: choose H to minimize bound on expected true error! * unfortunately a somewhat loose bound... 21