Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

Similar documents
Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Computational Learning Theory (VC Dimension)

Computational Learning Theory

Introduction to Machine Learning

Machine Learning

Machine Learning

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

CS 6375: Machine Learning Computational Learning Theory

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Learning Theory. Machine Learning B Seyoung Kim. Many of these slides are derived from Tom Mitchell, Ziv- Bar Joseph. Thanks!

Computational Learning Theory

Generalization, Overfitting, and Model Selection

Computational Learning Theory

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Computational Learning Theory

Web-Mining Agents Computational Learning Theory

Computational Learning Theory (COLT)

Computational Learning Theory. Definitions

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Computational Learning Theory

Computational Learning Theory

Learning Theory, Overfi1ng, Bias Variance Decomposi9on

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Statistical and Computational Learning Theory

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

Computational Learning Theory

Active Learning and Optimized Information Gathering

Computational learning theory. PAC learning. VC dimension.

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Computational Learning Theory. CS534 - Machine Learning

Introduction to Machine Learning

Introduction to Machine Learning

EECS 349: Machine Learning Bryan Pardo

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Understanding Generalization Error: Bounds and Decompositions

Generalization and Overfitting

Learning Theory Continued

Classification: The PAC Learning Framework

PAC-learning, VC Dimension and Margin-based Bounds

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Computational Learning Theory

Generalization, Overfitting, and Model Selection

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin


THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

PAC Model and Generalization Bounds

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Cognitive Cyber-Physical System

Learning theory Lecture 4

COMS 4771 Introduction to Machine Learning. Nakul Verma

Machine Learning

Introduction to Machine Learning

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Statistical Learning Theory: Generalization Error Bounds

MACHINE LEARNING. Probably Approximately Correct (PAC) Learning. Alessandro Moschitti

An Introduction to No Free Lunch Theorems

Machine Learning 4771

PAC-learning, VC Dimension and Margin-based Bounds

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Machine Learning Lecture 7

Learning with multiple models. Boosting.

Introduction to Machine Learning

FINAL: CS 6375 (Machine Learning) Fall 2014

Computational and Statistical Learning Theory

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Concept Learning through General-to-Specific Ordering

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

[Read Ch. 5] [Recommended exercises: 5.2, 5.3, 5.4]

Voting (Ensemble Methods)

Support Vector Machines

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Homework 5 SOLUTIONS CS 6375: Machine Learning

VC dimension and Model Selection

Machine Learning

Dan Roth 461C, 3401 Walnut

ECE 5424: Introduction to Machine Learning

MACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti

Stochastic Gradient Descent

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Notes on Machine Learning for and

CS534 Machine Learning - Spring Final Exam

Probably Approximately Correct Learning - III

CS 543 Page 1 John E. Boon, Jr.

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

VC-dimension for characterizing classifiers

Computational Learning Theory for Artificial Neural Networks

EVALUATING MISCLASSIFICATION PROBABILITY USING EMPIRICAL RISK 1. Victor Nedel ko

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

BITS F464: MACHINE LEARNING

Transcription:

Hypothesis Testing and Computational Learning Theory EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

Overview Hypothesis Testing: How do we know our learners are good? What does performance on test data imply/guarantee about future performance? Computational Learning Theory: Are there general laws that govern learning? Sample Complexity: How many training examples are needed to learn a successful hypothesis? Computational Complexity: How much computational effort is needed to learn a successful hypothesis?

Some terms X C H L D is the set of all possible instances is the set of all possible concepts c where c: X {0,1} is the set of hypotheses considered by a learner, H C is the learner is a probability distribution over X that generates observed instances

Definition The true error of hypothesis h, with respect to the target concept c and observation distribution D is the probability that h will misclassify an instance drawn according to D error P [ c( x) h( x)] D xd In a perfect world, we d like the true error to be 0

Definition The sample error of hypothesis h, with respect to the target concept c and sample S is the proportion of S that that h misclassifies: error S (h) = 1/ S xs (c(x), h(x)) where (c(x), h(x)) = 0 if c(x) = h(x), 1 otherwise

Problems Estimating Error

Example on Independent Test Set

Estimators

Confidence Intervals and n*error S (h), n*(1-error S (h)) each > 5

Confidence Intervals Under same conditions

Life Skills Convincing demonstration that certain enhancements improve performance? Use online Fisher Exact or Chi Square tests to evaluate hypotheses, e.g: http://www.socscistatistics.com/tests/chisquare2/default2.aspx

Overview Hypothesis Testing: How do we know our learners are good? What does performance on test data imply/guarantee about future performance? Computational Learning Theory: Are there general laws that govern learning? Sample Complexity: How many training examples are needed to learn a successful hypothesis? Computational Complexity: How much computational effort is needed to learn a successful hypothesis?

Computational Learning Theory Are there general laws that govern learning? No Free Lunch Theorem: The expected accuracy of any learning algorithm across all concepts is 50%. But can we still say something positive? Yes. Probably Approximately Correct (PAC) learning

The world isn t perfect If we can t provide every instance for training, a consistent hypothesis may have error on unobserved instances. Instance Space X Hypothesis H Training set Concept C How many training examples do we need to bound the likelihood of error to a reasonable level? When is our hypothesis Probably Approximately Correct (PAC)?

Definitions A hypothesis is consistent if it has zero error on training examples The version space (VS H,T ) is the set of all hypotheses consistent on training set T in our hypothesis space H (reminder: hypothesis space is the set of concepts we re considering, e.g. depth-2 decision trees)

Definition: e-exhausted IN ENGLISH: The set of hypotheses consistent with the training data T is e-exhausted if, when you test them on the actual distribution of instances, all consistent hypotheses have error below e IN MATH: VS H,T is e - exhausted and sampledistribution hvs, error ( h) H,T D for concept D, e if... c

A Theorem If hypothesisspace H is finite, & training set T contains m independent randomly drawn examples of concept c THEN,for any 0 e 1... P( VS is NOTε - exhausted) H e H,T em

Proof of Theorem If hypothesis h has true error e, the probability of getting a single random exampe right is : it P( h got 1example right) 1-ε Ergo the probability of h getting m examples right is : P( h got m examples right) (1-ε ) m

Proof of Theorem If there are k hypotheses in H with error at least e, call the probability at least one of those k hypotheses got m instances right P(at least one bad h looks good). This prob. is BOUNDED by k(1-ε ) m P at least one bad h looks good k(1-ε ) m Union bound

Proof of Theorem (continued) Since k H, it follows that k(1-ε ) m H (1-ε ) m If 0 e 1, then (1 e) e e Therefore... P(at least one bad h looks good) k(1-ε ) m H (1-ε ) m H e em Proof will complete! We now have a hypothsesis consistent have error e bound on the likelihood that a with the training data

Using the theorem Let's rearrange tosee how many training examples we need toset a bound on the likelihood our true error is e. 1 e ln 1 ln e e em em H e ln em H ln e ln ln ln ln ln H H em ln H ln H ln 1 em m H ln m

Probably Approximately Correct (PAC) 1 ln H ln m e The worst error we ll tolerate hypothesis space size The likelihood a hypothesis consistent with the training data will have error e number of training examples

Using the bound 1 ln H ln m e Plug in e,, and H to get a number of training examples m that will guarantee your learner will generate a hypothesis that is Probably Approximately Correct. NOTE: This assumes that the concept is actually IN H, that H is finite, and that your training set is drawn using distribution D

Think/Pair/Share Average accuracy of any learner across all concepts is 50%, but also: 1 ln H ln e m How can both be true? Think Start End 24

Think/Pair/Share Average accuracy of any learner across all concepts is 50%, but also: 1 ln H ln e m How can both be true? Pair Start End 25

Think/Pair/Share Average accuracy of any learner across all concepts is 50%, but also: 1 ln H ln e m How can both be true? Share 26

Problems with PAC The PAC Learning framework has 2 disadvantages: 1) It can lead to weak bounds 2)Sample Complexity bound cannot be established for infinite hypothesis spaces We introduce the VC dimension for dealing with these problems

Shattering Def: A set of instances S is shattered by hypothesis set H iff for every possible concept c on S there exists a hypothesis h in H that is consistent with that concept.

Can a linear separator shatter this? NO! The ability of H to shatter a set of instances is a measure of its capacity to represent target concepts defined over those instances

Can a quadratic separator shatter this?

Vapnik-Chervonenkis Dimension Def: The Vapnik-Chervonenkis dimension, VC(H) of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H. If arbitrarily large finite sets can be shattered by H, then VC(H) is infinite.

How many training examples needed? Lower bound on m using VC(H) m 1 e (4log (2/ ) 8VC( H)log 2 (13/ e 2 ))

Infinite VC dimension?

Think/Pair/Share What kind of classifier (that we ve talked about) has infinite VC dimension? Think Start End 34

Think/Pair/Share What kind of classifier (that we ve talked about) has infinite VC dimension? Pair Start End 35

Think/Pair/Share What kind of classifier (that we ve talked about) has infinite VC dimension? Share 36