Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Similar documents
Machine Learning

Machine Learning

Computational Learning Theory

Computational Learning Theory

Computational Learning Theory

Learning Theory. Machine Learning B Seyoung Kim. Many of these slides are derived from Tom Mitchell, Ziv- Bar Joseph. Thanks!

CS 6375: Machine Learning Computational Learning Theory

Computational Learning Theory (VC Dimension)

Computational Learning Theory

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Computational Learning Theory

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Computational Learning Theory

Introduction to Machine Learning

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Computational Learning Theory

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Computational Learning Theory (COLT)

Computational Learning Theory

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Web-Mining Agents Computational Learning Theory

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Learning Theory, Overfi1ng, Bias Variance Decomposi9on

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

PAC-learning, VC Dimension and Margin-based Bounds

Computational Learning Theory. Definitions

Machine Learning. Lecture 9: Learning Theory. Feng Li.

PAC Model and Generalization Bounds

COMS 4771 Introduction to Machine Learning. Nakul Verma

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Active Learning and Optimized Information Gathering

Dan Roth 461C, 3401 Walnut

Computational Learning Theory. CS534 - Machine Learning

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Lecture 29: Computational Learning Theory

On the Sample Complexity of Noise-Tolerant Learning

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

10.1 The Formal Model

[Read Ch. 5] [Recommended exercises: 5.2, 5.3, 5.4]

Statistical and Computational Learning Theory

Learning Theory Continued

Name (NetID): (1 Point)

Online Learning, Mistake Bounds, Perceptron Algorithm

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Generalization, Overfitting, and Model Selection

Computational learning theory. PAC learning. VC dimension.


The Vapnik-Chervonenkis Dimension

Notes on Machine Learning for and

MACHINE LEARNING. Probably Approximately Correct (PAC) Learning. Alessandro Moschitti

Models of Language Acquisition: Part II

Introduction to Machine Learning (67577) Lecture 3

Generalization and Overfitting

CS 395T Computational Learning Theory. Scribe: Mike Halcrow. x 4. x 2. x 6

Introduction to Machine Learning (67577) Lecture 5

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory

Machine Learning

TTIC An Introduction to the Theory of Machine Learning. Learning from noisy data, intro to SQ model

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

VC dimension and Model Selection

The PAC Learning Framework -II

Introduction to Machine Learning

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

Understanding Generalization Error: Bounds and Decompositions

Evaluating Hypotheses

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

Lecture 7: Passive Learning

Machine Learning

Learning theory Lecture 4

Polynomial time Prediction Strategy with almost Optimal Mistake Probability

1 Last time and today

PAC-learning, VC Dimension and Margin-based Bounds

Introduction to Machine Learning

BITS F464: MACHINE LEARNING

Introduction to Machine Learning CMU-10701

ECS171: Machine Learning

Computational and Statistical Learning theory

Statistical Learning Theory: Generalization Error Bounds

The sample complexity of agnostic learning with deterministic labels

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

1 A Lower Bound on Sample Complexity

CS 543 Page 1 John E. Boon, Jr.

Introduction to Machine Learning

Learning with multiple models. Boosting.

The Power of Random Counterexamples

Introduction to Bayesian Learning. Machine Learning Fall 2018

FORMULATION OF THE LEARNING PROBLEM

Cognitive Cyber-Physical System

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

Learning Theory. Sridhar Mahadevan. University of Massachusetts. p. 1/38

Computational Learning Theory for Artificial Neural Networks

Transcription:

Computational Learning Theory CS 486/686: Introduction to Artificial Intelligence Fall 2013 1

Overview Introduction to Computational Learning Theory PAC Learning Theory Thanks to T Mitchell 2

Introduction Recall how inductive learning works Given a training set of examples of the form (x, f(x)) return a function h (a hypothesis) that approximates f Decision Trees: Boolean functions 3

Computational Learning Theory Are there general laws for inductive learning? Theory to relate Probability of successful learning Number of training examples Complexity of hypothesis space Accuracy to which f is approximated Manner in which training examples are presented 4

Computational Learning Theory Sample complexity How many training examples are needed to learn the target function, f? 5

Computational Learning Theory Sample complexity How many training examples are needed to learn the target function, f? Computational complexity How much computation effort is needed to learn the target function, f? 6

Computational Learning Theory Sample complexity How many training examples are needed to learn the target function, f? Computational complexity How much computation effort is needed to learn the target function, f? Mistake bound How many training examples will a learner misclassify before learning the target function, f? 7

Sample Complexity How many examples are sufficient to learn f? If learner proposes instances as queries to a teacher learner proposes x, teacher provides f(x) If the teacher provides training examples teacher provides a sequence of examples of the form (x, f(x)) If some random process proposes instances instance x generated randomly, teacher provide f(x) 8

Function Approximation H={h:X->{0,1}} H =2 X =2 2^20 Boolean X=<x1,...,x20> X =2 20 h1 h3 h2 + + - - - 9

Function Approximation H={h:X->{0,1}} H =2 X =2 2^20 Boolean X=<x1,...,x20> X =2 20 h1 h3 h2 + + - - - How many labelled examples are needed in order to determine which of the hypothesis is the correct one? All 2 20 instances in X must be labelled! There is no free lunch! 10

Sample Complexity Given set of instances X set of hypothesis H set of possible target concepts F training instances generated by a fixed unknown probability distribution D over X Learner observes sequence, D, of training examples (x,f(x)) instances are drawn from distribution D teacher provides f(x) for each instance Learner must output a hypothesis h estimating f h is evaluated by its performance on future instances drawn according to D 11

True Error of a Hypothesis The true error (errord(h)) of hypothesis h with respect to target function f and distribution D is the probability that h will misclassify an instance drawn at random according to D errord(h)=prx in D[f(x) h(x)] Figure from Machine Learning by T Mitchell Note our notation is a bit different: We use f instead of c 12

Training Error vs True Error Training error of h wrt target function f How often h(x) f(x) over training instances D errord(h)=prx in D[f(x) h(x)]= #(f(x) h(x))/ D Note: A consistent h will have errord(h)=0 True error of h wrt to target function f How often h(x) f(x) over future instances drawn at random from D errord(h)=prx in D[f(x) h(x)] 13

Version Spaces A hypothesis h is consistent with a set of training examples D of target function f if and only if h(x)=f(x). A version space, VSH,D, is the set of all hypothesis from H that are consistent with all training examples in D 14

ε-exhausting VSH,D Version space VSH,D is ε-exhausted wrt f and D if every h in VSH,D has true error less than ε wrt f and D r: training error error: true error ε-exhausted for ε>0.2 15

How many examples will ε- exhaust the VS? Theorem[Haussler, 1988]. If H is finite and D is a sequence of m 1 independent random examples of target function f Then for any 0 ε 1, the probability that VSH,D is not ε-exhausted (wrt f) is less than H e -εm 16 16

How many examples will ε- exhaust the VS? Theorem[Haussler, 1988]. If Interesting! H is finite and This bounds the probability that any consistent learner D is will a sequence output a hypothesis of m 1 independent h with error(h) ε random examples of target function f Then for any 0 ε 1, the probability that VSH,D is not ε-exhausted (wrt f) is less than H e -εm 17 17

Proof 18

How to interpret this result Suppose we want this probability to be at most δ How many training examples are needed? If errord(h)=0 then with probability at least (1-δ) 19

Decision Tree Example 20

PAC Learning Probably Approximately Correct (PAC) Learner Given a class C of possible target concepts (f) defined over a set of instances X os length n, and a learner L using hypothesis space H C is PAC-learnable by L using H if for all f in C, distributions D over X, 0<ε<1/2 and 0<δ<1/2, leaner L will with probability at least (1-δ) output a hypothesis h such that errord(h) ε in time polynomial in 1/ε, 1/δ, n, and size(f) 21

Agnostic Learners So far we have assumed f is in H. What if we do not make this assumption? (Agnostic Learning) Goal: Hypothesis h that makes the fewest errors on the training data! 22

Agnostic Learning Derivation Hoeffding Bound (additive Chernoff bound): given a coin with Pr(heads)=θ, after m independent coin flips you observe Pr(heads m)=θm In our case, for any single hypothesis 23

Agnostic Learning Derivation To assure that the best hypothesis found by L has an error bounded this way Sample complexity 24

Infinite Hypothesis Spaces Recall What if H ={h h:x->y} is infinite? What is the right measure of complexity? The largest subset of X for which H can guarantee zero training error (regardless of target function f) Vapnik-Chervonenkis (VC) dimension 25

Shattering A dichotomy of a set S is a partition of S into two disjoint subsets Note: Given S there are 2 S dichotomies A set of instances S is shattered by hypothesis space H if and only if for every dichotomy of S there exists some hypothesis in H consistent with this dichotomy 26

VC Dimension The VC(H) dimension of hypothesis space H defined over input space X is the size of the largest finite subset of X shattered by H. If arbitrarily large finite subsets of X can be shattered by H then VC(H)= 27

VC Examples Let X=R, H=all open intervals, h:a<x<b (h(x)=1 is a<x<b, 0 otherwise) Let S={3.1, 4.5}, S =2 dichotomies: both are 1, both are 0, one is 1 and the other 0 H can shatter S VC(H) is at least 2 What about S={x,y,z} where x<y<z? 28

Sample Complexity How many randomly drawn examples suffice to ε-exhaust VSH,D with probability at least (1-δ)? Lower bound: For any 0<ε<1/8 and 0<δ<0.01, there exists some D and a f in C such that if L observes fewer than Then with prob. at least δ, L outputs a hypothesis with errord(h)>ε 29

Changing Directions: Observations about No Free Lunch I mentioned No Free Lunch earlier in todays lecture What do we really mean by No Free Lunch? 30

No Free Lunch Principle No learning is possible without the application of prior domain knowledge For any learning algorithm, there is some distribution that generates data such that when trained over this distribution will produce large error. If S is much smaller than X then error can be close to 0.5. 31