VC-dimension for characterizing classifiers

Similar documents
VC-dimension for characterizing classifiers

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Cross-validation for detecting and preventing overfitting

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

Andrew W. Moore Professor School of Computer Science Carnegie Mellon University

Support Vector Machines

Machine Learning Lecture 7

FIND A FUNCTION TO CLASSIFY HIGH VALUE CUSTOMERS

Probabilistic and Bayesian Analytics

Information Gain. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Midterm, Fall 2003

Computational Learning Theory

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

Machine Learning. Lecture 9: Learning Theory. Feng Li.

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

PDEEC Machine Learning 2016/17

Support Vector Machine & Its Applications

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Understanding Generalization Error: Bounds and Decompositions

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Bayesian Networks: Independencies and Inference

Cognitive Cyber-Physical System

CPSC 340: Machine Learning and Data Mining

10-701/15-781, Machine Learning: Homework 4

Computational Learning Theory

Machine Learning

Bayesian Networks Structure Learning (cont.)

Machine Learning (CS 567) Lecture 2

FINAL: CS 6375 (Machine Learning) Fall 2014

Introduction to Machine Learning

Lecture 3: Introduction to Complexity Regularization

Optimization Methods for Machine Learning (OMML)

Computational Learning Theory (VC Dimension)

ECE 5424: Introduction to Machine Learning

LEARNING & LINEAR CLASSIFIERS

MACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti

Machine Learning Lecture 5

Empirical Risk Minimization, Model Selection, and Model Assessment

Machine Learning Linear Classification. Prof. Matteo Matteucci

Midterm Exam Solutions, Spring 2007

Linear Classification and SVM. Dr. Xin Zhang

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Machine Learning, Fall 2009: Midterm

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Logistic Regression. Machine Learning Fall 2018

Lecture 2 Machine Learning Review

Statistical and Computational Learning Theory

Generative v. Discriminative classifiers Intuition

SUPPORT VECTOR MACHINE

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Hastie, Tibshirani & Friedman: Elements of Statistical Learning Chapter Model Assessment and Selection. CN700/March 4, 2008.

Linear Models for Classification

Machine Learning, Midterm Exam

Learning with multiple models. Boosting.

Support Vector Machines and Kernel Methods

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

CS 543 Page 1 John E. Boon, Jr.

Classifier Complexity and Support Vector Classifiers

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Final Examination CS 540-2: Introduction to Artificial Intelligence

Probability Densities in Data Mining

Learning Theory Continued

Qualifying Exam in Machine Learning

Machine Learning

Algorithms for Playing and Solving games*

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Performance Evaluation and Comparison

10-701/ Machine Learning - Midterm Exam, Fall 2010

Support Vector Machine. Natural Language Processing Lab lizhonghua

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 9: Classifica8on with Support Vector Machine (cont.

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

CS446: Machine Learning Spring Problem Set 4

PAC-learning, VC Dimension and Margin-based Bounds

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Generalization and Overfitting

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Linear Classifiers. Michael Collins. January 18, 2012

Brief Introduction to Machine Learning

COMS 4771 Introduction to Machine Learning. Nakul Verma

Active Learning and Optimized Information Gathering

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Neural Networks: Introduction

Bias-Variance Tradeoff

A [somewhat] Quick Overview of Probability. Shannon Quinn CSCI 6900

Warm up: risk prediction with logistic regression

ECE521 week 3: 23/26 January 2017

VBM683 Machine Learning

Machine Learning

Introduction to Machine Learning CMU-10701

Transcription:

VC-dimension for characterizing classifiers Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew s tutorials: http://www.cs.cmu.edu/~awm/tutorials. Comments and corrections gratefully received. Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599 Copyright 2001, Andrew W. Moore Nov 20th, 2001 A learning machine A learning machine f takes an input x and transforms it, somehow using weights α, into a predicted output y est = +/- 1 α is some vector of adjustable parameters x α f y est Copyright 2001, Andrew W. Moore VC-dimension: Slide 2 1

Examples x α f y est denotes +1 f(x,b) = sign(x x b) denotes -1 sign(z) = -1 if z < 0 0 if z == 0 1 if z > 0 Copyright 2001, Andrew W. Moore VC-dimension: Slide 3 Examples x α f y est denotes +1 f(x,w) = sign(x w) denotes -1 sign(z) = -1 if z < 0 0 if z == 0 1 if z > 0 Copyright 2001, Andrew W. Moore VC-dimension: Slide 4 2

Examples x α f y est denotes +1 f(x,w,b) = sign(x w+b) denotes -1 sign(z) = -1 if z < 0 0 if z == 0 1 if z > 0 Copyright 2001, Andrew W. Moore VC-dimension: Slide 5 How do we characterize power? Different machines have different amounts of power. Tradeoff between: More power: Can model more complex classifiers but might overfit. Less power: Not going to overfit, but restricted in what it can model. How do we characterize the amount of power? You: Uh Dan We just did that. Number of hypotheses, remember? Me: Yes, but Continuous parameters are weird. How many line-classifiers are there?. Copyright 2001, Andrew W. Moore VC-dimension: Slide 6 3

Some definitions Given some machine f And under the assumption that all training points (x k,y k ) were drawn i.i.d from some distribution. And under the assumption that future test points will be drawn from the same distribution Define & 1 # Probability of () = TESTE( () = E $ y ' f ( x, () = % 2! " Misclassification Official terminology Terminology we ll use Note y,f {+1,-1} Copyright 2001, Andrew W. Moore VC-dimension: Slide 7 Some definitions Given some machine f And under the assumption that all training points (x k,y k ) were drawn i.i.d from some distribution. And under the assumption that future test points will be drawn from the same distribution Define & 1 # Probability of () = TESTE( () = E $ y ' f ( x, () = % 2! " Misclassification Official terminology Terminology we ll use Note y,f {+1,-1} emp 1 1 Fraction Training (# ) = TAINE( #) = " yk! f ( xk,#) = 2 Set misclassified k = 1 = #training set data points Copyright 2001, Andrew W. Moore VC-dimension: Slide 8 4

Vapnik-Chervonenkis dimension & 1 # 1 1 TESTE () = E $ y ' f ( x, () % 2! TAINE (# ) =! yk " x k " f (,#) k = 1 2 Given some machine f, let h be its VC dimension. h is a measure of f s power (h does not depend on the choice of training set) Vapnik showed that with probability 1-η TESTE( ") $ TAINE( ") + h (log(2 / h) + 1) # log(! / 4) This gives us a way to estimate the error on future data based only on the training error and the VC-dimension of f Copyright 2001, Andrew W. Moore VC-dimension: Slide 9 What VC-dimension is used for & 1 # 1 1 TESTE () = E $ y ' f ( x, ()! TAINE (# ) =! yk " x k % 2 " f (,#) k = 1 2 Given some machine f, let h be its VC dimension. h is a measure of f s power. Vapnik showed that with probability 1-η But given machine f, how do we define and compute h? TESTE( ") $ TAINE( ") + h (log(2 / h) + 1) # log(! / 4) This gives us a way to estimate the error on future data based only on the training error and the VC-dimension of f Copyright 2001, Andrew W. Moore VC-dimension: Slide 10 5

Shattering Machine f can shatter a set of points x 1, x 2.. x r if and only if For every possible labeling of the form (x 1,y 1 ), (x 2,y 2 ), (x r,y r ) there exists some value of α that gets zero training error. There are 2 r such labelings to consider, each with a different combination of +1 s and 1 s for the y s Copyright 2001, Andrew W. Moore VC-dimension: Slide 11 Shattering Machine f can shatter a set of points x 1, x 2.. x r if and only if For every possible training set of the form (x 1,y 1 ), (x 2,y 2 ), (x r,y r ) there exists some value of α that gets zero training error. Question: Can the following f shatter the following points? f(x,w) = sign(x w) sign(z) = -1 if z < 0 0 if z == 0 1 if z > 0 Copyright 2001, Andrew W. Moore VC-dimension: Slide 12 6

Shattering Machine f can shatter a set of points x 1, x 2.. x r if and only if For every possible training set of the form (x 1,y 1 ), (x 2,y 2 ), (x r,y r ) there exists some value of α that gets zero training error. Question: Can the following f shatter the following points? f(x,w) = sign(x w) Answer: No problem. There are four training sets to consider w=(0,1) w=(-2,3) w=(2,-3) w=(0,-1) Copyright 2001, Andrew W. Moore VC-dimension: Slide 13 Shattering Machine f can shatter a set of points x 1, x 2.. x r if and only if For every possible training set of the form (x 1,y 1 ), (x 2,y 2 ), (x r,y r ) there exists some value of α that gets zero training error. Question: Can the following f shatter the following points? f(x,b) = sign(x x-b) Copyright 2001, Andrew W. Moore VC-dimension: Slide 14 7

Shattering Machine f can shatter a set of points x 1, x 2.. x r if and only if For every possible training set of the form (x 1,y 1 ), (x 2,y 2 ), (x r,y r ) there exists some value of α that gets zero training error. Question: Can the following f shatter the following points? f(x,b) = sign(x x-b) Answer: No way my friend. Copyright 2001, Andrew W. Moore VC-dimension: Slide 15 Definition of VC dimension Example: What s VC dimension of f(x,b) = sign(x x-b) Copyright 2001, Andrew W. Moore VC-dimension: Slide 16 8

VC dim of trivial circle Example: What s VC dimension of f(x,b) = sign(x x-b) Answer = 1: we can t even shatter two points! (but it s clear we can shatter 1) Copyright 2001, Andrew W. Moore VC-dimension: Slide 17 eformulated circle Example: For 2-d inputs, what s VC dimension of f(x,q,b) = sign(qx x-b) Copyright 2001, Andrew W. Moore VC-dimension: Slide 18 9

eformulated circle Example: What s VC dimension of f(x,q,b) = sign(qx x-b) Answer = 2 q,b are < 0 Copyright 2001, Andrew W. Moore VC-dimension: Slide 19 eformulated circle Example: What s VC dimension of f(x,q,b) = sign(qx x-b) Answer = 2 (clearly can t do 3) q,b are -ve Copyright 2001, Andrew W. Moore VC-dimension: Slide 20 10

VC dim of separating line Example: For 2-d inputs, what s VC-dim of f(x,w,b) = sign(w x+b)? Well, can f shatter these three points? Copyright 2001, Andrew W. Moore VC-dimension: Slide 21 VC dim of line machine Example: For 2-d inputs, what s VC-dim of f(x,w,b) = sign(w x+b)? Well, can f shatter these three points? Yes, of course. All - or all + is trivial One + can be picked off by a line One - can be picked off too. Copyright 2001, Andrew W. Moore VC-dimension: Slide 22 11

VC dim of line machine Example: For 2-d inputs, what s VC-dim of f(x,w,b) = sign(w x+b)? Well, can we find four points that f can shatter? Copyright 2001, Andrew W. Moore VC-dimension: Slide 23 VC dim of line machine Example: For 2-d inputs, what s VC-dim of f(x,w,b) = sign(w x+b)? Well, can we find four points that f can shatter? Can always draw six lines between pairs of four points. Copyright 2001, Andrew W. Moore VC-dimension: Slide 24 12

VC dim of line machine Example: For 2-d inputs, what s VC-dim of f(x,w,b) = sign(w x+b)? Well, can we find four points that f can shatter? Can always draw six lines between pairs of four points. Two of those lines will cross. Copyright 2001, Andrew W. Moore VC-dimension: Slide 25 VC dim of line machine Example: For 2-d inputs, what s VC-dim of f(x,w,b) = sign(w x+b)? Well, can we find four points that f can shatter? Can always draw six lines between pairs of four points. Two of those lines will cross. If we put points linked by the crossing lines in the same class they can t be linearly separated So a line can shatter 3 points but not 4 So VC-dim of Line Machine is 3 Copyright 2001, Andrew W. Moore VC-dimension: Slide 26 13

VC dim of linear classifiers in m-dimensions If input space is m-dimensional and if f is sign(w x+b), what is the VC-dimension? Proof that h >= m: Show that m points can be shattered Can you guess how? Copyright 2001, Andrew W. Moore VC-dimension: Slide 27 VC dim of linear classifiers in m-dimensions If input space is m-dimensional and if f is sign(w x+b), what is the VC-dimension? Proof that h >= m: Show that m points can be shattered Define m input points thus: x 1 = (1,0,0,,0) x 2 = (0,1,0,,0) : x m = (0,0,0,,1) So x k [j] = 1 if k=j and 0 otherwise Let y 1, y 2, y m, be any one of the 2 m combinations of class labels. Guess how we can define w 1, w 2, w m and b to ensure sign(w x k + b) = y k for all k? Note: $ m & # ' sign(w" x k + b) = sign& b + w j. x k [ j] ) % ( j=1 Copyright 2001, Andrew W. Moore VC-dimension: Slide 28 14

VC dim of linear classifiers in m-dimensions If input space is m-dimensional and if f is sign(w x+b), what is the VC-dimension? Proof that h >= m: Show that m points can be shattered Define m input points thus: x 1 = (1,0,0,,0) x 2 = (0,1,0,,0) : x m = (0,0,0,,1) So x k [j] = 1 if k=j and 0 otherwise Let y 1, y 2, y m, be any one of the 2 m combinations of class labels. Guess how we can define w 1, w 2, w m and b to ensure sign(w x k + b) = y k for all k? Note: Answer: b=0 and w k = y k for all k. $ m ' sign(w" x k + b) = sign & b + # w j " x k [ j] ) % ( j=1 Copyright 2001, Andrew W. Moore VC-dimension: Slide 29 VC dim of linear classifiers in m-dimensions If input space is m-dimensional and if f is sign(w x+b), what is the VC-dimension? Now we know that h >= m In fact, h=m+1 Proof that h >= m+1 is easy Proof that h < m+2 is moderate Copyright 2001, Andrew W. Moore VC-dimension: Slide 30 15

What does VC-dim measure? The power of a machine learning algorithm If VC dimension is high We can learn very complex functions but we will overfit if we don t have enough data. If VC dimension is low We can only learn simple functions but the danger of overfitting is minimal. If we know how much data we have, how powerful should our algorithm be? Copyright 2001, Andrew W. Moore VC-dimension: Slide 31 Structural isk Minimization Let φ(f) = the set of functions representable by f. Suppose Then h( f (Hey, can you formally prove this?) 1)! h( f2)! Lh( fn) We re trying to decide which machine to use. We train each machine and make a table h (log(2 / h) + 1) # log(! / 4) TESTE( ") $ TAINE( ") + i f i TAINE "( f 1 ) # "( f 2 ) # L # "( f n ) VC-Conf Probable upper bound on TESTE Choice 1 f 1 2 f 2 3 f 3 4 f 4 5 f 5 6 f 6 Copyright 2001, Andrew W. Moore VC-dimension: Slide 32 16

Using VC-dimensionality That s what VC-dimensionality is about People have worked hard to find VC-dimension for.. Decision Trees Perceptrons Neural Nets Decision Lists Support Vector Machines And many many more All with the goals of 1. Understanding which learning machines are more or less powerful under which circumstances 2. Using Structural isk Minimization for to choose the best learning machine Copyright 2001, Andrew W. Moore VC-dimension: Slide 33 Alternatives to VC-dim-based model selection What could we do instead of the scheme below? TESTE( ") $ TAINE( ") + i f i TAINE VC-Conf h (log(2 / h) + 1) # log(! / 4) Probable upper bound on TESTE Choice 1 f 1 2 f 2 3 f 3 4 f 4 5 f 5 6 f 6 Copyright 2001, Andrew W. Moore VC-dimension: Slide 34 17

Alternatives to VC-dim-based model selection What could we do instead of the scheme below? 1. Cross-validation i 1 2 3 4 5 6 f i f 1 f 2 f 3 f 4 f 5 f 6 TAINE 10-FOLD-CV-E Choice Copyright 2001, Andrew W. Moore VC-dimension: Slide 35 Alternatives to VC-dim-based model selection What could we do instead of the scheme below? 1. Cross-validation 2. AIC (Akaike Information Criterion) AICSCOE = LL( Data MLE params)! (# parameters) As the amount of data goes to infinity, AIC promises* to select the model that ll have the best likelihood for future data *Subject to about a million caveats i f i LOGLIKE(TAINE) #parameters AIC Choice f 1 f 2 f 3 f 4 f 5 f 6 Copyright 2001, Andrew W. Moore VC-dimension: Slide 36 1 2 3 4 5 6 18

i Alternatives to VC-dim-based model selection f i LOGLIKE(TAINE) #parameters BIC Choice f 1 f 2 f 3 f 4 f 5 f 6 Copyright 2001, Andrew W. Moore VC-dimension: Slide 37 1 2 3 4 5 6 What could we do instead of the scheme below? 1. Cross-validation 2. AIC (Akaike Information Criterion) 3. BIC (Bayesian Information Criterion) # params BICSCOE = LL( Data MLE params)! log 2 As the amount of data goes to infinity, BIC promises* to select the model that the data was generated from. More conservative than AIC. *Another million caveats Which model selection method is best? 1. (CV) Cross-validation 2. AIC (Akaike Information Criterion) 3. BIC (Bayesian Information Criterion) 4. (SMVC) Structural isk Minimize with VCdimension AIC, BIC and SMVC have the advantage that you only need the training error. CV error might have more variance SMVC is wildly conservative Asymptotically AIC and Leave-one-out CV should be the same Asymptotically BIC and a carefully chosen k-fold should be the same BIC is what you want if you want the best structure instead of the best predictor (e.g. for clustering or Bayes Net structure finding) Many alternatives to the above including proper Bayesian approaches. It s an emotional issue. Copyright 2001, Andrew W. Moore VC-dimension: Slide 38 19

Extra Comments Beware: that second VC-confidence term is usually very very conservative (at least hundreds of times larger than the empirical overfitting effect). An excellent tutorial on VC-dimension and Support Vector Machines (which we ll be studying soon): C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. http://citeseer.nj.nec.com/burges98tutorial.html Copyright 2001, Andrew W. Moore VC-dimension: Slide 39 What you should know The definition of a learning machine: f(x,α) The definition of Shattering Be able to work through simple examples of shattering The definition of VC-dimension Be able to work through simple examples of VCdimension Structural isk Minimization for model selection Awareness of other model selection methods Copyright 2001, Andrew W. Moore VC-dimension: Slide 40 20