Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Similar documents
VC-dimension for characterizing classifiers

VC-dimension for characterizing classifiers

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Computational Learning Theory

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Machine Learning

Machine Learning

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

MACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti

Computational Learning Theory (VC Dimension)

Statistical and Computational Learning Theory

Generalization and Overfitting

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

Generalization, Overfitting, and Model Selection

Optimization Methods for Machine Learning (OMML)

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Support Vector Machines

Machine Learning Lecture 7

CS 6375: Machine Learning Computational Learning Theory

Support Vector Machines. Machine Learning Fall 2017

Introduction to Machine Learning

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Cognitive Cyber-Physical System

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Introduction to Machine Learning CMU-10701

Computational Learning Theory

Generalization, Overfitting, and Model Selection

PAC-learning, VC Dimension and Margin-based Bounds

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory

COMS 4771 Introduction to Machine Learning. Nakul Verma

EVALUATING MISCLASSIFICATION PROBABILITY USING EMPIRICAL RISK 1. Victor Nedel ko

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

VC dimension and Model Selection

Generalization theory

Computational Learning Theory

Understanding Generalization Error: Bounds and Decompositions

Neural Network Learning: Testing Bounds on Sample Complexity

Stephen Scott.

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

LEARNING & LINEAR CLASSIFIERS

Introduction to Machine Learning

Computational Learning Theory

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

PDEEC Machine Learning 2016/17

PAC Model and Generalization Bounds

Learning Theory. Machine Learning B Seyoung Kim. Many of these slides are derived from Tom Mitchell, Ziv- Bar Joseph. Thanks!

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Learning Theory Continued

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

FIND A FUNCTION TO CLASSIFY HIGH VALUE CUSTOMERS


Statistical learning theory, Support vector machines, and Bioinformatics

Computational learning theory. PAC learning. VC dimension.

PAC-learning, VC Dimension and Margin-based Bounds

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

References for online kernel methods

Support Vector Machine. Natural Language Processing Lab lizhonghua

Learning theory Lecture 4

Support Vector Machine & Its Applications

Holdout and Cross-Validation Methods Overfitting Avoidance

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

MACHINE LEARNING. Probably Approximately Correct (PAC) Learning. Alessandro Moschitti

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Support vector machines Lecture 4

Cross-validation for detecting and preventing overfitting

Data Mining und Maschinelles Lernen

Active Learning and Optimized Information Gathering

Computational Learning Theory. Definitions

Learning with multiple models. Boosting.

Statistical Learning Learning From Examples

Computational Learning Theory

Lecture 3: Empirical Risk Minimization

Empirical Risk Minimization, Model Selection, and Model Assessment

Lecture 3: Introduction to Complexity Regularization

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Learning Theory, Overfi1ng, Bias Variance Decomposi9on

Does Unlabeled Data Help?

Machine Learning. Ensemble Methods. Manfred Huber

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Linear Classification and SVM. Dr. Xin Zhang

Computational Learning Theory

Classifier Complexity and Support Vector Classifiers

Name (NetID): (1 Point)

Statistical Learning Reading Assignments

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Part of the slides are adapted from Ziko Kolter

ECE 5424: Introduction to Machine Learning

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

ECS171: Machine Learning

Data Informatics. Seon Ho Kim, Ph.D.

Web-Mining Agents Computational Learning Theory

Transcription:

Lecture Slides for INTRODUCTION TO Machine Learning By: alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml Postedited by: R. Basili

Learning a Class from Examples Class C of a family car Prediction: Is car x a family car? Knowledge extraction: What do people expect from a family car? Output: Positive (+) and negative ( ) examples Input representation: x 1 : price, x 2 : engine power 2

Training set X X x t N { t,r } t 1 label x x x 1 2 r 1 if x 0 if x is positive is negative 3

Class C p price p AND e engine power e 1 2 1 2 In general we do not know C(x). 4

Hypothesis class H 1 if h classifies x as positive h( x) 0 if h classifies x as negative Empirical error: N E( h X) 1h x t 1 t r t 5

S, G, and the Version Space most specific hypothesis, S most general hypothesis, G h H, between S and G is consistent and make up the version space (Mitchell, 1997) 6

Probably Approximately Correct (PAC) Learning How many training examples are needed so that the tightest rectangle S which will constitute our hypothesis, will probably be approximately correct? We want to be confident (above a level) that The error probability is bounded by some value 7

Probably Approximately Correct (PAC) Learning How many training examples are needed so that the tightest rectangle S which will constitute our hypothesis, will be probably approximately correct? We want to be confident (above a level) that The error probability is bounded by some value 8

Probably Approximately Correct (PAC) Learning In PAC learning, given a class C and examples drawn from some unknown but fixed distribution p(x), we want to find the number of examples N, such that with probability at least 1 - δ, h has error at most ε? (Blumer et al., 1989) P( CDh ) d where CDh is the region of difference between C and h. 9

Probably Approximately Correct (PAC) Learning How many training examples m should we have, such that with probability at least 1 - δ, h has error at most ε? (Blumer et al., 1989) Let prob. of a + ex. in each strip be at most ε/4 Pr that random ex. misses a strip: 1- ε/4 Pr that m random instances miss a strip: (1 - ε/4) m Pr that m random instances instances miss 4 strips: 4(1 - ε/4) m We want 1-4(1 - ε/4) m 1- δ or 4(1 - ε/4) m δ Using (1 - x) exp( -x) it follows that 4exp(-εm/4) δ and m (4/ε)ln(4/δ) OR Divide by 4, take ln... and show that m (4/ε)ln(4/δ) 10

Probably Approximately Correct (PAC) Learning How many training examples m should we have, such that with probability at least 1 - δ, h has error at most ε? (Blumer et al., 1989) m (4/ε)ln(4/δ) m increases slowly with 1/ and 1/d Say 1% with confidence 95%, pick m >= 1752 Say 10% with confidence 95%, pick m >= 175 11

VC (Vapnik-Chervonenkis) Dimension N points can be labeled in 2 N ways as +/ H shatters N if there exists a set of N points such that h H is consistent with all of these possible labels: Denoted as: VC(H ) = N Measures the capacity of H Any learning problem definable by N examples can be learned with no error by a hypothesis drawn from H What is the VC dimension of axis-aligned rectangles? 12

Formal Definition 13

VC (Vapnik-Chervonenkis) Dimension H shatters N if there exists N points and h H such that h is consistent for any labelings of those N points. VC(axis aligned rectangles) = 4 14

VC (Vapnik-Chervonenkis) Dimension What does this say about using rectangles as our hypothesis class? VC dimension is pessimistic: in general we do not need to worry about all possible labelings It is important to remember that one can choose the arrangement of points in the space, but then the hypothesis must be consistent with all possible labelings of those fixed points. 15

Examples x a f y est denotes +1 f(x,w) = sign(x.w) denotes -1 16

Examples x a f y est denotes +1 f(x,w,b) = sign(x.w+b) denotes -1 17

Shattering Question: Can the following f shatter the following points? f(x,w) = sign(x.w) Answer: Yes. There are four possible training set types to consider: w=(0,1) w=(-2,3) w=(2,-3) w=(0,-1) 18

VC dim of linear classifiers in m-dimensions If input space is m-dimensional and if f is sign(w.x-b), what is the VC-dimension? h=m+1 Lines in 2D can shatter 3 points Planes in 3D space can shatter 4 points... Hyperplanes in R n spaces can shatter n+1 points 19

Examples x a f y est denotes +1 f(x,b) = sign(x.x b) denotes -1 20

Shattering Question: Can the following f shatter the following points? f(x,b) = sign(x.x-b) 21

Shattering Question: Can the following f shatter the following points? f(x,b) = sign(x.x-b) Answer: Yes. Hence, the VC dimension of circles on the origin is at least 2. 22

Note that if we pick two points at the same distance to the origin, they cannot be shattered. But we are interested if all possible labellings of some n-points can be shattered. How about 3 for circles on the origin (Can you find 3 points such that all possible labellings can be shattered?)? 23

Model Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique solution The need for inductive bias, assumptions about H Generalization: How well a model performs on new data Different machines have different amounts of power. Tradeoff between: More power: Can model more complex classifiers but might overfit. Less power: Not going to overfit, but restricted in what it can model. Overfitting: H more complex than C or f Underfitting: H less complex than C or f 24

Triple Trade-Off There is a trade-off between three factors (Dietterich, 2003): 1. Complexity of H, c (H), 2. Training set size, N, 3. Generalization error, E, on new data As N, E As c (H), first E and then E 25

26

Complexity Complexity is a measure of a set of classifiers, not any specific (fixed) classifier Many possible measures degrees of freedom description length Vapnik-Chervonenkis (VC) dimension etc. 27

Expected and Empirical error 28

29

30

31

33

A learning machine A learning machine f takes an input x and transforms it, somehow using weights a, into a predicted output y est = +/- 1 a is some vector of adjustable parameters a x f y est 34

Some definitions Given some machine f Define: 1 R( a) TESTERR( a) E y f ( x, a) 2 Probability of Misclassification R emp ( a) TRAINERR( a) 1 R R k1 1 2 y k f ( x k, a) Fraction Training Set misclassified R = #training set data points 35

Vapnik-Chervonenkis dimension 1 TESTERR( a) E y f ( x, a) 2 TRAINERR( a) R 1 1 yk f ( x k, a) R 1 2 k Given some machine f, let h be its VC dimension (h does not depend on the choice of training set) Let R be the number of training examples Vapnik showed that with probability 1-h TESTERR( a) TRAINERR( a) h (log(2r / h) 1) log( h / 4) R This gives us a way to estimate the error on future data based only on the training error and the VC-dimension of f 36

VC-dimension as measure of complexity TESTERR( a) TRAINERR( a) h (log(2r / h) 1) log( h / 4) R i f i TRAINERR VC-Conf Probable upper bound on TESTERR Choice 1 f 1 2 f 2 3 f 3 4 f 4 5 f 5 6 f 6 37

Using VC-dimensionality People have worked hard to find VC-dimension for.. Decision Trees Perceptrons Neural Nets Decision Lists Support Vector Machines And many many more All with the goals of 1. Understanding which learning machines are more or less powerful under which circumstances 2. Using Structural Risk Minimization for to choose the best learning machine 38

Alternatives to VC-dim-based model selection Cross Validation To estimate generalization error, we need data unseen during training. We split the data as Training set (50%) Validation set (25%) Test (publication) set (25%) Resampling when there is few data 39

Alternatives to VC-dim-based model selection What could we do instead of the scheme below? 1. Cross-validation i f i TRAINERR 10-FOLD-CV-ERR Choice 1 f 1 2 f 2 3 f 3 4 f 4 5 f 5 6 f 6 40

Extra Comments An excellent tutorial on VC-dimension and Support Vector Machines C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. http://citeseer.nj.nec.com/burges98tutorial.html 41

What you should know The definition of a learning machine: f(x,a) The definition of Shattering Be able to work through simple examples of shattering The definition of VC-dimension Be able to work through simple examples of VCdimension Structural Risk Minimization for model selection Awareness of other model selection methods 42