COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Similar documents
Statistical and Computational Learning Theory

Computational Learning Theory

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Computational Learning Theory (VC Dimension)

Computational Learning Theory

Computational Learning Theory

Machine Learning

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Machine Learning

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Computational Learning Theory. Definitions

Computational Learning Theory

CS 6375: Machine Learning Computational Learning Theory

Generalization, Overfitting, and Model Selection

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Computational Learning Theory. CS534 - Machine Learning

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

COMS 4771 Introduction to Machine Learning. Nakul Verma

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Computational Learning Theory

Active Learning and Optimized Information Gathering

Introduction to Machine Learning

Neural Network Learning: Testing Bounds on Sample Complexity

Computational Learning Theory

On the Sample Complexity of Noise-Tolerant Learning

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

Generalization theory

MACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti

Computational learning theory. PAC learning. VC dimension.

Computational Learning Theory

Web-Mining Agents Computational Learning Theory

Computational Learning Theory

Discriminative Models

Generalization and Overfitting

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Learning Theory. Sridhar Mahadevan. University of Massachusetts. p. 1/38

Sample width for multi-category classifiers

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Name (NetID): (1 Point)

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Midterm, Fall 2003

ECS171: Machine Learning

The sample complexity of agnostic learning with deterministic labels

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Machine Learning 4771

Learning with multiple models. Boosting.

Understanding Generalization Error: Bounds and Decompositions

Discriminative Models

Cognitive Cyber-Physical System

Maximal Width Learning of Binary Functions

Maximal Width Learning of Binary Functions

Part of the slides are adapted from Ziko Kolter

Machine Learning. Ensemble Methods. Manfred Huber

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Vapnik-Chervonenkis Dimension of Neural Nets

Models of Language Acquisition: Part II

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

CS 498F: Machine Learning Fall 2010

10.1 The Formal Model

Empirical Evaluation (Ch 5)

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Algorithm-Independent Learning Issues

Introduction to Machine Learning

Solving Classification Problems By Knowledge Sets

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Introduction to Machine Learning

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

Australian National University. Abstract. number of training examples necessary for satisfactory learning performance grows

Dan Roth 461C, 3401 Walnut

Minimax risk bounds for linear threshold functions

PAC-learning, VC Dimension and Margin-based Bounds

FINAL: CS 6375 (Machine Learning) Fall 2014

Learning Theory. Machine Learning B Seyoung Kim. Many of these slides are derived from Tom Mitchell, Ziv- Bar Joseph. Thanks!

Support Vector and Kernel Methods

Bagging and Other Ensemble Methods

Vapnik-Chervonenkis Dimension of Neural Nets

ORIE 4741: Learning with Big Messy Data. Generalization

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

MODULE -4 BAYEIAN LEARNING

Learning Theory, Overfi1ng, Bias Variance Decomposi9on

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Machine Learning Lecture 7

Computational and Statistical Learning theory

References for online kernel methods

FORMULATION OF THE LEARNING PROBLEM

Empirical Risk Minimization

PAC Model and Generalization Bounds

Learning Theory Continued

Online Learning, Mistake Bounds, Perceptron Algorithm

Computational Learning Theory for Artificial Neural Networks

Evaluation requires to define performance measures to be optimized

Performance Evaluation and Comparison

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

Supervised Learning. George Konidaris

An Introduction to No Free Lunch Theorems

Notes on Machine Learning for and

Transcription:

: Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage of incorrectly classified data among the training data. True error is the probability of misclassifying a randomly drawn data point from the domain (according to the underlying probability distribution). How does the training error relate to the true error?

11s2 VC-dimension and PAC-learning 2 How good a classifier does a learner produce? We can estimate the true error by using randomly drawn test data from the domain that was not shown to the classifier during training. Test error is the percentage of incorrectly classified data among the test data. Validation error is the percentage of incorrectly classified data among the validation data.

11s2 VC-dimension and PAC-learning 3 Calibrating MLPs for the learning task at hand Split the original training data set into two subsets: Training data set Validation data set Use the training data set for resampling to estimate which network configuration (e.g. how many hidden units to use) seems best. Once a network configuration is selected, train the network on the training set and estimate its true error using the validation set not shown to the network at any time before.

11s2 VC-dimension and PAC-learning 4 Resampling methods Resampling tries to use the same data multiple times to reduce estimation errors on the true error of a classifier being learned using a particular learning method. Frequently used resampling methods are n-fold cross validation (with n being 5 or 10) and bootstrapping.

11s2 VC-dimension and PAC-learning 5 Cross Validation Split available training data into n subsets s 1,...,s n. For each i set apart subset s i for testing. Train the learner on all remaining n 1 subsets and test the result on s i. Average the test errors over all n learning runs.

11s2 VC-dimension and PAC-learning 6 Bootstrapping Assume the available training data represents the actual distribution of data in the underlying domain. i.e. for k data points, assume each data point occurs with probability 1/k. Draw a set of t training data points by randomly selecting a data point from the original collectionwith probability 1/k for each data point (if there are multiple occurrences of the same data point, the probability of that data point occurring is deemed to be proportionally higher.) Often t is chosen to be equal to k. This will usually result in having only about 63% of the original sample, while the remaining 37% (1/e) are made up of duplicates

11s2 VC-dimension and PAC-learning 7 Bootstrapping Train learner on the randomly drawn training sample. Test the learned classifier on the items from the original data set which were not in the training set. Repeat the random selection, training and testing for a number of times and average the test error over all runs (typical number of runs is 30 to 50.)

11s2 VC-dimension and PAC-learning 8 Factors Determining the Difficulty of Learning The representation of domain objects, i.e. which attributes and how many. The number of training data points. The learner, i.e. the class of functions it can potentially learn.

11s2 VC-dimension and PAC-learning 9 PAC Learning We know a number of heuristics to ensure learning on the training set will carry to new data, such as using a minimally complex model, training on a representative dataset, with sufficient samples How can we address this issue in a principled way? Probably Approximately Correct learning, and Structured Risk Minimisation

11s2 VC-dimension and PAC-learning 10 PAC Learning Want to ensure the probability of poor learning is low P[ E out E in >ε]<1 δ If E out E in is low, our learner is approximately correct. This function provides a bounds on the probability that this is the case. Hoeffding inequality: probability that the sum of random variables deviates from expected values P[ ν µ >ε] 2e 2ε2 N For a given hypothesis h, we can say P[ E out (h) E in (h) >ε] 2e 2ε2 N

11s2 VC-dimension and PAC-learning 11 PAC Learning For a chosen hypothesis g, drawn from the set of hypotheses H: P[ E out (g) E in (g) >ε] 2Me 2ε2 N M represents the number of hypotheses, for a perceptron This is not very useful. Alternative, we can see how the value grows with the number of samples. m H (N)= max x 1...x N H(x 1...x N ) The growth function represents the most dichotomies that H implements over all possible samples of size N m H (N) 2 N

11s2 VC-dimension and PAC-learning 12 Vapnik-Chervonenkis dimension H shatters N points, if there exists an arrangement of N points such that for all arrangements of labels on the points, there is a hypothesis h that captures the labels Vapnik-Chervonenkis dimension d VC (H) is the largest N that H can shatter (and where m H (N)=2 N ). If N d VC, H may be able to shatter the data. If N > d VC H cannot shatter the data. Bounds: it can be proved that m H (N) N d VC + k P[ E out E in >ε] 4m H (2N)e ε2 N 8

11s2 VC-dimension and PAC-learning 13 Vapnik-Chervonenkis dimension Definition Various function sets - relevant to learning systems - and their VC-dimension Bounds from probably approximately correct learning (PAC-learning). General bounds on the Risk.

11s2 VC-dimension and PAC-learning 14 The Vapnik-Chervonenkis dimension In the following, we consider a boolean classification function f on the input space X to be equivalent to the subset of X such that x X f(x)=1. The VC-dimension is a useful combinatorial parameter on sets of subsets in the input space, e.g. on function classes H on the input space X. Definition We say a set S X is shattered by H, if for all subsets s S there is at least one function h H such that s=h S. Definition The Vapnik-Chervonenkis dimension of H, d VC (H) is the cardinality of the largest set S X shattered by H.

11s2 VC-dimension and PAC-learning 15 The Vapnik-Chervonenkis dimension 1 2 4 3 The VC-dimension of the set of linear decision functions in the 2-dimensional Euclidean space is 3.

11s2 VC-dimension and PAC-learning 16 The Vapnik-Chervonenkis dimension The VC-dimension of the set of linear decision functions in the n- dimensional Euclidean space is equal to n+1.

11s2 VC-dimension and PAC-learning 17 VC-dimension and PAC-Learning Theorem Upper Bound (Blumer et al., 1989) Let L be a learning algorithm that uses H consistently, i.e. that finds an h H that is consistent with all the data. For any 0<ε,δ<1 given N = (4log( 2 δ )+8d VC(H)log( 13 ε )) ε random examples, L will with probability of at least 1 δ either produce a function h H with error ε or indicate correctly, that the target function is not in H.

11s2 VC-dimension and PAC-learning 18 VC-dimension and PAC-Learning Theorem Lower Bound (Ehrenfeucht et al., 1992) Let L be a learning algorithm that uses H consistently. For any 0<ε< 8 1, given less than 0<δ< 1 100 N = d VC(H) 1 32ε random examples, there is some probability distribution for which L will not produce a function h H with error(h) ε with probability 1 δ.

11s2 VC-dimension and PAC-learning 19 VC-dimension bounds ε δ d VC lower bound upper bound 5% 5% 10 6 9192 10% 5% 10 3 4040 5% 5% 4 2 3860 10% 5% 4 1 1707 10% 10% 4 1 1677

11s2 VC-dimension and PAC-learning 20 The Vapnik-Chervonenkis dimension Y Let C be the set of all squares in the plane (parallel to axis). VC-dim =... X

11s2 VC-dimension and PAC-learning 21 The Vapnik-Chervonenkis dimension Y X Let C be the set of all rectangles in the plane (parallel to axis). VC-dim =...

11s2 VC-dimension and PAC-learning 22 The Vapnik-Chervonenkis dimension Y Let C be the set of all circles in the plane. VC-dim =... X

11s2 VC-dimension and PAC-learning 23 The Vapnik-Chervonenkis dimension Y Let C be the set of all triangles in the plane, allowing rotation. VC-dim =... X

11s2 VC-dimension and PAC-learning 24 The Vapnik-Chervonenkis dimension 1 f(x,1) f(x,2.5) 0.8 0.6 0.4 0.2 0-0.2-0.4-0.6-0.8-1 -10-5 0 5 10 The VC-dimension of all functions of the following form is inifinite! f(x,α)=sin(α x), α R.

11s2 VC-dimension and PAC-learning 25 The Vapnik-Chervonenkis dimension The following points on the line x 1 = 10 1,...,x n = 10 n can be shattered by functions from this set. To separate these data into two classes determined by the sequence δ 1,...,δ n {0,1} it is sufficient to choose the value of parameter α=π( n i=1 (1 δ i )10 i + 1)

11s2 VC-dimension and PAC-learning 26 VC-Dimensions of Neural Networks As a good heuristic (Bartlett): VC-dimension number of parameters For linear threshold functions on R n, (number of parameters is n+1.) d VC = n+1. For linear threshold networks, and fixed depth networks with piecewise polynomial squashing functions c 1 W d VC c 2 W log W where W is number of weights in the network. Some threshold networks have d VC c W log W. d VC (sigmoid net) c W 4

11s2 VC-dimension and PAC-learning 27 VC-Dimensions of Neural Networks Any function class H that can be computed by a program that takes a real input vector x and k real parameters and involves no more than t of the following operations: +,,,/ on real numbers >,,=,,<, on real numbers output value y { 1,+1} has VC-dimension of O(kt). (See work of Peter Bartlett)

11s2 VC-dimension and PAC-learning 28 VC-Dimensions of Neural Networks Any function class H that can be computed by a program that takes a real input vector x and k real parameters and involves no more than t of the following operations: +,,,/,e α on real numbers >,,=,,<, on real numbers output value y { 1,+1} has VC-dimension of O(k 2 t 2 ). This includes sigmoid networks, RBF networks, mixture of experts, etc. (See work of Peter Bartlett)

11s2 VC-dimension and PAC-learning 29 Structural Risk Minimisation (SRM) The complexity (or capacity) of a function class from which the learner chooses a function that minimises the empirical risk (i.e. the error on the training data) determines the convergence rate of the learner to the optimal function. For a given number of independently and identically distributed training examples, there is a trade-off between the degree to which the empirical risk can be minimised and to which the empirical risk will deviate from the true risk (i.e. the true error - error on unseen data according to the underlying distribution).

11s2 VC-dimension and PAC-learning 30 Structural Risk Minimisation (SRM) The general SRM principle: Choose complexity parameter d, e.g. the number of hidden units in a MLP, or the size of a decision tree, and function g H such that the following is minimised: E out (g) E in (g)+c where N is the number of training examples. dvc (H) The higher the VC dimension is the more likely will the empirical error be low. Structural risk minimisation seeks the right balance. N

11s2 VC-dimension and PAC-learning 31 Structural Risk Minimisation (SRM) out-error Error complexity in-error E out E in + Ω 8 E out E in + N log4m H(2N) 1 δ m H (N) N d VC(H) dvc (H) * d VC VC dim E out E in + c N

11s2 VC-dimension and PAC-learning 32 VC-dimension Heuristic For neural networks and decision trees: VC-dimension size. Hence, the order of the misclassification probability is no more than size training error+ m where m is the number of training examples. This suggests that the number of training examples should grow roughly linearly with the size of the hypothesis to be produced. If the function to be produced is too complex for the amount of data available, it is likely that the learned function is not a near-optimal one. (See work of Peter Bartlett)

11s2 VC-dimension and PAC-learning 33 Summary The VC-dimension is a useful combinatorial parameter of sets of functions. It can be used to estimate the true risk on the basis of the empirical risk and the number of independently and identically distributed training examples. It can also be used to determine a sufficient number of training examples to learn probably approximately correctly. Applications of the VC-dimension for choosing the most suitable subset of functions for a given number of independently and identically distributed examples. Trading empirical risk against confidence in estimate.

11s2 VC-dimension and PAC-learning 34 Summary (cont.) Limitations of the VC Learning Theory The probabilistic bounds on the required number of examples are worst case analysis. No preference relation on the functions of the learner are modelled. In practice, learning examples are not necessarily drawn from the same probability distribution as the test examples.