Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv: v2 [math.st] 23 Jul 2012

Similar documents
Dyadic Classification Trees via Structural Risk Minimization

Risk Bounds for CART Classifiers under a Margin Condition

BINARY TREE-STRUCTURED PARTITION AND CLASSIFICATION SCHEMES

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Random forests and averaging classifiers

Generalization and Overfitting

Consistency of Nearest Neighbor Methods

MULTIVARIATE HISTOGRAMS WITH DATA-DEPENDENT PARTITIONS

Statistical learning theory, Support vector machines, and Bioinformatics

Machine Learning. Lecture 9: Learning Theory. Feng Li.

EVALUATING MISCLASSIFICATION PROBABILITY USING EMPIRICAL RISK 1. Victor Nedel ko

On Learnability, Complexity and Stability

Using CART to Detect Multiple Change Points in the Mean for large samples

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Fuzzy histograms and density estimation

A Simple Algorithm for Learning Stable Machines

Generalization, Overfitting, and Model Selection

Statistics and learning: Big Data

PAC-learning, VC Dimension and Margin-based Bounds

Optimal L 1 Bandwidth Selection for Variable Kernel Density Estimates

Decision Tree Learning Lecture 2

Introduction to Machine Learning

Solving Classification Problems By Knowledge Sets

Influence measures for CART

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Decision trees COMS 4771

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Computational Learning Theory

International Journal "Information Theories & Applications" Vol.14 /

Machine Learning Lecture 7

Optimal global rates of convergence for interpolation problems with random design

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

REGRESSION TREE CREDIBILITY MODEL

Support Vector Machines

Adaptive sparse grids

On the S-Labeling problem

Monomial transformations of the projective space

Support Vector Machines for Classification: A Statistical Portrait

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

A talk on Oracle inequalities and regularization. by Sara van de Geer

Machine Learning

Fast learning rates for plug-in classifiers under the margin condition

Computational and Statistical Learning theory

MUTUAL INFORMATION (MI) specifies the level of

Model Selection and Error Estimation

Cognitive Cyber-Physical System

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Discriminative Models

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Machine Learning

Recap from previous lecture

Random projection ensemble classification

Statistical and Computational Learning Theory

A Bias Correction for the Minimum Error Rate in Cross-validation

Discriminative Models

arxiv: v1 [math.st] 14 Mar 2016

Histogram Regression Estimation Using Data-dependent Partitions

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Empirical Risk Minimization

Vapnik-Chervonenkis Dimension of Neural Nets

Adaptive Minimax Classification with Dyadic Decision Trees

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 1. Minimax-Optimal Bounds for Detectors Based on Estimated Prior Probabilities

Ensemble Methods and Random Forests

Maximal Width Learning of Binary Functions

Machine Leanring Theory and Applications: third lecture

Understanding Generalization Error: Bounds and Decompositions

Does Modeling Lead to More Accurate Classification?

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

Chapter 14 Combining Models

TREE-STRUCTURED REGRESSION AND THE DIFFERENTIATION OF INTEGRALS 1. BY RICHARD A. OLSHEN Stanford University. In Memory of Leo Breiman,

A Study of Relative Efficiency and Robustness of Classification Methods

CSE 648: Advanced algorithms

On threshold-based classification rules By Leila Mohammadi and Sara van de Geer. Mathematical Institute, University of Leiden

Curve learning. p.1/35

Harrison B. Prosper. Bari Lectures

Performance of Cross Validation in Tree-Based Models

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

arxiv: v1 [cs.ds] 28 Sep 2018

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

ECE521 Lecture7. Logistic Regression

SUPPORT VECTOR MACHINE

The Vapnik-Chervonenkis Dimension

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Constructing Prediction Intervals for Random Forests

CSCI 5622 Machine Learning

Support Vector Machines with a Reject Option

Escaping the curse of dimensionality with a tree-based regressor

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Probabilistic Random Forests: Predicting Data Point Specific Misclassification Probabilities ; CU- CS

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Classification via kernel regression based on univariate product density estimators

Notes on Discriminant Functions and Optimal Classification

Growing a Large Tree

Variance Reduction and Ensemble Methods

Learning with multiple models. Boosting.

Foundations of Machine Learning

Supervised Learning Part I

Transcription:

Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv:203.093v2 [math.st] 23 Jul 202 Servane Gey July 24, 202 Abstract The Vapnik-Chervonenkis (VC) dimension of the set of half-spaces of R d with frontiers parallel to the axes is computed exactly. It is shown that it is much smaller than the intuitive value of d. A good approximation based on the Stirling s formula proves that it is more likely of the order log 2 d. This result may be used to evaluate the performance of classifiers or regressors based on dyadic partitioning of R d for instance. Algorithms using axis-parallel cuts to partition R d are often used to reduce the computational time of such estimators when d is large. Keywords: Vapnik-Chervonenkis dimension, axis-parallel cuts. MSC 200 Classification: 62G99 62H99 Introduction The VC dimension of a set of subsets has been introduced by Vapnik and Chervonenkis [9, 0] to measure its complexity. The VC dimension of a realvalued function space F is then the VC dimension of x; f(x) 0}; f F}. In particular, the VC dimension of sets of classifiers or regressors appears commonly in the statistical learning area when evaluating their performance. For example, Vapnik s theory in the classification framework is now widely known (see [3] for instance): let (X, Y ) be a couple of variables taking values in R d 0; }, and let L be a sample of n independent replications of (X, Y ). If ˆf is a classifier minimizing the average misclassification rate of L on a set Laboratoire MAP5 - UMR 845, Université Paris Descartes, 75270 Paris Cedex 06, France - Servane.Gey@parisdescartes.fr

of classifiers having finite VC dimension V, then, without further assumption on the distribution P of (X, Y ), the performance of ˆf is evaluated as follows: ( )] E L [P ˆf(X) Y C bias 2 ( ˆf) V + C 2 n, () where E L denotes the expectation with respect to the sample distribution, bias( ˆf) denotes the bias of the classifier ˆf, and C and C 2 are absolute constants. Functional estimates defined on partitions of R d are often used to estimate relationships between two variables X R d and Y 0; } or Y R (such as histograms, piecewise polynomials, or splines for example). In many cases, the VC dimension of the set of subsets used to construct the partition appears inside risk bounds when evaluating the performance of such estimators. For example, if the set used is the set of all half-spaces of R d, often its VC dimension d + has to be taken into account. When d is large, it is often computationally easier to construct partitions using axis-parallel cuts. For example, some theoretical developments on dyadic partitions of R 2 are given in [4, ], and the VC dimension of axis-parallel cuts appears more particularly in the results obtained on the performance of classification and regression binary decision trees (CART) introduced by Breiman et. al [2] in 984, and theoretically studied in [8, 7, 5, 6]. 2 Reminder about VC Dimension The VC dimension of a set A of subsets of some measurable space X is based on counting the number of intersects of A with a finite set of fixed points in X. Definition (Vapnik-Chervonenkis Dimension). Let A be a set of subsets of some measurable space X. Then (x,..., x n ) X n will be said to be shattered by A if all subsets of x ;... ; x n } are covered by A, that is if x,..., x n } A ; A A} = 2 n. The Vapnik-Chervonenkis dimension V C(A) of A is then defined as the maximal integer n such that there exists n points in X shattered by A, i.e. V C(A) = max n ; If no such n exists, then V C(A) = +. max x,..., x n } A ; A A} = 2 n (x,...,x n) X n }. 2

Thus, it is easily seen that the larger V C(A), the more complex A. For example, if A = ] ; x] ; x R}, then V C(A) = ; or if A is the set of all half-spaces in R d, then V C(A) = d +. Since axis-parallel cuts is a subset of the set of all half-spaces in R d, it could be natural to think that its VC dimension is of order d. Actually, it is shown in what follows that it is of order log 2 d. 3 VC Dimension of axis-parallel cuts We give a formula to compute the VC dimension of axis-parallel cuts in R d. Since the obtained formula is not always easy to handle, an approximation is also given. Lemma. Let Then A d = } x R d ; x i a}; i =,..., d, a R. V C(A d ) = max n ; ( ) } n d, where x denotes the integer part of x. Furthermore, the following approximation of V C(A d ) is available for all d 2: log d log 2 0.38 V C(A d) log ( ) d d + 3 + 0.5. log 2 Figure shows that V C(A d ) is a piecewise constant function of the space dimension d, which increases at a rate much smaller than the intuitive value of d. It also shows that the bounds computed from the Stirling s formula are sharp. Proof. The idea is that, to have n points (x,..., x n ) shattered by A d, all the subsets of x,..., x n } should be covered by A d. But, if there exists p n such that there is more than d+ subsets of x,..., x n } having p elements, then A d will miss at least ( n p) d subsets: let n and (x,..., x n ) be n points in R d. Suppose that n is such that > d. This means that there are at least d + subsets of x,..., x n } of size. For each 3

Figure : V C(A d ) with respect to the space dimension d and Stirling s bounds. coordinate i =,..., d, let us denote by x i(.) the ordered statistic computed from the i th coordinate of (x,..., x n ), that is, for all i =,..., d, Let p = and let x i i() xi i(2)... xi i(n). B p = x i() ;... ; x i(p) } ; i =,..., d and x i() ;... ; x i(p) } = p }, B c p = B x,..., x n }; B = p and B / B p }. Hence B p is covered by A d (by simply taking A = x i (x i i(p) + xi i(p+) )/2} for each coordinate), and we have that: B p d and Bp c d > 0. p Let B Bp c and A = x i a} A d. If x,..., x n } A p, then x,..., x n } A B. Else, since x,..., x n } A = x j ; x i j a}, we have that x i i(j) a for all j =,..., p, and xi i(j) > a for all j = p +,..., n. So x,..., x n } A = x i() ;... ; x i(p) } and x i() ;... ; x i(p) } = p, leading to x,..., x n } A B p, and then to x,..., x n } A B. So, for all B Bp c 4

and all ( A A) d, x,..., x n } A B. n So, if > d, (x,..., x n ) can not be shattered by A d. Thus V C(A d ) max n ; ( ) } n d. Let n such that d. Let (x,..., x n ) be n points of R d defined as follows: for each coordinate i =,..., ( n ), let i ;... ; i } be the i th subset of indices in ;... ; n}, where the indices are denoted in ascending order, i.e.: i <... < i n. ( ) ( ) n n Since d, we obtain distinct subsets of indices. Hence we take for each such coordinate x i i k = k. Then the remaining values of (x,..., x n ) are taken as follows: ( ) n Since d, for each subset i ;... ; i + + } of ;... ; n} with + elements, there exists i ;... ; ( n ) } such that i ;... ; i } = i ;... ; i }. Then take xi i + = +. Let us note that, if n is odd, there is a bijection between i and i. Let j ;... ; j m } = j / i ;... ; i + }}, with j <... < j m, and let j 0 = i +. Then take x i j k = x i j k +. If not filled, the last coordinates are set to be equal to n. Hence, we obtain that, for all j / i ;... ; i }, x i j +. Then (x,..., x n ) is shattered by A d : for p 0;... ; n}, let B = x i ;... ; x ip } x,..., x n }, with i < i 2 <... < i p n as soon as p 0. If p = 0, let i 0 = argmin i d min x i j j, and take A = x i 0 min j x i 0 j }. Then B = x,..., x n } A =. If p = n, let i n = argmax i d max x i j, j 5

and take A = x in max j x in j + }. Then B = x,..., x n } A = x,..., x n }. If 0 < p, let A A d be the subset defined by A = x i p + /2}, with i the coordinate corresponding to a subset of indices i ;... ; i } containing i ;... ; i p }. Then, by definition of (x i,..., xi n), B = x,..., x n } A. If + p < n, let i be the coordinate corresponding to the configuration i ;... ; i + } (as defined by (x,..., x n )). Let A A d be the subset defined by A = x i p + /2}. Then, by definition of (x i,..., xi n), B = x,..., x n } A. Thus V C(A d ) max n ; ( ) } n d. Then, the lower and upper bounds of V C(A d ) are computed by using the Stirling s formula: for all n we have 2πe (n+) (n + ) n+ 2 n! 2πe (n+) e 2(n+) (n + ) n+ 2. A simple calculation gives the following: if n is even, then n/2 and if n is odd, then Thus, if then d. V C(A d ). On the other hand, if n/2 e 2(n+) e 2(n+) e n+ (n + )n+/2 2 2π e 2π 2 36 (n + 2) n+ e+ 2 n+, 6π n+ (n + )(n+)/2 (n + 3) e + 24 6π 2 n+ d, n/2+ e+ 24 2π 2 n. Taking the logarithm leads to the lower bound of d, we have that, if n is even, e e 3n+6 n+ (n + )n+/2 2 2π 2 (n + 2) n+ e 2 n+, 2π d + 2 6

and if n is odd, ( n ) e n+2 e 3(n+3)(n+) 2 2π n+ (n + )(n+)/2 (n + 3) e 8 n/2+ 2 n+. 2π d + 3 e 8 Thus, since, for all n such that d, 2 n+ d, the upper 2π bound of V C(A d ) is found by taking the logarithm of this last expression. References [] Akakpo, N. Adaptation to anisotropy and inhomogeneity via dyadic piecewise polynomial selection. Mathematical Methods of Statistics 2, (202), 28. [2] Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. Classification And Regression Trees. Chapman & Hall, 984. [3] Devroye, L., Györfi, L., and Lugosi, G. A probabilistic theory of pattern recognition, vol. 3 of Applications of Mathematics (New York). Springer-Verlag, New York, 996. [4] Donoho, D. L. CART and best-ortho-basis : A connection. The Annals of Statistics 25, 5 (997), 870 9. [5] Gey, S. Risk bounds for cart classifiers under a margin condition. Pattern Recognition 45 (202), 3523 3534. [6] Gey, S., and Mary Huard, T. Risk bounds for embedded variable selection in classification trees. Tech. rep., arxiv, 08.0757v, 20. [7] Gey, S., and Nedelec, E. Model selection for CART regression trees. IEEE Trans. Inform. Theory 5, 2 (2005), 658 670. [8] Nobel, A. B. Analysis of a complexity-based pruning scheme for classification trees. IEEE Trans. Inform. Theory 48, 8 (2002), 2362 2368. [9] Vapnik, V. N., and Chervonenkis, A. Y. Theory of uniform convergence of frequencies of events to their probabilities and problems of search for an optimal solution from empirical data. Avtomat. i Telemeh., 2 (97), 42 53. 7

[0] Vapnik, V. N., and Chervonenkis, A. Y. Teoriya raspoznavaniya obrazov. Statisticheskie problemy obucheniya. Izdat. Nauka, Moscow, 974. 8