Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv:203.093v2 [math.st] 23 Jul 202 Servane Gey July 24, 202 Abstract The Vapnik-Chervonenkis (VC) dimension of the set of half-spaces of R d with frontiers parallel to the axes is computed exactly. It is shown that it is much smaller than the intuitive value of d. A good approximation based on the Stirling s formula proves that it is more likely of the order log 2 d. This result may be used to evaluate the performance of classifiers or regressors based on dyadic partitioning of R d for instance. Algorithms using axis-parallel cuts to partition R d are often used to reduce the computational time of such estimators when d is large. Keywords: Vapnik-Chervonenkis dimension, axis-parallel cuts. MSC 200 Classification: 62G99 62H99 Introduction The VC dimension of a set of subsets has been introduced by Vapnik and Chervonenkis [9, 0] to measure its complexity. The VC dimension of a realvalued function space F is then the VC dimension of x; f(x) 0}; f F}. In particular, the VC dimension of sets of classifiers or regressors appears commonly in the statistical learning area when evaluating their performance. For example, Vapnik s theory in the classification framework is now widely known (see [3] for instance): let (X, Y ) be a couple of variables taking values in R d 0; }, and let L be a sample of n independent replications of (X, Y ). If ˆf is a classifier minimizing the average misclassification rate of L on a set Laboratoire MAP5 - UMR 845, Université Paris Descartes, 75270 Paris Cedex 06, France - Servane.Gey@parisdescartes.fr
of classifiers having finite VC dimension V, then, without further assumption on the distribution P of (X, Y ), the performance of ˆf is evaluated as follows: ( )] E L [P ˆf(X) Y C bias 2 ( ˆf) V + C 2 n, () where E L denotes the expectation with respect to the sample distribution, bias( ˆf) denotes the bias of the classifier ˆf, and C and C 2 are absolute constants. Functional estimates defined on partitions of R d are often used to estimate relationships between two variables X R d and Y 0; } or Y R (such as histograms, piecewise polynomials, or splines for example). In many cases, the VC dimension of the set of subsets used to construct the partition appears inside risk bounds when evaluating the performance of such estimators. For example, if the set used is the set of all half-spaces of R d, often its VC dimension d + has to be taken into account. When d is large, it is often computationally easier to construct partitions using axis-parallel cuts. For example, some theoretical developments on dyadic partitions of R 2 are given in [4, ], and the VC dimension of axis-parallel cuts appears more particularly in the results obtained on the performance of classification and regression binary decision trees (CART) introduced by Breiman et. al [2] in 984, and theoretically studied in [8, 7, 5, 6]. 2 Reminder about VC Dimension The VC dimension of a set A of subsets of some measurable space X is based on counting the number of intersects of A with a finite set of fixed points in X. Definition (Vapnik-Chervonenkis Dimension). Let A be a set of subsets of some measurable space X. Then (x,..., x n ) X n will be said to be shattered by A if all subsets of x ;... ; x n } are covered by A, that is if x,..., x n } A ; A A} = 2 n. The Vapnik-Chervonenkis dimension V C(A) of A is then defined as the maximal integer n such that there exists n points in X shattered by A, i.e. V C(A) = max n ; If no such n exists, then V C(A) = +. max x,..., x n } A ; A A} = 2 n (x,...,x n) X n }. 2
Thus, it is easily seen that the larger V C(A), the more complex A. For example, if A = ] ; x] ; x R}, then V C(A) = ; or if A is the set of all half-spaces in R d, then V C(A) = d +. Since axis-parallel cuts is a subset of the set of all half-spaces in R d, it could be natural to think that its VC dimension is of order d. Actually, it is shown in what follows that it is of order log 2 d. 3 VC Dimension of axis-parallel cuts We give a formula to compute the VC dimension of axis-parallel cuts in R d. Since the obtained formula is not always easy to handle, an approximation is also given. Lemma. Let Then A d = } x R d ; x i a}; i =,..., d, a R. V C(A d ) = max n ; ( ) } n d, where x denotes the integer part of x. Furthermore, the following approximation of V C(A d ) is available for all d 2: log d log 2 0.38 V C(A d) log ( ) d d + 3 + 0.5. log 2 Figure shows that V C(A d ) is a piecewise constant function of the space dimension d, which increases at a rate much smaller than the intuitive value of d. It also shows that the bounds computed from the Stirling s formula are sharp. Proof. The idea is that, to have n points (x,..., x n ) shattered by A d, all the subsets of x,..., x n } should be covered by A d. But, if there exists p n such that there is more than d+ subsets of x,..., x n } having p elements, then A d will miss at least ( n p) d subsets: let n and (x,..., x n ) be n points in R d. Suppose that n is such that > d. This means that there are at least d + subsets of x,..., x n } of size. For each 3
Figure : V C(A d ) with respect to the space dimension d and Stirling s bounds. coordinate i =,..., d, let us denote by x i(.) the ordered statistic computed from the i th coordinate of (x,..., x n ), that is, for all i =,..., d, Let p = and let x i i() xi i(2)... xi i(n). B p = x i() ;... ; x i(p) } ; i =,..., d and x i() ;... ; x i(p) } = p }, B c p = B x,..., x n }; B = p and B / B p }. Hence B p is covered by A d (by simply taking A = x i (x i i(p) + xi i(p+) )/2} for each coordinate), and we have that: B p d and Bp c d > 0. p Let B Bp c and A = x i a} A d. If x,..., x n } A p, then x,..., x n } A B. Else, since x,..., x n } A = x j ; x i j a}, we have that x i i(j) a for all j =,..., p, and xi i(j) > a for all j = p +,..., n. So x,..., x n } A = x i() ;... ; x i(p) } and x i() ;... ; x i(p) } = p, leading to x,..., x n } A B p, and then to x,..., x n } A B. So, for all B Bp c 4
and all ( A A) d, x,..., x n } A B. n So, if > d, (x,..., x n ) can not be shattered by A d. Thus V C(A d ) max n ; ( ) } n d. Let n such that d. Let (x,..., x n ) be n points of R d defined as follows: for each coordinate i =,..., ( n ), let i ;... ; i } be the i th subset of indices in ;... ; n}, where the indices are denoted in ascending order, i.e.: i <... < i n. ( ) ( ) n n Since d, we obtain distinct subsets of indices. Hence we take for each such coordinate x i i k = k. Then the remaining values of (x,..., x n ) are taken as follows: ( ) n Since d, for each subset i ;... ; i + + } of ;... ; n} with + elements, there exists i ;... ; ( n ) } such that i ;... ; i } = i ;... ; i }. Then take xi i + = +. Let us note that, if n is odd, there is a bijection between i and i. Let j ;... ; j m } = j / i ;... ; i + }}, with j <... < j m, and let j 0 = i +. Then take x i j k = x i j k +. If not filled, the last coordinates are set to be equal to n. Hence, we obtain that, for all j / i ;... ; i }, x i j +. Then (x,..., x n ) is shattered by A d : for p 0;... ; n}, let B = x i ;... ; x ip } x,..., x n }, with i < i 2 <... < i p n as soon as p 0. If p = 0, let i 0 = argmin i d min x i j j, and take A = x i 0 min j x i 0 j }. Then B = x,..., x n } A =. If p = n, let i n = argmax i d max x i j, j 5
and take A = x in max j x in j + }. Then B = x,..., x n } A = x,..., x n }. If 0 < p, let A A d be the subset defined by A = x i p + /2}, with i the coordinate corresponding to a subset of indices i ;... ; i } containing i ;... ; i p }. Then, by definition of (x i,..., xi n), B = x,..., x n } A. If + p < n, let i be the coordinate corresponding to the configuration i ;... ; i + } (as defined by (x,..., x n )). Let A A d be the subset defined by A = x i p + /2}. Then, by definition of (x i,..., xi n), B = x,..., x n } A. Thus V C(A d ) max n ; ( ) } n d. Then, the lower and upper bounds of V C(A d ) are computed by using the Stirling s formula: for all n we have 2πe (n+) (n + ) n+ 2 n! 2πe (n+) e 2(n+) (n + ) n+ 2. A simple calculation gives the following: if n is even, then n/2 and if n is odd, then Thus, if then d. V C(A d ). On the other hand, if n/2 e 2(n+) e 2(n+) e n+ (n + )n+/2 2 2π e 2π 2 36 (n + 2) n+ e+ 2 n+, 6π n+ (n + )(n+)/2 (n + 3) e + 24 6π 2 n+ d, n/2+ e+ 24 2π 2 n. Taking the logarithm leads to the lower bound of d, we have that, if n is even, e e 3n+6 n+ (n + )n+/2 2 2π 2 (n + 2) n+ e 2 n+, 2π d + 2 6
and if n is odd, ( n ) e n+2 e 3(n+3)(n+) 2 2π n+ (n + )(n+)/2 (n + 3) e 8 n/2+ 2 n+. 2π d + 3 e 8 Thus, since, for all n such that d, 2 n+ d, the upper 2π bound of V C(A d ) is found by taking the logarithm of this last expression. References [] Akakpo, N. Adaptation to anisotropy and inhomogeneity via dyadic piecewise polynomial selection. Mathematical Methods of Statistics 2, (202), 28. [2] Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. Classification And Regression Trees. Chapman & Hall, 984. [3] Devroye, L., Györfi, L., and Lugosi, G. A probabilistic theory of pattern recognition, vol. 3 of Applications of Mathematics (New York). Springer-Verlag, New York, 996. [4] Donoho, D. L. CART and best-ortho-basis : A connection. The Annals of Statistics 25, 5 (997), 870 9. [5] Gey, S. Risk bounds for cart classifiers under a margin condition. Pattern Recognition 45 (202), 3523 3534. [6] Gey, S., and Mary Huard, T. Risk bounds for embedded variable selection in classification trees. Tech. rep., arxiv, 08.0757v, 20. [7] Gey, S., and Nedelec, E. Model selection for CART regression trees. IEEE Trans. Inform. Theory 5, 2 (2005), 658 670. [8] Nobel, A. B. Analysis of a complexity-based pruning scheme for classification trees. IEEE Trans. Inform. Theory 48, 8 (2002), 2362 2368. [9] Vapnik, V. N., and Chervonenkis, A. Y. Theory of uniform convergence of frequencies of events to their probabilities and problems of search for an optimal solution from empirical data. Avtomat. i Telemeh., 2 (97), 42 53. 7
[0] Vapnik, V. N., and Chervonenkis, A. Y. Teoriya raspoznavaniya obrazov. Statisticheskie problemy obucheniya. Izdat. Nauka, Moscow, 974. 8