Learnability and the Doubling Dimension

Similar documents
An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Randomized Simultaneous Messages: Solution of a Problem of Yao in Communication Complexity

Margin Maximizing Loss Functions

Narrowing confidence interval width of PAC learning risk function by algorithmic inference

COMS 4771 Introduction to Machine Learning. Nakul Verma

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence

Computational Learning Theory

Applications of Discrete Mathematics to the Analysis of Algorithms

The sample complexity of agnostic learning with deterministic labels

Rademacher Averages and Phase Transitions in Glivenko Cantelli Classes

The information-theoretic value of unlabeled data in semi-supervised learning

Self-Testing Polynomial Functions Efficiently and over Rational Domains

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Sample width for multi-category classifiers

Model Selection and Error Estimation

Analysis of Spectral Kernel Design based Semi-supervised Learning

Transforming Hierarchical Trees on Metric Spaces

A Simple Proof of Optimal Epsilon Nets

Discriminative Learning can Succeed where Generative Learning Fails

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

Dyadic Classification Trees via Structural Risk Minimization

Machine Learning

Boosting and Hard-Core Sets

Consistency of Nearest Neighbor Methods

A Bound on the Label Complexity of Agnostic Active Learning

Similarity searching, or how to find your neighbors efficiently

k-fold unions of low-dimensional concept classes

Machine Learning

A non-linear lower bound for planar epsilon-nets

Computational Learning Theory

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Baum s Algorithm Learns Intersections of Halfspaces with respect to Log-Concave Distributions

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Observations on the Stability Properties of Cooperative Systems

A Simple Algorithm for Learning Stable Machines

A Large Deviation Bound for the Area Under the ROC Curve

On Learning µ-perceptron Networks with Binary Weights

Online Learning: Random Averages, Combinatorial Parameters, and Learnability

On Learnability, Complexity and Stability

An Improved Quantum Fourier Transform Algorithm and Applications

Homomorphism Preservation Theorem. Albert Atserias Universitat Politècnica de Catalunya Barcelona, Spain

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

On the Sample Complexity of Noise-Tolerant Learning

Kernel expansions with unlabeled examples

CONSIDER a measurable space and a probability

On Stochastic Decompositions of Metric Spaces

Yale University Department of Computer Science. The VC Dimension of k-fold Union

Support Vector Machines. Machine Learning Fall 2017

Beating the Hold-Out: Bounds for K-fold and Progressive Cross-Validation

Relating Data Compression and Learnability

10.1 The Formal Model

Understanding Generalization Error: Bounds and Decompositions

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

CSE 648: Advanced algorithms

CLUSTERING is the problem of identifying groupings of

From Batch to Transductive Online Learning

The 123 Theorem and its extensions

Computational and Statistical Learning Theory

Rademacher Complexity Bounds for Non-I.I.D. Processes

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Efficient classification for metric data

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Generalization bounds

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

Vapnik-Chervonenkis Dimension of Neural Nets

Computational Learning Theory

Uniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning

Learning convex bodies is hard

Lecture 2: Uniform Entropy

Soft-decision Decoding of Chinese Remainder Codes

Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv: v2 [math.st] 23 Jul 2012

Computational Learning Theory

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

PAC-learning, VC Dimension and Margin-based Bounds

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

A Note on Interference in Random Networks

FORMULATION OF THE LEARNING PROBLEM

Voronoi Networks and Their Probability of Misclassification

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Computational Learning Theory

Worst case complexity of the optimal LLL algorithm

Lecture 18: March 15

VC-dimension of a context-dependent perceptron

Using boxes and proximity to classify data into several categories

The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors

A Result of Vapnik with Applications

Vapnik-Chervonenkis Dimension of Neural Nets

Web-Mining Agents Computational Learning Theory

TOPOLOGICAL ENTROPY FOR DIFFERENTIABLE MAPS OF INTERVALS

The Vapnik-Chervonenkis Dimension

The Decision List Machine

Lower Bounds on the Complexity of Simplex Range Reporting on a Pointer Machine

Learning symmetric non-monotone submodular functions

An EM algorithm for Batch Markovian Arrival Processes and its comparison to a simpler estimation procedure

Computational Learning Theory

On A-distance and Relative A-distance

ON ASSOUAD S EMBEDDING TECHNIQUE

THE information capacity is one of the most important

Statistical and Computational Learning Theory

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Transcription:

Learnability and the Doubling Dimension Yi Li Genome Institute of Singapore liy3@gis.a-star.edu.sg Philip M. Long Google plong@google.com Abstract Given a set of classifiers and a probability distribution over their domain, one can define a metric by taking the distance between a pair of classifiers to be the probability that they classify a random item differently. We prove bounds on the sample complexity of PAC learning in terms of the doubling dimension of this metric. These bounds imply known bounds on the sample complexity of learning halfspaces with respect to the uniform distribution that are optimal up to a constant factor. We prove a bound that holds for any algorithm that outputs a classifier with zero error whenever this is possible; this bound is in terms of the maximum of the doubling dimension and the VC-dimension of, and strengthens the best known bound in terms of the VC-dimension alone. We show that there is no bound on the doubling dimension in terms of the VC-dimension of (in contrast with the metric dimension). 1 Introduction A set of classifiers and a probability distribution over their domain induce a metric in which the distance between classifiers is the probability that they disagree on how to classify a random object. (Let us call this metric.) Properties of metrics like this have long been used for analyzing the generalization ability of learning algorithms [11, 32]. This paper is about bounds on the number of examples required for PAC learning in terms of the doubling dimension [4] of this metric space. The doubling dimension of a metric space is the least such that any ball can be covered by ¾ balls of half its radius. The doubling dimension has been frequently used lately in the analysis of algorithms [13, 20, 21, 17, 29, 14, 7, 22, 28, 6]. In the PAC-learning model, an algorithm is given examples Ü Ü µµ Ü Ñ Ü Ñ µµ of the behavior of an arbitrary member of a known class. The items Ü Ü Ñ are chosen independently at random according to. The algorithm must, with probability at least Æ (w.r.t. to the random choice of Ü Ü Ñ ), output a classifier whose distance from is at most. We show that if µ has doubling dimension, then can be PAC-learned with respect to using ÐÓ Æ Ç (1) examples. If in addition the VC-dimension of is, we show that any algorithm that outputs a classifier with zero training error whenever this is possible PAC-learns w.r.t. using examples. Ç ¼ ÕÐÓ ÐÓ Æ (2)

We show that if consists of halfspaces through the origin, and is the uniform distribution over the unit ball in Ê Ò, then the doubling dimension of µ is Ç Òµ. Thus (1) generalizes the Ò ÐÓ known bound of Ç Æ for learning halfspaces with respect to the uniform distribution [25], matching a known lower bound for this problem [23] up to a constant factor. Both upper bounds Ò ÐÓ ÐÓ improve on the Ç Æ bound that follows from the traditional analysis; (2) is the first such improvement for a polynomial-time algorithm. Some previous analyses of the sample complexity of learning have made use of the fact that the metric dimension [18] is at most the VC-dimension [11, 15]. Since using the doubling dimension can sometimes lead to a better bound, a natural question is whether there is also a bound on the doubling dimension in terms of the VC-dimension. We show that this is not the case: it is possible to pack «µ ¾ Ó µµ classifiers in a set of VC-dimension so that the distance between every pair is in the interval «¾«. Our analysis was inspired by some previous work in computational geometry [19], but is simpler. Combining our upper bound analysis with established techniques (see [33, 3, 8, 31, 30]), one can perform similar analyses for the more general case in which no classifier in has zero error. We have begun with the PAC model because it is a clean setting in which to illustrate the power of the doubling dimension for analyzing learning algorithms. The doubling dimension appears most useful when the best achievable error rate (the Bayes error) is of the same order as the inverse of the number of training examples (or smaller). Bounding the doubling dimension is useful for analyzing the sample complexity of learning because it limits the richness of a subclass of near the classifier to be learned. For other analyses that exploit bounds on such local richness, please see [31, 30, 5, 25, 26, 34]. It could be that stronger results could be obtained by marrying the techniques of this paper with those. In any case, it appears that the doubling dimension is an intuitive yet powerful way to bound the local complexity of a collection of classifiers. 2 Preliminaries 2.1 Learning For some domain, an example consists of a member of, and its classification in ¼. A classifier is a mapping from to ¼. A training set is a finite collection of examples. A learning algorithm takes as input a training set, and outputs a classifier. Suppose is a probability distribution over. Then define µ ÈÖ Ü Üµ ܵµ A learning algorithm PAC learns w.r.t. with accuracy and confidence Æ from Ñ examples if, for any ¾, if domain elements Ü Ü Ñ are drawn independently at random according to, and Ü Ü µµ Ü Ñ Ü Ñ µµ is passed to, which outputs, then ÈÖ µ µ Æ If is a set of classifiers, a learning algorithm is a consistent hypothesis finder for if it outputs an element of that correctly classifies all of the training data whenever it is possible to do so. 2.2 Metrics Suppose µ is a metric space. An «-cover for is a set Ì such that every element of has a counterpart in Ì that is at a distance at most «(with respect to ).

An «-packing for is a set Ì such that every pair of elements of Ì are at a distance greater than «(again, with respect to ). The «-ball centered at Þ ¾ consists of all Ø ¾ for which Þ Øµ «. Denote the size of the smallest «-cover by Æ «µ. Denote the size of the largest «-packing by Å «µ. Lemma 1 ([18]) For any metric space µ, and any «¼, Å ¾«µ Æ «µ Å «µ The doubling dimension of is the least such that, for all radii «¼, any «-ball in can be covered by at most ¾ «¾-balls. That is, for any «¼ and any Þ ¾, there is a such that ¾, and Ø ¾ Þ Øµ «¾ Ø ¾ ص «¾. 2.3 Probability For a function and a probability distribution, let Ü Üµµ be the expectation of w.r.t.. We will shorten this to µ, and if Ù Ù Ù Ñ µ ¾ Ñ, then Ù µ will be Ñ È Ñ Ù µ. We will use ÈÖ Ü, ÈÖ, and ÈÖÙ similarly. 3 The strongest upper bound Theorem 2 Suppose is the doubling dimension of µ. There is an algorithm that PAClearns from Ç ÐÓ Æµ examples. The key lemma limits the extent to which points that are separated from one another can crowd around some point in a metric space with limited doubling dimension. Lemma 3 (see [13]) Suppose µ is a metric space with doubling dimension and Þ ¾. For ¼, let Þ µ consist of the elements of Ù ¾ such that Ù Þµ (that is, the -ball centered at Þ). Then Å «Þ µµ «(In other words, any «-packing must have at most «µ elements within distance of Þ.) Proof: Since has doubling dimension, the set Þ µ can be covered by ¾ balls of radius ¾. Each of these can be covered by ¾ balls of radius, and so on. Thus, Þ µ can be covered by ¾ ÐÓ ¾ ««µ balls of radius «¾. Applying Lemma 1 completes the proof. Now we are ready to prove Theorem 2. The proof is an application of the peeling technique [1] (see [30]). Proof of Theorem 2: Construct an packing greedily, by repeatedly adding an element of to for as long as this is possible. This packing is also an -cover, since otherwise we could add another member to. Consider the algorithm that outputs the element of with minimum error on the training set. Whatever the target, some element of has error at most. Applying Chernoff bounds, Ç ÐÓ Æµ examples are sufficient that, with probability at least ƾ, this classifier is incorrect on at most a fraction ¾ of the training data. Thus, the training error of the hypothesis output by is at most ¾ with probability at least ƾ. Choose an arbitrary function, and let Ë be the random training set resulting from drawing Ñ examples according to, and classifying them using. Define Ë µ to be the fraction of examples

in Ë on which and disagree. We have ÈÖ ¾ µ and Ë µ ¾µ ÐÓ µ ¼ ÐÓ µ ¼ ÈÖ ¾ ¾ µ ¾ and Ë µ ¾µ ¾ µ ¾Ñ by Lemma 3 and the standard Chernoff bound. Each of the following steps is a straightforward manipulation: ÐÓ µ ¼ ¾ ÐÓ µ ¾ µ ¾Ñ ¾ ¼ ¾ ¾ ¾ Ñ ¼ Ñ ¾ Ñ ÐÓ µ ¾ ¾Ñ ¾ ¼ ¾ ¾ ¾ Ñ Since Ñ Ç ÐÓ Æµµµ is sufficient for Ñ Æ¾ and ¾ Ñ ¾, this completes the proof. 4 A bound for consistent hypothesis finders In this section we analyze algorithms that work by finding hypotheses with zero training error. This is one way to achieve computational efficiency, as is the case when consists of halfspaces. This analysis will use the notion of VC-dimension. Definition 4 The VC-dimension of a set of ¼ -valued functions with a common domain is the size of the largest set Ü Ü of domain elements such that Ü µ Ü µµ ¾ ¼ The following lemma generalizes the Chernoff bound to hold uniformly over a class of random variables; it concentrates on a simplified consequence of the Chernoff bound that is useful when bounding the probability that an empirical estimate is much larger than the true expectation. Lemma 5 (see [12, 24]) Suppose is a set of ¼ -valued functions with a common domain. Let be the VC-dimension of. Let be a probability distribution over. Choose «¼ and Ã. Then if Ñ ÐÓ «ÐÓ Æ µ «Ã ÐÓ Ãµ where is an absolute constant, then ÈÖÙ Ñ ¾ ÈÖ µ «but ÈÖÙ µ õ«µ Æ Now we are ready for the main analysis of this section. Theorem 6 Suppose the doubling dimension of µ and the VC-dimension of are both at most. Any consistent hypothesis finder for PAC learns from Ç ÐÓ ÐÓ Æ examples. Proof: Assume without loss of generality that ¼¼. Let «ÜÔ Õ ÐÒ ¼¼, we have «. Õ µ; since Choose a target function. For each ¾, define ¼ by ܵ ܵ ܵ. Let ¾. Since ܵ ܵ exactly when ܵ ܵ, the doubling dimension

of is the same as the doubling dimension of ; the VC-dimension of is also known to be the same as the VC-dimension of (see [32]). Construct an «packing greedily, by repeatedly adding an element of to for as long as this is possible. This packing is also an «-cover. For each ¾, let µ be its nearest neighbor in. Since «, by the triangle inequality, µ and Ù µ ¼ µ µµ and Ù µ ¼ (3) The triangle inequality also yields Ù µ ¼ µ Ù µµ or ÈÖÙ µ µ µ Combining this with (3), we have we have We have ÈÖ ¾ µ but Ù µ ¼µ ÈÖ ¾ µµ but Ù µµ µ (4) ÈÖ ¾ ÈÖÙ µ µ µ ÈÖ ¾ µµ but Ù µµ µ ÈÖ ¾ µ but Ù µ µ ÈÖ ¾ µ but ÈÖÙ µ µ ÐÓ µµ ¼ ÐÓ µµ ¾ ¼ ÈÖ ¾ ¾ µ µ ¾ µ and ÈÖÙ µ µ «¾Ñ where ¼ is an absolute constant, by Lemma 3 and the standard Chernoff bound. Computing a geometric sum exactly as in the proof of Theorem 2, we have that Ñ Ç µ suffices for ÈÖ ¾ µµ but Ù µµ µ ¾Ñ «for absolute constants ¾ ¼. By plugging in the value of «and solving, we can see that Ñ Ç Ö ÐÓ ÐÓ Æ suffices for ÈÖ ¾ ÈÖ µµ but ÈÖÙ µµ µ ƾ (5) Since ÈÖ µ µ «for all ¾, applying Lemma 5 with à «µ, we get that there is an absolute constant ¼ such that also suffices for Ñ ÐÓ «ÐÓ Æ «µ ÐÓ ÈÖ ¾ ÈÖÙ µ µ µ ƾ Substituting the value «into (6), it is sufficient that Ñ ÐÓ µ Õ ÐÓ µ ÐÓ Æ ÕÐÓ ÐÓ Putting this together with (5) and (4) completes the proof. «µ (6)

5 Halfspaces and the uniform distribution Proposition 7 If Í Ò is the uniform distribution over the unit ball in Ê Ò, and À Ò is the set of halfspaces that go through the origin, then the doubling dimension of À Ò ÍÒ µ is Ç Òµ. Proof: Choose ¾ À Ò and «¼. We will show that the ball of radius «centered at can be covered by Ç Òµ balls of radius «¾. Suppose Í ÀÒ is the probability distribution over À Ò obtained by choosing a normal vector Û uniformly from the unit ball, and outputting Ü Û Ü ¼. The argument will be a volume argument using Í ÀÒ. It is known (see Lemma 4 of [25]) that ÈÖ ÍÀÒ ÍÒ µ «µ «µ Ò where ¼ is an absolute constant independent of «and Ò. Furthermore, where ¾ ¼ is another absolute constant. ÈÖ ÍÀÒ ÍÒ µ «µ ¾ «µ Ò Suppose we choose arbitrarily choose ¾ ¾ À Ò that are at a distance at most «from, but «¾ far from one another. By the triangle inequality, «-balls centered at ¾ are disjoint. Thus, the probability that an random element of À Ò is in a ball of radius «centered at one of Æ is at least Æ «µ Ò. On the other hand, since each Æ has distance at most «from, any element of an «ball centered at one of them is at most ««far from. Thus, the union of the «balls centered at Æ is contained in the «ball centered at. Thus Æ «µ Ò ¾ «µ Ò, which implies Æ ¾ µ Ò ¾ Ç Òµ, completing the proof. 6 Separation Theorem 8 For all «¾ ¼ ¾ and positive integers there is a set of classifiers and a probability distribution over their common domain with the following properties: the VC-dimension of is ¾ ¾ ¾«for each ¾, «µ ¾«. This proof uses the probabilistic method. We begin with the following lemma. Lemma 9 Choose positive integers and. Suppose is chosen uniformly at random from among the subsets of of size. Then, for any, ÈÖ µ µµ Proof: in Appendix A. µ µ Now we re ready for the proof of Theorem 8, which uses the deletion technique (see [2]). ¾ Proof (of Theorem 8): Set the domain to be, where «. Let Æ ¾. Suppose Æ are chosen independently, uniformly at random from among the classifiers that evaluate to on exactly elements of. For any distinct, suppose µ is fixed, and we think of the members of µ as being chosen one at a time. The probability that any of the elements of µ is also in µ is. Applying the linearity of expectation, and averaging over the different possibilities for µ, we get µ µµ ¾

Applying Lemma 9, ÈÖ µ µ ¾µ ¾ ¾ µ ¾ µ ¾ ¾ Thus, the expected number of pairs such that ÈÖ µ µ ¾µ is at most ¾. Æ ¾ ¾µ ¾ This implies that there exist Æ such that ¾ ¾ µ µ ¾ Æ ¾ ¾µ If we delete one element from each such pair, and form from what remains, then each pair of elements in satisfies µ µ ¾ (7) If is the uniform distribution over, then (7) implies µ «. The number of elements of is at least Æ Æ ¾ ¾µ ¾ ¾ ƾ. Since each ¾ has µ, no function in evaluates to on each element of any set of elements of. Thus, the VC-dimension of is at most. Theorem 8 implies that there is no bound on the doubling dimension of µ in terms of the VC-dimension of. For any constraint on the VC-dimension, a set satisfying the constraint can have arbitrarily large doubling dimension by setting the value of «in Theorem 8 arbitrarily small. Acknowledgement We thank Gábor Lugosi and Tong Zhang for their help. References [1] K. Alexander. Rates of growth for weighted empirical processes. In Proc. of Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer, volume 2, pages 475 493, 1985. [2] N. Alon, J. H. Spencer, and P. Erdös. The Probabilistic Method. Wiley, 1992. [3] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. [4] P. Assouad. Plongements lipschitziens dans. R. Bull. Soc. Math. France, 111(4):429 448, 1983. [5] P. L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. Annals of Statistics, 33(4):1497 1537, 2005. [6] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. ICML, 2006. [7] H. T. H. Chan, A. Gupta, B. M. Maggs, and S. Zhou. On hierarchical routing in doubling metrics. SODA, 2005. [8] L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer, 1996. [9] D. Dubhashi and D. Ranjan. Balls and bins: A study in negative dependence. Random Structures & Algorithms, 13(2):99 124, Sept 1998. [10] Devdatt Dubhashi, Volker Priebe, and Desh Ranjan. Negative dependence through the FKG inequality. Technical Report RS-96-27, BRICS, 1996. [11] R. M. Dudley. Central limit theorems for empirical measures. Annals of Probability, 6(6):899 929, 1978. [12] R. M. Dudley. A course on empirical processes. Lecture notes in mathematics, 1097:2 142, 1984. [13] A. Gupta, R. Krauthgamer, and J. R. Lee. Bounded geometries, fractals, and low-distortion embeddings. FOCS, 2003. [14] S. Har-Peled and M. Mendel. Fast construction of nets in low dimensional metrics, and their applications. SICOMP, 35(5):1148 1184, 2006.

[15] D. Haussler. Sphere packing numbers for subsets of the Boolean Ò-cube with bounded Vapnik- Chervonenkis dimension. Journal of Combinatorial Theory, Series A, 69(2):217 232, 1995. [16] K. Joag-Dev and F. Proschan. Negative association of random variables, with applications. The Annals of Statistics, 11(1):286 295, 1983. [17] J. Kleinberg, A. Slivkins, and T. Wexler. Triangulation and embedding using small sets of beacons. FOCS, 2004. [18] A. N. Kolmogorov and V. M. Tihomirov. -entropy and -capacity of sets in functional spaces. American Mathematical Society Translations (Ser. 2), 17:277 364, 1961. [19] J. Komlós, J. Pach, and G. Woeginger. Almost tight bounds on epsilon-nets. Discrete and Computational Geometry, 7:163 173, 1992. [20] R. Krauthgamer and J. R. Lee. The black-box complexity of nearest neighbor search. ICALP, 2004. [21] R. Krauthgamer and J. R. Lee. Navigating nets: simple algorithms for proximity search. SODA, 2004. [22] F. Kuhn, T. Moscibroda, and R. Wattenhofer. On the locality of bounded growth. PODC, 2005. [23] P. M. Long. On the sample complexity of PAC learning halfspaces against the uniform distribution. IEEE Transactions on Neural Networks, 6(6):1556 1559, 1995. [24] P. M. Long. Using the pseudo-dimension to analyze approximation algorithms for integer programming. Proceedings of the Seventh International Workshop on Algorithms and Data Structures, 2001. [25] P. M. Long. An upper bound on the sample complexity of PAC learning halfspaces with respect to the uniform distribution. Information Processing Letters, 87(5):229 234, 2003. [26] S. Mendelson. Estimating the performance of kernel classes. Journal of Machine Learning Research, 4:759 771, 2003. [27] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. [28] A. Slivkins. Distance estimation and object location via rings of neighbors. PODC, 2005. [29] K. Talwar. Bypassing the embedding: Approximation schemes and compact representations for low dimensional metrics. STOC, 2004. [30] S. van de Geer. Empirical processes in M-estimation. Cambridge Series in Statistical and Probabilistic Methods, 2000. [31] A. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes With Applications to Statistics. Springer, 1996. [32] V. N. Vapnik. Estimation of Dependencies based on Empirical Data. Springer Verlag, 1982. [33] V. N. Vapnik. Statistical Learning Theory. New York, 1998. [34] T. Zhang. Information theoretical upper and lower bounds for statistical estimation. IEEE Transactions on Information Theory, 2006. to appear. A Proof of Lemma 9 Definition 10 ([16]) A collection Ò of random variables are negatively associated if for every disjoint pair Á Â Ò of index sets, and for every pair Ê Á Ê and Ê Â Ê of non-decreasing functions, we have ¾ Áµ ¾ µµ ¾ Áµµ ¾ µµ Lemma 11 ([10]) If is chosen uniformly at random from among the subsets of with exactly elements, and if ¾ and ¼ otherwise, then are negatively associated. Lemma 12 ([9]) Collections Ò of negatively associated random variables satisfy Chernoff bounds: for any ¼, ÜÔ È Ò µµ É Ò ÜÔ µµ Proof of Lemma 9: Let ¾ ¼ È indicate whether ¾. By Lemma 11, are negatively associated. We have. Combining Lemma 12 with a standard Chernoff-Hoeffding bound (see Theorem 4.1 of [27]) completes the proof.