On A-distance and Relative A-distance

Size: px

Start display at page:

Download "On A-distance and Relative A-distance"

Daniela Roberts
5 years ago
Views:

1 ADAPTIVE COMMUNICATIONS AND SIGNAL PROCESSING LABORATORY CORNELL UNIVERSITY, ITHACA, NY 14853 On

1 1 ADAPTIVE COMMUNICATIONS AND SIGNAL PROCESSING LABORATORY CORNELL UNIVERSITY, ITHACA, NY On A-distance and Relative A-distance Ting He and Lang Tong Technical Report No. ACSP-TR August 004

2 I. INTRODUCTION We give a method to measure the distance between two probability distributions, and based on the distance measure, we bound the probability that the distance between the empirical distribution and the actual distribution exceeds certain level. The direct implication of our result is that for large sample size, one can replace the actual probability with its corresponding empirical probability with arbitrarily small error. The proof of the theorems is based on the Vapnik- Chervonenkis Theory [6] and Anthony & Shawe-Taylor s extension of the Vapnik-Chervonenkis Theory [4]. II. DISTANCE MEASURE A-distance Fix a measure space and let A be a collection of measurable sets. Let P 1 and P be probability distributions over this space. The A-distance between P 1 and P is defined as d A (P 1,P ) = sup P 1 (A) P (A) For finite sample sets S 1 and S, d A (S 1,S ) is defined similarly by replacing P i (A) with S i (A) = S i A / S i. The following notion of relative A-distance offers a way to take the relative magnitude of a change into account. Relative A-distance Let P 1,P be two probability distributions over the same measure space, let A denote a family of measurable subsets of that space, and A a set in A. The relative A-distance between P 1 and P is defined as φ A (P 1,P ) = sup P 1 (A) P (A) P 1 (A)+P (A) For empirical distances, simply replace P i (A) with the empirical measure S i (A) = S i A / S i. It is easy to see that A-distance is a metric. For the proof of relative A-distance as a metric, see [5]. III. VC BOUNDS The following theorems are derived from [6] and [4] to guarantee the rate that the empirical distance converges to the underlying distance for both distance notions.

3 3 For A-distance, we have the following theorems: Theorem 3.1 (Vapnik-Chervonenkis Inequality): Let P be a probability distribution over domain X and S be a collection of n i.i.d samples drawn from P. Then for a family of subsets of X A and a constant ǫ (0, 1), P n {sup S(A) P(A) > ǫ} 4Π A (n) 1 e nǫ /8 Using Theorem 3.1, it is easy to derive the following corollary. Corollary 3.: Let P 1, P be any probability distributions over some domain X and let A be a family of subsets of X and ǫ (0, 1). If S 1, S are i.i.d n samples drawn by P 1, P respectively, then, P n [ A A P 1 (A) P (A) S 1 (A) S (A) ǫ] < 8Π A (n)e nǫ /3 Where P n in the above inequality is the probability over the pairs of samples (S 1,S ) induced by the sample generating distributions (P 1,P ). Proof: Simple algebra yields the result. Pr{ A A, P 1 (A) P (A) S 1 (A) S (A) ǫ} Pr{sup P 1 (A) P (A) S 1 (A) + S (A) ǫ} (1) Pr{sup P 1 (A) S 1 (A) + P (A) S (A) ǫ} () Pr{{sup P 1 (A) S 1 (A) ǫ } {sup P (A) S (A) ǫ }} (3) Pr{sup P 1 (A) S 1 (A) ǫ } + Pr{sup P (A) S (A) ǫ } (4) 8Π A (n)e nǫ /3 (5) where the last inequality comes from Theorem Π A(n) is the shatter coefficient [3]. If A has a finite VC-dimension d, then by Sauer s Lemma, Π A(n) < (n + 1) d for all n.

4 4 We thus have ways to bound the probability that empirical A-distance deviates from true A-distance from both sides. The theoretical guarantee can be improved by considering relative A-distance. We can get results similar to Theorem 3.1 and Corollary 3. for the metric φ A (P 1,P ). We start with the following result of Anthony and Shawe-Taylor [4]. Lemma 3.3: Let A be a family of subsets of the domain X. P is any probability distribution over X. If S 1 and S are two collections of n samples each, drawn i.i.d. from P, then P n (φ A (S 1,S ) > ǫ) Π A (n)e nǫ /4 (where P n is the probability that P induces over the choice of samples.) In [4], Anthony and Shawe-Taylor proved that Pr{sup S 1 (A) S (A) S 1 (A)+S (A) > ǫ} Π A (n)e nǫ /4 By symmetry of S 1 (A) and S (A), the result in Lemma 3.3 holds. Theorem 3.4: Let A be a family of subsets of the domain X, P be any probability distribution over X, and S be a set of n samples, each drawn i.i.d. from P. Then P n (φ A (S,P) > ǫ) 8Π A (n)e nǫ /4 (Where P n is the n th power of P - the probability that P induces over the choice of samples). The proof of this theorem is similar to the proof in [4]. Proof: Define Q = S Xn : A A s.t. P(A) S(A) P(A)+S(A) R = SS X n : A A s.t. S(A) S (A) where S, S are two sets of n samples, drawn i.i.d. from P. S(A)+S (A) > ǫ > ǫ Then we claim that Pr(Q) 4 Pr(R) for n > 4. This is true because of the following. ǫ P(C)+S(C) Suppose S Q, so there is C A s.t. P(C) S(C) > ǫ. Hence S(C) < P(C) + ǫ 4 ǫ ǫ 16 + P(C). Noting that S(C) 0, some simple calculation shows that P(C) > ǫ.

5 5 If we draw another set of n samples S, each drawn i.i.d. from P, and define F = S (C) S(C) S (C)+S(C) we have F > ǫ if S (C) > P(C). This is because the function f(x,y) = x y is monotone (x+y)/ increasing w.r.t. x and monotone decreasing w.r.t. y on x, y (0, 1)(taking derivative easily verifies it). So inf F is achieved when S(C) = P(C) + ǫ ǫ ǫ + P(C) and 4 16 S (C) = P(C). Plugging in yields the value ǫ, and the strict inequality follows from the strict inequalities about S(C) and S (C). The random variable ns (C) has binomial distribution B(n,P(C)). For n > 4 ǫ > P(C), ns (C) > np(c) with probability 1/4( [4]). Therefore for n > 4 ǫ, we have Pr(Q) 4 Pr(R). In [4], it is proved that Thus P n sup Pr(R) Π A (n)e nǫ /4. P(A) S(A) P(A)+S(A) Note that this inequality is trivially satisfied if n 4 ǫ. By symmetry we have > ǫ 4Π A(n)e nǫ /4 P n (φ A (S,P) > ǫ) 8Π A (n)e nǫ /4. Similar to Corollary 3., we have the following corollary of Theorem 3.4 which bounds the probability that the empirical relative A-distance deviates from the true relative A-distance. Corollary 3.5: Let P 1, P be any probability distributions over some domain X and let A be a family of subsets of X and ǫ (0, 1). If S 1, S are two collections of n samples each, drawn i.i.d. from P 1, P respectively, then P n [ φ A (P 1,P ) φ A (S 1,S ) > ǫ] 16Π A (n)e nǫ /16 Where P n in the above inequality is the probability over the pairs of samples (S 1,S ) induced by the sample generating distribution (P 1,P ).

6 6 Proof: Because φ A (, ) is a metric on [0, 1]( [5]), we have φ A (P 1,P ) φ A (P 1,S 1 ) + φ A (S 1,S ) + φ A (S,P ) and Therefore, φ A (P 1,P ) φ A (S 1,S ) φ A (P 1,S 1 ) φ A (S,P ) Pr{ φ A (P 1,P ) φ A (S 1,S ) > ǫ} Pr{φ A (P 1,S 1 ) + φ A (P,S ) > ǫ} (6) Pr{φ A (P 1,S 1 ) > ǫ } + Pr{φ A(P,S ) > ǫ } (7) 16Π A (n)e nǫ /16 (8) where the last inequality comes from Theorem 3.4. REFERENCES [1] B. Brodsky and B. Darkovsky, Non-Parametric Methods in Change-Point Problems, Kluwer Academic, The Netherlands, [] J. Shao, Mathematical Statistics, Springer, [3] L. Gyorfi, Principles of Nonparametric Learning, Springer Wien New York, 00. [4] M. Anthony and J. Shawe-Taylor, A result of Vapnik with applications, in Discrete and Applied Mathematics, vol. 47(), pp , [5] S. Ben-David, J. Gehrke and D. Kifer, Detecting Change in Data Streams, in Proc. 004 VLDB Conference, (Toronto, Canada), 004. [6] V.N. Vapnik and A. Ya. Chervonenkis On the uniform convergence of relative frequency of events to their probabilities in Theory of Probability and its Applications, Vol. 16, pp 64-80, 1971.

7 Influence Functions

7 Influence Functions The influence function is used to approximate the standard error of a plug-in estimator. The formal definition is as follows. 7.1 Definition. The Gâteaux derivative of T at F in the