Renyi divergence and asymptotic theory of minimal K-point random graphs Alfred Hero Dept. of EECS, The University of Michigan, Summer 1999

Size: px

Start display at page:

Download "Renyi divergence and asymptotic theory of minimal K-point random graphs Alfred Hero Dept. of EECS, The University of Michigan, Summer 1999"

Jacob Sutton
5 years ago
Views:

1 Renyi divergence and asymptotic theory of minimal K-point random graphs Alfred Hero Dept. of EECS, The University of Michigan, Summer 999

2 Outline Rényi Entropy and Rényi Divergence ffl Euclidean k-minimal graphs ffl Asymptotics for tightly coverable graphs ffl Asymptotics for greedy k-minimal graphs ffl Influence function for greedy k-minimal graph ffl Applications ffl 2

3 ffl Rényi Entropy of order ν ffl Rényi Divergence of order ν. Rényi Entropy and Rényi Divergence ffl X ο f (x) a d-dimensional random vector. ν (f ) = H ν ln Z f ν (x)dx () ν (f;f o ) = I ν ln Z f (x) f o (x) ffl f o a dominating Lebesgue density ν f o (x)dx (2) 3

4 2 Examples: ffl Hellinger distance squared pf I (f;f o ) = ln Z (x)fo (x)dx 2 ffl Kullback-Liebler divergence Z lim I ν(f;f o ) = ν! o (x)ln f o(x) f (x) dx: f 4

5 2. k-minimal graphs vertices are subset of X n = fx i g n i= : n points in IRd ffl AgraphGofdegreel consists of vertices and edges ffl edges are denoted fe ij g ffl for any i: cardfe ij g j» l Weight (with power exponent fl) ofg L G (X n ) = X kek fl e2g 5

6 e2m(x n ) Examples: n-point Minimal Spanning Tree (MST) Let M(X n ) denote the possible sets of edges in the class of acyclic graphs spanning X n (spanning trees). The Euclidean Power Weighted MST achieves min n ) M(X L MST (X n ) = X kek fl : 6

7 256 random samples MST Figure. A data set and the MST 7

8 L TSP n ) = min (X n ) T(X e2t(x n ) kek fl : n-point Traveling Salesman Problem (TSP) Let T(X n ) be sets of edges in the class of graphs of degree 2 spanning X n. The minimal power-weighted TSP tour achieves X 8

9 L(m[(F Q i ) q i ]) + C m d fl L(m[(F Q i ) q i ]) C 2 m d fl 2.. Quasi-additive Euclidean Functionals is a continuous subadditive functional if it satises L Null L(f) =,wheref Condition: is the null set. Subadditivity: There exists a C constant with the following property: For any uniform resolution Q =m-partition m md» m X ) L(F i= Superadditivity: There exists a constant C 2 with the following property: md m X ) L(F i= 9

10 F and G of [; ] d Continuity: There exists a constant C 3 such that for all nite subsets jl(f [ G) L(F )j» C 3 (card(g)) (d fl)=d Denition A continuous subadditive functional L is said to be a quasi-additive functional when there exists a continuous superadditive functional L Λ which satises L(F ) + L Λ (F ) and the approximation property je[l(u ;:::;U n )] E[L Λ (U ;:::;U n )]j» C 4 n (d fl )=d (3) where U ;:::;U n are i.i.d. uniform random vectors in [; ] d.

11 lim n)=n (d fl)=d Z = L;fl L(X n! Asymptotics: the BHH Theorem and entropy 2.2. estimation Theorem [Redmond&Yukich:96] Let L be a quasi-additive Euclidean functional with power-exponent fl, and let X n = fx ;:::;x n g be an i.i.d. sample drawn from a distribution on [; ] d with an absolutely continuous component having (Lebesgue) density f (x). Then (4) f (x) (d fl)=d dx; (a:s:) Or, letting ν = (d fl)=d lim L(X n)=n ν = L;fl exp (( ν)h ν (f )) ; (a:s:) n!

12 uniform 2 d distribution (n=) triangular 2 d distribution (n=) y.5 y.5.5 MST x.5 MST x y.5 y.5.5 Mean MST length x as function of n 25 Mean MST length n, uniform 2 d distribution on [,].5 Mean MST length x as function of n 25 MST length n, triangular 2 d distribution on [,] Figure 2. 2D Triangular vs. Uniform sample study for MST. 2

13 MST length 5 2*Log(Ln/sqrt(n)) N, o=unif, *=triang number of points, o=unif, *=triang Figure 3. MST and log MST weights as function of number of samples for 2D uniform vs. triangular study. 3

14 ffl g(x): a reference density on IR d 2.3. I-Divergence and Quasi-additive functions ffl Assume f g, i.e. for all x such that g(x) = we have f (x) =. Make measure dx! g(x)dx transformation [; ] on. Then for Y d ffl n = transformed data [Hero&Michel:HOS99] lim L(Y n)=n ν = L;fl exp (( ν)i ν (f;g)) ; (a:s:) n! 4

15 x = [x ;:::;x d ] T! y = [y ;:::;y d ] T y d = G(x d jx d ;:::;x ) Proof. Make transformation of variables y = G(x ) (5) 2 = G(x 2 jx ) y.. where G(x k jx k ;:::;x ) = R x k g(~xk jx k ;:::;x )d~x k 2. Induced density h(y), of the vector y, takes the form: = f (G (y );:::;G (y d jy d ;:::;y )) h(y) (y );:::;G (y d jy d ;:::;y )) g(g (6) where G is inverse CDF and x k = G (y k jx k ;:::;x ). 5

16 ν g(x)dx = I(f;g) 3. Then we know (a:s:) ν (Y n )! ^H ν ln Z h ν (y)dy 4. By Jacobian formula: dy = dx = g(x)dx and dy dx ν ln Z ν (y)dy = h ν ln Z f (x) g(x) 6

17 3. Outlier Sensitivity of minimal n-point graphs f Assume is a mixture density of the form = ( ffl)f + fflf o ; (7) f where f o is a known outlier density ffl f is an unknown target density ffl ffl ffl 2 [; ] is unknown mixture parameter 7

18 5 samples from f density Add 5 samples of uniform noise x2.5 x2.5.5 x.5 x x2.5 x2.5.5 x.5 x Figure 4. st row: 2D torus density with and without the addition of uniform outliers. 2nd row: corresponding MST s. 8

19 3.. Minimal k-point Euclidean Graphs Fix k,» k» n. Let T n;k = T(x i ;:::;x ik ) be a minimal graph connecting k distinct vertices x i ;:::;x ik. The power weighted k-minimal graph T Λ n;k = TΛ (x i Λ ;:::;x i Λ k ) is the overall minimum weight k-point graph Λ n;k = L Λ (X n;k ) = min L ;:::;i k i min T n;k X kek fl e2t n;k 9

20 k MST (k=99): outlier rejection (k=98): 2 outlier rejection x2.5 x2.5.5 k MST (k=62): 38 outlier rejection.5 (k=25): 75 outlier rejection x2.5 x2.5.5 x.5 x Figure 5. k-mst for 2D torus density with and without the addition of uniform outliers. 2

21 4. Extended BHH Thm for k-minimal Graphs Fix ff 2 [; ] and assume that the k-minimal graph is tightly coverable. If k = bffnc,as n! we have (Hero&Michel:IT99) f ν (xjx 2 A)dx (a:s:) L(X Λ n;k )=(bffnc)ν! L;fl min (A) ff A:P Z or, alternatively, with ν (fjx 2 A) = H ν ln Z f ν (xjx 2 A)dx L(X Λ n;k )=(bffnc)ν! L;fl exp ν) min ( H ν(fjx 2 A) (A) ff A:P (a:s:) 2

22 n D m bffnc ) bffnc card(u n Denition 2 (Tightly Coverable Graphs) Let Q m, m = ; 2;:::,bea sequence of uniform partitions of [; ] d of resolution =m. LetG be an algorithm which constructs a graph with with k = bffnc vertices U n;k ρ U n, an i.i.d. uniform sample over [; ] d. Dene covers U n;k. The algorithm G is said to generate tightly coverable subgraphs if for any ffl > there exists an M such that for all m > M D m k = fc2ff(q m ):U n;k 2Cg C the minimum volume set in ff(a m ) which sup lim n!» ffl; (a:s:) 22

23 5. Greedy Partition Algorithm Greedy approximation to k-minimal graph (Ravi&etal:94) ) specify a uniform partition Q m of [; ] d having m d cells Q i of resolution =m; ) nd the smallest subset B m k = [ iq i of partition elements containing at least k points 2) out of X n B m k select k points X n;k which minimize L(X n;k ). Properties: Greedy algorithm is polynomial time unlike exponential time exact ffl k-minimal algorithm. ffl Greedy algorithm yields tightly coverable graphs by construction 23

24 Figure 6. A sample of 75 points from the mixture density f(x) = :25f (x)+:75fo(x) where fo is a uniform density over [; ] 2 and f is a bivariate Gaussian density with mean (=2; =2) and diagonal covariance diag(:). A smallest B subset k is the union of the two cross hatched cells shown for the case of m = 5 and k = 7. m 24

25 Figure 7. Another smallest B subset k containing at least k = 7 points for the mixture sample shown in Fig 6. m 25

26 Minimal =m-cover of Probability at least ff If for any C 2 ff(q m ) satisfying P (C) ff the set A 2 ff(q m ) satises P (C) P (A) ff; then A is called a minimal resolution-=m set of probability at least ff. The class of all such sets is denoted A m ff. ) all sets in A m ff have identical coverage probabilities pa m ff ff. 26

27 L(X G m Theorem 2 Let X n be an i.i.d. sample from a distribution having Lebesgue density f (x). Fixff 2 [; ], fl 2 (;d). Letf (d fl)=d be of bounded variation over [; ] d and denote by v(a) its total variation over a subset A ρ [; ] d.letl be a quasi-additive functional with power exponent fl as in Theorem. Then, Z sup lim n! n;bffnc )=nν L;fl min (A) ff A:P ν (xja)dx < f; (a:s:); f where md 2m X d L;fl = f v(q m ff ) + C 3(pA m ff ff)(d fl)=d i= = O(m fl d ); and pa m ff is the coverage probability of minimizing set A = Am ff. 27

28 Main idea behind proof: for large n can index over X n;k ρ X n by indexing over A ρ Borel in [; ] d. min L(X E[ n;bffnc )] ß n;bffnc X inf (A) ff E[L(X n A)]; A:P 28

29 Proof of Theorem uses following lemmas: Lemma For given ff 2 [; ] and a set of n i.i.d. points n [x ;:::;x n ] let T B m n be the minimal cover of bffnc points with X = produced by the greedy subset selection algorithm. Then resolution-=m P inf lim fx n : B m n 2 A m ff g = : n! 29

30 d md Lemma 2 For ν 2 [; ] let f ν be of bounded variation over [; ] d and denote by v(a) its total variation over any subset A 2 [; ] d. Dene the resolution =m block density approximation ~ f (x) = P m d where i = m d R Q i f (x)dx. Then for any A 2 ff(q m ) i= ii Qi (x)» Z A [ ~ f ν (x) f ν (x)]dx» m X v(q i A): i= 3

31 Lemma 3 Assume f is of bounded total variation v(q i ) in each partition cell Q i 2 Q m.leta be any set in the class A m ff. Then for any quasi-additive functional L n (B m bffnc ) def = L(X n B m bffnc ) sup lim n! L n(b m bffnc )=n(d fl)=d L;fl ZA f (d fl)=d (x)dx v(q m ff ); (a:s): md d X L;fl 2m < i= 3

32 =n (d fl)=d Lemma X 4 Let n;bffnc bffnc be any points B selected from. Then, bffnc for any quasi-additive m functional def (B m bffnc ) n L = L n (X n B m bffnc ) sup lim n! L n (B m bffnc ) L(X n;bffnc) < C 3 (pa m ff ff)(d fl)=d ; (a:s:) where pa m ff = P (Am ff ) is the coverage probability of sets Am ff in A m ff. 32

33 ffl + C 3 )v : (2 Interpretations of Theorem 2: ffl Bound f is tight: lim ff! f = and theorem reduces to BHH. Since x2q(q) f (d fl)=d (x)» v([; ] d ) and ffl sup d P m m ff ), )» v([; ]d v(q i= f can be upper bounded by f» 2 L;fl m d + C 3 m fl dλ v([; ] d ): Thus if an upper bound v on the total variation of f is available and the tolerance ffl is given G m n;bffnc )=(bffnc)ν L;fl expf ( ν)r ν )j < ffl jl(x we obtain a selection rule for required partition =m resolution =m» ffl f decreases to zero as a function of resolution =m at overall rate 33

34 O(m fl d ). Thus convergence rate in =m is fastest for small fl. ffl Conjecture: as MST (U n )] LMST ;fln (d fl)=d j = O max(;n (d fl )=d ) je[l [Redmond&Yukich:96], rate of convergence in the limsup of Theorem 2 is at O(=n best ) and this rate can be attained only when =d» d. fl k-minimal graph entropy estimator will have fastest convergence ffl when the Rényi order ν parameter is in the =d» ν < range. Theorem extends easily to I-divergence limit by measure ffl transformation. 34

35 ffl f (x) is 2D separable triangle density on [; ] 2 ffl f (x) is 2D uniform density on [; ] 2 6. Application : MST Discrimination ffl f (x) = ( ffl)f (x) + fflf (x): mixture density 35

36 ROC K, N=256, ε=.,.3,.5,.7,.9 ; ref=unif PD PFA Figure 8. ROC curves for the Rényi information divergence test for detecting triangle-uniform mixture density f = ( ffl)f + fflf (H ) against triangular hypothesis f = f (H ). Curves are increasing in ffl 2 f:; :3; :5; :7; :9g. 36

37 7. Application II: Nonuniform Outlier Rejection 256 random samples MST Figure 9. Left: A scatterplot of a 256 point sample from triangleuniform mixture density with ffl = :. Labels o and * mark those realizations from the uniform and triangular densities, respectively. Right: superimposed is the k-mst implemented directly on the scatterplot X n with k =

38 N=256, k/n=.9, f=.*triang+.9*unif N=256, k/n=.9, f=.*triang+.9*unif Original Coordinates Transformed Coordinates Figure. Left: A sample from triangle-uniform mixture density with ffl = :9 in the transformed domain Y n. Labels o and * mark those realizations from the uniform and triangular densities, respectively. Right: transformed coordinates. 38

39 N=256, k/n=.9, f o =unif, f =triang N=256, k/n=.9, f o =unif, f =triang Clustering in InvTriangle Transformed Data Domain Clustering in Original Data Domain Figure. Left: the k-mst implemented on the transformed scatterplot Y n with k = 23. Right: same k-mst displayed in the original data domain. 39

Estimation of Rényi Information Divergence via Pruned Minimal Spanning Trees 1

Estimation of Rényi Information Divergence via Pruned Minimal Spanning Trees Alfred Hero Dept. of EECS, The university of Michigan, Ann Arbor, MI 489-, USA Email: hero@eecs.umich.edu Olivier J.J. Michel