An Adaptive Test of Independence with Analytic Kernel Embeddings

An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum 1 Zoltán Szabó 2 Arthur Gretton 1 1 Gatsby Unit, University College London 2 CMAP, École Polytechnique ICML 2017, Sydney 9 August 2017 1/10

What Is Independence Testing? Let (X ; Y ) 2R R dx dy be random vectors following P xy. Given a joint sample f(x i ; y i )g n i=1 P xy (unknown), test H 0 :P xy = P x P y ; vs. H 1 :P xy 6= P x P y : Compute a test statistic ^ n. Reject H 0 if ^ n > T (threshold). T = (1 )-quantile of the null distribution. 2/10

Motivations Modern state-of-the-art test is HSIC [Gretton et al., 2005]. Nonparametric i.e., no assumption on P xy. Kernel-based. Slow. Runtime: O(n 2 ) where n = sample size. No systematic way to choose kernels. Propose the Finite-Set Independence Criterion (FSIC). 1 Nonparametric. 2 Linear-time. Runtime complexity: O(n). Fast. 3 Adaptive. Kernel parameters can be tuned. 3/10

Proposal: The Finite-Set Independence Criterion (FSIC) 1 Pick 2 kernels: k for X, and l for Y (e.g., Gaussian kernels). 2 Pick a feature (v; w) 2R R dx R R dx!r dy R dy 3: Transform (x; y) 7! (k (x; v); l(y; w)) then measure covariance FSIC 2 (X ; Y ) = cov 2 (x;y)p xy [k (x; v); l(y; w)] : 4/10

General Form of FSIC FSIC 2 (X ; Y ) = 1 J JX j =1 for J features f(v j ; w j )g J j =1 2Rdx R Proposition 1. cov 2 (x;y)p xy [k (x; v j ); l(y; w j )] ; dy. 1 k and l: vanish at infinity, translation-invariant, characteristic, real analytic (e.g., Gaussian kernels). 2 Features f(v i ; w i )g J i=1 Then, for any J 1, are drawn from a distribution with a density. Almost surely, FSIC(X ; Y ) = 0 iff X and Y are independent Under H 0 : P xy = P x P y, n\ FSIC 2 weighted sum of J dependent 2 variables. Difficult to get (1 )-quantile for the threshold. 5/10

Normalized FSIC (NFSIC) Let ^u := dcov[k (x; v1 ); l(y; w 1 )]; : : : ;dcov[k (x; v J ); l(y; w J )] > 2R J. Then, FSIC \ 2 1 = J ^u> ^u. NFSIC \ 2 (X ; Y ) = ^ n := n ^u > 1 ^ + n I ^u; with a regularization parameter n 0. ^ ij = covariance of ^u i and ^u j. Theorem 1 (NFSIC test is consistent). Assume n! 0, and same conditions on k and l as before. 1 Under H 0, ^ n d! 2 (J ) as n! 1. Easy to get threshold T. 2 Under H 1, P(reject H 0 )! 1 as n! 1. Complexity: O(J 3 + J 2 n + (d x + d y )Jn). Only need small J. 6/10

Tuning Features and Kernels Split the data into training (tr) and test (te) sets. Procedure: 1 Choose f(v i ; w i )g J (tr) i=1 and Gaussian widths by maximizing ^ n computed on the training set). Gradient ascent. 2 Reject H 0 if ^ (te) n > (1 )-quantile of 2 (J ). (i.e., Splitting avoids overfitting. The optimization is also linear-time. Theorem 2. 1 This procedure increases a lower bound on P(reject H 0 j H 1 true) (test power). 2 Asymptotically, if H 0 is true, false rejection rate is. 7/10

Simulation Settings Gaussian kernels kx vk k (x; v) = exp 2 2 2 x l(y; w) = exp ky wk 2 2 2 y for X for Y NFSIC-opt NFSIC-m ed QHSIC NyHSIC FHSIC RDC Method Description 1 NFSIC-opt NFSIC with optimization. O(n). 2 QHSIC [Gretton et al., 2005] State-of-the-art HSIC. O(n 2 ). 3 NFSIC-med NFSIC with random features. 4 NyHSIC Linear-time HSIC with Nystrom approx. 5 FHSIC Linear-time HSIC with random Fourier features 6 RDC [Lopez-Paz et al., 2013] Canonical Correlation Analysis with cosine basis. J = 10 in NFSIC. 8/10

Youtube Video (X ) vs. Caption (Y ). X 2R 2000 : Fisher vector encoding of motion boundary histograms descriptors [Wang and Schmid, 2013]. Y 2R 1878 : Bag of words. Term frequency. = 0:01. Test power 1.0 0.5 QHSIC 2000 4000 6000 8000 Sample size n For large n, NFSIC is comparable to HSIC. 9/10

Youtube Video (X ) vs. Caption (Y ). X 2R 2000 : Fisher vector encoding of motion boundary histograms descriptors [Wang and Schmid, 2013]. Y 2R 1878 : Bag of words. Term frequency. = 0:01. 1.0 QHSIC Test power 0.5 Proposed NFSIC 2000 4000 6000 8000 Sample size n For large n, NFSIC is comparable to HSIC. 9/10

Youtube Video (X ) vs. Caption (Y ). X 2R 2000 : Fisher vector encoding of motion boundary histograms descriptors [Wang and Schmid, 2013]. Y 2R 1878 : Bag of words. Term frequency. = 0:01. Type-I error 0.02 0.01 2000 4000 6000 8000 Sample size n Exchange (X ; Y ) pairs. H 0 true. For large n, NFSIC is comparable to HSIC. 9/10

Conclusions Proposed The Finite Set Independence Criterion (FSIC) FSIC(X ; Y ) = 0 () X and Y are independent. Independece test based on FSIC is 1 nonparametric (no parametric assumption on P xy ), 2 linear-time, 3 adaptive (parameters automatically tuned). Python code: github.com/wittawatj/fsic-test Poster #111. Tonight. 10/10

Questions? Thank you 11/10

Requirements on the Kernels Definition 1 (Analytic kernels). k : X X!R is said to be analytic if for all x 2 X, v! k (x; v) is a real analytic function on X. Analytic: Taylor series about x 0 converges for all x 0 2 X. =) k is infinitely differentiable. Definition 2 (Characteristic kernels). Let P (v) :=E zp [k (z; v)]. k is said to be characteristic if P is unique for distinct P. Equivalently, P 7! P is injective. Space of distributions P Q RKHS µ P } MMD(P, Q) µ Q 12/10

Optimization Objective = Power Lower Bound Recall ^ n := n ^u > ^ + n I 1 Let NFSIC 2 (X ; Y ) := n := nu > 1 u. Theorem 3 (A lower bound on the test power). 1 With some boundedness assumptions, the test power P H1 ^n T L( n ) where ^u. L( n ) = 1 62e 12 n (n T)2 =n 2e b0:5nc(n T)2 =[2n 2 ] 2e [(n T)n (n 1)=3 3n c32 n n(n 1)] 2 =[4n 2 (n 1)] ; where 1 ; : : : ; 4 ; c 3 > 0 are constants. 2 For large n, L( n ) is increasing in n. 13/10

\ An Estimator of NFSIC 2 ^ n := n ^u > ^ + n I 1 ^u; J test locations f(v i ; w i )g J i=1. K = [k (v i ; x j )] 2R J n L = [l(w i ; y j )] 2R J n. (No n n Gram matrix.) Estimators 1 ^u = (KL)1n n 1 (K1 n )(L1n ) n(n 1). 2 ^ = > n where := (K n 1 K1 n 1 > n ) (L n 1 L1 n 1 > n ) ^u1> n : ^ n can be computed in O(J 3 + J 2 n + (d x + d y )Jn) time. Main Point: Linear in n. Cubic in J (small). 14/10

Alternative View of FSIC Rewrite cov: FSIC 2 (X ; Y ) = cov 2 (x;y)p xy [k (x; v); l(y; w)] : cov xy [k (x; v); l(y; w)] =E xy [k (x; v)l(y; w)] E x [k (x; v)]e y [l(y; w)]; := xy (v; w) x (v) y (w) (witness function) = ^ xy (v; w) ^ x (v)^ y (w) Witness function FSIC = evaluate the witness function at J locations. Cost: O(Jn). HSIC = RKHS norm of the witness function. Cost: O(n 2 ). 15/10

HSIC vs. FSIC Recall the witness ^u(v; w) = ^ xy (v; w) ^ x (v)^ y (w): HSIC [Gretton et al., 2005] = k^uk RKHS witness FSIC [proposed] = 1 J P Ji=1 ^u 2 (v i ; w i ) witness (v, w) Good when difference between p xy and p x p y is spatially diffuse. ^u is almost flat. (v, w) Good when difference between p xy and p x p y is local. ^u is mostly zero, has many peaks (feature interaction). 16/10

Toy Problem 1: Independent Gaussians X N (0; I dx ) and Y N (0; I dy ). Independent X ; Y. So, H 0 holds. Set := 0:05; d x = d y = 250. 17/10

Toy Problem 1: Independent Gaussians X N (0; I dx ) and Y N (0; I dy ). Independent X ; Y. So, H 0 holds. Set := 0:05; d x = d y = 250. NFSIC-opt NFSIC-m ed QHSIC NyHSIC FHSIC RDC Type-I error 0.06 0.04 10 4 10 5 Sample size n Correct type-i errors (false positive rate). 17/10

Toy Problem 1: Independent Gaussians X N (0; I dx ) and Y N (0; I dy ). Independent X ; Y. So, H 0 holds. Set := 0:05; d x = d y = 250. NFSIC-opt NFSIC-m ed QHSIC NyHSIC FHSIC RDC Type-I error 0.06 0.04 Time (s) 10 2 10 0 10 4 10 5 Sample size n 10 3 10 4 10 5 Sample size n Correct type-i errors (false positive rate). 17/10

Toy Problem 2: Sinusoid p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = 4000. 18/10

Toy Problem 2: Sinusoid p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = 4000. y 2.5 0.0 ω = 1.00 2.5 2.5 0.0 2.5 x 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 18/10

Toy Problem 2: Sinusoid p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = 4000. y 2.5 0.0 ω = 2.00 2.5 2.5 0.0 2.5 x 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 18/10

Toy Problem 2: Sinusoid p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = 4000. y 2.5 0.0 ω = 3.00 2.5 2.5 0.0 2.5 x 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 18/10

Toy Problem 2: Sinusoid p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = 4000. y 2.5 0.0 ω = 4.00 2.5 2.5 0.0 2.5 x 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 18/10

Toy Problem 2: Sinusoid Test power p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = 4000. NFSIC-opt NFSIC-m ed QHSIC NyHSIC FHSIC RDC 1.0 ω = 4.00 2.00 2.5 1.75 1.50 0.5 1.25 0.0 1.00 0.75 0.0 0.50 1 2 3 4 5 6 2.5 0.25 ω in 1 + sin(ωx) sin(ωy) 0.00 2.5 0.0 2.5 x Main Point: NFSIC can handle well the local changes in the joint space. y 18/10

Toy Problem 3: Gaussian Sign y = jz j Q d x i=1 sign(x i ), where x N (0; I dy ) and Z N (0; 1) (noise). Full interaction among x 1 ; : : : ; x dx. Need to consider all x 1 ; : : : ; x d to detect the dependency. 1.0 Test power 0.5 0.0 10 3 10 4 10 5 19/10

y Test Power vs. J 2.5 0.0 Test power does not always increase with J (number of test locations). n = 800. ω = 2.00 2.5 2.5 0.0 2.5 x 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 Test power 1.0 0.5 10 200 400 600 J Accurate estimation of ^ 2R J J in ^ n = n ^u > ^ + n I 1 more difficult. Large J defeats the purpose of a linear-time test. ^u becomes 20/10

Real Problem: Million Song Data Song (X ) vs. year of release (Y ) Western commercial tracks from 1922 to 2011 [Bertin-Mahieux et al., 2011]. X 2R 90 contains audio features. Y 2R is the year of release. 21/10

Real Problem: Million Song Data Song (X ) vs. year of release (Y ) Type-I error Western commercial tracks from 1922 to 2011 [Bertin-Mahieux et al., 2011]. X 2R 90 contains audio features. Y 2R is the year of release. NFSIC-opt NFSIC-m ed QHSIC NyHSIC FHSIC RDC 0.02 0.01 0.00 500 1000 1500 2000 Sample size n Test power 1.0 0.5 500 1000 1500 2000 Sample size n Break (X ; Y ) pairs to simulate H 0. NFSIC-opt has the highest power among the linear-time tests. 21/10

Alternative Form of ^u(v; w) Recall \ FSIC 2 = 1 J P Ji=1 ^u(v i ; w i ) 2 Let [x y (v; w) be an unbiased estimator of x (v)y (w). [ x y P (v; w) := 1 ni=1 n(n 1) Pj 6=i k (x i ; v)l(y j ; w). An unbiased estimator of u(v; w) is where ^u(v; w) = ^xy (v; w) [x y (v; w) = 2 n(n 1) X i<j h (v;w)((x i ; y i ); (x j ; y j )); h (v;w)((x; y); (x 0 ; y 0 )) := 1 2 (k (x; v) k (x0 ; v))(l(y; w) l(y 0 ; w)): ^u(v; w) is a one-sample 2 nd -order U-statistic, given (v; w). 22/10

References I Bertin-Mahieux, T., Ellis, D. P., Whitman, B., and Lamere, P. (2011). The million song dataset. In International Conference on Music Information Retrieval (ISMIR). Gretton, A., Bousquet, O., Smola, A., and Schölkopf, B. (2005). Measuring Statistical Dependence with Hilbert-Schmidt Norms. In Algorithmic Learning Theory (ALT), pages 63 77. Lopez-Paz, D., Hennig, P., and Schölkopf, B. (2013). The Randomized Dependence Coefficient. In Advances in Neural Information Processing Systems (NIPS), pages 1 9. 23/10

References II Wang, H. and Schmid, C. (2013). Action recognition with improved trajectories. In IEEE International Conference on Computer Vision (ICCV), pages 3551 3558. 24/10