An Adaptive Test of Independence with Analytic Kernel Embeddings

Similar documents
An Adaptive Test of Independence with Analytic Kernel Embeddings

A Linear-Time Kernel Goodness-of-Fit Test

Hilbert Space Representations of Probability Distributions

Learning Interpretable Features to Compare Distributions

Advances in kernel exponential families

Learning features to compare distributions

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee

A Linear-Time Kernel Goodness-of-Fit Test

Kernel methods for comparing distributions, measuring dependence

Distribution Regression

Hilbert Schmidt Independence Criterion

Big Hypothesis Testing with Kernel Embeddings

Kernel Learning via Random Fourier Representations

Learning gradients: prescriptive models

Advanced Introduction to Machine Learning

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Tensor Product Kernels: Characteristic Property, Universality

Kernel Adaptive Metropolis-Hastings

Learning features to compare distributions

Hilbert Space Embedding of Probability Measures

Nonparametric Indepedence Tests: Space Partitioning and Kernel Approaches

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Hypothesis Testing with Kernel Embeddings on Interdependent Data

Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise

Efficient Complex Output Prediction

Semi-supervised Dictionary Learning Based on Hilbert-Schmidt Independence Criterion

Less is More: Computational Regularization by Subsampling

Approximate Kernel PCA with Random Features

Fisher Vector image representation

Convergence Rates of Kernel Quadrature Rules

Package dhsic. R topics documented: July 27, 2017

Mean-Shift Tracker Computer Vision (Kris Kitani) Carnegie Mellon University

10-701/ Recitation : Kernels

Minimax-optimal distribution regression

Less is More: Computational Regularization by Subsampling

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Maximum Mean Discrepancy

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Kernel methods for Bayesian inference

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Approximate Kernel Methods

Distribution Regression with Minimax-Optimal Guarantee

Probabilistic & Unsupervised Learning

Corners, Blobs & Descriptors. With slides from S. Lazebnik & S. Seitz, D. Lowe, A. Efros

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Kernel methods and the exponential family

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

arxiv: v2 [stat.ml] 3 Sep 2015

Learning Kernel Parameters by using Class Separability Measure

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.

Semi-Supervised Learning in Gigantic Image Collections. Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT)

CS798: Selected topics in Machine Learning

CS8803: Statistical Techniques in Robotics Byron Boots. Hilbert Space Embeddings

Lecture Notes 1: Vector spaces

Feature extraction: Corners and blobs

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning Lecture 5

Bayesian Support Vector Machines for Feature Ranking and Selection

Understanding the relationship between Functional and Structural Connectivity of Brain Networks

Adaptive HMC via the Infinite Exponential Family

Multiple kernel learning for multiple sources

Representer theorem and kernel examples

Semi-Supervised Laplacian Regularization of Kernel Canonical Correlation Analysis

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Unsupervised Nonparametric Anomaly Detection: A Kernel Method

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Causal Modeling with Generative Neural Networks

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Topics in kernel hypothesis testing

Minimax Estimation of Kernel Mean Embeddings

Stochastic optimization in Hilbert spaces

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Forefront of the Two Sample Problem

Robust regression and non-linear kernel methods for characterization of neuronal response functions from limited data

Blobs & Scale Invariance

Examples are not Enough, Learn to Criticize! Criticism for Interpretability

Non-Linear Regression for Bag-of-Words Data via Gaussian Process Latent Variable Set Model

(Kernels +) Support Vector Machines

arxiv: v1 [math.st] 24 Aug 2017

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces

Nonparameteric Regression:

Riemannian Metric Learning for Symmetric Positive Definite Matrices

Low Bias Bagged Support Vector Machines

Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.

Kernel Methods. Outline

Reproducing Kernel Hilbert Spaces

Lecture 35: December The fundamental statistical distances

Variations. ECE 6540, Lecture 10 Maximum Likelihood Estimation

Transcription:

An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum 1 Zoltán Szabó 2 Arthur Gretton 1 1 Gatsby Unit, University College London 2 CMAP, École Polytechnique ICML 2017, Sydney 9 August 2017 1/10

What Is Independence Testing? Let (X ; Y ) 2R R dx dy be random vectors following P xy. Given a joint sample f(x i ; y i )g n i=1 P xy (unknown), test H 0 :P xy = P x P y ; vs. H 1 :P xy 6= P x P y : Compute a test statistic ^ n. Reject H 0 if ^ n > T (threshold). T = (1 )-quantile of the null distribution. 2/10

What Is Independence Testing? Let (X ; Y ) 2R R dx dy be random vectors following P xy. Given a joint sample f(x i ; y i )g n i=1 P xy (unknown), test H 0 :P xy = P x P y ; vs. H 1 :P xy 6= P x P y : Compute a test statistic ^ n. Reject H 0 if ^ n > T (threshold). T = (1 )-quantile of the null distribution. 2/10

What Is Independence Testing? Let (X ; Y ) 2R R dx dy be random vectors following P xy. Given a joint sample f(x i ; y i )g n i=1 P xy (unknown), test H 0 :P xy = P x P y ; vs. H 1 :P xy 6= P x P y : Compute a test statistic ^ n. Reject H 0 if ^ n > T (threshold). T = (1 )-quantile of the null distribution. P H0 (ˆλ n ) T α 0 25 50 75 ˆλ n 2/10

Motivations Modern state-of-the-art test is HSIC [Gretton et al., 2005]. Nonparametric i.e., no assumption on P xy. Kernel-based. Slow. Runtime: O(n 2 ) where n = sample size. No systematic way to choose kernels. Propose the Finite-Set Independence Criterion (FSIC). 1 Nonparametric. 2 Linear-time. Runtime complexity: O(n). Fast. 3 Adaptive. Kernel parameters can be tuned. 3/10

Motivations Modern state-of-the-art test is HSIC [Gretton et al., 2005]. Nonparametric i.e., no assumption on P xy. Kernel-based. Slow. Runtime: O(n 2 ) where n = sample size. No systematic way to choose kernels. Propose the Finite-Set Independence Criterion (FSIC). 1 Nonparametric. 2 Linear-time. Runtime complexity: O(n). Fast. 3 Adaptive. Kernel parameters can be tuned. 3/10

Proposal: The Finite-Set Independence Criterion (FSIC) 1 Pick 2 kernels: k for X, and l for Y (e.g., Gaussian kernels). 2 Pick a feature (v; w) 2R R dx R R dx!r dy R dy 3: Transform (x; y) 7! (k (x; v); l(y; w)) then measure covariance FSIC 2 (X ; Y ) = cov 2 (x;y)p xy [k (x; v); l(y; w)] : 4/10

Proposal: The Finite-Set Independence Criterion (FSIC) 1 Pick 2 kernels: k for X, and l for Y (e.g., Gaussian kernels). 2 Pick a feature (v; w) 2R R dx R R dx!r dy R dy 3: Transform (x; y) 7! (k (x; v); l(y; w)) then measure covariance FSIC 2 (X ; Y ) = cov 2 (x;y)p xy [k (x; v); l(y; w)] : 4/10

Proposal: The Finite-Set Independence Criterion (FSIC) 1 Pick 2 kernels: k for X, and l for Y (e.g., Gaussian kernels). 2 Pick a feature (v; w) 2R R dx R R dx!r dy R dy 3: Transform (x; y) 7! (k (x; v); l(y; w)) then measure covariance FSIC 2 (X ; Y ) = cov 2 (x;y)p xy [k (x; v); l(y; w)] : 4/10

Proposal: The Finite-Set Independence Criterion (FSIC) 1 Pick 2 kernels: k for X, and l for Y (e.g., Gaussian kernels). 2 Pick a feature (v; w) 2R R dx R R dx!r dy R dy 3: Transform (x; y) 7! (k (x; v); l(y; w)) then measure covariance FSIC 2 (X ; Y ) = cov 2 (x;y)p xy [k (x; v); l(y; w)] : Data (v, w) 1.0 correlation: 0.97 y5 l(y, w) 0.5 0 2.5 0.0 2.5 x 0.0 0.0 0.5 1.0 k(x, v) 4/10

Proposal: The Finite-Set Independence Criterion (FSIC) 1 Pick 2 kernels: k for X, and l for Y (e.g., Gaussian kernels). 2 Pick a feature (v; w) 2R R dx R R dx!r dy R dy 3: Transform (x; y) 7! (k (x; v); l(y; w)) then measure covariance FSIC 2 (X ; Y ) = cov 2 (x;y)p xy [k (x; v); l(y; w)] : Data (v, w) 1.0 correlation: -0.47 y5 l(y, w) 0.5 0 2.5 0.0 2.5 x 0.0 0.0 0.5 1.0 k(x, v) 4/10

Proposal: The Finite-Set Independence Criterion (FSIC) 1 Pick 2 kernels: k for X, and l for Y (e.g., Gaussian kernels). 2 Pick a feature (v; w) 2R R dx R R dx!r dy R dy 3: Transform (x; y) 7! (k (x; v); l(y; w)) then measure covariance FSIC 2 (X ; Y ) = cov 2 (x;y)p xy [k (x; v); l(y; w)] : Data (v, w) 1.0 correlation: 0.33 y5 l(y, w) 0.5 0 2.5 0.0 2.5 x 0.0 0.0 0.5 1.0 k(x, v) 4/10

Proposal: The Finite-Set Independence Criterion (FSIC) 1 Pick 2 kernels: k for X, and l for Y (e.g., Gaussian kernels). 2 Pick a feature (v; w) 2R R dx R R dx!r dy R dy 3: Transform (x; y) 7! (k (x; v); l(y; w)) then measure covariance FSIC 2 (X ; Y ) = cov 2 (x;y)p xy [k (x; v); l(y; w)] : 2 Data (v, w) 1.0 correlation: 0.023 y 0 l(y, w) 0.5 2 10 0 10 x 0.0 0.0 0.5 1.0 k(x, v) 4/10

Proposal: The Finite-Set Independence Criterion (FSIC) 1 Pick 2 kernels: k for X, and l for Y (e.g., Gaussian kernels). 2 Pick a feature (v; w) 2R R dx R R dx!r dy R dy 3: Transform (x; y) 7! (k (x; v); l(y; w)) then measure covariance FSIC 2 (X ; Y ) = cov 2 (x;y)p xy [k (x; v); l(y; w)] : 2 Data (v, w) 1.0 correlation: 0.025 y 0 l(y, w) 0.5 2 10 0 10 x 0.0 0.0 0.5 1.0 k(x, v) 4/10

Proposal: The Finite-Set Independence Criterion (FSIC) 1 Pick 2 kernels: k for X, and l for Y (e.g., Gaussian kernels). 2 Pick a feature (v; w) 2R R dx R R dx!r dy R dy 3: Transform (x; y) 7! (k (x; v); l(y; w)) then measure covariance FSIC 2 (X ; Y ) = cov 2 (x;y)p xy [k (x; v); l(y; w)] : Data (v, w) correlation: 0.087 2 y 0 l(y, w) 0.5 2 10 0 10 x 0.0 0.0 0.5 1.0 k(x, v) 4/10

General Form of FSIC FSIC 2 (X ; Y ) = 1 J JX j =1 for J features f(v j ; w j )g J j =1 2Rdx R Proposition 1. cov 2 (x;y)p xy [k (x; v j ); l(y; w j )] ; dy. 1 k and l: vanish at infinity, translation-invariant, characteristic, real analytic (e.g., Gaussian kernels). 2 Features f(v i ; w i )g J i=1 Then, for any J 1, are drawn from a distribution with a density. Almost surely, FSIC(X ; Y ) = 0 iff X and Y are independent Under H 0 : P xy = P x P y, n\ FSIC 2 weighted sum of J dependent 2 variables. Difficult to get (1 )-quantile for the threshold. 5/10

General Form of FSIC FSIC 2 (X ; Y ) = 1 J JX j =1 for J features f(v j ; w j )g J j =1 2Rdx R Proposition 1. cov 2 (x;y)p xy [k (x; v j ); l(y; w j )] ; dy. 1 k and l: vanish at infinity, translation-invariant, characteristic, real analytic (e.g., Gaussian kernels). 2 Features f(v i ; w i )g J i=1 Then, for any J 1, are drawn from a distribution with a density. Almost surely, FSIC(X ; Y ) = 0 iff X and Y are independent Under H 0 : P xy = P x P y, n\ FSIC 2 weighted sum of J dependent 2 variables. Difficult to get (1 )-quantile for the threshold. 5/10

General Form of FSIC FSIC 2 (X ; Y ) = 1 J JX j =1 for J features f(v j ; w j )g J j =1 2Rdx R Proposition 1. cov 2 (x;y)p xy [k (x; v j ); l(y; w j )] ; dy. 1 k and l: vanish at infinity, translation-invariant, characteristic, real analytic (e.g., Gaussian kernels). 2 Features f(v i ; w i )g J i=1 Then, for any J 1, are drawn from a distribution with a density. Almost surely, FSIC(X ; Y ) = 0 iff X and Y are independent Under H 0 : P xy = P x P y, n\ FSIC 2 weighted sum of J dependent 2 variables. Difficult to get (1 )-quantile for the threshold. 5/10

Normalized FSIC (NFSIC) Let ^u := dcov[k (x; v1 ); l(y; w 1 )]; : : : ;dcov[k (x; v J ); l(y; w J )] > 2R J. Then, FSIC \ 2 1 = J ^u> ^u. NFSIC \ 2 (X ; Y ) = ^ n := n ^u > 1 ^ + n I ^u; with a regularization parameter n 0. ^ ij = covariance of ^u i and ^u j. Theorem 1 (NFSIC test is consistent). Assume n! 0, and same conditions on k and l as before. 1 Under H 0, ^ n d! 2 (J ) as n! 1. Easy to get threshold T. 2 Under H 1, P(reject H 0 )! 1 as n! 1. Complexity: O(J 3 + J 2 n + (d x + d y )Jn). Only need small J. 6/10

Normalized FSIC (NFSIC) Let ^u := dcov[k (x; v1 ); l(y; w 1 )]; : : : ;dcov[k (x; v J ); l(y; w J )] > 2R J. Then, FSIC \ 2 1 = J ^u> ^u. NFSIC \ 2 (X ; Y ) = ^ n := n ^u > 1 ^ + n I ^u; with a regularization parameter n 0. ^ ij = covariance of ^u i and ^u j. Theorem 1 (NFSIC test is consistent). Assume n! 0, and same conditions on k and l as before. 1 Under H 0, ^ n d! 2 (J ) as n! 1. Easy to get threshold T. 2 Under H 1, P(reject H 0 )! 1 as n! 1. Complexity: O(J 3 + J 2 n + (d x + d y )Jn). Only need small J. 6/10

Normalized FSIC (NFSIC) Let ^u := dcov[k (x; v1 ); l(y; w 1 )]; : : : ;dcov[k (x; v J ); l(y; w J )] > 2R J. Then, FSIC \ 2 1 = J ^u> ^u. NFSIC \ 2 (X ; Y ) = ^ n := n ^u > 1 ^ + n I ^u; with a regularization parameter n 0. ^ ij = covariance of ^u i and ^u j. Theorem 1 (NFSIC test is consistent). Assume n! 0, and same conditions on k and l as before. 1 Under H 0, ^ n d! 2 (J ) as n! 1. Easy to get threshold T. 2 Under H 1, P(reject H 0 )! 1 as n! 1. Complexity: O(J 3 + J 2 n + (d x + d y )Jn). Only need small J. 6/10

Normalized FSIC (NFSIC) Let ^u := dcov[k (x; v1 ); l(y; w 1 )]; : : : ;dcov[k (x; v J ); l(y; w J )] > 2R J. Then, FSIC \ 2 1 = J ^u> ^u. NFSIC \ 2 (X ; Y ) = ^ n := n ^u > 1 ^ + n I ^u; with a regularization parameter n 0. ^ ij = covariance of ^u i and ^u j. Theorem 1 (NFSIC test is consistent). Assume n! 0, and same conditions on k and l as before. 1 Under H 0, ^ n d! 2 (J ) as n! 1. Easy to get threshold T. 2 Under H 1, P(reject H 0 )! 1 as n! 1. Complexity: O(J 3 + J 2 n + (d x + d y )Jn). Only need small J. 6/10

Normalized FSIC (NFSIC) Let ^u := dcov[k (x; v1 ); l(y; w 1 )]; : : : ;dcov[k (x; v J ); l(y; w J )] > 2R J. Then, FSIC \ 2 1 = J ^u> ^u. NFSIC \ 2 (X ; Y ) = ^ n := n ^u > 1 ^ + n I ^u; with a regularization parameter n 0. ^ ij = covariance of ^u i and ^u j. Theorem 1 (NFSIC test is consistent). Assume n! 0, and same conditions on k and l as before. 1 Under H 0, ^ n d! 2 (J ) as n! 1. Easy to get threshold T. 2 Under H 1, P(reject H 0 )! 1 as n! 1. Complexity: O(J 3 + J 2 n + (d x + d y )Jn). Only need small J. 6/10

Tuning Features and Kernels Split the data into training (tr) and test (te) sets. Procedure: 1 Choose f(v i ; w i )g J (tr) i=1 and Gaussian widths by maximizing ^ n computed on the training set). Gradient ascent. 2 Reject H 0 if ^ (te) n > (1 )-quantile of 2 (J ). (i.e., Splitting avoids overfitting. The optimization is also linear-time. Theorem 2. 1 This procedure increases a lower bound on P(reject H 0 j H 1 true) (test power). 2 Asymptotically, if H 0 is true, false rejection rate is. 7/10

Tuning Features and Kernels Split the data into training (tr) and test (te) sets. Procedure: 1 Choose f(v i ; w i )g J (tr) i=1 and Gaussian widths by maximizing ^ n computed on the training set). Gradient ascent. 2 Reject H 0 if ^ (te) n > (1 )-quantile of 2 (J ). (i.e., Splitting avoids overfitting. The optimization is also linear-time. Theorem 2. 1 This procedure increases a lower bound on P(reject H 0 j H 1 true) (test power). 2 Asymptotically, if H 0 is true, false rejection rate is. 7/10

Simulation Settings Gaussian kernels kx vk k (x; v) = exp 2 2 2 x l(y; w) = exp ky wk 2 2 2 y for X for Y NFSIC-opt NFSIC-m ed QHSIC NyHSIC FHSIC RDC Method Description 1 NFSIC-opt NFSIC with optimization. O(n). 2 QHSIC [Gretton et al., 2005] State-of-the-art HSIC. O(n 2 ). 3 NFSIC-med NFSIC with random features. 4 NyHSIC Linear-time HSIC with Nystrom approx. 5 FHSIC Linear-time HSIC with random Fourier features 6 RDC [Lopez-Paz et al., 2013] Canonical Correlation Analysis with cosine basis. J = 10 in NFSIC. 8/10

Youtube Video (X ) vs. Caption (Y ). X 2R 2000 : Fisher vector encoding of motion boundary histograms descriptors [Wang and Schmid, 2013]. Y 2R 1878 : Bag of words. Term frequency. = 0:01. Test power 1.0 0.5 QHSIC 2000 4000 6000 8000 Sample size n For large n, NFSIC is comparable to HSIC. 9/10

Youtube Video (X ) vs. Caption (Y ). X 2R 2000 : Fisher vector encoding of motion boundary histograms descriptors [Wang and Schmid, 2013]. Y 2R 1878 : Bag of words. Term frequency. = 0:01. 1.0 QHSIC Test power 0.5 Proposed NFSIC 2000 4000 6000 8000 Sample size n For large n, NFSIC is comparable to HSIC. 9/10

Youtube Video (X ) vs. Caption (Y ). X 2R 2000 : Fisher vector encoding of motion boundary histograms descriptors [Wang and Schmid, 2013]. Y 2R 1878 : Bag of words. Term frequency. = 0:01. Type-I error 0.02 0.01 2000 4000 6000 8000 Sample size n Exchange (X ; Y ) pairs. H 0 true. For large n, NFSIC is comparable to HSIC. 9/10

Conclusions Proposed The Finite Set Independence Criterion (FSIC) FSIC(X ; Y ) = 0 () X and Y are independent. Independece test based on FSIC is 1 nonparametric (no parametric assumption on P xy ), 2 linear-time, 3 adaptive (parameters automatically tuned). Python code: github.com/wittawatj/fsic-test Poster #111. Tonight. 10/10

Questions? Thank you 11/10

Requirements on the Kernels Definition 1 (Analytic kernels). k : X X!R is said to be analytic if for all x 2 X, v! k (x; v) is a real analytic function on X. Analytic: Taylor series about x 0 converges for all x 0 2 X. =) k is infinitely differentiable. Definition 2 (Characteristic kernels). Let P (v) :=E zp [k (z; v)]. k is said to be characteristic if P is unique for distinct P. Equivalently, P 7! P is injective. Space of distributions P Q RKHS µ P } MMD(P, Q) µ Q 12/10

Optimization Objective = Power Lower Bound Recall ^ n := n ^u > ^ + n I 1 Let NFSIC 2 (X ; Y ) := n := nu > 1 u. Theorem 3 (A lower bound on the test power). 1 With some boundedness assumptions, the test power P H1 ^n T L( n ) where ^u. L( n ) = 1 62e 12 n (n T)2 =n 2e b0:5nc(n T)2 =[2n 2 ] 2e [(n T)n (n 1)=3 3n c32 n n(n 1)] 2 =[4n 2 (n 1)] ; where 1 ; : : : ; 4 ; c 3 > 0 are constants. 2 For large n, L( n ) is increasing in n. 13/10

Optimization Objective = Power Lower Bound Recall ^ n := n ^u > ^ + n I 1 Let NFSIC 2 (X ; Y ) := n := nu > 1 u. Theorem 3 (A lower bound on the test power). 1 With some boundedness assumptions, the test power P H1 ^n T L( n ) where ^u. L( n ) = 1 62e 12 n (n T)2 =n 2e b0:5nc(n T)2 =[2n 2 ] 2e [(n T)n (n 1)=3 3n c32 n n(n 1)] 2 =[4n 2 (n 1)] ; where 1 ; : : : ; 4 ; c 3 > 0 are constants. 2 For large n, L( n ) is increasing in n. 13/10

Optimization Objective = Power Lower Bound Recall ^ n := n ^u > ^ + n I 1 Let NFSIC 2 (X ; Y ) := n := nu > 1 u. Theorem 3 (A lower bound on the test power). 1 With some boundedness assumptions, the test power P H1 ^n T L( n ) where ^u. L( n ) = 1 62e 12 n (n T)2 =n 2e b0:5nc(n T)2 =[2n 2 ] 2e [(n T)n (n 1)=3 3n c32 n n(n 1)] 2 =[4n 2 (n 1)] ; where 1 ; : : : ; 4 ; c 3 > 0 are constants. 2 For large n, L( n ) is increasing in n. 13/10

Optimization Objective = Power Lower Bound Recall ^ n := n ^u > ^ + n I 1 Let NFSIC 2 (X ; Y ) := n := nu > 1 u. Theorem 3 (A lower bound on the test power). 1 With some boundedness assumptions, the test power P H1 ^n T L( n ) where ^u. L( n ) = 1 62e 12 n (n T)2 =n 2e b0:5nc(n T)2 =[2n 2 ] 2e [(n T)n (n 1)=3 3n c32 n n(n 1)] 2 =[4n 2 (n 1)] ; where 1 ; : : : ; 4 ; c 3 > 0 are constants. 2 For large n, L( n ) is increasing in n. Set test locations and Gaussian widths = arg max L( n ) = arg max n 13/10

\ An Estimator of NFSIC 2 ^ n := n ^u > ^ + n I 1 ^u; J test locations f(v i ; w i )g J i=1. K = [k (v i ; x j )] 2R J n L = [l(w i ; y j )] 2R J n. (No n n Gram matrix.) Estimators 1 ^u = (KL)1n n 1 (K1 n )(L1n ) n(n 1). 2 ^ = > n where := (K n 1 K1 n 1 > n ) (L n 1 L1 n 1 > n ) ^u1> n : ^ n can be computed in O(J 3 + J 2 n + (d x + d y )Jn) time. Main Point: Linear in n. Cubic in J (small). 14/10

\ An Estimator of NFSIC 2 ^ n := n ^u > ^ + n I 1 ^u; J test locations f(v i ; w i )g J i=1. K = [k (v i ; x j )] 2R J n L = [l(w i ; y j )] 2R J n. (No n n Gram matrix.) Estimators 1 ^u = (KL)1n n 1 (K1 n )(L1n ) n(n 1). 2 ^ = > n where := (K n 1 K1 n 1 > n ) (L n 1 L1 n 1 > n ) ^u1> n : ^ n can be computed in O(J 3 + J 2 n + (d x + d y )Jn) time. Main Point: Linear in n. Cubic in J (small). 14/10

\ An Estimator of NFSIC 2 ^ n := n ^u > ^ + n I 1 ^u; J test locations f(v i ; w i )g J i=1. K = [k (v i ; x j )] 2R J n L = [l(w i ; y j )] 2R J n. (No n n Gram matrix.) Estimators 1 ^u = (KL)1n n 1 (K1 n )(L1n ) n(n 1). 2 ^ = > n where := (K n 1 K1 n 1 > n ) (L n 1 L1 n 1 > n ) ^u1> n : ^ n can be computed in O(J 3 + J 2 n + (d x + d y )Jn) time. Main Point: Linear in n. Cubic in J (small). 14/10

Alternative View of FSIC Rewrite cov: FSIC 2 (X ; Y ) = cov 2 (x;y)p xy [k (x; v); l(y; w)] : cov xy [k (x; v); l(y; w)] =E xy [k (x; v)l(y; w)] E x [k (x; v)]e y [l(y; w)]; := xy (v; w) x (v) y (w) (witness function) = ^ xy (v; w) ^ x (v)^ y (w) Witness function FSIC = evaluate the witness function at J locations. Cost: O(Jn). HSIC = RKHS norm of the witness function. Cost: O(n 2 ). 15/10

HSIC vs. FSIC Recall the witness ^u(v; w) = ^ xy (v; w) ^ x (v)^ y (w): HSIC [Gretton et al., 2005] = k^uk RKHS witness FSIC [proposed] = 1 J P Ji=1 ^u 2 (v i ; w i ) witness (v, w) Good when difference between p xy and p x p y is spatially diffuse. ^u is almost flat. (v, w) Good when difference between p xy and p x p y is local. ^u is mostly zero, has many peaks (feature interaction). 16/10

Toy Problem 1: Independent Gaussians X N (0; I dx ) and Y N (0; I dy ). Independent X ; Y. So, H 0 holds. Set := 0:05; d x = d y = 250. 17/10

Toy Problem 1: Independent Gaussians X N (0; I dx ) and Y N (0; I dy ). Independent X ; Y. So, H 0 holds. Set := 0:05; d x = d y = 250. NFSIC-opt NFSIC-m ed QHSIC NyHSIC FHSIC RDC Type-I error 0.06 0.04 10 4 10 5 Sample size n Correct type-i errors (false positive rate). 17/10

Toy Problem 1: Independent Gaussians X N (0; I dx ) and Y N (0; I dy ). Independent X ; Y. So, H 0 holds. Set := 0:05; d x = d y = 250. NFSIC-opt NFSIC-m ed QHSIC NyHSIC FHSIC RDC Type-I error 0.06 0.04 Time (s) 10 2 10 0 10 4 10 5 Sample size n 10 3 10 4 10 5 Sample size n Correct type-i errors (false positive rate). 17/10

Toy Problem 2: Sinusoid p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = 4000. 18/10

Toy Problem 2: Sinusoid p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = 4000. y 2.5 0.0 ω = 1.00 2.5 2.5 0.0 2.5 x 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 18/10

Toy Problem 2: Sinusoid p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = 4000. y 2.5 0.0 ω = 2.00 2.5 2.5 0.0 2.5 x 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 18/10

Toy Problem 2: Sinusoid p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = 4000. y 2.5 0.0 ω = 3.00 2.5 2.5 0.0 2.5 x 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 18/10

Toy Problem 2: Sinusoid p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = 4000. y 2.5 0.0 ω = 4.00 2.5 2.5 0.0 2.5 x 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 18/10

Toy Problem 2: Sinusoid Test power p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = 4000. NFSIC-opt NFSIC-m ed QHSIC NyHSIC FHSIC RDC 1.0 ω = 4.00 2.00 2.5 1.75 1.50 0.5 1.25 0.0 1.00 0.75 0.0 0.50 1 2 3 4 5 6 2.5 0.25 ω in 1 + sin(ωx) sin(ωy) 0.00 2.5 0.0 2.5 x Main Point: NFSIC can handle well the local changes in the joint space. y 18/10

Toy Problem 3: Gaussian Sign y = jz j Q d x i=1 sign(x i ), where x N (0; I dy ) and Z N (0; 1) (noise). Full interaction among x 1 ; : : : ; x dx. Need to consider all x 1 ; : : : ; x d to detect the dependency. 1.0 Test power 0.5 0.0 10 3 10 4 10 5 19/10

Toy Problem 3: Gaussian Sign y = jz j Q d x i=1 sign(x i ), where x N (0; I dy ) and Z N (0; 1) (noise). Full interaction among x 1 ; : : : ; x dx. Need to consider all x 1 ; : : : ; x d to detect the dependency. 1.0 Test power 0.5 0.0 10 3 10 4 10 5 19/10

y Test Power vs. J 2.5 0.0 Test power does not always increase with J (number of test locations). n = 800. ω = 2.00 2.5 2.5 0.0 2.5 x 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 Test power 1.0 0.5 10 200 400 600 J Accurate estimation of ^ 2R J J in ^ n = n ^u > ^ + n I 1 more difficult. Large J defeats the purpose of a linear-time test. ^u becomes 20/10

Real Problem: Million Song Data Song (X ) vs. year of release (Y ) Western commercial tracks from 1922 to 2011 [Bertin-Mahieux et al., 2011]. X 2R 90 contains audio features. Y 2R is the year of release. 21/10

Real Problem: Million Song Data Song (X ) vs. year of release (Y ) Type-I error Western commercial tracks from 1922 to 2011 [Bertin-Mahieux et al., 2011]. X 2R 90 contains audio features. Y 2R is the year of release. NFSIC-opt NFSIC-m ed QHSIC NyHSIC FHSIC RDC 0.02 0.01 0.00 500 1000 1500 2000 Sample size n Test power 1.0 0.5 500 1000 1500 2000 Sample size n Break (X ; Y ) pairs to simulate H 0. NFSIC-opt has the highest power among the linear-time tests. 21/10

Alternative Form of ^u(v; w) Recall \ FSIC 2 = 1 J P Ji=1 ^u(v i ; w i ) 2 Let [x y (v; w) be an unbiased estimator of x (v)y (w). [ x y P (v; w) := 1 ni=1 n(n 1) Pj 6=i k (x i ; v)l(y j ; w). An unbiased estimator of u(v; w) is where ^u(v; w) = ^xy (v; w) [x y (v; w) = 2 n(n 1) X i<j h (v;w)((x i ; y i ); (x j ; y j )); h (v;w)((x; y); (x 0 ; y 0 )) := 1 2 (k (x; v) k (x0 ; v))(l(y; w) l(y 0 ; w)): ^u(v; w) is a one-sample 2 nd -order U-statistic, given (v; w). 22/10

Alternative Form of ^u(v; w) Recall \ FSIC 2 = 1 J P Ji=1 ^u(v i ; w i ) 2 Let [x y (v; w) be an unbiased estimator of x (v)y (w). [ x y P (v; w) := 1 ni=1 n(n 1) Pj 6=i k (x i ; v)l(y j ; w). An unbiased estimator of u(v; w) is where ^u(v; w) = ^xy (v; w) [x y (v; w) = 2 n(n 1) X i<j h (v;w)((x i ; y i ); (x j ; y j )); h (v;w)((x; y); (x 0 ; y 0 )) := 1 2 (k (x; v) k (x0 ; v))(l(y; w) l(y 0 ; w)): ^u(v; w) is a one-sample 2 nd -order U-statistic, given (v; w). 22/10

References I Bertin-Mahieux, T., Ellis, D. P., Whitman, B., and Lamere, P. (2011). The million song dataset. In International Conference on Music Information Retrieval (ISMIR). Gretton, A., Bousquet, O., Smola, A., and Schölkopf, B. (2005). Measuring Statistical Dependence with Hilbert-Schmidt Norms. In Algorithmic Learning Theory (ALT), pages 63 77. Lopez-Paz, D., Hennig, P., and Schölkopf, B. (2013). The Randomized Dependence Coefficient. In Advances in Neural Information Processing Systems (NIPS), pages 1 9. 23/10

References II Wang, H. and Schmid, C. (2013). Action recognition with improved trajectories. In IEEE International Conference on Computer Vision (ICCV), pages 3551 3558. 24/10