An Adaptive Test of Independence with Analytic Kernel Embeddings

Similar documents
An Adaptive Test of Independence with Analytic Kernel Embeddings

A Linear-Time Kernel Goodness-of-Fit Test

Learning Interpretable Features to Compare Distributions

Learning features to compare distributions

A Linear-Time Kernel Goodness-of-Fit Test

Hilbert Space Representations of Probability Distributions

Hilbert Schmidt Independence Criterion

Big Hypothesis Testing with Kernel Embeddings

Advances in kernel exponential families

Learning features to compare distributions

Hypothesis Testing with Kernel Embeddings on Interdependent Data

Kernel Learning via Random Fourier Representations

Kernel methods for comparing distributions, measuring dependence

Hilbert Space Embedding of Probability Measures

CS8803: Statistical Techniques in Robotics Byron Boots. Hilbert Space Embeddings

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Maximum Mean Discrepancy

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee

CMU-Q Lecture 24:

Kernel methods for Bayesian inference

Tensor Product Kernels: Characteristic Property, Universality

Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise

Reproducing Kernel Hilbert Spaces

Understanding the relationship between Functional and Structural Connectivity of Brain Networks

Examples are not Enough, Learn to Criticize! Criticism for Interpretability

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Stochastic optimization in Hilbert spaces

Kernel Adaptive Metropolis-Hastings

Reproducing Kernel Hilbert Spaces

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning Lecture 7

Learning gradients: prescriptive models

Kernel methods and the exponential family

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Lecture 35: December The fundamental statistical distances

Forefront of the Two Sample Problem

Machine Learning Lecture 5

Kernel Measures of Conditional Dependence

Advanced Introduction to Machine Learning

10-701/ Recitation : Kernels

Distribution Regression

arxiv: v2 [stat.ml] 3 Sep 2015

Convergence Rates of Kernel Quadrature Rules

Representer theorem and kernel examples

Mean-Shift Tracker Computer Vision (Kris Kitani) Carnegie Mellon University

Fisher Vector image representation

Semi-supervised Dictionary Learning Based on Hilbert-Schmidt Independence Criterion

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Gaussian with mean ( µ ) and standard deviation ( σ)

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Nonparametric Indepedence Tests: Space Partitioning and Kernel Approaches

Linear & nonlinear classifiers

Today. Calculus. Linear Regression. Lagrange Multipliers

Data Mining Techniques

Statistical Machine Learning

Evaluation requires to define performance measures to be optimized

Lecture 2: Mappings of Probabilities to RKHS and Applications

Approximate Kernel PCA with Random Features

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

Minimax-optimal distribution regression

Statistical Data Mining and Machine Learning Hilary Term 2016

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Adaptive HMC via the Infinite Exponential Family

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Reproducing Kernel Hilbert Spaces

Unsupervised Nonparametric Anomaly Detection: A Kernel Method

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels

Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.

Reproducing Kernel Hilbert Spaces

Topics in kernel hypothesis testing

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Causal Modeling with Generative Neural Networks

Package dhsic. R topics documented: July 27, 2017

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Undirected Graphical Models

Kernel-Based Information Criterion

Fast Two Sample Tests using Smooth Random Features

Multiple kernel learning for multiple sources

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Evaluation. Andrea Passerini Machine Learning. Evaluation

Unsupervised Learning

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

The Variational Gaussian Approximation Revisited

Efficient Complex Output Prediction

Bits of Machine Learning Part 1: Supervised Learning

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Information geometry for bivariate distribution control

Introduction to Logistic Regression

Generalized clustering via kernel embeddings

Recent advances in Time Series Classification

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Kernel Methods and Support Vector Machines

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

arxiv: v1 [math.st] 24 Aug 2017

Statistical Convergence of Kernel CCA

Transcription:

An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum Gatsby Unit, University College London wittawat@gatsby.ucl.ac.uk Probabilistic Graphical Model Workshop 2017 Institute of Statistical Mathematics, Tokyo 24 Feb 2017 1/29

Reference Coauthors: Arthur Gretton Gatsby Unit, UCL Zoltán Szabó École Polytechnique Preprint: An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum, Zoltán Szabó, Arthur Gretton https://arxiv.org/abs/1610.04782 Python code: https://github.com/wittawatj/fsic-test 2/29

What Is Independence Testing? Let X 2R dx ; Y 2R dy be random vectors following P xy. Given a joint sample f(x i ; y i )g n i=1 P xy (unknown), test H 0 :P xy = P x P y ; vs. H 1 :P xy 6= P x P y : P xy = P x P y equivalent to X? Y. Compute a test statistic ^ n. Reject H 0 if ^ n T (threshold). T = (1 )-quantile of the null distribution. 3/29

What Is Independence Testing? Let X 2R dx ; Y 2R dy be random vectors following P xy. Given a joint sample f(x i ; y i )g n i=1 P xy (unknown), test H 0 :P xy = P x P y ; vs. H 1 :P xy 6= P x P y : P xy = P x P y equivalent to X? Y. Compute a test statistic ^ n. Reject H 0 if ^ n T (threshold). T = (1 )-quantile of the null distribution. 3/29

What Is Independence Testing? Let X 2R dx ; Y 2R dy be random vectors following P xy. Given a joint sample f(x i ; y i )g n i=1 P xy (unknown), test H 0 :P xy = P x P y ; vs. H 1 :P xy 6= P x P y : P xy = P x P y equivalent to X? Y. Compute a test statistic ^ n. Reject H 0 if ^ n T (threshold). T = (1 )-quantile of the null distribution. H 0 (ˆλ n ) P H1 (ˆλ n ) T α ˆλ n 0 10 20 30 40 50 60 70 80 90 ˆλ n 3/29

What Is Independence Testing? Let X 2R dx ; Y 2R dy be random vectors following P xy. Given a joint sample f(x i ; y i )g n i=1 P xy (unknown), test H 0 :P xy = P x P y ; vs. H 1 :P xy 6= P x P y : P xy = P x P y equivalent to X? Y. Compute a test statistic ^ n. Reject H 0 if ^ n T (threshold). T = (1 )-quantile of the null distribution. H 0 (ˆλ n ) P H1 (ˆλ n ) 0 10 20 30 40 50 60 70 80 90 ˆλ n T α ˆλ n y 3 2 1 0 1 2 3 3 2 1 0 1 2 3 x 3/29

Goals Want a test which is : : : 1 Non-parametric i.e., no parametric assumption on P xy. 2 Linear-time i.e., computational complexity is O(n). Fast. 3 Adaptive i.e., has a well-defined criterion for parameter tuning. Non-parametric O(n) Adaptive Pearson correlation HSIC [Gretton et al., 2005] HSIC with RFFs [Zhang et al., 2016] FSIC (proposed) : RFFs = Random Fourier Features Focus on cases where n (sample size) is large. 4/29

Goals Want a test which is : : : 1 Non-parametric i.e., no parametric assumption on P xy. 2 Linear-time i.e., computational complexity is O(n). Fast. 3 Adaptive i.e., has a well-defined criterion for parameter tuning. Non-parametric O(n) Adaptive Pearson correlation HSIC [Gretton et al., 2005] HSIC with RFFs [Zhang et al., 2016] FSIC (proposed) : RFFs = Random Fourier Features Focus on cases where n (sample size) is large. 4/29

Goals Want a test which is : : : 1 Non-parametric i.e., no parametric assumption on P xy. 2 Linear-time i.e., computational complexity is O(n). Fast. 3 Adaptive i.e., has a well-defined criterion for parameter tuning. Non-parametric O(n) Adaptive Pearson correlation HSIC [Gretton et al., 2005] HSIC with RFFs [Zhang et al., 2016] FSIC (proposed) : RFFs = Random Fourier Features Focus on cases where n (sample size) is large. 4/29

Goals Want a test which is : : : 1 Non-parametric i.e., no parametric assumption on P xy. 2 Linear-time i.e., computational complexity is O(n). Fast. 3 Adaptive i.e., has a well-defined criterion for parameter tuning. Non-parametric O(n) Adaptive Pearson correlation HSIC [Gretton et al., 2005] HSIC with RFFs [Zhang et al., 2016] FSIC (proposed) : RFFs = Random Fourier Features Focus on cases where n (sample size) is large. 4/29

Goals Want a test which is : : : 1 Non-parametric i.e., no parametric assumption on P xy. 2 Linear-time i.e., computational complexity is O(n). Fast. 3 Adaptive i.e., has a well-defined criterion for parameter tuning. Non-parametric O(n) Adaptive Pearson correlation HSIC [Gretton et al., 2005] HSIC with RFFs [Zhang et al., 2016] FSIC (proposed) : RFFs = Random Fourier Features Focus on cases where n (sample size) is large. 4/29

Witness Function [Gretton et al., 2012] kx vk 2 A function showing the differences of two distributions P and Q. Gaussian kernel: k (x; v) = exp Empirical mean embedding of P: ^ P (v) = 1 n P ni=1 k (x i ; v) Maximum Mean Discrepancy (MMD): k^uk RKHS. MMD(P ; Q) = 0 if and only if P = Q. 2 2 5/29

Witness Function [Gretton et al., 2012] kx vk 2 A function showing the differences of two distributions P and Q. Gaussian kernel: k (x; v) = exp Empirical mean embedding of P: ^ P (v) = 1 n P ni=1 k (x i ; v) Maximum Mean Discrepancy (MMD): k^uk RKHS. MMD(P ; Q) = 0 if and only if P = Q. 2 2 Observe X = fx 1 ; : : : ; x n g P Observe Y = fy 1 ; : : : ; y n g Q 5/29

Witness Function [Gretton et al., 2012] kx vk 2 A function showing the differences of two distributions P and Q. Gaussian kernel: k (x; v) = exp Empirical mean embedding of P: ^ P (v) = 1 n P ni=1 k (x i ; v) Maximum Mean Discrepancy (MMD): k^uk RKHS. MMD(P ; Q) = 0 if and only if P = Q. 2 2 Gaussian kernel on x i Gaussian kernel on y i 5/29

Witness Function [Gretton et al., 2012] kx vk 2 A function showing the differences of two distributions P and Q. Gaussian kernel: k (x; v) = exp Empirical mean embedding of P: ^ P (v) = 1 n P ni=1 k (x i ; v) Maximum Mean Discrepancy (MMD): k^uk RKHS. MMD(P ; Q) = 0 if and only if P = Q. 2 2 ^P (v): mean embedding of P ^Q (v): mean embedding of Q v 5/29

Witness Function [Gretton et al., 2012] kx vk 2 A function showing the differences of two distributions P and Q. Gaussian kernel: k (x; v) = exp Empirical mean embedding of P: ^ P (v) = 1 n P ni=1 k (x i ; v) Maximum Mean Discrepancy (MMD): k^uk RKHS. MMD(P ; Q) = 0 if and only if P = Q. 2 2 ^P (v): mean embedding of P ^Q (v): mean embedding of Q ^u(v) = witness(v) = ^P (v) ^Q (v) {z } v 5/29

Witness Function [Gretton et al., 2012] kx vk 2 A function showing the differences of two distributions P and Q. Gaussian kernel: k (x; v) = exp Empirical mean embedding of P: ^ P (v) = 1 n P ni=1 k (x i ; v) Maximum Mean Discrepancy (MMD): k^uk RKHS. MMD(P ; Q) = 0 if and only if P = Q. 2 2 \MMD(P ; Q) = k^uk RKHS ^u(v) = witness(v) = ^P (v) ^Q (v) {z } v 5/29

Independence Test with HSIC [Gretton et al., 2005] Hilbert-Schmidt Independence Criterion. HSIC(X ; Y ) = MMD(P xy ; P x P y ) = kuk RKHS (need two kernels: k for X, and l for Y ). Empirical witness: ^u(v; w) = ^ xy (v; w) where ^ xy (v; w) = 1 n P ni=1 k (x i ; v)l(y i ; w). ^ x (v)^ y (w) HSIC(X ; Y ) = 0 if and only if X and Y are independent. Test statistic = k^uk RKHS ( flatness of ^u). Complexity: O(n 2 ). Key: Can we measure the flatness by other way that costs only O(n)? 6/29

Independence Test with HSIC [Gretton et al., 2005] Hilbert-Schmidt Independence Criterion. HSIC(X ; Y ) = MMD(P xy ; P x P y ) = kuk RKHS (need two kernels: k for X, and l for Y ). Empirical witness: ^u(v; w) = ^ xy (v; w) where ^ xy (v; w) = 1 n P ni=1 k (x i ; v)l(y i ; w). ^ x (v)^ y (w) HSIC(X ; Y ) = 0 if and only if X and Y are independent. Test statistic = k^uk RKHS ( flatness of ^u). Complexity: O(n 2 ). Key: Can we measure the flatness by other way that costs only O(n)? 6/29

Independence Test with HSIC [Gretton et al., 2005] Hilbert-Schmidt Independence Criterion. HSIC(X ; Y ) = MMD(P xy ; P x P y ) = kuk RKHS (need two kernels: k for X, and l for Y ). Empirical witness: ^u(v; w) = ^ xy (v; w) ^ x (v)^ y (w) where ^ xy (v; w) = 1 n P ni=1 k (x i ; v)l(y i ; w). ^ xy (v; w) ^ x (v)^ y (w) HSIC(X ; Y ) = 0 if and only if X and Y are independent. Test statistic = k^uk RKHS ( flatness of ^u). Complexity: O(n 2 ). Key: Can we measure the flatness by other way that costs only O(n)? 6/29

Independence Test with HSIC [Gretton et al., 2005] Hilbert-Schmidt Independence Criterion. HSIC(X ; Y ) = MMD(P xy ; P x P y ) = kuk RKHS (need two kernels: k for X, and l for Y ). Empirical witness: ^u(v; w) = ^ xy (v; w) where ^ xy (v; w) = 1 n P ni=1 k (x i ; v)l(y i ; w). ^ x (v)^ y (w) ^ xy (v; w) ^ x (v)^ y (w) = Witness ^u(v; w) HSIC(X ; Y ) = 0 if and only if X and Y are independent. Test statistic = k^uk RKHS ( flatness of ^u). Complexity: O(n 2 ). Key: Can we measure the flatness by other way that costs only O(n)? 6/29

Independence Test with HSIC [Gretton et al., 2005] Hilbert-Schmidt Independence Criterion. HSIC(X ; Y ) = MMD(P xy ; P x P y ) = kuk RKHS (need two kernels: k for X, and l for Y ). Empirical witness: ^u(v; w) = ^ xy (v; w) where ^ xy (v; w) = 1 n P ni=1 k (x i ; v)l(y i ; w). ^ x (v)^ y (w) ^ xy (v; w) ^ x (v)^ y (w) = Witness ^u(v; w) HSIC(X ; Y ) = 0 if and only if X and Y are independent. Test statistic = k^uk RKHS ( flatness of ^u). Complexity: O(n 2 ). Key: Can we measure the flatness by other way that costs only O(n)? 6/29

Independence Test with HSIC [Gretton et al., 2005] Hilbert-Schmidt Independence Criterion. HSIC(X ; Y ) = MMD(P xy ; P x P y ) = kuk RKHS (need two kernels: k for X, and l for Y ). Empirical witness: ^u(v; w) = ^ xy (v; w) where ^ xy (v; w) = 1 n P ni=1 k (x i ; v)l(y i ; w). ^ x (v)^ y (w) ^ xy (v; w) ^ x (v)^ y (w) = Witness ^u(v; w) HSIC(X ; Y ) = 0 if and only if X and Y are independent. Test statistic = k^uk RKHS ( flatness of ^u). Complexity: O(n 2 ). Key: Can we measure the flatness by other way that costs only O(n)? 6/29

Independence Test with HSIC [Gretton et al., 2005] Hilbert-Schmidt Independence Criterion. HSIC(X ; Y ) = MMD(P xy ; P x P y ) = kuk RKHS (need two kernels: k for X, and l for Y ). Empirical witness: ^u(v; w) = ^ xy (v; w) where ^ xy (v; w) = 1 n P ni=1 k (x i ; v)l(y i ; w). ^ x (v)^ y (w) ^ xy (v; w) ^ x (v)^ y (w) = Witness ^u(v; w) HSIC(X ; Y ) = 0 if and only if X and Y are independent. Test statistic = k^uk RKHS ( flatness of ^u). Complexity: O(n 2 ). Key: Can we measure the flatness by other way that costs only O(n)? 6/29

Proposal: The Finite Set Independence Criterion (FSIC) Idea: Evaluate ^u 2 (v; w) at only finitely many test locations. A set of random J locations: f(v 1 ; w 1 ); : : : ; (v J ; w J )g \FSIC 2 (X ; Y ) = 1 J P Ji=1 ^u 2 (v i ; w i ) Complexity: O((d x + d y )Jn). Linear time. But, what about an unlucky set of locations?? Can FSIC 2 (X ; Y ) = 0 even if X and Y are dependent?? 7/29

Proposal: The Finite Set Independence Criterion (FSIC) Idea: Evaluate ^u 2 (v; w) at only finitely many test locations. A set of random J locations: f(v 1 ; w 1 ); : : : ; (v J ; w J )g \FSIC 2 (X ; Y ) = 1 J P Ji=1 ^u 2 (v i ; w i ) 0.024 0.021 0.018 0.015 0.012 0.009 0.006 0.003 0.000 Complexity: O((d x + d y )Jn). Linear time. But, what about an unlucky set of locations?? Can FSIC 2 (X ; Y ) = 0 even if X and Y are dependent?? 7/29

Proposal: The Finite Set Independence Criterion (FSIC) Idea: Evaluate ^u 2 (v; w) at only finitely many test locations. A set of random J locations: f(v 1 ; w 1 ); : : : ; (v J ; w J )g \FSIC 2 (X ; Y ) = 1 J P Ji=1 ^u 2 (v i ; w i ) 0.024 0.021 0.018 0.015 0.012 0.009 0.006 0.003 0.000 Complexity: O((d x + d y )Jn). Linear time. But, what about an unlucky set of locations?? Can FSIC 2 (X ; Y ) = 0 even if X and Y are dependent?? 7/29

Proposal: The Finite Set Independence Criterion (FSIC) Idea: Evaluate ^u 2 (v; w) at only finitely many test locations. A set of random J locations: f(v 1 ; w 1 ); : : : ; (v J ; w J )g \FSIC 2 (X ; Y ) = 1 J P Ji=1 ^u 2 (v i ; w i ) 0.024 0.021 0.018 0.015 0.012 0.009 0.006 0.003 0.000 Complexity: O((d x + d y )Jn). Linear time. But, what about an unlucky set of locations?? Can FSIC 2 (X ; Y ) = 0 even if X and Y are dependent?? 7/29

Proposal: The Finite Set Independence Criterion (FSIC) Idea: Evaluate ^u 2 (v; w) at only finitely many test locations. A set of random J locations: f(v 1 ; w 1 ); : : : ; (v J ; w J )g \FSIC 2 (X ; Y ) = 1 J P Ji=1 ^u 2 (v i ; w i ) 0.024 0.021 0.018 0.015 0.012 0.009 0.006 0.003 0.000 Complexity: O((d x + d y )Jn). Linear time. But, what about an unlucky set of locations?? Can FSIC 2 (X ; Y ) = 0 even if X and Y are dependent?? 7/29

Proposal: The Finite Set Independence Criterion (FSIC) Idea: Evaluate ^u 2 (v; w) at only finitely many test locations. A set of random J locations: f(v 1 ; w 1 ); : : : ; (v J ; w J )g \FSIC 2 (X ; Y ) = 1 J P Ji=1 ^u 2 (v i ; w i ) 0.024 0.021 0.018 0.015 0.012 0.009 0.006 0.003 0.000 Complexity: O((d x + d y )Jn). Linear time. But, what about an unlucky set of locations?? Can FSIC 2 (X ; Y ) = 0 even if X and Y are dependent?? No. Population FSIC(X ; Y ) = 0 iff X? Y, almost surely. 7/29

Requirements on the Kernels Definition 1 (Analytic kernels). k : X X!R is said to be analytic if for all x 2 X, v! k (x; v) is a real analytic function on X. Analytic: Taylor series about x 0 converges for all x 0 2 X. =) k is infinitely differentiable. Definition 2 (Characteristic kernels). Let P ; Q be two distributions, and g be a kernel. Let P (v) :=E zp [g(z; v)] and Q (v) :=E zq [g(z; v)]. g is said to be characteristic if P 6= Q implies P 6= Q. 8/29

Requirements on the Kernels Definition 1 (Analytic kernels). k : X X!R is said to be analytic if for all x 2 X, v! k (x; v) is a real analytic function on X. Analytic: Taylor series about x 0 converges for all x 0 2 X. =) k is infinitely differentiable. Definition 2 (Characteristic kernels). Let P ; Q be two distributions, and g be a kernel. Let P (v) :=E zp [g(z; v)] and Q (v) :=E zq [g(z; v)]. g is said to be characteristic if P 6= Q implies P 6= Q. Space of distributions P Q RKHS µ P } MMD(P, Q) µ Q 8/29

FSIC Is a Dependence Measure Proposition 1. Assume 1 The product kernel g((x; y); (x 0 ; y 0 )) := k (x; x 0 )l(y; y 0 ) is characteristic and analytic (i.e., k ; l are Gaussian kernels). 2 Test locations f(v i ; w i )g J i=1 where has a density. Then, -almost surely, FSIC(X ; Y ) = 0 iff X and Y are independent. 9/29

FSIC Is a Dependence Measure Proposition 1. Assume 1 The product kernel g((x; y); (x 0 ; y 0 )) := k (x; x 0 )l(y; y 0 ) is characteristic and analytic (i.e., k ; l are Gaussian kernels). 2 Test locations f(v i ; w i )g J i=1 where has a density. Then, -almost surely, FSIC(X ; Y ) = 0 iff X and Y are independent. Let s plot 0.024 0.021 0.018 0.015 0.012 0.009 0.006 0.003 0.000 as 9/29

FSIC Is a Dependence Measure Proposition 1. Assume 1 The product kernel g((x; y); (x 0 ; y 0 )) := k (x; x 0 )l(y; y 0 ) is characteristic and analytic (i.e., k ; l are Gaussian kernels). 2 Test locations f(v i ; w i )g J i=1 where has a density. Then, -almost surely, FSIC(X ; Y ) = 0 iff X and Y are independent. ˆµ Pxy (v, w) ˆµ Px ˆµ Py (v, w) û 2 (v, w) (v, w) 9/29

FSIC Is a Dependence Measure Proposition 1. Assume 1 The product kernel g((x; y); (x 0 ; y 0 )) := k (x; x 0 )l(y; y 0 ) is characteristic and analytic (i.e., k ; l are Gaussian kernels). 2 Test locations f(v i ; w i )g J i=1 where has a density. Then, -almost surely, FSIC(X ; Y ) = 0 iff X and Y are independent. Under H 0, ˆµ Pxy (v, w) ˆµ Px ˆµ Py (v, w) û 2 (v, w) 9/29

FSIC Is a Dependence Measure Proposition 1. Assume 1 The product kernel g((x; y); (x 0 ; y 0 )) := k (x; x 0 )l(y; y 0 ) is characteristic and analytic (i.e., k ; l are Gaussian kernels). 2 Test locations f(v i ; w i )g J i=1 where has a density. Then, -almost surely, FSIC(X ; Y ) = 0 iff X and Y are independent. ˆµ Pxy (v, w) ˆµ Px ˆµ Py (v, w) û 2 (v, w) (v, w) Under H 1, u is not a zero function (P 7!E zp [g(z; )] is injective). u is analytic. So, R u = f(v; w) j u(v; w) = 0g has 0 Lebesgue measure. So, f(v i ; w i )g J i=1 will not be in R u (with probability 1). 9/29

Alternative View of the Witness u(v; w) The witness u(v; w) can be rewritten as u(v; w) := xy (v; w) x (v) y (w) =E xy [k (x; v)l(y; w)] E x [k (x; v)]e y [l(y; w)]; = cov xy [k (x; v); l(y; w)]: 1 Transforming x 7! k (x; v) and y 7! l(y; w) (from R dy to R). 2 Then, take the covariance. The kernel transformations turn the linear covariance into a dependence measure. 10/29

Alternative View of the Witness u(v; w) The witness u(v; w) can be rewritten as u(v; w) := xy (v; w) x (v) y (w) =E xy [k (x; v)l(y; w)] E x [k (x; v)]e y [l(y; w)]; = cov xy [k (x; v); l(y; w)]: 1 Transforming x 7! k (x; v) and y 7! l(y; w) (from R dy to R). 2 Then, take the covariance. The kernel transformations turn the linear covariance into a dependence measure. 10/29

Alternative Form of ^u(v; w) Recall FSIC \ 2 1 P Ji=1 = J ^u(v i ; w i ) 2 Let [x y (v; w) be an unbiased estimator of x (v)y (w). [ x y (v; w) := 1 n(n 1) P ni=1 P j 6=i k (x i ; v)l(y j ; w). An unbiased estimator of u(v; w) is ^u(v; w) = ^xy (v; w) [x y (v; w) = 2 n(n 1) X i<j h (v;w)((x i ; y i ); (x j ; y j )); where h (v;w)((x; y); (x 0 ; y 0 1 )) := 2 (k (x; v) k (x0 ; v))(l(y; w) l(y 0 ; w)): For a fixed (v; w), ^u(v; w) is a one-sample 2 nd -order U-statistic. 11/29

Alternative Form of ^u(v; w) Recall FSIC \ 2 1 P Ji=1 = J ^u(v i ; w i ) 2 Let [x y (v; w) be an unbiased estimator of x (v)y (w). [ x y (v; w) := 1 n(n 1) P ni=1 P j 6=i k (x i ; v)l(y j ; w). An unbiased estimator of u(v; w) is ^u(v; w) = ^xy (v; w) [x y (v; w) = 2 n(n 1) X i<j h (v;w)((x i ; y i ); (x j ; y j )); where h (v;w)((x; y); (x 0 ; y 0 1 )) := 2 (k (x; v) k (x0 ; v))(l(y; w) l(y 0 ; w)): For a fixed (v; w), ^u(v; w) is a one-sample 2 nd -order U-statistic. 11/29

Alternative Form of ^u(v; w) Recall FSIC \ 2 1 P Ji=1 = J ^u(v i ; w i ) 2 Let [x y (v; w) be an unbiased estimator of x (v)y (w). [ x y (v; w) := 1 n(n 1) P ni=1 P j 6=i k (x i ; v)l(y j ; w). An unbiased estimator of u(v; w) is ^u(v; w) = ^xy (v; w) [x y (v; w) = 2 n(n 1) X i<j h (v;w)((x i ; y i ); (x j ; y j )); where h (v;w)((x; y); (x 0 ; y 0 1 )) := 2 (k (x; v) k (x0 ; v))(l(y; w) l(y 0 ; w)): For a fixed (v; w), ^u(v; w) is a one-sample 2 nd -order U-statistic. 11/29

Asymptotic Distribution of ^u \FSIC 2 (X ; Y ) = 1 J JX i=1 ^u 2 (v i ; w i ) = 1 J ^u> ^u; where ^u = (^u(v 1 ; w 1 ); : : : ; ^u(v J ; w J )) > : Proposition 2 (Asymptotic distribution of ^u). For any fixed locations f(v i ; w i )g J i=1, we have p n(^u u) d! N (0; ): ij =E xy [ ~ k (x; v i ) ~ l(y; w i ) ~ k (x; v j ) ~ l(y; w j )] u(v i ; w i )u(v j ; w j ), ~k (x; v) := k (x; v) E x 0k (x 0 ; v), ~l(y; w) := l(y; w) E y 0l(y 0 ; w). Under H 0, n\ FSIC 2 = n J ^u> ^u weighted sum of dependent 2 variables. Difficult to get (1 )-quantile for the threshold. 12/29

Asymptotic Distribution of ^u \FSIC 2 (X ; Y ) = 1 J JX i=1 ^u 2 (v i ; w i ) = 1 J ^u> ^u; where ^u = (^u(v 1 ; w 1 ); : : : ; ^u(v J ; w J )) > : Proposition 2 (Asymptotic distribution of ^u). For any fixed locations f(v i ; w i )g J i=1, we have p n(^u u) d! N (0; ): ij =E xy [ ~ k (x; v i ) ~ l(y; w i ) ~ k (x; v j ) ~ l(y; w j )] u(v i ; w i )u(v j ; w j ), ~k (x; v) := k (x; v) E x 0k (x 0 ; v), ~l(y; w) := l(y; w) E y 0l(y 0 ; w). Under H 0, n\ FSIC 2 = n J ^u> ^u weighted sum of dependent 2 variables. Difficult to get (1 )-quantile for the threshold. 12/29

Asymptotic Distribution of ^u \FSIC 2 (X ; Y ) = 1 J JX i=1 ^u 2 (v i ; w i ) = 1 J ^u> ^u; where ^u = (^u(v 1 ; w 1 ); : : : ; ^u(v J ; w J )) > : Proposition 2 (Asymptotic distribution of ^u). For any fixed locations f(v i ; w i )g J i=1, we have p n(^u u) d! N (0; ): ij =E xy [ ~ k (x; v i ) ~ l(y; w i ) ~ k (x; v j ) ~ l(y; w j )] u(v i ; w i )u(v j ; w j ), ~k (x; v) := k (x; v) E x 0k (x 0 ; v), ~l(y; w) := l(y; w) E y 0l(y 0 ; w). Under H 0, n\ FSIC 2 = n J ^u> ^u weighted sum of dependent 2 variables. Difficult to get (1 )-quantile for the threshold. 12/29

Asymptotic Distribution of ^u \FSIC 2 (X ; Y ) = 1 J JX i=1 ^u 2 (v i ; w i ) = 1 J ^u> ^u; where ^u = (^u(v 1 ; w 1 ); : : : ; ^u(v J ; w J )) > : Proposition 2 (Asymptotic distribution of ^u). For any fixed locations f(v i ; w i )g J i=1, we have p n(^u u) d! N (0; ): ij =E xy [ ~ k (x; v i ) ~ l(y; w i ) ~ k (x; v j ) ~ l(y; w j )] u(v i ; w i )u(v j ; w j ), ~k (x; v) := k (x; v) E x 0k (x 0 ; v), ~l(y; w) := l(y; w) E y 0l(y 0 ; w). Under H 0, n\ FSIC 2 = n J ^u> ^u weighted sum of dependent 2 variables. Difficult to get (1 )-quantile for the threshold. 12/29

Asymptotic Distribution of ^u \FSIC 2 (X ; Y ) = 1 J JX i=1 ^u 2 (v i ; w i ) = 1 J ^u> ^u; where ^u = (^u(v 1 ; w 1 ); : : : ; ^u(v J ; w J )) > : Proposition 2 (Asymptotic distribution of ^u). For any fixed locations f(v i ; w i )g J i=1, we have p n(^u u) d! N (0; ): ij =E xy [ ~ k (x; v i ) ~ l(y; w i ) ~ k (x; v j ) ~ l(y; w j )] u(v i ; w i )u(v j ; w j ), ~k (x; v) := k (x; v) E x 0k (x 0 ; v), ~l(y; w) := l(y; w) E y 0l(y 0 ; w). Under H 0, n\ FSIC 2 = n J ^u> ^u weighted sum of dependent 2 variables. Difficult to get (1 )-quantile for the threshold. 12/29

Normalized FSIC (NFSIC) NFSIC \ 2 (X ; Y ) = ^ n := n ^u > 1 ^ + n I ^u; with a regularization parameter n 0. Key: NFSIC = FSIC normalized by the covariance. Theorem 1 (NFSIC test is consistent). Assume 1 The product kernel is characteristic and analytic. 2 lim n!1 n = 0. Then, for any k ; l and f(v i ; w i )g J i=1, 1 Under H 0, ^ n d! 2 (J ) as n! 1. 2 Under H 1, lim n!1 P ^n T = 1, -almost surely. 13/29

Normalized FSIC (NFSIC) NFSIC \ 2 (X ; Y ) = ^ n := n ^u > 1 ^ + n I ^u; with a regularization parameter n 0. Key: NFSIC = FSIC normalized by the covariance. Theorem 1 (NFSIC test is consistent). Assume 1 The product kernel is characteristic and analytic. 2 lim n!1 n = 0. Then, for any k ; l and f(v i ; w i )g J i=1, 1 Under H 0, ^ n d! 2 (J ) as n! 1. 2 Under H 1, lim n!1 P ^n T = 1, -almost surely. 13/29

Normalized FSIC (NFSIC) NFSIC \ 2 (X ; Y ) = ^ n := n ^u > 1 ^ + n I ^u; with a regularization parameter n 0. Key: NFSIC = FSIC normalized by the covariance. Theorem 1 (NFSIC test is consistent). Assume 1 The product kernel is characteristic and analytic. 2 lim n!1 n = 0. Then, for any k ; l and f(v i ; w i )g J i=1, 1 Under H 0, ^ n d! 2 (J ) as n! 1. 2 Under H 1, lim n!1 P ^n T = 1, -almost surely. 13/29

Normalized FSIC (NFSIC) NFSIC \ 2 (X ; Y ) = ^ n := n ^u > 1 ^ + n I ^u; with a regularization parameter n 0. Key: NFSIC = FSIC normalized by the covariance. Theorem 1 (NFSIC test is consistent). Assume 1 The product kernel is characteristic and analytic. 2 lim n!1 n = 0. Then, for any k ; l and f(v i ; w i )g J i=1, 1 Under H 0, ^ n d! 2 (J ) as n! 1. 2 Under H 1, lim n!1 P ^n T = 1, -almost surely. 13/29

Normalized FSIC (NFSIC) NFSIC \ 2 (X ; Y ) = ^ n := n ^u > 1 ^ + n I ^u; with a regularization parameter n 0. Key: NFSIC = FSIC normalized by the covariance. Theorem 1 (NFSIC test is consistent). Assume 1 The product kernel is characteristic and analytic. 2 lim n!1 n = 0. Then, for any k ; l and f(v i ; w i )g J i=1, 1 Under H 0, ^ n d! 2 (J ) as n! 1. 2 Under H 1, lim n!1 P ^n T = 1, -almost surely. 13/29

Normalized FSIC (NFSIC) NFSIC \ 2 (X ; Y ) = ^ n := n ^u > 1 ^ + n I ^u; with a regularization parameter n 0. Key: NFSIC = FSIC normalized by the covariance. Theorem 1 (NFSIC test is consistent). Assume 1 The product kernel is characteristic and analytic. 2 lim n!1 n = 0. Then, for any k ; l and f(v i ; w i )g J i=1, 1 Under H 0, ^ n d! 2 (J ) as n! 1. 2 Under H 1, lim n!1 P ^n T = 1, -almost surely. Asymptotically, false positive rate is at under H 0, and always reject under H 1. 13/29

\ An Estimator of NFSIC 2 ^ n := n ^u > 1 ^ + n I ^u; Test locations f(v i ; w i )g J i=1. K = [k (v i ; x j )] 2R J n L = [l(w i ; y j )] 2R J n. (No n n Gram matrix.) Estimators 1 ^u = (KL)1n n 1 (K1 n )(L1n ) n(n 1). 2 ^ = > n where := (K n 1 K1 n 1 > n ) (L n 1 L1 n 1 > n ) ^u1> n : ^ n can be computed in O(J 3 + J 2 n + (d x + d y )Jn) time. Main Point: Linear in n. Cubic in J (small). 14/29

\ An Estimator of NFSIC 2 ^ n := n ^u > 1 ^ + n I ^u; Test locations f(v i ; w i )g J i=1. K = [k (v i ; x j )] 2R J n L = [l(w i ; y j )] 2R J n. (No n n Gram matrix.) Estimators 1 ^u = (KL)1n n 1 (K1 n )(L1n ) n(n 1). 2 ^ = > n where := (K n 1 K1 n 1 > n ) (L n 1 L1 n 1 > n ) ^u1> n : ^ n can be computed in O(J 3 + J 2 n + (d x + d y )Jn) time. Main Point: Linear in n. Cubic in J (small). 14/29

\ An Estimator of NFSIC 2 ^ n := n ^u > 1 ^ + n I ^u; Test locations f(v i ; w i )g J i=1. K = [k (v i ; x j )] 2R J n L = [l(w i ; y j )] 2R J n. (No n n Gram matrix.) Estimators 1 ^u = (KL)1n n 1 (K1 n )(L1n ) n(n 1). 2 ^ = > n where := (K n 1 K1 n 1 > n ) (L n 1 L1 n 1 > n ) ^u1> n : ^ n can be computed in O(J 3 + J 2 n + (d x + d y )Jn) time. Main Point: Linear in n. Cubic in J (small). 14/29

\ An Estimator of NFSIC 2 ^ n := n ^u > 1 ^ + n I ^u; Test locations f(v i ; w i )g J i=1. K = [k (v i ; x j )] 2R J n L = [l(w i ; y j )] 2R J n. (No n n Gram matrix.) Estimators 1 ^u = (KL)1n n 1 (K1 n )(L1n ) n(n 1). 2 ^ = > n where := (K n 1 K1 n 1 > n ) (L n 1 L1 n 1 > n ) ^u1> n : ^ n can be computed in O(J 3 + J 2 n + (d x + d y )Jn) time. Main Point: Linear in n. Cubic in J (small). 14/29

\ An Estimator of NFSIC 2 ^ n := n ^u > 1 ^ + n I ^u; Test locations f(v i ; w i )g J i=1. K = [k (v i ; x j )] 2R J n L = [l(w i ; y j )] 2R J n. (No n n Gram matrix.) Estimators 1 ^u = (KL)1n n 1 (K1 n )(L1n ) n(n 1). 2 ^ = > n where := (K n 1 K1 n 1 > n ) (L n 1 L1 n 1 > n ) ^u1> n : ^ n can be computed in O(J 3 + J 2 n + (d x + d y )Jn) time. Main Point: Linear in n. Cubic in J (small). 14/29

Optimizing Test Locations f(v i ; w i )g J i =1 Test NFSIC \ 2 is consistent for any random locations f(v i ; w i )g J i=1. In practice, tuning them will increase the test power. 15/29

Optimizing Test Locations f(v i ; w i )g J i =1 Test NFSIC \ 2 is consistent for any random locations f(v i ; w i )g J i=1. In practice, tuning them will increase the test power. 15/29

Optimizing Test Locations f(v i ; w i )g J i =1 Test NFSIC \ 2 is consistent for any random locations f(v i ; w i )g J i=1. In practice, tuning them will increase the test power. Under H 0 : X? Y, we have ^ n 2 (J ) as n! 1. Â 2 (J) 0 20 40 60 80 100 ^ n 15/29

Optimizing Test Locations f(v i ; w i )g J i =1 Test NFSIC \ 2 is consistent for any random locations f(v i ; w i )g J i=1. In practice, tuning them will increase the test power. Under H 0 : X? Y, we have ^ n 2 (J ) as n! 1. Â 2 (J) T 0 20 40 60 80 100 ^ n 15/29

Optimizing Test Locations f(v i ; w i )g J i =1 Test NFSIC \ 2 is consistent for any random locations f(v i ; w i )g J i=1. In practice, tuning them will increase the test power. Under H 1, ^ n will be large. Follows some distribution P H1 (^ n ) Â 2 (J) T P H1 (^ n) 0 20 40 60 80 100 ^ n 15/29

Optimizing Test Locations f(v i ; w i )g J i =1 Test NFSIC \ 2 is consistent for any random locations f(v i ; w i )g J i=1. In practice, tuning them will increase the test power. Test power = P(reject H 0 j H 1 true) =P(^ n T ) Â 2 (J) T P H1 (^ n) 0 20 40 60 80 100 ^ n 15/29

Optimizing Test Locations f(v i ; w i )g J i =1 Test NFSIC \ 2 is consistent for any random locations f(v i ; w i )g J i=1. In practice, tuning them will increase the test power. Test power = P(reject H 0 j H 1 true) =P(^ n T ) Â 2 (J) T P H1 (^ n) 0 20 40 60 80 100 ^ n Idea: Pick locations and Gaussian widths to maximize (lower bound of) test power. 15/29

Optimizing Test Locations f(v i ; w i )g J i =1 Test NFSIC \ 2 is consistent for any random locations f(v i ; w i )g J i=1. In practice, tuning them will increase the test power. Test power = P(reject H 0 j H 1 true) =P(^ n T ) Â 2 (J) T P H1 (^ n) 0 20 40 60 80 100 ^ n Idea: Pick locations and Gaussian widths to maximize (lower bound of) test power. 15/29

Optimization Objective = Power Lower Bound Recall ^ n := n ^u > ^ + n I 1 Theorem 2 (A lower bound on the test power). Let NFSIC 2 (X ; Y ) := n := nu > 1 u. ^u. With some conditions, for any k ; l, and f(v i ; w i )g J i=1, the test power satisfies P ^n T L( n ) where L( n ) = 1 62e 12 n (n T)2 =n 2e b0:5nc(n T)2 =[2n 2 ] 2e [(n T)n (n 1)=3 3n c32 n n(n 1)] 2 =[4n 2 (n 1)] ; where 1 ; : : : ; 4 ; c 3 > 0 are constants. For large n, L( n ) is increasing in n. Do: Locations and Gaussian widths = arg max L( n ) = arg max n 16/29

Optimization Objective = Power Lower Bound Recall ^ n := n ^u > ^ + n I 1 Theorem 2 (A lower bound on the test power). Let NFSIC 2 (X ; Y ) := n := nu > 1 u. ^u. With some conditions, for any k ; l, and f(v i ; w i )g J i=1, the test power satisfies P ^n T L( n ) where L( n ) = 1 62e 12 n (n T)2 =n 2e b0:5nc(n T)2 =[2n 2 ] 2e [(n T)n (n 1)=3 3n c32 n n(n 1)] 2 =[4n 2 (n 1)] ; where 1 ; : : : ; 4 ; c 3 > 0 are constants. For large n, L( n ) is increasing in n. Do: Locations and Gaussian widths = arg max L( n ) = arg max n 16/29

Optimization Objective = Power Lower Bound Recall ^ n := n ^u > ^ + n I 1 Theorem 2 (A lower bound on the test power). Let NFSIC 2 (X ; Y ) := n := nu > 1 u. ^u. With some conditions, for any k ; l, and f(v i ; w i )g J i=1, the test power satisfies P ^n T L( n ) where L( n ) = 1 62e 12 n (n T)2 =n 2e b0:5nc(n T)2 =[2n 2 ] 2e [(n T)n (n 1)=3 3n c32 n n(n 1)] 2 =[4n 2 (n 1)] ; where 1 ; : : : ; 4 ; c 3 > 0 are constants. For large n, L( n ) is increasing in n. Do: Locations and Gaussian widths = arg max L( n ) = arg max n 16/29

Optimization Objective = Power Lower Bound Recall ^ n := n ^u > ^ + n I 1 Theorem 2 (A lower bound on the test power). Let NFSIC 2 (X ; Y ) := n := nu > 1 u. ^u. With some conditions, for any k ; l, and f(v i ; w i )g J i=1, the test power satisfies P ^n T L( n ) where L( n ) = 1 62e 12 n (n T)2 =n 2e b0:5nc(n T)2 =[2n 2 ] 2e [(n T)n (n 1)=3 3n c32 n n(n 1)] 2 =[4n 2 (n 1)] ; where 1 ; : : : ; 4 ; c 3 > 0 are constants. For large n, L( n ) is increasing in n. Do: Locations and Gaussian widths = arg max L( n ) = arg max n 16/29

Optimization Procedure NFSIC 2 (X ; Y ) := n := nu > 1 u is unknown. Split the data into 2 disjoint sets: training (tr) and test (te) sets. Procedure: 1 Estimate n with ^ (tr) n (i.e., computed on the training set). 2 Optimize all f(v i ; w i )g J i=1 and Gaussian widths with gradient ascent. 3 Independence test with Splitting avoids overfitting. (te) (te) ^ n. Reject H 0 if ^ n T. 17/29

Optimization Procedure NFSIC 2 (X ; Y ) := n := nu > 1 u is unknown. Split the data into 2 disjoint sets: training (tr) and test (te) sets. Procedure: 1 Estimate n with ^ (tr) n (i.e., computed on the training set). 2 Optimize all f(v i ; w i )g J i=1 and Gaussian widths with gradient ascent. 3 Independence test with Splitting avoids overfitting. (te) (te) ^ n. Reject H 0 if ^ n T. 17/29

Optimization Procedure NFSIC 2 (X ; Y ) := n := nu > 1 u is unknown. Split the data into 2 disjoint sets: training (tr) and test (te) sets. Procedure: 1 Estimate n with ^ (tr) n (i.e., computed on the training set). 2 Optimize all f(v i ; w i )g J i=1 and Gaussian widths with gradient ascent. 3 Independence test with Splitting avoids overfitting. (te) (te) ^ n. Reject H 0 if ^ n T. 17/29

Optimization Procedure NFSIC 2 (X ; Y ) := n := nu > 1 u is unknown. Split the data into 2 disjoint sets: training (tr) and test (te) sets. Procedure: 1 Estimate n with ^ (tr) n (i.e., computed on the training set). 2 Optimize all f(v i ; w i )g J i=1 and Gaussian widths with gradient ascent. 3 Independence test with Splitting avoids overfitting. (te) (te) ^ n. Reject H 0 if ^ n T. But, what does this do to P(^ n T ) when H 0 holds? 17/29

Optimization Procedure NFSIC 2 (X ; Y ) := n := nu > 1 u is unknown. Split the data into 2 disjoint sets: training (tr) and test (te) sets. Procedure: 1 Estimate n with ^ (tr) n (i.e., computed on the training set). 2 Optimize all f(v i ; w i )g J i=1 and Gaussian widths with gradient ascent. 3 Independence test with Splitting avoids overfitting. (te) (te) ^ n. Reject H 0 if ^ n T. But, what does this do to P(^ n T ) when H 0 holds? Still asymptotically at. n = 0 iff X ; Y independent. So, under H 0, we do arg max 0 = arbitrary locations. Asymptotic null distribution is 2 (J ) for any locations. 17/29

Demo: 2D Rotation ˆµ xy (v, w) 18/29

Demo: 2D Rotation ˆµ xy (v, w) ˆµ x (v)ˆµ y (w) ˆµ xy (v, w) ˆµ x (v)ˆµ y (w) ˆΣ(v, w) ˆλ n 18/29

Demo: Sin Problem (! = 1) ˆµ xy (v, w) y 3 2 1 0 1 2 ω = 1. 00 3 p(x; y) = 3 2 1 0 x 1 2 3 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 19/29

Demo: Sin Problem (! = 1) ˆµ xy (v, w) ˆµ x (v)ˆµ y (w) ˆµ xy (v, w) ˆµ x (v)ˆµ y (w) ˆΣ(v, w) ˆλ n y 3 2 1 0 1 2 ω = 1. 00 3 p(x; y) = 3 2 1 0 x 1 2 3 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 19/29

Simulation Settings n = full sample size All methods use Gaussian kernels for both X and Y. Compare 6 methods Method Description Tuning Test size Complex. NFSIC-opt Proposed Gradient descent n =2 O(n) NFSIC-med No tuning. Random locations n O(n) QHSIC Full HSIC Median heu. n O(n 2 ) NyHSIC NyStrom HSIC Median heu. n O(n) FHSIC HSIC + RFFs Median heu. n O(n) RDC RFFs + CCA Median heu. n O(n log n) : Random Fourier features Given a problem, report rejection rate of H 0. 10 features for all (except QHSIC). J = 10 in NFSIC. 20/29

Simulation Settings n = full sample size All methods use Gaussian kernels for both X and Y. Compare 6 methods Method Description Tuning Test size Complex. NFSIC-opt Proposed Gradient descent n =2 O(n) NFSIC-med No tuning. Random locations n O(n) QHSIC Full HSIC Median heu. n O(n 2 ) NyHSIC NyStrom HSIC Median heu. n O(n) FHSIC HSIC + RFFs Median heu. n O(n) RDC RFFs + CCA Median heu. n O(n log n) : Random Fourier features Given a problem, report rejection rate of H 0. 10 features for all (except QHSIC). J = 10 in NFSIC. 20/29

Simulation Settings n = full sample size All methods use Gaussian kernels for both X and Y. Compare 6 methods Method Description Tuning Test size Complex. NFSIC-opt Proposed Gradient descent n =2 O(n) NFSIC-med No tuning. Random locations n O(n) QHSIC Full HSIC Median heu. n O(n 2 ) NyHSIC NyStrom HSIC Median heu. n O(n) FHSIC HSIC + RFFs Median heu. n O(n) RDC RFFs + CCA Median heu. n O(n log n) : Random Fourier features Given a problem, report rejection rate of H 0. 10 features for all (except QHSIC). J = 10 in NFSIC. 20/29

Simulation Settings n = full sample size All methods use Gaussian kernels for both X and Y. Compare 6 methods Method Description Tuning Test size Complex. NFSIC-opt Proposed Gradient descent n =2 O(n) NFSIC-med No tuning. Random locations n O(n) QHSIC Full HSIC Median heu. n O(n 2 ) NyHSIC NyStrom HSIC Median heu. n O(n) FHSIC HSIC + RFFs Median heu. n O(n) RDC RFFs + CCA Median heu. n O(n log n) : Random Fourier features Given a problem, report rejection rate of H 0. 10 features for all (except QHSIC). J = 10 in NFSIC. 20/29

Simulation Settings n = full sample size All methods use Gaussian kernels for both X and Y. Compare 6 methods Method Description Tuning Test size Complex. NFSIC-opt Proposed Gradient descent n =2 O(n) NFSIC-med No tuning. Random locations n O(n) QHSIC Full HSIC Median heu. n O(n 2 ) NyHSIC NyStrom HSIC Median heu. n O(n) FHSIC HSIC + RFFs Median heu. n O(n) RDC RFFs + CCA Median heu. n O(n log n) : Random Fourier features Given a problem, report rejection rate of H 0. 10 features for all (except QHSIC). J = 10 in NFSIC. 20/29

Simulation Settings n = full sample size All methods use Gaussian kernels for both X and Y. Compare 6 methods Method Description Tuning Test size Complex. NFSIC-opt Proposed Gradient descent n =2 O(n) NFSIC-med No tuning. Random locations n O(n) QHSIC Full HSIC Median heu. n O(n 2 ) NyHSIC NyStrom HSIC Median heu. n O(n) FHSIC HSIC + RFFs Median heu. n O(n) RDC RFFs + CCA Median heu. n O(n log n) : Random Fourier features Given a problem, report rejection rate of H 0. 10 features for all (except QHSIC). J = 10 in NFSIC. 20/29

Toy Problem 1: Independent Gaussians X N (0; I dx ) and Y N (0; I dy ). Independent X ; Y. So, H 0 holds. Set := 0:05; d x = d y = 250. 21/29

Toy Problem 1: Independent Gaussians Type-I error X N (0; I dx ) and Y N (0; I dy ). Independent X ; Y. So, H 0 holds. Set := 0:05; d x = d y = 250. NFSIC-opt NFSIC-m ed QHSIC NyHSIC FHSIC RDC 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 10 3 10 4 10 5 Sample size n Correct type-i errors (false positive rate). 21/29

Toy Problem 1: Independent Gaussians Type-I error X N (0; I dx ) and Y N (0; I dy ). Independent X ; Y. So, H 0 holds. Set := 0:05; d x = d y = 250. NFSIC-opt NFSIC-m ed QHSIC NyHSIC FHSIC RDC 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 10 3 10 4 10 5 Sample size n Time (s) 10 3 10 2 10 1 10 0 Correct type-i errors (false positive rate). 10-1 10 3 10 4 Sample size n 10 5 21/29

Toy Problem 2: Sinusoid p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = 4000. 22/29

Toy Problem 2: Sinusoid p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = 4000. y 3 2 1 0 1 2 3 ω = 1. 00 3 2 1 0 1 2 3 x 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 22/29

Toy Problem 2: Sinusoid p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = 4000. y 3 2 1 0 1 2 3 ω = 2. 00 3 2 1 0 1 2 3 x 22/29

Toy Problem 2: Sinusoid p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = 4000. y 3 2 1 0 1 2 3 ω = 3. 00 3 2 1 0 1 2 3 x 22/29

Toy Problem 2: Sinusoid p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = 4000. y 3 2 1 0 1 2 3 ω = 4. 00 3 2 1 0 1 2 3 x 22/29

Toy Problem 2: Sinusoid p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = 4000. 1.0 3 ω = 4. 00 Test power 0.8 0.6 0.4 0.2 0.0 1 2 3 4 5 6 ω in 1 + sin(ωx)sin(ωy) y 2 1 0 1 2 3 3 2 1 0 1 2 3 x 22/29

Toy Problem 2: Sinusoid Test power 1.0 0.8 0.6 0.4 0.2 p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = 4000. NFSIC-opt NFSIC-m ed QHSIC NyHSIC FHSIC RDC 0.0 1 2 3 4 5 6 ω in 1 + sin(ωx)sin(ωy) y 3 2 1 0 1 2 3 ω = 4. 00 3 2 1 0 1 2 3 x 22/29

Toy Problem 2: Sinusoid Test power 1.0 0.8 0.6 0.4 0.2 p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = 4000. NFSIC-opt NFSIC-m ed QHSIC NyHSIC FHSIC RDC 0.0 1 2 3 4 5 6 ω in 1 + sin(ωx)sin(ωy) y 3 2 1 0 1 2 3 ω = 4. 00 3 2 1 0 1 2 3 x Main Point: NFSIC can handle well the local changes in the joint space. 22/29

Toy Problem 3: Gaussian Sign Test power y = jz j Q d x i=1 sign(x i ), where x N (0; I dy ) and Z N (0; 1) (noise). Full interaction among x 1 ; : : : ; x dx. Need to consider all x 1 ; : : : ; x d to detect the dependency. 1.0 0.8 0.6 0.4 0.2 0.0 10 3 10 4 10 5 Sample size n Main Point: NFSIC can handle feature interaction. NFSIC-opt NFSIC-med QHSIC NyHSIC FHSIC RDC 23/29

Toy Problem 3: Gaussian Sign Test power y = jz j Q d x i=1 sign(x i ), where x N (0; I dy ) and Z N (0; 1) (noise). Full interaction among x 1 ; : : : ; x dx. Need to consider all x 1 ; : : : ; x d to detect the dependency. 1.0 0.8 0.6 0.4 0.2 0.0 10 3 10 4 10 5 Sample size n Main Point: NFSIC can handle feature interaction. NFSIC-opt NFSIC-med QHSIC NyHSIC FHSIC RDC 23/29

HSIC vs. FSIC Recall the witness ^u(v; w) = ^ xy (v; w) ^ x (v)^ y (w): HSIC [Gretton et al., 2005] = k^uk RKHS witness FSIC [proposed] = 1 J P Ji=1 ^u 2 (v i ; w i ) witness (v, w) Good when difference between p xy and p x p y is spatially diffuse. ^u is almost flat. (v, w) Good when difference between p xy and p x p y is local. ^u is mostly zero, has many peaks (feature interaction). 24/29

Real Problem 1: Million Song Data Song (X ) vs. year of release (Y ). Western commercial tracks from 1922 to 2011 [Bertin-Mahieux et al., 2011]. X 2R 90 contains audio features. Y 2R is the year of release. 25/29

Real Problem 1: Million Song Data Song (X ) vs. year of release (Y ). Type-I error 0.025 0.020 0.015 0.010 0.005 0.000 Western commercial tracks from 1922 to 2011 [Bertin-Mahieux et al., 2011]. X 2R 90 contains audio features. Y 2R is the year of release. NFSIC-opt NFSIC-m ed QHSIC NyHSIC FHSIC RDC 500 1000 1500 2000 Sample size n Break (X ; Y ) pairs to simulate H 0. Test power 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 500 1000 1500 2000 Sample size n H 1 is true. 25/29

Real Problem 2: Videos and Captions Youtube video (X ) vs. caption (Y ). VideoStory46K [Habibian et al., 2014] X 2R 2000 : Fisher vector encoding of motion boundary histograms descriptors [Wang and Schmid, 2013]. Y 2R 1878 : bag of words. TF. 26/29

Real Problem 2: Videos and Captions Youtube video (X ) vs. caption (Y ). Type-I error VideoStory46K [Habibian et al., 2014] X 2R 2000 : Fisher vector encoding of motion boundary histograms descriptors [Wang and Schmid, 2013]. Y 2R 1878 : bag of words. TF. NFSIC-opt NFSIC-m ed QHSIC NyHSIC FHSIC RDC 0.018 0.016 0.014 0.012 0.010 0.008 0.006 0.004 0.002 2000 4000 6000 8000 Sample size n Break (X ; Y ) pairs to simulate H 0. Test power 1.0 0.8 0.6 0.4 0.2 0.0 2000 4000 6000 8000 Sample size n H 1 is true. 26/29

Penalize Redundant Test Locations Consider the Sin problem. Use J = 2 locations. Optimization objective: ^ n. Write t = (v; w). Fix t 1 at F. Plot t 2! ^ n (t 1 ; t 2 ). y 3 2 1 0 1 2 3 ω = 1. 00 3 2 1 0 1 2 3 x 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 ˆλn(t1, t2) y 2 0 2 250 200 Sample t 2 trajectory t 1 x 2 0 2 t 2 The optimized t 1 ; t 2 will not be in the same neighbourhood. 27/29

Test Power vs. J Test power does not always increase with J (number of test locations). n = 800. 3 ω = 2. 00 1.0 2 0.8 y 1 0 1 Test power 0.6 0.4 2 0.2 3 3 2 1 0 1 2 3 x 100 200 300 400 500 600 J Accurate estimation of 2R ^ J J in ^ n = n ^u > 1 ^ + n I ^u becomes more difficult. Large J defeats the purpose of a linear-time test. 28/29

Conclusions Proposed The Finite Set Independence Criterion (FSIC). Independece test based on FSIC is 1 non-parametric, 2 linear-time, 3 adaptive (parameteris automatically tuned). Future works Any way to interpret the learned f(v i ; w i )g J i=1? Relative efficiency of FSIC vs. block HSIC, RFF-HSIC. https://github.com/wittawatj/fsic-test 29/29

Conclusions Proposed The Finite Set Independence Criterion (FSIC). Independece test based on FSIC is 1 non-parametric, 2 linear-time, 3 adaptive (parameteris automatically tuned). Future works Any way to interpret the learned f(v i ; w i )g J i=1? Relative efficiency of FSIC vs. block HSIC, RFF-HSIC. https://github.com/wittawatj/fsic-test 29/29

Questions? Thank you 30/29

References I Bertin-Mahieux, T., Ellis, D. P., Whitman, B., and Lamere, P. (2011). The million song dataset. In International Conference on Music Information Retrieval (ISMIR). Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A Kernel Two-Sample Test. Journal of Machine Learning Research, 13:723 773. Gretton, A., Bousquet, O., Smola, A., and Schölkopf, B. (2005). Measuring Statistical Dependence with Hilbert-Schmidt Norms. In Algorithmic Learning Theory (ALT), pages 63 77. 31/29

References II Habibian, A., Mensink, T., and Snoek, C. G. (2014). Videostory: A new multimedia embedding for few-example recognition and translation of events. In ACM International Conference on Multimedia, pages 17 26. Wang, H. and Schmid, C. (2013). Action recognition with improved trajectories. In IEEE International Conference on Computer Vision (ICCV), pages 3551 3558. Zhang, Q., Filippi, S., Gretton, A., and Sejdinovic, D. (2016). Large-Scale Kernel Methods for Independence Testing. 32/29