Hilbert Space Representations of Probability Distributions

Similar documents
Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Maximum Mean Discrepancy

A Kernel Method for the Two-Sample-Problem

A Kernel Statistical Test of Independence

Learning Interpretable Features to Compare Distributions

Statistical Convergence of Kernel CCA

arxiv: v1 [cs.lg] 15 May 2008

arxiv: v1 [cs.lg] 22 Jun 2009

Unsupervised Nonparametric Anomaly Detection: A Kernel Method

Lecture 2: Mappings of Probabilities to RKHS and Applications

BIOINFORMATICS. Integrating structured biological data by Kernel Maximum Mean Discrepancy

Kernel Measures of Conditional Dependence

Kernel methods for Bayesian inference

Measuring Statistical Dependence with Hilbert-Schmidt Norms

An Adaptive Test of Independence with Analytic Kernel Embeddings

Hilbert Space Embedding of Probability Measures

Nonparametric Independence Tests: Space Partitioning and Kernel Approaches

Kernel methods for comparing distributions, measuring dependence

Hilbert Schmidt Independence Criterion

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Hypothesis Testing with Kernel Embeddings on Interdependent Data

Learning features to compare distributions

An Adaptive Test of Independence with Analytic Kernel Embeddings

arxiv: v2 [stat.ml] 3 Sep 2015

A Linear-Time Kernel Goodness-of-Fit Test

Feature Selection via Dependence Maximization

Robust Support Vector Machines for Probability Distributions

A Linear-Time Kernel Goodness-of-Fit Test

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Semi-Supervised Laplacian Regularization of Kernel Canonical Correlation Analysis

A Hilbert Space Embedding for Distributions

Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.

Package dhsic. R topics documented: July 27, 2017

Forefront of the Two Sample Problem

INDEPENDENCE MEASURES

Kernel Methods. Bernhard Schölkopf. Max-Planck-Institut für biologische Kybernetik. B. Schölkopf, Cambridge, 2009

Brisk Kernel ICA 1. Stefanie Jegelka 2, Arthur Gretton 3. 1 Introduction

B-tests: Low Variance Kernel Two-Sample Tests

Learning features to compare distributions

High-dimensional test for normality

Two-sample hypothesis testing for random dot product graphs

Non-parametric Estimation of Integral Probability Metrics

Big Hypothesis Testing with Kernel Embeddings

Nonparametric Indepedence Tests: Space Partitioning and Kernel Approaches

Generalized clustering via kernel embeddings

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels

BIOINFORMATICS. Integrating structured biological data by Kernel Maximum Mean Discrepancy

Convergence Rates of Kernel Quadrature Rules

M-Statistic for Kernel Change-Point Detection

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Lecture 4 February 2

Understanding the relationship between Functional and Structural Connectivity of Brain Networks

Tensor Product Kernels: Characteristic Property, Universality

Distribution Regression

On a Nonparametric Notion of Residual and its Applications

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee

A Kernelized Stein Discrepancy for Goodness-of-fit Tests

A Wild Bootstrap for Degenerate Kernel Tests

Colored Maximum Variance Unfolding

Distribution Regression with Minimax-Optimal Guarantee

Topics in kernel hypothesis testing

Least-Squares Independence Test

Recovering Distributions from Gaussian RKHS Embeddings

Kernels A Machine Learning Overview

Kernel Methods. Outline

Stochastic gradient descent and robustness to ill-conditioning

1 Exponential manifold by reproducing kernel Hilbert spaces

Kernel-Based Nonparametric Anomaly Detection

CS8803: Statistical Techniques in Robotics Byron Boots. Hilbert Space Embeddings

A spectral clustering algorithm based on Gram operators

Lecture 35: December The fundamental statistical distances

Minimax Estimation of Kernel Mean Embeddings

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Hilbert Space Embeddings of Conditional Distributions with Applications to Dynamical Systems

Test Code: STA/STB (Short Answer Type) 2013 Junior Research Fellowship for Research Course in Statistics

Reproducing Kernel Hilbert Spaces

Advances in kernel exponential families

Minimum Hellinger Distance Estimation in a. Semiparametric Mixture Model

Kernel Method: Data Analysis with Positive Definite Kernels

Kernel methods, kernel SVM and ridge regression

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process

Hilbert Space Embeddings of Conditional Distributions with Applications to Dynamical Systems

Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise

Stat 710: Mathematical Statistics Lecture 31

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Minimax-optimal distribution regression

Strictly Positive Definite Functions on a Real Inner Product Space

22 : Hilbert Space Embeddings of Distributions

Geometric Analysis of Hilbert-Schmidt Independence Criterion Based ICA Contrast Function

Kernel-based Information Criterion

Reproducing Kernel Hilbert Spaces

Nonparametric Tree Graphical Models via Kernel Embeddings

Spring 2012 Math 541B Exam 1

Multiple Random Variables

Spline Density Estimation and Inference with Model-Based Penalities

Oslo Class 2 Tikhonov regularization and kernels

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

Unsupervised Kernel Dimension Reduction Supplemental Material

A Bahadur Representation of the Linear Support Vector Machine

H 2 : otherwise. that is simply the proportion of the sample points below level x. For any fixed point x the law of large numbers gives that

Transcription:

Hilbert Space Representations of Probability Distributions Arthur Gretton joint work with Karsten Borgwardt, Kenji Fukumizu, Malte Rasch, Bernhard Schölkopf, Alex Smola, Le Song, Choon Hui Teo Max Planck Institute for Biological Cybernetics, Tübingen, Germany

Overview The two sample problem: are samples {x 1,..., x m } and {y 1,..., y n } generated from the same distribution? Kernel independence testing: given a sample of m pairs {(x 1, y 1 ),...,(x m, y m )}, are the random variables x and y independent?

Kernels, feature maps

A very short introduction to kernels Hilbert space of functions f F from X to R RKHS: evaluation operator δ x : x R continuous

A very short introduction to kernels Hilbert space of functions f F from X to R RKHS: evaluation operator δ x : x R continuous Riesz: Unique representer of evaluation k(x, ) F: f(x) = f, k(x, ) F k(x, ) feature map k : X R is kernel function

A very short introduction to kernels Hilbert space of functions f F from X to R RKHS: evaluation operator δ x : x R continuous Riesz: Unique representer of evaluation k(x, ) F: f(x) = f, k(x, ) F k(x, ) feature map k : X R is kernel function Inner product between two feature maps: k(x 1, ), k(x 2, ) F = k(x 1, x 2 )

A Kernel Method for the Two Sample Problem

The two-sample problem Given: m samples x := {x 1,..., x m } drawn i.i.d. from P samples y drawn from Q Determine: Are P and Q different?

The two-sample problem Given: m samples x := {x 1,..., x m } drawn i.i.d. from P samples y drawn from Q Determine: Are P and Q different? Applications: Microarray data aggregation Speaker/author identification Schema matching

The two-sample problem Given: m samples x := {x 1,..., x m } drawn i.i.d. from P samples y drawn from Q Determine: Are P and Q different? Applications: Microarray data aggregation Speaker/author identification Schema matching Where is our test useful? High dimensionality Low sample size Structured data (strings and graphs): currently the only method

Overview (two-sample problem) How to detect P Q? Distance between means in space of features Function revealing differences in distributions Same thing: the MMD [Gretton et al., 27, Borgwardt et al., 26]

Overview (two-sample problem) How to detect P Q? Distance between means in space of features Function revealing differences in distributions Same thing: the MMD [Gretton et al., 27, Borgwardt et al., 26] Hypothesis test using MMD Asymptotic distribution of MMD Large deviation bounds

Overview (two-sample problem) How to detect P Q? Distance between means in space of features Function revealing differences in distributions Same thing: the MMD [Gretton et al., 27, Borgwardt et al., 26] Hypothesis test using MMD Asymptotic distribution of MMD Large deviation bounds Experiments

Mean discrepancy (1) Simple example: 2 Gaussians with different means Answer: t-test.4.35.3 Two Gaussians with different means P X Q X Prob. density.25.2.15.1.5 6 4 2 2 4 6 X

Mean discrepancy (2) Two Gaussians with same means, different variance Idea: look at difference in means of features of the RVs In Gaussian case: second order features of form x 2.4.35 Two Gaussians with different variances P X Q X.3 Prob. density.25.2.15.1.5 6 4 2 2 4 6 X

Mean discrepancy (2) Two Gaussians with same means, different variance Idea: look at difference in means of features of the RVs In Gaussian case: second order features of form x 2 Prob. density.4.35.3.25.2.15.1.5 Two Gaussians with different variances 6 4 2 2 4 6 X P X Q X Prob. density 1.4 1.2 1.8.6.4.2 Densities of feature X 2 1 1 1 1 1 1 2 X 2 P X Q X

Mean discrepancy (3) Gaussian and Laplace distributions Same mean and same variance Difference in means using higher order features.7.6 Gaussian and Laplace densities P X Q X Prob. density.5.4.3.2.1 4 3 2 1 1 2 3 4 X

Function revealing difference in distributions (1) Idea: avoid density estimation when testing P Q [Fortet and Mourier, 1953] D(P,Q; F) := sup f F [E P f(x) E Q f(y)].

Function revealing difference in distributions (1) Idea: avoid density estimation when testing P Q [Fortet and Mourier, 1953] D(P,Q; F) := sup f F [E P f(x) E Q f(y)]. D(P,Q; F) = iff P = Q, when F =bounded continuous functions [Dudley, 22]

Function revealing difference in distributions (1) Idea: avoid density estimation when testing P Q [Fortet and Mourier, 1953] D(P,Q; F) := sup f F [E P f(x) E Q f(y)]. D(P,Q; F) = iff P = Q, when F =bounded continuous functions [Dudley, 22] D(P,Q; F) = iff P = Q when F =the unit ball in a universal RKHS F [via Steinwart, 21]

Function revealing difference in distributions (1) Idea: avoid density estimation when testing P Q [Fortet and Mourier, 1953] D(P,Q; F) := sup f F [E P f(x) E Q f(y)]. D(P,Q; F) = iff P = Q, when F =bounded continuous functions [Dudley, 22] D(P,Q; F) = iff P = Q when F =the unit ball in a universal RKHS F [via Steinwart, 21] Examples: Gaussian, Laplace [see also Fukumizu et al., 24]

Function revealing difference in distributions (2) Gauss vs Laplace revisited Prob. density and f 1.8.6.4.2.2 Witness f for Gauss and Laplace densities f Gauss Laplace.4 6 4 2 2 4 6 X

Function revealing difference in distributions (3) The (kernel) MMD: MMD(P,Q;F) ( = sup f F [E P f(x) E Q f(y)] ) 2

Function revealing difference in distributions (3) The (kernel) MMD: MMD(P,Q;F) ( = sup f F [E P f(x) E Q f(y)] ) 2 using E P (f(x)) = E P [ φ(x), f F ] =: µ x, f F

Function revealing difference in distributions (3) The (kernel) MMD: MMD(P,Q;F) ( = = ( sup f F sup f F [E P f(x) E Q f(y)] f, µ x µ y F ) 2 ) 2 using E P (f(x)) = E P [ φ(x), f F ] =: µ x, f F

Function revealing difference in distributions (3) The (kernel) MMD: MMD(P,Q;F) ( = = ( sup f F sup f F [E P f(x) E Q f(y)] f, µ x µ y F ) 2 ) 2 using µ F = sup f, µ F f F = µ x µ y 2 F

Function revealing difference in distributions (3) The (kernel) MMD: MMD(P,Q;F) ( = = ( sup f F sup f F = µ x µ y 2 F [E P f(x) E Q f(y)] f, µ x µ y F ) 2 = µ x µ y, µ x µ y F ) 2 = E P,P k(x,x ) + E Q,Q k(y,y ) 2E P,Q k(x,y) x is a R.V. independent of x with distribution P y is a R.V. independent of y with distribution Q.

Function revealing difference in distributions (3) The (kernel) MMD: MMD(P,Q;F) ( = = ( sup f F sup f F = µ x µ y 2 F [E P f(x) E Q f(y)] f, µ x µ y F ) 2 = µ x µ y, µ x µ y F ) 2 = E P,P k(x,x ) + E Q,Q k(y,y ) 2E P,Q k(x,y) x is a R.V. independent of x with distribution P y is a R.V. independent of y with distribution Q. Kernel between measures [Hein and Bousquet, 25] K(P,Q) = E P,Q k(x,y)

Statistical test using MMD (1) Two hypotheses: H : null hypothesis (P = Q) H 1 : alternative hypothesis (P Q)

Statistical test using MMD (1) Two hypotheses: H : null hypothesis (P = Q) H 1 : alternative hypothesis (P Q) Observe samples x := {x 1,..., x m } from P and y from Q If empirical MMD(x,y;F) is far from zero : reject H close to zero : accept H

Statistical test using MMD (1) Two hypotheses: H : null hypothesis (P = Q) H 1 : alternative hypothesis (P Q) Observe samples x := {x 1,..., x m } from P and y from Q If empirical MMD(x,y;F) is far from zero : reject H close to zero : accept H How good is a test? Type I error: We reject H although it is true Type II error: We accept H although it is false

Statistical test using MMD (1) Two hypotheses: H : null hypothesis (P = Q) H 1 : alternative hypothesis (P Q) Observe samples x := {x 1,..., x m } from P and y from Q If empirical MMD(x,y;F) is far from zero : reject H close to zero : accept H How good is a test? Type I error: We reject H although it is true Type II error: We accept H although it is false Good test has a low type II error for user-defined Type I error

Statistical test using MMD (2) far from zero vs close to zero - threshold?

Statistical test using MMD (2) far from zero vs close to zero - threshold? One answer: asymptotic distribution of MMD(x, y; F)

Statistical test using MMD (2) far from zero vs close to zero - threshold? One answer: asymptotic distribution of MMD(x, y; F) An unbiased empirical estimate (quadratic cost): MMD(x,y;F) = 1 m(m 1) i j k(x i,x j ) k(x i,y j ) k(y i,x j ) + k(y i,y j ) }{{} h((x i,y i ),(x j,y j ))

Statistical test using MMD (2) far from zero vs close to zero - threshold? One answer: asymptotic distribution of MMD(x, y; F) An unbiased empirical estimate (quadratic cost): MMD(x,y;F) = 1 m(m 1) i j k(x i,x j ) k(x i,y j ) k(y i,x j ) + k(y i,y j ) }{{} h((x i,y i ),(x j,y j )) When P Q, asymptotically normal [Hoeffding, 1948, Serfling, 198]

Statistical test using MMD (2) far from zero vs close to zero - threshold? One answer: asymptotic distribution of MMD(x, y; F) An unbiased empirical estimate (quadratic cost): MMD(x,y;F) = 1 m(m 1) i j k(x i,x j ) k(x i,y j ) k(y i,x j ) + k(y i,y j ) }{{} h((x i,y i ),(x j,y j )) When P Q, asymptotically normal [Hoeffding, 1948, Serfling, 198] Expression for the variance: z i := (x i, y i ) σ 2 u = 22 m ( E z [ (Ez h(z, z )) 2] [E z,z (h(z, z ))] 2) + O(m 2 )

Statistical test using MMD (3) Example: laplace distributions with different variance Prob. density 14 12 1 8 6 4 MMD distribution and Gaussian fit under H1 Empirical PDF Gaussian fit Prob. density Two Laplace distributions with different variances 1.5 1.5 6 4 2 2 4 6 X P X Q X 2.5.1.15.2.25.3.35.4 MMD

Statistical test using MMD (4) When P = Q, U-statistic degenerate: E z [h(z,z )] = [Anderson et al., 1994] Distribution is mmmd(x,y;f) [ λ l z 2 l 2 ] l=1 where z l N(,2) i.i.d k(x, X x ) ψ }{{} i (x)dp x (x) = λ i ψ i (x ) centred

Statistical test using MMD (4) When P = Q, U-statistic degenerate: E z [h(z,z )] = [Anderson et al., 1994] Distribution is mmmd(x,y;f) [ λ l z 2 l 2 ] l=1 where z l N(,2) i.i.d k(x, X x ) ψ }{{} i (x)dp x (x) = λ i ψ i (x ) centred Prob. density 5 4 3 2 MMD density under H Empirical PDF 1.4.2.2.4.6.8.1 MMD

Statistical test using MMD (5) Given P = Q, want threshold T such that P(MMD > T).5

Statistical test using MMD (5) Given P = Q, want threshold T such that P(MMD > T).5 Bootstrap for empirical CDF [Arcones and Giné, 1992] Pearson curves by matching first four moments [Johnson et al., 1994] Large deviation bounds [Hoeffding, 1963, McDiarmid, 1969] Other...

Statistical test using MMD (5) Given P = Q, want threshold T such that P(MMD > T).5 Bootstrap for empirical CDF [Arcones and Giné, 1992] Pearson curves by matching first four moments [Johnson et al., 1994] Large deviation bounds [Hoeffding, 1963, McDiarmid, 1969] Other... 1 CDF of the MMD and Pearson fit P(MMD < mmd).8.6.4.2 MMD Pearson.2.2.4.6.8.1 mmd

Experiments Small sample size: Pearson more accurate than bootstrap Large sample size: bootstrap faster

Experiments Small sample size: Pearson more accurate than bootstrap Large sample size: bootstrap faster Cancer subtype (m = 25, 2118 dimensions): For Pearson, Type I 3.5%, Type II % For bootstrap, Type I.9%, Type II %

Experiments Small sample size: Pearson more accurate than bootstrap Large sample size: bootstrap faster Cancer subtype (m = 25, 2118 dimensions): For Pearson, Type I 3.5%, Type II % For bootstrap, Type I.9%, Type II % Neural spikes (m = 1, 1 dimensions): For Pearson, Type I 4.8%, Type II 3.4% For bootstrap, Type I 5.4%, Type II 3.3%

Experiments Small sample size: Pearson more accurate than bootstrap Large sample size: bootstrap faster Cancer subtype (m = 25, 2118 dimensions): For Pearson, Type I 3.5%, Type II % For bootstrap, Type I.9%, Type II % Neural spikes (m = 1, 1 dimensions): For Pearson, Type I 4.8%, Type II 3.4% For bootstrap, Type I 5.4%, Type II 3.3% Further experiments: comparison with t-test, Friedman-Rafsky tests [Friedman and Rafsky, 1979], Biau-Györfi test [Biau and Gyorfi, 25], and Hall-Tajvidi test [Hall and Tajvidi, 22].

Conclusions (two-sample problem) The MMD: distance between means in feature spaces When feature spaces universal RKHSs, MMD = iff P = Q Statistical test of whether P Q using asymptotic distribution: Pearson approximation for low sample size Bootstrap for large sample size Useful in high dimensions and for structured data

Dependence Detection with Kernels

Kernel dependence measures Independence testing Given: m samples z := {(x 1, y 1 ),...,(x m, y m )} from P Determine: Does P = P x P y?

Kernel dependence measures Independence testing Given: m samples z := {(x 1, y 1 ),...,(x m, y m )} from P Determine: Does P = P x P y? Kernel dependence measures Zero only at independence Take into account high order moments Make sensible assumptions about smoothness

Kernel dependence measures Independence testing Given: m samples z := {(x 1, y 1 ),...,(x m, y m )} from P Determine: Does P = P x P y? Kernel dependence measures Zero only at independence Take into account high order moments Make sensible assumptions about smoothness Covariance operators in spaces of features Spectral norm (COCO) [Gretton et al., 25c,d] Hilbert-Schmidt norm (HSIC) [Gretton et al., 25a]

Function revealing dependence (1) Idea: avoid density estimation when testing P = P x P y [Rényi, 1959] COCO(P;F,G) := sup (E x,y [f(x)g(y)] E x [f(x)]e y [g(y)]) f F,g G

Function revealing dependence (1) Idea: avoid density estimation when testing P = P x P y [Rényi, 1959] COCO(P;F,G) := sup (E x,y [f(x)g(y)] E x [f(x)]e y [g(y)]) f F,g G COCO(P;F, G) = iff x,y independent, when F and G are respective unit balls in universal RKHSs F and G [via Steinwart, 21] Examples: Gaussian, Laplace [see also Bach and Jordan, 22]

Function revealing dependence (1) Idea: avoid density estimation when testing P = P x P y [Rényi, 1959] COCO(P;F,G) := sup (E x,y [f(x)g(y)] E x [f(x)]e y [g(y)]) f F,g G COCO(P;F, G) = iff x,y independent, when F and G are respective unit balls in universal RKHSs F and G [via Steinwart, 21] Examples: Gaussian, Laplace [see also Bach and Jordan, 22] In geometric terms: Covariance operator: C xy : G F such that f,c xy g F = E x,y [f(x)g(y)] E x [f(x)]e y [g(y)]

Function revealing dependence (1) Idea: avoid density estimation when testing P = P x P y [Rényi, 1959] COCO(P;F,G) := sup (E x,y [f(x)g(y)] E x [f(x)]e y [g(y)]) f F,g G COCO(P;F, G) = iff x,y independent, when F and G are respective unit balls in universal RKHSs F and G [via Steinwart, 21] Examples: Gaussian, Laplace [see also Bach and Jordan, 22] In geometric terms: Covariance operator: C xy : G F such that f,c xy g F = E x,y [f(x)g(y)] E x [f(x)]e y [g(y)] COCO is the spectral norm of C xy [Gretton et al., 25c,d]: COCO(P;F, G) := C xy S

Function revealing dependence (2) Ring-shaped density, correlation approx. zero [example from Fukumizu, Bach, and Gretton, 25] 1.5 Correlation:. 1.5 Y.5 1 1.5 2 2 X

Function revealing dependence (2) Ring-shaped density, correlation approx. zero [example from Fukumizu, Bach, and Gretton, 25] 1.5 Correlation:..5 Dependence witness, X Y 1.5.5 f(x).5 1 2 2 x Dependence witness, Y.5 1 1.5 2 2 X g(y).5 1 2 2 y

Function revealing dependence (2) Ring-shaped density, correlation approx. zero [example from Fukumizu, Bach, and Gretton, 25] Y 1.5 1.5.5 1 Correlation:. 1.5 2 2 X f(x) g(y).5.5 Dependence witness, X 1 2 2 x Dependence witness, Y.5.5 1 2 2 y g(y) Correlation:.9 COCO:.14 1.5.5 1 1.5.5 f(x)

Function revealing dependence (3) Empirical COCO(z;F, G) largest eigenvalue of 1 K L m 1 m L K c d = γ K L c d. K and L are matrices of inner products between centred observations in respective feature spaces: K = HKH where H = I 1 m 11 and k(x i, x j ) = φ(x i ), φ(x j ) F, l(y i, y j ) = ψ(y i ), ψ(y j ) G

Function revealing dependence (3) Empirical COCO(z;F, G) largest eigenvalue of 1 K L m 1 m L K c d = γ K L c d. K and L are matrices of inner products between centred observations in respective feature spaces: K = HKH where H = I 1 m 11 and k(x i, x j ) = φ(x i ), φ(x j ) F, l(y i, y j ) = ψ(y i ), ψ(y j ) G Witness function for x: m f(x) = i=1 c i k(x i, x) 1 m m k(x j, x) j=1

Function revealing dependence (4) Can we do better? A second example with zero correlation Y 1.5.5 1 Correlation: 1.5 1 1 X f(x) g(y).5.5 Dependence witness, X 1 2 2 x Dependence witness, Y.5.5 1 2 2 y g(y) Correlation:.8 COCO:.11 1.5.5 1 1.5.5 f(x)

Function revealing dependence (4) Can we do better? A second example with zero correlation 1 Correlation: 2nd dependence witness, X 1.5 Y.5.5 1 1.5 1 1 X f 2 (x) g 2 (y).5 1 2 2 x 2nd dependence witness, Y 1.5.5 1 2 2 y

Function revealing dependence (4) Can we do better? A second example with zero correlation Y 1.5.5 Correlation: f 2 (x).5.5 2nd dependence witness, X 1 1 2 2 x 2nd dependence witness, Y 1 g 2 (Y) Correlation:.37 COCO 2 :.6 1.5 1 1.5 1 1 X g 2 (y).5.5 1 2 2 y.5 1 1 1 f 2 (X)

Hilbert-Schmidt Independence Criterion Given γ i := COCO i (z;f, G), define Hilbert-Schmidt Independence Criterion (HSIC) [Gretton et al., 25b]: HSIC(z;F, G) := m i=1 γ 2 i

Hilbert-Schmidt Independence Criterion Given γ i := COCO i (z;f, G), define Hilbert-Schmidt Independence Criterion (HSIC) [Gretton et al., 25b]: HSIC(z;F, G) := m i=1 γ 2 i In limit of infinite samples: HSIC(P;F, G) := C xy 2 HS = C xy, C xy HS = E x,x,y,y [k(x,x )l(y,y )] + E x,x [k(x,x )]E y,y [l(y,y )] 2E x,y [ Ex [k(x,x )]E y [l(y,y )] ] x an independent copy of x, y a copy of y

Link between HSIC and MMD (1) Define the product space F G with kernel Φ(x, y),φ(x, y ) = K((x, y),(x, y )) = k(x, x )l(y, y )

Link between HSIC and MMD (1) Define the product space F G with kernel Φ(x, y),φ(x, y ) = K((x, y),(x, y )) = k(x, x )l(y, y ) Define the mean elements µ xy,φ(x, y) := E x,y Φ(x, y ),Φ(x, y) = E x,y k(x, x )l(y, y ) and µ x y,φ(x, y) := E x,y Φ(x, y ),Φ(x, y) = E x k(x, x )E y l(y, y )

Link between HSIC and MMD (1) Define the product space F G with kernel Φ(x, y),φ(x, y ) = K((x, y),(x, y )) = k(x, x )l(y, y ) Define the mean elements µ xy,φ(x, y) := E x,y Φ(x, y ),Φ(x, y) = E x,y k(x, x )l(y, y ) and µ x y,φ(x, y) := E x,y Φ(x, y ),Φ(x, y) = E x k(x, x )E y l(y, y ) The MMD between these two mean elements is MMD(P,P x P y, F G) = µ xy µ x y 2 F G = µ xy µ x y, µ xy µ x y = HSIC(P, F, G)

Link between HSIC and MMD (2) Witness function for HSIC 1.5 Dependence witness and sample.5 1.4.5.3.2 Y.1.5.1 1.2.3 1.5 1.5 1.5.5 1 1.5 X.4

Independence test: verifying ICA and ISA HSICp: null distribution via sampling HSICg: null distribution via moment matching Compare with contingency table test (PD) [Read and Cressie, 1988] 3 Rotation θ = π/8 2 1 Y 1 2 3 3 2 2 X Rotation θ = π/4 2 1 Y 1 2 3 2 2 X

Independence test: verifying ICA and ISA HSICp: null distribution via sampling HSICg: null distribution via moment matching Compare with contingency table test (PD) [Read and Cressie, 1988] Y Y 3 2 1 1 2 3 3 2 1 1 2 3 Rotation θ = π/8 2 2 X Rotation θ = π/4 2 2 X % acceptance of H % acceptance of H 1.8.6.4.2 Samp:128, Dim:1.5 1 1.8.6.4.2 Angle ( π/4) PD HSICp HSICg Samp:512, Dim:2.5 1 Angle ( π/4)

Independence test: verifying ICA and ISA HSICp: null distribution via sampling HSICg: null distribution via moment matching Compare with contingency table test (PD) [Read and Cressie, 1988] Y 3 2 1 1 2 3 Rotation θ = π/8 2 2 X % acceptance of H 1.8.6.4.2 Samp:128, Dim:1.5 1 Angle ( π/4) PD HSICp HSICg % acceptance of H 1.8.6.4.2 Samp:124, Dim:4.5 1 Angle ( π/4) Y 3 2 1 1 2 3 Rotation θ = π/4 2 2 X % acceptance of H 1.8.6.4.2 Samp:512, Dim:2.5 1 Angle ( π/4) % acceptance of H 1.8.6.4.2 Samp:248, Dim:4.5 1 Angle ( π/4)

Other applications of HSIC Feature selection [Song et al., 27c,a] Clustering [Song et al., 27b]

Conclusions (dependence measures) COCO and HSIC: norms of covariance operator between feature spaces When feature spaces universal RKHSs, COCO = HSIC = iff P = P x P y Statistical test possible using asymptotic distribution Independent component analysis high accuracy less sensitive to initialisation

Questions?

Bibliography References N. Anderson, P. Hall, and D. Titterington. Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. Journal of Multivariate Analysis, 5:41 54, 1994. M. Arcones and E. Giné. On the bootstrap of u and v statistics. The Annals of Statistics, 2(2):655 674, 1992. F. R. Bach and M. I. Jordan. Kernel independent component analysis. J. Mach. Learn. Res., 3:1 48, 22. G. Biau and L. Gyorfi. On the asymptotic properties of a nonparametric l 1 -test statistic of homogeneity. IEEE Transactions on Information Theory, 51(11):3965 3973, 25. K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. J. Smola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49 e57, 26. R. M. Dudley. Real analysis and probability. Cambridge University Press, Cambridge, UK, 22. R. Fortet and E. Mourier. Convergence de la réparation empirique vers la réparation théorique. Ann. Scient. École Norm. Sup., 7:266 285, 1953. J. Friedman and L. Rafsky. Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. The Annals of Statistics, 7(4):697 717, 1979. K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. J. Mach. Learn. Res., 5:73 99, 24. K. Fukumizu, F. Bach, and A. Gretton. Consistency of kernel canonical correlation analysis. Technical Report

Hard-to-detect dependence (1) COCO can be for dependent RVs with highly non-smooth densities

Hard-to-detect dependence (1) COCO can be for dependent RVs with highly non-smooth densities Reason: norms in the denominator COCO(P; F, G) := sup f F, g G cov (f(x), g(y)) f F g G RESULT: not detectable with finite sample size More formally: see Ingster [1989]

Hard-to-detect dependence (2) 3 Smooth density 3 Rough density 2 2 1 1 Y Y 1 1 2 3 4 2 2 X 5 Samples, smooth density 2 3 4 2 2 X 5 samples, rough density Density takes the form: P x,y 1 + sin(ωx) sin(ωy) 2 2 Y Y 2 2 4 4 2 2 4 X 4 4 2 2 4 X

Hard-to-detect dependence (3) Example: sinusoids of increasing frequency ω=1 ω=2.1 COCO (empirical average, 15 samples).9 ω=3 ω=4.8.7 COCO.6.5 ω=5 ω=6.4.3.2.1 1 2 3 4 5 6 7 Frequency of non constant density component

Choosing kernel size (1) The RKHS norm of f is f 2 H X := i=1 f 2 i ( ki ) 1. If kernel decays quickly, its spectrum decays slowly: then non-smooth functions have smaller RKHS norm Example: spectrum of two Gaussian kernels 1 Gaussian kernels 1 2 Gaussian kernel spectra 1 5 1 1 2 k(x) 1 1 K(f) 1 4 1 15 1 2 1 5 5 1 x kernel, σ 2 =1/2 kernel, σ 2 =2 1 6 1 8 1 5 5 1 f kernel, σ 2 =1/2 kernel, σ 2 =2

Choosing kernel size (2) Could we just decrease kernel size? Yes, but only up to a point.45 COCO (empirical average, 1 samples).4.35.3 COCO.25.2.15.1.5 1.5 1.5.5 1 1.5 Log kernel size