Hilbert Space Representations of Probability Distributions

Hilbert Space Representations of Probability Distributions Arthur Gretton joint work with Karsten Borgwardt, Kenji Fukumizu, Malte Rasch, Bernhard Schölkopf, Alex Smola, Le Song, Choon Hui Teo Max Planck Institute for Biological Cybernetics, Tübingen, Germany

Overview The two sample problem: are samples {x 1,..., x m } and {y 1,..., y n } generated from the same distribution? Kernel independence testing: given a sample of m pairs {(x 1, y 1 ),...,(x m, y m )}, are the random variables x and y independent?

Kernels, feature maps

A very short introduction to kernels Hilbert space of functions f F from X to R RKHS: evaluation operator δ x : x R continuous

A very short introduction to kernels Hilbert space of functions f F from X to R RKHS: evaluation operator δ x : x R continuous Riesz: Unique representer of evaluation k(x, ) F: f(x) = f, k(x, ) F k(x, ) feature map k : X R is kernel function

A Kernel Method for the Two Sample Problem

The two-sample problem Given: m samples x := {x 1,..., x m } drawn i.i.d. from P samples y drawn from Q Determine: Are P and Q different?

The two-sample problem Given: m samples x := {x 1,..., x m } drawn i.i.d. from P samples y drawn from Q Determine: Are P and Q different? Applications: Microarray data aggregation Speaker/author identification Schema matching Where is our test useful? High dimensionality Low sample size Structured data (strings and graphs): currently the only method

Overview (two-sample problem) How to detect P Q? Distance between means in space of features Function revealing differences in distributions Same thing: the MMD [Gretton et al., 27, Borgwardt et al., 26] Hypothesis test using MMD Asymptotic distribution of MMD Large deviation bounds

Mean discrepancy (1) Simple example: 2 Gaussians with different means Answer: t-test.4.35.3 Two Gaussians with different means P X Q X Prob. density.25.2.15.1.5 6 4 2 2 4 6 X

Mean discrepancy (2) Two Gaussians with same means, different variance Idea: look at difference in means of features of the RVs In Gaussian case: second order features of form x 2.4.35 Two Gaussians with different variances P X Q X.3 Prob. density.25.2.15.1.5 6 4 2 2 4 6 X

Mean discrepancy (2) Two Gaussians with same means, different variance Idea: look at difference in means of features of the RVs In Gaussian case: second order features of form x 2 Prob. density.4.35.3.25.2.15.1.5 Two Gaussians with different variances 6 4 2 2 4 6 X P X Q X Prob. density 1.4 1.2 1.8.6.4.2 Densities of feature X 2 1 1 1 1 1 1 2 X 2 P X Q X

Mean discrepancy (3) Gaussian and Laplace distributions Same mean and same variance Difference in means using higher order features.7.6 Gaussian and Laplace densities P X Q X Prob. density.5.4.3.2.1 4 3 2 1 1 2 3 4 X

Function revealing difference in distributions (1) Idea: avoid density estimation when testing P Q [Fortet and Mourier, 1953] D(P,Q; F) := sup f F [E P f(x) E Q f(y)].

Function revealing difference in distributions (1) Idea: avoid density estimation when testing P Q [Fortet and Mourier, 1953] D(P,Q; F) := sup f F [E P f(x) E Q f(y)]. D(P,Q; F) = iff P = Q, when F =bounded continuous functions [Dudley, 22] D(P,Q; F) = iff P = Q when F =the unit ball in a universal RKHS F [via Steinwart, 21]

Function revealing difference in distributions (2) Gauss vs Laplace revisited Prob. density and f 1.8.6.4.2.2 Witness f for Gauss and Laplace densities f Gauss Laplace.4 6 4 2 2 4 6 X

Function revealing difference in distributions (3) The (kernel) MMD: MMD(P,Q;F) ( = sup f F [E P f(x) E Q f(y)] ) 2

Function revealing difference in distributions (3) The (kernel) MMD: MMD(P,Q;F) ( = sup f F [E P f(x) E Q f(y)] ) 2 using E P (f(x)) = E P [ φ(x), f F ] =: µ x, f F

Function revealing difference in distributions (3) The (kernel) MMD: MMD(P,Q;F) ( = = ( sup f F sup f F [E P f(x) E Q f(y)] f, µ x µ y F ) 2 ) 2 using E P (f(x)) = E P [ φ(x), f F ] =: µ x, f F

Function revealing difference in distributions (3) The (kernel) MMD: MMD(P,Q;F) ( = = ( sup f F sup f F [E P f(x) E Q f(y)] f, µ x µ y F ) 2 ) 2 using µ F = sup f, µ F f F = µ x µ y 2 F

Function revealing difference in distributions (3) The (kernel) MMD: MMD(P,Q;F) ( = = ( sup f F sup f F = µ x µ y 2 F [E P f(x) E Q f(y)] f, µ x µ y F ) 2 = µ x µ y, µ x µ y F ) 2 = E P,P k(x,x ) + E Q,Q k(y,y ) 2E P,Q k(x,y) x is a R.V. independent of x with distribution P y is a R.V. independent of y with distribution Q.

Statistical test using MMD (1) Two hypotheses: H : null hypothesis (P = Q) H 1 : alternative hypothesis (P Q)

Statistical test using MMD (1) Two hypotheses: H : null hypothesis (P = Q) H 1 : alternative hypothesis (P Q) Observe samples x := {x 1,..., x m } from P and y from Q If empirical MMD(x,y;F) is far from zero : reject H close to zero : accept H How good is a test? Type I error: We reject H although it is true Type II error: We accept H although it is false

Statistical test using MMD (2) far from zero vs close to zero - threshold?

Statistical test using MMD (2) far from zero vs close to zero - threshold? One answer: asymptotic distribution of MMD(x, y; F)

Statistical test using MMD (2) far from zero vs close to zero - threshold? One answer: asymptotic distribution of MMD(x, y; F) An unbiased empirical estimate (quadratic cost): MMD(x,y;F) = 1 m(m 1) i j k(x i,x j ) k(x i,y j ) k(y i,x j ) + k(y i,y j ) }{{} h((x i,y i ),(x j,y j )) When P Q, asymptotically normal [Hoeffding, 1948, Serfling, 198]

Statistical test using MMD (3) Example: laplace distributions with different variance Prob. density 14 12 1 8 6 4 MMD distribution and Gaussian fit under H1 Empirical PDF Gaussian fit Prob. density Two Laplace distributions with different variances 1.5 1.5 6 4 2 2 4 6 X P X Q X 2.5.1.15.2.25.3.35.4 MMD

Statistical test using MMD (4) When P = Q, U-statistic degenerate: E z [h(z,z )] = [Anderson et al., 1994] Distribution is mmmd(x,y;f) [ λ l z 2 l 2 ] l=1 where z l N(,2) i.i.d k(x, X x ) ψ }{{} i (x)dp x (x) = λ i ψ i (x ) centred

Statistical test using MMD (5) Given P = Q, want threshold T such that P(MMD > T).5

Statistical test using MMD (5) Given P = Q, want threshold T such that P(MMD > T).5 Bootstrap for empirical CDF [Arcones and Giné, 1992] Pearson curves by matching first four moments [Johnson et al., 1994] Large deviation bounds [Hoeffding, 1963, McDiarmid, 1969] Other...

Experiments Small sample size: Pearson more accurate than bootstrap Large sample size: bootstrap faster

Experiments Small sample size: Pearson more accurate than bootstrap Large sample size: bootstrap faster Cancer subtype (m = 25, 2118 dimensions): For Pearson, Type I 3.5%, Type II % For bootstrap, Type I.9%, Type II % Neural spikes (m = 1, 1 dimensions): For Pearson, Type I 4.8%, Type II 3.4% For bootstrap, Type I 5.4%, Type II 3.3% Further experiments: comparison with t-test, Friedman-Rafsky tests [Friedman and Rafsky, 1979], Biau-Györfi test [Biau and Gyorfi, 25], and Hall-Tajvidi test [Hall and Tajvidi, 22].

Conclusions (two-sample problem) The MMD: distance between means in feature spaces When feature spaces universal RKHSs, MMD = iff P = Q Statistical test of whether P Q using asymptotic distribution: Pearson approximation for low sample size Bootstrap for large sample size Useful in high dimensions and for structured data

Dependence Detection with Kernels

Kernel dependence measures Independence testing Given: m samples z := {(x 1, y 1 ),...,(x m, y m )} from P Determine: Does P = P x P y?

Kernel dependence measures Independence testing Given: m samples z := {(x 1, y 1 ),...,(x m, y m )} from P Determine: Does P = P x P y? Kernel dependence measures Zero only at independence Take into account high order moments Make sensible assumptions about smoothness Covariance operators in spaces of features Spectral norm (COCO) [Gretton et al., 25c,d] Hilbert-Schmidt norm (HSIC) [Gretton et al., 25a]

Function revealing dependence (1) Idea: avoid density estimation when testing P = P x P y [Rényi, 1959] COCO(P;F,G) := sup (E x,y [f(x)g(y)] E x [f(x)]e y [g(y)]) f F,g G

Function revealing dependence (1) Idea: avoid density estimation when testing P = P x P y [Rényi, 1959] COCO(P;F,G) := sup (E x,y [f(x)g(y)] E x [f(x)]e y [g(y)]) f F,g G COCO(P;F, G) = iff x,y independent, when F and G are respective unit balls in universal RKHSs F and G [via Steinwart, 21] Examples: Gaussian, Laplace [see also Bach and Jordan, 22] In geometric terms: Covariance operator: C xy : G F such that f,c xy g F = E x,y [f(x)g(y)] E x [f(x)]e y [g(y)]

Function revealing dependence (2) Ring-shaped density, correlation approx. zero [example from Fukumizu, Bach, and Gretton, 25] 1.5 Correlation:. 1.5 Y.5 1 1.5 2 2 X

Function revealing dependence (2) Ring-shaped density, correlation approx. zero [example from Fukumizu, Bach, and Gretton, 25] 1.5 Correlation:..5 Dependence witness, X Y 1.5.5 f(x).5 1 2 2 x Dependence witness, Y.5 1 1.5 2 2 X g(y).5 1 2 2 y

Function revealing dependence (2) Ring-shaped density, correlation approx. zero [example from Fukumizu, Bach, and Gretton, 25] Y 1.5 1.5.5 1 Correlation:. 1.5 2 2 X f(x) g(y).5.5 Dependence witness, X 1 2 2 x Dependence witness, Y.5.5 1 2 2 y g(y) Correlation:.9 COCO:.14 1.5.5 1 1.5.5 f(x)

Function revealing dependence (3) Empirical COCO(z;F, G) largest eigenvalue of 1 K L m 1 m L K c d = γ K L c d. K and L are matrices of inner products between centred observations in respective feature spaces: K = HKH where H = I 1 m 11 and k(x i, x j ) = φ(x i ), φ(x j ) F, l(y i, y j ) = ψ(y i ), ψ(y j ) G

Function revealing dependence (4) Can we do better? A second example with zero correlation Y 1.5.5 1 Correlation: 1.5 1 1 X f(x) g(y).5.5 Dependence witness, X 1 2 2 x Dependence witness, Y.5.5 1 2 2 y g(y) Correlation:.8 COCO:.11 1.5.5 1 1.5.5 f(x)

Function revealing dependence (4) Can we do better? A second example with zero correlation 1 Correlation: 2nd dependence witness, X 1.5 Y.5.5 1 1.5 1 1 X f 2 (x) g 2 (y).5 1 2 2 x 2nd dependence witness, Y 1.5.5 1 2 2 y

Function revealing dependence (4) Can we do better? A second example with zero correlation Y 1.5.5 Correlation: f 2 (x).5.5 2nd dependence witness, X 1 1 2 2 x 2nd dependence witness, Y 1 g 2 (Y) Correlation:.37 COCO 2 :.6 1.5 1 1.5 1 1 X g 2 (y).5.5 1 2 2 y.5 1 1 1 f 2 (X)

Hilbert-Schmidt Independence Criterion Given γ i := COCO i (z;f, G), define Hilbert-Schmidt Independence Criterion (HSIC) [Gretton et al., 25b]: HSIC(z;F, G) := m i=1 γ 2 i

Hilbert-Schmidt Independence Criterion Given γ i := COCO i (z;f, G), define Hilbert-Schmidt Independence Criterion (HSIC) [Gretton et al., 25b]: HSIC(z;F, G) := m i=1 γ 2 i In limit of infinite samples: HSIC(P;F, G) := C xy 2 HS = C xy, C xy HS = E x,x,y,y [k(x,x )l(y,y )] + E x,x [k(x,x )]E y,y [l(y,y )] 2E x,y [ Ex [k(x,x )]E y [l(y,y )] ] x an independent copy of x, y a copy of y

Link between HSIC and MMD (1) Define the product space F G with kernel Φ(x, y),φ(x, y ) = K((x, y),(x, y )) = k(x, x )l(y, y )

Link between HSIC and MMD (1) Define the product space F G with kernel Φ(x, y),φ(x, y ) = K((x, y),(x, y )) = k(x, x )l(y, y ) Define the mean elements µ xy,φ(x, y) := E x,y Φ(x, y ),Φ(x, y) = E x,y k(x, x )l(y, y ) and µ x y,φ(x, y) := E x,y Φ(x, y ),Φ(x, y) = E x k(x, x )E y l(y, y ) The MMD between these two mean elements is MMD(P,P x P y, F G) = µ xy µ x y 2 F G = µ xy µ x y, µ xy µ x y = HSIC(P, F, G)

Link between HSIC and MMD (2) Witness function for HSIC 1.5 Dependence witness and sample.5 1.4.5.3.2 Y.1.5.1 1.2.3 1.5 1.5 1.5.5 1 1.5 X.4

Independence test: verifying ICA and ISA HSICp: null distribution via sampling HSICg: null distribution via moment matching Compare with contingency table test (PD) [Read and Cressie, 1988] Y Y 3 2 1 1 2 3 3 2 1 1 2 3 Rotation θ = π/8 2 2 X Rotation θ = π/4 2 2 X % acceptance of H % acceptance of H 1.8.6.4.2 Samp:128, Dim:1.5 1 1.8.6.4.2 Angle ( π/4) PD HSICp HSICg Samp:512, Dim:2.5 1 Angle ( π/4)

Independence test: verifying ICA and ISA HSICp: null distribution via sampling HSICg: null distribution via moment matching Compare with contingency table test (PD) [Read and Cressie, 1988] Y 3 2 1 1 2 3 Rotation θ = π/8 2 2 X % acceptance of H 1.8.6.4.2 Samp:128, Dim:1.5 1 Angle ( π/4) PD HSICp HSICg % acceptance of H 1.8.6.4.2 Samp:124, Dim:4.5 1 Angle ( π/4) Y 3 2 1 1 2 3 Rotation θ = π/4 2 2 X % acceptance of H 1.8.6.4.2 Samp:512, Dim:2.5 1 Angle ( π/4) % acceptance of H 1.8.6.4.2 Samp:248, Dim:4.5 1 Angle ( π/4)

Other applications of HSIC Feature selection [Song et al., 27c,a] Clustering [Song et al., 27b]

Conclusions (dependence measures) COCO and HSIC: norms of covariance operator between feature spaces When feature spaces universal RKHSs, COCO = HSIC = iff P = P x P y Statistical test possible using asymptotic distribution Independent component analysis high accuracy less sensitive to initialisation

Questions?

Bibliography References N. Anderson, P. Hall, and D. Titterington. Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. Journal of Multivariate Analysis, 5:41 54, 1994. M. Arcones and E. Giné. On the bootstrap of u and v statistics. The Annals of Statistics, 2(2):655 674, 1992. F. R. Bach and M. I. Jordan. Kernel independent component analysis. J. Mach. Learn. Res., 3:1 48, 22. G. Biau and L. Gyorfi. On the asymptotic properties of a nonparametric l 1 -test statistic of homogeneity. IEEE Transactions on Information Theory, 51(11):3965 3973, 25. K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. J. Smola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49 e57, 26. R. M. Dudley. Real analysis and probability. Cambridge University Press, Cambridge, UK, 22. R. Fortet and E. Mourier. Convergence de la réparation empirique vers la réparation théorique. Ann. Scient. École Norm. Sup., 7:266 285, 1953. J. Friedman and L. Rafsky. Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. The Annals of Statistics, 7(4):697 717, 1979. K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. J. Mach. Learn. Res., 5:73 99, 24. K. Fukumizu, F. Bach, and A. Gretton. Consistency of kernel canonical correlation analysis. Technical Report

Hard-to-detect dependence (1) COCO can be for dependent RVs with highly non-smooth densities

Hard-to-detect dependence (1) COCO can be for dependent RVs with highly non-smooth densities Reason: norms in the denominator COCO(P; F, G) := sup f F, g G cov (f(x), g(y)) f F g G RESULT: not detectable with finite sample size More formally: see Ingster [1989]

Hard-to-detect dependence (2) 3 Smooth density 3 Rough density 2 2 1 1 Y Y 1 1 2 3 4 2 2 X 5 Samples, smooth density 2 3 4 2 2 X 5 samples, rough density Density takes the form: P x,y 1 + sin(ωx) sin(ωy) 2 2 Y Y 2 2 4 4 2 2 4 X 4 4 2 2 4 X

Hard-to-detect dependence (3) Example: sinusoids of increasing frequency ω=1 ω=2.1 COCO (empirical average, 15 samples).9 ω=3 ω=4.8.7 COCO.6.5 ω=5 ω=6.4.3.2.1 1 2 3 4 5 6 7 Frequency of non constant density component

Choosing kernel size (1) The RKHS norm of f is f 2 H X := i=1 f 2 i ( ki ) 1. If kernel decays quickly, its spectrum decays slowly: then non-smooth functions have smaller RKHS norm Example: spectrum of two Gaussian kernels 1 Gaussian kernels 1 2 Gaussian kernel spectra 1 5 1 1 2 k(x) 1 1 K(f) 1 4 1 15 1 2 1 5 5 1 x kernel, σ 2 =1/2 kernel, σ 2 =2 1 6 1 8 1 5 5 1 f kernel, σ 2 =1/2 kernel, σ 2 =2

Choosing kernel size (2) Could we just decrease kernel size? Yes, but only up to a point.45 COCO (empirical average, 1 samples).4.35.3 COCO.25.2.15.1.5 1.5 1.5.5 1 1.5 Log kernel size