Hilbert Space Representations of Probability Distributions

Size: px
Start display at page:

Download "Hilbert Space Representations of Probability Distributions"

Transcription

1 Hilbert Space Representations of Probability Distributions Arthur Gretton joint work with Karsten Borgwardt, Kenji Fukumizu, Malte Rasch, Bernhard Schölkopf, Alex Smola, Le Song, Choon Hui Teo Max Planck Institute for Biological Cybernetics, Tübingen, Germany

2 Overview The two sample problem: are samples {x 1,..., x m } and {y 1,..., y n } generated from the same distribution? Kernel independence testing: given a sample of m pairs {(x 1, y 1 ),...,(x m, y m )}, are the random variables x and y independent?

3 Kernels, feature maps

4 A very short introduction to kernels Hilbert space of functions f F from X to R RKHS: evaluation operator δ x : x R continuous

5 A very short introduction to kernels Hilbert space of functions f F from X to R RKHS: evaluation operator δ x : x R continuous Riesz: Unique representer of evaluation k(x, ) F: f(x) = f, k(x, ) F k(x, ) feature map k : X R is kernel function

6 A very short introduction to kernels Hilbert space of functions f F from X to R RKHS: evaluation operator δ x : x R continuous Riesz: Unique representer of evaluation k(x, ) F: f(x) = f, k(x, ) F k(x, ) feature map k : X R is kernel function Inner product between two feature maps: k(x 1, ), k(x 2, ) F = k(x 1, x 2 )

7 A Kernel Method for the Two Sample Problem

8 The two-sample problem Given: m samples x := {x 1,..., x m } drawn i.i.d. from P samples y drawn from Q Determine: Are P and Q different?

9 The two-sample problem Given: m samples x := {x 1,..., x m } drawn i.i.d. from P samples y drawn from Q Determine: Are P and Q different? Applications: Microarray data aggregation Speaker/author identification Schema matching

10 The two-sample problem Given: m samples x := {x 1,..., x m } drawn i.i.d. from P samples y drawn from Q Determine: Are P and Q different? Applications: Microarray data aggregation Speaker/author identification Schema matching Where is our test useful? High dimensionality Low sample size Structured data (strings and graphs): currently the only method

11 Overview (two-sample problem) How to detect P Q? Distance between means in space of features Function revealing differences in distributions Same thing: the MMD [Gretton et al., 27, Borgwardt et al., 26]

12 Overview (two-sample problem) How to detect P Q? Distance between means in space of features Function revealing differences in distributions Same thing: the MMD [Gretton et al., 27, Borgwardt et al., 26] Hypothesis test using MMD Asymptotic distribution of MMD Large deviation bounds

13 Overview (two-sample problem) How to detect P Q? Distance between means in space of features Function revealing differences in distributions Same thing: the MMD [Gretton et al., 27, Borgwardt et al., 26] Hypothesis test using MMD Asymptotic distribution of MMD Large deviation bounds Experiments

14 Mean discrepancy (1) Simple example: 2 Gaussians with different means Answer: t-test Two Gaussians with different means P X Q X Prob. density X

15 Mean discrepancy (2) Two Gaussians with same means, different variance Idea: look at difference in means of features of the RVs In Gaussian case: second order features of form x Two Gaussians with different variances P X Q X.3 Prob. density X

16 Mean discrepancy (2) Two Gaussians with same means, different variance Idea: look at difference in means of features of the RVs In Gaussian case: second order features of form x 2 Prob. density Two Gaussians with different variances X P X Q X Prob. density Densities of feature X X 2 P X Q X

17 Mean discrepancy (3) Gaussian and Laplace distributions Same mean and same variance Difference in means using higher order features.7.6 Gaussian and Laplace densities P X Q X Prob. density X

18 Function revealing difference in distributions (1) Idea: avoid density estimation when testing P Q [Fortet and Mourier, 1953] D(P,Q; F) := sup f F [E P f(x) E Q f(y)].

19 Function revealing difference in distributions (1) Idea: avoid density estimation when testing P Q [Fortet and Mourier, 1953] D(P,Q; F) := sup f F [E P f(x) E Q f(y)]. D(P,Q; F) = iff P = Q, when F =bounded continuous functions [Dudley, 22]

20 Function revealing difference in distributions (1) Idea: avoid density estimation when testing P Q [Fortet and Mourier, 1953] D(P,Q; F) := sup f F [E P f(x) E Q f(y)]. D(P,Q; F) = iff P = Q, when F =bounded continuous functions [Dudley, 22] D(P,Q; F) = iff P = Q when F =the unit ball in a universal RKHS F [via Steinwart, 21]

21 Function revealing difference in distributions (1) Idea: avoid density estimation when testing P Q [Fortet and Mourier, 1953] D(P,Q; F) := sup f F [E P f(x) E Q f(y)]. D(P,Q; F) = iff P = Q, when F =bounded continuous functions [Dudley, 22] D(P,Q; F) = iff P = Q when F =the unit ball in a universal RKHS F [via Steinwart, 21] Examples: Gaussian, Laplace [see also Fukumizu et al., 24]

22 Function revealing difference in distributions (2) Gauss vs Laplace revisited Prob. density and f Witness f for Gauss and Laplace densities f Gauss Laplace X

23 Function revealing difference in distributions (3) The (kernel) MMD: MMD(P,Q;F) ( = sup f F [E P f(x) E Q f(y)] ) 2

24 Function revealing difference in distributions (3) The (kernel) MMD: MMD(P,Q;F) ( = sup f F [E P f(x) E Q f(y)] ) 2 using E P (f(x)) = E P [ φ(x), f F ] =: µ x, f F

25 Function revealing difference in distributions (3) The (kernel) MMD: MMD(P,Q;F) ( = = ( sup f F sup f F [E P f(x) E Q f(y)] f, µ x µ y F ) 2 ) 2 using E P (f(x)) = E P [ φ(x), f F ] =: µ x, f F

26 Function revealing difference in distributions (3) The (kernel) MMD: MMD(P,Q;F) ( = = ( sup f F sup f F [E P f(x) E Q f(y)] f, µ x µ y F ) 2 ) 2 using µ F = sup f, µ F f F = µ x µ y 2 F

27 Function revealing difference in distributions (3) The (kernel) MMD: MMD(P,Q;F) ( = = ( sup f F sup f F = µ x µ y 2 F [E P f(x) E Q f(y)] f, µ x µ y F ) 2 = µ x µ y, µ x µ y F ) 2 = E P,P k(x,x ) + E Q,Q k(y,y ) 2E P,Q k(x,y) x is a R.V. independent of x with distribution P y is a R.V. independent of y with distribution Q.

28 Function revealing difference in distributions (3) The (kernel) MMD: MMD(P,Q;F) ( = = ( sup f F sup f F = µ x µ y 2 F [E P f(x) E Q f(y)] f, µ x µ y F ) 2 = µ x µ y, µ x µ y F ) 2 = E P,P k(x,x ) + E Q,Q k(y,y ) 2E P,Q k(x,y) x is a R.V. independent of x with distribution P y is a R.V. independent of y with distribution Q. Kernel between measures [Hein and Bousquet, 25] K(P,Q) = E P,Q k(x,y)

29 Statistical test using MMD (1) Two hypotheses: H : null hypothesis (P = Q) H 1 : alternative hypothesis (P Q)

30 Statistical test using MMD (1) Two hypotheses: H : null hypothesis (P = Q) H 1 : alternative hypothesis (P Q) Observe samples x := {x 1,..., x m } from P and y from Q If empirical MMD(x,y;F) is far from zero : reject H close to zero : accept H

31 Statistical test using MMD (1) Two hypotheses: H : null hypothesis (P = Q) H 1 : alternative hypothesis (P Q) Observe samples x := {x 1,..., x m } from P and y from Q If empirical MMD(x,y;F) is far from zero : reject H close to zero : accept H How good is a test? Type I error: We reject H although it is true Type II error: We accept H although it is false

32 Statistical test using MMD (1) Two hypotheses: H : null hypothesis (P = Q) H 1 : alternative hypothesis (P Q) Observe samples x := {x 1,..., x m } from P and y from Q If empirical MMD(x,y;F) is far from zero : reject H close to zero : accept H How good is a test? Type I error: We reject H although it is true Type II error: We accept H although it is false Good test has a low type II error for user-defined Type I error

33 Statistical test using MMD (2) far from zero vs close to zero - threshold?

34 Statistical test using MMD (2) far from zero vs close to zero - threshold? One answer: asymptotic distribution of MMD(x, y; F)

35 Statistical test using MMD (2) far from zero vs close to zero - threshold? One answer: asymptotic distribution of MMD(x, y; F) An unbiased empirical estimate (quadratic cost): MMD(x,y;F) = 1 m(m 1) i j k(x i,x j ) k(x i,y j ) k(y i,x j ) + k(y i,y j ) }{{} h((x i,y i ),(x j,y j ))

36 Statistical test using MMD (2) far from zero vs close to zero - threshold? One answer: asymptotic distribution of MMD(x, y; F) An unbiased empirical estimate (quadratic cost): MMD(x,y;F) = 1 m(m 1) i j k(x i,x j ) k(x i,y j ) k(y i,x j ) + k(y i,y j ) }{{} h((x i,y i ),(x j,y j )) When P Q, asymptotically normal [Hoeffding, 1948, Serfling, 198]

37 Statistical test using MMD (2) far from zero vs close to zero - threshold? One answer: asymptotic distribution of MMD(x, y; F) An unbiased empirical estimate (quadratic cost): MMD(x,y;F) = 1 m(m 1) i j k(x i,x j ) k(x i,y j ) k(y i,x j ) + k(y i,y j ) }{{} h((x i,y i ),(x j,y j )) When P Q, asymptotically normal [Hoeffding, 1948, Serfling, 198] Expression for the variance: z i := (x i, y i ) σ 2 u = 22 m ( E z [ (Ez h(z, z )) 2] [E z,z (h(z, z ))] 2) + O(m 2 )

38 Statistical test using MMD (3) Example: laplace distributions with different variance Prob. density MMD distribution and Gaussian fit under H1 Empirical PDF Gaussian fit Prob. density Two Laplace distributions with different variances X P X Q X MMD

39 Statistical test using MMD (4) When P = Q, U-statistic degenerate: E z [h(z,z )] = [Anderson et al., 1994] Distribution is mmmd(x,y;f) [ λ l z 2 l 2 ] l=1 where z l N(,2) i.i.d k(x, X x ) ψ }{{} i (x)dp x (x) = λ i ψ i (x ) centred

40 Statistical test using MMD (4) When P = Q, U-statistic degenerate: E z [h(z,z )] = [Anderson et al., 1994] Distribution is mmmd(x,y;f) [ λ l z 2 l 2 ] l=1 where z l N(,2) i.i.d k(x, X x ) ψ }{{} i (x)dp x (x) = λ i ψ i (x ) centred Prob. density MMD density under H Empirical PDF MMD

41 Statistical test using MMD (5) Given P = Q, want threshold T such that P(MMD > T).5

42 Statistical test using MMD (5) Given P = Q, want threshold T such that P(MMD > T).5 Bootstrap for empirical CDF [Arcones and Giné, 1992] Pearson curves by matching first four moments [Johnson et al., 1994] Large deviation bounds [Hoeffding, 1963, McDiarmid, 1969] Other...

43 Statistical test using MMD (5) Given P = Q, want threshold T such that P(MMD > T).5 Bootstrap for empirical CDF [Arcones and Giné, 1992] Pearson curves by matching first four moments [Johnson et al., 1994] Large deviation bounds [Hoeffding, 1963, McDiarmid, 1969] Other... 1 CDF of the MMD and Pearson fit P(MMD < mmd) MMD Pearson mmd

44 Experiments Small sample size: Pearson more accurate than bootstrap Large sample size: bootstrap faster

45 Experiments Small sample size: Pearson more accurate than bootstrap Large sample size: bootstrap faster Cancer subtype (m = 25, 2118 dimensions): For Pearson, Type I 3.5%, Type II % For bootstrap, Type I.9%, Type II %

46 Experiments Small sample size: Pearson more accurate than bootstrap Large sample size: bootstrap faster Cancer subtype (m = 25, 2118 dimensions): For Pearson, Type I 3.5%, Type II % For bootstrap, Type I.9%, Type II % Neural spikes (m = 1, 1 dimensions): For Pearson, Type I 4.8%, Type II 3.4% For bootstrap, Type I 5.4%, Type II 3.3%

47 Experiments Small sample size: Pearson more accurate than bootstrap Large sample size: bootstrap faster Cancer subtype (m = 25, 2118 dimensions): For Pearson, Type I 3.5%, Type II % For bootstrap, Type I.9%, Type II % Neural spikes (m = 1, 1 dimensions): For Pearson, Type I 4.8%, Type II 3.4% For bootstrap, Type I 5.4%, Type II 3.3% Further experiments: comparison with t-test, Friedman-Rafsky tests [Friedman and Rafsky, 1979], Biau-Györfi test [Biau and Gyorfi, 25], and Hall-Tajvidi test [Hall and Tajvidi, 22].

48 Conclusions (two-sample problem) The MMD: distance between means in feature spaces When feature spaces universal RKHSs, MMD = iff P = Q Statistical test of whether P Q using asymptotic distribution: Pearson approximation for low sample size Bootstrap for large sample size Useful in high dimensions and for structured data

49 Dependence Detection with Kernels

50 Kernel dependence measures Independence testing Given: m samples z := {(x 1, y 1 ),...,(x m, y m )} from P Determine: Does P = P x P y?

51 Kernel dependence measures Independence testing Given: m samples z := {(x 1, y 1 ),...,(x m, y m )} from P Determine: Does P = P x P y? Kernel dependence measures Zero only at independence Take into account high order moments Make sensible assumptions about smoothness

52 Kernel dependence measures Independence testing Given: m samples z := {(x 1, y 1 ),...,(x m, y m )} from P Determine: Does P = P x P y? Kernel dependence measures Zero only at independence Take into account high order moments Make sensible assumptions about smoothness Covariance operators in spaces of features Spectral norm (COCO) [Gretton et al., 25c,d] Hilbert-Schmidt norm (HSIC) [Gretton et al., 25a]

53 Function revealing dependence (1) Idea: avoid density estimation when testing P = P x P y [Rényi, 1959] COCO(P;F,G) := sup (E x,y [f(x)g(y)] E x [f(x)]e y [g(y)]) f F,g G

54 Function revealing dependence (1) Idea: avoid density estimation when testing P = P x P y [Rényi, 1959] COCO(P;F,G) := sup (E x,y [f(x)g(y)] E x [f(x)]e y [g(y)]) f F,g G COCO(P;F, G) = iff x,y independent, when F and G are respective unit balls in universal RKHSs F and G [via Steinwart, 21] Examples: Gaussian, Laplace [see also Bach and Jordan, 22]

55 Function revealing dependence (1) Idea: avoid density estimation when testing P = P x P y [Rényi, 1959] COCO(P;F,G) := sup (E x,y [f(x)g(y)] E x [f(x)]e y [g(y)]) f F,g G COCO(P;F, G) = iff x,y independent, when F and G are respective unit balls in universal RKHSs F and G [via Steinwart, 21] Examples: Gaussian, Laplace [see also Bach and Jordan, 22] In geometric terms: Covariance operator: C xy : G F such that f,c xy g F = E x,y [f(x)g(y)] E x [f(x)]e y [g(y)]

56 Function revealing dependence (1) Idea: avoid density estimation when testing P = P x P y [Rényi, 1959] COCO(P;F,G) := sup (E x,y [f(x)g(y)] E x [f(x)]e y [g(y)]) f F,g G COCO(P;F, G) = iff x,y independent, when F and G are respective unit balls in universal RKHSs F and G [via Steinwart, 21] Examples: Gaussian, Laplace [see also Bach and Jordan, 22] In geometric terms: Covariance operator: C xy : G F such that f,c xy g F = E x,y [f(x)g(y)] E x [f(x)]e y [g(y)] COCO is the spectral norm of C xy [Gretton et al., 25c,d]: COCO(P;F, G) := C xy S

57 Function revealing dependence (2) Ring-shaped density, correlation approx. zero [example from Fukumizu, Bach, and Gretton, 25] 1.5 Correlation:. 1.5 Y X

58 Function revealing dependence (2) Ring-shaped density, correlation approx. zero [example from Fukumizu, Bach, and Gretton, 25] 1.5 Correlation:..5 Dependence witness, X Y f(x) x Dependence witness, Y X g(y) y

59 Function revealing dependence (2) Ring-shaped density, correlation approx. zero [example from Fukumizu, Bach, and Gretton, 25] Y Correlation: X f(x) g(y).5.5 Dependence witness, X x Dependence witness, Y y g(y) Correlation:.9 COCO: f(x)

60 Function revealing dependence (3) Empirical COCO(z;F, G) largest eigenvalue of 1 K L m 1 m L K c d = γ K L c d. K and L are matrices of inner products between centred observations in respective feature spaces: K = HKH where H = I 1 m 11 and k(x i, x j ) = φ(x i ), φ(x j ) F, l(y i, y j ) = ψ(y i ), ψ(y j ) G

61 Function revealing dependence (3) Empirical COCO(z;F, G) largest eigenvalue of 1 K L m 1 m L K c d = γ K L c d. K and L are matrices of inner products between centred observations in respective feature spaces: K = HKH where H = I 1 m 11 and k(x i, x j ) = φ(x i ), φ(x j ) F, l(y i, y j ) = ψ(y i ), ψ(y j ) G Witness function for x: m f(x) = i=1 c i k(x i, x) 1 m m k(x j, x) j=1

62 Function revealing dependence (4) Can we do better? A second example with zero correlation Y Correlation: X f(x) g(y).5.5 Dependence witness, X x Dependence witness, Y y g(y) Correlation:.8 COCO: f(x)

63 Function revealing dependence (4) Can we do better? A second example with zero correlation 1 Correlation: 2nd dependence witness, X 1.5 Y X f 2 (x) g 2 (y) x 2nd dependence witness, Y y

64 Function revealing dependence (4) Can we do better? A second example with zero correlation Y Correlation: f 2 (x).5.5 2nd dependence witness, X x 2nd dependence witness, Y 1 g 2 (Y) Correlation:.37 COCO 2 : X g 2 (y) y f 2 (X)

65 Hilbert-Schmidt Independence Criterion Given γ i := COCO i (z;f, G), define Hilbert-Schmidt Independence Criterion (HSIC) [Gretton et al., 25b]: HSIC(z;F, G) := m i=1 γ 2 i

66 Hilbert-Schmidt Independence Criterion Given γ i := COCO i (z;f, G), define Hilbert-Schmidt Independence Criterion (HSIC) [Gretton et al., 25b]: HSIC(z;F, G) := m i=1 γ 2 i In limit of infinite samples: HSIC(P;F, G) := C xy 2 HS = C xy, C xy HS = E x,x,y,y [k(x,x )l(y,y )] + E x,x [k(x,x )]E y,y [l(y,y )] 2E x,y [ Ex [k(x,x )]E y [l(y,y )] ] x an independent copy of x, y a copy of y

67 Link between HSIC and MMD (1) Define the product space F G with kernel Φ(x, y),φ(x, y ) = K((x, y),(x, y )) = k(x, x )l(y, y )

68 Link between HSIC and MMD (1) Define the product space F G with kernel Φ(x, y),φ(x, y ) = K((x, y),(x, y )) = k(x, x )l(y, y ) Define the mean elements µ xy,φ(x, y) := E x,y Φ(x, y ),Φ(x, y) = E x,y k(x, x )l(y, y ) and µ x y,φ(x, y) := E x,y Φ(x, y ),Φ(x, y) = E x k(x, x )E y l(y, y )

69 Link between HSIC and MMD (1) Define the product space F G with kernel Φ(x, y),φ(x, y ) = K((x, y),(x, y )) = k(x, x )l(y, y ) Define the mean elements µ xy,φ(x, y) := E x,y Φ(x, y ),Φ(x, y) = E x,y k(x, x )l(y, y ) and µ x y,φ(x, y) := E x,y Φ(x, y ),Φ(x, y) = E x k(x, x )E y l(y, y ) The MMD between these two mean elements is MMD(P,P x P y, F G) = µ xy µ x y 2 F G = µ xy µ x y, µ xy µ x y = HSIC(P, F, G)

70 Link between HSIC and MMD (2) Witness function for HSIC 1.5 Dependence witness and sample Y X.4

71 Independence test: verifying ICA and ISA HSICp: null distribution via sampling HSICg: null distribution via moment matching Compare with contingency table test (PD) [Read and Cressie, 1988] 3 Rotation θ = π/8 2 1 Y X Rotation θ = π/4 2 1 Y X

72 Independence test: verifying ICA and ISA HSICp: null distribution via sampling HSICg: null distribution via moment matching Compare with contingency table test (PD) [Read and Cressie, 1988] Y Y Rotation θ = π/8 2 2 X Rotation θ = π/4 2 2 X % acceptance of H % acceptance of H Samp:128, Dim: Angle ( π/4) PD HSICp HSICg Samp:512, Dim:2.5 1 Angle ( π/4)

73 Independence test: verifying ICA and ISA HSICp: null distribution via sampling HSICg: null distribution via moment matching Compare with contingency table test (PD) [Read and Cressie, 1988] Y Rotation θ = π/8 2 2 X % acceptance of H Samp:128, Dim:1.5 1 Angle ( π/4) PD HSICp HSICg % acceptance of H Samp:124, Dim:4.5 1 Angle ( π/4) Y Rotation θ = π/4 2 2 X % acceptance of H Samp:512, Dim:2.5 1 Angle ( π/4) % acceptance of H Samp:248, Dim:4.5 1 Angle ( π/4)

74 Other applications of HSIC Feature selection [Song et al., 27c,a] Clustering [Song et al., 27b]

75 Conclusions (dependence measures) COCO and HSIC: norms of covariance operator between feature spaces When feature spaces universal RKHSs, COCO = HSIC = iff P = P x P y Statistical test possible using asymptotic distribution Independent component analysis high accuracy less sensitive to initialisation

76 Questions?

77

78 Bibliography References N. Anderson, P. Hall, and D. Titterington. Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. Journal of Multivariate Analysis, 5:41 54, M. Arcones and E. Giné. On the bootstrap of u and v statistics. The Annals of Statistics, 2(2): , F. R. Bach and M. I. Jordan. Kernel independent component analysis. J. Mach. Learn. Res., 3:1 48, 22. G. Biau and L. Gyorfi. On the asymptotic properties of a nonparametric l 1 -test statistic of homogeneity. IEEE Transactions on Information Theory, 51(11): , 25. K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. J. Smola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49 e57, 26. R. M. Dudley. Real analysis and probability. Cambridge University Press, Cambridge, UK, 22. R. Fortet and E. Mourier. Convergence de la réparation empirique vers la réparation théorique. Ann. Scient. École Norm. Sup., 7: , J. Friedman and L. Rafsky. Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. The Annals of Statistics, 7(4): , K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. J. Mach. Learn. Res., 5:73 99, 24. K. Fukumizu, F. Bach, and A. Gretton. Consistency of kernel canonical correlation analysis. Technical Report

79 Hard-to-detect dependence (1) COCO can be for dependent RVs with highly non-smooth densities

80 Hard-to-detect dependence (1) COCO can be for dependent RVs with highly non-smooth densities Reason: norms in the denominator COCO(P; F, G) := sup f F, g G cov (f(x), g(y)) f F g G RESULT: not detectable with finite sample size More formally: see Ingster [1989]

81 Hard-to-detect dependence (2) 3 Smooth density 3 Rough density Y Y X 5 Samples, smooth density X 5 samples, rough density Density takes the form: P x,y 1 + sin(ωx) sin(ωy) 2 2 Y Y X X

82 Hard-to-detect dependence (3) Example: sinusoids of increasing frequency ω=1 ω=2.1 COCO (empirical average, 15 samples).9 ω=3 ω=4.8.7 COCO.6.5 ω=5 ω= Frequency of non constant density component

83 Choosing kernel size (1) The RKHS norm of f is f 2 H X := i=1 f 2 i ( ki ) 1. If kernel decays quickly, its spectrum decays slowly: then non-smooth functions have smaller RKHS norm Example: spectrum of two Gaussian kernels 1 Gaussian kernels 1 2 Gaussian kernel spectra k(x) 1 1 K(f) x kernel, σ 2 =1/2 kernel, σ 2 = f kernel, σ 2 =1/2 kernel, σ 2 =2

84 Choosing kernel size (2) Could we just decrease kernel size? Yes, but only up to a point.45 COCO (empirical average, 1 samples) COCO Log kernel size

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,

More information

Maximum Mean Discrepancy

Maximum Mean Discrepancy Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia

More information

A Kernel Method for the Two-Sample-Problem

A Kernel Method for the Two-Sample-Problem A Kernel Method for the Two-Sample-Problem Arthur Gretton MPI for Biological Cybernetics Tübingen, Germany arthur@tuebingen.mpg.de Karsten M. Borgwardt Ludwig-Maximilians-Univ. Munich, Germany kb@dbs.ifi.lmu.de

More information

A Kernel Statistical Test of Independence

A Kernel Statistical Test of Independence A Kernel Statistical Test of Independence Arthur Gretton MPI for Biological Cybernetics Tübingen, Germany arthur@tuebingen.mpg.de Kenji Fukumizu Inst. of Statistical Mathematics Tokyo Japan fukumizu@ism.ac.jp

More information

Learning Interpretable Features to Compare Distributions

Learning Interpretable Features to Compare Distributions Learning Interpretable Features to Compare Distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Theory of Big Data, 2017 1/41 Goal of this talk Given: Two collections

More information

Statistical Convergence of Kernel CCA

Statistical Convergence of Kernel CCA Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,

More information

arxiv: v1 [cs.lg] 15 May 2008

arxiv: v1 [cs.lg] 15 May 2008 Journal of Machine Learning Research (008) -0 Submitted 04/08; Published 04/08 A Kernel Method for the Two-Sample Problem arxiv:0805.368v [cs.lg] 5 May 008 Arthur Gretton MPI for Biological Cybernetics

More information

arxiv: v1 [cs.lg] 22 Jun 2009

arxiv: v1 [cs.lg] 22 Jun 2009 Bayesian two-sample tests arxiv:0906.4032v1 [cs.lg] 22 Jun 2009 Karsten M. Borgwardt 1 and Zoubin Ghahramani 2 1 Max-Planck-Institutes Tübingen, 2 University of Cambridge June 22, 2009 Abstract In this

More information

Unsupervised Nonparametric Anomaly Detection: A Kernel Method

Unsupervised Nonparametric Anomaly Detection: A Kernel Method Fifty-second Annual Allerton Conference Allerton House, UIUC, Illinois, USA October - 3, 24 Unsupervised Nonparametric Anomaly Detection: A Kernel Method Shaofeng Zou Yingbin Liang H. Vincent Poor 2 Xinghua

More information

Lecture 2: Mappings of Probabilities to RKHS and Applications

Lecture 2: Mappings of Probabilities to RKHS and Applications Lecture 2: Mappings of Probabilities to RKHS and Applications Lille, 214 Arthur Gretton Gatsby Unit, CSML, UCL Detecting differences in brain signals The problem: Dolocalfieldpotential(LFP)signalschange

More information

BIOINFORMATICS. Integrating structured biological data by Kernel Maximum Mean Discrepancy

BIOINFORMATICS. Integrating structured biological data by Kernel Maximum Mean Discrepancy BIOINFORMATICS Vol. 00 no. 00 2006 Pages 1 10 Integrating structured biological data by Kernel Maximum Mean Discrepancy Karsten M. Borgwardt a, Arthur Gretton b, Malte J. Rasch, c Hans-Peter Kriegel, a

More information

Kernel Measures of Conditional Dependence

Kernel Measures of Conditional Dependence Kernel Measures of Conditional Dependence Kenji Fukumizu Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku Tokyo 6-8569 Japan fukumizu@ism.ac.jp Arthur Gretton Max-Planck Institute for

More information

Kernel methods for Bayesian inference

Kernel methods for Bayesian inference Kernel methods for Bayesian inference Arthur Gretton Gatsby Computational Neuroscience Unit Lancaster, Nov. 2014 Motivating Example: Bayesian inference without a model 3600 downsampled frames of 20 20

More information

Measuring Statistical Dependence with Hilbert-Schmidt Norms

Measuring Statistical Dependence with Hilbert-Schmidt Norms Max Planck Institut für biologische Kybernetik Max Planck Institute for Biological Cybernetics Technical Report No. 140 Measuring Statistical Dependence with Hilbert-Schmidt Norms Arthur Gretton, 1 Olivier

More information

An Adaptive Test of Independence with Analytic Kernel Embeddings

An Adaptive Test of Independence with Analytic Kernel Embeddings An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum 1 Zoltán Szabó 2 Arthur Gretton 1 1 Gatsby Unit, University College London 2 CMAP, École Polytechnique ICML 2017, Sydney

More information

Hilbert Space Embedding of Probability Measures

Hilbert Space Embedding of Probability Measures Lecture 2 Hilbert Space Embedding of Probability Measures Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 2017 Recap of Lecture

More information

Nonparametric Independence Tests: Space Partitioning and Kernel Approaches

Nonparametric Independence Tests: Space Partitioning and Kernel Approaches Nonparametric Independence Tests: Space Partitioning and Kernel Approaches Arthur Gretton and László Györfi 2 MPI for Biological Cybernetics, Spemannstr. 38, 7276 Tübingen, Germany, arthur@tuebingen.mpg.de

More information

Kernel methods for comparing distributions, measuring dependence

Kernel methods for comparing distributions, measuring dependence Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations

More information

Hilbert Schmidt Independence Criterion

Hilbert Schmidt Independence Criterion Hilbert Schmidt Independence Criterion Thanks to Arthur Gretton, Le Song, Bernhard Schölkopf, Olivier Bousquet Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Hypothesis Testing with Kernel Embeddings on Interdependent Data

Hypothesis Testing with Kernel Embeddings on Interdependent Data Hypothesis Testing with Kernel Embeddings on Interdependent Data Dino Sejdinovic Department of Statistics University of Oxford joint work with Kacper Chwialkowski and Arthur Gretton (Gatsby Unit, UCL)

More information

Learning features to compare distributions

Learning features to compare distributions Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this

More information

An Adaptive Test of Independence with Analytic Kernel Embeddings

An Adaptive Test of Independence with Analytic Kernel Embeddings An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum Gatsby Unit, University College London wittawat@gatsby.ucl.ac.uk Probabilistic Graphical Model Workshop 2017 Institute

More information

arxiv: v2 [stat.ml] 3 Sep 2015

arxiv: v2 [stat.ml] 3 Sep 2015 Nonparametric Independence Testing for Small Sample Sizes arxiv:46.9v [stat.ml] 3 Sep 5 Aaditya Ramdas Dept. of Statistics and Machine Learning Dept. Carnegie Mellon University aramdas@cs.cmu.edu September

More information

A Linear-Time Kernel Goodness-of-Fit Test

A Linear-Time Kernel Goodness-of-Fit Test A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum 1; Wenkai Xu 1 Zoltán Szabó 2 NIPS 2017 Best paper! Kenji Fukumizu 3 Arthur Gretton 1 wittawatj@gmail.com 1 Gatsby Unit, University College

More information

Feature Selection via Dependence Maximization

Feature Selection via Dependence Maximization Journal of Machine Learning Research 1 (2007) xxx-xxx Submitted xx/07; Published xx/07 Feature Selection via Dependence Maximization Le Song lesong@it.usyd.edu.au NICTA, Statistical Machine Learning Program,

More information

Robust Support Vector Machines for Probability Distributions

Robust Support Vector Machines for Probability Distributions Robust Support Vector Machines for Probability Distributions Andreas Christmann joint work with Ingo Steinwart (Los Alamos National Lab) ICORS 2008, Antalya, Turkey, September 8-12, 2008 Andreas Christmann,

More information

A Linear-Time Kernel Goodness-of-Fit Test

A Linear-Time Kernel Goodness-of-Fit Test A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum Gatsby Unit, UCL wittawatj@gmail.com Wenkai Xu Gatsby Unit, UCL wenkaix@gatsby.ucl.ac.uk Zoltán Szabó CMAP, École Polytechnique zoltan.szabo@polytechnique.edu

More information

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Kernel-Based Contrast Functions for Sufficient Dimension Reduction Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach

More information

Semi-Supervised Laplacian Regularization of Kernel Canonical Correlation Analysis

Semi-Supervised Laplacian Regularization of Kernel Canonical Correlation Analysis Semi-Supervised Laplacian Regularization of Kernel Canonical Correlation Analsis Matthew B. Blaschko, Christoph H. Lampert, & Arthur Gretton Ma Planck Institute for Biological Cbernetics Tübingen, German

More information

A Hilbert Space Embedding for Distributions

A Hilbert Space Embedding for Distributions A Hilbert Space Embedding for Distributions Alex Smola 1, Arthur Gretton 2, Le Song 1, and Bernhard Schölkopf 2 1 National ICT Australia, North Road, Canberra 0200 ACT, Australia, alex.smola@nicta.com.au,lesong@it.usyd.edu.au

More information

Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.

Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators. Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators. Michel Besserve MPI for Intelligent Systems, Tübingen michel.besserve@tuebingen.mpg.de Nikos K. Logothetis MPI

More information

Package dhsic. R topics documented: July 27, 2017

Package dhsic. R topics documented: July 27, 2017 Package dhsic July 27, 2017 Type Package Title Independence Testing via Hilbert Schmidt Independence Criterion Version 2.0 Date 2017-07-24 Author Niklas Pfister and Jonas Peters Maintainer Niklas Pfister

More information

Forefront of the Two Sample Problem

Forefront of the Two Sample Problem Forefront of the Two Sample Problem From classical to state-of-the-art methods Yuchi Matsuoka What is the Two Sample Problem? X ",, X % ~ P i. i. d Y ",, Y - ~ Q i. i. d Two Sample Problem P = Q? Example:

More information

INDEPENDENCE MEASURES

INDEPENDENCE MEASURES INDEPENDENCE MEASURES Beatriz Bueno Larraz Máster en Investigación e Innovación en Tecnologías de la Información y las Comunicaciones. Escuela Politécnica Superior. Máster en Matemáticas y Aplicaciones.

More information

Kernel Methods. Bernhard Schölkopf. Max-Planck-Institut für biologische Kybernetik. B. Schölkopf, Cambridge, 2009

Kernel Methods. Bernhard Schölkopf. Max-Planck-Institut für biologische Kybernetik. B. Schölkopf, Cambridge, 2009 Kernel Methods Bernhard Schölkopf Max-Planck-Institut für biologische Kybernetik Roadmap Similarity, kernels, feature spaces Positive definite kernels and their RKHS Kernel means, representer theorem Support

More information

Brisk Kernel ICA 1. Stefanie Jegelka 2, Arthur Gretton 3. 1 Introduction

Brisk Kernel ICA 1. Stefanie Jegelka 2, Arthur Gretton 3. 1 Introduction Brisk Kernel ICA 1 Stefanie Jegelka 2, Arthur Gretton 3 Abstract. Recent approaches to independent component analysis have used kernel independence measures to obtain very good performance in ICA, particularly

More information

B-tests: Low Variance Kernel Two-Sample Tests

B-tests: Low Variance Kernel Two-Sample Tests B-tests: Low Variance Kernel Two-Sample Tests Wojciech Zaremba, Arthur Gretton, Matthew Blaschko To cite this version: Wojciech Zaremba, Arthur Gretton, Matthew Blaschko. B-tests: Low Variance Kernel Two-

More information

Learning features to compare distributions

Learning features to compare distributions Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this

More information

High-dimensional test for normality

High-dimensional test for normality High-dimensional test for normality Jérémie Kellner Ph.D Student University Lille I - MODAL project-team Inria joint work with Alain Celisse Rennes - June 5th, 2014 Jérémie Kellner Ph.D Student University

More information

Two-sample hypothesis testing for random dot product graphs

Two-sample hypothesis testing for random dot product graphs Two-sample hypothesis testing for random dot product graphs Minh Tang Department of Applied Mathematics and Statistics Johns Hopkins University JSM 2014 Joint work with Avanti Athreya, Vince Lyzinski,

More information

Non-parametric Estimation of Integral Probability Metrics

Non-parametric Estimation of Integral Probability Metrics Non-parametric Estimation of Integral Probability Metrics Bharath K. Sriperumbudur, Kenji Fukumizu 2, Arthur Gretton 3,4, Bernhard Schölkopf 4 and Gert R. G. Lanckriet Department of ECE, UC San Diego,

More information

Big Hypothesis Testing with Kernel Embeddings

Big Hypothesis Testing with Kernel Embeddings Big Hypothesis Testing with Kernel Embeddings Dino Sejdinovic Department of Statistics University of Oxford 9 January 2015 UCL Workshop on the Theory of Big Data D. Sejdinovic (Statistics, Oxford) Big

More information

Nonparametric Indepedence Tests: Space Partitioning and Kernel Approaches

Nonparametric Indepedence Tests: Space Partitioning and Kernel Approaches Nonparametric Indepedence Tests: Space Partitioning and Kernel Approaches Arthur Gretton 1 and László Györfi 2 1. Gatsby Computational Neuroscience Unit London, UK 2. Budapest University of Technology

More information

Generalized clustering via kernel embeddings

Generalized clustering via kernel embeddings Generalized clustering via kernel embeddings Stefanie Jegelka 1, Arthur Gretton 2,1, Bernhard Schölkopf 1, Bharath K. Sriperumbudur 3, and Ulrike von Luxburg 1 1 Max Planck Institute for Biological Cybernetics,

More information

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December

More information

BIOINFORMATICS. Integrating structured biological data by Kernel Maximum Mean Discrepancy

BIOINFORMATICS. Integrating structured biological data by Kernel Maximum Mean Discrepancy BIOINFORMATICS Vol. 22 no. 14 2006, pages e49 e57 doi:10.1093/bioinformatics/btl242 Integrating structured biological data by Kernel Maximum Mean Discrepancy Karsten M. Borgwardt 1,, Arthur Gretton 2,

More information

Convergence Rates of Kernel Quadrature Rules

Convergence Rates of Kernel Quadrature Rules Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction

More information

M-Statistic for Kernel Change-Point Detection

M-Statistic for Kernel Change-Point Detection M-Statistic for Kernel Change-Point Detection Shuang Li, Yao Xie H. Milton Stewart School of Industrial and Systems Engineering Georgian Institute of Technology sli37@gatech.edu yao.xie@isye.gatech.edu

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Lecture 4 February 2

Lecture 4 February 2 4-1 EECS 281B / STAT 241B: Advanced Topics in Statistical Learning Spring 29 Lecture 4 February 2 Lecturer: Martin Wainwright Scribe: Luqman Hodgkinson Note: These lecture notes are still rough, and have

More information

Understanding the relationship between Functional and Structural Connectivity of Brain Networks

Understanding the relationship between Functional and Structural Connectivity of Brain Networks Understanding the relationship between Functional and Structural Connectivity of Brain Networks Sashank J. Reddi Machine Learning Department, Carnegie Mellon University SJAKKAMR@CS.CMU.EDU Abstract Background.

More information

Tensor Product Kernels: Characteristic Property, Universality

Tensor Product Kernels: Characteristic Property, Universality Tensor Product Kernels: Characteristic Property, Universality Zolta n Szabo CMAP, E cole Polytechnique Joint work with: Bharath K. Sriperumbudur Hangzhou International Conference on Frontiers of Data Science

More information

Distribution Regression

Distribution Regression Zoltán Szabó (École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Dagstuhl Seminar 16481

More information

On a Nonparametric Notion of Residual and its Applications

On a Nonparametric Notion of Residual and its Applications On a Nonparametric Notion of Residual and its Applications Bodhisattva Sen and Gábor Székely arxiv:1409.3886v1 [stat.me] 12 Sep 2014 Columbia University and National Science Foundation September 16, 2014

More information

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department,

More information

A Kernelized Stein Discrepancy for Goodness-of-fit Tests

A Kernelized Stein Discrepancy for Goodness-of-fit Tests Qiang Liu QLIU@CS.DARTMOUTH.EDU Computer Science, Dartmouth College, NH, 3755 Jason D. Lee JASONDLEE88@EECS.BERKELEY.EDU Michael Jordan JORDAN@CS.BERKELEY.EDU Department of Electrical Engineering and Computer

More information

A Wild Bootstrap for Degenerate Kernel Tests

A Wild Bootstrap for Degenerate Kernel Tests A Wild Bootstrap for Degenerate Kernel Tests Kacper Chwialkowski Department of Computer Science University College London London, Gower Street, WCE 6BT kacper.chwialkowski@gmail.com Dino Sejdinovic Gatsby

More information

Colored Maximum Variance Unfolding

Colored Maximum Variance Unfolding Colored Maximum Variance Unfolding Le Song, Alex Smola, Karsten Borgwardt and Arthur Gretton National ICT Australia, Canberra, Australia University of Cambridge, Cambridge, United Kingdom MPI for Biological

More information

Distribution Regression with Minimax-Optimal Guarantee

Distribution Regression with Minimax-Optimal Guarantee Distribution Regression with Minimax-Optimal Guarantee (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton

More information

Topics in kernel hypothesis testing

Topics in kernel hypothesis testing Topics in kernel hypothesis testing Kacper Chwialkowski A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of University College London. Department

More information

Least-Squares Independence Test

Least-Squares Independence Test IEICE Transactions on Information and Sstems, vol.e94-d, no.6, pp.333-336,. Least-Squares Independence Test Masashi Sugiama Toko Institute of Technolog, Japan PRESTO, Japan Science and Technolog Agenc,

More information

Recovering Distributions from Gaussian RKHS Embeddings

Recovering Distributions from Gaussian RKHS Embeddings Motonobu Kanagawa Graduate University for Advanced Studies kanagawa@ism.ac.jp Kenji Fukumizu Institute of Statistical Mathematics fukumizu@ism.ac.jp Abstract Recent advances of kernel methods have yielded

More information

Kernels A Machine Learning Overview

Kernels A Machine Learning Overview Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter

More information

Kernel Methods. Outline

Kernel Methods. Outline Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

1 Exponential manifold by reproducing kernel Hilbert spaces

1 Exponential manifold by reproducing kernel Hilbert spaces 1 Exponential manifold by reproducing kernel Hilbert spaces 1.1 Introduction The purpose of this paper is to propose a method of constructing exponential families of Hilbert manifold, on which estimation

More information

Kernel-Based Nonparametric Anomaly Detection

Kernel-Based Nonparametric Anomaly Detection Kernel-Based Nonparaetric Anoaly Detection Shaofeng Zou Dept of EECS Syracuse University Eail: szou@syr.edu Yingbin Liang Dept of EECS Syracuse University Eail: yliang6@syr.edu H. Vincent Poor Dept of

More information

CS8803: Statistical Techniques in Robotics Byron Boots. Hilbert Space Embeddings

CS8803: Statistical Techniques in Robotics Byron Boots. Hilbert Space Embeddings CS8803: Statistical Techniques in Robotics Byron Boots Hilbert Space Embeddings 1 Motivation CS8803: STR Hilbert Space Embeddings 2 Overview Multinomial Distributions Marginal, Joint, Conditional Sum,

More information

A spectral clustering algorithm based on Gram operators

A spectral clustering algorithm based on Gram operators A spectral clustering algorithm based on Gram operators Ilaria Giulini De partement de Mathe matiques et Applications ENS, Paris Joint work with Olivier Catoni 1 july 2015 Clustering task of grouping

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Minimax Estimation of Kernel Mean Embeddings

Minimax Estimation of Kernel Mean Embeddings Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016 Collaborators Dr. Ilya Tolstikhin

More information

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien

More information

Hilbert Space Embeddings of Conditional Distributions with Applications to Dynamical Systems

Hilbert Space Embeddings of Conditional Distributions with Applications to Dynamical Systems with Applications to Dynamical Systems Le Song Jonathan Huang School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA Alex Smola Yahoo! Research, Santa Clara, CA 95051, USA Kenji

More information

Test Code: STA/STB (Short Answer Type) 2013 Junior Research Fellowship for Research Course in Statistics

Test Code: STA/STB (Short Answer Type) 2013 Junior Research Fellowship for Research Course in Statistics Test Code: STA/STB (Short Answer Type) 2013 Junior Research Fellowship for Research Course in Statistics The candidates for the research course in Statistics will have to take two shortanswer type tests

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Advances in kernel exponential families

Advances in kernel exponential families Advances in kernel exponential families Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS, 2017 1/39 Outline Motivating application: Fast estimation of complex multivariate

More information

Minimum Hellinger Distance Estimation in a. Semiparametric Mixture Model

Minimum Hellinger Distance Estimation in a. Semiparametric Mixture Model Minimum Hellinger Distance Estimation in a Semiparametric Mixture Model Sijia Xiang 1, Weixin Yao 1, and Jingjing Wu 2 1 Department of Statistics, Kansas State University, Manhattan, Kansas, USA 66506-0802.

More information

Kernel Method: Data Analysis with Positive Definite Kernels

Kernel Method: Data Analysis with Positive Definite Kernels Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University

More information

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;

More information

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process Applied Mathematical Sciences, Vol. 4, 2010, no. 62, 3083-3093 Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process Julia Bondarenko Helmut-Schmidt University Hamburg University

More information

Hilbert Space Embeddings of Conditional Distributions with Applications to Dynamical Systems

Hilbert Space Embeddings of Conditional Distributions with Applications to Dynamical Systems with Applications to Dynamical Systems Le Song Jonathan Huang School of Computer Science, Carnegie Mellon University Alex Smola Yahoo! Research, Santa Clara, CA, USA Kenji Fukumizu Institute of Statistical

More information

Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise

Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise Makoto Yamada and Masashi Sugiyama Department of Computer Science, Tokyo Institute of Technology

More information

Stat 710: Mathematical Statistics Lecture 31

Stat 710: Mathematical Statistics Lecture 31 Stat 710: Mathematical Statistics Lecture 31 Jun Shao Department of Statistics University of Wisconsin Madison, WI 53706, USA Jun Shao (UW-Madison) Stat 710, Lecture 31 April 13, 2009 1 / 13 Lecture 31:

More information

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds

More information

Minimax-optimal distribution regression

Minimax-optimal distribution regression Zoltán Szabó (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) ISNPS, Avignon June 12,

More information

Strictly Positive Definite Functions on a Real Inner Product Space

Strictly Positive Definite Functions on a Real Inner Product Space Strictly Positive Definite Functions on a Real Inner Product Space Allan Pinkus Abstract. If ft) = a kt k converges for all t IR with all coefficients a k 0, then the function f< x, y >) is positive definite

More information

22 : Hilbert Space Embeddings of Distributions

22 : Hilbert Space Embeddings of Distributions 10-708: Probabilistic Graphical Models 10-708, Spring 2014 22 : Hilbert Space Embeddings of Distributions Lecturer: Eric P. Xing Scribes: Sujay Kumar Jauhar and Zhiguang Huo 1 Introduction and Motivation

More information

Geometric Analysis of Hilbert-Schmidt Independence Criterion Based ICA Contrast Function

Geometric Analysis of Hilbert-Schmidt Independence Criterion Based ICA Contrast Function Geometric Analysis of Hilbert-Schmidt Independence Criterion Based ICA Contrast Function NICTA Technical Report: PA006080 ISSN 1833-9646 Hao Shen 1, Stefanie Jegelka 2 and Arthur Gretton 2 1 Systems Engineering

More information

Kernel-based Information Criterion

Kernel-based Information Criterion 1 Kernel-based Information Criterion Somayeh Danafar, Kenji Fukumizu, Faustino Gomez IDSIA/SUPSI, Manno-Lugano, Switzerland Università della Svizzera Italiana, Lugano, Switzerland The Institue of Statistical

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

Nonparametric Tree Graphical Models via Kernel Embeddings

Nonparametric Tree Graphical Models via Kernel Embeddings Le Song, 1 Arthur Gretton, 1,2 Carlos Guestrin 1 1 School of Computer Science, Carnegie Mellon University; 2 MPI for Biological Cybernetics Abstract We introduce a nonparametric representation for graphical

More information

Spring 2012 Math 541B Exam 1

Spring 2012 Math 541B Exam 1 Spring 2012 Math 541B Exam 1 1. A sample of size n is drawn without replacement from an urn containing N balls, m of which are red and N m are black; the balls are otherwise indistinguishable. Let X denote

More information

Multiple Random Variables

Multiple Random Variables Multiple Random Variables This Version: July 30, 2015 Multiple Random Variables 2 Now we consider models with more than one r.v. These are called multivariate models For instance: height and weight An

More information

Spline Density Estimation and Inference with Model-Based Penalities

Spline Density Estimation and Inference with Model-Based Penalities Spline Density Estimation and Inference with Model-Based Penalities December 7, 016 Abstract In this paper we propose model-based penalties for smoothing spline density estimation and inference. These

More information

Oslo Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n

More information

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space Robert Jenssen, Deniz Erdogmus 2, Jose Principe 2, Torbjørn Eltoft Department of Physics, University of Tromsø, Norway

More information

Unsupervised Kernel Dimension Reduction Supplemental Material

Unsupervised Kernel Dimension Reduction Supplemental Material Unsupervised Kernel Dimension Reduction Supplemental Material Meihong Wang Dept. of Computer Science U. of Southern California Los Angeles, CA meihongw@usc.edu Fei Sha Dept. of Computer Science U. of Southern

More information

A Bahadur Representation of the Linear Support Vector Machine

A Bahadur Representation of the Linear Support Vector Machine A Bahadur Representation of the Linear Support Vector Machine Yoonkyung Lee Department of Statistics The Ohio State University October 7, 2008 Data Mining and Statistical Learning Study Group Outline Support

More information

H 2 : otherwise. that is simply the proportion of the sample points below level x. For any fixed point x the law of large numbers gives that

H 2 : otherwise. that is simply the proportion of the sample points below level x. For any fixed point x the law of large numbers gives that Lecture 28 28.1 Kolmogorov-Smirnov test. Suppose that we have an i.i.d. sample X 1,..., X n with some unknown distribution and we would like to test the hypothesis that is equal to a particular distribution

More information