Hilbert Space Representations of Probability Distributions

Size: px

Start display at page:

Download "Hilbert Space Representations of Probability Distributions"

Prosper Gilbert
5 years ago
Views:

1 Hilbert Space Representations of Probability Distributions Arthur Gretton joint work with Karsten Borgwardt, Kenji Fukumizu, Malte Rasch, Bernhard Schölkopf, Alex Smola, Le Song, Choon Hui Teo Max Planck Institute for Biological Cybernetics, Tübingen, Germany

2 Overview The two sample problem: are samples {x 1,..., x m } and {y 1,..., y n } generated from the same distribution? Kernel independence testing: given a sample of m pairs {(x 1, y 1 ),...,(x m, y m )}, are the random variables x and y independent?

3 Kernels, feature maps

4 A very short introduction to kernels Hilbert space of functions f F from X to R RKHS: evaluation operator δ x : x R continuous

5 A very short introduction to kernels Hilbert space of functions f F from X to R RKHS: evaluation operator δ x : x R continuous Riesz: Unique representer of evaluation k(x, ) F: f(x) = f, k(x, ) F k(x, ) feature map k : X R is kernel function

6 A very short introduction to kernels Hilbert space of functions f F from X to R RKHS: evaluation operator δ x : x R continuous Riesz: Unique representer of evaluation k(x, ) F: f(x) = f, k(x, ) F k(x, ) feature map k : X R is kernel function Inner product between two feature maps: k(x 1, ), k(x 2, ) F = k(x 1, x 2 )

7 A Kernel Method for the Two Sample Problem

8 The two-sample problem Given: m samples x := {x 1,..., x m } drawn i.i.d. from P samples y drawn from Q Determine: Are P and Q different?

9 The two-sample problem Given: m samples x := {x 1,..., x m } drawn i.i.d. from P samples y drawn from Q Determine: Are P and Q different? Applications: Microarray data aggregation Speaker/author identification Schema matching

10 The two-sample problem Given: m samples x := {x 1,..., x m } drawn i.i.d. from P samples y drawn from Q Determine: Are P and Q different? Applications: Microarray data aggregation Speaker/author identification Schema matching Where is our test useful? High dimensionality Low sample size Structured data (strings and graphs): currently the only method

11 Overview (two-sample problem) How to detect P Q? Distance between means in space of features Function revealing differences in distributions Same thing: the MMD [Gretton et al., 27, Borgwardt et al., 26]

12 Overview (two-sample problem) How to detect P Q? Distance between means in space of features Function revealing differences in distributions Same thing: the MMD [Gretton et al., 27, Borgwardt et al., 26] Hypothesis test using MMD Asymptotic distribution of MMD Large deviation bounds

13 Overview (two-sample problem) How to detect P Q? Distance between means in space of features Function revealing differences in distributions Same thing: the MMD [Gretton et al., 27, Borgwardt et al., 26] Hypothesis test using MMD Asymptotic distribution of MMD Large deviation bounds Experiments

14 Mean discrepancy (1) Simple example: 2 Gaussians with different means Answer: t-test Two Gaussians with different means P X Q X Prob. density X

15 Mean discrepancy (2) Two Gaussians with same means, different variance Idea: look at difference in means of features of the RVs In Gaussian case: second order features of form x Two Gaussians with different variances P X Q X.3 Prob. density X

16 Mean discrepancy (2) Two Gaussians with same means, different variance Idea: look at difference in means of features of the RVs In Gaussian case: second order features of form x 2 Prob. density Two Gaussians with different variances X P X Q X Prob. density Densities of feature X X 2 P X Q X

17 Mean discrepancy (3) Gaussian and Laplace distributions Same mean and same variance Difference in means using higher order features.7.6 Gaussian and Laplace densities P X Q X Prob. density X

18 Function revealing difference in distributions (1) Idea: avoid density estimation when testing P Q [Fortet and Mourier, 1953] D(P,Q; F) := sup f F [E P f(x) E Q f(y)].

19 Function revealing difference in distributions (1) Idea: avoid density estimation when testing P Q [Fortet and Mourier, 1953] D(P,Q; F) := sup f F [E P f(x) E Q f(y)]. D(P,Q; F) = iff P = Q, when F =bounded continuous functions [Dudley, 22]

20 Function revealing difference in distributions (1) Idea: avoid density estimation when testing P Q [Fortet and Mourier, 1953] D(P,Q; F) := sup f F [E P f(x) E Q f(y)]. D(P,Q; F) = iff P = Q, when F =bounded continuous functions [Dudley, 22] D(P,Q; F) = iff P = Q when F =the unit ball in a universal RKHS F [via Steinwart, 21]

21 Function revealing difference in distributions (1) Idea: avoid density estimation when testing P Q [Fortet and Mourier, 1953] D(P,Q; F) := sup f F [E P f(x) E Q f(y)]. D(P,Q; F) = iff P = Q, when F =bounded continuous functions [Dudley, 22] D(P,Q; F) = iff P = Q when F =the unit ball in a universal RKHS F [via Steinwart, 21] Examples: Gaussian, Laplace [see also Fukumizu et al., 24]

22 Function revealing difference in distributions (2) Gauss vs Laplace revisited Prob. density and f Witness f for Gauss and Laplace densities f Gauss Laplace X

23 Function revealing difference in distributions (3) The (kernel) MMD: MMD(P,Q;F) ( = sup f F [E P f(x) E Q f(y)] ) 2

24 Function revealing difference in distributions (3) The (kernel) MMD: MMD(P,Q;F) ( = sup f F [E P f(x) E Q f(y)] ) 2 using E P (f(x)) = E P [ φ(x), f F ] =: µ x, f F

25 Function revealing difference in distributions (3) The (kernel) MMD: MMD(P,Q;F) ( = = ( sup f F sup f F [E P f(x) E Q f(y)] f, µ x µ y F ) 2 ) 2 using E P (f(x)) = E P [ φ(x), f F ] =: µ x, f F

26 Function revealing difference in distributions (3) The (kernel) MMD: MMD(P,Q;F) ( = = ( sup f F sup f F [E P f(x) E Q f(y)] f, µ x µ y F ) 2 ) 2 using µ F = sup f, µ F f F = µ x µ y 2 F

27 Function revealing difference in distributions (3) The (kernel) MMD: MMD(P,Q;F) ( = = ( sup f F sup f F = µ x µ y 2 F [E P f(x) E Q f(y)] f, µ x µ y F ) 2 = µ x µ y, µ x µ y F ) 2 = E P,P k(x,x ) + E Q,Q k(y,y ) 2E P,Q k(x,y) x is a R.V. independent of x with distribution P y is a R.V. independent of y with distribution Q.

28 Function revealing difference in distributions (3) The (kernel) MMD: MMD(P,Q;F) ( = = ( sup f F sup f F = µ x µ y 2 F [E P f(x) E Q f(y)] f, µ x µ y F ) 2 = µ x µ y, µ x µ y F ) 2 = E P,P k(x,x ) + E Q,Q k(y,y ) 2E P,Q k(x,y) x is a R.V. independent of x with distribution P y is a R.V. independent of y with distribution Q. Kernel between measures [Hein and Bousquet, 25] K(P,Q) = E P,Q k(x,y)

29 Statistical test using MMD (1) Two hypotheses: H : null hypothesis (P = Q) H 1 : alternative hypothesis (P Q)

30 Statistical test using MMD (1) Two hypotheses: H : null hypothesis (P = Q) H 1 : alternative hypothesis (P Q) Observe samples x := {x 1,..., x m } from P and y from Q If empirical MMD(x,y;F) is far from zero : reject H close to zero : accept H

31 Statistical test using MMD (1) Two hypotheses: H : null hypothesis (P = Q) H 1 : alternative hypothesis (P Q) Observe samples x := {x 1,..., x m } from P and y from Q If empirical MMD(x,y;F) is far from zero : reject H close to zero : accept H How good is a test? Type I error: We reject H although it is true Type II error: We accept H although it is false

32 Statistical test using MMD (1) Two hypotheses: H : null hypothesis (P = Q) H 1 : alternative hypothesis (P Q) Observe samples x := {x 1,..., x m } from P and y from Q If empirical MMD(x,y;F) is far from zero : reject H close to zero : accept H How good is a test? Type I error: We reject H although it is true Type II error: We accept H although it is false Good test has a low type II error for user-defined Type I error

33 Statistical test using MMD (2) far from zero vs close to zero - threshold?

34 Statistical test using MMD (2) far from zero vs close to zero - threshold? One answer: asymptotic distribution of MMD(x, y; F)

35 Statistical test using MMD (2) far from zero vs close to zero - threshold? One answer: asymptotic distribution of MMD(x, y; F) An unbiased empirical estimate (quadratic cost): MMD(x,y;F) = 1 m(m 1) i j k(x i,x j ) k(x i,y j ) k(y i,x j ) + k(y i,y j ) }{{} h((x i,y i ),(x j,y j ))

36 Statistical test using MMD (2) far from zero vs close to zero - threshold? One answer: asymptotic distribution of MMD(x, y; F) An unbiased empirical estimate (quadratic cost): MMD(x,y;F) = 1 m(m 1) i j k(x i,x j ) k(x i,y j ) k(y i,x j ) + k(y i,y j ) }{{} h((x i,y i ),(x j,y j )) When P Q, asymptotically normal [Hoeffding, 1948, Serfling, 198]

37 Statistical test using MMD (2) far from zero vs close to zero - threshold? One answer: asymptotic distribution of MMD(x, y; F) An unbiased empirical estimate (quadratic cost): MMD(x,y;F) = 1 m(m 1) i j k(x i,x j ) k(x i,y j ) k(y i,x j ) + k(y i,y j ) }{{} h((x i,y i ),(x j,y j )) When P Q, asymptotically normal [Hoeffding, 1948, Serfling, 198] Expression for the variance: z i := (x i, y i ) σ 2 u = 22 m ( E z [ (Ez h(z, z )) 2] [E z,z (h(z, z ))] 2) + O(m 2 )

38 Statistical test using MMD (3) Example: laplace distributions with different variance Prob. density MMD distribution and Gaussian fit under H1 Empirical PDF Gaussian fit Prob. density Two Laplace distributions with different variances X P X Q X MMD

39 Statistical test using MMD (4) When P = Q, U-statistic degenerate: E z [h(z,z )] = [Anderson et al., 1994] Distribution is mmmd(x,y;f) [ λ l z 2 l 2 ] l=1 where z l N(,2) i.i.d k(x, X x ) ψ }{{} i (x)dp x (x) = λ i ψ i (x ) centred

40 Statistical test using MMD (4) When P = Q, U-statistic degenerate: E z [h(z,z )] = [Anderson et al., 1994] Distribution is mmmd(x,y;f) [ λ l z 2 l 2 ] l=1 where z l N(,2) i.i.d k(x, X x ) ψ }{{} i (x)dp x (x) = λ i ψ i (x ) centred Prob. density MMD density under H Empirical PDF MMD

41 Statistical test using MMD (5) Given P = Q, want threshold T such that P(MMD > T).5

42 Statistical test using MMD (5) Given P = Q, want threshold T such that P(MMD > T).5 Bootstrap for empirical CDF [Arcones and Giné, 1992] Pearson curves by matching first four moments [Johnson et al., 1994] Large deviation bounds [Hoeffding, 1963, McDiarmid, 1969] Other...

43 Statistical test using MMD (5) Given P = Q, want threshold T such that P(MMD > T).5 Bootstrap for empirical CDF [Arcones and Giné, 1992] Pearson curves by matching first four moments [Johnson et al., 1994] Large deviation bounds [Hoeffding, 1963, McDiarmid, 1969] Other... 1 CDF of the MMD and Pearson fit P(MMD < mmd) MMD Pearson mmd

44 Experiments Small sample size: Pearson more accurate than bootstrap Large sample size: bootstrap faster

45 Experiments Small sample size: Pearson more accurate than bootstrap Large sample size: bootstrap faster Cancer subtype (m = 25, 2118 dimensions): For Pearson, Type I 3.5%, Type II % For bootstrap, Type I.9%, Type II %

46 Experiments Small sample size: Pearson more accurate than bootstrap Large sample size: bootstrap faster Cancer subtype (m = 25, 2118 dimensions): For Pearson, Type I 3.5%, Type II % For bootstrap, Type I.9%, Type II % Neural spikes (m = 1, 1 dimensions): For Pearson, Type I 4.8%, Type II 3.4% For bootstrap, Type I 5.4%, Type II 3.3%

47 Experiments Small sample size: Pearson more accurate than bootstrap Large sample size: bootstrap faster Cancer subtype (m = 25, 2118 dimensions): For Pearson, Type I 3.5%, Type II % For bootstrap, Type I.9%, Type II % Neural spikes (m = 1, 1 dimensions): For Pearson, Type I 4.8%, Type II 3.4% For bootstrap, Type I 5.4%, Type II 3.3% Further experiments: comparison with t-test, Friedman-Rafsky tests [Friedman and Rafsky, 1979], Biau-Györfi test [Biau and Gyorfi, 25], and Hall-Tajvidi test [Hall and Tajvidi, 22].

48 Conclusions (two-sample problem) The MMD: distance between means in feature spaces When feature spaces universal RKHSs, MMD = iff P = Q Statistical test of whether P Q using asymptotic distribution: Pearson approximation for low sample size Bootstrap for large sample size Useful in high dimensions and for structured data

49 Dependence Detection with Kernels

50 Kernel dependence measures Independence testing Given: m samples z := {(x 1, y 1 ),...,(x m, y m )} from P Determine: Does P = P x P y?

51 Kernel dependence measures Independence testing Given: m samples z := {(x 1, y 1 ),...,(x m, y m )} from P Determine: Does P = P x P y? Kernel dependence measures Zero only at independence Take into account high order moments Make sensible assumptions about smoothness

52 Kernel dependence measures Independence testing Given: m samples z := {(x 1, y 1 ),...,(x m, y m )} from P Determine: Does P = P x P y? Kernel dependence measures Zero only at independence Take into account high order moments Make sensible assumptions about smoothness Covariance operators in spaces of features Spectral norm (COCO) [Gretton et al., 25c,d] Hilbert-Schmidt norm (HSIC) [Gretton et al., 25a]

53 Function revealing dependence (1) Idea: avoid density estimation when testing P = P x P y [Rényi, 1959] COCO(P;F,G) := sup (E x,y [f(x)g(y)] E x [f(x)]e y [g(y)]) f F,g G

54 Function revealing dependence (1) Idea: avoid density estimation when testing P = P x P y [Rényi, 1959] COCO(P;F,G) := sup (E x,y [f(x)g(y)] E x [f(x)]e y [g(y)]) f F,g G COCO(P;F, G) = iff x,y independent, when F and G are respective unit balls in universal RKHSs F and G [via Steinwart, 21] Examples: Gaussian, Laplace [see also Bach and Jordan, 22]

55 Function revealing dependence (1) Idea: avoid density estimation when testing P = P x P y [Rényi, 1959] COCO(P;F,G) := sup (E x,y [f(x)g(y)] E x [f(x)]e y [g(y)]) f F,g G COCO(P;F, G) = iff x,y independent, when F and G are respective unit balls in universal RKHSs F and G [via Steinwart, 21] Examples: Gaussian, Laplace [see also Bach and Jordan, 22] In geometric terms: Covariance operator: C xy : G F such that f,c xy g F = E x,y [f(x)g(y)] E x [f(x)]e y [g(y)]

56 Function revealing dependence (1) Idea: avoid density estimation when testing P = P x P y [Rényi, 1959] COCO(P;F,G) := sup (E x,y [f(x)g(y)] E x [f(x)]e y [g(y)]) f F,g G COCO(P;F, G) = iff x,y independent, when F and G are respective unit balls in universal RKHSs F and G [via Steinwart, 21] Examples: Gaussian, Laplace [see also Bach and Jordan, 22] In geometric terms: Covariance operator: C xy : G F such that f,c xy g F = E x,y [f(x)g(y)] E x [f(x)]e y [g(y)] COCO is the spectral norm of C xy [Gretton et al., 25c,d]: COCO(P;F, G) := C xy S

57 Function revealing dependence (2) Ring-shaped density, correlation approx. zero [example from Fukumizu, Bach, and Gretton, 25] 1.5 Correlation:. 1.5 Y X

58 Function revealing dependence (2) Ring-shaped density, correlation approx. zero [example from Fukumizu, Bach, and Gretton, 25] 1.5 Correlation:..5 Dependence witness, X Y f(x) x Dependence witness, Y X g(y) y

59 Function revealing dependence (2) Ring-shaped density, correlation approx. zero [example from Fukumizu, Bach, and Gretton, 25] Y Correlation: X f(x) g(y).5.5 Dependence witness, X x Dependence witness, Y y g(y) Correlation:.9 COCO: f(x)

60 Function revealing dependence (3) Empirical COCO(z;F, G) largest eigenvalue of 1 K L m 1 m L K c d = γ K L c d. K and L are matrices of inner products between centred observations in respective feature spaces: K = HKH where H = I 1 m 11 and k(x i, x j ) = φ(x i ), φ(x j ) F, l(y i, y j ) = ψ(y i ), ψ(y j ) G

61 Function revealing dependence (3) Empirical COCO(z;F, G) largest eigenvalue of 1 K L m 1 m L K c d = γ K L c d. K and L are matrices of inner products between centred observations in respective feature spaces: K = HKH where H = I 1 m 11 and k(x i, x j ) = φ(x i ), φ(x j ) F, l(y i, y j ) = ψ(y i ), ψ(y j ) G Witness function for x: m f(x) = i=1 c i k(x i, x) 1 m m k(x j, x) j=1

62 Function revealing dependence (4) Can we do better? A second example with zero correlation Y Correlation: X f(x) g(y).5.5 Dependence witness, X x Dependence witness, Y y g(y) Correlation:.8 COCO: f(x)

63 Function revealing dependence (4) Can we do better? A second example with zero correlation 1 Correlation: 2nd dependence witness, X 1.5 Y X f 2 (x) g 2 (y) x 2nd dependence witness, Y y

64 Function revealing dependence (4) Can we do better? A second example with zero correlation Y Correlation: f 2 (x).5.5 2nd dependence witness, X x 2nd dependence witness, Y 1 g 2 (Y) Correlation:.37 COCO 2 : X g 2 (y) y f 2 (X)

65 Hilbert-Schmidt Independence Criterion Given γ i := COCO i (z;f, G), define Hilbert-Schmidt Independence Criterion (HSIC) [Gretton et al., 25b]: HSIC(z;F, G) := m i=1 γ 2 i

66 Hilbert-Schmidt Independence Criterion Given γ i := COCO i (z;f, G), define Hilbert-Schmidt Independence Criterion (HSIC) [Gretton et al., 25b]: HSIC(z;F, G) := m i=1 γ 2 i In limit of infinite samples: HSIC(P;F, G) := C xy 2 HS = C xy, C xy HS = E x,x,y,y [k(x,x )l(y,y )] + E x,x [k(x,x )]E y,y [l(y,y )] 2E x,y [ Ex [k(x,x )]E y [l(y,y )] ] x an independent copy of x, y a copy of y

67 Link between HSIC and MMD (1) Define the product space F G with kernel Φ(x, y),φ(x, y ) = K((x, y),(x, y )) = k(x, x )l(y, y )

68 Link between HSIC and MMD (1) Define the product space F G with kernel Φ(x, y),φ(x, y ) = K((x, y),(x, y )) = k(x, x )l(y, y ) Define the mean elements µ xy,φ(x, y) := E x,y Φ(x, y ),Φ(x, y) = E x,y k(x, x )l(y, y ) and µ x y,φ(x, y) := E x,y Φ(x, y ),Φ(x, y) = E x k(x, x )E y l(y, y )

69 Link between HSIC and MMD (1) Define the product space F G with kernel Φ(x, y),φ(x, y ) = K((x, y),(x, y )) = k(x, x )l(y, y ) Define the mean elements µ xy,φ(x, y) := E x,y Φ(x, y ),Φ(x, y) = E x,y k(x, x )l(y, y ) and µ x y,φ(x, y) := E x,y Φ(x, y ),Φ(x, y) = E x k(x, x )E y l(y, y ) The MMD between these two mean elements is MMD(P,P x P y, F G) = µ xy µ x y 2 F G = µ xy µ x y, µ xy µ x y = HSIC(P, F, G)

70 Link between HSIC and MMD (2) Witness function for HSIC 1.5 Dependence witness and sample Y X.4

71 Independence test: verifying ICA and ISA HSICp: null distribution via sampling HSICg: null distribution via moment matching Compare with contingency table test (PD) [Read and Cressie, 1988] 3 Rotation θ = π/8 2 1 Y X Rotation θ = π/4 2 1 Y X

72 Independence test: verifying ICA and ISA HSICp: null distribution via sampling HSICg: null distribution via moment matching Compare with contingency table test (PD) [Read and Cressie, 1988] Y Y Rotation θ = π/8 2 2 X Rotation θ = π/4 2 2 X % acceptance of H % acceptance of H Samp:128, Dim: Angle ( π/4) PD HSICp HSICg Samp:512, Dim:2.5 1 Angle ( π/4)

73 Independence test: verifying ICA and ISA HSICp: null distribution via sampling HSICg: null distribution via moment matching Compare with contingency table test (PD) [Read and Cressie, 1988] Y Rotation θ = π/8 2 2 X % acceptance of H Samp:128, Dim:1.5 1 Angle ( π/4) PD HSICp HSICg % acceptance of H Samp:124, Dim:4.5 1 Angle ( π/4) Y Rotation θ = π/4 2 2 X % acceptance of H Samp:512, Dim:2.5 1 Angle ( π/4) % acceptance of H Samp:248, Dim:4.5 1 Angle ( π/4)

74 Other applications of HSIC Feature selection [Song et al., 27c,a] Clustering [Song et al., 27b]

75 Conclusions (dependence measures) COCO and HSIC: norms of covariance operator between feature spaces When feature spaces universal RKHSs, COCO = HSIC = iff P = P x P y Statistical test possible using asymptotic distribution Independent component analysis high accuracy less sensitive to initialisation

76 Questions?

78 Bibliography References N. Anderson, P. Hall, and D. Titterington. Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. Journal of Multivariate Analysis, 5:41 54, M. Arcones and E. Giné. On the bootstrap of u and v statistics. The Annals of Statistics, 2(2): , F. R. Bach and M. I. Jordan. Kernel independent component analysis. J. Mach. Learn. Res., 3:1 48, 22. G. Biau and L. Gyorfi. On the asymptotic properties of a nonparametric l 1 -test statistic of homogeneity. IEEE Transactions on Information Theory, 51(11): , 25. K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. J. Smola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49 e57, 26. R. M. Dudley. Real analysis and probability. Cambridge University Press, Cambridge, UK, 22. R. Fortet and E. Mourier. Convergence de la réparation empirique vers la réparation théorique. Ann. Scient. École Norm. Sup., 7: , J. Friedman and L. Rafsky. Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. The Annals of Statistics, 7(4): , K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. J. Mach. Learn. Res., 5:73 99, 24. K. Fukumizu, F. Bach, and A. Gretton. Consistency of kernel canonical correlation analysis. Technical Report

79 Hard-to-detect dependence (1) COCO can be for dependent RVs with highly non-smooth densities

80 Hard-to-detect dependence (1) COCO can be for dependent RVs with highly non-smooth densities Reason: norms in the denominator COCO(P; F, G) := sup f F, g G cov (f(x), g(y)) f F g G RESULT: not detectable with finite sample size More formally: see Ingster [1989]

81 Hard-to-detect dependence (2) 3 Smooth density 3 Rough density Y Y X 5 Samples, smooth density X 5 samples, rough density Density takes the form: P x,y 1 + sin(ωx) sin(ωy) 2 2 Y Y X X

82 Hard-to-detect dependence (3) Example: sinusoids of increasing frequency ω=1 ω=2.1 COCO (empirical average, 15 samples).9 ω=3 ω=4.8.7 COCO.6.5 ω=5 ω= Frequency of non constant density component

83 Choosing kernel size (1) The RKHS norm of f is f 2 H X := i=1 f 2 i ( ki ) 1. If kernel decays quickly, its spectrum decays slowly: then non-smooth functions have smaller RKHS norm Example: spectrum of two Gaussian kernels 1 Gaussian kernels 1 2 Gaussian kernel spectra k(x) 1 1 K(f) x kernel, σ 2 =1/2 kernel, σ 2 = f kernel, σ 2 =1/2 kernel, σ 2 =2

84 Choosing kernel size (2) Could we just decrease kernel size? Yes, but only up to a point.45 COCO (empirical average, 1 samples) COCO Log kernel size

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,