Lecture 2: Mappings of Probabilities to RKHS and Applications

Size: px

Start display at page:

Download "Lecture 2: Mappings of Probabilities to RKHS and Applications"

Lucy Miller
6 years ago
Views:

1 Lecture 2: Mappings of Probabilities to RKHS and Applications Lille, 214 Arthur Gretton Gatsby Unit, CSML, UCL

2 Detecting differences in brain signals The problem: Dolocalfieldpotential(LFP)signalschange when measured near a spike burst?.3 LFP near spike burst.3 LFP without spike burst LFP amplitude.1 LFP amplitude Time Time

3 Detecting differences in brain signals The problem: Dolocalfieldpotential(LFP)signalschange when measured near a spike burst?

4 Detecting differences in brain signals The problem: Dolocalfieldpotential(LFP)signalschange when measured near a spike burst?

5 Detecting statistical dependence How do you detect dependence......in a discrete domain? [Read and Cressie, 1988]

6 Detecting statistical dependence How do you detect dependence......in a discrete domain? [Read and Cressie, 1988]

7 Detecting statistical dependence How do you detect dependence......in a discrete domain? [Read and Cressie, 1988]

8 Detecting statistical dependence How do you detect dependence......in a discrete domain? [Read and Cressie, 1988]

9 Detecting statistical dependence How do you detect dependence......in a discrete domain? [Read and Cressie, 1988] X 1 : Honourable senators, I have a question for the Leader of the Government in the Senate with regard to the support funding to farmers that has been announced. Most farmers have not received any money yet. X 2 : No doubt there is great pressure on provincial and municipal governments in relation to the issue of child care, but the reality is that there have been no cuts to child care funding from the federal government to the provinces. In fact, we have increased federal investments for early childhood development.? Y 1 : Honorables sénateurs, ma question s adresse au leader du gouvernement au Sénat et concerne l aide financiére qu on a annoncée pour les agriculteurs. La plupart des agriculteurs n ont encore rien reu de cet argent. Y 2 :Il est évident que les ordres de gouvernements provinciaux et municipaux subissent de fortes pressions en ce qui concerne les services de garde, mais le gouvernement n a pas réduit le financement qu il verse aux provinces pour les services de garde. Au contraire, nous avons augmenté le financement fédéral pour le développement des jeunes enfants. Are the French text extracts translations of the English ones?

10 Detecting statistical dependence, continuous domains How do you detect dependence......in a continuous domain? 1.5 Dependent P XY Sample from P XY Y X? Independent P XY =P X P Y

11 Detecting statistical dependence, continuous domains How do you detect dependence......in a continuous domain? Discretized empirical P XY 1.5 Sample from P XY 1.5 Y.5? Discretized empirical P X P Y X

12 Detecting statistical dependence, continuous domains How do you detect dependence......in a continuous domain? Discretized empirical P XY 1.5 Sample from P XY 1.5 Y.5? Discretized empirical P X P Y X

13 Detecting statistical dependence, continuous domains How do you detect dependence......in a continuous domain? Problem: failsevenin low dimensions! [NIPS7a, ALT8] X and Y in R 4,statistic=Power divergence, samples=124, cases where dependence detected=/5 Too few points per bin

14 Detecting statistical dependence, continuous domains How do you detect dependence......in a continuous domain? Problem: failsevenin low dimensions! [NIPS7a, ALT8] X and Y in R 4,statistic=Power divergence, samples=124, cases where dependence detected=/5 Too few points per bin Can we represent and compare distributions in high dimensions?

15 Further motivating questions Compare distributions with high dimension/ lowsample size/ complex structure Microarray data (aggregation problem) Neuroscience: naturalistic stimulus, complex response Images and text on web (kernels on structured data)

16 Outline Kernel metric on the space of probability measures Function revealing differences in distributions Distance between means in space of features (RKHS) Characteristic kernels: feature space mappings of probabilities unique

17 Outline Kernel metric on the space of probability measures Function revealing differences in distributions Distance between means in space of features (RKHS) Characteristic kernels: feature space mappings of probabilities unique Dependence detection Covariance and Correlation in feature space

18 Outline Kernel metric on the space of probability measures Function revealing differences in distributions Distance between means in space of features (RKHS) Characteristic kernels: feature space mappings of probabilities unique Dependence detection Covariance and Correlation in feature space Advanced topics Testing for big data Bayesian inference without models Three way interactions Energy distance/distance covariance is special case of RKHS distances

19 Kernel distance between distributions

20 Feature mean difference Simple example: 2 Gaussians with different means Answer: t-test Two Gaussians with different means P X Q X Prob. density X

21 Feature mean difference Two Gaussians with same means, different variance Idea: look at difference in means of features of the RVs In Gaussian case: second order features of form ϕ(x) =x Two Gaussians with different variances P X Q X.3 Prob. density X

22 Feature mean difference Two Gaussians with same means, different variance Idea: look at difference in means of features of the RVs In Gaussian case: second order features of form ϕ(x) =x 2 Prob. density Two Gaussians with different variances X P X Q X Prob. density Densities of feature X X 2 P X Q X

23 Feature mean difference Gaussian and Laplace distributions Same mean and same variance Difference in means using higher order features...rkhs.7.6 Gaussian and Laplace densities P X Q X Prob. density X

24 Reminder: feature maps and the RKHS Feature map of x R 2,writtenϕ x [ ϕ (p) ] (x) = x 2 1 x 2 2 x 1 x 2 2 ϕ (g) (x) = [... ] λ i e i (x)... l 2

25 Reminder: feature maps and the RKHS Feature map of x R 2,writtenϕ x [ ϕ (p) ] (x) = x 2 1 x 2 2 x 1 x 2 2 ϕ (g) (x) = [... ] λ i e i (x)... l 2 Inner product between feature maps: ϕ (p) (x),ϕ (p) (y) F = x, y 2 ϕ (g) (x),ϕ (g) (y) ( λ F =exp x y 2)

26 Reminder: feature maps and the RKHS Feature map of x R 2,writtenϕ x [ ϕ (p) ] (x) = x 2 1 x 2 2 x 1 x 2 2 ϕ (g) (x) = [... ] λ i e i (x)... l 2 Inner product between feature maps: ϕ (p) (x),ϕ (p) (y) F = x, y 2 ϕ (g) (x),ϕ (g) (y) ( λ F =exp x y 2) In general, ϕ x1,ϕ x2 F = k(x 1,x 2 ) for positive definite k(x, y) Kernels are inner products of feature maps

27 Reminder: feature view of RKHS functions Reproducing property: f(x) = m α i k(x i,x)= i=1 m α i ϕ(x i ),ϕ(x) F = f( ),ϕ(x) F i=1 f( ) = m α i ϕ(x i ) i= f(x) x

28 The mean in feature space For finite dimensional feature spaces, we can define expectations in terms of inner products. φ(x) =k(,x)= x x 2 f( ) = a b Then f(x) = a b x x 2 = f( ),φ(x) F.

29 The mean in feature space For finite dimensional feature spaces, we can define expectations in terms of inner products. φ(x) =k(,x)= x x 2 f( ) = a b Then f(x) = a b x x 2 = f( ),φ(x) F. Consider random variable x P E P f(x) =E P a x = b x 2 a b E Px E P (x 2 ) =: a b µ P. Does this reasoning translate to infinite dimensions?

30 Does the feature space mean exist? Does there exist an element µ P F such that E P f(x) = f( ), µ P ( ) F f F

31 Does the feature space mean exist? Does there exist an element µ P F such that E P f(x) = f( ), µ P ( ) F f F Recall the concept of bounded operator: alinearoperatora : F Ris bounded when f F, Af λ A f F. Riesz representation theorem: In a Hilbert space F, all bounded linear operators A can be written,g A F, for some g A F, Af = f( ),g A ( ) F

32 Does the feature space mean exist? Existence of mean embedding: If E P k(x, x) < then µp F. Proof: The linear operator T P f := E P f(x) forallf F is bounded under the assumption, since ( ) T P f E P f(x) E P f(x) = E P f( ),φ(x) F E P k(x, x) f F. Hence by Riesz (with λ TP = E P k(x, x)), µp F such that T P f = f( ), µ P ( ) F.

33 µ P is feature map of probability Embedding of P to feature space Mean embedding µ P F µ P ( ),f( ) F = E P f(x). What does prob. feature map look like? µ P (x) = µ P ( ),ϕ(x) F = µ P ( ),k(,x) F = E P k(x,x). Expectation of kernel! Empirical estimate: ˆµ P (x) = 1 m m k(x i,x) i=1 x i P

34 µ P is feature map of probability Embedding of P to feature space Mean embedding µ P F µ P ( ),f( ) F = E P f(x). What does prob. feature map look like? µ P (x) = µ P ( ),ϕ(x) F = µ P ( ),k(,x) F = E P k(x,x). Expectation of kernel! Empirical estimate: Histogram Embedding 2 2 X ˆµ P (x) = 1 m m k(x i,x) i=1 x i P

35 Function Showing Difference in Distributions Are P and Q different? 1 Samples from P and Q

36 Function Showing Difference in Distributions Are P and Q different? 1 Samples from P and Q

37 Function Showing Difference in Distributions Maximum mean discrepancy: smooth function for P vs Q MMD(P, Q; F ):=sup f F [E P f(x) E Q f(y)]. 1 Smooth function.5 f(x) x

38 Function Showing Difference in Distributions Maximum mean discrepancy: smooth function for P vs Q MMD(P, Q; F ):=sup f F [E P f(x) E Q f(y)]. 1 Smooth function.5 f(x) x

39 Function Showing Difference in Distributions What if the function is not smooth? MMD(P, Q; F ):=sup f F [E P f(x) E Q f(y)]. 1 Bounded continuous function.5 f(x) x

40 Function Showing Difference in Distributions What if the function is not smooth? MMD(P, Q; F ):=sup f F [E P f(x) E Q f(y)]. 1 Bounded continuous function.5 f(x) x

41 Function Showing Difference in Distributions Maximum mean discrepancy: smooth function for P vs Q MMD(P, Q; F ):=sup f F Gauss P vs Laplace Q [E P f(x) E Q f(y)]. Prob. density and f Witness f for Gauss and Laplace densities f Gauss Laplace X

42 Function Showing Difference in Distributions Maximum mean discrepancy: smooth function for P vs Q MMD(P, Q; F ):=sup f F [E P f(x) E Q f(y)]. Classical results: MMD(P, Q; F )=iffp = Q, when F =bounded continuous [Dudley, 22] F = bounded variation 1 (Kolmogorov metric) [Müller, 1997] F = bounded Lipschitz (Earth mover s distances) [Dudley, 22]

43 Function Showing Difference in Distributions Maximum mean discrepancy: smooth function for P vs Q MMD(P, Q; F ):=sup f F [E P f(x) E Q f(y)]. Classical results: MMD(P, Q; F )=iffp = Q, when F =bounded continuous [Dudley, 22] F = bounded variation 1 (Kolmogorov metric) [Müller, 1997] F = bounded Lipschitz (Earth mover s distances) [Dudley, 22] MMD(P, Q; F )=iffp = Q when F =the unit ball in a characteristic RKHS F [ISMB6, NIPS6a, NIPS7b, NIPS8a, JMLR1]

44 Function Showing Difference in Distributions Maximum mean discrepancy: smooth function for P vs Q MMD(P, Q; F ):=sup f F [E P f(x) E Q f(y)]. Classical results: MMD(P, Q; F )=iffp = Q, when F =bounded continuous [Dudley, 22] F = bounded variation 1 (Kolmogorov metric) [Müller, 1997] F = bounded Lipschitz (Earth mover s distances) [Dudley, 22] MMD(P, Q; F )=iffp = Q when F =the unit ball in a characteristic RKHS F [ISMB6, NIPS6a, NIPS7b, NIPS8a, JMLR1] How do smooth functions relate to feature maps?

45 Function view vs feature mean view The (kernel) MMD: [ISMB6, NIPS6a] MMD 2 (P, Q; F ) ( = sup f F [E P f(x) E Q f(y)] ) 2 Prob. density and f Witness f for Gauss and Laplace densities f Gauss Laplace X

46 Function view vs feature mean view The (kernel) MMD: [ISMB6, NIPS6a] MMD 2 (P, Q; F ) ( = sup f F [E P f(x) E Q f(y)] ) 2 use E P (f(x)) =: µ P,f F

47 Function view vs feature mean view The (kernel) MMD: [ISMB6, NIPS6a] MMD 2 (P, Q; F ) ( = = ( sup f F sup f F [E P f(x) E Q f(y)] f,µ P µ Q F ) 2 ) 2 use E P (f(x)) =: µ P,f F

48 Function view vs feature mean view The (kernel) MMD: [ISMB6, NIPS6a] MMD 2 (P, Q; F ) ( = = ( sup f F sup f F [E P f(x) E Q f(y)] f,µ P µ Q F ) 2 ) 2 use θ F =sup f,θ F f F = µ P µ Q 2 F Function view and feature view equivalent

49 Empirical estimate of MMD An unbiased empirical estimate: for {x i } m i=1 P and {y i} m i=1 Q, MMD 2 = 1 m(m 1) m i=1 m j i [k(x i,x j )+k(y i,y j )] m i=1 m j=1 [k(y i,x j )+k(x i,y j )]

50 Empirical estimate of MMD An unbiased empirical estimate: for {x i } m i=1 P and {y i} m i=1 Q, Proof: MMD 2 = 1 m(m 1) m i=1 m j i [k(x i,x j )+k(y i,y j )] m i=1 m j=1 [k(y i,x j )+k(x i,y j )] µ P µ Q 2 F = µ P µ Q,µ P µ Q F = µ P,µ P + µ Q,µ Q 2 µ P,µ Q

51 Empirical estimate of MMD An unbiased empirical estimate: for {x i } m i=1 P and {y i} m i=1 Q, Proof: MMD 2 = 1 m(m 1) m i=1 m j i [k(x i,x j )+k(y i,y j )] m i=1 m j=1 [k(y i,x j )+k(x i,y j )] µ P µ Q 2 F = µ P µ Q,µ P µ Q F = µ P,µ P + µ Q,µ Q 2 µ P,µ Q

52 Empirical estimate of MMD An unbiased empirical estimate: for {x i } m i=1 P and {y i} m i=1 Q, Proof: MMD 2 = 1 m(m 1) m i=1 m j i [k(x i,x j )+k(y i,y j )] m i=1 m j=1 [k(y i,x j )+k(x i,y j )] µ P µ Q 2 F = µ P µ Q,µ P µ Q F = µ P,µ P + µ Q,µ Q 2 µ P,µ Q = E P ϕ(x), E P ϕ(x) +...

53 Empirical estimate of MMD An unbiased empirical estimate: for {x i } m i=1 P and {y i} m i=1 Q, Proof: MMD 2 = 1 m(m 1) m i=1 m j i [k(x i,x j )+k(y i,y j )] m i=1 m j=1 [k(y i,x j )+k(x i,y j )] µ P µ Q 2 F = µ P µ Q,µ P µ Q F = µ P,µ P + µ Q,µ Q 2 µ P,µ Q = E P ϕ(x), E P ϕ(x) +... = E P ϕ(x),ϕ(x ) +...

54 Empirical estimate of MMD An unbiased empirical estimate: for {x i } m i=1 P and {y i} m i=1 Q, Proof: MMD 2 = 1 m(m 1) m i=1 m j i [k(x i,x j )+k(y i,y j )] m i=1 m j=1 [k(y i,x j )+k(x i,y j )] µ P µ Q 2 F = µ P µ Q,µ P µ Q F = µ P,µ P + µ Q,µ Q 2 µ P,µ Q = E P ϕ(x), E P ϕ(x) +... = E P ϕ(x),ϕ(x ) +... = E P k(x, x )+E Q k(y, y ) 2E P,Q k(x, y)

55 Empirical estimate of MMD An unbiased empirical estimate: for {x i } m i=1 P and {y i} m i=1 Q, Proof: MMD 2 = 1 m(m 1) m i=1 m j i [k(x i,x j )+k(y i,y j )] m i=1 m j=1 [k(y i,x j )+k(x i,y j )] µ P µ Q 2 F = µ P µ Q,µ P µ Q F = µ P,µ P + µ Q,µ Q 2 µ P,µ Q = E P ϕ(x), E P ϕ(x) +... = E P ϕ(x),ϕ(x ) +... = E P k(x, x )+E Q k(y, y ) 2E P,Q k(x, y) Then Êk(x, x ) = 1 m(m 1) m i=1 m j i k(x i,x j )

56 MMD for independence Dependence measure: [ALT5, NIPS7a, ALT7, ALT8, JMLR1] ( supf [E PXY f E PX P Y f] ) 2 = sup f,µ PXY f 1 = µ PXY µ PX P Y 2 F G := HSIC(P XY, P X P Y ) µ PX P Y 2 F G 1.5 Dependence witness and sample Y X.4

57 MMD for independence Dependence measure: [ALT5, NIPS7a, ALT7, ALT8, JMLR1] ( supf [E PXY f E PX P Y f] ) 2 = sup f,µ PXY f 1 = µ PXY µ PX P Y 2 F G := HSIC(P XY, P X P Y ) µ PX P Y 2 F G k(, ) l(, ) κ(, )= k(, ) l(, )

58 Characteristic kernels (Version 1: Via Universality)

59 Characteristic Kernels (via universality) Characteristic: MMD a metric (MMD = iff P = Q) [NIPS7b, COLT8]

60 Characteristic Kernels (via universality) Characteristic: MMD a metric (MMD = iff P = Q) [NIPS7b, COLT8] Classical result: P = Q if and only if E P (f(x)) = E Q (f(y)) for all f C(X ), the space of bounded continuous functions on X [Dudley, 22]

61 Characteristic Kernels (via universality) Characteristic: MMD a metric (MMD = iff P = Q) [NIPS7b, COLT8] Classical result: P = Q if and only if E P (f(x)) = E Q (f(y)) for all f C(X ), the space of bounded continuous functions on X [Dudley, 22] Universal RKHS: k(x, x )continuous,x compact, and F dense in C(X )with respect to L [Steinwart, 21]

62 Characteristic Kernels (via universality) Characteristic: MMD a metric (MMD = iff P = Q) [NIPS7b, COLT8] Classical result: P = Q if and only if E P (f(x)) = E Q (f(y)) for all f C(X ), the space of bounded continuous functions on X [Dudley, 22] Universal RKHS: k(x, x )continuous,x compact, and F dense in C(X )with respect to L [Steinwart, 21] If F universal, thenmmd{p, Q; F } =iffp = Q

63 Characteristic Kernels (via universality) Proof: First, it is clear that P = Q implies MMD {P, Q; F } is zero. Converse: by the universality of F, for any given ɛ> and f C(X ) g F f g ɛ.

64 Characteristic Kernels (via universality) Proof: First, it is clear that P = Q implies MMD {P, Q; F } is zero. Converse: by the universality of F, for any given ɛ> and f C(X ) g F f g ɛ. We next make the expansion E P f(x) E Q f(y) E P f(x) E P g(x) + E P g(x) E Q g(y) + E Q g(y) E Q f(y). The first and third terms satisfy E P f(x) E P g(x) E P f(x) g(x) ɛ.

65 Characteristic Kernels (via universality) Proof (continued): Next, write E P g(x) E Q g(y) = g( ),µ P µ Q F =, since MMD {P, Q; F } =impliesµ P = µ Q.Hence E P f(x) E Q f(y) 2ɛ for all f C(X ) and ɛ>, which implies P = Q.

66 Characteristic kernels (Version 2: Via Fourier)

67 Characteristic Kernels (via Fourier) Reminder: Fourier series Function [ π, π] with periodic boundary. f(x) = ˆf l exp(ılx) = ˆf l (cos(lx)+ı sin(lx)). l= l= 1.4 Top hat.5 Fourier series coefficients f (x).6 ˆfl x l

68 Characteristic Kernels (via Fourier) Reminder: Fourier series of kernel k(x, y) =k(x y) =k(z), k(z) = l= ˆk l exp (ılz), ( ) ( E.g. Gaussian, k(x) = 1 exp x 2, ˆkl = 1 2πσ 2 2σ 2 2π exp σ 2 l 2 2 ). Gaussian.16 Fourier series coefficients k(x).3 ˆfl x l

69 Characteristic Kernels (via Fourier) Maximum mean embedding via Fourier series: Fourier series for P is characteristic function φ P Fourier series for mean embedding is product of fourier series! (convolution theorem) π µ P (x) =E P k(x x) = π k(x t)dp(t) ˆµ P,l = ˆk l φ P,l

70 Characteristic Kernels (via Fourier) Maximum mean embedding via Fourier series: Fourier series for P is characteristic function φ P Fourier series for mean embedding is product of fourier series! (convolution theorem) µ P (x) =E P k(x x) = π π k(x t)dp(t) ˆµ P,l = ˆk l φ P,l MMD can be written in terms of Fourier series: MMD(P, Q; F ):= [(φ P,l φ Q,l ) ˆk ] l exp(ılx) l= Characteristic: MMDametric (MMD = iff P = Q) [NIPS7b, COLT8, JMLR1] F

71 Example Example: P differs from Q at one frequency P (x) x Q(x) x

72 Characteristic Kernels (2) Example: P differs from Q at (roughly) one frequency P (x) F φp,l x l.15 Q(x).1 F φq,l x 1 1 l

73 Characteristic Kernels (2) Example: P differs from Q at (roughly) one frequency.2 1 Q(x) P (x) x F F φp,l φq,l l Characteristic function difference φp,l φq,l l 2 2 x 1 1 l

74 Example Is the Gaussian kernel characteristic? Gaussian.16 Fourier series coefficients k(x).3 ˆfl x l MMD(P, Q; F ):= l= [(φ P,l φ Q,l ) ˆk ] l exp(ılx) F

75 Example Is the Gaussian kernel characteristic? YES Gaussian.16 Fourier series coefficients k(x).3 ˆfl x l MMD(P, Q; F ):= l= [(φ P,l φ Q,l ) ˆk ] l exp(ılx) F

76 Example Is the triangle kernel characteristic?.3 Triangle.7 Fourier series coefficients f (x) ˆfl x l MMD(P, Q; F ):= l= [(φ P,l φ Q,l ) ˆk ] l exp(ılx) F

77 Example Is the triangle kernel characteristic? NO.3 Triangle.7 Fourier series coefficients f (x) ˆfl x l MMD(P, Q; F ):= l= [(φ P,l φ Q,l ) ˆk ] l exp(ılx) F

78 Characteristic Kernels (via Fourier) Can we prove characteristic on R d?

79 Characteristic Kernels (via Fourier) Can we prove characteristic on R d? Characteristic function of P via Fourier transform φ P (ω) = e ix ω dp(x) R d

80 Characteristic Kernels (via Fourier) Can we prove characteristic on R d? Characteristic function of P via Fourier transform φ P (ω) = e ix ω dp(x) R d Translation invariant kernels: k(x, y) = k(x y) = k(z) Bochner s theorem: k(z) = e iz ω dλ(ω) R d Λfinitenon-negativeBorelmeasure

81 Characteristic Kernels (via Fourier) Can we prove characteristic on R d? Characteristic function of P via Fourier transform φ P (ω) = e ix ω dp(x) R d Translation invariant kernels: k(x, y) = k(x y) = k(z) Bochner s theorem: k(z) = e iz ω dλ(ω) R d Λfinitenon-negativeBorelmeasure

82 Characteristic Kernels (via Fourier) Fourier representation of MMD: MMD(P, Q; F ):= [( φp (ω) φ Q (ω) ) Λ(ω) ] F φ P characteristic function of P f is Fourier transform, f is inverse Fourier transform µ P := k(,x) dp(x)

83 Example Example: P differs from Q at (roughly) one frequency P(X) X.5.4 Q(X) X

84 Example Example: P differs from Q at (roughly) one frequency.35.4 P(X) F P X Q(X) F Q X

85 Example Example: P differs from Q at (roughly) one frequency.35.4 P(X) Q(X) X F F P Q Characteristic function difference P Q X

86 Example Example: P differs from Q at (roughly) one frequency Gaussian kernel Difference φ P φ Q Frequency

87 Example Example: P differs from Q at (roughly) one frequency.2 Characteristic Frequency

88 Example Example: P differs from Q at (roughly) one frequency Sinc kernel Difference φ P φ Q Frequency

89 Example Example: P differs from Q at (roughly) one frequency.2 NOT characteristic Frequency

90 Example Example: P differs from Q at (roughly) one frequency Triangle (B-spline) kernel Difference φ P φ Q Frequency

91 Example Example: P differs from Q at (roughly) one frequency.2??? Frequency

92 Example Example: P differs from Q at (roughly) one frequency.2 Characteristic Frequency

93 Choosing the kernel Gaussian kernel example MMD vs frequency of perturbation to P Q(X).2 Q X Q(X) Q(X) X Q Q MMD(P,Q;F) X Perturbing frequency

94 Why does MMD decay with increasing peturbation freq.? Fourier series argument (notationally easier, for periodic domains only): MMD(P, Q; F ):= [(φ P,l φ Q,l ) ˆk ] l exp(ılx) l= The squared norm of a function f in F is: F f 2 F = f,f F = l= ˆf l 2 ˆk l. Squared MMD is MMD(P, Q; F )= l= [ φ P,l φ Q,l ˆk l ] 2 ˆk l = l= φ P,l φ Q,l ˆk l

95 Choosing the kernel B-spline kernel example.4.4 MMD vs frequency of perturbation to P.3.3 Q(X).2.1 Q X Q(X) Q(X) X Q Q MMD(P,Q;F) X Perturbing frequency

96 Summary: Characteristic Kernels Characteristic kernel: (MMD=iffP = Q) [NIPS7b, COLT8] Main theorem: k characteristic for prob. measures on R d if and only if supp(λ) = R d [COLT8, JMLR1]

97 Summary: Characteristic Kernels Characteristic kernel: (MMD=iffP = Q) [NIPS7b, COLT8] Main theorem: k characteristic for prob. measures on R d if and only if supp(λ) = R d [COLT8, JMLR1] Corollary: continuous, compactly supported k characteristic

98 Summary: Characteristic Kernels Characteristic kernel: (MMD=iffP = Q) [NIPS7b, COLT8] Main theorem: k characteristic for prob. measures on R d if and only if supp(λ) = R d [COLT8, JMLR1] Corollary: continuous, compactly supported k characteristic Similar reasoning wherever extensions of Bochner s theorem exist: [NIPS8a] Locally compact Abelian groups (periodic domains, as we saw) Compact, non-abelian groups (orthogonal matrices) The semigroup R + n (histograms)

99 Statistical hypothesis testing

100 Reminder: detecting differences in brain signals The problem: Dolocalfieldpotential(LFP)signalschange when measured near a spike burst?.3 LFP near spike burst.3 LFP without spike burst LFP amplitude.1 LFP amplitude Time Time

101 Reminder: detecting differences in brain signals The problem: Dolocalfieldpotential(LFP)signalschange when measured near a spike burst?

102 Reminder: detecting differences in brain signals The problem: Dolocalfieldpotential(LFP)signalschange when measured near a spike burst?

103 Statistical test using MMD (1) Two hypotheses: H : null hypothesis (P = Q) H 1 : alternative hypothesis (P Q)

104 Statistical test using MMD (1) Two hypotheses: H : null hypothesis (P = Q) H 1 : alternative hypothesis (P Q) Observe samples x := {x 1,...,x n } from P and y from Q If empirical MMD(x, y; F )is far from zero : rejecth close to zero : accept H

105 Statistical test using MMD (2) far from zero vs close to zero -threshold? One answer: asymptotic distribution of MMD(x, y; F )

106 Statistical test using MMD (2) far from zero vs close to zero -threshold? One answer: asymptotic distribution of MMD(x, y; F ) An unbiased empirical estimate (quadratic cost): MMD(x, y; F )= 1 n(n 1) k(x i,x j ) k(x i,y j ) k(y i,x j )+k(y i,y j ) }{{} i j h((x i,y i ),(x j,y j ))

107 Statistical test using MMD (2) far from zero vs close to zero -threshold? One answer: asymptotic distribution of MMD(x, y; F ) An unbiased empirical estimate (quadratic cost): MMD(x, y; F )= 1 n(n 1) k(x i,x j ) k(x i,y j ) k(y i,x j )+k(y i,y j ) }{{} i j h((x i,y i ),(x j,y j )) When P Q, asymptotically normal [Hoeffding, 1948, Serfling, 198] Expression for the variance: z i := (x i,y i ) σ 2 u = 22 n ( E z [ (Ez h(z, z )) 2] [ E z,z (h(z, z )) ] 2 ) + O(n 2 )

108 Statistical test using MMD (3) Example: laplace distributions with different variance Prob. density MMD distribution and Gaussian fit under H1 Empirical PDF Gaussian fit Prob. density Two Laplace distributions with different variances X P X Q X MMD

109 Statistical test using MMD (4) When P = Q, U-statisticdegenerate:E z [h(z, z )] = [Anderson et al., 1994] Distribution is nmmd(x, y; F ) [ λ l z 2 l 2 ] l=1 where z l N(, 2) i.i.d k(x, X x ) ψ }{{} i (x)dp(x) =λ i ψ i (x ) centred

110 Statistical test using MMD (4) When P = Q, U-statisticdegenerate:E z [h(z, z )] = [Anderson et al., 1994] Distribution is nmmd(x, y; F ) [ λ l z 2 l 2 ] l=1 where z l N(, 2) i.i.d k(x, X x ) ψ }{{} i (x)dp(x) =λ i ψ i (x ) centred Prob. density MMD density under H Empirical PDF MMD

111 Statistical test using MMD (5) Given P = Q, wantthresholdt such that P(MMD > T ).5

112 Statistical test using MMD (5) Given P = Q, wantthresholdt such that P(MMD > T ).5 Permutation for empirical CDF [Arcones and Giné, 1992] Pearson curves by matching first four moments [Johnson et al., 1994] Large deviation bounds [Hoeffding, 1963, McDiarmid, 1989] Consistent test using kernel eigenspectrum [NIPS9b]

113 Statistical test using MMD (5) Given P = Q, wantthresholdt such that P(MMD > T ).5 Permutation for empirical CDF [Arcones and Giné, 1992] Pearson curves by matching first four moments [Johnson et al., 1994] Large deviation bounds [Hoeffding, 1963, McDiarmid, 1989] Consistent test using kernel eigenspectrum [NIPS9b] 1 CDF of the MMD and Pearson fit P(MMD < mmd) MMD Pearson mmd

114 Consistent test w/o bootstrap (not examinable) Maximum mean discrepancy (MMD): distance between P and Q MMD(P, Q; F ):= µ P µ Q 2 F Is MMD significantly >? P = Q, nulldistrib.of MMD: n MMD D l=1 λ l (z 2 l 2), λ l is lth eigenvalue of kernel k(x i,x j ) Type II error P Q (neuro) Spectral Permutation Sample size m Use Gram matrix spectrum for ˆλ l : consistent test without bootstrap

115 Kernel dependence measures

116 Reminder: MMD can be used as a dependence measure Dependence measure: [ALT5, NIPS7a, ALT7, ALT8, JMLR1] ( supf [E PXY f E PX P Y f] ) 2 = sup f,µ PXY f 1 = µ PXY µ PX P Y 2 F G := HSIC(P XY, P X P Y ) µ PX P Y 2 F G 1.5 Dependence witness and sample Y X.4

1 = µ PXY µ PX P Y 2 F G := HSIC(P XY, P

117 Reminder: MMD can be used as a dependence measure Dependence measure: [ALT5, NIPS7a, ALT7, ALT8, JMLR1] ( supf [E PXY f E PX P Y f] ) 2 = sup f,µ PXY f 1 = µ PXY µ PX P Y 2 F G := HSIC(P XY, P X P Y ) µ PX P Y 2 F G k(, ) l(, ) κ(, )= k(, ) l(, )

118 Distribution of HSIC at independence (Biased) empirical HSIC av-statistic HSIC b = 1 n 2 trace(khlh) Statistical testing: How do we find when this is larger enough that the null hypothesis P = P x P y is unlikely? Formally: given P = P x P y, what is the threshold T such that P(HSIC > T ) < α for small α?

119 Distribution of HSIC at independence (Biased) empirical HSIC av-statistic HSIC b = 1 n 2 trace(khlh) Associated U-statistic degenerate when P = P x P y [Serfling, 198]: D nhsic b λ l zl 2, l=1 z l N(, 1)i.i.d. λ l ψ l (z j )= h ijqr ψ l (z i )df i,q,r, h ijqr = 1 4! (i,j,q,r) k tu l tu + k tu l vw 2k tu l tv (t,u,v,w)

120 Distribution of HSIC at independence (Biased) empirical HSIC av-statistic HSIC b = 1 n 2 trace(khlh) Associated U-statistic degenerate when P = P x P y [Serfling, 198]: D nhsic b λ l zl 2, l=1 z l N(, 1)i.i.d. λ l ψ l (z j )= h ijqr ψ l (z i )df i,q,r, h ijqr = 1 4! (i,j,q,r) k tu l tu + k tu l vw 2k tu l tv (t,u,v,w) First two moments [NIPS7b] E(HSIC b )= 1 n TrC xxtrc yy var(hsic b )= 2(n 4)(n 5) (n) 4 C xx 2 HS C yy 2 HS +O(n 3 ).

121 Statistical testing with HSIC Given P = P x P y, what is the threshold T such that P(HSIC > T ) < α for small α? Null distribution via permutation [Feuerverger, 1993] Compute HSIC for {x i,y π(i) } n i=1 for random permutation π of indices {1,...,n}. This gives HSIC for independent variables. Repeat for many different permutations, get empirical CDF Threshold T is 1 α quantile of empirical CDF

122 Statistical testing with HSIC Given P = P x P y, what is the threshold T such that P(HSIC > T ) < α for small α? Null distribution via permutation [Feuerverger, 1993] Compute HSIC for {x i,y π(i) } n i=1 for random permutation π of indices {1,...,n}. This gives HSIC for independent variables. Repeat for many different permutations, get empirical CDF Threshold T is 1 α quantile of empirical CDF Approximate null distribution via moment matching [Kankainen, 1995]: nhsic b (Z) xα 1 e x/β β α Γ(α) where α = (E(HSIC b)) 2 var(hsic b ), β = var(hsic b) ne(hsic b ).

123 Experiment: dependence testing for translation (Biased) empirical HSIC: HSIC b = 1 n 2 trace(khlh) Translation example: [NIPS7b] Canadian Hansard (agriculture) 5-line extracts, k-spectrum kernel, k = 1, repetitions=3, sample size 1 K HSIC L k-spectrum kernel: averagetype II error (α =.5)

124 Experiment: dependence testing for translation (Biased) empirical HSIC: HSIC b = 1 n 2 trace(khlh) Translation example: [NIPS7b] Canadian Hansard (agriculture) 5-line extracts, k-spectrum kernel, k = 1, repetitions=3, sample size 1 K HSIC L k-spectrum kernel: averagetype II error (α =.5) Bag of words kernel: average Type II error.18

125 Advanced topics Energy distance and the MMD Two-sample testing for big data Testing three-way interactions Bayesian inference without models

126 Summary MMD adistancebetweendistributions [ISMB6, NIPS6a, JMLR1, JMLR12a] high dimensionality non-euclidean data (strings, graphs) Nonparametric hypothesis tests Measure and test independence [ALT5, NIPS7a, NIPS7b, ALT8, JMLR1, JMLR12a] Characteristic RKHS: MMDametric [NIPS7b, COLT8, NIPS8a] Easy to check: does spectrum cover R d

Co-authors From UCL: Luca Baldasssarre Steffen Grunewalder Guy Lever Sam Patterson Massimiliano Pontil Dino Sejdinovic External: Karsten Borgwardt, MPI

127 Co-authors From UCL: Luca Baldasssarre Steffen Grunewalder Guy Lever Sam Patterson Massimiliano Pontil Dino Sejdinovic External: Karsten Borgwardt, MPI Wicher Bergsma, LSE Kenji Fukumizu, ISM Zaid Harchaoui, INRIA Bernhard Schoelkopf, MPI Alex Smola, CMU/Google Le Song, Georgia Tech Bharath Sriperumbudur, Cambridge

128 Selected references Characteristic kernels and mean embeddings: Smola, A., Gretton, A., Song, L., Schoelkopf, B. (27). A hilbert space embedding for distributions. ALT. Sriperumbudur, B., Gretton, A., Fukumizu, K., Schoelkopf, B., Lanckriet, G. (21). Hilbert space embeddings and metrics on probability measures. JMLR. Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (212). A kernel two- sample test. JMLR. Two-sample, independence, conditional independence tests: Gretton, A., Fukumizu, K., Teo, C., Song, L., Schoelkopf, B., Smola, A. (28). A kernel statistical test of independence. NIPS Fukumizu, K., Gretton, A., Sun, X., Schoelkopf, B. (28). Kernel measures of conditional dependence. Gretton, A., Fukumizu, K., Harchaoui, Z., Sriperumbudur, B. (29). A fast, consistent kernel two-sample test. NIPS. Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (212). A kernel two- sample test. JMLR Energy distance, relation to kernel distances Sejdinovic, D., Sriperumbudur, B., Gretton, A.,, Fukumizu, K., (213). Equivalence of distance-based and rkhs-based statistics in hypothesis testing. Annals of Statistics. Three way interaction Sejdinovic, D., Gretton, A., and Bergsma, W. (213). A Kernel Test for Three-Variable Interactions. NIPS.

129 Selected references (continued) Conditional mean embedding, RKHS-valued regression: Weston, J., Chapelle, O., Elisseeff, A., Schölkopf, B., and Vapnik, V., (23). Kernel Dependency Estimation, NIPS. Micchelli, C., and Pontil, M., (25). On Learning Vector-Valued Functions. Neural Computation. Caponnetto, A., and De Vito, E. (27). Optimal Rates for the Regularized Least-Squares Algorithm. Foundations of Computational Mathematics. Song, L., and Huang, J., and Smola, A., Fukumizu, K., (29). Hilbert Space Embeddings of Conditional Distributions. ICML. Grunewalder, S., Lever, G., Baldassarre, L., Patterson, S., Gretton, A., Pontil, M. (212). Conditional mean embeddings as regressors. ICML. Grunewalder, S., Gretton, A., Shawe-Taylor, J. (213). Smooth operators. ICML. Kernel Bayes rule: Song, L., Fukumizu, K., Gretton, A. (213). Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine. Fukumizu, K., Song, L., Gretton, A. (213). Kernel Bayes rule: Bayesian inference with positive definite kernels, JMLR

130

131 References N. Anderson, P. Hall, and D. Titterington. Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. Journal of Multivariate Analysis, 5:41 54,1994. M. Arcones and E. Giné. On the bootstrap of u and v statistics. The Annals of Statistics, 2(2): ,1992. C. Baker. Joint measures and cross-covariance operators. Transactions of the American Mathematical Society, 186: ,1973. L. Baringhaus and C. Franz. On a new multivariate two-sample test. J. Multivariate Anal., 88:19 26,24. C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups. Springer, New York, R. M. Dudley. Real analysis and probability. Cambridge University Press, Cambridge, UK, 22. Andrey Feuerverger. A consistent test for bivariate dependence. International Statistical Review, 61(3): ,1993. W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:13 3,1963. Wassily Hoeffding. A class of statistics with asymptotically normal distribution. The Annals of Mathematical Statistics, 19(3): ,1948. N. L. Johnson, S. Kotz, and N. Balakrishnan. Continuous Univariate Distributions. Volume 1. John Wiley and Sons, 2nd edition, A. Kankainen. Consistent Testing of Total Independence Based on the Empirical Characteristic Function. PhD thesis, University of Jyväskylä, R. Lyons. Distance covariance in metric spaces. arxiv: , to appear in Ann. Probab., June 211. C. McDiarmid. On the method of bounded differences. In Survey in Combinatorics, pages Cambridge University Press, A. Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2): ,

132 T. Read and N. Cressie. Goodness-Of-Fit Statistics for Discrete Multivariate Analysis. Springer-Verlag, New York, R. Serfling. Approximation Theorems of Mathematical Statistics. Wiley, New York, 198. I. Steinwart. On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2:67 93,21. G. Székely and M. Rizzo. Testing for equal distributions in high dimension. InterStat, 5,24. G. Székely and M. Rizzo. A new test for multivariate normality. J. Multivariate Anal., 93:58 8,25. G. Székely and M. Rizzo. Brownian distance covariance. Annals of Applied Statistics, 4(3): ,29. G. Székely, M. Rizzo, and N. Bakirov. Measuring and testing dependence by correlation of distances. Ann. Stat., 35(6): ,

Kernel methods for hypothesis testing and inference

Kernel methods for hypothesis testing and inference MLSS Tübingen, 2015 Arthur Gretton Gatsby Unit, CSML, UCL Some motivating questions... Detecting differences in brain signals The problem: Do local field