Lecture 2: Mappings of Probabilities to RKHS and Applications
|
|
- Lucy Miller
- 6 years ago
- Views:
Transcription
1 Lecture 2: Mappings of Probabilities to RKHS and Applications Lille, 214 Arthur Gretton Gatsby Unit, CSML, UCL
2 Detecting differences in brain signals The problem: Dolocalfieldpotential(LFP)signalschange when measured near a spike burst?.3 LFP near spike burst.3 LFP without spike burst LFP amplitude.1 LFP amplitude Time Time
3 Detecting differences in brain signals The problem: Dolocalfieldpotential(LFP)signalschange when measured near a spike burst?
4 Detecting differences in brain signals The problem: Dolocalfieldpotential(LFP)signalschange when measured near a spike burst?
5 Detecting statistical dependence How do you detect dependence......in a discrete domain? [Read and Cressie, 1988]
6 Detecting statistical dependence How do you detect dependence......in a discrete domain? [Read and Cressie, 1988]
7 Detecting statistical dependence How do you detect dependence......in a discrete domain? [Read and Cressie, 1988]
8 Detecting statistical dependence How do you detect dependence......in a discrete domain? [Read and Cressie, 1988]
9 Detecting statistical dependence How do you detect dependence......in a discrete domain? [Read and Cressie, 1988] X 1 : Honourable senators, I have a question for the Leader of the Government in the Senate with regard to the support funding to farmers that has been announced. Most farmers have not received any money yet. X 2 : No doubt there is great pressure on provincial and municipal governments in relation to the issue of child care, but the reality is that there have been no cuts to child care funding from the federal government to the provinces. In fact, we have increased federal investments for early childhood development.? Y 1 : Honorables sénateurs, ma question s adresse au leader du gouvernement au Sénat et concerne l aide financiére qu on a annoncée pour les agriculteurs. La plupart des agriculteurs n ont encore rien reu de cet argent. Y 2 :Il est évident que les ordres de gouvernements provinciaux et municipaux subissent de fortes pressions en ce qui concerne les services de garde, mais le gouvernement n a pas réduit le financement qu il verse aux provinces pour les services de garde. Au contraire, nous avons augmenté le financement fédéral pour le développement des jeunes enfants. Are the French text extracts translations of the English ones?
10 Detecting statistical dependence, continuous domains How do you detect dependence......in a continuous domain? 1.5 Dependent P XY Sample from P XY Y X? Independent P XY =P X P Y
11 Detecting statistical dependence, continuous domains How do you detect dependence......in a continuous domain? Discretized empirical P XY 1.5 Sample from P XY 1.5 Y.5? Discretized empirical P X P Y X
12 Detecting statistical dependence, continuous domains How do you detect dependence......in a continuous domain? Discretized empirical P XY 1.5 Sample from P XY 1.5 Y.5? Discretized empirical P X P Y X
13 Detecting statistical dependence, continuous domains How do you detect dependence......in a continuous domain? Problem: failsevenin low dimensions! [NIPS7a, ALT8] X and Y in R 4,statistic=Power divergence, samples=124, cases where dependence detected=/5 Too few points per bin
14 Detecting statistical dependence, continuous domains How do you detect dependence......in a continuous domain? Problem: failsevenin low dimensions! [NIPS7a, ALT8] X and Y in R 4,statistic=Power divergence, samples=124, cases where dependence detected=/5 Too few points per bin Can we represent and compare distributions in high dimensions?
15 Further motivating questions Compare distributions with high dimension/ lowsample size/ complex structure Microarray data (aggregation problem) Neuroscience: naturalistic stimulus, complex response Images and text on web (kernels on structured data)
16 Outline Kernel metric on the space of probability measures Function revealing differences in distributions Distance between means in space of features (RKHS) Characteristic kernels: feature space mappings of probabilities unique
17 Outline Kernel metric on the space of probability measures Function revealing differences in distributions Distance between means in space of features (RKHS) Characteristic kernels: feature space mappings of probabilities unique Dependence detection Covariance and Correlation in feature space
18 Outline Kernel metric on the space of probability measures Function revealing differences in distributions Distance between means in space of features (RKHS) Characteristic kernels: feature space mappings of probabilities unique Dependence detection Covariance and Correlation in feature space Advanced topics Testing for big data Bayesian inference without models Three way interactions Energy distance/distance covariance is special case of RKHS distances
19 Kernel distance between distributions
20 Feature mean difference Simple example: 2 Gaussians with different means Answer: t-test Two Gaussians with different means P X Q X Prob. density X
21 Feature mean difference Two Gaussians with same means, different variance Idea: look at difference in means of features of the RVs In Gaussian case: second order features of form ϕ(x) =x Two Gaussians with different variances P X Q X.3 Prob. density X
22 Feature mean difference Two Gaussians with same means, different variance Idea: look at difference in means of features of the RVs In Gaussian case: second order features of form ϕ(x) =x 2 Prob. density Two Gaussians with different variances X P X Q X Prob. density Densities of feature X X 2 P X Q X
23 Feature mean difference Gaussian and Laplace distributions Same mean and same variance Difference in means using higher order features...rkhs.7.6 Gaussian and Laplace densities P X Q X Prob. density X
24 Reminder: feature maps and the RKHS Feature map of x R 2,writtenϕ x [ ϕ (p) ] (x) = x 2 1 x 2 2 x 1 x 2 2 ϕ (g) (x) = [... ] λ i e i (x)... l 2
25 Reminder: feature maps and the RKHS Feature map of x R 2,writtenϕ x [ ϕ (p) ] (x) = x 2 1 x 2 2 x 1 x 2 2 ϕ (g) (x) = [... ] λ i e i (x)... l 2 Inner product between feature maps: ϕ (p) (x),ϕ (p) (y) F = x, y 2 ϕ (g) (x),ϕ (g) (y) ( λ F =exp x y 2)
26 Reminder: feature maps and the RKHS Feature map of x R 2,writtenϕ x [ ϕ (p) ] (x) = x 2 1 x 2 2 x 1 x 2 2 ϕ (g) (x) = [... ] λ i e i (x)... l 2 Inner product between feature maps: ϕ (p) (x),ϕ (p) (y) F = x, y 2 ϕ (g) (x),ϕ (g) (y) ( λ F =exp x y 2) In general, ϕ x1,ϕ x2 F = k(x 1,x 2 ) for positive definite k(x, y) Kernels are inner products of feature maps
27 Reminder: feature view of RKHS functions Reproducing property: f(x) = m α i k(x i,x)= i=1 m α i ϕ(x i ),ϕ(x) F = f( ),ϕ(x) F i=1 f( ) = m α i ϕ(x i ) i= f(x) x
28 The mean in feature space For finite dimensional feature spaces, we can define expectations in terms of inner products. φ(x) =k(,x)= x x 2 f( ) = a b Then f(x) = a b x x 2 = f( ),φ(x) F.
29 The mean in feature space For finite dimensional feature spaces, we can define expectations in terms of inner products. φ(x) =k(,x)= x x 2 f( ) = a b Then f(x) = a b x x 2 = f( ),φ(x) F. Consider random variable x P E P f(x) =E P a x = b x 2 a b E Px E P (x 2 ) =: a b µ P. Does this reasoning translate to infinite dimensions?
30 Does the feature space mean exist? Does there exist an element µ P F such that E P f(x) = f( ), µ P ( ) F f F
31 Does the feature space mean exist? Does there exist an element µ P F such that E P f(x) = f( ), µ P ( ) F f F Recall the concept of bounded operator: alinearoperatora : F Ris bounded when f F, Af λ A f F. Riesz representation theorem: In a Hilbert space F, all bounded linear operators A can be written,g A F, for some g A F, Af = f( ),g A ( ) F
32 Does the feature space mean exist? Existence of mean embedding: If E P k(x, x) < then µp F. Proof: The linear operator T P f := E P f(x) forallf F is bounded under the assumption, since ( ) T P f E P f(x) E P f(x) = E P f( ),φ(x) F E P k(x, x) f F. Hence by Riesz (with λ TP = E P k(x, x)), µp F such that T P f = f( ), µ P ( ) F.
33 µ P is feature map of probability Embedding of P to feature space Mean embedding µ P F µ P ( ),f( ) F = E P f(x). What does prob. feature map look like? µ P (x) = µ P ( ),ϕ(x) F = µ P ( ),k(,x) F = E P k(x,x). Expectation of kernel! Empirical estimate: ˆµ P (x) = 1 m m k(x i,x) i=1 x i P
34 µ P is feature map of probability Embedding of P to feature space Mean embedding µ P F µ P ( ),f( ) F = E P f(x). What does prob. feature map look like? µ P (x) = µ P ( ),ϕ(x) F = µ P ( ),k(,x) F = E P k(x,x). Expectation of kernel! Empirical estimate: Histogram Embedding 2 2 X ˆµ P (x) = 1 m m k(x i,x) i=1 x i P
35 Function Showing Difference in Distributions Are P and Q different? 1 Samples from P and Q
36 Function Showing Difference in Distributions Are P and Q different? 1 Samples from P and Q
37 Function Showing Difference in Distributions Maximum mean discrepancy: smooth function for P vs Q MMD(P, Q; F ):=sup f F [E P f(x) E Q f(y)]. 1 Smooth function.5 f(x) x
38 Function Showing Difference in Distributions Maximum mean discrepancy: smooth function for P vs Q MMD(P, Q; F ):=sup f F [E P f(x) E Q f(y)]. 1 Smooth function.5 f(x) x
39 Function Showing Difference in Distributions What if the function is not smooth? MMD(P, Q; F ):=sup f F [E P f(x) E Q f(y)]. 1 Bounded continuous function.5 f(x) x
40 Function Showing Difference in Distributions What if the function is not smooth? MMD(P, Q; F ):=sup f F [E P f(x) E Q f(y)]. 1 Bounded continuous function.5 f(x) x
41 Function Showing Difference in Distributions Maximum mean discrepancy: smooth function for P vs Q MMD(P, Q; F ):=sup f F Gauss P vs Laplace Q [E P f(x) E Q f(y)]. Prob. density and f Witness f for Gauss and Laplace densities f Gauss Laplace X
42 Function Showing Difference in Distributions Maximum mean discrepancy: smooth function for P vs Q MMD(P, Q; F ):=sup f F [E P f(x) E Q f(y)]. Classical results: MMD(P, Q; F )=iffp = Q, when F =bounded continuous [Dudley, 22] F = bounded variation 1 (Kolmogorov metric) [Müller, 1997] F = bounded Lipschitz (Earth mover s distances) [Dudley, 22]
43 Function Showing Difference in Distributions Maximum mean discrepancy: smooth function for P vs Q MMD(P, Q; F ):=sup f F [E P f(x) E Q f(y)]. Classical results: MMD(P, Q; F )=iffp = Q, when F =bounded continuous [Dudley, 22] F = bounded variation 1 (Kolmogorov metric) [Müller, 1997] F = bounded Lipschitz (Earth mover s distances) [Dudley, 22] MMD(P, Q; F )=iffp = Q when F =the unit ball in a characteristic RKHS F [ISMB6, NIPS6a, NIPS7b, NIPS8a, JMLR1]
44 Function Showing Difference in Distributions Maximum mean discrepancy: smooth function for P vs Q MMD(P, Q; F ):=sup f F [E P f(x) E Q f(y)]. Classical results: MMD(P, Q; F )=iffp = Q, when F =bounded continuous [Dudley, 22] F = bounded variation 1 (Kolmogorov metric) [Müller, 1997] F = bounded Lipschitz (Earth mover s distances) [Dudley, 22] MMD(P, Q; F )=iffp = Q when F =the unit ball in a characteristic RKHS F [ISMB6, NIPS6a, NIPS7b, NIPS8a, JMLR1] How do smooth functions relate to feature maps?
45 Function view vs feature mean view The (kernel) MMD: [ISMB6, NIPS6a] MMD 2 (P, Q; F ) ( = sup f F [E P f(x) E Q f(y)] ) 2 Prob. density and f Witness f for Gauss and Laplace densities f Gauss Laplace X
46 Function view vs feature mean view The (kernel) MMD: [ISMB6, NIPS6a] MMD 2 (P, Q; F ) ( = sup f F [E P f(x) E Q f(y)] ) 2 use E P (f(x)) =: µ P,f F
47 Function view vs feature mean view The (kernel) MMD: [ISMB6, NIPS6a] MMD 2 (P, Q; F ) ( = = ( sup f F sup f F [E P f(x) E Q f(y)] f,µ P µ Q F ) 2 ) 2 use E P (f(x)) =: µ P,f F
48 Function view vs feature mean view The (kernel) MMD: [ISMB6, NIPS6a] MMD 2 (P, Q; F ) ( = = ( sup f F sup f F [E P f(x) E Q f(y)] f,µ P µ Q F ) 2 ) 2 use θ F =sup f,θ F f F = µ P µ Q 2 F Function view and feature view equivalent
49 Empirical estimate of MMD An unbiased empirical estimate: for {x i } m i=1 P and {y i} m i=1 Q, MMD 2 = 1 m(m 1) m i=1 m j i [k(x i,x j )+k(y i,y j )] m i=1 m j=1 [k(y i,x j )+k(x i,y j )]
50 Empirical estimate of MMD An unbiased empirical estimate: for {x i } m i=1 P and {y i} m i=1 Q, Proof: MMD 2 = 1 m(m 1) m i=1 m j i [k(x i,x j )+k(y i,y j )] m i=1 m j=1 [k(y i,x j )+k(x i,y j )] µ P µ Q 2 F = µ P µ Q,µ P µ Q F = µ P,µ P + µ Q,µ Q 2 µ P,µ Q
51 Empirical estimate of MMD An unbiased empirical estimate: for {x i } m i=1 P and {y i} m i=1 Q, Proof: MMD 2 = 1 m(m 1) m i=1 m j i [k(x i,x j )+k(y i,y j )] m i=1 m j=1 [k(y i,x j )+k(x i,y j )] µ P µ Q 2 F = µ P µ Q,µ P µ Q F = µ P,µ P + µ Q,µ Q 2 µ P,µ Q
52 Empirical estimate of MMD An unbiased empirical estimate: for {x i } m i=1 P and {y i} m i=1 Q, Proof: MMD 2 = 1 m(m 1) m i=1 m j i [k(x i,x j )+k(y i,y j )] m i=1 m j=1 [k(y i,x j )+k(x i,y j )] µ P µ Q 2 F = µ P µ Q,µ P µ Q F = µ P,µ P + µ Q,µ Q 2 µ P,µ Q = E P ϕ(x), E P ϕ(x) +...
53 Empirical estimate of MMD An unbiased empirical estimate: for {x i } m i=1 P and {y i} m i=1 Q, Proof: MMD 2 = 1 m(m 1) m i=1 m j i [k(x i,x j )+k(y i,y j )] m i=1 m j=1 [k(y i,x j )+k(x i,y j )] µ P µ Q 2 F = µ P µ Q,µ P µ Q F = µ P,µ P + µ Q,µ Q 2 µ P,µ Q = E P ϕ(x), E P ϕ(x) +... = E P ϕ(x),ϕ(x ) +...
54 Empirical estimate of MMD An unbiased empirical estimate: for {x i } m i=1 P and {y i} m i=1 Q, Proof: MMD 2 = 1 m(m 1) m i=1 m j i [k(x i,x j )+k(y i,y j )] m i=1 m j=1 [k(y i,x j )+k(x i,y j )] µ P µ Q 2 F = µ P µ Q,µ P µ Q F = µ P,µ P + µ Q,µ Q 2 µ P,µ Q = E P ϕ(x), E P ϕ(x) +... = E P ϕ(x),ϕ(x ) +... = E P k(x, x )+E Q k(y, y ) 2E P,Q k(x, y)
55 Empirical estimate of MMD An unbiased empirical estimate: for {x i } m i=1 P and {y i} m i=1 Q, Proof: MMD 2 = 1 m(m 1) m i=1 m j i [k(x i,x j )+k(y i,y j )] m i=1 m j=1 [k(y i,x j )+k(x i,y j )] µ P µ Q 2 F = µ P µ Q,µ P µ Q F = µ P,µ P + µ Q,µ Q 2 µ P,µ Q = E P ϕ(x), E P ϕ(x) +... = E P ϕ(x),ϕ(x ) +... = E P k(x, x )+E Q k(y, y ) 2E P,Q k(x, y) Then Êk(x, x ) = 1 m(m 1) m i=1 m j i k(x i,x j )
56 MMD for independence Dependence measure: [ALT5, NIPS7a, ALT7, ALT8, JMLR1] ( supf [E PXY f E PX P Y f] ) 2 = sup f,µ PXY f 1 = µ PXY µ PX P Y 2 F G := HSIC(P XY, P X P Y ) µ PX P Y 2 F G 1.5 Dependence witness and sample Y X.4
57 MMD for independence Dependence measure: [ALT5, NIPS7a, ALT7, ALT8, JMLR1] ( supf [E PXY f E PX P Y f] ) 2 = sup f,µ PXY f 1 = µ PXY µ PX P Y 2 F G := HSIC(P XY, P X P Y ) µ PX P Y 2 F G k(, ) l(, ) κ(, )= k(, ) l(, )
58 Characteristic kernels (Version 1: Via Universality)
59 Characteristic Kernels (via universality) Characteristic: MMD a metric (MMD = iff P = Q) [NIPS7b, COLT8]
60 Characteristic Kernels (via universality) Characteristic: MMD a metric (MMD = iff P = Q) [NIPS7b, COLT8] Classical result: P = Q if and only if E P (f(x)) = E Q (f(y)) for all f C(X ), the space of bounded continuous functions on X [Dudley, 22]
61 Characteristic Kernels (via universality) Characteristic: MMD a metric (MMD = iff P = Q) [NIPS7b, COLT8] Classical result: P = Q if and only if E P (f(x)) = E Q (f(y)) for all f C(X ), the space of bounded continuous functions on X [Dudley, 22] Universal RKHS: k(x, x )continuous,x compact, and F dense in C(X )with respect to L [Steinwart, 21]
62 Characteristic Kernels (via universality) Characteristic: MMD a metric (MMD = iff P = Q) [NIPS7b, COLT8] Classical result: P = Q if and only if E P (f(x)) = E Q (f(y)) for all f C(X ), the space of bounded continuous functions on X [Dudley, 22] Universal RKHS: k(x, x )continuous,x compact, and F dense in C(X )with respect to L [Steinwart, 21] If F universal, thenmmd{p, Q; F } =iffp = Q
63 Characteristic Kernels (via universality) Proof: First, it is clear that P = Q implies MMD {P, Q; F } is zero. Converse: by the universality of F, for any given ɛ> and f C(X ) g F f g ɛ.
64 Characteristic Kernels (via universality) Proof: First, it is clear that P = Q implies MMD {P, Q; F } is zero. Converse: by the universality of F, for any given ɛ> and f C(X ) g F f g ɛ. We next make the expansion E P f(x) E Q f(y) E P f(x) E P g(x) + E P g(x) E Q g(y) + E Q g(y) E Q f(y). The first and third terms satisfy E P f(x) E P g(x) E P f(x) g(x) ɛ.
65 Characteristic Kernels (via universality) Proof (continued): Next, write E P g(x) E Q g(y) = g( ),µ P µ Q F =, since MMD {P, Q; F } =impliesµ P = µ Q.Hence E P f(x) E Q f(y) 2ɛ for all f C(X ) and ɛ>, which implies P = Q.
66 Characteristic kernels (Version 2: Via Fourier)
67 Characteristic Kernels (via Fourier) Reminder: Fourier series Function [ π, π] with periodic boundary. f(x) = ˆf l exp(ılx) = ˆf l (cos(lx)+ı sin(lx)). l= l= 1.4 Top hat.5 Fourier series coefficients f (x).6 ˆfl x l
68 Characteristic Kernels (via Fourier) Reminder: Fourier series of kernel k(x, y) =k(x y) =k(z), k(z) = l= ˆk l exp (ılz), ( ) ( E.g. Gaussian, k(x) = 1 exp x 2, ˆkl = 1 2πσ 2 2σ 2 2π exp σ 2 l 2 2 ). Gaussian.16 Fourier series coefficients k(x).3 ˆfl x l
69 Characteristic Kernels (via Fourier) Maximum mean embedding via Fourier series: Fourier series for P is characteristic function φ P Fourier series for mean embedding is product of fourier series! (convolution theorem) π µ P (x) =E P k(x x) = π k(x t)dp(t) ˆµ P,l = ˆk l φ P,l
70 Characteristic Kernels (via Fourier) Maximum mean embedding via Fourier series: Fourier series for P is characteristic function φ P Fourier series for mean embedding is product of fourier series! (convolution theorem) µ P (x) =E P k(x x) = π π k(x t)dp(t) ˆµ P,l = ˆk l φ P,l MMD can be written in terms of Fourier series: MMD(P, Q; F ):= [(φ P,l φ Q,l ) ˆk ] l exp(ılx) l= Characteristic: MMDametric (MMD = iff P = Q) [NIPS7b, COLT8, JMLR1] F
71 Example Example: P differs from Q at one frequency P (x) x Q(x) x
72 Characteristic Kernels (2) Example: P differs from Q at (roughly) one frequency P (x) F φp,l x l.15 Q(x).1 F φq,l x 1 1 l
73 Characteristic Kernels (2) Example: P differs from Q at (roughly) one frequency.2 1 Q(x) P (x) x F F φp,l φq,l l Characteristic function difference φp,l φq,l l 2 2 x 1 1 l
74 Example Is the Gaussian kernel characteristic? Gaussian.16 Fourier series coefficients k(x).3 ˆfl x l MMD(P, Q; F ):= l= [(φ P,l φ Q,l ) ˆk ] l exp(ılx) F
75 Example Is the Gaussian kernel characteristic? YES Gaussian.16 Fourier series coefficients k(x).3 ˆfl x l MMD(P, Q; F ):= l= [(φ P,l φ Q,l ) ˆk ] l exp(ılx) F
76 Example Is the triangle kernel characteristic?.3 Triangle.7 Fourier series coefficients f (x) ˆfl x l MMD(P, Q; F ):= l= [(φ P,l φ Q,l ) ˆk ] l exp(ılx) F
77 Example Is the triangle kernel characteristic? NO.3 Triangle.7 Fourier series coefficients f (x) ˆfl x l MMD(P, Q; F ):= l= [(φ P,l φ Q,l ) ˆk ] l exp(ılx) F
78 Characteristic Kernels (via Fourier) Can we prove characteristic on R d?
79 Characteristic Kernels (via Fourier) Can we prove characteristic on R d? Characteristic function of P via Fourier transform φ P (ω) = e ix ω dp(x) R d
80 Characteristic Kernels (via Fourier) Can we prove characteristic on R d? Characteristic function of P via Fourier transform φ P (ω) = e ix ω dp(x) R d Translation invariant kernels: k(x, y) = k(x y) = k(z) Bochner s theorem: k(z) = e iz ω dλ(ω) R d Λfinitenon-negativeBorelmeasure
81 Characteristic Kernels (via Fourier) Can we prove characteristic on R d? Characteristic function of P via Fourier transform φ P (ω) = e ix ω dp(x) R d Translation invariant kernels: k(x, y) = k(x y) = k(z) Bochner s theorem: k(z) = e iz ω dλ(ω) R d Λfinitenon-negativeBorelmeasure
82 Characteristic Kernels (via Fourier) Fourier representation of MMD: MMD(P, Q; F ):= [( φp (ω) φ Q (ω) ) Λ(ω) ] F φ P characteristic function of P f is Fourier transform, f is inverse Fourier transform µ P := k(,x) dp(x)
83 Example Example: P differs from Q at (roughly) one frequency P(X) X.5.4 Q(X) X
84 Example Example: P differs from Q at (roughly) one frequency.35.4 P(X) F P X Q(X) F Q X
85 Example Example: P differs from Q at (roughly) one frequency.35.4 P(X) Q(X) X F F P Q Characteristic function difference P Q X
86 Example Example: P differs from Q at (roughly) one frequency Gaussian kernel Difference φ P φ Q Frequency
87 Example Example: P differs from Q at (roughly) one frequency.2 Characteristic Frequency
88 Example Example: P differs from Q at (roughly) one frequency Sinc kernel Difference φ P φ Q Frequency
89 Example Example: P differs from Q at (roughly) one frequency.2 NOT characteristic Frequency
90 Example Example: P differs from Q at (roughly) one frequency Triangle (B-spline) kernel Difference φ P φ Q Frequency
91 Example Example: P differs from Q at (roughly) one frequency.2??? Frequency
92 Example Example: P differs from Q at (roughly) one frequency.2 Characteristic Frequency
93 Choosing the kernel Gaussian kernel example MMD vs frequency of perturbation to P Q(X).2 Q X Q(X) Q(X) X Q Q MMD(P,Q;F) X Perturbing frequency
94 Why does MMD decay with increasing peturbation freq.? Fourier series argument (notationally easier, for periodic domains only): MMD(P, Q; F ):= [(φ P,l φ Q,l ) ˆk ] l exp(ılx) l= The squared norm of a function f in F is: F f 2 F = f,f F = l= ˆf l 2 ˆk l. Squared MMD is MMD(P, Q; F )= l= [ φ P,l φ Q,l ˆk l ] 2 ˆk l = l= φ P,l φ Q,l ˆk l
95 Choosing the kernel B-spline kernel example.4.4 MMD vs frequency of perturbation to P.3.3 Q(X).2.1 Q X Q(X) Q(X) X Q Q MMD(P,Q;F) X Perturbing frequency
96 Summary: Characteristic Kernels Characteristic kernel: (MMD=iffP = Q) [NIPS7b, COLT8] Main theorem: k characteristic for prob. measures on R d if and only if supp(λ) = R d [COLT8, JMLR1]
97 Summary: Characteristic Kernels Characteristic kernel: (MMD=iffP = Q) [NIPS7b, COLT8] Main theorem: k characteristic for prob. measures on R d if and only if supp(λ) = R d [COLT8, JMLR1] Corollary: continuous, compactly supported k characteristic
98 Summary: Characteristic Kernels Characteristic kernel: (MMD=iffP = Q) [NIPS7b, COLT8] Main theorem: k characteristic for prob. measures on R d if and only if supp(λ) = R d [COLT8, JMLR1] Corollary: continuous, compactly supported k characteristic Similar reasoning wherever extensions of Bochner s theorem exist: [NIPS8a] Locally compact Abelian groups (periodic domains, as we saw) Compact, non-abelian groups (orthogonal matrices) The semigroup R + n (histograms)
99 Statistical hypothesis testing
100 Reminder: detecting differences in brain signals The problem: Dolocalfieldpotential(LFP)signalschange when measured near a spike burst?.3 LFP near spike burst.3 LFP without spike burst LFP amplitude.1 LFP amplitude Time Time
101 Reminder: detecting differences in brain signals The problem: Dolocalfieldpotential(LFP)signalschange when measured near a spike burst?
102 Reminder: detecting differences in brain signals The problem: Dolocalfieldpotential(LFP)signalschange when measured near a spike burst?
103 Statistical test using MMD (1) Two hypotheses: H : null hypothesis (P = Q) H 1 : alternative hypothesis (P Q)
104 Statistical test using MMD (1) Two hypotheses: H : null hypothesis (P = Q) H 1 : alternative hypothesis (P Q) Observe samples x := {x 1,...,x n } from P and y from Q If empirical MMD(x, y; F )is far from zero : rejecth close to zero : accept H
105 Statistical test using MMD (2) far from zero vs close to zero -threshold? One answer: asymptotic distribution of MMD(x, y; F )
106 Statistical test using MMD (2) far from zero vs close to zero -threshold? One answer: asymptotic distribution of MMD(x, y; F ) An unbiased empirical estimate (quadratic cost): MMD(x, y; F )= 1 n(n 1) k(x i,x j ) k(x i,y j ) k(y i,x j )+k(y i,y j ) }{{} i j h((x i,y i ),(x j,y j ))
107 Statistical test using MMD (2) far from zero vs close to zero -threshold? One answer: asymptotic distribution of MMD(x, y; F ) An unbiased empirical estimate (quadratic cost): MMD(x, y; F )= 1 n(n 1) k(x i,x j ) k(x i,y j ) k(y i,x j )+k(y i,y j ) }{{} i j h((x i,y i ),(x j,y j )) When P Q, asymptotically normal [Hoeffding, 1948, Serfling, 198] Expression for the variance: z i := (x i,y i ) σ 2 u = 22 n ( E z [ (Ez h(z, z )) 2] [ E z,z (h(z, z )) ] 2 ) + O(n 2 )
108 Statistical test using MMD (3) Example: laplace distributions with different variance Prob. density MMD distribution and Gaussian fit under H1 Empirical PDF Gaussian fit Prob. density Two Laplace distributions with different variances X P X Q X MMD
109 Statistical test using MMD (4) When P = Q, U-statisticdegenerate:E z [h(z, z )] = [Anderson et al., 1994] Distribution is nmmd(x, y; F ) [ λ l z 2 l 2 ] l=1 where z l N(, 2) i.i.d k(x, X x ) ψ }{{} i (x)dp(x) =λ i ψ i (x ) centred
110 Statistical test using MMD (4) When P = Q, U-statisticdegenerate:E z [h(z, z )] = [Anderson et al., 1994] Distribution is nmmd(x, y; F ) [ λ l z 2 l 2 ] l=1 where z l N(, 2) i.i.d k(x, X x ) ψ }{{} i (x)dp(x) =λ i ψ i (x ) centred Prob. density MMD density under H Empirical PDF MMD
111 Statistical test using MMD (5) Given P = Q, wantthresholdt such that P(MMD > T ).5
112 Statistical test using MMD (5) Given P = Q, wantthresholdt such that P(MMD > T ).5 Permutation for empirical CDF [Arcones and Giné, 1992] Pearson curves by matching first four moments [Johnson et al., 1994] Large deviation bounds [Hoeffding, 1963, McDiarmid, 1989] Consistent test using kernel eigenspectrum [NIPS9b]
113 Statistical test using MMD (5) Given P = Q, wantthresholdt such that P(MMD > T ).5 Permutation for empirical CDF [Arcones and Giné, 1992] Pearson curves by matching first four moments [Johnson et al., 1994] Large deviation bounds [Hoeffding, 1963, McDiarmid, 1989] Consistent test using kernel eigenspectrum [NIPS9b] 1 CDF of the MMD and Pearson fit P(MMD < mmd) MMD Pearson mmd
114 Consistent test w/o bootstrap (not examinable) Maximum mean discrepancy (MMD): distance between P and Q MMD(P, Q; F ):= µ P µ Q 2 F Is MMD significantly >? P = Q, nulldistrib.of MMD: n MMD D l=1 λ l (z 2 l 2), λ l is lth eigenvalue of kernel k(x i,x j ) Type II error P Q (neuro) Spectral Permutation Sample size m Use Gram matrix spectrum for ˆλ l : consistent test without bootstrap
115 Kernel dependence measures
116 Reminder: MMD can be used as a dependence measure Dependence measure: [ALT5, NIPS7a, ALT7, ALT8, JMLR1] ( supf [E PXY f E PX P Y f] ) 2 = sup f,µ PXY f 1 = µ PXY µ PX P Y 2 F G := HSIC(P XY, P X P Y ) µ PX P Y 2 F G 1.5 Dependence witness and sample Y X.4
117 Reminder: MMD can be used as a dependence measure Dependence measure: [ALT5, NIPS7a, ALT7, ALT8, JMLR1] ( supf [E PXY f E PX P Y f] ) 2 = sup f,µ PXY f 1 = µ PXY µ PX P Y 2 F G := HSIC(P XY, P X P Y ) µ PX P Y 2 F G k(, ) l(, ) κ(, )= k(, ) l(, )
118 Distribution of HSIC at independence (Biased) empirical HSIC av-statistic HSIC b = 1 n 2 trace(khlh) Statistical testing: How do we find when this is larger enough that the null hypothesis P = P x P y is unlikely? Formally: given P = P x P y, what is the threshold T such that P(HSIC > T ) < α for small α?
119 Distribution of HSIC at independence (Biased) empirical HSIC av-statistic HSIC b = 1 n 2 trace(khlh) Associated U-statistic degenerate when P = P x P y [Serfling, 198]: D nhsic b λ l zl 2, l=1 z l N(, 1)i.i.d. λ l ψ l (z j )= h ijqr ψ l (z i )df i,q,r, h ijqr = 1 4! (i,j,q,r) k tu l tu + k tu l vw 2k tu l tv (t,u,v,w)
120 Distribution of HSIC at independence (Biased) empirical HSIC av-statistic HSIC b = 1 n 2 trace(khlh) Associated U-statistic degenerate when P = P x P y [Serfling, 198]: D nhsic b λ l zl 2, l=1 z l N(, 1)i.i.d. λ l ψ l (z j )= h ijqr ψ l (z i )df i,q,r, h ijqr = 1 4! (i,j,q,r) k tu l tu + k tu l vw 2k tu l tv (t,u,v,w) First two moments [NIPS7b] E(HSIC b )= 1 n TrC xxtrc yy var(hsic b )= 2(n 4)(n 5) (n) 4 C xx 2 HS C yy 2 HS +O(n 3 ).
121 Statistical testing with HSIC Given P = P x P y, what is the threshold T such that P(HSIC > T ) < α for small α? Null distribution via permutation [Feuerverger, 1993] Compute HSIC for {x i,y π(i) } n i=1 for random permutation π of indices {1,...,n}. This gives HSIC for independent variables. Repeat for many different permutations, get empirical CDF Threshold T is 1 α quantile of empirical CDF
122 Statistical testing with HSIC Given P = P x P y, what is the threshold T such that P(HSIC > T ) < α for small α? Null distribution via permutation [Feuerverger, 1993] Compute HSIC for {x i,y π(i) } n i=1 for random permutation π of indices {1,...,n}. This gives HSIC for independent variables. Repeat for many different permutations, get empirical CDF Threshold T is 1 α quantile of empirical CDF Approximate null distribution via moment matching [Kankainen, 1995]: nhsic b (Z) xα 1 e x/β β α Γ(α) where α = (E(HSIC b)) 2 var(hsic b ), β = var(hsic b) ne(hsic b ).
123 Experiment: dependence testing for translation (Biased) empirical HSIC: HSIC b = 1 n 2 trace(khlh) Translation example: [NIPS7b] Canadian Hansard (agriculture) 5-line extracts, k-spectrum kernel, k = 1, repetitions=3, sample size 1 K HSIC L k-spectrum kernel: averagetype II error (α =.5)
124 Experiment: dependence testing for translation (Biased) empirical HSIC: HSIC b = 1 n 2 trace(khlh) Translation example: [NIPS7b] Canadian Hansard (agriculture) 5-line extracts, k-spectrum kernel, k = 1, repetitions=3, sample size 1 K HSIC L k-spectrum kernel: averagetype II error (α =.5) Bag of words kernel: average Type II error.18
125 Advanced topics Energy distance and the MMD Two-sample testing for big data Testing three-way interactions Bayesian inference without models
126 Summary MMD adistancebetweendistributions [ISMB6, NIPS6a, JMLR1, JMLR12a] high dimensionality non-euclidean data (strings, graphs) Nonparametric hypothesis tests Measure and test independence [ALT5, NIPS7a, NIPS7b, ALT8, JMLR1, JMLR12a] Characteristic RKHS: MMDametric [NIPS7b, COLT8, NIPS8a] Easy to check: does spectrum cover R d
127 Co-authors From UCL: Luca Baldasssarre Steffen Grunewalder Guy Lever Sam Patterson Massimiliano Pontil Dino Sejdinovic External: Karsten Borgwardt, MPI Wicher Bergsma, LSE Kenji Fukumizu, ISM Zaid Harchaoui, INRIA Bernhard Schoelkopf, MPI Alex Smola, CMU/Google Le Song, Georgia Tech Bharath Sriperumbudur, Cambridge
128 Selected references Characteristic kernels and mean embeddings: Smola, A., Gretton, A., Song, L., Schoelkopf, B. (27). A hilbert space embedding for distributions. ALT. Sriperumbudur, B., Gretton, A., Fukumizu, K., Schoelkopf, B., Lanckriet, G. (21). Hilbert space embeddings and metrics on probability measures. JMLR. Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (212). A kernel two- sample test. JMLR. Two-sample, independence, conditional independence tests: Gretton, A., Fukumizu, K., Teo, C., Song, L., Schoelkopf, B., Smola, A. (28). A kernel statistical test of independence. NIPS Fukumizu, K., Gretton, A., Sun, X., Schoelkopf, B. (28). Kernel measures of conditional dependence. Gretton, A., Fukumizu, K., Harchaoui, Z., Sriperumbudur, B. (29). A fast, consistent kernel two-sample test. NIPS. Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (212). A kernel two- sample test. JMLR Energy distance, relation to kernel distances Sejdinovic, D., Sriperumbudur, B., Gretton, A.,, Fukumizu, K., (213). Equivalence of distance-based and rkhs-based statistics in hypothesis testing. Annals of Statistics. Three way interaction Sejdinovic, D., Gretton, A., and Bergsma, W. (213). A Kernel Test for Three-Variable Interactions. NIPS.
129 Selected references (continued) Conditional mean embedding, RKHS-valued regression: Weston, J., Chapelle, O., Elisseeff, A., Schölkopf, B., and Vapnik, V., (23). Kernel Dependency Estimation, NIPS. Micchelli, C., and Pontil, M., (25). On Learning Vector-Valued Functions. Neural Computation. Caponnetto, A., and De Vito, E. (27). Optimal Rates for the Regularized Least-Squares Algorithm. Foundations of Computational Mathematics. Song, L., and Huang, J., and Smola, A., Fukumizu, K., (29). Hilbert Space Embeddings of Conditional Distributions. ICML. Grunewalder, S., Lever, G., Baldassarre, L., Patterson, S., Gretton, A., Pontil, M. (212). Conditional mean embeddings as regressors. ICML. Grunewalder, S., Gretton, A., Shawe-Taylor, J. (213). Smooth operators. ICML. Kernel Bayes rule: Song, L., Fukumizu, K., Gretton, A. (213). Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine. Fukumizu, K., Song, L., Gretton, A. (213). Kernel Bayes rule: Bayesian inference with positive definite kernels, JMLR
130
131 References N. Anderson, P. Hall, and D. Titterington. Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. Journal of Multivariate Analysis, 5:41 54,1994. M. Arcones and E. Giné. On the bootstrap of u and v statistics. The Annals of Statistics, 2(2): ,1992. C. Baker. Joint measures and cross-covariance operators. Transactions of the American Mathematical Society, 186: ,1973. L. Baringhaus and C. Franz. On a new multivariate two-sample test. J. Multivariate Anal., 88:19 26,24. C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups. Springer, New York, R. M. Dudley. Real analysis and probability. Cambridge University Press, Cambridge, UK, 22. Andrey Feuerverger. A consistent test for bivariate dependence. International Statistical Review, 61(3): ,1993. W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:13 3,1963. Wassily Hoeffding. A class of statistics with asymptotically normal distribution. The Annals of Mathematical Statistics, 19(3): ,1948. N. L. Johnson, S. Kotz, and N. Balakrishnan. Continuous Univariate Distributions. Volume 1. John Wiley and Sons, 2nd edition, A. Kankainen. Consistent Testing of Total Independence Based on the Empirical Characteristic Function. PhD thesis, University of Jyväskylä, R. Lyons. Distance covariance in metric spaces. arxiv: , to appear in Ann. Probab., June 211. C. McDiarmid. On the method of bounded differences. In Survey in Combinatorics, pages Cambridge University Press, A. Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2): ,
132 T. Read and N. Cressie. Goodness-Of-Fit Statistics for Discrete Multivariate Analysis. Springer-Verlag, New York, R. Serfling. Approximation Theorems of Mathematical Statistics. Wiley, New York, 198. I. Steinwart. On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2:67 93,21. G. Székely and M. Rizzo. Testing for equal distributions in high dimension. InterStat, 5,24. G. Székely and M. Rizzo. A new test for multivariate normality. J. Multivariate Anal., 93:58 8,25. G. Székely and M. Rizzo. Brownian distance covariance. Annals of Applied Statistics, 4(3): ,29. G. Székely, M. Rizzo, and N. Bakirov. Measuring and testing dependence by correlation of distances. Ann. Stat., 35(6): ,
Kernel methods for hypothesis testing and inference
Kernel methods for hypothesis testing and inference MLSS Tübingen, 2015 Arthur Gretton Gatsby Unit, CSML, UCL Some motivating questions... Detecting differences in brain signals The problem: Do local field
More informationKernel methods. MLSS Arequipa, Peru, Arthur Gretton. Gatsby Unit, CSML, UCL
Kernel methods MLSS Arequipa, Peru, 2016 Arthur Gretton Gatsby Unit, CSML, UCL Some motivating questions... Detecting differences in brain signals The problem: Dolocalfieldpotential(LFP)signalschange when
More informationHilbert Space Representations of Probability Distributions
Hilbert Space Representations of Probability Distributions Arthur Gretton joint work with Karsten Borgwardt, Kenji Fukumizu, Malte Rasch, Bernhard Schölkopf, Alex Smola, Le Song, Choon Hui Teo Max Planck
More informationHypothesis Testing with Kernels
(Gatsby Unit, UCL) PRNI, Trento June 22, 2016 Motivation: detecting differences in AM signals Amplitude modulation: simple technique to transmit voice over radio. in the example: 2 songs. Fragments from
More informationKernel methods for Bayesian inference
Kernel methods for Bayesian inference Arthur Gretton Gatsby Computational Neuroscience Unit Lancaster, Nov. 2014 Motivating Example: Bayesian inference without a model 3600 downsampled frames of 20 20
More informationKernels. B.Sc. École Polytechnique September 4, 2018
CMAP B.Sc. Day @ École Polytechnique September 4, 2018 Inner product Ñ kernel: similarity between features Extension of kpx, yq x, y ř i x iy i : kpx, yq ϕpxq, ϕpyq H. Inner product Ñ kernel: similarity
More informationHilbert Space Embedding of Probability Measures
Lecture 2 Hilbert Space Embedding of Probability Measures Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 2017 Recap of Lecture
More informationHypothesis Testing with Kernel Embeddings on Interdependent Data
Hypothesis Testing with Kernel Embeddings on Interdependent Data Dino Sejdinovic Department of Statistics University of Oxford joint work with Kacper Chwialkowski and Arthur Gretton (Gatsby Unit, UCL)
More informationLearning Interpretable Features to Compare Distributions
Learning Interpretable Features to Compare Distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Theory of Big Data, 2017 1/41 Goal of this talk Given: Two collections
More informationKernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton
Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,
More informationMaximum Mean Discrepancy
Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia
More informationA Kernel Statistical Test of Independence
A Kernel Statistical Test of Independence Arthur Gretton MPI for Biological Cybernetics Tübingen, Germany arthur@tuebingen.mpg.de Kenji Fukumizu Inst. of Statistical Mathematics Tokyo Japan fukumizu@ism.ac.jp
More informationLearning features to compare distributions
Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this
More informationAn Adaptive Test of Independence with Analytic Kernel Embeddings
An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum Gatsby Unit, University College London wittawat@gatsby.ucl.ac.uk Probabilistic Graphical Model Workshop 2017 Institute
More informationKernel Bayes Rule: Nonparametric Bayesian inference with kernels
Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December
More informationDistribution Regression with Minimax-Optimal Guarantee
Distribution Regression with Minimax-Optimal Guarantee (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton
More informationBig Hypothesis Testing with Kernel Embeddings
Big Hypothesis Testing with Kernel Embeddings Dino Sejdinovic Department of Statistics University of Oxford 9 January 2015 UCL Workshop on the Theory of Big Data D. Sejdinovic (Statistics, Oxford) Big
More informationDistribution Regression: A Simple Technique with Minimax-optimal Guarantee
Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department,
More informationDistribution Regression
Zoltán Szabó (École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Dagstuhl Seminar 16481
More informationB-tests: Low Variance Kernel Two-Sample Tests
B-tests: Low Variance Kernel Two-Sample Tests Wojciech Zaremba, Arthur Gretton, Matthew Blaschko To cite this version: Wojciech Zaremba, Arthur Gretton, Matthew Blaschko. B-tests: Low Variance Kernel Two-
More informationMinimax-optimal distribution regression
Zoltán Szabó (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) ISNPS, Avignon June 12,
More informationA Kernel Method for the Two-Sample-Problem
A Kernel Method for the Two-Sample-Problem Arthur Gretton MPI for Biological Cybernetics Tübingen, Germany arthur@tuebingen.mpg.de Karsten M. Borgwardt Ludwig-Maximilians-Univ. Munich, Germany kb@dbs.ifi.lmu.de
More informationLecture 35: December The fundamental statistical distances
36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose
More informationConvergence Rates of Kernel Quadrature Rules
Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction
More informationRecovering Distributions from Gaussian RKHS Embeddings
Motonobu Kanagawa Graduate University for Advanced Studies kanagawa@ism.ac.jp Kenji Fukumizu Institute of Statistical Mathematics fukumizu@ism.ac.jp Abstract Recent advances of kernel methods have yielded
More informationOn a Nonparametric Notion of Residual and its Applications
On a Nonparametric Notion of Residual and its Applications Bodhisattva Sen and Gábor Székely arxiv:1409.3886v1 [stat.me] 12 Sep 2014 Columbia University and National Science Foundation September 16, 2014
More informationAdvances in kernel exponential families
Advances in kernel exponential families Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS, 2017 1/39 Outline Motivating application: Fast estimation of complex multivariate
More informationA Linear-Time Kernel Goodness-of-Fit Test
A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum 1; Wenkai Xu 1 Zoltán Szabó 2 NIPS 2017 Best paper! Kenji Fukumizu 3 Arthur Gretton 1 wittawatj@gmail.com 1 Gatsby Unit, University College
More informationKernel methods for comparing distributions, measuring dependence
Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations
More informationAn Adaptive Test of Independence with Analytic Kernel Embeddings
An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum 1 Zoltán Szabó 2 Arthur Gretton 1 1 Gatsby Unit, University College London 2 CMAP, École Polytechnique ICML 2017, Sydney
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationA Linear-Time Kernel Goodness-of-Fit Test
A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum Gatsby Unit, UCL wittawatj@gmail.com Wenkai Xu Gatsby Unit, UCL wenkaix@gatsby.ucl.ac.uk Zoltán Szabó CMAP, École Polytechnique zoltan.szabo@polytechnique.edu
More informationStrictly Positive Definite Functions on a Real Inner Product Space
Strictly Positive Definite Functions on a Real Inner Product Space Allan Pinkus Abstract. If ft) = a kt k converges for all t IR with all coefficients a k 0, then the function f< x, y >) is positive definite
More informationApproximate Kernel Methods
Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression
More informationMinimax Estimation of Kernel Mean Embeddings
Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016 Collaborators Dr. Ilya Tolstikhin
More informationLearning features to compare distributions
Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this
More informationarxiv: v1 [cs.lg] 15 May 2008
Journal of Machine Learning Research (008) -0 Submitted 04/08; Published 04/08 A Kernel Method for the Two-Sample Problem arxiv:0805.368v [cs.lg] 5 May 008 Arthur Gretton MPI for Biological Cybernetics
More informationTopics in kernel hypothesis testing
Topics in kernel hypothesis testing Kacper Chwialkowski A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of University College London. Department
More informationTwo-sample hypothesis testing for random dot product graphs
Two-sample hypothesis testing for random dot product graphs Minh Tang Department of Applied Mathematics and Statistics Johns Hopkins University JSM 2014 Joint work with Avanti Athreya, Vince Lyzinski,
More informationKernel Measures of Conditional Dependence
Kernel Measures of Conditional Dependence Kenji Fukumizu Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku Tokyo 6-8569 Japan fukumizu@ism.ac.jp Arthur Gretton Max-Planck Institute for
More informationTensor Product Kernels: Characteristic Property, Universality
Tensor Product Kernels: Characteristic Property, Universality Zolta n Szabo CMAP, E cole Polytechnique Joint work with: Bharath K. Sriperumbudur Hangzhou International Conference on Frontiers of Data Science
More informationStatistical Convergence of Kernel CCA
Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,
More informationThe Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space
The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space Robert Jenssen, Deniz Erdogmus 2, Jose Principe 2, Torbjørn Eltoft Department of Physics, University of Tromsø, Norway
More informationNonparametric Independence Tests: Space Partitioning and Kernel Approaches
Nonparametric Independence Tests: Space Partitioning and Kernel Approaches Arthur Gretton and László Györfi 2 MPI for Biological Cybernetics, Spemannstr. 38, 7276 Tübingen, Germany, arthur@tuebingen.mpg.de
More informationarxiv: v1 [cs.lg] 22 Jun 2009
Bayesian two-sample tests arxiv:0906.4032v1 [cs.lg] 22 Jun 2009 Karsten M. Borgwardt 1 and Zoubin Ghahramani 2 1 Max-Planck-Institutes Tübingen, 2 University of Cambridge June 22, 2009 Abstract In this
More information9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee
9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic
More informationINDEPENDENCE MEASURES
INDEPENDENCE MEASURES Beatriz Bueno Larraz Máster en Investigación e Innovación en Tecnologías de la Información y las Comunicaciones. Escuela Politécnica Superior. Máster en Matemáticas y Aplicaciones.
More informationBayesian estimation of the discrepancy with misspecified parametric models
Bayesian estimation of the discrepancy with misspecified parametric models Pierpaolo De Blasi University of Torino & Collegio Carlo Alberto Bayesian Nonparametrics workshop ICERM, 17-21 September 2012
More informationIntroduction to RKHS, and some simple kernel algorithms
Introduction to RKHS, and some simple kernel algorithms Arthur Gretton October 25, 217 1 Outline In this document, we give a nontechical introduction to reproducing kernel Hilbert spaces (RKHSs), and describe
More informationUnsupervised Nonparametric Anomaly Detection: A Kernel Method
Fifty-second Annual Allerton Conference Allerton House, UIUC, Illinois, USA October - 3, 24 Unsupervised Nonparametric Anomaly Detection: A Kernel Method Shaofeng Zou Yingbin Liang H. Vincent Poor 2 Xinghua
More informationFast Two Sample Tests using Smooth Random Features
Kacper Chwialkowski University College London, Computer Science Department Aaditya Ramdas Carnegie Mellon University, Machine Learning and Statistics School of Computer Science Dino Sejdinovic University
More informationarxiv: v2 [stat.ml] 3 Sep 2015
Nonparametric Independence Testing for Small Sample Sizes arxiv:46.9v [stat.ml] 3 Sep 5 Aaditya Ramdas Dept. of Statistics and Machine Learning Dept. Carnegie Mellon University aramdas@cs.cmu.edu September
More informationA Kernel Test for Three-Variable Interactions
A Kernel Test for Three-Variable Interactions Dino Sejdinovic, Arthur Gretton Gatsby Unit, CSML, UCL, UK {dino.sejdinovic, arthur.gretton}@gmail.com Wicher Bergsma Department of Statistics, LSE, UK w.p.bergsma@lse.ac.uk
More informationForefront of the Two Sample Problem
Forefront of the Two Sample Problem From classical to state-of-the-art methods Yuchi Matsuoka What is the Two Sample Problem? X ",, X % ~ P i. i. d Y ",, Y - ~ Q i. i. d Two Sample Problem P = Q? Example:
More informationarxiv: v1 [math.st] 24 Aug 2017
Submitted to the Annals of Statistics MULTIVARIATE DEPENDENCY MEASURE BASED ON COPULA AND GAUSSIAN KERNEL arxiv:1708.07485v1 math.st] 24 Aug 2017 By AngshumanRoy and Alok Goswami and C. A. Murthy We propose
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationAdaptive HMC via the Infinite Exponential Family
Adaptive HMC via the Infinite Exponential Family Arthur Gretton Gatsby Unit, CSML, University College London RegML, 2017 Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family
More informationA Wild Bootstrap for Degenerate Kernel Tests
A Wild Bootstrap for Degenerate Kernel Tests Kacper Chwialkowski Department of Computer Science University College London London, Gower Street, WCE 6BT kacper.chwialkowski@gmail.com Dino Sejdinovic Gatsby
More informationGeneralized clustering via kernel embeddings
Generalized clustering via kernel embeddings Stefanie Jegelka 1, Arthur Gretton 2,1, Bernhard Schölkopf 1, Bharath K. Sriperumbudur 3, and Ulrike von Luxburg 1 1 Max Planck Institute for Biological Cybernetics,
More informationTheory of Positive Definite Kernel and Reproducing Kernel Hilbert Space
Theory of Positive Definite Kernel and Reproducing Kernel Hilbert Space Statistical Inference with Reproducing Kernel Hilbert Space Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department
More informationA Kernelized Stein Discrepancy for Goodness-of-fit Tests
Qiang Liu QLIU@CS.DARTMOUTH.EDU Computer Science, Dartmouth College, NH, 3755 Jason D. Lee JASONDLEE88@EECS.BERKELEY.EDU Michael Jordan JORDAN@CS.BERKELEY.EDU Department of Electrical Engineering and Computer
More informationHigh-dimensional test for normality
High-dimensional test for normality Jérémie Kellner Ph.D Student University Lille I - MODAL project-team Inria joint work with Alain Celisse Rennes - June 5th, 2014 Jérémie Kellner Ph.D Student University
More informationGeneralized Multivariate Rank Type Test Statistics via Spatial U-Quantiles
Generalized Multivariate Rank Type Test Statistics via Spatial U-Quantiles Weihua Zhou 1 University of North Carolina at Charlotte and Robert Serfling 2 University of Texas at Dallas Final revision for
More informationApproximate Kernel PCA with Random Features
Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,
More informationKernel Learning via Random Fourier Representations
Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random
More informationHilbert Schmidt Independence Criterion
Hilbert Schmidt Independence Criterion Thanks to Arthur Gretton, Le Song, Bernhard Schölkopf, Olivier Bousquet Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au
More informationStatistical analysis of coupled time series with Kernel Cross-Spectral Density operators.
Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators. Michel Besserve MPI for Intelligent Systems, Tübingen michel.besserve@tuebingen.mpg.de Nikos K. Logothetis MPI
More information10-701/ Recitation : Kernels
10-701/15-781 Recitation : Kernels Manojit Nandi February 27, 2014 Outline Mathematical Theory Banach Space and Hilbert Spaces Kernels Commonly Used Kernels Kernel Theory One Weird Kernel Trick Representer
More informationStrictly positive definite functions on a real inner product space
Advances in Computational Mathematics 20: 263 271, 2004. 2004 Kluwer Academic Publishers. Printed in the Netherlands. Strictly positive definite functions on a real inner product space Allan Pinkus Department
More informationKernel Methods. Outline
Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert
More informationOutline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space
to The The A s s in to Fabio A. González Ph.D. Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 2, 2009 to The The A s s in 1 Motivation Outline 2 The Mapping the
More informationA spectral clustering algorithm based on Gram operators
A spectral clustering algorithm based on Gram operators Ilaria Giulini De partement de Mathe matiques et Applications ENS, Paris Joint work with Olivier Catoni 1 july 2015 Clustering task of grouping
More informationKernel-based Dependency Measures & Hypothesis Testing
Kernel-based Dependency Measures and Hypothesis Testing CMAP, École Polytechnique Carnegie Mellon University November 27, 2017 Outline Motivation. Diverse set of domains: kernel. Dependency measures: Kernel
More informationKernel Adaptive Metropolis-Hastings
Kernel Adaptive Metropolis-Hastings Arthur Gretton,?? Gatsby Unit, CSML, University College London NIPS, December 2015 Arthur Gretton (Gatsby Unit, UCL) Kernel Adaptive Metropolis-Hastings 12/12/2015 1
More informationconditional cdf, conditional pdf, total probability theorem?
6 Multiple Random Variables 6.0 INTRODUCTION scalar vs. random variable cdf, pdf transformation of a random variable conditional cdf, conditional pdf, total probability theorem expectation of a random
More informationNonparametric Indepedence Tests: Space Partitioning and Kernel Approaches
Nonparametric Indepedence Tests: Space Partitioning and Kernel Approaches Arthur Gretton 1 and László Györfi 2 1. Gatsby Computational Neuroscience Unit London, UK 2. Budapest University of Technology
More informationEECS 598: Statistical Learning Theory, Winter 2014 Topic 11. Kernels
EECS 598: Statistical Learning Theory, Winter 2014 Topic 11 Kernels Lecturer: Clayton Scott Scribe: Jun Guo, Soumik Chatterjee Disclaimer: These notes have not been subjected to the usual scrutiny reserved
More informationRKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee
RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets 9.520 Class 22, 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce an alternate perspective of RKHS via integral operators
More informationOn the Noise Model of Support Vector Machine Regression. Massimiliano Pontil, Sayan Mukherjee, Federico Girosi
MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 1651 October 1998
More informationKernel-Based Contrast Functions for Sufficient Dimension Reduction
Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach
More informationKernels A Machine Learning Overview
Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter
More informationEfficient Complex Output Prediction
Efficient Complex Output Prediction Florence d Alché-Buc Joint work with Romain Brault, Alex Lambert, Maxime Sangnier October 12, 2017 LTCI, Télécom ParisTech, Institut-Mines Télécom, Université Paris-Saclay
More informationStatistical Optimality of Stochastic Gradient Descent through Multiple Passes
Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationSpline Density Estimation and Inference with Model-Based Penalities
Spline Density Estimation and Inference with Model-Based Penalities December 7, 016 Abstract In this paper we propose model-based penalties for smoothing spline density estimation and inference. These
More informationKernel Method: Data Analysis with Positive Definite Kernels
Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University
More informationNon-parametric Estimation of Integral Probability Metrics
Non-parametric Estimation of Integral Probability Metrics Bharath K. Sriperumbudur, Kenji Fukumizu 2, Arthur Gretton 3,4, Bernhard Schölkopf 4 and Gert R. G. Lanckriet Department of ECE, UC San Diego,
More informationUnderstanding the relationship between Functional and Structural Connectivity of Brain Networks
Understanding the relationship between Functional and Structural Connectivity of Brain Networks Sashank J. Reddi Machine Learning Department, Carnegie Mellon University SJAKKAMR@CS.CMU.EDU Abstract Background.
More informationRobust Support Vector Machines for Probability Distributions
Robust Support Vector Machines for Probability Distributions Andreas Christmann joint work with Ingo Steinwart (Los Alamos National Lab) ICORS 2008, Antalya, Turkey, September 8-12, 2008 Andreas Christmann,
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationOn Testing Independence and Goodness-of-fit in Linear Models
On Testing Independence and Goodness-of-fit in Linear Models arxiv:1302.5831v2 [stat.me] 5 May 2014 Arnab Sen and Bodhisattva Sen University of Minnesota and Columbia University Abstract We consider a
More information22 : Hilbert Space Embeddings of Distributions
10-708: Probabilistic Graphical Models 10-708, Spring 2014 22 : Hilbert Space Embeddings of Distributions Lecturer: Eric P. Xing Scribes: Sujay Kumar Jauhar and Zhiguang Huo 1 Introduction and Motivation
More informationGaussian random variables inr n
Gaussian vectors Lecture 5 Gaussian random variables inr n One-dimensional case One-dimensional Gaussian density with mean and standard deviation (called N, ): fx x exp. Proposition If X N,, then ax b
More information41903: Introduction to Nonparametrics
41903: Notes 5 Introduction Nonparametrics fundamentally about fitting flexible models: want model that is flexible enough to accommodate important patterns but not so flexible it overspecializes to specific
More informationTesting Statistical Hypotheses
E.L. Lehmann Joseph P. Romano Testing Statistical Hypotheses Third Edition 4y Springer Preface vii I Small-Sample Theory 1 1 The General Decision Problem 3 1.1 Statistical Inference and Statistical Decisions
More informationOn non-parametric robust quantile regression by support vector machines
On non-parametric robust quantile regression by support vector machines Andreas Christmann joint work with: Ingo Steinwart (Los Alamos National Lab) Arnout Van Messem (Vrije Universiteit Brussel) ERCIM
More informationIntroduction to Gaussian Processes
Introduction to Gaussian Processes Neil D. Lawrence GPSS 10th June 2013 Book Rasmussen and Williams (2006) Outline The Gaussian Density Covariance from Basis Functions Basis Function Representations Constructing
More informationKernel methods, kernel SVM and ridge regression
Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;
More informationMathematical Methods for Data Analysis
Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data
More informationMinimum Hellinger Distance Estimation in a. Semiparametric Mixture Model
Minimum Hellinger Distance Estimation in a Semiparametric Mixture Model Sijia Xiang 1, Weixin Yao 1, and Jingjing Wu 2 1 Department of Statistics, Kansas State University, Manhattan, Kansas, USA 66506-0802.
More information