Hilbert Space Embedding of Probability Measures

Size: px

Start display at page:

Download "Hilbert Space Embedding of Probability Measures"

Ethelbert Palmer
6 years ago
Views:

1 Lecture 2 Hilbert Space Embedding of Probability Measures Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 2017

2 Recap of Lecture 1 Kernel method provides an elegant approach to achieve non-linear algorithms from linear algorithms. Input space, : the space of observed data on which learning is performed. Feature map, Φ: defined through a positive definite kernel function, k : R x Φ(x), x Constructing linear algorithms in the feature space Φ( ) translates as non-linear algorithms in. Elegance: No explicit construction of Φ as Φ(x), Φ(y) = k(x, y). Function space view: RKHS; smoothness and generalization Examples Ridge regression. In fact many more (Kernel+SVM/PCA/FDA/CCA/Perceptron/logistic regression,...)

3 Outline Motivating example: Comparing distributions Hilbert space embedding of measures Mean element Distance on probabilities (MMD) Characteristic kernels Cross-covariance operator and measure of independence Applications Two-sample testing Choice of kernel

4 Co-authors Sivaraman Balakrishnan (Statistics, Carnegie Mellon University) Kenji Fukumizu (Institute for Statistical Mathematics, Tokyo) Arthur Gretton (Gatsby Unit, University College London) Gert Lanckriet (Electrical Engineering, University of California, San Diego) Krikamol Muandet (Mathematics, Mahidol University, Bangkok) Massimiliano Pontil (Computer Science, University College London) Bernhard Schölkopf (Max Planck Institute for Intelligent Systems, Tübingen) Dino Sejdinovic (Statistics, University of Oxford) Heiko Strathmann (Gatsby Unit, University College London) Ilya Tolstikhin (Max Planck Institute for Intelligent Systems, Tübingen)

5 Motivating Example: Coin Toss Toss 1: T H H H T T H T T H H T H Toss 2: H T T H T H T T H H H T T Are the coins/tosses statistically similar? Toss 1 is a sample from P:=Bernoulli(p) and Toss 2 is a sample from Q:=Bernoulli(q). Is p = q or not?, i.e., compare E P [ ] = x dp(x) and E Q [ ] = x dq(x). {0,1} {0,1}

6 Motivating Example: Coin Toss Toss 1: T H H H T T H T T H H T H Toss 2: H T T H T H T T H H H T T Are the coins/tosses statistically similar? Toss 1 is a sample from P:=Bernoulli(p) and Toss 2 is a sample from Q:=Bernoulli(q). Is p = q or not?, i.e., compare E P [ ] = x dp(x) and E Q [ ] = x dq(x). {0,1} {0,1}

7 Coin Toss Example In other words, we compare Φ(x) dp(x) where Φ is an identity map, R and R Φ(x) dq(x) Φ(x) = x. A positive definite kernel corresponding to Φ is k(x, y) = Φ(x), Φ(y) 2 = xy, which is a linear kernel on {0, 1}. Therefore, comparing two Bernoulli is equivalent to k(y, x) dp(x) =? k(y, x) dq(x) {0,1} {0,1} for all y {0, 1}, i.e., compare the expectations of the kernel.

8 Comparing two Gaussians P = N(µ 1, σ1 2) and Q = N(µ 2, σ2 2) Comparing P and Q is equivalent to comparing µ 1, µ 2 and σ1 2, σ2 2, i.e., E P [ ] = x dp(x) =? x dq(x) = E Q [ ] R R and E P [ 2 ] = x 2 dp(x) =? x 2 dq(x) = E Q [ 2 ]. R R Concisely Φ(x) dp(x) =? Φ(x) dq(x) R R where Φ(x) = (x, x 2 ). Compare the first moment of the feature map

9 Comparing two Gaussians P = N(µ 1, σ1 2) and Q = N(µ 2, σ2 2) Comparing P and Q is equivalent to comparing µ 1, µ 2 and σ1 2, σ2 2, i.e., E P [ ] = x dp(x) =? x dq(x) = E Q [ ] R R and E P [ 2 ] = x 2 dp(x) =? x 2 dq(x) = E Q [ 2 ]. R R Concisely Φ(x) dp(x) =? Φ(x) dq(x) R R where Φ(x) = (x, x 2 ). Compare the first moment of the feature map

10 Comparing two Gaussians Using the map Φ, we can construct a positive definite kernel as k(x, y) = Φ(x), Φ(y) R 2 = xy + x 2 y 2 which is a polynomial kernel of order 2. Therefore, comparing two Gaussians is equivalent to k(y, x) dp(x) =? k(y, x) dq(x) R R for all y R, i.e., compare the expectations of the kernel.

11 Comparing general P and Q Moment generating function is defined as M P (y) = e xy dp(x) and (if it exists) captures the information about a distribution, i.e., M P = M Q P = Q. R Choosing Φ(x) = ( 1, x, x 2 2!,..., x i ),... l 2 (N), x R i! it is easy to verify that k(x, y) = Φ(x), Φ(y) l2(n) = e xy and so k(x, y) dp(x) = k(x, y) dq(x), y R P = Q. R R

12 Two-Sample Problem Given random samples { 1,..., m } i.i.d. P and {Y 1,..., Y n } i.i.d. Q. Determine: P = Q or P Q? Applications: Microarray data (aggregation problem) Speaker verification Independence Testing: Given random samples {( 1, Y 1 ),..., ( n, Y n )} i.i.d P xy. Does P xy factorize into P x P y? Feature selection (microarrays, image and text,...)

13 Hilbert Space Embedding of Measures

14 Hilbert Space Embedding of Measures Canonical feature map: Φ(x) = k(, x) H, x where H is a reproducing kernel Hilbert space (RKHS). Generalization to probabilities: x k(, x) δ }{{} x point mass at x k(, x) }{{} k(,y) dδx (y)=e δx [k(,y )] Based on the above, the map is extended to probability measures as P µ P := Φ(x) dp(x) = k(, x) dp(x) } {{ } E P k(, ) (Smola et al., ALT 2007)

15 Properties µ P is the mean of the feature map and is called the kernel mean or mean element of P. When is µ P well defined? k(x, x) dp(x) < µp H Proof: Jensen µ P H = k(, x) dp(x) H s k(, x) H dp(x). We know that for any f H, f (x) = f, k(, x) H. So, for any f H, f (x) dp(x) = f, k(, x) H dp(x) = f, k(, x) dp(x) H = f, µ P H.

16 Properties µ P is the mean of the feature map and is called the kernel mean or mean element of P. When is µ P well defined? k(x, x) dp(x) < µp H Proof: Jensen µ P H = k(, x) dp(x) H s k(, x) H dp(x). We know that for any f H, f (x) = f, k(, x) H. So, for any f H, f (x) dp(x) = f, k(, x) H dp(x) = f, k(, x) dp(x) H = f, µ P H.

17 Properties µ P is the mean of the feature map and is called the kernel mean or mean element of P. When is µ P well defined? k(x, x) dp(x) < µp H Proof: Jensen µ P H = k(, x) dp(x) H s k(, x) H dp(x). We know that for any f H, f (x) = f, k(, x) H. So, for any f H, f (x) dp(x) = f, k(, x) H dp(x) = f, k(, x) dp(x) H = f, µ P H.

18 Interpretation Suppose k is translation invariant on R d, i.e., k(x, y) = ψ(x y), x, y R d. Then µ P = ψ( x) dp(x) = ψ P, R d where is the convolution of ψ and P. Convolution is a smoothing operation µ P is a smoothed version of P. Example: Suppose P = δ y, a point mass at y. Then µ P = ψ P = ψ( y). Example: Suppose ψ N(0, σ 2 ) and P = N(µ, τ 2 ). Then µ P = ψ P N(µ, σ 2 + τ 2 ). µ P is a wider Gaussian than P

19 Interpretation Suppose k is translation invariant on R d, i.e., k(x, y) = ψ(x y), x, y R d. Then µ P = ψ( x) dp(x) = ψ P, R d where is the convolution of ψ and P. Convolution is a smoothing operation µ P is a smoothed version of P. Example: Suppose P = δ y, a point mass at y. Then µ P = ψ P = ψ( y). Example: Suppose ψ N(0, σ 2 ) and P = N(µ, τ 2 ). Then µ P = ψ P N(µ, σ 2 + τ 2 ). µ P is a wider Gaussian than P

20 Interpretation Suppose k is translation invariant on R d, i.e., k(x, y) = ψ(x y), x, y R d. Then µ P = ψ( x) dp(x) = ψ P, R d where is the convolution of ψ and P. Convolution is a smoothing operation µ P is a smoothed version of P. Example: Suppose P = δ y, a point mass at y. Then µ P = ψ P = ψ( y). Example: Suppose ψ N(0, σ 2 ) and P = N(µ, τ 2 ). Then µ P = ψ P N(µ, σ 2 + τ 2 ). µ P is a wider Gaussian than P

21 Comparing Kernel Means Define a distance (maximum mean discrepancy) on probabilities MMD H (P, Q) = µ P µ Q H (Gretton et al., NIPS 2006; Smola et al., ALT 2007) MMDH(P, 2 Q) = µ P, µ P H + µ Q, µ Q H 2 µ P, µ Q H = µ P (x) dp(x) + µ Q (x) dq(x) 2 µ P (x) dq(x) = k(x, y) dp(x) dp(y) + k(x, y) dq(x) dq(y) 2 k(x, y) dp(x) dq(y) = E P k(, ) }{{} + E Q k(y, Y ) }{{} avg. similarity between points from P avg. similarity between points from Q 2 E P,Q k(, Y ) }{{}. avg. similarity between points from P and Q

22 Comparing Kernel Means Define a distance (maximum mean discrepancy) on probabilities MMD H (P, Q) = µ P µ Q H (Gretton et al., NIPS 2006; Smola et al., ALT 2007) MMDH(P, 2 Q) = µ P, µ P H + µ Q, µ Q H 2 µ P, µ Q H = µ P (x) dp(x) + µ Q (x) dq(x) 2 µ P (x) dq(x) = k(x, y) dp(x) dp(y) + k(x, y) dq(x) dq(y) 2 k(x, y) dp(x) dq(y) = E P k(, ) }{{} + E Q k(y, Y ) }{{} avg. similarity between points from P avg. similarity between points from Q 2 E P,Q k(, Y ) }{{}. avg. similarity between points from P and Q

23 Comparing Kernel Means Define a distance (maximum mean discrepancy) on probabilities MMD H (P, Q) = µ P µ Q H (Gretton et al., NIPS 2006; Smola et al., ALT 2007) MMDH(P, 2 Q) = µ P, µ P H + µ Q, µ Q H 2 µ P, µ Q H = µ P (x) dp(x) + µ Q (x) dq(x) 2 µ P (x) dq(x) = k(x, y) dp(x) dp(y) + k(x, y) dq(x) dq(y) 2 k(x, y) dp(x) dq(y) = E P k(, ) }{{} + E Q k(y, Y ) }{{} avg. similarity between points from P avg. similarity between points from Q 2 E P,Q k(, Y ) }{{}. avg. similarity between points from P and Q

24 Comparing Kernel Means Define a distance (maximum mean discrepancy) on probabilities MMD H (P, Q) = µ P µ Q H (Gretton et al., NIPS 2006; Smola et al., ALT 2007) MMDH(P, 2 Q) = µ P, µ P H + µ Q, µ Q H 2 µ P, µ Q H = µ P (x) dp(x) + µ Q (x) dq(x) 2 µ P (x) dq(x) = k(x, y) dp(x) dp(y) + k(x, y) dq(x) dq(y) 2 k(x, y) dp(x) dq(y) = E P k(, ) }{{} + E Q k(y, Y ) }{{} avg. similarity between points from P avg. similarity between points from Q 2 E P,Q k(, Y ) }{{}. avg. similarity between points from P and Q

25 Comparing Kernel Means Define a distance (maximum mean discrepancy) on probabilities MMD H (P, Q) = µ P µ Q H (Gretton et al., NIPS 2006; Smola et al., ALT 2007) MMDH(P, 2 Q) = µ P, µ P H + µ Q, µ Q H 2 µ P, µ Q H = µ P (x) dp(x) + µ Q (x) dq(x) 2 µ P (x) dq(x) = k(x, y) dp(x) dp(y) + k(x, y) dq(x) dq(y) 2 k(x, y) dp(x) dq(y) = E P k(, ) }{{} + E Q k(y, Y ) }{{} avg. similarity between points from P avg. similarity between points from Q 2 E P,Q k(, Y ) }{{}. avg. similarity between points from P and Q

26 Comparing Kernel Means Define a distance (maximum mean discrepancy) on probabilities MMD H (P, Q) = µ P µ Q H (Gretton et al., NIPS 2006; Smola et al., ALT 2007) MMDH(P, 2 Q) = µ P, µ P H + µ Q, µ Q H 2 µ P, µ Q H = µ P (x) dp(x) + µ Q (x) dq(x) 2 µ P (x) dq(x) = k(x, y) dp(x) dp(y) + k(x, y) dq(x) dq(y) 2 k(x, y) dp(x) dq(y) = E P k(, ) }{{} + E Q k(y, Y ) }{{} avg. similarity between points from P avg. similarity between points from Q 2 E P,Q k(, Y ) }{{}. avg. similarity between points from P and Q

27 Comparing Kernel Means In the motivating examples, we compare P and Q by comparing µ P (y) = k(y, x) dp(x) and µ Q (y) = k(y, x) dq(x), y. For any f H, f = sup f (y) = sup f, k(, y) H sup y y y k(y, y) f H. µ P µ Q sup k(y, y) µp µ Q H. y Does µ P µ Q H = 0 P = Q? (More on this later)

28 Comparing Kernel Means In the motivating examples, we compare P and Q by comparing µ P (y) = k(y, x) dp(x) and µ Q (y) = k(y, x) dq(x), y. For any f H, f = sup f (y) = sup f, k(, y) H sup y y y k(y, y) f H. µ P µ Q sup k(y, y) µp µ Q H. y Does µ P µ Q H = 0 P = Q? (More on this later)

29 Comparing Kernel Means In the motivating examples, we compare P and Q by comparing µ P (y) = k(y, x) dp(x) and µ Q (y) = k(y, x) dq(x), y. For any f H, f = sup f (y) = sup f, k(, y) H sup y y y k(y, y) f H. µ P µ Q sup k(y, y) µp µ Q H. y Does µ P µ Q H = 0 P = Q? (More on this later)

30 Integral Probability Metric The integral probability metric between P and Q is defined as IPM(P, Q, F) := sup f (x) dp(x) f (x) dq(x) (Müller, 1997) f F = sup E P f ( ) E Q f ( ). f F F controls the degree of distinguishability between P and Q. Related to the Bayes risk of a certain classification problem (S et al., NIPS 2009; EJS 2012)

31 Integral Probability Metric The integral probability metric between P and Q is defined as IPM(P, Q, F) := sup f (x) dp(x) f (x) dq(x) (Müller, 1997) f F = sup E P f ( ) E Q f ( ). f F F controls the degree of distinguishability between P and Q. Related to the Bayes risk of a certain classification problem (S et al., NIPS 2009; EJS 2012) Example: Suppose F = {a x, x R : a [ 1, 1]}. Then IPM(P, Q, F) = sup a x dp(x) x dq(x) a [ 1,1] R R

32 Integral Probability Metric Example: Suppose F = {a x + b x 2, x R : a 2 + b 2 = 1}. Then IPM(P, Q, F) = sup a x d(p Q) + b x 2 d(p Q) a 2 +b 2 =1 R R [ ( 2 ( ) ] = x d(p Q)) + x 2 d(p Q). R R How? Exercise! The richer the F is, the finer is the resolvability of P and Q. We will explore the relation of MMD H (P, Q) to IPM(P, Q, F).

33 Integral Probability Metric Classical results: IPM(P, Q, F) := sup f F f (x) dp(x) f (x) dq(x) F = unit Lipschitz ball (Wasserstein distance) (Dudley, 2002) F = unit bounded-lipschitz ball (Dudley metric) (Dudley, 2002) F = {1 (,t] : t R d } (Kolmogorov metric) (Müller, 1997) F = unit ball in bounded measurable functions (Total variation distance) (Dudley, 2002) For all these F, IPM(P, Q, F) = 0 P = Q. (Gretton et al., NIPS 2006, JMLR 2012; S et al., COLT 2008): F = unit ball in an RKHS, H with bounded kernel, k. Then MMD H (P, Q) = IPM(P, Q, F). Proof: f (x) d(p Q)(x) = f, µ P µ Q H f H µ P µ Q H.

34 Integral Probability Metric Classical results: IPM(P, Q, F) := sup f F f (x) dp(x) f (x) dq(x) F = unit Lipschitz ball (Wasserstein distance) (Dudley, 2002) F = unit bounded-lipschitz ball (Dudley metric) (Dudley, 2002) F = {1 (,t] : t R d } (Kolmogorov metric) (Müller, 1997) F = unit ball in bounded measurable functions (Total variation distance) (Dudley, 2002) For all these F, IPM(P, Q, F) = 0 P = Q. (Gretton et al., NIPS 2006, JMLR 2012; S et al., COLT 2008): F = unit ball in an RKHS, H with bounded kernel, k. Then MMD H (P, Q) = IPM(P, Q, F). Proof: f (x) d(p Q)(x) = f, µ P µ Q H f H µ P µ Q H.

35 Integral Probability Metric Classical results: IPM(P, Q, F) := sup f F f (x) dp(x) f (x) dq(x) F = unit Lipschitz ball (Wasserstein distance) (Dudley, 2002) F = unit bounded-lipschitz ball (Dudley metric) (Dudley, 2002) F = {1 (,t] : t R d } (Kolmogorov metric) (Müller, 1997) F = unit ball in bounded measurable functions (Total variation distance) (Dudley, 2002) For all these F, IPM(P, Q, F) = 0 P = Q. (Gretton et al., NIPS 2006, JMLR 2012; S et al., COLT 2008): F = unit ball in an RKHS, H with bounded kernel, k. Then MMD H (P, Q) = IPM(P, Q, F). Proof: f (x) d(p Q)(x) = f, µ P µ Q H f H µ P µ Q H.

36 Two-Sample Problem Given random samples { 1,..., m } i.i.d. P and {Y 1,..., Y n } i.i.d. Q. Determine: P = Q or P Q? Approach: Define ρ to be a distance on probabilities H 0 : P = Q H 0 : ρ(p, Q) = 0 H 1 : P Q H 1 : ρ(p, Q) > 0 If empirical ρ is far from zero: reject H0 close to zero: accept H0

37 Two-Sample Problem Given random samples { 1,..., m } i.i.d. P and {Y 1,..., Y n } i.i.d. Q. Determine: P = Q or P Q? Approach: Define ρ to be a distance on probabilities H 0 : P = Q H 0 : ρ(p, Q) = 0 H 1 : P Q H 1 : ρ(p, Q) > 0 If empirical ρ is far from zero: reject H0 close to zero: accept H0

38 Two-Sample Problem Given random samples { 1,..., m } i.i.d. P and {Y 1,..., Y n } i.i.d. Q. Determine: P = Q or P Q? Approach: Define ρ to be a distance on probabilities H 0 : P = Q H 0 : ρ(p, Q) = 0 H 1 : P Q H 1 : ρ(p, Q) > 0 If empirical ρ is far from zero: reject H0 close to zero: accept H0

39 Why MMD H? Related to the estimation of IPM(P, Q, F). Recall MMDH(P, 2 Q) = k(, x) dp(x) k(, x) dq(x) A trivial approximation: P m := 1 m m i=1 δ i and Q n := 1 n n i=1 δ Y i, where δ x represents the Dirac measure at x. 2 H. MMDH(P 2 1 m, Q n ) = m m k(, i ) 1 n i=1 n k(, Y i ) i=1 = 1 m m 2 k( i, j ) + 1 n n 2 k(y i, Y j ) 2 i,j=1 i,j=1 i,j 2 H k( i, Y j ) V-statistic; biased estimator of MMD 2 H

40 Why MMD H? Related to the estimation of IPM(P, Q, F). Recall MMDH(P, 2 Q) = k(, x) dp(x) k(, x) dq(x) A trivial approximation: P m := 1 m m i=1 δ i and Q n := 1 n n i=1 δ Y i, where δ x represents the Dirac measure at x. 2 H. MMDH(P 2 1 m, Q n ) = m m k(, i ) 1 n i=1 n k(, Y i ) i=1 = 1 m m 2 k( i, j ) + 1 n n 2 k(y i, Y j ) 2 i,j=1 i,j=1 i,j 2 H k( i, Y j ) V-statistic; biased estimator of MMD 2 H

41 Why MMD H? IPM(P m, Q n, F) is obtained by solving a linear program for F = Lipschitz and bounded Lipschitz balls. (S et al., EJS 2012) Quality of approximation (S et al., EJS 2012) For F = Lipschitz and bounded Lipschitz balls, ( ) IPM(P m, Q m, F) IPM(P, Q, F) = O p m d+1 1, d > 2 For F = unit RKHS ball, ) MMD H (P m, Q m) MMD H (P, Q) = O p (m 2 1 Are there any other estimators of MMD H (P, Q) that are statistically better than MMD H (P m, Q m )? NO!! (Tolstikhin et al., 2016) In practice? YES!! (Krikamol et al., JMLR 2016; S, Bernoulli 2016)

42 Why MMD H? IPM(P m, Q n, F) is obtained by solving a linear program for F = Lipschitz and bounded Lipschitz balls. (S et al., EJS 2012) Quality of approximation (S et al., EJS 2012) For F = Lipschitz and bounded Lipschitz balls, ( ) IPM(P m, Q m, F) IPM(P, Q, F) = O p m d+1 1, d > 2 For F = unit RKHS ball, ) MMD H (P m, Q m) MMD H (P, Q) = O p (m 2 1 Are there any other estimators of MMD H (P, Q) that are statistically better than MMD H (P m, Q m )? NO!! (Tolstikhin et al., 2016) In practice? YES!! (Krikamol et al., JMLR 2016; S, Bernoulli 2016)

43 Why MMD H? IPM(P m, Q n, F) is obtained by solving a linear program for F = Lipschitz and bounded Lipschitz balls. (S et al., EJS 2012) Quality of approximation (S et al., EJS 2012) For F = Lipschitz and bounded Lipschitz balls, ( ) IPM(P m, Q m, F) IPM(P, Q, F) = O p m d+1 1, d > 2 For F = unit RKHS ball, ) MMD H (P m, Q m) MMD H (P, Q) = O p (m 2 1 Are there any other estimators of MMD H (P, Q) that are statistically better than MMD H (P m, Q m )? NO!! (Tolstikhin et al., 2016) In practice? YES!! (Krikamol et al., JMLR 2016; S, Bernoulli 2016)

44 Why MMD H? IPM(P m, Q n, F) is obtained by solving a linear program for F = Lipschitz and bounded Lipschitz balls. (S et al., EJS 2012) Quality of approximation (S et al., EJS 2012) For F = Lipschitz and bounded Lipschitz balls, ( ) IPM(P m, Q m, F) IPM(P, Q, F) = O p m d+1 1, d > 2 For F = unit RKHS ball, ) MMD H (P m, Q m) MMD H (P, Q) = O p (m 2 1 Are there any other estimators of MMD H (P, Q) that are statistically better than MMD H (P m, Q m )? NO!! (Tolstikhin et al., 2016) In practice? YES!! (Krikamol et al., JMLR 2016; S, Bernoulli 2016)

45 Beware of Pitfalls There are many other distances on probabilities: Total variation distance Hellinger distance Kullback-Leibler divergence and its variants Fisher divergence... Estimating these distances is both computationally and statistically difficult. MMD H is computationally simpler and appears statistically powerful with no curse of dimensionality. In fact, it is NOT statistically powerful. (Ramdas et al., AAAI 2015; S, Bernoulli, 2016) Recall: MMD H is based on µ P which is a smoothed version of P. Even though P and Q can be distinguished (coming up!!) based on µ P and µ Q, the distinguishability is weak compared to that of the above distances. (S et al., JMLR 2010; S, Bernoulli, 2016) NO FREE LUNCH!!

46 Beware of Pitfalls There are many other distances on probabilities: Total variation distance Hellinger distance Kullback-Leibler divergence and its variants Fisher divergence... Estimating these distances is both computationally and statistically difficult. MMD H is computationally simpler and appears statistically powerful with no curse of dimensionality. In fact, it is NOT statistically powerful. (Ramdas et al., AAAI 2015; S, Bernoulli, 2016) Recall: MMD H is based on µ P which is a smoothed version of P. Even though P and Q can be distinguished (coming up!!) based on µ P and µ Q, the distinguishability is weak compared to that of the above distances. (S et al., JMLR 2010; S, Bernoulli, 2016) NO FREE LUNCH!!

47 Beware of Pitfalls There are many other distances on probabilities: Total variation distance Hellinger distance Kullback-Leibler divergence and its variants Fisher divergence... Estimating these distances is both computationally and statistically difficult. MMD H is computationally simpler and appears statistically powerful with no curse of dimensionality. In fact, it is NOT statistically powerful. (Ramdas et al., AAAI 2015; S, Bernoulli, 2016) Recall: MMD H is based on µ P which is a smoothed version of P. Even though P and Q can be distinguished (coming up!!) based on µ P and µ Q, the distinguishability is weak compared to that of the above distances. (S et al., JMLR 2010; S, Bernoulli, 2016) NO FREE LUNCH!!

48 So far... P µ P := k(, x) dp(x) MMD H (P, Q) = µ P µ Q H Computation Estimation When is P µ P one-to-one?, i.e., MMD H (P, Q) = 0 P = Q?

49 Characteristic Kernel k is said to be characteristic if MMD H (P, Q) = 0 P = Q for any P and Q. Not all kernels are characteristic. Example: If k(x, y) = c > 0, x, y, then µ P = k(, x) dp(x) = c, and MMD H (P, Q) = 0, P, Q. µ Q = c Example: Let k(x, y) = xy, x, y R. Then MMD H (P, Q) = E P [ ] E Q [ ]. Characteristic for Bernoulli s but not for all P and Q. Example: Let k(x, y) = (1 + xy) 2, x, y R. Then MMDH 2 (P, Q) = 2(E P[ ] E Q [ ]) 2 + (E P [ 2 ] E Q [ 2 ]). Characteristic for Gaussian s but not for all P and Q.

50 Characteristic Kernel k is said to be characteristic if MMD H (P, Q) = 0 P = Q for any P and Q. Not all kernels are characteristic. Example: If k(x, y) = c > 0, x, y, then µ P = k(, x) dp(x) = c, and MMD H (P, Q) = 0, P, Q. µ Q = c Example: Let k(x, y) = xy, x, y R. Then MMD H (P, Q) = E P [ ] E Q [ ]. Characteristic for Bernoulli s but not for all P and Q. Example: Let k(x, y) = (1 + xy) 2, x, y R. Then MMDH 2 (P, Q) = 2(E P[ ] E Q [ ]) 2 + (E P [ 2 ] E Q [ 2 ]). Characteristic for Gaussian s but not for all P and Q.

51 Characteristic Kernel k is said to be characteristic if MMD H (P, Q) = 0 P = Q for any P and Q. Not all kernels are characteristic. Example: If k(x, y) = c > 0, x, y, then µ P = k(, x) dp(x) = c, and MMD H (P, Q) = 0, P, Q. µ Q = c Example: Let k(x, y) = xy, x, y R. Then MMD H (P, Q) = E P [ ] E Q [ ]. Characteristic for Bernoulli s but not for all P and Q. Example: Let k(x, y) = (1 + xy) 2, x, y R. Then MMDH 2 (P, Q) = 2(E P[ ] E Q [ ]) 2 + (E P [ 2 ] E Q [ 2 ]). Characteristic for Gaussian s but not for all P and Q.

52 Characteristic Kernel k is said to be characteristic if MMD H (P, Q) = 0 P = Q for any P and Q. Not all kernels are characteristic. Example: If k(x, y) = c > 0, x, y, then µ P = k(, x) dp(x) = c, and MMD H (P, Q) = 0, P, Q. µ Q = c Example: Let k(x, y) = xy, x, y R. Then MMD H (P, Q) = E P [ ] E Q [ ]. Characteristic for Bernoulli s but not for all P and Q. Example: Let k(x, y) = (1 + xy) 2, x, y R. Then MMDH 2 (P, Q) = 2(E P[ ] E Q [ ]) 2 + (E P [ 2 ] E Q [ 2 ]). Characteristic for Gaussian s but not for all P and Q.

53 Characteristic Kernels on R d Translation invariant kernel: k(x, y) = ψ(x y), x, y R d ; bounded and continuous. Bochner s theorem: ψ(x) = e 1 x,ω 2 dλ(ω), x R d, R d where Λ is a non-negative finite Borel measure on R d. Then, k is characteristic supp(λ) = R d. (S et al., COLT 2008; JMLR, 2010) Corollary: Compactly supported ψ are characteristic (S et al., COLT 2008; JMLR, 2010). Key Idea: Fourier representation of MMD H

54 Fourier Representation of MMD 2 H MMDH(P, 2 Q) = ϕ P (ω) ϕ Q (ω) 2 dλ(ω) R d where ϕ P is the characteristic function of P. Proof: MMD 2 H (P, Q) = ψ(x y) d(p Q)(x) d(p Q)(y) R d R d ( ) = e 1 x y,ω dλ(ω) d(p Q)(x) d(p Q)(y) R d R d R d ( ) = e 1 x,ω d(p Q)(x) e 1 y,ω d(p Q)(y) dλ(ω) R d R d R d = ϕ P (ω) ϕ Q (ω) 2 dλ(ω), R d where Bochner s theorem is used in ( ) and Fubini s theorem in ( ). Suppose Λ = 1, i.e., uniform on R d (!!). Then MMD H (P, Q) is the L 2 distance between the densities (if they exist) of P and Q.

55 Fourier Representation of MMD 2 H MMDH(P, 2 Q) = ϕ P (ω) ϕ Q (ω) 2 dλ(ω) R d where ϕ P is the characteristic function of P. Proof: MMD 2 H (P, Q) = ψ(x y) d(p Q)(x) d(p Q)(y) R d R d ( ) = e 1 x y,ω dλ(ω) d(p Q)(x) d(p Q)(y) R d R d R d ( ) = e 1 x,ω d(p Q)(x) e 1 y,ω d(p Q)(y) dλ(ω) R d R d R d = ϕ P (ω) ϕ Q (ω) 2 dλ(ω), R d where Bochner s theorem is used in ( ) and Fubini s theorem in ( ). Suppose Λ = 1, i.e., uniform on R d (!!). Then MMD H (P, Q) is the L 2 distance between the densities (if they exist) of P and Q.

56 Characteristic Kernels on R d Proof: Suppose supp(λ) = R d. Then MMDH(P, 2 Q) = 0 ϕ P (ω) ϕ Q (ω) 2 dλ(ω) = 0 ϕ P = ϕ Q R d But characteristic functions are uniformly continuous and so ϕ P = ϕ Q which implies P = Q. a.e. Suppose supp(λ) R d. Then there exists an open set U R d such that Λ(U) = 0. Construct P and Q such that ϕ P and ϕ Q differ only in U, i.e., MMD H (P, Q) > 0. If ψ is compactly supported, its Fourier transform is analytic, i.e., cannot vanish on an interval.

57 Characteristic Kernels on R d Proof: Suppose supp(λ) = R d. Then MMDH(P, 2 Q) = 0 ϕ P (ω) ϕ Q (ω) 2 dλ(ω) = 0 ϕ P = ϕ Q R d But characteristic functions are uniformly continuous and so ϕ P = ϕ Q which implies P = Q. a.e. Suppose supp(λ) R d. Then there exists an open set U R d such that Λ(U) = 0. Construct P and Q such that ϕ P and ϕ Q differ only in U, i.e., MMD H (P, Q) > 0. If ψ is compactly supported, its Fourier transform is analytic, i.e., cannot vanish on an interval.

58 Characteristic Kernels on R d Proof: Suppose supp(λ) = R d. Then MMDH(P, 2 Q) = 0 ϕ P (ω) ϕ Q (ω) 2 dλ(ω) = 0 ϕ P = ϕ Q R d But characteristic functions are uniformly continuous and so ϕ P = ϕ Q which implies P = Q. a.e. Suppose supp(λ) R d. Then there exists an open set U R d such that Λ(U) = 0. Construct P and Q such that ϕ P and ϕ Q differ only in U, i.e., MMD H (P, Q) > 0. If ψ is compactly supported, its Fourier transform is analytic, i.e., cannot vanish on an interval.

59 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L 2 (R d,λ) Example: P differs from Q at (roughly) one frequency

60 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L 2 (R d,λ) Example: P differs from Q at (roughly) one frequency F F

61 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L 2 (R d,λ) Example: P differs from Q at (roughly) one frequency F F φ P φ Q Characteristic function difference ω

62 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L 2 (R d,λ) Example: P differs from Q at (roughly) one frequency Gaussian kernel ϕ P ϕ Q

63 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L2 (R d,λ) Example: P differs from Q at (roughly) one frequency Characteristic Picture credit: A. Gretton

64 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L 2 (R d,λ) Example: P differs from Q at (roughly) one frequency Sinc kernel ϕ P ϕ Q

65 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L2 (R d,λ) Example: P differs from Q at (roughly) one frequency NOT characteristic Picture credit: A. Gretton

66 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L 2 (R d,λ) Example: P differs from Q at (roughly) one frequency B-Spline kernel ϕ P ϕ Q

67 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L2 (R d,λ) Example: P differs from Q at (roughly) one frequency???

68 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L2 (R d,λ) Example: P differs from Q at (roughly) one frequency Characteristic Picture credit: A. Gretton

69 Caution Chararacteristic property relates class of kernels and class of probabilities. (S et al., COLT 2008; JMLR 2010) Σ := supp(λ)

71 Measuring (In)Dependence Let and Y be Gaussian random variables on R. Then and Y are independent Cov(, Y ) = E(Y ) E( )E(Y ) = 0 In general, Cov(, Y ) = 0 Y. Covariance captures the linear relationship between and Y. Feature space view point: How about Cov(Φ( ), Ψ(Y ))? Suppose Φ( ) = (1,, 2 ) and Ψ(Y ) = (1, Y, Y 2, Y 3 ). Then Cov(Φ( ), Φ(Y )) captures Cov( i, Y j ) for i {0, 1, 2} and j {0, 1, 2, 3}.

72 Measuring (In)Dependence Let and Y be Gaussian random variables on R. Then and Y are independent Cov(, Y ) = E(Y ) E( )E(Y ) = 0 In general, Cov(, Y ) = 0 Y. Covariance captures the linear relationship between and Y. Feature space view point: How about Cov(Φ( ), Ψ(Y ))? Suppose Φ( ) = (1,, 2 ) and Ψ(Y ) = (1, Y, Y 2, Y 3 ). Then Cov(Φ( ), Φ(Y )) captures Cov( i, Y j ) for i {0, 1, 2} and j {0, 1, 2, 3}.

73 Measuring (In)Dependence Characterization of independence: Y Cov(f ( ), g(y )) = 0, measurable functions f and g. Dependence measure: sup f,g Cov(f ( ), g(y )) = sup E[f ( )g(y )] E[f ( )]E[g(Y )] f,g Similar to the IPM between P Y and P P Y. Restricting functions in RKHS: (constrained covariance) COCO(P Y ; H, H Y ) := sup f H =1 g HY =1 (Gretton et al., AISTATS 2005, JMLR 2005) E[f ( )g(y )] E[f ( )]E[g(Y )].

74 Measuring (In)Dependence Characterization of independence: Y Cov(f ( ), g(y )) = 0, measurable functions f and g. Dependence measure: sup f,g Cov(f ( ), g(y )) = sup E[f ( )g(y )] E[f ( )]E[g(Y )] f,g Similar to the IPM between P Y and P P Y. Restricting functions in RKHS: (constrained covariance) COCO(P Y ; H, H Y ) := sup f H =1 g HY =1 (Gretton et al., AISTATS 2005, JMLR 2005) E[f ( )g(y )] E[f ( )]E[g(Y )].

75 Covariance Operator Let k and k Y be the r.k. s of H and H Y respectively. Then E[f ( )] = f, µ P H and E[g(Y )] = g, µ PY HY E[f ( )]E[g(Y )] = f, µ P H g, µ PY HY = f g, µ P µ PY H H Y = f, (µ P µ PY )g H = g, (µ PY µ P )f HY E[f ( )g(y )] = E[ f, k (, ) H g, k Y (, Y ) HY ] = E[ f g, k (, ) k Y (, Y ) H H Y ] = E[ f, (k (, ) k Y (, Y ))g H ] = E[ g, (k Y (, Y ) k (, ))f HY ]

76 Covariance Operator Let k and k Y be the r.k. s of H and H Y respectively. Then E[f ( )] = f, µ P H and E[g(Y )] = g, µ PY HY E[f ( )]E[g(Y )] = f, µ P H g, µ PY HY = f g, µ P µ PY H H Y = f, (µ P µ PY )g H = g, (µ PY µ P )f HY E[f ( )g(y )] = E[ f, k (, ) H g, k Y (, Y ) HY ] = E[ f g, k (, ) k Y (, Y ) H H Y ] = E[ f, (k (, ) k Y (, Y ))g H ] = E[ g, (k Y (, Y ) k (, ))f HY ]

77 Covariance Operator Assuming E k (, )k Y (Y, Y ) <, we obtain E[f ( )g(y )] = f, E[k (, ) k Y (, Y )]g H = g, E[k Y (, Y ) k (, )]f HY Cov(f ( ), g(y )) = f, C Y g H = g, C Y f HY where C Y := E[k (, ) k Y (, Y )] µ P µ PY is a cross-covariance operator from H Y to H and C Y = C Y. Compare to the feature space view point with canonical feature maps

78 Dependence Measures COCO(P Y ; H, H Y ) = sup f H =1 g HY =1 f, C Y g H = C Y op = C Y op, which is the maximum singular value of C Y. Choosing k (, ) =, 2 and k Y (, Y ) =, Y 2, for Gaussian distributions, Y C Y = 0 In general, Y? C Y = 0.

79 Dependence Measures COCO(P Y ; H, H Y ) = sup f H =1 g HY =1 f, C Y g H = C Y op = C Y op, which is the maximum singular value of C Y. Choosing k (, ) =, 2 and k Y (, Y ) =, Y 2, for Gaussian distributions, Y C Y = 0 In general, Y? C Y = 0.

80 Dependence Measures How about we consider other singular values? How about C Y 2 HS, which is the sum of squared singular values of C Y? Hilbert-Schmidt Independence Criterion (HSIC) (Gretton et al., ALT C Y op C Y HS 2005, JMLR 2005)

81 Dependence Measures COCO(P Y ; H, H Y ) := sup f H =1 g HY =1 E[f ( )g(y )] E[f ( )]E[g(Y )]. How about we use different constraint, i.e., f g H H Y 1? sup Cov(f ( ), g(y )) = sup f, C Y g H f g H H Y 1 f g H H Y 1 = sup f g, C Y H H Y f g H H Y 1 = C Y H H Y = C Y HS C Y H H Y = E[k (, ) k Y (, Y )] µ P µ P H H Y = k (, ) k Y (, Y ) d(p Y P P Y ) = MMD H H Y (P Y, P P Y ) H H Y

82 Dependence Measures COCO(P Y ; H, H Y ) := sup f H =1 g HY =1 E[f ( )g(y )] E[f ( )]E[g(Y )]. How about we use different constraint, i.e., f g H H Y 1? sup Cov(f ( ), g(y )) = sup f, C Y g H f g H H Y 1 f g H H Y 1 = sup f g, C Y H H Y f g H H Y 1 = C Y H H Y = C Y HS C Y H H Y = E[k (, ) k Y (, Y )] µ P µ P H H Y = k (, ) k Y (, Y ) d(p Y P P Y ) = MMD H H Y (P Y, P P Y ) H H Y

83 Dependence Measures COCO(P Y ; H, H Y ) := sup f H =1 g HY =1 E[f ( )g(y )] E[f ( )]E[g(Y )]. How about we use different constraint, i.e., f g H H Y 1? sup Cov(f ( ), g(y )) = sup f, C Y g H f g H H Y 1 f g H H Y 1 = sup f g, C Y H H Y f g H H Y 1 = C Y H H Y = C Y HS C Y H H Y = E[k (, ) k Y (, Y )] µ P µ P H H Y = k (, ) k Y (, Y ) d(p Y P P Y ) = MMD H H Y (P Y, P P Y ) H H Y

84 Dependence Measures H H Y is an RKHS with kernel k k Y. If k k Y is characteristic, then C Y H H Y = 0 P Y = P P Y Y If k and k Y are characteristic, then C Y HS = 0 Y. (Gretton, 2015) Using the reproducing property, C Y 2 HS = E Y E Y k (, )k Y (Y, Y ) +E k (, )E YY k Y (Y, Y ) 2 E Y [E k (, )E Y k Y (Y, Y )] Can be estimated using a V-statistic (empirical sums).

85 Applications Two-sample testing Independence testing Conditional independence testing Supervised dimensionality reduction Kernel Bayes rule (filtering, prediction and smoothing) Kernel CCA,... Review paper (Muandet et al., 2016)

86 Application: Two-Sample Testing

87 Two-Sample Problem Given random samples { 1,..., m } i.i.d. P and {Y 1,..., Y n } i.i.d. Q. Determine: P = Q or P Q? Approach: If MMD 2 H (P m, Q n ) is H 0 : P = Q H 0 : MMD H (P, Q) = 0 H 1 : P Q H 1 : MMD H (P, Q) > 0 far from zero: reject H0 close to zero: accept H0

88 Type-I and Type-II Errors Given P = Q, want threshold or critical value t 1 α such that Pr H0 (MMD 2 H (P m, Q n ) > t 1 α ) α.

89 Statistical Test: Large Deviation Bounds Given P = Q, want threshold t such that Pr H0 (MMD 2 H (P m, Q n ) > t) α. We showed that (S et al., EJS 2012) MMD 2 Pr( H (P m, Q n ) MMDH(P, 2 Q) ( 1 + 2(m+n) mn 2 log 1 α )) α. α-level test: Accept H 0 if ( ) 2(m + n) MMDH(P 2 m, Q n ) < log 1 mn α Otherwise reject. Too conservative!!

90 Statistical Test: Large Deviation Bounds Given P = Q, want threshold t such that Pr H0 (MMD 2 H (P m, Q n ) > t) α. We showed that (S et al., EJS 2012) MMD 2 Pr( H (P m, Q n ) MMDH(P, 2 Q) ( 1 + 2(m+n) mn 2 log 1 α )) α. α-level test: Accept H 0 if ( ) 2(m + n) MMDH(P 2 m, Q n ) < log 1 mn α Otherwise reject. Too conservative!!

91 Statistical Test: Asymptotic Distribution (Gretton et al., NIPS 2006, JMLR 2012) Unbiased estimator of MMDH 2 (P, Q): U-statistic m MMD H 2 := 1 m(m 1) Under H 0, i j m MMD 2 H k( i, j ) + k(y i, Y j ) k( i, Y j ) k( j, Y i ) }{{} h(( i,y i ),( j,y j )) w ( λ i θ 2 i 2 ) as n, i=1 where θ i N (0, 2) i.i.d., and λ i are solutions to k(x, y) ψ i (x) dp(x) = λ i ψ i (y) }{{} centered Consistent (Type-II error goes to zero): Under H 1, m ( MMD 2 H MMD 2 H(P, Q)) w N (0, σ 2 h ) as n.

92 Statistical Test: Asymptotic Distribution (Gretton et al., NIPS 2006, JMLR 2012) Unbiased estimator of MMDH 2 (P, Q): U-statistic m MMD H 2 := 1 m(m 1) Under H 0, i j m MMD 2 H k( i, j ) + k(y i, Y j ) k( i, Y j ) k( j, Y i ) }{{} h(( i,y i ),( j,y j )) w ( λ i θ 2 i 2 ) as n, i=1 where θ i N (0, 2) i.i.d., and λ i are solutions to k(x, y) ψ i (x) dp(x) = λ i ψ i (y) }{{} centered Consistent (Type-II error goes to zero): Under H 1, m ( MMD 2 H MMD 2 H(P, Q)) w N (0, σ 2 h ) as n.

93 Statistical Test: Asymptotic Distribution (Gretton et al., NIPS 2006, JMLR 2012) Unbiased estimator of MMDH 2 (P, Q): U-statistic m MMD H 2 := 1 m(m 1) Under H 0, i j m MMD 2 H k( i, j ) + k(y i, Y j ) k( i, Y j ) k( j, Y i ) }{{} h(( i,y i ),( j,y j )) w ( λ i θ 2 i 2 ) as n, i=1 where θ i N (0, 2) i.i.d., and λ i are solutions to k(x, y) ψ i (x) dp(x) = λ i ψ i (y) }{{} centered Consistent (Type-II error goes to zero): Under H 1, m ( MMD 2 H MMD 2 H(P, Q)) w N (0, σ 2 h ) as n.

94 Statistical Test: Asymptotic Distribution (Gretton et al., NIPS 2006, JMLR 2012) α-level test: Estimate 1 α quantile of the null distribution using bootstrap. Picture credit: A. Gretton Computationally intensive!!

95 Statistical Test Without Bootstrap (Gretton et al., NIPS 2009) Estimate the eigenvalues, λ i from combined samples Define Z := (1,..., m, Y 1,..., Y m) Kij := k(z i, Z j ) Compute the eigenvalues, λi of K = HKH where H = I 1 2m 12m1T 2m α-level test: Compute the 1 α quantile of the distribution associated with 2m ( λ i θ 2 i 2 ) i=1 Test is asymptotically α-level consistent

96 Experiments (Gretton et al., NIPS 2009) Comparison example: Canadian Hansard corpus (agriculture, fisheries and immigration) Samples: 5-line extracts Kernel: k-spectrum kernel with k = 10 Sample size: 10 Repetitions: 300 Compute MMD 2 H k-spectrum kernel: average Type II error 0 (α = 0.05) Bag of words kernel: average Type II error 0.18 First ever test on structured data

97 Choice of Characteristic Kernel

98 Choice of Characteristic Kernels Let = R d. Suppose k is a Gaussian kernel, k σ (x, y) = e x y 2 2 2σ 2. MMD Hσ is a function of σ. So MMD Hσ practice? is a family of metrics. Which one should we use in Note that MMD Hσ 0 as σ 0 or σ. Therefore, the kernel choice is very critical in applications. Heuristics: Median: σ = median ( i j 2 : i j, i, j = 1,..., m ) where = (( i ) i, (Y i ) i ) (Gretton et al., NIPS 2006, NIPS 2009, JMLR 2012). Choose the test statistic to be MMD Hσ (P m, Q m ) where (S et al., NIPS 2009) σ = arg max MMD H σ (P m, Q m ) σ (0, )

99 Choice of Characteristic Kernels Let = R d. Suppose k is a Gaussian kernel, k σ (x, y) = e x y 2 2 2σ 2. MMD Hσ is a function of σ. So MMD Hσ practice? is a family of metrics. Which one should we use in Note that MMD Hσ 0 as σ 0 or σ. Therefore, the kernel choice is very critical in applications. Heuristics: Median: σ = median ( i j 2 : i j, i, j = 1,..., m ) where = (( i ) i, (Y i ) i ) (Gretton et al., NIPS 2006, NIPS 2009, JMLR 2012). Choose the test statistic to be MMD Hσ (P m, Q m ) where (S et al., NIPS 2009) σ = arg max MMD H σ (P m, Q m ) σ (0, )

100 Choice of Characteristic Kernels Let = R d. Suppose k is a Gaussian kernel, k σ (x, y) = e x y 2 2 2σ 2. MMD Hσ is a function of σ. So MMD Hσ practice? is a family of metrics. Which one should we use in Note that MMD Hσ 0 as σ 0 or σ. Therefore, the kernel choice is very critical in applications. Heuristics: Median: σ = median ( i j 2 : i j, i, j = 1,..., m ) where = (( i ) i, (Y i ) i ) (Gretton et al., NIPS 2006, NIPS 2009, JMLR 2012). Choose the test statistic to be MMD Hσ (P m, Q m ) where (S et al., NIPS 2009) σ = arg max MMD H σ (P m, Q m ) σ (0, )

101 Choice of Characteristic Kernels Let = R d. Suppose k is a Gaussian kernel, k σ (x, y) = e x y 2 2 2σ 2. MMD Hσ is a function of σ. So MMD Hσ practice? is a family of metrics. Which one should we use in Note that MMD Hσ 0 as σ 0 or σ. Therefore, the kernel choice is very critical in applications. Heuristics: Median: σ = median ( i j 2 : i j, i, j = 1,..., m ) where = (( i ) i, (Y i ) i ) (Gretton et al., NIPS 2006, NIPS 2009, JMLR 2012). Choose the test statistic to be MMD Hσ (P m, Q m ) where (S et al., NIPS 2009) σ = arg max MMD H σ (P m, Q m ) σ (0, )

102 Classes of Characteristic Kernels (S et al., NIPS 2009) More generally, we use Examples for K : MMD(P, Q) := sup MMD Hk (P, Q). k K K g := {e σ x y 2 2, x, y R d : σ R + }. K lin := {k λ = l i=1 λ ik i k λ is pd, l i=1 λ i = 1}. K con := {k λ = l i=1 λ ik i λ i 0, l i=1 λ i = 1}. Test: α-level test: Estimate 1 α quantile of the null distribution of MMD(P m, Q m ) using bootstrap. Test consistency: Based on the functional central limit theorem for U-processes indexed by VC-subgraph K. Computational disadvantage!!

103 Classes of Characteristic Kernels (S et al., NIPS 2009) More generally, we use Examples for K : MMD(P, Q) := sup MMD Hk (P, Q). k K K g := {e σ x y 2 2, x, y R d : σ R + }. K lin := {k λ = l i=1 λ ik i k λ is pd, l i=1 λ i = 1}. K con := {k λ = l i=1 λ ik i λ i 0, l i=1 λ i = 1}. Test: α-level test: Estimate 1 α quantile of the null distribution of MMD(P m, Q m ) using bootstrap. Test consistency: Based on the functional central limit theorem for U-processes indexed by VC-subgraph K. Computational disadvantage!!

104 Experiments q = N (0, σ 2 q). p(x) = q(x)(1 + sin νx) q(x) p(x) 0.2 p(x) x ν = x ν = x ν = 7.5 k(x, y) = exp( (x y) 2 /σ). Test statistics: MMD(P m, Q m ) and MMD Hσ (P m, Q m ) for various σ.

105 Experiments MMD(P, Q) Error (in %) Type I error Type II error ν

106 Experiments MMD Hσ (P, Q) Type I error (in %) ν=0.5 ν=0.75 ν=1.0 ν=1.25 ν= log σ Type II error (in %) ν=0.5 ν=0.75 ν=1.0 ν=1.25 ν= log σ log σ ν Median as σ ν

107 Choice of Characteristic Kernels (Gretton et al., NIPS 2012) Choose a kernel that minimizes the Type-II error for a given Type-I error: k arg inf Type II (k). k K:Type I (k) α Not easy to compute with the asymptotic distributions of the U-statistic, MMD 2 Hk (P m, Q m ). Modified statistic: Average of U-statistics computed on independent blocks of size 2. MMD H 2 (P k m, Q 2 m/2 m)= k( 2i 1, 2i ) + k(y 2i 1, Y 2i ) m i=1 k( 2i 1, Y 2i ) k(y 2i 1, 2i ), }{{} h k (Z i ) where Z i = ( 2i 1, 2i, Y 2i 1, Y 2i ). Recall MMD H 2 := 1 m(m 1) m k( i, j ) + k(y i, Y j ) k( i, Y j ) k( j, Y i ) }{{} h(( i,y i ),( j,y j )) i j

108 Choice of Characteristic Kernels (Gretton et al., NIPS 2012) Choose a kernel that minimizes the Type-II error for a given Type-I error: k arg inf Type II (k). k K:Type I (k) α Not easy to compute with the asymptotic distributions of the U-statistic, MMD 2 Hk (P m, Q m ). Modified statistic: Average of U-statistics computed on independent blocks of size 2. MMD H 2 (P k m, Q 2 m/2 m)= k( 2i 1, 2i ) + k(y 2i 1, Y 2i ) m i=1 k( 2i 1, Y 2i ) k(y 2i 1, 2i ), }{{} h k (Z i ) where Z i = ( 2i 1, 2i, Y 2i 1, Y 2i ). Recall MMD H 2 := 1 m(m 1) m k( i, j ) + k(y i, Y j ) k( i, Y j ) k( j, Y i ) }{{} h(( i,y i ),( j,y j )) i j

109 Modified Statistic Advantages: MMD 2 H is computable in O(m) while MMD H 2 requires O(m2 ) computations. Under H 0, m MMD 2 Hk (P m, Q m ) w N (0, 2σh 2 k ), where σh 2 k = E Z hk 2(Z) (E Z h k (Z)) 2 assuming 0 < E Z hk 2 (Z) <. The asymptotic distribution is normal as against weighted sum of infinite χ 2. Therefore, the test threshold is easy to compute. Disadvantages: Larger variance Smaller power

110 Modified Statistic Advantages: MMD 2 H is computable in O(m) while MMD H 2 requires O(m2 ) computations. Under H 0, m MMD 2 Hk (P m, Q m ) w N (0, 2σh 2 k ), where σh 2 k = E Z hk 2(Z) (E Z h k (Z)) 2 assuming 0 < E Z hk 2 (Z) <. The asymptotic distribution is normal as against weighted sum of infinite χ 2. Therefore, the test threshold is easy to compute. Disadvantages: Larger variance Smaller power

111 Type-I and Type-II Errors Test threshold: For a given k and α, where Φ N is the cdf of N (0, 1). Type-II error: t k,1 α = 2σ hk Φ 1 N (1 α) ( Φ N Φ 1 N (1 α) MMD2 H k (P, Q) ) m 2σhk

112 Best Kernel: Minimizes Type-II Error Since Φ N is a strictly increasing function, the Type-II error is minimized by maximizing MMD2 H k (P,Q) σ hk. Optimal kernel: k MMDH 2 arg sup k (P, Q). k K σ hk Since MMDH 2 k and σ hk depend on unknown P and Q, we split the data into train and test data to estimate k on the train data as ˆk and evaluate the threshold tˆk,1 α on the test data.

113 Data-Dependent Kernel Train data: Define MMD 2 Hk and ˆσ hk. for some λ m 0 as m. Test data: If ˆk MMD H 2 arg sup k k K ˆσ hk + λ m MMD 2 Hˆk, ˆσ hˆk and tˆk,1 α. MMD 2 > Hˆk tˆk,1 α, reject H 0, else accept. Similar results are recently obtained for MMD 2 Hk 2017) (Sutherland et al., ICLR

114 Data-Dependent Kernel Train data: Define MMD 2 Hk and ˆσ hk. for some λ m 0 as m. Test data: If ˆk MMD H 2 arg sup k k K ˆσ hk + λ m MMD 2 Hˆk, ˆσ hˆk and tˆk,1 α. MMD 2 > Hˆk tˆk,1 α, reject H 0, else accept. Similar results are recently obtained for MMD 2 Hk 2017) (Sutherland et al., ICLR

115 Learning the Kernel Define the family of kernels as follows: { } l K := k : k = β i k i, β i 0, i [l]. i=1 If all k i are characteristic and for some i [l], β i > 0, then k is characteristic. MMD 2 H k (P, Q) = l i=1 β immd 2 H ki (P, Q) σ 2 k = l i,j=1 β iβ j cov(h ki, h kj ) where h ki (x, x, y, y ) = k i (x, x ) + k i (y, y ) k i (x, y ) k i (x, y). Objective: β = arg max β 0 β T η βt W β, where η := (MMD 2 H ki (P, Q)) i and W := (cov(h ki, h kj )) i,j.

116 Learning the Kernel Define the family of kernels as follows: { } l K := k : k = β i k i, β i 0, i [l]. i=1 If all k i are characteristic and for some i [l], β i > 0, then k is characteristic. MMD 2 H k (P, Q) = l i=1 β immd 2 H ki (P, Q) σ 2 k = l i,j=1 β iβ j cov(h ki, h kj ) where h ki (x, x, y, y ) = k i (x, x ) + k i (y, y ) k i (x, y ) k i (x, y). Objective: β = arg max β 0 β T η βt W β, where η := (MMD 2 H ki (P, Q)) i and W := (cov(h ki, h kj )) i,j.

117 Learning the Kernel Define the family of kernels as follows: { } l K := k : k = β i k i, β i 0, i [l]. i=1 If all k i are characteristic and for some i [l], β i > 0, then k is characteristic. MMD 2 H k (P, Q) = l i=1 β immd 2 H ki (P, Q) σ 2 k = l i,j=1 β iβ j cov(h ki, h kj ) where h ki (x, x, y, y ) = k i (x, x ) + k i (y, y ) k i (x, y ) k i (x, y). Objective: β = arg max β 0 β T η βt W β, where η := (MMD 2 H ki (P, Q)) i and W := (cov(h ki, h kj )) i,j.

118 Optimization ˆβ λ = arg max β 0 β T ˆη β T (Ŵ + λi )β If ˆη has at least one positive element, the objective function is strictly positive and so { } ˆβ λ = arg min β T (Ŵ + λi )β : βt ˆη = 1, β 0. β On the test data: Compute MMD 2 Hˆk using ˆk = l i=1 ˆβ λ,ik i. Compute test threshold ˆtˆk,1 α using ˆσˆk.

119 Optimization ˆβ λ = arg max β 0 β T ˆη β T (Ŵ + λi )β If ˆη has at least one positive element, the objective function is strictly positive and so { } ˆβ λ = arg min β T (Ŵ + λi )β : βt ˆη = 1, β 0. β On the test data: Compute MMD 2 Hˆk using ˆk = l i=1 ˆβ λ,ik i. Compute test threshold ˆtˆk,1 α using ˆσˆk.

120 Experiments P and Q are mixtures of two-dimensional Gaussians. P has unit covariance in each component. Q has correlated Gaussians with ε being the ratio of largest to smallest covariance eigenvalues. Testing problem difficulty increases with ε 1 and the number of mixture components.

121 Competing Approaches Median heuristic Max. MMD: sup k K MMD 2 H k (P m, Q m ) choose k K with the largest MMD 2 H k (P m, Q m ) Same as maximizing βt ˆη subject to β 1 1. l 2 statistic: maximize β T ˆη subject to β 2 1. Cross-validation on training set.

122 Results m = 10, 000 (for training and test). Results are average over 617 trials.

123 Results Optimize MMD 2 H k ˆσ k

124 Results Maximize MMD 2 Hk with β constraint

125 Results Median heuristic

126 References I Dudley, R. M. (2002). Real Analysis and Probability. Cambridge University Press, Cambridge, UK. Fukumizu, K., Gretton, A., Sun,., and Schölkopf, B. (2008). Kernel measures of conditional dependence. In Platt, J., Koller, D., Singer, Y., and Roweis, S., editors, Advances in Neural Information Processing Systems 20, pages , Cambridge, MA. MIT Press. Fukumizu, K., Sriperumbudur, B. K., Gretton, A., and Schölkopf, B. (2009). Characteristic kernels on groups and semigroups. In Advances in Neural Information Processing Systems 21, pages Gretton, A. (2015). A simpler condition for consistency of a kernel independence test. ariv: Gretton, A., Borgwardt, K. M., Rasch, M., Schölkopf, B., and Smola, A. (2007). A kernel method for the two sample problem. In Advances in Neural Information Processing Systems 19, pages MIT Press. Gretton, A., Borgwardt, K. M., Rasch, M., Schölkopf, B., and Smola, A. (2012a). A kernel two-sample test. Journal of Machine Learning Research, 13: Gretton, A., Bousquet, O., Smola, A., and Schölkopf, B. (2005a). Measuring statistical dependence with Hilbert-Schmidt norms. In Jain, S., Simon, H. U., and Tomita, E., editors, Proceedings of Algorithmic Learning Theory, pages 63 77, Berlin. Springer-Verlag. Gretton, A., Fukumizu, K., Harchaoui, Z., and Sriperumbudur, B. K. (2010). A fast, consistent kernel two-sample test. In Advances in Neural Information Processing Systems 22, Cambridge, MA. MIT Press. Gretton, A., Herbrich, R., Smola, A., Bousquet, O., and Schölkopf, B. (2005b). Kernel methods for measuring independence. Journal of Machine Learning Research, 6: Gretton, A., Smola, A., Bousquet, O., Herbrich, R., Belitski, A., Augath, M., Murayama, Y., Pauls, J., Schölkopf, B., and Logothetis, N. (2005c). Kernel constrained covariance for dependence measurement. In Ghahramani, Z. and Cowell, R., editors, Proc. 10 th International Workshop on Artificial Intelligence and Statistics, pages 1 8.

127 References II Gretton, A., Sriperumbudur, B., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., and Fukumizu, K. (2012b). Optimal kernel choice for large-scale two-sample tests. In Advances in Neural Information Processing Systems 24, Cambridge, MA. MIT Press. Muandet, K., Fukumizu, K., Sriperumbudur, B. K., and Schölkopf, B. (2016a). Kernel mean embedding of distributions: A review and beyond. ariv: Muandet, K., Sriperumbudur, B. K., Fukumizu, K., Gretton, A., and Schölkopf, B. (2016b). Kernel mean shrinkage estimators. Journal of Machine Learning Research, 17(48):1 41. Müller, A. (1997). Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29: Ramdas, A., Reddi, S. J., Póczos, B., Singh, A., and Wasserman, L. (2015). On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. In Proc. of 29th AAAI Conference on Artificial Intelligence, pages Simon-Gabriel, C. and Schölkopf, B. (2016). Kernel distribution embeddings: Universal kernels, characteristic kernels and kernel metrics on distributions. ariv: Smola, A. J., Gretton, A., Song, L., and Schölkopf, B. (2007). A Hilbert space embedding for distributions. In Proc. 18th International Conference on Algorithmic Learning Theory, pages Springer-Verlag, Berlin, Germany. Sriperumbudur, B. K. (2016). On the optimal estimation of probability measures in weak and strong topologies. Bernoulli, 22(3): Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Schölkopf, B., and Lanckriet, G. R. G. (2012). On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6: Sriperumbudur, B. K., Fukumizu, K., and Lanckriet, G. R. G. (2011). Universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research, 12:

128 References III Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Lanckriet, G. R. G., and Schölkopf, B. (2008). Injective Hilbert space embeddings of probability measures. In Servedio, R. and Zhang, T., editors, Proc. of the 21 st Annual Conference on Learning Theory, pages Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Schölkopf, B., and Lanckriet, G. R. G. (2010). Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11: Steinwart, I. and Christmann, A. (2008). Support Vector Machines. Springer. Sutherland, D. J., Tung, H.-Y., Strathmann, H., De, S., Ramdas, A., Smola, A., and Gretton, A. (2017). Generative models and model criticism via optimized maximum mean discrepancy. In International Conference on Learning Representations. Tolstikhin, I., Sriperumbudur, B. K., and Muandet, K. (2016). Minimax estimation of kernel mean embeddings. ariv:

Minimax Estimation of Kernel Mean Embeddings

Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016 Collaborators Dr. Ilya Tolstikhin