Hilbert Space Embedding of Probability Measures
|
|
- Ethelbert Palmer
- 6 years ago
- Views:
Transcription
1 Lecture 2 Hilbert Space Embedding of Probability Measures Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 2017
2 Recap of Lecture 1 Kernel method provides an elegant approach to achieve non-linear algorithms from linear algorithms. Input space, : the space of observed data on which learning is performed. Feature map, Φ: defined through a positive definite kernel function, k : R x Φ(x), x Constructing linear algorithms in the feature space Φ( ) translates as non-linear algorithms in. Elegance: No explicit construction of Φ as Φ(x), Φ(y) = k(x, y). Function space view: RKHS; smoothness and generalization Examples Ridge regression. In fact many more (Kernel+SVM/PCA/FDA/CCA/Perceptron/logistic regression,...)
3 Outline Motivating example: Comparing distributions Hilbert space embedding of measures Mean element Distance on probabilities (MMD) Characteristic kernels Cross-covariance operator and measure of independence Applications Two-sample testing Choice of kernel
4 Co-authors Sivaraman Balakrishnan (Statistics, Carnegie Mellon University) Kenji Fukumizu (Institute for Statistical Mathematics, Tokyo) Arthur Gretton (Gatsby Unit, University College London) Gert Lanckriet (Electrical Engineering, University of California, San Diego) Krikamol Muandet (Mathematics, Mahidol University, Bangkok) Massimiliano Pontil (Computer Science, University College London) Bernhard Schölkopf (Max Planck Institute for Intelligent Systems, Tübingen) Dino Sejdinovic (Statistics, University of Oxford) Heiko Strathmann (Gatsby Unit, University College London) Ilya Tolstikhin (Max Planck Institute for Intelligent Systems, Tübingen)
5 Motivating Example: Coin Toss Toss 1: T H H H T T H T T H H T H Toss 2: H T T H T H T T H H H T T Are the coins/tosses statistically similar? Toss 1 is a sample from P:=Bernoulli(p) and Toss 2 is a sample from Q:=Bernoulli(q). Is p = q or not?, i.e., compare E P [ ] = x dp(x) and E Q [ ] = x dq(x). {0,1} {0,1}
6 Motivating Example: Coin Toss Toss 1: T H H H T T H T T H H T H Toss 2: H T T H T H T T H H H T T Are the coins/tosses statistically similar? Toss 1 is a sample from P:=Bernoulli(p) and Toss 2 is a sample from Q:=Bernoulli(q). Is p = q or not?, i.e., compare E P [ ] = x dp(x) and E Q [ ] = x dq(x). {0,1} {0,1}
7 Coin Toss Example In other words, we compare Φ(x) dp(x) where Φ is an identity map, R and R Φ(x) dq(x) Φ(x) = x. A positive definite kernel corresponding to Φ is k(x, y) = Φ(x), Φ(y) 2 = xy, which is a linear kernel on {0, 1}. Therefore, comparing two Bernoulli is equivalent to k(y, x) dp(x) =? k(y, x) dq(x) {0,1} {0,1} for all y {0, 1}, i.e., compare the expectations of the kernel.
8 Comparing two Gaussians P = N(µ 1, σ1 2) and Q = N(µ 2, σ2 2) Comparing P and Q is equivalent to comparing µ 1, µ 2 and σ1 2, σ2 2, i.e., E P [ ] = x dp(x) =? x dq(x) = E Q [ ] R R and E P [ 2 ] = x 2 dp(x) =? x 2 dq(x) = E Q [ 2 ]. R R Concisely Φ(x) dp(x) =? Φ(x) dq(x) R R where Φ(x) = (x, x 2 ). Compare the first moment of the feature map
9 Comparing two Gaussians P = N(µ 1, σ1 2) and Q = N(µ 2, σ2 2) Comparing P and Q is equivalent to comparing µ 1, µ 2 and σ1 2, σ2 2, i.e., E P [ ] = x dp(x) =? x dq(x) = E Q [ ] R R and E P [ 2 ] = x 2 dp(x) =? x 2 dq(x) = E Q [ 2 ]. R R Concisely Φ(x) dp(x) =? Φ(x) dq(x) R R where Φ(x) = (x, x 2 ). Compare the first moment of the feature map
10 Comparing two Gaussians Using the map Φ, we can construct a positive definite kernel as k(x, y) = Φ(x), Φ(y) R 2 = xy + x 2 y 2 which is a polynomial kernel of order 2. Therefore, comparing two Gaussians is equivalent to k(y, x) dp(x) =? k(y, x) dq(x) R R for all y R, i.e., compare the expectations of the kernel.
11 Comparing general P and Q Moment generating function is defined as M P (y) = e xy dp(x) and (if it exists) captures the information about a distribution, i.e., M P = M Q P = Q. R Choosing Φ(x) = ( 1, x, x 2 2!,..., x i ),... l 2 (N), x R i! it is easy to verify that k(x, y) = Φ(x), Φ(y) l2(n) = e xy and so k(x, y) dp(x) = k(x, y) dq(x), y R P = Q. R R
12 Two-Sample Problem Given random samples { 1,..., m } i.i.d. P and {Y 1,..., Y n } i.i.d. Q. Determine: P = Q or P Q? Applications: Microarray data (aggregation problem) Speaker verification Independence Testing: Given random samples {( 1, Y 1 ),..., ( n, Y n )} i.i.d P xy. Does P xy factorize into P x P y? Feature selection (microarrays, image and text,...)
13 Hilbert Space Embedding of Measures
14 Hilbert Space Embedding of Measures Canonical feature map: Φ(x) = k(, x) H, x where H is a reproducing kernel Hilbert space (RKHS). Generalization to probabilities: x k(, x) δ }{{} x point mass at x k(, x) }{{} k(,y) dδx (y)=e δx [k(,y )] Based on the above, the map is extended to probability measures as P µ P := Φ(x) dp(x) = k(, x) dp(x) } {{ } E P k(, ) (Smola et al., ALT 2007)
15 Properties µ P is the mean of the feature map and is called the kernel mean or mean element of P. When is µ P well defined? k(x, x) dp(x) < µp H Proof: Jensen µ P H = k(, x) dp(x) H s k(, x) H dp(x). We know that for any f H, f (x) = f, k(, x) H. So, for any f H, f (x) dp(x) = f, k(, x) H dp(x) = f, k(, x) dp(x) H = f, µ P H.
16 Properties µ P is the mean of the feature map and is called the kernel mean or mean element of P. When is µ P well defined? k(x, x) dp(x) < µp H Proof: Jensen µ P H = k(, x) dp(x) H s k(, x) H dp(x). We know that for any f H, f (x) = f, k(, x) H. So, for any f H, f (x) dp(x) = f, k(, x) H dp(x) = f, k(, x) dp(x) H = f, µ P H.
17 Properties µ P is the mean of the feature map and is called the kernel mean or mean element of P. When is µ P well defined? k(x, x) dp(x) < µp H Proof: Jensen µ P H = k(, x) dp(x) H s k(, x) H dp(x). We know that for any f H, f (x) = f, k(, x) H. So, for any f H, f (x) dp(x) = f, k(, x) H dp(x) = f, k(, x) dp(x) H = f, µ P H.
18 Interpretation Suppose k is translation invariant on R d, i.e., k(x, y) = ψ(x y), x, y R d. Then µ P = ψ( x) dp(x) = ψ P, R d where is the convolution of ψ and P. Convolution is a smoothing operation µ P is a smoothed version of P. Example: Suppose P = δ y, a point mass at y. Then µ P = ψ P = ψ( y). Example: Suppose ψ N(0, σ 2 ) and P = N(µ, τ 2 ). Then µ P = ψ P N(µ, σ 2 + τ 2 ). µ P is a wider Gaussian than P
19 Interpretation Suppose k is translation invariant on R d, i.e., k(x, y) = ψ(x y), x, y R d. Then µ P = ψ( x) dp(x) = ψ P, R d where is the convolution of ψ and P. Convolution is a smoothing operation µ P is a smoothed version of P. Example: Suppose P = δ y, a point mass at y. Then µ P = ψ P = ψ( y). Example: Suppose ψ N(0, σ 2 ) and P = N(µ, τ 2 ). Then µ P = ψ P N(µ, σ 2 + τ 2 ). µ P is a wider Gaussian than P
20 Interpretation Suppose k is translation invariant on R d, i.e., k(x, y) = ψ(x y), x, y R d. Then µ P = ψ( x) dp(x) = ψ P, R d where is the convolution of ψ and P. Convolution is a smoothing operation µ P is a smoothed version of P. Example: Suppose P = δ y, a point mass at y. Then µ P = ψ P = ψ( y). Example: Suppose ψ N(0, σ 2 ) and P = N(µ, τ 2 ). Then µ P = ψ P N(µ, σ 2 + τ 2 ). µ P is a wider Gaussian than P
21 Comparing Kernel Means Define a distance (maximum mean discrepancy) on probabilities MMD H (P, Q) = µ P µ Q H (Gretton et al., NIPS 2006; Smola et al., ALT 2007) MMDH(P, 2 Q) = µ P, µ P H + µ Q, µ Q H 2 µ P, µ Q H = µ P (x) dp(x) + µ Q (x) dq(x) 2 µ P (x) dq(x) = k(x, y) dp(x) dp(y) + k(x, y) dq(x) dq(y) 2 k(x, y) dp(x) dq(y) = E P k(, ) }{{} + E Q k(y, Y ) }{{} avg. similarity between points from P avg. similarity between points from Q 2 E P,Q k(, Y ) }{{}. avg. similarity between points from P and Q
22 Comparing Kernel Means Define a distance (maximum mean discrepancy) on probabilities MMD H (P, Q) = µ P µ Q H (Gretton et al., NIPS 2006; Smola et al., ALT 2007) MMDH(P, 2 Q) = µ P, µ P H + µ Q, µ Q H 2 µ P, µ Q H = µ P (x) dp(x) + µ Q (x) dq(x) 2 µ P (x) dq(x) = k(x, y) dp(x) dp(y) + k(x, y) dq(x) dq(y) 2 k(x, y) dp(x) dq(y) = E P k(, ) }{{} + E Q k(y, Y ) }{{} avg. similarity between points from P avg. similarity between points from Q 2 E P,Q k(, Y ) }{{}. avg. similarity between points from P and Q
23 Comparing Kernel Means Define a distance (maximum mean discrepancy) on probabilities MMD H (P, Q) = µ P µ Q H (Gretton et al., NIPS 2006; Smola et al., ALT 2007) MMDH(P, 2 Q) = µ P, µ P H + µ Q, µ Q H 2 µ P, µ Q H = µ P (x) dp(x) + µ Q (x) dq(x) 2 µ P (x) dq(x) = k(x, y) dp(x) dp(y) + k(x, y) dq(x) dq(y) 2 k(x, y) dp(x) dq(y) = E P k(, ) }{{} + E Q k(y, Y ) }{{} avg. similarity between points from P avg. similarity between points from Q 2 E P,Q k(, Y ) }{{}. avg. similarity between points from P and Q
24 Comparing Kernel Means Define a distance (maximum mean discrepancy) on probabilities MMD H (P, Q) = µ P µ Q H (Gretton et al., NIPS 2006; Smola et al., ALT 2007) MMDH(P, 2 Q) = µ P, µ P H + µ Q, µ Q H 2 µ P, µ Q H = µ P (x) dp(x) + µ Q (x) dq(x) 2 µ P (x) dq(x) = k(x, y) dp(x) dp(y) + k(x, y) dq(x) dq(y) 2 k(x, y) dp(x) dq(y) = E P k(, ) }{{} + E Q k(y, Y ) }{{} avg. similarity between points from P avg. similarity between points from Q 2 E P,Q k(, Y ) }{{}. avg. similarity between points from P and Q
25 Comparing Kernel Means Define a distance (maximum mean discrepancy) on probabilities MMD H (P, Q) = µ P µ Q H (Gretton et al., NIPS 2006; Smola et al., ALT 2007) MMDH(P, 2 Q) = µ P, µ P H + µ Q, µ Q H 2 µ P, µ Q H = µ P (x) dp(x) + µ Q (x) dq(x) 2 µ P (x) dq(x) = k(x, y) dp(x) dp(y) + k(x, y) dq(x) dq(y) 2 k(x, y) dp(x) dq(y) = E P k(, ) }{{} + E Q k(y, Y ) }{{} avg. similarity between points from P avg. similarity between points from Q 2 E P,Q k(, Y ) }{{}. avg. similarity between points from P and Q
26 Comparing Kernel Means Define a distance (maximum mean discrepancy) on probabilities MMD H (P, Q) = µ P µ Q H (Gretton et al., NIPS 2006; Smola et al., ALT 2007) MMDH(P, 2 Q) = µ P, µ P H + µ Q, µ Q H 2 µ P, µ Q H = µ P (x) dp(x) + µ Q (x) dq(x) 2 µ P (x) dq(x) = k(x, y) dp(x) dp(y) + k(x, y) dq(x) dq(y) 2 k(x, y) dp(x) dq(y) = E P k(, ) }{{} + E Q k(y, Y ) }{{} avg. similarity between points from P avg. similarity between points from Q 2 E P,Q k(, Y ) }{{}. avg. similarity between points from P and Q
27 Comparing Kernel Means In the motivating examples, we compare P and Q by comparing µ P (y) = k(y, x) dp(x) and µ Q (y) = k(y, x) dq(x), y. For any f H, f = sup f (y) = sup f, k(, y) H sup y y y k(y, y) f H. µ P µ Q sup k(y, y) µp µ Q H. y Does µ P µ Q H = 0 P = Q? (More on this later)
28 Comparing Kernel Means In the motivating examples, we compare P and Q by comparing µ P (y) = k(y, x) dp(x) and µ Q (y) = k(y, x) dq(x), y. For any f H, f = sup f (y) = sup f, k(, y) H sup y y y k(y, y) f H. µ P µ Q sup k(y, y) µp µ Q H. y Does µ P µ Q H = 0 P = Q? (More on this later)
29 Comparing Kernel Means In the motivating examples, we compare P and Q by comparing µ P (y) = k(y, x) dp(x) and µ Q (y) = k(y, x) dq(x), y. For any f H, f = sup f (y) = sup f, k(, y) H sup y y y k(y, y) f H. µ P µ Q sup k(y, y) µp µ Q H. y Does µ P µ Q H = 0 P = Q? (More on this later)
30 Integral Probability Metric The integral probability metric between P and Q is defined as IPM(P, Q, F) := sup f (x) dp(x) f (x) dq(x) (Müller, 1997) f F = sup E P f ( ) E Q f ( ). f F F controls the degree of distinguishability between P and Q. Related to the Bayes risk of a certain classification problem (S et al., NIPS 2009; EJS 2012)
31 Integral Probability Metric The integral probability metric between P and Q is defined as IPM(P, Q, F) := sup f (x) dp(x) f (x) dq(x) (Müller, 1997) f F = sup E P f ( ) E Q f ( ). f F F controls the degree of distinguishability between P and Q. Related to the Bayes risk of a certain classification problem (S et al., NIPS 2009; EJS 2012) Example: Suppose F = {a x, x R : a [ 1, 1]}. Then IPM(P, Q, F) = sup a x dp(x) x dq(x) a [ 1,1] R R
32 Integral Probability Metric Example: Suppose F = {a x + b x 2, x R : a 2 + b 2 = 1}. Then IPM(P, Q, F) = sup a x d(p Q) + b x 2 d(p Q) a 2 +b 2 =1 R R [ ( 2 ( ) ] = x d(p Q)) + x 2 d(p Q). R R How? Exercise! The richer the F is, the finer is the resolvability of P and Q. We will explore the relation of MMD H (P, Q) to IPM(P, Q, F).
33 Integral Probability Metric Classical results: IPM(P, Q, F) := sup f F f (x) dp(x) f (x) dq(x) F = unit Lipschitz ball (Wasserstein distance) (Dudley, 2002) F = unit bounded-lipschitz ball (Dudley metric) (Dudley, 2002) F = {1 (,t] : t R d } (Kolmogorov metric) (Müller, 1997) F = unit ball in bounded measurable functions (Total variation distance) (Dudley, 2002) For all these F, IPM(P, Q, F) = 0 P = Q. (Gretton et al., NIPS 2006, JMLR 2012; S et al., COLT 2008): F = unit ball in an RKHS, H with bounded kernel, k. Then MMD H (P, Q) = IPM(P, Q, F). Proof: f (x) d(p Q)(x) = f, µ P µ Q H f H µ P µ Q H.
34 Integral Probability Metric Classical results: IPM(P, Q, F) := sup f F f (x) dp(x) f (x) dq(x) F = unit Lipschitz ball (Wasserstein distance) (Dudley, 2002) F = unit bounded-lipschitz ball (Dudley metric) (Dudley, 2002) F = {1 (,t] : t R d } (Kolmogorov metric) (Müller, 1997) F = unit ball in bounded measurable functions (Total variation distance) (Dudley, 2002) For all these F, IPM(P, Q, F) = 0 P = Q. (Gretton et al., NIPS 2006, JMLR 2012; S et al., COLT 2008): F = unit ball in an RKHS, H with bounded kernel, k. Then MMD H (P, Q) = IPM(P, Q, F). Proof: f (x) d(p Q)(x) = f, µ P µ Q H f H µ P µ Q H.
35 Integral Probability Metric Classical results: IPM(P, Q, F) := sup f F f (x) dp(x) f (x) dq(x) F = unit Lipschitz ball (Wasserstein distance) (Dudley, 2002) F = unit bounded-lipschitz ball (Dudley metric) (Dudley, 2002) F = {1 (,t] : t R d } (Kolmogorov metric) (Müller, 1997) F = unit ball in bounded measurable functions (Total variation distance) (Dudley, 2002) For all these F, IPM(P, Q, F) = 0 P = Q. (Gretton et al., NIPS 2006, JMLR 2012; S et al., COLT 2008): F = unit ball in an RKHS, H with bounded kernel, k. Then MMD H (P, Q) = IPM(P, Q, F). Proof: f (x) d(p Q)(x) = f, µ P µ Q H f H µ P µ Q H.
36 Two-Sample Problem Given random samples { 1,..., m } i.i.d. P and {Y 1,..., Y n } i.i.d. Q. Determine: P = Q or P Q? Approach: Define ρ to be a distance on probabilities H 0 : P = Q H 0 : ρ(p, Q) = 0 H 1 : P Q H 1 : ρ(p, Q) > 0 If empirical ρ is far from zero: reject H0 close to zero: accept H0
37 Two-Sample Problem Given random samples { 1,..., m } i.i.d. P and {Y 1,..., Y n } i.i.d. Q. Determine: P = Q or P Q? Approach: Define ρ to be a distance on probabilities H 0 : P = Q H 0 : ρ(p, Q) = 0 H 1 : P Q H 1 : ρ(p, Q) > 0 If empirical ρ is far from zero: reject H0 close to zero: accept H0
38 Two-Sample Problem Given random samples { 1,..., m } i.i.d. P and {Y 1,..., Y n } i.i.d. Q. Determine: P = Q or P Q? Approach: Define ρ to be a distance on probabilities H 0 : P = Q H 0 : ρ(p, Q) = 0 H 1 : P Q H 1 : ρ(p, Q) > 0 If empirical ρ is far from zero: reject H0 close to zero: accept H0
39 Why MMD H? Related to the estimation of IPM(P, Q, F). Recall MMDH(P, 2 Q) = k(, x) dp(x) k(, x) dq(x) A trivial approximation: P m := 1 m m i=1 δ i and Q n := 1 n n i=1 δ Y i, where δ x represents the Dirac measure at x. 2 H. MMDH(P 2 1 m, Q n ) = m m k(, i ) 1 n i=1 n k(, Y i ) i=1 = 1 m m 2 k( i, j ) + 1 n n 2 k(y i, Y j ) 2 i,j=1 i,j=1 i,j 2 H k( i, Y j ) V-statistic; biased estimator of MMD 2 H
40 Why MMD H? Related to the estimation of IPM(P, Q, F). Recall MMDH(P, 2 Q) = k(, x) dp(x) k(, x) dq(x) A trivial approximation: P m := 1 m m i=1 δ i and Q n := 1 n n i=1 δ Y i, where δ x represents the Dirac measure at x. 2 H. MMDH(P 2 1 m, Q n ) = m m k(, i ) 1 n i=1 n k(, Y i ) i=1 = 1 m m 2 k( i, j ) + 1 n n 2 k(y i, Y j ) 2 i,j=1 i,j=1 i,j 2 H k( i, Y j ) V-statistic; biased estimator of MMD 2 H
41 Why MMD H? IPM(P m, Q n, F) is obtained by solving a linear program for F = Lipschitz and bounded Lipschitz balls. (S et al., EJS 2012) Quality of approximation (S et al., EJS 2012) For F = Lipschitz and bounded Lipschitz balls, ( ) IPM(P m, Q m, F) IPM(P, Q, F) = O p m d+1 1, d > 2 For F = unit RKHS ball, ) MMD H (P m, Q m) MMD H (P, Q) = O p (m 2 1 Are there any other estimators of MMD H (P, Q) that are statistically better than MMD H (P m, Q m )? NO!! (Tolstikhin et al., 2016) In practice? YES!! (Krikamol et al., JMLR 2016; S, Bernoulli 2016)
42 Why MMD H? IPM(P m, Q n, F) is obtained by solving a linear program for F = Lipschitz and bounded Lipschitz balls. (S et al., EJS 2012) Quality of approximation (S et al., EJS 2012) For F = Lipschitz and bounded Lipschitz balls, ( ) IPM(P m, Q m, F) IPM(P, Q, F) = O p m d+1 1, d > 2 For F = unit RKHS ball, ) MMD H (P m, Q m) MMD H (P, Q) = O p (m 2 1 Are there any other estimators of MMD H (P, Q) that are statistically better than MMD H (P m, Q m )? NO!! (Tolstikhin et al., 2016) In practice? YES!! (Krikamol et al., JMLR 2016; S, Bernoulli 2016)
43 Why MMD H? IPM(P m, Q n, F) is obtained by solving a linear program for F = Lipschitz and bounded Lipschitz balls. (S et al., EJS 2012) Quality of approximation (S et al., EJS 2012) For F = Lipschitz and bounded Lipschitz balls, ( ) IPM(P m, Q m, F) IPM(P, Q, F) = O p m d+1 1, d > 2 For F = unit RKHS ball, ) MMD H (P m, Q m) MMD H (P, Q) = O p (m 2 1 Are there any other estimators of MMD H (P, Q) that are statistically better than MMD H (P m, Q m )? NO!! (Tolstikhin et al., 2016) In practice? YES!! (Krikamol et al., JMLR 2016; S, Bernoulli 2016)
44 Why MMD H? IPM(P m, Q n, F) is obtained by solving a linear program for F = Lipschitz and bounded Lipschitz balls. (S et al., EJS 2012) Quality of approximation (S et al., EJS 2012) For F = Lipschitz and bounded Lipschitz balls, ( ) IPM(P m, Q m, F) IPM(P, Q, F) = O p m d+1 1, d > 2 For F = unit RKHS ball, ) MMD H (P m, Q m) MMD H (P, Q) = O p (m 2 1 Are there any other estimators of MMD H (P, Q) that are statistically better than MMD H (P m, Q m )? NO!! (Tolstikhin et al., 2016) In practice? YES!! (Krikamol et al., JMLR 2016; S, Bernoulli 2016)
45 Beware of Pitfalls There are many other distances on probabilities: Total variation distance Hellinger distance Kullback-Leibler divergence and its variants Fisher divergence... Estimating these distances is both computationally and statistically difficult. MMD H is computationally simpler and appears statistically powerful with no curse of dimensionality. In fact, it is NOT statistically powerful. (Ramdas et al., AAAI 2015; S, Bernoulli, 2016) Recall: MMD H is based on µ P which is a smoothed version of P. Even though P and Q can be distinguished (coming up!!) based on µ P and µ Q, the distinguishability is weak compared to that of the above distances. (S et al., JMLR 2010; S, Bernoulli, 2016) NO FREE LUNCH!!
46 Beware of Pitfalls There are many other distances on probabilities: Total variation distance Hellinger distance Kullback-Leibler divergence and its variants Fisher divergence... Estimating these distances is both computationally and statistically difficult. MMD H is computationally simpler and appears statistically powerful with no curse of dimensionality. In fact, it is NOT statistically powerful. (Ramdas et al., AAAI 2015; S, Bernoulli, 2016) Recall: MMD H is based on µ P which is a smoothed version of P. Even though P and Q can be distinguished (coming up!!) based on µ P and µ Q, the distinguishability is weak compared to that of the above distances. (S et al., JMLR 2010; S, Bernoulli, 2016) NO FREE LUNCH!!
47 Beware of Pitfalls There are many other distances on probabilities: Total variation distance Hellinger distance Kullback-Leibler divergence and its variants Fisher divergence... Estimating these distances is both computationally and statistically difficult. MMD H is computationally simpler and appears statistically powerful with no curse of dimensionality. In fact, it is NOT statistically powerful. (Ramdas et al., AAAI 2015; S, Bernoulli, 2016) Recall: MMD H is based on µ P which is a smoothed version of P. Even though P and Q can be distinguished (coming up!!) based on µ P and µ Q, the distinguishability is weak compared to that of the above distances. (S et al., JMLR 2010; S, Bernoulli, 2016) NO FREE LUNCH!!
48 So far... P µ P := k(, x) dp(x) MMD H (P, Q) = µ P µ Q H Computation Estimation When is P µ P one-to-one?, i.e., MMD H (P, Q) = 0 P = Q?
49 Characteristic Kernel k is said to be characteristic if MMD H (P, Q) = 0 P = Q for any P and Q. Not all kernels are characteristic. Example: If k(x, y) = c > 0, x, y, then µ P = k(, x) dp(x) = c, and MMD H (P, Q) = 0, P, Q. µ Q = c Example: Let k(x, y) = xy, x, y R. Then MMD H (P, Q) = E P [ ] E Q [ ]. Characteristic for Bernoulli s but not for all P and Q. Example: Let k(x, y) = (1 + xy) 2, x, y R. Then MMDH 2 (P, Q) = 2(E P[ ] E Q [ ]) 2 + (E P [ 2 ] E Q [ 2 ]). Characteristic for Gaussian s but not for all P and Q.
50 Characteristic Kernel k is said to be characteristic if MMD H (P, Q) = 0 P = Q for any P and Q. Not all kernels are characteristic. Example: If k(x, y) = c > 0, x, y, then µ P = k(, x) dp(x) = c, and MMD H (P, Q) = 0, P, Q. µ Q = c Example: Let k(x, y) = xy, x, y R. Then MMD H (P, Q) = E P [ ] E Q [ ]. Characteristic for Bernoulli s but not for all P and Q. Example: Let k(x, y) = (1 + xy) 2, x, y R. Then MMDH 2 (P, Q) = 2(E P[ ] E Q [ ]) 2 + (E P [ 2 ] E Q [ 2 ]). Characteristic for Gaussian s but not for all P and Q.
51 Characteristic Kernel k is said to be characteristic if MMD H (P, Q) = 0 P = Q for any P and Q. Not all kernels are characteristic. Example: If k(x, y) = c > 0, x, y, then µ P = k(, x) dp(x) = c, and MMD H (P, Q) = 0, P, Q. µ Q = c Example: Let k(x, y) = xy, x, y R. Then MMD H (P, Q) = E P [ ] E Q [ ]. Characteristic for Bernoulli s but not for all P and Q. Example: Let k(x, y) = (1 + xy) 2, x, y R. Then MMDH 2 (P, Q) = 2(E P[ ] E Q [ ]) 2 + (E P [ 2 ] E Q [ 2 ]). Characteristic for Gaussian s but not for all P and Q.
52 Characteristic Kernel k is said to be characteristic if MMD H (P, Q) = 0 P = Q for any P and Q. Not all kernels are characteristic. Example: If k(x, y) = c > 0, x, y, then µ P = k(, x) dp(x) = c, and MMD H (P, Q) = 0, P, Q. µ Q = c Example: Let k(x, y) = xy, x, y R. Then MMD H (P, Q) = E P [ ] E Q [ ]. Characteristic for Bernoulli s but not for all P and Q. Example: Let k(x, y) = (1 + xy) 2, x, y R. Then MMDH 2 (P, Q) = 2(E P[ ] E Q [ ]) 2 + (E P [ 2 ] E Q [ 2 ]). Characteristic for Gaussian s but not for all P and Q.
53 Characteristic Kernels on R d Translation invariant kernel: k(x, y) = ψ(x y), x, y R d ; bounded and continuous. Bochner s theorem: ψ(x) = e 1 x,ω 2 dλ(ω), x R d, R d where Λ is a non-negative finite Borel measure on R d. Then, k is characteristic supp(λ) = R d. (S et al., COLT 2008; JMLR, 2010) Corollary: Compactly supported ψ are characteristic (S et al., COLT 2008; JMLR, 2010). Key Idea: Fourier representation of MMD H
54 Fourier Representation of MMD 2 H MMDH(P, 2 Q) = ϕ P (ω) ϕ Q (ω) 2 dλ(ω) R d where ϕ P is the characteristic function of P. Proof: MMD 2 H (P, Q) = ψ(x y) d(p Q)(x) d(p Q)(y) R d R d ( ) = e 1 x y,ω dλ(ω) d(p Q)(x) d(p Q)(y) R d R d R d ( ) = e 1 x,ω d(p Q)(x) e 1 y,ω d(p Q)(y) dλ(ω) R d R d R d = ϕ P (ω) ϕ Q (ω) 2 dλ(ω), R d where Bochner s theorem is used in ( ) and Fubini s theorem in ( ). Suppose Λ = 1, i.e., uniform on R d (!!). Then MMD H (P, Q) is the L 2 distance between the densities (if they exist) of P and Q.
55 Fourier Representation of MMD 2 H MMDH(P, 2 Q) = ϕ P (ω) ϕ Q (ω) 2 dλ(ω) R d where ϕ P is the characteristic function of P. Proof: MMD 2 H (P, Q) = ψ(x y) d(p Q)(x) d(p Q)(y) R d R d ( ) = e 1 x y,ω dλ(ω) d(p Q)(x) d(p Q)(y) R d R d R d ( ) = e 1 x,ω d(p Q)(x) e 1 y,ω d(p Q)(y) dλ(ω) R d R d R d = ϕ P (ω) ϕ Q (ω) 2 dλ(ω), R d where Bochner s theorem is used in ( ) and Fubini s theorem in ( ). Suppose Λ = 1, i.e., uniform on R d (!!). Then MMD H (P, Q) is the L 2 distance between the densities (if they exist) of P and Q.
56 Characteristic Kernels on R d Proof: Suppose supp(λ) = R d. Then MMDH(P, 2 Q) = 0 ϕ P (ω) ϕ Q (ω) 2 dλ(ω) = 0 ϕ P = ϕ Q R d But characteristic functions are uniformly continuous and so ϕ P = ϕ Q which implies P = Q. a.e. Suppose supp(λ) R d. Then there exists an open set U R d such that Λ(U) = 0. Construct P and Q such that ϕ P and ϕ Q differ only in U, i.e., MMD H (P, Q) > 0. If ψ is compactly supported, its Fourier transform is analytic, i.e., cannot vanish on an interval.
57 Characteristic Kernels on R d Proof: Suppose supp(λ) = R d. Then MMDH(P, 2 Q) = 0 ϕ P (ω) ϕ Q (ω) 2 dλ(ω) = 0 ϕ P = ϕ Q R d But characteristic functions are uniformly continuous and so ϕ P = ϕ Q which implies P = Q. a.e. Suppose supp(λ) R d. Then there exists an open set U R d such that Λ(U) = 0. Construct P and Q such that ϕ P and ϕ Q differ only in U, i.e., MMD H (P, Q) > 0. If ψ is compactly supported, its Fourier transform is analytic, i.e., cannot vanish on an interval.
58 Characteristic Kernels on R d Proof: Suppose supp(λ) = R d. Then MMDH(P, 2 Q) = 0 ϕ P (ω) ϕ Q (ω) 2 dλ(ω) = 0 ϕ P = ϕ Q R d But characteristic functions are uniformly continuous and so ϕ P = ϕ Q which implies P = Q. a.e. Suppose supp(λ) R d. Then there exists an open set U R d such that Λ(U) = 0. Construct P and Q such that ϕ P and ϕ Q differ only in U, i.e., MMD H (P, Q) > 0. If ψ is compactly supported, its Fourier transform is analytic, i.e., cannot vanish on an interval.
59 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L 2 (R d,λ) Example: P differs from Q at (roughly) one frequency
60 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L 2 (R d,λ) Example: P differs from Q at (roughly) one frequency F F
61 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L 2 (R d,λ) Example: P differs from Q at (roughly) one frequency F F φ P φ Q Characteristic function difference ω
62 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L 2 (R d,λ) Example: P differs from Q at (roughly) one frequency Gaussian kernel ϕ P ϕ Q
63 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L2 (R d,λ) Example: P differs from Q at (roughly) one frequency Characteristic Picture credit: A. Gretton
64 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L 2 (R d,λ) Example: P differs from Q at (roughly) one frequency Sinc kernel ϕ P ϕ Q
65 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L2 (R d,λ) Example: P differs from Q at (roughly) one frequency NOT characteristic Picture credit: A. Gretton
66 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L 2 (R d,λ) Example: P differs from Q at (roughly) one frequency B-Spline kernel ϕ P ϕ Q
67 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L2 (R d,λ) Example: P differs from Q at (roughly) one frequency???
68 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L2 (R d,λ) Example: P differs from Q at (roughly) one frequency Characteristic Picture credit: A. Gretton
69 Caution Chararacteristic property relates class of kernels and class of probabilities. (S et al., COLT 2008; JMLR 2010) Σ := supp(λ)
70
71 Measuring (In)Dependence Let and Y be Gaussian random variables on R. Then and Y are independent Cov(, Y ) = E(Y ) E( )E(Y ) = 0 In general, Cov(, Y ) = 0 Y. Covariance captures the linear relationship between and Y. Feature space view point: How about Cov(Φ( ), Ψ(Y ))? Suppose Φ( ) = (1,, 2 ) and Ψ(Y ) = (1, Y, Y 2, Y 3 ). Then Cov(Φ( ), Φ(Y )) captures Cov( i, Y j ) for i {0, 1, 2} and j {0, 1, 2, 3}.
72 Measuring (In)Dependence Let and Y be Gaussian random variables on R. Then and Y are independent Cov(, Y ) = E(Y ) E( )E(Y ) = 0 In general, Cov(, Y ) = 0 Y. Covariance captures the linear relationship between and Y. Feature space view point: How about Cov(Φ( ), Ψ(Y ))? Suppose Φ( ) = (1,, 2 ) and Ψ(Y ) = (1, Y, Y 2, Y 3 ). Then Cov(Φ( ), Φ(Y )) captures Cov( i, Y j ) for i {0, 1, 2} and j {0, 1, 2, 3}.
73 Measuring (In)Dependence Characterization of independence: Y Cov(f ( ), g(y )) = 0, measurable functions f and g. Dependence measure: sup f,g Cov(f ( ), g(y )) = sup E[f ( )g(y )] E[f ( )]E[g(Y )] f,g Similar to the IPM between P Y and P P Y. Restricting functions in RKHS: (constrained covariance) COCO(P Y ; H, H Y ) := sup f H =1 g HY =1 (Gretton et al., AISTATS 2005, JMLR 2005) E[f ( )g(y )] E[f ( )]E[g(Y )].
74 Measuring (In)Dependence Characterization of independence: Y Cov(f ( ), g(y )) = 0, measurable functions f and g. Dependence measure: sup f,g Cov(f ( ), g(y )) = sup E[f ( )g(y )] E[f ( )]E[g(Y )] f,g Similar to the IPM between P Y and P P Y. Restricting functions in RKHS: (constrained covariance) COCO(P Y ; H, H Y ) := sup f H =1 g HY =1 (Gretton et al., AISTATS 2005, JMLR 2005) E[f ( )g(y )] E[f ( )]E[g(Y )].
75 Covariance Operator Let k and k Y be the r.k. s of H and H Y respectively. Then E[f ( )] = f, µ P H and E[g(Y )] = g, µ PY HY E[f ( )]E[g(Y )] = f, µ P H g, µ PY HY = f g, µ P µ PY H H Y = f, (µ P µ PY )g H = g, (µ PY µ P )f HY E[f ( )g(y )] = E[ f, k (, ) H g, k Y (, Y ) HY ] = E[ f g, k (, ) k Y (, Y ) H H Y ] = E[ f, (k (, ) k Y (, Y ))g H ] = E[ g, (k Y (, Y ) k (, ))f HY ]
76 Covariance Operator Let k and k Y be the r.k. s of H and H Y respectively. Then E[f ( )] = f, µ P H and E[g(Y )] = g, µ PY HY E[f ( )]E[g(Y )] = f, µ P H g, µ PY HY = f g, µ P µ PY H H Y = f, (µ P µ PY )g H = g, (µ PY µ P )f HY E[f ( )g(y )] = E[ f, k (, ) H g, k Y (, Y ) HY ] = E[ f g, k (, ) k Y (, Y ) H H Y ] = E[ f, (k (, ) k Y (, Y ))g H ] = E[ g, (k Y (, Y ) k (, ))f HY ]
77 Covariance Operator Assuming E k (, )k Y (Y, Y ) <, we obtain E[f ( )g(y )] = f, E[k (, ) k Y (, Y )]g H = g, E[k Y (, Y ) k (, )]f HY Cov(f ( ), g(y )) = f, C Y g H = g, C Y f HY where C Y := E[k (, ) k Y (, Y )] µ P µ PY is a cross-covariance operator from H Y to H and C Y = C Y. Compare to the feature space view point with canonical feature maps
78 Dependence Measures COCO(P Y ; H, H Y ) = sup f H =1 g HY =1 f, C Y g H = C Y op = C Y op, which is the maximum singular value of C Y. Choosing k (, ) =, 2 and k Y (, Y ) =, Y 2, for Gaussian distributions, Y C Y = 0 In general, Y? C Y = 0.
79 Dependence Measures COCO(P Y ; H, H Y ) = sup f H =1 g HY =1 f, C Y g H = C Y op = C Y op, which is the maximum singular value of C Y. Choosing k (, ) =, 2 and k Y (, Y ) =, Y 2, for Gaussian distributions, Y C Y = 0 In general, Y? C Y = 0.
80 Dependence Measures How about we consider other singular values? How about C Y 2 HS, which is the sum of squared singular values of C Y? Hilbert-Schmidt Independence Criterion (HSIC) (Gretton et al., ALT C Y op C Y HS 2005, JMLR 2005)
81 Dependence Measures COCO(P Y ; H, H Y ) := sup f H =1 g HY =1 E[f ( )g(y )] E[f ( )]E[g(Y )]. How about we use different constraint, i.e., f g H H Y 1? sup Cov(f ( ), g(y )) = sup f, C Y g H f g H H Y 1 f g H H Y 1 = sup f g, C Y H H Y f g H H Y 1 = C Y H H Y = C Y HS C Y H H Y = E[k (, ) k Y (, Y )] µ P µ P H H Y = k (, ) k Y (, Y ) d(p Y P P Y ) = MMD H H Y (P Y, P P Y ) H H Y
82 Dependence Measures COCO(P Y ; H, H Y ) := sup f H =1 g HY =1 E[f ( )g(y )] E[f ( )]E[g(Y )]. How about we use different constraint, i.e., f g H H Y 1? sup Cov(f ( ), g(y )) = sup f, C Y g H f g H H Y 1 f g H H Y 1 = sup f g, C Y H H Y f g H H Y 1 = C Y H H Y = C Y HS C Y H H Y = E[k (, ) k Y (, Y )] µ P µ P H H Y = k (, ) k Y (, Y ) d(p Y P P Y ) = MMD H H Y (P Y, P P Y ) H H Y
83 Dependence Measures COCO(P Y ; H, H Y ) := sup f H =1 g HY =1 E[f ( )g(y )] E[f ( )]E[g(Y )]. How about we use different constraint, i.e., f g H H Y 1? sup Cov(f ( ), g(y )) = sup f, C Y g H f g H H Y 1 f g H H Y 1 = sup f g, C Y H H Y f g H H Y 1 = C Y H H Y = C Y HS C Y H H Y = E[k (, ) k Y (, Y )] µ P µ P H H Y = k (, ) k Y (, Y ) d(p Y P P Y ) = MMD H H Y (P Y, P P Y ) H H Y
84 Dependence Measures H H Y is an RKHS with kernel k k Y. If k k Y is characteristic, then C Y H H Y = 0 P Y = P P Y Y If k and k Y are characteristic, then C Y HS = 0 Y. (Gretton, 2015) Using the reproducing property, C Y 2 HS = E Y E Y k (, )k Y (Y, Y ) +E k (, )E YY k Y (Y, Y ) 2 E Y [E k (, )E Y k Y (Y, Y )] Can be estimated using a V-statistic (empirical sums).
85 Applications Two-sample testing Independence testing Conditional independence testing Supervised dimensionality reduction Kernel Bayes rule (filtering, prediction and smoothing) Kernel CCA,... Review paper (Muandet et al., 2016)
86 Application: Two-Sample Testing
87 Two-Sample Problem Given random samples { 1,..., m } i.i.d. P and {Y 1,..., Y n } i.i.d. Q. Determine: P = Q or P Q? Approach: If MMD 2 H (P m, Q n ) is H 0 : P = Q H 0 : MMD H (P, Q) = 0 H 1 : P Q H 1 : MMD H (P, Q) > 0 far from zero: reject H0 close to zero: accept H0
88 Type-I and Type-II Errors Given P = Q, want threshold or critical value t 1 α such that Pr H0 (MMD 2 H (P m, Q n ) > t 1 α ) α.
89 Statistical Test: Large Deviation Bounds Given P = Q, want threshold t such that Pr H0 (MMD 2 H (P m, Q n ) > t) α. We showed that (S et al., EJS 2012) MMD 2 Pr( H (P m, Q n ) MMDH(P, 2 Q) ( 1 + 2(m+n) mn 2 log 1 α )) α. α-level test: Accept H 0 if ( ) 2(m + n) MMDH(P 2 m, Q n ) < log 1 mn α Otherwise reject. Too conservative!!
90 Statistical Test: Large Deviation Bounds Given P = Q, want threshold t such that Pr H0 (MMD 2 H (P m, Q n ) > t) α. We showed that (S et al., EJS 2012) MMD 2 Pr( H (P m, Q n ) MMDH(P, 2 Q) ( 1 + 2(m+n) mn 2 log 1 α )) α. α-level test: Accept H 0 if ( ) 2(m + n) MMDH(P 2 m, Q n ) < log 1 mn α Otherwise reject. Too conservative!!
91 Statistical Test: Asymptotic Distribution (Gretton et al., NIPS 2006, JMLR 2012) Unbiased estimator of MMDH 2 (P, Q): U-statistic m MMD H 2 := 1 m(m 1) Under H 0, i j m MMD 2 H k( i, j ) + k(y i, Y j ) k( i, Y j ) k( j, Y i ) }{{} h(( i,y i ),( j,y j )) w ( λ i θ 2 i 2 ) as n, i=1 where θ i N (0, 2) i.i.d., and λ i are solutions to k(x, y) ψ i (x) dp(x) = λ i ψ i (y) }{{} centered Consistent (Type-II error goes to zero): Under H 1, m ( MMD 2 H MMD 2 H(P, Q)) w N (0, σ 2 h ) as n.
92 Statistical Test: Asymptotic Distribution (Gretton et al., NIPS 2006, JMLR 2012) Unbiased estimator of MMDH 2 (P, Q): U-statistic m MMD H 2 := 1 m(m 1) Under H 0, i j m MMD 2 H k( i, j ) + k(y i, Y j ) k( i, Y j ) k( j, Y i ) }{{} h(( i,y i ),( j,y j )) w ( λ i θ 2 i 2 ) as n, i=1 where θ i N (0, 2) i.i.d., and λ i are solutions to k(x, y) ψ i (x) dp(x) = λ i ψ i (y) }{{} centered Consistent (Type-II error goes to zero): Under H 1, m ( MMD 2 H MMD 2 H(P, Q)) w N (0, σ 2 h ) as n.
93 Statistical Test: Asymptotic Distribution (Gretton et al., NIPS 2006, JMLR 2012) Unbiased estimator of MMDH 2 (P, Q): U-statistic m MMD H 2 := 1 m(m 1) Under H 0, i j m MMD 2 H k( i, j ) + k(y i, Y j ) k( i, Y j ) k( j, Y i ) }{{} h(( i,y i ),( j,y j )) w ( λ i θ 2 i 2 ) as n, i=1 where θ i N (0, 2) i.i.d., and λ i are solutions to k(x, y) ψ i (x) dp(x) = λ i ψ i (y) }{{} centered Consistent (Type-II error goes to zero): Under H 1, m ( MMD 2 H MMD 2 H(P, Q)) w N (0, σ 2 h ) as n.
94 Statistical Test: Asymptotic Distribution (Gretton et al., NIPS 2006, JMLR 2012) α-level test: Estimate 1 α quantile of the null distribution using bootstrap. Picture credit: A. Gretton Computationally intensive!!
95 Statistical Test Without Bootstrap (Gretton et al., NIPS 2009) Estimate the eigenvalues, λ i from combined samples Define Z := (1,..., m, Y 1,..., Y m) Kij := k(z i, Z j ) Compute the eigenvalues, λi of K = HKH where H = I 1 2m 12m1T 2m α-level test: Compute the 1 α quantile of the distribution associated with 2m ( λ i θ 2 i 2 ) i=1 Test is asymptotically α-level consistent
96 Experiments (Gretton et al., NIPS 2009) Comparison example: Canadian Hansard corpus (agriculture, fisheries and immigration) Samples: 5-line extracts Kernel: k-spectrum kernel with k = 10 Sample size: 10 Repetitions: 300 Compute MMD 2 H k-spectrum kernel: average Type II error 0 (α = 0.05) Bag of words kernel: average Type II error 0.18 First ever test on structured data
97 Choice of Characteristic Kernel
98 Choice of Characteristic Kernels Let = R d. Suppose k is a Gaussian kernel, k σ (x, y) = e x y 2 2 2σ 2. MMD Hσ is a function of σ. So MMD Hσ practice? is a family of metrics. Which one should we use in Note that MMD Hσ 0 as σ 0 or σ. Therefore, the kernel choice is very critical in applications. Heuristics: Median: σ = median ( i j 2 : i j, i, j = 1,..., m ) where = (( i ) i, (Y i ) i ) (Gretton et al., NIPS 2006, NIPS 2009, JMLR 2012). Choose the test statistic to be MMD Hσ (P m, Q m ) where (S et al., NIPS 2009) σ = arg max MMD H σ (P m, Q m ) σ (0, )
99 Choice of Characteristic Kernels Let = R d. Suppose k is a Gaussian kernel, k σ (x, y) = e x y 2 2 2σ 2. MMD Hσ is a function of σ. So MMD Hσ practice? is a family of metrics. Which one should we use in Note that MMD Hσ 0 as σ 0 or σ. Therefore, the kernel choice is very critical in applications. Heuristics: Median: σ = median ( i j 2 : i j, i, j = 1,..., m ) where = (( i ) i, (Y i ) i ) (Gretton et al., NIPS 2006, NIPS 2009, JMLR 2012). Choose the test statistic to be MMD Hσ (P m, Q m ) where (S et al., NIPS 2009) σ = arg max MMD H σ (P m, Q m ) σ (0, )
100 Choice of Characteristic Kernels Let = R d. Suppose k is a Gaussian kernel, k σ (x, y) = e x y 2 2 2σ 2. MMD Hσ is a function of σ. So MMD Hσ practice? is a family of metrics. Which one should we use in Note that MMD Hσ 0 as σ 0 or σ. Therefore, the kernel choice is very critical in applications. Heuristics: Median: σ = median ( i j 2 : i j, i, j = 1,..., m ) where = (( i ) i, (Y i ) i ) (Gretton et al., NIPS 2006, NIPS 2009, JMLR 2012). Choose the test statistic to be MMD Hσ (P m, Q m ) where (S et al., NIPS 2009) σ = arg max MMD H σ (P m, Q m ) σ (0, )
101 Choice of Characteristic Kernels Let = R d. Suppose k is a Gaussian kernel, k σ (x, y) = e x y 2 2 2σ 2. MMD Hσ is a function of σ. So MMD Hσ practice? is a family of metrics. Which one should we use in Note that MMD Hσ 0 as σ 0 or σ. Therefore, the kernel choice is very critical in applications. Heuristics: Median: σ = median ( i j 2 : i j, i, j = 1,..., m ) where = (( i ) i, (Y i ) i ) (Gretton et al., NIPS 2006, NIPS 2009, JMLR 2012). Choose the test statistic to be MMD Hσ (P m, Q m ) where (S et al., NIPS 2009) σ = arg max MMD H σ (P m, Q m ) σ (0, )
102 Classes of Characteristic Kernels (S et al., NIPS 2009) More generally, we use Examples for K : MMD(P, Q) := sup MMD Hk (P, Q). k K K g := {e σ x y 2 2, x, y R d : σ R + }. K lin := {k λ = l i=1 λ ik i k λ is pd, l i=1 λ i = 1}. K con := {k λ = l i=1 λ ik i λ i 0, l i=1 λ i = 1}. Test: α-level test: Estimate 1 α quantile of the null distribution of MMD(P m, Q m ) using bootstrap. Test consistency: Based on the functional central limit theorem for U-processes indexed by VC-subgraph K. Computational disadvantage!!
103 Classes of Characteristic Kernels (S et al., NIPS 2009) More generally, we use Examples for K : MMD(P, Q) := sup MMD Hk (P, Q). k K K g := {e σ x y 2 2, x, y R d : σ R + }. K lin := {k λ = l i=1 λ ik i k λ is pd, l i=1 λ i = 1}. K con := {k λ = l i=1 λ ik i λ i 0, l i=1 λ i = 1}. Test: α-level test: Estimate 1 α quantile of the null distribution of MMD(P m, Q m ) using bootstrap. Test consistency: Based on the functional central limit theorem for U-processes indexed by VC-subgraph K. Computational disadvantage!!
104 Experiments q = N (0, σ 2 q). p(x) = q(x)(1 + sin νx) q(x) p(x) 0.2 p(x) x ν = x ν = x ν = 7.5 k(x, y) = exp( (x y) 2 /σ). Test statistics: MMD(P m, Q m ) and MMD Hσ (P m, Q m ) for various σ.
105 Experiments MMD(P, Q) Error (in %) Type I error Type II error ν
106 Experiments MMD Hσ (P, Q) Type I error (in %) ν=0.5 ν=0.75 ν=1.0 ν=1.25 ν= log σ Type II error (in %) ν=0.5 ν=0.75 ν=1.0 ν=1.25 ν= log σ log σ ν Median as σ ν
107 Choice of Characteristic Kernels (Gretton et al., NIPS 2012) Choose a kernel that minimizes the Type-II error for a given Type-I error: k arg inf Type II (k). k K:Type I (k) α Not easy to compute with the asymptotic distributions of the U-statistic, MMD 2 Hk (P m, Q m ). Modified statistic: Average of U-statistics computed on independent blocks of size 2. MMD H 2 (P k m, Q 2 m/2 m)= k( 2i 1, 2i ) + k(y 2i 1, Y 2i ) m i=1 k( 2i 1, Y 2i ) k(y 2i 1, 2i ), }{{} h k (Z i ) where Z i = ( 2i 1, 2i, Y 2i 1, Y 2i ). Recall MMD H 2 := 1 m(m 1) m k( i, j ) + k(y i, Y j ) k( i, Y j ) k( j, Y i ) }{{} h(( i,y i ),( j,y j )) i j
108 Choice of Characteristic Kernels (Gretton et al., NIPS 2012) Choose a kernel that minimizes the Type-II error for a given Type-I error: k arg inf Type II (k). k K:Type I (k) α Not easy to compute with the asymptotic distributions of the U-statistic, MMD 2 Hk (P m, Q m ). Modified statistic: Average of U-statistics computed on independent blocks of size 2. MMD H 2 (P k m, Q 2 m/2 m)= k( 2i 1, 2i ) + k(y 2i 1, Y 2i ) m i=1 k( 2i 1, Y 2i ) k(y 2i 1, 2i ), }{{} h k (Z i ) where Z i = ( 2i 1, 2i, Y 2i 1, Y 2i ). Recall MMD H 2 := 1 m(m 1) m k( i, j ) + k(y i, Y j ) k( i, Y j ) k( j, Y i ) }{{} h(( i,y i ),( j,y j )) i j
109 Modified Statistic Advantages: MMD 2 H is computable in O(m) while MMD H 2 requires O(m2 ) computations. Under H 0, m MMD 2 Hk (P m, Q m ) w N (0, 2σh 2 k ), where σh 2 k = E Z hk 2(Z) (E Z h k (Z)) 2 assuming 0 < E Z hk 2 (Z) <. The asymptotic distribution is normal as against weighted sum of infinite χ 2. Therefore, the test threshold is easy to compute. Disadvantages: Larger variance Smaller power
110 Modified Statistic Advantages: MMD 2 H is computable in O(m) while MMD H 2 requires O(m2 ) computations. Under H 0, m MMD 2 Hk (P m, Q m ) w N (0, 2σh 2 k ), where σh 2 k = E Z hk 2(Z) (E Z h k (Z)) 2 assuming 0 < E Z hk 2 (Z) <. The asymptotic distribution is normal as against weighted sum of infinite χ 2. Therefore, the test threshold is easy to compute. Disadvantages: Larger variance Smaller power
111 Type-I and Type-II Errors Test threshold: For a given k and α, where Φ N is the cdf of N (0, 1). Type-II error: t k,1 α = 2σ hk Φ 1 N (1 α) ( Φ N Φ 1 N (1 α) MMD2 H k (P, Q) ) m 2σhk
112 Best Kernel: Minimizes Type-II Error Since Φ N is a strictly increasing function, the Type-II error is minimized by maximizing MMD2 H k (P,Q) σ hk. Optimal kernel: k MMDH 2 arg sup k (P, Q). k K σ hk Since MMDH 2 k and σ hk depend on unknown P and Q, we split the data into train and test data to estimate k on the train data as ˆk and evaluate the threshold tˆk,1 α on the test data.
113 Data-Dependent Kernel Train data: Define MMD 2 Hk and ˆσ hk. for some λ m 0 as m. Test data: If ˆk MMD H 2 arg sup k k K ˆσ hk + λ m MMD 2 Hˆk, ˆσ hˆk and tˆk,1 α. MMD 2 > Hˆk tˆk,1 α, reject H 0, else accept. Similar results are recently obtained for MMD 2 Hk 2017) (Sutherland et al., ICLR
114 Data-Dependent Kernel Train data: Define MMD 2 Hk and ˆσ hk. for some λ m 0 as m. Test data: If ˆk MMD H 2 arg sup k k K ˆσ hk + λ m MMD 2 Hˆk, ˆσ hˆk and tˆk,1 α. MMD 2 > Hˆk tˆk,1 α, reject H 0, else accept. Similar results are recently obtained for MMD 2 Hk 2017) (Sutherland et al., ICLR
115 Learning the Kernel Define the family of kernels as follows: { } l K := k : k = β i k i, β i 0, i [l]. i=1 If all k i are characteristic and for some i [l], β i > 0, then k is characteristic. MMD 2 H k (P, Q) = l i=1 β immd 2 H ki (P, Q) σ 2 k = l i,j=1 β iβ j cov(h ki, h kj ) where h ki (x, x, y, y ) = k i (x, x ) + k i (y, y ) k i (x, y ) k i (x, y). Objective: β = arg max β 0 β T η βt W β, where η := (MMD 2 H ki (P, Q)) i and W := (cov(h ki, h kj )) i,j.
116 Learning the Kernel Define the family of kernels as follows: { } l K := k : k = β i k i, β i 0, i [l]. i=1 If all k i are characteristic and for some i [l], β i > 0, then k is characteristic. MMD 2 H k (P, Q) = l i=1 β immd 2 H ki (P, Q) σ 2 k = l i,j=1 β iβ j cov(h ki, h kj ) where h ki (x, x, y, y ) = k i (x, x ) + k i (y, y ) k i (x, y ) k i (x, y). Objective: β = arg max β 0 β T η βt W β, where η := (MMD 2 H ki (P, Q)) i and W := (cov(h ki, h kj )) i,j.
117 Learning the Kernel Define the family of kernels as follows: { } l K := k : k = β i k i, β i 0, i [l]. i=1 If all k i are characteristic and for some i [l], β i > 0, then k is characteristic. MMD 2 H k (P, Q) = l i=1 β immd 2 H ki (P, Q) σ 2 k = l i,j=1 β iβ j cov(h ki, h kj ) where h ki (x, x, y, y ) = k i (x, x ) + k i (y, y ) k i (x, y ) k i (x, y). Objective: β = arg max β 0 β T η βt W β, where η := (MMD 2 H ki (P, Q)) i and W := (cov(h ki, h kj )) i,j.
118 Optimization ˆβ λ = arg max β 0 β T ˆη β T (Ŵ + λi )β If ˆη has at least one positive element, the objective function is strictly positive and so { } ˆβ λ = arg min β T (Ŵ + λi )β : βt ˆη = 1, β 0. β On the test data: Compute MMD 2 Hˆk using ˆk = l i=1 ˆβ λ,ik i. Compute test threshold ˆtˆk,1 α using ˆσˆk.
119 Optimization ˆβ λ = arg max β 0 β T ˆη β T (Ŵ + λi )β If ˆη has at least one positive element, the objective function is strictly positive and so { } ˆβ λ = arg min β T (Ŵ + λi )β : βt ˆη = 1, β 0. β On the test data: Compute MMD 2 Hˆk using ˆk = l i=1 ˆβ λ,ik i. Compute test threshold ˆtˆk,1 α using ˆσˆk.
120 Experiments P and Q are mixtures of two-dimensional Gaussians. P has unit covariance in each component. Q has correlated Gaussians with ε being the ratio of largest to smallest covariance eigenvalues. Testing problem difficulty increases with ε 1 and the number of mixture components.
121 Competing Approaches Median heuristic Max. MMD: sup k K MMD 2 H k (P m, Q m ) choose k K with the largest MMD 2 H k (P m, Q m ) Same as maximizing βt ˆη subject to β 1 1. l 2 statistic: maximize β T ˆη subject to β 2 1. Cross-validation on training set.
122 Results m = 10, 000 (for training and test). Results are average over 617 trials.
123 Results Optimize MMD 2 H k ˆσ k
124 Results Maximize MMD 2 Hk with β constraint
125 Results Median heuristic
126 References I Dudley, R. M. (2002). Real Analysis and Probability. Cambridge University Press, Cambridge, UK. Fukumizu, K., Gretton, A., Sun,., and Schölkopf, B. (2008). Kernel measures of conditional dependence. In Platt, J., Koller, D., Singer, Y., and Roweis, S., editors, Advances in Neural Information Processing Systems 20, pages , Cambridge, MA. MIT Press. Fukumizu, K., Sriperumbudur, B. K., Gretton, A., and Schölkopf, B. (2009). Characteristic kernels on groups and semigroups. In Advances in Neural Information Processing Systems 21, pages Gretton, A. (2015). A simpler condition for consistency of a kernel independence test. ariv: Gretton, A., Borgwardt, K. M., Rasch, M., Schölkopf, B., and Smola, A. (2007). A kernel method for the two sample problem. In Advances in Neural Information Processing Systems 19, pages MIT Press. Gretton, A., Borgwardt, K. M., Rasch, M., Schölkopf, B., and Smola, A. (2012a). A kernel two-sample test. Journal of Machine Learning Research, 13: Gretton, A., Bousquet, O., Smola, A., and Schölkopf, B. (2005a). Measuring statistical dependence with Hilbert-Schmidt norms. In Jain, S., Simon, H. U., and Tomita, E., editors, Proceedings of Algorithmic Learning Theory, pages 63 77, Berlin. Springer-Verlag. Gretton, A., Fukumizu, K., Harchaoui, Z., and Sriperumbudur, B. K. (2010). A fast, consistent kernel two-sample test. In Advances in Neural Information Processing Systems 22, Cambridge, MA. MIT Press. Gretton, A., Herbrich, R., Smola, A., Bousquet, O., and Schölkopf, B. (2005b). Kernel methods for measuring independence. Journal of Machine Learning Research, 6: Gretton, A., Smola, A., Bousquet, O., Herbrich, R., Belitski, A., Augath, M., Murayama, Y., Pauls, J., Schölkopf, B., and Logothetis, N. (2005c). Kernel constrained covariance for dependence measurement. In Ghahramani, Z. and Cowell, R., editors, Proc. 10 th International Workshop on Artificial Intelligence and Statistics, pages 1 8.
127 References II Gretton, A., Sriperumbudur, B., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., and Fukumizu, K. (2012b). Optimal kernel choice for large-scale two-sample tests. In Advances in Neural Information Processing Systems 24, Cambridge, MA. MIT Press. Muandet, K., Fukumizu, K., Sriperumbudur, B. K., and Schölkopf, B. (2016a). Kernel mean embedding of distributions: A review and beyond. ariv: Muandet, K., Sriperumbudur, B. K., Fukumizu, K., Gretton, A., and Schölkopf, B. (2016b). Kernel mean shrinkage estimators. Journal of Machine Learning Research, 17(48):1 41. Müller, A. (1997). Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29: Ramdas, A., Reddi, S. J., Póczos, B., Singh, A., and Wasserman, L. (2015). On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. In Proc. of 29th AAAI Conference on Artificial Intelligence, pages Simon-Gabriel, C. and Schölkopf, B. (2016). Kernel distribution embeddings: Universal kernels, characteristic kernels and kernel metrics on distributions. ariv: Smola, A. J., Gretton, A., Song, L., and Schölkopf, B. (2007). A Hilbert space embedding for distributions. In Proc. 18th International Conference on Algorithmic Learning Theory, pages Springer-Verlag, Berlin, Germany. Sriperumbudur, B. K. (2016). On the optimal estimation of probability measures in weak and strong topologies. Bernoulli, 22(3): Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Schölkopf, B., and Lanckriet, G. R. G. (2012). On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6: Sriperumbudur, B. K., Fukumizu, K., and Lanckriet, G. R. G. (2011). Universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research, 12:
128 References III Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Lanckriet, G. R. G., and Schölkopf, B. (2008). Injective Hilbert space embeddings of probability measures. In Servedio, R. and Zhang, T., editors, Proc. of the 21 st Annual Conference on Learning Theory, pages Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Schölkopf, B., and Lanckriet, G. R. G. (2010). Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11: Steinwart, I. and Christmann, A. (2008). Support Vector Machines. Springer. Sutherland, D. J., Tung, H.-Y., Strathmann, H., De, S., Ramdas, A., Smola, A., and Gretton, A. (2017). Generative models and model criticism via optimized maximum mean discrepancy. In International Conference on Learning Representations. Tolstikhin, I., Sriperumbudur, B. K., and Muandet, K. (2016). Minimax estimation of kernel mean embeddings. ariv:
Minimax Estimation of Kernel Mean Embeddings
Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016 Collaborators Dr. Ilya Tolstikhin
More informationApproximate Kernel Methods
Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression
More informationHilbert Space Representations of Probability Distributions
Hilbert Space Representations of Probability Distributions Arthur Gretton joint work with Karsten Borgwardt, Kenji Fukumizu, Malte Rasch, Bernhard Schölkopf, Alex Smola, Le Song, Choon Hui Teo Max Planck
More informationLearning Interpretable Features to Compare Distributions
Learning Interpretable Features to Compare Distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Theory of Big Data, 2017 1/41 Goal of this talk Given: Two collections
More informationApproximate Kernel PCA with Random Features
Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,
More informationLearning features to compare distributions
Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this
More informationLecture 35: December The fundamental statistical distances
36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose
More informationBig Hypothesis Testing with Kernel Embeddings
Big Hypothesis Testing with Kernel Embeddings Dino Sejdinovic Department of Statistics University of Oxford 9 January 2015 UCL Workshop on the Theory of Big Data D. Sejdinovic (Statistics, Oxford) Big
More informationLearning features to compare distributions
Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this
More informationKernel methods for Bayesian inference
Kernel methods for Bayesian inference Arthur Gretton Gatsby Computational Neuroscience Unit Lancaster, Nov. 2014 Motivating Example: Bayesian inference without a model 3600 downsampled frames of 20 20
More informationMaximum Mean Discrepancy
Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia
More informationKernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton
Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,
More informationStatistical Convergence of Kernel CCA
Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,
More informationAn Adaptive Test of Independence with Analytic Kernel Embeddings
An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum Gatsby Unit, University College London wittawat@gatsby.ucl.ac.uk Probabilistic Graphical Model Workshop 2017 Institute
More informationKernel Learning via Random Fourier Representations
Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random
More informationKernel Bayes Rule: Nonparametric Bayesian inference with kernels
Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December
More informationA Linear-Time Kernel Goodness-of-Fit Test
A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum 1; Wenkai Xu 1 Zoltán Szabó 2 NIPS 2017 Best paper! Kenji Fukumizu 3 Arthur Gretton 1 wittawatj@gmail.com 1 Gatsby Unit, University College
More informationAn Adaptive Test of Independence with Analytic Kernel Embeddings
An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum 1 Zoltán Szabó 2 Arthur Gretton 1 1 Gatsby Unit, University College London 2 CMAP, École Polytechnique ICML 2017, Sydney
More informationLecture 2: Mappings of Probabilities to RKHS and Applications
Lecture 2: Mappings of Probabilities to RKHS and Applications Lille, 214 Arthur Gretton Gatsby Unit, CSML, UCL Detecting differences in brain signals The problem: Dolocalfieldpotential(LFP)signalschange
More informationRobust Support Vector Machines for Probability Distributions
Robust Support Vector Machines for Probability Distributions Andreas Christmann joint work with Ingo Steinwart (Los Alamos National Lab) ICORS 2008, Antalya, Turkey, September 8-12, 2008 Andreas Christmann,
More informationTensor Product Kernels: Characteristic Property, Universality
Tensor Product Kernels: Characteristic Property, Universality Zolta n Szabo CMAP, E cole Polytechnique Joint work with: Bharath K. Sriperumbudur Hangzhou International Conference on Frontiers of Data Science
More informationHypothesis Testing with Kernel Embeddings on Interdependent Data
Hypothesis Testing with Kernel Embeddings on Interdependent Data Dino Sejdinovic Department of Statistics University of Oxford joint work with Kacper Chwialkowski and Arthur Gretton (Gatsby Unit, UCL)
More informationAdvances in kernel exponential families
Advances in kernel exponential families Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS, 2017 1/39 Outline Motivating application: Fast estimation of complex multivariate
More informationGeneralized clustering via kernel embeddings
Generalized clustering via kernel embeddings Stefanie Jegelka 1, Arthur Gretton 2,1, Bernhard Schölkopf 1, Bharath K. Sriperumbudur 3, and Ulrike von Luxburg 1 1 Max Planck Institute for Biological Cybernetics,
More informationarxiv: v2 [stat.ml] 3 Sep 2015
Nonparametric Independence Testing for Small Sample Sizes arxiv:46.9v [stat.ml] 3 Sep 5 Aaditya Ramdas Dept. of Statistics and Machine Learning Dept. Carnegie Mellon University aramdas@cs.cmu.edu September
More informationMetric Embedding for Kernel Classification Rules
Metric Embedding for Kernel Classification Rules Bharath K. Sriperumbudur University of California, San Diego (Joint work with Omer Lang & Gert Lanckriet) Bharath K. Sriperumbudur (UCSD) Metric Embedding
More informationDistribution Regression with Minimax-Optimal Guarantee
Distribution Regression with Minimax-Optimal Guarantee (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton
More informationUnsupervised Nonparametric Anomaly Detection: A Kernel Method
Fifty-second Annual Allerton Conference Allerton House, UIUC, Illinois, USA October - 3, 24 Unsupervised Nonparametric Anomaly Detection: A Kernel Method Shaofeng Zou Yingbin Liang H. Vincent Poor 2 Xinghua
More informationINDEPENDENCE MEASURES
INDEPENDENCE MEASURES Beatriz Bueno Larraz Máster en Investigación e Innovación en Tecnologías de la Información y las Comunicaciones. Escuela Politécnica Superior. Máster en Matemáticas y Aplicaciones.
More informationA Linear-Time Kernel Goodness-of-Fit Test
A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum Gatsby Unit, UCL wittawatj@gmail.com Wenkai Xu Gatsby Unit, UCL wenkaix@gatsby.ucl.ac.uk Zoltán Szabó CMAP, École Polytechnique zoltan.szabo@polytechnique.edu
More informationFast Two Sample Tests using Smooth Random Features
Kacper Chwialkowski University College London, Computer Science Department Aaditya Ramdas Carnegie Mellon University, Machine Learning and Statistics School of Computer Science Dino Sejdinovic University
More informationDistribution Regression: A Simple Technique with Minimax-optimal Guarantee
Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department,
More informationTwo-stage Sampled Learning Theory on Distributions
Two-stage Sampled Learning Theory on Distributions Zoltán Szabó (Gatsby Unit, UCL) Joint work with Arthur Gretton (Gatsby Unit, UCL), Barnabás Póczos (ML Department, CMU), Bharath K. Sriperumbudur (Department
More informationRecovering Distributions from Gaussian RKHS Embeddings
Motonobu Kanagawa Graduate University for Advanced Studies kanagawa@ism.ac.jp Kenji Fukumizu Institute of Statistical Mathematics fukumizu@ism.ac.jp Abstract Recent advances of kernel methods have yielded
More informationDistribution Regression
Zoltán Szabó (École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Dagstuhl Seminar 16481
More informationNon-parametric Estimation of Integral Probability Metrics
Non-parametric Estimation of Integral Probability Metrics Bharath K. Sriperumbudur, Kenji Fukumizu 2, Arthur Gretton 3,4, Bernhard Schölkopf 4 and Gert R. G. Lanckriet Department of ECE, UC San Diego,
More informationKernel methods for comparing distributions, measuring dependence
Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations
More informationAdaptive HMC via the Infinite Exponential Family
Adaptive HMC via the Infinite Exponential Family Arthur Gretton Gatsby Unit, CSML, University College London RegML, 2017 Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family
More informationConvergence Rates of Kernel Quadrature Rules
Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction
More informationKernel Measures of Conditional Dependence
Kernel Measures of Conditional Dependence Kenji Fukumizu Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku Tokyo 6-8569 Japan fukumizu@ism.ac.jp Arthur Gretton Max-Planck Institute for
More informationMinimax-optimal distribution regression
Zoltán Szabó (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) ISNPS, Avignon June 12,
More informationScale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract
Scale-Invariance of Support Vector Machines based on the Triangular Kernel François Fleuret Hichem Sahbi IMEDIA Research Group INRIA Domaine de Voluceau 78150 Le Chesnay, France Abstract This paper focuses
More informationA Kernel Statistical Test of Independence
A Kernel Statistical Test of Independence Arthur Gretton MPI for Biological Cybernetics Tübingen, Germany arthur@tuebingen.mpg.de Kenji Fukumizu Inst. of Statistical Mathematics Tokyo Japan fukumizu@ism.ac.jp
More informationKernel-Based Contrast Functions for Sufficient Dimension Reduction
Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach
More informationHilbert Schmidt Independence Criterion
Hilbert Schmidt Independence Criterion Thanks to Arthur Gretton, Le Song, Bernhard Schölkopf, Olivier Bousquet Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au
More informationOn non-parametric robust quantile regression by support vector machines
On non-parametric robust quantile regression by support vector machines Andreas Christmann joint work with: Ingo Steinwart (Los Alamos National Lab) Arnout Van Messem (Vrije Universiteit Brussel) ERCIM
More informationCan we do statistical inference in a non-asymptotic way? 1
Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.
More informationDependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise
Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise Makoto Yamada and Masashi Sugiyama Department of Computer Science, Tokyo Institute of Technology
More informationSemi-Supervised Learning through Principal Directions Estimation
Semi-Supervised Learning through Principal Directions Estimation Olivier Chapelle, Bernhard Schölkopf, Jason Weston Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany {first.last}@tuebingen.mpg.de
More informationKernel adaptive Sequential Monte Carlo
Kernel adaptive Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) December 7, 2015 1 / 36 Section 1 Outline
More informationB-tests: Low Variance Kernel Two-Sample Tests
B-tests: Low Variance Kernel Two-Sample Tests Wojciech Zaremba, Arthur Gretton, Matthew Blaschko To cite this version: Wojciech Zaremba, Arthur Gretton, Matthew Blaschko. B-tests: Low Variance Kernel Two-
More informationApproximation Theoretical Questions for SVMs
Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationOn the Convergence of the Concave-Convex Procedure
On the Convergence of the Concave-Convex Procedure Bharath K. Sriperumbudur and Gert R. G. Lanckriet UC San Diego OPT 2009 Outline Difference of convex functions (d.c.) program Applications in machine
More information1 Exponential manifold by reproducing kernel Hilbert spaces
1 Exponential manifold by reproducing kernel Hilbert spaces 1.1 Introduction The purpose of this paper is to propose a method of constructing exponential families of Hilbert manifold, on which estimation
More informationImitation Learning via Kernel Mean Embedding
Imitation Learning via Kernel Mean Embedding Kee-Eung Kim School of Computer Science KAIST kekim@cs.kaist.ac.kr Hyun Soo Park Department of Computer Science and Engineering University of Minnesota hspark@umn.edu
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationForefront of the Two Sample Problem
Forefront of the Two Sample Problem From classical to state-of-the-art methods Yuchi Matsuoka What is the Two Sample Problem? X ",, X % ~ P i. i. d Y ",, Y - ~ Q i. i. d Two Sample Problem P = Q? Example:
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationTopics in kernel hypothesis testing
Topics in kernel hypothesis testing Kacper Chwialkowski A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of University College London. Department
More informationTheory of Positive Definite Kernel and Reproducing Kernel Hilbert Space
Theory of Positive Definite Kernel and Reproducing Kernel Hilbert Space Statistical Inference with Reproducing Kernel Hilbert Space Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department
More informationarxiv: v1 [cs.lg] 22 Jun 2009
Bayesian two-sample tests arxiv:0906.4032v1 [cs.lg] 22 Jun 2009 Karsten M. Borgwardt 1 and Zoubin Ghahramani 2 1 Max-Planck-Institutes Tübingen, 2 University of Cambridge June 22, 2009 Abstract In this
More informationLearning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013
Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description
More informationKernel Adaptive Metropolis-Hastings
Kernel Adaptive Metropolis-Hastings Arthur Gretton,?? Gatsby Unit, CSML, University College London NIPS, December 2015 Arthur Gretton (Gatsby Unit, UCL) Kernel Adaptive Metropolis-Hastings 12/12/2015 1
More informationStrictly Positive Definite Functions on a Real Inner Product Space
Strictly Positive Definite Functions on a Real Inner Product Space Allan Pinkus Abstract. If ft) = a kt k converges for all t IR with all coefficients a k 0, then the function f< x, y >) is positive definite
More informationKernel Sequential Monte Carlo
Kernel Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) * equal contribution April 25, 2016 1 / 37 Section
More informationUnderstanding the relationship between Functional and Structural Connectivity of Brain Networks
Understanding the relationship between Functional and Structural Connectivity of Brain Networks Sashank J. Reddi Machine Learning Department, Carnegie Mellon University SJAKKAMR@CS.CMU.EDU Abstract Background.
More informationLinear and Non-Linear Dimensionality Reduction
Linear and Non-Linear Dimensionality Reduction Alexander Schulz aschulz(at)techfak.uni-bielefeld.de University of Pisa, Pisa 4.5.215 and 7.5.215 Overview Dimensionality Reduction Motivation Linear Projections
More informationStatistical analysis of coupled time series with Kernel Cross-Spectral Density operators.
Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators. Michel Besserve MPI for Intelligent Systems, Tübingen michel.besserve@tuebingen.mpg.de Nikos K. Logothetis MPI
More informationNonparametric Indepedence Tests: Space Partitioning and Kernel Approaches
Nonparametric Indepedence Tests: Space Partitioning and Kernel Approaches Arthur Gretton 1 and László Györfi 2 1. Gatsby Computational Neuroscience Unit London, UK 2. Budapest University of Technology
More informationCharacterizing Independence with Tensor Product Kernels
Characterizing Independence with Tensor Product Kernels Zolta n Szabo CMAP, E cole Polytechnique Joint work with: Bharath K. Sriperumbudur Department of Statistics, PSU December 13, 2017 Zolta n Szabo
More informationChapter 9. Support Vector Machine. Yongdai Kim Seoul National University
Chapter 9. Support Vector Machine Yongdai Kim Seoul National University 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved
More informationA Kernelized Stein Discrepancy for Goodness-of-fit Tests
Qiang Liu QLIU@CS.DARTMOUTH.EDU Computer Science, Dartmouth College, NH, 3755 Jason D. Lee JASONDLEE88@EECS.BERKELEY.EDU Michael Jordan JORDAN@CS.BERKELEY.EDU Department of Electrical Engineering and Computer
More informationMIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design
MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation
More informationOn Integral Probability Metrics, φ-divergences and Binary Classification
On Integral Probability etrics, φ-divergences and Binary Classification Bharath K. Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf and Gert R. G. Lanckriet arxiv:090.698v4 [cs.it] 3 Oct
More informationEfficient Complex Output Prediction
Efficient Complex Output Prediction Florence d Alché-Buc Joint work with Romain Brault, Alex Lambert, Maxime Sangnier October 12, 2017 LTCI, Télécom ParisTech, Institut-Mines Télécom, Université Paris-Saclay
More informationInverse Density as an Inverse Problem: the Fredholm Equation Approach
Inverse Density as an Inverse Problem: the Fredholm Equation Approach Qichao Que, Mikhail Belkin Department of Computer Science and Engineering The Ohio State University {que,mbelkin}@cse.ohio-state.edu
More informationAdaGAN: Boosting Generative Models
AdaGAN: Boosting Generative Models Ilya Tolstikhin ilya@tuebingen.mpg.de joint work with Gelly 2, Bousquet 2, Simon-Gabriel 1, Schölkopf 1 1 MPI for Intelligent Systems 2 Google Brain Radford et al., 2015)
More informationHilbertian Metrics and Positive Definite Kernels on Probability Measures
Max Planck Institut für biologische Kybernetik Max Planck Institute for Biological Cybernetics Technical Report No. 126 Hilbertian Metrics and Positive Definite Kernels on Probability Measures Matthias
More informationModel Specification Testing in Nonparametric and Semiparametric Time Series Econometrics. Jiti Gao
Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics Jiti Gao Department of Statistics School of Mathematics and Statistics The University of Western Australia Crawley
More informationLecture 3: Statistical Decision Theory (Part II)
Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical
More informationStatistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation
Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider
More informationA Wild Bootstrap for Degenerate Kernel Tests
A Wild Bootstrap for Degenerate Kernel Tests Kacper Chwialkowski Department of Computer Science University College London London, Gower Street, WCE 6BT kacper.chwialkowski@gmail.com Dino Sejdinovic Gatsby
More informationUnlabeled Data: Now It Helps, Now It Doesn t
institution-logo-filena A. Singh, R. D. Nowak, and X. Zhu. In NIPS, 2008. 1 Courant Institute, NYU April 21, 2015 Outline institution-logo-filena 1 Conflicting Views in Semi-supervised Learning The Cluster
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationKernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444
Kernel Methods Jean-Philippe Vert Jean-Philippe.Vert@mines.org Last update: Jan 2015 Jean-Philippe Vert (Mines ParisTech) 1 / 444 What we know how to solve Jean-Philippe Vert (Mines ParisTech) 2 / 444
More informationKernel methods and the exponential family
Kernel methods and the exponential family Stéphane Canu 1 and Alex J. Smola 2 1- PSI - FRE CNRS 2645 INSA de Rouen, France St Etienne du Rouvray, France Stephane.Canu@insa-rouen.fr 2- Statistical Machine
More informationStatistical Optimality of Stochastic Gradient Descent through Multiple Passes
Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationMeasuring Statistical Dependence with Hilbert-Schmidt Norms
Max Planck Institut für biologische Kybernetik Max Planck Institute for Biological Cybernetics Technical Report No. 140 Measuring Statistical Dependence with Hilbert-Schmidt Norms Arthur Gretton, 1 Olivier
More informationKernel Method: Data Analysis with Positive Definite Kernels
Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University
More informationBayesian Regularization
Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors
More informationStatistical inference on Lévy processes
Alberto Coca Cabrero University of Cambridge - CCA Supervisors: Dr. Richard Nickl and Professor L.C.G.Rogers Funded by Fundación Mutua Madrileña and EPSRC MASDOC/CCA student workshop 2013 26th March Outline
More informationDirect Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina
Direct Learning: Linear Classification Logistic regression models for classification problem We consider two class problem: Y {0, 1}. The Bayes rule for the classification is I(P(Y = 1 X = x) > 1/2) so
More informationCausal Inference in Time Series via Supervised Learning
Causal Inference in Time Series via Supervised Learning Yoichi Chikahara and Akinori Fujino NTT Communication Science Laboratories, Kyoto 619-0237, Japan chikahara.yoichi@lab.ntt.co.jp, fujino.akinori@lab.ntt.co.jp
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More informationA Bahadur Representation of the Linear Support Vector Machine
A Bahadur Representation of the Linear Support Vector Machine Yoonkyung Lee Department of Statistics The Ohio State University October 7, 2008 Data Mining and Statistical Learning Study Group Outline Support
More informationImportance Reweighting Using Adversarial-Collaborative Training
Importance Reweighting Using Adversarial-Collaborative Training Yifan Wu yw4@andrew.cmu.edu Tianshu Ren tren@andrew.cmu.edu Lidan Mu lmu@andrew.cmu.edu Abstract We consider the problem of reweighting a
More informationSTATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION
STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization
More informationOnline Gradient Descent Learning Algorithms
DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline
More information