Hilbert Space Embedding of Probability Measures

Size: px
Start display at page:

Download "Hilbert Space Embedding of Probability Measures"

Transcription

1 Lecture 2 Hilbert Space Embedding of Probability Measures Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 2017

2 Recap of Lecture 1 Kernel method provides an elegant approach to achieve non-linear algorithms from linear algorithms. Input space, : the space of observed data on which learning is performed. Feature map, Φ: defined through a positive definite kernel function, k : R x Φ(x), x Constructing linear algorithms in the feature space Φ( ) translates as non-linear algorithms in. Elegance: No explicit construction of Φ as Φ(x), Φ(y) = k(x, y). Function space view: RKHS; smoothness and generalization Examples Ridge regression. In fact many more (Kernel+SVM/PCA/FDA/CCA/Perceptron/logistic regression,...)

3 Outline Motivating example: Comparing distributions Hilbert space embedding of measures Mean element Distance on probabilities (MMD) Characteristic kernels Cross-covariance operator and measure of independence Applications Two-sample testing Choice of kernel

4 Co-authors Sivaraman Balakrishnan (Statistics, Carnegie Mellon University) Kenji Fukumizu (Institute for Statistical Mathematics, Tokyo) Arthur Gretton (Gatsby Unit, University College London) Gert Lanckriet (Electrical Engineering, University of California, San Diego) Krikamol Muandet (Mathematics, Mahidol University, Bangkok) Massimiliano Pontil (Computer Science, University College London) Bernhard Schölkopf (Max Planck Institute for Intelligent Systems, Tübingen) Dino Sejdinovic (Statistics, University of Oxford) Heiko Strathmann (Gatsby Unit, University College London) Ilya Tolstikhin (Max Planck Institute for Intelligent Systems, Tübingen)

5 Motivating Example: Coin Toss Toss 1: T H H H T T H T T H H T H Toss 2: H T T H T H T T H H H T T Are the coins/tosses statistically similar? Toss 1 is a sample from P:=Bernoulli(p) and Toss 2 is a sample from Q:=Bernoulli(q). Is p = q or not?, i.e., compare E P [ ] = x dp(x) and E Q [ ] = x dq(x). {0,1} {0,1}

6 Motivating Example: Coin Toss Toss 1: T H H H T T H T T H H T H Toss 2: H T T H T H T T H H H T T Are the coins/tosses statistically similar? Toss 1 is a sample from P:=Bernoulli(p) and Toss 2 is a sample from Q:=Bernoulli(q). Is p = q or not?, i.e., compare E P [ ] = x dp(x) and E Q [ ] = x dq(x). {0,1} {0,1}

7 Coin Toss Example In other words, we compare Φ(x) dp(x) where Φ is an identity map, R and R Φ(x) dq(x) Φ(x) = x. A positive definite kernel corresponding to Φ is k(x, y) = Φ(x), Φ(y) 2 = xy, which is a linear kernel on {0, 1}. Therefore, comparing two Bernoulli is equivalent to k(y, x) dp(x) =? k(y, x) dq(x) {0,1} {0,1} for all y {0, 1}, i.e., compare the expectations of the kernel.

8 Comparing two Gaussians P = N(µ 1, σ1 2) and Q = N(µ 2, σ2 2) Comparing P and Q is equivalent to comparing µ 1, µ 2 and σ1 2, σ2 2, i.e., E P [ ] = x dp(x) =? x dq(x) = E Q [ ] R R and E P [ 2 ] = x 2 dp(x) =? x 2 dq(x) = E Q [ 2 ]. R R Concisely Φ(x) dp(x) =? Φ(x) dq(x) R R where Φ(x) = (x, x 2 ). Compare the first moment of the feature map

9 Comparing two Gaussians P = N(µ 1, σ1 2) and Q = N(µ 2, σ2 2) Comparing P and Q is equivalent to comparing µ 1, µ 2 and σ1 2, σ2 2, i.e., E P [ ] = x dp(x) =? x dq(x) = E Q [ ] R R and E P [ 2 ] = x 2 dp(x) =? x 2 dq(x) = E Q [ 2 ]. R R Concisely Φ(x) dp(x) =? Φ(x) dq(x) R R where Φ(x) = (x, x 2 ). Compare the first moment of the feature map

10 Comparing two Gaussians Using the map Φ, we can construct a positive definite kernel as k(x, y) = Φ(x), Φ(y) R 2 = xy + x 2 y 2 which is a polynomial kernel of order 2. Therefore, comparing two Gaussians is equivalent to k(y, x) dp(x) =? k(y, x) dq(x) R R for all y R, i.e., compare the expectations of the kernel.

11 Comparing general P and Q Moment generating function is defined as M P (y) = e xy dp(x) and (if it exists) captures the information about a distribution, i.e., M P = M Q P = Q. R Choosing Φ(x) = ( 1, x, x 2 2!,..., x i ),... l 2 (N), x R i! it is easy to verify that k(x, y) = Φ(x), Φ(y) l2(n) = e xy and so k(x, y) dp(x) = k(x, y) dq(x), y R P = Q. R R

12 Two-Sample Problem Given random samples { 1,..., m } i.i.d. P and {Y 1,..., Y n } i.i.d. Q. Determine: P = Q or P Q? Applications: Microarray data (aggregation problem) Speaker verification Independence Testing: Given random samples {( 1, Y 1 ),..., ( n, Y n )} i.i.d P xy. Does P xy factorize into P x P y? Feature selection (microarrays, image and text,...)

13 Hilbert Space Embedding of Measures

14 Hilbert Space Embedding of Measures Canonical feature map: Φ(x) = k(, x) H, x where H is a reproducing kernel Hilbert space (RKHS). Generalization to probabilities: x k(, x) δ }{{} x point mass at x k(, x) }{{} k(,y) dδx (y)=e δx [k(,y )] Based on the above, the map is extended to probability measures as P µ P := Φ(x) dp(x) = k(, x) dp(x) } {{ } E P k(, ) (Smola et al., ALT 2007)

15 Properties µ P is the mean of the feature map and is called the kernel mean or mean element of P. When is µ P well defined? k(x, x) dp(x) < µp H Proof: Jensen µ P H = k(, x) dp(x) H s k(, x) H dp(x). We know that for any f H, f (x) = f, k(, x) H. So, for any f H, f (x) dp(x) = f, k(, x) H dp(x) = f, k(, x) dp(x) H = f, µ P H.

16 Properties µ P is the mean of the feature map and is called the kernel mean or mean element of P. When is µ P well defined? k(x, x) dp(x) < µp H Proof: Jensen µ P H = k(, x) dp(x) H s k(, x) H dp(x). We know that for any f H, f (x) = f, k(, x) H. So, for any f H, f (x) dp(x) = f, k(, x) H dp(x) = f, k(, x) dp(x) H = f, µ P H.

17 Properties µ P is the mean of the feature map and is called the kernel mean or mean element of P. When is µ P well defined? k(x, x) dp(x) < µp H Proof: Jensen µ P H = k(, x) dp(x) H s k(, x) H dp(x). We know that for any f H, f (x) = f, k(, x) H. So, for any f H, f (x) dp(x) = f, k(, x) H dp(x) = f, k(, x) dp(x) H = f, µ P H.

18 Interpretation Suppose k is translation invariant on R d, i.e., k(x, y) = ψ(x y), x, y R d. Then µ P = ψ( x) dp(x) = ψ P, R d where is the convolution of ψ and P. Convolution is a smoothing operation µ P is a smoothed version of P. Example: Suppose P = δ y, a point mass at y. Then µ P = ψ P = ψ( y). Example: Suppose ψ N(0, σ 2 ) and P = N(µ, τ 2 ). Then µ P = ψ P N(µ, σ 2 + τ 2 ). µ P is a wider Gaussian than P

19 Interpretation Suppose k is translation invariant on R d, i.e., k(x, y) = ψ(x y), x, y R d. Then µ P = ψ( x) dp(x) = ψ P, R d where is the convolution of ψ and P. Convolution is a smoothing operation µ P is a smoothed version of P. Example: Suppose P = δ y, a point mass at y. Then µ P = ψ P = ψ( y). Example: Suppose ψ N(0, σ 2 ) and P = N(µ, τ 2 ). Then µ P = ψ P N(µ, σ 2 + τ 2 ). µ P is a wider Gaussian than P

20 Interpretation Suppose k is translation invariant on R d, i.e., k(x, y) = ψ(x y), x, y R d. Then µ P = ψ( x) dp(x) = ψ P, R d where is the convolution of ψ and P. Convolution is a smoothing operation µ P is a smoothed version of P. Example: Suppose P = δ y, a point mass at y. Then µ P = ψ P = ψ( y). Example: Suppose ψ N(0, σ 2 ) and P = N(µ, τ 2 ). Then µ P = ψ P N(µ, σ 2 + τ 2 ). µ P is a wider Gaussian than P

21 Comparing Kernel Means Define a distance (maximum mean discrepancy) on probabilities MMD H (P, Q) = µ P µ Q H (Gretton et al., NIPS 2006; Smola et al., ALT 2007) MMDH(P, 2 Q) = µ P, µ P H + µ Q, µ Q H 2 µ P, µ Q H = µ P (x) dp(x) + µ Q (x) dq(x) 2 µ P (x) dq(x) = k(x, y) dp(x) dp(y) + k(x, y) dq(x) dq(y) 2 k(x, y) dp(x) dq(y) = E P k(, ) }{{} + E Q k(y, Y ) }{{} avg. similarity between points from P avg. similarity between points from Q 2 E P,Q k(, Y ) }{{}. avg. similarity between points from P and Q

22 Comparing Kernel Means Define a distance (maximum mean discrepancy) on probabilities MMD H (P, Q) = µ P µ Q H (Gretton et al., NIPS 2006; Smola et al., ALT 2007) MMDH(P, 2 Q) = µ P, µ P H + µ Q, µ Q H 2 µ P, µ Q H = µ P (x) dp(x) + µ Q (x) dq(x) 2 µ P (x) dq(x) = k(x, y) dp(x) dp(y) + k(x, y) dq(x) dq(y) 2 k(x, y) dp(x) dq(y) = E P k(, ) }{{} + E Q k(y, Y ) }{{} avg. similarity between points from P avg. similarity between points from Q 2 E P,Q k(, Y ) }{{}. avg. similarity between points from P and Q

23 Comparing Kernel Means Define a distance (maximum mean discrepancy) on probabilities MMD H (P, Q) = µ P µ Q H (Gretton et al., NIPS 2006; Smola et al., ALT 2007) MMDH(P, 2 Q) = µ P, µ P H + µ Q, µ Q H 2 µ P, µ Q H = µ P (x) dp(x) + µ Q (x) dq(x) 2 µ P (x) dq(x) = k(x, y) dp(x) dp(y) + k(x, y) dq(x) dq(y) 2 k(x, y) dp(x) dq(y) = E P k(, ) }{{} + E Q k(y, Y ) }{{} avg. similarity between points from P avg. similarity between points from Q 2 E P,Q k(, Y ) }{{}. avg. similarity between points from P and Q

24 Comparing Kernel Means Define a distance (maximum mean discrepancy) on probabilities MMD H (P, Q) = µ P µ Q H (Gretton et al., NIPS 2006; Smola et al., ALT 2007) MMDH(P, 2 Q) = µ P, µ P H + µ Q, µ Q H 2 µ P, µ Q H = µ P (x) dp(x) + µ Q (x) dq(x) 2 µ P (x) dq(x) = k(x, y) dp(x) dp(y) + k(x, y) dq(x) dq(y) 2 k(x, y) dp(x) dq(y) = E P k(, ) }{{} + E Q k(y, Y ) }{{} avg. similarity between points from P avg. similarity between points from Q 2 E P,Q k(, Y ) }{{}. avg. similarity between points from P and Q

25 Comparing Kernel Means Define a distance (maximum mean discrepancy) on probabilities MMD H (P, Q) = µ P µ Q H (Gretton et al., NIPS 2006; Smola et al., ALT 2007) MMDH(P, 2 Q) = µ P, µ P H + µ Q, µ Q H 2 µ P, µ Q H = µ P (x) dp(x) + µ Q (x) dq(x) 2 µ P (x) dq(x) = k(x, y) dp(x) dp(y) + k(x, y) dq(x) dq(y) 2 k(x, y) dp(x) dq(y) = E P k(, ) }{{} + E Q k(y, Y ) }{{} avg. similarity between points from P avg. similarity between points from Q 2 E P,Q k(, Y ) }{{}. avg. similarity between points from P and Q

26 Comparing Kernel Means Define a distance (maximum mean discrepancy) on probabilities MMD H (P, Q) = µ P µ Q H (Gretton et al., NIPS 2006; Smola et al., ALT 2007) MMDH(P, 2 Q) = µ P, µ P H + µ Q, µ Q H 2 µ P, µ Q H = µ P (x) dp(x) + µ Q (x) dq(x) 2 µ P (x) dq(x) = k(x, y) dp(x) dp(y) + k(x, y) dq(x) dq(y) 2 k(x, y) dp(x) dq(y) = E P k(, ) }{{} + E Q k(y, Y ) }{{} avg. similarity between points from P avg. similarity between points from Q 2 E P,Q k(, Y ) }{{}. avg. similarity between points from P and Q

27 Comparing Kernel Means In the motivating examples, we compare P and Q by comparing µ P (y) = k(y, x) dp(x) and µ Q (y) = k(y, x) dq(x), y. For any f H, f = sup f (y) = sup f, k(, y) H sup y y y k(y, y) f H. µ P µ Q sup k(y, y) µp µ Q H. y Does µ P µ Q H = 0 P = Q? (More on this later)

28 Comparing Kernel Means In the motivating examples, we compare P and Q by comparing µ P (y) = k(y, x) dp(x) and µ Q (y) = k(y, x) dq(x), y. For any f H, f = sup f (y) = sup f, k(, y) H sup y y y k(y, y) f H. µ P µ Q sup k(y, y) µp µ Q H. y Does µ P µ Q H = 0 P = Q? (More on this later)

29 Comparing Kernel Means In the motivating examples, we compare P and Q by comparing µ P (y) = k(y, x) dp(x) and µ Q (y) = k(y, x) dq(x), y. For any f H, f = sup f (y) = sup f, k(, y) H sup y y y k(y, y) f H. µ P µ Q sup k(y, y) µp µ Q H. y Does µ P µ Q H = 0 P = Q? (More on this later)

30 Integral Probability Metric The integral probability metric between P and Q is defined as IPM(P, Q, F) := sup f (x) dp(x) f (x) dq(x) (Müller, 1997) f F = sup E P f ( ) E Q f ( ). f F F controls the degree of distinguishability between P and Q. Related to the Bayes risk of a certain classification problem (S et al., NIPS 2009; EJS 2012)

31 Integral Probability Metric The integral probability metric between P and Q is defined as IPM(P, Q, F) := sup f (x) dp(x) f (x) dq(x) (Müller, 1997) f F = sup E P f ( ) E Q f ( ). f F F controls the degree of distinguishability between P and Q. Related to the Bayes risk of a certain classification problem (S et al., NIPS 2009; EJS 2012) Example: Suppose F = {a x, x R : a [ 1, 1]}. Then IPM(P, Q, F) = sup a x dp(x) x dq(x) a [ 1,1] R R

32 Integral Probability Metric Example: Suppose F = {a x + b x 2, x R : a 2 + b 2 = 1}. Then IPM(P, Q, F) = sup a x d(p Q) + b x 2 d(p Q) a 2 +b 2 =1 R R [ ( 2 ( ) ] = x d(p Q)) + x 2 d(p Q). R R How? Exercise! The richer the F is, the finer is the resolvability of P and Q. We will explore the relation of MMD H (P, Q) to IPM(P, Q, F).

33 Integral Probability Metric Classical results: IPM(P, Q, F) := sup f F f (x) dp(x) f (x) dq(x) F = unit Lipschitz ball (Wasserstein distance) (Dudley, 2002) F = unit bounded-lipschitz ball (Dudley metric) (Dudley, 2002) F = {1 (,t] : t R d } (Kolmogorov metric) (Müller, 1997) F = unit ball in bounded measurable functions (Total variation distance) (Dudley, 2002) For all these F, IPM(P, Q, F) = 0 P = Q. (Gretton et al., NIPS 2006, JMLR 2012; S et al., COLT 2008): F = unit ball in an RKHS, H with bounded kernel, k. Then MMD H (P, Q) = IPM(P, Q, F). Proof: f (x) d(p Q)(x) = f, µ P µ Q H f H µ P µ Q H.

34 Integral Probability Metric Classical results: IPM(P, Q, F) := sup f F f (x) dp(x) f (x) dq(x) F = unit Lipschitz ball (Wasserstein distance) (Dudley, 2002) F = unit bounded-lipschitz ball (Dudley metric) (Dudley, 2002) F = {1 (,t] : t R d } (Kolmogorov metric) (Müller, 1997) F = unit ball in bounded measurable functions (Total variation distance) (Dudley, 2002) For all these F, IPM(P, Q, F) = 0 P = Q. (Gretton et al., NIPS 2006, JMLR 2012; S et al., COLT 2008): F = unit ball in an RKHS, H with bounded kernel, k. Then MMD H (P, Q) = IPM(P, Q, F). Proof: f (x) d(p Q)(x) = f, µ P µ Q H f H µ P µ Q H.

35 Integral Probability Metric Classical results: IPM(P, Q, F) := sup f F f (x) dp(x) f (x) dq(x) F = unit Lipschitz ball (Wasserstein distance) (Dudley, 2002) F = unit bounded-lipschitz ball (Dudley metric) (Dudley, 2002) F = {1 (,t] : t R d } (Kolmogorov metric) (Müller, 1997) F = unit ball in bounded measurable functions (Total variation distance) (Dudley, 2002) For all these F, IPM(P, Q, F) = 0 P = Q. (Gretton et al., NIPS 2006, JMLR 2012; S et al., COLT 2008): F = unit ball in an RKHS, H with bounded kernel, k. Then MMD H (P, Q) = IPM(P, Q, F). Proof: f (x) d(p Q)(x) = f, µ P µ Q H f H µ P µ Q H.

36 Two-Sample Problem Given random samples { 1,..., m } i.i.d. P and {Y 1,..., Y n } i.i.d. Q. Determine: P = Q or P Q? Approach: Define ρ to be a distance on probabilities H 0 : P = Q H 0 : ρ(p, Q) = 0 H 1 : P Q H 1 : ρ(p, Q) > 0 If empirical ρ is far from zero: reject H0 close to zero: accept H0

37 Two-Sample Problem Given random samples { 1,..., m } i.i.d. P and {Y 1,..., Y n } i.i.d. Q. Determine: P = Q or P Q? Approach: Define ρ to be a distance on probabilities H 0 : P = Q H 0 : ρ(p, Q) = 0 H 1 : P Q H 1 : ρ(p, Q) > 0 If empirical ρ is far from zero: reject H0 close to zero: accept H0

38 Two-Sample Problem Given random samples { 1,..., m } i.i.d. P and {Y 1,..., Y n } i.i.d. Q. Determine: P = Q or P Q? Approach: Define ρ to be a distance on probabilities H 0 : P = Q H 0 : ρ(p, Q) = 0 H 1 : P Q H 1 : ρ(p, Q) > 0 If empirical ρ is far from zero: reject H0 close to zero: accept H0

39 Why MMD H? Related to the estimation of IPM(P, Q, F). Recall MMDH(P, 2 Q) = k(, x) dp(x) k(, x) dq(x) A trivial approximation: P m := 1 m m i=1 δ i and Q n := 1 n n i=1 δ Y i, where δ x represents the Dirac measure at x. 2 H. MMDH(P 2 1 m, Q n ) = m m k(, i ) 1 n i=1 n k(, Y i ) i=1 = 1 m m 2 k( i, j ) + 1 n n 2 k(y i, Y j ) 2 i,j=1 i,j=1 i,j 2 H k( i, Y j ) V-statistic; biased estimator of MMD 2 H

40 Why MMD H? Related to the estimation of IPM(P, Q, F). Recall MMDH(P, 2 Q) = k(, x) dp(x) k(, x) dq(x) A trivial approximation: P m := 1 m m i=1 δ i and Q n := 1 n n i=1 δ Y i, where δ x represents the Dirac measure at x. 2 H. MMDH(P 2 1 m, Q n ) = m m k(, i ) 1 n i=1 n k(, Y i ) i=1 = 1 m m 2 k( i, j ) + 1 n n 2 k(y i, Y j ) 2 i,j=1 i,j=1 i,j 2 H k( i, Y j ) V-statistic; biased estimator of MMD 2 H

41 Why MMD H? IPM(P m, Q n, F) is obtained by solving a linear program for F = Lipschitz and bounded Lipschitz balls. (S et al., EJS 2012) Quality of approximation (S et al., EJS 2012) For F = Lipschitz and bounded Lipschitz balls, ( ) IPM(P m, Q m, F) IPM(P, Q, F) = O p m d+1 1, d > 2 For F = unit RKHS ball, ) MMD H (P m, Q m) MMD H (P, Q) = O p (m 2 1 Are there any other estimators of MMD H (P, Q) that are statistically better than MMD H (P m, Q m )? NO!! (Tolstikhin et al., 2016) In practice? YES!! (Krikamol et al., JMLR 2016; S, Bernoulli 2016)

42 Why MMD H? IPM(P m, Q n, F) is obtained by solving a linear program for F = Lipschitz and bounded Lipschitz balls. (S et al., EJS 2012) Quality of approximation (S et al., EJS 2012) For F = Lipschitz and bounded Lipschitz balls, ( ) IPM(P m, Q m, F) IPM(P, Q, F) = O p m d+1 1, d > 2 For F = unit RKHS ball, ) MMD H (P m, Q m) MMD H (P, Q) = O p (m 2 1 Are there any other estimators of MMD H (P, Q) that are statistically better than MMD H (P m, Q m )? NO!! (Tolstikhin et al., 2016) In practice? YES!! (Krikamol et al., JMLR 2016; S, Bernoulli 2016)

43 Why MMD H? IPM(P m, Q n, F) is obtained by solving a linear program for F = Lipschitz and bounded Lipschitz balls. (S et al., EJS 2012) Quality of approximation (S et al., EJS 2012) For F = Lipschitz and bounded Lipschitz balls, ( ) IPM(P m, Q m, F) IPM(P, Q, F) = O p m d+1 1, d > 2 For F = unit RKHS ball, ) MMD H (P m, Q m) MMD H (P, Q) = O p (m 2 1 Are there any other estimators of MMD H (P, Q) that are statistically better than MMD H (P m, Q m )? NO!! (Tolstikhin et al., 2016) In practice? YES!! (Krikamol et al., JMLR 2016; S, Bernoulli 2016)

44 Why MMD H? IPM(P m, Q n, F) is obtained by solving a linear program for F = Lipschitz and bounded Lipschitz balls. (S et al., EJS 2012) Quality of approximation (S et al., EJS 2012) For F = Lipschitz and bounded Lipschitz balls, ( ) IPM(P m, Q m, F) IPM(P, Q, F) = O p m d+1 1, d > 2 For F = unit RKHS ball, ) MMD H (P m, Q m) MMD H (P, Q) = O p (m 2 1 Are there any other estimators of MMD H (P, Q) that are statistically better than MMD H (P m, Q m )? NO!! (Tolstikhin et al., 2016) In practice? YES!! (Krikamol et al., JMLR 2016; S, Bernoulli 2016)

45 Beware of Pitfalls There are many other distances on probabilities: Total variation distance Hellinger distance Kullback-Leibler divergence and its variants Fisher divergence... Estimating these distances is both computationally and statistically difficult. MMD H is computationally simpler and appears statistically powerful with no curse of dimensionality. In fact, it is NOT statistically powerful. (Ramdas et al., AAAI 2015; S, Bernoulli, 2016) Recall: MMD H is based on µ P which is a smoothed version of P. Even though P and Q can be distinguished (coming up!!) based on µ P and µ Q, the distinguishability is weak compared to that of the above distances. (S et al., JMLR 2010; S, Bernoulli, 2016) NO FREE LUNCH!!

46 Beware of Pitfalls There are many other distances on probabilities: Total variation distance Hellinger distance Kullback-Leibler divergence and its variants Fisher divergence... Estimating these distances is both computationally and statistically difficult. MMD H is computationally simpler and appears statistically powerful with no curse of dimensionality. In fact, it is NOT statistically powerful. (Ramdas et al., AAAI 2015; S, Bernoulli, 2016) Recall: MMD H is based on µ P which is a smoothed version of P. Even though P and Q can be distinguished (coming up!!) based on µ P and µ Q, the distinguishability is weak compared to that of the above distances. (S et al., JMLR 2010; S, Bernoulli, 2016) NO FREE LUNCH!!

47 Beware of Pitfalls There are many other distances on probabilities: Total variation distance Hellinger distance Kullback-Leibler divergence and its variants Fisher divergence... Estimating these distances is both computationally and statistically difficult. MMD H is computationally simpler and appears statistically powerful with no curse of dimensionality. In fact, it is NOT statistically powerful. (Ramdas et al., AAAI 2015; S, Bernoulli, 2016) Recall: MMD H is based on µ P which is a smoothed version of P. Even though P and Q can be distinguished (coming up!!) based on µ P and µ Q, the distinguishability is weak compared to that of the above distances. (S et al., JMLR 2010; S, Bernoulli, 2016) NO FREE LUNCH!!

48 So far... P µ P := k(, x) dp(x) MMD H (P, Q) = µ P µ Q H Computation Estimation When is P µ P one-to-one?, i.e., MMD H (P, Q) = 0 P = Q?

49 Characteristic Kernel k is said to be characteristic if MMD H (P, Q) = 0 P = Q for any P and Q. Not all kernels are characteristic. Example: If k(x, y) = c > 0, x, y, then µ P = k(, x) dp(x) = c, and MMD H (P, Q) = 0, P, Q. µ Q = c Example: Let k(x, y) = xy, x, y R. Then MMD H (P, Q) = E P [ ] E Q [ ]. Characteristic for Bernoulli s but not for all P and Q. Example: Let k(x, y) = (1 + xy) 2, x, y R. Then MMDH 2 (P, Q) = 2(E P[ ] E Q [ ]) 2 + (E P [ 2 ] E Q [ 2 ]). Characteristic for Gaussian s but not for all P and Q.

50 Characteristic Kernel k is said to be characteristic if MMD H (P, Q) = 0 P = Q for any P and Q. Not all kernels are characteristic. Example: If k(x, y) = c > 0, x, y, then µ P = k(, x) dp(x) = c, and MMD H (P, Q) = 0, P, Q. µ Q = c Example: Let k(x, y) = xy, x, y R. Then MMD H (P, Q) = E P [ ] E Q [ ]. Characteristic for Bernoulli s but not for all P and Q. Example: Let k(x, y) = (1 + xy) 2, x, y R. Then MMDH 2 (P, Q) = 2(E P[ ] E Q [ ]) 2 + (E P [ 2 ] E Q [ 2 ]). Characteristic for Gaussian s but not for all P and Q.

51 Characteristic Kernel k is said to be characteristic if MMD H (P, Q) = 0 P = Q for any P and Q. Not all kernels are characteristic. Example: If k(x, y) = c > 0, x, y, then µ P = k(, x) dp(x) = c, and MMD H (P, Q) = 0, P, Q. µ Q = c Example: Let k(x, y) = xy, x, y R. Then MMD H (P, Q) = E P [ ] E Q [ ]. Characteristic for Bernoulli s but not for all P and Q. Example: Let k(x, y) = (1 + xy) 2, x, y R. Then MMDH 2 (P, Q) = 2(E P[ ] E Q [ ]) 2 + (E P [ 2 ] E Q [ 2 ]). Characteristic for Gaussian s but not for all P and Q.

52 Characteristic Kernel k is said to be characteristic if MMD H (P, Q) = 0 P = Q for any P and Q. Not all kernels are characteristic. Example: If k(x, y) = c > 0, x, y, then µ P = k(, x) dp(x) = c, and MMD H (P, Q) = 0, P, Q. µ Q = c Example: Let k(x, y) = xy, x, y R. Then MMD H (P, Q) = E P [ ] E Q [ ]. Characteristic for Bernoulli s but not for all P and Q. Example: Let k(x, y) = (1 + xy) 2, x, y R. Then MMDH 2 (P, Q) = 2(E P[ ] E Q [ ]) 2 + (E P [ 2 ] E Q [ 2 ]). Characteristic for Gaussian s but not for all P and Q.

53 Characteristic Kernels on R d Translation invariant kernel: k(x, y) = ψ(x y), x, y R d ; bounded and continuous. Bochner s theorem: ψ(x) = e 1 x,ω 2 dλ(ω), x R d, R d where Λ is a non-negative finite Borel measure on R d. Then, k is characteristic supp(λ) = R d. (S et al., COLT 2008; JMLR, 2010) Corollary: Compactly supported ψ are characteristic (S et al., COLT 2008; JMLR, 2010). Key Idea: Fourier representation of MMD H

54 Fourier Representation of MMD 2 H MMDH(P, 2 Q) = ϕ P (ω) ϕ Q (ω) 2 dλ(ω) R d where ϕ P is the characteristic function of P. Proof: MMD 2 H (P, Q) = ψ(x y) d(p Q)(x) d(p Q)(y) R d R d ( ) = e 1 x y,ω dλ(ω) d(p Q)(x) d(p Q)(y) R d R d R d ( ) = e 1 x,ω d(p Q)(x) e 1 y,ω d(p Q)(y) dλ(ω) R d R d R d = ϕ P (ω) ϕ Q (ω) 2 dλ(ω), R d where Bochner s theorem is used in ( ) and Fubini s theorem in ( ). Suppose Λ = 1, i.e., uniform on R d (!!). Then MMD H (P, Q) is the L 2 distance between the densities (if they exist) of P and Q.

55 Fourier Representation of MMD 2 H MMDH(P, 2 Q) = ϕ P (ω) ϕ Q (ω) 2 dλ(ω) R d where ϕ P is the characteristic function of P. Proof: MMD 2 H (P, Q) = ψ(x y) d(p Q)(x) d(p Q)(y) R d R d ( ) = e 1 x y,ω dλ(ω) d(p Q)(x) d(p Q)(y) R d R d R d ( ) = e 1 x,ω d(p Q)(x) e 1 y,ω d(p Q)(y) dλ(ω) R d R d R d = ϕ P (ω) ϕ Q (ω) 2 dλ(ω), R d where Bochner s theorem is used in ( ) and Fubini s theorem in ( ). Suppose Λ = 1, i.e., uniform on R d (!!). Then MMD H (P, Q) is the L 2 distance between the densities (if they exist) of P and Q.

56 Characteristic Kernels on R d Proof: Suppose supp(λ) = R d. Then MMDH(P, 2 Q) = 0 ϕ P (ω) ϕ Q (ω) 2 dλ(ω) = 0 ϕ P = ϕ Q R d But characteristic functions are uniformly continuous and so ϕ P = ϕ Q which implies P = Q. a.e. Suppose supp(λ) R d. Then there exists an open set U R d such that Λ(U) = 0. Construct P and Q such that ϕ P and ϕ Q differ only in U, i.e., MMD H (P, Q) > 0. If ψ is compactly supported, its Fourier transform is analytic, i.e., cannot vanish on an interval.

57 Characteristic Kernels on R d Proof: Suppose supp(λ) = R d. Then MMDH(P, 2 Q) = 0 ϕ P (ω) ϕ Q (ω) 2 dλ(ω) = 0 ϕ P = ϕ Q R d But characteristic functions are uniformly continuous and so ϕ P = ϕ Q which implies P = Q. a.e. Suppose supp(λ) R d. Then there exists an open set U R d such that Λ(U) = 0. Construct P and Q such that ϕ P and ϕ Q differ only in U, i.e., MMD H (P, Q) > 0. If ψ is compactly supported, its Fourier transform is analytic, i.e., cannot vanish on an interval.

58 Characteristic Kernels on R d Proof: Suppose supp(λ) = R d. Then MMDH(P, 2 Q) = 0 ϕ P (ω) ϕ Q (ω) 2 dλ(ω) = 0 ϕ P = ϕ Q R d But characteristic functions are uniformly continuous and so ϕ P = ϕ Q which implies P = Q. a.e. Suppose supp(λ) R d. Then there exists an open set U R d such that Λ(U) = 0. Construct P and Q such that ϕ P and ϕ Q differ only in U, i.e., MMD H (P, Q) > 0. If ψ is compactly supported, its Fourier transform is analytic, i.e., cannot vanish on an interval.

59 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L 2 (R d,λ) Example: P differs from Q at (roughly) one frequency

60 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L 2 (R d,λ) Example: P differs from Q at (roughly) one frequency F F

61 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L 2 (R d,λ) Example: P differs from Q at (roughly) one frequency F F φ P φ Q Characteristic function difference ω

62 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L 2 (R d,λ) Example: P differs from Q at (roughly) one frequency Gaussian kernel ϕ P ϕ Q

63 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L2 (R d,λ) Example: P differs from Q at (roughly) one frequency Characteristic Picture credit: A. Gretton

64 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L 2 (R d,λ) Example: P differs from Q at (roughly) one frequency Sinc kernel ϕ P ϕ Q

65 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L2 (R d,λ) Example: P differs from Q at (roughly) one frequency NOT characteristic Picture credit: A. Gretton

66 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L 2 (R d,λ) Example: P differs from Q at (roughly) one frequency B-Spline kernel ϕ P ϕ Q

67 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L2 (R d,λ) Example: P differs from Q at (roughly) one frequency???

68 Translation Invariant Kernels on R d MMD H (P, Q) = ϕ P ϕ Q L2 (R d,λ) Example: P differs from Q at (roughly) one frequency Characteristic Picture credit: A. Gretton

69 Caution Chararacteristic property relates class of kernels and class of probabilities. (S et al., COLT 2008; JMLR 2010) Σ := supp(λ)

70

71 Measuring (In)Dependence Let and Y be Gaussian random variables on R. Then and Y are independent Cov(, Y ) = E(Y ) E( )E(Y ) = 0 In general, Cov(, Y ) = 0 Y. Covariance captures the linear relationship between and Y. Feature space view point: How about Cov(Φ( ), Ψ(Y ))? Suppose Φ( ) = (1,, 2 ) and Ψ(Y ) = (1, Y, Y 2, Y 3 ). Then Cov(Φ( ), Φ(Y )) captures Cov( i, Y j ) for i {0, 1, 2} and j {0, 1, 2, 3}.

72 Measuring (In)Dependence Let and Y be Gaussian random variables on R. Then and Y are independent Cov(, Y ) = E(Y ) E( )E(Y ) = 0 In general, Cov(, Y ) = 0 Y. Covariance captures the linear relationship between and Y. Feature space view point: How about Cov(Φ( ), Ψ(Y ))? Suppose Φ( ) = (1,, 2 ) and Ψ(Y ) = (1, Y, Y 2, Y 3 ). Then Cov(Φ( ), Φ(Y )) captures Cov( i, Y j ) for i {0, 1, 2} and j {0, 1, 2, 3}.

73 Measuring (In)Dependence Characterization of independence: Y Cov(f ( ), g(y )) = 0, measurable functions f and g. Dependence measure: sup f,g Cov(f ( ), g(y )) = sup E[f ( )g(y )] E[f ( )]E[g(Y )] f,g Similar to the IPM between P Y and P P Y. Restricting functions in RKHS: (constrained covariance) COCO(P Y ; H, H Y ) := sup f H =1 g HY =1 (Gretton et al., AISTATS 2005, JMLR 2005) E[f ( )g(y )] E[f ( )]E[g(Y )].

74 Measuring (In)Dependence Characterization of independence: Y Cov(f ( ), g(y )) = 0, measurable functions f and g. Dependence measure: sup f,g Cov(f ( ), g(y )) = sup E[f ( )g(y )] E[f ( )]E[g(Y )] f,g Similar to the IPM between P Y and P P Y. Restricting functions in RKHS: (constrained covariance) COCO(P Y ; H, H Y ) := sup f H =1 g HY =1 (Gretton et al., AISTATS 2005, JMLR 2005) E[f ( )g(y )] E[f ( )]E[g(Y )].

75 Covariance Operator Let k and k Y be the r.k. s of H and H Y respectively. Then E[f ( )] = f, µ P H and E[g(Y )] = g, µ PY HY E[f ( )]E[g(Y )] = f, µ P H g, µ PY HY = f g, µ P µ PY H H Y = f, (µ P µ PY )g H = g, (µ PY µ P )f HY E[f ( )g(y )] = E[ f, k (, ) H g, k Y (, Y ) HY ] = E[ f g, k (, ) k Y (, Y ) H H Y ] = E[ f, (k (, ) k Y (, Y ))g H ] = E[ g, (k Y (, Y ) k (, ))f HY ]

76 Covariance Operator Let k and k Y be the r.k. s of H and H Y respectively. Then E[f ( )] = f, µ P H and E[g(Y )] = g, µ PY HY E[f ( )]E[g(Y )] = f, µ P H g, µ PY HY = f g, µ P µ PY H H Y = f, (µ P µ PY )g H = g, (µ PY µ P )f HY E[f ( )g(y )] = E[ f, k (, ) H g, k Y (, Y ) HY ] = E[ f g, k (, ) k Y (, Y ) H H Y ] = E[ f, (k (, ) k Y (, Y ))g H ] = E[ g, (k Y (, Y ) k (, ))f HY ]

77 Covariance Operator Assuming E k (, )k Y (Y, Y ) <, we obtain E[f ( )g(y )] = f, E[k (, ) k Y (, Y )]g H = g, E[k Y (, Y ) k (, )]f HY Cov(f ( ), g(y )) = f, C Y g H = g, C Y f HY where C Y := E[k (, ) k Y (, Y )] µ P µ PY is a cross-covariance operator from H Y to H and C Y = C Y. Compare to the feature space view point with canonical feature maps

78 Dependence Measures COCO(P Y ; H, H Y ) = sup f H =1 g HY =1 f, C Y g H = C Y op = C Y op, which is the maximum singular value of C Y. Choosing k (, ) =, 2 and k Y (, Y ) =, Y 2, for Gaussian distributions, Y C Y = 0 In general, Y? C Y = 0.

79 Dependence Measures COCO(P Y ; H, H Y ) = sup f H =1 g HY =1 f, C Y g H = C Y op = C Y op, which is the maximum singular value of C Y. Choosing k (, ) =, 2 and k Y (, Y ) =, Y 2, for Gaussian distributions, Y C Y = 0 In general, Y? C Y = 0.

80 Dependence Measures How about we consider other singular values? How about C Y 2 HS, which is the sum of squared singular values of C Y? Hilbert-Schmidt Independence Criterion (HSIC) (Gretton et al., ALT C Y op C Y HS 2005, JMLR 2005)

81 Dependence Measures COCO(P Y ; H, H Y ) := sup f H =1 g HY =1 E[f ( )g(y )] E[f ( )]E[g(Y )]. How about we use different constraint, i.e., f g H H Y 1? sup Cov(f ( ), g(y )) = sup f, C Y g H f g H H Y 1 f g H H Y 1 = sup f g, C Y H H Y f g H H Y 1 = C Y H H Y = C Y HS C Y H H Y = E[k (, ) k Y (, Y )] µ P µ P H H Y = k (, ) k Y (, Y ) d(p Y P P Y ) = MMD H H Y (P Y, P P Y ) H H Y

82 Dependence Measures COCO(P Y ; H, H Y ) := sup f H =1 g HY =1 E[f ( )g(y )] E[f ( )]E[g(Y )]. How about we use different constraint, i.e., f g H H Y 1? sup Cov(f ( ), g(y )) = sup f, C Y g H f g H H Y 1 f g H H Y 1 = sup f g, C Y H H Y f g H H Y 1 = C Y H H Y = C Y HS C Y H H Y = E[k (, ) k Y (, Y )] µ P µ P H H Y = k (, ) k Y (, Y ) d(p Y P P Y ) = MMD H H Y (P Y, P P Y ) H H Y

83 Dependence Measures COCO(P Y ; H, H Y ) := sup f H =1 g HY =1 E[f ( )g(y )] E[f ( )]E[g(Y )]. How about we use different constraint, i.e., f g H H Y 1? sup Cov(f ( ), g(y )) = sup f, C Y g H f g H H Y 1 f g H H Y 1 = sup f g, C Y H H Y f g H H Y 1 = C Y H H Y = C Y HS C Y H H Y = E[k (, ) k Y (, Y )] µ P µ P H H Y = k (, ) k Y (, Y ) d(p Y P P Y ) = MMD H H Y (P Y, P P Y ) H H Y

84 Dependence Measures H H Y is an RKHS with kernel k k Y. If k k Y is characteristic, then C Y H H Y = 0 P Y = P P Y Y If k and k Y are characteristic, then C Y HS = 0 Y. (Gretton, 2015) Using the reproducing property, C Y 2 HS = E Y E Y k (, )k Y (Y, Y ) +E k (, )E YY k Y (Y, Y ) 2 E Y [E k (, )E Y k Y (Y, Y )] Can be estimated using a V-statistic (empirical sums).

85 Applications Two-sample testing Independence testing Conditional independence testing Supervised dimensionality reduction Kernel Bayes rule (filtering, prediction and smoothing) Kernel CCA,... Review paper (Muandet et al., 2016)

86 Application: Two-Sample Testing

87 Two-Sample Problem Given random samples { 1,..., m } i.i.d. P and {Y 1,..., Y n } i.i.d. Q. Determine: P = Q or P Q? Approach: If MMD 2 H (P m, Q n ) is H 0 : P = Q H 0 : MMD H (P, Q) = 0 H 1 : P Q H 1 : MMD H (P, Q) > 0 far from zero: reject H0 close to zero: accept H0

88 Type-I and Type-II Errors Given P = Q, want threshold or critical value t 1 α such that Pr H0 (MMD 2 H (P m, Q n ) > t 1 α ) α.

89 Statistical Test: Large Deviation Bounds Given P = Q, want threshold t such that Pr H0 (MMD 2 H (P m, Q n ) > t) α. We showed that (S et al., EJS 2012) MMD 2 Pr( H (P m, Q n ) MMDH(P, 2 Q) ( 1 + 2(m+n) mn 2 log 1 α )) α. α-level test: Accept H 0 if ( ) 2(m + n) MMDH(P 2 m, Q n ) < log 1 mn α Otherwise reject. Too conservative!!

90 Statistical Test: Large Deviation Bounds Given P = Q, want threshold t such that Pr H0 (MMD 2 H (P m, Q n ) > t) α. We showed that (S et al., EJS 2012) MMD 2 Pr( H (P m, Q n ) MMDH(P, 2 Q) ( 1 + 2(m+n) mn 2 log 1 α )) α. α-level test: Accept H 0 if ( ) 2(m + n) MMDH(P 2 m, Q n ) < log 1 mn α Otherwise reject. Too conservative!!

91 Statistical Test: Asymptotic Distribution (Gretton et al., NIPS 2006, JMLR 2012) Unbiased estimator of MMDH 2 (P, Q): U-statistic m MMD H 2 := 1 m(m 1) Under H 0, i j m MMD 2 H k( i, j ) + k(y i, Y j ) k( i, Y j ) k( j, Y i ) }{{} h(( i,y i ),( j,y j )) w ( λ i θ 2 i 2 ) as n, i=1 where θ i N (0, 2) i.i.d., and λ i are solutions to k(x, y) ψ i (x) dp(x) = λ i ψ i (y) }{{} centered Consistent (Type-II error goes to zero): Under H 1, m ( MMD 2 H MMD 2 H(P, Q)) w N (0, σ 2 h ) as n.

92 Statistical Test: Asymptotic Distribution (Gretton et al., NIPS 2006, JMLR 2012) Unbiased estimator of MMDH 2 (P, Q): U-statistic m MMD H 2 := 1 m(m 1) Under H 0, i j m MMD 2 H k( i, j ) + k(y i, Y j ) k( i, Y j ) k( j, Y i ) }{{} h(( i,y i ),( j,y j )) w ( λ i θ 2 i 2 ) as n, i=1 where θ i N (0, 2) i.i.d., and λ i are solutions to k(x, y) ψ i (x) dp(x) = λ i ψ i (y) }{{} centered Consistent (Type-II error goes to zero): Under H 1, m ( MMD 2 H MMD 2 H(P, Q)) w N (0, σ 2 h ) as n.

93 Statistical Test: Asymptotic Distribution (Gretton et al., NIPS 2006, JMLR 2012) Unbiased estimator of MMDH 2 (P, Q): U-statistic m MMD H 2 := 1 m(m 1) Under H 0, i j m MMD 2 H k( i, j ) + k(y i, Y j ) k( i, Y j ) k( j, Y i ) }{{} h(( i,y i ),( j,y j )) w ( λ i θ 2 i 2 ) as n, i=1 where θ i N (0, 2) i.i.d., and λ i are solutions to k(x, y) ψ i (x) dp(x) = λ i ψ i (y) }{{} centered Consistent (Type-II error goes to zero): Under H 1, m ( MMD 2 H MMD 2 H(P, Q)) w N (0, σ 2 h ) as n.

94 Statistical Test: Asymptotic Distribution (Gretton et al., NIPS 2006, JMLR 2012) α-level test: Estimate 1 α quantile of the null distribution using bootstrap. Picture credit: A. Gretton Computationally intensive!!

95 Statistical Test Without Bootstrap (Gretton et al., NIPS 2009) Estimate the eigenvalues, λ i from combined samples Define Z := (1,..., m, Y 1,..., Y m) Kij := k(z i, Z j ) Compute the eigenvalues, λi of K = HKH where H = I 1 2m 12m1T 2m α-level test: Compute the 1 α quantile of the distribution associated with 2m ( λ i θ 2 i 2 ) i=1 Test is asymptotically α-level consistent

96 Experiments (Gretton et al., NIPS 2009) Comparison example: Canadian Hansard corpus (agriculture, fisheries and immigration) Samples: 5-line extracts Kernel: k-spectrum kernel with k = 10 Sample size: 10 Repetitions: 300 Compute MMD 2 H k-spectrum kernel: average Type II error 0 (α = 0.05) Bag of words kernel: average Type II error 0.18 First ever test on structured data

97 Choice of Characteristic Kernel

98 Choice of Characteristic Kernels Let = R d. Suppose k is a Gaussian kernel, k σ (x, y) = e x y 2 2 2σ 2. MMD Hσ is a function of σ. So MMD Hσ practice? is a family of metrics. Which one should we use in Note that MMD Hσ 0 as σ 0 or σ. Therefore, the kernel choice is very critical in applications. Heuristics: Median: σ = median ( i j 2 : i j, i, j = 1,..., m ) where = (( i ) i, (Y i ) i ) (Gretton et al., NIPS 2006, NIPS 2009, JMLR 2012). Choose the test statistic to be MMD Hσ (P m, Q m ) where (S et al., NIPS 2009) σ = arg max MMD H σ (P m, Q m ) σ (0, )

99 Choice of Characteristic Kernels Let = R d. Suppose k is a Gaussian kernel, k σ (x, y) = e x y 2 2 2σ 2. MMD Hσ is a function of σ. So MMD Hσ practice? is a family of metrics. Which one should we use in Note that MMD Hσ 0 as σ 0 or σ. Therefore, the kernel choice is very critical in applications. Heuristics: Median: σ = median ( i j 2 : i j, i, j = 1,..., m ) where = (( i ) i, (Y i ) i ) (Gretton et al., NIPS 2006, NIPS 2009, JMLR 2012). Choose the test statistic to be MMD Hσ (P m, Q m ) where (S et al., NIPS 2009) σ = arg max MMD H σ (P m, Q m ) σ (0, )

100 Choice of Characteristic Kernels Let = R d. Suppose k is a Gaussian kernel, k σ (x, y) = e x y 2 2 2σ 2. MMD Hσ is a function of σ. So MMD Hσ practice? is a family of metrics. Which one should we use in Note that MMD Hσ 0 as σ 0 or σ. Therefore, the kernel choice is very critical in applications. Heuristics: Median: σ = median ( i j 2 : i j, i, j = 1,..., m ) where = (( i ) i, (Y i ) i ) (Gretton et al., NIPS 2006, NIPS 2009, JMLR 2012). Choose the test statistic to be MMD Hσ (P m, Q m ) where (S et al., NIPS 2009) σ = arg max MMD H σ (P m, Q m ) σ (0, )

101 Choice of Characteristic Kernels Let = R d. Suppose k is a Gaussian kernel, k σ (x, y) = e x y 2 2 2σ 2. MMD Hσ is a function of σ. So MMD Hσ practice? is a family of metrics. Which one should we use in Note that MMD Hσ 0 as σ 0 or σ. Therefore, the kernel choice is very critical in applications. Heuristics: Median: σ = median ( i j 2 : i j, i, j = 1,..., m ) where = (( i ) i, (Y i ) i ) (Gretton et al., NIPS 2006, NIPS 2009, JMLR 2012). Choose the test statistic to be MMD Hσ (P m, Q m ) where (S et al., NIPS 2009) σ = arg max MMD H σ (P m, Q m ) σ (0, )

102 Classes of Characteristic Kernels (S et al., NIPS 2009) More generally, we use Examples for K : MMD(P, Q) := sup MMD Hk (P, Q). k K K g := {e σ x y 2 2, x, y R d : σ R + }. K lin := {k λ = l i=1 λ ik i k λ is pd, l i=1 λ i = 1}. K con := {k λ = l i=1 λ ik i λ i 0, l i=1 λ i = 1}. Test: α-level test: Estimate 1 α quantile of the null distribution of MMD(P m, Q m ) using bootstrap. Test consistency: Based on the functional central limit theorem for U-processes indexed by VC-subgraph K. Computational disadvantage!!

103 Classes of Characteristic Kernels (S et al., NIPS 2009) More generally, we use Examples for K : MMD(P, Q) := sup MMD Hk (P, Q). k K K g := {e σ x y 2 2, x, y R d : σ R + }. K lin := {k λ = l i=1 λ ik i k λ is pd, l i=1 λ i = 1}. K con := {k λ = l i=1 λ ik i λ i 0, l i=1 λ i = 1}. Test: α-level test: Estimate 1 α quantile of the null distribution of MMD(P m, Q m ) using bootstrap. Test consistency: Based on the functional central limit theorem for U-processes indexed by VC-subgraph K. Computational disadvantage!!

104 Experiments q = N (0, σ 2 q). p(x) = q(x)(1 + sin νx) q(x) p(x) 0.2 p(x) x ν = x ν = x ν = 7.5 k(x, y) = exp( (x y) 2 /σ). Test statistics: MMD(P m, Q m ) and MMD Hσ (P m, Q m ) for various σ.

105 Experiments MMD(P, Q) Error (in %) Type I error Type II error ν

106 Experiments MMD Hσ (P, Q) Type I error (in %) ν=0.5 ν=0.75 ν=1.0 ν=1.25 ν= log σ Type II error (in %) ν=0.5 ν=0.75 ν=1.0 ν=1.25 ν= log σ log σ ν Median as σ ν

107 Choice of Characteristic Kernels (Gretton et al., NIPS 2012) Choose a kernel that minimizes the Type-II error for a given Type-I error: k arg inf Type II (k). k K:Type I (k) α Not easy to compute with the asymptotic distributions of the U-statistic, MMD 2 Hk (P m, Q m ). Modified statistic: Average of U-statistics computed on independent blocks of size 2. MMD H 2 (P k m, Q 2 m/2 m)= k( 2i 1, 2i ) + k(y 2i 1, Y 2i ) m i=1 k( 2i 1, Y 2i ) k(y 2i 1, 2i ), }{{} h k (Z i ) where Z i = ( 2i 1, 2i, Y 2i 1, Y 2i ). Recall MMD H 2 := 1 m(m 1) m k( i, j ) + k(y i, Y j ) k( i, Y j ) k( j, Y i ) }{{} h(( i,y i ),( j,y j )) i j

108 Choice of Characteristic Kernels (Gretton et al., NIPS 2012) Choose a kernel that minimizes the Type-II error for a given Type-I error: k arg inf Type II (k). k K:Type I (k) α Not easy to compute with the asymptotic distributions of the U-statistic, MMD 2 Hk (P m, Q m ). Modified statistic: Average of U-statistics computed on independent blocks of size 2. MMD H 2 (P k m, Q 2 m/2 m)= k( 2i 1, 2i ) + k(y 2i 1, Y 2i ) m i=1 k( 2i 1, Y 2i ) k(y 2i 1, 2i ), }{{} h k (Z i ) where Z i = ( 2i 1, 2i, Y 2i 1, Y 2i ). Recall MMD H 2 := 1 m(m 1) m k( i, j ) + k(y i, Y j ) k( i, Y j ) k( j, Y i ) }{{} h(( i,y i ),( j,y j )) i j

109 Modified Statistic Advantages: MMD 2 H is computable in O(m) while MMD H 2 requires O(m2 ) computations. Under H 0, m MMD 2 Hk (P m, Q m ) w N (0, 2σh 2 k ), where σh 2 k = E Z hk 2(Z) (E Z h k (Z)) 2 assuming 0 < E Z hk 2 (Z) <. The asymptotic distribution is normal as against weighted sum of infinite χ 2. Therefore, the test threshold is easy to compute. Disadvantages: Larger variance Smaller power

110 Modified Statistic Advantages: MMD 2 H is computable in O(m) while MMD H 2 requires O(m2 ) computations. Under H 0, m MMD 2 Hk (P m, Q m ) w N (0, 2σh 2 k ), where σh 2 k = E Z hk 2(Z) (E Z h k (Z)) 2 assuming 0 < E Z hk 2 (Z) <. The asymptotic distribution is normal as against weighted sum of infinite χ 2. Therefore, the test threshold is easy to compute. Disadvantages: Larger variance Smaller power

111 Type-I and Type-II Errors Test threshold: For a given k and α, where Φ N is the cdf of N (0, 1). Type-II error: t k,1 α = 2σ hk Φ 1 N (1 α) ( Φ N Φ 1 N (1 α) MMD2 H k (P, Q) ) m 2σhk

112 Best Kernel: Minimizes Type-II Error Since Φ N is a strictly increasing function, the Type-II error is minimized by maximizing MMD2 H k (P,Q) σ hk. Optimal kernel: k MMDH 2 arg sup k (P, Q). k K σ hk Since MMDH 2 k and σ hk depend on unknown P and Q, we split the data into train and test data to estimate k on the train data as ˆk and evaluate the threshold tˆk,1 α on the test data.

113 Data-Dependent Kernel Train data: Define MMD 2 Hk and ˆσ hk. for some λ m 0 as m. Test data: If ˆk MMD H 2 arg sup k k K ˆσ hk + λ m MMD 2 Hˆk, ˆσ hˆk and tˆk,1 α. MMD 2 > Hˆk tˆk,1 α, reject H 0, else accept. Similar results are recently obtained for MMD 2 Hk 2017) (Sutherland et al., ICLR

114 Data-Dependent Kernel Train data: Define MMD 2 Hk and ˆσ hk. for some λ m 0 as m. Test data: If ˆk MMD H 2 arg sup k k K ˆσ hk + λ m MMD 2 Hˆk, ˆσ hˆk and tˆk,1 α. MMD 2 > Hˆk tˆk,1 α, reject H 0, else accept. Similar results are recently obtained for MMD 2 Hk 2017) (Sutherland et al., ICLR

115 Learning the Kernel Define the family of kernels as follows: { } l K := k : k = β i k i, β i 0, i [l]. i=1 If all k i are characteristic and for some i [l], β i > 0, then k is characteristic. MMD 2 H k (P, Q) = l i=1 β immd 2 H ki (P, Q) σ 2 k = l i,j=1 β iβ j cov(h ki, h kj ) where h ki (x, x, y, y ) = k i (x, x ) + k i (y, y ) k i (x, y ) k i (x, y). Objective: β = arg max β 0 β T η βt W β, where η := (MMD 2 H ki (P, Q)) i and W := (cov(h ki, h kj )) i,j.

116 Learning the Kernel Define the family of kernels as follows: { } l K := k : k = β i k i, β i 0, i [l]. i=1 If all k i are characteristic and for some i [l], β i > 0, then k is characteristic. MMD 2 H k (P, Q) = l i=1 β immd 2 H ki (P, Q) σ 2 k = l i,j=1 β iβ j cov(h ki, h kj ) where h ki (x, x, y, y ) = k i (x, x ) + k i (y, y ) k i (x, y ) k i (x, y). Objective: β = arg max β 0 β T η βt W β, where η := (MMD 2 H ki (P, Q)) i and W := (cov(h ki, h kj )) i,j.

117 Learning the Kernel Define the family of kernels as follows: { } l K := k : k = β i k i, β i 0, i [l]. i=1 If all k i are characteristic and for some i [l], β i > 0, then k is characteristic. MMD 2 H k (P, Q) = l i=1 β immd 2 H ki (P, Q) σ 2 k = l i,j=1 β iβ j cov(h ki, h kj ) where h ki (x, x, y, y ) = k i (x, x ) + k i (y, y ) k i (x, y ) k i (x, y). Objective: β = arg max β 0 β T η βt W β, where η := (MMD 2 H ki (P, Q)) i and W := (cov(h ki, h kj )) i,j.

118 Optimization ˆβ λ = arg max β 0 β T ˆη β T (Ŵ + λi )β If ˆη has at least one positive element, the objective function is strictly positive and so { } ˆβ λ = arg min β T (Ŵ + λi )β : βt ˆη = 1, β 0. β On the test data: Compute MMD 2 Hˆk using ˆk = l i=1 ˆβ λ,ik i. Compute test threshold ˆtˆk,1 α using ˆσˆk.

119 Optimization ˆβ λ = arg max β 0 β T ˆη β T (Ŵ + λi )β If ˆη has at least one positive element, the objective function is strictly positive and so { } ˆβ λ = arg min β T (Ŵ + λi )β : βt ˆη = 1, β 0. β On the test data: Compute MMD 2 Hˆk using ˆk = l i=1 ˆβ λ,ik i. Compute test threshold ˆtˆk,1 α using ˆσˆk.

120 Experiments P and Q are mixtures of two-dimensional Gaussians. P has unit covariance in each component. Q has correlated Gaussians with ε being the ratio of largest to smallest covariance eigenvalues. Testing problem difficulty increases with ε 1 and the number of mixture components.

121 Competing Approaches Median heuristic Max. MMD: sup k K MMD 2 H k (P m, Q m ) choose k K with the largest MMD 2 H k (P m, Q m ) Same as maximizing βt ˆη subject to β 1 1. l 2 statistic: maximize β T ˆη subject to β 2 1. Cross-validation on training set.

122 Results m = 10, 000 (for training and test). Results are average over 617 trials.

123 Results Optimize MMD 2 H k ˆσ k

124 Results Maximize MMD 2 Hk with β constraint

125 Results Median heuristic

126 References I Dudley, R. M. (2002). Real Analysis and Probability. Cambridge University Press, Cambridge, UK. Fukumizu, K., Gretton, A., Sun,., and Schölkopf, B. (2008). Kernel measures of conditional dependence. In Platt, J., Koller, D., Singer, Y., and Roweis, S., editors, Advances in Neural Information Processing Systems 20, pages , Cambridge, MA. MIT Press. Fukumizu, K., Sriperumbudur, B. K., Gretton, A., and Schölkopf, B. (2009). Characteristic kernels on groups and semigroups. In Advances in Neural Information Processing Systems 21, pages Gretton, A. (2015). A simpler condition for consistency of a kernel independence test. ariv: Gretton, A., Borgwardt, K. M., Rasch, M., Schölkopf, B., and Smola, A. (2007). A kernel method for the two sample problem. In Advances in Neural Information Processing Systems 19, pages MIT Press. Gretton, A., Borgwardt, K. M., Rasch, M., Schölkopf, B., and Smola, A. (2012a). A kernel two-sample test. Journal of Machine Learning Research, 13: Gretton, A., Bousquet, O., Smola, A., and Schölkopf, B. (2005a). Measuring statistical dependence with Hilbert-Schmidt norms. In Jain, S., Simon, H. U., and Tomita, E., editors, Proceedings of Algorithmic Learning Theory, pages 63 77, Berlin. Springer-Verlag. Gretton, A., Fukumizu, K., Harchaoui, Z., and Sriperumbudur, B. K. (2010). A fast, consistent kernel two-sample test. In Advances in Neural Information Processing Systems 22, Cambridge, MA. MIT Press. Gretton, A., Herbrich, R., Smola, A., Bousquet, O., and Schölkopf, B. (2005b). Kernel methods for measuring independence. Journal of Machine Learning Research, 6: Gretton, A., Smola, A., Bousquet, O., Herbrich, R., Belitski, A., Augath, M., Murayama, Y., Pauls, J., Schölkopf, B., and Logothetis, N. (2005c). Kernel constrained covariance for dependence measurement. In Ghahramani, Z. and Cowell, R., editors, Proc. 10 th International Workshop on Artificial Intelligence and Statistics, pages 1 8.

127 References II Gretton, A., Sriperumbudur, B., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., and Fukumizu, K. (2012b). Optimal kernel choice for large-scale two-sample tests. In Advances in Neural Information Processing Systems 24, Cambridge, MA. MIT Press. Muandet, K., Fukumizu, K., Sriperumbudur, B. K., and Schölkopf, B. (2016a). Kernel mean embedding of distributions: A review and beyond. ariv: Muandet, K., Sriperumbudur, B. K., Fukumizu, K., Gretton, A., and Schölkopf, B. (2016b). Kernel mean shrinkage estimators. Journal of Machine Learning Research, 17(48):1 41. Müller, A. (1997). Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29: Ramdas, A., Reddi, S. J., Póczos, B., Singh, A., and Wasserman, L. (2015). On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. In Proc. of 29th AAAI Conference on Artificial Intelligence, pages Simon-Gabriel, C. and Schölkopf, B. (2016). Kernel distribution embeddings: Universal kernels, characteristic kernels and kernel metrics on distributions. ariv: Smola, A. J., Gretton, A., Song, L., and Schölkopf, B. (2007). A Hilbert space embedding for distributions. In Proc. 18th International Conference on Algorithmic Learning Theory, pages Springer-Verlag, Berlin, Germany. Sriperumbudur, B. K. (2016). On the optimal estimation of probability measures in weak and strong topologies. Bernoulli, 22(3): Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Schölkopf, B., and Lanckriet, G. R. G. (2012). On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6: Sriperumbudur, B. K., Fukumizu, K., and Lanckriet, G. R. G. (2011). Universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research, 12:

128 References III Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Lanckriet, G. R. G., and Schölkopf, B. (2008). Injective Hilbert space embeddings of probability measures. In Servedio, R. and Zhang, T., editors, Proc. of the 21 st Annual Conference on Learning Theory, pages Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Schölkopf, B., and Lanckriet, G. R. G. (2010). Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11: Steinwart, I. and Christmann, A. (2008). Support Vector Machines. Springer. Sutherland, D. J., Tung, H.-Y., Strathmann, H., De, S., Ramdas, A., Smola, A., and Gretton, A. (2017). Generative models and model criticism via optimized maximum mean discrepancy. In International Conference on Learning Representations. Tolstikhin, I., Sriperumbudur, B. K., and Muandet, K. (2016). Minimax estimation of kernel mean embeddings. ariv:

Minimax Estimation of Kernel Mean Embeddings

Minimax Estimation of Kernel Mean Embeddings Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016 Collaborators Dr. Ilya Tolstikhin

More information

Approximate Kernel Methods

Approximate Kernel Methods Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression

More information

Hilbert Space Representations of Probability Distributions

Hilbert Space Representations of Probability Distributions Hilbert Space Representations of Probability Distributions Arthur Gretton joint work with Karsten Borgwardt, Kenji Fukumizu, Malte Rasch, Bernhard Schölkopf, Alex Smola, Le Song, Choon Hui Teo Max Planck

More information

Learning Interpretable Features to Compare Distributions

Learning Interpretable Features to Compare Distributions Learning Interpretable Features to Compare Distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Theory of Big Data, 2017 1/41 Goal of this talk Given: Two collections

More information

Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,

More information

Learning features to compare distributions

Learning features to compare distributions Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Big Hypothesis Testing with Kernel Embeddings

Big Hypothesis Testing with Kernel Embeddings Big Hypothesis Testing with Kernel Embeddings Dino Sejdinovic Department of Statistics University of Oxford 9 January 2015 UCL Workshop on the Theory of Big Data D. Sejdinovic (Statistics, Oxford) Big

More information

Learning features to compare distributions

Learning features to compare distributions Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this

More information

Kernel methods for Bayesian inference

Kernel methods for Bayesian inference Kernel methods for Bayesian inference Arthur Gretton Gatsby Computational Neuroscience Unit Lancaster, Nov. 2014 Motivating Example: Bayesian inference without a model 3600 downsampled frames of 20 20

More information

Maximum Mean Discrepancy

Maximum Mean Discrepancy Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia

More information

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,

More information

Statistical Convergence of Kernel CCA

Statistical Convergence of Kernel CCA Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,

More information

An Adaptive Test of Independence with Analytic Kernel Embeddings

An Adaptive Test of Independence with Analytic Kernel Embeddings An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum Gatsby Unit, University College London wittawat@gatsby.ucl.ac.uk Probabilistic Graphical Model Workshop 2017 Institute

More information

Kernel Learning via Random Fourier Representations

Kernel Learning via Random Fourier Representations Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random

More information

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December

More information

A Linear-Time Kernel Goodness-of-Fit Test

A Linear-Time Kernel Goodness-of-Fit Test A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum 1; Wenkai Xu 1 Zoltán Szabó 2 NIPS 2017 Best paper! Kenji Fukumizu 3 Arthur Gretton 1 wittawatj@gmail.com 1 Gatsby Unit, University College

More information

An Adaptive Test of Independence with Analytic Kernel Embeddings

An Adaptive Test of Independence with Analytic Kernel Embeddings An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum 1 Zoltán Szabó 2 Arthur Gretton 1 1 Gatsby Unit, University College London 2 CMAP, École Polytechnique ICML 2017, Sydney

More information

Lecture 2: Mappings of Probabilities to RKHS and Applications

Lecture 2: Mappings of Probabilities to RKHS and Applications Lecture 2: Mappings of Probabilities to RKHS and Applications Lille, 214 Arthur Gretton Gatsby Unit, CSML, UCL Detecting differences in brain signals The problem: Dolocalfieldpotential(LFP)signalschange

More information

Robust Support Vector Machines for Probability Distributions

Robust Support Vector Machines for Probability Distributions Robust Support Vector Machines for Probability Distributions Andreas Christmann joint work with Ingo Steinwart (Los Alamos National Lab) ICORS 2008, Antalya, Turkey, September 8-12, 2008 Andreas Christmann,

More information

Tensor Product Kernels: Characteristic Property, Universality

Tensor Product Kernels: Characteristic Property, Universality Tensor Product Kernels: Characteristic Property, Universality Zolta n Szabo CMAP, E cole Polytechnique Joint work with: Bharath K. Sriperumbudur Hangzhou International Conference on Frontiers of Data Science

More information

Hypothesis Testing with Kernel Embeddings on Interdependent Data

Hypothesis Testing with Kernel Embeddings on Interdependent Data Hypothesis Testing with Kernel Embeddings on Interdependent Data Dino Sejdinovic Department of Statistics University of Oxford joint work with Kacper Chwialkowski and Arthur Gretton (Gatsby Unit, UCL)

More information

Advances in kernel exponential families

Advances in kernel exponential families Advances in kernel exponential families Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS, 2017 1/39 Outline Motivating application: Fast estimation of complex multivariate

More information

Generalized clustering via kernel embeddings

Generalized clustering via kernel embeddings Generalized clustering via kernel embeddings Stefanie Jegelka 1, Arthur Gretton 2,1, Bernhard Schölkopf 1, Bharath K. Sriperumbudur 3, and Ulrike von Luxburg 1 1 Max Planck Institute for Biological Cybernetics,

More information

arxiv: v2 [stat.ml] 3 Sep 2015

arxiv: v2 [stat.ml] 3 Sep 2015 Nonparametric Independence Testing for Small Sample Sizes arxiv:46.9v [stat.ml] 3 Sep 5 Aaditya Ramdas Dept. of Statistics and Machine Learning Dept. Carnegie Mellon University aramdas@cs.cmu.edu September

More information

Metric Embedding for Kernel Classification Rules

Metric Embedding for Kernel Classification Rules Metric Embedding for Kernel Classification Rules Bharath K. Sriperumbudur University of California, San Diego (Joint work with Omer Lang & Gert Lanckriet) Bharath K. Sriperumbudur (UCSD) Metric Embedding

More information

Distribution Regression with Minimax-Optimal Guarantee

Distribution Regression with Minimax-Optimal Guarantee Distribution Regression with Minimax-Optimal Guarantee (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton

More information

Unsupervised Nonparametric Anomaly Detection: A Kernel Method

Unsupervised Nonparametric Anomaly Detection: A Kernel Method Fifty-second Annual Allerton Conference Allerton House, UIUC, Illinois, USA October - 3, 24 Unsupervised Nonparametric Anomaly Detection: A Kernel Method Shaofeng Zou Yingbin Liang H. Vincent Poor 2 Xinghua

More information

INDEPENDENCE MEASURES

INDEPENDENCE MEASURES INDEPENDENCE MEASURES Beatriz Bueno Larraz Máster en Investigación e Innovación en Tecnologías de la Información y las Comunicaciones. Escuela Politécnica Superior. Máster en Matemáticas y Aplicaciones.

More information

A Linear-Time Kernel Goodness-of-Fit Test

A Linear-Time Kernel Goodness-of-Fit Test A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum Gatsby Unit, UCL wittawatj@gmail.com Wenkai Xu Gatsby Unit, UCL wenkaix@gatsby.ucl.ac.uk Zoltán Szabó CMAP, École Polytechnique zoltan.szabo@polytechnique.edu

More information

Fast Two Sample Tests using Smooth Random Features

Fast Two Sample Tests using Smooth Random Features Kacper Chwialkowski University College London, Computer Science Department Aaditya Ramdas Carnegie Mellon University, Machine Learning and Statistics School of Computer Science Dino Sejdinovic University

More information

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department,

More information

Two-stage Sampled Learning Theory on Distributions

Two-stage Sampled Learning Theory on Distributions Two-stage Sampled Learning Theory on Distributions Zoltán Szabó (Gatsby Unit, UCL) Joint work with Arthur Gretton (Gatsby Unit, UCL), Barnabás Póczos (ML Department, CMU), Bharath K. Sriperumbudur (Department

More information

Recovering Distributions from Gaussian RKHS Embeddings

Recovering Distributions from Gaussian RKHS Embeddings Motonobu Kanagawa Graduate University for Advanced Studies kanagawa@ism.ac.jp Kenji Fukumizu Institute of Statistical Mathematics fukumizu@ism.ac.jp Abstract Recent advances of kernel methods have yielded

More information

Distribution Regression

Distribution Regression Zoltán Szabó (École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Dagstuhl Seminar 16481

More information

Non-parametric Estimation of Integral Probability Metrics

Non-parametric Estimation of Integral Probability Metrics Non-parametric Estimation of Integral Probability Metrics Bharath K. Sriperumbudur, Kenji Fukumizu 2, Arthur Gretton 3,4, Bernhard Schölkopf 4 and Gert R. G. Lanckriet Department of ECE, UC San Diego,

More information

Kernel methods for comparing distributions, measuring dependence

Kernel methods for comparing distributions, measuring dependence Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations

More information

Adaptive HMC via the Infinite Exponential Family

Adaptive HMC via the Infinite Exponential Family Adaptive HMC via the Infinite Exponential Family Arthur Gretton Gatsby Unit, CSML, University College London RegML, 2017 Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family

More information

Convergence Rates of Kernel Quadrature Rules

Convergence Rates of Kernel Quadrature Rules Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction

More information

Kernel Measures of Conditional Dependence

Kernel Measures of Conditional Dependence Kernel Measures of Conditional Dependence Kenji Fukumizu Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku Tokyo 6-8569 Japan fukumizu@ism.ac.jp Arthur Gretton Max-Planck Institute for

More information

Minimax-optimal distribution regression

Minimax-optimal distribution regression Zoltán Szabó (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) ISNPS, Avignon June 12,

More information

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract Scale-Invariance of Support Vector Machines based on the Triangular Kernel François Fleuret Hichem Sahbi IMEDIA Research Group INRIA Domaine de Voluceau 78150 Le Chesnay, France Abstract This paper focuses

More information

A Kernel Statistical Test of Independence

A Kernel Statistical Test of Independence A Kernel Statistical Test of Independence Arthur Gretton MPI for Biological Cybernetics Tübingen, Germany arthur@tuebingen.mpg.de Kenji Fukumizu Inst. of Statistical Mathematics Tokyo Japan fukumizu@ism.ac.jp

More information

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Kernel-Based Contrast Functions for Sufficient Dimension Reduction Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach

More information

Hilbert Schmidt Independence Criterion

Hilbert Schmidt Independence Criterion Hilbert Schmidt Independence Criterion Thanks to Arthur Gretton, Le Song, Bernhard Schölkopf, Olivier Bousquet Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au

More information

On non-parametric robust quantile regression by support vector machines

On non-parametric robust quantile regression by support vector machines On non-parametric robust quantile regression by support vector machines Andreas Christmann joint work with: Ingo Steinwart (Los Alamos National Lab) Arnout Van Messem (Vrije Universiteit Brussel) ERCIM

More information

Can we do statistical inference in a non-asymptotic way? 1

Can we do statistical inference in a non-asymptotic way? 1 Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.

More information

Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise

Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise Makoto Yamada and Masashi Sugiyama Department of Computer Science, Tokyo Institute of Technology

More information

Semi-Supervised Learning through Principal Directions Estimation

Semi-Supervised Learning through Principal Directions Estimation Semi-Supervised Learning through Principal Directions Estimation Olivier Chapelle, Bernhard Schölkopf, Jason Weston Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany {first.last}@tuebingen.mpg.de

More information

Kernel adaptive Sequential Monte Carlo

Kernel adaptive Sequential Monte Carlo Kernel adaptive Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) December 7, 2015 1 / 36 Section 1 Outline

More information

B-tests: Low Variance Kernel Two-Sample Tests

B-tests: Low Variance Kernel Two-Sample Tests B-tests: Low Variance Kernel Two-Sample Tests Wojciech Zaremba, Arthur Gretton, Matthew Blaschko To cite this version: Wojciech Zaremba, Arthur Gretton, Matthew Blaschko. B-tests: Low Variance Kernel Two-

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

On the Convergence of the Concave-Convex Procedure

On the Convergence of the Concave-Convex Procedure On the Convergence of the Concave-Convex Procedure Bharath K. Sriperumbudur and Gert R. G. Lanckriet UC San Diego OPT 2009 Outline Difference of convex functions (d.c.) program Applications in machine

More information

1 Exponential manifold by reproducing kernel Hilbert spaces

1 Exponential manifold by reproducing kernel Hilbert spaces 1 Exponential manifold by reproducing kernel Hilbert spaces 1.1 Introduction The purpose of this paper is to propose a method of constructing exponential families of Hilbert manifold, on which estimation

More information

Imitation Learning via Kernel Mean Embedding

Imitation Learning via Kernel Mean Embedding Imitation Learning via Kernel Mean Embedding Kee-Eung Kim School of Computer Science KAIST kekim@cs.kaist.ac.kr Hyun Soo Park Department of Computer Science and Engineering University of Minnesota hspark@umn.edu

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Forefront of the Two Sample Problem

Forefront of the Two Sample Problem Forefront of the Two Sample Problem From classical to state-of-the-art methods Yuchi Matsuoka What is the Two Sample Problem? X ",, X % ~ P i. i. d Y ",, Y - ~ Q i. i. d Two Sample Problem P = Q? Example:

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Topics in kernel hypothesis testing

Topics in kernel hypothesis testing Topics in kernel hypothesis testing Kacper Chwialkowski A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of University College London. Department

More information

Theory of Positive Definite Kernel and Reproducing Kernel Hilbert Space

Theory of Positive Definite Kernel and Reproducing Kernel Hilbert Space Theory of Positive Definite Kernel and Reproducing Kernel Hilbert Space Statistical Inference with Reproducing Kernel Hilbert Space Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department

More information

arxiv: v1 [cs.lg] 22 Jun 2009

arxiv: v1 [cs.lg] 22 Jun 2009 Bayesian two-sample tests arxiv:0906.4032v1 [cs.lg] 22 Jun 2009 Karsten M. Borgwardt 1 and Zoubin Ghahramani 2 1 Max-Planck-Institutes Tübingen, 2 University of Cambridge June 22, 2009 Abstract In this

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Kernel Adaptive Metropolis-Hastings

Kernel Adaptive Metropolis-Hastings Kernel Adaptive Metropolis-Hastings Arthur Gretton,?? Gatsby Unit, CSML, University College London NIPS, December 2015 Arthur Gretton (Gatsby Unit, UCL) Kernel Adaptive Metropolis-Hastings 12/12/2015 1

More information

Strictly Positive Definite Functions on a Real Inner Product Space

Strictly Positive Definite Functions on a Real Inner Product Space Strictly Positive Definite Functions on a Real Inner Product Space Allan Pinkus Abstract. If ft) = a kt k converges for all t IR with all coefficients a k 0, then the function f< x, y >) is positive definite

More information

Kernel Sequential Monte Carlo

Kernel Sequential Monte Carlo Kernel Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) * equal contribution April 25, 2016 1 / 37 Section

More information

Understanding the relationship between Functional and Structural Connectivity of Brain Networks

Understanding the relationship between Functional and Structural Connectivity of Brain Networks Understanding the relationship between Functional and Structural Connectivity of Brain Networks Sashank J. Reddi Machine Learning Department, Carnegie Mellon University SJAKKAMR@CS.CMU.EDU Abstract Background.

More information

Linear and Non-Linear Dimensionality Reduction

Linear and Non-Linear Dimensionality Reduction Linear and Non-Linear Dimensionality Reduction Alexander Schulz aschulz(at)techfak.uni-bielefeld.de University of Pisa, Pisa 4.5.215 and 7.5.215 Overview Dimensionality Reduction Motivation Linear Projections

More information

Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.

Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators. Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators. Michel Besserve MPI for Intelligent Systems, Tübingen michel.besserve@tuebingen.mpg.de Nikos K. Logothetis MPI

More information

Nonparametric Indepedence Tests: Space Partitioning and Kernel Approaches

Nonparametric Indepedence Tests: Space Partitioning and Kernel Approaches Nonparametric Indepedence Tests: Space Partitioning and Kernel Approaches Arthur Gretton 1 and László Györfi 2 1. Gatsby Computational Neuroscience Unit London, UK 2. Budapest University of Technology

More information

Characterizing Independence with Tensor Product Kernels

Characterizing Independence with Tensor Product Kernels Characterizing Independence with Tensor Product Kernels Zolta n Szabo CMAP, E cole Polytechnique Joint work with: Bharath K. Sriperumbudur Department of Statistics, PSU December 13, 2017 Zolta n Szabo

More information

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University Chapter 9. Support Vector Machine Yongdai Kim Seoul National University 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved

More information

A Kernelized Stein Discrepancy for Goodness-of-fit Tests

A Kernelized Stein Discrepancy for Goodness-of-fit Tests Qiang Liu QLIU@CS.DARTMOUTH.EDU Computer Science, Dartmouth College, NH, 3755 Jason D. Lee JASONDLEE88@EECS.BERKELEY.EDU Michael Jordan JORDAN@CS.BERKELEY.EDU Department of Electrical Engineering and Computer

More information

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation

More information

On Integral Probability Metrics, φ-divergences and Binary Classification

On Integral Probability Metrics, φ-divergences and Binary Classification On Integral Probability etrics, φ-divergences and Binary Classification Bharath K. Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf and Gert R. G. Lanckriet arxiv:090.698v4 [cs.it] 3 Oct

More information

Efficient Complex Output Prediction

Efficient Complex Output Prediction Efficient Complex Output Prediction Florence d Alché-Buc Joint work with Romain Brault, Alex Lambert, Maxime Sangnier October 12, 2017 LTCI, Télécom ParisTech, Institut-Mines Télécom, Université Paris-Saclay

More information

Inverse Density as an Inverse Problem: the Fredholm Equation Approach

Inverse Density as an Inverse Problem: the Fredholm Equation Approach Inverse Density as an Inverse Problem: the Fredholm Equation Approach Qichao Que, Mikhail Belkin Department of Computer Science and Engineering The Ohio State University {que,mbelkin}@cse.ohio-state.edu

More information

AdaGAN: Boosting Generative Models

AdaGAN: Boosting Generative Models AdaGAN: Boosting Generative Models Ilya Tolstikhin ilya@tuebingen.mpg.de joint work with Gelly 2, Bousquet 2, Simon-Gabriel 1, Schölkopf 1 1 MPI for Intelligent Systems 2 Google Brain Radford et al., 2015)

More information

Hilbertian Metrics and Positive Definite Kernels on Probability Measures

Hilbertian Metrics and Positive Definite Kernels on Probability Measures Max Planck Institut für biologische Kybernetik Max Planck Institute for Biological Cybernetics Technical Report No. 126 Hilbertian Metrics and Positive Definite Kernels on Probability Measures Matthias

More information

Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics. Jiti Gao

Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics. Jiti Gao Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics Jiti Gao Department of Statistics School of Mathematics and Statistics The University of Western Australia Crawley

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

A Wild Bootstrap for Degenerate Kernel Tests

A Wild Bootstrap for Degenerate Kernel Tests A Wild Bootstrap for Degenerate Kernel Tests Kacper Chwialkowski Department of Computer Science University College London London, Gower Street, WCE 6BT kacper.chwialkowski@gmail.com Dino Sejdinovic Gatsby

More information

Unlabeled Data: Now It Helps, Now It Doesn t

Unlabeled Data: Now It Helps, Now It Doesn t institution-logo-filena A. Singh, R. D. Nowak, and X. Zhu. In NIPS, 2008. 1 Courant Institute, NYU April 21, 2015 Outline institution-logo-filena 1 Conflicting Views in Semi-supervised Learning The Cluster

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444 Kernel Methods Jean-Philippe Vert Jean-Philippe.Vert@mines.org Last update: Jan 2015 Jean-Philippe Vert (Mines ParisTech) 1 / 444 What we know how to solve Jean-Philippe Vert (Mines ParisTech) 2 / 444

More information

Kernel methods and the exponential family

Kernel methods and the exponential family Kernel methods and the exponential family Stéphane Canu 1 and Alex J. Smola 2 1- PSI - FRE CNRS 2645 INSA de Rouen, France St Etienne du Rouvray, France Stephane.Canu@insa-rouen.fr 2- Statistical Machine

More information

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Measuring Statistical Dependence with Hilbert-Schmidt Norms

Measuring Statistical Dependence with Hilbert-Schmidt Norms Max Planck Institut für biologische Kybernetik Max Planck Institute for Biological Cybernetics Technical Report No. 140 Measuring Statistical Dependence with Hilbert-Schmidt Norms Arthur Gretton, 1 Olivier

More information

Kernel Method: Data Analysis with Positive Definite Kernels

Kernel Method: Data Analysis with Positive Definite Kernels Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University

More information

Bayesian Regularization

Bayesian Regularization Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors

More information

Statistical inference on Lévy processes

Statistical inference on Lévy processes Alberto Coca Cabrero University of Cambridge - CCA Supervisors: Dr. Richard Nickl and Professor L.C.G.Rogers Funded by Fundación Mutua Madrileña and EPSRC MASDOC/CCA student workshop 2013 26th March Outline

More information

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Classification Logistic regression models for classification problem We consider two class problem: Y {0, 1}. The Bayes rule for the classification is I(P(Y = 1 X = x) > 1/2) so

More information

Causal Inference in Time Series via Supervised Learning

Causal Inference in Time Series via Supervised Learning Causal Inference in Time Series via Supervised Learning Yoichi Chikahara and Akinori Fujino NTT Communication Science Laboratories, Kyoto 619-0237, Japan chikahara.yoichi@lab.ntt.co.jp, fujino.akinori@lab.ntt.co.jp

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

A Bahadur Representation of the Linear Support Vector Machine

A Bahadur Representation of the Linear Support Vector Machine A Bahadur Representation of the Linear Support Vector Machine Yoonkyung Lee Department of Statistics The Ohio State University October 7, 2008 Data Mining and Statistical Learning Study Group Outline Support

More information

Importance Reweighting Using Adversarial-Collaborative Training

Importance Reweighting Using Adversarial-Collaborative Training Importance Reweighting Using Adversarial-Collaborative Training Yifan Wu yw4@andrew.cmu.edu Tianshu Ren tren@andrew.cmu.edu Lidan Mu lmu@andrew.cmu.edu Abstract We consider the problem of reweighting a

More information

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization

More information

Online Gradient Descent Learning Algorithms

Online Gradient Descent Learning Algorithms DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline

More information