An Adaptive Test of Independence with Analytic Kernel Embeddings
|
|
- Osborne Wood
- 5 years ago
- Views:
Transcription
1 An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum 1 Zoltán Szabó 2 Arthur Gretton 1 1 Gatsby Unit, University College London 2 CMAP, École Polytechnique ICML 2017, Sydney 9 August /10
2 What Is Independence Testing? Let (X ; Y ) 2R R dx dy be random vectors following P xy. Given a joint sample f(x i ; y i )g n i=1 P xy (unknown), test H 0 :P xy = P x P y ; vs. H 1 :P xy 6= P x P y : Compute a test statistic ^ n. Reject H 0 if ^ n > T (threshold). T = (1 )-quantile of the null distribution. 2/10
3 What Is Independence Testing? Let (X ; Y ) 2R R dx dy be random vectors following P xy. Given a joint sample f(x i ; y i )g n i=1 P xy (unknown), test H 0 :P xy = P x P y ; vs. H 1 :P xy 6= P x P y : Compute a test statistic ^ n. Reject H 0 if ^ n > T (threshold). T = (1 )-quantile of the null distribution. 2/10
4 What Is Independence Testing? Let (X ; Y ) 2R R dx dy be random vectors following P xy. Given a joint sample f(x i ; y i )g n i=1 P xy (unknown), test H 0 :P xy = P x P y ; vs. H 1 :P xy 6= P x P y : Compute a test statistic ^ n. Reject H 0 if ^ n > T (threshold). T = (1 )-quantile of the null distribution. P H0 (ˆλ n ) T α ˆλ n 2/10
5 Motivations Modern state-of-the-art test is HSIC [Gretton et al., 2005]. Nonparametric i.e., no assumption on P xy. Kernel-based. Slow. Runtime: O(n 2 ) where n = sample size. No systematic way to choose kernels. Propose the Finite-Set Independence Criterion (FSIC). 1 Nonparametric. 2 Linear-time. Runtime complexity: O(n). Fast. 3 Adaptive. Kernel parameters can be tuned. 3/10
6 Motivations Modern state-of-the-art test is HSIC [Gretton et al., 2005]. Nonparametric i.e., no assumption on P xy. Kernel-based. Slow. Runtime: O(n 2 ) where n = sample size. No systematic way to choose kernels. Propose the Finite-Set Independence Criterion (FSIC). 1 Nonparametric. 2 Linear-time. Runtime complexity: O(n). Fast. 3 Adaptive. Kernel parameters can be tuned. 3/10
7 Proposal: The Finite-Set Independence Criterion (FSIC) 1 Pick 2 kernels: k for X, and l for Y (e.g., Gaussian kernels). 2 Pick a feature (v; w) 2R R dx R R dx!r dy R dy 3: Transform (x; y) 7! (k (x; v); l(y; w)) then measure covariance FSIC 2 (X ; Y ) = cov 2 (x;y)p xy [k (x; v); l(y; w)] : 4/10
8 Proposal: The Finite-Set Independence Criterion (FSIC) 1 Pick 2 kernels: k for X, and l for Y (e.g., Gaussian kernels). 2 Pick a feature (v; w) 2R R dx R R dx!r dy R dy 3: Transform (x; y) 7! (k (x; v); l(y; w)) then measure covariance FSIC 2 (X ; Y ) = cov 2 (x;y)p xy [k (x; v); l(y; w)] : 4/10
9 Proposal: The Finite-Set Independence Criterion (FSIC) 1 Pick 2 kernels: k for X, and l for Y (e.g., Gaussian kernels). 2 Pick a feature (v; w) 2R R dx R R dx!r dy R dy 3: Transform (x; y) 7! (k (x; v); l(y; w)) then measure covariance FSIC 2 (X ; Y ) = cov 2 (x;y)p xy [k (x; v); l(y; w)] : 4/10
10 Proposal: The Finite-Set Independence Criterion (FSIC) 1 Pick 2 kernels: k for X, and l for Y (e.g., Gaussian kernels). 2 Pick a feature (v; w) 2R R dx R R dx!r dy R dy 3: Transform (x; y) 7! (k (x; v); l(y; w)) then measure covariance FSIC 2 (X ; Y ) = cov 2 (x;y)p xy [k (x; v); l(y; w)] : Data (v, w) 1.0 correlation: 0.97 y5 l(y, w) x k(x, v) 4/10
11 Proposal: The Finite-Set Independence Criterion (FSIC) 1 Pick 2 kernels: k for X, and l for Y (e.g., Gaussian kernels). 2 Pick a feature (v; w) 2R R dx R R dx!r dy R dy 3: Transform (x; y) 7! (k (x; v); l(y; w)) then measure covariance FSIC 2 (X ; Y ) = cov 2 (x;y)p xy [k (x; v); l(y; w)] : Data (v, w) 1.0 correlation: y5 l(y, w) x k(x, v) 4/10
12 Proposal: The Finite-Set Independence Criterion (FSIC) 1 Pick 2 kernels: k for X, and l for Y (e.g., Gaussian kernels). 2 Pick a feature (v; w) 2R R dx R R dx!r dy R dy 3: Transform (x; y) 7! (k (x; v); l(y; w)) then measure covariance FSIC 2 (X ; Y ) = cov 2 (x;y)p xy [k (x; v); l(y; w)] : Data (v, w) 1.0 correlation: 0.33 y5 l(y, w) x k(x, v) 4/10
13 Proposal: The Finite-Set Independence Criterion (FSIC) 1 Pick 2 kernels: k for X, and l for Y (e.g., Gaussian kernels). 2 Pick a feature (v; w) 2R R dx R R dx!r dy R dy 3: Transform (x; y) 7! (k (x; v); l(y; w)) then measure covariance FSIC 2 (X ; Y ) = cov 2 (x;y)p xy [k (x; v); l(y; w)] : 2 Data (v, w) 1.0 correlation: y 0 l(y, w) x k(x, v) 4/10
14 Proposal: The Finite-Set Independence Criterion (FSIC) 1 Pick 2 kernels: k for X, and l for Y (e.g., Gaussian kernels). 2 Pick a feature (v; w) 2R R dx R R dx!r dy R dy 3: Transform (x; y) 7! (k (x; v); l(y; w)) then measure covariance FSIC 2 (X ; Y ) = cov 2 (x;y)p xy [k (x; v); l(y; w)] : 2 Data (v, w) 1.0 correlation: y 0 l(y, w) x k(x, v) 4/10
15 Proposal: The Finite-Set Independence Criterion (FSIC) 1 Pick 2 kernels: k for X, and l for Y (e.g., Gaussian kernels). 2 Pick a feature (v; w) 2R R dx R R dx!r dy R dy 3: Transform (x; y) 7! (k (x; v); l(y; w)) then measure covariance FSIC 2 (X ; Y ) = cov 2 (x;y)p xy [k (x; v); l(y; w)] : Data (v, w) correlation: y 0 l(y, w) x k(x, v) 4/10
16 General Form of FSIC FSIC 2 (X ; Y ) = 1 J JX j =1 for J features f(v j ; w j )g J j =1 2Rdx R Proposition 1. cov 2 (x;y)p xy [k (x; v j ); l(y; w j )] ; dy. 1 k and l: vanish at infinity, translation-invariant, characteristic, real analytic (e.g., Gaussian kernels). 2 Features f(v i ; w i )g J i=1 Then, for any J 1, are drawn from a distribution with a density. Almost surely, FSIC(X ; Y ) = 0 iff X and Y are independent Under H 0 : P xy = P x P y, n\ FSIC 2 weighted sum of J dependent 2 variables. Difficult to get (1 )-quantile for the threshold. 5/10
17 General Form of FSIC FSIC 2 (X ; Y ) = 1 J JX j =1 for J features f(v j ; w j )g J j =1 2Rdx R Proposition 1. cov 2 (x;y)p xy [k (x; v j ); l(y; w j )] ; dy. 1 k and l: vanish at infinity, translation-invariant, characteristic, real analytic (e.g., Gaussian kernels). 2 Features f(v i ; w i )g J i=1 Then, for any J 1, are drawn from a distribution with a density. Almost surely, FSIC(X ; Y ) = 0 iff X and Y are independent Under H 0 : P xy = P x P y, n\ FSIC 2 weighted sum of J dependent 2 variables. Difficult to get (1 )-quantile for the threshold. 5/10
18 General Form of FSIC FSIC 2 (X ; Y ) = 1 J JX j =1 for J features f(v j ; w j )g J j =1 2Rdx R Proposition 1. cov 2 (x;y)p xy [k (x; v j ); l(y; w j )] ; dy. 1 k and l: vanish at infinity, translation-invariant, characteristic, real analytic (e.g., Gaussian kernels). 2 Features f(v i ; w i )g J i=1 Then, for any J 1, are drawn from a distribution with a density. Almost surely, FSIC(X ; Y ) = 0 iff X and Y are independent Under H 0 : P xy = P x P y, n\ FSIC 2 weighted sum of J dependent 2 variables. Difficult to get (1 )-quantile for the threshold. 5/10
19 Normalized FSIC (NFSIC) Let ^u := dcov[k (x; v1 ); l(y; w 1 )]; : : : ;dcov[k (x; v J ); l(y; w J )] > 2R J. Then, FSIC \ 2 1 = J ^u> ^u. NFSIC \ 2 (X ; Y ) = ^ n := n ^u > 1 ^ + n I ^u; with a regularization parameter n 0. ^ ij = covariance of ^u i and ^u j. Theorem 1 (NFSIC test is consistent). Assume n! 0, and same conditions on k and l as before. 1 Under H 0, ^ n d! 2 (J ) as n! 1. Easy to get threshold T. 2 Under H 1, P(reject H 0 )! 1 as n! 1. Complexity: O(J 3 + J 2 n + (d x + d y )Jn). Only need small J. 6/10
20 Normalized FSIC (NFSIC) Let ^u := dcov[k (x; v1 ); l(y; w 1 )]; : : : ;dcov[k (x; v J ); l(y; w J )] > 2R J. Then, FSIC \ 2 1 = J ^u> ^u. NFSIC \ 2 (X ; Y ) = ^ n := n ^u > 1 ^ + n I ^u; with a regularization parameter n 0. ^ ij = covariance of ^u i and ^u j. Theorem 1 (NFSIC test is consistent). Assume n! 0, and same conditions on k and l as before. 1 Under H 0, ^ n d! 2 (J ) as n! 1. Easy to get threshold T. 2 Under H 1, P(reject H 0 )! 1 as n! 1. Complexity: O(J 3 + J 2 n + (d x + d y )Jn). Only need small J. 6/10
21 Normalized FSIC (NFSIC) Let ^u := dcov[k (x; v1 ); l(y; w 1 )]; : : : ;dcov[k (x; v J ); l(y; w J )] > 2R J. Then, FSIC \ 2 1 = J ^u> ^u. NFSIC \ 2 (X ; Y ) = ^ n := n ^u > 1 ^ + n I ^u; with a regularization parameter n 0. ^ ij = covariance of ^u i and ^u j. Theorem 1 (NFSIC test is consistent). Assume n! 0, and same conditions on k and l as before. 1 Under H 0, ^ n d! 2 (J ) as n! 1. Easy to get threshold T. 2 Under H 1, P(reject H 0 )! 1 as n! 1. Complexity: O(J 3 + J 2 n + (d x + d y )Jn). Only need small J. 6/10
22 Normalized FSIC (NFSIC) Let ^u := dcov[k (x; v1 ); l(y; w 1 )]; : : : ;dcov[k (x; v J ); l(y; w J )] > 2R J. Then, FSIC \ 2 1 = J ^u> ^u. NFSIC \ 2 (X ; Y ) = ^ n := n ^u > 1 ^ + n I ^u; with a regularization parameter n 0. ^ ij = covariance of ^u i and ^u j. Theorem 1 (NFSIC test is consistent). Assume n! 0, and same conditions on k and l as before. 1 Under H 0, ^ n d! 2 (J ) as n! 1. Easy to get threshold T. 2 Under H 1, P(reject H 0 )! 1 as n! 1. Complexity: O(J 3 + J 2 n + (d x + d y )Jn). Only need small J. 6/10
23 Normalized FSIC (NFSIC) Let ^u := dcov[k (x; v1 ); l(y; w 1 )]; : : : ;dcov[k (x; v J ); l(y; w J )] > 2R J. Then, FSIC \ 2 1 = J ^u> ^u. NFSIC \ 2 (X ; Y ) = ^ n := n ^u > 1 ^ + n I ^u; with a regularization parameter n 0. ^ ij = covariance of ^u i and ^u j. Theorem 1 (NFSIC test is consistent). Assume n! 0, and same conditions on k and l as before. 1 Under H 0, ^ n d! 2 (J ) as n! 1. Easy to get threshold T. 2 Under H 1, P(reject H 0 )! 1 as n! 1. Complexity: O(J 3 + J 2 n + (d x + d y )Jn). Only need small J. 6/10
24 Tuning Features and Kernels Split the data into training (tr) and test (te) sets. Procedure: 1 Choose f(v i ; w i )g J (tr) i=1 and Gaussian widths by maximizing ^ n computed on the training set). Gradient ascent. 2 Reject H 0 if ^ (te) n > (1 )-quantile of 2 (J ). (i.e., Splitting avoids overfitting. The optimization is also linear-time. Theorem 2. 1 This procedure increases a lower bound on P(reject H 0 j H 1 true) (test power). 2 Asymptotically, if H 0 is true, false rejection rate is. 7/10
25 Tuning Features and Kernels Split the data into training (tr) and test (te) sets. Procedure: 1 Choose f(v i ; w i )g J (tr) i=1 and Gaussian widths by maximizing ^ n computed on the training set). Gradient ascent. 2 Reject H 0 if ^ (te) n > (1 )-quantile of 2 (J ). (i.e., Splitting avoids overfitting. The optimization is also linear-time. Theorem 2. 1 This procedure increases a lower bound on P(reject H 0 j H 1 true) (test power). 2 Asymptotically, if H 0 is true, false rejection rate is. 7/10
26 Simulation Settings Gaussian kernels kx vk k (x; v) = exp x l(y; w) = exp ky wk y for X for Y NFSIC-opt NFSIC-m ed QHSIC NyHSIC FHSIC RDC Method Description 1 NFSIC-opt NFSIC with optimization. O(n). 2 QHSIC [Gretton et al., 2005] State-of-the-art HSIC. O(n 2 ). 3 NFSIC-med NFSIC with random features. 4 NyHSIC Linear-time HSIC with Nystrom approx. 5 FHSIC Linear-time HSIC with random Fourier features 6 RDC [Lopez-Paz et al., 2013] Canonical Correlation Analysis with cosine basis. J = 10 in NFSIC. 8/10
27 Youtube Video (X ) vs. Caption (Y ). X 2R 2000 : Fisher vector encoding of motion boundary histograms descriptors [Wang and Schmid, 2013]. Y 2R 1878 : Bag of words. Term frequency. = 0:01. Test power QHSIC Sample size n For large n, NFSIC is comparable to HSIC. 9/10
28 Youtube Video (X ) vs. Caption (Y ). X 2R 2000 : Fisher vector encoding of motion boundary histograms descriptors [Wang and Schmid, 2013]. Y 2R 1878 : Bag of words. Term frequency. = 0: QHSIC Test power 0.5 Proposed NFSIC Sample size n For large n, NFSIC is comparable to HSIC. 9/10
29 Youtube Video (X ) vs. Caption (Y ). X 2R 2000 : Fisher vector encoding of motion boundary histograms descriptors [Wang and Schmid, 2013]. Y 2R 1878 : Bag of words. Term frequency. = 0:01. Type-I error Sample size n Exchange (X ; Y ) pairs. H 0 true. For large n, NFSIC is comparable to HSIC. 9/10
30 Conclusions Proposed The Finite Set Independence Criterion (FSIC) FSIC(X ; Y ) = 0 () X and Y are independent. Independece test based on FSIC is 1 nonparametric (no parametric assumption on P xy ), 2 linear-time, 3 adaptive (parameters automatically tuned). Python code: github.com/wittawatj/fsic-test Poster #111. Tonight. 10/10
31 Questions? Thank you 11/10
32 Requirements on the Kernels Definition 1 (Analytic kernels). k : X X!R is said to be analytic if for all x 2 X, v! k (x; v) is a real analytic function on X. Analytic: Taylor series about x 0 converges for all x 0 2 X. =) k is infinitely differentiable. Definition 2 (Characteristic kernels). Let P (v) :=E zp [k (z; v)]. k is said to be characteristic if P is unique for distinct P. Equivalently, P 7! P is injective. Space of distributions P Q RKHS µ P } MMD(P, Q) µ Q 12/10
33 Optimization Objective = Power Lower Bound Recall ^ n := n ^u > ^ + n I 1 Let NFSIC 2 (X ; Y ) := n := nu > 1 u. Theorem 3 (A lower bound on the test power). 1 With some boundedness assumptions, the test power P H1 ^n T L( n ) where ^u. L( n ) = 1 62e 12 n (n T)2 =n 2e b0:5nc(n T)2 =[2n 2 ] 2e [(n T)n (n 1)=3 3n c32 n n(n 1)] 2 =[4n 2 (n 1)] ; where 1 ; : : : ; 4 ; c 3 > 0 are constants. 2 For large n, L( n ) is increasing in n. 13/10
34 Optimization Objective = Power Lower Bound Recall ^ n := n ^u > ^ + n I 1 Let NFSIC 2 (X ; Y ) := n := nu > 1 u. Theorem 3 (A lower bound on the test power). 1 With some boundedness assumptions, the test power P H1 ^n T L( n ) where ^u. L( n ) = 1 62e 12 n (n T)2 =n 2e b0:5nc(n T)2 =[2n 2 ] 2e [(n T)n (n 1)=3 3n c32 n n(n 1)] 2 =[4n 2 (n 1)] ; where 1 ; : : : ; 4 ; c 3 > 0 are constants. 2 For large n, L( n ) is increasing in n. 13/10
35 Optimization Objective = Power Lower Bound Recall ^ n := n ^u > ^ + n I 1 Let NFSIC 2 (X ; Y ) := n := nu > 1 u. Theorem 3 (A lower bound on the test power). 1 With some boundedness assumptions, the test power P H1 ^n T L( n ) where ^u. L( n ) = 1 62e 12 n (n T)2 =n 2e b0:5nc(n T)2 =[2n 2 ] 2e [(n T)n (n 1)=3 3n c32 n n(n 1)] 2 =[4n 2 (n 1)] ; where 1 ; : : : ; 4 ; c 3 > 0 are constants. 2 For large n, L( n ) is increasing in n. 13/10
36 Optimization Objective = Power Lower Bound Recall ^ n := n ^u > ^ + n I 1 Let NFSIC 2 (X ; Y ) := n := nu > 1 u. Theorem 3 (A lower bound on the test power). 1 With some boundedness assumptions, the test power P H1 ^n T L( n ) where ^u. L( n ) = 1 62e 12 n (n T)2 =n 2e b0:5nc(n T)2 =[2n 2 ] 2e [(n T)n (n 1)=3 3n c32 n n(n 1)] 2 =[4n 2 (n 1)] ; where 1 ; : : : ; 4 ; c 3 > 0 are constants. 2 For large n, L( n ) is increasing in n. Set test locations and Gaussian widths = arg max L( n ) = arg max n 13/10
37 \ An Estimator of NFSIC 2 ^ n := n ^u > ^ + n I 1 ^u; J test locations f(v i ; w i )g J i=1. K = [k (v i ; x j )] 2R J n L = [l(w i ; y j )] 2R J n. (No n n Gram matrix.) Estimators 1 ^u = (KL)1n n 1 (K1 n )(L1n ) n(n 1). 2 ^ = > n where := (K n 1 K1 n 1 > n ) (L n 1 L1 n 1 > n ) ^u1> n : ^ n can be computed in O(J 3 + J 2 n + (d x + d y )Jn) time. Main Point: Linear in n. Cubic in J (small). 14/10
38 \ An Estimator of NFSIC 2 ^ n := n ^u > ^ + n I 1 ^u; J test locations f(v i ; w i )g J i=1. K = [k (v i ; x j )] 2R J n L = [l(w i ; y j )] 2R J n. (No n n Gram matrix.) Estimators 1 ^u = (KL)1n n 1 (K1 n )(L1n ) n(n 1). 2 ^ = > n where := (K n 1 K1 n 1 > n ) (L n 1 L1 n 1 > n ) ^u1> n : ^ n can be computed in O(J 3 + J 2 n + (d x + d y )Jn) time. Main Point: Linear in n. Cubic in J (small). 14/10
39 \ An Estimator of NFSIC 2 ^ n := n ^u > ^ + n I 1 ^u; J test locations f(v i ; w i )g J i=1. K = [k (v i ; x j )] 2R J n L = [l(w i ; y j )] 2R J n. (No n n Gram matrix.) Estimators 1 ^u = (KL)1n n 1 (K1 n )(L1n ) n(n 1). 2 ^ = > n where := (K n 1 K1 n 1 > n ) (L n 1 L1 n 1 > n ) ^u1> n : ^ n can be computed in O(J 3 + J 2 n + (d x + d y )Jn) time. Main Point: Linear in n. Cubic in J (small). 14/10
40 Alternative View of FSIC Rewrite cov: FSIC 2 (X ; Y ) = cov 2 (x;y)p xy [k (x; v); l(y; w)] : cov xy [k (x; v); l(y; w)] =E xy [k (x; v)l(y; w)] E x [k (x; v)]e y [l(y; w)]; := xy (v; w) x (v) y (w) (witness function) = ^ xy (v; w) ^ x (v)^ y (w) Witness function FSIC = evaluate the witness function at J locations. Cost: O(Jn). HSIC = RKHS norm of the witness function. Cost: O(n 2 ). 15/10
41 HSIC vs. FSIC Recall the witness ^u(v; w) = ^ xy (v; w) ^ x (v)^ y (w): HSIC [Gretton et al., 2005] = k^uk RKHS witness FSIC [proposed] = 1 J P Ji=1 ^u 2 (v i ; w i ) witness (v, w) Good when difference between p xy and p x p y is spatially diffuse. ^u is almost flat. (v, w) Good when difference between p xy and p x p y is local. ^u is mostly zero, has many peaks (feature interaction). 16/10
42 Toy Problem 1: Independent Gaussians X N (0; I dx ) and Y N (0; I dy ). Independent X ; Y. So, H 0 holds. Set := 0:05; d x = d y = /10
43 Toy Problem 1: Independent Gaussians X N (0; I dx ) and Y N (0; I dy ). Independent X ; Y. So, H 0 holds. Set := 0:05; d x = d y = 250. NFSIC-opt NFSIC-m ed QHSIC NyHSIC FHSIC RDC Type-I error Sample size n Correct type-i errors (false positive rate). 17/10
44 Toy Problem 1: Independent Gaussians X N (0; I dx ) and Y N (0; I dy ). Independent X ; Y. So, H 0 holds. Set := 0:05; d x = d y = 250. NFSIC-opt NFSIC-m ed QHSIC NyHSIC FHSIC RDC Type-I error Time (s) Sample size n Sample size n Correct type-i errors (false positive rate). 17/10
45 Toy Problem 2: Sinusoid p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = /10
46 Toy Problem 2: Sinusoid p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = y ω = x /10
47 Toy Problem 2: Sinusoid p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = y ω = x /10
48 Toy Problem 2: Sinusoid p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = y ω = x /10
49 Toy Problem 2: Sinusoid p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = y ω = x /10
50 Toy Problem 2: Sinusoid Test power p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = NFSIC-opt NFSIC-m ed QHSIC NyHSIC FHSIC RDC 1.0 ω = ω in 1 + sin(ωx) sin(ωy) x Main Point: NFSIC can handle well the local changes in the joint space. y 18/10
51 Toy Problem 3: Gaussian Sign y = jz j Q d x i=1 sign(x i ), where x N (0; I dy ) and Z N (0; 1) (noise). Full interaction among x 1 ; : : : ; x dx. Need to consider all x 1 ; : : : ; x d to detect the dependency. 1.0 Test power /10
52 Toy Problem 3: Gaussian Sign y = jz j Q d x i=1 sign(x i ), where x N (0; I dy ) and Z N (0; 1) (noise). Full interaction among x 1 ; : : : ; x dx. Need to consider all x 1 ; : : : ; x d to detect the dependency. 1.0 Test power /10
53 y Test Power vs. J Test power does not always increase with J (number of test locations). n = 800. ω = x Test power J Accurate estimation of ^ 2R J J in ^ n = n ^u > ^ + n I 1 more difficult. Large J defeats the purpose of a linear-time test. ^u becomes 20/10
54 Real Problem: Million Song Data Song (X ) vs. year of release (Y ) Western commercial tracks from 1922 to 2011 [Bertin-Mahieux et al., 2011]. X 2R 90 contains audio features. Y 2R is the year of release. 21/10
55 Real Problem: Million Song Data Song (X ) vs. year of release (Y ) Type-I error Western commercial tracks from 1922 to 2011 [Bertin-Mahieux et al., 2011]. X 2R 90 contains audio features. Y 2R is the year of release. NFSIC-opt NFSIC-m ed QHSIC NyHSIC FHSIC RDC Sample size n Test power Sample size n Break (X ; Y ) pairs to simulate H 0. NFSIC-opt has the highest power among the linear-time tests. 21/10
56 Alternative Form of ^u(v; w) Recall \ FSIC 2 = 1 J P Ji=1 ^u(v i ; w i ) 2 Let [x y (v; w) be an unbiased estimator of x (v)y (w). [ x y P (v; w) := 1 ni=1 n(n 1) Pj 6=i k (x i ; v)l(y j ; w). An unbiased estimator of u(v; w) is where ^u(v; w) = ^xy (v; w) [x y (v; w) = 2 n(n 1) X i<j h (v;w)((x i ; y i ); (x j ; y j )); h (v;w)((x; y); (x 0 ; y 0 )) := 1 2 (k (x; v) k (x0 ; v))(l(y; w) l(y 0 ; w)): ^u(v; w) is a one-sample 2 nd -order U-statistic, given (v; w). 22/10
57 Alternative Form of ^u(v; w) Recall \ FSIC 2 = 1 J P Ji=1 ^u(v i ; w i ) 2 Let [x y (v; w) be an unbiased estimator of x (v)y (w). [ x y P (v; w) := 1 ni=1 n(n 1) Pj 6=i k (x i ; v)l(y j ; w). An unbiased estimator of u(v; w) is where ^u(v; w) = ^xy (v; w) [x y (v; w) = 2 n(n 1) X i<j h (v;w)((x i ; y i ); (x j ; y j )); h (v;w)((x; y); (x 0 ; y 0 )) := 1 2 (k (x; v) k (x0 ; v))(l(y; w) l(y 0 ; w)): ^u(v; w) is a one-sample 2 nd -order U-statistic, given (v; w). 22/10
58 References I Bertin-Mahieux, T., Ellis, D. P., Whitman, B., and Lamere, P. (2011). The million song dataset. In International Conference on Music Information Retrieval (ISMIR). Gretton, A., Bousquet, O., Smola, A., and Schölkopf, B. (2005). Measuring Statistical Dependence with Hilbert-Schmidt Norms. In Algorithmic Learning Theory (ALT), pages Lopez-Paz, D., Hennig, P., and Schölkopf, B. (2013). The Randomized Dependence Coefficient. In Advances in Neural Information Processing Systems (NIPS), pages /10
59 References II Wang, H. and Schmid, C. (2013). Action recognition with improved trajectories. In IEEE International Conference on Computer Vision (ICCV), pages /10
An Adaptive Test of Independence with Analytic Kernel Embeddings
An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum Gatsby Unit, University College London wittawat@gatsby.ucl.ac.uk Probabilistic Graphical Model Workshop 2017 Institute
More informationA Linear-Time Kernel Goodness-of-Fit Test
A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum 1; Wenkai Xu 1 Zoltán Szabó 2 NIPS 2017 Best paper! Kenji Fukumizu 3 Arthur Gretton 1 wittawatj@gmail.com 1 Gatsby Unit, University College
More informationHilbert Space Representations of Probability Distributions
Hilbert Space Representations of Probability Distributions Arthur Gretton joint work with Karsten Borgwardt, Kenji Fukumizu, Malte Rasch, Bernhard Schölkopf, Alex Smola, Le Song, Choon Hui Teo Max Planck
More informationLearning Interpretable Features to Compare Distributions
Learning Interpretable Features to Compare Distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Theory of Big Data, 2017 1/41 Goal of this talk Given: Two collections
More informationAdvances in kernel exponential families
Advances in kernel exponential families Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS, 2017 1/39 Outline Motivating application: Fast estimation of complex multivariate
More informationLearning features to compare distributions
Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this
More informationDistribution Regression: A Simple Technique with Minimax-optimal Guarantee
Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department,
More informationA Linear-Time Kernel Goodness-of-Fit Test
A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum Gatsby Unit, UCL wittawatj@gmail.com Wenkai Xu Gatsby Unit, UCL wenkaix@gatsby.ucl.ac.uk Zoltán Szabó CMAP, École Polytechnique zoltan.szabo@polytechnique.edu
More informationKernel methods for comparing distributions, measuring dependence
Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations
More informationDistribution Regression
Zoltán Szabó (École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Dagstuhl Seminar 16481
More informationHilbert Schmidt Independence Criterion
Hilbert Schmidt Independence Criterion Thanks to Arthur Gretton, Le Song, Bernhard Schölkopf, Olivier Bousquet Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au
More informationBig Hypothesis Testing with Kernel Embeddings
Big Hypothesis Testing with Kernel Embeddings Dino Sejdinovic Department of Statistics University of Oxford 9 January 2015 UCL Workshop on the Theory of Big Data D. Sejdinovic (Statistics, Oxford) Big
More informationKernel Learning via Random Fourier Representations
Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random
More informationLearning gradients: prescriptive models
Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan
More informationAdvanced Introduction to Machine Learning
10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationTensor Product Kernels: Characteristic Property, Universality
Tensor Product Kernels: Characteristic Property, Universality Zolta n Szabo CMAP, E cole Polytechnique Joint work with: Bharath K. Sriperumbudur Hangzhou International Conference on Frontiers of Data Science
More informationKernel Adaptive Metropolis-Hastings
Kernel Adaptive Metropolis-Hastings Arthur Gretton,?? Gatsby Unit, CSML, University College London NIPS, December 2015 Arthur Gretton (Gatsby Unit, UCL) Kernel Adaptive Metropolis-Hastings 12/12/2015 1
More informationLearning features to compare distributions
Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this
More informationHilbert Space Embedding of Probability Measures
Lecture 2 Hilbert Space Embedding of Probability Measures Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 2017 Recap of Lecture
More informationNonparametric Indepedence Tests: Space Partitioning and Kernel Approaches
Nonparametric Indepedence Tests: Space Partitioning and Kernel Approaches Arthur Gretton 1 and László Györfi 2 1. Gatsby Computational Neuroscience Unit London, UK 2. Budapest University of Technology
More informationKernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton
Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,
More informationHypothesis Testing with Kernel Embeddings on Interdependent Data
Hypothesis Testing with Kernel Embeddings on Interdependent Data Dino Sejdinovic Department of Statistics University of Oxford joint work with Kacper Chwialkowski and Arthur Gretton (Gatsby Unit, UCL)
More informationDependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise
Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise Makoto Yamada and Masashi Sugiyama Department of Computer Science, Tokyo Institute of Technology
More informationEfficient Complex Output Prediction
Efficient Complex Output Prediction Florence d Alché-Buc Joint work with Romain Brault, Alex Lambert, Maxime Sangnier October 12, 2017 LTCI, Télécom ParisTech, Institut-Mines Télécom, Université Paris-Saclay
More informationSemi-supervised Dictionary Learning Based on Hilbert-Schmidt Independence Criterion
Semi-supervised ictionary Learning Based on Hilbert-Schmidt Independence Criterion Mehrdad J. Gangeh 1, Safaa M.A. Bedawi 2, Ali Ghodsi 3, and Fakhri Karray 2 1 epartments of Medical Biophysics, and Radiation
More informationLess is More: Computational Regularization by Subsampling
Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro
More informationApproximate Kernel PCA with Random Features
Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,
More informationFisher Vector image representation
Fisher Vector image representation Machine Learning and Category Representation 2014-2015 Jakob Verbeek, January 9, 2015 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.14.15 A brief recap on kernel
More informationConvergence Rates of Kernel Quadrature Rules
Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction
More informationPackage dhsic. R topics documented: July 27, 2017
Package dhsic July 27, 2017 Type Package Title Independence Testing via Hilbert Schmidt Independence Criterion Version 2.0 Date 2017-07-24 Author Niklas Pfister and Jonas Peters Maintainer Niklas Pfister
More informationMean-Shift Tracker Computer Vision (Kris Kitani) Carnegie Mellon University
Mean-Shift Tracker 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University Mean Shift Algorithm A mode seeking algorithm Fukunaga & Hostetler (1975) Mean Shift Algorithm A mode seeking algorithm
More information10-701/ Recitation : Kernels
10-701/15-781 Recitation : Kernels Manojit Nandi February 27, 2014 Outline Mathematical Theory Banach Space and Hilbert Spaces Kernels Commonly Used Kernels Kernel Theory One Weird Kernel Trick Representer
More informationMinimax-optimal distribution regression
Zoltán Szabó (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) ISNPS, Avignon June 12,
More informationLess is More: Computational Regularization by Subsampling
Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro
More informationLinear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction
Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the
More informationMaximum Mean Discrepancy
Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia
More informationThe Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space
The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space Robert Jenssen, Deniz Erdogmus 2, Jose Principe 2, Torbjørn Eltoft Department of Physics, University of Tromsø, Norway
More informationTutorial on Gaussian Processes and the Gaussian Process Latent Variable Model
Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,
More informationKernel methods for Bayesian inference
Kernel methods for Bayesian inference Arthur Gretton Gatsby Computational Neuroscience Unit Lancaster, Nov. 2014 Motivating Example: Bayesian inference without a model 3600 downsampled frames of 20 20
More informationDiffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)
Diffeomorphic Warping Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) What Manifold Learning Isn t Common features of Manifold Learning Algorithms: 1-1 charting Dense sampling Geometric Assumptions
More informationApproximate Kernel Methods
Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression
More informationDistribution Regression with Minimax-Optimal Guarantee
Distribution Regression with Minimax-Optimal Guarantee (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London
More informationCorners, Blobs & Descriptors. With slides from S. Lazebnik & S. Seitz, D. Lowe, A. Efros
Corners, Blobs & Descriptors With slides from S. Lazebnik & S. Seitz, D. Lowe, A. Efros Motivation: Build a Panorama M. Brown and D. G. Lowe. Recognising Panoramas. ICCV 2003 How do we build panorama?
More informationScale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract
Scale-Invariance of Support Vector Machines based on the Triangular Kernel François Fleuret Hichem Sahbi IMEDIA Research Group INRIA Domaine de Voluceau 78150 Le Chesnay, France Abstract This paper focuses
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationKernel methods and the exponential family
Kernel methods and the exponential family Stéphane Canu 1 and Alex J. Smola 2 1- PSI - FRE CNRS 2645 INSA de Rouen, France St Etienne du Rouvray, France Stephane.Canu@insa-rouen.fr 2- Statistical Machine
More informationThe Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017
The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the
More informationarxiv: v2 [stat.ml] 3 Sep 2015
Nonparametric Independence Testing for Small Sample Sizes arxiv:46.9v [stat.ml] 3 Sep 5 Aaditya Ramdas Dept. of Statistics and Machine Learning Dept. Carnegie Mellon University aramdas@cs.cmu.edu September
More informationLearning Kernel Parameters by using Class Separability Measure
Learning Kernel Parameters by using Class Separability Measure Lei Wang, Kap Luk Chan School of Electrical and Electronic Engineering Nanyang Technological University Singapore, 3979 E-mail: P 3733@ntu.edu.sg,eklchan@ntu.edu.sg
More informationESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.
On different ensembles of kernel machines Michiko Yamana, Hiroyuki Nakahara, Massimiliano Pontil, and Shun-ichi Amari Λ Abstract. We study some ensembles of kernel machines. Each machine is first trained
More informationSemi-Supervised Learning in Gigantic Image Collections. Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT)
Semi-Supervised Learning in Gigantic Image Collections Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT) Gigantic Image Collections What does the world look like? High
More informationCS798: Selected topics in Machine Learning
CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning
More informationCS8803: Statistical Techniques in Robotics Byron Boots. Hilbert Space Embeddings
CS8803: Statistical Techniques in Robotics Byron Boots Hilbert Space Embeddings 1 Motivation CS8803: STR Hilbert Space Embeddings 2 Overview Multinomial Distributions Marginal, Joint, Conditional Sum,
More informationLecture Notes 1: Vector spaces
Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector
More informationFeature extraction: Corners and blobs
Feature extraction: Corners and blobs Review: Linear filtering and edge detection Name two different kinds of image noise Name a non-linear smoothing filter What advantages does median filtering have over
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 89 Part II
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationBayesian Support Vector Machines for Feature Ranking and Selection
Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction
More informationUnderstanding the relationship between Functional and Structural Connectivity of Brain Networks
Understanding the relationship between Functional and Structural Connectivity of Brain Networks Sashank J. Reddi Machine Learning Department, Carnegie Mellon University SJAKKAMR@CS.CMU.EDU Abstract Background.
More informationAdaptive HMC via the Infinite Exponential Family
Adaptive HMC via the Infinite Exponential Family Arthur Gretton Gatsby Unit, CSML, University College London RegML, 2017 Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family
More informationMultiple kernel learning for multiple sources
Multiple kernel learning for multiple sources Francis Bach INRIA - Ecole Normale Supérieure NIPS Workshop - December 2008 Talk outline Multiple sources in computer vision Multiple kernel learning (MKL)
More informationRepresenter theorem and kernel examples
CS81B/Stat41B Spring 008) Statistical Learning Theory Lecture: 8 Representer theorem and kernel examples Lecturer: Peter Bartlett Scribe: Howard Lei 1 Representer Theorem Recall that the SVM optimization
More informationSemi-Supervised Laplacian Regularization of Kernel Canonical Correlation Analysis
Semi-Supervised Laplacian Regularization of Kernel Canonical Correlation Analsis Matthew B. Blaschko, Christoph H. Lampert, & Arthur Gretton Ma Planck Institute for Biological Cbernetics Tübingen, German
More informationMultiple Similarities Based Kernel Subspace Learning for Image Classification
Multiple Similarities Based Kernel Subspace Learning for Image Classification Wang Yan, Qingshan Liu, Hanqing Lu, and Songde Ma National Laboratory of Pattern Recognition, Institute of Automation, Chinese
More informationUnsupervised Nonparametric Anomaly Detection: A Kernel Method
Fifty-second Annual Allerton Conference Allerton House, UIUC, Illinois, USA October - 3, 24 Unsupervised Nonparametric Anomaly Detection: A Kernel Method Shaofeng Zou Yingbin Liang H. Vincent Poor 2 Xinghua
More informationComputer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression
Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:
More informationCausal Modeling with Generative Neural Networks
Causal Modeling with Generative Neural Networks Michele Sebag TAO, CNRS INRIA LRI Université Paris-Sud Joint work: D. Kalainathan, O. Goudet, I. Guyon, M. Hajaiej, A. Decelle, C. Furtlehner https://arxiv.org/abs/1709.05321
More informationA Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement
A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement Simon Leglaive 1 Laurent Girin 1,2 Radu Horaud 1 1: Inria Grenoble Rhône-Alpes 2: Univ. Grenoble Alpes, Grenoble INP,
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationClassification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).
Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes
More informationTopics in kernel hypothesis testing
Topics in kernel hypothesis testing Kacper Chwialkowski A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of University College London. Department
More informationMinimax Estimation of Kernel Mean Embeddings
Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016 Collaborators Dr. Ilya Tolstikhin
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationModeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop
Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector
More informationRKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee
RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets 9.520 Class 22, 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce an alternate perspective of RKHS via integral operators
More informationAn Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM
An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM Wei Chu Chong Jin Ong chuwei@gatsby.ucl.ac.uk mpeongcj@nus.edu.sg S. Sathiya Keerthi mpessk@nus.edu.sg Control Division, Department
More informationSimple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017
Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient
More informationForefront of the Two Sample Problem
Forefront of the Two Sample Problem From classical to state-of-the-art methods Yuchi Matsuoka What is the Two Sample Problem? X ",, X % ~ P i. i. d Y ",, Y - ~ Q i. i. d Two Sample Problem P = Q? Example:
More informationRobust regression and non-linear kernel methods for characterization of neuronal response functions from limited data
Robust regression and non-linear kernel methods for characterization of neuronal response functions from limited data Maneesh Sahani Gatsby Computational Neuroscience Unit University College, London Jennifer
More informationBlobs & Scale Invariance
Blobs & Scale Invariance Prof. Didier Stricker Doz. Gabriele Bleser Computer Vision: Object and People Tracking With slides from Bebis, S. Lazebnik & S. Seitz, D. Lowe, A. Efros 1 Apertizer: some videos
More informationExamples are not Enough, Learn to Criticize! Criticism for Interpretability
Examples are not Enough, Learn to Criticize! Criticism for Interpretability Been Kim, Rajiv Khanna, Oluwasanmi Koyejo Wittawat Jitkrittum Gatsby Machine Learning Journal Club 16 Jan 2017 1/20 Summary Examples
More informationNon-Linear Regression for Bag-of-Words Data via Gaussian Process Latent Variable Set Model
Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Non-Linear Regression for Bag-of-Words Data via Gaussian Process Latent Variable Set Model Yuya Yoshikawa Nara Institute of Science
More information(Kernels +) Support Vector Machines
(Kernels +) Support Vector Machines Machine Learning Torsten Möller Reading Chapter 5 of Machine Learning An Algorithmic Perspective by Marsland Chapter 6+7 of Pattern Recognition and Machine Learning
More informationarxiv: v1 [math.st] 24 Aug 2017
Submitted to the Annals of Statistics MULTIVARIATE DEPENDENCY MEASURE BASED ON COPULA AND GAUSSIAN KERNEL arxiv:1708.07485v1 math.st] 24 Aug 2017 By AngshumanRoy and Alok Goswami and C. A. Murthy We propose
More informationReproducing Kernel Hilbert Spaces
9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationNonparameteric Regression:
Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,
More informationRiemannian Metric Learning for Symmetric Positive Definite Matrices
CMSC 88J: Linear Subspaces and Manifolds for Computer Vision and Machine Learning Riemannian Metric Learning for Symmetric Positive Definite Matrices Raviteja Vemulapalli Guide: Professor David W. Jacobs
More informationLow Bias Bagged Support Vector Machines
Low Bias Bagged Support Vector Machines Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi di Milano, Italy valentini@dsi.unimi.it Thomas G. Dietterich Department of Computer
More informationStatistical analysis of coupled time series with Kernel Cross-Spectral Density operators.
Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators. Michel Besserve MPI for Intelligent Systems, Tübingen michel.besserve@tuebingen.mpg.de Nikos K. Logothetis MPI
More informationKernel Methods. Outline
Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationLecture 35: December The fundamental statistical distances
36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose
More informationVariations. ECE 6540, Lecture 10 Maximum Likelihood Estimation
Variations ECE 6540, Lecture 10 Last Time BLUE (Best Linear Unbiased Estimator) Formulation Advantages Disadvantages 2 The BLUE A simplification Assume the estimator is a linear system For a single parameter
More information