An Adaptive Test of Independence with Analytic Kernel Embeddings
|
|
- Joanna Dennis
- 6 years ago
- Views:
Transcription
1 An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum Gatsby Unit, University College London Probabilistic Graphical Model Workshop 2017 Institute of Statistical Mathematics, Tokyo 24 Feb /29
2 Reference Coauthors: Arthur Gretton Gatsby Unit, UCL Zoltán Szabó École Polytechnique Preprint: An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum, Zoltán Szabó, Arthur Gretton Python code: 2/29
3 What Is Independence Testing? Let X 2R dx ; Y 2R dy be random vectors following P xy. Given a joint sample f(x i ; y i )g n i=1 P xy (unknown), test H 0 :P xy = P x P y ; vs. H 1 :P xy 6= P x P y : P xy = P x P y equivalent to X? Y. Compute a test statistic ^ n. Reject H 0 if ^ n T (threshold). T = (1 )-quantile of the null distribution. 3/29
4 What Is Independence Testing? Let X 2R dx ; Y 2R dy be random vectors following P xy. Given a joint sample f(x i ; y i )g n i=1 P xy (unknown), test H 0 :P xy = P x P y ; vs. H 1 :P xy 6= P x P y : P xy = P x P y equivalent to X? Y. Compute a test statistic ^ n. Reject H 0 if ^ n T (threshold). T = (1 )-quantile of the null distribution. 3/29
5 What Is Independence Testing? Let X 2R dx ; Y 2R dy be random vectors following P xy. Given a joint sample f(x i ; y i )g n i=1 P xy (unknown), test H 0 :P xy = P x P y ; vs. H 1 :P xy 6= P x P y : P xy = P x P y equivalent to X? Y. Compute a test statistic ^ n. Reject H 0 if ^ n T (threshold). T = (1 )-quantile of the null distribution. H 0 (ˆλ n ) P H1 (ˆλ n ) T α ˆλ n ˆλ n 3/29
6 What Is Independence Testing? Let X 2R dx ; Y 2R dy be random vectors following P xy. Given a joint sample f(x i ; y i )g n i=1 P xy (unknown), test H 0 :P xy = P x P y ; vs. H 1 :P xy 6= P x P y : P xy = P x P y equivalent to X? Y. Compute a test statistic ^ n. Reject H 0 if ^ n T (threshold). T = (1 )-quantile of the null distribution. H 0 (ˆλ n ) P H1 (ˆλ n ) ˆλ n T α ˆλ n y x 3/29
7 Goals Want a test which is : : : 1 Non-parametric i.e., no parametric assumption on P xy. 2 Linear-time i.e., computational complexity is O(n). Fast. 3 Adaptive i.e., has a well-defined criterion for parameter tuning. Non-parametric O(n) Adaptive Pearson correlation HSIC [Gretton et al., 2005] HSIC with RFFs [Zhang et al., 2016] FSIC (proposed) : RFFs = Random Fourier Features Focus on cases where n (sample size) is large. 4/29
8 Goals Want a test which is : : : 1 Non-parametric i.e., no parametric assumption on P xy. 2 Linear-time i.e., computational complexity is O(n). Fast. 3 Adaptive i.e., has a well-defined criterion for parameter tuning. Non-parametric O(n) Adaptive Pearson correlation HSIC [Gretton et al., 2005] HSIC with RFFs [Zhang et al., 2016] FSIC (proposed) : RFFs = Random Fourier Features Focus on cases where n (sample size) is large. 4/29
9 Goals Want a test which is : : : 1 Non-parametric i.e., no parametric assumption on P xy. 2 Linear-time i.e., computational complexity is O(n). Fast. 3 Adaptive i.e., has a well-defined criterion for parameter tuning. Non-parametric O(n) Adaptive Pearson correlation HSIC [Gretton et al., 2005] HSIC with RFFs [Zhang et al., 2016] FSIC (proposed) : RFFs = Random Fourier Features Focus on cases where n (sample size) is large. 4/29
10 Goals Want a test which is : : : 1 Non-parametric i.e., no parametric assumption on P xy. 2 Linear-time i.e., computational complexity is O(n). Fast. 3 Adaptive i.e., has a well-defined criterion for parameter tuning. Non-parametric O(n) Adaptive Pearson correlation HSIC [Gretton et al., 2005] HSIC with RFFs [Zhang et al., 2016] FSIC (proposed) : RFFs = Random Fourier Features Focus on cases where n (sample size) is large. 4/29
11 Goals Want a test which is : : : 1 Non-parametric i.e., no parametric assumption on P xy. 2 Linear-time i.e., computational complexity is O(n). Fast. 3 Adaptive i.e., has a well-defined criterion for parameter tuning. Non-parametric O(n) Adaptive Pearson correlation HSIC [Gretton et al., 2005] HSIC with RFFs [Zhang et al., 2016] FSIC (proposed) : RFFs = Random Fourier Features Focus on cases where n (sample size) is large. 4/29
12 Witness Function [Gretton et al., 2012] kx vk 2 A function showing the differences of two distributions P and Q. Gaussian kernel: k (x; v) = exp Empirical mean embedding of P: ^ P (v) = 1 n P ni=1 k (x i ; v) Maximum Mean Discrepancy (MMD): k^uk RKHS. MMD(P ; Q) = 0 if and only if P = Q /29
13 Witness Function [Gretton et al., 2012] kx vk 2 A function showing the differences of two distributions P and Q. Gaussian kernel: k (x; v) = exp Empirical mean embedding of P: ^ P (v) = 1 n P ni=1 k (x i ; v) Maximum Mean Discrepancy (MMD): k^uk RKHS. MMD(P ; Q) = 0 if and only if P = Q. 2 2 Observe X = fx 1 ; : : : ; x n g P Observe Y = fy 1 ; : : : ; y n g Q 5/29
14 Witness Function [Gretton et al., 2012] kx vk 2 A function showing the differences of two distributions P and Q. Gaussian kernel: k (x; v) = exp Empirical mean embedding of P: ^ P (v) = 1 n P ni=1 k (x i ; v) Maximum Mean Discrepancy (MMD): k^uk RKHS. MMD(P ; Q) = 0 if and only if P = Q. 2 2 Gaussian kernel on x i Gaussian kernel on y i 5/29
15 Witness Function [Gretton et al., 2012] kx vk 2 A function showing the differences of two distributions P and Q. Gaussian kernel: k (x; v) = exp Empirical mean embedding of P: ^ P (v) = 1 n P ni=1 k (x i ; v) Maximum Mean Discrepancy (MMD): k^uk RKHS. MMD(P ; Q) = 0 if and only if P = Q. 2 2 ^P (v): mean embedding of P ^Q (v): mean embedding of Q v 5/29
16 Witness Function [Gretton et al., 2012] kx vk 2 A function showing the differences of two distributions P and Q. Gaussian kernel: k (x; v) = exp Empirical mean embedding of P: ^ P (v) = 1 n P ni=1 k (x i ; v) Maximum Mean Discrepancy (MMD): k^uk RKHS. MMD(P ; Q) = 0 if and only if P = Q. 2 2 ^P (v): mean embedding of P ^Q (v): mean embedding of Q ^u(v) = witness(v) = ^P (v) ^Q (v) {z } v 5/29
17 Witness Function [Gretton et al., 2012] kx vk 2 A function showing the differences of two distributions P and Q. Gaussian kernel: k (x; v) = exp Empirical mean embedding of P: ^ P (v) = 1 n P ni=1 k (x i ; v) Maximum Mean Discrepancy (MMD): k^uk RKHS. MMD(P ; Q) = 0 if and only if P = Q. 2 2 \MMD(P ; Q) = k^uk RKHS ^u(v) = witness(v) = ^P (v) ^Q (v) {z } v 5/29
18 Independence Test with HSIC [Gretton et al., 2005] Hilbert-Schmidt Independence Criterion. HSIC(X ; Y ) = MMD(P xy ; P x P y ) = kuk RKHS (need two kernels: k for X, and l for Y ). Empirical witness: ^u(v; w) = ^ xy (v; w) where ^ xy (v; w) = 1 n P ni=1 k (x i ; v)l(y i ; w). ^ x (v)^ y (w) HSIC(X ; Y ) = 0 if and only if X and Y are independent. Test statistic = k^uk RKHS ( flatness of ^u). Complexity: O(n 2 ). Key: Can we measure the flatness by other way that costs only O(n)? 6/29
19 Independence Test with HSIC [Gretton et al., 2005] Hilbert-Schmidt Independence Criterion. HSIC(X ; Y ) = MMD(P xy ; P x P y ) = kuk RKHS (need two kernels: k for X, and l for Y ). Empirical witness: ^u(v; w) = ^ xy (v; w) where ^ xy (v; w) = 1 n P ni=1 k (x i ; v)l(y i ; w). ^ x (v)^ y (w) HSIC(X ; Y ) = 0 if and only if X and Y are independent. Test statistic = k^uk RKHS ( flatness of ^u). Complexity: O(n 2 ). Key: Can we measure the flatness by other way that costs only O(n)? 6/29
20 Independence Test with HSIC [Gretton et al., 2005] Hilbert-Schmidt Independence Criterion. HSIC(X ; Y ) = MMD(P xy ; P x P y ) = kuk RKHS (need two kernels: k for X, and l for Y ). Empirical witness: ^u(v; w) = ^ xy (v; w) ^ x (v)^ y (w) where ^ xy (v; w) = 1 n P ni=1 k (x i ; v)l(y i ; w). ^ xy (v; w) ^ x (v)^ y (w) HSIC(X ; Y ) = 0 if and only if X and Y are independent. Test statistic = k^uk RKHS ( flatness of ^u). Complexity: O(n 2 ). Key: Can we measure the flatness by other way that costs only O(n)? 6/29
21 Independence Test with HSIC [Gretton et al., 2005] Hilbert-Schmidt Independence Criterion. HSIC(X ; Y ) = MMD(P xy ; P x P y ) = kuk RKHS (need two kernels: k for X, and l for Y ). Empirical witness: ^u(v; w) = ^ xy (v; w) where ^ xy (v; w) = 1 n P ni=1 k (x i ; v)l(y i ; w). ^ x (v)^ y (w) ^ xy (v; w) ^ x (v)^ y (w) = Witness ^u(v; w) HSIC(X ; Y ) = 0 if and only if X and Y are independent. Test statistic = k^uk RKHS ( flatness of ^u). Complexity: O(n 2 ). Key: Can we measure the flatness by other way that costs only O(n)? 6/29
22 Independence Test with HSIC [Gretton et al., 2005] Hilbert-Schmidt Independence Criterion. HSIC(X ; Y ) = MMD(P xy ; P x P y ) = kuk RKHS (need two kernels: k for X, and l for Y ). Empirical witness: ^u(v; w) = ^ xy (v; w) where ^ xy (v; w) = 1 n P ni=1 k (x i ; v)l(y i ; w). ^ x (v)^ y (w) ^ xy (v; w) ^ x (v)^ y (w) = Witness ^u(v; w) HSIC(X ; Y ) = 0 if and only if X and Y are independent. Test statistic = k^uk RKHS ( flatness of ^u). Complexity: O(n 2 ). Key: Can we measure the flatness by other way that costs only O(n)? 6/29
23 Independence Test with HSIC [Gretton et al., 2005] Hilbert-Schmidt Independence Criterion. HSIC(X ; Y ) = MMD(P xy ; P x P y ) = kuk RKHS (need two kernels: k for X, and l for Y ). Empirical witness: ^u(v; w) = ^ xy (v; w) where ^ xy (v; w) = 1 n P ni=1 k (x i ; v)l(y i ; w). ^ x (v)^ y (w) ^ xy (v; w) ^ x (v)^ y (w) = Witness ^u(v; w) HSIC(X ; Y ) = 0 if and only if X and Y are independent. Test statistic = k^uk RKHS ( flatness of ^u). Complexity: O(n 2 ). Key: Can we measure the flatness by other way that costs only O(n)? 6/29
24 Independence Test with HSIC [Gretton et al., 2005] Hilbert-Schmidt Independence Criterion. HSIC(X ; Y ) = MMD(P xy ; P x P y ) = kuk RKHS (need two kernels: k for X, and l for Y ). Empirical witness: ^u(v; w) = ^ xy (v; w) where ^ xy (v; w) = 1 n P ni=1 k (x i ; v)l(y i ; w). ^ x (v)^ y (w) ^ xy (v; w) ^ x (v)^ y (w) = Witness ^u(v; w) HSIC(X ; Y ) = 0 if and only if X and Y are independent. Test statistic = k^uk RKHS ( flatness of ^u). Complexity: O(n 2 ). Key: Can we measure the flatness by other way that costs only O(n)? 6/29
25 Proposal: The Finite Set Independence Criterion (FSIC) Idea: Evaluate ^u 2 (v; w) at only finitely many test locations. A set of random J locations: f(v 1 ; w 1 ); : : : ; (v J ; w J )g \FSIC 2 (X ; Y ) = 1 J P Ji=1 ^u 2 (v i ; w i ) Complexity: O((d x + d y )Jn). Linear time. But, what about an unlucky set of locations?? Can FSIC 2 (X ; Y ) = 0 even if X and Y are dependent?? 7/29
26 Proposal: The Finite Set Independence Criterion (FSIC) Idea: Evaluate ^u 2 (v; w) at only finitely many test locations. A set of random J locations: f(v 1 ; w 1 ); : : : ; (v J ; w J )g \FSIC 2 (X ; Y ) = 1 J P Ji=1 ^u 2 (v i ; w i ) Complexity: O((d x + d y )Jn). Linear time. But, what about an unlucky set of locations?? Can FSIC 2 (X ; Y ) = 0 even if X and Y are dependent?? 7/29
27 Proposal: The Finite Set Independence Criterion (FSIC) Idea: Evaluate ^u 2 (v; w) at only finitely many test locations. A set of random J locations: f(v 1 ; w 1 ); : : : ; (v J ; w J )g \FSIC 2 (X ; Y ) = 1 J P Ji=1 ^u 2 (v i ; w i ) Complexity: O((d x + d y )Jn). Linear time. But, what about an unlucky set of locations?? Can FSIC 2 (X ; Y ) = 0 even if X and Y are dependent?? 7/29
28 Proposal: The Finite Set Independence Criterion (FSIC) Idea: Evaluate ^u 2 (v; w) at only finitely many test locations. A set of random J locations: f(v 1 ; w 1 ); : : : ; (v J ; w J )g \FSIC 2 (X ; Y ) = 1 J P Ji=1 ^u 2 (v i ; w i ) Complexity: O((d x + d y )Jn). Linear time. But, what about an unlucky set of locations?? Can FSIC 2 (X ; Y ) = 0 even if X and Y are dependent?? 7/29
29 Proposal: The Finite Set Independence Criterion (FSIC) Idea: Evaluate ^u 2 (v; w) at only finitely many test locations. A set of random J locations: f(v 1 ; w 1 ); : : : ; (v J ; w J )g \FSIC 2 (X ; Y ) = 1 J P Ji=1 ^u 2 (v i ; w i ) Complexity: O((d x + d y )Jn). Linear time. But, what about an unlucky set of locations?? Can FSIC 2 (X ; Y ) = 0 even if X and Y are dependent?? 7/29
30 Proposal: The Finite Set Independence Criterion (FSIC) Idea: Evaluate ^u 2 (v; w) at only finitely many test locations. A set of random J locations: f(v 1 ; w 1 ); : : : ; (v J ; w J )g \FSIC 2 (X ; Y ) = 1 J P Ji=1 ^u 2 (v i ; w i ) Complexity: O((d x + d y )Jn). Linear time. But, what about an unlucky set of locations?? Can FSIC 2 (X ; Y ) = 0 even if X and Y are dependent?? No. Population FSIC(X ; Y ) = 0 iff X? Y, almost surely. 7/29
31 Requirements on the Kernels Definition 1 (Analytic kernels). k : X X!R is said to be analytic if for all x 2 X, v! k (x; v) is a real analytic function on X. Analytic: Taylor series about x 0 converges for all x 0 2 X. =) k is infinitely differentiable. Definition 2 (Characteristic kernels). Let P ; Q be two distributions, and g be a kernel. Let P (v) :=E zp [g(z; v)] and Q (v) :=E zq [g(z; v)]. g is said to be characteristic if P 6= Q implies P 6= Q. 8/29
32 Requirements on the Kernels Definition 1 (Analytic kernels). k : X X!R is said to be analytic if for all x 2 X, v! k (x; v) is a real analytic function on X. Analytic: Taylor series about x 0 converges for all x 0 2 X. =) k is infinitely differentiable. Definition 2 (Characteristic kernels). Let P ; Q be two distributions, and g be a kernel. Let P (v) :=E zp [g(z; v)] and Q (v) :=E zq [g(z; v)]. g is said to be characteristic if P 6= Q implies P 6= Q. Space of distributions P Q RKHS µ P } MMD(P, Q) µ Q 8/29
33 FSIC Is a Dependence Measure Proposition 1. Assume 1 The product kernel g((x; y); (x 0 ; y 0 )) := k (x; x 0 )l(y; y 0 ) is characteristic and analytic (i.e., k ; l are Gaussian kernels). 2 Test locations f(v i ; w i )g J i=1 where has a density. Then, -almost surely, FSIC(X ; Y ) = 0 iff X and Y are independent. 9/29
34 FSIC Is a Dependence Measure Proposition 1. Assume 1 The product kernel g((x; y); (x 0 ; y 0 )) := k (x; x 0 )l(y; y 0 ) is characteristic and analytic (i.e., k ; l are Gaussian kernels). 2 Test locations f(v i ; w i )g J i=1 where has a density. Then, -almost surely, FSIC(X ; Y ) = 0 iff X and Y are independent. Let s plot as 9/29
35 FSIC Is a Dependence Measure Proposition 1. Assume 1 The product kernel g((x; y); (x 0 ; y 0 )) := k (x; x 0 )l(y; y 0 ) is characteristic and analytic (i.e., k ; l are Gaussian kernels). 2 Test locations f(v i ; w i )g J i=1 where has a density. Then, -almost surely, FSIC(X ; Y ) = 0 iff X and Y are independent. ˆµ Pxy (v, w) ˆµ Px ˆµ Py (v, w) û 2 (v, w) (v, w) 9/29
36 FSIC Is a Dependence Measure Proposition 1. Assume 1 The product kernel g((x; y); (x 0 ; y 0 )) := k (x; x 0 )l(y; y 0 ) is characteristic and analytic (i.e., k ; l are Gaussian kernels). 2 Test locations f(v i ; w i )g J i=1 where has a density. Then, -almost surely, FSIC(X ; Y ) = 0 iff X and Y are independent. Under H 0, ˆµ Pxy (v, w) ˆµ Px ˆµ Py (v, w) û 2 (v, w) 9/29
37 FSIC Is a Dependence Measure Proposition 1. Assume 1 The product kernel g((x; y); (x 0 ; y 0 )) := k (x; x 0 )l(y; y 0 ) is characteristic and analytic (i.e., k ; l are Gaussian kernels). 2 Test locations f(v i ; w i )g J i=1 where has a density. Then, -almost surely, FSIC(X ; Y ) = 0 iff X and Y are independent. ˆµ Pxy (v, w) ˆµ Px ˆµ Py (v, w) û 2 (v, w) (v, w) Under H 1, u is not a zero function (P 7!E zp [g(z; )] is injective). u is analytic. So, R u = f(v; w) j u(v; w) = 0g has 0 Lebesgue measure. So, f(v i ; w i )g J i=1 will not be in R u (with probability 1). 9/29
38 Alternative View of the Witness u(v; w) The witness u(v; w) can be rewritten as u(v; w) := xy (v; w) x (v) y (w) =E xy [k (x; v)l(y; w)] E x [k (x; v)]e y [l(y; w)]; = cov xy [k (x; v); l(y; w)]: 1 Transforming x 7! k (x; v) and y 7! l(y; w) (from R dy to R). 2 Then, take the covariance. The kernel transformations turn the linear covariance into a dependence measure. 10/29
39 Alternative View of the Witness u(v; w) The witness u(v; w) can be rewritten as u(v; w) := xy (v; w) x (v) y (w) =E xy [k (x; v)l(y; w)] E x [k (x; v)]e y [l(y; w)]; = cov xy [k (x; v); l(y; w)]: 1 Transforming x 7! k (x; v) and y 7! l(y; w) (from R dy to R). 2 Then, take the covariance. The kernel transformations turn the linear covariance into a dependence measure. 10/29
40 Alternative Form of ^u(v; w) Recall FSIC \ 2 1 P Ji=1 = J ^u(v i ; w i ) 2 Let [x y (v; w) be an unbiased estimator of x (v)y (w). [ x y (v; w) := 1 n(n 1) P ni=1 P j 6=i k (x i ; v)l(y j ; w). An unbiased estimator of u(v; w) is ^u(v; w) = ^xy (v; w) [x y (v; w) = 2 n(n 1) X i<j h (v;w)((x i ; y i ); (x j ; y j )); where h (v;w)((x; y); (x 0 ; y 0 1 )) := 2 (k (x; v) k (x0 ; v))(l(y; w) l(y 0 ; w)): For a fixed (v; w), ^u(v; w) is a one-sample 2 nd -order U-statistic. 11/29
41 Alternative Form of ^u(v; w) Recall FSIC \ 2 1 P Ji=1 = J ^u(v i ; w i ) 2 Let [x y (v; w) be an unbiased estimator of x (v)y (w). [ x y (v; w) := 1 n(n 1) P ni=1 P j 6=i k (x i ; v)l(y j ; w). An unbiased estimator of u(v; w) is ^u(v; w) = ^xy (v; w) [x y (v; w) = 2 n(n 1) X i<j h (v;w)((x i ; y i ); (x j ; y j )); where h (v;w)((x; y); (x 0 ; y 0 1 )) := 2 (k (x; v) k (x0 ; v))(l(y; w) l(y 0 ; w)): For a fixed (v; w), ^u(v; w) is a one-sample 2 nd -order U-statistic. 11/29
42 Alternative Form of ^u(v; w) Recall FSIC \ 2 1 P Ji=1 = J ^u(v i ; w i ) 2 Let [x y (v; w) be an unbiased estimator of x (v)y (w). [ x y (v; w) := 1 n(n 1) P ni=1 P j 6=i k (x i ; v)l(y j ; w). An unbiased estimator of u(v; w) is ^u(v; w) = ^xy (v; w) [x y (v; w) = 2 n(n 1) X i<j h (v;w)((x i ; y i ); (x j ; y j )); where h (v;w)((x; y); (x 0 ; y 0 1 )) := 2 (k (x; v) k (x0 ; v))(l(y; w) l(y 0 ; w)): For a fixed (v; w), ^u(v; w) is a one-sample 2 nd -order U-statistic. 11/29
43 Asymptotic Distribution of ^u \FSIC 2 (X ; Y ) = 1 J JX i=1 ^u 2 (v i ; w i ) = 1 J ^u> ^u; where ^u = (^u(v 1 ; w 1 ); : : : ; ^u(v J ; w J )) > : Proposition 2 (Asymptotic distribution of ^u). For any fixed locations f(v i ; w i )g J i=1, we have p n(^u u) d! N (0; ): ij =E xy [ ~ k (x; v i ) ~ l(y; w i ) ~ k (x; v j ) ~ l(y; w j )] u(v i ; w i )u(v j ; w j ), ~k (x; v) := k (x; v) E x 0k (x 0 ; v), ~l(y; w) := l(y; w) E y 0l(y 0 ; w). Under H 0, n\ FSIC 2 = n J ^u> ^u weighted sum of dependent 2 variables. Difficult to get (1 )-quantile for the threshold. 12/29
44 Asymptotic Distribution of ^u \FSIC 2 (X ; Y ) = 1 J JX i=1 ^u 2 (v i ; w i ) = 1 J ^u> ^u; where ^u = (^u(v 1 ; w 1 ); : : : ; ^u(v J ; w J )) > : Proposition 2 (Asymptotic distribution of ^u). For any fixed locations f(v i ; w i )g J i=1, we have p n(^u u) d! N (0; ): ij =E xy [ ~ k (x; v i ) ~ l(y; w i ) ~ k (x; v j ) ~ l(y; w j )] u(v i ; w i )u(v j ; w j ), ~k (x; v) := k (x; v) E x 0k (x 0 ; v), ~l(y; w) := l(y; w) E y 0l(y 0 ; w). Under H 0, n\ FSIC 2 = n J ^u> ^u weighted sum of dependent 2 variables. Difficult to get (1 )-quantile for the threshold. 12/29
45 Asymptotic Distribution of ^u \FSIC 2 (X ; Y ) = 1 J JX i=1 ^u 2 (v i ; w i ) = 1 J ^u> ^u; where ^u = (^u(v 1 ; w 1 ); : : : ; ^u(v J ; w J )) > : Proposition 2 (Asymptotic distribution of ^u). For any fixed locations f(v i ; w i )g J i=1, we have p n(^u u) d! N (0; ): ij =E xy [ ~ k (x; v i ) ~ l(y; w i ) ~ k (x; v j ) ~ l(y; w j )] u(v i ; w i )u(v j ; w j ), ~k (x; v) := k (x; v) E x 0k (x 0 ; v), ~l(y; w) := l(y; w) E y 0l(y 0 ; w). Under H 0, n\ FSIC 2 = n J ^u> ^u weighted sum of dependent 2 variables. Difficult to get (1 )-quantile for the threshold. 12/29
46 Asymptotic Distribution of ^u \FSIC 2 (X ; Y ) = 1 J JX i=1 ^u 2 (v i ; w i ) = 1 J ^u> ^u; where ^u = (^u(v 1 ; w 1 ); : : : ; ^u(v J ; w J )) > : Proposition 2 (Asymptotic distribution of ^u). For any fixed locations f(v i ; w i )g J i=1, we have p n(^u u) d! N (0; ): ij =E xy [ ~ k (x; v i ) ~ l(y; w i ) ~ k (x; v j ) ~ l(y; w j )] u(v i ; w i )u(v j ; w j ), ~k (x; v) := k (x; v) E x 0k (x 0 ; v), ~l(y; w) := l(y; w) E y 0l(y 0 ; w). Under H 0, n\ FSIC 2 = n J ^u> ^u weighted sum of dependent 2 variables. Difficult to get (1 )-quantile for the threshold. 12/29
47 Asymptotic Distribution of ^u \FSIC 2 (X ; Y ) = 1 J JX i=1 ^u 2 (v i ; w i ) = 1 J ^u> ^u; where ^u = (^u(v 1 ; w 1 ); : : : ; ^u(v J ; w J )) > : Proposition 2 (Asymptotic distribution of ^u). For any fixed locations f(v i ; w i )g J i=1, we have p n(^u u) d! N (0; ): ij =E xy [ ~ k (x; v i ) ~ l(y; w i ) ~ k (x; v j ) ~ l(y; w j )] u(v i ; w i )u(v j ; w j ), ~k (x; v) := k (x; v) E x 0k (x 0 ; v), ~l(y; w) := l(y; w) E y 0l(y 0 ; w). Under H 0, n\ FSIC 2 = n J ^u> ^u weighted sum of dependent 2 variables. Difficult to get (1 )-quantile for the threshold. 12/29
48 Normalized FSIC (NFSIC) NFSIC \ 2 (X ; Y ) = ^ n := n ^u > 1 ^ + n I ^u; with a regularization parameter n 0. Key: NFSIC = FSIC normalized by the covariance. Theorem 1 (NFSIC test is consistent). Assume 1 The product kernel is characteristic and analytic. 2 lim n!1 n = 0. Then, for any k ; l and f(v i ; w i )g J i=1, 1 Under H 0, ^ n d! 2 (J ) as n! 1. 2 Under H 1, lim n!1 P ^n T = 1, -almost surely. 13/29
49 Normalized FSIC (NFSIC) NFSIC \ 2 (X ; Y ) = ^ n := n ^u > 1 ^ + n I ^u; with a regularization parameter n 0. Key: NFSIC = FSIC normalized by the covariance. Theorem 1 (NFSIC test is consistent). Assume 1 The product kernel is characteristic and analytic. 2 lim n!1 n = 0. Then, for any k ; l and f(v i ; w i )g J i=1, 1 Under H 0, ^ n d! 2 (J ) as n! 1. 2 Under H 1, lim n!1 P ^n T = 1, -almost surely. 13/29
50 Normalized FSIC (NFSIC) NFSIC \ 2 (X ; Y ) = ^ n := n ^u > 1 ^ + n I ^u; with a regularization parameter n 0. Key: NFSIC = FSIC normalized by the covariance. Theorem 1 (NFSIC test is consistent). Assume 1 The product kernel is characteristic and analytic. 2 lim n!1 n = 0. Then, for any k ; l and f(v i ; w i )g J i=1, 1 Under H 0, ^ n d! 2 (J ) as n! 1. 2 Under H 1, lim n!1 P ^n T = 1, -almost surely. 13/29
51 Normalized FSIC (NFSIC) NFSIC \ 2 (X ; Y ) = ^ n := n ^u > 1 ^ + n I ^u; with a regularization parameter n 0. Key: NFSIC = FSIC normalized by the covariance. Theorem 1 (NFSIC test is consistent). Assume 1 The product kernel is characteristic and analytic. 2 lim n!1 n = 0. Then, for any k ; l and f(v i ; w i )g J i=1, 1 Under H 0, ^ n d! 2 (J ) as n! 1. 2 Under H 1, lim n!1 P ^n T = 1, -almost surely. 13/29
52 Normalized FSIC (NFSIC) NFSIC \ 2 (X ; Y ) = ^ n := n ^u > 1 ^ + n I ^u; with a regularization parameter n 0. Key: NFSIC = FSIC normalized by the covariance. Theorem 1 (NFSIC test is consistent). Assume 1 The product kernel is characteristic and analytic. 2 lim n!1 n = 0. Then, for any k ; l and f(v i ; w i )g J i=1, 1 Under H 0, ^ n d! 2 (J ) as n! 1. 2 Under H 1, lim n!1 P ^n T = 1, -almost surely. 13/29
53 Normalized FSIC (NFSIC) NFSIC \ 2 (X ; Y ) = ^ n := n ^u > 1 ^ + n I ^u; with a regularization parameter n 0. Key: NFSIC = FSIC normalized by the covariance. Theorem 1 (NFSIC test is consistent). Assume 1 The product kernel is characteristic and analytic. 2 lim n!1 n = 0. Then, for any k ; l and f(v i ; w i )g J i=1, 1 Under H 0, ^ n d! 2 (J ) as n! 1. 2 Under H 1, lim n!1 P ^n T = 1, -almost surely. Asymptotically, false positive rate is at under H 0, and always reject under H 1. 13/29
54 \ An Estimator of NFSIC 2 ^ n := n ^u > 1 ^ + n I ^u; Test locations f(v i ; w i )g J i=1. K = [k (v i ; x j )] 2R J n L = [l(w i ; y j )] 2R J n. (No n n Gram matrix.) Estimators 1 ^u = (KL)1n n 1 (K1 n )(L1n ) n(n 1). 2 ^ = > n where := (K n 1 K1 n 1 > n ) (L n 1 L1 n 1 > n ) ^u1> n : ^ n can be computed in O(J 3 + J 2 n + (d x + d y )Jn) time. Main Point: Linear in n. Cubic in J (small). 14/29
55 \ An Estimator of NFSIC 2 ^ n := n ^u > 1 ^ + n I ^u; Test locations f(v i ; w i )g J i=1. K = [k (v i ; x j )] 2R J n L = [l(w i ; y j )] 2R J n. (No n n Gram matrix.) Estimators 1 ^u = (KL)1n n 1 (K1 n )(L1n ) n(n 1). 2 ^ = > n where := (K n 1 K1 n 1 > n ) (L n 1 L1 n 1 > n ) ^u1> n : ^ n can be computed in O(J 3 + J 2 n + (d x + d y )Jn) time. Main Point: Linear in n. Cubic in J (small). 14/29
56 \ An Estimator of NFSIC 2 ^ n := n ^u > 1 ^ + n I ^u; Test locations f(v i ; w i )g J i=1. K = [k (v i ; x j )] 2R J n L = [l(w i ; y j )] 2R J n. (No n n Gram matrix.) Estimators 1 ^u = (KL)1n n 1 (K1 n )(L1n ) n(n 1). 2 ^ = > n where := (K n 1 K1 n 1 > n ) (L n 1 L1 n 1 > n ) ^u1> n : ^ n can be computed in O(J 3 + J 2 n + (d x + d y )Jn) time. Main Point: Linear in n. Cubic in J (small). 14/29
57 \ An Estimator of NFSIC 2 ^ n := n ^u > 1 ^ + n I ^u; Test locations f(v i ; w i )g J i=1. K = [k (v i ; x j )] 2R J n L = [l(w i ; y j )] 2R J n. (No n n Gram matrix.) Estimators 1 ^u = (KL)1n n 1 (K1 n )(L1n ) n(n 1). 2 ^ = > n where := (K n 1 K1 n 1 > n ) (L n 1 L1 n 1 > n ) ^u1> n : ^ n can be computed in O(J 3 + J 2 n + (d x + d y )Jn) time. Main Point: Linear in n. Cubic in J (small). 14/29
58 \ An Estimator of NFSIC 2 ^ n := n ^u > 1 ^ + n I ^u; Test locations f(v i ; w i )g J i=1. K = [k (v i ; x j )] 2R J n L = [l(w i ; y j )] 2R J n. (No n n Gram matrix.) Estimators 1 ^u = (KL)1n n 1 (K1 n )(L1n ) n(n 1). 2 ^ = > n where := (K n 1 K1 n 1 > n ) (L n 1 L1 n 1 > n ) ^u1> n : ^ n can be computed in O(J 3 + J 2 n + (d x + d y )Jn) time. Main Point: Linear in n. Cubic in J (small). 14/29
59 Optimizing Test Locations f(v i ; w i )g J i =1 Test NFSIC \ 2 is consistent for any random locations f(v i ; w i )g J i=1. In practice, tuning them will increase the test power. 15/29
60 Optimizing Test Locations f(v i ; w i )g J i =1 Test NFSIC \ 2 is consistent for any random locations f(v i ; w i )g J i=1. In practice, tuning them will increase the test power. 15/29
61 Optimizing Test Locations f(v i ; w i )g J i =1 Test NFSIC \ 2 is consistent for any random locations f(v i ; w i )g J i=1. In practice, tuning them will increase the test power. Under H 0 : X? Y, we have ^ n 2 (J ) as n! 1. Â 2 (J) ^ n 15/29
62 Optimizing Test Locations f(v i ; w i )g J i =1 Test NFSIC \ 2 is consistent for any random locations f(v i ; w i )g J i=1. In practice, tuning them will increase the test power. Under H 0 : X? Y, we have ^ n 2 (J ) as n! 1. Â 2 (J) T ^ n 15/29
63 Optimizing Test Locations f(v i ; w i )g J i =1 Test NFSIC \ 2 is consistent for any random locations f(v i ; w i )g J i=1. In practice, tuning them will increase the test power. Under H 1, ^ n will be large. Follows some distribution P H1 (^ n ) Â 2 (J) T P H1 (^ n) ^ n 15/29
64 Optimizing Test Locations f(v i ; w i )g J i =1 Test NFSIC \ 2 is consistent for any random locations f(v i ; w i )g J i=1. In practice, tuning them will increase the test power. Test power = P(reject H 0 j H 1 true) =P(^ n T ) Â 2 (J) T P H1 (^ n) ^ n 15/29
65 Optimizing Test Locations f(v i ; w i )g J i =1 Test NFSIC \ 2 is consistent for any random locations f(v i ; w i )g J i=1. In practice, tuning them will increase the test power. Test power = P(reject H 0 j H 1 true) =P(^ n T ) Â 2 (J) T P H1 (^ n) ^ n Idea: Pick locations and Gaussian widths to maximize (lower bound of) test power. 15/29
66 Optimizing Test Locations f(v i ; w i )g J i =1 Test NFSIC \ 2 is consistent for any random locations f(v i ; w i )g J i=1. In practice, tuning them will increase the test power. Test power = P(reject H 0 j H 1 true) =P(^ n T ) Â 2 (J) T P H1 (^ n) ^ n Idea: Pick locations and Gaussian widths to maximize (lower bound of) test power. 15/29
67 Optimization Objective = Power Lower Bound Recall ^ n := n ^u > ^ + n I 1 Theorem 2 (A lower bound on the test power). Let NFSIC 2 (X ; Y ) := n := nu > 1 u. ^u. With some conditions, for any k ; l, and f(v i ; w i )g J i=1, the test power satisfies P ^n T L( n ) where L( n ) = 1 62e 12 n (n T)2 =n 2e b0:5nc(n T)2 =[2n 2 ] 2e [(n T)n (n 1)=3 3n c32 n n(n 1)] 2 =[4n 2 (n 1)] ; where 1 ; : : : ; 4 ; c 3 > 0 are constants. For large n, L( n ) is increasing in n. Do: Locations and Gaussian widths = arg max L( n ) = arg max n 16/29
68 Optimization Objective = Power Lower Bound Recall ^ n := n ^u > ^ + n I 1 Theorem 2 (A lower bound on the test power). Let NFSIC 2 (X ; Y ) := n := nu > 1 u. ^u. With some conditions, for any k ; l, and f(v i ; w i )g J i=1, the test power satisfies P ^n T L( n ) where L( n ) = 1 62e 12 n (n T)2 =n 2e b0:5nc(n T)2 =[2n 2 ] 2e [(n T)n (n 1)=3 3n c32 n n(n 1)] 2 =[4n 2 (n 1)] ; where 1 ; : : : ; 4 ; c 3 > 0 are constants. For large n, L( n ) is increasing in n. Do: Locations and Gaussian widths = arg max L( n ) = arg max n 16/29
69 Optimization Objective = Power Lower Bound Recall ^ n := n ^u > ^ + n I 1 Theorem 2 (A lower bound on the test power). Let NFSIC 2 (X ; Y ) := n := nu > 1 u. ^u. With some conditions, for any k ; l, and f(v i ; w i )g J i=1, the test power satisfies P ^n T L( n ) where L( n ) = 1 62e 12 n (n T)2 =n 2e b0:5nc(n T)2 =[2n 2 ] 2e [(n T)n (n 1)=3 3n c32 n n(n 1)] 2 =[4n 2 (n 1)] ; where 1 ; : : : ; 4 ; c 3 > 0 are constants. For large n, L( n ) is increasing in n. Do: Locations and Gaussian widths = arg max L( n ) = arg max n 16/29
70 Optimization Objective = Power Lower Bound Recall ^ n := n ^u > ^ + n I 1 Theorem 2 (A lower bound on the test power). Let NFSIC 2 (X ; Y ) := n := nu > 1 u. ^u. With some conditions, for any k ; l, and f(v i ; w i )g J i=1, the test power satisfies P ^n T L( n ) where L( n ) = 1 62e 12 n (n T)2 =n 2e b0:5nc(n T)2 =[2n 2 ] 2e [(n T)n (n 1)=3 3n c32 n n(n 1)] 2 =[4n 2 (n 1)] ; where 1 ; : : : ; 4 ; c 3 > 0 are constants. For large n, L( n ) is increasing in n. Do: Locations and Gaussian widths = arg max L( n ) = arg max n 16/29
71 Optimization Procedure NFSIC 2 (X ; Y ) := n := nu > 1 u is unknown. Split the data into 2 disjoint sets: training (tr) and test (te) sets. Procedure: 1 Estimate n with ^ (tr) n (i.e., computed on the training set). 2 Optimize all f(v i ; w i )g J i=1 and Gaussian widths with gradient ascent. 3 Independence test with Splitting avoids overfitting. (te) (te) ^ n. Reject H 0 if ^ n T. 17/29
72 Optimization Procedure NFSIC 2 (X ; Y ) := n := nu > 1 u is unknown. Split the data into 2 disjoint sets: training (tr) and test (te) sets. Procedure: 1 Estimate n with ^ (tr) n (i.e., computed on the training set). 2 Optimize all f(v i ; w i )g J i=1 and Gaussian widths with gradient ascent. 3 Independence test with Splitting avoids overfitting. (te) (te) ^ n. Reject H 0 if ^ n T. 17/29
73 Optimization Procedure NFSIC 2 (X ; Y ) := n := nu > 1 u is unknown. Split the data into 2 disjoint sets: training (tr) and test (te) sets. Procedure: 1 Estimate n with ^ (tr) n (i.e., computed on the training set). 2 Optimize all f(v i ; w i )g J i=1 and Gaussian widths with gradient ascent. 3 Independence test with Splitting avoids overfitting. (te) (te) ^ n. Reject H 0 if ^ n T. 17/29
74 Optimization Procedure NFSIC 2 (X ; Y ) := n := nu > 1 u is unknown. Split the data into 2 disjoint sets: training (tr) and test (te) sets. Procedure: 1 Estimate n with ^ (tr) n (i.e., computed on the training set). 2 Optimize all f(v i ; w i )g J i=1 and Gaussian widths with gradient ascent. 3 Independence test with Splitting avoids overfitting. (te) (te) ^ n. Reject H 0 if ^ n T. But, what does this do to P(^ n T ) when H 0 holds? 17/29
75 Optimization Procedure NFSIC 2 (X ; Y ) := n := nu > 1 u is unknown. Split the data into 2 disjoint sets: training (tr) and test (te) sets. Procedure: 1 Estimate n with ^ (tr) n (i.e., computed on the training set). 2 Optimize all f(v i ; w i )g J i=1 and Gaussian widths with gradient ascent. 3 Independence test with Splitting avoids overfitting. (te) (te) ^ n. Reject H 0 if ^ n T. But, what does this do to P(^ n T ) when H 0 holds? Still asymptotically at. n = 0 iff X ; Y independent. So, under H 0, we do arg max 0 = arbitrary locations. Asymptotic null distribution is 2 (J ) for any locations. 17/29
76 Demo: 2D Rotation ˆµ xy (v, w) 18/29
77 Demo: 2D Rotation ˆµ xy (v, w) ˆµ x (v)ˆµ y (w) ˆµ xy (v, w) ˆµ x (v)ˆµ y (w) ˆΣ(v, w) ˆλ n 18/29
78 Demo: Sin Problem (! = 1) ˆµ xy (v, w) y ω = p(x; y) = x /29
79 Demo: Sin Problem (! = 1) ˆµ xy (v, w) ˆµ x (v)ˆµ y (w) ˆµ xy (v, w) ˆµ x (v)ˆµ y (w) ˆΣ(v, w) ˆλ n y ω = p(x; y) = x /29
80 Simulation Settings n = full sample size All methods use Gaussian kernels for both X and Y. Compare 6 methods Method Description Tuning Test size Complex. NFSIC-opt Proposed Gradient descent n =2 O(n) NFSIC-med No tuning. Random locations n O(n) QHSIC Full HSIC Median heu. n O(n 2 ) NyHSIC NyStrom HSIC Median heu. n O(n) FHSIC HSIC + RFFs Median heu. n O(n) RDC RFFs + CCA Median heu. n O(n log n) : Random Fourier features Given a problem, report rejection rate of H features for all (except QHSIC). J = 10 in NFSIC. 20/29
81 Simulation Settings n = full sample size All methods use Gaussian kernels for both X and Y. Compare 6 methods Method Description Tuning Test size Complex. NFSIC-opt Proposed Gradient descent n =2 O(n) NFSIC-med No tuning. Random locations n O(n) QHSIC Full HSIC Median heu. n O(n 2 ) NyHSIC NyStrom HSIC Median heu. n O(n) FHSIC HSIC + RFFs Median heu. n O(n) RDC RFFs + CCA Median heu. n O(n log n) : Random Fourier features Given a problem, report rejection rate of H features for all (except QHSIC). J = 10 in NFSIC. 20/29
82 Simulation Settings n = full sample size All methods use Gaussian kernels for both X and Y. Compare 6 methods Method Description Tuning Test size Complex. NFSIC-opt Proposed Gradient descent n =2 O(n) NFSIC-med No tuning. Random locations n O(n) QHSIC Full HSIC Median heu. n O(n 2 ) NyHSIC NyStrom HSIC Median heu. n O(n) FHSIC HSIC + RFFs Median heu. n O(n) RDC RFFs + CCA Median heu. n O(n log n) : Random Fourier features Given a problem, report rejection rate of H features for all (except QHSIC). J = 10 in NFSIC. 20/29
83 Simulation Settings n = full sample size All methods use Gaussian kernels for both X and Y. Compare 6 methods Method Description Tuning Test size Complex. NFSIC-opt Proposed Gradient descent n =2 O(n) NFSIC-med No tuning. Random locations n O(n) QHSIC Full HSIC Median heu. n O(n 2 ) NyHSIC NyStrom HSIC Median heu. n O(n) FHSIC HSIC + RFFs Median heu. n O(n) RDC RFFs + CCA Median heu. n O(n log n) : Random Fourier features Given a problem, report rejection rate of H features for all (except QHSIC). J = 10 in NFSIC. 20/29
84 Simulation Settings n = full sample size All methods use Gaussian kernels for both X and Y. Compare 6 methods Method Description Tuning Test size Complex. NFSIC-opt Proposed Gradient descent n =2 O(n) NFSIC-med No tuning. Random locations n O(n) QHSIC Full HSIC Median heu. n O(n 2 ) NyHSIC NyStrom HSIC Median heu. n O(n) FHSIC HSIC + RFFs Median heu. n O(n) RDC RFFs + CCA Median heu. n O(n log n) : Random Fourier features Given a problem, report rejection rate of H features for all (except QHSIC). J = 10 in NFSIC. 20/29
85 Simulation Settings n = full sample size All methods use Gaussian kernels for both X and Y. Compare 6 methods Method Description Tuning Test size Complex. NFSIC-opt Proposed Gradient descent n =2 O(n) NFSIC-med No tuning. Random locations n O(n) QHSIC Full HSIC Median heu. n O(n 2 ) NyHSIC NyStrom HSIC Median heu. n O(n) FHSIC HSIC + RFFs Median heu. n O(n) RDC RFFs + CCA Median heu. n O(n log n) : Random Fourier features Given a problem, report rejection rate of H features for all (except QHSIC). J = 10 in NFSIC. 20/29
86 Toy Problem 1: Independent Gaussians X N (0; I dx ) and Y N (0; I dy ). Independent X ; Y. So, H 0 holds. Set := 0:05; d x = d y = /29
87 Toy Problem 1: Independent Gaussians Type-I error X N (0; I dx ) and Y N (0; I dy ). Independent X ; Y. So, H 0 holds. Set := 0:05; d x = d y = 250. NFSIC-opt NFSIC-m ed QHSIC NyHSIC FHSIC RDC Sample size n Correct type-i errors (false positive rate). 21/29
88 Toy Problem 1: Independent Gaussians Type-I error X N (0; I dx ) and Y N (0; I dy ). Independent X ; Y. So, H 0 holds. Set := 0:05; d x = d y = 250. NFSIC-opt NFSIC-m ed QHSIC NyHSIC FHSIC RDC Sample size n Time (s) Correct type-i errors (false positive rate) Sample size n /29
89 Toy Problem 2: Sinusoid p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = /29
90 Toy Problem 2: Sinusoid p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = y ω = x /29
91 Toy Problem 2: Sinusoid p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = y ω = x 22/29
92 Toy Problem 2: Sinusoid p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = y ω = x 22/29
93 Toy Problem 2: Sinusoid p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = y ω = x 22/29
94 Toy Problem 2: Sinusoid p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = ω = Test power ω in 1 + sin(ωx)sin(ωy) y x 22/29
95 Toy Problem 2: Sinusoid Test power p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = NFSIC-opt NFSIC-m ed QHSIC NyHSIC FHSIC RDC ω in 1 + sin(ωx)sin(ωy) y ω = x 22/29
96 Toy Problem 2: Sinusoid Test power p xy (x ; y) / 1 + sin(!x ) sin(!y) where x ; y 2 ( ; ). Local changes between p xy and p x p y. Set n = NFSIC-opt NFSIC-m ed QHSIC NyHSIC FHSIC RDC ω in 1 + sin(ωx)sin(ωy) y ω = x Main Point: NFSIC can handle well the local changes in the joint space. 22/29
97 Toy Problem 3: Gaussian Sign Test power y = jz j Q d x i=1 sign(x i ), where x N (0; I dy ) and Z N (0; 1) (noise). Full interaction among x 1 ; : : : ; x dx. Need to consider all x 1 ; : : : ; x d to detect the dependency Sample size n Main Point: NFSIC can handle feature interaction. NFSIC-opt NFSIC-med QHSIC NyHSIC FHSIC RDC 23/29
98 Toy Problem 3: Gaussian Sign Test power y = jz j Q d x i=1 sign(x i ), where x N (0; I dy ) and Z N (0; 1) (noise). Full interaction among x 1 ; : : : ; x dx. Need to consider all x 1 ; : : : ; x d to detect the dependency Sample size n Main Point: NFSIC can handle feature interaction. NFSIC-opt NFSIC-med QHSIC NyHSIC FHSIC RDC 23/29
99 HSIC vs. FSIC Recall the witness ^u(v; w) = ^ xy (v; w) ^ x (v)^ y (w): HSIC [Gretton et al., 2005] = k^uk RKHS witness FSIC [proposed] = 1 J P Ji=1 ^u 2 (v i ; w i ) witness (v, w) Good when difference between p xy and p x p y is spatially diffuse. ^u is almost flat. (v, w) Good when difference between p xy and p x p y is local. ^u is mostly zero, has many peaks (feature interaction). 24/29
100 Real Problem 1: Million Song Data Song (X ) vs. year of release (Y ). Western commercial tracks from 1922 to 2011 [Bertin-Mahieux et al., 2011]. X 2R 90 contains audio features. Y 2R is the year of release. 25/29
101 Real Problem 1: Million Song Data Song (X ) vs. year of release (Y ). Type-I error Western commercial tracks from 1922 to 2011 [Bertin-Mahieux et al., 2011]. X 2R 90 contains audio features. Y 2R is the year of release. NFSIC-opt NFSIC-m ed QHSIC NyHSIC FHSIC RDC Sample size n Break (X ; Y ) pairs to simulate H 0. Test power Sample size n H 1 is true. 25/29
102 Real Problem 2: Videos and Captions Youtube video (X ) vs. caption (Y ). VideoStory46K [Habibian et al., 2014] X 2R 2000 : Fisher vector encoding of motion boundary histograms descriptors [Wang and Schmid, 2013]. Y 2R 1878 : bag of words. TF. 26/29
103 Real Problem 2: Videos and Captions Youtube video (X ) vs. caption (Y ). Type-I error VideoStory46K [Habibian et al., 2014] X 2R 2000 : Fisher vector encoding of motion boundary histograms descriptors [Wang and Schmid, 2013]. Y 2R 1878 : bag of words. TF. NFSIC-opt NFSIC-m ed QHSIC NyHSIC FHSIC RDC Sample size n Break (X ; Y ) pairs to simulate H 0. Test power Sample size n H 1 is true. 26/29
104 Penalize Redundant Test Locations Consider the Sin problem. Use J = 2 locations. Optimization objective: ^ n. Write t = (v; w). Fix t 1 at F. Plot t 2! ^ n (t 1 ; t 2 ). y ω = x ˆλn(t1, t2) y Sample t 2 trajectory t 1 x t 2 The optimized t 1 ; t 2 will not be in the same neighbourhood. 27/29
105 Test Power vs. J Test power does not always increase with J (number of test locations). n = ω = y Test power x J Accurate estimation of 2R ^ J J in ^ n = n ^u > 1 ^ + n I ^u becomes more difficult. Large J defeats the purpose of a linear-time test. 28/29
106 Conclusions Proposed The Finite Set Independence Criterion (FSIC). Independece test based on FSIC is 1 non-parametric, 2 linear-time, 3 adaptive (parameteris automatically tuned). Future works Any way to interpret the learned f(v i ; w i )g J i=1? Relative efficiency of FSIC vs. block HSIC, RFF-HSIC. 29/29
107 Conclusions Proposed The Finite Set Independence Criterion (FSIC). Independece test based on FSIC is 1 non-parametric, 2 linear-time, 3 adaptive (parameteris automatically tuned). Future works Any way to interpret the learned f(v i ; w i )g J i=1? Relative efficiency of FSIC vs. block HSIC, RFF-HSIC. 29/29
108 Questions? Thank you 30/29
109 References I Bertin-Mahieux, T., Ellis, D. P., Whitman, B., and Lamere, P. (2011). The million song dataset. In International Conference on Music Information Retrieval (ISMIR). Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A Kernel Two-Sample Test. Journal of Machine Learning Research, 13: Gretton, A., Bousquet, O., Smola, A., and Schölkopf, B. (2005). Measuring Statistical Dependence with Hilbert-Schmidt Norms. In Algorithmic Learning Theory (ALT), pages /29
110 References II Habibian, A., Mensink, T., and Snoek, C. G. (2014). Videostory: A new multimedia embedding for few-example recognition and translation of events. In ACM International Conference on Multimedia, pages Wang, H. and Schmid, C. (2013). Action recognition with improved trajectories. In IEEE International Conference on Computer Vision (ICCV), pages Zhang, Q., Filippi, S., Gretton, A., and Sejdinovic, D. (2016). Large-Scale Kernel Methods for Independence Testing. 32/29
An Adaptive Test of Independence with Analytic Kernel Embeddings
An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum 1 Zoltán Szabó 2 Arthur Gretton 1 1 Gatsby Unit, University College London 2 CMAP, École Polytechnique ICML 2017, Sydney
More informationA Linear-Time Kernel Goodness-of-Fit Test
A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum 1; Wenkai Xu 1 Zoltán Szabó 2 NIPS 2017 Best paper! Kenji Fukumizu 3 Arthur Gretton 1 wittawatj@gmail.com 1 Gatsby Unit, University College
More informationLearning Interpretable Features to Compare Distributions
Learning Interpretable Features to Compare Distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Theory of Big Data, 2017 1/41 Goal of this talk Given: Two collections
More informationLearning features to compare distributions
Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this
More informationA Linear-Time Kernel Goodness-of-Fit Test
A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum Gatsby Unit, UCL wittawatj@gmail.com Wenkai Xu Gatsby Unit, UCL wenkaix@gatsby.ucl.ac.uk Zoltán Szabó CMAP, École Polytechnique zoltan.szabo@polytechnique.edu
More informationHilbert Space Representations of Probability Distributions
Hilbert Space Representations of Probability Distributions Arthur Gretton joint work with Karsten Borgwardt, Kenji Fukumizu, Malte Rasch, Bernhard Schölkopf, Alex Smola, Le Song, Choon Hui Teo Max Planck
More informationHilbert Schmidt Independence Criterion
Hilbert Schmidt Independence Criterion Thanks to Arthur Gretton, Le Song, Bernhard Schölkopf, Olivier Bousquet Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au
More informationBig Hypothesis Testing with Kernel Embeddings
Big Hypothesis Testing with Kernel Embeddings Dino Sejdinovic Department of Statistics University of Oxford 9 January 2015 UCL Workshop on the Theory of Big Data D. Sejdinovic (Statistics, Oxford) Big
More informationAdvances in kernel exponential families
Advances in kernel exponential families Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS, 2017 1/39 Outline Motivating application: Fast estimation of complex multivariate
More informationLearning features to compare distributions
Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this
More informationHypothesis Testing with Kernel Embeddings on Interdependent Data
Hypothesis Testing with Kernel Embeddings on Interdependent Data Dino Sejdinovic Department of Statistics University of Oxford joint work with Kacper Chwialkowski and Arthur Gretton (Gatsby Unit, UCL)
More informationKernel Learning via Random Fourier Representations
Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random
More informationKernel methods for comparing distributions, measuring dependence
Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations
More informationHilbert Space Embedding of Probability Measures
Lecture 2 Hilbert Space Embedding of Probability Measures Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 2017 Recap of Lecture
More informationCS8803: Statistical Techniques in Robotics Byron Boots. Hilbert Space Embeddings
CS8803: Statistical Techniques in Robotics Byron Boots Hilbert Space Embeddings 1 Motivation CS8803: STR Hilbert Space Embeddings 2 Overview Multinomial Distributions Marginal, Joint, Conditional Sum,
More informationKernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton
Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,
More informationMaximum Mean Discrepancy
Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia
More informationDistribution Regression: A Simple Technique with Minimax-optimal Guarantee
Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department,
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationKernel methods for Bayesian inference
Kernel methods for Bayesian inference Arthur Gretton Gatsby Computational Neuroscience Unit Lancaster, Nov. 2014 Motivating Example: Bayesian inference without a model 3600 downsampled frames of 20 20
More informationTensor Product Kernels: Characteristic Property, Universality
Tensor Product Kernels: Characteristic Property, Universality Zolta n Szabo CMAP, E cole Polytechnique Joint work with: Bharath K. Sriperumbudur Hangzhou International Conference on Frontiers of Data Science
More informationDependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise
Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise Makoto Yamada and Masashi Sugiyama Department of Computer Science, Tokyo Institute of Technology
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationUnderstanding the relationship between Functional and Structural Connectivity of Brain Networks
Understanding the relationship between Functional and Structural Connectivity of Brain Networks Sashank J. Reddi Machine Learning Department, Carnegie Mellon University SJAKKAMR@CS.CMU.EDU Abstract Background.
More informationExamples are not Enough, Learn to Criticize! Criticism for Interpretability
Examples are not Enough, Learn to Criticize! Criticism for Interpretability Been Kim, Rajiv Khanna, Oluwasanmi Koyejo Wittawat Jitkrittum Gatsby Machine Learning Journal Club 16 Jan 2017 1/20 Summary Examples
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationKernel Adaptive Metropolis-Hastings
Kernel Adaptive Metropolis-Hastings Arthur Gretton,?? Gatsby Unit, CSML, University College London NIPS, December 2015 Arthur Gretton (Gatsby Unit, UCL) Kernel Adaptive Metropolis-Hastings 12/12/2015 1
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 89 Part II
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationLearning gradients: prescriptive models
Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan
More informationKernel methods and the exponential family
Kernel methods and the exponential family Stéphane Canu 1 and Alex J. Smola 2 1- PSI - FRE CNRS 2645 INSA de Rouen, France St Etienne du Rouvray, France Stephane.Canu@insa-rouen.fr 2- Statistical Machine
More informationMultiple Similarities Based Kernel Subspace Learning for Image Classification
Multiple Similarities Based Kernel Subspace Learning for Image Classification Wang Yan, Qingshan Liu, Hanqing Lu, and Songde Ma National Laboratory of Pattern Recognition, Institute of Automation, Chinese
More informationLecture 35: December The fundamental statistical distances
36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose
More informationForefront of the Two Sample Problem
Forefront of the Two Sample Problem From classical to state-of-the-art methods Yuchi Matsuoka What is the Two Sample Problem? X ",, X % ~ P i. i. d Y ",, Y - ~ Q i. i. d Two Sample Problem P = Q? Example:
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationKernel Measures of Conditional Dependence
Kernel Measures of Conditional Dependence Kenji Fukumizu Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku Tokyo 6-8569 Japan fukumizu@ism.ac.jp Arthur Gretton Max-Planck Institute for
More informationAdvanced Introduction to Machine Learning
10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date
More information10-701/ Recitation : Kernels
10-701/15-781 Recitation : Kernels Manojit Nandi February 27, 2014 Outline Mathematical Theory Banach Space and Hilbert Spaces Kernels Commonly Used Kernels Kernel Theory One Weird Kernel Trick Representer
More informationDistribution Regression
Zoltán Szabó (École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Dagstuhl Seminar 16481
More informationarxiv: v2 [stat.ml] 3 Sep 2015
Nonparametric Independence Testing for Small Sample Sizes arxiv:46.9v [stat.ml] 3 Sep 5 Aaditya Ramdas Dept. of Statistics and Machine Learning Dept. Carnegie Mellon University aramdas@cs.cmu.edu September
More informationConvergence Rates of Kernel Quadrature Rules
Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction
More informationRepresenter theorem and kernel examples
CS81B/Stat41B Spring 008) Statistical Learning Theory Lecture: 8 Representer theorem and kernel examples Lecturer: Peter Bartlett Scribe: Howard Lei 1 Representer Theorem Recall that the SVM optimization
More informationMean-Shift Tracker Computer Vision (Kris Kitani) Carnegie Mellon University
Mean-Shift Tracker 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University Mean Shift Algorithm A mode seeking algorithm Fukunaga & Hostetler (1975) Mean Shift Algorithm A mode seeking algorithm
More informationFisher Vector image representation
Fisher Vector image representation Machine Learning and Category Representation 2014-2015 Jakob Verbeek, January 9, 2015 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.14.15 A brief recap on kernel
More informationSemi-supervised Dictionary Learning Based on Hilbert-Schmidt Independence Criterion
Semi-supervised ictionary Learning Based on Hilbert-Schmidt Independence Criterion Mehrdad J. Gangeh 1, Safaa M.A. Bedawi 2, Ali Ghodsi 3, and Fakhri Karray 2 1 epartments of Medical Biophysics, and Radiation
More informationModeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop
Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector
More informationGaussian with mean ( µ ) and standard deviation ( σ)
Slide from Pieter Abbeel Gaussian with mean ( µ ) and standard deviation ( σ) 10/6/16 CSE-571: Robotics X ~ N( µ, σ ) Y ~ N( aµ + b, a σ ) Y = ax + b + + + + 1 1 1 1 1 1 1 1 1 1, ~ ) ( ) ( ), ( ~ ), (
More informationTutorial on Gaussian Processes and the Gaussian Process Latent Variable Model
Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationNonparametric Indepedence Tests: Space Partitioning and Kernel Approaches
Nonparametric Indepedence Tests: Space Partitioning and Kernel Approaches Arthur Gretton 1 and László Györfi 2 1. Gatsby Computational Neuroscience Unit London, UK 2. Budapest University of Technology
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationToday. Calculus. Linear Regression. Lagrange Multipliers
Today Calculus Lagrange Multipliers Linear Regression 1 Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 Jan-Willem van de Meent (credit: Yijun Zhao, Arthur Gretton Rasmussen & Williams, Percy Liang) Kernel Regression Basis function regression
More informationStatistical Machine Learning
Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x
More informationEvaluation requires to define performance measures to be optimized
Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation
More informationLecture 2: Mappings of Probabilities to RKHS and Applications
Lecture 2: Mappings of Probabilities to RKHS and Applications Lille, 214 Arthur Gretton Gatsby Unit, CSML, UCL Detecting differences in brain signals The problem: Dolocalfieldpotential(LFP)signalschange
More informationApproximate Kernel PCA with Random Features
Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,
More informationExponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger
Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm by Korbinian Schwinger Overview Exponential Family Maximum Likelihood The EM Algorithm Gaussian Mixture Models Exponential
More informationMinimax-optimal distribution regression
Zoltán Szabó (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) ISNPS, Avignon June 12,
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationSimple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017
Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient
More informationAdaptive HMC via the Infinite Exponential Family
Adaptive HMC via the Infinite Exponential Family Arthur Gretton Gatsby Unit, CSML, University College London RegML, 2017 Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family
More informationDiffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)
Diffeomorphic Warping Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) What Manifold Learning Isn t Common features of Manifold Learning Algorithms: 1-1 charting Dense sampling Geometric Assumptions
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationUnsupervised Nonparametric Anomaly Detection: A Kernel Method
Fifty-second Annual Allerton Conference Allerton House, UIUC, Illinois, USA October - 3, 24 Unsupervised Nonparametric Anomaly Detection: A Kernel Method Shaofeng Zou Yingbin Liang H. Vincent Poor 2 Xinghua
More informationThe Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space
The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space Robert Jenssen, Deniz Erdogmus 2, Jose Principe 2, Torbjørn Eltoft Department of Physics, University of Tromsø, Norway
More informationKernel Bayes Rule: Nonparametric Bayesian inference with kernels
Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December
More informationStatistical analysis of coupled time series with Kernel Cross-Spectral Density operators.
Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators. Michel Besserve MPI for Intelligent Systems, Tübingen michel.besserve@tuebingen.mpg.de Nikos K. Logothetis MPI
More informationReproducing Kernel Hilbert Spaces
9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we
More informationTopics in kernel hypothesis testing
Topics in kernel hypothesis testing Kacper Chwialkowski A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of University College London. Department
More informationThe Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017
The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the
More informationCausal Modeling with Generative Neural Networks
Causal Modeling with Generative Neural Networks Michele Sebag TAO, CNRS INRIA LRI Université Paris-Sud Joint work: D. Kalainathan, O. Goudet, I. Guyon, M. Hajaiej, A. Decelle, C. Furtlehner https://arxiv.org/abs/1709.05321
More informationPackage dhsic. R topics documented: July 27, 2017
Package dhsic July 27, 2017 Type Package Title Independence Testing via Hilbert Schmidt Independence Criterion Version 2.0 Date 2017-07-24 Author Niklas Pfister and Jonas Peters Maintainer Niklas Pfister
More informationA Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement
A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement Simon Leglaive 1 Laurent Girin 1,2 Radu Horaud 1 1: Inria Grenoble Rhône-Alpes 2: Univ. Grenoble Alpes, Grenoble INP,
More informationUndirected Graphical Models
Undirected Graphical Models 1 Conditional Independence Graphs Let G = (V, E) be an undirected graph with vertex set V and edge set E, and let A, B, and C be subsets of vertices. We say that C separates
More informationKernel-Based Information Criterion
Kernel-Based Information Criterion Somayeh Danafar 1, Kenji Fukumizu 3 & Faustino Gomez 2 Computer and Information Science; Vol. 8, No. 1; 2015 ISSN 1913-8989 E-ISSN 1913-8997 Published by Canadian Center
More informationFast Two Sample Tests using Smooth Random Features
Kacper Chwialkowski University College London, Computer Science Department Aaditya Ramdas Carnegie Mellon University, Machine Learning and Statistics School of Computer Science Dino Sejdinovic University
More informationMultiple kernel learning for multiple sources
Multiple kernel learning for multiple sources Francis Bach INRIA - Ecole Normale Supérieure NIPS Workshop - December 2008 Talk outline Multiple sources in computer vision Multiple kernel learning (MKL)
More informationDistributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College
Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic
More informationEvaluation. Andrea Passerini Machine Learning. Evaluation
Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain
More informationUnsupervised Learning
2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and
More informationOutline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space
to The The A s s in to Fabio A. González Ph.D. Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 2, 2009 to The The A s s in 1 Motivation Outline 2 The Mapping the
More informationThe Variational Gaussian Approximation Revisited
The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much
More informationEfficient Complex Output Prediction
Efficient Complex Output Prediction Florence d Alché-Buc Joint work with Romain Brault, Alex Lambert, Maxime Sangnier October 12, 2017 LTCI, Télécom ParisTech, Institut-Mines Télécom, Université Paris-Saclay
More informationBits of Machine Learning Part 1: Supervised Learning
Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification
More informationMachine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 20, 2012 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)
More informationInformation geometry for bivariate distribution control
Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic
More informationIntroduction to Logistic Regression
Introduction to Logistic Regression Guy Lebanon Binary Classification Binary classification is the most basic task in machine learning, and yet the most frequent. Binary classifiers often serve as the
More informationGeneralized clustering via kernel embeddings
Generalized clustering via kernel embeddings Stefanie Jegelka 1, Arthur Gretton 2,1, Bernhard Schölkopf 1, Bharath K. Sriperumbudur 3, and Ulrike von Luxburg 1 1 Max Planck Institute for Biological Cybernetics,
More informationRecent advances in Time Series Classification
Distance Shapelet BoW Kernels CCL Recent advances in Time Series Classification Simon Malinowski, LinkMedia Research Team Classification day #3 S. Malinowski Time Series Classification 21/06/17 1 / 55
More informationPattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods
Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationKernel Methods and Support Vector Machines
Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationarxiv: v1 [math.st] 24 Aug 2017
Submitted to the Annals of Statistics MULTIVARIATE DEPENDENCY MEASURE BASED ON COPULA AND GAUSSIAN KERNEL arxiv:1708.07485v1 math.st] 24 Aug 2017 By AngshumanRoy and Alok Goswami and C. A. Murthy We propose
More informationStatistical Convergence of Kernel CCA
Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,
More information