Tensor Product Kernels: Characteristic Property, Universality

Size: px

Start display at page:

Download "Tensor Product Kernels: Characteristic Property, Universality"

Mabel Hopkins
5 years ago
Views:

1 Tensor Product Kernels: Characteristic Property, Universality Zolta n Szabo CMAP, E cole Polytechnique Joint work with: Bharath K. Sriperumbudur Hangzhou International Conference on Frontiers of Data Science May 19, 2018 Zolta n Szabo

2 Motivation: Classical Information Theory Kullback-Leibler divergence: ż j ppxq KL pp, Qq ppxq log dx. R d qpxq

3 Motivation: Classical Information Theory Kullback-Leibler divergence: ż j ppxq KL pp, Qq ppxq log dx. R d qpxq Mutual information: I ppq KL P, b M m 1P m.

4 Motivation: Classical Information Theory Kullback-Leibler divergence: ż j ppxq KL pp, Qq ppxq log dx. R d qpxq Mutual information: Properties: 1 I ppq ě 0. I ppq KL P, b M m 1P m.

5 Motivation: Classical Information Theory Kullback-Leibler divergence: ż j ppxq KL pp, Qq ppxq log dx. R d qpxq Mutual information: Properties: I ppq KL 1 I ppq ě 0. 2 I ppq 0 ô P b M m 1 P m. P, b M m 1P m.

6 Motivation: Classical Information Theory Kullback-Leibler divergence: ż j ppxq KL pp, Qq ppxq log dx. R d qpxq Mutual information: Properties: I ppq KL 1 I ppq ě 0. 2 I ppq 0 ô P b M m 1 P m. P, b M m 1P m. Alternatives: Rényi, Tsallis, L 2 divergence... Typically: X R d.

7 From Rd to Diverse Set of Domains, Kernel Examples X Rd, γ ą 0: kpx, y q hϕpxq, ϕpy qih kp px, yq phx, yi ` γqp, ke px, yq e γ}x y}2, Zolta n Szabo 2 kg px, yq e γ}x y}2, 1 kc px, yq 1 `. γ }x y}22

8 From Rd to Diverse Set of Domains, Kernel Examples X Rd, γ ą 0: kpx, y q hϕpxq, ϕpy qih kp px, yq phx, yi ` γqp, ke px, yq e γ}x y}2, 2 kg px, yq e γ}x y}2, 1 kc px, yq 1 `. γ }x y}22 X = strings, texts: r -spectrum kernel: # of common ď r -substrings. Zolta n Szabo

9 From Rd to Diverse Set of Domains, Kernel Examples X Rd, γ ą 0: kpx, y q hϕpxq, ϕpy qih kp px, yq phx, yi ` γqp, ke px, yq e γ}x y}2, 2 kg px, yq e γ}x y}2, 1 kc px, yq 1 `. γ }x y}22 X = strings, texts: r -spectrum kernel: # of common ď r -substrings. X = time-series: dynamic time-warping. Zolta n Szabo

10 From Rd to Diverse Set of Domains, Kernel Examples X Rd, γ ą 0: kpx, y q hϕpxq, ϕpy qih kp px, yq phx, yi ` γqp, ke px, yq e γ}x y}2, 2 kg px, yq e γ}x y}2, 1 kc px, yq 1 `. γ }x y}22 X = strings, texts: r -spectrum kernel: # of common ď r -substrings. X = time-series: dynamic time-warping. X = graphs, sets, permutations,... Zolta n Szabo

11 Distribution Representation P ÞÑ µ P ş X ϕpxqdppxq.

12 Distribution Representation Cdf: P ÞÑ µ P ş X ϕpxqdppxq. P ÞÑ F P pzq E x P χ p 8,zq pxq.

13 Distribution Representation Cdf: P ÞÑ µ P ş X ϕpxqdppxq. P ÞÑ F P pzq E x P χ p 8,zq pxq. Characteristic function: ż P ÞÑ c P pzq e i z,x dppxq.

14 Distribution Representation Cdf: P ÞÑ µ P ş X ϕpxqdppxq. P ÞÑ F P pzq E x P χ p 8,zq pxq. Characteristic function: ż P ÞÑ c P pzq Moment generating function: ż P ÞÑ M P pzq e i z,x dppxq. e z,x dp pxq.

15 Distribution Representation Cdf: P ÞÑ µ P ş X ϕpxqdppxq. P ÞÑ F P pzq E x P χ p 8,zq pxq. Characteristic function: ż P ÞÑ c P pzq Moment generating function: ż P ÞÑ M P pzq e i z,x dppxq. e z,x dp pxq. Trick ϕ: on any kernel-endowed domain!

16 Objects of Interest KL divergence & mutual information on kernel-endowed domains. Mean embedding: ż µppq : X ϕpxq dppxq

17 Objects of Interest KL divergence & mutual information on kernel-endowed domains. Mean embedding: ż µ k ppq : Maximum mean discrepancy: X ϕpxq lomon kp,xq dppxq P H k. MMD k pp, Qq : }µ k ppq µ k pqq} Hk.

18 Objects of Interest KL divergence & mutual information on kernel-endowed domains. Mean embedding: ż µ k ppq : Maximum mean discrepancy: X ϕpxq lomon kp,xq dppxq P H k. MMD k pp, Qq : }µ k ppq µ k pqq} Hk. Hilbert-Schmidt independence criterion, k b M m 1 k m: HSIC k ppq : MMD k P, b M m 1P m.

19 RKHS intuition (X : X m, k : k m ) Given: X set. H(ilbert space). Kernel: kpa, bq xϕpaq, ϕpbqy H.

20 RKHS intuition (X : X m, k : k m ) Given: X set. H(ilbert space). Kernel: Reproducing kernel of a H Ă R X : kpa, bq xϕpaq, ϕpbqy H. lomon kp, bq P H, f looooooooooomooooooooooon pbq xf, kp, bqy H. reproducing property

21 RKHS intuition (X : X m, k : k m ) Given: X set. H(ilbert space). Kernel: Reproducing kernel of a H Ă R X : kpa, bq xϕpaq, ϕpbqy H. lomon kp, bq P H, f looooooooooomooooooooooon pbq xf, kp, bqy H. reproducing property spec. ÝÝÝÑ kpa, bq xkp, aq, kp, bqy H.

22 RKHS intuition (X : X m, k : k m ) Given: X set. H(ilbert space). Kernel: Reproducing kernel of a H Ă R X : kpa, bq xϕpaq, ϕpbqy H. lomon kp, bq P H, f looooooooooomooooooooooon pbq xf, kp, bqy H. reproducing property spec. ÝÝÝÑ kpa, bq xkp, aq, kp, bqy H. H k t ř n i 1 α ikp, x i qu.

23 Mean Embedding, MMD: Applications & Review Applications: two-sample testing [Borgwardt et al., 2006, Gretton et al., 2012], domain adaptation [Zhang et al., 2013], -generalization [Blanchard et al., 2017], kernel Bayesian inference [Song et al., 2011, Fukumizu et al., 2013] approximate Bayesian computation [Park et al., 2016], probabilistic programming [Schölkopf et al., 2015], model criticism [Lloyd et al., 2014, Kim et al., 2016], goodness-of-fit [Balasubramanian et al., 2017], distribution classification [Muandet et al., 2011, Lopez-Paz et al., 2015], [Zaheer et al., 2017], distribution regression [Szabó et al., 2016], [Law et al., 2018], topological data analysis [Kusano et al., 2016]. Review [Muandet et al., 2017]. Switching to HSIC...

24 MMD spec. ÝÝÑ HSIC MMD with k b M m 1 k m: k `x, x 1 : Mź k m `xm, xm 1, m 1 HSIC k ppq : MMD k P, b M m 1P m.

25 MMD spec. ÝÝÑ HSIC MMD with k b M m 1 k m: Applications : k `x, x 1 : Mź k m `xm, xm 1, m 1 HSIC k ppq : MMD k P, b M m 1P m. blind source separation [Gretton et al., 2005], feature selection [Song et al., 2012], post selection inference [Yamada et al., 2018], independence testing [Gretton et al., 2008], causal inference [Mooij et al., 2016, Pfister et al., 2017, Strobl et al., 2017].

26 Central in Applications: Characteristic Property MMD: k is called characteristic [Fukumizu et al., 2008] if MMD k pp, Qq 0 ô P Q. Injectivitity of P ÞÑ µ P on finite signed measures: universality [Steinwart, 2001].

27 Central in Applications: Characteristic Property MMD: k is called characteristic [Fukumizu et al., 2008] if MMD k pp, Qq 0 ô P Q. Injectivitity of P ÞÑ µ P on finite signed measures: universality [Steinwart, 2001]. HSIC: k b M m 1 k m will be called I-characteristic if HSIC k ppq 0 ô P b M m 1P m.

28 Central in Applications: Characteristic Property MMD: k is called characteristic [Fukumizu et al., 2008] if MMD k pp, Qq 0 ô P Q. Injectivitity of P ÞÑ µ P on finite signed measures: universality [Steinwart, 2001]. HSIC: k b M m 1 k m will be called I-characteristic if HSIC k ppq 0 ô P b M m 1P m. b M m 1 k m: universal ñ characteristic ñ I-characteristic.

29 Central in Applications: Characteristic Property Wanted MMD: k is called characteristic [Fukumizu et al., 2008] if MMD k pp, Qq 0 ô P Q. Injectivitity of P ÞÑ µ P on finite signed measures: universality [Steinwart, 2001]. HSIC: k b M m 1 k m will be called I-characteristic if HSIC k ppq 0 ô P b M m 1P m. b M m 1 k m: universal ñ characteristic ñ I-characteristic. Characteristic properties of b M m 1 k m in terms of k m -s?

30 Well-known: shift-invariant kernels on R d Theorem ([Sriperumbudur et al., 2010]) k is characteristic iff. supppλq R d, where ż kpx, x 1 q e i x x1,ω dλpωq. R d

31 Well-known: shift-invariant kernels on R d Theorem ([Sriperumbudur et al., 2010]) k is characteristic iff. supppλq R d, where ż kpx, x 1 q k 0 px x 1 q e i x x1,ω dλpωq. R d Example on R: kernel name k 0 p k0 pωq supp` p k0 Gaussian e x2 2σ 2 σe σ2 ω 2 2 R b Laplacian e σ x 2 π σ a 2`ω2 Sinc π 2 χ r σ,σspωq sinpσxq x σ R r σ, σs

32 Well-known: M 2 [Blanchard et al., 2011, Gretton, 2015]: k 1 &k 2 : universal ñ k 1 b k 2 : universal (ñ I-characteristic).

33 Well-known: M 2 [Blanchard et al., 2011, Gretton, 2015]: k 1 &k 2 : universal ñ k 1 b k 2 : universal (ñ I-characteristic). Distance covariance [Lyons, 2013, Sejdinovic et al., 2013]: k 1 &k 2 : characteristic ô k 1 b k 2 : I-characteristic.

34 Well-known: M 2 [Blanchard et al., 2011, Gretton, 2015]: k 1 &k 2 : universal ñ k 1 b k 2 : universal (ñ I-characteristic). Distance covariance [Lyons, 2013, Sejdinovic et al., 2013]: k 1 &k 2 : characteristic ô k 1 b k 2 : I-characteristic. Goal Extension to M ě 2.

35 k 1, k 2, k 3 : characteristic œ b 3 m 1 k m: I-characteristic Example X m t1, 2u, τ Xm Ppt1, 2uq, k m px, x 1 q 2δ x,x 1 1, M 3. Then pk m q 3 m 1 : characteristic. b 3 m 1 k m: is not I-characteristic. Witness: p 1,1,1 1 5, p 1,1,2 1 10, p 1,2,1 1 10, p 1,2,2 1 10, p 2,1,1 1 5, p 2,1,2 1 10, p 2,2,1 1 10, p 2,2,

36 Non-I-characteristicity: Analytical Solution Parameter: z pz 0, z 1,..., z 5 q P r0, 1s 6.

37 Non-I-characteristicity: Analytical Solution Parameter: z pz 0, z 1,..., z 5 q P r0, 1s 6. Example: p 1,1,1 z 2 ` z 1 ` z 4 ` z 5 3z 2 z 1 4z 2 z 4 4z 1 z 4 z 2 z 3 2z 2 z 0 2z 1 z 3 3z 2 z 5 2z 4 z 3 z 1 z 0 3z 1 z 5 2z 4 z 0 4z 4 z 5 z 3 z 0 z 3 z 5 z 0 z 5 ` 2z 2 z1 2 ` 2z2 2 z 1 ` 4z 2 z4 2 ` 2z2 2 z 4 ` 4z 1 z4 2 ` 2z1 2 z 4 ` 2z2 2 z 0 ` 2z1 2 z 3 ` 2z 2 z5 2 ` 2z2 2 z 5 ` 2z4 2 z 3 ` 2z 1 z5 2 ` 2z1 2 z 5 ` 2z4 2 z 0 ` 2z 4 z5 2 ` 4z4 2 z 5 z2 2 z1 2 3z4 2 ` 2z4 3 z5 2 ` 6z 2 z 1 z 4 ` 2z 2 z 1 z 3 ` 2z 2 z 4 z 3 ` 2z 2 z 1 z 0 ` 4z 2 z 1 z 5 ` 4z 2 z 4 z 0 ` 4z 1 z 4 z 3 ` 6z 2 z 4 z 5 ` 2z 1 z 4 z 0 ` 6z 1 z 4 z 5 ` 2z 2 z 3 z 0 ` 2z 2 z 3 z 5 ` 2z 1 z 3 z 0 ` 2z 2 z 0 z 5 ` 2z 1 z 3 z 5 ` 2z 4 z 3 z 0 ` 2z 4 z 3 z 5 ` 2z 1 z 0 z 5 ` 2z 4 z 0 z 5. 2z 2 z 1 z 1 2z 4 z 3 z 0 2z 5 z 2 ` 2z 2 z 4 ` 2z 1 z 4 ` 2z 2 z 0 ` 2z 1 z 3 ` 2z 2 z 5 ` 2z 4 z 3 ` 2z 1 z 5 ` 2z 4 z 0 ` 4z 4 z 5 ` 2z 3 z 0 ` 2z 3 z 5 ` 2z 0 z 5 ` 2z4 2 ` 2z5 2

38 Non-I-characteristicity: Analytical Solution Parameter: z pz 0, z 1,..., z 5 q P r0, 1s 6. Example: p 1,1,1 z 2 ` z 1 ` z 4 ` z 5 3z 2 z 1 4z 2 z 4 4z 1 z 4 z 2 z 3 2z 2 z 0 2z 1 z 3 3z 2 z 5 2z 4 z 3 z 1 z 0 3z 1 z 5 2z 4 z 0 4z 4 z 5 z 3 z 0 z 3 z 5 z 0 z 5 ` 2z 2 z1 2 ` 2z2 2 z 1 ` 4z 2 z4 2 ` 2z2 2 z 4 ` 4z 1 z4 2 ` 2z1 2 z 4 ` 2z2 2 z 0 ` 2z1 2 z 3 ` 2z 2 z5 2 ` 2z2 2 z 5 ` 2z4 2 z 3 ` 2z 1 z5 2 ` 2z1 2 z 5 ` 2z4 2 z 0 ` 2z 4 z5 2 ` 4z4 2 z 5 z2 2 z1 2 3z4 2 ` 2z4 3 z5 2 ` 6z 2 z 1 z 4 ` 2z 2 z 1 z 3 ` 2z 2 z 4 z 3 ` 2z 2 z 1 z 0 ` 4z 2 z 1 z 5 ` 4z 2 z 4 z 0 ` 4z 1 z 4 z 3 ` 6z 2 z 4 z 5 ` 2z 1 z 4 z 0 ` 6z 1 z 4 z 5 ` 2z 2 z 3 z 0 ` 2z 2 z 3 z 5 ` 2z 1 z 3 z 0 ` 2z 2 z 0 z 5 ` 2z 1 z 3 z 5 ` 2z 4 z 3 z 0 ` 2z 4 z 3 z 5 ` 2z 1 z 0 z 5 ` 2z 4 z 0 z 5. 2z 2 z 1 z 1 2z 4 z 3 z 0 2z 5 z 2 ` 2z 2 z 4 ` 2z 1 z 4 ` 2z 2 z 0 ` 2z 1 z 3 ` 2z 2 z 5 ` 2z 4 z 3 ` 2z 1 z 5 ` 2z 4 z 0 ` 4z 4 z 5 ` 2z 3 z 0 ` 2z 3 z 5 ` 2z 0 z 5 ` 2z4 2 ` 2z5 2 We chose: z ` 1 10, 1 10, 1 10, 1 10, 1 10, 1 10.

39 Non-I-characteristicity: Analytical Solution Parameter: z pz 0, z 1,..., z 5 q P r0, 1s 6. Example: p 1,1,1 z 2 ` z 1 ` z 4 ` z 5 3z 2 z 1 4z 2 z 4 4z 1 z 4 z 2 z 3 2z 2 z 0 2z 1 z 3 3z 2 z 5 2z 4 z 3 z 1 z 0 3z 1 z 5 2z 4 z 0 4z 4 z 5 z 3 z 0 z 3 z 5 z 0 z 5 ` 2z 2 z1 2 ` 2z2 2 z 1 ` 4z 2 z4 2 ` 2z2 2 z 4 ` 4z 1 z4 2 ` 2z1 2 z 4 ` 2z2 2 z 0 ` 2z1 2 z 3 ` 2z 2 z5 2 ` 2z2 2 z 5 ` 2z4 2 z 3 ` 2z 1 z5 2 ` 2z1 2 z 5 ` 2z4 2 z 0 ` 2z 4 z5 2 ` 4z4 2 z 5 z2 2 z1 2 3z4 2 ` 2z4 3 z5 2 ` 6z 2 z 1 z 4 ` 2z 2 z 1 z 3 ` 2z 2 z 4 z 3 ` 2z 2 z 1 z 0 ` 4z 2 z 1 z 5 ` 4z 2 z 4 z 0 ` 4z 1 z 4 z 3 ` 6z 2 z 4 z 5 ` 2z 1 z 4 z 0 ` 6z 1 z 4 z 5 ` 2z 2 z 3 z 0 ` 2z 2 z 3 z 5 ` 2z 1 z 3 z 0 ` 2z 2 z 0 z 5 ` 2z 1 z 3 z 5 ` 2z 4 z 3 z 0 ` 2z 4 z 3 z 5 ` 2z 1 z 0 z 5 ` 2z 4 z 0 z 5. 2z 2 z 1 z 1 2z 4 z 3 z 0 2z 5 z 2 ` 2z 2 z 4 ` 2z 1 z 4 ` 2z 2 z 0 ` 2z 1 z 3 ` 2z 2 z 5 ` 2z 4 z 3 ` 2z 1 z 5 ` 2z 4 z 0 ` 4z 4 z 5 ` 2z 3 z 0 ` 2z 3 z 5 ` 2z 0 z 5 ` 2z4 2 ` 2z5 2 We chose: z ` 1 10, 1 10, 1 10, 1 10, 1 10, Universality: helps?

40 k 1, k 2 : universal, k 3 : char œ b 3 m 1 k m: I-characteristic Example X m t1, 2u, τ Xm Ppt1, 2uq, M 3. k 1 px, x 1 q k 2 px, x 1 q δ x,x 1: universal. k 3 px, x 1 q 2δ x,x 1 1: characteristic. Different constraints & Ppzq solution; same witness: useful. p 1,1,1 1 5, p 1,1,2 1 10, p 1,2,1 1 10, p 1,2,2 1 10, p 2,1,1 1 5, p 2,1,2 1 10, p 2,2,1 1 10, p 2,2,

41 Results [Szabó and Sriperumbudur, 2017] Proposition (characteristic property) b M m 1 k m: characteristic ñ pk m q M m 1 are characteristic. ö [ X m 2, k m px, x 1 q 2δ x,x 1 1]

42 Results [Szabó and Sriperumbudur, 2017] Proposition (characteristic property) b M m 1 k m: characteristic ñ pk m q M m 1 are characteristic. ö [ X m 2, k m px, x 1 q 2δ x,x 1 1] Proposition (I-characteristic property) k 1, k 2 : characteristic ñ k 1 b k 2 : I-characteristic. ð: ě 2. k 1, k 2, k 3 : characteristic œ b 3 m 1 k m: I-characteristic [Ex]. k 1, k 2 : universal, k 3 : char œ b 3 m 1 k m: I-characteristic [Ex].

43 Results: continued [Szabó and Sriperumbudur, 2017] Proposition (X m R dm, k m : continuous, shift-invariant, bounded) pk m q M m 1 -s are characteristic ô bm m 1 k m: I-characteristic ô b M m 1 k m: characteristic.

44 Results: continued [Szabó and Sriperumbudur, 2017] Proposition (X m R dm, k m : continuous, shift-invariant, bounded) pk m q M m 1 -s are characteristic ô bm m 1 k m: I-characteristic ô b M m 1 k m: characteristic. Proposition (Universality) b M m 1 k m: universal ô pk m q M m 1 are universal.

45 Summary We studied the validness of HSIC. HSIC ñ product structure: Space: X ˆMm 1 X m. Kernel: k b M m 1 k m. Complete answer in terms of k m -s.

46 Summary We studied the validness of HSIC. HSIC ñ product structure: Space: X ˆMm 1 X m. Kernel: k b M m 1 k m. Complete answer in terms of k m -s. ITE toolkit, preprint (with minor revision in JMLR):

47 Thank you for the attention! Acks: A part of the work was carried out while BKS was visiting ZSz at CMAP, École Polytechnique. BKS is supported by NSF-DMS

48 Balasubramanian, K., Li, T., and Yuan, M. (2017). On the optimality of kernel-embedding based goodness-of-fit tests. Technical report. ( Blanchard, G., Deshmukh, A. A., Dogan, U., Lee, G., and Scott, C. (2017). Domain generalization by marginal transfer learning. Technical report. ( Blanchard, G., Lee, G., and Scott, C. (2011). Generalizing from several related classification tasks to a new unlabeled sample. In Advances in Neural Information Processing Systems (NIPS), pages Borgwardt, K., Gretton, A., Rasch, M. J., Kriegel, H.-P., Schölkopf, B., and Smola, A. J. (2006).

49 Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22:e Fukumizu, K., Gretton, A., Sun, X., and Schölkopf, B. (2008). Kernel measures of conditional dependence. In Neural Information Processing Systems (NIPS), pages Fukumizu, K., Song, L., and Gretton, A. (2013). Kernel Bayes rule: Bayesian inference with positive definite kernels. Journal of Machine Learning Research, 14: Gretton, A. (2015). A simpler condition for consistency of a kernel independence test. Technical report, University College London. (

50 Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A kernel two-sample test. Journal of Machine Learning Research, 13: Gretton, A., Bousquet, O., Smola, A., and Schölkopf, B. (2005). Measuring statistical dependence with Hilbert-Schmidt norms. In Algorithmic Learning Theory (ALT), pages Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Schölkopf, B., and Smola, A. J. (2008). A kernel statistical test of independence. In Neural Information Processing Systems (NIPS), pages Kim, B., Khanna, R., and Koyejo, O. O. (2016). Examples are not enough, learn to criticize! criticism for interpretability.

51 In Advances in Neural Information Processing Systems (NIPS), pages Kusano, G., Fukumizu, K., and Hiraoka, Y. (2016). Persistence weighted Gaussian kernel for topological data analysis. In International Conference on Machine Learning (ICML), pages Law, H. C. L., Sutherland, D. J., Sejdinovic, D., and Flaxman, S. (2018). Bayesian approaches to distribution regression. In International Conference on Artificial Intelligence and Statistics (AISTATS). Lloyd, J. R., Duvenaud, D., Grosse, R., Tenenbaum, J. B., and Ghahramani, Z. (2014). Automatic construction and natural-language description of nonparametric regression models.

52 In AAAI Conference on Artificial Intelligence, pages Lopez-Paz, D., Muandet, K., Schölkopf, B., and Tolstikhin, I. (2015). Towards a learning theory of cause-effect inference. International Conference on Machine Learning (ICML; PMLR), 37: Lyons, R. (2013). Distance covariance in metric spaces. The Annals of Probability, 41: Mooij, J. M., Peters, J., Janzing, D., Zscheischler, J., and Schölkopf, B. (2016). Distinguishing cause from effect using observational data: Methods and benchmarks. Journal of Machine Learning Research, 17: Muandet, K., Fukumizu, K., Dinuzzo, F., and Schölkopf, B. (2011).

53 Learning from distributions via support measure machines. In Neural Information Processing Systems (NIPS), pages Muandet, K., Fukumizu, K., Sriperumbudur, B., and Schölkopf, B. (2017). Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10(1-2): Park, M., Jitkrittum, W., and Sejdinovic, D. (2016). K2-ABC: Approximate Bayesian computation with kernel embeddings. In International Conference on Artificial Intelligence and Statistics (AISTATS; PMLR), volume 51, pages 51: Pfister, N., Bühlmann, P., Schölkopf, B., and Peters, J. (2017). Kernel-based tests for joint independence.

54 Journal of the Royal Statistical Society: Series B (Statistical Methodology). Schölkopf, B., Muandet, K., Fukumizu, K., Harmeling, S., and Peters, J. (2015). Computing functions of random variables via reproducing kernel Hilbert space representations. Statistics and Computing, 25(4): Sejdinovic, D., Sriperumbudur, B. K., Gretton, A., and Fukumizu, K. (2013). Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Annals of Statistics, 41: Song, L., Gretton, A., Bickson, D., Low, Y., and Guestrin, C. (2011). Kernel belief propagation. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages

55 Song, L., Smola, A., Gretton, A., Bedo, J., and Borgwardt, K. (2012). Feature selection via dependence maximization. Journal of Machine Learning Research, 13: Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Schölkopf, B., and Lanckriet, G. R. (2010). Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11: Steinwart, I. (2001). On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 6(3): Strobl, E. V., Visweswaran, S., and Zhang, K. (2017). Approximate kernel-based conditional independence tests for fast non-parametric causal discovery. Technical report.

56 ( Szabó, Z. and Sriperumbudur, B. (2017). Characteristic and universal tensor product kernels. Technical report. ( Szabó, Z., Sriperumbudur, B., Póczos, B., and Gretton, A. (2016). Learning theory for distribution regression. Journal of Machine Learning Research, 17(152):1 40. Yamada, M., Umezu, Y., Fukumizu, K., and Takeuchi, I. (2018). Post selection inference with kernels. In International Conference on Artificial Intelligence and Statistics (AISTATS; PMLR), volume 84, pages Zaheer, M., Kottur, S., Ravanbakhsh, S., Póczos, B., Salakhutdinov, R. R., and Smola, A. J. (2017). Deep sets.

57 In Advances in Neural Information Processing Systems (NIPS), pages Zhang, K., Schölkopf, B., Muandet, K., and Wang, Z. (2013). Domain adaptation under target and conditional shift. Journal of Machine Learning Research, 28(3):

Characterizing Independence with Tensor Product Kernels

Characterizing Independence with Tensor Product Kernels Zolta n Szabo CMAP, E cole Polytechnique Joint work with: Bharath K. Sriperumbudur Department of Statistics, PSU December 13, 2017 Zolta n Szabo