Kernel-based Dependency Measures & Hypothesis Testing

Size: px

Start display at page:

Download "Kernel-based Dependency Measures & Hypothesis Testing"

Josephine Johnson
5 years ago
Views:

1 Kernel-based Dependency Measures and Hypothesis Testing CMAP, École Polytechnique Carnegie Mellon University November 27, 2017

2 Outline Motivation. Diverse set of domains: kernel. Dependency measures: Kernel canonical correlation analysis. Maximum mean discrepancy. Hilbert-Schmidt independence criterion. Hypothesis testing.

3 Dependency Measures as Objective Functions

4 Outlier-robust image registration [Kybic, 2004, Neemuchwala et al., 2007] Given two images: Goal: find the transformation which takes the right one to the left.

5 Outlier-robust image registration [Kybic, 2004, Neemuchwala et al., 2007] Given two images: Goal: find the transformation which takes the right one to the left.

6 Outlier-robust image registration: equations Reference image: y ref, test image: y test, possible transformations: Θ. Objective: In the example: I =KCCA. Jpθq I py ref, y test pθqq looooooomooooooon similarity Ñ max θpθ.

7 Independent subspace analysis [Cardoso, 1998] ext. ÐÝÝ ICA Cocktail party problem: independent groups of people / music bands, observation = mixed sources.

8 ISA equations Observation: x t As t, s Goal: ŝ from tx 1,..., x T u. Assumptions: independent groups: I `s 1,..., s M 0, s m -s: non-gaussian, A: invertible. s 1 ;... ; s Mı.

9 ISA solution Find W which makes the estimated components independent: y Wx y 1 ;... ; y Mı, JpWq I y M 1,..., y Ñ min. W

10 Distribution regression [Póczos et al., 2013, Szabó et al., 2016]. Sustainability Goal: aerosol prediction = air pollution Ñ climate. Prediction using labelled bags: bag := multi-spectral satellite measurements over an area, label := local aerosol value.

11 Objects in the bags Examples: time-series modelling: user = set of time-series, computer vision: image = collection of patch vectors, NLP: corpus = bag of documents, network analysis: group of people = bag of friendship graphs,...

12 Objects in the bags Examples: time-series modelling: user = set of time-series, computer vision: image = collection of patch vectors, NLP: corpus = bag of documents, network analysis: group of people = bag of friendship graphs,... Wider context (statistics): point estimation tasks.

13 Regression on labelled bags Given: labelled bags: ẑ `ˆP i, y i ( l i 1, ˆP i : bag from P i, N : ˆP i. test bag: ˆP.

14 Regression on labelled bags Given: labelled bags: ẑ `ˆP i, y i ( l i 1, ˆP i : bag from P i, N : ˆP i. test bag: ˆP. Estimator: f λ ẑ 1 arg min f PH l ÿ l i 1 f ` ı 2 lomon µˆp i yi ` λ }f } 2 feature of ˆP i H.

15 Regression on labelled bags Given: labelled bags: ẑ `ˆP i, y i ( l i 1, ˆP i : bag from P i, N : ˆP i. test bag: ˆP. Estimator: f λ ẑ Prediction: 1 ÿ l arg min f PH K l i 1 ŷ`ˆp g T pg ` lλiq 1 y, f `µˆp i yi ı 2 ` λ }f } 2 H K. g K`µˆP, µˆp i, G K`µˆP i, µˆp j, y ryi s.

16 Regression on labelled bags Given: labelled bags: ẑ `ˆP i, y i ( l i 1, ˆP i : bag from P i, N : ˆP i. test bag: ˆP. Estimator: f λ ẑ Prediction: 1 ÿ l arg min f PH K l i 1 ŷ`ˆp g T pg ` lλiq 1 y, f `µˆp i yi ı 2 ` λ }f } 2 H K. g K`µˆP, µˆp i, G K`µˆP i, µˆp j, y ryi s. Inner product of distributions K`µˆP i, µˆp j =?

17 Feature selection Goal: find the feature subset (# of rooms, criminal rate, local taxes) most relevant for house price prediction (y).

18 Feature selection: equations Features: x 1,..., x F. Subset: S Ď t1,..., F u. MaxRelevance - MinRedundancy principle [Peng et al., 2005]: JpSq 1 S ÿ ips I `x i, y 1 S 2 ÿ i,jps I `x i, x j Ñ max. SĎt1,...,F u

19 Hypothesis Testing

20 Example-1 (2-sample testing): NLP Given: 2 categories of documents (Bayesian inference, neuroscience). Task: test their distinguishability, most discriminative words Ñ interpretability.

Task: test their distinguishability, most discriminative words

21 Example-1 (2-sample testing): NLP Given: 2 categories of documents (Bayesian inference, neuroscience). Task: test their distinguishability, most discriminative words Ñ interpretability. Do tx i u and ty j u come from the same distribution, i.e. P x P y?

22 Example-2 (2-sample testing): computer vision Given: two sets of faces (happy, angry). Task: check if they are different, determine the most discriminative features/regions.

23 Example-3 (2-sample testing): audio Amplitude modulation: simple technique to transmit voice over radio. in the example: 2 songs. Fragments from song 1 P x, song 2 P y.

24 Example: independence testing-1 We are given paired samples. Task: test independence. Examples: (song, year of release) pairs

25 Example: independence testing-1 We are given paired samples. Task: test independence. Examples: (song, year of release) pairs (video, caption) pairs

26 Example: independence testing-1 We are given paired samples. Task: test independence. Examples: (song, year of release) pairs (video, caption) pairs tpx i, y i qu n i 1? ÝÑ P xy P x P y.

27 Example: independence testing-2 How do we detect dependency? (paired samples) x 1 : Honourable senators, I have a question for the Leader of the Government in the Senate with regard to the support funding to farmers that has been announced. Most farmers have not received any money yet. x 2 : No doubt there is great pressure on provincial and municipal governments in relation to the issue of child care, but the reality is that there have been no cuts to child care funding from the federal government to the provinces. In fact, we have increased federal investments for early childhood development.... y 1 : Honorables sénateurs, ma question s adresse au leader du gouvernement au Sénat et concerne l aide financiére qu on a annoncée pour les agriculteurs. La plupart des agriculteurs n ont encore rien reu de cet argent. y 2 : Il est évident que les ordres de gouvernements provinciaux et municipaux subissent de fortes pressions en ce qui concerne les services de garde, mais le gouvernement n a pas réduit le financement qu il verse aux provinces pour les services de garde. Au contraire, nous avons augmenté le financement fédéral pour le développement des jeunes enfants....

28 Example: independence testing-2 How do we detect dependency? (paired samples) x 1 : Honourable senators, I have a question for the Leader of the Government in the Senate with regard to the support funding to farmers that has been announced. Most farmers have not received any money yet. x 2 : No doubt there is great pressure on provincial and municipal governments in relation to the issue of child care, but the reality is that there have been no cuts to child care funding from the federal government to the provinces. In fact, we have increased federal investments for early childhood development.... y 1 : Honorables sénateurs, ma question s adresse au leader du gouvernement au Sénat et concerne l aide financiére qu on a annoncée pour les agriculteurs. La plupart des agriculteurs n ont encore rien reu de cet argent. y 2 : Il est évident que les ordres de gouvernements provinciaux et municipaux subissent de fortes pressions en ce qui concerne les services de garde, mais le gouvernement n a pas réduit le financement qu il verse aux provinces pour les services de garde. Au contraire, nous avons augmenté le financement fédéral pour le développement des jeunes enfants.... Are the French paragraphs translations of the English ones, or have nothing to do with it, i.e. P xy P x P y?

29 Diverse Set of Domains

30 Classical information theory Kullback-Leibler divergence: ż j ppxq KL pp, Qq ppxq log dx. R d qpxq

31 Classical information theory Kullback-Leibler divergence: ż j ppxq KL pp, Qq ppxq log dx. R d qpxq Mutual information: I ppq KL P, b M m 1P m.

32 Classical information theory Kullback-Leibler divergence: ż j ppxq KL pp, Qq ppxq log dx. R d qpxq Mutual information: Properties: I ppq ě 0. I ppq KL P, b M m 1P m.

33 Classical information theory Kullback-Leibler divergence: ż j ppxq KL pp, Qq ppxq log dx. R d qpxq Mutual information: I ppq KL Properties: I ppq ě 0. I ppq 0 ô P b M m 1 P m. P, b M m 1P m.

34 Classical information theory Kullback-Leibler divergence: ż j ppxq KL pp, Qq ppxq log dx. R d qpxq Mutual information: I ppq KL Properties: I ppq ě 0. I ppq 0 ô P b M m 1 P m. P, b M m 1P m. Alternatives: Rényi, Tsallis, L 2 divergence... Typically: X R d.

35 Euclidean space Ñ inner product Ñ kernel Extension of kpx, yq x T y leads to kernels.

36 Euclidean space Ñ inner product Ñ kernel Extension of kpx, yq x T y leads to kernels. Why?

37 Euclidean space Ñ inner product Ñ kernel Extension of kpx, yq x T y leads to kernels. Why? Classification (SVM):

38 Euclidean space Ñ inner product Ñ kernel Extension of kpx, yq x T y leads to kernels. Why? Classification (SVM):

39 Euclidean space Ñ inner product Ñ kernel Extension of kpx, yq x T y leads to kernels. Why? Classification (SVM):

40 Euclidean space Ñ inner product Ñ kernel Extension of kpx, yq x T y leads to kernels. Why? Classification (SVM): Representation of distributions: Example: ϕpxq x: mean. P ÞÑ E x P ϕpxq.

41 Representations of distributions: EX Given: 2 Gaussians with different means. Solution: t-test Two Gaussian variables: different means P Q pdf

42 Representations of distributions: EX 2 Setup: 2 Gaussians; same means, different variances. Idea: look at the 2nd-order features of RVs Two Gaussian variables: different variances P Q pdf

43 Representations of distributions: EX 2 Setup: 2 Gaussians; same means, different variances. Idea: look at the 2nd-order features of RVs. ϕ x x 2 ñ difference in EX Two Gaussian variables: different variances P Q Pdf s of X 2 P Q pdf pdf

44 Representations of distributions: further moments Setup: a Gaussian and a Laplacian distribution. Challenge: their means and variances are the same. Idea: look at higher-order features. pdf Gaussian & Laplacian variables P Q ϕpxq e i,x : characteristic function, X R d.

45 Kernels: why? continued Idea: P xy ÞÑ C xy. Covariance matrix C xy E xy px Exq py Eyq T ı

46 Kernels: why? continued Idea: P xy ÞÑ C xy. Covariance matrix C xy E xy px Exq py Eyq T ı, S }C xy } F

47 Kernels: why? continued Idea: P xy ÞÑ C xy. Covariance matrix C xy E xy px ı Exq py Eyq T,? S }C xy } F 0

48 Kernels: why? continued Idea: P xy ÞÑ C xy. Covariance matrix C xy E xy px Exq py Eyq T ı, S }C xy } F? 0 Ø linear dependence.

49 Kernels: why? continued Idea: P xy ÞÑ C xy. Covariance matrix C xy E xy px Exq py Eyq T ı, S }C xy } F? 0 Ø linear dependence. Covariance operator: take features of x and y C xy E xy pϕpxq looooooooomooooooooon Ex ϕpxqq b pψpyq E y ψpyqq centering in feature space

50 Kernels: why? continued Idea: P xy ÞÑ C xy. Covariance matrix C xy E xy px Exq py Eyq T ı, S }C xy } F? 0 Ø linear dependence. Covariance operator: take features of x and y C xy E xy pϕpxq looooooooomooooooooon Ex ϕpxqq b pψpyq E y ψpyqq, centering in feature space S }C xy } HS : HSICpP xy q. We capture non-linear dependencies via ϕ, ψ!

51 Kernel: similarity between features Given: x and x 1 P X objects (images, texts,... ).

52 Kernel: similarity between features Given: x and x 1 P X objects (images, texts,... ). Question: how similar they are?

53 Kernel: similarity between features Given: x and x 1 P X objects (images, texts,... ). Question: how similar they are? Define features of the objects: ϕpxq : features of x, ϕpx 1 q : features of x 1. Kernel: inner product of these features kpx, x 1 q : ϕpxq, ϕpx 1 q H.

54 RKHS: intuition k defines an RKHS: H k Ă tf : X Ñ R functionsu. Elements: lomon kp, xq P H k.

55 RKHS: intuition k defines an RKHS: H k Ă tf : X Ñ R functionsu. Elements: lomon kp, xq P H k. f ř n i 1 c ikp, x i q

56 RKHS: intuition k defines an RKHS: H k Ă tf : X Ñ R functionsu. Elements: lomon kp, xq P H k. f ř n i 1 c ikp, x i q Ø c P R n [SVM dual].

57 RKHS: intuition k defines an RKHS: H k Ă tf : X Ñ R functionsu. Elements: lomon kp, xq P H k. f ř n i 1 c ikp, x i q Ø c P R n [SVM dual]. Inner product: lomon kp, xq, kp, x 1 loomoonq kpx, x 1 q [evaluation]. ϕpxq ϕpx 1 q H k

58 RKHS: intuition k defines an RKHS: H k Ă tf : X Ñ R functionsu. Elements: lomon kp, xq P H k. f ř n i 1 c ikp, x i q Ø c P R n [SVM dual]. Inner product: lomon kp, xq, kp, x 1 loomoonq kpx, x 1 q [evaluation]. ϕpxq ϕpx 1 q Practically H k Q f f pcq. Enough: kpx, x 1 q. G rkpx i, x j qs n i,j 1 ě 0. H k

59 Diverse set of domains, kernel examples X Rd, γ ą 0: kp px, yq phx, yi ` γqp, ke px, yq e γ}x y}2, Zolt an Szab o 2 kg px, yq e γ}x y}2, 1 kc px, yq 1 `. γ }x y}22

60 Diverse set of domains, kernel examples X Rd, γ ą 0: kp px, yq phx, yi ` γqp, ke px, yq e γ}x y}2, 2 kg px, yq e γ}x y}2, 1 kc px, yq 1 `. γ }x y}22 X = strings, texts: r -spectrum kernel: # of common ď r -substrings. Zolt an Szab o

61 Diverse set of domains, kernel examples X Rd, γ ą 0: kp px, yq phx, yi ` γqp, ke px, yq e γ}x y}2, 2 kg px, yq e γ}x y}2, 1 kc px, yq 1 `. γ }x y}22 X = strings, texts: r -spectrum kernel: # of common ď r -substrings. X = time-series: dynamic time-warping. Zolt an Szab o

62 Diverse set of domains, kernel examples X Rd, γ ą 0: kp px, yq phx, yi ` γqp, ke px, yq e γ}x y}2, 2 kg px, yq e γ}x y}2, 1 kc px, yq 1 `. γ }x y}22 X = strings, texts: r -spectrum kernel: # of common ď r -substrings. X = time-series: dynamic time-warping. X = trees, graphs, dynamical systems, sets, permutations,... Zolt an Szab o

63 Independence measures Given: random variable px, yq P X ˆ Y, px, yq P xy. Goal: measure the dependence of x and y.

Desiderata for a QpP xy q independence measure [Rényi, 1959]: 1.

64 Independence measures Given: random variable px, yq P X ˆ Y, px, yq P xy. Goal: measure the dependence of x and y. Desiderata for a QpP xy q independence measure [Rényi, 1959]: 1. QpP xy q is well-defined, 2. QpP xy q P r0, 1s, 3. QpP xy q 0 iff. x K y. 4. QpP xy q 1 iff. y f pxq or x gpyq.

65 Independence measures He showed: satisfies 1-4. Too ambitious: computationally intractable. many functions. QpP xy q sup corrpf pxq, gpyqq, f,g

66 Independence measures: restriction to continuous functions C b px q tf : X Ñ R, bounded continuousu would also work. Still too large!

67 Independence measures: restriction to continuous functions C b px q tf : X Ñ R, bounded continuousu would also work. Still too large! Idea: certain H k function classes are dense in C b px q. computionally tractable.

68 Wanted Independence measure, distance, inner product measures/estimates on probability distributions without density estimation!

69 Kernel Canonical Correlation Analysis (KCCA)

70 KCCA: definition Given: k : X ˆ X Ñ R, l : Y ˆ Y Ñ R. Associated RKHS-s: H k, H l.

71 KCCA: definition Given: k : X ˆ X Ñ R, l : Y ˆ Y Ñ R. Associated RKHS-s: H k, H l. KCCA measure of px, yq P X ˆ Y: ρ KCCA px, yq sup corrpf pxq, gpyqq, f PH k,gph l corrpf pxq, gpyqq cov xypf pxq, gpyqq a varx f pxq var y gpyq.

72 KCCA: estimation Empirical estimate of KCCA from tpx i, y i qu n i 1 : {ρ KCCA px, yq sup cpr n,dpr n b c T p G x q 2 c c T G x G y d b. d T p G y q 2 d

73 KCCA: estimation Empirical estimate of KCCA from tpx i, y i qu n i 1 : {ρ KCCA px, yq sup cpr n,dpr n b c T p G x q 2 c c T G x G y d b. d T p G y q 2 d Centered Gram matrix: G x rkpx i, x j qs n i,j 1, G x H n G x H n looomooon centering in feature space, H n I n 1nˆn n.

74 KCCA: estimation Empirical estimate of KCCA from tpx i, y i qu n i 1 : {ρ KCCA px, yq sup cpr n,dpr n b c T p G x q 2 c c T G x G y d b. d T p G y q 2 d Centered Gram matrix: G x rkpx i, x j qs n i,j 1, G x H n G x H n looomooon centering in feature space, H n I n 1nˆn n. c, d: come from a generalized eigenvalue problem (Av λbv).

75 KCCA: estimation Empirical estimate of KCCA from tpx i, y i qu n i 1 : {ρ KCCA px, yq sup cpr n,dpr n b c T p G x q 2 c c T G x G y d b. d T p G y q 2 d Centered Gram matrix: G x rkpx i, x j qs n i,j 1, G x H n G x H n looomooon centering in feature space, H n I n 1nˆn n. c, d: come from a generalized eigenvalue problem (Av λbv). A bit of regularization: G x Ñ G x ` κi n, G y Ñ...

76 KCCA as an independence measure If x K y, then ρ KCCA px, y; κq 0. Opposite direction: For rich H k, H l [Bach and Jordan, 2002, Gretton et al., 2005].

77 KCCA as an independence measure If x K y, then ρ KCCA px, y; κq 0. Opposite direction: For rich H k, H l [Bach and Jordan, 2002, Gretton et al., 2005]. Enough: universal kernel = H k dense in C b px q.

78 KCCA as an independence measure If x K y, then ρ KCCA px, y; κq 0. Opposite direction: For rich H k, H l [Bach and Jordan, 2002, Gretton et al., 2005]. Enough: universal kernel = H k dense in C b px q. Example: Gaussian, Laplacian kernel.

79 ITE toolbox Estimators for dependency measures (Q KCCA), distances on distributions (Q MMD), independence of random variables (Q HSIC), and many more... Link:

80 ITE toolbox Estimators for dependency measures (Q KCCA), distances on distributions (Q MMD), independence of random variables (Q HSIC), and many more... Link:

81 Maximum Mean Discrepancy (MMD)

82 MMD estimator: intuition

83 MMD estimator: intuition

84 MMD estimator: intuition

85 MMD estimator: intuition {MMD 2 pp, Qq Ě G P,P ` ĘG Q,Q 2 Ę G P,Q (without diagonals in Ě G P,P, Ę G Q,Q ) : { MMD & { HSIC illustration credit: Arthur Gretton

86 MMD estimator Feature of a distribution: µ P E x P ϕpxq.

87 MMD estimator Feature of a distribution: µ P E x P ϕpxq. MMD = difference between feature means: MMD 2 pp, Qq : }µ P µ Q } 2 H k,

88 MMD estimator Feature of a distribution: µ P E x P ϕpxq. MMD = difference between feature means: MMD 2 pp, Qq : }µ P µ Q } 2 H k, {MMDupP, 2 Qq G Ě P,P ` ĘG Q,Q 2G Ę P,Q using tx i u m i 1 P, ty ju n j 1 Q samples.

89 MMD estimator Feature of a distribution: µ P E x P ϕpxq. MMD = difference between feature means: MMD 2 pp, Qq : }µ P µ Q } 2 H k, {MMDupP, 2 Qq G Ě P,P ` ĘG Q,Q 2G Ę P,Q using tx i u m i 1 P, ty ju n j 1 Q samples. Computational complexity: O `pm ` nq 2, quadratic.

90 Hilbert-Schmidt Independence Criterion (HSIC)

91 HSIC: intuition. X : images, Y: descriptions

92 HSIC intuition: Gram matrices

93 HSIC intuition: Gram matrices Empirical estimate: {HSIC 2 1 n 2 G x, G y F.

94 HSIC intuition: Gram matrices Empirical estimate: {HSIC 2 1 n 2 G x, G y F. HSICpP xyq MMD pp xy, P x b P y q.

95 Cocktail party: HSIC demo

96 ISA reminder x As, s where s m -s are non-gaussian & independent. Goal: tx t u T t 1 Ñ W A 1, ts t u T t 1, s 1 ;... ; s Mı,

97 ISA reminder x As, s where s m -s are non-gaussian & independent. Goal: tx t u T t 1 Ñ W A 1, ts t u T t 1, Objective function: ŝ Wx, JpWq I ŝ M 1,..., ŝ s 1 ;... ; s Mı, Ñ min W.

98 ISA: source, observation Hidden sources (s):

99 ISA: source, observation Hidden sources (s): Observation (x):

100 ISA: estimated sources using HSIC, ambiguity Estimated sources (ŝ):

101 ISA: estimated sources using HSIC, ambiguity Estimated sources (ŝ): Performance (ŴA), ambiguity:

102 Conjecture: ISA separation theorem [Cardoso, 1998] ISA = ICA + permutation.

103 Conjecture: ISA separation theorem [Cardoso, 1998] ISA = ICA + permutation. z HSICpŝi, ŝ j q. Here: dimps m q 3.

104 Conjecture: ISA separation theorem [Cardoso, 1998] ISA = ICA + permutation. z HSICpŝi, ŝ j q. Here: dimps m q 3.

105 Conjecture: ISA separation theorem [Cardoso, 1998] ISA = ICA + permutation. z HSICpŝi, ŝ j q. Here: dimps m q 3. Basis of the state-of-the-art ISA solvers.

106 Conjecture: ISA separation theorem [Cardoso, 1998] ISA = ICA + permutation. z HSICpŝi, ŝ j q. Here: dimps m q 3. Basis of the state-of-the-art ISA solvers. Sufficient conditions [Szabó et al., 2012]: s m : spherical.

107 ISA separation theorem For dimps m q 2: less is sufficient. Invariance to 90 rotation: f pu 1, u 2 q f p u 2, u 1 q f p u 1, u 2 q f pu 2, u 1 q.

108 ISA separation theorem For dimps m q 2: less is sufficient. Invariance to 90 rotation: f pu 1, u 2 q f p u 2, u 1 q f p u 1, u 2 q f pu 2, u 1 q. permutation and sign: f p u 1, u 2 q f p u 2, u 1 q.

109 ISA separation theorem For dimps m q 2: less is sufficient. Invariance to 90 rotation: f pu 1, u 2 q f p u 2, u 1 q f p u 1, u 2 q f pu 2, u 1 q. permutation and sign: f p u 1, u 2 q f p u 2, u 1 q. L p -spherical: f pu 1, u 2 q h p ř i u i p q pp ą 0q.

110 Mean embedding, MMD, HSIC: +applications & review Applications: domain adaptation [Zhang et al., 2013], -generalization [Blanchard et al., 2017], interpretable machine learning [Kim et al., 2016], kernel belief propagation [Song et al., 2011], kernel Bayes rule [Fukumizu et al., 2013], model criticism [Lloyd et al., 2014], approximate Bayesian computation [Park et al., 2016], probabilistic programming [Schölkopf et al., 2015], distribution classification[muandet et al., 2011], -regression [Szabó et al., 2016], topological data analysis [Kusano et al., 2016], post selection inference [Yamada et al., 2016], causal inference [Mooij et al., 2016, Pfister et al., 2017, Strobl et al., 2017]. Mean embedding review [Muandet et al., 2017].

111 Hypothesis Testing

112 Two-sample testing: recall Given: X tx i u n i 1 P, Y ty ju n j 1 Q. Example: x i = i th happy face, y j = j th sad face.

113 Two-sample testing: recall Given: X tx i u n i 1 P, Y ty ju n j 1 Q. Example: x i = i th happy face, y j = j th sad face. Problem: using X, Y test H 0 : P Q, vs H 1 : P Q.

114 Ingredients of two-sample test Test statistic: ˆλ n ˆλ n px, Y q, random. Significance level: α Under H 0 : P H0 p looomooon ˆλ n ď T α q 1 α. correctly accepting H P H0 (ˆλn ) P H1 (ˆλn ) T α ˆλ n ˆλ n

115 Ingredients of two-sample test Test statistic: ˆλ n ˆλ n px, Y q, random. Significance level: α Under H 0 : P H0 p looomooon ˆλ n ď T α q 1 α. correctly accepting H 0 Under H 1 : P H1 pt α ă ˆλ n q Ppcorrectly rejecting H 0 q =: power P H0 (ˆλn ) P H1 (ˆλn ) T α ˆλ n ˆλ n

116 Two-sample test using MMD asymptotics: H 0 Under H 0 [Gretton et al., 2007, Gretton et al., 2012] ÝÝÝÝÝÝÑ U-statistics n MMD { 8ÿ upp, 2 Pq λ i `z2 i 2, i 1 where z i Np0, 2q i.i.d., ż kpx, x 1 qv i pxqdppxq λ i v i px 1 q, kpx, x 1 q ϕpxq µ P, ϕpx 1 q µ P. H k X pdf Empirical pdf MMDu 2

117 Null approximations; test statistics: quadratic in time Small sample size: permutation test. Medium sample size: 0.4 gamma approximation: 0.3 α=2,β=1 p α,β (x) x

118 Null approximations; test statistics: quadratic in time Small sample size: permutation test. Medium sample size: 0.4 gamma approximation: 0.3 α=2,β=1 α=4,β=1 p α,β (x) x

119 Null approximations; test statistics: quadratic in time Small sample size: permutation test. Medium sample size: 0.4 gamma approximation: 0.3 α=2,β=1 α=4,β=1 α=2,β=3 p α,β (x) x

120 Null approximations; test statistics: quadratic in time Small sample size: permutation test. Medium sample size: 0.4 gamma approximation: 0.3 α=2,β=1 α=4,β=1 α=2,β=3 p α,β (x) truncated expansion [Gretton et al., 2009] x

121 Null approximations; test statistics: quadratic in time Small sample size: permutation test. Medium sample size: 0.4 gamma approximation: 0.3 α=2,β=1 α=4,β=1 α=2,β=3 p α,β (x) x truncated expansion [Gretton et al., 2009]. Large sample size: online techniques [Gretton et al., 2012] (large var), recent linear methods (soon).

122 Independence testing with HSIC Similary story [Gretton et al., 2008, Pfister et al., 2016]: Null asymptotics: 8ÿ λ i zi 2, z i Np0, 1q. i 1 In practice: permutation-test/gamma-approximation.

123 Related work 2-sample testing: block-mmd [Zaremba et al., 2013] var Œ

124 Related work 2-sample testing: block-mmd [Zaremba et al., 2013] var Œ 3-variable interaction [Sejdinovic et al., 2013].

125 Related work 2-sample testing: block-mmd [Zaremba et al., 2013] var Œ 3-variable interaction [Sejdinovic et al., 2013]. Goodness-of-fit [Chwialkowski et al., 2016].

126 Related work 2-sample testing: block-mmd [Zaremba et al., 2013] var Œ 3-variable interaction [Sejdinovic et al., 2013]. Goodness-of-fit [Chwialkowski et al., 2016]. Time-series: independence (stationary Ñ shift) [Chwialkowski and Gretton, 2014], wild bootstrap: [Chwialkowski et al., 2014, Rubenstein et al., 2016].

127 Related work 2-sample testing: block-mmd [Zaremba et al., 2013] var Œ 3-variable interaction [Sejdinovic et al., 2013]. Goodness-of-fit [Chwialkowski et al., 2016]. Time-series: independence (stationary Ñ shift) [Chwialkowski and Gretton, 2014], wild bootstrap: [Chwialkowski et al., 2014, Rubenstein et al., 2016]. block-hsic [Zhang et al., 2017]: RFF acceleration.

128 Related work 2-sample testing: block-mmd [Zaremba et al., 2013] var Œ 3-variable interaction [Sejdinovic et al., 2013]. Goodness-of-fit [Chwialkowski et al., 2016]. Time-series: independence (stationary Ñ shift) [Chwialkowski and Gretton, 2014], wild bootstrap: [Chwialkowski et al., 2014, Rubenstein et al., 2016]. block-hsic [Zhang et al., 2017]: RFF acceleration. Conditional independence & RFF [Strobl et al., 2017].

129 Linear-time Tests

130 Linear-time 2-sample test [Chwialkowski et al., 2015] Recall: MMDpP, Qq }µ P µ Q } Hk.

131 Linear-time 2-sample test [Chwialkowski et al., 2015] Recall: MMDpP, Qq }µ P µ Q } Hk. Idea: change this to g f ρpp, Qq : e 1 Jÿ rµ P pv j q µ Q pv j qs J 2 j 1 with random tv j u J j 1 test locations.

132 Linear-time 2-sample test [Chwialkowski et al., 2015] Recall: MMDpP, Qq }µ P µ Q } Hk. Idea: change this to g f ρpp, Qq : e 1 Jÿ rµ P pv j q µ Q pv j qs J 2 j 1 with random tv j u J j 1 test locations. ρ is a metric for Gaussian k (bounded, analytic, characteristic), a.s.

133 Estimation pρ 2 pp, Qq 1 J z n 1 n Jÿ rˆµ P pv j q ˆµ Q pv j qs 2 1 J zt n z n, j 1 nÿ rkpx loooooooooomoooooooooon i, v j q kpy i, v j qs J j 1 P R J. z i i 1

134 Estimation pρ 2 pp, Qq 1 J z n 1 n Jÿ rˆµ P pv j q ˆµ Q pv j qs 2 1 J zt n z n, j 1 nÿ rkpx loooooooooomoooooooooon i, v j q kpy i, v j qs J j 1 P R J. z i i 1 Good news: estimation is linear in n! Bad news: intractable null distr. =? n p ρ 2 pp, Pq d ÝÑ sum of J correlated χ 2.

135 Estimation pρ 2 pp, Qq 1 J z n 1 n Jÿ rˆµ P pv j q ˆµ Q pv j qs 2 1 J zt n z n, j 1 nÿ rkpx loooooooooomoooooooooon i, v j q kpy i, v j qs J j 1 P R J. z i i 1 Good news: estimation is linear in n! Bad news: intractable null distr. =? n p ρ 2 pp, Pq d ÝÑ sum of J correlated χ 2. Modified statistic (whitening): ˆλ n n z T n Σ 1 n z n, Σ n cov ptz i u n i 1q.

136 Estimation pρ 2 pp, Qq 1 J z n 1 n Jÿ rˆµ P pv j q ˆµ Q pv j qs 2 1 J zt n z n, j 1 nÿ rkpx loooooooooomoooooooooon i, v j q kpy i, v j qs J j 1 P R J. z i i 1 Good news: estimation is linear in n! Bad news: intractable null distr. =? n p ρ 2 pp, Pq d ÝÑ sum of J correlated χ 2. Modified statistic (whitening): ˆλ n n z T n Σ 1 n z n, Σ n cov ptz i u n i 1q. Under H 0 : ˆλ n d ÝÑ χ 2 pjq. ñ Easy to get the p1 αq-quantile!

137 Optimize locations and kernel Test locations (tv j u J j 1 ), kernel parameters: fixed. Idea [Jitkrittum et al., 2016]: optimize them for a power proxy In practice λ n nz T Σ 1 z ppopulation ˆλ n q. Power is similar to the Opn 2 q MMD, but in Opnq time!

138 Optimize locations and kernel Test locations (tv j u J j 1 ), kernel parameters: fixed. Idea [Jitkrittum et al., 2016]: optimize them for a power proxy In practice λ n nz T Σ 1 z ppopulation ˆλ n q. Power is similar to the Opn 2 q MMD, but in Opnq time! Independence testing [Jitkrittum et al., 2017a].

139 Optimize locations and kernel Test locations (tv j u J j 1 ), kernel parameters: fixed. Idea [Jitkrittum et al., 2016]: optimize them for a power proxy λ n nz T Σ 1 z ppopulation ˆλ n q. In practice Power is similar to the Opn 2 q MMD, but in Opnq time! Independence testing [Jitkrittum et al., 2017a]. Goodness-of-fit [Jitkrittum et al., 2017b]: Best Paper NIPS-2018!

140 Demo-1 = NLP: most/least discriminative words NIPS papers ( ). TF-IDF repr. Bayes-Neuro. Most discriminative words: spike, markov, cortex, dropout, recurr, iii, gibb. learned test locations: highly interpretable, markov, gibb (ð Gibbs): Bayesian inference, spike, cortex : key terms in neuroscience.

141 Demo-1 = NLP: most/least discriminative words NIPS papers ( ). TF-IDF repr. Bayes-Neuro. Least dicriminative ones: circumfer, bra, dominiqu, rhino, mitra, kid, impostor.

142 Demo-2: Distinguish positive/negative emotions Karolinska Directed Emotional Faces (KDEF) [Lundqvist et al., 1998]. 70 actors = 35 females and 35 males. d 48 ˆ Grayscale. Pixel features. ` : happy neutral surprised : afraid angry disgusted Learned test location (averaged) =

143 Summary Dependency measures, distance: KCCA, HSIC, MMD. Tool: kernels.

144 Summary Dependency measures, distance: KCCA, HSIC, MMD. Tool: kernels. Applications: image registration, ISA, distribution regression, feature selection. hypothesis testing.

145 Summary Dependency measures, distance: KCCA, HSIC, MMD. Tool: kernels. Applications: image registration, ISA, distribution regression, feature selection. hypothesis testing. MMD, HSIC: Opn 2 q. Linear-time tests (R d ).

146 Thank you for the attention!

147 Bach, F. R. and Jordan, M. I. (2002). Kernel independent component analysis. Journal of Machine Learning Research, 3:1 48. Blanchard, G., Deshmukh, A. A., Dogan, U., Lee, G., and Scott, C. (2017). Domain generalization by marginal transfer learning. Technical report. Cardoso, J.-F. (1998). Multidimensional independent component analysis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages Chwialkowski, K. and Gretton, A. (2014). A kernel independence test for random processes. In International Conference on Machine Learning (ICML; JMLR W&CP), volume 32, page

148 Chwialkowski, K., Ramdas, A., Sejdinovic, D., and Gretton, A. (2015). Fast two-sample testing with analytic representations of probability measures. In Neural Information Processing Systems (NIPS), pages Chwialkowski, K., Sejdinovic, D., and Gretton, A. (2014). A wild bootstrap for degenerate kernel tests. In Neural Information Processing Systems (NIPS), pages Chwialkowski, K., Strathmann, H., and Gretton, A. (2016). A kernel test of goodness of fit. In International Conference on Machine Learning (ICML), pages Fukumizu, K., Song, L., and Gretton, A. (2013). Kernel Bayes rule: Bayesian inference with positive definite kernels.

149 Journal of Machine Learning Research, 14: Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A kernel two-sample test. Journal of Machine Learning Research, 13: Gretton, A., Fukumizu, K., Harchaoui, Z., and Sriperumbudur, B. K. (2009). A fast, consistent kernel two-sample test. In Neural Information Processing Systems (NIPS), pages Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Schölkopf, B., and Smola, A. J. (2007). A kernel statistical test of independence. In Neural Information Processing Systems (NIPS), pages Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Schölkopf, B., and Smola, A. J. (2008).

150 A kernel statistical test of independence. In Neural Information Processing Systems (NIPS), pages Gretton, A., Herbrich, R., Smola, A., Bousquet, O., and Schölkopf, B. (2005). Kernel methods for measuring independence. Journal of Machine Learning Research, 6: Jitkrittum, W., Szabó, Z., Chwialkowski, K., and Gretton, A. (2016). Interpretable distribution features with maximum testing power. In Neural Information Processing Systems (NIPS), pages Jitkrittum, W., Szabó, Z., and Gretton, A. (2017a). An adaptive test of independence with analytic kernel embeddings. In International Conference on Machine Learning (ICML), volume 70, pages PMLR.

151 Jitkrittum, W., Xu, W., Szabó, Z., Fukumizu, K., and Gretton, A. (2017b). A linear-time kernel goodness-of-fit test. In Advances in Neural Information Processing Systems (NIPS). Kim, B., Khanna, R., and Koyejo, O. O. (2016). Examples are not enough, learn to criticize! criticism for interpretability. In Advances in Neural Information Processing Systems (NIPS), pages Kusano, G., Fukumizu, K., and Hiraoka, Y. (2016). Persistence weighted Gaussian kernel for topological data analysis. In International Conference on Machine Learning (ICML), pages Kybic, J. (2004). High-dimensional mutual information estimation for image registration.

152 In IEEE International Conference on Image Processing (ICIP), pages Lloyd, J. R., Duvenaud, D., Grosse, R., Tenenbaum, J. B., and Ghahramani, Z. (2014). Automatic construction and natural-language description of nonparametric regression models. In AAAI Conference on Artificial Intelligence, pages Lundqvist, D., Flykt, A., and Öhman, A. (1998). The Karolinska directed emotional faces-kdef. Technical report, ISBN Mooij, J. M., Peters, J., Janzing, D., Zscheischler, J., and Schölkopf, B. (2016). Distinguishing cause from effect using observational data: Methods and benchmarks. Journal of Machine Learning Research, 17:1 102.

153 Muandet, K., Fukumizu, K., Dinuzzo, F., and Schölkopf, B. (2011). Learning from distributions via support measure machines. In Neural Information Processing Systems (NIPS), pages Muandet, K., Fukumizu, K., Sriperumbudur, B., and Schölkopf, B. (2017). Kernel mean embedding of distributions: A review and beyond. Technical report. Neemuchwala, H., Hero, A., Zabuawala, S., and Carson, P. (2007). Image registration methods in high dimensional space. International Journal of Imaging Systems and Technology, 16: Park, M., Jitkrittum, W., and Sejdinovic, D. (2016).

154 K2-ABC: Approximate Bayesian computation with kernel embeddings. In International Conference on Artificial Intelligence and Statistics (AISTATS; PMLR), volume 51, pages 51: Peng, H., Long, F., and Ding, C. (2005). Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8): Pfister, N., Bühlmann, P., Schölkopf, B., and Peters, J. (2016). Kernel-based tests for joint independence. Technical report. ( Pfister, N., Bühlmann, P., Schölkopf, B., and Peters, J. (2017). Kernel-based tests for joint independence.

155 Journal of the Royal Statistical Society: Series B (Statistical Methodology). Póczos, B., Singh, A., Rinaldo, A., and Wasserman, L. (2013). Distribution-free distribution regression. In International Conference on AI and Statistics (AISTATS; JMLR W&CP), volume 31, pages Rényi, A. (1959). On measures of dependence. Acta Mathematica Academiae Scientiarum Hungaricae, 10: Rubenstein, P. K., Chwialkowski, K. P., and Gretton, A. (2016). A kernel test for three-variable interactions with random processes. In Conference on Uncertainty in Artificial Intelligence (UAI), pages

156 Schölkopf, B., Muandet, K., Fukumizu, K., Harmeling, S., and Peters, J. (2015). Computing functions of random variables via reproducing kernel Hilbert space representations. Statistics and Computing, 25(4): Sejdinovic, D., Gretton, A., and Bergsma, W. (2013). A kernel test for three-variable interactions. In Neural Information Processing Systems (NIPS), pages Song, L., Gretton, A., Bickson, D., Low, Y., and Guestrin, C. (2011). Kernel belief propagation. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages Strobl, E. V., Visweswaran, S., and Zhang, K. (2017). Approximate kernel-based conditional independence tests for fast non-parametric causal discovery.

157 Technical report. Szabó, Z., Póczos, B., and Lörincz, A. (2012). Separation theorem for independent subspace analysis and its consequences. Pattern Recognition, 45(4): Szabó, Z., Sriperumbudur, B. K., Póczos, B., and Gretton, A. (2016). Learning theory for distribution regression. Journal of Machine Learning Research, 17(152):1 40. Yamada, M., Umezu, Y., Fukumizu, K., and Takeuchi, I. (2016). Post selection inference with kernels. Technical report. ( Zaremba, W., Gretton, A., and Blaschko, M. (2013). B-tests: Low variance kernel two-sample tests.

158 In NIPS, pages Zhang, K., Schölkopf, B., Muandet, K., and Wang, Z. (2013). Domain adaptation under target and conditional shift. Journal of Machine Learning Research, 28(3): Zhang, Q., Filippi, S., Gretton, A., and Sejdinovic, D. (2017). Large-scale kernel methods for independence testing. Statistics and Computing, pages 1 18.

Kernels. B.Sc. École Polytechnique September 4, 2018

Kernels. B.Sc. École Polytechnique September 4, 2018 CMAP B.Sc. Day @ École Polytechnique September 4, 2018 Inner product Ñ kernel: similarity between features Extension of kpx, yq x, y ř i x iy i : kpx, yq ϕpxq, ϕpyq H. Inner product Ñ kernel: similarity