Kernel-based Dependency Measures & Hypothesis Testing
|
|
- Josephine Johnson
- 5 years ago
- Views:
Transcription
1 Kernel-based Dependency Measures and Hypothesis Testing CMAP, École Polytechnique Carnegie Mellon University November 27, 2017
2 Outline Motivation. Diverse set of domains: kernel. Dependency measures: Kernel canonical correlation analysis. Maximum mean discrepancy. Hilbert-Schmidt independence criterion. Hypothesis testing.
3 Dependency Measures as Objective Functions
4 Outlier-robust image registration [Kybic, 2004, Neemuchwala et al., 2007] Given two images: Goal: find the transformation which takes the right one to the left.
5 Outlier-robust image registration [Kybic, 2004, Neemuchwala et al., 2007] Given two images: Goal: find the transformation which takes the right one to the left.
6 Outlier-robust image registration: equations Reference image: y ref, test image: y test, possible transformations: Θ. Objective: In the example: I =KCCA. Jpθq I py ref, y test pθqq looooooomooooooon similarity Ñ max θpθ.
7 Independent subspace analysis [Cardoso, 1998] ext. ÐÝÝ ICA Cocktail party problem: independent groups of people / music bands, observation = mixed sources.
8 ISA equations Observation: x t As t, s Goal: ŝ from tx 1,..., x T u. Assumptions: independent groups: I `s 1,..., s M 0, s m -s: non-gaussian, A: invertible. s 1 ;... ; s Mı.
9 ISA solution Find W which makes the estimated components independent: y Wx y 1 ;... ; y Mı, JpWq I y M 1,..., y Ñ min. W
10 Distribution regression [Póczos et al., 2013, Szabó et al., 2016]. Sustainability Goal: aerosol prediction = air pollution Ñ climate. Prediction using labelled bags: bag := multi-spectral satellite measurements over an area, label := local aerosol value.
11 Objects in the bags Examples: time-series modelling: user = set of time-series, computer vision: image = collection of patch vectors, NLP: corpus = bag of documents, network analysis: group of people = bag of friendship graphs,...
12 Objects in the bags Examples: time-series modelling: user = set of time-series, computer vision: image = collection of patch vectors, NLP: corpus = bag of documents, network analysis: group of people = bag of friendship graphs,... Wider context (statistics): point estimation tasks.
13 Regression on labelled bags Given: labelled bags: ẑ `ˆP i, y i ( l i 1, ˆP i : bag from P i, N : ˆP i. test bag: ˆP.
14 Regression on labelled bags Given: labelled bags: ẑ `ˆP i, y i ( l i 1, ˆP i : bag from P i, N : ˆP i. test bag: ˆP. Estimator: f λ ẑ 1 arg min f PH l ÿ l i 1 f ` ı 2 lomon µˆp i yi ` λ }f } 2 feature of ˆP i H.
15 Regression on labelled bags Given: labelled bags: ẑ `ˆP i, y i ( l i 1, ˆP i : bag from P i, N : ˆP i. test bag: ˆP. Estimator: f λ ẑ Prediction: 1 ÿ l arg min f PH K l i 1 ŷ`ˆp g T pg ` lλiq 1 y, f `µˆp i yi ı 2 ` λ }f } 2 H K. g K`µˆP, µˆp i, G K`µˆP i, µˆp j, y ryi s.
16 Regression on labelled bags Given: labelled bags: ẑ `ˆP i, y i ( l i 1, ˆP i : bag from P i, N : ˆP i. test bag: ˆP. Estimator: f λ ẑ Prediction: 1 ÿ l arg min f PH K l i 1 ŷ`ˆp g T pg ` lλiq 1 y, f `µˆp i yi ı 2 ` λ }f } 2 H K. g K`µˆP, µˆp i, G K`µˆP i, µˆp j, y ryi s. Inner product of distributions K`µˆP i, µˆp j =?
17 Feature selection Goal: find the feature subset (# of rooms, criminal rate, local taxes) most relevant for house price prediction (y).
18 Feature selection: equations Features: x 1,..., x F. Subset: S Ď t1,..., F u. MaxRelevance - MinRedundancy principle [Peng et al., 2005]: JpSq 1 S ÿ ips I `x i, y 1 S 2 ÿ i,jps I `x i, x j Ñ max. SĎt1,...,F u
19 Hypothesis Testing
20 Example-1 (2-sample testing): NLP Given: 2 categories of documents (Bayesian inference, neuroscience). Task: test their distinguishability, most discriminative words Ñ interpretability.
21 Example-1 (2-sample testing): NLP Given: 2 categories of documents (Bayesian inference, neuroscience). Task: test their distinguishability, most discriminative words Ñ interpretability. Do tx i u and ty j u come from the same distribution, i.e. P x P y?
22 Example-2 (2-sample testing): computer vision Given: two sets of faces (happy, angry). Task: check if they are different, determine the most discriminative features/regions.
23 Example-3 (2-sample testing): audio Amplitude modulation: simple technique to transmit voice over radio. in the example: 2 songs. Fragments from song 1 P x, song 2 P y.
24 Example: independence testing-1 We are given paired samples. Task: test independence. Examples: (song, year of release) pairs
25 Example: independence testing-1 We are given paired samples. Task: test independence. Examples: (song, year of release) pairs (video, caption) pairs
26 Example: independence testing-1 We are given paired samples. Task: test independence. Examples: (song, year of release) pairs (video, caption) pairs tpx i, y i qu n i 1? ÝÑ P xy P x P y.
27 Example: independence testing-2 How do we detect dependency? (paired samples) x 1 : Honourable senators, I have a question for the Leader of the Government in the Senate with regard to the support funding to farmers that has been announced. Most farmers have not received any money yet. x 2 : No doubt there is great pressure on provincial and municipal governments in relation to the issue of child care, but the reality is that there have been no cuts to child care funding from the federal government to the provinces. In fact, we have increased federal investments for early childhood development.... y 1 : Honorables sénateurs, ma question s adresse au leader du gouvernement au Sénat et concerne l aide financiére qu on a annoncée pour les agriculteurs. La plupart des agriculteurs n ont encore rien reu de cet argent. y 2 : Il est évident que les ordres de gouvernements provinciaux et municipaux subissent de fortes pressions en ce qui concerne les services de garde, mais le gouvernement n a pas réduit le financement qu il verse aux provinces pour les services de garde. Au contraire, nous avons augmenté le financement fédéral pour le développement des jeunes enfants....
28 Example: independence testing-2 How do we detect dependency? (paired samples) x 1 : Honourable senators, I have a question for the Leader of the Government in the Senate with regard to the support funding to farmers that has been announced. Most farmers have not received any money yet. x 2 : No doubt there is great pressure on provincial and municipal governments in relation to the issue of child care, but the reality is that there have been no cuts to child care funding from the federal government to the provinces. In fact, we have increased federal investments for early childhood development.... y 1 : Honorables sénateurs, ma question s adresse au leader du gouvernement au Sénat et concerne l aide financiére qu on a annoncée pour les agriculteurs. La plupart des agriculteurs n ont encore rien reu de cet argent. y 2 : Il est évident que les ordres de gouvernements provinciaux et municipaux subissent de fortes pressions en ce qui concerne les services de garde, mais le gouvernement n a pas réduit le financement qu il verse aux provinces pour les services de garde. Au contraire, nous avons augmenté le financement fédéral pour le développement des jeunes enfants.... Are the French paragraphs translations of the English ones, or have nothing to do with it, i.e. P xy P x P y?
29 Diverse Set of Domains
30 Classical information theory Kullback-Leibler divergence: ż j ppxq KL pp, Qq ppxq log dx. R d qpxq
31 Classical information theory Kullback-Leibler divergence: ż j ppxq KL pp, Qq ppxq log dx. R d qpxq Mutual information: I ppq KL P, b M m 1P m.
32 Classical information theory Kullback-Leibler divergence: ż j ppxq KL pp, Qq ppxq log dx. R d qpxq Mutual information: Properties: I ppq ě 0. I ppq KL P, b M m 1P m.
33 Classical information theory Kullback-Leibler divergence: ż j ppxq KL pp, Qq ppxq log dx. R d qpxq Mutual information: I ppq KL Properties: I ppq ě 0. I ppq 0 ô P b M m 1 P m. P, b M m 1P m.
34 Classical information theory Kullback-Leibler divergence: ż j ppxq KL pp, Qq ppxq log dx. R d qpxq Mutual information: I ppq KL Properties: I ppq ě 0. I ppq 0 ô P b M m 1 P m. P, b M m 1P m. Alternatives: Rényi, Tsallis, L 2 divergence... Typically: X R d.
35 Euclidean space Ñ inner product Ñ kernel Extension of kpx, yq x T y leads to kernels.
36 Euclidean space Ñ inner product Ñ kernel Extension of kpx, yq x T y leads to kernels. Why?
37 Euclidean space Ñ inner product Ñ kernel Extension of kpx, yq x T y leads to kernels. Why? Classification (SVM):
38 Euclidean space Ñ inner product Ñ kernel Extension of kpx, yq x T y leads to kernels. Why? Classification (SVM):
39 Euclidean space Ñ inner product Ñ kernel Extension of kpx, yq x T y leads to kernels. Why? Classification (SVM):
40 Euclidean space Ñ inner product Ñ kernel Extension of kpx, yq x T y leads to kernels. Why? Classification (SVM): Representation of distributions: Example: ϕpxq x: mean. P ÞÑ E x P ϕpxq.
41 Representations of distributions: EX Given: 2 Gaussians with different means. Solution: t-test Two Gaussian variables: different means P Q pdf
42 Representations of distributions: EX 2 Setup: 2 Gaussians; same means, different variances. Idea: look at the 2nd-order features of RVs Two Gaussian variables: different variances P Q pdf
43 Representations of distributions: EX 2 Setup: 2 Gaussians; same means, different variances. Idea: look at the 2nd-order features of RVs. ϕ x x 2 ñ difference in EX Two Gaussian variables: different variances P Q Pdf s of X 2 P Q pdf pdf
44 Representations of distributions: further moments Setup: a Gaussian and a Laplacian distribution. Challenge: their means and variances are the same. Idea: look at higher-order features. pdf Gaussian & Laplacian variables P Q ϕpxq e i,x : characteristic function, X R d.
45 Kernels: why? continued Idea: P xy ÞÑ C xy. Covariance matrix C xy E xy px Exq py Eyq T ı
46 Kernels: why? continued Idea: P xy ÞÑ C xy. Covariance matrix C xy E xy px Exq py Eyq T ı, S }C xy } F
47 Kernels: why? continued Idea: P xy ÞÑ C xy. Covariance matrix C xy E xy px ı Exq py Eyq T,? S }C xy } F 0
48 Kernels: why? continued Idea: P xy ÞÑ C xy. Covariance matrix C xy E xy px Exq py Eyq T ı, S }C xy } F? 0 Ø linear dependence.
49 Kernels: why? continued Idea: P xy ÞÑ C xy. Covariance matrix C xy E xy px Exq py Eyq T ı, S }C xy } F? 0 Ø linear dependence. Covariance operator: take features of x and y C xy E xy pϕpxq looooooooomooooooooon Ex ϕpxqq b pψpyq E y ψpyqq centering in feature space
50 Kernels: why? continued Idea: P xy ÞÑ C xy. Covariance matrix C xy E xy px Exq py Eyq T ı, S }C xy } F? 0 Ø linear dependence. Covariance operator: take features of x and y C xy E xy pϕpxq looooooooomooooooooon Ex ϕpxqq b pψpyq E y ψpyqq, centering in feature space S }C xy } HS : HSICpP xy q. We capture non-linear dependencies via ϕ, ψ!
51 Kernel: similarity between features Given: x and x 1 P X objects (images, texts,... ).
52 Kernel: similarity between features Given: x and x 1 P X objects (images, texts,... ). Question: how similar they are?
53 Kernel: similarity between features Given: x and x 1 P X objects (images, texts,... ). Question: how similar they are? Define features of the objects: ϕpxq : features of x, ϕpx 1 q : features of x 1. Kernel: inner product of these features kpx, x 1 q : ϕpxq, ϕpx 1 q H.
54 RKHS: intuition k defines an RKHS: H k Ă tf : X Ñ R functionsu. Elements: lomon kp, xq P H k.
55 RKHS: intuition k defines an RKHS: H k Ă tf : X Ñ R functionsu. Elements: lomon kp, xq P H k. f ř n i 1 c ikp, x i q
56 RKHS: intuition k defines an RKHS: H k Ă tf : X Ñ R functionsu. Elements: lomon kp, xq P H k. f ř n i 1 c ikp, x i q Ø c P R n [SVM dual].
57 RKHS: intuition k defines an RKHS: H k Ă tf : X Ñ R functionsu. Elements: lomon kp, xq P H k. f ř n i 1 c ikp, x i q Ø c P R n [SVM dual]. Inner product: lomon kp, xq, kp, x 1 loomoonq kpx, x 1 q [evaluation]. ϕpxq ϕpx 1 q H k
58 RKHS: intuition k defines an RKHS: H k Ă tf : X Ñ R functionsu. Elements: lomon kp, xq P H k. f ř n i 1 c ikp, x i q Ø c P R n [SVM dual]. Inner product: lomon kp, xq, kp, x 1 loomoonq kpx, x 1 q [evaluation]. ϕpxq ϕpx 1 q Practically H k Q f f pcq. Enough: kpx, x 1 q. G rkpx i, x j qs n i,j 1 ě 0. H k
59 Diverse set of domains, kernel examples X Rd, γ ą 0: kp px, yq phx, yi ` γqp, ke px, yq e γ}x y}2, Zolt an Szab o 2 kg px, yq e γ}x y}2, 1 kc px, yq 1 `. γ }x y}22
60 Diverse set of domains, kernel examples X Rd, γ ą 0: kp px, yq phx, yi ` γqp, ke px, yq e γ}x y}2, 2 kg px, yq e γ}x y}2, 1 kc px, yq 1 `. γ }x y}22 X = strings, texts: r -spectrum kernel: # of common ď r -substrings. Zolt an Szab o
61 Diverse set of domains, kernel examples X Rd, γ ą 0: kp px, yq phx, yi ` γqp, ke px, yq e γ}x y}2, 2 kg px, yq e γ}x y}2, 1 kc px, yq 1 `. γ }x y}22 X = strings, texts: r -spectrum kernel: # of common ď r -substrings. X = time-series: dynamic time-warping. Zolt an Szab o
62 Diverse set of domains, kernel examples X Rd, γ ą 0: kp px, yq phx, yi ` γqp, ke px, yq e γ}x y}2, 2 kg px, yq e γ}x y}2, 1 kc px, yq 1 `. γ }x y}22 X = strings, texts: r -spectrum kernel: # of common ď r -substrings. X = time-series: dynamic time-warping. X = trees, graphs, dynamical systems, sets, permutations,... Zolt an Szab o
63 Independence measures Given: random variable px, yq P X ˆ Y, px, yq P xy. Goal: measure the dependence of x and y.
64 Independence measures Given: random variable px, yq P X ˆ Y, px, yq P xy. Goal: measure the dependence of x and y. Desiderata for a QpP xy q independence measure [Rényi, 1959]: 1. QpP xy q is well-defined, 2. QpP xy q P r0, 1s, 3. QpP xy q 0 iff. x K y. 4. QpP xy q 1 iff. y f pxq or x gpyq.
65 Independence measures He showed: satisfies 1-4. Too ambitious: computationally intractable. many functions. QpP xy q sup corrpf pxq, gpyqq, f,g
66 Independence measures: restriction to continuous functions C b px q tf : X Ñ R, bounded continuousu would also work. Still too large!
67 Independence measures: restriction to continuous functions C b px q tf : X Ñ R, bounded continuousu would also work. Still too large! Idea: certain H k function classes are dense in C b px q. computionally tractable.
68 Wanted Independence measure, distance, inner product measures/estimates on probability distributions without density estimation!
69 Kernel Canonical Correlation Analysis (KCCA)
70 KCCA: definition Given: k : X ˆ X Ñ R, l : Y ˆ Y Ñ R. Associated RKHS-s: H k, H l.
71 KCCA: definition Given: k : X ˆ X Ñ R, l : Y ˆ Y Ñ R. Associated RKHS-s: H k, H l. KCCA measure of px, yq P X ˆ Y: ρ KCCA px, yq sup corrpf pxq, gpyqq, f PH k,gph l corrpf pxq, gpyqq cov xypf pxq, gpyqq a varx f pxq var y gpyq.
72 KCCA: estimation Empirical estimate of KCCA from tpx i, y i qu n i 1 : {ρ KCCA px, yq sup cpr n,dpr n b c T p G x q 2 c c T G x G y d b. d T p G y q 2 d
73 KCCA: estimation Empirical estimate of KCCA from tpx i, y i qu n i 1 : {ρ KCCA px, yq sup cpr n,dpr n b c T p G x q 2 c c T G x G y d b. d T p G y q 2 d Centered Gram matrix: G x rkpx i, x j qs n i,j 1, G x H n G x H n looomooon centering in feature space, H n I n 1nˆn n.
74 KCCA: estimation Empirical estimate of KCCA from tpx i, y i qu n i 1 : {ρ KCCA px, yq sup cpr n,dpr n b c T p G x q 2 c c T G x G y d b. d T p G y q 2 d Centered Gram matrix: G x rkpx i, x j qs n i,j 1, G x H n G x H n looomooon centering in feature space, H n I n 1nˆn n. c, d: come from a generalized eigenvalue problem (Av λbv).
75 KCCA: estimation Empirical estimate of KCCA from tpx i, y i qu n i 1 : {ρ KCCA px, yq sup cpr n,dpr n b c T p G x q 2 c c T G x G y d b. d T p G y q 2 d Centered Gram matrix: G x rkpx i, x j qs n i,j 1, G x H n G x H n looomooon centering in feature space, H n I n 1nˆn n. c, d: come from a generalized eigenvalue problem (Av λbv). A bit of regularization: G x Ñ G x ` κi n, G y Ñ...
76 KCCA as an independence measure If x K y, then ρ KCCA px, y; κq 0. Opposite direction: For rich H k, H l [Bach and Jordan, 2002, Gretton et al., 2005].
77 KCCA as an independence measure If x K y, then ρ KCCA px, y; κq 0. Opposite direction: For rich H k, H l [Bach and Jordan, 2002, Gretton et al., 2005]. Enough: universal kernel = H k dense in C b px q.
78 KCCA as an independence measure If x K y, then ρ KCCA px, y; κq 0. Opposite direction: For rich H k, H l [Bach and Jordan, 2002, Gretton et al., 2005]. Enough: universal kernel = H k dense in C b px q. Example: Gaussian, Laplacian kernel.
79 ITE toolbox Estimators for dependency measures (Q KCCA), distances on distributions (Q MMD), independence of random variables (Q HSIC), and many more... Link:
80 ITE toolbox Estimators for dependency measures (Q KCCA), distances on distributions (Q MMD), independence of random variables (Q HSIC), and many more... Link:
81 Maximum Mean Discrepancy (MMD)
82 MMD estimator: intuition
83 MMD estimator: intuition
84 MMD estimator: intuition
85 MMD estimator: intuition {MMD 2 pp, Qq Ě G P,P ` ĘG Q,Q 2 Ę G P,Q (without diagonals in Ě G P,P, Ę G Q,Q ) : { MMD & { HSIC illustration credit: Arthur Gretton
86 MMD estimator Feature of a distribution: µ P E x P ϕpxq.
87 MMD estimator Feature of a distribution: µ P E x P ϕpxq. MMD = difference between feature means: MMD 2 pp, Qq : }µ P µ Q } 2 H k,
88 MMD estimator Feature of a distribution: µ P E x P ϕpxq. MMD = difference between feature means: MMD 2 pp, Qq : }µ P µ Q } 2 H k, {MMDupP, 2 Qq G Ě P,P ` ĘG Q,Q 2G Ę P,Q using tx i u m i 1 P, ty ju n j 1 Q samples.
89 MMD estimator Feature of a distribution: µ P E x P ϕpxq. MMD = difference between feature means: MMD 2 pp, Qq : }µ P µ Q } 2 H k, {MMDupP, 2 Qq G Ě P,P ` ĘG Q,Q 2G Ę P,Q using tx i u m i 1 P, ty ju n j 1 Q samples. Computational complexity: O `pm ` nq 2, quadratic.
90 Hilbert-Schmidt Independence Criterion (HSIC)
91 HSIC: intuition. X : images, Y: descriptions
92 HSIC intuition: Gram matrices
93 HSIC intuition: Gram matrices Empirical estimate: {HSIC 2 1 n 2 G x, G y F.
94 HSIC intuition: Gram matrices Empirical estimate: {HSIC 2 1 n 2 G x, G y F. HSICpP xyq MMD pp xy, P x b P y q.
95 Cocktail party: HSIC demo
96 ISA reminder x As, s where s m -s are non-gaussian & independent. Goal: tx t u T t 1 Ñ W A 1, ts t u T t 1, s 1 ;... ; s Mı,
97 ISA reminder x As, s where s m -s are non-gaussian & independent. Goal: tx t u T t 1 Ñ W A 1, ts t u T t 1, Objective function: ŝ Wx, JpWq I ŝ M 1,..., ŝ s 1 ;... ; s Mı, Ñ min W.
98 ISA: source, observation Hidden sources (s):
99 ISA: source, observation Hidden sources (s): Observation (x):
100 ISA: estimated sources using HSIC, ambiguity Estimated sources (ŝ):
101 ISA: estimated sources using HSIC, ambiguity Estimated sources (ŝ): Performance (ŴA), ambiguity:
102 Conjecture: ISA separation theorem [Cardoso, 1998] ISA = ICA + permutation.
103 Conjecture: ISA separation theorem [Cardoso, 1998] ISA = ICA + permutation. z HSICpŝi, ŝ j q. Here: dimps m q 3.
104 Conjecture: ISA separation theorem [Cardoso, 1998] ISA = ICA + permutation. z HSICpŝi, ŝ j q. Here: dimps m q 3.
105 Conjecture: ISA separation theorem [Cardoso, 1998] ISA = ICA + permutation. z HSICpŝi, ŝ j q. Here: dimps m q 3. Basis of the state-of-the-art ISA solvers.
106 Conjecture: ISA separation theorem [Cardoso, 1998] ISA = ICA + permutation. z HSICpŝi, ŝ j q. Here: dimps m q 3. Basis of the state-of-the-art ISA solvers. Sufficient conditions [Szabó et al., 2012]: s m : spherical.
107 ISA separation theorem For dimps m q 2: less is sufficient. Invariance to 90 rotation: f pu 1, u 2 q f p u 2, u 1 q f p u 1, u 2 q f pu 2, u 1 q.
108 ISA separation theorem For dimps m q 2: less is sufficient. Invariance to 90 rotation: f pu 1, u 2 q f p u 2, u 1 q f p u 1, u 2 q f pu 2, u 1 q. permutation and sign: f p u 1, u 2 q f p u 2, u 1 q.
109 ISA separation theorem For dimps m q 2: less is sufficient. Invariance to 90 rotation: f pu 1, u 2 q f p u 2, u 1 q f p u 1, u 2 q f pu 2, u 1 q. permutation and sign: f p u 1, u 2 q f p u 2, u 1 q. L p -spherical: f pu 1, u 2 q h p ř i u i p q pp ą 0q.
110 Mean embedding, MMD, HSIC: +applications & review Applications: domain adaptation [Zhang et al., 2013], -generalization [Blanchard et al., 2017], interpretable machine learning [Kim et al., 2016], kernel belief propagation [Song et al., 2011], kernel Bayes rule [Fukumizu et al., 2013], model criticism [Lloyd et al., 2014], approximate Bayesian computation [Park et al., 2016], probabilistic programming [Schölkopf et al., 2015], distribution classification[muandet et al., 2011], -regression [Szabó et al., 2016], topological data analysis [Kusano et al., 2016], post selection inference [Yamada et al., 2016], causal inference [Mooij et al., 2016, Pfister et al., 2017, Strobl et al., 2017]. Mean embedding review [Muandet et al., 2017].
111 Hypothesis Testing
112 Two-sample testing: recall Given: X tx i u n i 1 P, Y ty ju n j 1 Q. Example: x i = i th happy face, y j = j th sad face.
113 Two-sample testing: recall Given: X tx i u n i 1 P, Y ty ju n j 1 Q. Example: x i = i th happy face, y j = j th sad face. Problem: using X, Y test H 0 : P Q, vs H 1 : P Q.
114 Ingredients of two-sample test Test statistic: ˆλ n ˆλ n px, Y q, random. Significance level: α Under H 0 : P H0 p looomooon ˆλ n ď T α q 1 α. correctly accepting H P H0 (ˆλn ) P H1 (ˆλn ) T α ˆλ n ˆλ n
115 Ingredients of two-sample test Test statistic: ˆλ n ˆλ n px, Y q, random. Significance level: α Under H 0 : P H0 p looomooon ˆλ n ď T α q 1 α. correctly accepting H 0 Under H 1 : P H1 pt α ă ˆλ n q Ppcorrectly rejecting H 0 q =: power P H0 (ˆλn ) P H1 (ˆλn ) T α ˆλ n ˆλ n
116 Two-sample test using MMD asymptotics: H 0 Under H 0 [Gretton et al., 2007, Gretton et al., 2012] ÝÝÝÝÝÝÑ U-statistics n MMD { 8ÿ upp, 2 Pq λ i `z2 i 2, i 1 where z i Np0, 2q i.i.d., ż kpx, x 1 qv i pxqdppxq λ i v i px 1 q, kpx, x 1 q ϕpxq µ P, ϕpx 1 q µ P. H k X pdf Empirical pdf MMDu 2
117 Null approximations; test statistics: quadratic in time Small sample size: permutation test. Medium sample size: 0.4 gamma approximation: 0.3 α=2,β=1 p α,β (x) x
118 Null approximations; test statistics: quadratic in time Small sample size: permutation test. Medium sample size: 0.4 gamma approximation: 0.3 α=2,β=1 α=4,β=1 p α,β (x) x
119 Null approximations; test statistics: quadratic in time Small sample size: permutation test. Medium sample size: 0.4 gamma approximation: 0.3 α=2,β=1 α=4,β=1 α=2,β=3 p α,β (x) x
120 Null approximations; test statistics: quadratic in time Small sample size: permutation test. Medium sample size: 0.4 gamma approximation: 0.3 α=2,β=1 α=4,β=1 α=2,β=3 p α,β (x) truncated expansion [Gretton et al., 2009] x
121 Null approximations; test statistics: quadratic in time Small sample size: permutation test. Medium sample size: 0.4 gamma approximation: 0.3 α=2,β=1 α=4,β=1 α=2,β=3 p α,β (x) x truncated expansion [Gretton et al., 2009]. Large sample size: online techniques [Gretton et al., 2012] (large var), recent linear methods (soon).
122 Independence testing with HSIC Similary story [Gretton et al., 2008, Pfister et al., 2016]: Null asymptotics: 8ÿ λ i zi 2, z i Np0, 1q. i 1 In practice: permutation-test/gamma-approximation.
123 Related work 2-sample testing: block-mmd [Zaremba et al., 2013] var Œ
124 Related work 2-sample testing: block-mmd [Zaremba et al., 2013] var Œ 3-variable interaction [Sejdinovic et al., 2013].
125 Related work 2-sample testing: block-mmd [Zaremba et al., 2013] var Œ 3-variable interaction [Sejdinovic et al., 2013]. Goodness-of-fit [Chwialkowski et al., 2016].
126 Related work 2-sample testing: block-mmd [Zaremba et al., 2013] var Œ 3-variable interaction [Sejdinovic et al., 2013]. Goodness-of-fit [Chwialkowski et al., 2016]. Time-series: independence (stationary Ñ shift) [Chwialkowski and Gretton, 2014], wild bootstrap: [Chwialkowski et al., 2014, Rubenstein et al., 2016].
127 Related work 2-sample testing: block-mmd [Zaremba et al., 2013] var Œ 3-variable interaction [Sejdinovic et al., 2013]. Goodness-of-fit [Chwialkowski et al., 2016]. Time-series: independence (stationary Ñ shift) [Chwialkowski and Gretton, 2014], wild bootstrap: [Chwialkowski et al., 2014, Rubenstein et al., 2016]. block-hsic [Zhang et al., 2017]: RFF acceleration.
128 Related work 2-sample testing: block-mmd [Zaremba et al., 2013] var Œ 3-variable interaction [Sejdinovic et al., 2013]. Goodness-of-fit [Chwialkowski et al., 2016]. Time-series: independence (stationary Ñ shift) [Chwialkowski and Gretton, 2014], wild bootstrap: [Chwialkowski et al., 2014, Rubenstein et al., 2016]. block-hsic [Zhang et al., 2017]: RFF acceleration. Conditional independence & RFF [Strobl et al., 2017].
129 Linear-time Tests
130 Linear-time 2-sample test [Chwialkowski et al., 2015] Recall: MMDpP, Qq }µ P µ Q } Hk.
131 Linear-time 2-sample test [Chwialkowski et al., 2015] Recall: MMDpP, Qq }µ P µ Q } Hk. Idea: change this to g f ρpp, Qq : e 1 Jÿ rµ P pv j q µ Q pv j qs J 2 j 1 with random tv j u J j 1 test locations.
132 Linear-time 2-sample test [Chwialkowski et al., 2015] Recall: MMDpP, Qq }µ P µ Q } Hk. Idea: change this to g f ρpp, Qq : e 1 Jÿ rµ P pv j q µ Q pv j qs J 2 j 1 with random tv j u J j 1 test locations. ρ is a metric for Gaussian k (bounded, analytic, characteristic), a.s.
133 Estimation pρ 2 pp, Qq 1 J z n 1 n Jÿ rˆµ P pv j q ˆµ Q pv j qs 2 1 J zt n z n, j 1 nÿ rkpx loooooooooomoooooooooon i, v j q kpy i, v j qs J j 1 P R J. z i i 1
134 Estimation pρ 2 pp, Qq 1 J z n 1 n Jÿ rˆµ P pv j q ˆµ Q pv j qs 2 1 J zt n z n, j 1 nÿ rkpx loooooooooomoooooooooon i, v j q kpy i, v j qs J j 1 P R J. z i i 1 Good news: estimation is linear in n! Bad news: intractable null distr. =? n p ρ 2 pp, Pq d ÝÑ sum of J correlated χ 2.
135 Estimation pρ 2 pp, Qq 1 J z n 1 n Jÿ rˆµ P pv j q ˆµ Q pv j qs 2 1 J zt n z n, j 1 nÿ rkpx loooooooooomoooooooooon i, v j q kpy i, v j qs J j 1 P R J. z i i 1 Good news: estimation is linear in n! Bad news: intractable null distr. =? n p ρ 2 pp, Pq d ÝÑ sum of J correlated χ 2. Modified statistic (whitening): ˆλ n n z T n Σ 1 n z n, Σ n cov ptz i u n i 1q.
136 Estimation pρ 2 pp, Qq 1 J z n 1 n Jÿ rˆµ P pv j q ˆµ Q pv j qs 2 1 J zt n z n, j 1 nÿ rkpx loooooooooomoooooooooon i, v j q kpy i, v j qs J j 1 P R J. z i i 1 Good news: estimation is linear in n! Bad news: intractable null distr. =? n p ρ 2 pp, Pq d ÝÑ sum of J correlated χ 2. Modified statistic (whitening): ˆλ n n z T n Σ 1 n z n, Σ n cov ptz i u n i 1q. Under H 0 : ˆλ n d ÝÑ χ 2 pjq. ñ Easy to get the p1 αq-quantile!
137 Optimize locations and kernel Test locations (tv j u J j 1 ), kernel parameters: fixed. Idea [Jitkrittum et al., 2016]: optimize them for a power proxy In practice λ n nz T Σ 1 z ppopulation ˆλ n q. Power is similar to the Opn 2 q MMD, but in Opnq time!
138 Optimize locations and kernel Test locations (tv j u J j 1 ), kernel parameters: fixed. Idea [Jitkrittum et al., 2016]: optimize them for a power proxy In practice λ n nz T Σ 1 z ppopulation ˆλ n q. Power is similar to the Opn 2 q MMD, but in Opnq time! Independence testing [Jitkrittum et al., 2017a].
139 Optimize locations and kernel Test locations (tv j u J j 1 ), kernel parameters: fixed. Idea [Jitkrittum et al., 2016]: optimize them for a power proxy λ n nz T Σ 1 z ppopulation ˆλ n q. In practice Power is similar to the Opn 2 q MMD, but in Opnq time! Independence testing [Jitkrittum et al., 2017a]. Goodness-of-fit [Jitkrittum et al., 2017b]: Best Paper NIPS-2018!
140 Demo-1 = NLP: most/least discriminative words NIPS papers ( ). TF-IDF repr. Bayes-Neuro. Most discriminative words: spike, markov, cortex, dropout, recurr, iii, gibb. learned test locations: highly interpretable, markov, gibb (ð Gibbs): Bayesian inference, spike, cortex : key terms in neuroscience.
141 Demo-1 = NLP: most/least discriminative words NIPS papers ( ). TF-IDF repr. Bayes-Neuro. Least dicriminative ones: circumfer, bra, dominiqu, rhino, mitra, kid, impostor.
142 Demo-2: Distinguish positive/negative emotions Karolinska Directed Emotional Faces (KDEF) [Lundqvist et al., 1998]. 70 actors = 35 females and 35 males. d 48 ˆ Grayscale. Pixel features. ` : happy neutral surprised : afraid angry disgusted Learned test location (averaged) =
143 Summary Dependency measures, distance: KCCA, HSIC, MMD. Tool: kernels.
144 Summary Dependency measures, distance: KCCA, HSIC, MMD. Tool: kernels. Applications: image registration, ISA, distribution regression, feature selection. hypothesis testing.
145 Summary Dependency measures, distance: KCCA, HSIC, MMD. Tool: kernels. Applications: image registration, ISA, distribution regression, feature selection. hypothesis testing. MMD, HSIC: Opn 2 q. Linear-time tests (R d ).
146 Thank you for the attention!
147 Bach, F. R. and Jordan, M. I. (2002). Kernel independent component analysis. Journal of Machine Learning Research, 3:1 48. Blanchard, G., Deshmukh, A. A., Dogan, U., Lee, G., and Scott, C. (2017). Domain generalization by marginal transfer learning. Technical report. Cardoso, J.-F. (1998). Multidimensional independent component analysis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages Chwialkowski, K. and Gretton, A. (2014). A kernel independence test for random processes. In International Conference on Machine Learning (ICML; JMLR W&CP), volume 32, page
148 Chwialkowski, K., Ramdas, A., Sejdinovic, D., and Gretton, A. (2015). Fast two-sample testing with analytic representations of probability measures. In Neural Information Processing Systems (NIPS), pages Chwialkowski, K., Sejdinovic, D., and Gretton, A. (2014). A wild bootstrap for degenerate kernel tests. In Neural Information Processing Systems (NIPS), pages Chwialkowski, K., Strathmann, H., and Gretton, A. (2016). A kernel test of goodness of fit. In International Conference on Machine Learning (ICML), pages Fukumizu, K., Song, L., and Gretton, A. (2013). Kernel Bayes rule: Bayesian inference with positive definite kernels.
149 Journal of Machine Learning Research, 14: Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A kernel two-sample test. Journal of Machine Learning Research, 13: Gretton, A., Fukumizu, K., Harchaoui, Z., and Sriperumbudur, B. K. (2009). A fast, consistent kernel two-sample test. In Neural Information Processing Systems (NIPS), pages Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Schölkopf, B., and Smola, A. J. (2007). A kernel statistical test of independence. In Neural Information Processing Systems (NIPS), pages Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Schölkopf, B., and Smola, A. J. (2008).
150 A kernel statistical test of independence. In Neural Information Processing Systems (NIPS), pages Gretton, A., Herbrich, R., Smola, A., Bousquet, O., and Schölkopf, B. (2005). Kernel methods for measuring independence. Journal of Machine Learning Research, 6: Jitkrittum, W., Szabó, Z., Chwialkowski, K., and Gretton, A. (2016). Interpretable distribution features with maximum testing power. In Neural Information Processing Systems (NIPS), pages Jitkrittum, W., Szabó, Z., and Gretton, A. (2017a). An adaptive test of independence with analytic kernel embeddings. In International Conference on Machine Learning (ICML), volume 70, pages PMLR.
151 Jitkrittum, W., Xu, W., Szabó, Z., Fukumizu, K., and Gretton, A. (2017b). A linear-time kernel goodness-of-fit test. In Advances in Neural Information Processing Systems (NIPS). Kim, B., Khanna, R., and Koyejo, O. O. (2016). Examples are not enough, learn to criticize! criticism for interpretability. In Advances in Neural Information Processing Systems (NIPS), pages Kusano, G., Fukumizu, K., and Hiraoka, Y. (2016). Persistence weighted Gaussian kernel for topological data analysis. In International Conference on Machine Learning (ICML), pages Kybic, J. (2004). High-dimensional mutual information estimation for image registration.
152 In IEEE International Conference on Image Processing (ICIP), pages Lloyd, J. R., Duvenaud, D., Grosse, R., Tenenbaum, J. B., and Ghahramani, Z. (2014). Automatic construction and natural-language description of nonparametric regression models. In AAAI Conference on Artificial Intelligence, pages Lundqvist, D., Flykt, A., and Öhman, A. (1998). The Karolinska directed emotional faces-kdef. Technical report, ISBN Mooij, J. M., Peters, J., Janzing, D., Zscheischler, J., and Schölkopf, B. (2016). Distinguishing cause from effect using observational data: Methods and benchmarks. Journal of Machine Learning Research, 17:1 102.
153 Muandet, K., Fukumizu, K., Dinuzzo, F., and Schölkopf, B. (2011). Learning from distributions via support measure machines. In Neural Information Processing Systems (NIPS), pages Muandet, K., Fukumizu, K., Sriperumbudur, B., and Schölkopf, B. (2017). Kernel mean embedding of distributions: A review and beyond. Technical report. Neemuchwala, H., Hero, A., Zabuawala, S., and Carson, P. (2007). Image registration methods in high dimensional space. International Journal of Imaging Systems and Technology, 16: Park, M., Jitkrittum, W., and Sejdinovic, D. (2016).
154 K2-ABC: Approximate Bayesian computation with kernel embeddings. In International Conference on Artificial Intelligence and Statistics (AISTATS; PMLR), volume 51, pages 51: Peng, H., Long, F., and Ding, C. (2005). Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8): Pfister, N., Bühlmann, P., Schölkopf, B., and Peters, J. (2016). Kernel-based tests for joint independence. Technical report. ( Pfister, N., Bühlmann, P., Schölkopf, B., and Peters, J. (2017). Kernel-based tests for joint independence.
155 Journal of the Royal Statistical Society: Series B (Statistical Methodology). Póczos, B., Singh, A., Rinaldo, A., and Wasserman, L. (2013). Distribution-free distribution regression. In International Conference on AI and Statistics (AISTATS; JMLR W&CP), volume 31, pages Rényi, A. (1959). On measures of dependence. Acta Mathematica Academiae Scientiarum Hungaricae, 10: Rubenstein, P. K., Chwialkowski, K. P., and Gretton, A. (2016). A kernel test for three-variable interactions with random processes. In Conference on Uncertainty in Artificial Intelligence (UAI), pages
156 Schölkopf, B., Muandet, K., Fukumizu, K., Harmeling, S., and Peters, J. (2015). Computing functions of random variables via reproducing kernel Hilbert space representations. Statistics and Computing, 25(4): Sejdinovic, D., Gretton, A., and Bergsma, W. (2013). A kernel test for three-variable interactions. In Neural Information Processing Systems (NIPS), pages Song, L., Gretton, A., Bickson, D., Low, Y., and Guestrin, C. (2011). Kernel belief propagation. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages Strobl, E. V., Visweswaran, S., and Zhang, K. (2017). Approximate kernel-based conditional independence tests for fast non-parametric causal discovery.
157 Technical report. Szabó, Z., Póczos, B., and Lörincz, A. (2012). Separation theorem for independent subspace analysis and its consequences. Pattern Recognition, 45(4): Szabó, Z., Sriperumbudur, B. K., Póczos, B., and Gretton, A. (2016). Learning theory for distribution regression. Journal of Machine Learning Research, 17(152):1 40. Yamada, M., Umezu, Y., Fukumizu, K., and Takeuchi, I. (2016). Post selection inference with kernels. Technical report. ( Zaremba, W., Gretton, A., and Blaschko, M. (2013). B-tests: Low variance kernel two-sample tests.
158 In NIPS, pages Zhang, K., Schölkopf, B., Muandet, K., and Wang, Z. (2013). Domain adaptation under target and conditional shift. Journal of Machine Learning Research, 28(3): Zhang, Q., Filippi, S., Gretton, A., and Sejdinovic, D. (2017). Large-scale kernel methods for independence testing. Statistics and Computing, pages 1 18.
Kernels. B.Sc. École Polytechnique September 4, 2018
CMAP B.Sc. Day @ École Polytechnique September 4, 2018 Inner product Ñ kernel: similarity between features Extension of kpx, yq x, y ř i x iy i : kpx, yq ϕpxq, ϕpyq H. Inner product Ñ kernel: similarity
More informationTensor Product Kernels: Characteristic Property, Universality
Tensor Product Kernels: Characteristic Property, Universality Zolta n Szabo CMAP, E cole Polytechnique Joint work with: Bharath K. Sriperumbudur Hangzhou International Conference on Frontiers of Data Science
More informationHypothesis Testing with Kernels
(Gatsby Unit, UCL) PRNI, Trento June 22, 2016 Motivation: detecting differences in AM signals Amplitude modulation: simple technique to transmit voice over radio. in the example: 2 songs. Fragments from
More informationKernel methods for hypothesis testing and inference
Kernel methods for hypothesis testing and inference MLSS Tübingen, 2015 Arthur Gretton Gatsby Unit, CSML, UCL Some motivating questions... Detecting differences in brain signals The problem: Do local field
More informationKernel methods. MLSS Arequipa, Peru, Arthur Gretton. Gatsby Unit, CSML, UCL
Kernel methods MLSS Arequipa, Peru, 2016 Arthur Gretton Gatsby Unit, CSML, UCL Some motivating questions... Detecting differences in brain signals The problem: Dolocalfieldpotential(LFP)signalschange when
More informationCharacterizing Independence with Tensor Product Kernels
Characterizing Independence with Tensor Product Kernels Zolta n Szabo CMAP, E cole Polytechnique Joint work with: Bharath K. Sriperumbudur Department of Statistics, PSU December 13, 2017 Zolta n Szabo
More informationLearning Interpretable Features to Compare Distributions
Learning Interpretable Features to Compare Distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Theory of Big Data, 2017 1/41 Goal of this talk Given: Two collections
More informationLearning features to compare distributions
Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this
More informationDistribution Regression: A Simple Technique with Minimax-optimal Guarantee
Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department,
More informationLearning features to compare distributions
Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this
More informationMinimax-optimal distribution regression
Zoltán Szabó (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) ISNPS, Avignon June 12,
More informationDistribution Regression with Minimax-Optimal Guarantee
Distribution Regression with Minimax-Optimal Guarantee (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton
More informationDistribution Regression
Zoltán Szabó (École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Dagstuhl Seminar 16481
More informationAn Adaptive Test of Independence with Analytic Kernel Embeddings
An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum Gatsby Unit, University College London wittawat@gatsby.ucl.ac.uk Probabilistic Graphical Model Workshop 2017 Institute
More informationHilbert Space Representations of Probability Distributions
Hilbert Space Representations of Probability Distributions Arthur Gretton joint work with Karsten Borgwardt, Kenji Fukumizu, Malte Rasch, Bernhard Schölkopf, Alex Smola, Le Song, Choon Hui Teo Max Planck
More informationReal and Complex Independent Subspace Analysis by Generalized Variance
Real and Complex Independent Subspace Analysis by Generalized Variance Neural Information Processing Group, Department of Information Systems, Eötvös Loránd University, Budapest, Hungary ICA Research Network
More informationA Linear-Time Kernel Goodness-of-Fit Test
A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum 1; Wenkai Xu 1 Zoltán Szabó 2 NIPS 2017 Best paper! Kenji Fukumizu 3 Arthur Gretton 1 wittawatj@gmail.com 1 Gatsby Unit, University College
More informationAn Adaptive Test of Independence with Analytic Kernel Embeddings
An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum 1 Zoltán Szabó 2 Arthur Gretton 1 1 Gatsby Unit, University College London 2 CMAP, École Polytechnique ICML 2017, Sydney
More informationHypothesis Testing with Kernel Embeddings on Interdependent Data
Hypothesis Testing with Kernel Embeddings on Interdependent Data Dino Sejdinovic Department of Statistics University of Oxford joint work with Kacper Chwialkowski and Arthur Gretton (Gatsby Unit, UCL)
More informationBig Hypothesis Testing with Kernel Embeddings
Big Hypothesis Testing with Kernel Embeddings Dino Sejdinovic Department of Statistics University of Oxford 9 January 2015 UCL Workshop on the Theory of Big Data D. Sejdinovic (Statistics, Oxford) Big
More informationTwo-stage Sampled Learning Theory on Distributions
Two-stage Sampled Learning Theory on Distributions Zoltán Szabó (Gatsby Unit, UCL) Joint work with Arthur Gretton (Gatsby Unit, UCL), Barnabás Póczos (ML Department, CMU), Bharath K. Sriperumbudur (Department
More informationDependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise
Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise Makoto Yamada and Masashi Sugiyama Department of Computer Science, Tokyo Institute of Technology
More informationHilbert Space Embedding of Probability Measures
Lecture 2 Hilbert Space Embedding of Probability Measures Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 2017 Recap of Lecture
More informationKernel Learning via Random Fourier Representations
Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random
More informationIndependent Subspace Analysis
Independent Subspace Analysis Barnabás Póczos Supervisor: Dr. András Lőrincz Eötvös Loránd University Neural Information Processing Group Budapest, Hungary MPI, Tübingen, 24 July 2007. Independent Component
More informationFast Two Sample Tests using Smooth Random Features
Kacper Chwialkowski University College London, Computer Science Department Aaditya Ramdas Carnegie Mellon University, Machine Learning and Statistics School of Computer Science Dino Sejdinovic University
More informationForefront of the Two Sample Problem
Forefront of the Two Sample Problem From classical to state-of-the-art methods Yuchi Matsuoka What is the Two Sample Problem? X ",, X % ~ P i. i. d Y ",, Y - ~ Q i. i. d Two Sample Problem P = Q? Example:
More informationHilbert Schmidt Independence Criterion
Hilbert Schmidt Independence Criterion Thanks to Arthur Gretton, Le Song, Bernhard Schölkopf, Olivier Bousquet Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au
More informationMathematische Methoden der Unsicherheitsquantifizierung
Mathematische Methoden der Unsicherheitsquantifizierung Oliver Ernst Professur Numerische Mathematik Sommersemester 2014 Contents 1 Introduction 1.1 What is Uncertainty Quantification? 1.2 A Case Study:
More informationKernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton
Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,
More informationLecture 2: Mappings of Probabilities to RKHS and Applications
Lecture 2: Mappings of Probabilities to RKHS and Applications Lille, 214 Arthur Gretton Gatsby Unit, CSML, UCL Detecting differences in brain signals The problem: Dolocalfieldpotential(LFP)signalschange
More informationKernel Measures of Conditional Dependence
Kernel Measures of Conditional Dependence Kenji Fukumizu Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku Tokyo 6-8569 Japan fukumizu@ism.ac.jp Arthur Gretton Max-Planck Institute for
More informationKernel Bayes Rule: Nonparametric Bayesian inference with kernels
Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December
More informationTutorial on Gaussian Processes and the Gaussian Process Latent Variable Model
Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,
More informationKernel methods for Bayesian inference
Kernel methods for Bayesian inference Arthur Gretton Gatsby Computational Neuroscience Unit Lancaster, Nov. 2014 Motivating Example: Bayesian inference without a model 3600 downsampled frames of 20 20
More informationMinimax Estimation of Kernel Mean Embeddings
Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016 Collaborators Dr. Ilya Tolstikhin
More informationICA and ISA Using Schweizer-Wolff Measure of Dependence
Keywords: independent component analysis, independent subspace analysis, copula, non-parametric estimation of dependence Abstract We propose a new algorithm for independent component and independent subspace
More informationarxiv: v2 [stat.ml] 3 Sep 2015
Nonparametric Independence Testing for Small Sample Sizes arxiv:46.9v [stat.ml] 3 Sep 5 Aaditya Ramdas Dept. of Statistics and Machine Learning Dept. Carnegie Mellon University aramdas@cs.cmu.edu September
More informationAdvanced Machine Learning & Perception
Advanced Machine Learning & Perception Instructor: Tony Jebara Topic 6 Standard Kernels Unusual Input Spaces for Kernels String Kernels Probabilistic Kernels Fisher Kernels Probability Product Kernels
More informationDistinguishing Causes from Effects using Nonlinear Acyclic Causal Models
JMLR Workshop and Conference Proceedings 6:17 164 NIPS 28 workshop on causality Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models Kun Zhang Dept of Computer Science and HIIT University
More informationFaster Kernel Ridge Regression Using Sketching and Preconditioning
Faster Kernel Ridge Regression Using Sketching and Preconditioning Haim Avron Department of Applied Math Tel Aviv University, Israel SIAM CS&E 2017 February 27, 2017 Brief Introduction to Kernel Ridge
More informationManifold Regularization
Manifold Regularization Vikas Sindhwani Department of Computer Science University of Chicago Joint Work with Mikhail Belkin and Partha Niyogi TTI-C Talk September 14, 24 p.1 The Problem of Learning is
More informationSTK-IN4300 Statistical Learning Methods in Data Science
STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no STK-IN4300: lecture 2 1/ 38 Outline of the lecture STK-IN4300 - Statistical Learning Methods in Data Science Linear
More informationFast Parallel Estimation of High Dimensional Information Theoretical Quantities with Low Dimensional Random Projection Ensembles
Fast Parallel Estimation of High Dimensional Information Theoretical Quantities with Low Dimensional Random Projection Ensembles Zoltán Szabó and András Lőrincz Department of Information Systems, Eötvös
More informationDimension Reduction (PCA, ICA, CCA, FLD,
Dimension Reduction (PCA, ICA, CCA, FLD, Topic Models) Yi Zhang 10-701, Machine Learning, Spring 2011 April 6 th, 2011 Parts of the PCA slides are from previous 10-701 lectures 1 Outline Dimension reduction
More informationA Linear-Time Kernel Goodness-of-Fit Test
A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum Gatsby Unit, UCL wittawatj@gmail.com Wenkai Xu Gatsby Unit, UCL wenkaix@gatsby.ucl.ac.uk Zoltán Szabó CMAP, École Polytechnique zoltan.szabo@polytechnique.edu
More informationAdvanced Machine Learning & Perception
Advanced Machine Learning & Perception Instructor: Tony Jebara Topic 1 Introduction, researchy course, latest papers Going beyond simple machine learning Perception, strange spaces, images, time, behavior
More informationSTK-IN4300 Statistical Learning Methods in Data Science
Outline of the lecture Linear Methods for Regression Linear Regression Models and Least Squares Subset selection STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no
More informationMaximum Mean Discrepancy
Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia
More informationPackage dhsic. R topics documented: July 27, 2017
Package dhsic July 27, 2017 Type Package Title Independence Testing via Hilbert Schmidt Independence Criterion Version 2.0 Date 2017-07-24 Author Niklas Pfister and Jonas Peters Maintainer Niklas Pfister
More informationKernel methods for comparing distributions, measuring dependence
Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations
More informationApproximate Kernel Methods
Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression
More informationStatistical Machine Learning
Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x
More informationA Kernelized Stein Discrepancy for Goodness-of-fit Tests
Qiang Liu QLIU@CS.DARTMOUTH.EDU Computer Science, Dartmouth College, NH, 3755 Jason D. Lee JASONDLEE88@EECS.BERKELEY.EDU Michael Jordan JORDAN@CS.BERKELEY.EDU Department of Electrical Engineering and Computer
More informationGeneralized clustering via kernel embeddings
Generalized clustering via kernel embeddings Stefanie Jegelka 1, Arthur Gretton 2,1, Bernhard Schölkopf 1, Bharath K. Sriperumbudur 3, and Ulrike von Luxburg 1 1 Max Planck Institute for Biological Cybernetics,
More informationLearning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute
More informationKernel Methods. Barnabás Póczos
Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels
More informationApproximate Kernel PCA with Random Features
Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,
More informationStatistical analysis of coupled time series with Kernel Cross-Spectral Density operators.
Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators. Michel Besserve MPI for Intelligent Systems, Tübingen michel.besserve@tuebingen.mpg.de Nikos K. Logothetis MPI
More informationProbabilistic latent variable models for distinguishing between cause and effect
Probabilistic latent variable models for distinguishing between cause and effect Joris M. Mooij joris.mooij@tuebingen.mpg.de Oliver Stegle oliver.stegle@tuebingen.mpg.de Dominik Janzing dominik.janzing@tuebingen.mpg.de
More informationGWAS V: Gaussian processes
GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011
More informationLeast-Squares Independence Test
IEICE Transactions on Information and Sstems, vol.e94-d, no.6, pp.333-336,. Least-Squares Independence Test Masashi Sugiama Toko Institute of Technolog, Japan PRESTO, Japan Science and Technolog Agenc,
More informationDynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji
Dynamic Data Modeling, Recognition, and Synthesis Rui Zhao Thesis Defense Advisor: Professor Qiang Ji Contents Introduction Related Work Dynamic Data Modeling & Analysis Temporal localization Insufficient
More informationStatistical Convergence of Kernel CCA
Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,
More informationLearning gradients: prescriptive models
Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan
More informationA Wild Bootstrap for Degenerate Kernel Tests
A Wild Bootstrap for Degenerate Kernel Tests Kacper Chwialkowski Department of Computer Science University College London London, Gower Street, WCE 6BT kacper.chwialkowski@gmail.com Dino Sejdinovic Gatsby
More informationKernel Adaptive Metropolis-Hastings
Kernel Adaptive Metropolis-Hastings Arthur Gretton,?? Gatsby Unit, CSML, University College London NIPS, December 2015 Arthur Gretton (Gatsby Unit, UCL) Kernel Adaptive Metropolis-Hastings 12/12/2015 1
More informationPROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability
More informationActive and Semi-supervised Kernel Classification
Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),
More information13 : Variational Inference: Loopy Belief Propagation and Mean Field
10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft
More information10-701/ Recitation : Kernels
10-701/15-781 Recitation : Kernels Manojit Nandi February 27, 2014 Outline Mathematical Theory Banach Space and Hilbert Spaces Kernels Commonly Used Kernels Kernel Theory One Weird Kernel Trick Representer
More informationSemi-Supervised Laplacian Regularization of Kernel Canonical Correlation Analysis
Semi-Supervised Laplacian Regularization of Kernel Canonical Correlation Analsis Matthew B. Blaschko, Christoph H. Lampert, & Arthur Gretton Ma Planck Institute for Biological Cbernetics Tübingen, German
More informationKernel-Based Contrast Functions for Sufficient Dimension Reduction
Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach
More informationNo. of dimensions 1. No. of centers
Contents 8.6 Course of dimensionality............................ 15 8.7 Computational aspects of linear estimators.................. 15 8.7.1 Diagonalization of circulant andblock-circulant matrices......
More informationFeature-to-Feature Regression for a Two-Step Conditional Independence Test
Feature-to-Feature Regression for a Two-Step Conditional Independence Test Qinyi Zhang Department of Statistics University of Oxford qinyi.zhang@stats.ox.ac.uk Sarah Filippi School of Public Health Department
More informationBayesian Hidden Markov Models and Extensions
Bayesian Hidden Markov Models and Extensions Zoubin Ghahramani Department of Engineering University of Cambridge joint work with Matt Beal, Jurgen van Gael, Yunus Saatci, Tom Stepleton, Yee Whye Teh Modeling
More informationIndependent Component Analysis and Unsupervised Learning
Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien National Cheng Kung University TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent
More informationA Kernel Statistical Test of Independence
A Kernel Statistical Test of Independence Arthur Gretton MPI for Biological Cybernetics Tübingen, Germany arthur@tuebingen.mpg.de Kenji Fukumizu Inst. of Statistical Mathematics Tokyo Japan fukumizu@ism.ac.jp
More informationMACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA
1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR
More informationRecovering Distributions from Gaussian RKHS Embeddings
Motonobu Kanagawa Graduate University for Advanced Studies kanagawa@ism.ac.jp Kenji Fukumizu Institute of Statistical Mathematics fukumizu@ism.ac.jp Abstract Recent advances of kernel methods have yielded
More informationAdvanced Machine Learning & Perception
Advanced Machine Learning & Perception Instructor: Tony Jebara Topic 1 Introduction, researchy course, latest papers Going beyond simple machine learning Perception, strange spaces, images, time, behavior
More informationAutoregressive Independent Process Analysis with Missing Observations
Autoregressive Independent Process Analysis with Missing Observations Zoltán Szabó Eötvös Loránd University - Department of Software Technology and Methodology Pázmány P. sétány 1/C, Budapest, H-1117 -
More informationDistributed Gaussian Processes
Distributed Gaussian Processes Marc Deisenroth Department of Computing Imperial College London http://wp.doc.ic.ac.uk/sml/marc-deisenroth Gaussian Process Summer School, University of Sheffield 15th September
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:
More informationNOTES ON SOME EXERCISES OF LECTURE 5, MODULE 2
NOTES ON SOME EXERCISES OF LECTURE 5, MODULE 2 MARCO VITTURI Contents 1. Solution to exercise 5-2 1 2. Solution to exercise 5-3 2 3. Solution to exercise 5-7 4 4. Solution to exercise 5-8 6 5. Solution
More informationECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning
ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning Topics Summary of Class Advanced Topics Dhruv Batra Virginia Tech HW1 Grades Mean: 28.5/38 ~= 74.9%
More informationSelf Adaptive Particle Filter
Self Adaptive Particle Filter Alvaro Soto Pontificia Universidad Catolica de Chile Department of Computer Science Vicuna Mackenna 4860 (143), Santiago 22, Chile asoto@ing.puc.cl Abstract The particle filter
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Week #1
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Week #1 Today Introduction to machine learning The course (syllabus) Math review (probability + linear algebra) The future
More informationSTATS 200: Introduction to Statistical Inference. Lecture 29: Course review
STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout
More informationSparse Kernel Density Estimation Technique Based on Zero-Norm Constraint
Sparse Kernel Density Estimation Technique Based on Zero-Norm Constraint Xia Hong 1, Sheng Chen 2, Chris J. Harris 2 1 School of Systems Engineering University of Reading, Reading RG6 6AY, UK E-mail: x.hong@reading.ac.uk
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationSecond-Order Inference for Gaussian Random Curves
Second-Order Inference for Gaussian Random Curves With Application to DNA Minicircles Victor Panaretos David Kraus John Maddocks Ecole Polytechnique Fédérale de Lausanne Panaretos, Kraus, Maddocks (EPFL)
More informationCausal Modeling with Generative Neural Networks
Causal Modeling with Generative Neural Networks Michele Sebag TAO, CNRS INRIA LRI Université Paris-Sud Joint work: D. Kalainathan, O. Goudet, I. Guyon, M. Hajaiej, A. Decelle, C. Furtlehner https://arxiv.org/abs/1709.05321
More information9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering
Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make
More informationAnalysis of Spectral Kernel Design based Semi-supervised Learning
Analysis of Spectral Kernel Design based Semi-supervised Learning Tong Zhang IBM T. J. Watson Research Center Yorktown Heights, NY 10598 Rie Kubota Ando IBM T. J. Watson Research Center Yorktown Heights,
More informationSemi-Supervised Learning through Principal Directions Estimation
Semi-Supervised Learning through Principal Directions Estimation Olivier Chapelle, Bernhard Schölkopf, Jason Weston Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany {first.last}@tuebingen.mpg.de
More informationCross-Entropy Optimization for Independent Process Analysis
Cross-Entropy Optimization for Independent Process Analysis Zoltán Szabó, Barnabás Póczos, and András Lőrincz Department of Information Systems Eötvös Loránd University, Budapest, Hungary Research Group
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationDistinguishing Causes from Effects using Nonlinear Acyclic Causal Models
Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models Kun Zhang Dept of Computer Science and HIIT University of Helsinki 14 Helsinki, Finland kun.zhang@cs.helsinki.fi Aapo Hyvärinen
More information