Hypothesis Testing with Kernels

Size: px

Start display at page:

Download "Hypothesis Testing with Kernels"

Elvin Palmer
6 years ago
Views:

1 (Gatsby Unit, UCL) PRNI, Trento June 22, 2016

2 Motivation: detecting differences in AM signals Amplitude modulation: simple technique to transmit voice over radio. in the example: 2 songs. Fragments from song 1 P x, song 2 P y.

3 Motivation: detecting differences in AM signals Amplitude modulation: simple technique to transmit voice over radio. in the example: 2 songs. Fragments from song 1 P x, song 2 P y. Question: P x P y?

4 Motivation: discrete domain - 2-sample testing How do we compare distributions? Given: 2 sets of text fragments (fisheries, agriculture). x 1 : Now disturbing reports out of Newfoundland show that the fragile snow crab industry is in serious decline. First the west coast salmon, the east coast salmon and the cod, and now the snow crabs off Newfoundland. x 2 : To my pleasant surprise he responded that he had personally visited those wharves and that he had already announced money to fix them. What wharves did the minister visit in my riding and how much additional funding is he going to provide for Delaps Cove, Hampton, Port Lorne, y 1 : Honourable senators, I have a question for the Leader of the Government in the Senate with regard to the support funding to farmers that has been announced. Most farmers have not received any money yet. y 2 : On the grain transportation system we have had the Estey report and the Kroeger report. We could go on and on. Recently programs have been announced over and over by the government such as money for the disaster in agriculture on the prairies and across Canada....

5 Motivation: discrete domain - 2-sample testing How do we compare distributions? Given: 2 sets of text fragments (fisheries, agriculture). x 1 : Now disturbing reports out of Newfoundland show that the fragile snow crab industry is in serious decline. First the west coast salmon, the east coast salmon and the cod, and now the snow crabs off Newfoundland. x 2 : To my pleasant surprise he responded that he had personally visited those wharves and that he had already announced money to fix them. What wharves did the minister visit in my riding and how much additional funding is he going to provide for Delaps Cove, Hampton, Port Lorne, y 1 : Honourable senators, I have a question for the Leader of the Government in the Senate with regard to the support funding to farmers that has been announced. Most farmers have not received any money yet. y 2 : On the grain transportation system we have had the Estey report and the Kroeger report. We could go on and on. Recently programs have been announced over and over by the government such as money for the disaster in agriculture on the prairies and across Canada.... Do tx i u and ty j u come from the same distribution, i.e. P x P y?

6 Motivation: discrete domain - independence testing How do we detect dependency? (paired samples) x 1 : Honourable senators, I have a question for the Leader of the Government in the Senate with regard to the support funding to farmers that has been announced. Most farmers have not received any money yet. x 2 : No doubt there is great pressure on provincial and municipal governments in relation to the issue of child care, but the reality is that there have been no cuts to child care funding from the federal government to the provinces. In fact, we have increased federal investments for early childhood development.... y 1 : Honorables sénateurs, ma question s adresse au leader du gouvernement au Sénat et concerne l aide financiére qu on a annoncée pour les agriculteurs. La plupart des agriculteurs n ont encore rien reu de cet argent. y 2 : Il est évident que les ordres de gouvernements provinciaux et municipaux subissent de fortes pressions en ce qui concerne les services de garde, mais le gouvernement n a pas réduit le financement qu il verse aux provinces pour les services de garde. Au contraire, nous avons augmenté le financement fédéral pour le développement des jeunes enfants....

7 Motivation: discrete domain - independence testing How do we detect dependency? (paired samples) x 1 : Honourable senators, I have a question for the Leader of the Government in the Senate with regard to the support funding to farmers that has been announced. Most farmers have not received any money yet. x 2 : No doubt there is great pressure on provincial and municipal governments in relation to the issue of child care, but the reality is that there have been no cuts to child care funding from the federal government to the provinces. In fact, we have increased federal investments for early childhood development.... y 1 : Honorables sénateurs, ma question s adresse au leader du gouvernement au Sénat et concerne l aide financiére qu on a annoncée pour les agriculteurs. La plupart des agriculteurs n ont encore rien reu de cet argent. y 2 : Il est évident que les ordres de gouvernements provinciaux et municipaux subissent de fortes pressions en ce qui concerne les services de garde, mais le gouvernement n a pas réduit le financement qu il verse aux provinces pour les services de garde. Au contraire, nous avons augmenté le financement fédéral pour le développement des jeunes enfants.... Are the French paragraphs translations of the English ones, or have nothing to do with it, i.e. P XY P X P Y?

8 Outline 1 RKHS based metric on probability distributions. 2 2-sample testing: Nonparametric. Distance between distribution representations.

9 Outline 1 RKHS based metric on probability distributions. 2 2-sample testing: Nonparametric. Distance between distribution representations. 3 Independence testing: Dependency detection. Distance between joint (P XY ) and product of marginals (P X P Y ).

10 Kernels

11 Kernels on numerous data types Kernels exist on essentially any data type: images, texts, graphs, time series, dynamical systems,... ñ distribution representation, hypothesis testing: on all these domains.

12 Towards representations of distributions: EX Given: 2 Gaussians with different means. Solution: t-test.

13 Towards representations of distributions: EX 2 Setup: 2 Gaussians; same means, different variances. Idea: look at the 2nd-order features of RVs.

14 Towards representations of distributions: EX 2 Setup: 2 Gaussians; same means, different variances. Idea: look at the 2nd-order features of RVs. ϕ x x 2 ñ difference in EX 2.

15 Towards representations of distributions: further moments Setup: a Gaussian and a Laplacian distribution. Challenge: their means and variances are the same. Idea: look at higher-order features. Let us consider feature representations!

16 Kernel: similarity between features Given: x and x 1 P X objects (images, texts,...).

17 Kernel: similarity between features Given: x and x 1 P X objects (images, texts,...). Question: how similar they are?

18 Kernel: similarity between features Given: x and x 1 P X objects (images, texts,...). Question: how similar they are? Define features of the objects: ϕ x : features of x, ϕ x 1 : features of x 1. Kernel: inner product of these features kpx,x 1 q : ϕ x,ϕ x 1 H.

19 Kernel examples X R d : k p px,yq p x,y `γq p, k G px,yq e γ}x y}2 2, k e px,yq e γ}x y} 1 2, kc px,yq 1 ` γ }x y} 2. 2

20 Kernel examples X R d : k p px,yq p x,y `γq p, k G px,yq e γ}x y}2 2, k e px,yq e γ}x y} 1 2, kc px,yq 1 ` γ }x y} 2. 2 X = texts, strings: bag-of-word kernel, r-spectrum kernel: # of common ď r-substrings.

21 Kernel examples X R d : k p px,yq p x,y `γq p, k G px,yq e γ}x y}2 2, k e px,yq e γ}x y} 1 2, kc px,yq 1 ` γ }x y} 2. 2 X = texts, strings: bag-of-word kernel, r-spectrum kernel: # of common ď r-substrings. X = time-series: dynamic time-warping.

22 Two-sample testing

23 Ingredient: maximum mean discrepancy

24 Ingredient: maximum mean discrepancy

25 Ingredient: maximum mean discrepancy

26 Ingredient: maximum mean discrepancy {MMD 2 pp,qq Ě K P,P ` ĘK Q,Q 2 Ę K P,Q (without diagonals in Ě K P,P, Ę K Q,Q )

27 From kernel trick to mean trick Recall: ϕ x P H: feature of x P X. Kernel: kpx,x 1 q ϕ x,ϕ x 1 H.

28 From kernel trick to mean trick Recall: ϕ x P H: feature of x P X. Kernel: kpx,x 1 q ϕ x,ϕ x 1 H. Mean embedding: Feature of P: µ P : E x P rϕ x s P Hpkq. Inner product: µ P,µ Q H E x P,x1 Qkpx,x 1 q.

29 From kernel trick to mean trick Recall: ϕ x P H: feature of x P X. Kernel: kpx,x 1 q ϕ x,ϕ x 1 H. Mean embedding: Feature of P: µ P : E x P rϕ x s P Hpkq. Inner product: µ P,µ Q H E x P,x1 Qkpx,x 1 q. µ P : well-defined for all distributions (bounded k).

30 Maximum mean discrepancy Squared difference between feature means: MMD 2 pp,qq }µ P µ Q } 2 H µ P µ Q,µ P µ Q H µ P,µ P H ` µ Q,µ Q H 2 µ P,µ Q H E P,P kpx,x 1 q `E Q,Q kpy,y 1 q 2E P,Q kpx,yq.

31 Maximum mean discrepancy Squared difference between feature means: MMD 2 pp,qq }µ P µ Q } 2 H µ P µ Q,µ P µ Q H µ P,µ P H ` µ Q,µ Q H 2 µ P,µ Q H E P,P kpx,x 1 q `E Q,Q kpy,y 1 q 2E P,Q kpx,yq. Unbiased empirical estimate for tx i u n i 1 P, ty ju n j 1 Q: {MMD 2 pp,qq Ě K P,P ` ĘK Q,Q 2 Ę K P,Q.

32 Two-sample test using MMD Two hypotheses: H 0 (null hypothesis): P Q. H 1 (alternative hypothesis): P Q.

33 Two-sample test using MMD Two hypotheses: H 0 (null hypothesis): P Q. H 1 (alternative hypothesis): P Q. Observation: tx i u n i 1 P, ty ju n j 1 Q. Decision: if { MMD 2 pp,qq is far from 0 ñ reject H 0.

34 Two-sample test using MMD Two hypotheses: H 0 (null hypothesis): P Q. H 1 (alternative hypothesis): P Q. Observation: tx i u n i 1 P, ty ju n j 1 Q. Decision: if { MMD 2 pp,qq is far from 0 ñ reject H 0. Threshold =? one answer ÝÝÝÝÝÝÑ asymptotic distribution of { MMD 2.

35 Two-sample test using MMD: H 1 Under H 1 (P Q): asymptotic distribution of { MMD 2 is Gaussian.

36 Two-sample test using MMD: H 0 Under H 0 (P Q): asymptotic distribution is n MMD { 8ÿ 2 pp,pq λ i pzi 2 2q, where z i Np0,2q i.i.d., ż kpx,x 1 qv i pxqdppxq λ i v i px 1 q, kpx,x 1 q ϕ x µ P,ϕ x 1 µ P H. X i 1

37 Two-sample test using MMD: threshold To the decision: given that P Q, we want threshold T such that Ppn { MMD 2 ą T q ď 0.05 : α.

38 Two-sample test using MMD: threshold Task: P n MMD { 2 ą T ď α. Solutions: permutation test: below, kernel eigenspectrum estimate: ˆλ i. moment matching: Gamma approximation.

39 Demo: amplitude modulated signals Question: P x P y?

40 Results: AM signals (120kHz) n 10, 000. Average over 4124 trials. Gaussian noise: added.

41 Independence testing

42 Independence testing Given: 2 kernel-endowed domain: px, kq, py, lq, paired samples: tpx i,y i qu n i 1 P XY. Hypotheses: H 0 : P XY P X P Y, H 1 : P XY P X P Y.

43 Independence testing Given: 2 kernel-endowed domain: px, kq, py, lq, paired samples: tpx i,y i qu n i 1 P XY. Hypotheses: H 0 : P XY P X P Y, H 1 : P XY P X P Y. Statistics: HSIC MMD 2 pp XY,P X P Y q }µ PXY µ PX P Y } 2 Hpǩq, ǩppx,yq, px 1,y 1 qq kpx,x 1 qlpy,y 1 q.

44 HSIC in terms of expectations HSICpP XY,P X P Y q E xy E x 1 y 1rkpx,x1 qlpy,y 1 qs Let us consider an example! `E x E x 1rkpx,x 1 qse y E y 1rlpy,y 1 qs 2E x 1 y 1 Ex kpx,x 1 qe y lpy,y 1 q

45 HSIC: intuition. X: images, Y: descriptions.

46 HSIC intuition: Gram matrices

47 HSIC intuition: Gram matrices Empirical estimate: zhsic pp XY,P X P Y q 1 n 2 phkh HLHq``, H I n n 1 11 T.

48 Independence testing: decision Under H 0 : nhsic z Ñ 8-sum of weighted χ 2... Permutation test: 1 Compute HSIC for tx i,y πpiq u n i 1 with many π-s. 2 Estimate the p1 αq-quantile from the empirical CDF.

49 Demo: translation example 5-line extracts. kernel: bag-of-words, r-spectrum (r 5) sample size: n 10. repetitions: 300. Results: r-spectrum: average Type-II error = 0 (α 0.05), bag-of-words: 0.18.

50 Summary Kernels on images, texts, graphs, time series,... RKHS based metric on probability distributions. Applications: 2-sample testing: MMD. independence testing: HSIC. No density estimation.

51 Contents AM signals. Kernel examples. Universal kernel: definition, examples. MMD: IPM representation. HSIC: Where HS is coming from?

52 AM signals s i : i th song. observation (s ÞÑ y): yptq cospω c tqpasptq `o c q `nptq, where nptq: Gaussian noise. The AM signals were sampled at 120kHz.

53 Kernel examples k G pa,bq e }a b} 2 2 2θ 2, k e pa,bq e }a b} 2 2θ 2, 1 1 k C pa,bq, k t pa,bq 1 ` }a b}2 2 1 ` }a b} θ, θ 2 2 k p pa,bq p a,b `θq p, k r pa,bq 1 }a b}2 2 }a b} 2 2 `θ, 1 k i pa,bq, b}a b} 22 `θ2 k M, 3 pa,bq 2 k M, 5 pa,bq 2? 3 }a b}2 1 ` θ? 5 }a b}2 1 ` θ? 3}a b}2 e θ, ` 5 }a b}2 2 3θ 2? 5}a b}2 e θ.

54 Universal kernel: definition Assume X: compact, metric, k : X ˆX Ñ R kernel is continuous. Then Def-1: k is universal if Hpkq is dense in pcpxq, } } 8 q.

55 Universal kernel: definition Assume Then X: compact, metric, k : X ˆX Ñ R kernel is continuous. Def-1: k is universal if Hpkq is dense in pcpxq, } } 8 q. Def-2: k is characteristic, if µ : M `1 pxq Ñ Hpkq is injective.

56 Universal kernel: definition Assume X: compact, metric, k : X ˆX Ñ R kernel is continuous. Then Def-1: k is universal if Hpkq is dense in pcpxq, } } 8 q. Def-2: k is characteristic, if µ : M `1 pxq Ñ Hpkq is injective. universal, if µ is injective on the finite signed measures of X.

57 Universal kernel: examples On compact subsets of R d (β ą 0): kpa,bq e β}a b}2 2, kpa,bq e β}a b} 1, kpa,bq e β a,b, pβ ą 0q, or more generally 8ÿ kpa,bq f p a,b q, f pxq a n x n p@a n ą 0q. Universal ñ characteristic. n 0

58 MMD: IPM represenation Let F : tf P Hpkq : }f } H ď 1u be the unit ball in H. Then MMDpP,Q;Fq : supre x P f pxq E y Q f pyqs, f PF supr f,µ P H f,µ Q H supr f,µ P µ Q H f PF f PF }µ P µ Q } H.

59 HSIC: Where HS is coming from? Players: px,kq, py,lq, P XY, P X, P Y ; C XY : Hplq Ñ Hpkq. C XY E XY rpϕ x µ PX q b pϕ y µ PY qs, f,c XY g Hpkq E XY rf pxq E X f pxqsrgpyq E Y HSIC }C XY } 2 HS.

Kernel methods. MLSS Arequipa, Peru, Arthur Gretton. Gatsby Unit, CSML, UCL

Kernel methods. MLSS Arequipa, Peru, Arthur Gretton. Gatsby Unit, CSML, UCL Kernel methods MLSS Arequipa, Peru, 2016 Arthur Gretton Gatsby Unit, CSML, UCL Some motivating questions... Detecting differences in brain signals The problem: Dolocalfieldpotential(LFP)signalschange when