Statistical Emerging Pattern Mining with Multiple Testing Correction

Size: px

Start display at page:

Download "Statistical Emerging Pattern Mining with Multiple Testing Correction"

Brianna Sutton
5 years ago
Views:

1 Statistical Emerging Pattern Mining with Multiple Testing Correction Junpei Komiyama 1, Masakazu Ishihata 2, Hiroki Arimura 2, Takashi Nishibayashi 3, Shin-Ichi Minato 2 1. U-Tokyo 2. Hokkaido Univ. 3. VOYAGE GROUP Inc. (to appear in KDD2017 research track)

2 Single-page Summary p We study Emerging Pattern Mining (EPM) with statistical guarantee. p EPM is also known as contrast set mining, subgroup mining. p We propose two-stage mining methods that control FWER or FDR. p FWER: Family-Wise Error Rate p FDR: False Discovery Rate 2017/8/3 ERATO 感謝祭 2

3 Emerging Pattern Mining p Items: I = {1,, I } p Pattern: e I p Emerging Pattern: p e that appears frequently in D + but not in D. {1, 2} {1, 3, 4} {1, 2, 3} {1, 3} {2, 4} {1, 3, 4} 2017/8/3 ERATO 感謝祭 3

4 Emerging Pattern Mining p Database: p D = { (x i, y i ) : i = 1,,N } where x i I, y i {0,1} p D + = { (x, y) D : y = 1 } p D = { (x, y) D : y = 0} p Emerging Pattern: e s.t. N e+ / N e > a p N e+ = { (x, y) D + : e x } p N e = { (x, y) D : e x } p a = a given threshold p Problems of Emerging Pattern Mining: 1. Too many insignificant patterns are found. 2. Not sure whether the found patterns are just random fluctuation of D or truly significant. 2017/8/3 ERATO 感謝祭 4

5 Statistical Emerging Pattern Mining p Assumption: (x, y) ~i.i.d.~ IP [x, y] p IP [x, y] = unknown true distribution p Positive label prob.: µ e = IP [y = 1 e x] p True / False Emerging Pattern: p ε true = {e 2 I : µ e > a} p ε false = {e 2 I : µ e a} p SEMP: Estimate ε true from D. 2017/8/3 ERATO 感謝祭 5

6 Statistical Emerging Pattern Mining p ε alg = outputs of an algorithm p Family-wise error rate (FWER): p FWER = IP [ ε alg ε false 1 ] p False discovery rate (FDR): p FDR = IE [ ε alg ε false / ε alg ] We propose two-stage mining methods that satisfy FWER q or FDR q. 2017/8/3 ERATO 感謝祭 6

7 Pattern as a hypothesis p Null hypothesis (e ε false ): p H e0 : µ e = a p Alternative hypothesis (e ε true ): p H e1 : µ e > a p P-value: 2017/8/3 ERATO 感謝祭 7

8 Multiple Testing Correction p P-value: p How data is likely to be generated under H e0. p Small p-value à Rare event p Single Hypothesis: p Reject e if p e q è We think e is a True EP. p Control FWER (and FDR) q. p Multiple hypotheses: p p Probability of including false positive gets fairly high. Peeking p-values causes selection bias. Need multiple testing correction. 2017/8/3 ERATO 感謝祭 8

9 Bonferroni correction for FWER p Reject e if p e q / m. p m = # of patterns to test. p Control FWER at level q. 2017/8/3 ERATO 感謝祭 9

10 Step-up correction for FDR p Reject e (1),,e (k) p p (i) = the i-th smaller p-value p e (i) = its corresponding pattern p p BH : c(m) = 1 p BY : c(m) = Σ m i=1 (1 / i) Step-up rejection Level p Control FDR at level q, p under independence among hypotheses (BH) p or under arbitrary correlations (BY). 2017/8/3 ERATO 感謝祭 10

11 But patterns are exponentially large p m (= # of patterns to test) can be exponentially large : 2 I. p Naïve Bonferroni / Step-up corrections cannot find much patterns. p Both corrections have q / m factor Needs to reduce # of patterns to test. 2017/8/3 ERATO 感謝祭 11

12 Existing statistical pattern mining and our result Existing Our result Two-stage Mining methods p LAMP [Terada+ 13] p Proposed for statistical association mining (SAM) p Selects testable patterns before correction, and p Controls FWER. 2017/8/3 ERATO 感謝祭 12

13 Two-stage Mining Method 1. Find a appropriate threshold τ. 2. Test patterns {e : N e > τ} with multiple testing correction. How to choose appropriate τ? We use testability just like LAMP! 2017/8/3 ERATO 感謝祭 13

14 Testability for FWER p Taroneʼs exclusion principle: p Patterns with large p-value can be omitted without testing; it cannot be significant. p FWER can be controlled by q / m test (not m). m test = # of testable patterns p ψ(n e ) = a lower bound of p e p ψ(n) = a N p e is testable if ψ(p e ) q / m test 2017/8/3 ERATO 感謝祭 14

LAMP-EP (LAMP for SEPM) p Finding the largest testable set boils down to finding the following threshold s.t.: (1) (2) : Frequent patterns with min-support τ.

15 LAMP-EP (LAMP for SEPM) p Finding the largest testable set boils down to finding the following threshold s.t.: (1) (2) : Frequent patterns with min-support τ. (1) patterns of support < τ %&'( are untestable. (2) patterns of support τ %&'( are testable. 2017/8/3 ERATO 感謝祭 15

16 Next : Controlling FDR p No principled method yet. p Major challenges: 1. No Taroneʼs exclusion in FDR: Ø We solve this by splitting the dataset into calibration and main datasets. 2. Not sure how to select a testable set. Ø We introduce quasi-testablity (approximated testability). 2017/8/3 ERATO 感謝祭 16

17 Quasi-testability for FDR p Step-up Correction for FDR: p Reject e (1),,e (k) p (i) = the i-th smaller p-value. e (i) = its corresponding pattern. p e is testable if ψ(n e ) p (k) p But p (k) must be unknown to avoid selection bias. p e is quasi-testable if ψ(n e ) p est p Instead of true p (k), we use an estimator p est. p We split D into D main and D carib, and use D carib to estimate p est. 2017/8/3 ERATO 感謝祭 17

(1) patterns of support < τ *+, are untestable, (2) patterns of support τ

18 QT-LAMP-EP for controlling FDR p Find threshold value such that (1) (2) (3) = # of patterns rejected if step-up method is conducted for. (1) patterns of support < τ *+, are untestable, (2) patterns of support τ *+, are testable, under estimated step-up rejection level of (3). 2017/8/3 ERATO 感謝祭 18

19 Computer Simulations p Statistical powers of LAMP-EP and QT- LAMP-EP are compared. p LAMP-EP used the entire dataset for testing. p QT-LAMP-EP used 20% of D as D carib for obtaining p est, and 80% as D main for testing. 2017/8/3 ERATO 感謝祭 19

20 FWER/FDR in synthetic dataset. p Synthetic patterns, a = 0.5, q = p True SEPs: subset of {1,, 10}: μ 8 = 0.7 p Other patterns: subset of {11,, 100}, μ 8 = 0.5. p, 10% of patterns are true SEPs. Results: Controlled q 2017/8/3 ERATO 感謝祭 20

21 Number of discoveries in a classification (mushroom) dataset p Controlling FDR yields more patterns than FWER. More discoveries Similar results hold for other 7 datasets. 2017/8/3 ERATO 感謝祭 21

22 Conclusion p We formulated statistical emerging pattern mining. p We propose two-stage mining methods: p LAMP-EP controls FWER. p QT-LAMP-EP controls FDR. p We empirically verified their statistical power. Thanks! 2017/8/3 ERATO 感謝祭 22

23 Contact us Paper and software are available. Paper: Software: Contact: Junpei Komiyama 2017/8/3 ERATO 感謝祭 23

Statistical Emerging Pattern Mining with Multiple Testing Correction

Statistical Emerging Pattern Mining with Multiple Testing Correction Junpei Komiyama The University of Tokyo junpei@komiyama.info Masakazu Ishihata Hokkaido University ishihata.masakazu@ist.hokudai.ac.jp