Clustering by Important Features PCA (IF-PCA)

Size: px

Start display at page:

Download "Clustering by Important Features PCA (IF-PCA)"

Meredith Reeves
5 years ago
Views:

1 Clustering by Important Features PCA (IF-PCA) Rare/Weak Signals and Phase Diagrams Jiashun Jin, CMU David Donoho (Stanford) Zheng Tracy Ke (Univ. of Chicago) Wanjie Wang (Univ. of Pennsylvania) August 12, 2015 Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 1 / 42

2 Clustering subjects using microarray data # Data Name Source K n (# of subjects) p (# of genes) 1 Brain Pomeroy (02) Breast Cancer Wang et al. (05) Colon Cancer Alon et al. (99) Leukemia Golub et al. (99) Lung Cancer Gordon et al. (02) Lung Cancer(2) Bhattacharjee et al. (01) Lymphoma Alizadeh et al. (00) Prostate Cancer Singh et al. (02) SRBCT Kahn (01) Su-Cancer Su et al (01) Goal. Predict class labels Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 2 / 42

3 Left/right singular vectors features samples Left singular vector: (n-dimensional) eigenvector of XX Right singular vector: (p-dimensional) eigenvector of X X Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 3 / 42

Principal Component Analysis (PCA) Karl Pearson (1857-1936) Idea: Transformation Dimension Reduction while keeping main info.

4 Principal Component Analysis (PCA) Karl Pearson ( ) Idea: Transformation Dimension Reduction while keeping main info. Data = signal + noise (signal matrix: low-rank) Microarray data (after standardization): many columns of the signal matrix are 0 Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 4 / 42

5 Important Features PCA (IF-PCA) Idea: PCA applied to a small fraction of carefully selected features: Rank features by Kolmogorov-Smirnov statistic Select those with the largest KS-scores Apply PCA to the post-selection data matrix features selected features PC s samples Azizyan et al (2013), Chan and Hall (2010), Fan and Lv (2008) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 5 / 42

6 IF-PCA (microarray data) W i (j) = [X i (j) X (j)]/ˆσ(j) : W = [w 1,..., w p ] = [W 1,..., W n], feature-wise normalization F n,j (t) = 1 n 1{W i (j) t} n 1. Rank features with Kolmogorov-Smirnov (KS) scores ψ n,j = n sup <t< i=1 F n,j (t) Φ(t), (Φ: CDF of N(0, 1)) 2. Renormalize the KS scores by (Efron s empirical null) ψ n,j = ψ n,j mean of all p different KS-scores SD of all p different KS-scores 3. Fix t > 0. Û (t) R n,k 1 : first (K 1) left singular vectors of post-selection data matrix [w j : ψ n,j t] 4. Apply classical k-means algorithm to Û (t) R n,k 1 IF-PCA-HCT : t = tp HC : Higher Criticism threshold (TBA) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 6 / 42

7 The blessing of feature selection x-axis: 1, 2,..., n; y-axis: entries of Û HC (U (t) for t = tp HC ) Left: plot of ÛHC (Lung Cancer; K = 2 so ÛHC R n is a vector) Right: counterpart of ÛHC without feature selection ADCA MPM ADCA MPM Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 7 / 42

8 Efron s null correction (Lung Cancer) Efron s theoretic Null: density of ψ n,j if X i (j) iid N(u j, σ 2 j ) (not depend on (j, u j, σ j ); easy to simulate). Theoretic null is a bad fit to ψ n,j (top) but a nice fit to ψ n,j (bottom) Theoretic Null Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 8 / 42

9 How to set the threshold t? CV: not directly implementable (class labels unknown) FDR: need tuning and target on Rare/Strong signals [Benjaminin and Hochberg (1995), Efron (2010)] t (threshold) #{selected features} feature-fdr errors Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 9 / 42

10 Tukey s Higher Criticism John W. Tukey ( ) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 10 / 42

11 Higher Criticism (HC) Review papers: Donoho and Jin (2015), Jin and Ke (2015) Proposed by Donoho and Jin (2004) for sparse signal detection Found useful in GWAS, DNA Copy Number Variants (CNV), Cosmology and Astronomy, Disease surveillance Extended to many different directions: Innovated HC (for colored noise), signals detection in a regression model, HC when noise dist. is unknown/nongaussian, estimate the proportion of non-null effects Threshold choice in classification [Donoho and Jin (2008, 2009), Fan et al (2013), Jin (2009)] Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 11 / 42

12 Threshold choice by HC for IF-PCA Jin and Wang (2015), Jin, Ke and Wang (2015) HC p (t) = t HC p = argmax t {HC p (t)}, p[ĝp (t) F 0 (t)] n[ Ĝ p (t) F 0 (t)] + Ĝp(t) (new) F 0 : survival function of Efron s theoretical null Ĝ p : empirical survival function of renormalized KS-scores ψn,j Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 12 / 42

13 HCT: implementation Compute P-values: π j = F 0 (ψn,j ), 1 j p Sort P-values: π (1) < π (2) <... < π (p) Define the HC functional by p(k/p π(k) ) HC p,k = k/p + max{ n(k/p π (k) ), 0} Let ˆk = argmax {1 k p/2,π(k) >log(p)/p}{hc p,k }. HC threshold t HC p is the ˆk-th largest KS-score HC p (t) t=ψ (k) = p[k/p π(k) ] k/p + ; ψ(k) : sorted KS scores n(k/p π (k) ) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 13 / 42

14 Illustration ordered KS scores ordered p values HC objective Function Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 14 / 42

15 Illustration, II (Lung Cancer) 0.2 HC threshold x-axis: # of selected features; y-axis: error rates by IF-PCA Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 15 / 42

16 Comparison r = error rate of IF-PCA-HCT minimum error rate of all other methods # Data set K kmeans kmeans++ Hier SpecGem* IF-PCA-HCT r 1 Brain (.09) Breast Cancer (.05) Colon Cancer (.07) Leukemia (.09) Lung Cancer (.09) Lung Cancer(2) (.00) Lymphoma (.13) Prostate Cancer (.01) SRBCT (.06) SuCancer (.05) *: SpecGem is classical PCA [Lee et al (2010)]; Arthur and Vassilvitskii (2007) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 16 / 42

17 Sparse PCA and variants of IF-PCA features selected features PC s samples # Data set K Clu-sPCA IF-kmeans IF-Hier IF-PCA-HCT 1 Brain Breast Cancer Colon Cancer Leukemia Lung Cancer Lung Cancer(2) Lymphoma Prostate Cancer SRBCT SuCancer : project to estimated feature space (sparse PCA) and then clustering; unclear how to set λ (ideal λ is used above); clustering feature estimation Zou et al (2006), Witten and Tibshirani (2010) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 17 / 42

18 Summary (so far) IF-PCA-HCT consists of three simple steps Marginal screening (KS) Threshold choice (Empirical null + HCT) Post-selection PCA tuning free, fast, and yet effective easily extendable and adaptable Next: theoretical reasons for HCT in RW settings Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 18 / 42

19 RW viewpoint In many types of Big Data, signals of interest are not only sparse (rare) but also individually weak, and we have no priori where these RW signals are Large p small n (e.g., genetics and genomics) (Signal strength) α n $ or labor Clustering: α = 6; classification: α = 2 Technical limitation (e.g., astronomy) Early detection (e.g., disease surveillance) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 19 / 42

20 RW model and Phase Diagram Many methods/theory target on Rare/Strong signals (if conditions XX hold and all signals are sufficiently strong...) Our proposal: RW model: parsimonious model capturing the main factors (sparsity and signal strength) Phase Diagram: provides sharp results that characterize when the desired goal is impossible or possible to achieve an approach to distinguish non-optimal and optimal procedures Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 20 / 42

21 Sparse signal detection (global testing) Donoho and Jin (2004), Ingster (1997, 1999), Jin (2004) X = µ + Z, X R p, Z N(0, I p ) H (p) 0 : µ = 0, vs. H (p) 1 : µ(j) iid (1 ɛ p )ν 0 +ɛ p ν τp Calibration and subtlety of the problem: ɛ p = p β, τ p = 2r log(p), 0 < β < 1, r > 0 Rare: only a small fraction of non-zero means Weak: signals only moderately strong Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 21 / 42

22 Phase Diagram (signal detection/recovery) 0, 0 < β < 1/2 standard phase function: r = ρ(β); ρ(β) = β 1 2, 1 2 < β < 3 4 (1 1 β) 2 3, 4 < β < Estimable Exact Recovery r 0.5 Detectable r 2 Almost Full Recovery Undetectable β 0.5 Detectable Undetectable Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 22 / 42

23 RW settings: why we care? Growing awareness of irreproducibility Many methods/theory focus on Rare/Strong signals, do not cope well with RW settings FDR-controlling methods (showed before) L 0 /L 1 -penalization methods Minimax framework Ioannidis, (2005). Why most published research findings are false ; Donoho and Stark (1989); Chen et al (1995); Tibshirani (1996); Abrmovich et al (2006); Jin et al (2015); Ke et al (2015) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 23 / 42

24 A two-class model for clustering Jin, Ke & Wang (2015) X i = l i µ+z i, Z i iid N(0, Ip ), i = 1,..., n (p n) l i = ±1: unknown class labels (main interest) µ R p : feature vector RW: only a small fraction of entries of µ is nonzero, each contributes weakly to clustering Goal. Theoretical insight on HCT (and more) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 24 / 42

25 IF-PCA simplified to two-class model features selected features PC s samples microarray two-class model pre-normalization yes skipped feature-wise screening Kolmogorov-Smirnov chi-square ψ j = sup t F n,j (t) Φ(t) ψ j = ( x j n)/ 2n re-normalization Efron s null correction skipped threshold choice HCT same post-selection PCA same same Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 25 / 42

26 Asymptotic Rare/Weak (ARW) model X = lµ + Z R n,p, Z: iid N(0, 1) entries l i = ±1 with equal prob. µ(j) iid (1 ɛ p )ν 0 + ɛ p ν τp large p small n : n = p θ, 0 < θ < 1 RW signal: ɛ p = p β, τ p = 4 (4/n)r log(p) Note: ψ j = ( x j 2 n)/ 2n N(0, 1) or N( 2r log(p), 1) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 26 / 42

27 Three functions: HC, idealhc, SNR Given X = lµ + Z, we retain feature j if ψ j t F0 (t): survival function of normalized χ 2 2n (0) Ĝ p (t): empirical survival function ψ j Û (t) : first left singular vector of [x j : ψ j t] 1. HC(t) : p[ Ĝ p (t) F 0 (t)]/[ n[ĝp(t) F 0 (t)]+ĝp(t)] idealhc(t) : HC(t) with Ĝp(t) replaced by Ḡp(t) 3. SNR(t): Û (t) SNR(t) l + z + rem, z N(0, I n ) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 27 / 42

28 Three thresholds: t HC p t idealhc p t ideal p 4 HCT HC IdealHC SNR SNR(t) = E[ µ (t) 2 ] E[ µ (t) 2 ] + E[ Ŝ(t) ]/n p( Ḡ p (t) F 0 (t)) n[ḡp (t) F 0 (t)] + Ḡ p (t) µ (t) : µ restricted to Ŝ(t) (index set of all retained features) E µ (t) 2 τp 2 p[ḡp(t) F 0 (t)] E µ (t) 2 + E[ Ŝ(t) /n] τ p 2 [Ḡp(t) F 0 (t)] + Ḡp(t)/n Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 28 / 42

29 Impossibility Let ρ(β) be the standard phase function (before). Define ρ θ (β) = (1 θ)ρ( β θ ); recalling n = pθ Consider IF-PCA with t > 0. Let Û (t) R n be the first left singular vector of post-selection data matrix Consider ARW with [x j : ψ j t] n = p θ, ɛ p = p β, τ p = 4 (4/n)r log(p) If r < ρ θ (β), then for any threshold t, Cos(Û(t), l) c 0 < 1 (IF-PCA partially fails) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 29 / 42

30 Possibility If r > ρ θ (β), then For some t, Cos(Û(t), l) 1 There is a non-stochastic function SNR(t) such that for some t, SNR(t) 1 and Û (t) SNR(t)l + z + rem, z N(0, I n ); HCT yields the right threshold choice: tp HC /tp ideal 1 in prob.; tp ideal IF-PCA-HCT yields successful clustering: Hamm p (ˆl HCT ; β, r, θ) 0 = argmax t { SNR(t)} : Hamm p (ˆl; β, α, θ) = (n/2) 1 min b=±sgn(ˆl) { n i=1 P( b i sgn(l i ) )} Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 30 / 42

31 Phase Diagram (IF-PCA) #(useful features) p 1 β, τ p = 4 (4/n)r log(p); n = p θ (θ =.6) Success r 0.5 r Success Failure 0.1 Failure β β Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 31 / 42

32 Summary (so far) Big Data are here, but signals are RW RW model and Phase Diagrams: theoretical framework specifically for RW settings, allow for insights that will be otherwise overlooked HC provides the right threshold choice Limitation: IF-PCA only works at a specific signal strength and in a limited sparsity range Want a more complete story: statistical limits for clustering under ARW Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 32 / 42

33 ARW for statistical limits (a slight change) weaker X = lµ + Z l i = ±1 with equal prob. α Most interesting range for IF-PCA µ(j) iid (1 ɛ p )ν 0 + ɛ p ν τp n = p θ, ɛ p = p β τ = n 1/4 s = p 1/2 s = n 1/ β sparser For statistical limits For IF-PCA τ p τ p = p α, α > 0 τ p = 4 (4/n)r log(p) s := #{signals} 1 s p n s p Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 33 / 42

34 Aggregation methods sa Simple Aggregation: ˆl = sgn ( x ) Sparse Aggregation: ˆl sa N = sgn( xŝ), where Ŝ = Ŝ(N) = argmax { } {S: S =N} xs 1 Features Features Features Features Samples Samples xx 1 xx 2 xx 3 xx 1 xx 2 xx xx pp 2 xx pp 1 xx pp xx pp 2 xx pp 1 xx pp Samples Samples xx 1 xx 2 xx 3 xx 1 xx 2 xx xx pp 2 xx pp 1 xx pp xx pp 2 xx pp xx xx xx SS xx SS Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 34 / 42

35 Comparison of methods Method Simple Agg. PCA Sparse Agg. IF-PCA ˆl sa ˆl if ˆl sa N (N p) ˆlif t (t > 0) Sparsity dense dense sparse sparse Strength weak weak strong* strong F. Selection No No Yes Yes Complexity Poly. Poly. NP-hard Poly. Tuning No No Yes Yes** Notation. ˆl if t : IF-PCA adapted to two-class model Notation. ˆl if : classical PCA (a special case) *: signals are comparably stronger but still weak **: a tuning-free version exists Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 35 / 42

36 Statistical limits (clustering) Hamm p (ˆl; β, α, θ) = (n/2) 1 ηθ clu (β) = Consider ARW where min b=±sgn(ˆl) { n P ( b i sgn(l i ) )} i=1 (1 2β)/2, 0 < β < 1 θ 2 1 θ θ/2, < β < 1 θ 2 (1 β)/2, β > 1 θ n = p θ, ɛ p = p β, τ p = p α When α > ηθ clu(β), Hamm p(ˆl; β, α, θ) 1 When α < η clu(β), θ Hamm p (ˆl sa ; β, α, θ) 0 for β < 1 θ Hamm p (ˆl sa 1 θ N ; β, α, θ) 0 for β > 2 ; 2 (N = p 1 β ) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 36 / 42

37 Phase Diagram (clustering; θ = 0.6) For statistical limits For IF-PCA τ p τ p = p α, α > 0 τ p = 4 (4/n)r log(p) Range of β 0 < β < 1 1/2 < β < 1 θ/2 weaker τ = p 1/2 /s Impossible weaker Impossible 0.3 τ = n 1/2 0.3 α Simple Aggregation Sparse Aggregation (NP-hard) τ = s 1/2 α Simple Aggreg. NP-hard Classical PCA IF-PCA β sparser β sparser Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 37 / 42

38 Two closely related problems X = lµ + Z, Z: iid N(0, 1) entries (sig). Estimate support of µ (Signal recovery) n Hamm p (ˆµ; β, r, θ) = (pɛ p ) 1 P(sgn(ˆµ i ) sgn(µ i )) i=1 (hyp). Test H (p) 0 that X = Z against alternative H (p) 1 that X = lµ + Z (global hypothesis testing) TestErr p ( ˆT, β, r, β) = P H (p) 0 (Reject H 0 ) + P (p) H (Accept H (p) 0 ) 1 Arias-Castro & Verzelen (2015), Johnstone & Lu (2001), Rigollet & Berthet (2013) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 38 / 42

39 Statistical limits (three problems) µ(j) iid (1 ɛ p )ν 0 + ɛ p ν τp, n = p θ, s #{useful features} p 1 β, signal strength = p α τ=n 1/4 p 1/2 /s Clustering Signal Recovery Hypothesis Testing α τ=p 1/2 /s τ=n 1/2 s=n τ=(sn) 1/4 0.1 τ=n 1/4 τ=s 1/ β Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 39 / 42

40 Computable upper bounds (two problems) Left: signal recovery. Right: (global) hypothesis testing µ(j) iid (1 ɛ p )ν 0 + ɛ p ν τp, n = p θ, s #{useful features} p 1 β, signal strength = p α Clustering Signal Recovery (stat) Hypothesis Testing Signal Recovery (compu) Clustering Signal Recovery Hypothesis Testing (stat) Hypothesis Testing (compu) τ=n 1/4 ( p/s) α 0.3 τ=n 1/2 α τ=(p/n) 1/4 s 1/ τ=n 1/4 s=p 1/2 0.1 τ=n 1/4 s=p 1/ β Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 40 / 42 β

41 Take-home message Big Data is here, but your signals are RW; we need to do many things very differently RW model and Phase Diagrams are theoretical framework specifically designed for RW settings, and expose many insights we do not see with more traditional frameworks IF-PCA and Higher Criticism are simple and easy-to-adapt methods which are provably effective for analyzing real data, especially for Rare/Weak settings Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 41 / 42

42 Acknowledgements Many thanks for David Donoho, Stephen Fienberg, Peter Hall, Jon Wellner and many others for inspirations, collaborations, and encouragements! Main references. Jin J, Wang W (2015) Important Features PCA for high dimensional clustering. arxiv Jin J, Ke Z, Wang W (2015) Phase transitions for high dimensional clustering and related problems. arxiv Two review papers on HC. D. Donoho and J. Jin (2015). Higher Criticism for Large-Scale Inference, especially for rare and weak effects. Statistical Science, 30(1), J. Jin and Z. Ke (2015). Rare and weak effects in large-scale inference: methods and phase diagrams. arxiv Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 42 / 42

Clustering by Important Features PCA (IF-PCA)

Clustering by Important Features PCA (IF-PCA) Rare/Weak Signals and Phase Diagrams Jiashun Jin, CMU Zheng Tracy Ke (Univ. of Chicago) Wanjie Wang (Univ. of Pennsylvania) August 5, 2015 Jiashun Jin, CMU