Clustering by Important Features PCA (IF-PCA)

Size: px

Start display at page:

Download "Clustering by Important Features PCA (IF-PCA)"

Irma Reeves
5 years ago
Views:

1 Clustering by Important Features PCA (IF-PCA) Rare/Weak Signals and Phase Diagrams Jiashun Jin, CMU Zheng Tracy Ke (Univ. of Chicago) Wanjie Wang (Univ. of Pennsylvania) August 5, 2015 Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 1 / 32

2 Clustering subjects using microarray data # Data Name Source K n (# of subjects) p (# of genes) 1 Brain Pomeroy (02) Breast Cancer Wang et al. (05) Colon Cancer Alon et al. (99) Leukemia Golub et al. (99) Lung Cancer Gordon et al. (02) Lung Cancer(2) Bhattacharjee et al. (01) Lymphoma Alizadeh et al. (00) Prostate Cancer Singh et al. (02) SRBCT Kahn (01) Su-Cancer Su et al (01) Goal. Predict class labels Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 2 / 32

Principal Component Analysis (PCA) Karl Pearson (1857-1936) Idea: Transformation Dimension Reduction while keeping main info.

3 Principal Component Analysis (PCA) Karl Pearson ( ) Idea: Transformation Dimension Reduction while keeping main info. Data = signal + noise (signal matrix: low-rank) *microarray data: many columns of the signal matrix are 0 after standardization Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 3 / 32

4 Important Features PCA (IF-PCA) Idea: PCA applied to a small fraction of carefully selected features: Rank features by Kolmogorov-Smirnov statistic Select those with the largest KS-scores Apply PCA to the post-selection data matrix features selected features PC s samples Azizyan et al (2013), Chan and Hall (2010), Fan and Lv (2008) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 4 / 32

5 IF-PCA & IF-PCA-HCT (microarray data) W i (j) = [X i (j) X (j)]/ˆσ(j) : W = [w 1,..., w p ] = [W 1,..., W n], feature-wise normalization F n,j (t) = 1 n 1{W i (j) t} n 1. Rank features with Kolmogorov-Smirnov (KS) scores ψ n,j = n sup <t< i=1 F n,j (t) Φ(t), (Φ: CDF of N(0, 1)) 2. Renormalize the KS scores by (Efron s empirical null) ψ n,j = ψ n,j mean of all p different KS-scores SD of all p different KS-scores 3. Fix a threshold t > 0. Let Û (t) R n,k 1 be the first (K 1) singular vectors of matrix [w j : ψ n,j t] Our proposal : t = tp HC : Higher Criticism threshold (TBA), 4. Apply classical k-means algorithm to ÛHC Û (t) t=t HC p Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 5 / 32

6 The blessing of feature selection Left: plot of ÛHC (Lung Cancer; K = 2 so ÛHC R n is a vector) x-axis: 1, 2,..., n; y-axis: entries of ÛHC Right: counterpart of ÛHC without feature selection ADCA MPM ADCA MPM Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 6 / 32

7 Efron s null correction (Lung Cancer) Efron s theoretic Null: density of ψ n,j if X i (j) iid N(u j, σ 2 j ) (not depend on (j, u j, σ j ); easy to simulate). Theoretic null (red) is a bad fit to ψ n,j (top) but a nice fit to ψ n,j (bottom) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 7 / 32

8 How to set the threshold t? CV: not directly implementable (class labels unknown) FDR: need tuning and target on Rare/Strong signals [Benjaminin and Hochberg (1995), Efron (2010)] Our proposal. Setting t by Higher Criticism (RW settings) t (threshold) #{selected features} feature-fdr errors Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 8 / 32

9 Tukey s Higher Criticism John W. Tukey ( ) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 9 / 32

10 Higher Criticism (HC) Review papers: Donoho and Jin (2015), Jin and Ke (2015) First proposed by Donoho and Jin (2004) for detecting sparse signals Also useful for setting threshold in cancer classification [Donoho and Jin (2008, 2009), Jin (2009)] Found useful in GWAS, DNA Copy Number Variants (CNV), Cosmology and Astronomy, Disease surveillance Extended to many different directions Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 10 / 32

11 Higher Criticism Threshold (HCT) Obtain P-values: π j = 1 F 0 (ψ n,j ), 1 j p [F 0 : CDF of Efron s theoretical null] Sort P-values: π (1) < π (2) <... < π (p) Define the HC functional by HC p,k = p(k/p π(k) ) k/p + max{ n(k/p π (k) ), 0} Let ˆk = argmax {1 k p/2,π(k) >log(p)/p}{hc p,k }. HC threshold t HC p is the ˆk-th largest KS-score Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 11 / 32

12 Illustration ordered KS scores ordered p values HC objective Function Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 12 / 32

13 Illustration, II (Lung Cancer) 0.2 HC threshold x-axis: # of selected features; y-axis: error rates by IF-PCA Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 13 / 32

14 Comparison r = error rate of IF-PCA-HCT minimum error rate of all other methods # Data set K kmeans kmeans++ Hier SpecGem IF-PCA-HCT r 1 Brain (.09) Breast Cancer (.05) Colon Cancer (.07) Leukemia (.09) Lung Cancer (.09) Lung Cancer(2) (.00) Lymphoma (.13) Prostate Cancer (.01) SRBCT (.06) SuCancer (.05) Arthur and Vassilvitskii (2007), Hastie et al (2009), Lee et al (2010) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 14 / 32

15 Sparse PCA and variants of IF-PCA features selected features PC s samples # Data set K Clu-sPCA IF-kmeans IF-Hier IF-PCA-HCT 1 Brain Breast Cancer Colon Cancer Leukemia Lung Cancer Lung Cancer(2) Lymphoma Prostate Cancer SRBCT SuCancer : project to estimated feature space (sparse PCA) and then clustering Unclear how to set λ (ideal λ is used above); Clustering feature estimation Zou et al (2006), Witten and Tibshirani (2010) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 15 / 32

16 Summary (so far) IF-PCA-HCT consists of three simple steps Marginal screening (KS) Threshold choice (Empirical null + HCT) Post-selection PCA tuning free, fast, and yet effective easily extendable and adaptable Remaining problems: Why HCT Statistical limits for clustering/related problems Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 16 / 32

17 RW viewpoint In many types of Big Data, signals of interest are not only sparse (rare) but also individually weak, and we have no priori where these RW signals are Large p small n (e.g., genetics and genomics) (Signal strength) α n $ or manpower Clustering: α = 4 or 6, classification: α = 2 Technical limitation (e.g., astronomy) Early detection (e.g., disease surveillance) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 17 / 32

18 RW model and Phase Diagram Many scientific findings are not reproducible [Ioannidis, (2005). Why most published research findings are false ] and many methods/theory target on Rare/Strong signals [if conditions XX hold and all signals are sufficiently strong...] Our proposal: RW model: parsimonious model capturing the main factors (sparsity and signal strength) Phase Diagram: provides sharp results that characterize when the desired goal is impossible or possible to achieve an approach to distinguish non-optimal and optimal procedures Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 18 / 32

19 A two-class model for clustering Jin, Ke & Wang (2015) X i = l i µ+z i, Z i iid N(0, Ip ), i = 1,..., n (p n) l i = ±1: unknown class labels (main interest) µ R p : feature vector RW: only a small fraction of entries of µ is nonzero, each contributes weakly to clustering Interest. Rationale of HCT and Statistical limits Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 19 / 32

20 IF-PCA simplified to two-class model features selected features PC s samples microarray two-class model pre-normalization yes skipped feature-wise screening Kolmogorov-Smirnov chi-square ψ j = sup t F n,j (t) Φ(t) ψ j = ( x j n)/ 2n re-normalization Efron s null correction skipped threshold choice HCT same post-selection PCA same same Notation. ˆl if t : IF-PCA for a threshold t > 0 Notation. ˆl if : classical PCA (a special case) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 20 / 32

21 Aggregation methods sa Simple Aggregation: ˆl = sgn ( x ) Sparse Aggregation: ˆl sa N = sgn( xŝ), where Ŝ = Ŝ(N) = argmax { } {S: S =N} xs 1 Features Features Features Features Samples Samples xx 1 xx 2 xx 3 xx 1 xx 2 xx xx pp 2 xx pp 1 xx pp xx pp 2 xx pp 1 xx pp Samples Samples xx 1 xx 2 xx 3 xx 1 xx 2 xx xx pp 2 xx pp 1 xx pp xx pp 2 xx pp xx xx xx SS xx SS Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 21 / 32

22 Comparison of methods Method Simple Agg. PCA Sparse Agg. IF-PCA ˆl sa ˆl if ˆl sa N (N p) ˆlif t (t > 0) Sparsity dense dense sparse sparse Strength weak weak strong* strong F. Selection No No Yes Yes Complexity Poly. Poly. NP-hard Poly. Tuning No No Yes Yes** *: signals are comparably stronger but still weak **: a tuning-free version exists Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 22 / 32

23 Asymptotic Rare/Weak (ARW) model X = lµ + Z R n,p, Z: iid N(0, 1) entries l i = ±1 with equal prob., µ(j) iid (1 ɛ p )ν 0 + ɛ p ν τp large p small n : n = p θ, 0 < θ < 1 Rare signals: ɛ p = p β Range for sparsity/signal strength: full for statistical limits, more specific for IF-PCA: For statistical limits For IF-PCA τ p τ p = p α, α > 0 τ p = 4 (1/n)2r log(p) Range of β 0 < β < 1 1/2 < β < 1 θ/2 Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 23 / 32

24 Phase function for IF-PCA Fix 0 < θ < 1, define ρ θ (β) = (1 θ)ρ( β θ ) where ρ(β) is the standard phase function [Donoho and Jin (2004), Ingster (1997, 1999), Jin (2009)]: 0, 0 < β < 1/2 ρ(β) = β 1/2, 1/2 < β < 3/4 (1 1 β) 2, 3/4 < β < 1 Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 24 / 32

25 Phase transition for IF-PCA Theorem 1. Consider IF-PCA with threshold t. Let Û (t) R n be the first left singular vector of post-selection data matrix Impossibility. If r < ρ θ (β), then for any threshold t, Angle(Û (t), l) c 0 Possibility. If r > ρ θ (β), then (IF-PCA partially fails) For t in an appropriate range, Angle(Û (t), l) 0, and Û (t) SNR(t)l+z+rem, SNR(t) 1, z N(0, In ); HCT yields the right threshold choice: tp HC /tp ideal 1 in prob., where tp ideal = argmax t { SNR(t)} IF-PCA-HCT yields successful clustering: Hamm p (ˆl HCT ; β, r, θ) 0 Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 25 / 32

26 Phase Diagram (IF-PCA) #(useful features) p 1 β, τ p = 4 (1/n)2r log(p); n = p θ (θ =.6) Success r 0.5 r Success Failure 0.1 Failure β β Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 26 / 32

27 Statistical limits (clustering) Hamm p (ˆl; β, α, θ) = (n/2) 1 ηθ clu (β) = min b=±sgn(ˆl) { n P ( b i sgn(l i ) )} i=1 (1 2β)/2, 0 < β < 1 θ 2 1 θ θ/2, < β < 1 θ 2 (1 β)/2, β > 1 θ Theorem 1. When α > ηθ clu(β), Hamm p(ˆl; β, α, θ) 1 When α < ηθ clu(β), Hammp (ˆl sa ; β, α, θ) 0 for β < 1 θ 1 θ ; β, α, θ) 0 for β > Hammp (ˆl sa N N = p 1 β. 2 ; 2 and Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 27 / 32

Phase Diagram (clustering; θ = 0.6) For statistical limits For IF-PCA τ p τ p = p α, α > 0 τ p = 4 (1/n)2r log(p) Range of β 0 < β < 1 1/2 < β < 1 θ/2 weaker 0.5 0.45 0.4 0.

28 Phase Diagram (clustering; θ = 0.6) For statistical limits For IF-PCA τ p τ p = p α, α > 0 τ p = 4 (1/n)2r log(p) Range of β 0 < β < 1 1/2 < β < 1 θ/2 weaker τ = p 1/2 /s Impossible weaker Impossible 0.3 τ = n 1/2 0.3 α Simple Aggregation Sparse Aggregation (NP-hard) τ = s 1/2 α Simple Aggreg. NP-hard Classical PCA IF-PCA β sparser β sparser Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 28 / 32

29 Two closely related problems X = lµ + Z, Z: iid N(0, 1) entries (sig). Estimate support of µ (Signal recovery) n Hamm p (ˆµ; β, r, θ) = (pɛ p ) 1 P(sgn(ˆµ i ) sgn(µ i )) i=1 (hyp). Test H (p) 0 that X = Z against alternative H (p) 1 that X = lµ + Z (global hypothesis testing) TestErr p ( ˆT, β, r, β) = P H (p) 0 (Reject H 0 ) + P (p) H (Accept H (p) 0 ) 1 Arias-Castro & Verzelen (2015), Johnstone & Lu (2001), Rigollet & Berthet (2013) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 29 / 32

30 Statistical limits (three problems) µ(j) iid (1 ɛ p )ν 0 + ɛ p ν τp, n = p θ, s #{useful features} p 1 β, signal strength = p α τ=n 1/4 p 1/2 /s Clustering Signal Recovery Hypothesis Testing α τ=p 1/2 /s τ=n 1/2 s=n τ=(sn) 1/4 0.1 τ=n 1/4 τ=s 1/ β Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 30 / 32

31 Computable upper bounds (two problems) Left: feature estimation. Right: (global) hypothesis testing µ(j) iid (1 ɛ p )ν 0 + ɛ p ν τp, n = p θ, s #{useful features} p 1 β, signal strength = p α Clustering Signal Recovery Hypothesis Testing (stat) Hypothesis Testing (compu) Clustering Signal Recovery (stat) Hypothesis Testing Signal Recovery (compu) α τ=n 1/4 ( p/s) α τ=n 1/ τ=(p/n) 1/4 s 1/2 0.1 τ=n 1/4 s=p 1/2 0.1 τ=n 1/4 s=p 1/ β Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 31 / 32 β

32 Take-home message jiashun RW settings: found in many scientific problems, need new methods/theory Proposed IF-PCA-HCT as an easy-to-adapt and fast method, especially for RW settings Showed effective real data results Studied phase transitions for IF-PCA, clustering, signal recovery, and global testing Jin J, Wang W (2014) Important Features PCA for high dimensional clustering (arxiv ) Jin J, Ke Z, Wang W (2015) Phase transitions for high dimensional clustering and related problems (arxiv ) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 32 / 32

Clustering by Important Features PCA (IF-PCA)

Clustering by Important Features PCA (IF-PCA) Rare/Weak Signals and Phase Diagrams Jiashun Jin, CMU David Donoho (Stanford) Zheng Tracy Ke (Univ. of Chicago) Wanjie Wang (Univ. of Pennsylvania) August