Clustering by Important Features PCA (IF-PCA)

Size: px
Start display at page:

Download "Clustering by Important Features PCA (IF-PCA)"

Transcription

1 Clustering by Important Features PCA (IF-PCA) Rare/Weak Signals and Phase Diagrams Jiashun Jin, CMU David Donoho (Stanford) Zheng Tracy Ke (Univ. of Chicago) Wanjie Wang (Univ. of Pennsylvania) August 12, 2015 Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 1 / 42

2 Clustering subjects using microarray data # Data Name Source K n (# of subjects) p (# of genes) 1 Brain Pomeroy (02) Breast Cancer Wang et al. (05) Colon Cancer Alon et al. (99) Leukemia Golub et al. (99) Lung Cancer Gordon et al. (02) Lung Cancer(2) Bhattacharjee et al. (01) Lymphoma Alizadeh et al. (00) Prostate Cancer Singh et al. (02) SRBCT Kahn (01) Su-Cancer Su et al (01) Goal. Predict class labels Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 2 / 42

3 Left/right singular vectors features samples Left singular vector: (n-dimensional) eigenvector of XX Right singular vector: (p-dimensional) eigenvector of X X Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 3 / 42

4 Principal Component Analysis (PCA) Karl Pearson ( ) Idea: Transformation Dimension Reduction while keeping main info. Data = signal + noise (signal matrix: low-rank) Microarray data (after standardization): many columns of the signal matrix are 0 Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 4 / 42

5 Important Features PCA (IF-PCA) Idea: PCA applied to a small fraction of carefully selected features: Rank features by Kolmogorov-Smirnov statistic Select those with the largest KS-scores Apply PCA to the post-selection data matrix features selected features PC s samples Azizyan et al (2013), Chan and Hall (2010), Fan and Lv (2008) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 5 / 42

6 IF-PCA (microarray data) W i (j) = [X i (j) X (j)]/ˆσ(j) : W = [w 1,..., w p ] = [W 1,..., W n], feature-wise normalization F n,j (t) = 1 n 1{W i (j) t} n 1. Rank features with Kolmogorov-Smirnov (KS) scores ψ n,j = n sup <t< i=1 F n,j (t) Φ(t), (Φ: CDF of N(0, 1)) 2. Renormalize the KS scores by (Efron s empirical null) ψ n,j = ψ n,j mean of all p different KS-scores SD of all p different KS-scores 3. Fix t > 0. Û (t) R n,k 1 : first (K 1) left singular vectors of post-selection data matrix [w j : ψ n,j t] 4. Apply classical k-means algorithm to Û (t) R n,k 1 IF-PCA-HCT : t = tp HC : Higher Criticism threshold (TBA) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 6 / 42

7 The blessing of feature selection x-axis: 1, 2,..., n; y-axis: entries of Û HC (U (t) for t = tp HC ) Left: plot of ÛHC (Lung Cancer; K = 2 so ÛHC R n is a vector) Right: counterpart of ÛHC without feature selection ADCA MPM ADCA MPM Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 7 / 42

8 Efron s null correction (Lung Cancer) Efron s theoretic Null: density of ψ n,j if X i (j) iid N(u j, σ 2 j ) (not depend on (j, u j, σ j ); easy to simulate). Theoretic null is a bad fit to ψ n,j (top) but a nice fit to ψ n,j (bottom) Theoretic Null Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 8 / 42

9 How to set the threshold t? CV: not directly implementable (class labels unknown) FDR: need tuning and target on Rare/Strong signals [Benjaminin and Hochberg (1995), Efron (2010)] t (threshold) #{selected features} feature-fdr errors Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 9 / 42

10 Tukey s Higher Criticism John W. Tukey ( ) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 10 / 42

11 Higher Criticism (HC) Review papers: Donoho and Jin (2015), Jin and Ke (2015) Proposed by Donoho and Jin (2004) for sparse signal detection Found useful in GWAS, DNA Copy Number Variants (CNV), Cosmology and Astronomy, Disease surveillance Extended to many different directions: Innovated HC (for colored noise), signals detection in a regression model, HC when noise dist. is unknown/nongaussian, estimate the proportion of non-null effects Threshold choice in classification [Donoho and Jin (2008, 2009), Fan et al (2013), Jin (2009)] Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 11 / 42

12 Threshold choice by HC for IF-PCA Jin and Wang (2015), Jin, Ke and Wang (2015) HC p (t) = t HC p = argmax t {HC p (t)}, p[ĝp (t) F 0 (t)] n[ Ĝ p (t) F 0 (t)] + Ĝp(t) (new) F 0 : survival function of Efron s theoretical null Ĝ p : empirical survival function of renormalized KS-scores ψn,j Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 12 / 42

13 HCT: implementation Compute P-values: π j = F 0 (ψn,j ), 1 j p Sort P-values: π (1) < π (2) <... < π (p) Define the HC functional by p(k/p π(k) ) HC p,k = k/p + max{ n(k/p π (k) ), 0} Let ˆk = argmax {1 k p/2,π(k) >log(p)/p}{hc p,k }. HC threshold t HC p is the ˆk-th largest KS-score HC p (t) t=ψ (k) = p[k/p π(k) ] k/p + ; ψ(k) : sorted KS scores n(k/p π (k) ) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 13 / 42

14 Illustration ordered KS scores ordered p values HC objective Function Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 14 / 42

15 Illustration, II (Lung Cancer) 0.2 HC threshold x-axis: # of selected features; y-axis: error rates by IF-PCA Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 15 / 42

16 Comparison r = error rate of IF-PCA-HCT minimum error rate of all other methods # Data set K kmeans kmeans++ Hier SpecGem* IF-PCA-HCT r 1 Brain (.09) Breast Cancer (.05) Colon Cancer (.07) Leukemia (.09) Lung Cancer (.09) Lung Cancer(2) (.00) Lymphoma (.13) Prostate Cancer (.01) SRBCT (.06) SuCancer (.05) *: SpecGem is classical PCA [Lee et al (2010)]; Arthur and Vassilvitskii (2007) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 16 / 42

17 Sparse PCA and variants of IF-PCA features selected features PC s samples # Data set K Clu-sPCA IF-kmeans IF-Hier IF-PCA-HCT 1 Brain Breast Cancer Colon Cancer Leukemia Lung Cancer Lung Cancer(2) Lymphoma Prostate Cancer SRBCT SuCancer : project to estimated feature space (sparse PCA) and then clustering; unclear how to set λ (ideal λ is used above); clustering feature estimation Zou et al (2006), Witten and Tibshirani (2010) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 17 / 42

18 Summary (so far) IF-PCA-HCT consists of three simple steps Marginal screening (KS) Threshold choice (Empirical null + HCT) Post-selection PCA tuning free, fast, and yet effective easily extendable and adaptable Next: theoretical reasons for HCT in RW settings Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 18 / 42

19 RW viewpoint In many types of Big Data, signals of interest are not only sparse (rare) but also individually weak, and we have no priori where these RW signals are Large p small n (e.g., genetics and genomics) (Signal strength) α n $ or labor Clustering: α = 6; classification: α = 2 Technical limitation (e.g., astronomy) Early detection (e.g., disease surveillance) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 19 / 42

20 RW model and Phase Diagram Many methods/theory target on Rare/Strong signals (if conditions XX hold and all signals are sufficiently strong...) Our proposal: RW model: parsimonious model capturing the main factors (sparsity and signal strength) Phase Diagram: provides sharp results that characterize when the desired goal is impossible or possible to achieve an approach to distinguish non-optimal and optimal procedures Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 20 / 42

21 Sparse signal detection (global testing) Donoho and Jin (2004), Ingster (1997, 1999), Jin (2004) X = µ + Z, X R p, Z N(0, I p ) H (p) 0 : µ = 0, vs. H (p) 1 : µ(j) iid (1 ɛ p )ν 0 +ɛ p ν τp Calibration and subtlety of the problem: ɛ p = p β, τ p = 2r log(p), 0 < β < 1, r > 0 Rare: only a small fraction of non-zero means Weak: signals only moderately strong Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 21 / 42

22 Phase Diagram (signal detection/recovery) 0, 0 < β < 1/2 standard phase function: r = ρ(β); ρ(β) = β 1 2, 1 2 < β < 3 4 (1 1 β) 2 3, 4 < β < Estimable Exact Recovery r 0.5 Detectable r 2 Almost Full Recovery Undetectable β 0.5 Detectable Undetectable Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 22 / 42

23 RW settings: why we care? Growing awareness of irreproducibility Many methods/theory focus on Rare/Strong signals, do not cope well with RW settings FDR-controlling methods (showed before) L 0 /L 1 -penalization methods Minimax framework Ioannidis, (2005). Why most published research findings are false ; Donoho and Stark (1989); Chen et al (1995); Tibshirani (1996); Abrmovich et al (2006); Jin et al (2015); Ke et al (2015) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 23 / 42

24 A two-class model for clustering Jin, Ke & Wang (2015) X i = l i µ+z i, Z i iid N(0, Ip ), i = 1,..., n (p n) l i = ±1: unknown class labels (main interest) µ R p : feature vector RW: only a small fraction of entries of µ is nonzero, each contributes weakly to clustering Goal. Theoretical insight on HCT (and more) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 24 / 42

25 IF-PCA simplified to two-class model features selected features PC s samples microarray two-class model pre-normalization yes skipped feature-wise screening Kolmogorov-Smirnov chi-square ψ j = sup t F n,j (t) Φ(t) ψ j = ( x j n)/ 2n re-normalization Efron s null correction skipped threshold choice HCT same post-selection PCA same same Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 25 / 42

26 Asymptotic Rare/Weak (ARW) model X = lµ + Z R n,p, Z: iid N(0, 1) entries l i = ±1 with equal prob. µ(j) iid (1 ɛ p )ν 0 + ɛ p ν τp large p small n : n = p θ, 0 < θ < 1 RW signal: ɛ p = p β, τ p = 4 (4/n)r log(p) Note: ψ j = ( x j 2 n)/ 2n N(0, 1) or N( 2r log(p), 1) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 26 / 42

27 Three functions: HC, idealhc, SNR Given X = lµ + Z, we retain feature j if ψ j t F0 (t): survival function of normalized χ 2 2n (0) Ĝ p (t): empirical survival function ψ j Û (t) : first left singular vector of [x j : ψ j t] 1. HC(t) : p[ Ĝ p (t) F 0 (t)]/[ n[ĝp(t) F 0 (t)]+ĝp(t)] idealhc(t) : HC(t) with Ĝp(t) replaced by Ḡp(t) 3. SNR(t): Û (t) SNR(t) l + z + rem, z N(0, I n ) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 27 / 42

28 Three thresholds: t HC p t idealhc p t ideal p 4 HCT HC IdealHC SNR SNR(t) = E[ µ (t) 2 ] E[ µ (t) 2 ] + E[ Ŝ(t) ]/n p( Ḡ p (t) F 0 (t)) n[ḡp (t) F 0 (t)] + Ḡ p (t) µ (t) : µ restricted to Ŝ(t) (index set of all retained features) E µ (t) 2 τp 2 p[ḡp(t) F 0 (t)] E µ (t) 2 + E[ Ŝ(t) /n] τ p 2 [Ḡp(t) F 0 (t)] + Ḡp(t)/n Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 28 / 42

29 Impossibility Let ρ(β) be the standard phase function (before). Define ρ θ (β) = (1 θ)ρ( β θ ); recalling n = pθ Consider IF-PCA with t > 0. Let Û (t) R n be the first left singular vector of post-selection data matrix Consider ARW with [x j : ψ j t] n = p θ, ɛ p = p β, τ p = 4 (4/n)r log(p) If r < ρ θ (β), then for any threshold t, Cos(Û(t), l) c 0 < 1 (IF-PCA partially fails) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 29 / 42

30 Possibility If r > ρ θ (β), then For some t, Cos(Û(t), l) 1 There is a non-stochastic function SNR(t) such that for some t, SNR(t) 1 and Û (t) SNR(t)l + z + rem, z N(0, I n ); HCT yields the right threshold choice: tp HC /tp ideal 1 in prob.; tp ideal IF-PCA-HCT yields successful clustering: Hamm p (ˆl HCT ; β, r, θ) 0 = argmax t { SNR(t)} : Hamm p (ˆl; β, α, θ) = (n/2) 1 min b=±sgn(ˆl) { n i=1 P( b i sgn(l i ) )} Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 30 / 42

31 Phase Diagram (IF-PCA) #(useful features) p 1 β, τ p = 4 (4/n)r log(p); n = p θ (θ =.6) Success r 0.5 r Success Failure 0.1 Failure β β Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 31 / 42

32 Summary (so far) Big Data are here, but signals are RW RW model and Phase Diagrams: theoretical framework specifically for RW settings, allow for insights that will be otherwise overlooked HC provides the right threshold choice Limitation: IF-PCA only works at a specific signal strength and in a limited sparsity range Want a more complete story: statistical limits for clustering under ARW Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 32 / 42

33 ARW for statistical limits (a slight change) weaker X = lµ + Z l i = ±1 with equal prob. α Most interesting range for IF-PCA µ(j) iid (1 ɛ p )ν 0 + ɛ p ν τp n = p θ, ɛ p = p β τ = n 1/4 s = p 1/2 s = n 1/ β sparser For statistical limits For IF-PCA τ p τ p = p α, α > 0 τ p = 4 (4/n)r log(p) s := #{signals} 1 s p n s p Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 33 / 42

34 Aggregation methods sa Simple Aggregation: ˆl = sgn ( x ) Sparse Aggregation: ˆl sa N = sgn( xŝ), where Ŝ = Ŝ(N) = argmax { } {S: S =N} xs 1 Features Features Features Features Samples Samples xx 1 xx 2 xx 3 xx 1 xx 2 xx xx pp 2 xx pp 1 xx pp xx pp 2 xx pp 1 xx pp Samples Samples xx 1 xx 2 xx 3 xx 1 xx 2 xx xx pp 2 xx pp 1 xx pp xx pp 2 xx pp xx xx xx SS xx SS Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 34 / 42

35 Comparison of methods Method Simple Agg. PCA Sparse Agg. IF-PCA ˆl sa ˆl if ˆl sa N (N p) ˆlif t (t > 0) Sparsity dense dense sparse sparse Strength weak weak strong* strong F. Selection No No Yes Yes Complexity Poly. Poly. NP-hard Poly. Tuning No No Yes Yes** Notation. ˆl if t : IF-PCA adapted to two-class model Notation. ˆl if : classical PCA (a special case) *: signals are comparably stronger but still weak **: a tuning-free version exists Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 35 / 42

36 Statistical limits (clustering) Hamm p (ˆl; β, α, θ) = (n/2) 1 ηθ clu (β) = Consider ARW where min b=±sgn(ˆl) { n P ( b i sgn(l i ) )} i=1 (1 2β)/2, 0 < β < 1 θ 2 1 θ θ/2, < β < 1 θ 2 (1 β)/2, β > 1 θ n = p θ, ɛ p = p β, τ p = p α When α > ηθ clu(β), Hamm p(ˆl; β, α, θ) 1 When α < η clu(β), θ Hamm p (ˆl sa ; β, α, θ) 0 for β < 1 θ Hamm p (ˆl sa 1 θ N ; β, α, θ) 0 for β > 2 ; 2 (N = p 1 β ) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 36 / 42

37 Phase Diagram (clustering; θ = 0.6) For statistical limits For IF-PCA τ p τ p = p α, α > 0 τ p = 4 (4/n)r log(p) Range of β 0 < β < 1 1/2 < β < 1 θ/2 weaker τ = p 1/2 /s Impossible weaker Impossible 0.3 τ = n 1/2 0.3 α Simple Aggregation Sparse Aggregation (NP-hard) τ = s 1/2 α Simple Aggreg. NP-hard Classical PCA IF-PCA β sparser β sparser Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 37 / 42

38 Two closely related problems X = lµ + Z, Z: iid N(0, 1) entries (sig). Estimate support of µ (Signal recovery) n Hamm p (ˆµ; β, r, θ) = (pɛ p ) 1 P(sgn(ˆµ i ) sgn(µ i )) i=1 (hyp). Test H (p) 0 that X = Z against alternative H (p) 1 that X = lµ + Z (global hypothesis testing) TestErr p ( ˆT, β, r, β) = P H (p) 0 (Reject H 0 ) + P (p) H (Accept H (p) 0 ) 1 Arias-Castro & Verzelen (2015), Johnstone & Lu (2001), Rigollet & Berthet (2013) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 38 / 42

39 Statistical limits (three problems) µ(j) iid (1 ɛ p )ν 0 + ɛ p ν τp, n = p θ, s #{useful features} p 1 β, signal strength = p α τ=n 1/4 p 1/2 /s Clustering Signal Recovery Hypothesis Testing α τ=p 1/2 /s τ=n 1/2 s=n τ=(sn) 1/4 0.1 τ=n 1/4 τ=s 1/ β Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 39 / 42

40 Computable upper bounds (two problems) Left: signal recovery. Right: (global) hypothesis testing µ(j) iid (1 ɛ p )ν 0 + ɛ p ν τp, n = p θ, s #{useful features} p 1 β, signal strength = p α Clustering Signal Recovery (stat) Hypothesis Testing Signal Recovery (compu) Clustering Signal Recovery Hypothesis Testing (stat) Hypothesis Testing (compu) τ=n 1/4 ( p/s) α 0.3 τ=n 1/2 α τ=(p/n) 1/4 s 1/ τ=n 1/4 s=p 1/2 0.1 τ=n 1/4 s=p 1/ β Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 40 / 42 β

41 Take-home message Big Data is here, but your signals are RW; we need to do many things very differently RW model and Phase Diagrams are theoretical framework specifically designed for RW settings, and expose many insights we do not see with more traditional frameworks IF-PCA and Higher Criticism are simple and easy-to-adapt methods which are provably effective for analyzing real data, especially for Rare/Weak settings Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 41 / 42

42 Acknowledgements Many thanks for David Donoho, Stephen Fienberg, Peter Hall, Jon Wellner and many others for inspirations, collaborations, and encouragements! Main references. Jin J, Wang W (2015) Important Features PCA for high dimensional clustering. arxiv Jin J, Ke Z, Wang W (2015) Phase transitions for high dimensional clustering and related problems. arxiv Two review papers on HC. D. Donoho and J. Jin (2015). Higher Criticism for Large-Scale Inference, especially for rare and weak effects. Statistical Science, 30(1), J. Jin and Z. Ke (2015). Rare and weak effects in large-scale inference: methods and phase diagrams. arxiv Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 42 / 42

Clustering by Important Features PCA (IF-PCA)

Clustering by Important Features PCA (IF-PCA) Clustering by Important Features PCA (IF-PCA) Rare/Weak Signals and Phase Diagrams Jiashun Jin, CMU Zheng Tracy Ke (Univ. of Chicago) Wanjie Wang (Univ. of Pennsylvania) August 5, 2015 Jiashun Jin, CMU

More information

Graphlet Screening (GS)

Graphlet Screening (GS) Graphlet Screening (GS) Jiashun Jin Carnegie Mellon University April 11, 2014 Jiashun Jin Graphlet Screening (GS) 1 / 36 Collaborators Alphabetically: Zheng (Tracy) Ke Cun-Hui Zhang Qi Zhang Princeton

More information

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania Submitted to the Annals of Statistics DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING By T. Tony Cai and Linjun Zhang University of Pennsylvania We would like to congratulate the

More information

Covariate-Assisted Variable Ranking

Covariate-Assisted Variable Ranking Covariate-Assisted Variable Ranking Tracy Ke Department of Statistics Harvard University WHOA-PSI@St. Louis, Sep. 8, 2018 1/18 Sparse linear regression Y = X β + z, X R n,p, z N(0, σ 2 I n ) Signals (nonzero

More information

PHASE TRANSITIONS FOR HIGH DIMENSIONAL CLUSTERING AND RELATED PROBLEMS

PHASE TRANSITIONS FOR HIGH DIMENSIONAL CLUSTERING AND RELATED PROBLEMS The Annals of Statistics 27, Vol. 45, o. 5, 25 289 DOI:.24/6-AOS522 Institute of Mathematical Statistics, 27 PHASE TRASITIOS FOR HIGH DIMESIOAL CLUSTERIG AD RELATED PROBLEMS BY JIASHU JI, ZHEG TRACY KE

More information

Optimal Detection of Heterogeneous and Heteroscedastic Mixtures

Optimal Detection of Heterogeneous and Heteroscedastic Mixtures Optimal Detection of Heterogeneous and Heteroscedastic Mixtures T. Tony Cai Department of Statistics, University of Pennsylvania X. Jessie Jeng Department of Biostatistics and Epidemiology, University

More information

Estimation and Confidence Sets For Sparse Normal Mixtures

Estimation and Confidence Sets For Sparse Normal Mixtures Estimation and Confidence Sets For Sparse Normal Mixtures T. Tony Cai 1, Jiashun Jin 2 and Mark G. Low 1 Abstract For high dimensional statistical models, researchers have begun to focus on situations

More information

Optimal detection of weak positive dependence between two mixture distributions

Optimal detection of weak positive dependence between two mixture distributions Optimal detection of weak positive dependence between two mixture distributions Sihai Dave Zhao Department of Statistics, University of Illinois at Urbana-Champaign and T. Tony Cai Department of Statistics,

More information

Semi-Penalized Inference with Direct FDR Control

Semi-Penalized Inference with Direct FDR Control Jian Huang University of Iowa April 4, 2016 The problem Consider the linear regression model y = p x jβ j + ε, (1) j=1 where y IR n, x j IR n, ε IR n, and β j is the jth regression coefficient, Here p

More information

Optimal detection of heterogeneous and heteroscedastic mixtures

Optimal detection of heterogeneous and heteroscedastic mixtures J. R. Statist. Soc. B (0) 73, Part 5, pp. 69 66 Optimal detection of heterogeneous and heteroscedastic mixtures T. Tony Cai and X. Jessie Jeng University of Pennsylvania, Philadelphia, USA and Jiashun

More information

Factor-Adjusted Robust Multiple Test. Jianqing Fan (Princeton University)

Factor-Adjusted Robust Multiple Test. Jianqing Fan (Princeton University) Factor-Adjusted Robust Multiple Test Jianqing Fan Princeton University with Koushiki Bose, Qiang Sun, Wenxin Zhou August 11, 2017 Outline 1 Introduction 2 A principle of robustification 3 Adaptive Huber

More information

Marginal Screening and Post-Selection Inference

Marginal Screening and Post-Selection Inference Marginal Screening and Post-Selection Inference Ian McKeague August 13, 2017 Ian McKeague (Columbia University) Marginal Screening August 13, 2017 1 / 29 Outline 1 Background on Marginal Screening 2 2

More information

Higher Criticism Testing for Signal Detection in Rare And Weak Models

Higher Criticism Testing for Signal Detection in Rare And Weak Models Higher Criticism Testing for Signal Detection in Rare And Weak Models Niclas Blomberg Master Thesis at KTH Royal Institute of Technology Stockholm, September 2012 Supervisor: Tatjana Pavlenko Abstract

More information

arxiv: v2 [math.st] 25 Aug 2012

arxiv: v2 [math.st] 25 Aug 2012 Submitted to the Annals of Statistics arxiv: math.st/1205.4645 COVARIANCE ASSISTED SCREENING AND ESTIMATION arxiv:1205.4645v2 [math.st] 25 Aug 2012 By Tracy Ke,, Jiashun Jin and Jianqing Fan Princeton

More information

GLOBAL TESTING UNDER SPARSE ALTERNATIVES: ANOVA, MULTIPLE COMPARISONS AND THE HIGHER CRITICISM 1

GLOBAL TESTING UNDER SPARSE ALTERNATIVES: ANOVA, MULTIPLE COMPARISONS AND THE HIGHER CRITICISM 1 The Annals of Statistics 2011, Vol. 39, No. 5, 2533 2556 DOI: 10.1214/11-AOS910 Institute of Mathematical Statistics, 2011 GLOBAL TESTING UNDER SPARSE ALTERNATIVES: ANOVA, MULTIPLE COMPARISONS AND THE

More information

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Lucas Janson, Stanford Department of Statistics WADAPT Workshop, NIPS, December 2016 Collaborators: Emmanuel

More information

H 2 : otherwise. that is simply the proportion of the sample points below level x. For any fixed point x the law of large numbers gives that

H 2 : otherwise. that is simply the proportion of the sample points below level x. For any fixed point x the law of large numbers gives that Lecture 28 28.1 Kolmogorov-Smirnov test. Suppose that we have an i.i.d. sample X 1,..., X n with some unknown distribution and we would like to test the hypothesis that is equal to a particular distribution

More information

Post-selection Inference for Forward Stepwise and Least Angle Regression

Post-selection Inference for Forward Stepwise and Least Angle Regression Post-selection Inference for Forward Stepwise and Least Angle Regression Ryan & Rob Tibshirani Carnegie Mellon University & Stanford University Joint work with Jonathon Taylor, Richard Lockhart September

More information

GLOBAL TESTING UNDER SPARSE ALTERNATIVES: ANOVA, MULTIPLE COMPARISONS AND THE HIGHER CRITICISM

GLOBAL TESTING UNDER SPARSE ALTERNATIVES: ANOVA, MULTIPLE COMPARISONS AND THE HIGHER CRITICISM Submitted to the Annals of Statistics arxiv: 7.434 GLOBAL TESTING UNDER SPARSE ALTERNATIVES: ANOVA, MULTIPLE COMPARISONS AND THE HIGHER CRITICISM By Ery Arias-Castro Department of Mathematics, University

More information

Generalized Linear Models and Its Asymptotic Properties

Generalized Linear Models and Its Asymptotic Properties for High Dimensional Generalized Linear Models and Its Asymptotic Properties April 21, 2012 for High Dimensional Generalized L Abstract Literature Review In this talk, we present a new prior setting for

More information

OPTIMAL PROCEDURES IN HIGH-DIMENSIONAL VARIABLE SELECTION

OPTIMAL PROCEDURES IN HIGH-DIMENSIONAL VARIABLE SELECTION OPTIMAL PROCEDURES IN HIGH-DIMENSIONAL VARIABLE SELECTION by Qi Zhang Bachelor of Science in Math and Applied Math, China Agricultural University, 2008 Submitted to the Graduate Faculty of the Department

More information

False Discovery Rate

False Discovery Rate False Discovery Rate Peng Zhao Department of Statistics Florida State University December 3, 2018 Peng Zhao False Discovery Rate 1/30 Outline 1 Multiple Comparison and FWER 2 False Discovery Rate 3 FDR

More information

Confounder Adjustment in Multiple Hypothesis Testing

Confounder Adjustment in Multiple Hypothesis Testing in Multiple Hypothesis Testing Department of Statistics, Stanford University January 28, 2016 Slides are available at http://web.stanford.edu/~qyzhao/. Collaborators Jingshu Wang Trevor Hastie Art Owen

More information

The miss rate for the analysis of gene expression data

The miss rate for the analysis of gene expression data Biostatistics (2005), 6, 1,pp. 111 117 doi: 10.1093/biostatistics/kxh021 The miss rate for the analysis of gene expression data JONATHAN TAYLOR Department of Statistics, Stanford University, Stanford,

More information

COVARIATE ASSISTED VARIABLE RANKING. By Zheng Tracy Ke and Fan Yang University of Chicago

COVARIATE ASSISTED VARIABLE RANKING. By Zheng Tracy Ke and Fan Yang University of Chicago Submitted to the Annals of Statistics arxiv: arxiv:0000.0000 COVARIATE ASSISTED VARIABLE RANKING By Zheng Tracy Ke and Fan Yang University of Chicago Consider a linear model y = Xβ + z, z N(0, σ 2 I n).

More information

A General Framework for High-Dimensional Inference and Multiple Testing

A General Framework for High-Dimensional Inference and Multiple Testing A General Framework for High-Dimensional Inference and Multiple Testing Yang Ning Department of Statistical Science Joint work with Han Liu 1 Overview Goal: Control false scientific discoveries in high-dimensional

More information

Journal Club: Higher Criticism

Journal Club: Higher Criticism Journal Club: Higher Criticism David Donoho (2002): Higher Criticism for Heterogeneous Mixtures, Technical Report No. 2002-12, Dept. of Statistics, Stanford University. Introduction John Tukey (1976):

More information

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017 Lecture 7: Interaction Analysis Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 39 Lecture Outline Beyond main SNP effects Introduction to Concept of Statistical Interaction

More information

Solving Corrupted Quadratic Equations, Provably

Solving Corrupted Quadratic Equations, Provably Solving Corrupted Quadratic Equations, Provably Yuejie Chi London Workshop on Sparse Signal Processing September 206 Acknowledgement Joint work with Yuanxin Li (OSU), Huishuai Zhuang (Syracuse) and Yingbin

More information

arxiv: v1 [math.st] 20 Nov 2009

arxiv: v1 [math.st] 20 Nov 2009 REVISITING MARGINAL REGRESSION arxiv:0911.4080v1 [math.st] 20 Nov 2009 By Christopher R. Genovese Jiashun Jin and Larry Wasserman Carnegie Mellon University May 31, 2018 The lasso has become an important

More information

OPTIMAL CLASSIFICATION IN SPARSE GAUSSIAN GRAPHIC MODEL

OPTIMAL CLASSIFICATION IN SPARSE GAUSSIAN GRAPHIC MODEL The Annals of Statistics 2013, Vol. 41, No. 5, 2537 2571 DOI: 10.1214/13-AOS1163 Institute of Mathematical Statistics, 2013 OPTIMAL CLASSIFICATION IN SPARSE GAUSSIAN GRAPHIC MODEL BY YINGYING FAN 1,JIASHUN

More information

Methods for High Dimensional Inferences With Applications in Genomics

Methods for High Dimensional Inferences With Applications in Genomics University of Pennsylvania ScholarlyCommons Publicly Accessible Penn Dissertations Summer 8-12-2011 Methods for High Dimensional Inferences With Applications in Genomics Jichun Xie University of Pennsylvania,

More information

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method Christopher R. Genovese Department of Statistics Carnegie Mellon University joint work with Larry Wasserman

More information

Inference For High Dimensional M-estimates: Fixed Design Results

Inference For High Dimensional M-estimates: Fixed Design Results Inference For High Dimensional M-estimates: Fixed Design Results Lihua Lei, Peter Bickel and Noureddine El Karoui Department of Statistics, UC Berkeley Berkeley-Stanford Econometrics Jamboree, 2017 1/49

More information

Asymptotic distribution of the largest eigenvalue with application to genetic data

Asymptotic distribution of the largest eigenvalue with application to genetic data Asymptotic distribution of the largest eigenvalue with application to genetic data Chong Wu University of Minnesota September 30, 2016 T32 Journal Club Chong Wu 1 / 25 Table of Contents 1 Background Gene-gene

More information

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12 STATS 306B: Unsupervised Learning Spring 2014 Lecture 13 May 12 Lecturer: Lester Mackey Scribe: Jessy Hwang, Minzhe Wang 13.1 Canonical correlation analysis 13.1.1 Recap CCA is a linear dimensionality

More information

DETECTING RARE AND WEAK SPIKES IN LARGE COVARIANCE MATRICES

DETECTING RARE AND WEAK SPIKES IN LARGE COVARIANCE MATRICES Submitted to the Annals of Statistics DETECTING RARE AND WEAK SPIKES IN LARGE COVARIANCE MATRICES By Zheng Tracy Ke University of Chicago arxiv:1609.00883v1 [math.st] 4 Sep 2016 Given p-dimensional Gaussian

More information

Doing Cosmology with Balls and Envelopes

Doing Cosmology with Balls and Envelopes Doing Cosmology with Balls and Envelopes Christopher R. Genovese Department of Statistics Carnegie Mellon University http://www.stat.cmu.edu/ ~ genovese/ Larry Wasserman Department of Statistics Carnegie

More information

high-dimensional inference robust to the lack of model sparsity

high-dimensional inference robust to the lack of model sparsity high-dimensional inference robust to the lack of model sparsity Jelena Bradic (joint with a PhD student Yinchu Zhu) www.jelenabradic.net Assistant Professor Department of Mathematics University of California,

More information

Cramér-Type Moderate Deviation Theorems for Two-Sample Studentized (Self-normalized) U-Statistics. Wen-Xin Zhou

Cramér-Type Moderate Deviation Theorems for Two-Sample Studentized (Self-normalized) U-Statistics. Wen-Xin Zhou Cramér-Type Moderate Deviation Theorems for Two-Sample Studentized (Self-normalized) U-Statistics Wen-Xin Zhou Department of Mathematics and Statistics University of Melbourne Joint work with Prof. Qi-Man

More information

Regularized Estimation of High Dimensional Covariance Matrices. Peter Bickel. January, 2008

Regularized Estimation of High Dimensional Covariance Matrices. Peter Bickel. January, 2008 Regularized Estimation of High Dimensional Covariance Matrices Peter Bickel Cambridge January, 2008 With Thanks to E. Levina (Joint collaboration, slides) I. M. Johnstone (Slides) Choongsoon Bae (Slides)

More information

L. Brown. Statistics Department, Wharton School University of Pennsylvania

L. Brown. Statistics Department, Wharton School University of Pennsylvania Non-parametric Empirical Bayes and Compound Bayes Estimation of Independent Normal Means Joint work with E. Greenshtein L. Brown Statistics Department, Wharton School University of Pennsylvania lbrown@wharton.upenn.edu

More information

arxiv: v1 [math.st] 29 Jan 2010

arxiv: v1 [math.st] 29 Jan 2010 DISTILLED SENSING: ADAPTIVE SAMPLING FOR SPARSE DETECTION AND ESTIMATION arxiv:00.53v [math.st] 29 Jan 200 By Jarvis Haupt, Rui Castro, and Robert Nowak Rice University, Columbia University, and University

More information

A significance test for the lasso

A significance test for the lasso 1 First part: Joint work with Richard Lockhart (SFU), Jonathan Taylor (Stanford), and Ryan Tibshirani (Carnegie-Mellon Univ.) Second part: Joint work with Max Grazier G Sell, Stefan Wager and Alexandra

More information

Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence

Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence Sunil Kumar Dhar Center for Applied Mathematics and Statistics, Department of Mathematical Sciences, New Jersey

More information

arxiv: v1 [math.st] 11 Dec 2008

arxiv: v1 [math.st] 11 Dec 2008 Feature Selection by Higher Criticism Thresholding: Optimal Phase Diagram David Donoho 1 and Jiashun Jin 2 1 Department of Statistics, Stanford University 2 Department of Statistics, Carnegie Mellon University

More information

FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES

FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES Sanat K. Sarkar a a Department of Statistics, Temple University, Speakman Hall (006-00), Philadelphia, PA 19122, USA Abstract The concept

More information

Proportion of non-zero normal means: universal oracle equivalences and uniformly consistent estimators

Proportion of non-zero normal means: universal oracle equivalences and uniformly consistent estimators J. R. Statist. Soc. B (28) 7, Part 3, pp. 461 493 Proportion of non-zero normal means: universal oracle equivalences and uniformly consistent estimators Jiashun Jin Purdue University, West Lafayette, and

More information

Robust Principal Component Analysis

Robust Principal Component Analysis ELE 538B: Mathematics of High-Dimensional Data Robust Principal Component Analysis Yuxin Chen Princeton University, Fall 2018 Disentangling sparse and low-rank matrices Suppose we are given a matrix M

More information

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon

More information

AFT Models and Empirical Likelihood

AFT Models and Empirical Likelihood AFT Models and Empirical Likelihood Mai Zhou Department of Statistics, University of Kentucky Collaborators: Gang Li (UCLA); A. Bathke; M. Kim (Kentucky) Accelerated Failure Time (AFT) models: Y = log(t

More information

arxiv: v1 [math.st] 31 Mar 2009

arxiv: v1 [math.st] 31 Mar 2009 The Annals of Statistics 2009, Vol. 37, No. 2, 619 629 DOI: 10.1214/07-AOS586 c Institute of Mathematical Statistics, 2009 arxiv:0903.5373v1 [math.st] 31 Mar 2009 AN ADAPTIVE STEP-DOWN PROCEDURE WITH PROVEN

More information

Sparse statistical modelling

Sparse statistical modelling Sparse statistical modelling Tom Bartlett Sparse statistical modelling Tom Bartlett 1 / 28 Introduction A sparse statistical model is one having only a small number of nonzero parameters or weights. [1]

More information

Asymptotic Statistics-VI. Changliang Zou

Asymptotic Statistics-VI. Changliang Zou Asymptotic Statistics-VI Changliang Zou Kolmogorov-Smirnov distance Example (Kolmogorov-Smirnov confidence intervals) We know given α (0, 1), there is a well-defined d = d α,n such that, for any continuous

More information

Permutation-invariant regularization of large covariance matrices. Liza Levina

Permutation-invariant regularization of large covariance matrices. Liza Levina Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work

More information

Estimation of a Two-component Mixture Model

Estimation of a Two-component Mixture Model Estimation of a Two-component Mixture Model Bodhisattva Sen 1,2 University of Cambridge, Cambridge, UK Columbia University, New York, USA Indian Statistical Institute, Kolkata, India 6 August, 2012 1 Joint

More information

Regularized Discriminant Analysis and Its Application in Microarray

Regularized Discriminant Analysis and Its Application in Microarray Regularized Discriminant Analysis and Its Application in Microarray Yaqian Guo, Trevor Hastie and Robert Tibshirani May 5, 2004 Abstract In this paper, we introduce a family of some modified versions of

More information

Guarding against Spurious Discoveries in High Dimension. Jianqing Fan

Guarding against Spurious Discoveries in High Dimension. Jianqing Fan in High Dimension Jianqing Fan Princeton University with Wen-Xin Zhou September 30, 2016 Outline 1 Introduction 2 Spurious correlation and random geometry 3 Goodness Of Spurious Fit (GOSF) 4 Asymptotic

More information

arxiv: v1 [stat.me] 30 Dec 2017

arxiv: v1 [stat.me] 30 Dec 2017 arxiv:1801.00105v1 [stat.me] 30 Dec 2017 An ISIS screening approach involving threshold/partition for variable selection in linear regression 1. Introduction Yu-Hsiang Cheng e-mail: 96354501@nccu.edu.tw

More information

High-dimensional Ordinary Least-squares Projection for Screening Variables

High-dimensional Ordinary Least-squares Projection for Screening Variables 1 / 38 High-dimensional Ordinary Least-squares Projection for Screening Variables Chenlei Leng Joint with Xiangyu Wang (Duke) Conference on Nonparametric Statistics for Big Data and Celebration to Honor

More information

Optimal Rates of Convergence for Estimating the Null Density and Proportion of Non-Null Effects in Large-Scale Multiple Testing

Optimal Rates of Convergence for Estimating the Null Density and Proportion of Non-Null Effects in Large-Scale Multiple Testing Optimal Rates of Convergence for Estimating the Null Density and Proportion of Non-Null Effects in Large-Scale Multiple Testing T. Tony Cai 1 and Jiashun Jin 2 Abstract An important estimation problem

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 01: Introduction and Overview

Introduction to Empirical Processes and Semiparametric Inference Lecture 01: Introduction and Overview Introduction to Empirical Processes and Semiparametric Inference Lecture 01: Introduction and Overview Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and Operations

More information

Distilled Sensing: Adaptive Sampling for. Sparse Detection and Estimation

Distilled Sensing: Adaptive Sampling for. Sparse Detection and Estimation Distilled Sensing: Adaptive Sampling for Sparse Detection and Estimation Jarvis Haupt, Rui Castro, and Robert Nowak arxiv:00.53v2 [math.st] 27 May 200 Abstract Adaptive sampling results in dramatic improvements

More information

Tractable Upper Bounds on the Restricted Isometry Constant

Tractable Upper Bounds on the Restricted Isometry Constant Tractable Upper Bounds on the Restricted Isometry Constant Alex d Aspremont, Francis Bach, Laurent El Ghaoui Princeton University, École Normale Supérieure, U.C. Berkeley. Support from NSF, DHS and Google.

More information

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions:

More information

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Yale School of Public Health Joint work with Ning Hao, Yue S. Niu presented @Tsinghua University Outline 1 The Problem

More information

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies Ian Barnett, Rajarshi Mukherjee & Xihong Lin Harvard University ibarnett@hsph.harvard.edu August 5, 2014 Ian Barnett

More information

Least squares under convex constraint

Least squares under convex constraint Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption

More information

RATE ANALYSIS FOR DETECTION OF SPARSE MIXTURES

RATE ANALYSIS FOR DETECTION OF SPARSE MIXTURES RATE ANALYSIS FOR DETECTION OF SPARSE MIXTURES Jonathan G. Ligo, George V. Moustakides and Venugopal V. Veeravalli ECE and CSL, University of Illinois at Urbana-Champaign, Urbana, IL 61801 University of

More information

A significance test for the lasso

A significance test for the lasso 1 Gold medal address, SSC 2013 Joint work with Richard Lockhart (SFU), Jonathan Taylor (Stanford), and Ryan Tibshirani (Carnegie-Mellon Univ.) Reaping the benefits of LARS: A special thanks to Brad Efron,

More information

False Discovery Control in Spatial Multiple Testing

False Discovery Control in Spatial Multiple Testing False Discovery Control in Spatial Multiple Testing WSun 1,BReich 2,TCai 3, M Guindani 4, and A. Schwartzman 2 WNAR, June, 2012 1 University of Southern California 2 North Carolina State University 3 University

More information

Post-selection inference with an application to internal inference

Post-selection inference with an application to internal inference Post-selection inference with an application to internal inference Robert Tibshirani, Stanford University November 23, 2015 Seattle Symposium in Biostatistics, 2015 Joint work with Sam Gross, Will Fithian,

More information

Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square

Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square Yuxin Chen Electrical Engineering, Princeton University Coauthors Pragya Sur Stanford Statistics Emmanuel

More information

Estimating the Null and the Proportion of Non- Null Effects in Large-Scale Multiple Comparisons

Estimating the Null and the Proportion of Non- Null Effects in Large-Scale Multiple Comparisons University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 7 Estimating the Null and the Proportion of Non- Null Effects in Large-Scale Multiple Comparisons Jiashun Jin Purdue

More information

Incorporation of Sparsity Information in Large-scale Multiple Two-sample t Tests

Incorporation of Sparsity Information in Large-scale Multiple Two-sample t Tests Incorporation of Sparsity Information in Large-scale Multiple Two-sample t Tests Weidong Liu October 19, 2014 Abstract Large-scale multiple two-sample Student s t testing problems often arise from the

More information

A knockoff filter for high-dimensional selective inference

A knockoff filter for high-dimensional selective inference 1 A knockoff filter for high-dimensional selective inference Rina Foygel Barber and Emmanuel J. Candès February 2016; Revised September, 2017 Abstract This paper develops a framework for testing for associations

More information

arxiv: v1 [math.st] 20 Nov 2009

arxiv: v1 [math.st] 20 Nov 2009 FEATURE SELECTION WHEN THERE ARE MANY INFLUENTIAL FEATURES Peter Hall 12 Jiashun Jin 3 Hugh Miller 1 arxiv:0911.4076v1 [math.st] 20 Nov 2009 SUMMARY. Recent discussion of the success of feature selection

More information

Tweedie s Formula and Selection Bias. Bradley Efron Stanford University

Tweedie s Formula and Selection Bias. Bradley Efron Stanford University Tweedie s Formula and Selection Bias Bradley Efron Stanford University Selection Bias Observe z i N(µ i, 1) for i = 1, 2,..., N Select the m biggest ones: z (1) > z (2) > z (3) > > z (m) Question: µ values?

More information

Large-Scale Multiple Testing of Correlations

Large-Scale Multiple Testing of Correlations Large-Scale Multiple Testing of Correlations T. Tony Cai and Weidong Liu Abstract Multiple testing of correlations arises in many applications including gene coexpression network analysis and brain connectivity

More information

FDR and ROC: Similarities, Assumptions, and Decisions

FDR and ROC: Similarities, Assumptions, and Decisions EDITORIALS 8 FDR and ROC: Similarities, Assumptions, and Decisions. Why FDR and ROC? It is a privilege to have been asked to introduce this collection of papers appearing in Statistica Sinica. The papers

More information

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies Ian Barnett, Rajarshi Mukherjee & Xihong Lin Harvard University ibarnett@hsph.harvard.edu June 24, 2014 Ian Barnett

More information

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Nonconcave Penalized Likelihood with A Diverging Number of Parameters Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized

More information

Regularized Discriminant Analysis and Its Application in Microarrays

Regularized Discriminant Analysis and Its Application in Microarrays Biostatistics (2005), 1, 1, pp. 1 18 Printed in Great Britain Regularized Discriminant Analysis and Its Application in Microarrays By YAQIAN GUO Department of Statistics, Stanford University Stanford,

More information

High-throughput Testing

High-throughput Testing High-throughput Testing Noah Simon and Richard Simon July 2016 1 / 29 Testing vs Prediction On each of n patients measure y i - single binary outcome (eg. progression after a year, PCR) x i - p-vector

More information

Approximate Principal Components Analysis of Large Data Sets

Approximate Principal Components Analysis of Large Data Sets Approximate Principal Components Analysis of Large Data Sets Daniel J. McDonald Department of Statistics Indiana University mypage.iu.edu/ dajmcdon April 27, 2016 Approximation-Regularization for Analysis

More information

Statistical testing. Samantha Kleinberg. October 20, 2009

Statistical testing. Samantha Kleinberg. October 20, 2009 October 20, 2009 Intro to significance testing Significance testing and bioinformatics Gene expression: Frequently have microarray data for some group of subjects with/without the disease. Want to find

More information

Inference Conditional on Model Selection with a Focus on Procedures Characterized by Quadratic Inequalities

Inference Conditional on Model Selection with a Focus on Procedures Characterized by Quadratic Inequalities Inference Conditional on Model Selection with a Focus on Procedures Characterized by Quadratic Inequalities Joshua R. Loftus Outline 1 Intro and background 2 Framework: quadratic model selection events

More information

Stat 206: Estimation and testing for a mean vector,

Stat 206: Estimation and testing for a mean vector, Stat 206: Estimation and testing for a mean vector, Part II James Johndrow 2016-12-03 Comparing components of the mean vector In the last part, we talked about testing the hypothesis H 0 : µ 1 = µ 2 where

More information

Inference with Transposable Data: Modeling the Effects of Row and Column Correlations

Inference with Transposable Data: Modeling the Effects of Row and Column Correlations Inference with Transposable Data: Modeling the Effects of Row and Column Correlations Genevera I. Allen Department of Pediatrics-Neurology, Baylor College of Medicine, Jan and Dan Duncan Neurological Research

More information

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant

More information

VARIABLE SELECTION AND INDEPENDENT COMPONENT

VARIABLE SELECTION AND INDEPENDENT COMPONENT VARIABLE SELECTION AND INDEPENDENT COMPONENT ANALYSIS, PLUS TWO ADVERTS Richard Samworth University of Cambridge Joint work with Rajen Shah and Ming Yuan My core research interests A broad range of methodological

More information

Sparse representation classification and positive L1 minimization

Sparse representation classification and positive L1 minimization Sparse representation classification and positive L1 minimization Cencheng Shen Joint Work with Li Chen, Carey E. Priebe Applied Mathematics and Statistics Johns Hopkins University, August 5, 2014 Cencheng

More information

3 Comparison with Other Dummy Variable Methods

3 Comparison with Other Dummy Variable Methods Stats 300C: Theory of Statistics Spring 2018 Lecture 11 April 25, 2018 Prof. Emmanuel Candès Scribe: Emmanuel Candès, Michael Celentano, Zijun Gao, Shuangning Li 1 Outline Agenda: Knockoffs 1. Introduction

More information

Journal of Multivariate Analysis. Consistency of sparse PCA in High Dimension, Low Sample Size contexts

Journal of Multivariate Analysis. Consistency of sparse PCA in High Dimension, Low Sample Size contexts Journal of Multivariate Analysis 5 (03) 37 333 Contents lists available at SciVerse ScienceDirect Journal of Multivariate Analysis journal homepage: www.elsevier.com/locate/jmva Consistency of sparse PCA

More information

Knockoffs as Post-Selection Inference

Knockoffs as Post-Selection Inference Knockoffs as Post-Selection Inference Lucas Janson Harvard University Department of Statistics blank line blank line WHOA-PSI, August 12, 2017 Controlled Variable Selection Conditional modeling setup:

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Estimating empirical null distributions for Chi-squared and Gamma statistics with application to multiple testing in RNA-seq

Estimating empirical null distributions for Chi-squared and Gamma statistics with application to multiple testing in RNA-seq Estimating empirical null distributions for Chi-squared and Gamma statistics with application to multiple testing in RNA-seq Xing Ren 1, Jianmin Wang 1,2,, Song Liu 1,2, and Jeffrey C. Miecznikowski 1,2,

More information

Sparse PCA with applications in finance

Sparse PCA with applications in finance Sparse PCA with applications in finance A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon 1 Introduction

More information

Multiple testing: Intro & FWER 1

Multiple testing: Intro & FWER 1 Multiple testing: Intro & FWER 1 Mark van de Wiel mark.vdwiel@vumc.nl Dep of Epidemiology & Biostatistics,VUmc, Amsterdam Dep of Mathematics, VU 1 Some slides courtesy of Jelle Goeman 1 Practical notes

More information

Statistical Inference

Statistical Inference Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park

More information