Clustering by Important Features PCA (IF-PCA)

Size: px
Start display at page:

Download "Clustering by Important Features PCA (IF-PCA)"

Transcription

1 Clustering by Important Features PCA (IF-PCA) Rare/Weak Signals and Phase Diagrams Jiashun Jin, CMU Zheng Tracy Ke (Univ. of Chicago) Wanjie Wang (Univ. of Pennsylvania) August 5, 2015 Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 1 / 32

2 Clustering subjects using microarray data # Data Name Source K n (# of subjects) p (# of genes) 1 Brain Pomeroy (02) Breast Cancer Wang et al. (05) Colon Cancer Alon et al. (99) Leukemia Golub et al. (99) Lung Cancer Gordon et al. (02) Lung Cancer(2) Bhattacharjee et al. (01) Lymphoma Alizadeh et al. (00) Prostate Cancer Singh et al. (02) SRBCT Kahn (01) Su-Cancer Su et al (01) Goal. Predict class labels Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 2 / 32

3 Principal Component Analysis (PCA) Karl Pearson ( ) Idea: Transformation Dimension Reduction while keeping main info. Data = signal + noise (signal matrix: low-rank) *microarray data: many columns of the signal matrix are 0 after standardization Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 3 / 32

4 Important Features PCA (IF-PCA) Idea: PCA applied to a small fraction of carefully selected features: Rank features by Kolmogorov-Smirnov statistic Select those with the largest KS-scores Apply PCA to the post-selection data matrix features selected features PC s samples Azizyan et al (2013), Chan and Hall (2010), Fan and Lv (2008) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 4 / 32

5 IF-PCA & IF-PCA-HCT (microarray data) W i (j) = [X i (j) X (j)]/ˆσ(j) : W = [w 1,..., w p ] = [W 1,..., W n], feature-wise normalization F n,j (t) = 1 n 1{W i (j) t} n 1. Rank features with Kolmogorov-Smirnov (KS) scores ψ n,j = n sup <t< i=1 F n,j (t) Φ(t), (Φ: CDF of N(0, 1)) 2. Renormalize the KS scores by (Efron s empirical null) ψ n,j = ψ n,j mean of all p different KS-scores SD of all p different KS-scores 3. Fix a threshold t > 0. Let Û (t) R n,k 1 be the first (K 1) singular vectors of matrix [w j : ψ n,j t] Our proposal : t = tp HC : Higher Criticism threshold (TBA), 4. Apply classical k-means algorithm to ÛHC Û (t) t=t HC p Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 5 / 32

6 The blessing of feature selection Left: plot of ÛHC (Lung Cancer; K = 2 so ÛHC R n is a vector) x-axis: 1, 2,..., n; y-axis: entries of ÛHC Right: counterpart of ÛHC without feature selection ADCA MPM ADCA MPM Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 6 / 32

7 Efron s null correction (Lung Cancer) Efron s theoretic Null: density of ψ n,j if X i (j) iid N(u j, σ 2 j ) (not depend on (j, u j, σ j ); easy to simulate). Theoretic null (red) is a bad fit to ψ n,j (top) but a nice fit to ψ n,j (bottom) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 7 / 32

8 How to set the threshold t? CV: not directly implementable (class labels unknown) FDR: need tuning and target on Rare/Strong signals [Benjaminin and Hochberg (1995), Efron (2010)] Our proposal. Setting t by Higher Criticism (RW settings) t (threshold) #{selected features} feature-fdr errors Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 8 / 32

9 Tukey s Higher Criticism John W. Tukey ( ) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 9 / 32

10 Higher Criticism (HC) Review papers: Donoho and Jin (2015), Jin and Ke (2015) First proposed by Donoho and Jin (2004) for detecting sparse signals Also useful for setting threshold in cancer classification [Donoho and Jin (2008, 2009), Jin (2009)] Found useful in GWAS, DNA Copy Number Variants (CNV), Cosmology and Astronomy, Disease surveillance Extended to many different directions Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 10 / 32

11 Higher Criticism Threshold (HCT) Obtain P-values: π j = 1 F 0 (ψ n,j ), 1 j p [F 0 : CDF of Efron s theoretical null] Sort P-values: π (1) < π (2) <... < π (p) Define the HC functional by HC p,k = p(k/p π(k) ) k/p + max{ n(k/p π (k) ), 0} Let ˆk = argmax {1 k p/2,π(k) >log(p)/p}{hc p,k }. HC threshold t HC p is the ˆk-th largest KS-score Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 11 / 32

12 Illustration ordered KS scores ordered p values HC objective Function Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 12 / 32

13 Illustration, II (Lung Cancer) 0.2 HC threshold x-axis: # of selected features; y-axis: error rates by IF-PCA Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 13 / 32

14 Comparison r = error rate of IF-PCA-HCT minimum error rate of all other methods # Data set K kmeans kmeans++ Hier SpecGem IF-PCA-HCT r 1 Brain (.09) Breast Cancer (.05) Colon Cancer (.07) Leukemia (.09) Lung Cancer (.09) Lung Cancer(2) (.00) Lymphoma (.13) Prostate Cancer (.01) SRBCT (.06) SuCancer (.05) Arthur and Vassilvitskii (2007), Hastie et al (2009), Lee et al (2010) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 14 / 32

15 Sparse PCA and variants of IF-PCA features selected features PC s samples # Data set K Clu-sPCA IF-kmeans IF-Hier IF-PCA-HCT 1 Brain Breast Cancer Colon Cancer Leukemia Lung Cancer Lung Cancer(2) Lymphoma Prostate Cancer SRBCT SuCancer : project to estimated feature space (sparse PCA) and then clustering Unclear how to set λ (ideal λ is used above); Clustering feature estimation Zou et al (2006), Witten and Tibshirani (2010) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 15 / 32

16 Summary (so far) IF-PCA-HCT consists of three simple steps Marginal screening (KS) Threshold choice (Empirical null + HCT) Post-selection PCA tuning free, fast, and yet effective easily extendable and adaptable Remaining problems: Why HCT Statistical limits for clustering/related problems Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 16 / 32

17 RW viewpoint In many types of Big Data, signals of interest are not only sparse (rare) but also individually weak, and we have no priori where these RW signals are Large p small n (e.g., genetics and genomics) (Signal strength) α n $ or manpower Clustering: α = 4 or 6, classification: α = 2 Technical limitation (e.g., astronomy) Early detection (e.g., disease surveillance) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 17 / 32

18 RW model and Phase Diagram Many scientific findings are not reproducible [Ioannidis, (2005). Why most published research findings are false ] and many methods/theory target on Rare/Strong signals [if conditions XX hold and all signals are sufficiently strong...] Our proposal: RW model: parsimonious model capturing the main factors (sparsity and signal strength) Phase Diagram: provides sharp results that characterize when the desired goal is impossible or possible to achieve an approach to distinguish non-optimal and optimal procedures Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 18 / 32

19 A two-class model for clustering Jin, Ke & Wang (2015) X i = l i µ+z i, Z i iid N(0, Ip ), i = 1,..., n (p n) l i = ±1: unknown class labels (main interest) µ R p : feature vector RW: only a small fraction of entries of µ is nonzero, each contributes weakly to clustering Interest. Rationale of HCT and Statistical limits Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 19 / 32

20 IF-PCA simplified to two-class model features selected features PC s samples microarray two-class model pre-normalization yes skipped feature-wise screening Kolmogorov-Smirnov chi-square ψ j = sup t F n,j (t) Φ(t) ψ j = ( x j n)/ 2n re-normalization Efron s null correction skipped threshold choice HCT same post-selection PCA same same Notation. ˆl if t : IF-PCA for a threshold t > 0 Notation. ˆl if : classical PCA (a special case) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 20 / 32

21 Aggregation methods sa Simple Aggregation: ˆl = sgn ( x ) Sparse Aggregation: ˆl sa N = sgn( xŝ), where Ŝ = Ŝ(N) = argmax { } {S: S =N} xs 1 Features Features Features Features Samples Samples xx 1 xx 2 xx 3 xx 1 xx 2 xx xx pp 2 xx pp 1 xx pp xx pp 2 xx pp 1 xx pp Samples Samples xx 1 xx 2 xx 3 xx 1 xx 2 xx xx pp 2 xx pp 1 xx pp xx pp 2 xx pp xx xx xx SS xx SS Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 21 / 32

22 Comparison of methods Method Simple Agg. PCA Sparse Agg. IF-PCA ˆl sa ˆl if ˆl sa N (N p) ˆlif t (t > 0) Sparsity dense dense sparse sparse Strength weak weak strong* strong F. Selection No No Yes Yes Complexity Poly. Poly. NP-hard Poly. Tuning No No Yes Yes** *: signals are comparably stronger but still weak **: a tuning-free version exists Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 22 / 32

23 Asymptotic Rare/Weak (ARW) model X = lµ + Z R n,p, Z: iid N(0, 1) entries l i = ±1 with equal prob., µ(j) iid (1 ɛ p )ν 0 + ɛ p ν τp large p small n : n = p θ, 0 < θ < 1 Rare signals: ɛ p = p β Range for sparsity/signal strength: full for statistical limits, more specific for IF-PCA: For statistical limits For IF-PCA τ p τ p = p α, α > 0 τ p = 4 (1/n)2r log(p) Range of β 0 < β < 1 1/2 < β < 1 θ/2 Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 23 / 32

24 Phase function for IF-PCA Fix 0 < θ < 1, define ρ θ (β) = (1 θ)ρ( β θ ) where ρ(β) is the standard phase function [Donoho and Jin (2004), Ingster (1997, 1999), Jin (2009)]: 0, 0 < β < 1/2 ρ(β) = β 1/2, 1/2 < β < 3/4 (1 1 β) 2, 3/4 < β < 1 Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 24 / 32

25 Phase transition for IF-PCA Theorem 1. Consider IF-PCA with threshold t. Let Û (t) R n be the first left singular vector of post-selection data matrix Impossibility. If r < ρ θ (β), then for any threshold t, Angle(Û (t), l) c 0 Possibility. If r > ρ θ (β), then (IF-PCA partially fails) For t in an appropriate range, Angle(Û (t), l) 0, and Û (t) SNR(t)l+z+rem, SNR(t) 1, z N(0, In ); HCT yields the right threshold choice: tp HC /tp ideal 1 in prob., where tp ideal = argmax t { SNR(t)} IF-PCA-HCT yields successful clustering: Hamm p (ˆl HCT ; β, r, θ) 0 Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 25 / 32

26 Phase Diagram (IF-PCA) #(useful features) p 1 β, τ p = 4 (1/n)2r log(p); n = p θ (θ =.6) Success r 0.5 r Success Failure 0.1 Failure β β Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 26 / 32

27 Statistical limits (clustering) Hamm p (ˆl; β, α, θ) = (n/2) 1 ηθ clu (β) = min b=±sgn(ˆl) { n P ( b i sgn(l i ) )} i=1 (1 2β)/2, 0 < β < 1 θ 2 1 θ θ/2, < β < 1 θ 2 (1 β)/2, β > 1 θ Theorem 1. When α > ηθ clu(β), Hamm p(ˆl; β, α, θ) 1 When α < ηθ clu(β), Hammp (ˆl sa ; β, α, θ) 0 for β < 1 θ 1 θ ; β, α, θ) 0 for β > Hammp (ˆl sa N N = p 1 β. 2 ; 2 and Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 27 / 32

28 Phase Diagram (clustering; θ = 0.6) For statistical limits For IF-PCA τ p τ p = p α, α > 0 τ p = 4 (1/n)2r log(p) Range of β 0 < β < 1 1/2 < β < 1 θ/2 weaker τ = p 1/2 /s Impossible weaker Impossible 0.3 τ = n 1/2 0.3 α Simple Aggregation Sparse Aggregation (NP-hard) τ = s 1/2 α Simple Aggreg. NP-hard Classical PCA IF-PCA β sparser β sparser Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 28 / 32

29 Two closely related problems X = lµ + Z, Z: iid N(0, 1) entries (sig). Estimate support of µ (Signal recovery) n Hamm p (ˆµ; β, r, θ) = (pɛ p ) 1 P(sgn(ˆµ i ) sgn(µ i )) i=1 (hyp). Test H (p) 0 that X = Z against alternative H (p) 1 that X = lµ + Z (global hypothesis testing) TestErr p ( ˆT, β, r, β) = P H (p) 0 (Reject H 0 ) + P (p) H (Accept H (p) 0 ) 1 Arias-Castro & Verzelen (2015), Johnstone & Lu (2001), Rigollet & Berthet (2013) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 29 / 32

30 Statistical limits (three problems) µ(j) iid (1 ɛ p )ν 0 + ɛ p ν τp, n = p θ, s #{useful features} p 1 β, signal strength = p α τ=n 1/4 p 1/2 /s Clustering Signal Recovery Hypothesis Testing α τ=p 1/2 /s τ=n 1/2 s=n τ=(sn) 1/4 0.1 τ=n 1/4 τ=s 1/ β Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 30 / 32

31 Computable upper bounds (two problems) Left: feature estimation. Right: (global) hypothesis testing µ(j) iid (1 ɛ p )ν 0 + ɛ p ν τp, n = p θ, s #{useful features} p 1 β, signal strength = p α Clustering Signal Recovery Hypothesis Testing (stat) Hypothesis Testing (compu) Clustering Signal Recovery (stat) Hypothesis Testing Signal Recovery (compu) α τ=n 1/4 ( p/s) α τ=n 1/ τ=(p/n) 1/4 s 1/2 0.1 τ=n 1/4 s=p 1/2 0.1 τ=n 1/4 s=p 1/ β Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 31 / 32 β

32 Take-home message jiashun RW settings: found in many scientific problems, need new methods/theory Proposed IF-PCA-HCT as an easy-to-adapt and fast method, especially for RW settings Showed effective real data results Studied phase transitions for IF-PCA, clustering, signal recovery, and global testing Jin J, Wang W (2014) Important Features PCA for high dimensional clustering (arxiv ) Jin J, Ke Z, Wang W (2015) Phase transitions for high dimensional clustering and related problems (arxiv ) Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 32 / 32

Clustering by Important Features PCA (IF-PCA)

Clustering by Important Features PCA (IF-PCA) Clustering by Important Features PCA (IF-PCA) Rare/Weak Signals and Phase Diagrams Jiashun Jin, CMU David Donoho (Stanford) Zheng Tracy Ke (Univ. of Chicago) Wanjie Wang (Univ. of Pennsylvania) August

More information

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania Submitted to the Annals of Statistics DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING By T. Tony Cai and Linjun Zhang University of Pennsylvania We would like to congratulate the

More information

Covariate-Assisted Variable Ranking

Covariate-Assisted Variable Ranking Covariate-Assisted Variable Ranking Tracy Ke Department of Statistics Harvard University WHOA-PSI@St. Louis, Sep. 8, 2018 1/18 Sparse linear regression Y = X β + z, X R n,p, z N(0, σ 2 I n ) Signals (nonzero

More information

Graphlet Screening (GS)

Graphlet Screening (GS) Graphlet Screening (GS) Jiashun Jin Carnegie Mellon University April 11, 2014 Jiashun Jin Graphlet Screening (GS) 1 / 36 Collaborators Alphabetically: Zheng (Tracy) Ke Cun-Hui Zhang Qi Zhang Princeton

More information

PHASE TRANSITIONS FOR HIGH DIMENSIONAL CLUSTERING AND RELATED PROBLEMS

PHASE TRANSITIONS FOR HIGH DIMENSIONAL CLUSTERING AND RELATED PROBLEMS The Annals of Statistics 27, Vol. 45, o. 5, 25 289 DOI:.24/6-AOS522 Institute of Mathematical Statistics, 27 PHASE TRASITIOS FOR HIGH DIMESIOAL CLUSTERIG AD RELATED PROBLEMS BY JIASHU JI, ZHEG TRACY KE

More information

Optimal Detection of Heterogeneous and Heteroscedastic Mixtures

Optimal Detection of Heterogeneous and Heteroscedastic Mixtures Optimal Detection of Heterogeneous and Heteroscedastic Mixtures T. Tony Cai Department of Statistics, University of Pennsylvania X. Jessie Jeng Department of Biostatistics and Epidemiology, University

More information

Estimation and Confidence Sets For Sparse Normal Mixtures

Estimation and Confidence Sets For Sparse Normal Mixtures Estimation and Confidence Sets For Sparse Normal Mixtures T. Tony Cai 1, Jiashun Jin 2 and Mark G. Low 1 Abstract For high dimensional statistical models, researchers have begun to focus on situations

More information

H 2 : otherwise. that is simply the proportion of the sample points below level x. For any fixed point x the law of large numbers gives that

H 2 : otherwise. that is simply the proportion of the sample points below level x. For any fixed point x the law of large numbers gives that Lecture 28 28.1 Kolmogorov-Smirnov test. Suppose that we have an i.i.d. sample X 1,..., X n with some unknown distribution and we would like to test the hypothesis that is equal to a particular distribution

More information

Optimal detection of weak positive dependence between two mixture distributions

Optimal detection of weak positive dependence between two mixture distributions Optimal detection of weak positive dependence between two mixture distributions Sihai Dave Zhao Department of Statistics, University of Illinois at Urbana-Champaign and T. Tony Cai Department of Statistics,

More information

Semi-Penalized Inference with Direct FDR Control

Semi-Penalized Inference with Direct FDR Control Jian Huang University of Iowa April 4, 2016 The problem Consider the linear regression model y = p x jβ j + ε, (1) j=1 where y IR n, x j IR n, ε IR n, and β j is the jth regression coefficient, Here p

More information

GLOBAL TESTING UNDER SPARSE ALTERNATIVES: ANOVA, MULTIPLE COMPARISONS AND THE HIGHER CRITICISM 1

GLOBAL TESTING UNDER SPARSE ALTERNATIVES: ANOVA, MULTIPLE COMPARISONS AND THE HIGHER CRITICISM 1 The Annals of Statistics 2011, Vol. 39, No. 5, 2533 2556 DOI: 10.1214/11-AOS910 Institute of Mathematical Statistics, 2011 GLOBAL TESTING UNDER SPARSE ALTERNATIVES: ANOVA, MULTIPLE COMPARISONS AND THE

More information

Higher Criticism Testing for Signal Detection in Rare And Weak Models

Higher Criticism Testing for Signal Detection in Rare And Weak Models Higher Criticism Testing for Signal Detection in Rare And Weak Models Niclas Blomberg Master Thesis at KTH Royal Institute of Technology Stockholm, September 2012 Supervisor: Tatjana Pavlenko Abstract

More information

Optimal detection of heterogeneous and heteroscedastic mixtures

Optimal detection of heterogeneous and heteroscedastic mixtures J. R. Statist. Soc. B (0) 73, Part 5, pp. 69 66 Optimal detection of heterogeneous and heteroscedastic mixtures T. Tony Cai and X. Jessie Jeng University of Pennsylvania, Philadelphia, USA and Jiashun

More information

Factor-Adjusted Robust Multiple Test. Jianqing Fan (Princeton University)

Factor-Adjusted Robust Multiple Test. Jianqing Fan (Princeton University) Factor-Adjusted Robust Multiple Test Jianqing Fan Princeton University with Koushiki Bose, Qiang Sun, Wenxin Zhou August 11, 2017 Outline 1 Introduction 2 A principle of robustification 3 Adaptive Huber

More information

Generalized Linear Models and Its Asymptotic Properties

Generalized Linear Models and Its Asymptotic Properties for High Dimensional Generalized Linear Models and Its Asymptotic Properties April 21, 2012 for High Dimensional Generalized L Abstract Literature Review In this talk, we present a new prior setting for

More information

GLOBAL TESTING UNDER SPARSE ALTERNATIVES: ANOVA, MULTIPLE COMPARISONS AND THE HIGHER CRITICISM

GLOBAL TESTING UNDER SPARSE ALTERNATIVES: ANOVA, MULTIPLE COMPARISONS AND THE HIGHER CRITICISM Submitted to the Annals of Statistics arxiv: 7.434 GLOBAL TESTING UNDER SPARSE ALTERNATIVES: ANOVA, MULTIPLE COMPARISONS AND THE HIGHER CRITICISM By Ery Arias-Castro Department of Mathematics, University

More information

Marginal Screening and Post-Selection Inference

Marginal Screening and Post-Selection Inference Marginal Screening and Post-Selection Inference Ian McKeague August 13, 2017 Ian McKeague (Columbia University) Marginal Screening August 13, 2017 1 / 29 Outline 1 Background on Marginal Screening 2 2

More information

The miss rate for the analysis of gene expression data

The miss rate for the analysis of gene expression data Biostatistics (2005), 6, 1,pp. 111 117 doi: 10.1093/biostatistics/kxh021 The miss rate for the analysis of gene expression data JONATHAN TAYLOR Department of Statistics, Stanford University, Stanford,

More information

Incorporation of Sparsity Information in Large-scale Multiple Two-sample t Tests

Incorporation of Sparsity Information in Large-scale Multiple Two-sample t Tests Incorporation of Sparsity Information in Large-scale Multiple Two-sample t Tests Weidong Liu October 19, 2014 Abstract Large-scale multiple two-sample Student s t testing problems often arise from the

More information

Confounder Adjustment in Multiple Hypothesis Testing

Confounder Adjustment in Multiple Hypothesis Testing in Multiple Hypothesis Testing Department of Statistics, Stanford University January 28, 2016 Slides are available at http://web.stanford.edu/~qyzhao/. Collaborators Jingshu Wang Trevor Hastie Art Owen

More information

arxiv: v1 [math.st] 20 Nov 2009

arxiv: v1 [math.st] 20 Nov 2009 FEATURE SELECTION WHEN THERE ARE MANY INFLUENTIAL FEATURES Peter Hall 12 Jiashun Jin 3 Hugh Miller 1 arxiv:0911.4076v1 [math.st] 20 Nov 2009 SUMMARY. Recent discussion of the success of feature selection

More information

DETECTING RARE AND WEAK SPIKES IN LARGE COVARIANCE MATRICES

DETECTING RARE AND WEAK SPIKES IN LARGE COVARIANCE MATRICES Submitted to the Annals of Statistics DETECTING RARE AND WEAK SPIKES IN LARGE COVARIANCE MATRICES By Zheng Tracy Ke University of Chicago arxiv:1609.00883v1 [math.st] 4 Sep 2016 Given p-dimensional Gaussian

More information

arxiv: v1 [math.st] 11 Dec 2008

arxiv: v1 [math.st] 11 Dec 2008 Feature Selection by Higher Criticism Thresholding: Optimal Phase Diagram David Donoho 1 and Jiashun Jin 2 1 Department of Statistics, Stanford University 2 Department of Statistics, Carnegie Mellon University

More information

A General Framework for High-Dimensional Inference and Multiple Testing

A General Framework for High-Dimensional Inference and Multiple Testing A General Framework for High-Dimensional Inference and Multiple Testing Yang Ning Department of Statistical Science Joint work with Han Liu 1 Overview Goal: Control false scientific discoveries in high-dimensional

More information

Robust Principal Component Analysis

Robust Principal Component Analysis ELE 538B: Mathematics of High-Dimensional Data Robust Principal Component Analysis Yuxin Chen Princeton University, Fall 2018 Disentangling sparse and low-rank matrices Suppose we are given a matrix M

More information

arxiv: v2 [math.st] 25 Aug 2012

arxiv: v2 [math.st] 25 Aug 2012 Submitted to the Annals of Statistics arxiv: math.st/1205.4645 COVARIANCE ASSISTED SCREENING AND ESTIMATION arxiv:1205.4645v2 [math.st] 25 Aug 2012 By Tracy Ke,, Jiashun Jin and Jianqing Fan Princeton

More information

Post-selection Inference for Forward Stepwise and Least Angle Regression

Post-selection Inference for Forward Stepwise and Least Angle Regression Post-selection Inference for Forward Stepwise and Least Angle Regression Ryan & Rob Tibshirani Carnegie Mellon University & Stanford University Joint work with Jonathon Taylor, Richard Lockhart September

More information

OPTIMAL CLASSIFICATION IN SPARSE GAUSSIAN GRAPHIC MODEL

OPTIMAL CLASSIFICATION IN SPARSE GAUSSIAN GRAPHIC MODEL The Annals of Statistics 2013, Vol. 41, No. 5, 2537 2571 DOI: 10.1214/13-AOS1163 Institute of Mathematical Statistics, 2013 OPTIMAL CLASSIFICATION IN SPARSE GAUSSIAN GRAPHIC MODEL BY YINGYING FAN 1,JIASHUN

More information

arxiv: v1 [math.st] 20 Nov 2009

arxiv: v1 [math.st] 20 Nov 2009 REVISITING MARGINAL REGRESSION arxiv:0911.4080v1 [math.st] 20 Nov 2009 By Christopher R. Genovese Jiashun Jin and Larry Wasserman Carnegie Mellon University May 31, 2018 The lasso has become an important

More information

Large-Scale Multiple Testing of Correlations

Large-Scale Multiple Testing of Correlations Large-Scale Multiple Testing of Correlations T. Tony Cai and Weidong Liu Abstract Multiple testing of correlations arises in many applications including gene coexpression network analysis and brain connectivity

More information

Methods for High Dimensional Inferences With Applications in Genomics

Methods for High Dimensional Inferences With Applications in Genomics University of Pennsylvania ScholarlyCommons Publicly Accessible Penn Dissertations Summer 8-12-2011 Methods for High Dimensional Inferences With Applications in Genomics Jichun Xie University of Pennsylvania,

More information

COVARIATE ASSISTED VARIABLE RANKING. By Zheng Tracy Ke and Fan Yang University of Chicago

COVARIATE ASSISTED VARIABLE RANKING. By Zheng Tracy Ke and Fan Yang University of Chicago Submitted to the Annals of Statistics arxiv: arxiv:0000.0000 COVARIATE ASSISTED VARIABLE RANKING By Zheng Tracy Ke and Fan Yang University of Chicago Consider a linear model y = Xβ + z, z N(0, σ 2 I n).

More information

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Yale School of Public Health Joint work with Ning Hao, Yue S. Niu presented @Tsinghua University Outline 1 The Problem

More information

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12 STATS 306B: Unsupervised Learning Spring 2014 Lecture 13 May 12 Lecturer: Lester Mackey Scribe: Jessy Hwang, Minzhe Wang 13.1 Canonical correlation analysis 13.1.1 Recap CCA is a linear dimensionality

More information

False Discovery Control in Spatial Multiple Testing

False Discovery Control in Spatial Multiple Testing False Discovery Control in Spatial Multiple Testing WSun 1,BReich 2,TCai 3, M Guindani 4, and A. Schwartzman 2 WNAR, June, 2012 1 University of Southern California 2 North Carolina State University 3 University

More information

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method Christopher R. Genovese Department of Statistics Carnegie Mellon University joint work with Larry Wasserman

More information

arxiv: v1 [stat.me] 30 Dec 2017

arxiv: v1 [stat.me] 30 Dec 2017 arxiv:1801.00105v1 [stat.me] 30 Dec 2017 An ISIS screening approach involving threshold/partition for variable selection in linear regression 1. Introduction Yu-Hsiang Cheng e-mail: 96354501@nccu.edu.tw

More information

Doing Cosmology with Balls and Envelopes

Doing Cosmology with Balls and Envelopes Doing Cosmology with Balls and Envelopes Christopher R. Genovese Department of Statistics Carnegie Mellon University http://www.stat.cmu.edu/ ~ genovese/ Larry Wasserman Department of Statistics Carnegie

More information

Asymptotic distribution of the largest eigenvalue with application to genetic data

Asymptotic distribution of the largest eigenvalue with application to genetic data Asymptotic distribution of the largest eigenvalue with application to genetic data Chong Wu University of Minnesota September 30, 2016 T32 Journal Club Chong Wu 1 / 25 Table of Contents 1 Background Gene-gene

More information

Cramér-Type Moderate Deviation Theorems for Two-Sample Studentized (Self-normalized) U-Statistics. Wen-Xin Zhou

Cramér-Type Moderate Deviation Theorems for Two-Sample Studentized (Self-normalized) U-Statistics. Wen-Xin Zhou Cramér-Type Moderate Deviation Theorems for Two-Sample Studentized (Self-normalized) U-Statistics Wen-Xin Zhou Department of Mathematics and Statistics University of Melbourne Joint work with Prof. Qi-Man

More information

Journal Club: Higher Criticism

Journal Club: Higher Criticism Journal Club: Higher Criticism David Donoho (2002): Higher Criticism for Heterogeneous Mixtures, Technical Report No. 2002-12, Dept. of Statistics, Stanford University. Introduction John Tukey (1976):

More information

False Discovery Rate

False Discovery Rate False Discovery Rate Peng Zhao Department of Statistics Florida State University December 3, 2018 Peng Zhao False Discovery Rate 1/30 Outline 1 Multiple Comparison and FWER 2 False Discovery Rate 3 FDR

More information

Sparse statistical modelling

Sparse statistical modelling Sparse statistical modelling Tom Bartlett Sparse statistical modelling Tom Bartlett 1 / 28 Introduction A sparse statistical model is one having only a small number of nonzero parameters or weights. [1]

More information

Proportion of non-zero normal means: universal oracle equivalences and uniformly consistent estimators

Proportion of non-zero normal means: universal oracle equivalences and uniformly consistent estimators J. R. Statist. Soc. B (28) 7, Part 3, pp. 461 493 Proportion of non-zero normal means: universal oracle equivalences and uniformly consistent estimators Jiashun Jin Purdue University, West Lafayette, and

More information

OPTIMAL PROCEDURES IN HIGH-DIMENSIONAL VARIABLE SELECTION

OPTIMAL PROCEDURES IN HIGH-DIMENSIONAL VARIABLE SELECTION OPTIMAL PROCEDURES IN HIGH-DIMENSIONAL VARIABLE SELECTION by Qi Zhang Bachelor of Science in Math and Applied Math, China Agricultural University, 2008 Submitted to the Graduate Faculty of the Department

More information

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon

More information

Journal of Multivariate Analysis. Consistency of sparse PCA in High Dimension, Low Sample Size contexts

Journal of Multivariate Analysis. Consistency of sparse PCA in High Dimension, Low Sample Size contexts Journal of Multivariate Analysis 5 (03) 37 333 Contents lists available at SciVerse ScienceDirect Journal of Multivariate Analysis journal homepage: www.elsevier.com/locate/jmva Consistency of sparse PCA

More information

Regularized Discriminant Analysis and Its Application in Microarray

Regularized Discriminant Analysis and Its Application in Microarray Regularized Discriminant Analysis and Its Application in Microarray Yaqian Guo, Trevor Hastie and Robert Tibshirani May 5, 2004 Abstract In this paper, we introduce a family of some modified versions of

More information

Large-Scale Multiple Testing of Correlations

Large-Scale Multiple Testing of Correlations University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 5-5-2016 Large-Scale Multiple Testing of Correlations T. Tony Cai University of Pennsylvania Weidong Liu Follow this

More information

high-dimensional inference robust to the lack of model sparsity

high-dimensional inference robust to the lack of model sparsity high-dimensional inference robust to the lack of model sparsity Jelena Bradic (joint with a PhD student Yinchu Zhu) www.jelenabradic.net Assistant Professor Department of Mathematics University of California,

More information

Estimation of a Two-component Mixture Model

Estimation of a Two-component Mixture Model Estimation of a Two-component Mixture Model Bodhisattva Sen 1,2 University of Cambridge, Cambridge, UK Columbia University, New York, USA Indian Statistical Institute, Kolkata, India 6 August, 2012 1 Joint

More information

arxiv: v1 [math.st] 31 Mar 2009

arxiv: v1 [math.st] 31 Mar 2009 The Annals of Statistics 2009, Vol. 37, No. 2, 619 629 DOI: 10.1214/07-AOS586 c Institute of Mathematical Statistics, 2009 arxiv:0903.5373v1 [math.st] 31 Mar 2009 AN ADAPTIVE STEP-DOWN PROCEDURE WITH PROVEN

More information

Regularized Estimation of High Dimensional Covariance Matrices. Peter Bickel. January, 2008

Regularized Estimation of High Dimensional Covariance Matrices. Peter Bickel. January, 2008 Regularized Estimation of High Dimensional Covariance Matrices Peter Bickel Cambridge January, 2008 With Thanks to E. Levina (Joint collaboration, slides) I. M. Johnstone (Slides) Choongsoon Bae (Slides)

More information

Estimating the Null and the Proportion of Non- Null Effects in Large-Scale Multiple Comparisons

Estimating the Null and the Proportion of Non- Null Effects in Large-Scale Multiple Comparisons University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 7 Estimating the Null and the Proportion of Non- Null Effects in Large-Scale Multiple Comparisons Jiashun Jin Purdue

More information

Regularized Discriminant Analysis and Its Application in Microarrays

Regularized Discriminant Analysis and Its Application in Microarrays Biostatistics (2005), 1, 1, pp. 1 18 Printed in Great Britain Regularized Discriminant Analysis and Its Application in Microarrays By YAQIAN GUO Department of Statistics, Stanford University Stanford,

More information

Solving Corrupted Quadratic Equations, Provably

Solving Corrupted Quadratic Equations, Provably Solving Corrupted Quadratic Equations, Provably Yuejie Chi London Workshop on Sparse Signal Processing September 206 Acknowledgement Joint work with Yuanxin Li (OSU), Huishuai Zhuang (Syracuse) and Yingbin

More information

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017 Lecture 7: Interaction Analysis Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 39 Lecture Outline Beyond main SNP effects Introduction to Concept of Statistical Interaction

More information

FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES

FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES Sanat K. Sarkar a a Department of Statistics, Temple University, Speakman Hall (006-00), Philadelphia, PA 19122, USA Abstract The concept

More information

Sparse representation classification and positive L1 minimization

Sparse representation classification and positive L1 minimization Sparse representation classification and positive L1 minimization Cencheng Shen Joint Work with Li Chen, Carey E. Priebe Applied Mathematics and Statistics Johns Hopkins University, August 5, 2014 Cencheng

More information

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data Yoonkyung Lee Department of Statistics The Ohio State University http://www.stat.ohio-state.edu/ yklee May 13, 2005

More information

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Lucas Janson, Stanford Department of Statistics WADAPT Workshop, NIPS, December 2016 Collaborators: Emmanuel

More information

High-dimensional Ordinary Least-squares Projection for Screening Variables

High-dimensional Ordinary Least-squares Projection for Screening Variables 1 / 38 High-dimensional Ordinary Least-squares Projection for Screening Variables Chenlei Leng Joint with Xiangyu Wang (Duke) Conference on Nonparametric Statistics for Big Data and Celebration to Honor

More information

Tweedie s Formula and Selection Bias. Bradley Efron Stanford University

Tweedie s Formula and Selection Bias. Bradley Efron Stanford University Tweedie s Formula and Selection Bias Bradley Efron Stanford University Selection Bias Observe z i N(µ i, 1) for i = 1, 2,..., N Select the m biggest ones: z (1) > z (2) > z (3) > > z (m) Question: µ values?

More information

Guarding against Spurious Discoveries in High Dimension. Jianqing Fan

Guarding against Spurious Discoveries in High Dimension. Jianqing Fan in High Dimension Jianqing Fan Princeton University with Wen-Xin Zhou September 30, 2016 Outline 1 Introduction 2 Spurious correlation and random geometry 3 Goodness Of Spurious Fit (GOSF) 4 Asymptotic

More information

Multiple Hypothesis Testing in Microarray Data Analysis

Multiple Hypothesis Testing in Microarray Data Analysis Multiple Hypothesis Testing in Microarray Data Analysis Sandrine Dudoit jointly with Mark van der Laan and Katie Pollard Division of Biostatistics, UC Berkeley www.stat.berkeley.edu/~sandrine Short Course:

More information

arxiv: v1 [math.st] 29 Jan 2010

arxiv: v1 [math.st] 29 Jan 2010 DISTILLED SENSING: ADAPTIVE SAMPLING FOR SPARSE DETECTION AND ESTIMATION arxiv:00.53v [math.st] 29 Jan 200 By Jarvis Haupt, Rui Castro, and Robert Nowak Rice University, Columbia University, and University

More information

Package IGG. R topics documented: April 9, 2018

Package IGG. R topics documented: April 9, 2018 Package IGG April 9, 2018 Type Package Title Inverse Gamma-Gamma Version 1.0 Date 2018-04-04 Author Ray Bai, Malay Ghosh Maintainer Ray Bai Description Implements Bayesian linear regression,

More information

VARIABLE SELECTION AND INDEPENDENT COMPONENT

VARIABLE SELECTION AND INDEPENDENT COMPONENT VARIABLE SELECTION AND INDEPENDENT COMPONENT ANALYSIS, PLUS TWO ADVERTS Richard Samworth University of Cambridge Joint work with Rajen Shah and Ming Yuan My core research interests A broad range of methodological

More information

Asymptotic Statistics-VI. Changliang Zou

Asymptotic Statistics-VI. Changliang Zou Asymptotic Statistics-VI Changliang Zou Kolmogorov-Smirnov distance Example (Kolmogorov-Smirnov confidence intervals) We know given α (0, 1), there is a well-defined d = d α,n such that, for any continuous

More information

FDR and ROC: Similarities, Assumptions, and Decisions

FDR and ROC: Similarities, Assumptions, and Decisions EDITORIALS 8 FDR and ROC: Similarities, Assumptions, and Decisions. Why FDR and ROC? It is a privilege to have been asked to introduce this collection of papers appearing in Statistica Sinica. The papers

More information

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables LIB-MA, FSSM Cadi Ayyad University (Morocco) COMPSTAT 2010 Paris, August 22-27, 2010 Motivations Fan and Li (2001), Zou and Li (2008)

More information

Regression Model In The Analysis Of Micro Array Data-Gene Expression Detection

Regression Model In The Analysis Of Micro Array Data-Gene Expression Detection Jamal Fathima.J.I 1 and P.Venkatesan 1. Research Scholar -Department of statistics National Institute For Research In Tuberculosis, Indian Council For Medical Research,Chennai,India,.Department of statistics

More information

CPSC 531: Random Numbers. Jonathan Hudson Department of Computer Science University of Calgary

CPSC 531: Random Numbers. Jonathan Hudson Department of Computer Science University of Calgary CPSC 531: Random Numbers Jonathan Hudson Department of Computer Science University of Calgary http://www.ucalgary.ca/~hudsonj/531f17 Introduction In simulations, we generate random values for variables

More information

Tractable Upper Bounds on the Restricted Isometry Constant

Tractable Upper Bounds on the Restricted Isometry Constant Tractable Upper Bounds on the Restricted Isometry Constant Alex d Aspremont, Francis Bach, Laurent El Ghaoui Princeton University, École Normale Supérieure, U.C. Berkeley. Support from NSF, DHS and Google.

More information

Permutation-invariant regularization of large covariance matrices. Liza Levina

Permutation-invariant regularization of large covariance matrices. Liza Levina Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work

More information

3 Comparison with Other Dummy Variable Methods

3 Comparison with Other Dummy Variable Methods Stats 300C: Theory of Statistics Spring 2018 Lecture 11 April 25, 2018 Prof. Emmanuel Candès Scribe: Emmanuel Candès, Michael Celentano, Zijun Gao, Shuangning Li 1 Outline Agenda: Knockoffs 1. Introduction

More information

Variable selection with error control: Another look at Stability Selection

Variable selection with error control: Another look at Stability Selection Variable selection with error control: Another look at Stability Selection Richard J. Samworth and Rajen D. Shah University of Cambridge RSS Journal Webinar 25 October 2017 Samworth & Shah (Cambridge)

More information

Research Article Sample Size Calculation for Controlling False Discovery Proportion

Research Article Sample Size Calculation for Controlling False Discovery Proportion Probability and Statistics Volume 2012, Article ID 817948, 13 pages doi:10.1155/2012/817948 Research Article Sample Size Calculation for Controlling False Discovery Proportion Shulian Shang, 1 Qianhe Zhou,

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 3, Issue 1 2004 Article 14 Multiple Testing. Part II. Step-Down Procedures for Control of the Family-Wise Error Rate Mark J. van der Laan

More information

Stat 206: Estimation and testing for a mean vector,

Stat 206: Estimation and testing for a mean vector, Stat 206: Estimation and testing for a mean vector, Part II James Johndrow 2016-12-03 Comparing components of the mean vector In the last part, we talked about testing the hypothesis H 0 : µ 1 = µ 2 where

More information

RATE ANALYSIS FOR DETECTION OF SPARSE MIXTURES

RATE ANALYSIS FOR DETECTION OF SPARSE MIXTURES RATE ANALYSIS FOR DETECTION OF SPARSE MIXTURES Jonathan G. Ligo, George V. Moustakides and Venugopal V. Veeravalli ECE and CSL, University of Illinois at Urbana-Champaign, Urbana, IL 61801 University of

More information

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions:

More information

L. Brown. Statistics Department, Wharton School University of Pennsylvania

L. Brown. Statistics Department, Wharton School University of Pennsylvania Non-parametric Empirical Bayes and Compound Bayes Estimation of Independent Normal Means Joint work with E. Greenshtein L. Brown Statistics Department, Wharton School University of Pennsylvania lbrown@wharton.upenn.edu

More information

Statistical testing. Samantha Kleinberg. October 20, 2009

Statistical testing. Samantha Kleinberg. October 20, 2009 October 20, 2009 Intro to significance testing Significance testing and bioinformatics Gene expression: Frequently have microarray data for some group of subjects with/without the disease. Want to find

More information

Optimal Rates of Convergence for Estimating the Null Density and Proportion of Non-Null Effects in Large-Scale Multiple Testing

Optimal Rates of Convergence for Estimating the Null Density and Proportion of Non-Null Effects in Large-Scale Multiple Testing Optimal Rates of Convergence for Estimating the Null Density and Proportion of Non-Null Effects in Large-Scale Multiple Testing T. Tony Cai 1 and Jiashun Jin 2 Abstract An important estimation problem

More information

Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence

Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence Sunil Kumar Dhar Center for Applied Mathematics and Statistics, Department of Mathematical Sciences, New Jersey

More information

AFT Models and Empirical Likelihood

AFT Models and Empirical Likelihood AFT Models and Empirical Likelihood Mai Zhou Department of Statistics, University of Kentucky Collaborators: Gang Li (UCLA); A. Bathke; M. Kim (Kentucky) Accelerated Failure Time (AFT) models: Y = log(t

More information

Math 152. Rumbos Fall Solutions to Exam #2

Math 152. Rumbos Fall Solutions to Exam #2 Math 152. Rumbos Fall 2009 1 Solutions to Exam #2 1. Define the following terms: (a) Significance level of a hypothesis test. Answer: The significance level, α, of a hypothesis test is the largest probability

More information

Chapter 10. Semi-Supervised Learning

Chapter 10. Semi-Supervised Learning Chapter 10. Semi-Supervised Learning Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Outline

More information

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies Ian Barnett, Rajarshi Mukherjee & Xihong Lin Harvard University ibarnett@hsph.harvard.edu August 5, 2014 Ian Barnett

More information

Estimating empirical null distributions for Chi-squared and Gamma statistics with application to multiple testing in RNA-seq

Estimating empirical null distributions for Chi-squared and Gamma statistics with application to multiple testing in RNA-seq Estimating empirical null distributions for Chi-squared and Gamma statistics with application to multiple testing in RNA-seq Xing Ren 1, Jianmin Wang 1,2,, Song Liu 1,2, and Jeffrey C. Miecznikowski 1,2,

More information

On testing the significance of sets of genes

On testing the significance of sets of genes On testing the significance of sets of genes Bradley Efron and Robert Tibshirani August 17, 2006 Abstract This paper discusses the problem of identifying differentially expressed groups of genes from a

More information

Large-Scale Hypothesis Testing

Large-Scale Hypothesis Testing Chapter 2 Large-Scale Hypothesis Testing Progress in statistics is usually at the mercy of our scientific colleagues, whose data is the nature from which we work. Agricultural experimentation in the early

More information

Sparse PCA with applications in finance

Sparse PCA with applications in finance Sparse PCA with applications in finance A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon 1 Introduction

More information

TESTING FOR EQUAL DISTRIBUTIONS IN HIGH DIMENSION

TESTING FOR EQUAL DISTRIBUTIONS IN HIGH DIMENSION TESTING FOR EQUAL DISTRIBUTIONS IN HIGH DIMENSION Gábor J. Székely Bowling Green State University Maria L. Rizzo Ohio University October 30, 2004 Abstract We propose a new nonparametric test for equality

More information

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant

More information

Non-Negative Factorization for Clustering of Microarray Data

Non-Negative Factorization for Clustering of Microarray Data INT J COMPUT COMMUN, ISSN 1841-9836 9(1):16-23, February, 2014. Non-Negative Factorization for Clustering of Microarray Data L. Morgos Lucian Morgos Dept. of Electronics and Telecommunications Faculty

More information

Sparse Partial Least Squares Regression with an Application to. Genome Scale Transcription Factor Analysis

Sparse Partial Least Squares Regression with an Application to. Genome Scale Transcription Factor Analysis Sparse Partial Least Squares Regression with an Application to Genome Scale Transcription Factor Analysis Hyonho Chun Department of Statistics University of Wisconsin Madison, WI 53706 email: chun@stat.wisc.edu

More information

Hunting for significance with multiple testing

Hunting for significance with multiple testing Hunting for significance with multiple testing Etienne Roquain 1 1 Laboratory LPMA, Université Pierre et Marie Curie (Paris 6), France Séminaire MODAL X, 19 mai 216 Etienne Roquain Hunting for significance

More information

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data Ståle Nygård Trial Lecture Dec 19, 2008 1 / 35 Lecture outline Motivation for not using

More information