Factor-Adjusted Robust Multiple Test. Jianqing Fan (Princeton University)

Size: px
Start display at page:

Download "Factor-Adjusted Robust Multiple Test. Jianqing Fan (Princeton University)"

Transcription

1 Factor-Adjusted Robust Multiple Test Jianqing Fan Princeton University with Koushiki Bose, Qiang Sun, Wenxin Zhou August 11, 2017

2 Outline 1 Introduction 2 A principle of robustification 3 Adaptive Huber estimation 4 FARM-test 5 Numerical studies

3 Introduction

4 Heavy-tailed distributions ubiquitous in modern statistics and machine learning financial returns; macroeconomics time series high-throughput data: microarrays, proteomics, fmri arising easily in high-dimensional data at odd with sub-gaussian or sub-exponential assumptions

5 Example 1: Macroeconomic time series 131 macroeconomic series (Stock & Watson 10, Ludvigson & Ng, 10, Mccraven & Ng, 16). Histgram of Kurtosis Frequency series heavier than t 5! All financial return time series 0 t

6 Example 2: RNA-seq Data Gene expressions for 104 autism patients and controls Distribution of Kurtosis a /19K gene exp. heavier than t 5! kurtosis By chance, some have heavier tails in high dim. Aim: Reduce significantly the tail assumptions

7 Example 3: Protein and Gene Expressions NCI-60: 60 human cancer cell lines (Shankavaram et al., 2007) Histgram of Kurtosis of Protein Expressions Histgram of Kurtosis of Gene Expressions Frequency Frequency Protein: 49/162 Gene: 6542/17924 heavier than t 5!

8 Large-Scale Hypothesis Testing X = (X 1,...,X p ) T has mean µ = (µ 1,...,µ p ) T. Multiple test: H 0j : µ j = 0 vs H 1j : µ j 0, for j = 1,...,p. Challenge: Strong dependence between X 1,...,X p (Benjamini & Hochberg, 95; Storey, 02; Donoho & Jin, 04; Genovese & Wasserman, 04; Efron, 07, 10; Fan et al., 12; Desai & Storey, 12; Barber & Candés, 15, Fan & Han, 17+; ) Global test: H 0 : µ = 0 vs H 1 : µ 0.

9 A common model for dependence Factor model: X ij = µ j + b T j f i + u ij, i = 1,...,n, j = 1,...,p.

10 Importance of Factor Adjustments A synthetic three-factor model: X i = µ + Bf i + u i, i = 1,...,n, f i N (0,I 3 ), B = (b jl ) IID U( 1,1) & u i t 3 (0,I p ). Model setup: (n,p) = (100,500), µ j = 0.6 for j p/4; 0, otherwise. Histogram of Sample Means Histogram of Sample Means with Factor Adjustment Frequency Frequency

11 Importance of Robust Adjustments Histogram of Robust Means Histogram of Robust Means with Factor Adjustment Frequency Frequency Decreased noise!

12 Principles of Robustification

13 Principle of Robustification I: Truncation Data: X i IID(µ,σ 2 ). Truncation: Let Xi = sgn(x i )min( X i,τ). Exponential concentration: When τ σ n, (Fan, Wang, Zhu, 17) ( 1 P n n i=1 X i µ t σ n ) 2exp( ct 2 ), univ const c 1/t 2, for sample mean Fundamental to high-dim. estimation

14 Robust Covariance Inputs Data: X i IID(0,Σ), d-dim. σ ij = E(X i X j ) (EX i )(EX j ). }{{} =0 Elementwise truncation: x ij = sgn(x ij )min( x ij,τ) and Σ = (1/n) n i=1 x i x T i If E X j X k 2 M and τ (nm/(logd)) 1 4, we have ( P Σ Σ max am logd n for any a > 2 and a universal constant c. ) d 2 a/c

15 Robust Covariance Inputs Data: X i IID(0,Σ), d-dim. σ ij = E(X i X j ) (EX i )(EX j ). }{{} =0 Elementwise truncation: x ij = sgn(x ij )min( x ij,τ) and Σ = (1/n) n i=1 x i x T i If E X j X k 2 M and τ (nm/(logd)) 1 4, we have ( P Σ Σ max am logd n for any a > 2 and a universal constant c. ) d 2 a/c

16 Applications of Covariance Finance: portfolio risk and management Stat & ML: Classification, graphical models, PCA ? Regression: Let Σ = cov((x T,Y ) T ). E(Y X T β) 2 = ( β T,1)Σ ( β T,1) T Inference: Hotelling T 2, FDR/FDP control,

17 Principle of High-dim Robustification II Adaptive Huber loss: (Catoni 12; Fan, Li, Wang 17) 50 { x ρ τ (x) = 2 45, if x τ 40 µ τ = argmin τ(2 x τ), if x > τ. n i=1 ρ τ (Y i µ), u Then, for τ = nc/t with c SD(Y ), (Fan, Li, Wang 17) c P( µ τ µ t ) 2exp( t 2 /16), t n/8, n loss function: lτ (u) 1/t 2, for sample mean τ =0.5 τ =1 τ =2 τ =3 τ =4 τ =5 least squares

18 Adaptive Huber estimation

19 Adaptive Huber Estimation Robust covariante adjustments: ( µ, β T ) T n argmin µ,β τ (Y i µ X i=1l T i β) }{{} ε i Bahadur Representation If τ = τ 0 n(p t) 1/2 with τ 0 σ = var(ε i ) and X sub-gaussian, [ ] [ ] µ µ n Σ 1/2 ( β β 1 ) n l 1 τ(ε i ) i Σ 1/2 X R n(τ) p, i n P{R n (τ) > c p+t n } 8e t Results extended to the case with EX 4 i C via truncation.

20 A Note of Huber Regression in Low Dimensions Huber s M-estimator: βτ = argmin β R d (1/n) n i=1 l τ (y i x T i β) } {{ } L τ (β) is a sample version of β τ = argmin β R d EL τ (β). Heavy-tailed noise: v δ = 1 n n i=1 E ε i 1+δ < for some δ (0,1]. Bias β τ β 2 v δ τ δ for large τ. Larger τ less bias at cost of robustness

21 Phase Transition and Optimality Error bound For any t > 0, τ 0 ν δ, and τ = τ 0 (n/t) max{ 1 1+δ,1/2} P { S 1/2 n ( β τ β ) ( t 2 > 4τ 0 p n ) min{ δ 1+δ,1/2}} (2p + 1)e t, 2 d e n e pn min{ δ 1+δ,1/2} optimal =min{ /(1 + ), 1/2} 1/2 = 1+ impossible region =

22 Remarks The same behavior as sub-gaussian case when var. is finite. The results extend to random designs with sub-gaussian tails. Bahadur representation holds with exponential concentration. The results extends to high-dimensional with L 1 -penalty.

23 Dependent-adjusted Test T = n( µ τ µ )/σ, with τ = τ 0 n(p wn ) 1/2. Cramér-type deviation If E ε 3 < and w n and w n = o( n). Then, sup P( T z) 0 z o{min( w n, nw 1 2 2Φ(z) 1 0. n )} Taking τ n 1/3 gives the optimal result: P( Tτn x)/(1 Φ(x)) 1 uniformly over x [0,o(n 1/6 )). Rate has also been derived. Extended to heavy-tailed designs σ can be estimated by robust residual variances

24 Covariates and Factors-adjusted Multiple Tests

25 Covariate-adjusted Robust Test Covariate adjustments: X ij = µ j + b T j f i + u ij (known factor). For each null, run the robust two-sided test. Let P j = 2Φ( T j ) be the P-value for testing H 0 : µ j = 0. Uniform approximation of P-values Under regularity conditions, if logp = o{min(w n,nwn 2 )}, then max 1 j p P j P true j 1 1{Ptrue j > α/p} = o(1) as n,

26 FDP approximation Total discoveries: R(z) = p j=1 1( Tj z). FDP(z) = j Null 1( Tj z) R(z), FDP N (z) = 2p 0Φ( z). R(z) Valid approximation of FDP If m p p and m p, max FDP(z) 0 z Φ 1 (1 m p /(2p)) FDP N (z) 1 0 in probability.

27 Remarks 1 True FDP can be well approximated by normal dist., after factor adjustments, as if data are weakly dep. with reduced noise. 2 Verify the chosen critical value ẑ N,α yields truly FDP at level α. 3 Proportion of true nulls p 0 /p can be estimated as in Storey (2002): π 0 (λ) = 1 (1 λ)p p j=1 1( Pj > λ), where Pj = 2Φ( T j )

28 Factor-adjust Robust Test Model: X i = µ + Bf i + u i, i [n], f i is unobserved. Robust Estimation of Realized Factors: Note that X j = µ j + b T j f + ū j, j = 1,...,p Cross-sectional regression: If B were known, we estimate f by f = argminf R d p j=1 l γ ( Xj b T j f), where γ = γ(n, p) is a robustification parameter.

29 Factor-adjust Robust Test Model: X i = µ + Bf i + u i, i [n], f i is unobserved. Robust Estimation of Realized Factors: Note that X j = µ j + b T j f + ū j, j = 1,...,p Cross-sectional regression: If B were known, we estimate f by f = argminf R d p j=1 l γ ( Xj b T j f), where γ = γ(n, p) is a robustification parameter.

30 FARM Test Statistics n T j = σ jj b j 2 2 ( µj b T j f), σjj = RVar(X ij ) Validity of FDP Approximation Assume p n log(n), log(p) n 1/2 and that µ = (µ 1,...,µ p ) T is sparse, among other reg. cond. Then FDP N (z) FDP(z) = o P (1) as n,p.

31 Estimation of Loading Matrix B estimated by top d unnormalized eigenvectors of robust covariance matrix of var(x): 1 Elementwise robust estimator: σ ij = E(X i X j ) (EX i )(EX j ) 2 Robust U-type estimator: with ψ τ (u) = ( u τ)sign(u), Σ U (τ) = ( 1 ( ) 1 n ψ τ 2) 2 X j X k 2 (Xj X k )(X j X k ) T 2. X j k j X k 2 2 FARM-test: Choose ẑ α such that FDP N (ẑ α ; B) = α. Asymptotic results Validity of such a procedure in FDP control and estimation is proved.

32 Numerical Studies

33 Model and Methods Factor model: X i = µ + Bf i + u i, n {100,150,200},p = 500. B = (b jl ) IID U( 1,1), f i N (0,I 3 ), u i N (0,4I 3 ) or t 3 (0,I 3 ), µ j = 0.5, 1 j 25; µ j = 0, otherwise. Competing methods: 1 FARM-H: FARM-Test with adaptive Huber covariance estimator; 2 FARM-U: FARM-Test with U-type covariance estimator; 3 FAM: A non-robust counterpart of FARM (sample mean + cov.); 4 PFA: Principal factor approximation (Fan and Han, 17+); 5 Naive: Multiple t-tests ignoring factors.

34 FDP Control Table: Empirical mean abs. error between estimated & oracle FDP (t = 0.01) p = 500 u i n FARM-H FARM-U FAM PFA Naive Normal t Non-robust methods break down!

35 Power Comparisons Table: Empirical power p = 500 u i n FARM-H FARM-U FAM PFA Naive Normal t Little price to pay for robustness!

36 Power Curve under Varying Signal Strength Empirical power with respect to signal strength Empirical power FARM H FARM U FAM PFA Naive Signal Strength Figure: Empirical power versus signal strength for t 3 -distributed noise

37 Applications to Neuroblastoma Data German patients diagnosed between 1989 and 2004, aged from 0 to 296 months (median 15 months) customized oligonucleotide microarray with p = 10, focus on 3-y Event Free Survival, (49 + and 190 ) and 420 genes respectively have kurtosis heavier than t 5.

38 Effect of Adjustments and Differently Expressed Genes Before After Before After Negative group before adjustment Negative group after adjustment Negative group before adjustment Negative group after adjustment corr > 1/3 corr < 1/3. At t = 0.05, FARM-U, FAM and naive methods identify 2128, 1767, 1131 differently expressed genes.

39 Summary Introduce a simple robust principle Develop non-asymptotic Bahadur representation for adaptive Huber estimator Demonstrate a phase transition phenomenon. Propose a new factor-adjusted robust multiple test: FARM-test. Verify benefits and conclusions by simulation studies.

40 The End g{tç~ léâ A new perspective on robust M-estimation: finite sample theory and applications to dependence-adjusted multiple testing. (with W.-X. Zhou, K. Bose & H. Liu) Preprint, FARM-Test: factor-adjusted robust multiple testing with false discovery control. (with K. Yuan, Q. Sun & W.-X. Zhou) Preprint, 2017.

arxiv: v1 [stat.me] 15 Nov 2017

arxiv: v1 [stat.me] 15 Nov 2017 FARM-Test: Factor-adjusted robust multiple testing with false discovery control Jianqing Fan, Yuan Ke, Qiang Sun and Wen-Xin Zhou arxiv:1711.05386v1 [stat.me] 15 Nov 2017 Abstract Large-scale multiple

More information

Guarding against Spurious Discoveries in High Dimension. Jianqing Fan

Guarding against Spurious Discoveries in High Dimension. Jianqing Fan in High Dimension Jianqing Fan Princeton University with Wen-Xin Zhou September 30, 2016 Outline 1 Introduction 2 Spurious correlation and random geometry 3 Goodness Of Spurious Fit (GOSF) 4 Asymptotic

More information

A NEW PERSPECTIVE ON ROBUST M-ESTIMATION: FINITE SAMPLE THEORY AND APPLICATIONS TO DEPENDENCE-ADJUSTED MULTIPLE TESTING

A NEW PERSPECTIVE ON ROBUST M-ESTIMATION: FINITE SAMPLE THEORY AND APPLICATIONS TO DEPENDENCE-ADJUSTED MULTIPLE TESTING Submitted to the Annals of Statistics A NEW PERSPECTIVE ON ROBUST M-ESTIMATION: FINITE SAMPLE THEORY AND APPLICATIONS TO DEPENDENCE-ADJUSTED MULTIPLE TESTING By Wen-Xin Zhou, Koushiki Bose, Jianqing Fan,

More information

arxiv: v1 [math.st] 15 Nov 2017

arxiv: v1 [math.st] 15 Nov 2017 Submitted to the Annals of Statistics A NEW PERSPECTIVE ON ROBUST M-ESTIMATION: FINITE SAMPLE THEORY AND APPLICATIONS TO DEPENDENCE-ADJUSTED MULTIPLE TESTING arxiv:1711.05381v1 [math.st] 15 Nov 2017 By

More information

Sparse Learning and Distributed PCA. Jianqing Fan

Sparse Learning and Distributed PCA. Jianqing Fan w/ control of statistical errors and computing resources Jianqing Fan Princeton University Coauthors Han Liu Qiang Sun Tong Zhang Dong Wang Kaizheng Wang Ziwei Zhu Outline Computational Resources and Statistical

More information

Theory and Applications of High Dimensional Covariance Matrix Estimation

Theory and Applications of High Dimensional Covariance Matrix Estimation 1 / 44 Theory and Applications of High Dimensional Covariance Matrix Estimation Yuan Liao Princeton University Joint work with Jianqing Fan and Martina Mincheva December 14, 2011 2 / 44 Outline 1 Applications

More information

Large-Scale Multiple Testing of Correlations

Large-Scale Multiple Testing of Correlations Large-Scale Multiple Testing of Correlations T. Tony Cai and Weidong Liu Abstract Multiple testing of correlations arises in many applications including gene coexpression network analysis and brain connectivity

More information

Genetic Networks. Korbinian Strimmer. Seminar: Statistical Analysis of RNA-Seq Data 19 June IMISE, Universität Leipzig

Genetic Networks. Korbinian Strimmer. Seminar: Statistical Analysis of RNA-Seq Data 19 June IMISE, Universität Leipzig Genetic Networks Korbinian Strimmer IMISE, Universität Leipzig Seminar: Statistical Analysis of RNA-Seq Data 19 June 2012 Korbinian Strimmer, RNA-Seq Networks, 19/6/2012 1 Paper G. I. Allen and Z. Liu.

More information

Large-Scale Hypothesis Testing

Large-Scale Hypothesis Testing Chapter 2 Large-Scale Hypothesis Testing Progress in statistics is usually at the mercy of our scientific colleagues, whose data is the nature from which we work. Agricultural experimentation in the early

More information

High-dimensional covariance estimation based on Gaussian graphical models

High-dimensional covariance estimation based on Gaussian graphical models High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26,

More information

The miss rate for the analysis of gene expression data

The miss rate for the analysis of gene expression data Biostatistics (2005), 6, 1,pp. 111 117 doi: 10.1093/biostatistics/kxh021 The miss rate for the analysis of gene expression data JONATHAN TAYLOR Department of Statistics, Stanford University, Stanford,

More information

high-dimensional inference robust to the lack of model sparsity

high-dimensional inference robust to the lack of model sparsity high-dimensional inference robust to the lack of model sparsity Jelena Bradic (joint with a PhD student Yinchu Zhu) www.jelenabradic.net Assistant Professor Department of Mathematics University of California,

More information

Confounder Adjustment in Multiple Hypothesis Testing

Confounder Adjustment in Multiple Hypothesis Testing in Multiple Hypothesis Testing Department of Statistics, Stanford University January 28, 2016 Slides are available at http://web.stanford.edu/~qyzhao/. Collaborators Jingshu Wang Trevor Hastie Art Owen

More information

The Slow Convergence of OLS Estimators of α, β and Portfolio. β and Portfolio Weights under Long Memory Stochastic Volatility

The Slow Convergence of OLS Estimators of α, β and Portfolio. β and Portfolio Weights under Long Memory Stochastic Volatility The Slow Convergence of OLS Estimators of α, β and Portfolio Weights under Long Memory Stochastic Volatility New York University Stern School of Business June 21, 2018 Introduction Bivariate long memory

More information

Inference for High Dimensional Robust Regression

Inference for High Dimensional Robust Regression Department of Statistics UC Berkeley Stanford-Berkeley Joint Colloquium, 2015 Table of Contents 1 Background 2 Main Results 3 OLS: A Motivating Example Table of Contents 1 Background 2 Main Results 3 OLS:

More information

Portfolio Allocation using High Frequency Data. Jianqing Fan

Portfolio Allocation using High Frequency Data. Jianqing Fan Portfolio Allocation using High Frequency Data Princeton University With Yingying Li and Ke Yu http://www.princeton.edu/ jqfan September 10, 2010 About this talk How to select sparsely optimal portfolio?

More information

Asymptotic Statistics-III. Changliang Zou

Asymptotic Statistics-III. Changliang Zou Asymptotic Statistics-III Changliang Zou The multivariate central limit theorem Theorem (Multivariate CLT for iid case) Let X i be iid random p-vectors with mean µ and and covariance matrix Σ. Then n (

More information

STAT 461/561- Assignments, Year 2015

STAT 461/561- Assignments, Year 2015 STAT 461/561- Assignments, Year 2015 This is the second set of assignment problems. When you hand in any problem, include the problem itself and its number. pdf are welcome. If so, use large fonts and

More information

Estimation of large dimensional sparse covariance matrices

Estimation of large dimensional sparse covariance matrices Estimation of large dimensional sparse covariance matrices Department of Statistics UC, Berkeley May 5, 2009 Sample covariance matrix and its eigenvalues Data: n p matrix X n (independent identically distributed)

More information

A Large-Sample Approach to Controlling the False Discovery Rate

A Large-Sample Approach to Controlling the False Discovery Rate A Large-Sample Approach to Controlling the False Discovery Rate Christopher R. Genovese Department of Statistics Carnegie Mellon University Larry Wasserman Department of Statistics Carnegie Mellon University

More information

Permutation-invariant regularization of large covariance matrices. Liza Levina

Permutation-invariant regularization of large covariance matrices. Liza Levina Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work

More information

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los

More information

Robust Covariance Estimation for Approximate Factor Models

Robust Covariance Estimation for Approximate Factor Models Robust Covariance Estimation for Approximate Factor Models Jianqing Fan, Weichen Wang and Yiqiao Zhong arxiv:1602.00719v1 [stat.me] 1 Feb 2016 Department of Operations Research and Financial Engineering,

More information

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Cun-Hui Zhang and Stephanie S. Zhang Rutgers University and Columbia University September 14, 2012 Outline Introduction Methodology

More information

Doing Cosmology with Balls and Envelopes

Doing Cosmology with Balls and Envelopes Doing Cosmology with Balls and Envelopes Christopher R. Genovese Department of Statistics Carnegie Mellon University http://www.stat.cmu.edu/ ~ genovese/ Larry Wasserman Department of Statistics Carnegie

More information

3 Comparison with Other Dummy Variable Methods

3 Comparison with Other Dummy Variable Methods Stats 300C: Theory of Statistics Spring 2018 Lecture 11 April 25, 2018 Prof. Emmanuel Candès Scribe: Emmanuel Candès, Michael Celentano, Zijun Gao, Shuangning Li 1 Outline Agenda: Knockoffs 1. Introduction

More information

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method Christopher R. Genovese Department of Statistics Carnegie Mellon University joint work with Larry Wasserman

More information

CSC 411: Lecture 09: Naive Bayes

CSC 411: Lecture 09: Naive Bayes CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun & Rich Zemel s lectures Sanja Fidler University of Toronto Feb 8, 2015 Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 1

More information

Non-specific filtering and control of false positives

Non-specific filtering and control of false positives Non-specific filtering and control of false positives Richard Bourgon 16 June 2009 bourgon@ebi.ac.uk EBI is an outstation of the European Molecular Biology Laboratory Outline Multiple testing I: overview

More information

Estimating False Discovery Proportion Under Arbitrary Covariance Dependence

Estimating False Discovery Proportion Under Arbitrary Covariance Dependence Estimating False Discovery Proportion Under Arbitrary Covariance Dependence arxiv:1010.6056v2 [stat.me] 15 Nov 2011 Jianqing Fan, Xu Han and Weijie Gu May 31, 2018 Abstract Multiple hypothesis testing

More information

Semi-Penalized Inference with Direct FDR Control

Semi-Penalized Inference with Direct FDR Control Jian Huang University of Iowa April 4, 2016 The problem Consider the linear regression model y = p x jβ j + ε, (1) j=1 where y IR n, x j IR n, ε IR n, and β j is the jth regression coefficient, Here p

More information

Large-Scale Multiple Testing of Correlations

Large-Scale Multiple Testing of Correlations University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 5-5-2016 Large-Scale Multiple Testing of Correlations T. Tony Cai University of Pennsylvania Weidong Liu Follow this

More information

A General Framework for High-Dimensional Inference and Multiple Testing

A General Framework for High-Dimensional Inference and Multiple Testing A General Framework for High-Dimensional Inference and Multiple Testing Yang Ning Department of Statistical Science Joint work with Han Liu 1 Overview Goal: Control false scientific discoveries in high-dimensional

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Estimation and Confidence Sets For Sparse Normal Mixtures

Estimation and Confidence Sets For Sparse Normal Mixtures Estimation and Confidence Sets For Sparse Normal Mixtures T. Tony Cai 1, Jiashun Jin 2 and Mark G. Low 1 Abstract For high dimensional statistical models, researchers have begun to focus on situations

More information

arxiv: v2 [stat.me] 1 Mar 2019

arxiv: v2 [stat.me] 1 Mar 2019 A Factor-Adjusted Multiple Testing Procedure with Application to Mutual Fund Selection Wei Lan and Lilun Du arxiv:1407.5515v2 [stat.me] 1 Mar 2019 Southwestern University of Finance and Economics, and

More information

Heteroskedasticity in Time Series

Heteroskedasticity in Time Series Heteroskedasticity in Time Series Figure: Time Series of Daily NYSE Returns. 206 / 285 Key Fact 1: Stock Returns are Approximately Serially Uncorrelated Figure: Correlogram of Daily Stock Market Returns.

More information

High-dimensional regression with unknown variance

High-dimensional regression with unknown variance High-dimensional regression with unknown variance Christophe Giraud Ecole Polytechnique march 2012 Setting Gaussian regression with unknown variance: Y i = f i + ε i with ε i i.i.d. N (0, σ 2 ) f = (f

More information

Estimation of a Two-component Mixture Model

Estimation of a Two-component Mixture Model Estimation of a Two-component Mixture Model Bodhisattva Sen 1,2 University of Cambridge, Cambridge, UK Columbia University, New York, USA Indian Statistical Institute, Kolkata, India 6 August, 2012 1 Joint

More information

Sub-Gaussian Estimators of the Mean of a Random Matrix with Entries Possessing Only Two Moments

Sub-Gaussian Estimators of the Mean of a Random Matrix with Entries Possessing Only Two Moments Sub-Gaussian Estimators of the Mean of a Random Matrix with Entries Possessing Only Two Moments Stas Minsker University of Southern California July 21, 2016 ICERM Workshop Simple question: how to estimate

More information

Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square

Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square Yuxin Chen Electrical Engineering, Princeton University Coauthors Pragya Sur Stanford Statistics Emmanuel

More information

A Multiple Testing Approach to the Regularisation of Large Sample Correlation Matrices

A Multiple Testing Approach to the Regularisation of Large Sample Correlation Matrices A Multiple Testing Approach to the Regularisation of Large Sample Correlation Matrices Natalia Bailey 1 M. Hashem Pesaran 2 L. Vanessa Smith 3 1 Department of Econometrics & Business Statistics, Monash

More information

Estimation of High-dimensional Vector Autoregressive (VAR) models

Estimation of High-dimensional Vector Autoregressive (VAR) models Estimation of High-dimensional Vector Autoregressive (VAR) models George Michailidis Department of Statistics, University of Michigan www.stat.lsa.umich.edu/ gmichail CANSSI-SAMSI Workshop, Fields Institute,

More information

Cramér-Type Moderate Deviation Theorems for Two-Sample Studentized (Self-normalized) U-Statistics. Wen-Xin Zhou

Cramér-Type Moderate Deviation Theorems for Two-Sample Studentized (Self-normalized) U-Statistics. Wen-Xin Zhou Cramér-Type Moderate Deviation Theorems for Two-Sample Studentized (Self-normalized) U-Statistics Wen-Xin Zhou Department of Mathematics and Statistics University of Melbourne Joint work with Prof. Qi-Man

More information

Vast Volatility Matrix Estimation for High Frequency Data

Vast Volatility Matrix Estimation for High Frequency Data Vast Volatility Matrix Estimation for High Frequency Data Yazhen Wang National Science Foundation Yale Workshop, May 14-17, 2009 Disclaimer: My opinion, not the views of NSF Y. Wang (at NSF) 1 / 36 Outline

More information

Lesson 4: Stationary stochastic processes

Lesson 4: Stationary stochastic processes Dipartimento di Ingegneria e Scienze dell Informazione e Matematica Università dell Aquila, umberto.triacca@univaq.it Stationary stochastic processes Stationarity is a rather intuitive concept, it means

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson Bayesian variable selection via penalized credible regions Brian Reich, NC State Joint work with Howard Bondell and Ander Wilson Brian Reich, NCSU Penalized credible regions 1 Motivation big p, small n

More information

Incorporation of Sparsity Information in Large-scale Multiple Two-sample t Tests

Incorporation of Sparsity Information in Large-scale Multiple Two-sample t Tests Incorporation of Sparsity Information in Large-scale Multiple Two-sample t Tests Weidong Liu October 19, 2014 Abstract Large-scale multiple two-sample Student s t testing problems often arise from the

More information

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University Multiple Testing Hoang Tran Department of Statistics, Florida State University Large-Scale Testing Examples: Microarray data: testing differences in gene expression between two traits/conditions Microbiome

More information

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data Faming Liang, Chuanhai Liu, and Naisyin Wang Texas A&M University Multiple Hypothesis Testing Introduction

More information

Experience Rating in General Insurance by Credibility Estimation

Experience Rating in General Insurance by Credibility Estimation Experience Rating in General Insurance by Credibility Estimation Xian Zhou Department of Applied Finance and Actuarial Studies Macquarie University, Sydney, Australia Abstract This work presents a new

More information

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song Presenter: Jiwei Zhao Department of Statistics University of Wisconsin Madison April

More information

Can we do statistical inference in a non-asymptotic way? 1

Can we do statistical inference in a non-asymptotic way? 1 Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.

More information

Qualifying Exam in Probability and Statistics. https://www.soa.org/files/edu/edu-exam-p-sample-quest.pdf

Qualifying Exam in Probability and Statistics. https://www.soa.org/files/edu/edu-exam-p-sample-quest.pdf Part : Sample Problems for the Elementary Section of Qualifying Exam in Probability and Statistics https://www.soa.org/files/edu/edu-exam-p-sample-quest.pdf Part 2: Sample Problems for the Advanced Section

More information

GARCH Models Estimation and Inference

GARCH Models Estimation and Inference GARCH Models Estimation and Inference Eduardo Rossi University of Pavia December 013 Rossi GARCH Financial Econometrics - 013 1 / 1 Likelihood function The procedure most often used in estimating θ 0 in

More information

Robust Testing and Variable Selection for High-Dimensional Time Series

Robust Testing and Variable Selection for High-Dimensional Time Series Robust Testing and Variable Selection for High-Dimensional Time Series Ruey S. Tsay Booth School of Business, University of Chicago May, 2017 Ruey S. Tsay HTS 1 / 36 Outline 1 Focus on high-dimensional

More information

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Statistics Journal Club, 36-825 Beau Dabbs and Philipp Burckhardt 9-19-2014 1 Paper

More information

Statistical Inference On the High-dimensional Gaussian Covarianc

Statistical Inference On the High-dimensional Gaussian Covarianc Statistical Inference On the High-dimensional Gaussian Covariance Matrix Department of Mathematical Sciences, Clemson University June 6, 2011 Outline Introduction Problem Setup Statistical Inference High-Dimensional

More information

A GENERAL DECISION THEORETIC FORMULATION OF PROCEDURES CONTROLLING FDR AND FNR FROM A BAYESIAN PERSPECTIVE

A GENERAL DECISION THEORETIC FORMULATION OF PROCEDURES CONTROLLING FDR AND FNR FROM A BAYESIAN PERSPECTIVE A GENERAL DECISION THEORETIC FORMULATION OF PROCEDURES CONTROLLING FDR AND FNR FROM A BAYESIAN PERSPECTIVE Sanat K. Sarkar 1, Tianhui Zhou and Debashis Ghosh Temple University, Wyeth Pharmaceuticals and

More information

Multivariate Regression

Multivariate Regression Multivariate Regression The so-called supervised learning problem is the following: we want to approximate the random variable Y with an appropriate function of the random variables X 1,..., X p with the

More information

Summary and discussion of: Controlling the False Discovery Rate via Knockoffs

Summary and discussion of: Controlling the False Discovery Rate via Knockoffs Summary and discussion of: Controlling the False Discovery Rate via Knockoffs Statistics Journal Club, 36-825 Sangwon Justin Hyun and William Willie Neiswanger 1 Paper Summary 1.1 Quick intuitive summary

More information

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data Ståle Nygård Trial Lecture Dec 19, 2008 1 / 35 Lecture outline Motivation for not using

More information

Homogeneity Pursuit. Jianqing Fan

Homogeneity Pursuit. Jianqing Fan Jianqing Fan Princeton University with Tracy Ke and Yichao Wu http://www.princeton.edu/ jqfan June 5, 2014 Get my own profile - Help Amazing Follow this author Grace Wahba 9 Followers Follow new articles

More information

Brief Review on Estimation Theory

Brief Review on Estimation Theory Brief Review on Estimation Theory K. Abed-Meraim ENST PARIS, Signal and Image Processing Dept. abed@tsi.enst.fr This presentation is essentially based on the course BASTA by E. Moulines Brief review on

More information

Asymmetric least squares estimation and testing

Asymmetric least squares estimation and testing Asymmetric least squares estimation and testing Whitney Newey and James Powell Princeton University and University of Wisconsin-Madison January 27, 2012 Outline ALS estimators Large sample properties Asymptotic

More information

Clustering by Important Features PCA (IF-PCA)

Clustering by Important Features PCA (IF-PCA) Clustering by Important Features PCA (IF-PCA) Rare/Weak Signals and Phase Diagrams Jiashun Jin, CMU Zheng Tracy Ke (Univ. of Chicago) Wanjie Wang (Univ. of Pennsylvania) August 5, 2015 Jiashun Jin, CMU

More information

Review of Statistics

Review of Statistics Review of Statistics Topics Descriptive Statistics Mean, Variance Probability Union event, joint event Random Variables Discrete and Continuous Distributions, Moments Two Random Variables Covariance and

More information

False Discovery Control in Spatial Multiple Testing

False Discovery Control in Spatial Multiple Testing False Discovery Control in Spatial Multiple Testing WSun 1,BReich 2,TCai 3, M Guindani 4, and A. Schwartzman 2 WNAR, June, 2012 1 University of Southern California 2 North Carolina State University 3 University

More information

6. The econometrics of Financial Markets: Empirical Analysis of Financial Time Series. MA6622, Ernesto Mordecki, CityU, HK, 2006.

6. The econometrics of Financial Markets: Empirical Analysis of Financial Time Series. MA6622, Ernesto Mordecki, CityU, HK, 2006. 6. The econometrics of Financial Markets: Empirical Analysis of Financial Time Series MA6622, Ernesto Mordecki, CityU, HK, 2006. References for Lecture 5: Quantitative Risk Management. A. McNeil, R. Frey,

More information

Stat 206: Estimation and testing for a mean vector,

Stat 206: Estimation and testing for a mean vector, Stat 206: Estimation and testing for a mean vector, Part II James Johndrow 2016-12-03 Comparing components of the mean vector In the last part, we talked about testing the hypothesis H 0 : µ 1 = µ 2 where

More information

Problem Set 7. Ideally, these would be the same observations left out when you

Problem Set 7. Ideally, these would be the same observations left out when you Business 4903 Instructor: Christian Hansen Problem Set 7. Use the data in MROZ.raw to answer this question. The data consist of 753 observations. Before answering any of parts a.-b., remove 253 observations

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

Modified Simes Critical Values Under Positive Dependence

Modified Simes Critical Values Under Positive Dependence Modified Simes Critical Values Under Positive Dependence Gengqian Cai, Sanat K. Sarkar Clinical Pharmacology Statistics & Programming, BDS, GlaxoSmithKline Statistics Department, Temple University, Philadelphia

More information

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.

More information

Sample Size Requirement For Some Low-Dimensional Estimation Problems

Sample Size Requirement For Some Low-Dimensional Estimation Problems Sample Size Requirement For Some Low-Dimensional Estimation Problems Cun-Hui Zhang, Rutgers University September 10, 2013 SAMSI Thanks for the invitation! Acknowledgements/References Sun, T. and Zhang,

More information

Reconstruction from Anisotropic Random Measurements

Reconstruction from Anisotropic Random Measurements Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

UNIVERSITÄT POTSDAM Institut für Mathematik

UNIVERSITÄT POTSDAM Institut für Mathematik UNIVERSITÄT POTSDAM Institut für Mathematik Testing the Acceleration Function in Life Time Models Hannelore Liero Matthias Liero Mathematische Statistik und Wahrscheinlichkeitstheorie Universität Potsdam

More information

Classification. Chapter Introduction. 6.2 The Bayes classifier

Classification. Chapter Introduction. 6.2 The Bayes classifier Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode

More information

Inference with Transposable Data: Modeling the Effects of Row and Column Correlations

Inference with Transposable Data: Modeling the Effects of Row and Column Correlations Inference with Transposable Data: Modeling the Effects of Row and Column Correlations Genevera I. Allen Department of Pediatrics-Neurology, Baylor College of Medicine, Jan and Dan Duncan Neurological Research

More information

CONFOUNDER ADJUSTMENT IN MULTIPLE HYPOTHESIS TESTING

CONFOUNDER ADJUSTMENT IN MULTIPLE HYPOTHESIS TESTING Submitted to the Annals of Statistics CONFOUNDER ADJUSTMENT IN MULTIPLE HYPOTHESIS TESTING By Jingshu Wang, Qingyuan Zhao, Trevor Hastie, Art B. Owen Stanford University We consider large-scale studies

More information

Causal Inference: Discussion

Causal Inference: Discussion Causal Inference: Discussion Mladen Kolar The University of Chicago Booth School of Business Sept 23, 2016 Types of machine learning problems Based on the information available: Supervised learning Reinforcement

More information

TO HOW MANY SIMULTANEOUS HYPOTHESIS TESTS CAN NORMAL, STUDENT S t OR BOOTSTRAP CALIBRATION BE APPLIED? Jianqing Fan Peter Hall Qiwei Yao

TO HOW MANY SIMULTANEOUS HYPOTHESIS TESTS CAN NORMAL, STUDENT S t OR BOOTSTRAP CALIBRATION BE APPLIED? Jianqing Fan Peter Hall Qiwei Yao TO HOW MANY SIMULTANEOUS HYPOTHESIS TESTS CAN NORMAL, STUDENT S t OR BOOTSTRAP CALIBRATION BE APPLIED? Jianqing Fan Peter Hall Qiwei Yao ABSTRACT. In the analysis of microarray data, and in some other

More information

13. Time Series Analysis: Asymptotics Weakly Dependent and Random Walk Process. Strict Exogeneity

13. Time Series Analysis: Asymptotics Weakly Dependent and Random Walk Process. Strict Exogeneity Outline: Further Issues in Using OLS with Time Series Data 13. Time Series Analysis: Asymptotics Weakly Dependent and Random Walk Process I. Stationary and Weakly Dependent Time Series III. Highly Persistent

More information

Introduction to Rare Event Simulation

Introduction to Rare Event Simulation Introduction to Rare Event Simulation Brown University: Summer School on Rare Event Simulation Jose Blanchet Columbia University. Department of Statistics, Department of IEOR. Blanchet (Columbia) 1 / 31

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Covariance function estimation in Gaussian process regression

Covariance function estimation in Gaussian process regression Covariance function estimation in Gaussian process regression François Bachoc Department of Statistics and Operations Research, University of Vienna WU Research Seminar - May 2015 François Bachoc Gaussian

More information

FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES

FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES Sanat K. Sarkar a a Department of Statistics, Temple University, Speakman Hall (006-00), Philadelphia, PA 19122, USA Abstract The concept

More information

Methods for sparse analysis of high-dimensional data, II

Methods for sparse analysis of high-dimensional data, II Methods for sparse analysis of high-dimensional data, II Rachel Ward May 23, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 47 High dimensional

More information

Bootstrapping high dimensional vector: interplay between dependence and dimensionality

Bootstrapping high dimensional vector: interplay between dependence and dimensionality Bootstrapping high dimensional vector: interplay between dependence and dimensionality Xianyang Zhang Joint work with Guang Cheng University of Missouri-Columbia LDHD: Transition Workshop, 2014 Xianyang

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression In simple linear regression we are concerned about the relationship between two variables, X and Y. There are two components to such a relationship. 1. The strength of the relationship.

More information

Single Index Quantile Regression for Heteroscedastic Data

Single Index Quantile Regression for Heteroscedastic Data Single Index Quantile Regression for Heteroscedastic Data E. Christou M. G. Akritas Department of Statistics The Pennsylvania State University SMAC, November 6, 2015 E. Christou, M. G. Akritas (PSU) SIQR

More information

Robust Sparse Quadratic Discriminantion. Jianqing Fan

Robust Sparse Quadratic Discriminantion. Jianqing Fan Robust Sparse Quadratic Discriminantion Jianqing Fan Princeton University with Tracy Ke, Han Liu and Lucy Xia May 2, 2014 Outline 1 Introduction 2 Rayleigh Quotient for sparse QDA 3 Optimization Algorithm

More information

Assessing the dependence of high-dimensional time series via sample autocovariances and correlations

Assessing the dependence of high-dimensional time series via sample autocovariances and correlations Assessing the dependence of high-dimensional time series via sample autocovariances and correlations Johannes Heiny University of Aarhus Joint work with Thomas Mikosch (Copenhagen), Richard Davis (Columbia),

More information

Nonlinear time series

Nonlinear time series Based on the book by Fan/Yao: Nonlinear Time Series Robert M. Kunst robert.kunst@univie.ac.at University of Vienna and Institute for Advanced Studies Vienna October 27, 2009 Outline Characteristics of

More information

Bayesian Models for Regularization in Optimization

Bayesian Models for Regularization in Optimization Bayesian Models for Regularization in Optimization Aleksandr Aravkin, UBC Bradley Bell, UW Alessandro Chiuso, Padova Michael Friedlander, UBC Gianluigi Pilloneto, Padova Jim Burke, UW MOPTA, Lehigh University,

More information

Asymptotic Equivalence of Regularization Methods in Thresholded Parameter Space

Asymptotic Equivalence of Regularization Methods in Thresholded Parameter Space Asymptotic Equivalence of Regularization Methods in Thresholded Parameter Space Jinchi Lv Data Sciences and Operations Department Marshall School of Business University of Southern California http://bcf.usc.edu/

More information

Journal Club: Higher Criticism

Journal Club: Higher Criticism Journal Club: Higher Criticism David Donoho (2002): Higher Criticism for Heterogeneous Mixtures, Technical Report No. 2002-12, Dept. of Statistics, Stanford University. Introduction John Tukey (1976):

More information

Median Cross-Validation

Median Cross-Validation Median Cross-Validation Chi-Wai Yu 1, and Bertrand Clarke 2 1 Department of Mathematics Hong Kong University of Science and Technology 2 Department of Medicine University of Miami IISA 2011 Outline Motivational

More information