Guarding against Spurious Discoveries in High Dimension. Jianqing Fan

Size: px

Start display at page:

Download "Guarding against Spurious Discoveries in High Dimension. Jianqing Fan"

Ross Barton
5 years ago
Views:

1 in High Dimension Jianqing Fan Princeton University with Wen-Xin Zhou September 30, 2016

2 Outline 1 Introduction 2 Spurious correlation and random geometry 3 Goodness Of Spurious Fit (GOSF) 4 Asymptotic Distributions of GOSF 5 Bootstrap and LAMM Algorithm 6 Spurious Discoveries and Model Selection

3 Outline 1 Introduction 2 Spurious correlation and random geometry 3 Goodness Of Spurious Fit (GOSF) 4 Asymptotic Distributions of GOSF 5 Bootstrap and LAMM Algorithm 6 Spurious Discoveries and Model Selection

4 Introduction

Data revolution Enormous datasets are routinely collected

5 Data revolution Enormous datasets are routinely collected Biological Sci.: Genomics, Genetics, Neuroscience, Medicine Natural Sci.: Astronomy, earth sciences, meterology. Engineering: Machine learning, surveil. videos, social media, networks Social Sci: Economics, finance, marketing, managements. Characterize many contem. scientific and decision problems.

Growth of Dimensionality and Data Dimen. grows rapidly w/ interactions: 5000 12.5m.

6 Growth of Dimensionality and Data Dimen. grows rapidly w/ interactions: m. Synergy of Two Genes: colon cancer in Hanczar et al (2007). e.g., Y = I(X 1 + X 2 > 3) and Y X 1. white patients; black normal

7 False Discoveries Over last two decades, many exciting data mining, statistical machine learning techniques have been developed. Example: Lasso and SCAD are introduced to associate a set of covariates to a response from a large pool. Can our discoveries be spurious? Are our assumptions verifiable?

8 False Discoveries Over last two decades, many exciting data mining, statistical machine learning techniques have been developed. Example: Lasso and SCAD are introduced to associate a set of covariates to a response from a large pool. Can our discoveries be spurious? Are our assumptions verifiable?

9 False Discoveries Over last two decades, many exciting data mining, statistical machine learning techniques have been developed. Example: Lasso and SCAD are introduced to associate a set of covariates to a response from a large pool. Can our discoveries be spurious? Are our assumptions verifiable?

10 First Illustrative Example Data: Gene expressions of 90 Asians from HapMap project Response: Expressions of CHRNA6, cholinergic receptor, nicotinic, alpha 6 (554 SNPs within 1MB). Covariates: All other expressions (p = 47292) Lasso: Pick 25 genes! ĉorr(ŷl,y) = 0.90, ĉorr(ŷpl,y) = 0.92 Any better than chance? Reference distribution: n, p, ŝ, and X under null.

11 First Illustrative Example Data: Gene expressions of 90 Asians from HapMap project Response: Expressions of CHRNA6, cholinergic receptor, nicotinic, alpha 6 (554 SNPs within 1MB). Covariates: All other expressions (p = 47292) Lasso: Pick 25 genes! ĉorr(ŷl,y) = 0.90, ĉorr(ŷpl,y) = 0.92 Any better than chance? Reference distribution: n, p, ŝ, and X under null.

max ĉorr n (y,β T X), β 0 =s when X y 350 lambda = 0.

0674, s = 25 300 1000 250 200 150 100 800 600 400 50 200 2000

12 Any better than chance? Reference distribution: Max spurious correlation R n (s,p) = max ĉorr n (y,β T X), β 0 =s when X y 350 lambda = , s = lambda = , s = (a) 95%-tile Validating exogeneity via LASSO (b) 95%-tile Validating exogeneity via SCAD

13 Second Illustrative Example Data: Gene expression profiles in German Neuroblastoma Trials Response: 3-year event-free survival (binary outcome) Covariates: Gene expressions (p = 10707) Need a new way to assess correlation Measure of goodness of fit: Likelihood ratio (LR) statistic.

14 Second Illustrative Example Data: Gene expression profiles in German Neuroblastoma Trials Response: 3-year event-free survival (binary outcome) Covariates: Gene expressions (p = 10707) Need a new way to assess correlation Measure of goodness of fit: Likelihood ratio (LR) statistic.

15 Spurious Correlation and Random Geometry

16 Big dimension big spurious correlation Conformatory: Y,X N(0,1), corr(y,x) = 0.5. ĉorr(y,x) = based on 50 i.i.d. observations. Conclusion: Y and X are reasonably correlated. Selective: Y,X 1,...,X 1000 i.i.d. N(0,1). Based on n = 50, ĉorr(y,x 542 ) = max ĉorr(y,x j) = j 1000 Wrong conclusion: Y and X 542 are reasonably correlated. Need methods to guard spurious discoveries

17 Big dimension big spurious correlation Conformatory: Y,X N(0,1), corr(y,x) = 0.5. ĉorr(y,x) = based on 50 i.i.d. observations. Conclusion: Y and X are reasonably correlated. Selective: Y,X 1,...,X 1000 i.i.d. N(0,1). Based on n = 50, ĉorr(y,x 542 ) = max ĉorr(y,x j) = j 1000 Wrong conclusion: Y and X 542 are reasonably correlated. Need methods to guard spurious discoveries

18 Big Data Big Assumption Assumption: Y = X T S β S + ε EεX j = 0,j = 1,,p. Is the exogeneity assumption verifiable? N(0, 1 n ) ĉorr(x j, ε)

19 Relation with Random Geometry Throw p points uniformly at n-dim sphere Dist of min angle? Dist. of angles? Dist of pairwise angles? (Cai, Fan, Jiang, 13) Mutual Incoherence (Donoho and Huo, 01)

20 Relation with Random Geometry Throw p points uniformly at n-dim sphere Dist of min angle? Dist. of angles? Dist of pairwise angles? (Cai, Fan, Jiang, 13) Mutual Incoherence (Donoho and Huo, 01)

21 Relation with Random Geometry Throw p points uniformly at n-dim sphere Dist of min angle? Dist. of angles? Dist of pairwise angles? (Cai, Fan, Jiang, 13) Mutual Incoherence (Donoho and Huo, 01)

22 A more general statement Dist of min principal angle among ( p s) choices? R n (s,p) = max S: S =s ĉorr n (ε,β T S X S) Fan, Shao & Zhou (2015) show that nr n (s,p) 2 a Z 2 (1) + + Z 2 (s) What if it is an ellipse?

23 An Experiment: Dist. of Spurious Correlations Multiple spurious corr: R s = max β 0 =s corr(β T X,ε). Example: n = 50,p = 1000, s = 1,2,5,10 45 (a) s = 1 s = 2 s = 5 s = density density of σ n increment = order stat of χ 2 1 /n (Fan, Shao and Zhou, 15)

24 Asymptotic distribution of Rn (s,p) Theorem 1: Berry-Esseen bound (Fan, Shao and Zhou, 15) sup t 0 P{ n Rn (s,p) t} P{R (s,p) t} [s{log γ sp s logn}]7/8 n 1/8 R (s,p) = max Z T S Σ 1 SS Z S, S {1,...,p} S =s Z N p (0,Σ). Gaussian approx (Chernozhukov, Chetverikov, and Kato, 13, 14)

25 Goodness of Spurious Fit Beyond Linear Models

26 Maximum Spurious Correlation Data: i.i.d. sample {(Y i,x T i )} n i=1 from EY 2 = σ 2, EXX T = Σ. Max. Spurious Corr.: When Y and X are independent, R n (s,p) = ( ) max ĉorr n Y,β T X β R p : β 0 s represents the maximum spurious correlation. Asymptotics of Rn (s,p): Fan, Shao and Zhou (2015).

27 Goodness of spurious fit Normal model: 2{l(0) l( β S )} = n log(1 RS 2 ) are log-mlr of the null and fitted model. Quasi-likelihood: L n (β) from the sample {(Y i,x T i )} n i=1. Best subset: β(s) = argmin β Rp : β 0 s L n(β). Goodness Of Spurious Fit (GOSF): If Y X, LR n (s,p) = L n (0) L n ( β(s)).

28 Goodness of spurious fit Normal model: 2{l(0) l( β S )} = n log(1 RS 2 ) are log-mlr of the null and fitted model. Quasi-likelihood: L n (β) from the sample {(Y i,x T i )} n i=1. Best subset: β(s) = argmin β Rp : β 0 s L n(β). Goodness Of Spurious Fit (GOSF): If Y X, LR n (s,p) = L n (0) L n ( β(s)).

29 Examples Generalized linear models: L n (β) = n i=1 {b(xt i β) Y i X T i β}, where b is a model-dependent convex function. L 1 regression: L n (β) = n i=1 Y i X T i β. Support vector machine: L n (β) = n i=1 (Y i X T i β) +. AdaBoost: L n (β) = n i=1 exp( Y ix T i β).

30 Asymptotic Distributions of GOSF

31 Conditions for GLIM Cond 1: X is sub-gaussian, EXX T = Σ with diag(σ) = I p : A 0 := sup α S p 1 α T (Σ 1/2 X) ψ2 <. Cond 2: s-sparse cond. no. γ s = λmax (s) λ min (s) well-defined, α T Σα λ max (s) = max α 0 s α T α, λ α T Σα min(s) = min α 0 s α T α. Cond 3: Ee u(y µ Y )/σ Y e a 0u 2 /2, u R for some a 0 > 0, where µ Y = EY and σ 2 Y = var(y ). Cond 4: min u 1 b (u) a 1 and max u 1 b (u) A 1.

32 Conditions for GLIM Cond 1: X is sub-gaussian, EXX T = Σ with diag(σ) = I p : A 0 := sup α S p 1 α T (Σ 1/2 X) ψ2 <. Cond 2: s-sparse cond. no. γ s = λmax (s) λ min (s) well-defined, α T Σα λ max (s) = max α 0 s α T α, λ α T Σα min(s) = min α 0 s α T α. Cond 3: Ee u(y µ Y )/σ Y e a 0u 2 /2, u R for some a 0 > 0, where µ Y = EY and σ 2 Y = var(y ). Cond 4: min u 1 b (u) a 1 and max u 1 b (u) A 1.

33 Asymptotics of LR n (s,p) for GLIM R 0 (s,p) = max Σ 1/2 SS Z S 2 S {1,...,p} with Z N p (0,Σ). S =s = Z 2 (1) + + Z 2 (s) when Σ = I Theorem 2: (Berry-Esseen bound) Under Conditions 1 4, sup P{2LR n (s,p) t} P{R 2 0(s,p) t} t 0 {slog(γ spn)} 7/8 n 1/8 + γ1/2 s {slog(γ s pn)} 2 n 1/2. Wilks theorem: When s = p, R 0 (s,p) χ 2 p.

34 Asymptotics of LR n (s,p) for GLIM R 0 (s,p) = max Σ 1/2 SS Z S 2 S {1,...,p} with Z N p (0,Σ). S =s = Z 2 (1) + + Z 2 (s) when Σ = I Theorem 2: (Berry-Esseen bound) Under Conditions 1 4, sup P{2LR n (s,p) t} P{R 2 0(s,p) t} t 0 {slog(γ spn)} 7/8 n 1/8 + γ1/2 s {slog(γ s pn)} 2 n 1/2. Wilks theorem: When s = p, R 0 (s,p) χ 2 p.

35 Linear Median Regression Model and data: i.i.d. sample from (Y,X T ), Y = X T β + ε. Loss function: L n (β) = n i=1 Y i X T i β. l 0 -constrained LR: LR n (s,p) = n i=1 Y i min L n (β). β R p : β 0 s

36 Asymptotics of LR n (s,p) Cond 5: E ε κ < for κ > 1. a 2 < (E ε ) 1,A 2,A 3 2max{1 F ε (u),f ε ( u)} (1 + a 2 u) 1, for all u 0, max u f ε (u) A 2, max max{ f ε(u+), f ε(u ) } A 3, u 1 Theorem 3: Under Conditions 1, 2 & 5, sup P{2LR n (s,p) t} P{R 2 0(s,p) 2f ε (0)t} t 0 n 1 κ + {slog(γ spn)} 7/8 n 1/8 + γ1/4 s {slog(γ s pn)} 3/2 n 1/4.

37 Bootstrap and LAMM Algorithm

38 Multiplier bootstrap approximation Dist. of R 0 (s,p) depends on Σ. Bootstrap approximation: Generate Ẑ N p (0, Σ) via Ẑ = n 1/2 n i=1 e ix i, where e 1,...,e n i.i.d. N(0,1). Compute the bootstrap counterpart of R 0 (s,p): R n (s,p) = max Σ 1/2 SS ẐS 2. S {1,...,p} S =s Validity (Fan, Shao, Zhou, 15): If s log(γ s pn) = o(n 1/5 ), sup P{R 0 (s,p) t} P{R n (s,p) t X 1,...,X n } P 0. t 0

39 Multiplier bootstrap approximation Dist. of R 0 (s,p) depends on Σ. Bootstrap approximation: Generate Ẑ N p (0, Σ) via Ẑ = n 1/2 n i=1 e ix i, where e 1,...,e n i.i.d. N(0,1). Compute the bootstrap counterpart of R 0 (s,p): R n (s,p) = max Σ 1/2 SS ẐS 2. S {1,...,p} S =s Validity (Fan, Shao, Zhou, 15): If s log(γ s pn) = o(n 1/5 ), sup P{R 0 (s,p) t} P{R n (s,p) t X 1,...,X n } P 0. t 0

40 An LAMM Algorithm Goal: Solve min β 0 s f(β), f(β) = L n (β). MM: Get β (0) by forward selection. Compute β (k+1) = argmin β 0 s g(β β(k) ). g(β β (k) ) is a majorized function Non-increasing target value: LQA and LLA LQA LLA f(β (k+1) ) majorization g(β (k+1) β (k) ) minization g(β (k) β (k) ) initialization = f(β (k) ). (a)

41 An LAMM Algorithm Goal: Solve min β 0 s f(β), f(β) = L n (β). MM: Get β (0) by forward selection. Compute β (k+1) = argmin β 0 s g(β β(k) ). g(β β (k) ) is a majorized function Non-increasing target value: LQA and LLA LQA LLA f(β (k+1) ) majorization g(β (k+1) β (k) ) minization g(β (k) β (k) ) initialization = f(β (k) ). (a)

42 Implementation Majorize f(β) at β (k) by isotropic quadratic: g λ (β β (k) ) f( β (k) ) + f( β (k) ),β β (k) + λ 2 β β (k) 2 2. Valid when λ max β 2 f(β). Analytic solution: β (k+1) λ = argmin β 0 s g λ(β β (k) ) = { β(k) λ 1 f( β (k) ) } [1:s]. LAMM: Inflate λ unntil f( β (k+1) λ ) g λl ( β (k+1) l λ β (k) ) l

43 Implementation Majorize f(β) at β (k) by isotropic quadratic: g λ (β β (k) ) f( β (k) ) + f( β (k) ),β β (k) + λ 2 β β (k) 2 2. Valid when λ max β 2 f(β). Analytic solution: β (k+1) λ = argmin β 0 s g λ(β β (k) ) = { β(k) λ 1 f( β (k) ) } [1:s]. LAMM: Inflate λ unntil f( β (k+1) λ ) g λl ( β (k+1) l λ β (k) ) l

44 Spurious Discoveries and Model Selection

45 Spurious Discoveries? How large should a model be? Pick the models that are better than spurious fit: Check if 2 LR s > q 2 n,α(s,p) from null dist (MB). q n,α (s,p) provides a critical value and yardstick for judging whether the discovery is spurious.

46 An Empirical Illustration Data: 246 patients of the German Neuroblastoma Trials. Response: 3-year event-free survival, 56 positive. Covariates: Gene expressions over probe sites. Using 10-fold CV, Lasso selects ŝ cv = 40 probes.

47 Results: Selected models and spurious discoveries λ ŝ λ (2 LR λ ) 1/2 q n,0.1 (ŝ λ,p) q n,0.05 (ŝ λ,p) Mean CV Error Based 2000 bootstrap sample

48 Simulation Setup for Detecting Spurious Discoveries Logistic regression model p = 400,n = 120,160,200. β = (3, 1,3, 1,3,0,...,0) T R p. X 1,...,X n from N(0,Σ) w/ Σ = I p or Σ = (0.5 j k ) 1 j,k p. Repeat 200 times.

49 Simulation Results β cv : 5-fold cross-validated Lasso estimator; ŝ cv = β cv 0 : number of selected variables. For α (0,1), the rejection probability (power) is P{2 LR q 2 n,α(ŝ cv,p)}. α = 10%, q n,α (ŝ cv,p) is computed via 1000 bootstrap rep.

50 Rejection probability and median model size n = 120 n = 160 n = 200 Ind. Dep. Ind. Dep. Ind. Dep. Rej. Prob ŝ cv (13.43) (11.94) (13.81) (12.69) (14.18) (14.18) Lasso w/ CV gives far larger model size (s true = 5). Bias in Lasso forces CV to choose a smaller λ.

51 Applications to Model Selection Select among models that fit better than spurious disc. For each λ in Lasso path, compute LR λ and compare q n,α (s,p) s=ŝλ for α = 0.1. Starting from the largest λ, find λ fit the smallest λ satisfying 2 LR λ q 2 n,α(ŝ λ,p) and use the selected model.

52 Simulation Results: Continued (n,p) = (200,400), Σ = (0.5 j k ) 1 j,k p. Take α = 0.1 in q n,α (s,p). Repeat 200 times. Median of ŝ fit is 9 with RSD 1.87, power 100%.

53 Summary Define a generalized goodness of spurious fit, which extends maximum spurious correlation in linear models Develop the asymptotic distribution of such goodness of spurious fit for GLIM and L 1 regression Propose a simple and effective LAMM algorithm Guard against spurious discoveries Model selection from perspective of guarding against spurious discoveries

54 Summary Define a generalized goodness of spurious fit, which extends maximum spurious correlation in linear models Develop the asymptotic distribution of such goodness of spurious fit for GLIM and L 1 regression Propose a simple and effective LAMM algorithm Guard against spurious discoveries Model selection from perspective of guarding against spurious discoveries

55 The End Thank You Fan, J. and Zhou, W.-X. (2016). Guarding from Spurious Discoveries in High Dimension. Preprint.

Factor-Adjusted Robust Multiple Test. Jianqing Fan (Princeton University)

Factor-Adjusted Robust Multiple Test Jianqing Fan Princeton University with Koushiki Bose, Qiang Sun, Wenxin Zhou August 11, 2017 Outline 1 Introduction 2 A principle of robustification 3 Adaptive Huber