Uncertainty quantification in high-dimensional statistics

Size: px

Start display at page:

Download "Uncertainty quantification in high-dimensional statistics"

Gerald Hutchinson
5 years ago
Views:

1 Uncertainty quantification in high-dimensional statistics Peter Bühlmann ETH Zürich based on joint work with Sara van de Geer Nicolai Meinshausen Lukas Meier

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 Phenotype index High-dimensional data Behavioral economics and genetics (with Ernst Fehr, U.

2 Phenotype index High-dimensional data Behavioral economics and genetics (with Ernst Fehr, U. Zurich) n = persons genetic information (SNPs): p response variables, measuring behavior p n Number of significant target SNPs per phenotype goal: find significant associations between behavioral responses and genetic markers Number of significant target SNPs

3 Ars conjectandi in the light of modern applications from Stigler (1986), Bolthausen (2010) I learned: Jakob Bernoulli has developed: Law of Large Numbers for Bernoulli distributed random variables in the proof: some result about Large Deviations (concentration inequality; no mentioning of variance ) point estimation rate of convergence

4 regarding high-dimensional statistics: a lot of progress has been achieved over the last 8-10 years for point estimation rates of convergence and a substantial mathematical part relies on concentration inequalities, results on large deviations link to Ars Conjectandi: the general approach is still the same and was established 300 years ago

5 for high-dimensional statistics: very little work on assigning measures of uncertainty, p-values, confidence intervals and in Ars Conjectandi...? I ll come to this in a moment

6 for high-dimensional statistics: very little work on assigning measures of uncertainty, p-values, confidence intervals and in Ars Conjectandi...? I ll come to this in a moment

7 we need uncertainty quantification! (the core of statistics)

8 we need uncertainty quantification! (the core of statistics) did Jakob Bernoulli address this point?

9 Jakob Bernoulli wrote:... the most important part is missing where I am showing how the fundamental principles of Ars Conjectandi can be applied for civil, moral and economic matters Bernoulli describes how one can approximate unknown probabilities by finite sample quantities (i.e. frequentist ) Leibniz was not very convinced: how can one describe complex phenomena like diseases with probabilities (potential environmental changes,...) we certainly do nowadays...! ( Personalized medicine ) Bernoulli gives a brief response addressing the criticisms raised by some academics : probably Bernoulli had not enough time to present a more detailed treatment indications that Bernoulli had something like a confidence interval in mind

10 Jakob Bernoulli wrote:... the most important part is missing where I am showing how the fundamental principles of Ars Conjectandi can be applied for civil, moral and economic matters Bernoulli describes how one can approximate unknown probabilities by finite sample quantities (i.e. frequentist ) Leibniz was not very convinced: how can one describe complex phenomena like diseases with probabilities (potential environmental changes,...) we certainly do nowadays...! ( Personalized medicine ) Bernoulli gives a brief response addressing the criticisms raised by some academics : probably Bernoulli had not enough time to present a more detailed treatment indications that Bernoulli had something like a confidence interval in mind

11 Jakob Bernoulli wrote:... the most important part is missing where I am showing how the fundamental principles of Ars Conjectandi can be applied for civil, moral and economic matters Bernoulli describes how one can approximate unknown probabilities by finite sample quantities (i.e. frequentist ) Leibniz was not very convinced: how can one describe complex phenomena like diseases with probabilities (potential environmental changes,...) we certainly do nowadays...! ( Personalized medicine ) Bernoulli gives a brief response addressing the criticisms raised by some academics : probably Bernoulli had not enough time to present a more detailed treatment indications that Bernoulli had something like a confidence interval in mind

12 goal (regarding the title of the talk): p-values/confidence interval for a high-dimensional linear model (and we can then generalize to other models)

Motif regression and variable selection for finding HIF1α transcription factor binding sites in DNA seq. Müller, Meier, PB & Ricci for coarse DNA segments i = 1,..., n : predictor X i = (X (1) i,.

13 Motif regression and variable selection for finding HIF1α transcription factor binding sites in DNA seq. Müller, Meier, PB & Ricci for coarse DNA segments i = 1,..., n : predictor X i = (X (1) i,..., X (p) i ) R p : abundance score of candidate motifs j = 1,..., p in DNA segment i (using sequence data and computational biology algorithms, e.g. MDSCAN) univariate response Y i R: binding intensity of HIF1α to coarse DNA segment (from CHIP-chip experiments)

14 question: relation between the binding intensity Y and the abundance of short candidate motifs? linear model is often reasonable motif regression (Conlon, X.S. Liu, Lieb & J.S. Liu, 2003) Y i = p j=1 β 0 j X (j) i + ε i i = 1,..., n = 143, p = 195 goal: variable selection and significance of variables find the relevant motifs among the p = 195 candidates

15 Lasso (Tibshirani, 1996) Lasso for linear models ˆβ(λ) = argmin β (n 1 Y Xβ 2 + }{{} λ 0 well-known facts: convex optimization β 1 }{{} p j=1 β j Lasso does variable selection some of the ˆβ j (λ) = 0 (because of l 1 -geometry ) ˆβ(λ) is a shrunken OLS-estimate )

16 Lasso for variable selection: Ŝ(λ) = {j; ˆβ j (λ) 0} estimate for S 0 = {j; β 0 j 0} no significance testing involved it s convex optimization only! and it s very popular (Meinshausen & PB, 2006;Zhao & Yu, 2006; Wainwright, 2009;...)

17 for motif regression (finding HIF1α transcription factor binding sites) n=143, p=195 Lasso selects 26 covariates when choosing λ = ˆλ CV via cross-validation and resulting R 2 50% i.e. 26 interesting candidate motifs how significant are the findings?

18 for motif regression (finding HIF1α transcription factor binding sites) n=143, p=195 Lasso selects 26 covariates when choosing λ = ˆλ CV via cross-validation and resulting R 2 50% i.e. 26 interesting candidate motifs how significant are the findings?

19 estimated coefficients ˆβ(ˆλ CV ) original data coefficients variables p-values for H 0,j : β 0 j = 0?

20 High-dimensional linear models and what statistical theory tells us Y = Xβ 0 + ε, p n with fixed (deterministic) design X problem of identifiability: for p > n: Xβ 0 = Xθ for any θ = β 0 + ξ, ξ in the null-space of X cannot say anything about ˆβ β 0 without further assumptions!

21 Assumption 1: design conditions in terms of restricted eigenvalues minimal eigenvalue of ˆΣ = X T X/n equals zero (if p > n) consider smallest restricted eigenvalue (or compatibility constant) and require it to be bounded away from zero (van de Geer, 2007; Candes & Tao, 2007;...;Bickel, Ritov & Tsybakov, 2009; Wainwright, 2009;... ) Example: X has i.i.d. rows with sub-gaussian distribution Cov(X i ) is e.g. Toeplitz matrix; or equi-corr. with 0 < ρ < 1 with high probablity: smallest restricted eigenvalue of ˆΣ bounded away from 0

22 various conditions and their relations (van de Geer & PB, 2009) RIP 8 S * \S s 6 weak (S, 2s)- irrepresentable weak (S,2s)- RIP coherence adaptive (S, 2s)- restricted regression adaptive (S, s)- restricted regression 6 (S,2s)-irrepresentable oracle inequalities for prediction and estimation (S,2s)-restricted eigenvalue (S,s)-restricted eigenvalue (S,s)-uniform irrepresentable 9 S-compatibility 6 S \S =0 * S =S 6 * 2

23 consider Lasso ˆβ(λ) = argmin β (n 1 Y Xβ 2 + λ β 1 ) assuming restricted l 1 -eigenvalue (compatibility) condition: n 1 X( ˆβ β 0 ) 2 2 O P (s 0 log(p)/n), λ log(p)/n ˆβ β 0 1 O P (s 0 log(p)/n), λ log(p)/n s 0 = S 0 is the cardinality of the active set that is: β 0 is identifiable (if s 0 n/ log(p))

24 Assumption 2: beta-min condition beta-min condition: min βj 0 s 0 log(p)/n j S 0 (or s0 log(p)/n or log(p)/n) from ˆβ β 0 1 O P (s 0 log(p)/n) we immediately obtain: variable screening: Ŝ S 0 with high probability i.e., we will not miss a true variable! but we may (typically) have too many false positive selections

25 estimated coefficients ˆβ(ˆλ CV ) original data coefficients variables which variables in Ŝ are false positives? p-values would be very useful!

26 P-values for high-dimensional linear models Y = Xβ 0 + ε goal: statistical hypothesis testing H 0,j : β 0 j = 0 or H 0,G : β 0 j = 0 for all j G {1,..., p} background: if we could handle the asymptotic distribution of the Lasso ˆβ(λ) under the null-hypothesis could construct p-values this is very difficult! asymptotic distribution of ˆβ has some point mass at zero,... Knight and Fu (2000) for p < and n

27 standard bootstrapping and subsampling cannot be used either but there are recent proposals when using adaptations of standard resampling methods (Chatterjee & Lahiri, 2013; Liu & Yu, 2013) non-uniformity/super-efficiency issues remain...

28 Low-dimensional projections and bias correction Or de-sparsifying the Lasso estimator related work by Zhang and Zhang (2011) motivation: ˆβ OLS,j = projection of Y onto residuals (X j X j ˆγ (j) OLS ) projection not well defined if p > n use regularized residuals from Lasso on X-variables Z j = X j X j ˆγ (j) Lasso

29 using Y = Xβ 0 + ε Zj T Y = Zj T X j βj 0 + k j Z T j X k + Zj T ε and hence Zj T Y Zj T = βj 0 + X j ˆb j = Z j T Y Z T j Zj T X k Z T βk 0 + k j j X j }{{} bias X j Zj T ε Zj T X j }{{} noise component Zj T X k ˆβ Z T Lasso;k k j j X j }{{} Lasso-estim. bias corr.

30 ˆb j is not sparse!... and this is crucial to obtain Gaussian limit nevertheless: it is optimal (see later) target: low-dimensional component β 0 j η := {β 0 k ; k j} is a high-dimensional nuisance parameter exactly as in semiparametric modeling! and sparsely estimated (e.g. with Lasso)

31 ˆb j is not sparse!... and this is crucial to obtain Gaussian limit nevertheless: it is optimal (see later) target: low-dimensional component β 0 j η := {β 0 k ; k j} is a high-dimensional nuisance parameter exactly as in semiparametric modeling! and sparsely estimated (e.g. with Lasso)

32 Asymptotic pivot and optimality Theorem (van de Geer, PB & Ritov, 2013) n(ˆbj β 0 j ) N (0, σ 2 ε Ω jj ) (j = 1,..., p) Ω jj explicit expression (Σ 1 ) jj optimal! reaching semiparametric information bound asympt. optimal p-values and confidence intervals if we assume: population Cov(X) = Σ has minimal eigenvalue M > 0 sparsity for regr. Y vs. X: s 0 = o( n/ log(p)) quite sparse sparsity of design: Σ 1 sparse i.e. sparse regressions X j vs. X j : s j o( n/ log(p)) may not be OK

33 It is optimal! Cramer-Rao

34 for design with Σ 1 non-sparse: Ridge projection (PB, 2013): good type I error control but not optimal in terms of power convex program instead of Lasso for Z j (Javanmard & Montanari, 2013; MSc. thesis Dezeure, 2013) Javanmard & Montanari prove optimality careful choice of regularization parameters with e.g. square root Lasso (van de Geer & PB, in progress) so far: no convincing empirical evidence that we can deal well with such scenarios (Σ 1 non-sparse)

35 Uniform convergence: n(ˆbj β 0 j ) N (0, σ 2 ε Ω jj ) (j = 1,..., p) convergence is uniform over B(s 0 ) = {β; β 0 s 0 } honest tests and confidence regions! and we can avoid post model selection inference (cf. Pötscher and Leeb)

36 Simultaneous inference over all components: n(ˆb β 0 ) (W 1,..., W p) N p(0, σ 2 εω) can construct P-values for: H 0,G with any G: test-statistics max j G ˆb j since covariance structure Ω is known and can easily do efficient multiple testing adjustment since covariance structure Ω is known!

Alternatives? versions of bootstrapping (Chatterjee & Lahiri, 2013) super-efficiency phenomenon! i.e. non-uniform convergence Joe Hodges good for estimating the zeroes (i.e., j S c 0 with β0 j = 0) bad for estim.

37 Alternatives? versions of bootstrapping (Chatterjee & Lahiri, 2013) super-efficiency phenomenon! i.e. non-uniform convergence Joe Hodges good for estimating the zeroes (i.e., j S c 0 with β0 j = 0) bad for estim. the non-zeroes (i.e., j S 0 with β 0 j 0) multiple sample splitting (Meinshausen, Meier & PB, 2009) split the sample repeatedly in two halfs: select variables on first half p-values using second half, based on selected variables avoids (because of sample splitting) over-optmistic p-values, but potentially suffers in terms of power

38 Some empirical results (Dezeure, Meier & PB, in progress) compare power and control of familywise error rate (FWER) always p = 500, n = 100 and s 0 = 15 equi-correlation design Σ jk 0.8 (j k) (Σ 1 not sparse): multi smp split Ridge projection Lasso projection power power power FWER FWER FWER projection estimators are unreliable for controlling FWER!

39 Toeplitz design with banded (very sparse) Σ 1 : multi smp split Ridge projection Lasso projection power power power FWER FWER FWER Lasso-projection method is slightly best (as it should!)

40 design with exponential decaying (approx. sparse) Σ 1 : multi smp split Ridge projection Lasso projection power power power FWER FWER FWER methods are roughly on par Lasso projection method has one scenario with bad FWER

41 Real data X: 500 variables with highest emp. variance multi smp split Ridge projection Lasso projection power power power FWER FWER FWER Lasso-projection method is unreliable for controlling FWER!

42 Real data X: 500 variables with highest pairwise correlations multi smp split Ridge projection Lasso projection power power power FWER FWER FWER Lasso-projection method is slightly best multi-sample splitting has once rather bad FWER

43 overall: multi sample splitting seems most reliable (for type I err. contr.) at the price of being more conservative (less power) (and absence of optimality theory) Leo Breiman Brad Efron is our empirical finding true more generally? so far: theory doesn t give a clear answer

44 Motif regression example one significant variable with both de-sparsified Lasso and multi sample splitting motif regression coefficients variables : variable/motif with FWER-adjusted p-value : p-value clearly larger than 0.05 (this variable corresponds to known true motif)

45 for data-sets with p and n 100 often no significant variable because it is a too extreme ratio log(p)/n

46 Behavioral economics and genomewide association with Ernst Fehr, University of Zurich n = 1525 probands (all students!) m = 79 response variables measuring various behavioral characteristics (e.g. risk aversion) from well-designed experiments 460 Target SNPs (as a proxy for 10 6 SNPs): 1380 parameters per response (but only 1341 meaningful parameters) model: multivariate linear model Y } n m = X {{} n p β }{{} p m + ε } n m {{} responses SNP data error although p < n, the design matrix X (with categorical values {1, 2, 3}) does not have full rank

47 Y n m = X n p β p m + ε n m interested in p-values for H 0,jk : β jk = 0 versus H A,jk : β jk 0, H 0,G : β jk = 0 for all j, k G versus H A,G = H0,G c adjusted to control the familywise error rate (i.e. conservative criterion) in total: we consider hypotheses we test for non-marginal regression coefficients predictive GWAS

48 there is structure! 79 response experiments 23 chromosomes per response experiment 20 Target SNPs per chromosome = 460 Target SNPs global

49 do hierarchical FWER adjustment (Meinshausen, 2008) global significant not significant 1. test global hypothesis 2. if significant: test all single response hypotheses 3. for the significant responses: test all single chromosome hyp. 4. for the significant chromosomes: test all TargetSNPs powerful multiple testing with data dependent adaptation of the resolution level (our analysis with 20 TagetSNPs per chromosome is ad-hoc) cf. general sequential testing principle (Goeman & Solari, 2010)

50 number of significant SNP parameters per response Number of significant target SNPs per phenotype Number of significant target SNPs Phenotype index response 40 has most significant (levels of) Target SNPs

51 Conclusions 300 years ago = now & etc... computing but the basic principles appearing in nowadays books are still related to Ars Conjectandi our statistical inference methods are/will be available in R-package hdi }{{} high-dimensional inference (Meier, 2013)

books are still related to Ars Conjectandi our statistical

52 Conclusions 300 years ago = now & etc... computing but the basic principles appearing in nowadays books are still related to Ars Conjectandi our statistical inference methods are/will be available in R-package hdi }{{} high-dimensional inference (Meier, 2013)

53 can construct asymptotically optimal p-values and confidence intervals in high-dimensional models assuming suitable conditions: sparsity of Y vs X sparsity of X j vs X j (j = 1,..., p) design matrix X is not too ill-posed (e.g. restricted eigenvalue assumption) these conditions are typically uncheckable... confirmatory high-dimensional inference remains challenging Thank you!

54 can construct asymptotically optimal p-values and confidence intervals in high-dimensional models assuming suitable conditions: sparsity of Y vs X sparsity of X j vs X j (j = 1,..., p) design matrix X is not too ill-posed (e.g. restricted eigenvalue assumption) these conditions are typically uncheckable... confirmatory high-dimensional inference remains challenging Thank you!

55 can construct asymptotically optimal p-values and confidence intervals in high-dimensional models assuming suitable conditions: sparsity of Y vs X sparsity of X j vs X j (j = 1,..., p) design matrix X is not too ill-posed (e.g. restricted eigenvalue assumption) these conditions are typically uncheckable... confirmatory high-dimensional inference remains challenging Thank you!

References: Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methodology, Theory and Applications. Springer. Meinshausen, N., Meier, L. and Bühlmann, P. (2009).

56 References: Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methodology, Theory and Applications. Springer. Meinshausen, N., Meier, L. and Bühlmann, P. (2009). P-values for high-dimensional regression. Journal of the American Statistical Association 104, Bühlmann, P. (2013). Statistical significance in high-dimensional linear models. Bernoulli 19, van de Geer, S., Bühlmann, P. and Ritov, Y. (2013). On asymptotically optimal confidence regions and tests for high-dimensional models. Preprint arxiv: v1 Meier, L. (2013). hdi: High-dimensional inference. R-package available from R-Forge.

DISCUSSION OF A SIGNIFICANCE TEST FOR THE LASSO. By Peter Bühlmann, Lukas Meier and Sara van de Geer ETH Zürich

Submitted to the Annals of Statistics DISCUSSION OF A SIGNIFICANCE TEST FOR THE LASSO By Peter Bühlmann, Lukas Meier and Sara van de Geer ETH Zürich We congratulate Richard Lockhart, Jonathan Taylor, Ryan