LARGE NUMBERS OF EXPLANATORY VARIABLES. H.S. Battey. WHAO-PSI, St Louis, 9 September 2018

Size: px

Start display at page:

Download "LARGE NUMBERS OF EXPLANATORY VARIABLES. H.S. Battey. WHAO-PSI, St Louis, 9 September 2018"

Junior Robbins
5 years ago
Views:

1 LARGE NUMBERS OF EXPLANATORY VARIABLES HS Battey Department of Mathematics, Imperial College London WHAO-PSI, St Louis, 9 September 2018

2 Regression, broadly defined Response variable Y i, eg, blood pressure, disease status, survival time, Vector X i of v potential explanatory variables Observed on units, eg patients, i = 1,, n v n, eg X i arising from gene expression analysis

3 Goal: scientific understanding Sparsity is critical for interpretability and statistical stability Lasso (Tibshirani, 1996): penalized LS Minimize Y X β λ β 1 More generally penalized MLE There results a single model Cox, D R and Battey, H S (2017), Proc Nat Acad Sci, 114, : aim for a confidence set of models

4 Goal: scientific understanding Sparsity is critical for interpretability and statistical stability Lasso (Tibshirani, 1996): penalized LS Minimize Y X β λ β 1 More generally penalized MLE There results a single model Cox, D R and Battey, H S (2017), Proc Nat Acad Sci, 114, : aim for a confidence set of models

5 Arrange variable indices in a k k k hypercube where k 15 Traverse the cube: rows, columns, etc Assess each variable several times alongside k 1 different companions

6 Arrange variable indices in a k k k hypercube where k 15 Traverse the cube: rows, columns, etc Assess each variable several times alongside k 1 different companions

7 Arrange variable indices in a k k k hypercube where k 15 Traverse the cube: rows, columns, etc Assess each variable several times alongside k 1 different companions

8 Arrange variable indices in a k k k hypercube where k 15 Traverse the cube: rows, columns, etc Assess each variable several times alongside k 1 different companions Retain or discard variables according to a decision rule Repeat Informal checks Test low dimensional subsets for compatibility with the data A confidence set of models

9 Arrange variable indices in a k k k hypercube where k 15 Traverse the cube: rows, columns, etc Assess each variable several times alongside k 1 different companions Retain or discard variables according to a decision rule Repeat Informal checks Test low dimensional subsets for compatibility with the data A confidence set of models

10 Arrange variable indices in a k k k hypercube where k 15 Traverse the cube: rows, columns, etc Assess each variable several times alongside k 1 different companions Retain or discard variables according to a decision rule Repeat Informal checks Test low dimensional subsets for compatibility with the data A confidence set of models

11 TYPICAL OUTPUT Model Recoded variable indices This arose from an example with n =129 individuals: 105 with osteoarthritis, 24 controls v 50, 000 genetic variables Statistics can take us no further Any choice between such models would require additional data or subject matter expertise

12 THREE PHASES A reduction phase uses significance tests as an informal guide to discarding variables On the reduced set, an exploratory phase allows assessment of anomalies A model selection phase assesses candidate models for their compatibility with the data

13 REDUCTION PHASE

14 REMARKS ON THE REDUCTION PHASE No restriction to perfect (hyper)cubes Partially balanced incomplete block designs (Yates, 1936) A more severe reduction than marginal screening Arrangement rerandomization Prior assessment of importance Binary responses and separating hyperplanes

15 KEY ASPECTS OF ANALYSIS What is the probability that a signal variable is falsely discarded? How many of the variables ultimately suggested as potentially important are noise variables?

16 Battey, HS and Cox, DR (2018), Proc R Soc Lond A, 474: Specify behaviour under idealized conditions Provide guidance on decision rules The goal is not to set up a procedure to achieve pre-assigned error rates

17 SOME FIRST-STAGE REDUCTION STRATEGIES Retain the single variable with highest score (lowest p-value); Retain the two variables with highest scores; Retain all variables, if any, whose scores exceed a threshold In the second stage of the procedure, the third strategy is always used

18 A SIMPLIFYING APPROXIMATION Analyses involving the same variable are treated as independent Slightly pessimistic assessment of the proportion of correctly retained signal variables Slightly optimistic assessment of the number of falsely retained noise variables blanksignal variables are on an equal footing (same signal strength)

19 Behaviour is governed largely by θ, the probability that a particular signal variable survives reduction in any single analysis in which it appears This depends on the reduction strategy used

20 In a set of k variables chosen at random for test, the number of signal variables has a Poisson distribution of mean a kv S0 /v 1 e a (1 + a) lim a 0 1 e a = 0 So θ ϑ for small a, where ϑ is the survival probability conditioned on the event that there is exactly one signal variable in the set

21 SOME ANALYSIS OF FIRST-STAGE REDUCTION To avoid distributional assumptions on the response variable we frame the discussion in terms of p-values a For noise variables these are uniformly distributed on (0, 1) For signal variables, we model their density as (1 γ)x γ, 0 < γ < 1 a A parallel analysis for the normal theory linear model gives qualitatively the same conclusions

22 STRATEGY 1 ϑ (1) = pr{signal beats best of (k 1) noise} = (1 γ) 1 0 dx(1 x) k 1 x γ Γ(1 γ)γ(k) = (1 γ) Γ(1 γ + k) Γ(2 γ)/k1 γ Γ( ) is close to one over the range of interest If γ = 0 the notional signal variable is selected wp 1/k, ie at random

23 STRATEGY 2 ϑ (2) = pr(signal among 2 most significant) = ϑ (1) + ϑ (21), where ϑ (21) = pr(signal comes second) = 1 0 dx(k 1)(1 x) k 2 x(1 γ)x γ (1 γ)γ(2 γ)/k 1 γ ϑ (2) ϑ (1) is negligible for γ close to 1 If γ = 0, ϑ (2) = 2/k

24 APPROXIMATION ERROR Exact Approx Exact Approx Figure: Exact and approximate values of ϑ (1) (left) and ϑ (21) (right) over γ (0, 1) for k = 10

25 STRATEGY 3 If we set a critical level α, then ϑ (3) = α 0 dx(1 γ)x γ = α 1 γ

26 COMPARISON OF STRATEGIES Choose α to equalize survival probabilities of signal variables and compare expected number of retained noise variables Stategies 1 and 3 are equivalent at this α 2 dominates 3, ie, strategy 3 selects more noise variables on average for the same probability of selecting a signal variable Strategy 2 increases number of retained noise variables by roughly a factor of four over strategy 1

27 EXPLORATORY PHASE The procedure is intended to be largely exploratory Data on the reduced set of variables should be subjected to a thorough exploratory analysis eg inspect interaction plots, using significance tests as an informal guide for identifying potential interactions

28 FINAL PHASE: SETS OF WELL-FITTING MODELS

29 A POST-SELECTION INFERENCE PROBLEM Confidence sets of models comprise all low-dimensional subsets that pass a LR test against the comprehensive model Comprehensive model selected in the light of the data Fits better than an arbitrary model nesting the one to be tested Reject H 0 too often in hypothetical repeated sampling Conditional coverage probabilities of model confidence sets are lower than nominal A wasteful solution: split the sample

30 In each Monte Carlo replication: A SIMPLE EXPERIMENT Generate n = 10 2 replicates of v = 10 3 variables from a N(0, PΣP 1 ), P a permutation matrix Σ is an identity matrix with one diag block replaced by a correlation matrix of dim v S0 + v C0 and equal correlation ρ

31 A SIMPLE EXPERIMENT (CONTINUED) For each i = 1,, n, v S0 of the v S0 + v C0 correlated variables are multiplied by a constant signal and added, together with standard normal noise This is Y i Arrange variable indices in cube and reduce by strategy 2 followed by strategy 3 at the 01% level LR test against the comprehensive model

32 MC ESTIMATES: 2 4 FACTORIAL EXPERIMENT pr(s Ŝ) pr(s M) E M\S v S0 v C0 ρ signal noise lasso full sample split sample full sample split sample full sample split sample (004) 100 (000) 100 (000) 057 (050) 099 (008) 68 (90) 157 (300) (021) 096 (021) 074 (044) 045 (050) 074 (044) 54 (65) 131 (256) (000) 100 (000) 100 (004) 055 (050) 098 (013) 44 (59) 106 (498) (012) 096 (019) 076 (043) 041 (049) 076 (043) 26 (37) 110 (278) (004) 100 (006) 098 (013) 067 (047) 097 (016) 273 (273) 684 (108) (030) 093 (026) 073 (045) 048 (050) 072 (045) 235 (211) 398 (902) (000) 100 (000) 099 (008) 062 (049) 099 (012) 113 (174) 122 (300) (010) 095 (022) 076 (043) 038 (049) 075 (043) 41 (62) 947 (206) (008) 100 (000) 100 (004) 095 (021) 098 (013) 75 (81) 105 (143) (041) 099 (009) 095 (022) 088 (033) 095 (023) 462 (398) 182 (255) (000) 100 (000) 100 (000) 096 (019) 099 (009) 00 (00) 154 (261) (004) 100 (000) 098 (014) 090 (031) 097 (017) 14 (26) 807 (120) (010) 100 (000) 100 (004) 096 (019) 099 (011) 175 (145) 303 (317) (043) 098 (013) 091 (028) 091 (029) 090 (030) 116 (93) 430 (405) (000) 100 (000) 100 (004) 098 (015) 099 (011) 00 (04) 413 (799) (004) 100 (000) 096 (019) 092 (028) 095 (021) 36 (55) 217 (325) Table: S is the true set of signal variables, Ŝ is the set of variables surviving the reduction phase, M is the set of low dimensional models whose likelihood ratio test against the comprehensive model is not rejected at the 1% level Empirical standard errors in parenthesis Lasso is the full sample version, tuned to pick as many variables as are retained through the reduction phase

33 ESTIMATED MAIN EFFECTS outcome v S0 v C0 ρ signal/noise I1{S Ŝ} 1 88 (0 11) 0 19 (0 09) 0 28 (0 09) 3 94 (0 26) I1{S M} 1 56 (0 10) 0 19 (0 09) 0 26 (0 09) 2 64 (0 14) M\S 0 51 (0 00) 0 48 (0 00) 2 67 (0 01) 1 35 (0 01) Table: Monte Carlo estimates of effects of v S0, v C0, ρ and signal/noise (dichotomized to aid interpretation) on the logit scale for binary outcomes (first two rows) and on the log scale for counts (third row)

34 SUMMARY If the data are compatible with several alternative reasonable explanations, one should aim to specify as many as is feasible This view is in contraposition to that implicit in the use of the lasso and similar methods Any choice between well-fitting models must be based on subject-matter expertise or additional data

35 REFERENCES Cox, D R and Battey, H S (2017) Large numbers of explanatory variables, a semi-descriptive analysis, Proc Nat Acad Sci, 114, Battey, HS and Cox, D R (2018), Large numbers of explanatory variables: a probabilistic assessment, Proc R Soc Lond A, 474 SOFTWARE Matlab code is available at hbattey/softwarecube An R package, HCmodelSets, written by H H Hoeltgebaum is available on the Comprehensive R Archive Network

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models

Proceedings 59th ISI World Statistics Congress, 25-30 August 2013, Hong Kong (Session CPS023) p.3938 An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models Vitara Pungpapong