Sparse survival regression

Size: px

Start display at page:

Download "Sparse survival regression"

Diana Crawford
5 years ago
Views:

1 Sparse survival regression Anders Gorst-Rasmussen Department of Mathematics Aalborg University November / 27

2 Outline Penalized survival regression The semiparametric additive risk model. Theoretical results. Software. Ultra high dimension Independence screening. 2 / 27

3 Recap: The semiparametric additive risk (SAR) model Assume hazard function given covariates: λ(t Z i ) = λ 0 (t) + Z i β 0 ; for some unspecified baseline λ 0, covariate Z i R p. Lin & Ying (1994): estimate β 0 as solution to S n β = s n where S n = n 1 n τ i=1 0 τ s n = n 1 n i=1 0 Y i (t)(z i Z(t)) 2 dt, (Z i Z(t))dN i (t); with Z(t) the at-risk-average of Z i s, Y i the at-risk-indicator. Asymptotics easy using the signal + error decomposition s n = S n β 0 + n 1 n τ i=1 0 (Z i Z(t))dM i (t), M i martingale. Notice similarity with linear model, X y = X X β + X ε. 3 / 27

4 What do we do if p > n? Course of action depends on the purpose 1. Model selection? 2. Low prediction error? A useful guideline is how practicians prefer to think: Few features convey all of the effect Nice if statistical models can reflect this. Classical sparsity modeling: Multiple testing. All subsets/forward stepwise regression. Ridge regression and truncation. Etc. Computational/practical issues; weak theoretical justification. 4 / 27

5 Lasso for the SAR model Least Absolute Shrinkage and Selection Operator. SAR estimation minimization of L(β) = β S n β 2β s n. Lasso penalized version: L 1 (β) = L(β) + λ p i=1 β i, with λ 0 a regularization parameter (compare with ridge regression where the penalty is the L 2 -norm of β). Convex problem (so computationally feasible). Simple form (so theoretically tractable). Key property: Lasso does variable selection. 5 / 27

6 Why lasso does variable selection β 2 β 2 t β 1 t β 1 Equivalent formulation: minimize L(β) subj. to p i=1 β i t. Continuous subset selection. No hypothesis testing only optimization. 6 / 27

7 Lasso is (typically) a shrinkage estimator Suppose Z i = (Z 1i,...,Z 1p ) with Z 1i,...,Z 1p independent. Lasso solutions with diagonal S n : LS ˆβ j = S ( ˆβ j,λ/s n,jj ), with ˆβ LS j = S 1 n,jj s n,j the least-squares estimate and So the soft-thresholding operator. S (x,y) = sign(x)( x y) + LS S n,jj ˆβ j λ = ˆβ j = 0; S n,jj LS ˆβ j > λ = 0 < ˆβ j < λ = 0 = ˆβ = ˆβ LS j. LS ˆβ j (with correct sign); Also how we like to think of lasso in non-orthogonal case. 7 / 27

8 Example lasso regularization path LASSO Standardized Coefficients beta /max beta 8 / 27

9 From regularization paths to models Selecting a model requires a good λ how to choose? K-fold cross-validation using the least squares loss: i.e. with L(β) = β S n β 2β s n, choose λ to minimize CV(λ) = K 1 K L (i)( ˆβ ( i) (λ) ) ; i=1 with L (i) calculated from ith fold, ˆβ ( i) (λ) from remaining. For lasso, 5-fold CV is often considered sufficient. Note that CV... is a stochastic procedure; stability may be poor for small n.... focuses on prediction optimality and generally overfits. Alternatives: Generalized cross-validation. AIC, BIC etc. Bootstrapping/subsampling. 9 / 27

10 Computation Good: Choose some λ s; fit using e.g. quadprog in R. Better: Path following algorithm (stagewise/conjugate gradient): Start with β 1 = = β p = 0; set r = s n 1. Find j such that r j maximal. 2. Set β j β j + ε j sign(r j ). 3. Set r r S n β. Least Angle Regression (LARS), Efron et al. (2004): compute ε j so that we get new j in step 1 every time. Yields complete regularization path in min{n, p} steps (piecewise linear!). Implemented in timereg function surv.lars. Expensive: we need both s n and S n Best: Cyclic coordinate descent (we need only some of S n ). 10 / 27

11 The experimental package ahaz for R Cyclic coordinate descent Friedman et al. (2007); fairly recent R package glmnet. Minimize convex L: R p R by initializing θ 1 = = θ p = 0, say; sequentially iterating coordinatewise updates until convergence: θ i argmin θi L( θ 1,..., θ i 1,θ i, θ i+1,..., θ p ). Many iterations needed but they are cheap. The SAR model (ongoing work) Coordinatewise updates of the form ( ) β j S s j β i S ij,λ Sjj 1, A := {k : β k 0}; i A with S (x,y) = sign(x)( x y) +. We need only rows in S n for active variables. Very stable (an issue for nonlinear models and p n). 11 / 27

12 ahaz in action Metzeler et al. (2008) data Predict time to acute myeloid leukaemia from gene expr. p = ; n = 242. Additional test set of 79 patients. 100-grid lasso path 10 seconds on standard laptop. Vanilla lasso + 5-fold cross-validation: 14 nonzero parameters. Continuous risk predictor has HR= 1.55 (p=0.003) Metzeler (2008): using 86 genes, HR= 1.8 (p=0.001). Coefficients 5e 04 0e+00 5e 04 1e 03 # nonzero parameters L1 norm 12 / 27

13 Lasso asymptotics (no IID decompositions, sorry) When does lasso select the right model? I.e. when does there exist a sequence λ n such that P( ˆ M (λ n ) }{{} Est. model = }{{} M ) 1, n? True model Sufficient: strong irrepresentable condition (Zhao & Yu 2006). This a technical condition on S n holds, for example, if Almost orthogonal design (depending M ). Constant correlation. Power decay correlation. Restrictive but close to necessary. Meinshausen & Yu (2006): under much weaker conditions ˆβ(λ n ) β = p i=1 ( ˆβ i (λ n ) βi 0 ) 2 P 0. So lasso should get large effects with large probability. Greenshtein & Ritov (2004): prediction consistency. 13 / 27

14 The much-sought-after oracle property If we have selection consistency, we sacrifice n-consistency. Consider the adaptive lasso (Zou 2006) with criterion L 1 (β) = β S n β 2β s n + p i=1 w i β i ; where w i = β i 1 for β some n-consistent estimator of β 0. Then we can choose λ n such that the oracle property holds: 1. P( M ˆ(λ n ) = M ) 1 2. n 1/2 ( ˆβ M βm 0 ) asymptotically normal with correct variance. See Ma & Leng (2007); Scheike & Martinussen (2009). So adaptive lasso is as good as an oracle. But: How to get a good n-consistent estimator? Fixed-parameter asymptotics can be deceiving! A two-stage approach can be useful: tune 1st stage-lasso to prediction; use weights w i = β lasso 1st st. i 1 in 2nd stage. 14 / 27

15 Example lasso and friends for Sorlie data set.seed(17) X <- as.matrix(sorlie[,11:ncol(sorlie)]) surv <- Surv(sorlie$time+1e-3runif(nrow(sorlie)),sorlie$status) # Lasso m <- ahazpen(surv,x); plot(m) cvla <-cv.ahazpen(surv,x,dfmax=75); plot(cvla) fitla <- ahazpen(surv,x,lambda=cvla$lambda.min); fitla # Weighted lasso cvala <- cv.ahazpen(surv,x,penalty.factor=1/abs(fitla$beta),lambda.min=1e-4) fitala <- ahazpen(surv,x,penalty.factor=1/abs(fitla$beta),lambda=cvala$lambda.min); fitala # Lasso with a non-penalized predictor (grade) summary(ahaz(surv,sorlie$grade)) m <- ahazpen(surv,cbind(sorlie$grade,x),keep=1) cvgra <-cv.ahazpen(surv,cbind(sorlie$grade,x),keep=1) fitgra <- ahazpen(surv,cbind(sorlie$grade,x),lambda=cvgra$lambda.min,keep=1); fitgra # Risk scores risksc.las <- predict(fitla,x,"lp") #(or just X%%fitla$beta) risksc.ala <- predict(fitala,x,"lp") risksc.gra <- scale(cbind(sorlie$grade,x)%%fitgra$beta) # Compare model fit summary(coxph(surv~risksc.las))$rsq summary(coxph(surv~risksc.ala))$rsq summary(coxph(surv~risksc.gra))$rsq f <- function(x) as.numeric(x>median(x)) plot(survfit(surv~f(risksc.las))) lines(survfit(surv~f(risksc.ala)),col=2) lines(survfit(surv~f(risksc.gra)),col=3) legend("bottomleft",c("lasso","adaptive","w/grade"),lty=1,col=1:3) 15 / 27

16 How certain are the lasso estimates? Standard errors; sandwich estimators/bootstrapping. Consistent only for nonzero parameters; and how to use? Monte Carlo methods may be more useful. E.g. stability selection (Meinshausen & Bühlmann 2010). Lasso on subsamples; calculate empirical selection probability. Probability of selection Lambda 16 / 27

17 Lasso is a great screening method: example Simulations from Cox model with 5 nonzero parameters (indices randomly chosen), normally distributed covariates. n = 200, p = 1000; block structure on covariance matrix. Average TPR/FPR over 25 independent simulations. True positive rate SAR lasso Univariate Cox P values False positive rate 17 / 27

18 Additional useful knowledge Beyond the lasso: Elastic net: combine lasso (L1) and ridge (L2) penalties. Yields joint selection of correlated predictors. SCAD penalty: convexity of lasso penalty is responsible for poor model selection. Replace with non-convex to get oracle. MC+ Dantzig selector Etc. Beyond the SAR model: Cox model (including path following algorithms). E.g. glmnet, penalized, glcoxph in R. Accelerated failure time models. But computation and theoretical analysis can be difficult. 18 / 27

19 The case of ultra-high dimension p of order exponential in n; e.g. 2nd order interactions in microarray studies. Penalized variable selection is computationally too intensive; even with fast coordinate descent methods. Most lasso theory works only when p = O(n α ) (at most!) Alternatives? Prediction: Anything goes so just pick variables marginally correlated with survival time (and apply a model of choice). Model selection: Much harder but can we do something similar? 19 / 27

20 Sure independence screening (SIS): linear regression Assume y = X β + ε (with standardized predictors). Estimated model: Mˆ n = {1 j p : ej X y > γ n }; simple hard thresholding of regression coefficients. Fan & Lv (2008): When γ n 0 at suitable rate, if M denotes the true model, we have Sure screening property: P(M M ˆ n ) 1; even with p exponential in n, assuming (with Σ = EX X ) 1. Semi-orthogonality : ej Σβ and βj 0 are large for j M. 2. Σ and X X are sufficiently regular. Iterated SIS: Condition (1) easily fails. Heuristic iterative procedure: Set r 1 := X y. For i = 1,2, Calculate Mˆ i by SIS with r i. 2. Estimate β 0 Mˆ by (penalized) regression, assuming SAR model. i 3. Take r i+1 = r i (X X ) ˆβ M ˆ (residual correlation). i 20 / 27

21 SIS: generally SIS study of misspecification: assuming some joint model, when can marginal parameters be used to decide sign of the (joint) parameter? Fan et al. (2009): if model fitting corresponds to minimizing pseudo-likehood L(β) = n i=1 Q(Z i β) Vanilla SIS according to marginal utilities L j = min β j n 1 n i=1 Q(Z ij β j ); j = 1,...,p; rank according to size, the smaller the more important. 2. Iterated SIS: if 1st stage model M ˆ, calculate L (2) j = min β j,β M n 1 n i=1 Q(Z ij β j + Z ˆ M β M ˆ ); j {1,...,p}\M ; and combine 2nd stage model based on {L (2) j } with M using penalized regression (allows deletion of features). Iterate until stable. Refer to Fan et al. (2009) for a bag of tricks. 21 / 27

22 Independence screening: the Cox model R-package SIS. Uses Cox adaption of SCAD penalization for iterated variant. Everything is so far completely heuristic; but works well. But caution is advised: even for independent covariates, marginal Cox estimates are well known to be inconsistent. SIS example: library(sis) # Read and break ties X <- as.matrix(sorlie[,11:ncol(sorlie)]) surv <- Surv(sorlie$time+1e-3runif(nrow(sorlie)),sorlie$status) # Example ISIS cox.van.sis <- COXvanISISscad(X, surv[,1], surv[,2]) cox.van.sis$isisind 22 / 27

23 Theoretically justified survival SIS: the SAR model Ongoing work. Recall the linear model similarity s n }{{} X y +martingale integral. }{{} (X X )β 0 X ε = S n β 0 }{{} This suggests screening based on the correlation s n. Formally sensible beyond SAR model; if administrative censoring at t = τ and centered Z i s: τ Es nj = Cov(Z 1j,F T (τ Z 1 )) + Cov(Z 1j,F T (t Z 1 ))K(t)dt 0 with F T (t Z 1 ) = P(T 1 t Z 1 ), K a strictly positive function. So E s nj is large if Cov(Z 1j,F T (t)) is consistently large. Checkable if e.g. F T (t Z 1 ) = Λ(t,Z1 α); Λ(t, ) monotone. 23 / 27

24 SAR iterated SIS Within SAR model, the sure screening property will hold if ej ES n β 0 and βj 0 are both large whenever βj 0 0. Such semi-orthogonality may fail (although not clear when). Iterative screening: Set s 1 := s n. For i = 1,2, Estimate Mˆ i by independence screening with s i. 2. Estimate β 0 Mˆ by (penalized) regression, assuming SAR model. i 3. Take s i+1 = ŝ S n ˆβ M ˆ (residual correlation ). i Currently: What does ES n mean? How do we deal with censoring? (How) does it work in practice? 24 / 27

25 Concluding remarks We can deal with p n under the assumption of sparsity. Modern sparsity modeling is a very elaborate exercise in not ignoring the correlation structure. It is not a silver bullet: We can make good, interpretable prediction models. But model selection is primarily model filtering. Sample size still matters. (How important/meaningful is theoretical model selection?) Difficult to translate to applied sciences but progress! 25 / 27

26 Exercise: survival predictions for DLBCL data 1. Load the data set bair.rda which consists of 240 patients diagnosed with diffuse large B-cell lymphoma. Variables are as follows: time and status self-explanatory; train (binary) indicates whether subject is in training set; X4,...,X7998 (continuous) are gene expressions. 2. Build and validate survival prediction models based on the gene epxression data. Specifically, basing estimation on the training set, predict (linear) risk scores in the test set based on: Prediction-tuned SAR lasso. SAR PLS with 1 component (a scaled version ˆβ PLS suffices: use ahaz(surv,x,univariate=true)$s). Cox iterated SIS: the standard Cox model including the relevant covariates obtained from COXvanISISscad. 3. Dichotomize the three different risk scores and compare their log-rank test p-values from coxph. Also plot the corresponding survival curves. 4. Obtain test set risk scores based on the truncated, scaled 1-component PLS estimator s n I ( s n > γ) for different γ (e.g. quantiles of s n ). Calculate the log-rank test p-values for corresponding collection of continuous predictors; plot versus number of nonzero entries in truncated PLS estimator. Interpret. 26 / 27

27 Good night stories 1. Lin DY & Ying Z (1994) Semiparametric Analysis of the additive risk model. Biometrika, 81: Tibshirani R (1996). Regression shrinkage and selection via the lasso. JRSS B, 58: Efron B et al. (2004). Least angle regression. Ann. Statist. 32: Friedman et al. (2007). Pathwise coordinate optimization. Ann. Appl. Stat., 1: Metzeler et al. (2008). An 86-probe-set gene-expression signature predicts survival in cytogenetically normal acute myeloid leukemia. Blood, 112: Zhao P & Yu B (2006). On model selection consistency of LASSO. J Machine Learning Research, 7: Meinshausen N & Yu B (2009). Lasso-type recovery of sparse representations for high-dimensional data. Ann. Statist., 37: Ma S & Leng C (2007), Path consistent model selection in additive risk model via lasso. Statist. Med. 26: Martinussen T & Scheike TH (2009). Covariate selection for the semiparametric additive risk model, Scand. J. Statist. 36: Greenshtein E & Ritov Y. (2004). Persistence in high dimensional linear predictor-selection and the virtue of over-parametrization. Bernoulli, 10: Zou H (2006). The adaptive lasso and its oracle properties. JASA, 101: Meinshausen N & Bühlmann P (2010). Stability selection. JRSS B, 72: Fan J & Lv J (2008). Sure independence screening for ultra-high dimensional feature space. JRSS B, 70: Fan J, Samworth R & Wu Y (2009). Ultrahigh Dimensional Feature Selection: Beyond The Linear Model. J Machine Learning Research, 10: Fan J, Feng Y, Wu Y (2010). High-dimensional variable selection for Cox s proportional hazards model. Preprint. 27 / 27

Chapter 3. Linear Models for Regression

Chapter 3. Linear Models for Regression Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Linear