NEW METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO SURVIVAL ANALYSIS AND STATISTICAL REDUNDANCY ANALYSIS USING GENE EXPRESSION DATA SIMIN HU

Size: px

Start display at page:

Download "NEW METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO SURVIVAL ANALYSIS AND STATISTICAL REDUNDANCY ANALYSIS USING GENE EXPRESSION DATA SIMIN HU"

Hilda Perry
5 years ago
Views:

1 NEW METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO SURVIVAL ANALYSIS AND STATISTICAL REDUNDANCY ANALYSIS USING GENE EXPRESSION DATA by SIMIN HU Submitted in partial fulfillment of the requirements For the degree of Doctor of Philosophy Dissertation Adviser: Dr. J. SUNIL RAO Department of Epidemiology and Biostatistics CASE WESTERN RESERVE UNIVERSITY January 2007

2 CASE WESTERN RESERVE UNIVERSITY SCHOOL OF GRADUATE STUDIES We hereby approve the dissertation of candidate for the Ph.D. degree *. (signed) (chair of the committee) (date) *We also certify that written approval has been obtained for any proprietary material contained therein.

3 Table of Contents List of Tables...4 List of Figures...6 Acknowledgements...9 Abstract...10 Penalized Weighted Least Squares Method for Estimation and Variable Selection in Accelerated Failure Time Models with High Dimensional Gene Expression Data Introduction Survival analysis and right censored data, AFT Models, weighted least squares method and common regularization techniques Survival analysis and right censored data AFT models Stute s Weighted Least Squares method for AFT models Penalized weighted least squares method Censoring constraints Ridge and LASSO penalization Algorithms for finding the penalized solutions Quadratic programming Variable selection Algorithms for finding the penalized solutions Parameter tuning Simulation studies

4 Simulation 1: comparison of model estimation between algorithm QP and algorithm QPCC when p < n Simulation 2: comparison of the performances in variable selection and prediction accuracy between algorithms QP, QPCC, QPLASSO and QPCCLASSO when p < n Simulation 3: comparison of the performances in variable selection between algorithms QP, QPCC, QPLASSO and QPCCLASSO when p > n Simulation 4: comparison of the performances in variable selection and prediction accuracy between algorithms QP, QPCC, QPLASSO and QPCCLASSO when p > n Real data examples Stanford heart transplant data (p<n) Diffuse large B-cell lymphoma survival data (p > n) Discussion Statistical Redundancy Testing for Improved Gene Selection in Cancer Classification Using Microarray Data Introduction Background on gene expression data and related Algorithms DIANA (Divisive Analysis) clustering Algorithm FLDA Statistical redundancy testing and gene selection Algorithms Test statistic: eigenvalue-ratio Bootstrap re-sampling test for statistical redundancy Gene selection methods Simulations Simulation Simulation Simulation Gene selection for microarray datasets Gene selection for leukemia study

5 Gene selection for Colon cancer study Discussion Appendix I Appendix II Appendix III Bibliography

6 List of Tables Table 1.1 Distributions for AFT models Table 1.2 Simulated data Table 1.3 Model fitted under different λ Table 1.4 Estimate of β 2 and standard deviation Table 1.5 Estimates of β s with standard deviations using QP and QPCC when censoring percentage is 30% Table 1.6 Estimates of β s with standard deviations using QP and QPCC when censoring percentage is 70% Table 1.7 Mean-squared errors for test data and model selection for training data at 2 different censoring percentages and based on 100 replications. The 3rd and 5th columns show the frequencies to select the exactly correct model; the numbers in parenthesis are the frequencies to retain all the 6 relevant variables in the final model Table 1.8 Number of variables selected and number of relevant variables selected at different censoring percentages. The numbers in parenthesis are the numbers of relevant variables selected in the final model Table 1.9 Mean mean-squared errors for test data at 2 different censoring percentages and based on 50 replications. The 3rd and 5th columns show the frequencies to select the exactly correct model; the numbers in parenthesis are the frequencies to retain all the 20 relevant variables in the final model Table 1.10 Survival analysis results of Standford heart transplant data with different estimation methods Table 1.11 DLBCL WLSE and MSE of selected models using different algorithms Table 1.12 DLBCL log-rank test results for groups defined according to linear risk scores

7 5 Table 1.13 DLBCL log-rank test results for groups defined according to Cox scores Table 1.14 Genes selected by the four algorithms and belong to the 17 genes used to build the Cox model in Rosenwald et al. (2002) Table 1.15 Genes selected with the QPCCLASSO algorithm Table 1.16 Number of common genes between the lists of genes selected by the four algorithms Table 2.1 Gene p-values for statistical redundancy test and t-test Table 2.2 Gene filtering process in Algorithm Table 2.3 Genes selected under different pair-wise correlations Table selected genes for Leukemia data Table 2.5 Golub s 50 genes that are selected and/or filtered Table 2.6 LOOCV errors of different classification models and gene selection methods for Leukemia data Table 2.7 The 46 selected genes for colon cancer data Table 2.8 LOOCV errors of different classification models and gene selection methods for colon cancer data Table I.1 DLBCL survival training data results with the four algorithms

8 List of Figures Figure 1.1 Line plot in calendar time for four subjects in a hypothetical follow-up study Figure 1.2 Line plot in the time scale for four subjects Figure 1.3 Plot of model fitted under different λ 0. o uncensored data; censored data Figure 1.4 Slope and intercept vs. λ Figure 1.5 Data: n = 30, p = 10, corr(i, j) = 0.5, σ = 2, and, censoring percentage = 30%. Tuning parameters: λ 0 = 1 and λ 2 = Each plot has an overlaid line. o uncensored data; censored data. The overlaid lines have intercept 0 and slope 1. The plots in the left panel show the predicted survival times versus the observed survival times. The plots in the right panel show the predicted survival times versus the true survival times Figure 1.6 Data: n = 30, p = 10, corr(i, j) = 0.5, σ = 2, and, censoring percentage = 70%. Tuning parameters: λ 0 =1 and λ 2 = Each plot has an overlaid line. o uncensored data; censored data. The overlaid lines have intercept 0 and slope 1. The plots in the left panel show the predicted survival times versus the observed survival times. The plots in the right panel show the predicted survival times versus the true survival times Figure 1.7 Box-plots for mean-squared error on test data when p = 8 and n = 50. Methods 1, 2, 3 and 4 represent algorithm QP, QPCC, QPLASSO and QPCCLASSO, respectively. The left panel shows the results when censoring percentage is 30%; the right panel shows the results when censoring percentage is 70% Figure 1.8 Scatter plots of predicted log-survival time vs. actual log-survival time. The left top panel is for the model selected by QP ; the right top panel is for the 6

9 7 model selected by QPCC ; the left bottom panel is for the model selected by QPLASSO ; the right bottom panel is for the model selected by QPCCLASSO. o uncensored data; censored data. The overlaid lines have intercept 0 and slope 1. Censoring percentage = 30% Figure 1.9 Scatter plots of predicted log-survival time vs. actual log-survival time. The left top panel is for the model selected by QP ; the right top panel is for the model selected by QPCC ; the left bottom panel is for the model selected by QPLASSO ; the right bottom panel is for the model selected by QPCCLASSO. o uncensored data; censored data. The overlaid lines have intercept 0 and slope 1. Censoring percentage = 50% Figure 1.10 Scatter plots of predicted log-survival time vs. actual log-survival time. The left top panel is for the model selected by QP ; the right top panel is for the model selected by QPCC ; the left bottom panel is for the model selected by QPLASSO ; the right bottom panel is for the model selected by QPCCLASSO. o uncensored data; censored data. The overlaid lines have intercept 0 and slope 1. Censoring percentage = 70% Figure 1.11 Box-plots for mean-squared error on test data when p = 200 and n = 100. Methods 1, 2, 3 and 4 represent algorithm QP, QPCC, QPLASSO and QPCCLASSO respectively. The left panel shows the results when censoring percentage is 30%; the right panel shows the results when censoring percentage is 70% Figure 1.12 Scatter plots of predicted log-survival time vs. log-survival time for the training data. The left top panel is for the model selected by QP ; the right top panel is for the model selected by QPCC ; the left bottom panel is for the model selected by QPLASSO ; the right bottom panel is for the model selected by QPCCLASSO. o uncensored data; censored data. The overlaid lines have intercept 0 and slope Figure 1.13 Scatter plots of predicted log-survival time vs. log-survival time for the validation data. The left top panel is for the model selected by QP ; the right top panel is for the model selected by QPCC ; the left bottom panel is for the model selected by QPLASSO ; the right bottom panel is for the model selected by

10 8 QPCCLASSO. o uncensored data; censored data. The overlaid lines have intercept 0 and slope Figure I.1 Scatter plots of predicted log-survival time vs. log-survival time for the training data. The left top panel is for the model selected by QP ; the right top panel is for the model selected by QPCC ; the left bottom panel is for the model selected by QPLASSO ; the right bottom panel is for the model selected by QPCCLASSO. o uncensored data; censored data. The overlaid lines have intercept 0 and slope Figure I.2 Scatter plots of predicted log-survival time vs. log-survival time for the validation data. The left top panel is for the model selected by QP ; the right top panel is for the model selected by QPCC ; the left bottom panel is for the model selected by QPLASSO ; the right bottom panel is for the model selected by QPCCLASSO. o uncensored data; censored data. The overlaid lines have intercept 0 and slope

11 Acknowledgements My greatest gratitude goes to Professor J. Sunil Rao, my thesis advisor. His insightful guidance, persistent support and encouragement have given me the greatest experience. Thanks also go to Professors Mark Adams, Jeffrey M. Albert and Pingfu Fu. I m honored to have them on my Ph. D. committee. Their advices have made a lot of improvement to this work. Thanks to Dr. Albert for being my academic advisor and supporting me a lot during these years. Thanks to Dr. Fu for giving me a lot of inspiring discussions and constructive suggestions to my research work. I am also grateful to Guan and Jinjing for their friendship and support. This dissertation is dedicated to my mother Defeng Liang, my husband Baofu Duan and my daughter Kelly Duan. I give my deepest love and appreciation for their unconditional love and constant support. 9

12 New Methods for Variable Selection with Applications to Survival Analysis and Statistical Redundancy Analysis Using Gene Expression Data Abstract by Simin Hu An important application of microarray research is to develop cancer diagnostic and prognostic tools based on tumor genetic profiles. For easy interpretation, such studies aim to identify a small fraction of genes to build molecular predictors of clinical outcomes from at least thousands of genes thus require methodologies that can model high dimensional covariates and accomplish variable selection simultaneously. One interesting area is modeling cancer patients survival time or time to cancer reoccurrence with gene expression data. In the first part of this dissertation, we propose a new penalized weighted least squares method for model estimation and variable selection in accelerated failure time models. In this method, right censored observations are used as censoring constraints in optimizing the weighted least squares objective function. We also include ridge penalty to deal with singularity caused by collinearity and high dimensionality and use the least absolute shrinkage and selection operator to achieve 10

13 11 model parsimony. Simulation studies demonstrate that adding censoring constraints improves model estimation and variable selection especially for data with high dimensional covariates. Real data examples show our method is able to identify genes that are relevant to patient survival times. Another interesting area is cancer subtype classification using gene expression profiles. One important issue is to reduce redundancy caused by correlation among genes. Since genes with correlated expression levels may be co-expressed or belong to the same biological pathway related to the disease, including such genes into classifiers provides very little additional information. In the second part of the dissertation, we define an eigenvalue-ratio statistic to measure a gene s contribution to the joint discriminability of a set of genes. Based on this eigenvalue-ratio statistic, we define a novel hypothesis testing for gene statistical redundancy and propose two gene selection methods. Simulation studies illustrate the agreement between statistical redundancy testing and gene selection methods. Real data examples show the effectiveness of our eigenvalueratio statistic based gene selection methods. We also demonstrate that the selected compact gene subsets can not only be used to build high quality cancer classifiers but also have biological relevance.

14 Penalized Weighted Least Squares Method for Estimation and Variable Selection in Accelerated Failure Time Models with High Dimensional Gene Expression Data 12

15 Introduction Recently there is a great deal of interest in survival analysis with right censored data and high-dimensional covariates. This is mainly driven by studies using gene expression profiles to predict patient survival time or time to cancer recurrence (e.g., Rosenwald et al., 2002; van t Veer et al. 2002; Beer et al. 2002). For easy interpretation such studies aim to identify best predictors of survival times from at least thousands of genes and thus need methodologies that can address censoring problem, deal with high dimensionality and accomplish variable selection simultaneously. Dimension reduction techniques such as supervised principle component analysis and sliced inverse regression, partial least squares method and kernel methods have been extended to Cox models (e.g., Bair and Tibshirani, 2004, Li and Li, 2004, Li and Gui, 2004, Park et al., 2003, Li and Luan, 2003) and accelerated failure time models (e.g., Huang and Harrington, 2002). There are also some survival ensemble methods developed for survival prediction with gene expression data. Li and Luan (2005) proposed a smoothing spline based boosting algorithm for the non-parametric additive Cox model. Horthorn et al. (2006) introduced a random forest algorithm and a generic gradient boosting algorithm to construct prognostic accelerated failure time models. All these methods are not rigorous variable selection methods since all the covariates are used to build compound covariates or ensemble models. Tibshirani (1997) used the LASSO method for variable selection and shrinkage in Cox s proportional hazards model by minimizing the log partial likelihood subject to the

16 14 sum of the absolute values of the parameters being bounded by a constant. The LASSO imposes an L 1 -penalty on the regression coefficients and tends to produce some zero coefficients thus can produce a parsimonious model (Tibshirani, 1996). This lasso method in Cox model can not be directly applied to high-dimensional gene expression data. Some efforts have been made for modeling and variable selection with right censored data and high-dimensional covariates in the literature. Gui and Li (2005) proposed to use the L 1 penalized estimation for the Cox model to select genes that are associated with patients survival. They used Least Angle Regression method (LARS, Efron et al., 2004) to solve the computational difficulty for high-dimensional data. Their method can in principle select n 1 genes where n is the sample size. Although Cox model (Cox,1972) is the most widely used procedure for modeling the relationship of covariates to a right censored outcome, a drawback of this method is that it assumes that the relative hazards for any two individuals to be independent of time with covariates fixed over time. When the proportional hazards assumption is questionable, a very attractive practical alternative to Cox model for failure time regression is the accelerated failure time model (AFT, Kalbfleisch and Prentice, 1980). Another desirable property is that AFT models (e.g. James, 1998) reduce to an ordinary linear regression model in the absence of censoring but Cox model does not. Furthermore, an AFT model can provide information for assessing the baseline hazard function if it models the data reasonably well but a Cox model can not. In the literature, the estimation approaches developed for AFT models include the semi-parametric rank-based methods (e.g. Tsiatis, 1990, Wei et

17 15 al., 1990, Lai & Ying, 1992, Ying, 1993, Fygenson & Ritov, 1994, Jin et al., 2003), the least squares based and M-estimation methods (e.g. Miller, 1976, Buckley & James, 1979, Ritov, 1990, Lai & Ying, 1991), and, the weighted least squares method (Stute, 1993, 1996). Wang et al. (2006) proposed a doubly penalized Buckley-James method with both L 1 -norm and L 2 -norm penalties to relate high-dimensional genomic data to censored survival outcomes. Huang et al. (2005a) used two regularization approaches based on Stute s method, the LASSO (Tibshirani, 1994) and the threshold gradient directed regularization (TGDR, Friedman & Popsecus, 2003), for estimation and variable selection in the accelerated failure time model with high-dimensional covariates. In another paper of Huang et al. (2005b), they used the Kaplan-Meier weights in the leastabsolute-deviations objective function. Datta et al. (2006) applied LARS method in AFT modeling for variable selection after censored data imputation with re-weighting, mean imputation, and multiple imputation approaches. Besides estimation problem, another very important problem is concerned with model evaluation because prediction is an important inferential goal in survival analysis. Since a very important information given by right censored data is that the censored survival time is less than the true survival time for censored observations, it is very natural to think of using prediction for censored observation as a kind of goodness measure for AFT models that are built on only uncensored data while censored observations are assigned zero Kaplan-Meier weights. In other words, if an AFT model very well models the survival times of the population from which the survival data comes, it should not underestimate too many censored observations or underestimate some censored observations too much. On the other hand, we may expect to improve

18 16 model prediction accuracy by making use of such information. This is the main motivation of the method proposed in this dissertation work. To handle the aforementioned estimation problem for accelerated failure time model and high dimensionality problem with gene expression data, we consider a new penalized weighted least squares estimator for accelerated failure time model. First, we propose to include censored observations in our penalized weighted least squares estimator as constraints which we call censoring constraints to limit the model space and thus improve model prediction accuracy. Second, besides the censoring constraints, our penalized weighted least squares estimator also includes both L 1 and L 2 penalties. Such a combination of lasso and ridge penalty has been used in Zou & Hastie s (2004) regularization and variable selection method elastic net for linear regression model, which simultaneously does automatic variable selection and continuous shrinkage and thus can select groups of correlated variables. We use quadratic programming (Goldfarb and Idnani, 1982, 1983) to solve the Stute s weighted least squares objective function with censoring constraints and L 1 and L 2 penalties. Although this method is motivated by the aforementioned problems in survival analysis with high dimensional covariates, there is no doubt this method can be applied to cases when p < n. In Section 2, we describe survival analysis, right censored data and the usual set up of such data, and briefly summarize the AFT model, Stute s weighted least squares estimator, LASSO and ridge regression. We define in Section 3 our new penalized weighted least squares method. In Section 4, we discuss computational issues and solutions. We use the K-fold cross validation for tuning parameter selection. Bootstrap method is used for variance estimation. Section 5 presents some simulation studies. The

19 17 analyses on real datasets are carried out in Section 6. The last section summarizes the proposed method and gives some discussions for future work in this research era.

20 Survival analysis and right censored data, AFT Models, weighted least squares method and common regularization techniques Survival analysis and right censored data Survival analysis was originally the analysis of survival times for medical patients diagnosed with some fatal disease and now is a well-developed field of statistical methodology and widely applied in modeling and testing hypotheses of time-to-event data that occur in many scientific fields including biology, medicine, public health, epidemiology, engineering, and economics. In medical studies, time-to-event data are generically referred to as survival data and the event of interest may be death, the end of a period spent in remission from a disease, relief from symptoms, or the recurrence of a particular condition. A special feature of survival data is that survival times are frequently censored when the end-point of interest has not been observed for some subjects. Censoring mechanisms that can lead to incomplete observation of survival time include censoring and truncation. Common types of censoring are right censoring, left censoring and interval censoring. Right censoring occurs after a subject has been entered into a study, i.e., to the right of the last known survival time. Thus a right censored survival time is less than the actual, but unknown, survival time. An example of right censoring in medical studies is that subjects who are still alive at the end of study or have been lost to follow-up only have follow-up times that are referred as censored times. Only the subjects who died have

21 19 actual survival time. Since this dissertation work focuses on right censored data, we describe such data by an example in Hosmer and Lemeshow (1999). It is a hypothetical 2-year-long study and four patients are enrolled during the first year. Patient 1 entered the study on 01/01/1990 and died on 03/01/1991; patient 2 entered the study on 02/01/1990 and lost to follow up on 02/01/1991; patient 3 entered the study on 06/01/1990 and was still alive on 12/01/1991, the end of study; and, patient 4 entered the study on 09/01/1990 and died on 04/01/1991. The calendar times for these four patients are shown in Fig. 1.1 and survival times are shown in Fig In this study, patients 1 and 4 have actual observed survival time while patients 2 and 3 have right-censored observations of survival time. 5 4 x x 0 J F J S F M A J Calendar time Figure 1.1 Line plot in calendar time for four subjects in a hypothetical follow-up study

22 20 4 x x Time in Months Figure 1.2 Line plot in the time scale for four subjects We consider the usual form of right censored data (Y, δ, X), where Y = min{t, C} with T being the response variable (i.e., observed survival time) and C being the censoring variable (i.e., censored survival time), δ = I{T C} is the indicator of censoring, and X are p-dimensional covariates. For survival time and gene expression data, X is a n p matrix with x i = (x i1,, x ip ), where i=1,, n, as the vector of the gene expression levels of p genes for the ith sample. Modeling of survival time is based on non-parametric and parametric approaches. Non-parametric and semi-parametric methods include Kaplan-Meier estimates of survival (Kaplan & Meier, 1958), Cox proportional hazards regression models (Cox, 1972) and extensions due to Andersen and Gill (1982). Parametric accelerated failure time models (Kalbfleisch and Prentice, 1980) assume an underlying hazard function distribution such as exponential, Weibull, gamma, Gompertz, lognormal, and log-logistic distribution.

23 AFT models AFT models specify that the effect of a fixed covariate X j, j=1,, p, acts multiplicatively on the failure time T or additively on logarithm or a known monotone transformation of T. An AFT model can be written as t( y i ) = β 0 + x i β + ε i (1) where i=1,, n, β 0 is the intercept term, β is the unknown p 1 vector of true regression coefficients and ε i s (i=1,, n) are unobservable independent random errors with mean zero and a common distribution function. Table 1.1 displays the distributions assumed for ε i and the corresponding distributions for T. Table 1.1 Distributions for AFT models Distribution of ε Extreme value (2 parameters) Extreme value (1 parameter) Log-gamma Logistic Normal Distribution of T Weibull Exponential Gamma Log- logistic Log-normal The usual choice for t is log(y i ) so the AFT model has the following form which is the same as that of a linear regression model: log( y β β + ε (2) i ) = 0 + x i i For simplicity, we write Equation (1) hereafter as y = β 0 + β + ε (3) i x i i where y i is the logarithm transformation of the response variable.

24 Stute s Weighted Least Squares method for AFT models Suppose t 1 t n are independent observations from an unknown distribution function F and c 1 c n are independent observations from an unknown distribution function G. Let y (1) y (n) be the ordered observations, δ (1),, δ (n) be the concomitant censoring indicators, and, x (1),, x (n) be the concomitant covariates. For right censored response with distribution function F, the Kaplan-Meier estimator Fˆ KM ( t) (Kaplan & Meier, 1958) is defined by ( i ) ˆ KM n i 1 F ( t) = (4) y + 1 ( i ) t n i δ Under this non-parametric maximum likelihood estimator, the mass attached to y (i) is w ni δ 1 ( i ) δ i ( i) n j = n i 1 j= 1 n j (5) Stute (1993, 1996) used this attached mass to weight the observations in linear regression for right censored data and extended Fˆ KM to a multivariate distribution ˆ 0 function F n ( x, t) which serves as an estimator of the joint distribution function 0 ˆF of (X, T) when covariates X are available and set n ( x, t) = Win I X ( i ) x, y( i ) Fˆ 0 n { t} (6) i= 1 Without censoring, Fˆ KM ( t) reduces to the usual sample distribution estimator Fˆ which assigns weight 1/n to each observation thus the weighted least squares reduces to

25 23 ordinary least squares. Stute (1993, 1996) defined the weighted least squares estimate βˆ of β as ˆ β = arg min β 1 2 n i= 1 w n i ( y β x β 2 ( i) 0 ( i) ) (7) Stute (1993, 1996) proved the consistency and asymptotical normality of the weighted least squares estimator under some integrability assumptions, E (ε X) =0, and the following assumptions for random censoring mechanism: 1. T and C are independent and F and G have no jumps in common, 2. P (T C X, T) = P (T C T), i.e., δ and X are independent conditionally on T, and, 3. τ < τ, where τ T and τ C are respectively the least upper bounds for the support of distribution functions of T and C. T C

26 Penalized weighted least squares method Although the information from censored observations has been incorporated into the Kaplan-Meier weights in Stute s weighted least squares method, in real applications when p < n, we sometimes observe that too many censored observations predicted survival times are less than the censored time or some censored observations are very poorly predicted, i.e., some censored samples have predicted survival time T ˆ << C. We observe severer problem when using Stute s method in case of p >> n, e.g., for survival analysis with high dimensional gene expression data. Another problem is singularity caused by collinearity and high dimensionality of gene expression data. These problems motivate our new variable selection method in this dissertation work. To address the problems, two issues have to be considered in using Stute s weighted least squares method. One is the variable selection, i.e. how to select only a small fraction of the variables that are really important predictors of survival time, and the other is to pay more attention to the censored data in both variable and model selections so that they will not be underestimated too much. We propose a new penalized weighted least squares method to address these two issues. In this method, we try to minimize the weighted least squares objective function while adding censoring observations as constraints. L 1 and L 2 penalties are also added and the quadratic programming algorithm is used to solve the objective function. In the following, we elaborate how to add the censoring constraints and L 1 and L 2 penalties in a quadratic programming process.

27 25 The properties of censoring constraints and its combination with LASSO and ridge penalties will be discussed and the results will be developed in a series of lemmas Censoring constraints Since the Kaplan-Meier weights for censored observations are 0, we can rewrite the objective function in Stute s weighted least square method as 1 ( Y 2 U T β X β ) W ( Y β X ) (8) 0 U U U 0 U β where X U and Y U are the predictor and response observations for uncensored data, and W U is a diagonal matrix representing the Kaplan-Meier weights for uncensored data. Denote X C and Y C as the predictor and response for censored data. Since Y C T C for right censored data, we can add the constraints, i.e., Y C X C β + β 0, to the weighted least squares method. These constraints may be too stringent due to the random noises. Instead we modify the constraints as Y C X C β + β 0 + ξ, where ξ is a vector of non-negative values that measure the severities of violations of the constraints. The weighted least squares estimate βˆ with censoring constraints (CC) is defined as β 1 2 λ T β ) + ξ ξ 2n ˆ CC T 0 = arg min ( YU β 0 X U β ) WU ( YU β 0 X U β, ξ subject to Y β + β + ξ (9) C X C 0 where λ 0 is a positive value accounting for the penalties of violations of constraints, and n is added in the penalty term just for scaling purpose to match W U. Equation (9) is equivalent to the following

28 26 ˆ β CC 1 = arg min ( Y β 2 U T β0 X β ) W ( Y β0 X β ) U U U U λ T ( Y β X β ) W ( Y β X β ) 0 + C 0 C C C 0 2 C (10) where W C is a diagonal matrix with diagonal element W C i,i =1/n if Y Ci > X Ci β + β 0 and 0 otherwise. As can be seen in the objective function in Equation (10), for a censored observation (X Ci, Y Ci ), if its estimated survival time is no less than the censoring time, the difference between the estimated survival time and the censoring time is ignored. Otherwise, the difference (we name it censoring error) is added into the objective function just like the uncensored observations. However the Equation (10) is not easy to solve using traditional regression algorithms because the objective function is not continuously differentiable in the model space. Instead we choose to use Equation (9) in which we expand the dimension of the model space to include additional parameters, ξ, to make the objective function continuously differentiable in a constrained (β,ξ) space. Equation (9) can be solved using the quadratic programming algorithm and we will mention this later. Notice that in Equation (9), the constraints that ξ must be a vector of non-negative values, i.e., ξ 0, are ignored in the optimization problem but they are guaranteed due to the penalty term. We can prove it in the following Lemma 1. Lemma 1. If the pair ( βˆ,ξˆ ) is the minimizer of Equation (9) with λ 0 > 0, then ξˆ 0. Proof. Define f 1 2 T 0 T ( β, ξ ) = ( Y β X β ) W ( Y β X β ) ξ ξ U 0 U U U 0 U + λ 2n

29 27 Assume j that ˆ ξ j < 0, then ν, 0 < v < ξˆ j, and let V be a vector with V j = ν and V i = 0 for i j. We have f λ 2 ( β ˆ ξ ) ( β ˆ 0 ξ ) ˆ 2 ˆ, + V = f ˆ, ξ ( ˆ ξ + v) f ( ˆ, β ˆ ξ ) 2n j j < and YC X ˆ C β + β + ˆ ξ X ˆ C β + β + ˆ 0 0 ξ + V This contradicts that ( βˆ,ξˆ ) is the minimizer. Hence the assumption that j, ˆ ξ < 0 is false. The censoring constraints in equation (9) limit the model space using the rightcensored data. To illustrate how censoring constraints affect model estimation, we consider the following simple example. For simplicity, assume we have model Y = 1+ X + ε, a total of 4 observations are available and 2 observations are censored. Table 1.2 shows the 4 observations. j Table 1.2 Simulated data i X i Y i δ i We choose different values for censoring parameter λ 0 and fit the models. Table 1.3 presents the model values fitted under different λ 0.

30 28 Table 1.3 Model fitted under different λ 0 λ 0 intercept slope Figure 1.3 plots these fitted models. Figure 1.4 plots model estimate of intercept and slope as a function of λ 0. We can see that if λ 0 = 0, i.e. no censoring constraints are used, a perfect model is fitted for the two uncensored observations. However due to the random noises, this perfect model does not predict censored observation 4 very well. The predicted survival time is less than the censored time. If censoring constraints are used, the estimate will try to increase the prediction for observation 4 to reduce the amount of violation of censoring constraints. At the same time, it also changes the slope to balance the residuals for uncensored data. The amount of adjustment depends on the censoring parameter λ 0. It can be seen that when λ 0 increases, the model intercept increases a little bit, but the model slope becomes closer to the true model. The other censored data, observation 2, has no effect to the estimate since its predicted survival time is greater than the censoring time.

31 29 Y lambda0 = 0 lambda0 = 0.1 lambda0 = 0.5 lambda0 = X Figure 1.3 Plot of model fitted under different λ 0. o uncensored data; censored data

32 slope intercept lambda0 Figure 1.4 Slope and intercept vs. λ 0 The next example is to illustrate how censoring constraints affect model selection. We create data from model Y = β 0 + x 1 β 1 + x 2 β 2 + ε Y where β 0 = 1, β 1 =1, β 2 =0, corr(x 1, x 2 )=0, ε Y ~N(σ, 1), and n=10. Table 1.4 shows the estimate for β 2 and standard deviation based on 500 replications under various combinations of censoring percentage and σ with two different modeling methods. Under each combination of censoring percentage and σ, the estimate obtained using the method with censoring constraints is more closer

33 31 to 0 and has lower standard deviation than the estimate obtained with the other method without censoring constraints especially at a high censoring percentage. Since the true model is β 2 =0 in this simulation, this example shows that it is more likely to achieve correct model selection by adding censoring constraints especially when the available uncensored data are very limited. Table 1.4 Estimate of β 2 and standard deviation Censoring percentage σ Without censoring constraints With censoring constraints (0.295) 0.009(0.274) 30% (0.590) (0.519) (1.180) (0.987) (0.446) (0.349) 50% (0.893) (0.659) (1.785) (1.247) (2.911) 0.021(0.801) 70% (5.821) (1.513) (11.643) 0.076(2.941) Ridge and LASSO penalization Ridge regression (Hastie et al., 2001) minimize a penalized residual sum of squares, ˆ β ridge 2 n p = + p 1 2 arg min yi β 0 xij β j λ β j 2 i= 1 j= 1 j= 1 β (11) which can be written in the matrix form as 1 T 1 T RSS( λ) = ( y β 0 Xβ ) ( y β 0 Xβ ) + λβ β (12) 2 2 The ridge regression estimate can be written as ˆ ridge T 1 T β = ( X X + λi) X y (13)

34 32 where I is a p p identity matrix. So the ridge penalty is equivalent to add a positive constant to the diagonal of X T X before inversion. When there is collinearity between the covariates or X T X is not of full rank when p>>n, adding λi makes X T X non-singular. The effect of this penalty is to constrain the size of the coefficients by shrinking them toward zero. More subtle effects are that coefficients of correlated variables are shrunk toward each other as well as toward zero (Hastie & Tibshirani, 2004). To deal with singularity caused by collinearity between the covariates or that X T U X U is not of full rank because p>>n u, where n u is the number of uncensored observations, we add the ridge penalty and the penalized weighted least squares objective function can be written as: β 1 2 λ0 T λ T β ) + ξ ξ + β β 2n 2 ˆ CC T 2 = arg min ( YU β0 XU β ) WU ( YU β0 XU β, ξ subject to Y β + β + ξ (14) C X C 0 where λ 2 is a positive value accounting for the ridge penalty. Tibshirani (1994) proposed to use L 1 LASSO penalty p β for estimation in linear regression. The LASSO estimate is defined by 1 j 1 = arg min β 2 ˆ LASSO y β β n i 0 i= 1 j= 1 p 2 x ij β j subject to p 1 β j s (15) where L 1 LASSO penalty is added as a constraint: Σ j β j s. Because of the nature of this constraint, it tends to produce some coefficients that are exactly zero. This constraint makes the solutions nonlinear in the y i and further modifications need to be made in order to use it in the quadratic programming algorithm.

35 33 One possible way used in Tibshirani (1996) to add the LASSO constraint is to rewrite + = β β β, where + β and β are nonnegative. Then the LASSO constraint becomes: 0 + β, 0 β and s j j j + + β β. Now the weighted least squares estimate βˆ with censoring and LASSO constraints (CC-LASSO) is: ( ) ( ) [ ] ( ) [ ] = + β β λ β β λ ξ ξ λ β β β β β β β β ξ β β T T T U U U T U U LASSO CC n X Y W X Y arg min ˆ, ˆ ,, subject to ξ β β C X C Y, 0 + β, 0 β and s j j j + + β β (16) In Equation (16), the ridge penalty is written as β β λ β β λ T T instead of ( ) ( ) + + β β β β λ T 2 2, which follows from the following lemma 2: Lemma 2. If the pair ( ) + β β ˆ, ˆ is the minimizer of Equation (16) with λ 2 > 0, then p j j j = + 0 0, ˆ ˆ β β. Proof. Define ( ) ( ) [ ] ( ) [ ] = β β λ β β λ ξ ξ λ β β β β β β β β T T T U U U T U U n X Y W X Y f , Assume j that 0 ˆ ˆ > + j β j β, then ν, ( ) + < < j j v β β ˆ, ˆ min 0, and let V be a vector with V j = ν and V i = 0 for i j. We have ) ˆ ( ) ˆ ( ˆ V V = + β β β which satisfies all the constraints in Equation (16) and also

36 34 ( ) ( ) ( ) ( ) ( ) < + = β β β β β β λ β β β β ˆ, ˆ ˆ ˆ ˆ ˆ 2 ˆ, ˆ ˆ, ˆ f v v f V V f j j j j This contradicts that ( ) + β β ˆ, ˆ is the minimizer. Hence the assumption that j, 0 ˆ ˆ > + j β j β is false. Till now we have formed an optimization problem for the penalized weighted least squares method, and added censoring constraints and L 1 and L 2 penalties. The optimization problem in this new form (16) can be solved using the standard quadratic programming algorithm.

37 Algorithms for finding the penalized solutions Typically an intercept β 0 is included in the model, however in all estimators described above, the intercept β 0 has been left out for notational simplicity and we do not penalize β 0. In least square regression, this is achieved by simply centering predictors and responses. Similarly in weighted least square regression, samples can be centered by their weighted means X / W = wni X ni ( i) w and Y W wniy i) / = wni ( (Tibshirani, 1994). X w w If we denote the weighted data by,, X ) 1/ 2 ( w ) ( X X W ) w 1/ 2 Y δ where Y ( w ) ( Y Y W ) ( ( i) ( i) ( i) ( i) = ni ( i) and w ( i) = ni ( i), it is easy to see the weighted least squares estimator in equation (7) is equivalent to ordinary least squares estimator without intercept on weighted data with Kaplan-Meier weights. For censored observations, since their Kaplan- Meier weights are 0, we simply use 1/n to weight them and then use the weighted censored data in censoring constraints. For simplicity, we denote the weighted data by ( ( i) ( i) ( i) Y, δ, X ) hereafter and the equation (14) and (16) can be simplified as β 1 2 λ0 T λ T β ) + ξ ξ + β β 2 2 ˆ CC T 2 = arg min ( YU XU β ) ( YU XU β, ξ subject to Y X C β + ξ (17) C + CC LASSO + ( ˆ 1 β, ˆ β ) = arg min Y X ( β β ) + β, β, ξ 2 λ0 T λ2 + ξ ξ + β 2 2 T + [ ] [ Y X ( β β )] U + T U + β λ2 + β 2 T β U U subject to + + Y X C β + ξ, β 0, β 0 and β + β s (18) C j j j

38 Quadratic programming A standard quadratic program is defined as: QP : T 1 T arg min d b + b Db subject to Ab b0 (19) b 2 where b R p If D is positive semi-definite, i.e., b T Db 0 for all b, then the problem is convex. If D is positive definite, i.e., b T Db > 0 for all b 0, then the problem is strictly convex. For a strictly convex problem, the minimizer is unique if it exists. The weighted least square estimator with censoring constraints β CC defined Equation (17) can be converted to the standard quadratic program by defining the following: β b = ξ ( p+ ) 1 n c d = T X U 0 n c ( p+ ) 1 T X U X U + λ2i D = 0 p p λ I 0 0 nc nc ( p+ n ) ( p+ n ) c c A = [ X C I n c n ] c nc + c ( p n ) b 0 = Y C (20)

39 37 Lemma 3. If λ 0 and λ 2 are both positive finite, there is unique global minimizer for the quadratic program with censoring constraints and ridge penalty defined in Equation (20). Proof. In Equation (20), it can be seen that D is positive definite if λ 0 and λ 2 are both positive. Also there is a feasible solution ( ) = n c p C f Y b (21) Therefore we can get a unique global minimizer for the quadratic program with censoring constraints and ridge penalty defined in Equation (20). Similarly, the weighted least square estimator with censoring and LASSO constraints ( ) CC LASSO + β β, defined in Equation (18) to the standard quadratic program by defining the following: ( ) = n c p b ξ β β [ ] ( ) [ ] p n U U U T n p U U c X X X X d and = = ( ) ( ) c c C C n p n p n n p p U T U I I X X D = ] [ λ λ ( ) ( ) c c c c n p n p p p p n n C C I I X X A = ( ) = n c p C s Y b (22)

40 38 Lemma 4. If λ 0 and λ 2 are both positive finite, there is unique global minimizer for the quadratic program with censoring and LASSO constraints and ridge penalty defined in Equation (22). Proof. It can be seen that D in Equation (22) is positive definite if λ 0 and λ 2 are both positive and a feasible solution is 0 b f = (23) YC ( 2 p+ ) 1 n c Therefore we can also conclude that there exists a unique global minimizer for the quadratic program with censoring and LASSO constraints and ridge penalty defined in Equation (22) Variable selection For problems when p > n, it is desirable to build a parsimonious model. If we use Equation (14) to fit the AFT model, one possible approach is to recursively remove variables with very small coefficients and rebuild model again. We remove one variable at a time that has the smallest absolute coefficient value if p is small. In case of p is large, we can remove a small set of features at a time that have smallest absolute coefficient values to speed up the process. If we use Equation (16) to fit the AFT model, the LASSO bound can be used to achieve model parsimony and select important covariates since LASSO constraint tends to produce zero coefficients. However, unlike another model selection algorithm LARS (Efron et al., 2004) whose lasso solution paths grow piecewise linearly in a predictable one at a time way thus can produce exactly zero coefficients, our method is found to

41 39 have the same issue as discussed in Tibshirani (1996), like model selection, the LASSO is a tool for achieving parsimony but in actuality an exact zero coefficient is unlikely to occur. For this reason, for our variable selection algorithms with LASSO constraint as discussed in the next section, we choose a very small ε (ε = 1e-8 in this study) and set those variables whose absolute values of coefficients less than ε to have zero coefficient thus produce parsimonious models. There are some very well known model selection criteria such as Cp (Mallows, 1973), AIC (Akaike, 1973, 1974) and BIC (Schwartz, 1978). We use AIC (Akaike s Information Criterion) in this dissertation work. Akaike (1973, 1974) defined AIC as ( L( ˆ y ) K AIC = 2 log θ + 2 (24) where ( L ( ˆ y ) log θ is the maximized log-likelihood and K is the number of estimable parameters in the model. For model selection, we define ( L ( ˆ y ) log θ as the weighted K- fold cross-validation error, i.e., the sum of squared residuals of uncensored data multiplied by the Kaplan-Meier weights, and K as the number of variables with non-zero coefficients. Thus the AIC criterion can be formulated as: AIC= 2*n u *Log (CV-WLSE) + 2*K (25) where n u is the number of uncensored observations and CV-WLSE is the weighted K-fold cross-validation error. This AIC criterion is to be used when the total number of covariates is not so large relative to the sample size. We use AIC in all the simulation studies. AIC may perform poorly and drastically estimate corresponding risk function, i.e., lead to over-fitting, when a candidate model is over-specified and there are too many parameters relative to sample size (Sugiura 1978, Sakamoto et al. 1986). Hurvich and

42 40 Tsai (1989) studied this small-sample (second order) bias adjustment and derived a criterion AIC c : ( ( ˆ N L y ) + 2K AIC c = 2 log θ (26) N k 1 where N is the sample size. For datasets with high dimensional covariates such as survival time and gene expression data, we use AIC c which is formulated as AIC c = n u *Log(CV-WLSE) + 2*K * n u / (n u - K - 1) (27) Algorithms for finding the penalized solutions Based on above discussions, we provide in the following a few algorithms to build AFT models with different kinds of penalizations.

43 41 Algorithm 1 Quadratic Programming with Censoring Constraints (QPCC) 1) Initialize variable set FS = {1,, p}. 2) Divide the data into K folds. Leave out one fold at a time and solve the problem defined in Equation (14) using the standard quadratic programming algorithm. 3) Average the K models built in 2) by simply averaging their coefficients. Evaluate model quality by computing the weighted least square error using the averaged model. Evaluate the AIC or AIC c score and save the averaged model if it has the lowest AIC or AIC c score. 4) Stop if there is only one variable left and return the model with the lowest AIC or AIC c score. 5) Remove one variable or a small set of variables from FS that have smallest absolute coefficient values. Go to 2). Algorithm 2 Quadratic Programming (QP) This algorithm is used for comparison purpose only. It is the same as algorithm 1 except that we exclude the censoring constraints when solving Equation (14) in step 2).

44 42 Algorithm 3 Quadratic Programming with Censoring and LASSO Constraints (QPCCLASSO) 1) For each of a given set of LASSO bounds, use the standard quadratic programming algorithm to solve equation (16). 2) Find the variable set FS from the solution which are the variables with absolute values of coefficients larger than a small ε (ε = 1e-8 in this study). 3) Use the variable set in step 2) and divide the data into K folds. Leave out one fold at a time and solve the problem defined in Equation (16) using the standard quadratic programming algorithm. 4) Average the K models built in 3) by simply averaging their coefficients. Evaluate model quality by computing the weighted least square error using the averaged model. Evaluate the AIC or AIC c score and save the averaged model if it has the lowest AIC or AIC c score. 5) Repeat step 1) to 4) until all the bounds are exhausted. Return the model with the best AIC or AIC c score. Algorithm 4 Quadratic Programming with LASSO Constraints (QPLASSO) This algorithm is used for comparison purpose only. It is the same as algorithm 3 except that we exclude the censoring constraints when solving Equation (16) in step 1) and 3).

45 Parameter tuning When solving equation (14) or (16), we have two parameters to be tuned. One parameter is λ 2 which accounts for the ridge penalty. There are already many discussions on how to choose appropriate ridge penalties, which will not be discussed here. The advantage of using λ 2 in the algorithms is that it can make the quadratic programming problem strictly convex and therefore only one global optimizer can be found. The other parameter is λ 0 which accounts for the penalties of violations of censoring constraints. The value of this parameter depends how stringently one wants the model to satisfy censoring constraints. The larger the value, the more likely the model will satisfy censoring constraints, but may have poorer predictions for uncensored samples. A good practice is to balance the model quality for uncensored data and the censoring constraints. We can build a model by simply letting λ 0 be 1 and then check how the model predicts both censored and uncensored samples. If uncensored samples are predicted very well but most censoring constraints are severely violated, we may want to increase the value of λ 0 and start over again. If most censoring constraints are satisfied but uncensored samples are very poorly predicted, we may want to decrease the value of λ 0 and start over again. In our study, we find out that setting λ 0 as 1 will be sufficient for most problems.

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)