Lingmin Zeng a & Jun Xie b a Department of Biostatistics, MedImmune, One MedImmune Way,

Size: px

Start display at page:

Download "Lingmin Zeng a & Jun Xie b a Department of Biostatistics, MedImmune, One MedImmune Way,"

Sabrina Reynolds
6 years ago
Views:

1 This article was downloaded by: [Jun Xie] On: 27 July 2011, At: 09:44 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: Registered office: Mortimer House, Mortimer Street, London W1T 3JH, UK Journal of Statistical Computation and Simulation Publication details, including instructions for authors and subscription information: Group variable selection for data with dependent structures Lingmin Zeng a & Jun Xie b a Department of Biostatistics, MedImmune, One MedImmune Way, Gaithersburg, MD, 20878, USA b Department of Statistics, Purdue University, 250 N. University Street, West Lafayette, IN, 47907, USA Available online: 25 May 2011 To cite this article: Lingmin Zeng & Jun Xie (2011): Group variable selection for data with dependent structures, Journal of Statistical Computation and Simulation, DOI: / To link to this article: PLEASE SCROLL DOWN FOR ARTICLE Full terms and conditions of use: This article may be used for research, teaching and private study purposes. Any substantial or systematic reproduction, re-distribution, re-selling, loan, sub-licensing, systematic supply or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material.

2 Journal of Statistical Computation and Simulation ifirst, 2011, 1 12 Group variable selection for data with dependent structures Lingmin Zeng a and Jun Xie b a Department of Biostatistics, MedImmune, One MedImmune Way, Gaithersburg, MD 20878, USA; b Department of Statistics, Purdue University, 250 N. University Street, West Lafayette, IN 47907, USA (Received 19 April 2010; final version received 4 October 2010 ) Variable selection methods have been widely used in the analysis of high-dimensional data, for example, gene expression microarray data and single nucleotide polymorphism data. A special feature of the genomic data is that genes participating in a common metabolic pathway or sharing a similar biological function tend to have high correlations. The collinearity naturally embedded in these data requires special handling, which cannot be provided by existing variable selection methods. In this paper, we propose a set of new methods to select variables in correlated data. The new methods follow the forward selection procedure of least angle regression (LARS) but conduct grouping and selecting at the same time. The methods specially work when no prior information on group structures of data is available. Simulations and real examples show that our proposed methods often outperform the existing variable selection methods, including LARS and elastic net, in terms of both reducing prediction error and preserving sparsity of representation. Keywords: hard thresholding; least angle regression; variable selection 1. Introduction Regression is simple but the most useful statistical method in data analysis. The goal of regression analysis is to discover the relationship between a response y and a set of predictors x 1,x 2,...,x p. Variable selection methods have long been used in regression analysis, for example forward selection, backward elimination, best subset regression. The number of variables p in the traditional setting is typically 10 or at most a few dozen. Modern scientific technology, led by the microarray, has produced data dramatically above the conventional scale. We have p = 1000 to 10,000 in gene expression microarray data, and p up to 500,000 in single nucleotide polymorphism (SNP) data. To make things more complicated, the large number of variables in the biological data are dependent. For example, in analysis of gene expression microarray data, it is well known that for genes that share a common biological function or participate in the same metabolic pathway, the pairwise correlations among them can be very high [1]. Traditional variable selection methods that select variables one by one may miss important group effects on Corresponding author. junxie@purdue.edu ISSN print/issn online 2011 Taylor & Francis DOI: /

3 2 L. Zeng and J. Xie pathways. Consequently, when traditional variable selection methods are applied in multiple data sets from a common biological system, the selected variables from the multiple studies may show little overlap. The vast dimension of genomic data posts a challenge to the traditional variable selection methods. Several promising techniques have been proposed to overcome the dimension limitation, including Lasso [2], LARS [3], and elastic net [4]. Lasso (least absolute shrinkage and selection operator) imposes an L 1 norm bound on the regression coefficients while minimizing the residual sum of squares. Due to the nature of the L 1 penalty, Lasso shrinks the regression coefficients towards 0 and produces some coefficients that are exactly 0, and hence implements variable selection. Efron et al. [3] proposed a less greedy forward variable selection algorithm named least angle regression (LARS). With a slight modification, LARS calculates all possible Lasso estimators but uses computer times an order of magnitude less than Lasso. The efficiency of the LARS algorithm makes it widely used in variable selection. However, for highly correlated variables, Lasso tends to select only one variable instead of the whole group. Zou and Hastie [4] proposed the elastic net method by combining L 2 and L 1 penalties on the regression coefficients. The convex function induced by the L 2 penalty helped elastic net to achieve a grouping effect that highly correlated variables would be in or out of the model together. More specifically, elastic net estimators of regression coefficients of highly correlated variables tended to be equal (up to a change of sign). Analogously, Bondell and Reich [5] proposed a penalized regression method called OSCAR, with an octagonal-shaped constraint for the coefficients, to encourage selections of a group of highly correlated variables. Elastic net often outperformed Lasso in terms of prediction error on correlated data. OSCAR had a comparable performance with elastic net [5]. However, it was unclear how groups were defined in both methods. Alternatively, for data with dependent structures, a naive approach is a two-step procedure: first cluster highly correlated variables into groups then select among the groups. Park et al. [6] applied hierarchical clustering on gene expression data in the first step. For each cluster (group), they then averaged the genes and took the average as a supergene to fit a regression model. Yuan and Lin [7] developed group LARS and group Lasso, which extended LARS and Lasso to select groups of variables. However, group LARS (group Lasso) required information of group structures before hand. Hence the method was in fact the second step of a two-step procedure. Another method was supervised group Lasso proposed by Ma et al. [8], which was also a two-step procedure. After dividing genes into clusters using the K-means method, they first identified important genes within each group by Lasso and then selected important clusters using the group Lasso. A limitation of the two-step procedures is that the first step of clustering analysis does not incorporate the information of the response variable y. In other words, grouping variables based on their correlations alone is not appropriate for variable selection problem, which requires the joint information of the predictors and the response variable. In this paper, we propose two new variable selection algorithms glars and gridge, for data with dependent structures. Our methods conduct grouping and selecting at the same time therefore work well when prior information of group structures is not available. Similar to LARS, the proposed methods are forward procedures, which are computationally efficient and provide a piecewise linear solution path. In Section 2, we define a variable selection criterion in twodimensional space and then specify the glars and gridge algorithms. We compare the proposed methods with ordinary least squares (OLS), ridge regression, LARS, and elastic net through simulations in Section 3. We study an example on prostate cancer in Section 4. In Section 5, we apply the methods in an example of genetic variation SNP data, which contains a few thousands of variables (SNPs) with complicated dependence. We summarize and discuss future works in Section 6.

4 2. glars and gridge algorithms Consider a linear regression model Journal of Statistical Computation and Simulation 3 y = Xβ + ɛ, where y = (y 1,y 2,...,y n ) T is the response variable, X = (x 1, x 2,...,x p ) is the predictor matrix, and ɛ = (ɛ 1,ɛ 2,...,ɛ n ) T is a vector of independent and identically distributed random errors with mean 0 and variance σ 2. There are n observations and p predictors, where p may be larger than n. We centre the response variable and standardize the column vectors of the predictor matrix. Hence, there is no intercept in our model. n y i = 0, i=1 n x ij = 0, i=1 n xij 2 = 1 i=1 for j = 1, 2,...,p Grouping definition We start our methods with a grouping definition. Predictors form a group if they satisfy both the two criterions: They are highly correlated with the response variable (or current residual); They are highly correlated with each other. The correlation thresholds for the two criterions are data driven. The first threshold is suggested to be the 75th percentile of all correlations (in absolute value) between the current residual and unselected predictors. The second criterion decides group size. Its threshold is chosen from a set of grids, for instance, {0.9, 0.8, 0.7, 0.6}, by cross-validation (CV). In practices, the distribution of all pairwise correlations among predictors gives an idea on the set of grids. Then we decide on the threshold by CV, in addition to the tuning parameters introduced in Section 2.4. An important difference of the proposed methods from the traditional selection procedures is that our variable selection criterion has two components hence is defined by a region in the twodimensional space, i.e. (t 1,t 2 ) in the algorithm specified next. In addition, the first requirement of selecting a variable highly correlated with the response variable is not affected by collinearity among predictors glars algorithm Before introducing the new glars algorithm, we first briefly review the LARS procedure. At the beginning of LARS, a predictor enters the model if its absolute correlation with the response is the largest one among all predictors. The coefficient of this predictor grows in its OLS direction until another predictor has the same correlation with the current residual (i.e. equal-angle). Next, both coefficients of the two predictors begin to move along their OLS directions until a third predictor has the same correlation with the current residual as the first two. The whole process continues until all predictors enter the model. In each step, one variable adds into the model and the solution paths, which are the coefficient estimators as functions of the tuning parameter (defined later in Formula (1)), are extended in a piecewise linear fashion. After all variables enter the model, the whole LARS solution paths complete. In the glars algorithm, we start as LARS to select a predictor which has the largest correlation with the response. We call this predictor a leader element. We then build a group based on this leader element and the current residual according to the grouping definition. Note that both criterions in Section 2.1 have to be satisfied when selecting a variable. Once a group has been

5 4 L. Zeng and J. Xie constructed, it will be represented by a unique direction in R n as the linear combination of the OLS directions of all variables in the group. Next, we choose another leader element, analogous to the equal-angle requirement of the LARS algorithm. A new group is formed again following the grouping definition. We extend the solution paths in a piecewise linear format. The whole process continues until all predictors enter the model. The detailed algorithm is described below. Note that in Step 5 we adopt the idea of Park and Hastie [9] to replace the average squared correlation by the average absolute correlation to make the algorithm robust. (1) Initialization: Set the step index k = 1, β [0] = 0, residual r [0] = Y, active set A 0 =, inactive set A C 0 ={X 1,X 2,...,X p }. (2) Identify the leader predictor x for the first group, where x = argmax xi x i r[k 1], x i A C k 1. (3) Construct the group G k with the leader predictor x from Step 2 according to the two criterions: x j r[k 1] >t 1 and x j x >t 2, x j A C k 1, where t 1 = 75th percentile of all correlations between x j and r [k 1], and t 2 = t {0.9, 0.8, 0.7, 0.6}. Set A k = A k 1 G k, A C k = AC k 1 \ G k. (4) Compute the current direction γ with components γ Ak = (X A k X Ak ) 1 X A k r [k 1], γ A C k = 0, where X Ak denotes the matrix comprised of the columns of X corresponding to A k. (5) Calculate how far the glars algorithm progresses in direction γ. It divides into two small steps: Find x j in A C k which corresponds to the smallest α (0, 1] such that X G j (r [k 1] αxγ) L1 p j = x j (r[k 1] αxγ), where G j is a group from A k, p j is the number of variables in group G j, and. L1 represents the sum of absolute values. Justification. As in Step 3, find the group with the leader predictor x j selected above and denote the group as X Gj. Recalculate α (0, 1] for this selected new group such that X G j (r [k 1] αxγ) L1 p j = X (r[k 1] Gj αxγ) L1. p j Update β [k] = β [k 1] + αγ, r [k] = Y Xβ [k]. (6) Update k to k + 1, and A k = A k 1 G j, A C k = AC k 1 \ G j. (7) If the number of variables in A k is smaller than min(p, n), return to Step 4. Otherwise, set γ, β, and r to be the OLS solutions and stop gridge algorithm OLS would perform poorly when the correlations among the predictors are high and/or the noise level is high. Since both LARS and glars move towards OLS directions in each step, they have the same limitation. Ridge estimators, on the other hand, perform better in this situation. We propose a gridge algorithm, which moves towards ridge estimator direction in each step. The relationship

6 Journal of Statistical Computation and Simulation 5 between ridge estimator ˆβ(λ) and OLS estimator ˆβ satisfies ˆβ(λ) = (X X + λi) 1 X Y = (X X(I + λ(x X) 1 )) 1 X Y = (I + λ(x X) 1 ) 1 ˆβ = C ˆβ, where C = (I + λ(x X) 1 ) 1 and λ is the ridge parameter. The gridge algorithm is thus a simple modification of the glars algorithm. When a group is constructed, gridge represents the group by a unique direction from the linear combination of the ridge directions of all variables in the group. The variable coefficients are moving towards the ridge directions. As we run simulations, we notice that gridge outperforms other methods in terms of relative prediction errors (RPEs, defined in the simulations next). However, this method is limited by its relatively large number of false positives due to an over-grouping effect. In other words, although ridge shrinks small coefficients, it does not truncate them to zero for variable selection. We propose to add a hard threshold δ to gridge estimators so that small (but nonzero) coefficients will be removed, i.e. β j = βˆ j I( βˆ j >δ). Based on simulations, we define a threshold δ = σ log(p)/n. Hence smaller error term, smaller number of predictors, or larger sample size give smaller threshold. We name the modified gridge algorithm gridge_new, with this hard threshold filtering. The following simulation studies will show that gridge_new not only preserves low prediction errors but also greatly reduces false positives Tuning parameters As LARS does, both glars and gridge produce the entire piecewise linear solution paths with all p (or n when p>n) variables in the model. Groups of variables are selected when we stop the paths after a certain number of steps. The number of steps k is the tuning parameter. Equivalently, we may use a tuning parameter as the fraction of the L 1 norm of the coefficients s = j_selected ˆβ j L1 j ˆβ j L1. (1) For glars, s (or k) is the only tuning parameter. We determine it by a standard five-fold CV. For gridge, there are two tuning parameters, the ridge parameter λ in addition to s (or k). Hence we cross-validate on two dimensions. We first choose a grid for λ, say {0.01, 0.1, 1, 10, 100, 1000}. The gridge algorithm produces the entire solution paths for each λ. Then the parameter s (or k) is selected by five-fold CV. At the end, we choose the λ value which gives the smallest CV error. 3. Simulation studies Simulation studies are used to compare the proposed glars and gridge with OLS, ridge regression, LARS, and elastic net. The simulated data are generated from the true model y = Xβ + σɛ, ɛ N(0, 1). Four examples are presented for different scenarios in the following. Each example contains a training set and a test set. The tuning parameters are selected on the training set by fivefold CV. Variable selection methods are compared in terms of RPE [10] and selection accuracy on the test set. The RPE is defined as RPE = ( ˆβ β) T ( ˆβ β)/σ 2, where is the population covariance matrix of X. Selection accuracy is illustrated by the numbers of true and false positives

7 6 L. Zeng and J. Xie of the selected variables. In each example, we replicate the simulation 100 times and report the median RPEs and median true/false positives. The four scenarios are given by (1) Example 1: there are 100 and 200 observations in the training and test sets, respectively. The true parameters are β = (3, 1.5, 0, 0, 2, 0, 0, 0) and σ = 3. The pairwise correlation between x i and x j is set to be corr(x i, x j ) = 0.5 i j, i, j = 1,...,8. This example creates a sparse model with a few large effects and the predictors have first-order autoregressive correlations. (2) Example 2 (adopted from [11]): we simulate 100 and 400 observations in the training and test sets, respectively. We set the true parameters as and σ = 6. The predictors are generated as β = (3,...,3, 1.5,...,1.5, 0,...,0) }{{}}{{}}{{} x i = Z + εi x, Z N(0, 1), i = 1,...,15, x i N(0, 1), i.i.d., i = 16,...,40, where εi x are independent identically distributed N(0, 0.01), i = 1,...,15. This example creates one group from the first 15 highly correlated covariates. The next five covariates are independent but provide signals on the response variable. (3) Example 3: we simulate 100 and 400 observations in the training and test sets, respectively. We set the true parameters as and σ = 15. The predictors are generated as β = (3,...,3, 0,...,0) }{{}}{{} x i = Z 1 + εi x, Z 1 N(0, 1), i = 1,...,5, x i = Z 2 + εi x, Z 2 N(0, 1), i = 6,...,10, x i = Z 3 + εi x, Z 3 N(0, 1), i = 11,...,15, x i N(0, 1), i.i.d., i = 16,...,40, where εi x are independent identically distributed N(0, 0.01), i = 1,...,15. There are three equally important groups with five members in each. There are also 25 noise variables. (4) Example 4: we simulate 100 and 200 observations in the training and test sets, respectively. We set the true parameters as β = (3, 3, 3, 0, 0, 3, 3, 3, 0, 0, 3, 3, 3, 0, 0, 0,...,0). }{{}}{{}}{{}}{{} The predictors and the error terms are the same as in Example 4. There are also three equally important groups with five members in each of them. However, in each group, there are two noise variables, which have no effect on the response variable but are highly correlated with the other three important variables. There are totally 31 noise variables. Table 1 summarizes the prediction results in terms of RPE. The methods with the smallest RPE are highlighted in each example. The OLS estimate, which includes all variables in the regression model, has the highest prediction error. The other variable selection methods improve performance with smaller prediction errors in all the examples. Specifically, elastic net produces

8 Journal of Statistical Computation and Simulation 7 Table 1. Median relative prediction errors (RPE) and their standard errors (SE) for the four simulated examples based on 100 replications. Example 1 Example 2 Example 3 Example 4 Methods RPE SE RPE SE RPE SE RPE SE OLS Ridge LARS Elastic net glars gridge gridge new The SEs of RPEs are estimated by using bootstrap with B = 1000 resampling iterations on the 100 RPE. The bold numbers indicate the methods with the smallest prediction errors. the smallest prediction errors in Example 1, where variables are correlated but not in group structures. gridge and gridge_new produce the smallest prediction errors in most of the examples. Their reductions of RPEs in Examples 2 4 are 23.1%, 12.5%, and 37.8%, respectively, compared with elastic net. We also notice that while estimating the large coefficients close to the true coefficients, gridge tends to select more variables than elastic net, owing to its over grouping effect from ridge. After we add a hard threshold to gridge, gridge_new estimators achieve the best performance. Our first method glars gives comparable prediction errors with LARS. However, coefficient estimators show that glars has a clear pattern of selecting groups of variables in Examples 2 4. That is, variables in a common group enter the model together. The grouping effect of glars and gridge is partially demonstrated in Table 2, which compares different methods in terms of variable selection accuracy. Column TP represents the median number of nonzero coefficients correctly identified as nonzeros, or true positives. Column FP represents the median number of zero coefficients incorrectly specified as nonzeros, or false positives. The OLS and ridge estimators have large true positives but also large false positives. The LARS algorithm greatly lowers the false positives from OLS but misses many true effective variables in Example 2 4, due to the limitation in analysing data with dependent (group) structures. Elastic net and our proposed glars improve OLS and LARS, especially in Examples 2 4, whereas glars is better than elastic net in Examples 2 and 4. gridge has larger false positives than glars and elastic net. After adding the threshold, gridge_new achieves the best results. In summary, our simulations show that gridge_new outperforms the existing variable selection methods and has a greater advantage for data with group structures. Table 2. Median numbers of true positives (TP, nonzero coefficients) and false positives (FP, zero coefficients misspecified as nonzero coefficients) of the four simulations based on 100 replications. Example 1 Example 2 Example 3 Example 4 Methods TP FP TP FP TP FP TP FP OLS Ridge LARS Elastic net glars gridge gridge new The bold numbers indicate the methods with the largest true positives and the smallest false positives.

9 8 L. Zeng and J. Xie 4. Prostate cancer example An example that has been analysed by many variable selection methods, including Lasso and elastic net, is about prostate cancer from a study by Stamey et al. [12]. There were 97 observations collected from men who were about to receive a radical prostatectomy. We want to discover the relationship between the level of prostate-specific antigen and several clinical endpoints. The response variable is log(prostate specific antigen) (lpsa). The clinical endpoints are x 1 log(cancer volume) (lcavol), x 2 log(prostate weight) (lweight), x 3 age, x 4 log(benign prostatic hyperplasia amount) (lbph), x 5 seminal vesicle invasion (svi), x 6 log(capsular penetration) (lcp), x 7 Gleason score (gleason), and x 8 percentage Gleason scores 4 or 5 (pgg45). The correlation between x 7 gleason and x 8 pgg45 is and that between x 1 lcavol and x 6 lcp is These correlations indicate a moderate collinearity among the data. We split data into two parts: a training set with 67 observations and a test set with 30 observations. We choose the tuning parameters by 10-fold CV in the training set. After fitting the model, we calculate prediction errors, which are residual sum of squares, on the test data and compare our proposed methods with OLS, LARS, and elastic net. All predictors are centred and standardized to have mean 0 and standard deviation 1. Standardized coefficients LAR beta /max beta Standardized coefficients Elastic net beta /max beta lcp gleason svi lcavol glar gridge Standardized coefficients beta /max beta Standardized coefficients beta /max beta Figure 1. Solution paths of LARS, elastic net, glars, and gridge on the prostate cancer data.

10 Journal of Statistical Computation and Simulation 9 Table 3. Lars Elastic net glar gridge The order of covariates entering the model in the prostate cancer example. lcavol svi lweight pgg45 lbph gleason age lcp lcavol svi lweight pgg45 lbph gleason age lcp lcavol, lcp svi lweight gleason, pgg45 lbph age lcavol, lcp svi lweight lbph, gleason, pgg45 age Table 4. Coefficients estimators from OLS, LARS, elastic net, glars, and gridge methods for the prostate cancer example. Coefficients OLS LARS Elastic glars gridge gridge new lcavol lweight age lbph svi lcp gleason pgg Figure 1 gives the solution paths of LARS, elastic net, glars, and gridge. The orders that the variables are selected into the models are specified in Table 3. The methods of glars and gridge select two groups of variables, lcavol and lcp, and gleason and pgg45, respectively, due to their high correlations. Table 4 gives the coefficient estimators from different methods. All methods suggest that logarithm of cancer volume (lcavol), logarithm of prostate weight (lweight), and seminal vesicle invasion (svi) are important variables in explaining the response variable of prostate-specific antigen. LARS and elastic net also select logarithm of benign prostate hyperplasia (lbph), gleason, and pgg45, whereas glars and gridge select lcp with lcavol together and pgg45 with gleason together because of the grouping effect. The gridge_new method gives the most sparse model, removing gleason and pgg45 after the hard thresholding. Table 5 demonstrates the prediction errors in the test set. Our proposed methods outperform other existing methods by reducing prediction errors. This example also shows that gridge_new has the best performance with the lowest prediction error and the most parsimonous model. Table 5. The prediction errors by OLS, LARS, elastic net, glars, and gridge on prostate cancer data. Methods Prediction error OLS LARS Elastic net glars gridge gridge new An example on SNP data analysis SNPs are the most common genetic variations in the human genome, occurring once in several hundred base pairs. An SNP is a position at which two alternative bases occur at appreciable frequency (>1%) in the human population. SNPs can serve as genetic markers for identifying disease

11 10 L. Zeng and J. Xie genes by linkage studies in families, linkage disequilibrium in isolated populations, association analysis of patients and controls, and loss-of-heterozygosity studies in tumours [13]. With the technique advances, genome-wide association studies become popular to detect the specific DNA variants that contribute to human phenotypes and particularly human diseases. Natural variation in the baseline expression of many genes can be considered as heritable traits. Morley et al. [14] collected microarray and SNP data to localize the DNA variants that contribute to the expression phenotypes. The data consist of 14 families with 56 unrelated individuals (the grandparents). There are 8500 genes on the microarray and 2756 SNP markers genotyped for each individual. The expression level of a given gene is the response variable and the 2756 SNP markers are the predictors in our model. We code each SNP as 0, 1, 2 for wild-type homozygous, heterozygous, and mutation (rare) homozygous genotypes, respectively, according to the genotype frequency. We first screen data to exclude SNPs that have a call rate less than 95% or minor allele frequency less than 2.5%. The number of SNPs is reduced to 1739 after screening. Then we select 500 most variable SNPs as the potential predictors. The variability of an SNP is measured by its sample variance. Consider gene ICAP-1A, which is the top gene in Morley et al. s [14] Table 1 with the strongest linkage evidence. ICAP-1A, also named ITGB1BP1, encodes a protein binding to the beta 1 integrin cytoplasmic domain. It may play a role during integrin-dependent cell adhesion. Here, we apply multiple linear regression and the proposed group selection methods to search for an optimal set of SNPs for gene ICAP-1A. Linear regression over SNPs coded as 0, 1, and 2 captures linear relationship between the gene expression and different types of SNPs, and hence is commonly used as a simple tool in SNP data analysis. We split data into the training set with 42 observations and the test set with 14 observations. Model fitting and choices of the tuning parameters are based on the training set. The first grouping criterion is set up to be the 75th percentile of all correlations between x and y. The second grouping criterion requires the correlation with the leader element to be greater than 0.6. The prediction error Table 6. Test prediction errors of Lasso, elastic net, glars, and gridge for the SNP data. Methods Test prediction error Number of genes Tuning parameter s LARS Elastic net (λ 2 = 0.01) glars gridge (λ 2 = 0.01) Table 7. The first 12 steps of predictors selected by Lasso, elastic net, glars, and gridge for the SNP data. Methods LASSO Elastic net glars gridge Step 1 x 458 x 458 x 458 x 458 Step 2 x 481 x 481 x 481 x 481 Step 3 x 321 x 321 x 321, x 131 x 321, x 131 Step 4 x 287 x 287 x 287 x 287 Step 5 x 240 x 240 x 240 x 240 Step 6 x 76 x 76 x 406 x 406 Step 7 x 406 x 131 x 76 x 76 Step 8 x 131 x 406 x 102 x 102 Step 9 x 345 x 345 x 345 x 345 Step 10 x 102 x 102 x 498 x 498 Step 11 x 498 x 498 x 30, x 27 x 30, x 27 Step 12 x 30 x 30 x 167 x 167

12 Journal of Statistical Computation and Simulation 11 (residual sum of squares) is evaluated on the test data. Table 6 shows that glars and gridge have lower prediction errors than LARS and elastic net with about 30 SNPs selected for gene ICAP-1A. We notice that R 2 of the LARS fitted model with 24 SNP covariates is 0.779, which supports the hypothesis of Morley et al. s study [14] that gene expression phenotypes are controlled by genetic variants. On the other hand, the coefficient estimators of all SNP covariates are very small, in the scale of 0.01, suggesting small additive effects of multiple SNPs. Table 7 lists the first 12 steps that the predictors are selected in each algorithm. For instance, variable x 458 enters the model at the first step in all algorithms. At Step 3, glars and gridge depart from LARS and elastic net due to the grouping effect. The first group consists of two SNPs: x 321 SNP rs and x 131 SNP rs The correlation of these two variables is The two SNPs are in chromosome 3 with 14K base pairs apart. They are in an intergenic region. The two closest genes have no functional annotation. Another group consisting of the two SNPs x 30 rs and x 27 rs are selected by glars and gridge at Step 11. These two SNPs are in chromosome 7 with over 2.5 million base pairs apart. SNP x 30 rs is in an intergenic region, whereas x 27 rs resides in the gene FOXK1. According to Swiss- Prot functional annotation, FOXK1 is a transcriptional regulator that binds to the upstream of myoglobin gene. Our results suggest more SNP associations with a gene expression phenotype than the simple linkage analysis. For example, variables x 240 SNP rs and x 76 SNP rs are jointly selected as important covariates for the expression of ICAP-1A according to all four variable selection methods. However, they are not significant in simple regression analysis with a p-value cutoff of These two SNPs locate in the same chromosome as ICAP-1A (chromosome 2) with nearly 27 million base pairs and 220 million base pairs, respectively, away from ICAP-1A. Figure 2 shows their locations and regression plots. SNP rs is in the promotor region (about 200 base pairs upstream) of gene FAM82A1. SNP rs is in the promotor region (about 300 base pairs upstream) of gene DNER. The significant effects of these two SNPs on gene ICAP-1A may suggest associations among the corresponding genes. Expression level(log2) G,G rs A,G A,A Expression level(log2) G,G rs A,G A,A SNP Genotype SNP Genotype 5 3 Figure 2. Regression of the expression phenotype of ICAP-1A on the two nearby SNPs. The bottom left arrow points to the location of ICAP-1A.

13 12 L. Zeng and J. Xie 6. Discussion Traditional forward selection is a heuristic approach, not guaranteeing an optimal solution. LARS, a less greedy version of traditional forward selection methods, however, is shown by Efron et al. [3] to be closely related to Lasso, which possesses optimal properties under appropriate conditions [15 17]. Our proposed methods take advantage of the LARS procedure while aiming at group selections for dependent data. The proposed methods do not require prior information on the underlying group structures but construct groups along the selection procedure. Our grouping criterions consider the joint information of x and y therefore better fit the context of variable selection than standard clustering on x alone. On the other hand, any prior information on the model or groups can be easily incorporated into the algorithms of glars and gridge by manually selecting certain variables at specific steps. The current methods may be improved by exploring different thresholds (t 1,t 2 ) in the grouping definition. One of our future works is to extend the proposed group selection methods to general regression models, where y may depend on x through any nonlinear function. The proposed methods offer fast approaches that are more appropriate than other variable selection algorithms for data with complicated dependent structures. Acknowledgements This work is supported by US National Science Foundation Grant DMS References [1] M. Segal, K. Dahlquist, and B. Conklin, Regression approach for microarray data analysis, J. Comput. Biol. 10(6) (2003), pp [2] R. Tibshirani, Regression shrinkage and selection via the lasso, JRSS B 58(1) (1996), pp [3] B. Efron, I. Johnstone, T. Hastie, and R. Tibshirani, Least angle regression, Ann. Stat. 32(2) (2004), pp [4] H. Zou and T. Hastie, Regularization and variable selection via the elastic net, JRSS B 67(2) (2005), pp [5] H.D. Bondell and B.J. Reich, Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR, Biometrics 64(1) (2008), pp [6] M.Y. Park, T. Hastie, and R. Tibshirani, Averaged gene expression for regression, Biostatistics 8(2), (2006), pp [7] M. Yuan and Y. Lin, Model selection and estimation in regression with grouped variables, JRSS B 68(1) (2006), pp [8] S. Ma, X. Song, and J. Huang, Supervised group lasso with applications to microarray data analysis, BMC Bioinform. 8 (2007), 60, doi: / [9] M.Y. Park and T. Hastie, Regularization path algorithms for detecting gene interactions, Technical Report, Department of Statistics, Standford University, [10] H. Zou, The adaptive lasso and its oracle properties, JASA 101(476) (2006), pp [11] Z.J. Daye and X.J. Jeng, Shrinkage and model selection with correlated variables via weighted fusion, Technical Report, Department of Statistics, Purdue University, [12] T. Stamey, J. Kabalin, J.I. Johnstone, F. Freiha, E. Redwine, and N. Young, Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate, ii: Radical prostatectomy treated patients, J. Urol. 141(5) (1989), pp [13] D.G. Wang, J. Fan, C. Siao, A. Berno, P. Young, R. Sapolsky, G. Ghandour, N. Perkins, E. Winchester, J. Spencer, L. Kruglyak, L. Stein, L. Hsie, T. Topaloglou, E. Hubbell, E. Robinson, M.S. Morris, N. Shen, D. Kilburn, J. Rioux, C. Nusbaum, S. Rozen, T.J. Hudson, R. Lipshutz, M. Chee, and E.S. Lander, Large-scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome, Science 280(5366) (1998), pp [14] M. Morley, C.M. Molony, T.M. Weber, J.L. Devlin, K.G. Ewens, R.S. Spielman, and V.G. Cheung, Genetic analysis of genome-wide variation in human gene expression, Nature 430(7001) (2004), pp [15] K. Knight and W. Fu, Asymptotics for lasso-type estimators, Technometrics 12(1) (2000), pp [16] P. Zhao and B. Yu, On model selection consistency of lasso, J. Mach. Learn. Res. 7 (2006), pp [17] C. Zhang and J. Huang, The sparsity and bias of the lasso selection in high-dimensional linear regression, Ann. Stat. 36(4) (2008), pp

Group Variable Selection Methods and Their Applications in Analysis of Genomic Data

1 Group Variable Selection Methods and Their Applications in Analysis of Genomic Data Jun Xie 1 and Lingmin Zeng 2 1 Department of Statistics, Purdue University, 250 N. University Street, West Lafayette,