The Double Dantzig. Some key words: Dantzig Selector; Double Dantzig; Generalized Linear Models; Lasso; Variable Selection.

Size: px

Start display at page:

Download "The Double Dantzig. Some key words: Dantzig Selector; Double Dantzig; Generalized Linear Models; Lasso; Variable Selection."

Phillip McCoy
5 years ago
Views:

1 The Double Dantzig GARETH M. JAMES AND PETER RADCHENKO Abstract The Dantzig selector (Candes and Tao, 2007) is a new approach that has been proposed for performing variable selection and model fitting on linear regression models. It uses an L 1 penalty to shrink the regression coefficients towards zero, in a similar fashion to the Lasso. While both the Lasso and Dantzig selector potentially do a good job of selecting the correct variables, several researchers have noted that they tend to over shrink the final coefficients. This results in an unfortunate tradeoff. One can either select a high shrinkage tuning parameter that produces an accurate model but poor coefficient estimates or a low shrinkage tuning parameter that produces more accurate coefficients but includes many irrelevant variables. We propose a new approach called the Double Dantzig. The Double Dantzig has two key advantages over both the Dantzig selector and the Lasso. First, we demonstrate that it can select the correct model without over shrinking the coefficient estimates. Second, it can be applied to all standard generalized linear model response distributions. In addition we develop a path algorithm to simultaneously compute the coefficient estimates for all values of the tuning parameter, making our approach computationally efficient. A detailed simulation study is performed which illustrates the advantages of the Double Dantzig in relation to other possible methods. Finally, we demonstrate the Double Dantzig on several real world data sets. Some key words: Dantzig Selector; Double Dantzig; Generalized Linear Models; Lasso; Variable Selection. 1 Introduction In the traditional linear regression model, Y i = β 0 + p X i j β j + ε i, i = 1,...n, (1) j=1 a common problem is that of variable selection i.e. determining which β j 0. This is a classical problem with several standard approaches, such as best subset selection. However, in the last decade, as data sets have become larger and computers more powerful, there Marshall School of Business, University of Southern California 1

2 has been a great deal of development of new model selection methods. A few examples include the Lasso (Mallat and Zhang, 1993; Tibshirani, 1996; Chen et al., 1998), the adaptive Lasso (Zou, 2006), SCAD (Fan and Li, 2001), the Elastic net (Zou and Hastie, 2005), LARS (Efron et al., 2004), CAP (Zhao et al., 2006), and the Dantzig selector (Candes and Tao, 2007). Many of these methods utilize an L 1 penalty on the regression coefficients which has the effect of shrinking some coefficients towards zero and setting others to exactly zero, hence performing variable selection. This paper concerns a somewhat less well studied problem, that of variable selection methods for Generalized Linear Models (GLM) (McCullagh and Nelder, 1989). A number of the most important problems in areas such as machine learning, data mining, biology or many other fields involve performing classification on a categorical response in the presence of a large number of predictors. In this setting, a linear regression model will not suffice and a GLM approach, such as logistic regression, is more appropriate. There has been some previous work in this area. Tibshirani (1996) briefly discusses using the Lasso to fit GLM s. Several other researchers have also considered using L 1 penalties to fit logistic regression models (Lokhorst, 1999; Shevade and Keerthi, 2003; Genkin et al., 2004). In addition Zhao and Yu (2004) and Park and Hastie (2007) develop methods for fitting the entire coefficient path for GLM and other models with L 1 penalties. Most of these previous methods provide extensions of the Lasso so we use the term GLasso when discussing the Lasso applied to GLM s. We implement a different approach by utilizing the Dantzig selector, a new variable selection method that has received considerable attention and is currently scheduled to appear as a discussion article in the Annals of Statistics. The Dantzig selector works by minimizing the L 1 norm of β subject to an L constraint on the error terms. This approach has several nice theoretical and practical attributes which we discuss further in Section 2.1. Our paper provides three important contributions. First, the Dantzig selector was not designed for GLMs so we develop an extension, called the Generalized Dantzig selector (GDS). The GDS works by reformulating the Dantzig selector criteria in terms of the maximum likelihood equations for a linear regression with a Gaussian response. The reformulation then suggests a natural extension of the criteria to arbitrary response distributions. The GDS shares the same desirable theoretical and practical properties as the Dantzig selector but can be applied to all of the usual range of GLM response distributions. Second, we present a path algorithm for simultaneously computing the GDS coefficients for all values of the tuning parameter, λ, in an efficient manner. This makes it practical to perform cross-validation on λ even for very large data sets. The third contribution, and our main focus, is the introduction of a new method, the Double Dantzig (DD), designed for fitting GLM s with large numbers of predictors but sparse coefficient vectors. As we illustrate in the simulation study of Section 4, the Dantzig selector and Lasso both suffer from an unfortunate tradeoff. They tend to either select the correct variables and produce poor coefficient estimates or else generate more accurate coefficients at the expense of including many irrelevant variables. The Double Dantzig eliminates this drawback by utilizing two separate applications of the GDS. We first use the 2

3 GDS to select a candidate set of predictors and then reapply the GDS, with a new tuning parameter, to the candidate variables to produce the final model. We show through a detailed simulation study that this two step procedure can produce significant improvements, both in terms of the variables selected as well as the estimated coefficients, over not only the standard maximum likelihood GLM approach but also the GLasso and and GDS methods. The remainder of this paper is organized as follows. In Section 2 we begin by outlining the approach behind the Dantzig selector and show how it can be extended to GLMs via the Generalized Dantzig selector. We then present the Double Dantzig and explain more carefully the rational behind it. Section 3 develops an iterative linear programming algorithm, analogous to that used for fitting standard GLMs, to efficiently compute the GDS solution. We then present our path algorithm for computing all possible GDS solutions, for different values of λ, in one operation. A detailed simulation study, comparing the performance of the Double Dantzig with standard GLM, GLasso and the GDS is presented in Section 4. Two real world data sets with categorical responses are analyzed in Section 5. The first is a relatively small dataset of spam s while the second is a much larger dataset of internet advertisements containing p = 1, 430 predictors. Finally, we end with a discussion of our method and possible extensions in Section 6. 2 Methodology In this section we present the Double Dantzig methodology. We first, describe the Dantzig selector in Section 2.1. Then, in Section 2.2, we present the GDS approach to extend the Dantzig selector to GLM response distributions. The GDS is interesting in its own right but its central importance here is as a key component of the Double Dantzig which we describe in Section Dantzig Selector The Dantzig selector (Candes and Tao, 2007) was designed for linear regression models such as (1) with large p but a sparse set of coefficients i.e. most of the β j s are zero. For the linear regression model given by (1), the Dantzig selector estimate, ˆβ, is defined as the solution to min β 1 subject to X T j (Y X β) λ j = 1,..., p (2) β B where 1 represents the L 1 norm, X j is the jth column of the design matrix, X, λ is a tuning parameter and B represents the set of possible values for β, usually taken to be a subset of R p. The L 1 minimization produces coefficient estimates that are exactly zero in a similar fashion to the Lasso and hence can be used as a variable selection tool. Candes and Tao (2007) provide several examples on real world problems involving large values of p where the Dantzig selector performs well. In this setup X j is assumed to be norm one which is rarely the case in practice. However, this difficulty is easily resolved by reparamaterizing (1) such that the X j s have norm one. 3

4 Let d j = X j 2. Then computing ˆβ on the standardized X j s and returning ˆβ j /d j produces the Dantzig selector estimate for β j on the original scale. Notice that for Gaussian error terms, (2) can be rewritten as, min β 1 β B subject to l j ( β) λ/σ j = 1,..., p (3) where l j is the partial derivative of the log likelihood function with respect to β j and σ 2 = var(ε i ). Hence, an intuitive motivation for the Dantzig selector can be observed by noting that, for λ = 0, (3) will return the maximum likelihood estimator. Alternatively, the constraint in (2) is equivalent to X T Y = X T X ˆβ which is simply the least squares normal equation. For λ > 0 the Dantzig selector searches for the β with the smallest L 1 norm that is within a given distance of the maximum likelihood solution, i.e. the sparsest β that is still reasonably consistent with the observed data. Notice that even for p > n, where the likelihood equation will have infinite possible solutions, this approach can still hope to identify the correct solution, provided β is sparse, because it is only attempting to locate the sparsest β close to the peak of the likelihood function. One might imagine that minimizing the L 0 norm, which counts the number of non-zero components of a vector, would be more appropriate than the L 1 norm. However, directly minimizing the L 0 norm is computationally difficult and one can show that, under suitable conditions, the L 1 norm will also provide a sparse solution. The Dantzig selector has some nice properties. The first is that (2) can be formulated as a standard linear programming problem. Let β + j and β j respectively represent the positive and negative parts of β j i.e. β j = β + j β j. Then optimizing (2) is equivalent to solving, min p β + p j + β j subject to β +,β 0, and λ1 X T (Y Xβ) λ1, j=1 j=1 where 1 is a vector of ones. As a consequence, standard linear programming software can easily be used to efficiently fit the data. For example, in Section 5 we are able to use this approach to fit a dataset with over 2, 000 observations and almost 1, 500 variables. More recently James et al. (2007) develop the DASSO algorithm for fitting the entire coefficient path of the Dantzig selector. DASSO does not need to use linear programming and hence provides an even more computationally efficient means to fit the Dantzig selector. The second useful property is theoretical. Candes and Tao (2007) prove tight nonasymptotic bounds on the error in the estimator for β. Suppose that ε i N(0,σ 2 ) and that β has at most S non-zero components. Then for any a 0, by computing ˆβ using λ = σ 2(1+a)log p, it will be the case that ˆβ β 2 2 (1+a) C S σ2 log p, (4) with probability at least 1 ( p a 4πlog p ) 1, where C is a constant that depends on X. Even if we knew ahead of time which β j s were non-zero, it would still be the case that 4

5 ˆβ β 2 2 would grow at the rate of S σ2. Hence, (4) is optimal up to a factor of log p. Since log p grows very slowly, the Dantzig selector only pay a small price for adaptively choosing the significant variables. 2.2 Generalized Dantzig Selector Generalized linear models provide a framework for relating response and predictor variables (McCullagh and Nelder, 1989). For a random variable Y with distribution ( ) yθ b(θ) p(y;θ,φ) = exp + c(y,φ), a(φ) we model the relationship between predictor and response as g(µ i ) = p j=1 X i jβ j, where µ i = E(Y;θ i,φ) = b (θ i ) and g is referred to as the link function. Common examples of g include the identity link used for normal response data and the logistic link used for binary response data. In a standard GLM setup, the coefficient vector β is generally estimated using maximum likelihood. For notational simplicity we will assume that g is chosen as the canonical link, though all the ideas generalize naturally to other link functions. Using the canonical link function, the score statistics are given by l j (ˆβ) = X T j (Y µ), j = 1,..., p, (5) where µ= g 1 (X ˆβ). Hence the maximum likelihood estimator is given by solving X T j (Y µ) = 0, j = 1,..., p, which can be achieved using an iterative weighted least squares algorithm. However, when p is large relative to n, the maximum likelihood approach becomes undesirable for several reasons. First, solving (5) will not produce any coefficients that are exactly zero so there is no form of variable selection performed. As a result, the final model is less interpretable and probably less accurate. Second, for large p the variance of the estimated coefficients will become large and when p > n, equation (5) has no unique solution. We address both these limitations by extending the Dantzig selector to the GLM setting. In Section 2.1 we observed that the constraints in the standard Dantzig selector can be written in terms of a Gaussian likelihood function. Hence a natural generalization to GLM models involves formulating the optimization criteria in terms of a general GLM likelihood. In particular, by combining (3) and (5) we produce our Generalized Dantzig selector (GDS) criteria, min β 1 subject to X T j (Y µ) λ j = 1,..., p, (6) β B where µ= g 1 (X β). For the Gaussian distribution with the identity link function, the GDS reduces to the standard Dantzig selector but the GDS can also be applied to a much wider range of response distributions. For example, for categorical {0, 1} response data we can solve (6) using a logistic link function. The GDS maintains all the important properties of the Dantzig selector. In particular, it works well even in situations where p > n and 5

6 produces sparse coefficient estimates, so it can be used for variable selection purposes. In addition, in Section 3 we show that the computational advantages of the Dantzig selector are preserved and that the entire coefficient path can be computed efficiently. Finally, while the theoretical properties of the GDS are not our central interest here, James and Radchenko (2007) prove that theoretical bounds with the same structure as (4) exist for the GDS. Figure 1 provides an example of the GDS coefficient estimates from a simulated data set with Bernoulli responses and a logistic link function. The data was generated with n = 100 observations and p = 20 predictors with a design matrix produced using iid Gaussian random variables. All but the first five components of the true coefficient vector were zero i.e. the last fifteen variables were independent of the response. The responses were then generated randomly according to a binomial {0, 1} distribution. This was a fairly hard problem because there were a large number of variables and relatively few observations. In addition, each response provided relatively little information because it only took on two possible values. As noted previously, λ = 0 will give the standard GLM fit. However, as λ grows the GDS constraint relaxes until the point where all the coefficients are estimated as zero. Figure 1 plots the estimated coefficients for various values of λ with the L 1 norm of the coefficients on the x-axis. Each line represents the coefficient for a single predictor variable with the thicker solid lines relating to the five true predictors and dashed lines for the remaining fifteen unimportant variables. Despite the relatively small amount of information provided by the responses, the GDS clearly identifies the correct five variables first with a significant break before adding the remainding variables. 2.3 The Double Dantzig The Dantzig selector works in a similar fashion to the Lasso by shrinking the regression coefficients towards zero. Additionally, the L 1 norm causes certain coefficients to be set to exactly zero which enables variable selection to be performed. While both the Lasso and Dantzig selector potentially do a good job of removing irrelevant variables, several researchers have noted that the price that is paid is to over shrink the remaining coefficients (Candes and Tao, 2007; James et al., 2007). As we illustrate in the simulation study of Section 4, for models with sparse sets of parameters this results in an unfortunate tradeoff. One can either select a high shrinkage tuning parameter that produces an accurate model but poor coefficient estimates or a low shrinkage tuning parameter that produces more accurate coefficients but includes many irrelevant variables. The latter case is the norm when using cross-validation to select λ. The Double Dantzig eliminates this tradeoff by implementing the following two stage fitting procedure. 1. Select two tuning parameters λ 1 and λ Fit the GDS to the response Y i using predictors X 1,...,X p and tuning parameter λ Identify the q variables with non-zero coefficients, X 1,...,X q. 6

7 Coefficient L1 Norm Figure 1: Plot of estimated coefficients for different values of λ. The solid lines represent the variables that the response is related to. The dashed lines correspond to the remaining predictors. 4. The Double Dantzig solution is given by fitting the GDS to the response Y i using predictors X 1,...,X q and tuning parameter λ 2. Generally, λ 1 will be relatively large, causing a high degree of regularization, while λ 2 will be small, resulting in little shrinkage of the final coefficients. The key idea behind the Double Dantzig is that by using two different levels of shrinkage, one high and the other low, we can both select an accurate model and accurate coefficient estimates. The simulation and real world results of Sections 4 and 5 conclusively illustrate the potential advantages of such a method. As with all such approaches, an important component of DD is the choice of the tuning parameters λ 1 and λ 2. In Section 3 we introduce a path algorithm for computing the GDS for any given value of λ. Hence for any given pair of λ 1 and λ 2 we can efficiently compute the cross-validated deviance. One potential approach would then be to perform this calculation on a grid of values for the two tuning parameters and select the pair corresponding to the lowest deviance. While such an approach is feasible, in the experiments that we have performed, the final fitted models have been relatively stable for any small value 7

8 for λ 2. Hence, for the results presented here we fix λ 2 = log p and select λ 1 using cross-validation which is relatively computationally easy. The 2log p term is a natural adjustment for the number of variables suggested by the Dantzig selector theory (Candes and Tao, 2007). This approach seemed to give good results. 3 Fitting the models In Section 3.1 we present an iterative weighted Dantzig selector algorithm for computing the GDS solution for a given value of λ. Then in Section 3.2 we extend the fitting procedure to compute the entire path of coefficients for all possible λ. 3.1 Computing the GDS Unless g is the identity function, the constraints in (6) are non-linear, so linear programming software can not be directly used to compute the GDS solution. In the standard GLM setting, an iterative weighted least squares algorithm is used to solve (5). In particular, given the current estimate for ˆβ, an adjusted dependent variable Z i = p j=1 X i j ˆβ j +(Y i µ i )/V i is computed, where V i = b (θ i ) is the variance of Y i. A new estimate for β is then computed using weighted least squares, i.e. X T j W(Z X ˆβ) = 0 for j = 1,..., p, where W is a matrix consisting of the V i s on the diagonal. This procedure is iterated until ˆβ converges. An analogous iterative approach works well when computing the GDS solution. ˆβ (k) j 1. At the k+ 1th iteration let V i = b (ˆθ (k) i ) and Z i = p j=1 X i j (k) denotes the corresponding estimate from the kth iteration. 2. Let Z i = Z i Vi and X i j = X i j Vi. +(Y i µ (k) i )/V i where 3. Use the Dantzig selector to compute ˆβ (k+1) with Z as the response and X as the design matrix. 4. Repeat steps 1 through 3 until convergence. Notice that step 3 is equivalent to min β 1 subject to X T j β B W(Z X β) λ j = 1,..., p, (7) which only requires a linear programming algorithm to compute. In addition, once the algorithm has converged, ˆβ (k+1) = ˆβ (k) so X T j W(Z X ˆβ (k+1) ) = X T j W(Y µ)w 1 = X T j (Y µ) and hence the solution to (7) will also be in the feasible region for (6). As with the standard GLM iterative algorithm, there is no theoretical guarantee of convergence. However, in practice we have found that, for sparse solutions, our algorithm generally converges very rapidly so the computation time is only a small multiple of that 8

9 for the standard Dantzig selector. For large p and lower values of λ, nearer the GLM solution, the algorithm does not always converge though in this circumstance it usually iterates between two similar estimates so one can choose either point. We do not see this as a significant limitation because we are most interested in producing relatively sparse models where the algorithm always seems to converge. 3.2 The GDS Path Algorithm In their landmark paper Efron et al. (2004) demonstrated that the coefficient path of the Lasso, i.e. the set of all possible coefficient vectors for different values of the tuning parameter, λ, was piecewise linear. They introduced a highly efficient algorithm called LARS and showed that it could fit the entire Lasso path. More recently James et al. (2007) developed the DASSO algorithm, a generalization of LARS, which can fit the coefficient paths for both the Lasso and the Dantzig selector. Park and Hastie (2007) also use a variant of LARS to produce an approximation of the GLasso path. One key advantage of all these approaches is that they allow one to perform cross-validation on a fine grid of tuning parameters in a computationally efficient manner. Here we introduce our own algorithm for fitting the GDS coefficient path. Our algorithm is a generalization of DASSO and computes the GDS path in a similar fashion to that of the GLasso algorithm. As with GLasso, our algorithm provides an approximation to the true path. In analogy with DASSO, LARS and GLasso, the GDS algorithm progresses in piecewise linear steps. At the start of each step the variables that have maximum covariance with the residual vector are identified. These variables make up the current active set, A, and determine the optimal direction that the coefficient vector, β, should move in. We then update A and compute an approximate value of the distance that can be traveled before the active set changes. At this point we correct the coefficient vector by solving the GDS optimization problem for given sets of active constraints and nonzero coefficients. An appealing feature of such a correction is that the use of an optimization package is not required. We outline the main steps of our algorithm below. The full details for each step are provided in the appendix. 1. Initialize β 1 j = 0 for j = 1,..., p. Set A1 = /0 and l = Let c = X T (Y µ l ) and augment A l with argmax j c j where c j is the jth component of c. 3. Compute the p-dimensional direction vector h l based on the current active set A l and coefficient estimates β l. Subsection A.1 presents the details for the calculation of h l. 4. Form A l+1 by updating A l. Compute γ l, the distance to travel in the direction h l, and set β l+1 = β l + γ l h l. Subsection A.2 presents the details for these calculations. 5. Produce the corrected coefficient vector β l+1 based on β l+1. Subsection A.3 presents the details for the calculation of β l+1. 9

10 Path Algorithm GDS Standardized coefficients hp george re free $ remove hp george re free $ remove L1 norm L1 norm Figure 2: Plots of GDS coefficient paths on the SPAM data produced by the path algorithm and the GDS on a grid. 6. Set l l + 1. Repeat steps 2 through 5 until C l = 0. The entire coefficient path can then be constructed by linearly interpolating β 1,β 2,...,β L where L denotes the number of steps that the algorithm takes. Steps 3 and 4 above closely mirror the corresponding direction and distance steps in DASSO and essentially provide a linear approximation to the GDS path. Step 5 is a corrector step to adjust for the lack of linearity. Figure 2 shows the first segment of the GDS path on the SPAM dataset (described in Hastie et al. (2001) and discussed further in Section 5), produced by both our path algorithm and the GDS on a fine grid. Although, for non-linear link functions, the GDS coefficient path is not guaranteed to be exactly piecewise linear we see here that the two paths are practically identical. 4 Simulation Study In this section we present the results from several simulation studies that we conducted to test the performance of the DD in comparison to GDS and four other possible approaches; standard GLM, the Oracle, the Gauss Dantzig and GLasso. The Oracle method represents 10

11 GLM GDS Gauss DD Oracle Lambda Table 1: Root mean squared errors (root MSE) on the estimated beta coefficients compared to the truth using five estimation methods and six values for lambda. These numbers correspond to the median root MSE over one thousand simulations. a gold standard, that can not be used in practice, where it is assumed that the important predictors are known and GLM is used to estimate their coefficients. The Gauss Dantzig is an approach suggested by Candes and Tao (2007). They first used the Dantzig selector as a variable selection tool to identify the important predictors and then standard linear regression, on the reduced set of variables, to produce the final coefficient estimates. In our setting this process involves using the GDS to select the variables and then running a standard GLM on the subset. The Gauss Dantzig method is a special case of the Double Dantzig with λ 2 set to zero. Finally, GLasso uses the R package glmpath to compute the coefficients minimizing the GLM loglikelihood subject to an L 1 penalty on β i.e. a GLM version of the Lasso (Park and Hastie, 2007). For our first simulation we tested the DD, GLM, GDS, Gauss Dantzig and Oracle approaches on one thousand simulated data sets over a range of values for λ. We did not include GLasso in this simulation because the tuning parameters for the GLasso and GDS methods are not easily comparable. Each data set was generated in the same fashion as described for Figure 1. In particular we used n = 100 observations and p = 20 predictors, of which only the first five had a relationship to the response. The responses were then generated randomly according to a binomial distribution. For each data set we computed the GLM and Oracle coefficient estimates. We also produced estimates for the other methods using λ = 0.025,0.10,0.25,0.75,1.25 and 2.50 where λ = represents only a small level of regularization while at λ = 2.50 all coefficients are set to zero. For each method the root mean squared error between the estimated and true coefficients was computed for each of the data sets. The median errors are presented in Table 1. For all values of λ the GDS gave considerably improved estimates over standard GLM. The Gauss Dantzig method gave good results for λ = 0.75 and 1.25 but performed only equivalently to GLM for very low values of λ, suggesting that it is somewhat sensitive to the correct choice of λ. Alternatively, the Double Dantzig gave equivalent or superior results to GDS for all values of λ. In fact, for λ = 0.75 the DD was even superior to the Oracle. While somewhat surprising, such a result is possible if the first pass of the DD picks out the correct variables and then a small level of regularization on the second pass provides a better estimate then performing GLM alone. In practice, a value for λ needs to be chosen automatically. For the remaining simula- 11

12 GLM Gauss GLasso GDS.L GDS.DD DD Oracle a) b) c) Log Errors Incorrect Variables Beta GLasso GDS.L Gauss GDS.DD DD Oracle Predictor Figure 3: a) Boxplots of the root mean squared errors, on the log scale, for the seven different methods over the one hundred simulation runs. b) Boxplots of the number of mistakes made by each method in selecting variables for the final model. c) A plot of the estimated beta coefficients for a typical data set. Open circles correspond to standard GLM, blue filled circles to GDS and red pluses to the Double Dantzig. The solid horizontal line represents the true coefficients. tions, on each dataset, for the DD and Gauss Dantzig, we used the value, λ DD, that gave the lowest cross-validated deviance on the DD. Similarly for the GLasso we used the parameter, λ L, giving the lowest cross-validated deviance from the cv.glmpath function. For the GDS we implemented two versions. The first, GDS L, used λ L and the second, GDS DD used λ DD. Figure 3 illustrates the results using this approach on data generated in an almost identical fashion to that for Table 1 except that p = 50 variables were used. Figure 3a) provides boxplots, on the log scale, of the root MSE for each method. All methods significantly outperformed standard GLM. The DD and Oracle produced similar results though the Oracle performed poorly for a few of the data sets. Note this is the reason we report median rather than mean errors in Tables 1 and 2. The GLasso and GDS L methods performed in an almost identical fashion while GDS DD produced slightly inferior results. The median performance of the Gauss Dantzig method was similar to the GLasso and GDS L but it was far less stable. Figure 3b) contains boxplots illustrating the total number of mistakes each method made, for each data set, in predicting the variables for the final model. Mistakes correspond to true predictors that are not included and unrelated variables that are included. We have not produced a boxplot for GLM because it does not perform variable selection. The median number of errors for the GLasso and GDS L methods are both around 16 while the Gauss Dantzig, GDS DD and DD methods all have a considerably smaller number of mistakes. Most of these errors arise from including too many variables in the final model. 12

13 p = 20 p = 50 p = 105 Response Method RMSE F+ F- RMSE F+ F- RMSE F+ F- GLM NA NA NA GLasso Binomial GDS L GDS DD DD Oracle GLM NA NA NA GLasso Poisson GDS L GDS DD DD Oracle GLM NA NA NA GLasso Gaussian GDS L GDS DD DD Oracle Table 2: Root MSE, false positive rate (F+) and false negative rate (F-) for six different methods, and three different response distributions, using 100 simulations each with p = 20, p = 50 and p = 105. The sample size for each simulation was n = 100. The low number of mistakes for GDS DD may seem incongruous given its relatively worse performance in Figure 3a). This inconsistency is explained by Figure 3c) which plots the estimated coefficients, for the GLM, GDS DD and DD methods, on a typical data set. As Candes and Tao (2007) observed for the Dantzig selector, while the GDS DD does a good job at selecting the correct variables, it tends to over shrink their coefficients. Hence the poor root MSE of GDS DD despite its good model selection abilities. The DD allows us to both select a good model as well as to produce accurate coefficient estimates for the selected variables. Since the DD dominated the Gauss Dantzig method we only report results for the former in all the remaining simulations. Table 2 provides numerical results, with λ chosen using cross-validation, for nine sets of simulations, produced as described previously, using p = 20, p = 50 or p = 105 variables and three response distributions; Binomial, Poisson and Gaussian. In all cases only five variables were related to the response. For each combination of p and response distribution we present three sets of results. First we give the median root MSE, as in Table 1. We also provide the false positive rate, i.e. the proportion of unrelated variables that are estimated to have non-zero coefficients, and the false negative rate, i.e. the proportion of important variables that are zeroed out. The relative levels of performance of the different methods are similar for all three response distributions. For p = 20 and p = 50 the standard GLM tended to be inferior to the other five approaches in terms of root MSE. For p = 105 where p > n the GLM could not be fit while the other methods still performed well. The GLasso 13

14 Cross validation error rate Number of variables Figure 4: Cross validated error rates for the spam data, on the double Dantzig, using different numbers of variables. There is essentially no improvement in error rate beyond thirty variables. and GDS L produced very similar results. This is consistent with the theory and empirical results in James et al. (2007) which show strong connections between the Lasso and Dantzig selector. Generally, the GDS DD under performed relative to the GLasso and GDS L in terms of RMSE. However, it produced low false positive and negative rates. Finally, the DD did a good job of model selection while also producing RMSE levels close to, or in some cases below, those of the Oracle. 5 Applications We chose to illustrate the performance of the GDS and DD on two real world data sets. The first is the spam data described in Hastie et al. (2001). This data consists of n = 4,601 s. For each the response variable records whether this is a true or a spam. We also have p = 57 predictors recording the relative frequencies of commonly occurring words and characters. The aim is to treat this as a two class classification problem using the 57 predictors to identify the type of . We began by computing the 10-fold DD cross validated error rate using the logistic link function on a fine grid of values for λ. The cross validation required little computational effort because the GDS and DD ran extremely fast on this sized data set. For each value of λ we computed the average number of variables included in the model. Figure 4 plots the number of variables versus the error rate. Clearly 14

15 Coefficient Variable Figure 5: Coefficient estimates, ordered from largest to smallest in absolute value, on the spam data. The open circles correspond to GLM, the closed blue circles to GDS and the red pluses to DD. the first five to ten variables cause a significant decline in the error rate and from that point on there is a slow decline until approximately the thirty variable model at which point the error rate levels out. Figure 5 provides a comparison of the coefficients from the thirty variable GDS and DD fits to those using standard GLM. The GLM coefficients are extremely variable with several that have not been plotted on this scale because they range between -45 and 5. Of the remainder, interestingly, a number of the non-zero DD coefficients closely match the GLM ones. The effect of the DD appears to be to zero out the less important variables and to shrink the extreme coefficient estimates. The GDS generally returns coefficients with the same sign as the DD but with a significant level of shrinkage. Most of the reduction in error rate comes from the first few variables. Hence, in Table 3, we have listed the first twelve variables, in the order that they appear in the model, along with their DD coefficients. Negative coefficients correspond to a decreasing probability of spam and positive coefficients the opposite. The s were sent to George Forman at HP. Hence words such as hp, hp1 and george are indicators of non-spam while remove and $ suggest spam. Thus the coefficients all appear to have the correct sign with $, 000 and remove all highly positive and george strongly negative. In addition, we used the bootstrap to produce 95% confidence intervals via the percentile method (Efron and Tibshirani, 1993). The first six variables plus the intercept all provide clear evidence of 15

16 Order Variable Coefficient Confidence Interval 1 hp 1.69 ( 2.67, 1.22) (2.25, 5.54) 3 remove 3.41 (2.36, 5.36) 4 hp ( 2.26, 0.67) 5 $ 7.74 (4.14, 10.76) 6 george 3.10 ( 5.69, 2.36) ( 1.26, 0.02) 8 Intercept 0.95 ( 1.78, 0.70) 9 your 0.37 (0.00, 0.51) 10 free 1.28 (0.00, 1.69) 11 edu 1.89 ( 3.09, 0.00) 12 re 0.67 ( 1.01, 0.00) Table 3: Coefficients for the first twelve variables to enter the model when fitting the spam data. The variables are listed in the order that they entered. 95% bootstrap confidence intervals are also presented. predictive power since their confidence intervals do not include zero. The other variables include a point mass at zero so their predictive power is less clear. However, with the exception of 1999, none of these variables contain values on either side of zero, which one might expect for independent variables. Figure 6 illustrates this point with histograms of the bootstrap estimates for four of the variables. The first three are all consistently above or below zero. However, the fourth, re, has approximately 2/3rds of its estimates below zero and the remainder as a point mass at zero but no points above zero. This seems to provide evidence that re may be an important variable, despite the point mass at zero, though a formal significance test would be difficult. The second dataset that we examined was the internet advertising data available at the UCI machine learning repository. This data provides instances of possible advertisements on Internet pages. Hence the response is categorical indicating whether the image is an ad or not. The predictors record the geometry of the image as well as phrases occurring in the URL, the image s URL and alt text, the anchor text, and words occurring near the anchor text. After removing missing values etc., the data set contained n = 2, 359 observations and p = 1,430 variables. We included this dataset because its large number of predictors presents significant statistical and computational difficulties for standard approaches. For example, the GLM function using R took almost fifteen minutes to run and most coefficients produced NA estimates. However, for the ten variable model discussed below, the GDS and DD methods were able to estimate all the coefficients in under two minutes. Table 4 presents the coefficient estimates and bootstrap confidence intervals for the first ten variables to enter the model. The most important variables, in addition to the intercept, were whether the anchor url contained the words com or click and whether the image url contained the word ads. 16

17 hp george re Figure 6: Histograms of the coefficient estimates on four variables from 1, 000 bootstrap runs. 6 Discussion The Dantzig selector and Lasso are powerful tools for variable selection and model fitting on linear models. However, they both suffer from a potential over shrinkage of the coefficient estimates. Our Double Dantzig method both removes the shrinkage drawback as well as extending the Dantzig selector methodology to the GLM framework. Our simulation studies suggest that this approach can provide significant improvements, both in model and coefficient accuracy, over other standard linear and GLM variable selection methods. Finally, our GDS path algorithm allows the Double Dantzig to be implemented in an efficient manner. We see several possible extensions of this work. James and Zhu (2007) develop an approach called FLiRTI to fit functional linear regression models where the predictor, X i (t), is a function and is related to the scalar response, Y i, via Y i = β 0 + X i (t)β(t)dt+ ε i. FLiRTI estimates the coefficient curve, β(t), by assuming sparsity, not in the coefficients, but instead in the derivatives of β(t). For example, assuming that β (t) = 0 at most time points produces an estimate for β(t) that is piecewise linear with the breakpoints determined by the data. The FLiRTI approach uses the Dantzig selector to determine the locations of the non-zero derivatives and hence estimate β(t). James and Zhu (2007) show that the Dantzig 17

18 Order Variable Coefficient Confidence Interval 1 intercept 3.11 ( 3.46, 2.82) 2 anchor url contains com? 3.69 (2.57, 5.25) 3 image url contains ads? 3.57 (0.94,5.52) 4 anchor url contains click? 3.45 (0.79, 5.55) 5 alt text contains click and here? 2.61 (0.00,3.19) 6 anchor url contains uk? 2.47 (0.00, 7.89) 7 anchor url contains adclick? 6.84 (0.00, 7.71) 8 anchor url contains http or www? 4.15 (0.00,7.24) 9 image url contains ad? 3.57 (0.00,6.58) 10 anchor url contains cat? 2.31 (0.00, 4.13) Table 4: Coefficients for the first ten variables to enter the model when fitting the advertising data. The variables are listed in the order that they entered. 95% bootstrap confidence intervals are also presented. selector s theoretical bounds also apply to FLiRTI. We believe it is possible to combine the methodology and theory of the Double Dantzig and FLiRTI approaches to fit Generalized Linear Functional Models where the predictor is functional and the response has a general distribution such as Binomial, Poisson, Gaussian or Exponential etc. Another potential extension is to Generalized Additive Models (Hastie and Tibshirani, 1990) where the relationship between predictors, X i1,...,x ip, and response, Y i, is modeled via g(µ i ) = β 0 + f 1 (X i1 )+ + f p (X ip ). Traditionally, smoothing splines or kernel methods have been used to estimate the f j s. However alternatively, using a similar approach to that of James and Zhu (2007), one could assume sparsity in the derivatives of the f j s. This would let the data produce piecewise linear, quadratic or cubic etc. estimates for the curves, which would allow simple structures that could be interpreted more easily. A Individual Steps of the GDS Algorithm A.1 Direction Step For simplicity of the notation, we suppress the superscript l throughout the remainder of the section. Compute X and Z, from Section 3.1, based on the latest path point, β. Our direction step and that for the DASSO are identical except ours is applied to X and Z rather than X and Y. Let X A represent the K columns of X corresponding to the elements of the active set A. The set up of the algorithm ensures that, except for some degenerate cases, the number of nonzero elements of β is either K or K 1. Consider the latter case first. Let d be a vector with 1 s for the first 2p elements and 0 s for the remaining K elements. Further, let β + > 0 and β > 0 represent the positive and negative parts of β, i.e. β = β + β. Finally, let S represent a K-dimensional diagonal matrix whose j-th 18

19 diagonal element is 1 if the j-th component of XA T (Y µ) is positive and 1 otherwise. The procedure to choose the direction h can be summarized in the following steps. 1. Compute the K by 2p + K matrix A = [ SXA T X SXA T X I ] where I is the identity matrix. The first p columns correspond to β + and the next p columns to β. 2. We wish to identify a subset of the columns of A to form a K by K matrix B. We will construct B in the form [ B A i ], where B is produced by selecting all the columns of A that correspond to the non-zero components of β + and β, and A i is one of the remaining columns. Write the two parts of B in the following block form: ( ) B1 B = B 2 ( ) Ai1 and A i =, A i2 where B 1 is a square matrix of dimension K 1, and A i2 is a scalar. Finally, let d B and d i be the components of d corresponding to the columns in B and the column A i, respectively. 3. Choose the final column for B as A j, j = where q i = A i2 B 2 B 1 1 A i 1, α = B 2 B [ argmax d T B B 1 1 A i 1 d i ]/ q i, (8) i: q i 0,α/q i >0 4. Compute h = B 1 1 and let h be a p-dimensional vector of 0 s. Depending on the columns chosen for B, certain elements of h will correspond to elements of β + and β. For each k, set h j = h k if h k corresponds to β+ j, and set h j = h k if h k corresponds to β j. Note that h has at most K nonzero components. In the remaining case of exactly K nonzero β-coefficients, the above procedure is simplified by removing steps 2 and 3, and using the matrix B that is produced by selecting all the columns of A that correspond to the non-zero components of β + and β. See James et al. (2007) for further explanation of, and motivation for, the direction step. A.2 Distance Step We start by taking the distance step of the DASSO algorithm, applied to X and Z. If the index j in part 3 of the direction step was chosen as 2p+m for some positive m, remove the m-th active constraint from the index set A. Let X k represent any variable that is a member of the current active set and define { } γ 1 = min + c k c j j A c (X k X j )T X h, c k + c j (X k + X j )T X h 19

20 where min + is the minimum taken over the positive components only and c k and c j are the kth and jth components of c. Set γ 2 = min + { } j β j /h j and γ = min{γ1,γ 2 }. Let µ be the mean vector corresponding to β = β+γh. Note that for the DASSO algorithm γ specifies the first point on the latest step of the path where a new constraint becomes active or a coefficient hits zero. Since the GDS path is not guaranteed to be exactly piecewise linear it is possible that our candidate γ is too large. If this is the case we set γ γ/2, recompute µ and iterate until max j A c X T j (Y µ) max j A X T j (Y µ). A.3 Correction Step Define C = X T (Y µ) and compute X and Z that correspond to β. Form the set I by combining the index sets of nonzero coefficients of vectors β and β. Define B = SXA T X I using S from the direction step, set β I = B 1 (C 1 SXA T Z ), then recompute X, Z and B. Iterate these steps until convergence, then let β be the p-dimensional vector whose only nonzero coefficients are in the positions specified by I and have the values given by β I. Very few iterations are typically required until convergence. Note that β satisfies XA T(Y µ) = SC 1. References Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics (To appear as a discussion article). Available at emmanuel/publications.html. Chen, S., Donoho, D., and Saunders, M. (1998). Atomic decomposition by basis pursuit. SIAM Journal of Scientific Computing 20, Efron, B., Hastie, T., Johnston, I., and Tibshirani, R. (2004). Least angle regression (with discussion). The Annals of Statistics 32, 2, Efron, B. and Tibshirani, R. (1993). An Introduction to the Bootstrap. London : Chapman and Hall. Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96, Genkin, C., Lewis, D., and Madigan, D. (2004). Large-scale bayesian logistic regression for text categorization. Technical Report. Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Models. Chapman and Hall. Hastie, T. J., Tibshirani, R. J., and Friedman, J. (2001). The Elements of Statistical Learning. Springer. James, G. M. and Radchenko, P. (2007). The adapted dantzig selector. In preparation. 20

21 James, G. M., Radchenko, P., and Lv, J. (2007). The dasso algorithm for fitting the dantzig selector and the lasso. Under review. (Available at www-rcf.usc.edu/ gareth). James, G. M. and Zhu, J. (2007). Functional linear regression that s interpretable. Under review. (Available at www-rcf.usc.edu/ gareth). Lokhorst, J. (1999). The lasso and generalized linear models. Technical Report, University of Adelaide. Mallat, S. and Zhang, Z. (1993). Matching pursuit in a time-frequency dictionary. IEEE Transactions on Signal Processing 41, McCullagh, P. and Nelder, J. (1989). Generalized Linear Models. Chapman and Hall, London, 2nd edn. Park, M. and Hastie, T. (2007). An l1 regularization-path algorithm for generalized linear models. Journal of the Royal Statistical Society, Series B (to appear). Shevade, S. and Keerthi, S. (2003). A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 19, Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58, Zhao, P., Rocha, G., and Yu, B. (2006). Grouped and hierarchical model selection through composite absolute penalties. Technical Report, Department of Statistics, University of California at Berkeley. Zhao, P. and Yu, B. (2004). Boosted lasso. Technical Report, Department of Statistics, University of California at Berkeley. Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101, Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B 67,

Linear Model Selection and Regularization

Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In