The Double Dantzig. Some key words: Dantzig Selector; Double Dantzig; Generalized Linear Models; Lasso; Variable Selection.

Size: px
Start display at page:

Download "The Double Dantzig. Some key words: Dantzig Selector; Double Dantzig; Generalized Linear Models; Lasso; Variable Selection."

Transcription

1 The Double Dantzig GARETH M. JAMES AND PETER RADCHENKO Abstract The Dantzig selector (Candes and Tao, 2007) is a new approach that has been proposed for performing variable selection and model fitting on linear regression models. It uses an L 1 penalty to shrink the regression coefficients towards zero, in a similar fashion to the Lasso. While both the Lasso and Dantzig selector potentially do a good job of selecting the correct variables, several researchers have noted that they tend to over shrink the final coefficients. This results in an unfortunate tradeoff. One can either select a high shrinkage tuning parameter that produces an accurate model but poor coefficient estimates or a low shrinkage tuning parameter that produces more accurate coefficients but includes many irrelevant variables. We propose a new approach called the Double Dantzig. The Double Dantzig has two key advantages over both the Dantzig selector and the Lasso. First, we demonstrate that it can select the correct model without over shrinking the coefficient estimates. Second, it can be applied to all standard generalized linear model response distributions. In addition we develop a path algorithm to simultaneously compute the coefficient estimates for all values of the tuning parameter, making our approach computationally efficient. A detailed simulation study is performed which illustrates the advantages of the Double Dantzig in relation to other possible methods. Finally, we demonstrate the Double Dantzig on several real world data sets. Some key words: Dantzig Selector; Double Dantzig; Generalized Linear Models; Lasso; Variable Selection. 1 Introduction In the traditional linear regression model, Y i = β 0 + p X i j β j + ε i, i = 1,...n, (1) j=1 a common problem is that of variable selection i.e. determining which β j 0. This is a classical problem with several standard approaches, such as best subset selection. However, in the last decade, as data sets have become larger and computers more powerful, there Marshall School of Business, University of Southern California 1

2 has been a great deal of development of new model selection methods. A few examples include the Lasso (Mallat and Zhang, 1993; Tibshirani, 1996; Chen et al., 1998), the adaptive Lasso (Zou, 2006), SCAD (Fan and Li, 2001), the Elastic net (Zou and Hastie, 2005), LARS (Efron et al., 2004), CAP (Zhao et al., 2006), and the Dantzig selector (Candes and Tao, 2007). Many of these methods utilize an L 1 penalty on the regression coefficients which has the effect of shrinking some coefficients towards zero and setting others to exactly zero, hence performing variable selection. This paper concerns a somewhat less well studied problem, that of variable selection methods for Generalized Linear Models (GLM) (McCullagh and Nelder, 1989). A number of the most important problems in areas such as machine learning, data mining, biology or many other fields involve performing classification on a categorical response in the presence of a large number of predictors. In this setting, a linear regression model will not suffice and a GLM approach, such as logistic regression, is more appropriate. There has been some previous work in this area. Tibshirani (1996) briefly discusses using the Lasso to fit GLM s. Several other researchers have also considered using L 1 penalties to fit logistic regression models (Lokhorst, 1999; Shevade and Keerthi, 2003; Genkin et al., 2004). In addition Zhao and Yu (2004) and Park and Hastie (2007) develop methods for fitting the entire coefficient path for GLM and other models with L 1 penalties. Most of these previous methods provide extensions of the Lasso so we use the term GLasso when discussing the Lasso applied to GLM s. We implement a different approach by utilizing the Dantzig selector, a new variable selection method that has received considerable attention and is currently scheduled to appear as a discussion article in the Annals of Statistics. The Dantzig selector works by minimizing the L 1 norm of β subject to an L constraint on the error terms. This approach has several nice theoretical and practical attributes which we discuss further in Section 2.1. Our paper provides three important contributions. First, the Dantzig selector was not designed for GLMs so we develop an extension, called the Generalized Dantzig selector (GDS). The GDS works by reformulating the Dantzig selector criteria in terms of the maximum likelihood equations for a linear regression with a Gaussian response. The reformulation then suggests a natural extension of the criteria to arbitrary response distributions. The GDS shares the same desirable theoretical and practical properties as the Dantzig selector but can be applied to all of the usual range of GLM response distributions. Second, we present a path algorithm for simultaneously computing the GDS coefficients for all values of the tuning parameter, λ, in an efficient manner. This makes it practical to perform cross-validation on λ even for very large data sets. The third contribution, and our main focus, is the introduction of a new method, the Double Dantzig (DD), designed for fitting GLM s with large numbers of predictors but sparse coefficient vectors. As we illustrate in the simulation study of Section 4, the Dantzig selector and Lasso both suffer from an unfortunate tradeoff. They tend to either select the correct variables and produce poor coefficient estimates or else generate more accurate coefficients at the expense of including many irrelevant variables. The Double Dantzig eliminates this drawback by utilizing two separate applications of the GDS. We first use the 2

3 GDS to select a candidate set of predictors and then reapply the GDS, with a new tuning parameter, to the candidate variables to produce the final model. We show through a detailed simulation study that this two step procedure can produce significant improvements, both in terms of the variables selected as well as the estimated coefficients, over not only the standard maximum likelihood GLM approach but also the GLasso and and GDS methods. The remainder of this paper is organized as follows. In Section 2 we begin by outlining the approach behind the Dantzig selector and show how it can be extended to GLMs via the Generalized Dantzig selector. We then present the Double Dantzig and explain more carefully the rational behind it. Section 3 develops an iterative linear programming algorithm, analogous to that used for fitting standard GLMs, to efficiently compute the GDS solution. We then present our path algorithm for computing all possible GDS solutions, for different values of λ, in one operation. A detailed simulation study, comparing the performance of the Double Dantzig with standard GLM, GLasso and the GDS is presented in Section 4. Two real world data sets with categorical responses are analyzed in Section 5. The first is a relatively small dataset of spam s while the second is a much larger dataset of internet advertisements containing p = 1, 430 predictors. Finally, we end with a discussion of our method and possible extensions in Section 6. 2 Methodology In this section we present the Double Dantzig methodology. We first, describe the Dantzig selector in Section 2.1. Then, in Section 2.2, we present the GDS approach to extend the Dantzig selector to GLM response distributions. The GDS is interesting in its own right but its central importance here is as a key component of the Double Dantzig which we describe in Section Dantzig Selector The Dantzig selector (Candes and Tao, 2007) was designed for linear regression models such as (1) with large p but a sparse set of coefficients i.e. most of the β j s are zero. For the linear regression model given by (1), the Dantzig selector estimate, ˆβ, is defined as the solution to min β 1 subject to X T j (Y X β) λ j = 1,..., p (2) β B where 1 represents the L 1 norm, X j is the jth column of the design matrix, X, λ is a tuning parameter and B represents the set of possible values for β, usually taken to be a subset of R p. The L 1 minimization produces coefficient estimates that are exactly zero in a similar fashion to the Lasso and hence can be used as a variable selection tool. Candes and Tao (2007) provide several examples on real world problems involving large values of p where the Dantzig selector performs well. In this setup X j is assumed to be norm one which is rarely the case in practice. However, this difficulty is easily resolved by reparamaterizing (1) such that the X j s have norm one. 3

4 Let d j = X j 2. Then computing ˆβ on the standardized X j s and returning ˆβ j /d j produces the Dantzig selector estimate for β j on the original scale. Notice that for Gaussian error terms, (2) can be rewritten as, min β 1 β B subject to l j ( β) λ/σ j = 1,..., p (3) where l j is the partial derivative of the log likelihood function with respect to β j and σ 2 = var(ε i ). Hence, an intuitive motivation for the Dantzig selector can be observed by noting that, for λ = 0, (3) will return the maximum likelihood estimator. Alternatively, the constraint in (2) is equivalent to X T Y = X T X ˆβ which is simply the least squares normal equation. For λ > 0 the Dantzig selector searches for the β with the smallest L 1 norm that is within a given distance of the maximum likelihood solution, i.e. the sparsest β that is still reasonably consistent with the observed data. Notice that even for p > n, where the likelihood equation will have infinite possible solutions, this approach can still hope to identify the correct solution, provided β is sparse, because it is only attempting to locate the sparsest β close to the peak of the likelihood function. One might imagine that minimizing the L 0 norm, which counts the number of non-zero components of a vector, would be more appropriate than the L 1 norm. However, directly minimizing the L 0 norm is computationally difficult and one can show that, under suitable conditions, the L 1 norm will also provide a sparse solution. The Dantzig selector has some nice properties. The first is that (2) can be formulated as a standard linear programming problem. Let β + j and β j respectively represent the positive and negative parts of β j i.e. β j = β + j β j. Then optimizing (2) is equivalent to solving, min p β + p j + β j subject to β +,β 0, and λ1 X T (Y Xβ) λ1, j=1 j=1 where 1 is a vector of ones. As a consequence, standard linear programming software can easily be used to efficiently fit the data. For example, in Section 5 we are able to use this approach to fit a dataset with over 2, 000 observations and almost 1, 500 variables. More recently James et al. (2007) develop the DASSO algorithm for fitting the entire coefficient path of the Dantzig selector. DASSO does not need to use linear programming and hence provides an even more computationally efficient means to fit the Dantzig selector. The second useful property is theoretical. Candes and Tao (2007) prove tight nonasymptotic bounds on the error in the estimator for β. Suppose that ε i N(0,σ 2 ) and that β has at most S non-zero components. Then for any a 0, by computing ˆβ using λ = σ 2(1+a)log p, it will be the case that ˆβ β 2 2 (1+a) C S σ2 log p, (4) with probability at least 1 ( p a 4πlog p ) 1, where C is a constant that depends on X. Even if we knew ahead of time which β j s were non-zero, it would still be the case that 4

5 ˆβ β 2 2 would grow at the rate of S σ2. Hence, (4) is optimal up to a factor of log p. Since log p grows very slowly, the Dantzig selector only pay a small price for adaptively choosing the significant variables. 2.2 Generalized Dantzig Selector Generalized linear models provide a framework for relating response and predictor variables (McCullagh and Nelder, 1989). For a random variable Y with distribution ( ) yθ b(θ) p(y;θ,φ) = exp + c(y,φ), a(φ) we model the relationship between predictor and response as g(µ i ) = p j=1 X i jβ j, where µ i = E(Y;θ i,φ) = b (θ i ) and g is referred to as the link function. Common examples of g include the identity link used for normal response data and the logistic link used for binary response data. In a standard GLM setup, the coefficient vector β is generally estimated using maximum likelihood. For notational simplicity we will assume that g is chosen as the canonical link, though all the ideas generalize naturally to other link functions. Using the canonical link function, the score statistics are given by l j (ˆβ) = X T j (Y µ), j = 1,..., p, (5) where µ= g 1 (X ˆβ). Hence the maximum likelihood estimator is given by solving X T j (Y µ) = 0, j = 1,..., p, which can be achieved using an iterative weighted least squares algorithm. However, when p is large relative to n, the maximum likelihood approach becomes undesirable for several reasons. First, solving (5) will not produce any coefficients that are exactly zero so there is no form of variable selection performed. As a result, the final model is less interpretable and probably less accurate. Second, for large p the variance of the estimated coefficients will become large and when p > n, equation (5) has no unique solution. We address both these limitations by extending the Dantzig selector to the GLM setting. In Section 2.1 we observed that the constraints in the standard Dantzig selector can be written in terms of a Gaussian likelihood function. Hence a natural generalization to GLM models involves formulating the optimization criteria in terms of a general GLM likelihood. In particular, by combining (3) and (5) we produce our Generalized Dantzig selector (GDS) criteria, min β 1 subject to X T j (Y µ) λ j = 1,..., p, (6) β B where µ= g 1 (X β). For the Gaussian distribution with the identity link function, the GDS reduces to the standard Dantzig selector but the GDS can also be applied to a much wider range of response distributions. For example, for categorical {0, 1} response data we can solve (6) using a logistic link function. The GDS maintains all the important properties of the Dantzig selector. In particular, it works well even in situations where p > n and 5

6 produces sparse coefficient estimates, so it can be used for variable selection purposes. In addition, in Section 3 we show that the computational advantages of the Dantzig selector are preserved and that the entire coefficient path can be computed efficiently. Finally, while the theoretical properties of the GDS are not our central interest here, James and Radchenko (2007) prove that theoretical bounds with the same structure as (4) exist for the GDS. Figure 1 provides an example of the GDS coefficient estimates from a simulated data set with Bernoulli responses and a logistic link function. The data was generated with n = 100 observations and p = 20 predictors with a design matrix produced using iid Gaussian random variables. All but the first five components of the true coefficient vector were zero i.e. the last fifteen variables were independent of the response. The responses were then generated randomly according to a binomial {0, 1} distribution. This was a fairly hard problem because there were a large number of variables and relatively few observations. In addition, each response provided relatively little information because it only took on two possible values. As noted previously, λ = 0 will give the standard GLM fit. However, as λ grows the GDS constraint relaxes until the point where all the coefficients are estimated as zero. Figure 1 plots the estimated coefficients for various values of λ with the L 1 norm of the coefficients on the x-axis. Each line represents the coefficient for a single predictor variable with the thicker solid lines relating to the five true predictors and dashed lines for the remaining fifteen unimportant variables. Despite the relatively small amount of information provided by the responses, the GDS clearly identifies the correct five variables first with a significant break before adding the remainding variables. 2.3 The Double Dantzig The Dantzig selector works in a similar fashion to the Lasso by shrinking the regression coefficients towards zero. Additionally, the L 1 norm causes certain coefficients to be set to exactly zero which enables variable selection to be performed. While both the Lasso and Dantzig selector potentially do a good job of removing irrelevant variables, several researchers have noted that the price that is paid is to over shrink the remaining coefficients (Candes and Tao, 2007; James et al., 2007). As we illustrate in the simulation study of Section 4, for models with sparse sets of parameters this results in an unfortunate tradeoff. One can either select a high shrinkage tuning parameter that produces an accurate model but poor coefficient estimates or a low shrinkage tuning parameter that produces more accurate coefficients but includes many irrelevant variables. The latter case is the norm when using cross-validation to select λ. The Double Dantzig eliminates this tradeoff by implementing the following two stage fitting procedure. 1. Select two tuning parameters λ 1 and λ Fit the GDS to the response Y i using predictors X 1,...,X p and tuning parameter λ Identify the q variables with non-zero coefficients, X 1,...,X q. 6

7 Coefficient L1 Norm Figure 1: Plot of estimated coefficients for different values of λ. The solid lines represent the variables that the response is related to. The dashed lines correspond to the remaining predictors. 4. The Double Dantzig solution is given by fitting the GDS to the response Y i using predictors X 1,...,X q and tuning parameter λ 2. Generally, λ 1 will be relatively large, causing a high degree of regularization, while λ 2 will be small, resulting in little shrinkage of the final coefficients. The key idea behind the Double Dantzig is that by using two different levels of shrinkage, one high and the other low, we can both select an accurate model and accurate coefficient estimates. The simulation and real world results of Sections 4 and 5 conclusively illustrate the potential advantages of such a method. As with all such approaches, an important component of DD is the choice of the tuning parameters λ 1 and λ 2. In Section 3 we introduce a path algorithm for computing the GDS for any given value of λ. Hence for any given pair of λ 1 and λ 2 we can efficiently compute the cross-validated deviance. One potential approach would then be to perform this calculation on a grid of values for the two tuning parameters and select the pair corresponding to the lowest deviance. While such an approach is feasible, in the experiments that we have performed, the final fitted models have been relatively stable for any small value 7

8 for λ 2. Hence, for the results presented here we fix λ 2 = log p and select λ 1 using cross-validation which is relatively computationally easy. The 2log p term is a natural adjustment for the number of variables suggested by the Dantzig selector theory (Candes and Tao, 2007). This approach seemed to give good results. 3 Fitting the models In Section 3.1 we present an iterative weighted Dantzig selector algorithm for computing the GDS solution for a given value of λ. Then in Section 3.2 we extend the fitting procedure to compute the entire path of coefficients for all possible λ. 3.1 Computing the GDS Unless g is the identity function, the constraints in (6) are non-linear, so linear programming software can not be directly used to compute the GDS solution. In the standard GLM setting, an iterative weighted least squares algorithm is used to solve (5). In particular, given the current estimate for ˆβ, an adjusted dependent variable Z i = p j=1 X i j ˆβ j +(Y i µ i )/V i is computed, where V i = b (θ i ) is the variance of Y i. A new estimate for β is then computed using weighted least squares, i.e. X T j W(Z X ˆβ) = 0 for j = 1,..., p, where W is a matrix consisting of the V i s on the diagonal. This procedure is iterated until ˆβ converges. An analogous iterative approach works well when computing the GDS solution. ˆβ (k) j 1. At the k+ 1th iteration let V i = b (ˆθ (k) i ) and Z i = p j=1 X i j (k) denotes the corresponding estimate from the kth iteration. 2. Let Z i = Z i Vi and X i j = X i j Vi. +(Y i µ (k) i )/V i where 3. Use the Dantzig selector to compute ˆβ (k+1) with Z as the response and X as the design matrix. 4. Repeat steps 1 through 3 until convergence. Notice that step 3 is equivalent to min β 1 subject to X T j β B W(Z X β) λ j = 1,..., p, (7) which only requires a linear programming algorithm to compute. In addition, once the algorithm has converged, ˆβ (k+1) = ˆβ (k) so X T j W(Z X ˆβ (k+1) ) = X T j W(Y µ)w 1 = X T j (Y µ) and hence the solution to (7) will also be in the feasible region for (6). As with the standard GLM iterative algorithm, there is no theoretical guarantee of convergence. However, in practice we have found that, for sparse solutions, our algorithm generally converges very rapidly so the computation time is only a small multiple of that 8

9 for the standard Dantzig selector. For large p and lower values of λ, nearer the GLM solution, the algorithm does not always converge though in this circumstance it usually iterates between two similar estimates so one can choose either point. We do not see this as a significant limitation because we are most interested in producing relatively sparse models where the algorithm always seems to converge. 3.2 The GDS Path Algorithm In their landmark paper Efron et al. (2004) demonstrated that the coefficient path of the Lasso, i.e. the set of all possible coefficient vectors for different values of the tuning parameter, λ, was piecewise linear. They introduced a highly efficient algorithm called LARS and showed that it could fit the entire Lasso path. More recently James et al. (2007) developed the DASSO algorithm, a generalization of LARS, which can fit the coefficient paths for both the Lasso and the Dantzig selector. Park and Hastie (2007) also use a variant of LARS to produce an approximation of the GLasso path. One key advantage of all these approaches is that they allow one to perform cross-validation on a fine grid of tuning parameters in a computationally efficient manner. Here we introduce our own algorithm for fitting the GDS coefficient path. Our algorithm is a generalization of DASSO and computes the GDS path in a similar fashion to that of the GLasso algorithm. As with GLasso, our algorithm provides an approximation to the true path. In analogy with DASSO, LARS and GLasso, the GDS algorithm progresses in piecewise linear steps. At the start of each step the variables that have maximum covariance with the residual vector are identified. These variables make up the current active set, A, and determine the optimal direction that the coefficient vector, β, should move in. We then update A and compute an approximate value of the distance that can be traveled before the active set changes. At this point we correct the coefficient vector by solving the GDS optimization problem for given sets of active constraints and nonzero coefficients. An appealing feature of such a correction is that the use of an optimization package is not required. We outline the main steps of our algorithm below. The full details for each step are provided in the appendix. 1. Initialize β 1 j = 0 for j = 1,..., p. Set A1 = /0 and l = Let c = X T (Y µ l ) and augment A l with argmax j c j where c j is the jth component of c. 3. Compute the p-dimensional direction vector h l based on the current active set A l and coefficient estimates β l. Subsection A.1 presents the details for the calculation of h l. 4. Form A l+1 by updating A l. Compute γ l, the distance to travel in the direction h l, and set β l+1 = β l + γ l h l. Subsection A.2 presents the details for these calculations. 5. Produce the corrected coefficient vector β l+1 based on β l+1. Subsection A.3 presents the details for the calculation of β l+1. 9

10 Path Algorithm GDS Standardized coefficients hp george re free $ remove hp george re free $ remove L1 norm L1 norm Figure 2: Plots of GDS coefficient paths on the SPAM data produced by the path algorithm and the GDS on a grid. 6. Set l l + 1. Repeat steps 2 through 5 until C l = 0. The entire coefficient path can then be constructed by linearly interpolating β 1,β 2,...,β L where L denotes the number of steps that the algorithm takes. Steps 3 and 4 above closely mirror the corresponding direction and distance steps in DASSO and essentially provide a linear approximation to the GDS path. Step 5 is a corrector step to adjust for the lack of linearity. Figure 2 shows the first segment of the GDS path on the SPAM dataset (described in Hastie et al. (2001) and discussed further in Section 5), produced by both our path algorithm and the GDS on a fine grid. Although, for non-linear link functions, the GDS coefficient path is not guaranteed to be exactly piecewise linear we see here that the two paths are practically identical. 4 Simulation Study In this section we present the results from several simulation studies that we conducted to test the performance of the DD in comparison to GDS and four other possible approaches; standard GLM, the Oracle, the Gauss Dantzig and GLasso. The Oracle method represents 10

11 GLM GDS Gauss DD Oracle Lambda Table 1: Root mean squared errors (root MSE) on the estimated beta coefficients compared to the truth using five estimation methods and six values for lambda. These numbers correspond to the median root MSE over one thousand simulations. a gold standard, that can not be used in practice, where it is assumed that the important predictors are known and GLM is used to estimate their coefficients. The Gauss Dantzig is an approach suggested by Candes and Tao (2007). They first used the Dantzig selector as a variable selection tool to identify the important predictors and then standard linear regression, on the reduced set of variables, to produce the final coefficient estimates. In our setting this process involves using the GDS to select the variables and then running a standard GLM on the subset. The Gauss Dantzig method is a special case of the Double Dantzig with λ 2 set to zero. Finally, GLasso uses the R package glmpath to compute the coefficients minimizing the GLM loglikelihood subject to an L 1 penalty on β i.e. a GLM version of the Lasso (Park and Hastie, 2007). For our first simulation we tested the DD, GLM, GDS, Gauss Dantzig and Oracle approaches on one thousand simulated data sets over a range of values for λ. We did not include GLasso in this simulation because the tuning parameters for the GLasso and GDS methods are not easily comparable. Each data set was generated in the same fashion as described for Figure 1. In particular we used n = 100 observations and p = 20 predictors, of which only the first five had a relationship to the response. The responses were then generated randomly according to a binomial distribution. For each data set we computed the GLM and Oracle coefficient estimates. We also produced estimates for the other methods using λ = 0.025,0.10,0.25,0.75,1.25 and 2.50 where λ = represents only a small level of regularization while at λ = 2.50 all coefficients are set to zero. For each method the root mean squared error between the estimated and true coefficients was computed for each of the data sets. The median errors are presented in Table 1. For all values of λ the GDS gave considerably improved estimates over standard GLM. The Gauss Dantzig method gave good results for λ = 0.75 and 1.25 but performed only equivalently to GLM for very low values of λ, suggesting that it is somewhat sensitive to the correct choice of λ. Alternatively, the Double Dantzig gave equivalent or superior results to GDS for all values of λ. In fact, for λ = 0.75 the DD was even superior to the Oracle. While somewhat surprising, such a result is possible if the first pass of the DD picks out the correct variables and then a small level of regularization on the second pass provides a better estimate then performing GLM alone. In practice, a value for λ needs to be chosen automatically. For the remaining simula- 11

12 GLM Gauss GLasso GDS.L GDS.DD DD Oracle a) b) c) Log Errors Incorrect Variables Beta GLasso GDS.L Gauss GDS.DD DD Oracle Predictor Figure 3: a) Boxplots of the root mean squared errors, on the log scale, for the seven different methods over the one hundred simulation runs. b) Boxplots of the number of mistakes made by each method in selecting variables for the final model. c) A plot of the estimated beta coefficients for a typical data set. Open circles correspond to standard GLM, blue filled circles to GDS and red pluses to the Double Dantzig. The solid horizontal line represents the true coefficients. tions, on each dataset, for the DD and Gauss Dantzig, we used the value, λ DD, that gave the lowest cross-validated deviance on the DD. Similarly for the GLasso we used the parameter, λ L, giving the lowest cross-validated deviance from the cv.glmpath function. For the GDS we implemented two versions. The first, GDS L, used λ L and the second, GDS DD used λ DD. Figure 3 illustrates the results using this approach on data generated in an almost identical fashion to that for Table 1 except that p = 50 variables were used. Figure 3a) provides boxplots, on the log scale, of the root MSE for each method. All methods significantly outperformed standard GLM. The DD and Oracle produced similar results though the Oracle performed poorly for a few of the data sets. Note this is the reason we report median rather than mean errors in Tables 1 and 2. The GLasso and GDS L methods performed in an almost identical fashion while GDS DD produced slightly inferior results. The median performance of the Gauss Dantzig method was similar to the GLasso and GDS L but it was far less stable. Figure 3b) contains boxplots illustrating the total number of mistakes each method made, for each data set, in predicting the variables for the final model. Mistakes correspond to true predictors that are not included and unrelated variables that are included. We have not produced a boxplot for GLM because it does not perform variable selection. The median number of errors for the GLasso and GDS L methods are both around 16 while the Gauss Dantzig, GDS DD and DD methods all have a considerably smaller number of mistakes. Most of these errors arise from including too many variables in the final model. 12

13 p = 20 p = 50 p = 105 Response Method RMSE F+ F- RMSE F+ F- RMSE F+ F- GLM NA NA NA GLasso Binomial GDS L GDS DD DD Oracle GLM NA NA NA GLasso Poisson GDS L GDS DD DD Oracle GLM NA NA NA GLasso Gaussian GDS L GDS DD DD Oracle Table 2: Root MSE, false positive rate (F+) and false negative rate (F-) for six different methods, and three different response distributions, using 100 simulations each with p = 20, p = 50 and p = 105. The sample size for each simulation was n = 100. The low number of mistakes for GDS DD may seem incongruous given its relatively worse performance in Figure 3a). This inconsistency is explained by Figure 3c) which plots the estimated coefficients, for the GLM, GDS DD and DD methods, on a typical data set. As Candes and Tao (2007) observed for the Dantzig selector, while the GDS DD does a good job at selecting the correct variables, it tends to over shrink their coefficients. Hence the poor root MSE of GDS DD despite its good model selection abilities. The DD allows us to both select a good model as well as to produce accurate coefficient estimates for the selected variables. Since the DD dominated the Gauss Dantzig method we only report results for the former in all the remaining simulations. Table 2 provides numerical results, with λ chosen using cross-validation, for nine sets of simulations, produced as described previously, using p = 20, p = 50 or p = 105 variables and three response distributions; Binomial, Poisson and Gaussian. In all cases only five variables were related to the response. For each combination of p and response distribution we present three sets of results. First we give the median root MSE, as in Table 1. We also provide the false positive rate, i.e. the proportion of unrelated variables that are estimated to have non-zero coefficients, and the false negative rate, i.e. the proportion of important variables that are zeroed out. The relative levels of performance of the different methods are similar for all three response distributions. For p = 20 and p = 50 the standard GLM tended to be inferior to the other five approaches in terms of root MSE. For p = 105 where p > n the GLM could not be fit while the other methods still performed well. The GLasso 13

14 Cross validation error rate Number of variables Figure 4: Cross validated error rates for the spam data, on the double Dantzig, using different numbers of variables. There is essentially no improvement in error rate beyond thirty variables. and GDS L produced very similar results. This is consistent with the theory and empirical results in James et al. (2007) which show strong connections between the Lasso and Dantzig selector. Generally, the GDS DD under performed relative to the GLasso and GDS L in terms of RMSE. However, it produced low false positive and negative rates. Finally, the DD did a good job of model selection while also producing RMSE levels close to, or in some cases below, those of the Oracle. 5 Applications We chose to illustrate the performance of the GDS and DD on two real world data sets. The first is the spam data described in Hastie et al. (2001). This data consists of n = 4,601 s. For each the response variable records whether this is a true or a spam. We also have p = 57 predictors recording the relative frequencies of commonly occurring words and characters. The aim is to treat this as a two class classification problem using the 57 predictors to identify the type of . We began by computing the 10-fold DD cross validated error rate using the logistic link function on a fine grid of values for λ. The cross validation required little computational effort because the GDS and DD ran extremely fast on this sized data set. For each value of λ we computed the average number of variables included in the model. Figure 4 plots the number of variables versus the error rate. Clearly 14

15 Coefficient Variable Figure 5: Coefficient estimates, ordered from largest to smallest in absolute value, on the spam data. The open circles correspond to GLM, the closed blue circles to GDS and the red pluses to DD. the first five to ten variables cause a significant decline in the error rate and from that point on there is a slow decline until approximately the thirty variable model at which point the error rate levels out. Figure 5 provides a comparison of the coefficients from the thirty variable GDS and DD fits to those using standard GLM. The GLM coefficients are extremely variable with several that have not been plotted on this scale because they range between -45 and 5. Of the remainder, interestingly, a number of the non-zero DD coefficients closely match the GLM ones. The effect of the DD appears to be to zero out the less important variables and to shrink the extreme coefficient estimates. The GDS generally returns coefficients with the same sign as the DD but with a significant level of shrinkage. Most of the reduction in error rate comes from the first few variables. Hence, in Table 3, we have listed the first twelve variables, in the order that they appear in the model, along with their DD coefficients. Negative coefficients correspond to a decreasing probability of spam and positive coefficients the opposite. The s were sent to George Forman at HP. Hence words such as hp, hp1 and george are indicators of non-spam while remove and $ suggest spam. Thus the coefficients all appear to have the correct sign with $, 000 and remove all highly positive and george strongly negative. In addition, we used the bootstrap to produce 95% confidence intervals via the percentile method (Efron and Tibshirani, 1993). The first six variables plus the intercept all provide clear evidence of 15

16 Order Variable Coefficient Confidence Interval 1 hp 1.69 ( 2.67, 1.22) (2.25, 5.54) 3 remove 3.41 (2.36, 5.36) 4 hp ( 2.26, 0.67) 5 $ 7.74 (4.14, 10.76) 6 george 3.10 ( 5.69, 2.36) ( 1.26, 0.02) 8 Intercept 0.95 ( 1.78, 0.70) 9 your 0.37 (0.00, 0.51) 10 free 1.28 (0.00, 1.69) 11 edu 1.89 ( 3.09, 0.00) 12 re 0.67 ( 1.01, 0.00) Table 3: Coefficients for the first twelve variables to enter the model when fitting the spam data. The variables are listed in the order that they entered. 95% bootstrap confidence intervals are also presented. predictive power since their confidence intervals do not include zero. The other variables include a point mass at zero so their predictive power is less clear. However, with the exception of 1999, none of these variables contain values on either side of zero, which one might expect for independent variables. Figure 6 illustrates this point with histograms of the bootstrap estimates for four of the variables. The first three are all consistently above or below zero. However, the fourth, re, has approximately 2/3rds of its estimates below zero and the remainder as a point mass at zero but no points above zero. This seems to provide evidence that re may be an important variable, despite the point mass at zero, though a formal significance test would be difficult. The second dataset that we examined was the internet advertising data available at the UCI machine learning repository. This data provides instances of possible advertisements on Internet pages. Hence the response is categorical indicating whether the image is an ad or not. The predictors record the geometry of the image as well as phrases occurring in the URL, the image s URL and alt text, the anchor text, and words occurring near the anchor text. After removing missing values etc., the data set contained n = 2, 359 observations and p = 1,430 variables. We included this dataset because its large number of predictors presents significant statistical and computational difficulties for standard approaches. For example, the GLM function using R took almost fifteen minutes to run and most coefficients produced NA estimates. However, for the ten variable model discussed below, the GDS and DD methods were able to estimate all the coefficients in under two minutes. Table 4 presents the coefficient estimates and bootstrap confidence intervals for the first ten variables to enter the model. The most important variables, in addition to the intercept, were whether the anchor url contained the words com or click and whether the image url contained the word ads. 16

17 hp george re Figure 6: Histograms of the coefficient estimates on four variables from 1, 000 bootstrap runs. 6 Discussion The Dantzig selector and Lasso are powerful tools for variable selection and model fitting on linear models. However, they both suffer from a potential over shrinkage of the coefficient estimates. Our Double Dantzig method both removes the shrinkage drawback as well as extending the Dantzig selector methodology to the GLM framework. Our simulation studies suggest that this approach can provide significant improvements, both in model and coefficient accuracy, over other standard linear and GLM variable selection methods. Finally, our GDS path algorithm allows the Double Dantzig to be implemented in an efficient manner. We see several possible extensions of this work. James and Zhu (2007) develop an approach called FLiRTI to fit functional linear regression models where the predictor, X i (t), is a function and is related to the scalar response, Y i, via Y i = β 0 + X i (t)β(t)dt+ ε i. FLiRTI estimates the coefficient curve, β(t), by assuming sparsity, not in the coefficients, but instead in the derivatives of β(t). For example, assuming that β (t) = 0 at most time points produces an estimate for β(t) that is piecewise linear with the breakpoints determined by the data. The FLiRTI approach uses the Dantzig selector to determine the locations of the non-zero derivatives and hence estimate β(t). James and Zhu (2007) show that the Dantzig 17

18 Order Variable Coefficient Confidence Interval 1 intercept 3.11 ( 3.46, 2.82) 2 anchor url contains com? 3.69 (2.57, 5.25) 3 image url contains ads? 3.57 (0.94,5.52) 4 anchor url contains click? 3.45 (0.79, 5.55) 5 alt text contains click and here? 2.61 (0.00,3.19) 6 anchor url contains uk? 2.47 (0.00, 7.89) 7 anchor url contains adclick? 6.84 (0.00, 7.71) 8 anchor url contains http or www? 4.15 (0.00,7.24) 9 image url contains ad? 3.57 (0.00,6.58) 10 anchor url contains cat? 2.31 (0.00, 4.13) Table 4: Coefficients for the first ten variables to enter the model when fitting the advertising data. The variables are listed in the order that they entered. 95% bootstrap confidence intervals are also presented. selector s theoretical bounds also apply to FLiRTI. We believe it is possible to combine the methodology and theory of the Double Dantzig and FLiRTI approaches to fit Generalized Linear Functional Models where the predictor is functional and the response has a general distribution such as Binomial, Poisson, Gaussian or Exponential etc. Another potential extension is to Generalized Additive Models (Hastie and Tibshirani, 1990) where the relationship between predictors, X i1,...,x ip, and response, Y i, is modeled via g(µ i ) = β 0 + f 1 (X i1 )+ + f p (X ip ). Traditionally, smoothing splines or kernel methods have been used to estimate the f j s. However alternatively, using a similar approach to that of James and Zhu (2007), one could assume sparsity in the derivatives of the f j s. This would let the data produce piecewise linear, quadratic or cubic etc. estimates for the curves, which would allow simple structures that could be interpreted more easily. A Individual Steps of the GDS Algorithm A.1 Direction Step For simplicity of the notation, we suppress the superscript l throughout the remainder of the section. Compute X and Z, from Section 3.1, based on the latest path point, β. Our direction step and that for the DASSO are identical except ours is applied to X and Z rather than X and Y. Let X A represent the K columns of X corresponding to the elements of the active set A. The set up of the algorithm ensures that, except for some degenerate cases, the number of nonzero elements of β is either K or K 1. Consider the latter case first. Let d be a vector with 1 s for the first 2p elements and 0 s for the remaining K elements. Further, let β + > 0 and β > 0 represent the positive and negative parts of β, i.e. β = β + β. Finally, let S represent a K-dimensional diagonal matrix whose j-th 18

19 diagonal element is 1 if the j-th component of XA T (Y µ) is positive and 1 otherwise. The procedure to choose the direction h can be summarized in the following steps. 1. Compute the K by 2p + K matrix A = [ SXA T X SXA T X I ] where I is the identity matrix. The first p columns correspond to β + and the next p columns to β. 2. We wish to identify a subset of the columns of A to form a K by K matrix B. We will construct B in the form [ B A i ], where B is produced by selecting all the columns of A that correspond to the non-zero components of β + and β, and A i is one of the remaining columns. Write the two parts of B in the following block form: ( ) B1 B = B 2 ( ) Ai1 and A i =, A i2 where B 1 is a square matrix of dimension K 1, and A i2 is a scalar. Finally, let d B and d i be the components of d corresponding to the columns in B and the column A i, respectively. 3. Choose the final column for B as A j, j = where q i = A i2 B 2 B 1 1 A i 1, α = B 2 B [ argmax d T B B 1 1 A i 1 d i ]/ q i, (8) i: q i 0,α/q i >0 4. Compute h = B 1 1 and let h be a p-dimensional vector of 0 s. Depending on the columns chosen for B, certain elements of h will correspond to elements of β + and β. For each k, set h j = h k if h k corresponds to β+ j, and set h j = h k if h k corresponds to β j. Note that h has at most K nonzero components. In the remaining case of exactly K nonzero β-coefficients, the above procedure is simplified by removing steps 2 and 3, and using the matrix B that is produced by selecting all the columns of A that correspond to the non-zero components of β + and β. See James et al. (2007) for further explanation of, and motivation for, the direction step. A.2 Distance Step We start by taking the distance step of the DASSO algorithm, applied to X and Z. If the index j in part 3 of the direction step was chosen as 2p+m for some positive m, remove the m-th active constraint from the index set A. Let X k represent any variable that is a member of the current active set and define { } γ 1 = min + c k c j j A c (X k X j )T X h, c k + c j (X k + X j )T X h 19

20 where min + is the minimum taken over the positive components only and c k and c j are the kth and jth components of c. Set γ 2 = min + { } j β j /h j and γ = min{γ1,γ 2 }. Let µ be the mean vector corresponding to β = β+γh. Note that for the DASSO algorithm γ specifies the first point on the latest step of the path where a new constraint becomes active or a coefficient hits zero. Since the GDS path is not guaranteed to be exactly piecewise linear it is possible that our candidate γ is too large. If this is the case we set γ γ/2, recompute µ and iterate until max j A c X T j (Y µ) max j A X T j (Y µ). A.3 Correction Step Define C = X T (Y µ) and compute X and Z that correspond to β. Form the set I by combining the index sets of nonzero coefficients of vectors β and β. Define B = SXA T X I using S from the direction step, set β I = B 1 (C 1 SXA T Z ), then recompute X, Z and B. Iterate these steps until convergence, then let β be the p-dimensional vector whose only nonzero coefficients are in the positions specified by I and have the values given by β I. Very few iterations are typically required until convergence. Note that β satisfies XA T(Y µ) = SC 1. References Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics (To appear as a discussion article). Available at emmanuel/publications.html. Chen, S., Donoho, D., and Saunders, M. (1998). Atomic decomposition by basis pursuit. SIAM Journal of Scientific Computing 20, Efron, B., Hastie, T., Johnston, I., and Tibshirani, R. (2004). Least angle regression (with discussion). The Annals of Statistics 32, 2, Efron, B. and Tibshirani, R. (1993). An Introduction to the Bootstrap. London : Chapman and Hall. Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96, Genkin, C., Lewis, D., and Madigan, D. (2004). Large-scale bayesian logistic regression for text categorization. Technical Report. Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Models. Chapman and Hall. Hastie, T. J., Tibshirani, R. J., and Friedman, J. (2001). The Elements of Statistical Learning. Springer. James, G. M. and Radchenko, P. (2007). The adapted dantzig selector. In preparation. 20

21 James, G. M., Radchenko, P., and Lv, J. (2007). The dasso algorithm for fitting the dantzig selector and the lasso. Under review. (Available at www-rcf.usc.edu/ gareth). James, G. M. and Zhu, J. (2007). Functional linear regression that s interpretable. Under review. (Available at www-rcf.usc.edu/ gareth). Lokhorst, J. (1999). The lasso and generalized linear models. Technical Report, University of Adelaide. Mallat, S. and Zhang, Z. (1993). Matching pursuit in a time-frequency dictionary. IEEE Transactions on Signal Processing 41, McCullagh, P. and Nelder, J. (1989). Generalized Linear Models. Chapman and Hall, London, 2nd edn. Park, M. and Hastie, T. (2007). An l1 regularization-path algorithm for generalized linear models. Journal of the Royal Statistical Society, Series B (to appear). Shevade, S. and Keerthi, S. (2003). A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 19, Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58, Zhao, P., Rocha, G., and Yu, B. (2006). Grouped and hierarchical model selection through composite absolute penalties. Technical Report, Department of Statistics, University of California at Berkeley. Zhao, P. and Yu, B. (2004). Boosted lasso. Technical Report, Department of Statistics, University of California at Berkeley. Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101, Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B 67,

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Regularization Path Algorithms for Detecting Gene Interactions

Regularization Path Algorithms for Detecting Gene Interactions Regularization Path Algorithms for Detecting Gene Interactions Mee Young Park Trevor Hastie July 16, 2006 Abstract In this study, we consider several regularization path algorithms with grouped variable

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013) A Survey of L 1 Regression Vidaurre, Bielza and Larranaga (2013) Céline Cunen, 20/10/2014 Outline of article 1.Introduction 2.The Lasso for Linear Regression a) Notation and Main Concepts b) Statistical

More information

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Journal of Data Science 9(2011), 549-564 Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Masaru Kanba and Kanta Naito Shimane University Abstract: This paper discusses the

More information

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu LITIS - EA 48 - INSA/Universite de Rouen Avenue de l Université - 768 Saint-Etienne du Rouvray

More information

Learning with Sparsity Constraints

Learning with Sparsity Constraints Stanford 2010 Trevor Hastie, Stanford Statistics 1 Learning with Sparsity Constraints Trevor Hastie Stanford University recent joint work with Rahul Mazumder, Jerome Friedman and Rob Tibshirani earlier

More information

Generalized Elastic Net Regression

Generalized Elastic Net Regression Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1

More information

Fast Regularization Paths via Coordinate Descent

Fast Regularization Paths via Coordinate Descent KDD August 2008 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerry Friedman and Rob Tibshirani. KDD August 2008

More information

By Gareth M. James, Jing Wang and Ji Zhu University of Southern California, University of Michigan and University of Michigan

By Gareth M. James, Jing Wang and Ji Zhu University of Southern California, University of Michigan and University of Michigan The Annals of Statistics 2009, Vol. 37, No. 5A, 2083 2108 DOI: 10.1214/08-AOS641 c Institute of Mathematical Statistics, 2009 FUNCTIONAL LINEAR REGRESSION THAT S INTERPRETABLE 1 arxiv:0908.2918v1 [math.st]

More information

Lecture 14: Variable Selection - Beyond LASSO

Lecture 14: Variable Selection - Beyond LASSO Fall, 2017 Extension of LASSO To achieve oracle properties, L q penalty with 0 < q < 1, SCAD penalty (Fan and Li 2001; Zhang et al. 2007). Adaptive LASSO (Zou 2006; Zhang and Lu 2007; Wang et al. 2007)

More information

Prediction & Feature Selection in GLM

Prediction & Feature Selection in GLM Tarigan Statistical Consulting & Coaching statistical-coaching.ch Doctoral Program in Computer Science of the Universities of Fribourg, Geneva, Lausanne, Neuchâtel, Bern and the EPFL Hands-on Data Analysis

More information

A Confidence Region Approach to Tuning for Variable Selection

A Confidence Region Approach to Tuning for Variable Selection A Confidence Region Approach to Tuning for Variable Selection Funda Gunes and Howard D. Bondell Department of Statistics North Carolina State University Abstract We develop an approach to tuning of penalized

More information

Analysis Methods for Supersaturated Design: Some Comparisons

Analysis Methods for Supersaturated Design: Some Comparisons Journal of Data Science 1(2003), 249-260 Analysis Methods for Supersaturated Design: Some Comparisons Runze Li 1 and Dennis K. J. Lin 2 The Pennsylvania State University Abstract: Supersaturated designs

More information

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010 Model-Averaged l 1 Regularization using Markov Chain Monte Carlo Model Composition Technical Report No. 541 Department of Statistics, University of Washington Chris Fraley and Daniel Percival August 22,

More information

Discussion of Least Angle Regression

Discussion of Least Angle Regression Discussion of Least Angle Regression David Madigan Rutgers University & Avaya Labs Research Piscataway, NJ 08855 madigan@stat.rutgers.edu Greg Ridgeway RAND Statistics Group Santa Monica, CA 90407-2138

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

Sparsity in Underdetermined Systems

Sparsity in Underdetermined Systems Sparsity in Underdetermined Systems Department of Statistics Stanford University August 19, 2005 Classical Linear Regression Problem X n y p n 1 > Given predictors and response, y Xβ ε = + ε N( 0, σ 2

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu Department of Statistics, University of Illinois at Urbana-Champaign WHOA-PSI, Aug, 2017 St. Louis, Missouri 1 / 30 Background Variable

More information

Regularization: Ridge Regression and the LASSO

Regularization: Ridge Regression and the LASSO Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression

More information

Fast Regularization Paths via Coordinate Descent

Fast Regularization Paths via Coordinate Descent August 2008 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerry Friedman and Rob Tibshirani. August 2008 Trevor

More information

Exploratory quantile regression with many covariates: An application to adverse birth outcomes

Exploratory quantile regression with many covariates: An application to adverse birth outcomes Exploratory quantile regression with many covariates: An application to adverse birth outcomes June 3, 2011 eappendix 30 Percent of Total 20 10 0 0 1000 2000 3000 4000 5000 Birth weights efigure 1: Histogram

More information

Regularization Paths

Regularization Paths December 2005 Trevor Hastie, Stanford Statistics 1 Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Saharon Rosset, Ji Zhu, Hui Zhou, Rob Tibshirani and

More information

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models Proceedings 59th ISI World Statistics Congress, 25-30 August 2013, Hong Kong (Session CPS023) p.3938 An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models Vitara Pungpapong

More information

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent

More information

In Search of Desirable Compounds

In Search of Desirable Compounds In Search of Desirable Compounds Adrijo Chakraborty University of Georgia Email: adrijoc@uga.edu Abhyuday Mandal University of Georgia Email: amandal@stat.uga.edu Kjell Johnson Arbor Analytics, LLC Email:

More information

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions:

More information

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Bayesian Grouped Horseshoe Regression with Application to Additive Models Bayesian Grouped Horseshoe Regression with Application to Additive Models Zemei Xu, Daniel F. Schmidt, Enes Makalic, Guoqi Qian, and John L. Hopper Centre for Epidemiology and Biostatistics, Melbourne

More information

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model

More information

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables LIB-MA, FSSM Cadi Ayyad University (Morocco) COMPSTAT 2010 Paris, August 22-27, 2010 Motivations Fan and Li (2001), Zou and Li (2008)

More information

The Constrained Lasso

The Constrained Lasso The Constrained Lasso Gareth M. ames, Courtney Paulson and Paat Rusmevichientong Abstract Motivated by applications in areas as diverse as finance, image reconstruction, and curve estimation, we introduce

More information

Shrinkage Methods: Ridge and Lasso

Shrinkage Methods: Ridge and Lasso Shrinkage Methods: Ridge and Lasso Jonathan Hersh 1 Chapman University, Argyros School of Business hersh@chapman.edu February 27, 2019 J.Hersh (Chapman) Ridge & Lasso February 27, 2019 1 / 43 1 Intro and

More information

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los

More information

Fast Regularization Paths via Coordinate Descent

Fast Regularization Paths via Coordinate Descent user! 2009 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerome Friedman and Rob Tibshirani. user! 2009 Trevor

More information

Chapter 3. Linear Models for Regression

Chapter 3. Linear Models for Regression Chapter 3. Linear Models for Regression Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Linear

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

Tractable Upper Bounds on the Restricted Isometry Constant

Tractable Upper Bounds on the Restricted Isometry Constant Tractable Upper Bounds on the Restricted Isometry Constant Alex d Aspremont, Francis Bach, Laurent El Ghaoui Princeton University, École Normale Supérieure, U.C. Berkeley. Support from NSF, DHS and Google.

More information

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider variable

More information

The lasso, persistence, and cross-validation

The lasso, persistence, and cross-validation The lasso, persistence, and cross-validation Daniel J. McDonald Department of Statistics Indiana University http://www.stat.cmu.edu/ danielmc Joint work with: Darren Homrighausen Colorado State University

More information

LEAST ANGLE REGRESSION 469

LEAST ANGLE REGRESSION 469 LEAST ANGLE REGRESSION 469 Specifically for the Lasso, one alternative strategy for logistic regression is to use a quadratic approximation for the log-likelihood. Consider the Bayesian version of Lasso

More information

Regularization Paths. Theme

Regularization Paths. Theme June 00 Trevor Hastie, Stanford Statistics June 00 Trevor Hastie, Stanford Statistics Theme Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Mee-Young Park,

More information

Making Our Cities Safer: A Study In Neighbhorhood Crime Patterns

Making Our Cities Safer: A Study In Neighbhorhood Crime Patterns Making Our Cities Safer: A Study In Neighbhorhood Crime Patterns Aly Kane alykane@stanford.edu Ariel Sagalovsky asagalov@stanford.edu Abstract Equipped with an understanding of the factors that influence

More information

High-dimensional Ordinary Least-squares Projection for Screening Variables

High-dimensional Ordinary Least-squares Projection for Screening Variables 1 / 38 High-dimensional Ordinary Least-squares Projection for Screening Variables Chenlei Leng Joint with Xiangyu Wang (Duke) Conference on Nonparametric Statistics for Big Data and Celebration to Honor

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Bi-level feature selection with applications to genetic association

Bi-level feature selection with applications to genetic association Bi-level feature selection with applications to genetic association studies October 15, 2008 Motivation In many applications, biological features possess a grouping structure Categorical variables may

More information

On Algorithms for Solving Least Squares Problems under an L 1 Penalty or an L 1 Constraint

On Algorithms for Solving Least Squares Problems under an L 1 Penalty or an L 1 Constraint On Algorithms for Solving Least Squares Problems under an L 1 Penalty or an L 1 Constraint B.A. Turlach School of Mathematics and Statistics (M19) The University of Western Australia 35 Stirling Highway,

More information

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Boosting Methods: Why They Can Be Useful for High-Dimensional Data New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Linear regression methods

Linear regression methods Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response

More information

Two Tales of Variable Selection for High Dimensional Regression: Screening and Model Building

Two Tales of Variable Selection for High Dimensional Regression: Screening and Model Building Two Tales of Variable Selection for High Dimensional Regression: Screening and Model Building Cong Liu, Tao Shi and Yoonkyung Lee Department of Statistics, The Ohio State University Abstract Variable selection

More information

arxiv: v1 [math.st] 8 Jan 2008

arxiv: v1 [math.st] 8 Jan 2008 arxiv:0801.1158v1 [math.st] 8 Jan 2008 Hierarchical selection of variables in sparse high-dimensional regression P. J. Bickel Department of Statistics University of California at Berkeley Y. Ritov Department

More information

arxiv: v1 [stat.me] 30 Dec 2017

arxiv: v1 [stat.me] 30 Dec 2017 arxiv:1801.00105v1 [stat.me] 30 Dec 2017 An ISIS screening approach involving threshold/partition for variable selection in linear regression 1. Introduction Yu-Hsiang Cheng e-mail: 96354501@nccu.edu.tw

More information

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR Howard D. Bondell and Brian J. Reich Department of Statistics, North Carolina State University,

More information

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly Robust Variable Selection Methods for Grouped Data by Kristin Lee Seamon Lilly A dissertation submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree

More information

Institute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR

Institute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR DEPARTMENT OF STATISTICS North Carolina State University 2501 Founders Drive, Campus Box 8203 Raleigh, NC 27695-8203 Institute of Statistics Mimeo Series No. 2583 Simultaneous regression shrinkage, variable

More information

Iterative Selection Using Orthogonal Regression Techniques

Iterative Selection Using Orthogonal Regression Techniques Iterative Selection Using Orthogonal Regression Techniques Bradley Turnbull 1, Subhashis Ghosal 1 and Hao Helen Zhang 2 1 Department of Statistics, North Carolina State University, Raleigh, NC, USA 2 Department

More information

Comparisons of penalized least squares. methods by simulations

Comparisons of penalized least squares. methods by simulations Comparisons of penalized least squares arxiv:1405.1796v1 [stat.co] 8 May 2014 methods by simulations Ke ZHANG, Fan YIN University of Science and Technology of China, Hefei 230026, China Shifeng XIONG Academy

More information

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published

More information

Linear model selection and regularization

Linear model selection and regularization Linear model selection and regularization Problems with linear regression with least square 1. Prediction Accuracy: linear regression has low bias but suffer from high variance, especially when n p. It

More information

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data MSA220/MVE440 Statistical Learning for Big Data Lecture 9-10 - High-dimensional regression Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Recap from

More information

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr

More information

LASSO Review, Fused LASSO, Parallel LASSO Solvers

LASSO Review, Fused LASSO, Parallel LASSO Solvers Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable

More information

On Mixture Regression Shrinkage and Selection via the MR-LASSO

On Mixture Regression Shrinkage and Selection via the MR-LASSO On Mixture Regression Shrinage and Selection via the MR-LASSO Ronghua Luo, Hansheng Wang, and Chih-Ling Tsai Guanghua School of Management, Peing University & Graduate School of Management, University

More information

An Improved 1-norm SVM for Simultaneous Classification and Variable Selection

An Improved 1-norm SVM for Simultaneous Classification and Variable Selection An Improved 1-norm SVM for Simultaneous Classification and Variable Selection Hui Zou School of Statistics University of Minnesota Minneapolis, MN 55455 hzou@stat.umn.edu Abstract We propose a novel extension

More information

L 1 Regularization Path Algorithm for Generalized Linear Models

L 1 Regularization Path Algorithm for Generalized Linear Models L 1 Regularization Path Algorithm for Generalized Linear Models Mee Young Park Trevor Hastie November 12, 2006 Abstract In this study, we introduce a path-following algorithm for L 1 regularized generalized

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10 COS53: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 0 MELISSA CARROLL, LINJIE LUO. BIAS-VARIANCE TRADE-OFF (CONTINUED FROM LAST LECTURE) If V = (X n, Y n )} are observed data, the linear regression problem

More information

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression Stepwise Searching for Feature Variables in High-Dimensional Linear Regression Qiwei Yao Department of Statistics, London School of Economics q.yao@lse.ac.uk Joint work with: Hongzhi An, Chinese Academy

More information

DASSO: connections between the Dantzig selector and lasso

DASSO: connections between the Dantzig selector and lasso J. R. Statist. Soc. B (2009) 71, Part 1, pp. 127 142 DASSO: connections between the Dantzig selector and lasso Gareth M. James, Peter Radchenko and Jinchi Lv University of Southern California, Los Angeles,

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson Bayesian variable selection via penalized credible regions Brian Reich, NC State Joint work with Howard Bondell and Ander Wilson Brian Reich, NCSU Penalized credible regions 1 Motivation big p, small n

More information

Biostatistics Advanced Methods in Biostatistics IV

Biostatistics Advanced Methods in Biostatistics IV Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results

More information

Robust Variable Selection Through MAVE

Robust Variable Selection Through MAVE Robust Variable Selection Through MAVE Weixin Yao and Qin Wang Abstract Dimension reduction and variable selection play important roles in high dimensional data analysis. Wang and Yin (2008) proposed sparse

More information

The OSCAR for Generalized Linear Models

The OSCAR for Generalized Linear Models Sebastian Petry & Gerhard Tutz The OSCAR for Generalized Linear Models Technical Report Number 112, 2011 Department of Statistics University of Munich http://www.stat.uni-muenchen.de The OSCAR for Generalized

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract Journal of Data Science,17(1). P. 145-160,2019 DOI:10.6339/JDS.201901_17(1).0007 WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION Wei Xiong *, Maozai Tian 2 1 School of Statistics, University of

More information

Outline of GLMs. Definitions

Outline of GLMs. Definitions Outline of GLMs Definitions This is a short outline of GLM details, adapted from the book Nonparametric Regression and Generalized Linear Models, by Green and Silverman. The responses Y i have density

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

LOGISTIC REGRESSION Joseph M. Hilbe

LOGISTIC REGRESSION Joseph M. Hilbe LOGISTIC REGRESSION Joseph M. Hilbe Arizona State University Logistic regression is the most common method used to model binary response data. When the response is binary, it typically takes the form of

More information

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Bayesian Grouped Horseshoe Regression with Application to Additive Models Bayesian Grouped Horseshoe Regression with Application to Additive Models Zemei Xu 1,2, Daniel F. Schmidt 1, Enes Makalic 1, Guoqi Qian 2, John L. Hopper 1 1 Centre for Epidemiology and Biostatistics,

More information

Greedy and Relaxed Approximations to Model Selection: A simulation study

Greedy and Relaxed Approximations to Model Selection: A simulation study Greedy and Relaxed Approximations to Model Selection: A simulation study Guilherme V. Rocha and Bin Yu April 6, 2008 Abstract The Minimum Description Length (MDL) principle is an important tool for retrieving

More information

Or How to select variables Using Bayesian LASSO

Or How to select variables Using Bayesian LASSO Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO On Bayesian Variable Selection

More information

Forward Regression for Ultra-High Dimensional Variable Screening

Forward Regression for Ultra-High Dimensional Variable Screening Forward Regression for Ultra-High Dimensional Variable Screening Hansheng Wang Guanghua School of Management, Peking University This version: April 9, 2009 Abstract Motivated by the seminal theory of Sure

More information

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets Nan Zhou, Wen Cheng, Ph.D. Associate, Quantitative Research, J.P. Morgan nan.zhou@jpmorgan.com The 4th Annual

More information

A significance test for the lasso

A significance test for the lasso 1 First part: Joint work with Richard Lockhart (SFU), Jonathan Taylor (Stanford), and Ryan Tibshirani (Carnegie-Mellon Univ.) Second part: Joint work with Max Grazier G Sell, Stefan Wager and Alexandra

More information

Spatial Lasso with Applications to GIS Model Selection

Spatial Lasso with Applications to GIS Model Selection Spatial Lasso with Applications to GIS Model Selection Hsin-Cheng Huang Institute of Statistical Science, Academia Sinica Nan-Jung Hsu National Tsing-Hua University David Theobald Colorado State University

More information

Statistical Inference

Statistical Inference Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

Variable selection using Adaptive Non-linear Interaction Structures in High dimensions

Variable selection using Adaptive Non-linear Interaction Structures in High dimensions Variable selection using Adaptive Non-linear Interaction Structures in High dimensions Peter Radchenko and Gareth M. James Abstract Numerous penalization based methods have been proposed for fitting a

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

A Bias Correction for the Minimum Error Rate in Cross-validation

A Bias Correction for the Minimum Error Rate in Cross-validation A Bias Correction for the Minimum Error Rate in Cross-validation Ryan J. Tibshirani Robert Tibshirani Abstract Tuning parameters in supervised learning problems are often estimated by cross-validation.

More information

On High-Dimensional Cross-Validation

On High-Dimensional Cross-Validation On High-Dimensional Cross-Validation BY WEI-CHENG HSIAO Institute of Statistical Science, Academia Sinica, 128 Academia Road, Section 2, Nankang, Taipei 11529, Taiwan hsiaowc@stat.sinica.edu.tw 5 WEI-YING

More information

Forward Selection and Estimation in High Dimensional Single Index Models

Forward Selection and Estimation in High Dimensional Single Index Models Forward Selection and Estimation in High Dimensional Single Index Models Shikai Luo and Subhashis Ghosal North Carolina State University August 29, 2016 Abstract We propose a new variable selection and

More information

SUPPLEMENTARY APPENDICES FOR WAVELET-DOMAIN REGRESSION AND PREDICTIVE INFERENCE IN PSYCHIATRIC NEUROIMAGING

SUPPLEMENTARY APPENDICES FOR WAVELET-DOMAIN REGRESSION AND PREDICTIVE INFERENCE IN PSYCHIATRIC NEUROIMAGING Submitted to the Annals of Applied Statistics SUPPLEMENTARY APPENDICES FOR WAVELET-DOMAIN REGRESSION AND PREDICTIVE INFERENCE IN PSYCHIATRIC NEUROIMAGING By Philip T. Reiss, Lan Huo, Yihong Zhao, Clare

More information

VARIABLE SELECTION IN VERY-HIGH DIMENSIONAL REGRESSION AND CLASSIFICATION

VARIABLE SELECTION IN VERY-HIGH DIMENSIONAL REGRESSION AND CLASSIFICATION VARIABLE SELECTION IN VERY-HIGH DIMENSIONAL REGRESSION AND CLASSIFICATION PETER HALL HUGH MILLER UNIVERSITY OF MELBOURNE & UC DAVIS 1 LINEAR MODELS A variety of linear model-based methods have been proposed

More information

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report Hujia Yu, Jiafu Wu [hujiay, jiafuwu]@stanford.edu 1. Introduction Housing prices are an important

More information

L 1 -regularization path algorithm for generalized linear models

L 1 -regularization path algorithm for generalized linear models J. R. Statist. Soc. B (2007) 69, Part 4, pp. 659 677 L 1 -regularization path algorithm for generalized linear models Mee Young Park Google Inc., Mountain View, USA and Trevor Hastie Stanford University,

More information

Logistic Regression with the Nonnegative Garrote

Logistic Regression with the Nonnegative Garrote Logistic Regression with the Nonnegative Garrote Enes Makalic Daniel F. Schmidt Centre for MEGA Epidemiology The University of Melbourne 24th Australasian Joint Conference on Artificial Intelligence 2011

More information

PENALIZING YOUR MODELS

PENALIZING YOUR MODELS PENALIZING YOUR MODELS AN OVERVIEW OF THE GENERALIZED REGRESSION PLATFORM Michael Crotty & Clay Barker Research Statisticians JMP Division, SAS Institute Copyr i g ht 2012, SAS Ins titut e Inc. All rights

More information

Lecture 24 May 30, 2018

Lecture 24 May 30, 2018 Stats 3C: Theory of Statistics Spring 28 Lecture 24 May 3, 28 Prof. Emmanuel Candes Scribe: Martin J. Zhang, Jun Yan, Can Wang, and E. Candes Outline Agenda: High-dimensional Statistical Estimation. Lasso

More information