The Constrained Lasso

Size: px
Start display at page:

Download "The Constrained Lasso"

Transcription

1 The Constrained Lasso Gareth M. ames, Courtney Paulson and Paat Rusmevichientong Abstract Motivated by applications in areas as diverse as finance, image reconstruction, and curve estimation, we introduce the constrained lasso problem, where the underlying parameters satisfy a collection of linear constraints. We show that many statistical methods, such as the fused lasso, monotone curve estimation and the generalized lasso, are all special cases of the constrained lasso. Computing the constrained lasso poses some technical challenges but we develop an efficient algorithm for fitting it over a grid of tuning parameters. Non-asymptotic error bounds are developed which suggest that the constrained lasso should outperform the lasso in situations where the true parameters satisfy the underlying constraints. Extensive numerical experiments show that our method performs well, both computationally and statistically. Finally, we apply the constrained lasso to a real data set to estimate the demand curve for a particular type of auto loan as a function of interest rate, and demonstrate that it can outperform more standard approaches. Some key words: Lasso; Linear constraints; Penalized regression; Demand function; Generalized lasso. Introduction A great deal of attention has recently been focused on the problem of fitting the traditional regression problem p Y i = β 0 + X ij β j + ε i, i =,..., n, j= Marshall School of Business, University of Southern California.

2 in situations where the number of predictors, p, is large relative to the sample size, n. In this setting there are many approaches that outperform ordinary least squares (Frank and Friedman, 993). Penalized regression methods are commonly adopted for these large p problems. A small sampling of these approaches include SCAD (Fan and Li, 00), the elastic net (Zou and Hastie, 005), the adaptive lasso (Zou, 006), CAP (Zhao et al., 006), the Dantzig selector (Candes and Tao, 007), the relaxed lasso (Meinshausen, 007), and VISA (Radchenko and ames, 008). However, one of the most popular of these approaches is the lasso (Tibshirani, 996) which places an l penalty on the coefficients. In particular the lasso is defined as the solution to arg min β Y Xβ + λβ, () where we have assumed that the response and predictors have been standardized to have mean zero so the intercept can be ignored. The strengths and limitations of the lasso have been well documented. One of its major advantages is that the l penalty has the effect of shrinking certain regression coefficients to exactly zero, and hence it performs variable selection. Another appealing aspect of the lasso is that there are several efficient algorithms for computing its solution. In particular the LARS algorithm (Efron et al., 004) computes the entire path of solutions for all values of λ. Recently, coordinate descent algorithms have been shown to be even more efficient for computing the lasso solution (Friedman et al., 007, 00). In this article we are interested in solving arg min β Y Xβ + λβ subject to Cβ b, () where C R m p and b R m are predefined matrices and vectors. Equation places a set of m linear constraints on the coefficients so we call this problem the constrained lasso or classo. Equations and appear similar so why is the classo interesting? We address this question in three parts. First, we demonstrate that many commonly applied statistical problems, such as the fused lasso, generalized lasso, monotone curve estimation etc., can be expressed as special cases of (). Hence, an efficient algorithm for solving the classo can be used to fit a large class of statistical methodologies. Second, it turns out that solving () is a non-trivial problem. There does not appear to be any simple extension of the LARS path algorithm to solve the classo and a naive implementation of coordinate descent

3 generally results in solutions that fail to reach the global optimum of (). However, we present a modified coordinate descent algorithm that avoids these convergence issues. In addition our algorithm can be implemented using standard lasso software packages, which greatly improves the robustness and computational speed of the classo. Third we extend the theoretical bounds on the errors in the lasso to those for the classo coefficient estimates. The lasso and classo bounds have similar forms but our bounds clearly demonstrate the potential improvements in accuracy that can be derived from adding the additional constraint in the classo. The paper is structured as follows. In Section we provide a number of motivating examples which illustrate the wide range of situations where the constrained lasso is applicable. Our modified coordinate descent algorithm is presented in Section 3. We first consider the situation where () involves equality constraints and then show how the algorithm can be extended to the more general inequality setting through the use of slack variables. Several different error bounds have been developed for the lasso coefficients (Negahban et al., 00). In Section 4 we demonstrate that these bounds can be extended to the classo and that, provided the constraints hold for the true coefficients, the classo estimates will generally have smaller error bounds. There are many settings, such as monotone curve estimation or portfolio optimization, where we know that certain linear constraints must hold on the parameters we are estimating. Our theoretical results show that, in these settings, directly incorporating these constraints will produce superior estimates. We compare the classo to the standard lasso and relaxed lasso on a series of simulated data sets in Section 5. The classo is also implemented on a real data set consisting of loan applications for a large number of potential customers. The aim here is to estimate the demand curve as a function of interest rate. We conclude with a discussion of future extensions of this work in Section 6.. Motivating Examples In this section we briefly describe a few of the situations in which the classo can be applied.. Portfolio Optimization Suppose we have p assets indexed by,,..., p whose covariance matrix is denoted by Σ. Markowitz (95, 959) developed the seminal framework for mean-variance analysis. In 3

4 particular his approach involved choosing asset weights, w, to minimize the portfolio risk w T Σw (3) subject to w T =. One may also choose to impose additional constraints on w to control the expected return of the portfolio, the allocations among sectors or industries, or the risk exposures to certain known risk factors. In practice Σ is unobserved so must be estimated using the sample covariance matrix, ˆΣ. However, it has been well documented in the finance literature that when p is large, which is the norm in real world applications, minimizing (3) using the sample covariance matrix gives poor estimates for w. One possible solution involves regularizing ˆΣ, but more recently attention has focused on directly penalizing or constraining the weights; an analogous approach to penalizing the coefficients in a regression setting. Fan et al. (0) recently adopted this framework using a portfolio optimization problem with a gross-exposure parameter c defined by: w T ˆΣw, w c (4) subject to w T =, where c is a tuning parameter. Fan et al. (0) reformulated (4) as a penalized regression problem and used LARS to estimate the weights. However, as they point out, their regression formulation only approximately solves (4). It is not hard to verify that (4) can be expressed in the form (), where C has at least one row (to constrain w to sum to one) but may also have additional rows if we place constraints on the expected return, industry weightings etc. Hence, the classo can be used to exactly solve the gross-exposure portfolio optimization problem.. Monotone Curve Estimation Consider the problem of fitting a smooth function, g(x), to a set of observations {(x, y ),..., (x n, y n )}, subject to the constraint that g must be monotone e.g. non-increasing. There are many examples of such a setting. For instance, when estimating demand as a function of price we need to ensure that the curve is non-increasing. Alternatively, when estimating a cumulative distribution function (CDF) we need to ensure that the function is non-decreasing. Similarly, when performing curve registration to align a set of curves, the estimated warping functions must be non-decreasing. 4

5 Interest Rate of Offered Loan Probability of Loan Acceptance Figure : Auto loan data demand curves as a function of interest rate for the classo (black), the logistic regression (red), and the regular lasso (blue). If g is modeled as a parametric function the monotonicity constraint is not hard to impose. However, one often wishes to produce a more flexible fit using a non-parametric approach. For example, we could model g(x) = b(x) T β where b(x) is some p-dimensional basis function. One natural approach is to choose a large value for p, to produce a very flexible estimate for g(x), and then to smooth the estimate by minimizing the lasso criterion Y Bβ + λβ, (5) where the ith row of B is given by b(x i ) T. However, minimizing (5) does not ensure a monotone curve. Hence, we need to constrain the coefficients such that Cβ 0, (6) where the lth row of C is given by b (u l ) and u,..., u m are a fine grid of points over the range of x. Enforcing (6) ensures that the derivative of g is non-positive so g will be monotone decreasing. Of course minimizing (5) subject to (6) is a special case of the classo. 5

6 Figure provides an example of this approach applied to the auto loan data considered in Section 5.3. We have used the standard lasso (blue), logistic regression (red) and the classo (black) to estimate demand for loans as a function of interest rate. The logistic curve is constrained in the shape it can model while the lasso fit is not sensible because it lacks a monotone shape. However, the classo is capable of fitting a flexible, but still monotone, curve..3 Generalized Lasso and Related Methods Tibshirani and Taylor (0) introduce the generalized lasso problem: arg min Y θ Xθ + λ Dθ, (7) where D R r p. When rank(d) = r, and thus r p, Tibshirani and Taylor (0) show that the generalized lasso can be converted to the classical lasso problem. However, if r > p then such a reformulation is not possible. Lemma shows that when r > p and D is full column rank, then the generalized lasso is a special case of the classo with equality constraints and b = 0. Lemma. If r > p and rank(d) = p then there exist matrices A, C and X such that, for all values of λ, the solution to (7) is equal to θ = Aβ, where β is given by: arg min β Y Xβ + λβ subject to Cβ = 0. (8) The proof of Lemma is provided in Appendix A Hence, any problem that falls into the generalized lasso paradigm can be solved as a classo problem. Tibshirani and Taylor (0) demonstrate that a variety of common statistical methods can be formulated as special cases of the generalized lasso. One example is the d fused lasso (Tibshirani et al., 005) which is defined as the solution to arg min β Y Xβ + λ p p β j + λ β j β j. j= The fused lasso encourages blocks of adjacent estimated coefficients to all have the same value. This type of structure often makes sense in situations where there is a natural ordering in the coefficients. If instead the data have a two-dimensional ordering, such as for an image j= 6

7 reconstruction, this idea can be extended to the d fused lasso arg min β Y Xβ + λ β j,j + λ β j,j β j,j + λ 3 β j,j β j,j, j,j j j j j where β j,j is the coefficient at location j, j. Other examples of statistical methodologies that fall into the generalized (and hence constrained) lasso setting include; polynomial trend filtering where one penalizes discrete differences to produce smooth piecewise polynomial curves, Wavelet smoothing, and the FLiRTI method (ames et al., 009). See Tibshirani and Taylor (0) for more details on these methods and further examples of the generalized lasso paradigm. She (00) also considers a similar criterion to (7) and discusses special cases such as the clustered lasso. 3. A Constrained Lasso Fitting Algorithm To develop our algorithm for fitting the classo we start by defining some notation. For any matrix Z let Z A represent the m columns of Z associated with an index set A of size m. Similarly let Z Ā correspond to the remaining columns of Z. To reduce notation we assume without loss of generality that A corresponds to the first m columns of Z so Z = [Z A Z Ā ]. 3. Equality Constraints A common approach to deal with inequality constraints involves augmenting the set of parameters to include slack variables, δ, and hence reparameterizing the problem using equality constraints. We show in Section 3. that this augmentation approach also works to solve the classo so we begin by considering the equality constrained version of (): arg min β Y Xβ + λβ such that Cβ = b. (9) Unfortunately, the constraint on the coefficients makes it difficult to directly solve (9). However, Lemma provides an approach for reformulating (9) as an unconstrained optimization problem. Lemma. For a given index set, A, define θ A by: θ A = arg min θ Y X θ + λθ + λc A (b CĀ θ), (0) where Y = Y X A C A b, X = X Ā X A C A CĀ. 7

8 Then, for any index set, A, such that C A is non-singular, the solution to (9) is given by β = C A (b CĀ θ A ). θ A Again to reduce notation we assume without loss of generality that the elements of β are ordered so that the first m correspond to A. Equation 0 has the advantage that one does not need to incorporate any constraints on the parameters. Hence, one might imagine using a coordinate descent algorithm to compute θ A. However, for such an algorithm to reach a global optimum it must be the case that the parameters are separable in the non-differentiable l penalty i.e the criterion must be of the form f(θ,..., θ p m ) = g(θ,..., θ p m ) + p m j= h j (θ j ), () where g( ) is differentiable and convex, and the h j ( ) are convex. The standard lasso criterion () satisfies () because β = p j= β j. However, unfortunately, the second l term in (0) can not be separated into additive functions of θ j. Hence, there does not appear to be any simple approach for directly computing θ A. Fortunately, we can make use of Lemma 3 which shows that an alternative, more tractable, criterion can be used to compute θ A. Lemma 3. For a given index set, A, and vector s define θ A,s by: θ A,s = arg min θ Ỹ X θ + λθ () where Ỹ = Y + λx ( C A CĀ ) T s, and X is a matrix such that X T X = I. Then, for any index set, A, it will be the case that θ A = θ A,s provided s = sign ( θ A,s), (3) where θ A,s = C A (b CĀ θ A,s ) and sign(a) is a vector of the same dimension as a with ith element equal to or depending on the sign of a i. The proofs of Lemmas and 3 are provided in Appendix B. There is a simple intuition behind Lemma 3. The difficulty in computing (0) lies in the non-differentiability (and non-separability) of the second l penalty. However, if (3) holds then C A (b CĀ θ A,s ) = s T C A (b CĀ θ A,s ) 8

9 and we can replace the l penalty by a differentiable term which no longer needs to be separable. Of course the key to this approach is selecting A and s such that (3) holds. Note that θ A,s corresponds to the elements of β indexed by A so our general principle for choosing A and s will be to:. Produce β 0, an initial estimate for β.. Set A equal to the m largest elements of β Set s equal to the signs of the corresponding m elements of β 0. In order for this strategy to work we only require that the m largest elements of β 0 are sign consistent i.e. they have the same sign as the corresponding elements in the final solution to (9). Corollary shows that Lemmas and 3 provide the solution to the classo. Corollary. By Lemmas and 3, provided A and s are chosen such that C A is non-singular and (3) holds, then the solution to (9) is given by β = θ A,s θ A,s. Lemma 3 and Corollary provide an attractive means of computing the classo solution because () is a standard lasso criterion so can be solved using any one of the many highly efficient lasso optimization approaches, such as the LARS algorithm or coordinate descent. These results suggest Algorithm (next page) for solving (9) over a grid of λ. This algorithm relies on the fact that β λ will be continuous in λ. Hence, if we take a small enough step in λ then the signs of the m largest elements of β k and β k will be identical. If this is not the case then we have taken too large a step so we reduce the step size and recompute β k. In practice, unless m is very large or α is set too large, (3) generally holds without needing to decrease the step size. Step 3 is the main computational component of this algorithm but θ Ak,s k is easy to compute because () is just a standard lasso criterion so we can use any one of a number of optimization tools. 9

10 Algorithm Constrained Lasso with Equality Constraints. Initialize β 0 by solving (9) using λ 0 = λ max.. At step k select A k and s k using the largest m elements of β k and set where α > 0 controls the step size. λ k 0 α λ k, ( ) 3. Compute θ Ak,s k by solving (). Let θ A k,s k = C A k b CAk θ Ak,s k. 4. If (3) holds then set β k = θ A k,s k θ Ak,s k, k k + and return to. 5. If (3) does not hold then one of the largest m elements of β k has changed sign. Since the coefficient paths are continuous this means our step size in λ was too large. Hence, set and return to Iterate until λ k < λ min. λ k λ k (λ k λ k ) will be The initial solution, β 0, can be computed by noting that as λ the solution to (9) arg min β β such that Cβ = b, (4) which is a linear programming problem that can be efficiently solved using standard algorithms. We also implement a reversed version of this algorithm where we first set λ 0 = λ min, compute β 0 as the solution to a quadratic programming problem and then increase λ at each step until λ k > λ max. We discuss some additional implementation details in Appendix C. Figure provides a comparison of the lasso with the classo on a small simulated data set with n = 50 observations and p = 5 predictors. The left-hand panel plots the 5 lasso coefficients as a function of λ, one curve for each coefficient. As λ the lasso coefficients are all shrunk to zero. The center panel plots the same set of coefficients computed using our classo algorithm after incorporating a set of m = 5 equality constraints. The left hand sides of the two plots, where λ is small, are similar. However, for large values of λ the lasso 0

11 Lasso Constrained Lasso Quadratic Programming Lambda on log scale 0 4 Lambda on log scale 0 4 Lambda on log scale Figure : Left: Lasso coefficient paths for a simulated data set with n = 50 observations and p = 5 predictors. Center: Constrained lasso coefficient paths computed using Algorithm, after incorporating m = 5 linear constraints. Right: Constrained lasso coefficient paths computed using quadratic programming software. and classo coefficient estimates are significantly different from each other because the linear constraints do not allow the coefficients to be set to zero. The right-hand panel illustrates the solution to the classo computed using standard quadratic programming software. The two sets of paths are identical, confirming that our algorithm is computing the correct set of solutions. One might ask why Algorithm is required since Figure demonstrates that the classo path of coefficients can be computed using standard quadratic programming software? Of course the same question could be raised for the standard lasso which can also be solved as a quadratic programming problem. The difficulty is that quadratic programming becomes extremely inefficient for large values of p. Figure 3 provides an example of the number of seconds required to fit the classo, as a function of p, on a single simulated data set with n = 50 observations and m = 5 constraints. We computed the classo solution using both Algorithm and quadratic programming on the same set of grid points for λ. The methods are roughly comparable until p = 00 but beyond this point quadratic programming becomes computationally infeasible while Algorithm still performs well.

12 Number of Seconds p Figure 3: Time (in seconds) required to fit the classo on a single simulated data set, as a function of p, using Algorithm (black) and quadratic programming (red). 3. Inequality Constraints We now consider the more general optimization problem given by (). One might imagine that a reasonable approach would be to initialize with β such that Cβ b (5) and then apply a coordinate descent algorithm subject to ensuring that at each update (5) is not violated. Unfortunately, this approach typically gets stuck at a constraint boundary point where no improvement is possible by changing a single coordinate. In this setting the criterion can often be improved by moving along the constraint boundary but such a move requires adjusting multiple coefficients simultaneously which is not possible using coordinate descent because it only updates one coefficient at a time. Instead we introduce a set of m slack variables, δ, so that (5) can be equivalently expressed as Cβ + δ = b, δ 0 or C β = b, δ 0, (6) where β = (β, δ) is a p + m-dimensional vector, C = [C I] and I is an m-dimensional identity matrix. Let e δ (a) be a function which selects out the elements of a that correspond

13 to δ. For example, e δ ( β) = δ while e δ( β) = β. Then, the inclusion of the slack variables in (6) allows us to reexpress the classo criterion () as arg min β Y X β + λe δ( β) such that C β = b, eδ ( β) 0, (7) where X = [X 0] and 0 is a n by m matrix of zero elements. The criterion in (7) is very similar to the equality constrained lasso (9). The only differences are that the components of β corresponding to δ do not appear in the penalty term and are required to be non-negative. Even with these minor differences, the same basic approach from Section 3. can still be adopted for fitting (7). In particular Lemma 4 provides a set of conditions under which () can be solved. Lemma 4. For a given index set, A, and vector s such that e δ (s) = 0, define θ A,s by: θ A,s = arg min θ Ỹ X θ + λe δ(θ) such that e δ (θ) 0, (8) where Ỹ = Y X ( C A A b + λx C C ) T Ā s, X = X Ā X C C A A A, and X is a matrix such that X T X = I. Suppose where θ A,s e δ( β) where, A ( ( e δ(s) = sign e δ θ ( ) e δ θ A,s A,s )), (9) 0, (0) = C A (b C θ ) Ā A,s. Then, the solution to the classo criterion () is given by β = θ A,s. θ A,s The proof of this result is similar to that for Lemma 3 so we omit it here. Lemma 4 shows that, provided an appropriate A and s are chosen, the solution to the classo can still be computed by solving a lasso type criterion, (8). However, we must now ensure that both (9) and (0) hold. Condition 9 is equivalent to (3) in the equality constraint setting, while (0) along with the constraint in (8) ensure that δ 0. We use the same strategy as in Section 3.. First, obtain an initial coefficient estimate, β 0. Next, select A corresponding to the largest m elements of β 0, say β A. Finally, the m-dimensional s vector is chosen by fixing e δ(s) = sign(e δ( β A )) and setting the remaining elements of s to zero i.e. e δ (s) = 0. 3

14 β A is our initial guess for θ A,s so as long as β A is sign consistent for θ A,s then (9) will hold. Similarly, by only including the largest current values of δ in A, for a small enough step in λ, none of these elements will become negative and (0) will hold. Hence, Algorithm can be used to fit the inequality constrained classo. Notice that Algorithm Constrained Lasso with Inequality Constraints. Initialize β 0 by solving (7) using λ 0 = λ max.. At step k select A k and s k using the largest m elements of β k and set λ k 0 α λ k. 3. Compute θ Ak,s k by solving (8). Let θ A k,s k = C A k (b C Ak θak,s k ). 4. If (9) and (0) hold then set β k = θ A k,s k, k k + and return to. θ Ak,s k 5. If (9) or (0) do not hold then the step size must be too large. Hence, set λ k λ k (λ k λ k ) and return to Iterate until λ k < λ min. Algorithm only involves slight changes from Algorithm. In particular solving (8) in Step 3 poses little additional complication over fitting the standard lasso. The only differences are that the elements of θ that correspond to δ, i.e. e δ (θ), have zero penalty and must be non-negative. However, these changes are simple to incorporate into a coordinate descent algorithm. For any θ j that is an element of e δ (θ) we first compute the unshrunk least squares estimate, ˆθ j, and then set ] θ j = [ˆθj +. () It is not hard to show that () enforces the non-negative constraint on δ while also ensuring that no penalty term is applied to the slack variables. The update step for the original β coefficients, e δ(θ) (those that do not involve δ), is identical to that for the standard lasso. 4

15 The initial solution, β0, can still be computed by solving a standard linear programming problem, subject to inequality constraints. 4. Statistical Properties: Error Bounds In this section we show that, assuming the true coefficients correspond to a given set of constraints, one can obtain sharper error bounds for the classo estimate relative to the traditional unconstrained lasso estimate. To describe the results, let us introduce the statistical model for our problem. We assume that Y R n is generated from: Y = Xβ + ε, where X R n p, β R p, and ε = (ε, ε,..., ε n ) are i.i.d. mean-zero random variables. Let β(λ) denote a solution of the classo problem defined by: β(λ) = arg min β L (β; Y ) + λβ subject to Cβ = b, () where L (β; Y ) = n Y Xβ. If the solution to () is not unique we can take β(λ) to be any optimal solution; our error bounds hold for all such solutions. We make the following two assumptions about the underlying parameter β. Assumption (Sparsity). Let supp(β ) = { j : β j 0 }. The parameter β is s-sparse with supp(β ) = s. Assumption (Constraint Matrix). There exists a matrix C R m p with rank(c) = m s such that Cβ = b. Moreover, the matrix C A is non-singular for some A supp(β ) such that the A = m. By Assumption we can decompose β = [βa, β ]. Let S = Ā supp(β ) denote the Ā support of the vector β. Note that Assumptions and imply that S = s m. For any Ā vector u, let Π S (u) and Π S (u) be defined by: u j if j S, u j if j / S, [Π S (u)] j = and [Π S (u)] j = 0 otherwise. 0 otherwise. Note that u = Π S (u) + Π S (u). Finally, for any >, let a cone C R p be defined by: { C R p : C = 0 and Π S ( Ā ) + ( ) } Π S ( Ā ) + A. 5

16 It is easy to verify that C is a cone because if x C, then tx C for all t 0. Negahban et al. (00) show that, provided Assumption holds and λ L (β ; Y ), then the traditional unconstrained lasso estimate β L (λ) satisfies the following error bounds: β L (λ) β s λ κ L and β L (λ) β s 6λ κ L, where κ L is a constant which we define shortly. Hence, the l error of β L (λ) is of order O( s). In Theorem we show that, provided m is not too large, the error in the classo estimate is instead of order O( s m). Theorem (An Error Bound for Classo). Under Assumptions and, suppose λ L (β ; Y ) for >, and the loss function L( ; Y ) satisfies restricted strong convexity on C with parameter κ L > 0; that is, for all C, L (β + ; Y ) L (β ; Y ) L (β ; Y ) T κ L. Then, and β(λ) β max { s m, m } ( + ) β(λ) ( β max { s m, m } 4 + ) λ. κ L λ κ L Note that max { s m, m } < s for m < s. Theorem, we have that ( + ) =. ( < and 4 + ) = 6, Moreover, by setting = 5 in and thus, for = 5, the constants in our error bound are no larger than the constants in the standard lasso error bound. By using different values of, we can also extend the range of λ. For m s/, our l bound is of order O( s m ). This rate follows from the fact that Assumption implies Hence, the m coordinates of β A β A = C A (b C Ā β Ā ). (3) are completely determined by the remaining p m coordinates of β Ā. Since supp ( β Ā) = s m, the vector β Ā is (s m)-sparse, which can be interpreted as the adjusted sparsity level. Why then do the bounds in Theorem also 6

17 depend on m? In fact, this term reflects the error due to model selection. To see this, note that when m = s, it follows from Assumptions and and Equation 3 that we can exactly recover β, but only if we know the locations of the non-zero entries A. There are ( p ) s p s possible locations of the non-zero entries. Intuitively, this corresponds to O(p s ) possible hypotheses, and information theoretic arguments show that even if we have m = s constraints, we still need n to be of order log ( p s) s log p to identify the correct hypothesis, and hence ensure that our error β(λ) β is small. This observation is formalized in the following corollary, which provides a high probability error bound for the classo estimate. The proof follows from Theorem. Corollary (High Probability Error Bound). Under Assumptions and, suppose the matrix X satisfies the restricted eigenvalue condition on C with Xu κ n L u for all u C. If λ = (4 )σ and ε = (ε, ε,..., ε n ) are i.i.d. mean-zero sub-gaussian log p n random variables with variance σ, then with probability at least /p, β(λ) β max { s m, m } 8σ log p κ L n log p By comparison Negahban et al. (00) show that, for λ = 4σ, the standard unconstrained lasso estimate β L (λ) satisfies the following error bound: β n L (λ) β s 8σ log p, κ L n with probability at least /p. Hence, the classo and lasso bounds differ only in the first term, with the classo bound strictly lower than that of the lasso for m < s. It is worth noting that the constant in (4) can be lowered further by choosing a slightly lower value for λ but we have not done so because our goal here is to highlight the difference in the dependence on s between the classo and lasso. A bound of the same form as (4) can also be produced for the l error, although a somewhat larger value for λ is optimal in that case. We note that the bounds in Theorem only depend on s and m, and not on the constraint matrix C. It is possible to develop an error bound that depends on C, in which case the bound scales with s m + m ρ max (C A CĀ ), where ρ max (C A CĀ ) is the largest singular value of C A CĀ. This error bound is useful, for example, when the magnitude of the elements of C Ā are much smaller than those of C A. For example, in the extreme case where the constraints 7 (4)

18 only involve the m largest coefficients then C Ā = 0 and ρ max = 0 so the bound scales with s m. The proof of this bound is similar to that of Theorem, so we omit the details In this section we have concentrated on results for the equality constrained classo. Results in the inequality setting are more complicated. Intuitively, one can see that if β lies well inside the region Cβ b then the classo and lasso should give similar answers because the constraints will play little role in the fit. However, if β lies close to the constraint boundary then the classo should offer improvements relative to the lasso. As shown in Section 3, one can reformulate the inequality classo using equality constraints with the addition of a set of m slack variables δ. If β lies close to the boundary then these slack variables will be close to, but not exactly, zero. Hence, to develop bounds for the classo in this setting one must adapt Assumption to one where the augmented set of coefficients is not necessarily sparse, but can be well approximated by a sparse vector. Negahban et al. (00) provide bounds for the standard lasso in such an approximately sparse setting but extending them to the classo is beyond the scope of this paper. 5. Empirical Results 5. Equality Constraints In this section we compare the performance of the classo to the regular lasso (Tibshirani 996) in nine different simulation settings. For each setting we randomly generated a training data set, fit each method to the data and computed the root mean squared error RMSE = N (Ŷi E(Y i X i )) N i= over a separate test set of N = 0000 observations. This process was repeated 00 times for each of the nine settings. The training data sets were produced using a random design matrix generated from a normal distribution. Three different correlation structures were investigated: ρ ij = 0, ρ ij = 0.5, and ρ ij = 0.5 i j (where ρ ij is the correlation between the ith and jth variables). In all cases the m by p constraint matrix C and the constraint vector b were randomly generated from a standard normal distribution. The true coefficient vector β was produced by first generating β using 5 non-zero random normal components, and Ā p m 5 zero entries, and then computing βa from (3). Note that this process resulted in 8

19 ρ Lasso Classo Relaxed Lasso Relaxed Classo n = (0.0) 0.50(0.0) 0.43(0.0) 0.9(0.0) p = (0.0) 0.67(0.0) 0.66(0.0) 0.53(0.0) m = i j 0.59(0.0) 0.46(0.0) 0.50(0.0) 0.3(0.0) n = (0.08) 3.8(0.09) 3.9(0.09) 3.07(0.0) p = (0.07) 3.78(0.08) 3.68(0.08) 3.68(0.09) m = i j.6(0.08).39(0.09).50(0.09).3(0.0) n = (0.07).5(0.03) 6.84(0.08) 0.94(0.03) p = (0.07).67(0.04) 6.98(0.08).38(0.04) m = i j 6.46(0.06).30(0.03) 6.6(0.07).0(0.03) Table : Average RMSE over 00 training data sets, for four lasso methods tested in three different simulation settings and three different correlation structures. The numbers in parentheses are standard errors. β being m + 5 sparse and ensured that the constraints held for the true coefficient vector. For each set of simulations, the optimal value of λ was chosen by minimizing RMSE on a separate validation set, which was independently generated using the same parameters as for the given simulation setting. We explored three combinations of n, p and m and for each setting tested all three correlation structures. The test RMSE values for the nine resulting settings are displayed in Table. We compared four methods; the standard lasso, our classo, the relaxed lasso (Meinshausen, 007), and the relaxed classo. The latter two methods use a two step approach in an attempt to reduce the over shrinkage problem commonly exhibited by the lasso. In the first step the lasso or classo is used to select a candidate set of predictors. In the second step the final model is produced using an unshrunk ordinary least squares fit on the variables selected in the first step. This approach is fairly simple for the relaxed lasso. However, a constrained least squares fit must be implemented for the relaxed classo in order to ensure that the constraints are still satisfied. The first set of three results correspond to a setting with n = 00, p = 50 and m = 5. This is a relatively low dimensional problem with only a small number of constraints, m = 5. Hence, one might imagine that the classo would have a comparatively small advantage over the lasso. However, even with a low value for m the classo shows highly statistically 9

20 significant improvements over the lasso. Both relaxed methods display lower error rates than their unrelaxed counterparts but the relaxed classo has a large advantage over the relaxed lasso. The higher correlations in the ρ = 0.5 design structure moderately increase the error rates for all methods but do not change the relative rankings of the four approaches. The second set of results represent a very high dimensional setting with n = 50, p = 500 and m = 0. In this situation the number of constraints is very low relative to the number of predictors so the classo generally only has a small advantage over the lasso, although the relaxed classo still appears to offer some real improvements in accuracy. Our main purpose in examining this setting was to prove that the classo algorithm is still efficient enough to optimize the constrained criterion even for very high dimensional data. The final setting examined data with n = 50, p = 00 and a very large number of constraints, m = 60. This setting is statistically favorable for the classo because it has the potential to produce significantly more accurate regression coefficients by correctly incorporating the large number of constraints. However, this is also a computationally difficult setting for the classo because the large value of m causes the coefficient paths to be highly variable. Nevertheless, the large improvements in accuracy for both the classo and relaxed classo in Table demonstrate that our algorithm is quite capable of dealing with this added complexity. Figure 4 provides a better feeling for the general improvement in accuracy from incorporating the constraints into our fit. Here we display boxplots of the RMSE, for each of our four comparison methods, over the 00 training data sets on two of the nine simulation settings. The improvement ranges from moderate in the first setting when m is low, to dramatic in the second setting where m is much higher. 5. Violations of Constraints The results presented in the previous section all correspond to an ideal situation where the true regression coefficients exactly match the equality constraints. We also investigated the sensitivity of the classo to deviations of the regression coefficients from the assumed constraints. In particular we generated the true regression coefficients according to Cβ = ( + u) b, (5) 0

21 Prediction Error (RMSE) Prediction Error (RMSE) Lasso Relaxed Lasso Constr. Relaxed Constr. Method Lasso Relaxed Lasso Constr. Relaxed Constr. Method Figure 4: Boxplots of the 00 simulation RMSE values for all four lasso methods using the ρ = 0.5 i j correlation structure. Left: The n = 00, p = 50, m = 5 setting. Right: The n = 50, p = 00, m = 60 setting. where u = (u,..., u m ), u l Unif(0, a) for l =,... m and the vector product is taken pointwise. The classo and relaxed classo were then fit using the usual constraint, Cβ = b. Table reports the new RMSE values for the three settings considered in Section 5., under the ρ = 0 correlation structure. We tested four values for a; 0.5, 0.50, 0.75 and.00. The largest value of a corresponds to a 50% average error in the constraint. The results suggest that the classo and relaxed classo are surprisingly robust to random violations in the constraints. While both methods deteriorated slightly as a increased, they were still both superior to their unconstrained counterparts for all values of a and all settings. 5.3 On-line Auto Loan Data We now consider an application of the classo to estimate a non-parametric demand function on the On-line Auto Lending data set, courtesy of The Center for Pricing and Revenue Management of the Columbia Business School. As the name suggests, the data contains observations on all United States auto loans approved by an online lender. Each record corresponds to a specific customer who was approved for a loan, then either chose to accept or decline the offer. The data also includes characteristics of the customer and offered loan, such as FICO score and loan duration. In particular, the data is segmented by three variables: credit tier (Tier I-VII, with Tier I being the highest), length of loan (0-36 months, 37-48

22 a Lasso Classo Relaxed Lasso Relaxed Classo n = (0.0) 0.5(0.0) 0.43(0.0) 0.30(0.0) p = (0.0) 0.5(0.0) 0.4(0.0) 0.33(0.0) m = (0.0) 0.5(0.0) 0.4(0.0) 0.35(0.0) (0.0) 0.53(0.0) 0.4(0.0) 0.38(0.0) n = (0.08) 3.(0.09) 3.4(0.09).98(0.0) p = (0.08) 3.(0.09) 3.3(0.09).99(0.0) m = (0.08) 3.8(0.09) 3.6(0.09) 3.07(0.0) (0.08) 3.8(0.09) 3.3(0.09) 3.06(0.0) n = (0.07).7(0.03) 6.80(0.08) 0.95(0.03) p = (0.07).8(0.03) 6.8(0.08) 0.99(0.03) m = (0.07).(0.03) 6.79(0.08).03(0.03) (0.07).7(0.04) 6.86(0.08).08(0.04) Table : Average RMSE over 00 training data sets in three different simulation settings using the ρ = 0 correlation structure. The numbers in parentheses are standard errors. The true regression coefficients were generated according to (5) months, months, or over 60 months), and type of car loan (new, used, or refinanced). As with previous analyses on the auto lending data set (Besbes et al., 00), the analysis here is performed on a segment of the overall data: Tier I loans for new cars with a duration of 0 to 36 months. This subset contained observations on 8698 customers, which we randomly divided into two-thirds training data and one-third validation data. We used the offered rate as the predictor and the loan s funding status, if the customer accepted the loan and 0 otherwise, as the response. A cubic spline basis, b(t), was used to model the demand curve as a non-linear function of the interest rate. To ensure that the demand function was monotonically decreasing as a function of the interest rate we computed the first derivative of the spline over a fine grid of m interest rates, between the minimum and the maximum observed rate. We then fit the classo over the training data with the ith row of X derived from b(x i ), where x i is the ith interest rate, and the first derivative constrained to be less than zero at each grid point i.e. b (u l ) T β 0 for l =,..., m. We compared the classo fit to an unconstrained lasso with the same spline design matrix. The degrees of freedom for b(t) and λ for the lasso and classo

23 were both chosen by minimizing RMSE on the validation data. To provide a comparison with a traditional parametric approach we also fit a logistic regression curve to the data. Figure plots the three resulting demand functions, with the classo in black, the traditional logistic in red, and the standard lasso in blue. ittered versions of the test data are also provided for context. As expected the linear logistic curve provides a smooth monotone decreasing fit to the data. However, the constrained parametric form of the logistic means that it is not capable of both modeling the high acceptance rate near.5% and the much lower acceptance rate near 5%. By comparison the lasso fit using the spline basis is far more flexible, modeling both the high acceptance rate near.5% and the much lower acceptance rate near 5%. However, because the curve is unconstrained it provides an unrealistic shape which actually suggests demand is increasing and decreasing in wild oscillations. Of course if the validation data had suggested a smaller degrees of freedom the lasso would produce a smoother curve, but even in that case the resulting fit will generally not be monotone. By comparison the classo fit combines both the flexibility of the lasso with the intuitively reasonable monotone decreasing shape of the logistic curve. The classo suggests a sharp decline in demand between interest rates of.5% and 3.5%, followed by a relatively constant demand between 3.5% and 4.0%, and then a further decline for rates above 4%. It is interesting that both 3.5% and 4.0% seem to represent distinct change points in demand, possibly corresponding to psychological effects from mentally rounding to the nearest half percent. 5.4 Demand Function Simulation We ran a simulation study to test out the accuracy of the classo in a setting such as the on-line auto loan demand estimation problem. To mimic the real life setting we generated Bernoulli responses with P (Y i = X i ) = ( γ)l(x i ) + γcl(x i ), 0 γ, (6) where L(X) and CL(X) respectively corresponded to the logistic and classo fits to the auto loan data. With γ = 0 the data was generated from a smooth underlying demand function, while for larger values of γ the true demand function had more flexibility. We tested γ = 0, /3, /3 and, and for each value generated 00 data sets according to (6). The simulated interest rates were generated along an equally spaced grid from.5% 3

24 γ Lasso Logit Classo 0.64(0.08) 0.45(0.04).9(0.07) /3.68(0.09) 0.87(0.04).(0.08) /3.03(0.0).00(0.05).50(0.08).4(0.0) 3.8(0.07).67(0.0) Table 3: A comparison of the prediction accuracies of the lasso, logistic, and classo methods. The values are the mean integrated squared differences over 00 simulation runs (multiplied by a factor of 0 3 for comparison purposes, as are the standard errors in parentheses). to 5.0% in step sizes of 0.00% for a total of 5 observations. Each data set was then randomly divided into one-half training data, one-quarter validation data, and one-quarter testing data. The classo, lasso and logistic regression were then fit in the same fashion as in Section 5.3. In particular the classo and lasso were each fit over a range of degrees of freedom (df) and λ values, then the combination of df and λ which yielded the minimum SSE = i (Y i Ŷi) value on the validation set was chosen as the optimal model for a particular simulation. That model was then used to fit the testing data. Table 3 provides a comparison of the prediction accuracies over the testing data for the three methods on each of the four choices of γ. The values in the table are the estimated mean integrated squared differences (ISD) ( P (Y = X) Ŷ (x) ) dx computed by averaging the test observations and the 00 simulation runs, where Ŷ (x) is the predicted demand function at x for a given method. As one would expect, the logistic regression provides superior results for low values of γ. However, as the true demand function becomes more flexible the classo starts to dominate. At γ = the classo has an ISD that is less than half that of the logistic regression fit. While the lasso also improves relative to the logistic fit for larger values of γ, it is clearly inferior to the classo in all settings. 6. Conclusion In this paper we have illustrated a few of the wide range of statistical applications for the classo and developed computationally efficient path algorithms for computing its solutions. 4

25 Our theoretical results demonstrate that, for correctly chosen constraints, the classo solution provides the potential for improvements in prediction accuracy. Furthermore, our simulation results show that the classo generally outperforms the lasso, not only when the constraints hold exactly, but also when there is some reasonable error in the assumption on the constraints. The relaxed version of classo appears to offer even greater improvements. We have concentrated here on the lasso optimization problem. However, we believe that a similar approach could be adopted to fit more general optimization problems of the form: arg min β Y Xβ + λ p ρ(β j ) subject to Cβ = b, (7) j= where ρ is a general penalty function, such as the SCAD penalty (Fan and Li, 00). A common approach for fitting the unconstrained version of (7) is to iteratively approximate ρ using a first order Taylor series, and then perform the lasso on the new criterion. A similar approach could be used for solving (7). In this setting one could again incorporate the constraint into the optimization function, use a prior estimate of the coefficients to select the terms we believe will be non-zero in the final solution, and incorporate these terms into the differentiable part of the criterion. Such an approach suggests a promising area of future research. A. Proof of Lemma Since D has full column rank, by reordering the rows if necessary, we can write D as D = D D where D R p p is an invertible matrix and D R r p p. Then, Y Xθ + λ Dθ = Y XD D θ = Y ( XD )D θ + λ D θ + λ D θ + λ D θ + λ D D D θ Using the change of variables β = D θ, β = D D D θ = D D β, and β = β β, 5

26 we can rewrite the generalized lasso problem as follows: min Y θ R p Xθ { + λ Dθ = min Y β R r where X = [ ] XD 0 = min β R r XD β { Y Xβ + λ β } + λ β D D β β = 0 } Cβ = 0 and C = [ D D I ]. Note that θ = [D 0] thus, the generalized lasso is a special case of the constrained lasso. B. Proofs of Lemmas and 3, β β, and Consider any index set A such that C A is non-singular. The constraint Cβ = b can be written as C A β A + C Ā β Ā = b β A = C A (b CĀ β Ā ), and thus, we can determine β A from β Ā. Then, for any β such that Cβ = b, Y Xβ + λ β = Y XĀ β Ā X A β A + λ βā + λ β A = = Y XĀ β Ā X A C A (b CĀ β Ā ) + λ βā + λ C A (b CĀ β Ā ) [ Y X A C A b] ( X Ā X A C A CĀ ) βā + λ βā + λ C A (b CĀ β Ā ). By using the change of variable θ = β Ā, the classo problem is equivalent to the following unconstrained optimization problem: min θ Y X θ + λ θ + λ C A (b CĀ θ), and let θ A denote a solution to the above optimization problem. Then, a solution to the original classo problem is given β = and this completes Lemma. β A β Ā = C A (b CĀ θ A ) θ A,, 6

27 To prove Lemma 3, consider an arbitrary θ A,s and s such that s = sign ( C A (b CĀ θ A,s ) ). Let F : R p m R + denote the objective function of the optimization problem in Equation 0; that is, for each θ, F (θ) = Y X θ + λ θ + λ C A (b CĀ θ). By definition of θ A, we have F (θ A ) F (θ A,s ). To complete the proof, it suffices to show that F (θ A,s ) F (θ A ). Suppose, on the contrary, that F (θ A,s ) > F (θ A ). For each α [0, ], let θ α R p m and g(α) R + be defined by: θ α ( α)θ A,s + αθ A and g(α) F (θ α ) = F (( α)θ A,s + αθ A ). Note that g is a convex function on [0, ] because F ( ) is convex. Moreover, we have g(0) = F (θ A,s ) > F (θ A ) = g(). Thus, for all 0 < α, g(α) = g(α + ( α) 0) αg() + ( α)g(0) < g(0). By our hypothesis, s i = for all i, and thus, every coordinate of the vector C A (b CĀ θ A,s ) is bounded away from zero. So, we can choose α 0 sufficiently small so that Then, it follows that sign ( C A (b CĀ θ α0 ) ) = sign ( C A (b CĀ θ A,s ) ). F (θ α0 ) = Y X θ α0 + λ θ α 0 + λs T C A (b CĀ θ α0 ) = (Y ) T Y = (Y ) T Y (Y ) T X θ α0 λs T C A CĀ θ α0 + (X θ α0 ) T X θ α0 (Y ) T X θ α0 ( λx ( C A CĀ ) T s ) T X θ α0 + (X θ α0 ) T X θ α0 + λ θ α0 + λs T C A b = Ỹ X θ α0 + λ θ α0 + d + λ θ α0 + λs T C A b where the last equality follows from Ỹ = Y + λx ( C A CĀ ) T s, and d is defined by d = λ(y ) T X ( C A CĀ ) T s [ λx ( C A CĀ ) T s ] T [ λx ( C A CĀ ) T s ] Also, the third equality follows from the fact that (X ) T X = I. 7 + λs T C A b.

Penalized and Constrained Optimization: An Application to High-Dimensional Website Advertising

Penalized and Constrained Optimization: An Application to High-Dimensional Website Advertising Penalized and Constrained Optimization: An Application to High-Dimensional Website Advertising Gareth M. James, Courtney Paulson, and Paat Rusmevichientong Gareth M. James is the E. Morgan Stanley Chair

More information

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013) A Survey of L 1 Regression Vidaurre, Bielza and Larranaga (2013) Céline Cunen, 20/10/2014 Outline of article 1.Introduction 2.The Lasso for Linear Regression a) Notation and Main Concepts b) Statistical

More information

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

The Double Dantzig. Some key words: Dantzig Selector; Double Dantzig; Generalized Linear Models; Lasso; Variable Selection.

The Double Dantzig. Some key words: Dantzig Selector; Double Dantzig; Generalized Linear Models; Lasso; Variable Selection. The Double Dantzig GARETH M. JAMES AND PETER RADCHENKO Abstract The Dantzig selector (Candes and Tao, 2007) is a new approach that has been proposed for performing variable selection and model fitting

More information

Generalized Elastic Net Regression

Generalized Elastic Net Regression Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1

More information

Lecture 14: Variable Selection - Beyond LASSO

Lecture 14: Variable Selection - Beyond LASSO Fall, 2017 Extension of LASSO To achieve oracle properties, L q penalty with 0 < q < 1, SCAD penalty (Fan and Li 2001; Zhang et al. 2007). Adaptive LASSO (Zou 2006; Zhang and Lu 2007; Wang et al. 2007)

More information

A Modern Look at Classical Multivariate Techniques

A Modern Look at Classical Multivariate Techniques A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico

More information

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu LITIS - EA 48 - INSA/Universite de Rouen Avenue de l Université - 768 Saint-Etienne du Rouvray

More information

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28 Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu arxiv:1709.04840v1 [stat.me] 14 Sep 2017 Abstract Penalty-based variable selection methods are powerful in selecting relevant covariates

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 1 Ridge Regression Ridge regression and the Lasso are two forms of regularized

More information

Permutation-invariant regularization of large covariance matrices. Liza Levina

Permutation-invariant regularization of large covariance matrices. Liza Levina Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work

More information

By Gareth M. James, Jing Wang and Ji Zhu University of Southern California, University of Michigan and University of Michigan

By Gareth M. James, Jing Wang and Ji Zhu University of Southern California, University of Michigan and University of Michigan The Annals of Statistics 2009, Vol. 37, No. 5A, 2083 2108 DOI: 10.1214/08-AOS641 c Institute of Mathematical Statistics, 2009 FUNCTIONAL LINEAR REGRESSION THAT S INTERPRETABLE 1 arxiv:0908.2918v1 [math.st]

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

Analysis Methods for Supersaturated Design: Some Comparisons

Analysis Methods for Supersaturated Design: Some Comparisons Journal of Data Science 1(2003), 249-260 Analysis Methods for Supersaturated Design: Some Comparisons Runze Li 1 and Dennis K. J. Lin 2 The Pennsylvania State University Abstract: Supersaturated designs

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010 Model-Averaged l 1 Regularization using Markov Chain Monte Carlo Model Composition Technical Report No. 541 Department of Statistics, University of Washington Chris Fraley and Daniel Percival August 22,

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

Regularization: Ridge Regression and the LASSO

Regularization: Ridge Regression and the LASSO Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression

More information

Sparse Gaussian conditional random fields

Sparse Gaussian conditional random fields Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

Variable selection using Adaptive Non-linear Interaction Structures in High dimensions

Variable selection using Adaptive Non-linear Interaction Structures in High dimensions Variable selection using Adaptive Non-linear Interaction Structures in High dimensions Peter Radchenko and Gareth M. James Abstract Numerous penalization based methods have been proposed for fitting a

More information

ESL Chap3. Some extensions of lasso

ESL Chap3. Some extensions of lasso ESL Chap3 Some extensions of lasso 1 Outline Consistency of lasso for model selection Adaptive lasso Elastic net Group lasso 2 Consistency of lasso for model selection A number of authors have studied

More information

In Search of Desirable Compounds

In Search of Desirable Compounds In Search of Desirable Compounds Adrijo Chakraborty University of Georgia Email: adrijoc@uga.edu Abhyuday Mandal University of Georgia Email: amandal@stat.uga.edu Kjell Johnson Arbor Analytics, LLC Email:

More information

Optimization Problems

Optimization Problems Optimization Problems The goal in an optimization problem is to find the point at which the minimum (or maximum) of a real, scalar function f occurs and, usually, to find the value of the function at that

More information

Biostatistics Advanced Methods in Biostatistics IV

Biostatistics Advanced Methods in Biostatistics IV Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R Xingguo Li Tuo Zhao Tong Zhang Han Liu Abstract We describe an R package named picasso, which implements a unified framework

More information

(Part 1) High-dimensional statistics May / 41

(Part 1) High-dimensional statistics May / 41 Theory for the Lasso Recall the linear model Y i = p j=1 β j X (j) i + ɛ i, i = 1,..., n, or, in matrix notation, Y = Xβ + ɛ, To simplify, we assume that the design X is fixed, and that ɛ is N (0, σ 2

More information

Regularization and Variable Selection via the Elastic Net

Regularization and Variable Selection via the Elastic Net p. 1/1 Regularization and Variable Selection via the Elastic Net Hui Zou and Trevor Hastie Journal of Royal Statistical Society, B, 2005 Presenter: Minhua Chen, Nov. 07, 2008 p. 2/1 Agenda Introduction

More information

Lecture 6: Methods for high-dimensional problems

Lecture 6: Methods for high-dimensional problems Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,

More information

High-dimensional regression with unknown variance

High-dimensional regression with unknown variance High-dimensional regression with unknown variance Christophe Giraud Ecole Polytechnique march 2012 Setting Gaussian regression with unknown variance: Y i = f i + ε i with ε i i.i.d. N (0, σ 2 ) f = (f

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions:

More information

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model

More information

Stability and the elastic net

Stability and the elastic net Stability and the elastic net Patrick Breheny March 28 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/32 Introduction Elastic Net Our last several lectures have concentrated on methods for

More information

High dimensional thresholded regression and shrinkage effect

High dimensional thresholded regression and shrinkage effect J. R. Statist. Soc. B (014) 76, Part 3, pp. 67 649 High dimensional thresholded regression and shrinkage effect Zemin Zheng, Yingying Fan and Jinchi Lv University of Southern California, Los Angeles, USA

More information

Algorithms for Constrained Optimization

Algorithms for Constrained Optimization 1 / 42 Algorithms for Constrained Optimization ME598/494 Lecture Max Yi Ren Department of Mechanical Engineering, Arizona State University April 19, 2015 2 / 42 Outline 1. Convergence 2. Sequential quadratic

More information

25 : Graphical induced structured input/output models

25 : Graphical induced structured input/output models 10-708: Probabilistic Graphical Models 10-708, Spring 2013 25 : Graphical induced structured input/output models Lecturer: Eric P. Xing Scribes: Meghana Kshirsagar (mkshirsa), Yiwen Chen (yiwenche) 1 Graph

More information

25 : Graphical induced structured input/output models

25 : Graphical induced structured input/output models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 25 : Graphical induced structured input/output models Lecturer: Eric P. Xing Scribes: Raied Aljadaany, Shi Zong, Chenchen Zhu Disclaimer: A large

More information

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Journal of Data Science 9(2011), 549-564 Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Masaru Kanba and Kanta Naito Shimane University Abstract: This paper discusses the

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

A Fast Algorithm for Computing High-dimensional Risk Parity Portfolios

A Fast Algorithm for Computing High-dimensional Risk Parity Portfolios A Fast Algorithm for Computing High-dimensional Risk Parity Portfolios Théophile Griveau-Billion Quantitative Research Lyxor Asset Management, Paris theophile.griveau-billion@lyxor.com Jean-Charles Richard

More information

SPARSE signal representations have gained popularity in recent

SPARSE signal representations have gained popularity in recent 6958 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Blind Compressed Sensing Sivan Gleichman and Yonina C. Eldar, Senior Member, IEEE Abstract The fundamental principle underlying

More information

Chapter 3. Linear Models for Regression

Chapter 3. Linear Models for Regression Chapter 3. Linear Models for Regression Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Linear

More information

Summary and discussion of: Exact Post-selection Inference for Forward Stepwise and Least Angle Regression Statistics Journal Club

Summary and discussion of: Exact Post-selection Inference for Forward Stepwise and Least Angle Regression Statistics Journal Club Summary and discussion of: Exact Post-selection Inference for Forward Stepwise and Least Angle Regression Statistics Journal Club 36-825 1 Introduction Jisu Kim and Veeranjaneyulu Sadhanala In this report

More information

The lasso, persistence, and cross-validation

The lasso, persistence, and cross-validation The lasso, persistence, and cross-validation Daniel J. McDonald Department of Statistics Indiana University http://www.stat.cmu.edu/ danielmc Joint work with: Darren Homrighausen Colorado State University

More information

Regularization Path Algorithms for Detecting Gene Interactions

Regularization Path Algorithms for Detecting Gene Interactions Regularization Path Algorithms for Detecting Gene Interactions Mee Young Park Trevor Hastie July 16, 2006 Abstract In this study, we consider several regularization path algorithms with grouped variable

More information

Bi-level feature selection with applications to genetic association

Bi-level feature selection with applications to genetic association Bi-level feature selection with applications to genetic association studies October 15, 2008 Motivation In many applications, biological features possess a grouping structure Categorical variables may

More information

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider variable

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Shrinkage Methods: Ridge and Lasso

Shrinkage Methods: Ridge and Lasso Shrinkage Methods: Ridge and Lasso Jonathan Hersh 1 Chapman University, Argyros School of Business hersh@chapman.edu February 27, 2019 J.Hersh (Chapman) Ridge & Lasso February 27, 2019 1 / 43 1 Intro and

More information

Statistical Inference

Statistical Inference Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

LASSO Review, Fused LASSO, Parallel LASSO Solvers

LASSO Review, Fused LASSO, Parallel LASSO Solvers Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable

More information

A Confidence Region Approach to Tuning for Variable Selection

A Confidence Region Approach to Tuning for Variable Selection A Confidence Region Approach to Tuning for Variable Selection Funda Gunes and Howard D. Bondell Department of Statistics North Carolina State University Abstract We develop an approach to tuning of penalized

More information

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1 Variable Selection in Restricted Linear Regression Models Y. Tuaç 1 and O. Arslan 1 Ankara University, Faculty of Science, Department of Statistics, 06100 Ankara/Turkey ytuac@ankara.edu.tr, oarslan@ankara.edu.tr

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Outlier detection and variable selection via difference based regression model and penalized regression

Outlier detection and variable selection via difference based regression model and penalized regression Journal of the Korean Data & Information Science Society 2018, 29(3), 815 825 http://dx.doi.org/10.7465/jkdi.2018.29.3.815 한국데이터정보과학회지 Outlier detection and variable selection via difference based regression

More information

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

EE 381V: Large Scale Optimization Fall Lecture 24 April 11 EE 381V: Large Scale Optimization Fall 2012 Lecture 24 April 11 Lecturer: Caramanis & Sanghavi Scribe: Tao Huang 24.1 Review In past classes, we studied the problem of sparsity. Sparsity problem is that

More information

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data MSA220/MVE440 Statistical Learning for Big Data Lecture 9-10 - High-dimensional regression Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Recap from

More information

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:

More information

Reconstruction from Anisotropic Random Measurements

Reconstruction from Anisotropic Random Measurements Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015 Machine Learning Regression basics Linear regression, non-linear features (polynomial, RBFs, piece-wise), regularization, cross validation, Ridge/Lasso, kernel trick Marc Toussaint University of Stuttgart

More information

Robust Variable Selection Through MAVE

Robust Variable Selection Through MAVE Robust Variable Selection Through MAVE Weixin Yao and Qin Wang Abstract Dimension reduction and variable selection play important roles in high dimensional data analysis. Wang and Yin (2008) proposed sparse

More information

DEGREES OF FREEDOM AND MODEL SEARCH

DEGREES OF FREEDOM AND MODEL SEARCH Statistica Sinica 25 (2015), 1265-1296 doi:http://dx.doi.org/10.5705/ss.2014.147 DEGREES OF FREEDOM AND MODEL SEARCH Ryan J. Tibshirani Carnegie Mellon University Abstract: Degrees of freedom is a fundamental

More information

Sparse PCA with applications in finance

Sparse PCA with applications in finance Sparse PCA with applications in finance A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon 1 Introduction

More information

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract Journal of Data Science,17(1). P. 145-160,2019 DOI:10.6339/JDS.201901_17(1).0007 WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION Wei Xiong *, Maozai Tian 2 1 School of Statistics, University of

More information

arxiv: v2 [math.st] 2 Jul 2017

arxiv: v2 [math.st] 2 Jul 2017 A Relaxed Approach to Estimating Large Portfolios Mehmet Caner Esra Ulasan Laurent Callot A.Özlem Önder July 4, 2017 arxiv:1611.07347v2 [math.st] 2 Jul 2017 Abstract This paper considers three aspects

More information

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Article Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Fei Jin 1,2 and Lung-fei Lee 3, * 1 School of Economics, Shanghai University of Finance and Economics,

More information

Regression Shrinkage and Selection via the Lasso

Regression Shrinkage and Selection via the Lasso Regression Shrinkage and Selection via the Lasso ROBERT TIBSHIRANI, 1996 Presenter: Guiyun Feng April 27 () 1 / 20 Motivation Estimation in Linear Models: y = β T x + ɛ. data (x i, y i ), i = 1, 2,...,

More information

Linear regression methods

Linear regression methods Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response

More information

Asymptotic Equivalence of Regularization Methods in Thresholded Parameter Space

Asymptotic Equivalence of Regularization Methods in Thresholded Parameter Space Asymptotic Equivalence of Regularization Methods in Thresholded Parameter Space Jinchi Lv Data Sciences and Operations Department Marshall School of Business University of Southern California http://bcf.usc.edu/

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Within Group Variable Selection through the Exclusive Lasso

Within Group Variable Selection through the Exclusive Lasso Within Group Variable Selection through the Exclusive Lasso arxiv:1505.07517v1 [stat.me] 28 May 2015 Frederick Campbell Department of Statistics, Rice University and Genevera Allen Department of Statistics,

More information

Sparsity and the Lasso

Sparsity and the Lasso Sparsity and the Lasso Statistical Machine Learning, Spring 205 Ryan Tibshirani (with Larry Wasserman Regularization and the lasso. A bit of background If l 2 was the norm of the 20th century, then l is

More information

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los

More information

High-dimensional regression

High-dimensional regression High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and

More information

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley Review of Classical Least Squares James L. Powell Department of Economics University of California, Berkeley The Classical Linear Model The object of least squares regression methods is to model and estimate

More information

19.1 Problem setup: Sparse linear regression

19.1 Problem setup: Sparse linear regression ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 19: Minimax rates for sparse linear regression Lecturer: Yihong Wu Scribe: Subhadeep Paul, April 13/14, 2016 In

More information

2.3. Clustering or vector quantization 57

2.3. Clustering or vector quantization 57 Multivariate Statistics non-negative matrix factorisation and sparse dictionary learning The PCA decomposition is by construction optimal solution to argmin A R n q,h R q p X AH 2 2 under constraint :

More information

A Short Introduction to the Lasso Methodology

A Short Introduction to the Lasso Methodology A Short Introduction to the Lasso Methodology Michael Gutmann sites.google.com/site/michaelgutmann University of Helsinki Aalto University Helsinki Institute for Information Technology March 9, 2016 Michael

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Prediction & Feature Selection in GLM

Prediction & Feature Selection in GLM Tarigan Statistical Consulting & Coaching statistical-coaching.ch Doctoral Program in Computer Science of the Universities of Fribourg, Geneva, Lausanne, Neuchâtel, Bern and the EPFL Hands-on Data Analysis

More information

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models Statistics 203: Introduction to Regression and Analysis of Variance Penalized models Jonathan Taylor - p. 1/15 Today s class Bias-Variance tradeoff. Penalized regression. Cross-validation. - p. 2/15 Bias-variance

More information

Technical Vignette 5: Understanding intrinsic Gaussian Markov random field spatial models, including intrinsic conditional autoregressive models

Technical Vignette 5: Understanding intrinsic Gaussian Markov random field spatial models, including intrinsic conditional autoregressive models Technical Vignette 5: Understanding intrinsic Gaussian Markov random field spatial models, including intrinsic conditional autoregressive models Christopher Paciorek, Department of Statistics, University

More information

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana

More information

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables LIB-MA, FSSM Cadi Ayyad University (Morocco) COMPSTAT 2010 Paris, August 22-27, 2010 Motivations Fan and Li (2001), Zou and Li (2008)

More information

Compressed Sensing in Cancer Biology? (A Work in Progress)

Compressed Sensing in Cancer Biology? (A Work in Progress) Compressed Sensing in Cancer Biology? (A Work in Progress) M. Vidyasagar FRS Cecil & Ida Green Chair The University of Texas at Dallas M.Vidyasagar@utdallas.edu www.utdallas.edu/ m.vidyasagar University

More information

Sparse and Robust Optimization and Applications

Sparse and Robust Optimization and Applications Sparse and and Statistical Learning Workshop Les Houches, 2013 Robust Laurent El Ghaoui with Mert Pilanci, Anh Pham EECS Dept., UC Berkeley January 7, 2013 1 / 36 Outline Sparse Sparse Sparse Probability

More information

Lecture 9: September 28

Lecture 9: September 28 0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

Interior-Point Methods for Linear Optimization

Interior-Point Methods for Linear Optimization Interior-Point Methods for Linear Optimization Robert M. Freund and Jorge Vera March, 204 c 204 Robert M. Freund and Jorge Vera. All rights reserved. Linear Optimization with a Logarithmic Barrier Function

More information

Introduction to the genlasso package

Introduction to the genlasso package Introduction to the genlasso package Taylor B. Arnold, Ryan Tibshirani Abstract We present a short tutorial and introduction to using the R package genlasso, which is used for computing the solution path

More information

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression Behavioral Data Mining Lecture 7 Linear and Logistic Regression Outline Linear Regression Regularization Logistic Regression Stochastic Gradient Fast Stochastic Methods Performance tips Linear Regression

More information

High-dimensional Ordinary Least-squares Projection for Screening Variables

High-dimensional Ordinary Least-squares Projection for Screening Variables 1 / 38 High-dimensional Ordinary Least-squares Projection for Screening Variables Chenlei Leng Joint with Xiangyu Wang (Duke) Conference on Nonparametric Statistics for Big Data and Celebration to Honor

More information