The Constrained Lasso

Size: px

Start display at page:

Download "The Constrained Lasso"

Griselda Byrd
6 years ago
Views:

1 The Constrained Lasso Gareth M. ames, Courtney Paulson and Paat Rusmevichientong Abstract Motivated by applications in areas as diverse as finance, image reconstruction, and curve estimation, we introduce the constrained lasso problem, where the underlying parameters satisfy a collection of linear constraints. We show that many statistical methods, such as the fused lasso, monotone curve estimation and the generalized lasso, are all special cases of the constrained lasso. Computing the constrained lasso poses some technical challenges but we develop an efficient algorithm for fitting it over a grid of tuning parameters. Non-asymptotic error bounds are developed which suggest that the constrained lasso should outperform the lasso in situations where the true parameters satisfy the underlying constraints. Extensive numerical experiments show that our method performs well, both computationally and statistically. Finally, we apply the constrained lasso to a real data set to estimate the demand curve for a particular type of auto loan as a function of interest rate, and demonstrate that it can outperform more standard approaches. Some key words: Lasso; Linear constraints; Penalized regression; Demand function; Generalized lasso. Introduction A great deal of attention has recently been focused on the problem of fitting the traditional regression problem p Y i = β 0 + X ij β j + ε i, i =,..., n, j= Marshall School of Business, University of Southern California.

2 in situations where the number of predictors, p, is large relative to the sample size, n. In this setting there are many approaches that outperform ordinary least squares (Frank and Friedman, 993). Penalized regression methods are commonly adopted for these large p problems. A small sampling of these approaches include SCAD (Fan and Li, 00), the elastic net (Zou and Hastie, 005), the adaptive lasso (Zou, 006), CAP (Zhao et al., 006), the Dantzig selector (Candes and Tao, 007), the relaxed lasso (Meinshausen, 007), and VISA (Radchenko and ames, 008). However, one of the most popular of these approaches is the lasso (Tibshirani, 996) which places an l penalty on the coefficients. In particular the lasso is defined as the solution to arg min β Y Xβ + λβ, () where we have assumed that the response and predictors have been standardized to have mean zero so the intercept can be ignored. The strengths and limitations of the lasso have been well documented. One of its major advantages is that the l penalty has the effect of shrinking certain regression coefficients to exactly zero, and hence it performs variable selection. Another appealing aspect of the lasso is that there are several efficient algorithms for computing its solution. In particular the LARS algorithm (Efron et al., 004) computes the entire path of solutions for all values of λ. Recently, coordinate descent algorithms have been shown to be even more efficient for computing the lasso solution (Friedman et al., 007, 00). In this article we are interested in solving arg min β Y Xβ + λβ subject to Cβ b, () where C R m p and b R m are predefined matrices and vectors. Equation places a set of m linear constraints on the coefficients so we call this problem the constrained lasso or classo. Equations and appear similar so why is the classo interesting? We address this question in three parts. First, we demonstrate that many commonly applied statistical problems, such as the fused lasso, generalized lasso, monotone curve estimation etc., can be expressed as special cases of (). Hence, an efficient algorithm for solving the classo can be used to fit a large class of statistical methodologies. Second, it turns out that solving () is a non-trivial problem. There does not appear to be any simple extension of the LARS path algorithm to solve the classo and a naive implementation of coordinate descent

3 generally results in solutions that fail to reach the global optimum of (). However, we present a modified coordinate descent algorithm that avoids these convergence issues. In addition our algorithm can be implemented using standard lasso software packages, which greatly improves the robustness and computational speed of the classo. Third we extend the theoretical bounds on the errors in the lasso to those for the classo coefficient estimates. The lasso and classo bounds have similar forms but our bounds clearly demonstrate the potential improvements in accuracy that can be derived from adding the additional constraint in the classo. The paper is structured as follows. In Section we provide a number of motivating examples which illustrate the wide range of situations where the constrained lasso is applicable. Our modified coordinate descent algorithm is presented in Section 3. We first consider the situation where () involves equality constraints and then show how the algorithm can be extended to the more general inequality setting through the use of slack variables. Several different error bounds have been developed for the lasso coefficients (Negahban et al., 00). In Section 4 we demonstrate that these bounds can be extended to the classo and that, provided the constraints hold for the true coefficients, the classo estimates will generally have smaller error bounds. There are many settings, such as monotone curve estimation or portfolio optimization, where we know that certain linear constraints must hold on the parameters we are estimating. Our theoretical results show that, in these settings, directly incorporating these constraints will produce superior estimates. We compare the classo to the standard lasso and relaxed lasso on a series of simulated data sets in Section 5. The classo is also implemented on a real data set consisting of loan applications for a large number of potential customers. The aim here is to estimate the demand curve as a function of interest rate. We conclude with a discussion of future extensions of this work in Section 6.. Motivating Examples In this section we briefly describe a few of the situations in which the classo can be applied.. Portfolio Optimization Suppose we have p assets indexed by,,..., p whose covariance matrix is denoted by Σ. Markowitz (95, 959) developed the seminal framework for mean-variance analysis. In 3

4 particular his approach involved choosing asset weights, w, to minimize the portfolio risk w T Σw (3) subject to w T =. One may also choose to impose additional constraints on w to control the expected return of the portfolio, the allocations among sectors or industries, or the risk exposures to certain known risk factors. In practice Σ is unobserved so must be estimated using the sample covariance matrix, ˆΣ. However, it has been well documented in the finance literature that when p is large, which is the norm in real world applications, minimizing (3) using the sample covariance matrix gives poor estimates for w. One possible solution involves regularizing ˆΣ, but more recently attention has focused on directly penalizing or constraining the weights; an analogous approach to penalizing the coefficients in a regression setting. Fan et al. (0) recently adopted this framework using a portfolio optimization problem with a gross-exposure parameter c defined by: w T ˆΣw, w c (4) subject to w T =, where c is a tuning parameter. Fan et al. (0) reformulated (4) as a penalized regression problem and used LARS to estimate the weights. However, as they point out, their regression formulation only approximately solves (4). It is not hard to verify that (4) can be expressed in the form (), where C has at least one row (to constrain w to sum to one) but may also have additional rows if we place constraints on the expected return, industry weightings etc. Hence, the classo can be used to exactly solve the gross-exposure portfolio optimization problem.. Monotone Curve Estimation Consider the problem of fitting a smooth function, g(x), to a set of observations {(x, y ),..., (x n, y n )}, subject to the constraint that g must be monotone e.g. non-increasing. There are many examples of such a setting. For instance, when estimating demand as a function of price we need to ensure that the curve is non-increasing. Alternatively, when estimating a cumulative distribution function (CDF) we need to ensure that the function is non-decreasing. Similarly, when performing curve registration to align a set of curves, the estimated warping functions must be non-decreasing. 4

5 Interest Rate of Offered Loan Probability of Loan Acceptance Figure : Auto loan data demand curves as a function of interest rate for the classo (black), the logistic regression (red), and the regular lasso (blue). If g is modeled as a parametric function the monotonicity constraint is not hard to impose. However, one often wishes to produce a more flexible fit using a non-parametric approach. For example, we could model g(x) = b(x) T β where b(x) is some p-dimensional basis function. One natural approach is to choose a large value for p, to produce a very flexible estimate for g(x), and then to smooth the estimate by minimizing the lasso criterion Y Bβ + λβ, (5) where the ith row of B is given by b(x i ) T. However, minimizing (5) does not ensure a monotone curve. Hence, we need to constrain the coefficients such that Cβ 0, (6) where the lth row of C is given by b (u l ) and u,..., u m are a fine grid of points over the range of x. Enforcing (6) ensures that the derivative of g is non-positive so g will be monotone decreasing. Of course minimizing (5) subject to (6) is a special case of the classo. 5

6 Figure provides an example of this approach applied to the auto loan data considered in Section 5.3. We have used the standard lasso (blue), logistic regression (red) and the classo (black) to estimate demand for loans as a function of interest rate. The logistic curve is constrained in the shape it can model while the lasso fit is not sensible because it lacks a monotone shape. However, the classo is capable of fitting a flexible, but still monotone, curve..3 Generalized Lasso and Related Methods Tibshirani and Taylor (0) introduce the generalized lasso problem: arg min Y θ Xθ + λ Dθ, (7) where D R r p. When rank(d) = r, and thus r p, Tibshirani and Taylor (0) show that the generalized lasso can be converted to the classical lasso problem. However, if r > p then such a reformulation is not possible. Lemma shows that when r > p and D is full column rank, then the generalized lasso is a special case of the classo with equality constraints and b = 0. Lemma. If r > p and rank(d) = p then there exist matrices A, C and X such that, for all values of λ, the solution to (7) is equal to θ = Aβ, where β is given by: arg min β Y Xβ + λβ subject to Cβ = 0. (8) The proof of Lemma is provided in Appendix A Hence, any problem that falls into the generalized lasso paradigm can be solved as a classo problem. Tibshirani and Taylor (0) demonstrate that a variety of common statistical methods can be formulated as special cases of the generalized lasso. One example is the d fused lasso (Tibshirani et al., 005) which is defined as the solution to arg min β Y Xβ + λ p p β j + λ β j β j. j= The fused lasso encourages blocks of adjacent estimated coefficients to all have the same value. This type of structure often makes sense in situations where there is a natural ordering in the coefficients. If instead the data have a two-dimensional ordering, such as for an image j= 6

7 reconstruction, this idea can be extended to the d fused lasso arg min β Y Xβ + λ β j,j + λ β j,j β j,j + λ 3 β j,j β j,j, j,j j j j j where β j,j is the coefficient at location j, j. Other examples of statistical methodologies that fall into the generalized (and hence constrained) lasso setting include; polynomial trend filtering where one penalizes discrete differences to produce smooth piecewise polynomial curves, Wavelet smoothing, and the FLiRTI method (ames et al., 009). See Tibshirani and Taylor (0) for more details on these methods and further examples of the generalized lasso paradigm. She (00) also considers a similar criterion to (7) and discusses special cases such as the clustered lasso. 3. A Constrained Lasso Fitting Algorithm To develop our algorithm for fitting the classo we start by defining some notation. For any matrix Z let Z A represent the m columns of Z associated with an index set A of size m. Similarly let Z Ā correspond to the remaining columns of Z. To reduce notation we assume without loss of generality that A corresponds to the first m columns of Z so Z = [Z A Z Ā ]. 3. Equality Constraints A common approach to deal with inequality constraints involves augmenting the set of parameters to include slack variables, δ, and hence reparameterizing the problem using equality constraints. We show in Section 3. that this augmentation approach also works to solve the classo so we begin by considering the equality constrained version of (): arg min β Y Xβ + λβ such that Cβ = b. (9) Unfortunately, the constraint on the coefficients makes it difficult to directly solve (9). However, Lemma provides an approach for reformulating (9) as an unconstrained optimization problem. Lemma. For a given index set, A, define θ A by: θ A = arg min θ Y X θ + λθ + λc A (b CĀ θ), (0) where Y = Y X A C A b, X = X Ā X A C A CĀ. 7

8 Then, for any index set, A, such that C A is non-singular, the solution to (9) is given by β = C A (b CĀ θ A ). θ A Again to reduce notation we assume without loss of generality that the elements of β are ordered so that the first m correspond to A. Equation 0 has the advantage that one does not need to incorporate any constraints on the parameters. Hence, one might imagine using a coordinate descent algorithm to compute θ A. However, for such an algorithm to reach a global optimum it must be the case that the parameters are separable in the non-differentiable l penalty i.e the criterion must be of the form f(θ,..., θ p m ) = g(θ,..., θ p m ) + p m j= h j (θ j ), () where g( ) is differentiable and convex, and the h j ( ) are convex. The standard lasso criterion () satisfies () because β = p j= β j. However, unfortunately, the second l term in (0) can not be separated into additive functions of θ j. Hence, there does not appear to be any simple approach for directly computing θ A. Fortunately, we can make use of Lemma 3 which shows that an alternative, more tractable, criterion can be used to compute θ A. Lemma 3. For a given index set, A, and vector s define θ A,s by: θ A,s = arg min θ Ỹ X θ + λθ () where Ỹ = Y + λx ( C A CĀ ) T s, and X is a matrix such that X T X = I. Then, for any index set, A, it will be the case that θ A = θ A,s provided s = sign ( θ A,s), (3) where θ A,s = C A (b CĀ θ A,s ) and sign(a) is a vector of the same dimension as a with ith element equal to or depending on the sign of a i. The proofs of Lemmas and 3 are provided in Appendix B. There is a simple intuition behind Lemma 3. The difficulty in computing (0) lies in the non-differentiability (and non-separability) of the second l penalty. However, if (3) holds then C A (b CĀ θ A,s ) = s T C A (b CĀ θ A,s ) 8

9 and we can replace the l penalty by a differentiable term which no longer needs to be separable. Of course the key to this approach is selecting A and s such that (3) holds. Note that θ A,s corresponds to the elements of β indexed by A so our general principle for choosing A and s will be to:. Produce β 0, an initial estimate for β.. Set A equal to the m largest elements of β Set s equal to the signs of the corresponding m elements of β 0. In order for this strategy to work we only require that the m largest elements of β 0 are sign consistent i.e. they have the same sign as the corresponding elements in the final solution to (9). Corollary shows that Lemmas and 3 provide the solution to the classo. Corollary. By Lemmas and 3, provided A and s are chosen such that C A is non-singular and (3) holds, then the solution to (9) is given by β = θ A,s θ A,s. Lemma 3 and Corollary provide an attractive means of computing the classo solution because () is a standard lasso criterion so can be solved using any one of the many highly efficient lasso optimization approaches, such as the LARS algorithm or coordinate descent. These results suggest Algorithm (next page) for solving (9) over a grid of λ. This algorithm relies on the fact that β λ will be continuous in λ. Hence, if we take a small enough step in λ then the signs of the m largest elements of β k and β k will be identical. If this is not the case then we have taken too large a step so we reduce the step size and recompute β k. In practice, unless m is very large or α is set too large, (3) generally holds without needing to decrease the step size. Step 3 is the main computational component of this algorithm but θ Ak,s k is easy to compute because () is just a standard lasso criterion so we can use any one of a number of optimization tools. 9

10 Algorithm Constrained Lasso with Equality Constraints. Initialize β 0 by solving (9) using λ 0 = λ max.. At step k select A k and s k using the largest m elements of β k and set where α > 0 controls the step size. λ k 0 α λ k, ( ) 3. Compute θ Ak,s k by solving (). Let θ A k,s k = C A k b CAk θ Ak,s k. 4. If (3) holds then set β k = θ A k,s k θ Ak,s k, k k + and return to. 5. If (3) does not hold then one of the largest m elements of β k has changed sign. Since the coefficient paths are continuous this means our step size in λ was too large. Hence, set and return to Iterate until λ k < λ min. λ k λ k (λ k λ k ) will be The initial solution, β 0, can be computed by noting that as λ the solution to (9) arg min β β such that Cβ = b, (4) which is a linear programming problem that can be efficiently solved using standard algorithms. We also implement a reversed version of this algorithm where we first set λ 0 = λ min, compute β 0 as the solution to a quadratic programming problem and then increase λ at each step until λ k > λ max. We discuss some additional implementation details in Appendix C. Figure provides a comparison of the lasso with the classo on a small simulated data set with n = 50 observations and p = 5 predictors. The left-hand panel plots the 5 lasso coefficients as a function of λ, one curve for each coefficient. As λ the lasso coefficients are all shrunk to zero. The center panel plots the same set of coefficients computed using our classo algorithm after incorporating a set of m = 5 equality constraints. The left hand sides of the two plots, where λ is small, are similar. However, for large values of λ the lasso 0

11 Lasso Constrained Lasso Quadratic Programming Lambda on log scale 0 4 Lambda on log scale 0 4 Lambda on log scale Figure : Left: Lasso coefficient paths for a simulated data set with n = 50 observations and p = 5 predictors. Center: Constrained lasso coefficient paths computed using Algorithm, after incorporating m = 5 linear constraints. Right: Constrained lasso coefficient paths computed using quadratic programming software. and classo coefficient estimates are significantly different from each other because the linear constraints do not allow the coefficients to be set to zero. The right-hand panel illustrates the solution to the classo computed using standard quadratic programming software. The two sets of paths are identical, confirming that our algorithm is computing the correct set of solutions. One might ask why Algorithm is required since Figure demonstrates that the classo path of coefficients can be computed using standard quadratic programming software? Of course the same question could be raised for the standard lasso which can also be solved as a quadratic programming problem. The difficulty is that quadratic programming becomes extremely inefficient for large values of p. Figure 3 provides an example of the number of seconds required to fit the classo, as a function of p, on a single simulated data set with n = 50 observations and m = 5 constraints. We computed the classo solution using both Algorithm and quadratic programming on the same set of grid points for λ. The methods are roughly comparable until p = 00 but beyond this point quadratic programming becomes computationally infeasible while Algorithm still performs well.

12 Number of Seconds p Figure 3: Time (in seconds) required to fit the classo on a single simulated data set, as a function of p, using Algorithm (black) and quadratic programming (red). 3. Inequality Constraints We now consider the more general optimization problem given by (). One might imagine that a reasonable approach would be to initialize with β such that Cβ b (5) and then apply a coordinate descent algorithm subject to ensuring that at each update (5) is not violated. Unfortunately, this approach typically gets stuck at a constraint boundary point where no improvement is possible by changing a single coordinate. In this setting the criterion can often be improved by moving along the constraint boundary but such a move requires adjusting multiple coefficients simultaneously which is not possible using coordinate descent because it only updates one coefficient at a time. Instead we introduce a set of m slack variables, δ, so that (5) can be equivalently expressed as Cβ + δ = b, δ 0 or C β = b, δ 0, (6) where β = (β, δ) is a p + m-dimensional vector, C = [C I] and I is an m-dimensional identity matrix. Let e δ (a) be a function which selects out the elements of a that correspond

13 to δ. For example, e δ ( β) = δ while e δ( β) = β. Then, the inclusion of the slack variables in (6) allows us to reexpress the classo criterion () as arg min β Y X β + λe δ( β) such that C β = b, eδ ( β) 0, (7) where X = [X 0] and 0 is a n by m matrix of zero elements. The criterion in (7) is very similar to the equality constrained lasso (9). The only differences are that the components of β corresponding to δ do not appear in the penalty term and are required to be non-negative. Even with these minor differences, the same basic approach from Section 3. can still be adopted for fitting (7). In particular Lemma 4 provides a set of conditions under which () can be solved. Lemma 4. For a given index set, A, and vector s such that e δ (s) = 0, define θ A,s by: θ A,s = arg min θ Ỹ X θ + λe δ(θ) such that e δ (θ) 0, (8) where Ỹ = Y X ( C A A b + λx C C ) T Ā s, X = X Ā X C C A A A, and X is a matrix such that X T X = I. Suppose where θ A,s e δ( β) where, A ( ( e δ(s) = sign e δ θ ( ) e δ θ A,s A,s )), (9) 0, (0) = C A (b C θ ) Ā A,s. Then, the solution to the classo criterion () is given by β = θ A,s. θ A,s The proof of this result is similar to that for Lemma 3 so we omit it here. Lemma 4 shows that, provided an appropriate A and s are chosen, the solution to the classo can still be computed by solving a lasso type criterion, (8). However, we must now ensure that both (9) and (0) hold. Condition 9 is equivalent to (3) in the equality constraint setting, while (0) along with the constraint in (8) ensure that δ 0. We use the same strategy as in Section 3.. First, obtain an initial coefficient estimate, β 0. Next, select A corresponding to the largest m elements of β 0, say β A. Finally, the m-dimensional s vector is chosen by fixing e δ(s) = sign(e δ( β A )) and setting the remaining elements of s to zero i.e. e δ (s) = 0. 3

14 β A is our initial guess for θ A,s so as long as β A is sign consistent for θ A,s then (9) will hold. Similarly, by only including the largest current values of δ in A, for a small enough step in λ, none of these elements will become negative and (0) will hold. Hence, Algorithm can be used to fit the inequality constrained classo. Notice that Algorithm Constrained Lasso with Inequality Constraints. Initialize β 0 by solving (7) using λ 0 = λ max.. At step k select A k and s k using the largest m elements of β k and set λ k 0 α λ k. 3. Compute θ Ak,s k by solving (8). Let θ A k,s k = C A k (b C Ak θak,s k ). 4. If (9) and (0) hold then set β k = θ A k,s k, k k + and return to. θ Ak,s k 5. If (9) or (0) do not hold then the step size must be too large. Hence, set λ k λ k (λ k λ k ) and return to Iterate until λ k < λ min. Algorithm only involves slight changes from Algorithm. In particular solving (8) in Step 3 poses little additional complication over fitting the standard lasso. The only differences are that the elements of θ that correspond to δ, i.e. e δ (θ), have zero penalty and must be non-negative. However, these changes are simple to incorporate into a coordinate descent algorithm. For any θ j that is an element of e δ (θ) we first compute the unshrunk least squares estimate, ˆθ j, and then set ] θ j = [ˆθj +. () It is not hard to show that () enforces the non-negative constraint on δ while also ensuring that no penalty term is applied to the slack variables. The update step for the original β coefficients, e δ(θ) (those that do not involve δ), is identical to that for the standard lasso. 4

15 The initial solution, β0, can still be computed by solving a standard linear programming problem, subject to inequality constraints. 4. Statistical Properties: Error Bounds In this section we show that, assuming the true coefficients correspond to a given set of constraints, one can obtain sharper error bounds for the classo estimate relative to the traditional unconstrained lasso estimate. To describe the results, let us introduce the statistical model for our problem. We assume that Y R n is generated from: Y = Xβ + ε, where X R n p, β R p, and ε = (ε, ε,..., ε n ) are i.i.d. mean-zero random variables. Let β(λ) denote a solution of the classo problem defined by: β(λ) = arg min β L (β; Y ) + λβ subject to Cβ = b, () where L (β; Y ) = n Y Xβ. If the solution to () is not unique we can take β(λ) to be any optimal solution; our error bounds hold for all such solutions. We make the following two assumptions about the underlying parameter β. Assumption (Sparsity). Let supp(β ) = { j : β j 0 }. The parameter β is s-sparse with supp(β ) = s. Assumption (Constraint Matrix). There exists a matrix C R m p with rank(c) = m s such that Cβ = b. Moreover, the matrix C A is non-singular for some A supp(β ) such that the A = m. By Assumption we can decompose β = [βa, β ]. Let S = Ā supp(β ) denote the Ā support of the vector β. Note that Assumptions and imply that S = s m. For any Ā vector u, let Π S (u) and Π S (u) be defined by: u j if j S, u j if j / S, [Π S (u)] j = and [Π S (u)] j = 0 otherwise. 0 otherwise. Note that u = Π S (u) + Π S (u). Finally, for any >, let a cone C R p be defined by: { C R p : C = 0 and Π S ( Ā ) + ( ) } Π S ( Ā ) + A. 5

16 It is easy to verify that C is a cone because if x C, then tx C for all t 0. Negahban et al. (00) show that, provided Assumption holds and λ L (β ; Y ), then the traditional unconstrained lasso estimate β L (λ) satisfies the following error bounds: β L (λ) β s λ κ L and β L (λ) β s 6λ κ L, where κ L is a constant which we define shortly. Hence, the l error of β L (λ) is of order O( s). In Theorem we show that, provided m is not too large, the error in the classo estimate is instead of order O( s m). Theorem (An Error Bound for Classo). Under Assumptions and, suppose λ L (β ; Y ) for >, and the loss function L( ; Y ) satisfies restricted strong convexity on C with parameter κ L > 0; that is, for all C, L (β + ; Y ) L (β ; Y ) L (β ; Y ) T κ L. Then, and β(λ) β max { s m, m } ( + ) β(λ) ( β max { s m, m } 4 + ) λ. κ L λ κ L Note that max { s m, m } < s for m < s. Theorem, we have that ( + ) =. ( < and 4 + ) = 6, Moreover, by setting = 5 in and thus, for = 5, the constants in our error bound are no larger than the constants in the standard lasso error bound. By using different values of, we can also extend the range of λ. For m s/, our l bound is of order O( s m ). This rate follows from the fact that Assumption implies Hence, the m coordinates of β A β A = C A (b C Ā β Ā ). (3) are completely determined by the remaining p m coordinates of β Ā. Since supp ( β Ā) = s m, the vector β Ā is (s m)-sparse, which can be interpreted as the adjusted sparsity level. Why then do the bounds in Theorem also 6

17 depend on m? In fact, this term reflects the error due to model selection. To see this, note that when m = s, it follows from Assumptions and and Equation 3 that we can exactly recover β, but only if we know the locations of the non-zero entries A. There are ( p ) s p s possible locations of the non-zero entries. Intuitively, this corresponds to O(p s ) possible hypotheses, and information theoretic arguments show that even if we have m = s constraints, we still need n to be of order log ( p s) s log p to identify the correct hypothesis, and hence ensure that our error β(λ) β is small. This observation is formalized in the following corollary, which provides a high probability error bound for the classo estimate. The proof follows from Theorem. Corollary (High Probability Error Bound). Under Assumptions and, suppose the matrix X satisfies the restricted eigenvalue condition on C with Xu κ n L u for all u C. If λ = (4 )σ and ε = (ε, ε,..., ε n ) are i.i.d. mean-zero sub-gaussian log p n random variables with variance σ, then with probability at least /p, β(λ) β max { s m, m } 8σ log p κ L n log p By comparison Negahban et al. (00) show that, for λ = 4σ, the standard unconstrained lasso estimate β L (λ) satisfies the following error bound: β n L (λ) β s 8σ log p, κ L n with probability at least /p. Hence, the classo and lasso bounds differ only in the first term, with the classo bound strictly lower than that of the lasso for m < s. It is worth noting that the constant in (4) can be lowered further by choosing a slightly lower value for λ but we have not done so because our goal here is to highlight the difference in the dependence on s between the classo and lasso. A bound of the same form as (4) can also be produced for the l error, although a somewhat larger value for λ is optimal in that case. We note that the bounds in Theorem only depend on s and m, and not on the constraint matrix C. It is possible to develop an error bound that depends on C, in which case the bound scales with s m + m ρ max (C A CĀ ), where ρ max (C A CĀ ) is the largest singular value of C A CĀ. This error bound is useful, for example, when the magnitude of the elements of C Ā are much smaller than those of C A. For example, in the extreme case where the constraints 7 (4)

18 only involve the m largest coefficients then C Ā = 0 and ρ max = 0 so the bound scales with s m. The proof of this bound is similar to that of Theorem, so we omit the details In this section we have concentrated on results for the equality constrained classo. Results in the inequality setting are more complicated. Intuitively, one can see that if β lies well inside the region Cβ b then the classo and lasso should give similar answers because the constraints will play little role in the fit. However, if β lies close to the constraint boundary then the classo should offer improvements relative to the lasso. As shown in Section 3, one can reformulate the inequality classo using equality constraints with the addition of a set of m slack variables δ. If β lies close to the boundary then these slack variables will be close to, but not exactly, zero. Hence, to develop bounds for the classo in this setting one must adapt Assumption to one where the augmented set of coefficients is not necessarily sparse, but can be well approximated by a sparse vector. Negahban et al. (00) provide bounds for the standard lasso in such an approximately sparse setting but extending them to the classo is beyond the scope of this paper. 5. Empirical Results 5. Equality Constraints In this section we compare the performance of the classo to the regular lasso (Tibshirani 996) in nine different simulation settings. For each setting we randomly generated a training data set, fit each method to the data and computed the root mean squared error RMSE = N (Ŷi E(Y i X i )) N i= over a separate test set of N = 0000 observations. This process was repeated 00 times for each of the nine settings. The training data sets were produced using a random design matrix generated from a normal distribution. Three different correlation structures were investigated: ρ ij = 0, ρ ij = 0.5, and ρ ij = 0.5 i j (where ρ ij is the correlation between the ith and jth variables). In all cases the m by p constraint matrix C and the constraint vector b were randomly generated from a standard normal distribution. The true coefficient vector β was produced by first generating β using 5 non-zero random normal components, and Ā p m 5 zero entries, and then computing βa from (3). Note that this process resulted in 8

19 ρ Lasso Classo Relaxed Lasso Relaxed Classo n = (0.0) 0.50(0.0) 0.43(0.0) 0.9(0.0) p = (0.0) 0.67(0.0) 0.66(0.0) 0.53(0.0) m = i j 0.59(0.0) 0.46(0.0) 0.50(0.0) 0.3(0.0) n = (0.08) 3.8(0.09) 3.9(0.09) 3.07(0.0) p = (0.07) 3.78(0.08) 3.68(0.08) 3.68(0.09) m = i j.6(0.08).39(0.09).50(0.09).3(0.0) n = (0.07).5(0.03) 6.84(0.08) 0.94(0.03) p = (0.07).67(0.04) 6.98(0.08).38(0.04) m = i j 6.46(0.06).30(0.03) 6.6(0.07).0(0.03) Table : Average RMSE over 00 training data sets, for four lasso methods tested in three different simulation settings and three different correlation structures. The numbers in parentheses are standard errors. β being m + 5 sparse and ensured that the constraints held for the true coefficient vector. For each set of simulations, the optimal value of λ was chosen by minimizing RMSE on a separate validation set, which was independently generated using the same parameters as for the given simulation setting. We explored three combinations of n, p and m and for each setting tested all three correlation structures. The test RMSE values for the nine resulting settings are displayed in Table. We compared four methods; the standard lasso, our classo, the relaxed lasso (Meinshausen, 007), and the relaxed classo. The latter two methods use a two step approach in an attempt to reduce the over shrinkage problem commonly exhibited by the lasso. In the first step the lasso or classo is used to select a candidate set of predictors. In the second step the final model is produced using an unshrunk ordinary least squares fit on the variables selected in the first step. This approach is fairly simple for the relaxed lasso. However, a constrained least squares fit must be implemented for the relaxed classo in order to ensure that the constraints are still satisfied. The first set of three results correspond to a setting with n = 00, p = 50 and m = 5. This is a relatively low dimensional problem with only a small number of constraints, m = 5. Hence, one might imagine that the classo would have a comparatively small advantage over the lasso. However, even with a low value for m the classo shows highly statistically 9

20 significant improvements over the lasso. Both relaxed methods display lower error rates than their unrelaxed counterparts but the relaxed classo has a large advantage over the relaxed lasso. The higher correlations in the ρ = 0.5 design structure moderately increase the error rates for all methods but do not change the relative rankings of the four approaches. The second set of results represent a very high dimensional setting with n = 50, p = 500 and m = 0. In this situation the number of constraints is very low relative to the number of predictors so the classo generally only has a small advantage over the lasso, although the relaxed classo still appears to offer some real improvements in accuracy. Our main purpose in examining this setting was to prove that the classo algorithm is still efficient enough to optimize the constrained criterion even for very high dimensional data. The final setting examined data with n = 50, p = 00 and a very large number of constraints, m = 60. This setting is statistically favorable for the classo because it has the potential to produce significantly more accurate regression coefficients by correctly incorporating the large number of constraints. However, this is also a computationally difficult setting for the classo because the large value of m causes the coefficient paths to be highly variable. Nevertheless, the large improvements in accuracy for both the classo and relaxed classo in Table demonstrate that our algorithm is quite capable of dealing with this added complexity. Figure 4 provides a better feeling for the general improvement in accuracy from incorporating the constraints into our fit. Here we display boxplots of the RMSE, for each of our four comparison methods, over the 00 training data sets on two of the nine simulation settings. The improvement ranges from moderate in the first setting when m is low, to dramatic in the second setting where m is much higher. 5. Violations of Constraints The results presented in the previous section all correspond to an ideal situation where the true regression coefficients exactly match the equality constraints. We also investigated the sensitivity of the classo to deviations of the regression coefficients from the assumed constraints. In particular we generated the true regression coefficients according to Cβ = ( + u) b, (5) 0

21 Prediction Error (RMSE) Prediction Error (RMSE) Lasso Relaxed Lasso Constr. Relaxed Constr. Method Lasso Relaxed Lasso Constr. Relaxed Constr. Method Figure 4: Boxplots of the 00 simulation RMSE values for all four lasso methods using the ρ = 0.5 i j correlation structure. Left: The n = 00, p = 50, m = 5 setting. Right: The n = 50, p = 00, m = 60 setting. where u = (u,..., u m ), u l Unif(0, a) for l =,... m and the vector product is taken pointwise. The classo and relaxed classo were then fit using the usual constraint, Cβ = b. Table reports the new RMSE values for the three settings considered in Section 5., under the ρ = 0 correlation structure. We tested four values for a; 0.5, 0.50, 0.75 and.00. The largest value of a corresponds to a 50% average error in the constraint. The results suggest that the classo and relaxed classo are surprisingly robust to random violations in the constraints. While both methods deteriorated slightly as a increased, they were still both superior to their unconstrained counterparts for all values of a and all settings. 5.3 On-line Auto Loan Data We now consider an application of the classo to estimate a non-parametric demand function on the On-line Auto Lending data set, courtesy of The Center for Pricing and Revenue Management of the Columbia Business School. As the name suggests, the data contains observations on all United States auto loans approved by an online lender. Each record corresponds to a specific customer who was approved for a loan, then either chose to accept or decline the offer. The data also includes characteristics of the customer and offered loan, such as FICO score and loan duration. In particular, the data is segmented by three variables: credit tier (Tier I-VII, with Tier I being the highest), length of loan (0-36 months, 37-48

22 a Lasso Classo Relaxed Lasso Relaxed Classo n = (0.0) 0.5(0.0) 0.43(0.0) 0.30(0.0) p = (0.0) 0.5(0.0) 0.4(0.0) 0.33(0.0) m = (0.0) 0.5(0.0) 0.4(0.0) 0.35(0.0) (0.0) 0.53(0.0) 0.4(0.0) 0.38(0.0) n = (0.08) 3.(0.09) 3.4(0.09).98(0.0) p = (0.08) 3.(0.09) 3.3(0.09).99(0.0) m = (0.08) 3.8(0.09) 3.6(0.09) 3.07(0.0) (0.08) 3.8(0.09) 3.3(0.09) 3.06(0.0) n = (0.07).7(0.03) 6.80(0.08) 0.95(0.03) p = (0.07).8(0.03) 6.8(0.08) 0.99(0.03) m = (0.07).(0.03) 6.79(0.08).03(0.03) (0.07).7(0.04) 6.86(0.08).08(0.04) Table : Average RMSE over 00 training data sets in three different simulation settings using the ρ = 0 correlation structure. The numbers in parentheses are standard errors. The true regression coefficients were generated according to (5) months, months, or over 60 months), and type of car loan (new, used, or refinanced). As with previous analyses on the auto lending data set (Besbes et al., 00), the analysis here is performed on a segment of the overall data: Tier I loans for new cars with a duration of 0 to 36 months. This subset contained observations on 8698 customers, which we randomly divided into two-thirds training data and one-third validation data. We used the offered rate as the predictor and the loan s funding status, if the customer accepted the loan and 0 otherwise, as the response. A cubic spline basis, b(t), was used to model the demand curve as a non-linear function of the interest rate. To ensure that the demand function was monotonically decreasing as a function of the interest rate we computed the first derivative of the spline over a fine grid of m interest rates, between the minimum and the maximum observed rate. We then fit the classo over the training data with the ith row of X derived from b(x i ), where x i is the ith interest rate, and the first derivative constrained to be less than zero at each grid point i.e. b (u l ) T β 0 for l =,..., m. We compared the classo fit to an unconstrained lasso with the same spline design matrix. The degrees of freedom for b(t) and λ for the lasso and classo

23 were both chosen by minimizing RMSE on the validation data. To provide a comparison with a traditional parametric approach we also fit a logistic regression curve to the data. Figure plots the three resulting demand functions, with the classo in black, the traditional logistic in red, and the standard lasso in blue. ittered versions of the test data are also provided for context. As expected the linear logistic curve provides a smooth monotone decreasing fit to the data. However, the constrained parametric form of the logistic means that it is not capable of both modeling the high acceptance rate near.5% and the much lower acceptance rate near 5%. By comparison the lasso fit using the spline basis is far more flexible, modeling both the high acceptance rate near.5% and the much lower acceptance rate near 5%. However, because the curve is unconstrained it provides an unrealistic shape which actually suggests demand is increasing and decreasing in wild oscillations. Of course if the validation data had suggested a smaller degrees of freedom the lasso would produce a smoother curve, but even in that case the resulting fit will generally not be monotone. By comparison the classo fit combines both the flexibility of the lasso with the intuitively reasonable monotone decreasing shape of the logistic curve. The classo suggests a sharp decline in demand between interest rates of.5% and 3.5%, followed by a relatively constant demand between 3.5% and 4.0%, and then a further decline for rates above 4%. It is interesting that both 3.5% and 4.0% seem to represent distinct change points in demand, possibly corresponding to psychological effects from mentally rounding to the nearest half percent. 5.4 Demand Function Simulation We ran a simulation study to test out the accuracy of the classo in a setting such as the on-line auto loan demand estimation problem. To mimic the real life setting we generated Bernoulli responses with P (Y i = X i ) = ( γ)l(x i ) + γcl(x i ), 0 γ, (6) where L(X) and CL(X) respectively corresponded to the logistic and classo fits to the auto loan data. With γ = 0 the data was generated from a smooth underlying demand function, while for larger values of γ the true demand function had more flexibility. We tested γ = 0, /3, /3 and, and for each value generated 00 data sets according to (6). The simulated interest rates were generated along an equally spaced grid from.5% 3

24 γ Lasso Logit Classo 0.64(0.08) 0.45(0.04).9(0.07) /3.68(0.09) 0.87(0.04).(0.08) /3.03(0.0).00(0.05).50(0.08).4(0.0) 3.8(0.07).67(0.0) Table 3: A comparison of the prediction accuracies of the lasso, logistic, and classo methods. The values are the mean integrated squared differences over 00 simulation runs (multiplied by a factor of 0 3 for comparison purposes, as are the standard errors in parentheses). to 5.0% in step sizes of 0.00% for a total of 5 observations. Each data set was then randomly divided into one-half training data, one-quarter validation data, and one-quarter testing data. The classo, lasso and logistic regression were then fit in the same fashion as in Section 5.3. In particular the classo and lasso were each fit over a range of degrees of freedom (df) and λ values, then the combination of df and λ which yielded the minimum SSE = i (Y i Ŷi) value on the validation set was chosen as the optimal model for a particular simulation. That model was then used to fit the testing data. Table 3 provides a comparison of the prediction accuracies over the testing data for the three methods on each of the four choices of γ. The values in the table are the estimated mean integrated squared differences (ISD) ( P (Y = X) Ŷ (x) ) dx computed by averaging the test observations and the 00 simulation runs, where Ŷ (x) is the predicted demand function at x for a given method. As one would expect, the logistic regression provides superior results for low values of γ. However, as the true demand function becomes more flexible the classo starts to dominate. At γ = the classo has an ISD that is less than half that of the logistic regression fit. While the lasso also improves relative to the logistic fit for larger values of γ, it is clearly inferior to the classo in all settings. 6. Conclusion In this paper we have illustrated a few of the wide range of statistical applications for the classo and developed computationally efficient path algorithms for computing its solutions. 4

25 Our theoretical results demonstrate that, for correctly chosen constraints, the classo solution provides the potential for improvements in prediction accuracy. Furthermore, our simulation results show that the classo generally outperforms the lasso, not only when the constraints hold exactly, but also when there is some reasonable error in the assumption on the constraints. The relaxed version of classo appears to offer even greater improvements. We have concentrated here on the lasso optimization problem. However, we believe that a similar approach could be adopted to fit more general optimization problems of the form: arg min β Y Xβ + λ p ρ(β j ) subject to Cβ = b, (7) j= where ρ is a general penalty function, such as the SCAD penalty (Fan and Li, 00). A common approach for fitting the unconstrained version of (7) is to iteratively approximate ρ using a first order Taylor series, and then perform the lasso on the new criterion. A similar approach could be used for solving (7). In this setting one could again incorporate the constraint into the optimization function, use a prior estimate of the coefficients to select the terms we believe will be non-zero in the final solution, and incorporate these terms into the differentiable part of the criterion. Such an approach suggests a promising area of future research. A. Proof of Lemma Since D has full column rank, by reordering the rows if necessary, we can write D as D = D D where D R p p is an invertible matrix and D R r p p. Then, Y Xθ + λ Dθ = Y XD D θ = Y ( XD )D θ + λ D θ + λ D θ + λ D θ + λ D D D θ Using the change of variables β = D θ, β = D D D θ = D D β, and β = β β, 5

26 we can rewrite the generalized lasso problem as follows: min Y θ R p Xθ { + λ Dθ = min Y β R r where X = [ ] XD 0 = min β R r XD β { Y Xβ + λ β } + λ β D D β β = 0 } Cβ = 0 and C = [ D D I ]. Note that θ = [D 0] thus, the generalized lasso is a special case of the constrained lasso. B. Proofs of Lemmas and 3, β β, and Consider any index set A such that C A is non-singular. The constraint Cβ = b can be written as C A β A + C Ā β Ā = b β A = C A (b CĀ β Ā ), and thus, we can determine β A from β Ā. Then, for any β such that Cβ = b, Y Xβ + λ β = Y XĀ β Ā X A β A + λ βā + λ β A = = Y XĀ β Ā X A C A (b CĀ β Ā ) + λ βā + λ C A (b CĀ β Ā ) [ Y X A C A b] ( X Ā X A C A CĀ ) βā + λ βā + λ C A (b CĀ β Ā ). By using the change of variable θ = β Ā, the classo problem is equivalent to the following unconstrained optimization problem: min θ Y X θ + λ θ + λ C A (b CĀ θ), and let θ A denote a solution to the above optimization problem. Then, a solution to the original classo problem is given β = and this completes Lemma. β A β Ā = C A (b CĀ θ A ) θ A,, 6

27 To prove Lemma 3, consider an arbitrary θ A,s and s such that s = sign ( C A (b CĀ θ A,s ) ). Let F : R p m R + denote the objective function of the optimization problem in Equation 0; that is, for each θ, F (θ) = Y X θ + λ θ + λ C A (b CĀ θ). By definition of θ A, we have F (θ A ) F (θ A,s ). To complete the proof, it suffices to show that F (θ A,s ) F (θ A ). Suppose, on the contrary, that F (θ A,s ) > F (θ A ). For each α [0, ], let θ α R p m and g(α) R + be defined by: θ α ( α)θ A,s + αθ A and g(α) F (θ α ) = F (( α)θ A,s + αθ A ). Note that g is a convex function on [0, ] because F ( ) is convex. Moreover, we have g(0) = F (θ A,s ) > F (θ A ) = g(). Thus, for all 0 < α, g(α) = g(α + ( α) 0) αg() + ( α)g(0) < g(0). By our hypothesis, s i = for all i, and thus, every coordinate of the vector C A (b CĀ θ A,s ) is bounded away from zero. So, we can choose α 0 sufficiently small so that Then, it follows that sign ( C A (b CĀ θ α0 ) ) = sign ( C A (b CĀ θ A,s ) ). F (θ α0 ) = Y X θ α0 + λ θ α 0 + λs T C A (b CĀ θ α0 ) = (Y ) T Y = (Y ) T Y (Y ) T X θ α0 λs T C A CĀ θ α0 + (X θ α0 ) T X θ α0 (Y ) T X θ α0 ( λx ( C A CĀ ) T s ) T X θ α0 + (X θ α0 ) T X θ α0 + λ θ α0 + λs T C A b = Ỹ X θ α0 + λ θ α0 + d + λ θ α0 + λs T C A b where the last equality follows from Ỹ = Y + λx ( C A CĀ ) T s, and d is defined by d = λ(y ) T X ( C A CĀ ) T s [ λx ( C A CĀ ) T s ] T [ λx ( C A CĀ ) T s ] Also, the third equality follows from the fact that (X ) T X = I. 7 + λs T C A b.

Penalized and Constrained Optimization: An Application to High-Dimensional Website Advertising

Penalized and Constrained Optimization: An Application to High-Dimensional Website Advertising Gareth M. James, Courtney Paulson, and Paat Rusmevichientong Gareth M. James is the E. Morgan Stanley Chair