Boosted Lasso and Reverse Boosting

Size: px
Start display at page:

Download "Boosted Lasso and Reverse Boosting"

Transcription

1 Boosted Lasso and Reverse Boosting Peng Zhao University of California, Berkeley, USA. Bin Yu University of California, Berkeley, USA. Summary. This paper introduces the concept of backward step in contrast with forward fashion algorithms like Boosting and Forward Stagewise Fitting. Like classical elimination methods, this backward step works by shrinking the model complexity of an ensemble learner. Through a step analysis, we show that this additional step is necessary for minimizing L 1 penalized loss (Lasso loss). We also propose a BLasso algorithm as a combination of both backward and forward steps which is able to produce the complete regularization path for Lasso problems. Moreover, BLasso can be generalized to solve problems with general convex loss with general convex penalty. 1. Introduction Boosting is one of the most successful and practical learning ideas that recently come from the machine learning community. Since its inception in 1990 (Schapire, 1990; Freund, 1995; Freund and Schapire, 1996), it has been tried on an amazing array of data sets. The motivation for boosting was a procedure that combines the outputs of many weak or base learners to produce a powerful ensemble. The improved performance is impressive. Another successful and vastly popular idea that recently comes from the statistics community is the Lasso (Tibshirani, 1996, 1997). Lasso is a shrinkage method that regularize fitted models using a L 1 penalty. Its popularity can be explained in several ways. Since nonparametric models that fit training data well often have low bias but large variances, prediction accuracy can sometimes be improved by shrinking a model or making a model more sparse. The regularization resulting from the L 1 penalty leads to sparse solutions where there are few basis functions with nonzero weights (among all possible choices). This Statement is proved rigorously in recent works of Donoho and co-authors (1994, 2004) in the specialized setting of over-complete representation and large under-determined systems of linear equations. Furthermore, the sparse models induced by Lasso are more interpretable and often preferred in areas such as Biostatistic and Social Sciences. While it is a natural idea to combine boosting and Lasso to have a regularized boosting procedure, it is also intriguing that boosting, without any additional regularization, has its own resistance to overfitting. For specific cases, e.g. L 2 Boost (Friedman, 2001), this resistance is understood to some extent (Buhlmann and Yu 2002). However, it is not until later when Forward Stagewise Fitting was introduced and connected with a boosting procedure with much more cautious steps (therefore it is also called ǫ-boosting, cf. Friedman 2001) that a shocking similarity between Forward Stagewise Fitting and Lasso was observed (Hastie et al. 2001, Efron et al. 2003). This link between Lasso and Forward Stagewise Fitting is formally described in linear regression case through LARS (Least Angle Regression, Efron et al. 2003). It s also known

2 2 Peng Zhao et al. that for special cases (e.g. orthogonal design) Forward Stagewise Fitting can approximate Lasso path infinitely close, but in general, they are different from each other despite of their similarity. However, Forward Stagewise Fitting is still used as an approximation to Lasso for different regularization parameters because it is computationally prohibitive to solve Lasso for many regularization parameters even under the linear regression case. Both the similarity and difference between Forward Stagewise Fitting and Lasso can be clearly seen from our analysis. Like many researchers in the statistics community, we take a similar functional steepest descent view of boosting (Breiman 1999, Mason et al. 1999, Friedman et al. 2000, Friedman 2001), but two things distinguish our analysis from previous ones: (a) We look at the Lasso loss, i.e. the original loss plus L 1 penalty, comparing to previous studies where only the loss function is studied. This directly relates every fitting stage with its impact on Lasso loss. For instance, it can be seen that except for special cases like early fitting stages or orthogonal design, a boosting or forward stagewise fitting step can leads to an increase in Lasso loss, which consequently parts the solutions generated by Boosting and Forward Stagewise Fitting from those generated by Lasso. (b) Unlike previous studies, where the steepness of a descending step is judged by gradient, we judge the steepness by the difference of Lasso loss. One technical reason leads to this change is that the Lasso penalty is not differentiable everywhere, so the gradient can not be used. But using the difference is more than just a compromise due to technical difficulty, in fact it leads to a more transparent view on the minimization of Lasso loss and later gives rise to our backward step which is described next. A critical observation from our analysis is that, besides their difference, Forward Stagewise Fitting and Boosting all work in a forward fashion (so is Forward Stagewise Fitting named). The model complexity, measured by the L 1 norm of model parameters, may slightly fluctuate but hold a dominating upward trend for both methods. This is because both methods only proceed by minimizing the empirical loss which gradually builds up the model complexity. Borrowing idea from the Forward Addition and Backward Elimination dual, we construct a backward step. It uses the same minimization rule as the forward step to define each fitting stage but utilizes an additional rule to force the model complexity to decrease. This backward step not only serves as the key to understanding the difference between Forward Stagewise Fitting and Lasso but is also crucial in dealing with situations where the initial model already over-fits. Such situation often rises in online settings when the model fitted from a previous trial need to be re-fitted given new data. As an immediate application of this backward step, we construct an algorithm called Reverse Boosting with purely backward steps starting from an over fitted model and reducing its model complexity. This Reverse Boosting algorithm is conceptually appealing as well as practically effective for shrinking models with high model complexity. Not only does the backward step identifies the similarity and difference between Lasso and Forward Stagewise Fitting, it also reveals an stagewise procedure by which the Lasso solutions can be approximated for general convex loss functions. We call this stagewise procedure Boosted Lasso (BLasso) to relate it to both its ancestors: it uses a stagewise steepest descent procedure similar to boosting and Forward Stagewise Fitting; and, the solutions given by BLasso are close approximates of Lasso solutions with different regularization parameters. Since BLasso has the same order of computational complexity as Forward Stagewise Fitting and unlike Forward Stagewise Fitting, BLasso can be seen to ap-

3 Boosted Lasso and Reverse Boosting 3 proximate the Lasso solutions infinitely close, we think BLasso has a substantial advantage over Forward Stagewise Fitting in approximating Lasso solutions. The fact that BLasso can also be easily generalized to give regularized path for other penalized loss functions with general convex penalties also comes as a pleasant surprise. (Another algorithm that works in a similar fashion is developed independently by Rosset 2004.) After a brief overview of Boosting, Forward Stagewise Fitting in Section 2.1 and the Lasso in Section 2.2, Section 3 covers our step analysis, introduces the backward step and Reverse Boosting algorithm as a backward compliment of the forward fitting procedures and discusses their application as a regularization method. Section 4 proposes BLasso as an algorithm that approximates the regularization path of Lasso and other penalized loss functions with convex penalties. In section 5, we support the theory and algorithms by simulated and real data sets which demonstrate the attractiveness of Boosted Lasso. Finally, Section 6 contains a discussion on choice of step sizes and application of BLasso in online learning with a summary of the paper. 2. Boosting, Forward Stagewise Fitting and the Lasso The nature of stagewise fitting is responsible to a large extent for boosting s resistance to overfitting. Forward Stagewise Fitting uses more fitting stages by limiting the step size at each stage to a small fixed constant and produces solutions that are strikingly similar to the Lasso. We first give a brief overview of these two algorithms followed by an overview of the Lasso Boosting and Forward Stagewise Fitting The boosting algorithms can be seen as functional gradient descent techniques. The task is to estimate the function F : R d R, minimize an expected lost E[C(Y, F(X))], C(, ) : R R R + (1) based on data Z i = (Y i, X i )(i = 1,..., n). The univariate Y can be continuous (regression problem) or discrete (classification problem). The most prominent examples for the loss function C(, ) include Classification Margin, Logit Loss and L 2 Loss functions. The family of F( ) being considered is the set of ensembles of base learners D = {F : F(x) = m j 1 β j h j (x), x R d, β j R}. (2) For example, the learner h j (, ) could be a decision tree of size 5 where each h j (, ) describes a tree with different splits. Let β = (β 1,...β m ) T, we can reparametrize the problem using L(Z, β) := C(Y, F(X)), (3) where the specification of F is hidden by L and makes our notation simpler. The parameters ˆβ are found by minimizing the empirical loss ˆβ = argmin β L(Z i ; β). (4)

4 4 Peng Zhao et al. Despite the fact that the empirical loss function is often convex in β, this is usually a formidable optimization problem for a moderately rich function family, and we usually settle for approximating suboptimal solutions by a progressive procedure: (ĵ, ĝ) = arg min j,g L(Z i ; ˆβ t + g1 j ) (5) ˆβ t+1 = ˆβ t + ĝ1ĵ (6) It is often useful to further divide (5) into two parts: Finding g given j which is typically trivial; Finding j this is the difficult part, for which approximate solutions are found. Note also that finding the j entails estimating the g as well. A typical strategy is by applying functional gradient descent. This gradient descent view has been recognized and refined by various authors including Breiman (1999), Mason et al. (1999), Friedman et al. (2000), Friedman (2001) and Buhlmann et al.. The famous AdaBoost, LogitBoost and L 2 Boost are can all be viewed as implementations of this strategy for different loss functions. Forward Stagewise Fitting is a similar method for approximating the minimization problem described by (5) with additional regularization. It disregards the stepsize g in (6) and instead update ˆβ t by a fixed stepsize ǫ: ˆβ t+1 = ˆβ t + ǫ sign(g)1ĵ When this Forward Stagewise Fitting was introduced (Hastie et al. 1999, Efron 2001), it was only described for the L 2 regression setting. We found it more sensible to also remove the minimization over g in (5): (ĵ, ŝ) = arg min j,s=±ǫ L(Z i ; ˆβ t + s1 j ), (7) ˆβ t+1 = ˆβ t + ŝ1ĵ, (8) Notice that this is only a change of form, underlying mechanic of the algorithm remains unchanged in the L 2 regression setting as can be seen later in Section 4. Initially all coefficients are zero. At each successive step, a coefficient is selected that best fits the empirical loss. Its corresponding coefficient βĵ is then incremented or decremented by a small amount, while all other coefficients β j, j ĵ are left unchanged. By taking small cautious steps, Forward Stagewise Fitting imposes some implicit regularization. After applying it with T < iterations, many of the coefficients will be zero, namely those that have yet to be incremented. The others will tend to have absolute values smaller than their corresponding best fits. This shrinkage and sparsity property is explained by the striking similarity between the solutions given by Forward Stagewise Fitting and the Lasso which we give a brief overview next the Lasso Suppose that we have data {Z i = (Y i, X i ); i = 1,..., n}, where X i are the predictor variables and Y i are the responses. As in the usual model fitting set-up we are given some loss function

5 Boosted Lasso and Reverse Boosting 5 L(Z; β) which is convex in β given Z, the empirical loss is calculated as L(Z i ; β), which is immediately convex in β as well. Most common loss functions such as classification margin, logit loss, L 2 loss and Huber loss all satisfy this convex assumption. The L 1 penalty T(β) is defined as the L 1 norm of β = (β 1,..., β m ) T, T(β) = β 1 = m β i. (9) Letting ˆβ = (ˆβ 1,..., ˆβ m ) T, the Lasso estimate is defined by ˆβ = arg min{ L(Z i ; β)} subject to T(β) t. (10) Here t 0 is a tuning parameter. For some choice of regularization parameter λ this is equivalent to minimizing the Lasso loss: Γ(β; λ) = L(Z i ; β) + λt(β). (11) The parameter λ 0 controls the amount of regularization applied to the estimates. Setting λ = 0 reverses the Lasso problem to minimizing unregularized empirical loss. On the other hand, a very large λ will completely shrink ˆβ to 0 thus leads to an empty model. In general, moderate values of λ will cause shrinkage of the solutions towards 0, and some coefficients may be exactly equal to 0. This sparsity in Lasso solutions has been studied extensively, e.g. Efron et al. (2002), Donoho (1995) and Donoho et al. (2004). Computation of the solution of the Lasso problem for a fixed λ has been studied for special cases. Specifically, for least square regression, it is a quadratic programming problem with linear inequality constraints; for SVM, it can be transformed into a linear programming problem. But to get a good fitted model that performs well on future data, we need to select an appropriate value for the tuning parameter λ. Efficient algorithms have been proposed for least square regression (LARS, Efron et al. 2002) and SVM (1-norm SVM, Ji Zhu 2003) to give the entire regularization path. But how to give the entire regularization path of the Lasso problem for general convex loss function remained open. Moreover, even for the least square and SVM cases, the known algorithms only deals with stationary batch data case. An adaptive online algorithm is needed to deal with real time data. We link the Lasso and Forward Stagewise Fitting through our step analysis on the Lasso loss and an introduction of a backward step in the next Section. This additional backward step completes Forward Stagewise Fitting and gives rise to a BLasso algorithm which will be described in Section Step Analysis and Reverse Boosting One observation is: both Boosting and Forward Stagewise Fitting use only forward steps. They only take steps that lead to direct reduction of the empirical loss. Comparing to clas-

6 6 Peng Zhao et al. sical model selection methods like Forward Selection and Backward Elimination, Growing and Pruning of a classification tree, a backward counterpart is missing. In this section, we uncover this missing backward step through a step analysis of Stagewise Fitting and the Lasso, and describe a Reverse Boosting algorithm which is completely formed with backward steps. And as will be shown in the next section, it turns out the backward step is the last piece to the puzzle of the similarity between boosting-like procedures resemblance to Lasso solutions. For a given β 0 and λ > 0, consider the impact of a small ǫ > 0 change of β j to the Lasso loss Γ(β; λ). For an s = ǫ, j Γ = ( L(Z i ; β + s1 j ) := j ( L(Z i ; β)) + λ(t(β + s1 j ) T(β)) (12) L(Z i ; β)) + λ j T(β). (13) Since T(β) is simply the L 1 norm of β, T(β) reduces to a simple form: j T(β) = β + s1 j 1 β 1 = β j + s β j = sign + (sβ j ) ǫ (14) where sign + is just a normal sign function except that 0 is projected to 1, i.e. sign + (x) = 1 if x 0 and sign + = 1 if x < 0. Equation (14) shows that a ǫ step s impact on penalty is a fixed ǫ for different j. Only the sign of the impact may vary. Suppose given a β, the forward steps for different j have impacts on the penalty of the same sign, then j T is a constant in (13) for all j. Thus, minimizing the Lasso loss using fixed-size steps is equivalent to minimizing the empirical loss directly. At the early stages of Forward Stagewise Fitting, all forward steps are parting from zero, therefore all the signs of the forward steps impact on penalty are positive. As the algorithm proceeds into later stages, some of the signs may change into negative and minimizing the empirical loss is no longer equivalent to minimizing the Lasso loss. Thus, in the beginning, Forward Stagewise Fitting carries out a steepest descent algorithm that minimizes the Lasso loss and follows Lasso s regularization path, but as it goes into later stages, the equivalence is broken and they part ways. In fact, except for special cases like orthogonal designed covariates, the signs of the forward steps impacts on penalty can change from positive to negative. These steps then reduce the empirical loss and penalty simultaneously therefore they should be prefered over other forward steps. Moreover, there can also be occasions where a step goes backward to reduce the penalty with a small sacrifice in empirical loss. In general, to minimize the Lasso loss, one need to go back and forth to trade-off the penalty with empirical loss basing on different regularization parameters. We call a direction that leads to reduction of the penalty a backward direction and define a backward step as the following: For a given ˆβ, a backward step is such that: ˆβ = s j 1 j, for some j, subject to ˆβ j 0, sign(s) = sign(ˆβ j ) and s = ǫ. Making such a step will reduce the penalty by a fixed amount λ ǫ, but its impact on the empirical loss may vary,

7 Boosted Lasso and Reverse Boosting 7 therefore we also want: ĵ = arg min L(Z i ; ˆβ + s j 1 j ) subject to ˆβ j 0 and s j = sign(ˆβ j )ǫ, j i.e. ĵ is picked such that the empirical loss after making the step is as small as possible. While forward steps try to reduce the Lasso loss through minimizing the empirical loss, the backward steps try to reduce the Lasso loss through minimizing the Lasso penalty. However, unlike other backward-forward duals, here the backward and forward steps do not necessarily contradict with each other. During a fitting process, although rarely happen, it is possible to have a step reduce both the empirical loss and the Lasso penalty. We do not distinguish these steps that are both forward and backward, and they will not create any trouble during our study. Starting from a fitted model, identified by parameters ˆβ 0, using purely backward steps we have the following algorithm: Reverse Boosting Step 1 (initialization). Given data Z i = (Y i, X i ); i = 1,..., n, a small stepsize constant ǫ > 0 and starting point ˆβ 0 = (ˆβ 0 1,..., ˆβ 0 m )T, set the initial active index set: I 0 A = {j; ˆβ 0 j 0}. Set iteration counter t = 0. Step 2 (Backward Stepping). Find which backward step to make, ĵ = arg min j I t A L(Z i ; ˆβ t + s j 1 j ) where s j = sign(ˆβ j t )ǫ. In case when ˆβ ṱ is within a stepsize away from zero, it is set to 0 and the active index set j is reduced, otherwise do a normal backward step: If ˆβ ṱ ǫ, j Otherwise ˆβt+1 = ˆβ t ˆβ t ĵ 1 ĵ and I t+1 A = It A/{ĵ}; ˆβ t+1 = ˆβ t + sĵ1ĵ. Step 3 (Iteration). Increase t by one and repeat Step 2 and 3. Stop when I t A =. Comparing to Forward Stagewise Fitting, Reverse Boosting is greedy in an opposite way. It always shrinks the model ˆβ t strictly decreases and eventually reaches 0 after roughly ˆβ 0 ǫ iterations. It can be used for pruning a fully boosted additive tree or in general shrinking over-fitted models. However, as Forward Stagewise Fitting uses only forward steps, Reverse Boosting is equally incomplete since it uses only backward steps. On the other hand, the backward step is a greatly important concept. It provides the freedom of going back instead of having to go forward. Models can be fitted using starting points other than 0. For example, to update a model using new data, or to fit models adaptively for time series, the previous fitted model can be used as starting point and it will take only very few iterations to fit the new model. And, by combining both forward and backward steps we introduce the following Boosted Lasso algorithm.

8 8 Peng Zhao et al. 4. Steepest Descent on Lasso Loss and Boosted Lasso Mixing up the forward and backward steps and calculate the regularization parameter λ along the way gives rise to the Boosted Lasso (BLasso) algorithm. As the algorithm proceeds through iterations, Lasso solutions along the whole regularization path are produced. Boosted Lasso Step 1 (initialization). Given data Z i = (Y i, X i ); i = 1,..., n and a small stepsize constant ǫ > 0, take an initial forward step (ĵ, ŝĵ) = arg min ˆβ 0 = ŝĵ1ĵ, j,s=±ǫ Then calculate the initial regularization parameter λ 0 = 1 n ǫ ( L(Z i ; 0) L(Z i ; s1 j ), L(Z i ; ˆβ 0 )). Set the active index set IA 0 = {ĵ}. Set t = 0. Step 2 (Backward and Forward steps). Find the backward step that leads to the minimal empirical loss ĵ = arg min j I t A L(Z i ; ˆβ t + s j 1 j ) where s j = sign(ˆβ j)ǫ. t Take the step if it leads to a decrease in the Lasso loss, otherwise force a forward step and relax λ if necessary: If Γ(ˆβ t + ŝĵ1ĵ; λ t ) < Γ(ˆβ t, λ t ), then Otherwise, ĵ = arg min j ˆβ t+1 = ˆβ t + ŝĵ1ĵ, λ t+1 = λ t. L(Z i ; ˆβ t + sign(ˆβ j) t ǫ1 j ), ˆβ t+1 = ˆβ t + sign(ˆβ ṱ j )1 ĵ, λ t+1 = min[λ t, 1 n ǫ ( L(Z i ; ˆβ t ) I t+1 A = I t A {ĵ}. L(Z i ; ˆβ t+1 ))], Step 3 (iteration). Increase t by one and repeat Step 2 and 3. Stop when λ t 0. BLasso works by choosing from both backward and forward steps by looking at the Lasso loss and relaxes the regularization once the minimum of Lasso loss under the current

9 Boosted Lasso and Reverse Boosting 9 regularization is reached. As the algorithm proceeds, solutions from each iterations coincide with the solutions along the complete Lasso regularization path. From an optimization point of view, for a given λ the BLasso solves the Lasso problem as a convex optimization problem by a fixed stepsize coordinate descent using both forward and backward steps. When the minimum of the Lasso loss is reached, a forward step is forced and λ is recalculated. Since for two adjacent λ, the corresponding Lasso solutions are very close, the simple fixed stepsize coordinate descent algorithm takes only one to a few steps to reach the next solution. And since no algorithm other than fixed step coordinate descent is used, BLasso can be used for a wide range of convex loss functions and can be generalized to work for general convex penalties as well. We defer the the generalization to the end of this section and take a more careful look at a specific loss function. For the most common special case least square regression, the forward steps, backward steps and BLasso all become simpler and more intuitive. To see this, we write out the empirical loss function L(Z i ; β) in its L 2 form, L(Z i ; β) = (Y i X i β) 2 = (Y i Ŷi) 2 = ηi 2. where Ŷ = (Ŷ1,..., Ŷn) T are the fitted values and η = (η 1,..., η n ) T are the residuals. Recall that in a penalized regression setup X i = (X i1,..., X im ) where every covariates X j = (X 1j,..., X nj ) T is normalized, i.e. X j 2 = n X2 ij = 1 and n X ij = 0. For a given β = (β 1,...β m ) T, the impact of a step s of size s = ǫ along β j on the empirical loss function can be written as: ( L(Z i ; β)) = [(Y i X i (β + s1 j )) 2 (Y i X i β) 2 ] = = [(η i sx i 1 j ) 2 ηi 2 ] ( 2sη i X ij + s 2 Xij) 2 = 2s(η X j ) + s 2. The last line of these equations delivers a strong message in least square regression, given the step size, the impact on the empirical loss function is solely determined by the correlation between the fitted residuals and the coordinate. Specifically, it is proportional to the negative correlation between the fitted residuals and the covariate plus the step size squared. Therefore, steepest descent with a fixed step size on the empirical loss function is equivalent to finding the covariate that has the maximum size of correlation with the fitted residuals, then proceed along the same direction. This is in principle same as Forward Stagewise Fitting. Translate this for the forward step where originally (ĵ, ŝĵ) = arg min L(Z i ; β + s1 j ), we get j,s=±ǫ ĵ = arg max η X j and ŝ = sign(η Xĵ)ǫ, j

10 10 Peng Zhao et al. which coincides exactly with the stagewise procedure described in Efron (2002) and is in general the same principle as L 2 Boosting, i.e. recursively refitting the regression residuals along the most correlated direction except the difference in step size choice (Friedman 2001, Buhlmann and Yu 2002). Also, under this simplification, a backward step becomes ĵ = arg min j ( s(η X j )) subject to ˆβ j 0 and s j = sign(ˆβ j )ǫ. Ultimately, since both forward and back steps are based solely on the correlations between fitted residuals and the covariates, therefore in the L 2 case, BLasso reduces to finding the best directions in both forward and backward directions by examining the correlations, then decide whether to go forward or backward based on the regularization parameter. As stated earlier, BLasso not only works for general convex loss functions, it can also be generalized for convex penalties other than L 1 penalty. For the Lasso problem, BLasso algorithm does a fixed step size coordinate descent to minimize the penalized loss. Since the penalty has the special L 1 norm and (14) holds, therefore the coordinate descent takes form of backward and forward steps. For general convex penalties, this nice feature is lost but the algorithm still works. Assume T(β): R m R is a penalty function and is convex in β, now we describe the Generalized Boosted Lasso algorithm: Generalized Boosted Lasso (BLasso) Step 1 (initialization). Given data Z i = (Y i, X i ); i = 1,..., n and a small stepsize constant ǫ > 0, take an initial forward step (ĵ, ŝĵ) = arg min ˆβ 0 = ŝĵ1ĵ. j,s=±ǫ L(Z i ; s1 j ), Then calculate the corresponding regularization parameter n λ 0 = L(Z i; 0) n L(Z i; ˆβ 0 ). T(ˆβ 0 ) T(0) Set t = 0. Step 2 (steepest descent on Lasso loss). Find the steepest coordinate descent direction on Lasso loss (ĵ, ŝĵ) = arg min j,s=±ǫ Γ(ˆβ t + s1 j ; λ t ). Update ˆβ if it reduces Lasso loss; otherwise force ˆβ to minimize the empirical loss and recalculate the regularization parameter : If Γ(ˆβ t + ŝĵ1ĵ; λ t ) < Γ(ˆβ t, λ t ), then Otherwise, ˆβ t+1 = ˆβ t + ŝĵ1ĵ, λ t+1 = λ t. ĵ = arg min j L(Z i ; ˆβ t + sign(ˆβ j t ) ǫ1 j),

11 Boosted Lasso and Reverse Boosting 11 ˆβ t+1 = ˆβ t + sign(ˆβ ṱ j )1 ĵ, n λ t+1 = min[λ t, L(Z i; ˆβ t ) n L(Z i; ˆβ t+1 ) ]. T(ˆβ t+1 ) T(ˆβ t ) Step 3 (iteration). Increase t by one and repeat Step 2 and 3. Stop when λ t 0. In the Generalized Boosted Lasso algorithm, explicit forward or backward steps are no longer seen. However, the mechanic remains the same minimize the penalized loss function for each λ, relax the regularization by reducing λ when the minimal is reached. Another algorithm of a similar fashion is developed independently by Rosset (2004). There, an incremental quadratic algorithm is used to minimize the penalized loss for each λ; then λ is relaxed by a fixed amount each time the minimal is reached. The algorithm involves calculation of the Hessian matrix and an additional step size parameter for relaxing the λ. In comparison, BLasso uses much simpler and computationally less intensive operations and λ is calculated automatically through the process. Also, for situations like boosting trees where the number of basic learners is huge and at each step the minimization of empirical loss is done through a clever approximation, BLasso is able to adapt naturally in the same way as Forward Stagewise Fitting. 5. Experiments Three different experiments are carried out to illustrate BLasso with both simulated and real datasets. We first run BLasso on a diabetes dataset (cf Efron et al. 2002) under the classical Lasso setting, i.e. L 2 regression with an L 1 penalty. BLasso completely replicates the entire regularization path of the Lasso. Then we run BLasso on the same dataset but under an unconventional setting L 1 regression with L 2 penalty to show that BLasso works for a general problem with a convex loss function and a convex penalty. At last, switching from regression to classification, we use simulated data to illustrate BLasso solving regularized classification problem under the 1-norm SVM setting L 2 Regression with L 1 Penalty (Classical Lasso) The dataset used in this and the following experiment is from a Diabetes study where diabetes patients were measured on 10 baseline variables. A prediction model was desired for the response variable. The ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline. The statisticians were asked to construct a model that predicted response Y from covariates X 1, X 2,..., X 10. Two hopes were evident, that the model would produce accurate baseline predictions of response for future patients, and also that the form of the model would suggest which covariates were important factors in disease progression. The classical Lasso L 2 regression with L 1 penalty is used for this purpose. Let X 1, X 2,..., X m be n vectors representing the covariates and Y the vector of responses for the n cases, m = 10 and n = 442 in this study. Location and scale transformations are done so that all covariates are standardized to have mean 0 and unit length, and that the response has mean zero.

12 12 Peng Zhao et al. The penalized loss function has the form: Γ(β; λ) = (Y i X i β) 2 + λ β 1 (15) BLasso Lasso t = ˆβ j t = ˆβ j Figure 1. Estimates of regression coefficients ˆβ j, j=1,2,...10, for the diabetes data. Left Panel Lasso solutions (given by LARS) as a function of t = β 1. The covariates enter the regression equation sequentially as t increase, in order j = 3, 9, 4, 7,...1. Right Panel BLasso solutions, which can be seen identical to the Lasso solutions. The left and right panels of Figure 1 are virtually identical which empirically verifies that BLasso replicates the Lasso regularization path L 1 Regression with L 2 Penalty To illustrate BLasso in an unconventional setting, we run an L 1 regression with L 2 penalty on the same diabetes data in the previous section. Instead of (15), the penalized loss function now takes the form: where β 2 = ( m j=1 β2 j )1 2 Γ(β; λ) = Y i X i β + λ β 2 (16)

13 Boosted Lasso and Reverse Boosting t = ( m j=1 β2 j )1 2 Figure 2. Estimates of regression coefficients ˆβ j, j=1,2,...10, for the diabetes data as in (16). Solutions are plotted as functions of t = ( m j=1 β2 j )1 2. Figure 2 shows the results given by BLasso for solving (16). The difference between Figure 1 and Figure 2 is significant. Because L 2 penalty is used in (16), there is no sparsity in the regression coefficient for any λ Classification with 1-norm SVM (Hinge Loss) In addition to the regression experiments in the previous two sections, we now look at binary classification. We generate 50 training data in each of two classes. The first class has two standard normal independent inputs X 1 and X 2 and class label Y = 1. The second class also has two standard normal independent inputs, but conditioned on 4.5 (X 1 ) 2 + (X 2 ) 2 8 and has class label Y = 1. We wish to find a classification rule from the training data. so that when given a new input, we can assign a class Y from {1, 1} to it. To handle this problem, 1-norm SVM (Zhu 2003) is considered: (ˆβ 0, β) = argmin β 0,β m (1 Y i (β 0 + β j h j (X i ))) + + λ β 1 (17) where h i are basis functions and λ is the regularization parameter. The dictionary of basis functions considered here is D = { 2X 1, 2X 2, 2X 1 X 2, (X 1 ) 2, (X 2 ) 2 }. The fitted model is ˆf(x) = ˆβ m 0 + ˆβ j h j (x). The classification rule is given by sign( ˆf(x)). j=1 j=1

14 14 Peng Zhao et al. Regularization Path Data t = 5 j=1 ˆβ j Figure 3. Estimates of 1-norm SVM coefficients ˆβ j, j=1,2,...10, for the simulated two-class classification data. Left Panel BLasso solutions as a function of t = 5 j=1 ˆβ j. Right Panel Scatter plot of the data points with labels: + for y = 1; o for y = 1. The covariates enter the regression equation sequentially as t increase, in the following order: the two quadratic terms first, followed by the interaction term then the two linear terms. The penalty used here is the L 1 norm same as the Lasso. However, the constant term coefficient β 0 is not penalized but also needs to be estimated. Since the penalty function is still convex, BLasso gives the solutions successfully. 6. Discussion and Concluding Remarks As seen from the experiments, BLasso is effective for solving the Lasso problem and general convex penalized loss minimization problems. One practical issue left undiscussed is the choice of stepsize. In general, BLasso take O(1/ǫ) steps. For simple L 2 regression with m covariates, each step uses O(m n) basic operations. Depend on the actual loss function, base learners and minimization trick used in each step, the actual computation complexity varies. Nonetheless, smaller steps give smoother and closer approximation to the regularization path but also invokes more computation complexity. However, we observe that the the actual coefficient estimates are pretty accurate even for relatively large step sizes.

15 Boosted Lasso and Reverse Boosting λ Figure 4. Estimates of regression coefficients ˆβ 3 for the diabetes data as in Section 4.1. Solutions are plotted as functions of λ. Dotted Line Estimates using step size ǫ = Solid Line Estimates using step size ǫ = 10. Dash-dot Line Estimates using step size ǫ = 50. As can be seen from Figure 4, for small step size ǫ = 0.05, the solution path can not be distinguished from the exact regularization path. However, even when the step size is as large as ǫ = 10 and ǫ = 50 (200 and 1000 times bigger), the solutions are still good approximations. As mentioned in section 3, BLasso has only one step size parameter. This parameter controls both how close BLasso approximates the minimization coefficients for each λ and how close two adjacent λ on the regularization path are placed. As can be seen from Figure 4, a smaller stepsize leads to a closer approximation to the solutions and also finer grids for λ. We argue that, if λ is sampled on a coarse grid there is no point of wasting computational power on finding a much more accurate approximation of the coefficients for each λ. Instead, the available computational power spent on these two coupled tasks should be balanced. BLasso s 1-parameter setup automatically balances these two aspects of the approximation which is graphically expressed by the staircase shape of the solution paths. Another rich topic that deserves a closer look is to apply BLasso in an online setting. Since BLasso has both forward and backward steps, it should be easily transformed into an adaptive online learning algorithm where it goes back and forth to track the best regularization parameter and the corresponding model. In this paper, we introduced a backward step as a necessary compliment to forward fitting procedures like Forward Stagewise Fitting for minimizing Lasso loss. We also proposed the Reverse Boosting and the Boosted Lasso algorithms while the latter is able to produce the complete regularization path for general convex loss function with convex penalty. To summarize, we showed that

16 16 Peng Zhao et al. (a) A backward step is necessary to for boosting like procedures to produce the Lasso regularization path. (b) The construction of such a backward step involves reducing the Lasso penalty and trying to keep the increase of empirical loss as small as possible. (c) Using only the backward steps, a Reverse Boosting algorithm can be constructed which is concise, computationally light and similar to the classical backward elimination algorithm. (d) Combining both forward and backward steps and trading off the Lasso penalty with empirical loss, a Boosted Lasso (BLasso) algorithm can be constructed to produce the Lasso solutions for general convex loss functions. (e) The BLasso algorithm can be generalized to efficiently compute the regularization path for general convex loss function with convex penalty. (f) The results are verified by experiments using both real and simulated data for both regression and classification problems. References [1] Breiman, L. (1998). Arcing Classifiers, Ann. Statist. 26, [2] Breiman, L. (1999). Prediction Games and Arcing Algorithms, Neural Computation 11, [3] Buhlmann, P. and Yu, B. (2001). Boosting with the L2 Loss: Regression and Classification, J. Am. Statist. Ass. 98, [4] Donoho, D. and Elad, M. (2004). Optimally sparse representation in general(nonorthogonal) dictionaries vy l 1 minimization, Technical reports, Statistics Department, Stanford University. [5] Efron, B., Hastie,T., Johnstone, I. and Tibshirani, R. (2002). Least Angle Regression, Ann. Statist. 32 (2004), no. 2, [6] Freund, Y. (1995). Boosting a weak learning algorithm by majority, Information and Computation 121, [7] Freund, Y. and Schapire, R.E. (1996). Experiments with a new boosting algorithm, Machine Learning: Proc. Thirteenth International Conference, pp Morgan Kauffman, San Francisco. [8] Friedman, J.H., Hastie, T. and Tibshirani, R. (2000). Additive Logistic Regression: a Statistical View of Boosting, Ann. Statist. 28, [9] Friedman, J.H. (2001). Greedy Function Approximation: a Gradient Boosting Machine, Ann. Statist. 29, [10] Hansen, M. and Yu, B. (2001). Model Selection and the Principle of Minimum Description Length, J. Am. Statist. Ass. Vol. 96, [11] Hastie, T., Tibshirani, R. and Friedman, J.H. (2001).The Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer Verlag, New York

17 Boosted Lasso and Reverse Boosting 17 [12] Li, S. and Zhang, Z. (2004). FloatBoost Learning and Statistical Face Detection, IEEE Transactions on Pattern Analysis and Machine Intelligence Vol 26, [13] Mason, L., Baxter, J., Bartlett, P. and Frean, M. (1999). Functional Gradient Techniques for Combining Hypotheses, In Advance in Large Margin Classifiers. MIT Press. [14] Rosset, S. (2004). Tracking Curved Regularized Optimization Solution Paths, NIPS 2004, to appear. [15] Schapire, R.E. (1990). The Strength of Weak Learnability. Machine Learning 5(2), [16] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso, J. R. Statist. Soc. B, Vol. 58, No. 1., pp [17] Zhu, J. Rosset, S., Hastie, T. and Tibshirani, R.(2003) 1-norm Support Vector Machines, Advances in Neural Information Processing Systems 16. MIT Press.

Boosted Lasso. Peng Zhao. Bin Yu. 1. Introduction

Boosted Lasso. Peng Zhao. Bin Yu. 1. Introduction Boosted Lasso Peng Zhao University of California, Berkeley, USA. Bin Yu University of California, Berkeley, USA. Summary. In this paper, we propose the Boosted Lasso (BLasso) algorithm that is able to

More information

Boosted Lasso. Abstract

Boosted Lasso. Abstract Boosted Lasso Peng Zhao Department of Statistics University of Berkeley 367 Evans Hall Berkeley, CA 94720-3860, USA Bin Yu Department of Statistics University of Berkeley 367 Evans Hall Berkeley, CA 94720-3860,

More information

LEAST ANGLE REGRESSION 469

LEAST ANGLE REGRESSION 469 LEAST ANGLE REGRESSION 469 Specifically for the Lasso, one alternative strategy for logistic regression is to use a quadratic approximation for the log-likelihood. Consider the Bayesian version of Lasso

More information

Regularization Paths

Regularization Paths December 2005 Trevor Hastie, Stanford Statistics 1 Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Saharon Rosset, Ji Zhu, Hui Zhou, Rob Tibshirani and

More information

Regularization Paths. Theme

Regularization Paths. Theme June 00 Trevor Hastie, Stanford Statistics June 00 Trevor Hastie, Stanford Statistics Theme Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Mee-Young Park,

More information

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Boosting Methods: Why They Can Be Useful for High-Dimensional Data New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,

More information

Boosting. March 30, 2009

Boosting. March 30, 2009 Boosting Peter Bühlmann buhlmann@stat.math.ethz.ch Seminar für Statistik ETH Zürich Zürich, CH-8092, Switzerland Bin Yu binyu@stat.berkeley.edu Department of Statistics University of California Berkeley,

More information

A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives

A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives Paul Grigas May 25, 2016 1 Boosting Algorithms in Linear Regression Boosting [6, 9, 12, 15, 16] is an extremely

More information

Margin Maximizing Loss Functions

Margin Maximizing Loss Functions Margin Maximizing Loss Functions Saharon Rosset, Ji Zhu and Trevor Hastie Department of Statistics Stanford University Stanford, CA, 94305 saharon, jzhu, hastie@stat.stanford.edu Abstract Margin maximizing

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012 Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Neural networks Neural network Another classifier (or regression technique)

More information

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes Support Vector Machine, Random Forests, Boosting December 2, 2012 1 / 1 2 / 1 Neural networks Artificial Neural networks: Networks

More information

Self Supervised Boosting

Self Supervised Boosting Self Supervised Boosting Max Welling, Richard S. Zemel, and Geoffrey E. Hinton Department of omputer Science University of Toronto 1 King s ollege Road Toronto, M5S 3G5 anada Abstract Boosting algorithms

More information

Boosting as a Regularized Path to a Maximum Margin Classifier

Boosting as a Regularized Path to a Maximum Margin Classifier Journal of Machine Learning Research () Submitted 5/03; Published Boosting as a Regularized Path to a Maximum Margin Classifier Saharon Rosset Data Analytics Research Group IBM T.J. Watson Research Center

More information

Sparse representation classification and positive L1 minimization

Sparse representation classification and positive L1 minimization Sparse representation classification and positive L1 minimization Cencheng Shen Joint Work with Li Chen, Carey E. Priebe Applied Mathematics and Statistics Johns Hopkins University, August 5, 2014 Cencheng

More information

Sparse Approximation and Variable Selection

Sparse Approximation and Variable Selection Sparse Approximation and Variable Selection Lorenzo Rosasco 9.520 Class 07 February 26, 2007 About this class Goal To introduce the problem of variable selection, discuss its connection to sparse approximation

More information

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets Nan Zhou, Wen Cheng, Ph.D. Associate, Quantitative Research, J.P. Morgan nan.zhou@jpmorgan.com The 4th Annual

More information

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines Nonlinear Support Vector Machines through Iterative Majorization and I-Splines P.J.F. Groenen G. Nalbantov J.C. Bioch July 9, 26 Econometric Institute Report EI 26-25 Abstract To minimize the primal support

More information

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Is the test error unbiased for these programs? 2017 Kevin Jamieson Is the test error unbiased for these programs? 2017 Kevin Jamieson 1 Is the test error unbiased for this program? 2017 Kevin Jamieson 2 Simple Variable Selection LASSO: Sparse Regression Machine Learning

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

Gradient Boosting (Continued)

Gradient Boosting (Continued) Gradient Boosting (Continued) David Rosenberg New York University April 4, 2016 David Rosenberg (New York University) DS-GA 1003 April 4, 2016 1 / 31 Boosting Fits an Additive Model Boosting Fits an Additive

More information

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University A Gentle Introduction to Gradient Boosting Cheng Li chengli@ccs.neu.edu College of Computer and Information Science Northeastern University Gradient Boosting a powerful machine learning algorithm it can

More information

Variable Selection in Data Mining Project

Variable Selection in Data Mining Project Variable Selection Variable Selection in Data Mining Project Gilles Godbout IFT 6266 - Algorithmes d Apprentissage Session Project Dept. Informatique et Recherche Opérationnelle Université de Montréal

More information

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13 Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Conjugate direction boosting

Conjugate direction boosting Conjugate direction boosting June 21, 2005 Revised Version Abstract Boosting in the context of linear regression become more attractive with the invention of least angle regression (LARS), where the connection

More information

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69 R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual

More information

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013) A Survey of L 1 Regression Vidaurre, Bielza and Larranaga (2013) Céline Cunen, 20/10/2014 Outline of article 1.Introduction 2.The Lasso for Linear Regression a) Notation and Main Concepts b) Statistical

More information

Why does boosting work from a statistical view

Why does boosting work from a statistical view Why does boosting work from a statistical view Jialin Yi Applied Mathematics and Computational Science University of Pennsylvania Philadelphia, PA 939 jialinyi@sas.upenn.edu Abstract We review boosting

More information

Sparse Penalized Forward Selection for Support Vector Classification

Sparse Penalized Forward Selection for Support Vector Classification Sparse Penalized Forward Selection for Support Vector Classification Subhashis Ghosal, Bradley Turnnbull, Department of Statistics, North Carolina State University Hao Helen Zhang Department of Mathematics

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c15 2013/9/9 page 331 le-tex 331 15 Ensemble Learning The expression ensemble learning refers to a broad class

More information

Proteomics and Variable Selection

Proteomics and Variable Selection Proteomics and Variable Selection p. 1/55 Proteomics and Variable Selection Alex Lewin With thanks to Paul Kirk for some graphs Department of Epidemiology and Biostatistics, School of Public Health, Imperial

More information

Least Angle Regression, Forward Stagewise and the Lasso

Least Angle Regression, Forward Stagewise and the Lasso January 2005 Rob Tibshirani, Stanford 1 Least Angle Regression, Forward Stagewise and the Lasso Brad Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani Stanford University Annals of Statistics,

More information

LASSO Review, Fused LASSO, Parallel LASSO Solvers

LASSO Review, Fused LASSO, Parallel LASSO Solvers Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable

More information

A NEW PERSPECTIVE ON BOOSTING IN LINEAR REGRESSION VIA SUBGRADIENT OPTIMIZATION AND RELATIVES

A NEW PERSPECTIVE ON BOOSTING IN LINEAR REGRESSION VIA SUBGRADIENT OPTIMIZATION AND RELATIVES Submitted to the Annals of Statistics A NEW PERSPECTIVE ON BOOSTING IN LINEAR REGRESSION VIA SUBGRADIENT OPTIMIZATION AND RELATIVES By Robert M. Freund,, Paul Grigas, and Rahul Mazumder, Massachusetts

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14 Brett Bernstein CDS at NYU March 30, 2017 Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 1 / 14 Initial Question Intro Question Question Suppose 10 different meteorologists have produced functions

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24 Big Data Analytics Special Topics for Computer Science CSE 4095-001 CSE 5095-005 Feb 24 Fei Wang Associate Professor Department of Computer Science and Engineering fei_wang@uconn.edu Prediction III Goal

More information

Sparsity in Underdetermined Systems

Sparsity in Underdetermined Systems Sparsity in Underdetermined Systems Department of Statistics Stanford University August 19, 2005 Classical Linear Regression Problem X n y p n 1 > Given predictors and response, y Xβ ε = + ε N( 0, σ 2

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

Exploratory quantile regression with many covariates: An application to adverse birth outcomes

Exploratory quantile regression with many covariates: An application to adverse birth outcomes Exploratory quantile regression with many covariates: An application to adverse birth outcomes June 3, 2011 eappendix 30 Percent of Total 20 10 0 0 1000 2000 3000 4000 5000 Birth weights efigure 1: Histogram

More information

Biostatistics Advanced Methods in Biostatistics IV

Biostatistics Advanced Methods in Biostatistics IV Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results

More information

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725 Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive: Simple Variable Selection LASSO: Sparse Regression Machine Learning CSE546 Carlos Guestrin University of Washington October 7, 2013 1 Sparsity Vector w is sparse, if many entries are zero: Very useful

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

Ordinal Classification with Decision Rules

Ordinal Classification with Decision Rules Ordinal Classification with Decision Rules Krzysztof Dembczyński 1, Wojciech Kotłowski 1, and Roman Słowiński 1,2 1 Institute of Computing Science, Poznań University of Technology, 60-965 Poznań, Poland

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Regularization Path Algorithms for Detecting Gene Interactions

Regularization Path Algorithms for Detecting Gene Interactions Regularization Path Algorithms for Detecting Gene Interactions Mee Young Park Trevor Hastie July 16, 2006 Abstract In this study, we consider several regularization path algorithms with grouped variable

More information

Logistic Regression and Boosting for Labeled Bags of Instances

Logistic Regression and Boosting for Labeled Bags of Instances Logistic Regression and Boosting for Labeled Bags of Instances Xin Xu and Eibe Frank Department of Computer Science University of Waikato Hamilton, New Zealand {xx5, eibe}@cs.waikato.ac.nz Abstract. In

More information

Incremental Training of a Two Layer Neural Network

Incremental Training of a Two Layer Neural Network Incremental Training of a Two Layer Neural Network Group 10 Gurpreet Singh 150259 guggu@iitk.ac.in Jaivardhan Kapoor 150300 jkapoor@iitk.ac.in Abstract Gradient boosting for convex objectives has had a

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification ABC-Boost: Adaptive Base Class Boost for Multi-class Classification Ping Li Department of Statistical Science, Cornell University, Ithaca, NY 14853 USA pingli@cornell.edu Abstract We propose -boost (adaptive

More information

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees

More information

CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes CS229 Supplemental Lecture notes John Duchi 1 Boosting We have seen so far how to solve classification (and other) problems when we have a data representation already chosen. We now talk about a procedure,

More information

Statistical Methods for Data Mining

Statistical Methods for Data Mining Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find

More information

An Improved 1-norm SVM for Simultaneous Classification and Variable Selection

An Improved 1-norm SVM for Simultaneous Classification and Variable Selection An Improved 1-norm SVM for Simultaneous Classification and Variable Selection Hui Zou School of Statistics University of Minnesota Minneapolis, MN 55455 hzou@stat.umn.edu Abstract We propose a novel extension

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model

More information

Lecture 17: October 27

Lecture 17: October 27 0-725/36-725: Convex Optimiation Fall 205 Lecturer: Ryan Tibshirani Lecture 7: October 27 Scribes: Brandon Amos, Gines Hidalgo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Saharon Rosset 1 and Ji Zhu 2

Saharon Rosset 1 and Ji Zhu 2 Aust. N. Z. J. Stat. 46(3), 2004, 505 510 CORRECTED PROOF OF THE RESULT OF A PREDICTION ERROR PROPERTY OF THE LASSO ESTIMATOR AND ITS GENERALIZATION BY HUANG (2003) Saharon Rosset 1 and Ji Zhu 2 IBM T.J.

More information

JEROME H. FRIEDMAN Department of Statistics and Stanford Linear Accelerator Center, Stanford University, Stanford, CA

JEROME H. FRIEDMAN Department of Statistics and Stanford Linear Accelerator Center, Stanford University, Stanford, CA 1 SEPARATING SIGNAL FROM BACKGROUND USING ENSEMBLES OF RULES JEROME H. FRIEDMAN Department of Statistics and Stanford Linear Accelerator Center, Stanford University, Stanford, CA 94305 E-mail: jhf@stanford.edu

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Generalized Boosted Models: A guide to the gbm package

Generalized Boosted Models: A guide to the gbm package Generalized Boosted Models: A guide to the gbm package Greg Ridgeway April 15, 2006 Boosting takes on various forms with different programs using different loss functions, different base models, and different

More information

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr

More information

Iterative Selection Using Orthogonal Regression Techniques

Iterative Selection Using Orthogonal Regression Techniques Iterative Selection Using Orthogonal Regression Techniques Bradley Turnbull 1, Subhashis Ghosal 1 and Hao Helen Zhang 2 1 Department of Statistics, North Carolina State University, Raleigh, NC, USA 2 Department

More information

Ensemble Methods: Jay Hyer

Ensemble Methods: Jay Hyer Ensemble Methods: committee-based learning Jay Hyer linkedin.com/in/jayhyer @adatahead Overview Why Ensemble Learning? What is learning? How is ensemble learning different? Boosting Weak and Strong Learners

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Computational statistics

Computational statistics Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

6.036 midterm review. Wednesday, March 18, 15

6.036 midterm review. Wednesday, March 18, 15 6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that

More information

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă mmp@stat.washington.edu Reading: Murphy: BIC, AIC 8.4.2 (pp 255), SRM 6.5 (pp 204) Hastie, Tibshirani

More information

Analysis of Greedy Algorithms

Analysis of Greedy Algorithms Analysis of Greedy Algorithms Jiahui Shen Florida State University Oct.26th Outline Introduction Regularity condition Analysis on orthogonal matching pursuit Analysis on forward-backward greedy algorithm

More information

arxiv: v1 [math.st] 2 May 2007

arxiv: v1 [math.st] 2 May 2007 arxiv:0705.0269v1 [math.st] 2 May 2007 Electronic Journal of Statistics Vol. 1 (2007) 1 29 ISSN: 1935-7524 DOI: 10.1214/07-EJS004 Forward stagewise regression and the monotone lasso Trevor Hastie Departments

More information

Regularization: Ridge Regression and the LASSO

Regularization: Ridge Regression and the LASSO Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

A Brief Introduction to Adaboost

A Brief Introduction to Adaboost A Brief Introduction to Adaboost Hongbo Deng 6 Feb, 2007 Some of the slides are borrowed from Derek Hoiem & Jan ˇSochman. 1 Outline Background Adaboost Algorithm Theory/Interpretations 2 What s So Good

More information

Pathwise coordinate optimization

Pathwise coordinate optimization Stanford University 1 Pathwise coordinate optimization Jerome Friedman, Trevor Hastie, Holger Hoefling, Robert Tibshirani Stanford University Acknowledgements: Thanks to Stephen Boyd, Michael Saunders,

More information

Linear model selection and regularization

Linear model selection and regularization Linear model selection and regularization Problems with linear regression with least square 1. Prediction Accuracy: linear regression has low bias but suffer from high variance, especially when n p. It

More information

arxiv: v2 [stat.ml] 22 Feb 2008

arxiv: v2 [stat.ml] 22 Feb 2008 arxiv:0710.0508v2 [stat.ml] 22 Feb 2008 Electronic Journal of Statistics Vol. 2 (2008) 103 117 ISSN: 1935-7524 DOI: 10.1214/07-EJS125 Structured variable selection in support vector machines Seongho Wu

More information

SPECIAL INVITED PAPER

SPECIAL INVITED PAPER The Annals of Statistics 2000, Vol. 28, No. 2, 337 407 SPECIAL INVITED PAPER ADDITIVE LOGISTIC REGRESSION: A STATISTICAL VIEW OF BOOSTING By Jerome Friedman, 1 Trevor Hastie 2 3 and Robert Tibshirani 2

More information

Lasso Regression: Regularization for feature selection

Lasso Regression: Regularization for feature selection Lasso Regression: Regularization for feature selection Emily Fox University of Washington January 18, 2017 1 Feature selection task 2 1 Why might you want to perform feature selection? Efficiency: - If

More information

Qualifying Exam in Machine Learning

Qualifying Exam in Machine Learning Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts

More information

TDT4173 Machine Learning

TDT4173 Machine Learning TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline 1 Ensemble-methods

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Thomas G. Dietterich tgd@eecs.oregonstate.edu 1 Outline What is Machine Learning? Introduction to Supervised Learning: Linear Methods Overfitting, Regularization, and the

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

A ROBUST ENSEMBLE LEARNING USING ZERO-ONE LOSS FUNCTION

A ROBUST ENSEMBLE LEARNING USING ZERO-ONE LOSS FUNCTION Journal of the Operations Research Society of Japan 008, Vol. 51, No. 1, 95-110 A ROBUST ENSEMBLE LEARNING USING ZERO-ONE LOSS FUNCTION Natsuki Sano Hideo Suzuki Masato Koda Central Research Institute

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Compressed Sensing in Cancer Biology? (A Work in Progress)

Compressed Sensing in Cancer Biology? (A Work in Progress) Compressed Sensing in Cancer Biology? (A Work in Progress) M. Vidyasagar FRS Cecil & Ida Green Chair The University of Texas at Dallas M.Vidyasagar@utdallas.edu www.utdallas.edu/ m.vidyasagar University

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014 Case Study 3: fmri Prediction Fused LASSO LARS Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, 2014 Emily Fox 2014 1 LASSO Regression

More information