Backtting algorithms for total-variation and empirical-norm penalized additive modeling with high-dimensional data

Size: px
Start display at page:

Download "Backtting algorithms for total-variation and empirical-norm penalized additive modeling with high-dimensional data"

Transcription

1 The ISI's Journal for the Rapid Dissemination of Statistics Research (wileyonlinelibrary.com) DOI: 0.00X/sta Backtting algorithms for total-variation and empirical-norm penalized additive modeling with high-dimensional data Ting Yang a, Zhiqiang Tan a Received 0 May 208; Accepted 00 Month 208 Additive modeling is useful in studying nonlinear relationship between a response and covariates. We develop backtting algorithms to implement the doubly penalized method in Tan & Zhang (207), using total-variation and empirical-norm penalties. Use of the total-variation penalty leads to an automatic selection of knots for each component function, whereas use of the empirical-norm penalty can result in zero solutions for component functions and hence facilitates component selection in high-dimensional settings. For a backtting cycle, each component function is updated by thresholding a solution to a Lasso problem, which is computed using an active-set descent method. Screening rules are also derived to determine zero solutions without solving the Lasso problem directly. We present numerical experiments to demonstrate the eectiveness of the proposed algorithms for linear and logistic additive modeling. Copyright c 202 John Wiley & Sons, Ltd. Keywords: Additive model; High-dimensional data; Total variation; Nonparametric smoothing; Penalized estimation; Trend ltering.. Introduction Additive models provide a useful extension of linear models to allow nonlinear dependency of a mean response on covariates (Stone, 986). For a sample of size n, let y i be a response and x i = (x i ; : : : ; x ip ) T be a p-dimensional covariate vector for i = ; : : : ; n. Consider a nonparametric additive model y i = + f (x i ) + " i = + p = f (x i ) + " i ; where is a constant term, f ( ) is an unknown function of th covariate, and " i is a noise with mean zero and nite variance 2. Theory and methods for generalized additive models are well studied in low-dimensional settings ( p n) (Hastie & Tibshirani, 990; Wood, 207). Recently, there has been considerable research on sparse additive models a Department of Statistics, Rutgers University, New Jersey, USA ztan@stat.rutgers.edu Stat 202, Copyright c 202 John Wiley & Sons, Ltd. [Version: 202/05/2 v.00]

2 T Yang and Z Tan in high-dimensional settings, where p is close to or greater than n, but the number of nonzero functions f ( ), up to some centering, is still smaller than n. See Section 2 for a review of related work. In this article, we propose a backtting algorithm, called block descent and thresholding (BDT), to implement doubly penalized estimation studied in Tan & Zhang (207), using total-variation and empirical-norm penalties. For a function g ( ) of th covariate in an interval [a ; b ], the total variation is dened as { k } TV(g ) = sup g (z i ) g (z i ) : z 0 < z < : : : < z k is a partition of [a ; b ] for any k : i= If g is dierentiable with derivative g (), then TV(g ) = g () (z) dz. The empirical norm (precisely, empirical L 2 norm) of g is dened as kg k n = fn n i= g2 (x i )g =2. For some integer m and tuning parameters (; ), the doubly penalized estimator (^; ^f ; : : : ; ^f p ) is dened as a minimizer of the following penalized loss 2n n (y i f (x i )) 2 + i= p { } (m ) TV(f ) + kf k n ; () = over (; f ; : : : ; f p ), where f p = = f (m ), and f is the (m )th derivative of f with f (0) f. If f is mth dierentiable, then (m ) TV(f ) = f (m) (z) dz. By extending Mammen & van de Geer (997) for univariate smoothing, Tan & Zhang (207) showed that each ^f can be chosen to be a spline of order m, that is, a piecewise polynomial of degree m and, if m 2, an (m 2)th continuously dierentiable function. Moreover, the knots of each ^f can be obtained from the data points fx i : i = ; : : : ; ng if m = or 2, but not necessarily so if m 3. For simplicity, we always restrict each f ( ) to be a spline of order m with knots from the data points. The penalty term in () involves two penalties, playing complementary roles, for each component function f. The total (m ) variation, TV(f ), is used to induce smoothness of f. In fact, as shall be seen in Section 3., the total-variation penalty leads to an automatic selection of knots from data points for each f, similarly as Lasso selection of nonzero coecients in linear regression (Tibshirani, 996). The empirical norm kf k n is used to induce sparsity for f (i.e., setting f to zero). See Lemma for how a zero solution for f can be caused by the presence of the empirical-norm penalty. In the special case where f (z) = z is linear for all, the overall penalty in () reduces to the Lasso penalty p = up to some scaling for linear regression. In other words, the penalty in () can be regarded as a functional (m ) extension of the Lasso penalty, with replaced by a combination of TV(f ) and kf k n. The theoretical analysis in Tan & Zhang (207) shows under some technical conditions that if and are properly specied, then the prediction error of ^f p = ^f = achieves a minimax rate { } k ^f f k 2 n = O p()(m F + M 0 ) n 2m log(p) 2m+ + ; (2) n for any decomposition f p = = f such that #f p : f 6 0g M 0 and p = TV(f (m ) ) M F. A similar result holds for out-of-sample prediction errors. The error bound (2) consists of two terms, reecting the errors from respectively nonparametric smoothing and variable selection. The rst term, of order n 2m=(2m+), is the minimax (m ) rate of estimation in univariate smoothing (p = ) over the function class ff : TV(f ) Cg for a constant C (Mammen & van de Geer, 997). The second term, of order M 0 log(p)=n, is known as the Lasso fast rate of prediction errors in linear regression with at most M 0 nonzero regression coecients (Bickel et al., 2009). Our backtting algorithm can be modied to implement doubly penalized estimation using L 2 Sobolev seminorm and empirical-norm penalties, where the penalty term in () is replaced by p (m) =fkf k L2 + kf k n g, with kf (m) k L2 = Copyright c 202 John Wiley & Sons, Ltd. 2 Stat 202, 00 30

3 Backtting for doubly penalized additive modeling Stat f (f (m) (z)) 2 dzg =2. Denote by ( f ~ ; : : : ; f ~ p ) the resulting estimators. There are, however, two notable dierences from when the total-variation penalty is used. First, each f ~ is a smoothing spline with all the data points fx i : i = ; : : : ; ng as knots (Meier et al., 2009). No selection of knots is achieved. Second, the error bound (2) is also valid with ^f replaced by f ~ p = ~ = f and M F replaced by p = kf (m) k L2 (Koltchinskii & Yuan, 200; Raskutti et al., 202). But a bounded function class ff : kf (m) (m ) k L2 Cg is strictly smaller than ff : TV(f ) Cg for any constant C. For univariate smoothing (p = ), smoothing splines cannot achieve the rate n 2m=(2m+) as do total-variation splines in (m ) the larger function class ff : TV(f ) Cg (Donoho & Johnstone, 994; Mammen & van de Geer, 997). We use the following notation. For a function h(y ; x), the empirical norm of h is dened as khk n = fn n i= h2 (y i ; x i )g =2. For example, ky f k 2 n = n n i= (y i f (x i )) 2. The sample average of h is h = n n i= h(y i ; x i ). For example, y = n n i= y i. Moreover, for a vector u = (u ; : : : ; u n ) T, denote kuk n = (n n i= u2 i )=2. For a vector v = (v ; : : : ; v k ) T, denote kvk k = = v and kvk = max =;:::;k v. The rest of the paper is organized as follows. Section 2 reviews related work. In Section 3, we develop a backtting algorithm for minimizing the penalized loss () in linear additive modeling. In Section 4, we extend the algorithm to logistic additive modeling. Section 5 presents numerical experiments. Section 6 concludes the paper. The Supplementary Material contains proofs and additional discussion and results. 2. Related work Recently, there has been considerable research on penalized estimation for additive modeling, beyond earlier work as in Hastie & Tibshirani (990). For example, Lin & Zhang (2006) used a penalty p (m) = kf k L2. Huang et al. (200) studied a similar method using adaptive group Lasso. Ravikumar et al. (2009) used a penalty p = kf k n, and restricted f = for an n d basis matrix, which is pre-specied. Meier et al. (2009) used a penalty p in the form = fkf k 2 n + 2kf (m) k 2 L 2 g =2, and parameterized f using B-spline basis functions with pre-specied K knots in numerical implementation. Koltchinskii & Yuan (200) and Raskutti et al. (202) used a penalty term p (m) =fkf k L2 + kf k n g, but did not present numerical algorithms for minimizing the penalized loss. As mentioned above, such doubly penalized estimation can be handled by extending our backtting algorithm. The doubly penalized method in Tan & Zhang (207) diers from Koltchinskii & Yuan (200) and Raskutti et al. (202) in the use of the total-variation penalty, which leads to automatic knot selection for each component f and allows the same rate of convergence achieved in much larger bounded-variation function classes. Use of total-variation penalties seems to be rst studied by Mammen & van de Geer (997) for univariate smoothing, where the penalized loss is ky f k 2 (m ) n =2 + TV(f ). Recently, Kim et al. (2009) proposed a related method for univariate smoothing, called trend ltering, by minimizing the penalized loss over, ky k 2 2 =2 + kd(m) P k, where k k 2 denotes the L 2 norm, y = (y ; : : : ; y n ) T, = (f (x ); : : : ; f (x n )) T, P is the permutation matrix that sorts (x ; : : : ; x n ) in the ascending order, and D (m) is the mth-order dierence matrix. Tibshirani (204) showed that trend ltering is equivalent to total-variation splines in Mammen & van de Geer (997) for m = and 2, but not in general for m 3. In addition, trend ltering is shown to achieve the same (minimax) rate of convergence as total-variation splines in bounded-variation classes when the data points are evenly spaced for m. For additive modeling, Petersen et al. (206) proposed a fused lasso additive model (FLAM), by minimizing the penalized loss over (; ; : : : ; p ): p 2 ky k = p = {kd () P k + ( )k k 2 } ; Stat 202, Copyright c 202 John Wiley & Sons, Ltd.

4 T Yang and Z Tan where is a tuning parameter, 2 [0; ], = (f (x ); : : : ; f (x n )) T, and P is the permutation matrix that sorts the data points (x ; : : : ; x n ). This tting procedure is easily shown to be statistically equivalent to that of the doubly penalized method with m = in Tan & Zhang (207), up to a transformation between the tuning parameters (; ) above and (; ) in (). The tted values ^ p + ^ = from FLAM and ( ^f (x ); : : : ; ^f (x n )) T from our method can be matched. However, there is a subtle dierence in choosing how out-of-sample predictions are made. The tted function ^f of th covariate is dened by linear interpolation of the tted values ^ in FLAM, but directly by piecewise-constant interpolation in our method, which is more aligned with the rst-order total-variation penalty used. To obtain tted functions ^f that are piecewise linear, our method uses the second-order total-variation penalty (i.e., m = 2). In general, our method also accommodates use of higher-order total-variation penalties. Sadhanala & Tibshirani (207) also proposed additive modeling with trend ltering, by minimizing the penalized loss p p 2 ky k kd (m) P k : = This method allows higher-order trend ltering on component functions, but does not incorporate empirical-norm penalties, which are crucial to achieve sparsity in high-dimensional settings. The theory and numerical evaluations are provided only in low-dimensional settings in Sadhanala & Tibshirani (207). Our backtting algorithm can be modied to handle both trend ltering and empirical-norm penalties, i.e., a penalty term p = fkd(m) P k + k k n g, by recasting trend ltering using the falling factorial basis functions (Wang et al., 204). 3. Linear additive modeling We develop a backtting algorithm for minimizing the obective function (). As discussed in Section, we restrict each f to be a spline of order m with knots from the data points fx i : i = ; : : : ; ng. For p, dene the knot superset for f as (Mammen & van de Geer, 997) { fx((m )=2+2) ; : : : ; x (n (m )=2) g if m is odd; ft () ; : : : ; t (n m) g = = fx (m=2+) ; : : : ; x (n m=2) g if m is even: where x () < : : : < x (n) are the ordered values of the th covariate, and t () < : : : < t (n m). The data points near the left and right boundaries are removed to avoid over-parameterization. As shown in Tan & Zhang (207), a solution ( ^f ; : : : ; ^f p ) obtained with this restriction is also an unrestricted minimizer of () if m = or 2, but not necessarily so if m 3. For univariate smoothing, Mammen & van de Geer (997) also showed that the restricted and unrestricted solutions achieve the same rate of convergence under a mild condition on the maximum spacing between the data points. A similar result can be expected to hold for additive models. To represent splines of order m, we use the truncated power basis, dened as k; (z) = z k ; k = ; : : : ; m ; m +k; (z) = (z t (t) ) m + ; k = ; 2; : : : ; n m: where (c) + = max(0; c) and (c) 0 + = 0 if c < 0 or if c 0. Denote f = 0 +, where = ( ; ; : : : ; n ; ) and = ( ; ; : : : ; n ; ) T. After simple algebra, the obective function () can be written as 2 ky f k2 n + p fkd k + k 0 + k n g ; (3) = where D is a diagonal matrix with only nonzero elements d m = = d n =. Using the truncated power basis (m ) transforms the total variation TV(f ) into a Lasso penalty kd k n m = k= m +k;. Copyright c 202 John Wiley & Sons, Ltd. 4 Stat 202, 00 30

5 Backtting for doubly penalized additive modeling Stat Because is non-penalized, it follows that for (3) to be minimized, ^ = y, and each ^f is empirically centered, that is, ^ 0 = ^, where = n n i= (x i ). Then minimization of (3) reduces to that of 2 ky y f k2 n + p { kd k + k ~ } k n ; (4) = over ( ; : : : ; p ), where f = p = f, f = ~, and ~ =, the empirically centered version of. 3.. Backtting To minimize (3) or equivalently (4), a backtting algorithm involves solving p sub-problems updating the components (f ; : : : ; f p ) sequentially. For any p, the th sub-problem is { } min 2 kr ~ k 2 n + kd k + k ~ k n ; (5) where r = y y ^f k and f ^f k : k 6= g are the current estimates. By abuse of notation, ~ can also be k6= treated as the n (n ) matrix with ith row ~ (x i ), and r be treated as the n vector with ith element y i y k6= ^f k (x ik ). Proposition provides the main idea of our algorithm for solving problem (5). This result serves as a generalization of Corollary 3. in Petersen et al. (206). Proposition Suppose that ~ is a solution to Then a solution to problem (5) is ^ = ( min { 2 kr ~ k 2 n + kd k k ~ ~ ~ k. n )+ } : (6) Proposition says that a solution to problem (5) is determined by directly thresholding a solution of the Lasso problem (6). From our proof, a more general result holds where r is a "response" vector, ~ is a data matrix, and kd k is replaced by a semi-norm penalty R( ), including R( ) = kd k 2 in the case of L 2 semi-norm kf (m) k L2 used instead (m ) of TV(f ) in (). Proposition sheds light on the consequences of using the two penalties in (5). Use of the total-variation or Lasso penalty can induce a sparse solution ~ with only a few nonzero components, corresponding to an automatic selection of knots for f, as shown in Osborne et al. (998) for univariate smoothing. Use of the empiricalnorm penalty can result in a zero solution for f via thresholding ~ and hence achieve sparsity in high-dimensional additive modelling. From Proposition, we propose the following backtting algorithm to minimize (4). To solve Lasso problem (6), it is possible to use a variety of numerical methods including coordinate descent (Friedman et al., 200; Wu & Lange, 2008), gradient descent related methods (Beck & Teboulle, 2009; Kim et al., 2007), and active-set descent (Osborne et al., 2000). In particular, Petersen et al. (206) used a fast fused-lasso algorithm (Hoeing, 200) for solving (6) with m =, which seems not applicable for m 2. We employ a variant of active-set descent, which is attractive for the following reasons. First, the performance of backtting depends on how accurately the sub-problem (6) is solved within each block. More accurate within-block solutions may result in fewer backtting cycles to achieve convergence by a certain criterion. The active-set method nds an exact solution after a nite number of iterations, and the computational cost is often reasonable in sparse settings. When the estimate ~ from the previous backtting cycle is used as an initial value, the method also allows problem (6) to be solved with one or a few iterations if the previous estimate ~ is close to the desired solution. See the Supplementary Material for details Stat 202, Copyright c 202 John Wiley & Sons, Ltd.

6 T Yang and Z Tan Algorithm Block Descent and Thresholding (BDT) algorithm : Initialize: Set ^ = 0 for = ; :::; p. 2: for = ; 2; : : : ; p do 3: if any screening condition (Section 3.2) is satised then 4: Return ^. 5: else 6: Update the residual: r = y y ~ k ^ k. k6= 7: Compute a solution ~ to problem ( (6). 8: Threshold the solution: ^ = 9: end if 0: end for k ~ ~ k n )+ : Repeat line 2-0 until convergence of the obective (4). ~. of the active-set method. Many other methods, notably coordinate descent, only provide an approximate solution with a pre-specied precision. To obtain more precise a solution, more iterations are needed. In addition, the active-set method is tuning free, whereas gradient descent related methods involve tuning of step sizes to achieve satisfactory convergence. Selection of such step sizes can be cumbersome for all p Lasso problems (6) in backtting Screening rules In principle, to solve sub-problem (5), we rst compute ~ and then threshold it to ^. For relatively large, even if ~ is nonzero, the solution ^ after thresholding may become 0. For relatively large, it may occur that only the rst m components of ^ (not penalized by kd k ) are nonzero. To speed up computation, we derive screening rules to directly detect these scenarios and determine ^, without solving the Lasso problem (5). By the matrix interpretation for ~ and r, rewrite the sub-problem (5) as { min 2 kr ~ () () ~ (2) (2) ( () ; (2) ) } k 2 n + k(2) k + k ~ () () + ~ (2) (2) k n ; (7) where ~ is partitioned into ( ~ () ; ~ (2) ) with ~ () giving the rst m columns of ~, and accordingly is partitioned into ( ()T ; (2)T ) T, with () giving the rst m components of (not penalized by Lasso). If m =, then ~ () and () become degenerate. Proposition 2 provides the main result for our construction of screening rules. Proposition 2 For m 2, the following results hold for a solution ^ ()T (2)T = ( ^ ; ^ ) T to problem (7).. ^ () 6= 0 and ^ (2) = 0 if k r k n > and k ~ (2)T (r r )k =n ; (8) where is the proection ("hat") matrix onto the column space of ~ (). 2. ^ = 0 if there exists a vector u 2 R n satisfying kuk n and ~ ()T (r u) = 0 and k ~ (2)T (r u)k =n : (9) For m =, result 2 holds with the second equality in (9) removed. Copyright c 202 John Wiley & Sons, Ltd. 6 Stat 202, 00 30

7 Backtting for doubly penalized additive modeling Stat For ease of application, Corollary give a reformulation of condition (9), in terms of the existence of a suitable scalar for an arbitrarily xed vector h 2 R n. Corollary For m 2, ^ = 0 is a solution to problem (7) if for any h 2 R n, there exists 2 R satisfying kr ( )(h h)k n and k( ) ~ (2)T (h h)k =n : (0) For m =, the result holds with h h replaced by h or, equivalently, set to 0. From these results, we obtain the following screening rules. For m =, we skip step and set = 0. Algorithm 2 Screening rules ( ) () : Return ^ = k r k n ( ~ ()T ~ () ) ()T ~ r and (a) k r k n > and k ~ (2)T (r r )k =n. 2: Return ^ () = 0 and ^ (2) ^ (2) = 0 if = 0 if one of the following conditions is satised: (a) kr k n. (b) kr k n > and k r k n and k( ) ~ (2)T (r r )k =n ; where = f(n 2 r T r )=(r T r r T r )g =2 (< ). (c) kr k n > and k( ) ~ (2)T (h h)k =n, where h = r ~ ~ with ~ obtained from the previous backtting cycle and is determined such that kr ( )(h h)k n =. Condition (a) is derived from (8), and condition 2(a) is from (9) with u = r. Condition 2(b) is from (0) with h = r, where the rst inequality in (0) reduces to 2 (r T r r T r ) n 2 r T r. Condition 2(c) is from (0) with h = r ~ ~, where ~ is a solution of Lasso problem (6), and hence ~ ~ is the vector of tted values before thresholding, from the previous backtting cycle. The motivation for this choice is that if the "response" vector r is similar to the previous one, then k ~ (2)T (h h)k =n would remain similar to. The screening rules in Algorithm 2 are more eective than would be derived by rst detecting ~ = 0 for problem (6) and then deciding ^ = 0. For m =, it holds that ^ = 0 if either kr k n or kr k n > and ( =kr k n )k ~ T r k =n, whereas a necessary and sucient condition for ~ = 0 is k ~ T r k =n. As another use of the screening rules, we also obtain the following conditions on the tuning parameters (; ) to imply a completely zero solution to problem (4), ^ = : : : = ^ p = 0. Corollary 2 It holds that ^ = : : : = ^ p = 0 is a solution to problem (4) if either (a) ky yk n or (b) ky yk n >, k (y y )k n, and k ~ (2)T (y y y )k =n for = ; : : : ; p. In numerical implementation, we restrict the search of (; ) to a grid such that max and max, where max = ky yk n and max = k ~ (2)T (y y y )k =n. This region includes all (; ) yielding a nonzero solution to (4) for m = with = 0, and may be practically sucient for m 2. In addition, because the theoretical analysis of Tan & Zhang (207) suggests choosing = O() 2 under sparsity, we also restrict in the search. Stat 202, Copyright c 202 John Wiley & Sons, Ltd.

8 T Yang and Z Tan 4. Logistic additive modeling As an extension, we provide a backtting algorithm for logistic additive modeling with a binary response: { P (y i = x i ) = expit + p = } f (x i ) ; where expit(c) = f + exp( c)g. Doubly penalized estimation involves minimizing the obective function n n p `(y i ; + f (x i )) + fkd k + k ~ k n g; () i= = where `(y i ; + f (x i )) = logf + exp( + f (x i ))g y i ( + f (x i )) with f p = = f and f = ~ as in (4), but cannot be directly solved. For each cycle of backtting, the th sub-problem is min (; ) n n `(y i ; + ^f ( ) (x i ) + ~ (x i ) ) + kd k + k ~ k n ; (2) i= where ^f ( ) = k6= ^f k and f ^f k : k 6= g are the current estimates. Similarly as in Friedman et al. (200), we form a quadratic approximation to the negative log-likelihood term (via a Taylor expansion about the previous estimates ^ and ^ ) and solve the following problem, similar to (5) but with weighted least squares: min (; ) 2n n w i ( i ~ ) 2 + kd k + k ~ k n ; (3) i= where i = ^ + ~ (x i ) ^ + ^w i (y i ^p i ), ^p i = expit(^ + ^f (x i )), and w i = ^p i ( ^p i ). Proposition can be easily extended for solving (3). However, we employ a simple modication to the Lasso problem associated with (3) when using the active-set algorithm. If the weights (w ; : : : ; w n ) are updated during backtting, there can be substantial cost in accordingly updating the Cholesky decomposition for the active-set algorithm. Therefore, we replace each w i by the constant =4 and then solve problem (3) in the same way as (5). Because ^p i ( ^p i ) is upper bounded by =4, it can shown that the resulting update of (^; ^ ) remains a descent update, which deceases the obective value (2), by the quadratic lower bound principle (Böhning & Lindsay, 988; Wu & Lange, 200). We summarize the backtting algorithm as Algorithm 3. Algorithm 3 Block Descent and Thresholding algorithm for logistic modeling (BDT-Logit) : Initialize: Set ^ = 0 and ^ = 0 for = ; :::; p. Set w 0 = =4. 2: for = ; 2; : : : ; p do 3: Compute ^p = expit(^ p + = ^ ) and = ^ + ~ ^ + w 0 (y ^p). 4: Update ^ =, the sample average of. 5: Update ^ as a solution to (using Algorithm, line 3-9) { w0 min 2 k } ~ k 2 n + kd k + k ~ k n : 6: end for 7: Repeat line 2-7 until convergence of the obective (). Copyright c 202 John Wiley & Sons, Ltd. 8 Stat 202, 00 30

9 Backtting for doubly penalized additive modeling Stat 5. Numerical experiments We evaluate BDT algorithms and doubly penalized additive modeling (dpam) in two aspects. One is computational performance: we compare active-set and coordinate descent methods for solving the Lasso sub-problem (6) and investigate eectiveness of the screening rules. The other is statistical performance in terms of mean squared errors or logistic losses: we compare the estimators obtained from dpam and related methods SpAM (Ravikumar et al., 2009) and hgam (Meier et al., 2009), implemented in R packages SAM (Zhao et al., 204) and hgam (Frick et al., 203). 5.. Linear additive modeling We generate data according to y p i = i= f (x i ) + i with x i Uniform[ 2:5; 2:5] and i N(0; ) for i = ; : : : ; n (= 00). Consider four scenarios (piecewise constant, piecewise linear, smooth, and mixed), where four nonzero functions, f ; : : : ; f 4, are specied (Figure ). The rst scenario (piecewise-constant) is the same as in Petersen et al. (206). The remaining functions, f 5 ; : : : ; f p, are zero, with p = Computational speed We compare the active-set (AS) and coordinate-descent (CD) methods for solving subproblem (6) in BDT. To focus on this comparison, the two versions of BDT without screening rules are denoted as AS-BDT and CD-BST. The total number of basis functions from p covariates is large, (n )p. Instead of storing the basis matrices, ( ~ ; : : : ; ~ p ), we only store the inner-product matrices, ( ~ T ~ S ; : : : ; ~ T ~ p Sp ), where ~ S is a submatrix of ~ with column indices in S, and S indicates the subset of basis functions of th covariate that are ever included in the active set before the termination of training. For th covariate (or block), the active set is dened as the subset of basis functions whose coecients are currently estimated as nonzero when using either AS or CD method. The total column set of stored inner-product matrices is S = S [ [ S p, and its size is denoted as S. For m = 2; 3, we apply AS-BDT and CD-BDT with a range of tuning parameters, = max =4 k and = max =4 l for 2 k l 4. Within each block, the coordinate-descent algorithm is terminated with the tolerance 0 5 ; 0 6, or 0 7 in the decrease of obective values when solving Lasso sub-problem (6). The backtting cycles using AS-BDT or CD-BDT are terminated when the decrease of the obective (4) is smaller than 0 4. Figure 2 and 3 show trace plots of the obective values, with six choices of (; ) in scenario 2 (piecewise linear) for m = 2; 3. Similar plots in the other cases are presented in the Supplementary Material. It is evident that not only AS-BDT is 0-00 times faster than CD-BDT to reach the same stopping criterion, but also AS-BDT achieves smaller obective values than CD-BDT when the stopping criterion is met. The smaller the tuning parameters (; ) are, the more substantial the speed gain of AS-BDT is over CD-BDT. In addition, the smaller the within-block tolerance, from 0 5 to 0 7, the longer it takes CD-BDT to reach the stopping criterion. Table summarizes several performance measures for scenario 2 and m = 2; 3. Similar results in the other cases are presented in the Supplementary Material. In addition to the obective values achieved ("ob"), we also study the number of backtting cycles ("cycle"), the average number of iterations per cycle (iter"), and the column size of stored inner-products matrices ("S"). For each cycle of backtting, the number of iterations is dened by summing the numbers of iterations over all p blocks. The number of iterations within th block is how many times the descent direction is adusted when using the active-set algorithm, or how many scans are performed over the basis functions of th covariate when using the coordinate-descent algorithm, to solve the th Lasso sub-problem. The results from Table are consistent with Figure 2 and 3. Compared with CD-BDT, AS-BDT achieves smaller obective values, with smaller numbers of backtting cycles or iterations per cycle, especially with small (; ). The number of backtting cycles from CD-BDT is large when the within-block tolerance is relatively large, whereas the Stat 202, Copyright c 202 John Wiley & Sons, Ltd.

10 T Yang and Z Tan average number of iterations per cycle increases substantially when the within-block tolerance is reduced. In addition, the column size of stored inner-product matrices is much smaller in AS-BDT than CD-BDT. The active-set algorithm is more careful than coordinate descent in selecting basis functions into and removing them from the active set. We also apply AS-BDT with screening rules, denoted as AS-BDT-S. Within each block, if screening is successful, then the number of iterations is 0. From Table, the average number of iterations per cycle from AS-BDT is considerably smaller than from AS-BDT-S, especially when (; ) become large. Additional results are provided in the Supplementary Material on relative frequencies of when the screening rules in Algorithm 2 are successful Statistical performance We generate training, validation, and test sets, each with p = 00 covariates and n = 00 observations. The tuning parameters (; ) for dpam are selected to minimize the mean squared error (MSE) on the validation set, calculated from the model tted on the training set, over a grid max, max, and (Section 3.2). The model is then tted with the selected (; ) on the training set, and the test MSE is calculated on the test set. The calculation is repeated over 00 datasets, and the average test MSE is reported in Table 3. Similarly, average test MSEs are calculated for SpAM with d = 3; 6; 0 and hgam with K = 5; 20; 30. From Table 3, the smallest MSE is achieved by dpam with m = in scenario (piecewise constant), and by dpam with m = 2 in the other three scenarios. Both SpAM and hgam appear to yield MSEs considerably larger than the minimum MSEs obtained by dpam, even in scenario 3 (smooth) Logistic additive modeling We generate data according to y i p Bernoulli(expit( f ; : : : ; f 4 are the same as in Section 5., and the remaining f 5 ; : : : ; f p i= f (x i ))), where x i Uniform[ 2:5; 2:5], the functions are zero with p = Computational speed We apply three versions of BDT-logit: AS-BDT-logit and AS-BDT-logit-S, using the active-set method without or with the screening rules, and CD-BDT-logit, using the coordinate-descent method. Table 2 summarizes several performance measures similarly as in Table. It is evident that AS-BDT-logit outperforms CD-BDT-logit, in achieving smaller logistic losses, smaller numbers of backtting cycles or iterations per cycle, and smaller sizes of stored inner-product matrices. In addition, AS-BDT-logit-S outperforms AS-BDT-logit, with smaller average numbers of iterations per cycle, due to the screening rules Statistical performance Similarly as in Section 5., we conduct 00 repeated simulations, each with training and validation sets, to calculate test logistic losses on a test set for dpam and SpAM (Table 2). There is currently no logistic modeling allowed in the R package for hgam. To obtain meaningful comparison, we increase the sample size to n = 500, because the sample size n = 00 appears to be insucient to achieve reasonable estimation for logistic modeling. As shown in Table 2, dpam with m = yields the smallest logistic losses in scenario (piecewise constant), whereas dpam with m = 2 gives the smallest losses in the other three scenarios. 6. Conclusion We develop backtting algorithms for doubly penalized additive modeling using total-variation and empirical-norm penalties, and demonstrate computational and statistical eectiveness of the proposed method. For solving the Lasso sub-problems (6), we advocate the use of the active-set method when compared with the coordinate-descent method. It can be of interest to conduct more simulations and investigate possible improvement and extensions. Copyright c 202 John Wiley & Sons, Ltd. 0 Stat 202, 00 30

11 Backtting for doubly penalized additive modeling Stat References Beck, A & Teboulle, M (2009), `A fast iterative shrinkage-thresholding algorithm for linear inverse problems,' SIAM Journal on Imaging Sciences, 2(), pp Bickel, PJ, Ritov, Y, Tsybakov, AB et al. (2009), `Simultaneous analysis of Lasso and Dantzig selector,' Annals of Statistics, 37(4), pp Böhning, D & Lindsay, BG (988), `Monotonicity of quadratic approximation algorithms,' Annals of the Institute of Statistical Mathematics, 40(4), pp Donoho, DL & Johnstone, JM (994), `Ideal spatial adaptation by wavelet shrinkage,' Biometrika, 8(3), pp Frick, H, Kondofersky, I, Kuehnle, OS, Lindenlaub, C, Pfundstein, G, Speidel, M, Spindler, M, Straub, A, Wickler, F, Zink, K, Eugster, M & Hothorn, T (203), hgam: High-dimensional Additive Modelling, R Package Version Friedman, J, Hastie, T & Tibshirani, R (200), `Regularization paths for generalized linear models via coordinate descent,' Journal of Statistical Software, 33(), pp. 22. Hastie, T & Tibshirani, R (990), Generalized Additive Models, Wiley. Hoeing, H (200), `A path algorithm for the fused Lasso signal approximator,' Journal of Computational and Graphical Statistics, 9(4), pp Huang, J, Horowitz, JL & Wei, F (200), `Variable selection in nonparametric additive models,' Annals of Statistics, 38(4), pp Kim, SJ, Koh, K, Boyd, S & Gorinevsky, D (2009), `` trend ltering,' SIAM Review, 5(2), pp Kim, SJ, Koh, K, Lustig, M, Boyd, S & Gorinevsky, D (2007), `An interior-point method for large-scale `-regularized least squares,' IEEE Journal of Selected Topics in Signal Processing, (4), pp Koltchinskii, V & Yuan, M (200), `Sparsity in multiple kernel learning,' Annals of Statistics, pp Lin, Y & Zhang, HH (2006), `Component selection and smoothing in multivariate nonparametric regression,' Annals of Statistics, 34(5), pp Mammen, E & van de Geer, S (997), `Locally adaptive regression splines,' Annals of Statistics, 25(), pp Meier, L, Van de Geer, S & Bühlmann, P (2009), `High-dimensional additive modeling,' Annals of Statistics, 37(6B), pp Osborne, MR, Presnell, B & Turlach, BA (998), `Knot selection for regression splines via the lasso,' Computing Science and Statistics, 30, pp Osborne, MR, Presnell, B & Turlach, BA (2000), `A new approach to variable selection in least squares problems,' IMA Journal of Numerical Analysis, 20(3), pp Petersen, A, Witten, D & Simon, N (206), `Fused Lasso additive model,' Journal of Computational and Graphical Statistics, 25(4), pp Raskutti, G, Wainwright, MJ & Yu, B (202), `Minimax-optimal rates for sparse additive models over kernel classes via convex programming,' Journal of Machine Learning Research, 3, pp Ravikumar, P, Liu, H, Laerty, J & Wasserman, L (2009), `SpAM: Sparse additive models,' Journal of the Royal Statistical Society, Series B, 7(5), pp Stat 202, Copyright c 202 John Wiley & Sons, Ltd.

12 T Yang and Z Tan Sadhanala, V & Tibshirani, RJ (207), `Additive models with trend ltering,' arxiv preprint arxiv: Stone, CJ (986), `The dimensionality reduction principle for generalized additive models,' Annals of Statistics, 4(2), pp Tan, Z & Zhang, CH (207), `Penalized estimation in additive regression with high-dimensional data,' arxiv preprint arxiv: Tibshirani, R (996), `Regression shrinkage and selection via the Lasso,' Journal of the Royal Statistical Society, Series B, pp Tibshirani, RJ (204), `Adaptive piecewise polynomial estimation via trend ltering,' Annals of Statistics, 42(), pp Wang, YX, Smola, A & Tibshirani, R (204), `The falling factorial basis and its statistical applications,' in International Conference on Machine Learning (ICML), pp Wood, SN (207), Generalized Additive Models: an Introduction with R, CRC press. Wu, TT & Lange, K (2008), `Coordinate descent algorithms for Lasso penalized regression,' Annals of Applied Statistics, pp Wu, TT & Lange, K (200), `The MM alternative to EM,' Statistical Science, 25(4), pp Zhao, T, Li, X, Liu, H & Roeder, K (204), SAM: Sparse Additive Modelling, R Package Version.0.5. Copyright c 202 John Wiley & Sons, Ltd. 2 Stat 202, 00 30

13 Backtting for doubly penalized additive modeling Stat () (2) (3) (4) f(x) f(x) f(x) f(x) x x x x Figure. Nonzero functions f ;:::;f 4 used to generate data: () scenario (piecewise-constant); (2) scenario 2 (piecewiselinear); (3) scenario 3 (smooth); (4) scenario 4 (combination). Stat 202, Copyright c 202 John Wiley & Sons, Ltd.

14 T Yang and Z Tan ρ = λ max 6 and λ = λ max 6 ρ = λ max 64 and λ = λ max 6 ρ = λ max 256 and λ = λ max 6 obective function obective function obective function ρ = λ max 64 and λ = λ max 64 ρ = λ max 256 and λ = λ max 64 ρ = λ max 256 and λ = λ max 256 obective function obective function obective function Figure 2. Obective values over running time from ASD-BDT ( ), CD-BDT with tolerance 0 5 ( ), CD-BDT with tolerance 0 6 ( ) and CD-BDT with tolerance 0 7 ( ) for scenario 2 and m = 2 in regression setting. ρ = λ max 6 and λ = λ max 6 ρ = λ max 64 and λ = λ max 6 ρ = λ max 256 and λ = λ max 6 obective function obective function obective function ρ = λ max 64 and λ = λ max 64 ρ = λ max 256 and λ = λ max 64 ρ = λ max 256 and λ = λ max 256 obective function obective function obective function Figure 3. Obective values over running time from ASD-BDT ( ), CD-BDT with tolerance 0 5 ( ), CD-BDT with tolerance 0 6 ( ) and CD-BDT with tolerance 0 7 ( ) for scenario 2 and m = 3 in regression setting. Copyright c 202 John Wiley & Sons, Ltd. 4 Stat 202, 00 30

15 Backtting for doubly penalized additive modeling Stat ρ = λ max 6 and λ = λ max 6 ρ = λ max 64 and λ = λ max 6 ρ = λ max 256 and λ = λ max 6 obective function obective function obective function ρ = λ max 64 and λ = λ max 64 ρ = λ max 256 and λ = λ max 64 ρ = λ max 256 and λ = λ max 256 obective function obective function obective function Figure 4. Obective values over running time from ASD-BDT-Logit ( ), CD-BDT-Logit with tolerance 0 5 ( ), CD- BDT-Logit with tolerance 0 6 ( ) and CD-BDT-Logit with tolerance 0 7 ( ) for scenario 2 and m = 2 in classication setting. ρ = λ max 6 and λ = λ max 6 ρ = λ max 64 and λ = λ max 6 ρ = λ max 256 and λ = λ max 6 obective function obective function obective function ρ = λ max 64 and λ = λ max 64 ρ = λ max 256 and λ = λ max 64 ρ = λ max 256 and λ = λ max 256 obective function obective function obective function Figure 5. Obective values over running time from ASD-BDT-Logit ( ), CD-BDT-Logit with tolerance 0 5 ( ), CD- BDT-Logit with tolerance 0 6 ( ) and CD-BDT-Logit with tolerance 0 7 ( ) for scenario 2 and m = 3 in classication setting. Stat 202, Copyright c 202 John Wiley & Sons, Ltd.

16 T Yang and Z Tan Table. Computation comparison for scenario 2 in regression setting Metric ob cycle iter S ob cycle iter S = Method max =6 = max =64 = max =256 = = =4 = =6 = = =4 = m = 2 AS-BDT AS-BDT-S CD-BDT CD-BDT CD-BDT AS-BDT AS-BDT-S CD-BDT CD-BDT CD-BDT AS-BDT AS-BDT-S CD-BDT CD-BDT CD-BDT AS-BDT AS-BDT-S CD-BDT CD-BDT CD-BDT m = 3 AS-BDT AS-BDT-S CD-BDT CD-BDT CD-BDT AS-BDT AS-BDT-S CD-BDT CD-BDT CD-BDT AS-BDT AS-BDT-S CD-BDT CD-BDT CD-BDT AS-BDT AS-BDT-S CD-BDT CD-BDT CD-BDT Copyright c 202 John Wiley & Sons, Ltd. 6 Stat 202, 00 30

17 Backtting for doubly penalized additive modeling Stat Table 2. Computation comparison for scenario 2 in classication setting Metric ob cycle iter S ob cycle iter S = Method max =6 = max =64 = max =256 = = =4 = =6 = = =4 = m = 2 AS-BDT-Logit AS-BDT-Logit-S CD-BDT-Logit CD-BDT-Logit CD-BDT-Logit AS-BDT-Logit AS-BDT-Logit-S CD-BDT-Logit CD-BDT-Logit CD-BDT-Logit AS-BDT-Logit AS-BDT-Logit-S CD-BDT-Logit CD-BDT-Logit CD-BDT-Logit AS-BDT-Logit AS-BDT-Logit-S CD-BDT-Logit CD-BDT-Logit CD-BDT-Logit m = 3 AS-BDT-Logit AS-BDT-Logit-S CD-BDT-Logit CD-BDT-Logit CD-BDT-Logit AS-BDT-Logit AS-BDT-Logit-S CD-BDT-Logit CD-BDT-Logit CD-BDT-Logit AS-BDT-Logit AS-BDT-Logit-S CD-BDT-Logit CD-BDT-Logit CD-BDT-Logit AS-BDT-Logit AS-BDT-Logit-S CD-BDT-Logit CD-BDT-Logit CD-BDT-Logit Stat 202, Copyright c 202 John Wiley & Sons, Ltd.

18 T Yang and Z Tan Table 3. Test MSEs from linear additive modeling Scenario Scenario 2 Scenario 3 Scenario 4 SpAM (Ravikumar et al., 2009) d = (0.05).75(0.03).62(0.02).78(0.03) d = (0.05) 2.4(0.04) 2.03(0.03) 2.3(0.04) d = (0.06) 3.8(0.05) 2.69(0.05) 2.6(0.05) hgam (Meier et al., 2009) K = (0.04).59(0.03).97(0.03).84(0.02) K = (0.04).57(0.03).95(0.03).82(0.02) K = (0.04).53(0.03).89(0.03).78(0.02) dpam (this paper) m = 2.03(0.04).88(0.03).76(0.03).74(0.02) m = (0.04).40(0.02).40(0.02).60(0.02) m = (0.04).5(0.02).46(0.02).64(0.02) Note: FLAM (Petersen et al., 206) corresponds to dpam with m =, except linear interpolation is used when evaluating the tted functions on the validation and test sets. Table 4. Test logistic losses ( 0) from logistic additive modeling Scenario Scenario 2 Scenario 3 Scenario 4 SpAM (Ravikumar et al., 2009) d = (0.0) 5.38(0.0) 5.33(0.0) 5.08(0.0) d = 6 5.8(0.0) 5.49(0.0) 5.42(0.0) 5.8(0.0) d = 0 5.3(0.02) 5.55(0.0) 5.46(0.0) 5.25(0.0) dpam (this paper) m = 4.97(0.02) 5.08(0.0) 5.8(0.0) 5.04(0.02) m = 2 5.(0.02) 4.88(0.0) 5.0(0.0) 5.0(0.0) m = 3 5.4(0.02) 4.88(0.0) 5.04(0.0) 5.0(0.0) Note: Logistic modeling is currently not implemented in the R package for hgam (Meier et al., 2009). Copyright c 202 John Wiley & Sons, Ltd. 8 Stat 202, 00 30

(wileyonlinelibrary.com)

(wileyonlinelibrary.com) Stat (wileyonlinelibrary.com) https://doi.org/0.00/sta4.98 Backfitting algorithms for total-variation and empirical-norm penalized additive modelling with high-dimensional data Ting Yang and Zhiqiang Tan

More information

Adaptive Piecewise Polynomial Estimation via Trend Filtering

Adaptive Piecewise Polynomial Estimation via Trend Filtering Adaptive Piecewise Polynomial Estimation via Trend Filtering Liubo Li, ShanShan Tu The Ohio State University li.2201@osu.edu, tu.162@osu.edu October 1, 2015 Liubo Li, ShanShan Tu (OSU) Trend Filtering

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

An Homotopy Algorithm for the Lasso with Online Observations

An Homotopy Algorithm for the Lasso with Online Observations An Homotopy Algorithm for the Lasso with Online Observations Pierre J. Garrigues Department of EECS Redwood Center for Theoretical Neuroscience University of California Berkeley, CA 94720 garrigue@eecs.berkeley.edu

More information

Hierarchical kernel learning

Hierarchical kernel learning Hierarchical kernel learning Francis Bach Willow project, INRIA - Ecole Normale Supérieure May 2010 Outline Supervised learning and regularization Kernel methods vs. sparse methods MKL: Multiple kernel

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement

More information

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana

More information

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao

More information

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider variable

More information

Fast Regularization Paths via Coordinate Descent

Fast Regularization Paths via Coordinate Descent August 2008 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerry Friedman and Rob Tibshirani. August 2008 Trevor

More information

arxiv: v2 [math.st] 12 Feb 2008

arxiv: v2 [math.st] 12 Feb 2008 arxiv:080.460v2 [math.st] 2 Feb 2008 Electronic Journal of Statistics Vol. 2 2008 90 02 ISSN: 935-7524 DOI: 0.24/08-EJS77 Sup-norm convergence rate and sign concentration property of Lasso and Dantzig

More information

Inversion Base Height. Daggot Pressure Gradient Visibility (miles)

Inversion Base Height. Daggot Pressure Gradient Visibility (miles) Stanford University June 2, 1998 Bayesian Backtting: 1 Bayesian Backtting Trevor Hastie Stanford University Rob Tibshirani University of Toronto Email: trevor@stat.stanford.edu Ftp: stat.stanford.edu:

More information

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent

More information

Reconstruction from Anisotropic Random Measurements

Reconstruction from Anisotropic Random Measurements Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013

More information

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu LITIS - EA 48 - INSA/Universite de Rouen Avenue de l Université - 768 Saint-Etienne du Rouvray

More information

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Boosting Methods: Why They Can Be Useful for High-Dimensional Data New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,

More information

Robust Inverse Covariance Estimation under Noisy Measurements

Robust Inverse Covariance Estimation under Noisy Measurements .. Robust Inverse Covariance Estimation under Noisy Measurements Jun-Kun Wang, Shou-De Lin Intel-NTU, National Taiwan University ICML 2014 1 / 30 . Table of contents Introduction.1 Introduction.2 Related

More information

Sparse inverse covariance estimation with the lasso

Sparse inverse covariance estimation with the lasso Sparse inverse covariance estimation with the lasso Jerome Friedman Trevor Hastie and Robert Tibshirani November 8, 2007 Abstract We consider the problem of estimating sparse graphs by a lasso penalty

More information

Regularization Path Algorithms for Detecting Gene Interactions

Regularization Path Algorithms for Detecting Gene Interactions Regularization Path Algorithms for Detecting Gene Interactions Mee Young Park Trevor Hastie July 16, 2006 Abstract In this study, we consider several regularization path algorithms with grouped variable

More information

Discussion of Least Angle Regression

Discussion of Least Angle Regression Discussion of Least Angle Regression David Madigan Rutgers University & Avaya Labs Research Piscataway, NJ 08855 madigan@stat.rutgers.edu Greg Ridgeway RAND Statistics Group Santa Monica, CA 90407-2138

More information

Standardization and the Group Lasso Penalty

Standardization and the Group Lasso Penalty Standardization and the Group Lasso Penalty Noah Simon and Rob Tibshirani Corresponding author, email: nsimon@stanfordedu Sequoia Hall, Stanford University, CA 9435 March, Abstract We re-examine the original

More information

Penalized versus constrained generalized eigenvalue problems

Penalized versus constrained generalized eigenvalue problems Penalized versus constrained generalized eigenvalue problems Irina Gaynanova, James G. Booth and Martin T. Wells. arxiv:141.6131v3 [stat.co] 4 May 215 Abstract We investigate the difference between using

More information

The lasso, persistence, and cross-validation

The lasso, persistence, and cross-validation The lasso, persistence, and cross-validation Daniel J. McDonald Department of Statistics Indiana University http://www.stat.cmu.edu/ danielmc Joint work with: Darren Homrighausen Colorado State University

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu Department of Statistics, University of Illinois at Urbana-Champaign WHOA-PSI, Aug, 2017 St. Louis, Missouri 1 / 30 Background Variable

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010 Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X

More information

Lecture 14: Variable Selection - Beyond LASSO

Lecture 14: Variable Selection - Beyond LASSO Fall, 2017 Extension of LASSO To achieve oracle properties, L q penalty with 0 < q < 1, SCAD penalty (Fan and Li 2001; Zhang et al. 2007). Adaptive LASSO (Zou 2006; Zhang and Lu 2007; Wang et al. 2007)

More information

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published

More information

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los

More information

Sparse Gaussian conditional random fields

Sparse Gaussian conditional random fields Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

A note on the group lasso and a sparse group lasso

A note on the group lasso and a sparse group lasso A note on the group lasso and a sparse group lasso arxiv:1001.0736v1 [math.st] 5 Jan 2010 Jerome Friedman Trevor Hastie and Robert Tibshirani January 5, 2010 Abstract We consider the group lasso penalty

More information

Comments on \Wavelets in Statistics: A Review" by. A. Antoniadis. Jianqing Fan. University of North Carolina, Chapel Hill

Comments on \Wavelets in Statistics: A Review by. A. Antoniadis. Jianqing Fan. University of North Carolina, Chapel Hill Comments on \Wavelets in Statistics: A Review" by A. Antoniadis Jianqing Fan University of North Carolina, Chapel Hill and University of California, Los Angeles I would like to congratulate Professor Antoniadis

More information

Lecture 25: November 27

Lecture 25: November 27 10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

STANDARDIZATION AND THE GROUP LASSO PENALTY

STANDARDIZATION AND THE GROUP LASSO PENALTY Statistica Sinica (0), 983-00 doi:http://dx.doi.org/0.5705/ss.0.075 STANDARDIZATION AND THE GROUP LASSO PENALTY Noah Simon and Robert Tibshirani Stanford University Abstract: We re-examine the original

More information

Convex relaxation for Combinatorial Penalties

Convex relaxation for Combinatorial Penalties Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,

More information

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R Xingguo Li Tuo Zhao Tong Zhang Han Liu Abstract We describe an R package named picasso, which implements a unified framework

More information

Supplementary Material for Nonparametric Operator-Regularized Covariance Function Estimation for Functional Data

Supplementary Material for Nonparametric Operator-Regularized Covariance Function Estimation for Functional Data Supplementary Material for Nonparametric Operator-Regularized Covariance Function Estimation for Functional Data Raymond K. W. Wong Department of Statistics, Texas A&M University Xiaoke Zhang Department

More information

On Algorithms for Solving Least Squares Problems under an L 1 Penalty or an L 1 Constraint

On Algorithms for Solving Least Squares Problems under an L 1 Penalty or an L 1 Constraint On Algorithms for Solving Least Squares Problems under an L 1 Penalty or an L 1 Constraint B.A. Turlach School of Mathematics and Statistics (M19) The University of Western Australia 35 Stirling Highway,

More information

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Journal of Data Science 9(2011), 549-564 Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Masaru Kanba and Kanta Naito Shimane University Abstract: This paper discusses the

More information

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Xiaocheng Tang Department of Industrial and Systems Engineering Lehigh University Bethlehem, PA 18015 xct@lehigh.edu Katya Scheinberg

More information

Sparse Additive Functional and kernel CCA

Sparse Additive Functional and kernel CCA Sparse Additive Functional and kernel CCA Sivaraman Balakrishnan* Kriti Puniyani* John Lafferty *Carnegie Mellon University University of Chicago Presented by Miao Liu 5/3/2013 Canonical correlation analysis

More information

25 : Graphical induced structured input/output models

25 : Graphical induced structured input/output models 10-708: Probabilistic Graphical Models 10-708, Spring 2013 25 : Graphical induced structured input/output models Lecturer: Eric P. Xing Scribes: Meghana Kshirsagar (mkshirsa), Yiwen Chen (yiwenche) 1 Graph

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

Comparisons of penalized least squares. methods by simulations

Comparisons of penalized least squares. methods by simulations Comparisons of penalized least squares arxiv:1405.1796v1 [stat.co] 8 May 2014 methods by simulations Ke ZHANG, Fan YIN University of Science and Technology of China, Hefei 230026, China Shifeng XIONG Academy

More information

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model

More information

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28 Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:

More information

arxiv: v1 [math.st] 8 Jan 2008

arxiv: v1 [math.st] 8 Jan 2008 arxiv:0801.1158v1 [math.st] 8 Jan 2008 Hierarchical selection of variables in sparse high-dimensional regression P. J. Bickel Department of Statistics University of California at Berkeley Y. Ritov Department

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

Sparse representation classification and positive L1 minimization

Sparse representation classification and positive L1 minimization Sparse representation classification and positive L1 minimization Cencheng Shen Joint Work with Li Chen, Carey E. Priebe Applied Mathematics and Statistics Johns Hopkins University, August 5, 2014 Cencheng

More information

Effective Dimension and Generalization of Kernel Learning

Effective Dimension and Generalization of Kernel Learning Effective Dimension and Generalization of Kernel Learning Tong Zhang IBM T.J. Watson Research Center Yorktown Heights, Y 10598 tzhang@watson.ibm.com Abstract We investigate the generalization performance

More information

High-dimensional covariance estimation based on Gaussian graphical models

High-dimensional covariance estimation based on Gaussian graphical models High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26,

More information

OWL to the rescue of LASSO

OWL to the rescue of LASSO OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,

More information

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract Journal of Data Science,17(1). P. 145-160,2019 DOI:10.6339/JDS.201901_17(1).0007 WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION Wei Xiong *, Maozai Tian 2 1 School of Statistics, University of

More information

Statistics for high-dimensional data: Group Lasso and additive models

Statistics for high-dimensional data: Group Lasso and additive models Statistics for high-dimensional data: Group Lasso and additive models Peter Bühlmann and Sara van de Geer Seminar für Statistik, ETH Zürich May 2012 The Group Lasso (Yuan & Lin, 2006) high-dimensional

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

LASSO Isotone for High-Dimensional Additive Isotonic Regression

LASSO Isotone for High-Dimensional Additive Isotonic Regression Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/jcgs LASSO Isotone for High-Dimensional Additive Isotonic Regression Zhou FANG and Nicolai MEINSHAUSEN

More information

An algorithm for the multivariate group lasso with covariance estimation

An algorithm for the multivariate group lasso with covariance estimation An algorithm for the multivariate group lasso with covariance estimation arxiv:1512.05153v1 [stat.co] 16 Dec 2015 Ines Wilms and Christophe Croux Leuven Statistics Research Centre, KU Leuven, Belgium Abstract

More information

2 Tikhonov Regularization and ERM

2 Tikhonov Regularization and ERM Introduction Here we discusses how a class of regularization methods originally designed to solve ill-posed inverse problems give rise to regularized learning algorithms. These algorithms are kernel methods

More information

Sparse Permutation Invariant Covariance Estimation: Motivation, Background and Key Results

Sparse Permutation Invariant Covariance Estimation: Motivation, Background and Key Results Sparse Permutation Invariant Covariance Estimation: Motivation, Background and Key Results David Prince Biostat 572 dprince3@uw.edu April 19, 2012 David Prince (UW) SPICE April 19, 2012 1 / 11 Electronic

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Learning discrete graphical models via generalized inverse covariance matrices

Learning discrete graphical models via generalized inverse covariance matrices Learning discrete graphical models via generalized inverse covariance matrices Duzhe Wang, Yiming Lv, Yongjoon Kim, Young Lee Department of Statistics University of Wisconsin-Madison {dwang282, lv23, ykim676,

More information

Variable Selection for Generalized Additive Mixed Models by Likelihood-based Boosting

Variable Selection for Generalized Additive Mixed Models by Likelihood-based Boosting Variable Selection for Generalized Additive Mixed Models by Likelihood-based Boosting Andreas Groll 1 and Gerhard Tutz 2 1 Department of Statistics, University of Munich, Akademiestrasse 1, D-80799, Munich,

More information

Saharon Rosset 1 and Ji Zhu 2

Saharon Rosset 1 and Ji Zhu 2 Aust. N. Z. J. Stat. 46(3), 2004, 505 510 CORRECTED PROOF OF THE RESULT OF A PREDICTION ERROR PROPERTY OF THE LASSO ESTIMATOR AND ITS GENERALIZATION BY HUANG (2003) Saharon Rosset 1 and Ji Zhu 2 IBM T.J.

More information

arxiv: v1 [math.st] 1 Dec 2014

arxiv: v1 [math.st] 1 Dec 2014 HOW TO MONITOR AND MITIGATE STAIR-CASING IN L TREND FILTERING Cristian R. Rojas and Bo Wahlberg Department of Automatic Control and ACCESS Linnaeus Centre School of Electrical Engineering, KTH Royal Institute

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines Nonlinear Support Vector Machines through Iterative Majorization and I-Splines P.J.F. Groenen G. Nalbantov J.C. Bioch July 9, 26 Econometric Institute Report EI 26-25 Abstract To minimize the primal support

More information

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions:

More information

COS 424: Interacting with Data

COS 424: Interacting with Data COS 424: Interacting with Data Lecturer: Rob Schapire Lecture #14 Scribe: Zia Khan April 3, 2007 Recall from previous lecture that in regression we are trying to predict a real value given our data. Specically,

More information

Computational and Statistical Aspects of Statistical Machine Learning. John Lafferty Department of Statistics Retreat Gleacher Center

Computational and Statistical Aspects of Statistical Machine Learning. John Lafferty Department of Statistics Retreat Gleacher Center Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher Center Outline Modern nonparametric inference for high dimensional data Nonparametric

More information

The Nonparanormal skeptic

The Nonparanormal skeptic The Nonpara skeptic Han Liu Johns Hopkins University, 615 N. Wolfe Street, Baltimore, MD 21205 USA Fang Han Johns Hopkins University, 615 N. Wolfe Street, Baltimore, MD 21205 USA Ming Yuan Georgia Institute

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Fast Regularization Paths via Coordinate Descent

Fast Regularization Paths via Coordinate Descent KDD August 2008 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerry Friedman and Rob Tibshirani. KDD August 2008

More information

Gaussian Graphical Models and Graphical Lasso

Gaussian Graphical Models and Graphical Lasso ELE 538B: Sparsity, Structure and Inference Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Spring 2017 Multivariate Gaussians Consider a random vector x N (0, Σ) with pdf

More information

Sparsity and the Lasso

Sparsity and the Lasso Sparsity and the Lasso Statistical Machine Learning, Spring 205 Ryan Tibshirani (with Larry Wasserman Regularization and the lasso. A bit of background If l 2 was the norm of the 20th century, then l is

More information

Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space

Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) Contents 1 Vector Spaces 1 1.1 The Formal Denition of a Vector Space.................................. 1 1.2 Subspaces...................................................

More information

Supplement to A Generalized Least Squares Matrix Decomposition. 1 GPMF & Smoothness: Ω-norm Penalty & Functional Data

Supplement to A Generalized Least Squares Matrix Decomposition. 1 GPMF & Smoothness: Ω-norm Penalty & Functional Data Supplement to A Generalized Least Squares Matrix Decomposition Genevera I. Allen 1, Logan Grosenic 2, & Jonathan Taylor 3 1 Department of Statistics and Electrical and Computer Engineering, Rice University

More information

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 6, JUNE

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 6, JUNE IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 6, JUNE 2010 2935 Variance-Component Based Sparse Signal Reconstruction and Model Selection Kun Qiu, Student Member, IEEE, and Aleksandar Dogandzic,

More information

arxiv: v1 [stat.me] 4 Oct 2013

arxiv: v1 [stat.me] 4 Oct 2013 Monotone Splines Lasso Linn Cecilie Bergersen, Kukatharmini Tharmaratnam and Ingrid K. Glad Department of Mathematics, University of Oslo arxiv:1310.1282v1 [stat.me] 4 Oct 2013 Abstract We consider the

More information

Pathwise coordinate optimization

Pathwise coordinate optimization Stanford University 1 Pathwise coordinate optimization Jerome Friedman, Trevor Hastie, Holger Hoefling, Robert Tibshirani Stanford University Acknowledgements: Thanks to Stephen Boyd, Michael Saunders,

More information

290 J.M. Carnicer, J.M. Pe~na basis (u 1 ; : : : ; u n ) consisting of minimally supported elements, yet also has a basis (v 1 ; : : : ; v n ) which f

290 J.M. Carnicer, J.M. Pe~na basis (u 1 ; : : : ; u n ) consisting of minimally supported elements, yet also has a basis (v 1 ; : : : ; v n ) which f Numer. Math. 67: 289{301 (1994) Numerische Mathematik c Springer-Verlag 1994 Electronic Edition Least supported bases and local linear independence J.M. Carnicer, J.M. Pe~na? Departamento de Matematica

More information

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu

More information

Introduction Wavelet shrinage methods have been very successful in nonparametric regression. But so far most of the wavelet regression methods have be

Introduction Wavelet shrinage methods have been very successful in nonparametric regression. But so far most of the wavelet regression methods have be Wavelet Estimation For Samples With Random Uniform Design T. Tony Cai Department of Statistics, Purdue University Lawrence D. Brown Department of Statistics, University of Pennsylvania Abstract We show

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 7, 04 Reading: See class website Eric Xing @ CMU, 005-04

More information

LASSO Review, Fused LASSO, Parallel LASSO Solvers

LASSO Review, Fused LASSO, Parallel LASSO Solvers Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable

More information

Sparsity Regularization

Sparsity Regularization Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation

More information

Kneib, Fahrmeir: Supplement to "Structured additive regression for categorical space-time data: A mixed model approach"

Kneib, Fahrmeir: Supplement to Structured additive regression for categorical space-time data: A mixed model approach Kneib, Fahrmeir: Supplement to "Structured additive regression for categorical space-time data: A mixed model approach" Sonderforschungsbereich 386, Paper 43 (25) Online unter: http://epub.ub.uni-muenchen.de/

More information

Statistical Machine Learning for Structured and High Dimensional Data

Statistical Machine Learning for Structured and High Dimensional Data Statistical Machine Learning for Structured and High Dimensional Data (FA9550-09- 1-0373) PI: Larry Wasserman (CMU) Co- PI: John Lafferty (UChicago and CMU) AFOSR Program Review (Jan 28-31, 2013, Washington,

More information

Chapter 3. Linear Models for Regression

Chapter 3. Linear Models for Regression Chapter 3. Linear Models for Regression Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Linear

More information

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014 Case Study 3: fmri Prediction Fused LASSO LARS Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, 2014 Emily Fox 2014 1 LASSO Regression

More information

Permutation-invariant regularization of large covariance matrices. Liza Levina

Permutation-invariant regularization of large covariance matrices. Liza Levina Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

Or How to select variables Using Bayesian LASSO

Or How to select variables Using Bayesian LASSO Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO On Bayesian Variable Selection

More information

Least squares under convex constraint

Least squares under convex constraint Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption

More information

Exact Hybrid Covariance Thresholding for Joint Graphical Lasso

Exact Hybrid Covariance Thresholding for Joint Graphical Lasso Exact Hybrid Covariance Thresholding for Joint Graphical Lasso Qingming Tang Chao Yang Jian Peng Jinbo Xu Toyota Technological Institute at Chicago Massachusetts Institute of Technology Abstract. This

More information

Sparse Additive machine

Sparse Additive machine Sparse Additive machine Tuo Zhao Han Liu Department of Biostatistics and Computer Science, Johns Hopkins University Abstract We develop a high dimensional nonparametric classification method named sparse

More information