Backtting algorithms for total-variation and empirical-norm penalized additive modeling with high-dimensional data
|
|
- Jean Weaver
- 5 years ago
- Views:
Transcription
1 The ISI's Journal for the Rapid Dissemination of Statistics Research (wileyonlinelibrary.com) DOI: 0.00X/sta Backtting algorithms for total-variation and empirical-norm penalized additive modeling with high-dimensional data Ting Yang a, Zhiqiang Tan a Received 0 May 208; Accepted 00 Month 208 Additive modeling is useful in studying nonlinear relationship between a response and covariates. We develop backtting algorithms to implement the doubly penalized method in Tan & Zhang (207), using total-variation and empirical-norm penalties. Use of the total-variation penalty leads to an automatic selection of knots for each component function, whereas use of the empirical-norm penalty can result in zero solutions for component functions and hence facilitates component selection in high-dimensional settings. For a backtting cycle, each component function is updated by thresholding a solution to a Lasso problem, which is computed using an active-set descent method. Screening rules are also derived to determine zero solutions without solving the Lasso problem directly. We present numerical experiments to demonstrate the eectiveness of the proposed algorithms for linear and logistic additive modeling. Copyright c 202 John Wiley & Sons, Ltd. Keywords: Additive model; High-dimensional data; Total variation; Nonparametric smoothing; Penalized estimation; Trend ltering.. Introduction Additive models provide a useful extension of linear models to allow nonlinear dependency of a mean response on covariates (Stone, 986). For a sample of size n, let y i be a response and x i = (x i ; : : : ; x ip ) T be a p-dimensional covariate vector for i = ; : : : ; n. Consider a nonparametric additive model y i = + f (x i ) + " i = + p = f (x i ) + " i ; where is a constant term, f ( ) is an unknown function of th covariate, and " i is a noise with mean zero and nite variance 2. Theory and methods for generalized additive models are well studied in low-dimensional settings ( p n) (Hastie & Tibshirani, 990; Wood, 207). Recently, there has been considerable research on sparse additive models a Department of Statistics, Rutgers University, New Jersey, USA ztan@stat.rutgers.edu Stat 202, Copyright c 202 John Wiley & Sons, Ltd. [Version: 202/05/2 v.00]
2 T Yang and Z Tan in high-dimensional settings, where p is close to or greater than n, but the number of nonzero functions f ( ), up to some centering, is still smaller than n. See Section 2 for a review of related work. In this article, we propose a backtting algorithm, called block descent and thresholding (BDT), to implement doubly penalized estimation studied in Tan & Zhang (207), using total-variation and empirical-norm penalties. For a function g ( ) of th covariate in an interval [a ; b ], the total variation is dened as { k } TV(g ) = sup g (z i ) g (z i ) : z 0 < z < : : : < z k is a partition of [a ; b ] for any k : i= If g is dierentiable with derivative g (), then TV(g ) = g () (z) dz. The empirical norm (precisely, empirical L 2 norm) of g is dened as kg k n = fn n i= g2 (x i )g =2. For some integer m and tuning parameters (; ), the doubly penalized estimator (^; ^f ; : : : ; ^f p ) is dened as a minimizer of the following penalized loss 2n n (y i f (x i )) 2 + i= p { } (m ) TV(f ) + kf k n ; () = over (; f ; : : : ; f p ), where f p = = f (m ), and f is the (m )th derivative of f with f (0) f. If f is mth dierentiable, then (m ) TV(f ) = f (m) (z) dz. By extending Mammen & van de Geer (997) for univariate smoothing, Tan & Zhang (207) showed that each ^f can be chosen to be a spline of order m, that is, a piecewise polynomial of degree m and, if m 2, an (m 2)th continuously dierentiable function. Moreover, the knots of each ^f can be obtained from the data points fx i : i = ; : : : ; ng if m = or 2, but not necessarily so if m 3. For simplicity, we always restrict each f ( ) to be a spline of order m with knots from the data points. The penalty term in () involves two penalties, playing complementary roles, for each component function f. The total (m ) variation, TV(f ), is used to induce smoothness of f. In fact, as shall be seen in Section 3., the total-variation penalty leads to an automatic selection of knots from data points for each f, similarly as Lasso selection of nonzero coecients in linear regression (Tibshirani, 996). The empirical norm kf k n is used to induce sparsity for f (i.e., setting f to zero). See Lemma for how a zero solution for f can be caused by the presence of the empirical-norm penalty. In the special case where f (z) = z is linear for all, the overall penalty in () reduces to the Lasso penalty p = up to some scaling for linear regression. In other words, the penalty in () can be regarded as a functional (m ) extension of the Lasso penalty, with replaced by a combination of TV(f ) and kf k n. The theoretical analysis in Tan & Zhang (207) shows under some technical conditions that if and are properly specied, then the prediction error of ^f p = ^f = achieves a minimax rate { } k ^f f k 2 n = O p()(m F + M 0 ) n 2m log(p) 2m+ + ; (2) n for any decomposition f p = = f such that #f p : f 6 0g M 0 and p = TV(f (m ) ) M F. A similar result holds for out-of-sample prediction errors. The error bound (2) consists of two terms, reecting the errors from respectively nonparametric smoothing and variable selection. The rst term, of order n 2m=(2m+), is the minimax (m ) rate of estimation in univariate smoothing (p = ) over the function class ff : TV(f ) Cg for a constant C (Mammen & van de Geer, 997). The second term, of order M 0 log(p)=n, is known as the Lasso fast rate of prediction errors in linear regression with at most M 0 nonzero regression coecients (Bickel et al., 2009). Our backtting algorithm can be modied to implement doubly penalized estimation using L 2 Sobolev seminorm and empirical-norm penalties, where the penalty term in () is replaced by p (m) =fkf k L2 + kf k n g, with kf (m) k L2 = Copyright c 202 John Wiley & Sons, Ltd. 2 Stat 202, 00 30
3 Backtting for doubly penalized additive modeling Stat f (f (m) (z)) 2 dzg =2. Denote by ( f ~ ; : : : ; f ~ p ) the resulting estimators. There are, however, two notable dierences from when the total-variation penalty is used. First, each f ~ is a smoothing spline with all the data points fx i : i = ; : : : ; ng as knots (Meier et al., 2009). No selection of knots is achieved. Second, the error bound (2) is also valid with ^f replaced by f ~ p = ~ = f and M F replaced by p = kf (m) k L2 (Koltchinskii & Yuan, 200; Raskutti et al., 202). But a bounded function class ff : kf (m) (m ) k L2 Cg is strictly smaller than ff : TV(f ) Cg for any constant C. For univariate smoothing (p = ), smoothing splines cannot achieve the rate n 2m=(2m+) as do total-variation splines in (m ) the larger function class ff : TV(f ) Cg (Donoho & Johnstone, 994; Mammen & van de Geer, 997). We use the following notation. For a function h(y ; x), the empirical norm of h is dened as khk n = fn n i= h2 (y i ; x i )g =2. For example, ky f k 2 n = n n i= (y i f (x i )) 2. The sample average of h is h = n n i= h(y i ; x i ). For example, y = n n i= y i. Moreover, for a vector u = (u ; : : : ; u n ) T, denote kuk n = (n n i= u2 i )=2. For a vector v = (v ; : : : ; v k ) T, denote kvk k = = v and kvk = max =;:::;k v. The rest of the paper is organized as follows. Section 2 reviews related work. In Section 3, we develop a backtting algorithm for minimizing the penalized loss () in linear additive modeling. In Section 4, we extend the algorithm to logistic additive modeling. Section 5 presents numerical experiments. Section 6 concludes the paper. The Supplementary Material contains proofs and additional discussion and results. 2. Related work Recently, there has been considerable research on penalized estimation for additive modeling, beyond earlier work as in Hastie & Tibshirani (990). For example, Lin & Zhang (2006) used a penalty p (m) = kf k L2. Huang et al. (200) studied a similar method using adaptive group Lasso. Ravikumar et al. (2009) used a penalty p = kf k n, and restricted f = for an n d basis matrix, which is pre-specied. Meier et al. (2009) used a penalty p in the form = fkf k 2 n + 2kf (m) k 2 L 2 g =2, and parameterized f using B-spline basis functions with pre-specied K knots in numerical implementation. Koltchinskii & Yuan (200) and Raskutti et al. (202) used a penalty term p (m) =fkf k L2 + kf k n g, but did not present numerical algorithms for minimizing the penalized loss. As mentioned above, such doubly penalized estimation can be handled by extending our backtting algorithm. The doubly penalized method in Tan & Zhang (207) diers from Koltchinskii & Yuan (200) and Raskutti et al. (202) in the use of the total-variation penalty, which leads to automatic knot selection for each component f and allows the same rate of convergence achieved in much larger bounded-variation function classes. Use of total-variation penalties seems to be rst studied by Mammen & van de Geer (997) for univariate smoothing, where the penalized loss is ky f k 2 (m ) n =2 + TV(f ). Recently, Kim et al. (2009) proposed a related method for univariate smoothing, called trend ltering, by minimizing the penalized loss over, ky k 2 2 =2 + kd(m) P k, where k k 2 denotes the L 2 norm, y = (y ; : : : ; y n ) T, = (f (x ); : : : ; f (x n )) T, P is the permutation matrix that sorts (x ; : : : ; x n ) in the ascending order, and D (m) is the mth-order dierence matrix. Tibshirani (204) showed that trend ltering is equivalent to total-variation splines in Mammen & van de Geer (997) for m = and 2, but not in general for m 3. In addition, trend ltering is shown to achieve the same (minimax) rate of convergence as total-variation splines in bounded-variation classes when the data points are evenly spaced for m. For additive modeling, Petersen et al. (206) proposed a fused lasso additive model (FLAM), by minimizing the penalized loss over (; ; : : : ; p ): p 2 ky k = p = {kd () P k + ( )k k 2 } ; Stat 202, Copyright c 202 John Wiley & Sons, Ltd.
4 T Yang and Z Tan where is a tuning parameter, 2 [0; ], = (f (x ); : : : ; f (x n )) T, and P is the permutation matrix that sorts the data points (x ; : : : ; x n ). This tting procedure is easily shown to be statistically equivalent to that of the doubly penalized method with m = in Tan & Zhang (207), up to a transformation between the tuning parameters (; ) above and (; ) in (). The tted values ^ p + ^ = from FLAM and ( ^f (x ); : : : ; ^f (x n )) T from our method can be matched. However, there is a subtle dierence in choosing how out-of-sample predictions are made. The tted function ^f of th covariate is dened by linear interpolation of the tted values ^ in FLAM, but directly by piecewise-constant interpolation in our method, which is more aligned with the rst-order total-variation penalty used. To obtain tted functions ^f that are piecewise linear, our method uses the second-order total-variation penalty (i.e., m = 2). In general, our method also accommodates use of higher-order total-variation penalties. Sadhanala & Tibshirani (207) also proposed additive modeling with trend ltering, by minimizing the penalized loss p p 2 ky k kd (m) P k : = This method allows higher-order trend ltering on component functions, but does not incorporate empirical-norm penalties, which are crucial to achieve sparsity in high-dimensional settings. The theory and numerical evaluations are provided only in low-dimensional settings in Sadhanala & Tibshirani (207). Our backtting algorithm can be modied to handle both trend ltering and empirical-norm penalties, i.e., a penalty term p = fkd(m) P k + k k n g, by recasting trend ltering using the falling factorial basis functions (Wang et al., 204). 3. Linear additive modeling We develop a backtting algorithm for minimizing the obective function (). As discussed in Section, we restrict each f to be a spline of order m with knots from the data points fx i : i = ; : : : ; ng. For p, dene the knot superset for f as (Mammen & van de Geer, 997) { fx((m )=2+2) ; : : : ; x (n (m )=2) g if m is odd; ft () ; : : : ; t (n m) g = = fx (m=2+) ; : : : ; x (n m=2) g if m is even: where x () < : : : < x (n) are the ordered values of the th covariate, and t () < : : : < t (n m). The data points near the left and right boundaries are removed to avoid over-parameterization. As shown in Tan & Zhang (207), a solution ( ^f ; : : : ; ^f p ) obtained with this restriction is also an unrestricted minimizer of () if m = or 2, but not necessarily so if m 3. For univariate smoothing, Mammen & van de Geer (997) also showed that the restricted and unrestricted solutions achieve the same rate of convergence under a mild condition on the maximum spacing between the data points. A similar result can be expected to hold for additive models. To represent splines of order m, we use the truncated power basis, dened as k; (z) = z k ; k = ; : : : ; m ; m +k; (z) = (z t (t) ) m + ; k = ; 2; : : : ; n m: where (c) + = max(0; c) and (c) 0 + = 0 if c < 0 or if c 0. Denote f = 0 +, where = ( ; ; : : : ; n ; ) and = ( ; ; : : : ; n ; ) T. After simple algebra, the obective function () can be written as 2 ky f k2 n + p fkd k + k 0 + k n g ; (3) = where D is a diagonal matrix with only nonzero elements d m = = d n =. Using the truncated power basis (m ) transforms the total variation TV(f ) into a Lasso penalty kd k n m = k= m +k;. Copyright c 202 John Wiley & Sons, Ltd. 4 Stat 202, 00 30
5 Backtting for doubly penalized additive modeling Stat Because is non-penalized, it follows that for (3) to be minimized, ^ = y, and each ^f is empirically centered, that is, ^ 0 = ^, where = n n i= (x i ). Then minimization of (3) reduces to that of 2 ky y f k2 n + p { kd k + k ~ } k n ; (4) = over ( ; : : : ; p ), where f = p = f, f = ~, and ~ =, the empirically centered version of. 3.. Backtting To minimize (3) or equivalently (4), a backtting algorithm involves solving p sub-problems updating the components (f ; : : : ; f p ) sequentially. For any p, the th sub-problem is { } min 2 kr ~ k 2 n + kd k + k ~ k n ; (5) where r = y y ^f k and f ^f k : k 6= g are the current estimates. By abuse of notation, ~ can also be k6= treated as the n (n ) matrix with ith row ~ (x i ), and r be treated as the n vector with ith element y i y k6= ^f k (x ik ). Proposition provides the main idea of our algorithm for solving problem (5). This result serves as a generalization of Corollary 3. in Petersen et al. (206). Proposition Suppose that ~ is a solution to Then a solution to problem (5) is ^ = ( min { 2 kr ~ k 2 n + kd k k ~ ~ ~ k. n )+ } : (6) Proposition says that a solution to problem (5) is determined by directly thresholding a solution of the Lasso problem (6). From our proof, a more general result holds where r is a "response" vector, ~ is a data matrix, and kd k is replaced by a semi-norm penalty R( ), including R( ) = kd k 2 in the case of L 2 semi-norm kf (m) k L2 used instead (m ) of TV(f ) in (). Proposition sheds light on the consequences of using the two penalties in (5). Use of the total-variation or Lasso penalty can induce a sparse solution ~ with only a few nonzero components, corresponding to an automatic selection of knots for f, as shown in Osborne et al. (998) for univariate smoothing. Use of the empiricalnorm penalty can result in a zero solution for f via thresholding ~ and hence achieve sparsity in high-dimensional additive modelling. From Proposition, we propose the following backtting algorithm to minimize (4). To solve Lasso problem (6), it is possible to use a variety of numerical methods including coordinate descent (Friedman et al., 200; Wu & Lange, 2008), gradient descent related methods (Beck & Teboulle, 2009; Kim et al., 2007), and active-set descent (Osborne et al., 2000). In particular, Petersen et al. (206) used a fast fused-lasso algorithm (Hoeing, 200) for solving (6) with m =, which seems not applicable for m 2. We employ a variant of active-set descent, which is attractive for the following reasons. First, the performance of backtting depends on how accurately the sub-problem (6) is solved within each block. More accurate within-block solutions may result in fewer backtting cycles to achieve convergence by a certain criterion. The active-set method nds an exact solution after a nite number of iterations, and the computational cost is often reasonable in sparse settings. When the estimate ~ from the previous backtting cycle is used as an initial value, the method also allows problem (6) to be solved with one or a few iterations if the previous estimate ~ is close to the desired solution. See the Supplementary Material for details Stat 202, Copyright c 202 John Wiley & Sons, Ltd.
6 T Yang and Z Tan Algorithm Block Descent and Thresholding (BDT) algorithm : Initialize: Set ^ = 0 for = ; :::; p. 2: for = ; 2; : : : ; p do 3: if any screening condition (Section 3.2) is satised then 4: Return ^. 5: else 6: Update the residual: r = y y ~ k ^ k. k6= 7: Compute a solution ~ to problem ( (6). 8: Threshold the solution: ^ = 9: end if 0: end for k ~ ~ k n )+ : Repeat line 2-0 until convergence of the obective (4). ~. of the active-set method. Many other methods, notably coordinate descent, only provide an approximate solution with a pre-specied precision. To obtain more precise a solution, more iterations are needed. In addition, the active-set method is tuning free, whereas gradient descent related methods involve tuning of step sizes to achieve satisfactory convergence. Selection of such step sizes can be cumbersome for all p Lasso problems (6) in backtting Screening rules In principle, to solve sub-problem (5), we rst compute ~ and then threshold it to ^. For relatively large, even if ~ is nonzero, the solution ^ after thresholding may become 0. For relatively large, it may occur that only the rst m components of ^ (not penalized by kd k ) are nonzero. To speed up computation, we derive screening rules to directly detect these scenarios and determine ^, without solving the Lasso problem (5). By the matrix interpretation for ~ and r, rewrite the sub-problem (5) as { min 2 kr ~ () () ~ (2) (2) ( () ; (2) ) } k 2 n + k(2) k + k ~ () () + ~ (2) (2) k n ; (7) where ~ is partitioned into ( ~ () ; ~ (2) ) with ~ () giving the rst m columns of ~, and accordingly is partitioned into ( ()T ; (2)T ) T, with () giving the rst m components of (not penalized by Lasso). If m =, then ~ () and () become degenerate. Proposition 2 provides the main result for our construction of screening rules. Proposition 2 For m 2, the following results hold for a solution ^ ()T (2)T = ( ^ ; ^ ) T to problem (7).. ^ () 6= 0 and ^ (2) = 0 if k r k n > and k ~ (2)T (r r )k =n ; (8) where is the proection ("hat") matrix onto the column space of ~ (). 2. ^ = 0 if there exists a vector u 2 R n satisfying kuk n and ~ ()T (r u) = 0 and k ~ (2)T (r u)k =n : (9) For m =, result 2 holds with the second equality in (9) removed. Copyright c 202 John Wiley & Sons, Ltd. 6 Stat 202, 00 30
7 Backtting for doubly penalized additive modeling Stat For ease of application, Corollary give a reformulation of condition (9), in terms of the existence of a suitable scalar for an arbitrarily xed vector h 2 R n. Corollary For m 2, ^ = 0 is a solution to problem (7) if for any h 2 R n, there exists 2 R satisfying kr ( )(h h)k n and k( ) ~ (2)T (h h)k =n : (0) For m =, the result holds with h h replaced by h or, equivalently, set to 0. From these results, we obtain the following screening rules. For m =, we skip step and set = 0. Algorithm 2 Screening rules ( ) () : Return ^ = k r k n ( ~ ()T ~ () ) ()T ~ r and (a) k r k n > and k ~ (2)T (r r )k =n. 2: Return ^ () = 0 and ^ (2) ^ (2) = 0 if = 0 if one of the following conditions is satised: (a) kr k n. (b) kr k n > and k r k n and k( ) ~ (2)T (r r )k =n ; where = f(n 2 r T r )=(r T r r T r )g =2 (< ). (c) kr k n > and k( ) ~ (2)T (h h)k =n, where h = r ~ ~ with ~ obtained from the previous backtting cycle and is determined such that kr ( )(h h)k n =. Condition (a) is derived from (8), and condition 2(a) is from (9) with u = r. Condition 2(b) is from (0) with h = r, where the rst inequality in (0) reduces to 2 (r T r r T r ) n 2 r T r. Condition 2(c) is from (0) with h = r ~ ~, where ~ is a solution of Lasso problem (6), and hence ~ ~ is the vector of tted values before thresholding, from the previous backtting cycle. The motivation for this choice is that if the "response" vector r is similar to the previous one, then k ~ (2)T (h h)k =n would remain similar to. The screening rules in Algorithm 2 are more eective than would be derived by rst detecting ~ = 0 for problem (6) and then deciding ^ = 0. For m =, it holds that ^ = 0 if either kr k n or kr k n > and ( =kr k n )k ~ T r k =n, whereas a necessary and sucient condition for ~ = 0 is k ~ T r k =n. As another use of the screening rules, we also obtain the following conditions on the tuning parameters (; ) to imply a completely zero solution to problem (4), ^ = : : : = ^ p = 0. Corollary 2 It holds that ^ = : : : = ^ p = 0 is a solution to problem (4) if either (a) ky yk n or (b) ky yk n >, k (y y )k n, and k ~ (2)T (y y y )k =n for = ; : : : ; p. In numerical implementation, we restrict the search of (; ) to a grid such that max and max, where max = ky yk n and max = k ~ (2)T (y y y )k =n. This region includes all (; ) yielding a nonzero solution to (4) for m = with = 0, and may be practically sucient for m 2. In addition, because the theoretical analysis of Tan & Zhang (207) suggests choosing = O() 2 under sparsity, we also restrict in the search. Stat 202, Copyright c 202 John Wiley & Sons, Ltd.
8 T Yang and Z Tan 4. Logistic additive modeling As an extension, we provide a backtting algorithm for logistic additive modeling with a binary response: { P (y i = x i ) = expit + p = } f (x i ) ; where expit(c) = f + exp( c)g. Doubly penalized estimation involves minimizing the obective function n n p `(y i ; + f (x i )) + fkd k + k ~ k n g; () i= = where `(y i ; + f (x i )) = logf + exp( + f (x i ))g y i ( + f (x i )) with f p = = f and f = ~ as in (4), but cannot be directly solved. For each cycle of backtting, the th sub-problem is min (; ) n n `(y i ; + ^f ( ) (x i ) + ~ (x i ) ) + kd k + k ~ k n ; (2) i= where ^f ( ) = k6= ^f k and f ^f k : k 6= g are the current estimates. Similarly as in Friedman et al. (200), we form a quadratic approximation to the negative log-likelihood term (via a Taylor expansion about the previous estimates ^ and ^ ) and solve the following problem, similar to (5) but with weighted least squares: min (; ) 2n n w i ( i ~ ) 2 + kd k + k ~ k n ; (3) i= where i = ^ + ~ (x i ) ^ + ^w i (y i ^p i ), ^p i = expit(^ + ^f (x i )), and w i = ^p i ( ^p i ). Proposition can be easily extended for solving (3). However, we employ a simple modication to the Lasso problem associated with (3) when using the active-set algorithm. If the weights (w ; : : : ; w n ) are updated during backtting, there can be substantial cost in accordingly updating the Cholesky decomposition for the active-set algorithm. Therefore, we replace each w i by the constant =4 and then solve problem (3) in the same way as (5). Because ^p i ( ^p i ) is upper bounded by =4, it can shown that the resulting update of (^; ^ ) remains a descent update, which deceases the obective value (2), by the quadratic lower bound principle (Böhning & Lindsay, 988; Wu & Lange, 200). We summarize the backtting algorithm as Algorithm 3. Algorithm 3 Block Descent and Thresholding algorithm for logistic modeling (BDT-Logit) : Initialize: Set ^ = 0 and ^ = 0 for = ; :::; p. Set w 0 = =4. 2: for = ; 2; : : : ; p do 3: Compute ^p = expit(^ p + = ^ ) and = ^ + ~ ^ + w 0 (y ^p). 4: Update ^ =, the sample average of. 5: Update ^ as a solution to (using Algorithm, line 3-9) { w0 min 2 k } ~ k 2 n + kd k + k ~ k n : 6: end for 7: Repeat line 2-7 until convergence of the obective (). Copyright c 202 John Wiley & Sons, Ltd. 8 Stat 202, 00 30
9 Backtting for doubly penalized additive modeling Stat 5. Numerical experiments We evaluate BDT algorithms and doubly penalized additive modeling (dpam) in two aspects. One is computational performance: we compare active-set and coordinate descent methods for solving the Lasso sub-problem (6) and investigate eectiveness of the screening rules. The other is statistical performance in terms of mean squared errors or logistic losses: we compare the estimators obtained from dpam and related methods SpAM (Ravikumar et al., 2009) and hgam (Meier et al., 2009), implemented in R packages SAM (Zhao et al., 204) and hgam (Frick et al., 203). 5.. Linear additive modeling We generate data according to y p i = i= f (x i ) + i with x i Uniform[ 2:5; 2:5] and i N(0; ) for i = ; : : : ; n (= 00). Consider four scenarios (piecewise constant, piecewise linear, smooth, and mixed), where four nonzero functions, f ; : : : ; f 4, are specied (Figure ). The rst scenario (piecewise-constant) is the same as in Petersen et al. (206). The remaining functions, f 5 ; : : : ; f p, are zero, with p = Computational speed We compare the active-set (AS) and coordinate-descent (CD) methods for solving subproblem (6) in BDT. To focus on this comparison, the two versions of BDT without screening rules are denoted as AS-BDT and CD-BST. The total number of basis functions from p covariates is large, (n )p. Instead of storing the basis matrices, ( ~ ; : : : ; ~ p ), we only store the inner-product matrices, ( ~ T ~ S ; : : : ; ~ T ~ p Sp ), where ~ S is a submatrix of ~ with column indices in S, and S indicates the subset of basis functions of th covariate that are ever included in the active set before the termination of training. For th covariate (or block), the active set is dened as the subset of basis functions whose coecients are currently estimated as nonzero when using either AS or CD method. The total column set of stored inner-product matrices is S = S [ [ S p, and its size is denoted as S. For m = 2; 3, we apply AS-BDT and CD-BDT with a range of tuning parameters, = max =4 k and = max =4 l for 2 k l 4. Within each block, the coordinate-descent algorithm is terminated with the tolerance 0 5 ; 0 6, or 0 7 in the decrease of obective values when solving Lasso sub-problem (6). The backtting cycles using AS-BDT or CD-BDT are terminated when the decrease of the obective (4) is smaller than 0 4. Figure 2 and 3 show trace plots of the obective values, with six choices of (; ) in scenario 2 (piecewise linear) for m = 2; 3. Similar plots in the other cases are presented in the Supplementary Material. It is evident that not only AS-BDT is 0-00 times faster than CD-BDT to reach the same stopping criterion, but also AS-BDT achieves smaller obective values than CD-BDT when the stopping criterion is met. The smaller the tuning parameters (; ) are, the more substantial the speed gain of AS-BDT is over CD-BDT. In addition, the smaller the within-block tolerance, from 0 5 to 0 7, the longer it takes CD-BDT to reach the stopping criterion. Table summarizes several performance measures for scenario 2 and m = 2; 3. Similar results in the other cases are presented in the Supplementary Material. In addition to the obective values achieved ("ob"), we also study the number of backtting cycles ("cycle"), the average number of iterations per cycle (iter"), and the column size of stored inner-products matrices ("S"). For each cycle of backtting, the number of iterations is dened by summing the numbers of iterations over all p blocks. The number of iterations within th block is how many times the descent direction is adusted when using the active-set algorithm, or how many scans are performed over the basis functions of th covariate when using the coordinate-descent algorithm, to solve the th Lasso sub-problem. The results from Table are consistent with Figure 2 and 3. Compared with CD-BDT, AS-BDT achieves smaller obective values, with smaller numbers of backtting cycles or iterations per cycle, especially with small (; ). The number of backtting cycles from CD-BDT is large when the within-block tolerance is relatively large, whereas the Stat 202, Copyright c 202 John Wiley & Sons, Ltd.
10 T Yang and Z Tan average number of iterations per cycle increases substantially when the within-block tolerance is reduced. In addition, the column size of stored inner-product matrices is much smaller in AS-BDT than CD-BDT. The active-set algorithm is more careful than coordinate descent in selecting basis functions into and removing them from the active set. We also apply AS-BDT with screening rules, denoted as AS-BDT-S. Within each block, if screening is successful, then the number of iterations is 0. From Table, the average number of iterations per cycle from AS-BDT is considerably smaller than from AS-BDT-S, especially when (; ) become large. Additional results are provided in the Supplementary Material on relative frequencies of when the screening rules in Algorithm 2 are successful Statistical performance We generate training, validation, and test sets, each with p = 00 covariates and n = 00 observations. The tuning parameters (; ) for dpam are selected to minimize the mean squared error (MSE) on the validation set, calculated from the model tted on the training set, over a grid max, max, and (Section 3.2). The model is then tted with the selected (; ) on the training set, and the test MSE is calculated on the test set. The calculation is repeated over 00 datasets, and the average test MSE is reported in Table 3. Similarly, average test MSEs are calculated for SpAM with d = 3; 6; 0 and hgam with K = 5; 20; 30. From Table 3, the smallest MSE is achieved by dpam with m = in scenario (piecewise constant), and by dpam with m = 2 in the other three scenarios. Both SpAM and hgam appear to yield MSEs considerably larger than the minimum MSEs obtained by dpam, even in scenario 3 (smooth) Logistic additive modeling We generate data according to y i p Bernoulli(expit( f ; : : : ; f 4 are the same as in Section 5., and the remaining f 5 ; : : : ; f p i= f (x i ))), where x i Uniform[ 2:5; 2:5], the functions are zero with p = Computational speed We apply three versions of BDT-logit: AS-BDT-logit and AS-BDT-logit-S, using the active-set method without or with the screening rules, and CD-BDT-logit, using the coordinate-descent method. Table 2 summarizes several performance measures similarly as in Table. It is evident that AS-BDT-logit outperforms CD-BDT-logit, in achieving smaller logistic losses, smaller numbers of backtting cycles or iterations per cycle, and smaller sizes of stored inner-product matrices. In addition, AS-BDT-logit-S outperforms AS-BDT-logit, with smaller average numbers of iterations per cycle, due to the screening rules Statistical performance Similarly as in Section 5., we conduct 00 repeated simulations, each with training and validation sets, to calculate test logistic losses on a test set for dpam and SpAM (Table 2). There is currently no logistic modeling allowed in the R package for hgam. To obtain meaningful comparison, we increase the sample size to n = 500, because the sample size n = 00 appears to be insucient to achieve reasonable estimation for logistic modeling. As shown in Table 2, dpam with m = yields the smallest logistic losses in scenario (piecewise constant), whereas dpam with m = 2 gives the smallest losses in the other three scenarios. 6. Conclusion We develop backtting algorithms for doubly penalized additive modeling using total-variation and empirical-norm penalties, and demonstrate computational and statistical eectiveness of the proposed method. For solving the Lasso sub-problems (6), we advocate the use of the active-set method when compared with the coordinate-descent method. It can be of interest to conduct more simulations and investigate possible improvement and extensions. Copyright c 202 John Wiley & Sons, Ltd. 0 Stat 202, 00 30
11 Backtting for doubly penalized additive modeling Stat References Beck, A & Teboulle, M (2009), `A fast iterative shrinkage-thresholding algorithm for linear inverse problems,' SIAM Journal on Imaging Sciences, 2(), pp Bickel, PJ, Ritov, Y, Tsybakov, AB et al. (2009), `Simultaneous analysis of Lasso and Dantzig selector,' Annals of Statistics, 37(4), pp Böhning, D & Lindsay, BG (988), `Monotonicity of quadratic approximation algorithms,' Annals of the Institute of Statistical Mathematics, 40(4), pp Donoho, DL & Johnstone, JM (994), `Ideal spatial adaptation by wavelet shrinkage,' Biometrika, 8(3), pp Frick, H, Kondofersky, I, Kuehnle, OS, Lindenlaub, C, Pfundstein, G, Speidel, M, Spindler, M, Straub, A, Wickler, F, Zink, K, Eugster, M & Hothorn, T (203), hgam: High-dimensional Additive Modelling, R Package Version Friedman, J, Hastie, T & Tibshirani, R (200), `Regularization paths for generalized linear models via coordinate descent,' Journal of Statistical Software, 33(), pp. 22. Hastie, T & Tibshirani, R (990), Generalized Additive Models, Wiley. Hoeing, H (200), `A path algorithm for the fused Lasso signal approximator,' Journal of Computational and Graphical Statistics, 9(4), pp Huang, J, Horowitz, JL & Wei, F (200), `Variable selection in nonparametric additive models,' Annals of Statistics, 38(4), pp Kim, SJ, Koh, K, Boyd, S & Gorinevsky, D (2009), `` trend ltering,' SIAM Review, 5(2), pp Kim, SJ, Koh, K, Lustig, M, Boyd, S & Gorinevsky, D (2007), `An interior-point method for large-scale `-regularized least squares,' IEEE Journal of Selected Topics in Signal Processing, (4), pp Koltchinskii, V & Yuan, M (200), `Sparsity in multiple kernel learning,' Annals of Statistics, pp Lin, Y & Zhang, HH (2006), `Component selection and smoothing in multivariate nonparametric regression,' Annals of Statistics, 34(5), pp Mammen, E & van de Geer, S (997), `Locally adaptive regression splines,' Annals of Statistics, 25(), pp Meier, L, Van de Geer, S & Bühlmann, P (2009), `High-dimensional additive modeling,' Annals of Statistics, 37(6B), pp Osborne, MR, Presnell, B & Turlach, BA (998), `Knot selection for regression splines via the lasso,' Computing Science and Statistics, 30, pp Osborne, MR, Presnell, B & Turlach, BA (2000), `A new approach to variable selection in least squares problems,' IMA Journal of Numerical Analysis, 20(3), pp Petersen, A, Witten, D & Simon, N (206), `Fused Lasso additive model,' Journal of Computational and Graphical Statistics, 25(4), pp Raskutti, G, Wainwright, MJ & Yu, B (202), `Minimax-optimal rates for sparse additive models over kernel classes via convex programming,' Journal of Machine Learning Research, 3, pp Ravikumar, P, Liu, H, Laerty, J & Wasserman, L (2009), `SpAM: Sparse additive models,' Journal of the Royal Statistical Society, Series B, 7(5), pp Stat 202, Copyright c 202 John Wiley & Sons, Ltd.
12 T Yang and Z Tan Sadhanala, V & Tibshirani, RJ (207), `Additive models with trend ltering,' arxiv preprint arxiv: Stone, CJ (986), `The dimensionality reduction principle for generalized additive models,' Annals of Statistics, 4(2), pp Tan, Z & Zhang, CH (207), `Penalized estimation in additive regression with high-dimensional data,' arxiv preprint arxiv: Tibshirani, R (996), `Regression shrinkage and selection via the Lasso,' Journal of the Royal Statistical Society, Series B, pp Tibshirani, RJ (204), `Adaptive piecewise polynomial estimation via trend ltering,' Annals of Statistics, 42(), pp Wang, YX, Smola, A & Tibshirani, R (204), `The falling factorial basis and its statistical applications,' in International Conference on Machine Learning (ICML), pp Wood, SN (207), Generalized Additive Models: an Introduction with R, CRC press. Wu, TT & Lange, K (2008), `Coordinate descent algorithms for Lasso penalized regression,' Annals of Applied Statistics, pp Wu, TT & Lange, K (200), `The MM alternative to EM,' Statistical Science, 25(4), pp Zhao, T, Li, X, Liu, H & Roeder, K (204), SAM: Sparse Additive Modelling, R Package Version.0.5. Copyright c 202 John Wiley & Sons, Ltd. 2 Stat 202, 00 30
13 Backtting for doubly penalized additive modeling Stat () (2) (3) (4) f(x) f(x) f(x) f(x) x x x x Figure. Nonzero functions f ;:::;f 4 used to generate data: () scenario (piecewise-constant); (2) scenario 2 (piecewiselinear); (3) scenario 3 (smooth); (4) scenario 4 (combination). Stat 202, Copyright c 202 John Wiley & Sons, Ltd.
14 T Yang and Z Tan ρ = λ max 6 and λ = λ max 6 ρ = λ max 64 and λ = λ max 6 ρ = λ max 256 and λ = λ max 6 obective function obective function obective function ρ = λ max 64 and λ = λ max 64 ρ = λ max 256 and λ = λ max 64 ρ = λ max 256 and λ = λ max 256 obective function obective function obective function Figure 2. Obective values over running time from ASD-BDT ( ), CD-BDT with tolerance 0 5 ( ), CD-BDT with tolerance 0 6 ( ) and CD-BDT with tolerance 0 7 ( ) for scenario 2 and m = 2 in regression setting. ρ = λ max 6 and λ = λ max 6 ρ = λ max 64 and λ = λ max 6 ρ = λ max 256 and λ = λ max 6 obective function obective function obective function ρ = λ max 64 and λ = λ max 64 ρ = λ max 256 and λ = λ max 64 ρ = λ max 256 and λ = λ max 256 obective function obective function obective function Figure 3. Obective values over running time from ASD-BDT ( ), CD-BDT with tolerance 0 5 ( ), CD-BDT with tolerance 0 6 ( ) and CD-BDT with tolerance 0 7 ( ) for scenario 2 and m = 3 in regression setting. Copyright c 202 John Wiley & Sons, Ltd. 4 Stat 202, 00 30
15 Backtting for doubly penalized additive modeling Stat ρ = λ max 6 and λ = λ max 6 ρ = λ max 64 and λ = λ max 6 ρ = λ max 256 and λ = λ max 6 obective function obective function obective function ρ = λ max 64 and λ = λ max 64 ρ = λ max 256 and λ = λ max 64 ρ = λ max 256 and λ = λ max 256 obective function obective function obective function Figure 4. Obective values over running time from ASD-BDT-Logit ( ), CD-BDT-Logit with tolerance 0 5 ( ), CD- BDT-Logit with tolerance 0 6 ( ) and CD-BDT-Logit with tolerance 0 7 ( ) for scenario 2 and m = 2 in classication setting. ρ = λ max 6 and λ = λ max 6 ρ = λ max 64 and λ = λ max 6 ρ = λ max 256 and λ = λ max 6 obective function obective function obective function ρ = λ max 64 and λ = λ max 64 ρ = λ max 256 and λ = λ max 64 ρ = λ max 256 and λ = λ max 256 obective function obective function obective function Figure 5. Obective values over running time from ASD-BDT-Logit ( ), CD-BDT-Logit with tolerance 0 5 ( ), CD- BDT-Logit with tolerance 0 6 ( ) and CD-BDT-Logit with tolerance 0 7 ( ) for scenario 2 and m = 3 in classication setting. Stat 202, Copyright c 202 John Wiley & Sons, Ltd.
16 T Yang and Z Tan Table. Computation comparison for scenario 2 in regression setting Metric ob cycle iter S ob cycle iter S = Method max =6 = max =64 = max =256 = = =4 = =6 = = =4 = m = 2 AS-BDT AS-BDT-S CD-BDT CD-BDT CD-BDT AS-BDT AS-BDT-S CD-BDT CD-BDT CD-BDT AS-BDT AS-BDT-S CD-BDT CD-BDT CD-BDT AS-BDT AS-BDT-S CD-BDT CD-BDT CD-BDT m = 3 AS-BDT AS-BDT-S CD-BDT CD-BDT CD-BDT AS-BDT AS-BDT-S CD-BDT CD-BDT CD-BDT AS-BDT AS-BDT-S CD-BDT CD-BDT CD-BDT AS-BDT AS-BDT-S CD-BDT CD-BDT CD-BDT Copyright c 202 John Wiley & Sons, Ltd. 6 Stat 202, 00 30
17 Backtting for doubly penalized additive modeling Stat Table 2. Computation comparison for scenario 2 in classication setting Metric ob cycle iter S ob cycle iter S = Method max =6 = max =64 = max =256 = = =4 = =6 = = =4 = m = 2 AS-BDT-Logit AS-BDT-Logit-S CD-BDT-Logit CD-BDT-Logit CD-BDT-Logit AS-BDT-Logit AS-BDT-Logit-S CD-BDT-Logit CD-BDT-Logit CD-BDT-Logit AS-BDT-Logit AS-BDT-Logit-S CD-BDT-Logit CD-BDT-Logit CD-BDT-Logit AS-BDT-Logit AS-BDT-Logit-S CD-BDT-Logit CD-BDT-Logit CD-BDT-Logit m = 3 AS-BDT-Logit AS-BDT-Logit-S CD-BDT-Logit CD-BDT-Logit CD-BDT-Logit AS-BDT-Logit AS-BDT-Logit-S CD-BDT-Logit CD-BDT-Logit CD-BDT-Logit AS-BDT-Logit AS-BDT-Logit-S CD-BDT-Logit CD-BDT-Logit CD-BDT-Logit AS-BDT-Logit AS-BDT-Logit-S CD-BDT-Logit CD-BDT-Logit CD-BDT-Logit Stat 202, Copyright c 202 John Wiley & Sons, Ltd.
18 T Yang and Z Tan Table 3. Test MSEs from linear additive modeling Scenario Scenario 2 Scenario 3 Scenario 4 SpAM (Ravikumar et al., 2009) d = (0.05).75(0.03).62(0.02).78(0.03) d = (0.05) 2.4(0.04) 2.03(0.03) 2.3(0.04) d = (0.06) 3.8(0.05) 2.69(0.05) 2.6(0.05) hgam (Meier et al., 2009) K = (0.04).59(0.03).97(0.03).84(0.02) K = (0.04).57(0.03).95(0.03).82(0.02) K = (0.04).53(0.03).89(0.03).78(0.02) dpam (this paper) m = 2.03(0.04).88(0.03).76(0.03).74(0.02) m = (0.04).40(0.02).40(0.02).60(0.02) m = (0.04).5(0.02).46(0.02).64(0.02) Note: FLAM (Petersen et al., 206) corresponds to dpam with m =, except linear interpolation is used when evaluating the tted functions on the validation and test sets. Table 4. Test logistic losses ( 0) from logistic additive modeling Scenario Scenario 2 Scenario 3 Scenario 4 SpAM (Ravikumar et al., 2009) d = (0.0) 5.38(0.0) 5.33(0.0) 5.08(0.0) d = 6 5.8(0.0) 5.49(0.0) 5.42(0.0) 5.8(0.0) d = 0 5.3(0.02) 5.55(0.0) 5.46(0.0) 5.25(0.0) dpam (this paper) m = 4.97(0.02) 5.08(0.0) 5.8(0.0) 5.04(0.02) m = 2 5.(0.02) 4.88(0.0) 5.0(0.0) 5.0(0.0) m = 3 5.4(0.02) 4.88(0.0) 5.04(0.0) 5.0(0.0) Note: Logistic modeling is currently not implemented in the R package for hgam (Meier et al., 2009). Copyright c 202 John Wiley & Sons, Ltd. 8 Stat 202, 00 30
(wileyonlinelibrary.com)
Stat (wileyonlinelibrary.com) https://doi.org/0.00/sta4.98 Backfitting algorithms for total-variation and empirical-norm penalized additive modelling with high-dimensional data Ting Yang and Zhiqiang Tan
More informationAdaptive Piecewise Polynomial Estimation via Trend Filtering
Adaptive Piecewise Polynomial Estimation via Trend Filtering Liubo Li, ShanShan Tu The Ohio State University li.2201@osu.edu, tu.162@osu.edu October 1, 2015 Liubo Li, ShanShan Tu (OSU) Trend Filtering
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods
More informationA New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables
A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of
More informationAn Homotopy Algorithm for the Lasso with Online Observations
An Homotopy Algorithm for the Lasso with Online Observations Pierre J. Garrigues Department of EECS Redwood Center for Theoretical Neuroscience University of California Berkeley, CA 94720 garrigue@eecs.berkeley.edu
More informationHierarchical kernel learning
Hierarchical kernel learning Francis Bach Willow project, INRIA - Ecole Normale Supérieure May 2010 Outline Supervised learning and regularization Kernel methods vs. sparse methods MKL: Multiple kernel
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationBAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage
BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement
More informationProperties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation
Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana
More informationAn efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss
An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao
More informationPre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models
Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider variable
More informationFast Regularization Paths via Coordinate Descent
August 2008 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerry Friedman and Rob Tibshirani. August 2008 Trevor
More informationarxiv: v2 [math.st] 12 Feb 2008
arxiv:080.460v2 [math.st] 2 Feb 2008 Electronic Journal of Statistics Vol. 2 2008 90 02 ISSN: 935-7524 DOI: 0.24/08-EJS77 Sup-norm convergence rate and sign concentration property of Lasso and Dantzig
More informationInversion Base Height. Daggot Pressure Gradient Visibility (miles)
Stanford University June 2, 1998 Bayesian Backtting: 1 Bayesian Backtting Trevor Hastie Stanford University Rob Tibshirani University of Toronto Email: trevor@stat.stanford.edu Ftp: stat.stanford.edu:
More informationA Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression
A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent
More informationReconstruction from Anisotropic Random Measurements
Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013
More informationSOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu
SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu LITIS - EA 48 - INSA/Universite de Rouen Avenue de l Université - 768 Saint-Etienne du Rouvray
More informationBoosting Methods: Why They Can Be Useful for High-Dimensional Data
New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,
More informationRobust Inverse Covariance Estimation under Noisy Measurements
.. Robust Inverse Covariance Estimation under Noisy Measurements Jun-Kun Wang, Shou-De Lin Intel-NTU, National Taiwan University ICML 2014 1 / 30 . Table of contents Introduction.1 Introduction.2 Related
More informationSparse inverse covariance estimation with the lasso
Sparse inverse covariance estimation with the lasso Jerome Friedman Trevor Hastie and Robert Tibshirani November 8, 2007 Abstract We consider the problem of estimating sparse graphs by a lasso penalty
More informationRegularization Path Algorithms for Detecting Gene Interactions
Regularization Path Algorithms for Detecting Gene Interactions Mee Young Park Trevor Hastie July 16, 2006 Abstract In this study, we consider several regularization path algorithms with grouped variable
More informationDiscussion of Least Angle Regression
Discussion of Least Angle Regression David Madigan Rutgers University & Avaya Labs Research Piscataway, NJ 08855 madigan@stat.rutgers.edu Greg Ridgeway RAND Statistics Group Santa Monica, CA 90407-2138
More informationStandardization and the Group Lasso Penalty
Standardization and the Group Lasso Penalty Noah Simon and Rob Tibshirani Corresponding author, email: nsimon@stanfordedu Sequoia Hall, Stanford University, CA 9435 March, Abstract We re-examine the original
More informationPenalized versus constrained generalized eigenvalue problems
Penalized versus constrained generalized eigenvalue problems Irina Gaynanova, James G. Booth and Martin T. Wells. arxiv:141.6131v3 [stat.co] 4 May 215 Abstract We investigate the difference between using
More informationThe lasso, persistence, and cross-validation
The lasso, persistence, and cross-validation Daniel J. McDonald Department of Statistics Indiana University http://www.stat.cmu.edu/ danielmc Joint work with: Darren Homrighausen Colorado State University
More informationVariable Selection for Highly Correlated Predictors
Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu Department of Statistics, University of Illinois at Urbana-Champaign WHOA-PSI, Aug, 2017 St. Louis, Missouri 1 / 30 Background Variable
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationLinear Regression. Aarti Singh. Machine Learning / Sept 27, 2010
Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X
More informationLecture 14: Variable Selection - Beyond LASSO
Fall, 2017 Extension of LASSO To achieve oracle properties, L q penalty with 0 < q < 1, SCAD penalty (Fan and Li 2001; Zhang et al. 2007). Adaptive LASSO (Zou 2006; Zhang and Lu 2007; Wang et al. 2007)
More informationOrthogonal Matching Pursuit for Sparse Signal Recovery With Noise
Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published
More informationA Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models
A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los
More informationSparse Gaussian conditional random fields
Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian
More informationLasso: Algorithms and Extensions
ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions
More informationA note on the group lasso and a sparse group lasso
A note on the group lasso and a sparse group lasso arxiv:1001.0736v1 [math.st] 5 Jan 2010 Jerome Friedman Trevor Hastie and Robert Tibshirani January 5, 2010 Abstract We consider the group lasso penalty
More informationComments on \Wavelets in Statistics: A Review" by. A. Antoniadis. Jianqing Fan. University of North Carolina, Chapel Hill
Comments on \Wavelets in Statistics: A Review" by A. Antoniadis Jianqing Fan University of North Carolina, Chapel Hill and University of California, Los Angeles I would like to congratulate Professor Antoniadis
More informationLecture 25: November 27
10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationDATA MINING AND MACHINE LEARNING
DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationSTANDARDIZATION AND THE GROUP LASSO PENALTY
Statistica Sinica (0), 983-00 doi:http://dx.doi.org/0.5705/ss.0.075 STANDARDIZATION AND THE GROUP LASSO PENALTY Noah Simon and Robert Tibshirani Stanford University Abstract: We re-examine the original
More informationConvex relaxation for Combinatorial Penalties
Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,
More informationThe picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R
The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R Xingguo Li Tuo Zhao Tong Zhang Han Liu Abstract We describe an R package named picasso, which implements a unified framework
More informationSupplementary Material for Nonparametric Operator-Regularized Covariance Function Estimation for Functional Data
Supplementary Material for Nonparametric Operator-Regularized Covariance Function Estimation for Functional Data Raymond K. W. Wong Department of Statistics, Texas A&M University Xiaoke Zhang Department
More informationOn Algorithms for Solving Least Squares Problems under an L 1 Penalty or an L 1 Constraint
On Algorithms for Solving Least Squares Problems under an L 1 Penalty or an L 1 Constraint B.A. Turlach School of Mathematics and Statistics (M19) The University of Western Australia 35 Stirling Highway,
More informationSelection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty
Journal of Data Science 9(2011), 549-564 Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Masaru Kanba and Kanta Naito Shimane University Abstract: This paper discusses the
More informationEfficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization
Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Xiaocheng Tang Department of Industrial and Systems Engineering Lehigh University Bethlehem, PA 18015 xct@lehigh.edu Katya Scheinberg
More informationSparse Additive Functional and kernel CCA
Sparse Additive Functional and kernel CCA Sivaraman Balakrishnan* Kriti Puniyani* John Lafferty *Carnegie Mellon University University of Chicago Presented by Miao Liu 5/3/2013 Canonical correlation analysis
More information25 : Graphical induced structured input/output models
10-708: Probabilistic Graphical Models 10-708, Spring 2013 25 : Graphical induced structured input/output models Lecturer: Eric P. Xing Scribes: Meghana Kshirsagar (mkshirsa), Yiwen Chen (yiwenche) 1 Graph
More informationThe deterministic Lasso
The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality
More informationComparisons of penalized least squares. methods by simulations
Comparisons of penalized least squares arxiv:1405.1796v1 [stat.co] 8 May 2014 methods by simulations Ke ZHANG, Fan YIN University of Science and Technology of China, Hefei 230026, China Shifeng XIONG Academy
More informationTECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection
DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model
More informationSparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28
Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:
More informationarxiv: v1 [math.st] 8 Jan 2008
arxiv:0801.1158v1 [math.st] 8 Jan 2008 Hierarchical selection of variables in sparse high-dimensional regression P. J. Bickel Department of Statistics University of California at Berkeley Y. Ritov Department
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationSparse representation classification and positive L1 minimization
Sparse representation classification and positive L1 minimization Cencheng Shen Joint Work with Li Chen, Carey E. Priebe Applied Mathematics and Statistics Johns Hopkins University, August 5, 2014 Cencheng
More informationEffective Dimension and Generalization of Kernel Learning
Effective Dimension and Generalization of Kernel Learning Tong Zhang IBM T.J. Watson Research Center Yorktown Heights, Y 10598 tzhang@watson.ibm.com Abstract We investigate the generalization performance
More informationHigh-dimensional covariance estimation based on Gaussian graphical models
High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26,
More informationOWL to the rescue of LASSO
OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,
More informationWEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract
Journal of Data Science,17(1). P. 145-160,2019 DOI:10.6339/JDS.201901_17(1).0007 WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION Wei Xiong *, Maozai Tian 2 1 School of Statistics, University of
More informationStatistics for high-dimensional data: Group Lasso and additive models
Statistics for high-dimensional data: Group Lasso and additive models Peter Bühlmann and Sara van de Geer Seminar für Statistik, ETH Zürich May 2012 The Group Lasso (Yuan & Lin, 2006) high-dimensional
More informationDistributed Inexact Newton-type Pursuit for Non-convex Sparse Learning
Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology
More informationLASSO Isotone for High-Dimensional Additive Isotonic Regression
Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/jcgs LASSO Isotone for High-Dimensional Additive Isotonic Regression Zhou FANG and Nicolai MEINSHAUSEN
More informationAn algorithm for the multivariate group lasso with covariance estimation
An algorithm for the multivariate group lasso with covariance estimation arxiv:1512.05153v1 [stat.co] 16 Dec 2015 Ines Wilms and Christophe Croux Leuven Statistics Research Centre, KU Leuven, Belgium Abstract
More information2 Tikhonov Regularization and ERM
Introduction Here we discusses how a class of regularization methods originally designed to solve ill-posed inverse problems give rise to regularized learning algorithms. These algorithms are kernel methods
More informationSparse Permutation Invariant Covariance Estimation: Motivation, Background and Key Results
Sparse Permutation Invariant Covariance Estimation: Motivation, Background and Key Results David Prince Biostat 572 dprince3@uw.edu April 19, 2012 David Prince (UW) SPICE April 19, 2012 1 / 11 Electronic
More informationLinear Methods for Regression. Lijun Zhang
Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived
More informationLearning discrete graphical models via generalized inverse covariance matrices
Learning discrete graphical models via generalized inverse covariance matrices Duzhe Wang, Yiming Lv, Yongjoon Kim, Young Lee Department of Statistics University of Wisconsin-Madison {dwang282, lv23, ykim676,
More informationVariable Selection for Generalized Additive Mixed Models by Likelihood-based Boosting
Variable Selection for Generalized Additive Mixed Models by Likelihood-based Boosting Andreas Groll 1 and Gerhard Tutz 2 1 Department of Statistics, University of Munich, Akademiestrasse 1, D-80799, Munich,
More informationSaharon Rosset 1 and Ji Zhu 2
Aust. N. Z. J. Stat. 46(3), 2004, 505 510 CORRECTED PROOF OF THE RESULT OF A PREDICTION ERROR PROPERTY OF THE LASSO ESTIMATOR AND ITS GENERALIZATION BY HUANG (2003) Saharon Rosset 1 and Ji Zhu 2 IBM T.J.
More informationarxiv: v1 [math.st] 1 Dec 2014
HOW TO MONITOR AND MITIGATE STAIR-CASING IN L TREND FILTERING Cristian R. Rojas and Bo Wahlberg Department of Automatic Control and ACCESS Linnaeus Centre School of Electrical Engineering, KTH Royal Institute
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationNonlinear Support Vector Machines through Iterative Majorization and I-Splines
Nonlinear Support Vector Machines through Iterative Majorization and I-Splines P.J.F. Groenen G. Nalbantov J.C. Bioch July 9, 26 Econometric Institute Report EI 26-25 Abstract To minimize the primal support
More informationThe Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA
The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions:
More informationCOS 424: Interacting with Data
COS 424: Interacting with Data Lecturer: Rob Schapire Lecture #14 Scribe: Zia Khan April 3, 2007 Recall from previous lecture that in regression we are trying to predict a real value given our data. Specically,
More informationComputational and Statistical Aspects of Statistical Machine Learning. John Lafferty Department of Statistics Retreat Gleacher Center
Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher Center Outline Modern nonparametric inference for high dimensional data Nonparametric
More informationThe Nonparanormal skeptic
The Nonpara skeptic Han Liu Johns Hopkins University, 615 N. Wolfe Street, Baltimore, MD 21205 USA Fang Han Johns Hopkins University, 615 N. Wolfe Street, Baltimore, MD 21205 USA Ming Yuan Georgia Institute
More informationA Unified Approach to Proximal Algorithms using Bregman Distance
A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department
More informationFast Regularization Paths via Coordinate Descent
KDD August 2008 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerry Friedman and Rob Tibshirani. KDD August 2008
More informationGaussian Graphical Models and Graphical Lasso
ELE 538B: Sparsity, Structure and Inference Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Spring 2017 Multivariate Gaussians Consider a random vector x N (0, Σ) with pdf
More informationSparsity and the Lasso
Sparsity and the Lasso Statistical Machine Learning, Spring 205 Ryan Tibshirani (with Larry Wasserman Regularization and the lasso. A bit of background If l 2 was the norm of the 20th century, then l is
More informationLinear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space
Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) Contents 1 Vector Spaces 1 1.1 The Formal Denition of a Vector Space.................................. 1 1.2 Subspaces...................................................
More informationSupplement to A Generalized Least Squares Matrix Decomposition. 1 GPMF & Smoothness: Ω-norm Penalty & Functional Data
Supplement to A Generalized Least Squares Matrix Decomposition Genevera I. Allen 1, Logan Grosenic 2, & Jonathan Taylor 3 1 Department of Statistics and Electrical and Computer Engineering, Rice University
More informationIEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 6, JUNE
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 6, JUNE 2010 2935 Variance-Component Based Sparse Signal Reconstruction and Model Selection Kun Qiu, Student Member, IEEE, and Aleksandar Dogandzic,
More informationarxiv: v1 [stat.me] 4 Oct 2013
Monotone Splines Lasso Linn Cecilie Bergersen, Kukatharmini Tharmaratnam and Ingrid K. Glad Department of Mathematics, University of Oslo arxiv:1310.1282v1 [stat.me] 4 Oct 2013 Abstract We consider the
More informationPathwise coordinate optimization
Stanford University 1 Pathwise coordinate optimization Jerome Friedman, Trevor Hastie, Holger Hoefling, Robert Tibshirani Stanford University Acknowledgements: Thanks to Stephen Boyd, Michael Saunders,
More information290 J.M. Carnicer, J.M. Pe~na basis (u 1 ; : : : ; u n ) consisting of minimally supported elements, yet also has a basis (v 1 ; : : : ; v n ) which f
Numer. Math. 67: 289{301 (1994) Numerische Mathematik c Springer-Verlag 1994 Electronic Edition Least supported bases and local linear independence J.M. Carnicer, J.M. Pe~na? Departamento de Matematica
More informationAccelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization
Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu
More informationIntroduction Wavelet shrinage methods have been very successful in nonparametric regression. But so far most of the wavelet regression methods have be
Wavelet Estimation For Samples With Random Uniform Design T. Tony Cai Department of Statistics, Purdue University Lawrence D. Brown Department of Statistics, University of Pennsylvania Abstract We show
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 7, 04 Reading: See class website Eric Xing @ CMU, 005-04
More informationLASSO Review, Fused LASSO, Parallel LASSO Solvers
Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable
More informationSparsity Regularization
Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation
More informationKneib, Fahrmeir: Supplement to "Structured additive regression for categorical space-time data: A mixed model approach"
Kneib, Fahrmeir: Supplement to "Structured additive regression for categorical space-time data: A mixed model approach" Sonderforschungsbereich 386, Paper 43 (25) Online unter: http://epub.ub.uni-muenchen.de/
More informationStatistical Machine Learning for Structured and High Dimensional Data
Statistical Machine Learning for Structured and High Dimensional Data (FA9550-09- 1-0373) PI: Larry Wasserman (CMU) Co- PI: John Lafferty (UChicago and CMU) AFOSR Program Review (Jan 28-31, 2013, Washington,
More informationChapter 3. Linear Models for Regression
Chapter 3. Linear Models for Regression Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Linear
More informationMachine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014
Case Study 3: fmri Prediction Fused LASSO LARS Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, 2014 Emily Fox 2014 1 LASSO Regression
More informationPermutation-invariant regularization of large covariance matrices. Liza Levina
Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationOr How to select variables Using Bayesian LASSO
Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO On Bayesian Variable Selection
More informationLeast squares under convex constraint
Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption
More informationExact Hybrid Covariance Thresholding for Joint Graphical Lasso
Exact Hybrid Covariance Thresholding for Joint Graphical Lasso Qingming Tang Chao Yang Jian Peng Jinbo Xu Toyota Technological Institute at Chicago Massachusetts Institute of Technology Abstract. This
More informationSparse Additive machine
Sparse Additive machine Tuo Zhao Han Liu Department of Biostatistics and Computer Science, Johns Hopkins University Abstract We develop a high dimensional nonparametric classification method named sparse
More information