Backtting algorithms for total-variation and empirical-norm penalized additive modeling with high-dimensional data

Size: px

Start display at page:

Download "Backtting algorithms for total-variation and empirical-norm penalized additive modeling with high-dimensional data"

Jean Weaver
5 years ago
Views:

1 The ISI's Journal for the Rapid Dissemination of Statistics Research (wileyonlinelibrary.com) DOI: 0.00X/sta Backtting algorithms for total-variation and empirical-norm penalized additive modeling with high-dimensional data Ting Yang a, Zhiqiang Tan a Received 0 May 208; Accepted 00 Month 208 Additive modeling is useful in studying nonlinear relationship between a response and covariates. We develop backtting algorithms to implement the doubly penalized method in Tan & Zhang (207), using total-variation and empirical-norm penalties. Use of the total-variation penalty leads to an automatic selection of knots for each component function, whereas use of the empirical-norm penalty can result in zero solutions for component functions and hence facilitates component selection in high-dimensional settings. For a backtting cycle, each component function is updated by thresholding a solution to a Lasso problem, which is computed using an active-set descent method. Screening rules are also derived to determine zero solutions without solving the Lasso problem directly. We present numerical experiments to demonstrate the eectiveness of the proposed algorithms for linear and logistic additive modeling. Copyright c 202 John Wiley & Sons, Ltd. Keywords: Additive model; High-dimensional data; Total variation; Nonparametric smoothing; Penalized estimation; Trend ltering.. Introduction Additive models provide a useful extension of linear models to allow nonlinear dependency of a mean response on covariates (Stone, 986). For a sample of size n, let y i be a response and x i = (x i ; : : : ; x ip ) T be a p-dimensional covariate vector for i = ; : : : ; n. Consider a nonparametric additive model y i = + f (x i ) + " i = + p = f (x i ) + " i ; where is a constant term, f ( ) is an unknown function of th covariate, and " i is a noise with mean zero and nite variance 2. Theory and methods for generalized additive models are well studied in low-dimensional settings ( p n) (Hastie & Tibshirani, 990; Wood, 207). Recently, there has been considerable research on sparse additive models a Department of Statistics, Rutgers University, New Jersey, USA ztan@stat.rutgers.edu Stat 202, Copyright c 202 John Wiley & Sons, Ltd. [Version: 202/05/2 v.00]

2 T Yang and Z Tan in high-dimensional settings, where p is close to or greater than n, but the number of nonzero functions f ( ), up to some centering, is still smaller than n. See Section 2 for a review of related work. In this article, we propose a backtting algorithm, called block descent and thresholding (BDT), to implement doubly penalized estimation studied in Tan & Zhang (207), using total-variation and empirical-norm penalties. For a function g ( ) of th covariate in an interval [a ; b ], the total variation is dened as { k } TV(g ) = sup g (z i ) g (z i ) : z 0 < z < : : : < z k is a partition of [a ; b ] for any k : i= If g is dierentiable with derivative g (), then TV(g ) = g () (z) dz. The empirical norm (precisely, empirical L 2 norm) of g is dened as kg k n = fn n i= g2 (x i )g =2. For some integer m and tuning parameters (; ), the doubly penalized estimator (^; ^f ; : : : ; ^f p ) is dened as a minimizer of the following penalized loss 2n n (y i f (x i )) 2 + i= p { } (m ) TV(f ) + kf k n ; () = over (; f ; : : : ; f p ), where f p = = f (m ), and f is the (m )th derivative of f with f (0) f. If f is mth dierentiable, then (m ) TV(f ) = f (m) (z) dz. By extending Mammen & van de Geer (997) for univariate smoothing, Tan & Zhang (207) showed that each ^f can be chosen to be a spline of order m, that is, a piecewise polynomial of degree m and, if m 2, an (m 2)th continuously dierentiable function. Moreover, the knots of each ^f can be obtained from the data points fx i : i = ; : : : ; ng if m = or 2, but not necessarily so if m 3. For simplicity, we always restrict each f ( ) to be a spline of order m with knots from the data points. The penalty term in () involves two penalties, playing complementary roles, for each component function f. The total (m ) variation, TV(f ), is used to induce smoothness of f. In fact, as shall be seen in Section 3., the total-variation penalty leads to an automatic selection of knots from data points for each f, similarly as Lasso selection of nonzero coecients in linear regression (Tibshirani, 996). The empirical norm kf k n is used to induce sparsity for f (i.e., setting f to zero). See Lemma for how a zero solution for f can be caused by the presence of the empirical-norm penalty. In the special case where f (z) = z is linear for all, the overall penalty in () reduces to the Lasso penalty p = up to some scaling for linear regression. In other words, the penalty in () can be regarded as a functional (m ) extension of the Lasso penalty, with replaced by a combination of TV(f ) and kf k n. The theoretical analysis in Tan & Zhang (207) shows under some technical conditions that if and are properly specied, then the prediction error of ^f p = ^f = achieves a minimax rate { } k ^f f k 2 n = O p()(m F + M 0 ) n 2m log(p) 2m+ + ; (2) n for any decomposition f p = = f such that #f p : f 6 0g M 0 and p = TV(f (m ) ) M F. A similar result holds for out-of-sample prediction errors. The error bound (2) consists of two terms, reecting the errors from respectively nonparametric smoothing and variable selection. The rst term, of order n 2m=(2m+), is the minimax (m ) rate of estimation in univariate smoothing (p = ) over the function class ff : TV(f ) Cg for a constant C (Mammen & van de Geer, 997). The second term, of order M 0 log(p)=n, is known as the Lasso fast rate of prediction errors in linear regression with at most M 0 nonzero regression coecients (Bickel et al., 2009). Our backtting algorithm can be modied to implement doubly penalized estimation using L 2 Sobolev seminorm and empirical-norm penalties, where the penalty term in () is replaced by p (m) =fkf k L2 + kf k n g, with kf (m) k L2 = Copyright c 202 John Wiley & Sons, Ltd. 2 Stat 202, 00 30

3 Backtting for doubly penalized additive modeling Stat f (f (m) (z)) 2 dzg =2. Denote by ( f ~ ; : : : ; f ~ p ) the resulting estimators. There are, however, two notable dierences from when the total-variation penalty is used. First, each f ~ is a smoothing spline with all the data points fx i : i = ; : : : ; ng as knots (Meier et al., 2009). No selection of knots is achieved. Second, the error bound (2) is also valid with ^f replaced by f ~ p = ~ = f and M F replaced by p = kf (m) k L2 (Koltchinskii & Yuan, 200; Raskutti et al., 202). But a bounded function class ff : kf (m) (m ) k L2 Cg is strictly smaller than ff : TV(f ) Cg for any constant C. For univariate smoothing (p = ), smoothing splines cannot achieve the rate n 2m=(2m+) as do total-variation splines in (m ) the larger function class ff : TV(f ) Cg (Donoho & Johnstone, 994; Mammen & van de Geer, 997). We use the following notation. For a function h(y ; x), the empirical norm of h is dened as khk n = fn n i= h2 (y i ; x i )g =2. For example, ky f k 2 n = n n i= (y i f (x i )) 2. The sample average of h is h = n n i= h(y i ; x i ). For example, y = n n i= y i. Moreover, for a vector u = (u ; : : : ; u n ) T, denote kuk n = (n n i= u2 i )=2. For a vector v = (v ; : : : ; v k ) T, denote kvk k = = v and kvk = max =;:::;k v. The rest of the paper is organized as follows. Section 2 reviews related work. In Section 3, we develop a backtting algorithm for minimizing the penalized loss () in linear additive modeling. In Section 4, we extend the algorithm to logistic additive modeling. Section 5 presents numerical experiments. Section 6 concludes the paper. The Supplementary Material contains proofs and additional discussion and results. 2. Related work Recently, there has been considerable research on penalized estimation for additive modeling, beyond earlier work as in Hastie & Tibshirani (990). For example, Lin & Zhang (2006) used a penalty p (m) = kf k L2. Huang et al. (200) studied a similar method using adaptive group Lasso. Ravikumar et al. (2009) used a penalty p = kf k n, and restricted f = for an n d basis matrix, which is pre-specied. Meier et al. (2009) used a penalty p in the form = fkf k 2 n + 2kf (m) k 2 L 2 g =2, and parameterized f using B-spline basis functions with pre-specied K knots in numerical implementation. Koltchinskii & Yuan (200) and Raskutti et al. (202) used a penalty term p (m) =fkf k L2 + kf k n g, but did not present numerical algorithms for minimizing the penalized loss. As mentioned above, such doubly penalized estimation can be handled by extending our backtting algorithm. The doubly penalized method in Tan & Zhang (207) diers from Koltchinskii & Yuan (200) and Raskutti et al. (202) in the use of the total-variation penalty, which leads to automatic knot selection for each component f and allows the same rate of convergence achieved in much larger bounded-variation function classes. Use of total-variation penalties seems to be rst studied by Mammen & van de Geer (997) for univariate smoothing, where the penalized loss is ky f k 2 (m ) n =2 + TV(f ). Recently, Kim et al. (2009) proposed a related method for univariate smoothing, called trend ltering, by minimizing the penalized loss over, ky k 2 2 =2 + kd(m) P k, where k k 2 denotes the L 2 norm, y = (y ; : : : ; y n ) T, = (f (x ); : : : ; f (x n )) T, P is the permutation matrix that sorts (x ; : : : ; x n ) in the ascending order, and D (m) is the mth-order dierence matrix. Tibshirani (204) showed that trend ltering is equivalent to total-variation splines in Mammen & van de Geer (997) for m = and 2, but not in general for m 3. In addition, trend ltering is shown to achieve the same (minimax) rate of convergence as total-variation splines in bounded-variation classes when the data points are evenly spaced for m. For additive modeling, Petersen et al. (206) proposed a fused lasso additive model (FLAM), by minimizing the penalized loss over (; ; : : : ; p ): p 2 ky k = p = {kd () P k + ( )k k 2 } ; Stat 202, Copyright c 202 John Wiley & Sons, Ltd.

4 T Yang and Z Tan where is a tuning parameter, 2 [0; ], = (f (x ); : : : ; f (x n )) T, and P is the permutation matrix that sorts the data points (x ; : : : ; x n ). This tting procedure is easily shown to be statistically equivalent to that of the doubly penalized method with m = in Tan & Zhang (207), up to a transformation between the tuning parameters (; ) above and (; ) in (). The tted values ^ p + ^ = from FLAM and ( ^f (x ); : : : ; ^f (x n )) T from our method can be matched. However, there is a subtle dierence in choosing how out-of-sample predictions are made. The tted function ^f of th covariate is dened by linear interpolation of the tted values ^ in FLAM, but directly by piecewise-constant interpolation in our method, which is more aligned with the rst-order total-variation penalty used. To obtain tted functions ^f that are piecewise linear, our method uses the second-order total-variation penalty (i.e., m = 2). In general, our method also accommodates use of higher-order total-variation penalties. Sadhanala & Tibshirani (207) also proposed additive modeling with trend ltering, by minimizing the penalized loss p p 2 ky k kd (m) P k : = This method allows higher-order trend ltering on component functions, but does not incorporate empirical-norm penalties, which are crucial to achieve sparsity in high-dimensional settings. The theory and numerical evaluations are provided only in low-dimensional settings in Sadhanala & Tibshirani (207). Our backtting algorithm can be modied to handle both trend ltering and empirical-norm penalties, i.e., a penalty term p = fkd(m) P k + k k n g, by recasting trend ltering using the falling factorial basis functions (Wang et al., 204). 3. Linear additive modeling We develop a backtting algorithm for minimizing the obective function (). As discussed in Section, we restrict each f to be a spline of order m with knots from the data points fx i : i = ; : : : ; ng. For p, dene the knot superset for f as (Mammen & van de Geer, 997) { fx((m )=2+2) ; : : : ; x (n (m )=2) g if m is odd; ft () ; : : : ; t (n m) g = = fx (m=2+) ; : : : ; x (n m=2) g if m is even: where x () < : : : < x (n) are the ordered values of the th covariate, and t () < : : : < t (n m). The data points near the left and right boundaries are removed to avoid over-parameterization. As shown in Tan & Zhang (207), a solution ( ^f ; : : : ; ^f p ) obtained with this restriction is also an unrestricted minimizer of () if m = or 2, but not necessarily so if m 3. For univariate smoothing, Mammen & van de Geer (997) also showed that the restricted and unrestricted solutions achieve the same rate of convergence under a mild condition on the maximum spacing between the data points. A similar result can be expected to hold for additive models. To represent splines of order m, we use the truncated power basis, dened as k; (z) = z k ; k = ; : : : ; m ; m +k; (z) = (z t (t) ) m + ; k = ; 2; : : : ; n m: where (c) + = max(0; c) and (c) 0 + = 0 if c < 0 or if c 0. Denote f = 0 +, where = ( ; ; : : : ; n ; ) and = ( ; ; : : : ; n ; ) T. After simple algebra, the obective function () can be written as 2 ky f k2 n + p fkd k + k 0 + k n g ; (3) = where D is a diagonal matrix with only nonzero elements d m = = d n =. Using the truncated power basis (m ) transforms the total variation TV(f ) into a Lasso penalty kd k n m = k= m +k;. Copyright c 202 John Wiley & Sons, Ltd. 4 Stat 202, 00 30

5 Backtting for doubly penalized additive modeling Stat Because is non-penalized, it follows that for (3) to be minimized, ^ = y, and each ^f is empirically centered, that is, ^ 0 = ^, where = n n i= (x i ). Then minimization of (3) reduces to that of 2 ky y f k2 n + p { kd k + k ~ } k n ; (4) = over ( ; : : : ; p ), where f = p = f, f = ~, and ~ =, the empirically centered version of. 3.. Backtting To minimize (3) or equivalently (4), a backtting algorithm involves solving p sub-problems updating the components (f ; : : : ; f p ) sequentially. For any p, the th sub-problem is { } min 2 kr ~ k 2 n + kd k + k ~ k n ; (5) where r = y y ^f k and f ^f k : k 6= g are the current estimates. By abuse of notation, ~ can also be k6= treated as the n (n ) matrix with ith row ~ (x i ), and r be treated as the n vector with ith element y i y k6= ^f k (x ik ). Proposition provides the main idea of our algorithm for solving problem (5). This result serves as a generalization of Corollary 3. in Petersen et al. (206). Proposition Suppose that ~ is a solution to Then a solution to problem (5) is ^ = ( min { 2 kr ~ k 2 n + kd k k ~ ~ ~ k. n )+ } : (6) Proposition says that a solution to problem (5) is determined by directly thresholding a solution of the Lasso problem (6). From our proof, a more general result holds where r is a "response" vector, ~ is a data matrix, and kd k is replaced by a semi-norm penalty R( ), including R( ) = kd k 2 in the case of L 2 semi-norm kf (m) k L2 used instead (m ) of TV(f ) in (). Proposition sheds light on the consequences of using the two penalties in (5). Use of the total-variation or Lasso penalty can induce a sparse solution ~ with only a few nonzero components, corresponding to an automatic selection of knots for f, as shown in Osborne et al. (998) for univariate smoothing. Use of the empiricalnorm penalty can result in a zero solution for f via thresholding ~ and hence achieve sparsity in high-dimensional additive modelling. From Proposition, we propose the following backtting algorithm to minimize (4). To solve Lasso problem (6), it is possible to use a variety of numerical methods including coordinate descent (Friedman et al., 200; Wu & Lange, 2008), gradient descent related methods (Beck & Teboulle, 2009; Kim et al., 2007), and active-set descent (Osborne et al., 2000). In particular, Petersen et al. (206) used a fast fused-lasso algorithm (Hoeing, 200) for solving (6) with m =, which seems not applicable for m 2. We employ a variant of active-set descent, which is attractive for the following reasons. First, the performance of backtting depends on how accurately the sub-problem (6) is solved within each block. More accurate within-block solutions may result in fewer backtting cycles to achieve convergence by a certain criterion. The active-set method nds an exact solution after a nite number of iterations, and the computational cost is often reasonable in sparse settings. When the estimate ~ from the previous backtting cycle is used as an initial value, the method also allows problem (6) to be solved with one or a few iterations if the previous estimate ~ is close to the desired solution. See the Supplementary Material for details Stat 202, Copyright c 202 John Wiley & Sons, Ltd.

6 T Yang and Z Tan Algorithm Block Descent and Thresholding (BDT) algorithm : Initialize: Set ^ = 0 for = ; :::; p. 2: for = ; 2; : : : ; p do 3: if any screening condition (Section 3.2) is satised then 4: Return ^. 5: else 6: Update the residual: r = y y ~ k ^ k. k6= 7: Compute a solution ~ to problem ( (6). 8: Threshold the solution: ^ = 9: end if 0: end for k ~ ~ k n )+ : Repeat line 2-0 until convergence of the obective (4). ~. of the active-set method. Many other methods, notably coordinate descent, only provide an approximate solution with a pre-specied precision. To obtain more precise a solution, more iterations are needed. In addition, the active-set method is tuning free, whereas gradient descent related methods involve tuning of step sizes to achieve satisfactory convergence. Selection of such step sizes can be cumbersome for all p Lasso problems (6) in backtting Screening rules In principle, to solve sub-problem (5), we rst compute ~ and then threshold it to ^. For relatively large, even if ~ is nonzero, the solution ^ after thresholding may become 0. For relatively large, it may occur that only the rst m components of ^ (not penalized by kd k ) are nonzero. To speed up computation, we derive screening rules to directly detect these scenarios and determine ^, without solving the Lasso problem (5). By the matrix interpretation for ~ and r, rewrite the sub-problem (5) as { min 2 kr ~ () () ~ (2) (2) ( () ; (2) ) } k 2 n + k(2) k + k ~ () () + ~ (2) (2) k n ; (7) where ~ is partitioned into ( ~ () ; ~ (2) ) with ~ () giving the rst m columns of ~, and accordingly is partitioned into ( ()T ; (2)T ) T, with () giving the rst m components of (not penalized by Lasso). If m =, then ~ () and () become degenerate. Proposition 2 provides the main result for our construction of screening rules. Proposition 2 For m 2, the following results hold for a solution ^ ()T (2)T = ( ^ ; ^ ) T to problem (7).. ^ () 6= 0 and ^ (2) = 0 if k r k n > and k ~ (2)T (r r )k =n ; (8) where is the proection ("hat") matrix onto the column space of ~ (). 2. ^ = 0 if there exists a vector u 2 R n satisfying kuk n and ~ ()T (r u) = 0 and k ~ (2)T (r u)k =n : (9) For m =, result 2 holds with the second equality in (9) removed. Copyright c 202 John Wiley & Sons, Ltd. 6 Stat 202, 00 30

7 Backtting for doubly penalized additive modeling Stat For ease of application, Corollary give a reformulation of condition (9), in terms of the existence of a suitable scalar for an arbitrarily xed vector h 2 R n. Corollary For m 2, ^ = 0 is a solution to problem (7) if for any h 2 R n, there exists 2 R satisfying kr ( )(h h)k n and k( ) ~ (2)T (h h)k =n : (0) For m =, the result holds with h h replaced by h or, equivalently, set to 0. From these results, we obtain the following screening rules. For m =, we skip step and set = 0. Algorithm 2 Screening rules ( ) () : Return ^ = k r k n ( ~ ()T ~ () ) ()T ~ r and (a) k r k n > and k ~ (2)T (r r )k =n. 2: Return ^ () = 0 and ^ (2) ^ (2) = 0 if = 0 if one of the following conditions is satised: (a) kr k n. (b) kr k n > and k r k n and k( ) ~ (2)T (r r )k =n ; where = f(n 2 r T r )=(r T r r T r )g =2 (< ). (c) kr k n > and k( ) ~ (2)T (h h)k =n, where h = r ~ ~ with ~ obtained from the previous backtting cycle and is determined such that kr ( )(h h)k n =. Condition (a) is derived from (8), and condition 2(a) is from (9) with u = r. Condition 2(b) is from (0) with h = r, where the rst inequality in (0) reduces to 2 (r T r r T r ) n 2 r T r. Condition 2(c) is from (0) with h = r ~ ~, where ~ is a solution of Lasso problem (6), and hence ~ ~ is the vector of tted values before thresholding, from the previous backtting cycle. The motivation for this choice is that if the "response" vector r is similar to the previous one, then k ~ (2)T (h h)k =n would remain similar to. The screening rules in Algorithm 2 are more eective than would be derived by rst detecting ~ = 0 for problem (6) and then deciding ^ = 0. For m =, it holds that ^ = 0 if either kr k n or kr k n > and ( =kr k n )k ~ T r k =n, whereas a necessary and sucient condition for ~ = 0 is k ~ T r k =n. As another use of the screening rules, we also obtain the following conditions on the tuning parameters (; ) to imply a completely zero solution to problem (4), ^ = : : : = ^ p = 0. Corollary 2 It holds that ^ = : : : = ^ p = 0 is a solution to problem (4) if either (a) ky yk n or (b) ky yk n >, k (y y )k n, and k ~ (2)T (y y y )k =n for = ; : : : ; p. In numerical implementation, we restrict the search of (; ) to a grid such that max and max, where max = ky yk n and max = k ~ (2)T (y y y )k =n. This region includes all (; ) yielding a nonzero solution to (4) for m = with = 0, and may be practically sucient for m 2. In addition, because the theoretical analysis of Tan & Zhang (207) suggests choosing = O() 2 under sparsity, we also restrict in the search. Stat 202, Copyright c 202 John Wiley & Sons, Ltd.

8 T Yang and Z Tan 4. Logistic additive modeling As an extension, we provide a backtting algorithm for logistic additive modeling with a binary response: { P (y i = x i ) = expit + p = } f (x i ) ; where expit(c) = f + exp( c)g. Doubly penalized estimation involves minimizing the obective function n n p `(y i ; + f (x i )) + fkd k + k ~ k n g; () i= = where `(y i ; + f (x i )) = logf + exp( + f (x i ))g y i ( + f (x i )) with f p = = f and f = ~ as in (4), but cannot be directly solved. For each cycle of backtting, the th sub-problem is min (; ) n n `(y i ; + ^f ( ) (x i ) + ~ (x i ) ) + kd k + k ~ k n ; (2) i= where ^f ( ) = k6= ^f k and f ^f k : k 6= g are the current estimates. Similarly as in Friedman et al. (200), we form a quadratic approximation to the negative log-likelihood term (via a Taylor expansion about the previous estimates ^ and ^ ) and solve the following problem, similar to (5) but with weighted least squares: min (; ) 2n n w i ( i ~ ) 2 + kd k + k ~ k n ; (3) i= where i = ^ + ~ (x i ) ^ + ^w i (y i ^p i ), ^p i = expit(^ + ^f (x i )), and w i = ^p i ( ^p i ). Proposition can be easily extended for solving (3). However, we employ a simple modication to the Lasso problem associated with (3) when using the active-set algorithm. If the weights (w ; : : : ; w n ) are updated during backtting, there can be substantial cost in accordingly updating the Cholesky decomposition for the active-set algorithm. Therefore, we replace each w i by the constant =4 and then solve problem (3) in the same way as (5). Because ^p i ( ^p i ) is upper bounded by =4, it can shown that the resulting update of (^; ^ ) remains a descent update, which deceases the obective value (2), by the quadratic lower bound principle (Böhning & Lindsay, 988; Wu & Lange, 200). We summarize the backtting algorithm as Algorithm 3. Algorithm 3 Block Descent and Thresholding algorithm for logistic modeling (BDT-Logit) : Initialize: Set ^ = 0 and ^ = 0 for = ; :::; p. Set w 0 = =4. 2: for = ; 2; : : : ; p do 3: Compute ^p = expit(^ p + = ^ ) and = ^ + ~ ^ + w 0 (y ^p). 4: Update ^ =, the sample average of. 5: Update ^ as a solution to (using Algorithm, line 3-9) { w0 min 2 k } ~ k 2 n + kd k + k ~ k n : 6: end for 7: Repeat line 2-7 until convergence of the obective (). Copyright c 202 John Wiley & Sons, Ltd. 8 Stat 202, 00 30

9 Backtting for doubly penalized additive modeling Stat 5. Numerical experiments We evaluate BDT algorithms and doubly penalized additive modeling (dpam) in two aspects. One is computational performance: we compare active-set and coordinate descent methods for solving the Lasso sub-problem (6) and investigate eectiveness of the screening rules. The other is statistical performance in terms of mean squared errors or logistic losses: we compare the estimators obtained from dpam and related methods SpAM (Ravikumar et al., 2009) and hgam (Meier et al., 2009), implemented in R packages SAM (Zhao et al., 204) and hgam (Frick et al., 203). 5.. Linear additive modeling We generate data according to y p i = i= f (x i ) + i with x i Uniform[ 2:5; 2:5] and i N(0; ) for i = ; : : : ; n (= 00). Consider four scenarios (piecewise constant, piecewise linear, smooth, and mixed), where four nonzero functions, f ; : : : ; f 4, are specied (Figure ). The rst scenario (piecewise-constant) is the same as in Petersen et al. (206). The remaining functions, f 5 ; : : : ; f p, are zero, with p = Computational speed We compare the active-set (AS) and coordinate-descent (CD) methods for solving subproblem (6) in BDT. To focus on this comparison, the two versions of BDT without screening rules are denoted as AS-BDT and CD-BST. The total number of basis functions from p covariates is large, (n )p. Instead of storing the basis matrices, ( ~ ; : : : ; ~ p ), we only store the inner-product matrices, ( ~ T ~ S ; : : : ; ~ T ~ p Sp ), where ~ S is a submatrix of ~ with column indices in S, and S indicates the subset of basis functions of th covariate that are ever included in the active set before the termination of training. For th covariate (or block), the active set is dened as the subset of basis functions whose coecients are currently estimated as nonzero when using either AS or CD method. The total column set of stored inner-product matrices is S = S [ [ S p, and its size is denoted as S. For m = 2; 3, we apply AS-BDT and CD-BDT with a range of tuning parameters, = max =4 k and = max =4 l for 2 k l 4. Within each block, the coordinate-descent algorithm is terminated with the tolerance 0 5 ; 0 6, or 0 7 in the decrease of obective values when solving Lasso sub-problem (6). The backtting cycles using AS-BDT or CD-BDT are terminated when the decrease of the obective (4) is smaller than 0 4. Figure 2 and 3 show trace plots of the obective values, with six choices of (; ) in scenario 2 (piecewise linear) for m = 2; 3. Similar plots in the other cases are presented in the Supplementary Material. It is evident that not only AS-BDT is 0-00 times faster than CD-BDT to reach the same stopping criterion, but also AS-BDT achieves smaller obective values than CD-BDT when the stopping criterion is met. The smaller the tuning parameters (; ) are, the more substantial the speed gain of AS-BDT is over CD-BDT. In addition, the smaller the within-block tolerance, from 0 5 to 0 7, the longer it takes CD-BDT to reach the stopping criterion. Table summarizes several performance measures for scenario 2 and m = 2; 3. Similar results in the other cases are presented in the Supplementary Material. In addition to the obective values achieved ("ob"), we also study the number of backtting cycles ("cycle"), the average number of iterations per cycle (iter"), and the column size of stored inner-products matrices ("S"). For each cycle of backtting, the number of iterations is dened by summing the numbers of iterations over all p blocks. The number of iterations within th block is how many times the descent direction is adusted when using the active-set algorithm, or how many scans are performed over the basis functions of th covariate when using the coordinate-descent algorithm, to solve the th Lasso sub-problem. The results from Table are consistent with Figure 2 and 3. Compared with CD-BDT, AS-BDT achieves smaller obective values, with smaller numbers of backtting cycles or iterations per cycle, especially with small (; ). The number of backtting cycles from CD-BDT is large when the within-block tolerance is relatively large, whereas the Stat 202, Copyright c 202 John Wiley & Sons, Ltd.

10 T Yang and Z Tan average number of iterations per cycle increases substantially when the within-block tolerance is reduced. In addition, the column size of stored inner-product matrices is much smaller in AS-BDT than CD-BDT. The active-set algorithm is more careful than coordinate descent in selecting basis functions into and removing them from the active set. We also apply AS-BDT with screening rules, denoted as AS-BDT-S. Within each block, if screening is successful, then the number of iterations is 0. From Table, the average number of iterations per cycle from AS-BDT is considerably smaller than from AS-BDT-S, especially when (; ) become large. Additional results are provided in the Supplementary Material on relative frequencies of when the screening rules in Algorithm 2 are successful Statistical performance We generate training, validation, and test sets, each with p = 00 covariates and n = 00 observations. The tuning parameters (; ) for dpam are selected to minimize the mean squared error (MSE) on the validation set, calculated from the model tted on the training set, over a grid max, max, and (Section 3.2). The model is then tted with the selected (; ) on the training set, and the test MSE is calculated on the test set. The calculation is repeated over 00 datasets, and the average test MSE is reported in Table 3. Similarly, average test MSEs are calculated for SpAM with d = 3; 6; 0 and hgam with K = 5; 20; 30. From Table 3, the smallest MSE is achieved by dpam with m = in scenario (piecewise constant), and by dpam with m = 2 in the other three scenarios. Both SpAM and hgam appear to yield MSEs considerably larger than the minimum MSEs obtained by dpam, even in scenario 3 (smooth) Logistic additive modeling We generate data according to y i p Bernoulli(expit( f ; : : : ; f 4 are the same as in Section 5., and the remaining f 5 ; : : : ; f p i= f (x i ))), where x i Uniform[ 2:5; 2:5], the functions are zero with p = Computational speed We apply three versions of BDT-logit: AS-BDT-logit and AS-BDT-logit-S, using the active-set method without or with the screening rules, and CD-BDT-logit, using the coordinate-descent method. Table 2 summarizes several performance measures similarly as in Table. It is evident that AS-BDT-logit outperforms CD-BDT-logit, in achieving smaller logistic losses, smaller numbers of backtting cycles or iterations per cycle, and smaller sizes of stored inner-product matrices. In addition, AS-BDT-logit-S outperforms AS-BDT-logit, with smaller average numbers of iterations per cycle, due to the screening rules Statistical performance Similarly as in Section 5., we conduct 00 repeated simulations, each with training and validation sets, to calculate test logistic losses on a test set for dpam and SpAM (Table 2). There is currently no logistic modeling allowed in the R package for hgam. To obtain meaningful comparison, we increase the sample size to n = 500, because the sample size n = 00 appears to be insucient to achieve reasonable estimation for logistic modeling. As shown in Table 2, dpam with m = yields the smallest logistic losses in scenario (piecewise constant), whereas dpam with m = 2 gives the smallest losses in the other three scenarios. 6. Conclusion We develop backtting algorithms for doubly penalized additive modeling using total-variation and empirical-norm penalties, and demonstrate computational and statistical eectiveness of the proposed method. For solving the Lasso sub-problems (6), we advocate the use of the active-set method when compared with the coordinate-descent method. It can be of interest to conduct more simulations and investigate possible improvement and extensions. Copyright c 202 John Wiley & Sons, Ltd. 0 Stat 202, 00 30

11 Backtting for doubly penalized additive modeling Stat References Beck, A & Teboulle, M (2009), `A fast iterative shrinkage-thresholding algorithm for linear inverse problems,' SIAM Journal on Imaging Sciences, 2(), pp Bickel, PJ, Ritov, Y, Tsybakov, AB et al. (2009), `Simultaneous analysis of Lasso and Dantzig selector,' Annals of Statistics, 37(4), pp Böhning, D & Lindsay, BG (988), `Monotonicity of quadratic approximation algorithms,' Annals of the Institute of Statistical Mathematics, 40(4), pp Donoho, DL & Johnstone, JM (994), `Ideal spatial adaptation by wavelet shrinkage,' Biometrika, 8(3), pp Frick, H, Kondofersky, I, Kuehnle, OS, Lindenlaub, C, Pfundstein, G, Speidel, M, Spindler, M, Straub, A, Wickler, F, Zink, K, Eugster, M & Hothorn, T (203), hgam: High-dimensional Additive Modelling, R Package Version Friedman, J, Hastie, T & Tibshirani, R (200), `Regularization paths for generalized linear models via coordinate descent,' Journal of Statistical Software, 33(), pp. 22. Hastie, T & Tibshirani, R (990), Generalized Additive Models, Wiley. Hoeing, H (200), `A path algorithm for the fused Lasso signal approximator,' Journal of Computational and Graphical Statistics, 9(4), pp Huang, J, Horowitz, JL & Wei, F (200), `Variable selection in nonparametric additive models,' Annals of Statistics, 38(4), pp Kim, SJ, Koh, K, Boyd, S & Gorinevsky, D (2009), `` trend ltering,' SIAM Review, 5(2), pp Kim, SJ, Koh, K, Lustig, M, Boyd, S & Gorinevsky, D (2007), `An interior-point method for large-scale `-regularized least squares,' IEEE Journal of Selected Topics in Signal Processing, (4), pp Koltchinskii, V & Yuan, M (200), `Sparsity in multiple kernel learning,' Annals of Statistics, pp Lin, Y & Zhang, HH (2006), `Component selection and smoothing in multivariate nonparametric regression,' Annals of Statistics, 34(5), pp Mammen, E & van de Geer, S (997), `Locally adaptive regression splines,' Annals of Statistics, 25(), pp Meier, L, Van de Geer, S & Bühlmann, P (2009), `High-dimensional additive modeling,' Annals of Statistics, 37(6B), pp Osborne, MR, Presnell, B & Turlach, BA (998), `Knot selection for regression splines via the lasso,' Computing Science and Statistics, 30, pp Osborne, MR, Presnell, B & Turlach, BA (2000), `A new approach to variable selection in least squares problems,' IMA Journal of Numerical Analysis, 20(3), pp Petersen, A, Witten, D & Simon, N (206), `Fused Lasso additive model,' Journal of Computational and Graphical Statistics, 25(4), pp Raskutti, G, Wainwright, MJ & Yu, B (202), `Minimax-optimal rates for sparse additive models over kernel classes via convex programming,' Journal of Machine Learning Research, 3, pp Ravikumar, P, Liu, H, Laerty, J & Wasserman, L (2009), `SpAM: Sparse additive models,' Journal of the Royal Statistical Society, Series B, 7(5), pp Stat 202, Copyright c 202 John Wiley & Sons, Ltd.

12 T Yang and Z Tan Sadhanala, V & Tibshirani, RJ (207), `Additive models with trend ltering,' arxiv preprint arxiv: Stone, CJ (986), `The dimensionality reduction principle for generalized additive models,' Annals of Statistics, 4(2), pp Tan, Z & Zhang, CH (207), `Penalized estimation in additive regression with high-dimensional data,' arxiv preprint arxiv: Tibshirani, R (996), `Regression shrinkage and selection via the Lasso,' Journal of the Royal Statistical Society, Series B, pp Tibshirani, RJ (204), `Adaptive piecewise polynomial estimation via trend ltering,' Annals of Statistics, 42(), pp Wang, YX, Smola, A & Tibshirani, R (204), `The falling factorial basis and its statistical applications,' in International Conference on Machine Learning (ICML), pp Wood, SN (207), Generalized Additive Models: an Introduction with R, CRC press. Wu, TT & Lange, K (2008), `Coordinate descent algorithms for Lasso penalized regression,' Annals of Applied Statistics, pp Wu, TT & Lange, K (200), `The MM alternative to EM,' Statistical Science, 25(4), pp Zhao, T, Li, X, Liu, H & Roeder, K (204), SAM: Sparse Additive Modelling, R Package Version.0.5. Copyright c 202 John Wiley & Sons, Ltd. 2 Stat 202, 00 30

13 Backtting for doubly penalized additive modeling Stat () (2) (3) (4) f(x) f(x) f(x) f(x) x x x x Figure. Nonzero functions f ;:::;f 4 used to generate data: () scenario (piecewise-constant); (2) scenario 2 (piecewiselinear); (3) scenario 3 (smooth); (4) scenario 4 (combination). Stat 202, Copyright c 202 John Wiley & Sons, Ltd.

14 T Yang and Z Tan ρ = λ max 6 and λ = λ max 6 ρ = λ max 64 and λ = λ max 6 ρ = λ max 256 and λ = λ max 6 obective function obective function obective function ρ = λ max 64 and λ = λ max 64 ρ = λ max 256 and λ = λ max 64 ρ = λ max 256 and λ = λ max 256 obective function obective function obective function Figure 2. Obective values over running time from ASD-BDT ( ), CD-BDT with tolerance 0 5 ( ), CD-BDT with tolerance 0 6 ( ) and CD-BDT with tolerance 0 7 ( ) for scenario 2 and m = 2 in regression setting. ρ = λ max 6 and λ = λ max 6 ρ = λ max 64 and λ = λ max 6 ρ = λ max 256 and λ = λ max 6 obective function obective function obective function ρ = λ max 64 and λ = λ max 64 ρ = λ max 256 and λ = λ max 64 ρ = λ max 256 and λ = λ max 256 obective function obective function obective function Figure 3. Obective values over running time from ASD-BDT ( ), CD-BDT with tolerance 0 5 ( ), CD-BDT with tolerance 0 6 ( ) and CD-BDT with tolerance 0 7 ( ) for scenario 2 and m = 3 in regression setting. Copyright c 202 John Wiley & Sons, Ltd. 4 Stat 202, 00 30

15 Backtting for doubly penalized additive modeling Stat ρ = λ max 6 and λ = λ max 6 ρ = λ max 64 and λ = λ max 6 ρ = λ max 256 and λ = λ max 6 obective function obective function obective function ρ = λ max 64 and λ = λ max 64 ρ = λ max 256 and λ = λ max 64 ρ = λ max 256 and λ = λ max 256 obective function obective function obective function Figure 4. Obective values over running time from ASD-BDT-Logit ( ), CD-BDT-Logit with tolerance 0 5 ( ), CD- BDT-Logit with tolerance 0 6 ( ) and CD-BDT-Logit with tolerance 0 7 ( ) for scenario 2 and m = 2 in classication setting. ρ = λ max 6 and λ = λ max 6 ρ = λ max 64 and λ = λ max 6 ρ = λ max 256 and λ = λ max 6 obective function obective function obective function ρ = λ max 64 and λ = λ max 64 ρ = λ max 256 and λ = λ max 64 ρ = λ max 256 and λ = λ max 256 obective function obective function obective function Figure 5. Obective values over running time from ASD-BDT-Logit ( ), CD-BDT-Logit with tolerance 0 5 ( ), CD- BDT-Logit with tolerance 0 6 ( ) and CD-BDT-Logit with tolerance 0 7 ( ) for scenario 2 and m = 3 in classication setting. Stat 202, Copyright c 202 John Wiley & Sons, Ltd.

16 T Yang and Z Tan Table. Computation comparison for scenario 2 in regression setting Metric ob cycle iter S ob cycle iter S = Method max =6 = max =64 = max =256 = = =4 = =6 = = =4 = m = 2 AS-BDT AS-BDT-S CD-BDT CD-BDT CD-BDT AS-BDT AS-BDT-S CD-BDT CD-BDT CD-BDT AS-BDT AS-BDT-S CD-BDT CD-BDT CD-BDT AS-BDT AS-BDT-S CD-BDT CD-BDT CD-BDT m = 3 AS-BDT AS-BDT-S CD-BDT CD-BDT CD-BDT AS-BDT AS-BDT-S CD-BDT CD-BDT CD-BDT AS-BDT AS-BDT-S CD-BDT CD-BDT CD-BDT AS-BDT AS-BDT-S CD-BDT CD-BDT CD-BDT Copyright c 202 John Wiley & Sons, Ltd. 6 Stat 202, 00 30

17 Backtting for doubly penalized additive modeling Stat Table 2. Computation comparison for scenario 2 in classication setting Metric ob cycle iter S ob cycle iter S = Method max =6 = max =64 = max =256 = = =4 = =6 = = =4 = m = 2 AS-BDT-Logit AS-BDT-Logit-S CD-BDT-Logit CD-BDT-Logit CD-BDT-Logit AS-BDT-Logit AS-BDT-Logit-S CD-BDT-Logit CD-BDT-Logit CD-BDT-Logit AS-BDT-Logit AS-BDT-Logit-S CD-BDT-Logit CD-BDT-Logit CD-BDT-Logit AS-BDT-Logit AS-BDT-Logit-S CD-BDT-Logit CD-BDT-Logit CD-BDT-Logit m = 3 AS-BDT-Logit AS-BDT-Logit-S CD-BDT-Logit CD-BDT-Logit CD-BDT-Logit AS-BDT-Logit AS-BDT-Logit-S CD-BDT-Logit CD-BDT-Logit CD-BDT-Logit AS-BDT-Logit AS-BDT-Logit-S CD-BDT-Logit CD-BDT-Logit CD-BDT-Logit AS-BDT-Logit AS-BDT-Logit-S CD-BDT-Logit CD-BDT-Logit CD-BDT-Logit Stat 202, Copyright c 202 John Wiley & Sons, Ltd.

18 T Yang and Z Tan Table 3. Test MSEs from linear additive modeling Scenario Scenario 2 Scenario 3 Scenario 4 SpAM (Ravikumar et al., 2009) d = (0.05).75(0.03).62(0.02).78(0.03) d = (0.05) 2.4(0.04) 2.03(0.03) 2.3(0.04) d = (0.06) 3.8(0.05) 2.69(0.05) 2.6(0.05) hgam (Meier et al., 2009) K = (0.04).59(0.03).97(0.03).84(0.02) K = (0.04).57(0.03).95(0.03).82(0.02) K = (0.04).53(0.03).89(0.03).78(0.02) dpam (this paper) m = 2.03(0.04).88(0.03).76(0.03).74(0.02) m = (0.04).40(0.02).40(0.02).60(0.02) m = (0.04).5(0.02).46(0.02).64(0.02) Note: FLAM (Petersen et al., 206) corresponds to dpam with m =, except linear interpolation is used when evaluating the tted functions on the validation and test sets. Table 4. Test logistic losses ( 0) from logistic additive modeling Scenario Scenario 2 Scenario 3 Scenario 4 SpAM (Ravikumar et al., 2009) d = (0.0) 5.38(0.0) 5.33(0.0) 5.08(0.0) d = 6 5.8(0.0) 5.49(0.0) 5.42(0.0) 5.8(0.0) d = 0 5.3(0.02) 5.55(0.0) 5.46(0.0) 5.25(0.0) dpam (this paper) m = 4.97(0.02) 5.08(0.0) 5.8(0.0) 5.04(0.02) m = 2 5.(0.02) 4.88(0.0) 5.0(0.0) 5.0(0.0) m = 3 5.4(0.02) 4.88(0.0) 5.04(0.0) 5.0(0.0) Note: Logistic modeling is currently not implemented in the R package for hgam (Meier et al., 2009). Copyright c 202 John Wiley & Sons, Ltd. 8 Stat 202, 00 30

(wileyonlinelibrary.com)

Stat (wileyonlinelibrary.com) https://doi.org/0.00/sta4.98 Backfitting algorithms for total-variation and empirical-norm penalized additive modelling with high-dimensional data Ting Yang and Zhiqiang Tan