Sparse Prediction with the k-overlap Norm

Size: px
Start display at page:

Download "Sparse Prediction with the k-overlap Norm"

Transcription

1 Sparse Prediction with the -Overlap Norm Andreas Argyriou Toyota Technological Institute at Chicago Rina Foygel University of Chicago Nathan Srebro Toyota Technological Institute at Chicago Abstract We derive a novel norm that corresponds to the tightest convex relaxation of sparsity combined with an l penalty and can also be interpreted as a group Lasso norm with overlaps. We show that this new norm provides a tighter relaxation than the elastic net and suggest using it as a replacement for the Lasso or the elastic net in sparse prediction problems. Introduction Regularizing with the l norm, when we expect a sparse solution to a regression problem, is often justified by w being the convex envelope of w 0 (the numberofnon-zerocoordinatesofavectorw R d ). Thatis, w isthetightest convex lower bound on w 0. But we must be careful with this statement for sparsevectorswith largeentries, w 0 can be small while w is large. In order to discuss convex lower bounds on w 0, we must impose some scale constraint. A more accurate statement is that w w w 0, and so, when the magnitudes of entries in w are bounded by, then w w 0, and indeed it is the largest such convex lower bound. Viewed as a convex outer relaxation, S ( ) := { w w 0, w } { w w }. Intersectingthe right-hand-sidewith the l unit ball, weget the tightest convex outer bound (convex hull) of S ( ) : { w w, w } = conv(s ( ) ).

2 However, in our view, this relationship between w and w 0 yields disappointing learning guarantees, and does not appropriately capture the success of the l norm as a surrogate for sparsity. In particular, the sample complexity of learning a linear predictor with non-zero entries by empirical ris minimization inside this class (an NP-hard optimization problem) scales as O(logd), but relaxing to the constraint w yields a sample complexity which scales as O( logd) (since the sample complexity of l -regularized learning scales quadratically with the l norm [0, 0]). Perhaps a better reason for the l norm being a good surrogate for sparsity is because, not only do we expect the magnitude of each entry of w to be bounded, but we further expect w to be small. In a regression setting, with a vector of features x, this can be justified when E[(x w) ] is bounded (a reasonable assumption) and the features are not too correlated see, e.g. [6]. More broadly, especially in the presence of correlations, we might require this as a modeling assumption to aid in robustness and generalization. In any case, we then have w w w 0, and so if we are interested in predictors with bounded l norm, we can motivate the l norm through the following relaxation of sparsity, where the scale is now set by the l norm: { w w 0, w B } { w w B }. The sample complexity when using the relaxation now scales as O(logd). Sparse + l constraint. Ourstartingpointisthenthatofcombiningsparsity and l regularization, and learning a sparse predictor with small l norm. We are thus interested in classes of the form S () := { w w 0, w }. As discussed above, the class { w } (corresponding to the standard Lasso) provides a convex relaxation of S (). But it is clear that we can get a tighter convex relaxation by eeping the l constraint as well: { S () w w }, w (.) { w w }. This is based on bounding the VC-subgraph dimension of this class, which is essentially the effective number of parameters. The sample complexity is O(B logd), where the dependence on B is to be expected. Note that if feature vectors are l -bounded (i.e. individual features are bounded), the sample complexity when using only w B (without a sparsity or l constraint) scales as O(B d). That is, even after identifying the correct support, we still need a sample complexity that scales with B.

3 Constraining (or equivalently, penalizing) both the l and l norms, as in (.), is nown as the elastic net [] and has indeed been advocated as a better alternative to the Lasso [7]. In this paper, we as whether the elastic net is the tightest convex relaxation to sparsity plus l (that is, to S () ) or whether a tighter, and better, convex relaxation is possible. A new norm. We consider the convex hull (tightest convex outer bound) of S (), C := conv(s () ) = conv{ w w 0, w }. (.) We study the gauge function associated with this convex set, that is, the norm whose unit ball is given by (.), which we call the -overlap norm. We show that, for >, this is indeed a tighter convex relaxation than the elastic net (that is, both inequalities in (.) are in fact strict inequalities), and is therefore a better convex constraint than the elastic net when seeing a sparse, low l - norm linear predictor. We thus advocate using it as a replacement for the elastic net. However,wealsoshowthat the gapbetween the elasticnet andthe -overlap norm is at most a factor of, corresponding to a factor of two difference in the sample complexity. Thus, our wor can also be interpreted as justifying the use of the elastic net, viewing it as a fairly good approximation to the tightest possible convex relaxation of sparsity intersected with an l constraint. Still, even a factor of two should not necessarily be ignored and, as we show in our experiments, using the tighter -overlap norm can indeed be beneficial. To better understand the -overlap norm, we show in Section that it can also ( be described as the group lasso with overlap norm [9] corresponding to all d ) subsets of features, hence its name. Despite the exponential number of groups in this description, we show that the -overlap norm can be calculated efficiently in time O(dlogd) and that its dual is given simply by the l norm of the largest entries. We also provide efficient first-order optimization algorithms for learning with the -overlapnorm in Section 4. In Section 5, we demonstrate how it can indeed outperform the Lasso and the elastic net.. Related Wor In many learning problems of interest, Lasso has been observed to shrin too many of the variables of w to zero. In particular, in many applications, when a group of variables is highly correlated, the Lasso may prefer a sparse solution, but we might gain more predictive accuracy by including all the correlated variables in our model. These drawbacs have recently motivated the use of various other regularization methods, such as the elastic net[], which penalizes the regression coefficients w with a combination of l and l norms: { } min Xw y +λ w +λ w : w R d, (.3) 3

4 where for a sample of size n, y R n is the vector of response values, and X R n d is a matrix with column j containing the values of feature j. The elastic net can be viewed as a trade-off between l regularization (the Lasso) and l regularization (Ridge regression [8]), depending on the relative values of λ and λ. In particular, when λ = 0, (.3) is equivalent to the Lasso. This method, and the other methods discussed below, have been observed to significantly outperform Lasso in many real applications. The pairwise elastic net (PEN), proposed by [], has a penalty function that accounts for similarity among features: w PEN R = w + w w R w, where R [0,] p p is a matrix with R j measuring similarity between features X j and X. The trace Lasso [5] is a second method proposed to handle correlations within X, defined by w trace X = Xdiag(w), where denotes the matrix trace-norm (the sum of the singular values) and promotes a low-ran solution. If the features are orthogonal, then both the PEN and the Trace Lasso are equivalent to the Lasso. If the features are all identical, then both penalties are equivalent to Ridge regression (penalizing w ). Another existing penalty is OSCAR [3], given by w OSCAR c = w +c j<max{ w j, w }. Lie the elastic net, each one of these three methods also prefers averaging similar features over selecting a single feature. The -Overlap Norm One argument for the elastic net has been the flexibility of tuning the cardinality of the regression vector w. Thus, when groups of correlated variables are present, a larger may be learned, which corresponds to a higher λ in (.3). A more natural way to obtain such an effect of tuning the cardinality is to consider the convex hull of cardinality vectors, C = conv(s () ) = conv{w Rd w 0, w }. Clearly the sets C are nested, and C and C d are the unit balls for the l and l norms, respectively. Consequently we define the -overlap norm as the norm whose unit ball equals C (the gauge function associated with C ). An equivalent definition is the following variational formula: Definition.. Let {,...,d}. The -overlap norm ov is defined, for every w R d, as { } w ov := min v I : supp(v I ) I, v I = w, I G I G 4

5 Figure : Unit ball of the -overlap norm (left) and of the elastic net (right) on R 3. where G denotes the set of all subsets of {,...,d} of cardinality at most. The equivalence is immediate by rewriting v I = µ I z I in the above definition, where µ I 0,z I C, I G, µ I =. In addition, this immediately I G implies that ov is indeed a norm. In fact, the -overlapnorm is equivalent to the norm used by the group lasso with overlaps [9], when the set of overlapping groups is chosen to be G (although the group lasso has traditionally been used for applications with nown grouped structure, unlie the case considered here). Although the variational definition. is not amenable to computation because of the exponential growth of the set of groups G, the -overlap norm is computationally very tractable, with an O(d log d) algorithm described in Section.. As already mentioned, ov = and ov d =. The unit ball of this new norm in R 3 for = is depicted in Figure. We immediately notice severaldifferences between this unit ball and the l, l and elastic net unit balls. For example, at points with cardinality and l norm equal to, the -overlap norm is not differentiable, but unlie the l or elastic norm, it is differentiable at points with cardinality less than. Thus, the -overlap norm is less biased towards sparse vectors than the elastic net and the l norm.. The Dual Norm It is interesting and useful to compute the dual of the -overlap norm. We follow the notation of [] for ordered vectors: for any w R d, w is the vector of absolute values, and w i is the i-th largest element of w. We have u ov = max{ w,u : w ov } = ( ) ( ) max u i : I G = ( u i ) i I i= =: u () (). This is the l -norm of the largest entries in u, and is nown as the - symmetric gauge norm []. Not surprisingly, this dual norm interpolates between the l norm (when = d and all entries are taen) and the l norm (when = and only the 5

6 largest entry is taen). This parallels the interpolation of the -overlap norm between the l and l norms. Lie the l p norms and elastic net, the -overlap norm and its dual are symmetric gauge functions, that is, sign- and permutation-invariant norms. For properties of such norms, see [].. Computation of the Norm In this section, we derive an alternative formula for the -overlap norm, which leads to computation of the value of the norm in O(dlogd) steps. Proposition.. For every w R d, w ov = r i= ( w i ) + r+ ( d i= r where r is the unique integer in {0,..., } satisfying w i ), w r > r+ (Here we use the notation w 0 = +.) d i= r w i w r. (.) This result shows that ov trades off between the l and l norms in a way that favors sparse vectors but allows for cardinality larger than. It combines the uniform shrinage of an l penalty for the largest components, with the sparse shrinage of an l penalty for the smallest components. Proof of Proposition.. We will use the inequality w,u w,u [6]. We have { ( w ov ) = max u,w } ( u () () ) : u R d { d } = max α i w i α i : α α d 0 i= i= { } d = max α i w i +α w i α i : α α 0. i= i= i= Let A r := d i= r w i for r {0,..., }. If A 0 < w then the solution α is given by α i = w i for i =,...,( ), α i = A 0 for i =,...,d. If 6

7 A 0 w then the optimal α, α lie between w and A 0, and have to be equal. So, the maximization becomes max { } α i w i α i + A α α : α α 0. i= i= If A 0 w and w > A then the solution is α i = w i for i =,...,( ), α i = A for i = ( ),...,d. Otherwise we proceed as before and continue this process. At stage r the process terminates if A 0 w Ar,..., r w r, A r r+ < w r and all but the last two inequalities are redundant. Hence the condition can be rewritten as (.). One optimal solution is α i = w i for i =,..., r, α i = Ar r+ for i = r,...,d. This proves the claim..3 Learning with the -overlap norm We thus propose using learning rules with -overlap norm regularization. These are appropriate when we would lie to learn a sparse predictor that also has low l norm, and are especially relevant when features might be correlated (that is, in almost all learning tass) but the correlation structure is not nown in advance. For regression problems with squared error loss, the resulting learning rule is of the form { min Xw y + λ } ( w ov ) : w R d (.) with λ > 0 a regularization parameter and {,...,d} also a parameter to be tuned. As typical in regularization-based methods, both λ and can be selected by cross validation [7]. Although we have motivated this norm by considering S (), the set of -sparse unit vectors, the parameter does not necessarily correspond to the sparsity level of the fitted vector of coefficients, and should be chosen via cross-validation independently of the desired sparsity level. 3 Relation to the Elastic Net Recall that the elastic net with penalty parameters λ and λ selects a vector of coefficients given by { } arg min Xw y +λ w +λ w. (3.) Foreaseofcomparisonwiththe-overlapnorm, wefirstshowthatthe elastic net is equivalent to regularizing with the norm { w el := max w, w }, 7

8 for some [,d], corresponding to the unit ball in (.) (note that is not necessarily an integer). To see this, let ŵ be a solution to (3.), and let ( ) ŵ := [,d]. ŵ Then for any w ŵ, if w el ŵ el, then w p ŵ p for p =,. Since ŵ is a solution to (3.), therefore, Xw y Xŵ y. This proves that, for some constraint parameter B, { } ŵ = argmin n Xw y : w el B. Lie the -overlap norm, the elastic net interpolates between the l and l norms. In fact, when is an integer, any -sparse unit vector w R d must lie in the unit ball of el. Since the -overlap norm gives the convex hull of all -sparse unit vectors, this immediately implies that w el w ov w Rd. The two norms are not equal, however. The difference between the two is illustrated in Figure, where we see that the -overlap norm is more rounded. To see an example where the two norms are not equal, we set d = + for some large, and let w = (.5,,,...,) R d. Then { 3 w el = max +,.5 + } ( =.5 + ). Taing u = (,,,..., ), we have u () () <, and therefore w ov > w,u =.5 + =.5. In this example, we see that the two norms can differ by as much as a factor of. We now show that this is actually the most by which they can differ. Proposition 3.. el ov < el. Proof. We show that these bounds hold in the duals of the two norms. First, since el is a maximum over the l and l norms, its dual is given by u (el) { := inf a + } u a a R d Now tae any u R d. First we show u () () u (el) generality, we tae u u d 0. For any a R d,. Without loss of u () () = u : a : + u : a : a + u a. 8

9 Finally, we show that u (el) < u () (). Taing a = (u u +,...,u u +,0,...,0), we have u (el) a + u a = (u i u + ) + u + i= (u i u + u )+ + i= (u i u + )+u + i= = u () (). Furthermore, this yields a strict inequality, because if u > u +, the nextto-last inequality is strict, while if u = = u +, then the last inequality is strict. 4 Optimization Solving the optimization problem (.) efficiently can be done with a first-order proximal algorithm. Proximal methods see [, 4, 5, 8, 9] and references therein are used to solve composite problems of the form min{f(x) +ω(x) : x R d }, where the loss function f(x) and the regularizer ω(x) are convex functions, and f is smooth with an L-Lipschitz gradient. These methods require fast computation of the gradient f and the proximity operator [3] { } prox ω (x) := argmin u x +ω(u) : u R d. In particular, accelerated first-order methods, proposed by Nesterov [4, 5] require two levels of memory at each iteration and exhibit an optimal O ( ) T convergence rate for the objective after T iterations. To obtain a proximal method for -overlap regularization, it suffices to compute the proximity map of g = L ( ov ), for any L > 0. This can be done in O(d( +logd)) steps with Algorithm. Proof. If g is a symmetric gauge function then prox g (v) has the same ordering and signs as v. This is an easy consequence of the v,u v, u inequality. Hence, without loss of generality, we may assume that v v d 0 and require that q q d 0. The latter follows from inequality (4.) and the fact that z is ordered. 9

10 Algorithm Computation of the proximity operator. Input v R d Output q = prox L ( ov )(v) z v Find r {0,..., }, l {,...,d} such that L+ z r > z l > T r,l l +(L+)r+L+ L+ z r (4.) T r,l l +(L+)r+L+ z l+ (4.) where z 0 := +, z d+ :=, T r,l := l z i i= r L L+ z i if i =,..., r T q i z i r,l l +(L+)r+L+ if i = r,...,l 0 if i = l+,...,d Reorder and change the signs of q so that it conforms with v Algorithm Accelerated -overlap regularization. w = α R d, θ for t=,,... do θ t+ + +4θt ( w t+ prox λ αt L ( ov ) L (Xα X t y) ) using Algorithm α t+ w t+ + θt θ t+ (w t+ w t ) end for Note that q = prox g (v) is equivalent to Lz Lq = Lv Lq ( ov ) (q). It suffices to showthat, for w = q, Lz Lq is an optimal α in the proofof Proposition.. Indeed, A r correspondsto d q i = l i= r i= r LT r,l l +(L+)r+L+ ( z i T r,l l +(L+)r+L+ ) = T r,l (l +r+)t r,l l +(L+)r+L+ = (r+) and (.) is equivalent to condition (4.). For i r, we need Lz i Lq i = q i, which holds. For r i l, we need Lz i Lq i = r+ A r, which also holds. For i l +, since q i = 0, any z i such that Lz i Lq i r+ A r are allowed, which is true by (4.). Combining with standard results on accelerated methods [], we obtain the following rate. Corollary 4.. After T iterations of Algorithm with L X, for every 0

11 Table : Mean squared error for the simulated examples (median over 50 data sets). σ = 5 σ = Method MSE MSE Lasso Elastic net overlap w R d, it holds that ( ) Xw T y + λ ( w T ov ) Xw y + λ ( w ov ) + L w w (T +). 5 Empirical Comparisons 5. Simulation For the first simulation we follow [, Sec. 5, example 4]. In this experimental protocol, the target (oracle) vector equals w = (3,...,3,0...,0). }{{}}{{} The input data X were generated from a normal distribution such that components,...,5 have the same random mean Z N(0,), components 6,...,0 have mean Z N(0,) and components,...,5 have mean Z 3 N(0,). A total of 50 data sets were created in this way, each containing 50 training points, 50 validation points and 350 test points. The goal is to achieve good prediction performance on the test data. We compared the -overlap norm with Lasso and the elastic net. We considered the ranges = {,...,d} for -overlap norm regularization, λ = 0 i, i = { 5,...,5}, for the regularization parameter of Lasso and -overlap regularization and the same range for the λ,λ of the elastic net. For each method, the optimal set of parameters was selected based on mean squared error on the validation set. The error reported in Table is the mean squared error with respect to the oracle w, namely MSE = (ŵ w ) V(ŵ w ), where V is the population covariance matrix of X test. We report results on both a high noise (σ = 5 as in []) and a low noise (σ = ) experiment. In the presence of low noise, -overlap regularization predicts significantly better than elastic net, which is even inferior to Lasso. However, in the presence of high noise, none of the three methods could predict well enough. To further illustrate the effect of the -overlap norm with the low noise data sets, in Figure we show the coefficients learned by each method, in absolute value. For each image, one row corresponds to the w learned for one of the 5 5

12 Figure : Solutions learned by each method for all the simulation data sets (σ = ). Left is -overlap, right is Lasso and bottom is elastic net. 50 data sets. Whereas the elastic net can learn higher values at the relevant features, a better feature pattern with less variability emerges when using Lasso and especially when using the -overlap norm. 5. South African Heart Data This is a classificationtas which has been used in [7]. There are 9 variablesand 46 examples, and the response is presence/absence of coronary heart disease. We normalized the data so that each predictor variable has zero mean and unit variance. We then split the data 50 times randomly into training, validation, and test sets of sizes 400, 30, and 3 respectively. For each method, the parameters were selected using the validation data. In Tables, we report the accuracy of each method on the test data. In Table 3, we report the mean squared errors. We observe that all three methods have identical performance Newsgroups This is a binary classification version of 0 newsgroups created in [] which can be found in the LIBSVM data repository. 3 The positive class consists of the 0 groups with names of form sci.*, comp.*, or misc.forsale and the negative class consists of the other 0 groups. To reduce the number of features, we removed the words which appear in less than 3 documents. We randomly split the data into a training, a validation and a test set of sizes 4000,000and 4996, respectively. We report the accuracy on the test data in Table and the mean squared error in Table 3. We found that -overlap regularization gave improved prediction accuracy than both other methods. 4 6 Summary We introduced the -overlap norm as the tightest convex relaxation of sparsity plus l regularization, and showed that it is tighter than the elastic net by exactly a factor of. In our view, this sheds light on the elastic net as a 3 cjlin/libsvmtools /datasets/ 4 Regarding other sparse prediction methods, we did not manage to compare with OSCAR, due to memory limitations, or to PEN or trace Lasso, which do not have code available online.

13 Table : Accuracies for the SA heart data (median over 50 replications) and for the 0 newsgroups data set. Heart 0 newsgroups Method Accuracy (%) Accuracy (%) Lasso Elastic net overlap Table 3: Mean squared errors for the SA heart data(median over 50 replications) and for the 0 newsgroups data set. Heart 0 newsgroups Method MSE MSE Lasso Elastic net overlap close approximation to this tightest possible convex relaxation, and motivates using the -overlap norm when a tighter relaxation is sought. This is also demonstrated in our empirical results. We note that the -overlap norm has better prediction properties, but not necessarily better sparsity-inducing properties, as evident from its more rounded unit ball. It is well understood that there is often a tradeoff between sparsity and good prediction, and that even if the population optimal predictor is sparse, a denser predictor often yields better predictive performance [3, 9, ]. For example, in the presence of correlated features, it is often beneficial to include several highly correlated features rather than a single representative feature. This is exactly the behavior encouraged by l norm regularization, and the elastic net is already nown to yield less sparse (but more predictive) solutions. The -overlap norm goes a step further in this direction, often yielding solutions that are even less sparse (but more predictive) compared to the elastic net. Nevertheless, it is interesting to consider whether compressed sensing results, where l regularizationis of coursecentral, can be refined by using the -overlap norm, which might be able to handle more correlation structure within the set of features. References [] A. Bec and M. Teboulle. A fast iterative shrinage-thresholding algorithm for linear inverse problems. SIAM Journal of Imaging Sciences, ():83 3

14 0, 009. [] R. Bhatia. Matrix Analysis. Graduate Texts in Mathematics. Springer, 997. [3] H.D. Bondell and B.J. Reich. Simultaneous regression shrinage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics, 64():5 3, 008. [4] P.L. Combettes and V.R. Wajs. Signal recovery by proximal forwardbacward splitting. Multiscale Modeling and Simulation, 4(4):68 00, 006. [5] E. Grave, G. R. Obozinsi, and F. Bach. Trace lasso: a trace norm regularization for correlated designs. In J. Shawe-Taylor, R.S. Zemel, P. Bartlett, F.C.N. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 4, 0. [6] G. H. Hardy, J. E. Littlewood, and G. Pólya. Inequalities. Cambridge University Press, 934. [7] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer Verlag Series in Statistics, 00. [8] A.E. Hoerl and R.W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, pages 55 67, 970. [9] L. Jacob, G. Obozinsi, and J.P. Vert. Group Lasso with overlap and graph Lasso. In Proceedings of the 6th Annual International Conference on Machine Learning, pages ACM, 009. [0] S.M. Kaade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Ris bounds, margin bounds, and regularization. In Advances in Neural Information Processing Systems, volume, 008. [] S. S. Keerthi and D. DeCoste. A modified finite Newton method for fast solution of large scale linear SVMs. Journal of Machine Learning Research, 6:34 36, 005. [] A. Lorbert, D. Eis, V. Kostina, D.M. Blei, and P.J. Ramadge. Exploiting covariate similarity in sparse regression via the pairwise elastic net. In Proceedings of the 3th International Conference on Artificial Intelligence and Statistics, 00. [3] J. J. Moreau. Proximité et dualité dans un espace hilbertien. Bulletin de la Société Mathématique de France, 93():73 99, 965. [4] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 03():7 5,

15 [5] Y. Nesterov. Gradient methods for minimizing composite objective function. CORE, 007. [6] N. Srebro, K. Sridharan, and A. Tewari. Smoothness, low-noise and fast rates. In Advances in Neural Information Processing Systems 3, 00. [7] R. Tibshirani. Regression shrinage and selection via the lasso. Journal of the Royal Statistical Society, Series B (Statistical Methodology), 58():67 88, 996. [8] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. Preprint, 008. [9] P. Tseng. Approximation accuracy, gradient methods, and error bound for structured convex optimization. Mathematical Programming, 5():63 95, 00. [0] T. Zhang. Covering number bounds of certain regularized linear function classes. The Journal of Machine Learning Research, :57 550, 00. [] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67():30 30,

OWL to the rescue of LASSO

OWL to the rescue of LASSO OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Convex relaxation for Combinatorial Penalties

Convex relaxation for Combinatorial Penalties Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1

Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1 Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1 ( OWL ) Regularization Mário A. T. Figueiredo Instituto de Telecomunicações and Instituto Superior Técnico, Universidade de

More information

Spectral k-support Norm Regularization

Spectral k-support Norm Regularization Spectral k-support Norm Regularization Andrew McDonald Department of Computer Science, UCL (Joint work with Massimiliano Pontil and Dimitris Stamos) 25 March, 2015 1 / 19 Problem: Matrix Completion Goal:

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Nikhil Rao, Miroslav Dudík, Zaid Harchaoui

Nikhil Rao, Miroslav Dudík, Zaid Harchaoui THE GROUP k-support NORM FOR LEARNING WITH STRUCTURED SPARSITY Nikhil Rao, Miroslav Dudík, Zaid Harchaoui Technicolor R&I, Microsoft Research, University of Washington ABSTRACT Several high-dimensional

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Supplement to A Generalized Least Squares Matrix Decomposition. 1 GPMF & Smoothness: Ω-norm Penalty & Functional Data

Supplement to A Generalized Least Squares Matrix Decomposition. 1 GPMF & Smoothness: Ω-norm Penalty & Functional Data Supplement to A Generalized Least Squares Matrix Decomposition Genevera I. Allen 1, Logan Grosenic 2, & Jonathan Taylor 3 1 Department of Statistics and Electrical and Computer Engineering, Rice University

More information

Regularization and Variable Selection via the Elastic Net

Regularization and Variable Selection via the Elastic Net p. 1/1 Regularization and Variable Selection via the Elastic Net Hui Zou and Trevor Hastie Journal of Royal Statistical Society, B, 2005 Presenter: Minhua Chen, Nov. 07, 2008 p. 2/1 Agenda Introduction

More information

A Spectral Regularization Framework for Multi-Task Structure Learning

A Spectral Regularization Framework for Multi-Task Structure Learning A Spectral Regularization Framework for Multi-Task Structure Learning Massimiliano Pontil Department of Computer Science University College London (Joint work with A. Argyriou, T. Evgeniou, C.A. Micchelli,

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

Proximal methods. S. Villa. October 7, 2014

Proximal methods. S. Villa. October 7, 2014 Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

Oslo Class 6 Sparsity based regularization

Oslo Class 6 Sparsity based regularization RegML2017@SIMULA Oslo Class 6 Sparsity based regularization Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017 Learning from data Possible only under assumptions regularization min Ê(w) + λr(w) w Smoothness Sparsity

More information

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs) ORF 523 Lecture 8 Princeton University Instructor: A.A. Ahmadi Scribe: G. Hall Any typos should be emailed to a a a@princeton.edu. 1 Outline Convexity-preserving operations Convex envelopes, cardinality

More information

arxiv: v1 [cs.lg] 27 Dec 2015

arxiv: v1 [cs.lg] 27 Dec 2015 New Perspectives on k-support and Cluster Norms New Perspectives on k-support and Cluster Norms arxiv:151.0804v1 [cs.lg] 7 Dec 015 Andrew M. McDonald Department of Computer Science University College London

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Generalized Elastic Net Regression

Generalized Elastic Net Regression Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu LITIS - EA 48 - INSA/Universite de Rouen Avenue de l Université - 768 Saint-Etienne du Rouvray

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data MSA220/MVE440 Statistical Learning for Big Data Lecture 9-10 - High-dimensional regression Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Recap from

More information

Support Vector Machine via Nonlinear Rescaling Method

Support Vector Machine via Nonlinear Rescaling Method Manuscript Click here to download Manuscript: svm-nrm_3.tex Support Vector Machine via Nonlinear Rescaling Method Roman Polyak Department of SEOR and Department of Mathematical Sciences George Mason University

More information

Lecture 25: November 27

Lecture 25: November 27 10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS Filippo Portera and Alessandro Sperduti Dipartimento di Matematica Pura ed Applicata Universit a di Padova, Padova, Italy {portera,sperduti}@math.unipd.it

More information

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon

More information

1 Sparsity and l 1 relaxation

1 Sparsity and l 1 relaxation 6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the

More information

A Note on Extended Formulations for Cardinality-based Sparsity

A Note on Extended Formulations for Cardinality-based Sparsity A Note on Extended Formulations for Cardinality-based Sparsity Cong Han Lim Wisconsin Institute for Discovery, University of Wisconsin-Madison Abstract clim9@wisc.edu We provide compact convex descriptions

More information

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 08: Sparsity Based Regularization Lorenzo Rosasco Learning algorithms so far ERM + explicit l 2 penalty 1 min w R d n n l(y

More information

Sparse Gaussian conditional random fields

Sparse Gaussian conditional random fields Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian

More information

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider variable

More information

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R Xingguo Li Tuo Zhao Tong Zhang Han Liu Abstract We describe an R package named picasso, which implements a unified framework

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University UNDERDETERMINED LINEAR EQUATIONS We

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Some tensor decomposition methods for machine learning

Some tensor decomposition methods for machine learning Some tensor decomposition methods for machine learning Massimiliano Pontil Istituto Italiano di Tecnologia and University College London 16 August 2016 1 / 36 Outline Problem and motivation Tucker decomposition

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Xiaocheng Tang Department of Industrial and Systems Engineering Lehigh University Bethlehem, PA 18015 xct@lehigh.edu Katya Scheinberg

More information

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Prediction & Feature Selection in GLM

Prediction & Feature Selection in GLM Tarigan Statistical Consulting & Coaching statistical-coaching.ch Doctoral Program in Computer Science of the Universities of Fribourg, Geneva, Lausanne, Neuchâtel, Bern and the EPFL Hands-on Data Analysis

More information

Efficient Methods for Overlapping Group Lasso

Efficient Methods for Overlapping Group Lasso Efficient Methods for Overlapping Group Lasso Lei Yuan Arizona State University Tempe, AZ, 85287 Lei.Yuan@asu.edu Jun Liu Arizona State University Tempe, AZ, 85287 j.liu@asu.edu Jieping Ye Arizona State

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

Trace Lasso: a trace norm regularization for correlated designs

Trace Lasso: a trace norm regularization for correlated designs Author manuscript, published in "NIPS 2012 - Neural Information Processing Systems, Lake Tahoe : United States (2012)" Trace Lasso: a trace norm regularization for correlated designs Edouard Grave edouard.grave@inria.fr

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Hierarchical Penalization

Hierarchical Penalization Hierarchical Penalization Marie Szafransi 1, Yves Grandvalet 1, 2 and Pierre Morizet-Mahoudeaux 1 Heudiasyc 1, UMR CNRS 6599 Université de Technologie de Compiègne BP 20529, 60205 Compiègne Cedex, France

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation

A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation Vu Malbasa and Slobodan Vucetic Abstract Resource-constrained data mining introduces many constraints when learning from

More information

1 Regression with High Dimensional Data

1 Regression with High Dimensional Data 6.883 Learning with Combinatorial Structure ote for Lecture 11 Instructor: Prof. Stefanie Jegelka Scribe: Xuhong Zhang 1 Regression with High Dimensional Data Consider the following regression problem:

More information

The OSCAR for Generalized Linear Models

The OSCAR for Generalized Linear Models Sebastian Petry & Gerhard Tutz The OSCAR for Generalized Linear Models Technical Report Number 112, 2011 Department of Statistics University of Munich http://www.stat.uni-muenchen.de The OSCAR for Generalized

More information

Large Scale Semi-supervised Linear SVMs. University of Chicago

Large Scale Semi-supervised Linear SVMs. University of Chicago Large Scale Semi-supervised Linear SVMs Vikas Sindhwani and Sathiya Keerthi University of Chicago SIGIR 2006 Semi-supervised Learning (SSL) Motivation Setting Categorize x-billion documents into commercial/non-commercial.

More information

Inverse Power Method for Non-linear Eigenproblems

Inverse Power Method for Non-linear Eigenproblems Inverse Power Method for Non-linear Eigenproblems Matthias Hein and Thomas Bühler Anubhav Dwivedi Department of Aerospace Engineering & Mechanics 7th March, 2017 1 / 30 OUTLINE Motivation Non-Linear Eigenproblems

More information

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this

More information

Mathematical Methods for Data Analysis

Mathematical Methods for Data Analysis Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Recent Advances in Structured Sparse Models

Recent Advances in Structured Sparse Models Recent Advances in Structured Sparse Models Julien Mairal Willow group - INRIA - ENS - Paris 21 September 2010 LEAR seminar At Grenoble, September 21 st, 2010 Julien Mairal Recent Advances in Structured

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Lecture 14: Variable Selection - Beyond LASSO

Lecture 14: Variable Selection - Beyond LASSO Fall, 2017 Extension of LASSO To achieve oracle properties, L q penalty with 0 < q < 1, SCAD penalty (Fan and Li 2001; Zhang et al. 2007). Adaptive LASSO (Zou 2006; Zhang and Lu 2007; Wang et al. 2007)

More information

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning Professor Erik Sudderth Brown University Computer Science October 4, 2016 Some figures and materials courtesy

More information

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Optimal Regularized Dual Averaging Methods for Stochastic Optimization Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business

More information

Tractable Upper Bounds on the Restricted Isometry Constant

Tractable Upper Bounds on the Restricted Isometry Constant Tractable Upper Bounds on the Restricted Isometry Constant Alex d Aspremont, Francis Bach, Laurent El Ghaoui Princeton University, École Normale Supérieure, U.C. Berkeley. Support from NSF, DHS and Google.

More information

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017 Machine Learning Regularization and Feature Selection Fabio Vandin November 14, 2017 1 Regularized Loss Minimization Assume h is defined by a vector w = (w 1,..., w d ) T R d (e.g., linear models) Regularization

More information

Signal Recovery from Permuted Observations

Signal Recovery from Permuted Observations EE381V Course Project Signal Recovery from Permuted Observations 1 Problem Shanshan Wu (sw33323) May 8th, 2015 We start with the following problem: let s R n be an unknown n-dimensional real-valued signal,

More information

Generalized Conditional Gradient and Its Applications

Generalized Conditional Gradient and Its Applications Generalized Conditional Gradient and Its Applications Yaoliang Yu University of Alberta UBC Kelowna, 04/18/13 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 1 / 25 1 Introduction 2 Generalized

More information

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr

More information

Sparse and Robust Optimization and Applications

Sparse and Robust Optimization and Applications Sparse and and Statistical Learning Workshop Les Houches, 2013 Robust Laurent El Ghaoui with Mert Pilanci, Anh Pham EECS Dept., UC Berkeley January 7, 2013 1 / 36 Outline Sparse Sparse Sparse Probability

More information

The FTRL Algorithm with Strongly Convex Regularizers

The FTRL Algorithm with Strongly Convex Regularizers CSE599s, Spring 202, Online Learning Lecture 8-04/9/202 The FTRL Algorithm with Strongly Convex Regularizers Lecturer: Brandan McMahan Scribe: Tamara Bonaci Introduction In the last lecture, we talked

More information

Conditional Gradient Algorithms for Rank-One Matrix Approximations with a Sparsity Constraint

Conditional Gradient Algorithms for Rank-One Matrix Approximations with a Sparsity Constraint Conditional Gradient Algorithms for Rank-One Matrix Approximations with a Sparsity Constraint Marc Teboulle School of Mathematical Sciences Tel Aviv University Joint work with Ronny Luss Optimization and

More information

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley A. d Aspremont, INFORMS, Denver,

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

l 1 and l 2 Regularization

l 1 and l 2 Regularization David Rosenberg New York University February 5, 2015 David Rosenberg (New York University) DS-GA 1003 February 5, 2015 1 / 32 Tikhonov and Ivanov Regularization Hypothesis Spaces We ve spoken vaguely about

More information

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization Tim Roughgarden & Gregory Valiant April 18, 2018 1 The Context and Intuition behind Regularization Given a dataset, and some class of models

More information

About Split Proximal Algorithms for the Q-Lasso

About Split Proximal Algorithms for the Q-Lasso Thai Journal of Mathematics Volume 5 (207) Number : 7 http://thaijmath.in.cmu.ac.th ISSN 686-0209 About Split Proximal Algorithms for the Q-Lasso Abdellatif Moudafi Aix Marseille Université, CNRS-L.S.I.S

More information

Regularization: Ridge Regression and the LASSO

Regularization: Ridge Regression and the LASSO Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression

More information

Self-dual Smooth Approximations of Convex Functions via the Proximal Average

Self-dual Smooth Approximations of Convex Functions via the Proximal Average Chapter Self-dual Smooth Approximations of Convex Functions via the Proximal Average Heinz H. Bauschke, Sarah M. Moffat, and Xianfu Wang Abstract The proximal average of two convex functions has proven

More information

Lecture 23: November 21

Lecture 23: November 21 10-725/36-725: Convex Optimization Fall 2016 Lecturer: Ryan Tibshirani Lecture 23: November 21 Scribes: Yifan Sun, Ananya Kumar, Xin Lu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Generalized Power Method for Sparse Principal Component Analysis

Generalized Power Method for Sparse Principal Component Analysis Generalized Power Method for Sparse Principal Component Analysis Peter Richtárik CORE/INMA Catholic University of Louvain Belgium VOCAL 2008, Veszprém, Hungary CORE Discussion Paper #2008/70 joint work

More information

MATH 680 Fall November 27, Homework 3

MATH 680 Fall November 27, Homework 3 MATH 680 Fall 208 November 27, 208 Homework 3 This homework is due on December 9 at :59pm. Provide both pdf, R files. Make an individual R file with proper comments for each sub-problem. Subgradients and

More information

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables LIB-MA, FSSM Cadi Ayyad University (Morocco) COMPSTAT 2010 Paris, August 22-27, 2010 Motivations Fan and Li (2001), Zou and Li (2008)

More information

Consistent Group Identification and Variable Selection in Regression with Correlated Predictors

Consistent Group Identification and Variable Selection in Regression with Correlated Predictors Consistent Group Identification and Variable Selection in Regression with Correlated Predictors Dhruv B. Sharma, Howard D. Bondell and Hao Helen Zhang Abstract Statistical procedures for variable selection

More information

arxiv: v3 [stat.me] 8 Jun 2018

arxiv: v3 [stat.me] 8 Jun 2018 Between hard and soft thresholding: optimal iterative thresholding algorithms Haoyang Liu and Rina Foygel Barber arxiv:804.0884v3 [stat.me] 8 Jun 08 June, 08 Abstract Iterative thresholding algorithms

More information

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Accelerated Block-Coordinate Relaxation for Regularized Optimization Accelerated Block-Coordinate Relaxation for Regularized Optimization Stephen J. Wright Computer Sciences University of Wisconsin, Madison October 09, 2012 Problem descriptions Consider where f is smooth

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Exploratory quantile regression with many covariates: An application to adverse birth outcomes

Exploratory quantile regression with many covariates: An application to adverse birth outcomes Exploratory quantile regression with many covariates: An application to adverse birth outcomes June 3, 2011 eappendix 30 Percent of Total 20 10 0 0 1000 2000 3000 4000 5000 Birth weights efigure 1: Histogram

More information

Optimization for Compressed Sensing

Optimization for Compressed Sensing Optimization for Compressed Sensing Robert J. Vanderbei 2014 March 21 Dept. of Industrial & Systems Engineering University of Florida http://www.princeton.edu/ rvdb Lasso Regression The problem is to solve

More information

ORIE 4741: Learning with Big Messy Data. Regularization

ORIE 4741: Learning with Big Messy Data. Regularization ORIE 4741: Learning with Big Messy Data Regularization Professor Udell Operations Research and Information Engineering Cornell October 26, 2017 1 / 24 Regularized empirical risk minimization choose model

More information