MIST: l 0 Sparse Linear Regression with Momentum

Size: px
Start display at page:

Download "MIST: l 0 Sparse Linear Regression with Momentum"

Transcription

1 MIST: l 0 Sparse Linear Regression with Momentum Goran Marjanovic, Magnus O. Ulfarsson, Alfred O. Hero III arxiv: v [stat.ml] 9 Mar 05 Abstract Significant attention has been given to minimizing a penalized least squares criterion for estimating sparse solutions to large linear systems of equations. The penalty is responsible for inducing sparsity and the natural choice is the so-called l 0 norm. In this paper we develop a Momentumized Iterative Shrinkage Thresholding (MIST) algorithm for minimizing the resulting nonconvex criterion and prove its convergence to a local minimizer. Simulations on large data sets show superior performance of the proposed method to other methods. Index Terms sparsity, non-convex, l 0 regularization, linear regression, iterative shrinkage thresholding, hard-thresholding, momentum I. INTRODUCTION In the current age of big data acquisition there has been an ever growing interest in sparse representations, which consists of representing, say, a noisy signal as a linear combination of very few components. This implies that the entire information in the signal can be approximately captured by a small number of components, which has huge benefits in analysis, processing and storage of high dimensional signals. As a result, sparse linear regression has been widely studied with many applications in signal and image processing, statistical inference and machine learning. Specific applications include compressed sensing, denoising, inpainting, deblurring, source separation, sparse image reconstruction, and signal classification, etc. The linear regression model is given by: y = Ax + ɛ, () where y d is a vector of noisy data observations, x m is the sparse representation (vector) of interest, A d m is the regression matrix and ɛ d is the noise. The estimation aim is to choose the simplest model, i.e., the sparsest x, that adequately explains the data y. To estimate a sparse x, major attention has been given to minimizing a sparsity Penalized Least Squares (PLS) criterion [] [0]. The least squares term promotes goodness-of-fit of the estimator while the penalty shrinks its coefficients to zero. Here we consider the nonconvex l 0 penalty since it is the natural sparsity promoting Goran Marjanovic is with the School of Electrical Engineering and Telecommunications, University of New South Wales, Sydney, NSW, 05 Australia ( goran.m.marjanovic@gmail.com) Alfred O. Hero III is with the Department of Electrical and Computer Science, University of Michigan, Ann Arbor, MI, 4809 USA ( hero@eecs.umich.edu) Magnus O. Ulfarsson is with the Department of Electrical Engineering, University of Iceland, Reykjavik, 07 Iceland ( mou@hi.is) This work was partly supported by the Research Fund of the University of Iceland and the Icelandic Research Fund ( ). This work was partially supported by ARO grant W9NF penalty and induces maximum sparsity. The resulting nonconvex l 0 PLS criterion is given by: F (x) = y Ax + λ x 0, () where λ > 0 is the tuning/regularization parameter and x 0 is the l 0 penalty representing the number of non-zeros in x. A. Previous Work Existing algorithms for directly minimizing () fall into the category of Iterative Shrinkage Thresholding (IST), and rely on the Majorization-Minimization (MM) type procedures, see [], []. These procedures exploit separability properties of the l 0 PLS criterion, and thus, rely on the minimizers of one dimensional versions of the PLS function: the so-called hard-thresholding operators. Since the convex l PLS criterion has similar separability properties, some MM procedures developed for its minimization could with modifications be applied to minimize (). Applicable MM procedures include first order methods and their accelerated versions [9], [], []. However, when these are applied to the l 0 penalized problem () there is no guarantee of convergence, and for [9] there is additionally no guarantee of algorithm stability. Analysis of convergence of MM algorithms for minimizing the l 0 PLS criterion () is rendered difficult due to lack of convexity. As far as we are aware, algorithm convergence for this problem has only been shown for the Iterative Hard Thresholding (IHT) method [], []. Specifically, a bounded sequence generated by IHT was shown to converge to the set of local minimizers of () when the singular values of A are strictly less than one. Convergence analysis of algorithms designed for minimizing the l q PLS criterion, q (0, ], is not applicable to the case of the l 0 penalized objective () because it relies on convex arguments when q =, and continuity and/or differentiability of the criterion when q (0, ). B. Paper Contribution In this paper we develop an MM algorithm with momentum acceleration, called Momentumized IST (MIST), for minimizing the l 0 PLS criterion () and prove its convergence to a single local minimizer without imposing any assumptions on A. Simulations on large data sets are carried out, which show that the proposed algorithm outperforms existing methods for minimizing (), including modified MM methods originally designed for the l PLS criterion. The paper is organised as follows. Section II reviews some of background on MM that will be used to develop the proposed convergent algorithm. The proposed algorithm is given in Section III, and Section IV contains the convergence

2 analysis. Lastly, Section V and VI presents the simulations and concluding remarks respectively. C. Notation The i-th component of a vector v is denoted by v[i]. Given a vector f(v), where f( ) is a function, its i-th component is denoted by f(v)[i]. M is the spectral norm of matrix M. I( ) is the indicator function equaling if its argument is true, and zero otherwise. Given a vector v, v 0 = i I(v[i] 0). sgn( ) is the sign function. {x k } k 0 denotes an infinite sequence, and {x kn } n 0 an infinite subsequence, where k n k n+ for all n 0. II. PRELIMINARIES Denoting the least squares term in () by: f(x) = y Ax, the Lipschitz continuity of f( ) implies: f(z) f(x) + f(x) T (z x) + µ z x for all x, z, µ A. For the proof see [9, Lemma.]. As a result, the following approximation of the objective function F ( ) in (), Q µ(z, x) = f(x) + f(x) T (z x) + µ z x + λ z 0 (3) is a majorizing function, i.e., F (z) Q µ (z, x) for any x, z, µ A. (4) Let P µ (x) be any point in the set arg min z Q µ (z, x), we have: F (P µ (x)) (4) Q µ (P µ (x), x) Q µ (x, x) = F (x), (5) where the stacking of (4) above the first inequality indicates that this inequality follows from Eq. (4). The proposed algorithm will be constructed based on the above MM framework with a momentum acceleration, described below. This momentum acceleration will be designed based on the following. Theorem. Let B µ = µi A T A, where µ > A, and: ( δ T ) B µ (P µ (x) x) α = η δ T, η [0, ], (6) B µ δ where δ 0. Then, F (P µ (x + αδ)) F (x). For the proof see the Appendix. A. Evaluating the Operator P µ ( ) Since (3) is non-convex there may exist multiple minimizers of Q µ (z, ) so that P µ ( ) may not be unique. We select a single element of the set of minimizers as described below. By simple algebraic manipulations of the quadratic quantity in (3), letting: g(x) = x f(x), (7) µ it is easy to show that: Q µ (z, x) = µ z g(x) + λ z 0 + f(x) µ f(x), and so, P µ ( ) is given by: P µ (x) = arg min z z g(x) + (λ/µ) z 0. (8) For the proposed algorithm we fix P µ ( ) = H λ/µ (g( )), the point to point map defined in the following Theorem. Theorem. Let the hard-thresholding (point-to-point) map H h ( ), h > 0, be such that for each i =,..., m: 0 if g(v)[i] < h H h (g(v))[i] = g(v)[i] I(v[i] 0) if g(v)[i] = h g(v)[i] if g(v)[i] > (9) h. Then, H λ/µ (g(v)) arg min z Q µ (z, v), where g( ) is from (7). The proof is in the Appendix. Evidently Theorem holds with P µ ( ) replaced by H λ/µ (g( )). The motivation for selecting this particular minimizer is Lemma in Section IV. III. THE ALGORITHM The proposed algorithm is constructed by repeated application of Theorem where δ is chosen to be the difference between the current and the previous iterate, i.e., x k+ = H λ/µ (w k ) µ f(w k), w k = x k + α k δ k (0) with α k given by (6), where δ k = x k x k. The iteration (0) is an instance of a momentum accelerated IST algorithm, similar to Fast IST Algorithm (FISTA) introduced in [9] for minimizing the convex l PLS criterion. In (0), δ k is called the momentum term and α k is a momentum step size parameter. A more explicit implementation of (0) is given below. Our proposed algorithm will be called Momentumized Iterative Shrinkage Thresholding (MIST). Momentumized IST (MIST) Algorithm Compute ȳ = (y T A) T off-line. Choose x 0 and let x = x 0. Also, calculate A off-line, let µ > A and k = 0. Then: () If k = 0, let α k = 0. Otherwise, compute: (a) u k = Ax k (b) v k = (u T k A)T (c) g k = x k µ (v k ȳ) (d) p k = H λ/µ (g k ) x k (e) δ k = x k x k and: γ k = µδ k v k + v k () (f) Choose η k (0, ) and compute: ( ) γ T α k = η k p k k γk T δ k ()

3 3 () Using (c), (e) and (f) compute: ( x k+ = H λ/µ g k + α ) k µ γ k (3) Let k = k + and go to (). (3) Remark. Thresholding using (9) is simple, and can always be done off-line. Secondly, note that MIST requires computing only O(md) products, which is the same order required when the momentum term δ k is not incorporated, i.e., η k = 0 for all k. In this case, MIST is a generalization of IHT from [], []. Other momentum methods such as FISTA [9] and its monotone version M-FISTA [] also require computing O(md) and O(3md) products, respectively. IV. CONVERGENCE ANALYSIS Here we prove that the proposed MIST algorithm iterates converge to a local minimizer of F ( ). Theorem 3. Suppose {x k } k 0 is a bounded sequence generated by the MIST algorithm. Then x k x as k, where x is a local minimizer of (). The proof is in the Appendix and requires several lemmas that are also proved in the Appendix. In Lemma and it is assumed that MIST reaches a fixed point only in the limit, i.e., x k+ x k for all k. This implies that δ k 0 for all k. Lemma. x k+ x k 0 as k. The following lemma motivates Theorem and is crucial for the subsequent convergence analysis. Lemma. Assume the result in Lemma. If, for any subsequence {x kn } n 0, x kn x as n, then: H λ/µ (w kn ) µ f(w k n ) H λ/µ (x µ ) f(x ), (4) where w kn = x kn + α kn δ kn. The following lemma characterizes the fixed points of the MIST algorithm. Lemma 3. Suppose x is a fixed point of MIST. Letting Z = {i : x [i] = 0} and Z c = {i : x [i] 0}, (C ) If i Z, then f(x )[i] λµ. (C ) If i Z c, then f(x )[i] = 0. (C 3 ) If i Z c, then x [i] λ/µ. Lemma 4. Suppose x is a fixed point of MIST. Then there exists ɛ > 0 such that F (x ) < F (x +d) for any d satisfying d (0, ɛ). In other words, x is a strict local minimizer of (). Lemma 5. MIST. The limit points of {x k } k 0 are fixed points of All of the above lemmas are proved in the Appendix. V. SIMULATIONS Here we demonstrate the performance advantages of the proposed MIST algorithm in terms of convergence speed. The methods used for comparison are the well known MM algorithms: ISTA and FISTA from [9], as well as M-FISTA from [], where the soft-thresholding map is replaced by the hard-thresholding map. In this case, ISTA becomes identical to the IHT algorithm from [], [], while FISTA and M-FISTA become its accelerated versions, which exploit the ideas in [3]. A popular compressed sensing scenario is considered with the aim of reconstructing a length m sparse signal x from d observations, where d < m. The matrix A d m is obtained by filling it with independent samples from the standard Gaussian distribution. A relatively high dimensional example is considered, where d = 3 = 89 and m = 4 = 6384, and x contains 50 randomly placed ± spikes (0.9% nonzeros). The observation y is generated according to () with the standard deviation of the noise ɛ given by σ = 3, 6, 0. The Signal to Noise Ratio (SNR) is defined by: SNR = 0 log 0 ( Ax σ d ). Figures, and 3 show a plot of Ax and observation noise ɛ for the three SNR values corresponding to the three considered values of σ. A. Selection of the Tuning Parameter λ Oracle results are reported, where the chosen tuning parameter λ in () is the minimizer of the Mean Squared Error (MSE), defined by: MSE(λ) = x x x, where x = x(λ) is the estimator of x produced by a particular algorithm. As x is generally unknown to the experimenter we also report results of using a model selection method to select λ. Some of the classical model selection methods include the Bayesian Information Criterion (BIC) [4], the Akaike Information criterion [5], and (generalized) cross validation [6], [7]. However, these methods tend to select a model with many spurious components when m is large and d is comparatively smaller, see [8] []. As a result, we use the Extended BIC (EBIC) model selection method proposed in [8], which incurs a small loss in the positive selection rate while tightly controlling the false discovery rate, a desirable property in many applications. The EBIC is defined by: ( ) ( y A x EBIC(λ) = log log d + d d + γ log m d ) x 0, and the chosen λ in () is the minimizer of this criterion. Note that the classical BIC criterion is obtained by setting γ = 0. As suggested in [8], we let γ = /(κ), where κ is the solution of m = d κ, i.e., κ = log m log d =.08

4 4 Fig.. Examples of the first 50 entry values in the (noiseless) observation Ax and the observation noise ɛ for SNR=. Fig. 4. Algorithm comparisons based on relative error F (x k ) F / F where F is the final value of F ( ) obtained by each algorithm at its termination, i.e., F = F (x k ), where F (x k ) satisfies the termination criterion in (5). Here SNR=, and the regularization parameter λ has been selected using the EBIC criterion. As it can be seen, in the low noise environment (σ = 3) the MIST algorithm outperforms the rest, both in terms of time and iteration number. Fig.. Examples of the first 50 entry values in the (noiseless) observation Ax and the observation noise ɛ for SNR= 6. Fig. 5. Similar comparisons as in Fig. 4 except that SNR=6. As it can be seen, in the intermediate noise environment (σ = 6) the MIST algorithm outperforms the others, both in terms of time and iteration number. Fig. 3. Examples of the first 50 entry values in the (noiseless) observation Ax and the observation noise ɛ for SNR=.7. B. Results All algorithms are initialized with x 0 = 0, and are terminated when the following criterion is satisfied: F (x k ) F (x k ) F (x k ) < 0 0. (5) In the MIST algorithm we let µ = A and η k = 0 5. All experiments were run in MATLAB 8. on an Intel Core i7 processor with 3.0GHz CPU and 8GB of RAM. Figures 4, 5 and 6 show percentage reduction of F ( ) as a function of time and iteration for each algorithm. To make the comparisons fair, i.e., to make sure all the algorithms minimize the same objective function, a common λ is used and chosen to be the smallest λ from the averaged arg min λ EBIC(λ) obtained by each algorithm (over 0 instances). Figures 7, 8 and 9 also show percentage reduction of F ( ) as a function of time and iteration for each algorithm. This time the MSE is used for the tuning of λ, and again, to make Fig. 6. Similar comparisons as in Fig. 4 except that SNR=.7. As it can be seen, in the high noise environment (σ = 0) the MIST algorithm outperforms the rest, both in terms of time and iteration number. sure all the algorithms minimize the same objective function, a common λ is used and chosen to be the smallest λ from the averaged arg min λ MSE(λ) obtained by each algorithm (over 0 instances). Based on a large number of experiments we noticed that MIST, FISTA and ISTA outperformed M-FISTA in terms of run time. This could be due to the fact that M-FISTA requires computing a larger number of products, see Remark, and the fact that it is a monotone version of a severely non-monotone FISTA. The high non-monotonicity could possibly be due to non-convexity of the objective function F ( ). Lastly, Figures 0, and show the average speed of each algorithm as a function of λ.

5 5 Fig. 7. Algorithm comparisons based on relative error F (xk ) F? / F? where F? is the final value of F ( ) obtained by each algorithm at its termination, i.e., F? = F (xk ), where F (xk ) satisfies the termination criterion in (5). Here an oracle selects the regularization parameter λ using the minimum MSE criterion. As it can be seen, in the low noise environment (σ = 3) the MIST algorithm outperforms the rest, both in terms of time and iteration number. Fig. 0. Average time vs. λ over 5 instances when SNR=. For the comparison, 0 equally spaced values of λ are considered, where 0 4 kat yk λ 0.kAT yk. As it can be seen, except in the scenario when λ = 0 4 kat yk the MIST algorithm outperforms the others. Fig. 8. Similar comparisons as in Fig. 7 except that SNR=6. As it can be seen, in the intermediate noise environment (σ = 6) the MIST algorithm outperforms the rest, both in terms of time and iteration number. Fig.. Similar comparisons as in Fig. 0 except that SNR=6. As it can be seen, except in the scenario when λ = 0 4 kat yk the MIST algorithm outperforms the others. Fig. 9. Similar comparisons as in Fig. 7 except that SNR=.7. As it can be seen, in the high noise environment (σ = 0) the MIST algorithm outperforms the rest, both in terms of time and iteration number. VI. C ONCLUSION We have developed a momentum accelerated MM algorithm, MIST, for minimizing the l0 penalized least squares criterion for linear regression problems. We have proved that MIST converges to a local minimizer without imposing any assumptions on the regression matrix A. Simulations on large data sets were carried out for different SNR values, and have shown that the MIST algorithm outperforms other popular MM algorithms in terms of time and iteration number. VII. A PPENDIX Proof of Theorem : Let w = x + βδ. The quantities f (w), f (w)t (z w) and kz wk are quadratic functions, and by simple linear algebra they can easily be expanded in terms Fig.. Similar comparisons as in Fig. 0 except that SNR=.7. As it can be seen, except in the scenario when λ = 0 4 kat yk the MIST algorithm outperforms the others. of z, x and δ. Namely, kaw yk = kax y + βaδk f (w) = = f (x) + β(ax y)t Aδ + β kaδk = f (x) + β f (x)t δ + β kaδk. (6)

6 6 Also, f(w) T (z w) = ( f(x) + βa T Aδ) T (z w) = f(x) T (z w) + βδ T A T A(z w) = f(x) T (z x βδ) + βδ T A T A(z x βδ) = f(x) T (z x) β f(x) T δ + βδ T A T A(z x) and finally: β Aδ, (7) z w = z x βδ = z x βδ T (z x) + β δ (8) Using the above expansions and the definition of Q µ (, ), we have that: where: Q µ (z, w) = Q µ (z, x) + Φ µ (z, δ, β), (9) Φ µ (z, δ, β) = β δ T B µ δ βδ T B µ (z x). Observing that δ T B µ δ > 0, let: ( δ T ) B µ (z x) β = η δ T, η [0, ]. (0) B µ δ Then, one has: Q µ(p µ(w), w) = min z Q µ(z, w) Q µ(z, w) (9) = Q µ(z, x) + Φ µ(z, δ, β) (0) = Q µ(z, x) η( η) [δt B µ(z x)] δ T B µδ () Q µ(z, x), () which holds for any z. So, letting z = P µ (x) implies: F (P µ (w)) (5) Q µ (P µ (w), w) () Q µ (P µ (x), x) (5) F (x), which completes the proof. Proof of Theorem : Looking at (8) it is obvious that: P µ (v)[i] = arg min z[i] (z[i] g(v)[i]) + (λ/µ)i(z[i] 0). If g(v)[i] λ/µ, by [, Theorem ] P µ (v)[i] is unique and given by H λ/µ (g(v))[i]. If g(v)[i] = λ/µ, again by [, Theorem ] we now have: P µ (v)[i] = 0 and P µ (v)[i] = sgn(g(v)[i]) λ/µ. Hence, H λ/µ (g(v))[i] P µ (v)[i], completing the proof. Proof of Lemma : From Theorem, 0 F (x k+ ) F (x k ), so the sequence {F (x k )} k 0 is bounded, which means it has a finite limit, say, F. As a result: F (x k ) F (x k+ ) F F = 0. (3) Next, recall that w k = x k + α k δ k and g( ) = ( ) µ f( ). So, using () in the proof of Theorem (where z = P µ (x)) with w, P µ (w), P µ ( ), x, and δ respectively replaced by w k, x k+, H λ/µ (g( )), x k and δ k, we have: Q µ (x k+, w k ) Q µ (H λ/µ (g(x k )), x k ) η k ( η k ) [δt k B µ(h λ/µ (g(x k )) x k )] δ k B µ δ k F (x k ) σ k α kδ T k B µ δ k, (4) where σ k = ( η k )/η k > 0. The first term in (4) follows from the fact that: Q µ (H λ/µ (g(x k )), x k ) = Q µ (P µ (x k ), x k ) F (x k ). Now, noting that: Q µ (x k+, w k ) = F (x k+ ) + (x k+ w k ) T B µ (x k+ w k ), (5) which easily follows from basic linear algebra, (4) and (5) together imply that: F (x k ) F (x k+ ) σ k α kδ T k B µ δ k + (x k+ w k ) T B µ (x k+ w k ) ρσ k α k δ k + ρ x k+ w k, (6) where ρ > 0 is the smallest eigenvalue of B µ 0. So, both terms on the right hand side in (6) are 0 for all k. As a result, due to (3) we can use the pinching/squeeze argument on (6) to establish that x k+ w k = δ k+ α k δ k 0 and α k δ k 0 as k. Consequently, δ k 0 as k, which completes the proof. Proof of Lemma : Firstly, from Lemma we have that x kn x kn 0 as n. In the last paragraph in the proof of that lemma (above) we also have that α k δ k 0, and so, α kn δ kn 0 as n. As a result, by the definition of w kn, w kn w = x. Then, by the continuity of f( ) we have that: w kn µ f(w k n ) x µ f(x ). Now, we need to show that, if u kn u as n, then: H λ/µ (u kn ) H λ/µ (u ). (7) Consider an arbitrary component of u kn, say, u kn [i], in which case we must have u kn [i] u [i]. Without loss of generality, assume u [i] > 0. Then, by the definition of H λ/µ ( ), there are two scenarios to consider: (a) u [i] λ/µ (b) u [i] = λ/µ. Regarding (a): For a large enough n = N, we must either have u kn [i] < λ/µ or u kn [i] > λ/µ for all n > N, which implies: 0 if u kn [i] < λ/µ H λ/µ (u kn )[i] = u kn [i] if u kn [i] > (8) λ/µ, for all n > N. In (8), H λ/µ ( ) is a continuous function of u kn [i] in both cases, which immediately implies (7).

7 7 Regarding (b): In general, u [i] could be reached in an oscillating fashion, i.e., for some n we can have u kn [i] < λ/µ and for others u kn [i] > λ/µ. If this was the case for all n then H λ/µ (u kn )[i] would approach a limit set of two points {0, λ/µ}. However, having x kn x and x k+ x k 0 implies x kn+ x. So, using the fact that: x kn+[i] = H λ/µ (u kn )[i] (9) H λ/µ (u kn )[i] must approach either 0 or λ/µ. In other words, there has to exist a large enough n = N such that u [i] is approached either only from the left or the right for all n > N, i.e., (b ) if u kn [i] < u [i] = λ/µ for all n > N, from (8) we have H λ/µ (u kn )[i] 0. So, noting that: H λ/µ (u kn )[i] x kn (9) = x kn+ x kn 0 (30) implies x kn [i] 0. As a result, x [i] = 0, and using the definition of H λ/µ ( ), we have: H λ/µ (u )[i] = λ/µ I(w [i] 0) = λ/µ I(x [i] 0) = λ/µ 0 = 0. Hence, (7) is satisfied. (b ) if u kn [i] > u [i] = λ/µ for all n > N, from (8) we have H λ/µ (u kn )[i] λ/µ. So, (30) implies x kn λ/µ, and using the definition of Hλ/µ ( ), we have: H λ/µ (u )[i] = λ/µ I(w [i] 0) = λ/µ I(x [i] 0) = λ/µ = λ/µ. Hence, (7) is again satisfied. Since i is arbitrary, the proof is complete. Proof of Lemma 3: The fixed points are obviously obtained by setting x k+ = x k = x k = x. So, any fixed point x satisfies the equation: x = H λ/µ (x ) µ f(x ). (3) The result is established by using the definition of H λ/µ (g( )) in Theorem. Namely, if i Z we easily obtain: (/µ) f(x )[i] λ/µ, which reduces to C. If i Z c, we easily obtain: x [i] = x [i] (/µ) f(x )[i], (3) which reduces to C. However, since: x [i] (/µ) f(x )[i] λ/µ, (33) (3) and (33) together imply x [i] λ/µ, giving C 3. Proof of Lemma 4: Letting Z = {i : x [i] = 0} and Z c = {i : x [i] 0}, it can easily be shown that F (x + d) = F (x ) + φ(d), where: φ(d) = Ad + d T f(x ) + λ x + d 0 λ x 0 d[i] f(x )[i] + λi(d[i] 0) }{{} i Z =φ Z (d[i]) + d[i] f(x )[i] + λi(x [i] + d[i] 0) λ. }{{} i Z c =φ Z c (d[i]) Now, φ Z (0) = 0, so suppose d[i] ( 0, λ/ λµ ), i Z. Then: φ Z (d[i]) d[i] f(x )[i] + λ d[i] λµ + λ > 0. The second inequality in the above is due to C in Lemma 3. Lastly, note that φ Z c(0) = 0, and suppose i Z c. From C in Lemma 3 we have f(x )[i] = 0. Thus, supposing d[i] (0, λ/µ), from C we have d[i] < x [i] for all i Z c. Thus: I(x [i] + d[i] 0) = I(x [i] 0) =, and so, φ Z c(d[i]) = 0. Since complete after letting ɛ = λ λµ. λ λµ < λ µ, the proof is Proof of Lemma 5: Since it is assumed that {x k } k 0 is bounded, the sequence {(x k, x k+ )} k 0 is also bounded, and thus, has at least one limit point. Denoting one of these by (x, x ), there exists a subsequence {(x kn, x kn+)} n 0 such that (x kn, x kn+) (x, x ) as n. However, by Lemma we must have x kn x kn+ 0, which implies x = x. Consequently: x kn+ = H λ/µ (w kn ) µ f(w k n ) x, (34) recalling that w kn = x kn + α kn δ kn and δ kn = x kn x kn. Furthermore, the convergence: H λ/µ (w kn ) µ f(w k n ) H λ/µ (x µ ) f(x ), (35) follows from Lemma. Equating the limits in (34) and (35) assures that x satisfies the fixed point equation (3), making it a fixed point of the algorithm. This completes the proof. Proof of Theorem 3: By Lemma and Ostrowski s result [3, Theorem 6.], the bounded {x k } k 0 converges to a closed and connected set, i.e., the set of limit points form a closed and connected set. But, by Lemma 5 these limit points are fixed points, which by Lemma 4 are strict local minimizers. So, since the local minimizers form a discrete set the connected set of limit points can only contain one point, and so, the entire {x k } k 0 must converge to a single local minimizer. REFERENCES [] T. Blumensath, M. Yaghoobi, and M. E. Davies, Iterative hard thresholding and l 0 regularisation, IEEE ICASSP, vol. 3, pp , 007. [] T. Blumensath and M. Davies, Iterative thresholding for sparse approximations, J. Fourier Anal. Appl., vol. 4, no. 5, pp , 008. [3] R. Mazumder, J. Friedman, and T. Hastie, SparseNet: Coordinate descent with non-convex penalties, J. Am. Stat. Assoc., vol. 06, no. 495, pp. 38, 0.

8 [4] K. Bredies, D. A. Lorenz, and S. Reiterer, Minimization of non-smooth, non-convex functionals by iterative thresholding, Tech. Rept., 009, bredies/publications.html. [5] P. Tseng, Convergence of block coordinate descent method for nondifferentiable minimization, J. Optimiz. Theory App., vol. 09, no. 3, pp , 00. [6] M. Nikolova, Description of the minimisers of least squares regularized with l 0 norm. uniqueness of the global minimizer, SIAM J. Imaging. Sci., vol. 6, no., pp , 03. [7] G. Marjanovic and V. Solo, On exact l q denoising, IEEE ICASSP, pp , 03. [8], l q sparsity penalized linear regression with cyclic descent, IEEE T. Signal Proces., vol. 6, no. 6, pp , 04. [9] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J. Imaging Sciences, vol., no., pp. 83 0, 009. [0] S. J. Wright, R. D. Nowak, and M. Figueiredo, Sparse reconstruction by separable approximation, IEEE T. Signal Proces., vol. 57, no. 7, pp , 009. [] A. Beck and M. Teboulle, Fast gradient-based algorithms for constrained total variation image denoising and deblurring, IEEE T. Image Process., vol. 8, no., pp , 009. [] M. Figueiredo, J. M. Bioucas-Dias, and R. D. Nowak, Majorizationminimization algorithms for wavelet-based image restoration, IEEE T. Image Process., vol. 6, no., pp , 007. [3] Y. Nesterov, A method of solving a convex programming problem with convergence rate O(/k ), Soviet Math. Doklady, vol. 7, pp , 983. [4] U. J. Schwarz, Mathematical-statistical description of the iterative beam removing technique (method clean), Astron. Astrophys., vol. 65, pp , 978. [5] H. Akaike, Information theory and an extension of the maximum likelihood principle, nd International Symposium on Information Theory, pp. 67 8, 973. [6] M. Stone, Cross-validatory choice and assessment of statistical predictions (with Discussion), J. R. Statist. Soc. B, vol. 39, pp. 47, 974. [7] P. Craven and G. Wahba, Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation, Numer. Math., vol. 3, pp , 979. [8] J. Chen and Z. Chen, Extended Bayesian information criteria for model selection with large model spaces, Biometrika, vol. 95, no. 3, pp , 008. [9] K. W. Broman and T. P. Speed, A model selection approach for the identification of quantitative trait loci in experimental crosses, J. R. Statist. Soc. B, vol. 64, pp , 00. [0] D. Siegmund, Model selection in irregular problems: Application to mapping quantitative trait loci, Biometrika, vol. 9, pp , 004. [] M. Bogdan, R. Doerge, and J. K. Ghosh, Modifying the Schwarz Bayesian information criterion to locate multiple interacting quantitative trait loci, Genetics, vol. 67, pp , 004. [] G. Marjanovic and A. O. Hero, On l q estimation of sparse inverse covariance, IEEE ICASSP, 04. [3] A. M. Ostrowski, Solutions of Equations in Euclidean and Banach Spaces. New York: Academic Press,

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

SPARSE SIGNAL RESTORATION. 1. Introduction

SPARSE SIGNAL RESTORATION. 1. Introduction SPARSE SIGNAL RESTORATION IVAN W. SELESNICK 1. Introduction These notes describe an approach for the restoration of degraded signals using sparsity. This approach, which has become quite popular, is useful

More information

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Sparsity Regularization

Sparsity Regularization Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation

More information

A NEW ITERATIVE METHOD FOR THE SPLIT COMMON FIXED POINT PROBLEM IN HILBERT SPACES. Fenghui Wang

A NEW ITERATIVE METHOD FOR THE SPLIT COMMON FIXED POINT PROBLEM IN HILBERT SPACES. Fenghui Wang A NEW ITERATIVE METHOD FOR THE SPLIT COMMON FIXED POINT PROBLEM IN HILBERT SPACES Fenghui Wang Department of Mathematics, Luoyang Normal University, Luoyang 470, P.R. China E-mail: wfenghui@63.com ABSTRACT.

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

FAST FIRST-ORDER METHODS FOR COMPOSITE CONVEX OPTIMIZATION WITH BACKTRACKING

FAST FIRST-ORDER METHODS FOR COMPOSITE CONVEX OPTIMIZATION WITH BACKTRACKING FAST FIRST-ORDER METHODS FOR COMPOSITE CONVEX OPTIMIZATION WITH BACKTRACKING KATYA SCHEINBERG, DONALD GOLDFARB, AND XI BAI Abstract. We propose new versions of accelerated first order methods for convex

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

Homotopy methods based on l 0 norm for the compressed sensing problem

Homotopy methods based on l 0 norm for the compressed sensing problem Homotopy methods based on l 0 norm for the compressed sensing problem Wenxing Zhu, Zhengshan Dong Center for Discrete Mathematics and Theoretical Computer Science, Fuzhou University, Fuzhou 350108, China

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Extended Bayesian Information Criteria for Gaussian Graphical Models

Extended Bayesian Information Criteria for Gaussian Graphical Models Extended Bayesian Information Criteria for Gaussian Graphical Models Rina Foygel University of Chicago rina@uchicago.edu Mathias Drton University of Chicago drton@uchicago.edu Abstract Gaussian graphical

More information

Lecture 1: September 25

Lecture 1: September 25 0-725: Optimization Fall 202 Lecture : September 25 Lecturer: Geoff Gordon/Ryan Tibshirani Scribes: Subhodeep Moitra Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

Learning MMSE Optimal Thresholds for FISTA

Learning MMSE Optimal Thresholds for FISTA MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Learning MMSE Optimal Thresholds for FISTA Kamilov, U.; Mansour, H. TR2016-111 August 2016 Abstract Fast iterative shrinkage/thresholding algorithm

More information

Sparse Covariance Selection using Semidefinite Programming

Sparse Covariance Selection using Semidefinite Programming Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support

More information

Lecture 8: February 9

Lecture 8: February 9 0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we

More information

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan

More information

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING YANGYANG XU Abstract. Motivated by big data applications, first-order methods have been extremely

More information

Lecture 9: September 28

Lecture 9: September 28 0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

Lecture 25: November 27

Lecture 25: November 27 10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Accelerated primal-dual methods for linearly constrained convex problems

Accelerated primal-dual methods for linearly constrained convex problems Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

Fast proximal gradient methods

Fast proximal gradient methods L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient

More information

Message Passing Algorithms for Compressed Sensing: II. Analysis and Validation

Message Passing Algorithms for Compressed Sensing: II. Analysis and Validation Message Passing Algorithms for Compressed Sensing: II. Analysis and Validation David L. Donoho Department of Statistics Arian Maleki Department of Electrical Engineering Andrea Montanari Department of

More information

OWL to the rescue of LASSO

OWL to the rescue of LASSO OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in

More information

Sparse Approximation via Penalty Decomposition Methods

Sparse Approximation via Penalty Decomposition Methods Sparse Approximation via Penalty Decomposition Methods Zhaosong Lu Yong Zhang February 19, 2012 Abstract In this paper we consider sparse approximation problems, that is, general l 0 minimization problems

More information

Comparative Study of Restoration Algorithms ISTA and IISTA

Comparative Study of Restoration Algorithms ISTA and IISTA Comparative Study of Restoration Algorithms ISTA and IISTA Kumaresh A K 1, Kother Mohideen S 2, Bremnavas I 3 567 Abstract our proposed work is to compare iterative shrinkage thresholding algorithm (ISTA)

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Analysis Methods for Supersaturated Design: Some Comparisons

Analysis Methods for Supersaturated Design: Some Comparisons Journal of Data Science 1(2003), 249-260 Analysis Methods for Supersaturated Design: Some Comparisons Runze Li 1 and Dennis K. J. Lin 2 The Pennsylvania State University Abstract: Supersaturated designs

More information

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

More information

Sparse signals recovered by non-convex penalty in quasi-linear systems

Sparse signals recovered by non-convex penalty in quasi-linear systems Cui et al. Journal of Inequalities and Applications 018) 018:59 https://doi.org/10.1186/s13660-018-165-8 R E S E A R C H Open Access Sparse signals recovered by non-conve penalty in quasi-linear systems

More information

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University 07.06.2018 joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16 Some

More information

Near Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing

Near Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing Near Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas M.Vidyasagar@utdallas.edu www.utdallas.edu/ m.vidyasagar

More information

Large-Scale L1-Related Minimization in Compressive Sensing and Beyond

Large-Scale L1-Related Minimization in Compressive Sensing and Beyond Large-Scale L1-Related Minimization in Compressive Sensing and Beyond Yin Zhang Department of Computational and Applied Mathematics Rice University, Houston, Texas, U.S.A. Arizona State University March

More information

An iterative hard thresholding estimator for low rank matrix recovery

An iterative hard thresholding estimator for low rank matrix recovery An iterative hard thresholding estimator for low rank matrix recovery Alexandra Carpentier - based on a joint work with Arlene K.Y. Kim Statistical Laboratory, Department of Pure Mathematics and Mathematical

More information

A Bregman alternating direction method of multipliers for sparse probabilistic Boolean network problem

A Bregman alternating direction method of multipliers for sparse probabilistic Boolean network problem A Bregman alternating direction method of multipliers for sparse probabilistic Boolean network problem Kangkang Deng, Zheng Peng Abstract: The main task of genetic regulatory networks is to construct a

More information

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018 Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 08 Instructor: Quoc Tran-Dinh Scriber: Quoc Tran-Dinh Lecture 4: Selected

More information

Sparse Solutions of Systems of Equations and Sparse Modelling of Signals and Images

Sparse Solutions of Systems of Equations and Sparse Modelling of Signals and Images Sparse Solutions of Systems of Equations and Sparse Modelling of Signals and Images Alfredo Nava-Tudela ant@umd.edu John J. Benedetto Department of Mathematics jjb@umd.edu Abstract In this project we are

More information

Lecture 17: October 27

Lecture 17: October 27 0-725/36-725: Convex Optimiation Fall 205 Lecturer: Ryan Tibshirani Lecture 7: October 27 Scribes: Brandon Amos, Gines Hidalgo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

Elaine T. Hale, Wotao Yin, Yin Zhang

Elaine T. Hale, Wotao Yin, Yin Zhang , Wotao Yin, Yin Zhang Department of Computational and Applied Mathematics Rice University McMaster University, ICCOPT II-MOPTA 2007 August 13, 2007 1 with Noise 2 3 4 1 with Noise 2 3 4 1 with Noise 2

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit New Coherence and RIP Analysis for Wea 1 Orthogonal Matching Pursuit Mingrui Yang, Member, IEEE, and Fran de Hoog arxiv:1405.3354v1 [cs.it] 14 May 2014 Abstract In this paper we define a new coherence

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu LITIS - EA 48 - INSA/Universite de Rouen Avenue de l Université - 768 Saint-Etienne du Rouvray

More information

Minimizing Isotropic Total Variation without Subiterations

Minimizing Isotropic Total Variation without Subiterations MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Minimizing Isotropic Total Variation without Subiterations Kamilov, U. S. TR206-09 August 206 Abstract Total variation (TV) is one of the most

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

ON THE GLOBAL AND LINEAR CONVERGENCE OF THE GENERALIZED ALTERNATING DIRECTION METHOD OF MULTIPLIERS

ON THE GLOBAL AND LINEAR CONVERGENCE OF THE GENERALIZED ALTERNATING DIRECTION METHOD OF MULTIPLIERS ON THE GLOBAL AND LINEAR CONVERGENCE OF THE GENERALIZED ALTERNATING DIRECTION METHOD OF MULTIPLIERS WEI DENG AND WOTAO YIN Abstract. The formulation min x,y f(x) + g(y) subject to Ax + By = b arises in

More information

Nonconvex penalties: Signal-to-noise ratio and algorithms

Nonconvex penalties: Signal-to-noise ratio and algorithms Nonconvex penalties: Signal-to-noise ratio and algorithms Patrick Breheny March 21 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/22 Introduction In today s lecture, we will return to nonconvex

More information

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization A Sparsity Preserving Stochastic Gradient Method for Composite Optimization Qihang Lin Xi Chen Javier Peña April 3, 11 Abstract We propose new stochastic gradient algorithms for solving convex composite

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

On the acceleration of the double smoothing technique for unconstrained convex optimization problems

On the acceleration of the double smoothing technique for unconstrained convex optimization problems On the acceleration of the double smoothing technique for unconstrained convex optimization problems Radu Ioan Boţ Christopher Hendrich October 10, 01 Abstract. In this article we investigate the possibilities

More information

Sparse Gaussian conditional random fields

Sparse Gaussian conditional random fields Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian

More information

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Accelerated Block-Coordinate Relaxation for Regularized Optimization Accelerated Block-Coordinate Relaxation for Regularized Optimization Stephen J. Wright Computer Sciences University of Wisconsin, Madison October 09, 2012 Problem descriptions Consider where f is smooth

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Zhaosong Lu Lin Xiao June 25, 2013 Abstract In this paper we propose a randomized block coordinate non-monotone

More information

Adaptive Corrected Procedure for TVL1 Image Deblurring under Impulsive Noise

Adaptive Corrected Procedure for TVL1 Image Deblurring under Impulsive Noise Adaptive Corrected Procedure for TVL1 Image Deblurring under Impulsive Noise Minru Bai(x T) College of Mathematics and Econometrics Hunan University Joint work with Xiongjun Zhang, Qianqian Shao June 30,

More information

SOR- and Jacobi-type Iterative Methods for Solving l 1 -l 2 Problems by Way of Fenchel Duality 1

SOR- and Jacobi-type Iterative Methods for Solving l 1 -l 2 Problems by Way of Fenchel Duality 1 SOR- and Jacobi-type Iterative Methods for Solving l 1 -l 2 Problems by Way of Fenchel Duality 1 Masao Fukushima 2 July 17 2010; revised February 4 2011 Abstract We present an SOR-type algorithm and a

More information

A Conservation Law Method in Optimization

A Conservation Law Method in Optimization A Conservation Law Method in Optimization Bin Shi Florida International University Tao Li Florida International University Sundaraja S. Iyengar Florida International University Abstract bshi1@cs.fiu.edu

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

CONVERGENCE PROPERTIES OF COMBINED RELAXATION METHODS

CONVERGENCE PROPERTIES OF COMBINED RELAXATION METHODS CONVERGENCE PROPERTIES OF COMBINED RELAXATION METHODS Igor V. Konnov Department of Applied Mathematics, Kazan University Kazan 420008, Russia Preprint, March 2002 ISBN 951-42-6687-0 AMS classification:

More information

A Fast Algorithm for Computing High-dimensional Risk Parity Portfolios

A Fast Algorithm for Computing High-dimensional Risk Parity Portfolios A Fast Algorithm for Computing High-dimensional Risk Parity Portfolios Théophile Griveau-Billion Quantitative Research Lyxor Asset Management, Paris theophile.griveau-billion@lyxor.com Jean-Charles Richard

More information

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding Patrick Breheny February 15 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/24 Introduction Last week, we introduced penalized regression and discussed ridge regression, in which the penalty

More information

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013 Machine Learning for Signal Processing Sparse and Overcomplete Representations Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013 1 Key Topics in this Lecture Basics Component-based representations

More information

LINEARIZED BREGMAN ITERATIONS FOR FRAME-BASED IMAGE DEBLURRING

LINEARIZED BREGMAN ITERATIONS FOR FRAME-BASED IMAGE DEBLURRING LINEARIZED BREGMAN ITERATIONS FOR FRAME-BASED IMAGE DEBLURRING JIAN-FENG CAI, STANLEY OSHER, AND ZUOWEI SHEN Abstract. Real images usually have sparse approximations under some tight frame systems derived

More information

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization x = prox_f(x)+prox_{f^*}(x) use to get prox of norms! PROXIMAL METHODS WHY PROXIMAL METHODS Smooth

More information

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Zhaosong Lu October 5, 2012 (Revised: June 3, 2013; September 17, 2013) Abstract In this paper we study

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

DNNs for Sparse Coding and Dictionary Learning

DNNs for Sparse Coding and Dictionary Learning DNNs for Sparse Coding and Dictionary Learning Subhadip Mukherjee, Debabrata Mahapatra, and Chandra Sekhar Seelamantula Department of Electrical Engineering, Indian Institute of Science, Bangalore 5612,

More information

An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems

An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems Kim-Chuan Toh Sangwoon Yun March 27, 2009; Revised, Nov 11, 2009 Abstract The affine rank minimization

More information

Fast Angular Synchronization for Phase Retrieval via Incomplete Information

Fast Angular Synchronization for Phase Retrieval via Incomplete Information Fast Angular Synchronization for Phase Retrieval via Incomplete Information Aditya Viswanathan a and Mark Iwen b a Department of Mathematics, Michigan State University; b Department of Mathematics & Department

More information

Machine Learning for Signal Processing Sparse and Overcomplete Representations

Machine Learning for Signal Processing Sparse and Overcomplete Representations Machine Learning for Signal Processing Sparse and Overcomplete Representations Abelino Jimenez (slides from Bhiksha Raj and Sourish Chaudhuri) Oct 1, 217 1 So far Weights Data Basis Data Independent ICA

More information

LEAST Mean Squares Algorithm (LMS), introduced by

LEAST Mean Squares Algorithm (LMS), introduced by 1 Hard Threshold Least Mean Squares Algorithm Lampros Flokas and Petros Maragos arxiv:1608.0118v1 [cs.sy] 3 Aug 016 Abstract This work presents a new variation of the commonly used Least Mean Squares Algorithm

More information

Regularization and Variable Selection via the Elastic Net

Regularization and Variable Selection via the Elastic Net p. 1/1 Regularization and Variable Selection via the Elastic Net Hui Zou and Trevor Hastie Journal of Royal Statistical Society, B, 2005 Presenter: Minhua Chen, Nov. 07, 2008 p. 2/1 Agenda Introduction

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:

More information

Sparsity in Underdetermined Systems

Sparsity in Underdetermined Systems Sparsity in Underdetermined Systems Department of Statistics Stanford University August 19, 2005 Classical Linear Regression Problem X n y p n 1 > Given predictors and response, y Xβ ε = + ε N( 0, σ 2

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Xiaocheng Tang Department of Industrial and Systems Engineering Lehigh University Bethlehem, PA 18015 xct@lehigh.edu Katya Scheinberg

More information

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom

More information

1 Sparsity and l 1 relaxation

1 Sparsity and l 1 relaxation 6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the

More information

Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1

Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1 Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1 ( OWL ) Regularization Mário A. T. Figueiredo Instituto de Telecomunicações and Instituto Superior Técnico, Universidade de

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Approximate Message Passing Algorithms

Approximate Message Passing Algorithms November 4, 2017 Outline AMP (Donoho et al., 2009, 2010a) Motivations Derivations from a message-passing perspective Limitations Extensions Generalized Approximate Message Passing (GAMP) (Rangan, 2011)

More information

Coordinate descent methods

Coordinate descent methods Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5

More information

arxiv: v1 [stat.ml] 1 Mar 2015

arxiv: v1 [stat.ml] 1 Mar 2015 Matrix Completion with Noisy Entries and Outliers Raymond K. W. Wong 1 and Thomas C. M. Lee 2 arxiv:1503.00214v1 [stat.ml] 1 Mar 2015 1 Department of Statistics, Iowa State University 2 Department of Statistics,

More information

Descent methods. min x. f(x)

Descent methods. min x. f(x) Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained

More information

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon

More information

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley A. d Aspremont, INFORMS, Denver,

More information

An Active Set Strategy for Solving Optimization Problems with up to 200,000,000 Nonlinear Constraints

An Active Set Strategy for Solving Optimization Problems with up to 200,000,000 Nonlinear Constraints An Active Set Strategy for Solving Optimization Problems with up to 200,000,000 Nonlinear Constraints Klaus Schittkowski Department of Computer Science, University of Bayreuth 95440 Bayreuth, Germany e-mail:

More information

Spectral gradient projection method for solving nonlinear monotone equations

Spectral gradient projection method for solving nonlinear monotone equations Journal of Computational and Applied Mathematics 196 (2006) 478 484 www.elsevier.com/locate/cam Spectral gradient projection method for solving nonlinear monotone equations Li Zhang, Weijun Zhou Department

More information

Least Sparsity of p-norm based Optimization Problems with p > 1

Least Sparsity of p-norm based Optimization Problems with p > 1 Least Sparsity of p-norm based Optimization Problems with p > Jinglai Shen and Seyedahmad Mousavi Original version: July, 07; Revision: February, 08 Abstract Motivated by l p -optimization arising from

More information

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 6, JUNE

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 6, JUNE IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 6, JUNE 2010 2935 Variance-Component Based Sparse Signal Reconstruction and Model Selection Kun Qiu, Student Member, IEEE, and Aleksandar Dogandzic,

More information

arxiv: v3 [stat.me] 8 Jun 2018

arxiv: v3 [stat.me] 8 Jun 2018 Between hard and soft thresholding: optimal iterative thresholding algorithms Haoyang Liu and Rina Foygel Barber arxiv:804.0884v3 [stat.me] 8 Jun 08 June, 08 Abstract Iterative thresholding algorithms

More information