Optimization for Sparse Estimation and Structured Sparsity

Size: px
Start display at page:

Download "Optimization for Sparse Estimation and Structured Sparsity"

Transcription

1 Optimization for Sparse Estimation and Structured Sparsity Julien Mairal INRIA LEAR, Grenoble IMA, Minneapolis, June 2013 Short Course Applied Statistics and Machine Learning Julien Mairal Optimization for Sparse Estimation 1/94

2 Main topic of the lecture How do we solve sparse estimation problems in machine learning and statistics; sparse inverse problems in signal processing. Tools we use greedy algorithms; gradient methods for non-differentiable functions; other tricks. Julien Mairal Optimization for Sparse Estimation 2/94

3 1 Short Introduction on Sparsity 2 Convex Optimization for Sparse Estimation 3 Non-convex Optimization for Sparse Estimation 4 Structured Sparsity Julien Mairal Optimization for Sparse Estimation 3/94

4 Part I: Short Introduction on Sparsity Julien Mairal Optimization for Sparse Estimation 4/94

5 Sparse Linear Models: Signal Processing Point of View Let y in R n be a signal. Let X = [x 1,...,x p ] R n p be a set of elementary signals. We call it dictionary. X is adapted to y if it can represent it with a few elements that is, there exists a sparse vector β in R p such that y Xβ. We call β the sparse code. y x 1 x 2 x p }{{} y R n } {{ } X R n p β 1 β 2.. β p }{{} β R p,sparse Julien Mairal Optimization for Sparse Estimation 5/94

6 Sparse Linear Models: Signal Processing Point of View Wavelet Representation [see Mallat, 1999] y = Xβ. The coefficients β are represented in a quadtree. Julien Mairal Optimization for Sparse Estimation 6/94

7 Sparse Linear Models: Signal Processing Point of View Dictionary for natural image patches [Olshausen and Field, 1997, Elad and Aharon, 2006] Julien Mairal Optimization for Sparse Estimation 7/94

8 Sparse Linear Models: Machine Learning Point of View Let (y i,x i ) n i=1 be a training set, where the vectors xi are in R p and are called features. The scalars y i are in { 1, +1} for binary classification problems. R for regression problems. We assume there is a relation y β x, and solve 1 min β R p n n l(y i,β x i ) i=1 } {{ } empirical risk + λω(β). }{{} regularization Julien Mairal Optimization for Sparse Estimation 8/94

9 Sparse Linear Models: Machine Learning Point of View A few examples: Ridge regression: Linear SVM: 1 min β R p n 1 min β R p n 1 Logistic regression: min β R p n n 1 2 (yi β x i ) 2 +λ β 2 2. n max(0,1 y i β x i )+λ β 2 2. i=1 i=1 n i=1 ( log 1+e yi β x i) +λ β 2 2. Julien Mairal Optimization for Sparse Estimation 9/94

10 Sparse Linear Models: Machine Learning Point of View A few examples: Ridge regression: Linear SVM: 1 min β R p n 1 min β R p n 1 Logistic regression: min β R p n n 1 2 (yi β x i ) 2 +λ β 2 2. n max(0,1 y i β x i )+λ β 2 2. i=1 i=1 n i=1 ( log 1+e yi β x i) +λ β 2 2. The squared l 2 -norm induces smoothness in β. When one knows in advance that β should be sparse, one should use a sparsity-inducing regularization such as the l 1 -norm. [Chen et al., 1999, Tibshirani, 1996] Julien Mairal Optimization for Sparse Estimation 10/94

11 Main Topic of the lecture How do we solve? 1 min β R p 2 y Xβ 2 2+ λω(β), }{{}}{{} regularization loss/data-fitting X in R n p is a design matrix, Ω induces sparsity. β 0 = #{i s.t. βi 0} (NP-hard) β 1 = p i=1 β i (convex), When Ω is the l 1 -norm, the problem is called Lasso [Tibshirani, 1996] or basis pursuit [Chen et al., 1999]. Later in the lecture, extensions to more general smooth loss functions f, other sparsity-inducing penalties. Julien Mairal Optimization for Sparse Estimation 11/94

12 One Important Question Why does the l 1 -norm induce sparsity? Can we get some intuition from the simplest isotropic case? 1 ˆβ(λ) = argmin β R p 2 y β 2 2 +λ β 1, or equivalently the Euclidean projection onto the l 1 -ball? 1 β(t) = argmin β R p 2 y β 2 2 s.t. β 1 T. equivalent means that λ > 0, T 0 s.t. β(t) = ˆβ(λ). Julien Mairal Optimization for Sparse Estimation 12/94

13 Regularizing with the l 1 -norm β 1 l 1 -ball β 2 β 1 T The projection onto a convex set is biased towards singularities. Julien Mairal Optimization for Sparse Estimation 13/94

14 Regularizing with the l 2 -norm β 1 l 2 -ball β 2 β 2 T The l 2 -norm is isotropic. Julien Mairal Optimization for Sparse Estimation 14/94

15 Regularizing with the l -norm β 1 l -ball β 2 β T The l -norm encourages β 1 = β 2. Julien Mairal Optimization for Sparse Estimation 15/94

16 In 3D. Copyright G. Obozinski Julien Mairal Optimization for Sparse Estimation 16/94

17 Why does the l 1 -norm induce sparsity? Exemple: quadratic problem in 1D 1 min β R 2 (y β)2 +λ β Piecewise quadratic function with a kink at zero. Derivative at 0 + : g + = y +λ and 0 : g = y λ. Optimality conditions. β is optimal iff: β > 0 and (y β)+λsign(β) = 0 β = 0 and g + 0 and g 0 The solution is a soft-thresholding: β = sign(y)( y λ) +. Julien Mairal Optimization for Sparse Estimation 17/94

18 Why does the l 1 -norm induce sparsity? β β λ λ y 2λ 2λ y (a) soft-thresholding operator, β = sign(y)( y λ) +, min β 1 2 (y β)2 +λ β (b) hard-thresholding operator β = 1 u 2λ y min β 1 2 (y β)2 +λ1 β >0 Julien Mairal Optimization for Sparse Estimation 18/94

19 Why does the l 1 -norm induce sparsity? Comparison with l 2-regularization in 1D Ω(β) = β 2 Ω(β) = β Ω (β) = 2β Ω (β) = 1, Ω +(β) = 1 The gradient of the l 2 -penalty vanishes when β get close to 0. On its differentiable part, the norm of the gradient of the l 1 -norm is constant. Julien Mairal Optimization for Sparse Estimation 19/94

20 Why does the l 1 -norm induce sparsity? Physical illustration E 1 = 0 E 1 = 0 y 0 Julien Mairal Optimization for Sparse Estimation 20/94

21 Why does the l 1 -norm induce sparsity? Physical illustration E 1 = k 1 2 (y 0 y) 2 E 1 = k 1 2 (y 0 y) 2 E 2 = k 2 2 y 2 y y E 2 = mgy Julien Mairal Optimization for Sparse Estimation 21/94

22 Why does the l 1 -norm induce sparsity? Physical illustration E 1 = k 1 2 (y 0 y) 2 E 1 = k 1 2 (y 0 y) 2 E 2 = k 2 2 y 2 y y = 0!! E 2 = mgy Julien Mairal Optimization for Sparse Estimation 22/94

23 Examples of sparsity-inducing penalties Exploiting concave functions with a kink at zero Ω(β) = p j=1 φ( β j ). l q - pseudo-norm, with 0 < q < 1: Ω(β) = p j=1 β j q, log penalty, Ω(β) = p j=1 log( β j +ε), φ is any function that looks like this: φ( β ) β Julien Mairal Optimization for Sparse Estimation 23/94

24 Examples of sparsity-inducing penalties (c) l 0.5-ball, 2-D (d) l 1-ball, 2-D (e) l 2-ball, 2-D Figure: Open balls in 2-D corresponding to several l q -norms and pseudo-norms. Julien Mairal Optimization for Sparse Estimation 24/94

25 Group Lasso Turlach et al. [2005], Yuan and Lin [2006], Zhao et al. [2009] the l 1 /l q -norm : Ω(β) = g G β g q. G is a partition of {1,...,p}; q = 2 or q = in practice; can be interpreted as the l 1 -norm of [ β g q ] g G. Ω(β) = β {1,2} 2 + β 3. Julien Mairal Optimization for Sparse Estimation 25/94

26 Part II: Convex Optimization for Sparse Estimation Julien Mairal Optimization for Sparse Estimation 26/94

27 Convex Functions f(β) β Julien Mairal Optimization for Sparse Estimation 27/94

28 Why do we care about convexity? Julien Mairal Optimization for Sparse Estimation 28/94

29 Why do we care about convexity? Local observations give some information about the global optimum f(β) β β f(β) = 0 is a necessary and sufficient optimality condition for differentiable convex functions; it is often easy to upper-bound f(β) f. Julien Mairal Optimization for Sparse Estimation 28/94

30 An important inequality for smooth convex functions If f is convex f(β) β β 0 β f(β) f(β 0 )+ f(β 0 ) (β β 0 ); }{{} linear approximation this is an equivalent definition of convexity for smooth functions. Julien Mairal Optimization for Sparse Estimation 29/94

31 An important inequality for smooth functions If f is L-Lipschitz continuous g(β) f(β) β β 1 β 0 β f(β) g(β) = f(β 0 )+ f(β 0 ) (β β 0 ) + L }{{} 2 β β0 2 2 ; linear approximation β 1 = β 0 1 L f(β0 ). (gradient descent step). Julien Mairal Optimization for Sparse Estimation 30/94

32 Gradient Descent Algorithm Assume that f is convex and differentiable, and that f is L-Lipschitz. Theorem Consider the algorithm β t β t 1 1 L f(βt 1 ). Then, f(β t ) f L β0 β t Remarks the convergence rate improves under additional assumptions on f (strong convexity); some variants have a O(1/t 2 ) convergence rate [Nesterov, 1983]. Julien Mairal Optimization for Sparse Estimation 31/94

33 Proof (1/2) Proof of the main inequality for smooth functions We want to show that for all β and α, f(β) f(α)+ f(α) (β α)+ L 2 β α 2 2. By using Taylor s theorem with integral form, Then, f(β) f(α) = 1 0 f(tβ +(1 t)α) (β α)dt. f(β) f(α) f(α) (β α) ( f(tβ+(1 t)α) f(α)) (β α)dt ( f(tβ+(1 t)α) f(α)) (β α) dt f(tβ+(1 t)α) f(α) 2 β α 2dt (C.-S.) Lt β α 2 2dt = L 2 β α 2 2. Julien Mairal Optimization for Sparse Estimation 32/94

34 Proof (2/2) Proof of the theorem We have shown that for all β, f(β) g t(β) = f(β t 1 )+ f(β t 1 ) (β β t 1 )+ L 2 β βt g t is minimized by β t ; it can be rewritten g t(β) = g t(β t )+ L 2 β βt 2 2. Then, f(β t ) g t(β t ) = g t(β ) L 2 β β t 2 2 = f(β t 1 )+ f(β t 1 ) (β β t 1 )+ L 2 β β t L 2 β β t 2 2 f + L 2 β β t L 2 β β t 2 2. By summing from t = 1 to T, we have a telescopic sum T T(f(β T ) f ) f(β t ) f L 2 β β L 2 β β T 2 2. t=1 Julien Mairal Optimization for Sparse Estimation 33/94

35 The Proximal Gradient Method We consider a smooth function f and a non-smooth regularizer Ω. min β R pf(β)+ω(β) For example, 1 min β R p 2 y Xβ 2 2 +λ β 1. the objective function is not differentiable. an extension of gradient descent for such a problem is called proximal gradient descent [Beck and Teboulle, 2009, Nesterov, 2007]. Julien Mairal Optimization for Sparse Estimation 34/94

36 The Proximal Gradient Method An important inequality for composite functions If f is L-Lipschitz continuous g(β)+ω(β) f(β)+ω(β) β β 1 β 0 β f(β)+ω(β) f(β 0 )+ f(β 0 ) (β β 0 )+ L 2 β β Ω(β); β 1 minimizes g +Ω. Julien Mairal Optimization for Sparse Estimation 35/94

37 The Proximal Gradient Method Gradient descent for minimizing f consists of β t argmin β R p g t (β) β t β t 1 1 L f(βt 1 ). The proximal gradient method for minimizing f + Ω consists of which is equivalent to β t 1 argmin β R p 2 β t argmin β R p g t (β)+ω(β), βt 1 1 L f(βt 1 ) β L Ω(β). It requires computing efficiently the proximal operator of Ω. 1 α argmin β R p 2 α β 2 2 +Ω(β). Julien Mairal Optimization for Sparse Estimation 36/94

38 The Proximal Gradient Method Remarks also known as forward-backward algorithm; has similar convergence rates as the gradient descent method [Beck and Teboulle, 2009, Nesterov, 2007]; there exists line search schemes to automatically tune L. The case of l 1 The proximal operator of λ. 1 is the soft-thresholding operator β j = sign(α j )( α j λ) +. The resulting algorithm is called iterative soft-thresholding [Daubechies et al., 2004]. Julien Mairal Optimization for Sparse Estimation 37/94

39 The Proximal Gradient Method for the Group Lasso The proximal operator for the Group Lasso penalty 1 min β R p 2 α β 2 2 +λ β g q. g G For q = 2, For q =, β g = α g α g 2 ( α g 2 λ) +, g G. β g = α g Π. 1 λ[α g ], g G. These formula generalize soft-thresholding to groups of variables. Julien Mairal Optimization for Sparse Estimation 38/94

40 Coordinate Descent for the Lasso 1 min β R p 2 y Xβ 2 2 +λ β 1. The coordinate descent method consists of iteratively fixing all variables and optimizing with respect to one: β j argmin β R 1 2 y β i x i βx j 2 2 +λ β, i j }{{} r where X = [x 1,...,x p ]. Assume the columns of X to have unit l 2 -norm, β j sign(x j r)( x j r λ) + This involves again the soft-thresholding operator. Julien Mairal Optimization for Sparse Estimation 39/94

41 Coordinate Descent for the Lasso Remarks no parameter to tune! impressive performance with five lines of code. coordinate descent + nonsmooth objective is not convergent in general. Here, the problem is equivalent to a convex smooth optimization problem with separable constraints 1 min β +,β 2 y X +β + +X β 2 2 +λβ T +1+λβ T 1 s.t. β,β + 0. For this specific problem, the algorithm is convergent [see Bertsekas, 1999]. can be extended to group-lasso, or other loss functions. j can be picked up at random, or by cycling (harder to analyze). Julien Mairal Optimization for Sparse Estimation 40/94

42 Smoothing Techniques: Reweighted l 2 Let us start from something simple a 2 2ab +b 2 0. Julien Mairal Optimization for Sparse Estimation 41/94

43 Smoothing Techniques: Reweighted l 2 Let us start from something simple Then and a 1 2 The formulation becomes a 2 2ab +b 2 0. ( a 2 b +b ) 1 β 1 = min η j 0 2 with equality iff a = b p j=1 1 min β,η j ε 2 y Xβ λ 2 β 2 j η j +η j. p j=1 β 2 j η j +η j. Julien Mairal Optimization for Sparse Estimation 41/94

44 The Regularization Path of the Lasso coefficient values w 1 w 2 w 3 w 4 w λ Figure: The regularization path of the Lasso is piecewise linear. 1 min β R p 2 y Xβ 2 2 +λ β 1. Julien Mairal Optimization for Sparse Estimation 42/94

45 Lasso and optimality conditions Theorem β is a solution of the Lasso if and only if { x j (y Xβ) λ if β j = 0 x j (y Xβ) = λsign(β j ) otherwise. Consequence β Γ (λ) = (XT Γ X Γ) 1 (X T Γ y λsign(β Γ )) = A+λB, where Γ = {j s.t. β j 0}. If we know Γ and the signs of β in advance, we have a closed form solution. Following the piecewise linear regularization path is called the homotopy method [Osborne et al., 2000, Efron et al., 2004]. Julien Mairal Optimization for Sparse Estimation 43/94

46 LARS algorithm (Homotopy) The regularization path (λ,β (λ)) is piecewise linear. 1 Start from the trivial solution (λ = X T y,β (λ) = 0). 2 Define Γ = {j s.t. x j y = λ}, 3 Follow the regularization path: β Γ (λ) = A+λB, keeping β Γ c = 0, decreasing the value of λ, until one of the following event occurs: j / Γ such that x j (y Xβ (λ)) = λ, then Γ Γ {j}. j Γ such that β (λ) = 0, then Γ Γ\{j}. 4 Update the direction of the path and go back to 3. Hidden assumptions the regularization path is unique. variables enter the path one at a time. Extremely efficient for small/medium scale problems (p ) and/or very sparse problems (when implemented correctly). Robust to correlated features. Can solve the elastic-net. Julien Mairal Optimization for Sparse Estimation 44/94

47 LARS algorithm (Homotopy) - Complexity Theorem - worst case analysis [Mairal and Yu, 2012] In the worst-case, the regularization path of the Lasso has exactly (3 p +1)/2 linear segments. Coefficients (log scale) Regularization path, p= Kinks Julien Mairal Optimization for Sparse Estimation 45/94

48 Lasso Empirical comparison: Lasso, small scale (n = 200,p = 200) reg: low reg: high corr:high corr:low Julien Mairal Optimization for Sparse Estimation 46/94

49 Empirical comparison: Lasso, medium scale (n = 2000,p = 10000) reg: low reg: high corr:high corr:low Julien Mairal Optimization for Sparse Estimation 47/94

50 Empirical comparison: conclusions Lasso Generic methods (subgradient descent, QP/CP solvers) are slow; homotopy fastest in low dimension and/or for high correlation Proximal methods are competitive esp. larger setting and/or weak corr. and/or weak reg. and/or low precision Coordinate descent usually dominated by LARS; but much simpler to implement! Smooth Losses and other regularization LARS not available (block) coordinate descent, proximal gradient methods are good candidates. Julien Mairal Optimization for Sparse Estimation 48/94

51 Conclusion of the part What was not covered in this part: other penalty functions, fused lasso (total variation), hierarchical penalties, structured sparsity with overlapping groups. stochastic optimization, Augmented Lagrangian, convergence rates, parallel computing, duality gaps via Fenchel duality... Software. Try it yourself (Matlab/R/Python/C++). More reading on optimization Bach, Jenatton, Mairal and Obozinski. Optimization with Sparsity-Inducing Penalties convex optimization: [Boyd and Vandenberghe, 2004, Borwein and Lewis, 2006, Nocedal and Wright, 2006, Bertsekas, 1999, Nesterov, 2004]. Julien Mairal Optimization for Sparse Estimation 49/94

52 Part III: Non-convex Optimization for Sparse Estimation Julien Mairal Optimization for Sparse Estimation 50/94

53 Greedy Algorithms Several equivalent non-convex NP-hard problems: 1 min β R p 2 y Xβ 2 2+ λ β 0, }{{}}{{} regularization residual r min β R p y Xβ 2 2 s.t. β 0 L, min β R p β 0 s.t. y Xβ 2 2 ε, The solution is often approximated with a greedy algorithm. Signal processing: Matching Pursuit [Mallat and Zhang, 1993], Orthogonal Matching Pursuit [Pati et al., 1993]; Statistics: L2-boosting [Bühlmann and Yu, 2003], forward selection. Julien Mairal Optimization for Sparse Estimation 51/94

54 Matching Pursuit β = (0,0,0) y x 2 x 1 z r x 3 x Julien Mairal Optimization for Sparse Estimation 52/94

55 Matching Pursuit β = (0,0,0) y x 2 x 1 z r x 3 < r,x 3 > x 3 x Julien Mairal Optimization for Sparse Estimation 53/94

56 Matching Pursuit β = (0,0,0) y x 2 x 1 z r r < r,x3 > x 3 x 3 x Julien Mairal Optimization for Sparse Estimation 54/94

57 Matching Pursuit β = (0, 0, 0.75) y x 2 x 1 r z x 3 x Julien Mairal Optimization for Sparse Estimation 55/94

58 Matching Pursuit β = (0, 0, 0.75) y x 2 x 1 r z x 3 x Julien Mairal Optimization for Sparse Estimation 56/94

59 Matching Pursuit β = (0, 0, 0.75) y x 2 x 1 r z x 3 x Julien Mairal Optimization for Sparse Estimation 57/94

60 Matching Pursuit β = (0, 0.24, 0.75) y x 2 x 1 z r x 3 x Julien Mairal Optimization for Sparse Estimation 58/94

61 Matching Pursuit β = (0, 0.24, 0.75) y x 2 z x 1 r x 3 x Julien Mairal Optimization for Sparse Estimation 59/94

62 Matching Pursuit β = (0, 0.24, 0.75) y x 2 z x 1 r x 3 x Julien Mairal Optimization for Sparse Estimation 60/94

63 Matching Pursuit β = (0, 0.24, 0.65) y x 2 x 1 z r x 3 x Julien Mairal Optimization for Sparse Estimation 61/94

64 Matching Pursuit min y Xβ 2 β R p }{{} 2 s.t. β 0 L r 1: β 0 2: r y (residual). 3: while β 0 < L do 4: Select the predictor with maximum correlation with the residual ĵ argmax x j r j=1,...,p 5: Update the residual and the coefficients 6: end while βĵ βĵ +xĵ r r r (xĵ r)xĵ Julien Mairal Optimization for Sparse Estimation 62/94

65 Orthogonal Matching Pursuit y x 2 β = (0,0,0) Γ = z x 1 y x 3 x Julien Mairal Optimization for Sparse Estimation 63/94

66 Orthogonal Matching Pursuit y x 2 β = (0,0,0.75) Γ = {3} π 2 x 1 z y r π 1 x 3 π 3 x Julien Mairal Optimization for Sparse Estimation 64/94

67 Orthogonal Matching Pursuit y x 2 β = (0,0.29,0.63) Γ = {3,2} x 1 z π 32 y r π 31 x 3 x Julien Mairal Optimization for Sparse Estimation 65/94

68 Orthogonal Matching Pursuit min β R y Xβ 2 p 2 s.t. β 0 L 1: Γ =. 2: for iter = 1,...,L do 3: Select the predictor which most reduces the objective î argmin i Γ C { min β y X Γ {i} β 2 2 4: Update the active set: Γ Γ {î}. 5: Update the residual (orthogonal projection) 6: Update the coefficients 7: end for r (I X Γ (X Γ X Γ) 1 X Γ )y. β Γ (X Γ X Γ) 1 X Γ y. Julien Mairal Optimization for Sparse Estimation 66/94 }

69 Orthogonal Matching Pursuit The keys for a good implementation If available, use Gram matrix G = X X, Maintain the computation of X r for each signal, Maintain a Cholesky decomposition of (X Γ X Γ) 1 for each signal. The total complexity for decomposing n L-sparse signals of size m with a dictionary of size p is O(p 2 m) }{{} Gram matrix +O(nL 3 ) }{{} Cholesky +O(n(pm+pL 2 )) = O(np(m+L 2 )) }{{} X T r It is also possible to use the matrix inversion lemma instead of a Cholesky decomposition (same complexity, but less numerical stability). Julien Mairal Optimization for Sparse Estimation 67/94

70 Example with the software SPAMS Software available at >> I=double(imread( data/lena.eps ))/255; >> %extract all patches of I >> X=im2col(I,[8 8], sliding ); >> %load a dictionary of size 64 x 256 >> D=load( dict.mat ); >> >> %set the sparsity parameter L to 10 >> param.l=10; >> alpha=mexomp(x,d,param); On a 4-cores 3.6Ghz machine: signals processed per second! Julien Mairal Optimization for Sparse Estimation 68/94

71 DC (difference of convex) - Programming Remember? Concave functions with a kink at zero Ω(β) = p j=1 φ( β j ). l q - pseudo-norm, with 0 < q < 1: Ω(β) = p j=1 ( β j +ε) q, log penalty, Ω(β) = p j=1 log( β j +ε), φ is any function that looks like this: φ(β) β Julien Mairal Optimization for Sparse Estimation 69/94

72 Figure: Illustration of the DC-programming Julien Mairal Optimization approach. for Sparse The Estimation non-convex part of 70/94 φ ( β k ) β +C β k f(β) β k φ(β) = log( β +ε) f(β)+φ ( β k ) β +C β k f(β)+φ( β )

73 DC (difference of convex) - Programming p min β R pf(β)+λ φ( β j ). This problem is non-convex. f is convex, and φ is concave on R +. if β k is the current estimate at iteration k, the algorithm solves j=1 [ β k+1 argmin f(β)+λ β R p p j=1 ] ψ ( β k j ) β j, which is a reweighted-l 1 problem. Warning: It does not solve the non-convex problem, only provides a stationary point. In practice, each iteration sets to zero small coefficients. After 2 3 iterations, the result does not change much. Julien Mairal Optimization for Sparse Estimation 71/94

74 The Proximal Gradient Method This inequality is also true for non-convex functions! If f is L-Lipschitz continuous g(β)+ω(β) f(β)+ω(β) β β 1 β 0 β f(β)+ω(β) f(β 0 )+ f(β 0 ) (β β 0 )+ L 2 β β Ω(β); Julien Mairal Optimization for Sparse Estimation 72/94

75 The Proximal Gradient Method As before, the method consits of the following iteration β t 1 argmin β R p 2 βt 1 1 L f(βt 1 ) β 2 2 +Ω(β). It requires computing efficiently the proximal operator of Ω. 1 α argmin β R p 2 α β 2 2 +Ω(β). For the l 0 -penalty (Ω(β) = λ β 0 ), this amounts to a hard-thresholding: β j = 1 αj 2λ α j. Trick: start with a high value for λ, and gradually decrease it. Julien Mairal Optimization for Sparse Estimation 73/94

76 Part IV: Structured Sparsity Julien Mairal Optimization for Sparse Estimation 74/94

77 Wavelet coefficients Zero-tree wavelets coding [Shapiro, 1993]; block thresholding [Cai, 1999]. Julien Mairal Optimization for Sparse Estimation 75/94

78 Sparse linear models for natural image patches Image restoration Julien Mairal Optimization for Sparse Estimation 76/94

79 Sparse linear models for natural image patches Image restoration Julien Mairal Optimization for Sparse Estimation 77/94

80 Structured dictionary for natural image patches [Jenatton, Mairal, Obozinski, and Bach, 2010] Julien Mairal Optimization for Sparse Estimation 78/94

81 Structured dictionary for natural image patches [Mairal, Jenatton, Obozinski, and Bach, 2011] Julien Mairal Optimization for Sparse Estimation 79/94

82 Tree of topics [Jenatton, Mairal, Obozinski, and Bach, 2010] Julien Mairal Optimization for Sparse Estimation 80/94

83 Metabolic network of the budding yeast from Rapaport, Zinovyev, Dutreix, Barillot, and Vert [2007] Julien Mairal Optimization for Sparse Estimation 81/94

84 Metabolic network of the budding yeast from Rapaport, Zinovyev, Dutreix, Barillot, and Vert [2007] Julien Mairal Optimization for Sparse Estimation 82/94

85 Questions about structured sparsity min β R p f(β) }{{} + λω(β), }{{} convex, smooth regularization Ω should encode some a priori knowledge about β. In this part, we will see how to design structured sparsity-inducing functions Ω; How to solve the corresponding estimation/inverse problems. out of the scope of this part: consistency, recovery, theoretical properties (statistics...) Julien Mairal Optimization for Sparse Estimation 83/94

86 In 3D. Copyright G. Obozinski Julien Mairal Optimization for Sparse Estimation 84/94

87 What about more complicated norms? Copyright G. Obozinski Julien Mairal Optimization for Sparse Estimation 85/94

88 What about more complicated norms? Copyright G. Obozinski Julien Mairal Optimization for Sparse Estimation 86/94

89 Group Lasso Turlach et al. [2005], Yuan and Lin [2006], Zhao et al. [2009] the l 1 /l q -norm : Ω(β) = g G β g q. G is a partition of {1,...,p}; q = 2 or q = in practice; can be interpreted as the l 1 -norm of [ β g q ] g G. Ω(β) = β {1,2} 2 + β 3. Julien Mairal Optimization for Sparse Estimation 87/94

90 Structured sparsity with overlapping groups Warning: Under the name structured sparsity appear in fact significantly different formulations! 1 non-convex zero-tree wavelets [Shapiro, 1993]; predefined collection of sparsity patterns: [Baraniuk et al., 2010]; select a union of groups: [Huang et al., 2009]; structure via Markov Random Fields: [Cehver et al., 2008]; 2 convex (norms) tree-structure: [Zhao et al., 2009]; select a union of groups: [Jacob et al., 2009]; zero-pattern is a union of groups: [Jenatton et al., 2009]; other norms: [Micchelli et al., 2011]. Julien Mairal Optimization for Sparse Estimation 88/94

91 Group Lasso with overlapping groups [Jenatton, Audibert, and Bach, 2009] Ω(β) = g G β g q. What happens when the groups overlap? the pattern of non-zero variables is an intersection of groups; the zero pattern is a union of groups. Ω(β) = β 2 + β 2 + β 3. Julien Mairal Optimization for Sparse Estimation 89/94

92 Hierarchical Norms [Zhao, Rocha, and Yu, 2009] A node can be active only if its ancestors are active. The selected patterns are rooted subtrees. Julien Mairal Optimization for Sparse Estimation 90/94

93 Proximal Gradient methods A few proximal operators: l 0 -penalty: hard-thresholding; l 1 -norm: soft-thresholding; group-lasso: group soft-thresholding; fused-lasso (1D total variation): [Hoefling, 2010]; hierarchical norms: [Jenatton et al., 2010], O(p) complexity; overlapping group Lasso with l -norm: [Mairal et al., 2010], (link with network flow optimization); Julien Mairal Optimization for Sparse Estimation 91/94

94 Modelling Patterns as Unions of Groups the non-convex penalty of Huang, Zhang, and Metaxas [2009] Warning: different point of view than in the two previous slides { ϕ(β) = min η g s.t. Supp(β) } g. J G g J the penalty is non-convex. is NP-hard to compute (set cover problem). g J The pattern of non-zeroes in β is a union of (a few) groups. It can be rewritten as a boolean linear program: { } ϕ(β) = min η x s.t. Nx Supp(β). x {0,1} G Julien Mairal Optimization for Sparse Estimation 92/94

95 Modelling Patterns as Unions of Groups convex relaxation and the penalty of Jacob, Obozinski, and Vert [2009] The penalty of Huang et al. [2009]: { } ϕ(β) = min η x s.t. Nx Supp(β). x {0,1} G A convex LP-relaxation: ψ(β) = min x R G + { } η x s.t. Nx β. Lemma: ψ is the penalty of Jacob et al. [2009] with the l -norm: ψ(β) = min g ξ (ξ g R p ) g G g Gη g s.t. β= ξ g and g, Supp(ξ g ) g, g G Julien Mairal Optimization for Sparse Estimation 93/94

96 Modelling Patterns as Unions of Groups The norm of Jacob et al. [2009] in 3D ψ(β) with G = {{1,2},{2,3},{1,3}}. Julien Mairal Optimization for Sparse Estimation 94/94

97 References I M. Aharon, M. Elad, and A. M. Bruckstein. The K-SVD: An algorithm for designing of overcomplete dictionaries for sparse representations. IEEE Transactions on Signal Processing, 54(11): , November R. G. Baraniuk, V. Cevher, M. Duarte, and C. Hegde. Model-based compressive sensing. IEEE Transactions on Information Theory, to appear. A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1): , D. P. Bertsekas. Nonlinear programming. Athena Scientific Belmont, Mass, J. M. Borwein and A. S. Lewis. Convex analysis and nonlinear optimization: Theory and examples. Springer, Julien Mairal Optimization for Sparse Estimation 95/94

98 References II S. P. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, P. Bühlmann and B. Yu. Boosting with the L2 loss: regression and classification. Journal of the American Statistical Association, 98 (462): , T.T. Cai. Adaptive wavelet estimation: a block thresholding and oracle inequality approach. Annals of Statistics, 27(3): , V. Cehver, M. F. Duarte, C. Hegde, and R. G. Baraniuk. Sparse signal recovery usingmarkov random fields. In Advances in Neural Information Processing Systems, S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20:33 61, F. Couzinie-Devy, J. Mairal, F. Bach, and J. Ponce. Dictionary learning for deblurring and digital zoom. preprint arxiv: , Julien Mairal Optimization for Sparse Estimation 96/94

99 References III I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Comm. Pure Appl. Math, 57: , B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of statistics, 32(2): , M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing, 54(12): , December K. Engan, S. O. Aase, and J. H. Husoy. Frame based signal compression using method of optimal directions (MOD). In Proceedings of the 1999 IEEE International Symposium on Circuits Systems, volume 4, A. Haar. Zur theorie der orthogonalen funktionensysteme. Mathematische Annalen, 69: , Julien Mairal Optimization for Sparse Estimation 97/94

100 References IV H. Hoefling. A path algorithm for the fused lasso signal approximator. Journal of Computational and Graphical Statistics, 19(4): , J. Huang, Z. Zhang, and D. Metaxas. Learning with structured sparsity. In Proceedings of the International Conference on Machine Learning (ICML), L. Jacob, G. Obozinski, and J.-P. Vert. Group Lasso with overlap and graph Lasso. In Proceedings of the International Conference on Machine Learning (ICML), R. Jenatton, J-Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing norms. Technical report, preprint arxiv: v1. R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for sparse hierarchical dictionary learning. In Proceedings of the International Conference on Machine Learning (ICML), Julien Mairal Optimization for Sparse Estimation 98/94

101 References V D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems, M. S. Lewicki and T. J. Sejnowski. Learning overcomplete representations. Neural Computation, 12(2): , J. Mairal and B. Yu. Complexity analysis of the lasso regularization path. In Proceedings of the International Conference on Machine Learning (ICML), J. Mairal, M. Elad, and G. Sapiro. Sparse representation for color image restoration. IEEE Transactions on Image Processing, 17(1):53 69, January 2008a. J. Mairal, G. Sapiro, and M. Elad. Learning multiscale sparse representations for image and video restoration. SIAM Multiscale Modelling and Simulation, 7(1): , April 2008b. Julien Mairal Optimization for Sparse Estimation 99/94

102 References VI J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. ArXiv: v1, submitted. J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Network flow algorithms for structured sparsity. In Advances in Neural Information Processing Systems, J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Convex and network flow optimization for structured sparsity. Journal of Machine Learning Research, 12: , S. Mallat. A Wavelet Tour of Signal Processing, Second Edition. Academic Press, New York, September S. Mallat and Z. Zhang. Matching pursuit in a time-frequency dictionary. IEEE Transactions on Signal Processing, 41(12): , C.A. Micchelli, J.M. Morales, and M. Pontil. Regularizers for structured sparsity. preprint arxiv: v2, Julien Mairal Optimization for Sparse Estimation 100/94

103 References VII Y. Nesterov. A method for solving a convex programming problem with convergence rate O(1/k 2 ). Soviet Math. Dokl., 27: , Y. Nesterov. Introductory lectures on convex optimization. Kluwer Academic Publishers, Y. Nesterov. Gradient methods for minimizing composite objective function. Technical report, CORE, J. Nocedal and SJ Wright. Numerical Optimization. Springer: New York, nd Edition. B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37: , M. R. Osborne, B. Presnell, and B. A. Turlach. On the Lasso and its dual. Journal of Computational and Graphical Statistics, 9(2):319 37, Julien Mairal Optimization for Sparse Estimation 101/94

104 References VIII Y.C. Pati, R. Rezaiifar, and PS Krishnaprasad. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Proceedings of the Twenty-Seventh Asilomar Conference on Signals, Systems and Computers, M. Protter and M. Elad. Image sequence denoising via sparse and redundant representations. IEEE Transactions on Image Processing, 18(1):27 36, F. Rapaport, A. Zinovyev, M. Dutreix, E. Barillot, and J.P. Vert. Classification of microarray data using gene networks. BMC bioinformatics, 8(1):35, J.M. Shapiro. Embedded image coding using zerotrees of wavelet coefficients. IEEE Transactions on Signal Processing, 41(12): , R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B, 58(1): , Julien Mairal Optimization for Sparse Estimation 102/94

105 References IX B. A. Turlach, W. N. Venables, and S. J. Wright. Simultaneous variable selection. Technometrics, 47(3): , M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B, 68:49 67, P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and hierarchical variable selection. 37(6A): , Julien Mairal Optimization for Sparse Estimation 103/94

106 Appendix Julien Mairal Optimization for Sparse Estimation 104/94

107 Basic convex optimization tools: subgradients β β (a) Smooth case (b) Non-smooth case Figure: Gradients and subgradients for smooth and non-smooth functions. f(β) = {κ R p f(β)+κ (β β) f(β ) for all β R p }. Julien Mairal Optimization for Sparse Estimation 105/94

108 Basic convex optimization tools: subgradients Some nice properties f(β) = {g} iff f differentiable at β and g = f(β). many calculus rules: (αf +βg) = α f +β g for α,β > 0. for more details, see Boyd and Vandenberghe [2004], Bertsekas [1999], Borwein and Lewis [2006] and S. Boyd s course at Stanford. Optimality conditions For g : R p R convex, g differentiable: β minimizes g iff g(β ) = 0. g nondifferentiable: β minimizes g iff 0 g(β ). Careful: the concept of subgradient requires a function to be above its tangents. It does only make sense for convex functions! Julien Mairal Optimization for Sparse Estimation 106/94

109 Basic convex optimization tools: dual-norm Definition Let κ be in R p, κ = max β R p : β 1 β κ. Exercises β = β (true in finite dimension) l 2 is dual to itself. l 1 and l are dual to each other. l q and l q are dual to each other if 1 q + 1 q = 1. similar relations for spectral norms on matrices. β = {κ R p s.t. κ 1 and κ β = β }. Julien Mairal Optimization for Sparse Estimation 107/94

110 Optimality conditions Let f : R p R be convex differentiable and. be any norm. β is solution if and only if min β R pf(β)+λ β. 0 (f(β)+λ β ) = f(β)+λ β Since β = {κ R p s.t. κ 1 and κ β = β }, General optimality conditions: f(β) λ and f(β) β = λ β. Julien Mairal Optimization for Sparse Estimation 108/94

111 Convex Duality Strong Duality f(β), primal β β κ κ g(κ), dual Strong duality means that max κ g(κ) = min β f(β) Julien Mairal Optimization for Sparse Estimation 109/94

112 Convex Duality Duality Gaps f(β), primal β β κ κ δ( β, κ) Strong duality means that max κ g(κ) = min β f(β) g(κ), dual The duality gap guarantees us that 0 f( β) f(β ) δ( β, κ). Julien Mairal Optimization for Sparse Estimation 110/94

113 Part V (Bonus): Dictionary Learning and Matrix Factorization Julien Mairal Optimization for Sparse Estimation 111/94

114 Matrix Factorization and Clustering Let us cluster some training vectors y 1,...,y m into p clusters using K-means: m min y i x l i 2 2. (x j ) p j=1,(l i) m i=1 It can be equivalently formulated as i=1 min X R n p,b {0,1} p m m y i Xβ i 2 F s.t. βi 0 and i=1 p β i j = 1, j=1 which is a matrix factorization problem: min X R n p,b {0,1} p m Y XB 2 F s.t. B 0 and p β i j = 1, j=1 Julien Mairal Optimization for Sparse Estimation 112/94

115 Matrix Factorization and Clustering Hard clustering min X R n p,b {0,1} p m Y XB 2 F s.t. B 0 and X = [x 1,...,x p ] are the centroids of the p clusters. p β i j = 1, j=1 Soft clustering min X R n p,b R p m Y XB 2 F s.t. B 0 and X = [x 1,...,x p ] are the centroids of the p clusters. p β i j = 1, j=1 Julien Mairal Optimization for Sparse Estimation 113/94

116 Other Matrix Factorization Problems PCA 1 min B R p n 2 Y XB 2 F s.t. X X = I and BB is diagonal. X R m p X = [x 1,...,x p ] are the principal components. Julien Mairal Optimization for Sparse Estimation 114/94

117 Other Matrix Factorization Problems Non-negative matrix factorization [Lee and Seung, 2001] 1 min B R p n 2 Y XB 2 F X R m p s.t. B 0 and X 0. Julien Mairal Optimization for Sparse Estimation 115/94

118 Dictionary Learning and Matrix Factorization [Olshausen and Field, 1997] min X X,B R p m n i=1 1 2 yi Xβ i 2 F +λ βi 1, which is again a matrix factorization problem 1 min B R p n 2 Y XB 2 F +λ B 1. X X Julien Mairal Optimization for Sparse Estimation 116/94

119 Why having a unified point of view? 1 min B B 2 Y XB 2 F +λψ(b). X X same framework for NMF, sparse PCA, dictionary learning, clustering, topic modelling; can play with various constraints/penalties on B (coefficients) and on X (loadings, dictionary, centroids); same algorithms (no need to reinvent the wheel): alternate minimization, online learning [Mairal et al., 2009]. Julien Mairal Optimization for Sparse Estimation 117/94

120 The Image Denoising Problem y }{{} measurements = x orig }{{} original image + ε }{{} noise Julien Mairal Optimization for Sparse Estimation 118/94

121 Sparse representations for image restoration y = x }{{} orig + }{{}}{{} ε measurements original image noise Energy minimization problem - MAP estimation E(x) = Some classical priors Smoothness λ Lx 2 2 Total variation λ x 2 1 MRF priors y x ψ(x) }{{}}{{} image model (-log prior) relation to measurements Julien Mairal Optimization for Sparse Estimation 119/94

122 Sparse representations for image restoration Designed dictionaries [Haar, 1910], [Zweig, Morlet, Grossman 70s], [Meyer, Mallat, Daubechies, Coifman, Donoho, Candes 80s-today]...(see [Mallat, 1999]) Wavelets, Curvelets, Wedgelets, Bandlets,... lets Learned dictionaries of patches [Olshausen and Field, 1997], [Engan et al., 1999], [Lewicki and Sejnowski, 2000], [Aharon et al., 2006] min β i,x C i 1 2 y i Xβ i 2 2 }{{} reconstruction ψ(β) = β 0 ( l 0 pseudo-norm ) ψ(β) = β 1 (l 1 norm) +λψ(β i ) }{{} sparsity Julien Mairal Optimization for Sparse Estimation 120/94

123 Sparse representations for image restoration Solving the denoising problem [Elad and Aharon, 2006] Extract all overlapping 8 8 patches y i. Solve a matrix factorization problem: min β i,x C n 1 2 y i Xβ i 2 2+λψ(β i ), }{{}}{{} sparsity reconstruction i=1 with n > 100,000 Average the reconstruction of each patch. Julien Mairal Optimization for Sparse Estimation 121/94

124 Sparse representations for image restoration K-SVD: [Elad and Aharon, 2006] Figure: Dictionary trained on a noisy version of the image boat. Julien Mairal Optimization for Sparse Estimation 122/94

125 Sparse representations for image restoration Grayscale vs color image patches Julien Mairal Optimization for Sparse Estimation 123/94

126 Sparse representations for image restoration [Mairal, Sapiro, and Elad, 2008b] Julien Mairal Optimization for Sparse Estimation 124/94

127 Sparse representations for image restoration Inpainting, [Mairal, Elad, and Sapiro, 2008a] Julien Mairal Optimization for Sparse Estimation 125/94

128 Sparse representations for image restoration Inpainting, [Mairal, Elad, and Sapiro, 2008a] Julien Mairal Optimization for Sparse Estimation 126/94

129 Sparse representations for video restoration Key ideas for video processing [Protter and Elad, 2009] Using a 3D dictionary. Processing of many frames at the same time. Dictionary propagation. Julien Mairal Optimization for Sparse Estimation 127/94

130 Sparse representations for image restoration Inpainting, [Mairal, Sapiro, and Elad, 2008b] Figure: Inpainting results. Julien Mairal Optimization for Sparse Estimation 128/94

131 Sparse representations for image restoration Inpainting, [Mairal, Sapiro, and Elad, 2008b] Figure: Inpainting results. Julien Mairal Optimization for Sparse Estimation 129/94

132 Sparse representations for image restoration Inpainting, [Mairal, Sapiro, and Elad, 2008b] Figure: Inpainting results. Julien Mairal Optimization for Sparse Estimation 130/94

133 Sparse representations for image restoration Inpainting, [Mairal, Sapiro, and Elad, 2008b] Figure: Inpainting results. Julien Mairal Optimization for Sparse Estimation 131/94

134 Sparse representations for image restoration Inpainting, [Mairal, Sapiro, and Elad, 2008b] Figure: Inpainting results. Julien Mairal Optimization for Sparse Estimation 132/94

135 Sparse representations for image restoration Color video denoising, [Mairal, Sapiro, and Elad, 2008b] Figure: Denoising results. σ = 25 Julien Mairal Optimization for Sparse Estimation 133/94

136 Sparse representations for image restoration Color video denoising, [Mairal, Sapiro, and Elad, 2008b] Figure: Denoising results. σ = 25 Julien Mairal Optimization for Sparse Estimation 134/94

137 Sparse representations for image restoration Color video denoising, [Mairal, Sapiro, and Elad, 2008b] Figure: Denoising results. σ = 25 Julien Mairal Optimization for Sparse Estimation 135/94

138 Sparse representations for image restoration Color video denoising, [Mairal, Sapiro, and Elad, 2008b] Figure: Denoising results. σ = 25 Julien Mairal Optimization for Sparse Estimation 136/94

139 Sparse representations for image restoration Color video denoising, [Mairal, Sapiro, and Elad, 2008b] Figure: Denoising results. σ = 25 Julien Mairal Optimization for Sparse Estimation 137/94

140 Digital Zooming [Couzinie-Devy et al., 2011], Original Julien Mairal Optimization for Sparse Estimation 138/94

141 Digital Zooming [Couzinie-Devy et al., 2011], Bicubic Julien Mairal Optimization for Sparse Estimation 139/94

142 Digital Zooming [Couzinie-Devy et al., 2011], Proposed method Julien Mairal Optimization for Sparse Estimation 140/94

143 Digital Zooming [Couzinie-Devy et al., 2011], Original Julien Mairal Optimization for Sparse Estimation 141/94

144 Digital Zooming [Couzinie-Devy et al., 2011], Bicubic Julien Mairal Optimization for Sparse Estimation 142/94

145 Digital Zooming [Couzinie-Devy et al., 2011], Proposed approach Julien Mairal Optimization for Sparse Estimation 143/94

146 Image Deblurring [Couzinie-Devy et al., 2011], Original Julien Mairal Optimization for Sparse Estimation 144/94

147 Image Deblurring [Couzinie-Devy et al., 2011], Blurry and Noisy Julien Mairal Optimization for Sparse Estimation 145/94

148 Image Deblurring [Couzinie-Devy et al., 2011], Result Julien Mairal Optimization for Sparse Estimation 146/94

149 Image Deblurring [Couzinie-Devy et al., 2011], Original Julien Mairal Optimization for Sparse Estimation 147/94

150 Image Deblurring [Couzinie-Devy et al., 2011], Blurry and Noisy Julien Mairal Optimization for Sparse Estimation 148/94

151 Image Deblurring [Couzinie-Devy et al., 2011], Result Julien Mairal Optimization for Sparse Estimation 149/94

152 Inverse half-toning Original Julien Mairal Optimization for Sparse Estimation 150/94

153 Inverse half-toning Reconstructed image Julien Mairal Optimization for Sparse Estimation 151/94

154 Inverse half-toning Reconstructed image Julien Mairal Optimization for Sparse Estimation 152/94

155 Inverse half-toning Original Julien Mairal Optimization for Sparse Estimation 153/94

156 Inverse half-toning Reconstructed image Julien Mairal Optimization for Sparse Estimation 154/94

157 Inverse half-toning Original Julien Mairal Optimization for Sparse Estimation 155/94

158 Inverse half-toning Reconstructed image Julien Mairal Optimization for Sparse Estimation 156/94

159 Inverse half-toning Original Julien Mairal Optimization for Sparse Estimation 157/94

160 Inverse half-toning Reconstructed image Julien Mairal Optimization for Sparse Estimation 158/94

161 Inverse half-toning Original Julien Mairal Optimization for Sparse Estimation 159/94

162 Inverse half-toning Reconstructed image Julien Mairal Optimization for Sparse Estimation 160/94

Topographic Dictionary Learning with Structured Sparsity

Topographic Dictionary Learning with Structured Sparsity Topographic Dictionary Learning with Structured Sparsity Julien Mairal 1 Rodolphe Jenatton 2 Guillaume Obozinski 2 Francis Bach 2 1 UC Berkeley 2 INRIA - SIERRA Project-Team San Diego, Wavelets and Sparsity

More information

Sparse Estimation and Dictionary Learning

Sparse Estimation and Dictionary Learning Sparse Estimation and Dictionary Learning (for Biostatistics?) Julien Mairal Biostatistics Seminar, UC Berkeley Julien Mairal Sparse Estimation and Dictionary Learning Methods 1/69 What this talk is about?

More information

Structured Sparse Estimation with Network Flow Optimization

Structured Sparse Estimation with Network Flow Optimization Structured Sparse Estimation with Network Flow Optimization Julien Mairal University of California, Berkeley Neyman seminar, Berkeley Julien Mairal Neyman seminar, UC Berkeley /48 Purpose of the talk introduce

More information

Recent Advances in Structured Sparse Models

Recent Advances in Structured Sparse Models Recent Advances in Structured Sparse Models Julien Mairal Willow group - INRIA - ENS - Paris 21 September 2010 LEAR seminar At Grenoble, September 21 st, 2010 Julien Mairal Recent Advances in Structured

More information

Parcimonie en apprentissage statistique

Parcimonie en apprentissage statistique Parcimonie en apprentissage statistique Guillaume Obozinski Ecole des Ponts - ParisTech Journée Parcimonie Fédération Charles Hermite, 23 Juin 2014 Parcimonie en apprentissage 1/44 Classical supervised

More information

Backpropagation Rules for Sparse Coding (Task-Driven Dictionary Learning)

Backpropagation Rules for Sparse Coding (Task-Driven Dictionary Learning) Backpropagation Rules for Sparse Coding (Task-Driven Dictionary Learning) Julien Mairal UC Berkeley Edinburgh, ICML, June 2012 Julien Mairal, UC Berkeley Backpropagation Rules for Sparse Coding 1/57 Other

More information

Convex relaxation for Combinatorial Penalties

Convex relaxation for Combinatorial Penalties Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,

More information

A tutorial on sparse modeling. Outline:

A tutorial on sparse modeling. Outline: A tutorial on sparse modeling. Outline: 1. Why? 2. What? 3. How. 4. no really, why? Sparse modeling is a component in many state of the art signal processing and machine learning tasks. image processing

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Proximal Methods for Optimization with Spasity-inducing Norms

Proximal Methods for Optimization with Spasity-inducing Norms Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Structured sparsity-inducing norms through submodular functions

Structured sparsity-inducing norms through submodular functions Structured sparsity-inducing norms through submodular functions Francis Bach Sierra team, INRIA - Ecole Normale Supérieure - CNRS Thanks to R. Jenatton, J. Mairal, G. Obozinski June 2011 Outline Introduction:

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Pathwise coordinate optimization

Pathwise coordinate optimization Stanford University 1 Pathwise coordinate optimization Jerome Friedman, Trevor Hastie, Holger Hoefling, Robert Tibshirani Stanford University Acknowledgements: Thanks to Stephen Boyd, Michael Saunders,

More information

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Julien Mairal Inria, LEAR Team, Grenoble Journées MAS, Toulouse Julien Mairal Incremental and Stochastic

More information

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu LITIS - EA 48 - INSA/Universite de Rouen Avenue de l Université - 768 Saint-Etienne du Rouvray

More information

Sparsity in Underdetermined Systems

Sparsity in Underdetermined Systems Sparsity in Underdetermined Systems Department of Statistics Stanford University August 19, 2005 Classical Linear Regression Problem X n y p n 1 > Given predictors and response, y Xβ ε = + ε N( 0, σ 2

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Machine Learning for Signal Processing Sparse and Overcomplete Representations

Machine Learning for Signal Processing Sparse and Overcomplete Representations Machine Learning for Signal Processing Sparse and Overcomplete Representations Abelino Jimenez (slides from Bhiksha Raj and Sourish Chaudhuri) Oct 1, 217 1 So far Weights Data Basis Data Independent ICA

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

An Homotopy Algorithm for the Lasso with Online Observations

An Homotopy Algorithm for the Lasso with Online Observations An Homotopy Algorithm for the Lasso with Online Observations Pierre J. Garrigues Department of EECS Redwood Center for Theoretical Neuroscience University of California Berkeley, CA 94720 garrigue@eecs.berkeley.edu

More information

Hierarchical kernel learning

Hierarchical kernel learning Hierarchical kernel learning Francis Bach Willow project, INRIA - Ecole Normale Supérieure May 2010 Outline Supervised learning and regularization Kernel methods vs. sparse methods MKL: Multiple kernel

More information

Online Learning for Matrix Factorization and Sparse Coding

Online Learning for Matrix Factorization and Sparse Coding Online Learning for Matrix Factorization and Sparse Coding Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro To cite this version: Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro. Online

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

An Efficient Proximal Gradient Method for General Structured Sparse Learning

An Efficient Proximal Gradient Method for General Structured Sparse Learning Journal of Machine Learning Research 11 (2010) Submitted 11/2010; Published An Efficient Proximal Gradient Method for General Structured Sparse Learning Xi Chen Qihang Lin Seyoung Kim Jaime G. Carbonell

More information

Sparse & Redundant Signal Representation, and its Role in Image Processing

Sparse & Redundant Signal Representation, and its Role in Image Processing Sparse & Redundant Signal Representation, and its Role in Michael Elad The CS Department The Technion Israel Institute of technology Haifa 3000, Israel Wave 006 Wavelet and Applications Ecole Polytechnique

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve

More information

Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows

Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows Journal of Machine Learning Research 4 (03) 449-485 Submitted 4/; Revised 3/3; Published 8/3 Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows Julien Mairal Bin Yu Department

More information

Structured sparsity through convex optimization

Structured sparsity through convex optimization Structured sparsity through convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with R. Jenatton, J. Mairal, G. Obozinski Journées INRIA - Apprentissage - December

More information

Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1

Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1 Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1 ( OWL ) Regularization Mário A. T. Figueiredo Instituto de Telecomunicações and Instituto Superior Técnico, Universidade de

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725 Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014 Case Study 3: fmri Prediction Fused LASSO LARS Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, 2014 Emily Fox 2014 1 LASSO Regression

More information

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon

More information

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013 Machine Learning for Signal Processing Sparse and Overcomplete Representations Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013 1 Key Topics in this Lecture Basics Component-based representations

More information

Sparse representation classification and positive L1 minimization

Sparse representation classification and positive L1 minimization Sparse representation classification and positive L1 minimization Cencheng Shen Joint Work with Li Chen, Carey E. Priebe Applied Mathematics and Statistics Johns Hopkins University, August 5, 2014 Cencheng

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

operator and Group LASSO penalties

operator and Group LASSO penalties Multidimensional shrinkage-thresholding operator and Group LASSO penalties Arnau Tibau Puig, Ami Wiesel, Gilles Fleury and Alfred O. Hero III Abstract The scalar shrinkage-thresholding operator is a key

More information

Sparse PCA with applications in finance

Sparse PCA with applications in finance Sparse PCA with applications in finance A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon 1 Introduction

More information

Accelerated Gradient Method for Multi-Task Sparse Learning Problem

Accelerated Gradient Method for Multi-Task Sparse Learning Problem Accelerated Gradient Method for Multi-Task Sparse Learning Problem Xi Chen eike Pan James T. Kwok Jaime G. Carbonell School of Computer Science, Carnegie Mellon University Pittsburgh, U.S.A {xichen, jgc}@cs.cmu.edu

More information

LASSO Review, Fused LASSO, Parallel LASSO Solvers

LASSO Review, Fused LASSO, Parallel LASSO Solvers Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Graph Wavelets to Analyze Genomic Data with Biological Networks

Graph Wavelets to Analyze Genomic Data with Biological Networks Graph Wavelets to Analyze Genomic Data with Biological Networks Yunlong Jiao and Jean-Philippe Vert "Emerging Topics in Biological Networks and Systems Biology" symposium, Swedish Collegium for Advanced

More information

2.3. Clustering or vector quantization 57

2.3. Clustering or vector quantization 57 Multivariate Statistics non-negative matrix factorisation and sparse dictionary learning The PCA decomposition is by construction optimal solution to argmin A R n q,h R q p X AH 2 2 under constraint :

More information

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider variable

More information

Adaptive Compressive Imaging Using Sparse Hierarchical Learned Dictionaries

Adaptive Compressive Imaging Using Sparse Hierarchical Learned Dictionaries Adaptive Compressive Imaging Using Sparse Hierarchical Learned Dictionaries Jarvis Haupt University of Minnesota Department of Electrical and Computer Engineering Supported by Motivation New Agile Sensing

More information

Sparse Approximation and Variable Selection

Sparse Approximation and Variable Selection Sparse Approximation and Variable Selection Lorenzo Rosasco 9.520 Class 07 February 26, 2007 About this class Goal To introduce the problem of variable selection, discuss its connection to sparse approximation

More information

Fast Hard Thresholding with Nesterov s Gradient Method

Fast Hard Thresholding with Nesterov s Gradient Method Fast Hard Thresholding with Nesterov s Gradient Method Volkan Cevher Idiap Research Institute Ecole Polytechnique Federale de ausanne volkan.cevher@epfl.ch Sina Jafarpour Department of Computer Science

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Sparse Optimization Lecture: Basic Sparse Optimization Models

Sparse Optimization Lecture: Basic Sparse Optimization Models Sparse Optimization Lecture: Basic Sparse Optimization Models Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this lecture will know basic l 1, l 2,1, and nuclear-norm

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.

More information

6. Regularized linear regression

6. Regularized linear regression Foundations of Machine Learning École Centrale Paris Fall 2015 6. Regularized linear regression Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe agathe.azencott@mines paristech.fr

More information

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net

More information

A Family of Penalty Functions for Structured Sparsity

A Family of Penalty Functions for Structured Sparsity A Family of Penalty Functions for Structured Sparsity Charles A. Micchelli Department of Mathematics City University of Hong Kong 83 Tat Chee Avenue, Kowloon Tong Hong Kong charles micchelli@hotmail.com

More information

1 Sparsity and l 1 relaxation

1 Sparsity and l 1 relaxation 6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the

More information

DNNs for Sparse Coding and Dictionary Learning

DNNs for Sparse Coding and Dictionary Learning DNNs for Sparse Coding and Dictionary Learning Subhadip Mukherjee, Debabrata Mahapatra, and Chandra Sekhar Seelamantula Department of Electrical Engineering, Indian Institute of Science, Bangalore 5612,

More information

Regularization Path Algorithms for Detecting Gene Interactions

Regularization Path Algorithms for Detecting Gene Interactions Regularization Path Algorithms for Detecting Gene Interactions Mee Young Park Trevor Hastie July 16, 2006 Abstract In this study, we consider several regularization path algorithms with grouped variable

More information

c 2011 International Press Vol. 18, No. 1, pp , March DENNIS TREDE

c 2011 International Press Vol. 18, No. 1, pp , March DENNIS TREDE METHODS AND APPLICATIONS OF ANALYSIS. c 2011 International Press Vol. 18, No. 1, pp. 105 110, March 2011 007 EXACT SUPPORT RECOVERY FOR LINEAR INVERSE PROBLEMS WITH SPARSITY CONSTRAINTS DENNIS TREDE Abstract.

More information

Online Learning for Matrix Factorization and Sparse Coding

Online Learning for Matrix Factorization and Sparse Coding Online Learning for Matrix Factorization and Sparse Coding Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro To cite this version: Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro. Online

More information

Inverse problems and sparse models (1/2) Rémi Gribonval INRIA Rennes - Bretagne Atlantique, France

Inverse problems and sparse models (1/2) Rémi Gribonval INRIA Rennes - Bretagne Atlantique, France Inverse problems and sparse models (1/2) Rémi Gribonval INRIA Rennes - Bretagne Atlantique, France remi.gribonval@inria.fr Structure of the tutorial Session 1: Introduction to inverse problems & sparse

More information

Recent developments on sparse representation

Recent developments on sparse representation Recent developments on sparse representation Zeng Tieyong Department of Mathematics, Hong Kong Baptist University Email: zeng@hkbu.edu.hk Hong Kong Baptist University Dec. 8, 2008 First Previous Next Last

More information

A Generalized Uncertainty Principle and Sparse Representation in Pairs of Bases

A Generalized Uncertainty Principle and Sparse Representation in Pairs of Bases 2558 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 48, NO 9, SEPTEMBER 2002 A Generalized Uncertainty Principle Sparse Representation in Pairs of Bases Michael Elad Alfred M Bruckstein Abstract An elementary

More information

On Algorithms for Solving Least Squares Problems under an L 1 Penalty or an L 1 Constraint

On Algorithms for Solving Least Squares Problems under an L 1 Penalty or an L 1 Constraint On Algorithms for Solving Least Squares Problems under an L 1 Penalty or an L 1 Constraint B.A. Turlach School of Mathematics and Statistics (M19) The University of Western Australia 35 Stirling Highway,

More information

Efficient Methods for Overlapping Group Lasso

Efficient Methods for Overlapping Group Lasso Efficient Methods for Overlapping Group Lasso Lei Yuan Arizona State University Tempe, AZ, 85287 Lei.Yuan@asu.edu Jun Liu Arizona State University Tempe, AZ, 85287 j.liu@asu.edu Jieping Ye Arizona State

More information

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data MSA220/MVE440 Statistical Learning for Big Data Lecture 9-10 - High-dimensional regression Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Recap from

More information

SIGNAL SEPARATION USING RE-WEIGHTED AND ADAPTIVE MORPHOLOGICAL COMPONENT ANALYSIS

SIGNAL SEPARATION USING RE-WEIGHTED AND ADAPTIVE MORPHOLOGICAL COMPONENT ANALYSIS TR-IIS-4-002 SIGNAL SEPARATION USING RE-WEIGHTED AND ADAPTIVE MORPHOLOGICAL COMPONENT ANALYSIS GUAN-JU PENG AND WEN-LIANG HWANG Feb. 24, 204 Technical Report No. TR-IIS-4-002 http://www.iis.sinica.edu.tw/page/library/techreport/tr204/tr4.html

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Sparsity Regularization

Sparsity Regularization Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation

More information

EUSIPCO

EUSIPCO EUSIPCO 013 1569746769 SUBSET PURSUIT FOR ANALYSIS DICTIONARY LEARNING Ye Zhang 1,, Haolong Wang 1, Tenglong Yu 1, Wenwu Wang 1 Department of Electronic and Information Engineering, Nanchang University,

More information

OWL to the rescue of LASSO

OWL to the rescue of LASSO OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,

More information

Structured sparsity-inducing norms through submodular functions

Structured sparsity-inducing norms through submodular functions Structured sparsity-inducing norms through submodular functions Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with R. Jenatton, J. Mairal, G. Obozinski IPAM - February 2013 Outline

More information

On Optimal Frame Conditioners

On Optimal Frame Conditioners On Optimal Frame Conditioners Chae A. Clark Department of Mathematics University of Maryland, College Park Email: cclark18@math.umd.edu Kasso A. Okoudjou Department of Mathematics University of Maryland,

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley A. d Aspremont, INFORMS, Denver,

More information

Sparsity and the Lasso

Sparsity and the Lasso Sparsity and the Lasso Statistical Machine Learning, Spring 205 Ryan Tibshirani (with Larry Wasserman Regularization and the lasso. A bit of background If l 2 was the norm of the 20th century, then l is

More information

Subgradient Method. Ryan Tibshirani Convex Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial

More information

The Iteration-Tuned Dictionary for Sparse Representations

The Iteration-Tuned Dictionary for Sparse Representations The Iteration-Tuned Dictionary for Sparse Representations Joaquin Zepeda #1, Christine Guillemot #2, Ewa Kijak 3 # INRIA Centre Rennes - Bretagne Atlantique Campus de Beaulieu, 35042 Rennes Cedex, FRANCE

More information

Analysis of Greedy Algorithms

Analysis of Greedy Algorithms Analysis of Greedy Algorithms Jiahui Shen Florida State University Oct.26th Outline Introduction Regularity condition Analysis on orthogonal matching pursuit Analysis on forward-backward greedy algorithm

More information

Proximal Methods for Hierarchical Sparse Coding

Proximal Methods for Hierarchical Sparse Coding Proximal Methods for Hierarchical Sparse Coding Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach To cite this version: Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis

More information

Lecture 8: February 9

Lecture 8: February 9 0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Lasso applications: regularisation and homotopy

Lasso applications: regularisation and homotopy Lasso applications: regularisation and homotopy M.R. Osborne 1 mailto:mike.osborne@anu.edu.au 1 Mathematical Sciences Institute, Australian National University Abstract Ti suggested the use of an l 1 norm

More information

Tractable Upper Bounds on the Restricted Isometry Constant

Tractable Upper Bounds on the Restricted Isometry Constant Tractable Upper Bounds on the Restricted Isometry Constant Alex d Aspremont, Francis Bach, Laurent El Ghaoui Princeton University, École Normale Supérieure, U.C. Berkeley. Support from NSF, DHS and Google.

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

Lecture 2 February 25th

Lecture 2 February 25th Statistical machine learning and convex optimization 06 Lecture February 5th Lecturer: Francis Bach Scribe: Guillaume Maillard, Nicolas Brosse This lecture deals with classical methods for convex optimization.

More information

Online Dictionary Learning with Group Structure Inducing Norms

Online Dictionary Learning with Group Structure Inducing Norms Online Dictionary Learning with Group Structure Inducing Norms Zoltán Szabó 1, Barnabás Póczos 2, András Lőrincz 1 1 Eötvös Loránd University, Budapest, Hungary 2 Carnegie Mellon University, Pittsburgh,

More information

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model

More information

Fast Regularization Paths via Coordinate Descent

Fast Regularization Paths via Coordinate Descent user! 2009 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerome Friedman and Rob Tibshirani. user! 2009 Trevor

More information

Compressed Sensing and Neural Networks

Compressed Sensing and Neural Networks and Jan Vybíral (Charles University & Czech Technical University Prague, Czech Republic) NOMAD Summer Berlin, September 25-29, 2017 1 / 31 Outline Lasso & Introduction Notation Training the network Applications

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725 Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:

More information

Generalized Conditional Gradient and Its Applications

Generalized Conditional Gradient and Its Applications Generalized Conditional Gradient and Its Applications Yaoliang Yu University of Alberta UBC Kelowna, 04/18/13 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 1 / 25 1 Introduction 2 Generalized

More information

27: Case study with popular GM III. 1 Introduction: Gene association mapping for complex diseases 1

27: Case study with popular GM III. 1 Introduction: Gene association mapping for complex diseases 1 10-708: Probabilistic Graphical Models, Spring 2015 27: Case study with popular GM III Lecturer: Eric P. Xing Scribes: Hyun Ah Song & Elizabeth Silver 1 Introduction: Gene association mapping for complex

More information

Newton s Method. Javier Peña Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725 Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and

More information

Basis Pursuit Denoising and the Dantzig Selector

Basis Pursuit Denoising and the Dantzig Selector BPDN and DS p. 1/16 Basis Pursuit Denoising and the Dantzig Selector West Coast Optimization Meeting University of Washington Seattle, WA, April 28 29, 2007 Michael Friedlander and Michael Saunders Dept

More information