Winter Conference in Statistics Compressed Sensing. LECTURE #7-8 Algorithms for low-dimensional models

Size: px

Start display at page:

Download "Winter Conference in Statistics Compressed Sensing. LECTURE #7-8 Algorithms for low-dimensional models"

Wendy French
5 years ago
Views:

1 Winter Conference in Statistics 2013 Compressed Sensing LECTURE #7-8 Algorithms for low-dimensional models Prof. Dr. Volkan Cevher LIONS/Laboratory for Information and Inference Systems

2 Convex Algorithms for Low-Dimensional Models And, this is how you solve huge-dimensional problems

3 The classical problem templates Criteria seen above have the form kxk 1 = NX [x]i i=1 Basis pursuit (BP) [Chen, Donoho, Saunders, 1998] min A x kxk 1 s.t. x = u BP denoising (BPDN): [Chen, Donoho, Saunders, 1998] Also well known: LASSO (least absolute shrinkage/selection operator): [Tibshirani, 1996] All can be written as bx 2 arg min x2r N f 1 (x) + f 2 (x)

4 Convex optimization and proximal algorithms bx 2 arg min x2r N f 1 (x) + f 2 (x) f 1 : R N! R data fidelity term; convex, smooth. typically: f 2 : R N! ¹ R = R [ f+1g Convex regularizer (maybe non-smooth; e.g. `1 ) (non-convex, later ). Difficulties: non-smoothness and large dimension ( N À 1)

5 Constrained vs unconstrained formulations Constrained optimization formulations can be written as using indicator functions: S (x) = bx 2 arg min f 1 (x) + f 2 (x) x2r N ½ 0 ( x 2 S +1 ( x 62 S Example: same as bx 2 arg min x2r N f 1 (x) + fx:g(x) ºg (x) Classical example: the LASSO:

6 Convex and strictly convex sets S is convex if x; x 0 2 S ) 8 2 [0; 1] x + (1 )x 0 2 S x S x 0 convex x S x 0 non-convex S is strictly convex if x; x 0 2 S ) 8 2 (0; 1) x + (1 )x 0 2 int(s) x S x convex, but not strictly x 0 strictly convex x 0

7 Convex and strictly convex functions Extended real valued function: f : R N! ¹ R = R [ f+1g Domain of a function: dom(f) = fx : f(x) 6= +1g f is a convex function if f is a strictly convex function if 8 2 (0; 1); x; x 0 2 dom(f) f( x + (1 )x 0 ) < f(x) + (1 )f(x 0 ) non-convex convex strictly convex convex, not strictly

8 Convexity, coercivity, and minima f is coercive if f f : R N! ¹ R = R [ f+1g lim f(x) = +1 kxk!+1 G arg min f(x) if is coercive, then is a non-empty set x if f is strictly convex, then G has at most one element coercive and strictly convex coercive, not strictly convex convex, not coercive x G = fx g f G G = ;

9 Euclidean projections on convex sets bx 2 arg min f 1(x) + f 2 (x) x2r n ½ 0 ( x 2 S consider f 2 (x) = S (x) = +1 ( x 62 S Our problem: (convex if S is convex) and f 1 (x) = 1 2 ku xk2 2 (strictly convex) S z = P S (z) P S (u) u bx = arg min f 1(x) + f 2 (x) x2r n = arg min ku x2s xk2 2 P S (u) (Euclidean projection)

10 Projected gradient algorithm Our problem: bx 2 arg min x2r n f 1(x) + f 2 (x) f 2 (x) = S (x) with ( is a convex set) and S f 1 some function, e.g., f 1 (x) = 1 2 k x uk2 2 Projected gradient algorithm: x k+1 = P S ³x k k rf 1 (x k ) if f 1 (x) = 1 2 k x uk2 2 step size x k+1 = P S ³x k k T ( x k u)

11 Detour: majorization-minimization (MM) Problem: bx 2 arg min x2r n f(x) Q(x; x k ) is a majorizer of f at x k Q(x; x k ) f(x); Q(x k ; x k ) = f(x k ) f(x) Q(x; x k ) Q(x; x k+1 ) MM algorithm: x k+1 = arg min Q(x; x k ) x monotonicity: x k x k+2 x k+1

12 Projected gradient from majorization-minimization Our problem: bx 2 arg min f 1(x) + f 2 (x) x2r n f 2 (x) = S (x) with ( is a convex set) and f 1 has L -Lipschitz gradient S e.g. f 1 (x) = 1 2 k x uk2 2 ) L = max( T ) = k k a separable approximation of f 1 Hessian of f 1 Q(x; x k ) = f 1 (x k ) + (x x k ) T rf 1 (x k ) k kx x k k 2 2

13 Projected gradient from majorization-minimization Our problem: Separable approximation of f 1 bx 2 arg min x2r n f 1(x) + S (x) Q(x; x k ) = f 1 (x k ) + (x x k ) T rf 1 (x k ) + 1 Q(x; x k ) is a majorizer of f 1, if k < 1 L 2 k kx x k k 2 2 Q(x; x k ) + S (x) is a majorizer f 1 (x) + S (x) MM algorithm: x k+1 = arg min Q(x; x k ) + S (x) x 1 = arg min x x k krf 1 (x k ) 2 x 2 k 2 = P S ³x k k rf 1 (x k ) + S (x) projected gradient.

14 Proximity operators Our problem: with bx 2 arg min f 1(x) + f 2 (x) x2r n f 2 a convex function and f 1 (x) = 1 2 ku xk2 2 (strictly convex) 1 bx = arg min x2r n 2 ku xk2 2 + f 2(x) prox f2 (u) Proximity operator [Moreau 62], [Combettes 01]. Generalizes the notion of Euclidean projection.

15 Proximity operators (linear) 1 prox f (u) = arg min x2r n 2 ku xk2 2 + f(x) (R N! R N ) Classical cases: squared `2 regulizer f(x) = 2 kxk2 2 1 prox f (u) = arg min x2r n 2 ku xk kxk2 2 = u 1 + squared regularizer with analysis operator f(x) = `2 2 kdxk2 2 1 prox f (u) = arg min x2r n 2 ku xk kdxk2 2 = (I + D T D) 1 u if D is a circulant matrix, O(N log N) cost using the FFT

16 Proximity operator of the norm 1 prox k k1 (u) = arg min x2r n 2 ku xk2 2 + kxk 1 Separable: solve w.r.t. each component: min x jxj + 0:5(x u) 2 Possible approach: write jxj = max jzj 1 zx min x max zx + 0:5(x jzj 1 u)2 = max min zx + 0:5(x u)2 jzj 1 x arg max jzj 1 0:5 2 z 2 + zu = `1 x = u z = max 0:5 2 z 2 + zu (for ) jzj 1 8 < : u= ( juj 1 ( u > 1 ( u <

17 Proximity operator of the `1 norm: the soft soft thresholding soft(u; ) = sign(u) maxf0; juj g soft(u; ) = prox j j (for vectors, soft(u; ) is applied component-wise) p -th power of `p closed form prox for norms [Combettes, Wajs, 2005] kxk p p = X j[x] i j p i p 2 ½1; 2; 43 ; 32 ¾ ; 3; 4

18 Dual norms, proximity operators, and projections Dual norm: some norm, k k : R N! R + its dual norm: kxk = max hx; zi kzk 1 1 Dual norm of k k p is k k q, where p + 1 q = 1 Hölder conjugates simple corollary of Hölder s inequality: Examples of Hölder conjugates: (2; 2); (1; +1); (3=2; 3); ::: These concepts are related through: prox k k (u) = u P fx:kxk 1g(u) [Combettes, Wajs, 2005]

19 Dual norms, proximity operators, and projections prox k k (u) = u P fx:kxk g(u) This relation underlies our earlier derivation of prox k k1 prox k k1 (u) = u P fx:kxk1 g(u) It s all separable, prox j j (u) = u P fx:jxj g (u) kxk 1 = maxfj[x] i jg P j j (z) soft(u; ) = soft(u; )

20 Dual norms, proximity operators, and projections prox k k (u) = u P fx:kxk g(u) This relation allows deriving prox k k1 and prox k k2 prox k k1 (u) = u P fx:kxk1 g(u) prox k k2 (u) = u P fx:kxk2 g(u) projection on the ball of radius O(n log n) `1 prox P(u) u = u kuk 2 maxf0; kuk 2 g vector soft thresholding

21 Proximity operators of atomic norms prox k k (u) = u P fx:kxk g(u) These relation allows deriving prox operators of atomic norms: kxk A = infft > 0 : x 2 t conv(a)g The dual of an atomic norm ball: kxk A = max hz; xi = max kzk A 1 P fx:kxk A g(u) = arg prox k ka (u) = u arg hz; xi z2conv(a) = maxfha; xi; a 2 Ag min ha;xi ; 8 a2a ku xk 2 2 min ha;xi ; 8 a2a ku xk 2 2

22 Proximity operators of atomic norms: `1 Deriving prox k k1 kxk 1 = kxk A A = e 1 e 2 e 1 e 2 jaj = 2 N from the atomic norm view >< ; ; : : : ; ; ; : : : ; 6 4 >: = fe 1 ; e 2 ; :::; e N ; e 1 ; :::; e N g >= 7 5 >; kxk A = maxfha; xi; a 2 Ag = maxfj[x] i jg = kxk 1 prox k k1 (u) = u P fx:kxk1 g(u) = soft(x; )

23 Proximity operators of atomic norms: `1 Deriving prox k k1 from the atomic norm view >< ; ; ; ; : : : ; 1 >= >: >; kxk 1 = kxk A A = = f 1; +1g N jaj = 2 N N kxk A = maxfha; xi; a 2 Ag = X j[x] i j = kxk 1 i=1 prox k k1 (u) = u P fx:kxk1 g(u)

24 Proximity of atomic norms: matrix nuclear norm Matrix nuclear norm: kxk = X i ¾ i (X) = X i q i(x T X) kxk = kxk A A = fz : rank(z) = 1; kzk F = 1g rank(z) = jf¾ i (Z) 6= 0gj Frobenius norm kzk 2 F = X ij [Z] 2 ij = X i ¾ 2 i (Z) kxk A = maxfhz; Xi; Z 2 Ag ( X = max ¾ i (Z)¾ i (X); rank(z) = 1; X i i = ¾ max (X) = kxk 2 spectral norm ¾ 2 i (Z) = 1 )

25 Proximity of atomic norms: matrix nuclear norm Euclidean matrix projection: P S (X) = arg min Z2S kz Xk2 F Note: for any unitary matrix U (U T U = I; UU T = I) kumk 2 F = trace M T U T UM = trace M T A = kmk 2 F prox k k (X) = X P fz:kzk2 g(x) singular value diagonal matrix = U V T P fz:¾max (Z) g(u V T ) [Lewis, Malick, 2009] = Udiag diag( ) P fx:kxk1 g(diag( )) V T = Usoft( ; )V T singular value thresholding (svt)

26 Proximity of atomic norms: matrix spectral norm Matrix spectral norm: kxk 2 = ¾ max (X) kxk 2 = kxk A A = fz : Z T Z = Ig = fz : ¾ i (Z) = 1; 8 i g kxk A = maxfhz; Xi; Z 2 Ag ( ) X = max ¾ i (Z)¾ i (X); ¾ i (Z) = 1; 8 i i orthogonal matrices = X i ¾ i (X) = kxk nuclear norm

27 Proximity of atomic norms: matrix spectral norm prox k k2 (X) = X P fz:kzk g(x) = U V T P fz:kzk g(u V T ) singular value diagonal matrix = U P fz: P i ¾ i(z) g( ) V T = Udiag diag( ) P fx:kxk1 g(diag( )) V T residual of projection of the singular values on an `1 ball of radius

Proximity and atomic sets: vectors vs matrices norm vectors prox matrices atomic set norm prox atomic set `1 kxk 1 component soft thresholding A = f e i g nuclear jaj = 2 N kxk singular value

28 Proximity and atomic sets: vectors vs matrices norm vectors prox matrices atomic set norm prox atomic set `1 kxk 1 component soft thresholding A = f e i g nuclear jaj = 2 N kxk singular value thresholding A = set of all rank 1, norm 1 matrices `1 kxk 1 residual of projection on `1 ball A = f 1g N spectral jaj = 2 N kxk 2 residual of s.v. proj. on `1 ball A = set of all orthogonal matrices `2 kxk 2 vector soft thresholding A = set of all vectors with norm 1 jaj = 1 Frobenius kxk F matrix soft threshold. A = all matrices of unit Frobenius norm.

29 Proximal algorithms Back to the problem: with bx 2 arg min f 1(x) + f 2 (x) x2r n f 2 a proper convex function f 1 L f 1 (x) = 1 2 k x uk2 2 and has a -Lipschitz gradient; e.g. with L = max ( ) separable majorizer ( k < 1=L ) Q(x; x k ) = f 1 (x k ) + (x x k ) T rf 1 (x k ) + 1 kx x k k k majorization-minimization algorithm x k+1 = arg min Q(x; x k ) + f 2 (x) x 1 = arg min x x k krf 1 (x k ) 2 x 2 k 2 x k+1 = prox k f 2 ³x k k rf 1 (x k ) + f 2 (x)

30 Proximal algorithms: convergence Problem: bx 2 arg min x2r n f 1(x) + f 2 (x) f(x) f 1 has a L -Lipschitz gradient; e.g. f 1 (x) = 1 2 k x uk2 2 Iterative shrinkage/thresholding (IST) (or forward-backward) L = max ( ) x k+1 = prox kf 2 ³x k k rf 1 (x k ) if k < 1 L, IST is a majorization-minimization algorithm, thus f(x) 0, thus (f(x 1 ); f(x 2 ); :::; f(x k ); :::) converges. Attention: this does not imply convergence of (x 1 ; :::; x k ; :::)

31 Proximal algorithms: convergence bx 2 G = arg min f 1(x) + f 2 (x) x2r n IST algorithm: x k+1 = prox kf 2 ³x k k rf 1 (x k ) 0 < k < 2 L if, then (x 1 ; x 2 ; :::; x k ; :::) errors x k+1 = prox kf 2 ³x k ( k rf 1 (x k ) + b k ) + a k Inexact version: converges to a point in convergence still guaranteed if 1X ka k k < 1 G 1X kb k k < 1 k=1 k=1 Results and proofs in [Combettes and Wajs, 2005]

32 Proximal algorithms: convergence ) Convergence of function values (f(x 1 ); :::; f(x k ); :::)! f(bx) Convergence of iterates ) (x 1 ; x 2 ; :::; x k ; :::)! bx Convergence rates (for function values) [Beck, Teboulle, 2009]: Convergence rate for the iterates require further assumptions on f

33 Proximal algorithms: convergence of iterates 1 bx = arg min x 2 k x uk2 2 + f 2(x) With L = max ( ) l = min ( ) > 0 = l=l (condition number) ) G = fbxg (unique minimizer) Under- ( < 1 ) or over-relaxed ( > 1) IST x k+1 = (1 )x k + prox f2 ³x k T ( x k u) Optimal choice = 2 L + l Q-linear convergence Small l ) ½. 1 ) slow convergence! [F, Bioucas-Dias, 2007]

34 Proximal algorithms: convergence of iterates With 1 bx 2 G = arg min x 2 k x uk2 2 + kxk 1 L = max ( ) ; using a step-size < 2=L; ³ x k+1 = soft x k T ( x k u); Z µ f1; 2; :::; ng such that bx 2 G ) [bx] Z = 0 Then, after a finite number of iterations: [x k ] Z = [bx] Z = 0 After this, Q-linear convergence: Optimal choice = 2 L + l ; l = min ( ¹ Z ¹ Z ) > 0 [Hale, Yin, Zhang, 2008]

35 Slowness and acceleration of IST Problem: IST algorithm: 1 bx 2 G = arg min x 2 k x uk2 2 + kxk 1 ³ x k+1 = soft x k T ( x k u); IST is slow, if is very ill-conditioned and/or is very small! Several proposals for accelerated variants of IST Methods with memory (TwIST, FISTA) Quasi-Newton methods (SpaRSA) Continuation, i.e., use a varying (FPC, SpaRSA)

36 Memory-based variants of IST: FISTA Fast IST algortihm (FISTA); based on Nesterov s work (1980 s) [Beck, Teboulle, 2009] FISTA t k+1 = 1 + p t 2 k 2 z k+1 = x k + t k 1 (x k x k 1 ) t k+1 ³ x k+1 = soft z k T ( z k u); IST: FISTA:

37 Memory-based variants of IST: twist Inspired by 2-step methods for linear systems [Frankel, 1950], [Axelsson, 1996] TwIST (two-step IST): [Bioucas-Dias, F, 2007] x k+1 = ( )x k + (1 )x k 1 + prox f2 xk T ( x k u) TwIST Q-linear convergence IST

Memory-based variants of IST: twist objective function SNR original B Blurred ( ), 9x9, 40db noise restored 1 bx 2 arg min x2r n 2 kbªx uk2 2 + kxk 1 representation coefficients dictionary (e.

38 Memory-based variants of IST: twist objective function SNR original B Blurred ( ), 9x9, 40db noise restored 1 bx 2 arg min x2r n 2 kbªx uk2 2 + kxk 1 representation coefficients dictionary (e.g, wavelet basis, frame,...) IST over-relaxed IST TwIST = =1 = 0 Second order full TwIST iterations TwIST over-relaxed IST = =1 = 0 Second order full TwIST IST iterations

39 Quasi-newton acceleration of IST: SpaRSA IST: x k+1 = prox kf 2 ³x k k rf 1 (x k ) A Newton step (instead of gradient descent) would be: x k+1 = prox kf 2 ³x k [H(x k )] 1 rf 1 (x k )...computationally too expensive! Barzilai-Borwein approach: [Barzilai-Borwein, 1988], [Wright, Nowak, F, 2009] 1 k Hessian (matrix of second derivatives) = arg min k (x k x k 1 ) (rf(x k ) rf(x k 1 )k 2 2 f 1 (x) = 1 If 2 k x uk2 2, then k = kx k x k 1 k 2 2 k (x k x k 1 )k k I ' H(x k )

40 Acceleration via continuation IST: ³ x k+1 = soft x k T ( x k u); Slow, if is small. Observation: IST (as SpaRSA) benefits from warm-starting (being initialized close to the minimizer) Continuation: start with large slowly decrease while tracking the solution. [F, Nowak, Wright, 2007], [Hale, Yin, Zhang, 2007] IST + continuation = fixed point continuation (FPC) [Hale, Yin, Zhang, 2007]

41 Acceleration via continuation 1 bx 2 G = arg min x 2 k x uk2 2 + kxk u = x + n max = k T yk 1 ( max ) bx = 0 )

42 Some speed comparisons from [Lorenz, 2011] = [I U R] ( ) bx 1 bx = arg min x 2 k x uk2 2 + kxk 1 with 120 non-zeros IST

43 Proximal algorithms for matrices cm 2 arg X k+1 = svt ¹ k 1 min M2R n n 2 k (M) Uk2 F + ¹kMk The proximal algorithm (IST) is as before: ³ X k k ( (X k ) U) linear operator...its adjoint Matrix completion: (X) = X (subset of entries) IST APG (FISTA) FPC (continuation) APG + continuation from [Toh, Yun, 2009]...the importance of acceleration!

44 Another class of methods: augmented Lagrangian The problem: min x s.t. f(x) x = u The augmented Lagrangian (AL) Penalty parameter L ¹ (x; ) = f(x) + T ( x u) + ¹ 2 k x uk2 2 The AL method (ALM) (a.k.a. method of multipliers) [Hestenes, Powell, 1969] x k+1 = arg min L ¹ (x; k) x k+1 = k + ¹( x k+1 u) Can be written as: x k+1 = arg min f(x) + ¹ x 2 k x u d kk 2 2 d k+1 = d k ( x k+1 u) Similar to Bregman method [Osher, Burger, Goldfarb, Xu, Yin, 2005] [Yin, Osher, Goldfarb, Darbon, 2008]

45 Augmented Lagrangian for variable splitting The problem: min x Equivalent constrained formulation Can be written as ALM: f 1 ( x) + f 2 (x) min y f(y) s.t. ªy = 0 min x f 1 (z) + f 2 (x) s.t. x z = 0 (x k+1 ; z k+1 ) = arg min x;z f 1(z) + f 2 (x) + ¹ 2 k x z d kk 2 2 d k+1 = d k ( x k+1 z k+1 ) with y = x z ª = [ I]

46 Augmented Lagrangian for variable splitting It may be hard to solve (x k+1 ; z k+1 ) = arg min x;z f 1(z) + f 2 (x) + ¹ 2 k x z d kk 2 2 Alternative: x k+1 = arg min x f 2 (x) + ¹ 2 k x z k d k k 2 2 z k+1 = arg min z f 1 (z) + ¹ 2 k x k+1 z d k k 2 2 d k+1 = d k ( x k+1 z k+1 ) Alternating directions method of multipliers (ADMM) [Glowinsky, Marrocco, 1975], [Gabay, Mercier, 1976], [Eckstein, Bertsekas, 1992] When applied to bx = arg min 2 k x uk2 2 + kxk 1 split augmented Lagrangian shrinkage algorithm (SALSA) [F, Bioucas-Dias, Afonso, 2009] x 1

47 Augmented Lagrangian for variable splitting Testing ADMM/SALSA on a typical image deblurring problem blurred restored bx 2 arg min x2r n 1 2 kbªx uk2 2 + kxk 1 restored 2 Objective function function 0.5 y-ax 2 + TV(x) TwIST FISTA SpaRSA SALSA seconds CPU time

48 Handling more than two functions bx 2 arg min f 0(x) + f 1 (x) + + f n (x) x2r n f 0 has a L-Lipschitz gradient f 1 ; :::; f n are convex Possible uses: multiple regularizers, positivity constraints,... Generalized forward-backward algorithm [Raguet, Fadili, Peyré, 2011] Parameters:! 1 ; :::;! n 2 (0; 1); s.t. P j! j = 1 k = 0; z0 1; :::; zn 0 ; x 0 = P n j=1! j z j 0 repeat until convergence for i = 1 : n zk+1 i = zk i + prox kf i =! i ³2 x k zk i k rf 1 (x k ) x k Initialization: x k+1 = P n i=1! i z i k+1 k Ã k + 1

49 Handling more than two functions bx 2 arg min x2r n f 1(x) + + f n (x) f 1 ; :::; f n arbitrary convex functions ADMM-based method [F and Bioucas-Dias, 2009], [Setzer, Steidl, Teuber, 2009] Parameter: Initialization: k = 0; z 1 0 ; :::; zn 0 ; y1 0 ; :::; yn 0 repeat until convergence x k+1 = (1=n) P n i=1 (yi k zi k ) for i = 1 : n yk+1 i = prox fi xk zk i zk+1 i = zk i + x k yk+1 i k Ã k + 1

50 Non-Convex Algorithms for Low-Dimensional Models

51 Motivation Discrete descriptions of low-dimensional models Example: reflectivity of Lambertian surfaces [Basri and Jacobs 2001]

52 Motivation Discrete descriptions of low-dimensional models Example: graphical model selection vertex weights of the i-th edge Gauss-Markov graph <> linear regression [Meinshausen and Buhlman 2006]

53 Motivation Discrete descriptions of structure in low-dimensional models Typical of wavelet transforms of natural signals and images (piecewise smooth) [Baraniuk, C, Duarte, Hegde 2010]

54 Motivation Non-convex criteria beyond atomic norms 1-bit compressive sensing optimization criteria [Boufounos and Baraniuk 2008] Compressible signals in weak optimization criteria [Chartrand and Yin 2008]

2010] convergence of the projected gradient-descent algorithm This tutorial <> a special subset (

55 Non-convexity in this tutorial Anything not convex <> too big to cover convexity is in general a rare condition Active research topic with great depth Key lesson: [Attouch et al. 2010] convergence of the projected gradient-descent algorithm This tutorial <> a special subset ( is non-convex) with Assumptions: 1. access to the gradient of convex 2. tractable/approximate prox of non-convex

56 Can we project onto non-convex sets? Running examples Sparse signal: only K out of N coordinates nonzero model: union of all K-dimensional subspaces aligned w/ coordinate axes Structured sparse signal: (or model-sparse) reduced set of subspaces model: a particular union of subspaces ex: clustered or dispersed sparse patterns sorted index

57 Can we project onto non-convex sets? Running examples Sparse signal: only K out of N coordinates nonzero model: union of all K-dimensional subspaces aligned w/ coordinate axes Structured sparse signal: (or model-sparse) reduced set of subspaces model: a particular union of subspaces ex: clustered or dispersed sparse patterns sorted index

58 Can we project onto non-convex sets? Analysis of the prox for structured sparse sets support of the solution <> modular approximation problem indexing set

59 Can we project onto non-convex sets? Analysis of the prox for structured sparse sets support of the solution <> modular approximation problem

60 Can we project onto non-convex sets? Analysis of the prox for structured sparse sets support of the solution <> modular approximation problem

61 Can we project onto non-convex sets? Analysis of the prox for structured sparse sets support of the solution <> modular approximation problem where

62 Can we project onto non-convex sets? Analysis of the prox for structured sparse sets support of the solution <> modular approximation problem underlying optimization problem <> integer linear program : support indicator variables [Kyrillidis and C, 2011]

63 Can we project onto non-convex sets? Analysis of the prox for structured sparse sets support of the solution <> modular approximation problem underlying optimization problem <> integer linear program Class of problems we can tractably solve: PMAP Polynomial time modular epsilon-approximation property [Kyrillidis and C, 2011]

64 Can we project onto non-convex sets? PMAP-0: Matroid structured sparse models: Definition: non-emptiness heredity exchange [Nemhauser and Wolsey, 1999]

65 Can we project onto non-convex sets? PMAP-0: Matroid structured sparse models: Definition: non-emptiness heredity exchange Let \. The smallest matroid that contains is??? [Nemhauser and Wolsey, 1999]

66 Can we project onto non-convex sets? PMAP-0: Matroid structured sparse models: Definition: non-emptiness heredity exchange Let \. The smallest matroid that contains {, by the non-emptiness property {1}, {2}, {3}, {4}, {1,2}, {3,4}, by the heredity property {1,3}, {1,4}, {2,3}, {2,4} by the exchange property }

67 Can we project onto non-convex sets? PMAP-0: Matroid structured sparse models: Definition: Greedy basis algorithm efficiently solves sort N in decreasing order by weight start with empty set: 1. while keeping the order 2. 3.

68 Can we project onto non-convex sets? PMAP-0: Matroid structured sparse models: Definition: Greedy basis algorithm efficiently solves matroid constrained problems Examples: 1. uniform matroid: hard thresholding! sorted index

69 Can we project onto non-convex sets? PMAP-0: Matroid structured sparse models: Definition: Greedy basis algorithm efficiently solves matroid constrained problems Examples: [Kyrillidis and C, 2011] 1. uniform matroid <> simple sparsity intersection with the following matroids (result is still a matroid!*) 2. partition matroid <> distributed sparsity 3. graphic matroid <> spanning tree sparsity 4. matching matroid <> graph matching sparsity *: in general, the intersection of two matroids is not a matroid.

70 Can we project onto non-convex sets? PMAP-0: Linear support constraints: Definition: A and b <> integral first row of A <> all 1 s first entry of b <> K [Kyrillidis and C, 2011]

71 Can we project onto non-convex sets? PMAP-0: Linear support constraints: Definition: Example: neuronal spike model

72 Can we project onto non-convex sets? PMAP-0: Linear support constraints: Definition: We can use LP can relax the LS constrained ILPs: but, when is the result binary?

73 Can we project onto non-convex sets? PMAP-0: Linear support constraints: Definition: LP can exactly solve the LS constrained ILPs: when A is totally unimodular (TU)*! [Nemhauser and Wolsey, 1999] the determinant of each square submatrix is {-1,0,1} Examples: interval matrices, perfect matrices, network matrices *: if we want LP relaxation to work for all b, TU is a necessary condition.

74 Can we project onto non-convex sets? PMAP-0: Linear support constraints: Definition: Example: neuronal spike model TU [Hegde, Duarte, and C, 2009]

75 Can we project onto non-convex sets? PMAP-0: prox-sparse models Definition: define algorithmically! [Kyrillidis and C, 2011]

76 Can we project onto non-convex sets? PMAP-0: prox-sparse models Definition: define algorithmically! Example: clustered sparsity models tree-sparse <> dynamic program clustered sparse <> dynamic program [Baraniuk, C, Wakin 2010; Baraniuk, C, Duarte, Hegde 2010]

77 Can we project onto non-convex sets? Pop-quiz: A prox with convex and non-convex terms Let us consider Is it PMAP-0?

78 Can we project onto non-convex sets? Pop-answer: A prox with convex and non-convex terms Let us consider Hard thresholding followed by soft thresholding! YES: certified PMAP-0 w y

79 Can we project onto non-convex sets? PMAP-epsilon: Knapsack multi-knapsack constraints weighted multi-knapsack Ex: Nested group sparse problems quadratically-constrained Define algorithmically! approximate solutions for computational reasons selector variables [Kyrillidis and C, 2011]

80 we can only approximate and epsilon is large! Can we project onto non-convex sets? PMAP-epsilon: Knapsack multi-knapsack constraints weighted multi-knapsack Ex: Nested group sparse problems quadratically-constrained Define algorithmically! approximate solutions for computational reasons selector variables Pairwise overlapping groups <> quadratic binary w/ cardinality cons.

81 Can we project onto non-convex sets? PMAP-epsilon: Knapsack multi-knapsack constraints weighted multi-knapsack Ex: Nested group sparse problems quadratically-constrained Define algorithmically! approximate solutions for computational reasons selector variables Pairwise overlapping groups <> quadratic binary w/ cardinality cons. Multi-knapsack + multi-matroids [Lee et al., 2009] we can only approximate and epsilon is large!

82 Can we project onto non-convex sets? Matrix examples! Rank constrained projections

83 Can we project onto non-convex sets? Matrix examples! Rank constrained projections singular value decomposition

84 Can we project onto non-convex sets? Matrix examples! Rank constrained projections invariance to unitary transform

85 Can we project onto non-convex sets? Matrix examples! Rank constrained projections sparse approximation problem!

86 Can we project onto non-convex sets? Matrix examples! Rank constrained projections singular value (hard) thresholding

87 Can we project onto non-convex sets? Matrix examples! Rank constrained projections singular value (hard) thresholding Non-convex spectral projections <> sets described by their eigenvalue properties exact projections >> basic operations on eigenvalues [Lewis and Malick 2008]

88 Can we project onto non-convex sets? Matrix examples! Rank constrained projections Non-convex spectral projections <> sets described by their eigenvalue properties epsilon-approximate projections (note the difference with PMAP) Two highlights: structure from randomness/power methods [Halko, Martinsson, Tropp, 2010] column subset selection approaches [Boutsidis, Mahoney, Drineas, 2010]

89 Recovery algorithms for low-dimensional models Now that we have projections Non-convex Convex Probabilistic Encoding combinatorial / manifolds atomic norm / convex relaxation compressible / sparse priors A common criteria covering a broad set of applications: affine rank minimization, matrix completion, robust PCA [Candes and Recht 2009; Waters, Sankaranayanan, Baraniuk, 2011] A common algorithm: projected gradient

90 Recovery algorithms for low-dimensional models To highlight the salient differences, we will consider Non-convex Convex Probabilistic Encoding combinatorial / manifolds atomic norm / convex relaxation compressible / sparse priors compressive sensing recovery A common algorithm: projected gradient

91 A tale of two algorithms Soft thresholding

92 A tale of two algorithms Soft thresholding Structure in optimization:

93 A tale of two algorithms Soft thresholding majorization-minimization Key actor: least absolute shrinkage

94 A tale of two algorithms Soft thresholding slower

95 A tale of two algorithms Soft thresholding Is x* what we are looking for? local unverifiable assumptions: ERC/URC/RSC condition coherence based conditions (local global / dual certification) [Buhlmann and van de Geer 2011]

96 A tale of two algorithms Hard thresholding Key actor: hard thresholding ALGO: sort and pick the largest K

97 A tale of two algorithms Hard thresholding percolations What could possibly go wrong with this naïve approach? [C, 2011]

98 A tale of two algorithms Hard thresholding Global unverifiable assumption: RIP condition we can tiptoe among percolations! another variant has

99 Restricted Isometry Property Model: K-sparse coefficients Remark: implies convergence of convex relaxations also e.g., RIP: stable embedding IID sub-gaussian matrices K-planes [Candes 2008; Baraniuk et al. 2008]

100 Restricted Isometry Property Model: K-sparse coefficients Remark: implies convergence of convex relaxations also e.g., RIP: stable embedding IID sub-gaussian matrices K-planes

101 Restricted Isometry Property for Matrices! Model: rank-r matrices Remark: bi-lipschitz embedding of low-rank matrices RIP: stable embedding IID sub-gaussian matrices [Plan 2011] K-planes

102 Projected gradient method for non-convex sets Model-based hard thresholding Global unverifiable assumption: [Baraniuk, C, Duarte, Hegde 2010] Key actor: non-convex projector

103 Example: tree-sparse recovery Model: K-sparse coefficients + significant coefficients lie on a rooted subtree Sparse approx: find best set of coefficients sorting hard thresholding Tree-sparse approx: find best rooted subtree of coefficients condensing sort and select [Baraniuk] dynamic programming [Donoho] [Baraniuk, C, Duarte, Hegde 2010]

104 Example: tree-sparse recovery Model: K-sparse coefficients RIP: stable embedding IID sub-gaussian matrices K-planes

105 Example: tree-sparse recovery Model: K-sparse coefficients + significant coefficients lie on a rooted subtree Tree-RIP: stable embedding IID sub-gaussian matrices K-planes

106 Example: tree-sparse recovery Number samples for correct recovery Piecewise cubic signals + wavelets Models/algorithms: compressible (CoSaMP) tree-compressible (tree-cosamp) [Baraniuk, C, Duarte, Hegde 2010]

107 Recovery algorithms for low-dimensional models The Clash Operator Non-convex Convex Probabilistic Encoding combinatorial / manifolds Example Algorithm IHT, CoSaMP, SP, ALPS, OMP atomic norm / convex relaxation Basis pursuit, Lasso, basis pursuit denoising compressible / sparse priors Variational Bayes, EP, Approximate message passing (AMP) [Kyrillidis and C, 2011]

108 Recovery algorithms for low-dimensional models

109 Recovery algorithms for low-dimensional models A subtle issue 2 solutions! Which one is correct?

110 Recovery algorithms for low-dimensional models A subtle issue 2 solutions! Greed is good. Joel Tropp 2004

111 Recovery algorithms for low-dimensional models A subtle issue 2 solutions! Which one is correct?

112 The CLASH algorithm combinatorial selection + least absolute shrinkage

113 The CLASH algorithm combinatorial selection + least absolute shrinkage combinatorial origami

114 Recovery algorithms for low-dimensional models The Clash Operator Non-convex Convex Probabilistic Encoding combinatorial / manifolds Example Algorithm IHT, CoSaMP, SP, ALPS, OMP atomic norm / convex relaxation Basis pursuit, Lasso, basis pursuit denoising compressible / sparse priors Variational Bayes, EP, Approximate message passing (AMP) The idea is much more general [Kyrillidis, Puy, and C, 2012]

115 Recovery algorithms for low-dimensional models Using projected gradient with exact non-convex projections with RIP/ERC/URC/RSC Exact low-dimensional model noise-free measurements: exact recovery noisy measurements: stable recovery Approximately low-dimensional model recovery as good as K-model-sparse approximation recovery error signal K-term model approx error noise [Baraniuk, C, Duarte, Hegde 2010]

116 Recovery algorithms for low-dimensional models Using projected gradient with exact non-convex projections with RIP/ERC/URC/RSC Exact low-dimensional model noise-free measurements: exact recovery noisy measurements: stable recovery Approximately low-dimensional model recovery as good as K-model-sparse approximation recovery error signal K-term model approx error noise the bound remains qualitatively the same for other models!!!

117 Recovery algorithms for low-dimensional models Projected gradient with (non)exact non-convex projections without RIP/ERC/URC/RSC Not much! convergence to stationary point with Kurdyka-Lojasiewicz [Attouch et al., 2010]

118 Examples CLASH Structured Sparsity Norm Constraints [Kyrillidis, Puy, and C, 2012]

119 Examples Coded Aperture Snapshot Spectral Imager

120 Examples Total variation TWiST results

121 Examples TV-CLASH TV + sparsity in wavelets [Kyrillidis, Puy, and C, 2012]

122 Acceleration of non-convex algorithms Several approaches step-size selection memory based methods similar to Nesterov acceleration / double overrelaxation non-convex splitting (adaptive) block coordinate descent epsilon-approximate projections [Kyrillidis and C, 2011]

Acceleration of non-convex algorithms Several approaches step-size selection memory based methods similar to Nesterov acceleration / double

123 Acceleration of non-convex algorithms Several approaches step-size selection memory based methods similar to Nesterov acceleration / double overrelaxation 34.8s non-convex splitting (adaptive) block coordinate descent epsilon-approximate projections 15.8s [Zhou and Tao 2011; Kyrillidis and C, 2012]

space/time bounds complexity of structured approximation non-convex algorithms vs.

124 Final remarks non-convex algorithms <> low-dimensional scaffold possible performance gains non-convexifiable priors matching prox operator with optimal space/time bounds complexity of structured approximation non-convex algorithms vs. convex algorithms no clear winner / scenario dependent convex non-convex decades of research in both

125 References M. Afonso, J. Bioucas-Dias, M. Figueiredo, Fast image recovery using variable splitting and constrained optimization, IEEE Transactions on Image Processing, vol. 19, H. Attouch, J. Bolte, P. Redont, A. Soubeyran, Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka-Lojasiewicz inequality, Math. Oper. Research, 2010 O. Axelsson, Iterative Solution Methods, Cambridge University Press, R. Baraniuk, V. Cevher, M. Duarte, C. Hegde, Model-based compressive sensing, IEEE Transactions on Information Theory, vol. 56, R. Baraniuk, M. Davenport, R. de Vore, M. Wakin, A Simple Proof of the Restricted Isometry Property for Random Matrices, Constructive Approximation, R. Baraniuk, V. Cevher, M. Wakin, Low-dimensional models for dimensionality reduction and signal recovery: A geometric perspective, Proceedings of the IEEE, vol. 98, J. Barzilai and J. Borwein, Two point step size gradient methods, IMA Journal of Numer. Anal., vol. 8, R. Basri, D. Jacobs, Lambertian Reflectance and Linear Subspaces, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM Journal on Imaging Science, vol. 2, J. Bioucas-Dias, M. Figueiredo, A new TwIST: Two-step iterative shrinkage/thresholding algorithms for image restoration, IEEE Transactions on Image Processing, vol. 16, P. Boufounos, R. Baraniuk, 1-bit compressed sensing, Proceedings of the Conference on Information Science and Systems, Princeton, C. Boutsidis, M. Mahoney, P. Drineas, An improved approximation algorithm for the column subset selection problem, Proc. 20th Annual ACM/SIAM Symposium on Discrete Algorithms, New York, NY, 2008.

126 References P. Bühlmann, S. van der Geer, Statistics for High-Dimensional Data, Springer, E. Candès, The restricted isometry property and its implications for compressed sensing, Comptes Rendus Mathematique, vol. 346, E. Candès and B. Recht, Exact matrix completion via convex optimization, Foundations of Computational Mathematics, vol. 9, L. Carin, R. Baraniuk, V. Cevher, D. Dunson, M. Jordan, G. Sapiro, M. Wakin, Learning low-dimensional signal models, IEEE Signal Processing Magazine, vol. 28, V. Cevher, An ALPS view of sparse recovery, Proc. ICASSP, V. Chandrasekaran, B. Recht, P. Parrilo, A. Willsky, The convex geometry of linear inverse problems, submitted, R. Chartrand, W. Yin, Iteratively reweighted algorithms for compressive sensing, Proc. ICASSP, 2008 S. Chen, D. Donoho, M. Saunders, Atomic decomposition by Basis Pursuit, SIAM Review, vol. 43, P. Combettes, V. Wajs, Signal recovery by proximal forward-backward splitting, SIAM Journal Multiscale Modeling and Simulation, vol. 4, J. Eckstein, D. Bertsekas, On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators, Mathematical Programming, vol. 5, M. Figueiredo and J. Bioucas-Dias, Restoration of Poissonian images using alternating direction optimization, IEEE Transactions on Image Processing, vol. 19, M. Figueiredo, R. Nowak, S. Wright, Gradient projection for sparse reconstruction: application to compressed sensing and other inverse problems, IEEE Journal of Selected Topics in Signal Processing, vol. 1, 2007.

127 References D. Gabay, B. Mercier, A dual algorithm for the solution of nonlinear variational problems via finite-element approximations, Computers and Mathematics with Application, vol. 2, R. Glowinski, A. Marroco, Sur l approximation, par elements finis d ordre un, et la resolution, par penalisation-dualité d une classe de problemes de Dirichlet non lineares, Rev. Française d Automatique, Y. Gordon, On Milman's inequality and random subspaces which escape through a mesh in R n, in Geometric Aspects of Functional Analysis, Springer, E. Hale, W. Yin, Y. Zhang, Fixed-point continuation for l1-minimization: Methodology and convergence, SIAM Journal on Optimization, vol. 19, N. Halko, P.-G. Martinsson, J. Tropp, Finding structure with randomness: stochastic algorithms for constructing approximate matrix decompositions, SIAM Review, vol. 53, C. Hegde, M. Duarte, V. Cevher, Compressive sensing recovery of spike trains using a structured sparsity model, Proceedings of SPARS'09, Saint-Malo, France, M. Hestenes, Multiplier and gradient methods, Journal of Optimazion Theory andapplications, vol. 4, A. Kyrillidis, V. Cevher, Recipes for hard thresholding methods, Tech. Rep., EPFL, J. Lee, V. Mirrokni, V. Nagarajan, M. Sviridenko, Non-monotone submodular maximization under matroid and knapsack constraints, Proc. 41st Annual ACM Symposium on Theory of Computing, Bethesda, MD, A. Lewis, J. Malick, Alternating projections on manifolds, Math. of Operations Research, vol. 33, D. Lorenz, Constructing test instances for basis pursuit denoising, submitted, N. Meinshausen, P. Bühlmann, High-dimensional graphs and variable selection with the lasso, The Annals of Statistics, vol. 34, pp , J.-J. Moreau, Proximité et dualité dans un espace hilbertien, Bull. Soc. Mathematiques de France, vol. 93, 1965.

128 References G. Nemhauser, L. Wolsey, Integer and combinatorial optimization, Wiley, S. Osher, M. Burger, D. Goldfarb, J. Xu, W. Yin, An iterative regularization method for total variation-based image restoration, SIAM Journal on Multiscale Modeling and Simulation, vol. 4, Y. Plan, Compressed sensing, sparse approximation, low-rank matrix estimation, PhD Thesis, Caltech, 2011 M. Powell, A method for nonlinear constraints in minimization problems, in Optimization, Academic Press, H. Raguet, J. Fadili, G. Peyré, Generalized Forward-Backward splitting, Tech. report, Hal , S. Setzer, G. Steidl, T. Teuber, Deblurring Poissonian images by split Bregman techniques, Journal of Visual Communication and Image Representation, K.-C. Toh, S. Yun, An accelerated proximal gradient algorithm for nuclear norm regularized least squares Problems, Pacific Journal of Optimization, vol. 6, A. Waters, A. Sankaranarayanan, R. Baraniuk, SpaRCS: recovering low-rank and sparse matrices from compressive measurements, Neural Information Processing Systems, S. Wright, R. Nowak, M. Figueiredo, Sparse reconstruction by separable approximation, IEEE Transactions on Signal Processing, vol. 57, W. Yin, S. Osher, D. Goldfarb, J. Darbon, Bregman iterative algorithms for l1-minimization with applications to compressed sensing, SIAM Journal on Imaging Science, vol. 1, T. Zhou, D. Tao, Godec: randomized low-rank & sparse matrix decomposition in noisy case, Proc. International Conference on Machine Learning, Bellevue, WA, 2011.

Low Complexity Regularization 1 / 33

Low Complexity Regularization 1 / 33 Low-dimensional signal models Information level: pixels large wavelet coefficients (blue = 0) sparse signals low-rank matrices nonlinear models Sparse representations