Low Complexity Regularization 1 / 33

Size: px

Start display at page:

Download "Low Complexity Regularization 1 / 33"

Lionel Nichols
5 years ago
Views:

1 Low Complexity Regularization 1 / 33

2 Low-dimensional signal models Information level: pixels large wavelet coefficients (blue = 0) sparse signals low-rank matrices nonlinear models

3 Sparse representations Sparse signal: only K out of N coordinates nonzero Sparse representations: sparse transform coefficients A fundamental impact:

5 Recommendation systems

6 Recommendation systems Machine learning competition with a $1 million prize

7 Backround extraction

8 Basis pursuit min{ x 1 : Ax = b} find least l 1-norm point on the affine plane {x : Ax = b} tends to return a sparse point (sometimes, the sparsest) l 1 ball touches the affine plane 4 / 33

9 Basis pursuit denoising, LASSO min x { Ax b 2 : x 1 τ}, (1a) min x x 1 + µ 2 Ax b 2 2, (1b) min x { x 1 : Ax b 2 σ}. (1c) all models allow Ax b 5 / 33

10 Basis pursuit denoising, LASSO min x { Ax b 2 : x 1 τ}, (2a) min x x 1 + µ 2 Ax b 2 2, (2b) min x { x 1 : Ax b 2 σ}. (2c) 2 is most common for error but can be generalized to loss function L (2a) seeks for a least-squares solution with bounded sparsity (2b) is known as LASSO (least absolute shrinkage and selection operator). it seeks for a balance between sparsity and fitting (2c) is referred to as BPDN (basis pursuit denoising), seeking for a sparse solution from tube-like set {x : Ax b 2 σ} they are equivalent (see later slides) in terms of regression, they select a (sparse) set of features (i.e., columns of A) to linearly express the observation b 6 / 33

Sparse under basis Ψ / l 1 -synthesis model min s { s 1 : AΨs = b} (3) signal x is sparsely synthesized by atoms from Ψ, so vector s is sparse Ψ is referred to as the dictionary commonly used

11 Sparse under basis Ψ / l 1 -synthesis model min s { s 1 : AΨs = b} (3) signal x is sparsely synthesized by atoms from Ψ, so vector s is sparse Ψ is referred to as the dictionary commonly used dictionaries include both analytic and trained ones analytic examples: Id, DCT, wavelets, curvelets, gabor, etc., also their combinations; they have analytic properties, often easy to compute (for example, multiplying a vector takes O(n log n) instead of O(n 2 )) Ψ can also be numerically learned from training data or partial signal they can be orthogonal, frame, or general 7 / 33

12 Sparse under basis Ψ / l 1 -synthesis model If Ψ is orthogonal, problem (3) is equivalent to by change of variable x = Ψs, equivalently s = Ψ x. Related models for noise and approximate sparsity: min x { Ψ x 1 : Ax = b} (4) min{ Ax b 2 : Ψ x 1 τ}, x min x Ψ x 1 + µ 2 Ax b 2 2, min{ Ψ x 1 : Ax b 2 σ}. x 8 / 33

13 Sparse after transform / l 1 -analysis model min x { Ψ x 1 : Ax = b} (5) Signal x becomes sparse under the transform Ψ (may not be orthogonal) Examples of Ψ: DCT, wavelets, curvelets, ridgelets,... tight frames, Gabor,... (weighted) total variation When Ψ is not orthogonal, the analysis is more difficult 9 / 33

14 Joint/group sparsity Joint sparse recovery model: min{ X 2,1 : A(X) = b} (6) X where m X 2,1 := [x i1 x i,2 x in] 2. i=1 l 2-norm is applied to each row of X l 2,1-norm ball has sharp boundaries across different rows, which tend to be touched by {X : A(X) = b}, so the solution tends to be row-sparse also X p,q for 1 < p, affects magnitudes of entries on the same row complex-valued signals are a special case 13 / 33

15 Joint/group sparsity Decompose {1,..., n} = G 1 G 2 G S. non-overlapping groups: G i G j =, i j. otherwise, groups may overlap (modeling many interesting structures). Group-sparse recovery model: where min x { x G,2,1 : Ax = b} (7) x G,2,1 = S w s x Gs 2. s=1 14 / 33

16 Auxiliary constraints Auxiliary constraints introduce additional structures of the underlying signal into its recovery, which sometimes significantly improve recovery quality nonnegativity: x 0 bound (box) constraints: l x u general inequalities: Qx q They can be very effective in practice. They also generate corners. 15 / 33

17 Nuclear-norm minimization min{ X : A(X) = b} (8) We can also model min X { X : A(X) b F σ} min X { A(X) b F : X τ} min X µ X + 1 A(X) 2 b 2 F 26 / 33

18 Questions 1. Can we trust these models to return Low complexity solutions? 2. When will the solution be unique? 3. Will the solution be robust to noise? 4. How to compute? 5. How to quality uncertainty? 6. How to understanding Low complexity in a unified framework? Including: Sparse, Low Rank, Sparse+Low Rank, Sign vector, Vector from a list, permutation matrices, Mtrices constrained by egienvalues, Orthogonal matrices, Measrues / 33

19 Linear representation of low-dimensional models A key notion in sparse representation synthesis of the signal using a few vectors A slightly different mathematical formalism for generalization Synthesis model: i.e., linear (positive) combination of elements from an atomic set

20 Linear representation of low-dimensional models A key notion in sparse representation synthesis of the signal using a few vectors Sparse representations via the atomic formulation Example:

21 Linear representation of low-dimensional models Basic definitions on low-dimensional atomic representations

22 Linear representation of low-dimensional models Basic definitions on low-dimensional atomic representations : convex hull of atoms in A

23 Linear representation of low-dimensional models Basic definitions on low-dimensional atomic representations : convex hull of atoms in A atomic ball

24 Linear representation of low-dimensional models Basic definitions on low-dimensional atomic representations : convex hull of atoms in A : atomic norm* *: requires A to be centrally symmetric

25 Linear representation of low-dimensional models Basic definitions on low-dimensional atomic representations : convex hull of atoms in A : atomic norm* *: requires A to be centrally symmetric

26 Linear representation of low-dimensional models Basic definitions on low-dimensional atomic representations : convex hull of atoms in A : atomic norm* Alternative: *: requires A to be centrally symmetric

27 Linear representation of low-dimensional models Examples with easy forms: sparse vectors low-rank matrices * *symmetric matrices binary vectors

28 Linear representation of low-dimensional models Examples with easy forms: sparse vectors Examples with no-so-easy forms: A : infinite set of unit-norm rank-one tensors low-rank matrices A : finite (but large) set of permutation matrices A : infinite set of orthogonal matrices A : infinite set of matrices constrained by eigenvalues A : infinite set of measures binary vectors A : finite (but large) set of cut matrices

29 A Geometrical Approach = let s turn to the blackboard!

30 A geometric perspective Other key concepts:

31 A geometric perspective Other key concepts:

32 A geometric perspective Other key concepts: Tangent cone is the set of descent directions where you do not increase the atomic norm.

33 A geometric perspective Other key concepts: Tangent cone is the set of descent directions where you do not increase the atomic norm.

34 A geometric perspective

35 A geometric perspective

36 A geometric perspective Consider the criteria:

37 A geometric perspective Consider the criteria:

38 A geometric perspective Consider the criteria:

39 A geometric perspective Consider the criteria:

40 A geometric perspective Key observation:

41 A geometric perspective How about noise?

42 A geometric perspective How about noise? Stability assumption:

43 A geometric perspective How about noise? Stability assumption:

44 A geometric perspective How about noise? Stability assumption: want epsilon large to minimize overlap between and For this 2D example: Matlab notation

45 A geometric perspective How about noise? Stability assumption:

46 A geometric perspective Can we guarantee the following?* *without knowing

47 A geometric perspective Can we guarantee the following?* YES: with randomized measurements! Gordon s Minimum Restricted Singular Values Theorem has a probabilistic characterization. probabilistic deterministic [Gordon 1988] *without knowing

48 A geometric perspective Can we guarantee the following?* Gordon s Minimum Restricted Singular Values Theorem has a probabilistic characterization. Key concept: width of the tangent cone! *without knowing

49 A geometric perspective Can we guarantee the following?* Gordon s Minimum Restricted Singular Values Theorem has a probabilistic characterization. *without knowing

50 A geometric perspective Can we guarantee the following?* Gordon s Minimum Restricted Singular Values Theorem has a probabilistic characterization. *without knowing

51 A geometric perspective Can we guarantee the following?* Gordon s Minimum Restricted Singular Values Theorem has a probabilistic characterization. *without knowing

52 A geometric perspective Key observation:

53 A geometric perspective How about noise? Stability assumption:

54 A geometric perspective Can we guarantee the following?* Gordon s Minimum Restricted Singular Values Theorem has a probabilistic characterization. *without knowing

55 A geometric perspective Can we guarantee the following?* *without knowing 1-sparse and 1-random measurement

56 A geometric perspective Can we guarantee the following?* *without knowing 1-sparse and 1-random measurement

57 A geometric perspective Can we guarantee the following?* *without knowing 1-sparse and 1-random measurement

58 A geometric perspective Can we guarantee the following?* A projected 6D hypercube with 64 vertices Blessing-of-dimensionality!

59 A geometric perspective Pop-quiz: What is the probability that we can determine a 2-sparse x* with 1-random measurement?

60 A geometric perspective Pop-answer: Tangent cone is too wide! Need at least 2 measurements!

Take home messages convex polytope <> atomic norm geometry (and algebra) of representations in high dimensions geometric perspective <> convex criteria

61 Take home messages convex polytope <> atomic norm geometry (and algebra) of representations in high dimensions geometric perspective <> convex criteria convex optimization algorithms in high dimensions tangent cone width <> # of randomized samples probabilistic concentration-of-measures in high dimensions

63 three and do this fast with theoretical guarantees

64 Convex optimization and proximal algorithms bx 2 arg min x2r N f 1 (x) + f 2 (x) f 1 : R N! R data fidelity term; convex, smooth. typically: f 2 : R N! ¹ R = R [ f+1g Convex regularizer (maybe non-smooth; e.g. `1 ) (non-convex, later ). Difficulties: non-smoothness and large dimension ( N À 1)

65 Convex and strictly convex sets S is convex if x; x 0 2 S ) 8 2 [0; 1] x + (1 )x 0 2 S x S x 0 convex x S x 0 non-convex S is strictly convex if x; x 0 2 S ) 8 2 (0; 1) x + (1 )x 0 2 int(s) x S x convex, but not strictly x 0 strictly convex x 0

66 Convex and strictly convex functions Extended real valued function: f : R N! ¹ R = R [ f+1g Domain of a function: dom(f) = fx : f(x) 6= +1g f is a convex function if f is a strictly convex function if 8 2 (0; 1); x; x 0 2 dom(f) f( x + (1 )x 0 ) < f(x) + (1 )f(x 0 ) non-convex convex strictly convex convex, not strictly

67 Convexity, coercivity, and minima f is coercive if f f : R N! ¹ R = R [ f+1g lim f(x) = +1 kxk!+1 G arg min f(x) if is coercive, then is a non-empty set x if f is strictly convex, then G has at most one element coercive and strictly convex coercive, not strictly convex convex, not coercive x G = fx g f G G = ;

68 Euclidean projections on convex sets bx 2 arg min f 1(x) + f 2 (x) x2r n ½ 0 ( x 2 S consider f 2 (x) = S (x) = +1 ( x 62 S Our problem: (convex if S is convex) and f 1 (x) = 1 2 ku xk2 2 (strictly convex) S z = P S (z) P S (u) u bx = arg min f 1(x) + f 2 (x) x2r n = arg min ku x2s xk2 2 P S (u) (Euclidean projection)

69 Projected gradient algorithm Our problem: bx 2 arg min x2r n f 1(x) + f 2 (x) f 2 (x) = S (x) with ( is a convex set) and S f 1 some function, e.g., f 1 (x) = 1 2 k x uk2 2 Projected gradient algorithm: x k+1 = P S ³x k k rf 1 (x k ) if f 1 (x) = 1 2 k x uk2 2 step size x k+1 = P S ³x k k T ( x k u)

70 Detour: majorization-minimization (MM) Problem: bx 2 arg min x2r n f(x) Q(x; x k ) is a majorizer of f at x k Q(x; x k ) f(x); Q(x k ; x k ) = f(x k ) f(x) Q(x; x k ) Q(x; x k+1 ) MM algorithm: x k+1 = arg min Q(x; x k ) x monotonicity: x k x k+2 x k+1

71 Projected gradient from majorization-minimization Our problem: bx 2 arg min f 1(x) + f 2 (x) x2r n f 2 (x) = S (x) with ( is a convex set) and f 1 has L -Lipschitz gradient S e.g. f 1 (x) = 1 2 k x uk2 2 ) L = max( T ) = k k a separable approximation of f 1 Hessian of f 1 Q(x; x k ) = f 1 (x k ) + (x x k ) T rf 1 (x k ) k kx x k k 2 2

72 Projected gradient from majorization-minimization Our problem: Separable approximation of f 1 bx 2 arg min x2r n f 1(x) + S (x) Q(x; x k ) = f 1 (x k ) + (x x k ) T rf 1 (x k ) + 1 Q(x; x k ) is a majorizer of f 1, if k < 1 L 2 k kx x k k 2 2 Q(x; x k ) + S (x) is a majorizer f 1 (x) + S (x) MM algorithm: x k+1 = arg min Q(x; x k ) + S (x) x 1 = arg min x x k krf 1 (x k ) 2 x 2 k 2 = P S ³x k k rf 1 (x k ) + S (x) projected gradient.

73 Proximity operators Our problem: with bx 2 arg min f 1(x) + f 2 (x) x2r n f 2 a convex function and f 1 (x) = 1 2 ku xk2 2 (strictly convex) 1 bx = arg min x2r n 2 ku xk2 2 + f 2(x) prox f2 (u) Proximity operator [Moreau 62], [Combettes 01]. Generalizes the notion of Euclidean projection.

74 Proximity operators (linear) 1 prox f (u) = arg min x2r n 2 ku xk2 2 + f(x) (R N! R N ) Classical cases: squared `2 regulizer f(x) = 2 kxk2 2 1 prox f (u) = arg min x2r n 2 ku xk kxk2 2 = u 1 + squared regularizer with analysis operator f(x) = `2 2 kdxk2 2 1 prox f (u) = arg min x2r n 2 ku xk kdxk2 2 = (I + D T D) 1 u if D is a circulant matrix, O(N log N) cost using the FFT

75 Proximity operator of the norm 1 prox k k1 (u) = arg min x2r n 2 ku xk2 2 + kxk 1 Separable: solve w.r.t. each component: min x jxj + 0:5(x u) 2 Possible approach: write jxj = max jzj 1 zx min x max zx + 0:5(x jzj 1 u)2 = max min zx + 0:5(x u)2 jzj 1 x arg max jzj 1 0:5 2 z 2 + zu = `1 x = u z = max 0:5 2 z 2 + zu (for ) jzj 1 8 < : u= ( juj 1 ( u > 1 ( u <

76 Proximity operator of the `1 norm: the soft soft thresholding soft(u; ) = sign(u) maxf0; juj g soft(u; ) = prox j j (for vectors, soft(u; ) is applied component-wise) p -th power of `p closed form prox for norms [Combettes, Wajs, 2005] kxk p p = X j[x] i j p i p 2 ½1; 2; 43 ; 32 ¾ ; 3; 4

77 Dual norms, proximity operators, and projections Dual norm: some norm, k k : R N! R + its dual norm: kxk = max hx; zi kzk 1 1 Dual norm of k k p is k k q, where p + 1 q = 1 Hölder conjugates simple corollary of Hölder s inequality: Examples of Hölder conjugates: (2; 2); (1; +1); (3=2; 3); ::: These concepts are related through: prox k k (u) = u P fx:kxk 1g(u) [Combettes, Wajs, 2005]

78 Dual norms, proximity operators, and projections prox k k (u) = u P fx:kxk g(u) This relation underlies our earlier derivation of prox k k1 prox k k1 (u) = u P fx:kxk1 g(u) It s all separable, prox j j (u) = u P fx:jxj g (u) kxk 1 = maxfj[x] i jg P j j (z) soft(u; ) = soft(u; )

79 Dual norms, proximity operators, and projections prox k k (u) = u P fx:kxk g(u) This relation allows deriving prox k k1 and prox k k2 prox k k1 (u) = u P fx:kxk1 g(u) prox k k2 (u) = u P fx:kxk2 g(u) projection on the ball of radius O(n log n) `1 prox P(u) u = u kuk 2 maxf0; kuk 2 g vector soft thresholding

80 Proximity operators of atomic norms prox k k (u) = u P fx:kxk g(u) These relation allows deriving prox operators of atomic norms: kxk A = infft > 0 : x 2 t conv(a)g The dual of an atomic norm ball: kxk A = max hz; xi = max kzk A 1 P fx:kxk A g(u) = arg prox k ka (u) = u arg hz; xi z2conv(a) = maxfha; xi; a 2 Ag min ha;xi ; 8 a2a ku xk 2 2 min ha;xi ; 8 a2a ku xk 2 2

81 Proximity operators of atomic norms: `1 Deriving prox k k1 kxk 1 = kxk A A = e 1 e 2 e 1 e 2 jaj = 2 N from the atomic norm view >< ; ; : : : ; ; ; : : : ; 6 4 >: = fe 1 ; e 2 ; :::; e N ; e 1 ; :::; e N g >= 7 5 >; kxk A = maxfha; xi; a 2 Ag = maxfj[x] i jg = kxk 1 prox k k1 (u) = u P fx:kxk1 g(u) = soft(x; )

82 Proximity operators of atomic norms: `1 Deriving prox k k1 from the atomic norm view >< ; ; ; ; : : : ; 1 >= >: >; kxk 1 = kxk A A = = f 1; +1g N jaj = 2 N N kxk A = maxfha; xi; a 2 Ag = X j[x] i j = kxk 1 i=1 prox k k1 (u) = u P fx:kxk1 g(u)

83 Proximity of atomic norms: matrix nuclear norm Matrix nuclear norm: kxk = X i ¾ i (X) = X i q i(x T X) kxk = kxk A A = fz : rank(z) = 1; kzk F = 1g rank(z) = jf¾ i (Z) 6= 0gj Frobenius norm kzk 2 F = X ij [Z] 2 ij = X i ¾ 2 i (Z) kxk A = maxfhz; Xi; Z 2 Ag ( X = max ¾ i (Z)¾ i (X); rank(z) = 1; X i i = ¾ max (X) = kxk 2 spectral norm ¾ 2 i (Z) = 1 )

84 Proximity of atomic norms: matrix nuclear norm Euclidean matrix projection: P S (X) = arg min Z2S kz Xk2 F Note: for any unitary matrix U (U T U = I; UU T = I) kumk 2 F = trace M T U T UM = trace M T A = kmk 2 F prox k k (X) = X P fz:kzk2 g(x) singular value diagonal matrix = U V T P fz:¾max (Z) g(u V T ) [Lewis, Malick, 2009] = Udiag diag( ) P fx:kxk1 g(diag( )) V T = Usoft( ; )V T singular value thresholding (svt)

85 Proximity of atomic norms: matrix spectral norm Matrix spectral norm: kxk 2 = ¾ max (X) kxk 2 = kxk A A = fz : Z T Z = Ig = fz : ¾ i (Z) = 1; 8 i g kxk A = maxfhz; Xi; Z 2 Ag ( ) X = max ¾ i (Z)¾ i (X); ¾ i (Z) = 1; 8 i i orthogonal matrices = X i ¾ i (X) = kxk nuclear norm

86 Proximity of atomic norms: matrix spectral norm prox k k2 (X) = X P fz:kzk g(x) = U V T P fz:kzk g(u V T ) singular value diagonal matrix = U P fz: P i ¾ i(z) g( ) V T = Udiag diag( ) P fx:kxk1 g(diag( )) V T residual of projection of the singular values on an `1 ball of radius

Proximity and atomic sets: vectors vs matrices norm vectors prox matrices atomic set norm prox atomic set `1 kxk 1 component soft thresholding A = f e i g nuclear jaj = 2 N kxk singular value

87 Proximity and atomic sets: vectors vs matrices norm vectors prox matrices atomic set norm prox atomic set `1 kxk 1 component soft thresholding A = f e i g nuclear jaj = 2 N kxk singular value thresholding A = set of all rank 1, norm 1 matrices `1 kxk 1 residual of projection on `1 ball A = f 1g N spectral jaj = 2 N kxk 2 residual of s.v. proj. on `1 ball A = set of all orthogonal matrices `2 kxk 2 vector soft thresholding A = set of all vectors with norm 1 jaj = 1 Frobenius kxk F matrix soft threshold. A = all matrices of unit Frobenius norm.

88 Proximal algorithms Back to the problem: with bx 2 arg min f 1(x) + f 2 (x) x2r n f 2 a proper convex function f 1 L f 1 (x) = 1 2 k x uk2 2 and has a -Lipschitz gradient; e.g. with L = max ( ) separable majorizer ( k < 1=L ) Q(x; x k ) = f 1 (x k ) + (x x k ) T rf 1 (x k ) + 1 kx x k k k majorization-minimization algorithm x k+1 = arg min Q(x; x k ) + f 2 (x) x 1 = arg min x x k krf 1 (x k ) 2 x 2 k 2 x k+1 = prox k f 2 ³x k k rf 1 (x k ) + f 2 (x)

89 Proximal algorithms: convergence Problem: bx 2 arg min x2r n f 1(x) + f 2 (x) f(x) f 1 has a L -Lipschitz gradient; e.g. f 1 (x) = 1 2 k x uk2 2 Iterative shrinkage/thresholding (IST) (or forward-backward) L = max ( ) x k+1 = prox kf 2 ³x k k rf 1 (x k ) if k < 1 L, IST is a majorization-minimization algorithm, thus f(x) 0, thus (f(x 1 ); f(x 2 ); :::; f(x k ); :::) converges. Attention: this does not imply convergence of (x 1 ; :::; x k ; :::)

90 Proximal algorithms: convergence bx 2 G = arg min f 1(x) + f 2 (x) x2r n IST algorithm: x k+1 = prox kf 2 ³x k k rf 1 (x k ) 0 < k < 2 L if, then (x 1 ; x 2 ; :::; x k ; :::) errors x k+1 = prox kf 2 ³x k ( k rf 1 (x k ) + b k ) + a k Inexact version: converges to a point in convergence still guaranteed if 1X ka k k < 1 G 1X kb k k < 1 k=1 k=1 Results and proofs in [Combettes and Wajs, 2005]

91 Proximal algorithms: convergence ) Convergence of function values (f(x 1 ); :::; f(x k ); :::)! f(bx) Convergence of iterates ) (x 1 ; x 2 ; :::; x k ; :::)! bx Convergence rates (for function values) [Beck, Teboulle, 2009]: Convergence rate for the iterates require further assumptions on f

92 Proximal algorithms: convergence of iterates 1 bx = arg min x 2 k x uk2 2 + f 2(x) With L = max ( ) l = min ( ) > 0 = l=l (condition number) ) G = fbxg (unique minimizer) Under- ( < 1 ) or over-relaxed ( > 1) IST x k+1 = (1 )x k + prox f2 ³x k T ( x k u) Optimal choice = 2 L + l Q-linear convergence Small l ) ½. 1 ) slow convergence! [F, Bioucas-Dias, 2007]

93 Proximal algorithms: convergence of iterates With 1 bx 2 G = arg min x 2 k x uk2 2 + kxk 1 L = max ( ) ; using a step-size < 2=L; ³ x k+1 = soft x k T ( x k u); Z µ f1; 2; :::; ng such that bx 2 G ) [bx] Z = 0 Then, after a finite number of iterations: [x k ] Z = [bx] Z = 0 After this, Q-linear convergence: Optimal choice = 2 L + l ; l = min ( ¹ Z ¹ Z ) > 0 [Hale, Yin, Zhang, 2008]

94 Slowness and acceleration of IST Problem: IST algorithm: 1 bx 2 G = arg min x 2 k x uk2 2 + kxk 1 ³ x k+1 = soft x k T ( x k u); IST is slow, if is very ill-conditioned and/or is very small! Several proposals for accelerated variants of IST Methods with memory (TwIST, FISTA) Quasi-Newton methods (SpaRSA) Continuation, i.e., use a varying (FPC, SpaRSA)

95 Memory-based variants of IST: FISTA Fast IST algortihm (FISTA); based on Nesterov s work (1980 s) [Beck, Teboulle, 2009] FISTA t k+1 = 1 + p t 2 k 2 z k+1 = x k + t k 1 (x k x k 1 ) t k+1 ³ x k+1 = soft z k T ( z k u); IST: FISTA:

96 Memory-based variants of IST: twist Inspired by 2-step methods for linear systems [Frankel, 1950], [Axelsson, 1996] TwIST (two-step IST): [Bioucas-Dias, F, 2007] x k+1 = ( )x k + (1 )x k 1 + prox f2 xk T ( x k u) TwIST Q-linear convergence IST

Memory-based variants of IST: twist objective function SNR original B Blurred ( ), 9x9, 40db noise restored 1 bx 2 arg min x2r n 2 kbªx uk2 2 + kxk 1 representation coefficients dictionary (e.

97 Memory-based variants of IST: twist objective function SNR original B Blurred ( ), 9x9, 40db noise restored 1 bx 2 arg min x2r n 2 kbªx uk2 2 + kxk 1 representation coefficients dictionary (e.g, wavelet basis, frame,...) IST over-relaxed IST TwIST = =1 = 0 Second order full TwIST iterations TwIST over-relaxed IST = =1 = 0 Second order full TwIST IST iterations

98 Quasi-newton acceleration of IST: SpaRSA IST: x k+1 = prox kf 2 ³x k k rf 1 (x k ) A Newton step (instead of gradient descent) would be: x k+1 = prox kf 2 ³x k [H(x k )] 1 rf 1 (x k )...computationally too expensive! Barzilai-Borwein approach: [Barzilai-Borwein, 1988], [Wright, Nowak, F, 2009] 1 k Hessian (matrix of second derivatives) = arg min k (x k x k 1 ) (rf(x k ) rf(x k 1 )k 2 2 f 1 (x) = 1 If 2 k x uk2 2, then k = kx k x k 1 k 2 2 k (x k x k 1 )k k I ' H(x k )

99 Acceleration via continuation IST: ³ x k+1 = soft x k T ( x k u); Slow, if is small. Observation: IST (as SpaRSA) benefits from warm-starting (being initialized close to the minimizer) Continuation: start with large slowly decrease while tracking the solution. [F, Nowak, Wright, 2007], [Hale, Yin, Zhang, 2007] IST + continuation = fixed point continuation (FPC) [Hale, Yin, Zhang, 2007]

100 Acceleration via continuation 1 bx 2 G = arg min x 2 k x uk2 2 + kxk u = x + n max = k T yk 1 ( max ) bx = 0 )

101 Some speed comparisons from [Lorenz, 2011] = [I U R] ( ) bx 1 bx = arg min x 2 k x uk2 2 + kxk 1 with 120 non-zeros IST

102 Proximal algorithms for matrices cm 2 arg X k+1 = svt ¹ k 1 min M2R n n 2 k (M) Uk2 F + ¹kMk The proximal algorithm (IST) is as before: ³ X k k ( (X k ) U) linear operator...its adjoint Matrix completion: (X) = X (subset of entries) IST APG (FISTA) FPC (continuation) APG + continuation from [Toh, Yun, 2009]...the importance of acceleration!

103 Another class of methods: augmented Lagrangian The problem: min x s.t. f(x) x = u The augmented Lagrangian (AL) Penalty parameter L ¹ (x; ) = f(x) + T ( x u) + ¹ 2 k x uk2 2 The AL method (ALM) (a.k.a. method of multipliers) [Hestenes, Powell, 1969] x k+1 = arg min L ¹ (x; k) x k+1 = k + ¹( x k+1 u) Can be written as: x k+1 = arg min f(x) + ¹ x 2 k x u d kk 2 2 d k+1 = d k ( x k+1 u) Similar to Bregman method [Osher, Burger, Goldfarb, Xu, Yin, 2005] [Yin, Osher, Goldfarb, Darbon, 2008]

104 Augmented Lagrangian for variable splitting The problem: min x Equivalent constrained formulation Can be written as ALM: f 1 ( x) + f 2 (x) min y f(y) s.t. ªy = 0 min x f 1 (z) + f 2 (x) s.t. x z = 0 (x k+1 ; z k+1 ) = arg min x;z f 1(z) + f 2 (x) + ¹ 2 k x z d kk 2 2 d k+1 = d k ( x k+1 z k+1 ) with y = x z ª = [ I]

105 Augmented Lagrangian for variable splitting It may be hard to solve (x k+1 ; z k+1 ) = arg min x;z f 1(z) + f 2 (x) + ¹ 2 k x z d kk 2 2 Alternative: x k+1 = arg min x f 2 (x) + ¹ 2 k x z k d k k 2 2 z k+1 = arg min z f 1 (z) + ¹ 2 k x k+1 z d k k 2 2 d k+1 = d k ( x k+1 z k+1 ) Alternating directions method of multipliers (ADMM) [Glowinsky, Marrocco, 1975], [Gabay, Mercier, 1976], [Eckstein, Bertsekas, 1992] When applied to bx = arg min 2 k x uk2 2 + kxk 1 split augmented Lagrangian shrinkage algorithm (SALSA) [F, Bioucas-Dias, Afonso, 2009] x 1

106 Augmented Lagrangian for variable splitting Testing ADMM/SALSA on a typical image deblurring problem blurred restored bx 2 arg min x2r n 1 2 kbªx uk2 2 + kxk 1 restored 2 Objective function function 0.5 y-ax 2 + TV(x) TwIST FISTA SpaRSA SALSA seconds CPU time

107 Handling more than two functions bx 2 arg min f 0(x) + f 1 (x) + + f n (x) x2r n f 0 has a L-Lipschitz gradient f 1 ; :::; f n are convex Possible uses: multiple regularizers, positivity constraints,... Generalized forward-backward algorithm [Raguet, Fadili, Peyré, 2011] Parameters:! 1 ; :::;! n 2 (0; 1); s.t. P j! j = 1 k = 0; z0 1; :::; zn 0 ; x 0 = P n j=1! j z j 0 repeat until convergence for i = 1 : n zk+1 i = zk i + prox kf i =! i ³2 x k zk i k rf 1 (x k ) x k Initialization: x k+1 = P n i=1! i z i k+1 k Ã k + 1

108 Handling more than two functions bx 2 arg min x2r n f 1(x) + + f n (x) f 1 ; :::; f n arbitrary convex functions ADMM-based method [F and Bioucas-Dias, 2009], [Setzer, Steidl, Teuber, 2009] Parameter: Initialization: k = 0; z 1 0 ; :::; zn 0 ; y1 0 ; :::; yn 0 repeat until convergence x k+1 = (1=n) P n i=1 (yi k zi k ) for i = 1 : n yk+1 i = prox fi xk zk i zk+1 i = zk i + x k yk+1 i k Ã k + 1

Winter Conference in Statistics Compressed Sensing. LECTURE #7-8 Algorithms for low-dimensional models

Winter Conference in Statistics 2013 Compressed Sensing LECTURE #7-8 Algorithms for low-dimensional models lions@epfl Prof. Dr. Volkan Cevher volkan.cevher@epfl.ch LIONS/Laboratory for Information and