Sparsity Regularization

Size: px

Start display at page:

Download "Sparsity Regularization"

Eleanor Stafford
5 years ago
Views:

1 Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41

2 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41

3 problem setup finite-dimensional formulation x R p : the unknown signal b = Ax + η, η R n : additive Gaussian noise; ɛ = η : noise level A R n p, p n: (normalized column), i.e., A i = 1 The problem has infinitely many solutions (if it has one), which one shall we take? 3 / 41

4 insights from exact data toy example: find a reasonable solution to the problem x 1 + 2x 2 = 5 There are infinitely many solutions. (1,2) Which one shall we take? convention: least-squares Gauss 1809 min x x 2 2 s.t. x 1 + 2x 2 = 5 generalized minimum-energy solution min x 1 p + x 2 p, 0 p 2 s.t. x 1 + 2x 2 = 5 4 / 41

5 (1,2) p = 2 p = 3/2 p = 1 p = 1/2 5 / 41

6 in case of noisy data: Tikhonov regularization 1 2 Ax b 2 + αψ(x) two possible choices of ψ(x) (convexity...) classical Tikhonov regularization ψ(x) = 1 2 x 2 2 =: 1 2 x i 2 sparsity regularization general analogues... ψ(x) = x p p =: 1 x i p, p [0, 1] p i i 6 / 41

7 in case of noisy data: Tikhonov regularization 1 2 Ax b 2 + αψ(x) assumption: i.i.d. additive Gaussian noise on the data likelihood assumption: prior knowledge b i = b i + ξ i, ξ i N(0, σ 2 ) p(b x) e 1 2σ 2 (Ax b)2 p(x) e λψ(x) classical Tikhonov regularization Gaussian prior distribution sparsity regularization Laplace distribution 7 / 41

8 1 0.8 Gaussian Laplace Gaussian v.s. Laplace distribution 8 / 41

9 Gaussian v.s. Laplace 9 / 41

10 The energy can be more general: ψ(x) = ψ(wx), under certain transformation, e.g., wavelet, framelet, curvelet, shearlet... The discussions below extend to these more complex cases 10 / 41

11 natural idea for sparse solution is to penalize the number of unknowns where 1 2 Ax b 2 + α x 0 x 0 = #(nonzeros in x) conceptually intuitive, but computationally very challenging: approximations: bridge penalty x q q = x i q, q (0, 1) l1 penalty x 1 = x i 11 / 41

12 / 41

13 The l0 penalty is the genuine choice, but VERY challenging there are different ways to approximate it... lq is an approximation, and there are many others especially l1 is very popular, since l1 is convex further, there is a solid theory 13 / 41

14 notation: S = {1,..., p} I S, x I : subvector consisting of entries of x indexed by i I I S, A I : submatrix consisting of columns of A indexed by i I 14 / 41

15 restricted isometry property (RIP) RIP of order s, if a δ s (0, 1) s.t. (1 δ s ) c 2 A I c 2 (1 + δ s ) c 2 I S, I s. with δ s being the smallest constant for which RIP holds δ s := inf{δ : (1 δ) c 2 A I c 2 (1+δ) c 2 I s, c R I } denoted by RIP (s, δ s ) RIP (s, δ s ) 1 δ s λ min (A I A I) λ max (A I A I) 1 + δ s the submatrix A I is fairly well-conditioned RIP is difficult to compute 15 / 41

16 Matrices satisfying RIP random matrices with i.i.d. Gaussian/Bernoulli with zero mean and variance 1/p RIP holds with overwhelming probability if s cn/ log(p/n) random matrices from Fourier ensemble: randomly select n rows from p p discrete Fourier transform matrix uniformly at random, and then re-normalized RIP holds with overwhelming probability if s cn/(log(p)) 6 DFT matrix is used in Magnetic Resonance Imaging 16 / 41

17 under certain conditions on the matrix A and the true solution x : conditions x x α Cɛ the result holds on δ 3s + 3δ 4s < 2 n is nearly of order s, i.e., n s up to log factors the reconstruction error is of the same order as data error ɛ much better than the classical inverse problems sublinear much stronger conditions there are some other methods that also achieves the similar errors 17 / 41

18 an example from radar imaging c Potter et al 2010 passive imaging, backhoe data conventional v.s. sparsity 18 / 41

19 convex function convex functions: f (x) is convex over its domain dom(f ) if f (λx 1 + (1 λ)x 2 ) λf (x 1 ) + (1 λ)f (x 2 ) λ [0, 1], x 1, x 2 dom(f ) f is concave if f is convex f is strictly convex if f (λx 1 +(1 λ)x 2 )<λf (x 1 )+(1 λ)f (x 2 ) λ (0, 1), x 1 x 2 dom(f ), if f differentiable f (x 2 ) f (x 1 ) + ( f (x 1 ), x 2 x 1 ) first-order Taylor exp. is a global under-estimator how to verify: by definition if f is twice differential: convex f 0 19 / 41

20 l 1 term is not differentiable, but a generalized derivative exists a vector g R n is a subgradient of a convex function f (x) : R n R at x 0 if i.e., f (x) f (x 0 ) x x 0, g x dom(f ) f (x) f (x 0 ) + x x 0, g x dom(f ) the set of subgradient at x 0 is denoted by f (x 0 ) if f is differentiable at x 0, then it is identical with f (x 0 ) property: x is a minimizer to f if and only if 0 f (x ) sum rules (under certain mild conditions) 20 / 41

21 the subdifferential of f (t) = t at t 0, f is differentiable, f (t) = {f (t)}, i.e., f (t) = sign(t), t 0 at t = 0, f (t) is not differentiable: any constant c s.t. t = f (t) f (0) + c(t 0) = ct t R 1 c 1, i.e. ( t )(0) = [ 1, 1] Hence, t 1, t > 0, ( t ) = 1, t < 0, [ 1, 1], t = / 41

22 one-dimensional example: fixed t f (s) = 1 2 (t s)2 + α s the function is strictly convex! a unique minimizer { 1 f (s) = 1 2 (t s)2 2 + α s = (t s)2 + αs, s > (t s)2 αs, s 0 suppose t > 0 and the minimum is achieved at s > 0, then s = t α > 0, f (s ) = 1 2 α2 + α(t α) s = 0, f (s ) = 1 2 t2 t α 0 s = t α t α < 0 s = 0 22 / 41

23 t α, t > α s = S α (t) = 0, t α t + α, t < α the optimality condition is 0 (s t) + α s, i.e. t s + α s, soft thresholding operator S α (t) = ( α + I) 1 (t) 23 / 41

24 0 0 0 t t 0 t S α (t) It shrinks the value, and zeros it if small 24 / 41

25 Convex approach l1 penalty popular approach: basis pursuit or lasso Chen et al 1998, Tibshirani 1996 min J α(x) = 1 x R p 2 Ax b 2 + α x 1, How can we obtain such nice pictures numerically? 25 / 41

26 iterative soft thresholding iterative soft thresholding Daubechies-defrise-de Mol 2005 an iterative algorithm for computing the solution by surrogate function approach (majorization-minimization, optimization transfer) observations: J α (x) = 1 2 Ax y 2 + α x 1, if A = I, the problem is easy J α (x) = ( 1 2 (x i y i ) 2 + α x i ) i the problem decouples into n one-dimensional problems if A is orthonormal, i.e., A A = I, J α (x) = 1 2 x A b 2 + α x 1 26 / 41

27 the presence of an operator A surrogate function coupling f (x) = 1 2 Ax b 2 1st-order Taylor expansion... given the current guess x k f (x) = 1 2 A(x x k ) + Ax k b 2 = 1 2 A(x x k ) 2 + A(x x k ), Ax k b Ax k y 2 τ k 2 x x k 2 + x x k, A (Ax k b) Ax k b 2 := Q(x, x k ) it is easy to verify that and further Q(x k, x k ) = f (x k ), Q (x k, x k ) = f (x k ) Q(x, x k ) f (x) if τ k A 2 algorithm: simplified minimization problem: x k+1 arg min Q(x, x k ) + α x 1 27 / 41

28 approximate minimization problem let then x k+1 = arg min Q(x, x k ) + α x 1 Q(x, x k ) + α x = τ k 2 x x k 2 A (Ax k b), x x k + α x 1 = τ k 2 x (x k τ 1 k A (Ax k b)) 2 + α x 1 1 2τ k A (Ax k b) 2 x k+1 = x k τ 1 k A (Ax k b) Q(x, x k ) + α x = τ k 2 x x k α x 1 + cnst = ( ) τk 2 (x i x k+1 i ) 2 + α x i + cnst i 28 / 41

29 The one-dimensional optimization problem the solution S α (t) is given by 1 2 (s t)2 + α s t α, t > α 0 S α (t) = 0, t α t + α, t < α 0 It shrinks the value, and zeros it if small t 29 / 41

30 iterative soft thresholding Daubechies-Defrise-De Mol, 2005 given initial guess x 0, update the solution iteratively by x k+1 = x k τ 1 A (Ax k b) x k+1 = S τ 1 α( x k+1 ) (gradient descent) (thresholding) iterative thresholding iteration is a (nonlinear) gradient descent method the convergence is slow... adaptive choice of step size can improve convergence... primal dual active set (PDAS) algorithm PDAS = Newton method, for a class of convex optimization 30 / 41

31 choice I: Cauchy step size Cauchy 1847 τ k = arg min τ>0 A(x k τa (Ax k b)) b i.e., τ k = A (Ax k b) 2 AA (Ax k b) 2 = d k 2 Ad k 2 31 / 41

32 Choice II: Barzilai-Borwein rule Barzilai-Borwein 1988 to use preceding two iterates to decide the step size general quasi-newton method: x k+1 = x k (B k ) 1 g k, B k (x k x k 1 ) = g k g k 1 (quasi-newton relation) select D k = τ k I and x k+1 = x k D k g k to mimic the quasi-newton method (in least-squares sense) min (x k x k 1 ) τ(g k g k 1 ) τ k = x k x k 1, g k g k 1 g k g k / 41

33 fast iterative shrinkage-thresholding algorithm Nesterov 1980s, Beck-Teboulle 2008 x 1 = x 0, z 1 = x 0, and for k 1, t 1 = 1 extrapolated point z k x k = S α (z k A (Az k b)) t k+1 = tk 2 2 z k+1 = x k + t k 1 (x k x k 1 ) t k+1 33 / 41

34 observation: One-dimensional problem can be solved easily! The presence of the operator K messes things up, so we update the solution componentwise theoretically P. Tseng, 2001 x k 1 arg min J α (x 1, x k 1 2,, x k 1 p ) x k 2 arg min J α (x k 1, x 2, x k 1 3,, x k 1 p ). x k p arg min J α (x k 1, x k 2,, x k p 1, x k ) The sequence have a subsequence converging to the minimizer. The sequence of function value to the minimum. revived interest in statistics Friedman et al / 41

35 simple case α = 0, f (x) = 1 2 Ax b 2 minimizing over x i, with all x j, j i fixed i.e. 0 = i f (x) = A i (Ax b) = A i (A ix i + A i x i b) x i = A i (b A ix i ) A i A i coordinate descent repeats this for i = 1, 2,...,... x i = A i r A i 2 + x i old with r = y Ax O(n) operation per cycle 35 / 41

36 l1 problem minimization over x i, with x j, j i fixed 0 = A i A ix i + A i (A ix i b) + αs i s i x i ( ) A x i = S i (b A i x i ) α/ Ai 2 A i 2 36 / 41

37 iteratively reweighed least-squares (IRLS) Another viewpoint: recall for the quadratic penalty 1 2 Ax b 2 + α 2 x 2 the optimal solution x α satisfies the following optimality system A (Ax α b) + αx α = 0 i.e., (A A + αi)x α = A b 37 / 41

38 To take advantage of the quadratic problem, we rewrite the l1 problem as (given current estimate x k ) with 1 2 Ax b 2 + αx t W k x, W k = diag( x k i 1 ) Then this gives the iterative scheme (+ regularization with small ɛ > 0): W k = 2diag(( x k i + ɛ) 1 ), x k = (A A + W k ) 1 A b. 38 / 41

39 what is beyond Newton type method nonlinear forward operators (many medical imaging problems) structured sparsity patterns total variation regularization nonconvex penalties (can be efficiently solved) 39 / 41

40 Probability of exact recover of the true support Sparsity level l 0 l 1/2 l 1 SCAD MCP Capped-l 1 setting: Gaussian Ψ and noise, , DR = 10 3, σ = 0.01 with 500 data points: the l 1 allows exact support recovery only if solution is very sparse nonconvex models allow recovering far more nonzeros 40 / 41

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring