Introduction to Sparsity in Signal Processing

1 Introduction to Sparsity in Signal Processing Ivan Selesnick Polytechnic Institute of New York University Brooklyn, New York selesi@poly.edu 212

2 Under-determined linear equations Consider a system of under-determined system of equations y = Ax (1) A : M N M < N y : length-m x : length-n y = y(). y(m 1) x = The system has more unknowns than equations. The matrix A is wider than it is tall. x(). x(n 1) We assume that AA is invertible, therefore the system of equations (1) has infinitely many solutions.

Norms We will use the l 2 and l 1 norms: N 1 x 2 2 := x(n) 2 (2) n= N 1 x 1 := x(n). (3) x 2 2, i. e. the sum of squares, is referred to as the energy of x. n=

Least squares To solve y = Ax, it is common to minimize the energy of x. argmin x x 2 2 such that y = Ax. (4a) (4b) Solution to (4): x = A (AA ) 1 y. (5) When y is noisy, don t solve y = Ax exactly. Instead, find approximate solution: argmin x y Ax 2 2 + λ x 2 2. (6) Solution to (6): x = (A A + λi) 1 A y. (7) Large scale systems fast algorithms needed..

Sparse solutions Another approach to solve y = Ax: argmin x x 1 such that y = Ax (8a) (8b) Problem (8) is the basis pursuit (BP) problem. When y is noisy, don t solve y = Ax exactly. Instead, find approximate solution: argmin x y Ax 2 2 + λ x 1. (9) Problem (9) is the basis pursuit denoising (BPD) problem. The BP/BPD problems can not be solved in explicit form, only by iterative numerical algorithms.

6 Least squares & BP/BPD Least squares and BP/BPD solutions are quite different. Why? To minimize x 2 2... the largest values of x must be made small as they count much more than the smallest values. least square solutions have many small values, as they are relatively unimportant least square solutions are not sparse. x 2 x 2 1 1 2 x Therefore, when it is known/expected that x is sparse, use the l 1 norm; not the l 2 norm.

Algorithms for sparse solutions Cost function: Non-differentiable Convex Large-scale Algorithms: Matrix-free ISTA FISTA SALSA (ADMM) and more...

Parseval frames If A satisfies AA = pi (1) then the columns of A are said to form a Parseval frame. A can be rectangular. Least squares solution (5) becomes No matrix inversion needed. Least square solution (7) becomes x = A (AA ) 1 y = 1 p A y (AA = pi) (11) x = (A A + λi) 1 A y = 1 λ + p A y (AA = pi) (12) (Use the matrix inverse lemma.) No matrix inversion needed. If A satisfies (1), then it is very easy to find least square solutions. Some algorithms for BP/BPD also become computationally easier.

9 Example: Sparse Fourier coefficients using BP The Fourier transform tells how to write the signal as a sum of sinusoids. But, it is not the only way. Basis pursuit gives a sparse spectrum. Suppose the M-point signal y(m) is written as N 1 y(m) = c(n) exp (j 2πN ) mn, m M 1 (13) n= where c(n) is a length-n coefficient sequence, with M N. y = Ac (14) A m,n = exp (j 2πN ) mn, m M 1, n N 1 (15) c : length-n The coefficients c(n) are frequency-domain (Fourier) coefficients.

Example: Sparse Fourier coefficients using BP 1. If N = M, then A is the inverse N-point DFT matrix. 2. If N > M, then A is the first M rows of the inverse N-point DFT matrix. A or A can be implemented efficiently using the FFT. For example, in Matlab, y = Ac is implemented as: function y = A(c, M, N) v = N * ifft(c); y = v(1:m); end Similarly, A y can be obtained by zero-padding and computing the DFT. In Matlab, c = A y is implemented as: function c = AT(y, M, N) c = fft([y; zeros(n-m, 1)]); end Matrix-free algorithms. 3. Due to the orthogonality properties of complex sinusoids, AA = N I M (16)

11 Example: Sparse Fourier coefficients using BP When N = M, the coefficients c satisfying y = Ac are uniquely determined. When N > M, the coefficients c are not unique. Any vector c satisfying y = Ac can be considered a valid set of coefficients. To find a particular solution we can minimize either c 2 2 or c 1. Least squares: argmin c c 2 2 such that y = Ac (17a) (17b) Basis pursuit: argmin c c 1 such that y = Ac. (18a) (18b) The two solutions can be quite different...

Example: Sparse Fourier coefficients using BP 1.5 1.5.5 1 Signal Real part Imaginary part 2 4 6 8 1 Time (samples) Least square solution: c = A (AA ) 1 y = 1 N A y (least square solution) A y is computed via: 1. zero-pad the length-m signal y to length-n 2. compute its DFT BP solution: Compute using algorithm SALSA

Example: Sparse Fourier coefficients using BP 8 6 (A) Fourier coefficients (DFT) 4 2 2 4 6 8 1 Frequency (DFT index).4.3 (B) Fourier coefficients (least square solution).2.1 5 1 15 2 25 Frequency (index) 1.8.6 (C) Fourier coefficients (basis pursuit solution).4.2 5 1 15 2 25 Frequency (index) 3 The BP solution does not exhibit the leakage phenomenon.

14 Example: Sparse Fourier coefficients using BP 2.5 Cost function history 2 1.5 1 2 4 6 8 1 Iteration Cost function history of algorithm for basis pursuit solution

5 Example: Denoising using BPD Digital LTI filters are often used for noise reduction (denoising). But if the noise and signal overlap in the frequency domain or the respective frequency bands are unknown, then it is difficult to use LTI filters. However, if the signal has sparse (or relatively sparse) Fourier coefficients, then BPD can be used for noise reduction.

16 Example: Denoising using BPD Noisy speech signal y(m): y(m) = s(m) + w(m), m M 1, M = 5 (19) s(m) : noise-free speech signal w(m) : noise sequence..6 Noisy signal.4.2.2.4.4.3.2.1 1 2 3 4 5 Time (samples) (A) Fourier coefficients (FFT) of noisy signal 1 2 3 4 5 Frequency (index)

17 Example: Denoising using BPD Assume the noise-free speech signal s(n) has a sparse set of Fourier coefficients: y = Ac + w y : noisy speech signal, length-m A : M N DFT matrix (15) c : sparse Fourier coefficients, length-n w : noise, length-m As y is noisy, find c by solving the least square problem argmin c or the basis pursuit denoising (BPD) problem argmin c y Ac 2 2 + λ c 2 2 (2) y Ac 2 2 + λ c 1. (21) Once c is found, an estimate of the speech signal is given by ŝ = Ac.

18 Example: Denoising using BPD Least square solution: c = (A A + λi) 1 A y = 1 λ + N A y (AA = N I) (22) using (12) and (16). least square estimate of the speech signal is ŝ = Ac = N λ + N y But this is only a scaled version of the noisy signal! No real filtering is achieved. (least square solution).

19 Example: Denoising using BPD BPD solution:.6 (B) Fourier coefficients (BPD solution).4.2 1 2 3 4 5 Frequency (index).6 Denoising using BPD.4.2.2.4 1 2 3 4 5 Time (samples) obtained with algorithm SALSA. Effective noise reduction, unlike least squares!

2 Example: Deconvolution using BPD If the signal of interest x(m) is not only noisy but is also distorted by an LTI system with impulse response h(m), then the available data y(m) is y(m) = (h x)(m) + w(m) (23) where denotes convolution (linear convolution) and w(m) is additive noise. Given the observed data y, we aim to estimate the signal x. We will assume that the sequence h is known.

1 Example: Deconvolution using BPD x : length N h : length L y : length M = N + L 1 y = Hx + w (24) h h 1 h H = h 2 h 1 h h 2 h 1 h h 2 h 1 h 2 (25) H is of size M N with M > N (because M = N + L 1).

22 Example: Deconvolution using BPD Sparse signal convolved by the 4-point moving average filter { 1 n =, 1, 2, 3 4 h(n) = otherwise 2 Sparse signal 1 1 2 4 6 8 1.6.4.2 Observed signal.2.4 2 4 6 8 1

23 Example: Deconvolution using BPD Due to noise w, solve the least square problem argmin x or the basis pursuit denoising (BPD) problem argmin x y Hx 2 2 + λ x 2 2 (26) y Hx 2 2 + λ x 1 (27) Least square solution: x = (H H + λi) 1 H y. (28)

24 Example: Deconvolution using BPD.6.4.2 Deconvolution (least square solution).2.4 2 4 6 8 1 2 Deconvolution (BPD solution) 1 1 2 4 6 8 1 The BPD solution, obtained using SALSA, is more faithful to original signal.

25 Example: Filling in missing samples using BP Due to data transmission/acquisition errors, some signal samples may be lost. Fill in missing values for error concealment. Part of a signal or image may be intentionally deleted (image editing, etc). Convincingly fill in missing values according to the surrounding area to do inpainting. 2 missing samples.5 Incomplete signal.5 1 2 3 4 5 Time (samples)

Example: Filling in missing samples using BP The incomplete signal y: x : length M y : length K < M S : selection (or sampling ) matrix of size K M. y = Sx (29) For example, if only the first, second and last elements of a 5-point signal x are observed, then the matrix S is given by: 1 S = 1. (3) 1 Problem: Given y and S, find x such that y = Sx Underdetermined system, infinitely many solutions. Least square and BP solutions are very different...

Example: Filling in missing samples using BP Properties of S: 1. SS t = I (31) where I is an K K identity matrix. For example, with S in (3) 1 SS t = 1. 1 2. S t y sets the missing samples to zero. For example, with S in (3) 1 y() 1 y() y(1) S t y = y(1) = y(2). (32) 1 y(2)

28 Example: Filling in missing samples using BP Suppose x has a sparse representation with respect to A, c : sparse vector, length N, with M N A : size M N. The incomplete signal y can then be written as Therefore, if we can find a vector c satisfying then we can create an estimate ˆx of x by setting x = Ac (33) y = Sx from (29) (34a) = SAc from (33). (34b) y = SAc (35) ˆx = Ac. (36) From (35), Sˆx = y. That is, ˆx agrees with the known samples y.

29 Example: Filling in missing samples using BP Note that y is shorter than the coefficient vector c, so there are infinitely many solutions to (35). Any vector c satisfying y = SAc can be considered a valid set of coefficients. To find a particular solution, solve the least square problem argmin c c 2 2 such that y = SAc (37a) (37b) or the basis pursuit problem argmin c c 1 such that y = SAc. (38a) (38b) The two solutions are very different... Let us assume A satisfies AA = pi, (39) for some positive real number p.

Example: Filling in missing samples using BP The least square solution is c = (SA) ((SA)(SA) ) 1 y (4) = A S t (SAA S t ) 1 y (41) = A S t (p SS t ) 1 y using (1) (42) = A S t (p I) 1 y using (31) (43) = 1 p A S t y (44) Hence, the least square estimate ˆx is given by ˆx = Ac (45) = 1 p AA S t y using (44) (46) = S t y using (1). (47) This estimate consists of setting all the missing values to zero! See (32). No real estimation of the missing values has been achieved. The least square solution is of no use here.

31 Example: Filling in missing samples using BP Short segments of speech can be sparsely represented using the DFT; therefore we set A equal to the M N DFT (15) with N = 124. BP solution obtained using 1 iterations of a SALSA:.8.6.4.2 Estimated coefficients 1 2 3 4 5 Frequency (DFT index).5 Estimated signal.5 1 2 3 4 5 Time (samples) The missing samples have been filled in quite accurately.

32 Total Variation Denoising/Deconvolution A signal is observed in additive noise, y = x + w. Total variation denoising is suitable when it is known that the signal of interest has a sparse derivative. argmin x y x 2 2 + λ N x(n) x(n 1). (48) n=2 The total variation of x is defined as N TV(x) := x(n) x(n 1) = Dx 1 n=2 1 1 1 1 D =... 1 1 (49) Matrix D has size (N 1) N.

3 Total Variation Denoising/Deconvolution 6 Noise free signal 6 L2 filtered signal (λ = 3.) 4 4 2 2 2 2 5 1 15 2 25 5 1 15 2 25 6 Noisy signal 6 L2 filtered signal (λ = 8.) 4 4 2 2 2 2 5 1 15 2 25 5 1 15 2 25 6 TV filtered signal (λ = 3.) 4 2 2 5 1 15 2 25 TV filtering preserves edges/jumps more accurately than least squares filtering.

Polynomial Smoothing of Time Series with Additive Step Discontinuities 3 (a) Noisy data 2 1 1 5 1 15 2 25 3 3 (c) TV compensated data 2 1 1 5 1 15 2 25 3 3 (b) Calculated TV component 2 1 1 5 1 15 2 25 3 3 (d) Estimated signal 2 1 1 5 1 15 2 25 3 Observed noisy data: a low-frequency signal and with step-wise discontinuities (steps/jumps). Goal: Simultaneous smoothing and change/jump detection. Conventional low-pass filtering blurs discontinuities.

Polynomial Smoothing of Time Series with Additive Step Discontinuities Variational Formulation: Given noisy data y, find x (step signal) p (local polynomial signal) by minimization: where argmin a,x λ N N x(n) x(n 1) + y(n) p(n) x(n) 2 n=2 n=1 p(n) = a 1 n + + a d n d. The cost function is non-differentiable. We use variable splitting and ADMM to derive an efficient iterative algorithm.

36 Polynomial Smoothing of Time Series with Additive Step Discontinuities Nano-particle Detection: The Whispering Gallery Mode (WGM) biosensor, designed to detect individual nano-particles in real time, is based on detecting discrete changes in the resonance frequency of a microspherical cavity. The adsorption of a nano-particle onto the surface of a custom designed microsphere produces a theoretically predicted but minute change in the optical properties of the sphere [Arnold, 23]. 9 8 7 wavelength (fm) 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 Time (seconds)

37 Polynomial Smoothing of Time Series with Additive Step Discontinuities Nano-particle Detection: wavelenth (fm) 3 25 2 15 1 5 5 1 15 2 25 3 Time (seconds) wavelength (fm) 2 1 1 2 x(t) 3 5 1 15 2 25 3 Time (seconds) wavelength (fm) 15 1 5 diff(x(t)) 5 5 1 15 2 25 3 Time (seconds) Reference: Polynomial Smoothing of Time Series with Additive Step Discontinuities, IS, S. Arnold, V. Dantham. Submitted to IEEE Trans on Signal Processing.

8 Summary 1. Basis pursuit 2. Basis pursuit denoising 3. Sparse Fourier coefficients using BP 4. Denoising using BPD 5. Deconvolution using BPD 6. Filling in missing samples using BP 7. Total variation filtering 8. Simultaneous least-squares & sparsity