Least Squares with Examples in Signal Processing 1. 2 Overdetermined equations. 1 Notation. The sum of squares of x is denoted by x 2 2, i.e.

Least Squares with Eamples in Signal Processing Ivan Selesnick March 7, 3 NYU-Poly These notes address (approimate) solutions to linear equations by least squares We deal with the easy case wherein the system matri is full rank If the system matri is rank deficient, then other methods are needed, eg, QR decomposition, singular value decomposition, or the pseudo-inverse, [, 3] In these notes, least squares is illustrated by applying it to several basic problems in signal processing: Linear prediction Smoothing 3 Deconvolution 4 System identification 5 Estimating missing data For the use of least squares in filter design, see [] Notation We denote vectors in lower-case bold, ie, = () N We denote matrices in upper-case bold The transpose of a vector or matri in indicated by a superscript T, ie, T is the transpose of The notation refers to the Euclidian length of the vector, ie, = + + + N () For feedback/corrections, email selesi@polyedu Matlab software to reproduce the eamples in these notes is available on the web or from the author http://eewebpolyedu/iselesni/lecture_notes/ Last edit: November 6, 3 The sum of squares of is denoted by, ie, = n (n) = T (3) The energy of a vector refers to In these notes, it is assumed that all vectors and matrices are real-valued In the comple-valued case, the conjugate transpose should be used in place of the transpose, etc Overdetered equations Consider the system of linear equations y = H If there is no solution to this system of equations, then the system is overdetered This frequently happens when H is a tall matri (more rows than columns) with linearly independent columns In this case, it is common to seek a solution imizing the energy of the error: J() = y H Epanding J() gives J() = (y H) T (y H) (4) = y T y y T H T H T y + T H T H (5) = y T y y T H + T H T H (6) Note that each of the four terms in (5) are scalars Note also that the scalar T H T y is the transpose of the scalar y T H, and hence T H T y = y T H Taking the derivative (see Appendi A), gives J() = HT y + H T H Setting the derivative to zero, J() = = HT H = H T y

Let us assume that H T H is invertible Then the solution is given by = (H T H) H T y This is the least squares solution y H = = (H T H) H T y (7) In some situations, it is desirable to imize the weighted square error, ie, n w n r n where r is the residual, or error, r = y H, and w n are positive weights This corresponds to imizing W / (y H) where W is the diagonal matri, [W] n,n = w n Using (7) gives W / (y H) = = (H T WH) H T Wy (8) where we have used the fact that W is symmetric 3 Underdetered equations Consider the system of linear equations y = H Set the derivatives to zero to get: = HT µ () y = H () Plugging () into () gives y = HHT µ Let us assume HH T is invertible Then µ = (HH T ) y (3) Plugging (3) into () gives the least squares solution: = H T (HH T ) y We can verify that in this formula does in fact satisfy y = H by plugging in: If there are many solutions, then the system is underdetered This frequently happens when H is a wide matri (more columns than rows) with linearly independent rows In this case, it is common to seek a solution with imum norm That is, we would like to solve the optimization problem So, H = H [ H T (HH T ) y ] = (HH T )(HH T ) y = y st y = H = = H T (HH T ) y (4) (9) such that y = H () Minimization with constraints can be done with Lagrange multipliers So, define the Lagrangian: L(, µ) = + µ T (y H) Take the derivatives of the Lagrangian: L() = HT µ L() = y H µ In some situations, it is desirable to imize the weighted energy, ie, n w n n, where w n are positive weights This corresponds to imizing W / where W is the diagonal matri, [W] n,n = w n The derivation of the solution is similar, and gives W / st y = H = = W H T ( HW H T ) y This solution is also derived below, see (5) 4 Regularization (5) In the overdetered case, we imized y H In the underdetered case, we imized Another approach is to imize the

weighted sum: c y H + c The solution depends on the ratio c /c, not on c and c individually A common approach to obtain an ineact solution to a linear system is to imize the objective function: J() = y H + λ (6) Setting the derivative to zero, J() = = HT H + λa T A = H T y So the solution is given by = (H T H + λa T A) = H T y where λ > Taking the derivative, we get J() = HT (H y) + λ So, = (H T H + λa T A) H T y Setting the derivative to zero, J() = = HT H + λ = H T y So the solution is given by So, = (H T H + λi) H T y = (H T H + λi) = H T y y H + λ = = (H T H + λi) H T y (7) This is referred to as diagonal loading because a constant, λ, is added to the diagonal elements of H T H The approach also avoids the problem of rank deficiency because H T H + λi is invertible even if H T H is not In addition, the solution (7) can be used in both cases: when H is tall and when H is wide 5 Weighted regularization A more general form of the regularized objective function (6) is: J() = y H + λ A where λ > Taking the derivative, we get J() = HT (H y) + λa T A y H + λ A = = (H T H + λa T A) H T y Note that if A is the identity matri, then equation (8) becomes (7) 6 Constrained least squares (8) Constrained least squares refers to the problem of finding a least squares solution that eactly satisfies additional constraints If the additional constraints are a set of linear equations, then the solution is obtained as follows The constrained least squares problem is of the form: y H (9) such that C = b () Define the Lagrangian, L(, µ) = y H + µ T (C b) The derivatives are: L() = HT (H y) + C T µ L() = C b µ 3

Setting the derivatives to zero, L() = = = (HT H) (H T y 5C T µ) () L() = = C = b () µ Multiplying () on the left by C gives C, which from () is b, so we have C(H T H) (H T y 5C T µ) = b or, epanding, C(H T H) H T y 5C(H T H) C T µ = b Solving for µ gives µ = ( C(H T H) C T ) ( C(H T H) H T y b ) Plugging µ into () gives = (H T H) ( H T y C T ( C(H T H) C T ) ( C(H T H) H T y b )) Let us verify that in this formula does in fact satisfy C = b, So, C = C(H T H) ( H T y C T ( C(H T H) C T ) ( C(H T H) H T y b )) = C(H T H) H T y C(H T H) C T ( C(H T H) C T ) ( C(H T H) H T y b ) = C(H T H) H T y ( C(H T H) H T y b ) = b y H st C = b = = (H T H) ( H T y C T ( C(H T H) C T ) ( C(H T H) H T y b )) (3) 6 Special cases Simpler forms of (3) are frequently useful For eample, if H = I and b = in (3), then we get y st C = = = y C T ( CC T ) Cy (4) If y = in (3), then we get H st C = b = = (H T H) C T ( C(H T H) C T ) b (5) If y = and H = I in (3), then we get st C = b = = C T ( CC T ) b (6) which is the same as (4) 7 Note The epressions above involve matri inverses For eample, (7) involves (H T H) However, it must be emphasized that finding the least square solution does not require computing the inverse of H T H even though the inverse appears in the formula Instead, in (7) should be obtained, in practice, by solving the system A = b where A = H T H and b = H T y The most direct way to solve a linear system of equations is by Gaussian eliation Gaussian eliation is much faster than computing the inverse of the matri A 8 Eamples 8 Polynomial approimation An important eample of least squares is fitting a low-order polynomial to data Suppose the N-point data is of the form (t i, y i ) for i N The goal is to find a polynomial that approimates the data by imizing the energy of the residual: E = i (y i p(t i )) 4

where p(t) is a polynomial, eg, p(t) = a + a t + a t Data 5 5 Polynomial approimation (degree = ) 5 5 Polynomial approimation (degree = 4) 5 5 Figure : Least squares polynomial approimation The problem can be viewed as solving the overdetered system of equations, y t t y t t a a, a y N t N t N which we denote as y Ha The energy of the residual, E, is written as E = y Ha From (7), the least squares solution is given by a = (H T H) H T y An eample is illustrated in Fig 8 Linear prediction One approach to predict future values of a time-series is based on linear prediction, eg, y(n) a y(n ) + a y(n ) + a 3 y(n 3) (7) If past data y(n) is available, then the problem of finding a i can be solved using least squares Finding a = (a, a, a ) T can be viewed as one of solving an overdetered system of equations For eample, if y(n) is available for n N, and we seek a third order linear predictor, then the overdetered system of equations are given by y(3) y() y() y() y(4) y(3) y() y() a a, a 3 y(n ) y(n ) y(n 3) y(n 4) which we can write as ȳ = Ha where H is a matri of size (N 3) 3 From (7), the least squares solution is given by a = (H T H) H T ȳ Note that H T H is small, of size 3 3 only Hence, a is obtained by solving a small linear system of equations 5

Given data to be predicted Once the coefficients a i are found, then y(n) for n > N can be estimated using the recursive difference equation (7) An eample is illustrated in Fig One hundred samples of data are available, ie, y(n) for n 99 From these samples, a p-order linear predictor is obtained by least squares, and the subsequent samples are predicted 5 5 Predicted samples (deg = 3) p = 3 5 5 Predicted samples (deg = 4) p = 4 5 5 Predicted samples (deg = 6) p = 6 5 5 Figure : Least squares linear prediction 83 Smoothing One approach to smooth a noisy signal is based on least squares weighted regularization The idea is to obtain a signal similar to the noisy one, but smoother The smoothness of a signal can be measured by the energy of its derivative (or second-order derivative) The smoother a signal is, the smaller the energy of its derivative is Define the matri D as D = (8) Then D is the second-order difference (a discrete form of the second-order derivative) of the signal (n) See Appendi B If is smooth, then D is small in value If y(n) is a noisy signal, then a smooth signal (n), that approimates y(n), can be obtained as the solution to the problem: y + λ D (9) where λ > is a parameter to be specified Minimizing y forces to be similar to the noisy signal y Minimizing D forces to be smooth Minimizing the sum in (9) forces to be both similar to y and smooth (as far as possible, and depending on λ) If λ =, then the solution will be the noisy data, ie, = y, because this solution makes (9) equal to zero In this case, no smoothing is achieved On the other hand, the greater λ is, the smoother will be Using (8), the signal imizing (9) is given by = (I + λd T D) y (3) 6

5 5 Noisy data 5 5 5 5 3 5 5 Least squares smoothing 5 5 5 5 3 Figure 3: Least squares smoothing Note that the matri I + λd T D is banded (The only non-zero values are near the main diagonal) Therefore, the solution can be obtained using fast solvers for banded systems [5, Sect 4] An eample of least squares smoothing is illustrated in Fig 3 A noisy ECG signal is smoothed using (3) We have used the ECG waveform generator ECGSYN [4] 84 Deconvolution Deconvolution refers to the problem of finding the input to an LTI system when the output signal is known Here, we assume the impulse response of the system is known The output, y(n), is given by y(n) = h() (n) + h() (n ) + + h(n) (n N) (3) where (n) is the input signal and h(n) is the impulse response Equation (3) can be written as y = H where H is a matri of the form h() h() h() H = h() h() h() This matri is constant-valued along its diagonals Such matrices are called Toeplitz matrices It may be epected that can be obtained from y by solving the linear system y = H In some situations, this is possible However, the matri H is often singular or almost singular In this case, Gaussian eliation encounters division by zeros For eample, Fig 4 illustrates an input signal, (n), an impulse response, h(n), and the output signal, y(n) When we attempt to obtain by solving y = H in Matlab, we receive the warning message: Matri is singular to working precision and we obtain a vector of all NaN (not a number) Due to H being singular, we regularize the problem Note that the input signal in Fig 4 is mostly zero, hence, it is reasonable to seek a solution with small energy The signal we seek should also satisfy y = H, at least approimately To obtain such a signal,, we solve the problem: y H + λ (3) where λ > is a parameter to be specified Minimizing y H forces to be consistent with the output signal y Minimizing forces to have low energy Minimizing the sum in (3) forces to be consistent with y and to have low energy (as far as possible, and depending on λ) Using (8), the signal imizing (3) is given by = (H T H + λi) H T y (33) This technique is called diagonal loading because λ is added the diagonal of H T H A small value of λ is sufficient to make the matri invertible The solution, illustrated in Fig 4, is very similar to the original input signal, shown in the figure In practice, the available data is also noisy In this case, the data y is given by y = H + w where w is the noise The noise is often modeled as an additive white Gaussian random signal In this case, diagonal loading with a small λ will generally produce a noisy estimate of the input signal In Fig 4, we used λ = When the same value is used with the noisy data, a noisy result is obtained, as illustrated in Fig 5 A larger λ is needed so as to attenuate the noise But if λ is too large, then the estimate of the input signal is distorted Notice that with λ =, the noise is reduced 7

3 Input signal 5 5 5 3 4 Impulse response 4 5 5 5 3 5 5 3 Output signal (noise free) 5 5 5 3 Deconvolution (diagonal loading) λ = 5 5 5 3 Figure 4: Deconvolution of noise-free data by diagonal loading but the height of the the pulses present in the original signal are somewhat attenuated With λ = 5, the noise is reduced slightly more, but the pulses are substantially more attenuated To improve the deconvolution result in the presence of noise, we can imize the energy of the derivative (or second-order derivative) of instead As in the smoothing eample above, imizing the energy of the second-order derivative forces to be smooth In order that is consistent with the data y and is also smooth, we solve the problem: y H + λ D (34) where D is the second-order difference matri (8) Using (8), the signal imizing (34) is given by = (H T H + λd T D) H T y (35) The solution obtained using (35) is illustrated in Fig 6 Compared to the solutions obtained by diagonal loading, illustrated in Fig 5, this solution is less noisy and less distorted This eample illustrates the need for regularization even when the data is noise-free (an unrealistic ideal case) It also illustrates the choice of regularizer (ie,, D, or other) affects the quality of the result 85 System identification System identification refers to the problem of estimating an unknown system In its simplest form, the system is LTI and input-output data is available Here, we assume that the output signal is noisy We also assume that the impulse response is relatively short The output, y(n), of the system can be written as y(n) = h (n) + h (n ) + h (n ) + w(n) (36) where (n) is the input signal and w(n) is the noise Here, we have assumed the impulse response h n is of length 3 We can write this in matri form as y y y y 3 3 h h h 8

Output signal (noisy) Output signal (noisy) 5 5 5 5 5 5 3 5 5 5 5 3 3 Deconvolution (derivative regularization) 3 Deconvolution (diagonal loading) λ = 5 5 5 3 λ = 5 5 5 3 Figure 6: Deconvolution of noisy data by derivative regularization 3 Deconvolution (diagonal loading) which we denote as y Xh If y is much longer than the length of the impulse response h, then X is a tall matri and y Xh is an overdetered system of equations In this case, h can be estimated from (7) as h = (X T X) X T y (37) λ = 5 5 5 3 3 Deconvolution (diagonal loading) λ = 5 5 5 5 3 Figure 5: Deconvolution of noisy data by diagonal loading An eample is illustrated in Fig 7 A binary input signal and noisy output signal are shown When it is assumed that h is of length, then we obtain the impulse response shown The residual, ie, r = y Xh, is also shown in the figure It is informative to plot the root-mean-square-error (RMSE), ie, r, as a function of the length of the impulse response This is a decreasing function If the data really is the input-output data of an LTI system with a finite impulse response, and if the noise is not too severe, then the RMSE tends to flatten out at the correct impulse response length This provides an indication of the length of the unknown impulse response 86 Missing sample estimation Due to transmission errors, transient interference, or impulsive noise, some samples of a signal may be lost or so badly corrupted as to be unusable In 9

Input signal 5 5 Signal with missing samples 4 6 8 5 5 5 Output signal (noisy) 5 5 Estimated samples 4 6 8 5 5 5 8 6 4 Estimated impulse response (length ) 4 6 8 5 5 Final output 5 5 5 RMSE Residual (RMSE = 83) 4 6 8 8 6 4 RMSE vs impulse response length 5 5 Length of impulse response Figure 7: Least squares system identification Figure 8: Least squares estimation of missing data this case, the missing samples should be estimated based on the available uncorrupted data To complicate the problem, the missing samples may be randomly distributed through out the signal Filling in missing values in order to conceal errors is called error concealment [6] This eample shows how the missing samples can be estimated by least squares As an eample, Fig 8 shows a -point ECG signal wherein samples are missing The problem is to fill in the missing samples To formulate the problem as a least squares problem, we introduce some notation Let be a signal of length N Suppose K samples of are known, where K < N The K-point known signal, y, can be written as y = S (38) where S is a selection (or sampling ) matri of size K N For eample,

if only the first, second and last elements of a 5-point signal are observed, then the matri S is given by S = (39) The matri S is the identity matri with rows removed, corresponding to the missing samples Note that S removes two samples from the signal, () () () S = () = () = y (4) (3) (4) (4) The vector y consists of the known samples of So, the vector y is shorter than (K < N) The problem can be stated as: Given the signal, y, and the matri, S, find such that y = S Of course, there are infinitely many solutions Below, it is shown how to obtain a smooth solution by least squares Note that S T y has the effect of setting the missing samples to zero For eample, with S in (39) we have y() y() y() S T y = y() = (4) y() y() Let us define S c as the complement of S The matri S c consists of the rows of the identity matri not appearing in S Continuing the 5-point eample, S c = [ Now, an estimate ˆ can be represented as ] (4) ˆ = S T y + S T c v (43) where y is the available data and v consists of the samples to be detered For eample, y() y() [ ] y() S T y + S T c v = y() + v() = v() v() (44) y() v() y() The problem is to estimate the vector v, which is of length N K Let us assume that the original signal,, is smooth Then it is reasonable to find v to optimize the smoothness of ˆ, ie, to imize the energy of the second-order derivative of ˆ Therefore, v can be obtained by imizing Dˆ where D is the second-order difference matri (8) Using (43), we find v by solving the problem ie, v D(S T y + S T c v), (45) v DS T y + DS T c v (46) From (7), the solution is given by v = (S c D T D S T c ) S c D T D S T y (47) Once v is obtained, the estimate ˆ in (43) can be constructed simply by inserting entries v(i) into y An eample of least square estimation of missing data using (47) is illustrated in Fig 8 The result is a smoothly interpolated signal We make several remarks All matrices in (47) are banded, so the computation of v can be implemented with very high efficiency using a fast solver for banded systems [5, Sect 4] The banded property of the matri G = S c D T D S T c (48) arising in (47) is illustrated in Fig 9

3 4 5 6 7 8 9 4 6 8 nz = 8 Figure 9: Visualization of the banded matri G in (48) All the non-zero values lie near the main diagonal The method does not require the pattern of missing samples to have any particular structure The missing samples can be distributed quite randomly 3 This method (47) does not require any regularization parameter λ be specified However, this derivation does assume the available data, y, is noise free If y is noisy, then simultaneous smoothing and missing sample estimation is required (see the Eercises) Speech de-clipping: In audio recording, if the amplitude of the audio source is too high, then the recorded waveform may suffer from clipping (ie, saturation) Figure shows a speech waveform that is clipped All values greater than in absolute value are lost due to clipping To estimate the missing data, we can use the least squares approach given by (47) That is, we fill in the missing data so as to imize the energy of the derivative of the total signal In this eample, we imize the energy of the the third derivative This encourages the filled in data to have the form of a parabola (second order polynomial), because the third derivative of a parabola is zero In order to use (47) for this problem, we 6 4 4 6 4 4 6 4 4 Clipped speech with 39 missing samples 3 4 5 Estimated samples 3 4 5 Final output 3 4 5 Figure : Estimation of speech waveform samples lost due to clipping The lost samples are estimated by least squares only need to change the matri D to the following one If we define the matri D as D = 3 3 3 3 3 3, (49) then D is an approimation of the third-order derivative of the signal Using (47) with D defined in (49), we obtain the signal shown in the Fig The samples lost due to clipping have been smoothly filled in Figure shows both, the clipped signal and the estimated samples, on the same ais

Filled in (red) Clipped data (blue) eperiments and compare results of -nd and 3-rd order difference matrices 6 4 4 3 4 5 Figure : The available clipped speech waveform is shown in blue The filled in signal, estimated by least squares, is shown in red 9 Eercises Find the solution to the least squares problem: y A + λ b Show that the solution to the least squares problem is λ b A + λ b A + λ 3 b 3 A 3 = ( λ A T A + λ A T A + λ 3 A T 3 A 3 ) ( λ A T b + λ A T b + λ 3 A T 3 b 3 ) (5) 3 In reference to (7), why is H T H + λi with λ > invertible even if H T H is not? 4 Show (56) 5 Smoothing Demonstrate least square smoothing of noisy data Use various values of λ What behavior do you observe when λ is very high? 6 The second-order difference matri (8) was used in the eamples for smoothing, deconvolution, and estimating missing samples Discuss the use of the third-order difference instead Perform numerical A 7 System identification Perform a system identification eperiment with varying variance of additive Gaussian noise Plot the RMSE versus impulse response length How does the plot of RMSE change with respect to the variance of the noise? 8 Speech de-clipping Record your own speech and use it to artificially create a clipped signal Perform numerical eperiments to test the least square estimation of the lost samples 9 Suppose the available data is both noisy and that some samples are missing Formulate a suitable least squares optimization problem to smooth the data and recover the missing samples Illustrate the effectiveness by a numerical demonstration (eg using Matlab) Vector derivatives If f() is a function of,, N, then the derivative of f() with respect to is the vector of derivatives, f() f() f() = (5) f() N This is the gradient of f, denoted f By direct calculation, we have and T b = b (5) bt = b (53) Suppose that A is a symmetric real matri, A T = A Then, by direct calculation, we also have T A = A (54) 3

Also, and (y )T A(y ) = A( y), (55) A b = A T (A b) (56) We illustrate (54) by an eample Set A as the matri, [ ] 3 A = (57) 5 Then, by direct calculation [ ] [ ] [ T 3 A = 5 so and ] = 3 + 4 + 5 (58) ( T A ) = 6 + 4 (59) ( T A ) = 4 + (6) Let us verify that the right-hand side of (54) gives the same: [ ] [ ] [ ] 3 6 + 4 A = = (6) 5 4 + B The Kth-order difference The first order difference of a discrete-time signal (n), n Z, is defined as y(n) = (n) (n ) (6) This is represented as a system with input and output y, The second order difference is obtained by taking the first order difference twice: D D y which give the difference equation y(n) = (n) (n ) (n ) (63) The third order difference is obtained by taking the first order difference three times: D D D y which give the difference equation y(n) = (n) 3 (n ) + 3 (n ) (n 3) (64) In terms of discrete-time linear time-invariant systems (LTI), the first order difference is an LTI system with transfer function D(z) = z The second order difference has the transfer function D (z) = ( z ) = z + z The third order difference has the transfer function D 3 (z) = ( z ) 3 = 3z + 3z z 3 Note that the coefficient come from Pascal s triangle: 3 3 4 6 4 D y 4

C Additional Eercises for Signal Processing Students For smoothing a noisy signal using least squares, we obtained (3), = (I + λd T D) y, λ > The matri G = (I + λd T D) can be understood as a low-pass filter Using Matlab, compute and plot the output, y(n), when the input is an impulse, (n) = δ(n n o ) This is an impulse located at inde n o Try placing the impulse at various points in the signal For eample, put the impulse around the middle of the signal What happens when the impulse is located near the ends of the signal (n o = or n o = N )? [5] W H Press, S A Teukolsky, W T Vetterling, and B P Flannery Numerical recipes in C: the art of scientific computing (nd ed) Cambridge University Press, 99 [6] Y Wang and Q Zhu Error control and concealment for video communication: A review Proc IEEE, 86(5):974 997, May 998 For smoothing a noisy signal using least squares, we obtained (3), = (I + λd T D) y, λ >, where D represents the K-th order derivative Assume D is the first-order difference and that the matrices I and D are infinite is size Then I + λd T D is a convolution matri and represents an LTI system (a) Find and sketch the frequency response of the system I + λd T D (b) Based on (a), find the frequency response of G = (I + λd T D) Sketch the frequency response of G What kind of filter is it (low-pass, high-pass, band-pass, etc)? What is its dc gain? References [] C Burrus Least squared error design of FIR filters Conneions Web site, http://cnorg/content/m689/3/ [] C Burrus General solutions of simultaneous equations Conneions Web site, 3 http://cnorg/content/m956/5/ [3] C L Lawson and R J Hanson Solving Least Squares Problems SIAM, 987 [4] P E McSharry, G D Clifford, L Tarassenko, and L A Smith A dynamical model for generating synthetic electrocardiogram signals Trans on Biomed Eng, 5(3):89 94, March 3 5