Least Square Es?ma?on, Filtering, and Predic?on: ECE 5/639 Sta?s?cal Signal Processing II: Linear Es?ma?on

Least Square Es?ma?on, Filtering, and Predic?on: Sta?s?cal Signal Processing II: Linear Es?ma?on Eric Wan, Ph.D. Fall 2015 1

Mo?va?ons If the second-order sta?s?cs are known, the op?mum es?mator is given by the normal equa?ons or the solu?on to the Wiener-Hopf Equa?ons For most applica?ons, the actual sta?s?cs are unknown. Alterna?ve approach is to es?mate the coefficients from observed data Two possible approaches Es?mate required moments from available data and build an approximate MMSE es?mator Build an es?mator that minimizes some error func?onal calculated from the available data 2

MMSE versus Least Squares Recall that MMSE es'mators are op?mal in expecta?on across the ensemble of all stochas?c processes with the same second order sta?s?cs Least squares es?mators minimize the error on a given block of data In signal processing applica?ons, the block of data is a fnite-length period of?me Note the book defines E as a sum instead of an average No guarantees about op?mality on other data sets or other stochas?c processes When can we infer something about the ensemble performance based on a single observa?on sequence of an experiment? 3

MMSE versus Least Squares No guarantees about op?mality on other data sets or other stochas?c processes When can we infer something about the ensemble performance based on a single observa?on sequence of an experiment? If the process is ergodic and sta'onary, the LSE es?mator approaches the MMSE es?mator as the size of the data set grows Will only discuss the sum of squares as the performance criterion Recall our earlier discussion about alterna?ves Ra?onale Mathema?cally tractable - Picking sum of squares will permit us to obtain a closedform op?mal solu?on Solu?on only depends on second order moments, which are easily es?mated 4

Block Processing 5

Least Squares Least squares is a method for finding the best fit to a linear system of N equa?ons and M unknown, a 11 x 1 + a 12 x 2 = y 1 a 21 x 2 + a 22 x 2 = y 2 N = M! # # " a 11 a 12 a 21 a 22 $! &# &# %" x 1 x 2 $! & & = y # 1 # % " y 2 $ & & % Ax = y x = A 1 y 6

Least Squares Least squares is a method for finding the best fit to a linear system of N equa?ons and M unknown, a 11 a 12 a 21 a 22 a 31 a 32 x 1 x 2 = y 1 y 2 y 3 N > M Ax = y x = A 1 y e = y Ax A Is called the Data Matrix min x (e T e) = min e(n) 2 x N n=1 (note index n starts at 1 versus 0 in book) 7

Linear Least Squares Es?ma?on and Filtering Back to book s nota?on 8

Linear Least Squares Es?ma?on and Filtering Defini?ons 9

Matrix Formula?on - Mul?ple Sensors (previous nota?on: e = y Ax ) 10

Matrix Formula?on - Mul?ple Sensors 11

Matrix Formula?on - Filtering Principle is the same, but need to consider edge effects. Will come back to computa?onal and windowing aspects of filtering later 12

Matrix Formula?on Back to mul?ple signals 13

Squared Error 14

Squared Error Components (note that book does not normalize by 1/N) 15

Squared Error Components 16

Rela?ng LSE to MMSE Es?ma?on Plugging in our defini?ons for R and d Which should look familiar from before. Many of the concepts, solu?ons, etc. will be similar 17

Least Squares Es?mate Three ways to solve for the least squares es?mate 1. Take gradient and set to zero 2. Complete the square 3. Orthogonality 18

Least Squares Es?mate 1. Take gradient and set to zero e T e = (y - Xc) T (y - Xc) This yields the normal equa?ons = y T y 2c T X T y + c T X T Xc e T e c = 2XT y + 2X T Xc = 0 X T Xc = X T y ˆRc = ˆd For N>M the problem is almost always overdetermined and hence the columns of the data matrix X are independent. This implies X T X is full rank. c LS = ( X T X) 1 X T y c LS = ˆR 1ˆd 19

Least Squares Es?mate 2. Complete the square 20

Least Squares Es?mate 2. Complete the square Both the LSE and MSE criteria are quadra?c func?ons of the coefficient vector Same form as the FIR Wiener solu?on When are they equivalent? For an ergodic process in the limit of large N 21

A toy example Line fiing y x" x" x" x" x" x" x" x" x" y = ax + b x! # # # # # " e 1 e 2! e N $! & # & # & = # & # & # % " y 1 y 2! y N $! & # & # & # & # & # % " x 1 1 x N 1!! x 1 1 $ & &! &# &" & % a b $ & % e = y - Xc Note the coefficients could be for a higher-order polynomial. The system must be linear in the unknown parameters, not the equa?ons themselves. 22

More Applica?ons Applica?ons we ve seen earlier for Linear Es?ma?on / Wiener Filtering Noise reduc?on Equaliza?on Predic?on System Iden?fica?on Can be solved using LS given a block of data Applies to FIR (not general IIR filter) Numerical and computa?onal aspects need further inves?ga?on 23

Computa?onal Issues 24

Example: Time Series Predic?on (from McNames notes) Goal: Predict the S&P 500 Clearly not a sta?onary signal What might we do? Common trick: difference the?me series 25

Example: Difference Time Series 26

Example: Percent Change Time Series 27

Example: Predic?on Results 28

Orthogonality and Geometric Interpreta?on (2-D illustra?on) Consider the simple example:! X = # 2 " 1 $! & y = # 2 % " 2 $ & % e = y - Xc y e o Xc LS ŷ = Xc ŷ(0) = 2c ŷ(1) =1c If we want to make Xc LS as close as possible to y, then the error vector e should be orthogonal to the line (column space) Xc (Xc) T e o = 0 c c T X T (Xc LS y) = c T [X T Xc LS X T y] = 0 c P = X(X T X) 1 X T Since this must hold for all c, we must have, X T Xc LS = X T y (normal equa2ons) 29

Orthogonality and Geometric Interpreta?on (2-D illustra?on) Consider the simple example:! X = # 2 " 1 $! & y = # 2 % " 2 $ & % e = y - Xc y e o Xc LS ŷ = Xc ŷ(0) = 2c ŷ(1) =1c Subs?tu?ng directly for the LS solu?on: c LS = ( X T X) 1 X T y ŷ = X(X T X) 1 X T y The matrix P = X(X T X) 1 X T is a projec?on operator which projects y onto the space spanned by X 30

Orthogonality and Geometric Interpreta?on (book) 31

Orthogonality and Geometric Interpreta?on 32

Orthogonality and Geometric Interpreta?on 33

Uniqueness 34

The Pseudoinverse We can write the least squares solu?on as c LS = ( X T X) 1 X T y = X + y Where we have defined the pseudoinverse of the matrix independent columns: X + = ( X T X) 1 X T The pseudoinverse has the following proper?es X with linear i) XX + X = X ii) (XX + ) T = XX + It can be shown using orthogonality that any matrix X + sa?sfying the above two condi?ons yields the least squares solu?on X + y to the equa?on y -Xc = e. 35

Minimum Norm Solu?on Suppose that the columns of X are not linearly independent, or simple N<M. Then X T X cannot be inverted and there are an infinite number of solu?ons which solve y = Xc exactly Which to choose? X =! 1 2 "# $ %& y =! 4 "# $ %& 1c 1 + 2c 2 = 4 y - Xc = 0 c 2 c min c 1 c min is orthogonal to the subspace of y - Xc = 0 c min is orthogonal to the null space of X is in the range space of X T c min c = X T min λ 36

Minimum Norm Solu?on Solving c min = X T c min = X T λ y = Xc y = XX T λ λ = (XX T ) 1 y ( X T X) 1 y = X + y Psuedoinverse X + = X T ( X T X) 1 iii) X + XX + = X + iv) (X + X) T = X + X Moore-Penrose pseudoinverse. For any matrix X there is only one matrix X + that sa?sfies all four condi?ons 37

Weighed Least Squares 38

Weighed Least Squares 39

Weighed Least Squares 40

Proper?es of the LS Es?mate Assume a sta?s?cal model of how the data was generated Some proper?es won t hold when the model is not accurate 41

Determinis?c versus Stochas?c Data Matrix 42

Es?mator Proper?es (Determinis?c case) 43

Es?mator Proper?es (Determinis?c case) Error Variance Define This is is an unbiased es?mate of the true error variance See book for proof 44

Es?mator Proper?es (Determinis?c case) Other proper?es - see book for proofs 45

Es?mator Proper?es (Stochas?c case) 46

Another Perspec?ve System iden?fica?on v(n) x(n) H(z) y(n) c LS ŷ(n) - e(n) Addi?ve noise did not affect the Wiener solu?on LS solu?on is s?ll unbiased (if the model matches). But adds variance to the LS solu?on, so you need more data for a good fit. 47

Example: Power Spectral Es?ma?on Power spectral es?ma?on (material from chapter 7) 1. Use non-parametric approach (e.g., Welch method) from ECE538 2. Fit a model e(n) H(z) x(n) If e(n) is white, then Use an autoregressive model driven by noise, e(n) M 1 x(n) = a k x(n k) + e(n) k=1 48

Example: Power Spectral Es?ma?on Least Squares Solu?on a LS = c LS = ( X T X) 1 X T y Note autoregressive models have a duality with predic?on y(n) = x(n +1) x(n) h lp (n) ŷ(n)!" e(n +1) (we will discuss windowing aspects later) 49

Example: Power Spectral Es?ma?on Generate some data e(n) H(z) x(n) 1 H(z) = 1+.3z 1 +.6z 2 6 x 4 2 0 2 4 6 0 100 200 300 400 500 600 700 800 900 1000 n 50

Example: Power Spectral Es?ma?on True Power Spectrum 25 Power Spectrum Rxx(e jw ) 20 15 Power (db) 10 5 0 5 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized Freq 51

Example: Power Spectral Es?ma?on Periodogram 25 Rxx(e jw ) versus DFT 2 20 15 Power (db) 10 5 0 5 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized Freq 52

Example: Power Spectral Es?ma?on Welch Method 25 Rxx(e jw ) versus DFT 2 (WELCH(256)) 20 15 Power (db) 10 5 0 5 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized Freq 53

Example: Power Spectral Es?ma?on LS Fit (M=5) 25 Rxx(e jw ) versus AR fit 20 15 Power (db) 10 5 0 5 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized Freq 54

Power Spectral Es?ma?on Mixing parametric and non-parametric Use predic?on filter to whiten the error Use non-parametric on the residual Color the PSD es?mate using the AP model (undo the pre-whitening) 55

Least Squares Filtering Addi?onal aspects 56

Edge condi?ons and windowing 57

Compu?ng the correla?on matrix 58

Deriva?on of Correla?on Matrix Recursion 59

Window op?ons See text for minor modifica?ons to correla?on recursions 60

More on computa?onally efficient methods Methods based on forward-backward predic?on and Order Recursive algorithms A lot of the details in Chapter 7, 8.5.2 We will touch on some aspects rela?ng to predic?on Allows for an alterna?ve windowing approach. 8.4.2 General Linear Algebra approaches (come back to this) Cholesky decomposi?on, SVD, etc. 8.5 61

Linear Predic?on Recall the Wiener Solu?on y(n) = x(n +1) M x(n) = a k x(n k) + e(n) k=1 x(n) a ŷ(n)!" e(n) Ra o = d d What is?! # # d = E[x(n) y(n)] = E[x(n)x(n +1)] = # # # " r x (1) r x (2)! r x (M ) $ & & & = r & & % Ra o = r M 1 r(n) = a k r(n 1) k=1 These are known as the Yule-Walker equa?ons Leads to efficient order recursive computa?ons (Levinson-Durbin algorithms) 62

!" Backward Linear Predic?on Think of?me running in the reverse y(n) = x(n +1) a f x(n +1) x(n) a f ŷ(n)!" e f (n) y(n) = x(n M 1) x(n M 1) a b e b (n) ŷ(n) a b x(n) M x(n) = a f x(n k) + e f (n) x(n M 1) = a b x(n k) + e b (n) k k k=1 M k=1 Easy to show using the Yule-Walker equa?ons that So how do we make use of this? Note, book s nota?on slightly different (b = a b ) a f = flip(a b ) 63

Forward-Backward Linear Predic?on Minimize the forward and backward error. Double the size of the data matrix F w a FB F B B Lowers the variance of the LS es?mate Correla?on or modified covariance window methods See book for addi?onal details and more careful nota?on See MATLAB s ar(x,m) (uses forward-backward and short/no windows by default) 64

Applica?on Example: Narrow band interference canceling 65

Applica?on Example: Narrow band interference canceling 66

Applica?on Example: Narrow band interference canceling 67

Applica?on Example: Narrow band interference canceling Just a D-Step ahead predictor Some?mes called a line enhancer 68

Example: Microelectrode Narrowband Interference 69

Example: Microelectrode Narrowband Interference Signal PSD 0.1 Input PSD 0.09 0.08 0.07 PSD (scaled) 0.06 0.05 0.04 0.03 0.02 0.01 0 0 2000 4000 6000 8000 10000 Frequency (Hz) 70

Example: Microelectrode Narrowband Interference Residual PSD 0.1 Output PSD 0.09 0.08 0.07 PSD (scaled) 0.06 0.05 0.04 0.03 0.02 0.01 0 0 2000 4000 6000 8000 10000 Frequency (Hz) 71

Example: Microelectrode Narrowband Interference Input and Predicted Signal 0.8 0.6 NMSE= 93.4% D = 44 (5 seconds) M = 500 Observed Estimated 0.4 0.2 Signal 0 0.2 0.4 0.6 0.8 1 0 0.05 0.1 0.15 0.2 0.25 0.3 Time (s) 72

Example: Microelectrode Narrowband Interference Predic?on filter frequency response 10 1 Prediction Filter 10 0 Magnitude Response H(e jω ) 2 10 1 10 2 10 3 10 4 10 5 10 6 0 2000 4000 6000 8000 10000 Frequency (Hz) 73

Example: Microelectrode Narrowband Interference Predic?on error filter frequency response 10 1 Prediction Error Filter 10 0 Magnitude Response H(e jω ) 2 10 1 10 2 10 3 10 4 0 2000 4000 6000 8000 10000 Frequency (Hz) 74

Example: OGI Seminar 1 original recording 0.5 0 0.5 1 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 x 10 4 0.4 noise segment 0.2 0 0.2 0.4 0.6 100 200 300 400 500 600 700 800 75

Example: OGI Seminar 1 original recording D =1 M = 500 0.5 0 0.5 1 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 x 10 4 1 Enhanced Speech 0.5 0 0.5 1 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 x 10 4 (Audi2on Noise reduc2on) 76

Example Applica?on: IIR filtering / System ID We previously used a predictor / AR model (all-pole) for spectral es?ma?on Now consider a general zero/pole IIR model Given input x(n) and desired output y(n) v(n) x(n)? y(n) H LS (z) = B(z) A(z) ŷ(n) - e(n) M 1 ŷ(n) = a(n) ŷ(n k) + b(n)x(n k) k=1 M 1 k=0 77

IIR filtering / System ID Data matrices M 1 ŷ(n) = a(n) ŷ(n k) + b(n)x(n k) k=1 M 1 k=0 - y What s wrong with this? How do we get ŷ(n)! X c 78

IIR filtering / System ID Data matrices M 1 ŷ(n) = a(n) ŷ(n k) + b(n)x(n k) k=1 M 1 k=0 - y Subs?tute the for the best solu?on: Called Equa?on Error method Easy to show the solu?on is unbiased if no noise, X ŷ(n) y(n) v(n) = 0 c 79

Back to Numerical Methods QR decomposi?on (just the basics) Any matrix with linearly independent columns can be factored as X = QR R is upper triangular and inver?ble (not to be confused with the autocorrela?on matrix) The columns of Q are Orthonormal Q T Q = I Factoriza?on is achieved using Gram-Schimdt or Householder algorithms Subs?tu?ng into the LS equa?ons ( ) 1 X T y c LS = X T X = (R T Q T QR) 1 R T Q T y = (R T R) 1 R T Q T y = R 1 Q T y Which is easily solves using back subs?tu?on since R is upper triangular MATLAB s backslash command, c LS = X \ y Rc LS = Q T y 80

Back to Numerical Methods Singular Value Decomposi?on Any matrix of rank r can be factored 81

Back to Numerical Methods Singular Value Decomposi?on Any matrix of rank r can be factored Easy to show that the pseudoinverse is given by Thus the LS solu?on is just c LS = X + y 82

Other Topics Not Covered Addi?onal details on Numeric Methods Details on Signal Modeling and Parametric Spectral Es?ma?on Minimum variance spectral es?ma?on Harmonic models and super resolu'on algorithms. 83