ELEG-636: Statistical Signal Processing

Size: px

Start display at page:

Download "ELEG-636: Statistical Signal Processing"

Margery Abigail Bates
6 years ago
Views:

1 ELEG-636: Statistical Signal Processing Gonzalo R. Arce Department of Electrical and Computer Engineering University of Delaware Spring 2010 Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

2 Course Objectives & Structure Course Objectives & Structure Objective: Given a discrete time sequence {x(n)}, develop Statistical and spectral signal representation Filtering, prediction, and system identification algorithms Optimization methods that are Statistical Adaptive Course Structure: Weekly lectures [notes: arce] Periodic homework (theory & Matlab implementations) [15%] Midterm & Final examinations [85%] Textbook: Haykin, Adaptive Filter Theory. Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

3 Course Objectives & Structure Course Objectives & Structure Broad Applications in Communications, Imaging, Sensors. Emerging application in Brain-imaging techniques Brain-machine interfaces, Implantable devices. Neurofeedback presents real-time physiological signals from MRIs in a visual or auditory form to provide information about brain activity. These signals are used to train the patient to alter neural activity in a desired direction. Traditionally, feedback using EEGs or other mechanisms has not focused on the brain because the resolution is not good enough. Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

4 Motivation Adaptive Optimization and Filtering Methods Motivation Adaptive optimization and filtering methods are appropriate, advantageous, or necessary when: Signal statistics are not known aprioriand must be learned from observed or representative samples Signal statistics evolve over time Time or computational restrictions dictate that simple, if repetitive, operations be employed rather than solving more complex, closed form expressions To be considered are the following algorithms: Steepest Descent (SD) deterministic Least Means Squared (LMS) stochastic Recursive Least Squares (RLS) deterministic Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

5 Steepest Descent Definition (Steepest Descent (SD)) Steepest descent, also known as gradient descent, it is an iterative technique for finding the local minimum of a function. Approach: Given an arbitrary starting point, the current location (value) is moved in steps proportional to the negatives of the gradient at the current point. SD is an old, deterministic method, that is the basis for stochastic gradient based methods SD is a feedback approach to finding local minimum of an error performance surface The error surface must be known a priori In the MSE case, SD converges converges to the optimal solution, w 0 = R 1 p, without inverting a matrix Question: Why in the MSE case does this converge to the global minimum rather than a local minimum? Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

6 Steepest Descent Example Consider a well structured cost function with a single minimum. The optimization proceeds as follows: Contour plot showing that evolution of the optimization Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

Contour plot illustrating that the final result depends on starting value Gonzalo

7 Steepest Descent Example Consider a gradient ascent example in which there are multiple minima/maxima Surface plot showing the multiple minima and maxima Contour plot illustrating that the final result depends on starting value Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

8 Steepest Descent To derive the approach, consider the FIR case: {x(n)} the WSS input samples {d(n)} the WSS desired output {ˆd(n)} the estimate of the desired signal given by ˆd(n) =w H (n)x(n) where x(n) =[x(n), x(n 1),, x(n M + 1)] T w(n) =[w 0 (n), w 1 (n),, w M-1 (n)] T [obs. vector] [time indexed filter coefs.] Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

9 Steepest Descent Then similarly to previously considered cases and the MSE at time n is where σ 2 d e(n) =d(n) ˆd(n) =d(n) w H (n)x(n) J(n) = E{ e(n) 2 } = σ 2 d wh (n)p p H w(n)+w H (n)rw(n) variance of desired signal p cross-correlation between x(n) and d(n) R correlation matrix of x(n) Note: The weight vector and cost function are time indexed (functions of time) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

10 Steepest Descent When w(n) is set to the (optimal) Wiener solution, w(n) =w 0 = R 1 p and J(n) =J min = σd 2 ph w 0 Use the method of steepest descent to iteratively find w 0. The optimal result is achieved since the cost function is a second order polynomial with a single unique minimum Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

11 Steepest Descent Example Let M = 2. The MSE is a bowl shaped surface, which is a function of the 2-D space weight vector w(n) J(w) w 2 J w 1 (J ) J w 2 w 2 J(w) w 0 w 1 w 0 w w 1 Surface Plot Contour Plot Imagine dropping a marble at any point on the bowl-shaped surface. The ball will reach the minimum point by going through the path of steepest descent. Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79 1

12 Steepest Descent Observation: Set the direction of filter update as: J(n) Resulting Update: or, since J(n) = 2p + 2Rw(n) w(n + 1) =w(n)+ 1 2 µ[ J(n)] w(n + 1) =w(n)+µ[p Rw(n)] n = 0, 1, 2, where w(0) =0 (or other appropriate value) and µ is the step size Observation: SD uses feedback, which makes it possible for the system to be unstable Bounds on the step size guaranteeing stability can be determined with respect to the eigenvalues of R (Widrow, 1970) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

13 Convergence Analysis Convergence Analysis Define the error vector for the tap weights as c(n) =w(n) w 0 Then using p = Rw 0 in the update, w(n + 1) = w(n)+ µ[p Rw(n)] = w(n)+µ[rw 0 Rw(n)] = w(n) µrc(n) and subtracting w 0 from both sides w(n + 1) w 0 = w(n) w 0 µrc(n) c(n + 1) = c(n) µrc(n) = [I µr]c(n) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

14 Convergence Analysis Using the Unitary Similarity Transform R = QΩQ H we have c(n + 1) = [I µr]c(n) = [I µqωq H ]c(n) Q H c(n + 1) = [Q H µq H QΩQ H ]c(n) = [I µω]q H c(n) ( ) Define the transformed coefficients as v(n) = Q H c(n) = Q H (w(n) w 0 ) Then ( ) becomes v(n + 1) =[I µω]v(n) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

15 Convergence Analysis Consider the initial condition of v(n) v(0) = Q H (w(0) w 0 ) = Q H w 0 [if w(0) =0] Consider the k th term (mode) in v(n + 1) =[I µω]v(n) Note [I µω] is diagonal Thus all modes are independently updated The update for the k th term can be written as v k (n + 1) =(1 µλ k )v k (n) k = 1, 2,, M or using recursion v k (n) =(1 µλ k ) n v k (0) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

16 Convergence Analysis Observation: Conversion to the optimal solution requires lim n = w 0 lim c(n) n = lim w(n) w 0 = 0 n lim v(n) = lim Q H c(n) =0 n n lim v k (n) = 0 k = 1, 2,, M ( ) n Result: According to the recursion the limit in ( ) holds if and only if v k (n) =(1 µλ k ) n v k (0) 1 µλ k < 1 for all k Thus since the eigenvalues are nonnegative, 0 <µλ max < 2, or 0 <µ< 2 λ max Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

17 Convergence Analysis Observation: The k th mode has geometric decay v k (n) =(1 µλ k ) n v k (0) The rate of decay it is characterized by the time it takes to decay to e 1 of the initial value Let τ k denote this time for the k th mode v k (τ k ) = (1 µλ k ) τ k v k (0) =e 1 v k (0) e 1 = (1 µλ k ) τ k τ k = Result: The overall rate of decay is 1 ln(1 µλ k ) 1 for µ 1 µλ k 1 ln(1 µλ max ) τ 1 ln(1 µλ min ) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

18 Convergence Analysis Example Consider the typical behavior of a single mode Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

19 Error Analysis Convergence Analysis Recall that J(n) = J min +(w(n) w 0 ) H R(w(n) w 0 ) = J min +(w(n) w 0 ) H QΩQ H (w(n) w 0 ) = J min + v(n) H Ωv(n) = M J min + λ k v k (n) 2 [sub in v k (n) =(1 µλ k ) n v k (0)] = J min + k=1 M λ k (1 µλ k ) 2n v k (0) 2 k=1 Result: If 0 <µ< 2 λ max, then lim n J(n) =J min Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

20 Example: Predictor Example Consider a two tap predictor for real valued input Analyzed the effects of the following cases: Varying the eigenvalue spread χ(r) = λ max λ min while keeping µ fixed Varying µ and keeping the eigenvalue spread χ(r) fixed λ Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

21 Example: Predictor SD loci plots (with shown J(n) contours) as a function of [v 1 (n), v 2 (n)] for step-size µ = 0.3 Eigenvalue spread: χ(r) =1.22 Small eigenvalue spread modes converge at a similar rate Eigenvalue spread: χ(r) =3 Moderate eigenvalue spread modes converge at moderately similar rates Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

22 Example: Predictor SD loci plots (with shown J(n) contours) as a function of [v 1 (n), v 2 (n)] for step-size µ = 0.3 Eigenvalue spread: χ(r) =10 Large eigenvalue spread modes converge at different rates Eigenvalue spread: χ(r) =100 Very large eigenvalue spread modes converge at very different rates Principle direction convergence is fastest Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

23 Example: Predictor SD loci plots (with shown J(n) contours) as a function of [w 1 (n), w 2 (n)] for step-size µ = 0.3 Eigenvalue spread: χ(r) =1.22 Small eigenvalue spread modes converge at a similar rate Eigenvalue spread: χ(r) =3 Moderate eigenvalue spread modes converge at moderately similar rates Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

24 Example: Predictor SD loci plots (with shown J(n) contours) as a function of [w 1 (n), w 2 (n)] for step-size µ = 0.3 Eigenvalue spread: χ(r) =10 Large eigenvalue spread modes converge at different rates Eigenvalue spread: χ(r) =100 Very large eigenvalue spread modes converge at very different rates Principle direction convergence is fastest Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

25 Example: Predictor Learning curves of steepest-descent algorithm with step-size parameter µ = 0.3 and varying eigenvalue spread. Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

26 Example: Predictor SD loci plots (with shown J(n) contours) as a function of [v 1 (n), v 2 (n)] with χ(r) =10 and varying step sizes Step sizes: µ = 0.3 This is over damped slow convergence Step sizes: µ = 1 This is under damped fast (erratic) convergence Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

27 Example: Predictor SD loci plots (with shown J(n) contours) as a function of [w 1 (n), w 2 (n)] with χ(r) =10 and varying step sizes Step sizes: µ = 0.3 This is over damped slow convergence Step sizes: µ = 1 This is under damped fast (erratic) convergence Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

28 Example: Predictor Example Consider a system identification problem {x(n)} w(n) system d ˆ( n ) _ + d(n) e(n) Suppose M = 2and R x = [ ] P = [ ] Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

29 Example: Predictor From eigen analysis we have λ 1 = 1.8,λ 2 = 0.2 µ< also and Also, q 1 = 1 2 [ 1 1 ] Q = 1 2 [ w 0 = R 1 p = q 2 = 1 2 [ 1 1 [ ] ] ] Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

30 Example: Predictor Thus v(n) =Q H [w(n) w 0 ] Noting that v(0) = Q H w 0 = 1 2 [ ][ ] = [ ] and v 1 (n) =(1 µ(1.8)) n 0.51 v 2 (n) =(1 µ(0.2)) n 1.06 Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

31 Example: Predictor SD convergence properties for two µ values Step sizes: µ = 0.5 This is over damped slow convergence Step sizes: µ = 1 This is under damped fast (erratic) convergence Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

32 Least Mean Squares (LMS) Least Mean Squares (LMS) Definition (Least Mean Squares (LMS) Algorithm) Motivation: The error performance surface used by the SD method is not always known apriori Solution: Use estimated values. We will use the following instantaneous estimates ˆR(n) =x(n)x H (n) ˆp(n) =x(n)d (n) Result: The estimates are RVs and thus this leads to a stochastic optimization Historical Note: Invented in 1960 by Stanford University professor Bernard Widrow and his first Ph.D. student, Ted Hoff Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

33 Least Mean Squares (LMS) Recall the SD update w(n + 1) =w(n)+ 1 2 µ[ (J(n))] where the gradient of the error surface at w(n) was shown to be Using the instantaneous estimates, (J(n)) = 2p + 2Rw(n) ˆ (J(n)) = 2x(n)d (n)+2x(n)x H (n)w(n) = 2x(n)[d (n) x H (n)w(n)] = 2x(n)[d (n) d ˆ (n)] = 2x(n)e (n) where e (n) is the complex conjugate of the estimate error. Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

34 Least Mean Squares (LMS) Utilizing (J(n)) = 2x(n)e (n) in the update w(n + 1) = w(n)+ 1 2 µ[ (J(n))] = w(n)+µx(n)e (n) [LMS Update] The LMS algorithm belongs to the family of stochastic gradient algorithms The update is extremely simple Although the instantaneous estimates may have large variance, the LMS algorithm is recursive and effectively averages these estimates The simplicity and good performance of the LMS algorithm make it the benchmark against which other optimization algorithms are judged Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

35 Convergence Analysis Convergence Analysis Independence Theorem The following conditions hold: 1 The vectors x(1), x(2),, x(n) are statistically independent 2 x(n) is independent of d(1), d(2),, d(n 1) 3 d(n) is statistically dependent on x(n), but is independent of d(1), d(2),, d(n 1) 4 x(n) and d(n) are mutually Gaussian The independence theorem is invoked in the LMS algorithm analysis The independence theorem is justified in some cases, e.g., beamforming where we receive independent vector observations In other cases it is not well justified, but allows the analysis to proceeds (i.e., when all else fails, invoke simplifying assumptions) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

36 Convergence Analysis We will invoke the independence theorem to show that w(n) converges to the optimal solution in the mean To prove this, evaluate the update lim n E{w(n)} = w 0 w(n + 1) = w(n)+µx(n)e (n) w(n + 1) w 0 = w(n) w 0 + µx(n)e (n) c(n + 1) = c(n)+µx(n)(d (n) x H (n)w(n)) = c(n)+µx(n)d (n) µx(n)x H (n)[w(n) w 0 + w 0 ] = c(n)+µx(n)d (n) µx(n)x H (n)c(n) µx(n)x H (n)w 0 = [I µx(n)x H (n)]c(n)+µx(n)[d (n) x H (n)w 0 ] = [I µx(n)x H (n)]c(n)+µx(n)e 0 (n) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

37 Convergence Analysis Take the expectation of the update noting that w(n) is based on past inputs and desired values w(n), and consequently c(n)), are independent of x(n) (Independence Theorem) Thus c(n + 1) = [I µx(n)x H (n)]c(n)+µx(n)e 0 (n) E{c(n + 1)} = (I µr)e{c(n)} + µ E{x(n)e0 }{{ (n)} } =0 why? = (I µr)e{c(n)} Using arguments similar to the SD case we have or equivalently 2 lim E{c(n)} = 0 if 0 <µ< n λ max lim E{w(n)} = w 0 if 0 <µ< 2 n λ max Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

38 Noting that M i=1 λ i = trace[r] Convergence Analysis λ max trace[r] =Mr(0) =Mσ 2 x Thus a more conservative bound (and one easier to determine) is 0 <µ< 2 Mσ 2 x Convergence in the mean lim n E{w(n)} = w 0 is a weak condition that says nothing about the variance, which may even grow A stronger condition is convergence in the mean square, which says lim n E{ c(n) 2 } = constant Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

39 Convergence Analysis Proving convergence in the mean square is equivalent to showing that lim J(n) = lim n n E{ e(n) 2 } = constant To evaluate the limit, write e(n) as Thus e(n) = d(n) ˆd(n) =d(n) w H (n)x(n) = d(n) w H 0 x(n) [wh (n) w H 0 ]x(n) = e 0 (n) c H (n)x(n) J(n) = E{ e(n) 2 } {( )( )} = E e 0 (n) c H (n)x(n) e0 (n) xh (n)c(n) = J min + E{c H (n)x(n)x H (n)c(n)} }{{} J ex (n) = J min + J ex (n) [Cross terms 0, why?] Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

40 Convergence Analysis Since J ex (n) is a scalar J ex (n) = E{c H (n)x(n)x H (n)c(n)} = E{trace[c H (n)x(n)x H (n)c(n)]} = E{trace[x(n)x H (n)c(n)c H (n)]} = trace[e{x(n)x H (n)c(n)c H (n)}] Invoking the independence theorem J ex (n) = trace[e{x(n)x H (n)}e{c(n)c H (n)}] = trace[rk(n)] where K(n) =E{c(n)c H (n)} Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

41 Convergence Analysis Thus J(n) = J min + J ex (n) = J min + trace[rk(n)] Recall Set Q H RQ = Ω or R = QΩQ H S(n) Q H K(n)Q where S(n) need not be diagonal. Then K(n) = QQ H K(n)QQ H [since Q 1 = Q H ] = QS(n)Q H Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

42 Convergence Analysis Utilizing R = QΩQ H and K(n) =QS(n)Q H in the excess error expression Since Ω is diagonal J ex (n) = trace[rk(n)] = trace[qωq H QS(n)Q H ] = trace[qωs(n)q H ] = trace[q H QΩS(n)] = trace[ωs(n)] J ex (n) =trace[ωs(n)] = M λ i s i (n) i=1 where s 1 (n), s 2 (n),, s M (n) are the diagonal elements of S(n). Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

43 The previously derived recursion Convergence Analysis E{c(n + 1)} =(I µr)e{c(n)} can be modified to yield a recursion on S(n), S(n + 1) =(I µω)s(n)(i µω)+µ 2 J min Ω which for the diagonal elements is s i (n + 1) =(1 µλ i ) 2 s i (n)+µ 2 J min λ i i = 1, 2,, M Suppose J ex (n) converges, then s i (n + 1) =s i (n) s i (n) = (1 µλ i ) 2 s i (n)+µ 2 J min λ i s i (n) = µ 2 J min λ i 1 (1 µλ i ) 2 = µ2 J min λ i 2µλ i µ 2 λ 2 i = µj min 2 µλ i i = 1, 2,, M Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

44 Convergence Analysis Consider again J ex (n) =trace[ωs(n)] = M λ i s i (n) i=1 Taking the limit and utilizing s i (n) = µj min 2 µλ i, lim n J ex(n) =J min The LMS misadjustment is defined as M i=1 µλ i 2 µλ i MA = lim n J ex (n) J min = M i=1 µλ i 2 µλ i Note: A misadjustment at 10% or less is generally considered acceptable. Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

45 Example: First-Order Predictor Example This is a one tap predictor ˆx(n) =w(n)x(n 1) Take the underlying process to be a real order one AR process x(n) = ax(n 1)+v(n) µ The weight update is w(n + 1) = w(n)+µx(n 1)e(n) [LMS update for obs. x(n 1)] = w(n)+µx(n 1)[x(n) w(n)x(n 1)] Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

46 Example: First-Order Predictor Since and x(n) = ax(n 1)+v(n) [AR model] ˆx(n) = w(n)x(n 1) [one tap predictor] w 0 = a Note that E{x(n 1)e o (n)} = E{x(n 1)v(n)} = 0 proves the optimality Set µ = 0.05 and consider two cases a σ 2 x Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

47 Example: First-Order Predictor Figure: Transient behavior of adaptive first-order predictor weight ŵ(n) for µ = Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

48 Example: First-Order Predictor Figure: Transient behavior of adaptive first-order predictor squared error for µ = Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

49 Example: First-Order Predictor Figure: Mean-squared error learning curves for an adaptive first-order predictor with varying step-size parameter µ. Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

50 Example: First-Order Predictor Consider the expected trajectory of w(n). Recall w(n + 1) = w(n)+µx(n 1)e(n) = w(n) +µx(n 1)[x(n) w(n)x(n 1)] = [1 µx(n 1)x(n 1)]w(n)+µx(n 1)x(n) In this example, x(n) = ax(n 1)+v(n). Substituting in: w(n + 1) = [1 µx(n 1)x(n 1)]w(n)+µx(n 1)[ ax(n 1) +v(n)] = [1 µx(n 1)x(n 1)]w(n) µax(n 1)x(n 1) +µx(n 1)v(n) Taking the expectation and invoking the dependence theorem E{w(n + 1)} =(1 µσx)e{w(n)} µσ 2 xa 2 Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

51 Example: First-Order Predictor Figure: Comparison of experimental results with theory, based on ŵ(n). Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

52 Example: First-Order Predictor Next, derive a theoretical expression for J(n). Note that the initial value of J(n) is J(0) =E{(x(0) w(0)x( 1)) 2 } = E{(x(0)) 2 } = σx 2 and the final value is J( ) = J min + J ex = E{(x(n) w(n)x(n 1)) 2 } + J ex = E{(v(n)) 2 } + J ex Note λ 1 = σ 2 x.thus, = σ 2 v + J min µλ 1 2 µλ 1 ( µσ 2 x ) J( ) = σv 2 + σv 2 2 µσx 2 ( ) = σv µσ2 x 2 µσx 2 Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

53 Example: First-Order Predictor And if µ is small J( ) = σ 2 v σ 2 v ( 1 + µσ2 x ( 1 + µσ2 x 2 2 µσx 2 ) ) Putting all the components together: J(n) =[σx 2 σv(1 2 + µ 2 σ2 x)] (1 µσx) 2 2n + σ }{{}}{{} v(1 2 + µ 2 σ2 x) }{{} 1 0 J(0) J( ) J( ) Also, the time constant is τ = 1 2ln(1 µλ 1 ) = 1 2ln(1 µσx) 2 1 2µσx 2 Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

54 Example: First-Order Predictor Figure: Comparison of experimental results with theory for the adaptive predictor, based on the mean-square error for µ = Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

55 Example: Adaptive Equalization Example (Adaptive Equalization) Objective: Pass a known signal through an unknown channel to invert the effects the channel and noise have on the signal Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

56 Example: Adaptive Equalization The signal is a Bernoulli sequence x n = { +1 with probability 1/2 1 with probability 1/2 The additive noise is N(0, 0.001) The channel has a raised cosine response { 1 [ ( h n = cos 2π w (n 2))] n = 1, 2, 3 0 otherwise w controls the eigenvalue spread χ(r) h n is symmetric about n = 2andthusintroducesadelayof2 We will use an M = 11 tap filter, which is symmetric about n = 5 Introduce a delay of 5 Thus an overall delay of δ = = 7 is added to the system Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

57 Example: Adaptive Equalization Channel response and Filter response Figure: (a) Impulse response of channel; (b) impulse response of optimum transversal equalizer. Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

58 Example: Adaptive Equalization Consider three w values Note the step size is bound by the w = 3.5 case Choose µ = in all cases. µ 2 Mr(0) = 2 11(1.3022) = 0.14 Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

59 Example: Adaptive Equalization Figure: Learning curves of the LMS algorithm for an adaptive equalizer with number of taps M = 11, step-size parameter µ = 0.075, and varying eigenvalue spread χ(r). Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

60 Example: Adaptive Equalization Ensemble-average impulse response of the adaptive equalizer (after 1000 iterations) for each of four different eigenvalue spreads. Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

61 Example: Adaptive Equalization Figure: Learning curves of the LMS algorithm for an adaptive equalizer with the number of taps M = 11, fixed eigenvalue spread, and varying step-size parameter µ. Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

62 Example: LMS Directionality Example Directionality of the LMS algorithm The speed of convergence of the LMS algorithm is faster in certain directions in the weight space If the convergence is in the appropriate direction, the convergence can be accelerated by increased eigenvalue spread To investigate this phenomenon, consider the deterministic signal x(n) =A 1 cos(ω 1 n)+a 2 cos(ω 2 n) Even though it is deterministic, a correlation matrix can be determined: R = 1 [ A A2 A 2 1 cos(ω 1)+A 2 2 cos(ω ] 2) 2 A 2 1 cos(ω 1)+A 2 2 cos(ω 2) A A2 2 Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

63 Example: LMS Directionality Determining the eigenvalues and eigenvectors yields λ 1 = 1 2 A2 1 (1 + cos(ω 1)) A2 2 (1 + cos(ω 2)) λ 2 = 1 2 A2 1 (1 cos(ω 1)) A2 2 (1 cos(ω 2)) and q 1 = [ 1 1 ] q 2 = [ 1 1 ] Case 1: A 1 = 1, A 2 = 0.5,ω 1 = 1.2,ω 2 = 0.1 x a (n) =cos(1.2n)+0.5cos(0.1n) and χ(r) =2.9 Case 2: A 1 = 1, A 2 = 0.5,ω 1 = 0.6,ω 2 = 0.23 x b (n) =cos(0.6n)+0.5cos(0.23n) and χ(r) =12.9 Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

64 Example: LMS Directionality Since p undefined, set p = λ i q i Then since p = Rw 0, we see (two cases) p = λ 1 q 1 Rw 0 = λ 1 q 1 w 0 = q 1 = p = λ 2 q 2 Rw 0 = λ 2 q 2 w 0 = q 2 = [ 1 1 [ 1 1 ] ] Utilize 200 iterations of the algorithm. [ ] 1 Consider the minimum eigenfilter first, w 0 = q 2 = 1 [ 1 Consider the maximum eigenfilter second, w 0 = q 1 = 1 ] Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

65 Example: LMS Directionality Convergence of the LMS algorithm, for a deterministic sinusoidal process, along slow eigenvector (i.e., minimum eigenfilter). For input x a (n) (χ(r) =2.9) For input x b (n) (χ(r) =12.9) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

66 Example: LMS Directionality Convergence of the LMS algorithm, for a deterministic sinusoidal process, along fast eigenvector (i.e., maximum eigenfilter). For input x a (n) (χ(r) =2.9) For input x b (n) (χ(r) =12.9) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

67 Normalized LMS Algorithm Observation: The LMS correction is proportional to µx(n)e (n) w(n + 1) =w(n)+µx(n)e (n) If x(n) is large, the LMS update suffers from gradient noise amplification The normalized LMS algorithm seeks to avoid gradient noise amplification The step size is made time varying, µ(n), andoptimizedto minimize the next step error w(n + 1) = w(n)+ 1 2 µ(n)[ J(n)] = w(n)+µ(n)[p Rw(n)] Choose µ(n), such that w(n + 1) produces the minimum MSE, J(n + 1) =E{ e(n + 1) 2 } Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

68 Normalized LMS Algorithm Let (n) J(n) and note e(n + 1) =d(n + 1) w H (n + 1)x(n + 1) Objective: Choose µ(n) such that it minimizes J(n + 1) The optimal step size, µ 0 (n), will be a function of R and (n). Use instantaneous estimates of these values To determine µ 0 (n), expand J(n + 1) J(n + 1) = E{e(n + 1)e (n + 1)} = E{(d(n + 1) w H (n + 1)x(n + 1)) (d (n + 1) x H (n + 1)w(n + 1))} = σd 2 wh (n + 1)p p H w(n + 1) +w H (n + 1)Rw(n + 1) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

69 Normalized LMS Algorithm Now use the fact that w(n + 1) =w(n) 1 2 µ(n) (n) J(n + 1) = σ 2 d wh (n + 1)p p H w(n + 1) +w H (n + 1)Rw(n + 1) = σd [w(n) 2 1 ] H 2 µ(n) (n) p [ p H w(n) 1 ] 2 µ(n) (n) [ + w(n) 1 ] H 2 µ(n) (n) R [w(n) 12 ] µ(n) (n) }{{} = w H (n)rw(n) 1 2 µ(n)wh (n)r (n) 1 2 µ(n) H (n)rw(n)+ 1 4 µ2 (n) H (n)r (n) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

70 Normalized LMS Algorithm J(n + 1) = σd [w(n) 2 1 ] H 2 µ(n) (n) p [ p H w(n) 1 ] 2 µ(n) (n) Differentiating with respect to µ(n), +w H (n)rw(n) 1 2 µ(n)wh (n)r (n) 1 2 µ(n) H (n)rw(n)+ 1 4 µ2 (n) H (n)r (n) J(n + 1) µ(n) = 1 2 H (n)p ph (n) 1 2 wh R (n) 1 2 H (n)rw(n)+ 1 2 µ(n) H (n)r (n) ( ) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

71 Normalized LMS Algorithm Setting ( ) equal to 0 µ 0 (n) H (n)r (n) = w H (n)r (n) p H (n) + H (n)rw(n) H (n)p µ 0 (n) = wh (n)r (n) p H (n)+ H (n)rw(n) H (n)p H (n)r (n) = [wh (n)r p H ] (n)+ H (n)[rw(n) p] H (n)r (n) = [Rw(n) p]h (n)+ H (n)[rw(n) p] H (n)r (n) = 1 2 H (n) (n)+ 1 2 H (n) (n) H (n)r (n) = H (n) (n) H (n)r (n) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

72 Normalized LMS Algorithm Using instantaneous estimates Thus ˆR = x(n)x H (n) and ˆp = x(n)d (n) ˆ (n) = 2[ˆRw(n) ˆp] = 2[x(n)x H (n)w(n) x(n)d (n)] = 2[x(n)(ˆd (n) d (n))] = 2x(n)e (n) µ 0 (n) = H (n) (n) H (n)r (n) = = = e(n) 2 x H (n)x(n) e(n) 2 (x H (n)x(n)) 2 1 x H (n)x(n) = 1 x(n) 2 4x H (n)e(n)x(n)e (n) 2x H (n)e(n)x(n)x H (n)2x(n)e (n) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

73 Normalized LMS Algorithm Result: The NLMS update is w(n + 1) =w(n)+ µ x(n) }{{ 2 x(n)e (n) } µ(n) µ is introduced to scale the update To avoid problems when x(n) 2 0 we add an offset w(n + 1) =w(n)+ µ a + x(n) 2 x(n)e (n) where a > 0 Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

74 NLMS Convergnce Objective: Analyze the NLMS convergence w(n + 1) =w(n)+ µ x(n) 2 x(n)e (n) Substituting e(n) =d(n) w H (n)x(n) µ w(n + 1) = w(n)+ x(n) 2 x(n)[d (n) x H (n)w(n)] [ ] = I µ x(n)xh (n) x(n) 2 w(n)+ µ x(n)d (n) x(n) 2 Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

75 NLMS Convergnce Objective: Compare the NLMS and LMS algorithms: NLMS: w(n + 1) = [ ] I µ x(n)xh (n) x(n) 2 w(n)+ µ x(n)d (n) x(n) 2 LMS: w(n + 1) =[I µx(n)x H (n)]w(n)+µx(n)d (n) By observation, we see the following corresponding terms LMS NLMS µ µ x(n)x H (n) x(n)x H (n) x(n) 2 x(n)d (n) x(n)d (n) x(n) 2 Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

76 NLMS Convergnce LMS NLMS µ µ x(n)x H (n) x(n)x H (n) x(n) 2 x(n)d (n) x(n)d (n) x(n) 2 LMS case: 0 <µ< 2 trace[e{x(n)x H (n)}] = 2 trace[r] guarantees stability By analogy, 0 < µ < trace guarantees stability of the NLMS [ E 2 { }] x(n)x H (n) x(n) 2 Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

77 NLMS Convergnce To analyze the bound, make the following approximation E { x(n)x H } (n) x(n) 2 E{x(n)xH (n)} E{ x(n) 2 } Then trace [ E { x(n)x H }] (n) x(n) 2 = trace[e{x(n)xh (n)}] E{ x(n) 2 } = E{trace[x(n)xH (n)]} E{ x(n) 2 } = E{trace[xH (n)x(n)]} E{ x(n) 2 } = E{trace[ x(n) 2 ]} E{ x(n) 2 } = 1 Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

78 NLMS Convergnce Thus 0 < µ < trace [ E 2 { }] = 2 x(n)x H (n) x(n) 2 Final Result: The NLMS update will converge if 0 < µ <2 Note: w(n + 1) =w(n)+ µ x(n) x(n) 2 e (n) The NLMS has a simpler convergence criterion than the LMS The NLMS generally converges faster than the LMS algorithm Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring / 79

79 ELEG Statistical Signal Processing Gonzalo R. Arce Variants of the LMS algorithm Department of Electrical and Computer Engineering University of Delaware Newark, DE, Fall 2013 (Variants of the LMS algorithm) Gonzalo R. Arce Spring, / 15

80 Standard LMS Algorithm FIR filters: y(n)=w 0 (n)u(n)+w 1 (n)u(n 1) W M 1 (n)u(n M + 1) = M 1 Â w k (n)u(n k=0 k)=w(n) T u(n), n = 0,1,..., Error between filter output y(t) and a desired signal d(t): Update filter parameters according to e(n)=d(n) y(n)=d(n) w(n) T u(n). w(n + 1)= w(n)+ µu(n)e(n). (Variants of the LMS algorithm) Gonzalo R. Arce Spring, / 15

81 1. Normalized LMS Algorithm Modify at time n the parameter vector from w(n) to w(n + 1) Where l will result from d(n)= M 1 Â w i (n + 1)u(n i=0 i). In order to add an extra freedom degree to the adaptation strategy, one constant, µ, controlling the step size will be introduced: w j (n+1)=w j (n)+ µ 1 Â M 1 i=0 (u(n e(n)u(n j)=w µ j(n)+ i))2 ku(n)k 2 e(n)u(n j). (Variants of the LMS algorithm) Gonzalo R. Arce Spring, / 15

82 To overcome the possible numerical difficulties when ku(n)k is close to zero, a constant a > 0 is used: w j (n + 1)=w j (n)+ µ e(n)u(n j) a + ku(n)k2 This is the update used in the Normalized LMS algorithm. The Normalized LMS algorithm converges if 0 < µ < 2 (Variants of the LMS algorithm) Gonzalo R. Arce Spring, / 15

83 Comparison of LMS and NLMS (Variants of the LMS algorithm) Gonzalo R. Arce Spring, / 15

84 Comparison of LMS and NLMS The LMS was run with three different step-sizes: µ =[0.075; 0.025; ]. The NLMS was run with four different step-sizes: µ =[1.0; 0.5; 0.1]. The larger the step-size, the faster the convergence. The smaller the step-size, the better the steady state square error. LMS with µ = and NLMS with µ = 0.1 achieved similar average steady state square error. However, NLMS was faster. LMS with µ = and NLMS with µ = 1.0 had a similar convergence speed. However, NLMS achieved a lower steady state average sqare error. Conclusion: NLMS offers better trade-offs than LMS. The computational complexity of NLMS is slightly higher that that of LMS. (Variants of the LMS algorithm) Gonzalo R. Arce Spring, / 15

85 2. LMS Algorithm with Time Variable Adaptation Step Heuristics: combine benefits of two different situations: Convergence time constant is small for large µ. Mean-square error in steady state is low for small µ. Initial adaptation µ is kept large, then it is monotonically reduced µ(n)= 1 n + c. Disadvantage for non-stationary data: algorithm will not react to changes in the optimum solution, for large values of n. Variable Step algorithm: where M(n)= w(n + 1)= w(n)+ M(n)u(n)e(n) µ 0 (n) µ 1 (n) µ M 1 (n) Each filter parameter w i ()n is updated using an independent adaptation step µ i (n).. (Variants of the LMS algorithm) Gonzalo R. Arce Spring, / 15

86 Comparison of LMS and variable size LMS µ = µ 0.01n+c with c =[10;20;50] (Variants of the LMS algorithm) Gonzalo R. Arce Spring, / 15

87 3. Sign algorithms In high speed communication the time is critical, thus faster adaptation processes is needed. 8 < sgn(a)= : 1; a> 0 0; a= 0 1; a< 0 The Sign algorithm (other names: pilot LMS, or Sign Error) w(n + 1)= w(n)+ µu(n) sgn(e(n)). The Clipped LMS (or Signed Regressor) The Zero forcing LMS (or Sign Sign) w(n + 1)=w(n)+µ sgn(u(n))e(n). w(n + 1)=w(n)+µ sgn(u(n)) sgn(e(n)). The Sign algorithm can be derived as a LMS algorithm for minimizing the Mean absolute error (MAE) criterion J(w)=E[ e(n) ]=E[ d(n) w T u(n) ]. (Variants of the LMS algorithm) Gonzalo R. Arce Spring, / 15

88 Properties of sign algorithms Fast computation: if µ is constrained to the form µ = 2 m, only shifting and addition operations are required. Drawback: the update mechanism is degraded, compared to LMS algorithm, by the crude quantization of gradient estimates. The steady state error will increase The convergence rate decreases The fastest of them, Sign-Sign, is used in the CCITT ADPCM standard for bps system. (Variants of the LMS algorithm) Gonzalo R. Arce Spring, / 15

89 Comparison of LMS and Sign LMS Sign LMS algorithm should be operated at smaller step-sizes to get a similar behavior as standard LMS algorithm. (Variants of the LMS algorithm) Gonzalo R. Arce Spring, / 15

90 4. Linear smoothing of LMS gradient estimates Lowpass filtering the noisy gradient Rename the noisy gradient g(n)= ˆ w J g(n)= ˆ w J = 2u(n)e(n) g i (n)= 2e(n)u(n i). Passing the signals g i (n) through low pass filters will prevent the large fluctuations of direction during adaptation process. b i (n)=lpf(g i (n)). The updating process will use the filtered noisy gradient w(n + 1)=w(n) The following versions are well known: µb(n). Averaged LMS algorithm LPF is the filter with impulse response h(m)= N 1 m = 0,1,...,N 1. w(n + 1)=w(n)+ µ N n Â e(j)u(j). j=n N+1 (Variants of the LMS algorithm) Gonzalo R. Arce Spring, / 15

91 Momentum LMS algorithm LPF is an IIR filter of first order h(0)=1 g,h(1)=gh(0),h(2)=g 2 h(0),... then, b i (n)=lpf (g i (n)) = gb i (n 1)+(1 g)g i (n) b(n)=gb(n 1)+(1 g)g(n) The resulting algorithm can be written as a second order recursion: w(n + 1)= w(n) µb(n) gw(n)=gw(n 1) gµb(n 1) w(n + 1) gw(n)= w(n) gw(n 1) µb(n)+ gµb(n 1) (Variants of the LMS algorithm) Gonzalo R. Arce Spring, / 15

92 w(n + 1) gw(n)= w(n) gw(n 1) µb(n)+gµb(n 1) w(n + 1)= w(n)+g(w(n) w(n 1)) µ(b(n) gb(n 1)) w(n + 1)= w(n)+g(w(n) w(n 1)) µ(1 g)g(n) w(n + 1) = w(n) +g(w(n) w(n 1)) + 2µ(1 g)e(n)u(n) w(n + 1) w(n) = g(w(n) w(n 1)) + µ(1 g)e(n)u(n) Drawback: The convergence rate may decrease. Advantages: The momentum term keeps the algorithm active even in the regions close to minimum. (Variants of the LMS algorithm) Gonzalo R. Arce Spring, / 15

93 5. Nonlinear smoothing of LMS gradient estimates Impulsive interference in either d(n) or u(n), drastically degrades LMS performance. Smooth noisy gradient components using a nonlinear filter. The Median LMS Algorithm The adaptation equation can be implemented as w i (n + 1) =w i (n) µ med((e(n)u(n i)),(e(n 1)u(n 1 i)),..., (e(n N)u(n N i))) Smoothing effect in impulsive noise environment is very strong. If the environment is not impulsive, the performances of Median LMS are comparable with those of LMS. Convergence rate is slower than in LMS. (Variants of the LMS algorithm) Gonzalo R. Arce Spring, / 15

ELEG-636: Statistical Signal Processing

ELEG-636: Statistical Signal Processing Gonzalo R. Arce Department of Electrical and Computer Engineering University of Delaware Spring 2010 Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical