ELEG-636: Statistical Signal Processing

ELEG-636: Statistical Signal Processing Gonzalo R. Arce Department of Electrical and Computer Engineering University of Delaware Spring 2010 Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 1 / 79

Course Objectives & Structure Course Objectives & Structure Objective: Given a discrete time sequence {x(n)}, develop Statistical and spectral signal representation Filtering, prediction, and system identification algorithms Optimization methods that are Statistical Adaptive Course Structure: Weekly lectures [notes: www.ece.udel.edu/ arce] Periodic homework (theory & Matlab implementations) [15%] Midterm & Final examinations [85%] Textbook: Haykin, Adaptive Filter Theory. Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 2 / 79

Course Objectives & Structure Course Objectives & Structure Broad Applications in Communications, Imaging, Sensors. Emerging application in Brain-imaging techniques Brain-machine interfaces, Implantable devices. Neurofeedback presents real-time physiological signals from MRIs in a visual or auditory form to provide information about brain activity. These signals are used to train the patient to alter neural activity in a desired direction. Traditionally, feedback using EEGs or other mechanisms has not focused on the brain because the resolution is not good enough. Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 3 / 79

Motivation Adaptive Optimization and Filtering Methods Motivation Adaptive optimization and filtering methods are appropriate, advantageous, or necessary when: Signal statistics are not known aprioriand must be learned from observed or representative samples Signal statistics evolve over time Time or computational restrictions dictate that simple, if repetitive, operations be employed rather than solving more complex, closed form expressions To be considered are the following algorithms: Steepest Descent (SD) deterministic Least Means Squared (LMS) stochastic Recursive Least Squares (RLS) deterministic Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 5 / 79

Steepest Descent Definition (Steepest Descent (SD)) Steepest descent, also known as gradient descent, it is an iterative technique for finding the local minimum of a function. Approach: Given an arbitrary starting point, the current location (value) is moved in steps proportional to the negatives of the gradient at the current point. SD is an old, deterministic method, that is the basis for stochastic gradient based methods SD is a feedback approach to finding local minimum of an error performance surface The error surface must be known a priori In the MSE case, SD converges converges to the optimal solution, w 0 = R 1 p, without inverting a matrix Question: Why in the MSE case does this converge to the global minimum rather than a local minimum? Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 6 / 79

Steepest Descent Example Consider a well structured cost function with a single minimum. The optimization proceeds as follows: Contour plot showing that evolution of the optimization Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 7 / 79

Steepest Descent Example Consider a gradient ascent example in which there are multiple minima/maxima Surface plot showing the multiple minima and maxima Contour plot illustrating that the final result depends on starting value Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 8 / 79

Steepest Descent To derive the approach, consider the FIR case: {x(n)} the WSS input samples {d(n)} the WSS desired output {ˆd(n)} the estimate of the desired signal given by ˆd(n) =w H (n)x(n) where x(n) =[x(n), x(n 1),, x(n M + 1)] T w(n) =[w 0 (n), w 1 (n),, w M-1 (n)] T [obs. vector] [time indexed filter coefs.] Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 9 / 79

Steepest Descent Then similarly to previously considered cases and the MSE at time n is where σ 2 d e(n) =d(n) ˆd(n) =d(n) w H (n)x(n) J(n) = E{ e(n) 2 } = σ 2 d wh (n)p p H w(n)+w H (n)rw(n) variance of desired signal p cross-correlation between x(n) and d(n) R correlation matrix of x(n) Note: The weight vector and cost function are time indexed (functions of time) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 10 / 79

Steepest Descent When w(n) is set to the (optimal) Wiener solution, w(n) =w 0 = R 1 p and J(n) =J min = σd 2 ph w 0 Use the method of steepest descent to iteratively find w 0. The optimal result is achieved since the cost function is a second order polynomial with a single unique minimum Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 11 / 79

Steepest Descent Example Let M = 2. The MSE is a bowl shaped surface, which is a function of the 2-D space weight vector w(n) J(w) w 2 J w 1 (J ) J w 2 w 2 J(w) w 0 w 1 w 0 w w 1 Surface Plot Contour Plot Imagine dropping a marble at any point on the bowl-shaped surface. The ball will reach the minimum point by going through the path of steepest descent. Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 12 / 79 1

Steepest Descent Observation: Set the direction of filter update as: J(n) Resulting Update: or, since J(n) = 2p + 2Rw(n) w(n + 1) =w(n)+ 1 2 µ[ J(n)] w(n + 1) =w(n)+µ[p Rw(n)] n = 0, 1, 2, where w(0) =0 (or other appropriate value) and µ is the step size Observation: SD uses feedback, which makes it possible for the system to be unstable Bounds on the step size guaranteeing stability can be determined with respect to the eigenvalues of R (Widrow, 1970) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 13 / 79

Convergence Analysis Convergence Analysis Define the error vector for the tap weights as c(n) =w(n) w 0 Then using p = Rw 0 in the update, w(n + 1) = w(n)+ µ[p Rw(n)] = w(n)+µ[rw 0 Rw(n)] = w(n) µrc(n) and subtracting w 0 from both sides w(n + 1) w 0 = w(n) w 0 µrc(n) c(n + 1) = c(n) µrc(n) = [I µr]c(n) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 14 / 79

Convergence Analysis Using the Unitary Similarity Transform R = QΩQ H we have c(n + 1) = [I µr]c(n) = [I µqωq H ]c(n) Q H c(n + 1) = [Q H µq H QΩQ H ]c(n) = [I µω]q H c(n) ( ) Define the transformed coefficients as v(n) = Q H c(n) = Q H (w(n) w 0 ) Then ( ) becomes v(n + 1) =[I µω]v(n) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 15 / 79

Convergence Analysis Consider the initial condition of v(n) v(0) = Q H (w(0) w 0 ) = Q H w 0 [if w(0) =0] Consider the k th term (mode) in v(n + 1) =[I µω]v(n) Note [I µω] is diagonal Thus all modes are independently updated The update for the k th term can be written as v k (n + 1) =(1 µλ k )v k (n) k = 1, 2,, M or using recursion v k (n) =(1 µλ k ) n v k (0) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 16 / 79

Convergence Analysis Observation: Conversion to the optimal solution requires lim n = w 0 lim c(n) n = lim w(n) w 0 = 0 n lim v(n) = lim Q H c(n) =0 n n lim v k (n) = 0 k = 1, 2,, M ( ) n Result: According to the recursion the limit in ( ) holds if and only if v k (n) =(1 µλ k ) n v k (0) 1 µλ k < 1 for all k Thus since the eigenvalues are nonnegative, 0 <µλ max < 2, or 0 <µ< 2 λ max Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 17 / 79

Convergence Analysis Observation: The k th mode has geometric decay v k (n) =(1 µλ k ) n v k (0) The rate of decay it is characterized by the time it takes to decay to e 1 of the initial value Let τ k denote this time for the k th mode v k (τ k ) = (1 µλ k ) τ k v k (0) =e 1 v k (0) e 1 = (1 µλ k ) τ k τ k = Result: The overall rate of decay is 1 ln(1 µλ k ) 1 for µ 1 µλ k 1 ln(1 µλ max ) τ 1 ln(1 µλ min ) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 18 / 79

Convergence Analysis Example Consider the typical behavior of a single mode Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 19 / 79

Error Analysis Convergence Analysis Recall that J(n) = J min +(w(n) w 0 ) H R(w(n) w 0 ) = J min +(w(n) w 0 ) H QΩQ H (w(n) w 0 ) = J min + v(n) H Ωv(n) = M J min + λ k v k (n) 2 [sub in v k (n) =(1 µλ k ) n v k (0)] = J min + k=1 M λ k (1 µλ k ) 2n v k (0) 2 k=1 Result: If 0 <µ< 2 λ max, then lim n J(n) =J min Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 20 / 79

Example: Predictor Example Consider a two tap predictor for real valued input Analyzed the effects of the following cases: Varying the eigenvalue spread χ(r) = λ max λ min while keeping µ fixed Varying µ and keeping the eigenvalue spread χ(r) fixed λ Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 21 / 79

Example: Predictor SD loci plots (with shown J(n) contours) as a function of [v 1 (n), v 2 (n)] for step-size µ = 0.3 Eigenvalue spread: χ(r) =1.22 Small eigenvalue spread modes converge at a similar rate Eigenvalue spread: χ(r) =3 Moderate eigenvalue spread modes converge at moderately similar rates Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 22 / 79

Example: Predictor SD loci plots (with shown J(n) contours) as a function of [v 1 (n), v 2 (n)] for step-size µ = 0.3 Eigenvalue spread: χ(r) =10 Large eigenvalue spread modes converge at different rates Eigenvalue spread: χ(r) =100 Very large eigenvalue spread modes converge at very different rates Principle direction convergence is fastest Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 23 / 79

Example: Predictor SD loci plots (with shown J(n) contours) as a function of [w 1 (n), w 2 (n)] for step-size µ = 0.3 Eigenvalue spread: χ(r) =1.22 Small eigenvalue spread modes converge at a similar rate Eigenvalue spread: χ(r) =3 Moderate eigenvalue spread modes converge at moderately similar rates Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 24 / 79

Example: Predictor SD loci plots (with shown J(n) contours) as a function of [w 1 (n), w 2 (n)] for step-size µ = 0.3 Eigenvalue spread: χ(r) =10 Large eigenvalue spread modes converge at different rates Eigenvalue spread: χ(r) =100 Very large eigenvalue spread modes converge at very different rates Principle direction convergence is fastest Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 25 / 79

Example: Predictor Learning curves of steepest-descent algorithm with step-size parameter µ = 0.3 and varying eigenvalue spread. Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 26 / 79

Example: Predictor SD loci plots (with shown J(n) contours) as a function of [v 1 (n), v 2 (n)] with χ(r) =10 and varying step sizes Step sizes: µ = 0.3 This is over damped slow convergence Step sizes: µ = 1 This is under damped fast (erratic) convergence Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 27 / 79

Example: Predictor SD loci plots (with shown J(n) contours) as a function of [w 1 (n), w 2 (n)] with χ(r) =10 and varying step sizes Step sizes: µ = 0.3 This is over damped slow convergence Step sizes: µ = 1 This is under damped fast (erratic) convergence Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 28 / 79

Example: Predictor Example Consider a system identification problem {x(n)} w(n) system d ˆ( n ) _ + d(n) e(n) Suppose M = 2and R x = [ 1 0.8 0.8 1 ] P = [ 0.8 0.5 ] Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 29 / 79

Example: Predictor From eigen analysis we have λ 1 = 1.8,λ 2 = 0.2 µ< 2 1.8 also and Also, q 1 = 1 2 [ 1 1 ] Q = 1 2 [ 1 1 1 1 w 0 = R 1 p = q 2 = 1 2 [ 1 1 [ ] 1.11 0.389 ] ] Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 30 / 79

Example: Predictor Thus v(n) =Q H [w(n) w 0 ] Noting that v(0) = Q H w 0 = 1 2 [ 1 1 1 1 ][ 1.11 0.389 ] = [ 0.51 1.06 ] and v 1 (n) =(1 µ(1.8)) n 0.51 v 2 (n) =(1 µ(0.2)) n 1.06 Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 31 / 79

Example: Predictor SD convergence properties for two µ values Step sizes: µ = 0.5 This is over damped slow convergence Step sizes: µ = 1 This is under damped fast (erratic) convergence Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 32 / 79

Least Mean Squares (LMS) Least Mean Squares (LMS) Definition (Least Mean Squares (LMS) Algorithm) Motivation: The error performance surface used by the SD method is not always known apriori Solution: Use estimated values. We will use the following instantaneous estimates ˆR(n) =x(n)x H (n) ˆp(n) =x(n)d (n) Result: The estimates are RVs and thus this leads to a stochastic optimization Historical Note: Invented in 1960 by Stanford University professor Bernard Widrow and his first Ph.D. student, Ted Hoff Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 33 / 79

Least Mean Squares (LMS) Recall the SD update w(n + 1) =w(n)+ 1 2 µ[ (J(n))] where the gradient of the error surface at w(n) was shown to be Using the instantaneous estimates, (J(n)) = 2p + 2Rw(n) ˆ (J(n)) = 2x(n)d (n)+2x(n)x H (n)w(n) = 2x(n)[d (n) x H (n)w(n)] = 2x(n)[d (n) d ˆ (n)] = 2x(n)e (n) where e (n) is the complex conjugate of the estimate error. Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 34 / 79

Least Mean Squares (LMS) Utilizing (J(n)) = 2x(n)e (n) in the update w(n + 1) = w(n)+ 1 2 µ[ (J(n))] = w(n)+µx(n)e (n) [LMS Update] The LMS algorithm belongs to the family of stochastic gradient algorithms The update is extremely simple Although the instantaneous estimates may have large variance, the LMS algorithm is recursive and effectively averages these estimates The simplicity and good performance of the LMS algorithm make it the benchmark against which other optimization algorithms are judged Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 35 / 79

Convergence Analysis Convergence Analysis Independence Theorem The following conditions hold: 1 The vectors x(1), x(2),, x(n) are statistically independent 2 x(n) is independent of d(1), d(2),, d(n 1) 3 d(n) is statistically dependent on x(n), but is independent of d(1), d(2),, d(n 1) 4 x(n) and d(n) are mutually Gaussian The independence theorem is invoked in the LMS algorithm analysis The independence theorem is justified in some cases, e.g., beamforming where we receive independent vector observations In other cases it is not well justified, but allows the analysis to proceeds (i.e., when all else fails, invoke simplifying assumptions) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 36 / 79

Convergence Analysis We will invoke the independence theorem to show that w(n) converges to the optimal solution in the mean To prove this, evaluate the update lim n E{w(n)} = w 0 w(n + 1) = w(n)+µx(n)e (n) w(n + 1) w 0 = w(n) w 0 + µx(n)e (n) c(n + 1) = c(n)+µx(n)(d (n) x H (n)w(n)) = c(n)+µx(n)d (n) µx(n)x H (n)[w(n) w 0 + w 0 ] = c(n)+µx(n)d (n) µx(n)x H (n)c(n) µx(n)x H (n)w 0 = [I µx(n)x H (n)]c(n)+µx(n)[d (n) x H (n)w 0 ] = [I µx(n)x H (n)]c(n)+µx(n)e 0 (n) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 37 / 79

Convergence Analysis Take the expectation of the update noting that w(n) is based on past inputs and desired values w(n), and consequently c(n)), are independent of x(n) (Independence Theorem) Thus c(n + 1) = [I µx(n)x H (n)]c(n)+µx(n)e 0 (n) E{c(n + 1)} = (I µr)e{c(n)} + µ E{x(n)e0 }{{ (n)} } =0 why? = (I µr)e{c(n)} Using arguments similar to the SD case we have or equivalently 2 lim E{c(n)} = 0 if 0 <µ< n λ max lim E{w(n)} = w 0 if 0 <µ< 2 n λ max Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 38 / 79

Noting that M i=1 λ i = trace[r] Convergence Analysis λ max trace[r] =Mr(0) =Mσ 2 x Thus a more conservative bound (and one easier to determine) is 0 <µ< 2 Mσ 2 x Convergence in the mean lim n E{w(n)} = w 0 is a weak condition that says nothing about the variance, which may even grow A stronger condition is convergence in the mean square, which says lim n E{ c(n) 2 } = constant Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 39 / 79

Convergence Analysis Proving convergence in the mean square is equivalent to showing that lim J(n) = lim n n E{ e(n) 2 } = constant To evaluate the limit, write e(n) as Thus e(n) = d(n) ˆd(n) =d(n) w H (n)x(n) = d(n) w H 0 x(n) [wh (n) w H 0 ]x(n) = e 0 (n) c H (n)x(n) J(n) = E{ e(n) 2 } {( )( )} = E e 0 (n) c H (n)x(n) e0 (n) xh (n)c(n) = J min + E{c H (n)x(n)x H (n)c(n)} }{{} J ex (n) = J min + J ex (n) [Cross terms 0, why?] Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 40 / 79

Convergence Analysis Since J ex (n) is a scalar J ex (n) = E{c H (n)x(n)x H (n)c(n)} = E{trace[c H (n)x(n)x H (n)c(n)]} = E{trace[x(n)x H (n)c(n)c H (n)]} = trace[e{x(n)x H (n)c(n)c H (n)}] Invoking the independence theorem J ex (n) = trace[e{x(n)x H (n)}e{c(n)c H (n)}] = trace[rk(n)] where K(n) =E{c(n)c H (n)} Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 41 / 79

Convergence Analysis Thus J(n) = J min + J ex (n) = J min + trace[rk(n)] Recall Set Q H RQ = Ω or R = QΩQ H S(n) Q H K(n)Q where S(n) need not be diagonal. Then K(n) = QQ H K(n)QQ H [since Q 1 = Q H ] = QS(n)Q H Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 42 / 79

Convergence Analysis Utilizing R = QΩQ H and K(n) =QS(n)Q H in the excess error expression Since Ω is diagonal J ex (n) = trace[rk(n)] = trace[qωq H QS(n)Q H ] = trace[qωs(n)q H ] = trace[q H QΩS(n)] = trace[ωs(n)] J ex (n) =trace[ωs(n)] = M λ i s i (n) i=1 where s 1 (n), s 2 (n),, s M (n) are the diagonal elements of S(n). Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 43 / 79

The previously derived recursion Convergence Analysis E{c(n + 1)} =(I µr)e{c(n)} can be modified to yield a recursion on S(n), S(n + 1) =(I µω)s(n)(i µω)+µ 2 J min Ω which for the diagonal elements is s i (n + 1) =(1 µλ i ) 2 s i (n)+µ 2 J min λ i i = 1, 2,, M Suppose J ex (n) converges, then s i (n + 1) =s i (n) s i (n) = (1 µλ i ) 2 s i (n)+µ 2 J min λ i s i (n) = µ 2 J min λ i 1 (1 µλ i ) 2 = µ2 J min λ i 2µλ i µ 2 λ 2 i = µj min 2 µλ i i = 1, 2,, M Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 44 / 79

Convergence Analysis Consider again J ex (n) =trace[ωs(n)] = M λ i s i (n) i=1 Taking the limit and utilizing s i (n) = µj min 2 µλ i, lim n J ex(n) =J min The LMS misadjustment is defined as M i=1 µλ i 2 µλ i MA = lim n J ex (n) J min = M i=1 µλ i 2 µλ i Note: A misadjustment at 10% or less is generally considered acceptable. Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 45 / 79

Example: First-Order Predictor Example This is a one tap predictor ˆx(n) =w(n)x(n 1) Take the underlying process to be a real order one AR process x(n) = ax(n 1)+v(n) µ The weight update is w(n + 1) = w(n)+µx(n 1)e(n) [LMS update for obs. x(n 1)] = w(n)+µx(n 1)[x(n) w(n)x(n 1)] Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 46 / 79

Example: First-Order Predictor Since and x(n) = ax(n 1)+v(n) [AR model] ˆx(n) = w(n)x(n 1) [one tap predictor] w 0 = a Note that E{x(n 1)e o (n)} = E{x(n 1)v(n)} = 0 proves the optimality Set µ = 0.05 and consider two cases a σ 2 x -0.99 0.93627 0.99 0.995 Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 47 / 79

Example: First-Order Predictor Figure: Transient behavior of adaptive first-order predictor weight ŵ(n) for µ = 0.05. Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 48 / 79

Example: First-Order Predictor Figure: Transient behavior of adaptive first-order predictor squared error for µ = 0.05. Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 49 / 79

Example: First-Order Predictor Figure: Mean-squared error learning curves for an adaptive first-order predictor with varying step-size parameter µ. Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 50 / 79

Example: First-Order Predictor Consider the expected trajectory of w(n). Recall w(n + 1) = w(n)+µx(n 1)e(n) = w(n) +µx(n 1)[x(n) w(n)x(n 1)] = [1 µx(n 1)x(n 1)]w(n)+µx(n 1)x(n) In this example, x(n) = ax(n 1)+v(n). Substituting in: w(n + 1) = [1 µx(n 1)x(n 1)]w(n)+µx(n 1)[ ax(n 1) +v(n)] = [1 µx(n 1)x(n 1)]w(n) µax(n 1)x(n 1) +µx(n 1)v(n) Taking the expectation and invoking the dependence theorem E{w(n + 1)} =(1 µσx)e{w(n)} µσ 2 xa 2 Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 51 / 79

Example: First-Order Predictor Figure: Comparison of experimental results with theory, based on ŵ(n). Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 52 / 79

Example: First-Order Predictor Next, derive a theoretical expression for J(n). Note that the initial value of J(n) is J(0) =E{(x(0) w(0)x( 1)) 2 } = E{(x(0)) 2 } = σx 2 and the final value is J( ) = J min + J ex = E{(x(n) w(n)x(n 1)) 2 } + J ex = E{(v(n)) 2 } + J ex Note λ 1 = σ 2 x.thus, = σ 2 v + J min µλ 1 2 µλ 1 ( µσ 2 x ) J( ) = σv 2 + σv 2 2 µσx 2 ( ) = σv 2 1 + µσ2 x 2 µσx 2 Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 53 / 79

Example: First-Order Predictor And if µ is small J( ) = σ 2 v σ 2 v ( 1 + µσ2 x ( 1 + µσ2 x 2 2 µσx 2 ) ) Putting all the components together: J(n) =[σx 2 σv(1 2 + µ 2 σ2 x)] (1 µσx) 2 2n + σ }{{}}{{} v(1 2 + µ 2 σ2 x) }{{} 1 0 J(0) J( ) J( ) Also, the time constant is τ = 1 2ln(1 µλ 1 ) = 1 2ln(1 µσx) 2 1 2µσx 2 Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 54 / 79

Example: First-Order Predictor Figure: Comparison of experimental results with theory for the adaptive predictor, based on the mean-square error for µ = 0.001. Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 55 / 79

Example: Adaptive Equalization Example (Adaptive Equalization) Objective: Pass a known signal through an unknown channel to invert the effects the channel and noise have on the signal Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 56 / 79

Example: Adaptive Equalization The signal is a Bernoulli sequence x n = { +1 with probability 1/2 1 with probability 1/2 The additive noise is N(0, 0.001) The channel has a raised cosine response { 1 [ ( h n = 2 1 + cos 2π w (n 2))] n = 1, 2, 3 0 otherwise w controls the eigenvalue spread χ(r) h n is symmetric about n = 2andthusintroducesadelayof2 We will use an M = 11 tap filter, which is symmetric about n = 5 Introduce a delay of 5 Thus an overall delay of δ = 5 + 2 = 7 is added to the system Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 57 / 79

Example: Adaptive Equalization Channel response and Filter response Figure: (a) Impulse response of channel; (b) impulse response of optimum transversal equalizer. Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 58 / 79

Example: Adaptive Equalization Consider three w values Note the step size is bound by the w = 3.5 case Choose µ = 0.075 in all cases. µ 2 Mr(0) = 2 11(1.3022) = 0.14 Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 59 / 79

Example: Adaptive Equalization Figure: Learning curves of the LMS algorithm for an adaptive equalizer with number of taps M = 11, step-size parameter µ = 0.075, and varying eigenvalue spread χ(r). Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 60 / 79

Example: Adaptive Equalization Ensemble-average impulse response of the adaptive equalizer (after 1000 iterations) for each of four different eigenvalue spreads. Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 61 / 79

Example: Adaptive Equalization Figure: Learning curves of the LMS algorithm for an adaptive equalizer with the number of taps M = 11, fixed eigenvalue spread, and varying step-size parameter µ. Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 62 / 79

Example: LMS Directionality Example Directionality of the LMS algorithm The speed of convergence of the LMS algorithm is faster in certain directions in the weight space If the convergence is in the appropriate direction, the convergence can be accelerated by increased eigenvalue spread To investigate this phenomenon, consider the deterministic signal x(n) =A 1 cos(ω 1 n)+a 2 cos(ω 2 n) Even though it is deterministic, a correlation matrix can be determined: R = 1 [ A 2 1 + A2 A 2 1 cos(ω 1)+A 2 2 cos(ω ] 2) 2 A 2 1 cos(ω 1)+A 2 2 cos(ω 2) A 2 1 + A2 2 Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 63 / 79

Example: LMS Directionality Determining the eigenvalues and eigenvectors yields λ 1 = 1 2 A2 1 (1 + cos(ω 1)) + 1 2 A2 2 (1 + cos(ω 2)) λ 2 = 1 2 A2 1 (1 cos(ω 1)) + 1 2 A2 2 (1 cos(ω 2)) and q 1 = [ 1 1 ] q 2 = [ 1 1 ] Case 1: A 1 = 1, A 2 = 0.5,ω 1 = 1.2,ω 2 = 0.1 x a (n) =cos(1.2n)+0.5cos(0.1n) and χ(r) =2.9 Case 2: A 1 = 1, A 2 = 0.5,ω 1 = 0.6,ω 2 = 0.23 x b (n) =cos(0.6n)+0.5cos(0.23n) and χ(r) =12.9 Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 64 / 79

Example: LMS Directionality Since p undefined, set p = λ i q i Then since p = Rw 0, we see (two cases) p = λ 1 q 1 Rw 0 = λ 1 q 1 w 0 = q 1 = p = λ 2 q 2 Rw 0 = λ 2 q 2 w 0 = q 2 = [ 1 1 [ 1 1 ] ] Utilize 200 iterations of the algorithm. [ ] 1 Consider the minimum eigenfilter first, w 0 = q 2 = 1 [ 1 Consider the maximum eigenfilter second, w 0 = q 1 = 1 ] Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 65 / 79

Example: LMS Directionality Convergence of the LMS algorithm, for a deterministic sinusoidal process, along slow eigenvector (i.e., minimum eigenfilter). For input x a (n) (χ(r) =2.9) For input x b (n) (χ(r) =12.9) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 66 / 79

Example: LMS Directionality Convergence of the LMS algorithm, for a deterministic sinusoidal process, along fast eigenvector (i.e., maximum eigenfilter). For input x a (n) (χ(r) =2.9) For input x b (n) (χ(r) =12.9) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 67 / 79

Normalized LMS Algorithm Observation: The LMS correction is proportional to µx(n)e (n) w(n + 1) =w(n)+µx(n)e (n) If x(n) is large, the LMS update suffers from gradient noise amplification The normalized LMS algorithm seeks to avoid gradient noise amplification The step size is made time varying, µ(n), andoptimizedto minimize the next step error w(n + 1) = w(n)+ 1 2 µ(n)[ J(n)] = w(n)+µ(n)[p Rw(n)] Choose µ(n), such that w(n + 1) produces the minimum MSE, J(n + 1) =E{ e(n + 1) 2 } Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 68 / 79

Normalized LMS Algorithm Let (n) J(n) and note e(n + 1) =d(n + 1) w H (n + 1)x(n + 1) Objective: Choose µ(n) such that it minimizes J(n + 1) The optimal step size, µ 0 (n), will be a function of R and (n). Use instantaneous estimates of these values To determine µ 0 (n), expand J(n + 1) J(n + 1) = E{e(n + 1)e (n + 1)} = E{(d(n + 1) w H (n + 1)x(n + 1)) (d (n + 1) x H (n + 1)w(n + 1))} = σd 2 wh (n + 1)p p H w(n + 1) +w H (n + 1)Rw(n + 1) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 69 / 79

Normalized LMS Algorithm Now use the fact that w(n + 1) =w(n) 1 2 µ(n) (n) J(n + 1) = σ 2 d wh (n + 1)p p H w(n + 1) +w H (n + 1)Rw(n + 1) = σd [w(n) 2 1 ] H 2 µ(n) (n) p [ p H w(n) 1 ] 2 µ(n) (n) [ + w(n) 1 ] H 2 µ(n) (n) R [w(n) 12 ] µ(n) (n) }{{} = w H (n)rw(n) 1 2 µ(n)wh (n)r (n) 1 2 µ(n) H (n)rw(n)+ 1 4 µ2 (n) H (n)r (n) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 70 / 79

Normalized LMS Algorithm J(n + 1) = σd [w(n) 2 1 ] H 2 µ(n) (n) p [ p H w(n) 1 ] 2 µ(n) (n) Differentiating with respect to µ(n), +w H (n)rw(n) 1 2 µ(n)wh (n)r (n) 1 2 µ(n) H (n)rw(n)+ 1 4 µ2 (n) H (n)r (n) J(n + 1) µ(n) = 1 2 H (n)p + 1 2 ph (n) 1 2 wh R (n) 1 2 H (n)rw(n)+ 1 2 µ(n) H (n)r (n) ( ) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 71 / 79

Normalized LMS Algorithm Setting ( ) equal to 0 µ 0 (n) H (n)r (n) = w H (n)r (n) p H (n) + H (n)rw(n) H (n)p µ 0 (n) = wh (n)r (n) p H (n)+ H (n)rw(n) H (n)p H (n)r (n) = [wh (n)r p H ] (n)+ H (n)[rw(n) p] H (n)r (n) = [Rw(n) p]h (n)+ H (n)[rw(n) p] H (n)r (n) = 1 2 H (n) (n)+ 1 2 H (n) (n) H (n)r (n) = H (n) (n) H (n)r (n) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 72 / 79

Normalized LMS Algorithm Using instantaneous estimates Thus ˆR = x(n)x H (n) and ˆp = x(n)d (n) ˆ (n) = 2[ˆRw(n) ˆp] = 2[x(n)x H (n)w(n) x(n)d (n)] = 2[x(n)(ˆd (n) d (n))] = 2x(n)e (n) µ 0 (n) = H (n) (n) H (n)r (n) = = = e(n) 2 x H (n)x(n) e(n) 2 (x H (n)x(n)) 2 1 x H (n)x(n) = 1 x(n) 2 4x H (n)e(n)x(n)e (n) 2x H (n)e(n)x(n)x H (n)2x(n)e (n) Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 73 / 79

Normalized LMS Algorithm Result: The NLMS update is w(n + 1) =w(n)+ µ x(n) }{{ 2 x(n)e (n) } µ(n) µ is introduced to scale the update To avoid problems when x(n) 2 0 we add an offset w(n + 1) =w(n)+ µ a + x(n) 2 x(n)e (n) where a > 0 Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 74 / 79

NLMS Convergnce Objective: Analyze the NLMS convergence w(n + 1) =w(n)+ µ x(n) 2 x(n)e (n) Substituting e(n) =d(n) w H (n)x(n) µ w(n + 1) = w(n)+ x(n) 2 x(n)[d (n) x H (n)w(n)] [ ] = I µ x(n)xh (n) x(n) 2 w(n)+ µ x(n)d (n) x(n) 2 Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 75 / 79

NLMS Convergnce Objective: Compare the NLMS and LMS algorithms: NLMS: w(n + 1) = [ ] I µ x(n)xh (n) x(n) 2 w(n)+ µ x(n)d (n) x(n) 2 LMS: w(n + 1) =[I µx(n)x H (n)]w(n)+µx(n)d (n) By observation, we see the following corresponding terms LMS NLMS µ µ x(n)x H (n) x(n)x H (n) x(n) 2 x(n)d (n) x(n)d (n) x(n) 2 Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 76 / 79

NLMS Convergnce LMS NLMS µ µ x(n)x H (n) x(n)x H (n) x(n) 2 x(n)d (n) x(n)d (n) x(n) 2 LMS case: 0 <µ< 2 trace[e{x(n)x H (n)}] = 2 trace[r] guarantees stability By analogy, 0 < µ < trace guarantees stability of the NLMS [ E 2 { }] x(n)x H (n) x(n) 2 Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 77 / 79

NLMS Convergnce To analyze the bound, make the following approximation E { x(n)x H } (n) x(n) 2 E{x(n)xH (n)} E{ x(n) 2 } Then trace [ E { x(n)x H }] (n) x(n) 2 = trace[e{x(n)xh (n)}] E{ x(n) 2 } = E{trace[x(n)xH (n)]} E{ x(n) 2 } = E{trace[xH (n)x(n)]} E{ x(n) 2 } = E{trace[ x(n) 2 ]} E{ x(n) 2 } = 1 Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 78 / 79

NLMS Convergnce Thus 0 < µ < trace [ E 2 { }] = 2 x(n)x H (n) x(n) 2 Final Result: The NLMS update will converge if 0 < µ <2 Note: w(n + 1) =w(n)+ µ x(n) x(n) 2 e (n) The NLMS has a simpler convergence criterion than the LMS The NLMS generally converges faster than the LMS algorithm Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 79 / 79

ELEG 636 - Statistical Signal Processing Gonzalo R. Arce Variants of the LMS algorithm Department of Electrical and Computer Engineering University of Delaware Newark, DE, 19716 Fall 2013 (Variants of the LMS algorithm) Gonzalo R. Arce Spring, 2013 1 / 15

Standard LMS Algorithm FIR filters: y(n)=w 0 (n)u(n)+w 1 (n)u(n 1)+... + W M 1 (n)u(n M + 1) = M 1 Â w k (n)u(n k=0 k)=w(n) T u(n), n = 0,1,..., Error between filter output y(t) and a desired signal d(t): Update filter parameters according to e(n)=d(n) y(n)=d(n) w(n) T u(n). w(n + 1)= w(n)+ µu(n)e(n). (Variants of the LMS algorithm) Gonzalo R. Arce Spring, 2013 2 / 15

1. Normalized LMS Algorithm Modify at time n the parameter vector from w(n) to w(n + 1) Where l will result from d(n)= M 1 Â w i (n + 1)u(n i=0 i). In order to add an extra freedom degree to the adaptation strategy, one constant, µ, controlling the step size will be introduced: w j (n+1)=w j (n)+ µ 1 Â M 1 i=0 (u(n e(n)u(n j)=w µ j(n)+ i))2 ku(n)k 2 e(n)u(n j). (Variants of the LMS algorithm) Gonzalo R. Arce Spring, 2013 3 / 15

To overcome the possible numerical difficulties when ku(n)k is close to zero, a constant a > 0 is used: w j (n + 1)=w j (n)+ µ e(n)u(n j) a + ku(n)k2 This is the update used in the Normalized LMS algorithm. The Normalized LMS algorithm converges if 0 < µ < 2 (Variants of the LMS algorithm) Gonzalo R. Arce Spring, 2013 4 / 15

Comparison of LMS and NLMS (Variants of the LMS algorithm) Gonzalo R. Arce Spring, 2013 5 / 15

Comparison of LMS and NLMS The LMS was run with three different step-sizes: µ =[0.075; 0.025; 0.0075]. The NLMS was run with four different step-sizes: µ =[1.0; 0.5; 0.1]. The larger the step-size, the faster the convergence. The smaller the step-size, the better the steady state square error. LMS with µ = 0.0075 and NLMS with µ = 0.1 achieved similar average steady state square error. However, NLMS was faster. LMS with µ = 0.075 and NLMS with µ = 1.0 had a similar convergence speed. However, NLMS achieved a lower steady state average sqare error. Conclusion: NLMS offers better trade-offs than LMS. The computational complexity of NLMS is slightly higher that that of LMS. (Variants of the LMS algorithm) Gonzalo R. Arce Spring, 2013 6 / 15

2. LMS Algorithm with Time Variable Adaptation Step Heuristics: combine benefits of two different situations: Convergence time constant is small for large µ. Mean-square error in steady state is low for small µ. Initial adaptation µ is kept large, then it is monotonically reduced µ(n)= 1 n + c. Disadvantage for non-stationary data: algorithm will not react to changes in the optimum solution, for large values of n. Variable Step algorithm: where M(n)= w(n + 1)= w(n)+ M(n)u(n)e(n) µ 0 (n) 0 0 0 µ 1 (n) 0 0 0 0 0 0 µ M 1 (n) Each filter parameter w i ()n is updated using an independent adaptation step µ i (n).. (Variants of the LMS algorithm) Gonzalo R. Arce Spring, 2013 7 / 15

Comparison of LMS and variable size LMS µ = µ 0.01n+c with c =[10;20;50] (Variants of the LMS algorithm) Gonzalo R. Arce Spring, 2013 8 / 15

3. Sign algorithms In high speed communication the time is critical, thus faster adaptation processes is needed. 8 < sgn(a)= : 1; a> 0 0; a= 0 1; a< 0 The Sign algorithm (other names: pilot LMS, or Sign Error) w(n + 1)= w(n)+ µu(n) sgn(e(n)). The Clipped LMS (or Signed Regressor) The Zero forcing LMS (or Sign Sign) w(n + 1)=w(n)+µ sgn(u(n))e(n). w(n + 1)=w(n)+µ sgn(u(n)) sgn(e(n)). The Sign algorithm can be derived as a LMS algorithm for minimizing the Mean absolute error (MAE) criterion J(w)=E[ e(n) ]=E[ d(n) w T u(n) ]. (Variants of the LMS algorithm) Gonzalo R. Arce Spring, 2013 9 / 15

Properties of sign algorithms Fast computation: if µ is constrained to the form µ = 2 m, only shifting and addition operations are required. Drawback: the update mechanism is degraded, compared to LMS algorithm, by the crude quantization of gradient estimates. The steady state error will increase The convergence rate decreases The fastest of them, Sign-Sign, is used in the CCITT ADPCM standard for 32000 bps system. (Variants of the LMS algorithm) Gonzalo R. Arce Spring, 2013 10 / 15

Comparison of LMS and Sign LMS Sign LMS algorithm should be operated at smaller step-sizes to get a similar behavior as standard LMS algorithm. (Variants of the LMS algorithm) Gonzalo R. Arce Spring, 2013 11 / 15

4. Linear smoothing of LMS gradient estimates Lowpass filtering the noisy gradient Rename the noisy gradient g(n)= ˆ w J g(n)= ˆ w J = 2u(n)e(n) g i (n)= 2e(n)u(n i). Passing the signals g i (n) through low pass filters will prevent the large fluctuations of direction during adaptation process. b i (n)=lpf(g i (n)). The updating process will use the filtered noisy gradient w(n + 1)=w(n) The following versions are well known: µb(n). Averaged LMS algorithm LPF is the filter with impulse response h(m)= N 1 m = 0,1,...,N 1. w(n + 1)=w(n)+ µ N n Â e(j)u(j). j=n N+1 (Variants of the LMS algorithm) Gonzalo R. Arce Spring, 2013 12 / 15

Momentum LMS algorithm LPF is an IIR filter of first order h(0)=1 g,h(1)=gh(0),h(2)=g 2 h(0),... then, b i (n)=lpf (g i (n)) = gb i (n 1)+(1 g)g i (n) b(n)=gb(n 1)+(1 g)g(n) The resulting algorithm can be written as a second order recursion: w(n + 1)= w(n) µb(n) gw(n)=gw(n 1) gµb(n 1) w(n + 1) gw(n)= w(n) gw(n 1) µb(n)+ gµb(n 1) (Variants of the LMS algorithm) Gonzalo R. Arce Spring, 2013 13 / 15

w(n + 1) gw(n)= w(n) gw(n 1) µb(n)+gµb(n 1) w(n + 1)= w(n)+g(w(n) w(n 1)) µ(b(n) gb(n 1)) w(n + 1)= w(n)+g(w(n) w(n 1)) µ(1 g)g(n) w(n + 1) = w(n) +g(w(n) w(n 1)) + 2µ(1 g)e(n)u(n) w(n + 1) w(n) = g(w(n) w(n 1)) + µ(1 g)e(n)u(n) Drawback: The convergence rate may decrease. Advantages: The momentum term keeps the algorithm active even in the regions close to minimum. (Variants of the LMS algorithm) Gonzalo R. Arce Spring, 2013 14 / 15

5. Nonlinear smoothing of LMS gradient estimates Impulsive interference in either d(n) or u(n), drastically degrades LMS performance. Smooth noisy gradient components using a nonlinear filter. The Median LMS Algorithm The adaptation equation can be implemented as w i (n + 1) =w i (n) µ med((e(n)u(n i)),(e(n 1)u(n 1 i)),..., (e(n N)u(n N i))) Smoothing effect in impulsive noise environment is very strong. If the environment is not impulsive, the performances of Median LMS are comparable with those of LMS. Convergence rate is slower than in LMS. (Variants of the LMS algorithm) Gonzalo R. Arce Spring, 2013 15 / 15