Assesment of the efficiency of the LMS algorithm based on spectral information

Assesment of the efficiency of the algorithm based on spectral information (Invited Paper) Aaron Flores and Bernard Widrow ISL, Department of Electrical Engineering, Stanford University, Stanford CA, USA Email: {aflores,widrow}@stanford.edu Abstract The algorithm is used to find the optimal minimum mean-squared error (SE) solutions for a wide variety of problems. Unfortunately, its convergence speed depends heavily on its initial conditions when the autocorrelation matrix R of its input vector has a high eigenvalue spread. In many applications such as system identification or channel equalization, R is Toeplitz. In this paper we exploit the Toeplitz structure of R to show that when the weight vector is initialized to zero, the convergence speed of is related to the similarity between the input PSD and the power spectrum of the optimum solution. I. INTRODUCTION The algorithm is widely used to find the minimum mean-squared error (SE) solutions for a variety of linear estimation problems, and its efficiency has been extensively studied during the last decades [] [8]. The convergence speed of depends on two factors: the eigenvalue distribution of the input autocorrelation matrix R, and the chosen initial condition. odes corresponding to small eigenvalues converge much more slowly than those corresponding to large eigenvalues, and the initialization of the algorithm determines how much excitation is received by each of these modes. In practice, it is not known how the chosen initial conditions excite each of the modes, making the speed of convergence difficult to predict. Our analysis of the convergence speed is concerned with applications where the input vector comes from a tapped delay line, inducing a toeplitz structure in the R matrix. Using the toeplitz nature of R, we show that the speed of convergence of the algorithm can be estimated from its input power spectral density and the spectrum of the difference between the initial weight vector and the optimum solution. The speed of convergence of was qualitatively assessed by comparing it to its ideal counterpart, the /Newton algorithm, which is often used as a benchmark for adaptive algorithms [7], [8]. In section II we describe both the and /Newton algorithms and derive approximations to their learning curves on which our transient analysis is based. In section III we define the performance metric used to evaluate speed of convergence, and in section IV we show how that metric can be estimated from spectral information. We illustrate our result with simulations in section V and summarize our conclusions and future work in Section VI. II. THE AND /NEWTON ALGORITHS Adaptive algorithms such as and /Newton are used to solve linear estimation problems like the one depicted in Fig., where the input vector = [ x k x Lk ] T and desired response Rare jointly stationary random processes, w k = [w k w k w Lk ] T is the weight vector, = x T k w k is the output, and ɛ k = is the error. The ean Square Error (SE) is defined as ξ k = E[ɛ k ] and it is a quadratic function of the weight vector. The optimal weight vector that minimizes ξ k is given by w = R p,where R = E[ x T k ] is the input autocorrelation matrix (assumed to be full rank), and p = E[ ] is the crosscorrelation vector. The minimum SE (SE) obtained using w is denoted by ξ. input signal vector { x k x Lk. Fig.. w k w k w Lk. ɛ k error Adaptive Linear Combiner desired response output A. The algorithm Often in practice R p cannot be calculated due to the lack of knowledge of the statistics R and p. However,when samples of and are available, they can be used to iteratively adjust the weight vector to obtain an approximation of w. The simplest and most widely used algorithm for this is []. It performs instantaneous gradient descent adaptation of the weight vector: w k = w k µɛ k. () The step size parameter is µ and the initial weight vector w is arbitrarily set by the user. The SE sequence ξ k corresponding -783-86-//$. IEEE

to the sequence of adapted weight vectors w k is commonly known as the learning curve. Next we derive an approximation for ξ k that is the basis of our speed of convergence analysis. Since R is symmetric, we can diagonalize it as R = Q T ΛQ, where Λ is a diagonal matrix of the eigenvalues λ i of R, andq is a real orthonormal matrix with the corresponding eigenvectors of R as columns. Let v k = wk w be the weight vector deviation from the optimal solution and v k = Q T v k.letf k be the vector obtained from the diagonal of E[v k v kt ],and λ =[λ λ...λ L ] T. Assuming the random processes {, } to be independent, it can be shown [3] that the SE at time k can be expressed as ξ k = ξ λ T F k. () If we further assume to be gaussian, it was shown in [5] and [3] that F k obeys the following recursion F k =(I µλ 8µ Λ µ λλ T )F k µ ξ λ, (3) where F k is shown [5] to converge if µ < 3Tr(R). This condition in µ allows us to approximate (3) by F k (I µλ)f k µ ξ λ. () Denoting by R L a vector of ones and using (), F k (I µλ) k (F µξ )µξ, (5) Substituting (5) in (), we obtain the following approximation for the learning curve ξ k ξ λ T (I µλ) k (F µξ ). (6) where ξ = limk ξ k ξ ( µ Tr (R)). B. The /Newton algorithm The /Newton algorithm [6] is an ideal variant of the algorithm that uses R to whiten its input. Although most of the time it cannot be implemented in practice due to the lack of knowledge of R, it is of theoretical importance as a benchmark for adaptive algorithms [7], [8]. The /Newton algorithm is the following w k = w k µλ avg R ɛ k (7) P N where λ avg = i= λi n. It is well known [6] that the /Newton algorithm is equivalent to the algorithm with learning rate µλ avg and previously whitened and normalized such that Λ=I. Therefore, assuming the same independence and gaussian statistics of and as with, we can replace µ by µλ avg and Λ by I in (6) to obtain the following approximation for the /Newton learning curve ξ k ξ ( µλ avg ) k λ T (F µξ ) (8) where the asymptotic SE ξ is the same as for. III. TRANSIENT EFFICIENCY Examining (6) and (8) we can see that the learning curve for is a sum of geometric sequences (modes) with µλ i as geometric ratios, whereas for /Newton it consists of a single geometric sequence with geometric ratio µλ avg.if all the eigenvalues of R are equal, the learning curves for and /Newton are the same. Generally, the eigenvalues of R are not equal, and consists of multiple modes, some faster than /Newton and some slower than it as depicted in Fig. ean Square Error (SE) ξ 8 6 (fast mode) /Newton (slow mode) (average) 3 5 6 7 8 9 Fig.. and /Newton modes To evaluate the speed of convergence of a learning curve we use the natural measure suggested in [5], J = ξ k ξ. (9) k= Small values of J indicate fast convergence and large values indicate slow convergence. Using (6) and (8) to obtain J for and /Newton respectively we get J T F µlξ µ = vt v µ Lξ () J /Newton λt F µλ avg Lξ = vt Rv Lξ µλ avg µλ avg () To compare the speed of convergence of to the one of /Newton, we define what we call the Transient Efficiency Tansient Efficiency = J /Newton J () If the Transient Efficiency is bigger/smaller than one, then performs better/worse than /Newton in the sense of the J metric. Assuming that the initial weight vector w is far enough from w such that v T v µlξ and vt Rv µλ avg µlξ, we obtain from (), () and () Transient Efficiency v T Rv λ avg v T v (3)

Since (3) is invariant to scaling of v and R, we can assume without loss of generality v = λ avg =, obtaining the following compact expression: Transient Efficiency v T Rv () This is the first contribution in this paper and it will be used to obtain an approximation of the Transient Efficiency in terms of spectra in the next section. IV. TRANSIENT EFFICIENCY IN TERS OF SPECTRAL INFORATION In this section we show that in the case that R is toeplitz, the Transient Efficiency can be expressed in terms of the Fourier spectrum of v and the input power spectral density. The toeplitz structure of R arises from applications where the input vector comes from a Tapped Delay Line (TDL) as showninfig.3,i.e. =[... L ] T, where is a stationary scalar random process with autocorrelation sequence φ xx [n] = E[ n ]. The input autocorrelation matrix is therefore given by R k,l = φ xx [k l]. z z... w w w L Fig. 3.... L error Tapped Delay Line desired response output We define the Input Power Spectral Density vector Φ xx R as the uniformly spaced samples of the DTFT of the N- point truncation of the input autocorrelation function φ xx [n], i. e. Φ xx [m] = N n=n mn πj φ xx [n]e (5) m =,,..., N L, N L. In order to show how Φ xx R is related to the Transient Efficiency, the following Lemma will be needed Lemma (Spectral Factorization of R): Let Ψ be a diagonal matrix with Ψ m,m = Φ xx [m] as its diagonal entries, and let U be a L matrix with elements U l,m = ml πj e with l L and m, thenr can be factored in the following way R = UΨU (6) Proof: Let T = UΨU,then T k,l = m= = = = = e πj km ml πj Ψm,m e (7) m= m(kl) πj Φ xx [m]e (8) N m= n=n N n=n N n=n φ xx [n] φ xx [n]e m= mn πj e πj m(kl) (9) m(kln) πj e () φ xx [n]δ[k l n] () = φ xx [k l] =R k,l () where the step from () to () was done using the following inequalities: L k l L, N n N, and N L. Let V = U v be the -point DFT of v,i.e. V [m] = L mn πj v [n]e m =,,..., n= Using Lemma and (3) we obtain v T o Rv = v T o UΨU v = m= (3) Φ xx [m] V o [m], () furthermore, inserting () into (), we arrive to the following approximation for the Transient Eficciency in terms of the input psd and the spectrum of v o Transient Efficiency m= Φ xx [m] V [m] (5) We should note that, although the independence assumption under which the approximation in () was derived is violated when a tapped delay line is used, it has been observed to hold in simulations. Since v and R are scaled such that v = λ avg =, we can show that m= Φ xx[m] =, m= V [m] = (6) These normalizations on Φ xx [m] and V [m] allow us to interpret the right hand side of (5) as a weighted average of Φ xx [m] where the weights are given by V [m]. Therefore, if Φ xx [m] is large in the frequency ranges where V [m] is also large, the Transient Efficiency will likely be bigger than one, implying that will outperform /Newton under the J metric. On the other hand, if Φ xx [m] tends to be small in the frequency ranges where V [m] is large, the Transient Efficiency will likely be less than one indicating that will perform worse than /Newton in the J metric sense.

An important application of (5) is to the special case when the weight vector is initialized to zero w o =, hence v o = w, which results in V [m] = W [m] (7) where W [m] is the -point Discrete Fourier Transform of the optimal or Wiener solution w as defined in (3). Therefore, when the weight vector is initialized to zero, the speed of convergence of the algorithm can be estimated by the weighted average of the input psd with the weights given by the spectrum of the Wiener solution. This observation can be very useful for system identification applications, where prior approximatnowledge of the input psd and frequency response of the plant to be identifiedmaybeusedtoestimate s performance before implementing it. For example, if the plant has a low-pass frequency response and the input psd has very little power in the frequency range above the cut-off frequency, will converge very fast. On the other hand, if the plant has a high-pass nature and the input psd is small for high frequencies, will be very slow. In channel equalization applications the spectrum of the optimum solution is the opposite of the channel frequency response; hence, if we initiliaze the weight vector to zero, (5) will be small implying a slow convergence of, a phenomenom often observed when using in equalization or adaptive inverse control problems. We illustrate these ideas with two examples in the next section. V. SIULATIONS Consider the system identification problem depicted in Fig.. The additive plant noise is independent of the input. Both, the adaptive filter and the plant consists of 6 taps, so the spectrum of the optimum solution coincides with the plant frequency response. The weight vector is initialized to zero and adapted for 8 iterations. The learning rate is set so that µ Tr (R) =.5. The power spectrum of the input and the frequency response of the plant to be identified are shown in Fig. 5, where a value of = was used to calculate Φ xx [m] and W [m]. The eigenvalue spread for this exercise was 579. By inspection of the input psd and the frequency response of the plant we expect the Transient Efficiency to be bigger than one, indicating that should perform better than /Newton, which is confirmed by their learning curves shown in Fig. 6. The estimate for the Transient efficiency obtained from the simulation was.6, very close to the approximation of.69 given by (5). To illustrate what happens if we use zero initial conditions when the input psd and the Wiener solution spectrum are very distinct, consider the simple equalization problem depicted in Fig. 7. The input to the channel is a white signal, hence the input psd for the adaptive filter is given by the magnitude squared of the channel frequency response. The channel frequency response and spectrum of the wiener solution for the adaptive equalizer are shown in Fig. 8 where a value of = was used. The eigenvalue spread of the input to the adaptive filter was 75.. As intuitively expected, the input W [m] Input Φxx[m] 3 Fig.. FIR Filter Unknown Adaptive Filter n k Noise Error System Identification Example Output.5.5.5 3 3.5..5..5 Fig. 5. ean Square Error SE Eigenspread = 579 Input PSD Channel Response.5.5.5 3 3.5 Fig. 6. Input Power Spectrum and Response /Newton 3 5 6 7 8 Learning curves for System Identification example 3

psd and Wiener spectrum have opposite shapes resulting in a small Transient Efficiency,indicating that should perform worse than /Newton, which was corroborated by their learning curves shown in Fig. 9. The estimate for the Transient efficiency obtained from the simulation was.3, very close to the approximation of.37 given by (5). plant Input W [m] Φxx[m] 3 Channel Fig. 7. Adaptive Equalizer Error Equalization Example.5.5.5 3 3.5..3.. Fig. 8. ean Square Error SE Eigenspread = 75. Input PSD.5.5.5 3 3.5 Wiener Solution Spectrum Output response of plant and optimum equalizer SE Fig. 9. /Newton 3 5 6 7 8 9 Learning curves for equalization example VI. CONCLUSIONS AND FUTURE WORK We have shown that when the algorithm is used to train a transversal adaptive filter with jointly stationary input and desired response signals, its transient performance can be assesed by the inner product between the input psd and the spectrum of the initial weight vector deviation from the Wiener solution. This implies that when the initial weight vector is set to zero, the transient efficiency of the algorithm can be predicted from approximatnowledge of the input psd and the frequency response of the Wiener solution. We described how this theory can be applied to system identification and equalization tasks; however, our results can be useful in predicting the performance in any application where the input autocorrelation matrix is toeplitz and spectral information about the input and optimum solution is available. An extension of this analysis is being made to nonstationary conditions, where the spectral characteristics of the input and optimum solutions keep a general shape, say both being always low pass. A similar analysis based on the ean Square Deviation (SD) instead of the SE is also currently investigated. REFERENCES [] B. Widrow and. E. Hoff, Jr, Adaptive switching circuits, in IRE WESCON Con. Rec., pt., 96, pp. 96. [] B. Widrow, J.. ccool,. G. Larimore, and J. C. R. Johnson, Stationary and nonstationary learning characteristics of the adaptive filter., Proceedings of the IEEE, vol. 6, no. 8, pp. 5 6, Aug. 976. [3] L. L. Horowitz and K. D. Senne, Performance advantage of complex for controlling narrow-band adaptive arrays, IEEE Trans. Acoust. Speech Signal Process., vol. ASSP-9, no. 3, pp. 7 736, June 98. [] B. Widrow and E. Walach, On the statistical efficiency of the lms algorithm with nonstationary inputs., IEEE Transactions on Information Theory, vol. 3, no., pp., 98. [5] A. Feuer and E. Weinstein, Convergence analysis of filters with uncorrelated Gaussian data, IEEE Trans. Acoust. Speech Signal Process., vol. ASSP-33, no., pp. 9, Feb. 985. [6] B. Widrow and S. D. Stearns, Adaptive Signal Processing, Prentice Hall, Upper Saddle River, NJ, 985. [7] F.. Boland and J. B. Foley, Stochastic convergence of the lms algorithm in adaptive systems., Signal Processing, vol. 3, no., pp. 339 5, DEC 987. [8] B. F. Boroujeny, On statistical efficiency of the lms algorithm in system modeling., IEEE Transactions on Signal Processing, vol., no. 5, pp. 97 5, AY 993. [9] D. R. organ, Slow asymptotic convergence of lms acoustic echo cancelers., IEEE Transactions on Speech and Audio Processing, vol. 3, no., pp. 6 36, AR 995. [] S. C. Douglas and W.. Pan, Exact expectation analysis of the lms adaptive filter., IEEE Transactions on Signal Processing, vol. 3, no., pp. 863 7, DEC 995. [] B. Hassibi, A. H. Sayed, and T. Kailath, h optimality of the lms algorithm., IEEE Transactions on Signal Processing, vol., no., pp. 67 8, FEB 996. []. Reuter and J. R. Zeidler, Nonlinear effects in lms adaptive equalizers., IEEE Transactions on Signal Processing, vol. 7, no. 6, pp. 57 9, JUN 999. [3] K. J. Quirk, L. B. ilstein, and J. R. Zeidler, A performance bound for the lms estimator., IEEE Transactions on Information Theory, vol. 6, no. 3, pp. 5 8, AY. [] N. R. Yousef and A. H. Sayed, A unified approach to the steady-state and tracking analyses of adaptive filters., IEEE Transactions on Signal Processing, vol. 9, no., pp. 3, FEB. [5] H. J. Butterweck, A wave theory of long adaptive filters., IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, vol. 8, no. 6, pp. 739 7, JUN. [6] O. Dabeer and E. asry, Analysis of mean-square error and transient speed of the adaptive algorithm, IEEE Trans. Inf. Theory, vol. 8, no. 7, pp. 873 89, July. [7] B. Widrow and. Kamenetsky, Statistical efficiency of adaptive algorithms., Neural Networks, vol. 6, no. 5/6, pp. 735, JUN- JUL 3. [8] S. Haykin and B. Widrow, Least-mean-square adaptive filters, Wiley- Interscience, Hoboken, N.J., 3.