TWO-LAYER LINEAR STRUCTURES FOR FAST ADAPTIVE FILTERING. a dissertation. submitted to the department of electrical engineering

Size: px

Start display at page:

Download "TWO-LAYER LINEAR STRUCTURES FOR FAST ADAPTIVE FILTERING. a dissertation. submitted to the department of electrical engineering"

Sharyl McBride
5 years ago
Views:

1 TWO-LAYER LIEAR STRUCTURES FOR FAST ADAPTIVE FILTERIG a dissertation submitted to the department of electrical engineering and the committee on graduate studies of stanford university in partial fulfillment of the requirements for the degree of doctor of philosophy By Francoise Beaufays June 1995

3 I certify that I have read this dissertation and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy. Bernard Widrow (Principal Advisor) I certify that I have read this dissertation and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy. Thomas Kailath I certify that I have read this dissertation and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy. Umran Inan Approved for the University Committee on Graduate Studies: iii

4 Abstract The least mean squares (LMS) algorithm is simple, robust, and is one of the most widely used algorithms for adaptive ltering. Unfortunately, it is highly sensitive to the conditioning of its input autocorrelation matrix: the higher the input eigenvalue spread, the slower the convergence of the adaptive weights. This problem can be overcome by preprocessing the inputs to the lter with a xed data-independent transformation that, at least partially, decorrelates the inputs. Typical transformations are the discrete Fourier transform (DFT) and the discrete cosine transform (DCT). The adaptive lter is then adapted using LMS with normalized learning rates. The resulting algorithms are called DFT-LMS and DCT-LMS. We rst give a brief intuitive explanation of the algorithms. We then analyze the performance of DFT/DCT-LMS for rst-order Markov inputs. In particular, we show that for Markov-1 inputs of correlation 2 [; 1], the eigenvalue spread after DFT and amplitude normalization tends to (1 + )=(1? ) as the size of the lter gets large, while after DCT and amplitude normalization it reduces to (1 + ). For comparison, the eigenvalue spread before transformation is asymptotically equal to (1 + ) 2 =(1? ) 2. We next show that the DFT/DCT preprocessing stage can advantageously be implemented using the LMS spectrum analyzer, an adaptive lter originally proposed as an alternative way of computing the DFT of a time series. We show that this circuit is extremely robust to the propagation of round-o errors due to nite precision eects. Analytical results and computer simulations are given to support this point. The LMS spectrum analyzer concept is then extended to the cosine transform, and two alternative circuits are proposed to implement the DCT adaptively. iv

5 The overall structure composed of the preprocessing and ltering stages forms a fully adaptive two-layer linear lter, which achieves better speed performance than pure LMS while retaining its low computational cost and its extreme robustness. v

6 Acknowledgements I wish to express my deepest gratitude to my principal advisor, Prof. Bernard Widrow, for his support, his encouragements and his advices throughout my studies at Stanford. He initiated me to the world of research, and I will always remember his teaching. Even more than for his academic help, I thank Prof. Widrow for his care and comprehension. I gratefully acknowledge the professors in my reading and orals committees, Prof. Thomas Kailath, Umran Inan, Anoop Gupta, and Dwight ishimura, for their time and valuable advices. I also wish to thank Prof. Amir Dembo and Istvan Kollar for their helpful suggestions. Special thanks go to Prof. Steven Boyd for the great idea of introducing a real espresso machine in ISL! I thank all my friends from the \zoo" - past and present - Michel Bilello, Takeshi Doi (alias Keish), Boyd Fowler, Dana How, Jack Kouloharis, Michael Lehr (best known as Mister Mike), Ming-Chang Liu, Derrick guyen, Steven Piche, Gregory Plett, Edward Plumer (Edouard for his fellows French speakers), Raymond Shen, Maryhelen Stevenson, Linda Tomassini, and Eric Wan, for all the instructive discussions and all the fun we had together. I heartily thank Joice DeBolt for her help with everything, for her constant good humour, and for her kindness. I also wish to thank all the folks from the Italian department who added so much to my stay at Stanford, and especially Dina Viggiano who from a professor became a great friend. Finally, I would like to thank my parents Oscar and Denise for making me understand at a young age how important and fun learning is. Last but not least, I thank my friend Luca for distracting me from my work in such a lovely way. vi

7 I acknowledge the nancial help of the Belgian American Educational Foundation whose fellowship supported my rst year at Stanford, and of the Zonta Foundation whose two fellowships helped me later on. I also acknowledge the Electric Power Research Institute and its project manager, John Maulbetsch, for their sponsorship during most of my PhD, and my current employer, SRI International, for facilitating the end of my dissertation. vii

8 Contents Abstract Acknowledgements iv vi 1 Introduction Author's Contributions : : : : : : : : : : : : : : : : : : : : : : : : : : 4 2 Adaptation Algorithms for Linear Filtering Introduction to Adaptive Filters : : : : : : : : : : : : : : : : : : : : : The LMS Algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : Derivation of the LMS Algorithm : : : : : : : : : : : : : : : : Properties of the LMS Algorithm : : : : : : : : : : : : : : : : The Complex LMS Algorithm : : : : : : : : : : : : : : : : : : The Block-LMS Algorithm : : : : : : : : : : : : : : : : : : : : The RLS Algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : Derivation of the RLS Algorithm : : : : : : : : : : : : : : : : Properties of the RLS Algorithm : : : : : : : : : : : : : : : : Transform-Domain LMS Algorithms : : : : : : : : : : : : : : : : : : : Transform-Domain Block-LMS Algorithms : : : : : : : : : : : Transform-Domain on-block LMS Algorithms : : : : : : : : 22 3 Transform-Domain Algorithms General Description of DFT-LMS and DCT-LMS : : : : : : : : : : : Intuitive Justications of DFT/DCT-LMS : : : : : : : : : : : : : : : 29 viii

9 3.2.1 Filtering Approach : : : : : : : : : : : : : : : : : : : : : : : : Geometrical Approach : : : : : : : : : : : : : : : : : : : : : : Towards an Analytical Study of DFT/DCT-LMS : : : : : : : : : : : 33 4 Eigenvalue Spread Computation Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Eigenvalues and Eigenvalue Spread with Markov-1 Inputs : : : : : : : Eigenvalue Distribution of DFT-LMS with Markov-1 Inputs : Eigenvalue Distribution of DCT-LMS with Markov-1 Inputs : Conclusion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 52 5 Simulations An Adaptive Modeling Task with Markov-1 Inputs : : : : : : : : : : Adaptive Filters with Other Low-Pass Inputs : : : : : : : : : : : : : Band-Pass Input Signals : : : : : : : : : : : : : : : : : : : : : : : : : Conclusion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 67 6 Implementation of the Sliding-DFT Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : on-adaptive Implementations of the Sliding-DFT : : : : : : : : : : The Straightforward on-adaptive Implementation : : : : : : Shynk's Implementation : : : : : : : : : : : : : : : : : : : : : The LMS Spectrum Analyzer : : : : : : : : : : : : : : : : : : : : : : Propagation of Limited-Precision Errors in the Sliding-DFT : : : : : Behavior of the Sliding-DFT in Floating Point Arithmetic : : : : : : Conclusion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 84 7 Implementation of the Sliding-DCT Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Derivation of a Real-Valued LMS Cosine-Spectrum Analyzer : : : : : Derivation of a Complex LMS Cosine-Spectrum Analyzer : : : : : : : Conclusion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 99 ix

10 8 Conclusions and Future Work Further Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 13 A DFT-LMS with Markov-1 Inputs 15 A.1 Toeplitz ature of D, and its Analytical Form : : : : : : : : : : : : 15 A.2 Asymptotic Equivalence f D D : : : : : : : : : : : : : : : : : : : 16 A.3 Asymptotic Equivalence f X R?1 f D : : : : : : : : : : : : : : : : : 16 B DCT-LMS with Markov-1 Inputs 18 B.1 Analytical Expression of Y 4 =R f X, and Asymptotic Equivalence fy Y : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 18 B.2 Asymptotic Equivalence C f Y C T diagb : : : : : : : : : : : : 19 C Ongoing Deadbeat Spectral Observers 112 C.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 112 C.1.1 The Spectral Observer : : : : : : : : : : : : : : : : : : : : : : 112 C.2 Relationship with the LMS Spectrum Analyzer : : : : : : : : : : : : 114 D Floating Point Representation of umbers 116 D.1 Representation of umbers in Computers : : : : : : : : : : : : : : : : 116 D.2 IEEE Single Precision Standard for Floating Point Arithmetic : : : : 117 D.3 A Simple Procedure for Simulating Low Precision Processors : : : : : 118 Bibliography 12 x

11 List of Tables 3.1 Summary of the DFT-LMS and DCT-LMS algorithms (u k denotes the complex conjugate of u k, and u k = u k is real when the data preprocessing is performed by the DCT.) : : : : : : : : : : : : : : : : : : : : : : : : Summary of the DFT-LMS and DCT-LMS algorithms: amplitude-normalization of the inputs to the LMS lter. : : : : : : : : : : : : : : : : : : : : : : Eigenvalue spreads for a band-pass signal of increasing bandwidth. is the size of the adaptive lter, R is the autocorrelation matrix as seen by LMS (i.e. without preprocessing), S C is the autocorrelation matrix after DCT and amplitude normalization, and S F is the autocorrelation matrix after DFT and amplitude normalization. : : : : : : : : : : : : : : : : : : Schematical description of the DCT spectrum analyzer. : : : : : : : : : 95 xi

12 List of Figures 2.1 Linear Adaptive lter : : : : : : : : : : : : : : : : : : : : : : : : : : : : Linear Adaptive lter with tap-delayed inputs. : : : : : : : : : : : : : : Error surface for a 2-weight adaptive lter. : : : : : : : : : : : : : : : : DFT-LMS and DCT-LMS block diagram. : : : : : : : : : : : : : : : : : DFT-LMS and DCT-LMS block diagram: amplitude-normalization of the inputs to the LMS lter. : : : : : : : : : : : : : : : : : : : : : : : : : : Magnitude of a sample transfer function for a DFT: jh 5 (!)j 2 : : Magnitude of a sample transfer functions for a DCT: jh 1 (!)j 2 : MSE hyperellipsoid (2-D section) (a) before transformation, (b) after DCT, (c) after amplitude normalization. : : : : : : : : : : : : : : : : : : : : : Error function (xj x=xo ; y; z) for a sinusoidal input without and with additive white noise (upper and lower plots, respectively). : : : : : : : : : : Eigenvalue spread of S vs. (DFT-LMS). : : : : : : : : : : : : : : : D plots of the main matrices involved in DFT-LMS eigenvalue spread derivation. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Eigenvalue spread of S vs. (DCT-LMS). : : : : : : : : : : : : : : : D plots of the main matrices involved in DCT-LMS eigenvalue spread derivation. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Similarity between the DCT basis functions and the eigenvectors of a Markov-1 autocorrelation matrix ( = 16; = :95). : : : : : : : : : : : Block diagram of an adaptive modeling system. : : : : : : : : : : : : : : 57 xii

13 5.2 Impulse responses of the dynamic system to be modelled (IIR lter), of the LMS adaptive lter, and of the DCT-LMS adaptive lter. : : : : : : Comparison between the LMS and the DCT-LMS learning curves for the adaptive modeling application. : : : : : : : : : : : : : : : : : : : : : : : Eigenvalues of the 1818 autocorrelation matrix of a Markov-1 signal ( = :9) before (points marked 'o') and after (points marked 'x') preprocessing by the DCT and amplitude normalization. : : : : : : : : : : : : : : : : : Real and imaginary parts of the 1 1 autocorrelation matrix of a Markov-2 signal ( 1 = :8; 2 = :9) after DFT and amplitude normalization autocorrelation matrix of a Markov-2 signal ( 1 = :8; 2 = :9) after DCT and amplitude normalization. : : : : : : : : : : : : : : : : : Diagonal of the matrix B = CRC T ( 1 = :95; 2 = :99). : : : : : : : The sliding-dft : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Comparison between the exact ( = 1) and the modied ( < 1) sliding- DFT's. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : The LMS spectrum analyzer. : : : : : : : : : : : : : : : : : : : : : : : LMS spectrum analyzer vs. non-adaptive sliding-dft with a random perturbation hitting the system at time k = 1. (a) The sum of the square of the DFT components is plotted versus time, (b) the sum of the squared errors in the DFT components is plotted versus time. : : : : : : : : : : : Recursive implementations of the DFT with limited precision. : : : : : : Block diagram of the LMS cosine-spectrum analyzer ( = 6). : : : : : : Signals used as desired outputs to the LMS lters involved in the complex LMS cosine-spectrum analyzer. : : : : : : : : : : : : : : : : : : : : : : Block diagram of the complex LMS cosine-spectrum analyzer ( = 6). : Block diagram of DFT-LMS: a two-layer linear fully adaptive structure. : 11 D.1 IEEE standard for single precision representation of real numbers in oating point arithmetic. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 117 xiii

14 Chapter 1 Introduction The rst steps in the eld of adaptive ltering can be traced back to the 193's - 194's with Wiener's early work on linear estimation of stochastic processes and the formulation of the famous Wiener-Hopf equations (see e.g. [23, 3]). These equations allow the determination of the linear lter that best maps, in a least squares sense, an input signal into some target output. Depending on the nature of the target (past, current, or future value of a given signal), the task that the lter performs is referred to as smoothing, estimation, or prediction. The Wiener-Hopf equations can handle all three cases. They can be phrased to solve continuous-time problems as well as discrete-time problems. In all cases, the impulse response of the lter that best maps the inputs into the target outputs is a function of the autocorrelation of the inputs and of the crosscorrelation between the inputs and the target outputs. Writing and solving the Wiener-Hopf equations thus necessitates the knowledge of the second order statistics (correlation functions) of the input and target output signals. In most practical applications, these statistics are not known beforehand and they need to be estimated from the data samples that are presented to the system. For example, a batch of inputs and target outputs can be observed, the autocorrelation and cross-correlation functions can be estimated from these data, and the impulse response of the optimum lter can be calculated. The lter is then ready for use. The so-called adaptive lters dier from the lters we just described in that their 1

15 CHAPTER 1. ITRODUCTIO 2 impulse response (a set of coecients for discrete-time lters) are adjusted iteratively as data ow through the lter instead of being determined once forever in a preliminary design phase. This second method has the advantage that the lter parameters can be continuously adjusted to reect changes that may occur in the statistics of the intervening signals. The algorithms used to adjust the parameters of these adaptive lters are referred to as adaptation algorithms. One such algorithm is the least mean squares or LMS algorithm, which was invented by Widrow and Ho in the late 195's - early 196's [65, 66]. The principle underlying LMS is extremely simple: it consists of dening an error function as the average square dierence between the lter output and its target, and in iteratively minimizing this error function over the lter coecient space, using a simple gradient-based optimization method. Because of its extreme simplicity, this algorithm has an elegance and robustness that are unsurpassed by other adaptation algorithms. Its main disadvantage is its very slow convergence under certain input conditions. As we will show in this thesis, some modications can be brought to LMS to ameliorate its convergence properties. Another famous adaptation algorithm is the recursive least squares or RLS algorithm (see e.g. [18, 23]). In the RLS algorithm, the lter coecients are made equal at each iteration to the best approximation of the Wiener solution that can be calculated based on all the data the system has so far seen. In this sense, it is an exact least squares algorithm as opposed to LMS whose coecients don't follow the Wiener solution so closely during adaptation. Because of this property, the RLS algorithm generally displays better convergence performance than LMS. However, it suers from dierent problems such as a lack of robustness for certain input conditions, and a higher computational cost. In spite of its slow convergence, LMS is a very popular algorithm, mostly because of its simplicity and robustness. For many decades, it has been a major component in a large number of engineering systems such as, for example, automatic controllers for linear systems (adaptive modeling lters, adaptive inverse controllers,... [7]), various telephony and communication devices (adaptive interference and echo cancellors, adaptive equalizers, adaptive pulse code modulators,... [36, 55, 47]), signal detection

16 CHAPTER 1. ITRODUCTIO 3 circuits (adaptive line enhancer [56]), parameter estimation systems (adaptive spectrum analyzers, adaptive correlators,... [64]), beamforming circuits [67, 22], and so forth. As new applications were developed and as the adaptation speed required from existing systems increased, various adaptation algorithms were developed to replace LMS, some based on RLS techniques, some based on LMS itself. In this thesis, we will concentrate on the second category. For completeness, we should also mention that interest in LMS recently increased with the advent of feedforward multi-layer neural networks. These neural networks, which can be described as layered arrangements of linear adaptive units followed by nonlinear unimodal functions, typically contain a very large number of parameters that must be adapted. The most famous algorithm for adjusting these parameters is the backpropagation algorithm [62, 5, 63] which is nothing else than a generalization of LMS to this more complicated structure. Backpropagation suers from the same convergence speed problems than LMS, but remedying to this problem turns out to be much more complicated in the case of neural networks, mostly because these circuits contain so many elements that only very simple algorithms can be used if the computational cost is to be kept reasonably low. In this thesis, we will discuss a class of LMS-based algorithms whose inputs are preprocessed by a xed, data-independent transformation such as the discrete Fourier transform (DFT) or the discrete cosine transform (DCT). The purpose of this transformation is to decorrelate, at least partially, the inputs to the lter. The lter coecients are then adjusted using the LMS algorithm with normalized learning rates. This has the eect of redistributing the energy of the input signal more or less evenly over all the lter inputs, thereby improving the convergence speed of the lter coef- cients. The resulting algorithms are referred to as DFT-LMS and DCT-LMS. The best performance we observe with such algorithms are obtained with DCT-LMS, for low-pass input signals. In order to maintain the computational eciency and robustness of LMS, we propose that the orthogonalizing transforms, DFT or DCT, be implemented using the so-called LMS spectrum analyzer [64], an adaptive structure that calculates the DFT of a time series eciently and which is extremely robust to

17 CHAPTER 1. ITRODUCTIO 4 error propagation. The outline of the thesis is as follows. In chapter 2, we give a detailed introduction to adaptive linear lters, to LMS and RLS, and to the algorithms we will later focus on: DFT-LMS and DCT-LMS. In chapter 3, we dene, model, and justify intuitively the these algorithms. In chapter 4, we study in detail the convergence properties of both algorithms with the assumption that the input signals are generated by a rstorder Markov system. We present computer simulations illustrating these results in Chapter 5. Chapter 6 describes the LMS spectrum analyzer and demonstrates its robustness to noise propagation. In chapter 7, we generalize the LMS spectrum analyzer to the case of the DCT. We conclude in chapter 8 by summarizing the dissertation, adding some comments, and listing several points that we think would be interesting to further study. 1.1 Author's Contributions The major contributions to knowledge of this work can be summarized as follows. Modeling of the DFT-LMS and DCT-LMS algorithms so as to simplify the analytical study of their performance, and development of a mathematical framework where such an analytical study can take place. Derivation of asymptotic results on the transformed input eigenvalues and on the speed of convergence of DFT-LMS and DCT-LMS under the assumption of rst-order Markov inputs. Mathematical proof and experimental demonstration of the robustness of the LMS spectrum analyzer to noise propagation. Generalization of the LMS spectrum analyzer to the DCT (two dierent structures are proposed).

18 Chapter 2 Adaptation Algorithms for Linear Filtering In this chapter, we introduce more formally the concept of adaptive lter and we briey summarize the characteristics of two famous adaptation algorithms, LMS and RLS. The conclusions drawn from their comparison will lead to the introduction of another family of adaptation algorithms, the transform-domain LMS algorithms, of which DFT-LMS and DCT-LMS are two examples. 2.1 Introduction to Adaptive Filters A discrete-time linear adaptive combiner [65, 66] of length is shown in Fig At time k, a set of signals, x k (); x k (1); : : : ; x k (?1) are inputted in the combiner. The combiner coecients, w k (); w k (1); : : : ; w k (? 1), are refered to in this context as the weights of the combiner. These coecients can be adjusted by an adaptation algorithm so as to make the output y k resemble a given desired output signal that we denote d k. In another set-up, the inputs could come from a tap-delay line as shown in Fig This second structure is a very common particular case of the general diagram of Fig It is used essentially for prediction and ltering applications, and it is the basic structure on which this thesis will elaborate. 5

19 CHAPTER 2. ADAPTATIO ALGORITHMS FOR LIEAR FILTERIG 6 Inputs x () k x (1) x (2) k k x (-2) k x (-1) k w k() w k(1) w k(2) w (-2) k w (-1) k Adaptive weights y k _ Error e k + d k Output Desired response Figure 2.1: Linear Adaptive lter The task of the adaptation algorithm is to iteratively minimize some error criterion, where by error we mean a measure of how distant the actual outputs are from the desired outputs. Typically, the error criterion is chosen to be the expectation of the square of the dierence e k between the desired and the actual outputs, (w) = E = E h e 2 k i (2.1) h (dk? y k ) 2 i ; (2.2) where the expectation E[] is taken over the input space. Let w k 4 = [w k ()w k (1) : : : w k (?1)] be the weight vector, and x k 4 = [x k ()x k (1) : : : x k (? 1)] be the input vector. For tap-delayed inputs, x k = [x k x k?1 : : : x k? +1 ]. The output signal y k can be then expressed as the dot product of the weight and the input vectors, y k = w T k x k, where the superscipt T denotes the vector transpose. The mean square error (MSE) (w) dened in Eq. 2.2 can be expanded as h (dk? y k ) 2 i (w) = E (2.3) h = E dki 2 + wt Rw? 2pw; (2.4)

20 CHAPTER 2. ADAPTATIO ALGORITHMS FOR LIEAR FILTERIG 7 Input z -1 z -1 z -1 x k x k-1 x k-2 x k-+2 x k-+1 x k w k() w k(1) w k(2) w (-2) k w (-1) k Adaptive weights y k _ Error e k + d k Output Desired response Figure 2.2: Linear Adaptive lter with tap-delayed inputs. where R is the autocorrealtion matrix of the inputs, R 4 = E h xk x T k i ; (2.5) and p is the cross-correlation between the inputs and the desired outputs, p 4 = E [x k d k ] : (2.6) When the inputs are tap-delayed, the matrix R is Toeplitz, i.e. R(l; m) = R(jl? mj) 8 l; m (2.7) { a property that will be used throughout this thesis. The error (w) is a quadratic function of the weights and assumes the shape of a hyperparaboloid as illustrated in Fig. 2.3 for a 2-weight case. The sections of the error surface, = constant, are hyperellipsoids (ellipses in the 2-D case). The orientation and the shape of these ellipsoids depend on the eigenvalues of the input

21 CHAPTER 2. ADAPTATIO ALGORITHMS FOR LIEAR FILTERIG 8 autocorrelation matrix R. It is easy to show that the axes of the hyperellipsoids are aligned with the eigenvectors of R and that their lengths are inversely proportional to the square roots of the corresponding eigenvalues. In the 2-D case, if the two eigenvalues are very dierent the ellipses are thin and long while if the eigenvalues are equal the ellipses degenerate into circles. ξ(w) 15 Error surface w(1) Error contours Minimum error 5 w() Optimal solution ξ min 1 w opt Figure 2.3: Error surface for a 2-weight adaptive lter. The weight vector that minimizes the error (w) corresponds to the \bottom of the bowl" (see Fig. 2.3). It is obtained mathematically by taking the derivative of (w) with respect to the weights, setting it to zero, and solving for w. The solution

22 CHAPTER 2. ADAPTATIO ALGORITHMS FOR LIEAR FILTERIG 9 w opt, which is a special case of the Wiener solution 1, is equal to w opt = arg min (w) = R?1 p: (2.8) w The minimum achievable mean square error is obtained by replacing w with w opt in Eq. 2.4: min = E h h dki 2? 2pT R?1 p = E dki 2? 2pT w opt : (2.9) The error function (w) can thus be rewritten as (w) = min + (w? w opt ) T R (w? w opt ) : (2.1) This is the function that the adaptation algorithm has to minimize. Typically, adaptation algorithms work as follows. The lter weights are initially set to zero or to small - possibly random - values. Then, at each iteration, the weights are adjusted so as to travel down the error surface and to eventually reach its minimum w opt or a vicinity of it. The speed at which this happens, the precision of the solution after convergence, the overall robustness of the algorithm, its simplicity, its capacity to deal with non-stationary inputs and/or desired outputs, the number of calculations required per iteration,... are all factors that must be taken into account when comparing dierent adaptation algorithms. In the following sections, we will successively discuss three families of algorithms: the least mean squares (LMS) algorithms, the recursive least squares (RLS) algorithms, and the transform-domain LMS algorithms. 1 The lter that best maps, in a least squares sense, an input signal into a given desired output is in general of innite length [3]. It therefore requires the resolution of an innite set of equations, which may be impractical in computer implementations. Limiting the number of lter taps to constrains the solution w opt to be of nite length. It makes the computation of w opt more tractable but it also somewhat increases the minimum achievable error of the lter, min. ote that w opt is in general dierent from the solution that would be obtained by solving the innite set of equations and then truncating the solution after the th component. In general, it is assumed that is chosen large to make these eects negligeable.

23 CHAPTER 2. ADAPTATIO ALGORITHMS FOR LIEAR FILTERIG The LMS Algorithm The Least Mean Squares (LMS) algorithm invented by Widrow and Ho [65, 66] is the simplest and one of the most widely used adaptation algorithms. In this section, we summarize the main features of LMS, insisting only on the properties that will inuence the remaining of this thesis. For more details, we refer the reader to Widrow's textbook [69] and to the original publications mentioned above Derivation of the LMS Algorithm The LMS algorithm minimizes the error function using a stochastic steepest descent approach, that is, at each iteration, the weights are updated proportionaly to an estimate of the error gradient. Let r k denote the true error gradient at time k, and cr k its estimate. The true gradient of the error function is given by r k = d(w k) (2.11) dw k = de d k? w Tk x k 2 dw k (2.12) The gradient estimate, c r k, is simply obtained by omitting the expectation in Eq. 2.12, hence the name \stochastic gradient": cr k = d d k? w T k x k 2 dw k (2.13) =?2 d k? w T k x k xk (2.14) =?2e k x k : (2.15) By adjusting the weights proportionaly to the stochastic gradient instead of the true gradient, LMS follows on the error surface a zig-zag path whose average course is the exact steepest descent path. The main motivation behind this stochastic approximation is to avoid the cost of computing an expectation over the whole input space at each iteration.

24 CHAPTER 2. ADAPTATIO ALGORITHMS FOR LIEAR FILTERIG 11 The LMS weight update is thus given by the simple formula: w k+1 = w k? c r k (2.16) = w k + 2e k x k ; (2.17) where the learning rate,, is a constant that governs the speed of convergence of the algorithm: large 's allow the algorithm to converge fast (bigger steps are taken towards the bottom of the bowl), but large 's also lower the precision of the weight vector after convergence has been reached (the weight vector keeps on wandering in a large neighborhood of the optimum solution). Moreover, large 's can create instability problems. Choosing the right value for is an important and dicult task Properties of the LMS Algorithm Although it may seem counter-intuitive, the very simplicity of LMS makes its exact analysis quite complicated. Most of the published proofs of convergence of LMS are based on the average behavior of the algorithm rather than on its stochastic behavior. The early theory of LMS developed by Widrow and Ho considers the convergence in the mean of the weight vector. Later studies have also included convergence in the mean square [57, 4, 23]. For our purposes, the former will suce, and we will limit ourselves to a summary of Widrow's main results, refering the reader to Widrow's and Haykin's textbooks [69, 23] for more details. Widrow based his analysis on the exact steepest descent algorithm: w k+1 = w k? r k : (2.18) The exact error gradient at time k can be expressed as r k = d(w k) dw k (2.19) = 2Rw k? 2p (2.2) = 2R(w k? w opt ): (2.21)

25 CHAPTER 2. ADAPTATIO ALGORITHMS FOR LIEAR FILTERIG 12 Let v k be the translated weight vector v k 4 = wk? w opt : (2.22) The weight update formula of Eq can be rewritten in terms of translated weights as v k+1 = (I? 2R)v k ; (2.23) where I is the identity matrix. The next step consists in performing a rotation of the translated weights, v k 4 = Q T v k ; (2.24) where the unitary matrix Q contains the eigenvectors of R, that is R = QQ T, where is a diagonal matrix containing the eigenvalues of R. Eq can then be rewritten as v k+1 = (I? 2)v k : (2.25) This last formula can be iterated from time to time k to give the transformed weight vector at time k: v k = (I? 2) k v ; (2.26) where v is the initial value of the transformed weight vector. Also the error at time k can be expressed in the transformed weight space: k = min + v T k v k: (2.27) By introducing the newly found formula for the weight vector (Eq. 2.26) in the error

26 CHAPTER 2. ADAPTATIO ALGORITHMS FOR LIEAR FILTERIG 13 function and with a little algebra, one nds k = min + X n= v 2 n n(1? 2 n ) 2k : (2.28) The implications of this equation are extremely important. First, we see that the error decreases as a sum of geometrical series { or exponentials if we interpret the adaptation as a continuous process. Each exponential corresponds to one weight and evolves independently of the others. This is due to the decorrelation of the weights resulting from their transformation by the matrix Q. exponentials are given by The time constants of the n = 1 4 n ; (2.29) where n is the eigenvalue associated to the n th weight. Small eigenvalues (low energy modes) correspond to long time constants and slow down the overall convergence of the adaptive lter. High eigenvalues, on the other hand, can cause the modulus of (1? 2 n ) to be larger than one, thereby causing the algorithm to diverge. Of course this divergence can be avoided by reducing the learning rate, but decreasing will have the direct consequence of slowing down even further those modes that are already slow because they correspond to small eigenvalues. Input signals with high eigenvalue spread will therefore always result in poor convergence performance. Clearly, the problem faced by LMS when its input eigenvalues are very spread apart is due to the fact that it has a single learning rate that must satisfy all the weights. The problem would be greatly reduced if we could associate to each decorrelated weight v (n) a specic learning rate n such that the product n n is more or less constant over n. ote however that this reasoning holds only in a weight space that has been previously orthogonalized (i.e. it holds in the weight space v but not in the weight spaces v or w). Without this preliminary decorrelating step, each weight would be associated to a combination of modes in the error function instead of just one mode, and the learning rates n could not be chosen eciently. Another important property to discuss is the precision of the steady-state solution

27 CHAPTER 2. ADAPTATIO ALGORITHMS FOR LIEAR FILTERIG 14 of LMS. When it converges, LMS does not reach the optimal solution w opt exactly; rather, it reaches a vicinity of the optimum solution where it keeps on wandering forever. This is due to the fact that the weight update in LMS is proportional to the stochastic error gradient instead of the true error gradient. At the bottom of the error function, the true gradient is equal to zero, but the stochastic gradient (i.e. the product of the input by the output error) is not necessarily equal to zero (besides of course in the degenerated case where the minimum achievable error, min, is equal to zero). The steady-state solution found by LMS is thus noisy. Its precision is characterized by a quantity called misadjustment, which is equal to the variance of the steady-state solution normalized by the minimum achievable error min. It can be shown [69] that the misadjustment is in rst approximation given by Misadjustment = Trace(R): (2.3) This formula shows that improving the precision of the steady-state solution can easily be achieved by decreasing the learning rate. However, this has the inconvenient of slowing down the adaptation process. A better solution which is often used in practice consists in starting the adaptation with a large and decreasing it progressively as the weights converge (see [14]). The misadjustment of LMS should therefore not be seen as a major limitation of the algorithm. Another way of interpreting the non-zero misadjustment of LMS is to note that the algorithm has no memory. Other algorithms such as RLS accumulate information about the data and use this information to progressively reduce the uncertainty about the solution. LMS does not. While for steady-state convergence this is a disadvantage because it causes steady-state misadjustment, the same feature turns out to be advantageous in non-stationary environments. By not accumulating any information about the data, LMS can more easily track a time-varying solution than RLS whose weights are delayed in their evolution by the obsolete information they have accumulated (see e.g. [23]).

28 CHAPTER 2. ADAPTATIO ALGORITHMS FOR LIEAR FILTERIG 15 In conclusion, besides for its slow convergence when the inputs are highly correlated, LMS displays excellent properties. Its weight update requires only O() computations per iteration, it is by far the simplest algorithm to implement, it is robust to error propagation in limited precision implementations [9], and it tracks non-stationarities better than other adaptation algorithms [5, 38, 7]. These properties have greatly contributed to the popularity of LMS, although the major complain about the algorithm remains its slow convergence. In applications where it is critical to achieve fast convergence, LMS is often not a viable solution. With this respect, exact least squares algorithms such as RLS may be more attractive. We will see however that the convergence speed of LMS can be greatly ameliorated if an adequate preprocessing is eected on its inputs. This will bring the discussion to the family of transform-domain LMS algorithms. In order to facilitate the later description of these transform-domain algorithms, we would like to briey introduce two extensions of LMS: complex LMS and block-lms, after which we will turn our attention to RLS algorithms, and then to transform-domain LMS algorithms The Complex LMS Algorithm Complex LMS [68] is the straightforward extension of real LMS to the case where the inputs, the adaptive weigths, and the desired outputs are allowed to take on complex values. The error function to be minimized is dened as the expectation of the square modulus of the output error, (w) = E [e k e k] ; (2.31) where e k is the complex conjugate of e k. Applying stochastic steepest descent to, one nds the following weight update fomula 2 : w k+1 = w k + 2e k x k ; (2.32) 2 A formal derivation of the algorithm is given in [68]. An intuitive derivation can be obtained by considering a few particular cases such as real inputs with imaginary desired outputs, etc.

29 CHAPTER 2. ADAPTATIO ALGORITHMS FOR LIEAR FILTERIG 16 with e k = d k? y k = d k? wk T x k, and where the weights w k (i) are complex numbers. Equivalently, the weight update can be formulated as w k+1 = w k + 2e kx k ; (2.33) with e k = d k? y k = d k? wk H x k, where the superscript H denotes the hermitian, i.e. the transpose conjugate [23]. The properties of complex LMS are very similar to those of real LMS, although slight dierences can be observed in terms of mean square convergence and stability performance [24] The Block-LMS Algorithm Block LMS is a partially batched extension of LMS [12, 1]. The instantaneous error gradient, c r, is computed at each iteration as in regular non-block LMS, but rather than being used right away to update the weights, it is buered for a certain number, L, of iterations. The weight update takes place every L iterations and is made proportional to the sum of the last L instantaneous error gradients: X L?1 w(k + 1) = w(k) + 2 l= x(kl + l)e(kl + l): (2.34) If L = 1, the weight update of Eq describes regular LMS; if L is equal to the number of input patterns available for training, Eq describes a batched implementation of LMS. In practice, L is often made equal to, the length of the lter. Having L > 1 essentially modies the nature of the error gradient, making it less stochastic and closer to the true error gradient, r. This has the consequence of smoothing the learning curve but it also aects other properties of the algorithm. For example, buering the gradient for L iterations may hurt the tracking capabilities of the algorithm in non-stationary environments because of the delay it introduces in the weight update. Also, the maximum learning rate that can be used without encountering stability problems is L times smaller than the one that could be used, under

30 CHAPTER 2. ADAPTATIO ALGORITHMS FOR LIEAR FILTERIG 17 identical input conditions, with regular LMS [16]. This, of course, may adversely inuence the convergence speed of the algorithm. The advantage of block-lms comes from the fact that the block-gradient in Eq can be seen as a linear correlation between the input signal and the output error signal. It can therefore be implemented eciently by taking the Fourier transforms of the two signals, computing their product, and inverse transforming the result [12, 1]. The computational eciency of this method counterbalances the slowliness of the weight convergence. We will see in section 2.4 that a whole class of transform-domain algorithms is based on this principle. 2.3 The RLS Algorithm The Recursive Least Squares (RLS) algorithm implements recursively an exact least squares solution [18, 23]. We saw previously that the Wiener solution for an adaptive lter of nite length is given by w opt = R?1 p where R is the autocorrelation matrix of the inputs and p is the cross-correlation between inputs and desired ouputs. At each time step, RLS estimates recursively R?1 and p based on all past data and computes 4 the weight vector as w k = R?1 k p k, which is thus the best to-date approximation to the Wiener solution Derivation of the RLS Algorithm At time k, the best estimates of R and p are given by R k = kx k?i x i x T i (2.35) i=1 = R k?1 + x k x T k ; (2.36) p k = kx k?i x i d i (2.37) i=1 = p k?1 + x k d k ; (2.38)

31 CHAPTER 2. ADAPTATIO ALGORITHMS FOR LIEAR FILTERIG 18 where the constant 2 [; 1] is generally chosen close to one but slightly smaller for stability reasons. The best estimate of the optimum weight vector 3 w opt is given by Applying the matrix inversion lemma 4 to Eq. 2.39, we get: where K k, the gain vector, is dened by w k = R?1 k p k : (2.39) R?1 k = R?1 R?1 k?1??2 k?1x k x T k R?1 k?1 1 +?1 x T k R k?1x k (2.4) =?1 R?1 k?1??1 K k x T k R?1 k?1 (2.41) K k = G k X k ; (2.42) and G k =?1 R?1 k?1 1 +?1 x T k R k?1x k : (2.43) Introducing Eq and Eq in Eq. 2.39, and after some algebraic manipulation, we nd the weight update formula w k = w k?1 + K k k (2.44) = w k?1 + G k k x k ; (2.45) where k = d k? w T k?1x k : (2.46) 3 Alternatively, the weight update formula can be found by minimizing the error function k = P k i=1 k?i e 2 i, where e i is the output error at time i, and where the sum over i causes the weight vector at time k to take into account all the past history (i = 1 to k) of the system. 4 Let A and B be two positive denite matrices, C a M matrix, and D a positive denite M M matrix. The matrix inversion lemma states that if A = B + CD?1 C T, then A?1 = B?1? B?1 C(D + C T B?1 C)?1 C T B?1 (see e.g. [29]).

32 CHAPTER 2. ADAPTATIO ALGORITHMS FOR LIEAR FILTERIG 19 Equations 2.43, 2.45, and 2.46 summarize the algorithm. The weights are typically initialized to zero while the R?1 k matrix is initialized by R?1 =?1 I, where is a small positive constant and I is the identity matrix Properties of the RLS Algorithm ote the formal resemblance between RLS and LMS. The RLS weight vector is updated proportionally to the product of the current input x k and some error signal k as in LMS. The error in RLS is dened a priori in the sense that it is based on the old weight vector w k?1, whereas in LMS the error e k = d k? wk T x k is computed a posteriori, that is based on the current weight vector, w k. A more important dierence results from the fact that the constant learning rate in LMS is replaced in RLS by a matrix G k that is time- and data-dependent. With this respect, RLS can be thought of as a sort of LMS algorithm having a matrixcontrolled optimal learning rate (optimal because the weight vector at each iteration is the best achievable one in the least mean square sense, given all the past data samples). The weight update formula of Eq could also be rewritten as with w k = w k?1 + k R?1 k?1 kx k ; (2.47) k =?1 1 +?1 x T : (2.48) k R k?1x k This formulation places in evidence the decorrelation operation performed by RLS on the input data: the stochastic gradient k x k is premultiplied by an estimate of the inverse autocorrelation matrix R?1 k?1, which has the eect of decorrelating the inputs to the adaptive lter. This decorrelation, along with the specic expression of the learning rate, reduces the sensibility of the algorithm to its input eigenvalue spread, and enhances its convergence properties with respect to LMS. This premultiplication by R?1 can unfortunately hurt the stability of the lter if the matrix R is ill-conditioned or close to being ill-conditioned { a situation that arises each time

33 CHAPTER 2. ADAPTATIO ALGORITHMS FOR LIEAR FILTERIG 2 some of the lter inputs are linearly dependent or, in other words, each time the lter contains more weights than necessary. Another issue to be discussed is the precision of the steady-state solution. The asymptotic misadjustment in RLS can be arbitrarily decreased by increasing the parameter up to one. Intuitively, 1 causes the entries of the matrix R k to grow as k increases (see Eq. 2.35), and forces k (Eq. 2.48) to gradually decrease down to zero (see [23] for a more formal justication). ote that choosing = 1 does not impair the convergence properties: the convergence speed with RLS in roughly independent of. As a counterpart, RLS displays poor tracking capabilities in non-stationary environments [5, 38, 7]. Intuitively, the weight vector in RLS is based on all the past history of the input signal. If the statistics of this signal change over time, it will be harder for RLS to adjust to these changes than for LMS whose weight update is based solely on the current stochastic gradient. In addition, RLS suers of a high computational complexity: due to matrix-vector multiplications, O( 2 ) operations are required for each weight update whereas only O() are necessary with LMS. In conclusion, while RLS has the advantage of a fast convergence rate and low sensitivity to the input eigenvalue spread, it is computationaly intensive, prompt to numerical instabilities, and inecient at tracking non-stationaries when compared to LMS. LMS is intrinsically slow because it does not decorrelate its inputs prior to adaptive ltering, but preprocessing the inputs by an estimate of the inverse input autocorrelation matrix in the fashion of RLS leads to the problems cited above. One solution, which we will further discuss in this thesis, consists of preprocessing the inputs to the LMS lter with a xed transformation that does not depend on the actual input data. The decorrelation will only be approximative, but the computational cost will remain of O() and the robustness and tracking capability of LMS will be preserved. These algorithms are generally called transform-domain LMS algorithms or frequency-domain LMS algorithms. Before leaving this section, we should mention that many other exact least squares algorithms have been studied in the literature. The main motivation for doing so was

34 CHAPTER 2. ADAPTATIO ALGORITHMS FOR LIEAR FILTERIG 21 to reduce the computational complexity of RLS and improve its robustness while maintaining its convergence characteristics. The most famous algorithms in this family are those based on the so-called lattice structure [42, 31, 32, 33]. These algorithms take advantage of the fact that in a Toeplitz matrix such as the autocorrelation matrix R, only out of 2 elements are distinct. This observation can be used to make a one-to-one correspondance between these elements and so-called reection coecients and, with some algebraic manipulation, reducing the computational cost of the algorithm from O( 2 ) to O(). Another characteristics that makes lattice lters very popular is the ease with which their stability can be monitored (see e.g. [23]). It has been observed, however, that nite arithmetic eects can severely degrade the algorithm performance [9]. The price for the improvement brought by the lattice structure is the increased complexity of the algorithm in terms of number of equations to be implemented, number of variables to be stored, and general complication of the algebra. Transform-domain LMS algorithms may be seen as a sort of intermediate solution that tries to combine the advantages of both LMS and RLS. 2.4 Transform-Domain LMS Algorithms The name \transform-domain LMS algorithms" is somewhat ambiguous. It has been used in the literature to designate two dierent categories of algorithms: block-lms algorithms implemented in the frequency domain and non-block LMS algorithms whose inputs are transformed into the frequency domain prior to ltering. In this section, we give a quick overview of both families of algorithms. Since the rest of this thesis is devoted to non-block algorithms, these will be further detailed in the next chapter Transform-Domain Block-LMS Algorithms This category of algorithms builds upon the Fourier implementation of the block-lms algorithm (see section 2.2.4). Assume that the outputs, desired outputs, and output

35 CHAPTER 2. ADAPTATIO ALGORITHMS FOR LIEAR FILTERIG 22 errors are buered into vectors. The output vector as well as the error gradient can be estimated in the frequency domain rather than in the time domain since they both result from the convolution of two vectors. Because the product of two Fourier series corresponds in the time domain to a circular convolution rather than a linear one, some constraints must be implemented to restore the linearity of the convolution (see e.g. [44]). ot implementing these constraints results in wrap-around eects that aect the performance of the algorithms (biased optimal solution, extra noise in the steady-state solution, etc.) [1, 46]. Two methods have been described in the literature that calculate the linear convolution of two signals by taking the product of their Fourier transforms: the overlapsave method and the overlap-add method [44, 11]. These methods require typically one Fourier transform, one inverse Fourier transform, and a few appropriate vector manipulations (zero-padding, truncation, concatenation of vectors,...) to calculate one convolution. Since two convolutions are calculated at each weight update (one for the error gradient and one for the lter output), frequency-domain block-lms algorithms are quite involved (see e.g. [52] for a detailed description of the algorithms). The main advantage of these algorithms { in addition to their computational eciency { is their potentially very fast convergence. By attributing to each transformed weight a learning rate that is inversely proportional to the energy of the corresponding input, the convergence of the algorithms can be greatly improved [15, 54] Transform-Domain on-block LMS Algorithms This family of algorithms was rst introduced by arayan under the name transform domain LMS algorithms [43]. arayan's structure consists simply of an LMS lter whose inputs are preprocessed by a DFT and whose learning rates (one per weight) are adjusted as a function of the input energy levels (a full description of the algorithm is given in the next chapter). Refering back to our discussions on LMS and RLS (sections and respectively), the DFT is used to decorrelate the inputs of the LMS lter. The learning rate normalization is then used to optimize the convergence speed of each individual weight. Since the DFT is not a perfect decorrelator, this structure

ADAPTIVE FILTER THEORY

ADAPTIVE FILTER THEORY Fourth Edition Simon Haykin Communications Research Laboratory McMaster University Hamilton, Ontario, Canada Front ice Hall PRENTICE HALL Upper Saddle River, New Jersey 07458 Preface