Ch4: Method of Steepest Descent

Similar documents
Ch6-Normalized Least Mean-Square Adaptive Filtering

SGN Advanced Signal Processing: Lecture 4 Gradient based adaptation: Steepest Descent Method

Lecture 3: Linear FIR Adaptive Filtering Gradient based adaptation: Steepest Descent Method

Adaptive Filtering Part II

Ch5: Least Mean-Square Adaptive Filtering

Computer exercise 1: Steepest descent

Numerical optimization

Numerical optimization. Numerical optimization. Longest Shortest where Maximal Minimal. Fastest. Largest. Optimization problems

Linear Optimum Filtering: Statement

Adaptive Filtering. Squares. Alexander D. Poularikas. Fundamentals of. Least Mean. with MATLABR. University of Alabama, Huntsville, AL.

2.6 The optimum filtering solution is defined by the Wiener-Hopf equation

ELEG-636: Statistical Signal Processing

V. Adaptive filtering Widrow-Hopf Learning Rule LMS and Adaline

Least Mean Square Filtering

Performance Surfaces and Optimum Points

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Neural Network Training

LMS and eigenvalue spread 2. Lecture 3 1. LMS and eigenvalue spread 3. LMS and eigenvalue spread 4. χ(r) = λ max λ min. » 1 a. » b0 +b. b 0 a+b 1.

CHAPTER 4 ADAPTIVE FILTERS: LMS, NLMS AND RLS. 4.1 Adaptive Filter

EE482: Digital Signal Processing Applications

ECE580 Fall 2015 Solution to Midterm Exam 1 October 23, Please leave fractions as fractions, but simplify them, etc.

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

EE482: Digital Signal Processing Applications

ADAPTIVE FILTER THEORY

Gradient Descent. Dr. Xiaowei Huang

A METHOD OF ADAPTATION BETWEEN STEEPEST- DESCENT AND NEWTON S ALGORITHM FOR MULTI- CHANNEL ACTIVE CONTROL OF TONAL NOISE AND VIBRATION

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

SIMON FRASER UNIVERSITY School of Engineering Science

Non-linear least squares

Adaptive Beamforming Algorithms

26. Filtering. ECE 830, Spring 2014

Searching The Performance Surface

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Machine Learning and Adaptive Systems. Lectures 3 & 4

5 Handling Constraints

Video 6.1 Vijay Kumar and Ani Hsieh

ADAPTIVE FILTER THEORY

Adaptive Filter Theory

MATHEMATICS FOR COMPUTER VISION WEEK 8 OPTIMISATION PART 2. Dr Fabio Cuzzolin MSc in Computer Vision Oxford Brookes University Year

, b = 0. (2) 1 2 The eigenvectors of A corresponding to the eigenvalues λ 1 = 1, λ 2 = 3 are

Line Search Methods for Unconstrained Optimisation

The Conjugate Gradient Method

Lecture Notes: Geometric Considerations in Unconstrained Optimization

Lecture Notes in Adaptive Filters

Adaptive Filters. un [ ] yn [ ] w. yn n wun k. - Adaptive filter (FIR): yn n n w nun k. (1) Identification. Unknown System + (2) Inverse modeling

EEL 6502: Adaptive Signal Processing Homework #4 (LMS)

Chapter 8 Gradient Methods

Introduction to gradient descent

ECE 680 Modern Automatic Control. Gradient and Newton s Methods A Review

Numerical computation II. Reprojection error Bundle adjustment Family of Newtonʼs methods Statistical background Maximum likelihood estimation

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Statistical and Adaptive Signal Processing

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Gradient Descent. Sargur Srihari

Unconstrained minimization of smooth functions

Advanced Signal Processing Adaptive Estimation and Filtering

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Comparative Performance Analysis of Three Algorithms for Principal Component Analysis

Numerical Optimization: Basic Concepts and Algorithms

Chapter 2 Wiener Filtering

Dominant Pole Localization of FxLMS Adaptation Process in Active Noise Control

Newton s laws. Chapter 1. Not: Quantum Mechanics / Relativistic Mechanics

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

10. Unconstrained minimization

Optimization Methods. Lecture 18: Optimality Conditions and. Gradient Methods. for Unconstrained Optimization

Comparison of Modern Stochastic Optimization Algorithms

AdaptiveFilters. GJRE-F Classification : FOR Code:

IS NEGATIVE STEP SIZE LMS ALGORITHM STABLE OPERATION POSSIBLE?

Least Mean Squares Regression. Machine Learning Fall 2018

Data Mining (Mineria de Dades)

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

4 Newton Method. Unconstrained Convex Optimization 21. H(x)p = f(x). Newton direction. Why? Recall second-order staylor series expansion:

Ch 4. Linear Models for Classification

Lecture 5: Logistic Regression. Neural Networks

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Often, in this class, we will analyze a closed-loop feedback control system, and end up with an equation of the form

Numerical Optimization Prof. Shirish K. Shevade Department of Computer Science and Automation Indian Institute of Science, Bangalore

PETROV-GALERKIN METHODS

IMPROVEMENTS IN ACTIVE NOISE CONTROL OF HELICOPTER NOISE IN A MOCK CABIN ABSTRACT

1. Method 1: bisection. The bisection methods starts from two points a 0 and b 0 such that

Cheng Soon Ong & Christian Walder. Canberra February June 2018

SNR lidar signal improovement by adaptive tecniques

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

7.2 Steepest Descent and Preconditioning

Lecture 7: Linear Prediction

Convex Optimization. 9. Unconstrained minimization. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University

Two hours. To be provided by Examinations Office: Mathematical Formula Tables. THE UNIVERSITY OF MANCHESTER. xx xxxx 2017 xx:xx xx.

Lecture 5: September 12

Scientific Computing II

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Some definitions. Math 1080: Numerical Linear Algebra Chapter 5, Solving Ax = b by Optimization. A-inner product. Important facts

Optimization Tutorial 1. Basic Gradient Descent

Stochastic Analogues to Deterministic Optimizers

Day 3 Lecture 3. Optimizing deep networks

Lecture 19 IIR Filters

Lecture 15: Ordinary Differential Equations: Second Order

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

LMS Algorithm Summary

Transcription:

Ch4: Method of Steepest Descent The method of steepest descent is recursive in the sense that starting from some initial (arbitrary) value for the tap-weight vector, it improves with the increased number of iterations. The final value so computed for the tap-weight vector converges to the Wiener solution. This method is descriptive of a deterministic feedback system that finds the minimum point of the ensembleaveraged error-performance surface without knowledge of the surface itself. 1

Mean Square Error (Revisited) For a transversal filter (of length M), the output is written as H y ( n) w ( n) u( n) and the error term wrt. a certain desired response is 2

Mean Square Error (Revisited) Following these terms, the MSE criterion is defined as Substituting e(n) and manupulating the expression, we get Quadratic in w! where 3

Mean Square Error (Revisited) For notational simplicity, express MSE in terms of vector/matrices where H R E u( n) u ( n) r(0) r(1) r( M 1) * r (1) r(0) r( M 2) * * r ( M 1) r ( M 2) r(0) note r k r k * : ( ) ( ) 2 d is variance of the desired response d (n) p * E u( n) d ( n) 4 p(0) p( 1) p( ( M 1))

Mean Square Error (Revisited) We found that the solution (optimum filter coef.s w o ) is given by the Wiener-Hopf eqn.s 2 J min - H d p w o Inversion of R can be very costly. J(w) is quadratic in w convex in w for w o, Surface has a single minimum and it is global, then Can we reach to w o, i.e. with a less demanding algorithm? 5

Basic Idea of the Method of Steepest Descent Can we find w o in an iterative manner? 6

Basic Idea of the Method of Steepest Descent Starting from w(0), generate a sequence {w(n)} with the property J w( n 1) J w( n) Many sequences can be found following different rules. Method of steepest descent generates points using the gradient Gradient of J at point w, i.e. the function increases most. gives the direction at which Then gives the direction at which the function decreases most. Release a tiny ball on the surface of J it follows negative gradient of the surface. 7

8

9

10

11

Basic Idea of the Method of Steepest Descent For notational simplicity, let then going in the direction given by the negative gradient How far should we go in g defined by the step size param. μ Optimum step size can be obtained by line search - difficult Generally a constant step size is taken for simplicity. 12

ξ J 13

Application of SD to Wiener Filter For w(n) From the theory of Wiener Filter we know that J( n) J ( n) J ( n) j a0( n) b0( n) J ( n) J ( n) j a1( n) b1( n) J ( n) J ( n) j a M 1( n) bm 1( n) 14

Then the update eqn. becomes w( n 1) w( n) [ p Rw( n)] n 0, 1, 2, which defines a feedback connection. The correction δw(n) applied to the tap-weight vector at time n + 1 is equal to μ[p - Rw(n)]. This correction may also be expressed as μ times the expectation of the inner product of the tap-input vector u(n) and the estimation error e(n); e( n) d ( n) u H ( n) w( n) w * ( n 1) E u( n) e ( n), This suggests that we may use a bank of cross-correlators to compute the correction δw(n) applied to the tap-weight vector w(n) 15

16

Feedback model The transmittance of each branch of the graph is a scalar or a square matrix. For each branch of the graph, the signal vector flowing out equals the signal vector flowing in multiplied by the transmittance matrix of the branch. Parallel sum Cascade product w( n 1) w( n) [ p Rw( n)] 17

Convergence Analysis Feedback may cause stability problems under certain conditions. Depends on The step size, μ The autocorrelation matrix, R Does SD converge? Under which conditions? What is the rate of convergence? We may use the canonical representation. Let the weight-error vector be then the update eqn. becomes c( n) w w( n) o 18

Convergence Analysis Let be the eigen-decomposition of R. (the unitary similarity transformation) Then Using QQ H =I Apply the change of coordinates H H v( n) Q c( n) Q [ w w( n)] Then, the update eqn. becomes o 19

Convergence Analysis We know that Λ is diagonal, then the k-th natural mode is or, with the initial values v k (0), we have Note the geometric series 20

Convergence Analysis Obviously for stability or, simply or Why? Since the eigenvalues of the correlation matrix R are all real and positive Geometric series results in an exponentially decaying curve with time constant τ k, where letting 21

Convergence Analysis We have c( n) w w( n) w( n) w c( n) but o o then We know that Q is composed of the eigenvectors of R, then or Each filter coefficient decays exponentially. The overall rate of convergence is limited by the slowest and fastest modes 22

Convergence Analysis For small step size What is v(0)? The initial value v(0) is H v(0) Q [ w w(0)] For simplicity assume that w(0)=0, then o H v(0) Q w o 23

Convergence Analysis Transient behaviour: From the canonical form we know that then As long as the upper limit on the step size parameter μ is satisfied, regardless of the initial point 24

Convergence Analysis The progress of J(n) for n =0,1,... is called the learning curve. The learning curve of the steepest-descent algorithm consists of a sum of exponentials, each of which corresponds to a natural mode of the problem. # natural modes = # filter taps 25

Example A predictor with 2 taps (w 1 (n) and w 2 (n ) is used to find the params. of the AR process Examine the transient behaviour for Fixed step size, varying eigenvalue spread Fixed eigenvalue spread, varying step size. σ v2 is adjusted so that σ u2 =1. a 1 and a 2 are chosen to have complex roots 26

Example The AR process: We had Two eigenmodes Condition number 27

Example (Experiment 1) Experiment 1: Keep the step size fixed at Change the eigenvalue spread 28

v ( n) (1 ) v (0) v the optimum tap-weight vector equals: w n 1 1 1 ( n) n 1, 2, n v 2( n) (1 2) v 2(0) o a a and 1 2 min v 2 H using v(0) Q w we have J o v1(0) 1 1 1 a1 1 a1 a2 v(0) v 2(0) 2 1 1 a 2 2 a1 a 2 when = and n is fixed J ( n) J v ( n) v ( n) represents 2 2 1 2 min 1 1 2 2 a circle with center at the origin and radius equal to the square root of [ J ( n) J ]. When eqn. represents (for fixed n) an ellipse with min 1 2 major axis equal to the square root of [ J ( n) J ] and minor axis equal to the square root of [ J ( n) J ]. min 1 min 2 29

Loci of v 1 (n) versus v 2 (n) for the steepest-descent algorithm with step-size parameter μ=0.3 and varying eigenvalue spread: (a) X(R) =1.22; (b) X(R)=3; (c) X(R)=10; (d) X(R)=100. 30

31

w 1( n) a1 ( v 1( n) v 2( n)) / 2 w( n) w 2( n) a2 ( v 1( n) v 2( n)) / 2 Loci of w 1 (n) versus w 2 (n) for the steepest-descent algorithm with step-size parameter μ=0.3 and varying eigenvalue spread: (a) X(R) =1.22; (b) X(R)=3; (c) X(R)=10; (d) X(R)=100. 32

33

We see that as the eigenvalue spread increases (and the input process becomes more correlated), the minimum meansquared error J min decreases. 34

Example (Experiment 2) Keep the eigenvalue spread fixed at Change the step size (μ max =1.1) Loci of v 1 (n) versus v 2 (n) for the steepest-descent algorithm with eigenvalue X (R)=10 and varying step-size parameters: (a) overdamped, μ=0.3 ; (b) underdamped, μ=1.0. 35

Loci of w 1 (n) versus w 2 (n) for the steepest-descent algorithm with eigenvalue X (R)=10 and varying step-size parameters: (a) overdamped, μ=0.3 ; (b) underdamped, μ=1.0. Depending on the value of μ, the learning curve can be Overdamped, moves smoothly to the min. ((very) small μ) Underdamped, oscillates towards the min. (large μ< μ max ) Critically damped Generally rate of convergence is slow for the first two. 36

Observations SD is a deterministic algorithm, i.e. we assume that R and p are known exactly. In practice they can only be estimated Sample average? Can have high computational complexity. SD is a local search algorithm, but for Wiener filtering, the cost surface is convex (quadratic) convergence is guaranteed as long as μ< μ max is satisfied. 37

38

Observations The origin of SD comes from the Taylor series expansion (as many other local search optimization algorithms) Convergence can be very slow. To speed up the process, second term can also be included as in the Newton s Method نکته: روش نيوتن در اثبات NLMS نيز استفاده خواهد شد. H=2R, Hessian Differentiation with respect to w and setting the result to zero 39

1 w w R p Rw 2 1 R p w 1 ( n 1) ( n) 2 2 ( n) o Optimum solution in a single iteration! High computational complexity (inversion), numerical stability problems. Hw4: Ch4, p 2, 4, 7, 10, 14 40