Instructor: Dr. Benjamin Thompson Lecture 8: 3 February 2009

Size: px

Start display at page:

Download "Instructor: Dr. Benjamin Thompson Lecture 8: 3 February 2009"

Jennifer Burke
5 years ago
Views:

1 Instructor: Dr. Benjamin Thompson Lecture 8: 3 February 2009

2 Announcement Homework 3 due one week from today.

3 Not so long ago in a classroom very very closeby Unconstrained Optimization The Method of Steepest Descent Newton s Method The Gauss-Newton Method Really? Only four topics? I must be slacking

4 Episode VIII: A New Trope The Wiener Filter The LMS Algorithm Moving On Matlab Demo: Linear Regression Matlab Demo: The Wiener Filter Matlab Demo: LMS Data Sets The Iris Classification Problem The Sumo-Basketball Player-Jockeys Problem Stock Market Data

5 It involves bell-shaped figs.

6 Motivation The motivation for this approach is to address the complexity issues, without taking too big of a hit on convergence Outline for derivation: We define a new error function as the sum of squared errors accrued up to time n We then linearize the dependence of these error terms on the weight vector (to make the math easier) Given this relationship between the error terms and the weight vector, we may find the weight vector wthat minimizes the error function defined in the beginning

7 Let s Get Started Define our (new) cost function as the sum of the squares of all the approximation errors from the first iteration up until the current iteration: n 1 2 For the least-squares problem, the error E( w) = e ( i T ) term is: e( i) = d( i) y( i) = d( i) w x( i) 2 We linearize each of these error terms, which are functions of the weight vector w, (whatever they may be) by a firstorder Taylor series expansion: (, w) ( ) ( ) e i i= 1 ( n) ( w w( )) ' e i = e i + n w w = w T

8 An Illustration (, ) e i w 1 (, ( )) e i w i The slope of this line is the derivative term: this line is the linearized function described on the previous slide! w( i) w 1

9 Still Going We may vectorize this over all e (n,w) (that is, the e (i,w) for each i from [1..n]) to get: ' e ( n, w) = e( n) + J( n) ( w w( n) ) where e(n) is the error vectorof the accumulated errors, and J(n) is the Jacobianof the error vector, given as: J ( ) ( 1) ( 1) ( 1) e e e w1 w2 w e e e n = w1 w2 w e n e n ( 2) ( 2) ( 2) ( ) ( ) e( n) w1 w2 w m w = w m m ( n) Note that this is n-by-m: the number of observations so far by the size of the parameter vector we are trying to optimize!

10 Still Going The Jacobian may also be expressed as the transpose of the gradient matrix e( n) = e( 1) e( 2) e( n) Now we may find the weight vector wthat would minimize the errors of all the previous stimuli and current input, which gives us the nextvalue of w: n 1 1 w n 1 argmin e i, w argmin e n, w ' ' ( + ) = ( ) = w 2 i= 1 w 2 Evaluating this norm gives us: 1 ' ( ) 2 1 e n, w = e( n) + J( n) ( w w( n) ) Which expands to: 2 ( ) ( ) 1 2 T 1 e( n) + e ( n) J( n) w w( n) + w w n J n J n w w n 2 2 T T ( ) ( ( )) ( ) ( )( ( )) 2 Each of these elements is a column vector!

11 Still Going As usual, since we want to find the value of wthat maximizes this, we take the derivative and set equal to zero: ( ) T ( n) ( n) ( n) ( n) ( n) T J e + J J w w = 0 And now solve for the wthat minimizes this: T ( 1) ( ) ( ) ( ) 1 T ( ) ( ) ( ) w= w n+ = w n J n J n J n e n So the next value of wis just the previous value of w minus the error times some gradient information!

12 Some Caveats J T J must be nonsingular so you can calculate its inverse Not a guarantee, so frequently J T J +δi is used instead, where δ is a small positive constant that ensures that the result is invertible. As it turns out, this is equivalent to minimizing the cost function n 2 δ 1 2 i= 1 e i ( ) + w w( n) where the second term represents the deviation from the previous weight, which is another form of regularization

13 In Practice Let s take a closer look at the Jacobian J(n): Recall for our linear approximation w T x(i), the error is given by the difference between this approximation and the true output d(i) = f(x(i)), which is given as e(i)=d(i)-w T x(i) So the derivative of this error with respect to the weight vector is just -x(i)! So

14 In Practice For the case of linear approximation, the Jacobian becomes: ( 1) ( 1) ( 1) e e e w w w ( 2) ( 2) ( 2) ( n) ( 1) ( 1) m( 1) ( 2) ( 2) ( 2) 1 2 m x1 x2 x e e e x1 x2 xm J( n) = w1 w2 w m = = x( 1) x( n) x1( n) x2( n) xm( n) e( n) e( n) e( n) w1 w2 w m w = w So the Jacobian is just the transpose of all the input data up to time n! T

15 In Practice So here s how Gauss-Newton works: Given a weight vector at time n, w(n), calculate the errors for all the data you ve used from x(1) up to x(n), with respect to that weight vector Hint: This may be simplified by creating an n x m matrix of all the data and multiplying it by the (m x 1) weight vector, which results in a vector of all the estimated outputs up to time n. This is then subtracted from the target vector d(n), whose elements are the target values from 1 to n, to produce e(n) The sharp student will notice that the n x m matrix he or she made above is the negative of the Jacobian! Then just turn the crank to get your new answer.

16 True story: I bought a dachshund ( Wiener Dog ) for the sole purpose of naming him Norbert. After Norbert Wiener, inventor of the Wiener filter. As it happens, I also have a cat named Newton. After Isaac, not Fig. My wife actually let me get away with this twice. I have a plan to buy a second cat, name him Liebniz, and watch them duke it out.

17 The Wiener Filter Fortunately, we already derived this, sort of: The Wiener Filter simply applies the Gauss-Newton method of optimization to the linear least-squares function approximation That is, we want a currentestimate to a linear approximator d(i) = f(x(n)) w(n) T x(n) = y(n), given the current error e(n) = d(n) y(n) and all the errors of the previous inputs applied to the current weight vector The thing we want to approximate is the weight vector, or parameter vector

18 To Reiterate Here s how the Wiener Filter works: Given a current weight vector w(n), and all the previous inputs that led up to that current weight vector, x(1) up to x(n), we form the data vectorx(n) as: ( n) ( 1) ( 2) ( n) X = x x x We then calculate the cumulative errors as: ( n) = ( n) ( n) ( n) e d X w where d(n) is a column vector containing all the desired responses up until now. T

19 I Love It When A Plan Comes Together Recall, to linearize this error with respect to w, we get the Jacobian: ( n) T e e( ) = J ( ) = = X w Since the error is already linear with respect to w, Gauss-Newton will actually converge in a single iteration! So we apply it: T 1 T w( n+ 1) = w( n) J ( n) J( n) J n e n ( ) T n n n ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( T ) 1 T T ( ) ( ) ( ) ( ) = w n + X n X n X n d n X n w n = T 1 T ( X ( n) X( n) ) X ( n) d( n) Now THAT s simplification!

20 Some Notes on Wiener 1 T T The multiplier ( X ( n) X( n) ) X ( n) is called the pseudoinversematrixof X(n), and is denoted X + (n) What does the pseudoinverse devolve into when X(n) is a nonsingular square matrix? Hint: ( ) AB = B A Obviously, this can be computationally intensive for large data sets: you need to keep a matrix of all data for all time since the beginning of your estimation! The same caveats for the invertibility of X T X apply as did in the case of the Gauss-Newton method

21 The Wiener To The Limit Everybody Fhqwghads! Let s assume stable statistics for the underlying function we re trying to estimate Furthermore, let s assume ergodicity, which is BTSOTC, but means that we can perfectly determine the statistics from one (maybe infinitely long) realization of this random process Let s look at what happens to the nature of the Wiener Filter as our data set becomes infinitely large; that is, what is wo = limw( n+ 1) n where w o is the Wiener solutionto the linear leastsquares problem

22 No Math This Time! Well, not really. Let s look at the Wiener Filter equation again: ( ) ( ) ( ) T 1 T w( n+ 1) = X ( n) X( n) X n d n and simply rewrite this in sum-form: 1 T ( ) ( i) ( i) ( i) d( i) w = w = o x x x i= 1 i= 1 ( ) R 1 xx rdx What do those terms look like? Looks like the Maximum Likelihood Estimator!

23 All that back story for this?

24 LMS Motivation Recall that the Wiener Filter sought to minimize the n 1 2 cumulative error, e ( i) 2 i= 1 To address the memory and complexity requirements of this algorithm, the LMS algorithm seeks instead to 1 minimize the instantaneous error, E( wˆ ) = e 2 ( n), where 2 w is the desired weight vector estimate ŵ That is, it only looks at the current prediction error rather than all the previous prediction errors with respect to the current parameter estimate

25 Derivation We should be used to this by now: We want to minimize the error with respect to some unknown weight vector, so we take the derivative: and our error signal is determined, as usual, by: so ( wˆ ) E wˆ ( ) e n = e( n) = 0 wˆ T ( ) = ( ) x ( n) wˆ ( n) e n d n ( ) e n wˆ = x ( n)

26 Derivation (cont.) This gives us an instantaneous estimateof the gradient as: E( wˆ ) = x( n) e( n) wˆ which gives us an estimate to use for the method of steepest descent: wˆ ( n+ 1) = wˆ ( n) η gˆ ( n) wˆ ( n+ 1) = wˆ ( n) + η x( n) e( n) Look familiar? We ve just generalized the Rosenblatt Perceptron Learning Rule for all linear approximators!

27 The LMS Algorithm Given a training sample x(n) and desired response d(n), and some starting weight vector w(n), compute the error as e(n) = d(n) w T x(n) Update the weight vector as wˆ ( n+ 1) = wˆ ( n) + η x( n) e( n) Repeat. Yes. It s that simple. And powerful!

28 Notes on the LMS Algorithm While the Rosenblatt Perceptron algorithm generally assumes a fixed training set generated ahead of time, and a fixed structure after training is complete, the LMS algorithm assumes that each training pattern is shown only once, and new data is generated continuously In this sense, the LMS is an adaptivealgorithm The LMS algorithm converges to a random walk around the optimal (Wiener) solution if the training set is stationary (doesn t statistically change over time)

29 What You Say??? The rate of convergence of the LMS algorithm toward the optimal solution (given the assumption of stationarity) is given by k min ( ) = + η J + λ υ ( 0) ( 1 η λ ) J n J min M M λ η J 2 2 η λk min k= 1 η λk k= 1 This should be obvious by inspection, so I will not prove it in class. 2 2 n k k k Just kidding. where the λ k s are the eigenvaluesof the correlation matrix of the data, J min is the minimum mean-squared error produced by the Wiener Filter, and υ k (0) is the state evolution of the Markov model of the error premultiplied by a matrix whose rows are the eigenvectors of the correlation matrix! It s all so simple!!!!

30 Which of course is ancient Greek for The Populace of Matlab

31 The Sequence of Interest Today we ll be examining the autoregressive sequence ( ) ( ) y n m = i + i= 1 wy n i ε We ll look at this from several standpoints: Parameter estimation of the wvector using ML and MAP estimators The Wiener filter approximation of the function The LMS filter approximation of the function i

32 Makin Data The data set is constructed as: ( 1) ( 2) ( ) ( 2) ( 3) ( + 1) y y y n y y y n X= y( m) y( m+ 1) y( n+ m 1) T d ( 1) ( 2) ( ) = y m+ y m+ y m+ n T

33 ML and MAP Given the entiredata set a priori, we want to estimate the parameters wthat generated that data, assuming the output was a linear combination of the inputs ˆ ML w = R r 1 xx dx ˆ MAP R xx I r dx w = +λ ( ) 1

34 Wiener Filter Given all the data up to time n, we want to estimate the parameters that generated that data. We derived this for the special case of a linear filter. ( ) T( ) ( ) ( + 1) = T( ) ( ) w n X n X n X n d n 1

35 The Least Mean Square Filter Given onlythe data at time n, and a current weight estimate, we want to update our estimate of the parameters that generated the data. ( n+ 1) = ˆ ( n) + η ( n) e( n) wˆ w x

36 Something useful this way comes.

37 The Nature of Data Recall: there are two primary tasks for neural networks (and the other tools we ve been learning about) Classification Function Approximation So far, all we ve talked about is two-class classification That is, given a data point, does it belong to one class or the other? What about more classes?

38 I Borrowed This From Some Website.

39 Data Big and Small Example: Character recognition via neural networks Suppose we want to train a neural network to recognize handwritten characters digitized into black-and-white, 30x20-pixel images We could vectorize each image into a SIX HUNDRED ELEMENT input vector This just might be computationally intensive OR we could perform feature extraction

40 Feature Extraction Consider some features for character recognition: number of black pixels (add them up!): a d probably contains more writing than an i character height (number of pixels between bottom-most black pixel and top-most black pixel): l and h are taller than e and c character width (number of pixels between left-most black pixel and right-most black pixel): e and b are wider than i and t Others? The (ideal) goal is to find the smallest feature set that fully allows for identification of all the input classes

41 How Much Is Too Much? Rule of thumb: more data is always better. The bigger the training set, the better the trained learning machine will be at generalization The reality, of course, is that more data equals longer training time! Character recognition example: have lots of people write lots of individual letters of the alphabet

Instructor: Dr. Benjamin Thompson Lecture 10: 12 February 2009

Instructor: Dr. Benjamin Thompson Lecture 10: 12 February 2009 Announcement Reminder: Homework 3 is due right now Homework 1 solutions are available online Homework 2 is graded (avg ~75%, minus two outliers