Machine Learning and Adaptive Systems. Lectures 5 & 6

Size: px

Start display at page:

Download "Machine Learning and Adaptive Systems. Lectures 5 & 6"

Annabella Ward
5 years ago
Views:

1 ECE656- Lectures 5 & 6, Professor Department of Electrical and Computer Engineering Colorado State University Fall 2015

2 c. Performance Learning-LMS Algorithm (Widrow 1960) The iterative procedure in steepest descent requires the knowledge of exact gradient J(w i (k)). In practice, the gradient is not known and needs to be estimated based upon instantaneous value of squared error (or instantaneous estimates of R xx and R xd ). J(w i (k)) e 2 i (k) = (d i(k) net i (k)) 2, net i (k) = w t i (k)x(k) J(w i (k)) w i (k) = 2x(k)e i (k) Then from gradient descent rule (also called Stochastic Gradient Descent), w i (k + 1) = w i (k) + µ x(k)e i (k) Note that the same results can be obtained using the instantaneous values of R xx and R xd instead of actual values in the steepest descent rule, i.e. R xx = x(k)x(k) t R xd = x(k)d i (k) w i (k + 1) = w i (k) + µ [R xd R xxw i (k)] = w i (k) + µ x(k)[d i (k) x(k) t w i (k)] = w i (k) + µ x(k)e i (k)

3 Remarks 1 Plot of J(w i (k)) as a function of k is the learning curve of the LMS. 2 Comparing LMS with Amari s rule, r(w i (k), x(k), d i (k)) = e i (k) i.e. the error is the learning signal. 3 If we use sample correlation and cross-correlation matrices instead of the actual values in the gradient descent rule, i.e. R xx = X t X and R xd = X t d i, then w i (k + 1) = w i (k) + µ X t [d i Xw i (k)] = w i (k) + µ X t e i = w i (k) + µ K l=1 x(l)e i(l) i.e. LS solution corresponds to LMS after one pass over the training data (hence called Batch Gradient Descent). When K and process is ergodic, LMS Wiener-Hopf. 4 Owing to presence of noise and perturbations in the gradient estimate at each iteration, LMS doesn t end up at the global minimum and wonders around it. This misadjustment is measured by M = Average Excess MSE Min. MSE = Average(J(w i (k) J min) J min It can be shown that M = µ tr[r xx] = µ N i=1 λ i where λ i is the ith eigenvalue of R xx i.e. M can be reduced (not eliminated) by reducing µ. But, this presents a trade-off between speed and accuracy.

4 Example 1: (a) For small learning rate, show that the mean of weight vector estimate satisfies the state equation, m(n) = (I µr xx) n [m(0) m( )] + m( ) where m(k) = E[w(k)]. (b) Show that as with the gradient descent for convergence 0 < µ < 2 is the maximum eigenvalue of R xx. λ max (a) Rewrite the weight update equation using e(k) = d(k) x t (k)w(k) as w(k + 1) = A(k) w(k) + µ x(k)d(k) where A(k) = (I µx(k)x t (k)). where λ max Now define error weight ɛ(k) = w(k) w where w stands for optimum Wiener-Hopf solution. Then, we get ɛ(k + 1) = A(k) ɛ(k) + f(k) where f(k) = µ x(k)[d(k) w t x(k)]. Using Kushner s direct averaging method (see Section 3.8) for small learning rates and taking E[.] from the above stochastic state equation gives, E[ɛ(k + 1)] = (I µr xx) E[ɛ(k)] = Ā(k) E[ɛ(k)] due to orthogonality principle E[x(k)(d(k) w t x(k))] = 0.

5 Since m( ) = w, the above Eq. can be written as, m(k + 1) m( ) = (I µr xx)(m(k) m( )), which is a state Eq. with no excitation. Solving this state Eq. for initial condition (m(0) m( )) yields, m(n) m( ) = (I µr xx) n (m(0) m( )). (b) Matrix R xx can be diagonalized using R xx = QΛQ t where the diagonal matrix Λ contains all the eigenvalues of R xx and Q contains all the associated eigenvectors as its columns. Then, using QQ t = I we can write, m(n) m( ) = (QQ t µqλq t ) n (m(0) m( )) ζ(n) = (I µλ) n ζ(0). where ζ(n) = Q t (m(n) m( )). For stability and convergence we require that 1 µλ i < 1, necessary and sufficient condition for convergence is, 0 < µ < 2 λ max. Remark i [1, N]. Thus, the Note that we can fit an exponential to (1 µλ i ) = e 1/τ i where τ i is the time constant of the ith mode. Now, the slowest learning mode is determined by λ min while the fastest is dictated by λ max. Thus, if the eigenvalues are widely spread the settling time is decided by the smallest eigenvalue.

6 d. Performance Learning-Perceptron Rule (Rosenblatt 1958) In contrast to Widrow s ADALINE, this supervised learning uses the actual output to generate the learning signal i.e. r = d i o i, and 1 Uses BHL (or sgn(.) function) as an activation function, 2 Uses binary ±1 desired signal. That is, o i (k) = sgn(net i (k)) and net i (k) = w i (k) t x(k) e i (k) = d i (k) o i (k), and w i (k) = µ x(k)e i (k) Thus, the Perceptron updating rule is, w i (k + 1) = w i (k) + µ [d i (k) sgn(net i (k)] x(k), i.e. the weights are only adjusted if o i is incorrect since d i = 1 or 1 and o i = 1 or 1 = e i (k) = 0, i.e. no learning. Now, if d i = 1, o i = 1 then e i (k) = 2, or d i = 1, o i = 1 then e i (k) = 2 and learning rule becomes, w i (k + 1) = w i (k) ± 2µx(k). Assume w i (0) = 0, after one epoch, w o i = 2µ x(k) 2µ x(k), k R 1 k R 2 R 1 R 2 : subset of data indices that are misclassified.

7 e. Performance Learning-Delta Rule (McClelland & Rumelhart 1986) In contrast to Perceptron s rule, this revolutionary supervised learning rule: 1 Uses continuous output (i.e. continuous activation function), 2 Uses differentiable activation function, 3 Circumvents problems of all previous learning rules. wi1 x(k) wij Cell i neti(k) f(.) oi(k) win f (1) (.) win Delta Rule ei(k) - di(k) + Delta rule is a generalization of Widrow-Hopf LMS rule where we use instantaneous error, ξ(k) = e 2 i (k) = (d i(k) o i (k)) 2 = (d i (k) f(net i (k))) 2. Now, taking partial of this error wrt w i gives, ξ(k) w i (k) = 2 f(net i) w i (k) e i(k) Using Chain Rule f(net i(k)) w i (k) hence net i(k) = x(k). w i (k) = f(net i(k)) net i (k) net i (k) w i (k). But net i(k) = w i (k) t x(k) and

8 e. Performance Learning-Delta Rule (Cont.) Thus, we get ξ(k) w i (k) = 2f (w i (k) t x(k)) x(k) e i (k) = 2f (net i (k)) x(k) e i (k), where f (net i (k)) = f(net i(k)) net i (k). Now using the gradient descent we get the delta updating rule, w i (k + 1) = w i (k) 1 2 µ ξ(k) Remarks: = w i (k) + µ f (w i (k) t x(k)) x(k) e i (k). 1 For the delta rule, the learning signal is r(w i (k), x(k), d i (k)) = f (w i (k) t x(k)) e i (k) = f (w i (k) t x(k)) (d i (k) o i (k)). 2 For the unipolar sigmoidal activation function o(net i ) = f(net i ) = f (net i (k)) = λo i (k)(1 o i (k)). 3 For linear neurons, Delta rule becomes LMS rule. 1 1+e λnet i,

9 f. Performance Learning-RLS Rule (Azimi-Sadjadi 1992) In contrast to LMS rule, this supervised learning rule: 1 Uses SE with limited memory (forgetting factor) to rely mostly on recent data, 2 Uses a continuous threshold logic activation function, 3 Is significantly faster than LMS with no accuracy-speed tradeoffs, 4 Is more suited for non-stationary environments. w i1 x(k) w ij Cell i net i(k) f(net) 1 a net o i(k) w in w in RLS Rule e i(k) - d i(k) + Objective: Given {x(k), d i (k)} K k=1, find optimum w i (n) at (current) iteration n to minimize the SE with limited memory, J(w i (n)) = 1 n γ n k e 2 i 2 (k) = 1 n γ n k (d i (k) o i (k)) 2, 2 k=1 k=1 where 0 < γ < 1 (e.g., γ =.99) is a forgetting factor that weights recent data and forgets the old data.

10 f. Performance Learning-RLS Rule (Cont.) In this rule a threshold-logic activation is used, i.e. we have 0 net i (k) 0 o i (k) = f(net i (k)) = net i (k)/a 0 < net i (k) < a where net i (k) = w t i (k)x(k). 1 net i (k) a Assuming that the current weight vector w i (n) is used in place of the previous weights i.e. w i (k) = w i (n) k [1, n 1] then taking the derivative of J(w i (n)) wrt w i (n) and setting it to zero yields the normal equation, 1 a n k=1 γn k x(k)(d i (k) 1 a xt (k) ŵ i (n)) = 0 Note that the learning only happens when the net i is within the ramp part of the activation function. The above normal equation can be solved iteratively using the weighted Recursive Least Squares (RLS) as follows, K(n) = P (n 1)x(n) γ + x t : Gain Calculation (1) (n)p (n 1)x(n) P (n) = γ 1 [I K(n)x t (n)]p (n 1) : Updating Inverse Correlation (2) ŵ i (n) = ŵ i (n 1) + K(n)[d i (n) 1 a xt (n) ŵ i (n 1)] : Weight Updating (3) where P (n) = R 1 (n) and R(n) = 1 a 2 n k=1 γn k x(k)x t (k) which is the weighted correlation matrix of the data. Start with P (0) = δi for small δ e.g., δ = 0.5 and weights are randomly initialized.

11 - Hebbian Rule (Hebb 1949) Donald Hebb, a neuro-psychologist postulated a mechanism for learning at the cellular level in the brain. Hebb s idea : When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A s efficiency as one of the cells firing B, is increased. Alternatively, we can break up this into two rules: a) If two cells (A, B) (or (i, j)) on either side of a synaptic weight (w ij ) are activated simultaneously or synchronously, then strength of that synapse is selectively increased. b) Whereas if they are activated asynchronously then, the synapse is selectively weakened (or eliminated). j wij xj Pre snyaptic i oi : Post snyaptic Thus the Hebbian synapse uses a time-dependent, highly local, and strongly interactive mechanism to increase synaptic efficiency as a function of the correlation between presynaptic and postsynaptic activities. That is, w ij = µx j o i Writing this for all j [1, N] cells connected to cell i yields the Hebbian learning rule, w i (k + 1) = w i (k) + µx(k)o i (k)

12 - Hebbian Rule (Cont.) Remarks: 1 No desired signal hence an unsupervised learning. 2 The learning signal in this case in r(w i (k), x(k)) = o i (k). 3 Used for many applications including associative memories (either auto-associative or hetero-associative), PCA extraction, etc. Example 1: (a) Derive Hebb s learning rule for a network with M output cells. (b) For data set {x(k), o(k)} K 1 k=0 and W (0) = 0 where W (k) = [w 1 (k),, w M (k)]t is the weight matrix (M N), show that the network in (a) is a perfect linear associator if x(k) s are orthonormal. o 1 (k) o i (k) o M (k) wi(k) 1 2 N x(k) (a) For cell i we have w i (k + 1) = w i (k) + µx(k)o i (k), same µ for all M cells, then W (k + 1) = W (k) + µo(k)x t (k) where o(k) = [o 1 (k),, o M (k)] t in the output vector. i [1, M]. Assuming the

13 - Hebbian Rule (Cont.) (b) W (1) = µ o(0)x t (0) W (2) = µ o(0)x t (0) + µo(1)x t (1). W (K) = W = µ K 1 k=0 o(k)xt (k): Outer Product Sum At this point, the network in trained and has stored K patterns (i.e. Storage Phase of the Associative Memory). Now, if the patterns are orthogonal i.e. { 1 i = j x t (i)x(j) = δ(i j) = 0 i j Thus, postmultiply outer product sum by x(l) gives W x(l) = µ( K 1 k=0 o(k)xt (k))x(l) = µo(l), i.e. a perfect associator (i.e. Recall Phase). Notes: (a) If x(k) s are normal but not orthogonal, then W x(l) = µ o(l) + µ( K 1 k=0 k l o(k)x t (k))x(l), where the second term shows the effects of cross-talk. (b) Also, if x = x(k) + ε(k) where ε(k) represents perturbation/deformation, then W x(l) = µ o(l) + µ( K 1 k=0 o(k)xt (k))ε(k).

14 - Hebbian Rule (Cont.) Example 2: Let x(1) = [ ] t, o(1) = [5, 1, 0] t x(2) = [ ] t, o(2) = [ 2, 1, 6] t x(3) = [ ] t, o(3) = [ 2, 4, 3] t (a) Store these patterns using Hebb s rule. (b) Let x = [.8,.15,.15,.2] t. Find associated o. What is the closest output pattern to o (in the. 2 sense)? Clearly, x(k) s are orthonormal. So the results in Example 1 apply here and the recall is perfect for x(k) s. The weight matrix after the storage phase is (assume µ = 1), W = 3 k=1 o(k)xt (k) = Now, what would happen we if apply x = [.8,.15,.15,.2] t which is a perturbed version of x(1)? o = W x = [4, 1.25,.45] t The Euclidean distances to o(k) s are: o o 1 2 = 1.26 o o 2 2 = o o 3 2 = i.e. the memory generates a pattern that is closet to o 1.

15 - Hebbian Rule (Cont.) Example 3: If x(k) s are not orthogonal devise a way to choose the best weights to obtain a linear associator. Here, we choose to minimize average SE, J(W ) = 1 K 1 K k=0 o(k) W x(k) 2 = 1 K 1 K k=0 o(k) ô(k) 2 Let us define matrices O = [o(0),, o(k 1)] M K X = [x(0),, x(k 1)] N K Then, the average SE can be rewritten as J(W ) = 1 K tr[(o W X)(O W X)t ] Minimizing wrt matrix W yields J(W ) W = 1 K (O W X)X t = 0 or W = OX t (XX t ) 1. Note that to get this result we used matrix derivative properties tr(xa) = A X t, tr(ax t ) = A and tr(xaxt ) = XA + XA X X t. If x(k) s are orthonormal XX t = I, X t x(k) = [0..1 k th..0] t, and hence W x(k) = o(k).

Machine Learning and Adaptive Systems. Lectures 3 & 4

Machine Learning and Adaptive Systems. Lectures 3 & 4 ECE656- Lectures 3 & 4, Professor Department of Electrical and Computer Engineering Colorado State University Fall 2015 What is Learning? General Definition of Learning: Any change in the behavior or performance