Principal Components Analysis and Unsupervised Hebbian Learning

Princial Comonents Analysis and Unsuervised Hebbian Learning Robert Jacobs Deartment of Brain & Cognitive Sciences University of Rochester Rochester, NY 1467, USA August 8, 008 Reference: Much of the material in this note is from Haykin, S. (1994). Neural Networks. New York: Macmillan. Here we consider training a single layer neural network (no hidden units) with an unsuervised Hebbian learning rule. It seems sensible that we might want the activation of an outut unit to vary as much as ossible when given different inuts. After all, if this activation was a constant for all inuts then the activation would not be telling us anything about which inut is resent. In contrast, if the activation has very different values for different inuts then there is some hoe of knowing something about the inut based just on the outut unit s activation. So we might want to train the outut unit to adat its weights so as to maximize the variance of its activation. However, one way in which the unit might do this is by making its weights arbitrarily large. This would not be a satisfying solution. We would refer a solution in which the variance of the outut activation is maximized subject to the constraint that the weights are bounded. For examle, we might want to constrain the length of the weight vector to be one. Lagrange otimization is a method for maximizing (or minimizing) a function subject to one or more constraints. At this oint, we take a brief detour and give an examle of Lagrange otimization. From this examle, you should get the general idea of how this method works. Examle: Suose that we want to find the shortest vector y = [y 1 y ] T that lies on the line y 1 y 5 = 0. We write down an objective function L in which the first term states the function that we want to minimize, and the second term states the constraint: L = 1 (y 1 + y ) + λ(y 1 y 5) (1) where λ is called a Lagrange multilier (it is always laced in front of the constraint). The first term gives the length of the vector (actually its 1/ times the length of the vector squared); the second term gives the constraint. We now take derivatives, and set these derivatives equal to zero: y 1 = y 1 + λ = 0 () y = y λ = 0 (3) λ = y 1 y 5 = 0. (4) Note that we have three equations and three unknown variables. From the first derivative we know that y 1 = λ, and from the second derivative we know that y = λ. If we lug these values into 1

the third derivative, we get 4λ λ = 5 or that λ = 1. From this, we can now solve for y 1 and y ; in articular, y 1 = and y = 1. So the solution to our roblem is y = [ 1] T. Now let s turn to the roblem that we really care about. We want to maximize the variance of the outut unit s activation subject to the constraint that the length of the weight vector is equal to one. Without loss of generality, let s make things simler by assuming that the inut variables each have a mean of zero (E[x i ] = 0). Because the maing from inuts to the outut is linear (y = w T x = x T w), this means that the outut activation will also have a mean of zero (E[y] = 0). We are now ready to write down the objective function: L = 1 y + λ(1 w T w) (5) where y is the outut activation on data attern. The first term, 1 y, is roortional to the samle variance of the outut unit, and the second term, λ(1 w T w) gives the constraint that the weight vector should have a length of one. We now do some algebraic maniulations: L = 1 (w T x )(x T w) + λ(1 w T w) (6) = 1 w T (x x T )w + λ(1 w T w) (7) = 1 wt Qw + λ(1 w T w) (8) where Q = x x T is the samle covariance matrix. Now let s take the derivative of L with resect to w, and set this derivative equal to zero: = Qw λw = 0 (9) w which means that Qw = λw. In other words, w is an eigenvector of the covariance matrix Q (the corresonding eigenvalue is λ). So if we want to maximize the variance of the outut activation (subject to the constraint that the weight vector is of length one), then we should set the weight vector to be an eigenvector of the inut covariance matrix (in fact it should be the eigenvector with the largest eigenvalue). As discussed below, this has an interesting relationshi to rincial comonent analysis and unsuervised Hebbian learning. The idea behind rincial comonent analysis (PCA) is that we want to reresent the data with resect to a new basis, where the vectors that form this new basis are the eigenvectors of the data covariance matrix. Recall that eigenvectors with large eigenvalues give directions of large variance (again, we are considering the eigenvectors of a covariance matrix), whereas eigenvectors with small eigenvalues give directions of small variance (so the eigenvector with the largest eigenvalue gives the direction in which the inut data shows the greatest variance, and the eigenvector with the smallest eigenvalue gives the direction in which the inut data shows the least variance). Each eigenvalue gives the value of the variance in the direction of its corresonding eigenvector (thus the

eigenvalues are all ositive). Because the data covariance matrix is symmetric, its eigenvectors are orthogonal (as usual, we assume that the eigenvectors are of length one). Let x (t) = [x (t) 1,..., x(t) n ] T be a data item, and e i = [e 1,..., e n ] T be the i th eigenvector (assume that the eigenvectors are ordered such that the eigenvector with the largest eigenvalue is e 1, the eigenvector with the next largest eigenvalue is e, and so on). The rojection of x (t) onto e i is denoted y (t) i, and is given by the inner roduct of these two vectors: y (t) i = [x (t) ] T e i = e T i x (t). (10) That is, y (t) i is the coordinate of x (t) along the axis given by e i, and y (t) = [y (t) 1,..., y(t) n ] T is the reresentation of x (t) with resect to the new basis. Let E = [e 1,..., e n ] be the matrix whose i th column is the i th eigenvector. Then we can move back and forth between the two reresentations of the same data by the linear equations: y (t) = E T x (t), (11) and x (t) = Ey (t). (1) Please make sure that you understand these equations [we have used the fact that for an orthogonal matrix E (a matrix whose columns are orthogonal vectors), its inverse, E 1, is equal to its transose E T ]. Princial comonent analysis is useful for at least two reasons. First, as a feature detection mechanism. It is often the case that the eigenvectors give interesting underlying features of the data. PCA is also useful as a dimensionality reduction technique. Suose that we reresent the data using only the first m eigenvectors, where m < n (that is, suose we reresent x (t) with ŷ (t) = [y (t) 1,..., y(t) m ] T instead of y (t) = [y (t) 1,..., y(t) n ] T ). Out of all linear transformations that roduce an m-dimensional reresentation, this transformation is otimal in the sense that it reserves the most information about the data (yes, I am being vague here). That is, PCA can be used for otimal linear dimensionality reduction. In the remainder of this handout, we rove that the use of a articular Hebbian learning rule results in a network that erforms PCA. That is, the weight vector of the i th outut unit goes to the i th eigenvector e i of the data covariance matrix, and the outut of this unit y (t) i is the coordinate of the inut x (t) along the axis given by e i. For simlicity, we assume that we are dealing with a two-layer linear network with n inut units and only one outut unit (the result extends to the case with n outut units in which the weight vector of each outut unit is a different eigenvector, but we will not rove that here). We will also assume that the inut is a random variable with mean equal to zero. The learning rule is: w i (t + 1) = w i (t) + ɛ[y(t)x i (t) y (t)w i (t)] (13) 3

where w i (t) is the value of the outut unit s i th weight at time t. Note that the first term inside the brackets is the standard Hebbian learning term, and the second term is a tye of weight decay term that revents the weight from getting too big. In vector notation, this equation may be re-written: w(t + 1) = w(t) + ɛ[y(t)x(t) y (t)w(t)]. (14) Recall that we are using a linear network so that Using this fact, we re-write the weight udate rule as y(t) = w T (t)x(t) = x T (t)w(t). (15) w(t + 1) = w(t) + ɛ[x(t)x T (t)w(t) w T (t)x(t)x T (t)w(t)w(t)]. (16) Let C = E[x(t)x T (t)] be the data covariance matrix (we use the fact that the inut has a mean equal to zero). The udate rule converges when E[w(t + 1)] = E[w(t)]. Denote the exected value of w(t) at convergence as w o. Convergence occurs when the exected value of what is inside the brackets in the udate rule is equal to zero: 0 = E[x(t)x T (t)]w o w T o E[x(t)x T (t)]w o w o (17) = Cw o w T o Cw o w o. (18) In other words, where Cw o = λ o w o (19) λ o = w T o Cw o. (0) That is, w o is an eigenvector of the data covariance matrix C with λ o as its eigenvalue. We know that w o has length one because λ o = w T o Cw o (1) = w T o λ o w o () = λ o w o (3) which is only ossible if w o = 1. We also know that λ o is the variance in the direction of w o because λ o = w T o Cw o (4) = w T o E[x(t)x T (t)]w o (5) = E[{w T (t)x(t)}{x T (t)w(t)}] (6) = E[y (t)] (7) which is the variance of the outut (assuming that the outut has mean zero which is true given our assumtion that the inut has mean zero and that the inut and outut are linearly related). 4

We still need to rove that w o is the articular eigenvector with the largest eigenvalue, i.e. w o = e 1. This is trickier to rove so we shall omit it. This result is due to Oja (198). The extension to networks with n outut units in which the weight vector of each outut unit goes to a different eigenvector is due to Sanger (1989). The extension to networks (using Hebbian and anti-hebbian learning) with n outut units in which the n weight vectors san the same sace as the first n eigenvectors is due to Földiák (1990). Földiák, P. (1990) Adative network for otimal linear feature extraction. IEEE/INNS International Joint Conference on Neural Networks. Proceedings of the Oja, E. (198) A simlified neuron model as a rincial comonent analyzer. Journal of Mathematical Biology, 15, 67-73. Sanger, T.D. (1989) Otimal unsuervised learning in a single-layer linear feedforward neural network. Neural Networks, 1, 459-473. 5