Table of Contents EQ EQ EQ EQ EQ EQ KOHONEN...70 EQ STEPHEN GROSSBERG...70 EQ EQ.11...

Size: px

Start display at page:

Download "Table of Contents EQ EQ EQ EQ EQ EQ KOHONEN...70 EQ STEPHEN GROSSBERG...70 EQ EQ.11..."

Alice Janis Floyd
6 years ago
Views:

1 able of Contents CHAPER VI- HEBBIAN LEARNING AND PRINCIPAL COMPONEN ANALYSIS INRODUCION EFFEC OF HE HEBB UPDAE OJA S RULE PRINCIPAL COMPONEN ANALYSIS ANI-HEBBIAN LEARNING ESIMAING CROSSCORRELAION WIH HEBBIAN NEWORKS NOVELY FILERS AND LAERAL INHIBIION LINEAR ASSOCIAIVE MEMORIES (LAMS) LMS LEARNING AS A COMBINAION OF HEBB RULES AUOASSOCIAION NONLINEAR ASSOCIAIVE MEMORIES PROJEC: USE OF HEBBIAN NEWORKS FOR DAA COMPRESSION AND ASSOCIAIVE MEMORIES CONCLUSIONS...57 LONG AND SHOR ERM MEMORY...60 ASSOCIAIVE MEMORY...60 HEBBIAN AS GRADIEN SEARCH...61 INSABILIY OF HEBBIAN...63 DERIVAION OF OJA S RULE...63 PROOF OF EIGEN-EQUAION...64 PCA DERIVAION...65 DEFINIION OF EIGENFILER...66 OPIMAL LAMS...67 HEBB...67 UNSUPERVISED...67 SANGER...68 OJA...68 APEX...68 ASCII...68 SECOND ORDER...68 SVD...68 EQ EQ EQ EQ EQ EQ EQ EQ KOHONEN...70 EQ SEPHEN GROSSBERG...70 EQ EQ EQ EQ EQ EQ DIAMANARAS...71 DEFLAION...71 BALDI...71 HECH-NIELSEN...72 KAY...72 EQ

2 ENERGY, POWER AND VARIANCE...72 PCA, SVD, AND KL RANSFORMS...73 GRAM-SCHMID ORHOGONALIZAION...76 SILVA AND ALMEIDA...77 INFORMAION AND VARIANCE...77 COVER AND HOMAS...78 FOLDIAK...78 RAO AND HUANG

3 Chapter VI- Hebbian Learning and Principal Component Analysis Version 2.0 his Chapter is Part of: Neural and Adaptive Systems: Fundamentals hrough Simulation by Jose C. Principe Neil R. Euliano W. Curt Lefebvre Copyright 1997 Principe he goal of this chapter is to introduce the concepts of Hebbian learning and its multiple applications. We will show that the rule is unstable but through normalization is very useful. Hebbian learning is used to associate an input to a given output through a similarity metric. A single linear PE net trained with Hebbian rule finds the direction in data space where the data has the largest projection, i.e. such network transfers most of the input energy to the output. his concept can be extended to multiple PEs giving rise to the principal component analysis (PCA) networks. hese nets can be trained on-line and produce an output which preserve the maximum information from the input as required for signal representation. By changing the sign of the Hebbian update we also obtain a very useful network that decorrelates the input from the outputs, i.e. it can be used for finding novel information. Hebbian can be even related to the LMS learning rule showing that correlation is effectively the most widely used learning principle. Finally, we show how to apply Hebbian learning to associate patterns, which gives rise to a new and very biological form of memory called associative memory. 1.Introduction 2. Effect of the Hebb update 3. Oja s rule 3

4 4. Principal Component Analysis 5. Anti Hebbian Learning 6. Estimating crosscorrelation with Hebbian networks 7. Novelty filters 8. Linear associative memories (LAMs) 9. LMS learning as a combination of Hebb rules 10. AutoAssociation 11. Nonlinear Associative memories 12. Conclusions Go to next section 1. Introduction he neurophysiologist Donald Hebb enunciated in the 40 s a principle that became very influential in neurocomputing. By studying the communication between neurons, Hebb verified that once a neuron repeatedly excited another neuron, the threshold of excitation of the later decreased, i.e. the communication between them was facilitated by repeated excitation. his means that repeated excitation lowered the threshold, or equivalently that the excitation effect of the first neuron was amplified (Figure 1). neuron 2 y i synapse i th PE x j w ij j th PE neuron 1 Figure 1. Biological and modeled artificial system One can extend this idea to artificial systems very easily. In artificial neural systems, 4

5 neurons are equivalent to PEs, and PEs are connected through weights. Hence, Hebb s principle will increase the common weight wij when there is activity flowing from the j th i th PE to the PE. If we denote the output to the PE by yi and the activation of the i th j th PE by xj, then Δwij =ηxjyi Equation 1 where η is our already known step size which controls what percentage of the product is effectively used to change the weight. here are many more ways to translate Hebb s principle in equations, but Eq. 1 is the most commonly used and is called Hebb s rule. Unlike all the learning rules studied so far (LMS and backpropagation) there is no desired signal required in Hebbian learning. In order to apply Hebb s rule only the input signal needs to flow through the neural network. Learning rules that use only information from the input to update the weights are called unsupervised. Note that in unsupervised learning the learning machine is changing the weights according to some internal rule specified a priori (here the Hebb rule). Note also that the Hebb rule is local to the weight. Go to the next section 2. Effect of the Hebb update Let us see what is the net effect of updating a single weight w in a linear PE with the Hebb rule. Hebbian learning updates the weights according to wn ( + 1) = wn ( ) + ηxnyn ( ) ( ) Equation 2 where n is the iteration number and η a stepsize. For a linear PE, y = wx, so [ ] 2 wn ( + 1) = wn ( ) 1+ ηx ( n) Equation 3 If the initial value of the weight is a small positive constant (w(0)~0), irrespective of the 5

value of η>0 and of the input sign, the update will always be positive. Hence, the weight value will increase with the number of iterations without bounds, irrespective of the value of η.

6 value of η>0 and of the input sign, the update will always be positive. Hence, the weight value will increase with the number of iterations without bounds, irrespective of the value of η. his is unlike the behavior we observed for the LMS or backpropragation, where the weights would stabilize for a range of step sizes. Hence, Hebbian learning is intrinsically unstable, producing very large positive or negative weights. In biology this is not a problem because there are natural limitations to synaptic efficacy (chemical depletion, dynamic range, etc). NeuroSolutions raining with the Hebbian rule In this example, we introduce the Hebbian Synapse. he Hebbian Synapse implements the weight update of Equation 2. he Hebbian network is built from an input Axon, the Hebbian Synapse and an Axon, so it is a linear network. Since the Hebbian Synapse, and all the other Unsupervised Synapses (which we will introduce soon), use an unsupervised weight update (no desired signal), they do not require a backpropagation layer. he weights are updated on a sample by sample basis. his example shows the behavior of the Hebbian weight update. he weights with the Hebbian update will always increase, no matter how small the stepsize is. We have placed a scope at the output of the net and also opened a MatrixViewer to observe the weights during learning. he only thing that the stepsize does is to control the rate of increase of the weights. Notice also that if the initial weight is positive the weights will become increasingly more positive, while if the initial weight is negative the weights become increasingly more negative. 6

7 NeuroSolutions Example 2.1. he multiple input PE Hebbian learning is normally applied to single layer linear networks. Figure 2 shows a single linear PE with D inputs, which will be called the Hebbian PE. he output is x 1 x 2 w 1 w 2 w D y x D Figure 2. A D input linear PE D y = w i x i i= 1 Equation 4 According to the Hebb s rule, the weight vector is adapted as xy 1 Δw = η... xd y Equation 5 It is important to get a solid understanding for the role of Hebbian learning, and we will start with a geometric interpretation. Eq. 4 in vector notation (vectors are denoted by bold letters) is simply y = w x = x w Equation 6 i.e. the transpose of the weight vector is multiplied with the input (which is called the inner product) to produce the scalar output y. We know that the inner product is computed as the product of the length of the vectors times the cosine of their angle θ, ( ) y = wxcos θ Equation 7 So, assuming normalized inputs and weights, a large y means that the input x is close 7

8 to the direction of the weight vector (Figure 3), i.e. x is in the neighborhood of w. x θ y w Figure 3. he output of the linear PE in vector space A small y means that the input is almost perpendicular to w (cosine of 90 degrees is 0), i.e. x and w are far apart. So the magnitude of y measures similarity between the input x and the weight w using the inner product as the similarity measure. his is a very powerful interpretation. During learning the weights are exposed to the data and condense all this information in their value. his is the reason the weights should be considered as the long-term memory of the network. long and short term memory he Hebbian PE is a very simple system that creates a similarity measure (the inner product, Eq. 7 ) in its input space according to the information contained in the weights. During operation, once the weights are fixed, a large output y signifies that the present input is similar to the inputs x that created the weights during training. We can say that the output of the PE responds high or low according to the similarity of the present input with what the PE remembers from training. So, the Hebbian PE implements a type of memory that is called an associative memory NeuroSolutions Directions of the Hebbian update his example shows how the Hebbian network projects the input onto the vector defined by its weights. We use an input which is composed of samples that fall in an ellipse in 2 dimensions, and allow you to select the weights. When you run the network, a custom DLL will display both the input (blue) and the projection of the input onto the weight vector (black) he default is to set the weights to [1,0] 8

9 which defines a vector along the x-axis. hus you would be projecting the input onto the x-axis. Change the value of the weights which will rotate the vector. Notice that in any direction the output will track the input along that direction, i.e. the output is the projection of the input along that specified direction. Notice also the Megascope display. When the input data circles the origin, the output produces a sinusoidal component in time since the projection increases and decreases periodically with the rotation. he amplitude of the sinusoid is maximal when the weight vector is [1,0] since this is the direction that produces a larger projection for this data set. If we release the weights, i.e. if they are trained with Hebbian learning the weights will exactly seek the direction [1,0]. It is very interesting to note the path of the evolution of the weights (it oscillates around this direction). Note also that they are becoming progressively larger. NeuroSolutions Example 2.2. he Hamming Network as a primitive associative memory his idea that a simple linear network embeds a similarity metric can be explored in many practical applications. Here we will exemplify its use in information transmission, where noise normally corrupts messages. We will assume that the messages are strings of bipolar binary values (-1/1), and that we know what are the strings of the alphabet (for instance the ASCII code of the letters). A practical problem is to find from a given string of 5 bits received, which was the string sent. We can think of a n-bit string as a vector in n-dimensional space. he ASCII code for each letter can also be thought as a vector. So the question of finding the value of the received string is the same as asking which is the closest ASCII vector to the received string (Figure 4)? Using the argument above, we should find the ASCII vector in which the bit string produces the largest projection. 9

10 received vector z c a=[-1,-1,-1,-1,1] b=[-1,-1,-1,1,-1]... z=[1,1,-1,1,-1] find best match a constellation (coded in the weights) Figure 4. he problem of finding the best match to the received character in vector spaces A linear network can be constructed with as many inputs as bits on an ASCII code (here we will only use 5 bits, although the ASCII code is 8 bits long) and a number of outputs equal to the size of the alphabet (here 26 letters). he weights of the network will be hard coded as the bit patterns of all ASCII letters. More formally, the inputs are vectors x = [ x, x,... x ] 1 2 5, the output is a scalar and the weight matrix S is built from rows that are our ASCII codes represented by s i s i s i s i = [ 1, 2,... 5 ], with i=1,..,26. he output of the network is y = Sx. he remaining question is how to measure the distance between the received vector and each of the ASCII characters. Since the patterns are binary, one possibility is to ask how many bit flips are present between the received string and all the ASCII characters. One should assign the received string to the ASCII character that has the least number of bit flips. his distance is called the Hamming distance - HD (also known as the Manhattan norm or L1 norm). When a character is received each output i of the network is the scalar product of the input with the corresponding row vector si. his scalar product can be written as the total number of positions in which the vectors agree minus the number of positions they differ which is quantified as their HD. Since the number of positions they agree is 5-HD, we have 10

11 x s i = 5 HD( s i, x) his equation sates that if we add a bias equal to 5 to each of the outputs of our net, we can directly interpret the network output as an Hamming distance (to be exact the weights should be multiplied by 0.5, and the bias should be 0.5 to obtain the HD). A perfect match will provide an output of 5. So one just needs to look for the highest output to know what was the character that was sent. NeuroSolutions Signal detection with Hamming networks In this example we create the equivalent of an Hamming net which will recognize the binary ASCII of 5 letters (A,B,L,P,Z). he input to the network is the last 5 bits of each letter. For instance, A is -1.-1,-1,-1,1, B is -1,-1,-1,1,-1, etc. Because we know ahead of time what the letters will be, we will set the weights to the expected ASCII code of each letter. But here we are not going to use the Hamming distance but the dot-product distance of Hebbian learning. According to the associative memory concept, when the input and the weight vector are same, the output of the net will be the largest possible. For instance, if -1,-1,-1,-,1,1 is input to the network, the first PE will respond with the highest possible input. Single step through the data to see that in fact the net gives the correct response. Notice also that the other outputs are not zero since the distance between each weight vector and the input is finite (it depends on the Hamming distance between the input and weight vectors). When noise corrupts the input, this network can be used to determine which letter the input was most likely to be. Noise will affect each component of the vector, but the net will assign the highest output to the weight vector that lies closer to the noisy input. When noise is small this still provides a good assignment. Increase the noise power to see when the system breaks down. It is amazing that such a simple device still provides the correct output most of the time when the variance of the noise is 2. 11

12 NeuroSolutions Example Note that here we utilized the inner product metric intrinsic to the Hebbian network instead of the Hamming distance, but the result is very similar. In this example the weight matrix was constructed by hand due to the knowledge we have about the problem. In general the weight matrix has to be adapted, and this is where the Hebbian learning is important. Nevertheless this example shows the power of association for information processing Hebbian rule as correlation learning here is a strong reason to translate Hebb s principle as in Eq. 1. In fact, Eq. 1 prescribes a weight correction according to the product between the jth and the ith PE activations. Let us substitute Eq. 6 in Eq. 5 to obtain the vector equivalent ( n) = ηy( n) x( n) = ( n) x ( n) w( n) Δw ηx Equation 8 In on-line learning the weight vector is repeatedly changed according to this equation using a different input sample for each n. However in batch, after iterating over the input data of L patterns, the cumulative weight is the sum of the products of the input with its transpose w( L) = w(0) x( n) x L n= 1 ( n) Equation 9 Eq. 9 can be thought of as a sample approximation to the autocorrelation of the input Rx = E[ xx ] data which is defined as where E[.] is the expectation operator (see Appendix). Effectively the Hebbian algorithm is updating the weights with a sample estimate of the autocorrelation function w( L) = ηrˆ w(0) x Equation 10 Correlation is a well known operation in signal processing and in statistics, and it measures the second order statistics of the random variable under study. First and 12

13 second order statistics are sufficient to describe signals modeled as Gaussian distributions as we saw in Chapter II (i.e. first and second statistics is what is needed to completely describe the data cluster). Second order moments also describe many properties of linear systems, such as the adaptation of the linear regressor studied in Chapter I Power, Quadratic Forms and Hebbian Learning As we saw the output of a linear network is given by Eq. 6. We will define the power energy, power and variance at the output given the data set { x(1), x(2),..., x( L)} as P = 1 L L n= 1 y 2 ( n) = w R x w where R x R = 1 L L n= 1 x( n) x ( n) Equation 11 P in Eq. 11 is a quadratic form, and it can be interpreted as a field in the space of the weights. Since R is positive definite we can further say that this field is a paraboloid facing upwards passing through the origin of the weight space (Figure 5). P = w Rw P=const w 1 w 2 gradient P = Rw Figure 5. he power field as a performance surface Let us take the gradient of P with respect to the weights P P = w = 2Rw We can immediately recognize that this equation provides the basic form for the Hebbian 13

14 update of Eq. 10. If we recall the performance surface concept of Chapter I, we see immediately that the power field is the performance surface for Hebbian learning. So we conclude that when we train a network with the Hebbian rule we are doing gradient ASCEN (seeking the maximum) in the power field of the input data. he sample by sample adaptation rule of Eq. 8 is merely a stochastic version and follows the same behavior. Since the power field is unbounded upwards we can immediately expect that Hebbian learning will diverge, unless some type of normalization is applied to the update rule. Instability of Hebbian his is a shortcoming for our computer implementations because due to the limited dynamic range it will produce overflow errors. But there are many ways to normalize the Hebbian update. NeuroSolutions Instability of Hebbian his example shows that the Hebbian update rule is unstable since the weights grow without bound. We use a simple 2D input example to show that the weight vector grows. We have opened a MatrixViewer to see the weights, and we also plot the tip of the weight vector in the ScatterPlot as a blue dot (think of the weight vector as going from the origin to the blue dot). Notice however, that the weight vector diverges always along the same direction. his is not by chance. Although unstable the Hebbian network is finding the direction where the output is the largest. he more you train the network, the larger the weights get. Repeat several times to observe the behavior we describe. So the Hebbian update is not practical. NeuroSolutions Example 2.5 Data representations in multidimensional spaces An important question is what does the direction of the gradient ascent represent? In order to understand the answer to this question we have to talk about data representations in multidimensional spaces. We normally collect information about the real world events with sensors. Most of the times the data to model a real world phenomenon is multidimensional, that is, we need several sensors (such as temperature, 14

15 pressure, flow, etc.). his immediately says that the state of the real world system is multidimensional, in fact a point in a space where the axes are exactly our measurement variables. In Figure 6 we show a two dimensional example. So the system states create a cloud of points somewhere in this measurement space. An alternative to describe the cloud of points is to define a new set of axes that are glued to the cloud of points instead of with the measurement variables. his new coordinate system is called a data dependent coordinate system. From the Figure 6 we see that the data dependent representation moves the origin to the center (the mean) of our cloud of samples. But we can do more. We can also try to align one of the axes with the direction where the data has the largest projection. his is called the principal coordinate system for the data. For simplicity we also would like the principal coordinate system to be orthogonal (more on this later). Notice that the original (measurement) coordinate system and the principal coordinate system are related by a translation and a rotation, which is called an affine transform in algebra. If we know the parameters of this transformation we have captured a lot about the structure of our cloud of data. m easurem ent 2 o o o o o o o o o Principal coordinate system measurement 1 Figure 6. he principal coordinate system What we gain with the principal coordinate system is knowledge about the structure of the data and versatility. We may say, I want to represent my data into a smaller dimensional space to simplify the problem, or to be able to visualize the data, etc. Suppose that we are interested in preserving the variance of the cloud of points, since 15

16 variance is associated with information Information and Variance. o make the point clear let us try to find only a single direction (i.e. a one dimensional space) to represent most of the variance on our data. What direction should we use? If you think a bit, the principal coordinate system is the one that makes more sense, because we aligned one of the axes with the direction where the data has the largest variance. In this coordinate system we should then choose the axis where the data has the largest projected variance. Now let us go back to the Hebbian network. he weights of the network trained with the Hebbian learning rule find the direction of the input power gradient. he output of the Hebbian network (the projection of the input into the weight vector) will then be the largest variance projection. Or in other words, the Hebbian network finds the axis of the principal coordinate system where the projected variance is the largest, and gives it as the output. What is amazing is that the simple Hebbian rule automatically finds this direction for us with a local learning rule! So even tough Hebbian learning was biologically motivated, it is a way of creating network weights that are tuned to the second order statistics of the input data. Moreover, the network does this with a rule that is local to the weights. We can further say that Hebbian extracts the most of the information about the input, since from all possible linear projections it find the one that maximizes the variance at the output (which is synonym of information for Gaussian distributed variables). Go to the next section 3. Oja s rule Perhaps the simplest normalization to Hebb s rule was proposed by Oja. Let us divide the new value of the weight in Eq. 2 by the norm of the new weight vector connected to the PE, i.e. 16

17 w ( n+ 1) = i i w ( n) + ηy( n) x ( n) i ( w ( n) + ηy( n) x ( n) ) i i i 2 Equation 12 We see that this expression effectively will normalize the size of the weight vector to one. So if a given weight component increases, the others have to decrease to keep the weight vector at the same length. So weight normalization is in fact a constraint. Assuming the step size small, Oja approximated the update of Eq. 12 by [ ] ( )[ ( )] ( ) ( ) 2 w( n+ 1) = w( n) + ηyn ( ) x( n) ynw ( ) ( n) = w n 1 ηy n + ηx nyn i i i i i i Equation 13 producing the Oja s rule derivation of Oja s rule. Note that this rule can still be considered Hebbian update with a normalized activity xi(n)=xi(n)-y(n)wi(n). he normalization is basically a forgetting factor proportional to the output square (see Eq. 13). his equation describes the fundamental problem of Hebbian learning. In order to avoid unlimited growth in the weights, we applied a forgetting term. his solves the problem of weight growth but creates another problem. If the pattern is not presented frequently it will be forgotten since the network forgets old associations. NeuroSolutions Oja s rule his example introduces the Oja s Synapse (look at the synapse with the label Oja). he network is still a linear network, but the Oja s synapse implements Oja s weight update described in Eq. 13. he overall network function is similar to the Hebbian network except that now the weights stabilize producing a vector in the direction of maximum change. he input data is the same as in the previous case. Notice that now the weights of the single output network produce a vector oriented along the largest axis of the cloud of input samples (45 degrees). his is the direction which produces the largest possible output. Randomize the weights several times during learning to see that the network quickly finds this direction. Depending upon the sign of the initial weights, the final weights will be both positive or both negative, but the direction does not change. 17

18 he stepsize controls now the speed of convergence. If the stepsize is too large, the iteration will blowup as in the gradient descent learning case. Large stepsizes also produce rattling of the final weights (note that the weights form a linear segment) which should be avoided. If the stepsize is too small, the process will slowly converge. he best is to start the adaptation large and anneal its value to a small constant to fine tune the final position of the weights. his can be accomplished with the scheduler. NeuroSolutions Example 3.1 Oja s rule implements the principal component network What is the meaning of the weight vector of a neural network trained with Oja s rule? In order to answer this question let us study a single linear PE network with multiple inputs (Figure 2) using the ideas of vector spaces. he goal is to study the projection defined by the weights created with the Oja s rule. We already saw that Hebbian finds the direction where the input data has the largest projection. But the weigh vectors grows without limit. Now with Oja s rule we found a way to normalize the weight vector to 1. If you recall, vectors of length 1 are normally used for axes of coordinate systems. We should expect that this normalization would not change the geometric picture we developed for the Hebbian network. In fact, it is possible to show that Oja s rule finds a weight vector w=e which satisfies the relation proof of eigen-equation Re =λ e 0 Equation where R is the autocorrelation function of the input data, and λ is a real value scalar. his equation was already encountered in Chapter I and tells us that e0 is an eigenvector of the autocorrelation function, since rotating e0 by R (the left side) produces a vector 18

19 colinear with itself. We can further show that in fact λ0 is the largest eigenvalue of R so e0 is the eigenvector that corresponds to the largest eigenvalue. We should expect this since from the eigendecomposition theory we know that the scalar is exactly the variance of the projected data on the eigendirection, and Oja s rule seeks the gradient direction of the power field. We conclude that training the linear PE with the Oja s algorithm produces a weight vector that is aligned with the direction in the input space where the input data cluster produces the largest variance (the largest projection). Figure 7 shows a simple case for 2D. It shows a data cluster (black dots) spread along the 45 line. he principal axis of the data is the direction in 2D space where the data has its largest power (projection variance). So imagine a line passing through the center of the cluster and rotate it so that the data cluster produces the largest spread in the line. For this case the direction will be close to 45. he weight vector of the network of Figure 2 trained with Oja s rule coincides exactly with the principal axis, also called the principal component. he direction perpendicular to it (the minor axis) will produce a much smaller spread. For zero mean data, the direction of maximum spread coincides with the direction where most of the information about the data resides. he same thing happens when the data exists in a larger dimensionality space D, but we can not visualize it anymore. minor direction x2 principal direction smallest spread x1 largest spread projection Figure 7. Projection of a data cluster onto the principal components If you relate this figure with the NeuroSolutions example, the Oja s weight vector found 19

20 exactly the direction where the data produced the largest projection. his is a very important property because the simple one PE network trained with Oja s rule is extracting the most information that it can from the input, if we think that information is associated with power of the input. In engineering applications where the input data is normally corrupted by noise, this system will provide a solution that will maximize the signal-power (of the largest sinusoidal component) to noise-power ratio definition of eigenfilter Go to the next section 4. Principal Component Analysis We saw that Oja s rule found a unity weight vector that is colinear with the principal component of the input data. But how can we find still other directions where the data cluster has still appreciable variance? We would like to create more axes of the principal coordinate system mentioned in section 2.5. For simplicity we would like to create an orthogonal coordinate system (i.e. all the vectors are orthogonal to each other) with unit length vectors (orthonormal coordinate system). How can we do this? Principal Component Analysis answers this question. Principal Component Analysis or PCA for short is a very well known statistical procedure that has important properties. Suppose that we have input data of very large dimensionality (D dimensions). We would like to project this data to a smaller dimensionality space M (M<D), a step that is commonly called feature extraction. Projection will always distort somewhat our data (just think of a 3-D object and its 2-D shadow). Obviously we would like to do this projection to M dimensional space preserving maximally the information (variance from a representation point of view) about the input data. he linear projection that accomplishes this goal is exactly the PCA. PCA, SVD, and KL transforms. PCA produces an orthonormal basis that is built from the eigenvectors of the input data autocorrelation function. he projections onto each basis are therefore the eigenvalues of 20

21 R. If one orders the eigenvectors by descending order of the eigenvalues, and we truncate them at M (with M<D) then we will project the input data to a linear space of (smaller) dimensionality M. In this space the projection onto each axis will produce the M largest eigenvalues, so there is no better linear projection to preserve the input signal power. he outputs of the PCA represent the input into a smaller subspace so they are called features. So PCA is the best linear feature extractor for signal reconstruction. he error e in the approximation when we utilize M features is exactly given by e 2 = D λ i i= M+ 1 Equation 15 Eq. 15 tells that the error power is exactly equal to the sum of the eigenvalues that were discarded. For the case of Figure 7, the minimum error in representing the 2-D data set in an 1-D space is obtained when the principal direction is chosen as the projection axis. he error power is exactly given by the projection on the minor direction. If we decided to keep the projection in the minor direction, the error incurred would have been much higher. his method is called subspace decomposition and it is widely applied in signal processing and statistics to find the best sub-space of a given dimension that preserves maximally data information. here are well known algorithms that compute analytically the PCA, but they have to solve matrix equations (Singular value decomposition ). Can we build a neural network that implements PCA on-line, with local learning rules? he answer is affirmative. We have to use a linear network with multiple outputs (equal to the dimension M of the projection space) as in Figure 8. 21

22 x 1 w 11 w 21 y 1 x 2 w 2D y 2 x D w MD y M Figure 8. A PCA network to project the data from D to M dimensions. he idea is very simple. First, we compute the largest eigenvector as done above with Oja s rule. hen we project the data onto a space perpendicular to the largest eigenvector and we apply the algorithm again to find the second largest principal component, and so on until order M D. he projection onto the orthogonal space is easily accomplished by subtracting the output of all previous output components (after convergence) from the input. his method is called the deflation method and mimics the Gram-Schmidt orthogonalization procedure Gram-Schmidt orthgonalization. What is interesting is that the deflation method can be accomplished easily by modifying slightly Oja s learning rule as was first done by Sanger. We are assuming that the network has M outputs each given by D yi( n) = wij( n) xj( n) i = 1,... M i= 1 Equation 16 and D inputs ( M D ). o apply Sanger s rule the weights are updated according to i Δwij ( n) = ηyi ( n) xj ( n) wkj ( n) yk ( n) k = 1 Equation 17 his rule resembles the Oja s update, but now the input to each PE is modified by subtracting the outputs from the preceding PEs times the respective weights. his implements the deflation method, after the system converges. he weight update of Eq. 22

23 17 is not local since we need all the previous network outputs to compute the weight update to weight wij. However, there are other rules that use local updates (such as the APEX algorithm Diamantaras ). As we can expect from Eq. 17 and the explanation there is a coupling between the modes, i.e. only after convergence of the first PE weights will the second PE weights converge completely to the eigenvector that corresponds to the second largest eigenvalue. here are other on-line algorithms for the same purpose, such as the lateral inhibition network and the recursively computed APEX, but for the sake of simplicity they will be omitted here. A two output PCA network will have weight vectors that correspond to the principal and minor component of Figure 5. he two outputs will correspond to the largest and smallest eigenvalues respectively. he interesting thing about subspace projections is that in many problems the data is already restricted to a (unknown) subspace, so PCA can effectively perform data compression preserving the major features of the data. NeuroSolutions Sanger s and PCA his example introduces Sanger s rule (look at the synapse with the label Sang in the breadboard ). Sanger s rule does Principal Component Analysis (PCA). he dimension M of the output determines the size of the output space, i.e. the number of eigenvectors and also the number of features used to represent the input data. PCA finds the M weight vectors which capture the most information about the input data. For instance, a 3 output Sanger s network will find 3 orthogonal vectors, the principal axis which captures the more information than any other vector in the input space, along with the two vectors which capture the second most and third most information. In this example, we take a high dimensional input, 8x8 images of the 10 digits, and project them onto their M Principal components. M is a variable that you can control by setting the number of outputs of the Sanger s network. he outputs of the PCA network are the features 23

24 obtained by the projection. We then use a custom DLL to recreate the digits using only the M features. his DLL takes the output of the Sanger s network and multiplies it by the transpose of W, so it recreates a 64 output image. his image shows us how much of the original information in the input we have captured in the M dimensional subspace. When the two images are identical, we have preserved in the features the information contained in the input data. he display of the eigenvectors (the PCA weights) is not easy since they are vectors in a 64 dimensional space. After convergence they are orthogonal. We can use the Hinton probe to visualize their value, but it is difficult to find patterns (in fact the signs should alternate more frequently towards the higher order meaning that finer details is being encoded). ry different values for the subspace dimension (M), and verify that PCA is very robust, i.e. even with just a few dimensions the reconstructed digits can be recognized. A word of caution is needed at this point. he PCA finds the subspace that best represents the ensemble of digits, so the best discrimination among the digits in the subspace is not guaranteed with PCA. If the goal is discrimination among the digits then a classifier should be designed for that purpose. PCA is a linear representation mechanism, and only guarantees that the features contain the most information for reconstruction. NeuroSolutions Example he PCA decomposition is a very important operation in data processing because it provides knowledge about the hidden structure (latent variables) of the data. As such there are many other possible formulations for the problem. PCA derivation 24

25 4.1. PCA for data compression PCA is the optimal linear feature extractor, i.e. there is no other linear system that is able to provide better features for reconstruction. So one of the obvious PCA applications is data compression. In data compression the goal is to be able to transmit as fewer bits per second as possible preserving as much as the source information as possible. So this means that we must squeeze in each bit as much information as possible from the source. We can model data compression as a projection operation where the goal is to find a set of basis that produce a large concentration of signal power in only a few components. In PCA compression the receiver must know the weight matrix containing the eigenvectors since the estimation of the input from the eigenvalues is done by ~ x = W y Equation 18 he weight matrix is obtained after training with exemplars from the data to be transmitted. It has been shown that for special applications this step can be completed efficiently and is done only once. So the receiver can be constructed before hand. he reconstruction step requires MxD operations where D is the input vector dimension and M is the size of the subspace (number of features) PCA features and classification We may think that a system able to preserve optimally signal energy in a subspace should also be the optimal projector for classification. Unfortunately this is not the case. he reason can be seen in Figure 9. We have here represented two classes. When the PCA is computed no distinction is made between the samples of each class so the optimal 1-D projection for reconstruction (the principal direction) is along the x1 axis. However it is easy to see that the best discrimination between these two clusters is along the x2 axis which from the point of view of reconstruction is the minor direction. So PCA chooses the projections to best reconstruct the data in the chosen subspace. his may or may not coincide with the projection for best discrimination. A similar thing 25

26 happened when we addressed regression and classification (first example of Chapter II). A linear regressor can be used as a classifier, but there is no guarantee that it produces the optimal classifier (which by definition minimizes the classification error). x2 (minor direction) class 2 perfect classification with x2 class 1 x1 (principal direction) not perfect classification with x1 Figure 9. he relation between eigendirections and classification However, PCA is appealing for classification since it is a simple procedure, and experience has shown that it normally provides good features for classification. But this depends upon the problem and there is no guarantee that classifiers based on PCA features work well. NeuroSolutions PCA for preprocessing In this example we use PCA to find the best possible linear projection in terms of reconstruction and then we use a MLP to classify the data into one of 10 classes (the digits). Notice that in fact this problem was already solved in Chapter III with the perceptron and we obtained perfect classification using the input data directly. he only way we can do a fair comparison is to limit the number of weights in the two systems to the same value and compare performance. 26

27 Go to the next section NeuroSolutions Example 5. Anti-Hebbian Learning We have seen that Hebbian learning discovers the directions in space where the input data has the largest variance. Let us do a very simple modification to the algorithm and include a minus sign in the weight update rule of Eq. 1, i.e. Δw = ηx y ij j i Equation 19 his rule is called the anti-hebbian rule. Let us assume that we train the system of Figure 2 with this rule. What do you expect this rule will do? he easiest reasoning is to recall that the Hebbian network maximizes the output variance by doing gradient ascent in the power field. Now with the negative sign in the weight update equation, the adaptation will seek the minimum of the performance surface, i.e. the output variance will be minimized. Hence, the output of the linear network trained with anti-hebbian will always produce zero output, because the weights will seek the directions in the input space where the data cluster have a point projection. his is called the null (or orthogonal) space of the data. he network finds this direction by doing gradient descent in the power field. If the data fills the full input space then the weights will have to go to zero. On the other hand, if the data exists in a subspace, the weights will find the directions where the data projects to a point. For Figure 7 anti-hebbian will provide zero weights. However, if the data was one dimensional, i.e. along the 45 degree line, then the weights will be placed along the 135 degree line. NeuroSolutions Anti-Hebbian learning In this example we use the Hebbian synapse with a negative stepsize to implement an anti-hebbian network. he anti-hebbian rule minimizes the output variance, 27

28 thus it will try to find a vector which is orthogonal to the input (the null space of the input) such that the projection of the data onto the weight vector is always zero. here are two cases of importance. Either the data lies in a subspace of the input space in which case the zero output can be achieved by adapting the weight vector perpendicular to the subspace where the input lies. Or in the second case the input samples covers the full input space, so the only way to get a zero output is to drive the weights to zero. Notice how fast the anti-hebbian trains. If the data moves in the input space, notice that the weights are always finding the direction orthogonal to the data cluster. NeuroSolutions Example his behavior of anti-hebbian learning can be translated as decorrelation, i.e. a linear PE trained with anti-hebbian learning decorrelates the output from its input. We must realize that Hebbian and anti-hebbian have complementary roles in projecting the input data, that are very important for signal processing. For instance the new high resolution spectral analysis techniques (such as MUSIC and ESPRI Kay ) are based on ways of finding the null space of the data and so they can be implemented on-line using anti-hebbian learning. We will provide an example in Chapter IX Convergence of anti-hebbian rule Another interesting thing is that the convergence of the anti-hebbian rule can be controlled by the step size, like in LMS or backpropagation. his means that if the step size is too large the weights will get progressively larger (diverge), but if the step size is below a given value the adaptation will converge. In fact from the fact that the power field is a paraboloid in weight space, we know it has a single minimum. Hence the situation is like gradient descent that we studied in Chapter I. What is the value under which the weights converge to finite values? he anti-hebbian update for one weight is 28

29 ( ) 2 wn ( + 1) = wn ( ) 1 ηx ( n) Equation 20 So, if we take expectations and project into the principal coordinate system as we did in Chapter I to compute the largest stepsize for the LMS, we can conclude that ( ) ( ) ( ) wn+ 1 = 1 ηλ wn Equation 21 which is stable if η < 2 λ Equation 22 where λ is the eigenvalue of the autocorrelation function of the input. We can immediately see the similarity to the convergence of the LMS rule. For a system with multiple inputs the requirement for convergence has to be modified to η < 2 λ max Equation 23 where λ max is the largest eigenvalue of the input autocorrelation function as for the LMS case. NeuroSolutions Stability of Hebbian his example shows that the anti-hebbian rule is stable for the range of values given by Eq. 23 when random data is utilized. Just change the stepsize to see what is the compromise rattling speed of convergence achieved with the anti-hebbian. Since the weight update is sample by sample, when the data has deterministic structure divergence may occur at step sizes smaller than the ones predicted by Eq. 23. he same behavior was encountered in the LMS. Go tothe next section NeuroSolutions Example 29

30 6. Estimating crosscorrelation with Hebbian networks Suppose that we have two data sets formed by P exemplars of N dimensional data x1,...xn and d1,... dm, and the goal is to estimate the crosscorrelation between them. he crosscorrelation is a measure of similarity between the two sets of data which extends the idea of correlation coefficient (see Appendix and Chapter I ). In practice we are often faced with the question how similar is this data set to that other one. Crosscorrelation helps exactly to answer this question. Let us assume that the data samples are ordered by their indices. he crosscorrelation for index i, j is r L 1 = L ( i, j) x d 0 < i < N, 0 < j M xd i, k j, k < k = 1 Equation 24 where L is the number of patterns, N is the size of the input vector and M is the size of desired response vector. he fundamental operation of correlation is to cross multiply the data samples and add the contributions. Define the average operator A 1 L L u k k = 1 [ u] = Equation 25 he crosscorrelation can then be defined as (, ) = [ xd ] r i j A xd i j Equation 26 where the vector = [ x 1, x 2,..., x ] is built from the i th sample of all the x i i i ip patterns in the input set (likewise for d). he crosscorrelation matrix Rxy is built from all possible shifts i, j, i.e R xd = x1d A... x Nd 1 1 x d x 1... N d 2 2 x d x 1 N... d N N Equation 27 he crosscorrelation vector used in regression (Chapter I ) is just the first column of this matrix. Now let us relate this formalism to the calculations of a linear network trained with 30

31 Hebbian learning. Assume we have a linear network with N inputs x and N outputs y (Figure 10) x 1 x 2 x N w 11 w 21 w N1 w N,N y 1 y 2 y N d 1 d 2 d N Figure 10. A multiple input multiple Hebbian network In order to compute the cross correlation between x and the data set d, we will substitute the network output y in the Hebbian rule by the data set d, i.e. Δw =ηx d ij j i Equation 28 which implements what we call forced Hebbian learning. We can write the output yj as Eq. 4 but now with two indices i and j y j = N wk, jx k k = 1 Equation 29 he weight wi,j when adapted with forced Hebbian learning takes the form w ( n+ 1) = w ( n) + ηx ( n) d ( n) i, j i, j j i Equation 30 If wij(0)=0 after L iterations we get w i, j L ( L) = η x ( n) d ( n) n= 1 j i Equation 31 So by comparing Eq.24 with Eq. 31 we conclude that the weight wij trained with forced Hebbian learning is proportional after P iterations to the crosscorrelation element rij. If η=1/l and the initial conditions are zero this is exactly rij. Notice also that the elements of the crosscorrelation matrix are precisely the weights of the linear network (Eq. 27 ). For 31

32 this reason the linear network trained with forced Hebbian learning is called a correlator or a linear associator. Hence, forced Hebbian learning is an alternate, on-line way of computing the crosscorrelation function between two data sets. NeuroSolutions Forced Hebbian computes crosscorrelation In this example we show how forced Hebbian learning simply computes the crosscorrelation of the input and desired output. We have a 3 input network which we would like to train with a desired response of 2 outputs. We have created a data set with 4 patterns. he crosscorrelation computed according to Eq. 24 is r(0,0)= 0.5 ; r(0,1)=r(1,0)=0; r(1,1)=0.25; r(0,2)=0.5 ;r(1,2)=0.25 Let us use the Hebbian network and take a look at the final weights. Notice that we started the weights with a zero value, and stopped the network after 10 iterations of each batch (4 patterns) with a stepsize of (1/4x10). NeuroSolutions Example here are two important applications of this concept that we will address in this chapter. One uses crosscorrelation with anti-hebbian learning to find what is different between two data sets, and can be considered a novelty filter. he other is possibly even more important and is a memory device called an associative memory. Go to the next section 7. Novelty Filters and Lateral Inhibition Let us assume that we have two data sets x and d. aking x as the input to our system, we want to create an output y as dissimilar as possible to the data set d (Figure 11). his function is very important in signal processing (decorrelation) and in information processing (uncorrelated features), and it seems to be at the core of biological information processing. We humans filter out with extreme ease what we know already from the sensor input (either visual or acoustic). his avoids information overload. It 32

7. Variable extraction and dimensionality reduction

7. Variable extraction and dimensionality reduction The goal of the variable selection in the preceding chapter was to find least useful variables so that it would be possible to reduce the dimensionality