Table of Contents EQ EQ EQ EQ EQ EQ KOHONEN...70 EQ STEPHEN GROSSBERG...70 EQ EQ.11...

Size: px
Start display at page:

Download "Table of Contents EQ EQ EQ EQ EQ EQ KOHONEN...70 EQ STEPHEN GROSSBERG...70 EQ EQ.11..."

Transcription

1 able of Contents CHAPER VI- HEBBIAN LEARNING AND PRINCIPAL COMPONEN ANALYSIS INRODUCION EFFEC OF HE HEBB UPDAE OJA S RULE PRINCIPAL COMPONEN ANALYSIS ANI-HEBBIAN LEARNING ESIMAING CROSSCORRELAION WIH HEBBIAN NEWORKS NOVELY FILERS AND LAERAL INHIBIION LINEAR ASSOCIAIVE MEMORIES (LAMS) LMS LEARNING AS A COMBINAION OF HEBB RULES AUOASSOCIAION NONLINEAR ASSOCIAIVE MEMORIES PROJEC: USE OF HEBBIAN NEWORKS FOR DAA COMPRESSION AND ASSOCIAIVE MEMORIES CONCLUSIONS...57 LONG AND SHOR ERM MEMORY...60 ASSOCIAIVE MEMORY...60 HEBBIAN AS GRADIEN SEARCH...61 INSABILIY OF HEBBIAN...63 DERIVAION OF OJA S RULE...63 PROOF OF EIGEN-EQUAION...64 PCA DERIVAION...65 DEFINIION OF EIGENFILER...66 OPIMAL LAMS...67 HEBB...67 UNSUPERVISED...67 SANGER...68 OJA...68 APEX...68 ASCII...68 SECOND ORDER...68 SVD...68 EQ EQ EQ EQ EQ EQ EQ EQ KOHONEN...70 EQ SEPHEN GROSSBERG...70 EQ EQ EQ EQ EQ EQ DIAMANARAS...71 DEFLAION...71 BALDI...71 HECH-NIELSEN...72 KAY...72 EQ

2 ENERGY, POWER AND VARIANCE...72 PCA, SVD, AND KL RANSFORMS...73 GRAM-SCHMID ORHOGONALIZAION...76 SILVA AND ALMEIDA...77 INFORMAION AND VARIANCE...77 COVER AND HOMAS...78 FOLDIAK...78 RAO AND HUANG

3 Chapter VI- Hebbian Learning and Principal Component Analysis Version 2.0 his Chapter is Part of: Neural and Adaptive Systems: Fundamentals hrough Simulation by Jose C. Principe Neil R. Euliano W. Curt Lefebvre Copyright 1997 Principe he goal of this chapter is to introduce the concepts of Hebbian learning and its multiple applications. We will show that the rule is unstable but through normalization is very useful. Hebbian learning is used to associate an input to a given output through a similarity metric. A single linear PE net trained with Hebbian rule finds the direction in data space where the data has the largest projection, i.e. such network transfers most of the input energy to the output. his concept can be extended to multiple PEs giving rise to the principal component analysis (PCA) networks. hese nets can be trained on-line and produce an output which preserve the maximum information from the input as required for signal representation. By changing the sign of the Hebbian update we also obtain a very useful network that decorrelates the input from the outputs, i.e. it can be used for finding novel information. Hebbian can be even related to the LMS learning rule showing that correlation is effectively the most widely used learning principle. Finally, we show how to apply Hebbian learning to associate patterns, which gives rise to a new and very biological form of memory called associative memory. 1.Introduction 2. Effect of the Hebb update 3. Oja s rule 3

4 4. Principal Component Analysis 5. Anti Hebbian Learning 6. Estimating crosscorrelation with Hebbian networks 7. Novelty filters 8. Linear associative memories (LAMs) 9. LMS learning as a combination of Hebb rules 10. AutoAssociation 11. Nonlinear Associative memories 12. Conclusions Go to next section 1. Introduction he neurophysiologist Donald Hebb enunciated in the 40 s a principle that became very influential in neurocomputing. By studying the communication between neurons, Hebb verified that once a neuron repeatedly excited another neuron, the threshold of excitation of the later decreased, i.e. the communication between them was facilitated by repeated excitation. his means that repeated excitation lowered the threshold, or equivalently that the excitation effect of the first neuron was amplified (Figure 1). neuron 2 y i synapse i th PE x j w ij j th PE neuron 1 Figure 1. Biological and modeled artificial system One can extend this idea to artificial systems very easily. In artificial neural systems, 4

5 neurons are equivalent to PEs, and PEs are connected through weights. Hence, Hebb s principle will increase the common weight wij when there is activity flowing from the j th i th PE to the PE. If we denote the output to the PE by yi and the activation of the i th j th PE by xj, then Δwij =ηxjyi Equation 1 where η is our already known step size which controls what percentage of the product is effectively used to change the weight. here are many more ways to translate Hebb s principle in equations, but Eq. 1 is the most commonly used and is called Hebb s rule. Unlike all the learning rules studied so far (LMS and backpropagation) there is no desired signal required in Hebbian learning. In order to apply Hebb s rule only the input signal needs to flow through the neural network. Learning rules that use only information from the input to update the weights are called unsupervised. Note that in unsupervised learning the learning machine is changing the weights according to some internal rule specified a priori (here the Hebb rule). Note also that the Hebb rule is local to the weight. Go to the next section 2. Effect of the Hebb update Let us see what is the net effect of updating a single weight w in a linear PE with the Hebb rule. Hebbian learning updates the weights according to wn ( + 1) = wn ( ) + ηxnyn ( ) ( ) Equation 2 where n is the iteration number and η a stepsize. For a linear PE, y = wx, so [ ] 2 wn ( + 1) = wn ( ) 1+ ηx ( n) Equation 3 If the initial value of the weight is a small positive constant (w(0)~0), irrespective of the 5

6 value of η>0 and of the input sign, the update will always be positive. Hence, the weight value will increase with the number of iterations without bounds, irrespective of the value of η. his is unlike the behavior we observed for the LMS or backpropragation, where the weights would stabilize for a range of step sizes. Hence, Hebbian learning is intrinsically unstable, producing very large positive or negative weights. In biology this is not a problem because there are natural limitations to synaptic efficacy (chemical depletion, dynamic range, etc). NeuroSolutions raining with the Hebbian rule In this example, we introduce the Hebbian Synapse. he Hebbian Synapse implements the weight update of Equation 2. he Hebbian network is built from an input Axon, the Hebbian Synapse and an Axon, so it is a linear network. Since the Hebbian Synapse, and all the other Unsupervised Synapses (which we will introduce soon), use an unsupervised weight update (no desired signal), they do not require a backpropagation layer. he weights are updated on a sample by sample basis. his example shows the behavior of the Hebbian weight update. he weights with the Hebbian update will always increase, no matter how small the stepsize is. We have placed a scope at the output of the net and also opened a MatrixViewer to observe the weights during learning. he only thing that the stepsize does is to control the rate of increase of the weights. Notice also that if the initial weight is positive the weights will become increasingly more positive, while if the initial weight is negative the weights become increasingly more negative. 6

7 NeuroSolutions Example 2.1. he multiple input PE Hebbian learning is normally applied to single layer linear networks. Figure 2 shows a single linear PE with D inputs, which will be called the Hebbian PE. he output is x 1 x 2 w 1 w 2 w D y x D Figure 2. A D input linear PE D y = w i x i i= 1 Equation 4 According to the Hebb s rule, the weight vector is adapted as xy 1 Δw = η... xd y Equation 5 It is important to get a solid understanding for the role of Hebbian learning, and we will start with a geometric interpretation. Eq. 4 in vector notation (vectors are denoted by bold letters) is simply y = w x = x w Equation 6 i.e. the transpose of the weight vector is multiplied with the input (which is called the inner product) to produce the scalar output y. We know that the inner product is computed as the product of the length of the vectors times the cosine of their angle θ, ( ) y = wxcos θ Equation 7 So, assuming normalized inputs and weights, a large y means that the input x is close 7

8 to the direction of the weight vector (Figure 3), i.e. x is in the neighborhood of w. x θ y w Figure 3. he output of the linear PE in vector space A small y means that the input is almost perpendicular to w (cosine of 90 degrees is 0), i.e. x and w are far apart. So the magnitude of y measures similarity between the input x and the weight w using the inner product as the similarity measure. his is a very powerful interpretation. During learning the weights are exposed to the data and condense all this information in their value. his is the reason the weights should be considered as the long-term memory of the network. long and short term memory he Hebbian PE is a very simple system that creates a similarity measure (the inner product, Eq. 7 ) in its input space according to the information contained in the weights. During operation, once the weights are fixed, a large output y signifies that the present input is similar to the inputs x that created the weights during training. We can say that the output of the PE responds high or low according to the similarity of the present input with what the PE remembers from training. So, the Hebbian PE implements a type of memory that is called an associative memory NeuroSolutions Directions of the Hebbian update his example shows how the Hebbian network projects the input onto the vector defined by its weights. We use an input which is composed of samples that fall in an ellipse in 2 dimensions, and allow you to select the weights. When you run the network, a custom DLL will display both the input (blue) and the projection of the input onto the weight vector (black) he default is to set the weights to [1,0] 8

9 which defines a vector along the x-axis. hus you would be projecting the input onto the x-axis. Change the value of the weights which will rotate the vector. Notice that in any direction the output will track the input along that direction, i.e. the output is the projection of the input along that specified direction. Notice also the Megascope display. When the input data circles the origin, the output produces a sinusoidal component in time since the projection increases and decreases periodically with the rotation. he amplitude of the sinusoid is maximal when the weight vector is [1,0] since this is the direction that produces a larger projection for this data set. If we release the weights, i.e. if they are trained with Hebbian learning the weights will exactly seek the direction [1,0]. It is very interesting to note the path of the evolution of the weights (it oscillates around this direction). Note also that they are becoming progressively larger. NeuroSolutions Example 2.2. he Hamming Network as a primitive associative memory his idea that a simple linear network embeds a similarity metric can be explored in many practical applications. Here we will exemplify its use in information transmission, where noise normally corrupts messages. We will assume that the messages are strings of bipolar binary values (-1/1), and that we know what are the strings of the alphabet (for instance the ASCII code of the letters). A practical problem is to find from a given string of 5 bits received, which was the string sent. We can think of a n-bit string as a vector in n-dimensional space. he ASCII code for each letter can also be thought as a vector. So the question of finding the value of the received string is the same as asking which is the closest ASCII vector to the received string (Figure 4)? Using the argument above, we should find the ASCII vector in which the bit string produces the largest projection. 9

10 received vector z c a=[-1,-1,-1,-1,1] b=[-1,-1,-1,1,-1]... z=[1,1,-1,1,-1] find best match a constellation (coded in the weights) Figure 4. he problem of finding the best match to the received character in vector spaces A linear network can be constructed with as many inputs as bits on an ASCII code (here we will only use 5 bits, although the ASCII code is 8 bits long) and a number of outputs equal to the size of the alphabet (here 26 letters). he weights of the network will be hard coded as the bit patterns of all ASCII letters. More formally, the inputs are vectors x = [ x, x,... x ] 1 2 5, the output is a scalar and the weight matrix S is built from rows that are our ASCII codes represented by s i s i s i s i = [ 1, 2,... 5 ], with i=1,..,26. he output of the network is y = Sx. he remaining question is how to measure the distance between the received vector and each of the ASCII characters. Since the patterns are binary, one possibility is to ask how many bit flips are present between the received string and all the ASCII characters. One should assign the received string to the ASCII character that has the least number of bit flips. his distance is called the Hamming distance - HD (also known as the Manhattan norm or L1 norm). When a character is received each output i of the network is the scalar product of the input with the corresponding row vector si. his scalar product can be written as the total number of positions in which the vectors agree minus the number of positions they differ which is quantified as their HD. Since the number of positions they agree is 5-HD, we have 10

11 x s i = 5 HD( s i, x) his equation sates that if we add a bias equal to 5 to each of the outputs of our net, we can directly interpret the network output as an Hamming distance (to be exact the weights should be multiplied by 0.5, and the bias should be 0.5 to obtain the HD). A perfect match will provide an output of 5. So one just needs to look for the highest output to know what was the character that was sent. NeuroSolutions Signal detection with Hamming networks In this example we create the equivalent of an Hamming net which will recognize the binary ASCII of 5 letters (A,B,L,P,Z). he input to the network is the last 5 bits of each letter. For instance, A is -1.-1,-1,-1,1, B is -1,-1,-1,1,-1, etc. Because we know ahead of time what the letters will be, we will set the weights to the expected ASCII code of each letter. But here we are not going to use the Hamming distance but the dot-product distance of Hebbian learning. According to the associative memory concept, when the input and the weight vector are same, the output of the net will be the largest possible. For instance, if -1,-1,-1,-,1,1 is input to the network, the first PE will respond with the highest possible input. Single step through the data to see that in fact the net gives the correct response. Notice also that the other outputs are not zero since the distance between each weight vector and the input is finite (it depends on the Hamming distance between the input and weight vectors). When noise corrupts the input, this network can be used to determine which letter the input was most likely to be. Noise will affect each component of the vector, but the net will assign the highest output to the weight vector that lies closer to the noisy input. When noise is small this still provides a good assignment. Increase the noise power to see when the system breaks down. It is amazing that such a simple device still provides the correct output most of the time when the variance of the noise is 2. 11

12 NeuroSolutions Example Note that here we utilized the inner product metric intrinsic to the Hebbian network instead of the Hamming distance, but the result is very similar. In this example the weight matrix was constructed by hand due to the knowledge we have about the problem. In general the weight matrix has to be adapted, and this is where the Hebbian learning is important. Nevertheless this example shows the power of association for information processing Hebbian rule as correlation learning here is a strong reason to translate Hebb s principle as in Eq. 1. In fact, Eq. 1 prescribes a weight correction according to the product between the jth and the ith PE activations. Let us substitute Eq. 6 in Eq. 5 to obtain the vector equivalent ( n) = ηy( n) x( n) = ( n) x ( n) w( n) Δw ηx Equation 8 In on-line learning the weight vector is repeatedly changed according to this equation using a different input sample for each n. However in batch, after iterating over the input data of L patterns, the cumulative weight is the sum of the products of the input with its transpose w( L) = w(0) x( n) x L n= 1 ( n) Equation 9 Eq. 9 can be thought of as a sample approximation to the autocorrelation of the input Rx = E[ xx ] data which is defined as where E[.] is the expectation operator (see Appendix). Effectively the Hebbian algorithm is updating the weights with a sample estimate of the autocorrelation function w( L) = ηrˆ w(0) x Equation 10 Correlation is a well known operation in signal processing and in statistics, and it measures the second order statistics of the random variable under study. First and 12

13 second order statistics are sufficient to describe signals modeled as Gaussian distributions as we saw in Chapter II (i.e. first and second statistics is what is needed to completely describe the data cluster). Second order moments also describe many properties of linear systems, such as the adaptation of the linear regressor studied in Chapter I Power, Quadratic Forms and Hebbian Learning As we saw the output of a linear network is given by Eq. 6. We will define the power energy, power and variance at the output given the data set { x(1), x(2),..., x( L)} as P = 1 L L n= 1 y 2 ( n) = w R x w where R x R = 1 L L n= 1 x( n) x ( n) Equation 11 P in Eq. 11 is a quadratic form, and it can be interpreted as a field in the space of the weights. Since R is positive definite we can further say that this field is a paraboloid facing upwards passing through the origin of the weight space (Figure 5). P = w Rw P=const w 1 w 2 gradient P = Rw Figure 5. he power field as a performance surface Let us take the gradient of P with respect to the weights P P = w = 2Rw We can immediately recognize that this equation provides the basic form for the Hebbian 13

14 update of Eq. 10. If we recall the performance surface concept of Chapter I, we see immediately that the power field is the performance surface for Hebbian learning. So we conclude that when we train a network with the Hebbian rule we are doing gradient ASCEN (seeking the maximum) in the power field of the input data. he sample by sample adaptation rule of Eq. 8 is merely a stochastic version and follows the same behavior. Since the power field is unbounded upwards we can immediately expect that Hebbian learning will diverge, unless some type of normalization is applied to the update rule. Instability of Hebbian his is a shortcoming for our computer implementations because due to the limited dynamic range it will produce overflow errors. But there are many ways to normalize the Hebbian update. NeuroSolutions Instability of Hebbian his example shows that the Hebbian update rule is unstable since the weights grow without bound. We use a simple 2D input example to show that the weight vector grows. We have opened a MatrixViewer to see the weights, and we also plot the tip of the weight vector in the ScatterPlot as a blue dot (think of the weight vector as going from the origin to the blue dot). Notice however, that the weight vector diverges always along the same direction. his is not by chance. Although unstable the Hebbian network is finding the direction where the output is the largest. he more you train the network, the larger the weights get. Repeat several times to observe the behavior we describe. So the Hebbian update is not practical. NeuroSolutions Example 2.5 Data representations in multidimensional spaces An important question is what does the direction of the gradient ascent represent? In order to understand the answer to this question we have to talk about data representations in multidimensional spaces. We normally collect information about the real world events with sensors. Most of the times the data to model a real world phenomenon is multidimensional, that is, we need several sensors (such as temperature, 14

15 pressure, flow, etc.). his immediately says that the state of the real world system is multidimensional, in fact a point in a space where the axes are exactly our measurement variables. In Figure 6 we show a two dimensional example. So the system states create a cloud of points somewhere in this measurement space. An alternative to describe the cloud of points is to define a new set of axes that are glued to the cloud of points instead of with the measurement variables. his new coordinate system is called a data dependent coordinate system. From the Figure 6 we see that the data dependent representation moves the origin to the center (the mean) of our cloud of samples. But we can do more. We can also try to align one of the axes with the direction where the data has the largest projection. his is called the principal coordinate system for the data. For simplicity we also would like the principal coordinate system to be orthogonal (more on this later). Notice that the original (measurement) coordinate system and the principal coordinate system are related by a translation and a rotation, which is called an affine transform in algebra. If we know the parameters of this transformation we have captured a lot about the structure of our cloud of data. m easurem ent 2 o o o o o o o o o Principal coordinate system measurement 1 Figure 6. he principal coordinate system What we gain with the principal coordinate system is knowledge about the structure of the data and versatility. We may say, I want to represent my data into a smaller dimensional space to simplify the problem, or to be able to visualize the data, etc. Suppose that we are interested in preserving the variance of the cloud of points, since 15

16 variance is associated with information Information and Variance. o make the point clear let us try to find only a single direction (i.e. a one dimensional space) to represent most of the variance on our data. What direction should we use? If you think a bit, the principal coordinate system is the one that makes more sense, because we aligned one of the axes with the direction where the data has the largest variance. In this coordinate system we should then choose the axis where the data has the largest projected variance. Now let us go back to the Hebbian network. he weights of the network trained with the Hebbian learning rule find the direction of the input power gradient. he output of the Hebbian network (the projection of the input into the weight vector) will then be the largest variance projection. Or in other words, the Hebbian network finds the axis of the principal coordinate system where the projected variance is the largest, and gives it as the output. What is amazing is that the simple Hebbian rule automatically finds this direction for us with a local learning rule! So even tough Hebbian learning was biologically motivated, it is a way of creating network weights that are tuned to the second order statistics of the input data. Moreover, the network does this with a rule that is local to the weights. We can further say that Hebbian extracts the most of the information about the input, since from all possible linear projections it find the one that maximizes the variance at the output (which is synonym of information for Gaussian distributed variables). Go to the next section 3. Oja s rule Perhaps the simplest normalization to Hebb s rule was proposed by Oja. Let us divide the new value of the weight in Eq. 2 by the norm of the new weight vector connected to the PE, i.e. 16

17 w ( n+ 1) = i i w ( n) + ηy( n) x ( n) i ( w ( n) + ηy( n) x ( n) ) i i i 2 Equation 12 We see that this expression effectively will normalize the size of the weight vector to one. So if a given weight component increases, the others have to decrease to keep the weight vector at the same length. So weight normalization is in fact a constraint. Assuming the step size small, Oja approximated the update of Eq. 12 by [ ] ( )[ ( )] ( ) ( ) 2 w( n+ 1) = w( n) + ηyn ( ) x( n) ynw ( ) ( n) = w n 1 ηy n + ηx nyn i i i i i i Equation 13 producing the Oja s rule derivation of Oja s rule. Note that this rule can still be considered Hebbian update with a normalized activity xi(n)=xi(n)-y(n)wi(n). he normalization is basically a forgetting factor proportional to the output square (see Eq. 13). his equation describes the fundamental problem of Hebbian learning. In order to avoid unlimited growth in the weights, we applied a forgetting term. his solves the problem of weight growth but creates another problem. If the pattern is not presented frequently it will be forgotten since the network forgets old associations. NeuroSolutions Oja s rule his example introduces the Oja s Synapse (look at the synapse with the label Oja). he network is still a linear network, but the Oja s synapse implements Oja s weight update described in Eq. 13. he overall network function is similar to the Hebbian network except that now the weights stabilize producing a vector in the direction of maximum change. he input data is the same as in the previous case. Notice that now the weights of the single output network produce a vector oriented along the largest axis of the cloud of input samples (45 degrees). his is the direction which produces the largest possible output. Randomize the weights several times during learning to see that the network quickly finds this direction. Depending upon the sign of the initial weights, the final weights will be both positive or both negative, but the direction does not change. 17

18 he stepsize controls now the speed of convergence. If the stepsize is too large, the iteration will blowup as in the gradient descent learning case. Large stepsizes also produce rattling of the final weights (note that the weights form a linear segment) which should be avoided. If the stepsize is too small, the process will slowly converge. he best is to start the adaptation large and anneal its value to a small constant to fine tune the final position of the weights. his can be accomplished with the scheduler. NeuroSolutions Example 3.1 Oja s rule implements the principal component network What is the meaning of the weight vector of a neural network trained with Oja s rule? In order to answer this question let us study a single linear PE network with multiple inputs (Figure 2) using the ideas of vector spaces. he goal is to study the projection defined by the weights created with the Oja s rule. We already saw that Hebbian finds the direction where the input data has the largest projection. But the weigh vectors grows without limit. Now with Oja s rule we found a way to normalize the weight vector to 1. If you recall, vectors of length 1 are normally used for axes of coordinate systems. We should expect that this normalization would not change the geometric picture we developed for the Hebbian network. In fact, it is possible to show that Oja s rule finds a weight vector w=e which satisfies the relation proof of eigen-equation Re =λ e 0 Equation where R is the autocorrelation function of the input data, and λ is a real value scalar. his equation was already encountered in Chapter I and tells us that e0 is an eigenvector of the autocorrelation function, since rotating e0 by R (the left side) produces a vector 18

19 colinear with itself. We can further show that in fact λ0 is the largest eigenvalue of R so e0 is the eigenvector that corresponds to the largest eigenvalue. We should expect this since from the eigendecomposition theory we know that the scalar is exactly the variance of the projected data on the eigendirection, and Oja s rule seeks the gradient direction of the power field. We conclude that training the linear PE with the Oja s algorithm produces a weight vector that is aligned with the direction in the input space where the input data cluster produces the largest variance (the largest projection). Figure 7 shows a simple case for 2D. It shows a data cluster (black dots) spread along the 45 line. he principal axis of the data is the direction in 2D space where the data has its largest power (projection variance). So imagine a line passing through the center of the cluster and rotate it so that the data cluster produces the largest spread in the line. For this case the direction will be close to 45. he weight vector of the network of Figure 2 trained with Oja s rule coincides exactly with the principal axis, also called the principal component. he direction perpendicular to it (the minor axis) will produce a much smaller spread. For zero mean data, the direction of maximum spread coincides with the direction where most of the information about the data resides. he same thing happens when the data exists in a larger dimensionality space D, but we can not visualize it anymore. minor direction x2 principal direction smallest spread x1 largest spread projection Figure 7. Projection of a data cluster onto the principal components If you relate this figure with the NeuroSolutions example, the Oja s weight vector found 19

20 exactly the direction where the data produced the largest projection. his is a very important property because the simple one PE network trained with Oja s rule is extracting the most information that it can from the input, if we think that information is associated with power of the input. In engineering applications where the input data is normally corrupted by noise, this system will provide a solution that will maximize the signal-power (of the largest sinusoidal component) to noise-power ratio definition of eigenfilter Go to the next section 4. Principal Component Analysis We saw that Oja s rule found a unity weight vector that is colinear with the principal component of the input data. But how can we find still other directions where the data cluster has still appreciable variance? We would like to create more axes of the principal coordinate system mentioned in section 2.5. For simplicity we would like to create an orthogonal coordinate system (i.e. all the vectors are orthogonal to each other) with unit length vectors (orthonormal coordinate system). How can we do this? Principal Component Analysis answers this question. Principal Component Analysis or PCA for short is a very well known statistical procedure that has important properties. Suppose that we have input data of very large dimensionality (D dimensions). We would like to project this data to a smaller dimensionality space M (M<D), a step that is commonly called feature extraction. Projection will always distort somewhat our data (just think of a 3-D object and its 2-D shadow). Obviously we would like to do this projection to M dimensional space preserving maximally the information (variance from a representation point of view) about the input data. he linear projection that accomplishes this goal is exactly the PCA. PCA, SVD, and KL transforms. PCA produces an orthonormal basis that is built from the eigenvectors of the input data autocorrelation function. he projections onto each basis are therefore the eigenvalues of 20

21 R. If one orders the eigenvectors by descending order of the eigenvalues, and we truncate them at M (with M<D) then we will project the input data to a linear space of (smaller) dimensionality M. In this space the projection onto each axis will produce the M largest eigenvalues, so there is no better linear projection to preserve the input signal power. he outputs of the PCA represent the input into a smaller subspace so they are called features. So PCA is the best linear feature extractor for signal reconstruction. he error e in the approximation when we utilize M features is exactly given by e 2 = D λ i i= M+ 1 Equation 15 Eq. 15 tells that the error power is exactly equal to the sum of the eigenvalues that were discarded. For the case of Figure 7, the minimum error in representing the 2-D data set in an 1-D space is obtained when the principal direction is chosen as the projection axis. he error power is exactly given by the projection on the minor direction. If we decided to keep the projection in the minor direction, the error incurred would have been much higher. his method is called subspace decomposition and it is widely applied in signal processing and statistics to find the best sub-space of a given dimension that preserves maximally data information. here are well known algorithms that compute analytically the PCA, but they have to solve matrix equations (Singular value decomposition ). Can we build a neural network that implements PCA on-line, with local learning rules? he answer is affirmative. We have to use a linear network with multiple outputs (equal to the dimension M of the projection space) as in Figure 8. 21

22 x 1 w 11 w 21 y 1 x 2 w 2D y 2 x D w MD y M Figure 8. A PCA network to project the data from D to M dimensions. he idea is very simple. First, we compute the largest eigenvector as done above with Oja s rule. hen we project the data onto a space perpendicular to the largest eigenvector and we apply the algorithm again to find the second largest principal component, and so on until order M D. he projection onto the orthogonal space is easily accomplished by subtracting the output of all previous output components (after convergence) from the input. his method is called the deflation method and mimics the Gram-Schmidt orthogonalization procedure Gram-Schmidt orthgonalization. What is interesting is that the deflation method can be accomplished easily by modifying slightly Oja s learning rule as was first done by Sanger. We are assuming that the network has M outputs each given by D yi( n) = wij( n) xj( n) i = 1,... M i= 1 Equation 16 and D inputs ( M D ). o apply Sanger s rule the weights are updated according to i Δwij ( n) = ηyi ( n) xj ( n) wkj ( n) yk ( n) k = 1 Equation 17 his rule resembles the Oja s update, but now the input to each PE is modified by subtracting the outputs from the preceding PEs times the respective weights. his implements the deflation method, after the system converges. he weight update of Eq. 22

23 17 is not local since we need all the previous network outputs to compute the weight update to weight wij. However, there are other rules that use local updates (such as the APEX algorithm Diamantaras ). As we can expect from Eq. 17 and the explanation there is a coupling between the modes, i.e. only after convergence of the first PE weights will the second PE weights converge completely to the eigenvector that corresponds to the second largest eigenvalue. here are other on-line algorithms for the same purpose, such as the lateral inhibition network and the recursively computed APEX, but for the sake of simplicity they will be omitted here. A two output PCA network will have weight vectors that correspond to the principal and minor component of Figure 5. he two outputs will correspond to the largest and smallest eigenvalues respectively. he interesting thing about subspace projections is that in many problems the data is already restricted to a (unknown) subspace, so PCA can effectively perform data compression preserving the major features of the data. NeuroSolutions Sanger s and PCA his example introduces Sanger s rule (look at the synapse with the label Sang in the breadboard ). Sanger s rule does Principal Component Analysis (PCA). he dimension M of the output determines the size of the output space, i.e. the number of eigenvectors and also the number of features used to represent the input data. PCA finds the M weight vectors which capture the most information about the input data. For instance, a 3 output Sanger s network will find 3 orthogonal vectors, the principal axis which captures the more information than any other vector in the input space, along with the two vectors which capture the second most and third most information. In this example, we take a high dimensional input, 8x8 images of the 10 digits, and project them onto their M Principal components. M is a variable that you can control by setting the number of outputs of the Sanger s network. he outputs of the PCA network are the features 23

24 obtained by the projection. We then use a custom DLL to recreate the digits using only the M features. his DLL takes the output of the Sanger s network and multiplies it by the transpose of W, so it recreates a 64 output image. his image shows us how much of the original information in the input we have captured in the M dimensional subspace. When the two images are identical, we have preserved in the features the information contained in the input data. he display of the eigenvectors (the PCA weights) is not easy since they are vectors in a 64 dimensional space. After convergence they are orthogonal. We can use the Hinton probe to visualize their value, but it is difficult to find patterns (in fact the signs should alternate more frequently towards the higher order meaning that finer details is being encoded). ry different values for the subspace dimension (M), and verify that PCA is very robust, i.e. even with just a few dimensions the reconstructed digits can be recognized. A word of caution is needed at this point. he PCA finds the subspace that best represents the ensemble of digits, so the best discrimination among the digits in the subspace is not guaranteed with PCA. If the goal is discrimination among the digits then a classifier should be designed for that purpose. PCA is a linear representation mechanism, and only guarantees that the features contain the most information for reconstruction. NeuroSolutions Example he PCA decomposition is a very important operation in data processing because it provides knowledge about the hidden structure (latent variables) of the data. As such there are many other possible formulations for the problem. PCA derivation 24

25 4.1. PCA for data compression PCA is the optimal linear feature extractor, i.e. there is no other linear system that is able to provide better features for reconstruction. So one of the obvious PCA applications is data compression. In data compression the goal is to be able to transmit as fewer bits per second as possible preserving as much as the source information as possible. So this means that we must squeeze in each bit as much information as possible from the source. We can model data compression as a projection operation where the goal is to find a set of basis that produce a large concentration of signal power in only a few components. In PCA compression the receiver must know the weight matrix containing the eigenvectors since the estimation of the input from the eigenvalues is done by ~ x = W y Equation 18 he weight matrix is obtained after training with exemplars from the data to be transmitted. It has been shown that for special applications this step can be completed efficiently and is done only once. So the receiver can be constructed before hand. he reconstruction step requires MxD operations where D is the input vector dimension and M is the size of the subspace (number of features) PCA features and classification We may think that a system able to preserve optimally signal energy in a subspace should also be the optimal projector for classification. Unfortunately this is not the case. he reason can be seen in Figure 9. We have here represented two classes. When the PCA is computed no distinction is made between the samples of each class so the optimal 1-D projection for reconstruction (the principal direction) is along the x1 axis. However it is easy to see that the best discrimination between these two clusters is along the x2 axis which from the point of view of reconstruction is the minor direction. So PCA chooses the projections to best reconstruct the data in the chosen subspace. his may or may not coincide with the projection for best discrimination. A similar thing 25

26 happened when we addressed regression and classification (first example of Chapter II). A linear regressor can be used as a classifier, but there is no guarantee that it produces the optimal classifier (which by definition minimizes the classification error). x2 (minor direction) class 2 perfect classification with x2 class 1 x1 (principal direction) not perfect classification with x1 Figure 9. he relation between eigendirections and classification However, PCA is appealing for classification since it is a simple procedure, and experience has shown that it normally provides good features for classification. But this depends upon the problem and there is no guarantee that classifiers based on PCA features work well. NeuroSolutions PCA for preprocessing In this example we use PCA to find the best possible linear projection in terms of reconstruction and then we use a MLP to classify the data into one of 10 classes (the digits). Notice that in fact this problem was already solved in Chapter III with the perceptron and we obtained perfect classification using the input data directly. he only way we can do a fair comparison is to limit the number of weights in the two systems to the same value and compare performance. 26

27 Go to the next section NeuroSolutions Example 5. Anti-Hebbian Learning We have seen that Hebbian learning discovers the directions in space where the input data has the largest variance. Let us do a very simple modification to the algorithm and include a minus sign in the weight update rule of Eq. 1, i.e. Δw = ηx y ij j i Equation 19 his rule is called the anti-hebbian rule. Let us assume that we train the system of Figure 2 with this rule. What do you expect this rule will do? he easiest reasoning is to recall that the Hebbian network maximizes the output variance by doing gradient ascent in the power field. Now with the negative sign in the weight update equation, the adaptation will seek the minimum of the performance surface, i.e. the output variance will be minimized. Hence, the output of the linear network trained with anti-hebbian will always produce zero output, because the weights will seek the directions in the input space where the data cluster have a point projection. his is called the null (or orthogonal) space of the data. he network finds this direction by doing gradient descent in the power field. If the data fills the full input space then the weights will have to go to zero. On the other hand, if the data exists in a subspace, the weights will find the directions where the data projects to a point. For Figure 7 anti-hebbian will provide zero weights. However, if the data was one dimensional, i.e. along the 45 degree line, then the weights will be placed along the 135 degree line. NeuroSolutions Anti-Hebbian learning In this example we use the Hebbian synapse with a negative stepsize to implement an anti-hebbian network. he anti-hebbian rule minimizes the output variance, 27

28 thus it will try to find a vector which is orthogonal to the input (the null space of the input) such that the projection of the data onto the weight vector is always zero. here are two cases of importance. Either the data lies in a subspace of the input space in which case the zero output can be achieved by adapting the weight vector perpendicular to the subspace where the input lies. Or in the second case the input samples covers the full input space, so the only way to get a zero output is to drive the weights to zero. Notice how fast the anti-hebbian trains. If the data moves in the input space, notice that the weights are always finding the direction orthogonal to the data cluster. NeuroSolutions Example his behavior of anti-hebbian learning can be translated as decorrelation, i.e. a linear PE trained with anti-hebbian learning decorrelates the output from its input. We must realize that Hebbian and anti-hebbian have complementary roles in projecting the input data, that are very important for signal processing. For instance the new high resolution spectral analysis techniques (such as MUSIC and ESPRI Kay ) are based on ways of finding the null space of the data and so they can be implemented on-line using anti-hebbian learning. We will provide an example in Chapter IX Convergence of anti-hebbian rule Another interesting thing is that the convergence of the anti-hebbian rule can be controlled by the step size, like in LMS or backpropagation. his means that if the step size is too large the weights will get progressively larger (diverge), but if the step size is below a given value the adaptation will converge. In fact from the fact that the power field is a paraboloid in weight space, we know it has a single minimum. Hence the situation is like gradient descent that we studied in Chapter I. What is the value under which the weights converge to finite values? he anti-hebbian update for one weight is 28

29 ( ) 2 wn ( + 1) = wn ( ) 1 ηx ( n) Equation 20 So, if we take expectations and project into the principal coordinate system as we did in Chapter I to compute the largest stepsize for the LMS, we can conclude that ( ) ( ) ( ) wn+ 1 = 1 ηλ wn Equation 21 which is stable if η < 2 λ Equation 22 where λ is the eigenvalue of the autocorrelation function of the input. We can immediately see the similarity to the convergence of the LMS rule. For a system with multiple inputs the requirement for convergence has to be modified to η < 2 λ max Equation 23 where λ max is the largest eigenvalue of the input autocorrelation function as for the LMS case. NeuroSolutions Stability of Hebbian his example shows that the anti-hebbian rule is stable for the range of values given by Eq. 23 when random data is utilized. Just change the stepsize to see what is the compromise rattling speed of convergence achieved with the anti-hebbian. Since the weight update is sample by sample, when the data has deterministic structure divergence may occur at step sizes smaller than the ones predicted by Eq. 23. he same behavior was encountered in the LMS. Go tothe next section NeuroSolutions Example 29

30 6. Estimating crosscorrelation with Hebbian networks Suppose that we have two data sets formed by P exemplars of N dimensional data x1,...xn and d1,... dm, and the goal is to estimate the crosscorrelation between them. he crosscorrelation is a measure of similarity between the two sets of data which extends the idea of correlation coefficient (see Appendix and Chapter I ). In practice we are often faced with the question how similar is this data set to that other one. Crosscorrelation helps exactly to answer this question. Let us assume that the data samples are ordered by their indices. he crosscorrelation for index i, j is r L 1 = L ( i, j) x d 0 < i < N, 0 < j M xd i, k j, k < k = 1 Equation 24 where L is the number of patterns, N is the size of the input vector and M is the size of desired response vector. he fundamental operation of correlation is to cross multiply the data samples and add the contributions. Define the average operator A 1 L L u k k = 1 [ u] = Equation 25 he crosscorrelation can then be defined as (, ) = [ xd ] r i j A xd i j Equation 26 where the vector = [ x 1, x 2,..., x ] is built from the i th sample of all the x i i i ip patterns in the input set (likewise for d). he crosscorrelation matrix Rxy is built from all possible shifts i, j, i.e R xd = x1d A... x Nd 1 1 x d x 1... N d 2 2 x d x 1 N... d N N Equation 27 he crosscorrelation vector used in regression (Chapter I ) is just the first column of this matrix. Now let us relate this formalism to the calculations of a linear network trained with 30

31 Hebbian learning. Assume we have a linear network with N inputs x and N outputs y (Figure 10) x 1 x 2 x N w 11 w 21 w N1 w N,N y 1 y 2 y N d 1 d 2 d N Figure 10. A multiple input multiple Hebbian network In order to compute the cross correlation between x and the data set d, we will substitute the network output y in the Hebbian rule by the data set d, i.e. Δw =ηx d ij j i Equation 28 which implements what we call forced Hebbian learning. We can write the output yj as Eq. 4 but now with two indices i and j y j = N wk, jx k k = 1 Equation 29 he weight wi,j when adapted with forced Hebbian learning takes the form w ( n+ 1) = w ( n) + ηx ( n) d ( n) i, j i, j j i Equation 30 If wij(0)=0 after L iterations we get w i, j L ( L) = η x ( n) d ( n) n= 1 j i Equation 31 So by comparing Eq.24 with Eq. 31 we conclude that the weight wij trained with forced Hebbian learning is proportional after P iterations to the crosscorrelation element rij. If η=1/l and the initial conditions are zero this is exactly rij. Notice also that the elements of the crosscorrelation matrix are precisely the weights of the linear network (Eq. 27 ). For 31

32 this reason the linear network trained with forced Hebbian learning is called a correlator or a linear associator. Hence, forced Hebbian learning is an alternate, on-line way of computing the crosscorrelation function between two data sets. NeuroSolutions Forced Hebbian computes crosscorrelation In this example we show how forced Hebbian learning simply computes the crosscorrelation of the input and desired output. We have a 3 input network which we would like to train with a desired response of 2 outputs. We have created a data set with 4 patterns. he crosscorrelation computed according to Eq. 24 is r(0,0)= 0.5 ; r(0,1)=r(1,0)=0; r(1,1)=0.25; r(0,2)=0.5 ;r(1,2)=0.25 Let us use the Hebbian network and take a look at the final weights. Notice that we started the weights with a zero value, and stopped the network after 10 iterations of each batch (4 patterns) with a stepsize of (1/4x10). NeuroSolutions Example here are two important applications of this concept that we will address in this chapter. One uses crosscorrelation with anti-hebbian learning to find what is different between two data sets, and can be considered a novelty filter. he other is possibly even more important and is a memory device called an associative memory. Go to the next section 7. Novelty Filters and Lateral Inhibition Let us assume that we have two data sets x and d. aking x as the input to our system, we want to create an output y as dissimilar as possible to the data set d (Figure 11). his function is very important in signal processing (decorrelation) and in information processing (uncorrelated features), and it seems to be at the core of biological information processing. We humans filter out with extreme ease what we know already from the sensor input (either visual or acoustic). his avoids information overload. It 32

7. Variable extraction and dimensionality reduction

7. Variable extraction and dimensionality reduction 7. Variable extraction and dimensionality reduction The goal of the variable selection in the preceding chapter was to find least useful variables so that it would be possible to reduce the dimensionality

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

The Hebb rule Neurons that fire together wire together.

The Hebb rule Neurons that fire together wire together. Unsupervised learning The Hebb rule Neurons that fire together wire together. PCA RF development with PCA Classical Conditioning and Hebbʼs rule Ear A Nose B Tongue When an axon in cell A is near enough

More information

CHAPTER 3. Pattern Association. Neural Networks

CHAPTER 3. Pattern Association. Neural Networks CHAPTER 3 Pattern Association Neural Networks Pattern Association learning is the process of forming associations between related patterns. The patterns we associate together may be of the same type or

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [based on slides from Nina Balcan] slide 1 Goals for the lecture you should understand

More information

Unsupervised learning: beyond simple clustering and PCA

Unsupervised learning: beyond simple clustering and PCA Unsupervised learning: beyond simple clustering and PCA Liza Rebrova Self organizing maps (SOM) Goal: approximate data points in R p by a low-dimensional manifold Unlike PCA, the manifold does not have

More information

Machine Learning (Spring 2012) Principal Component Analysis

Machine Learning (Spring 2012) Principal Component Analysis 1-71 Machine Learning (Spring 1) Principal Component Analysis Yang Xu This note is partly based on Chapter 1.1 in Chris Bishop s book on PRML and the lecture slides on PCA written by Carlos Guestrin in

More information

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works CS68: The Modern Algorithmic Toolbox Lecture #8: How PCA Works Tim Roughgarden & Gregory Valiant April 20, 206 Introduction Last lecture introduced the idea of principal components analysis (PCA). The

More information

Introduction to Neural Networks

Introduction to Neural Networks Introduction to Neural Networks What are (Artificial) Neural Networks? Models of the brain and nervous system Highly parallel Process information much more like the brain than a serial computer Learning

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Using a Hopfield Network: A Nuts and Bolts Approach

Using a Hopfield Network: A Nuts and Bolts Approach Using a Hopfield Network: A Nuts and Bolts Approach November 4, 2013 Gershon Wolfe, Ph.D. Hopfield Model as Applied to Classification Hopfield network Training the network Updating nodes Sequencing of

More information

14 Singular Value Decomposition

14 Singular Value Decomposition 14 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

Iterative face image feature extraction with Generalized Hebbian Algorithm and a Sanger-like BCM rule

Iterative face image feature extraction with Generalized Hebbian Algorithm and a Sanger-like BCM rule Iterative face image feature extraction with Generalized Hebbian Algorithm and a Sanger-like BCM rule Clayton Aldern (Clayton_Aldern@brown.edu) Tyler Benster (Tyler_Benster@brown.edu) Carl Olsson (Carl_Olsson@brown.edu)

More information

Neural Networks Lecture 2:Single Layer Classifiers

Neural Networks Lecture 2:Single Layer Classifiers Neural Networks Lecture 2:Single Layer Classifiers H.A Talebi Farzaneh Abdollahi Department of Electrical Engineering Amirkabir University of Technology Winter 2011. A. Talebi, Farzaneh Abdollahi Neural

More information

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26 Principal Component Analysis Brett Bernstein CDS at NYU April 25, 2017 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 1 / 26 Initial Question Intro Question Question Let S R n n be symmetric. 1

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks Delivered by Mark Ebden With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable

More information

15 Singular Value Decomposition

15 Singular Value Decomposition 15 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Laurenz Wiskott Institute for Theoretical Biology Humboldt-University Berlin Invalidenstraße 43 D-10115 Berlin, Germany 11 March 2004 1 Intuition Problem Statement Experimental

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1 Week 2 Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1 Part I Other datatypes, preprocessing 2 / 1 Other datatypes Document data You might start with a collection of

More information

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes Week 2 Based in part on slides from textbook, slides of Susan Holmes Part I Other datatypes, preprocessing October 3, 2012 1 / 1 2 / 1 Other datatypes Other datatypes Document data You might start with

More information

Artificial Neural Networks Examination, March 2004

Artificial Neural Networks Examination, March 2004 Artificial Neural Networks Examination, March 2004 Instructions There are SIXTY questions (worth up to 60 marks). The exam mark (maximum 60) will be added to the mark obtained in the laborations (maximum

More information

In biological terms, memory refers to the ability of neural systems to store activity patterns and later recall them when required.

In biological terms, memory refers to the ability of neural systems to store activity patterns and later recall them when required. In biological terms, memory refers to the ability of neural systems to store activity patterns and later recall them when required. In humans, association is known to be a prominent feature of memory.

More information

1 What a Neural Network Computes

1 What a Neural Network Computes Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists

More information

Hebb rule book: 'The Organization of Behavior' Theory about the neural bases of learning

Hebb rule book: 'The Organization of Behavior' Theory about the neural bases of learning PCA by neurons Hebb rule 1949 book: 'The Organization of Behavior' Theory about the neural bases of learning Learning takes place in synapses. Synapses get modified, they get stronger when the pre- and

More information

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = 30 MATHEMATICS REVIEW G A.1.1 Matrices and Vectors Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = a 11 a 12... a 1N a 21 a 22... a 2N...... a M1 a M2... a MN A matrix can

More information

Comparative Performance Analysis of Three Algorithms for Principal Component Analysis

Comparative Performance Analysis of Three Algorithms for Principal Component Analysis 84 R. LANDQVIST, A. MOHAMMED, COMPARATIVE PERFORMANCE ANALYSIS OF THR ALGORITHMS Comparative Performance Analysis of Three Algorithms for Principal Component Analysis Ronnie LANDQVIST, Abbas MOHAMMED Dept.

More information

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection

More information

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) Principal Component Analysis (PCA) Salvador Dalí, Galatea of the Spheres CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang Some slides from Derek Hoiem and Alysha

More information

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data. Structure in Data A major objective in data analysis is to identify interesting features or structure in the data. The graphical methods are very useful in discovering structure. There are basically two

More information

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

Notes on Latent Semantic Analysis

Notes on Latent Semantic Analysis Notes on Latent Semantic Analysis Costas Boulis 1 Introduction One of the most fundamental problems of information retrieval (IR) is to find all documents (and nothing but those) that are semantically

More information

Dimensionality Reduction

Dimensionality Reduction Lecture 5 1 Outline 1. Overview a) What is? b) Why? 2. Principal Component Analysis (PCA) a) Objectives b) Explaining variability c) SVD 3. Related approaches a) ICA b) Autoencoders 2 Example 1: Sportsball

More information

Linear discriminant functions

Linear discriminant functions Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative

More information

Covariance and Correlation Matrix

Covariance and Correlation Matrix Covariance and Correlation Matrix Given sample {x n } N 1, where x Rd, x n = x 1n x 2n. x dn sample mean x = 1 N N n=1 x n, and entries of sample mean are x i = 1 N N n=1 x in sample covariance matrix

More information

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis .. December 20, 2013 Todays lecture. (PCA) (PLS-R) (LDA) . (PCA) is a method often used to reduce the dimension of a large dataset to one of a more manageble size. The new dataset can then be used to make

More information

Lecture 7: Con3nuous Latent Variable Models

Lecture 7: Con3nuous Latent Variable Models CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 7: Con3nuous Latent Variable Models All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/

More information

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA) CS68: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA) Tim Roughgarden & Gregory Valiant April 0, 05 Introduction. Lecture Goal Principal components analysis

More information

A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing

A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 5, SEPTEMBER 2001 1215 A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing Da-Zheng Feng, Zheng Bao, Xian-Da Zhang

More information

Maximum variance formulation

Maximum variance formulation 12.1. Principal Component Analysis 561 Figure 12.2 Principal component analysis seeks a space of lower dimensionality, known as the principal subspace and denoted by the magenta line, such that the orthogonal

More information

The Perceptron. Volker Tresp Summer 2016

The Perceptron. Volker Tresp Summer 2016 The Perceptron Volker Tresp Summer 2016 1 Elements in Learning Tasks Collection, cleaning and preprocessing of training data Definition of a class of learning models. Often defined by the free model parameters

More information

7 Principal Component Analysis

7 Principal Component Analysis 7 Principal Component Analysis This topic will build a series of techniques to deal with high-dimensional data. Unlike regression problems, our goal is not to predict a value (the y-coordinate), it is

More information

Artificial Neural Networks Examination, June 2005

Artificial Neural Networks Examination, June 2005 Artificial Neural Networks Examination, June 2005 Instructions There are SIXTY questions. (The pass mark is 30 out of 60). For each question, please select a maximum of ONE of the given answers (either

More information

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal

More information

Designing Information Devices and Systems II Fall 2018 Elad Alon and Miki Lustig Homework 9

Designing Information Devices and Systems II Fall 2018 Elad Alon and Miki Lustig Homework 9 EECS 16B Designing Information Devices and Systems II Fall 18 Elad Alon and Miki Lustig Homework 9 This homework is due Wednesday, October 31, 18, at 11:59pm. Self grades are due Monday, November 5, 18,

More information

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Machine Learning Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 1 / 47 Table of contents 1 Introduction

More information

Computational Intelligence Lecture 6: Associative Memory

Computational Intelligence Lecture 6: Associative Memory Computational Intelligence Lecture 6: Associative Memory Farzaneh Abdollahi Department of Electrical Engineering Amirkabir University of Technology Fall 2011 Farzaneh Abdollahi Computational Intelligence

More information

A. The Hopfield Network. III. Recurrent Neural Networks. Typical Artificial Neuron. Typical Artificial Neuron. Hopfield Network.

A. The Hopfield Network. III. Recurrent Neural Networks. Typical Artificial Neuron. Typical Artificial Neuron. Hopfield Network. Part 3A: Hopfield Network III. Recurrent Neural Networks A. The Hopfield Network 1 2 Typical Artificial Neuron Typical Artificial Neuron connection weights linear combination activation function inputs

More information

Machine Learning. Neural Networks

Machine Learning. Neural Networks Machine Learning Neural Networks Bryan Pardo, Northwestern University, Machine Learning EECS 349 Fall 2007 Biological Analogy Bryan Pardo, Northwestern University, Machine Learning EECS 349 Fall 2007 THE

More information

Linear Algebra for Machine Learning. Sargur N. Srihari

Linear Algebra for Machine Learning. Sargur N. Srihari Linear Algebra for Machine Learning Sargur N. srihari@cedar.buffalo.edu 1 Overview Linear Algebra is based on continuous math rather than discrete math Computer scientists have little experience with it

More information

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations. Previously Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations y = Ax Or A simply represents data Notion of eigenvectors,

More information

Least Squares Optimization

Least Squares Optimization Least Squares Optimization The following is a brief review of least squares optimization and constrained optimization techniques, which are widely used to analyze and visualize data. Least squares (LS)

More information

Feedforward Neural Nets and Backpropagation

Feedforward Neural Nets and Backpropagation Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28 th, 2016 1 / 23 Supervised Learning Roadmap Supervised Learning: Assume that we are given the features

More information

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang. Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning

More information

A. The Hopfield Network. III. Recurrent Neural Networks. Typical Artificial Neuron. Typical Artificial Neuron. Hopfield Network.

A. The Hopfield Network. III. Recurrent Neural Networks. Typical Artificial Neuron. Typical Artificial Neuron. Hopfield Network. III. Recurrent Neural Networks A. The Hopfield Network 2/9/15 1 2/9/15 2 Typical Artificial Neuron Typical Artificial Neuron connection weights linear combination activation function inputs output net

More information

This appendix provides a very basic introduction to linear algebra concepts.

This appendix provides a very basic introduction to linear algebra concepts. APPENDIX Basic Linear Algebra Concepts This appendix provides a very basic introduction to linear algebra concepts. Some of these concepts are intentionally presented here in a somewhat simplified (not

More information

Introduction to Artificial Neural Networks

Introduction to Artificial Neural Networks Facultés Universitaires Notre-Dame de la Paix 27 March 2007 Outline 1 Introduction 2 Fundamentals Biological neuron Artificial neuron Artificial Neural Network Outline 3 Single-layer ANN Perceptron Adaline

More information

CSC 411 Lecture 12: Principal Component Analysis

CSC 411 Lecture 12: Principal Component Analysis CSC 411 Lecture 12: Principal Component Analysis Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 12-PCA 1 / 23 Overview Today we ll cover the first unsupervised

More information

IV. Matrix Approximation using Least-Squares

IV. Matrix Approximation using Least-Squares IV. Matrix Approximation using Least-Squares The SVD and Matrix Approximation We begin with the following fundamental question. Let A be an M N matrix with rank R. What is the closest matrix to A that

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole

More information

Lecture 4: Feed Forward Neural Networks

Lecture 4: Feed Forward Neural Networks Lecture 4: Feed Forward Neural Networks Dr. Roman V Belavkin Middlesex University BIS4435 Biological neurons and the brain A Model of A Single Neuron Neurons as data-driven models Neural Networks Training

More information

Chapter 9: The Perceptron

Chapter 9: The Perceptron Chapter 9: The Perceptron 9.1 INTRODUCTION At this point in the book, we have completed all of the exercises that we are going to do with the James program. These exercises have shown that distributed

More information

7 Rate-Based Recurrent Networks of Threshold Neurons: Basis for Associative Memory

7 Rate-Based Recurrent Networks of Threshold Neurons: Basis for Associative Memory Physics 178/278 - David Kleinfeld - Fall 2005; Revised for Winter 2017 7 Rate-Based Recurrent etworks of Threshold eurons: Basis for Associative Memory 7.1 A recurrent network with threshold elements The

More information

PRINCIPAL COMPONENTS ANALYSIS

PRINCIPAL COMPONENTS ANALYSIS 121 CHAPTER 11 PRINCIPAL COMPONENTS ANALYSIS We now have the tools necessary to discuss one of the most important concepts in mathematical statistics: Principal Components Analysis (PCA). PCA involves

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Module 2 Lecture 05 Linear Regression Good morning, welcome

More information

HST.582J/6.555J/16.456J

HST.582J/6.555J/16.456J Blind Source Separation: PCA & ICA HST.582J/6.555J/16.456J Gari D. Clifford gari [at] mit. edu http://www.mit.edu/~gari G. D. Clifford 2005-2009 What is BSS? Assume an observation (signal) is a linear

More information

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1). Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes

More information

COMP 558 lecture 18 Nov. 15, 2010

COMP 558 lecture 18 Nov. 15, 2010 Least squares We have seen several least squares problems thus far, and we will see more in the upcoming lectures. For this reason it is good to have a more general picture of these problems and how to

More information

Table of Contents. Multivariate methods. Introduction II. Introduction I

Table of Contents. Multivariate methods. Introduction II. Introduction I Table of Contents Introduction Antti Penttilä Department of Physics University of Helsinki Exactum summer school, 04 Construction of multinormal distribution Test of multinormality with 3 Interpretation

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Machine Learning CSE546 Carlos Guestrin University of Washington November 13, 2014 1 E.M.: The General Case E.M. widely used beyond mixtures of Gaussians The recipe is the same

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,

More information

Advanced Introduction to Machine Learning CMU-10715

Advanced Introduction to Machine Learning CMU-10715 Advanced Introduction to Machine Learning CMU-10715 Principal Component Analysis Barnabás Póczos Contents Motivation PCA algorithms Applications Some of these slides are taken from Karl Booksh Research

More information

Neural Networks. Hopfield Nets and Auto Associators Fall 2017

Neural Networks. Hopfield Nets and Auto Associators Fall 2017 Neural Networks Hopfield Nets and Auto Associators Fall 2017 1 Story so far Neural networks for computation All feedforward structures But what about.. 2 Loopy network Θ z = ቊ +1 if z > 0 1 if z 0 y i

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Metric-based classifiers. Nuno Vasconcelos UCSD

Metric-based classifiers. Nuno Vasconcelos UCSD Metric-based classifiers Nuno Vasconcelos UCSD Statistical learning goal: given a function f. y f and a collection of eample data-points, learn what the function f. is. this is called training. two major

More information

How to do backpropagation in a brain

How to do backpropagation in a brain How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep

More information

Linear Algebra and Eigenproblems

Linear Algebra and Eigenproblems Appendix A A Linear Algebra and Eigenproblems A working knowledge of linear algebra is key to understanding many of the issues raised in this work. In particular, many of the discussions of the details

More information

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Linear Algebra & Geometry why is linear algebra useful in computer vision? Linear Algebra & Geometry why is linear algebra useful in computer vision? References: -Any book on linear algebra! -[HZ] chapters 2, 4 Some of the slides in this lecture are courtesy to Prof. Octavia

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

Learning Vector Quantization

Learning Vector Quantization Learning Vector Quantization Neural Computation : Lecture 18 John A. Bullinaria, 2015 1. SOM Architecture and Algorithm 2. Vector Quantization 3. The Encoder-Decoder Model 4. Generalized Lloyd Algorithms

More information

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7. Preliminaries Linear models: the perceptron and closest centroid algorithms Chapter 1, 7 Definition: The Euclidean dot product beteen to vectors is the expression d T x = i x i The dot product is also

More information

7 Recurrent Networks of Threshold (Binary) Neurons: Basis for Associative Memory

7 Recurrent Networks of Threshold (Binary) Neurons: Basis for Associative Memory Physics 178/278 - David Kleinfeld - Winter 2019 7 Recurrent etworks of Threshold (Binary) eurons: Basis for Associative Memory 7.1 The network The basic challenge in associative networks, also referred

More information

Singular Value Decomposition and Principal Component Analysis (PCA) I

Singular Value Decomposition and Principal Component Analysis (PCA) I Singular Value Decomposition and Principal Component Analysis (PCA) I Prof Ned Wingreen MOL 40/50 Microarray review Data per array: 0000 genes, I (green) i,i (red) i 000 000+ data points! The expression

More information

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1 90 8 80 7 70 6 60 0 8/7/ Preliminaries Preliminaries Linear models and the perceptron algorithm Chapters, T x + b < 0 T x + b > 0 Definition: The Euclidean dot product beteen to vectors is the expression

More information

Lecture 6. Notes on Linear Algebra. Perceptron

Lecture 6. Notes on Linear Algebra. Perceptron Lecture 6. Notes on Linear Algebra. Perceptron COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne This lecture Notes on linear algebra Vectors

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

2- AUTOASSOCIATIVE NET - The feedforward autoassociative net considered in this section is a special case of the heteroassociative net.

2- AUTOASSOCIATIVE NET - The feedforward autoassociative net considered in this section is a special case of the heteroassociative net. 2- AUTOASSOCIATIVE NET - The feedforward autoassociative net considered in this section is a special case of the heteroassociative net. - For an autoassociative net, the training input and target output

More information

Foundations of Computer Vision

Foundations of Computer Vision Foundations of Computer Vision Wesley. E. Snyder North Carolina State University Hairong Qi University of Tennessee, Knoxville Last Edited February 8, 2017 1 3.2. A BRIEF REVIEW OF LINEAR ALGEBRA Apply

More information

Least Squares Optimization

Least Squares Optimization Least Squares Optimization The following is a brief review of least squares optimization and constrained optimization techniques. Broadly, these techniques can be used in data analysis and visualization

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Clustering VS Classification

Clustering VS Classification MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:

More information

22 Approximations - the method of least squares (1)

22 Approximations - the method of least squares (1) 22 Approximations - the method of least squares () Suppose that for some y, the equation Ax = y has no solutions It may happpen that this is an important problem and we can t just forget about it If we

More information

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x = Linear Algebra Review Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1 x x = 2. x n Vectors of up to three dimensions are easy to diagram.

More information