Computational Explorations in Cognitive Neuroscience Chapter 4: Hebbian Model Learning

Size: px
Start display at page:

Download "Computational Explorations in Cognitive Neuroscience Chapter 4: Hebbian Model Learning"

Transcription

1 Computational Explorations in Cognitive Neuroscience Chapter 4: Hebbian Model Learning

2 4.1 Overview Learning is a general phenomenon that allows a complex system to replicate some of the structure in its environment. Structure refers to any regularity or consistent pattern in the environment. Donald Hebb was instrumental in suggesting a way that learning could take place in the nervous system by changing synaptic weights. For Hebb, joint activation of units in the Nerve Cell Assembly strengthens their connection. We now know that there are synaptic processes that can strengthen (and weaken) synapses by following a (modified) Hebbian rule, i.e. they require joint pre- and post-synaptic depolarization to take effect.

3 Hebbian learning is a form of model learning: the result of learning is the construction of a model in the system that captures some of the structure of the environment. Because it does not require explicit feedback from the environment, it is called self-organizing.

4 4.3 Computational Objectives of Learning The problem of constructing an internal model of the external world is a difficult one. It is inherently under-determined, or ill-posed, i.e., there is not enough information available to do a complete job. In other words, our sensory systems have an impoverished view of the outside world. They have to figure out what is going on based only on small glimpses. At the same time, another problem is that there is an overwhelming quantity of potential information available to the senses. The model construction job is made even more difficult by having to sift through all the activity going on in the sensory sheets to find information that is relevant.

5 In summary: our senses deliver a large quantity of low-quality information that must be highly processed to produce the apparently transparent access to the world that we experience. The strategy taken by the nervous system (and the one that we must follow in designing neural networks) is to use biases to organize and select the incoming sensory data. One useful bias is parsimony choosing the simplest possible explanation from all the possible options.

6 Consider the case of vision. The visual system must construct models of the visual world based on a series of limited snapshots two-dimensional projections on the retina from a high-dimensional space. That is, many dimensions of variation are collapsed and intertwined in the photoreceptor activity. The problem of recovering all these dimensions is under-determined because the mapping of the environment to the activity patterns is many-to-one. That is, many different environmental arrangements could be responsible for any given pattern of retinal activity. Therefore, many different possible internal models could fit equally well with the activity. Another way of expressing this idea: the interpretation of the visual world by the visual system is not sufficiently constrained by the visual input. Because many different dimensions may be collapsed in the projection onto the retina, it is difficult for the system to determine the real external causes in visual perception.

7 One process that helps overcome this problem is to use the constraint of temporal consistency. This constraint can be implemented by temporal integration averaging over a sequence of individual snapshots. Temporal integration is important for the implementation of learning in neural network models. The process of slowly adding up small weight changes results in the weights that represent the aggregate statistics of a large sample of experiences. If there are stable patterns in the input space, these will prevail in the final weights through this averaging process. The result is that we can train a network to represent stable sources of input over a wide range of experiences with the world.

8 Averaging, however, is not enough. Averaging all the snapshots of the world that we are exposed to would result in a uniformly gray image. There needs to be some filtering, i.e. some selectivity as to what is taken in. This is where biases, or prior expectations, are critical. Biases allow the system to determine what kinds of input patterns are more informative than others, and how to organize the transformations of those input patterns (remember the clustering diagrams).

9 For biases to be useful, they must provide a fit to the properties of the real world. How can this be done? One type of bias that is NOT very useful is the implementation of specific knowledge. This amounts to hard-wiring a system with connection patterns that represent detailed aspects of the environment. This only works if the environment is guaranteed to contain those aspects. It hinders the system s ability to flexibly adjust to different environments. Also, it is not neurobiologically realistic. On the other hand, two types of bias that are both useful and biologically plausible are: a) architectural: built-in connectivity preferences; e.g. area a is connected preferentially to area b, and not to area c. b) parametric: built-in differences in values such as quantity, ratio, or rate; e.g. area a has a greater proportion of inhibitory cells or a faster learning rate than area b.

10 There must be a balance between too much and too little use of biases in learning. The bias-variance dilemma: an over-reliance on biases results in not enough learning about the environment, whereas an over-reliance on experiences results in learning that is idiosyncratic and variable. Both extremes result in the construction of poor models: the first because the model may be too rigid by not accounting for important dimensions of variability in the world; the second because the model may be too prone to domination by specific instances of input and fail to account for important consistent features in the world. So it is important that biases and experiences be balanced. However, there is no general procedure that will guarantee this balance.

11 4.3.1 Simple Exploration of Correlational Model Learning One powerful way for a model to find consistency in the patterns it receives from the environment is to detect correlations. In visual inputs, for example, correlations are indicative of stable features in the visual world. The simulation in this section is a very simple example to show that the weights of a network (using a Hebbian learning mechanism) can be shaped to reflect the correlation structure of the environment. The weights are initially uniform, but they change to conform to a consistent feature of the input space, i.e. a single line.

12 4.4 Principal Components Analysis Understanding how Hebbian learning causes units in a network to represent correlations in the environment can be aided by the concept of principal components analysis (PCA). PCA is a technique that rotates a variable space in order to maximize the variance along one (principal) axis. It is an iterative process that orders the major directions of variability in the variable space according to the amount of variability that they account for. The first principal component is the one that accounts for the greatest amount of variability in the variable space. We will see that a simple Hebbian learning mechanism operates in such a way as to represent the first principal component of the variability of input space.

13 4.4.1 Simple Hebbian PCA in One Linear Unit Consider a receiving unit (Figs 4.5, 4.6) with the activation function: y j = xkwkj (4.1) k Keep in mind that each variable is a function of time, although that is not written explicitly. At each time step, the values of these variables may change. Also, a different pattern of activity may be presented over the input units at each time step. Specifically, the Hebbian learning rule is implemented by updating the weights into the receiving unit at each time step.

14 For the rule to be Hebbian, the weight change (dwt in the simulator) depends on both pre- and post-synaptic units activity as: Δ t w ij = ε x i y j (4.2) The symbol ε is called the learning rate parameter (lrate in the simulator). When ε is large, it means that the weights undergo big changes at each step, when it is small, they undergo small changes. Explicitly, the weight change enters into the weight update equation as: ij ( 1) ( ) w t w t + = ij +Δtwij (4.3) Now, to observe the overall effect of learning, we need to determine how the weights change over a whole sequence of input patterns. (Remember that a different input pattern is considered to be presented at each time step.)

15 The weight change over all input pattern presentations is: Δ w ij = ε xi yj (4.4) t Notice that if we set ε to equal 1/N (where N is the total number of input patterns presented), then the right-hand side of this equation is just the temporal mean (expected value). Thus: Δ wij = xi yj t (4.5)

16 Now, we can substitute the expression for y from (4.1): Δ w = x x w ij i k kj k = xx w = C w k i k t kj t ik kj t k t (4.6) where C ik is an element of the correlation matrix, representing the correlation between two input units i and k. The correlation between units i and k is defined as the expected value of the product of their activity values over time. [Refer to Fig 4.6]

17 The last equation says: the overall change in the weight from input unit i to receiving unit j is a weighted average, over all the different input units (indexed by k), of the correlations between these other input units and the particular input unit i. Each correlation is weighted by the average weight from the other input unit to the receiving unit. Training a network with this learning rule results in a set of weights that are dominated by the pattern of correlation that best accounts for the variability of the input space, i.e. the network learns to detect the most prevalent feature of the input space. This (strongest) pattern of correlation can be thought of as the first principal component of the input data.

18 4.4.2 Oja s Normalized Hebbian PCA A general problem with the simple Hebbian learning rule is that there is no bound on the weights as learning continues. It is neither computationally feasible nor biologically realistic to let the weights grow without bounds. A variety of different methods have been proposed for restraining weight growth. An early version was that of Oja (1982): 2 ( xi yj yj wij ) Δ wij = ε (4.9)

19 This modified Hebbian learning rule subtracts away a portion of the weight value to keep it from growing infinitely large. At equilibrium (when the weight value is no longer changing), the weight from a given input unit represents the fraction of that input s activation relative to the total weighted activation over all the other inputs. w 2 ( xy yw j i j) 0 i j = ε ij xi xi = = y x w j k k j k (4.10)

20 4.5 Conditional Principal Components Analysis How can this simple form of learning be applied to a layer of receiving units, instead of a single one? One possibility would be to have a different receiving unit for each principal component in the input space. Called sequential principal components analysis (SPCA), this approach would train the first receiving unit on the first principal component, the second on the second, etc. Each successive principal component would represent the direction of greatest variability in the data after the previous ones had been removed.

21 In theory, SPCA provides a complete representation of the input space. However, it is not practical because it is based on the assumption that the input space has a hierarchical structure. The structure of the real world is more likely to be heterarchical consisting of separate but equal categories. Interestingly, when models that are constructed to produce heterarchical representations are trained with inputs from natural scenes, they produce weight patterns that resemble receptive field properties of some neurons in the primary visual cortex, i.e. preferred tuning for bars of light of a particular thickness in a particular orientation. Although the SPCA approach is a general solution, it is not biologically realistic. Also, because of its generality it computes correlations over the entire space of input patterns. This ignores the type of clustering that exists in real world inputs, where meaningful correlations only exist in particular sub-regions of input space, not over the entire space.

22 For example, of all the patterns of light that cross your retinas, only a small subset are relevant for behavior. These patterns are the ones for which the visual pattern recognition system must represent correlations. Thus, realism would seem to require application of a type of conditionality to restrict the PCA computation to only certain input patterns. This argument motivates a version of Hebbian learning called conditional PCA (CPCA). Conditionality is imposed by determining when individual units will participate in learning about different aspects of the input space.

23 A conditionalizing function is used to specify the conditions under which a given unit should perform its PCA function. This function determines when a receiving unit is active, which is when PCA learning can occur. It would be desirable for the conditionalizing function to be self-organizing, where the units evolve their own conditionalizing function as a result of interactions during learning. To begin understanding CPCA, however, we assume the existence of a conditionalizing function that turns on receiving units for some inputs and not for others. Thus, the receiving unit s activation serves as a conditionalizing factor.

24 4.5.1 The CPCA Learning Rule The design objective of CPCA is to have each weight represent the probability that the input unit is active conditional on the receiving unit also being active. Thus: wij ( xi yj ) = P (4.11) In CPCA, we assume that the receiving unit represents a subset of input patterns in the environment. (4.11) tells us the learning objective: we want the weights to reflect the probability that a given input unit is active across the subset of input patterns represented by the receiving unit, i.e. for which it is active.

25 A conditional probability of 0.5 is equivalent to zero correlation, i.e. the input unit is equally likely to be on and off when the receiving unit is active. When the conditional probability is greater than 0.5, it means that a positive correlation exists between the input unit being on and the receiving unit being on. When the conditional probability is less than 0.5, it means that a negative correlation exists the input unit is more likely to be off when the receiving unit is on.

26 Note that the activation of a receiving unit in CPCA depends on more than one input unit. This makes the weight for any given input unit dependent on its correlation with the other input units. This is the most basic property desired from a Hebbian learning mechanism. The following weight update rule has been shown (Rumelhart & Zipser 1986) to achieve CPCA: ( ) Δ twij = ε yj xi yjw ij = ε y x w j i ij The corresponding effect of weight changes over all inputs is: ( x w ) Δ wij = ε yj i ij (4.12) t This rule has the effect of normalizing the weights so that they do not become infinitely large as learning progresses.

27 Note that the weights are adjusted according to 2 factors: a) how different the value of the input unit s activation level is from the weight value. b) the activation of the receiving unit. The activity of the receiving unit controls the weight adjustment: a) If the receiving unit is not active, no weight adjustment will occur. b) If the receiving unit is active, the weight adjustment depends on its level of activity (reflecting how much it cares about the input). In the second case, the rule tries to set the weight to match the input unit activation. That is, if the weight is equal to the input activation, no change will occur, and the farther away from the input activation level is the weight, the greater the weight change will be. Thus, when the receiving unit cares about the input, it tries to match the weights from its input units to the activation levels of those units.

28 4.5.2 Derivation of CPCA Learning Rule At this point, we want to show that the weight update rule of (4.12) achieves the conditional probability objective of (4.11). To do this, we assume that the activation represents the hypothesis about the input and the input pattern represents the data. This assumption allows us to replace each activation in (4.12) by the joint probability of its activation and the occurrence of a given input pattern. The joint probability is expressed in terms of the conditional probability as described in Chapter 2. The joint probability of the hypothesis (h) and the data (d) is given by P(h,d). It is equivalent to the intersection of the hypothesis and data. Conditional probability: ( ) P h d = (, ) ( ) P h d P d (2.23)

29 Interpretation: when we receive some particular input data, this equation gives the probability that the hypothesis is true. This gives us an expression for the joint probability in terms of the conditional probability: ( ) (, ) P h d P( d) P h d = (2.23b) Following (2.23b), for activations x i and y j, and input pattern t, the joint probabilities are: ( j, ) = ( yj t) P( t) P y t P (, ) = P( x t) P( t) P x t i i

30 Substituting joint probabilities for activations, the weight update rule (4.12) becomes: ( ) ( ) ( j ) ij ( ) Δ w = ε P y t P x t P y t w P t (4.13) ij j i t Now, in order to observe the final outcome of updating weights based on this rule, consider the asymptotic equilibrium state, where the weights no longer change with repeated exposure to input patterns.

31 At equilibrium, the network has already learned the structure of the input space, and no further weight changes are necessary. Thus: ( ) ( ) ( ) ε P ( y t j ) P ( t ) Δ w 0 ij = = ε P yj t P xi t P t w ij (4.14a) t Rearranging, we get an expression for the weight value at equilibrium: t w ij = t ( j ) ( xi t) P( t) P( yj ) ( ) P y t P t t P t (4.14b) The numerator of this equation is the definition of the joint probability of the input (x) and receiving (y) units both being active together across all the inputs. This is P(y j,x i ). Likewise, the denominator is the probability of the receiving unit being active across the whole input space. This is P(y j ).

32 Now, from the joint probability definition, the final weight is equal to the probability of x conditional on y: w ij = ( j, i) ( yj ) P y x P = ( i j ) P x y (4.15) This means that by using this weight update rule in (4.12), learning of the input space will result in each final weight representing the probability that the input unit is active conditional on the receiving unit also being active. This shows that this update rule achieves the desired CPCA design objective (4.11).

33 4.5.3 Biological Implementation of CPCA Hebbian Learning The CPCA learning rule (4.12) has some basic features that are similar to what would be expected from NMDA-mediated LTP/LTD. Thus: a) when the input and receiving units are both strongly active, the weight value (synaptic strength) tends to increase. This is similar to what occurs in LTP. b) when the receiving unit is active but the input unit is not, the weight value decreases. This is similar to what occurs in LTD, assuming that the NMDA channels are open and a small amount of calcium influx occurs. c) when the receiving unit is not active, no weight change occurs. This is similar to the effect of blocking of the NMDA channels by magnesium ions.

34 Note also that the weights saturate at both extremes: a) When the weight is large (near 1), further increases in the weight are suppressed because it becomes less likely that the input activation level will exceed the weight value and the increase will be smaller when that value is exceeded. Further decreases will become more likely because the input activation level will be more likely to be less than the weight value. b) When the weight is small (near 0), further increases in the weight become more likely and larger; likewise further decreases will become less likely and smaller. This pattern is also consistent with experimental studies of LTP/LTD. This type of saturation, where the magnitude of the weight change occurs exponentially as the bounds are approached, is called soft weight bounding.

35 4.6 Exploration of Hebbian Model Learning We see in this simulation how a single receiving unit can learn different patterns of correlation across the input units depending on their different probabilities of being co-active with the receiving unit. The receiving unit will always be active for present purposes, so the conditional probabilities of the different correlation structures (i.e. right- and left-slanting lines) will only depend on their relative frequencies of occurrence. For example, with p_right set to 0.7 and p_left to 0.3, the weights to the receiving unit from input units that only send the right-slanted pattern will go to 0.7 and those from input units that only send the left-slanted pattern will go to 0.3. (The weight of the one common unit goes to 1 because it is always activated.) The parameter lrate in Leabra corresponds to the learning rate parameter in (4.12).

36 Question 4.1: (a) Changing lrate from to 0.1 causes the weights to fluctuate with much greater variance. (b) The weight updates are too large, and so they overcompensate for small changes in the probability structure as the weight structure evolves. (c) If the learning rate is too high, the system runs the risk of learning erroneous relations. This risk is greatest when the number of learning events is small. The problem may be mitigated by integrating over a larger number of events with a lower learning rate. In the unconditional PCA, each receiving unit is exposed to the structure of the entire input space. The relative probabilities of occurrence are disregarded. As discussed, this will tend to blur out distinctions between different correlation patterns (features) in the environment. In the next exercise, we explore the unconditional PCA. The unconditional property is simulated by removing the distinction between the probabilities of the two input patterns (i.e. setting them both to 0.5).

37 Question 4.2: (a) Setting p_right to 0.5 causes the weights all to go to 0.5 (except the common one which goes to 1 as before). (b) This weight pattern suggests the existence of a single x feature existing in the environment, not two separate diagonal line features. There is no way to distinguish line features since the weights for this x pattern are all the same. (c) This is similar to the blob solution for natural scene images because the different input patterns (right- and left-slanting lines) are blurred together into a common pattern.

38 Question 4.3: (a) Setting p_right to 1 simulates the situation in which the hidden unit is only active when there is a right-leaning line in the input, never when there is a left-leaning one. (b) The weights follow the same pattern they go to 1 for the right-leaning input units and zero for the others. (c) This case might be more informative because the unit acts as a feature detector, i.e. it is conditionalized to be active for only one type of input pattern. (d) The architecture and training of the network could be extended by adding an additional feature detector for left-leaning lines in the input, and having each feature detector activated only when its preferred type of input was present. This arrangement would lead to each receiving unit (feature detector) having an input weight pattern corresponding to the feature that it was designed to detect. A readout unit in a higher layer could then determine which environmental feature was present by which feature detector was activated.

39 Next consider what happens when each feature category is represented by three subtypes rather than one. Set env_type to THREE_LINES. With p_right set to 0.7, the units that exclusively carry right-slanting inputs converge to (.7/3) and those carrying left-slanting inputs converge to 0.1 (.3/3), reflecting the relative probabilities of the different subtypes as before. Note, however, that the weights have been diluted as compared to the previous example. Because the difference in weights between the two categories is so much smaller, the ability to distinguish between the categories is much more susceptible to noise fluctuations.

40 This may be a particularly serious problem during the early phase of learning when the units are generally not very selective (i.e. they are more vulnerable to fluctuations). It would be desirable for the learning algorithm to do a better job at emphasizing the categorical differences between receiving units. It would also be desirable to have the weights have a dynamic range that was consistent with their full range of possible excursion, i.e. from 0 to 1, rather than being squeezed into a small range (in this case from.1 to.2333).

41 4.7 Renormalization and Contrast Enhancement There are two basic problems that can occur with the CPCA algorithm as it has been developed so far: 1) insufficient dynamic range 2) insufficient selectivity We will now look at correction factors to remedy these problems: 1) renormalization corrects the weights by accounting for the expected activity over the input layer: if this is sparse, as is typical, renormalization will work to boost the weights 2) contrast enhancement increases the separation between strong and weak weights by use of a sigmoid nonlinear function These corrections are necessitated by computational concerns, i.e. they help the algorithm perform more efficient learning. As such, they represent quantitative adjustments that do not affect the basic qualitative nature of the learning rule.

42 4.7.1 Renormalization Remember our intuitive argument that a conditional probability (of x given y) of 0.5 should correspond to a situation in which the input and receiving units behave in an uncorrelated manner. Note that this argument depends on the assumption that the input unit has a 0.5 probability of being active, i.e. it is equally likely to be on or off. If we consider that the input patterns are typically sparse, i.e. have low activity levels, then this assumption is violated because any given input unit will be infrequently active. If α represents the probability that an input unit is active, then the probability of x conditional on y cannot be greater than α (given that x and y are uncorrelated). For example, if on average x i is only active 20% of the time, then P(x) = α = 0.2, and P(x y) cannot be larger than 0.2 if x and y are uncorrelated.

43 But this violates the intuition that P(x y) should be 0.5 if x and y are uncorrelated. Also, it makes the ranges for positive & negative correlation unequal. Renormalization is meant to restore the uncorrelated probability to 0.5, making the ranges for positive & negative correlation the same size. ε ε ( 1 ) ( 1 xi )( 0 wij ) Δ wij = yj xi yjwij yj xi wij y = + j (4.17)

44 Expressed in this way, we see that the first term in the brackets acts to bring small weights up since (1-w ij ) is large and positive when w ij is small. The second term in the brackets acts to bring large weights down since (0-w ij ) is large and negative when w ij is large. Note that since (1-w ij ) goes to zero as w ij goes to one, the first term in brackets disappears at w ij = 1. We can allow this term to be larger by replacing 1 with a number m, where m > 1. That is: ε ε ( ) ( 1 xi )( 0 wij ) Δ wij = yj xi yjwij yj xi m wij y = + j (4.18) When m > 1, small weights are increased more than when m = 1. If we know α, the probability that the input unit is active, then we can set m = 0.5/α. This will produce larger weight increases when the probability of the input unit being active is smaller than 0.5, and smaller increases when is it larger. At α=0.5, the increases will be unaffected.

45 In simulations, we can also control the amount of renormalization to use by using parameter q m (called savg_cor in the simulator): α m (.5 α ) =.5 qm (4.20) and then set m = 0.5/α m. As an example, consider what happens if α =.1 q m α m m (no renormalization) (some renormalization) (maximum renormalization)

46 4.7.2 Contrast Enhancement The goal of this correction is to make the weights more reflective of strong correlations in the input patterns, and less reflective of weak correlations. This motivation can be justified as a parsimony bias. Contrast enhancement is implemented by a sigmoid nonlinear function.

47 This function converts the linear weight into an effective weight: wˆ ij = 1 γ w ij (4.21) 1+ 1 w ij The effect of this nonlinear function is to elevate weights above the threshold, and suppress weights below it. The threshold is the midpoint (0.5). The term γ is the weight gain parameter. The nonlinearity collapses to the linear case when γ is 1. As γ increases above 1, the slope of the sigmoid function becomes more steep, giving more contrast between large and small weights.

48 The offset parameter θ ( wt_off in the simulator) is added in order to change the threshold: wˆ ij = 1 w ij 1+ θ ( 1 wij ) γ (4.23) When θ > 1, the threshold (midpoint) value is shifted to a greater value (as in Fig 4.12, where it has a value of 1.25). It is important to distinguish the effect of contrast enhancement of the weights from the effect of the gain parameter on the activation values.

49 The activation gain changes a receiving unit s sensitivity to its total net input. It can make the unit more or less sensitive to all inputs around its threshold value, based on the total level of activation produced by an input, not the pattern of activation. By contrast, weight contrast enhancement makes the units more or less selective to particular patterns of input. That is, weights get pushed up or down according to whether they are above or below the threshold. Increasing the weight contrast enhancement of a unit makes it a more sensitive filter for detecting input patterns.

50 4.7.3 Exploration of Renormalization and Contrast Enhancement in CPCA Renormalization: The input space consists of 5 horizontal lines, each presented with equal probability. Each line co-occurs with the receiving unit with the same probability as the expected activity level over the input layer (.2). This situation represents the case where input & receiving units are uncorrelated because it gives the same level of co-occurrence that would result from activating input units at random, with the overall activity level on the input being 0.2.

51 We see that the weights all settle in near 0.2. As we discussed, we would like the weights for the uncorrelated case to go to 0.5 rather than 0.2. This is why we need renormalization. To get full renormalization, we set q m (savg_cor) to 1. This causes the weights to settle close to 0.5.

52 In some applications, there may be expectations about the ratio of input features/hidden unit. In those cases, the value of q m can be set lower than 1 to prevent this ratio from being larger than expected.

53 Contrast enhancement sigmoid function: We now train the network to distinguish between left-slanting and right-slanting line categories, where each category has 3 subtypes. With wt_gain = 1 (linear case), there is mild separation between the weights of the two categories (0.58 vs 0.25). When we introduce the sigmoid nonlinearity (wt_gain = 6), the separation increases (0.85 vs 0). That is, only the right lines are represented, and they have strong weights. wt_gain=1 wt_gain=6

54 The value of increasing the weight contrast enhancement is that we can train hidden units to represent just one feature type, even when they may be partially selective for other feature types. In conclusion, contrast enhancement and renormalization work together to determine what a unit will tend to detect and what it will ignore. They are essentially correction factors that adjust the CPCA algorithm to compensate for its limitations of dynamic range and selectivity. They can increase the effectiveness of the CPCA algorithm.

55 Question 4.4: (a) There are 2 different types of weights. The first type is at the central input unit and its 4 horizontal and vertical neighbors. These have high weights because they are at the intersection of right- and left-slanted input patterns. The second type is at the 8 other input units receiving input from right-slanted inputs. These are lower because they only get input from the right-slanted inputs. With [env_type=three_lines; p_right=0.7; lrate=0.005; savg_cor=1; wt_gain=6; wt_off=1] the first type has a value of 1.0, and the second type has a value from 0.85 to 0.90 With [env_type=three_lines; p_right=0.7; lrate=0.005; savg_cor=1; wt_gain=6; wt_off=1.25] the first type is unchanged whereas the second type is reduced to a value from 0.63 to 0.67.

56 (b) The stronger weights (type 1) stay the same. The weaker weights (type 2) are decreased. Setting wt_off to 1.25 shifts the sigmoid curve to the right. For type 1, the effective weight stays the same because the linear weight remains on the upper saturation part of the sigmoid curve. For type 2, the effective weight is reduced because the linear weight is in the linear part of the sigmoid curve. A linear weight value of 0.58 is transformed to an effective weight value of 0.87 when wt_off is 1.0. When wt_off is 1.25, the same linear weight value is transformed to This is contrast enhancement: high weights remain high while lower weights get lower. (c) A wt_off value of around 2.1 causes the non-central units of the right lines to have weights around 0.1 or less. (d) No, weights of 0.1 do not reflect the correlations in any single input pattern, which are all high. This high threshold value suppresses the representation of correlated inputs so that only the highest are represented. (e) This representation might be useful for excluding unwanted input correlations, even when they are fairly strong, in favor of a single desired very strong feature.

57 Question 4.5: (a) With wt_off set to 1 and savg_cor set to 0.7, the non-central units of the right lines go down as compared to a savg_cor value of 1. (b) They go to the same low value (around.1 or less) that occurred with wt_off equal to 2.1 in the previous question. Why does this happen? In the previous question, we were using a wt_gain value of 6 to achieve contrast enhancement. This magnified the differences in weights around 0.5 (wt_off was set to 1). By now lowering savg_cor to 0.7, the effective activity level (α m ) is larger, making m smaller, meaning that we are lessening the degree of renormalization since the smaller weights remain smaller.

58 4.8 Self-Organizing Model Learning Now that we have learned about CPCA, we are ready (finally!) to consider Hebbian learning in a network having multiple receiving units. The learning is now self-organizing, i.e. the receiving units compete with each under by a kwta inhibitory function. In terms of CPCA, the conditionalizing function comes from the competition of receiving units competition imposes conditions on when a unit is active, and is thus allowed to learn. (1) The probability of a receiving unit winning the competition to have its weights strengthened by any given input pattern depends on the strength of the input it receives from that pattern as compared to that of the other receiving units. (2) For a receiving unit to win and have its weights strengthened, it must be more well-tuned to (selective for) that input pattern than the other receiving units.

59 (3) To become more well-tuned, however, the receiving unit must win the competition, i.e. according to CPCA it must be active for its weights to be strengthened. (4) Therefore, weights will be further strengthened for the most-tuned receiving units and not for less-tuned ones. This is an example of positive feedback: tuning strengthening more tuning more strengthening etc. (5) This means that any initial small tunings of the receiving units will tend to be enhanced. (What happens if the initial weight settings are randomized?)

60 A system with positive feedback always runs the risk of an exaggerated response (as we saw with runaway excitation in the case of bidirectional excitatory connectivity). In self-organizing learning, the risk is that some units may become overloaded in representing input features at the expense of others. This tendency for hogging is countered by the growing selectivity of the receiving units. As their tuning increases with learning, they not only become more responsive to certain input features, but they also become less responsive to others. This tends to cause the selectivity to different features to be spread among the receiving units (as long as there is variation across their initial tunings).

61 4.8.1 Exploration of Self-Organizing Learning The network in self_org.proj has a receiving layer with 20 units. Each receiving unit receives projections from all input units. The receiving layer uses the averagebased kwta inhibition function with k = 2. Initially the weights are random. [Note: to reinitialize the weights, select the View/PROCESS_CTRL option from the control panel; then click the NewInit button.]

62 The input space consists of all 45 pairwise combinations of vertical and horizontal lines in the 5x5 input grid (10 vert-vert, 10 horiz-horiz, 25 vert-horiz). A training session consists of presentation of 30 passes through the 45 events.

63 With learning, individual receiving units develop selective representations of the line features, while ignoring the random context of the other lines. That is, they become line feature detectors. Results of 3 different training sessions

64 After training, you should see that: 1) each feature (vert or horiz line) gets represented by at least 1 receiving unit. 2) some features get represented by more than 1 receiving unit. 3) which unit represents any given feature changes randomly from 1 training session to the next. 4) for any given training session, some units do not become selective for any feature. These are loser or dead units. This reflects the excess capacity that is required to adequately reflect the structure of the input space. Biologically, this excess capacity should be large to allow a reserve pool of neurons for learning new features.

65 The ability of the network to develop this selectivity is made possible by the interaction of CPCA learning and inhibitory competition: 1) receiving units that initially have larger random weights for an input pattern win the competition. 2) the weights of these units are then tuned to be more selective for this pattern. 3) these units will then be more likely to respond to that pattern in the future, as well as other similar patterns (i.e. ones that share one of the line features). 4) Small initial differences in the weights for the two lines in the input pattern will cause receiving units to be more likely to respond to one of the line features in the pattern. 5) The small initial weight differences are enhanced with learning, so that receiving units usually become more selective for just one feature. The separation between a unit s stronger and weaker correlations is increased by contrast enhancement. 6) Overall, the strongest correlations in the environment (i.e. line features) will tend to become represented by this process. This is a combinatorial distributed representation each input pattern is represented by a combination of receiving units.

66 Unique Pattern Statistic Each time Run is clicked, a training session of 30 passes through the 45 events is initiated. TRAIN_GRAPH_LOG shows the unique pattern statistic for each of the 30 passes. This statistic records the number of unique hidden unit activity patterns that were produced as a result of probing the network with all 10 different features (i.e. vert or horiz lines) presented individually. For perfect performance, the unique pattern statistic is 10, meaning that all 10 features were uniquely represented. Each time Batch is clicked, a batch of 8 training sessions is initiated. BATCH_LOG shows 4 summary statistics after the 8 training sessions: (a) average unique pattern statistic over the 8 sessions (b) maximum unique pattern statistic over the 8 sessions (c) minimum unique pattern statistic over the 8 sessions (d) count of number of times that perfect performance (i.e. a perfect 10) occurred in the 8 sessions

67 Parameter Manipulations Weight gain parameter (γ): Question 4.6: (a) Training the network with wt_gain = 6 gives: avg_unq_pats min_unq_pats max_unq_pats cnt_unq_pats with wt_gain = 1 gives: (b) Over the 8 training runs, the minimum number of unique patterns that occurred was 8 out of 10, rather than 10 out of 10. The average number was 9 out of 10, instead of 10 out of 10. The number of training runs on which all 10 features were uniquely represented was 2.

68 (c) With wt_gain = 1, the sigmoid function is squashed and there is weaker separation of large and small weights. This causes some hidden units to represent more than one line feature, and some features not to be uniquely represented. With wt_gain = 6, the sigmoid function is in effect, and there is strong separation of large and small weights. The sigmoid function allows contrast enhancement, whereby the strongest weights are selected and the weaker weights are de-selected. This selectivity is important for self-organizing learning by allowing detection of distinct features.

69 Weight offset parameter (θ): Question 4.7: (a) wt_off avg_unq_pats min_unq_pats max_unq_pats cnt_unq_pats 1.25 (default) (b) There was a noticeable change in the weight patterns compared to the default case: the number of runs having all unique representations drops from 8 to 6 to 1, indicating that training is producing more units with non-unique representations. (c) wt_off is the offset of the sigmoid function. It acts like a threshold for weight enhancement. As it is lowered, more and more weights in the mid-range are enhanced. This means that weaker weights that were being suppressed now get enhanced. This makes it more likely that hidden units will come to represent multiple features, and less likely that they will represent only one feature (since they are less exclusive ).

70 (d) This threshold is important for self-organizing learning because it helps determine how selective hidden units will be. Remember that selectivity is important for establishing a combinatorial distributed representation, one in which the separate features of the input space are uniquely represented by hidden units.

71 Renormalization parameter (q m ): Changing savg_cor from.5 to 1 results in: savg_cor avg_unq_pats min_unq_pats max_unq_pats cnt_unq_pats Increasing the amount of renormalization makes the weights increase more rapidly. This causes the units to develop less selective representations of lines. A lower level of correlation is needed to produce strong weights, and units have a greater tendency to represent multiple features.

72 Initial mean random weight parameter: Now, setting wt_mean to 0.5 sets the initial random weight values to be 0.5 rather than We now see: wt_mean avg_unq_pats min_unq_pats max_unq_pats cnt_unq_pats The tendency for units to form unique representations is increased so that, most of the time, all units form unique representations. This effectively eliminates loser units, i.e. every unit now codes for a line. How can we explain this result? Starting off with larger weight values means that we will tend to get larger decreases than increases. Then hidden units that were active for a given pattern will receive less net input for a similar pattern. This is because the weights will have decreased for those input units that were off initially but are now on.

73 The result is that units that are initially successful at representing input patterns will not have as much of an advantage over those that were not successful. This gives the latter units a chance to catch up. So, all units end up representing unique features. This tendency can be used to counterbalance the hogging tendency, where a few units tend to represent all the features at the expense of the other units. Learning rate parameter (ε): Question 4.8: (a) Apparently, this tenfold increase in the learning rate has NO noticeable effect on the network. lrate avg_unq_pats min_unq_pats max_unq_pats cnt_unq_pats (b) This same value of lrate in Question 4.1 produced fluctuation with great variance which interfered with the network s learning ability. Here, there is no comparable effect because the learning rate effect is compensated by other effects, e.g. kwta, renormalization and contrast enhancement.

74 4.8.2 Summary and Discussion Hebbian learning by CPCA + kwta has been shown to be effective in more complex environments with real-world input spaces. One drawback: the complex interdependencies of the hidden layer units make rigorous mathematical analysis difficult.

75 4.9 Other Approaches to Model Learning How does CPCA + kwta compare to other types of Hebbian learning? a) Does it have the same level of performance? b) Can it accomplish the same functions? Algorithms That Use CPCA-Style Hebbian Learning There are several different Hebbian learning algorithms that use a similar learning rule, e.g. the Kohonen network (Kohonen 1984). The CPCA + kwta approach differs primarily in the kwta activation dynamics, not the learning rule. The production of sparse distributed representations by CPCA + kwta gives it a combinatorial flexibility that is lacking in the Kohonen network.

76 4.9.2 Clustering The competitive learning algorithm of Rumelhart & Zipser (1986) produces a localist representation, i.e. only one unit is active at a time. Competitive learning causes each hidden unit to represent a different cluster (natural grouping) of similar patterns in the input space. Strongly correlated input patterns tend to form such clusters. The clustering metaphor makes sense for representation by single units, i.e. localist representation. It makes less sense for kwta. However, the k active units in kwta inhibition may be thought of as representing multiple active clusters simultaneously.

77 4.9.3 Topography The Kohonen network is useful for formation of topographic maps because it is concerned about the neighborhood of activity around the single winner. This causes hidden units to represent similar things as their neighbors. This property may be useful for understanding the formation of topographic maps in the nervous system. A CPCA + kwta model with lateral excitation can also produce topographic maps (see Chapter 8).

78 4.9.4 Information Maximization and MDL Another proposed constraint on Hebbian learning is information maximization (Linsker 1988), in which models are trained so as to maximize the amount of information extracted from the input patterns. Information maximization is only one of many constraints that should be considered in Hebbian learning. If this constraint is unchecked, it could lead to over-extraction of information, i.e. the production of unparsimonious representations that capture all of the details of the input space.

79 It is usually more useful to strike a balance between information maximization and parsimony. CPCA + kwta accomplishes this balance. a) By having each receiving unit extract the first principal component of the correlation matrix representing its subset of the input space, CPCA maximizes the information received by that unit because its weights are tuned to the direction of maximum input variation. b) The kwta inhibitory competition lowers the overall information capacity of the hidden layer, providing a counterbalance to the information maximization objective. Also, by excluding all other principal components of the correlation matrix, parsimony is enforced. Parsimony is further enforced by the weight contrast enhancement function.

80 4.9.5 Learning Based Primarily on Hidden Layer Constraints The Bienenstock, Cooper, Munro (BCM) algorithm uses different constraints: 1. each hidden unit must be active for the same percentage of time as every other one. 2. the number of hidden units must equal the number of sources (features) in the environment. The BCM algorithm works well when the feature categories to be learned are uniformly distributed in the environment & when the numbers of hidden units and feature categories are evenly matched. However, this assumption does not seem very realistic: (it is unlikely that the number of hidden units in sensory cortical areas in any way matches the number of sensory feature categories that must be learned; and it does not seem realistic that all feature categories in the sensory environment occur with the same frequency).

81 Independent components analysis (ICA) is designed to solve the blind source separation problem. It works well when the basic conditions of that problem are satisfied (e.g. the cocktail party situation). a) Like BCM, ICA requires that the number of hidden units be equal to the number of feature categories (sources) in the environment. b) ICA also requires that the number of hidden units be equal to the number of input units. c) ICA learning is based on making hidden units maximally independent of each other, so that what one unit learns is highly dependent on what other units have learned. CPCA+kWTA attempts to maintain a balance between specialization of individual units and competition between units. The result is that, unlike BCM or ICA, it is much less dependent of the number of hidden units.

82 4.9.6 Generative Models Another class of models are called generative (Dayan et al 1995; Carpenter & Grossberg 1987) because they are based on the generation of an internal model of the world to accomplish pattern recognition. Representations are learned by an iterative interaction between model synthesis and input analysis. Learning is based on the difference between what is generated and what appears in the input. This is sometimes called recognition by synthesis. a) Generative models have the advantage of fitting nicely into the Bayesian statistical framework. The generative mode is similar to computing likelihood: it expresses the probability that the internal model (hypothesis) would have produced the input pattern (data). b) The disadvantage of a generative model is that it establishes a rigid hierarchy: one layer must be considered as an internal model to the one below it & each layer uses a different kind of processing. This may limit its usefulness and biological plausibility as compared to bidirectional constraint satisfaction processing (as in Chapter 3).

Computational Explorations in Cognitive Neuroscience Chapter 5: Error-Driven Task Learning

Computational Explorations in Cognitive Neuroscience Chapter 5: Error-Driven Task Learning Computational Explorations in Cognitive Neuroscience Chapter 5: Error-Driven Task Learning 1 5.1 Overview In the previous chapter, the goal of learning was to represent the statistical structure of the

More information

Computational Explorations in Cognitive Neuroscience Chapter 2

Computational Explorations in Cognitive Neuroscience Chapter 2 Computational Explorations in Cognitive Neuroscience Chapter 2 2.4 The Electrophysiology of the Neuron Some basic principles of electricity are useful for understanding the function of neurons. This is

More information

Synaptic Plasticity. Introduction. Biophysics of Synaptic Plasticity. Functional Modes of Synaptic Plasticity. Activity-dependent synaptic plasticity:

Synaptic Plasticity. Introduction. Biophysics of Synaptic Plasticity. Functional Modes of Synaptic Plasticity. Activity-dependent synaptic plasticity: Synaptic Plasticity Introduction Dayan and Abbott (2001) Chapter 8 Instructor: Yoonsuck Choe; CPSC 644 Cortical Networks Activity-dependent synaptic plasticity: underlies learning and memory, and plays

More information

Viewpoint invariant face recognition using independent component analysis and attractor networks

Viewpoint invariant face recognition using independent component analysis and attractor networks Viewpoint invariant face recognition using independent component analysis and attractor networks Marian Stewart Bartlett University of California San Diego The Salk Institute La Jolla, CA 92037 marni@salk.edu

More information

Hierarchy. Will Penny. 24th March Hierarchy. Will Penny. Linear Models. Convergence. Nonlinear Models. References

Hierarchy. Will Penny. 24th March Hierarchy. Will Penny. Linear Models. Convergence. Nonlinear Models. References 24th March 2011 Update Hierarchical Model Rao and Ballard (1999) presented a hierarchical model of visual cortex to show how classical and extra-classical Receptive Field (RF) effects could be explained

More information

Artificial Neural Network and Fuzzy Logic

Artificial Neural Network and Fuzzy Logic Artificial Neural Network and Fuzzy Logic 1 Syllabus 2 Syllabus 3 Books 1. Artificial Neural Networks by B. Yagnanarayan, PHI - (Cover Topologies part of unit 1 and All part of Unit 2) 2. Neural Networks

More information

A Biologically-Inspired Model for Recognition of Overlapped Patterns

A Biologically-Inspired Model for Recognition of Overlapped Patterns A Biologically-Inspired Model for Recognition of Overlapped Patterns Mohammad Saifullah Department of Computer and Information Science Linkoping University, Sweden Mohammad.saifullah@liu.se Abstract. In

More information

Ângelo Cardoso 27 May, Symbolic and Sub-Symbolic Learning Course Instituto Superior Técnico

Ângelo Cardoso 27 May, Symbolic and Sub-Symbolic Learning Course Instituto Superior Técnico BIOLOGICALLY INSPIRED COMPUTER MODELS FOR VISUAL RECOGNITION Ângelo Cardoso 27 May, 2010 Symbolic and Sub-Symbolic Learning Course Instituto Superior Técnico Index Human Vision Retinal Ganglion Cells Simple

More information

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal

More information

Bayesian probability theory and generative models

Bayesian probability theory and generative models Bayesian probability theory and generative models Bruno A. Olshausen November 8, 2006 Abstract Bayesian probability theory provides a mathematical framework for peforming inference, or reasoning, using

More information

Introduction to Neural Networks

Introduction to Neural Networks Introduction to Neural Networks What are (Artificial) Neural Networks? Models of the brain and nervous system Highly parallel Process information much more like the brain than a serial computer Learning

More information

Effects of Interactive Function Forms in a Self-Organized Critical Model Based on Neural Networks

Effects of Interactive Function Forms in a Self-Organized Critical Model Based on Neural Networks Commun. Theor. Phys. (Beijing, China) 40 (2003) pp. 607 613 c International Academic Publishers Vol. 40, No. 5, November 15, 2003 Effects of Interactive Function Forms in a Self-Organized Critical Model

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Iterative face image feature extraction with Generalized Hebbian Algorithm and a Sanger-like BCM rule

Iterative face image feature extraction with Generalized Hebbian Algorithm and a Sanger-like BCM rule Iterative face image feature extraction with Generalized Hebbian Algorithm and a Sanger-like BCM rule Clayton Aldern (Clayton_Aldern@brown.edu) Tyler Benster (Tyler_Benster@brown.edu) Carl Olsson (Carl_Olsson@brown.edu)

More information

Neural Networks Based on Competition

Neural Networks Based on Competition Neural Networks Based on Competition In some examples of pattern classification we encountered a situation in which the net was trained to classify the input signal into one of the output categories, while

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,

More information

Optimal In-Place Self-Organization for Cortical Development: Limited Cells, Sparse Coding and Cortical Topography

Optimal In-Place Self-Organization for Cortical Development: Limited Cells, Sparse Coding and Cortical Topography Optimal In-Place Self-Organization for Cortical Development: Limited Cells, Sparse Coding and Cortical Topography Juyang Weng and Matthew D. Luciw Department of Computer Science and Engineering Michigan

More information

15 Grossberg Network 1

15 Grossberg Network 1 Grossberg Network Biological Motivation: Vision Bipolar Cell Amacrine Cell Ganglion Cell Optic Nerve Cone Light Lens Rod Horizontal Cell Retina Optic Nerve Fiber Eyeball and Retina Layers of Retina The

More information

Outline. NIP: Hebbian Learning. Overview. Types of Learning. Neural Information Processing. Amos Storkey

Outline. NIP: Hebbian Learning. Overview. Types of Learning. Neural Information Processing. Amos Storkey Outline NIP: Hebbian Learning Neural Information Processing Amos Storkey 1/36 Overview 2/36 Types of Learning Types of learning, learning strategies Neurophysiology, LTP/LTD Basic Hebb rule, covariance

More information

Plasticity and Learning

Plasticity and Learning Chapter 8 Plasticity and Learning 8.1 Introduction Activity-dependent synaptic plasticity is widely believed to be the basic phenomenon underlying learning and memory, and it is also thought to play a

More information

Adaptation in the Neural Code of the Retina

Adaptation in the Neural Code of the Retina Adaptation in the Neural Code of the Retina Lens Retina Fovea Optic Nerve Optic Nerve Bottleneck Neurons Information Receptors: 108 95% Optic Nerve 106 5% After Polyak 1941 Visual Cortex ~1010 Mean Intensity

More information

Effects of Interactive Function Forms and Refractoryperiod in a Self-Organized Critical Model Based on Neural Networks

Effects of Interactive Function Forms and Refractoryperiod in a Self-Organized Critical Model Based on Neural Networks Commun. Theor. Phys. (Beijing, China) 42 (2004) pp. 121 125 c International Academic Publishers Vol. 42, No. 1, July 15, 2004 Effects of Interactive Function Forms and Refractoryperiod in a Self-Organized

More information

Chapter 9: The Perceptron

Chapter 9: The Perceptron Chapter 9: The Perceptron 9.1 INTRODUCTION At this point in the book, we have completed all of the exercises that we are going to do with the James program. These exercises have shown that distributed

More information

Artificial Neural Networks. Edward Gatt

Artificial Neural Networks. Edward Gatt Artificial Neural Networks Edward Gatt What are Neural Networks? Models of the brain and nervous system Highly parallel Process information much more like the brain than a serial computer Learning Very

More information

Neuron. Detector Model. Understanding Neural Components in Detector Model. Detector vs. Computer. Detector. Neuron. output. axon

Neuron. Detector Model. Understanding Neural Components in Detector Model. Detector vs. Computer. Detector. Neuron. output. axon Neuron Detector Model 1 The detector model. 2 Biological properties of the neuron. 3 The computational unit. Each neuron is detecting some set of conditions (e.g., smoke detector). Representation is what

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

An Introductory Course in Computational Neuroscience

An Introductory Course in Computational Neuroscience An Introductory Course in Computational Neuroscience Contents Series Foreword Acknowledgments Preface 1 Preliminary Material 1.1. Introduction 1.1.1 The Cell, the Circuit, and the Brain 1.1.2 Physics of

More information

arxiv: v2 [cs.ne] 22 Feb 2013

arxiv: v2 [cs.ne] 22 Feb 2013 Sparse Penalty in Deep Belief Networks: Using the Mixed Norm Constraint arxiv:1301.3533v2 [cs.ne] 22 Feb 2013 Xanadu C. Halkias DYNI, LSIS, Universitè du Sud, Avenue de l Université - BP20132, 83957 LA

More information

Marr's Theory of the Hippocampus: Part I

Marr's Theory of the Hippocampus: Part I Marr's Theory of the Hippocampus: Part I Computational Models of Neural Systems Lecture 3.3 David S. Touretzky October, 2015 David Marr: 1945-1980 10/05/15 Computational Models of Neural Systems 2 Marr

More information

Forecasting Wind Ramps

Forecasting Wind Ramps Forecasting Wind Ramps Erin Summers and Anand Subramanian Jan 5, 20 Introduction The recent increase in the number of wind power producers has necessitated changes in the methods power system operators

More information

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II Gatsby Unit University College London 27 Feb 2017 Outline Part I: Theory of ICA Definition and difference

More information

Neural networks: Unsupervised learning

Neural networks: Unsupervised learning Neural networks: Unsupervised learning 1 Previously The supervised learning paradigm: given example inputs x and target outputs t learning the mapping between them the trained network is supposed to give

More information

7 Rate-Based Recurrent Networks of Threshold Neurons: Basis for Associative Memory

7 Rate-Based Recurrent Networks of Threshold Neurons: Basis for Associative Memory Physics 178/278 - David Kleinfeld - Fall 2005; Revised for Winter 2017 7 Rate-Based Recurrent etworks of Threshold eurons: Basis for Associative Memory 7.1 A recurrent network with threshold elements The

More information

A MEAN FIELD THEORY OF LAYER IV OF VISUAL CORTEX AND ITS APPLICATION TO ARTIFICIAL NEURAL NETWORKS*

A MEAN FIELD THEORY OF LAYER IV OF VISUAL CORTEX AND ITS APPLICATION TO ARTIFICIAL NEURAL NETWORKS* 683 A MEAN FIELD THEORY OF LAYER IV OF VISUAL CORTEX AND ITS APPLICATION TO ARTIFICIAL NEURAL NETWORKS* Christopher L. Scofield Center for Neural Science and Physics Department Brown University Providence,

More information

Unsupervised Discovery of Nonlinear Structure Using Contrastive Backpropagation

Unsupervised Discovery of Nonlinear Structure Using Contrastive Backpropagation Cognitive Science 30 (2006) 725 731 Copyright 2006 Cognitive Science Society, Inc. All rights reserved. Unsupervised Discovery of Nonlinear Structure Using Contrastive Backpropagation Geoffrey Hinton,

More information

Balance of Electric and Diffusion Forces

Balance of Electric and Diffusion Forces Balance of Electric and Diffusion Forces Ions flow into and out of the neuron under the forces of electricity and concentration gradients (diffusion). The net result is a electric potential difference

More information

The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural

The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural 1 2 The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural networks. First we will look at the algorithm itself

More information

Natural Image Statistics

Natural Image Statistics Natural Image Statistics A probabilistic approach to modelling early visual processing in the cortex Dept of Computer Science Early visual processing LGN V1 retina From the eye to the primary visual cortex

More information

7 Recurrent Networks of Threshold (Binary) Neurons: Basis for Associative Memory

7 Recurrent Networks of Threshold (Binary) Neurons: Basis for Associative Memory Physics 178/278 - David Kleinfeld - Winter 2019 7 Recurrent etworks of Threshold (Binary) eurons: Basis for Associative Memory 7.1 The network The basic challenge in associative networks, also referred

More information

Efficient Coding. Odelia Schwartz 2017

Efficient Coding. Odelia Schwartz 2017 Efficient Coding Odelia Schwartz 2017 1 Levels of modeling Descriptive (what) Mechanistic (how) Interpretive (why) 2 Levels of modeling Fitting a receptive field model to experimental data (e.g., using

More information

Covariance and Correlation Matrix

Covariance and Correlation Matrix Covariance and Correlation Matrix Given sample {x n } N 1, where x Rd, x n = x 1n x 2n. x dn sample mean x = 1 N N n=1 x n, and entries of sample mean are x i = 1 N N n=1 x in sample covariance matrix

More information

How to do backpropagation in a brain

How to do backpropagation in a brain How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

CIFAR Lectures: Non-Gaussian statistics and natural images

CIFAR Lectures: Non-Gaussian statistics and natural images CIFAR Lectures: Non-Gaussian statistics and natural images Dept of Computer Science University of Helsinki, Finland Outline Part I: Theory of ICA Definition and difference to PCA Importance of non-gaussianity

More information

Machine Learning, Midterm Exam

Machine Learning, Midterm Exam 10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.2 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

Backpropagation Neural Net

Backpropagation Neural Net Backpropagation Neural Net As is the case with most neural networks, the aim of Backpropagation is to train the net to achieve a balance between the ability to respond correctly to the input patterns that

More information

Neural Networks and Ensemble Methods for Classification

Neural Networks and Ensemble Methods for Classification Neural Networks and Ensemble Methods for Classification NEURAL NETWORKS 2 Neural Networks A neural network is a set of connected input/output units (neurons) where each connection has a weight associated

More information

TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES. Mika Inki and Aapo Hyvärinen

TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES. Mika Inki and Aapo Hyvärinen TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES Mika Inki and Aapo Hyvärinen Neural Networks Research Centre Helsinki University of Technology P.O. Box 54, FIN-215 HUT, Finland ABSTRACT

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction

More information

Using Variable Threshold to Increase Capacity in a Feedback Neural Network

Using Variable Threshold to Increase Capacity in a Feedback Neural Network Using Variable Threshold to Increase Capacity in a Feedback Neural Network Praveen Kuruvada Abstract: The article presents new results on the use of variable thresholds to increase the capacity of a feedback

More information

Probabilistic Models in Theoretical Neuroscience

Probabilistic Models in Theoretical Neuroscience Probabilistic Models in Theoretical Neuroscience visible unit Boltzmann machine semi-restricted Boltzmann machine restricted Boltzmann machine hidden unit Neural models of probabilistic sampling: introduction

More information

Lecture 4: Feed Forward Neural Networks

Lecture 4: Feed Forward Neural Networks Lecture 4: Feed Forward Neural Networks Dr. Roman V Belavkin Middlesex University BIS4435 Biological neurons and the brain A Model of A Single Neuron Neurons as data-driven models Neural Networks Training

More information

Tuning tuning curves. So far: Receptive fields Representation of stimuli Population vectors. Today: Contrast enhancment, cortical processing

Tuning tuning curves. So far: Receptive fields Representation of stimuli Population vectors. Today: Contrast enhancment, cortical processing Tuning tuning curves So far: Receptive fields Representation of stimuli Population vectors Today: Contrast enhancment, cortical processing Firing frequency N 3 s max (N 1 ) = 40 o N4 N 1 N N 5 2 s max

More information

Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics

Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics Processing of Time Series by Neural Circuits with iologically Realistic Synaptic Dynamics Thomas Natschläger & Wolfgang Maass Institute for Theoretical Computer Science Technische Universität Graz, ustria

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

In biological terms, memory refers to the ability of neural systems to store activity patterns and later recall them when required.

In biological terms, memory refers to the ability of neural systems to store activity patterns and later recall them when required. In biological terms, memory refers to the ability of neural systems to store activity patterns and later recall them when required. In humans, association is known to be a prominent feature of memory.

More information

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann (Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

More information

Consider the following spike trains from two different neurons N1 and N2:

Consider the following spike trains from two different neurons N1 and N2: About synchrony and oscillations So far, our discussions have assumed that we are either observing a single neuron at a, or that neurons fire independent of each other. This assumption may be correct in

More information

CHAPTER 3. Pattern Association. Neural Networks

CHAPTER 3. Pattern Association. Neural Networks CHAPTER 3 Pattern Association Neural Networks Pattern Association learning is the process of forming associations between related patterns. The patterns we associate together may be of the same type or

More information

Does the Wake-sleep Algorithm Produce Good Density Estimators?

Does the Wake-sleep Algorithm Produce Good Density Estimators? Does the Wake-sleep Algorithm Produce Good Density Estimators? Brendan J. Frey, Geoffrey E. Hinton Peter Dayan Department of Computer Science Department of Brain and Cognitive Sciences University of Toronto

More information

Neural Network Based Response Surface Methods a Comparative Study

Neural Network Based Response Surface Methods a Comparative Study . LS-DYNA Anwenderforum, Ulm Robustheit / Optimierung II Neural Network Based Response Surface Methods a Comparative Study Wolfram Beyer, Martin Liebscher, Michael Beer, Wolfgang Graf TU Dresden, Germany

More information

fraction dt (0 < dt 1) from its present value to the goal net value: Net y (s) = Net y (s-1) + dt (GoalNet y (s) - Net y (s-1)) (2)

fraction dt (0 < dt 1) from its present value to the goal net value: Net y (s) = Net y (s-1) + dt (GoalNet y (s) - Net y (s-1)) (2) The Robustness of Relaxation Rates in Constraint Satisfaction Networks D. Randall Wilson Dan Ventura Brian Moncur fonix corporation WilsonR@fonix.com Tony R. Martinez Computer Science Department Brigham

More information

TIME-SEQUENTIAL SELF-ORGANIZATION OF HIERARCHICAL NEURAL NETWORKS. Ronald H. Silverman Cornell University Medical College, New York, NY 10021

TIME-SEQUENTIAL SELF-ORGANIZATION OF HIERARCHICAL NEURAL NETWORKS. Ronald H. Silverman Cornell University Medical College, New York, NY 10021 709 TIME-SEQUENTIAL SELF-ORGANIZATION OF HIERARCHICAL NEURAL NETWORKS Ronald H. Silverman Cornell University Medical College, New York, NY 10021 Andrew S. Noetzel polytechnic University, Brooklyn, NY 11201

More information

THE retina in general consists of three layers: photoreceptors

THE retina in general consists of three layers: photoreceptors CS229 MACHINE LEARNING, STANFORD UNIVERSITY, DECEMBER 2016 1 Models of Neuron Coding in Retinal Ganglion Cells and Clustering by Receptive Field Kevin Fegelis, SUID: 005996192, Claire Hebert, SUID: 006122438,

More information

Neural Networks. Fundamentals Framework for distributed processing Network topologies Training of ANN s Notation Perceptron Back Propagation

Neural Networks. Fundamentals Framework for distributed processing Network topologies Training of ANN s Notation Perceptron Back Propagation Neural Networks Fundamentals Framework for distributed processing Network topologies Training of ANN s Notation Perceptron Back Propagation Neural Networks Historical Perspective A first wave of interest

More information

Lecture 3: Pattern Classification

Lecture 3: Pattern Classification EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures

More information

A Three-dimensional Physiologically Realistic Model of the Retina

A Three-dimensional Physiologically Realistic Model of the Retina A Three-dimensional Physiologically Realistic Model of the Retina Michael Tadross, Cameron Whitehouse, Melissa Hornstein, Vicky Eng and Evangelia Micheli-Tzanakou Department of Biomedical Engineering 617

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Left over from units..

Left over from units.. Left over from units.. Hodgkin-Huxleymodel Inet = g Na m 3 h(vm E Na )+g k n 4 (Vm E k )+(Vm E l ) m, h, n: voltage gating variables with their own dynamics that determine when channels open and close

More information

Neural Networks and Fuzzy Logic Rajendra Dept.of CSE ASCET

Neural Networks and Fuzzy Logic Rajendra Dept.of CSE ASCET Unit-. Definition Neural network is a massively parallel distributed processing system, made of highly inter-connected neural computing elements that have the ability to learn and thereby acquire knowledge

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Using a Hopfield Network: A Nuts and Bolts Approach

Using a Hopfield Network: A Nuts and Bolts Approach Using a Hopfield Network: A Nuts and Bolts Approach November 4, 2013 Gershon Wolfe, Ph.D. Hopfield Model as Applied to Classification Hopfield network Training the network Updating nodes Sequencing of

More information

Analysis of Interest Rate Curves Clustering Using Self-Organising Maps

Analysis of Interest Rate Curves Clustering Using Self-Organising Maps Analysis of Interest Rate Curves Clustering Using Self-Organising Maps M. Kanevski (1), V. Timonin (1), A. Pozdnoukhov(1), M. Maignan (1,2) (1) Institute of Geomatics and Analysis of Risk (IGAR), University

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Learning and Memory in Neural Networks

Learning and Memory in Neural Networks Learning and Memory in Neural Networks Guy Billings, Neuroinformatics Doctoral Training Centre, The School of Informatics, The University of Edinburgh, UK. Neural networks consist of computational units

More information

Fundamentals of Computational Neuroscience 2e

Fundamentals of Computational Neuroscience 2e Fundamentals of Computational Neuroscience 2e January 1, 2010 Chapter 10: The cognitive brain Hierarchical maps and attentive vision A. Ventral visual pathway B. Layered cortical maps Receptive field size

More information

Keywords- Source coding, Huffman encoding, Artificial neural network, Multilayer perceptron, Backpropagation algorithm

Keywords- Source coding, Huffman encoding, Artificial neural network, Multilayer perceptron, Backpropagation algorithm Volume 4, Issue 5, May 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Huffman Encoding

More information

Content-Addressable Memory Associative Memory Lernmatrix Association Heteroassociation Learning Retrieval Reliability of the answer

Content-Addressable Memory Associative Memory Lernmatrix Association Heteroassociation Learning Retrieval Reliability of the answer Associative Memory Content-Addressable Memory Associative Memory Lernmatrix Association Heteroassociation Learning Retrieval Reliability of the answer Storage Analysis Sparse Coding Implementation on a

More information

ADAPTIVE LATERAL INHIBITION FOR NON-NEGATIVE ICA. Mark Plumbley

ADAPTIVE LATERAL INHIBITION FOR NON-NEGATIVE ICA. Mark Plumbley Submitteed to the International Conference on Independent Component Analysis and Blind Signal Separation (ICA2) ADAPTIVE LATERAL INHIBITION FOR NON-NEGATIVE ICA Mark Plumbley Audio & Music Lab Department

More information

General idea. Firms can use competition between agents for. We mainly focus on incentives. 1 incentive and. 2 selection purposes 3 / 101

General idea. Firms can use competition between agents for. We mainly focus on incentives. 1 incentive and. 2 selection purposes 3 / 101 3 Tournaments 3.1 Motivation General idea Firms can use competition between agents for 1 incentive and 2 selection purposes We mainly focus on incentives 3 / 101 Main characteristics Agents fulll similar

More information

Undirected graphical models

Undirected graphical models Undirected graphical models Semantics of probabilistic models over undirected graphs Parameters of undirected models Example applications COMP-652 and ECSE-608, February 16, 2017 1 Undirected graphical

More information

Analysis of an Attractor Neural Network s Response to Conflicting External Inputs

Analysis of an Attractor Neural Network s Response to Conflicting External Inputs Journal of Mathematical Neuroscience (2018) 8:6 https://doi.org/10.1186/s13408-018-0061-0 RESEARCH OpenAccess Analysis of an Attractor Neural Network s Response to Conflicting External Inputs Kathryn Hedrick

More information

Slope Fields: Graphing Solutions Without the Solutions

Slope Fields: Graphing Solutions Without the Solutions 8 Slope Fields: Graphing Solutions Without the Solutions Up to now, our efforts have been directed mainly towards finding formulas or equations describing solutions to given differential equations. Then,

More information

Hidden Markov Models Part 1: Introduction

Hidden Markov Models Part 1: Introduction Hidden Markov Models Part 1: Introduction CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Modeling Sequential Data Suppose that

More information

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November. COGS Q250 Fall 2012 Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November. For the first two questions of the homework you will need to understand the learning algorithm using the delta

More information

Supervised Learning in Neural Networks

Supervised Learning in Neural Networks The Norwegian University of Science and Technology (NTNU Trondheim, Norway keithd@idi.ntnu.no March 7, 2011 Supervised Learning Constant feedback from an instructor, indicating not only right/wrong, but

More information

Artificial Neural Network

Artificial Neural Network Artificial Neural Network Contents 2 What is ANN? Biological Neuron Structure of Neuron Types of Neuron Models of Neuron Analogy with human NN Perceptron OCR Multilayer Neural Network Back propagation

More information

Lecture 7 Artificial neural networks: Supervised learning

Lecture 7 Artificial neural networks: Supervised learning Lecture 7 Artificial neural networks: Supervised learning Introduction, or how the brain works The neuron as a simple computing element The perceptron Multilayer neural networks Accelerated learning in

More information

Statistics Random Variables

Statistics Random Variables 1 Statistics Statistics are used in a variety of ways in neuroscience. Perhaps the most familiar example is trying to decide whether some experimental results are reliable, using tests such as the t-test.

More information

Introduction to Artificial Neural Networks

Introduction to Artificial Neural Networks Facultés Universitaires Notre-Dame de la Paix 27 March 2007 Outline 1 Introduction 2 Fundamentals Biological neuron Artificial neuron Artificial Neural Network Outline 3 Single-layer ANN Perceptron Adaline

More information

Nervous Systems: Neuron Structure and Function

Nervous Systems: Neuron Structure and Function Nervous Systems: Neuron Structure and Function Integration An animal needs to function like a coherent organism, not like a loose collection of cells. Integration = refers to processes such as summation

More information

Simple Neural Nets For Pattern Classification

Simple Neural Nets For Pattern Classification CHAPTER 2 Simple Neural Nets For Pattern Classification Neural Networks General Discussion One of the simplest tasks that neural nets can be trained to perform is pattern classification. In pattern classification

More information

Linear discriminant functions

Linear discriminant functions Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative

More information

Instituto Tecnológico y de Estudios Superiores de Occidente Departamento de Electrónica, Sistemas e Informática. Introductory Notes on Neural Networks

Instituto Tecnológico y de Estudios Superiores de Occidente Departamento de Electrónica, Sistemas e Informática. Introductory Notes on Neural Networks Introductory Notes on Neural Networs Dr. José Ernesto Rayas Sánche April Introductory Notes on Neural Networs Dr. José Ernesto Rayas Sánche BIOLOGICAL NEURAL NETWORKS The brain can be seen as a highly

More information

CHAPTER 4 THE COMMON FACTOR MODEL IN THE SAMPLE. From Exploratory Factor Analysis Ledyard R Tucker and Robert C. MacCallum

CHAPTER 4 THE COMMON FACTOR MODEL IN THE SAMPLE. From Exploratory Factor Analysis Ledyard R Tucker and Robert C. MacCallum CHAPTER 4 THE COMMON FACTOR MODEL IN THE SAMPLE From Exploratory Factor Analysis Ledyard R Tucker and Robert C. MacCallum 1997 65 CHAPTER 4 THE COMMON FACTOR MODEL IN THE SAMPLE 4.0. Introduction In Chapter

More information

Learning Energy-Based Models of High-Dimensional Data

Learning Energy-Based Models of High-Dimensional Data Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal

More information

Using Kernel PCA for Initialisation of Variational Bayesian Nonlinear Blind Source Separation Method

Using Kernel PCA for Initialisation of Variational Bayesian Nonlinear Blind Source Separation Method Using Kernel PCA for Initialisation of Variational Bayesian Nonlinear Blind Source Separation Method Antti Honkela 1, Stefan Harmeling 2, Leo Lundqvist 1, and Harri Valpola 1 1 Helsinki University of Technology,

More information

DEVS Simulation of Spiking Neural Networks

DEVS Simulation of Spiking Neural Networks DEVS Simulation of Spiking Neural Networks Rene Mayrhofer, Michael Affenzeller, Herbert Prähofer, Gerhard Höfer, Alexander Fried Institute of Systems Science Systems Theory and Information Technology Johannes

More information