FEBAM: A Feature-Extracting Bidirectional Associative Memory

Size: px

Start display at page:

Download "FEBAM: A Feature-Extracting Bidirectional Associative Memory"

Kevin Wilkinson
5 years ago
Views:

1 Proceedings of International Joint Conference on eural etworks, Orlo, Florida, USA, August -7, 007 FEBAM: A Feature-Extracting Bidirectional Associative Memory Sylvain Chartier, Gyslain Giguère, Patrice Renaud, Jean-Marc Lina & Robert Proulx Abstract In this paper, a new model that can ultimately create its own set of perceptual features is proposed. Using a bidirectional associative memory (BAM)-inspired architecture, the resulting model inherits properties such as attractor-like behavior successful processing of noisy inputs, while being able to achieve principal component analysis (PCA) tasks such as feature extraction dimensionality reduction. he model is tested by simulating image reconstruction blind source separation tasks. Simulations show that the model fares particularly well compared to current neural PCA independent component analysis (ICA) algorithms. It is argued the model possesses more cognitive explanative power than any other nonlinear/linear PCA ICA algorithm. I I. IRODUCIO n everyday life, humans are constantly exposed to situations in which they must discriminate, identify, classify categorize perceptual patterns (such as visual objects) in order to act upon the identity properties of the encountered stimuli. o achieve this task, cognitive scientists have historically taken for granted that the human cognitive system uses sets of properties (or dimensions) which crisply or fuzzily define a particular object or category []. hese properties are thought to strictly contain information that is invariant relevant to future decisionmaking situations. hese characteristics of human properties tend to accelerate cognitive processes by reducing the amount of information to be taken into account in a given situation. While most researchers view properties dimensions as symbolic (or logically-based), some have argued for perceptual symbol or feature systems, in which properties would be defined in an abstract iconic fashion created in major part by associative bottom-up processes [, 3]. o achieve this perceptual feature creation scheme, humans are sometimes hypothesized to implicitly use an information reduction strategy. Unfortunately, while the idea of a system creating its own atomic perceptual representations has been deemed crucial to understing many cognitive Sylvain Chartier, School of psychology, University of Ottawa (Ottawa, Canada). Institut Philippe-Pinel de Montréal (Montréal, Canada). ( chartier.sylvain@gmail.com). Gyslain Giguère, Département d informatique, Université du Québec à Montréal (Montréal, Canada). ( giguere.gyslain@uqam.ca). Patrice Renaud, Département de psychologie, Université du Québec en Outaouais (Gatineau, Canada). ( patrice.renaud@uqo.ca). Jean-Marc Lina, Département de génie électrique, École de echnologie Supérieure (Montréal, Canada). ( jean-marc.lina@etsmtl.ca). Robert Proulx, Département de psychologie, Université du Québec à Montréal (Montréal, Canada). ( proulx.robert@uqam.ca). processes [3], few efforts have been made to model perceptual feature creation processes in humans. When faced with the task of modeling this strategy, one statistical solution is to use a principal component analysis (PCA) approach [4]. PCA neural networks [5] can deal quite effectively with input variability by creating (or extracting ) statistical low-level features that represent the data s intrinsic information to a certain degree. he features are selected in such a way that the input data s dimensionality is reduced while preserving important information. While widely used for still moving picture compression, as well as many other computer science engineering applications, PCA networks have not made a breakthrough in the world of cognitive modeling, even if they possess many interesting properties such as psychologically plausible local rules flexible architecture growth, can deal with noisy inputs extract multiple components in parallel. his is probably because cognitive science researchers are looking for solutions to apply to multiple perceptual/cognitive tasks without assuming a high number of different modules (needing ad hoc modifications of the rules architecture) or adding multiple fitting parameters. his can hardly be done with PCA networks, since this class of neural network models is feedforward by definition, therefore does not lead to attractor-like (or categorical ) behavior. Also, the absence of explicitly depicted connections in the model forces supervised learning to be done using an external teacher. Moreover, some PCA models cannot even perform online or local learning [5]. A class of neural networks known as recurrent autoassociative memory (RAM) models does respond to these last requirements. In psychology, AI engineering, autoassociative memory models are widely used to store correlated patterns. One characteristic of RAMs is the use of a feedback loop, which allows for generalization to new patterns, noise filtering pattern completion, among other uses [6]. Feedback enables a given network to shift progressively from an initial pattern towards an invariable state (namely, an attractor). Because the information is distributed among the units, the network s attractors are global. If the model is properly trained, the attractors should then correspond to the learned patterns [6]. Direct generalization of RAM models is the development of bidirectional associative memory (BAM) models [7]. BAMs can associate any two data vectors of equal or different lengths (representing for example, a visual input X/07/$ IEEE

2 a prototype or category). hese networks possess the advantage of being both autoassociative heteroassociative memories [7], therefore encompass both unsupervised supervised learning. A problem with BAM models is that contrarily to what is found in real-world situations, they store information usingnoise free versions of the input patterns. In comparison, to overcome the possibly infinite number of stimuli stemming, for instance, from multiple perceptual events, humans must regroup these unique noisy inputs into a finite number of stimuli or categories. his type of learning in the presence of noise should therefore be taken into account by BAM models. o sum up, on one h, PCA models can deal with noise but do not have the properties of attractor networks. On the other h, BAM models exhibit attractor-like behavior but have difficulty dealing with noise. In this paper, it will be shown that unifying both classes of networks under the same framework leads to a model that possesses a clear advantage in explanatory power for memory cognition modeling over its predecessors. Precisely, this paper will show that BAM-type architecture can be made to encompass some of the PCA networks properties. he resulting model will thus exhibit attractorlike behavior, supervised unsupervised learning, feature extraction, other nonlinear PCA properties such as blind source separation (BSS) capacities. II. HEOREICAL BACKGROUD A. Principal component analysis neural nets PCA neural networks are based on a linear output function defined by: y = Wx () where x is the original input vector, W is the stabilized weight matrix, y is the output. When the weight matrix maximizes preservation of relevant dimensions, then the reverse operation: xˆ = W y () should produce a result that is close to the original input. hus, the norm of the difference vector x x ˆ should be minimized. his translates by finding a W which minimizes: i ˆ i i i E = x x = x W Wx (3) where E is the error function to minimize, is the input s dimensionality. Several solutions exist for Equation 3 (for a review of most of the methods see [5]). Generalization to nonlinear PCA [8] is straightforward; in this case, a nonlinear output function is to be used: ( ) y = f Wx (4) he corresponding minimization function is thus given by i ( ) E = x W f Wx (5) i Depending on the type of nonlinear output function, different types of optimization can be achieved. If the output function uses cubic or higher nonlinearity, then the network directly relates to independent component analysis (ICA) other BSS algorithms [8, 9]. B. Bidirectional associative memories In Kosko s stard BAM [7], learning is accomplished using a simple Hebbian rule, according to the following equation: W = YX (6) In this expression, X Y are matrices representing the sets of bipolar pairs to be associated, W is the weight matrix which, in fact, corresponds to a correlation measure between the input output. he model uses a recurrent nonlinear output function in order to allow the network to filter out the different patterns during recall. Kosko s BAM uses a signum output function to recall noisy patterns [7]. he nonlinear output function typically used in BAM networks is expressed by the following equations: y( t+ ) = sgn( Wx ( t)) (7) x( t+ ) = sgn( W y ( t)) (8) where y(t) x(t) represent the vectors to be associated at time t, sgn is the signum function defined by if z>0 sgn( z) = 0 if z=0 (9) - if z<0 Many enhancements to Kosko s original BAM have since been proposed over the years; these allow, for example, learning convergence, online learning progressive recall to occur. Moreover, recent works have enabled BAMs to deal with multi-valued [0-4] stimuli in addition to bipolar (binary) stimuli. hese developments directly increase the BAM s modeling capacities range of applications.

3 III. MODEL DESCRIPIO Like any artificial neural network, the Feature-Extracting Bidirectional Associative Memory (FEBAM) can be entirely described by its architecture its output learning functions. A. Architecture FEBAM s architecture, which is nearly identical to that of the BAM proposed by [0], is illustrated in Figure. It consists of two Hopfield-like neural networks interconnected in head-to-toe fashion [5]. When connected, these networks allow a recurrent flow of information that is processed bidirectionally. As shown in Figure, the W layer returns information to the V layer vice versa, in a kind of top-down/bottom-up process fashion. As in a stard BAM, both layers serve as a teacher for the other layer, the connections are explicitly depicted in the model (as opposed to Multi-layer Perceptrons). However, to enable a BAM to perform PCA, one set of those explicit connections must be removed. dimensionality is greater or equal to that of the x layer, then no compression occurs, the network can behave as an autoassociative memory [6]. B. Output function he output function is expressed by the following equations:, If Wxi ( t) > i,...,, yi( t+ ) =, If Wxi( t) < (0) 3 ( δ + ) Wxi( t) δ( Wx) i( t), Else, If Vyi ( t) > i,..., M, xi( t+ ) =, If Vyi( t) < () 3 ( δ + ) Vyi( t) δ( Vy) i( t), Else where M are the number of units in each layer, i is the index of the respective vector element, y(t+) x(t+) represent the network outputs at time t +, δ is a general output parameter that should be fixed at a value lower than 0.5 to insure fixed-point behavior [6]. Figure 3 illustrates the shape of the output function when δ = 0.4. his output function possesses the advantage of exhibiting continuous-valued (gray-level) attractor behavior (for a detailed example see [0]). Such properties contrast with networks using a stard nonlinear output function, which can only exhibit bipolar attractor behavior (e.g. [7]). Fig.. FEBAM network architecture. In a stard BAM, the external links given by the initial y(0) connections would also be available to the W layer. Here, the external links are provided solely through the initial x(0) connections on the V layer. hus, in contrast with the stard architecture [0], y(0), the initial output is not obtained externally, but is instead acquired by iterating once through the network as illustrated in Figure. x(0) x() W V W y(0) y() Fig.. Output iterative process used for learning updates. In the context of feature extraction, the W layer will be used to extract the features whose dimensionality will be determined by the number of y units; the more units in that layer, the less compression will occur. If the y layer s Fig. 3. Output function for δ = 0.4. C. Learning function Learning is based on time-difference Hebbian association [0, 6-9], is formally expressed by the following equations: W( k+ ) = W( k) + η( y(0) y( t))( x(0) + x ( t)) () V( k + ) = V( k) + η( x(0) x( t))( y(0) + y ( t)) (3) where η represents a learning parameter, y(0) x(0), the initial patterns at t = 0, y(t) x(t), the state vectors after t iterations through the network, k the learning trial. he learning rule is thus very simple, can be shown to constitute a generalization of hebbian/anti-hebbian correlation in its autoassociative memory version [0, 6].

For weight convergence to occur, η must be set according to the following condition [6]: η <, δ ( δ ) Max[, M ] (4) Equations 3 show that the weights can only converge when feedback is identical to

As a result, the learning rule is dynamically linked to the network s output (unlike most BAMs).

4 For weight convergence to occur, η must be set according to the following condition [6]: η <, δ ( δ ) Max[, M ] (4) Equations 3 show that the weights can only converge when feedback is identical to the initial inputs (that is, y(t) = y(0) x(t) = x(0)). he function therefore correlates directly with network outputs, instead of activations. As a result, the learning rule is dynamically linked to the network s output (unlike most BAMs). In this paper, the number of iterations (or network cycles) performed before the weights are updated is set to t=. Consequently, FEBAM can be seen as a type of onlinear PCA which tries to minimize the following functions at different times (t = 0, ) (see [0] for details): E = yi(0) yi() yi(0) yi() y i() (5) M E = xi(0) xi() xi(0) xi() x i() (6) Equations 5 Equation 6 are respectively minimized by learning rules 3. hese minimizations are done in parallel. For example, if the left side of the right term of Equation 6 is exped, it results in the following expression: n E = xi(0) f[ Vf[ Wxi(0)]] xi(0) xi() x i() (7) which is closely related to Equation 5 (the nonlinear PCA minimization function). IV. SIMULAIOS o test FEBAM s ability to perform like a nonlinear PCA, we simulated a classic PCA task (image reconstruction), as well as a classic ICA task (blind source separation of mixed signals). A. Image reconstruction he first simulation consisted in extracting a limited amount of statistical features from a natural image, then reconstructing it to evaluate the amount of information loss. hen, using the same features, the network s generalization capacities are tested by having it reconstruct a different image with other statistical properties. ) Methodology he image used for learning is illustrated in Figure 4a (left side). his image has a dimension of 8x8 pixels. Each pixel is encoded according to an 8-bit grayscale (0 to 55). First, the image values were rescaled to obtain values in the range [-, ]. Using an overlapping window of 5x5 dimensions, the image was convoluted into 5376 input vectors of 5 dimensions. Weights were romly initialized with values ranging from - to. he learning output parameters were respectively set to η = δ = 0.. Learning followed this general procedure: - Rom selection of an input vector; - Iteration through the network (one cycle, as illustrated in Figure ) using the output function (Equations 0 ); 3- Weight updates (Equations 3); 4- Repetition of until the desired number of learning trials k has been completed or the squared error between y(0) y(t) is sufficiently small. (a) Model (b) APEX (c) PCA (d) fastica (e) FEBAM Learning Resulting image PSR (db) Generalization Resulting image PSR (db) Fig. 4. (a) Original images used for learning generalization purposes. (b-e) Simulation results. For each model, reconstructed images following learning generalization are shown, as well as the respective peak signalto-noise ratio (PSR) for each simulation. Results for: (b) APEX [5], a linear PCA neural network; (c) a nonlinear PCA network [8]; (d) fastica [9], an ICA algorithm; (e) FEBAM, a nonlinear PCA-BAM network. FEBAM s performance was compared to that of several algorithms, namely neural linear PCA [5], neural nonlinear PCA [8], fastica [9]. All algorithms were also tested using the generalization image, with other statistical properties, illustrated in Figure 4a (right side). In this situation, the new image s 5 5 pixel windows were

5 presented to the network one at a time. hose resulting outputs were then used to produce the compressed image shown in Figure 4. o quantify performance, the peak signal-to-ratio (PSR) was used: PSR( z) = 0log row, col i, j= 0 row, col i, j= 55 ( zi, j oi, j) (8) where z is the reconstructed image, o the original image, row col are the image s number of rows columns. ) Results After 4000 learning trials, the squared error is lower than Hence, the networks weights constitute a solution which minimizes Equations 5 6. he weights, illustrated using contour plots (Figure 5), seem to have converged onto salient traits that correspond to various contours. From those weights, detection feature maps ( DV 5 = ( WV ) DW = W j i ij ij ij ij V ) were constructed to illustrate what each unit in each layer is tuned for (Figure 6). In Figure 6a, the 5 units are tuned to detect information presents in various orientations in various locations. On the other h, in Figure 6b, the 5 units are tuned to respond to four different diagonal bars a horizontal bar. With various combinations of these features, the network is able to reconstruct both the original the generalization images without much loss of detail. As Figure 4 shows, FEBAM s reconstruction performance is superior to that of other algorithms, except for fastica. In regards to reconstructing the generalization image, FEBAM s performance is comparable to fastica, superior to other algorithms. (a) (b) Fig. 5 Final weight connections for: (a) the W layer; (b) the V layer. (a) Fig. 6. Feature detection performed by each of (a) the 5 units of layer V (b) the 5 units of layer W. (b) B. Blind source separation ) Methodology Blind source separation (BSS) consists in recovering unobserved source signals from several observed mixed signals [0]. he unobserved source signals are assumed nongaussian mutually independent. BSS techniques are mainly used when one must deal with sets of multivariate data (e.g. mixed speech signals, EEG data). he two source signals used for the simulation are illustrated in Figure 7a. he first source is a sinusoid, the second one contains uniformly distributed white noise. hese source signals, noted s, have been multiplied by a mixing matrix, A, following: x = As (9) where A = 0.5. he resulting mixed signal, x, is illustrated in Figure 7b. he network s task is then to recover the original signals (which, in this case, means separating signal from noise), using only the mixed signal. his is a classic task that can be found in several BSS studies (e.g. [8, 9]). (a) Original (b) Observed Fig. 7. (a) Original signals. (b) Observed mixtures of the original signals. Before estimation of the original components is performed, the observed mixtures (Figure 7b) are whitened (or sphered) by decomposing the observed signals covariance matrix into their eigenvalues eigenvectors components ( x% = Λλ Λ x, where Λ is the eigenvector matrix λ is a diagonal matrix composed of the eigenvalues). hen, the final whitened signal can be obtained ( x% = Λλ Λ x ). he new signal vectors will in that case be uncorrelated, their variances will be equal to unity ( xx %% = I ). For this simulation, the model s free parameters were set to η = 0.05 δ = 0., the weights were romly initialized with values ranging from - to, the general learning procedure was identical to that of the image reconstruction simulation. FEBAM s performance was compared to that of the fastica algorithm [9], which was designed for such tasks. ) Results After 000 learning trials, the weights self-stabilize

6 (once again, squared error < 0.03). Figure 8a show the network s estimate of the original signals Figure 8b show fastica s estimate of the original signals. hese signals closely match the originals up to multiplicative signs, like stard BSS algorithms (e.g. [9]). (a) FEBAM (b) FastICA Fig. 8 (a) FEBAM s estimates of the original signals. (b) fastica s estimates of the original signals. V. DISCUSSIO In this paper, it was shown that the BAM introduced in [0] uses a learning function that is closely related to nonlinear PCA networks. Consequently, by simply removing a set of external connections from the original BAM architecture, FEBAM is able to perform feature extraction reduction of the input dimensionality, just like PCA networks do. Because the W V layers are connected, it is possible to observe what the developed attractors in the V layer look like. Leading to attractor-like behavior, the network can thus be used to perform prototype development from exemplar presentations. Seemingly, the model takes more trials to learn shows a slightly lower generalization performance than fastica [9]. However, FEBAM, being a special case of BAM [0], can also be used to simulate other applications such as categorization [6], classification [0], many-to-one association [] multi-step pattern recognition []. hose properties cannot be simultaneously accounted for by any other models presented in this paper (such as fastica). Consequently, in addition to being a cidate for modeling human perceptual feature creation, the model possesses more cognitive explanative power than any other nonlinear/linear PCA or ICA algorithm. One characteristic of FEBAM resides in the fact that as the network learns different weights may converge to the same component. o enhance the network s performance, a decorrelation procedure, such as the ones proposed for ICA [9] PCA [5], could be used. ote, however, that in cognitive psychology, redundancy is often seen as a sign of representational robustness, instead of a performance weakness. Further research should establish FEBAM s computational complexity, as well as evaluate its capacity to learn environmental biases [], to extract learn prototypes in noisy environments (a difficult task to achieve for stard RAM/BAMs). In addition, further studies should address the possibility of FEBAM presenting an evolutive architecture in relation to clustering development. REFERECES [] G. L. Murphy, he Big Book of Concepts. Cambridge, MA: MI Press, 00. [] S. Harnad, "he symbol-grounding problem," Physica D, pp , 990. [3] P. G. Schyns, R. L. Goldstone, J. P. hibaut, "he Development of Features in Object Concepts," Brain Behavioral Sciences, pp. -54, 998. [4] H. Abdi, D. Valentin, B. G. Edelman, "Eigenfeatures as Intermediate-level Representations: he Case for PCA Models," Brain Behavioral Sciences, vol., pp. 7-8, 998. [5] K. I. Diamantaras S. Y. Kung, Principal Component eural etworks. ew York: Wiley, 996. [6] J. J. Hopfield, "eural etworks Physical Systems with Emergent Collective Computational Abilities," Proceedings of the ational Academy of Sciences, vol. 79, pp , 98. [7] B. Kosko, "Bidirectional associative memories," IEEE ransactions on Systems, Man Cybernetics vol. 8, pp , 988. [8] J. Karhunen, P. Pajunen, E. Oja, "he onlinear PCA Criterion in Blind Source Separation: Relations with Other Approaches," eurocomputing, vol., pp. 5-0, 998. [9] A. Hyvarinen E. Oja, "Independent component analysis: algorithms applications," eural etworks, vol. 3, pp , 000. [0] S. Chartier M. Boukadoum, "A bidirectional heteroassociative memory for binary grey-level patterns," IEEE ransactions on eural etworks, vol. 7, pp , 006. [] G. Costantini, D. Casali, R. Perfetti, "eural associative memory storing gray-coded gray-scale images," IEEE ransactions on eural etworks, vol. 4, pp , 003. [] C.-C. Wang, S.-M. Hwang, J.-P. Lee, "Capacity analysis of the asymptotically stable multi-valued exponential bidirectional associative memory," IEEE transactions on systems, Man Cybernetics - Part B: Cybernetics, vol. 6, pp , 996. [3] D. Zhang S. Chen, "A novel multi-valued BAM model with improved error-correcting capability," Journal of electronics, vol. 0, pp. 0-3, 003. [4] J. M. Zurada, I. Cloete, E. van der Peol, "Generalized Hopfield networks for associative memories with multi-valued stable states," eurocomputing, vol. 3, pp , 996. [5] M. H. Hassoun, "Dynamic Heteroassociative eural Memories," eural etworks, vol., pp , 989. [6] S. Chartier R. Proulx, "DRAM: nonlinear dynamic recurrent associative memory for learning bipolar nonbipolar correlated patterns," IEEE ransactions on eural etworks, vol. 6, pp , 005. [7] B. Kosko, "Unsupervised learning in noise," IEEE ransactions on eural etworks, vol., pp , 990. [8] E. Oja, "eural etworks, Principal Components, Subspaces," International Journal of eural Systems, vol., pp. 6-68, 989. [9] R. S. Sutton, "Learning to predict by the methods of temporal difference," Machine learning, vol. 3, pp. 9-44, 988. [0] J. F. Cardoso, "Blind Signal Separation: Statistical Principles," Proceedings of the IEEE, vol. 86, pp , 998. [] S. Chartier M. Boukadoum, "A sequential dynamic heteroassociative memory for multistep pattern recognition oneto-many association," IEEE ransactions on eural etworks, vol. 7, pp , 006. [] S. Helie, S. Chartier, R. Proulx, "Are unsupervised neural networks ignorant? Sizing the effect of environmental distributions on unsupervised learning," Cognitive Systems Research, vol. In Press, 006.

SCRAM: Statistically Converging Recurrent Associative Memory

SCRAM: Statistically Converging Recurrent Associative Memory Sylvain Chartier Sébastien Hélie Mounir Boukadoum Robert Proulx Department of computer Department of computer science, UQAM, Montreal, science,