neural units that learn through an information-theoretic-based criterion.

Similar documents
1 Introduction Independent component analysis (ICA) [10] is a statistical technique whose main applications are blind source separation, blind deconvo

Blind Deconvolution by Modified Bussgang Algorithm

On Information Maximization and Blind Signal Deconvolution

Entropy Manipulation of Arbitrary Non I inear Map pings

Linear Regression and Its Applications

A Canonical Genetic Algorithm for Blind Inversion of Linear Channels

Blind channel deconvolution of real world signals using source separation techniques

PROPERTIES OF THE EMPIRICAL CHARACTERISTIC FUNCTION AND ITS APPLICATION TO TESTING FOR INDEPENDENCE. Noboru Murata

4. Multilayer Perceptrons

Non-Euclidean Independent Component Analysis and Oja's Learning

MULTICHANNEL BLIND SEPARATION AND. Scott C. Douglas 1, Andrzej Cichocki 2, and Shun-ichi Amari 2

Information maximization in a network of linear neurons

Undercomplete Independent Component. Analysis for Signal Separation and. Dimension Reduction. Category: Algorithms and Architectures.


Blind Machine Separation Te-Won Lee

One-unit Learning Rules for Independent Component Analysis

where A 2 IR m n is the mixing matrix, s(t) is the n-dimensional source vector (n» m), and v(t) is additive white noise that is statistically independ

TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES. Mika Inki and Aapo Hyvärinen

ADAPTIVE LATERAL INHIBITION FOR NON-NEGATIVE ICA. Mark Plumbley

y(n) Time Series Data

To appear in Proceedings of the ICA'99, Aussois, France, A 2 R mn is an unknown mixture matrix of full rank, v(t) is the vector of noises. The

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Feature Extraction with Weighted Samples Based on Independent Component Analysis

Error Empirical error. Generalization error. Time (number of iteration)

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing

Fundamentals of Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Independent Vector Analysis (IVA)

Remaining energy on log scale Number of linear PCA components

Independent Component Analysis and Unsupervised Learning

SEPARATION OF ACOUSTIC SIGNALS USING SELF-ORGANIZING NEURAL NETWORKS. Temujin Gautama & Marc M. Van Hulle

Blind Extraction of Singularly Mixed Source Signals

x 1 (t) Spectrogram t s

BLIND SEPARATION OF INSTANTANEOUS MIXTURES OF NON STATIONARY SOURCES

Novel determination of dierential-equation solutions: universal approximation method

Recursive Generalized Eigendecomposition for Independent Component Analysis

ORIENTED PCA AND BLIND SIGNAL SEPARATION

Lecture 5: Logistic Regression. Neural Networks

1 Introduction Blind source separation (BSS) is a fundamental problem which is encountered in a variety of signal processing problems where multiple s

CIFAR Lectures: Non-Gaussian statistics and natural images

Learning with Ensembles: How. over-tting can be useful. Anders Krogh Copenhagen, Denmark. Abstract

Massoud BABAIE-ZADEH. Blind Source Separation (BSS) and Independent Componen Analysis (ICA) p.1/39

ON-LINE BLIND SEPARATION OF NON-STATIONARY SIGNALS

ON SOME EXTENSIONS OF THE NATURAL GRADIENT ALGORITHM. Brain Science Institute, RIKEN, Wako-shi, Saitama , Japan

Reading Group on Deep Learning Session 1

Statistical Independence and Novelty Detection with Information Preserving Nonlinear Maps

Speed and Accuracy Enhancement of Linear ICA Techniques Using Rational Nonlinear Functions

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

Independent Component Analysis. Contents

Fundamentals of Neural Network

A Non-linear Information Maximisation Algorithm that Performs Blind Separation.

On the INFOMAX algorithm for blind signal separation

Gradient Descent Training Rule: The Details

SYSTEM RECONSTRUCTION FROM SELECTED HOS REGIONS. Haralambos Pozidis and Athina P. Petropulu. Drexel University, Philadelphia, PA 19104

Performance Bounds for Joint Source-Channel Coding of Uniform. Departements *Communications et **Signal

ADAPTIVE FILTER THEORY

Blind deconvolution by simple adaptive activation function neuron

Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models

Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models

The Bias-Variance dilemma of the Monte Carlo. method. Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel

Independent Component Analysis of Incomplete Data

Blind separation of sources that have spatiotemporal variance dependencies

An Improved Cumulant Based Method for Independent Component Analysis

Carnegie Mellon University Forbes Ave. Pittsburgh, PA 15213, USA. fmunos, leemon, V (x)ln + max. cost functional [3].

In: Advances in Intelligent Data Analysis (AIDA), International Computer Science Conventions. Rochester New York, 1999

Generalized Information Potential Criterion for Adaptive System Training

CONVOLUTIVE NON-NEGATIVE MATRIX FACTORISATION WITH SPARSENESS CONSTRAINT

Comparative Analysis of ICA Based Features

Modifying Voice Activity Detection in Low SNR by correction factors

Blind separation of instantaneous mixtures of dependent sources

Support Vector Machines vs Multi-Layer. Perceptron in Particle Identication. DIFI, Universita di Genova (I) INFN Sezione di Genova (I) Cambridge (US)

Natural Gradient Learning for Over- and Under-Complete Bases in ICA

1 Introduction Consider the following: given a cost function J (w) for the parameter vector w = [w1 w2 w n ] T, maximize J (w) (1) such that jjwjj = C

ICA [6] ICA) [7, 8] ICA ICA ICA [9, 10] J-F. Cardoso. [13] Matlab ICA. Comon[3], Amari & Cardoso[4] ICA ICA

Deep Feedforward Networks

MINIMIZATION-PROJECTION (MP) APPROACH FOR BLIND SOURCE SEPARATION IN DIFFERENT MIXING MODELS

Higher Order Statistics

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

NONLINEAR BLIND SOURCE SEPARATION USING KERNEL FEATURE SPACES.

Independent Component Analysis

Velocity distribution in active particles systems, Supplemental Material

Ecient Higher-order Neural Networks. for Classication and Function Approximation. Joydeep Ghosh and Yoan Shin. The University of Texas at Austin

IE 5531: Engineering Optimization I

Independent Component Analysis on the Basis of Helmholtz Machine

SPARSE REPRESENTATION AND BLIND DECONVOLUTION OF DYNAMICAL SYSTEMS. Liqing Zhang and Andrzej Cichocki

Analytical solution of the blind source separation problem using derivatives

1 Introduction Tasks like voice or face recognition are quite dicult to realize with conventional computer systems, even for the most powerful of them

= w 2. w 1. B j. A j. C + j1j2

p(z)

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

computation of the algorithms it is useful to introduce some sort of mapping that reduces the dimension of the data set before applying signal process

Artificial Neural Networks Examination, March 2004

Departement Elektrotechniek ESAT-SISTA/TR Dynamical System Prediction: a Lie algebraic approach for a novel. neural architecture 1

BUMPLESS SWITCHING CONTROLLERS. William A. Wolovich and Alan B. Arehart 1. December 27, Abstract

Blind Equalization Formulated as a Self-organized Learning Process

Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network

ICA. Independent Component Analysis. Zakariás Mátyás

Simultaneous Diagonalization in the Frequency Domain (SDIF) for Source Separation

Introduction to Neural Networks

An Iterative Blind Source Separation Method for Convolutive Mixtures of Images

The Best Circulant Preconditioners for Hermitian Toeplitz Systems II: The Multiple-Zero Case Raymond H. Chan Michael K. Ng y Andy M. Yip z Abstract In

Transcription:

Submitted to: Network: Comput. Neural Syst. Entropy Optimization by the PFANN Network: Application to Blind Source Separation Simone Fiori Dept. of Electronics and Automatics { University of Ancona, Italy E-mail: simone@eealab.unian.it PACS numbers: unknown Abstract. The aim of this paper is to present a study on polynomial functionallink neural units that learn through an information-theoretic-based criterion. First the neuron's structure is presented and the unsupervised learning theory is explained and discussed, with particular attention paid to probability density function and cumulative distribution function approximation capability. Then a neural network formed by such neurons (Polynomial Functional-Link Articial Neural Network, PFANN) is shown to be able to separate out linearly mixed eterokurtic source signals, that is signals endowed with either positive or negative kurtoses. In order to compare the performances of the proposed blind separation technique to those exhibited by existing methods, the Mixture of Densities (MOD) approach by Xu et al., that is closely related to PFANN, is briey recalled; then comparative numerical simulations performed on both synthetic and realworld signals and a complexity evaluation are illustrated. These results show that PFANN approach allows to obtain similar performances with a noticeable reduction of computational eorts.. Introduction Over the recent years, information-theoretic-based optimization by neural networks has become an important research eld. Since the pioneering work of Linsker and Plumbley (see for instance [, 3, 7, 8] and references therein), several authors were involved in the study of the unsupervised neural learning driven by entropy-based criteria, with current applications like blind separation of sources [,, 4, 5, ], linear estimation and time-series prediction [], probability density shaping [, 4,, 4, 9], unsupervised classication [], and blind system deconvolution [, 3, 3]. Particularly, the aim of the dierent techniques for neural density shaping is to nd a non-linear transformation of an input random process (with unknown statistics) that maximizes the entropy of the transformed process. Then, the found transformation approximates the cumulative distribution function of the input random process, and the rst derivative of the transformation approximates the probability density distribution function of the input [, ], with a degree of accuracy depending on the structure of the employed neural network. Usually, the neural topologies used in the literature involve semi-linear neural units, that is, linear combiners followed by static sigmoidal non-linearities. In this paper, a study concerning the use of more complex and exible neural units endowed with functional links [9, 3], trained in an unsupervised way by means

Entropy Optimization, PFANN Network and Blind Separation of an entropy-based criterion, is presented. As used here, in a functional-link neuron (also known as? neuron) the total excitation is computed as the weighted sum of dierent powers of the external stimulus, so that the total excitation assumes the expression of a polynomial [6, 7]. The actual response of the neuron is then computed by squashing the total excitation by a sigmoidal function. In order to test for the learning and approximation capabilities of the proposed structure, cases of probability density shaping and cumulative distribution function approximation are tackled and discussed through computer simulations. Among others, blind source separation by the independent component analysis is a very interesting and challenging neural information processing task. The problem of separating out mixed statistically independent source signals by Neural Networks has been widely investigated in the recent years [, 6,, 5], and several algorithms and methods have been proposed by dierent authors. Particularly, the informationtheoretic approach by Bell and Sejnowski [] has attracted a lot of interest in the Neural Network community. However, both analytical studies and computer experiments have shown that this algorithm cannot be guaranteed to work with all kind of source signals, that means it is eective depending on the source probability distribution. To overcome this problem, recently, Xu et al. [5] proposed to use the Bell-Sejnowski's algorithm modied by Amari's Natural Gradient learning technique (see for instance [5, 8] and references therein) that dramatically speeds up its convergence, and by using adaptive non-linearities instead of xed ones. These exible functions may be `learnt' so that they approximate the required cumulative distribution functions of the sources that can be proven to be optimal [6, 8]. Particularly this new technique overcomes the problem of separating both leptokurtic (i.e. with positive kurtosis) and platikurtic (i.e. with negative kurtosis) sources without the need of explicitly estimating their kurtosis, making the algorithm able to separate eterokurtic sources, that means mixed leptokurtic and platikurtic sources. Since the functional-link units that learn by means of entropy optimization principle show interesting probability density function and cumulative distribution function approximation abilities, we propose to extend the aforementioned learning theory to a multiple version (PFANN) and apply it to source separation; we rst present the new Polynomial Functional-Link Articial Neural Network (PFANN) learning equations, then briey recall the Mixture of Densities (MOD) theory from [5, 6, 7], that is closely related to our approach, and compare the performances and structural features of the two methods.. Learning of maximum entropy connection strengths in a single functional-link unit.. Theory derivation In this paper the following input-output description for a functional-link neuron is assumed: y = (x; a) = sgm['(x; a)] ; () where sgm() is a sigmoidal function, bounded above and below, continuous and strictly increasing; '(x; a) is a strictly monotonic polynomial in x depending upon parameters in a = [a a a n ] T. If x is supposed to be a stationary continuoustime random process x = x(t) X with probability density function (pdf) p x (x), then

Entropy Optimization, PFANN Network and Blind Separation 3 y will be a random process y = y(t) Y too, with a pdf denoted here as p y (y; a). The dierential entropy of the random process y(t) dened as: H y (a) def =? Z Y p y (; a) log p y (; a)d () can be related to the dierential entropy H x of the random process x(t) by means of the fundamental formula: p y = p x =j j ; (x; a) def = (x; a), where the prime denotes derivative with respect to x. Using that substitution in the formula (), yields: Z p x () H y (a) =? j (; a)j log px () j (; a)jd j (; a)j X = H x + where: Z H x =? p x () log p x ()d : X By denition, j j assumes the expression: j (x; a)j = Z X p x () log j (; a)jd ; (3) d (x; a) dx = d'(x; a) sgm ['(x; a)] dx ; (4) due to equation (). The entropy H y depends upon coecients a k by means of (x; a), thus: Z @H y (a) p x () @j (; a)j = d : (5) @a k j (; a)j @a k X It is straightforward to see that the partial derivatives involved in the integral (5) read: @j j = sgm ['] @' d' @a k @a k dx + sgm @ d' ['] @a k dx ; (6) whereby direct calculations lead to: @j j = sgm ['] j j @a k sgm ['] @' + @a k @ d' @a k dx d' dx? : (7) It is known from statistics that optimizes the entropy when it approaches the cumulative distribution function of x(t), thus when approaches the probability density function of the input process. Note that entropy is only optimized here for xed upper and lower bound. Thus we wish to nd a vector of parameters a, hence a conguration of the functional-link neuron, that maximizes the entropy of the neuron response y(t). To this aim, a set of continuous-time learning equations derived by the Gradient Steepest Ascent (GSA) method can be used here z. Following this way, such equations write in the form da=dt = @H y =@a, that is: da sgm p x () ['(; a)] @'(; a) @' sgm + (; a) ['(; a)] @a ' d : (8) (; a) @a dt = ZX z Must be underlined that the presence of the bounded (saturating) non-linearity is needed to have xed bounds, otherwise the optimization problem would be meaningless, and using the sigmoidal function sgm() is a way to enforce this.

Entropy Optimization, PFANN Network and Blind Separation 4 Also, from equations (3) and (4) the exact expression of the quantity G def = H y? H x, hereafter referred to as \entropy gap", is obtained: G(a) = Z X p x () logfsgm ['(; a)]j' (; a)jgd : (9) It should be noted that G(a) is not guaranteed to be positive, since it actually represents a dierence among entropies. Anyway, as H y (a) maximizes, G(a) maximizes too since H x is constant, thus maximizing the response entropy may be conceived as the maximization of the entropy gap between the original and the squashed processes... A case-study Here the simple case-study concerning the following neural structure is discussed: Z sgm(z) = + erf(z) = p z exp(?u )du ; ()? z def = '(x; a) = a + a x : () The excitation x is endowed with a Laplacian distribution p x (x) = e?jx?j, where >. Equations () and () represent the input-output relationships of a sigmoidal neuron with one weight and one bias. This is an interesting case in the theory, since it is possible to nd solutions of the dierential system (8) in a closed form. In fact, the relevant quantities involved in the integrals are found to be sgm (z)=sgm (z) =?z and ' (x; a) = a, whereby the others follow. Thus system (8) rewrites, in this case: da dt da dt =? = a? or, after direct calculations: Z +? Z +? e?j?j (a + a )d ; e?j?j (a + a )d ; da da =?(a + a ) ; dt dt =? (a + a )? 4a? : () a Vanishing the time-derivatives allows to determine the equilibrium points a eq for the above dierential system. They are found to be: a eq = [+=? =] T and [?= + =] T : (3) It would be interesting to give in this simple case the exact expression of the entropy gap (9), that is: G([a a ] T ) = log p ja j? [a + a a? (? + )a ] : (4) Note that the entropy gap is invariant with respect to a sign exchange among a and a, coherently to result (3).

Entropy Optimization, PFANN Network and Blind Separation 5.3. Two more complex examples It is worth to consider the more complex neural structure described by functions: sgm(z) = tanh(z) ; (5) z = '(x; a) = a + e a x + e a3 x 3 + + e ar+ x r+ ; (6) with r + being the order of the polynomial. Note that, due to the exponential structure of the coecients and the monotonicity of the polynomial, the property ' (x; a) > holds true for any value of x X (providing that a ; a 3 ; : : : ; a r+ >?). With the structural functions as above, the relevant learning quantities are: sgm (z) sgm =?tanh(z) =?sgm(z) =?y ; (7) (z) ' (x; a) = e a + 3e a3 x + + (r + )e ar+ x r ; (8) and the others follow. In order to simulate learning equations (8) particularized with functions (5)+(6), their instantaneous, stochastic, discrete-time approximations can be used. They are expressed by: a (t + ) = a (t)? y(t) ; (9) a k (t + ) = a k (t) + k e ak(t) k ' (t)? x(t)y(t) x k? (t) ; () a () = a init ; a k () = a init k ; () for k = ; 3; 5; : : :; r +, with ; ; 3 ; : : : ; r+ being suciently small positive learning rates, and t denotes discrete-time index. It should be noted that when the k are dierent one from another, the learning equations no longer represent a gradient descent rule in the a-space, but they only resemble gradient descent equations in the a k -spaces. As learning performance index, an estimate of the entropy gap H y? H x may be obtained by averaging over the learning epochs the instantaneous, stochastic approximation of the right-hand side of the expression (9): Entropy gap G log[(? y )' ] : Since it is known that the quantity (x; a) approximates the pdf of the excitation x(t) (with a degree of accuracy related to sgm() and '(; ), see for instance [, ] and references therein), to test for the probability density shaping capability of the proposed neural structure, two exemplary experiments are considered in what follows. First, a two-overlapping-gaussian excitation with probability distribution function like: p x (x) = p exp? (x? ) + exp? (x? ) was presented to the functional-link neuron (5) with r + = 3. Simulation results are shown in Figure. Each epoch counts 6 input samples; a total of 5 epochs (corresponding to 3 samples) was used. Another experiment was performed with a polynomial '(x; a) with degree r+ = 7 and a real-world excitation (a khz sampled musical stream with suitable amplitude range scaling). Results with batches counting 5 samples and 5 epochs are shown in Figure. Unfortunately, the values of the coecients a i obtained by running the preceding simulations cannot be compared to the `optimal' ones since the solutions to the ()

Entropy Optimization, PFANN Network and Blind Separation 6 pdf_in.5.5 - -.5.5 x Entropy gap.5.4.3.. 3 4 5 epochs y (x).5.5 a, a, a3 - -.5.5 x - 3 samples Figure. Entropy maximization for a two-overlapping-gaussian excitation. (Pdf-in: pdf of the stimulus; y (x): approximated pdf after neuron's learning; gures on the right: estimated entropy gap and neuron's coecients during learning)..5.4 pdf_in. Entropy gap - -.5.5 x -.5 3 4 5 epochs y (x).5.5 - -.5.5 x a, a, a3, a5, a7.6.4. -. -.4 5 5 5 samples Figure. Entropy maximization for a real-world excitation. (Pdf-in: histogram approximation of the stimulus pdf; y (x): approximated pdf after neuron's learning; gures on the right: estimated entropy gap and neuron's coecients during learning). equilibrium equation dh y (a)=da = are not available at present. However, a close examination of the simulations show that in all cases the neural unit endowed with functional links, with a relatively small number of parameters to be learnt, seems to possess interesting approximation capabilities that of course should be expected as biological evidences seem to suggest []. 3. Extension to the multiple case: The PFANN network applied to blind source separation The problem of separating out mixed statistically independent source signals by Neural Networks has been widely investigated in the recent years [, 6,, 5], and several algorithms and methods have been proposed by dierent authors. Particularly, the

Entropy Optimization, PFANN Network and Blind Separation 7 information-theoretic approach by Bell and Sejnowski [] with the Natural Gradient modication developed by Amari [] has attracted a lot of interest in the Neural Network community. However, both analytical studies and computer experiments have shown that this algorithm cannot work with all kinds of source signals, that means it is eective depending on the source probability distribution. This behavior may be explained recognizing that Bell-Sejnowski method relies on the use of some non-linearities whose optimal shapes are the cumulative distribution functions (cdf's) of the sources [, 6], thus using xed non-linearities like standard sigmoids would not be a good way, as also mentioned in [, 5, 9, 8]. On the other hand, in a blind problem the cdf's of the sources are unknown, thus the problem cannot be directly solved. To overcome this problem, recently, Xu et al. [5, 6, 7] proposed to use the wellknown Bell-Sejnowski's algorithm [] modied by Amari's Natural Gradient learning technique (readers please refer to [5, 8] and references therein) that dramatically speeds up its convergence and reduces the amount of required computational eorts, and by using adaptive non-linearities instead of xed ones. These exible functions may be `learnt' so that they approximate the required cdf's of the sources helping the separation algorithm to perform better. Particularly, they overcome the problem of separating both leptokurtic and platikurtic sources without the need of explicitly estimating their kurtoses, making the algorithm able to separate eterokurtic sources. In the previous Section we presented a technique for approximating the pdf as well as the cdf of a signal by means of a single neural unit that learns through an information-theoretic-based learning rule. The aim of this part is to extend this algorithm to a multiple version (PFANN) and to apply it to source separation; we rst present the new PFANN learning equations, then we briey recall the MOD theory from [5], that is closely related to our approach, and compare the performances and structural features of the two methods. 3.. Blind source separation by the entropy maximization approach For separating out n linearly mixed independent sources, we use a neural network with n inputs and n outputs described by the relationship y = h(x) = h(wu), where u is the network input vector and W is the weight matrix that adapt through time in order for the entropy H h(x) to be maximized [, 5], with h(x) = [h (x ) h (x ) h n (x n )] T. The entropy of the squashed signal y is dened as def H y = E x [log p y (y)], where operator E x [] denotes mathematical expectation with respect to x. As learning algorithm for W, we use the natural gradient rule [5]: W(t + ) = W(t) + W [I + g(x(t))x T (t)]w(t) ; (3) = h where g(x) = [g (x ) g (x ) g n (x n )] T def, g j j =h j, and h j should approximate the cumulative distribution function of the j th source signal [, 5]. The mixing model is u = Ms, where M is a full-rank mixing matrix and s is the vector containing the n statistically independent source signals to be separated. It is worth to remark that in a blind separation problem both the mixing matrix M and the source signal stream s(t) are unknown. Moreover, in some real-world applications the mixtures may vary through the time making the problem nonstationary. In this paper we only deal with instantaneous mixtures, that is, mixtures given by mixing matrices M that do not change through time.

Entropy Optimization, PFANN Network and Blind Separation 8 x -? exp(a ) a - -?? w exp(a 3 - ) - erf ' 7 exp(a 5 ) - y s? Figure 3. An odd polynomial functional-link neural unit shown for r =. 3.. Entropy maximization by the PFANN network In order to approximate any cdf of the source signals we use the following adaptive transformations: y j = h j (x j ) def = + erf[' j(x j ; a j )] ; j = ; : : : ; n ; (4) where the index j denotes the j th neuron, ' j (x j ; a j ) is a polynomial whose coecients in a j are adapted through time, and erf() denotes again the mathematical `error function' x. Since h j () should approximate a non-decreasing function, each polynomial ' j must be monotonic almost everywhere within the range of x j, formally ' j. This condition is surely fullled if ' j (x j ; a j ) has the form: ' j (x j ; a j ) = a ;j + r j X i= def = d'j dx j e ai+;j x i+ j ; (5) where r j + is the order of the polynomial for each neuron. As mentioned above, the structure of the neurons gives the name of Polynomial Functional-Link Articial Neural Network to the network used for separating out the source signals. A polynomial functional-link network [9, 3] as intended here is composed by neural units endowed with exponential links. A unit of this kind is depicted for instance in Figure 3 for r =. In P Q this Figure, the blocks marked with a perform multiplication of their inputs, the block performs summation and erf() denotes the aforementioned error function that squashes the polynomial '(x). The entropy H y, that has to be maximized [, 5] with respect to the coecients a k;j, is dened as: H y (A) def =? Z p y (y; A) log p y (y; A)dy n ; (6) x The error function in (4) may be replaced by any sigmoid y = (x), where () is continuously dierentiable almost everywhere at least twice, non-decreasing and ranging between and [6, 7]. We chose the `erf' function because it simplies the learning equations.

Entropy Optimization, PFANN Network and Blind Separation 9 where p y (y) is the joint pdf of the squashed signal vector y def = [y y y n ] T and A is the matrix formed by the columns a j. The entropy H y may be rewritten [, 5] H y = H x + E x [log ], and function (x; A) is dened as: def = det 6 4 @h @x. @h n @x @h @x n.... @h n @x n 3 7 5 = P where j def = dhj n dx j, so that log = log j= j. In order for the entropy to be maximized, the recursive stochastic gradient steepest ascent technique may be used again: a ;j = @ log j @a ;j ny j= j ; ; a i+;j = i+ @ log j @a i+;j ; (7) where i = ; ; : : :; r, and,, 3, : : :, r+ are positive learning stepsizes. In this case, the functions j look like: j(x j ) = p e?' j (xj) ' j (x j) ; where ' j (x j) = r j X i= (i + )e ai+;j x i j ; (8) and the dependence on the a j is understood. Thus we have: " # @ j = p @e?' j (xj) ' @' @a ;j j(x j ) + e?' j (xj ) j (x j) @a ;j while: @a ;j =? p e?' j (xj) ' j(x j )' j (x j ) ; (9) " # @ j = p @e?' j (xj ) ' @' @a i+;j @a j(x j ) + e?' j (xj) j (x j) i+;j @a i+;j Therefore the adapting algorithm writes: ( a;j =? ' j (x j ; a j ) ; a i+;j = i+ e ai+;j h = p e?' j (xj)? i + ' j(x j )' j (x j )x j (i + )e ai+;j x i j : (3) i i+ '? j (xj;aj ) ' j(x j ; a j )x j x i j ; where i = ; ; ; r j, and the short notation a k;j = a k;j (t + )? a k;j (t) is used. Finally, we need to compute the non-linear functions g j that are required for adapting the weight matrix W by (3). In our case they take on the expression: (3) g j (x j ; a j ) = d j =?' j (x j ; a j )' j(x j ; a j ) + ' j (x j; a j ) j dx j ' j (x j; a j ) ; (3) where ' j (x j; a j ) = P r j i= i(i + )eai+;j x i? j.

Entropy Optimization, PFANN Network and Blind Separation 3.3. The MOD learning algorithm The MOD technique, an existing approach drawn from the scientic literature, is now briey recalled for a comparison with the PFANN approach. As functions h j, in [5, 6] the following expression is proposed: h j (x j ) = m X j i= i;j [b i;j (x j? c i;j )] ; (33) where m j is the size of the `mixture of densities' for the j th neuron, (b i;j,c i;j ) are coecients to be adapted, and the i;j measure the contribution given by any single term to the approximation, thus they should be positive and sum to one with respect to the index i. To this aim, it is useful to replace the i;j with new unconstrained variables % i;j dened by i;j = exp(%i;j ) Pi exp(%i;j). The basis function is [5] (u) =. +exp(?u) Also, the rst derivative of the functions h j have to be computed: h j(x j ) = dh j dx j = m X j i= i;j b i;j [b i;j (x j? c i;j )] : (34) Now, the MOD learning equations that maximize the entropy H y read [5, 6, 7]: % i;j = k % h j (x j) c i;j =? m X j k= b k;j (v k;j? v k;j) k;j ( i;k? i;j ) ; (35) k c h j (x j) i;jb i;j (? v i;j)(v i;j? v i;j ) ; (36) b i;j = k b i;j h j (x j) [ + b i;j(x j? c i;j )(? v i;j )](v i;j? v i;j) ; (37) where v i;j = v i;j (x j ) def = [b i;j (x j? c i;j )], i;j is the Kronecker `delta', and the k %, k c and k b are positive learning rates. In this case the non-linearities g j () take on the expression: g j (x j ) = h j (x j) m X j i= i;j b i;j(? v i;j )(v i;j? v i;j) : (38) These equations have been recast directly from [5, 6, 7] where a very clear derivation of the MOD learning theory was presented, along with a detailed discussion on the reasons for employing this kind of mixtures of parametric densities as adaptive activation functions. 4. Computer simulation results and structural comparison In this part we show computer simulation results that conrm the eectiveness of the proposed approach, and present a complexity comparison of the MOD and PFANN separation methods. 4.. A blind separation test on PFANN network The functional-link network-based PFANN separation algorithm has been tested with an input stream which is a mixture of three signals: s (t) = sign[cos(5t+9 cos(5t))],

Entropy Optimization, PFANN Network and Blind Separation s (t) is a uniformly distributed white noise ranging in [?; +], and s 3 (t) is a 8kHz sampled speech signal. The signal s (t) has been chosen in such a way because in [9] Gustason reported that original Bell-Sejnowki algorithm may be ineective in presence of it. The 3 3 mixing matrix M has been randomly generated and the weights in W and the coecients a k;j have been randomly initialized as well. The learning stepsize W was?4. The separation neural network has 3 inputs, 3 outputs and thus 3 functional-link neurons structured as in Section 3, where the constants r j have been chosen equal to r =. As convergence measure, an interference residual has been dened as the sum of the n? n smallest squared entries of the product V def = WM like in []. In fact, since V represents the overall source-to-output transfer matrix, perfect separation would imply a quasi-diagonal form of it, i.e., only one entry per row dierent from zero [4]; in a real-world context, however, some residual interference should be tolerated. Figure 4 shows the interference residual. It tells that the algorithm has been able to separate the three eterokurtic sources. Figure 5 depicts instead the Noise-to-Signal.5 Interference residual.5.5.5.5.5 3 3.5 4 4.5 5 Samples x 4 Figure 4. Interference residual. Ratios (NSRs) of the three network's outputs. The NSR at the output of each neuron measures the power of the residual interference (the `noise') with respect to the power of the separated source signal pertaining to that neuron. Formally, it is dened as [5]: N S j! def Pi6=k = log v j;i vj;k ; k def = argmax fig vj;i : The Figure 6 illustrates the functions h j and their derivatives j = h j as learnt by the functional-link units. Consider, for instance, neurons and. Due to the denition of sources s and s, the pdf of the signal s consists of two Dirac's `delta centered in? and +, while the pdf of s is constant within [?; +] and null elsewhere. The approximation capability of neurons and may be quantitatively evaluated by observing the rst two pictures from the left on the rst row of Figure 6, where the solid lines represent the estimated pdfs j of the signals s and s, respectively. They are in good accordance with the true pdfs. The dashed lines represent the derivatives of the static sigmoidal activation functions, as used by Bell and Sejnowski, for comparison: the dierence between static and adaptive (exible) non-linearities

Entropy Optimization, PFANN Network and Blind Separation - -4 NSR -6-8 - -.5.5.5 3 3.5 4 4.5 5 Samples x 4 Figure 5. Noise-to-signal ratio (NSR) in db. are quite evident, especially for neuron. The dostdashed lines on the same pictures represent the j functions at the beginning of PFANN learning. Pictures in the second row display instead the approximated cdfs h j, the sigmoidal activation functions, and the coarse approximations at the beginning of PFANN neurons' learning. The third column relates instead to pdf-cdf approximation for a speech signal; it is common experience that the true pdf is a leptokurtic (rather picked) function, and simulation results seem to show again one network's neuron is able to retain in its activation function derivative that shape. In conclusion, a close examination of these results shows that the neurons have been eectively able to approximate with a high degree of accuracy, compatibly to the number of free parameters to adapt, both the cdf and the pdf of the sources. Neuron Neuron 4 Neuron 3.5.5 3 h (y) h (y) h3 (y3).5.5 - y - y - y3.8.8.8 h(y).6.4 h(y).6.4 h3(y3).6.4... - y - y - y3 Figure 6. Functions h j and h j = j for any neuron. (Dotdashed: at the beginning of learning; solid: after learning; dashed: no adaptation (hyperbolic tangent) for a comparison to the static case).

Entropy Optimization, PFANN Network and Blind Separation 3 4.. Numerical comparison of PFANN and MOD In order to eectively compare the MOD and PFANN algorithms, we chose the learning rates of the adapting rules so that both algorithms show approximately the same convergence speed and steady-state precision. The aim of these simulations is to prove that the new separation algorithm may exhibit the same capability of MOD. In Subsection below we shall show that this may be obtained with a substantial reduction of computational eort. In Experiment 4., we used two i.i.d. source sequences uniformly distributed within [?; +]. The initial conditions relative to the MOD algorithm are the same used in [5], and particularly for each neuron we have m j = 7. For running the PFANN network we set r j = and = i+ = : for any i, for each neuron. Figure 7 shows the interference residuals averaged over trials of PFANN and MOD algorithms. In this case, both the source streams are platikurtic, and the algorithms were able to separate them out with no external `kurtosis tuning'. In Experiment 4., we considered two speech sequences both sampled at 8kHz. The testing conditions are similar to those of the Experiment 4., except for the learning rates that have been set to = i+ = :5. Note that in order to obtain the same performances for the two algorithms we had to use r j = and m j = 7. The pictures displayed in Figure 8 have the same meaning as in the preceding Experiment. In this Experiment both the source signals are leptokurtic. The algorithms were able to separate them out without requiring the user to know the kurtoses' signs nor to perform kurtoses estimation..9.8.7 Interference residuals.6.5.4.3...5.5.5 3 3.5 4 4.5 5 Samples x 4 Figure 7. Two noises. Solid: MOD, Dotdashed: PFANN. In Experiment 4.3, the blind separation algorithms have been tested with three source signals: a pure sinusoid, a uniformly distributed noise and a speech signal. The learning constants are m j = 5, r j =, and = i+ = :3. The results are shown in Figure 9. Since the at noise is platikurtic while the voice stream is leptokurtic, the PFANN separating technique has shown its ability to separate out eterokurtic sources, as well as MOD, again with no prior knowledge nor kurtoses estimation needs.

Entropy Optimization, PFANN Network and Blind Separation 4. Interference residuals.8.6.4..5.5.5 3 3.5 4 4.5 5 Samples x 4 Figure 8. Two voices. Solid: MOD, Dotdashed: PFANN..5 Interference residuals.5.5.5.5.5 3 3.5 4 4.5 5 Samples x 4 Figure 9. Sinusoid, noise, voice. Solid: MOD, Dotdashed: PFANN. Table. Complexity comparison of MOD and PFANN. Algorithm Multiplications Divisions Non-Linearities Parameters MOD 4mn (5m + )n mn 3mn PFANN (r + 8)n (r + )n n (r + )n 4.3. Structure comparison Suppose (as in the simulations) that r j = r and m j = m. A direct comparison between the experimental results obtained with the functional-link network-based PFANN and MOD techniques shows that to obtain similar performances it is needed to choose m > r. In order to perform a fair complexity comparison, in Table the number of operations (multiplications, divisions and any generic non-linearity) involved (and strictly needed) in the learning equations is shown. The PFANN algorithm appears more easy to be implemented as it requires reduced computational eorts. For a numerical example about computational complexity, readers please refer to [8] where

Entropy Optimization, PFANN Network and Blind Separation 5 times required to run PFANN and MOD on the same blind separation problems are given. 5. Conclusion In this work, a new learning theory for functional-link neural units, based on an information-theoretic approach, is presented; then, learning and approximation capabilities shown by dierent units were investigated by solving density shaping problems. Simulation results conrm the eectiveness of the proposed learning theory and the good exibility exhibited by the non-semi-linear structures. The aim of this paper was also to propose a novel approach for performing blind separation of eterokurtic source signals by the functional-link neural network, that is based upon the functional approximation ability of (low-order) polynomials. With the aim to provide exible non-linearities, Taleb and Jutten [] and Roth and Baram [] proposed the use of a multilayer perceptron, Gustason [9] presented a technique based upon an adjustable linear combination of a series of Gaussian basis functions with xed mean values and variances, while Xu et al. proposed to employ linear combinations of fully adjustable sigmoids (MOD technique) [5, 6, 7]. Here we compared the new method to the closely related MOD: both computer simulations and structure comparison conrm that the presented approach is eective and interesting since it allows to obtain comparable performances with a noticeable reduction of computational eorts. Extensions of the presently proposed theory to other kinds of approximating functions, like squashed truncated Fourier series, are currently under investigation, along with a numerical and structural comparison to existing techniques in order to gain quantitative knowledge about how the proposed method relates to other approaches drawn from the scientic literature, especially from applied statistics. Acknowledgments This research was supported by the Italian MURST. The author wishes to thank the anonymous reviewers whose careful and detailed suggestions allowed to signicantly improve the quality of the paper. References [] S.-i. Amari, T.-p. Chen and A. Chicocki, Stability Analysis of Learning Algorithms for Blind Source Separation, Neural Networks, Vol., No. 8, pp, 345 { 35, 997 [] A.J. Bell and T.J. Sejnowski, An Information Maximization Approach to Blind Separation and Blind Deconvolution, Neural Computation, Vol. 7, No. 6, pp. 9 { 59, 996 [3] S. Bellini, Bussgang Techniques for Blind Equalization, in IEEE Global Telecommunication Conf. Rec., pp. 634 { 64, Dec. 986 [4] P. Comon, Independent Component Analysis, A new concept?, Signal Processing, Vol. 36, pp. 87 { 34, 994 [5] J. Dehaene and N. Twum Danso, Local Adaptive Algorithms for Information Maximization in Neural Networks, Proc. of the International Conference on Acoustics, Speech and Signal Processing (ICASSP'97), pp. 59 { 6, 997 [6] S. Fiori and F. Piazza, A Study on Functional-Link Neural Units with Maximum Entropy Response, Articial Neural Networks, Vol. II, pp. 493 { 498, Springer-Verlag, 998 [7] S. Fiori, P. Bucciarelli and F. Piazza, Blind Signal Flatting Using Warping Neural Modules, Proc. Int. Joint Conf. on Neural Networks (IJCNN'98), Vol., pp. 3 { 37, 998

Entropy Optimization, PFANN Network and Blind Separation 6 [8] S. Fiori, Blind Source Separation by New M{WARP Algorithm, Electronics Letters, Vol. 35, No. 4, pp. 69 { 7, Feb. 999 [9] M. Gustaffson, Gaussian Mixture and Kernel Based Approach to Blind Source Separation Using Neural Networks, Articial Neural Networks, Vol., pp. 869 { 874, 998, Springer- Verlag [] A. Hyvarinen and E. Oja, Independent Component Analysis by General Non-Linear Hebbian- Like Rules, Signal Processing, Vol. 64, No. 3, pp. 3 { 33, 998 [] S. Laughlin, A Simple Coding Procedure Enhances a Neuron's Information Capacity, Z. Naturforsch, Vol. 36, pp. 9 { 9, 98 [] R. Linsker, An Application of the Principle of Maximum Information Preservation to Linear Systems, in Advances in Neural Information Processing System, (NIPS*88), pp. 86 { 94, Morgan-Kaufmann, 989 [3] R. Linsker, Local Synaptic Rules Suce to Maximize Mutual Information in a Linear Network, Neural Computation, Vol. 4, pp. 69 { 7, 99 [4] P. Moreland, Mixture of Experts Estimate A-Posteriori Probabilities, Articial Neural Networks, pp. 499 { 55, Springer-Verlag, 997 [5] J.P. Nadal, N. Brunel, and N. Parga, Nonlinear Feedforward Networks with Stochastic Inputs: Infomax Implies Redundancy Reduction, Network: Computation in neural Systems, Vol. 9, No., May 998 [6] D. Obradovic and G. Deco, Unsupervised Learning for Blind Source Separation: An Information-Approach, Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP'97), pp. 7 { 3, 997 [7] M.D. Plumbley, Ecient Information Transfer and Anti-Hebbian Neural Networks, Neural Networks, Vol. 6, pp. 83 { 833, 993 [8] M.D. Plumbley, Approximating Optimal Information Transmission Using Local Hebbian Algorithm in a Double Feedback Loop, Articial Neural Networks, pp. 435 { 44, Springer- Verlag, 993 [9] Y.-H. Pao, Adaptive Pattern Recognition and Neural Networks, Addison-Wesley Publishing Company, 989 (Chpt. 8) [] Z. Roth and Y. Baram, Multidimensional Density Shaping by Sigmoids, IEEE Trans. on Neural Networks, Vol. 7, No. 5, pp. 9 { 98, Sept. 996 [] A. Sudjianto and M.H. Hassoun, Nonlinear Hebbian Rule: A Statistical Interpretation, Proc. of International Conference on Neural Networks (ICNN'94), Vol., pp. 47 { 5, 994 [] A. Taleb and C. Jutten, Entropy Optimization - Application to Source Separation, Articial Neural Networks, pp. 59 { 534, Springer-Verlag, 997 [3] K. Torkkola, Blind Deconvolution, Information Maximization and Recursive Filters, Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP'97), pp. 33 { 334, 997 [4] V. Vapnik, The Support Vector Method, Articial Neural Networks, pp. 63 { 7, Springer- Verlag, 997 [5] L. Xu, C.C. Cheung, H.H. Yang, and S.-i. Amari, Independent Component Analysis by the Information-Theoretic Approach with Mixture of Densities, Proc. of International Joint Conference on Neural Networks (IJCNN'98), pp. 8 { 86, 998 [6] L. Xu, C.C. Cheung, J. Ruan, and S.-i. Amari, Nonlinearity and Separation Capability: Further Justications for the ICA Algorithm with a Learned Mixture of Parametric Densities, Proc. of European Symposium on Articial Neural Networks (ESANN'97), pp. 9 { 96, 997 [7] L. Xu, C.C. Cheung, and S.-i. Amari, Nonlinearity, Separation Capability and Learned Parametric Mixture ICA Algorithms, to appear on a special issue of the International Journal of Neural Systems, 998 [8] H.H. Yang and S.-i. Amari, Adaptive Online Learning Algorithms for Blind Separation: Maximum Entropy and Minimal Mutual Information, Neural Computation, Vol. 9, pp. 457 { 48, 997 [9] Y. Yang and A.R. Barron, An Asymptotic Property of Model Selection Criteria, IEEE Trans. on Information Theory, Vol. 44, No., pp. 95 { 6, Jan. 998 [3] S.M. Zurada, Introduction to Neural Articial Systems, West Publishing Company, 99