Robust techniques for independent component analysis (ICA) with noisy data

Size: px

Start display at page:

Download "Robust techniques for independent component analysis (ICA) with noisy data"

Juniper Wilkerson
5 years ago
Views:

1 Neurocomputing 22 (1998) Robust techniques for independent component analysis (ICA) with noisy data A. Cichocki*, S.C. Douglas, S. Amari Brain Science Institute Riken, Brain Information Processing Group, 2-1 Hirosawa, Wako-shi, Saitama , Japan Department of Electrical Engineering, University of Utah, Salt Lake City, UT 84112, USA Accepted 3 July 1998 Abstract In this contribution, we propose approaches to independent component analysis (ICA) when the measured signals are contaminated by additive noise. We extend existing adaptive algorithms with equivariant properties in order to considerably reduce the bias in the demixing matrix caused by measurement noise. Moreover, we describe a novel recurrent dynamic neural network for simultaneous estimation of the unknown mixing matrix, blind source separation, and reduction of noise in the extracted output signals. We discuss the optimal choice of nonlinear activation functions for various noise distributions assuming a generalized Gaussian-distributed noise model. Computer simulations of a selected approach are provided that confirm its usefulness and excellent performance Elsevier Science B.V. All rights reserved. Keywords: Independent component analysis (ICA); Bias removal; Noise cancellation; Natural gradient; Blind source separation; Maximum likelihood 1. Introduction Recently, a number of efficient adaptive, on-line learning algorithms have been developed for ICA [1 19,21 31]. Although the underlying principles and approaches are different, many of the techniques have very similar forms. Most of these algorithms assume that any measurement noise within the mixed signals can be neglected. * Corresponding author. cia@brain.riken.go.jp. A. Cichocki is on leave from the Warsaw University of Technology, Poland /98/$ see front matter 1998 Elsevier Science B.V. All rights reserved. PII S (98)

2 114 A. Cichocki et al./neurocomputing 22 (1998) However, in real-world applications most measured signals are contaminated by additive noise. Thus, the problem of efficiently reducing the influence of noise on the performance of algorithms for ICA arises, and in particular, methods are desired to reduce noise in the stochastically independent extracted components. This paper addresses this difficult and challenging problem [6,7,21,27]. There are at least three definitions of ICA. In this paper, we employ the two distinct definitions given below: Definition 1. The ICA of a noisy random vector x(t)"[x (t)2x (t)] is obtained by finding an nm, (with m5n), full rank, linear transformation matrix W such that the output signal vector y(t)"[y (t)2y (t)], defined as y(t)"wx(t), (1) contains source components that are as independent as possible, as measured by an information-theoretic cost function such as minimum Kullback Leibler divergence. Definition 2. For a random noisy vector x(t) defined as x(t)"hs(t)#ν(t), (2) where H is an (mn) mixing matrix (m5n), s(t)"[s (t)2s (t)] is a source vector of stochastically independent signals, and ν(t) is a vector of uncorrelated noises, ICA is obtained by estimating both the mixing matrix H and the additive noise ν(t). As the estimation of a separating or de-mixing matrix W and/or a mixing matrix H in the presence of noise is rather difficult, the majority of past research efforts were devoted to the noiseless case where ν(t)"0. The objective of this paper is to develop novel approaches and learning algorithms that are more robust with respect to noise than existing techniques or that can reduce the noise in the estimated output vector y(t). In this paper, we assume that the source signals and additive noise components are mutually stochastically independent. 2. Bias removal techniques for pre-whitening and ICA algorithms 2.1. Bias removal for pre-whitening algorithms Consider the standard decorrelation or pre-whitening algorithm for x(t) given by [15,22] W(t#1)"W(t)#η(t)[I!y(t)y(t)]W(t) (3) or its averaged version given by W(t)"η(t)[I!Ey(t)y(t)]W(t), (4) where y(t)"w(t)x(t) and E ) denotes statistical expectation. When x(t) is noisy such that x(t)"xl (t)#ν(t) and xl (t) and yl (t)"w(t)xl (t) are the noiseless estimates of the input

3 A. Cichocki et al./neurocomputing 22 (1998) and output vectors, respectively, it is easy to show that the additive noise ν(t) within x(t) introduces a bias in the estimated decorrelation matrix W. The covariance matrix of the output can be evaluated as R "Ey(t)y(t)"WR L L W#WR W, (5) where R L L "ExL (t)xl (t) and R "Eν(t)ν(t). Assuming that the covariance matrix of the noise is known (e.g. R "σi) or can be estimated, a proposed modified algorithm employing bias removal is given by W(t)"η(t)[I!Ey(t)y(t)#W(t)R W(t)]W(t). (6) The stochastic gradient version of this algorithm for R "σi is W(t)"η(t)[I!y(t)y(t)#σW(t)W(t)]W(t). (7) 2.2. Bias removal for ICA adaptive algorithms A similar technique as described above can be applied to remove the coefficient bias for a class of natural gradient algorithms for ICA [17]. In what follows, we outline these algorithms and the proposed modifications. To formulate the ICA problem, one must define an appropriate loss or cost function that depends on the parameters of the specified neural network ICA model. Minimization of such a loss function should cause the outputs of the model to satisfy the desired statistical conditions of stochastic independence and/or temporal and spatial mutual decorrelation [2,4,5,19,26]. Minimum entropy, ICA, and maximum likelihood lead to similar expected loss functions that measure the mutual stochastic independence of the system s output signals. A unifying loss or risk function is given by the Kullback Leibler divergence [2] E(y,W)" p (y)log p (y) dy, (8) p (y) where p (y) is the joint probability distribution of the output signal vector and p (y)"p (y ) is the model distribution assuming that y contains independent components. When used in a stochastic-gradient-type algorithm, this loss function can be expressed in a simple form as (y,w)"! log det(ww)! log p (y ), (9) where p (y ) is the assumed form of the probability density function (p.d.f.) of the ith output signal, det(ww) is the determinant of symmetric positive-definite matrix WW and ( ) ) is the transpose operator. The natural gradient search method [2] has emerged as a particularly useful technique for solving iterative optimization problems. Taking into account that the gradient of the loss function can be expressed as (W) "!(WW)W#f( y)x, (10) W

4 116 A. Cichocki et al./neurocomputing 22 (1998) the natural gradient learning rule is [2] W(t)"W(t#1)!W(t)"!η (W) W WW "η(t)[i!f(y(t))y(t)]w(t), (11) where f(y)"[f )2f )] with f )"! log(p )) y "! pr ) p ). (12) Typical p.d.f. and corresponding optimal nonlinear activation functions are shown in Table 1. Alternatively, we can use the following pre-conditioning filtered gradient rule [3,8,13] W(t)"!η(t)W(t) (W(t)) W W(t)"η(t)[I!y(t)u(y(t))]W(t), (13) Table 1 Typical probability density functions p(y) and corresponding activation functions f (y)"!d log p(y)/dy Name Density function p(y) Activation function f (y) Gaussian Laplace Cauchy Hyperbolic cosine Unimodal Triangular Generalized Gaussian Robust generalized Gaussian 1 2πσ exp(!y/2σ) 1 2σ exp(!y/σ) 1 πσ(1#(y/σ)) 1 π cosh(γy) exp(!2γy) (1#exp(!2γy)) 1 a (1!y/a) y(a r 2aΓ(1/r) exp(!(y/a)) r51 r 2aΓ(1/r) exp(!ρ(y)/a) r51, ρ(y) robust function y sign(y) 2y σ#y tanh(γy) tanh(γy) sign(y) a(1!y/a) y sign(y) ρ(y)ρ y

5 A. Cichocki et al./neurocomputing 22 (1998) where u(y)"[g (y )2g (y )] with nonlinearities g (y ) are now inverse (dual) to function f (y )"!p (y )/p (y ) (e.g. instead of f (y )"y we use cubic function g (y )"y or instead f (y )"tanh(y ) we can use inverse function g )"artanh )" log 1#y 1!y. The performances of these learning algorithms strongly depend on the shapes of the activation functions f ) and g ). Moreover, the optimal selection of these functions depends on the p.d.f. s of the source signals. It has been indicated via analysis and simulation that for the specific choices of nonlinearities f )"α y #tanh(β y ), the learning rule in Eq. (11) is able to successfully separate the source signals if all the source p.d.f. s are heavy-tailed, not unlike super-gaussian signals, whereas the learning rule in Eq. (13) can separate source signals with light-tailed p.d.f. s similar to sub-gaussian signals. Alternatively, when f )"α y #y, the algorithm in Eq. (11) can separate sub-gaussian signals, whereas the algorithm in Eq. (13) can separate super-gaussian signals. However, if the measured signals x (k) contain mixtures of both sub-gaussian and super-gaussian sources, then these algorithms may fail to separate these signals reliably, in which case other approaches have been suggested [16,19]. Note that the above two learning rules can be combined to form a more general and flexible universal learning rule [11 13] W(t)"η(t)[I!f[y(t)]u[y(t)]]W(t), (14) where f[y(t)] and u[y(t)] are suitably designed nonlinear functions, e.g. f (y )" tanh(β y ) sign(y )y g (y )" sign )y tanh(β y ) for κ )'δ, otherwise, for κ )'!δ, otherwise, (15) (16) where r 52, κ )"Ey /Ey!3 is the normalized value of kurtosis and δ50 is a small threshold. The value of the kurtosis can be evaluated on-line using Ey (k#1)"(1!η)ey(k)#ηy (k), (q"2,4). (17) The above learning algorithm (14) (16) monitors and estimates the statistics of each output signal and depending on sign or value of its normalized kurtosis (which is the measure of distance from the Gaussianity) automatically selects (or switches) suitable nonlinear activation functions, such that successful (stable) separation of all non- Gaussian source signals is possible. In this approach activation functions are adaptive time-varying nonlinearities. It should be noted that nonlinearities of the form f )"tanh(β y ) or g )"tanh(β y ) provide a degree of robustness to outliers that is not shared by

6 118 A. Cichocki et al./neurocomputing 22 (1998) nonlinearities of the form f (y )"sign(y )y (for r 53) or f (y )"αy # sign(κ(y ))tanh(βy ). For these choices, the parameters β 52 can be either fixed in value or adapted during the learning process as [9,24,30] log(p (y )) β (k)"!η "η β β η log(cosh(β (k)y (k)))/β (k)!y (k) β (k)!0.729/(β (k)#1.397). (18) The learning algorithms (11) (14) have been shown to possess excellent performance when separating noiseless signal mixtures; however, its performance deteriorates with noisy measurements due to undesirable coefficient biases and the existence of noise in the separated signals. In order to estimate the coefficient biases, we determine Taylor series expansions of the nonlinearities f (y ) and g (y ) about the estimated noiseless values yl. The generalized covariance matrix R can be approximately evaluated as [17] R "Ef[y(t)]u[y(t)]"Ef[yL (t)]u[yl (t)]#k WR Wk, (19) where k and k are diagonal matrices with entries k "Edf (t))/dy and k "Edg (t))/dy, respectively. Thus, a modified adaptive learning algorithm with reduced coefficient bias has the form W(t)"η(t)[I!f[y(t)]u[y(t)]#k W(t)R W(t)k ]W(t) "η(t)[i!f[y(t)]u[y(t)]#c W(t)R W(t)]W(t), (20) where C"[c ] is an nn scaling matrix with entries c "k k and means Hadamard product. In the special case when all of the source distributions are identical, f )"f ) i, g )"g ) i, and R "σ I, the bias correction term simplifies to B"σ k k WW. It is interesting to note that we can almost always select nonlinearities such that the global scaling coefficient c"k k can be close to zero for a wide class of signals. For example, when f )"y sign ) and g )"tanh(βy ) are chosen, or when f )"tanh(βy ) and g )"y sign ) are chosen, the scaling coefficient is equal to c"k k "rβey (t)[1!etanh(βy (t))] for r51, which is smaller over the range y 41 than would be the case if g )"y were chosen. Moreover, we can optimally design the parameters r and β so that within a specified range of y the absolute value of the scaling coefficient c"k k is minimal. Another possible solution to mitigate coefficient bias is to employ nonlinearities of the form fi )"f )!α y and g )"y with α 50. The motivation behind the use of linear terms!α y is to reduce the values of the scaling coefficients as c "k!α as well as to reduce the influence of large outliers. Alternatively, we can use the generalized Fahlman functions given by tanh(β y )!α y for either f )org ), where appropriate [18,19]. One disadvantage of these proposed techniques for bias removal is that a few equivariant properties for the resulting algorithm are lost when a bias compensating

7 term is added, and thus the algorithm may perform poorly or even fail to separate sources if the mixing matrix is very ill-conditioned. For this reason, it is necessary to design nonlinearities which correspond as closely as possible to those produced from the true p.d.f. s of the source signals while also maximally reducing the coefficient bias caused by noise Computer simulation experiments A. Cichocki et al./neurocomputing 22 (1998) We now illustrate the behavior of the bias removal algorithm in Eq. (20) via simulation. More illustrative examples are provided in [17]. In this example, a 33 mixing matrix given by H" (21) is employed. Three independent random sources one uniform-[!1,1]-distributed and two binary-$1-distributed are generated, and Eq. (2) is used to create x(t), where each ν(t) is a jointly Gaussian random vector with covariance R "σi with σ"0.01. The condition number of HEs(t)s(t)H is Here, f (y)"y and g (y)"y for all 14i43 and η(t)" Twenty trials were run, in which W(0) was a different random orthogonal matrix such that W(0)W(0)"0.25I, and ensemble averages were taken in each case. Fig. 1 shows the evolution of the performance factor ζ(t) defined as ζ(t)" 1 m b(t) max b (t)!1, l Ol for ioj (22) for each algorithm, where m"3 and b (t) is the (i, j)th element of the combined system matrix W(t)H. The value of ζ(t) measures the average source signal crosstalk in the output signals y (t) if no noise were present. As can be seen, the original algorithm yields a biased estimate of W(t), whereas the bias removal algorithm achieves a crosstalk level that is about 7 db lower. Also, shown for comparison is the original algorithm with no measurement noise, showing that the new algorithm s performance approaches this idealized case for small learning rates. 3. Recurrent neural network approach for noise cancellation 3.1. Basic concept and algorithm derivation Assume that we have successfully estimated an unbiased estimate of the separating matrix W via one of the previously described approaches. Then, we can estimate a mixing matrix HK "W"HPD, where W is the pseudo-inverse of W, P is any nn permutation matrix, and D is an nn non-singular diagonal scaling matrix. We now

8 120 A. Cichocki et al./neurocomputing 22 (1998) Fig. 1. Ensemble-averaged value of the performance factors for uncorrelated measurement noise in the first example: dotted line original algorithm (14) with noise, dashed line bias removal algorithm (20) with noise, continous line original algorithm (14) without noise. propose approaches for cancelling the effects of noise in the estimated source signals. In order to develop a viable neural network approach for noise cancellation, we define the error vector e(t)"x(t)!hk yl (t), (23) where e(t)"[e (t)2e (t)] and yl (t) an estimate of the source s(t). To compute yl (t), consider discussing the minimum entropy (ME) cost function E(e(t))"! Elog[p (e (t))], (24) where p (e ) is the true p.d.f. of the additive noise ν (t). It should be noted that we have assumed that the noise sources are i.i.d.; thus, stochastic gradient descent of the ME function yields stochastic independence of the error components as well as the minimization of their magnitude in an optimal way. The resulting system of differential equations is dyl (t) "μ(t)hk Ψ[e(t)], (25) dt

9 A. Cichocki et al./neurocomputing 22 (1998) where Ψ[e(t)]"[Ψ [e (t)]2ψ [e (t)]] with nonlinearities Ψ (e )"! log p (e ). (26) e A block diagram illustrating the implementation of the above algorithm is shown in Fig. 2, where Learning Algorithm denotes an appropriate bias removal learning rule (20). In the proposed algorithm, the optimal choices of nonlinearities Ψ (e ) depend on the noise distributions. Assume that all of the noise signals have generalized Gaussian distributions of the form [20] p (e )" r 2σ Γ(1/r ) exp!1 r e, (27) σ where r '0 is a variable parameter, Γ(r)"u exp(!u)du is the gamma function and σ"ee is a generalized measure of the noise variance known as the dispersion. Note that a unity value of r yields a Laplacian distribution, a value of r "2 yields the standard Gaussian distribution, and r PR yields a uniform distribution. In general, we can select any value of r 51, in which case the locally optimal nonlinear activation functions are of the form Ψ (e )"! log(p (e )) "e sign(e ), r 51. (28) e For very impulsive (spiky) sources with a high value of kurtosis, the optimal parameter r typically takes a value between zero and one. In such cases, we can use the modified activation functions Ψ (e )"e /[e #ε], where ε is a small positive constant, to avoid the singularity of the function at e "0. Moreover, when we do not Fig. 2. Neural network architecture for estimating the separating matrix and efficient noise reduction.

10 122 A. Cichocki et al./neurocomputing 22 (1998) have exact a priori knowledge about the noise distributions, we can adapt the value of r (t) for each error signal e (t) according to its estimated distance from Gaussianity. A simple gradient-based rule for adjusting each parameter r (t) is log(p (e )) r (t)"!η "η (29) r r 0.1r (t)#e (t)(1!log(e (t))) η. (30) r(t) Similar methods can be applied for other parameterized noise distributions. For example, when p (e ) is a generalized Cauchy distribution, then Ψ (e )"[(vr #1)/ (va(r )#e )]e sgn(e ). Similarly, for the generalized Rayleigh distribution, one obtains Ψ (e )"e e for complex-valued signals and coefficients. It should be noted that the continuous time algorithm in Eq. (25) can be easily converted to a discrete time algorithm as yl (t#1)"yl (t)#η(t)hk (t)ψ[e(t)]. (31) The proposed system in Fig. 2 can be considered as a form of nonlinear postprocessing that effectively reduces the additive noise component in the estimated source signals. In the next subsection, we propose a more efficient architecture that simultaneously estimates the mixing matrix H while reducing the amount of noise in the separated sources Simultaneous estimation of a mixing matrix and noise reduction Consider a recurrent neural network for noisy ICA with the same number of inputs and outputs (m"n) described by y(t)"x(t)!wk (t)y(t). (32) For m"n this model is equivalent to the previously described model since y(t)"[i#wk (t)]x(t)"w(t)x(t) with W(t)"[I#WK (t)]. It is easy to derive an equivariant learning algorithm for this algorithm, given by [13]. dwk (t) "!μ (t)[i#wk (t)][i!f[y(t)]u[y(t)]]. (33) dt Since the estimating mixing matrix HK can be expressed as HK "W"WK #I, (34) we replace the output vector y(t) by an improved estimate yl (t) to derive a novel learning algorithm as (see Fig. 3) dhk (t) dt "!μ (t)hk (t)[i!f[yl (t)]u[yl (t)]] (35)

11 A. Cichocki et al./neurocomputing 22 (1998) Fig. 3. Neural network architecture for simultaneous noise reduction and mixing matrix estimation for discrete-time t"k (k"0, 1, 2, 2 ). and dyl (t) "μ(t)hk Ψ[e(t)] (36) dt or in discrete time, and where HK (t)"hk (t#1)!hk (t)"η (t)hk (t)[i!f[yl (t)]u[yl (t)]] (37) yl (t#1)"yl (t)#η(t)hk (t)ψ[e(t)], (38) e(t)"x(t)!hk (t)sl (t) and x(t)"xl (t)#ν(t). A functional block diagram illustrating this algorithm s implementation is shown in Fig Pre-whitening and principal component analysis (PCA) As an enhancement to the above approaches, we could perform as preprocessing either pre-whitening or principal component analysis of the measured sensor signals either to reduce the effects of data conditioning or to reduce the effects of noise when m5n. This preprocessing step is represented in Fig. 3 by the nm matrix Q. Pre-whitening for noisy data can be performed using the learning algorithm (7) Q(t)"η(t)[I!xJ (t)xj (t)#σ Q(t)Q(t)]Q(t), (39)

12 124 A. Cichocki et al./neurocomputing 22 (1998) where xj (t)"qx(t)"q(hs(t)#ν(t)). Alternatively, for a nonsingular covariance matrix R "Ex(t)x(t) with m"n, we can use the standard numerical algorithm Q"ΛV"( )xx, (40) where ) denotes time averaging, Λ"diagλ 2λ is a diagonal matrix containing the n largest eigenvalues of R, and V"[ 2 ] is an orthogonal matrix of the corresponding eigenvectors of R. As an alternative to computing the PCA whitening matrix using a numerical algorithm, we could also use the fast adaptive RLS algorithm given by [10] (see Fig. 4) xn (t)" (t)x (t), (41) (t)"λη(t!1)#xn η (t#1)" (t)# xn (t) (t), (42) η(t) (k)!xn (t) (t)], (43) x (t)"x (t)!xn (t) *, (44) x (t)"x(t). (45) After applying the above PCA learning procedure, the output signals xn (t) are uncorrelated with variance λ "ExN. To normalize them to unit variance, we use the procedure xj (t)"λ xn (t)"λ x (t). (46) * It is interesting to note that after pre-whitening the global mixing matrix A"QH is an orthogonal matrix, since R "I for normalized sources and R J J "ExJ xj "AA#σI. Fig. 4. Functional block diagram illustrating implementation of the fast adaptive PCA learning algorithm.

13 4. Computer simulation experiments A. Cichocki et al./neurocomputing 22 (1998) Due to space we will present only two illustrative examples indicating the performance of the techniques. The three sub-gaussian source signals shown in Fig. 5 have been mixed using the mixing matrix whose rows are h "[0.8! ], h "[ ], and h "[! ]. Uncorrelated Gaussian noise signals Fig. 5. Exemplary on-line simulation results of the neural network in Fig. 3 for Gaussian noise. The first three signals are the original sources, the next three signals are the measured signals, and the last three signals are the estimated source signals using the learning rule in Eqs. (37) and (38). The horizontal axis represents time in seconds.

14 126 A. Cichocki et al./neurocomputing 22 (1998) with variance 1.6 was added to each of the elements of x(t). The neural network model depicted in Fig. 3 with associated learning rules in Eqs. (37) and (38) and nonlinearities f )"y, g )"tanh(10y ) and Ψ(e )"e was used to separate these signals, where HK (0)"I. Shown in Fig. 5 are the resulting separated signals, in which the source signals are accurately estimated. The resulting three rows of the combined system matrix HK H after 400 ms (with sampling period ) are [0.0034! ], [! !!0.0142] and Fig. 6. Exemplary on-line simulation results of the neural network in Fig. 3 for impulsive noise. The first three signals are the mixed sensors signals contaminated by impulsive (Laplacian) noise, the next three signals are the separated signals, using the learning rule (14) and the last three signals are the estimated source signals using the learning rule in Eqs. (37) and (38).

15 A. Cichocki et al./neurocomputing 22 (1998) [!0.2975!0.0061!0.0683], respectively, indicating that separation has been achieved. Note that standard algorithms that assume noiseless measurements fail to separate such noisy signals. In the second illustrative example the sensor signals were contaminated by additive impulsive (spiky) noise as is shown in Fig. 6. The same learning rule has been employed but with nonlinear functions Ψ(e )"tanh(10e ). The neural network of Fig. 3 was able to considerably reduce the influence of the noise in separating signals. 5. Conclusions In this paper, robust methods for performing independent component analysis in the presence of measurement noise are described. These methods simultaneously perform unbiased estimation of the separating matrix and noise reduction on the extracted sources. In addition, gradient-based rules for adjusting the shape parameters of the nonlinearities within the algorithms are given. Simulations indicate that the algorithms perform robust estimation of the independent components when noises are present. References [1] S. Amari, A. Cichocki, H.H. Yang, Recurrent neural networks for blind separation of sources, Proc. Int. Symp. on Nonlinear Theory and its Applications, NOLTA-95, Las Vegas, NV, 1995, pp [2] S. Amari, A. Cichocki, H.H. Yang, A new learning algorithm for blind signal separation, in: D.S. Touretzky, M.C. Mozer, M.E. Hasselmo (Eds.), Advances in Neural Information Processing Systems, MIT Press, Cambridge, MA, 1996, pp [3] J.J. Atick, A.N. Redlich, Convergent algorithm for sensory receptive field development, Neural Comput. 5 (1993) [4] A.J. Bell, T.J. Sejnowski, An information-maximization approach to blind separation and blind deconvolution, Neural Comput. 7 (1995) [5] J.F. Cardoso, B. Laheld, Equivariant adaptive source separation, IEEE Trans. Signal Process. 44 (1996) [6] A. Cichocki, S.C. Douglas, S. Amari, P. Mierzejewski, Independent component analysis for noisy data, in: C. Fyfe (Ed.), Proc. Int. Workshop on Independence and artificial Neural Networks, Tenerife, Spain, 9 10 February 1998, pp [7] A. Cichocki, W. Kasprzak, S. Amari, Adaptive approach to blind source separation with cancellation of additive and convolutional noise, ICSP 96, 3rd Int. Conf. on Signal Processing, Proc. IEEE Press/PHEI Beijing, vol. I, September 1996, pp [8] A. Cichocki, I. Sabala, S. Choi, B. Orsier, R. Szupiluk, Self-adaptive independent component analysis for sub-gaussian and super-gaussian mixtures with an unknown number of sources, Int. Symp. on Nonlinear Theory and Applications NOLTA 97, Honolulu, USA, 1997, pp [9] A. Cichocki, I. Sabala, S. Amari, Intelligent neural networks for blind signal separation with unknown number of sources, Proc. Int. Conf. Engineering of Intelligent Systems, Tenerife, Spain, February 1988, pp [10] A. Cichocki, R. Unbehauen, Robust estimation of principal components in real time, Electron. Lett. 29 (1993) [11] A. Cichocki, R. Unbehauen, E. Rummert, Robust learning algorithm for blind separation of signals, Electron. Lett. 30 (1994)

16 128 A. Cichocki et al./neurocomputing 22 (1998) [12] A. Cichocki, R. Unbehauen, L. Moszczyński, E. Rummert, A new on-line adaptive learning algorithm for blind separation of source signals, Proc. Int. Symp. Artificial Neural Networks, ISANN-94, Tainan, Taiwan, 1994, pp [13] A. Cichocki, R. Unbehauen, Robust neural networks with on-line learning for blind identification and blind separation of sources, IEEE Trans. Circuits Systems I 43 (1996) [14] P. Comon, Independent component analysis: a new concept?, Signal Process. 36 (1994) [15] S.C. Douglas, A. Cichocki, Neural networks for blind decorrelation of signals, IEEE Trans. Signal Process. 45 (1997) [16] S.C. Douglas, A. Cichocki, S. Amari, Multichannel blind separation and deconvolution of sources with arbitrary distributions, Proc. IEEE Workshop on Neural Networks for Signal Processing, Amelia Island, FL, 1987, pp [17] S.C. Douglas, A. Cichocki, S. Amari, Bias removal for blind source separation with noisy measurements, Electron. Lett. 34 (14) (1998) [18] M. Girolami, C. Fyfe, Stochastic ICA contrast maximization using Oja s nonlinear PCA algorithm, Int. J. Neural Systems (1997), in press. [19] M. Girolami, C. Fyfe, Extraction of independent signals sources using a deflationary exploratory projection pursuit network with lateral inhibition, IEE Proc. Vision, Image Signal Process. (1997), in press. [20] W.C. Gray, Variable norm deconvolution, Ph.D. Dissertation, Stanford Univ., Stanford, CA, [21] A. Hyvarinen, Independent component analysis in the presence of noise: a maximum likelihood approach, in: C. Fyfe (Ed.), Proc. Int. Workshop on Independence and Artificial Neural Networks, Tenerife, Spain, 9 10 February 1998, pp [22] J. Karhunen, Neural approaches to independent component analysis and source separation, Proc. European Symp. on Artificial Neural Networks, ESANN-96, Bruges, Belgium, 1996, pp [23] J. Karhunen, A. Cichocki, W. Kasprzak, P. Pajunen, On neural blind separation with noise suppression and redundancy reduction, Int. J. Neural Systems 8 (2) (1997) [24] D.J.C. MacKay, Maximum likelihood and covariant algorithms for independent component analysis, Internal Report, Cavendish Laboratory, Cambridge Univ., [25] Z. Maluche, O. Macchi, Adaptive separation of unknown number of sources, Proc. IEEE Workshop on Higher Order Statistics, Los Alamitos, CA, 1997, pp [26] K. Matsuoka, M. Ohya, M. Kawamoto, A neural net for blind separation of non-stationary signals, Neural Networks 8 (1995) [27] E. Moulines, J.F. Cardoso, E. Gassiat, Maximum likelihood for blind separation and deconvolution of noisy signals using mixture models, Proc. Int. Conf. Acoust. Speech, Signal Processing, ICASSP-97, Munich, Germany, 1997, pp [28] E. Oja, J. Karhunen, Signal separation by nonlinear Hebbian learning, in: M. Palaniswami et al. (Eds.), Computational Intelligence A Dynamic System Perspective, IEEE Press, New York, 1995, pp [29] E. Oja, The nonlinear PCA learning rule in independent component analysis, Neurocomputing 17 (1997) [30] L. Xu, C.-C. Cheung, J. Ruan, S. Amari, Nonlinearity and separation capability: further justification for the ICA algorithm with a learned mixture of parametric densities, Proc. European Symp. on Artificial Neural Networks, ESANN 97, Bruges, Belgium, 1997, pp [31] L. Xu, C.C. Chung, H.H. Yang, S. Amari, Independent component analysis by the informationtheoretic approach with mixture of parametric densities, Proc. IEEE Int. Conf. on Nural Networks, vol. III, Houston, TX, USA, 9 12 June, 1997, pp

A. Cichocki et al./neurocomputing 22 (1998) 113 129 129 Andrzej Cichocki received the M.Sc.

Since 1972, he has been with the Institute of Theory of Electrical Engineering and Electrical Measurements at the Warsaw University of Technology, where he became a Professor in 1991.

(Teubner Wiley, 1993/94) and more than 150 research papers.

17 A. Cichocki et al./neurocomputing 22 (1998) Andrzej Cichocki received the M.Sc. (with honors), Ph.D., and Habilitate Doctorate (Dr.Sc.) degrees, all in electrical engineering, from Warsaw University of Technology (Poland) in 1972, 1975, and 1982, respectively. Since 1972, he has been with the Institute of Theory of Electrical Engineering and Electrical Measurements at the Warsaw University of Technology, where he became a Professor in He is the co-author of two books: MOS Switched-Capacitor and Continuous-Time Integrated Circuits and Systems (Springer-Verlag, 1989) and Neural Networks for Optimization and Signal Processing (Teubner Wiley, 1993/94) and more than 150 research papers. He spent at University Erlangen-Nuernberg (Germany) a few years as Alexander Humboldt Research Fellow and Guest Professor, at Lehrstuhl fuer Allgemeine und Theortische Elektrotechnik directed by Professor Rolf Unbehauen. In he worked as a team leader of the laboratory for Artificial Brain Systems, at Frontier Research Program, Riken, Japan. He is currently working as a head of the laboratory for Open Information Systems, at Brain Science Institute, Riken in the Brain-Style Information Processing Group directed by Professor Shun-ichi Amari. He is currently an Associate Editor for IEEE ¹ransactions on Neural Networks. His current research interests include adaptive semi-blind signal processing, especially intelligent processing of biomedical signals, dynamic independent component analysis, neural networks and nonlinear dynamic systems theory. Scott C. Douglas received the B.S. (with distinction), M.S., and Ph.D. degrees in Electrical Engineering from Stanford University, Stanford, CA, in 1988, 1989, and 1992, respectively. From April to September 1992, he was with the Acoustics and Radar Technology Laboratory at SRI International, Menlo Park, CA. Since September 1992, he has been an assistant professor in the Department of Electrical Engineering at the University of Utah, Salt Lake City, UT. His research activities include adaptive filtering, active noise control, blind equalization and source separation, neural networks, and VLSI implementations of digital signal processing systems. Dr. Douglas received the Hughes Masters Fellowship Award in 1988 and the NSF Graduate Fellowship Award in He was a recipient of the NSF CAREER Award in He has published more than 60 articles in journals and conference proceedings. He served as a section editor for ¹he Digital Signal Processing Handbook (CRC/IEEE Press, 1998) and is currently an Associate Editor for IEEE Signal Processing etters. He has served on the organizing committees of several international conferences and workshops and co-founded the Signal Processing/Communications Chapter of the Utah Section of the IEEE. He is a frequent consultant to industry in the areas of signal processing and adaptive filtering and is a member of Phi Beta Kappa and Tau Beta Pi. Shun-ichi Amari graduated from the University of Tokyo in 1958 majoring in mathematical engineering and received the Dr.Eng. degree from the University of Tokyo in He has worked at Kyushu University and then at University of Tokyo at which he is now a Professor Emeritus. He is currently the Group Director of the Brain-Style Information Processing Group, Brain Science Institute, Riken (The Institute of Physical and Chemical Research), Japan. He has been engaged in research in wide areas of mathematical engineering or applied mathematics, such as topological network theory, differential geometry of continuum mechanics, pattern recognition and information sciences. In particular, he has devoted himself to mathematical foundations of neural network theory, including statistical neurodynamical, dynamical theory of neural fields, associative memory, self-organization, and general learning theory. Another main subject of his research is information geometry proposed by himself, which develops and applies modern differential geometry to statistical inference, information theory, and modern control theory, proposing a new powerful method of information sciences and probability theory. Dr. Amari is the Past President of the International Neural Networks Society. He was the Conference Chair of first International Joint Conference on Neural Networks, and he gave the Mahalanobis lecture on Differential Geometry of Statistical Inference at the 47th session of the International Statistical Inference at the 47th session of the International Statistical Institute. He is an IEEE Fellow, and the recipient of the 1995 Japan Academy Award, 1992 IEEE Neural Networks Pioneer Award, 1997 IEEE Piore Award.

ON SOME EXTENSIONS OF THE NATURAL GRADIENT ALGORITHM. Brain Science Institute, RIKEN, Wako-shi, Saitama , Japan

ON SOME EXTENSIONS OF THE NATURAL GRADIENT ALGORITHM Pando Georgiev a, Andrzej Cichocki b and Shun-ichi Amari c Brain Science Institute, RIKEN, Wako-shi, Saitama 351-01, Japan a On leave from the Sofia