Undercomplete Independent Component. Analysis for Signal Separation and. Dimension Reduction. Category: Algorithms and Architectures.

Undercomplete Independent Component Analysis for Signal Separation and Dimension Reduction John Porrill and James V Stone Psychology Department, Sheeld University, Sheeld, S10 2UR, England. Tel: 0114 222 6522 Fax: 0114 276 6515, Email: J.Porrill, J.V.Stone@shef.ac.uk Category: Algorithms and Architectures. Author for correspondence: JV Stone. Abstract We introduce undercomplete independent component analysis (uica), a method for extracting K signals from M K mixtures of N M source signals. The mixtures x = (x 1 ; : : :; x M ) T are formed from a linear combination of the independent source signals s = (s 1 ; : : :; s N ) T using an M N mixing matrix A, so that x = As. For the case N = K = M, Bell and Sejnowski [Bell and Sejnowski, 1995] showed that a square N N unmixing matrix W can be found by maximising the joint entropy of M signals (Y 1 ; : : :; Y N ) T = (W x), where is a monotonic, non-linear function. Using a similar approach, we show that a K M unmixing matrix W can be used to recover K M signals (s 1 ; : : :; s K ) T by maximising the joint entropy of K signals Y = (W x). The matrix W is essentially a pseudo-inverse of the M N mixing matrix A. A dierent and widely used method for reducing the size of W is to perform principal component analysis (PCA) on the data set, and to use only the L principal components with the largest eigenvalues as input to ICA. This results in an L L unmixing matrix W. However, there is no a priori reason to assume that independent components exist in only the L-D subspace dened by the L principal components with largest eigenvectors. Thus, discarding some eigenvectors may also corrupt or discard independent components. In contrast, uica does not discard any independent components in the data set, and can extract between 1 and M signals from x. The method is demonstrated on mixtures of high kurtosis (speech and music) and Gaussian signals.

Introduction We present a method for extracting K signals from M mixtures of N sources, where K M N. This is a generalisation of the method described by Bell and Sejnowski (B&S) [Bell and Sejnowski, 1995], who showed how K signals could be extracted from M mixtures of N source signals for K = M = N. Both methods can be described informally as follows. The amplitudes of N source signals can be represented as a point in an N-dimensional space, and, considered over all times, they dene a distribution of points in this space. If the signals are from dierent sources then they tend to be statistically independent of each other. A key observation is that, if a signal s has a cumulative density function (cdf) then the distribution of (s) has maximum entropy. Similarly, if N signals each have cdf then the joint distribution of (s) = ((s 1 ); : : :; (s N )) T has maximum entropy. For a set of signal mixtures x = As, a linear unmixing matrix exists such that s = W x. Given that (s) has maximum entropy, s can be recovered by nding a matrix W that maximises the entropy of Y = (W x), at which point W x = s. Why Extract Fewer Sources Than Mxtures? Given a temporal sequence x of M (P P ) images, ICA can be used to extract M spatially independent components (IC) with an (M M) unmixing matrix W. However, such a matrix may be large. In such cases, we would like to be able to extract fewer sources than mixtures. A common method for reducing the size of the unmixing matrix W is to perform principal component analysis (PCA) on the data matrix x, and then to retain the L < M principal components (PCs) with the largest eigenvalues. One property of ICA is that it is insensitive to the RMS amplitude of the ICs it extracts from the mixtures x. However, for a given data set x, there is no a priori reason to suppose that ICs should reside only within the subspace dened by the L PCs with the largest eigenvalues. Thus, discarding a subspace using PCA removes ICs that exist within it, and partially destroys any IC with a non-zero projection onto that subspace. For example, in analysing fmri data, an IC with very small variance was found to be associated with the form of the 'on-o' experimental protocol used [McKeown et al., 1998]. Additionally, the reduced dimensionality of the input space dened by the L retained PCs may contain more ICs than PCs. That is, if the original M-D input space x contains N ICs then, as the number L of retained PCs is reduced, there is an increasing likelihood that L < N. If L < N then a linear decomposition of the L eigenvectors into K <= N sources s does not exist, and therefore ICA cannot work in this case 1. Signal Separation Using Entropy Maximisation Suppose that the outputs x = (x 1 ; : : :; x M ) T of M measurement devices are a linear mixture of N independent signal sources s = (s 1 ; : : :; s N ) T, x = As, where A is an M N mixing matrix. We wish to nd a K M unmixing matrix W such that the K recovered components y = W x are a subset of the original signals s (i.e. K N). 1 Thanks to Martin McKeown for pointing this out

Signal Separation For Equal Numbers of Sources and Source Mixtures In the case K = M = N, B&S showed that the unmixing matrix W can be found by maximising the entropy H(Y) of the joint distribution Y = fy 1 ; : : :; Y N g = f 1 (y 1 ); : : :; N (y N )g, where y i = W x i. The correct i have the same form as the cdfs of the input signals x i. However, in many cases it is sucient to approximate these cdfs by sigmoids 2 Y i = tanh(y i ). The output entropy H(Y) can be shown to be related to the entropy of the input H(x) by H(Y) = H(x) + E [ log jjj ] (1) where E denotes expected value, and jjj is the absolute value of the determinant of the Jacobian matrix @Y=@x. We can evaluate jjj as: @Y @x = @Y @y @y @x where @Y=@y and @y=@x are Jacobian matrices. Equation (1) yields H(Y) = H(x) + E Y N = i(y 0 i )jw j (2) " NX Substituting Equation (2) in + log jw j: (3) The term H(x) is constant, P and can therefore be ignored in the maximisation of H(Y). The term E [ log i 0 ] can be estimated given n samples from the distribution dened by y: E " NX 1 n nx NX j=1 (y(j) i ) (4) Ignoring H(x), and substituting Equation (4) in (3) yields a new function that diers from H(Y) by a constant (= H(x)) h(w ) = 1 n nx NX j=1 (y(j) i ) + log jw j (5) B&S showed how maximising this function with respect to the matrix W can be used to recover linear mixtures of signals. Signal Separation For Unequal Numbers of Sources and Source Mixtures The assumption K = M = N is very restrictive. If the number of sources is unknown one might like to reduce the dimensionality of the problem by looking for a small subset of the source signals, so that in general K < M. This means that a rectangular unmixing matrix is required. This can be determined using the criterion of maximum output entropy which ensures unmixing of those independent variables with cdfs that best t the functional forms of i (this follows from the maximumlikelihood interpretation of the maximum entropy criterion, see [Amari et al., 1996]). Note that if K < M then the mere independence of the outputs y is no longer sucient to guarantee that we have recovered a subset of the input variables because all linear combinations of disjoint subsets of inputs are independent. For 2 In fact sources s i normalised so that E[s i tanh s i] = 1=2 can be separated using tanh sigmoids if and only if the pairwise conditions i j > 1 are satised, where i = 2E[s 2 i ]E[sech 2 s i].

example, given four signal mixtures of four source signals, it is possible to extract two independent signals (y 1 and y 1 ) that are linear combinations of disjoint pairs of source signals, y 1 = w 1 s 1 + w 2 s 2 and y 2 = w 3 s 3 + w 4 s 4 ; where the w's are elements in a 2 2 unmixing matrix W. Note that, because y 1 and y 2 are mixtures, they are approximately Gaussian. The criterion that the recovered variables have cdf's approximated by i is thus of much greater importance in the case K < M than if K = M. Thus, to simultaneously optimise both W and the i would be counter-productive for K < M. Equation (1) cannot be used when K < M, so we replace it by H(Y) = H(y) + log j@y=@yj: H(Y) = H(y) + E " KX In general, the entropy H(y) is dicult to calculate. We can approximate it by the entropy of a multi-dimensional Gaussian, which is given by H(y) 1=2 log jcj + K=2 (1 + log 2), where C = Cov[y]. However, as the algorithm converges, this approximation becomes less accurate. This is because the projection of the input data onto the subspace dened by the rows of W denes an increasingly non-gaussian distribution as the algorithm converges. In practice, this approximation seems to work adequately (see Results). More accurate approximations involving higher moments of the distribution than the covariance can be derived if required (for example [Amari et al., 1996]). C can be re-written as C = E[yy T ] = W SW T where S = Cov[x]. In maximising H(Y), we can ignore the constant K=2 (1 + log 2). We can now dene a new function which is an approximation to H(Y), and which diers from it by this constant: h(w ) = 1 2 log jw SW T j + E " KX This can be maximised using its derivative, which can be shown to be: (6) (7) @h @W ij = W T ij + E [ 00 i =0 i x j] (8) where W = (SW T )(W SW T )?1, which is the pseudo-inverse of W with respect to the positive denite matrix S. Note that, if K = N then W = W?1. If i = tanh then this evaluates to: r W h = W T? 2 E y T x (9) In our experiments, this gradient was used to maximise equation (7), using a BFGS quasi-newton method. Results Extracting K Source Signals from M K Mixtures Using uica: The method has been tested by extracting signals from mixtures of natural sounds and Gaussian noise. The program was stopped when the correlation r between each of the K outputs and exactly one of the N input signals s was greater than 0:95, or when the number of iterations (evaluations of h(y)) exceeded 1000. All signals were normalised to have zero mean and unit variance. The N signals were then linearly combined using a random M N mixing matrix (with normally distributed independent entries) to produce M signal mixtures x, which were used as input to the method. In experiments reported here N = 6. The signals (s 1 ; s 2 ; s 3 ) were a gong, Handel's Messiah, and laughter, respectively, obtained from the MatLab

software package. Each of the signals s 4 to s 6 was a dierent random sample of Gaussian noise. Each source signal consisted of a random sample of 20,000 points from each signal. The rst task consists of extracting K = 1; 3 and 4 dierent sound signals from M = 6 linear mixtures of N = 6 signals. For K = 3 all and only the three non- Gaussian sources were recovered, despite dierent initial values for W. For K = 4, these three sources were always recovered, with the highest correlation between one output and a Gaussian source being around 0:8. For K = 1, the algorithm was run three times with dierent random number seeds. On each occasion, s 1 was recovered; this source having the distribution with most kurtosis, so that its cdf is a good match to the tanh non-linearity. The results are displayed in Table 1. These experiments were repeated with M = 12 and 24, with K = 3 and 1. In each case, the required number of sources was recovered, each source was recovered once only, and these sources did not include one of the Gaussian sources. Extracting K Source Signals from M K Mixtures Using PCA/ICA: We compared the results obtained with uica with those obtained using a conventional dimension reduction method (PCA) to preprocess the set of M mixtures. The results presented here involve the three sound sources described in the previous section, plus three other sound sources (obtained from MatLab). Each source consisted of 10,000 samples due to the short length of one source signal. All signals were normalised and mixed together as described in the previous section. First, we ran uica with six mixtures, and set the number of required sources to K = 4. The four extracted signals each had a correlation of jrj > 0:9 with exactly one of the source signals (see Table 2). This result is consistent with those reported in the previous section. Next, we ran ICA after preprocessing with PCA to obtain four eigenvectors with the largest eigenvalues. These were then used as input to ICA using a 4 4 unmixing matrix. From Table 3, only three sources (1,3 and 6) can reasonably considered to have been extracted, with jrj > 0:9. The remaining sources (2, 4 and 5) have maximum correlations with extracted signals of 0.56, 0.59 and 0.50, respectively. Thus, one IC had been eectively discarded along with the two eigenvectors with smallest eigenvalues. For completeness, all six eigenvectors and a 6 6 unmixing matrix were used. In this case, each source signal had a correlation jrj > 0:95 with exactly one extracted signal and jrj < 0:25 with the remaining extracted signals. These results demonstrate that using PCA to reduce the number of signal mixtures used as input to ICA can compromise the ability of ICA to extract source signals. This is because the source signals had non-zero projections on to the discarded eigenvectors. In contrast, uica does not require PCA, and can therefore extract exactly K N source signals from M N signal mixtures. Discussion Two alternative approaches to the problem of dimension reduction are: 1) reduce the dimension of the mixture space from M to K before separating signals, which risks corrupting ICs, or, 2) nd K = M > N components (which could be a large number) using the B&S's method, and then apply a separate method to identify the K < M most important components (see [Cichocki and Kasprzak, 1996]). With respect to 1, discarding PCs cannot also inadvertently discard ICs if only PCs with

zero eigenvalues are discarded. However, zero eigenvalues are rarely encountered with noisy data, so that one is forced to risk corrupting ICs when using PCA to reduce the dimensionality of data used as input to ICA. The method we have described unies approaches 1 and 2, without compromising the ability of ICA to extract ICs. A logical modication to our algorithm would be to extract ICs one at a time, as in [Girolami and Fyfe, 1996] using projection pursuit indices. This involves extracting one IC from the M-dimensional space of signal mixtures x, 'removing' the one-dimensional subspace corresponding to that IC (using Gramm-Schmidt orthonormalisation), and then extracting the next IC. This operation is repeated until all the sources have been extracted. This method has been tested on the data described above, and results are not noticably dierent from those reported here. However, we conjecture that sequential extraction of sources does not provide similar results to ICA in general. Consider image data which is a mixture of two spatial ICs each of which has exactly one region (A and B, respectively) with non-zero grey-levels, and the same grey-levels in an overlapping region C of the IC images. The independence criterion implict in ICA would force it to identify three ICs, corresponding to regions (A? C), (B? C) and C. In contrast, the `cdf-matching' criterion implicit in projection pursuit methods (and uica for K = 1) would ensure that an IC corresponding to A would be extracted rst, followed by B 3. Conclusion Using an undercomplete basis set to extract ICs is useful when the number of signal mixtures is larger than the number of source signals. This commonly occurs in high dimensional data sets in which each signal is an entire image, or even a sequence of images. The obvious strategy of discarding PCs of the data set that have small eigenvalues can compromise ICA's ability to extract ICs for two reasons. First, ICs may have non-zero projections onto the subspace dened by these discarded PCs; so partially destroying these ICs. Second, ICA is only possible if the number of source signals is equal to or less than the number of signal mixtures. Using PCA to eectively reduce the number of mixtures therefore increases the probability that the number of source signals is greater than the number of mixtures. In contrast, uica extracts a specied number of sources signals from the original data set, and therefore precludes the problems associated with preprocessing with PCA. Acknowledgements: Thanks to members of the Computational Neurobiology Laboratory at the Salk Institute, and to Tony Bell, for comments on this work. Thanks to Stephen Isard for comments on a previous draft of this paper. J Stone is supported by a Mathematical Biology Wellcome Fellowship (Grant number 044823). References [Amari et al., 1996] Amari, S., Cichocki, A., and Yang, H. (1996). A new learning algorithm for blind signal separation. In Touretzky, D., Mozer, M., and Hasselmo, M., editors, Advances in Neural Information Processing Systems 8. MIT Press, Cambridge MA (In Press). [Bell and Sejnowski, 1995] Bell, A. and Sejnowski, T. (1995). An informationmaximization approach to blind separation and blind deconvolution. Neural Computation, 7:1129{1159. 3 Thanks for Martin McKeown for pointing this out

[Cichocki and Kasprzak, 1996] Cichocki, A. and Kasprzak, W. (1996). Local adapative learning algorithms for blind separation of natural images. Neural Network World, 6(4):515{523. [Girolami and Fyfe, 1996] Girolami, M. and Fyfe, C. (1996). Negentropy and kurtosis as projection pursuit indices provide generalised ica algorithms. NIPS96 Blind Signal Separation Workshop. [McKeown et al., 1998] McKeown, M., Makeig, S., Brown, G., Jung, T., Kindermann, S., and Sejnowski, T. (1998). Spatially independent activity patterns in functional magnetic resonance imaging data during the stroop color-naming task. Proceedings of the National Academy of Sciences USA., 95:803{810. Sources (N) Mixtures (M) Required (K) Extracted Iterations 6 6 3 3 140 6 6 4 3 100 6 6 1 1 60 6 12 3 3 120 6 12 4 3 90 6 24 3 3 90 6 24 1 1 40 Table 1: Performance for N=6 signals s = (s 1 ; : : :; s 6 ). (s 1 ; s 2 ; s 3 ) are a gong, Handel's Messiah, and laughter, respectively, and (s 4 ; s 5 ; s 6 ) are three Gaussian noise signals. The method was tested with dierent numbers M of signal mixtures and dierent numbers K of required outputs. Iterations denotes the number of function evaluations of h(w ) required for convergence (see text). Src 1 Src 2 Src 3 Src 4 Src 5 Src 6 1 1.0 0.00 0.00 0.00 0.00 0.00 2 0.0 0.93 0.05 0.05 0.00 0.38 3 0.00 0.02 0.15 0.99 0.00 0.04 4 0.00 0.02 0.01 0.00 1.00 0.00 Table 2: Using uica to extract 4 signals from a mixture of 6 sound signals. Each cell species the absolute value of the correlation jrj between a source signal (columns) and a signal extracted by uica (rows). Src 1 Src 2 Src 3 Src 4 Src 5 Src 6 1 0.02 0.13 0.95 0.29 0.00 0.00 2 0.02 0.56 0.25 0.59 0.50 0.01 3 0.01 0.25 0.02 0.18 0.05 0.96 4 0.99 0.013 0.011 0.07 0.06 0.01 Table 3: Using PCA to preprocess six mixtures to obtain four PCs. Each cell species the absolute value of the correlation jrj between a source signal (columns) and an extracted signal (rows). The four PCs were used as input to ICA. Only three of the source signals (in bold typeface) can be considered to have been extracted, with correlations jrj > 0:90.