Independent Component Analysis and Blind Source Separation

Independent Component Analysis and Blind Source Separation Aapo Hyvärinen University of Helsinki and Helsinki Institute of Information Technology 1

Blind source separation Four source signals : 1.5 2 3 4 1 1.5 2 2 1 0.5 0.5 1 0 0 0 0 2 0.5 0.5 1 4 1 1 1.5 2 6 1.5 0 10 20 30 40 50 60 70 80 90 100 2 0 10 20 30 40 50 60 70 80 90 100 3 0 10 20 30 40 50 60 70 80 90 100 8 0 10 20 30 40 50 60 70 80 90 100 Due to some external circumstances, only linear mixtures of the source signals are observed. 4 4 10 8 3 2 8 6 2 1 0 6 4 4 2 0 2 2 0 1 2 4 0 2 2 3 6 4 4 4 0 10 20 30 40 50 60 70 80 90 100 8 0 10 20 30 40 50 60 70 80 90 100 6 0 10 20 30 40 50 60 70 80 90 100 6 0 10 20 30 40 50 60 70 80 90 100 Estimate (separate) original signals! 2

Solution by independence Use only information on statistical independence to recover: 2 3 1.5 8 1.5 2 1 6 1 0.5 1 0.5 4 0 0 0 2 0.5 1 0.5 0 1 1.5 2 1 2 2 0 10 20 30 40 50 60 70 80 90 100 3 0 10 20 30 40 50 60 70 80 90 100 1.5 0 10 20 30 40 50 60 70 80 90 100 4 0 10 20 30 40 50 60 70 80 90 100 These are the independent components! 3

Independent Component Analysis. (Jutten and Hérault, 1991) Observed random vector x is modelled by a linear latent variable model x i = m a i j s j, i = 1...n (1) j=1 or in matrix form: x = As (2) where The mixing matrix A is constant (a parameter matrix). The s i are latent random variables called the independent components. Estimate both A and s, observing only x. 4

Basic properties of the ICA model Must assume: The s i are mutually independent The s i are nongaussian. For simplicity: The matrix A is square. The s i defined only up to a multiplicative constant. The s i are not ordered. 5

ICA and decorrelation First approach: decorrelate variables. Whitening or sphering: decorrelate and normalize E{xx T } = I Simple by eigen-value decomposition of covariance matrix. But: Decorrelation uses only correlation matrix: n 2 /2 equations, and A has n 2 elements Not enough information! 6

Independence is better Fortunately, independence is stronger than uncorrelatedness. For independent variables we have E{h 1 (y 1 )h 2 (y 2 )} E{h 1 (y 1 )}E{h 2 (y 2 )} = 0. (3) Still, decorrelation ( whitening ) is usually done before ICA for various technical reasons For example: after decorrelation and standardization, A can be considered orthogonal. Gaussian data determined by correlations alone model cannot be estimated for gaussian data. 7

Illustration of whitening Two ICs with uniform distributions: Original variables, observed mixtures, whitened mixtures. Cf. gaussian density: symmetric in all directions. 8

Basic intuitive principle of ICA estimation. (Sloppy version of) the Central Limit Theorem (Donoho, 1981). Consider a linear combination w T x = q T s q i s i + q j s j is more gaussian than s i. Maximizing the nongaussianity of q T s, we can find s i. Also known as projection pursuit. 9

Marginal and joint densities, uniform distributions. Marginal and joint densities, whitened mixtures of uniform ICs 10

Marginal and joint densities, supergaussian distributions. Whitened mixtures of supergaussian ICs 11

Kurtosis as nongaussianity measure. Problem: how to measure nongaussianity? Definition: kurt(x) = E{x 4 } 3(E{x 2 }) 2 (4) if variance constrained to unity, essentially 4th moment. Simple algebraic properties because it s a cumulant: kurt(s 1 + s 2 ) = kurt(s 1 ) + kurt(s 2 ) (5) kurt(αs 1 ) = α 4 kurt(s 1 ) (6) zero for gaussian RV, non-zero for most nongaussian RV s. positive vs. negative kurtosis have typical forms of pdf. 12

Left: Laplacian pdf, positive kurt ( supergaussian ). Right: Uniform pdf, negative kurt ( subgaussian ). 13

The extrema of kurtosis by the properties of kurtosis: kurt(w T x) = kurt(q T s) = q 4 1 kurt(s 1 ) + q 4 2 kurt(s 2 ) (7) constrain variance to equal unity E{(w T x) 2 } = E{(q T s) 2 } = q 2 1 + q 2 2 = 1 (8) for simplicity, consider kurtoses equal to one. maxima of kurtosis give independent components (see figure) general result: absolute value of kurtosis maximized by the s i (Delfosse and Loubaton, 1995). Note: extrema are orthogonal due to whitening. 14

Optimization landscape for kurtosis. Thick curve is unit sphere, thin curves are contours where kurtosis is constant. 15

5.5 5 4.5 kurtosis 4 3.5 3 2.5 2 0 0.5 1 1.5 2 2.5 3 3.5 angle of w Kurtosis as a function of the direction of projection. For positive kurtosis, kurtosis (and its absolute value) are maximized in the directions of the independent components. 16

0.5 0.6 0.7 kurtosis 0.8 0.9 1 1.1 1.2 1.3 0 0.5 1 1.5 2 2.5 3 3.5 angle of w Case of negative kurtosis. Kurtosis is minimized, and its absolute value maximized, in the directions of the independent components. 17

Basic ICA estimation procedure 1. Whiten the data to give z. 2. Set iteration count i = 1. 3. Take a random vector w i. 4. Maximize nongaussianity of w T i z, under constraints w i 2 = 1 and w T i w j = 0, j < i (by a suitable algorithm, see later) 5. increment iteration count by 1, go back to 3 Alternatively: maximize all the w i in parallel, keeping them orthogonal. 18

Why kurtosis is not optimal Sensitive to outliers: Consider a sample of 1000 values with unit var, one value equal to 10. Kurtosis equals at least 10 4 /1000 3 = 7. For supergaussian variables, statistical performance not optimal even without outliers. Other measures of nongaussianity should be considered. 19

Differential entropy as nongaussianity measure Generalization of ordinary discrete Shannon entropy: H(x) = E{ log p(x)} (9) for fixed variance, maximized by gaussian distribution. often normalized to give negentropy J(x) = H(x gauss ) H(x) (10) Good statistical properties, but computationally difficult. 20

Approximation of negentropy Approximations of negentropy (Hyvärinen, 1998): J G (x) = (E{G(x)} E{G(x gauss )}) 2 (11) where G is a nonquadratic function. Generalization of (square of) kurtosis (which is G(x) = x 4 ). A good compromise? statistical properties not bad (for suitable choice of G) computationally simple 21

Maximum likelihood estimation. (Pham and Garrat, 1997) Log-likelihood of the model: (W = Â 1 ) L = T n t=1 i=1 log p si (w T i x(t))) + T log detw (12) Equivalent to the infomax approach in neural networks. Needs estimates of the p si, but these need not be exact at all. Roughly: consistent if p si is of the right type (sub or supergaussian). 22

Maximum likelihood and nongaussianity If W constrained to be orthogonal (whitened data), and the densities of the components consistently estimated: where H is differential entropy n lim L = T H(w T i x) + const. (13) i=1 This is sum of nongaussianities! (note minus sign) Rigorous derivation of maximization of nongaussianities. 23

Overview of ICA estimation principles. Basic approach is maximizing the nongaussianity of ICs, which is roughly equivalent to MLE. Basic choice: the nonquadratic function in the nongaussianity measure: kurtosis: fourth power entropy/likelihood: log of density approx of entropy: G(s) = log cosh s or others. One-by-one estimation vs. estimation of the whole model. Estimates constrained to be white vs. no constraint 24

Algorithms (1). Adaptive gradient methods Gradient methods for one-by-one estimation straightforward. Stochastic gradient ascent for likelihood (Bell and Sejnowski, 1995) W (W 1 ) T + g(wx)x T (14) with g = (log p s ). Problem: needs matrix inversion! Better: natural/relative gradient ascent of likelihood (Amari, 1998; Cardoso and Laheld, 1996) W [I + g(y)y T ]W (15) with y = Wx. Obtained by multiplying gradient by W T W. 25

Algorithms (2). The FastICA fixed-point algorithm (Hyvärinen, 1999) An approximate Newton method in block (batch) mode. No matrix inversion, but still quadratic (or cubic) convergence. No parameters to be tuned. For a single IC (whitened data) w E{xg(w T x)} E{g (w T x)}w, normalize w where g is the derivative of G. For likelihood: W W + D 1 [D 2 + E{g(y)y T }]W, orthonormalize W The FastICA MATLAB package on the WWW (The FastICA Team, 1998). 26

0.7 0.8 0.9 kurtosis 1 1.1 1.2 1.3 0 0.5 1 1.5 2 2.5 3 iteration count Convergence of FastICA. Vectors after 1 and 2 iterations, values of kurtosis. 27

5.5 5 4.5 kurtosis 4 3.5 3 2.5 0 0.5 1 1.5 2 2.5 3 iteration count Convergence of FastICA (2). Vectors after 1 and 2 iterations, values of kurtosis. 28

Stability/reliability analysis Are the components an algorithm gives reliable, really there? All optimization methods are prone to get stuck in local minima a) b) 1 1 0 0 1 1 2 2 3 3 4 4 5 5 6 0 2 4 6 8 10 6 0 2 4 6 8 10 Reliability should be analyzed by running algorithm many times from random initial points: the Icasso software package (Himberg et al., 2004) Statistical reliability of components could also be analyzed by bootstrapping Computation of p-values still a subject for future research. 29

Relations to other methods (1): Projection pursuit (Friedman, 1987; Huber, 1985) Projection pursuit is a method for visualization and exploratory data analysis. Attempts to show clustering structure of data by finding interesting projections. PCA is not designed to find clustering structure. Interestingness is usually measured by nongaussianity. For example, bimodal distributions are very nongaussian. 30

Illustration of projection pursuit. The projection pursuit direction is horizontal, the principal component vertical. 31

Relations to other methods. (2) Factor analysis: ICA is a nongaussian (usually noise-free) version Blind deconvolution: obtained by constraining the mixing matrix Principal component analysis often the same applications very different statistical principles 32

Basic ICA estimation: conclusions ICA is very simple as a model: linear nongaussian latent variables model. Estimation not so simple due to nongaussianity: objective functions cannot be quadratic. Estimation by maximizing nongaussianity of independent components. More or less equivalent to maximum likelihood estimation Algorithms: adaptive (natural gradient descent) vs. block/batch mode (FastICA). Choice of nonlinearity: cubic (kurtosis) vs. non-polynomial functions For more information, see e.g. (Hyvärinen and Oja, 2000; Hyvärinen et al., 2001b; Cardoso, 1998a; Amari and Cardoso, 1997) 33

Blind source separation using time dependencies 34

Using autocorrelations for ICA estimation Take the basic linear mixture model x(t) = As(t) (16) Cannot be estimated in general (e.g. gaussian RV s) Usually in ICA, we assume the s i to be nongaussian higher-order statistics provide missing information. Alternatively: assume the s i are time-dependent signals use time correlations to give more information For example, a lagged covariance matrix measures covariances of lagged signals. C x τ = E{x(t)x(t τ)}. (17) 35

The AMUSE algorithm for using autocorrelations (Tong et al., 1991; Molgedey and Schuster, 1994) Basic principle: decorrelate each signal y = Wx with other signals, lagged as well as not lagged. In other words: E{y i (t)y j (t τ)} = 0 for all i j. To do this: 1. Whiten the data to obtain z(t) = Vx(t) 2. Find orthogonal transformation W so that the lagged covariance matrix of y(t) = Wz(t) is identity. Matrix diagonalization problem C x τ = E{x(t)x(t τ)} = E{As(t)s(t τ) T A T } = AC s τa T of a (more or less) symmetric matrix. 36

Pros and cons of separation by autocorrelations Very fast to compute: a single eigen-value decomposition, like PCA Can only separate ICs with different autocorrelations Because the lagged covariance matrix must have different eigenvalues Some improvement can be achieved by using several lags in the algorithm (Belouchrani et al., 1997). but if signals have identical Fourier spectra, autocorrelations just cannot separate them 37

Combining nongaussianity and autocorrelations Best results should be obtained by using these two kinds of information. E.g.: Model temporal structure of signals with e.g. ARMA models Basic case: linear non-gaussian autoregressive model (Hyvärinen, 2001b) s i (t) = as i (t 1) + n i (t) (18) Straighforward formulation of likelihood. Use parametric model for the distribution of the innovation process. 38

Estimation using variance nonstationarity (Matsuoka et al., 1995) An alternative to autocorrelations (and nongaussianity) Variance changes slowly over time 4 3 2 1 0 1 2 3 4 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Similar to autoregressive conditional heteroscedasticity (ARCH) models for each independent component This gives enough information to estimate model (Pham and Cardoso, 2001; Hyvärinen, 2001a) 39

Unifying autoregressive model (Hyvärinen, 2005) A simple model that incorporates the three properties of nongaussianity, distinct autocorrelations, and a smoothly changing nonstationary variance. Model each s i by an autoregressive model where the innovation term n i (t) is nongaussian, its variance can be nonstationary. s i (t) = α τ i s i (t τ) + n i (t) (19) τ>0 40

Coding complexity as a general theoretical framework (Pajunen, 1998) A more general approach: minimize coding (Kolmogoroff) complexity Find a decomposition y = Wx so that the y i are easy to code. For whitened data z, and an orthogonal W: minimize sum of coding lengths of the y = Wz. If only marginal distributions are used, coding length is given by entropy, i.e. nongaussianity. If only autocorrelations are used, coding length is related to autocorrelations. Signals are easy to code if they are nongaussian, or have time dependencies, or have nonstationary variances. 41

Convolutive ICA Often the signals do not arrive at the same time in the sensors Or: latent events are not immediately manifested in observed variables. There may be echos as well (multi-path phenomena) Include convolution in the model: x i (t) = n a i j (t) s i (t) = j=1 n j=1 a i j (k)s i (t k),for i = 1,...,n, (20) k In theory: Estimation by the same principles as ordinary ICA In practice: huge number of parameters since (de)convolving filters may be very long special methods may need to be used FastICA can be adapted (Douglas et al., 2005). 42

Modelling dependencies between components 43

Relaxing independence For most data sets, the estimated components are not very independent. In fact, independent components can not be found in general by a linear transformation. We attempt to model some of the remaining dependencies. Basic models group components: Multidimensional ICA, and Independent Subspace Analysis. 44

Multidimensional ICA (Cardoso, 1998b) One approach to relaxing independence. the s i can be divided into n-tuples, such that the s i inside a given n-tuple may be dependent on each other dependencies between different n-tuples are not allowed. Every n-tuple corresponds to a subspace. 45

Invariant-feature subspaces (Kohonen, 1996) Linear filters (like in ICA) necessarily lack any invariance. invariant-feature subspaces is an abstract approach to representing invariant features. Principle: invariant feature is a linear subspace in a feature space. The value of the invariant feature is given by norm of the projection on that subspace. k (w T i x) 2 (21) i=1 46

Independent Subspace Analysis (Hyvärinen and Hoyer, 2000) Combination of multidimensional ICA and invariant-feature subspaces. The probability density inside each subspace is spherically symmetric, i.e. depends only on the norm of the projection. Simplifies the model considerably. The nature of the invariant features is not specified. 47

<w 1, I> (.) 2 <w 2, I> <w, I> 3 (.) 2 (.) 2 Σ <w 4, I> (.) 2 Input I <w 5, I> <w, I> 6 <w 7, I> (.) 2 (.) 2 (.) 2 Σ <w 8, I> (.) 2 48

Problem: Dependencies still remain Linear decomposition often does not give independence, even for subspaces. Remaining dependencies could be visualized or else utilized. Components can be decorrelated, so only higher-order correlations are interesting How to visualize them? E.g. using topographic order 49

Extending the model to include topography Instead of having unordered components, they are arranged on a two-dimensional lattice dependent independent The components are typically sparse, but not independent. Near-by components have higher-order correlations. 50

Dependence through local variances Related to ARCH models where the variance variables are shared Components are independent given their variances In our model, variances are not independent instead: correlated for near-by components e.g. generated by another ICA model, with topographic mixing INDEPENDENT TOPOGRAPHIC VARIANCE DEPENDENCE 51

Two signals that are independent given their variances. 52

Topographic ICA model (Hyvärinen et al., 2001a) u Σ Σ Σ 2 3 1 x x x A 3 φ φ φ u u 1 2 s s s 2 3 1 σ σ 1 σ 2 3 Variance-generating variables u i are generated randomly, and mixed linearly inside their topographic neighbourhoods. Mixtures are transformed using a nonlinearity φ, thus giving variances σ i of the s i. Finally, ordinary linear mixing. 53

Approximation of likelihood Likelihood of the model intractable Approximation: T t=1 n j=1 G( n i=1 h(i, j)(w T i x(t)) 2 ) + T log detw. (22) where h(i, j) is neighborhood function, and G a nonlinear function. Generalization of independent subspace analysis. Function of local energies only! 54

Example of blind source separation with topographic ICA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2000 4000 6000 8000 55 10000 12000 14000 16000

Independent subspace analysis vs. topographic ICA In ISA, single components are not independent, but subspaces are. In topographic ICA, dependencies modelled continuously. No strict division into subspaces. Topographic ICA is a generalization of ISA, incorporating the invariant-feature subspace principle as invariant-feature neighbourhoods. 56

Double-blind source separation (Hyvärinen and Hurri, 2004) Using time-dependencies can help in separating independent components. Actually, it is not necessary to know or model how the variances depend from each other. Theorem: Maximization of [ cov([w T i x(t)] 2,[w T j x(t t)] 2 ) ] 2 i, j under the constraint of orthogonality of W gives the original sources Assumption 1: the sources are dependent only through their variances as in topographic ICA Assumption 2: x(t) is spatially and temporally whitened Assumption 3: the matrix K i j = cov(s 2 i (t),s2 j (t t)) is of full rank (23) 57

Final Summary ICA is a very simple model. Simplicity implies wide applicability. A nongaussian alternative to PCA or factor analysis. Decorrelation or whitening is only half ICA. The other half uses the higher-order statistics of nongaussian variables Alternatively, separation is possible using time dependencies: Linear autocorrelations Smoothly changing variances (ARCH) Since dependencies cannot always be cancelled, subspaces or topographic versions may be useful. Nongaussianity is beautiful!? 58

REFERENCES REFERENCE References Amari, S.-I. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2):251 276. Amari, S.-I. and Cardoso, J.-F. (1997). Blind source separation semiparametric statistical approach. IEEE Trans. on Signal Processing, 45(11):2692 2700. Bell, A. and Sejnowski, T. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7:1129 1159. Belouchrani, A., Meraim, K. A., Cardoso, J.-F., and Moulines, E. (1997). A blind source separation technique based on second order statistics. IEEE Trans. on Signal Processing, 45(2):434 444. Cardoso, J.-F. (1998a). Blind signal separation: statistical principles. Proceedings of the IEEE, 9(10):2009 2025. Cardoso, J.-F. (1998b). Multidimensional independent component analysis. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP 98), Seattle, WA. Cardoso, J.-F. and Laheld, B. H. (1996). Equivariant adaptive source separation. IEEE Trans. on Signal Processing, 44(12):3017 3030. Delfosse, N. and Loubaton, P. (1995). Adaptive blind separation of independent sources: a deflation approach. Signal Processing, 45:59 83. Donoho, D. L. (1981). On minimum entropy deconvolution. In Applied Time Series Analysis II, pages 565 608. Academic Press. Douglas, S., Sawada, H., and Makino, S. (2005). A spatio-temporal fastica algorithm for separating convolutive mixtures. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP2005), Philadephia, PA. IEEE Press. Friedman, J. (1987). Exploratory projection pursuit. J. of the American Statistical Association, 82(397):249 266. Himberg, J., Hyv ärinen, A., and Esposito, F. (2004). Validating the independent components of neuroimaging time-series via clustering and visualization. NeuroImage, 22(3):1214 1222. Huber, P. (1985). Projection pursuit. The Annals of Statistics, 13(2):435 475. 58-1

REFERENCES REFERENCE Hyv ärinen, A. (1998). New approximations of differential entropy for independent component analysis and projection pursuit. In Advances in Neural Information Processing Systems, volume 10, pages 273 279. MIT Press. Hyv ärinen, A. (1999). Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10(3):626 634. Hyv ärinen, A. (2001a). Blind source separation by nonstationarity of variance: A cumulant-based approach. IEEE Transactions on Neural Networks, 12(6):1471 1474. Hyv ärinen, A. (2001b). Complexity pursuit: Separating interesting components from time-series. Neural Computation, 13(4):883 898. Hyv ärinen, A. (2005). A unifying model for blind separation of independent sources. Signal Processing, 85(7):1419 1427. Hyv ärinen, A. and Hoyer, P. O. (2000). Emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces. Neural Computation, 12(7):1705 1720. Hyv ärinen, A., Hoyer, P. O., and Inki, M. (2001a). Topographic independent component analysis. Neural Computation, 13(7):1527 1558. Hyv ärinen, A. and Hurri, J. (2004). Blind separation of sources that have spatiotemporal variance dependencies. Signal Processing, 84(2):247 254. Hyv ärinen, A., Karhunen, J., and Oja, E. (2001b). Independent Component Analysis. Wiley Interscience. Hyv ärinen, A. and Oja, E. (2000). Independent component analysis: Algorithms and applications. Neural Networks, 13(4-5):411 430. Jutten, C. and Hérault, J. (1991). Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24:1 10. Kohonen, T. (1996). Emergence of invariant-feature detectors in the adaptive-subspace self-organizing map. Biological Cybernetics, 75:281 291. Matsuoka, K., Ohya, M., and Kawamoto, M. (1995). A neural net for blind separation of nonstationary signals. Neural Networks, 8(3):411 419. Molgedey, L. and Schuster, H. G. (1994). Separation of a mixture of independent signals using time delayed correlations. Physical Review Letters, 72:3634 3636. Pajunen, P. (1998). Blind source separation using algorithmic information theory. Neurocomputing, 22:35 48. 58-2

REFERENCES REFERENCE Pham, D.-T. and Cardoso, J.-F. (2001). Blind separation of instantaneous mixtures of non stationary sources. IEEE Trans. Signal Processing, 49(9):1837 1848. Pham, D.-T. and Garrat, P. (1997). Blind separation of mixture of independent sources through a quasi-maximum likelihood approach. IEEE Trans. on Signal Processing, 45(7):1712 1725. The FastICA Team (1998). The FastICA MATLAB package. Available at http://www.cis.hut.fi/projects/ica/fastica/. Tong, L., Liu, R.-W., Soon, V., and Huang, Y.-F. (1991). Indeterminacy and identifiability of blind identification. IEEE Trans. on Circuits and Systems, 38:499 509. 58-3