Independent Component Analysis

A Short Introduction to Independent Component Analysis with Some Recent Advances Aapo Hyvärinen Dept of Computer Science Dept of Mathematics and Statistics University of Helsinki

Problem of blind source separation There is a number of source signals : Due to some external circumstances, only linear mixtures of the source signals are observed: Estimate (separate) original signals!

Principal component analysis does not recover original signals

Principal component analysis does not recover original signals A solution is possible Use information on statistical independence to recover:

Independent Component Analysis (Hérault and Jutten, 1984-1991) Observed data x i (t) is modelled using hidden variables s i (t): x i (t)= m a i j s j (t), i=1...n (1) j=1 or as a matrix decomposition X=AS (2) Matrix of a i j is constant parameter called mixing matrix Hidden random factors s i (t) are called independent components or source signals Problem: Estimate both a i j and s j (t), observing only x i (t) Unsupervised, exploratory approach

When can the ICA model be estimated? Must assume: The s i are mutually statistically independent The s i are nongaussian (non-normal) (Optional:) Number of independent components is equal to number of observed variables Then: mixing matrix and components can be identified (Comon, 1994) A very surprising result!

Background: Principal component analysis and factor analysis Basic idea: find directions i w i x i of maximum variance Goal: explain maximum amount of variance with few components Noise reduction Reduction in computation Easier to interpret (?) (Vain hope: finds original underlying components)

Why cannot PCA or FA find source signals (original components)? They use only covariances cov(x i,x j ) Due to symmetry cov(x i,x j )=cov(x j,x i ), only n 2 /2 available Mixing matrix has n 2 parameters So, not enough information in covariances Factor rotation problem : Cannot distinguish between (s 1,s 2 ) and ( 2s 1 + 2s 2, 2s 1 2s 2 )

ICA uses nongaussianity to find source signals Gaussian (normal) distribution is completely determined by covariances But: Nongaussian data has structure beyond covariances e.g. E{x i x j x k x l } instead of just E{x i x j } Is it reasonable to assume data is nongaussian? Many variables may be gaussian because sum of many independent variables (central limit theorem), e.g. intelligence Fortunately, signals measured by physical sensors are usually quite nongaussian

Some examples of nongaussianity in signals 2 5 10 1.5 4 8 1 0.5 0 3 2 1 0 6 4 2 0 0.5 1 2 1 2 4 1.5 3 6 2 0 1 2 3 4 5 6 7 8 9 10 4 0 1 2 3 4 5 6 7 8 9 10 8 0 1 2 3 4 5 6 7 8 9 10 0.7 0.7 0.7 0.6 0.6 0.6 0.5 0.5 0.5 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0 2 1.5 1 0.5 0 0.5 1 1.5 2 0 4 3 2 1 0 1 2 3 4 5 0 8 6 4 2 0 2 4 6 8 10

Illustration of PCA vs. ICA with nongaussian data Two components with uniform distributions: Original components, observed mixtures, PCA, ICA PCA does not find original coordinates, ICA does!

Illustration of PCA vs. ICA with gaussian data Original components, observed mixtures, PCA Distribution after PCA is the same as distribution before mixing! Factor rotation problem

How is nongaussianity used in ICA? Classic Central Limit Theorem: Average of many independent random variables will have a distribution that is close(r) to gaussian So, roughly: Any mixture of components will be more gaussian than the components themselves Maximizing the nongaussianity of i w i x i, we can find s i. Also known as projection pursuit. Cf. principal component analysis: maximize variance of i w i x i.

Illustration of changes in nongaussianity 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 4 3 2 1 0 1 2 3 4 Histogram and scatterplot, original uniform distributions 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 4 3 2 1 0 1 2 3 4 Histogram and scatterplot, mixtures given by PCA

ICA algorithms consist of two ingredients 1. Nongaussianity measure Kurtosis: a classic, but sensitive to outliers. Differential entropy: statistically better, but difficult to compute. Approximations of entropy: a practical compromise. 2. Optimization algorithm Gradient methods: natural gradient, infomax (Bell, Sejnowski, Amari, Cichocki et al 1994-6) Fixed-point algorithm: FastICA (Hyvärinen, 1999)

Example of separated components in brain signals (MEG) Temporal envelope (arbitrary units) Fourier amplitude (arbitrary units) Distribution over channels Phase differences 1 2 3 4 5 6 7 8 9 0 100 200 300 Time (seconds) 5 10 15 20 25 30 Frequency (Hz) (Hyvärinen, Ramkumar, Parkkonen, Hari, NeuroImage, 2010)

Recent Advance 1: Causal analysis A structural equation model (SEM) analyses causal connections as x2 0.49 x1 0.7 x i = b i j x j + n i j i 0.17-0.14-1.2 0.91 x3 0.0088 Cannot be estimated in gaussian case without further assumptions 0.54 0.97 x4-0.063-0.12-0.42 Again, nongaussianity solves the problem (Shimizu et al, 2006) x5-0.77 x6 0.87-0.66 x7

Recent Advance 2: Analysing reliability of components (testing) Algorithmic reliability: Are there local minima? Can be analysed by rerunning from different initial points (a) Statistical reliability: Is the result just random/accidental? Can be analyzed by bootstrap (b) Our Icasso package (Himberg et al, 2004) visualizes reliability: 10 5 2 1 0 1 2 a) 1 0 1 2 b) 14 12 A 16 20 13 19 18 17 15 B 3 1 3 4 5 3 4 5 4 7 9 6 6 0 2 4 6 8 10 6 0 2 4 6 8 10 11 8 New: A proper testing procedure which gives p-values (Hyvärinen, 2011)

Recent Advance 3: Three-way data ( Group ICA ) Often ones measures several data matrices X k,k=1,...,r, e.g. different conditions, measurement sessions, subjects, etc. Each matrix could be modelled X k = A k S k but how to connect the results? (Calhoun, 2001) a) If mixing matrices same for all X k, use ) X 1 = (X 1,X 2,...,X r ) = A (S 1,S 2,...,S r b) If component values same for all X k, use X 2 = X 1. X r = A 1. A r S If both A k and S k the same: analyse average data matrix, or combine ICA with PARAFAC (Beckmann and Smith, 2005).

Recent Advance 4: Dependencies between components In fact, estimated components are often not independent ICA does not have enough parameters to force independence Many authors model correlations of squares, or simultaneous activity Two signals that are uncorrelated but whose squares are correlated. On-going work on even linearly correlated components (Sasaki et al, 2011) Alternatively, in parallel time series, innovations of VAR model could be independent (Gómez-Herrero et al, 2008)

Recent Advance 5: Better estimation of basic linear mixing In case of time signals, we can do ICA on time-frequency decompositions (Pham, 2002; Hyvärinen et al, 2010) If the data is by its very nature non-negative, we could impose the same in the model (Hoyer, 2004) Zero must have some special meaning as a baseline E.g. Fourier spectra More precise modelling of nongaussianity of components could also improve estimation.

Summary ICA is a very simple model: Simplicity implies wide applicability. A nongaussian alternative to PCA or factor analysis. Finds a linear decomposition by maximizing nongaussianity of the components. These (hopefully) correspond to the original sources Recent advances: Causal analysis, or structural equation modelling, using ICA Testing of independent components for statistical significance Group ICA, i.e. ICA on three-way data Modelling dependencies between components Imporovements in estimating the basic linear mixing model Nongaussianity is beautiful!?