A Short Introduction to Independent Component Analysis Aapo Hyvärinen Helsinki Institute for Information Technology and Depts of Computer Science and Psychology University of Helsinki
Problem of blind source separation There is a number of source signals : Due to some external circumstances, only linear mixtures of the source signals are observed: Estimate (separate) original signals!
Principal component analysis does not recover original signals
Principal component analysis does not recover original signals A solution is possible Use information on statistical independence to recover:
Independent Component Analysis (Hérault and Jutten, 1984-1991) Observed data x i is modelled using hidden variables s i : x i = m a i j s j, i = 1...n (1) j=1 Matrix of a i j is constant parameter called mixing matrix Hidden random factors s i are called independent components or source signals Problem: Estimate both a i j and s j, observing only x i Unsupervised, exploratory approach
When can the ICA model be estimated? Must assume: The s i are mutually statistically independent The s i are nongaussian (non-normal) (Optional:) Number of independent components is equal to number of observed variables Then: mixing matrix and components can be identified (Comon, 1994) A very surprising result!
Related methods: Principal component analysis and factor analysis Basic idea: find directions i w i x i of maximum variance Goal: explain maximum amount of variance with few components Noise reduction Reduction in computation Easier to interpret (?)
Why cannot PCA or FA find source signals (original components)? They use only covariances cov(x i,x j ) Due to symmetry cov(x i,x j ) = cov(x j,x i ), only n 2 /2 available Mixing matrix has n 2 parameters
Why cannot PCA or FA find source signals (original components)? They use only covariances cov(x i,x j ) Due to symmetry cov(x i,x j ) = cov(x j,x i ), only n 2 /2 available Mixing matrix has n 2 parameters So, not enough information in covariances Factor rotation problem : Cannot distinguish between (s 1,s 2 ) and ( 2s 1 + 2s 2, 2s 1 2s 2 )
Deep connection between covariances and gaussianity Gaussian (normal) distribution is completely determined by covariances No additional information can be obtained Using only covariances and treating data as gaussian are closely related Uncorrelated gaussian variables (i.e. zero cov) are independent Uncorrelated, standardized gaussian distribution is same in all directions
ICA uses nongaussianity to find source signals Nongaussian data has structure beyond covariances e.g. E{x i x j x k x l } instead of just E{x i x j } Catch: only works when components are nongaussian Many psychological variables may be gaussian because sum of many independent variables (central limit theorem). Fortunately, signals measured by sensors are usually quite nongaussian
Some examples of nongaussianity in signals 2 5 6 1.5 1 4 3 2 4 2 0.5 0 0.5 1 1.5 1 0 1 2 3 4 0 2 4 6 2 0 1 2 3 4 5 6 7 8 9 10 5 0 1 2 3 4 5 6 7 8 9 10 8 0 1 2 3 4 5 6 7 8 9 10 0.7 0.7 0.8 0.6 0.6 0.7 0.5 0.5 0.6 0.4 0.3 0.4 0.3 0.5 0.4 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0 2 1.5 1 0.5 0 0.5 1 1.5 2 0 5 4 3 2 1 0 1 2 3 4 5 0 8 6 4 2 0 2 4 6
Illustration of using nongaussianity Two components with uniform distributions: Original components, observed mixtures, PCA, ICA PCA does not find original coordinates, ICA does!
Illustration of problem with gaussian distributions Original components, observed mixtures, PCA Distribution after PCA is the same as distribution before mixing! Factor rotation problem
How is nongaussianity used in ICA? Classic Central Limit Theorem: Average of many independent random variables will have a distribution that is close(r) to gaussian So, roughly: Any mixture of components will be more gaussian than the components themselves
How is nongaussianity used in ICA? Classic Central Limit Theorem: Average of many independent random variables will have a distribution that is close(r) to gaussian So, roughly: Any mixture of components will be more gaussian than the components themselves Maximizing the nongaussianity of i w i x i, we can find s i. Also known as projection pursuit. Cf. principal component analysis: maximize variance of i w i x i.
Illustration of changes in nongaussianity 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 4 3 2 1 0 1 2 3 4 Histogram and scatterplot, original uniform distributions 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 4 3 2 1 0 1 2 3 4 Histogram and scatterplot, mixtures given by PCA
ICA algorithms consist of two ingredients 1. Nongaussianity measure Kurtosis: a classic, but sensitive to outliers. Differential entropy: statistically better, but difficult to compute. Equivalent to maximizing likelihood or network information flow Approximations of entropy: good compromise. 2. Optimization algorithm Gradient methods: natural gradient, infomax (Bell, Sejnowski, Amari, Cichocki et al 1994-6) Fast fixed-point algorithm, FastICA (Hyvärinen, 1999) One-by-one estimation vs. simultaneous estimation
Practical note: Combining ICA with FA/PCA Almost always done in practice: First, find a small number of factors with PCA or FA Then, perform ICA on those factors ICA is then a method of factor rotation Very different from varimax etc. which do not use statistical structure, and cannot find original components (in most cases) PCA reduces noise in signals, reduces computation
Practical note 2: Time-filtering of data ICA model still holds with the same matrix A x i (t) = f(t) x i (t) = f(τ)x i (t τ) (2) τ (3) x i (t) = a i j s j (t) (4) j Try to find a frequency band in which the source signals are as independent and nongaussian as possible
Recent advance: Causal analysis In a structural equation model (SEM), causal directions can be analyzed x i = b i j x j + n i j i 0.17 x2 x1 0.49-0.14-1.2 0.91 0.7 x3 0.0088 Cannot be estimated in gaussian case without further assumptions Again, nongaussianity solves the problem (Shimizu et al, 2006): Our LiNGAM package. x5 0.54 0.97 x4-0.063-0.12-0.42-0.77 x6 0.87-0.66 x7
Conclusions ICA is very simple as a model: linear nongaussian hidden variables model. Solves the blind source separation problem if data (components) are nongaussian Works by maximizing nongaussianity of components. Radically different from PCA. Can be applied almost in any field where we have continuous-valued variables, e.g. electro/magnetoencephalograms functional magnetic resonance imaging