Independent Component Analysis

1 Independent Component Analysis Background paper: http://www-stat.stanford.edu/ hastie/papers/ica.pdf

2 ICA Problem X = AS where X is a random p-vector representing multivariate input measurements. S is a latent source p-vector whose components are independently distributed random variables. A is p p mixing matrix. Given realizations x 1,x 2,...,x N of X, the goals of ICA are to Estimate A Estimate the source distributions S j f Sj, j = 1,...,p.

3 Cocktail Party Problem In a room there are p independent sources of sound, and p microphones placed around the room hear different mixtures. Source Signals Measured Signals PCA Solution ICA Solution Here each of the x ij = x j (t i ) and recovered sources are a time-series sampled uniformly at times t i.

4 Independent vs Uncorrelated WoLOG can assume that E(S) = 0 and Cov(S) = I, and hence Cov(X) = Cov(AS) = AA T. Suppose X = AS with S unit variance, uncorrelated Let R be any orthogonal p p matrix. Then X = AS = AR T RS = A S and Cov(S ) = I It is not enough to find uncorrelated variables, as they are not unique under rotations. Hence methods based on second order moments, like principal components and Gaussian factor analysis, cannot recover A. ICA uses independence, and non-gaussianity of S, to recover A e.g. higher order moments.

5 Independent vs Uncorrelated Demo Source S Data X PCA Solution ICA Solution Principal components are uncorrelated linear combinations of X, chosen to successively maximize variance. Independent components are also uncorrelated linear combinations of X, chosen to be as independent as possible.

6 Example: ICA representation of Natural Images Pixel blocks are treated as vectors, and then the collection of such vectors for an image forms an image database. ICA can lead to a sparse coding for the image, using a natural basis. see http://www.cis.hut.fi/projects/ica/imageica/(patrik Hoyer and Aapo Hyvärinen, Helsinki University of Technology)

7 Example: ICA and EEG data See http://www.cnl.salk.edu/ tewon/ica cnl.html (Scott Makeig, Salk Institute)

8 Approaches to ICA ICA literature is HUGE. Recent book by Hyvärinen, Karhunen & Oja (Wiley, 2001) is a great source for learning about ICA, and some good computational tricks. Mutual Information and Entropy, maximizing non-gaussianity FastICA (HKO 2001), Infomax (Bell and Sejnowski, 1995) Likelihood methods ProdDenICA (Hastie +Tibshirani)- later Nonlinear decorrelation Y 1 independent Y 2 iff max g,f Corr[f(Y 1 ),g(y 2 )] = 0 (Hérault-Jutten, 1984), KernelICA (Bach and Jordan, 2001) Tensorial moment methods

9 Simplification Let Σ = Cov(X) and suppose X = AS, and B = A 1. Then we can write B = WΣ 1 2 for some nonsingular W. Then S = BX = WΣ 1 2 X with Cov(S) = I and W orthonormal. So operationally we sphere the data X = Σ 1 2 X, and then seek an orthogonal matrix W so that the components S = W X are independent.

10 Entropy and Mutual Information Entropy: H(Y) = f(y)logf(y)dy maximized by f(y) = φ(y), the Gaussian (for fixed variance). Mutual Information: I(Y) = p j=1 H(Y j) H(Y) Y is a random vector with joint density f(y) and entropy H(Y) H(Y j ) is the (marginal) entropy of component Y j, with marginal density f j (y j ). I(Y) is the Kullback-Leibler divergence between f(y) and it s independence version p 1 f j(y j ) (which is the KL closest of all independence densities to f(y)) Hence I(Y) is a measure of dependence between the components of a random vector Y.

11 Entropy, Mutual Information, and ICA If Cov(X) = I and W is orthogonal then simple calculations show p I(WX) = H(wj T X) H(X) Hence j=1 min W I(WX) min W {dependence between wt j X} min W {the sum of the entropies of the wt j X} max W {departures from Gaussianity of the wt j X} Many methods for ICA look for low-entropy or non-gaussian projections (Hyvärinen, Karhunen & Oja, 2001) Strong similarities with projection pursuit (Friedman and Tukey, 1974)

12 Negentropy and FastICA Negentropy: J(Y j ) = H(Y j ) H(Z j ), where Z j is a Gaussian RV with same variance as Y j. Measures the departure from Gaussianity. FastICA uses simple approximations to negentropy J(w T j X) [EG(w T j X) EG(Z j )] 2, and with data replaced expectations by a sample averages. They use G(y) = 1 a logcoshy;, 1 a 2

13 ICA Density Model Under the ICA model, the density of S: f S (s) = p j=1 f j (s j ) with each f j a univariate density, with mean 0 and variance 1. Next we develop two direct ways of fitting this density model.

14 Discrimination from Gaussian Unsupervised vs supervised idea Consider case of p = 2 variables. Idea: observed data is assigned to class G = 1, and a background sample is generated from spherical Gaussian g 0 (x) and assigned to class G = 0. We fit to this data a projection pursuit model of the form log Pr(G = 1) 1 Pr(G = 1) = f 1(a T 1X)+f 2 (a T 2X) (1) where [a 1,a 2 ] is an orthonormal basis. This implies the data density model g(x) = g 0 (X) expf 1 (a T 1X) expf 2 (a T 2X) Can also show that g 0 (X) factors into a product h 1 (a T 1X) h 2 (a T 2X), since [a 1,a 2 ] is an orthonormal basis.

15 Hence model (1) is a product density for the data. If also looks like a neural network Can fit it by a) sphering ( whitening ) the data and then b) applying ppr function in R, combined with IRLS (iteratively reweighted least squares).

16 Projection Pursuit Solutions

17 ICA Density Model -direct fitting Density of X = AS: p f X (x) = B f j (b T j x) with B = A 1. Log-likelihood, given x 1,x 2,...,x N : j=1 l(b,{f j } p 1 ) = log B + N i=1 p logf j (b T j x i ) j=1

18 ICA Density Model -fitting ctd As before: let Σ = Cov(X); then for any B we can write B = WΣ 1 2 for some nonsingular W. Then S = BX and Cov(S) = I = W is orthonormal. So we sphere the data X = Σ 1 2 X and seek an orthonormal W. Let Then f j (s j ) = φ(s j )e g j(s j ), a tilted Gaussian density. Here φ is the standard Gaussian density, and g j satisfies the normalization conditions. l(w,σ,{g j } p 1 ) = 1 2 log Σ + N i=1 p j=1 [logφ j (w T j x i )+g j (w T j x i ) ] We estimate ˆΣ by the sample covariance of the x i, and ignore it thereafter pre-whitening in the ICA literature.

19 ProdDenICA algorithm Maximizes log-likelihood on previous slide, over W and g j ( ), j = 1,2,...p, with smoothness constraints on g j ( ). Details on next 3 slides.

20 Restrictions on g j Our model is still over-parameterized. We maximize instead a penalized log-likelihood: [ p 1 N [ logφ(w T N j x i )+g j (wj T x i ) ] ] λ j {g j (t)} 2 (t)dt j=1 i=1 w.r.t. W T = (w 1,w 2,...,w p ) and g j, j = 1,...,p subject to W T W = I each f j (s) = φ(s)e g j(s) is a density, with mean 0 and variance 1. solutions ĝ j are piecewise quartic splines

21 ProDenICA: Product Density ICA algorithm 1. Initialize W (random Gaussian matrix followed by orthogonalization). 2. Alternate until convergence of W, using the Amari metric. (a) Given W, optimize the penalized log-likelihood w.r.t. g j (separately for each j), using the penalized density estimation algorithm. (b) Given g j, j = 1,...,p, perform one step of a fixed point algorithm towards finding the optimal W.

22 Penalized density estimation for step (a) Discretize the data Fit a generalized additive model with Poisson family of distributions (see text chapter 9)

23 Simulations a b c d e f g h i j k l m n o p q r Taken from Bach and Jordan(2001) Each distribution used to generate 2-dim S and X (with random A) 30 reps each 300 reps with 4-dim S distributionsof S j picked at random. Compared FastICA (homegrown Splus version), KernelICA (Francis Bach s matlab version), and Pro- DenICA (Splus).

24 Amari Metric We evaluate solutions by comparing their estimated W with the true W 0, using the Amari metric (HKO 2001, Bach and Jordan 2001): d(w 0,W) = 1 2p p i=1 where r ij = (W 0 W 1 ) ij. ( p j=1 r ) ij max j r ij 1 + 1 2p p j=1 ( p i=1 r ) ij max i r ij 1

25 2-Dim examples Distribution Amari Distance from True W (Log Scale) a b c d e f g h i j k l m n o p q r 0.01 0.05 0.10 0.50 FastICA KernelICA ProdDenICA

Amari Distance from True W 0.0 0.5 1.0 1.5 2.0 4-dim examples 0.25 (0.014) 0.08 (0.004) 0.14 (0.011) FastICA KernelICA ProdDenICA 26

27 FastICA -3-2 -1 0-3 -2-1 0-3 -2-1 0 KernelICA -3-2 -1 0-3.5-2.5-1.5-0.5-3.5-2.5-1.5-0.5 ProdDenICA