Independent Component Analysis

Independent Component Analysis Seungjin Choi Department of Computer Science Pohang University of Science and Technology, Korea seungjin@postech.ac.kr March 4, 2009 1 / 78

Outline Theory and Preliminaries for ICA 1 Theory and Preliminaries for ICA Model Theory 2 Criteria Unsupervised Learning Algorithms Algebraic Algorithms 3 4 2 / 78

Outline Theory and Preliminaries for ICA Model Theory 1 Theory and Preliminaries for ICA Model Theory 2 Criteria Unsupervised Learning Algorithms Algebraic Algorithms 3 4 3 / 78

Books on ICA Theory and Preliminaries for ICA Model Theory T. -W. Lee, Independent Component Analysis, 1998. S. Haykin, Unsupervised Adaptive Filtering, volume 1 and 2, 2001. A. Hyvärinen, J. Karhunen, and E. Oja, Independent Component Analysis, 2001. A. Cichocki and S. Amari, Adaptive Blind Signal and Image Processing, 2002. 4 / 78

ICALAB Toolbox Theory and Preliminaries for ICA Model Theory ICALAB is a Matlab toolbox, containing various ICA algorithms. Check out http://www.bsp.brain.riken.jp/icalab 5 / 78

What is ICA? Theory and Preliminaries for ICA Model Theory ICA is a statistical method, the goal of which is to decompose the multivariate data x R n into a linear sum of statistically independent components, i.e., x = s 1 a 1 + s 2 a 2 + + s n a n = As, where {s i } are coefficients (sources, latent variables, encoding variables) and {a i } are basis vectors. Constraints: Coefficients {s i } are statistically independent. Goal: Learn basis vectors A from data samples only {x(1),..., x(n)} 6 / 78

ICA vs. PCA Theory and Preliminaries for ICA Model Theory Linear transform Compression (dimensionality reduction) Classification (feature extraction) PCA Second-order statistics (Gaussian) Linear orthogonal transform Optimal coding in MS sense ICA Higher-order statistics (non-gaussian) Linear non-orthogonal transform Related with projection pursuit (non-gaussian is interesting) Better features for classification? 7 / 78

An Example of PCA vs ICA Model Theory 8 8 6 6 4 4 2 2 0 0 2 2 4 4 6 6 8 8 6 4 2 0 2 4 6 8 8 8 6 4 2 0 2 4 6 8 (a) PCA (b) ICA 8 / 78

Two Aspects of ICA Theory and Preliminaries for ICA Model Theory Blind source separation Acoustic source separation (cocktail party speech recognition) Biomedical data analysis (EEG, ECG, MEG, fmri, PET) Digital communications (multiuser detection, blind equalization, MIMO channels) Information representation (e.g., feature extraction) Natural sound/image statistics Computer vision (e.g. face recogntion/detection) Empirical data analysis (stock market returns, gene expression data, etc) Data visualization (lower-dimensional embedding) 9 / 78

Blind Source Separation Model Theory s Mixing x Demixing y A W Unknown Mixing: x = Ax. Demixing: y = Wx. 10 / 78

An Example of EEG Theory and Preliminaries for ICA Model Theory (c) Raw EEG (d) After ICA 11 / 78

Transparent Transformation Model Theory Given a set of observed data X = [x(1),..., x(n)] that was generated from unknown sources s through an unknown linear transform A, i.e., x = As, the task of blind source separtion is to restore sources S by estimating the mixing matrix A. To this end, we constrcut a demixing matrix W such that the elements of y = Wx are statistically independent. Impopsing independence in {y i } leads to y = WAs = PΛs where P is the permutation matrix and Λ is a scaling matrix. The transformation PΛ is referred to as transparent transformation. For example, y 1 y 2 y 3 0 0 λ 3 λ 1 0 0 0 λ 2 0 s 1 s 2 s 3. 12 / 78

Darmois Theorem Theory and Preliminaries for ICA Model Theory Theorem Supposed that random variables s 1,..., s n are mutually independent. Consider two linear combinations of s i, y 1 = α 1 s 1 + α n s n, y 2 = β 1 s 1 + β n s n. If y 1 and y 2 are statistically independent, then α i β i 0 only when s i is Gaussian. Remark: In other words, assume that at most one of {s i } is Gaussian. Suppose that the mixing matrix is of full-column rank. Then the pairwise independence between {y i } leads to WA is a transparent transformation. 13 / 78

Outline Theory and Preliminaries for ICA Criteria Unsupervised Learning Algorithms Algebraic Algorithms 1 Theory and Preliminaries for ICA Model Theory 2 Criteria Unsupervised Learning Algorithms Algebraic Algorithms 3 4 14 / 78

Mutual Information Minimization Criteria Unsupervised Learning Algorithms Algebraic Algorithms Mutual information is the relative entropy between the joint distribution and the product of marginal distributions, [ ] p(y) I (y 1,..., y n ) = p(y) log i p dy i(y i ) [ ] = KL p(y) i p i (y i ), which is always nonnegative and its minimum is achieved only when y i are independent. Note that p(y) = p(x) det W. This leads to the objective function J = log det W n log p i (y i ). i=1 15 / 78

Maximum Likelihood Estimation Criteria Unsupervised Learning Algorithms Algebraic Algorithms Consider a single factor of the log-likelihood, L = log p(x A, r) = log p(x s, A)r(s)ds = log det A + n r i (s i ). Replacing r i ( ) = p i ( ), s i = y i, and A = W 1, the negative log-likelihood becomes L = log det W i=1 n log p i (y i ). Maximum likelihood estimation = mutual information minimization in the context of ICA. i=1 16 / 78

Criteria Unsupervised Learning Algorithms Algebraic Algorithms An Information Geometrical View of ICA 17 / 78

More Criteria... Theory and Preliminaries for ICA Criteria Unsupervised Learning Algorithms Algebraic Algorithms Information maximization: Infomax seeks for a linear transform such that the output entropy of f (y) = [f 1 (y 1 ),..., f n (y n )] is maximized. In the case where f i ( ) is the cumulative distribution function of y i, then Infomax = MLE = MMI. Nongaussianity maximizataion Negentropy maximization: The negentropy is defined by J(y) = H(y G ) H(y) where y G is a Gaussian random vector whose mean and covariance matrix is the same as y. Kurtosis extremization: Maximizes kurtosis for super-gaussian and minimizes kurtosis for sub-gaussian. 18 / 78

Learning Algorithms Theory and Preliminaries for ICA Criteria Unsupervised Learning Algorithms Algebraic Algorithms Gradient descent/ascent. Natural gradient (or relative gradient) descent/ascent Conjugate gradient Newton and Quasi-Newton Fixed point iteration Relative trust-region optimization 19 / 78

Relative Gradient Theory and Preliminaries for ICA Criteria Unsupervised Learning Algorithms Algebraic Algorithms Conventional gradient involves the following first-order approximation { } J J (W + E) J (W) + tr W E, and searches for a direction that minimizes J (W + E) under a norm constraint on E = const. Relative gradient involves the follwing first-order approximation J (W + EW) J (W) + tr { r J E } This leads to r J = J W. J (W) + tr { J (EW) }. 20 / 78

Natural Gradient Theory and Preliminaries for ICA Criteria Unsupervised Learning Algorithms Algebraic Algorithms Let S w = {w R n } be a parameter space on which an objective function J (w) is defined. If the coordinate system is nonorthogonal, then dw 2 = g ij (w)dw i dw j, i j Theorem g ij (w) is Riemannian metric. The steepest descent direction of J (w) in a Riemannian space is given by ng J (w) = G 1 (w)j (w). 21 / 78

Natural Gradient ICA Criteria Unsupervised Learning Algorithms Algebraic Algorithms It turned out that the natural gradient in the context of ICA has the form ng J (W) = J (W)W W. The natural gradient ICA algorithm is of the form W (t+1) = W (t) + η { I ϕ(y(t))y (t) } W (t), where ϕ(y) = [ϕ 1 (y i ),..., ϕ n (y n )] and ϕ i ( ) = d log p i (y i ) dy i. Relatively fast convergence (compared to the conventional gradient) and equivariance property (uniform performance, regardless of condition of A). 22 / 78

Hypothesized Distributions Criteria Unsupervised Learning Algorithms Algebraic Algorithms The ICA algorithm requires p i ( ), hence, we use a hypothesized distribution. Super-Gaussian: ϕ i (y i ) = sign(y i ) or tanh(y i ). Sub-Gaussian: ϕ i = y 3 i Switching nonlinearity: y i ± tanh(αy i ). Flexible ICA: Generalized Gaussian distribution. 23 / 78

Generalized Gaussian Distribution Criteria Unsupervised Learning Algorithms Algebraic Algorithms α p(y; α) = 2λΓ ( 1 )e y λ α. α p(y) 1 0.9 0.8 0.7 0.6 0.5 0.4 alpha=2 alpha=4 alpha=1 alpha=.8 0.3 0.2 0.1 Note that if α = 1, the distribution becomes Laplacian distribution. If α = 2, the distribution is Gaussian distribution. 0 4 3 2 1 0 1 2 3 4 y 24 / 78

Simultaneous Diagonalization Criteria Unsupervised Learning Algorithms Algebraic Algorithms A symmetric matrix R R n n is diagonalized if R = UΣU. Whitening transformation seeks for a linear transformation such that the correlation matrix of z = Vx is the identity matrix, i.e. zz = V xx V = I. The whitening transformation is given by V = Σ 1 2 U where xx = UΣU. Simultaneous diagonalization aims at diagonalizing two symmetric matrics R 1 and R 2 simultaneously by a linear transformation. Whitening transformation + unitary transformation Generalized eigenvalue problem: R 2U = R 1UΣ. 25 / 78

Symmetric-Definite Pencils Criteria Unsupervised Learning Algorithms Algebraic Algorithms Definition The set of all matrices of the form R 2 λr 1 with λ R is said to be a pencil. Definition Pencils (R 1, R 2 ) where R 2 is symmetric and R 1 is symmetric and positive definite, are referred to as symmetric-definite pencils. Theorem If R 2 λr 1 is symmetric-definite, then there exists a nonsingular matrix U = [u 1,..., u n ] such that U R 1 U = diag {γ 1 (τ 1 ),..., γ n (τ 1 )}, (1) U R 2 U = diag {γ 1 (τ 2 ),..., γ n (τ 2 )}. (2) Moreover R 2 u i = λ i R 1 u i for i = 1,..., n, and λ i = γ i (τ 2) γ i (τ 1). 26 / 78

Fundamental Theorem Criteria Unsupervised Learning Algorithms Algebraic Algorithms Theorem Let Λ 1, D 1 R n n be diagonal matrices with positive diagonal entries and Λ 2, D 2 R n n be diagonal matrices with non-zero diagonal entries. Suppose that G R n n satisfies the following decompositions: D 1 = GΛ 1 G T, D 2 = GΛ 2 G T. Then the matrix G is the generalized permutation matrix, i.e., G = PΛ if D 1 1 D 2 and Λ 1 1 Λ 2 have distinct diagonal entries. 27 / 78

AMUSE Theory and Preliminaries for ICA Criteria Unsupervised Learning Algorithms Algebraic Algorithms A second-order method that exploits a time-delayed correlation matrix (temporal structure) as well as a equal-time correlation matrix (whitening transformation). Whitening transformation Compute a equal-time correlation matrix, R x(0) = 1 N N t=1 x(t)x (t) { and make it symmetric, M x(0) = 1 2 Rx(0) + R x (0) }. Do spectral decomposition, R x(0) = UΣU. Whitening transformation leads to z = Σ 1 2 U x. Unitary transformation Find a unitary transformation, y = V z, that diagonalizes z(t)z (t τ) = VΛV. The demixing matrix is given by W = V Σ 1 2 U 28 / 78

Source Separation via Matrix Pencil Criteria Unsupervised Learning Algorithms Algebraic Algorithms Note that simultaneous diagonalization is solved by the generalized eigenvalue problem. Compute symmetric two correlation matrices M x (0) and M x (τ). Note that (M x (0), M x (τ)) is a symmetric-definite pencil. Find the generalized eigenvector matrix V of the pencil M x (τ) λm x (0) which satisfies M x (τ)v = M x (0)VΛ. The demixing matrix is given by W = V T. 29 / 78

SOBI Theory and Preliminaries for ICA Criteria Unsupervised Learning Algorithms Algebraic Algorithms A second-order method that exploits a multiple time-delayed correlation matrices. Seeks for a linear transformation W that jointly diagonalizes multiple time-delayed correlation matrices (joint approximate diagonalization). Whitening transformation Whitening transformation leads to z = Σ 1 2 U x where R x(0) = UΣU. Unitary transformation Find a unitary joint diagonalizer V of {M z(τ j )} which satisfies where {Λ j } is a set of diagonal matrices. V T M z(τ j )V = Λ j, (3) The demixing matrix is given by W = V Σ 1 2 U 30 / 78

More Algebraic Algorithms Criteria Unsupervised Learning Algorithms Algebraic Algorithms FOBI: 4th-order moment matrix JADE: slices of 4th-order cumulant matices SEONS A generalization of SOBI Quasi-nonstationarity More... 31 / 78

Outline Theory and Preliminaries for ICA 1 Theory and Preliminaries for ICA Model Theory 2 Criteria Unsupervised Learning Algorithms Algebraic Algorithms 3 4 32 / 78

Variations and Extensions of ICA Spatial, temporal, spatiotemporal ICA Independent subspace analysis (ISA) Topographic ICA (TICA) Nonnegative matrix factorization (NMF) 33 / 78

Independent Subspace Analysis (ISA) Multidimensional ICA + Invariant feature subspace. L(W, X) = 1 N J log p ( w T N i x(t) ) 2 + log det W. t=1 j=1 i Fj 34 / 78

Pooling in Complex Cells 35 / 78

Topographic ICA s i = σ i z i = φ ( j h(i, j)u j ) z i. Further extension of ISA, incorporating with topographic representation. Dependencies between near-by components are modelled by higher-order correlations. 36 / 78

Nonnegative Matrix Factorization (NMF) Parts-based representation 37 / 78

Algorithm for NMF Find a factorization such that X AS, subject to A ij 0 and S ij 0. For example, we seeks for a matrix factorization which minimizes I-divergence, defined by E = [ ] X ij log X ij X ij + (AS) (AS) ij. i,j ij A multiplicative algorithm is given by k S ij S [A kix kj / (AS) kl ] ij l A, li k A ij A [S jkx ik / (AS) ik ] ij l S. jl 38 / 78

Outline Theory and Preliminaries for ICA 1 Theory and Preliminaries for ICA Model Theory 2 Criteria Unsupervised Learning Algorithms Algebraic Algorithms 3 4 39 / 78

Applications Theory and Preliminaries for ICA Computational/Computer vision Natural image statistics Face recognition Speech/Language Audio source separation Semantic structure of language Bioinformatics (gene expression data analysis) Dynamic PET image analysis 40 / 78

Compact Coding Theory and Preliminaries for ICA 41 / 78

Sparse Coding Theory and Preliminaries for ICA 42 / 78

Learn Statistical Structure of Natural Scenes 43 / 78

Theory and Preliminaries for Algorithms for Beyond Applications of ICA ICA ICA ICA Examples of Natural Images 44 / 78

Learned Basis Images: PCA 45 / 78

Learned Basis Images: ICA 46 / 78

Learned Basis Images: ISA 47 / 78

Learned Basis Images: Topographic ICA 48 / 78

Eigenfaces Theory and Preliminaries for ICA 49 / 78

Factorial Faces Theory and Preliminaries for ICA 50 / 78

AR Face Database Theory and Preliminaries for ICA 51 / 78

Eigenfaces vs Factorial Faces 52 / 78

Performance Comparison 53 / 78

Multiple View Images 54 / 78

Learned Face Basis Images: ISA 55 / 78

Learned Face Basis Images: Topographic ICA 56 / 78

Audio Source Separation 57 / 78

ST-NB95 Database Theory and Preliminaries for ICA 58 / 78

Word Error Rate Theory and Preliminaries for ICA 59 / 78

Emergence of Linguistic Features and ICA A collection of emails sent to connectionists mailing list. One hundred common words were manually selected and the contextual information was calculated using the 2000 most common types. A context matrix C was formed, where C ij represents the number of occurrences of the jth word in the immediate context of ith word. 60 / 78

Example 1: ICA Features of Contextual Data Figure: ICA features for neuroscience and psychology. 61 / 78

Example 2: ICA Features of Contextual Data Figure: ICA features for models and problems. 62 / 78

Example 3: ICA Features of Contextual Data Figure: ICA features for my, his, our, and their. 63 / 78

Example 4: ICA Features of Contextual Data Figure: ICA features for will, can, may, and must. 64 / 78

Modern Biologists Theory and Preliminaries for ICA We need a something different tool! 65 / 78

Gene Expression Data 66 / 78

Data Preparation Theory and Preliminaries for ICA 67 / 78

Why Linear Models for Gene Expression Data Analysis? 68 / 78

Eigengenes and Eigenarrays 69 / 78

Spatial ICA Theory and Preliminaries for ICA 70 / 78

Temporal ICA Theory and Preliminaries for ICA 71 / 78

60 50 40 30 20 10 60 50 40 30 20 10 60 50 40 30 20 10 Theory and Preliminaries for ICA Performance Comparison: Gene Clustering tica tica tica stica PCA sica log 10 (pvalue) log 10 (p value) log 10 (p value) 0 0 10 20 30 40 50 60 log 10 (p value) 0 0 10 20 30 40 50 60 log 10 (p value) 0 0 10 20 30 40 50 60 log 10 (p value) 72 / 78

Temporal Modes Theory and Preliminaries for ICA 73 / 78

Theory and Preliminaries for Algorithms for Beyond Applications of ICA ICA ICA ICA Positron Emission Tomography 74 / 78

What PET can do for Heart? Quantify the extent of heart disease. Can calculate myocardial blood flow or metabolism quantitatively. 75 / 78

Dynamic PET Image Acquisition 76 / 78

Independent Components of Heart PET Images 77 / 78

ICs vs Arterial Sampling 78 / 78