Independent Component Analysis (ICA) Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/ seungjin 1 / 34
Outline ICA vs PCA Blind source separation Darmois theorem Algorithms Maximum likelihood ICA Natural gradient vs relative gradient Infomax ICA FastICA 2 / 34
Introduction to ICA 3 / 34
What is ICA? ICA is a statistical method, the goal of which is to decompose the multivariate data x R D into a linear sum of statistically independent components, i.e., x = s 1 a 1 + s 2 a 2 + + s D a D = As, where {s i } are coefficients (sources, latent variables, encoding variables) and {a i } are basis vectors. Constraints: Assume that coefficients {s i } D i=1 are statistically independent. Goal: Learn basis vectors A from data only {x 1,..., x N } 4 / 34
ICA vs. PCA Linear transform Dimensionality reduction (compression) Feature extraction (representation learning) PCA Second-order statistics (Gaussian) Linear orthogonal transform Optimal coding in mean square sense ICA Higher-order statistics (non-gaussian) Linear non-orthogonal transform Related with projection pursuit (non-gaussian is interesting) Better features for classification? 5 / 34
Example: PCA vs ICA 8 8 6 6 4 4 2 2 0 0 2 2 4 4 6 6 8 8 6 4 2 0 2 4 6 8 8 8 6 4 2 0 2 4 6 8 (a) PCA (b) ICA 6 / 34
Two Aspects of ICA Blind source separation Acoustic source separation (cocktail party speech recognition) Biomedical data analysis (EEG, ECG, MEG, fmri, PET) Digital communications (multiuser detection, blind equalization, MIMO channels) Representation learning Natural sound/image statistics Computer vision (e.g. face recogntion/detection) Empirical data analysis (stock market returns, gene expression data, etc) Data visualization (lower-dimensional embedding) 7 / 34
Blind Source Separation 8 / 34
Blind Source Separation Mixing: x = As. Demixing: Want y = Wx to be estimate of s 9 / 34
An Example of EEG (a) Raw EEG (b) After ICA 10 / 34
Transparent Transformation Given a set of observed data X = [x 1,..., x N ] that was generated from unknown sources s through an unknown linear transform A, i.e., x = As, the task of blind source separtion is to restore sources S by estimating the mixing matrix A. To this end, we constrcut a demixing matrix W such that the elements of y = Wx are statistically independent. Impopsing independence in {y i } leads to y = WAs = PΛs where P is the permutation matrix and Λ is a scaling matrix. The transformation PΛ is referred to as transparent transformation. For example, y 1 y 2 y 3 0 0 λ 3 λ 1 0 0 0 λ 2 0 s 1 s 2 s 3. 11 / 34
Darmois Theorem Theorem Supposed that random variables s 1,..., s n are mutually independent. Consider two linear combinations of s i, y 1 = α 1 s 1 + α n s n, y 2 = β 1 s 1 + β n s n. If y 1 and y 2 are statistically independent, then α i β i 0 only when s i is Gaussian. Remark: In other words, assume that at most one of {s i } is Gaussian. Suppose that the mixing matrix is of full-column rank. Then the pairwise independence between {y i } leads to WA is a transparent transformation. 12 / 34
Algorithms for ICA Mutual information minimization Maximum likelihood estimation 13 / 34
Mutual Information Minimization Build a linear model y = Wx such that we solve the following optimization: arg min E p(y) [J (W)] = I (y 1,..., y D ), W where I (y 1,..., y D ) is the mutual information given by [ I (y 1,..., y D ) = D KL p(y) ] [ p i (y i ) = p(y) log i p(y) i p i(y i ) ] dy, which is always nonnegative and its minimum is achieved only when y i s are mutually independent. Note that p(y) = p(x), leading to the loss function (to be minimized) det W J (W) = log det W D log p i (y i ). i=1 14 / 34
Maximum Likelihood Estimation Consider the linear model x = As, where the distribution of x is given by p(x) = r(s) D det A = i=1 r i(s i ). det A Then the single factor the log-likelihood is given by L = log p(x A, r) = log det A + D r i (s i ). Replacing r i ( ) = p i ( ) and A = W 1, the negative log-likelihood becomes L = log det W n log p i (y i ), leading to Maximum likelihood estimation = mutual information minimization in the context of ICA. i=1 i=1 15 / 34
An Information Geometrical View of ICA 16 / 34
ICA: Gradient Descent Algorithm The loss function is given by D J (W) = log det W log p i (y i ). i=1 Define the score function ϕ i (y i ) = d log p i (y i ) d log det W dw = W to obtain dy i and use the relation J (W) = W + ϕ(y)x, leading to the following update ( W W + η W ϕ(y)x ), where ϕ(y) = [ϕ 1 (y 1 ),..., ϕ D (y D )]. 17 / 34
Hypothesized Distributions The ICA algorithm requires p i ( ), hence, we use a hypothesized distribution. Super-Gaussian: ϕi (y i ) = sign(y i ) or tanh(y i ). Sub-Gaussian: ϕi = y 3 i Switching nonlinearity: y i ± tanh(αy i ). Flexible ICA: Generalized Gaussian distribution, leading to α p(y; α) = 2λΓ ( 1 )e y λ α, α ϕ i (y i ) = y i α 1 sign(y i ). 18 / 34
Natural Gradient 19 / 34
Natural Gradient Let S w = {w R D } be a parameter space on which an objective function J (w) is defined. If the coordinate system is nonorthogonal, then dw 2 = G i,j (w)dw i dw j, i j G i,j (w) is Riemannian metric. Theorem (Amari, 1998) The steepest descent direction of J (w) in a Riemannian space is given by ng J (w) = G 1 (w)j (w). 20 / 34
Natural Gradient ICA It turned out that the natural gradient in the context of ICA has the form ng J (W) = J (W)W W. The natural gradient ICA algorithm is of the form ( W W + η I ϕ(y)y ) W, where ϕ(y) = [ϕ 1 (y 1 ),..., ϕ D (y D )] and ϕ i ( ) = d log p i (y i ) dy i. - Relatively fast convergence (compared to the conventional gradient) and equivariance property (uniform performance, regardless of condition of A). 21 / 34
Application I Learn statistical structure of natural scenes 22 / 34
Learn Statistical Structure of Natural Scenes 23 / 34
Examples of Natural Images 24 / 34
Learned Basis Images: PCA 25 / 34
Learned Basis Images: ICA 26 / 34
Application II Face recognition 27 / 34
Eigenfaces 28 / 34
Factorial Faces [Choi and Lee, 2000] 29 / 34
AR Face Database 30 / 34
Eigenfaces vs Factorial Faces 31 / 34
Performance Comparison 32 / 34
Application III Fetal ECG 33 / 34
ECG raw data ICs (flexible ICA) x_1 y_1 x_2 y_2 x_3 y_3 x_4 y_4 x_5 y_5 x_6 y_6 x_7 y_7 x_8 y_8 (a) ECG Data (b) Independent components 34 / 34