Independent Component Analysis

Similar documents
Independent Component Analysis (ICA)

Independent Component Analysis. Contents

Advanced Introduction to Machine Learning CMU-10715

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

CIFAR Lectures: Non-Gaussian statistics and natural images

Independent Component Analysis

Independent Component Analysis and Unsupervised Learning

Independent Component Analysis (ICA) Bhaskar D Rao University of California, San Diego

Principal Component Analysis

HST.582J/6.555J/16.456J

Fundamentals of Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Independent Vector Analysis (IVA)

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

where A 2 IR m n is the mixing matrix, s(t) is the n-dimensional source vector (n» m), and v(t) is additive white noise that is statistically independ

Machine Learning (BSMC-GA 4439) Wenke Liu

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

Natural Gradient Learning for Over- and Under-Complete Bases in ICA

Introduction to Independent Component Analysis. Jingmei Lu and Xixi Lu. Abstract

ON SOME EXTENSIONS OF THE NATURAL GRADIENT ALGORITHM. Brain Science Institute, RIKEN, Wako-shi, Saitama , Japan

Massoud BABAIE-ZADEH. Blind Source Separation (BSS) and Independent Componen Analysis (ICA) p.1/39

Independent Component Analysis and Its Applications. By Qing Xue, 10/15/2004

1 Introduction Blind source separation (BSS) is a fundamental problem which is encountered in a variety of signal processing problems where multiple s

Independent Component Analysis and Blind Source Separation

Dimensionality Reduction. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Probabilistic Latent Semantic Analysis

Independent Component Analysis. PhD Seminar Jörgen Ungh

Non-negative Matrix Factorization: Algorithms, Extensions and Applications

Unsupervised learning: beyond simple clustering and PCA

ICA [6] ICA) [7, 8] ICA ICA ICA [9, 10] J-F. Cardoso. [13] Matlab ICA. Comon[3], Amari & Cardoso[4] ICA ICA

Logistic Regression. Seungjin Choi

An Introduction to Independent Components Analysis (ICA)

Nonnegative Matrix Factorization

Fisher s Linear Discriminant Analysis

Artificial Intelligence Module 2. Feature Selection. Andrea Torsello

c Springer, Reprinted with permission.

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Semi-Blind approaches to source separation: introduction to the special session

1 Introduction Independent component analysis (ICA) [10] is a statistical technique whose main applications are blind source separation, blind deconvo

Independent Component Analysis and Its Application on Accelerator Physics

TRINICON: A Versatile Framework for Multichannel Blind Signal Processing

LECTURE :ICA. Rita Osadchy. Based on Lecture Notes by A. Ng

Tutorial on Blind Source Separation and Independent Component Analysis

MULTI-VARIATE/MODALITY IMAGE ANALYSIS

STATS 306B: Unsupervised Learning Spring Lecture 12 May 7

Independent Component Analysis of Incomplete Data

BLIND SEPARATION OF POSITIVE SOURCES USING NON-NEGATIVE PCA

Independent Component Analysis on the Basis of Helmholtz Machine

Lecture 10: Dimension Reduction Techniques

A Convex Cauchy-Schwarz Divergence Measure for Blind Source Separation

FuncICA for time series pattern discovery

Complete Blind Subspace Deconvolution

Blind Machine Separation Te-Won Lee

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Final Report For Undergraduate Research Opportunities Project Name: Biomedical Signal Processing in EEG. Zhang Chuoyao 1 and Xu Jianxin 2

Dimensionality Reduction

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Statistical Analysis of fmrl Data

Some Interesting Problems in Pattern Recognition and Image Processing

Constrained Projection Approximation Algorithms for Principal Component Analysis

Independent Component Analysis

ICA and ISA Using Schweizer-Wolff Measure of Dependence

Independent Component Analysis (ICA)

Independent component analysis: an introduction

Recursive Generalized Eigendecomposition for Independent Component Analysis

New Machine Learning Methods for Neuroimaging

ICA. Independent Component Analysis. Zakariás Mátyás

Numerical Methods I Singular Value Decomposition

A Coupled Helmholtz Machine for PCA

Slide11 Haykin Chapter 10: Information-Theoretic Models

NONNEGATIVE matrix factorization (NMF) is a

TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES. Mika Inki and Aapo Hyvärinen

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Separation of Different Voices in Speech using Fast Ica Algorithm

Probabilistic & Unsupervised Learning

Principal Component Analysis (PCA)

The Singular Value Decomposition

Data Mining and Matrices

Orthogonal Nonnegative Matrix Factorization: Multiplicative Updates on Stiefel Manifolds

Independent component analysis: algorithms and applications

PCA and admixture models

Blind signal processing algorithms

Independent Component Analysis

Acoustic Source Separation with Microphone Arrays CCNY

Undercomplete Independent Component. Analysis for Signal Separation and. Dimension Reduction. Category: Algorithms and Architectures.

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

Principal Component Analysis CS498

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

PCA, Kernel PCA, ICA

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

A two-layer ICA-like model estimated by Score Matching

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

Natural Image Statistics

Principal Component Analysis vs. Independent Component Analysis for Damage Detection

A Modular NMF Matching Algorithm for Radiation Spectra

1 Linear Algebra Problems

To appear in Proceedings of the ICA'99, Aussois, France, A 2 R mn is an unknown mixture matrix of full rank, v(t) is the vector of noises. The

Tree-Dependent Components of Gene Expression Data for Clustering

CS281 Section 4: Factor Analysis and PCA

Transcription:

Independent Component Analysis Seungjin Choi Department of Computer Science Pohang University of Science and Technology, Korea seungjin@postech.ac.kr March 4, 2009 1 / 78

Outline Theory and Preliminaries for ICA 1 Theory and Preliminaries for ICA Model Theory 2 Criteria Unsupervised Learning Algorithms Algebraic Algorithms 3 4 2 / 78

Outline Theory and Preliminaries for ICA Model Theory 1 Theory and Preliminaries for ICA Model Theory 2 Criteria Unsupervised Learning Algorithms Algebraic Algorithms 3 4 3 / 78

Books on ICA Theory and Preliminaries for ICA Model Theory T. -W. Lee, Independent Component Analysis, 1998. S. Haykin, Unsupervised Adaptive Filtering, volume 1 and 2, 2001. A. Hyvärinen, J. Karhunen, and E. Oja, Independent Component Analysis, 2001. A. Cichocki and S. Amari, Adaptive Blind Signal and Image Processing, 2002. 4 / 78

ICALAB Toolbox Theory and Preliminaries for ICA Model Theory ICALAB is a Matlab toolbox, containing various ICA algorithms. Check out http://www.bsp.brain.riken.jp/icalab 5 / 78

What is ICA? Theory and Preliminaries for ICA Model Theory ICA is a statistical method, the goal of which is to decompose the multivariate data x R n into a linear sum of statistically independent components, i.e., x = s 1 a 1 + s 2 a 2 + + s n a n = As, where {s i } are coefficients (sources, latent variables, encoding variables) and {a i } are basis vectors. Constraints: Coefficients {s i } are statistically independent. Goal: Learn basis vectors A from data samples only {x(1),..., x(n)} 6 / 78

ICA vs. PCA Theory and Preliminaries for ICA Model Theory Linear transform Compression (dimensionality reduction) Classification (feature extraction) PCA Second-order statistics (Gaussian) Linear orthogonal transform Optimal coding in MS sense ICA Higher-order statistics (non-gaussian) Linear non-orthogonal transform Related with projection pursuit (non-gaussian is interesting) Better features for classification? 7 / 78

An Example of PCA vs ICA Model Theory 8 8 6 6 4 4 2 2 0 0 2 2 4 4 6 6 8 8 6 4 2 0 2 4 6 8 8 8 6 4 2 0 2 4 6 8 (a) PCA (b) ICA 8 / 78

Two Aspects of ICA Theory and Preliminaries for ICA Model Theory Blind source separation Acoustic source separation (cocktail party speech recognition) Biomedical data analysis (EEG, ECG, MEG, fmri, PET) Digital communications (multiuser detection, blind equalization, MIMO channels) Information representation (e.g., feature extraction) Natural sound/image statistics Computer vision (e.g. face recogntion/detection) Empirical data analysis (stock market returns, gene expression data, etc) Data visualization (lower-dimensional embedding) 9 / 78

Blind Source Separation Model Theory s Mixing x Demixing y A W Unknown Mixing: x = Ax. Demixing: y = Wx. 10 / 78

An Example of EEG Theory and Preliminaries for ICA Model Theory (c) Raw EEG (d) After ICA 11 / 78

Transparent Transformation Model Theory Given a set of observed data X = [x(1),..., x(n)] that was generated from unknown sources s through an unknown linear transform A, i.e., x = As, the task of blind source separtion is to restore sources S by estimating the mixing matrix A. To this end, we constrcut a demixing matrix W such that the elements of y = Wx are statistically independent. Impopsing independence in {y i } leads to y = WAs = PΛs where P is the permutation matrix and Λ is a scaling matrix. The transformation PΛ is referred to as transparent transformation. For example, y 1 y 2 y 3 0 0 λ 3 λ 1 0 0 0 λ 2 0 s 1 s 2 s 3. 12 / 78

Darmois Theorem Theory and Preliminaries for ICA Model Theory Theorem Supposed that random variables s 1,..., s n are mutually independent. Consider two linear combinations of s i, y 1 = α 1 s 1 + α n s n, y 2 = β 1 s 1 + β n s n. If y 1 and y 2 are statistically independent, then α i β i 0 only when s i is Gaussian. Remark: In other words, assume that at most one of {s i } is Gaussian. Suppose that the mixing matrix is of full-column rank. Then the pairwise independence between {y i } leads to WA is a transparent transformation. 13 / 78

Outline Theory and Preliminaries for ICA Criteria Unsupervised Learning Algorithms Algebraic Algorithms 1 Theory and Preliminaries for ICA Model Theory 2 Criteria Unsupervised Learning Algorithms Algebraic Algorithms 3 4 14 / 78

Mutual Information Minimization Criteria Unsupervised Learning Algorithms Algebraic Algorithms Mutual information is the relative entropy between the joint distribution and the product of marginal distributions, [ ] p(y) I (y 1,..., y n ) = p(y) log i p dy i(y i ) [ ] = KL p(y) i p i (y i ), which is always nonnegative and its minimum is achieved only when y i are independent. Note that p(y) = p(x) det W. This leads to the objective function J = log det W n log p i (y i ). i=1 15 / 78

Maximum Likelihood Estimation Criteria Unsupervised Learning Algorithms Algebraic Algorithms Consider a single factor of the log-likelihood, L = log p(x A, r) = log p(x s, A)r(s)ds = log det A + n r i (s i ). Replacing r i ( ) = p i ( ), s i = y i, and A = W 1, the negative log-likelihood becomes L = log det W i=1 n log p i (y i ). Maximum likelihood estimation = mutual information minimization in the context of ICA. i=1 16 / 78

Criteria Unsupervised Learning Algorithms Algebraic Algorithms An Information Geometrical View of ICA 17 / 78

More Criteria... Theory and Preliminaries for ICA Criteria Unsupervised Learning Algorithms Algebraic Algorithms Information maximization: Infomax seeks for a linear transform such that the output entropy of f (y) = [f 1 (y 1 ),..., f n (y n )] is maximized. In the case where f i ( ) is the cumulative distribution function of y i, then Infomax = MLE = MMI. Nongaussianity maximizataion Negentropy maximization: The negentropy is defined by J(y) = H(y G ) H(y) where y G is a Gaussian random vector whose mean and covariance matrix is the same as y. Kurtosis extremization: Maximizes kurtosis for super-gaussian and minimizes kurtosis for sub-gaussian. 18 / 78

Learning Algorithms Theory and Preliminaries for ICA Criteria Unsupervised Learning Algorithms Algebraic Algorithms Gradient descent/ascent. Natural gradient (or relative gradient) descent/ascent Conjugate gradient Newton and Quasi-Newton Fixed point iteration Relative trust-region optimization 19 / 78

Relative Gradient Theory and Preliminaries for ICA Criteria Unsupervised Learning Algorithms Algebraic Algorithms Conventional gradient involves the following first-order approximation { } J J (W + E) J (W) + tr W E, and searches for a direction that minimizes J (W + E) under a norm constraint on E = const. Relative gradient involves the follwing first-order approximation J (W + EW) J (W) + tr { r J E } This leads to r J = J W. J (W) + tr { J (EW) }. 20 / 78

Natural Gradient Theory and Preliminaries for ICA Criteria Unsupervised Learning Algorithms Algebraic Algorithms Let S w = {w R n } be a parameter space on which an objective function J (w) is defined. If the coordinate system is nonorthogonal, then dw 2 = g ij (w)dw i dw j, i j Theorem g ij (w) is Riemannian metric. The steepest descent direction of J (w) in a Riemannian space is given by ng J (w) = G 1 (w)j (w). 21 / 78

Natural Gradient ICA Criteria Unsupervised Learning Algorithms Algebraic Algorithms It turned out that the natural gradient in the context of ICA has the form ng J (W) = J (W)W W. The natural gradient ICA algorithm is of the form W (t+1) = W (t) + η { I ϕ(y(t))y (t) } W (t), where ϕ(y) = [ϕ 1 (y i ),..., ϕ n (y n )] and ϕ i ( ) = d log p i (y i ) dy i. Relatively fast convergence (compared to the conventional gradient) and equivariance property (uniform performance, regardless of condition of A). 22 / 78

Hypothesized Distributions Criteria Unsupervised Learning Algorithms Algebraic Algorithms The ICA algorithm requires p i ( ), hence, we use a hypothesized distribution. Super-Gaussian: ϕ i (y i ) = sign(y i ) or tanh(y i ). Sub-Gaussian: ϕ i = y 3 i Switching nonlinearity: y i ± tanh(αy i ). Flexible ICA: Generalized Gaussian distribution. 23 / 78

Generalized Gaussian Distribution Criteria Unsupervised Learning Algorithms Algebraic Algorithms α p(y; α) = 2λΓ ( 1 )e y λ α. α p(y) 1 0.9 0.8 0.7 0.6 0.5 0.4 alpha=2 alpha=4 alpha=1 alpha=.8 0.3 0.2 0.1 Note that if α = 1, the distribution becomes Laplacian distribution. If α = 2, the distribution is Gaussian distribution. 0 4 3 2 1 0 1 2 3 4 y 24 / 78

Simultaneous Diagonalization Criteria Unsupervised Learning Algorithms Algebraic Algorithms A symmetric matrix R R n n is diagonalized if R = UΣU. Whitening transformation seeks for a linear transformation such that the correlation matrix of z = Vx is the identity matrix, i.e. zz = V xx V = I. The whitening transformation is given by V = Σ 1 2 U where xx = UΣU. Simultaneous diagonalization aims at diagonalizing two symmetric matrics R 1 and R 2 simultaneously by a linear transformation. Whitening transformation + unitary transformation Generalized eigenvalue problem: R 2U = R 1UΣ. 25 / 78

Symmetric-Definite Pencils Criteria Unsupervised Learning Algorithms Algebraic Algorithms Definition The set of all matrices of the form R 2 λr 1 with λ R is said to be a pencil. Definition Pencils (R 1, R 2 ) where R 2 is symmetric and R 1 is symmetric and positive definite, are referred to as symmetric-definite pencils. Theorem If R 2 λr 1 is symmetric-definite, then there exists a nonsingular matrix U = [u 1,..., u n ] such that U R 1 U = diag {γ 1 (τ 1 ),..., γ n (τ 1 )}, (1) U R 2 U = diag {γ 1 (τ 2 ),..., γ n (τ 2 )}. (2) Moreover R 2 u i = λ i R 1 u i for i = 1,..., n, and λ i = γ i (τ 2) γ i (τ 1). 26 / 78

Fundamental Theorem Criteria Unsupervised Learning Algorithms Algebraic Algorithms Theorem Let Λ 1, D 1 R n n be diagonal matrices with positive diagonal entries and Λ 2, D 2 R n n be diagonal matrices with non-zero diagonal entries. Suppose that G R n n satisfies the following decompositions: D 1 = GΛ 1 G T, D 2 = GΛ 2 G T. Then the matrix G is the generalized permutation matrix, i.e., G = PΛ if D 1 1 D 2 and Λ 1 1 Λ 2 have distinct diagonal entries. 27 / 78

AMUSE Theory and Preliminaries for ICA Criteria Unsupervised Learning Algorithms Algebraic Algorithms A second-order method that exploits a time-delayed correlation matrix (temporal structure) as well as a equal-time correlation matrix (whitening transformation). Whitening transformation Compute a equal-time correlation matrix, R x(0) = 1 N N t=1 x(t)x (t) { and make it symmetric, M x(0) = 1 2 Rx(0) + R x (0) }. Do spectral decomposition, R x(0) = UΣU. Whitening transformation leads to z = Σ 1 2 U x. Unitary transformation Find a unitary transformation, y = V z, that diagonalizes z(t)z (t τ) = VΛV. The demixing matrix is given by W = V Σ 1 2 U 28 / 78

Source Separation via Matrix Pencil Criteria Unsupervised Learning Algorithms Algebraic Algorithms Note that simultaneous diagonalization is solved by the generalized eigenvalue problem. Compute symmetric two correlation matrices M x (0) and M x (τ). Note that (M x (0), M x (τ)) is a symmetric-definite pencil. Find the generalized eigenvector matrix V of the pencil M x (τ) λm x (0) which satisfies M x (τ)v = M x (0)VΛ. The demixing matrix is given by W = V T. 29 / 78

SOBI Theory and Preliminaries for ICA Criteria Unsupervised Learning Algorithms Algebraic Algorithms A second-order method that exploits a multiple time-delayed correlation matrices. Seeks for a linear transformation W that jointly diagonalizes multiple time-delayed correlation matrices (joint approximate diagonalization). Whitening transformation Whitening transformation leads to z = Σ 1 2 U x where R x(0) = UΣU. Unitary transformation Find a unitary joint diagonalizer V of {M z(τ j )} which satisfies where {Λ j } is a set of diagonal matrices. V T M z(τ j )V = Λ j, (3) The demixing matrix is given by W = V Σ 1 2 U 30 / 78

More Algebraic Algorithms Criteria Unsupervised Learning Algorithms Algebraic Algorithms FOBI: 4th-order moment matrix JADE: slices of 4th-order cumulant matices SEONS A generalization of SOBI Quasi-nonstationarity More... 31 / 78

Outline Theory and Preliminaries for ICA 1 Theory and Preliminaries for ICA Model Theory 2 Criteria Unsupervised Learning Algorithms Algebraic Algorithms 3 4 32 / 78

Variations and Extensions of ICA Spatial, temporal, spatiotemporal ICA Independent subspace analysis (ISA) Topographic ICA (TICA) Nonnegative matrix factorization (NMF) 33 / 78

Independent Subspace Analysis (ISA) Multidimensional ICA + Invariant feature subspace. L(W, X) = 1 N J log p ( w T N i x(t) ) 2 + log det W. t=1 j=1 i Fj 34 / 78

Pooling in Complex Cells 35 / 78

Topographic ICA s i = σ i z i = φ ( j h(i, j)u j ) z i. Further extension of ISA, incorporating with topographic representation. Dependencies between near-by components are modelled by higher-order correlations. 36 / 78

Nonnegative Matrix Factorization (NMF) Parts-based representation 37 / 78

Algorithm for NMF Find a factorization such that X AS, subject to A ij 0 and S ij 0. For example, we seeks for a matrix factorization which minimizes I-divergence, defined by E = [ ] X ij log X ij X ij + (AS) (AS) ij. i,j ij A multiplicative algorithm is given by k S ij S [A kix kj / (AS) kl ] ij l A, li k A ij A [S jkx ik / (AS) ik ] ij l S. jl 38 / 78

Outline Theory and Preliminaries for ICA 1 Theory and Preliminaries for ICA Model Theory 2 Criteria Unsupervised Learning Algorithms Algebraic Algorithms 3 4 39 / 78

Applications Theory and Preliminaries for ICA Computational/Computer vision Natural image statistics Face recognition Speech/Language Audio source separation Semantic structure of language Bioinformatics (gene expression data analysis) Dynamic PET image analysis 40 / 78

Compact Coding Theory and Preliminaries for ICA 41 / 78

Sparse Coding Theory and Preliminaries for ICA 42 / 78

Learn Statistical Structure of Natural Scenes 43 / 78

Theory and Preliminaries for Algorithms for Beyond Applications of ICA ICA ICA ICA Examples of Natural Images 44 / 78

Learned Basis Images: PCA 45 / 78

Learned Basis Images: ICA 46 / 78

Learned Basis Images: ISA 47 / 78

Learned Basis Images: Topographic ICA 48 / 78

Eigenfaces Theory and Preliminaries for ICA 49 / 78

Factorial Faces Theory and Preliminaries for ICA 50 / 78

AR Face Database Theory and Preliminaries for ICA 51 / 78

Eigenfaces vs Factorial Faces 52 / 78

Performance Comparison 53 / 78

Multiple View Images 54 / 78

Learned Face Basis Images: ISA 55 / 78

Learned Face Basis Images: Topographic ICA 56 / 78

Audio Source Separation 57 / 78

ST-NB95 Database Theory and Preliminaries for ICA 58 / 78

Word Error Rate Theory and Preliminaries for ICA 59 / 78

Emergence of Linguistic Features and ICA A collection of emails sent to connectionists mailing list. One hundred common words were manually selected and the contextual information was calculated using the 2000 most common types. A context matrix C was formed, where C ij represents the number of occurrences of the jth word in the immediate context of ith word. 60 / 78

Example 1: ICA Features of Contextual Data Figure: ICA features for neuroscience and psychology. 61 / 78

Example 2: ICA Features of Contextual Data Figure: ICA features for models and problems. 62 / 78

Example 3: ICA Features of Contextual Data Figure: ICA features for my, his, our, and their. 63 / 78

Example 4: ICA Features of Contextual Data Figure: ICA features for will, can, may, and must. 64 / 78

Modern Biologists Theory and Preliminaries for ICA We need a something different tool! 65 / 78

Gene Expression Data 66 / 78

Data Preparation Theory and Preliminaries for ICA 67 / 78

Why Linear Models for Gene Expression Data Analysis? 68 / 78

Eigengenes and Eigenarrays 69 / 78

Spatial ICA Theory and Preliminaries for ICA 70 / 78

Temporal ICA Theory and Preliminaries for ICA 71 / 78

60 50 40 30 20 10 60 50 40 30 20 10 60 50 40 30 20 10 Theory and Preliminaries for ICA Performance Comparison: Gene Clustering tica tica tica stica PCA sica log 10 (pvalue) log 10 (p value) log 10 (p value) 0 0 10 20 30 40 50 60 log 10 (p value) 0 0 10 20 30 40 50 60 log 10 (p value) 0 0 10 20 30 40 50 60 log 10 (p value) 72 / 78

Temporal Modes Theory and Preliminaries for ICA 73 / 78

Theory and Preliminaries for Algorithms for Beyond Applications of ICA ICA ICA ICA Positron Emission Tomography 74 / 78

What PET can do for Heart? Quantify the extent of heart disease. Can calculate myocardial blood flow or metabolism quantitatively. 75 / 78

Dynamic PET Image Acquisition 76 / 78

Independent Components of Heart PET Images 77 / 78

ICs vs Arterial Sampling 78 / 78