Independent Component Analysis

Similar documents
Lecture'12:' SSMs;'Independent'Component'Analysis;' Canonical'Correla;on'Analysis'

STATS 306B: Unsupervised Learning Spring Lecture 12 May 7

Independent Component Analysis

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

CIFAR Lectures: Non-Gaussian statistics and natural images

Unsupervised learning: beyond simple clustering and PCA

Introduction to Independent Component Analysis. Jingmei Lu and Xixi Lu. Abstract

ICA. Independent Component Analysis. Zakariás Mátyás

Independent Component Analysis and Its Applications. By Qing Xue, 10/15/2004

Independent Component Analysis (ICA) Bhaskar D Rao University of California, San Diego

Independent Component Analysis. PhD Seminar Jörgen Ungh

Independent Component Analysis

Fundamentals of Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Independent Vector Analysis (IVA)

Package ProDenICA. February 19, 2015

Different Estimation Methods for the Basic Independent Component Analysis Model

ICA [6] ICA) [7, 8] ICA ICA ICA [9, 10] J-F. Cardoso. [13] Matlab ICA. Comon[3], Amari & Cardoso[4] ICA ICA

Advanced Introduction to Machine Learning CMU-10715

Independent Component Analysis. Contents

Slide11 Haykin Chapter 10: Information-Theoretic Models

Independent Component Analysis of Incomplete Data

Independent Component Analysis and Blind Source Separation

Independent component analysis: algorithms and applications

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

An Introduction to Independent Components Analysis (ICA)

Independent Component Analysis (ICA)

Independent Component Analysis (ICA)

Independent Component Analysis

Independent Component Analysis

Multidimensional scaling (MDS)

Artificial Intelligence Module 2. Feature Selection. Andrea Torsello

Massoud BABAIE-ZADEH. Blind Source Separation (BSS) and Independent Componen Analysis (ICA) p.1/39

Independent Component Analysis for Blind Source Separation

Optimization and Testing in Linear. Non-Gaussian Component Analysis

1 Introduction Independent component analysis (ICA) [10] is a statistical technique whose main applications are blind source separation, blind deconvo

Natural Image Statistics

TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES. Mika Inki and Aapo Hyvärinen

Dimensionality Reduction. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

One-unit Learning Rules for Independent Component Analysis

Donghoh Kim & Se-Kang Kim

FuncICA for time series pattern discovery

Dimension Reduction (PCA, ICA, CCA, FLD,

Machine Learning (BSMC-GA 4439) Wenke Liu

Principal Component Analysis

From independent component analysis to score matching

Estimation of linear non-gaussian acyclic models for latent factors

A survey of dimension reduction techniques

Natural Gradient Learning for Over- and Under-Complete Bases in ICA

Undercomplete Independent Component. Analysis for Signal Separation and. Dimension Reduction. Category: Algorithms and Architectures.

Final Report For Undergraduate Research Opportunities Project Name: Biomedical Signal Processing in EEG. Zhang Chuoyao 1 and Xu Jianxin 2

Non-orthogonal Support-Width ICA

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen

File: ica tutorial2.tex. James V Stone and John Porrill, Psychology Department, Sheeld University, Tel: Fax:

Independent Component Analysis and FastICA. Copyright Changwei Xiong June last update: July 7, 2016

Real and Complex Independent Subspace Analysis by Generalized Variance

ACENTRAL problem in neural-network research, as well

Principal Component Analysis vs. Independent Component Analysis for Damage Detection

Non-Euclidean Independent Component Analysis and Oja's Learning

Information geometry for bivariate distribution control

Probabilistic Sequential Independent Components Analysis

Separation of Different Voices in Speech using Fast Ica Algorithm

Kernel Independent Component Analysis

INDEPENDENT COMPONENT ANALYSIS

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

INDEPENDENT COMPONENT ANALYSIS VIA

PROPERTIES OF THE EMPIRICAL CHARACTERISTIC FUNCTION AND ITS APPLICATION TO TESTING FOR INDEPENDENCE. Noboru Murata

Independent Component Analysis

Independent Components Analysis

Efficient Coding. Odelia Schwartz 2017

Independent Component Analysis

Emergence of Phase- and Shift-Invariant Features by Decomposition of Natural Images into Independent Feature Subspaces

ARTEFACT DETECTION IN ASTROPHYSICAL IMAGE DATA USING INDEPENDENT COMPONENT ANALYSIS. Maria Funaro, Erkki Oja, and Harri Valpola

CS281 Section 4: Factor Analysis and PCA

Introduction Outline Introduction Copula Specification Heuristic Example Simple Example The Copula perspective Mutual Information as Copula dependent

Comparative Analysis of ICA Based Features

Copula Based Independent Component Analysis

Recursive Generalized Eigendecomposition for Independent Component Analysis

Independent component analysis: an introduction

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

VARIABLE SELECTION AND INDEPENDENT COMPONENT

Independent Component Analysis

An Improved Cumulant Based Method for Independent Component Analysis

c Springer, Reprinted with permission.

MINIMIZATION-PROJECTION (MP) APPROACH FOR BLIND SOURCE SEPARATION IN DIFFERENT MIXING MODELS

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

OPTIMISATION CHALLENGES IN MODERN STATISTICS. Co-authors: Y. Chen, M. Cule, R. Gramacy, M. Yuan

Rigid Structure from Motion from a Blind Source Separation Perspective

Independent Component Analysis (ICA)

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Using Kernel PCA for Initialisation of Variational Bayesian Nonlinear Blind Source Separation Method

Lecture III. Independent component analysis (ICA) for linear static problems: information- theoretic approaches

COMPARISON OF TWO FEATURE EXTRACTION METHODS BASED ON MAXIMIZATION OF MUTUAL INFORMATION. Nikolay Chumerin, Marc M. Van Hulle

ICA and ISA Using Schweizer-Wolff Measure of Dependence

Independent Component Analysis (ICA)

ECE 5984: Introduction to Machine Learning

Learning features by contrasting natural images with noise

Independent Subspace Analysis on Innovations

Independent Component Analysis of Evoked Potentials in EEG

Linear Methods for Prediction

Blind Machine Separation Te-Won Lee

Transcription:

1 Independent Component Analysis Background paper: http://www-stat.stanford.edu/ hastie/papers/ica.pdf

2 ICA Problem X = AS where X is a random p-vector representing multivariate input measurements. S is a latent source p-vector whose components are independently distributed random variables. A is p p mixing matrix. Given realizations x 1,x 2,...,x N of X, the goals of ICA are to Estimate A Estimate the source distributions S j f Sj, j = 1,...,p.

3 Cocktail Party Problem In a room there are p independent sources of sound, and p microphones placed around the room hear different mixtures. Source Signals Measured Signals PCA Solution ICA Solution Here each of the x ij = x j (t i ) and recovered sources are a time-series sampled uniformly at times t i.

4 Independent vs Uncorrelated WoLOG can assume that E(S) = 0 and Cov(S) = I, and hence Cov(X) = Cov(AS) = AA T. Suppose X = AS with S unit variance, uncorrelated Let R be any orthogonal p p matrix. Then X = AS = AR T RS = A S and Cov(S ) = I It is not enough to find uncorrelated variables, as they are not unique under rotations. Hence methods based on second order moments, like principal components and Gaussian factor analysis, cannot recover A. ICA uses independence, and non-gaussianity of S, to recover A e.g. higher order moments.

5 Independent vs Uncorrelated Demo Source S Data X PCA Solution ICA Solution Principal components are uncorrelated linear combinations of X, chosen to successively maximize variance. Independent components are also uncorrelated linear combinations of X, chosen to be as independent as possible.

6 Example: ICA representation of Natural Images Pixel blocks are treated as vectors, and then the collection of such vectors for an image forms an image database. ICA can lead to a sparse coding for the image, using a natural basis. see http://www.cis.hut.fi/projects/ica/imageica/(patrik Hoyer and Aapo Hyvärinen, Helsinki University of Technology)

7 Example: ICA and EEG data See http://www.cnl.salk.edu/ tewon/ica cnl.html (Scott Makeig, Salk Institute)

8 Approaches to ICA ICA literature is HUGE. Recent book by Hyvärinen, Karhunen & Oja (Wiley, 2001) is a great source for learning about ICA, and some good computational tricks. Mutual Information and Entropy, maximizing non-gaussianity FastICA (HKO 2001), Infomax (Bell and Sejnowski, 1995) Likelihood methods ProdDenICA (Hastie +Tibshirani)- later Nonlinear decorrelation Y 1 independent Y 2 iff max g,f Corr[f(Y 1 ),g(y 2 )] = 0 (Hérault-Jutten, 1984), KernelICA (Bach and Jordan, 2001) Tensorial moment methods

9 Simplification Let Σ = Cov(X) and suppose X = AS, and B = A 1. Then we can write B = WΣ 1 2 for some nonsingular W. Then S = BX = WΣ 1 2 X with Cov(S) = I and W orthonormal. So operationally we sphere the data X = Σ 1 2 X, and then seek an orthogonal matrix W so that the components S = W X are independent.

10 Entropy and Mutual Information Entropy: H(Y) = f(y)logf(y)dy maximized by f(y) = φ(y), the Gaussian (for fixed variance). Mutual Information: I(Y) = p j=1 H(Y j) H(Y) Y is a random vector with joint density f(y) and entropy H(Y) H(Y j ) is the (marginal) entropy of component Y j, with marginal density f j (y j ). I(Y) is the Kullback-Leibler divergence between f(y) and it s independence version p 1 f j(y j ) (which is the KL closest of all independence densities to f(y)) Hence I(Y) is a measure of dependence between the components of a random vector Y.

11 Entropy, Mutual Information, and ICA If Cov(X) = I and W is orthogonal then simple calculations show p I(WX) = H(wj T X) H(X) Hence j=1 min W I(WX) min W {dependence between wt j X} min W {the sum of the entropies of the wt j X} max W {departures from Gaussianity of the wt j X} Many methods for ICA look for low-entropy or non-gaussian projections (Hyvärinen, Karhunen & Oja, 2001) Strong similarities with projection pursuit (Friedman and Tukey, 1974)

12 Negentropy and FastICA Negentropy: J(Y j ) = H(Y j ) H(Z j ), where Z j is a Gaussian RV with same variance as Y j. Measures the departure from Gaussianity. FastICA uses simple approximations to negentropy J(w T j X) [EG(w T j X) EG(Z j )] 2, and with data replaced expectations by a sample averages. They use G(y) = 1 a logcoshy;, 1 a 2

13 ICA Density Model Under the ICA model, the density of S: f S (s) = p j=1 f j (s j ) with each f j a univariate density, with mean 0 and variance 1. Next we develop two direct ways of fitting this density model.

14 Discrimination from Gaussian Unsupervised vs supervised idea Consider case of p = 2 variables. Idea: observed data is assigned to class G = 1, and a background sample is generated from spherical Gaussian g 0 (x) and assigned to class G = 0. We fit to this data a projection pursuit model of the form log Pr(G = 1) 1 Pr(G = 1) = f 1(a T 1X)+f 2 (a T 2X) (1) where [a 1,a 2 ] is an orthonormal basis. This implies the data density model g(x) = g 0 (X) expf 1 (a T 1X) expf 2 (a T 2X) Can also show that g 0 (X) factors into a product h 1 (a T 1X) h 2 (a T 2X), since [a 1,a 2 ] is an orthonormal basis.

15 Hence model (1) is a product density for the data. If also looks like a neural network Can fit it by a) sphering ( whitening ) the data and then b) applying ppr function in R, combined with IRLS (iteratively reweighted least squares).

16 Projection Pursuit Solutions

17 ICA Density Model -direct fitting Density of X = AS: p f X (x) = B f j (b T j x) with B = A 1. Log-likelihood, given x 1,x 2,...,x N : j=1 l(b,{f j } p 1 ) = log B + N i=1 p logf j (b T j x i ) j=1

18 ICA Density Model -fitting ctd As before: let Σ = Cov(X); then for any B we can write B = WΣ 1 2 for some nonsingular W. Then S = BX and Cov(S) = I = W is orthonormal. So we sphere the data X = Σ 1 2 X and seek an orthonormal W. Let Then f j (s j ) = φ(s j )e g j(s j ), a tilted Gaussian density. Here φ is the standard Gaussian density, and g j satisfies the normalization conditions. l(w,σ,{g j } p 1 ) = 1 2 log Σ + N i=1 p j=1 [logφ j (w T j x i )+g j (w T j x i ) ] We estimate ˆΣ by the sample covariance of the x i, and ignore it thereafter pre-whitening in the ICA literature.

19 ProdDenICA algorithm Maximizes log-likelihood on previous slide, over W and g j ( ), j = 1,2,...p, with smoothness constraints on g j ( ). Details on next 3 slides.

20 Restrictions on g j Our model is still over-parameterized. We maximize instead a penalized log-likelihood: [ p 1 N [ logφ(w T N j x i )+g j (wj T x i ) ] ] λ j {g j (t)} 2 (t)dt j=1 i=1 w.r.t. W T = (w 1,w 2,...,w p ) and g j, j = 1,...,p subject to W T W = I each f j (s) = φ(s)e g j(s) is a density, with mean 0 and variance 1. solutions ĝ j are piecewise quartic splines

21 ProDenICA: Product Density ICA algorithm 1. Initialize W (random Gaussian matrix followed by orthogonalization). 2. Alternate until convergence of W, using the Amari metric. (a) Given W, optimize the penalized log-likelihood w.r.t. g j (separately for each j), using the penalized density estimation algorithm. (b) Given g j, j = 1,...,p, perform one step of a fixed point algorithm towards finding the optimal W.

22 Penalized density estimation for step (a) Discretize the data Fit a generalized additive model with Poisson family of distributions (see text chapter 9)

23 Simulations a b c d e f g h i j k l m n o p q r Taken from Bach and Jordan(2001) Each distribution used to generate 2-dim S and X (with random A) 30 reps each 300 reps with 4-dim S distributionsof S j picked at random. Compared FastICA (homegrown Splus version), KernelICA (Francis Bach s matlab version), and Pro- DenICA (Splus).

24 Amari Metric We evaluate solutions by comparing their estimated W with the true W 0, using the Amari metric (HKO 2001, Bach and Jordan 2001): d(w 0,W) = 1 2p p i=1 where r ij = (W 0 W 1 ) ij. ( p j=1 r ) ij max j r ij 1 + 1 2p p j=1 ( p i=1 r ) ij max i r ij 1

25 2-Dim examples Distribution Amari Distance from True W (Log Scale) a b c d e f g h i j k l m n o p q r 0.01 0.05 0.10 0.50 FastICA KernelICA ProdDenICA

Amari Distance from True W 0.0 0.5 1.0 1.5 2.0 4-dim examples 0.25 (0.014) 0.08 (0.004) 0.14 (0.011) FastICA KernelICA ProdDenICA 26

27 FastICA -3-2 -1 0-3 -2-1 0-3 -2-1 0 KernelICA -3-2 -1 0-3.5-2.5-1.5-0.5-3.5-2.5-1.5-0.5 ProdDenICA