Independent Component Analysis (ICA) Université catholique de Louvain (Belgium) Machine Learning Group http://www.dice.ucl ucl.ac.be/.ac.be/mlg/ 1 Overview Uncorrelation vs Independence Blind source separation & cocktail party problem Equations, indeterminations & assumptions Pre-whitening step Some examples The Gaussian case Objective functions: how to recover independent components? Non-Gaussianity approach & central limit theorem Minimum dependence approach Real-world examples Extensions 1
What is ICA? PCA: finding a transformation that uncorrelate variables ICA: finding a transformation that make variables as independent as possible In this lecture: the transformation is constrained to be linear and instantaneous 3 Independence is stronger than uncorrelation Uncorrelation between x and y : E[xy xy] ] = E[x]E[ ]E[y] Independence between x and y : E[f(x)g(y)]=E[ )]=E[f(x)]E[g(y)],)], for any non-linear functions f and g x does not carry any information about y In other words : Correlation measures the existence of a linear relation between variables Dependence measures the existence of any relation between variables 4
Uncorrelation vs indepence: example Let u be a random variable with uniform distribution: E[u]= and E[u ]=1 f u 1/ 3 Let v = u E[uv] = E[u 3 ] = and E[u]E[v]= u and v are uncorrelated (no linear relation between u and v) 3 + 3 u Let f(u)=u and g(u)=u E f g [ ( u ) ( v )] = E [ u 4 ] = ( 3) u and v are dependent (a link exists between u and v) 4 but E f [ ( u )] E [ u ] E [ g ( v )] 1443 1443 E [u ] = PCA vs ICA? PCA (whitening): maximum variance projection NO independence! rem: whitening conservation for any rotation 1-1 - - ICA: minimum dependence directions (can t say anything on y knowing x) rem: independence remains for k π rotation (k in Z) up to permutation and sign! - - 6 3
PCA vs ICA? s x y 1 y s 1 mixing matrix x 1 A 1 = 1 PCA x x y x 1 ICA y 1 x 1 7 Independent Component Analysis (ICA) «source separation» or «cocktail-party» problem aims: to separate signals to use an independence criterion instead of variance maximisation (PCA) 8 4
Blind source separation Sources S A Mixtures X W Outputs Y UNKNOWN KNOWN TO ESTIMATE 9 Method? Under several assumptions: Y=Estim(S)=W ICA X Cocktail party Hypotheses: linear and additive mixing no phase delay signals rather than data 1 Why ICA rather than PCA? Uncorrelation independence If W is a whitening matrix, then UW s.t. UU T =I is also a whitening W whitening matrix highly non-unique (up to any rotation matrix!) W ICA is unique, up to indeterminations
The problem in equations Notations independent signals (unknown) measured signals linear mixing: The problem : to estimate W A -1 x () t = [ s () t s () t s ()] T,, n t s 1, K 1, K () t = [ x () t x () t x ()] T,, n t () t = A s () t x. but A is unknown! so that y = Wx = WAs will be an estimate of the sources: y = sˆ 11 Independence hypothesis A is unknown we cannot compute W = A -1 This lack of information is compensated by the independence hypothesis Solution - indeterminations Solution We measure the independence of signals y i (t) when this independence is maximum : y t Indeterminations order of signals (indep. is symmetric) multiplying factor on each signal solution: n x i = aij s j j = 1 W 1 = PDA n = j = 1 α s j. aij α () s () t constant Diagonal matrix (non-zero coefficients) A calibration could be necessary Permutation matrix Low importance in applications 1 6
Solution - assumptions Source signals are mutually independent Since the magnitude of the s i cannot be known, it is fixed s.t. E[s i s it ]=1. Hence, it is supposed that : T E [ ss ] = I The mixing matrix is supposed to be constant in time 13 Whitening: a preprocessing to ICA? Sources centered observations Output signals 3 - - s -3 -.. x=as Whitened signals - - Y=Wz 1-1 - - z=vx=vas 14 7
Whitening: a preprocessing to ICA? Why unmixing z (whitened signals) instead of x? If z is white VA is orthogonal: T E [ zz ] = T T ( VA) E [ ss ] VA 1443 I = I If VA is orthogonal W reduces to an orthogonal matrix : T E [ yy ] = T T W E [ zz ] W 1443 I = I Hence: only n(n-1)/ instead of n parameters have to be estimated 1 Separation of Uniform signals sources 1 An estimation -Permutation s1-s - Double inversion. - - Other estimation. 1 - Inversion of s1 - No permutation - - 16 8
Uncorrelation and independence [1] sources Mixtures 1-1 1-1 - - 1-1 - ICA Whitening - - - - (FastICA) - - - - 17 Uncorrelation and independence [] (SWICA) Sources Mixtures Whitening ICA F. Vrins Warning: mean and variance of original images are important! 18 9
The Gaussian case If the sources have Gaussian distribution Temporal structure looks (but isn t t!) similar to uniform random signal Scatter plot is very different Sources (temp. Struct.) Sources (scatter plot) 3 - - 1-3 -3 3 19 ϕ ( x µ ) 1 ( ) = σ x e πσ (fully described by mean and variance) The Gaussian case If the sources have Gaussian distribution in the Gaussian case: Independence is equivalent to uncorrelation! n! Mixtures Max. Variance Scaling (whitening) - - - - - - A rotation after whitening does not change anything! 1
Sources The Gaussian case superimposition White mixt. - - - - - - p How to find the rotation corresponding to original sources (up to perm/scale)? p Other information needed (temporal struct.,etc) to separate Gaussian sources. 1 Main tool for ICA: independence Discrete case Continuous case ( A, B ) p ( A) p ( B ) ( A B ) p ( A) P = P = f x n ( x ) = f x ( x i ) i = 1 i The problem to measure independence between signals to minimize this independence 11
ICA objective functions Y = WX = WAS Non-Gaussianity approach By the central limit theorem, the PDF of a sum of n indepedent random variables converges to a Gaussian Measure of non-gaussianity Finding W such that the outputs PDF are as different as possible from the Gaussian function one output signal at a time Independence approach Find independence measures between signals Estimation of PDF or of independence criterion all output signals together 3 Gaussianity and CLT Central Limit Theorem: illustration with uniform variable n=1 n= n= n=1 n= n=1 4 1
Non-Gaussianity approach Minimum differential entropy Gaussians have maximum differential entropy Maximum negentropy equivalent to differential entropy Maximum positive transform of kurtosis Gaussians have kurtosis = Gram-Charlier expansion measure the difference between output pdf and Gaussian pdf Entropy [1/] Discrete case H K ( x ) = log( ) i p i p i = 1 for j i H ( ) ( minimum ) ( x ) log( K ) ( maximum) if pi = 1, p j = x = if p i = 1 K H = Continuous case: differential entropy h ( x ) f ( u ) logf ( u ) = du 6 13
Entropy [/] Continuous case (continued) maximum differential entropy (if variance = σ ): Gaussian 1 ( x ) log( πeσ ) h G = differential entropy: invariant to orthogonal transforms Minimizing h(x) make PDF of x far from Gaussian Finding W s.t. the outputs entropies are low (x i are unit-variance) 7 Negentropy negentropy: : difference wrt the entropy of a Gaussian J ( x ) = h ( x ) h( x ) multi-dimensional case : J G ( x ) f ( u ) x = log f f x x G ( u ) ( u ) du Finding W s.t. the J(x) is maximum (x i are unit-variance) 8 14
Kurtosis : intuitive considerations Definition of Kurtosis : κ 4 ( ) E[ ] 3( [ ]) x = x x 4 E Interesting properties: - for Gaussian PDF: κ ( x ) 4 G = - for most non-gaussian PDF: ( x ) κ 4 > Finding W s.t. m i = 1 κ 4 ( x i ) is maximum (x i are unit-variance) 9 Kurtosis: illustration 3 1
Gram-Charlier Expansion Taylor expansion approximate a function f around f(x ) Gram-Charlier expansion approximate a PDF p x around the Gaussian function ϕ truncated at fourth order: p x H 3 ( ξ ) H ( ) ( ) ( ) 1 + ( ) + ( ) 4 ξ ξ ϕ ξ κ3 x κ 4 x 3! 4! «Non-Gaussian part» of p x 31 Minimum dependence approach Minimum Mutual Information Minimum sum of marginal entropies Minimum positive transform of cross-cumulant cumulant 3 16
Mutual information and marginal entropies Mutual information (MI) I ( x ) f ( u ) = f x log n f i = 1 I(x)= iff all x i are independent x ( u ) ( u ) x i i du MI and sum of outputs marginal entropies 33 I ( x ) = = = m h ( x i ) i = 1 m h ( x i ) i = 1 m h ( x i ) i = 1 h h h ( x ) ( Wz ) ( z ) log( det( W ) ) 144 44 3 (WW T =I) Mutual information and marginal entropies Mutual information Difficult to estimate (joint pdf of x ) Computational cost Finding W s.t. the I(x) is minimum Sum of outputs marginal entropies Better than MI because no estimation of joint PDF Finding W s.t. WW T =I minimizing m h(xi ) i = 1 34 17
Moments and cumulants Probability density function Properties of the distribution (mean, variance, ) Moments ordre r moment r r ( x ) E[ x ] µ' = centered ordre r moment µ r [ ] r ( x ) = E ( x - E[ x] ) 3 Independence,, PCA and ICA Whitening Diagonalization of covariance matrix (+ scaling) Only moments with order < are taken into account ICA Diagonlization of higher-order cumulant tensor ( hyper( hyper-matrix with four indices: I,j,k,l): like higher-order covariance! In order to go further than (linear) decorrelation! Independence: one should know all cross-cumulants cumulants and then make them equal to zero! 36 18
The Gaussian case (con t) Gaussian distribution Perfectly defined by the mean/variance of the variable! All moments of order> are stricly zero! PCA ICA data are described by covariance matrices only moments with order < are taken into account (set to zero) = Make the higher order statistics to zero But it is already the case for Gaussian PDF (see G-C G C exp.)! 37 Decorrelation = independence for Gaussian variables! Decorrelation transform: up to rotation transform Too many indeterminations for the BSS problem Other information needed (temporal structure, frequency, ) Theory and Practice In theory, independence measures require the knowledge of the PDF (to compute mutual information or entropies, ). In practice, those PDF are unknown. Two possibilities Density estimation (but difficult task) Estimate directly the independence measures 38 Example of independence approximation: independence = all higher-order cross-cumulants cumulants must be zero approx. of indep. = covariance and kurtosis must be zero 19
Dependence minimization 39 How to maximize independence or non-gaussianity? for example, through cumulants or negentropy Often: a pre-whitening step is useful ICA problem reduces to find a rotation matrix: ++ : the number of elements to estimate reduces from n to n(n-1)/ - - : computational problem and errors (if PCA failed) Objective functions (OF) to minimize estimation or approximation of non-gaussianity and independence measures Local minima? Algorithms to minimize OF: neural (gradient-based, based, ) and algebraic methods, specific algorithms for specific problems Local vs Global criterion «local criterion» π Source scatter plot I(y) 1 π π 3π θ π -1 - - -1 1 «global criterion» 4. 4 y = Ws s.t. W = θ cos sin θ sin cos θ θ 3. 3 4. pi/4 pi/ 3*pi/4 pi pi/4 3pi/7pi/4 pi
Image preprocessing 41 Signal separation Example 1 Example x1 y 1 -s 1 x y -s 4 1
Biomedical appl. : FECG extraction Whitening/ICA Signals recorded on a pregnant woman abdomen Extracted source signals F. Vrins Maternal ECG, Fetal ECG 43 Mixture of digital sources (PCA ICA) PCA ICA 44
Independent subimages A.J. Bell, T.J. Sejnowski, Edges are the Independent Components of natural scenes, NIPS 96, pp. 831-836, MIT Press, 1996. 4 Handfree phone in car N. Charkani El Hassani, Séparation auto-adaptative de sources pour des mélanges convolutifs Application à la téléphonie mains-libres dans les voitures, thèse de doctorat, INP Grenoble, 1996. 46 3
Multiple RF tags Y. Deville, J. Damour, N. Charkani, Improved multi-tag radio-frequency identification systems based on new source separation neural networks, Proc. of ICA 99, Aussois (France), January 1999, pp. 449-44. 47 Financial time series ICA reconstruction (4 ICs) A.D. Back, A.S. Weigend, A First Application of Independent Component Analysis to Extracting Structure from Stock Returns, International Journal of Neural Systems, Vol. 8, No. (October, 1997) 48 4
Cocktail party Speech music separation observations estimations Speech speech separation observations estimations T.-W. Lee, Institute for Neural Computation, University of California (San Diego), http://www.cnl.salk.edu/~tewon/blind/blind_audio.html 49 Problem extensions [1/3] Basic model Extensions n measures, m sources, n m n > m: first : m is estimated PCA stage : dim. red. from n to m m m signals then source separation (ICA) n < m: n x i = ij j, j = 1 () t a s () t, i = 1, n the m most powerful sources are estimated the result is corrupted by the m - n other sources Other techniques, using sparisty, K
Problem extensions [/3] Extensions (ctnd( ctnd) noisy observations n x i t = aij s j i K, j = 1 () () t + n () t, i = 1, n () t = As () t n() t x + Even if W is estimated perfectly: () t = Wx() t = WAs() t Wn() t y + If n >>m: : specific algorithms (projection in the signal subspace) 1 ill-conditioned mixings lines of A are similar specific algorithms Problem extensions [3/3] Extensions (ctnd( ctnd) more complex mixtures (filtering) specific algorithms convolutive mixtures x () t = A() t s () t n p 1 x i = ij j, j = 1 k = () t a ( k ) s ( t k ), i = 1, n post non-linear mixtures n x i = i ij j, j = 1 () t f a s () t, i = 1, n non-linear mixtures x i K 1 K n = K () t = f ( s () t,, s () t ), i 1,, n i K 6
Sources and References Some ideas and figures contained in these slides come from: Blind separation of sources, Part I, C. Jutten & J. Hérault,, Signal Processing 4, pp. 1-1, 1 1, 1991. Improving Independent Component Analysis Performances by Variable Selection, F. Vrins, J. A. Lee,, V. Vigneron and C. Jutten, IEEE NNSP'3, pp. 39-368, 368, September 17-19, 19, 3, Toulouse (France). High performance magnetic field smart sensor arrays with source separation, A. Paraschiv-Ionescu et al., Proc. 1st Int. Conf. On Modeling and Simulation of Microsystems (MSM98), Santa Clara (USA), April 6-8, 6 1998. Elements of Information Theory, Cover and Thomas, Wiley and Sons,, New York, 1. Independent component analysis,, A. Hyvarinen, J. Karhunen and E. Oja, Wiley series on adaptive and learning systems for signal processing, communications and control, S. Haykin edt, 1. Thanks to Frédéric ric Vrins for many slides! 3 7