Independent Component Analysis. Slides from Paris Smaragdis UIUC

Independent Component Analysis Slides from Paris Smaragdis UIUC 1

A simple audio problem 2

Formalizing the problem Each mic will receive a mi of both sounds Sound waves superimpose linearly We ll ignore propagation delays for now s 1 The simplified miing model is: s 2 (t) = A.s(t) We know (t), but nothing else How do we solve this system and find s(t)? 1 (t)=a 11 s 1 (t)+a 21 s 2 (t) 2 (t)=a 12 s 1 (t)+a 22 s 2 (t) 3

When can we solve this? The miing equation is: (t) = A.s(t) Our estimates of s(t) will be: s 1 A a 1 a 2 1 sˆ ( t ) A.( t) s 2 To recover s(t), A must be invertible: We need as many mics as sources The mics/sources must not coincide All sources must be audible s 1 s 2 Otherwise this is a different story A a a 1 1 a a 2 2 4

A simple eample (t) A s(t) 1 2 1 1. A simple invertible problem s(t) contains two structured waveforms A is invertible (but we don t know it) (t) looks messy, doesn t reveal s(t) clearly How do we solve this? Any ideas? 5

What to look for (t) = A.s(t) We can only use (t) Is there a property we can take advantage of? Yes! We know that different sounds are statistically unrelated The plan: Find a solution that enforces this unrelatedness 6

A first try Find s(t) by minimizing cross-correlation Our estimate of s(t) is computed by: sˆ ( t) W. ( t) If W A -1 then we have a good solution The goal is that the output becomes uncorrelated: sˆ ( t), sˆ ( t) 0,i j We assume here that our signals are zero mean So the overall problem to solve is: i j arg min W wik k ( t), wjk k ( t), i k k j 7

How to solve for uncorrelatedness Let s use matrices instead of time-series: 1 (1)... 1 ( N) ( t) X, etc. 2(1)... 2( N) The uncorrelatedness translates to: ( t) As. ( t) and sˆ( t) W. ( t) X AS. and S ˆ W. X 1 N S.S ˆ ˆ T c 0 11 0 c22 We then need to diagonalize: 1 N Sˆ.Sˆ T 1 N ( W. X)( W. X) T 8

How to solve for uncorrelatedness This is actually a well known problem in linear algebra One solution is Eigenanalysis: Cov( Sˆ) 1 T S. ˆ Sˆ N 1 ( W. X)( W. X) N 1 N W. X. X T W T T W. Cov( X). W T Cov( X) is a symmetric matri 9

How to solve for uncorrelatedness For any symmetric matri like Cov(X) we know that: Cov( X) U. D. U T u1... d u u 1 0 N d 1 0 N u N Where u i and d i are the eigenvectors and eigenvalues of Cov(X) In our case if [U,D] = eig( Cov( X)) Cov( S ˆ) W. Cov( X). W T T T W. U. D. U. W let us replace W with T T U. U. DU. U T note U U I ( U is orthonormal) D This is actually Principal Component Analysis U T 10

Another solution We can also solve for a matri inverse square root: Cov( ˆ) T S W. X. X. I W T 1 T 1 1 2 T T 2 T X. X. X. X. X. X replace W with X. X 2 d 1 2 T T 1 1 X. X 2 U where [ U, D] eig X. X 0 d 0 1 2 N 11

Another approach What if we want to do this in real-time? We can also estimate W in an online manner: T T W ( I - W. ( t). ( t). W ) W Every time we see a new observation (t) we update W W WW Using this adaptive approach we can see that: Cov( W. ( t)) I, W 0 We ll skip the derivation details for now 12

Summary so far For a miture: X = A.S We can algebraically recover an uncorrelated output using Sˆ W.X If W is the eigenvector matri of Cov( X) Or with W = Cov( X) -1/2 Or we can use an online estimator: T T W ( I - W. ( t). ( t). W ) W 13

Miing system Unmiing system So how well does this work? (t) A s(t) 1 2 1 1. ŝ(t) W (t) 0.6 2.8 0.4 3.7. Well, that was a waste of time 14

What went wrong? What does decorrelation mean? That the two things compared are not related on average Consider a miture of two Gaussians s( t) N(0,2) N(0,1) ( t) 1 2 1 1. s( t) 15

Decorrelation Now let us do what we derived so far on this signal: a) Find the eigenvectors b) Rotate and scale so that covariance is I After we are done the two Gaussians are statistically independent i.e., P( s1, s2) P( s1) P( s2) We have in effect separated the original signals Save for a scaling ambiguity 16

Now lets try this on the original data (t) A s(t) 1 2 1 1. Stating the obvious: These are not very Gaussian signals!! 17

Doing PCA doesn t give us the right solution The result is not what we want We are off by a rotation This idea doesn t seem to work for non-gaussian signals a) Find the eigenvectors b) Rotate and scale so that covariance is I 18

19 So what s wrong? For Gaussian data decorrelation means independence Gaussians have up to second order statistics (1 st is mean, 2 nd is variance) By minimizing the 2 nd -order cross-statistics we achieve independence These statistics can be epressed by the 2 nd -order cumulants: Which happen to be the diagonals of the covariance matri But real-world data are seldom Gaussian Non-Gaussian data have higher orders which are not taken care of with PCA We can measure their dependence using higher order cumulants: 2 1 2 1. ), ( cum k j i k j i rd cum.. ),, ( : order 3 k j l i l j k i k j j i l k j i l k j i th cum......... ),,, ( : order 4

Cumulants for Gaussian vs non-gaussian case Gaussian case Non-Gaussian case Cross-cumulants tend to zero cum( i, j ) 1e -13 cum( i, j, k ) 0.0008, 0.0004 cum( i, j, k, l ) -0.003, 0.0005, 0.0007 Only 2 nd order cross-cumulants tend to zero cum( i, j ) -4e -14 cum( i, j, k ) -0.08, -0.1 cum( i, j, k, l ) 0.42, -0.3, 0.16 20

The real problem to solve For statistical independence we need to minimize all cross-cumulants In practice up to 4 th order is enough For 2 nd order we minimized the off-diagonal covariance elements cum( 1, cum( 2, 1 1 ) ) cum( cum( 1 2,, 2 2 ) ) For 4 th order we will do the same for a tensor Q cum(, The process is similar to PCA, but in more dimensions We now find eigenmatrices instead of eigenvectors Algorithms like JADE and FOBI solve this problem,, Can you see a potential problem though? : i, j, k, l i j k l ) 21

An alternative approach Tensorial methods can be very very computationally intensive How about an on-line method instead? Independence can also be coined as non-linear decorrelation and y are independent if and only if: f ( ) g( y) f ( ) g( y) For all continuous functions f and g This is a non-linear etension of 2 nd order independence where f() = g() = We can try solving for that then 22

Online ICA Conceptually this is very similar to online decorrelation For decorrelation: For non-linear decorrelation: T T W ( I - W. ( t). ( t). W ) W T W ( I - f ( W. ( t)). g( W. ( t)) ) W This adaptation method is known as the Cichocki-Unbehauen update But we can obtain it using many different ways But how do we pick the non-linearities? Depends on the prior we have on the sources f ( i tanh( ), for super Gaussian ) tanh( ), for sub Gaussian 23

Other popular approaches Minimum Mutual Information Minimize the mutual information of the output Creates maimally statistically independent outputs Infoma Maimize the entropy of the output or Mutual Information of input/output Non-Gaussianity Adding signals tends towards Gaussianity (Central Limit Theorem) Find the maimally non-gaussian outputs undoes the miing Maimum Likelihood Less straightforward at first, but elegant nevertheless Geometric methods Trying to eyeball the proper way to rotate 24

Trying this on our dataset (t) A s(t) 1 2 1 1. 25

Trying this on our dataset (t) A s(t) ŝ(t) 1 2 1 1. W (t) 1.39 2.5 2.78 2.58. We actually separated the miture! 26

Trying this on our dataset ŝ(t) W (t) 1.39 2.5 2.78 2.58. 27

But something is amiss.. There some things that ICA will not resolve Scale Statistical independence is invariant of scale (and sign) Order of inputs Order of inputs is irrelevant when talking about independence ICA will actually recover: sˆ ( t) D. P. s( t) Where D is diagonal and P is a permutation matri ŝ(t) s(t) 28

This works really well for audio mitures! Input Mi Output 29

Problems with instantaneous miing Sounds don t really mi instantaneously There are multiple effects Room reflections Sensor response Propagation delays Propagation and reflection filtering Most can be seen as filters We need a convolutive miing model Estimated sources using the instantaneous model on convolutive mi 30 30

Convolutive miing Instead of instantaneous miing: ( t) a s ( t) i We now have convolutive miing: The miing filters a ij (k) encapsulate all the miing effects in this model But how do we do ICA now? This is an ugly equation! j ( t) a ( k) s ( t k) i j k ij ij j j 31 31

FIR matri algebra Matrices with FIR filters as elements a ij a A a 11 21 a a 12 22 [ a (0)... a ( k 1)] ij ij FIR matri multiplication performs convolution and accumulation A. b a a 11 21 a a 12 22 b b 1 2 a a 11 21 * b * b 1 1 a a 12 22 * b * b 2 2 32 32

Back to convolutive miing Now we can rewrite convolutive miing as: ( t) ( t) a ( k) s ( t k) i a A. s( t) a j k ij 11 21 * s * s 1 1 j ( t) ( t) a a 12 22 * s * s 2 2 ( t) ( t) Tidier formulation! We can use the FIR matri abstraction to solve this problem now 33 33

An easy way to solve convolutive miing Straightforward translation of instantaneous learning rules using FIR matrices: T W ( I - f ( W. ). g( W. ) ) W Not so easy with algebraic approaches! Multiple other (and more rigorous/better behaved) approaches have been developed 34 34

Complications with this approach Required convolutions are epensive Real-room filters are long Their FIR inverses are very long FIR products can become very time consuming Convergence is hard to achieve Huge parameter space Tightly interwoven parameter relationships A slow optimization nightmare! 35 35

FIR matri algebra, part II FIR matrices have frequency domain counterparts: And their products are simpler: 36 36

Yet another convolutive miing formulation We can now model the process in the frequency domain: Xˆ Aˆ. sˆ For every frequency we have: X f ( t) A f. S f ( t), f, t Hey, that s instantaneous miing! We can solve that! 37 37

Overall flowgraph Convolved Mitures Frequency Transform Mied Frequency Bins Instantaneous ICA unmiers Unmied Frequency Bins Time Transform Recovered Sources 1 W 1 S 1 2 S 2... M Point STFT W 2... M point ISTFT... N S N W M 38 38

Some complications Permutation issues We don t know which source will end up in each narrowband output Resulting output can have separated narrowband elements from both sounds! Etracted source with permutation Scaling issues Narrowband outputs can be scaled arbitrarily This results in spectrally colored outputs Original source Colored source 39 39

Scaling issue One simple fi is to normalize the separating matrices W W norm f orig f W orig f 1 N Results into more reasonable scaling More sophisticated approaches eist but this is not a major problem Some spectral coloration is however unavoidable Original source Colored source Corrected source 40 40

Some solutions for permutation problems Continuity of unmiing matrices Adjacent unmiing matrices tend to be a little similar, we can permute/bias them accordingly Doesn t work that great Smoothness of spectral output Narrowband components from each source tend to modulate the same way Permute unmiing matrices to ensure adjacent narrowband output are similarly modulated Works fine The above can fail miserably for more than two sources! Combinatorial eplosion! Interspeec Microphone Array Processing and 41 41

Beamforming and ICA If we know the placement of the sensors we can obtain the spatial response of the ICA solution ICA places nulls to cancel out interfering sources Just as in the instantaneous case we cancel out sources We can visualize the permutation problem now Out of place bands Bands with permutation problems Interspeec Microphone Array Processing and 42 42

Using beamforming to resolve permutations Spatial information can be used to resolve permutations Find permutations that preserve zeros or smooth out the responses Works fine, although it can be flaky if the array response is not that clean Interspeec Microphone Array Processing and 43

The N-input N-output problem ICA, in either formulation inverts a square matri (whether scalar, or FIR) This implies that we have the same number of input as outputs E.g. in a street with 30 noise sources we need at least 30 mics! Solutions eist for M ins - N outs where M > N If N > M we can only beamform In some cases etra sources can be treated as noise This can be restrictive in some situations Interspeec Microphone Array Processing and 44 44

Separation recap Orthogonality is not independence!! Not all signals are Gaussian which is a usual assumption We can model instantaneous mitures with ICA and get good results ICA algorithms can optimize a variety of objectives, but ultimately result in statistical independence between the outputs Same model is useful for all sorts of miing situations Convolutive mitures are more challenging but solvable There s more ambiguity, and a closer link to signal processing approaches 45

ICA for data eploration ICA is also great for data eploration If PCA is, then ICA should be, right? With data of large dimensionalities we want to find structure PCA can reduce the dimensionality And clean up the data structure a bit But ICA can find much more intuitive projections 46

Eample cases of PCA vs ICA Motivation for using ICA vs PCA PCA will indicate orthogonal directions of maimal variance PCA ICA Non-Gaussian data This is great for Gaussian data Also great if we are into LS models Real-world is not Gaussian though ICA finds directions that are more revealing 47

Finding useful transforms with ICA Audio preprocessing eample Take a lot of audio snippets and concatenate them in a big matri, do component analysis PCA results in the DCT bases Do you see why? ICA returns time/freq localized sinusoids which is a better way to analyze sounds Ditto for images ICA returns localizes edge filters 48

Enhancing PCA with ICA ICA cannot perform dimensionality reduction The goal is to find independent components, hence there is no sense of order PCA does a great job at reducing dimensionality Keeps the elements that carry most of the input s energy It turns out that PCA is a great preprocessor for ICA There is no guarantee that the PCA subspace will be appropriate for the independent components but for most practical purposes this doesn t make a big difference M N input M K PCA K K ICA K N ICs 49

Eample case: ICA-faces vs. Eigenfaces ICA-faces Eigenfaces 50

A Video Eample The movie is a series of frames Each frame is a data point 126, 80 60 piel frames Data X will be 4800 126 Using PCA/ICA X = W H W will contain visual components H will contain their time weights 51

PCA Results Nothing special about the visual components They are orthogonal pictures Does this mean anything? (not really ) Some segmentation of constant vs. moving parts Some highlighting of the action in the weights 52

ICA Results Much more interesting visual components They are independent Unrelated elements (left/right hands, background) are now highlighted We have some decomposition by parts Components weights are now describing the scene 53

A Video Eample The movie is a series of frames Each frame is a data point Input movie 315, 80 60 piel frames Data X will be 4800 315 Using PCA/ICA X = W H W will contain visual components H will contain their time weights Independent neighborhoods 54

What about the soundtrack? We can also analyze audio in a similar way We do a frequency transform and get an audio spectrogram X Xis frequencies time Distinct audio elements can be seen inx Unlike before we have only one input this time 55

PCA on Audio Umm it sucks! Orthogonality doesn t mean much for audio components Results are mathematically optimal, perceptually useless 56

ICA on Audio A definite improvement Independence helps pick up somewhat more meaningful sound objects Not too clean results, but the intentions are clear Misses some details 57

Audio Visual Components? We can can even take in both audio and video data and try to find structure Sometimes there is a very strong correlation between auditory and visual elements We should be able to discover that automatically Input video Input audio 58

Audio/Visual Components 59

Which allows us to play with output And of course once we have such a nice description we can resynthesize at will Resynthesis 60

Recap A motivating eample The theory Decorrelation Independence vsdecorrelation Independent component analysis Separating sounds Solving instantaneous mitures Solving convolutive mitures Data eploration and independence Etracting audio features Etracting multimodal features 61