Independent Component Analysis. Slides from Paris Smaragdis UIUC

Similar documents
Principal Component Analysis CS498

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Machine Learning for Signal Processing. Analysis. Class Nov Instructor: Bhiksha Raj. 8 Nov /18797

LECTURE :ICA. Rita Osadchy. Based on Lecture Notes by A. Ng

Geometric Modeling Summer Semester 2010 Mathematical Tools (1)

Independent Component Analysis. Contents

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Tutorial on Blind Source Separation and Independent Component Analysis

Acoustic Source Separation with Microphone Arrays CCNY

1 Singular Value Decomposition and Principal Component

Statistical Geometry Processing Winter Semester 2011/2012

[Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty.]

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

HST.582J/6.555J/16.456J

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Metric-based classifiers. Nuno Vasconcelos UCSD

Principal Component Analysis

14 Singular Value Decomposition

Dimensionality Reduction. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Lagrange Multipliers

Independent Components Analysis

Structured matrix factorizations. Example: Eigenfaces

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

15 Singular Value Decomposition

Math 4242 Fall 2016 (Darij Grinberg): homework set 8 due: Wed, 14 Dec b a. Here is the algorithm for diagonalizing a matrix we did in class:

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

20 Unsupervised Learning and Principal Components Analysis (PCA)

CSE 559A: Computer Vision

Linear Algebra, Summer 2011, pt. 2

Linear Least-Squares Data Fitting

Introduction to Machine Learning

Principal Component Analysis

Dot Products, Transposes, and Orthogonal Projections

CSC 411 Lecture 12: Principal Component Analysis

= W z1 + W z2 and W z1 z 2

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi

Lecture 4: Training a Classifier

Getting Started with Communications Engineering

Final Review Sheet. B = (1, 1 + 3x, 1 + x 2 ) then 2 + 3x + 6x 2

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

MACHINE LEARNING ADVANCED MACHINE LEARNING

Advanced Introduction to Machine Learning CMU-10715

Lecture Notes 1: Vector spaces

Frequency, Vibration, and Fourier

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

UNCONSTRAINED OPTIMIZATION PAUL SCHRIMPF OCTOBER 24, 2013

A METHOD OF ICA IN TIME-FREQUENCY DOMAIN

Independent Component Analysis

Covariance and Principal Components

CS168: The Modern Algorithmic Toolbox Lecture #8: PCA and the Power Iteration Method

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Separation of Different Voices in Speech using Fast Ica Algorithm

Blind Machine Separation Te-Won Lee

Lecture Notes 5: Multiresolution Analysis

Advanced Methods for Fault Detection

Introduction to Data Mining

CIFAR Lectures: Non-Gaussian statistics and natural images

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Principal Component Analysis

Semi-Supervised Laplacian Regularization of Kernel Canonical Correlation Analysis

A GUIDE TO INDEPENDENT COMPONENT ANALYSIS Theory and Practice

Issues and Techniques in Pattern Classification

Eigenvalues, Eigenvectors, and an Intro to PCA

Statistical Pattern Recognition

Acoustic MIMO Signal Processing

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

Unsupervised learning: beyond simple clustering and PCA

Principal Components Theory Notes

IV. Matrix Approximation using Least-Squares

Eigenvalues, Eigenvectors, and an Intro to PCA

Eigenvalues, Eigenvectors, and an Intro to PCA

Linear Algebra Methods for Data Mining

1 What does the random effect η mean?

Independent Component Analysis and Its Applications. By Qing Xue, 10/15/2004

Semi-Blind approaches to source separation: introduction to the special session

Noise & Data Reduction

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

Mathematical foundations - linear algebra

Math (P)Review Part I:

x 1 (t) Spectrogram t s

Iterative Methods for Solving A x = b

CSE 559A: Computer Vision Tomorrow Zhihao's Office Hours back in Jolley 309: 10:30am-Noon

This appendix provides a very basic introduction to linear algebra concepts.

Independent Component Analysis (ICA) Bhaskar D Rao University of California, San Diego

A non-gaussian decomposition of Total Water Storage (TWS), using Independent Component Analysis (ICA)

Karhunen-Loève Transform KLT. JanKees van der Poel D.Sc. Student, Mechanical Engineering

Week 3: Linear Regression

Lecture: Face Recognition and Feature Reduction

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Linear Algebra, Summer 2011, pt. 3

STATS 306B: Unsupervised Learning Spring Lecture 12 May 7

Name Solutions Linear Algebra; Test 3. Throughout the test simplify all answers except where stated otherwise.

Principal Component Analysis

Dot Products. K. Behrend. April 3, Abstract A short review of some basic facts on the dot product. Projections. The spectral theorem.

Missing Data and Dynamical Systems

ECE 501b Homework #6 Due: 11/26

Transcription:

Independent Component Analysis Slides from Paris Smaragdis UIUC 1

A simple audio problem 2

Formalizing the problem Each mic will receive a mi of both sounds Sound waves superimpose linearly We ll ignore propagation delays for now s 1 The simplified miing model is: s 2 (t) = A.s(t) We know (t), but nothing else How do we solve this system and find s(t)? 1 (t)=a 11 s 1 (t)+a 21 s 2 (t) 2 (t)=a 12 s 1 (t)+a 22 s 2 (t) 3

When can we solve this? The miing equation is: (t) = A.s(t) Our estimates of s(t) will be: s 1 A a 1 a 2 1 sˆ ( t ) A.( t) s 2 To recover s(t), A must be invertible: We need as many mics as sources The mics/sources must not coincide All sources must be audible s 1 s 2 Otherwise this is a different story A a a 1 1 a a 2 2 4

A simple eample (t) A s(t) 1 2 1 1. A simple invertible problem s(t) contains two structured waveforms A is invertible (but we don t know it) (t) looks messy, doesn t reveal s(t) clearly How do we solve this? Any ideas? 5

What to look for (t) = A.s(t) We can only use (t) Is there a property we can take advantage of? Yes! We know that different sounds are statistically unrelated The plan: Find a solution that enforces this unrelatedness 6

A first try Find s(t) by minimizing cross-correlation Our estimate of s(t) is computed by: sˆ ( t) W. ( t) If W A -1 then we have a good solution The goal is that the output becomes uncorrelated: sˆ ( t), sˆ ( t) 0,i j We assume here that our signals are zero mean So the overall problem to solve is: i j arg min W wik k ( t), wjk k ( t), i k k j 7

How to solve for uncorrelatedness Let s use matrices instead of time-series: 1 (1)... 1 ( N) ( t) X, etc. 2(1)... 2( N) The uncorrelatedness translates to: ( t) As. ( t) and sˆ( t) W. ( t) X AS. and S ˆ W. X 1 N S.S ˆ ˆ T c 0 11 0 c22 We then need to diagonalize: 1 N Sˆ.Sˆ T 1 N ( W. X)( W. X) T 8

How to solve for uncorrelatedness This is actually a well known problem in linear algebra One solution is Eigenanalysis: Cov( Sˆ) 1 T S. ˆ Sˆ N 1 ( W. X)( W. X) N 1 N W. X. X T W T T W. Cov( X). W T Cov( X) is a symmetric matri 9

How to solve for uncorrelatedness For any symmetric matri like Cov(X) we know that: Cov( X) U. D. U T u1... d u u 1 0 N d 1 0 N u N Where u i and d i are the eigenvectors and eigenvalues of Cov(X) In our case if [U,D] = eig( Cov( X)) Cov( S ˆ) W. Cov( X). W T T T W. U. D. U. W let us replace W with T T U. U. DU. U T note U U I ( U is orthonormal) D This is actually Principal Component Analysis U T 10

Another solution We can also solve for a matri inverse square root: Cov( ˆ) T S W. X. X. I W T 1 T 1 1 2 T T 2 T X. X. X. X. X. X replace W with X. X 2 d 1 2 T T 1 1 X. X 2 U where [ U, D] eig X. X 0 d 0 1 2 N 11

Another approach What if we want to do this in real-time? We can also estimate W in an online manner: T T W ( I - W. ( t). ( t). W ) W Every time we see a new observation (t) we update W W WW Using this adaptive approach we can see that: Cov( W. ( t)) I, W 0 We ll skip the derivation details for now 12

Summary so far For a miture: X = A.S We can algebraically recover an uncorrelated output using Sˆ W.X If W is the eigenvector matri of Cov( X) Or with W = Cov( X) -1/2 Or we can use an online estimator: T T W ( I - W. ( t). ( t). W ) W 13

Miing system Unmiing system So how well does this work? (t) A s(t) 1 2 1 1. ŝ(t) W (t) 0.6 2.8 0.4 3.7. Well, that was a waste of time 14

What went wrong? What does decorrelation mean? That the two things compared are not related on average Consider a miture of two Gaussians s( t) N(0,2) N(0,1) ( t) 1 2 1 1. s( t) 15

Decorrelation Now let us do what we derived so far on this signal: a) Find the eigenvectors b) Rotate and scale so that covariance is I After we are done the two Gaussians are statistically independent i.e., P( s1, s2) P( s1) P( s2) We have in effect separated the original signals Save for a scaling ambiguity 16

Now lets try this on the original data (t) A s(t) 1 2 1 1. Stating the obvious: These are not very Gaussian signals!! 17

Doing PCA doesn t give us the right solution The result is not what we want We are off by a rotation This idea doesn t seem to work for non-gaussian signals a) Find the eigenvectors b) Rotate and scale so that covariance is I 18

19 So what s wrong? For Gaussian data decorrelation means independence Gaussians have up to second order statistics (1 st is mean, 2 nd is variance) By minimizing the 2 nd -order cross-statistics we achieve independence These statistics can be epressed by the 2 nd -order cumulants: Which happen to be the diagonals of the covariance matri But real-world data are seldom Gaussian Non-Gaussian data have higher orders which are not taken care of with PCA We can measure their dependence using higher order cumulants: 2 1 2 1. ), ( cum k j i k j i rd cum.. ),, ( : order 3 k j l i l j k i k j j i l k j i l k j i th cum......... ),,, ( : order 4

Cumulants for Gaussian vs non-gaussian case Gaussian case Non-Gaussian case Cross-cumulants tend to zero cum( i, j ) 1e -13 cum( i, j, k ) 0.0008, 0.0004 cum( i, j, k, l ) -0.003, 0.0005, 0.0007 Only 2 nd order cross-cumulants tend to zero cum( i, j ) -4e -14 cum( i, j, k ) -0.08, -0.1 cum( i, j, k, l ) 0.42, -0.3, 0.16 20

The real problem to solve For statistical independence we need to minimize all cross-cumulants In practice up to 4 th order is enough For 2 nd order we minimized the off-diagonal covariance elements cum( 1, cum( 2, 1 1 ) ) cum( cum( 1 2,, 2 2 ) ) For 4 th order we will do the same for a tensor Q cum(, The process is similar to PCA, but in more dimensions We now find eigenmatrices instead of eigenvectors Algorithms like JADE and FOBI solve this problem,, Can you see a potential problem though? : i, j, k, l i j k l ) 21

An alternative approach Tensorial methods can be very very computationally intensive How about an on-line method instead? Independence can also be coined as non-linear decorrelation and y are independent if and only if: f ( ) g( y) f ( ) g( y) For all continuous functions f and g This is a non-linear etension of 2 nd order independence where f() = g() = We can try solving for that then 22

Online ICA Conceptually this is very similar to online decorrelation For decorrelation: For non-linear decorrelation: T T W ( I - W. ( t). ( t). W ) W T W ( I - f ( W. ( t)). g( W. ( t)) ) W This adaptation method is known as the Cichocki-Unbehauen update But we can obtain it using many different ways But how do we pick the non-linearities? Depends on the prior we have on the sources f ( i tanh( ), for super Gaussian ) tanh( ), for sub Gaussian 23

Other popular approaches Minimum Mutual Information Minimize the mutual information of the output Creates maimally statistically independent outputs Infoma Maimize the entropy of the output or Mutual Information of input/output Non-Gaussianity Adding signals tends towards Gaussianity (Central Limit Theorem) Find the maimally non-gaussian outputs undoes the miing Maimum Likelihood Less straightforward at first, but elegant nevertheless Geometric methods Trying to eyeball the proper way to rotate 24

Trying this on our dataset (t) A s(t) 1 2 1 1. 25

Trying this on our dataset (t) A s(t) ŝ(t) 1 2 1 1. W (t) 1.39 2.5 2.78 2.58. We actually separated the miture! 26

Trying this on our dataset ŝ(t) W (t) 1.39 2.5 2.78 2.58. 27

But something is amiss.. There some things that ICA will not resolve Scale Statistical independence is invariant of scale (and sign) Order of inputs Order of inputs is irrelevant when talking about independence ICA will actually recover: sˆ ( t) D. P. s( t) Where D is diagonal and P is a permutation matri ŝ(t) s(t) 28

This works really well for audio mitures! Input Mi Output 29

Problems with instantaneous miing Sounds don t really mi instantaneously There are multiple effects Room reflections Sensor response Propagation delays Propagation and reflection filtering Most can be seen as filters We need a convolutive miing model Estimated sources using the instantaneous model on convolutive mi 30 30

Convolutive miing Instead of instantaneous miing: ( t) a s ( t) i We now have convolutive miing: The miing filters a ij (k) encapsulate all the miing effects in this model But how do we do ICA now? This is an ugly equation! j ( t) a ( k) s ( t k) i j k ij ij j j 31 31

FIR matri algebra Matrices with FIR filters as elements a ij a A a 11 21 a a 12 22 [ a (0)... a ( k 1)] ij ij FIR matri multiplication performs convolution and accumulation A. b a a 11 21 a a 12 22 b b 1 2 a a 11 21 * b * b 1 1 a a 12 22 * b * b 2 2 32 32

Back to convolutive miing Now we can rewrite convolutive miing as: ( t) ( t) a ( k) s ( t k) i a A. s( t) a j k ij 11 21 * s * s 1 1 j ( t) ( t) a a 12 22 * s * s 2 2 ( t) ( t) Tidier formulation! We can use the FIR matri abstraction to solve this problem now 33 33

An easy way to solve convolutive miing Straightforward translation of instantaneous learning rules using FIR matrices: T W ( I - f ( W. ). g( W. ) ) W Not so easy with algebraic approaches! Multiple other (and more rigorous/better behaved) approaches have been developed 34 34

Complications with this approach Required convolutions are epensive Real-room filters are long Their FIR inverses are very long FIR products can become very time consuming Convergence is hard to achieve Huge parameter space Tightly interwoven parameter relationships A slow optimization nightmare! 35 35

FIR matri algebra, part II FIR matrices have frequency domain counterparts: And their products are simpler: 36 36

Yet another convolutive miing formulation We can now model the process in the frequency domain: Xˆ Aˆ. sˆ For every frequency we have: X f ( t) A f. S f ( t), f, t Hey, that s instantaneous miing! We can solve that! 37 37

Overall flowgraph Convolved Mitures Frequency Transform Mied Frequency Bins Instantaneous ICA unmiers Unmied Frequency Bins Time Transform Recovered Sources 1 W 1 S 1 2 S 2... M Point STFT W 2... M point ISTFT... N S N W M 38 38

Some complications Permutation issues We don t know which source will end up in each narrowband output Resulting output can have separated narrowband elements from both sounds! Etracted source with permutation Scaling issues Narrowband outputs can be scaled arbitrarily This results in spectrally colored outputs Original source Colored source 39 39

Scaling issue One simple fi is to normalize the separating matrices W W norm f orig f W orig f 1 N Results into more reasonable scaling More sophisticated approaches eist but this is not a major problem Some spectral coloration is however unavoidable Original source Colored source Corrected source 40 40

Some solutions for permutation problems Continuity of unmiing matrices Adjacent unmiing matrices tend to be a little similar, we can permute/bias them accordingly Doesn t work that great Smoothness of spectral output Narrowband components from each source tend to modulate the same way Permute unmiing matrices to ensure adjacent narrowband output are similarly modulated Works fine The above can fail miserably for more than two sources! Combinatorial eplosion! Interspeec Microphone Array Processing and 41 41

Beamforming and ICA If we know the placement of the sensors we can obtain the spatial response of the ICA solution ICA places nulls to cancel out interfering sources Just as in the instantaneous case we cancel out sources We can visualize the permutation problem now Out of place bands Bands with permutation problems Interspeec Microphone Array Processing and 42 42

Using beamforming to resolve permutations Spatial information can be used to resolve permutations Find permutations that preserve zeros or smooth out the responses Works fine, although it can be flaky if the array response is not that clean Interspeec Microphone Array Processing and 43

The N-input N-output problem ICA, in either formulation inverts a square matri (whether scalar, or FIR) This implies that we have the same number of input as outputs E.g. in a street with 30 noise sources we need at least 30 mics! Solutions eist for M ins - N outs where M > N If N > M we can only beamform In some cases etra sources can be treated as noise This can be restrictive in some situations Interspeec Microphone Array Processing and 44 44

Separation recap Orthogonality is not independence!! Not all signals are Gaussian which is a usual assumption We can model instantaneous mitures with ICA and get good results ICA algorithms can optimize a variety of objectives, but ultimately result in statistical independence between the outputs Same model is useful for all sorts of miing situations Convolutive mitures are more challenging but solvable There s more ambiguity, and a closer link to signal processing approaches 45

ICA for data eploration ICA is also great for data eploration If PCA is, then ICA should be, right? With data of large dimensionalities we want to find structure PCA can reduce the dimensionality And clean up the data structure a bit But ICA can find much more intuitive projections 46

Eample cases of PCA vs ICA Motivation for using ICA vs PCA PCA will indicate orthogonal directions of maimal variance PCA ICA Non-Gaussian data This is great for Gaussian data Also great if we are into LS models Real-world is not Gaussian though ICA finds directions that are more revealing 47

Finding useful transforms with ICA Audio preprocessing eample Take a lot of audio snippets and concatenate them in a big matri, do component analysis PCA results in the DCT bases Do you see why? ICA returns time/freq localized sinusoids which is a better way to analyze sounds Ditto for images ICA returns localizes edge filters 48

Enhancing PCA with ICA ICA cannot perform dimensionality reduction The goal is to find independent components, hence there is no sense of order PCA does a great job at reducing dimensionality Keeps the elements that carry most of the input s energy It turns out that PCA is a great preprocessor for ICA There is no guarantee that the PCA subspace will be appropriate for the independent components but for most practical purposes this doesn t make a big difference M N input M K PCA K K ICA K N ICs 49

Eample case: ICA-faces vs. Eigenfaces ICA-faces Eigenfaces 50

A Video Eample The movie is a series of frames Each frame is a data point 126, 80 60 piel frames Data X will be 4800 126 Using PCA/ICA X = W H W will contain visual components H will contain their time weights 51

PCA Results Nothing special about the visual components They are orthogonal pictures Does this mean anything? (not really ) Some segmentation of constant vs. moving parts Some highlighting of the action in the weights 52

ICA Results Much more interesting visual components They are independent Unrelated elements (left/right hands, background) are now highlighted We have some decomposition by parts Components weights are now describing the scene 53

A Video Eample The movie is a series of frames Each frame is a data point Input movie 315, 80 60 piel frames Data X will be 4800 315 Using PCA/ICA X = W H W will contain visual components H will contain their time weights Independent neighborhoods 54

What about the soundtrack? We can also analyze audio in a similar way We do a frequency transform and get an audio spectrogram X Xis frequencies time Distinct audio elements can be seen inx Unlike before we have only one input this time 55

PCA on Audio Umm it sucks! Orthogonality doesn t mean much for audio components Results are mathematically optimal, perceptually useless 56

ICA on Audio A definite improvement Independence helps pick up somewhat more meaningful sound objects Not too clean results, but the intentions are clear Misses some details 57

Audio Visual Components? We can can even take in both audio and video data and try to find structure Sometimes there is a very strong correlation between auditory and visual elements We should be able to discover that automatically Input video Input audio 58

Audio/Visual Components 59

Which allows us to play with output And of course once we have such a nice description we can resynthesize at will Resynthesis 60

Recap A motivating eample The theory Decorrelation Independence vsdecorrelation Independent component analysis Separating sounds Solving instantaneous mitures Solving convolutive mitures Data eploration and independence Etracting audio features Etracting multimodal features 61