LECURE :ICA Rita Osadchy Based on Lecture Notes by A. Ng
Cocktail Party Person 1 2 s 1 Mike 2 s 3 Person 3 1 Mike 1 s 2 Person 2 3 Mike 3 microphone signals are mied speech signals 1 2 3 ( t) ( t) ( t) a a a 11 21 31 s ( t) a 1 1 1 Input: microphone signals 1, 2, 3 ( t) ( t) ( t) Goal: recover the speech signals s, s s 12 s ( t) a s ( t) a 1 2, 22 32 3 s s 2 s ( t) a 2 2 ( t) a 13 ( t) a 23 33 s 3 s s 3 3 http://research.ics.tkk.fi/ica/cocktail/cocktail_en.cgi
ICA vs. PCA Similar to PCA Finds a new basis to represent the data Different from PCA PCA removes only correlations, ICA removes correlations, and higher order dependence. In PCA some components are more important than others (based on eigenvalues) in ICA components are equally important.
ICA vs. PCA PCA: principle components are orthogonal. ICA: independent components are not!
ICA vs. PCA maimal variance directions independent components
Model n Assume data sr, generated by n independent sources. We assume: As, A R nn miing matri is unknown
Model Assume data sources. ij sr s signal from source j at time i. n, generated by n independent source 1 source 2 s t1 mic. j at time i ij n k 1 A jk s ik sum over sources s t 2 t
Problem Definition We observe Goal: recover the sources the data ( As). i ; i 1,.., m s j i denotes time, that generated Let W Goal is to find W, such that Denote 1 A unmiing matri W w1 wn si W i then the j-th source can be recovered by s w ij j i
ICA Intuition s j Uniform[1,1]
ICA Ambiguities If we have no prior knowledge about the miing matri, then there are inherent ambiguities in A that are impossible to recover. he sources can be recovered up to Permutation Scaling Sign
Permutation Ambiguity Assume that P is a nn permutation matri. 0 1 0 Eamples: 1 0 0 0 1 P ; P ; 1 0 0 0 1 W w w 11 21 w w 12 22 ; P 0 1 1 ; 0 PW w w 21 11 w w 22 12 Given only the W and PW. i ' s, we cannot distinguish between he permutation of the original sources is ambiguous. Not important in most applications
Scaling Ambiguity i As i A 2, s i 0. 5s i 2A 0. 5 i s i a A 1 a j, s 1 s j j i a 1 We cannot recover the correct scaling of the sources. Not important in most applications a j s i1 s 1/ ij Scaling a speaker's speech signal by some positive factor affects only the volume of that speaker's speech. Also, sign changes do not matter: j and j sound identical when played on a speaker. s j s s
Gaussian sources are problematic n 2, s ~ N(0, I), As ~ N(0, AA ) E[ ] E[ Ass A ] AA Let R be an arbitrary orthogonal matri, such that RR R R I. Let A' AR, then E[ ' ' ' ] A' s ' ~ N(0, AA E[ A' ss A' ] E[ ARss ( AR) ] ) ARR A AA
Gaussian Sources are Problematic Whether the miing matri is A or A, we would observe data from a N(0, AA ) distribution. hus, there is no way to tell if the sources were mied using A or A. here is an arbitrary rotational component in the miing matri that cannot be determined from the data, and we cannot recover the original sources. Reason: he Gaussian distribution is spherically symmetric. For non-gaussian data, it is possible, given enough data, to recover the n independent sources.
Densities and linear transformations Suppose R s is a r.v drawn according to (s). Let be a r.v. defined by. he density of is given by: p ( ) p ( W) W s As p s where 1 W A (A is squared invertible matri) Eample: s ~ Uniform[0,1] : p s ( s) 1(0 s 1) Let A 2, then 2s. Clearly, ~ Uniform[0,2] hus, ( ) 0.5 (0 2). p
ICA algorithm Assume that the distribution of he joint distribution is n j1 s p s s ). i is ( i p( s) p s ( s j ) sources are independent Using the previous formulation, we can derive p( ) p ( w ) W p( ) ps ( W) W We must specify a density for the individual sources p s. n j1 s j As W 1 s
ICA algorithm A cumulative distribution of a real r.v. z is defined by z z F( z0) P 0 he density of z can be found by z 0 p z ( z) dz p z ( z) F'( z). Specify a density for the s j specify its cdf. If you have a prior knowledge that the sources' densities take a certain form, then use it here, otherwise make an assumption about cdf.
Density of s cdf is has to be a monotonic function that increases from zero to one. Gaussian CDF sigmoid i g( s) 1/(1 e p( s) g'( s) We assume that the data has zero mean. his is necessary because our assumption that implies hus p( s) g'( s) E( s) 0. E( ) E( As) 0 s )
ICA algorithm W is a parameter of our model that we want to estimate. Given a training set l( W ) Maimize m l(w) s i1 j1 i ; i 1,..., log g' m w logw. using gradient ascent: j i, the log likelihood is: W W l(w), where is the learning rate.? w Equivalently, w j w j l(w ) j
ICA algorithm By taking the derivatives of using: we obtain the update rule: When the algorithm converges, compute l(w) )) ( )(1 ( ) '( ); 1/(1 ) ( g g g e g W W W W 1 1 2 1 2 1 2 1 2 1 i i n i i W w g w g w g W W. i W i s
Remarks We assumed that independent of each other. i ; i 1,..., m are his assumption is incorrect for time series where the i 's are dependent (e.g. speech data). it can be shown, that having correlated training eamples will not hurt the performance of the algorithm if we have sufficient data. ip: run stochastic gradient ascent on a randomly shuffled copy of the training set.
Application domains of ICA Blind source separation Image denoising Medical signal processing fmri, ECG, EEG Modelling of the hippocampus and visual corte Feature etraction, face recognition Compression, redundancy reduction Watermarking Clustering ime series analysis (stock market, microarray data) opic etraction Econometrics: Finding hidden factors in financial data Slide due B. Poczos
Slide due B. Poczos
Fig. from Jung
Image denoising Original image Noisy image Wiener filtering ICA filtering