Machine Learning for Signal Processing. Analysis. Class Nov Instructor: Bhiksha Raj. 8 Nov /18797

Size: px

Start display at page:

Download "Machine Learning for Signal Processing. Analysis. Class Nov Instructor: Bhiksha Raj. 8 Nov /18797"

Nigel Craig
5 years ago
Views:

1 Machine Learning for Signal Processing Independent Component Analysis Class Nov 2012 Instructor: Bhiksha Raj 8 Nov /

2 A brief review of basic probability Uncorrelated: Two random variables X and Y are uncorrelated iff: The average value of the product of the variables equals the product of their individual averages Setup: Each draw produces one instance of X and one instance of Y I.e one instance of (X,Y) E[XY] = E[X]E[Y] The average value of X is the same regardless of the value of Y 8 Nov /

3 Uncorrelatedness Which of the above represent uncorrelated RVs? 8 Nov /

4 A brief review of basic probability Independence: Two random variables X and Y are independent iff: Their joint probability equals the product of their individual probabilities P(X,Y) = P(X)P(Y) The average value of X is the same regardless of the value of Y E[X Y] = E[X] 8 Nov /

5 A brief review of basic probability Independence: Two random variables X and Y are independent iff: The average value of any function X is the same regardless of the value of Y E[f(X)g(Y)] (Y)] = E[f(X)] E[g(Y)] for all f(), g() 8 Nov /

6 Independence Which of the above represent independent RVs? Which h represent uncorrelated RVs? 8 Nov /

RV is 0 mean The PDF is of the RV is symmetric around 0

7 A brief review of basic probability f(x) p(x) The expected value of an odd function of an RV is 0 if The RV is 0 mean The PDF is of the RV is symmetric around 0 E[f(X)] = 0 if f(x) is odd symmetric 8 Nov /

8 A brief review of basic info. theory T(all), M(ed), S(hort) ( X ) P( X )[ log P( X )] Entropy: Theminimum average number of bits to transmit to convey a symbol X X T, M, S M F F M.. Y ( X, Y ) P( X, Y )[ log P( X, Y )] Joint entropy: The minimum average number of bits to convey sets (pairs here) of symbols 8 Nov / X, Y

9 A brief review of basic info. theory X T, M, S ( X Y ) Y M F F M.. P( Y ) P( X Y )[ log P( X Y )] P( X, Y )[ Y X X, Y log P( X Y )] Conditional Entropy: The minimum average number of bits to transmit to convey a symbol X, after symbol Y has already been conveyed Averaged over all values of X and Y 8 Nov /

10 A brief review of basic info. theory ) ( )] ( log )[ ( ) ( )] ( log )[ ( ) ( ) ( X X P X P Y P Y X P Y X P Y P Y X Conditional entropy of X = (X) if X is Y X Y X Conditional entropy of X = (X) if X is independent of Y Y P X P Y X P Y X P Y X P Y X )] ( ) ( l )[ ( )] ( l )[ ( ) ( Y X Y X Y P X P Y X P Y X P Y X P Y X,, )] ( ) ( log )[, ( )], ( log )[, ( ), ( ) ( ) ( ) ( )log, ( ) ( )log, (,, Y X Y P Y X P X P Y X P Y X Y X Joint entropy of X and Y is the sum of the entropies of X and Y if they are independent p y p 8 Nov /

11 Onward.. 8 Nov /

12 Projection: multiple notes M = W = P = W (W T W) 1 W T Projected Spectrogram = P * M 8 Nov /

13 We re actually computing a score M = =? W = M ~ W = pinv(w)m 8 Nov /

14 ow about the other way? M = =?? W= U= M ~ W W = Mpinv(V) U = W 8 Nov /

15 So what are we doing here? =? W =? M ~ W is an approximation Given W, estimate to minimize error arg min M W 2 F arg min Must ideally find transcription of given notes i j M ij ( W) ij 2 8 Nov /

16 Going the other way.. W =?? W M ~ W is an approximation Given, estimate W to minimize error arg min W M W 2 F arg min ( W) Must ideally find the notes corresponding to the transcription 8 Nov /18797 i j M ij ij 2

17 When both parameters are unknown =? W =? approx(m) =? Must estimate both and W to best approximate M Ideally, must learn both the notes and their transcription! 8 Nov /

18 A least squares solution W Unconstrained 2, arg min M W W F, For any W, that minimizes the error, W =WA, =A -1 also minimizes the error for any invertible A Forour our problem, lets consider the truth.. When one note occurs, the other does not h it h j = 0 for all i!= j The rows of are uncorrelated 8 Nov /

19 A least squares solution Assume: T = I Normalizing all rows of to length 1 pinv() = T Projecting M onto W = M pinv() = M T W = M T W 2, arg min M W W, F 2 arg min M M T F Constraint: Rank() = 4 8 Nov /

20 Finding the notes arg min M M T 2 F Note T!= I Only T = I Could also be rewritten as T arg min trace M( I ) M T arg min trace M T M( I T ) arg min trace Correlation( M T )( I T ) arg max T trace Correlation( M ) T 8 Nov /

21 Finding the notes Constraint: every row of has length 1 T T T Correlation( M trace arg max trace ) Differentiating and equating to 0 T Correlation(M ) Simply pyrequiring the rows of to be orthonormal gives us that is the set of Eigenvectors of the data in M T 8 Nov /

22 Equivalences T T T Correlation( M trace arg max trace ) is identical to W, arg min 2 2 M W F i h W, i i i j h ij T i h j Minimize least squares error with the constraint that the rows of are length 1 and orthogonal to one another 8 Nov /

23 So how does that work? There are 12 notes in the segment, hence we try to estimate 12 notes.. 8 Nov /

24 So how does that work? The first three notes and their contributions The spectrograms of the notes are statistically uncorrelated 8 Nov /

25 Finding the notes Can find W instead of W arg min M W W T WM 2 F Solving the above, with the constraints that the columns of W are orthonormal gives you the eigen vectors of the data in M W arg max W trace T T W WCorrelation( M) trace W W Correlation( M) W W 8 Nov /

26 So how does that work? There are 12 notes in the segment, hence we try to estimate 12 notes.. 8 Nov /

27 Our notes are not orthogonal Overlapping frequencies Note occur concurrently armonica continues to resonate to previous note More generally, simple orthogonality will not give us the desired solution 8 Nov /

28 What else can we look for? Assume: The transcription of one note does not depend on what else is playing Or, in a multi instrument piece, instruments are playing independently of one another Not strictly true, but still.. 8 Nov /

29 Formulating it with Independence W, arg min W 2 M W ( rows. of.. are. independent), F Impose statistical independence constraints on decomposition 8 Nov /

30 Changing problems for a bit h 1 ( t) m1 ( t) w11h1 ( t) w12h2 ( t) m ( t ) w h ( t ) w h ( t ) 2( t h 2 ( t) Two people speak simultaneously Recorded by two microphones Each recorded ddsignal lis a mixture it of both signals 8 Nov /

31 Imposing Statistical Constraints M W = w21 w 11 w 12 w 22 M = W M = mixed signal W = notes = transcription Signal at mic 1 Signal at mic 2 Signal from speaker 1 Signal from speaker 2 Given only M estimate Ensure that the components of the vectors in the estimated are statistically independent Multiple approaches.. 8 Nov /

32 Imposing Statistical Constraints M W = w21 w 11 w 12 w 22 M = W Given only M estimate = W -1 M = AM Estimate A such that the components of AM are statistically independent A is the unmixing matrix Multiple approaches.. 8 Nov /

33 Statistical Independence M = W = AM Emulating independence Compute W (or A) and such that t has statistical ttiti characteristics that are observed in statistically independent variables Enforcing independence Compute W and such that the components of M are independent 8 Nov /

34 Emulating Independence Therows of areuncorrelated E[h i h j ] = E[h i ]E[h j ] h i and h j are the i th and j th components of any vector in The fourth order moments are independent E[h i h j h k h l ] = E[h i ]E[h j ]E[h k ]E[h l ] E[h i2 h j h k ] = E[h i2 ]E[h j ]E[h k ] E[h i2 h j2 ] = E[h i2 ]E[h j2 ] Etc. 8 Nov /

35 Zero Mean Usual to assume zero mean processes Otherwise, some of the math doesn t work well M = W = AM If mean(m) = 0 => mean() = 0 E[] [ ] = AE[M] = A0 = 0 First step of ICA: Set the mean of M to 0 1 m m i cols ( M ) i m i m m i are the columns of M i m i 8 Nov /

36 Emulating Independence.. = Diagonal + rank1 matrix Independence Uncorrelatedness Estimate a C such that CM is uncorrelated X = CM =AM A=BC =BCM E[x i x j ] = E[x i ]E[x j ] = ij [since M is now centered ] XX T = I In reality, we only want this to be a diagonal matrix, but we ll make it identity 8 Nov /

37 Decorrelating = Diagonal + rank1 matrix =AM A=BC =BCM X = CM XX T = I Eigen decomposition MM T = USU T Let C = S -1/2 U T WMM T W T = S -1/2 U T USU T US -1/2 = I 8 Nov /

38 Decorrelating = Diagonal + rank1 matrix =AM A=BC =BCM X = CM XX T = I Eigen decomposition MM T = ESE T Let C = S -1/2 E T WMM T W T = S -1/2 E T ESE T ES -1/2 = I X is called the whitened version of M The process of decorrelating M is called whitening C is the whitening i matrix ti 8 Nov /

39 Uncorrelated!= Independent Whitening merely ensures that the resulting shat signals are uncorrelated, i.e. E[x i x j ] = 0 if i!= j j This does not ensure higher order moments are also decoupled, e.g. it does not ensurethat E[x i2 x j2 ] = E[x i2 ]E [x j2 ] This is one of the signatures of independent RVs Lets explicitly itl decouple the fourth order moments 8 Nov /

40 Decorrelating = Diagonal + rank1 matrix =AM A=BC =BX X = CM XX T = I Will multiplying li li X by B re correlate the components? Not if B is unitary BB T = B T B = I So we want to find a unitary matrix Since the rows of are uncorrelated Because they are independent d 8 Nov /

41 ICA: Freeing Fourth Moments We have E[x i x j ] = 0 if i!= j Already been decorrelated A=BC, = BCM, X = CM, = BX The fourth moments of have the form: E[h i h j h k h l ] If the rows of were independent E[h i h j h k h l ] = E[h i ] E[h j ] E[h k ] E[h l ] Solution: Compute B such that the fourth moments of = BX are decoupled While ensuring that B is Unitary 8 Nov /

42 ICA: Freeing Fourth Moments Create a matrix of fourth moment terms that would be diagonal were the rows of independent and diagonalize it A good candidate Good because it incorporates the energy in all rows of Where d ij = E[ k h k2 h i h j j] i.e. D = E[h T hhh T ] h are the columns of d11 d12 d13.. D d.. 21 d22 d Assuming h is real, else replace transposition with ermition 8 Nov /

43 ICA: The D matrix D d11 d12 d13.. d.. 21 d22 d d ij = E[ k h k2 h i h j ] = 1 h 2 cols( ) m k mk h mi h mj Sum of squares of all components j th component h k 2 j th component h i h j h k2 h i h j Average above term across all columns of 8 Nov /

44 ICA: The D matrix D d11 d12 d13.. d.. 21 d22 d d ij = E[ k h k2 h i h j ] = 1 h 2 cols( ) m k mk h mi h mj If the h i terms were independent For i!= j E k h 2 k h h i j E h E h Eh Eh Eh Eh Eh i j j i k i, k j Centered: E[h j ] = 0 E[ k h k2 h i h j ]=0 for i!= j For i= j k i j E k 2 k hih j E h Eh Eh 0 Thus, if the h i terms were independent, d ij = 0 if i!= j i.e., if h i were independent, D would be a diagonal matrix Let us diagonalize D 8 Nov /18797 h i i k i k 45

45 Diagonalizing D Compose a fourth order matrix from X Recall: X = CM, = BX = BCM B is what we re trying to learn to make independent Compose D = E[x T xxx T ] Diagonalize D via Eigen decmpositin D = UU T B = U T That s it!!!! 8 Nov /

46 B frees the fourth moment D = UU T ; B = U T U is a unitary matrix, i.e. U T U = UU T = I (identity) = BX = U T X h = U T x The fourth moment matrix of is E[h T h h h T ] = E[x T UU T xu T x x T U] = E[x T xu T x x T U] = U T E[x T x xx T ]U = U T D U = U T U U T U = The fourth moment matrix of = U T X is Diagonal!! 8 Nov /

47 Overall Solution = AM = BCM = BX A = BC = U T C 8 Nov /

48 Independent Component Analysis Goal: to derive a matrix A such that the rows of AM are independent Procedure: re 1. Center M 2. Compute the autocorrelation matrix R MM of M 3. Compute whitening matrix C via Eigen decomposition R XX = ESE T, C = S -1/2 E T 4. Compute X= CM 5. Compute the fourth moment matrix D = E[x T xxx T ] 6. Diagonalize D via Eigen decomposition 7. D = UU T 8. Compute A = U T C The fourth moment matrix of =AM is diagonal Note that the autocorrelation matrix of will also be diagonal 8 Nov /

49 ICA by diagonalizing moment matrices The procedure just outlined, while fully functional, has shortcomings Only a subset of fourth order moments are considered There eeae are many yother ways of constructing fourth order ode moment matrices that would ideally be diagonal Diagonalizing the particular fourth order moment matrix we have chosen is not guaranteed to diagonalize every other fourth order moment matrix JADE: (Joint Approximate Diagonalization of Eigenmatrices), J.F. Cardoso Jointly diagonalizes several fourth order moment matrices More effective than the procedure shown, but more computationally expensive 8 Nov /

50 Enforcing Independence Specifically ensure that the components of are independent = AM Contrast function: A non linear function that has a minimum value when the output components are independent Dfi Define and minimize i i a contrast function F(AM) Contrast tfunctions are often only approximations too.. 8 Nov /

51 A note on pre-whitening The mixed signal is usually prewhitened Normalize variance along all directions Eliminate second order dependence X = CM E[x i x j ] = E[x i ]E[x j ] = ij for centered signals Eigen decomposition MM T = ESE T C = S -1/2 E T Can use first K columns of E only if only K independent sources are expected In microphone array setup only K < M sources 8 Nov /

52 The contrast function Contrast function: A non linear function that has a minimum i value when the output tcomponents are independent An explicit contrast function I ( ) ( h ) ( ) With constraint : = BX X is whitened M i i 8 Nov /

53 Linear Functions h = Bx Individual columns of the and X matrices x is mixed signal, B is the unmixing matrix P h ( h) P x ( B 1 h) B 1 ( x) P( x)log P( x) dx ( h) ( x) log B 8 Nov /

54 The contrast function i I( ) ( h ) ( ) i I( ) ( hi ) ( x) log B i Ignoring g (x) ( ) (Const) J ( ) ( h ) log W Minimize the above to obtain B i i 8 Nov /

55 An alternate approach Recall PCA M = W, the columns of W must be statistically independent Leads to: min W M W T WM 2 Error minimization framework to estimate W Can we arrive at an error minimization framework for ICA Define an Error objective that represents independence 8 Nov /

56 An alternate approach Definition of Independence if x and y are independent: d E[f(x)g(y)] = E[f(x)]E[g(y)] Must thold ldfor every f() and g()!! 8 Nov /

57 An alternate approach Define g() = g(bx) (component wise function) g(h 11 ) g(h 12 )... g(h 21 ) g(h 22 ) Define f() ( ) = f(bx) f(h 11 ) f(h 12 )... f(h 21 ) f(h 22 ) Nov /

58 An alternate approach P = g() f() T = g(bx) f(bx) T P = P 11 P P 21 P Must ideally be Error = P-Q 2 F.. Q P ij g( hik ) f ( h jk ) k This is a square matrix g( h ) Q 11 Q Q 12 Q ij ik Q = 22.. k l.... f ( h Q ii g( hik ) f ( hil ) k jl ) i j 8 Nov /

59 An alternate approach Ideal value for Q Q = Q 11 Q Q 21 Q Q ij g( h ik ) k l f ( h Q ii g( hik ) f ( hil ) k jl ) i j If g() and h() are odd symmetric functions j g(h ij ) = 0 for all i Since = j h ij = 0 ( is centered) Q is a Diagonal Matrix!!! 8 Nov /

60 An alternate approach Minimize Error P g(bx)f(bx) Q Diagonal error T 2 P Q F Leads to trivial Widrow opf type iterative rule: T E Diag g(bx)f(bx) B B EB T 8 Nov /

61 Update Rules Multiple solutions under different assumptions for g() and f() = BX B = B + B Jutten erraut : Online update B ij = f(h i )g(h j ); actually assumed a recursive neural network Bell Sejnowski B = ([B T ] -1 g()x T ) 8 Nov /

62 Update Rules Multiple solutions under different assumptions for g() and f() = BX B = B + B Natural gradient f() = identity function B = (I g() T )W Cichoki Unbehaeven B = (I g()f() T )W 8 Nov /

63 What are G() and () Must be odd symmetric functions Multiple functions proposed g( x) x x tanh( x ) tanh( x) x is super Gaussian x is sub Gaussian Audio signals in general B = (I T -Ktanh() T )W Or simply B=(I Ktanh() = T )W 8 Nov /

64 So how does it work? Example with instantaneous mixture of two speakers Natural gradient update Works very well! 8 Nov /

65 Another example! Input Mix Output 8 Nov /

66 Another Example Three instruments.. t 8 Nov /

67 The Notes Three instruments.. 8 Nov /

68 ICA for data exploration The bases in PCA represent the building blocks Ideally notes Very successfully used So can ICA be used to do the same? 8 Nov /

69 ICA vs PCA bases Motivation for using ICA vs PCA PCA will indicate orthogonal directions of maximal variance May not align with the data! Non-Gaussian data PCA ICA ICA finds directions that are independent More likely to align with the data 8 Nov /

70 Finding useful transforms with ICA Audio preprocessing example Take a lot of audio snippets and concatenate t them in a big matrix, do component analysis PCA results in the DCT bases ICA returns time/freq localized sinusoids which is a better way to analyze sounds Ditto for images ICA returns localizes edge filters 8 Nov /

71 Example case: ICA-faces vs. Eigenfaces ICA-faces Eigenfaces 8 Nov /

corrupted by heartbeats and biorhythm signals ICA

72 ICA for Signal Enhncement Very commonly used to enhance EEG signals EEG signals are frequently corrupted by heartbeats and biorhythm signals ICA can be used to separate them out 8 Nov /

73 So how does that work? There are 12 notes in the segment, hence we try to estimate 12 notes.. 8 Nov /

74 PCA solution There are 12 notes in the segment, hence we try to estimate 12 notes.. 8 Nov /

75 So how does this work: ICA solution Better.. But not much But the issues here? 8 Nov /

76 ICA Issues No sense of order Unlike PCA Get K independent directions, but does not have a notion of the best direction So the sources can come in any order Permutation invariance Does not have sense of scaling Scaling the signal does not affect independence Outputsare scaled versions of desired signals in permuted order In the best case In worse case, output are not desired signals at all.. 8 Nov /

77 What else went wrong? Assume distribution of signals is symmetric around mean Note energy here Not symmetric negative values never happen Still this didn t affect the three instruments case.. Notes are not independent Only one note plays at a time If one note plays, other notes are not playing 8 Nov /

78 Continue in next class.. NMF Factor analysis.. 8 Nov /

Independent Component Analysis. Slides from Paris Smaragdis UIUC

Independent Component Analysis Slides from Paris Smaragdis UIUC 1 A simple audio problem 2 Formalizing the problem Each mic will receive a mi of both sounds Sound waves superimpose linearly We ll ignore