Independent Component Analysis (ICA)

Data Mining 160 SECTION 5A-ICA Independent Component Analysis (ICA) Independent Component Analysis (ICA) ( Independent Component Analysis:Algorithms and Applications Hyvarinen and Oja (2000)) is a variation of Principal Component Analysis (PCA).and a strong competitor to Factor Analysis. ICA is an attempt to decompose complex data into independent subparts (also known as the blind source separation problem or the cocktail party problem.) It attempts to determine the source signals S given only the observed mixtures.x (It is necessary to assume independence of source signals, i.e. the value of one signal does not give any information re other signals). Using singular value decompositon X UDV T and writing S N U and A T DVT we can write N X SA T thus each column of X is a linear combination of the columns of S. Since U is orthogonal, and assuming that the columns of X each have mean zero, this means the columns of S have zero mean, are uncorrelated, and have unit variance. p We have X i a ij S j, i 1,..., p j1 or (writing X and S as column vectors) X AS AR T RS A S for any orthogonal p p matrix R ICA assumes the S i are statistically independent (thus determining all the cross moments) rather than uncorrelated.(which determines only the second-order cross moments). Independence implies uncorrelatedness so ICA constrains the estimation procedure to give uncorrelated estimates of the independent components (this reduces the number of free parameters and thus simplifies the problem) The extra moment conditions identify A uniquely. NOTE: In Factor Analysis with q p we have q X i a ij S j, i, j1 i 1,...,p or X AS where the S are the common factors and represents unique factors. ICA can be viewed as another Factor Analysis rotation method (just like varimax or quartimax); it starts essentially from a Factor Analysis solution and looks for rotations that lead to independent components. Mills 2017 ICA 160

Data Mining 2017 161 In Factor Analysis, the S j and i are generally assumed to be Gaussian; orthogonal transformations AS of Gaussians are still Gaussian. Hence we can estimate the model only up to an orthogonal transformation. Thus A is not identifiable for independent Gaussian components. (If just one component is Gaussian, the ICA model can be estimated.) However we actually do not want Gaussian source variables (we allow at most one Gaussian source variable) because if S i are Gaussian and the mixing matrix A is orthogonal, the X i will also be Gaussian and uncorrelated with unit variance so the joint density will be completely symmetric and we will have no information on the directions of the columns of the mixing matrix A hence A cannot be estimated. We avoid this identifiability problem by assuming the S i are independent and non-gaussian so (because A is orthogonal). S A 1 X A T X We assume X has been whitened (i.e. sphered) via SVD to have CovX I; A is orthogonal and solving the ICA problem means finding an orthogonal S such that the components of S A T X are independent and non-gaussian. Writing where and setting we obtain Y W T X X AS Z A T W Y W T X W T AS A T W T S Z T S which can be more Gaussian than any of the S i. and the least Gaussian when it equals one of the S i (i.e. when only one of the elements of Z is nonzero). We want to find W so as to maximize the nongaussianity of Y; this corresponds (in the transformed coordinate systems) to Z which has only one nonzero component. Thus is one of the independent components. Y W T X Z T S Thus finding A that minimizes mutual information IS A T X requires looking for an orthogonal transformation that gives the most independence between its components. This is equivalent to minimizing the sum of the entropies of separate components of Y, which is equivalent to maximizing their departures from Gaussianity (since Gaussian variables have maximum entropy). Mills 2017 161

Data Mining 162 There are two problems: 1. We cannot determine the variances of the independent components. Since S and A are both unknown, a scalar multiple of one S i can be cancelled out by dividing the corresponding column a i of A by the same scalar. Thus we fix the magnitude of the independent components S i - since they are all random variables, we assume each has unit variance and, since they have been centered, this means ES i 2 1. Note that we can multiply an independent component by1 without affecting the model so there is also ambiguity of sign. 2. We cannot determine the order of the independent components. Since S and A are both unknowns, we are free to change the order of the terms, setting any one of them first. Thus a permutation matrix P and its inverse can be substituted in the model to give X AP 1 PS where AP 1 is the new unknown mixing matrix to be solved for by ICA and the elements of PS are the original independent S i but in different (i.e. permuted) order. Read some required files: drive - D: code.dir - paste(drive, DATA/Data Mining R-Code, sep / ) data.dir - paste(drive, DATA/Data Mining Data, sep / ) source(paste(code.dir, BorderHist.r, sep / )) source(paste(code.dir, WaveIO.r, sep / )) library(fastica) We will create and display two signals (Figure 16) S.1 - sin((1:1000)/20) S.2 - rep((((1:200)-100)/100), 5) S - cbind(s.1, S.2) plot(s.1) plot(s.2) Mills 2017 ICA 162

Data Mining 2017 163 Figure 16. Original signals Mills 2017 163

Data Mining 164 and rotate them: a - pi/4 A - matrix(c(cos(a), sin(a), -sin(a), cos(a)), 2, 2) X - S%*%A plot(x[,1]) plot(x[,2]) Figure 17. Rotated signals We combine them and then display them with their histograms: border.hist(s.1, S.2) border.hist(x[,1], X[,2]) Figure 18. Border histograms of the original (left) and rotated signals. Mills 2017 ICA 164

Data Mining 2017 165 Now start with the mixed signals and observe what happens to the histograms as we rotate the axes on which the signals are projected: b - pi/36 W - matrix(c(cos(b), -sin(b), sin(b), cos(b)), 2, 2) XX - X for (i in 1:9) { XX - XX%*%W border.hist(xx[,1], XX[,2]) readline( Press Enter... ) Mills 2017 165

Data Mining 166 Figure 19. Effect of rotating the projection plane Mills 2017 ICA 166

Data Mining 2017 167 We see that for the fully mixed signals the histograms appear nearly Gaussian. As we move through the different projections the histograms move away from normality. The resulting signals are: plot(xx[,1]) plot(xx[,2]) Figure 20. Result of the ICA Now consider what happens for 3 signals - a sine function, a sawtooth, and a pair of exponentials. S.1 - sin((1:1000)/20) S.2 - rep((((1:200)-100)/100), 5) S.3 - rep(c(exp(seq(0,.99,.01))-1.845617, -exp(seq(0,.99,.01))1.845617), 5) S - cbind(s.1, S.2, S.3) A - matrix(runif(9), 3, 3) # Set a random mixing X - S%*%A Mills 2017 167

Data Mining 168 Do an ICA on the mixed data: a - fastica(x, 3, alg.typ parallel, fun logcosh, alpha 1, method R, row.norm FALSE, maxit 200, tol 0.0001, verbose TRUE) Whitening Symmetric FastICA using logcosh approx. to neg-entropy function Iteration 1 tol 0.1086564 Iteration 2 tol 0.004629528 Iteration 3 tol 0.0001178137 Iteration 4 tol 5.028182e-06 We then plot the original, mixed, and recovered data: oldpar - par(mfcol c(3, 3), marc(2, 2, 2, 1)) plot(1:1000, S[,1], type l, main Original Signals, xlab, ylab ) for (i in 2:3) { plot(1:1000, S[,i ], type l, xlab, ylab ) plot(1:1000, X[,1 ], type l, main Mixed Signals, xlab, ylab ) for (i in 2:3) { plot(1:1000, X[,i], type l, xlab, ylab ) plot(1:1000, a$s[,1 ], type l, main ICA source estimates, xlab, ylab ) for (i in 2:3) { plot(1:1000, a$s[,i], type l, xlab, ylab ) par(oldpar) Figure 21. Original, mixed and recovered Mills 2017 ICA 168

Data Mining 2017 169 Repeat the process with four signals: S.1 - sin((1:1000)/20) S.2 - rep((((1:200)-100)/100), 5) s.3 - tan(seq(-pi/2.1,pi/2-.1,.0118)) S.3 - rep(s.3, 4) S.4 - rep(c(exp(seq(0,.99,.01))-1.845617, -exp(seq(0,.99,.01))1.845617), 5) S - cbind(s.1, S.2, S.3, S.4) (A - matrix(runif(16), 4, 4)) [,1] [,2] [,3] [,4] [1,] 0.4091777 0.79526756 0.773487999 0.7201944 [2,] 0.1084712 0.03256865 0.151097684 0.2899303 [3,] 0.8920621 0.69775810 0.281228361 0.1156242 [4,] 0.4683415 0.91346105 0.003911073 0.1033929 X - S%*%A a - fastica(x, 4, alg.typ parallel, fun logcosh, alpha 1, method R, row.norm FALSE, maxit 200, tol 0.0001, verbose TRUE) Centering Whitening Symmetric FastICA using logcosh approx. to neg-entropy function Iteration 1 tol 0.3458911 Iteration 2 tol 0.007638039 Iteration 3 tol 0.001150413 Iteration 4 tol 0.0003499578 Iteration 5 tol 9.909304e-05 oldpar - par(mfcol c(4, 3), marc(2, 2, 2, 1)) plot(1:1000, S[,1], type l, main Original Signals, xlab, ylab ) for (i in 2:4) { plot(1:1000, S[,i ], type l, xlab, ylab ) plot(1:1000, X[,1 ], type l, main Mixed Signals, xlab, ylab ) for (i in 2:4) { plot(1:1000, X[,i], type l, xlab, ylab ) plot(1:1000, a$s[,1 ], type l, main ICA source estimates, xlab, ylab ) for (i in 2:4) { plot(1:1000, a$s[,i], type l, xlab, ylab ) par(oldpar) Mills 2017 169

Data Mining 170 Figure 22. Mills 2017 ICA 170

Data Mining 2017 171 For this example we will look at three mixtures of 4 signals (note the warning messages):. A - matrix(runif(12), 4, 3) X - S%*%A a - fastica(x, 4, alg.typ parallel, fun logcosh, alpha 1, method R, row.norm FALSE, maxit 200, tol 0.0001, verbose TRUE) n.comp is too large n.comp set to 3 Centering Whitening Symmetric FastICA using logcosh approx. to neg-entropy function Iteration 1 tol 0.1473840 Iteration 2 tol 0.003145043 Iteration 3 tol 1.781576e-05 oldpar - par(mfcol c(4, 3), marc(2, 2, 2, 1)) plot(1:1000, S[,1], type l, main Original Signals, xlab, ylab ) for (i in 2:4) { plot(1:1000, S[,i ], type l, xlab, ylab ) plot(1:1000, X[,1 ], type l, main Mixed Signals, xlab, ylab ) for (i in 2:3) { plot(1:1000, X[,i], type l, xlab, ylab ) plot(0, type n )) # Dummy to fill plot(1:1000, a$s[,1 ], type l, main ICA source estimates, xlab, ylab ) for (i in 2:3) { plot(1:1000, a$s[,i], type l, xlab, ylab ) plot(0, type n ) # Dummy to fill par(oldpar) Mills 2017 171

Data Mining 172 Figure 23. Mills 2017 ICA 172

Data Mining 2017 173 The next example uses ICA on sounds. This is a demonstration found at the Laboratory of Computer and Information Science (CIS) of the Department of Computer Science and Engineering at Helsinki University of Technology http://www.cis.hut.fi/projects/ica/cocktail/cocktail_en.cgi For this example we will need to read and write.wav files. A.wav file has the basic structure.described in the next function: read.wavz -z function(d.file)z { zz - file(d.file, rb ) # Open binary file for reading # RIFF chunk RIFF - readchar(zz,4) # Word RIFF (4) file.len - readbin(zz, integer(), 1) # Number of bytes in file (4) WAVE - readchar(zz,4) # Word WAVE (4) # FORMAT chunk fmt - readchar(zz,4) # fmt (3) len.of.format - readbin(zz, integer(), 1) # format length (40 f.one - readbin(zz, integer(), 1, size2) # Number 1 (2) Channel.numbs - readbin(zz, integer(), 1, size2) # Number of channels (2) Sample.Rate - readbin(zz, integer(), 1) # Sample rate (4) Bytes.P.Sec - readbin(zz, integer(), 1) # Bytes/sec (4) Bytes.P.Sample - readbin(zz, integer(), 1, size2) # Bytes/sample (2) Bits.P.Sample - readbin(zz, integer(), 1, size2) # Bits/sample (2) # DATA chunk DATA - readchar(zz,4) # Word DATA (4) data.len - readbin(zz, integer(), 1) # Length of data (4) bias - 2^(Bits.P.Sample - 1) wav.data - rep(0, data.len) # Create a place to store data # Read data based on above parameters wav.data - readbin(zz, integer(), data.len, sizebytes.p.sample, signedf) close(zz) # Close the file wav.data - wav.data - bias # Shift based on bias # Return the information for R list(riffriff, File.Lenfile.len,WAVEWAVE, formatfmt, len.of.formatlen.of.format,f.onef.one,channel.numbschannel.numbs, Sample.RateSample.Rate,Bytes.P.SecBytes.P.Sec, Bytes.P.SampleBytes.P.Sample, Bits.P.SampleBits.P.Sample, DATADATA, data.lendata.len, datawav.data) Set up variables for the data and create the file names for the input, mixed, and output file: numb.source - 9 in.file - matrix(0, numb.source, 1) mix.file - matrix(0, numb.source, 1) out.file - matrix(0, numb.source, 1) for (i in 1:numb.source) { in.file[i,] - paste(data.dir, /source,i,.wav, sep ) mix.file[i,] - paste(data.dir, /m,i,.wav, sep ) out.file[i,] - paste(data.dir, /s,i,.wav, sep ) Mills 2017 173

Data Mining 174 in.wav - { for (m in 1:numb.source) { in.wav - c(in.wav, list(read.wav(in.file[m,]))) Mills 2017 ICA 174

Data Mining 2017 175 We can look at the characteristics of the file with: wav.char - function(wav) { cat( RIFF, wav$riff, \n ) cat( Length, wav$file.len, \n ) cat( Wave, wav$wave, \n ) cat( Format, wav$format, \n ) cat( Format Length, wav$len.of.format, \n ) cat( One, wav$f.one, \n ) cat( Number of Channels, wav$channel.numbs, \n ) cat( Sample Rate, wav$sample.rate, \n ) cat( Bytes/Sec, wav$bytes.p.sec, \n ) cat( Bytes/Sample, wav$bytes.p.sample, \n ) cat( Bits/Sample, wav$bits.p.sample, \n ) cat( Data, wav$data, \n ) cat( Data Length, wav$data.len, \n ) wav.char(in.wav[[1]]) RIFF RIFF Length 50036 Wave WAVE Format fmt Format Length 16 One 1 Number of Channels 1 Sample Rate 8000 Bytes/Sec 8000 Bytes/Sample 1 Bits/Sample 8 Data data Data Length 50000 Set up a random matrix for mixing: A - matrix(runif(numb.source*numb.source),numb.source,numb.source) We will create a matrix (50009) that has one source in each column: mixed - { for (i in 1:numb.source) { mixed - cbind(mixed, in.wav[[i]]$data) We multiply by the 99mixing matrix to produce a new (5000 9) matrix in which each column is a mixture of the 9 columns of the original matrix. mixed - mixed%*%a Mills 2017 175

Data Mining 176 We now plot the resulting wave form (Figure 24). old.par - par(mfcol c(numb.source, 1)) par(marc(2,2,2,2)0.1) plot(mixed[,1], type l, main Mixed ) for (m in 2:numb.source) { plot(mixed[,m], type l ) if (dev.cur()[[1]]!1) bringtotop(whichdev.cur()) par(old.par) Figure 24. 9 signals mixed Mills 2017 ICA 176

Data Mining 2017 177 In order to save the signal as a.wav file, we need the header information. We cheat a little by simply using the in.wav header and replacing its data part with the mixed data. The first part of the following code simply creates a mixed list from the in list and the second part does the data replacement. mix.wav - { for (m in 1:numb.source) { mix.wav - c(mix.wav, list(in.wav[[m]])) for (m in 1:numb.source) { mix.wav[[m]]$data - mixed[,m] write.wav(mix.file[m,], mix.wav[[m]]) # Play them Use the sound library to play the mixed sound: library(sound) play(mix.file[1,]) play(mix.file[2,]) play(mix.file[3,]) play(mix.file[4,]) play(mix.file[5,]) play(mix.file[6,]) play(mix.file[7,]) play(mix.file[8,]) play(mix.file[9,]) # Unmix them We will use fastica to unmix the signals and save then play the results as we did for the mixed signal: mixed.all - { for (i in 1:numb.source) { mixed.all - cbind(mixed.all, mixed[,i]) ICA.wavs - fastica(mixed.all, numb.source, alg.typ parallel, fun logcosh, alpha 1, method R, row.norm FALSE, maxit 200, tol 0.0001, verbose TRUE) # Save them new.wav - { for (m in 1:numb.source) { new.wav - c(new.wav, list(in.wav[[m]])) for (m in 1:numb.source) { new.wav[[m]]$data - 5*ICA.wavs$S[,m] write.wav(out.file[m,], new.wav[[m]]) # Play them play(out.file[1,]) Mills 2017 177

Data Mining 178 play(out.file[2,]) play(out.file[3,]) play(out.file[4,]) play(out.file[5,]) play(out.file[6,]) play(out.file[7,]) play(out.file[8,]) play(out.file[9,]) # Plot them old.par - par(mfcol c(numb.source, 3)) par(marc(2,2,2,2)0.1) plot(in.wav[[1]]$data, type l, main Original ) for (m in 2:numb.source) { plot(in.wav[[m]]$data, type l ) plot(mixed[,1], type l, main Mixed ) for (m in 2:numb.source) { plot(mixed[,m], type l ) plot(ica.wavs$s[,1], type l, ) if (dev.cur()[[1]]!1) bringtotop(whichdev.cur()) for (m in 2:numb.source) { plot(ica.wavs$s[,m], type l ) par(old.par) Mills 2017 ICA 178

Data Mining 2017 179 Figure 25. # Original - play play(in.file[1,]) play(in.file[2,]) play(in.file[3,]) play(in.file[4,]) play(in.file[5,]) play(in.file[6,]) play(in.file[7,]) play(in.file[8,]) play(in.file[9,]) Mills 2017 179