Independent Component Analysis

000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 1 Introduction Indepent Component Analysis Qiuping Xu Department of Math, FSU Indepent component analysis (ICA) is a statistical procedure of finding additive components with the assumption that the components are non-gaussian and they are statistically indepent. This method has great applications in signal decomposition (EEG signal), feature extraction, noise removal and finding hidden factors in financial data. 2 Problem Setup The famous example to illustrate ICA method is so-called cocktail party. Imagine two people are talking in a party and two different microphones are recording. Then the two records X 1 (t), X 2 (t) from those microphones are both the mixtures of the speech signal S 1 (t), S 2 (t) from the two speakers. Let us assume that only additive mixed effect exists, at this time, we can use the following equation 1 to describe the scenario. X 1 (t) = a 11 S 1 (t) + a 12 S 2 (t) X 2 (t) = a 21 S 1 (t) + a 22 S 2 (t) Or we can use matrix from X = AS or we put the unknown on the left as S = W X. The aim of ICA is to estimate the indepent signal S 1 (t), S 2 (t) and parameters a ij from the recorded data X 1 (t), X 2 (t). 3 Background 3.1 PCA VS ICA PCA is adequate if the data are Gaussian, linear; if not, ICA is more proper. PCA optimizes the covariance of the data (which is second order statistics); on the other hand, ICA optimizes higher-order statistics. PCA finds the uncorrelated components; while ICA finds the indepent components (uncorrelatedness is a weaker form of indepence). PCA can give the order of the components; while ICA normally can not (Magnitude and scaling ambiguity, Permutation ambiguity). (1) 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044

2 Qiuping Xu 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 3.2 Uncorrelatedness VS Indepence Consider two scalar-valued random variables x 1 and x 2, those two variables are said to be indepent if information on the value of x 1 does not give any information on the value of x 2, and vice verse. If we define p(x 1, x 2 ) as the joint probability function of those two variables and p 1 (x 1 ) and p 2 (x 2 ) as the probability functions for x 1 and x 2 respectively, Then indepence is equivalent to the validity of Eq: 2 p(x 1, x 2 ) = p 1 (x 1 )p 2 (x 2 ) (2) From Eq:2, we can derive that Eq:3 holds for any proper function f and g E(f(x 1 ), g(x 2 )) = E(f(x 1 ))E(g(x 2 )) (3) However, uncorrelatedness only requires cov(x 1, x 2 ) = E(x 1, x 2 ) E(x 1 )E(x 2 ) = 0. If we rewrite this, we can see uncorrelatedness is equivalent to Eq:4 E(x 1, x 2 ) = E(x 1 )E(x 2 ) (4) Compare Eq: 3 and 4, we can clearly see that uncorrelatedness is just a weaker from of Indepence, and uncorrelatedness does not imply indepence. 4 Core of ICA The core of ICA is indepence. I prefer thinking there are two different kinds of approaches of indepence. One approach is based on non-gaussianity equals indepence, so the algorithm focuses on minimization of some measurement(negentropy and kurtosis) of gaussianity to find the indepent components. Another approach states indepence can be achieved by minimising mutual information, so the algorithm focuses on minimization of mutual information(free to define different measurement upon applications). 4.1 Negentropy The entropy H x of a continue variable x with density function p(x) is defined as Eq: 5 H x = p(x) log p(x)dx (5) Negentropy is defined as Eq: 6 J x = H x H x (6) Where x is the Gaussian variable with the same mean and variance as x. Out of all distributions with given mean and variance, the Gaussian distribution is the one with the highest entropy. Thus, negentropy is always nonnegative and zero only for a pure Gaussian signal x. Negentropy is often used as a measure of distance to gaussianity. This measure is stable but is difficult to evaluate. In practice, approximation is used to estimate this value. 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089

Indepent Component Analysis 3 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 5 Example and Computation 5.1 Algorithm Centering This is similar to what we did in PCA, just make the the data into zero-mean data. If you like you can add the mean back to the indepent components after completing the ICA process, which is inv(a) m, here A is the coefficient matrix and m is the mean we subtracted from the data. Whitening Another pre-processing step we need to do is to whiten the data, which means we need to transform the signals into uncorrelated signals and each of the signal has unit variance. The whole process is to do PCA first, then make the column of the SCORE matrix into unit variance by dividing the column s own standard deviation(component-wise). If you are not familiar with PCA, the process can also be done by the following steps. Form the zero-mean data X, the whiten data X is given by V D 1 2 V T X, where V and D are the results from singular value decomposition of X and X = V DV T. One thing to notice, D is a diagonal matrix, so finding D 1 2 only requires a component-wised operation. Update The direct estimation of entropy is generally difficult. The new estimation methods focus on keeping the maximal entropy principle. J1 x = {E(G(x )) E(G(x))} 2 (7) Here x is the Gaussian variable with the same mean and variance as x. Or after whitening, x is the standard Gaussian variable. E is the expectation operation, in computation, it usually replaced by the sample mean(use whole data set to estimate the mean or just a portion to estimate the mean). J1 x shares similar properties to J x, such as J1 x is also always nonnegative and zero only for a pure Gaussian signal x. Given initial guess W, the symmetric version(no preference for a particular component) of update rule is given as = E(Xg(W T X))E(g (W T X))W W next = (W W T ) 1 2 W (8) W Where S = W X. To remind again, X is the recorded signal and S is the indepent components we want to find. g is the derivative of G and g is the derivative of g. The choice of g is application depent, but there are a few good choices such as g(x) = x exp(x 2 /2), x 3, tanh(x). 5.2 Code and Example There is free version of ICA available online called FastICA. The code are well written in several languages including Matlab, R, C++ and Python. Next, I will use an example to illustrate how to use the code (Of course, download the code first and unzip it into a folder). 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134

4 Qiuping Xu 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 EX1 We first generate two source signal(as speech signal in the cocktail example) which also services as the ground truth for later comparison. The signals show in Fig.1. In order to have better comparison, I made those signals into unit variance(output IC will have unit variance). sig1=0.5*sin([1:0.05:30]); sig1=sig1./std(sig1); sig2=rand(size(sig1))-0.5; sig2=sig2./std(sig2); ; % visualize those 2 source signals subplot(2,1,1) plot(sig1) subplot(2,1,2) plot(sig2) Fig. 1. Two source signals From those two source signals, we can generate mixed signals(as the recording in cocktail example). The mixed signals are showing in Fig.2. And the histograms of those mixed signals are plotted in Fig.3. We see bell shaped histogram from mixed signals, which it is a indication of gaussianity. A=[1,5;0.3,2]; mixedsig=a*[sig1;sig2]; % generate 2 mixed signal %with 581 observations each ; % visualize mixed signals plot(mixedsig(i,:)) 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179

Indepent Component Analysis 5 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 hist(mixedsig(i,:)) Fig. 2. Two observed signals Fig. 3. Histograms of Mixed signals 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224

6 Qiuping Xu 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 Use the default setting of FastICA to find the indepent components. [icasig, A, W]=FASTICA (mixedsig), the input need to be # of signal # time points. The rows of the icasig are the ICs with the mean added back, and A is the coefficient matrix as X = AS and W A = I. Just one thing need to pay attention, if you output just two variables, A and W will output. The of those indepent components are Fig. 4 and 5. [icasig A1 W] = FASTICA (mixedsig); % find the ICs by default setting, % mixedsig is in # of signal * # time points ; % visualize those 2 signals plot(icasig(i,:)) ; % visualize those 2 signals hist(icasig(i,:)) Fig. 4. Two indepent Signal Lent from FastICA Several things worth mentioning From ICA, there is no natural order of indepent components. The indepent component is unique up to sign. In my example, the first component is the negative of the ground-truth. The learned A1 matrix is [ 0.9990, 4.9775; 0.2997, 1.9928], the first column is the opposite of the A matrix. 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

Indepent Component Analysis 7 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 Fig. 5. Histogram of indepent Signals Since the reprocess include whitening, another common thing is to use lower dimensional representation from PCA to avoid information redundancy and get computation efficiency. EX2 In this example, I used images as input, the principle is same as EX1. Suppose our source are two colored pictures as in Fig. 6 % read in source I = imread( p1.png ); J = (rgb2gray(i)); ; imshow(i) J_final(1,:)=im2double(reshape(J(1:200,1:200),1,40000)); J_final(1,:)=J_final(1,:)./std(J_final(1,:)); I = imread( p2.png ); J = (rgb2gray(i)); ; imshow(i) J_final(2,:)=reshape(J(1:200,1:200),1,40000); J_final(2,:)=J_final(2,:)./std(J_final(2,:)); imshow(reshape(j_final(i,:),200,200)) Since, we just want to illustrate the idea. So I only used the grey image with unite variance as our source. 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314

8 Qiuping Xu 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 Fig. 6. Two color pictures Fig. 7. Two grey pictures with unit variance And generate two mixed images as our observation. A= [ 0.2186 0.2830; 0.0119 0.3113]; mixedsig=a*(j_final); imshow((reshape(mixedsig(i,:),200,200))) Note,because of the sign ambiguous, even for the unite variance source, we can have 2 #IC solution. In Fig 9, I selected the one that is consistent with the source(not applicable in real problem). [icasig A1 W] = FASTICA (mixedsig); 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359

360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 Indepent Component Analysis 9 Fig. 8. Mixed pictures imshow((reshape(icasig(i,:),200,200))) index=[-1,1]; imshow((reshape(index(i)*icasig(i,:),200,200))) index=[1,-1]; imshow((reshape(index(i)*icasig(i,:),200,200))) index=[-1,-1]; imshow((reshape(index(i)*icasig(i,:),200,200)))

10 Qiuping Xu 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 References Fig. 9. Learned indepent components 1. Aapo Hyvärinen and Erkki Oja, Indepent Component Analysis: Algorithms and Applications. Neural Networks, 13(4-5):411-430, 2000. 2. Ganesh R. Naik and Dinesh K Kumar, An Overview of Indepent Component Analysis and Its Applications. Informatica 35 (2011) 6381. 63. 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449