EFFICIENT GEOMETRIC METHODS FOR KERNEL DENSITY ESTIMATION BASED INDEPENDENT COMPONENT ANALYSIS

Size: px

Start display at page:

Download "EFFICIENT GEOMETRIC METHODS FOR KERNEL DENSITY ESTIMATION BASED INDEPENDENT COMPONENT ANALYSIS"

Andrea Dalton
5 years ago
Views:

1 15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, Septeber 3-7, 2007, copyright by EURASIP EFFICIENT GEOMETRIC METHODS FOR KERNEL DENSITY ESTIMATION BASED INDEPENDENT COMPONENT ANALYSIS Hao Shen, Martin Kleinsteuber and Knut Hüper National ICT Australia, Canberra Research Laboratory Locked Bag 8001, Canberra ACT 2601, Australia and Departent of Inforation Engineering, RSISE, The Australian National University, Canberra ACT 0200, Australia ABSTRACT The perforance of Independent Coponent Analysis (ICA) ethods significantly depends on the choice of the contrast function and the optiisation algorith used in obtaining the deixing atrix. It has been shown that nonparaetric ICA approaches are ore robust than its paraetric counterparts. One basic nonparaetric ICA contrast was developed by approxiating utual inforation using kernel density estiations. In this work we study the kernel density estiation based linear ICA proble fro an optiisation point of view. Two geoetric ethods are proposed to optiise the kernel density estiation based linear ICA contrast function, a Jacobi-type ethod and an approxiate Newton-like ethod. Rigorous analysis shows that both geoetric ethods converge locally quadratically fast to the correct deixing. The perforance of the proposed algoriths is investigated by nuerical experients. 1. INTRODUCTION In the past decade, Independent Coponent Analysis (ICA) has becoe a standard statistical tool for solving the Blind Source Separation (BSS) proble. It has received considerable attentions in various counities. Generally, the perforance of an ICA ethod depends significantly on the choice of the contrast function easuring statistical independence of signals and on the appropriate optiisation technique as well. Approaches to designing ICA contrasts can be classified into two categories: paraetric and nonparaetric. Paraetric approaches construct the ICA criterion according to certain hypotheses of distributions of the source signals by soe paraeterised failies of density functions. Since the distributions of sources are generally unknown, incorrect hypothesis on these distributions can lead to poor separation perforance, and ay even cause paraetric ICA ethods to fail copletely in any real applications. For these reasons, there has been an increasing interest in developing nonparaetric ICA ethods. One of such nonparaetric ICA approaches is to iniise the epirical utual inforation between recovered signals by eploying the Kernel Density Estiation (KDE) technique to deal with the unknown distributions of the sources, such as 14, 20, 2. There also exist other sophisticated nonparaetric ICA approaches, which do not work with the probability density, refer to 6, 1, 9 and references therein. In this work we only focus on the nonparaetric linear ICA proble based on the kernel density estiation proposed in 2 fro an optiisation point of view. Fro an optiisation point of view, developent of efficient geoetric ICA ethods has been very active ever since the early work of Coon 5. Jacobi-type ethods are one of the proinent ethods, including Joint Approxiate Diagonalisation of Eigenatrices (JADE) 3, Extended Maxiu Likelihood (EML) 21, MaxKurt 3, Jacobi optiised Epirical Characteristic Function ICA (JECFICA) 6, RAD- ICAL 14, and so on. Although the efficiency of these Jacobi-type ICA ethods has been verified by nuerical experients, the convergence properties of ost Jacobi-type ICA algoriths are still theoretically unclear. Recently with a coplete understanding of the FastICA algorith 12, 18, the so-called approxiate Newton-like ethod on anifolds, which is locally quadratically convergent, has been successfully applied to the probles of parallel paraetric linear ICA 17 and one-unit KDE based linear ICA 19. The focus of this work is on the developent of efficient geoetric ethods, naely a Jacobi-type ethod and an approxiate Newton-like ethod, for solving the KDE based linear ICA proble. Rigorous analysis shows that both ethods are locally quadratically convergent to a correct deixing atrix under reasonable assuptions. The paper is organised as follows. Section 2 introduces the KDE based linear ICA proble and gives a critical point analysis of the contrast function followed by a study of the Hessian. In Section 3, we present a Jacobi-type ethod and an approxiate Newton-like ethod for optiising the KDE based linear ICA contrast. The local convergence properties of both ethods are studied as well. Finally in Section 4, several nuerical experients are provided to investigate the perforance of the proposed algoriths. 2. THE KDE BASED LINEAR ICA PROBLEM 2.1 Linear ICA Model and the KDE Based Contrast The standard instantaneous noiseless linear ICA odel is stated as the following relation, see 11 for details, Z = AS, (1) where S = s 1,...,s n R n is a data atrix of n saples of sources ( n), which are utually statistically independent and drawn independently and identically, A R is the ixing atrix of full rank, and Z = z 1,...,z n R n 2007 EURASIP 1711

2 15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, Septeber 3-7, 2007, copyright by EURASIP represents the observed ixtures. Note that, the utual statistical independence can only be ensured if the saple size n tends to infinity. Nevertheless, for our theoretical analysis in Section 2.2, we assue that the coplete independence holds even if the saple size is finite. The task of the linear ICA proble is to recover the source signals S by estiating the ixing atrix A based only on the observed ixtures Z. After the so-called whitening process of the ixtures, the corresponding deixing odel of (1) can be stated as follows Y = X W, (2) where W = w 1,...,w n = V Z R n is the whitened observation (V R is the whitening atrix), X R is an orthogonal atrix being the deixing atrix, and Y = y 1,...,y n R n contains the reconstructed signals. It has been shown in 4, that the whitening process is statistically { n-consistent. Let SO() := X R X X = 1,det(X) = 1 } denote the special orthogonal group of order. In this work, we study the linear ICA proble of finding an X = x 1,...,x SO() which reconstructs the sources via the odel (2) based only on the whitened observations. Miniisation of the utual inforation between the recovered signal is a coon criterion for ICA. In absence of knowing the true distribution of the sources, an epirical utual inforation between the extracted signals can be coputed by eploying the kernel density estiation technique, refer to 2 for ore details, f : SO() R, X k=1 ( ( 1 E i log h E j φ x k w i j h )), (3) where w i j := w i w j R represents the difference between the i-th and j-th saple, φ : R R is an appropriate kernel function, and h R + is the kernel bandwidth. In this work we specify φ as the Gaussian kernel φ(t) = exp( t 2 /2). 2.2 Analysis of the KDE based linear ICA contrast Before analysing the KDE based linear ICA contrast defined in (3), we recall the tangent space of SO() at X as T X SO() := { Ξ R Ξ = XΩ, Ω so() }, (4) where so() := {Ω R Ω = Ω } denote the set of all skew-syetric atrices. Let Ω = ω 1,...,ω = (ω kl ) k,l=1 so() and define y i j (x k ) := (xk w i j)/h. By the chain rule, the first derivative of the contrast f in direction XΩ T X SO() is calculated as where D f (X)(XΩ) = 1 k<l ω kl (u kl (X) u lk (X)), (5) φ u kl (X) := E (y i j (x k ))y i j (x l ) i E j φ(y i j (x k )) R. (6) It can be shown that if X SO() is a correct deixing atrix, by the whitening properties of the sources, the ter u kl (X ) is equal to zero for all k l. Thus it follows that the first derivative of f vanishes at X, i.e., D f (X )(X Ω) = 0. Therefore a correct deixing atrix X is indeed a critical point of f. Now coputing the second derivative of f, one gets the following quadratic for f (Xe tω ) dt 2 = ω k H(k) (X)ω k k=1 1 k<l ω 2 kl (u kk(x) + u ll (X)), where H (k) (X) = (h (k) pq (X)) p,q=1 R with the pq-th entry h (k) φ pq (X) =E (y i j (x k ))y i j (x p )y i j (x q ) i E j φ(y i j (x k )) φ E (y i j (x k ))y i j (x p )E j φ (y i j (x k ))y i j (x q ) i. (E j φ(y i j (x k ))) 2 Let X = X. Using the properties of the whitened signals, tedious but straightforward coputations lead to f (X e tω ) dt 2 =D 2 f (X )(X Ω,X Ω) where = 1 k<l û kk (X) := 2 φ E (y i j (x k )) h 2 i E j φ(y i j (x k )) E i φ (y i j (x k ))y i j (x k ) E j φ(y i j (x k )) ω 2 kl (û kk(x ) + û ll (X )), 1 h 2 E i ( φ (y i j (x k )) E j φ(y i j (x k )). ) 2 (7) (8) (9) (10) Consequently, the Hessian of f at X is indeed diagonal. It is iportant to notice that the properties in (9) hold true only if the coplete statistical independence can be ensured for the sources. It is known already that the global iniu of the actual utual inforation between reconstructed signals gives the correct separation. However there is still no proof showing that the global iniu of the nonparaetric epirical utual inforation defined in (3) yields the best estiation of the sources. Nevertheless, in this work, we assue that the best estiation of the sources is attained at the global iniu of the KDE based contrast defined in (3). In other words, we assue that any correct deixing atrix X is a nondegenerate critical point of the contrast f, i.e., for a given set of sources, the ter û kk (X ) defined in (10) is nonzero. 3. GEOMETRIC METHODS FOR THE KDE BASED LINEAR ICA PROBLEM 3.1 Jacobi-type ICA Method Jacobi-type ethods are well-known tools for solving atrix eigenvalue probles or singular value probles 7. They can also be considered as optiisation procedures. The local convergence properties of Jacobi-type ethods have been recently discussed in 10, 13. In this section, we develop a Jacobi-type ethod to optiise the ICA constrast (3). A Jacobi-type ethod consists of iterating so-called sweeps. Within a sweep, the Jacobi-type ethod uses predeterined directions, which reduce the coputational burden. Once fixing a predeterined direction, a step size is chosen to yield either a local or a global optiu of the restricted 2007 EURASIP 1712

3 15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, Septeber 3-7, 2007, copyright by EURASIP cost function. However, this is often unfeasible for general cost functions. In this work, we follow a different approach that is based on a one diensional Newton optiisation step. Siilar approxiations of the optial step size have already been used in 15, 8. Now let us denote a basis atrix of so() by Ω kl so() with ω kl = ω lk = 1 and zeros anywhere else. We define a set of directions in so() as := {Ω kl so() 1 k < l }. Recall the forula for a geodesic in SO() eanating fro a point X in direction XΩ T X SO() γ X : R SO(), t X exp(tω). (11) That is γ X (0) = X and γ X (0) = XΩ. By fixing a direction Ω so(), we constrain the ICA contrast function f defined in (3) to a geodesic, such that in each step, a one diensional optiisation proble f γ X : R R, t f (X exp(tω)) (12) has to be solved. Having chosen a predeterined direction Ω kl, we copute the first and second derivatives of f γ X as follows d dt f (X exp(tω kl)) = u kl (X) u lk (X), dt 2 f (X exp(tω kl )) = ĥkl(x) + ĥlk(x), (13) where ĥkl(x) := h (k) ll (X) u kk (X). Hence, a Newton step λ kl R for a fixed direction Ω kl can be coputed by λ kl := λ kl (X) := u kl(x) u lk (X) ĥ kl (X)+ĥlk(X). (14) Now by the special structure of Ω kl, the atrix Φ = (φ pq ) p,q=1 := exp(λ klω kl ) is coputed as being a Jacobi rotation, i.e., it is equal to the identity atrix except for the kk, ll, kl and lk entries with φ kk = φ ll = cos(λ kl ), φ kl = φ lk = sin(λ kl ). (15) To ensure the one diensional Newton direction to point downhill, one can force the denoinator defined in (14) to be positive by taking the absolute value of the Hessian of f γ X. A Jacobi-type ethod for optiising the KDE based linear ICA contrast function can be suarised as follows: Jacobi-type KDE based linear ICA ethod Step 1: Given an initial guess X SO() and a set of predeterined directions so(); Step 2: Let X old = X. For 1 k < l ; (i) Calculate a Newton stepsize λ kl R λ kl = u kl(x) u lk (X) ĥkl(x)+ĥlk(x) ; (ii) Update X X exp(λ kl Ω kl ); Step 3: If X old X is sall enough, Stop. Otherwise, goto Step 2. To show the local convergence properties of the Jacobitype KDE based linear ICA ethod, we first state the following theore regarding the local quadratic convergence of a Jacobi-type ethod, which eploys a Newton step rotation, for optiising a general cost function on SO(). This result can be proven by siilar arguents explored already in 10, 13. Due to the page liits, the proof will be oitted here. Theore 1 Let f : SO() R be a sooth cost function, X SO() be a local iniu of f and {Ω 1,...,Ω N } be a basis of so(). If the Hessian H f (X ) is nondegenerated and if X Ω i,x Ω j T X SO() are orthogonal with respect to H f (X ) for all 1 i < j N, then the Jacobi-type ethod using a Newton step converges locally quadratically fast. The local convergence properties of the Jacobi-type KDE based linear ICA ethod can be stated as follows Corollary 1 Let X SO() be a correct deixing atrix of a linear ICA proble. Then the Jacobi-type KDE based linear ICA ethod is locally quadratically convergent to X. Sketch of proof. Let be a standard basis of so(). It can be shown that D 2 f (X )(X Ω kl,x Ω pq ) = 0 for any Ω kl Ω pq. Then by the assuption that X is a nondegenerate critical point, the result follows. 3.2 Approxiate Newton-like ICA Method For a Newton type ethod, coputing the Hessian is generally too expensive to evaluate at each iteration. Therefore we propose a cheap approxiation of the true Hessian according to the explicit structure of the Hessian at a correct deixing atrix. As suggested in (9), the Hessian of f at a correct deixing atrix X is indeed diagonal. We therefore ight approxiate the Hessian of f considered now as the linear operator T X SO() T X SO() by approxiating dt 2 f (Xe tω ) using the expression ωkl 2 (û kk(x) + û ll (X)). (16) 1 k<l It is easily seen that the approxiation (16) gives the true Hessian of f as in (9) if X = X. Hence, for an approxiate Newton ethod, the approxiate Newton direction Ω = (ω kl ) k,l=1 so() can be cheaply coputed by ω kl = u kl(x) u lk (X) û kk (X)+û ll (X). (17) Now, an approxiate Newton iteration can be closed by projecting the Newton direction Ω back onto SO() using the atrix exponential. However, in contrast to the Jacobi-type update, the atrix exponential of an arbitrary Ω so() requires an eigenvalue decoposition being an unnecessarily expensive iterative process 16. However, one can overcoe this issue by using a first order approxiation of the atrix exponential via a QR decoposition, which is still orthogonal as follows. Let I + Ω = Q(Ω)R(Ω), (18) be the unique QR decoposition of the atrix I + Ω with Ω so(). Note that, the deterinant of I + Ω is always positive. Therefore the QR decoposition is indeed unique with diagonal entries of R(Ω) being positive and detq(ω) = 1. To suarise, we construct the Approxiate Newtonlike KDE based linear ICA ethod as follows: 2007 EURASIP 1713

4 15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, Septeber 3-7, 2007, copyright by EURASIP Approxiate Newton-like KDE based linear ICA ethod Step 1: Given an initial guess X SO(); Step 2: Let X old = X. Copute the direction Ω = (ω kl ) k,l=1 so() by ω kl = u kl(x) u lk (X) û kk (X)+û ll (X) ; Step 3: Update X XQ(Ω); Step 4: If X old X is sall enough, Stop. Otherwise, goto Step 2. Finally, the local convergence properties of the approxiate Newton-like KDE based linear ICA ethod are suarised in the following result. Theore 2 Let X SO() be a correct deixing atrix of a linear ICA proble. Then the approxiate Newton-like KDE based linear ICA ethod is locally quadratically convergent to X. Sketch of proof. Since the approxiate Newton direction defined in (17) is locally a sooth ap on SO() around X. Each iteration of the approxiate Newton-like KDE based linear ICA ethod is then locally a sooth ap on SO() around X as well. It can be shown that the first derivative of the algorith considered now as locally a sooth ap around X vanishes at X. The result follows by a Taylor series arguent. 4. NUMERICAL EXPERIMENTS In this section, we copare our new ethods with an existing parallelised approxiate Newton linear ICA algorith, PANNICA proposed in 19, which can be considered as an nonparaetric version of FastICA. Firstly, the local convergence properties of all algoriths are investigated by applying the to an ideal dataset and a siple speech dataset. The perforance are further copared in ters of both separation quality and convergence speed. Without fine tuning the kernel bandwidth h, we set h = Convergence Properties The convergence is easured by the distance of the accuulation point X to the current iterate X k, i.e., by the Frobenius nor X k X F. All three ethods are initialised by the sae deixing atrix. It is worthwhile to notice that, by the nature of the linear ICA proble, all algoriths can converge to different correct deixing atrices, where one is a colun-wise perutation of the other. Here we only study the cases where they converge to the sae deixing atrix. First, we construct an ideal dataset where the properties of the Hessian given in (9) hold true. It consists of two siple signals, one being a square wave and the other a sine wave. The nuerical results shown in Figure 1 verify the local quadratic convergence properties of both ethods presented in Corollary 1 and Theore 2. It also indicates that PAN- NICA is locally quadratically convergent as well. Based on our ipleentations, all algoriths have coparable coputational burden. Now, let us study the local convergence properties of these ethods on a real speech dataset, which consists of three signals with n = 5,000 saples. The results in Figure 2 suggest that all three algoriths appear to converge only linearly fast. This is ost likely caused by the fact that, for X(k) X * F X(k) X * F Jacobi ICA ANL ICA PANNICA Iteration/Sweep (k) Figure 1: Convergence on an ideal dataset. Jacobi ICA ANL ICA PANNICA Iteration/Sweep (k) Figure 2: Convergence on a siple speech dataset. this speech dataset, the ideal properties of the Hessian shown in (9) are not fulfilled. Nevertheless fro this experient, one can still conclude that the Jacobi-type ethod converges uch faster than the other two approxiate Newton based algoriths within a real environent. 4.2 Coparison of perforance In this experient, all ethods are tested on the benchark speech signal dataset fro the Brain Science Institute, RIKEN, see The dataset consists of 200 natural speech signals sapled at 4 khz. The task of this experient is to separate = 3 signals which are randoly chosen out of 200 sources, with a fixed saple size n = 1,000. The separation perforance is easured by the average Signal-to-Interference-Ratio (SIR) index, refer to 2 for the definition. Usually, a large SIR index indicates a better separation. The convergence perforance is copared by the nuber of iterations/sweeps required by each algorith to reach the sae level of error, e.g., X k X < Each ethod is initialised by the sae randoly generated deixing atrix. By replicating the experient 100 ties, Figure 3 and Figure 4 present the quartile based boxplots of the SIR index and the nuber of iterations/sweeps for each ethod to converge respectively. In Figure 3, it is shown that both the Jacobi-type ethod and the approxiate Newton-like ethod reach alost identical accuracy, while the parallelised approxiate Newton ethod perfors slightly worse than the others. By coparing the convergence perforance 2007 EURASIP 1714

5 15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, Septeber 3-7, 2007, copyright by EURASIP SIR (db) Nuber of Iterations/Sweeps Jacobi ICA ANL ICA PANNICA Figure 3: Coparison of separation quality Jacobi ICA ANL ICA PANNICA Figure 4: Coparison of convergence speed. in Figure 4, the Jacobi-type ethod outperfors the other two ethods. In average, it converges about 3 ties faster than the others. 5. ACKNOWLEDGMENT National ICT Australia is funded by the Australian Governent s Departent of Counications, Inforation Technology and the Arts and the Australian Research Council through Backing Australia s Ability and the ICT Research Centre of Excellence progras. REFERENCES 1 F. R. Bach and M. I. Jordan. Kernel independent coponent analysis. Journal of Machine Learning Research, 3:1 48, R. Boscolo, H. Pan, and V. P. Roychowdhury. Independent coponent analysis based on nonparaetric density estiation. IEEE Transactions on Neural Networks, 15(1):55 65, J.-F. Cardoso. High-order contrasts for independent coponent analysis. Neural Coputation, 11(1): , A. Chen and P. J. Bickel. Consistent independent coponent analysis and prewhitening. IEEE Transactions on Signal Processing, 53(10): , P. Coon. Independent coponent analysis, a new concept? Signal Processing, 36(3): , J. Eriksson and K. Koivunen. Characteristic-function based independent coponent analysis. Signal Processing, 83(10): , G. Golub and C. F. van Loan. Matrix Coputations. The Johns Hopkins University Press, Baltiore, 2nd edition, J. Götze, S. Paul, and M. Sauer. An efficient Jacobi-like algorith for parallel eigenvalue coputation. IEEE Transactions on Coputers, 42(9): , A. Gretton, R. Herbrich, A. J. Sola, O. Bousquet, and B. Schölkopf. Kernel ethods for easuring independence. Journal of Machine Learning Research, 6: , K. Hüper. A Calculus Approach to Matrix Eigenvalue Algoriths. Habilitation Dissertation, Departent of Matheatics, University of Würzburg, Gerany, July A. Hyvärinen, J. Karhunen, and E. Oja. Independent Coponent Analysis. Wiley, New York, A. Hyvärinen and E. Oja. A fast fixed-point algorith for independent coponent analysis. Neural Coputation, 9(7): , M. Kleinsteuber. Jacobi-Type Methods on Seisiple Lie Algebras A Lie Algebraic Approach to the Syetric Eigenvalue Proble. PhD thesis, Bayerische Julius-Maxiilians-Universität Würzburg, E. G. Learned-Miller and J. W. Fisher III. ICA using spacings estiates of entropy. Journal of Machine Learning Research, 4: , J. J. Modi and J. D. Pryce. Efficient ipleentation of Jacobi s diagonalization ethod on the DAP. Nuerische Matheatik, 46(3): , C. Moler and C. van Loan. Nineteen dubious ways to copute the exponential of a atrix, twenty-five years later. SIAM Review, 45(1):3 49, H. Shen and K. Hüper. Newton-like ethods for parallel independent coponent analysis. In Proceedings of the 16 th IEEE International Workshop on Machine Learning for Signal Processing (MLSP 2006), pages , Maynooth, Ireland, H. Shen, K. Hüper, and A.-K. Seghouane. Geoetric optiisation and FastICA algoriths. In Proceedings of the 17 th International Syposiu of Matheatical Theory of Networks and Systes (MTNS 2006), pages , Kyoto, Japan, H. Shen, K. Hüper, and A. J. Sola. Newton-like ethods for nonparaetric independent coponent analysis. In Proceedings of the 13 th International Conference on Neural Inforation Processing, Part I, (ICONIP 2006), Hong Kong, China, October 3-6, 2006, volue 4232 of Lecture Notes in Coputer Science, pages , Berlin/Heidelberg, Springer-Verlag. 20 H. Stögbauer, A. Kraskov, S. Astakhov, and P. Grassberger. Least dependent coponent analysis based on utual inforation. Phys. Rev. E, 70(6):066123, V. Zarzoso and A. K. Nandi. Blind separation of independent sources for virtually any source probability density function. IEEE Transactions on Signal Processing, 47(9): , EURASIP 1715

Geometric Analysis of Hilbert-Schmidt Independence Criterion Based ICA Contrast Function

Geometric Analysis of Hilbert-Schmidt Independence Criterion Based ICA Contrast Function NICTA Technical Report: PA006080 ISSN 1833-9646 Hao Shen 1, Stefanie Jegelka 2 and Arthur Gretton 2 1 Systems Engineering