Least-squares Independent Component Analysis

Size: px

Start display at page:

Download "Least-squares Independent Component Analysis"

Phyllis Casey
5 years ago
Views:

1 Neura Computation, vo.23, no., pp , 20. Least-squares Independent Component Anaysis Taiji Suzuki Department of Mathematica Informatics, The University of Tokyo 7-3- Hongo, Bunkyo-ku, Tokyo , Japan Masashi Sugiyama Department of Computer Science, Tokyo Institute of Technoogy and PRESTO, Japan Science and Technoogy Agency 2-2- O-okayama, Meguro-ku, Tokyo , Japan Abstract Accuratey evauating statistica independence among random variabes is a key eement of Independent Component Anaysis ICA. In this paper, we empoy a squared-oss variant of mutua information as an independence measure and give its estimation method. Our basic idea is to estimate the ratio of probabiity densities directy without going through density estimation, by which a hard task of density estimation can be avoided. In this density ratio approach, a natura cross-vaidation procedure is avaiabe for hyper-parameter seection. Thus, a tuning parameters such as the kerne width or the reguarization parameter can be objectivey optimized. This is an advantage over recenty deveoped kerne-based independence measures and is a highy usefu property in unsupervised earning probems such as ICA. Based on this nove independence measure, we deveop an ICA agorithm named Least-squares Independent Component Anaysis LICA. Introduction The purpose of Independent Component Anaysis ICA Hyvärinen et a., 200 is to obtain a transformation matrix that separates mixed signas into statisticay-independent source signas. A direct approach to ICA is to find a transformation matrix such that independence among separated signas is maximized under some independence measure such as mutua information MI. A MATLAB R impementation of the proposed agorithm, LICA, is avaiabe from

2 Least-squares Independent Component Anaysis 2 Various approaches to evauating the independence among random variabes from sampes have been expored so far. A naive approach is to estimate probabiity densities based on parametric or non-parametric density estimation methods. However, finding an appropriate parametric mode is not easy without strong prior knowedge and nonparametric estimation is not accurate in high-dimensiona probems. Thus this naive approach is not reiabe in practice. Another approach is to approximate the negentropy or negative entropy based on the Gram-Charier expansion Cardoso & Sououmiac, 993; Comon, 994; Amari et a., 996 or the Edgeworth expansion Hue, An advantage of this negentropy-based approach is that a hard task of density estimation is not directy invoved. However, these expansion techniques are based on the assumption that the target density is cose to norma and vioation of this assumption can cause arge approximation error. The above approaches are based on the probabiity densities of signas. Another ine of research that does not expicity invove probabiity densities empoys non-inear correation signas are statisticay independent if and ony if a non-inear correations among the signas vanish. Foowing this ine, computationay efficient agorithms have been deveoped based on a contrast function Jutten & Heraut, 99; Hyvärinen, 999, which is an approximation of negentropy or mutua information. However, these methods require to pre-specify non-inearities in the contrast function, and thus coud be inaccurate if the predetermined non-inearities do not match the target distribution. To cope with this probem, the kerne trick has been appied in ICA, which aows one to evauate a non-inear correations in a computationay efficient manner Bach & Jordan, However, its practica performance depends on the choice of kernes more specificay, the Gaussian kerne width and there seems no theoreticay justified method to determine the kerne width see aso Fukumizu et a., This is a critica probem in unsupervised earning probems such as ICA. In this paper, we deveop a new ICA agorithm that resoves the probems mentioned above. We adopt a squared-oss variant of MI which we ca squared-oss MI ; SMI as an independence measure and approximate it by estimating the ratio of probabiity densities contained in SMI directy without going through density estimation. This approach which foows the ine of Sugiyama et a. 2008, Kanamori et a. 2009, and Nguyen et a. 200 aows us to avoid a hard task of density estimation. Another practica advantage of this density-ratio approach is that a natura cross-vaidation CV procedure is avaiabe for hyper-parameter seection. Thus a tuning parameters such as the kerne width or the reguarization parameter can be objectivey and systematicay optimized through CV. From an agorithmic point of view, our density-ratio approach anayticay provides a non-parametric estimator of SMI; furthermore its derivative can aso be computed anayticay and these properties are utiized in deriving a new ICA agorithm. The proposed method is named Least-squares Independent Component Anaysis LICA. Characteristics of existing and proposed ICA methods are summarized in Tabe, highighting the advantage of the proposed LICA approach. The structure of this paper is as foows. In Section 2, we formuate our estimator of

3 Least-squares Independent Component Anaysis 3 Tabe : Summary of existing and proposed ICA methods. Fast ICA FICA Hyvärinen, 999 Natura-gradient ICA NICA Amari et a., 996 Kerne ICA KICA Bach & Jordan, 2002 Edgeworth-expansion ICA EICA Hue, 2008 Least-squares ICA LICA proposed Hyper-parameter seection Not Necessary Not Necessary Not Avaiabe Not Necessary Avaiabe Distribution Not Free Not Free Free Neary norma Free SMI. In Section 3, we derive the LICA agorithm based on the SMI estimator. Section 4 is devoted to numerica experiments where we show that our method propery estimate the true demixing matrix using toy datasets, and compare the performances of the proposed and existing methods on artificia and rea datasets. 2 SMI Estimation for ICA In this section, we formuate the ICA probem and introduce our independence measure, SMI. Then we give an estimation method of SMI and derive an ICA agorithm. 2. Probem Formuation Suppose there is a d-dimensiona random signa x = x,..., x d drawn from a distribution with density px, where {x m } d m= are statisticay independent of each other, and denotes the transpose of a matrix or a vector. Thus, px can be factorized as d px = p m x m. m= We cannot directy observe the source signa x, but ony a ineary mixed signa y: y = y,..., y d := Ax, where A is a d d invertibe matrix caed the mixing matrix. The goa of ICA is, given sampes of the mixed signas {y i } n i=, to obtain a demixing matrix W that recovers the origina source signa x. We denote the demixed signa by z: z = W y.

4 Least-squares Independent Component Anaysis 4 The idea soution is W = A, but we can ony recover the source signas up to permutation and scaing of components of x due to non-identifiabiity of the ICA setup Hyvärinen et a., 200. A direct approach to ICA is to determine W so that components of z are as independent as possibe. Here, we adopt SMI as the independence measure: I s Z,..., Z d := 2 qz 2 rz rzdz, where qz denotes the joint density of z and rz denotes the product of margina densities {q m z m } d m=: d rz = q m z m. m= Note that SMI is the Pearson divergence Pearson, 900; Paninsky, 2003; Liese & Vajda, 2006; Cichocki et a., 2009 between qz and rz, whie ordinary MI is the Kuback- Leiber divergence Kuback & Leiber, 95. Since I s is non-negative and it vanishes if and ony if qz = rz, the degree of independence among {z m } d m= may be measured by SMI. Note that Eq. corresponds to the f-divergence Ai & Sivey, 966; Csiszár, 967 between qx and rz with the squared-oss, whie ordinary MI corresponds to the f-divergence with the og-oss. Thus SMI coud be regarded as a natura generaization of ordinary MI. Based on the independence detection property of SMI, we try to find the demixing matrix W that minimizes SMI. Let us denote the demixed sampes by {z i z i = z i,..., z d i := W y i } n i=. Our key constraint when estimating SMI is that we want to avoid density estimation since it is a hard task Vapnik, 998. Beow, we show how this coud be accompished. 2.2 SMI Approximation via Density Ratio Estimation We approximate SMI via density ratio estimation. Let us denote the ratio of the densities qz and rz by g z := qz rz. 2 Then SMI can be written as I s Z,..., Z d = g z 2 rzdz 2 = g z 2 rz 2g zrz + rz dz 2 = g zqz 2qz + rz dz 2 = g zqzdz

5 Least-squares Independent Component Anaysis 5 Therefore, SMI can be approximated through the estimation of g zqzdz, the expectation of g z over qz. This can be achieved by taking the sampe average of an estimator of the density ratio g z, say ĝz: Î s = 2n n ĝz i 2. 4 We take the east-squares approach to estimating the density ratio g z: [ ] 2rzdz inf gz g z g 2 [ ] = inf g 2 gz2 rz gzqz dz + constant, i= where inf g is taken over a measurabe functions. Obviousy the optima soution is the density ratio g. Thus computing I s is now reduced to soving the foowing optimization probem: inf g [ 2 gz2 rz gzqz dz ]. 5 However, directy soving the probem 5 is not possibe due to the foowing two reasons. The first reason is that finding the minimizer over a measurabe functions is not tractabe in practice since the search space is too vast. To overcome this probem, we restrict the search space to some inear subspace G: G = {α φz α = α,..., α b R b }, 6 where α is a parameter to be earned from sampes, and φz is a basis function vector such that φz = φ z,..., φ b z 0 b for a z. 0 b denotes the b-dimensiona vector with a zeros. Note that φz coud be dependent on the sampes {z i } n i=, i.e., kerne modes are aso aowed. We expain how the basis functions φz are chosen in Section 2.3. The second reason why directy soving the probem 5 is not possibe is that the expectations over the true probabiity densities qz and rz cannot be computed since qz and rz are unknown. To cope with this probem, we approximate the expectations by their empirica averages then the optimization probem is reduced to α := argmin α R b [ ] 2 α Ĥα ĥ α + λα Rα, 7 where we incuded λα Rα λ > 0 for avoiding overfitting. λ is caed the reguarization

6 Least-squares Independent Component Anaysis 6 parameter, and R is some positive definite matrix. Ĥ and ĥ are defined as Ĥ := n d ĥ := n n i,...,i d = n i= φz i,..., z d i d φz i,..., z d i d, 8 φz i,..., zd i. 9 Differentiating the objective function in Eq.7 with respect to α and equating it to zero, we can obtain an anaytic-form soution as α = Ĥ + λr ĥ. Thus, the soution can be computed very efficienty just by soving a system of inear equations. Once the density ratio 2 has been estimated, SMI can be approximated by pugging the estimated density ratio ĝz = α φz in Eq.4: Î s = 2 α ĥ 2. 0 Note that we may obtain various expressions of SMI using the foowing identities: g z 2 rzdz = g zqzdz, g zrzdz = qzdz =. Ordinary MI based on the Kuback-Leiber divergence can aso be estimated simiary using the density ratio Suzuki et a., However, the use of SMI is more advantageous due to the anaytic-form soution, as described in Section Design of Basis Functions and Hyper-parameter Seection As basis functions, we propose to use a Gaussian kerne: φ z = exp z v 2 d = exp zm v m 2, 2σ 2 2σ 2 where m= {v v = v,..., v d } b = are Gaussian centers randomy chosen from {z i } n i= more precisey, we set v = z c, where {c} b = are randomy chosen from {,..., n} without repacement. An advantage

7 Least-squares Independent Component Anaysis 7 of the Gaussian kerne ies in the factorizabiity in Eq., contributing to reducing the computationa cost of the matrix [ Ĥ significanty: Ĥ, = d n ] exp zm i v m 2 + z m i v m 2. n d 2σ 2 m= i= We use the RKHS Reproducing Kerne Hibert Space norm of α φz induced by the Gaussian kerne as the reguarization term α Rα, which is a popuar choice in the kerne method community Schökopf & Smoa, 2002: R, = exp v v σ 2 In the experiments, we fix the number of basis functions to b = min300, n, and choose the Gaussian width σ and the reguarization parameter λ by CV with grid search as foows. First, the sampes {z i } n i= are divided into K disjoint subsets {Z k } K k= of approximatey the same size we use K = 5 in the experiments. Then an estimator α Z\Zk is obtained using Z\Z k i.e., Z without Z k and the approximation error for the hod-out sampes Z k is computed: J K-CV Z k = 2 α Z\Z k Ĥ Zk α Z\Zk ĥ Z k α Z\Zk, where the matrix ĤZ k and the vector ĥz k are defined in the same way as Ĥ and ĥ, but computed ony using Z k. This procedure is repeated for k =,..., K and its average J K-CV is computed: J K-CV = K J K-CV Z K k. For parameter seection, we compute J K-CV for a hyper-parameter candidates the Gaussian width σ and the reguarization parameter λ in the current setting and choose the parameter that minimizes J K-CV. We can show that J K-CV is an amost unbiased estimator of the objective function in Eq.5, where the amost -ness comes from the fact that the number of sampes is reduced in the CV procedure due to data spitting Geisser, 975; Kohave, The LICA Agorithms In this section, we show how the above SMI estimation idea coud be empoyed in the context of ICA. Here, we derive two agorithms, which we ca Least-squares Independent Component Anaysis LICA, for obtaining a minimizer of Îs with respect to the demixing matrix W one is based on a pain gradient method which we refer to as PG-LICA and the other is based on a natura gradient method for whitened sampes which we refer to as NG-LICA. A MATLAB R impementation of LICA is avaiabe from k=

8 Least-squares Independent Component Anaysis Pain Gradient Agorithm: PG-LICA Based on the pain gradient technique, an update rue of W is given by W W ε Îs W, 3 where ε > 0 is the step size. As shown in Appendix, the gradient is given by Îs W, = ĥ W, α 2 α where, for u = y c and y i = y i,..., y d i, ĥ Ĥ, = nσ 2 = n d Ĥ W, + λ R α, 4 W, n z k i v k u k y k i exp z i v k 2, 5 2σ 2 i= [ n ] exp zm i v m 2 + z m i v m 2 2σ 2 m k i= [ n z k nσ 2 i v k u k y k i + z k i v k u k y k i i= ] exp zk i v k 2 + z k i v k σ 2 For the reguarization matrix R defined by Eq.2, the partia derivative is given by R, = σ 2 vk v k u k u k exp v v 2. 2σ 2 In ICA, scaing of components of z can be arbitrary. This impies that the above gradient updating rue can ead to a soution with poor scaing, which is not preferabe from a numerica point of view. To avoid possibe numerica instabiity, we normaize W at each gradient iteration as W k,k W k,k d m= W 2 k,m. 7 In practice, we may iterativey perform ine search aong the gradient and optimize the Gaussian width σ and the reguarization parameter λ by CV. A pseudo code of the PG-LICA agorithm is summarized in Figure.

9 Least-squares Independent Component Anaysis 9. Initiaize demixing matrix W and normaize it by Eq Optimize Gaussian width σ and reguarization parameter λ by CV. 3. Compute gradient Îs W by Eq Choose step-size ε such that Îs see Eq.0 is minimized ine-search. 5. Update W by Eq Normaize W by Eq Repeat unti W converges. Figure : The LICA agorithm with pain gradient descent PG-LICA. 3.2 Natura Gradient Agorithm for Whitened Data: NG-LICA The second agorithm is based on a natura gradient technique Amari, 998. Suppose the data sampes are whitened, i.e., sampes {y i } n i= are transformed as where Ĉ is the sampe covariance matrix: Ĉ := n y n i n y n j y i n i= y i Ĉ 2 y i, 8 j= n y j. Then it can be shown that a demixing matrix which eiminates the second order correation is an orthogona matrix Hyvärinen et a., 200. Thus, for whitened data, the search space of W can be restricted to the orthogona group Od without oss of generaity. The tangent space of Od at W is equa to the space of a matrices U such that W U is skew symmetric, i.e., UW = W U. The steepest direction on this tangent space, which is caed the natura gradient, is given as foows Amari, 998: ÎsW := Îs 2 W W Îs W, 9 W where the canonica metric G, G 2 = 2 trg G 2 is adopted in the tangent space. Then the geodesic from W in the direction of the natura gradient over Od can be expressed by W exp tw ÎsW, where t R and exp denotes the matrix exponentia, i.e., for a square matrix D, expd = k! Dk. k=0 j=

10 Least-squares Independent Component Anaysis 0. Whiten the data sampes by Eq Initiaize demixing matrix W and normaize it by Eq Optimize Gaussian width σ and reguarization parameter λ by CV. 4. Compute the natura gradient Îs by Eq Choose step-size t such that Îs see Eq.0 is minimized over the set Update W by Eq Repeat unti W converges. Figure 2: The LICA agorithm with natura gradient descent NG-LICA. Thus when we perform ine search aong the geodesic in the natura gradient direction, the minimizer may be searched from the set { } W exp tw ÎsW t 0, 20 i.e., t is chosen such that Îs see Eq.0 is minimized and W is updated as W W exp tw ÎsW. 2 Geometry and optimization agorithms on more genera structure, the Stiefe manifod, is discussed in more detai in Nishimori and Akaho A pseudo code of the NG-LICA agorithm is summarized in Figure Remarks The proposed LICA agorithms can be regarded as an appication of the genera unconstrained east-squares density-ratio estimator proposed by Kanamori et a to SMI in the context of ICA. The optimization probem 5 can aso be obtained foowing the ine of Nguyen et a. 200, which addresses a divergence estimation probem utiizing the Legendre-Fenche duaity. SMI defined by Eq. can be expressed as I s Z,..., Z d = 2 If the Legendre-Fenche duaity of the convex function 2 x2, 2 x2 = sup yx 2 y2, y 2 qz rzdz rz 2. 22

11 Least-squares Independent Component Anaysis is appied to 2 qz rz 2 in Eq.22 in a pointwise manner, we have [ qz I s Z,..., Z d = sup g rz gz rzdz 2 gz2 ] 2 [ ] = inf g 2 gz2 qz gzrz dz 2, where sup g and inf g are taken over a measurabe functions. SMI is cosey reated to the kerne independence measures deveoped recenty Gretton et a., 2005a; Gretton et a., 2005b; Fukumizu et a., In particuar, it has been shown that the NOrmaized Cross-Covariance Operator NOCCO proposed in Fukumizu et a is aso an estimator of SMI for d = 2. However, there is no reasonabe hyperparameter seection method for this and a other kerne-based independence measures see aso Bach & Jordan, 2002 and Fukumizu et a., This is a crucia imitation in unsupervised earning scenarios such as ICA. On the other hand, cross-vaidation can be appied to our method for hyper-parameter seection, as shown in Section Experiments In this section, we investigate the experimenta performance of the proposed method. 4. Iustrative Exampes First, we iustrate how the proposed method behaves using the foowing three 2- dimensiona datasets: a Sub Sub-Gaussians: px = Ux ; 0.5, 0.5Ux 2 ; 0.5, 0.5, b Super Super-Gaussians: px = Lx ; 0, Lx 2 ; 0,, c Sub Super-Gaussians: px = Ux ; 0.5, 0.5Lx 2 ; 0,, where Ux; a, b a, b R, a < b denotes the uniform density on [a, b] and Lx; µ, v µ R, v > 0 denotes the Lapace density with mean µ and variance v. Let the number of sampes be n = 300 and we observe mixed sampes {y i } n i= through the foowing mixing matrix: cosπ/4 sinπ/4 A = =. sinπ/4 cosπ/4 2 The observed sampes are potted in Figure 3. We empoyed the NG-LICA agorithm described in Figure 2. Hyper-parameters σ and λ in LICA were chosen by 5-fod CV from the 0 vaues in [0., ] at reguar intervas and the 0 vaues in [0.00, ] at reguar intervas in og scae, respectivey. The reguarization term was set to the squared RKHS norm induced by the Gaussian kerne, i.e., we empoyed R defined by Eq.2.

12 Least-squares Independent Component Anaysis a b c Figure 3: Observed sampes asterisks, true independent directions dotted ines and estimated independent directions soid ines. Estimated SMI Estimated SMI Estimated SMI Iterations Iterations Iterations a b c Figure 4: The vaue of Îs over iterations. W, W,2 0.5 W 2, W 2,2 true Iterations Iterations Iterations a b c Figure 5: The eements of the demixing matrix W over iterations. Soid ines correspond to W,, W,2, W 2,, and W 2,2, respectivey. The dotted ines denote the true vaues.

13 Least-squares Independent Component Anaysis 3 The true independent directions as we as the estimated independent directions are potted in Figure 3. Figure 4 depicts the vaue of the estimated SMI 0 over iterations and Figure 5 depicts the eements of the demixing matrix W over iterations. The resuts show that estimated SMI decreases rapidy and good soutions are obtained for a the datasets. The reason why the estimated SMI in Figure 4 does not decrease monotonicay is that during the natura gradient optimization procedure, the hyper-parameters λ and σ are adjusted by CV see Figure 2, which possiby causes an increase in the objective vaues. 4.2 Performance Comparison Here we compare our method with some existing methods KICA, FICA, JADE Cardoso & Sououmiac, 993 on artificia and rea datasets. We used the datasets a, b, and c in Section 4., the demosig dataset avaiabe in the FastICA package for MATLAB R, and 0hao, Sergio7, Speech4, and c5signas datasets avaiabe in the ICALAB signa processing benchmark datasets 2 Cichocki & Amari, The datasets a, b, c, demosig, Sergio7, and c5signas are artificia datasets. The datasets 0hao and Speech4 are rea datasets. We empoyed the Amari index Amari et a., 996 as the performance measure smaer is better: Amari index := 2dd d m,m = o m,m max m o m,m + o m,m max m o m,m d, where o m,m := [Ŵ A] m,m for an estimated demixing matrix Ŵ. We used the pubicy avaiabe MATLAB R codes for KICA 3, FICA and JADE 4, where defaut parameter settings were used. Hyper-parameters σ and λ in LICA were chosen by 5-fod CV from the 0 vaues in [0., ] at reguar intervas and the 0 vaues in [0.00, ] at reguar intervas in og scae, respectivey. R was set as Eq.2. We randomy generated the mixing matrix A and source signas for artificia datasets, and computed the Amari index between the true A and Ŵ for Ŵ estimated by each method. As training sampes, we used the first n sampes for Sergio7 and c5signas, and the n sampes between the 00th and 000+n-th interva for 0hao and Speech4, where we tested n = 200 and 500. The performance of each method is summarized in Tabe 2, which depicts the mean and standard deviation of the Amari index over 50 trias. NG-LICA overa shows good performance. KICA tends to work reasonaby we for datasets a, b, c and demosig, but it performs poory for the ICALAB datasets; this seems to be caused by an inappropriate choice of the Gaussian kerne width and oca optima. On the other hand, FICA and JADE tend to work reasonaby we for the ICALAB datasets, but performs poory fbach/kerne-ica/index.htm 4 cardoso/guidesepsou.htm

14 Least-squares Independent Component Anaysis 4 Tabe 2: Mean and standard deviation of the Amari index smaer is better for the benchmark datasets. The datasets a, b, and c are taken from Section 4.. The demosig dataset is taken from the FastICA package. The 0hao, Sergio7, Speech4, and c5signas datasets are taken from the ICALAB benchmarks datasets. The best method in terms of the mean Amari index and comparabe ones based on the one-sided t-test at the significance eve % are indicated by bodface. dataset n NG-LICA KICA FICA JADE a b c demosig hao Sergio Speech c5signas for a, b, c and demosig ; we conjecture that the contrast functions in FICA and the fourth-order statistics in JADE did not appropriatey catch the non-gaussianity of the datasets a, b, c and demosig. Overa, the proposed LICA agorithm is shown to be a promising ICA method. 5 Concusions In this paper, we proposed a new ICA method based on a squared-oss variant of mutua information. The proposed method, named east-squares ICA LICA, has severa preferabe properties, e.g., it is distribution-free and hyper-parameter seection by crossvaidation is avaiabe. Simiary to other ICA agorithms, the optimization probem invoved in LICA is nonconvex. Thus it is practicay very important to deveop good heuristics for initiaization and avoiding oca optima in the gradient procedures, which is an open research topic to be investigated. Moreover, athough our SMI estimator is anaytic, the LICA agorithm is sti computationay rather expensive due to inear equations and cross-vaidation. Our

15 Least-squares Independent Component Anaysis 5 future work wi address the computationa issue, e.g., by vectorization and paraeization. Appendix: Derivation of the Gradient of the SMI Estimator Here we show the derivation of the gradient 4 of the SMI estimator 0. Since Îs = 2ĥ α + 2 see Eq.0, the derivative of Îs with respect to W k,k Îs = 2ĥ α is given as foows: + ĥ 2 α. 23 Remind that dbx = Bx dbx dx dx Bx for an arbitrary matrix function Bx. Then the partia derivative of α = Ĥ + λr ĥ with respect to W k,k is given by α = Ĥ = Ĥ Ĥ + λr + λr Ĥ W + λr ĥ + Ĥ + ĥ λr k,k + λr Ĥ + λr α + Ĥ + λr Substituting this in Eq.23, we have Îs = 2ĥ Ĥ + λr Ĥ + λr α + Ĥ W + λr k,k = 2 α Ĥ which gives Eq.4. Acknowedgments α λ 2 α R α + α ĥ, ĥ ĥ. + 2 α ĥ The authors woud ike to thank Dr. Takafumi Kanamori for his vauabe comments. T.S. was supported in part by the JSPS Research Feowships for Young Scientists and Goba COE Program The research and training center for new deveopment in mathematics, MEXT, Japan. M.S. acknowedges support from SCAT, AOARD, and the JST PRESTO program. References Ai, S. M., & Sivey, S. D A genera cass of coefficients of divergence of one distribution from another. Journa of the Roya Statistica Society, Series B, 28, 3 42.

16 Least-squares Independent Component Anaysis 6 Amari, S Natura gradient works efficienty in earning. Neura Computation, 0, Amari, S., Cichocki, A., & Yang, H. H A new earning agorithm for bind signa separation. Advances in Neura Information Processing Systems pp MIT Press. Bach, F. R., & Jordan, M. I Kerne independent component anaysis. Journa of Machine Learning Research, 3, 48. Cardoso, J.-F., & Sououmiac, A Bind beamforming for non-gaussian signas. Radar and Signa Processing, IEE Proceedings-F, 40, Cichocki, A., & Amari, S Adaptive bind signa and image processing: Learning agorithms and appications. Wiey. Cichocki, A., Zdunek, R., Phan, A. H., & Amari, S Non-negative matrix and tensor factorizations: Appications to exporatory muti-way data anaysis and bind source separation. New York: Wiey. Comon, P Independent component anaysis, a new concept? Signa Processing, 36, Csiszár, I Information-type measures of difference of probabiity distributions and indirect observation. Studia Scientiarum Mathematicarum Hungarica, 2, Fukumizu, K., Bach, F. R., & Jordan, M. I regression. The Annas of Statistics, 37, Kerne dimension reduction in Fukumizu, K., Gretton, A., Sun, X., & Schökopf, B Kerne measures of conditiona dependence. Advances in Neura Information Processing Systems 20 pp Cambridge, MA: MIT Press. Geisser, S The predictive sampe reuse method with appications. Journa of the American Statistica Association, 70, Gretton, A., Bousquet, O., Smoa, A., & Schökopf, B. 2005a. Measuring statistica dependence with Hibert-Schmidt norms. Agorithmic Learning Theory pp Berin: Springer-Verag. Gretton, A., Herbrich, R., Smoa, A., Bousquet, O., & Schökopf, B. 2005b. Kerne methods for measuring independence. Journa of Machine Learning Research, 6, Hue, M. M. V Sequentia fixed-point ICA based on mutua information minimization. Neura Computation, 20,

17 Least-squares Independent Component Anaysis 7 Hyvärinen, A Fast and robust fixed-point agorithms for independent component anaysis. IEEE Transactions on Neura Networks, 0, Hyvärinen, A., Karhunen, J., & Oja, E Independent component anaysis. New York: Wiey. Jutten, C., & Heraut, J. 99. Bind separation of sources, part I: An adaptive agorithm based on neuromimetic architecture. Signa Processing, 24, 0. Kanamori, T., Hido, S., & Sugiyama, M A east-squares approach to direct importance estimation. Journa of Machine Learning Research, 0, Kohave, R A study of cross-vaidation and bootstrap for accuracy estimation and mode seection. the Fourteenth Internationa Joint Conference on Artificia Inteigence pp Morgan Kaufmann. Kuback, S., & Leiber, R. A. 95. On information and sufficiency. Annas of Mathematica Statistics, 22, Liese, F., & Vajda, I On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 52, Nguyen, X., Wainwright, M. J., & Jordan, M. I Estimating divergence functionas and the ikeihood ratio by convex risk minimization. IEEE Transactions on Information Theory. to appear. Nishimori, Y., & Akaho, S Learning agorithms utiizing quasi-geodesic fows on the Stiefe manifod. Neurocomputing, 67, Paninsky, L Estimation of entropy and mutua information. Neura Computation, 5, Pearson, K On the criterion that a given system of deviations from the probabe in the case of a correated system of variabes is such that it can be reasonaby supposed to have arisen from random samping. Phiosophica Magazine, 50, Schökopf, B., & Smoa, A. J Learning with kernes. Cambridge, MA: MIT Press. Sugiyama, M., Suzuki, T., Nakajima, S., Kashima, H., von Bünau, P., & Kawanabe, M Direct importance estimation for covariate shift adaptation. Annas of the Institute of Statistica Mathematics, 60, Suzuki, T., Sugiyama, M., Sese, J., & Kanamori, T Approximating mutua information by maximum ikeihood density ratio estimation. New Chaenges for Feature Seection in Data Mining and Knowedge Discovery pp Vapnik, V. N Statistica earning theory. New York: Wiey.

Statistical Learning Theory: A Primer

Statistical Learning Theory: A Primer Internationa Journa of Computer Vision 38(), 9 3, 2000 c 2000 uwer Academic Pubishers. Manufactured in The Netherands. Statistica Learning Theory: A Primer THEODOROS EVGENIOU, MASSIMILIANO PONTIL AND TOMASO