A kernel method for canonical correlation analysis

Size: px

Start display at page:

Download "A kernel method for canonical correlation analysis"

Sherman Cunningham
5 years ago
Views:

1 A kernel method for canoncal correlaton analyss Shotaro Akaho AIST Neuroscence Research Insttute, Central 2, - Umezono, Tsukuba, Ibarak , Japan s.akaho@ast.go.jp Abstract Canoncal correlaton analyss s a technque to extract common features from a par of multvarate data. In complex stuatons, however, t does not extract useful features because of ts lnearty. On the other hand, kernel method used n support vector machne s an effcent approach to mprove such a lnear method. In ths paper, we nvestgate the effectveness of applyng kernel method to canoncal correlaton analyss. Keyword: multvarate analyss, multmodal data, kernel method, regularzaton lnear technques, for example, kernel regresson and kernel PCA[6] In ths paper, we apply the kernel method to CCA. Snce the kernel method s lkely to overft the data, we ncorporate some regularzaton technque to avod the overfttng. 2 Canoncal correlaton analyss x a u v b y Introducton Ths paper deals wth the method to extract common features from multple nformaton sources. For nstance, let us consder a task of learnng n pattern recognton, n whch an object s gven by usng an mage and ts name s gven by a speech. For a newly gven mage, the system s requred to answer ts name by a speech, and for a newly gven speech, the system s to answer the correspondng mage. The task can be consdered to be a regresson problem from mage to speech and vce versa. However, snce the dmensonaltes of mages and speeches are generally very large, a regresson analyss many not work effectvely. In order to solve the problem, t s useful to map the nputs nto low dmensonal feature space and then to solve the regresson problem. The canoncal correlaton analyss (CCA) has been used for such a purpose. CCA fnds a lnear transformaton of a par of mult-varates such that the correlaton coeffcent s maxmzed. From an nformaton theoretcal pont of vew, the transformaton maxmzes the mutual nformaton between extracted features. However, f there s nonlnear relaton between the varates, CCA does not always extract useful features. On the other hand, the support vector machnes (SVM) are attracted a lot of attenton by ts state-of-art performance n pattern recognton [8] The kernel trck used n SVM s applcable not only for classfcaton but also for other Ths s the full verson of paper presented n IMPS200 (Internatonal Meetng of Psychometrc Socety, Osaka, 200) akaho/papers/imps200full.pdf Fgure : CCA CCA has been proposed by Hotellng n 935[3]. Suppose there s a par of mult-varates x R n x y R n y, CCA fnds a par of lnear transformatons such that the correlaton coeffcent between extracted features s maxmzed (Fg.) For the sake of smplcty, we assume that the averages of x and y are 0, and the dmensonalty of feature s, then by the transformatons u = a,x, () v = b,y, (2) we would lke to fnd the transformaton a, b that maxmzes E[uv] ρ =, (3) Var[u]Var[v] where a, x represents the nner product. We have to further assume Var[u]=Var[v]=, () to reduce the freedom of scalng of u and v. a and b can be found by an egen vector correspondng to the maxmal egen values of a generalzed egen value problem. If we

2 need more than one dmenson, we can take egen vectors correspondng other maxmal egen values. CCA s mportant n an nformaton theoretcal vewpont, snce t fnds a transformaton that maxmzes the mutual nformaton between features, when x and y are jontly Gaussan. Even f the assumpton s not fullflled, CCA can be stll used n some cases. However, f the purpose s regresson, the large values of correlaton coeffcents are crucally necessary. The reasons that correlaton coeffcents are small can be consdered n the followng cases:. x and y does not have almost any relaton. 2. There s strong nonlnear relaton between x and y, It s mpossble of mprovement n the frst case. However, n the second case, we can obtan the relaton by some methods. One of those methods s to allow the nonlnear transformaton and Asoh et al[] has proposed a neural network model that approxmates the optmal nonlnear canoncal correlaton analyss. However, ths model requres a lot of computaton tme and t also has a lot of local optma. In ths paper, we ncorporate the kernel method, whch enables the nonlnear transformaton as well as the small computaton and no undesred local optma. 3 Kernel CCA x φx a u v b φy y However, the Lagrangean s ll-posed as t s when the dmensonaltes of the Hlbert spaces are large. Therefore, we ntroduce a quadratc regularzaton term and we get well-posed Lagrangean, L = L 0 + η 2 ( a 2 + b 2 ), (8) where η s a regularzaton constant. Note that the average of u s gven by and the average of uv s gven by E[u]= N a,φ x (x ), (9) E[uv]= N, j a,φ x (x ) b,φ y (y ). (0) Now, from the condton that the dervatve of L by a s equal to 0, we get a = α φ x (x ), () where α s a schalar, then as a result, we have u = α φ x (x ),φ x (x). (2) Therefore, u can be calculated by only nner products n H x. Kernel trck used n SVM uses a kernel functon k x (x,x 2 ) nstead of the nner product between φ x (x ) and φ x (x 2 ).In practce, snce we don t need an explct form of φ x,we frst determne k x that can be decomposed n the form of nner product. From Mercer theorem, the symmetrc postve defnte kernel k x can be decomposed nto the nner product form. Let us rewrte L by the kernel. Frst, let α =(α,..., α N ) T, β =(β,...,β N ) T, and we defne the matrces (K x ) j = k x (x,x j ), (3) Fgure 2: Kernel CCA Frst, x and y are transformed nto the Hlbert space, φ x (x) H x and φ y (y) H y. By takng nner products wth a parameter n the Hlbert spaces, a H x, and b H y,we fnd a feature u = a,φ x (x), (5) v = b,φ y (y), (6) whch maxmzes the correlaton coeffcents. Now, suppose we have pars of tranng samples {(x,y )} N =. a and b can be found by solvng the Lagrangean L 0 = E[(u E[u])(v E[v])] λ 2 E[(u E[u])2 ] λ 2 2 E[(v E[v])2 ]. (7) Then, we obtan L by where (K y ) j = k y (y,y j ). () L = α T Mβ λ 2 αt Lα λ 2 2 βt Nβ (5) M = N KT x JK y, (6) L = N KT x JK x + η K x, (7) N = N KT y JK y + η 2 K y, (8) J = I N T, (9) = (,...,) T, (20) 2

3 and η = η/λ, η 2 = η/λ 2. If η > 0 s satsfed, L and N are postve defnte almost surely, and we can show λ = λ 2 = λ from the constrant, then as a result we have a generalzed egenvalue problem for α, β Mβ = λlα, (2) M T α = λnβ, (22) It can be solved by generalzed egenvalue problem package or Cholesky decomposton of L and M. v[] u[] Computer smulaton. Smulaton We generate tranng samples and test samples ndependently as follows. Frst θ s generated from the unform dstrbuton on [ π, π], and then a par of two dmensonal varables x and y are generated by x =(θ) sn3θ + ε, (23) y = e θ/ (cos)2θsn2θ + ε 2, (2) where ε, ε 2 are ndependent two dmensonal Gaussan nose wth a standard devaton We test for 0 tranng samples and 00 test samples. The x y scatter plot of (lnear) CCA s shown fg.3. The correlaton coeffcents are as follows, where the values for test samples are n the braces. v v 2 u 0.7 (0.0) 0.00 (0.09) u (0.00) 0.27 (0.9) The x y plot of kernel CCA s shown n fg.. We used Gaussan kernel k(x,x 2 )=exp( x x 2 2 2σ 2 ), (25) both for x and y, where parameters are take by η =.0, σ =.0. The correlaton coeffcents are as follows, where the values for test samples are n the braces. v v 2 u 0.98 (0.95) 0.00 (0.02) u (0.02) 0.97 (0.93) We only show upto the second components, though we have hgher components n the kernel CCA..2 Smulaton 2 Ths secton examnes an artfcal pattern recognton tasks n multmodal settng descrbed n the begnnng of the paper. Tranng samples x and y are generated randomly from the unform dstrbuton on [0,] 2 and make random pars Fgure 3: Smulaton. x y plot for CCA. The numbers represent the ncreasng order of tranng samples for θ of tranng samples. Each par of tranng samples represent a class center. Test samples are generated by addng an ndependent Gaussan nose wth standard devaton 0.05 to tranng samples randomly chosen. We test 0 tranng samples (classes) and 00 test samples. x y plot of CCA result s shown n fg.5. The correlaton coeffcent between features are as follows, where the values for test samples are n the braces. v v 2 u 0.0 (0.) 0.00 ( 0.0) u ( 0.05) 0.3 (0.9) x y plot of kernel CCA result for the same dataset s shown n fg.6. We use Gaussan kernel n whch parameters are taken η = 0., σ = 0.. The correlaton coeffcents between features are as follows, where the values of test samples are n braces. v v 2 u 0.97 (0.90) 0.00 (0.0) u (0.0) 0.95 (0.88) 5 Concludng remarks 5. Kernel method and regularzaton We have proposed kernel canonal correlaton analyss n whch the kernel method s ncorporated n the kernel 3

4 v[] v[] u[] u[] Fgure : Smulaton. x y plot for kernel CCA method. It s smlar to SVM that the pont s nonlnearzaton by kernel method and avodng overfttng by regularzaton technque. In general, t s mportant to determne the regularzaon parameter. Moreover, the selecton of kernel form s crucal for the performance. Although all parameters are determned by hand n the smulatons of ths paper, we can take more systematc approaches, such as resamplng methods lke cross-valdaton and emprcal Bayes approaches[7]. In such technques, we usually need teratve algorthms whch s tme consumng and s also lkely to be trapped nto a local optmum. To examne such ssues are future work. As for regularzaton term, we can use α 2 and β 2 nstead of the quadratc term of regularzaton n ths paper. In the kernel dscrmnant analyss descrbed below, such a dfferent type of regularzaton term s used. The tme complextes for both types are same and emprcally we are not able to fnd sgnfcant dfference of performance. However, we may need more realstc experments. 5.2 Relaton to kernel dscrmnant analyss The canoncal correlaton analyss s closely related to the Fsher s dscrmnant analyss (FDA), whch fnds a mappng that mnmzes the nner-class varance as well as maxmzes nter-class varance for effectve pattern recognton. FDA can be consdered as a specal case of CCA. Mka et al[5] has proposed a kernel method for FDA, whch s not strctly ncluded nto the kernel CCA because the kernel FDA does not transform the class label by nonlnear mappng. For both n kernel CCA and kernel FDA, t s dffcult to obtan sparse representaton of mappng. It would be Fgure 5: Smulaton 2. x y plot of CCA. The numbers represents class centers promsng dea to ncorporate the sparsty as a utlty functon. 5.3 Future ssues from the nformaton theory The author s group has been proposed the multmodal ndependent component analyss (multmodal ICA) whch extends the CCA by ncorporatng the nformaton theoretc vewpont[2]. The transformaton s restrcted to lnear and t has been sometmes dffcult to extract useful features from nonlnearly related multvarates. Now we can rase a queston: Can we ntegrate the kernel CCA wth multmodal ICA n order to extract useful features? The answer for ths queston depends on the property of gven data. If the nose level s low as n the smulaton of ths paper, the regularzaton constants are set to small values and t s desred that the correlaton coeffcents are almost. We cannot expect the performance s mproved by multmodal ICA because the correlaton coeffcent close to already acheves a large amount of mutual nformaton. On the other hand, when the nose level s large, the multmodal ICA possbly mproves the performance. However, n such a case, the lnear CCA s sometmes enough n practce. If we learn a multple value functon as n the aquston of multple consept[], t may worth tryng because the correlaton coeffcents are small even f the nose level s low. Let us consder further the case the nose level s low. From the result of the smulatons n the prevous secton, samples are mapped nto a few clusters that wll make re-

5 v[] u[] Fgure 6: Smulaton 2. x y plot of KCCA [3] T. W. Anderson: An Introducton to Multvarate Statstcal Analyss Second edton, John Wley & Sons, 98. [] H. Asoh, O. Takech: An approxmaton of Nonlnear Canoncal Correlaton Analyss by Multlayer Perceptrons, Proc. of Int. Conf. Artfcal Neural Networks, pp , 99. [5] S. Mka, G. Rätsch, J. Weston, B. Schölkpf, K-R. Müller: Fsher dscrmnant analyss wth kernels, In Y.-H. Hu, et al.(eds.): Neural Networks for Sgnal Processng IX, pp. -8, IEEE, 999. [6] B. Schölkopf, A. Smola and K. Müller: Kernel prncpal component analyss, In B. Schölkopf et al. (eds), Advances n Kernel Methods, Support Vector Learnng, MIT Press, 998. [7] M. E. Tppng: The relevant vector machne, to appear n Advances n Neural Informaton Processng Systems (NIPS) 2, [8] V.N. Vapnk : Statstcal Learnng Theory, John Wley & Sons, 998. gresson between x and y dffcult. In such a case, the dstrbuton of u and v s desred to be scattered. From the nformaton theoretc vewpont, the feature space s preferrable to have large amount of entropy. Snce the dstrbuton wth largest entropy s Gaussan under fxed average and varance, the Gaussanty can be used for the utlty functon. For example, the thrd and forth cumulants are preferrable to be as small as possble. It seems opposte from the projecton pursut and ndependent component analyss, but t may be caused from the dfference of the purpose that ICA s for vsualzaton whle our task s for regresson. The assumpton of nose s also dfferent. These ssues are related to the sparsty stated n the prevous secton, and t s also future work. References [] S. Akaho, S. Hayamzu, O. Hasegawa, T. Yoshmura, H. Asoh: Concept aquston from multple nformaton sources by the EM algorthm, Trans. of IEICE, Vol. J80-A, No. 9, pp , 997. (n Japanese; Englsh verson (ETL techncal report 97-8) s avalable at akaho/papers/etl- TR-97-8E.ps.gz) [2] S. Akaho and S. Umeyama, Multmodal Independent Component Analyss A method of feature extracton from multple nformaton sources, Electroncs and Communcatons n Japan, Part 3, Fundamental Electronc Scence, Vol.8, No., pp.2 28, 200. (summary verson s avalable n IJCNN 99 paper, akaho/papers/ijcnn99.ps.gz) 5

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Support Vector Machines. Vibhav Gogate The University of Texas at dallas Support Vector Machnes Vbhav Gogate he Unversty of exas at dallas What We have Learned So Far? 1. Decson rees. Naïve Bayes 3. Lnear Regresson 4. Logstc Regresson 5. Perceptron 6. Neural networks 7. K-Nearest