ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

Similar documents
Kernel and Nonlinear Canonical Correlation Analysis

Immediate Reward Reinforcement Learning for Projective Kernel Methods

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Relevance Vector Machines for Earthquake Response Spectra

Matching the dimensionality of maps with that of the data

Machine learning for pervasive systems Classification in high-dimensional spaces

Linear and Non-Linear Dimensionality Reduction

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Smooth Bayesian Kernel Machines

Semi-Supervised Learning through Principal Directions Estimation

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Stochastic Optimization for Deep CCA via Nonlinear Orthogonal Iterations

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

EM-algorithm for Training of State-space Models with Application to Time Series Prediction

Machine Learning - MT & 14. PCA and MDS

Pattern Recognition and Machine Learning

Linear & nonlinear classifiers

Support Vector Machine Regression for Volatile Stock Market Prediction

Short Term Memory Quantifications in Input-Driven Linear Dynamical Systems

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Eigenvoice Speaker Adaptation via Composite Kernel PCA

CS534 Machine Learning - Spring Final Exam

Learning with kernels and SVM

Linear Dependency Between and the Input Noise in -Support Vector Regression

An Anomaly Detection Method for Spacecraft using Relevance Vector Learning

Principal Component Analysis

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Nonlinear Dimensionality Reduction

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Probabilistic Class-Specific Discriminant Analysis

Probabilistic Regression Using Basis Function Models

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Using Kernel PCA for Initialisation of Variational Bayesian Nonlinear Blind Source Separation Method

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Data Mining Techniques

Data Mining and Analysis: Fundamental Concepts and Algorithms

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Probabilistic & Unsupervised Learning

Introduction to SVM and RVM

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Introduction to Machine Learning

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Denoising and Dimension Reduction in Feature Space

Kernel methods for comparing distributions, measuring dependence

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Approximate Kernel Methods

Statistical Machine Learning

Unsupervised dimensionality reduction

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Notes on Latent Semantic Analysis

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Discriminative Direction for Kernel Classifiers

Mathematical Formulation of Our Example

Kernels and the Kernel Trick. Machine Learning Fall 2017

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

Linear & nonlinear classifiers

PCA and admixture models

Neutron inverse kinetics via Gaussian Processes

Recent Advances in Bayesian Inference Techniques

Dimensionality Reduction AShortTutorial

Lecture : Probabilistic Machine Learning

Learning Gaussian Process Models from Uncertain Data

Supervised Machine Learning: Learning SVMs and Deep Learning. Klaus-Robert Müller!!et al.!!

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Introduction. Chapter 1

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Relevance Vector Machines

Probabilistic Graphical Models

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Incorporating Invariances in Nonlinear Support Vector Machines

ABSTRACT INTRODUCTION

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Missing Data in Kernel PCA

Gaussian Process Regression

Machine Learning Practice Page 2 of 2 10/28/13

1162 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 5, SEPTEMBER The Evidence Framework Applied to Support Vector Machines

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

CS281 Section 4: Factor Analysis and PCA

Multiple Kernel Self-Organizing Maps

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Perceptron Revisited: Linear Separators. Support Vector Machines

Kernel Hebbian Algorithm for Iterative Kernel Principal Component Analysis

The Numerical Stability of Kernel Methods

GWAS V: Gaussian processes

Clustering VS Classification

Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Lecture 5: GPs and Streaming regression

Machine Learning. Lecture 9: Learning Theory. Feng Li.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Approximate Kernel PCA with Random Features

Transcription:

Sparse Kernel Canonical Correlation Analysis Lili Tan and Colin Fyfe 2, Λ. Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong. 2. School of Information and Communication Technologies, The University of Paisley, Scotland. Abstract. We review the recently proposed method of Relevance Vector Machines which is a supervised training method related to Support Vector Machines. We also review the statistical technique of Canonical Correlation Analysis and its implementation in a Feature Space. We show how the technique of Relevance Vectors may be applied to the method of Kernel Canonical Correlation Analysis to gain a very sparse representation of a data set and discuss why such a representation may be beneficial to an organism.. Introduction We have previously developed both neural [] and kernel [2] methods for finding correlations in a data set which are greater than linear correlations. The kernel method was based on the supervised method of Support Vector Machines (SVMs)[7] but lacked the inherent sparseness which SVMs generate - the filters found are functions of the whole data set. In this paper, we use the ideas from Relevance Vector Machines (RVMs) to generate Canonical Correlation vectors which are inherently sparse. We have previously [] shown how canonical correlation analysis networks can be used to identify stereo correspondence in a data set. While this is still possible with kernel methods, the power of nonlinear kernels is best illustrated on more difficult correspondence problems such as the integration of information across different modalities such as sight and sound. Λ Colin Fyfe would like to gratefully acknowledge the support of the Chinese University of Hong Kong for the period when the current work was carried out.

2. Kernel Canonical Correlation Analysis Canonical Correlation Analysis [3] is used when wehavetwo data sets whichwe believehave some underlying correlation. Consider two sets of input data, from which we draw iid samples to form a pair of input vectors, x and x2. Then in classical CCA, we attempt to find the linear combination of the variables which gives us maximum correlation between the combinations. Let ± be the covariance matrix of x and similarly with ±2 and ±22. Then define K =± 2 ± 2± 2 22. We then perform a Singular Value Decomposition of K to get K =(ff;ff2; :::; ff k )D(fi;fi2; :::; fi k ) T () where ff i and fi i are the standardised eigenvectors of KK T and K T K respectively and D is the diagonal matrix of eigenvalues. Then the first canonical correlation vectors (those which give greatest correlation) are given by w = ± 2 ff (2) w2 = ± 2 22 fi (3) with subsequent canonical correlation vectors defined in terms of the subsequent eigenvectors, ff i and fi i. We have recently extended CCA into the Kernel domain [2]. Consider mapping the input data to a high dimensional (perhaps infinite dimensional) feature space, F. Now, assuming the data has been centred, (we actually will use the same trick as[5] to centre the data later) the covariance matrices in Feature space are defined by ± = EfΦ(x)Φ(x) T g ±22 = EfΦ(x2)Φ(x2) T g ±2 = EfΦ(x)Φ(x2) T g and we wish to find those values w and w2 which will maximise w T ±2w2 subject to the constraints w T ±w =andw2 T ±22w2 =. In practise we will approximate ±2 with P M i Φ(x i)φ(x2i), the sample average. We [2] have shown that, using (K) ij =Φ T (x i )Φ(xj) and (K2) ij =Φ T (x i ) Φ(x2j)we require to maximise ff T KK2 T fi subject to the constraints fft KK T ff =andfi T K2K2 T fi =. Therefore if we define = KK T, 22 = K2K2 T and 2 = KK2 T we solve the problem in the usual way: by forming matrix K = 2 2 2 22 and performing a singular value decomposition on it as before to get K =(fl;fl2; :::; fl k )D( ; 2;:::; k ) T (4)

where fl i and i are again the standardised eigenvectors of KK T and K T K respectively and D is the diagonal matrix of eigenvalues Then the first canonical correlation vectors in feature space are given by ff = 2 fl (5) fi = 2 22 (6) with subsequent canonical correlation vectors defined in terms of the subsequent eigenvectors, fl i and i. Now for any new values x, wemay calculate w:φ(x) = X i ff i Φ(x i )Φ(x) = X i ff i K(x i ; x) (7) which then requires to be centered as before. We see that we areagain performing a dot product in feature space (it is actually calculated in the subspace formed from projections of x i ). The method requires a matrix inversion and the data sets may besuchthat one data point may be repeated (or almost) leading to a singularity or badly conditioned matrices. One solution is to add μi, where I is the identity matrix to and 22 - a method which was also used in [4]. This gives robust and reliable solutions however it detracts from the precise analytical foundations of the method. The difficulty comes about because the method, like other unsupervised kernel methods, does not have the automatic identification of important data points which the supervised method of Support Vector Machines has. We will use the ideas from Relevance Vector Machines [6] to regain this feature. 3. Relevance Vector Regression The vectors found by the Relevance Vector Machine method are prototypical vectors of the class types which isavery different concept from the Support Vectors whose positions are always at the edges of clusters, thereby helping to delimit one cluster from another. Relevance Vector Regression uses a dataset of input-target pairsfx i ;t i g N i= It assumes that the machine can form an output y from NX y(x) = w i K(x; x i )+w0 (8) i= and p(tjx) is Gaussian N(y(x);ff 2 ). The likelihood of the model is given by p(tjw;ff)= (2ßff 2 ) N 2 expf 2ff 2 jjt Φwjj2 g (9) This optimisation is applicable for all symmetric matrices (Theorem A.9.2, [3]).

where t = ft;t2; :::; t N g; w = fw0;w; :::; w N g; and Φ is the N Λ(N +) design (data) matrix. To prevent overfitting, an automatic relevance detection prior is set over the weights NY p(wjff) = N(0;ff ) (0) i=0 To find the maximum likelihood of the data set with respect to ff and ff 2,we iterate between finding the mean and variance of the weight vector and then calculating new values for ff and ff 2 using these statistics. We find that many of the ff i tend to infinity which means that the corresponding weights tend to 0. In detail, we have that the posterior of the weights is given by where p(wjt;ff;ff 2 ) /j±j 2 expf 2 (w μ)t ± (w μ)g () ± = (K T BK + A) μ = ±K T Bt (2) with A = diag(ff0;ff; :::; ff N ) and B = ff 2 I N. If we integrate out the weights, we obtain the marginal likelihood p(tjff; ff 2 ) /jb + KA K T j 2 expf 2 tt (B + KA K T ) tg (3) which can be differentiated to give at the optimum, where fl i = ff i ± ii. ff new i = fl i μ 2 i (ff 2 ) new jjt Kμjj2 = P (N i fl i) (4) (5) 4. Application to CCA We must first describe CCA in probabilistic terms. Consider two data sets = fx i ;i 2 :::N g and 2 = fx i 2 ;i 2 :::N g defined by probability density functions p(x) and p2(x2). Let there be some underlying relationship so that y = w T x + ffl is the canonical correlate corresponding to y2 = w2 T x 2 + ffl2. Then y can be used to predict the value of y2 and vice versa. Let y = ρy2+e, where ρ is the correlation coefficient. Then from the perspective ofx, the targets, t is given by the other input x2. i.e. t = w2k2 (6) So we are using the current value of w2 to determine the target for updating the posterior probabilities for w. Similarly, we create a target for updating the

probabilities for w2 using the current value of w. We then simply alternate between the two Relevance Vector Machines in a way which is reminiscent of the EM algorithm: from x's perspective, calculation of the new values of w2 corresponds to the E-step while the calculation of the new values of w corresponds to the M-step and vice-versa from x2's perspective. We have carried out experiments on real and artificial data sets: for example, we generate data according to the prescription: x = sin + μ (7) x2 = cos + μ2 (8) x2 = + μ3 (9) x22 = + μ4 (20) where is drawn from a uniform distribution in [ ß; ß] andμ i ;i=; :::; 4are drawn from the zero mean Gaussian distribution N(0, 0.). Equations (7) and (8) define a circular manifold in the two dimensional input space while equations (9) and (20) define a linear manifold within the input space where each manifold is only approximate due to the presence of noise (μ i ;i=; :::; 4). Thus x = fx;x2g lies on or near the circular manifold while x2 = fx2;x22g lies on or near the line. We have previously shown[2]thatthekernel method can find greater than linear correlations in such a data set. We now report that the current Relevance Vector Kernel method also finds greater than linear correlations but with a very sparse representation generally. Typically the resulting vectors will have zeros in all but one position with the single non-zero value being very strong. Occasionally, one vector, e.g. w will have a single non-zero value whose correlation with all the other points will be seen in its vector w2. However in all cases we find a very strong correlation. Figure shows the outputs of a trained CCA-RVM network; the high correlation between y and y2 is clear. We may also use the CCA-RVM network to find stereo correspondences as before [], however this task is relatively easy and does not use the full power of the CCA-RVM network. 5. Conclusion We have previously developed both neural [] and kernel [2] methods for finding correlations which are greater than linear correlations in a data set. However the previous kernel method found weight vectors which depended on the whole data set. Using the Relevance Vector Machine to determine the Kernel correlations, we still get greater than linear correlations but use only a small number of data points to do so. For an organism which must integrate information from different sensory modalities (or indeed from the same modality but two different organs e.g. two eyes), this is an important saving since the organism need not maintain all its previous memories from the two data streams. The method automatically identifies the most relevant memories which maximise the correlation in feature space.

8000 7000 6000 5000 4000 3000 2000 000 ESANN'200 proceedings - European Symposium on Artificial Neural Networks 500 000 500 2000 2500 3000 3500 4000 Figure : The figure shows a graph of y (horizontal axis) against y2 (vertical axis) for a Relevance Vector CCA network trained on the negative distance kernel. The high correlation is obvious. References [] P. L. Lai and C. Fyfe. A neural network implementation of canonical correlation analysis. Neural Networks, 2(0):39 397, Dec. 999. [2] P.L. Lai and C. Fyfe. Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems, 200. (Accepted for publication). [3] K. V. Mardia, J.T. Kent, and J.M. Bibby. Multivariate Analysis. Academic Press, 979. [4] S. Mika, B. Scholkopf, A. Smola, K.-R. Muller, M. Scholz, and G. Ratsch. Kernel pca and de-noising in feature spaces. In Advances in Neural Processing Systems,, 999. [5] B. Scholkopf, S. Mika, C. Burges, P. Knirsch, K.-R. Muller, G. Ratsch, and A. J. Smola. Input space vs feature space in kernel-based methods. IEEE Transactions on Neural Networks, 0:000 07, 999. [6] M. Tipping. The relevance vector machine. In Leen T. K. Solla, S. A. and K.-R. Muller, editors, Advances in Neural Information Processing Systems, 2. MIT Press, 2000. [7] V Vapnik. The nature of statistical learning theory. Springer Verlag, New York, 995.