Immediate Reward Reinforcement Learning for Projective Kernel Methods

Similar documents
ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

Kernel and Nonlinear Canonical Correlation Analysis

Matching the dimensionality of maps with that of the data

Linear and Non-Linear Dimensionality Reduction

Using Kernel PCA for Initialisation of Variational Bayesian Nonlinear Blind Source Separation Method

Discriminative Direction for Kernel Classifiers

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Learning sets and subspaces: a spectral approach

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Convergence of Eigenspaces in Kernel Principal Component Analysis

Basic Principles of Unsupervised and Unsupervised

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

The Cross Entropy Method for the N-Persons Iterated Prisoner s Dilemma

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Principal Component Analysis

PCA, Kernel PCA, ICA

Approximate Kernel PCA with Random Features

Cheng Soon Ong & Christian Walder. Canberra February June 2018

EM-algorithm for Training of State-space Models with Application to Time Series Prediction

Principal Component Analysis

Short Term Memory Quantifications in Input-Driven Linear Dynamical Systems

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Introduction to Machine Learning

Bearing fault diagnosis based on EMD-KPCA and ELM

EEE 241: Linear Systems

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Preprocessing & dimensionality reduction

Linear Regression and Its Applications

Machine Learning - MT & 14. PCA and MDS

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Kernel methods for comparing distributions, measuring dependence

Curvilinear Components Analysis and Bregman Divergences

Nonlinear Dimensionality Reduction

Machine Learning (BSMC-GA 4439) Wenke Liu

Semi-Supervised Learning through Principal Directions Estimation

Linking non-binned spike train kernels to several existing spike train metrics

CSE 291. Assignment Spectral clustering versus k-means. Out: Wed May 23 Due: Wed Jun 13

Online Kernel PCA with Entropic Matrix Updates

Introduction to Neural Networks

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Unsupervised dimensionality reduction

Linear Scalarized Knowledge Gradient in the Multi-Objective Multi-Armed Bandits Problem

Replacing eligibility trace for action-value learning with function approximation

A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier

Discriminative K-means for Clustering

Machine learning for pervasive systems Classification in high-dimensional spaces

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

Learning Tetris. 1 Tetris. February 3, 2009

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier

Heterogeneous mixture-of-experts for fusion of locally valid knowledge-based submodels

Bayesian Semi Non-negative Matrix Factorisation

Probabilistic & Unsupervised Learning

Analysis of Multiclass Support Vector Machines

Multi-Layer Boosting for Pattern Recognition

CS 231A Section 1: Linear Algebra & Probability Review

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

Statistical Machine Learning

CS281 Section 4: Factor Analysis and PCA

An online kernel-based clustering approach for value function approximation

Neuroscience Introduction

Fisher Memory of Linear Wigner Echo State Networks

Smooth Bayesian Kernel Machines

Cluster Kernels for Semi-Supervised Learning

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

STA 414/2104: Lecture 8

A CUSUM approach for online change-point detection on curve sequences

Iterative face image feature extraction with Generalized Hebbian Algorithm and a Sanger-like BCM rule

Unsupervised learning: beyond simple clustering and PCA

Kernel Method: Data Analysis with Positive Definite Kernels

Nonlinear Dimensionality Reduction

Learning Gaussian Process Models from Uncertain Data

Kernel-based Feature Extraction under Maximum Margin Criterion

Local Learning Projections

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing

Lecture 6 Proof for JL Lemma and Linear Dimensionality Reduction

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

Negatively Correlated Echo State Networks

PCA and admixture models

Multiple Kernel Self-Organizing Maps

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Approximate Kernel Methods

Tutorial on Machine Learning for Advanced Electronics

STATISTICAL LEARNING SYSTEMS

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

Mathematical foundations - linear algebra

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Unsupervised Learning

A Generalization of Principal Component Analysis to the Exponential Family

Lecture 3: Review of Linear Algebra

Nonlinear Projection Trick in kernel methods: An alternative to the Kernel Trick

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Lecture 3: Review of Linear Algebra

Online Kernel PCA with Entropic Matrix Updates

Functional Radial Basis Function Networks (FRBFN)

Kernel methods, kernel SVM and ridge regression

Clustering VS Classification

Transcription:

ESANN'27 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), 25-27 April 27, d-side publi., ISBN 2-9337-7-2. Immediate Reward Reinforcement Learning for Projective Kernel Methods Colin Fyfe 1 and Pei Ling Lai 2, 1. Applied Computational Intelligence Research Unit, The University of Paisley, Scotland. email:colin.fyfe@paisley.ac.uk 2. Southern Taiwan University of Technology, Taiwan email: pei ling lai@hotmail.com Abstract. We extend a reinforcement learning algorithm which has previously been shown to cluster data. We have previously applied the method to unsupervised projection methods, principal component analysis, exploratory projection pursuit and canonical correlation analysis. We now show how the same methods can be used in feature spaces to perform kernel principal component analysis and kernel canonical correlation analysis. 1 Introduction Unsupervised and reinforcement learning research have tended, in the main, to be two entirely separate streams of adaptive methods. This is somewhat surprising in that the former is modelled on biological processes believed to be utilised in animal and human brains while the latter is attempting to create learning processes which an entity would use to investigate its environment in order to maximise its long term utility. A notable exception to the above dearth of interaction is [4] which uses a reinforcement learning method (albeit a non-standard one) in order to cluster data sets in an unsupervised manner. We have recently [1] investigated this reinforcement learning algorithm in order to perform a topology preserving clustering of the data and in order to perform linear projections. In this paper, we extend the method to kernel spaces and show that we can use it to perform kernel principal component analysis and kernel canonical correlation analysis. 2 Immediate Reward Reinforcement Learning [7] investigated a particular form of reinforcement learning in which reward for an action is immediate which is somewhat different from mainstream reinforcement learning [2]. Williams [7] considered a stochastic learning unit in which the probability of any specific output was a parameterised function of its input, x. For the i th unit, this gives where, for example, P (y i = ζ w i, x) = f(w i, x) (1) f(w i, x) = 1 1 + exp( w i x 2 ) (2) 37

ESANN'27 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), 25-27 April 27, d-side publi., ISBN 2-9337-7-2. Williams [7] considers the learning rule w ij = α ij (r i,ζ b ij ) ln P (y i = ζ w i, x) w ij (3) where α ij is the learning rate, r i,ζ is the reward for the unit outputting ζ and b ij is a reinforcement baseline which in the following we will take as the reinforcement comparison, b ij = r = 1 K ri,ζ where K is the number of times this unit has output ζ. ([7], Theorem 1) shows that the above learning rule causes weight changes which maximises the expected reward. [7] gave the example of a Bernoulli unit in which P (y i = 1) = p i and so P (y i = ) = 1 p i. [4] applies the Bernoulli model to (unsupervised) clustering based on which we have recently created a topology preserving algorithm. We are more interested in the Gaussian learner from which we draw a sample y N(m i, β 2 i ), the Gaussian distribution with mean m i and variance β 2 i. Each learner has two parameters to adapt, its mean and variance. The learning rules can be derived [7] as m = α m (r r) y m β 2 (4) β = α β (r r) y m 2 β 2 β 3 (5) We have investigated such algorithms in the context of principal component analysis, exploratory projection pursuit and canonical correlation analysis [1]. In this paper, we apply the technique in kernel space and show that we may perform kernel principal component analysis [5] and kernel canonical correlation analysis [3] with immediate reward reinforcement learning. 3 Unsupervised Kernel Methods There has been a great deal of recent interest in using kernel methods both in the supervised learning paradigm [6] and in the unsupervised paradigm [5, 3]. Kernel methods map the data first into a feature space in which a linear operation (such as principal component analysis or canonical correlation analysis) is performed. If the mapping into the feature space is a non-linear one, any linear operation in the feature space corresponds to a nonlinear operation in the original data space. Since we are particularly interested in applying the reinforcement learning paradigm to unsupervised learning, we will illustrate its use in kernel space for kernel principal component analysis and kernel canonical correlation analysis. 3.1 Kernel Principal Component Analysis (KPCA) Let the nonlinear mapping be φ : Ξ F where Ξ is the original data space and F is the feature space to which the elements of Ξ are mapped. Then Kernel Principal Component Analysissearches for the filter w = arg max w E Ξ (w T φ(x)) i.e. the weight vector in feature space which maximises the variance of the (centered) 38

ESANN'27 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), 25-27 April 27, d-side publi., ISBN 2-9337-7-2. projections (under the constraint of orthonormality of the weight vectors. The crucial insight is that w may be defined in terms of the points φ(x) which span the subspace in which w lies. Thus w = N i=1 α iφ(x i ). Therefore, the problem reduces to finding the weight vector α in feature space. It is readily shown [5] that α is the eigenvector of the scalar product matrix, K : K ij = φ(x i ) T φ(x j ). Thus finding the first principal component filter w in the feature space can be readily done by finding α = arg max β β T Kβ. Actually, in order to normalise w properly, we require to normalise α appropriately, however, in the following we shall ignore this issue since we are only interested in identifying structure in the data sets. Thus, with the methodology defined above, we draw α from the Gaussian distribution N(m, βi) where now m is N 1, and I is similarly the N N identity matrix. At each iteration, we select a specific input, e.g. x t and calculate the reward as r = α T K t α t where K t is the t th column of K i.e. that corresponding to x t and α t is the element of α corresponding to x t. This constitutes the reward, r. To illustrate this we create a two dimensional data set of 1 samples, the first 1 4 are centered at (2,2), the next 1 4 at (2,-2), and so on as shown in Figure 1. We use a squared exponential kernel so that K ij = exp( γ(x i x j ) 2 ). We then center this representation [5] using K = K 1 N 1K 1 N K1 + 1 N 1K1 where 1 2 is the N N matrix of 1s. Results from simulations in which γ =.1 and γ = 1 are shown in Figure 1. Since kernel methods, in general, require the construction of the scalar product (Gram) matrix, K, they are often used in batch mode rather than in online mode as above. Then the reward function becomes r = α T Kα i.e. we are calculating the reward over the whole data set at each iteration. With all other parameters held constant and γ =.1, we show the results of one simulation on the same data as before in the last diagram of Figure 1 though the simulation had to be run for a slightly longer time. 3.2 Kernel Canonical Correlation Analysis Canonical correlation analysis is a standard statistical technique for investigating two data sets which we believe have some underlying (linear) relationship. This has also been extended to utilise kernel techniques [3]. Let the covariance matrices in Feature space be defined by Σ 11 = E{(Φ(x 1 ) µ 1 )(Φ(x 1 ) µ 1 ) T } Σ 22 = E{(Φ(x 2 ) µ 2 )(Φ(x 2 ) µ 2 ) T } Σ 12 = E{(Φ(x 1 ) µ 1 )(Φ(x 2 ) µ 2 ) T } where now µ i = E(Φ(x i )) for i = 1, 2. Let us assume for the moment that the data has been centred in feature space (we actually use the same trick as [5] again to centre the data). Then we wish to find those values w 1 and w 2 which will maximise w1 T Σ 12 w 2 subject to the constraints w1 T Σ 11 w 1 = 1 and w2 T Σ 22 w 2 = 1. 39

ESANN'27 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), 25-27 April 27, d-side publi., ISBN 2-9337-7-2. 4 Data samples for KPCA.15 Weights in feature space when gamma=.1 3.1 2.5 1 1 weight.5.1 2.15 3.2 4 4 3 2 1 1 2 3 4.25 2 4 6 8 1 12 data sample number.2 Weights of data samples in feature space when gamma=1.2 Batch algorithm, partly converged.15.15.1.1.5.5 weight weights.5.5.1.1.15.15.2 2 4 6 8 1 12 data sample number.2 2 4 6 8 1 12 data sample number Fig. 1: Top left: the two dimensional data set. Top right: the α weights in a simulation in which γ =.1. Bottom left: as above but with γ = 1. Bottom right: with reward w t x. 31

ESANN'27 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), 25-27 April 27, d-side publi., ISBN 2-9337-7-2. In [3], we define (K 1 ) ij = Φ T (x i )Φ(x 1j ) and (K 2 ) ij = Φ T (x i )Φ(x 2j ) and so maximise α T K 1 K T 2 β subject to the constraints α T K 1 K T 1 α = 1 and β T K 2 K T 2 β = 1. Therefore if we define Γ 11 = K 1 K T 1, Γ 22 = K 2 K T 2 and Γ 12 = K 1 K T 2 we solved the problem in the usual way: by forming matrix K = Γ 1 2 11 Γ 12Γ 1 2 22 and performing a singular value decomposition on it as before to get K = (γ 1, γ 2,..., γ k )D(θ 1, θ 2,..., θ k ) T (6) where γ i and θ i are again the standardised eigenvectors of KK T and K T K respectively and D is the diagonal matrix of eigenvalues. Then the first canonical correlation vectors in feature space are given by α 1 = Γ 1 2 11 γ 1 (7) β 1 = Γ 1 2 22 θ 1 (8) with subsequent canonical correlation vectors defined in terms of the subsequent eigenvectors, γ i and θ i. Again, we are only interested in proving the reinforcement learning method in this paper and so we create a 1 sample, artificial data set in which the first 8 samples in each stream are Gaussian clusters of standard deviation.3 round centres in each space and the last 2 samples are also Gaussian clusters of standard deviation 1 round different centres in each space. We then randomly initialise γ and θ and generate a reward, r = γkθ and using this to update γ and θ. Results are shown in Figure 2. We see that the two distinct regimes incorporating the different relationships are clearly visible. Again, we note that we normalise the lengths of the weight vectors γ and θ to have length 1 when actually we should be normalising w 1 and w 2 ; however this only affects the magnitude of the results not the identification of different regimes in the data. 4 Conclusion We have taken an existing method for immediate reinforcement learning and applied the algorithm to two linear methods in feature space and shown that the resulting method is clearly able to identify nonlinear structure in data sets. In this paper, we have merely introduced the method. Further research is needed to compare these methods with existing (both neural and standard statistical) methods. Further, we have discussed only the first principal (resp. canonical correlation) analysis projection; clearly, a Gram-Schmidt reduction is possible but it is an open question as to whether the reinforcement learning paradigm might facilitate a different approach to subsequent projections. Given the early success of the method, we are encouraged to investigate more mappings with the method. We will investigate both linear manifold methods such as Independent Component Analysis [3] as well as nonlinear manifolds. 311

ESANN'27 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), 25-27 April 27, d-side publi., ISBN 2-9337-7-2..15 Weights into the first data set.15 Weights into second data set.1.1.5.5 Weight.5.1.15.2.25.3 2 4 6 8 1 12 Data sample number weights.5.1.15.2.25.3 2 4 6 8 1 12 Data sample number Fig. 2: The weights into the two data streams clearly identify the two clusters in the data set each pair of which has an underlying close but nonlinear relationship. The first 8 samples from both data streams are of one type while the last 2 are of the other type. References [1] C. Fyfe and P. L. Lai. Reinforcement learning reward functions for unsupervised learning. In 4th International Symposium on Neural Networks, ISNN 27, 27. [2] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237 285, 1996. [3] P. L. Lai and C. Fyfe. Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems, 1(5):365 377, 21. [4] A. Likas. A reinforcement learning approach to on-line clustering. Neural Computation, 1999. [5] B. Scholkopf, A. Smola, and K.-R. Muller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 1:1299 1319, 1998. [6] V. Vapnik. The nature of statistical learning theory. Springer Verlag, New York, 1995. [7] R. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229 256, 1992. 312