Kernels for Multi task Learning

Similar documents
The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Mathematical Methods for Data Analysis

An Analytical Comparison between Bayes Point Machines and Support Vector Machines

Reproducing Kernel Hilbert Spaces

9.2 Support Vector Machines 159

Kernel Methods. Barnabás Póczos

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Kernel Methods. Outline

Reproducing Kernel Hilbert Spaces

CIS 520: Machine Learning Oct 09, Kernel Methods

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

Machine Learning : Support Vector Machines

On the V γ Dimension for Regression in Reproducing Kernel Hilbert Spaces. Theodoros Evgeniou, Massimiliano Pontil

Reproducing Kernel Hilbert Spaces

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil

An Algorithm for Transfer Learning in a Heterogeneous Environment

5.6 Nonparametric Logistic Regression

Regularization Predicts While Discovering Taxonomy Youssef Mroueh, Tomaso Poggio, and Lorenzo Rosasco

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Universal Kernels for Multi-Task Learning

Support Vector Machines

Geometry on Probability Spaces

Multi-Task Learning and Matrix Regularization

Discriminative Direction for Kernel Classifiers

Linear Dependency Between and the Input Noise in -Support Vector Regression

Kernel methods and the exponential family

Kernels A Machine Learning Overview

Reproducing Kernel Hilbert Spaces

Learning to Integrate Data from Different Sources and Tasks

Cambridge University Press The Mathematics of Signal Processing Steven B. Damelin and Willard Miller Excerpt More information

Reproducing Kernel Hilbert Spaces

Strictly Positive Definite Functions on a Real Inner Product Space

Kernel Learning via Random Fourier Representations

10-701/ Recitation : Kernels

22 : Hilbert Space Embeddings of Distributions

Kernel Methods. Machine Learning A W VO

Universal Multi-Task Kernels

Advanced Introduction to Machine Learning

D. Shepard, Shepard functions, late 1960s (application, surface modelling)

Universal Multi-Task Kernels

Support Vector Machine

Kernel Method: Data Analysis with Positive Definite Kernels

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

Semi-Supervised Learning in Reproducing Kernel Hilbert Spaces Using Local Invariances

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Back to the future: Radial Basis Function networks revisited

Support Vector Machines

Convex Multi-Task Feature Learning

Multi-view Laplacian Support Vector Machines

Kernel methods and the exponential family

Notes on Regularized Least Squares Ryan M. Rifkin and Ross A. Lippert

CS 231A Section 1: Linear Algebra & Probability Review

Strictly positive definite functions on a real inner product space

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

ECE521 week 3: 23/26 January 2017

Bernstein-Szegö Inequalities in Reproducing Kernel Hilbert Spaces ABSTRACT 1. INTRODUCTION

Support Vector Machines: Maximum Margin Classifiers

Manifold Regularization

Predicting Graph Labels using Perceptron. Shuang Song

Learning with Consistency between Inductive Functions and Kernels

Multi-task Learning and Structured Sparsity

Lecture Notes 1: Vector spaces

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

The Learning Problem and Regularization

Linear & nonlinear classifiers

Learning Good Representations for Multiple Related Tasks

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

A Spectral Regularization Framework for Multi-Task Structure Learning

Kernel Methods & Support Vector Machines

Analysis of Multiclass Support Vector Machines

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.

SVMs, Duality and the Kernel Trick

Manifold Regularization

MTTTS16 Learning from Multiple Sources

Reproducing Kernel Banach Spaces for Machine Learning

Kernels MIT Course Notes

Kernels and the Kernel Trick. Machine Learning Fall 2017

Support Vector Machines

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Learning with kernels and SVM

Linear & nonlinear classifiers

A graph based approach to semi-supervised learning

Learning a Kernel Matrix for Nonlinear Dimensionality Reduction

Lecture II: Linear Algebra Revisited

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Online Gradient Descent Learning Algorithms

The Representor Theorem, Kernels, and Hilbert Spaces

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

Support Vector Machines for Classification: A Statistical Portrait

Derivative reproducing properties for kernel methods in learning theory

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Efficient Complex Output Prediction

1 Kernel methods & optimization

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Chapter 2 Linear Transformations

CSC 411 Lecture 17: Support Vector Machine

Duality Between Feature and Similarity Models, Based on the Reproducing-Kernel Hilbert Space

Transcription:

Kernels for Multi task Learning Charles A Micchelli Department of Mathematics and Statistics State University of New York, The University at Albany 1400 Washington Avenue, Albany, NY, 12222, USA Massimiliano Pontil Department of Computer Science University College London Gower Street, London WC1E 6BT, England, UK Abstract This paper provides a foundation for multi task learning using reproducing kernel Hilbert spaces of vector valued functions In this setting, the kernel is a matrix valued function Some explicit examples will be described which go beyond our earlier results in [7] In particular, we characterize classes of matrix valued kernels which are linear and are of the dot product or the translation invariant type We discuss how these kernels can be used to model relations between the tasks and present linear multi task learning algorithms Finally, we present a novel proof of the representer theorem for a minimizer of a regularization functional which is based on the notion of minimal norm interpolation 1 Introduction This paper addresses the problem of learning a vector valued function, where is a set and a Hilbert space We focus on linear spaces of such functions that admit a reproducing kernel, see [7] This study is valuable from a variety of perspectives Our main motivation is the practical problem of multi task learning where we wish to learn many related regression or classification functions simultaneously, see eg [3, 5, 6] For instance, image understanding requires the estimation of multiple binary classifiers simultaneously, where each classifier is used to detect a specific object Specific examples include locating a car from a pool of possibly similar objects, which may include cars, buses, motorbikes, faces, people, etc Some of these objects or tasks may share common features so it would be useful to relate their classifier parameters Other examples include multi modal human computer interface which requires the modeling of both, say, speech and vision, or tumor prediction in bioinformatics from multiple micro array datasets Moreover, the spaces of vector valued functions described in this paper may be useful for learning continuous transformations In this case, is a space of parameters and a Hilbert space of functions For example, in face animation represents pose and expression of a face and a space of functions, although in practice one considers discrete images in which case is a finite dimensional vector whose components are

associated to the image pixels Other problems such as image morphing, can be formulated as vector valued learning When is an dimensional Euclidean space, one straightforward approach in learning a vector valued function consists in separately representing each component of by a linear space of smooth functions and then learn these components independently, for example by minimizing some regularized error functional This approach does not capture relations between components of (which are associated to tasks or pixels in the examples above) and should not be the method of choice when these relations occur In this paper we investigate how kernels can be used for representing vector valued functions We proposed to do this by using a matrix valued kernel that reflects the interaction amongst the components of This paper provides a foundation for this approach For example, in the case of support vector machines (SVM s) [10], appropriate choices of the matrix valued kernel implement a trade off between large margin of each per task SVM and large margin of combinations of these SVM s, eg their average The paper is organized as follows In section 2 we formalize the above observations and show that reproducing Hilbert spaces (RKHS) of vector valued functions admit a kernel with values which are bounded linear operators on the output space and characterize the form some of these operators in section 3 Finally, in section 4 we provide a novel proof for the representer theorem which is based on the notion of minimal norm interpolation and present linear multi task learning algorithms 2 RKHS of vector valued functions Let be a real Hilbert space with inner product, a set, and a linear space of functions on with values in We assume that is also a Hilbert space with inner product We present two methods to enhance standard RKHS to vector valued functions 21 Matrix valued kernels based on Aronszajn The first approach extends the scalar case,, in [2] Definition 1 We say that is a reproducing kernel Hilbert space (RKHS) of functions, when for any and the linear functional which maps to is continuous on We conclude from the Riesz Lemma (see, eg, [1]) that, for every is a linear operator such that For every $ we also introduce the linear operator $ every, by $ % we let & be the set of all bounded linear operators from (&, we denote by *) its adjoint We also use &,+ semidefinite bounded linear operators, ie -& + 0/21 51 When this inequality is strict for all 43 we say We also denote by 687 that is normal provided there does not exist : *;=<>1@ functional A1 for all B and, there (21) defined, for (22) In the proposition below we state the main properties of the function To this end, into itself and, for every to denote the cone of positive provided that, for every, is positive definite the set of positive integers up to and including 9 Finally, we say such that the linear Proposition 1 If $ by equation (21) then the kernel is defined, for every $, by equation (22) and is given satisfies, for every C, the following properties:

& & (a) For every, we have that C % (b) $ (&, C ), and B& + Moreover, is positive definite for all (c) For any 9: 6, < 6 7, < 6 7 / 1 if and only if is normal we have that (23) PROOF We prove (a) by merely choosing B % in equation (21) to obtain that % % $ Consequently, from this equation, we conclude that $ p 48] implies that $ and $ * ) Moreover, choosing in (a) proves that &,+ As for the positive definiteness of (24) admits an algebraic adjoint defined everywhere on and, so, the uniform boundness principle, see, eg, [1,, merely use equations (21) and property (a) These remarks prove (b) As for (c), we again use property (a) to obtain that This completes the proof For simplicity, we say that & is a matrix valued kernel (or simply a kernel if no confusion will arise) if it satisfies properties (b) and (c) So far we have seen that if is a RKHS of vector valued functions, there exists a kernel In the spirit of the Moore-Aronszajn s : theorem for RKHS of scalar functions [2], it can be shown that if is a kernel then there exists a unique (up to an isometry) RKHS of functions from to which admits as the reproducing kernel The proof parallels the scalar case Given a vector valued function defined by /1 we associate to it a scalar valued function (25) We let be the linear space of all such functions Thus, consists of functions which are linear in their second variable We make into a Hilbert space by choosing * It then follows that is a RKHS with reproducing scalar valued kernel defined, for all, by the formula $ (26) 22 Feature map The second approach uses the notion of feature map, see eg [9] A feature map is a function where is a Hilbert space A feature map representation of a kernel there holds the equation has the property that, for every $ and $ From equation (24) we conclude that every kernel admits a feature map representation (a Mercer type theorem) with With additional hypotheses on and this representation can take a familiar form $ $ $ $ &% 6 (27)

Much more importantly, we begin with a feature map % 6 where, this being the space of squared summable sequence on 6 We wish to learn a function where and % 6 with 0 $ $ $ for each % 6, where We choose and conclude that the space of all such functions is a Hilbert space of function from to with kernel (27) These remarks connect feature maps to kernels and vice versa Note a kernel may have many maps which represent it and a feature map representation for a kernel may not be the appropriate way to write it for numerical computations 3 Kernel construction In this section we characterize a wide variety of kernels which are potentially useful for applications 31 Linear kernels A first natural question concerning RKHS of vector valued functions is: if is what is the form of linear kernels In the scalar case a linear kernel is a quadratic form, namely C *, where is a positive semidefinite matrix We claim that for any linear matrix valued kernel % 6 has the form $ $ (38) where are matrices for some 6 To see that such is a kernel simply note that 8 is in the Mercer form (27) for On the other hand, since any linear kernel has a Mercer representation with linear features, we conclude that all linear kernels have the form (38) A special case is provided by choosing and to be diagonal matrices We note that the theory presented in section 2 can be naturally extended to the case where each component of the vector valued function has a different input domain This situation is important in multi task learning, see eg [5] To this end, we specify sets, % 6, functions framework by defining the input space 8, and note that multi task learning can be placed in the above, We are * interested in vector valued functions whose coordinates are given by, where % 6 and is a projection operator defined, for every 8 by *, % 6 For % 6, we suppose kernel functions are given such that the matrix valued kernel whose elements are defined as $ &% 6 satisfies properties (b) and (c) of Proposition 1 An example of this construction is provided again by linear functions Specifically, we choose, where 6 and $, C, where are matrices In this case, the matrix valued kernel % 6 is given by $ (39) which is of the form in equation (38) for 32 Combinations of kernels % 6 The results in this section are based on a lemma by Schur which state that the elementwise product of two positive semidefinite matrices is also positive semidefinite, see [2, p 358]

This result implies that, when is finite dimensional, the elementwise product of two matrix valued kernels is also a matrix valued kernel Indeed, in view of the discussion at the end of section 22 we immediately conclude the following two lemma hold Lemma 1 If and and product is a matrix valued kernel are matrix valued kernels then their elementwise This result allows us, for example, to enhance the linear kernel (38) to a polynomial kernel In particular, if is a positive integer, we define, for every % 6, C $ and conclude that % 6 is a kernel Lemma 2 If is a kernel and a vector valued function, for % 6 then the matrix valued function whose elements are defined, for every $ by C is a matrix valued kernel This lemma confirms, as a special case, that if with a matrix, % 6, and is a scalar valued kernel, then the function (38) is a matrix valued kernel When is chosen to be a Gaussian kernel, we conclude that $ is a matrix valued kernel In the scalar case it is well known that a nonnegative combination of kernels is a kernel The next proposition extends this result to matrix valued kernels Proposition 2 If function is a matrix valued kernel PROOF For any C 6, 6 are scalar valued kernels and B& + and we have that C and so the proposition follows form the Schur lemma $ then the (310) Other results of this type can be found in [7] The formula (310) can be used to generate a wide variety of matrix valued kernels which have the flexibility needed for learning For example, we obtain polynomial matrix valued kernels by setting and C $, where $ We remark that, generally, the kernel in equation (310) cannot be reduced to a diagonal kernel An interesting case of Proposition 2 is provided by low rank kernels which may be useful in situations where the components of are linearly related, that is, for every and lies in a linear subspace In this case, for all it is desirable to use a kernel which has the same property that We < can ensure this by an appropriate choice of the matrices For example, if 6 we may choose ) Matrix valued Gaussian mixtures are obtained by choosing,, < 6 +, and $ Specifically, % $ % %& $ is a kernel on for any < 6 & +

7 4 Regularization and minimal norm interpolation 7 Let + be a prescribed function and consider the problem of minimizing the functional 6 7 (411) over all functions A special case is covered by the functional of the form (412) where is a positive parameter and + is some prescribed loss function, eg the square loss Within this general setting we provide a representer theorem for any function which minimizes the functional in equation (411) This result is well-known in the scalar case Our proof technique uses the idea of minimal norm interpolation, a central notion in function estimation and interpolation Lemma 3 If is unique and admits the form < 6 7 ( the minimum of problem 6 7 (413) We refer to [7] for a proof This approach achieves both simplicity and generality For example, it can be extended to normed linear spaces, see [8] Our next result establishes that the form of any local minimizer 1 indeed has the same form as in Lemma 3 This result improves upon [9] where it is proven only for a global minimizer Theorem 1 If for every 7 the function + + defined for + by C is strictly increasing and is a local minimum of then for some < 6 7 1 Proof: If is any function in such that 6 7 and a real number such that, for 1, then Consequently, we have that from which it follows that * 1 Thus, satisfies < 6 7 0 and the result follows from Lemma 3 41 Linear regularization We comment on regularization for linear multi task learning and therefore consider minimizing the functional ) for &% We set is related to the functional, % 6 $ (414), and observe that the above functional ( *) (415) 1 A function +,&- is a local minimum for / provided that there is a positive number 0 such that whenever +- satisfies 12+3,546+7198:0 then /<;=+>,>*8:/<;=+@

< where we have defined the minimum norm functional ) ) Specifically, we have < The optimal solution % 6 < % 8 < % of problem (416) is given by satisfy the linear equations ) 7 % 6 % 6 % % 6 (416) where the vectors and ) provided the ) block matrix % 6 is nonsingular We note that this analysis can be extended to the case of different inputs across the tasks by replacing in equations (414) and (415) by and matrix by, see section 31 for the definition of these quantities As a special example we choose to be the matrix whose blocks are all zero expect for the st and % 1 th block which are equal to and respectively, where and is the dimensional identity matrix From equation (38) the matrix valued kernel in equation (38) reduce to $ $ &% 6 Moreover, in this case the minimization in (416) is given by ) $ (417) (418) The model of minimizing (414) was proposed in [6] in the context of support vector machines (SVM s) for these special choice of matrices The derivation presented here improves upon it The regularizer (418) forces a trade off between a desirable small size for per task parameters and closeness of each of these parameters to their average This trade-off is controlled by the coupling parameter If is small the tasks parameters are related (closed to their average) whereas a large value of means the task are learned independently For SVM s, is the Hinge loss function defined by 1, In this case the above regularizer trades off large margin of each per task SVM with closeness of each SVM to the average SVM Numerical experiments showing the good performance of the multi task SVM compared to both independent per task SVM s (ie, in equation (417)) and previous multi task learning methods are also discussed in [6] The analysis above can be used to derive other linear kernels This can be done by either introducing the matrices as in the previous example, or by modifying the functional on the right hand side of equation (415) For example, we choose an symmetric matrix all of whose entries are in the unit interval, and consider the regularizer ) : (419) where with 7 of a graph with vertices and the graph Laplacian, see eg [4] The equation 1 The matrix could be the weight matrix

& means that tasks % and are not related, whereas means strong relation In order to derive the matrix valued kernel we note that (419) can be written as where is the block matrix whose % block is the matrix Thus, we define & so that we have (here is the pseudoinverse), where is a projection matrix from to Consequently, the feature map in equation (27) is, & ) given by and we conclude that $ ) Finally, as discussed in section 32 one can form polynomials or non-linear functions of the above linear kernels From Theorem 1 the minimizer of (412) is still a linear combination of the kernel at the given data examples 5 Conclusions and future directions We have described reproducing kernel Hilbert spaces of vector valued functions and discussed their use in multi task learning We have provided a wide class of matrix valued kernels which should proved useful in applications In the future it would be valuable to study learning methods, using convex optimization or MonteCarlo integration, for choosing the matrix valued kernel This problem seems more challenging that its scalar counterpart due to the possibly large dimension of the output space Another important problem is to study error bounds for learning in these spaces Such analysis can clarify the role played by the spectra of the matrix valued kernel Finally, it would be interesting to link the choice of matrix valued kernels to the notion of relatedness between tasks discussed in [5] Acknowledgments This work was partially supported by EPSRC Grant GR/T18707/01 and NSF Grant No ITR-0312113 We are grateful to Zhongying Chen, Head of the Department of Scientific Computation at Zhongshan University for providing both of us with the opportunity to complete this work in a scientifically stimulating and friendly environment We also wish to thank Andrea Caponnetto, Sayan Mukherjee and Tomaso Poggio for useful discussions References [1] NI Akhiezer and IM Glazman Theory of linear operators in Hilbert spaces, volume I Dover reprint, 1993 [2] N Aronszajn Theory of reproducing kernels Trans AMS, 686:337 404, 1950 [3] J Baxter A Model for Inductive Bias Learning Journal of Artificial Intelligence Research, 12, p 149 198, 2000 [4] M Belkin and P Niyogi Laplacian Eigenmaps for Dimensionality Reduction and data Representation Neural Computation, 15(6):1373 1396, 2003 [5] S Ben-David and R Schuller Exploiting Task Relatedness for Multiple Task Learning Proc of the 16 th Annual Conference on Learning Theory (COLT 03), 2003 [6] T Evgeniou and MPontil Regularized Multitask Learning Proc of 17-th SIGKDD Conf on Knowledge Discovery and Data Mining, 2004 [7] CA Micchelli and M Pontil On Learning Vector-Valued Functions Neural Computation, 2004 (to appear) [8] CA Micchelli and M Pontil A function representation for learning in Banach spaces Proc of the 17 th Annual Conf on Learning Theory (COLT 04), 2004 [9] B Schölkopf, R Herbrich, and AJ Smola A Generalized Representer Theorem Proc of the 14-th Annual Conf on Computational Learning Theory (COLT 01), 2001 [10] V N VapnikStatistical Learning Theory Wiley, New York, 1998