Introduction Dual Representations Kernel Design RBF Linear Reg. GP Regression GP Classification Summary. Kernel Methods. Henrik I Christensen

Size: px

Start display at page:

Download "Introduction Dual Representations Kernel Design RBF Linear Reg. GP Regression GP Classification Summary. Kernel Methods. Henrik I Christensen"

Elwin Evans
5 years ago
Views:

1 Kernel Methods Henrik I Christensen Robotics & Intelligent GT Georgia Institute of Technology, Atlanta, GA hic@cc.gatech.edu Henrik I Christensen (RIM@GT) Kernel Methods 1 / 37

2 Outline 1 Introduction 2 Dual Representations 3 Kernel Design 4 Radial Basis Functions 5 Linear Regression Revisited 6 Gaussian Processes for Regression 7 Gaussian Processes for Classification 8 Summary Henrik I Christensen (RIM@GT) Kernel Methods 2 / 37

3 Introduction This far the process has been about data compression and optimal discrimination Once process complete the training set is discarded and the model is used for processing What if data were kept and used directly for estimation? Why you ask? The decision boundaries might not be simple or the modeling is too complicated Already discussed Nearest Neighbor (NN) as an example of direct data processing A complete class of memory based techniques Q: how to measure similarity between a data point and samples in memory? Henrik I Christensen (RIM@GT) Kernel Methods 3 / 37

4 Kernel Methods What if we could predict based on a linear combination of features? Assume a mapping to a new feature space using φ(x) A kernel function is defined by k(x, x ) = φ(x) T φ(x ) Characteristics: The function is symmetric: k(x, x ) = k(x, x) Can be used both on continuous and symbolic data Simple kernel the linear kernel. k(x, x ) = x T x A kernel is basically an inner product performed in a feature/mapped space. Henrik I Christensen (RIM@GT) Kernel Methods 4 / 37

5 Kernels Consider a complete set of data in memory How can we interpolate new values based on training values? I.e., y(x) = 1 k N n=1 k(x, x n )x n consider k(.,.) a weight function that determines contribution based on distance between x and x n Henrik I Christensen (RIM@GT) Kernel Methods 5 / 37

6 Outline 1 Introduction 2 Dual Representations 3 Kernel Design 4 Radial Basis Functions 5 Linear Regression Revisited 6 Gaussian Processes for Regression 7 Gaussian Processes for Classification 8 Summary Henrik I Christensen (RIM@GT) Kernel Methods 6 / 37

7 Dual Representation Consider a regression problem as seen earlier J(w) = 1 N { } 2 w T λ φ(x n ) t n wt w n=1 n=1 with the solution w = 1 N } {w T φ(x n ) t n φ(x n ) = λ where a is defined by a n = 1 λ Substitute w = Φ t a into J(w) to obtain N a n φ(x n ) = Φ T a n=1 {w T φ(x n ) t n } J(a) = 1 2 at ΦΦ T Φ T Φa a T ΦΦ T t tt t + λ 2 at ΦΦ T a which is termed the dual representation Henrik I Christensen (RIM@GT) Kernel Methods 7 / 37

8 Dual Representation II Define the Gram matrix - K = ΦΦ T to get where J(a) = 1 2 at KK T a a T Kt tt t + λ 2 at Ka J(a) is then minimized by Through substitution we obtain K nm = φ(x m ) T φ(x n ) = k(x m, x n ) a = (K + λi N ) 1 t y(x) = w T φ(x) = a T Φφ(x) = k(x) T (K + λi N ) 1 t We have in reality mapped the program to another (dual) space in which it is possible to optimize the regression/discrimination problem Typically N >> M so the immediate advantage is not obvious. See later. Henrik I Christensen (RIM@GT) Kernel Methods 8 / 37

9 Outline 1 Introduction 2 Dual Representations 3 Kernel Design 4 Radial Basis Functions 5 Linear Regression Revisited 6 Gaussian Processes for Regression 7 Gaussian Processes for Classification 8 Summary Henrik I Christensen (RIM@GT) Kernel Methods 9 / 37

10 Constructing Kernels How would we construct kernel functions? One approach is to choose a mapping and find corresponding kernels A one dimensional example k(x, x ) = φ(x) T φ(x ) = where φ i (.) are basis functions M φ i (x)φ i (x ) n=1 Henrik I Christensen (RIM@GT) Kernel Methods 10 / 37

11 Kernel Basis Functions - Example Henrik I Christensen (RIM@GT) Kernel Methods 11 / 37

12 Construction of Kernels We can also design kernels directly. Must correspond to a scala product in some space Consider: k(x, z) = (x T z) 2 for a 2-dimensional space x = (x 1, x 2 ) k(x, z) = (x T z) 2 = (x 1 z 1 + x 2 z 2 ) 2 = x 2 1 z x 1 z 1 x 2 z 2 + x 2 2 z 2 z = (x 2 1, 2x 1 x 2, x 2 2 )(z 2 1, 2z 1 z 2, z 2 2 ) T = φ(x) T φ(z) In general if the Gram matrix, K, is positive semi-definite the kernel function is valid Henrik I Christensen (RIM@GT) Kernel Methods 12 / 37

13 Techniques for construction of kernels k(x, x ) = c 1 k(x, x ) k(x, x ) = f (x)k(x, x )f (x ) k(x, x ) = q(k(x, x )) k(x, x ) = exp(k(x, x )) k(x, x ) = k 1 (x, x ) + k 2 (x, x ) k(x, x ) = k 1 (x, x )k 2 (x, x ) k(x, x ) = x T Ax Henrik I Christensen (RIM@GT) Kernel Methods 13 / 37

14 More kernel examples/generalizations We could generalize k(x, x ) = (x T x ) 2 in various ways 1 k(x, x ) = (x T x + c) 2 2 k(x, x ) = (x T x ) M 3 k(x, x ) = (x T x + c) M Example correlation between image regions Another option is k(x, x ) = e x x /2σ 2 called the Gaussian kernel (see later) Several more examples in the book Henrik I Christensen (RIM@GT) Kernel Methods 14 / 37

15 Outline 1 Introduction 2 Dual Representations 3 Kernel Design 4 Radial Basis Functions 5 Linear Regression Revisited 6 Gaussian Processes for Regression 7 Gaussian Processes for Classification 8 Summary Henrik I Christensen (RIM@GT) Kernel Methods 15 / 37

16 Radial Basis Functions What is a radial basis function? φ j (x) = h( x x j ) How to average/smooth across data entirely based on distance? y(x) = N w n h( x x n ) n=1 the weights w n could be estimated using LSQ A popular interpolation strategy is: where y(x) = N t n h(x x n ) n=1 h(x x n ) = ν(x x n) j ν(x x j) Henrik I Christensen (RIM@GT) Kernel Methods 16 / 37

17 The effect of normalization? Henrik I Christensen (RIM@GT) Kernel Methods 17 / 37

18 Nadaraya-Watson Models Lets interpolate across all data! Using a Parzen density estimator we have p(x, t) = 1 N f (x x n, t t n ) N We can then estimate n=1 y(x) = E[t x] = tp(t x)dt tp(x, t)dt = p(x, t)dt = n g(x x n)t n m g(x x m) = n k(x, x n )t n where Henrik I Christensen (RIM@GT) Kernel Methods 18 / 37

19 Gaussian Mixture Example Assume a particular one-dimensional function (here sine) with noise Each data point is an iso-tropic Gaussian Kernel Smoothing factors are determined for the interpolation Henrik I Christensen (RIM@GT) Kernel Methods 19 / 37

20 Gaussian Mixture Example Henrik I Christensen (RIM@GT) Kernel Methods 20 / 37

21 Gaussian Kernels We have so far considered basic of kernels - a distance metric Transformations to a new space Lets consider Gaussian Processes Rather than direct regression / classification What if the mapping is probabilistic over a function space Ex; training with noisy training data Henrik I Christensen (RIM@GT) Kernel Methods 21 / 37

22 Outline 1 Introduction 2 Dual Representations 3 Kernel Design 4 Radial Basis Functions 5 Linear Regression Revisited 6 Gaussian Processes for Regression 7 Gaussian Processes for Classification 8 Summary Henrik I Christensen (RIM@GT) Kernel Methods 22 / 37

23 Linear Regression Revisited In Regression we are used to y(x) = w T φ(x) what is the weights were probabilistic p(w) = N(w 0, α 1 I) we can reformulate the optimization to be y = Φw Henrik I Christensen (RIM@GT) Kernel Methods 23 / 37

24 Gaussian Models Considering basic Gaussian parameters E[y] = ΦE[w] = 0 E[yy T ] = ΦE[ww T ]Φ T = 1 α ΦΦT = K where K is the Gram matrix which defines the kernel, i.e. K nm = k(x n, x m ) = 1 α φ(x n) T φ(x m ) Henrik I Christensen (RIM@GT) Kernel Methods 24 / 37

25 Gaussian Processes Determined in full by 1. and 2. order moments Typically no knowledge of mean so is assumed a good guess E[p(w t)] = 0 Specification of the co-variance is thus adequate E[y(x n )y(x m )] = k(x n, x m ) One could also define the Gaussian kernels directly as a set of basis functions. Henrik I Christensen (RIM@GT) Kernel Methods 25 / 37

26 Outline 1 Introduction 2 Dual Representations 3 Kernel Design 4 Radial Basis Functions 5 Linear Regression Revisited 6 Gaussian Processes for Regression 7 Gaussian Processes for Classification 8 Summary Henrik I Christensen (RIM@GT) Kernel Methods 26 / 37

27 Gaussian Processes for Regression If we have noisy training data t n = y n + ɛ n then we can model the data as p(t n y n ) = N(t n y n, β 1 ) for a vector of data (your training set) we have p(t y) = N(t y, β 1 I N ) the marginalized distribution is then p(y) = N(y 0, K) where K is the Gram matrix Henrik I Christensen (RIM@GT) Kernel Methods 27 / 37

28 Gaussian Processes for Regression II We can then compute the marginal for the target p(t) = p(t y)p(y)dy = N(t 0, C) where C(x n, x m ) = k(x n, x m ) + β 1 δ nm We can thus express the distribution of y entirely based on the kernel function Henrik I Christensen (RIM@GT) Kernel Methods 28 / 37

29 Exponential quadratic Gaussian Processes A popular family of Gaussian processes are defined by { k(x n, x m ) = θ 0 exp θ } 1 2 x n x m 2 + θ 2 + θ 3 x T n x m Which has a bias term, a linear term and the quadratic exponential Allows representation of a broad family of functions Henrik I Christensen (RIM@GT) Kernel Methods 29 / 37

30 Quadratic Exponential Gaussian Processes 3 (1.00, 4.00, 0.00, 0.00) 9 (9.00, 4.00, 0.00, 0.00) 3 (1.00, 64.00, 0.00, 0.00) (1.00, 0.25, 0.00, 0.00) 9 (1.00, 4.00, 10.00, 0.00) 4 (1.00, 4.00, 0.00, 5.00) Henrik I Christensen (RIM@GT) Kernel Methods 30 / 37

31 Recursive Process Regression In temporal processes we would like to model p(t N+1 t N, x N+1 ) For the process we have p(t N+1 ) = N(t N+1 0, C N+1 ) We can partition C [ ] CN k C N+1 = k T c where k is composed of k(x i, x N+1 ) and c = k(x N+1, x N+1 ) + β 1 Henrik I Christensen (RIM@GT) Kernel Methods 31 / 37

32 Recursive Process Regression II The mean and variance is then E[x N+1 ] = k T C 1 N t σ 2 (x N+1 ) = c k T C 1 N k Henrik I Christensen (RIM@GT) Kernel Methods 32 / 37

33 Outline 1 Introduction 2 Dual Representations 3 Kernel Design 4 Radial Basis Functions 5 Linear Regression Revisited 6 Gaussian Processes for Regression 7 Gaussian Processes for Classification 8 Summary Henrik I Christensen (RIM@GT) Kernel Methods 33 / 37

34 Gaussian Processes for Classification This far we have considered regression over the full space For classification the optimization would be with respect to miss classification p(t a) = σ(a) t (1 sigma(a)) 1 t Very similar derivations can be performed as detailed in the book. Henrik I Christensen (RIM@GT) Kernel Methods 34 / 37

Gaussian Process Classification Example 2 0 2 2 0 2

35 Gaussian Process Classification Example Henrik I Christensen (RIM@GT) Kernel Methods 35 / 37

36 Outline 1 Introduction 2 Dual Representations 3 Kernel Design 4 Radial Basis Functions 5 Linear Regression Revisited 6 Gaussian Processes for Regression 7 Gaussian Processes for Classification 8 Summary Henrik I Christensen (RIM@GT) Kernel Methods 36 / 37

37 Summary Memory based methods - keeping the data! Design of distance metrics for weighting of data in learning set Kernels - a distance metric based on dot-product in some feature space Being creative about design of kernels Gaussian processes represent a broad class of stochastic processes Estimation of Gaussian Processes is a way to optimize fit to data and to obtain estimate of uncertainty as interpolation is performed away from learning data A good source: C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006 Available from Includes a good Matlab toolkit Henrik I Christensen (RIM@GT) Kernel Methods 37 / 37

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter