Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Similar documents
Introduction Dual Representations Kernel Design RBF Linear Reg. GP Regression GP Classification Summary. Kernel Methods. Henrik I Christensen

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning Srihari. Gaussian Processes. Sargur Srihari

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Pattern Recognition and Machine Learning

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Ch 4. Linear Models for Classification

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bayesian Machine Learning

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Reading Group on Deep Learning Session 1

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Linear Models for Classification

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Outline Lecture 2 2(32)

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Nonparametric Bayesian Methods (Gaussian Processes)

Lecture : Probabilistic Machine Learning

Probabilistic & Unsupervised Learning

Kernel Methods and Support Vector Machines

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear & nonlinear classifiers

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Neural Network Training

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Machine Learning. 7. Logistic and Linear Regression

Nonparameteric Regression:

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CS-E3210 Machine Learning: Basic Principles

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Gaussian with mean ( µ ) and standard deviation ( σ)

Linear & nonlinear classifiers

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Gaussian Process Regression

Max Margin-Classifier

Linear Classification

Probabilistic Reasoning in Deep Learning

GAUSSIAN PROCESS REGRESSION

PATTERN RECOGNITION AND MACHINE LEARNING

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Bayesian Linear Regression. Sargur Srihari

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Introduction to SVM and RVM

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Introduction to Machine Learning

Relevance Vector Machines

Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Density Estimation. Seungjin Choi

Bayesian Logistic Regression

Linear Models for Regression

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Gaussian Processes (10/16/13)

Outline lecture 4 2(26)

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Machine Learning Lecture 7

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

STA 4273H: Statistical Machine Learning

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

An Introduction to Statistical and Probabilistic Linear Models

Kernel methods, kernel SVM and ridge regression

Model Selection for Gaussian Processes

GWAS IV: Bayesian linear (variance component) models

ECE521 week 3: 23/26 January 2017

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Least Squares Regression

Iterative Reweighted Least Squares

Linear Models for Regression

Overfitting, Bias / Variance Analysis

Introduction to Support Vector Machines

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Lecture 5: GPs and Streaming regression

Reading Group on Deep Learning Session 2

Gaussian Processes for Machine Learning

Linear and logistic regression

STA414/2104 Statistical Methods for Machine Learning II

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Supervised Learning Coursework

CMU-Q Lecture 24:

STA 4273H: Sta-s-cal Machine Learning

Introduction to Machine Learning Midterm Exam

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Introduction to Machine Learning

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Gaussian Processes in Machine Learning

Introduction to Gaussian Process

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Machine Learning for Signal Processing Bayes Classification and Regression

Support Vector Machine (SVM) and Kernel Methods

9.2 Support Vector Machines 159

Transcription:

Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007

Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter vector w or posterior distribution p(w t) discard training data t Non-parametric methods: Parzen probability density model: set of kernel functions centered on training data points Nearest neighbours technique: closest example from the training set Memory-based methods: similar examples from the training set Kernel methods: prediction is based on linear combinations of a kernel function evaluated at the training data points

Kernel Function for models based on feature space mapping φ(x): k(x, x ) = φ(x) T φ(x ) (6.1) symmetric function: k(x, x ) = k(x, x) simple example - linear kernel: φ(x) = x stationary kernel: k(x, x ) = k(x x ) homogeneous kernel: k(x, x ) = k( x x ) Algorithm, expressed in terms of scalar products can be reformulated using kernel substitution: PCA, nearest-neighbour classifiers, Fisher discriminant

Dual Representation Many linear models for regression and classification can be reformulated in terms of a dual representation in which the kernel function arises naturally. linear regression model (λ 0): J(w) = 1 2 N n=1 set the gradient to zero: w = 1 λ { w T φ(x n ) t n } 2 + λ 2 wt w (6.2) N { w T } φ(x n ) t n φ(xn ) = Φ T a (6.3) n=1 substitute w and define the Gram matrix K = ΦΦ T : K nm = φ(x n ) T φ(x m ) = k(x n, x m ) (6.6)

Solution for Dual Problem in terms of new parameter vector a: w = 1 2 at KKa a T Kt + 1 2 tt t + λ 2 at Ka (6.7) set the gradient to zero: a = (K + λi N ) 1 t (6.8) prediction for a new input x y(x) = w T φ(x) = a T Φφ(x) = k(x) T (K + λi N ) 1 t (6.9) where k(x) = (k(x 1, x),..., k(x N, x)) inverting N N matrix instead of M M

Constructing Kernels - First Approach choose feature space mapping φ(x) k(x, x ) = φ(x) T φ(x ) = M φ i (x)φ i (x ) (6.10) i=1 1 0.5 0 0.5 1 0.75 0.5 0.25 1 0.75 0.5 0.25 1 1 0 1 0 1 0 1 0 1 0 1 1.0 2.0 6.0 0.0 0.4 1 0 1 1.0 0.0 1 0 1 3.0 0.0 1 0 1

Constructing Kernels - Second Approach construct kernel function directly and verify its validity simple example in 2-D case corresponds to with φ(x) = (x 2 1, 2x 1 x 2, x 2 2 )T k(x, z) = (x T z) 2 (6.11) k(x, z) = φ(x) T φ(z) (6.12) To test validity without having to construct the function φ(x) explicitly, one can use the condition: Function k(x, x ) is a valid kernel K 0 {x n }

Combining Kernels Given valid kernels k 1 (x, x ) and k 2 (x, x ) the following kernels will also be valid: k(x, x ) = ck 1 (x, x ) (6.13) k(x, x ) = f(x)k 1 (x, x )f(x ) (6.14) k(x, x ) = q(k 1 (x, x )) (6.15) k(x, x ) = exp(k 1 (x, x )) (6.16) k(x, x ) = k 1 (x, x ) + k 2 (x, x ) (6.17) k(x, x ) = k 1 (x, x )k 2 (x, x ) (6.18) k(x, x ) = k 3 (φ(x), φ(x )) (6.19) k(x, x ) = x T Ax (6.20) k(x, x ) = k a (x a, x a) + k b (x b, x b ) (6.21) k(x, x ) = k a (x a, x a)k b (x b, x b ) (6.22) with corresponding conditions on c, f, q, φ, k 3, A, x a, x b, k a, k b

Examples Polynomial kernels: k(x, x ) = (x T x ) 2 contains only terms of degree 2 k(x, x ) = (x T x + c) 2, c > 0 k(x, x ) = (x T x ) M contains all monomials of order M k(x, x ) = (x T x + c) M, c > 0 Gaussian kernel: k(x, x ) = exp( x x 2 /2σ 2 ) (6.23) Note: can substitute x T x with a nonlinear kernel κ(x, x ) Kernel on nonvectorial space: k(a 1, A 2 ) = 2 A 1 A 2 (6.27) Sigmoidal kernel: k(x, x ) = tanh(ax T x + b) (6.37)

Probabilistic generative models Kernel for generative model p(x): k(x, x ) = p(x)p(x ) (6.28) k(x, x ) = p(x i)p(x i)p(i) (6.29) i k(x, x ) = p(x z)p(x z)p(z)dz (6.30) Kernel for HMM: k(x, X ) = Z p(x Z)p(X Z)p(Z) (6.31) X = {x 1,..., x L } - observations Z = {z 1,..., z L } - hidden states

Fisher kernel parametric generative model p(x θ) Fisher score: g(θ, x) = θ ln p(x θ) (6.32) Fisher kernel and information matrix: k(x, x ) = g(θ, x) T F 1 g(θ, x ) (6.33) F = IEx [ g(θ, x)g(θ, x) T θ ] (6.34) Note: the kernel is invariant under θ ψ(θ) Simplify matrix calculation: F 1 N N g(θ, x n )g(θ, x n ) T (6.35) n=1 Or simply omit the Fisher information matrix

Radial Basis Functions By definition φ j (x) = h( x µ j ) Originally were introduced for the problem of exact interpolation f(x n ) = t n : f(x) = N w n h( x x n ) (6.38) n=1 Green s functions for an isotropic differential operator in regularizer Interpolation problem with noisy inputs E = 1 2 N n=1 {y(x n + ξ) t n } 2 ν(ξ)dξ (6.39)

Interpolation With Noisy Inputs Sum of squares function E = 1 2 Optimal value N {y(x n + ξ) t n } 2 ν(ξ)dξ (6.39) n=1 y ( x n ) = N t n h(x x n ) (6.40) n=1 with basis functions given by normalized functions h(x x n ) = ν(x x n) N ν(x x n ) n=1 (6.41) If noise distribution ν(ξ) is isotropic, basis functions would be radial

Normalization effect Avoids regions in an input space where all of the basis functions take small values 1 0.8 0.6 0.4 0.2 0 1 0.5 0 0.5 1 1 0.8 0.6 0.4 0.2 0 1 0.5 0 0.5 1

Reducing Size of the Basis Keep number of basis functions M smaller than input data size N Centers locations µ i are determined based on the input data {x n } alone Coefficients {w i } are determined by least squares Choice of centers: random orthogonal least squares - greatest error reduction using clustering algorithms

Parzen Density Estimator Prediction of linear regression model - linear combination of t n with equivalent kernel values Same result starting from Parzen density estimator p(x, t) = 1 N f(x x n, t t n ) (6.42) N Regression function n=1 y(x) = IE[t x] = = tp(t x)dt = tp(x, t)dt = p(x, t)dt tf(x xn, t t n )dt = n m f(x xm, t t m )dt (6.43)

Nadaraya-Watson Model Assume that the component density functions have zero mean so that tf(x, t)dt = 0 (6.44) for all values of x. Then by variable change g(x x n )t n n y(x) = g(x x m ) = = n m k(x, x n )t n (6.45) with g(x) = f(x, t)dt and k(x, x n ) given by k(x, x n ) = g(x x n) g(x x m ) m (6.46)

Illustration of the Nadaraya-Watson Model 1.5 single input variable x f(x, t) is a zero-mean isotropic Gaussian with variance σ 2 1 0.5 0 0.5 1 Conditional distribution p(t x) = p(t, x) p(t, x)dt = 1.5 0 0.2 0.4 0.6 0.8 1 f(x x n, t t n ) n f(x xm, t t m )dt m (6.47) is given by a mixture of Gaussians

Gaussian processes: Key Idea The idea is similar to linear regression with a fixed set of basis functions (Chapter 3): y(x) = w T φ(x) (6.49) However, we forget about the parametric model Instead we take a infinite number of basis functions given by a probability distribution over functions Might loook difficult to handle, but it is not... we only have to consider the values of the functions at training/test data points

Linear regression revisited Model defined by linear combination of M fixed basis functions: y(x) = w T φ(x) (6.49) A Gaussian prior distribution over the weight vector w p(w) = N (w 0, α 1 I) (6.50) governed by the hyperparameter α (precision) induces then a probability distribution over functions y(x).

Linear regression revisited (2) Evaluating y(x) for a set of training data points x 1,..., x N, we have a joint distribution y = Φw, with elements y n = y(x n ) = w T φ(x n ) N, (6.51) where Φ is the design matrix with elements Φ nk = φ k (x n ). p(y) is Gaussian, and its mean and covariance can be shown to be where K is the Gram matrix with elements E[y] = 0 (6.52) cov[y] = 1 α ΦΦT = K (6.53) K nm = k(x n, x m ) = 1 α φ(x n) T φ(x m ). (6.54) Up to now we only took data points + prior, but no target values.

Linear regression revisited: summary The model we have seen is one example for a Gaussian process Gaussian process is defined as a probability distribution over functions y(x) such that the set of values of y(x) evaluated at an arbitrary set of points x 1,..., x N jointly have a Gaussian distribution Key point: the joint distribution is definied completely by second-order statistics (mean, covariance) Note, usually the mean is taken to be zero, then we only need the covariance, i.e., the kernel-function: E[y(x n )y(x m )] = k(x n, x m ) (6.55) Actually, instead of choosing (a limited number of) basis functions, we can directly choose a kernel function (which may result in an infinite number of basis functions)

Gaussian process regression To use Gaussian processes for regression, we need to model noise: t n = y n + ɛ n with y n = y(x n ) (6.57) For noise processes with a Gaussian distribution we obtain p(t n y n ) = N (t n y n, β 1 ). (6.58) The joint distribution for t = (t 1,..., t n ) T and y = (y 1,..., y n ) T is then (since the noise is assumed to be independent) p(t y) = N (t y, β 1 I N ). (6.59)

Gaussian process regression (2) And from the definition of a Gaussian process we have p(y) = N (y 0, K), (6.60) i.e., points that are more similar (given the kernel function k) will be stronger correlated. For the marginal distribution p(t), we need to integrate over y (see Section 2.3.3): p(t) = p(t y)p(y)dy = N (t 0, C) with C = K + β 1 I. (6.61)

Making predictions So far we have a model for the joint probability distribution over sets of data points For predictions of a new input variable x N+1, we need to evaluate the predictive distribution p(t N+1 t) By partitioning (see Section 2.3.1) the joint Gaussian distribution over x 1,..., x N, x N+1, we obtain p(t N+1 t) given by its mean and covariance: m(x N+1 ) = k T C 1 t with (6.66) k = (k(x 1, x N+1 ),..., k(x N, x N+1 )) T σ 2 (x N+1 ) = c k T C 1 k with (6.67) c = k(x N+1, x N+1 ) + β 1.

Making predictions (2) 1 0.5 0 0.5 1 0 0.2 0.4 0.6 0.8 1 Green curve: original sinusoidal function; blue points: sampled training data points with additional noise; read line: mean estimate; shaded regions: +/- 2σ

Distribution over functions... hm? Sample functions from prior p(y) with common kernel function ( k(x n, x m ) = θ 0 exp θ ) 1 2 x n x m 2 + θ 2 + θ 3 x T n x m (6.63) 3 9 3 1.5 4.5 1.5 0 0 0 1.5 4.5 1.5 3 1 0.5 0 0.5 1 9 1 0.5 0 0.5 1 3 1 0.5 0 0.5 1 3 9 4 1.5 4.5 2 0 0 0 1.5 4.5 2 3 1 0.5 0 0.5 1 9 1 0.5 0 0.5 1 4 1 0.5 0 0.5 1

Learning hyperparameters θ In practice, it can be preferable not to fix parameters, but to infer them from the data Parameters θ are, e.g.: length scale of correlations, precision of noise (β) Simplest approach: Maximizing θ for the log-likelihood (maximum likelihood): ln p(t θ) Problem: ln p(t θ) is in general non-convex and can have multiple maxima Introduce a prior p(θ) and maximize the log-posterior (maximum a posteriori): ln p(t θ) + ln p(θ) To be Bayesian, we need the actual distribution and have to marginalize; this is not tractable approximations The noise might not be additive but dependent on on x A second Gaussian process can be introduced to represent the dependency of β on x

Automatic relevance detection (ARD) An additional hyperparameter can be introduced for each input dimension, e.g.: ( ) k(x n, x m ) = θ 0 exp 1 D η i (x ni x mi ) 2 + θ 2 + θ 3 x T n x m 2 i=1 (6.72) Hyperparameter optimization by maximum likelihood allows then a different weighting of each dimension Unrelevant dimensions (with small weights) can be detected and discarded

Gaussian process classification Objective: model posterior probabilities of the target variable for a new input Problem: we need to map values to interval (0; 1) Solution: use a Gaussian process together with a non-linear activation function (e.g., sigmoid) 10 1 5 0.75 0 5 = 0.5 0.25 10 1 0.5 0 0.5 1 0 1 0.5 0 0.5 1

Gaussian process classification (2) Consider two-class problem with target values t {0, 1}. Define a Gaussian process over a function a(x) and transform a using the logistic sigmoid to y = σ(a(x)). Similar to before, we need to predict the conditional distribution: p(t N+1 = 1 t) = p(t N+1 = 1 a N+1 )p(a N+1 t)da N+1 = σ(a N+1 )p(a N+1 t)da N+1. (6.76) However, the integral is analytically intractable. Approximations can be of numerical or analytical nature.

Gaussian process classification (3) Problem 1: we need to compute a weird integral Solution 1: we know how to compute the convolution of an Gaussian and a sigmoid function (Eq. (4.153)) approximate the posterior distribution p(a N+1 t) as Gaussian Problem 2: Hm... but how do we approximate the posterior? Solution 2: the Laplace approximation (among others)

Laplace approximation We can rewrite the posterior over a N+1 using Bayes s theorem: p(a N+1 t) = p(a N+1, a t)da =... = p(a N+1 a)p(a t)da. (6.77) We know how to compute mean and covariance for p(a N+1 a). Now we have to find a Gaussian approximation only for p(a t). This is done noting that p(a t) p(t a)p(a), thus Ψ(a) = ln p(a t) = ln p(t a) + ln p(a) + const. (6.80)

Laplace approximation (2) We can use the iterative reweighted least squares (IRLS) algorithm (Sec. 4.3.3) to find the mode of Ψ(a) (first and second derivative have to be evaluated) It can be shown that Ψ(a) is convex and thus has only one mode :-) The mode position a and the Hessian matrix H at this position define our Gaussian approximation q(a) = N (a a, H 1 ) (6.86) Now we can go back to the formulas and compute the integrals and finally also p(t N+1 t) from Eq. (6.76)

Learning hyperparameters To determine the parameters θ, we can maximize the likelihood function: p(t θ) = p(t a)p(a θ)da (6.89) Again, the integral is analytically intractable, so the Laplace approximation can be applied again We need an expression for the gradient of the logarithm of p(t θ) Hm... this is a bit trickier, but nevertheless doable... by now we might be happy to skip the exact details :-)

Gaussian process classification (finishing up) Just to finish up, and now we have a binary classifier based on a Gaussian process 2 0 2 2 0 2

Connection to neural networks Neural Networks The range of representable functions is governed by the number M of hidden units Within the maximum likelihood framework, they overfit as M comes close to the number of training samples Bayesian Neural Networks The prior over w in conjunction with the network function f(x, w) produces a prior distribution over functions from y(x) The distribution of functions will tend to a Gaussian process in the limit M The property that the outputs share hidden units (and thus borrow statistical strength from each other) is lost in this limit... and some other details