Functional modelling with Gaussian Processes

Size: px

Start display at page:

Download "Functional modelling with Gaussian Processes"

Alexander Richards
5 years ago
Views:

1 modelling with Gaussian Processes Faculty of Mathematics and Informatics Babeş Bolyai University, Cluj-Napoca, Matematika és Informatika Tanszék Babeş Bolyai Tudományegyetem, Kolozsvár September 2008

2 Table of Contents 1 2 Non-parametric models 3 Support Vectors Machines 4 Gaussian Processes 5 Models using Gaussian Processes 6 erences

3 6 erences Machine Learning Latent variable models Estimation methods Summary 1 Machine Learning Latent variable models Estimation methods Drawbacks of parametric models 2 Non-parametric models 3 Support Vectors Machines 4 Gaussian Processes 5 Models using Gaussian Processes

4 Motivation Machine Learning Latent variable models Estimation methods Summary D. Donoho Data, Data, Data! Challenges and opportunities of the coming Data Deluge Several data types: classification problem needed in decision systems: frequently there are data of very high dimension and several hundred classes to take into account; regression/prediction problems.

5 Machine learning Historical background / Motivation: Machine Learning Latent variable models Estimation methods Summary Huge amount of data, that should automatically be processed, Mathematics provides general solutions, solutions are i.e. not for a given problem, Need for science, that uses mathematics machinery for solving practical problems.

6 Definition of Machine Learning Machine Learning Latent variable models Estimation methods Summary Machine learning Collection of methods (from statistics, probability theory) to solve problems met in practice. noise filtering for non-linear regression and/or non-gaussian noise Classification: binary, multiclass, partially labeled Clustering, Inversion problems, density estimation, novelty detection. Generally, we need to model the data,

7 Data Machine Learning Latent variable models Estimation methods Summary f(x) Observation (x,y) Real world: there is a function y = f (x) (x,y ) 1 1 (x,y ) 2 2 (x,y ) N N Observation process: a corrupted datum is collected for a sample x n : t n = y n + ɛ additive noise t n = h(y n, ɛ) h distortion function Problem: find function y = f (x)

8 Latent variable models Machine Learning * f (x) Inference (x,y ) 1 1 (x,y ) 2 2 Latent variable models Estimation methods Summary F function class Observ. process (x,y ) N N Data set collected. Assume a function class. polynomial, Fourier expansion, Wavelet; Observation process encodes the noise; Find the optimal function from the class.

9 Latent variable models II Machine Learning Latent variable models Estimation methods Summary We have the data set D = {(x 1, y 1 ),..., (x N, y N )}. Consider a function class: (1) F = { w T x + b w R d, b R } (2) F = { K K a 0 + a k sin(2πkx) + b k cos(2πkx) k=1 a, b R K, a 0 R } k=1 Assume an observation process: y n = f (x n ) + ɛ with ɛ N(0, σ 2 ).

10 Latent variable models III 1 The data set: D = {(x 1, y 1 ),..., (x N, y N )}. Machine Learning Latent variable models Estimation methods Summary 2 Assume a function class: F polynomial, etc. F = { f (x, θ) θ R p} 3 Assume an observation process. Define a loss function: L (y n, f (x n, θ)) For the Gaussian noise: L(y n, f (x n, θ)) = (y n f (x n, θ)) 2.

11 Parameter estimation Machine Learning Latent variable models Estimation methods Summary Estimating parameters: Finding the optimal value to θ: where θ = arg min L(D, θ) θ Ω Ω is the domain of the parameters. L(D, θ) is a loss function for the data set. Example: N L(D, θ) = L(y n, f (x n, θ)) n=1

12 Parameter estimation M.L. L(D, θ) (log)likelihood function. Machine Learning Latent variable models Estimation methods Summary Maximum likelihood estimation of the model: θ = arg min L(D, θ) θ Ω Example quadratic regression: L(D, θ) = N (y n f (x n, θ)) 2 factorisation n=1 Drawback: produces perfect fit to the data over-fitting.

13 Maximum Likelihood Over-fitting Machine Learning Latent variable models Estimation methods Summary 10 5 Training set Poly 4 Poly 3 Test set

14 Maximum a posteriori estimation Machine Learning Latent variable models Estimation methods Summary M.A.P. assigning probabilities to the Data D: log likelihood function: the probability of the data drawn using θ P(y n x n, θ, F) exp [ L(y n, f (x n, θ))] a normalisation constant missing. Parameters θ: probability that θ could have had a given value ] p 0 (θ) exp [ θ 2 2 prior probability.

15 Maximum a posteriori estimation Machine Learning Latent variable models Estimation methods Summary M.A.P. assigning probabilities to the Data D: log likelihood function: the probability of the data drawn using θ P(y n x n, θ, F) exp [ L(y n, f (x n, θ))] a normalisation constant missing. Parameters θ: probability that θ could have had a given value ] p 0 (θ) exp [ θ 2 2 prior probability.

16 M.A.P. estimation II Machine Learning Latent variable models Estimation methods Summary Combining prior with observation likelihood using Bayes rule: A posteriori probability of the parameters: p(θ D, F) = P(D θ)p 0 (θ) p(d F) p(d F) = dθ P(D θ)p 0 (θ) Ω p(d F) probability of the data set under a given family of models.

17 M.A.P. estimation III Machine Learning Latent variable models Estimation methods Summary M.A.P. estimation aims at finding θ with the largest probability: θ MAP = arg max p(θ D, F) θ Ω Example: Using L(y n, f (x n, θ)) in defining the likelihood and Gaussian prior: θ MAP = arg max θ Ω K n L(y n, f (x n, θ)) θ 2 2σ 2 o For σ 2 o we have maximum likelihood. after a change of sign and max min

18 M.A.P. Example Machine Learning Latent variable models Estimation methods Summary Poly 6 N. Dev =10 3 True function Training data N. Dev =

19 Parameter estimation Bayes Machine Learning Latent variable models Estimation methods Summary We use Bayes rule: p(θ D, F) = P(D θ)p 0 (θ) p(d F) p(d F) = dθ P(D θ)p 0 (θ) and try to store the whole distribution of the possible values. Ω We operate therefore with p post (θ D, F)

20 Bayes estimation II Machine Learning Latent variable models Estimation methods Summary When computing p post (θ D, F) we assumed that the posterior can be represented analytically. This is not the case. Approximations are needed for the posterior distribution predictive distribution In Bayesian modelling an important issue is how we approximate the posterior distribution.

21 Bayes Example Pol. 6 N.var σ 2 = 1 Machine Learning Latent variable models Estimation methods Summary

22 Graphical Models Machine Learning Latent variable models Estimation methods Summary M.L. estimation: no prior M.A.P. estimation: no distribution Bayes est. µ σ θ y x Bayes models if approximations used we have Level II Maximum Likelihood.

23 Drawbacks of Parametric Models Model complexity is an important parameter : Machine Learning Latent variable models Estimation methods Summary choosing F is an important decision in modelling the data: Example for polynomial functions: linear too simple quadratic good for medium-sized data... The model complexity should be changed if we have more data available.

24 Non-parametric models Density Estimation Regr./Class. 1 2 Non-parametric models Density Estimation Regression and classification 3 Support Vectors Machines 4 Gaussian Processes 5 Models using Gaussian Processes 6 erences

25 Non-parametric Models Density Estimation Regr./Class. Non-parametric models: use the available data for prediction. 1 Density estimation: Histogram method; Parzen window; 2 Regression/Classification: K-Nearest Neighbour Rule; Support Vector Machine; Gaussian Processes;

26 Non-parametric Models Density Estimation Regr./Class. Non-parametric models: use the available data for prediction. 1 Density estimation: Histogram method; Parzen window; 2 Regression/Classification: K-Nearest Neighbour Rule; Support Vector Machine; Gaussian Processes;

27 Density Estimation Histogram method: Density Estimation Regr./Class. Parameters bin width; locations; Sensitive to the choice of parameters. Not usable for higher dimensions

28 Density Estimation II Density Estimation Regr./Class. Parzen window: Parameters function width; data locations Not usable for large data-sets h(x) = 1 N N k(x n, x) n=1 Nonparametric: summation scales with N.

29 Non parametric Classification K Nearest Neighbour: (Knn) Density Estimation Regr./Class. Parameters # of neighbours Slow for large data. Nonparametric: all data taken into account.

30 Knn for high-dimensions I Density Estimation Regr./Class. We generally: normalise and center the data: x i x i x x mean x ij x ij /σ j σ 2 j is var. for j-th comp. Each x ij has zero mean and unit variance. the length of the random vector x i is l 2 i = j x 2 ij x i cf. central limit theorem the larger the dimension, the more concentrated the average length is around the mean. l i

31 Knn for high-dimensions II Density Estimation Regr./Class. Example: dim=20;x=randn(100000,dim);s2=sum(x. 2,2)/dim;g=histc(s2,bin) dim=1 dim=2 dim=8 dim=20 dim=

32 Knn for high-dimensions III Density Estimation Regr./Class. The angle: d x i, x j = x il x jl l=1 with average value 0. Frequency d=10 d=20 d= x, y Central limit theorem The larger the dimension the more orthogonal random vectors are; the more difficult is to select a representative.

33 Knn for high-dimensions III Density Estimation Regr./Class. The angle: d x i, x j = x il x jl l=1 with average value 0. Frequency d=10 d=20 d= x, y Central limit theorem The larger the dimension the more orthogonal random vectors are; the more difficult is to select a representative.

34 Non-parametric models 1 Loss functions Function class Kernels 2 Non-parametric models 3 Support Vectors Machines Loss functions Function class Kernels 4 Gaussian Processes 5 Models using Gaussian Processes 6 erences

35 Support Vectors Machines Motivation I Loss functions Function class Kernels Motivation: Knn is too demanding for large datasets; works only in low-dimensions; HD Algorithm that works well in high-dimensions { } f (x) = argmin L (y n, f (x n )) + Pf ( ) 2 f F n Learning Algorithm Design of a method that simultaneously minimises the empirical error (first term) and selects cleverly (second term) from a large family F of available functions. Details of the elements:... Skip HighDim

36 S.V.M.-s Motivation II Loss functions Function class Kernels Algorithm that works well in high-dimensions { } f (x) = argmin L (y n, f (x n )) + Pf ( ) 2 f F Learning Algorithm n Design of a method that simultaneously minimises the empirical error (first term) and selects cleverly (second term) from a large family F of available functions. Within the framework we look for candidates in a large family of functions; and we penalise the complexity of the functions; Explain loss function and F.

37 S.V.M Loss functions Loss functions Function class Kernels Hinge Loss: { 0 if z 1 L(z) = 1 z if z < 1 1 L(z) For classification with labels y n = ±1 the loss function for the data-set D: L(D) def = L (y n f (x n )) n Minimality L(D) is minimal if all elements separated with margin 1.

38 S.V.M. Loss Functions II Loss functions Function class Kernels Loss functions: Hinge loss in classification penalises data away from boundary; assuming class labels y i = ±1, we have L (y n, f (x n)) = H + (1 y nf (x n)) Logit loss in binary classification, returns the log-probabilities: L logit(y n, f (x n)) = log (1 + exp ( y nf (x))) when y nf (x n), L logit 0 in probit model (1 + exp( )) 1 replaced with Φ( ). Quadratic loss quadratic error: L mse(y n, f (x n)) = (y n f (x n)) 2 (1 y nf (x n)) 2 can also be applied to regression problems.

39 Support Vectors Machines Fn. Class O Loss functions Function class Kernels Which family of functions? we have only data D. Possibilities: linear might be too simple; complex might be too complex; In general we want a flexible function class, but do not want a too complex solution. Solution: use a large function class F and define penalties on the complexity of the solution.

40 Support Vectors Machines Fn. Class O Loss functions Function class Kernels Which family of functions? we have only data D. Possibilities: linear might be too simple; complex might be too complex; In general we want a flexible function class, but do not want a too complex solution. Solution: use a large function class F and define penalties on the complexity of the solution.

41 Support Vectors Machines Fn. Class O Loss functions Function class Kernels Which family of functions? we have only data D. Possibilities: linear might be too simple; complex might be too complex; In general we want a flexible function class, but do not want a too complex solution. Solution: use a large function class F and define penalties on the complexity of the solution.

42 Support Vectors Machines Fn. Class O Loss functions Function class Kernels Which family of functions? we have only data D. Possibilities: linear might be too simple; complex might be too complex; In general we want a flexible function class, but do not want a too complex solution. Solution: use a large function class F and define penalties on the complexity of the solution.

43 Support Vectors Machines Fn. Class O Loss functions Function class Kernels Which family of functions? we have only data D. Possibilities: linear might be too simple; complex might be too complex; In general we want a flexible function class, but do not want a too complex solution. Solution: use a large function class F and define penalties on the complexity of the solution.

44 Support Vectors Machines Fn. Class I Loss functions Function class Kernels Loss functions help us find the best function from a family of functions. The family of functions is important It should be flexible enough for large data sets and to allow for different length scales. Solution is of the form (P penalty): { } f (x) = argmin L (y n, f (x n )) + Pf ( ) 2 f F n Pf ( ) 2 = 0 null space, gives possible solutions.

45 Support Vectors Machines Fn. Class II Example: F twice diff. functions with cont. second derivatives; P def = ( 2 xf ( ) ) 2 the sum of the second derivatives. Loss functions Function class Kernels Result: interpolating splines with knots at the data points Spline interpolation Data Second derivative Message optimisation plausible with reliable results Presented in Wahba 90: Splines Models for Observational Data

46 Support Vectors Machines Fn. Class III Result of the optimisation problem: K s (x, x ) = 1 3 min ( x, x ) x x min ( x, x ) 2 +xx +1 2 Loss functions Function class Kernels Solution of the argmin: ^f (x) = n α n K s (x, x n ) k(x,x ) Linear combination of polynomials X X 2 4

47 Kernels in RKHS Hilbert Loss functions Function class Kernels Reproducing Kernel Hilbert space: Assume X an index set, H = {f f : X R} with, the dot product s.t. f 2 = f, f. If there exists k : X X R such that 1 k has the reproducing property: f H f, k(x, ) = f (x); 2 and k spans H. Not a feature map RKHS is not equivalent with a feature map. More than a single feature map for the same k(, ).

48 RKHS II Loss functions Function class Kernels k(, ) spans H every function f H is a linear combination of {k(x n, )} N H n=1 : Note: f (x) = we choose the support set: N H w i k(x, x n ) (NH = allowed) n=1 the location of the points and the set size; we choose the weights. Empirical kernel map The support set X{ is the training data set: } H e = nα n k(x, x n ) α n R

49 RKHS II Loss functions Function class Kernels k(, ) spans H every function f H is a linear combination of {k(x n, )} N H n=1 : Note: f (x) = we choose the support set: N H w i k(x, x n ) (NH = allowed) n=1 the location of the points and the set size; we choose the weights. Empirical kernel map The support set X{ is the training data set: } H e = nα n k(x, x n ) α n R

50 RKHS Mercer Mercer kernel For a positive-definite k(, ) there exists a set {φ j } of orthogonal functions and {λ j } positive constants s.t. Loss functions Function class Kernels k(x, x ) = j λ j φ j (x)φ j (x ) Note: The function k(, ) defines: the eigen-functions {φ j } the corresponding eigen-values {λ j }; independent of the data-set we are using; convergence guarantee: j λ2 j <.

51 RKHS Mercer Mercer kernel For a positive-definite k(, ) there exists a set {φ j } of orthogonal functions and {λ j } positive constants s.t. Loss functions Function class Kernels k(x, x ) = j λ j φ j (x)φ j (x ) Note: The function k(, ) defines: the eigen-functions {φ j } the corresponding eigen-values {λ j }; independent of the data-set we are using; convergence guarantee: j λ2 j <.

52 Kernels for S.V.M.s Example Loss functions Function class Kernels Kernel functions: Let Φ : R R 3 be given by: and compute 1 Φ(x) = 2x x 2 Φ(x) T Φ(x ) = x 2y + x 2 y 2 = (1 + xy) 2 def = K (x, y) Kernel trick We can translate linear algorithms into nonlinear ones using a kernel function represented as a scalar product.

53 The XOR problem Loss functions Function class Kernels 1 X 1 X X X 2

54 Kernels for S.V.M.s II Loss functions Function class Kernels Kernels two-argument functions that are the generalisation of a matrix to non countable index sets. K : X X R Valid kernel functions are positive definite: for any a = [a 1, a 2,..., a L ] and D = [x 1,..., x L ] T : L k=1 l=1 L a k K (x k, x l )a l 0 Proof idea: kernel function is K (x, y) = Φ(x) T Φ(x ) and we have L k,l=1 a kφ(x k ) T Φ(x l )a l = ( a i Φ(x i )) T ( a i Φ(x i )) = s T s 0 where s = L i=1 a iφ(x i ) What is the dimensionality of Φ(x).

55 S.V.M.s Summary Loss functions Function class Kernels Developed for classification. φ Idea KERNEL TRICK: use linear algorithms; project to Φ; solution in Φ; back-project; NON-linear solution; Solution to the problem is of the form: ( ) f (x) = Φx T α i Φ x i = α i K (x, x i ) i SV i SV Nonparametric: number of parameters is not fixed.

56 Kernel Algorithms Loss functions Function class Kernels The projection trick is exploited. Based on the success of the S.V.M.-s; General recipe: Find/Construct a linear algorithm; Re-express it in Φ the space of features ; Write the non linear solution in the space of inputs and use K (x, x ). Algorithms: Kernel regression,... ridge regression;... Principal/Independent Components;... Fisher Discriminants;

57 Advantage / Price Loss functions Function class Kernels Definition: non-parametric methods number of parameters/complexity might change with data. Advantages: Greater complexity; Built-in regularisation; Performant algorithms. Disadvantages: Small data sets only; No selection of parameters;

58 Advantage / Price Loss functions Function class Kernels Definition: non-parametric methods number of parameters/complexity might change with data. Advantages: Greater complexity; Built-in regularisation; Performant algorithms. Disadvantages: Small data sets only; No selection of parameters;

59 Advantage / Price Loss functions Function class Kernels Definition: non-parametric methods number of parameters/complexity might change with data. Advantages: Greater complexity; Built-in regularisation; Performant algorithms. Disadvantages: Small data sets only; No selection of parameters;

60 Non-parametric models Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity 1 2 Non-parametric models 3 Support Vectors Machines 4 Gaussian Processes Kernels within GP Inference and Prediction Gaussian Regression Posterior Approximations Optimising hyper-parameters Sparse Representation 5 Models using Gaussian Processes

61 Bayesian Nonparametric Methods GP models functional latent variable models. simple random functions. Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Appear in the likelihood: P(D f D ) = i = i P(y i x i, f ) P(y i x i, f x i ) Local dependencies only: f f x

62 Gaussian processes I Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Gaussian process: generalisation of a Gaussian. f Gaussian random function. f = [f x 1, f x 2,..., f x N,...] T, x n domain. GP prior p 0 (f ) characterised with mean function f x 0, covariance kernel K 0 (x, x ). Property - for any sample set D, a joint Gaussian r.v.: f D = [f x 1,..., f x N ] N (f D f D 0, K 0 )

63 Gaussian processes II Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Gaussian process: random function Parameters mean function covariance kernel Gp realisations before training GP mean function Samples from GP 5 10 DEMO

64 GP parameters Gaussian process parameters mean function usually is 0. Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity parameter: the class of the kernel function parameters hidden into the kernel function. Example: K (x, x θ) = exp [ θ d i=1 ] ( θ i xi x i ) 2 θ = [θ 0, θ 1,..., θ d ] T parameter vector.

65 Kernel Functions Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Kernel functions: Generate the covariance matrix. Need to be positive definite functions/matrices X = {x 1,..., x T } K (x 1, x 1 )... K (x 1, x T ) K X =..... K (x T, x 1 )... K (x T, x T ) must be a positive definite matrix. A construction of kernels as covariances:

66 Kernel constructions I Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Let {x k = k N } N k=1 be the location of Gaussian independent r.v.-s v k N (0, σ 2 0 N ) where N = 1/N. Let t denote the the index where t = k t N. def If b t = t i=1 v i, then b t = 0 and the covariance: s t b sb t = v is v it = = i s=1 i t =1 min(s,t) i s=1 min(s,t) i s=1 σ 2 0 N = min(s, t)σ 2 0 v 2 i s in the following we take σ 2 0 = 1. Brownian motion A stochastic process with covariance kernel K (s, t) = min(s, t) is a Brownian motion. Its derivative v t is the Wiener process.

67 Kernel constructions II Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Images of the Brownian motion at different resolutions (different N-s)

68 Kernel constructions III Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Let us integrate (sum) the Brownian motion. Define leading to s t = 0 and K (s, t) = s ss t = = s i s=1 i t =1 s t def = s t i t =1 b it i s=1 i t =1 s t s is s it = t s is s it t s dz t min(z s, z t) 0dz 0 For the integration we assume s < t = z s < t = [0, t] = [0, z s] [z s, t].

69 Kernel constructions IV Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Using [0, t] = [0, z s] [z s, t] s ( zs t ) K (s, t) = dz s dz t z t + dz t z s = 0 0 z s s ( ) = dz s z st z2 s 2 0 = s2 t 2 s3 6 assuming s < t s ( z 2 dz s s 2 After symmetrization (writing the s > t case and unifying) K (s, t) = st min(s, t) 2 0 min(s, t)3 6 ) + zs(t zs) = 1 2 min(s, t)2 s t + 1 min(s, t)3 3

70 Kernel constructions V Samples from the integrated Brownian motion (Brownian motion on the bottom). Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity

71 Kernel constructions VI Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity The integrated Brownian motion: s(0) = s (0) = 0. For generality we add a constant and a linear term: s 2 (x) = w 0 + w 1 x + s(x) Means that the kernel is: K 2 (s, t) = s 2 (s)s 2 (t) = 1 + st + K s (s, t) where w 0, w 1 are i.i.d. Gaussian r.v-s. K 2 (s, t) = 1 + st min(s, t)2 s t + 1 min(s, t)3 3

72 Kernel constructions VII Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Random splines:

73 Kernels from conditioning VIII Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Consider the Brownian motion: K b (s, t) = min(s, t). We generate random functions with x 1 = 0. Called Brownian bridge To calculate the covariance, we have to condition the r.v.-s x s and x t on x 1 = 0. We identify the kernel from the conditioned joint Gaussian distribution p(x s, x t x 1 = 0) exp 1 xs T x t K 1 b(s, s) K b (s, t) s K b (t, s) K b (t, t) t xs x t 2 0 s t 1 0 exp 1 xs T ( [Kb ] [ ] [ ] ) (s, s) K x t b (s, t) s s T 1 E K b (t, s) K b (t, t) t t xs x t 2 0 E T 0 F where we used the matrix inversion lemma K 0 (s, t) = K b (s, t) st

74 Kernels from conditioning VIII Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Consider the Brownian motion: K b (s, t) = min(s, t). We generate random functions with x 1 = 0. Called Brownian bridge To calculate the covariance, we have to condition the r.v.-s x s and x t on x 1 = 0. We identify the kernel from the conditioned joint Gaussian distribution p(x s, x t x 1 = 0) exp 1 xs T x t K 1 b(s, s) K b (s, t) s K b (t, s) K b (t, t) t xs x t 2 0 s t 1 0 exp 1 xs T ( [Kb ] [ ] [ ] ) (s, s) K x t b (s, t) s s T 1 E K b (t, s) K b (t, t) t t xs x t 2 0 E T 0 F where we used the matrix inversion lemma K 0 (s, t) = K b (s, t) st

75 Kernels from conditioning IX Samples from a Brownian bridge: 1.5 Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity

76 Kernels from conditioning X Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Exercise: Find the mean and kernel functions corresponding to the Brownian bridge where x 1 = 1. Consider the spline kernel K s(s, t) = 1 2 min(s, t)2 s t + 1 min(s, t)3 3 Similarly to the Brownian bridge, find the kernel function for the splines conditioned on x 1 = 0. Samples from the second family look like this:

77 Generating random samples XI Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity 1 clear all; N = 100; T=25; D = 0.99; t = linspace(0,d,n+1); t = t(2:end); 6 % put covariance function here k_bb= inline( min(s,t)-s*t, s, t ); Ks = zeros(n,n); for ii=1:n; 11 for jj=1:n; Ks(ii,jj) = k_sb(t(ii),t(jj)); end; end; 16 kks = chol(ks); yr = randn(t,n); ys = zeros(n+2,t); ys(2:n+1,:) = (yr*kks) ; 21 t0=[0,t,1]; figure(1); cla; box on; hold on; plot(t0,ys);

78 Gaussian Process Inference I GP inference: application of Bayes rule. p post (f, f D ) P(D f D ) p 0 (f D, f ) Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity For any collection of indexes X the posterior: p post (f X ) = 1 df D P(D f D ) p 0 (f D, f X ) Z D where Z D = df D P(D f D ) p 0 (f D ) Data probability conditioned on the model. A.R.D. obs: NO specific P(D f D )

79 Gaussian Process Inference II Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Problems with the representation p post (f X ) df D P(D f D ) p 0 (f D, f X ) Integral evaluation necessary for posterior distribution. Representation How to represent the posterior? Finite representation of the posterior process; Non Gaussian posterior processes: approximations to them

80 Gaussian Process Parametrisation I Property of Gaussian averages: f x post = f x 0 + i K 0 (x, x i )α(i) Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Where coefficients: α(i) = ln P(D f D ) f i 0 0 Provide parametrisation (see Kimeldorf-Wahba).

81 Gaussian Process Parametrisation II For the posterior kernel: Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity K post (x, x ) = K 0 (x, x ) + i,j K 0 (x, x i )C(ij)K 0 (x j, x ) Where coefficients: 2 C(ij) = f i 0 f j 0 ln P(D f D ) 0

82 GPs in Feature space I K 0 (x, x ) defines a feature space F : φ x, φ x F and K 0 (x, x ) = φ T x φ x Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Using F and the scalar product: N f x post = φx T α(i)φ i = φ T x µ post N K post (x, x ) = φx T I F + φ i C(ij)φ T j φ x = φt x Σ post φ x i,j=1 i=1

83 GPs in Feature space II K 0 (x, x ) defines a feature space F : φ x, φ x F and K 0 (x, x ) = φ T x φ x Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity f x post µ post K post (x, x ) Σ post GP inference: Estimating a Gaussian distribution in F

84 Prediction with Gaussian processes Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Given: x - for which we require answer y. p(y x, D) = df df D p(y, f D, f x, D) = df P(y x, f ) df D p post (f D, f D) = df P(y x, f )p post (f D) where f = f x random variable associated to x. We use posterior process: irrespective of the likelihood; if not Gaussian, we approximate.

85 Regression with Gaussian noise Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Gaussian noise: P(y D f D ) exp [ 1 ] 2σ 2 y D f D 2 o Gaussian latent variables: [ P(f D K (, θ)) exp 1 ] 2 (f D µ D ) T K 1 D (f D µ D ) Combining: product quadratic = Gaussian

86 Posterior distribution Gaussian noise Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Gaussian distribution reading off the coefficients: µ post = Σ post = ( K 1 D + 1 ) 1 [ σ 2 I N K 1 D µ D + 1 ] o σ 2 y D o ( K 1 D + 1 ) 1 σ 2 I N o The joint distribution of all r.v.-s is a Gaussian: f D N (µ post, Σ post ) The distribution of the r.v.-s at training locations.

87 Posterior distribution Brownian bridge Assume that there was an observation for the Brownian motion k(x, x ) = min(x, x ); at 1 the value of the process is 0. It generates a new process: k B (x, x ) = min(x, x ) xx Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity

88 Predictive distributions Gaussian noise For test location x : Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity µ = α T D k α D = C D y D σ = K (x, x ) k T with ) 1 C D k C D = (K D + σ 2 oi N where k = [K (x, x 1 ),..., K (x, x N )] T Posterior mean and covariance functions: µ(x) = α D T k x K post (x, x ) = K (x, x ) k T x C D k x where k x = [K (x, x 1 ),..., K (x, x N )] T.

89 Non-computable Posteriors If likelihood non Gaussian posterior does not have analytical form. (No summarising statistics) Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Methods to obtain posterior: Sampling; Analytic approximations;

90 Sampling from the posterior I Sampling p post (f X ) = 1 Z D df D P(D f D )p 0 (f D, f X ) Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity In practise: joint sampling from p post (f X, f D ), keeping only f X. Implementation: sampling from p 0 (f D, f X ) + weighting: p post (f X ) 1 C T T t=1 P(y N f (i) D ) δ(f X f (i) X ) with C T = T t=1 P(y N f (i) D )

91 Sampling from the posterior II Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity weights p 0 Sampling methods: samples powerful i.e. allow flexibility in modelling Hard to assess convergence Sampling algorithms suited for different models. Can be incredibly slow (tempering, MCMC)

92 Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Laplace Approximation Log Posterior: log p post (f X, f D ) = K + log P(D f D ) + log p 0 (f D, f X ) Finding maximum of f X and f D : = log P(D f D ) + log p 0 (f D ) + log p }{{} 0 (f X f D ) }{{} g D (f D ) g X (f X ) ^f X = P X D f D ^f D = arg max g D (f D ) I Taylor expansion around ^f D : ) T [ ] ) g D (f D ) (^f D f D H g (^f D ) (^f D f D ) + 0 (^f D f D + g D (^f D ) Gaussian Posterior Laplace

93 Laplace approximation II Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Gaussian approximation: ^p post (f X, f D ) p 0 (f X, f D ) [ ] ) 1 N L (f D ^f D, H g (^f D ) } p 0 (f D ) {{ } ^P(D f D ) Defines an approximation to the likelihood: [ ] ) 1 N L (f D ^f D, H g (^f D ) ^P(D f D ) p 0 (f D ) Posterior Laplace

94 Laplace approximation Summary The Laplace approximation: Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity + generates an approximation to the likelihood; applicable only for differentiable likelihood functions; + defines an approximation to the whole process; the Hessian has to be positive definite and smaller than the prior

95 Laplace approximation Summary The Laplace approximation: Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity + generates an approximation to the likelihood; applicable only for differentiable likelihood functions; + defines an approximation to the whole process; the Hessian has to be positive definite and smaller than the prior

96 Laplace approximation Summary The Laplace approximation: Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity + generates an approximation to the likelihood; applicable only for differentiable likelihood functions; + defines an approximation to the whole process; the Hessian has to be positive definite and smaller than the prior

97 Laplace approximation Summary The Laplace approximation: Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity + generates an approximation to the likelihood; applicable only for differentiable likelihood functions; + defines an approximation to the whole process; the Hessian has to be positive definite and smaller than the prior

98 Analytic approximations I Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Aim: approximate the posterior distribution or the posterior process. GP prior GP approximation to posterior. projection closest GP. 01 p 0 Data ppost d KL ^p N G

99 Analytic approximations II Choice of projection: Kullback-Leibler divergence KL(GP post GP) = dgp post (f ) log dgp post(f ) dgp(f ) Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity The minimiser: GP = arg min GP KL(GP post GP) f x GP K GP (x, x ) def = f x post def = K post (x, x ) Implies that the KL-approximation the GP (α D, C D ).

100 Computing KL-distances Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Between GPs with the same prior: GP 1 = N (µ 1, Σ 1 ) and GP 2 = N (µ 2, Σ 2 ), the KL-distance is: 2KL(GP 1 GP 2 ) =(µ 2 µ 1 ) T Σ 1 2 (µ 2 µ 1 ) ( ) + tr Σ 1 Σ 1 2 I F ln Σ 1 Σ 1 With parameters (Q BV = K 1 BV ): (α 2 α 1 ) (C 2 + Q BV ) 1 (α 2 α 1 ) [ tr (C 1 C 2 )(C 2 + Q BV ) 1] ln (C 1 + Q BV ) (C 2 + Q BV ) 1 2 Assumptions: the kernel matrix on the BV set is non-singular K BV 0.

101 Analytic Approximations III Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Bayesian Online Learning (recursive) Instead of D = N uses D = (x t+1, y t+1 ) and; For prior process f x t, K t (x, x ). KL(GP t+1 post GP ) ( x 1, y 01 p 0 1 ) ppost ^ p 1 d KL smaller approximation ( x N, yn) ^p N 1 G p post ^pn

102 Bayesian Online Learning Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Learning: propagating the mean and the kernel: f x t+1 = f x t + q K t (x, x t+1 ) K t+1 (x, x ) = K t (x, x ) + r K t (x, x t+1 )K t (x t+1, x ) q, r functions of the single likelihood: q = q (t+1) = f t+1 t ln P(y t+1 f t+1 ) t where t average w.r.to f t+1 N t ( f t+1, σ 2 t+1 ).

103 Optimising hyper-parameters I GP kernel parameters model choice. Exemplu: [ RBF kernel: K (x, x ) = A exp ] (x x ) 2 β Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Training set Bay. Error Bars Mean function β = 0.1 β = 1.

104 Optimising hyper-parameters II Model evidence: Z D (θ) = P(D θ) = df D P(D f D ) p 0 (f D θ) Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Maximum Likelihood II inference if p(θ M) flat θ P(D θ)p(θ M) = arg min P(θ D) = arg min θ Ω θ Ω p(d M) θ = arg min θ Ω Z D(θ) Evidence maximisation. Gradient/conj.grad. methods are used. MacKay Evidence

105 Sparse representations Motivation Gaussian Processes are fully connected graphical models. Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity x1 y 1 = Computing estimates is difficult. E.g for the posterior mean: yn f x post = y T ( K N + σ 2 oi N ) 1 k x xn inversion requires O(N 3 ) time. Regr

106 Sparse representations a solution Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Condition all training locations on a set of basis locations. f BV N (µ BV, Σ BV ) Basis Proj. The pseudo-latents f X are conditioned on f BV : f X f BV N (P ) µ BV, P Σ BV P T x1 y1 yn xn where P is the projection matrix: P = P X,BV = K X,BV K 1 BV

107 Gaussian Regression Non-Gaussian Noise Classification Multi-class Inverse problems Artificial data: y = sin(x)/x and polynomial kernel K 0 (x, x ) = (1 + x T x ) k. Number of training points: 1000 with added Gaussian noise σ 2 = POL with order 5 and noise σ 2 =

108 Robust one-sided regression Non-Gaussian Noise Classification Multi-class Inverse problems Exponential, one-sided, additive noise. { λ exp [ λ(y f x )] if y > f x. P(y f x ) = 0 otherwise BV set size: 9; Lik. par:

109 Classification Non-Gaussian Noise Classification Multi-class Inverse problems For each location x we have ±1. The likelihood function for this model is: ( ) y fx P (y f (x)) = Erf σ 0 Erf the incomplete Gaussian ( sigmoid): Erf(x) = x dt exp( t 2 /2)/ 2π Posterior is not Gaussian. For single data, mean-var computable iterative methods can be used.

110 0.7 Toy Classification Non-Gaussian Noise Classification Multi-class Inverse problems RBF kernel: K (x, x ) = exp behaviour of the ARD parameters β i BV#: 20, Log.Evid: [ b d i=1 (x i x i )2 β i ] BV#: 4, Log.Evid: demogp_class_gui

111 Classification Non-Gaussian Noise Classification Multi-class Inverse problems USPS data-set Handwritten image data-set of gray-scale images with 7291 training and 2007 test patterns. (RBF kernel with σ 2 K = 0.5) Test error % USPS: 4 < > non 4 1 iterations 4 iterations #BV

112 Crab data-set Using RBF kernels 10 Test Error 1 Log. of input weights Non-Gaussian Noise Classification Multi-class Inverse problems Log.Evidence and BV set size Log. Evidence BV set size

113 Multiclass Classification Non-Gaussian Noise Classification Multi-class Inverse problems Problem setup: For each location x we have y {1,..., K }. Transforming it into y {0, 1} K Coding: y = [0,..., 0, 1, 0,...] T on the k-th position K independent are used. Indep. is a-priori. The likelihood function is: P (y f (x)) = y T s 1 T s ( ) where s = exp [f 1 (x),..., f K (x)] T. The posterior processes are not independent.

114 Multiclass Classification Two-dimensional demo: class-conditional distributions Non-Gaussian Noise Classification Multi-class Inverse problems X Y

115 Multiclass Classification Non-Gaussian Noise Classification Multi-class Inverse problems

116 Inverse Problems Non-Gaussian Noise Classification Multi-class Inverse problems Likelihood: local observations of global wind-fields (u i, v i ). Probabilistic framework preferred due to lack of direct observations. Uncertainty captured in Mixture density networks: 4 P(u i, v i obs) = β il N (u i, v i µ il, A il ) k=1 β il, µ il, A il local parameters.

117 Inverse Problems Non-Gaussian Noise Classification Multi-class Inverse problems Few Basis Vectors retained Approximation preserves information about local uncertainty The inference process is fast

118 Bibliography I D.J.C. MacKay: The evidence framework... H: C. Rasmussen: Gaussian Processes for Regression. H: C.K.I. Williams: Prediction with Gaussian Processes:... H: Bib 2

119 Bibliography II R. Neal: Priors for infinite networks H: M. Seeger: Bayesian Gaussian Process Models:... H: L. Csato: Gaussian Processes:... Sparse Approximations H:

120 Bibliography III

121 NIPS 05 I NIPS 05 GP-related articles: J. Murillo-Fuentes, F. Perez-Cruz: Gaussian Processes for Multiuser Detection in CDMA Receivers Y. Shen, A. Ng, M. Seeger: Fast Gaussian Process Regression Using KD-Trees A. Shon, K. Grochow, A. Hertzmann, R. Rao: Gaussian Process CCA for Image Synthesis and Robotic Imitation R. Der, D. Lee: Beyond Gaussian Processes: On the Distributions of Infinite Networks D. Fleet, J. Wang, A. Hertzmann: Gaussian Process Dynamical Models

122 NIPS 05 II NIPS 05 GP-related articles: Y. Engel, P. Szabo, D. Volkinshtein: Learning With Gaussian Process Temporal Difference Methods M. Kuss, C. Rasmussen: Assessing Approximations for Gaussian Process Classification E. Snelson, Z. Ghahramani: Sparse Parametric Gaussian Processes S. Kakade, M. Seeger, D. Foster: Worst-Case Bounds for Gaussian Process Models S. Keerthi, W. Chu: A Matching Pursuit Approach to Sparse Gaussian Process Regression E. Meeds, S. Osindero: An Alternative Infinite Mixture Of Gaussian Process Experts

Lehel Csató. March Faculty of Mathematics and Informatics Babeş Bolyai University, Cluj-Napoca,

Lehel Csató. March Faculty of Mathematics and Informatics Babeş Bolyai University, Cluj-Napoca, Faculty of Mathematics and Informatics Babeş Bolyai University, Cluj-Napoca, Matematika és Informatika Tanszék Babeş Bolyai Tudományegyetem, Kolozsvár March 2008 Table of Contents 1 2 methods 3 Methods