Functional modelling with Gaussian Processes

Size: px
Start display at page:

Download "Functional modelling with Gaussian Processes"

Transcription

1 modelling with Gaussian Processes Faculty of Mathematics and Informatics Babeş Bolyai University, Cluj-Napoca, Matematika és Informatika Tanszék Babeş Bolyai Tudományegyetem, Kolozsvár September 2008

2 Table of Contents 1 2 Non-parametric models 3 Support Vectors Machines 4 Gaussian Processes 5 Models using Gaussian Processes 6 erences

3 6 erences Machine Learning Latent variable models Estimation methods Summary 1 Machine Learning Latent variable models Estimation methods Drawbacks of parametric models 2 Non-parametric models 3 Support Vectors Machines 4 Gaussian Processes 5 Models using Gaussian Processes

4 Motivation Machine Learning Latent variable models Estimation methods Summary D. Donoho Data, Data, Data! Challenges and opportunities of the coming Data Deluge Several data types: classification problem needed in decision systems: frequently there are data of very high dimension and several hundred classes to take into account; regression/prediction problems.

5 Machine learning Historical background / Motivation: Machine Learning Latent variable models Estimation methods Summary Huge amount of data, that should automatically be processed, Mathematics provides general solutions, solutions are i.e. not for a given problem, Need for science, that uses mathematics machinery for solving practical problems.

6 Definition of Machine Learning Machine Learning Latent variable models Estimation methods Summary Machine learning Collection of methods (from statistics, probability theory) to solve problems met in practice. noise filtering for non-linear regression and/or non-gaussian noise Classification: binary, multiclass, partially labeled Clustering, Inversion problems, density estimation, novelty detection. Generally, we need to model the data,

7 Data Machine Learning Latent variable models Estimation methods Summary f(x) Observation (x,y) Real world: there is a function y = f (x) (x,y ) 1 1 (x,y ) 2 2 (x,y ) N N Observation process: a corrupted datum is collected for a sample x n : t n = y n + ɛ additive noise t n = h(y n, ɛ) h distortion function Problem: find function y = f (x)

8 Latent variable models Machine Learning * f (x) Inference (x,y ) 1 1 (x,y ) 2 2 Latent variable models Estimation methods Summary F function class Observ. process (x,y ) N N Data set collected. Assume a function class. polynomial, Fourier expansion, Wavelet; Observation process encodes the noise; Find the optimal function from the class.

9 Latent variable models II Machine Learning Latent variable models Estimation methods Summary We have the data set D = {(x 1, y 1 ),..., (x N, y N )}. Consider a function class: (1) F = { w T x + b w R d, b R } (2) F = { K K a 0 + a k sin(2πkx) + b k cos(2πkx) k=1 a, b R K, a 0 R } k=1 Assume an observation process: y n = f (x n ) + ɛ with ɛ N(0, σ 2 ).

10 Latent variable models III 1 The data set: D = {(x 1, y 1 ),..., (x N, y N )}. Machine Learning Latent variable models Estimation methods Summary 2 Assume a function class: F polynomial, etc. F = { f (x, θ) θ R p} 3 Assume an observation process. Define a loss function: L (y n, f (x n, θ)) For the Gaussian noise: L(y n, f (x n, θ)) = (y n f (x n, θ)) 2.

11 Parameter estimation Machine Learning Latent variable models Estimation methods Summary Estimating parameters: Finding the optimal value to θ: where θ = arg min L(D, θ) θ Ω Ω is the domain of the parameters. L(D, θ) is a loss function for the data set. Example: N L(D, θ) = L(y n, f (x n, θ)) n=1

12 Parameter estimation M.L. L(D, θ) (log)likelihood function. Machine Learning Latent variable models Estimation methods Summary Maximum likelihood estimation of the model: θ = arg min L(D, θ) θ Ω Example quadratic regression: L(D, θ) = N (y n f (x n, θ)) 2 factorisation n=1 Drawback: produces perfect fit to the data over-fitting.

13 Maximum Likelihood Over-fitting Machine Learning Latent variable models Estimation methods Summary 10 5 Training set Poly 4 Poly 3 Test set

14 Maximum a posteriori estimation Machine Learning Latent variable models Estimation methods Summary M.A.P. assigning probabilities to the Data D: log likelihood function: the probability of the data drawn using θ P(y n x n, θ, F) exp [ L(y n, f (x n, θ))] a normalisation constant missing. Parameters θ: probability that θ could have had a given value ] p 0 (θ) exp [ θ 2 2 prior probability.

15 Maximum a posteriori estimation Machine Learning Latent variable models Estimation methods Summary M.A.P. assigning probabilities to the Data D: log likelihood function: the probability of the data drawn using θ P(y n x n, θ, F) exp [ L(y n, f (x n, θ))] a normalisation constant missing. Parameters θ: probability that θ could have had a given value ] p 0 (θ) exp [ θ 2 2 prior probability.

16 M.A.P. estimation II Machine Learning Latent variable models Estimation methods Summary Combining prior with observation likelihood using Bayes rule: A posteriori probability of the parameters: p(θ D, F) = P(D θ)p 0 (θ) p(d F) p(d F) = dθ P(D θ)p 0 (θ) Ω p(d F) probability of the data set under a given family of models.

17 M.A.P. estimation III Machine Learning Latent variable models Estimation methods Summary M.A.P. estimation aims at finding θ with the largest probability: θ MAP = arg max p(θ D, F) θ Ω Example: Using L(y n, f (x n, θ)) in defining the likelihood and Gaussian prior: θ MAP = arg max θ Ω K n L(y n, f (x n, θ)) θ 2 2σ 2 o For σ 2 o we have maximum likelihood. after a change of sign and max min

18 M.A.P. Example Machine Learning Latent variable models Estimation methods Summary Poly 6 N. Dev =10 3 True function Training data N. Dev =

19 Parameter estimation Bayes Machine Learning Latent variable models Estimation methods Summary We use Bayes rule: p(θ D, F) = P(D θ)p 0 (θ) p(d F) p(d F) = dθ P(D θ)p 0 (θ) and try to store the whole distribution of the possible values. Ω We operate therefore with p post (θ D, F)

20 Bayes estimation II Machine Learning Latent variable models Estimation methods Summary When computing p post (θ D, F) we assumed that the posterior can be represented analytically. This is not the case. Approximations are needed for the posterior distribution predictive distribution In Bayesian modelling an important issue is how we approximate the posterior distribution.

21 Bayes Example Pol. 6 N.var σ 2 = 1 Machine Learning Latent variable models Estimation methods Summary

22 Graphical Models Machine Learning Latent variable models Estimation methods Summary M.L. estimation: no prior M.A.P. estimation: no distribution Bayes est. µ σ θ y x Bayes models if approximations used we have Level II Maximum Likelihood.

23 Drawbacks of Parametric Models Model complexity is an important parameter : Machine Learning Latent variable models Estimation methods Summary choosing F is an important decision in modelling the data: Example for polynomial functions: linear too simple quadratic good for medium-sized data... The model complexity should be changed if we have more data available.

24 Non-parametric models Density Estimation Regr./Class. 1 2 Non-parametric models Density Estimation Regression and classification 3 Support Vectors Machines 4 Gaussian Processes 5 Models using Gaussian Processes 6 erences

25 Non-parametric Models Density Estimation Regr./Class. Non-parametric models: use the available data for prediction. 1 Density estimation: Histogram method; Parzen window; 2 Regression/Classification: K-Nearest Neighbour Rule; Support Vector Machine; Gaussian Processes;

26 Non-parametric Models Density Estimation Regr./Class. Non-parametric models: use the available data for prediction. 1 Density estimation: Histogram method; Parzen window; 2 Regression/Classification: K-Nearest Neighbour Rule; Support Vector Machine; Gaussian Processes;

27 Density Estimation Histogram method: Density Estimation Regr./Class. Parameters bin width; locations; Sensitive to the choice of parameters. Not usable for higher dimensions

28 Density Estimation II Density Estimation Regr./Class. Parzen window: Parameters function width; data locations Not usable for large data-sets h(x) = 1 N N k(x n, x) n=1 Nonparametric: summation scales with N.

29 Non parametric Classification K Nearest Neighbour: (Knn) Density Estimation Regr./Class. Parameters # of neighbours Slow for large data. Nonparametric: all data taken into account.

30 Knn for high-dimensions I Density Estimation Regr./Class. We generally: normalise and center the data: x i x i x x mean x ij x ij /σ j σ 2 j is var. for j-th comp. Each x ij has zero mean and unit variance. the length of the random vector x i is l 2 i = j x 2 ij x i cf. central limit theorem the larger the dimension, the more concentrated the average length is around the mean. l i

31 Knn for high-dimensions II Density Estimation Regr./Class. Example: dim=20;x=randn(100000,dim);s2=sum(x. 2,2)/dim;g=histc(s2,bin) dim=1 dim=2 dim=8 dim=20 dim=

32 Knn for high-dimensions III Density Estimation Regr./Class. The angle: d x i, x j = x il x jl l=1 with average value 0. Frequency d=10 d=20 d= x, y Central limit theorem The larger the dimension the more orthogonal random vectors are; the more difficult is to select a representative.

33 Knn for high-dimensions III Density Estimation Regr./Class. The angle: d x i, x j = x il x jl l=1 with average value 0. Frequency d=10 d=20 d= x, y Central limit theorem The larger the dimension the more orthogonal random vectors are; the more difficult is to select a representative.

34 Non-parametric models 1 Loss functions Function class Kernels 2 Non-parametric models 3 Support Vectors Machines Loss functions Function class Kernels 4 Gaussian Processes 5 Models using Gaussian Processes 6 erences

35 Support Vectors Machines Motivation I Loss functions Function class Kernels Motivation: Knn is too demanding for large datasets; works only in low-dimensions; HD Algorithm that works well in high-dimensions { } f (x) = argmin L (y n, f (x n )) + Pf ( ) 2 f F n Learning Algorithm Design of a method that simultaneously minimises the empirical error (first term) and selects cleverly (second term) from a large family F of available functions. Details of the elements:... Skip HighDim

36 S.V.M.-s Motivation II Loss functions Function class Kernels Algorithm that works well in high-dimensions { } f (x) = argmin L (y n, f (x n )) + Pf ( ) 2 f F Learning Algorithm n Design of a method that simultaneously minimises the empirical error (first term) and selects cleverly (second term) from a large family F of available functions. Within the framework we look for candidates in a large family of functions; and we penalise the complexity of the functions; Explain loss function and F.

37 S.V.M Loss functions Loss functions Function class Kernels Hinge Loss: { 0 if z 1 L(z) = 1 z if z < 1 1 L(z) For classification with labels y n = ±1 the loss function for the data-set D: L(D) def = L (y n f (x n )) n Minimality L(D) is minimal if all elements separated with margin 1.

38 S.V.M. Loss Functions II Loss functions Function class Kernels Loss functions: Hinge loss in classification penalises data away from boundary; assuming class labels y i = ±1, we have L (y n, f (x n)) = H + (1 y nf (x n)) Logit loss in binary classification, returns the log-probabilities: L logit(y n, f (x n)) = log (1 + exp ( y nf (x))) when y nf (x n), L logit 0 in probit model (1 + exp( )) 1 replaced with Φ( ). Quadratic loss quadratic error: L mse(y n, f (x n)) = (y n f (x n)) 2 (1 y nf (x n)) 2 can also be applied to regression problems.

39 Support Vectors Machines Fn. Class O Loss functions Function class Kernels Which family of functions? we have only data D. Possibilities: linear might be too simple; complex might be too complex; In general we want a flexible function class, but do not want a too complex solution. Solution: use a large function class F and define penalties on the complexity of the solution.

40 Support Vectors Machines Fn. Class O Loss functions Function class Kernels Which family of functions? we have only data D. Possibilities: linear might be too simple; complex might be too complex; In general we want a flexible function class, but do not want a too complex solution. Solution: use a large function class F and define penalties on the complexity of the solution.

41 Support Vectors Machines Fn. Class O Loss functions Function class Kernels Which family of functions? we have only data D. Possibilities: linear might be too simple; complex might be too complex; In general we want a flexible function class, but do not want a too complex solution. Solution: use a large function class F and define penalties on the complexity of the solution.

42 Support Vectors Machines Fn. Class O Loss functions Function class Kernels Which family of functions? we have only data D. Possibilities: linear might be too simple; complex might be too complex; In general we want a flexible function class, but do not want a too complex solution. Solution: use a large function class F and define penalties on the complexity of the solution.

43 Support Vectors Machines Fn. Class O Loss functions Function class Kernels Which family of functions? we have only data D. Possibilities: linear might be too simple; complex might be too complex; In general we want a flexible function class, but do not want a too complex solution. Solution: use a large function class F and define penalties on the complexity of the solution.

44 Support Vectors Machines Fn. Class I Loss functions Function class Kernels Loss functions help us find the best function from a family of functions. The family of functions is important It should be flexible enough for large data sets and to allow for different length scales. Solution is of the form (P penalty): { } f (x) = argmin L (y n, f (x n )) + Pf ( ) 2 f F n Pf ( ) 2 = 0 null space, gives possible solutions.

45 Support Vectors Machines Fn. Class II Example: F twice diff. functions with cont. second derivatives; P def = ( 2 xf ( ) ) 2 the sum of the second derivatives. Loss functions Function class Kernels Result: interpolating splines with knots at the data points Spline interpolation Data Second derivative Message optimisation plausible with reliable results Presented in Wahba 90: Splines Models for Observational Data

46 Support Vectors Machines Fn. Class III Result of the optimisation problem: K s (x, x ) = 1 3 min ( x, x ) x x min ( x, x ) 2 +xx +1 2 Loss functions Function class Kernels Solution of the argmin: ^f (x) = n α n K s (x, x n ) k(x,x ) Linear combination of polynomials X X 2 4

47 Kernels in RKHS Hilbert Loss functions Function class Kernels Reproducing Kernel Hilbert space: Assume X an index set, H = {f f : X R} with, the dot product s.t. f 2 = f, f. If there exists k : X X R such that 1 k has the reproducing property: f H f, k(x, ) = f (x); 2 and k spans H. Not a feature map RKHS is not equivalent with a feature map. More than a single feature map for the same k(, ).

48 RKHS II Loss functions Function class Kernels k(, ) spans H every function f H is a linear combination of {k(x n, )} N H n=1 : Note: f (x) = we choose the support set: N H w i k(x, x n ) (NH = allowed) n=1 the location of the points and the set size; we choose the weights. Empirical kernel map The support set X{ is the training data set: } H e = nα n k(x, x n ) α n R

49 RKHS II Loss functions Function class Kernels k(, ) spans H every function f H is a linear combination of {k(x n, )} N H n=1 : Note: f (x) = we choose the support set: N H w i k(x, x n ) (NH = allowed) n=1 the location of the points and the set size; we choose the weights. Empirical kernel map The support set X{ is the training data set: } H e = nα n k(x, x n ) α n R

50 RKHS Mercer Mercer kernel For a positive-definite k(, ) there exists a set {φ j } of orthogonal functions and {λ j } positive constants s.t. Loss functions Function class Kernels k(x, x ) = j λ j φ j (x)φ j (x ) Note: The function k(, ) defines: the eigen-functions {φ j } the corresponding eigen-values {λ j }; independent of the data-set we are using; convergence guarantee: j λ2 j <.

51 RKHS Mercer Mercer kernel For a positive-definite k(, ) there exists a set {φ j } of orthogonal functions and {λ j } positive constants s.t. Loss functions Function class Kernels k(x, x ) = j λ j φ j (x)φ j (x ) Note: The function k(, ) defines: the eigen-functions {φ j } the corresponding eigen-values {λ j }; independent of the data-set we are using; convergence guarantee: j λ2 j <.

52 Kernels for S.V.M.s Example Loss functions Function class Kernels Kernel functions: Let Φ : R R 3 be given by: and compute 1 Φ(x) = 2x x 2 Φ(x) T Φ(x ) = x 2y + x 2 y 2 = (1 + xy) 2 def = K (x, y) Kernel trick We can translate linear algorithms into nonlinear ones using a kernel function represented as a scalar product.

53 The XOR problem Loss functions Function class Kernels 1 X 1 X X X 2

54 Kernels for S.V.M.s II Loss functions Function class Kernels Kernels two-argument functions that are the generalisation of a matrix to non countable index sets. K : X X R Valid kernel functions are positive definite: for any a = [a 1, a 2,..., a L ] and D = [x 1,..., x L ] T : L k=1 l=1 L a k K (x k, x l )a l 0 Proof idea: kernel function is K (x, y) = Φ(x) T Φ(x ) and we have L k,l=1 a kφ(x k ) T Φ(x l )a l = ( a i Φ(x i )) T ( a i Φ(x i )) = s T s 0 where s = L i=1 a iφ(x i ) What is the dimensionality of Φ(x).

55 S.V.M.s Summary Loss functions Function class Kernels Developed for classification. φ Idea KERNEL TRICK: use linear algorithms; project to Φ; solution in Φ; back-project; NON-linear solution; Solution to the problem is of the form: ( ) f (x) = Φx T α i Φ x i = α i K (x, x i ) i SV i SV Nonparametric: number of parameters is not fixed.

56 Kernel Algorithms Loss functions Function class Kernels The projection trick is exploited. Based on the success of the S.V.M.-s; General recipe: Find/Construct a linear algorithm; Re-express it in Φ the space of features ; Write the non linear solution in the space of inputs and use K (x, x ). Algorithms: Kernel regression,... ridge regression;... Principal/Independent Components;... Fisher Discriminants;

57 Advantage / Price Loss functions Function class Kernels Definition: non-parametric methods number of parameters/complexity might change with data. Advantages: Greater complexity; Built-in regularisation; Performant algorithms. Disadvantages: Small data sets only; No selection of parameters;

58 Advantage / Price Loss functions Function class Kernels Definition: non-parametric methods number of parameters/complexity might change with data. Advantages: Greater complexity; Built-in regularisation; Performant algorithms. Disadvantages: Small data sets only; No selection of parameters;

59 Advantage / Price Loss functions Function class Kernels Definition: non-parametric methods number of parameters/complexity might change with data. Advantages: Greater complexity; Built-in regularisation; Performant algorithms. Disadvantages: Small data sets only; No selection of parameters;

60 Non-parametric models Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity 1 2 Non-parametric models 3 Support Vectors Machines 4 Gaussian Processes Kernels within GP Inference and Prediction Gaussian Regression Posterior Approximations Optimising hyper-parameters Sparse Representation 5 Models using Gaussian Processes

61 Bayesian Nonparametric Methods GP models functional latent variable models. simple random functions. Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Appear in the likelihood: P(D f D ) = i = i P(y i x i, f ) P(y i x i, f x i ) Local dependencies only: f f x

62 Gaussian processes I Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Gaussian process: generalisation of a Gaussian. f Gaussian random function. f = [f x 1, f x 2,..., f x N,...] T, x n domain. GP prior p 0 (f ) characterised with mean function f x 0, covariance kernel K 0 (x, x ). Property - for any sample set D, a joint Gaussian r.v.: f D = [f x 1,..., f x N ] N (f D f D 0, K 0 )

63 Gaussian processes II Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Gaussian process: random function Parameters mean function covariance kernel Gp realisations before training GP mean function Samples from GP 5 10 DEMO

64 GP parameters Gaussian process parameters mean function usually is 0. Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity parameter: the class of the kernel function parameters hidden into the kernel function. Example: K (x, x θ) = exp [ θ d i=1 ] ( θ i xi x i ) 2 θ = [θ 0, θ 1,..., θ d ] T parameter vector.

65 Kernel Functions Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Kernel functions: Generate the covariance matrix. Need to be positive definite functions/matrices X = {x 1,..., x T } K (x 1, x 1 )... K (x 1, x T ) K X =..... K (x T, x 1 )... K (x T, x T ) must be a positive definite matrix. A construction of kernels as covariances:

66 Kernel constructions I Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Let {x k = k N } N k=1 be the location of Gaussian independent r.v.-s v k N (0, σ 2 0 N ) where N = 1/N. Let t denote the the index where t = k t N. def If b t = t i=1 v i, then b t = 0 and the covariance: s t b sb t = v is v it = = i s=1 i t =1 min(s,t) i s=1 min(s,t) i s=1 σ 2 0 N = min(s, t)σ 2 0 v 2 i s in the following we take σ 2 0 = 1. Brownian motion A stochastic process with covariance kernel K (s, t) = min(s, t) is a Brownian motion. Its derivative v t is the Wiener process.

67 Kernel constructions II Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Images of the Brownian motion at different resolutions (different N-s)

68 Kernel constructions III Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Let us integrate (sum) the Brownian motion. Define leading to s t = 0 and K (s, t) = s ss t = = s i s=1 i t =1 s t def = s t i t =1 b it i s=1 i t =1 s t s is s it = t s is s it t s dz t min(z s, z t) 0dz 0 For the integration we assume s < t = z s < t = [0, t] = [0, z s] [z s, t].

69 Kernel constructions IV Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Using [0, t] = [0, z s] [z s, t] s ( zs t ) K (s, t) = dz s dz t z t + dz t z s = 0 0 z s s ( ) = dz s z st z2 s 2 0 = s2 t 2 s3 6 assuming s < t s ( z 2 dz s s 2 After symmetrization (writing the s > t case and unifying) K (s, t) = st min(s, t) 2 0 min(s, t)3 6 ) + zs(t zs) = 1 2 min(s, t)2 s t + 1 min(s, t)3 3

70 Kernel constructions V Samples from the integrated Brownian motion (Brownian motion on the bottom). Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity

71 Kernel constructions VI Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity The integrated Brownian motion: s(0) = s (0) = 0. For generality we add a constant and a linear term: s 2 (x) = w 0 + w 1 x + s(x) Means that the kernel is: K 2 (s, t) = s 2 (s)s 2 (t) = 1 + st + K s (s, t) where w 0, w 1 are i.i.d. Gaussian r.v-s. K 2 (s, t) = 1 + st min(s, t)2 s t + 1 min(s, t)3 3

72 Kernel constructions VII Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Random splines:

73 Kernels from conditioning VIII Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Consider the Brownian motion: K b (s, t) = min(s, t). We generate random functions with x 1 = 0. Called Brownian bridge To calculate the covariance, we have to condition the r.v.-s x s and x t on x 1 = 0. We identify the kernel from the conditioned joint Gaussian distribution p(x s, x t x 1 = 0) exp 1 xs T x t K 1 b(s, s) K b (s, t) s K b (t, s) K b (t, t) t xs x t 2 0 s t 1 0 exp 1 xs T ( [Kb ] [ ] [ ] ) (s, s) K x t b (s, t) s s T 1 E K b (t, s) K b (t, t) t t xs x t 2 0 E T 0 F where we used the matrix inversion lemma K 0 (s, t) = K b (s, t) st

74 Kernels from conditioning VIII Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Consider the Brownian motion: K b (s, t) = min(s, t). We generate random functions with x 1 = 0. Called Brownian bridge To calculate the covariance, we have to condition the r.v.-s x s and x t on x 1 = 0. We identify the kernel from the conditioned joint Gaussian distribution p(x s, x t x 1 = 0) exp 1 xs T x t K 1 b(s, s) K b (s, t) s K b (t, s) K b (t, t) t xs x t 2 0 s t 1 0 exp 1 xs T ( [Kb ] [ ] [ ] ) (s, s) K x t b (s, t) s s T 1 E K b (t, s) K b (t, t) t t xs x t 2 0 E T 0 F where we used the matrix inversion lemma K 0 (s, t) = K b (s, t) st

75 Kernels from conditioning IX Samples from a Brownian bridge: 1.5 Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity

76 Kernels from conditioning X Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Exercise: Find the mean and kernel functions corresponding to the Brownian bridge where x 1 = 1. Consider the spline kernel K s(s, t) = 1 2 min(s, t)2 s t + 1 min(s, t)3 3 Similarly to the Brownian bridge, find the kernel function for the splines conditioned on x 1 = 0. Samples from the second family look like this:

77 Generating random samples XI Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity 1 clear all; N = 100; T=25; D = 0.99; t = linspace(0,d,n+1); t = t(2:end); 6 % put covariance function here k_bb= inline( min(s,t)-s*t, s, t ); Ks = zeros(n,n); for ii=1:n; 11 for jj=1:n; Ks(ii,jj) = k_sb(t(ii),t(jj)); end; end; 16 kks = chol(ks); yr = randn(t,n); ys = zeros(n+2,t); ys(2:n+1,:) = (yr*kks) ; 21 t0=[0,t,1]; figure(1); cla; box on; hold on; plot(t0,ys);

78 Gaussian Process Inference I GP inference: application of Bayes rule. p post (f, f D ) P(D f D ) p 0 (f D, f ) Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity For any collection of indexes X the posterior: p post (f X ) = 1 df D P(D f D ) p 0 (f D, f X ) Z D where Z D = df D P(D f D ) p 0 (f D ) Data probability conditioned on the model. A.R.D. obs: NO specific P(D f D )

79 Gaussian Process Inference II Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Problems with the representation p post (f X ) df D P(D f D ) p 0 (f D, f X ) Integral evaluation necessary for posterior distribution. Representation How to represent the posterior? Finite representation of the posterior process; Non Gaussian posterior processes: approximations to them

80 Gaussian Process Parametrisation I Property of Gaussian averages: f x post = f x 0 + i K 0 (x, x i )α(i) Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Where coefficients: α(i) = ln P(D f D ) f i 0 0 Provide parametrisation (see Kimeldorf-Wahba).

81 Gaussian Process Parametrisation II For the posterior kernel: Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity K post (x, x ) = K 0 (x, x ) + i,j K 0 (x, x i )C(ij)K 0 (x j, x ) Where coefficients: 2 C(ij) = f i 0 f j 0 ln P(D f D ) 0

82 GPs in Feature space I K 0 (x, x ) defines a feature space F : φ x, φ x F and K 0 (x, x ) = φ T x φ x Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Using F and the scalar product: N f x post = φx T α(i)φ i = φ T x µ post N K post (x, x ) = φx T I F + φ i C(ij)φ T j φ x = φt x Σ post φ x i,j=1 i=1

83 GPs in Feature space II K 0 (x, x ) defines a feature space F : φ x, φ x F and K 0 (x, x ) = φ T x φ x Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity f x post µ post K post (x, x ) Σ post GP inference: Estimating a Gaussian distribution in F

84 Prediction with Gaussian processes Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Given: x - for which we require answer y. p(y x, D) = df df D p(y, f D, f x, D) = df P(y x, f ) df D p post (f D, f D) = df P(y x, f )p post (f D) where f = f x random variable associated to x. We use posterior process: irrespective of the likelihood; if not Gaussian, we approximate.

85 Regression with Gaussian noise Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Gaussian noise: P(y D f D ) exp [ 1 ] 2σ 2 y D f D 2 o Gaussian latent variables: [ P(f D K (, θ)) exp 1 ] 2 (f D µ D ) T K 1 D (f D µ D ) Combining: product quadratic = Gaussian

86 Posterior distribution Gaussian noise Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Gaussian distribution reading off the coefficients: µ post = Σ post = ( K 1 D + 1 ) 1 [ σ 2 I N K 1 D µ D + 1 ] o σ 2 y D o ( K 1 D + 1 ) 1 σ 2 I N o The joint distribution of all r.v.-s is a Gaussian: f D N (µ post, Σ post ) The distribution of the r.v.-s at training locations.

87 Posterior distribution Brownian bridge Assume that there was an observation for the Brownian motion k(x, x ) = min(x, x ); at 1 the value of the process is 0. It generates a new process: k B (x, x ) = min(x, x ) xx Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity

88 Predictive distributions Gaussian noise For test location x : Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity µ = α T D k α D = C D y D σ = K (x, x ) k T with ) 1 C D k C D = (K D + σ 2 oi N where k = [K (x, x 1 ),..., K (x, x N )] T Posterior mean and covariance functions: µ(x) = α D T k x K post (x, x ) = K (x, x ) k T x C D k x where k x = [K (x, x 1 ),..., K (x, x N )] T.

89 Non-computable Posteriors If likelihood non Gaussian posterior does not have analytical form. (No summarising statistics) Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Methods to obtain posterior: Sampling; Analytic approximations;

90 Sampling from the posterior I Sampling p post (f X ) = 1 Z D df D P(D f D )p 0 (f D, f X ) Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity In practise: joint sampling from p post (f X, f D ), keeping only f X. Implementation: sampling from p 0 (f D, f X ) + weighting: p post (f X ) 1 C T T t=1 P(y N f (i) D ) δ(f X f (i) X ) with C T = T t=1 P(y N f (i) D )

91 Sampling from the posterior II Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity weights p 0 Sampling methods: samples powerful i.e. allow flexibility in modelling Hard to assess convergence Sampling algorithms suited for different models. Can be incredibly slow (tempering, MCMC)

92 Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Laplace Approximation Log Posterior: log p post (f X, f D ) = K + log P(D f D ) + log p 0 (f D, f X ) Finding maximum of f X and f D : = log P(D f D ) + log p 0 (f D ) + log p }{{} 0 (f X f D ) }{{} g D (f D ) g X (f X ) ^f X = P X D f D ^f D = arg max g D (f D ) I Taylor expansion around ^f D : ) T [ ] ) g D (f D ) (^f D f D H g (^f D ) (^f D f D ) + 0 (^f D f D + g D (^f D ) Gaussian Posterior Laplace

93 Laplace approximation II Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Gaussian approximation: ^p post (f X, f D ) p 0 (f X, f D ) [ ] ) 1 N L (f D ^f D, H g (^f D ) } p 0 (f D ) {{ } ^P(D f D ) Defines an approximation to the likelihood: [ ] ) 1 N L (f D ^f D, H g (^f D ) ^P(D f D ) p 0 (f D ) Posterior Laplace

94 Laplace approximation Summary The Laplace approximation: Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity + generates an approximation to the likelihood; applicable only for differentiable likelihood functions; + defines an approximation to the whole process; the Hessian has to be positive definite and smaller than the prior

95 Laplace approximation Summary The Laplace approximation: Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity + generates an approximation to the likelihood; applicable only for differentiable likelihood functions; + defines an approximation to the whole process; the Hessian has to be positive definite and smaller than the prior

96 Laplace approximation Summary The Laplace approximation: Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity + generates an approximation to the likelihood; applicable only for differentiable likelihood functions; + defines an approximation to the whole process; the Hessian has to be positive definite and smaller than the prior

97 Laplace approximation Summary The Laplace approximation: Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity + generates an approximation to the likelihood; applicable only for differentiable likelihood functions; + defines an approximation to the whole process; the Hessian has to be positive definite and smaller than the prior

98 Analytic approximations I Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Aim: approximate the posterior distribution or the posterior process. GP prior GP approximation to posterior. projection closest GP. 01 p 0 Data ppost d KL ^p N G

99 Analytic approximations II Choice of projection: Kullback-Leibler divergence KL(GP post GP) = dgp post (f ) log dgp post(f ) dgp(f ) Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity The minimiser: GP = arg min GP KL(GP post GP) f x GP K GP (x, x ) def = f x post def = K post (x, x ) Implies that the KL-approximation the GP (α D, C D ).

100 Computing KL-distances Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Between GPs with the same prior: GP 1 = N (µ 1, Σ 1 ) and GP 2 = N (µ 2, Σ 2 ), the KL-distance is: 2KL(GP 1 GP 2 ) =(µ 2 µ 1 ) T Σ 1 2 (µ 2 µ 1 ) ( ) + tr Σ 1 Σ 1 2 I F ln Σ 1 Σ 1 With parameters (Q BV = K 1 BV ): (α 2 α 1 ) (C 2 + Q BV ) 1 (α 2 α 1 ) [ tr (C 1 C 2 )(C 2 + Q BV ) 1] ln (C 1 + Q BV ) (C 2 + Q BV ) 1 2 Assumptions: the kernel matrix on the BV set is non-singular K BV 0.

101 Analytic Approximations III Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Bayesian Online Learning (recursive) Instead of D = N uses D = (x t+1, y t+1 ) and; For prior process f x t, K t (x, x ). KL(GP t+1 post GP ) ( x 1, y 01 p 0 1 ) ppost ^ p 1 d KL smaller approximation ( x N, yn) ^p N 1 G p post ^pn

102 Bayesian Online Learning Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Learning: propagating the mean and the kernel: f x t+1 = f x t + q K t (x, x t+1 ) K t+1 (x, x ) = K t (x, x ) + r K t (x, x t+1 )K t (x t+1, x ) q, r functions of the single likelihood: q = q (t+1) = f t+1 t ln P(y t+1 f t+1 ) t where t average w.r.to f t+1 N t ( f t+1, σ 2 t+1 ).

103 Optimising hyper-parameters I GP kernel parameters model choice. Exemplu: [ RBF kernel: K (x, x ) = A exp ] (x x ) 2 β Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Training set Bay. Error Bars Mean function β = 0.1 β = 1.

104 Optimising hyper-parameters II Model evidence: Z D (θ) = P(D θ) = df D P(D f D ) p 0 (f D θ) Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Maximum Likelihood II inference if p(θ M) flat θ P(D θ)p(θ M) = arg min P(θ D) = arg min θ Ω θ Ω p(d M) θ = arg min θ Ω Z D(θ) Evidence maximisation. Gradient/conj.grad. methods are used. MacKay Evidence

105 Sparse representations Motivation Gaussian Processes are fully connected graphical models. Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity x1 y 1 = Computing estimates is difficult. E.g for the posterior mean: yn f x post = y T ( K N + σ 2 oi N ) 1 k x xn inversion requires O(N 3 ) time. Regr

106 Sparse representations a solution Kernels II Inf./Pred. Regression Approximations A.R.D Sparsity Condition all training locations on a set of basis locations. f BV N (µ BV, Σ BV ) Basis Proj. The pseudo-latents f X are conditioned on f BV : f X f BV N (P ) µ BV, P Σ BV P T x1 y1 yn xn where P is the projection matrix: P = P X,BV = K X,BV K 1 BV

107 Gaussian Regression Non-Gaussian Noise Classification Multi-class Inverse problems Artificial data: y = sin(x)/x and polynomial kernel K 0 (x, x ) = (1 + x T x ) k. Number of training points: 1000 with added Gaussian noise σ 2 = POL with order 5 and noise σ 2 =

108 Robust one-sided regression Non-Gaussian Noise Classification Multi-class Inverse problems Exponential, one-sided, additive noise. { λ exp [ λ(y f x )] if y > f x. P(y f x ) = 0 otherwise BV set size: 9; Lik. par:

109 Classification Non-Gaussian Noise Classification Multi-class Inverse problems For each location x we have ±1. The likelihood function for this model is: ( ) y fx P (y f (x)) = Erf σ 0 Erf the incomplete Gaussian ( sigmoid): Erf(x) = x dt exp( t 2 /2)/ 2π Posterior is not Gaussian. For single data, mean-var computable iterative methods can be used.

110 0.7 Toy Classification Non-Gaussian Noise Classification Multi-class Inverse problems RBF kernel: K (x, x ) = exp behaviour of the ARD parameters β i BV#: 20, Log.Evid: [ b d i=1 (x i x i )2 β i ] BV#: 4, Log.Evid: demogp_class_gui

111 Classification Non-Gaussian Noise Classification Multi-class Inverse problems USPS data-set Handwritten image data-set of gray-scale images with 7291 training and 2007 test patterns. (RBF kernel with σ 2 K = 0.5) Test error % USPS: 4 < > non 4 1 iterations 4 iterations #BV

112 Crab data-set Using RBF kernels 10 Test Error 1 Log. of input weights Non-Gaussian Noise Classification Multi-class Inverse problems Log.Evidence and BV set size Log. Evidence BV set size

113 Multiclass Classification Non-Gaussian Noise Classification Multi-class Inverse problems Problem setup: For each location x we have y {1,..., K }. Transforming it into y {0, 1} K Coding: y = [0,..., 0, 1, 0,...] T on the k-th position K independent are used. Indep. is a-priori. The likelihood function is: P (y f (x)) = y T s 1 T s ( ) where s = exp [f 1 (x),..., f K (x)] T. The posterior processes are not independent.

114 Multiclass Classification Two-dimensional demo: class-conditional distributions Non-Gaussian Noise Classification Multi-class Inverse problems X Y

115 Multiclass Classification Non-Gaussian Noise Classification Multi-class Inverse problems

116 Inverse Problems Non-Gaussian Noise Classification Multi-class Inverse problems Likelihood: local observations of global wind-fields (u i, v i ). Probabilistic framework preferred due to lack of direct observations. Uncertainty captured in Mixture density networks: 4 P(u i, v i obs) = β il N (u i, v i µ il, A il ) k=1 β il, µ il, A il local parameters.

117 Inverse Problems Non-Gaussian Noise Classification Multi-class Inverse problems Few Basis Vectors retained Approximation preserves information about local uncertainty The inference process is fast

118 Bibliography I D.J.C. MacKay: The evidence framework... H: C. Rasmussen: Gaussian Processes for Regression. H: C.K.I. Williams: Prediction with Gaussian Processes:... H: Bib 2

119 Bibliography II R. Neal: Priors for infinite networks H: M. Seeger: Bayesian Gaussian Process Models:... H: L. Csato: Gaussian Processes:... Sparse Approximations H:

120 Bibliography III

121 NIPS 05 I NIPS 05 GP-related articles: J. Murillo-Fuentes, F. Perez-Cruz: Gaussian Processes for Multiuser Detection in CDMA Receivers Y. Shen, A. Ng, M. Seeger: Fast Gaussian Process Regression Using KD-Trees A. Shon, K. Grochow, A. Hertzmann, R. Rao: Gaussian Process CCA for Image Synthesis and Robotic Imitation R. Der, D. Lee: Beyond Gaussian Processes: On the Distributions of Infinite Networks D. Fleet, J. Wang, A. Hertzmann: Gaussian Process Dynamical Models

122 NIPS 05 II NIPS 05 GP-related articles: Y. Engel, P. Szabo, D. Volkinshtein: Learning With Gaussian Process Temporal Difference Methods M. Kuss, C. Rasmussen: Assessing Approximations for Gaussian Process Classification E. Snelson, Z. Ghahramani: Sparse Parametric Gaussian Processes S. Kakade, M. Seeger, D. Foster: Worst-Case Bounds for Gaussian Process Models S. Keerthi, W. Chu: A Matching Pursuit Approach to Sparse Gaussian Process Regression E. Meeds, S. Osindero: An Alternative Infinite Mixture Of Gaussian Process Experts

Lehel Csató. March Faculty of Mathematics and Informatics Babeş Bolyai University, Cluj-Napoca,

Lehel Csató. March Faculty of Mathematics and Informatics Babeş Bolyai University, Cluj-Napoca, Faculty of Mathematics and Informatics Babeş Bolyai University, Cluj-Napoca, Matematika és Informatika Tanszék Babeş Bolyai Tudományegyetem, Kolozsvár March 2008 Table of Contents 1 2 methods 3 Methods

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

Bayesian Support Vector Machines for Feature Ranking and Selection

Bayesian Support Vector Machines for Feature Ranking and Selection Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group Nonparmeteric Bayes & Gaussian Processes Baback Moghaddam baback@jpl.nasa.gov Machine Learning Group Outline Bayesian Inference Hierarchical Models Model Selection Parametric vs. Nonparametric Gaussian

More information

Nonparameteric Regression:

Nonparameteric Regression: Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

Model Selection for Gaussian Processes

Model Selection for Gaussian Processes Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

More information

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the

More information

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu Lecture: Gaussian Process Regression STAT 6474 Instructor: Hongxiao Zhu Motivation Reference: Marc Deisenroth s tutorial on Robot Learning. 2 Fast Learning for Autonomous Robots with Gaussian Processes

More information

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart Machine Learning Bayesian Regression & Classification learning as inference, Bayesian Kernel Ridge regression & Gaussian Processes, Bayesian Kernel Logistic Regression & GP classification, Bayesian Neural

More information

CS798: Selected topics in Machine Learning

CS798: Selected topics in Machine Learning CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Gaussian Processes (10/16/13)

Gaussian Processes (10/16/13) STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

Lecture 5: GPs and Streaming regression

Lecture 5: GPs and Streaming regression Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

Introduction to Gaussian Process

Introduction to Gaussian Process Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Gaussian processes for inference in stochastic differential equations

Gaussian processes for inference in stochastic differential equations Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space to The The A s s in to Fabio A. González Ph.D. Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 2, 2009 to The The A s s in 1 Motivation Outline 2 The Mapping the

More information

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations

More information

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive

More information

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

More information

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,

More information

Expectation Propagation for Approximate Bayesian Inference

Expectation Propagation for Approximate Bayesian Inference Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given

More information

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning 10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date

More information

Relevance Vector Machines

Relevance Vector Machines LUT February 21, 2011 Support Vector Machines Model / Regression Marginal Likelihood Regression Relevance vector machines Exercise Support Vector Machines The relevance vector machine (RVM) is a bayesian

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Kernel Methods & Support Vector Machines

Kernel Methods & Support Vector Machines Kernel Methods & Support Vector Machines Mahdi pakdaman Naeini PhD Candidate, University of Tehran Senior Researcher, TOSAN Intelligent Data Miners Outline Motivation Introduction to pattern recognition

More information

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

Multiple-step Time Series Forecasting with Sparse Gaussian Processes Multiple-step Time Series Forecasting with Sparse Gaussian Processes Perry Groot ab Peter Lucas a Paul van den Bosch b a Radboud University, Model-Based Systems Development, Heyendaalseweg 135, 6525 AJ

More information

Curve Fitting Re-visited, Bishop1.2.5

Curve Fitting Re-visited, Bishop1.2.5 Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the

More information

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Probabilistic Graphical Models Lecture 20: Gaussian Processes Probabilistic Graphical Models Lecture 20: Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 30, 2015 1 / 53 What is Machine Learning? Machine learning algorithms

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words,

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO: A Model-Based and Data-Efficient Approach to Policy Search PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Gaussian Processes in Machine Learning

Gaussian Processes in Machine Learning Gaussian Processes in Machine Learning November 17, 2011 CharmGil Hong Agenda Motivation GP : How does it make sense? Prior : Defining a GP More about Mean and Covariance Functions Posterior : Conditioning

More information

Probabilistic Reasoning in Deep Learning

Probabilistic Reasoning in Deep Learning Probabilistic Reasoning in Deep Learning Dr Konstantina Palla, PhD palla@stats.ox.ac.uk September 2017 Deep Learning Indaba, Johannesburgh Konstantina Palla 1 / 39 OVERVIEW OF THE TALK Basics of Bayesian

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

More information

Basis Expansion and Nonlinear SVM. Kai Yu

Basis Expansion and Nonlinear SVM. Kai Yu Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion

More information

Outline Lecture 2 2(32)

Outline Lecture 2 2(32) Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II 1 Non-linear regression techniques Part - II Regression Algorithms in this Course Support Vector Machine Relevance Vector Machine Support vector regression Boosting random projections Relevance vector

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/

More information

GWAS IV: Bayesian linear (variance component) models

GWAS IV: Bayesian linear (variance component) models GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS IV: Bayesian

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

20: Gaussian Processes

20: Gaussian Processes 10-708: Probabilistic Graphical Models 10-708, Spring 2016 20: Gaussian Processes Lecturer: Andrew Gordon Wilson Scribes: Sai Ganesh Bandiatmakuri 1 Discussion about ML Here we discuss an introduction

More information

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2

More information

Part 1: Expectation Propagation

Part 1: Expectation Propagation Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

Statistical Approaches to Learning and Discovery

Statistical Approaches to Learning and Discovery Statistical Approaches to Learning and Discovery Bayesian Model Selection Zoubin Ghahramani & Teddy Seidenfeld zoubin@cs.cmu.edu & teddy@stat.cmu.edu CALD / CS / Statistics / Philosophy Carnegie Mellon

More information

Lecture 1b: Linear Models for Regression

Lecture 1b: Linear Models for Regression Lecture 1b: Linear Models for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced

More information

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML) Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML) Nonlinear Features x4: -1 x1: +1 x3: +1 x2: -1 Concatenated (combined) features XOR:

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray School of Informatics, University of Edinburgh The problem Learn scalar function of vector values f(x).5.5 f(x) y i.5.2.4.6.8 x f 5 5.5 x x 2.5 We have (possibly

More information

Gaussian Processes for Machine Learning

Gaussian Processes for Machine Learning Gaussian Processes for Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics Tübingen, Germany carl@tuebingen.mpg.de Carlos III, Madrid, May 2006 The actual science of

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee 9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic

More information

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction

More information

Variational Methods in Bayesian Deconvolution

Variational Methods in Bayesian Deconvolution PHYSTAT, SLAC, Stanford, California, September 8-, Variational Methods in Bayesian Deconvolution K. Zarb Adami Cavendish Laboratory, University of Cambridge, UK This paper gives an introduction to the

More information