Kernel methods for Bayesian inference

Size: px
Start display at page:

Download "Kernel methods for Bayesian inference"

Transcription

1 Kernel methods for Bayesian inference Arthur Gretton Gatsby Computational Neuroscience Unit Lancaster, Nov. 2014

2 Motivating Example: Bayesian inference without a model 3600 downsampled frames of RGB pixels (Y t [0,1] 1200 ) 1800 training frames, remaining for test. Gaussian noise added to Y t. Challenges: No parametric model of camera dynamics (only samples) No parametric model of map from camera angle to image (only samples) Want to do filtering: Bayesian inference

3 ABC: an approach to Bayesian inference without a model Bayes rule: P(y x) = P(x y)π(y) P(x y)π(y)dy P(x y) is likelihood π(y) is prior

4 ABC: an approach to Bayesian inference without a model Bayes rule: P(y x) = P(x y)π(y) P(x y)π(y)dy P(x y) is likelihood π(y) is prior One approach: Approximate Bayesian Computation (ABC)

5 ABC: an approach to Bayesian inference without a model Bayes rule: P(y x) = P(x y)π(y) P(x y)π(y)dy P(x y) is likelihood π(y) is prior One approach: Approximate Bayesian Computation (ABC) ABC generates a sample from p(y x ) as follows: 1. generate a sample y t from the prior π, 2. generate a sample x t from P(X y t ), 3. if D(x,x t ) < τ, accept y = y t ; otherwise reject, 4. go to (i). In step (3), D is a distance measure, and τ is a tolerance parameter.

6 Motivating example 2: simple Gaussian case p(x,y) is N((0,1 T d )T,V) with V a randomly generated covariance Posterior mean on x: ABC vs kernel approach CPU time vs Error (6 dim.) Mean Square Errors KBI COND ABC CPU time (sec)

7 Overview Introduction to reproducing kernel Hilbert spaces Hilbert space Kernels and feature spaces Reproducing property Mapping probabilities to feature space

8 Overview Introduction to reproducing kernel Hilbert spaces Hilbert space Kernels and feature spaces Reproducing property Mapping probabilities to feature space Nonparametric Bayesian inference Learning conditional probabilities: smooth regression to an RKHS Kernelized Bayesian inference

9 Functions in a reproducing kernel Hilbert space Kernel methods can control smoothness and avoid overfitting/underfitting.

10 Hilbert space Inner product Let H be a vector space over R. A function, H : H H R is an inner product on H if 1. Linear: α 1 f 1 +α 2 f 2,g H = α 1 f 1,g H +α 2 f 2,g H 2. Symmetric: f,g H = g,f H 3. f,f H 0 and f,f H = 0 if and only if f = 0.

11 Hilbert space Inner product Let H be a vector space over R. A function, H : H H R is an inner product on H if 1. Linear: α 1 f 1 +α 2 f 2,g H = α 1 f 1,g H +α 2 f 2,g H 2. Symmetric: f,g H = g,f H 3. f,f H 0 and f,f H = 0 if and only if f = 0. Norm induced by the inner product: f H := f,f H

12 Hilbert space Inner product Let H be a vector space over R. A function, H : H H R is an inner product on H if 1. Linear: α 1 f 1 +α 2 f 2,g H = α 1 f 1,g H +α 2 f 2,g H 2. Symmetric: f,g H = g,f H 3. f,f H 0 and f,f H = 0 if and only if f = 0. Norm induced by the inner product: f H := f,f H Hilbert space: Inner product space containing Cauchy sequence limits.

13 Kernel Kernel: Let X be a non-empty set. A function k : X X R is a kernel if there exists an R-Hilbert space and a map ϕ : X H such that x,x X, k(x,x ) := ϕ x,ϕ x H. Almost no conditions on X (eg, X itself doesn t need an inner product, eg. documents). A single kernel can correspond to several possible feature vectors. A trivial example for X := R: ϕ (1) x = x and ϕ (2) x = x/ 2 x/ 2

14 Finite dim. RKHS with polynomial features Example: A three dimensional space of features of points in R 2 : x = ϕ : R 2 R 3 x 1 x 2 ϕ x = x 1 x 2 x 1 x 2, with kernel k(x,y) = x 1 x 2 y 1 y 2 x 1 x 2 y 1 y 2 (the standard inner product in R 3 between features). Denote this feature space by H.

15 Finite dim. RKHS with polynomial features Define a linear function of the inputs x 1,x 2, and their product x 1 x 2, f(x) = f 1 x 1 +f 2 x 2 +f 3 x 1 x 2. f in a space of functions mapping from X = R 2 to R. Equivalent representation for f, [ ]. f( ) = f 1 f 2 f 3 f( ) refers to the function as an object (here as a vector in R 3 ) f(x) R is function evaluated at a point (a real number).

16 Finite dim. RKHS with polynomial features Define a linear function of the inputs x 1,x 2, and their product x 1 x 2, f(x) = f 1 x 1 +f 2 x 2 +f 3 x 1 x 2. f in a space of functions mapping from X = R 2 to R. Equivalent representation for f, [ ]. f( ) = f 1 f 2 f 3 f( ) refers to the function as an object (here as a vector in R 3 ) f(x) R is function evaluated at a point (a real number). f(x) = f( ) ϕ x = f( ),ϕ x H Evaluation of f at x is an inner product in feature space (here standard inner product in R 3 ) H is a space of functions mapping R 2 to R.

17 Finite dim. RKHS with polynomial features ϕ y is a mapping from R 2 to R which also parametrizes a function mapping R 2 to R. [ ] k(,y) := y 1 y 2 y 1 y 2 = ϕy, Given y, there is a vector k(,y) in H such that k(,y),ϕ x H = ax 1 +bx 2 +cx 1 x 2, where a = y 1, b = y 2, and c = y 1 y 2

18 Finite dim. RKHS with polynomial features ϕ y is a mapping from R 2 to R which also parametrizes a function mapping R 2 to R. [ ] k(,y) := y 1 y 2 y 1 y 2 = ϕy, Given y, there is a vector k(,y) in H such that where a = y 1, b = y 2, and c = y 1 y 2 Due to symmetry, k(,y),ϕ x H = ax 1 +bx 2 +cx 1 x 2, k(,x),ϕ y = uy 1 +vy 2 +wy 1 y 2 = k(x,y). We can write ϕ x = k(,x) and ϕ y = k(,y) without ambiguity: canonical feature map

19 The reproducing property This example illustrates the two defining features of an RKHS: The reproducing property: x X, f( ) H, f( ),k(,x) H = f(x)...or use shorter notation f,ϕ x H. In particular, for any x,y X, k(x,y) = k(,x),k(,y) H. Note: the feature map of every point is in the feature space: x X, k(,x) = ϕ x H,

20 Infinite dimensional feature space Reproducing property for function with Gaussian kernel: f(x) := m i=1 α ik(x i,x) = m i=1 α iϕ xi,ϕ x H f(x) x

21 Infinite dimensional feature space Reproducing property for function with Gaussian kernel: f(x) := m i=1 α ik(x i,x) = m i=1 α iϕ xi,ϕ x H f(x) What do the features ϕ x look like (there are infinitely many of them, and they are not unique!) What do these features have to do with smoothness? x

22 Infinite dimensional feature space Gaussian kernel, k(x,x ) = exp ( x x 2 2σ 2 ) λ k b k b < 1 e k (x) exp( (c a)x 2 )H k (x 2c), a,b,c are functions of σ, and H k is kth order Hermite polynomial., k(x,x ) = λ i e i (x)e i (x ) = = i=1 i=1 ( )( λi e i (x) λi e i (x )) ϕ x ϕ x i=1

23 Infinite dimensional feature space (Mercer) Let X be a compact metric space, k be a continous kernel, and µ be a finite Borel measure with supp{µ} = X. Then the convergence of k(x,y) = j λ j e j (x)e j (y) is absolute and uniform (e j is the continuous element of the L 2 equivalence class e j.). [ ] The feature map is ϕ x =... λi e i (x)...

24 Infinite dimensional feature space (Mercer RKHS)(Steinwart and Christmann, Theorem 4.51) Under the assumptions of Mercer s theorem, { } H := a i λi e i : a i l 2 (1) is an RKHS with kernel k. Given two functions in the RKHS i f(x) := i a i λi e i (x), g(x) := i b i λi e i (x), the inner product is f,g H = i a ib i

25 Infinite dimensional feature space Proof: There are two aspects requiring care: 1. Is k(x, ) H x X? Requires Mercer s theorem 2. Does the reproducing property hold? f,k(,x) H = f(x).

26 Infinite dimensional feature space Proof: There are two aspects requiring care: 1. Is k(x, ) H x X? Requires Mercer s theorem 2. Does the reproducing property hold? f,k(,x) H = f(x). First part: By the definition of H, the function in H indexed by x is k(x, ) = ( )( λi e i (x) λi e i ( )). i Is this function in the RKHS? Yes, if the l 2 norm of ( λ i e i (x) ) is bounded. This is due to Mercer: x X, k(x, ) 2 ( ) H = 2 λi e i (x) = k(x,x) <. l 2

27 Infinite dimensional feature space Proof (continued): Second part: The reproducing property holds: using the inner product definition, f,k(x, ) H = ( ) f i λi e i (x) = f(x), i which is always well defined since both f l 2 and k(x, ) l 2.

28 Infinite dimensional feature space Example RKHS function, Gaussian kernel: m m f(x) := α i k(x i,x) = λ j e j (x i )e j (x) = α i i=1 i=1 j=1 j=1 [ λj ] f j e j (x) where f j = m i=1 α i λj e j (x i ). f(x) x NOTE that this enforces smoothing: λ j decay as e j become rougher, f j decay since f 2 H = j f2 j <.

29 Reproducing kernel Hilbert space H a Hilbert space of R-valued functions on non-empty set X. A function k : X X R is a reproducing kernel of H, and H is a reproducing kernel Hilbert space, if x X, k(,x) H, x X, f H, f( ),k(,x) H = f(x) (the reproducing property). In particular, for any x,y X, k(x,y) = k(,x),k(,y) H. (2) Original definition: kernel an inner product between feature maps. Then ϕ x = k(,x) a valid feature map.

30 Probabilities in feature space: the mean trick The kernel trick Given x X for some set X, define feature map ϕ x H, [ ϕ x =... ] λ i e i (x)... l 2 For positive definite k(x,x ), k(x,x ) = ϕ x,ϕ x H The kernel trick: f H, f(x) = f,ϕ x H

31 Probabilities in feature space: the mean trick The kernel trick Given x X for some set X, define feature map ϕ x H, [ ϕ x =... ] λ i e i (x)... l 2 For positive definite k(x,x ), k(x,x ) = ϕ x,ϕ x H The kernel trick: f H, f(x) = f,ϕ x H The mean trick Given P a Borel probability measure on X, define feature map µ P H [ µ P =... ] λ i E P [e i (X)]... l 2 For positive definite k(x,x ), E P,Q k(x,y) = µ P,µ Q H for X P and Y Q. The mean trick: (we call µ P a mean/distribution embedding) E P (f(x)) = E P [ ϕ X,f F ]

32 Feature embeddings of probabilities The kernel trick: f(x) = f,ϕ x H The mean trick: Empirical mean embedding: E P (f(x)) = µ P,f F µ P = m 1 m i=1 ϕ xi x i i.i.d. P µ P gives you expectations of all RKHS functions When k characteristic, then µ P unique, e.g. Gauss, Laplace,...

33 Nonparametric Bayesian inference using distribution embeddings

34 Bayes again Bayes rule: P(y x) = P(x y)π(y) P(x y)π(y)dy P(x y) is likelihood π is prior How would this look with kernel embeddings?

35 Bayes again Bayes rule: P(y x) = P(x y)π(y) P(x y)π(y)dy P(x y) is likelihood π is prior How would this look with kernel embeddings? Define RKHS G on Y with feature map ψ y and kernel l(y, ) We need a conditional mean embedding: for all g G, E Y x g(y) = g,µ P(y x ) G This will be obtained by RKHS-valued ridge regression

36 Ridge regression and the conditional feature mean Ridge regression from X := R d to a finite vector output Y := R d (these could be d nonlinear features of y): Define training data [ ] [ ] X = x 1... x m R d m Y = y 1... y m R d m

37 Ridge regression and the conditional feature mean Ridge regression from X := R d to a finite vector output Y := R d (these could be d nonlinear features of y): Define training data [ ] [ ] X = x 1... x m R d m Y = y 1... y m R d m Solve Ă = argmin A R d d ( ) Y AX 2 +λ A 2 HS, where A 2 HS = tr(a A) = min{d,d } i=1 γ 2 A,i

38 Ridge regression and the conditional feature mean Ridge regression from X := R d to a finite vector output Y := R d (these could be d nonlinear features of y): Define training data [ ] [ ] X = x 1... x m R d m Y = y 1... y m R d m Solve Ă = argmin A R d d ( ) Y AX 2 +λ A 2 HS, where A 2 HS = tr(a A) = Solution: Ă = C YX (C XX +mλi) 1 min{d,d } i=1 γ 2 A,i

39 Ridge regression and the conditional feature mean Prediction at new point x: where y = Ăx = C YX (C XX +mλi) 1 x m = β i (x)y i i=1 β i (x) = (K +λmi) 1[ k(x 1,x)... k(x m,x) ] and K := X X k(x 1,x) = x 1 x

40 Ridge regression and the conditional feature mean Prediction at new point x: where y = Ăx = C YX (C XX +mλi) 1 x m = β i (x)y i i=1 β i (x) = (K +λmi) 1[ k(x 1,x)... k(x m,x) ] and K := X X k(x 1,x) = x 1 x What if we do everything in kernel space?

41 Ridge regression and the conditional feature mean Recall our setup: Given training pairs: (x i,y i ) P XY F on X with feature map ϕ x and kernel k(x, ) G on Y with feature map ψ y and kernel l(y, ) We define the covariance between feature maps: C XX = E X (ϕ X ϕ X ) C XY = E XY (ϕ X ψ Y ) and matrices of feature mapped training data [ ] [ X = ϕ x1... ϕ xm Y := ψ y1... ψ ym ]

42 Ridge regression and the conditional feature mean Objective: [Weston et al. (2003), Micchelli and Pontil (2005), Caponnetto and De Vito (2007), Grunewalder et al. (2012, 2013) ] ( ) Ă = arg min E XY Y AX 2 G A HS(F,G) +λ A 2 HS, A 2 HS = Solution same as vector case: i=1 γ 2 A,i Ă = C YX (C XX +mλi) 1, Prediction at new x using kernels: [ Ăϕ x = ψ y1... ψ ym ](K +λmi) 1[ k(x 1,x)... k(x m,x) m = β i (x)ψ yi i=1 ] where K ij = k(x i,x j )

43 Ridge regression and the conditional feature mean How is loss Y AX 2 G relevant to conditional expectation of some E Y x g(y)? Define: [Song et al. (2009), Grunewalder et al. (2013)] µ Y x := Aϕ x

44 Ridge regression and the conditional feature mean How is loss Y AX 2 G relevant to conditional expectation of some E Y x g(y)? Define: [Song et al. (2009), Grunewalder et al. (2013)] µ Y x := Aϕ x We need A to have the property E Y x g(y) g,µ Y x G = g,aϕ x G = A g,ϕ x F = (A g)(x)

45 Ridge regression and the conditional feature mean How is loss Y AX 2 G relevant to conditional expectation of some E Y x g(y)? Define: [Song et al. (2009), Grunewalder et al. (2013)] We need A to have the property µ Y x := Aϕ x E Y x g(y) g,µ Y x G = g,aϕ x G Natural risk function for conditional mean L(A,P XY ) := sup E X g 1 = A g,ϕ x F = (A g)(x) ( E Y X g(y) ) (X) (A g) (X) }{{}}{{} Target Estimator 2,

46 Ridge regression and the conditional feature mean The squared loss risk provides an upper bound on the natural risk. L(A,P XY ) E XY ψ Y Aϕ X 2 G

47 Ridge regression and the conditional feature mean The squared loss risk provides an upper bound on the natural risk. L(A,P XY ) E XY ψ Y Aϕ X 2 G Proof: Jensen and Cauchy Schwarz [( L(A,P XY ) := sup E X EY X g(y) ) (X) (A g)(x) ] 2 g 1 E XY sup [g(y) (A g)(x)] 2 g 1

48 Ridge regression and the conditional feature mean The squared loss risk provides an upper bound on the natural risk. L(A,P XY ) E XY ψ Y Aϕ X 2 G Proof: Jensen and Cauchy Schwarz [( L(A,P XY ) := sup E X EY X g(y) ) (X) (A g)(x) ] 2 g 1 E XY sup [g(y) (A g)(x)] 2 g 1 = E XY sup [ g,ψ Y G A g,ϕ X F ] 2 g 1

49 Ridge regression and the conditional feature mean The squared loss risk provides an upper bound on the natural risk. L(A,P XY ) E XY ψ Y Aϕ X 2 G Proof: Jensen and Cauchy Schwarz [( L(A,P XY ) := sup E X EY X g(y) ) (X) (A g)(x) ] 2 g 1 E XY sup [g(y) (A g)(x)] 2 g 1 = E XY sup [ g,ψ Y G g,aϕ X G ] 2 g 1

50 Ridge regression and the conditional feature mean The squared loss risk provides an upper bound on the natural risk. L(A,P XY ) E XY ψ Y Aϕ X 2 G Proof: Jensen and Cauchy Schwarz [( L(A,P XY ) := sup E X EY X g(y) ) (X) (A g)(x) ] 2 g 1 E XY sup [g(y) (A g)(x)] 2 g 1 = E XY sup g,ψ Y Aϕ X 2 G g 1

51 Ridge regression and the conditional feature mean The squared loss risk provides an upper bound on the natural risk. L(A,P XY ) E XY ψ Y Aϕ X 2 G Proof: Jensen and Cauchy Schwarz [( L(A,P XY ) := sup E X EY X g(y) ) (X) (A g)(x) ] 2 g 1 E XY sup [g(y) (A g)(x)] 2 g 1 = E XY sup g,ψ Y Aϕ X 2 G g 1 E XY ψ Y Aϕ X 2 G

52 Ridge regression and the conditional feature mean The squared loss risk provides an upper bound on the natural risk. L(A,P XY ) E XY ψ Y Aϕ X 2 G Proof: Jensen and Cauchy Schwarz [( L(A,P XY ) := sup E X EY X g(y) ) (X) (A g)(x) ] 2 g 1 E XY sup [g(y) (A g)(x)] 2 g 1 = E XY sup g,ψ Y Aϕ X 2 G g 1 E XY ψ Y Aϕ X 2 G If we assume E Y [g(y) X = x] F then upper bound tight (next slide).

53 Conditions for ridge regression = conditional mean Proof: conditional mean obtained by ridge regression when E Y [g(y) X = x] F Given a function g G. Assume E Y X [g(y) X = ] F. Then C XX E Y X [g(y) X = ] = C XY g.

54 Conditions for ridge regression = conditional mean Proof: conditional mean obtained by ridge regression when E Y [g(y) X = x] F Given a function g G. Assume E Y X [g(y) X = ] F. Then C XX E Y X [g(y) X = ] = C XY g. Proof: [Fukumizu et al., 2004] For all f F, by definition of C XX, f,cxx E Y X [g(y) X = ] F = cov ( f,e Y X [g(y) X = ] ) by definition of C XY. = E X ( f(x)ey X [g(y) X] ) = E XY (f(x)g(y)) = f,c XY g,

55 Kernel Bayes law Prior: Y π(y) Likelihood: (X y) P(x y) with some joint P(x, y)

56 Kernel Bayes law Prior: Y π(y) Likelihood: (X y) P(x y) with some joint P(x, y) Joint distribution: Q(x, y) = P(x y)π(y) Warning: Q P, change of measure from P(y) to π(y) Marginal for x: Q(x) := P(x y)π(y)dy. Bayes law: Q(y x) = P(x y)π(y) Q(x)

57 Kernel Bayes law Posterior embedding via the usual conditional update, µ Q(y x) = C Q(y,x) C 1 Q(x,x) φ x.

58 Kernel Bayes law Posterior embedding via the usual conditional update, µ Q(y x) = C Q(y,x) C 1 Q(x,x) φ x. Given mean embedding of prior: µ π (y) Define marginal covariance: C Q(x,x) = (ϕ x ϕ x ) P(x y)π(y)dx = C (xx)y Cyy 1 µ π(y) Define cross-covariance: C Q(y,x) = (φ y ϕ x ) P(x y)π(y)dx = C (yx)y Cyy 1 µ π(y).

59 Kernel Bayes law: consistency result How to compute posterior expectation from data? Given samples: {(x i,y i )} n i=1 from P xy, {(u j )} n j=1 from prior π. Want to compute E[g(Y) X = x] for g in G For any x X, g T y R Y X k X (x) E[f(Y) X = x] = Op (n 4 27 ), (n ), where g y = (g(y 1 ),...,g(y n )) T R n. k X (x) = (k(x 1,x),...,k(x n,x)) T R n R Y X learned from the samples, contains the u j Smoothness assumptions: π/p Y R(C 1/2 Y Y ), where p Y p.d.f. of P Y, E[g(Y) X = ] R(C 2 Q(xx) ).

60 Experiment: Kernel Bayes law vs EKF

61 Experiment: Kernel Bayes law vs EKF Compare with extended Kalman filter (EKF) on camera orientation task 3600 downsampled frames of RGB pixels (Y t [0,1] 1200 ) 1800 training frames, remaining for test. Gaussian noise added to Y t.

62 Experiment: Kernel Bayes law vs EKF Compare with extended Kalman filter (EKF) on camera orientation task 3600 downsampled frames of RGB pixels (Y t [0,1] 1200 ) 1800 training frames, remaining for test. Gaussian noise added to Y t. Average MSE and standard errors (10 runs) KBR (Gauss) KBR (Tr) Kalman (9 dim.) Kalman (Quat.) σ 2 = ± ± ± ±0.023 σ 2 = ± ± ± ±0.022

63 Overview Introduction to reproducing kernel Hilbert spaces Hilbert space Kernels and feature spaces Reproducing property Nonparametric Bayesian inference Mapping probabilities to feature space Learning conditional probabilities: smooth regression to an RKHS Kernelized Bayesian inference

64 Co-authors From UCL: Luca Baldasssarre Steffen Grunewalder Guy Lever Sam Patterson Massimiliano Pontil External: Kenji Fukumizu, ISM Bernhard Schoelkopf, MPI Alex Smola, Google/CMU Le Song, Georgia Tech Bharath Sriperumbudur, Penn. State

65 Questions?

66 Selected references Characteristic kernels and mean embeddings: Smola, A., Gretton, A., Song, L., Schoelkopf, B. (2007). A hilbert space embedding for distributions. ALT. Sriperumbudur, B., Gretton, A., Fukumizu, K., Schoelkopf, B., Lanckriet, G. (2010). Hilbert space embeddings and metrics on probability measures. JMLR. Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR. Sejdinovic, D., Sriperumbudur, B., Gretton, A., Fukumizu, K. (2013). Equivalence of distance-based and rkhs-based statistics in hypothesis testing. Annals of Statistics. Two-sample, independence, conditional independence tests: Gretton, A., Fukumizu, K., Teo, C., Song, L., Schoelkopf, B., Smola, A. (2008). A kernel statistical test of independence. NIPS Fukumizu, K., Gretton, A., Sun, X., Schoelkopf, B. (2008). Kernel measures of conditional dependence. Gretton, A., Fukumizu, K., Harchaoui, Z., Sriperumbudur, B. (2009). A fast, consistent kernel two-sample test. NIPS. Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR

67 Selected references (continued) Conditional mean embedding, RKHS-valued regression: Weston, J., Chapelle, O., Elisseeff, A., Schölkopf, B., and Vapnik, V., (2003). Kernel Dependency Estimation, NIPS. Micchelli, C., and Pontil, M., (2005). On Learning Vector-Valued Functions. Neural Computation. Caponnetto, A., and De Vito, E. (2007). Optimal Rates for the Regularized Least-Squares Algorithm. Foundations of Computational Mathematics. Song, L., and Huang, J., and Smola, A., Fukumizu, K., (2009). Hilbert Space Embeddings of Conditional Distributions. ICML. Grunewalder, S., Lever, G., Baldassarre, L., Patterson, S., Gretton, A., Pontil, M. (2012). Conditional mean embeddings as regressors. ICML. Grunewalder, S., Gretton, A., Shawe-Taylor, J. (2013). Smooth operators. ICML. Kernel Bayes rule: Song, L., Fukumizu, K., Gretton, A. (2013). Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine. Fukumizu, K., Song, L., Gretton, A. (2013). Kernel Bayes rule: Bayesian inference with positive definite kernels, JMLR

68

69 References K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 5:73 99, L. Song, J. Huang, A. J. Smola, and K. Fukumizu. Hilbert space embeddings of conditional distributions. In Proceedings of the International Conference on Machine Learning,

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December

More information

Hilbert Space Representations of Probability Distributions

Hilbert Space Representations of Probability Distributions Hilbert Space Representations of Probability Distributions Arthur Gretton joint work with Karsten Borgwardt, Kenji Fukumizu, Malte Rasch, Bernhard Schölkopf, Alex Smola, Le Song, Choon Hui Teo Max Planck

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Advances in kernel exponential families

Advances in kernel exponential families Advances in kernel exponential families Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS, 2017 1/39 Outline Motivating application: Fast estimation of complex multivariate

More information

Learning Interpretable Features to Compare Distributions

Learning Interpretable Features to Compare Distributions Learning Interpretable Features to Compare Distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Theory of Big Data, 2017 1/41 Goal of this talk Given: Two collections

More information

Hilbert Space Embedding of Probability Measures

Hilbert Space Embedding of Probability Measures Lecture 2 Hilbert Space Embedding of Probability Measures Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 2017 Recap of Lecture

More information

Lecture 2: Mappings of Probabilities to RKHS and Applications

Lecture 2: Mappings of Probabilities to RKHS and Applications Lecture 2: Mappings of Probabilities to RKHS and Applications Lille, 214 Arthur Gretton Gatsby Unit, CSML, UCL Detecting differences in brain signals The problem: Dolocalfieldpotential(LFP)signalschange

More information

Recovering Distributions from Gaussian RKHS Embeddings

Recovering Distributions from Gaussian RKHS Embeddings Motonobu Kanagawa Graduate University for Advanced Studies kanagawa@ism.ac.jp Kenji Fukumizu Institute of Statistical Mathematics fukumizu@ism.ac.jp Abstract Recent advances of kernel methods have yielded

More information

Distribution Regression with Minimax-Optimal Guarantee

Distribution Regression with Minimax-Optimal Guarantee Distribution Regression with Minimax-Optimal Guarantee (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton

More information

Hilbert Space Embeddings of Conditional Distributions with Applications to Dynamical Systems

Hilbert Space Embeddings of Conditional Distributions with Applications to Dynamical Systems with Applications to Dynamical Systems Le Song Jonathan Huang School of Computer Science, Carnegie Mellon University Alex Smola Yahoo! Research, Santa Clara, CA, USA Kenji Fukumizu Institute of Statistical

More information

Distribution Regression

Distribution Regression Zoltán Szabó (École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Dagstuhl Seminar 16481

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Hilbert Space Embeddings of Conditional Distributions with Applications to Dynamical Systems

Hilbert Space Embeddings of Conditional Distributions with Applications to Dynamical Systems with Applications to Dynamical Systems Le Song Jonathan Huang School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA Alex Smola Yahoo! Research, Santa Clara, CA 95051, USA Kenji

More information

Statistical Convergence of Kernel CCA

Statistical Convergence of Kernel CCA Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,

More information

Minimax-optimal distribution regression

Minimax-optimal distribution regression Zoltán Szabó (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) ISNPS, Avignon June 12,

More information

Adaptive HMC via the Infinite Exponential Family

Adaptive HMC via the Infinite Exponential Family Adaptive HMC via the Infinite Exponential Family Arthur Gretton Gatsby Unit, CSML, University College London RegML, 2017 Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family

More information

Kernel Methods. Outline

Kernel Methods. Outline Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert

More information

Minimax Estimation of Kernel Mean Embeddings

Minimax Estimation of Kernel Mean Embeddings Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016 Collaborators Dr. Ilya Tolstikhin

More information

Hypothesis Testing with Kernel Embeddings on Interdependent Data

Hypothesis Testing with Kernel Embeddings on Interdependent Data Hypothesis Testing with Kernel Embeddings on Interdependent Data Dino Sejdinovic Department of Statistics University of Oxford joint work with Kacper Chwialkowski and Arthur Gretton (Gatsby Unit, UCL)

More information

Convergence Rates of Kernel Quadrature Rules

Convergence Rates of Kernel Quadrature Rules Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction

More information

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department,

More information

Bayesian Interpretations of Regularization

Bayesian Interpretations of Regularization Bayesian Interpretations of Regularization Charlie Frogner 9.50 Class 15 April 1, 009 The Plan Regularized least squares maps {(x i, y i )} n i=1 to a function that minimizes the regularized loss: f S

More information

Kernel methods for comparing distributions, measuring dependence

Kernel methods for comparing distributions, measuring dependence Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations

More information

Learning features to compare distributions

Learning features to compare distributions Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this

More information

Kernels A Machine Learning Overview

Kernels A Machine Learning Overview Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter

More information

MTTTS16 Learning from Multiple Sources

MTTTS16 Learning from Multiple Sources MTTTS16 Learning from Multiple Sources 5 ECTS credits Autumn 2018, University of Tampere Lecturer: Jaakko Peltonen Lecture 6: Multitask learning with kernel methods and nonparametric models On this lecture:

More information

Kernel Embeddings of Conditional Distributions

Kernel Embeddings of Conditional Distributions Kernel Embeddings of Conditional Distributions Le Song, Kenji Fukumizu and Arthur Gretton Georgia Institute of Technology The Institute of Statistical Mathematics University College London Abstract Many

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

CS8803: Statistical Techniques in Robotics Byron Boots. Hilbert Space Embeddings

CS8803: Statistical Techniques in Robotics Byron Boots. Hilbert Space Embeddings CS8803: Statistical Techniques in Robotics Byron Boots Hilbert Space Embeddings 1 Motivation CS8803: STR Hilbert Space Embeddings 2 Overview Multinomial Distributions Marginal, Joint, Conditional Sum,

More information

An Adaptive Test of Independence with Analytic Kernel Embeddings

An Adaptive Test of Independence with Analytic Kernel Embeddings An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum Gatsby Unit, University College London wittawat@gatsby.ucl.ac.uk Probabilistic Graphical Model Workshop 2017 Institute

More information

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Kernel-Based Contrast Functions for Sufficient Dimension Reduction Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach

More information

A Linear-Time Kernel Goodness-of-Fit Test

A Linear-Time Kernel Goodness-of-Fit Test A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum 1; Wenkai Xu 1 Zoltán Szabó 2 NIPS 2017 Best paper! Kenji Fukumizu 3 Arthur Gretton 1 wittawatj@gmail.com 1 Gatsby Unit, University College

More information

Bayesian Support Vector Machines for Feature Ranking and Selection

Bayesian Support Vector Machines for Feature Ranking and Selection Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction

More information

CS 7140: Advanced Machine Learning

CS 7140: Advanced Machine Learning Instructor CS 714: Advanced Machine Learning Lecture 3: Gaussian Processes (17 Jan, 218) Jan-Willem van de Meent (j.vandemeent@northeastern.edu) Scribes Mo Han (han.m@husky.neu.edu) Guillem Reus Muns (reusmuns.g@husky.neu.edu)

More information

Kernel Bayes Rule: Bayesian Inference with Positive Definite Kernels

Kernel Bayes Rule: Bayesian Inference with Positive Definite Kernels Journal of Machine Learning Research 14 (2013) 3753-3783 Submitted 12/11; Revised 6/13; Published 12/13 Kernel Bayes Rule: Bayesian Inference with Positive Definite Kernels Kenji Fukumizu The Institute

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,

More information

Big Hypothesis Testing with Kernel Embeddings

Big Hypothesis Testing with Kernel Embeddings Big Hypothesis Testing with Kernel Embeddings Dino Sejdinovic Department of Statistics University of Oxford 9 January 2015 UCL Workshop on the Theory of Big Data D. Sejdinovic (Statistics, Oxford) Big

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond Lecture 2: From Linear Regression to Kalman Filter and Beyond January 18, 2017 Contents 1 Batch and Recursive Estimation 2 Towards Bayesian Filtering 3 Kalman Filter and Bayesian Filtering and Smoothing

More information

Gaussian Processes in Machine Learning

Gaussian Processes in Machine Learning Gaussian Processes in Machine Learning November 17, 2011 CharmGil Hong Agenda Motivation GP : How does it make sense? Prior : Defining a GP More about Mean and Covariance Functions Posterior : Conditioning

More information

Gaussian processes for inference in stochastic differential equations

Gaussian processes for inference in stochastic differential equations Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017

More information

Introduction to RKHS, and some simple kernel algorithms

Introduction to RKHS, and some simple kernel algorithms Introduction to RKHS, and some simple kernel algorithms Arthur Gretton October 25, 217 1 Outline In this document, we give a nontechical introduction to reproducing kernel Hilbert spaces (RKHSs), and describe

More information

Gaussian Processes (10/16/13)

Gaussian Processes (10/16/13) STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs

More information

Kernel Learning via Random Fourier Representations

Kernel Learning via Random Fourier Representations Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random

More information

An Adaptive Test of Independence with Analytic Kernel Embeddings

An Adaptive Test of Independence with Analytic Kernel Embeddings An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum 1 Zoltán Szabó 2 Arthur Gretton 1 1 Gatsby Unit, University College London 2 CMAP, École Polytechnique ICML 2017, Sydney

More information

10-701/ Recitation : Kernels

10-701/ Recitation : Kernels 10-701/15-781 Recitation : Kernels Manojit Nandi February 27, 2014 Outline Mathematical Theory Banach Space and Hilbert Spaces Kernels Commonly Used Kernels Kernel Theory One Weird Kernel Trick Representer

More information

Kernel Measures of Conditional Dependence

Kernel Measures of Conditional Dependence Kernel Measures of Conditional Dependence Kenji Fukumizu Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku Tokyo 6-8569 Japan fukumizu@ism.ac.jp Arthur Gretton Max-Planck Institute for

More information

Nonparametric Tree Graphical Models via Kernel Embeddings

Nonparametric Tree Graphical Models via Kernel Embeddings Le Song, 1 Arthur Gretton, 1,2 Carlos Guestrin 1 1 School of Computer Science, Carnegie Mellon University; 2 MPI for Biological Cybernetics Abstract We introduce a nonparametric representation for graphical

More information

Kernel Methods. Machine Learning A W VO

Kernel Methods. Machine Learning A W VO Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

More information

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

More information

Robust Support Vector Machines for Probability Distributions

Robust Support Vector Machines for Probability Distributions Robust Support Vector Machines for Probability Distributions Andreas Christmann joint work with Ingo Steinwart (Los Alamos National Lab) ICORS 2008, Antalya, Turkey, September 8-12, 2008 Andreas Christmann,

More information

Metric Embedding for Kernel Classification Rules

Metric Embedding for Kernel Classification Rules Metric Embedding for Kernel Classification Rules Bharath K. Sriperumbudur University of California, San Diego (Joint work with Omer Lang & Gert Lanckriet) Bharath K. Sriperumbudur (UCSD) Metric Embedding

More information

Approximate Kernel Methods

Approximate Kernel Methods Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0. Gaussian Processes Gaussian Process Stochastic process: basically, a set of random variables. may be infinite. usually related in some way. Gaussian process: each variable has a Gaussian distribution every

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning 10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

22 : Hilbert Space Embeddings of Distributions

22 : Hilbert Space Embeddings of Distributions 10-708: Probabilistic Graphical Models 10-708, Spring 2014 22 : Hilbert Space Embeddings of Distributions Lecturer: Eric P. Xing Scribes: Sujay Kumar Jauhar and Zhiguang Huo 1 Introduction and Motivation

More information

Stability of Multi-Task Kernel Regression Algorithms

Stability of Multi-Task Kernel Regression Algorithms JMLR: Workshop and Conference Proceedings 29:1 16, 2013 ACML 2013 Stability of Multi-Task Kernel Regression Algorithms Julien Audiffren Hachem Kadri Aix-Marseille Université, CNRS, LIF UMR 7279, 13000,

More information

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University Chapter 9. Support Vector Machine Yongdai Kim Seoul National University 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved

More information

Inverse Density as an Inverse Problem: the Fredholm Equation Approach

Inverse Density as an Inverse Problem: the Fredholm Equation Approach Inverse Density as an Inverse Problem: the Fredholm Equation Approach Qichao Que, Mikhail Belkin Department of Computer Science and Engineering The Ohio State University {que,mbelkin}@cse.ohio-state.edu

More information

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2 Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal

More information

Kernel Bayesian Inference with Posterior Regularization

Kernel Bayesian Inference with Posterior Regularization Kernel Bayesian Inference with Posterior Regularization Yang Song, Jun Zhu, Yong Ren Dept. of Physics, Tsinghua University, Beijing, China Dept. of Comp. Sci. & Tech., TNList Lab; Center for Bio-Inspired

More information

Learning features to compare distributions

Learning features to compare distributions Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this

More information

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart Machine Learning Bayesian Regression & Classification learning as inference, Bayesian Kernel Ridge regression & Gaussian Processes, Bayesian Kernel Logistic Regression & GP classification, Bayesian Neural

More information

Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,

More information

Unsupervised Nonparametric Anomaly Detection: A Kernel Method

Unsupervised Nonparametric Anomaly Detection: A Kernel Method Fifty-second Annual Allerton Conference Allerton House, UIUC, Illinois, USA October - 3, 24 Unsupervised Nonparametric Anomaly Detection: A Kernel Method Shaofeng Zou Yingbin Liang H. Vincent Poor 2 Xinghua

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London

More information

Tensor Product Kernels: Characteristic Property, Universality

Tensor Product Kernels: Characteristic Property, Universality Tensor Product Kernels: Characteristic Property, Universality Zolta n Szabo CMAP, E cole Polytechnique Joint work with: Bharath K. Sriperumbudur Hangzhou International Conference on Frontiers of Data Science

More information

Nonparameteric Regression:

Nonparameteric Regression: Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak Gaussian processes and bayesian optimization Stanisław Jastrzębski kudkudak.github.io kudkudak Plan Goal: talk about modern hyperparameter optimization algorithms Bayes reminder: equivalent linear regression

More information

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II 1 Non-linear regression techniques Part - II Regression Algorithms in this Course Support Vector Machine Relevance Vector Machine Support vector regression Boosting random projections Relevance vector

More information

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond Lecture 2: From Linear Regression to Kalman Filter and Beyond Department of Biomedical Engineering and Computational Science Aalto University January 26, 2012 Contents 1 Batch and Recursive Estimation

More information

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Gaussian Processes for Machine Learning

Gaussian Processes for Machine Learning Gaussian Processes for Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics Tübingen, Germany carl@tuebingen.mpg.de Carlos III, Madrid, May 2006 The actual science of

More information

Basis Expansion and Nonlinear SVM. Kai Yu

Basis Expansion and Nonlinear SVM. Kai Yu Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 Jan-Willem van de Meent (credit: Yijun Zhao, Arthur Gretton Rasmussen & Williams, Percy Liang) Kernel Regression Basis function regression

More information

Introduction to Kernel methods

Introduction to Kernel methods Introduction to Kernel ethods ML Workshop, ISI Kolkata Chiranjib Bhattacharyya Machine Learning lab Dept of CSA, IISc chiru@csa.iisc.ernet.in http://drona.csa.iisc.ernet.in/~chiru 19th Oct, 2012 Introduction

More information

Lecture 4 February 2

Lecture 4 February 2 4-1 EECS 281B / STAT 241B: Advanced Topics in Statistical Learning Spring 29 Lecture 4 February 2 Lecturer: Martin Wainwright Scribe: Luqman Hodgkinson Note: These lecture notes are still rough, and have

More information

Structured Prediction

Structured Prediction Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Mathematical Methods for Data Analysis

Mathematical Methods for Data Analysis Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data

More information

Semi-Supervised Learning through Principal Directions Estimation

Semi-Supervised Learning through Principal Directions Estimation Semi-Supervised Learning through Principal Directions Estimation Olivier Chapelle, Bernhard Schölkopf, Jason Weston Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany {first.last}@tuebingen.mpg.de

More information

Kernel Adaptive Metropolis-Hastings

Kernel Adaptive Metropolis-Hastings Kernel Adaptive Metropolis-Hastings Arthur Gretton,?? Gatsby Unit, CSML, University College London NIPS, December 2015 Arthur Gretton (Gatsby Unit, UCL) Kernel Adaptive Metropolis-Hastings 12/12/2015 1

More information

Support Vector Machines

Support Vector Machines Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

A Generalized Kernel Approach to Structured Output Learning

A Generalized Kernel Approach to Structured Output Learning A Generalized Kernel Approach to Structured Output Learning Hachem Kadri, Mohammad Ghavamzadeh, Philippe Preux To cite this version: Hachem Kadri, Mohammad Ghavamzadeh, Philippe Preux. A Generalized Kernel

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Hilbert Space Embeddings of Predictive State Representations

Hilbert Space Embeddings of Predictive State Representations Hilbert Space Embeddings of Predictive State Representations Byron Boots Computer Science and Engineering Dept. University of Washington Seattle, WA Arthur Gretton Gatsby Unit University College London

More information

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information