Kernel methods for Bayesian inference

Size: px

Start display at page:

Download "Kernel methods for Bayesian inference"

Angel Morris
5 years ago
Views:

1 Kernel methods for Bayesian inference Arthur Gretton Gatsby Computational Neuroscience Unit Lancaster, Nov. 2014

2 Motivating Example: Bayesian inference without a model 3600 downsampled frames of RGB pixels (Y t [0,1] 1200 ) 1800 training frames, remaining for test. Gaussian noise added to Y t. Challenges: No parametric model of camera dynamics (only samples) No parametric model of map from camera angle to image (only samples) Want to do filtering: Bayesian inference

3 ABC: an approach to Bayesian inference without a model Bayes rule: P(y x) = P(x y)π(y) P(x y)π(y)dy P(x y) is likelihood π(y) is prior

4 ABC: an approach to Bayesian inference without a model Bayes rule: P(y x) = P(x y)π(y) P(x y)π(y)dy P(x y) is likelihood π(y) is prior One approach: Approximate Bayesian Computation (ABC)

5 ABC: an approach to Bayesian inference without a model Bayes rule: P(y x) = P(x y)π(y) P(x y)π(y)dy P(x y) is likelihood π(y) is prior One approach: Approximate Bayesian Computation (ABC) ABC generates a sample from p(y x ) as follows: 1. generate a sample y t from the prior π, 2. generate a sample x t from P(X y t ), 3. if D(x,x t ) < τ, accept y = y t ; otherwise reject, 4. go to (i). In step (3), D is a distance measure, and τ is a tolerance parameter.

6 Motivating example 2: simple Gaussian case p(x,y) is N((0,1 T d )T,V) with V a randomly generated covariance Posterior mean on x: ABC vs kernel approach CPU time vs Error (6 dim.) Mean Square Errors KBI COND ABC CPU time (sec)

7 Overview Introduction to reproducing kernel Hilbert spaces Hilbert space Kernels and feature spaces Reproducing property Mapping probabilities to feature space

8 Overview Introduction to reproducing kernel Hilbert spaces Hilbert space Kernels and feature spaces Reproducing property Mapping probabilities to feature space Nonparametric Bayesian inference Learning conditional probabilities: smooth regression to an RKHS Kernelized Bayesian inference

9 Functions in a reproducing kernel Hilbert space Kernel methods can control smoothness and avoid overfitting/underfitting.

10 Hilbert space Inner product Let H be a vector space over R. A function, H : H H R is an inner product on H if 1. Linear: α 1 f 1 +α 2 f 2,g H = α 1 f 1,g H +α 2 f 2,g H 2. Symmetric: f,g H = g,f H 3. f,f H 0 and f,f H = 0 if and only if f = 0.

11 Hilbert space Inner product Let H be a vector space over R. A function, H : H H R is an inner product on H if 1. Linear: α 1 f 1 +α 2 f 2,g H = α 1 f 1,g H +α 2 f 2,g H 2. Symmetric: f,g H = g,f H 3. f,f H 0 and f,f H = 0 if and only if f = 0. Norm induced by the inner product: f H := f,f H

12 Hilbert space Inner product Let H be a vector space over R. A function, H : H H R is an inner product on H if 1. Linear: α 1 f 1 +α 2 f 2,g H = α 1 f 1,g H +α 2 f 2,g H 2. Symmetric: f,g H = g,f H 3. f,f H 0 and f,f H = 0 if and only if f = 0. Norm induced by the inner product: f H := f,f H Hilbert space: Inner product space containing Cauchy sequence limits.

13 Kernel Kernel: Let X be a non-empty set. A function k : X X R is a kernel if there exists an R-Hilbert space and a map ϕ : X H such that x,x X, k(x,x ) := ϕ x,ϕ x H. Almost no conditions on X (eg, X itself doesn t need an inner product, eg. documents). A single kernel can correspond to several possible feature vectors. A trivial example for X := R: ϕ (1) x = x and ϕ (2) x = x/ 2 x/ 2

14 Finite dim. RKHS with polynomial features Example: A three dimensional space of features of points in R 2 : x = ϕ : R 2 R 3 x 1 x 2 ϕ x = x 1 x 2 x 1 x 2, with kernel k(x,y) = x 1 x 2 y 1 y 2 x 1 x 2 y 1 y 2 (the standard inner product in R 3 between features). Denote this feature space by H.

15 Finite dim. RKHS with polynomial features Define a linear function of the inputs x 1,x 2, and their product x 1 x 2, f(x) = f 1 x 1 +f 2 x 2 +f 3 x 1 x 2. f in a space of functions mapping from X = R 2 to R. Equivalent representation for f, [ ]. f( ) = f 1 f 2 f 3 f( ) refers to the function as an object (here as a vector in R 3 ) f(x) R is function evaluated at a point (a real number).

16 Finite dim. RKHS with polynomial features Define a linear function of the inputs x 1,x 2, and their product x 1 x 2, f(x) = f 1 x 1 +f 2 x 2 +f 3 x 1 x 2. f in a space of functions mapping from X = R 2 to R. Equivalent representation for f, [ ]. f( ) = f 1 f 2 f 3 f( ) refers to the function as an object (here as a vector in R 3 ) f(x) R is function evaluated at a point (a real number). f(x) = f( ) ϕ x = f( ),ϕ x H Evaluation of f at x is an inner product in feature space (here standard inner product in R 3 ) H is a space of functions mapping R 2 to R.

17 Finite dim. RKHS with polynomial features ϕ y is a mapping from R 2 to R which also parametrizes a function mapping R 2 to R. [ ] k(,y) := y 1 y 2 y 1 y 2 = ϕy, Given y, there is a vector k(,y) in H such that k(,y),ϕ x H = ax 1 +bx 2 +cx 1 x 2, where a = y 1, b = y 2, and c = y 1 y 2

18 Finite dim. RKHS with polynomial features ϕ y is a mapping from R 2 to R which also parametrizes a function mapping R 2 to R. [ ] k(,y) := y 1 y 2 y 1 y 2 = ϕy, Given y, there is a vector k(,y) in H such that where a = y 1, b = y 2, and c = y 1 y 2 Due to symmetry, k(,y),ϕ x H = ax 1 +bx 2 +cx 1 x 2, k(,x),ϕ y = uy 1 +vy 2 +wy 1 y 2 = k(x,y). We can write ϕ x = k(,x) and ϕ y = k(,y) without ambiguity: canonical feature map

19 The reproducing property This example illustrates the two defining features of an RKHS: The reproducing property: x X, f( ) H, f( ),k(,x) H = f(x)...or use shorter notation f,ϕ x H. In particular, for any x,y X, k(x,y) = k(,x),k(,y) H. Note: the feature map of every point is in the feature space: x X, k(,x) = ϕ x H,

20 Infinite dimensional feature space Reproducing property for function with Gaussian kernel: f(x) := m i=1 α ik(x i,x) = m i=1 α iϕ xi,ϕ x H f(x) x

21 Infinite dimensional feature space Reproducing property for function with Gaussian kernel: f(x) := m i=1 α ik(x i,x) = m i=1 α iϕ xi,ϕ x H f(x) What do the features ϕ x look like (there are infinitely many of them, and they are not unique!) What do these features have to do with smoothness? x

22 Infinite dimensional feature space Gaussian kernel, k(x,x ) = exp ( x x 2 2σ 2 ) λ k b k b < 1 e k (x) exp( (c a)x 2 )H k (x 2c), a,b,c are functions of σ, and H k is kth order Hermite polynomial., k(x,x ) = λ i e i (x)e i (x ) = = i=1 i=1 ( )( λi e i (x) λi e i (x )) ϕ x ϕ x i=1

23 Infinite dimensional feature space (Mercer) Let X be a compact metric space, k be a continous kernel, and µ be a finite Borel measure with supp{µ} = X. Then the convergence of k(x,y) = j λ j e j (x)e j (y) is absolute and uniform (e j is the continuous element of the L 2 equivalence class e j.). [ ] The feature map is ϕ x =... λi e i (x)...

24 Infinite dimensional feature space (Mercer RKHS)(Steinwart and Christmann, Theorem 4.51) Under the assumptions of Mercer s theorem, { } H := a i λi e i : a i l 2 (1) is an RKHS with kernel k. Given two functions in the RKHS i f(x) := i a i λi e i (x), g(x) := i b i λi e i (x), the inner product is f,g H = i a ib i

25 Infinite dimensional feature space Proof: There are two aspects requiring care: 1. Is k(x, ) H x X? Requires Mercer s theorem 2. Does the reproducing property hold? f,k(,x) H = f(x).

26 Infinite dimensional feature space Proof: There are two aspects requiring care: 1. Is k(x, ) H x X? Requires Mercer s theorem 2. Does the reproducing property hold? f,k(,x) H = f(x). First part: By the definition of H, the function in H indexed by x is k(x, ) = ( )( λi e i (x) λi e i ( )). i Is this function in the RKHS? Yes, if the l 2 norm of ( λ i e i (x) ) is bounded. This is due to Mercer: x X, k(x, ) 2 ( ) H = 2 λi e i (x) = k(x,x) <. l 2

27 Infinite dimensional feature space Proof (continued): Second part: The reproducing property holds: using the inner product definition, f,k(x, ) H = ( ) f i λi e i (x) = f(x), i which is always well defined since both f l 2 and k(x, ) l 2.

28 Infinite dimensional feature space Example RKHS function, Gaussian kernel: m m f(x) := α i k(x i,x) = λ j e j (x i )e j (x) = α i i=1 i=1 j=1 j=1 [ λj ] f j e j (x) where f j = m i=1 α i λj e j (x i ). f(x) x NOTE that this enforces smoothing: λ j decay as e j become rougher, f j decay since f 2 H = j f2 j <.

29 Reproducing kernel Hilbert space H a Hilbert space of R-valued functions on non-empty set X. A function k : X X R is a reproducing kernel of H, and H is a reproducing kernel Hilbert space, if x X, k(,x) H, x X, f H, f( ),k(,x) H = f(x) (the reproducing property). In particular, for any x,y X, k(x,y) = k(,x),k(,y) H. (2) Original definition: kernel an inner product between feature maps. Then ϕ x = k(,x) a valid feature map.

30 Probabilities in feature space: the mean trick The kernel trick Given x X for some set X, define feature map ϕ x H, [ ϕ x =... ] λ i e i (x)... l 2 For positive definite k(x,x ), k(x,x ) = ϕ x,ϕ x H The kernel trick: f H, f(x) = f,ϕ x H

31 Probabilities in feature space: the mean trick The kernel trick Given x X for some set X, define feature map ϕ x H, [ ϕ x =... ] λ i e i (x)... l 2 For positive definite k(x,x ), k(x,x ) = ϕ x,ϕ x H The kernel trick: f H, f(x) = f,ϕ x H The mean trick Given P a Borel probability measure on X, define feature map µ P H [ µ P =... ] λ i E P [e i (X)]... l 2 For positive definite k(x,x ), E P,Q k(x,y) = µ P,µ Q H for X P and Y Q. The mean trick: (we call µ P a mean/distribution embedding) E P (f(x)) = E P [ ϕ X,f F ]

32 Feature embeddings of probabilities The kernel trick: f(x) = f,ϕ x H The mean trick: Empirical mean embedding: E P (f(x)) = µ P,f F µ P = m 1 m i=1 ϕ xi x i i.i.d. P µ P gives you expectations of all RKHS functions When k characteristic, then µ P unique, e.g. Gauss, Laplace,...

33 Nonparametric Bayesian inference using distribution embeddings

34 Bayes again Bayes rule: P(y x) = P(x y)π(y) P(x y)π(y)dy P(x y) is likelihood π is prior How would this look with kernel embeddings?

35 Bayes again Bayes rule: P(y x) = P(x y)π(y) P(x y)π(y)dy P(x y) is likelihood π is prior How would this look with kernel embeddings? Define RKHS G on Y with feature map ψ y and kernel l(y, ) We need a conditional mean embedding: for all g G, E Y x g(y) = g,µ P(y x ) G This will be obtained by RKHS-valued ridge regression

36 Ridge regression and the conditional feature mean Ridge regression from X := R d to a finite vector output Y := R d (these could be d nonlinear features of y): Define training data [ ] [ ] X = x 1... x m R d m Y = y 1... y m R d m

37 Ridge regression and the conditional feature mean Ridge regression from X := R d to a finite vector output Y := R d (these could be d nonlinear features of y): Define training data [ ] [ ] X = x 1... x m R d m Y = y 1... y m R d m Solve Ă = argmin A R d d ( ) Y AX 2 +λ A 2 HS, where A 2 HS = tr(a A) = min{d,d } i=1 γ 2 A,i

38 Ridge regression and the conditional feature mean Ridge regression from X := R d to a finite vector output Y := R d (these could be d nonlinear features of y): Define training data [ ] [ ] X = x 1... x m R d m Y = y 1... y m R d m Solve Ă = argmin A R d d ( ) Y AX 2 +λ A 2 HS, where A 2 HS = tr(a A) = Solution: Ă = C YX (C XX +mλi) 1 min{d,d } i=1 γ 2 A,i

39 Ridge regression and the conditional feature mean Prediction at new point x: where y = Ăx = C YX (C XX +mλi) 1 x m = β i (x)y i i=1 β i (x) = (K +λmi) 1[ k(x 1,x)... k(x m,x) ] and K := X X k(x 1,x) = x 1 x

40 Ridge regression and the conditional feature mean Prediction at new point x: where y = Ăx = C YX (C XX +mλi) 1 x m = β i (x)y i i=1 β i (x) = (K +λmi) 1[ k(x 1,x)... k(x m,x) ] and K := X X k(x 1,x) = x 1 x What if we do everything in kernel space?

41 Ridge regression and the conditional feature mean Recall our setup: Given training pairs: (x i,y i ) P XY F on X with feature map ϕ x and kernel k(x, ) G on Y with feature map ψ y and kernel l(y, ) We define the covariance between feature maps: C XX = E X (ϕ X ϕ X ) C XY = E XY (ϕ X ψ Y ) and matrices of feature mapped training data [ ] [ X = ϕ x1... ϕ xm Y := ψ y1... ψ ym ]

42 Ridge regression and the conditional feature mean Objective: [Weston et al. (2003), Micchelli and Pontil (2005), Caponnetto and De Vito (2007), Grunewalder et al. (2012, 2013) ] ( ) Ă = arg min E XY Y AX 2 G A HS(F,G) +λ A 2 HS, A 2 HS = Solution same as vector case: i=1 γ 2 A,i Ă = C YX (C XX +mλi) 1, Prediction at new x using kernels: [ Ăϕ x = ψ y1... ψ ym ](K +λmi) 1[ k(x 1,x)... k(x m,x) m = β i (x)ψ yi i=1 ] where K ij = k(x i,x j )

43 Ridge regression and the conditional feature mean How is loss Y AX 2 G relevant to conditional expectation of some E Y x g(y)? Define: [Song et al. (2009), Grunewalder et al. (2013)] µ Y x := Aϕ x

44 Ridge regression and the conditional feature mean How is loss Y AX 2 G relevant to conditional expectation of some E Y x g(y)? Define: [Song et al. (2009), Grunewalder et al. (2013)] µ Y x := Aϕ x We need A to have the property E Y x g(y) g,µ Y x G = g,aϕ x G = A g,ϕ x F = (A g)(x)

45 Ridge regression and the conditional feature mean How is loss Y AX 2 G relevant to conditional expectation of some E Y x g(y)? Define: [Song et al. (2009), Grunewalder et al. (2013)] We need A to have the property µ Y x := Aϕ x E Y x g(y) g,µ Y x G = g,aϕ x G Natural risk function for conditional mean L(A,P XY ) := sup E X g 1 = A g,ϕ x F = (A g)(x) ( E Y X g(y) ) (X) (A g) (X) }{{}}{{} Target Estimator 2,

46 Ridge regression and the conditional feature mean The squared loss risk provides an upper bound on the natural risk. L(A,P XY ) E XY ψ Y Aϕ X 2 G

47 Ridge regression and the conditional feature mean The squared loss risk provides an upper bound on the natural risk. L(A,P XY ) E XY ψ Y Aϕ X 2 G Proof: Jensen and Cauchy Schwarz [( L(A,P XY ) := sup E X EY X g(y) ) (X) (A g)(x) ] 2 g 1 E XY sup [g(y) (A g)(x)] 2 g 1

48 Ridge regression and the conditional feature mean The squared loss risk provides an upper bound on the natural risk. L(A,P XY ) E XY ψ Y Aϕ X 2 G Proof: Jensen and Cauchy Schwarz [( L(A,P XY ) := sup E X EY X g(y) ) (X) (A g)(x) ] 2 g 1 E XY sup [g(y) (A g)(x)] 2 g 1 = E XY sup [ g,ψ Y G A g,ϕ X F ] 2 g 1

49 Ridge regression and the conditional feature mean The squared loss risk provides an upper bound on the natural risk. L(A,P XY ) E XY ψ Y Aϕ X 2 G Proof: Jensen and Cauchy Schwarz [( L(A,P XY ) := sup E X EY X g(y) ) (X) (A g)(x) ] 2 g 1 E XY sup [g(y) (A g)(x)] 2 g 1 = E XY sup [ g,ψ Y G g,aϕ X G ] 2 g 1

50 Ridge regression and the conditional feature mean The squared loss risk provides an upper bound on the natural risk. L(A,P XY ) E XY ψ Y Aϕ X 2 G Proof: Jensen and Cauchy Schwarz [( L(A,P XY ) := sup E X EY X g(y) ) (X) (A g)(x) ] 2 g 1 E XY sup [g(y) (A g)(x)] 2 g 1 = E XY sup g,ψ Y Aϕ X 2 G g 1

51 Ridge regression and the conditional feature mean The squared loss risk provides an upper bound on the natural risk. L(A,P XY ) E XY ψ Y Aϕ X 2 G Proof: Jensen and Cauchy Schwarz [( L(A,P XY ) := sup E X EY X g(y) ) (X) (A g)(x) ] 2 g 1 E XY sup [g(y) (A g)(x)] 2 g 1 = E XY sup g,ψ Y Aϕ X 2 G g 1 E XY ψ Y Aϕ X 2 G

52 Ridge regression and the conditional feature mean The squared loss risk provides an upper bound on the natural risk. L(A,P XY ) E XY ψ Y Aϕ X 2 G Proof: Jensen and Cauchy Schwarz [( L(A,P XY ) := sup E X EY X g(y) ) (X) (A g)(x) ] 2 g 1 E XY sup [g(y) (A g)(x)] 2 g 1 = E XY sup g,ψ Y Aϕ X 2 G g 1 E XY ψ Y Aϕ X 2 G If we assume E Y [g(y) X = x] F then upper bound tight (next slide).

53 Conditions for ridge regression = conditional mean Proof: conditional mean obtained by ridge regression when E Y [g(y) X = x] F Given a function g G. Assume E Y X [g(y) X = ] F. Then C XX E Y X [g(y) X = ] = C XY g.

54 Conditions for ridge regression = conditional mean Proof: conditional mean obtained by ridge regression when E Y [g(y) X = x] F Given a function g G. Assume E Y X [g(y) X = ] F. Then C XX E Y X [g(y) X = ] = C XY g. Proof: [Fukumizu et al., 2004] For all f F, by definition of C XX, f,cxx E Y X [g(y) X = ] F = cov ( f,e Y X [g(y) X = ] ) by definition of C XY. = E X ( f(x)ey X [g(y) X] ) = E XY (f(x)g(y)) = f,c XY g,

55 Kernel Bayes law Prior: Y π(y) Likelihood: (X y) P(x y) with some joint P(x, y)

56 Kernel Bayes law Prior: Y π(y) Likelihood: (X y) P(x y) with some joint P(x, y) Joint distribution: Q(x, y) = P(x y)π(y) Warning: Q P, change of measure from P(y) to π(y) Marginal for x: Q(x) := P(x y)π(y)dy. Bayes law: Q(y x) = P(x y)π(y) Q(x)

57 Kernel Bayes law Posterior embedding via the usual conditional update, µ Q(y x) = C Q(y,x) C 1 Q(x,x) φ x.

58 Kernel Bayes law Posterior embedding via the usual conditional update, µ Q(y x) = C Q(y,x) C 1 Q(x,x) φ x. Given mean embedding of prior: µ π (y) Define marginal covariance: C Q(x,x) = (ϕ x ϕ x ) P(x y)π(y)dx = C (xx)y Cyy 1 µ π(y) Define cross-covariance: C Q(y,x) = (φ y ϕ x ) P(x y)π(y)dx = C (yx)y Cyy 1 µ π(y).

59 Kernel Bayes law: consistency result How to compute posterior expectation from data? Given samples: {(x i,y i )} n i=1 from P xy, {(u j )} n j=1 from prior π. Want to compute E[g(Y) X = x] for g in G For any x X, g T y R Y X k X (x) E[f(Y) X = x] = Op (n 4 27 ), (n ), where g y = (g(y 1 ),...,g(y n )) T R n. k X (x) = (k(x 1,x),...,k(x n,x)) T R n R Y X learned from the samples, contains the u j Smoothness assumptions: π/p Y R(C 1/2 Y Y ), where p Y p.d.f. of P Y, E[g(Y) X = ] R(C 2 Q(xx) ).

60 Experiment: Kernel Bayes law vs EKF

61 Experiment: Kernel Bayes law vs EKF Compare with extended Kalman filter (EKF) on camera orientation task 3600 downsampled frames of RGB pixels (Y t [0,1] 1200 ) 1800 training frames, remaining for test. Gaussian noise added to Y t.

62 Experiment: Kernel Bayes law vs EKF Compare with extended Kalman filter (EKF) on camera orientation task 3600 downsampled frames of RGB pixels (Y t [0,1] 1200 ) 1800 training frames, remaining for test. Gaussian noise added to Y t. Average MSE and standard errors (10 runs) KBR (Gauss) KBR (Tr) Kalman (9 dim.) Kalman (Quat.) σ 2 = ± ± ± ±0.023 σ 2 = ± ± ± ±0.022

63 Overview Introduction to reproducing kernel Hilbert spaces Hilbert space Kernels and feature spaces Reproducing property Nonparametric Bayesian inference Mapping probabilities to feature space Learning conditional probabilities: smooth regression to an RKHS Kernelized Bayesian inference

64 Co-authors From UCL: Luca Baldasssarre Steffen Grunewalder Guy Lever Sam Patterson Massimiliano Pontil External: Kenji Fukumizu, ISM Bernhard Schoelkopf, MPI Alex Smola, Google/CMU Le Song, Georgia Tech Bharath Sriperumbudur, Penn. State

65 Questions?

66 Selected references Characteristic kernels and mean embeddings: Smola, A., Gretton, A., Song, L., Schoelkopf, B. (2007). A hilbert space embedding for distributions. ALT. Sriperumbudur, B., Gretton, A., Fukumizu, K., Schoelkopf, B., Lanckriet, G. (2010). Hilbert space embeddings and metrics on probability measures. JMLR. Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR. Sejdinovic, D., Sriperumbudur, B., Gretton, A., Fukumizu, K. (2013). Equivalence of distance-based and rkhs-based statistics in hypothesis testing. Annals of Statistics. Two-sample, independence, conditional independence tests: Gretton, A., Fukumizu, K., Teo, C., Song, L., Schoelkopf, B., Smola, A. (2008). A kernel statistical test of independence. NIPS Fukumizu, K., Gretton, A., Sun, X., Schoelkopf, B. (2008). Kernel measures of conditional dependence. Gretton, A., Fukumizu, K., Harchaoui, Z., Sriperumbudur, B. (2009). A fast, consistent kernel two-sample test. NIPS. Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR

67 Selected references (continued) Conditional mean embedding, RKHS-valued regression: Weston, J., Chapelle, O., Elisseeff, A., Schölkopf, B., and Vapnik, V., (2003). Kernel Dependency Estimation, NIPS. Micchelli, C., and Pontil, M., (2005). On Learning Vector-Valued Functions. Neural Computation. Caponnetto, A., and De Vito, E. (2007). Optimal Rates for the Regularized Least-Squares Algorithm. Foundations of Computational Mathematics. Song, L., and Huang, J., and Smola, A., Fukumizu, K., (2009). Hilbert Space Embeddings of Conditional Distributions. ICML. Grunewalder, S., Lever, G., Baldassarre, L., Patterson, S., Gretton, A., Pontil, M. (2012). Conditional mean embeddings as regressors. ICML. Grunewalder, S., Gretton, A., Shawe-Taylor, J. (2013). Smooth operators. ICML. Kernel Bayes rule: Song, L., Fukumizu, K., Gretton, A. (2013). Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine. Fukumizu, K., Song, L., Gretton, A. (2013). Kernel Bayes rule: Bayesian inference with positive definite kernels, JMLR

69 References K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 5:73 99, L. Song, J. Huang, A. J. Smola, and K. Fukumizu. Hilbert space embeddings of conditional distributions. In Proceedings of the International Conference on Machine Learning,

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December