Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December 8, 2012 @Lake Tahoe 1
Introduction A new kernel methodology for nonparametric inference. Kernel means are used in representing and manipulating the probabilities of variables. Nonparametric Bayesian inference is also possible! Completely nonparametric Computation is done by linear algebra with Gram matrices. Different from Bayesian nonparametrics Today s main topic. 2
Outline 1. Kernel mean: a method for nonparametric inference 2. Representing conditional probability 3. Kernel Bayes Rule and its applications 4. Conclusions 3
Kernel mean: representing probabilities Classical nonparametric methods for representing probabilities Kernel density estimation: p n x = 1 n K((x X n i=1 i)/h n ) Characteristic function: Ch. f X u = E[e iii ], Ch. f X u = 1 n n eiix i i=1 New alternative: kernel mean X: random variable taking values on Ω, with probability P. k: positive definite kernel on Ω, H k : RKHS associated with k. Def. Kernel mean of X on H k : m P E Φ X = k, x dd x H k n Φ x = k(, x): feature vector Empirical estimation: m P = 1 Φ X n i=1 i for X 1,, X n ~P. i.i.d. 4
Reproducing expectation: f, m P = E f X f H k. Kernel mean has information of higher order moments of X e.g. k u, x = c 0 + c 1 uu + c 2 uu 2 + c i 0, e.g., e uu m P u = c 0 + c 1 E X u + c 2 E X 2 u 2 + Moment generating function 5
Characteristic kernel (Fukumizu et al. JMLR 2004, AoS 2009; Sriperumbudur et al. JMLR2010) Def. A bounded measurable kernel k is characteristic, if m P = m Q P = Q. Kernel mean m P with characteristic kernel k uniquely determines the probability. Examples: Gaussian, Laplace kernel (polynomial kernel is not) Analogous to the characteristic function Ch. f X u = E[e iii ]. Ch.f. uniquely determines the probability of X. Positive definite kernel gives a better alternative: efficient computation by kernel trick. applicable to non-vectorial data. 6
Nonparametric inference with kernels Principle: with characteristic kernels, Inference on P Inference on m P Two sample test m P = m Q? (Gretton et al. NIPS 2006, JMLR 2012, NIPS 2009, 2012) Independence test m XX = m X m Y? (Gretton NIPS 2007) Bayesian inference Estimate kernel mean of the posterior given kernel representation of prior and conditional probability. 7
Conventional approaches to nonparametric inference Smoothing kernel (not necessarily positive definite) Kernel density estimation, local polynomial fitting h d K(x/h) Characteristic function: E[e iii ] etc, etc, Curse of dimensionality e.g. smoothing kernel: difficulty for high (or several) dimension. Kernel methods for nonparametric inference What can we do? How robust to high-dimensionality? 8
Conditional probabilities 9
Conditional kernel mean Conditional probabilities are important to inference Graphical modeling: conditional independence / dependence Bayesian inference Kernel mean of conditional probability E Φ Y X = x = Φ y p y x dd Question: How can we estimate it in the kernel framework? Accurate estimation of p(y x) is not easy. Regression approach. 10
Covariance (X, Y) : random vector taking values on Ω X Ω Y. (H X, k X ), (H Y, k Y ): RKHS on Ω X and Ω Y, resp. Def. (uncentered) covariance operators C YY : H X H Y, C XX : H X H X C YY = E Φ Y X Φ X Y T, C XX = E Φ X X Φ X X T Simply, extension of covariance matrix (linear map) V YY = E XY T Reproducing property: g, C YY f = E f X g Y for all f H X, g H Y. C YY can be identified with the kernel mean E[k Y, Y k X, X ] on the product space H Y H X : 11
Conditional kernel mean Review: X, Y, Gaussian random variables ( R m, R l, resp.) argmin A R l m Y AX 2 dd(x, Y) = V YY V XX 1 For general X and Y E Y X = x = V YY V XX 1 x argmin Φ Y Y F X F H X H Y F, Φ X X With characteristic kernel k X, 2 1 HY dd(x, Y) = C YY C XX HX E Φ Y X = x = C YY C XX 1 Φ X (x) Conditional kernel mean given X = x 12
Empirical estimation E [Φ Y (Y) X = x] = k Y T G X + nε n I n 1 k X (x) k X (x) = k X x, X 1,, k X x, X n T R n, k Y ( ) = k Y, Y 1,, k Y, Y n T H Y n, ε n : regularization coefficient Note: joint sample X 1, Y 1,, X n, Y n ~ P XX is used to give the conditional kernel mean with P Y X. c.f. kernel ridge regression E Y X = x = Y T G X + nε n I n 1 k X (x) 13
Kernel Bayes Rule 14
Inference with conditional kernel mean Sum rule: q y = p y x π x dd Chain rule : q x, y = p y x π x Bayes rule: q x y = p y x π x p y x π x dd Kernelization Express the probabilities by kernel means. Express the statistical relations among variables with covariance operators. Realize the above inference rules with Gram matrix computation. 15
Kernel Sum Rule Sum rule: q y = p y x π x dd Kernelization: m Y = C YY C 1 XX m π Gram matrix expression Joint sample Input: m π = l i=1 α i Φ X i, X 1, Y 1,, X n, Y n ~ P XX, n m Y = i=1 β i Φ Y i, β = G X + nε n I 1 n G XX α. G X = k X i, X j ii, G XX = k X i, X j ii Proof: Φ y p y x dd = C YY C 1 XX Φ x π x dd Φ y p y x π x dddd = C YY C 1 XX m Π 16
Kernel Chain Rule Chain rule: q x, y = p y x π x Kernelization: m Q = C (YY)X C 1 XX m π Gram matrix expression: Input: m π = l i=1 α i Φ X i, X 1, Y 1,, X n, Y n ~ P XX n m Q = i=1 β i Φ Y i Φ X i, β = G X + nε n I 1 n G XX α. Intuition: Note C (YY)X : H X H Y H X, E Φ Y Φ X Φ X p(y, x x From Sum Rule, ) C (YY)X C 1 XX m π = Φ y Φ x p y x δ(x x )π x ddddddd = Φ y Φ x p y x π x dddd = m Q 17
Kernel Bayes Rule (KBR) Bayes rule is regression y x with probability q(x, y) = p y x π x Kernel Bayes Rule (KBR, Fukumizu et al NIPS2011) m Qx y = C π π XX C 1 YY Φ(y) where C π YY = C YY X C 1 XX m π, C π YY = C (YY)X C 1 XX m π Gram matrix expression: Input: m π = l i=1 α i Φ X i, X 1, Y 1,, X n, Y n ~ P XX, n m Qx y = i=1 w i y Φ X i, w y = R X Y k Y y, Recall: Mean on the product space = Covariance 2 1 R X Y = ΛG YY ΛG YY + δ n I n Λk y, Λ = Diag G XX + nε n I 1 n G XX α 18
Inference with KBR KBR estimates the kernel mean of the posterior q(x y), not itself. How can we use it for Bayesian inference? Expectation: for any f H X, f X T R X Y k Y y f x q x y dd. (consistent) where f X = f X 1,, f X n T. Point estimation: x = argmin x m X Y=y Φ X x HX (pre-image problem) solved numerically 19
Completely nonparametric way of computing Bayes rule. No parametric models are needed, but data or samples are used to express the probabilistic relations nonparametrically. Examples: 1. Nonparametric HMM See next. X 0 X 1 X 2 X 3 X T Y 0 Y 1 Y 2 Y 3 Y T 2. Kernel Approximate Bayesian Computation (Nakagome, F., Mano 2012) Explicit form of likelihood p(y x) is unavailable, but sampling is possible. c.f. Approximate Bayesian Computation (ABC) 3. Kernelization of Bellman equation in POMDP (Nishiyama et al UAI2012) 20
Example : KBR for nonparametric HMM Assume: p X, Y = p(x 0, Y 0 ) T t=1 p Y t X t q X t X t 1 p(y t x t ) and/or q(x t x t 1 ) is not known. T But, data X t, Y t t=0 is available in training phase. Examples: X 0 X 1 X 2 X 3 X T Y 0 Y 1 Y 2 Y 3 Y T Measurement of hidden states is expensive, Hidden states are measured with time delay. Testing phase (e.g., filtering, e.g.): given y 0,, y t, estimate hidden state x s. KBR point estimator: argmin xs m xs y 0,,y t Φ x HX General sequential inference uses Bayes rule KBR applied. 21
Smoothing: noisy oscillation u t v t = 1 + 0.4 sin 8θ t cos θ t sin (θ t ) + Z t, θ t+1 = arctan v t u t + 0.4, Y t = u t, v t T + W t, Z t, W t ~ N 0, 0.04I 2 i. i. d. Note: KBR does not know the dynamics, while the EKF and UKF use it. 22
Rotation angle of camera Hidden X t : angles of a video camera located at a corner of a room. Observed Y t : movie frame of a room + additive Gaussian noise. X t : 3600 downsampled frames of 20 x 20 RGB pixels (1200 dim. ). The first 1800 frames for training, and the second half for testing. noise KBR (Trace) Kalman filter(q) σ 2 = 10-4 0.15 ±< 0.01 0.56 ± 0.02 σ 2 = 10-3 0.21±0.01 0.54 ± 0.02 Average MSE for camera angles (10 runs) * For the rotation matrices, Tr[AB -1 ] kernel for KBR, and quaternion expression for Kalman filter are used. 23
Concluding remarks Kernel methods : useful, general tool for nonparametric inference. Efficient linear algebraic computation with Gram matrices. Kernel Baeys rule. Inference with kernel mean of conditional probablity. Completely nonparametric way for general Bayesian inference. 24
Ongoing / future works Combination of parametric model and kernel nonparametric method: Exact integration + kernel nonparametrics (Nishiyama et al IBIS2012) Particle filter + kernel nonparametrics (Kanagawa et al IBIS 2012) Theoretical analysis in high-dimensional situation. Relation to other recent nonparametric approaches? Gaussian process Bayesian nonparametrics 25
Collaborators Bernhard Schölkopf (MPI) Arthur Gretton (UCL/MPI) Bharath Sriperumbudur (Cambridge) Le Song (Georgia Tech) Shigeki Nakagome (ISM) Shuhei Mano (ISM) Yu Nishiyama (ISM) Motonobu Kanagawa (NAIST) 26