Kernel Bayes Rule: Nonparametric Bayesian inference with kernels

Similar documents
Kernel methods for Bayesian inference

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Kernel Bayes Rule: Bayesian Inference with Positive Definite Kernels

Lecture 2: From Linear Regression to Kalman Filter and Beyond

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering

Advances in kernel exponential families

Introduction to Probabilistic Graphical Models: Exercises

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Recovering Distributions from Gaussian RKHS Embeddings

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Kernel Embeddings of Conditional Distributions

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Nonparameteric Regression:

Graphical Models for Collaborative Filtering

Density Estimation. Seungjin Choi

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

22 : Hilbert Space Embeddings of Distributions

arxiv: v1 [stat.ml] 18 Sep 2014

CS8803: Statistical Techniques in Robotics Byron Boots. Hilbert Space Embeddings

Nonparametric Bayesian Methods (Gaussian Processes)

Kernel methods, kernel SVM and ridge regression

Kalman filtering and friends: Inference in time series models. Herke van Hoof slides mostly by Michael Rubinstein

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Lecture 6: Bayesian Inference in SDE Models

Probabilistic Graphical Models

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Linear Dynamical Systems

STA 4273H: Statistical Machine Learning

Hilbert Space Embeddings of Conditional Distributions with Applications to Dynamical Systems

Hilbert Space Embeddings of Conditional Distributions with Applications to Dynamical Systems

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

PATTERN RECOGNITION AND MACHINE LEARNING

Gaussian Process Regression

Big Hypothesis Testing with Kernel Embeddings

Bayesian Interpretations of Regularization

Convergence Rates of Kernel Quadrature Rules

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Linear Dynamical Systems (Kalman filter)

GWAS V: Gaussian processes

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Kernel methods for comparing distributions, measuring dependence

The Kalman Filter ImPr Talk

Kernel Learning via Random Fourier Representations

Model Selection for Gaussian Processes

Dynamic models 1 Kalman filters, linearization,

GAUSSIAN PROCESS REGRESSION

Learning Interpretable Features to Compare Distributions

Nonparametric Tree Graphical Models via Kernel Embeddings

LA Support for Scalable Kernel Methods. David Bindel 29 Sep 2018

Minimax Estimation of Kernel Mean Embeddings

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

Mathematical Formulation of Our Example

Adaptive HMC via the Infinite Exponential Family

Lecture 4: Extended Kalman filter and Statistically Linearized Filter

Forefront of the Two Sample Problem

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Pattern Recognition and Machine Learning

Bayesian Nonparametrics

ECE521 Lecture 19 HMM cont. Inference in HMM

Approximate Kernel Methods

Hilbert Space Embeddings of Predictive State Representations

Undirected Graphical Models

1 Kalman Filter Introduction

Learning features to compare distributions

ROBOTICS 01PEEQW. Basilio Bona DAUIN Politecnico di Torino

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CMU-Q Lecture 24:

State Space and Hidden Markov Models

Gaussian processes for inference in stochastic differential equations

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Lecture 5: GPs and Streaming regression

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Probabilistic & Unsupervised Learning

Gaussian Processes for Machine Learning

CS281A/Stat241A Lecture 17

Data assimilation with and without a model

Remarks on kernel Bayes rule

Lecture 16: State Space Model and Kalman Filter Bus 41910, Time Series Analysis, Mr. R. Tsay

STA 4273H: Statistical Machine Learning

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systems

Introduction to Machine Learning

Bayesian Methods for Machine Learning

STA 414/2104: Machine Learning

Miscellaneous. Regarding reading materials. Again, ask questions (if you have) and ask them earlier

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Factor Analysis and Kalman Filtering (11/2/04)

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 26. Estimation: Regression and Least Squares

Machine Learning Srihari. Gaussian Processes. Sargur Srihari

Advanced Introduction to Machine Learning CMU-10715

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Bayesian Regression Linear and Logistic Regression

Lecture 3: Pattern Classification

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Introduction to Mobile Robotics Probabilistic Robotics

CPSC 540: Machine Learning

Kernel Bayesian Inference with Posterior Regularization

Transcription:

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December 8, 2012 @Lake Tahoe 1

Introduction A new kernel methodology for nonparametric inference. Kernel means are used in representing and manipulating the probabilities of variables. Nonparametric Bayesian inference is also possible! Completely nonparametric Computation is done by linear algebra with Gram matrices. Different from Bayesian nonparametrics Today s main topic. 2

Outline 1. Kernel mean: a method for nonparametric inference 2. Representing conditional probability 3. Kernel Bayes Rule and its applications 4. Conclusions 3

Kernel mean: representing probabilities Classical nonparametric methods for representing probabilities Kernel density estimation: p n x = 1 n K((x X n i=1 i)/h n ) Characteristic function: Ch. f X u = E[e iii ], Ch. f X u = 1 n n eiix i i=1 New alternative: kernel mean X: random variable taking values on Ω, with probability P. k: positive definite kernel on Ω, H k : RKHS associated with k. Def. Kernel mean of X on H k : m P E Φ X = k, x dd x H k n Φ x = k(, x): feature vector Empirical estimation: m P = 1 Φ X n i=1 i for X 1,, X n ~P. i.i.d. 4

Reproducing expectation: f, m P = E f X f H k. Kernel mean has information of higher order moments of X e.g. k u, x = c 0 + c 1 uu + c 2 uu 2 + c i 0, e.g., e uu m P u = c 0 + c 1 E X u + c 2 E X 2 u 2 + Moment generating function 5

Characteristic kernel (Fukumizu et al. JMLR 2004, AoS 2009; Sriperumbudur et al. JMLR2010) Def. A bounded measurable kernel k is characteristic, if m P = m Q P = Q. Kernel mean m P with characteristic kernel k uniquely determines the probability. Examples: Gaussian, Laplace kernel (polynomial kernel is not) Analogous to the characteristic function Ch. f X u = E[e iii ]. Ch.f. uniquely determines the probability of X. Positive definite kernel gives a better alternative: efficient computation by kernel trick. applicable to non-vectorial data. 6

Nonparametric inference with kernels Principle: with characteristic kernels, Inference on P Inference on m P Two sample test m P = m Q? (Gretton et al. NIPS 2006, JMLR 2012, NIPS 2009, 2012) Independence test m XX = m X m Y? (Gretton NIPS 2007) Bayesian inference Estimate kernel mean of the posterior given kernel representation of prior and conditional probability. 7

Conventional approaches to nonparametric inference Smoothing kernel (not necessarily positive definite) Kernel density estimation, local polynomial fitting h d K(x/h) Characteristic function: E[e iii ] etc, etc, Curse of dimensionality e.g. smoothing kernel: difficulty for high (or several) dimension. Kernel methods for nonparametric inference What can we do? How robust to high-dimensionality? 8

Conditional probabilities 9

Conditional kernel mean Conditional probabilities are important to inference Graphical modeling: conditional independence / dependence Bayesian inference Kernel mean of conditional probability E Φ Y X = x = Φ y p y x dd Question: How can we estimate it in the kernel framework? Accurate estimation of p(y x) is not easy. Regression approach. 10

Covariance (X, Y) : random vector taking values on Ω X Ω Y. (H X, k X ), (H Y, k Y ): RKHS on Ω X and Ω Y, resp. Def. (uncentered) covariance operators C YY : H X H Y, C XX : H X H X C YY = E Φ Y X Φ X Y T, C XX = E Φ X X Φ X X T Simply, extension of covariance matrix (linear map) V YY = E XY T Reproducing property: g, C YY f = E f X g Y for all f H X, g H Y. C YY can be identified with the kernel mean E[k Y, Y k X, X ] on the product space H Y H X : 11

Conditional kernel mean Review: X, Y, Gaussian random variables ( R m, R l, resp.) argmin A R l m Y AX 2 dd(x, Y) = V YY V XX 1 For general X and Y E Y X = x = V YY V XX 1 x argmin Φ Y Y F X F H X H Y F, Φ X X With characteristic kernel k X, 2 1 HY dd(x, Y) = C YY C XX HX E Φ Y X = x = C YY C XX 1 Φ X (x) Conditional kernel mean given X = x 12

Empirical estimation E [Φ Y (Y) X = x] = k Y T G X + nε n I n 1 k X (x) k X (x) = k X x, X 1,, k X x, X n T R n, k Y ( ) = k Y, Y 1,, k Y, Y n T H Y n, ε n : regularization coefficient Note: joint sample X 1, Y 1,, X n, Y n ~ P XX is used to give the conditional kernel mean with P Y X. c.f. kernel ridge regression E Y X = x = Y T G X + nε n I n 1 k X (x) 13

Kernel Bayes Rule 14

Inference with conditional kernel mean Sum rule: q y = p y x π x dd Chain rule : q x, y = p y x π x Bayes rule: q x y = p y x π x p y x π x dd Kernelization Express the probabilities by kernel means. Express the statistical relations among variables with covariance operators. Realize the above inference rules with Gram matrix computation. 15

Kernel Sum Rule Sum rule: q y = p y x π x dd Kernelization: m Y = C YY C 1 XX m π Gram matrix expression Joint sample Input: m π = l i=1 α i Φ X i, X 1, Y 1,, X n, Y n ~ P XX, n m Y = i=1 β i Φ Y i, β = G X + nε n I 1 n G XX α. G X = k X i, X j ii, G XX = k X i, X j ii Proof: Φ y p y x dd = C YY C 1 XX Φ x π x dd Φ y p y x π x dddd = C YY C 1 XX m Π 16

Kernel Chain Rule Chain rule: q x, y = p y x π x Kernelization: m Q = C (YY)X C 1 XX m π Gram matrix expression: Input: m π = l i=1 α i Φ X i, X 1, Y 1,, X n, Y n ~ P XX n m Q = i=1 β i Φ Y i Φ X i, β = G X + nε n I 1 n G XX α. Intuition: Note C (YY)X : H X H Y H X, E Φ Y Φ X Φ X p(y, x x From Sum Rule, ) C (YY)X C 1 XX m π = Φ y Φ x p y x δ(x x )π x ddddddd = Φ y Φ x p y x π x dddd = m Q 17

Kernel Bayes Rule (KBR) Bayes rule is regression y x with probability q(x, y) = p y x π x Kernel Bayes Rule (KBR, Fukumizu et al NIPS2011) m Qx y = C π π XX C 1 YY Φ(y) where C π YY = C YY X C 1 XX m π, C π YY = C (YY)X C 1 XX m π Gram matrix expression: Input: m π = l i=1 α i Φ X i, X 1, Y 1,, X n, Y n ~ P XX, n m Qx y = i=1 w i y Φ X i, w y = R X Y k Y y, Recall: Mean on the product space = Covariance 2 1 R X Y = ΛG YY ΛG YY + δ n I n Λk y, Λ = Diag G XX + nε n I 1 n G XX α 18

Inference with KBR KBR estimates the kernel mean of the posterior q(x y), not itself. How can we use it for Bayesian inference? Expectation: for any f H X, f X T R X Y k Y y f x q x y dd. (consistent) where f X = f X 1,, f X n T. Point estimation: x = argmin x m X Y=y Φ X x HX (pre-image problem) solved numerically 19

Completely nonparametric way of computing Bayes rule. No parametric models are needed, but data or samples are used to express the probabilistic relations nonparametrically. Examples: 1. Nonparametric HMM See next. X 0 X 1 X 2 X 3 X T Y 0 Y 1 Y 2 Y 3 Y T 2. Kernel Approximate Bayesian Computation (Nakagome, F., Mano 2012) Explicit form of likelihood p(y x) is unavailable, but sampling is possible. c.f. Approximate Bayesian Computation (ABC) 3. Kernelization of Bellman equation in POMDP (Nishiyama et al UAI2012) 20

Example : KBR for nonparametric HMM Assume: p X, Y = p(x 0, Y 0 ) T t=1 p Y t X t q X t X t 1 p(y t x t ) and/or q(x t x t 1 ) is not known. T But, data X t, Y t t=0 is available in training phase. Examples: X 0 X 1 X 2 X 3 X T Y 0 Y 1 Y 2 Y 3 Y T Measurement of hidden states is expensive, Hidden states are measured with time delay. Testing phase (e.g., filtering, e.g.): given y 0,, y t, estimate hidden state x s. KBR point estimator: argmin xs m xs y 0,,y t Φ x HX General sequential inference uses Bayes rule KBR applied. 21

Smoothing: noisy oscillation u t v t = 1 + 0.4 sin 8θ t cos θ t sin (θ t ) + Z t, θ t+1 = arctan v t u t + 0.4, Y t = u t, v t T + W t, Z t, W t ~ N 0, 0.04I 2 i. i. d. Note: KBR does not know the dynamics, while the EKF and UKF use it. 22

Rotation angle of camera Hidden X t : angles of a video camera located at a corner of a room. Observed Y t : movie frame of a room + additive Gaussian noise. X t : 3600 downsampled frames of 20 x 20 RGB pixels (1200 dim. ). The first 1800 frames for training, and the second half for testing. noise KBR (Trace) Kalman filter(q) σ 2 = 10-4 0.15 ±< 0.01 0.56 ± 0.02 σ 2 = 10-3 0.21±0.01 0.54 ± 0.02 Average MSE for camera angles (10 runs) * For the rotation matrices, Tr[AB -1 ] kernel for KBR, and quaternion expression for Kalman filter are used. 23

Concluding remarks Kernel methods : useful, general tool for nonparametric inference. Efficient linear algebraic computation with Gram matrices. Kernel Baeys rule. Inference with kernel mean of conditional probablity. Completely nonparametric way for general Bayesian inference. 24

Ongoing / future works Combination of parametric model and kernel nonparametric method: Exact integration + kernel nonparametrics (Nishiyama et al IBIS2012) Particle filter + kernel nonparametrics (Kanagawa et al IBIS 2012) Theoretical analysis in high-dimensional situation. Relation to other recent nonparametric approaches? Gaussian process Bayesian nonparametrics 25

Collaborators Bernhard Schölkopf (MPI) Arthur Gretton (UCL/MPI) Bharath Sriperumbudur (Cambridge) Le Song (Georgia Tech) Shigeki Nakagome (ISM) Shuhei Mano (ISM) Yu Nishiyama (ISM) Motonobu Kanagawa (NAIST) 26