CS8803: Statistical Techniques in Robotics Byron Boots. Hilbert Space Embeddings

Similar documents
22 : Hilbert Space Embeddings of Distributions

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Kernel methods for comparing distributions, measuring dependence

Review of probability

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

1 Bayesian Linear Regression (BLR)

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Gaussian Process Regression

Review of Probability Mark Craven and David Page Computer Sciences 760.

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 35: December The fundamental statistical distances

Kernel Methods. Barnabás Póczos

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Hilbert Space Embeddings of Hidden Markov Models

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels

Lecture 2: From Linear Regression to Kalman Filter and Beyond

CPSC 540: Machine Learning

Gaussian Processes (10/16/13)

The Multivariate Gaussian Distribution [DRAFT]

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Nonparametric Bayesian Methods

Probabilistic Graphical Models

Copula Regression RAHUL A. PARSA DRAKE UNIVERSITY & STUART A. KLUGMAN SOCIETY OF ACTUARIES CASUALTY ACTUARIAL SOCIETY MAY 18,2011

Nonparameteric Regression:

Introduction to Bayesian Statistics

Dynamic models 1 Kalman filters, linearization,

Bayesian Support Vector Machines for Feature Ranking and Selection

Gaussian Processes for Machine Learning

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Probabilistic Graphical Models

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Gaussian Mixture Models, Expectation Maximization

Stat 5101 Lecture Notes

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Chapter 5 Joint Probability Distributions

Kernel Embeddings of Conditional Distributions

Gaussian Models (9/9/13)

GAUSSIAN PROCESS REGRESSION

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

CS 630 Basic Probability and Information Theory. Tim Campbell

CS145: Probability & Computing

Linear Regression and Discrimination

Math 456: Mathematical Modeling. Tuesday, March 6th, 2018

Statistical Learning Reading Assignments

Machine Learning for Signal Processing Bayes Classification and Regression

4 Sums of Independent Random Variables

Linear Dynamical Systems

Density Estimation. Seungjin Choi

Machine Learning Srihari. Gaussian Processes. Sargur Srihari

Advances in kernel exponential families

Multivariate probability distributions and linear regression

Curve Fitting Re-visited, Bishop1.2.5

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Introduction to Machine Learning

Expectation Propagation Algorithm

20: Gaussian Processes

CS 7140: Advanced Machine Learning

An Introduction to Machine Learning

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Statistics for scientists and engineers

(y 1, y 2 ) = 12 y3 1e y 1 y 2 /2, y 1 > 0, y 2 > 0 0, otherwise.

Gaussian with mean ( µ ) and standard deviation ( σ)

Introduction to Machine Learning

Fast Direct Methods for Gaussian Processes

Linear discriminant functions

Approximation Theoretical Questions for SVMs

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 26. Estimation: Regression and Least Squares

Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space

Machine Learning for Data Science (CS4786) Lecture 12

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

K-Means and Gaussian Mixture Models

Probability Review. September 25, 2015

DEEP LEARNING CHAPTER 3 PROBABILITY & INFORMATION THEORY

Introduction Dual Representations Kernel Design RBF Linear Reg. GP Regression GP Classification Summary. Kernel Methods. Henrik I Christensen

01 Probability Theory and Statistics Review

Kernel methods for Bayesian inference

An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systems

Joint distribution optimal transportation for domain adaptation

DD Advanced Machine Learning

Naïve Bayes classification

Nonparametric Regression With Gaussian Processes

Probability and Information Theory. Sargur N. Srihari

Gaussian Mixture Models

Lecture Note 1: Probability Theory and Statistics

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Lecture 4 February 2

Machine Learning for Signal Processing Bayes Classification

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

An Adaptive Test of Independence with Analytic Kernel Embeddings

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Statistics: Learning models from data

Transcription:

CS8803: Statistical Techniques in Robotics Byron Boots Hilbert Space Embeddings 1

Motivation CS8803: STR Hilbert Space Embeddings 2

Overview Multinomial Distributions Marginal, Joint, Conditional Sum, Product, Bayes rules Hilbert Space Embeddings Marginal, Joint, Conditional Sum, Product, Bayes rules Gram/Kernel Matrices CS8803: STR Hilbert Space Embeddings 3

Multinomial Distributions Marginal Probabilities: P[Y ] µ Y = Y Joint Probabilities: P[Y,X] YX = X Y Conditional Probabilities: P[Y X] Y X = X CS8803: STR Hilbert Space Embeddings 4

Sum Rule P[Y ]= X X P[YX] µ Y = YX 1 = µ Y YX 1 CS8803: STR Hilbert Space Embeddings 5

Product Rule P[Y,X]=P[Y X]P[X] Y X = YX 1 XX YX = Y X XX = Y X YX 1 XX CS8803: STR Hilbert Space Embeddings 6

Sum Rule (Revisited) P[Y ]= X X P[Y,X] = X X P[Y X]P[X] µ Y = YX 1 = YX 1 XX µ X = µ Y YX 1 XX µ X CS8803: STR Hilbert Space Embeddings 7

Conditioning P[Y X = x] =P[Y X] (X = x) µ Y x = Y X µ x = µ Y x YX 1 XX µ x CS8803: STR Hilbert Space Embeddings 8

Bayes Rule etc. P[X Y ]= P[Y X]P[X] P[Y ] X Y =( Y X XX ) > 1 YY = XY 1 YY Y X Y Y X = ( X X )Y X Y ( Y X XX ) > 1 YY CS8803: STR Hilbert Space Embeddings 9

Bayes Rule etc. P[X Y ]= P[Y X]P[X] P[Y ] X Y =( Y X XX ) > 1 YY = XY 1 YY P[X Y = y] = P[Y = y X]P[X] P[Y = y] µ X y =( Y X XX ) > 1 YY µ y = XY 1 YY µ y CS8803: STR Hilbert Space Embeddings 10

Bayes Rule etc. P[X Y = y, Z = z] = P[X, Y = y Z] (Z = z) P[Y = y Z] (Z = z) XY z = (XY )Z 1 ZZµ YY z = (YY)Z 1 z ZZµ z µ X y,z = XY z 1 YY z µ y CS8803: STR Hilbert Space Embeddings 11

Learning XY Z b XY Z = 1 N NX i=1 y i x i z i YX b YX = 1 N NX i=1 y i x > i µ X ˆµ X = 1 N NX i=1 x i where x i y i z i are indicator vectors CS8803: STR Hilbert Space Embeddings 12

Generalization how do we make a conditional probability table out of this? how do we learn parameters? (what are the parameters??) how do we perform inference? CS8803: STR Hilbert Space Embeddings 13

Could Discretize the Distribution 0 1 2 3 loses information, hard to learn for high cardinality CS8803: STR Hilbert Space Embeddings 14

Key Idea: Sufficient Statistics P[Y ] µ Y = E[Y ] Problem: lots of distributions have the same mean P[Y ] µ Y = E[Y ] E[Y 2 ] Better, but lots of distributions still have the same mean and variance!! P[Y ] µ Y = 0 @ E[Y ] E[Y 2 ] E[Y 3 ] 1 A Even better, but lots of distributions still have first 3 moments! CS8803: STR Hilbert Space Embeddings 15

Key Idea: Sufficient Statistics P[Y ] µ Y = 0 B @ E[Y ] E[Y 2 ] E[Y 3 ]. 1 C A CS8803: STR Hilbert Space Embeddings 16

Overview Multinomial Distributions Marginal, Joint, Conditional Sum, Product, Bayes rules Hilbert Space Embeddings Marginal, Joint, Conditional Sum, Product, Bayes rules Gram/Kernel Matrices CS8803: STR Hilbert Space Embeddings 17

David Hilbert CS8803: STR Hilbert Space Embeddings 18

Representation Marginal Distributions: P[Y ] Joint Distributions: P[Y,X] Conditional Distributions: P[Y X] Use kernel representations for distributions CS8803: STR Hilbert Space Embeddings 19

Embedding Distributions Summary statistics for distributions P[Y ] E[Y ] Mean E YY > Covariance E[ y0 (Y )] Probability P[y 0 ] E[ (Y )] Expected Features Pick a kernel k(y, y 0 )=h (y), (y 0 )i, and generate a different statistic CS8803: STR Hilbert Space Embeddings 20

Embedding Marginal Distributions P[Y ] (Y )=k(y, ) F (RKHS) µ Y = E[ (Y )] ˆµ Y = 1 TX (y i ) T i=1 CS8803: STR Hilbert Space Embeddings 21

Embedding Marginal Distributions P[Y ] (Y )=k(y, ) F (RKHS) One-to-one mapping µ Y = from E[ (Y P[Y )] ] to µ Y for certain kernels (e.g. Gaussian, Laplacian ˆµ Y = 1 TX RBF kernels ) (y i ) T Recover discrete probability i=1with delta kernel Sample average converges to true mean at O p m 1 2 CS8803: STR Hilbert Space Embeddings 21

Embedding Joint Distributions using outer- Embedding joint distributions P[Y,X] product feature map (Y )'(X) > µ YX = E (Y )'(X) > ˆµ YX = 1 m mx (y i )'(x i ) > i=1 µ YX is also the covariance operator C YX Recover discrete probabilities with delta kernels Empirical estimate converges at O p (m 1 2 ) CS8803: STR Hilbert Space Embeddings 22

Y Embedding Conditional Distributions P[Y x 1 ] P[Y x 2 ] E[ (Y ) x] µ Y x1 µ Y x2 (Y )=l(y, ) G (RKHS) x 1 x 2 X For each value X = x, return the summary statistic for P[Y X = x] Some X = x are never observed CS8803: STR Hilbert Space Embeddings 23

Embedding Conditional Distributions E[ (Y ) x] Y P[Y x 1 ] P[Y x 2 ] (Y )=l(y, ) G (RKHS) µ Y x1 µ Y x2 x 1 x 2 X avoid data partitioning '(x 1 ) µ Y x = U Y X '(x) '(x 2 ) '(X) =k(x, ) F (RKHS) conditional embedding operator CS8803: STR Hilbert Space Embeddings 24

Embedding Conditional Distributions Estimation via covariance operators U Y X := C YX C 1 XX bu Y X = (K + I) 1 > := ( (y 1 ),..., (y m )), L = > := ('(x 1 ),...,'(x m )), K = > Gaussian: covariance matrices Discrete: joint probability matrix divided by marginal Empirical estimate converges at O p ( m 1 2 + 1 2 ) CS8803: STR Hilbert Space Embeddings 25

Direct Correspondence NX YX b YX = 1 N C YX b C YX = 1 N i=1 NX i=1 y i x > i (y i )'(x i ) > NX µ X ˆµ X = 1 N µ X ˆµ X = 1 N i=1 NX i=1 x i (x i ) CS8803: STR Hilbert Space Embeddings 26

Key Rules for Inference Sum Rule: P[Y ]= Z X P[Y X]P[X] Product Rule: P[Y,X]=P[Y X]P[X] Bayes Rule: P[X Y ]= R P[Y X]P[X] P[Y X]P[X] X Do probabilistic inference in feature space CS8803: STR Hilbert Space Embeddings 27

Product Rule P[Y,X]=P[Y X]P[X] Discrete Y X = YX 1 XX YX = Y X XX HSE C Y X = C YX C 1 XX C YX = C Y X C XX CS8803: STR Hilbert Space Embeddings 28

Sum Rule P[Y ]= X X P[Y,X] = X X P[Y X]P[X] Discrete µ Y = YX 1 = YX 1 XX µ X HSE µ Y = C YX C 1 XX µ X CS8803: STR Hilbert Space Embeddings 29

Bayes Rule P[X Y ]= P[Y X]P[X] P[Y ] Discrete X Y =( Y X XX ) > 1 YY = XY 1 YY HSE C X Y =(C Y X ) > C 1 YY = C XY C 1 YY CS8803: STR Hilbert Space Embeddings 30

Overview Multinomial Distributions Marginal, Joint, Conditional Sum, Product, Bayes rules Hilbert Space Embeddings Marginal, Joint, Conditional Sum, Product, Bayes rules Gram/Kernel Matrices CS8803: STR Hilbert Space Embeddings 31

Jørgen Gram CS8803: STR Hilbert Space Embeddings 32

Gram/Kernel Matrices bc YX = 1 N bc XX = 1 N NX (y i )'(x i ) > = 1 N Y > X 2 R 1 1 i=1 NX i=1 '(x i )'(x i ) > = 1 N X > X 2 R 1 1 µ x = '(x) 2 R 1 1 Would like to calculate: µ Y x = b C YX b C 1 XX µ x CS8803: STR Hilbert Space Embeddings 33

Gram/Kernel Matrices µ Y x = b C YX b C 1 XX µ x ˆµ Y x = Y > X X > X + I 1 '(x) (Woodbury) Matrix Inversion Lemma = Y ( > X X + NI) 1 > X'(x) = Y (G XX + NI) 1 G XX (:,i) where G XX = 1 N > X X 2 R N N G XX (:,i)= > X'(x i ) 2 R N 1 CS8803: STR Hilbert Space Embeddings 34

Hilbert Space Embeddings of Distributions An alternative to (for example) exponential families and Parzan windows (KDE) Represent arbitrary distributions in feature spaces, reason using Hilbert space sum, product, and Bayes rules Linear algebra for learning and inference Can extend state space models non-parametrically to domains defined by kernels CS8803: STR Hilbert Space Embeddings 35