CS8803: Statistical Techniques in Robotics Byron Boots. Hilbert Space Embeddings

CS8803: Statistical Techniques in Robotics Byron Boots Hilbert Space Embeddings 1

Motivation CS8803: STR Hilbert Space Embeddings 2

Overview Multinomial Distributions Marginal, Joint, Conditional Sum, Product, Bayes rules Hilbert Space Embeddings Marginal, Joint, Conditional Sum, Product, Bayes rules Gram/Kernel Matrices CS8803: STR Hilbert Space Embeddings 3

Multinomial Distributions Marginal Probabilities: P[Y ] µ Y = Y Joint Probabilities: P[Y,X] YX = X Y Conditional Probabilities: P[Y X] Y X = X CS8803: STR Hilbert Space Embeddings 4

Sum Rule P[Y ]= X X P[YX] µ Y = YX 1 = µ Y YX 1 CS8803: STR Hilbert Space Embeddings 5

Product Rule P[Y,X]=P[Y X]P[X] Y X = YX 1 XX YX = Y X XX = Y X YX 1 XX CS8803: STR Hilbert Space Embeddings 6

Sum Rule (Revisited) P[Y ]= X X P[Y,X] = X X P[Y X]P[X] µ Y = YX 1 = YX 1 XX µ X = µ Y YX 1 XX µ X CS8803: STR Hilbert Space Embeddings 7

Conditioning P[Y X = x] =P[Y X] (X = x) µ Y x = Y X µ x = µ Y x YX 1 XX µ x CS8803: STR Hilbert Space Embeddings 8

Bayes Rule etc. P[X Y ]= P[Y X]P[X] P[Y ] X Y =( Y X XX ) > 1 YY = XY 1 YY Y X Y Y X = ( X X )Y X Y ( Y X XX ) > 1 YY CS8803: STR Hilbert Space Embeddings 9

Bayes Rule etc. P[X Y ]= P[Y X]P[X] P[Y ] X Y =( Y X XX ) > 1 YY = XY 1 YY P[X Y = y] = P[Y = y X]P[X] P[Y = y] µ X y =( Y X XX ) > 1 YY µ y = XY 1 YY µ y CS8803: STR Hilbert Space Embeddings 10

Bayes Rule etc. P[X Y = y, Z = z] = P[X, Y = y Z] (Z = z) P[Y = y Z] (Z = z) XY z = (XY )Z 1 ZZµ YY z = (YY)Z 1 z ZZµ z µ X y,z = XY z 1 YY z µ y CS8803: STR Hilbert Space Embeddings 11

Learning XY Z b XY Z = 1 N NX i=1 y i x i z i YX b YX = 1 N NX i=1 y i x > i µ X ˆµ X = 1 N NX i=1 x i where x i y i z i are indicator vectors CS8803: STR Hilbert Space Embeddings 12

Generalization how do we make a conditional probability table out of this? how do we learn parameters? (what are the parameters??) how do we perform inference? CS8803: STR Hilbert Space Embeddings 13

Could Discretize the Distribution 0 1 2 3 loses information, hard to learn for high cardinality CS8803: STR Hilbert Space Embeddings 14

Key Idea: Sufficient Statistics P[Y ] µ Y = E[Y ] Problem: lots of distributions have the same mean P[Y ] µ Y = E[Y ] E[Y 2 ] Better, but lots of distributions still have the same mean and variance!! P[Y ] µ Y = 0 @ E[Y ] E[Y 2 ] E[Y 3 ] 1 A Even better, but lots of distributions still have first 3 moments! CS8803: STR Hilbert Space Embeddings 15

Key Idea: Sufficient Statistics P[Y ] µ Y = 0 B @ E[Y ] E[Y 2 ] E[Y 3 ]. 1 C A CS8803: STR Hilbert Space Embeddings 16

David Hilbert CS8803: STR Hilbert Space Embeddings 18

Representation Marginal Distributions: P[Y ] Joint Distributions: P[Y,X] Conditional Distributions: P[Y X] Use kernel representations for distributions CS8803: STR Hilbert Space Embeddings 19

Embedding Distributions Summary statistics for distributions P[Y ] E[Y ] Mean E YY > Covariance E[ y0 (Y )] Probability P[y 0 ] E[ (Y )] Expected Features Pick a kernel k(y, y 0 )=h (y), (y 0 )i, and generate a different statistic CS8803: STR Hilbert Space Embeddings 20

Embedding Marginal Distributions P[Y ] (Y )=k(y, ) F (RKHS) µ Y = E[ (Y )] ˆµ Y = 1 TX (y i ) T i=1 CS8803: STR Hilbert Space Embeddings 21

Embedding Marginal Distributions P[Y ] (Y )=k(y, ) F (RKHS) One-to-one mapping µ Y = from E[ (Y P[Y )] ] to µ Y for certain kernels (e.g. Gaussian, Laplacian ˆµ Y = 1 TX RBF kernels ) (y i ) T Recover discrete probability i=1with delta kernel Sample average converges to true mean at O p m 1 2 CS8803: STR Hilbert Space Embeddings 21

Embedding Joint Distributions using outer- Embedding joint distributions P[Y,X] product feature map (Y )'(X) > µ YX = E (Y )'(X) > ˆµ YX = 1 m mx (y i )'(x i ) > i=1 µ YX is also the covariance operator C YX Recover discrete probabilities with delta kernels Empirical estimate converges at O p (m 1 2 ) CS8803: STR Hilbert Space Embeddings 22

Y Embedding Conditional Distributions P[Y x 1 ] P[Y x 2 ] E[ (Y ) x] µ Y x1 µ Y x2 (Y )=l(y, ) G (RKHS) x 1 x 2 X For each value X = x, return the summary statistic for P[Y X = x] Some X = x are never observed CS8803: STR Hilbert Space Embeddings 23

Embedding Conditional Distributions E[ (Y ) x] Y P[Y x 1 ] P[Y x 2 ] (Y )=l(y, ) G (RKHS) µ Y x1 µ Y x2 x 1 x 2 X avoid data partitioning '(x 1 ) µ Y x = U Y X '(x) '(x 2 ) '(X) =k(x, ) F (RKHS) conditional embedding operator CS8803: STR Hilbert Space Embeddings 24

Embedding Conditional Distributions Estimation via covariance operators U Y X := C YX C 1 XX bu Y X = (K + I) 1 > := ( (y 1 ),..., (y m )), L = > := ('(x 1 ),...,'(x m )), K = > Gaussian: covariance matrices Discrete: joint probability matrix divided by marginal Empirical estimate converges at O p ( m 1 2 + 1 2 ) CS8803: STR Hilbert Space Embeddings 25

Direct Correspondence NX YX b YX = 1 N C YX b C YX = 1 N i=1 NX i=1 y i x > i (y i )'(x i ) > NX µ X ˆµ X = 1 N µ X ˆµ X = 1 N i=1 NX i=1 x i (x i ) CS8803: STR Hilbert Space Embeddings 26

Key Rules for Inference Sum Rule: P[Y ]= Z X P[Y X]P[X] Product Rule: P[Y,X]=P[Y X]P[X] Bayes Rule: P[X Y ]= R P[Y X]P[X] P[Y X]P[X] X Do probabilistic inference in feature space CS8803: STR Hilbert Space Embeddings 27

Product Rule P[Y,X]=P[Y X]P[X] Discrete Y X = YX 1 XX YX = Y X XX HSE C Y X = C YX C 1 XX C YX = C Y X C XX CS8803: STR Hilbert Space Embeddings 28

Sum Rule P[Y ]= X X P[Y,X] = X X P[Y X]P[X] Discrete µ Y = YX 1 = YX 1 XX µ X HSE µ Y = C YX C 1 XX µ X CS8803: STR Hilbert Space Embeddings 29

Bayes Rule P[X Y ]= P[Y X]P[X] P[Y ] Discrete X Y =( Y X XX ) > 1 YY = XY 1 YY HSE C X Y =(C Y X ) > C 1 YY = C XY C 1 YY CS8803: STR Hilbert Space Embeddings 30

Jørgen Gram CS8803: STR Hilbert Space Embeddings 32

Gram/Kernel Matrices bc YX = 1 N bc XX = 1 N NX (y i )'(x i ) > = 1 N Y > X 2 R 1 1 i=1 NX i=1 '(x i )'(x i ) > = 1 N X > X 2 R 1 1 µ x = '(x) 2 R 1 1 Would like to calculate: µ Y x = b C YX b C 1 XX µ x CS8803: STR Hilbert Space Embeddings 33

Gram/Kernel Matrices µ Y x = b C YX b C 1 XX µ x ˆµ Y x = Y > X X > X + I 1 '(x) (Woodbury) Matrix Inversion Lemma = Y ( > X X + NI) 1 > X'(x) = Y (G XX + NI) 1 G XX (:,i) where G XX = 1 N > X X 2 R N N G XX (:,i)= > X'(x i ) 2 R N 1 CS8803: STR Hilbert Space Embeddings 34

Hilbert Space Embeddings of Distributions An alternative to (for example) exponential families and Parzan windows (KDE) Represent arbitrary distributions in feature spaces, reason using Hilbert space sum, product, and Bayes rules Linear algebra for learning and inference Can extend state space models non-parametrically to domains defined by kernels CS8803: STR Hilbert Space Embeddings 35