CMPE 58K Bayesian Statistics and Machine Learning Lecture 5

CMPE 58K Bayesian Statistics and Machine Learning Lecture 5 Multivariate distributions: Gaussian, Bernoulli, Probability tables Department of Computer Engineering, Boğaziçi University, Istanbul, Turkey Instructor: A. Taylan Cemgil Fall 2009 Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul

Multivariate Bernoulli A probability table with binary states px 1, x 2 x 2 0 x 2 1 x 1 0 p 00 p 01 x 1 1 p 10 p 11 px 1, x 2 p 1 x 11 x 2 00 p x 11 x 2 10 p 1 x 1x 2 01 p x 1x 2 11 exp 1 x 1 x 2 + x 1 x 2 log p 00 + x 1 x 1 x 2 log p 10 +x 2 x 1 x 2 log p 01 + x 1 x 2 log p 11 exp log p 00 + x 1 log p 10 + x 2 log p 01 + x 1 x 2 log p 00p 11 p 00 p 00 p 10 p 01 Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 1

Multivariate Bernoulli Independent case px 1, x 2 px 1 px 2 px 1, x 2 x 2 0 x 2 1 x 1 0 p 0 q 0 p 0 q 1 x 1 1 p 1 q 0 p 1 q 1 px 1, x 2 exp exp log p 0 q 0 + x 1 log p 1q 0 + x 2 log p 0q 1 + x 1 x 2 log p 0q 0 p 1 q 1 p 0 q 0 p 0 q 0 p 1 q 0 p 0 q 1 log p 0 q 0 + x 1 log p 1 p 0 + x 2 log q 1 q 0 + x 1 x 2 log 1 The coupling parameter is zero Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 2

Multivariate Bernoulli Canonical parameters px 1, x 2 exp g + h 1 x 1 + h 2 x 2 + θ 1,2 x 1 x 2 h 1 log p 10 p 00 h 2 log p 01 p 00 θ 1,2 log p 00p 11 p 10 p 01 h i : threshold θ i,j : pairwise coupling p are the moment parameters Coupling parameters are harder to interpret but provide direct information about the conditional independence structure Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 3

Example x i {0, 1} px 1, x 2 exp 2x 1 0.5x 2 3x 1 x 2 qx 1, x 2 exp 2x 1 0.5x 2 + 0.01x 1 x 2 Are x 1 and x 2 independent, given p or given q? What is px 1 qx 1? Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 4

For two events A and B Odds and the Cross Product Ratio cpt oddsa pa pa c oddsa B pa B pa c B cpra, B pa BpAc B c pa c BpA B c oddsa B/oddsA B c When the events A and B are independent, we have pa B pa cpra, B oddsa/oddsa 1 Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 5

Odds and the Cross Product Ratio cpt Univariate Bernoulli distribution is parametrised by the event A : x 1 1 px 1 exp log p 0 + x 1 log p 1 p 0 exp x 1 log oddsx 1 1 Bivariate Bernoulli distribution px 1, x 2 exp log p 00 + x 1 log p 10 + x 2 log p 01 + x 1 x 2 log p 00p 11 p 00 p 00 p 10 p 01 exp x 1 log oddsx 1 1 x 2 0 + x 2 log oddsx 2 1 x 1 0 +x 1 x 2 log cprx 1 1, x 2 1 Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 6

Multivariate Bernoulli For trivariate case, we have pairwise and triple interactions px 1, x 2, x 3 p 1 x 11 x 2 1 x 3 000 p x 11 x 2 1 x 3 100 p 1 x 1x 2 1 x 3 010 p x 1x 2 1 x 3 110 p 1 x 11 x 2 x 3 001 p x 11 x 2 x 3 101 p 1 x 1x 2 x 3 011 p x 1x 2 x 3 111 log px log p 000 + log p 100 p 000 x 1 + log p 010 p 000 x 2 + log p 001 p 000 x 3 + log p 000p 110 p 010 p 100 x 1 x 2 + log p 000p 101 p 001 p 100 x 1 x 3 + log p 000p 011 p 001 p 010 x 2 x 3 log p 001p 010 p 100 p 111 p 000 p 011 p 101 p 110 x 1 x 2 x 3 Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 7

log px log p 000 + log oddsx 1 1 x 2 0, x 3 0x 1 + log oddsx 2 1 x 1 0, x 3 0x 2 + log oddsx 3 1 x 1 0, x 2 0x 3 + log cprx 1 1, x 2 1 x 3 0x 1 x 2 + log cprx 1 1, x 3 1 x 2 0x 1 x 3 + log cprx 2 1, x 3 1 x 1 0x 2 x 3 + log cprx 1 1, x 2 1 x 3 0 cprx 1 1, x 2 1 x 3 1 x 1x 2 x 3 Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 8

Canonical Parameters versus Moment Parameters Canonical parameters are harder to interpret but give information about the independence structure Inference in exponential family models turns out to be just conversion from a canonical representation to a moment representation. Sounds simple but can be easily intractable Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 9

The Multivariate Gaussian Distribution. N s; µ, P µ is the mean and P is the covariance: N s; µ, P 2πP 1/2 exp 1 2 s µ P 1 s µ exp 12 s P 1 s + µ P 1 s 12 µ P 1 µ 12 2πP log N s; µ, P 1 2 s P 1 s + µ P 1 s + const 1 2 Tr P 1 ss + µ P 1 s + const + 1 2 Tr P 1 ss + µ P 1 s Notation: log fx + gx fx expgx c R : fx c expgx Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 10

Gaussian potentials Consider a Gaussian potential with mean µ and covariance Σ on x. φx αn µ, Σ 1 α 2πΣ 1 2 exp 1 2 x µt Σ 1 x µ 2 where dxφx α and 2πΣ is a short notation for 2π d det Σ, where Σ is d d.. If α 1 the potential is normalized. A general Gaussian potential φ need not to be normalized so α is in fact an arbitrary positive constant. The exponent is just a quadratic form. Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 11

Canonical Form φx exp{log α 1 2 log 2πΣ 1 2 µt Σ 1 µ} + µ T Σ 1 x 1 2 xt Σ 1 x expg + h T x 1 2 xt Kx Alternative to the conventional and intuitive moment form. Here we represent the potential by the polynomial coefficients h and K. Coefficients h and K as natural parameters. Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 12

Canonical and Moment parametrisations The moment parameters and canonical parameters are related by K Σ 1 h Σ 1 µ g log α 1 2 log 2πΣ 1 2 µt Σ 1 ΣΣ 1 µ log α + 1 2 log K 2π 1 2 ht K 1 h Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 13

Jointly Gaussian Vectors Moment form µ1 Σ11 Σ φx 1, x 2 αn, 12 µ 2 Σ 21 Σ 22 φ α 2πΣ 1 2 exp 1 2 1 Σ x1 µ 1 x 2 µ 11 Σ 12 x1 µ 1 2 Σ 21 Σ 22 x 2 µ 2 Canonical form φx 1, x 2 expg + h 1 h 2 x 1 x 2 1 2 K x1 x 11 K 12 x1 2 K 21 K 22 x 2 need to find a parametric representation of K Σ 1 in terms of the partitions Σ 11, Σ 12, Σ 21, Σ 22. Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 14

Partitioned Matrix Inverse Strategy: We will find two matrices X and Z such that W becomes block diagonal. LΣR W Σ L 1 W R 1 Σ 1 RW 1 L K Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 15

Add a multiple of row s to row t Gauss Transformations Premultiply Σ with Ls, t where L i,j s, t 1, i j γ, i s and j t 0, o/w Example: s 2, t 1 1 γ 0 1 a b c d a + γc b + γd c d The inverse just subtracts what is added 1 γ 0 1 1 1 γ 0 1 Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 16

Gauss Transformations Given Σ, add a multiple of column s to column t Postmultiply Σ with Rs, t where R i,j s, t 1, i j γ, j s and i t 0, o/w Example: s 2, t 1 a b c d 1 0 γ 1 a + γb b c + γd d Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 17

Scalar example Σ LΣ LΣR a b c d 1 bd 1 a b a bd 1 c b bd 1 d 0 1 c d c d 1 bd 1 a b 1 0 0 1 c d d 1 c 1 a bd 1 c 0 1 0 c d d 1 c 1 a bd 1 c 0 a bd c dd 1 1 c 0 W c d 0 d Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 18

Scalar example cont Σ L 1 W R 1 Σ 1 RW 1 L 1 0 d 1 c 1 a bd 1 c 1 0 1 bd 1 0 d 1 0 1 a bd 1 c 1 0 1 bd 1 d 1 ca bd 1 c 1 d 1 0 1 a bd 1 c 1 a bd 1 c 1 bd 1 d 1 ca bd 1 c 1 d 1 + d 1 ca bd 1 c 1 bd 1 Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 19

Scalar example We could also use Σ LΣ LΣR RW 1 L a b c d 1 0 a b ca 1 1 c d 1 0 a b 1 a 1 b ca 1 1 c d 0 1 a 0 0 d ca 1 W b a 1 + a 1 b d ca 1 b 1 ca 1 a 1 b d ca 1 b 1 d ca 1 b 1 ca 1 d ca 1 b 1 Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 20

Partitioned Matrix Inverse In matrix case, this leads to following dual factorizations of Σ as Σ I Σ12 Σ 1 22 0 I I 0 Σ 21 Σ 1 11 I Σ11 Σ 12 Σ 22 1 Σ 21 0 I 0 0 Σ 22 Σ 1 22 Σ 21 I Σ11 0 I Σ 1 0 Σ 22 Σ 21 Σ 11 1 11 Σ 12 Σ 12 0 I Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 21

We will introduce the notation The Schur Complement Σ/Σ 22 Σ 11 Σ 12 Σ 22 1 Σ 21 Σ/Σ 11 Σ 22 Σ 21 Σ 11 1 Σ 12 Determinant Σ Σ/Σ 11 Σ 11 Σ/Σ 22 Σ 22 Σ 1 I 0 Σ 1 22 Σ 21 I I Σ 1 11 Σ 12 0 I Σ/Σ22 1 0 0 Σ 1 22 Σ 1 11 0 0 Σ/Σ 11 1 I Σ12 Σ 1 22 0 I I 0 Σ 21 Σ 1 11 I Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 22

Partitioned Matrix Inverse K11 K 12 K 21 K 22 1 Σ11 Σ 12 Σ 21 Σ 22 Σ/Σ 22 1 Σ/Σ 22 1 Σ 12 Σ 1 22 Σ 1 22 Σ 21Σ/Σ 22 1 Σ 1 22 + Σ 1 22 Σ 21Σ/Σ 22 1 Σ 12 Σ 1 22 Σ 1 11 + Σ 1 11 Σ 12Σ/Σ 11 1 Σ 21 Σ 1 11 Σ 1 11 Σ 12Σ/Σ 11 1 Σ/Σ 11 1 Σ 21 Σ 1 11 Σ/Σ 11 1 Quite complicated looking formulas, but straightforward to implement Caution: Σ 1 11 K 11 in general! Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 23

Matrix Inversion Lemma Read the diagonal entries Σ 11 Σ 12 Σ 22 1 Σ 21 1 Σ 1 11 + Σ 1 11 Σ 12Σ/Σ 11 1 Σ 21 Σ 1 11 A BC 1 D 1 A 1 + A 1 B C DA 1 B 1 DA 1 Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 24

Factorisation of Multivariate Gaussians Consider the joint distribution over the variable x x1 x 2 where the joint distribution is Gaussian px N x; µ, Σ with µ Σ µ1 µ 2 Σ1 Σ 12 Σ 12 Σ 2 Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 25

Find the following Factorisation of Multivariate Gaussians 1. Conditionals a px 1 x 2 b px 2 x 1 2. Marginals a px 1 b px 2 Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 26

Factorisation of Multivariate Gaussians Using the partitioned inverse equations, we rearrange px 1, x 2 exp 1 2 x1 µ 1 x 2 µ 2 Σ1 Σ 12 Σ 12 Σ 2 1 x1 µ 1 x 2 µ 2 bring the expression in form of px 1 px 2 x 1 or px 2 px 1 x 2 where the marginal and conditional can be easily identified. See also Bishop, section 2.3. Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 27

Factorisation of Multivariate Gaussians We have the two decompositions Σ1 Σ 12 Σ 12 Σ 2 1 Σ1 Σ 12 Σ 1 Σ 1 2 Σ 12 Σ 1 1 + Σ 1 1 Σ 12 2 Σ 12 Σ1 Σ 12 Σ 1 2 Σ 12 Σ2 Σ 12Σ 1 Σ 2 Σ 12Σ 1 1 Σ 12 1 Σ 1 Σ 12 Σ 1 1 Σ 1 2 + Σ 1 2 Σ 12 1 Σ 12 Σ 1 1 Σ 12 1 Σ 12 Σ 1 1 1 Σ 1 1 2 Σ 12 Σ1 Σ 12 Σ 1 2 Σ 1 Σ 12 Σ2 Σ2 Σ 12Σ We let s i x i µ i and use the first decomposition. ps 1, s 2 exp 1 2 s1 s 2 Σ1 Σ 12 Σ 12 Σ 2 1 s1 s 2 Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 28

exp 1 2 exp 1 2 s 1 s 2 Σ 1 2 Σ 12 s1 1 2 s 2 Σ 1 2 Σ 12 1 2 s 2 Σ 1 2 s 2 s 2 Σ1 Σ 12 Σ 1 Σ 1 2 Σ 12 1 Σ 1 Σ 12 Σ 1 2 12 Σ s1 1 Σ 1 Σ 12 Σ 1 2 12 Σ s1 Σ 1 Σ 12 Σ 1 2 Σ 12 2 Σ 12 Σ1 Σ 12 Σ 1 2 Σ 12 1 Σ12 Σ 1 2 s 2 N s 1 ; Σ 12 Σ 1 2 s 2, Σ 1 Σ 12 Σ 1 2 Σ 12 N s 2; 0, Σ 2 1 Σ 1 Σ 12 Σ 1 1 Σ 1 2 + Σ 1 2 Σ 12 N x 1 ; µ 1 + Σ 12 Σ 1 2 x 2 µ 2, Σ 1 Σ 12 Σ 1 2 Σ 12 N x 2; µ 2, Σ 2 1 2 12 Σ Σ12 Σ 1 Σ1 Σ 12 Σ 1 2 Σ 12 This leads to a factorisation of form px 2 px 1 x 2. The second decomposition will lead to the other factorisation px 1 px 2 x 1. Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 29

Keywords Summary Multivariate Bernoulli Multivariate Gaussian Partitioned Inverse Equations, Matrix Inversion Lemma Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 30

Cemgil CMPE 58K Bayesian Statistics and Machine Learning. Lecture 5. Fall 2009, Boğaziçi University, Istanbul 31