Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Course 495: Advanced Statistical Machine Learning/Pattern Recognition Goal (Lecture): To present Probabilistic Principal Component Analysis (PPCA) using both Maximum Likelihood (ML) and Expectation Maximization (EM). Goal (Tutorials): To provide the students the necessary mathematical tools for deeply understanding PPCA. 1

Materials Pattern Recognition & Machine Learning by C. Bishop Chapter 12 PPCA: Tipping, Michael E., and Christopher M. Bishop. "Probabilistic principal component analysis." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61.3 (1999): 611-622 PPCA: Tipping, Michael E., and Christopher M. Bishop. "Mixtures of probabilistic principal component analyzers." Neural computation 11.2 (1999): 443-482. 2

An overview PCA: Maximize the global variance LDA: Minimize the class variance while maximizing the mean variance LPP: Minimize the local variance ICA: Maximize independence by maximizing non- Gaussianity All are deterministic!! 3

Advantages of PPCA over PCA An EM algorithm for PCA that is computationally efficient in situations where only a few leading eigenvectors are required and that avoids having to evaluate the data covariance matrix as an intermediate step. A combination of a probabilistic model and EM allows us to deal with missing values in the dataset. Missing values Missing values 4

Advantages of PPCA over PCA Mixtures of a probabilistic PCA models can be formulated in a principled way and trained using the EM algorithm. W 1 W 5 W 9 5

Advantages of PPCA over PCA Probabilistic PCA forms the basis for a Bayesian treatment of PCA in which the dimensionality of the principal subspace can be found automatically from the data. Priors on W:Automatic Relevance Determination (find the relevant components) p W a = d a i 2π d 2 exp { 1 2 a iw i T w i } 6

Advantages of PPCA over PCA Probabilistic PCA can be used to model class-conditional densities and hence be applied to classification problems. p e p e < p e ( ) 7

Advantages of PPCA over PCA The probabilistic PCA model can be run generatively to provide samples from the distribution. p e 8

Probabilistic Principal Component Analysis General Concept: y 1 y 2 y 3 y N Share a common linear structure x 1 x 2 x 3 x N x N We want to find the parameters: 9

Probabilistic Principal Component Analysis Graphical Model: σ 2 y i μ x i N W p(y) x 2 p(x y) w x 2 p(x) μ y w 10 y x = x 1 x 2 x 1 x 1

ML-PPCA p(x 1,.., x N, y 1,.., y N θ) = p(x i y i, W, μ, σ) p(y i ) N p x i y i, W, μ, σ = N(x i Wy i + μ, σ 2 ) Ν p y i = N(y i 0, I) p(x 1,.., x N θ = 11 y i p(x 1,.., x N, y 1,.., y N θ)dy 1 dy N = p(x i y i, W, μ, σ) p(y i )dy 1 dy N y i N Ν

ML-PPCA N p(x 1,.., x N θ = p(x i y i, W, μ, σ) p(y i )dy 1 dy N y i N Ν = p(x i y i, W, μ, σ)p(y i )dy i y i p x i y i, W, μ, σ p y i = 1 (2π) F σ 1 (2π) d e y i 1 e σ 2 x i μ Wy Τ i x i μ Wy i F T y i 12

ML-PPCA Complete the square: 1 σ 2 x i μ Wy i Τ x i μ Wy i + σ 2 y i T yi = x i x i = x i Wy i Τ x i Wy i + σ 2 y i T yi x i = x i μ = x i T x i 2x i T Wy i + y i T W T Wy i + σ 2 y i T yi = x i T x i 2x i T Wy i + y i T (σ 2 I + W T W)y i M 13

ML-PPCA = x i T x i 2x i T Wy i + y i T My i = x i T x i 2(M 1 W T x i ) T My i + y i T My i + (M 1 W T x i ) T M M 1 W T x i (M 1 W T x i ) T MM 1 W T x i = x i T (I WM 1 W T )x i + y i M 1 W T x i T M(y i M 1 W T x i ) 14

ML-PPCA y i p(x i y i, W, μ, σ)p(y i )dy i y i e 1 T σ 2x i (I WM 1 W T )x i 1 σ 2 y i M 1 W T T x i M(yi M 1 W T x i ) dyi e 1 σ 2x it (I WM 1 W T )x i e 1 σ 2 y i M 1 W T x i T M(yi M 1 W T x i ) dyi y i = N(x i μ, σ 2 I σ 2 WM 1 W T 1 ) 1 15

ML-PPCA Let s have a look at: σ 2 I σ 2 WM 1 W T = σ 2 I σ 2 W(σ 2 I + W T W ) 1 W T Woodbury formula: (A + UCV) 1 = A 1 A 1 U C 1 + VA 1 U 1 VA 1 Can you prove that if: D = WW T + σ 2 Ι then D 1 = σ 2 I σ 2 W(σ 2 I + W T W ) 1 W T 16

ML-PPCA p x i W, μ, σ 2 = p(x i y i, W, μ, σ)p(y i )dy i y i D = WW T + σ 2 Ι = N(x i μ, D) Using also the above we can easily show that the posterior : p y i x i, W, μ, σ 2 = p(y i ) p x i W,μ,σ 2 p x i y i, W, μ, σ 2 = = N(y i M 1 W T (x i μ), σ 2 Μ 1 ) 17

ML-PPCA p(x 1,.., x N θ = N N(x i μ, D) θ=argmax θ ln p(x 1,.., x N θ =argmax θ N ln N x i μ, D = NF 2 ln 2π N 2 ln D 1 2 N x i μ T D 1 (x i μ) 18

ML-PPCA L(W, σ 2, μ) = NF 2 ln 2π N 2 ln D Ntr[D 1 S t ] N where S t = 1 N x i μ x i μ Τ dl dμ = 0 μ = 1 Ν N x i 19

ML-PPCA dl dw = N D 1 S t D 1 W D 1 W S t D 1 W = W Three different solutions: (1) W = 0 which is the minimum of the likelihood (2) D = S t WW T + σ 2 Ι = S t Assume the eigen-decomposition S t = UΛU T U square matrix of eigenvectors R is a rotation matrix R T R = I W = U(Λ σ 2 Ι) 1/2 R 20

ML-PPCA (3) D S t and W 0 d < q = rank(s t ) Assume the SVD of W = ULV T U = [u 1 u d ] Fxd matrix U T U = I V T V = VV T = I L= l 1 0 0 l d S t D 1 ULV T = ULV T Let s study D 1 U 21

ML-PPCA D 1 = (WW T + σ 2 Ι ) 1 W=ULV T = (UL 2 U T + σ 2 Ι ) 1 Assume a set of bases U F d such that U F d T U = 0 and U F d T U F d = I 22 = [U U F d ] L2 0 0 0 [U U F d] T + U U F d σ 2 Ι[U U F d ] T = [U U F d ] L2 + σ 2 Ι 0 0 σ 2 [U U F d ] T Ι = [U U F d ] (L2 + σ 2 Ι) 1 0 0 σ 2 Ι [U U F d] T 1 1

ML-PPCA D 1 U = [U U F d ] (L2 + σ 2 Ι) 1 0 0 σ 2 Ι [U U F d] T U = [U U F d ] (L2 + σ 2 Ι) 1 0 0 σ 2 Ι [I 0]T = U U F d (L 2 + σ 2 Ι) 1 0 = U(L 2 + σ 2 Ι) 1 23

ML-PPCA S t D 1 ULV T = ULV T S t U(L 2 + σ 2 Ι) 1 = U S t U = U(L 2 + σ 2 Ι) 1 It means that S t u i = l i 2 + σ 2 u i S t = UΛU T u i are eigenvectors of S t and λ ι = l i 2 + σ 2 l ι = λ ι σ 2 Unfortunately V cannot be determined hence there is a rotation ambiguity. 24

ML-PPCA Hence the optimum is given by (keeping d eigenvectors) W d = U d (Λ d σ 2 Ι)V T Having computed W we need to compute the optimum σ 2 L(W, σ 2, μ) = NF 2 ln 2π N 2 ln D Ntr[D 1 S t ] = NF 2 ln 2π N 2 ln WWT + σ 2 I Ntr[(WW T + σ 2 I ) 1 S t ] 25

W d W d T + σ 2 I = U d U F d Λ d σ 2 I 0 0 0 U du F d T 26 ML-PPCA Hence U d U F d + U d U F d σ 2 0 0 σ 2 Λ = U d U d 0 F d 0 σ 2 I W d W d T + σ 2 I = ln W d W d T + σ 2 I =(F d)lnσ 2 + d λ i U d U F d T F i=d+1 q σ 2 lnλ i T

ML-PPCA 1 S t = U d U F d Λ d 0 0 σ 2 I = U d U F d 0 1 σ 2 Λ q d 0 I 0 0 0 0 0 U d U T F d U d U F d 0 Λ q d 0 1 Λ d 0 0 T U d U F d 0 0 0 U d U F d T 27 tr D 1 S t = 1 σ 2 λ i + d q i=d+1

ML-PPCA L σ 2 = N 2 d Fln 2π + ln λ j + 1 σ 2 λ j j=1 L σ = 0 2σ 3 λ j q j=d+1 If I put the solution back + F d σ q j=d+1 = 0 σ 2 = 1 F d + F d lnσ 2 + d q j=d+1 λ j L σ 2 = N 2 d j=1 ln λ j + F q ln 1 F d F j=d+1 λ j + Fln2π + F 28

ML-PPCA d q q L σ 2 = N 2 { ln λ j + ln λ j ln λ j j=1 ln S t j=d j=d + F d ln 1 F d F j=d+1 λ j + Fln2π + F} max N 2 1 F d ln S t 1 F d q j=d ln λ j + ln 1 F d F j=d+1 ln λ j + cost min ln 1 F d q j=d ln λ j 1 F d q j=d ln λ j 29

ML-PPCA Jensen inequality ln n n r i 1 n n ln r i ln 1 F d q j=d+1 λ j 1 F d q j=d+1 lnλ j hence ln 1 F d q j=d ln λ j 1 F d q j=d ln λ j 0 Hence, the function is minimized when the discarded eigenvectors are the ones that correspond to the q d eigenvalues 30

ML-PPCA Summarize: σ 2 = 1 F d q j=d+1 λ j We no longer have a projection but: W d = U d (Λ d σ 2 Ι)V T μ = 1 Ν N x i E p(yi x i ){y i } = M 1 W T (x i μ) and a reconstruction x i = WE p(yi x i ) y i + μ 31

ML-PPCA lim W d = U d Λ d σ 2 0 lim M = W T d W d σ 2 0 Hence lim E p(yi x i ){y i } = M 1 W T d (x i μ) σ 2 0 = Λ d 1 U d (x i μ) which gives the whitened PCA 32

EM-PPCA First step formulate the joint likelihood p X, Y θ = p X, Y θ p Y = p( x i y i, θ p(y i ) N N lnp X, Z θ = lnp(x i y i, θ + lnp(y i ) 1 lnp(x i y i, θ = ln (2π) F (σ 2 ) = F 2 ln 2πσ2 1 2σ 2 lnp y i = 1 2 y i T y i D 2 ln2π 1 e 2σ 2 x i Wy i μ T (x i Wy i μ) F x i Wy i μ T (x i Wy i μ) 33

EM-PPCA ln p X, Y θ N = { F 2 ln2πσ2 1 2σ 2 μ) 1 2 y i T y i D 2 ln2π} x i Wy i μ T (x i Wy i I need now to optimize it with regards to y i, W, μ, σ 2 We can t hence we need to take the expectations 34

EM-PPCA But first we need to expand a bit: ln p X, Y θ = NF 2 ln2πσ2 ND 2 ln2π N 1 2σ 2 [ x i μ T x i μ 2 W T x i T y i + y i T W T Wy i ] + 1 2 y i T y i 35

EM-PPCA E P(Y X) {log P(X,Y θ)}= NF 2 ln2πσ2 ND 2 ln2π N { 1 2σ 2 x i μ T x i μ 2 W T x i T E{y i } + tr[e{y i y i T }W T W

EM-PPCA E{y i } = y i p y i x i, θ dy n = Ε{N y i M 1 W T x i μ, σ 2 M 1 } = M 1 W T (x i μ) E y i E y i y i E y T i = = E{ y i y T i } E{E{y i }y i } T E{y i E y T i } + E y i E{y i } T = E{y i y T i } E{y i }E{y i } T E y i E y T i + E y i E{y i } T = E{y i y T i } E{y i }E{y i } T = σ 2 M 1 hence E y i y i T = σ 2 M 1 + E{y i }E{y i } T 37

EM-PPCA Expectation Step: Given W, μ, σ: E{y i } = M 1 W T (x i μ) E y i y i T = σ 2 M 1 + E{y i }E{y i } T 38

EM-PPCA Maximization step E{μ} μ = 0 μ = 1 N N x i N E{W} W = 0 1 σ 2 x i μ E y T i + 2 2σ 2 WE{y iy T i } 1 N N W = x i μ E{y i } T E{y i y T i } = 0 39

EM-PPCA E{σ 2 } σ = 0 σ 2 = 1 NF N { x i μ 2 2E y T i W T x i μ + tr E y i y T i W T W } 40

Why EM-PPCA? A complexity of O(NFd) which can be significant smaller than O(NF 2 ) (computation of the covariance) A combination of a probabilistic model and EM allows us to deal with missing values in the dataset. Missing values Missing values 41

Why EM-PPCA? A combination of a probabilistic model and EM allows us to deal with missing values in the dataset. p x i y i, W 1, μ 1, σ p x i y i, W 9, μ 9, σ 42

Summary We saw how to perform parameter estimation and inference using ML and EM in the following graphical model σ 2 y i μ x i N W 43

What will we see next? Dynamic data y 1 y 2 y 3 y T x 1 x 2 x 3 x T EM in the following graphical models Spatial data y u+1v y u+1v+1 y uv x uv x u+1v x u+1v+1 y uv+1 x uv+1 44