21/06/2016
Plan 1 Basic definitions Singular Multivariate Normal Distribution 2 3
Plan Singular Multivariate Normal Distribution 1 Basic definitions Singular Multivariate Normal Distribution 2 3
Multivariate Normal Distribution Singular Multivariate Normal Distribution Gaussian Vector Properties Let X = (X 1, X 2,..., X d ) T N d (µ, Σ). β R d, β T X N(β T µ, β T Σβ) Laplace Transform :L(θ) = E(e θt X ) = e θt µ+ 1 2 θt Σθ Density function if Σ is positive-definite (Σ > 0) : x R d, f (x) = (2π) d 2 Σ 1 2 exp( 1 2 (x µ)t Σ 1 (x µ))
Singular Multivariate Normal Distribution Singular Multivariate Normal Distribution Definition Let X = (X 1, X 2,..., X d ) T N d (µ, Σ) If r = rank(σ) < d, X has a singular multivariate normal distribution. Laplace Transform :L(θ) = E(e θt X ) = e θt µ+ 1 2 θt Σθ Σ = (P 1, P 2 ) ( ) ( Λ 0 P1 0 0 P 2 ) T where Λ = λ 1... The density function is defined on an affine subspace of R d : {f (x) = (2π) r 2 Λ 1 2 exp( 1 2 (x µ)t Σ + (x µ)) P2 T x = PT 2 µ Where Σ + is the Moore-Penrose pseudo-inverse of Σ. λ r
Plan 1 Basic definitions Singular Multivariate Normal Distribution 2 3
Mixture Model Definition Consider a mixture of K d-variate singular normal distributions with density : K f (x Θ) = π k f k (x µ k, Σ k ) (1) k=1 π 1, π 2,..., π K are the mixing weights :( K π k = 1) k=1 f k (x µ k, Σ k ) is the density function of multivariate degenerate Gaussian distribution with mean µ k and covariance matrix Σ k Θ = {π k, µ k, Σ k ; k = 1..K} the parameters vector All components are concentrated on an affine subspace E.
Concrete Situation Clustering data Y is drawn form a population which consists of K groups G 1, G 2,...,G K in proportions π 1,..., π K. Z is K-dimensional component-label { vector 1 if Y i G k Z i = (Z i1,..., Z ik ) : Z ik = (Z i ) k = 0 if Y i / G k Z i is distributed according to a multinomial distribution with probabilities vector π = (π 1,..., π K ) Z i Mult K (1, π) and P(Z i = z i ) = K k=1 π z i k k
Data in the EM framework (Dempster et al., 1977) y = (y 1,..., y n ) : the observed data. z = (z 1,..., z n ) : the hidden data. The complete data is (y T, z T ) T Likelihood function The complete likelihood function is given by : l(y 1,..., y n, z 1,..., z n Θ) = n K [π z i f z i (y i µ, Σ )] i=1 =1 The complete-data log-likelihood can be written as : L(y 1,..., y n, z 1,..., z n Θ) = n K z i log(π ) + z i f (y i µ, Σ ) i=1 =1
EM Algorithm : 3 steps Initialization π (0) k = 1 K µ (0) k E Σ (0) k positive semi-definite verifying : ϕ(σ (0) k ) = Ē where ϕ(a) denotes the column space of a matrix A and Ē is the vector subspace associated to E. Expectation Step : E-Step After l iterations, the conditional expectation of the complete-data log likelihood given the observed data using the current fit can be written as follows : Q(Θ Θ (l) ) = E Θ (l)(l(y, z, Θ) y)
EM Algorithm : 3 steps Expectation Step : E-Step The posterior probability that y i belongs to the th component of the mixture : τ (l) i = E Θ (l)(z i y) = P Θ (l)(z i = 1 y) = π(l) K k=1 f (y i µ (l),σ (l) ) π (l) k f k(y i µ (l) k,σ(l) k ) The conditional expectation of the complete-data log likelihood given y is : Q(Θ Θ (l) ) = n K τ (l) i [log(π ) + log(f (y i µ, Σ ))] Q(Θ Θ (l) ) = n i=1 =1 K i=1 =1 1 2 (y i µ ) T Σ + (y i µ )] τ (l) i [log(π ) r 2 log(2π) 1 2 log Λ
EM Algorithm : 3 steps Maximization Step : M-Step The global maximization of Q(Θ Θ (l) ) with respect to Θ over the parameter space to give the updated estimate Θ (l+1). Θ (l+1) = argmax Θ Q(Θ Θ (l) ) π (l+1) n = 1 n i=1 τ (l) i µ (l+1) = n i=1 n τ (l) i y i n = n i=1 τ (l) i
EM Algorithm : 3 steps Maximization Step : Covariance matrices Let S (l+1) = n i=1 τ (l) i (y i µ (l+1) )(y i µ (l+1) ) T (2) and H be the d r matrix of eigenvectors corresponding to the r largest eigenvalues of S (l+1), denoted by L = diag(l 1,..., l r ). Then the maximum of Q with respect to Σ is achieved for : Σ (l+1) = H L H T n where n = n i=1 τ (l) i.
Plan 1 Basic definitions Singular Multivariate Normal Distribution 2 3
Singular Multivariate Normal Distribution 1 Simulation study in progress. 2 Clustering real data. 3 Convergence problems : Choosing good Starting values.
Thank you!
Moore-Penrose Pseudo-inverse Definition Let A M n,m, then there exists a unique A + M n,m that satisfies : 1 AA + A = A 2 A + AA + = A + 3 A + A = (A + A) Hermitian 4 AA + = (AA + ) Hermitian Where M is the conugate transpose of matrix M. Remark If A is square and non-singular, it is clear that A + = A 1. For Σ we can check that : Σ + = P 1 Λ 1 P T 1
Maximum Likelihood Estimation Theorem (M.S. Srivastava & D. von Rosen) Let Y = (y 1 : y 2 :... : y n ) be a set of i.i.d data drawn from N d (µ, Σ), where Σ is of rank r(σ) = r, and Σ = P 1 ΛP T 1. Let S = Y (I 1 n 11T )Y and H be the p r matrix of eigenvectors corresponding to the r largest eigenvalues of S, denoted by L = diag(l 1,..., l r ). 1 stands for n 1 vector of ones. Then the MLE of Λ and P 1 are respectively given by : ˆΛ = 1 n L ˆP 1 = H ˆΣ = 1 n HLHT