Inverse Covariance Estimation with Missing Data using the Concave-Convex Procedure

Inverse Covariance Estimation with Missing Data using the Concave-Convex Procedure Jérôme Thai 1 Timothy Hunter 1 Anayo Akametalu 1 Claire Tomlin 1 Alex Bayen 1,2 1 Department of Electrical Engineering & Computer Sciences University of California at Berkeley 2 Department of Civil & Environmental Engineering University of California at Berkeley December 8, 2014 1

Outline Motivation: Gaussian Markov Random Fields Sparse estimator with missing data Concave convex procedure Comparison with Expectation-Maximization algorithm Conclusion 2

Outline Motivation: Gaussian Markov Random Fields Sparse estimator with missing data Concave-Convex Procedure Comparison with Expectation-Maximization algorithm Conclusion Motivation: Gaussian Markov Random Fields 3

Multivariate Gaussian & Gaussian Markov Random Fields Definition: Multivariate Gaussian A random vector x =(X 1,, X p ) T 2 R p is a Multivariate Gaussian with mean µ and inverse covariance matrix Q if its density is 1 (x µ, Q 1 )=(2 ) n/2 Q 1/2 exp 2 (x µ)t Q(x µ) Definition: Gaussian Markov Random Field A random vector x =(X 1,, X p ) T is a Gaussian Markov random field w.r.t. graph G =(V, E) if it is a Multivariate Gaussian with Q ij =0 () {i, j} /2 E, 8 i 6= j () x i? x j x ij Motivation: Gaussian Markov Random Fields 4

Applications of Gaussian Markov Random Fields (GMRF) Find sparse 1 for dependency patterns between biological factors 1 1 Dobra, Variable selection and dependency networks for genomewide data, Biostatistics, 2009 2 Banerjee, El Ghaoui, daspremont. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. JMLR, 2008. 3 Friedman, Hastie, Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9,2008. Motivation: Gaussian Markov Random Fields 5

Applications of Gaussian Markov Random Fields (GMRF) Find sparse 1 for dependency patterns between biological factors 1 Sparse estimator for GMRF from data y 1,, y n 2 R p (ˆµ-centered): 2 3 ˆQ = argmin Q 0 Pn log Q +Tr j=1 y jyj T Q + kqk 1 1 Dobra, Variable selection and dependency networks for genomewide data, Biostatistics, 2009 2 Banerjee, El Ghaoui, daspremont. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. JMLR, 2008. 3 Friedman, Hastie, Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9,2008. Motivation: Gaussian Markov Random Fields 5

Maximum likelihood with missing data I obs j = observed entries, mis j =missingentriesineachsampley j Sparse estimator with missing data 7

Maximum likelihood with missing data I obs j = observed entries, mis j =missingentriesineachsampley j I Sparse estimator (max. likelihood) with missing data: ˆQ = argmax Q 0 nx j=1 log (y j,obs obsj ) {z } + kqk 1 density of marginal Gaussian N(µ obsj, obsj ) with µ obsj, obsj subvector and submatrix over entries obs j Sparse estimator with missing data 7

What is the dependency of ( obsj ) 1 in Q? 4 Kolar and Xing. Estimating Sparse Precision Matrices from Data with Missing Values. ICML 2012. 5 Stadler and Buhlmann. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, 2009. Sparse estimator with missing data 8

What is the dependency of ( obsj ) 1 in Q? I Modulo permutation matrix P j P j y j = apple yj,obs y j,mis P j QP T j = apple Qobsj Q misj obs j Q obsj mis j Q misj 4 Kolar and Xing. Estimating Sparse Precision Matrices from Data with Missing Values. ICML 2012. 5 Stadler and Buhlmann. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, 2009. Sparse estimator with missing data 8

What is the dependency of ( obsj ) 1 in Q? I Modulo permutation matrix P j P j y j = apple yj,obs y j,mis P j QP T j = apple Qobsj Q misj obs j Q obsj mis j Q misj I We have the Schur complement w.r.t. Q obsj : S j (Q) :=( obsj ) 1 = Q obsj Q obsj mis j Q 1 mis j Q misj obs j 4 Kolar and Xing. Estimating Sparse Precision Matrices from Data with Missing Values. ICML 2012. 5 Stadler and Buhlmann. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, 2009. Sparse estimator with missing data 8

What is the dependency of ( obsj ) 1 in Q? I Modulo permutation matrix P j P j y j = apple yj,obs y j,mis P j QP T j = apple Qobsj Q misj obs j Q obsj mis j Q misj I We have the Schur complement w.r.t. Q obsj : S j (Q) :=( obsj ) 1 = Q obsj Q obsj mis j Q 1 mis j Q misj obs j I Explicit formulation of the sparse estimator with missing data: 4 5 ˆQ = argmin Q 0 nx { log S j (Q) +Tr(y j,obs yj,obs T S j(q))} + kqk 1 j=1 4 Kolar and Xing. Estimating Sparse Precision Matrices from Data with Missing Values. ICML 2012. 5 Stadler and Buhlmann. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, 2009. Sparse estimator with missing data 8

Problem statement I Inverse covariance estimation with missing data has a complicated non-convex objective. Sparse estimator with missing data 9

Problem statement I Inverse covariance estimation with missing data has a complicated non-convex objective. I Design a novel application of the Concave-Convex procedure to solve our program. Sparse estimator with missing data 9

Outline Motivation: Gaussian Markov Random Fields Sparse estimator with missing data Concave-Convex Procedure Comparison with Expectation-Maximization algorithm Conclusion Concave-Convex Procedure 10

Di erence of convex programs Our program: min f (Q) g(q) s.t. Q 0with f (Q) = g(q) = P n P n j=1 log S j(q) + kqk 1 j=1 Tr(y j,obsyj,obs T S j(q)) Concave-Convex Procedure 11

Di erence of convex programs Our program: min f (Q) g(q) s.t. Q 0with f (Q) = g(q) = P n P n j=1 log S j(q) + kqk 1 j=1 Tr(y j,obsyj,obs T S j(q)) Why are both f and g convex? Concave-Convex Procedure 11

Di erence of convex programs Our program: min f (Q) g(q) s.t. Q 0with f (Q) = g(q) = P n P n j=1 log S j(q) + kqk 1 j=1 Tr(y j,obsyj,obs T S j(q)) Why are both f and g convex? Lemma The function Q 7! S(Q) is concave on the set of positive definite matrices. Concave-Convex Procedure 11

Review of the Concave-Convex Procedure Definition: Di erence of Convex (DC) program Let f, g two convex functions and X R n a convex set, a Di erence of Convex (DC) program is such that min f (x) g(x) s.t. x 2X Concave-convex procedure (CCCP) to solve DC programs At x t, solves convex approximation by linearizing g L x t g x t+1 = argmin f (x) g(x t ) rg(x t ) T (x x t ) x2x 6 Yuille and Rangarajan. The Concave-Convex Procedure. Neural Computation, 2003. Concave-Convex Procedure 12

Review of the Concave-Convex Procedure Definition: Di erence of Convex (DC) program Let f, g two convex functions and X R n a convex set, a Di erence of Convex (DC) program is such that min f (x) g(x) s.t. x 2X Concave-convex procedure (CCCP) to solve DC programs At x t, solves convex approximation by linearizing g L x t g x t+1 = argmin f (x) g(x t ) rg(x t ) T (x x t ) x2x Proposition CCCP is a descent method: f (x t+1 ) g(x t+1 ) apple f (x t ) g(x t ) 6 6 Yuille and Rangarajan. The Concave-Convex Procedure. Neural Computation, 2003. Concave-Convex Procedure 12

Illustration of the Concave-Convex Procedure (CCCP) Concave-Convex Procedure 13

Expectation-Maximization (EM) algorithm I Our estimate: ˆQ = argmin Q 0 P n j=1 log (y j,obs obsj ) + kqk 1 {z } density of marginal distribution over obs j Comparison with Expectation-Maximization algorithm 15

Expectation-Maximization (EM) algorithm I Our estimate: ˆQ = argmin Q 0 P n j=1 log (y j,obs obsj ) + kqk 1 {z } density of marginal distribution over obs j I Di cult because ( obsj ) 1 = S j (Q) =Q obsj Q obsj mis j Q 1 mis j Q misj obs j Expectation-Maximization (EM) algorithm Updates Q t in two steps E-step: y j,mis x j,mis {x j,obs = y j,obs, Q = Q t } ˆ t = P j E x j,mis x j,obs =y j,obs [y j y T j ] M-step: Q t+1 = argmin Q 0 = argmin Q 0 P j E x j,mis x j,obs =y j,obs log (x (Q t ) 1 )+ kqk 1 log Q +Tr(ˆ t Q)+ kqk 1 Comparison with Expectation-Maximization algorithm 15

EM algorithm as a Concave-Convex procedure Proposition For Gaussians, the EM algorithm is a CCCP with DC decomposition: min f EM (Q) g EM (Q) s.t. Q 0 f EM (Q) = log Q + kqk 1 g EM (Q) = 1 n P n j=1 {log Q mis j + Tr(y j,obs y T j,obs S j(q))} If we pose h(q) := 1 n P n j=1 log Q mis j : EM decomposition: f EM (h + g) Our decomposition: (f EM h) g Comparison with Expectation-Maximization algorithm 16

EM algorithm as a Concave-Convex procedure Proposition For Gaussians, the EM algorithm is a CCCP with DC decomposition: min f EM (Q) g EM (Q) s.t. Q 0 f EM (Q) = log Q + kqk 1 g EM (Q) = 1 n P n j=1 {log Q mis j + Tr(y j,obs y T j,obs S j(q))} If we pose h(q) := 1 n P n j=1 log Q mis j : Proposition EM decomposition: f EM (h + g) Our decomposition: (f EM h) g With our decomposition, CCCP is a lower bound on EM: (f EM h) Lg apple f EM (Lh + Lg) Comparison with Expectation-Maximization algorithm 16

Experimental setting 1. Generate n samples y 1,, y n from N (0, ) from 3 models with dimension p = 10, 50, 100 and n = 100, 150, 200: 7 I Model 1: ij =0.7 j i AR(1) I Model 2: ij = I i=j +0.4I i j =1 +0.2I i j 2{2,3} +0.1I i j =4 ( 0 if i = j I Model 3: = B + I where B ij = 0/0.5 w.p.=0.5 if i 6= j 7 N. Stadler and P. Buhlmann. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, 2009. Comparison with Expectation-Maximization algorithm 17

Experimental setting 1. Generate n samples y 1,, y n from N (0, ) from 3 models with dimension p = 10, 50, 100 and n = 100, 150, 200: 7 I Model 1: ij =0.7 j i AR(1) I Model 2: ij = I i=j +0.4I i j =1 +0.2I i j 2{2,3} +0.1I i j =4 ( 0 if i = j I Model 3: = B + I where B ij = 0/0.5 w.p.=0.5 if i 6= j 2. Remove at random 20, 40, 60, 80% of the data for each sample 7 N. Stadler and P. Buhlmann. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, 2009. Comparison with Expectation-Maximization algorithm 17

Experimental setting 1. Generate n samples y 1,, y n from N (0, ) from 3 models with dimension p = 10, 50, 100 and n = 100, 150, 200: 7 I Model 1: ij =0.7 j i AR(1) I Model 2: ij = I i=j +0.4I i j =1 +0.2I i j 2{2,3} +0.1I i j =4 ( 0 if i = j I Model 3: = B + I where B ij = 0/0.5 w.p.=0.5 if i 6= j 2. Remove at random 20, 40, 60, 80% of the data for each sample 3. Impute y j,mis by row means ) complete su cient statistics ˆ 7 N. Stadler and P. Buhlmann. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, 2009. Comparison with Expectation-Maximization algorithm 17

Experimental setting 1. Generate n samples y 1,, y n from N (0, ) from 3 models with dimension p = 10, 50, 100 and n = 100, 150, 200: 7 I Model 1: ij =0.7 j i AR(1) I Model 2: ij = I i=j +0.4I i j =1 +0.2I i j 2{2,3} +0.1I i j =4 ( 0 if i = j I Model 3: = B + I where B ij = 0/0.5 w.p.=0.5 if i 6= j 2. Remove at random 20, 40, 60, 80% of the data for each sample 3. Impute y j,mis by row means ) complete su cient statistics ˆ 4. Initialization: Q 0 = argmin Q 0 log Q +Tr(ˆ Q)+ kqk 1 7 N. Stadler and P. Buhlmann. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, 2009. Comparison with Expectation-Maximization algorithm 17

Numerical results on synthetic datasets missing=20% missing=40% missing=60% missing=80% Comparison with Expectation-Maximization algorithm 18

Numerical results on real datasets missing=20% missing=40% missing=60% missing=80% Comparison with Expectation-Maximization algorithm 19

Numerical results Comparison with Expectation-Maximization algorithm 20

Outline Motivation: Gaussian Markov Random Fields Sparse estimator with missing data Concave-Convex Procedure Comparison with Expectation-Maximization algorithm Conclusion Conclusion 21

Summary of contributions I Schur complement is log-concave. I Hence sparse inverse covariance estimator is a DC program. I New CCCP for the sparse inverse covariance estimator. I Superior convergence in number of iterations. I Validated by numerical results on synthetic and real datasets. Conclusion 22