Inverse Covariance Estimation with Missing Data using the Concave-Convex Procedure Jérôme Thai 1 Timothy Hunter 1 Anayo Akametalu 1 Claire Tomlin 1 Alex Bayen 1,2 1 Department of Electrical Engineering & Computer Sciences University of California at Berkeley 2 Department of Civil & Environmental Engineering University of California at Berkeley December 8, 2014 1
Outline Motivation: Gaussian Markov Random Fields Sparse estimator with missing data Concave convex procedure Comparison with Expectation-Maximization algorithm Conclusion 2
Outline Motivation: Gaussian Markov Random Fields Sparse estimator with missing data Concave-Convex Procedure Comparison with Expectation-Maximization algorithm Conclusion Motivation: Gaussian Markov Random Fields 3
Multivariate Gaussian & Gaussian Markov Random Fields Definition: Multivariate Gaussian A random vector x =(X 1,, X p ) T 2 R p is a Multivariate Gaussian with mean µ and inverse covariance matrix Q if its density is 1 (x µ, Q 1 )=(2 ) n/2 Q 1/2 exp 2 (x µ)t Q(x µ) Motivation: Gaussian Markov Random Fields 4
Multivariate Gaussian & Gaussian Markov Random Fields Definition: Multivariate Gaussian A random vector x =(X 1,, X p ) T 2 R p is a Multivariate Gaussian with mean µ and inverse covariance matrix Q if its density is 1 (x µ, Q 1 )=(2 ) n/2 Q 1/2 exp 2 (x µ)t Q(x µ) Definition: Gaussian Markov Random Field A random vector x =(X 1,, X p ) T is a Gaussian Markov random field w.r.t. graph G =(V, E) if it is a Multivariate Gaussian with Q ij =0 () {i, j} /2 E, 8 i 6= j () x i? x j x ij Motivation: Gaussian Markov Random Fields 4
Applications of Gaussian Markov Random Fields (GMRF) Find sparse 1 for dependency patterns between biological factors 1 1 Dobra, Variable selection and dependency networks for genomewide data, Biostatistics, 2009 2 Banerjee, El Ghaoui, daspremont. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. JMLR, 2008. 3 Friedman, Hastie, Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9,2008. Motivation: Gaussian Markov Random Fields 5
Applications of Gaussian Markov Random Fields (GMRF) Find sparse 1 for dependency patterns between biological factors 1 Sparse estimator for GMRF from data y 1,, y n 2 R p (ˆµ-centered): 2 3 ˆQ = argmin Q 0 Pn log Q +Tr j=1 y jyj T Q + kqk 1 1 Dobra, Variable selection and dependency networks for genomewide data, Biostatistics, 2009 2 Banerjee, El Ghaoui, daspremont. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. JMLR, 2008. 3 Friedman, Hastie, Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9,2008. Motivation: Gaussian Markov Random Fields 5
Outline Motivation: Gaussian Markov Random Fields Sparse estimator with missing data Concave-Convex Procedure Comparison with Expectation-Maximization algorithm Conclusion Sparse estimator with missing data 6
Maximum likelihood with missing data I obs j = observed entries, mis j =missingentriesineachsampley j Sparse estimator with missing data 7
Maximum likelihood with missing data I obs j = observed entries, mis j =missingentriesineachsampley j I Sparse estimator (max. likelihood) with missing data: ˆQ = argmax Q 0 nx j=1 log (y j,obs obsj ) {z } + kqk 1 density of marginal Gaussian N(µ obsj, obsj ) with µ obsj, obsj subvector and submatrix over entries obs j Sparse estimator with missing data 7
What is the dependency of ( obsj ) 1 in Q? 4 Kolar and Xing. Estimating Sparse Precision Matrices from Data with Missing Values. ICML 2012. 5 Stadler and Buhlmann. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, 2009. Sparse estimator with missing data 8
What is the dependency of ( obsj ) 1 in Q? I Modulo permutation matrix P j P j y j = apple yj,obs y j,mis P j QP T j = apple Qobsj Q misj obs j Q obsj mis j Q misj 4 Kolar and Xing. Estimating Sparse Precision Matrices from Data with Missing Values. ICML 2012. 5 Stadler and Buhlmann. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, 2009. Sparse estimator with missing data 8
What is the dependency of ( obsj ) 1 in Q? I Modulo permutation matrix P j P j y j = apple yj,obs y j,mis P j QP T j = apple Qobsj Q misj obs j Q obsj mis j Q misj I We have the Schur complement w.r.t. Q obsj : S j (Q) :=( obsj ) 1 = Q obsj Q obsj mis j Q 1 mis j Q misj obs j 4 Kolar and Xing. Estimating Sparse Precision Matrices from Data with Missing Values. ICML 2012. 5 Stadler and Buhlmann. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, 2009. Sparse estimator with missing data 8
What is the dependency of ( obsj ) 1 in Q? I Modulo permutation matrix P j P j y j = apple yj,obs y j,mis P j QP T j = apple Qobsj Q misj obs j Q obsj mis j Q misj I We have the Schur complement w.r.t. Q obsj : S j (Q) :=( obsj ) 1 = Q obsj Q obsj mis j Q 1 mis j Q misj obs j I Explicit formulation of the sparse estimator with missing data: 4 5 ˆQ = argmin Q 0 nx { log S j (Q) +Tr(y j,obs yj,obs T S j(q))} + kqk 1 j=1 4 Kolar and Xing. Estimating Sparse Precision Matrices from Data with Missing Values. ICML 2012. 5 Stadler and Buhlmann. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, 2009. Sparse estimator with missing data 8
What is the dependency of ( obsj ) 1 in Q? I Modulo permutation matrix P j P j y j = apple yj,obs y j,mis P j QP T j = apple Qobsj Q misj obs j Q obsj mis j Q misj I We have the Schur complement w.r.t. Q obsj : S j (Q) :=( obsj ) 1 = Q obsj Q obsj mis j Q 1 mis j Q misj obs j I Explicit formulation of the sparse estimator with missing data: 4 5 ˆQ = argmin Q 0 nx { log S j (Q) +Tr(y j,obs yj,obs T S j(q))} + kqk 1 j=1 4 Kolar and Xing. Estimating Sparse Precision Matrices from Data with Missing Values. ICML 2012. 5 Stadler and Buhlmann. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, 2009. Sparse estimator with missing data 8
What is the dependency of ( obsj ) 1 in Q? I Modulo permutation matrix P j P j y j = apple yj,obs y j,mis P j QP T j = apple Qobsj Q misj obs j Q obsj mis j Q misj I We have the Schur complement w.r.t. Q obsj : S j (Q) :=( obsj ) 1 = Q obsj Q obsj mis j Q 1 mis j Q misj obs j I Explicit formulation of the sparse estimator with missing data: 4 5 ˆQ = argmin Q 0 nx { log S j (Q) +Tr(y j,obs yj,obs T S j(q))} + kqk 1 j=1 4 Kolar and Xing. Estimating Sparse Precision Matrices from Data with Missing Values. ICML 2012. 5 Stadler and Buhlmann. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, 2009. Sparse estimator with missing data 8
Problem statement I Inverse covariance estimation with missing data has a complicated non-convex objective. Sparse estimator with missing data 9
Problem statement I Inverse covariance estimation with missing data has a complicated non-convex objective. I Design a novel application of the Concave-Convex procedure to solve our program. Sparse estimator with missing data 9
Problem statement I Inverse covariance estimation with missing data has a complicated non-convex objective. I Design a novel application of the Concave-Convex procedure to solve our program. I Better theoretical and experimental convergence than the Expectation-maximization algorithm. Sparse estimator with missing data 9
Outline Motivation: Gaussian Markov Random Fields Sparse estimator with missing data Concave-Convex Procedure Comparison with Expectation-Maximization algorithm Conclusion Concave-Convex Procedure 10
Di erence of convex programs Our program: min f (Q) g(q) s.t. Q 0with f (Q) = g(q) = P n P n j=1 log S j(q) + kqk 1 j=1 Tr(y j,obsyj,obs T S j(q)) Concave-Convex Procedure 11
Di erence of convex programs Our program: min f (Q) g(q) s.t. Q 0with f (Q) = g(q) = P n P n j=1 log S j(q) + kqk 1 j=1 Tr(y j,obsyj,obs T S j(q)) Why are both f and g convex? Concave-Convex Procedure 11
Di erence of convex programs Our program: min f (Q) g(q) s.t. Q 0with f (Q) = g(q) = P n P n j=1 log S j(q) + kqk 1 j=1 Tr(y j,obsyj,obs T S j(q)) Why are both f and g convex? Lemma The function Q 7! S(Q) is concave on the set of positive definite matrices. Concave-Convex Procedure 11
Di erence of convex programs Our program: min f (Q) g(q) s.t. Q 0with f (Q) = g(q) = P n P n j=1 log S j(q) + kqk 1 j=1 Tr(y j,obsyj,obs T S j(q)) Why are both f and g convex? Lemma The function Q 7! S(Q) is concave on the set of positive definite matrices. Proposition The function Q 7! log S(Q) is concave on the set of positive definite matrices. Concave-Convex Procedure 11
Review of the Concave-Convex Procedure Definition: Di erence of Convex (DC) program Let f, g two convex functions and X R n a convex set, a Di erence of Convex (DC) program is such that min f (x) g(x) s.t. x 2X 6 Yuille and Rangarajan. The Concave-Convex Procedure. Neural Computation, 2003. Concave-Convex Procedure 12
Review of the Concave-Convex Procedure Definition: Di erence of Convex (DC) program Let f, g two convex functions and X R n a convex set, a Di erence of Convex (DC) program is such that min f (x) g(x) s.t. x 2X Concave-convex procedure (CCCP) to solve DC programs At x t, solves convex approximation by linearizing g L x t g x t+1 = argmin f (x) g(x t ) rg(x t ) T (x x t ) x2x 6 Yuille and Rangarajan. The Concave-Convex Procedure. Neural Computation, 2003. Concave-Convex Procedure 12
Review of the Concave-Convex Procedure Definition: Di erence of Convex (DC) program Let f, g two convex functions and X R n a convex set, a Di erence of Convex (DC) program is such that min f (x) g(x) s.t. x 2X Concave-convex procedure (CCCP) to solve DC programs At x t, solves convex approximation by linearizing g L x t g x t+1 = argmin f (x) g(x t ) rg(x t ) T (x x t ) x2x Proposition CCCP is a descent method: f (x t+1 ) g(x t+1 ) apple f (x t ) g(x t ) 6 6 Yuille and Rangarajan. The Concave-Convex Procedure. Neural Computation, 2003. Concave-Convex Procedure 12
Illustration of the Concave-Convex Procedure (CCCP) Concave-Convex Procedure 13
Illustration of the Concave-Convex Procedure (CCCP) Concave-Convex Procedure 13
Illustration of the Concave-Convex Procedure (CCCP) Concave-Convex Procedure 13
Illustration of the Concave-Convex Procedure (CCCP) Concave-Convex Procedure 13
Outline Motivation: Gaussian Markov Random Fields Sparse estimator with missing data Concave-Convex Procedure Comparison with Expectation-Maximization algorithm Conclusion Comparison with Expectation-Maximization algorithm 14
Expectation-Maximization (EM) algorithm I Our estimate: ˆQ = argmin Q 0 P n j=1 log (y j,obs obsj ) + kqk 1 {z } density of marginal distribution over obs j Comparison with Expectation-Maximization algorithm 15
Expectation-Maximization (EM) algorithm I Our estimate: ˆQ = argmin Q 0 P n j=1 log (y j,obs obsj ) + kqk 1 {z } density of marginal distribution over obs j I Di cult because ( obsj ) 1 = S j (Q) =Q obsj Q obsj mis j Q 1 mis j Q misj obs j Comparison with Expectation-Maximization algorithm 15
Expectation-Maximization (EM) algorithm I Our estimate: ˆQ = argmin Q 0 P n j=1 log (y j,obs obsj ) + kqk 1 {z } density of marginal distribution over obs j I Di cult because ( obsj ) 1 = S j (Q) =Q obsj Q obsj mis j Q 1 mis j Q misj obs j Expectation-Maximization (EM) algorithm Updates Q t in two steps E-step: y j,mis x j,mis {x j,obs = y j,obs, Q = Q t } ˆ t = P j E x j,mis x j,obs =y j,obs [y j y T j ] M-step: Q t+1 = argmin Q 0 = argmin Q 0 P j E x j,mis x j,obs =y j,obs log (x (Q t ) 1 )+ kqk 1 log Q +Tr(ˆ t Q)+ kqk 1 Comparison with Expectation-Maximization algorithm 15
EM algorithm as a Concave-Convex procedure Proposition For Gaussians, the EM algorithm is a CCCP with DC decomposition: min f EM (Q) g EM (Q) s.t. Q 0 f EM (Q) = log Q + kqk 1 g EM (Q) = 1 n P n j=1 {log Q mis j +Tr(y j,obs y T j,obs S j(q))} Comparison with Expectation-Maximization algorithm 16
EM algorithm as a Concave-Convex procedure Proposition For Gaussians, the EM algorithm is a CCCP with DC decomposition: min f EM (Q) g EM (Q) s.t. Q 0 f EM (Q) = log Q + kqk 1 g EM (Q) = 1 n P n j=1 {log Q mis j + Tr(y j,obs y T j,obs S j(q))} If we pose h(q) := 1 n P n j=1 log Q mis j : EM decomposition: f EM (h + g) Our decomposition: (f EM h) g Comparison with Expectation-Maximization algorithm 16
EM algorithm as a Concave-Convex procedure Proposition For Gaussians, the EM algorithm is a CCCP with DC decomposition: min f EM (Q) g EM (Q) s.t. Q 0 f EM (Q) = log Q + kqk 1 g EM (Q) = 1 n P n j=1 {log Q mis j + Tr(y j,obs y T j,obs S j(q))} If we pose h(q) := 1 n P n j=1 log Q mis j : Proposition EM decomposition: f EM (h + g) Our decomposition: (f EM h) g With our decomposition, CCCP is a lower bound on EM: (f EM h) Lg apple f EM (Lh + Lg) Comparison with Expectation-Maximization algorithm 16
Experimental setting 1. Generate n samples y 1,, y n from N (0, ) from 3 models with dimension p = 10, 50, 100 and n = 100, 150, 200: 7 I Model 1: ij =0.7 j i AR(1) I Model 2: ij = I i=j +0.4I i j =1 +0.2I i j 2{2,3} +0.1I i j =4 ( 0 if i = j I Model 3: = B + I where B ij = 0/0.5 w.p.=0.5 if i 6= j 7 N. Stadler and P. Buhlmann. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, 2009. Comparison with Expectation-Maximization algorithm 17
Experimental setting 1. Generate n samples y 1,, y n from N (0, ) from 3 models with dimension p = 10, 50, 100 and n = 100, 150, 200: 7 I Model 1: ij =0.7 j i AR(1) I Model 2: ij = I i=j +0.4I i j =1 +0.2I i j 2{2,3} +0.1I i j =4 ( 0 if i = j I Model 3: = B + I where B ij = 0/0.5 w.p.=0.5 if i 6= j 2. Remove at random 20, 40, 60, 80% of the data for each sample 7 N. Stadler and P. Buhlmann. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, 2009. Comparison with Expectation-Maximization algorithm 17
Experimental setting 1. Generate n samples y 1,, y n from N (0, ) from 3 models with dimension p = 10, 50, 100 and n = 100, 150, 200: 7 I Model 1: ij =0.7 j i AR(1) I Model 2: ij = I i=j +0.4I i j =1 +0.2I i j 2{2,3} +0.1I i j =4 ( 0 if i = j I Model 3: = B + I where B ij = 0/0.5 w.p.=0.5 if i 6= j 2. Remove at random 20, 40, 60, 80% of the data for each sample 3. Impute y j,mis by row means ) complete su cient statistics ˆ 7 N. Stadler and P. Buhlmann. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, 2009. Comparison with Expectation-Maximization algorithm 17
Experimental setting 1. Generate n samples y 1,, y n from N (0, ) from 3 models with dimension p = 10, 50, 100 and n = 100, 150, 200: 7 I Model 1: ij =0.7 j i AR(1) I Model 2: ij = I i=j +0.4I i j =1 +0.2I i j 2{2,3} +0.1I i j =4 ( 0 if i = j I Model 3: = B + I where B ij = 0/0.5 w.p.=0.5 if i 6= j 2. Remove at random 20, 40, 60, 80% of the data for each sample 3. Impute y j,mis by row means ) complete su cient statistics ˆ 4. Initialization: Q 0 = argmin Q 0 log Q +Tr(ˆ Q)+ kqk 1 7 N. Stadler and P. Buhlmann. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, 2009. Comparison with Expectation-Maximization algorithm 17
Numerical results on synthetic datasets missing=20% missing=40% missing=60% missing=80% Comparison with Expectation-Maximization algorithm 18
Numerical results on real datasets missing=20% missing=40% missing=60% missing=80% Comparison with Expectation-Maximization algorithm 19
Numerical results Comparison with Expectation-Maximization algorithm 20
Outline Motivation: Gaussian Markov Random Fields Sparse estimator with missing data Concave-Convex Procedure Comparison with Expectation-Maximization algorithm Conclusion Conclusion 21
Summary of contributions I Schur complement is log-concave. I Hence sparse inverse covariance estimator is a DC program. I New CCCP for the sparse inverse covariance estimator. I Superior convergence in number of iterations. I Validated by numerical results on synthetic and real datasets. Conclusion 22