Inverse Covariance Estimation with Missing Data using the Concave-Convex Procedure

Size: px

Start display at page:

Download "Inverse Covariance Estimation with Missing Data using the Concave-Convex Procedure"

Loren Wade
6 years ago
Views:

1 Inverse Covariance Estimation with Missing Data using the Concave-Convex Procedure Jérôme Thai 1 Timothy Hunter 1 Anayo Akametalu 1 Claire Tomlin 1 Alex Bayen 1,2 1 Department of Electrical Engineering & Computer Sciences University of California at Berkeley 2 Department of Civil & Environmental Engineering University of California at Berkeley December 8,

2 Outline Motivation: Gaussian Markov Random Fields Sparse estimator with missing data Concave convex procedure Comparison with Expectation-Maximization algorithm Conclusion 2

3 Outline Motivation: Gaussian Markov Random Fields Sparse estimator with missing data Concave-Convex Procedure Comparison with Expectation-Maximization algorithm Conclusion Motivation: Gaussian Markov Random Fields 3

4 Multivariate Gaussian & Gaussian Markov Random Fields Definition: Multivariate Gaussian A random vector x =(X 1,, X p ) T 2 R p is a Multivariate Gaussian with mean µ and inverse covariance matrix Q if its density is 1 (x µ, Q 1 )=(2 ) n/2 Q 1/2 exp 2 (x µ)t Q(x µ) Motivation: Gaussian Markov Random Fields 4

5 Multivariate Gaussian & Gaussian Markov Random Fields Definition: Multivariate Gaussian A random vector x =(X 1,, X p ) T 2 R p is a Multivariate Gaussian with mean µ and inverse covariance matrix Q if its density is 1 (x µ, Q 1 )=(2 ) n/2 Q 1/2 exp 2 (x µ)t Q(x µ) Definition: Gaussian Markov Random Field A random vector x =(X 1,, X p ) T is a Gaussian Markov random field w.r.t. graph G =(V, E) if it is a Multivariate Gaussian with Q ij =0 () {i, j} /2 E, 8 i 6= j () x i? x j x ij Motivation: Gaussian Markov Random Fields 4

6 Applications of Gaussian Markov Random Fields (GMRF) Find sparse 1 for dependency patterns between biological factors 1 1 Dobra, Variable selection and dependency networks for genomewide data, Biostatistics, Banerjee, El Ghaoui, daspremont. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. JMLR, Friedman, Hastie, Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9,2008. Motivation: Gaussian Markov Random Fields 5

7 Applications of Gaussian Markov Random Fields (GMRF) Find sparse 1 for dependency patterns between biological factors 1 Sparse estimator for GMRF from data y 1,, y n 2 R p (ˆµ-centered): 2 3 ˆQ = argmin Q 0 Pn log Q +Tr j=1 y jyj T Q + kqk 1 1 Dobra, Variable selection and dependency networks for genomewide data, Biostatistics, Banerjee, El Ghaoui, daspremont. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. JMLR, Friedman, Hastie, Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9,2008. Motivation: Gaussian Markov Random Fields 5

8 Outline Motivation: Gaussian Markov Random Fields Sparse estimator with missing data Concave-Convex Procedure Comparison with Expectation-Maximization algorithm Conclusion Sparse estimator with missing data 6

9 Maximum likelihood with missing data I obs j = observed entries, mis j =missingentriesineachsampley j Sparse estimator with missing data 7

10 Maximum likelihood with missing data I obs j = observed entries, mis j =missingentriesineachsampley j I Sparse estimator (max. likelihood) with missing data: ˆQ = argmax Q 0 nx j=1 log (y j,obs obsj ) {z } + kqk 1 density of marginal Gaussian N(µ obsj, obsj ) with µ obsj, obsj subvector and submatrix over entries obs j Sparse estimator with missing data 7

11 What is the dependency of ( obsj ) 1 in Q? 4 Kolar and Xing. Estimating Sparse Precision Matrices from Data with Missing Values. ICML Stadler and Buhlmann. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, Sparse estimator with missing data 8

12 What is the dependency of ( obsj ) 1 in Q? I Modulo permutation matrix P j P j y j = apple yj,obs y j,mis P j QP T j = apple Qobsj Q misj obs j Q obsj mis j Q misj 4 Kolar and Xing. Estimating Sparse Precision Matrices from Data with Missing Values. ICML Stadler and Buhlmann. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, Sparse estimator with missing data 8

13 What is the dependency of ( obsj ) 1 in Q? I Modulo permutation matrix P j P j y j = apple yj,obs y j,mis P j QP T j = apple Qobsj Q misj obs j Q obsj mis j Q misj I We have the Schur complement w.r.t. Q obsj : S j (Q) :=( obsj ) 1 = Q obsj Q obsj mis j Q 1 mis j Q misj obs j 4 Kolar and Xing. Estimating Sparse Precision Matrices from Data with Missing Values. ICML Stadler and Buhlmann. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, Sparse estimator with missing data 8

14 What is the dependency of ( obsj ) 1 in Q? I Modulo permutation matrix P j P j y j = apple yj,obs y j,mis P j QP T j = apple Qobsj Q misj obs j Q obsj mis j Q misj I We have the Schur complement w.r.t. Q obsj : S j (Q) :=( obsj ) 1 = Q obsj Q obsj mis j Q 1 mis j Q misj obs j I Explicit formulation of the sparse estimator with missing data: 4 5 ˆQ = argmin Q 0 nx { log S j (Q) +Tr(y j,obs yj,obs T S j(q))} + kqk 1 j=1 4 Kolar and Xing. Estimating Sparse Precision Matrices from Data with Missing Values. ICML Stadler and Buhlmann. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, Sparse estimator with missing data 8

15 What is the dependency of ( obsj ) 1 in Q? I Modulo permutation matrix P j P j y j = apple yj,obs y j,mis P j QP T j = apple Qobsj Q misj obs j Q obsj mis j Q misj I We have the Schur complement w.r.t. Q obsj : S j (Q) :=( obsj ) 1 = Q obsj Q obsj mis j Q 1 mis j Q misj obs j I Explicit formulation of the sparse estimator with missing data: 4 5 ˆQ = argmin Q 0 nx { log S j (Q) +Tr(y j,obs yj,obs T S j(q))} + kqk 1 j=1 4 Kolar and Xing. Estimating Sparse Precision Matrices from Data with Missing Values. ICML Stadler and Buhlmann. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, Sparse estimator with missing data 8

16 What is the dependency of ( obsj ) 1 in Q? I Modulo permutation matrix P j P j y j = apple yj,obs y j,mis P j QP T j = apple Qobsj Q misj obs j Q obsj mis j Q misj I We have the Schur complement w.r.t. Q obsj : S j (Q) :=( obsj ) 1 = Q obsj Q obsj mis j Q 1 mis j Q misj obs j I Explicit formulation of the sparse estimator with missing data: 4 5 ˆQ = argmin Q 0 nx { log S j (Q) +Tr(y j,obs yj,obs T S j(q))} + kqk 1 j=1 4 Kolar and Xing. Estimating Sparse Precision Matrices from Data with Missing Values. ICML Stadler and Buhlmann. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, Sparse estimator with missing data 8

17 Problem statement I Inverse covariance estimation with missing data has a complicated non-convex objective. Sparse estimator with missing data 9

18 Problem statement I Inverse covariance estimation with missing data has a complicated non-convex objective. I Design a novel application of the Concave-Convex procedure to solve our program. Sparse estimator with missing data 9

19 Problem statement I Inverse covariance estimation with missing data has a complicated non-convex objective. I Design a novel application of the Concave-Convex procedure to solve our program. I Better theoretical and experimental convergence than the Expectation-maximization algorithm. Sparse estimator with missing data 9

20 Outline Motivation: Gaussian Markov Random Fields Sparse estimator with missing data Concave-Convex Procedure Comparison with Expectation-Maximization algorithm Conclusion Concave-Convex Procedure 10

21 Di erence of convex programs Our program: min f (Q) g(q) s.t. Q 0with f (Q) = g(q) = P n P n j=1 log S j(q) + kqk 1 j=1 Tr(y j,obsyj,obs T S j(q)) Concave-Convex Procedure 11

22 Di erence of convex programs Our program: min f (Q) g(q) s.t. Q 0with f (Q) = g(q) = P n P n j=1 log S j(q) + kqk 1 j=1 Tr(y j,obsyj,obs T S j(q)) Why are both f and g convex? Concave-Convex Procedure 11

23 Di erence of convex programs Our program: min f (Q) g(q) s.t. Q 0with f (Q) = g(q) = P n P n j=1 log S j(q) + kqk 1 j=1 Tr(y j,obsyj,obs T S j(q)) Why are both f and g convex? Lemma The function Q 7! S(Q) is concave on the set of positive definite matrices. Concave-Convex Procedure 11

24 Di erence of convex programs Our program: min f (Q) g(q) s.t. Q 0with f (Q) = g(q) = P n P n j=1 log S j(q) + kqk 1 j=1 Tr(y j,obsyj,obs T S j(q)) Why are both f and g convex? Lemma The function Q 7! S(Q) is concave on the set of positive definite matrices. Proposition The function Q 7! log S(Q) is concave on the set of positive definite matrices. Concave-Convex Procedure 11

25 Review of the Concave-Convex Procedure Definition: Di erence of Convex (DC) program Let f, g two convex functions and X R n a convex set, a Di erence of Convex (DC) program is such that min f (x) g(x) s.t. x 2X 6 Yuille and Rangarajan. The Concave-Convex Procedure. Neural Computation, Concave-Convex Procedure 12

26 Review of the Concave-Convex Procedure Definition: Di erence of Convex (DC) program Let f, g two convex functions and X R n a convex set, a Di erence of Convex (DC) program is such that min f (x) g(x) s.t. x 2X Concave-convex procedure (CCCP) to solve DC programs At x t, solves convex approximation by linearizing g L x t g x t+1 = argmin f (x) g(x t ) rg(x t ) T (x x t ) x2x 6 Yuille and Rangarajan. The Concave-Convex Procedure. Neural Computation, Concave-Convex Procedure 12

27 Review of the Concave-Convex Procedure Definition: Di erence of Convex (DC) program Let f, g two convex functions and X R n a convex set, a Di erence of Convex (DC) program is such that min f (x) g(x) s.t. x 2X Concave-convex procedure (CCCP) to solve DC programs At x t, solves convex approximation by linearizing g L x t g x t+1 = argmin f (x) g(x t ) rg(x t ) T (x x t ) x2x Proposition CCCP is a descent method: f (x t+1 ) g(x t+1 ) apple f (x t ) g(x t ) 6 6 Yuille and Rangarajan. The Concave-Convex Procedure. Neural Computation, Concave-Convex Procedure 12

28 Illustration of the Concave-Convex Procedure (CCCP) Concave-Convex Procedure 13

29 Illustration of the Concave-Convex Procedure (CCCP) Concave-Convex Procedure 13

30 Illustration of the Concave-Convex Procedure (CCCP) Concave-Convex Procedure 13

31 Illustration of the Concave-Convex Procedure (CCCP) Concave-Convex Procedure 13

32 Outline Motivation: Gaussian Markov Random Fields Sparse estimator with missing data Concave-Convex Procedure Comparison with Expectation-Maximization algorithm Conclusion Comparison with Expectation-Maximization algorithm 14

33 Expectation-Maximization (EM) algorithm I Our estimate: ˆQ = argmin Q 0 P n j=1 log (y j,obs obsj ) + kqk 1 {z } density of marginal distribution over obs j Comparison with Expectation-Maximization algorithm 15

34 Expectation-Maximization (EM) algorithm I Our estimate: ˆQ = argmin Q 0 P n j=1 log (y j,obs obsj ) + kqk 1 {z } density of marginal distribution over obs j I Di cult because ( obsj ) 1 = S j (Q) =Q obsj Q obsj mis j Q 1 mis j Q misj obs j Comparison with Expectation-Maximization algorithm 15

35 Expectation-Maximization (EM) algorithm I Our estimate: ˆQ = argmin Q 0 P n j=1 log (y j,obs obsj ) + kqk 1 {z } density of marginal distribution over obs j I Di cult because ( obsj ) 1 = S j (Q) =Q obsj Q obsj mis j Q 1 mis j Q misj obs j Expectation-Maximization (EM) algorithm Updates Q t in two steps E-step: y j,mis x j,mis {x j,obs = y j,obs, Q = Q t } ˆ t = P j E x j,mis x j,obs =y j,obs [y j y T j ] M-step: Q t+1 = argmin Q 0 = argmin Q 0 P j E x j,mis x j,obs =y j,obs log (x (Q t ) 1 )+ kqk 1 log Q +Tr(ˆ t Q)+ kqk 1 Comparison with Expectation-Maximization algorithm 15

36 EM algorithm as a Concave-Convex procedure Proposition For Gaussians, the EM algorithm is a CCCP with DC decomposition: min f EM (Q) g EM (Q) s.t. Q 0 f EM (Q) = log Q + kqk 1 g EM (Q) = 1 n P n j=1 {log Q mis j +Tr(y j,obs y T j,obs S j(q))} Comparison with Expectation-Maximization algorithm 16

37 EM algorithm as a Concave-Convex procedure Proposition For Gaussians, the EM algorithm is a CCCP with DC decomposition: min f EM (Q) g EM (Q) s.t. Q 0 f EM (Q) = log Q + kqk 1 g EM (Q) = 1 n P n j=1 {log Q mis j + Tr(y j,obs y T j,obs S j(q))} If we pose h(q) := 1 n P n j=1 log Q mis j : EM decomposition: f EM (h + g) Our decomposition: (f EM h) g Comparison with Expectation-Maximization algorithm 16

38 EM algorithm as a Concave-Convex procedure Proposition For Gaussians, the EM algorithm is a CCCP with DC decomposition: min f EM (Q) g EM (Q) s.t. Q 0 f EM (Q) = log Q + kqk 1 g EM (Q) = 1 n P n j=1 {log Q mis j + Tr(y j,obs y T j,obs S j(q))} If we pose h(q) := 1 n P n j=1 log Q mis j : Proposition EM decomposition: f EM (h + g) Our decomposition: (f EM h) g With our decomposition, CCCP is a lower bound on EM: (f EM h) Lg apple f EM (Lh + Lg) Comparison with Expectation-Maximization algorithm 16

39 Experimental setting 1. Generate n samples y 1,, y n from N (0, ) from 3 models with dimension p = 10, 50, 100 and n = 100, 150, 200: 7 I Model 1: ij =0.7 j i AR(1) I Model 2: ij = I i=j +0.4I i j =1 +0.2I i j 2{2,3} +0.1I i j =4 ( 0 if i = j I Model 3: = B + I where B ij = 0/0.5 w.p.=0.5 if i 6= j 7 N. Stadler and P. Buhlmann. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, Comparison with Expectation-Maximization algorithm 17

40 Experimental setting 1. Generate n samples y 1,, y n from N (0, ) from 3 models with dimension p = 10, 50, 100 and n = 100, 150, 200: 7 I Model 1: ij =0.7 j i AR(1) I Model 2: ij = I i=j +0.4I i j =1 +0.2I i j 2{2,3} +0.1I i j =4 ( 0 if i = j I Model 3: = B + I where B ij = 0/0.5 w.p.=0.5 if i 6= j 2. Remove at random 20, 40, 60, 80% of the data for each sample 7 N. Stadler and P. Buhlmann. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, Comparison with Expectation-Maximization algorithm 17

41 Experimental setting 1. Generate n samples y 1,, y n from N (0, ) from 3 models with dimension p = 10, 50, 100 and n = 100, 150, 200: 7 I Model 1: ij =0.7 j i AR(1) I Model 2: ij = I i=j +0.4I i j =1 +0.2I i j 2{2,3} +0.1I i j =4 ( 0 if i = j I Model 3: = B + I where B ij = 0/0.5 w.p.=0.5 if i 6= j 2. Remove at random 20, 40, 60, 80% of the data for each sample 3. Impute y j,mis by row means ) complete su cient statistics ˆ 7 N. Stadler and P. Buhlmann. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, Comparison with Expectation-Maximization algorithm 17

42 Experimental setting 1. Generate n samples y 1,, y n from N (0, ) from 3 models with dimension p = 10, 50, 100 and n = 100, 150, 200: 7 I Model 1: ij =0.7 j i AR(1) I Model 2: ij = I i=j +0.4I i j =1 +0.2I i j 2{2,3} +0.1I i j =4 ( 0 if i = j I Model 3: = B + I where B ij = 0/0.5 w.p.=0.5 if i 6= j 2. Remove at random 20, 40, 60, 80% of the data for each sample 3. Impute y j,mis by row means ) complete su cient statistics ˆ 4. Initialization: Q 0 = argmin Q 0 log Q +Tr(ˆ Q)+ kqk 1 7 N. Stadler and P. Buhlmann. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, Comparison with Expectation-Maximization algorithm 17

43 Numerical results on synthetic datasets missing=20% missing=40% missing=60% missing=80% Comparison with Expectation-Maximization algorithm 18

44 Numerical results on real datasets missing=20% missing=40% missing=60% missing=80% Comparison with Expectation-Maximization algorithm 19

45 Numerical results Comparison with Expectation-Maximization algorithm 20

46 Outline Motivation: Gaussian Markov Random Fields Sparse estimator with missing data Concave-Convex Procedure Comparison with Expectation-Maximization algorithm Conclusion Conclusion 21

47 Summary of contributions I Schur complement is log-concave. I Hence sparse inverse covariance estimator is a DC program. I New CCCP for the sparse inverse covariance estimator. I Superior convergence in number of iterations. I Validated by numerical results on synthetic and real datasets. Conclusion 22

Gaussian Graphical Models and Graphical Lasso

ELE 538B: Sparsity, Structure and Inference Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Spring 2017 Multivariate Gaussians Consider a random vector x N (0, Σ) with pdf