High-dimensional Covariance Estimation Based On Gaussian Graphical Models

Size: px

Start display at page:

Download "High-dimensional Covariance Estimation Based On Gaussian Graphical Models"

Sherilyn Sutton
5 years ago
Views:

1 High-dimensional Covariance Estimation Based On Gaussian Graphical Models Shuheng Zhou, Philipp Rutimann, Min Xu and Peter Buhlmann February 3, 2012

2 Problem definition Want to estimate the covariance matrix for Gaussian Distributions: e.g. gene expression Take a random sample of vectors X 1,..., X (n) i.i.d. N p (0, Σ 0 ) where p may depend on n Let Θ 0 = Σ 1 0 denote the concentratin matrix Sparsity: certain elements of Θ 0 are zero Task: Use the sample to etimte a set of zeros, and then estimator for Θ 0 (Σ 0 ) given the pattern of zeros. Show consistency in predictive risk and in estimating Θ 0 andσ 0 when n,p

3 Gaussian Graphical Model X = (X 1,..., X p ) indep. N(µ, Σ 0 ) Given a graph G = (V, E 0 ), a pair ( i, j) is NOT contained in E 0 (θ 0,ij = 0) iff X i X j {X k ; k V \ {i,j }} Define Predictive Risk with Θ 0 as R(Θ) = tr(θσ 0 ) - log Θ -2E 0 (logf Θ (X )) Where the Gaussian Log-likelihood function using Σ 0 is logf Θ (X ) = p 2 log2π log Θ 1 2 X T Θ 1 X

4 Gaussian Graphical Model X = (X 1,..., X p ) indep. N(µ, Σ 0 ) Given a graph G = (V, E 0 ), a pair ( i, j) is NOT contained in E 0 (θ 0,ij = 0) iff X i X j {X k ; k V \ {i,j }} Define Predictive Risk with Θ 0 as R(Θ) = tr(θσ 0 ) - log Θ -2E 0 (logf Θ (X )) Where the Gaussian Log-likelihood function using Σ 0 is logf Θ (X ) = p 2 log2π log Θ 1 2 X T Θ 1 X

5 Previous work Yuan and Li (2007), Banerjee, El Ghaoui and d Aspremont (2008), d Aspremont et al ( 2008), Friedman et al (2008) min tr(ˆσω) logdet(ω) + λσ i j ω i,j where ˆΣ is sample covariance matrix. Meinshausen Buhlmann (2006) Regression Approach

6 A Regression Model We assume a multivariate Gaussian model Regression model: i = 1,..., p X = (X 1,..., X p ) N p (0, Σ 0 ), where Σ 0,ii = 1 X i = j i βi j X j + V i where β i j = θ 0,ij /θ 0,ii, and V i N(0, σ 2 V i ) {X j : j i} for i, Var(V i ) = 1 θ 0,ii X i X j {X k ; k V \ {i,j }} θ 0,ij 0 β j i 0 and β i j 0

7 Meinshausen and Buhlmann 2006 Perform p regressions using the Lasso to ˆβ 1,..., ˆβ p, where { } ˆβ i = ˆβ j i; j {1,..., p} \ i Then estimate the edge set by OR rule Under sparsity and Neighborhood Stability conditions, they show ) P (Ên = E 0 1 as n

8 Essential Sparsity For each i, define s0,n i as the smallest integer such that } p j=1,j i {θ min 0,ij 2, λ2 θ 0,ii s0,n i λ2 θ 0,ii, where λ = 2log (p) /n where s0,n i at row i describes the number of sufficiently large non-diagonal elements. Define S 0,n = p i=1 si 0,n as the essential sparsity of the graph.

9 The estimation procedure Nodewise regressions for inferring the graph Maximum likelihood estimation based on graph Choosing the regularization parameters using cross validation

10 Nodewise regression for inferring the graph Using the Lasso in combination with thresholding to infer the graph. For each of the nodewise regressions using the Lasso with same λ n βinit i = argmin ( n β i r=1 X i ) 2 j i βi j X j + λn j i, i Thresholding βinit i with τ to get the Zero set: { } D i = j : j i, < τ β i j,init Estimate the edge set Ên estimate an edge between nodes i and j ˆβ i j 0 or ˆβ j i 0 β i j

11 Estimation of edge weight Estimating the concentration matrix by maximum likelihood with the edge set Ê n Denote the sample correlation matrix by ˆΓ n : ) 1/2 ) ) 1/2 ˆΓ n = diag (Ŝn (Ŝn diag (Ŝn The estimator for the concentration matrix θ 0 is : ) ) ˆθ n (E) = argmin θ Mp,E (tr (θˆγ n log θ, where M p,e = {θ R pxp ; θ 0andθ i j = 0for all (i, j) D}, and D = {(i, j) : i, j = 1,..., p, (i, j) / Eandi j}

12 Likelihood equation Let diag(ŝn ) 1/2 = {ˆσ1,..., ˆσ p } The following relationships hold for the maximum likelihood estimate ˆθ n and ˆΣ n = (ˆθn ) 1: ˆΣ n,ii = 1, i = 1,..., p ˆΣ n,ij = ˆΓ n,ij = Ŝ n,ij /ˆσ i ˆσ j, (i, j) E and ˆΘ n,ij = 0, (i, j) D

13 Bias caused by using estimated edge set Ên For a given Ê n, define an approximation Θ n of Θ 0 which is identical to Θ 0 on Ê n, and zero elsewhere: Θ 0 = diag (Θ 0 ) + (Θ 0 )Ên = diag (Θ 0 ) + Θ 0,Ên E 0, Define: Θ 0,D F = Θ 0 Θ 0 F

14 Asymptotic Properties Set of assumptions: (A0) The size of the neighborhood for each node i V is upper bounded by an integer s p and the sample size satisfies for some constant C n Cslog (p/s) (A1) The dimension and S 0,n satisfy: p = o (e cn ) for some constant 0 < c < 1 and S 0,n = 0 (n/logmax (n, p)), (n ) (A2) The minimal and maximal eighenvalues of Σ 0 are bounded, and v 2 > 0 such that for all i, and V i : Var(V i ) = 1/σ 0,ii v 2

15 Theorem 1: selection Assume that A0 and A2 hold, and Σ i,i = 1 for all i. Under appropriately chosen λ n and τ, we obtain an edge set Ê n with high probability that Ên 4S 0,n, where Ên \ E 0 2S0,n

16 Theorem 1 : estimation Assume that in addition, A1 holds. Then ˆΘ 2 n Θ 0 ˆΘ F ( n Θ 0 = O p S0,n log max (n, p) /n ) 2 F ( ˆΣ n Σ 0 ˆΣ n Σ 0 = O p S0,n log max (n, p) /n ) ( ) R ˆΘ n R (Θ 0 ) = O p (S 0,n log max (n, p) /n)

17 Remark 3 : Σ 0,ii are unknow Under A0, A1 and A2 : ˆΘ 2 2 n Θ 0 and ˆΣ n Σ 0 achieve the same rate as in Theorem 1. For the Frobenius norm and the risk to converge, A1 is to be replaced by: (A3) p n c for some constant 0 < c < 1 and p + S 0,n = o (n/log max (n, p)) as n in this case, we have ˆΘ F ( n Θ 0 = O p (p + S0,n ) log max (n, p) /n ) F ( ˆΣ n Σ 0 = O p (p + S0,n ) log max (n, p) /n ) ( ) R ˆΘ n R (Θ 0 ) = O p ((p + S 0,n ) log max (n, p) /n)

18 Proposition 4: Bounding the bias due to sparse approoximation Assume that (A0), (A1) and (A2) hold. With the same choices on λ n, τ as in Theorem 1, then Θ n,d F = Θ F ( 0 Θ 0 = O p (p + S0,n ) log max (n, p) /n )

19 Theorem 5: for case that E 0 E Assume (A1) and (A2) hold, and define S = E 0. Assume Σ 0,ii = 1 for all i. Suppose with high probability we obtain an edge set E such that E 0 E and E \ E 0 = O (S), then: ˆΘ ( F Slog ) n (E) Θ 0 = O p max (n, p) /n This is the case that all non-zero β i j are suffciently large

20 Simulation study Gelato, GLasso and space AR(1)-Block model, random concentration matrix model and exponential decay model p = 300 and n = 40, 80, 320 F Measure ˆΣ n Σ 0, ˆΘ F n Θ 0 and Kullback Leibler divergence Select tuning parameter using cross-validation

21 Simulation 1: AR-Bolck Model ΣBlock = 0.9 i j

22 Simulation 1: AR-Bolck Model 2

23 Simulation 2: Random Concentration Matrix Θ = B + δi

24 Simulation 3: Exponential Decay Model θ 0,ij = exp ( 2 i j )

25 Real Data Application

26 Conclusion Gelta separates the tasks of model selection and covariance estimation Thresholding plays a key role in obtaining a sparse approximation of the graph using a small amount of sample With stronger conditions on the smaple size, convergence rates in terms of operator and frobenius norms and KL divergence are established The method is feasible in high dimensions

High-dimensional covariance estimation based on Gaussian graphical models

High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26,