High-dimensional covariance estimation based on Gaussian graphical models

Size: px

Start display at page:

Download "High-dimensional covariance estimation based on Gaussian graphical models"

Brianna Lang
5 years ago
Views:

1 High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26, 2011 Joint work with Philipp Rütimann, Min Xu, and Peter Bühlmann

2 Problem definition Want to estimate the covariance matrix for Gaussian Distributions: e.g., stock prices Take a random sample of vectors X (1),..., X (n) i.i.d. N p (0,Σ 0 ), where p is understood to depend on n Let Θ 0 := Σ 1 0 denote the concentration matrix Sparsity: certain elements of Θ 0 are assumed to be zero Task: Use the sample to obtain a set of zeros, and then an estimator for Θ 0 (Σ 0 ) upon a given pattern of zeros Show consistency in predictive risk and in estimating Θ 0 and Σ 0 when n, p

3 Gaussian graphical model: representation Let X be a p-dimensional Gaussian random vector X (X 1,..., X p ) N(0,Σ 0 ), where Σ 0 = Θ 1 0 In Gaussian graphical model G(V, E 0 ), where V = p: A pair (i, j) is NOT contained in E 0 (θ 0,ij = 0) iff X i X j {X k ; k V \{i, j}} Define Predictive Risk with Σ 0 as R(Σ) = tr(σ 1 Σ 0 )+log Σ 2E 0 (log f Σ (X)) where the Gaussian Log-likelihood function using Σ 0 is log f Σ (X) = p 2 log 2π 1 2 log Σ 1 2 X T Σ 1 X

4 Penalized maximum likelihood estimators To estimate a sparse model (i.e., Θ 0 0 is small), recent work has considered l 1 -penalized maximum likelihood estimators: let Θ 1 = vecθ 1 = i j θ ij, { } Θ n = arg min tr(θ Ŝ n ) log Θ +λ n Θ 1, where Θ 0 Ŝ n = n 1 n r=1 X(r) (X (r) ) T is the sample covariance The graph Ĝn is determined by the non-zeros of Θ n References: Yuan-Lin 07, d Aspremont-Banerjee-El Ghaoui 08, Friedman-Hastie-Tibshirani 08, Rothman et al 08, Z-Lafferty-Wasserman 08, and Ravikumar et. al. 08

5 Predictive risks Fix a point of interest with f 0 = N(0,Σ 0 ) For a given L n, consider a constrained set of positive definite matrices: Γ n = {Σ : Σ 0, Σ 1 1 L n } Define the oracle estimator as Σ = arg min Σ Γn R(Σ) Recall R(Σ) = tr(σ 1 Σ 0 )+log Σ Define Σ n as the minimizer of R n (Σ) subject to Σ Γ n, { Σ n = arg min trσ 1 } Ŝ n + log Σ Σ Γ n }{{} R n(σ) R n (Σ) is the negative Gaussian log-likelihood function and Ŝ n is the sample covariance

6 Risk consistency Persistence Theorem: Let p < n ξ, for some ξ > 0. Given Γ n = {Σ : Σ 0, Σ 1 1 L n }, where L n = o( n log n ), n. Then R( Σ n ) R(Σ n) P 0, where R(Σ) = tr(σ 1 Σ 0 )+log Σ and Σ n = arg min Σ Γ n R(Σ) L n = n 1 2 logn n= o + o +o +o o + Persistency answers the asymptotic question: How large may the set Γ n be, so that it is still possible to select empirically a predictor whose risk is close to that of the best predictor in the set (see Greenshtein-Ritov 04)

7 Non-edges act as the constraints Suppose we obtain an edge set E such that E 0 E: Define the estimator for the concentration matrix Θ 0 as: ) Θ n (E) = argmin Θ ME (tr(θŝn) log Θ, where M E = {Θ 0 and θ ij = 0 (i, j) E, and i j} Theorem. Assume that 0 < ϕ min (Σ 0 ) < ϕ max (Σ 0 ) <. Suppose that E 0 E and E \ E 0 = O(S), where S = E 0. Then, Θ n (E) Θ 0 F = O P ( (p+s) log max(n, p)/n ) This is the same rate as Rothman et al 08 for the l 1 -penalized likelihood estimate

8 Get rid of the dependency on p Theorem. Assume that 0 < ϕ min (Σ 0 ) < ϕ max (Σ 0 ) <. Assume that Σ 0,ii = 1, i. Suppose we obtain an edge set E such that E 0 E and E \ E 0 = O(S), where S := E 0 = p i=1 si. Then, Θ n (E) Θ 0 F = O P ( S log max(n, p)/n ) In the likelihood function, Ŝn will be replaced by the sample correlation matrix Γ n = diag(ŝn) 1/2 (Ŝn)diag(Ŝn) 1/2

9 Main questions: How to select an edge set E so that we estimate Θ 0 well? What assumptions do we need to impose on Σ 0 or Θ 0? How does n scale with p, E, or the maximum node degree deg(g)? What if some edges have very small weights? How to ensure that E \ E 0 is small? How does the edge-constrained maximum likelihood estimate behave with respect to E 0 \ E and E \ E 0?

10 Outline Introduction The regression model The method Theoretical results Conclusion

11 A Regression Model We assume a multivariate Gaussian model X = (X 1,..., X p ) N p (0,Σ 0 ), where Σ 0,ii = 1 Consider a regression formulation of the model: For all i = 1,...,p X i = j i β i j X j + V i where β i j = θ 0,ij /θ 0,ii, and V i N(0,σ 2 V i ) is independent of {X j ; j i} for which we assume that there exists v 2 > 0 such that for all i, Var(V i ) = 1/θ 0,ii v 2 Recall X i X j {X k ; k V \{i, j}} θ 0,ij = 0 β j i = 0 and β i j = 0

12 Want to recover the support of β i Take a random sample of size n, and use the sample to estimate β i, i; that is, we have for each variable X i, X i n = X.\i n (p 1) β i p 1 + ǫ n, where we assume p > n, that is, given high-dimensional data X Lasso (Tibshirani 96), a.k.a. Basis Pursuit (Chen, Donoho, and Saunders 98, and others): β i = arg min β X i X \i β 2 /2n+λ n β 1

13 Meinshausen and Bühlmann 06 Perform p regressions using the Lasso to obtain p vectors of regression coefficients β 1,..., β p where for each i, β i = { β j i ; j {1,..., p}\i} Then estimate the edge set by the OR rule, estimate an edge between nodes i and j β i j 0 or β j i 0 Under sparsity and Neighborhood Stability conditions, they show P(Ên = E 0 ) 1 as n

14 Sparsity At row i, define s0,n i as the smallest integer such that: p j=1,j i min{θ 2 0,ij,λ2 θ 0,ii } s i 0,n λ2 θ 0,ii The essential sparsity s0,n i at row i counts all (i, j) such that θ 0,ij λ θ 0,ii βj i λσ V i Define S 0,n = p i=1 si 0,n as the essential sparsity of the graph, which counts all (i, j) such that θ 0,ij λ min( θ 0,ii, θ 0,jj ) βj i λσ V i or β j i λσ V j Aim to keep 2S 0 edges in E

15 Defining 2s 0 Let 0 s 0 s be the smallest integer such that p 1 i=1 min(β2 i,λ 2 σ 2 ) s 0 λ 2 σ 2, where λ = 2 log p/n If we order the β j s in decreasing order of magnitude β 1 β 2... β p 1, then β j < λσ j > s 0 Value s 0 2s 0 s p = 512 n = 500 s = 96 σ = 1 λ n = logp n σ 2logp n σ logp n This notion of sparsity has been used in linear regression (Candès-Tao 07, Z09,10) σ n

16 Selection: individual neighborhood We use the Lasso in combination with thresholding (Z09, Z10) for inferring the graph: Let λ = 2 log p/n For each of the nodewise regressions, obtain an estimator β i init using the Lasso with penalty parameter λ n λ, β i init = argmin β i n r=1 (X (r) i j i β i j X(r) j ) 2 +λ n βj i i, j i Threshold βinit i with τ λ to get the Zero set: Let D i = {j : j i, < τ} β i j,init

17 Selection: joining the neighborhoods Define the total zeros as: D = {(i, j) : i j : (i, j) D i D j } Select edge set E := {(i, j) : i, j = 1,...,p, i j,(i, j) D} That is, edge set is the joint neighborhoods across all nodes in the graph This reflects the idea that the essential sparsity S 0,n of the graph counts all (i, j) such that θ 0,ij λ min( θ 0,ii, θ 0,jj )

18 Example: a star graph Construct Σ 0 from a model used in Ravikumar et. al. 08: Σ 0 = 1 ρ ρ ρ... 0 ρ 1 ρ 2 ρ ρ ρ 2 1 ρ ρ ρ 2 ρ p p

19 Example: original graph p = 128, n = 96, s = 8,ρ = 0.5 λ n = 2 2 log p/n, τ = log p/n

20 Example: estimated graph with n = 96 λ n = 2 2 log p/n

21 Example: estimated graph λ n = 2 2 log p/n

22 Example: estimated graph λ n = 2 2 log p/n

23 Example: estimated graph λ n = 2 2 log p/n

24 Example: estimated graph λ n = 2 2 log p/n

25 Example: estimated graph λ n = 2 2 log p/n

26 Example: estimated graph λ n = 2 2 log p/n

27 Example: estimated graph λ n = 2 2 log p/n

28 Example: estimated graph τ = log p/n

29 Gelato: estimation of edge weight Given a graph with edge set E, we estimate the concentration matrix by maximum likelihood: Denote the sample correlation matrix by Γ n : Γ n = diag(ŝn) 1/2 (Ŝn)diag(Ŝn) 1/2 The estimator for the concentration matrix Θ 0 is: ) Θ n (E) = argmin Θ Mp,E (tr(θ Γ n ) log Θ, where M p,e = {Θ R p p ; Θ 0 and θ ij = 0 for all (i, j) D} and D := {(i, j) : i, j = 1,...,p,(i, j) E and i j}

30 Likelihood equations Let diag(ŝn) 1/2 = { σ 1,..., σ p } The following relationships hold for the maximum likelihood estimate Θ n and Σ n = ( Θ n ) 1 : Σ n,ii Σ n,ij Θ n,ij = 1, i = 1,...,p = Γ n,ij = Ŝn,ij/ σ i σ j, (i, j) E and = 0, (i, j) D This is also known as positive definite matrix completion problem

31 Set of assumptions Let c, C be some absolute contants. (A0) The size of the neighborhood for each node i V is bounded by an integer s < p and the sample size satisfies n Cs log(cp/s). (A1) The dimension and number of sufficiently strong edges S 0,n satisfy: p = o(e cn ) for some 0 < c < 1 and S 0,n = o(n/ log max(n, p)) (n ). (A2) The minimal and maximal eigenvalues of Σ 0 are bounded, and Σ 0,ii = 1 for all i.

32 The main theorem: selection Assume that (A0), and (A2) hold. Let λ = 2 log p/n. Let d, C, D depend on sparse and restrictive eigenvalues of Σ 0. Let λ n = dλ and τ = Dλ n be chosen appropriately chosen Denote the estimated edge by E = Ên(λ n,τ) Then with high probability, E 2S 0,n, where E \ E 0 S 0,n and Θ0,D F Cλ n min{s 0,n ( max i=1,...p θ2 0,ii ), s 0 diag(θ 0 ) 2 F } where s 0 = max i s0,n i denotes the maximum essential node degree

33 Example: p = 128, s = 12, ρ = 0.5 λ n = 2 2 log p/n, τ = f 2 log p/n FPR FNR Lasso f = 0.30 f = 0.35 f = Lasso f = 0.30 f = 0.35 f = n n

34 The main theorem: estimation Assume that in addition, (A1) holds. Then for Θ n and Σ n = ( Θ n ) 1, ) Θ n Θ 0 F = O P ( S 0,n log max(n, p)/n ) Σ n Σ 0 F = O P ( S 0,n log max(n, p)/n R( Θ n ) R(Θ 0 ) = O P ( S0,n log max(n, p)/n ) 2 2 ( So Θ n Θ 0, Σ n Σ 0 = O p S0,n log max(n, p)/n )

35 Obtaining an edge set E Let S i = {j : j i, β i j 0} and s i = S i. Let D,λn, C be the same as in the main Theorem For each of nodewise regressions, we apply the same thresholding rule to obtain a subset I i as follows, I i = {j : j i, τ = Dλ n }, and D i β i j,init := {1,..., i 1, i + 1,..., p}\i i Then we have with high probability, I i 2s0 i and Ii S i S i +s0 i and Θ i 0,D Cλ n θ 0,ii s i β 2 0 i 2 D Cλ n s0 i Proof follows from results in Z10 on the Thresholded Lasso estimator

36 Oracle inequalities for the Lasso Theorem (Z 10). Under (A0) and (A2), for all nodewise regressions, the Lasso estimator achieves squared l 2 loss of O P (s 0 σ 2 log p/n). Value s 0 2s 0 s p = 512 n = 500 s = 96 σ = 1 λ n = logp n σ 2logp n σ logp n σ n

37 Constructing a pivot point Now clearly by the OR rule, we have and E = {(i, j) : j I i, i = 1,...,p} p E I i p 2s0 i = 2S 0 i=1 i=1 Given a 2S 0 -sparse set of edges E, Define a sparse approximation Θ 0 of Θ 0 which is identical to Θ 0 on E and the diagonal, and zero elsewhere: Θ 0 = diag(θ 0 )+Θ 0,E = diag(θ 0 )+Θ 0,E E0 0 Θ 0 = p+2 E E 0 p+4s 0 with s+1-sparse row (column) vectors

38 Θ 0 as a sparse approximation The bias is small: Θ 0 Θ F 0 C max i=1,...p (θ 0,ii) S 0 log p/n For q = 1, 2,, Θ 0 Θ q 0 C max i=1,...p (θ 0,ii) s 0 λ n s Note that each row vector will be 2s sparse if we apply the AND rule, however, at the cost of a larger bias

39 Θ 0 as a pivot The sparsity and the small bias allow us to bound Θ n (E) Θ 0 F = O P S 0,n log max(n, p)/n }{{} r n where we use the fact that both the estimator Θ n (E) and the pivot Θ 0 are sparse By the triangle inequality, we conclude that Θ n (E) Θ 0 F Θ n (E) Θ 0 F + Θ 0 Θ 0 ) F = O P ( S 0,n log max(n, p)/n

40 Generalization on the estimation step Assume that (A1) and (A2) hold. Let σ 2 max := max i Σ 0,ii < and σ 2 min := min i Σ 0,ii > 0. Let W = diag(σ 0 ) 1/2. Suppose that we obtain an edge set E such that E = lin(s 0,n ) is a linear function in S 0,n For Θ 0 = diag(θ 0 )+Θ 0,E Θ 0 Θ 0 F C 2S 0,n log(p)/n We note that this is equivalent to assume Ω 0 Ω 0 F C 2S 0,n log(p)/n where Ω 0 = WΘ 0 W and Ω 0 = W( Θ 0 )W

41 Generalization on the estimation step Theorem. Suppose the sample size satisfies for a sufficiently large constant M, n > MS 0,n log max(n, p). Then Ω n (E) Ω 0 F O p ( 2S 0,n log max p, n/n) where Ω n (E) is the maximum likelihood estimator based on the sample correlation matrix Γ n : ) Ω n (E) = argmin Ω Mp,E (tr(ω Γ n ) log Ω

42 Generalization on the estimation step Given Ŵ = diag(ŝn) 1/2 and Ω n (E), compute Θ n = Ŵ 1 Ωn Ŵ 1 and Σ n = Ŵ( Ω n (E)) 1 Ŵ for which the following hold: Σ n,ij = Ŝn,ij (i, j) E {(i, i) : i = 1,...,p} Θ n,ij = 0, (i, j) D. Following the bound on Ω n (E) and arguments in Rothman et al. (2008), ) Θ n Θ 0 2 = O P ( S 0,n log max(n, p)/n

43 Generalization on the estimation error For the Frobenius norm and the risk to converge to zero, (A1) is to be replaced by: p n c for some 0 < c < 1 and p+s 0,n = o(n/ log max(n, p)) as n In this case, we have Θ n Θ 0 F ) = O P ( (p+s 0,n ) log max(n, p)/n Σ n Σ 0 F ) = O P ( (p+s 0,n ) log max(n, p)/n R( Θ n ) R(Θ 0 ) = O P ( (p+s0,n ) log max(n, p)/n ) We could achieve these rates with ) Θ n (E) = argmin Θ Mp,E (tr(θŝn) log Θ

44 Conclusion Gelato separates the tasks of model selection and (inverse) covariance estimation Thresholding plays a key role in obtaining a sparse approximation of the graph with a small bias using a very small amount of sample With stronger conditions on the sample size, convergence rates in terms of operator and frobenius norms, and KL divergence are established The method is feasible in high dimensions: p > n is allowed

45 Related work on inverse/covariance estimation Regression based selection/estimation: Meinshausen-Bühlmann 06, Peng et. al. 09, Yuan 10, Verzelen 10, Cai-Liu-Luo 11 Penalized likelihood method based on l 1 -norm penalty: Yuan-Lin 07, d Aspremont-Banerjee-El Ghaoui 08, Friedman-Hastie-Tibshirani 07, Rothman et. al. 08, Zhou-Lafferty-Wasserman 08, Ravikumar et. al. 08 Nonconvex: Lam-Fan 09 Sparse covariance selection/estimation: Bickel and Levina 06, 08; El Karoui 08, Levina and Vershynin 10 and more...

46 References RUDELSON, M. and ZHOU, S. (2011). Reconstruction from anisotropic random measurements. ArXiv: ZHOU, S. (2009). Restricted eigenvalue conditions on subgaussian random matrices. ArXiv: v2. ZHOU, S. (2009). Thresholding procedures for high dimensional variable selection and statistical estimation. In Advances in Neural Information Processing Systems 22. MIT Press. ZHOU, S. (2010). Thresholded lasso for high dimensional variable selection and statistical estimation. ArXiv: ZHOU, S., RÜTIMANN, P., XU, M. and BÜHLMANN, P. (2011). High-dimensional covariance estimation based on gaussian graphical models. ArXiv: v2.

High-dimensional Covariance Estimation Based On Gaussian Graphical Models

High-dimensional Covariance Estimation Based On Gaussian Graphical Models Shuheng Zhou, Philipp Rutimann, Min Xu and Peter Buhlmann February 3, 2012 Problem definition Want to estimate the covariance matrix