An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

Size: px

Start display at page:

Download "An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss"

Dwight Domenic Long
5 years ago
Views:

1 An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv: v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao Tong University and Binyan Jiang Department of Applied Mathematics, The Hong Kong Polytechnic University, Abstract The estimation of high dimensional precision matrices has been a central topic in statistical learning. However, as the number of parameters scales quadratically with the dimension p, many state-ofthe-art methods do not scale well to solve problems with a very large p. In this paper, we propose a very efficient algorithm for precision matrix estimation via penalized quadratic loss functions. Under the high dimension low sample size setting, the computation complexity of our algorithm is linear in the sample size and the number of parameters, which is the same as computing the sample covariance matrix. Keywords: ADMM; High dimension; Penalized quadratic loss; Precision matrix. 1 Introduction Precision matrix plays an important role in statistical analysis. Under Gaussian assumptions, the precision matrix is often used to study the conditional independence among the random variables. Contemporary applications usually require fast methods for estimating a very high dimensional precision 1

2 matrix (Meinshausen and Bühlmann, 2006). Despite recent advances, estimation of the precision matrix remains challenging when the dimension p is very large, owing to the fact that the number of parameters to be estimated is of order O(p 2 ). For example, in the Prostate dataset we are studying in this paper, 6033 genetic activity measurements are recorded for 102 subjects. The precision matrix to be estimated is of dimension , resulting in more than 18 million parameters. One of the well-known and popular method in precision matrix estimation is referred to as the graphical lasso (Yuan and Lin, 2007; Banerjee et al., 2008; Friedman et al., 2008). Without loss of generality, assume that X 1,, X n are i.i.d. observations from a p-dimensional Gaussian distribution with mean 0 and covariance matrix Σ. To estimation the precision matrix Ω = Σ 1, the graphical lasso solves the following l 1 regularized negative log-likelihood: arg min tr(sω) log Ω + λ Ω 1, (1) Ω O p where O p is the set of non-negative definite matrices in R p p, S is the sample covariance matrix, Ω 1 denotes the sum of the absolute values of Ω and λ 0 is the tuning parameter. Although (1) is constructed based on Gaussian likelihood, it is known that glasso also works for non-gaussian data (Ravikumar et al., 2011). Many algorithms have been developed to solve the graphical lasso. Friedman et al. (2008) proposed a coordinate descent procedure and Boyd et al. (2011) provided an alternating direction method of multipliers (ADMM) algorithm for solving (1). In order to obtain faster convergence for the iterations, second order methods and proximal gradient algorithms on the dual problem are also well developed; see for example Hsieh et al. (2014), Dalal and Rajaratnam (2017), and the references therein. However, since the graphical lasso involves matrix determinant, eigen-decomposition or calculation of the determinant of a p p matrix is inevitable in these algorithms. Note that the computation complexity of eigen-decomposition or matrix determinant is of order O(p 3 ). Thus, the computation time for these algorithms will scale up cubically in p. Recently, Zhang and Zou (2014) and Liu and Luo (2015) proposed to estimate Ω by some trace based quadratic loss functions. Using Kronecker product and matrix vectorization, our interest is to estimate, or equivalently, vec(ω ) = vec(σ 1 ) = vec(i I Σ 1 ) = (Σ I) 1 vec(i), (2) vec(ω ) = vec(σ 1 ) = vec(σ 1 I I) = (I Σ) 1 vec(i), (3) 2

3 where I denotes the identity matrix and is Kronecker product. Motivated by (2) and LASSO (Tibshirani, 1996), a natural way to estimate β = vec(ω ) is 1 arg min β R 2 βt (S I)β β T vec(i) + λ β 1. (4) p2 To obtain a symmetric estimator we can use both (2) and (3), and estimate β by 1 arg min β R 4 βt (S I + I S)β β T vec(i) + λ β 1. (5) p2 In matrix notation and writing β = vec(ω), (4) can be written as ˆΩ 1 = arg min 1 2 tr(ωsωt ) tr(ω) + λ Ω 1. (6) Symmetrization can then be applied to obtain a final estimator. The loss function L 1 (Ω) := 1 2 tr(ωsωt ) tr(ω) (7) is used in Liu and Luo (2015) and they proposed a column-wise estimation approach called SCIO. Similarly, the matrix form of (5) is ˆΩ 2 = arg min 1 4 tr(ωt SΩ) tr(ωsωt ) tr(ω) + λ Ω 1. (8) The loss function L 2 (Ω) = 1 2 {L 1(Ω) + L 1 (Ω T )} = 1 4 tr(ωsωt ) tr(ωt SΩ) tr(ω), is equivalent to the D-trace loss proposed by Zhang and Zou (2014), owing to the fact that L 2 (Ω) naturally force the solution to be symmetric. In the original papers by Zhang and Zou (2014) and Liu and Luo (2015), the authors have established consistency results for the estimation (6) and (8) and show that their performance are comparable to graphical lasso. As can be seen in the vectorized formulation (2) and (3), the loss functions L 1 (Ω) and L 2 (Ω) are quadratic in β. In this note, we show that under such quadratic loss functions, explicit solutions can be obtained in each step of the ADMM algorithm. In particular, we derived explicit formulations for the inverses of (S I + ρi) and (2 1 S I I S + ρi) for any given ρ > 0, from which we are able to solve (6) and (8), or equivalently (4) and (5), with computation complexity of order O(np 2 ). Such a rate is in some sense optimal, as the complexity for computing S is also of order O(np 2 ). 3

4 2 Main Results For any real matrix M, we shall use M 2 = tr(mm T ) to denote its Frobenius norm, and M to denote its Spectral norm, i.e., the square root of the largest eigenvalue of M T M. For the quadratic loss function L(Ω) = L 1 (Ω) or L 2 (Ω), the augmented Lagrangian is L a (Ω, A, B) = arg min L(Ω) + ρ/2 Ω A + B λ A 1, where ρ > 0 is the step size of ADMM. By Boyd et al. (2011), the alternating iterations are Ω k+1 = arg min L a (Ω, A k, B k ), A k+1 = arg min L a (Ω k+1, A, B k ) = soft(ω k+1 + B k, λ/ρ), B k+1 = Ω k+1 A k+1 + B k, where soft(a, λ) is an element-wise soft thresholding operator. Clearly the computation complexity will be mainly dominated by the update of Ω k+1, which amount to solving the following problem: arg min Ω R p p L(Ω) + ρ/2 Ω A k + B k 2 2. From the convexity of the objective function, we have, L (Ω) + ρ(ω A k + B k ) = 0. Consequently, for the estimation (6) and (8), we need to solve the following equations respectively, SΩ + ρω = I ρ(a k B k ), (9) 2 1 SΩ ΩS + ρω = I ρ(a k B k ). (10) By looking at the vectorized formulations (4) and (5), we immediately have that in order to solve (9) and (10) we need to compute the inverses of (S I + ρi) and (2 1 S I I S + ρi). The following proposition provides explicit expressions for the inverses. Proposition 1 Write the decomposition of S as S = UΛU T where U R p m, m = min(n, p), U T U = I m and Λ = diag{τ 1,, τ m }, τ 1,..., τ m 0. For any ρ > 0, we have (S + ρi) 1 = ρ 1 I ρ 1 U T Λ 1 U, and (2 1 S I I S + ρi) 1 =ρ 1 I ρ 1 (UΛ 1 U T ) I ρ 1 I (UΛ 1 U T ) + ρ 1 (U U)diag{vec(Λ 2 )}(U T U T ), { } { } τ where Λ 1 = diag 1,, τ m τ τ 1 +2ρ τ m+2ρ, and Λ 2 = i τ j (τ i +τ j +4ρ) (τ i +2ρ)(τ j +2ρ)(τ i +τ j. +2ρ) 4 m m

5 The main technique used for deriving the above proposition is the well-known Woodbury matrix identity. The derivation involves lengthy and tedious calculations and is hence omitted. Nevertheless, one can simply verify the proposition by showing for example (S + ρi)(ρ 1 I ρ 1 U T Λ 1 U) = (ρ 1 I ρ 1 U T Λ 1 U)(S + ρi) = I. By Proposition 1, and the following equations: {(UΛ 1 U T ) I}vec(C) = vec(cuλ 1 U T ), I (UΛ 1 U T )vec(c) = vec(uλ 1 U T C), (U U)diag{vec(Λ 2 )}(U T U T )vec(c) = U{Λ 2 (U T CU)}U T, where denotes the Hadamard product, we can obtain the following theorem: Theorem 1 For a given ρ > 0, (i) the solution to the equation SΩ + ρω = C is unique and is given as Ω = ρ 1 C ρ 1 UΛ 1 U T C; (ii) the solution to the equation 2 1 SΩ ΩS + ρω = C is unique and is given as Ω = ρ 1 C ρ 1 CUΛ 1 U T ρ 1 UΛ 1 U T C + ρ 1 U{Λ 2 (U T CU)}U T. Note that when S is the the sample covariance, U and Λ can be obtained from the singular value decomposition (SVD) of X = (X 1,..., X n ) whose complexity is of order O(mnp). On the other hand, the solutions obtained in Theorem 1 involves only elementary matrix operations of p m and m m matrices and thus the complexity for solving (9) and (10) can be seen to be of order O(mnp + mp 2 ) = O(np 2 ). Remark 1 Generally, we can specify different weights for each element and consider the estimation ˆΩ k = arg min L k (Ω) + λ W Ω 1, k = 1, 2, where W = (w ij ) p p, w i,j 0. For example, Setting w ii = 0 and w i,j = 1, i j where the diagonal elements are left out of penalization; Using the local linear approximation (Zou and Li, 2008), we can set W = {p λ (ˆΩ ij )} p p, where ˆΩ = (ˆΩ ij ) p p is a LASSO solution and p λ ( ) is a general penalized function such as SCAD or MCP. The ADMM algorithm will be the same as the l 1 penalized case except the A k+1 related update is replaced by a element-wise soft thresholding with different thresholding parameters. 5

6 Remark 2 Compared with the ADMM algorithm given in Zhang and Zou (2014), our update of Ω k+1 only involves matrix operations of some p m and m m matrices, while matrix operations on some p p matrices are required in Zhang and Zou (2014); see for example Theorem 1 in Zhang and Zou (2014). Consequently, we are able to obtain the order O(np 2 ) in these updates while Zhang and Zou (2014) requires O((n + p)p 2 ). Our algorithm thus scales much better when n p. Similarly, for the graphical lasso, we can also use ADMM (Boyd et al., 2011) to implement the minimization where the loss function is L(Ω) = tr(sω) log Ω. The update for Ω k+1 is obtained by solving ρω Ω 1 = ρ(a k B k ) S. Denote the eigenvalue decomposition of ρ(a k B k ) S as Q T Λ 0 Q where Λ 0 = diag{a 1,, a p }, we can obtain a closed form solution, { Ω k+1 a = Q T 1 + a ρ diag,, a p + } a 2 p + 4ρ Q. 2ρ 2ρ Compared with the algorithm based on quadratic loss functions, the computational complexity is dominated by the eigenvalue decomposition of p p matrices. 3 Simulations In this section, we conduct several simulations to illustrate the efficiency and estimation accuracy of our proposed methods. We consider the following two precision matrices: 4 2 Ω 1 = (0.5 i j ) p p, Ω 2 = Ω 1 1 = where Ω 1 is an asymptotic sparse matrix with many small elements and Ω 2 is a sparse matrix where each column/row only has at most three non-zero elements. For all of our simulations, we set the sample size n = 200 and generate the data X 1,, X n from N(0, Σ) with Σ = Ω 1 1 or Ω 1 2. For notation convenience, we shall use the term EQUAL to denote our proposed Efficient admm algorithm via the QUAdratic Loss (4), and similarly, use EQUALs to denote the estimation based on the symmetric quadratic loss (5). An R package EQUAL has been developed to implement our methods. For comparison, we consider the following competitors: 6,

7 CLIME (Cai et al., 2011) which is implemented by the R package fastclime ; glasso (Friedman et al., 2008) which is implemented by the R package glasso ; BigQuic (Hsieh et al., 2013) which is implemented by the R package BigQuic ; glasso-admm which solves the glasso by ADMM (Boyd et al., 2011); SCIO (Liu and Luo, 2015) which is implemented by the R package scio ; D-trace (Zhang and Zou, 2014) which is implemented using the ADMM algorithm provided in the paper. Table 1 summaries the computation time in seconds based on 10 replications where all methods are implemented in R with a PC with 3.3 GHz Intel Core i7 CPU and 16GB memory. For all the methods, we solve a solution path corresponding to 50 lambda values ranging from λ max to λ max log p/n. Here λ max is the maximum absolute elements of the sample covariance matrix. Although the stopping criteria is different for each method, we can see from Table 1 the computation advantage of our methods. In particularly, our proposed algorithms are much faster than the original quadratic loss based methods SCIO or D-trace for large p. In addition, we can roughly observe that the required time increases quadratically in p in our proposed algorithms. The second simulation is designed to evaluate the performance of estimation accuracy. Given the true precision matrix Ω and an estimator ˆΩ, we report the following four loss functions: loss1 = 1 Ω ˆΩ 2, loss2 = Ω ˆΩ, loss3 = p 1 loss4 = p {tr(ˆω T Ω 1 ˆΩ)/2 tr(ˆω) + tr(ω)}, 1 p {tr(ω 1 ˆΩ) log Ω 1 ˆΩ p}, where loss1 is the scaled Frobenius loss, loss2 is the spectral loss, loss3 is the Stein s loss which is related to the Gaussian likelihood and loss4 is related to the quadratic loss. Table 2 reports the simulation results where the tuning parameter is chosen by five-fold crossvalidations. We can see that the performance of all three estimators are comparable, indicating that the penalized quadratic loss estimator is also reliable for high dimensional precision matrix estimation. We also observe that the estimator based on (5) has slightly smaller estimation error than those based on (4). As shown in Table 1, the computation for quadratic loss estimator are much faster than glasso. 7

8 Table 1: The average computation time (standard deviation) of solving a solution path for the high dimensional precision matrix estimation. p=100 p=200 p=400 p=800 p=1600 Sparse case where Ω = Ω 2 CLIME 0.388(0.029) 2.863(0.056) (0.341) (0.696) ( 7.953) glasso 0.096(0.014) 0.562(0.080) 3.108(0.367) (1.739) (11.379) BigQuic 2.150(0.068) 5.337(0.073) (0.326) (0.867) ( 2.376) glasso-admm 0.909(0.018) 1.788(0.045) 5.419(0.137) (0.540) ( 3.993) SCIO 0.044(0.001) 0.272(0.021) 1.766(0.032) (0.103) ( 1.325) EQUAL 0.064(0.002) 0.343(0.006) 1.180(0.031) 4.846(0.112) ( 0.281) D-trace 0.088(0.003) 0.504(0.010) 2.936(0.070) (0.598) ( 4.959) EQUALs 0.113(0.003) 0.648(0.014) 1.723(0.039) 5.687(0.184) ( 0.373) Asymptotic sparse case where Ω = Ω 1 CLIME 0.382(0.026) 2.605(0.081) (0.480) (1.198) (9.645) glasso 0.052(0.007) 0.289(0.049) 1.467(0.324) 9.124(1.756) (7.386) BigQuic 1.816(0.043) 4.134(0.033) (0.123) (0.247) (0.574) glasso-admm 0.847(0.018) 1.636(0.032) 5.370(0.239) (0.518) (3.656) SCIO 0.036(0.000) 0.253(0.028) 1.690(0.030) (0.181) (0.236) EQUAL 0.031(0.001) 0.167(0.006) 0.614(0.029) 3.084(0.122) (0.129) D-trace 0.037(0.001) 0.213(0.009) 1.438(0.072) (0.591) (4.898) EQUALs 0.048(0.001) 0.273(0.011) 0.856(0.043) 3.518(0.107) (0.128) 8

9 Moreover, to check the singularity of the estimation, we also report the minimum eigenvalue for each estimation in the final column of Table 2. We observe that all estimators are well estimated to be positive definite. Finally, we apply our proposal to the Prostate dataset which is publicly available at stanford.edu/ hastie/casi_files/data/prostate.html. The data records 6033 genetic activity measurements for the control group (50 subjects) and the prostate cancer group (52 subjects). Here, the data dimension p is 6033 and the sample size n is 50 or 52. We estimate the precision for each group. Since our EQUAL and EQUALs give similar results, we only report the estimation for EQUALs. It took less than 20 minutes for EQUALs to obtain the solution paths while glasso can not produce the solution due to out of memory in R. The sparsity level of the solution paths are plotted in the upper panel of Figure 1. To compare the precision matrices between the two groups, the network graphs of the EQUALs estimators with tuning λ = 0.75 are provided in the lower panel of Figure 1. References O. Banerjee, L. E. Ghaoui, and A. daspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. The Journal of Machine Learning Research, 9 (Mar): , S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends R in Machine learning, 3 (1):1 122, T. Cai, W. Liu, and X. Luo. A constrained l 1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494): , O. Dalal and B. Rajaratnam. Sparse gaussian graphical model estimation via alternating minimization. Biometrika, 104(2): , J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3): ,

10 Table 2: The estimation error (standard deviation) for precision matrix estimation. loss1 loss2 loss3 loss4 min-eigen Sparse case where Ω = Ω 2 p = 500 EQUAL 0.508(0.009) 1.183(0.051) 0.273(0.004) 0.286(0.004) 0.555(0.018) EQUALs 0.465(0.010) 1.122(0.050) 0.240(0.003) 0.254(0.004) 0.500(0.013) glasso 0.554(0.008) 1.209(0.032) 0.232(0.002) 0.273(0.003) 0.298(0.011) p = 1000 EQUAL 0.605(0.010) 1.323(0.039) 0.304(0.004) 0.326(0.005) 0.577(0.015) EQUALs 0.550(0.008) 1.274(0.041) 0.267(0.003) 0.289(0.004) 0.527(0.014) glasso 0.599(0.006) 1.280(0.026) 0.248(0.002) 0.292(0.002) 0.307(0.011) p = 2000 EQUAL 0.555(0.007) 1.293(0.042) 0.291(0.003) 0.307(0.004) 0.560(0.016) EQUALs 0.538(0.008) 1.250(0.037) 0.278(0.003) 0.293(0.004) 0.549(0.015) glasso 0.640(0.005) 1.343(0.025) 0.263(0.001) 0.310(0.002) 0.318(0.010) Asymptotic sparse case where Ω = Ω 1 p = 500 EQUAL 0.707(0.005) 2.028(0.016) 0.329(0.003) 0.344(0.003) 0.364(0.011) EQUALs 0.664(0.010) 1.942(0.026) 0.297(0.005) 0.320(0.004) 0.333(0.011) glasso 0.706(0.004) 2.019(0.010) 0.310(0.002) 0.335(0.001) 0.252(0.008) p = 1000 EQUAL 0.701(0.008) 2.033(0.020) 0.331(0.004) 0.344(0.004) 0.361(0.012) EQUALs 0.681(0.003) 1.983(0.013) 0.314(0.002) 0.331(0.002) 0.353(0.011) glasso 0.728(0.003) 2.070(0.008) 0.321(0.002) 0.344(0.001) 0.261(0.008) p = 2000 EQUAL 0.860(0.106) 2.351(0.190) 0.446(0.066) 0.426(0.049) 0.322(0.023) EQUALs 0.666(0.020) 1.984(0.042) 0.317(0.008) 0.331(0.007) 0.348(0.011) 10 glasso 0.746(0.002) 2.109(0.007) 0.332(0.001) 0.353(0.001) 0.265(0.008)

11 lambda Sparsity level lambda Sparsity level (a) Solution path for control subjects (b) Solution path for prostate cancer subjects (c) Estimated networks for control subjects (d) Estimated networks for prostate cancer subjects Figure 1: Estimation for the Prostate data using EQUALs. Upper panel: sparsity level (average number of non-zero elements for each row/column) versus λ. Lower panel: network graphs for the two patient groups when λ = C.-J. Hsieh, M. A. Sustik, I. S. Dhillon, P. K. Ravikumar, and R. Poldrack. BIG & QUIC: Sparse inverse covariance estimation for a million variables. In Advances in neural information processing systems, pages , C.-J. Hsieh, M. A. Sustik, I. S. Dhillon, and P. Ravikumar. QUIC: quadratic approximation for sparse inverse covariance estimation. The Journal of Machine Learning Research, 15(1): ,

12 W. Liu and X. Luo. Fast and adaptive sparse precision matrix estimation in high dimensions. Journal of Multivariate Analysis, 135: , N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection with the lasso. Annals of Statistics, 34(3): , P. Ravikumar, M. J. Wainwright, G. Raskutti, B. Yu, et al. High-dimensional covariance estimation by minimizing l 1 -penalized log-determinant divergence. Electronic Journal of Statistics, 5: , R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, pages , M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model. Biometrika, 94 (1):19 35, T. Zhang and H. Zou. Sparse precision matrix estimation via lasso penalized D-trace loss. Biometrika, 101(1): , H. Zou and R. Li. One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics, 36(4):1509,

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana