Estimation of Graphical Models with Shape Restriction

Size: px

Start display at page:

Download "Estimation of Graphical Models with Shape Restriction"

Charleen Arleen Johnson
6 years ago
Views:

1 Estimation of Graphical Models with Shape Restriction BY KHAI X. CHIONG USC Dornsife INE, Department of Economics, University of Southern California, Los Angeles, California 989, U.S.A. AND HYUNGSIK ROGER MOON Department of Economics, USC Dornsife INE, University of Southern California, Los Angeles, California 989, U.S.A., and School of Economics, Yonsei University, Seoul, Korea. SUMMARY Gaussian graphical models are recently used in economics to obtain networks of dependence among agents. he standard estimator is the Graphical Lasso (), which amounts to a maximum likelihood estimation regularized using the L, matrix norm on the precision matrix Ω. he L, norm is a lasso penalty that controls for sparsity, or the number of zeros in Ω. We propose a new estimator called Structured Graphical Lasso (S) that uses the L, mixed norm instead of the L, lasso norm. he use of the L, penalty controls for the structure of the sparsity in Ω. We show that S is asymptotically equivalent to an infeasible problem which prioritizes the sparsity-recovery of high-degree nodes. Monte Carlo simulation shows that S outperforms in terms of estimating the overall precision matrix and in terms of estimating the structure of the graphical model. In an empirical application to a classic firms investment dataset, we obtain a network of firms dependence that exhibits the core-periphery structure, with General Motors, General Electric and U.S. Steel forming the core group of firms. Date: April 7, 7 Some key words: Gaussian graphical models; Glasso; Inverse covariance matrices; Lasso; Precision matrices; Sparsity. INRODUCION he Gaussian graphical model is a graph summarizing the conditional independence relationships among a group of random variables. wo nodes i and j in the graph are not linked if and only if i and j are statistically independent conditional on all other nodes besides i and j. In As an example, consider the setting of our empirical application: the variables (X,..., X p ) represent the annual investment of the p number of firms: (,..., p). he goal is to recover a graph (the graphical model) that describes the pairwise conditional independence relationships among the random variables (X,..., X p ). Suppose that two firms i and j are linked in this graphical model, it means that the investment decisions of firms i and j are dependent on each other, conditional on all other firms X k, k i, j. Hence the investment decisions of firms i and j have direct effects on each other, without the mediation of other firms. Graphical model is a popular tool in biostatistics that has seen recent applications in economics, for instance, Giudici and Spelta () uses graphical modeling to obtain the network of international financial flows. In our own empirical application here, we use graphical models

2 to obtain the network of dependence among firms investment decision, using a classic dataset of firms investment. Interestingly, we recover a core-periphery network structure among the firms, with General Motors, General Electric and U.S. Steel forming the core group of firms. In a coreperiphery network, there is a core group of nodes that all other nodes (including other nodes in the core) are linked to. In addition, the periphery nodes do not have links to each other, they only have links to the core. A widely-used technique to estimate graphical models is called Graphical Lasso () which aims to estimate the precision matrix Ω while imposing a sparsity constraint on Ω. o see the intuition, suppose that (X,..., X p ) is distributed with N (, Σ ). If the (i, j)-th entry of the precision matrix Σ Ω, then it is known that X i and X j are independent conditional on all other X k, k i, j. he sparsity constraint is needed, as otherwise the sample precision matrix produces a dense matrix without any zeros, and hence does not identify the non-links in the graphical model. In this note, we propose an estimation method for graphical models, which works by estimating the precision matrix, Ω, taking into account the structure of sparsity in Ω. he standard approach estimates the precision matrices while controlling for the sparsity of Ω, i.e. the number of zeros in Ω. However it does not consider when these zeros could be distributed in a certain pattern or structure in Ω. For instance, in many economic applications it is reasonable to think that in the graphical model, some nodes have very few links, while some hub-like nodes have many links to other nodes. Such settings are common in economic applications: for instance, we commonly observe the so-called small-world properties in social and economic networks (Jackson (8)). We adapt the well-known method (e.g., Banerjee et al. (8); Friedman et al. (8); Yuan and Lin (7); Fan et al. ()), to estimate a precision matrix, taking into account the hidden structure of its sparsity. he main feature of our estimation method is to use the L, -norm penalty, which corresponds to the second moment of the weighted degree of nodes, i.e. p k= Ω ik, for i =,..., p. Notice that Lasso uses the L, -norm penalty, which is proportional to the first moment of the weighted degree distribution. We derive the asymptotic distribution of the new estimator, called Structured Graphical Lasso (S) when the sample size and the dimension of Ω, p, is fixed. he main theoretical finding is that S is asymptotically equivalent to an infeasible problem which penalizes an entry of Ω proportionally according to the influence of the entry, where the influence of an (i, j)-th entry is defined as the sum of the actual weighted degrees of nodes i and j in the graphical model. Specifically, S is asymptotically equivalent to a modified where the element-by-element penalty factor for the (i, j)-th entry of Ω is proportional to (d i + d j ), where d i = p k= Ω,ik is the true weighted degree of node i. Note that d i is unknown to us, and hence it is infeasible to be implemented as a estimator. his has two implications. First, S is similar to in the sense that both achieve sparsity the non-differentiability of the penalty function at zeros results in entries that are exactly estimated to be zero. Second, S differs from in the sense that it prioritizes recovering the zeros of two high-degree nodes. For instance going back to the example of network of dependence among firms, consider two firms i, j with high weighted degrees. If Ω,ij =, then S tends to estimate ˆΩ ij = correctly with higher probability compared to (Figure ). Nodes with higher weighted degrees can be viewed as being more influential and important, and we might be more interested in uncovering the relationships between more influential agents.

3 In practice, we can also combine both S and by using both L, and L, norms as penalty terms, essentially controlling for both the first and the second moments of the weighted degree distribution. We show by means of simulations showing that S performs better than. hat is, S achieves lower overall Kullback-Leibler and Frobenius losses in terms of estimating the overall precision matrix. In addition, in terms of recovering the structure of the graphical model, we also show that S outperforms. he paper is organized as follows. Section introduces the model, the L, -norm penalty, and a motivating example. Section presents asymptotic results. Section contains Monte Carlo simulation results. Section contains an empirical application. he appendix contains technical proofs and derivations.. Related literature his paper is related to literature of estimating high-dimensional inverse covariance matrices. An overview of this topic can be found in Cai et al. (); Fan et al. (). he standard sample covariance estimator performs poorly when the dimension of the true covariance matrix is high relative to the sample size. One approach is to impose a sparsity constraint on the precision matrix, and estimate the sparse precision matrix by maximizing a penalized likelihood. Our estimator follows this approach. A different class of procedure frames the estimation of the precision matrix as nodewise or pairwise regressions, where the Lasso or Dantzig selector can then be used to achieve variable selection and high-dimensional regularization. he relevant papers belonging to this class are Cai et al. (, ); Meinshausen and Bühlmann (); Peng et al. (); Ren et al. (). A recent approach to estimating high-dimensional Gaussian graphical models is the Innovated Scalable Efficient Estimation proposed by Fan and Lv (). Instead of estimating the inverse covariance through penalized likelihoods, they convert the original problem into that of estimating a large covariance matrix. Another type of mixed norm that is widely used in the high-dimensional linear regression setting is the L, -norm, also known as the Group Lasso (Friedman et al. (); Yuan and Lin ()), which is useful when parameters are naturally partitioned into disjoint groups, and we would like to achieve sparsity with respect to whole groups. In contrast, our proposed L, -norm is distinct from the L, -norm, and has not been studied in the context of covariance estimation.. SEUP Let X = (X,..., X p ) R p be a p-dimensional vector distributed according to a multivariate distribution with mean zero and covariance matrix Σ. Our goal is to obtain the graphical model corresponding to X, which is defined as follows. he (Gaussian) graphical model of X is an undirected graph G = (V, E). he set of nodes (or vertices) of this graph is V = {,..., p}. Each node i corresponds to a random variable X i. he set of links (or edges) E is such that {i, j} / E if and only if X i X j conditional on all X k, k V \ {i, j}. hat is, the graphical model is defined such that there is no link between nodes i and j in the graph if and only if X i and X j are independent conditional on all other nodes besides i and j. As such the graphical model G summarizes the pairwise conditional independence relationships among (X,..., X p ). he existence of a link in G amounts to conditional dependence between the two nodes given all other ones. Estimation of graphical models is based on the following known fact: if X is Gaussian, then the non-zeros in its precision matrix (or inverse covariance matrix) Ω := Σ corresponds ex-

4 actly to links in the graphical model (see Hastie et al. (); Whittaker (9)). hat is, there is a link between nodes i and j if and only if the (i, j)-th entry of Ω is non-zero. Similarly the lack of a link {i, j} between a pair of nodes i and j is equivalent to the (i, j)-th entry of Ω being zero. Now suppose we observe i.i.d samples X t, t =,..., from X = (X,..., X p ). We now introduce the estimator, as well as our proposed S estimator, that allows us to recover the graphical model from the samples. he output of these estimators is a sparse precision matrix ˆΩ. From ˆΩ, the estimated graphical model is then constructed by including a link between nodes i and j if and only if ˆΩ i,j. For a m n matrix A, define the mixed L p,q norm as follows: ( n m ) q/p /q A p,q = a ij p. j= i= he following L, norm in equation () will play a crucial role in our estimator: ( n m ) / A, = a ij. () j= Our Structured- (S) estimator is defined in equation () below, where ˆΣ is the sample covariance matrix calculated from the sample (X t ) t= {. } ˆΩ sglasso := argmax Ω i= log det Ω tr(ωˆσ) λ Ω,. () In comparison, the estimator (Banerjee et al. (8); Friedman et al. (8); Yuan and Lin (7); Fan et al. ()) is defined in equation (). While S uses the L, mixed norm as a penalty to the likelihood, in equation () uses the L, norm, which is the familiar lasso penalty (sum of the absolute values of the entries of Ω). { ˆΩ glasso := argmax Ω log det Ω tr(ωˆσ) λ Ω, }. () Another definition of the is one where only Ω is penalized, where Ω denotes the matrix Ω where its diagonal entries are set to zero. While Yuan and Lin (7); Rothman et al. (8) use equation (), other authors such as Banerjee et al. (8); Friedman et al. (8) use the latter. For various expositional reasons, we will penalize Ω instead of Ω. However in the Monte Carlo simulation, we will additionally show results for the variants of both and S where the diagonals are not penalized. Although we are comparing S with in this paper, we could also combine them together. hat is, ˆΩ := argmax Ω { log det Ω tr(ωˆσ) λ Ω, λ Ω,}, and use cross-validation procedures to tune the parameters λ and λ. We do not require that the data-generating process be Gaussian. herefore our estimator can be seen as maximizing a Gaussian quasi-likelihood subject to regularization. When there is no penalty (λ = ), then ˆΩ defined in both () and () correspond to the Quasi Maximum Likelihood estimator of the inverse covariance matrix, which is given by the inverse of the empirical sample When the DGP is non-gaussian, the graphical model estimated using or S is still useful it corresponds to the structure of pairwise conditional correlations (partialling out other variables).

5 covariance matrix. It is well-known that the unpenalized sample estimator behaves poorly, and it is unsuitable for graphical modeling when zeros of the estimates are important.. Motivating Example We illustrate the difference between the L, and the conventional L, lasso norms. Define Ω and Ω as the precision matrices given by equations () and () below. Ω = () Ω = he precision matrices Ω and Ω give rise respectively to the graphical models in Figure and Figure below, where links in the graphs represent non-zero entries in the precision matrices. () Fig. : Ω : star graph Fig. : Ω : AR() model. Both the standard Lasso and Frobenius norms do not distinguish between Ω and Ω, while our proposed L, -norm does. he reason is clear: both Ω and Ω have the same number of zeros. hey have the same sparsity but the structure of this sparsity is very different. More concretely, for the L, lasso norm, we have: L, (Ω ) = L, (Ω ) = p j= p i j + = 8 +. Similarly, for the L, Frobenius norm, we have L, (Ω ) = L, (Ω ). Now on the other hand, the L, -norm can distinguish between these two structures. he L, norm evaluated at Ω for the star graph is Ω,.. While the L, norm evaluated at Ω for the AR() graph is Ω,.9.. Remarks Lam and Fan (9) study a very general { class of estimators that nests the. In particular, they consider argmax Ω log det Ω tr(ωˆσ) i j p λ( Ω ij ) }, where p λ ( ) is a penalty function that depends on a regularization parameter λ. his general framework nests many estimators including the conventional (by setting p λ ( ω ) = λ ω ). he S estimator does not belong to this class because Ω, is not additively separable in the entries of Ω. Hence, S cannot be analyzed using the framework proposed in Lam and Fan (9). It is possible to incorporate other penalty functions into our estimator. Consider: ˆΩ sc-glasso := { argmax Ω log det Ω tr(ωˆσ) pλ (Ω),}, where pλ (Ω) denotes the matrix (of the same dimension as Ω) such that p λ (Ω) ij = p λ (Ω ij ). For instance, we can then let p λ ( ) to be the Smoothly Clipped Absolute Deviation (SCAD) penalty function (Fan and Li ()). For instance, when p/ c >, and the samples are drawn i.i.d. from a multivariate Gaussian distribution, we know from Johnstone () that neither the eigenvalues nor the eigenvectors of the sample covariance matrix are consistent estimators of the true covariance matrix.

6 . HEOREICAL PROPERIES In this section we derive the limiting distribution of S when the number of samples goes to infinity while p is fixed (following Yuan and Lin (7)). he main finding of this section is that S is asymptotically equivalent to an infeasible problem which prioritizes the sparsity-recovery of high-degree nodes. More precisely, S is asymptotically equivalent to a modified estimator where each row of Ω is penalized proportionally according to its true degree. he true degree of row (or node) i is defined by d i = p j= Ω,ij. his asymptotic equivalence result has two implications. First, S inherits the sparsity-discovery property of where zeros in the precision matrix are estimated precisely to be zero. herefore S estimates are also sparse. Secondly, S differs from in that it seeks out higher degrees nodes and prioritizes recovering the zeropattern associated with these nodes. o derive the limiting distributions of our S estimator, we assume a high level assumption on the weak limit of the sample covariance matrix. Assumption. We assume that vech ( (ˆΣ Σ )) vech (W ) N(, Λ), where the notation vech is the half-vectorization operator that takes only the lower-triangular part of a symmetric matrix. his high level condition holds when X j,t X k,t have higher moments and the serial dependence is weak. When X t i.i.d. N(, Σ ), then (vech (ˆΣ Σ )) vech (W ) N(, Λ), where Λ is such that Cov(w ij, w i j ) = Cov(X i,tx j,t, X i,tx j,t). Now let D p be a p p matrix whose (i, j) entry is d j p k= Ω,jk, the true weighted degree of node j. Define ˆΩ Dp = argmax{log det Ω tr(ωˆσ) λ D p Ω, } () Ω to be a variant of the estimator. Here, the operator is the element-wise multiplication. he S estimator is defined as: ˆΩ λ = argmax Ω {log det Ω tr(ωˆσ) λ Ω, }. HEOREM. Suppose that as, λ λ. hen, (ˆΩλ Ω ) = (ˆΩ Dp Ω ) + o p (). In addition, suppose that Assumption holds. hen, (ˆΩλ Ω ), and (ˆΩ Dp Ω ) argmin U R p p V (U), where V (U) = tr(uσ UΣ ) + tr(uw ) + λ i= j= g(u ij )d j (7) as. Here Σ = Ω is the true covariance matrix of X, and g(u ij ) := u ij sign(ω ij )(Ω ij ) + u ij (Ω ij = ).

7 Compared to the limit in heorem, the estimator has the following limiting distribution (Yuan and Lin (7)) : where argmin V glasso (U) U R p p V glasso (U) = tr(uσ UΣ ) + tr(uw ) + λ i= j= 7 g(u ij ) (8) where λ λ. Comparing equations (7) and (8), we see that the S obtains the same limiting distribution as that of a modified with element-wise LASSO penalty given by D p Ω,. Since D p is constructed using the true, unknown values of Ω, S cannot be simply implemented as. Moreover from Proposition below, we can further say that S is asymptotically equivalent to a variant of where the penalty term is p p i= j= (d i + d j ) Ω ij. hat is, each entry Ω ij is given the penalty factor equals to the sum of the (true) weighted degrees of nodes i and j. PROPOSIION. Let D p be a p p matrix whose (i, j) entry is d j p k= Ω,jk. It is true that D p Ω, = p p i= j= (d i + d j ) Ω ij.. Illustrating the asymptotic distribution We consider the following true precision matrix, Ω. It is an AR() model with p =. he corresponding graph representing Ω is depicted in Figure.. Ω =..... Fig. : Graph of Ω In Figure 9 in the Appendix, we illustrate the asymptotic distribution for both and S, setting λ = in heorem. Each circle in the plot represents a draw from the distribution over ˆΩ as given by heorem. Note that the precision matrix is such that Ω, = Ω, =. For, we have ˆPr(ˆΩ = ) =.9, while for S, ˆPr(ˆΩ = ) =.. herefore, S does a better job at estimating Ω, =, where a non-link involves a node of relatively higher degree. In Figure (Appendix), we show the asymptotic distribution of ˆΩ without any penalty or regularization, which corresponds to the Maximum-Likelihood estimator of the inverse covariance matrix. Comparing the two figures, we see that both L, and L, norms are able to estimate entries of Ω to be zero exactly, while MLE without regularization fails to recovers any sparsity structure. Yuan and Lin (7) consider the estimator where the diagonals are not penalized, so that V glasso (U) = tr(uσ UΣ ) + tr(uw ) + λ p i= p j i g(u ij) in their paper.

8 8. NUMERICAL RESULS We consider different graphical models as depicted in Figures and. he models in Figure has nodes (p = ), and the models in Figure has nodes (p = ). From a given graphical model, we generate the true precision matrix Ω such that Ω,ij = if and only if there is a link between nodes i and j, otherwise we set Ω,ij =.. We set Ω,ii =. For each model, we draw independent datasets from N(, Ω ). hat is, each dataset comprises of (X t ) t=, where X t R p is randomly drawn from N(, Ω ). Here, our sample size is = when p = and = when p =. For each of the dataset, we use our estimator, as well as the, to estimate the underlying inverse covariance matrix. (a) Model (b) Model (c) Model (d) Model (e) Model Fig. : p = (a) Model (b) Model 7 (c) Model (d) Model 9 (e) Model Fig. : p = Both estimators involve choosing the λ tuning parameters. First, we present the result when cross-validation is used to tune λ. More precisely, at each replication, we estimate Ω using both S and, where the optimal λ is determined using a -folds cross-validation. During each cross-validation, we use the Kullback-Leibler (KL) loss in Equation () averaged over the two-folds to evaluate predictive accuracies. Equation (9) gives the KL loss between the estimated ˆΩ from the training set versus the estimated Ω from the validation set.

9 KL(λ ) = log det(ω) log det ˆΩ + tr(ˆωω ) p (9) 9 Model Optimal λ KL Frobenius F score (a) (b) (a) (b) (a) (b) (a) (b) S S S S () (.) (.) (.) (.97) (.) (.7) (.) (.7) () (.7) (.) (.7) (.) (.9) (.8) (.) (.7) () (.) (.) (.7) (.) (.) (.77) (.) (.) () (.) (.) (.78) (.) (.8) (.7) (.) (.) () (.) (.) (.79) (.) (.) (.) (.) (.9) () (.) (.8) (.) (.) (.99) (.9) (.) (.7) (7) (.8) (.8) (.99) (.9) (.99) (.) (.8) (.) (8) (.7) (.) (.9) (.) (.) (.) (.) (.) (9) (.) (.) (.) (.) (.8) (.) (.9) (.) () (.8) (.) (.) (.) (.8) (.) (.) (.) able : Averages and standard errors from replications. Columns (a) and (b) report the optimal λ as determined from cross-validation for (a) S and (b). Columns (a) and (b) are the average Kullback- Leibler losses. Columns (a) and (b) report the average Frobenius losses. he last two columns report the F scores, which is a metric for how accurately the graphical model is estimated. We report the simulation result in able. In the table, the (a) columns corresponds to our S estimator whereas the (b) columns refer to. In Columns (a) and (b), we report the optimal λ as determined by cross-validation, averaged across the replications. In Columns (a) and (b), we report the Kullback-Leibler loss (averaged across replications) between the S and estimates, ˆΩ, and the true precision matrix Ω given by KL(ˆΩ, Ω ) = log det(ω ) log det ˆΩ + tr(ˆωω ) p. In Columns (a) and (b), we report the average Frobenius loss between ˆΩ and Ω. In the last two columns of able, we report the accuracy of graph recovery using S and. he Kullback-Leibler loss and the Frobenius norms may not fully capture how accurately the zeros are recovered. We introduce an additional metric: F score is F = precision recall precision+recall, where precision is the ratio of true positives (P) to all predicted positives (P + FP), recall is the ratio of true positives to all actual positives (P + FN). he notations are further explained in the confusion matrix in able. Alternatively, the F score can be written as F = P P +F P +F N.

10 Predicted value is Predicted value is Actual value is rue positive (P) False negative (FN) Actual value is False positive (FP) rue negative (N) able : Confusion matrix Hence, the F score measures the quality of a binary classifier by equally balancing both the precision and the recall of a classifier. he larger the F score is, the better the classifier is. he F score is commonly used in machine learning to evaluate binary classifiers. For instance, the Yelp competition uses the F score as a metric to rank competing models. he F score is P + N P + N+F P +F N favored over the metric Accuracy = especially in our current setting where the graphical models are sparse. his is because a model that naively predicts all negatives will obtain a high Accuracy score just because there are many actual negatives, and the N term dominates the Accuracy score. he conclusion from able is that S achieves significantly lower Kullback-Leibler losses across all models. In terms of Frobenius losses, S also slightly outperforms. Moreover in in terms of graph accuracy, S also outperforms, as indicated by the F scores. In able in the appendix, we compare S with where the diagonals are not penalized. he conclusion that S outperforms still holds.. Minimum Kullback-Leibler losses Now for each estimator, we choose λ that minimizes the Kullback-Leibler loss (Equation ) between the estimated and the true inverse covariance matrix. herefore, we are comparing the lowest Kullback-Leibler losses that can be achieved by our estimator versus the. KL(λ ) = log det(ω ) log det ˆΩ + tr(ˆωω ) p () We conclude from able that the superior performance of S over exists after taking away the randomness due to cross-validations. In particular, the lowest possible KL losses achievable by our estimator is smaller, across all models, than the corresponding KL losses for the.. Recovering high-degree sparsities heorem says that the S prioritizes recovering the sparsity between nodes that have higher degrees. Here we show clear numerical evidence that S performs better than in terms of recovering the zeros corresponding to two high-degree nodes. In particular for each of the models described above, we examined the pair of nodes (i, j) such that d i + d j is largest and Ω,ij =, where d i = p k= Ω,ik is the true weighted degree of node i. For instance, for model, the argmax i,j (d i + d j ) such that Ω,ij = would correspond to the pairs of nodes (, ) and (, ). For simplicity, if there are multiple pairs of nodes that maximize d i + d j, we will pick the pair of nodes that comes first when the adjacency matrix is vectorized. In each of the model, we calculate the fraction (out of times) that and S correctly recover Ω,ij = for the largest d i + d j. We plot this result in Figure, which shows that across all models, the probability of correctly recovering Ω,ij = for high-degree nodes is greater when S is used, compared to.

11 Model Minimum Minimum Percentage KL Frobenius Dominance (a) (b) (a) (b) S S () (.8) (.7) (.9) (.) () (.9) (.) (.9) (.) () (.) (.) (.) (.) () (.) (.) (.8) (.) () (.) (.) (.97) (.9) () (.9) (.) (.8) (.9) (7) (.8) (.9) (.7) (.8) (8) (.9) (.) (.8) (.88) (9) (.9) (.98) (.78) (.8) () (.8) (.9) (.8) (.8) able : Minimum possible KL and Frobenius losses across all λ. he numbers reported are the average KL and Frobenius losses evaluated at the λ that result in the lowest possible values. Standard deviations across replications are reported in parentheses. In the last column, we report the fraction of times (out of the replications) in which S performed better in terms of KL loss at the corresponding λ. In practical applications, we are often interested in the relationships between influential agents (as measured by their weighted degrees in the true graphical model). his paper suggests that we can better uncover the dependence relationships between influential agents using S.. Computational details Computing the S estimator as in equation () is a convex optimization problem. More precisely, the problem consists of minimizing a smooth convex term given by log det Ω + tr(ωˆσ), and a non-differentiable convex term corresponding to λ Ω,. Since the objective can be formulated as minimizing the sum of a differentiable convex function and a non-differentiable convex function, we can use the proximal gradient method (Beck and eboulle (9)) to compute S. In this paper, we use CVX, a Matlab package for specifying and solving convex programs (Grant et al. (8)). CVX supports convex objective functions that are non-smooth (Grant and Boyd (8)). CVX also allows us to easily enforce symmetry and positive-definiteness of the matrix minimand, which is needed here. he code is readily available from the authors upon request.

12 Model Model Model Model S. S. S. S (a) Model (b) Model (c) Model (d) Model Model Model Model 7 Model S. S. S. S (e) Model (f) Model (g) Model 7 (h) Model 8 Model 9 Model S... (i) Model 9. S... (j) Model Fig. : S more accurately recovers the zeros belonging to high-degree nodes. Each figure shows the estimated probability of recovering the non-link (i, j) such that (d i + d j) is the largest in each model, as a function of the tuning parameter λ. More precisely, for a given λ, we calculate the number of times (out of the replications) that S and successfully estimate the entry Ω ij as zero, where (i, j) is such that Ω i,j = and that d i + d j has the highest value among all non-links.. EMPIRICAL EXAMPLE For a real-world empirical application, we use the Grunfeld investment data, which is a classic dataset that have been used in undergraduate and graduate econometrics. It is used widely as a teaching example in Greene (), and the data itself can be freely downloaded from the Greene () s companion website. he data consist of time series of = yearly observations on p = firms. he three variables are I it = real gross investment of firm i in year t, F it = real market value of the firm, C it = real value of capital stock such as plant and equipment. For each firm i, we first estimate the following linear equation (separately): I it = β i + β i F it + β i C it + ɛ it. Greene then uses

13 Fig. 7: Network of firms dependence estimated using S. he result using is identical. he estimated graphical model exhibits a clear core-periphery structure.he core firms are, and, which are respectively General Motors, U.S. Steel and General Electric. With this classification, all periphery firms do not have links to each other, but are linked only to the core firms. All core firms are linked to each other. Fig. 8: Network of firms dependence, as estimated using the sample inverse covariance matrix of the residuals. his network is a complete graph every node is linked to all other nodes. here are no zeros in the estimated inverse covariance matrix. Similarly, the sample covariance matrix has no zeros as well. the residual estimates ˆɛ it to calculate the sample inverse covariance matrix between the firms investment residuals. he graphical model produced by this sample inverse covariance matrix is in Figure 8. We see that the graphical model representing firms error dependence structure is a complete graph where every node is linked to all other nodes. We now use our proposed S and the conventional to estimate the precision matrix of the residuals ˆɛ it, i =,...,, t =,...,. he tuning parameters are determined using a two-fold cross-validation procedure. he result is shown in Figure 7. It turns out that both S and yield identical graphical models. Moreover, the graphical model exhibits a clear core-periphery structure: there is a group of core firms that are linked to each other, as well as linked to periphery firms. he periphery firms are not linked to each other, but are linked to the core firms. he core firms here are firms, and, which are respectively General Motors, U.S. Steel and General Electric. (With this classification, periphery firms do not have links to each other, but only to the core firms)

14 Without using S or, we would not be able to recover this core-periphery structure. Both the sample covariance matrix and the inverse covariance matrix do not contain any zeros.. CONCLUDING REMARKS We propose a new estimator for precision matrices. his estimator is called Structured Graphical Lasso (S), because it uses the L, -norm to control for the structure of sparsity in the precision matrix. In contrast, the standard Graphical Lasso () uses the L, lasso norm to control for the number of sparsity or zeros in the precision matrix. We show that S is asymptotically equivalent to an infeasible problem which prioritizes the sparsity-recovery of high-degree nodes. In another words, S is able to penalize an entry of the precision matrix proportionally according to the influence of the entry. S is potentially valuable in real-world settings where we are interested in the relationships between influential agents. In a variety of numerical simulations, S is shown to perform better than. APPENDIX A. Proof of heorem We will use the following important result that concerns the asymptotics of the minimizer of a convex random function (Geyer (99); Hjort and Pollard (); Kato (9); Pollard (99)). LEMMA A. Let f n (x, Ω) : R d Ω R be a random convex function, that is, f n (x, ) is a random variable for each x R d. Moreover, let x n(ω) be the unique minimizer of f n (x, Ω) with respect to x. Suppose that f n (x, ) converges in distribution to some random convex function f (x, ) for each x. Now if x (Ω) is the unique minimizer of f (x, Ω), then the random variable x n converges in distribution to the random variable x o prove heorem, define the following convex random functions by re-parameterization U = (Ω Ω ): V (U) = log det (Ω + U ) + tr {(Ω + U ) } ˆΣ + λ Ω + U, + log det Ω tr(ω ˆΣ) λ Ω,, (A) and V l (U) = log det (Ω + U ) + tr {(Ω + U ) } ( ˆΣ + λ D p Ω + U ), + log det Ω tr(ω ˆΣ) λ D p Ω,. (A) his function is random because of its dependence on the sample covariance matrix, ˆΣ. By definition, our S estimator ˆΩ λ satisfies the following. (ˆΩλ Ω ) = Û := argmin V (U). U By Lemma, the limit distribution of (ˆΩ λ Ω ) follows if we show V (U) V (U), (A) as and as λ λ, where the random limit function V (U) is defined in equation (7).

15 Following the argument in the proof of heorem of Yuan and Lin (7), we have that: log det (Ω + U ) log det Ω = tr(uσ ) tr(uσ ( ) UΣ ) + o tr {(Ω + U ) } ( U ˆΣ tr(ω ˆΣ) ˆΣ ) = tr ( ) ( ) UΣ U(ˆΣ Σ = tr ) + tr (A) (A) Moreover, we know that when is large enough, Ω,ij + u ij Ω,ij = g ij where g ij = u ij sign(ω,ij )(Ω,ij ) + u ij (Ω,ij = ). herefore we can show that the difference of the penalty terms can be written as Ω + U ( Ω, = Ω,ij + u ) ( ij ) Ω,ij, = = = = = j= j= i= i= j= [ ( Ω,ij + u ij + Ω,ij [ ( Ω,ij + u ij + Ω,ij j= j= i= i= i= ) i= ) i= ( Ω,ij + u ) ] ij Ω,ij ] g ij [ ( Ω,ij + u ij Ω,ij + Ω,ij [( g ij + j= i= j= i,i = ) Ω,ij i= g ij g i j + j= i,i = We can rewrite the second term of equation ( A) as follows: g ij Ω,i j = g ij Ω,ji j= i,i = = = j= i,i = j= i= i= j= i= ) ] g ij ( ) g ij Ω,ji i = g ij d j, where d j is the true weighted degree of node j defined as d j p k= Ω,jk. Combining equations ( A), ( A), ( A) and ( A), we can write V (U) as ( V (U) = tr(uσ UΣ ) + tr U ) (ˆΣ Σ ) + λ j= i,i = g ij g i j + λ i= ] g ij Ω,ij g i j (A) i= j= g ij d j + o()

16 In the limit as, we then have: ( V (U) = tr(uσ UΣ ) + tr U ) (ˆΣ Σ ) + λ Similarly, in the limit as, we have: i= j= ( V l (U) = tr(uσ UΣ ) + tr U ) (ˆΣ Σ ) + λ i= j= g ij d j g ij d j herefore, we have the first result of the theorem. Under Assumption the second result of the theorem follows by the continuous mapping theorem and Lemma. A. Proof of Proposition Proof. Since Ω is symmetric, we have D p Ω, = = = = = = i= j= i= j<i d j Ω ij i= j<i i= j<i i= j<i i= j= p d j Ω ij + j= i>j p d j Ω ij + d j Ω ij + j= i>j i= j<i (d i + d j ) Ω ij + d i Ω ji + d i Ω ij + d i Ω ij + d i Ω ii i= d i Ω ii i= d i Ω ii i= (d i + d i ) Ω ii i= (A7) (d i + d j ) Ω ij (A8) A. Additional tables and figures Fig. 9: Asymptotic distributions, with λ =. Left-hand side is, right-hand side is S. For, ˆPr(ˆΩ = ) =.9, ˆPr(ˆΩ = ) =.9. For S, ˆPr(ˆΩ = ) =., ˆPr(ˆΩ = ) =.7. Standard deviations of both estimators are comparable.

17 7 Θ, Θ, Fig. : Asymptotic distribution of the Maximum Likelihood Estimator (MLE). his corresponds to the case where λ =. Model Optimal λ KL Frobenius (a) (b) (a) (b) (a) (b) S S S () (.) (.9) (.) (.8) (.) (.) () (.7) (.9) (.7) (.) (.9) (.7) () (.) (.) (.7) (.9) (.) (.8) () (.) (.) (.78) (.8) (.8) (.) () (.) (.7) (.79) (.8) (.) (.) () (.) (.7) (.) (.9) (.99) (.) (7) (.8) (.9) (.99) (.9) (.99) (.) (8) (.7) (.) (.9) (.) (.) (.7) (9) (.) (.) (.) (.7) (.8) (.) () (.8) (.) (.) (.8) (.8) (.) able : Identical to able, except we use the version of where the diagonals are not penalized. Averages and standard errors from replications. Columns (a) and (b) report the optimal λ as determined from crossvalidation for (a) S and (b). Columns (a) and (b) are the average Kullback-Leibler losses. he last two columns are the average Frobenius losses. REFERENCES Banerjee, O., El Ghaoui, L., and d Aspremont, A. (8). Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. he Journal of Machine Learning Research, 9:8. Beck, A. and eboulle, M. (9). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, ():8.

18 8 Cai,., Liu, W., and Luo, X. (). A constrained minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, (9):9 7. Cai,.., Liu, W., Zhou, H. H., et al. (). Estimating sparse precision matrix: Optimal rates of convergence and adaptive estimation. he Annals of Statistics, (): 88. Fan, J., Feng, Y., and Wu, Y. (9). Network exploration via the adaptive lasso and scad penalties. he annals of applied statistics, ():. Fan, J. and Li, R. (). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 9():8. Fan, J., Liao, Y., and Liu, H. (). An overview on the estimation of large covariance and precision matrices. arxiv preprint arxiv:.99. Fan, J., Liao, Y., and Liu, H. (). An overview of the estimation of large covariance and precision matrices. he Econometrics Journal, 9():C C. Fan, Y. and Lv, J. (). Innovated scalable efficient estimation in ultra-large gaussian graphical models. arxiv preprint arxiv:.. Friedman, J., Hastie,., and ibshirani, R. (8). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9():. Friedman, J., Hastie,., and ibshirani, R. (). A note on the group lasso and a sparse group lasso. arxiv preprint arxiv:.7. Geyer, C. J. (99). On the asymptotics of constrained m-estimation. he Annals of Statistics, pages 99. Giudici, P. and Spelta, A. (). Graphical network models for international financial flows. Journal of Business & Economic Statistics, ():8 8. Grant, M., Boyd, S., and Ye, Y. (8). Cvx: Matlab software for disciplined convex programming. Grant, M. C. and Boyd, S. P. (8). Graph implementations for nonsmooth convex programs. In Recent advances in learning and control, pages 9. Springer. Greene, W. H. (). Econometric analysis. Prentice hall, 7th edition. Hastie,., ibshirani, R., and Wainwright, M. (). Statistical learning with sparsity. CRC press. Hjort, N. L. and Pollard, D. (). Asymptotics for minimisers of convex processes. arxiv preprint arxiv:7.8. Jackson, M. O. (8). Social and economic networks. Princeton University Press. Johnstone, I. M. (). On the distribution of the largest eigenvalue in principal components analysis. Annals of statistics, pages 9 7. Kato, K. (9). Asymptotics for argmin processes: Convexity arguments. Journal of Multivariate Analysis, (8):8 89. Lam, C. and Fan, J. (9). Sparsistency and rates of convergence in large covariance matrix estimation. Annals of statistics, 7(B):. Meinshausen, N. and Bühlmann, P. (). High-dimensional graphs and variable selection with the lasso. he annals of statistics, pages. Peng, J., Wang, P., Zhou, N., and Zhu, J. (). Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association. Pollard, D. (99). Asymptotics for least absolute deviation regression estimators. Econometric heory, 7():8 99. Ren, Z., Sun,., Zhang, C.-H., Zhou, H. H., et al. (). Asymptotic normality and optimalities in estimation of large gaussian graphical models. he Annals of Statistics, ():99. Rothman, A. J., Bickel, P. J., Levina, E., Zhu, J., et al. (8). Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, :9. Whittaker, J. (9). Graphical Models in Applied Multivariate Statistics. London: Wiley Publishing. Yuan, M. and Lin, Y. (). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 8():9 7. Yuan, M. and Lin, Y. (7). Model selection and estimation in the gaussian graphical model. Biometrika, 9():9.

arxiv: v2 [econ.em] 1 Oct 2017

arxiv: v2 [econ.em] 1 Oct 2017 Estimation of Graphical Models using the L, Norm Khai X. Chiong and Hyungsik Roger Moon arxiv:79.8v [econ.em] Oct 7 Naveen Jindal School of Management, University of exas at Dallas E-mail: khai.chiong@utdallas.edu