Equivalent Partial Correlation Selection for High Dimensional Gaussian Graphical Models

Similar documents
Genetic Networks. Korbinian Strimmer. Seminar: Statistical Analysis of RNA-Seq Data 19 June IMISE, Universität Leipzig

Probabilistic Graphical Models

Sparse inverse covariance estimation with the lasso

High-dimensional Covariance Estimation Based On Gaussian Graphical Models

High-dimensional covariance estimation based on Gaussian graphical models

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

Probabilistic Graphical Models

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Chapter 17: Undirected Graphical Models

Graphical Model Selection

10708 Graphical Models: Homework 2

Learning discrete graphical models via generalized inverse covariance matrices

Proximity-Based Anomaly Detection using Sparse Structure Learning

Generalized Linear Models and Its Asymptotic Properties

Permutation-invariant regularization of large covariance matrices. Liza Levina

GRAPHICAL MODELS FOR HIGH DIMENSIONAL GENOMIC DATA. Min Jin Ha

Learning Gene Regulatory Networks with High-Dimensional Heterogeneous Data

Learning Gaussian Graphical Models with Unknown Group Sparsity

MATH 829: Introduction to Data Mining and Analysis Graphical Models II - Gaussian Graphical Models

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

This paper has been submitted for consideration for publication in Biometrics

Robust and sparse Gaussian graphical modelling under cell-wise contamination

Robust Inverse Covariance Estimation under Noisy Measurements

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

The lasso: some novel algorithms and applications

High dimensional Ising model selection

Gaussian Graphical Models and Graphical Lasso

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

Introduction to graphical models: Lecture III

Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models

Assessing the Validity Domains of Graphical Gaussian Models in order to Infer Relationships among Components of Complex Biological Systems

Regularized Estimation of High Dimensional Covariance Matrices. Peter Bickel. January, 2008

University of California, Berkeley

11 : Gaussian Graphic Models and Ising Models

Sparse Graph Learning via Markov Random Fields

Multivariate Normal Models

Multivariate Normal Models

On the inconsistency of l 1 -penalised sparse precision matrix estimation

Learning Markov Network Structure using Brownian Distance Covariance

An Imputation-Consistency Algorithm for Biomedical Complex Data Analysis

Inequalities on partial correlations in Gaussian graphical models

25 : Graphical induced structured input/output models

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

High-dimensional Covariance Estimation Based On Gaussian Graphical Models

Structure estimation for Gaussian graphical models

Dependence Modeling in Ultra High Dimensions with Vine Copulas and the Graphical Lasso

Inferring Causal Phenotype Networks from Segregating Populat

High dimensional sparse covariance estimation via directed acyclic graphs

Extended Bayesian Information Criteria for Gaussian Graphical Models

Mixed and Covariate Dependent Graphical Models

Causal Inference: Discussion

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Bayesian Inference of Multiple Gaussian Graphical Models

MATH 829: Introduction to Data Mining and Analysis Graphical Models III - Gaussian Graphical Models (cont.)

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

A Divide-and-Conquer Procedure for Sparse Inverse Covariance Estimation

High-dimensional graphical model selection: Practical and information-theoretic limits

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

Predicting causal effects in large-scale systems from observational data

Sparse Gaussian conditional random fields

Graphical Models and Independence Models

Regularization in Cox Frailty Models

High Dimensional Inverse Covariate Matrix Estimation via Linear Programming

CPSC 540: Machine Learning

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models

Undirected Graphical Models

The Nonparanormal skeptic

Exact Hybrid Covariance Thresholding for Joint Graphical Lasso

Lecture 25: November 27

Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models

A Stepwise Approach for High-Dimensional Gaussian Graphical Models

Unsupervised Regressive Learning in High-dimensional Space

MATH 829: Introduction to Data Mining and Analysis Graphical Models I

CPSC 540: Machine Learning

High-Dimensional Learning of Linear Causal Networks via Inverse Covariance Estimation

Sparse Permutation Invariant Covariance Estimation: Motivation, Background and Key Results

Supplementary material to Structure Learning of Linear Gaussian Structural Equation Models with Weak Edges

The Iterated Lasso for High-Dimensional Logistic Regression

Group exponential penalties for bi-level variable selection

Posterior convergence rates for estimating large precision. matrices using graphical models

High-dimensional graphical model selection: Practical and information-theoretic limits

Probabilistic Graphical Models

Towards an extension of the PC algorithm to local context-specific independencies detection

Inferring Transcriptional Regulatory Networks from High-throughput Data

Example: multivariate Gaussian Distribution

Comparing Bayesian Networks and Structure Learning Algorithms

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R

Estimation of Graphical Models with Shape Restriction

High dimensional ising model selection using l 1 -regularized logistic regression

An algorithm for the multivariate group lasso with covariance estimation

Big & Quic: Sparse Inverse Covariance Estimation for a Million Variables

Graphical Models for Non-Negative Data Using Generalized Score Matching

An Introduction to Graphical Lasso

Undirected Graphical Models

STAT 461/561- Assignments, Year 2015

A Practical Approach to Inferring Large Graphical Models from Sparse Microarray Data

Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion

Learning the Network Structure of Heterogenerous Data via Pairwise Exponential MRF

Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and

Transcription:

Equivalent Partial Correlation Selection for High Dimensional Gaussian Graphical Models Faming Liang University of Florida August 23, 2015

Abstract Gaussian graphical models (GGMs) are frequently used to explore networks, such as gene regulatory networks, among a set of variables. Under the classical theory of GGMs, the graph construction amounts to finding the pairs of variables with nonzero partial correlation coefficients. However, this is infeasible for high dimensional problems for which the number of variables is larger than the sample size. We propose a surrogate of partial correlation coefficient, which is evaluated with a reduced conditional set and thus feasible for high dimensional problems. Under the faithfulness condition, we show that the surrogate partial correlation coefficient is equivalent to the true one in graph construction. The proposed method is not only computational efficient, but also outperforms the existing methods, such as graphical Lasso and node-wise regression, in graph construction, especially for the problems for which a large number of indirect associations are present. The proposed method is computationally efficient, and very flexible in data integration, covariate adjustment, network comparison, etc.

Gaussian Graphical Model Consider a set of Gaussian random variables, there are two measures for their dependency: Correlation Coefficient Partial Correlation Coefficient: Compared to the partial correlation coefficient, the correlation coefficient is much weaker as marginally, i.e. directly or indirectly, all variables in a system are more or less correlated. The goal of GGM learning is to distinguish direct from indirect dependencies for all variables in a system. The partial correlation coefficient provides a measure for direct dependence as it will vanish for the indirect case.

Applications Gene regulatory networks: molecular mechanism underlying cancer with data available at The Cancer Genome Atlas (TCGA). Causal networks: time course data.

Covariance Selection let X = (X (1),... X (p) ) denote a p-dimensional random vector drawn from a multivariate Gaussian distribution N p (µ, Σ). Denote the concentration matrix by C = Σ 1 = (C ij ). Then C i,j ρ ij V\{i,j} =, i, j = 1,..., p. (1) Ci,i C j,j The covariance selection method (Dempster, 1972; Lauritzen, 1996) is to identify the non-zero elements in the concentration matrix. However, it is not feasible for high dimensional problems with p > n.

Limited order partial correlations The idea is to use low-order partial correlation as a surrogate of the full-order partial correlation, and it is widely acknowledged that the limited order partial correlation methods can result in something inbetween the full GGM (with correlations conditioned on all p 2 variables) and the correlation graph. Related work: Spirtes et al. (2000), Magwene and Kim (2004), Wille and Bühlmann (2006), Castelo and Roverato (2006, 2009).

Nodewise Regression It is to use Lasso (Tibshirani, 1996) as a variable selection method to identify the neighborhood of each variable, and thus the nonzero elements of the concentration matrix. Consider a linear regression X (j) = β (j) i X (i) + r V\{j,i} β (j) r X (r) + ϵ (j), (2) where ϵ (j) is a zero-mean Gaussian random error. Let S (j) = {r : β r (j) 0} denote the set of explanatory variables identified by Lasso for X (j). The GGM can then be constructed using the or -rule: estimate an edge between vertices i and j i S (j) or j S (i), or the and -rule: estimate an edge between vertices i and j i S (j) and j S (i).

Graphical Lasso Yuan and Lin (2007) proposed to directly estimate the concentration matrix by minimizing log(det(c)) + trace( Σ MLE C) + λ C, (3) where Σ MLE denotes the maximum likelihood estimator of Σ, C denotes the norm of C, and λ is the regularization parameter. Later, the algorithm is accelerated by Friedman et al. (2008) and Banerjee et al. (2008).

Graph Theory An undirected graph is a pair G = (V, E), where V is the set of vertices and E = (e ij ) is the adjacency matrix. If two vertices i, j V forms an edge then we say that i and j are adjacent and set e ij = 1. The boundary set of a vertex v V, denoted by b G (v), is the set of vertices adjacent to v, i.e., b G (v) = {j : e vj = 1}. A path of length l > 0 from v 0 to v l is a sequence v 0, v 1,..., v l of distinct vertices such that e vk 1,v k = 1 for all k = 1,..., l. The subset U V is said to separate I V from J V if for every i I and j J, all paths from i to j have at least one vertex in U. For a pair of vertices i j with e ij = 0, a set U V is called a {i, j}-separator if it separates {i} and {j} in G. Let G ij be a reduced graph of G with e ij being set to zero. Then both the boundary sets b Gij (i) and b Gij (j) are {i, j}-separators in G ij.

Graph Theory (continuation) Definition 1 We say that P V satisfies the Markov property with respect to G if for every triple of disjoint sets I, J, U V, it holds that X I X J X U whenever U separates I and J in G. Definition 2 We say that P V satisfies the adjacency faithfulness condition with respect to G: If two variables X (i) and X (j) are adjacent in G, then they are dependent conditioned on any subset of X V\{i,j}.

Graph Theory (continuation) The adjacency faithfulness condition implies that if there exists a subset U V \ {i, j} such that X (i) X (j) X U, then X (i) and X (j) are not adjacent in G. Further, by the Markov property, we have X (i) X (j) X U = X (i) X (j) X V\{i,j}, for any U V \ {i, j}. (4) In particular, if U =, we have X (i) and X (j) are marginally independent = X (i) X (j) X V\{i,j}, (5) or, equivalently, ρ ij V\{i,j} 0 = Corr{X (i), X (j) ) 0. (6) Hence, to infer the conditional independence structure for a GGM, one may perform a correlation screening which can often reduce the dimensionality of the problem by a substantial amount.

Equivalent Measure Let G = (V, E) denote the correlation graph of X 1,..., X p, where E = (ẽ ij ) is the edge set, and ẽ ij = 1 if r ij 0 and 0 otherwise. Let Ê γi,i, j = {v : ˆr iv > γ i } \ {j} denote a reduced neighborhood of node i in G ij, where G ij denotes a reduced graph of G with ẽ ij being set to 0, and γ i denotes a threshold value. Ê γi,i = {v : ˆr iv > γ i } as a reduced neighborhood of node i in G. For any pair of vertices i and j, we define the partial correlation coefficient ψ ij by ψ ij = ρ ij Sij, (7) where S ij = Ê γi,i, j if Ê γi,i, j < Ê γj,j, i and S ij = Ê γi,i, j otherwise.

Equivalent Measure Figure: Illustrative plot for calculation of ψ-partial correlation coefficients, where the solid and dotted edges indicate the direct and indirect associations, respectively.

Equivalent Measure Theorem 1 Suppose that a GGM G = (V, E) satisfies the assumption of faithfulness and b G (i) Ê γi,i is true for each node i. Then ψ ij defined in (7) is an equivalent measure of the partial correlation coefficient ρ ij V\{i,j} in the sense that ψ ij = 0 ρ ij V\{i,j} = 0.

Equivalent Measure Remarks: 1. There are many examples where the faithfulness assumption is violated, see Bühlmann and van de Geer (2011, pp.446-448) for some examples. However, as pointed out in Koller and Friedman (2009, p.1041), these examples usually correspond to particular parameter values, which are a set of measure zero within the space of all possible parameterizations. 2. There are many ways to specify the separator S ij. For example, we can set S ij to be Ê γi,i, j or Ê γj,j, i for which the cardinality is larger, or even set S ij = Ê γi,i, j Ê γj,j, i. Under these settings, ψ ij is still an equivalent measure of ρ ij V\{i,j}. However, since ψ ij has an approximate variance of 1/(n S ij 3), we prefer to set S ij to Ê γi,i, j or Ê γj,j, i for which the cardinality is smaller.

ψ-learning Algorithm (a) (Correlation screening) Determine the reduced neighborhood Ê γi,i for each variable X (i) : (i) Conduct a multiple hypothesis test to identify the pairs of vertices for which the empirical correlation coefficient is significantly different from zero. This step results in a so-called correlation network. (ii) For each variable X (i), identify its neighborhood in the correlation network, and reduce the size of the neighborhood to O(n/ log(n)) by removing the variables having lower correlation (in absolute value) with X (i). (b) (ψ-calculation) For each pair of vertices i and j, identify the separator S ij based on the correlation network resulted in step (a) and calculate ψ ij by inverting the subsample covariance matrix indexed by the variables in S ij {i, j}. (c) (ψ-screening) Conduct a multiple hypothesis test to identify the pairs of vertices for which ψ ij is significantly different from zero, and set the corresponding elements of E to be 1.

Consistency (A 1 ) The distribution P (n) satisfies the conditions: (i) P (n) is multivariate Gaussian; (ii) P (n) satisfies the Markov property and adjacency faithfulness condition with respect to the undirected graph G (n) for all n N. (A 2 ) The dimension p n = O(exp(n δ )) for some 0 δ < 1. (A 3 ) The correlation satisfy min{ r ij ; r ij 0, i, j = 1, 2,..., p n, i j} c 0 n κ, (8) for some constants c 0 > 0 and 0 < κ < (1 δ)/2, and max{ r ij ; i, j = 1,..., p n, i j} M r < 1, (9) for some constant 0 < M r < 1.

Consistency Define Ẽ (n) = {(i, j) : ρ ij V\{i,j} 0, i, j = 1,..., p n }, Ẽ (n) = {(i, j) : r ij 0, i, j = 1,..., p n }, (10) as the edge sets of G (n) and G (n), respectively. Since Ẽ(n) Ẽ(n), there exist constants c 1 > 0 and 0 < κ κ such that min{ r ij ; (i, j) Ẽ(n), i, j = 1,..., p n } c 1 n κ. (11)

Consistency Lemma 1 Assume (A 1 ), (A 2 ), and (A 3 ) hold. Let γ n = 2/3c 1 n κ. Then there exist constants c 2 and c 3 such that P(Ẽ(n) Ê γn ) 1 c 2 exp( c 3 n 1 2κ ), P(b G (n)(i) Ê γn,i) 1 c 2 exp( c 3 n 1 2κ ).

Consistency (A 4 ) There exist constants c 4 > 0 and 0 τ < 1 2κ such that λ max (Σ) c 4 n τ, where Σ denotes the covariance matrix of X i, and λ max (Σ) is the largest eigenvalue of Σ. Lemma 2 Assume (A 1 ), (A 2 ), (A 3 ), and (A 4 ) hold. Let γ n = 2/3c 1 n κ. Then for each node i, [ ] P Ê γn,i O(n 2κ +τ ) 1 c 2 exp( c 3 n 1 2κ ), where c 2 and c 3 are as given in Lemma 1.

Consistency Lemma 3 Assume (A 1 )-(i), (A 2 ), and (A 3 ). If η n = 1/2c 0 n κ, then P[Ê ηn = Ẽ(n) ] = 1 o(1), as n.

Consistency The faithfulness implies that Ẽ(n) Ẽ(n). Further, it follows from Lemma 3 that P[Ẽ(n) Ê ηn ] = 1 o(1). (12) Based on Lemma 1, Lemma 2 and (12), we propose to restrict the neighborhood size of each node in calculation of ψ-partial correlation coefficients to be min { Ê ηn,i, } n, (13) ξ n log(n) where η n can be determined through a multiple hypothesis test.

Consistency Let ζ n denote the threshold value of ψ-partial correlation used in the ψ-screening step, and let Êζ n denote the final network obtained through thresholding ψ-partial correlation. That is, we define Ê ζn = {(i, j) : ˆψ ij > ζ n, i, j = 1, 2,..., p n }. (A 5 ) The ψ-partial correlation coefficients satisfy inf{ψ ij ; ψ ij 0, i, j = 1,..., p n, i j, S ij q n } c 6 n d, where q n = O(n 2κ +τ ), 0 < c 6 < and 0 < d < (1 δ)/2 are some constants. In addition, sup{ψ ij ; i, j = 1,..., p n, i j, S ij q n } M ψ < 1, for some constant 0 < M ψ < 1.

Consistency Theorem 2 Consider a GGM with distribution P (n) and underlying conditional independence graph G (n). Assume (A 1 ) (A 5 ) hold. Then P[Ê ζn = Ẽ(n) ] 1 o(1), as n.

An Illustrative Example Consider an auto-regressive process of order two with the concentration matrix given by 0.5, if j i = 1, i = 2,..., (p 1), 0.25, if j i = 1, i = 3,..., (p 2), C ij = 1, if i = j, i = 1,..., p, 0, otherwise. This example has been used by Yuan and Lin (2007) and Mazumder and Hastie (2012) to illustrate their glasso algorithms. We are interested in it because all the variables in it are dependent, either directly or indirectly, and thus it can serve as a good example for testing whether or not the proposed ψ-learning algorithm can distinguish direct dependencies from indirect ones.

An Illustrative Example Precision 0.0 0.2 0.4 0.6 0.8 1.0 equivalent partial correlation PC glasso penalized regression marginal correlation qp average sample partial correlation 0.0 0.2 0.4 0.6 0.8 1.0 Recall Figure: Precision-recall curves of the ψ-learning, glasso, nodewise regression, qp-average, PC, partial correlation, and correlation methods for one dataset simulated with (n, p) = (500, 200). Under this setting, the true partial correlation is available and thus included for comparison.

An Illustrative Example Precision 0.0 0.2 0.4 0.6 0.8 1.0 equivalent partial correlation PC glasso penalized regression marginal correlation qp average 0.0 0.2 0.4 0.6 0.8 1.0 Recall Figure: Precision-recall curves of the ψ-learning, glasso, nodewise regression, qp-average, PC, partial correlation, and correlation methods for one dataset simulated with (n, p) = (100, 200). Under this setting, the true partial correlation is not available.

Partial Correlation Scores The true partial correlation score is defined by zij = 1 [ ] 1 + ( ρij V\{i,j} 2 log, z ij = Φ 1 2Φ( ) n p 1 zij ) 1, 1 ρ ij V\{i,j} (14) where n p 1 is its effective sample size. The ψ-score is defined by z ij = 1 2 log [ 1 + ψij 1 ψ ij ], z ij = Φ (2Φ( 1 n S ij 3 z ij ) 1 where S ij denotes the conditional set, and n S ij 3 is its effective sample size. ), (15)

An Illustrative Example (a) (b) equivalent measure score Frequency 0 500 1000 1500 0 5 10 0 5 10 1 2 3 equivalent measure score (c) (d) partial correlation score Frequency 0 500 1000 1500 0 5 10 0 5 10 1 2 3 partial correlation score Figure: Comparison of ψ-scores and sample partial correlation scores. (a) Histogram of ψ-scores. (b) Boxplots of ψ-scores for the first-order dependent nodes (labeled by 1), the second-order dependent nodes (labeled by 2) and other nodes (labeled by 3). (c) Histogram of sample partial correlation scores. (d) Boxplots of sample partial correlation scores for the first-order dependent nodes (labeled by 1), the second-order dependent nodes (labeled by 2) and other nodes (labeled by 3).

An Illustrative Example Table: Average areas under the Precision-Recall curves resulting from different methods: ψ-learning, true partial correlation, glasso, nodewise regression, qp-average, PC, and correlation. The numbers in parentheses represent the standard deviation of the average area. n ψ partial glasso nodewise qp-ave PC corr 0.9940 0.9831 0.8259 0.9466 0.7268 0.5285 0.7696 500 (0.0002) (0.0011) (0.0010) (0.0019) (0.0025) (0.0040) (0.0017) 0.7925 0.5336 0.6207 0.5819 0.4945 0.5215 100 (0.0086) (0.0030) (0.0024) (0.0030) (0.0026) (0.0029)

T-cell Data This dataset results from one experiment investigating the expression response of human T-cells to phorbol myristate acetate (PMA). It contains the temporal expression levels of 58 genes for 10 unequally spaced time points. At each time point there are 34 separate measurements. The data have been log-transformed and quantile normalized and are available at the R package longitudinal (Opgen-Rhein and Strimmer, 2008). As in other work, we ignore its longitudinal structure and treat the observations as independent in studying the GGM.

T-cell Data (a) (b) (c) (d) Figure: Networks identified by (a) the ψ-learning method with q-value=0.01, (b) the ψ-learning method with q-value=0.0001, (c) glasso, and (d) nodewise regression for the T-cell data.

Phase Transition The correlation and partial correlation screening suffers from a phase transition phenomenon: As the threshold decreases, the number of discoveries increases abruptly. This explaines the ψ-learning method outperforms the other methods in the high-precision region, while this is reversed in the low-precision region.

T-cell Data: Power Law To assess the quality of the networks resulted from different methods, we fit the power law to them. A nonnegative random variable X is said to have a power law distribution if P(X = x) x α, for some positive constant α. The power law states that the majority of vertices are of very low degree, although some are of much higher degree two or three orders of magnitude higher in some cases.

T-cell Data (a) (b) log probability 4.0 3.5 3.0 2.5 2.0 1.5 log probability 3.0 2.5 2.0 1.5 1.0 0.0 0.5 1.0 1.5 2.0 2.5 log degree 0.0 0.5 1.0 1.5 log degree (c) (d) log probability 4.0 3.5 3.0 2.5 log probability 4.0 3.5 3.0 2.5 2.0 1.5 1.0 1.5 2.0 2.5 log degree 1.5 2.0 2.5 log degree Figure: Log-log plots of the degree distributions of the four networks shown in Figure 5: (a) ψ-learning with q-value=0.01; (b) ψ-learning with q-value=0.0001; (c) glasso; and (d) nodewise regression.

Breast Cancer Data This dataset contains 49 breast cancer samples and 3883 genes arising in a study of molecular phenotypes for clinical prediction (West et al., 2001). The cleaned data is available at http://strimmerlab.org/ data.html. The full graph consists of 7,536,903 edges.

Breast Cancer Data (a) (b) (c) Figure: Networks identified by (a) the ψ-learning method with q-value=0.01, (b) glasso, and (c) nodewise regression for the breast cancer data.

Breast Cancer Data 94 33 268 361 387 31 217 252 430 450 524 590 220 103 170 576 111 211 690 147 691 882 754 280 669 692 722 992 98 1080 1100 812 148 207 240 262 425 871 898 177 1249 1370 1398 592 846 1256 1434 713 1537 1448 1018 1654 883 1206 670 831 1767 1058 998 38 1889 1114 1248 1129 1217 646 827 1333 1226 1979 1999 1756 1234 506 581 609 1133 429 635 1633 1269 1108 2051 2176 1690 1740 2473 1873 1032 1184 819 1408 1463 1334 638 1691 495 887 1033 1034 1093 1744 1845 2524 2557 2709 2719 2755 2783 2852 2870 805 835 1360 3046 1454 307 3149 1064 655 667 826 2309 3117 2663 1649 3201 3137 2942 1159 115 3182 1983 197 1826 3127 2690 2979 2372 1162 1487 1874 3266 3342 3170 922 1263 1956 2009 2700 3221 215 258 501 571 603 624 714 716 756 764 818 820 851 853 884 948 1041 1103 1486 1543 1544 1586 1717 1951 2053 2066 2112 2150 2175 2274 2459 2515 2879 2992 3067 3109 3124 3160 3257 3278 3372 3425 3441 3451 3460 351 3488 1323 1522 952 3171 1905 3331 3131 2781 582 3687 2652 443 618 619 3261 1985 828 2973 206 763 778 96 122 393 420 550 593 600 621 640 644 679 766 861 865 873 911 937 981 989 1008 1066 1101 1134 1157 1172 1208 1218 1316 1374 1400 1427 1468 1476 1494 1532 1538 1558 1561 1571 1577 1632 1679 1680 1702 1820 1841 1847 1862 1885 1906 1911 1926 2004 2007 2030 2048 2062 2070 2074 2080 2088 2104 2116 2122 2130 2153 2174 2194 2195 2203 2204 2216 2266 2299 2321 2338 2343 2365 2366 2376 2388 2434 2474 2487 2504 2553 2560 2575 2578 2592 2616 2640 2644 2664 2675 2689 2714 2731 2732 2738 2763 2786 2802 2869 2886 2894 2901 2906 2921 2946 2955 2981 2984 2988 2997 3002 3057 3058 3069 3087 3105 3134 3135 3150 3158 3161 3179 3183 3184 3190 3191 3202 3237 3244 3247 3252 3258 3275 3292 3295 3309 3315 3325 3328 3339 3354 3367 3369 3381 3388 3436 3444 3470 3472 3477 3487 3494 3508 3535 3538 3544 3548 3577 3578 3586 3587 3593 3598 3609 3617 3627 3633 3634 3637 3640 3663 3674 3686 3694 3698 3709 3713 3718 3722 3724 3730 3731 3733 3737 3740 3749 3751 3752 3774 3783 3787 3799 3803 3811 3814 3815 3830 3861 Figure: Edges of the network identified by the ψ-learning method for the breast cancer data, where the genes numbered by 590, 3488, 1537, and 992 are CD44, IGFBP-5, HLA, and STARD3, respectively.

Biological Results Based on the result of ψ-learning, 13 hub genes are identified by setting the cutoff value of connectivity at 5, and the top 4 genes are STARD3, IGFBP-5, CD44 and HLA. Our examination shows very exciting results: All of the 13 genes are related to breast cancer. STARD3 (also named as MLN64 and CAB1) is in the 17q11-q21 region in which amplification is found in about 25% of primary breast carcinomas, and its gene expression is associated with poor clinical outcome, such as increase risk of relapse and poor prognosis (http://atlasgeneticsoncology.org/). IGFBP-5 has been shown to play an important role in the molecular mechanism of breast cancer, especially in metastasis (Akkiprik et al., 2008), and it is considered to be a potential therapeutic target for breast cancer. CD44 is involved in many essential biological functions associated with the pathology activities of cancer cells, and CD44 expression is commonly used as a marker for breast cancer stem cells (Louderbough et al., 2011). HLA expression is associated with the genetic susceptibility (Chaudhuri et al., 2000), tumor progression and recurrent risk of breast cancer (Kaneko et al., 2011).

Data Integration To integrate multiple sources of data, the ψ-partial correlation coefficient can be transformed to a Z-score via Fisher s transformation: [ n Sij 3 1 + ψ zij = log ˆψ ] ij 2 1 ˆψ, i, j = 1, 2,..., p. (16) ij Then ψ z -scores from different sources of data can be combined using the meta-analysis method: ψ cij = K k=1 w kψ (k) z ij K k=1 w 2 k, i, j = 1, 2,..., p, (17) where K denotes the total number of data sources, ψ z (k) ij denotes the ψ z -score from source k, and w k denotes the weight assigned on source k.

Network Comparison ψ dij = [ψ (1) z ij ψ (2) z ij ]/ 2, (18) (i) (ψ-correlation calculation) Perform steps (a) and (b) of the ψ-learning algorithm independently for each source of data. (ii) (ψ d -score calculation) Calculate the difference of ψ-scores in in (18). (iii) (ψ d -score screening) Conduct a multiple hypothesis test to identify the pairs of vertices for which ψ dij is differentially distributed from the standard normal N(0, 1).

An Illustrative Example (continuation) Generate two datasets from N(0, 1 2 C 1 ) and N(0, C 1 ), respectively, where C is as early specified with p = 200. Both datasets have a sample size of n = 100.

An Illustrative Example Frequency 0 200 400 600 800 Frequency 0 200 400 600 800 4 2 0 2 4 6 8 4 2 0 2 4 Figure: Histograms of ψ c ij (left) and ψ d ij (right) for the two simulated datasets.

An Illustrative Example (a) (b) (c) Figure: Networks identified by the ψ-learning algorithm for the simulation experiment: (a) network for dataset 1 (precision,recall)=(0.963,0.388), (b) network for dataset 2 (precision,recall)=(0.958,0.343), and (c) integrated network of dataset 1 and dataset 2 (precision,recall)=(0.934,0.602).

Covariate Adjustment Let W 1,..., W q denote the external variables. To adjust their effects, we can replace the the empirical correlation coefficient used in step (a) of the ψ-learning algorithm by the p-value obtained in testing the hypotheses H 0 : β q+1 = 0 H 1 : β q+1 0 for the regression X (i) = β 0 + β 1 W 1 + + β q W q + β q+1 X (j) + ϵ, (19) where ϵ denotes a vector of Gaussian random errors. Similarly, we replace the ψ-correlation coefficient used in step (c) of the ψ-learning algorithm by p-value obtained in testing the hypotheses H 0 : β q+1 = 0 H 1 : β q+1 0 for the regression γ k X (k) + ϵ, X (i) = β 0 + β 1 W 1 + + β q W q + β q+1 X (j) + k S ij (20) where S ij, as defined previously, denotes the selected separator of X (i) and X (j).

Time Complexity In terms of p, the computational complexity of the ψ-learning algorithm is bounded by O(p 2 (log p) b ), where b = 3(2κ + τ)/δ, O(p 2 ) is for the total number of ψ-scores that need to calculate, and O((log p) b ) is for the computational complexity of inverting a neighboring matrix of size O(n 2κ +τ ). When δ > 0, the computational complexity of the algorithm is nearly O(p 2 ).

Time Complexity The glasso algorithm is known to have a computational complexity of O(p 3 ). With its fast implementation (Witten et al., 2011; Mazumder and Hastie, 2012), which makes use of the block diagonal structure in graphical lasso solutions, the computational complexity can be reduced to O(p 2+ν ), where 0 < ν 1 may depend on the number of blocks and the size of each block. Since the ordinary Lasso has a computational complexity of O(np min(n, p)) (Meinshausen, 2007), the nodewise regression algorithm has a complexity of O(p 3 (log p) 2/δ ) if p > n and O(p 4 (log p) 1/δ ) otherwise. Other algorithms usually have higher computational complexities. For example, the PC algorithm (Spirtes et al., 2000) has a computational complexity bounded by O(p 2+m ), where m is the maximum size of the neighborhoods.

Time Complexity We have compared the CPU times of ψ-learning and glasso (the fastest regularization method) for the breast cancer example. On a single core of Intel R Xeon R CPU E5-2690(2.90Ghz), ψ-learning took 61 minutes, and glasso took 6,993 minutes. Here glasso was implemented in the package huge and run under its default setting with the regularization parameter being determined using the stability approach.

Discussion We have provided a general framework for inference of GGMs based on the equivalent measure of partial correlation coefficient. The key idea is correlation screening and the key strategy is split-and-merge. Extension to Multiple Datasets Integration, covariate adjustment, network comparison, and non-normality data are all simple under the proposed framework!!!

Acknowledgments NSF grants Students: Qifan Song (Purdue University), Suwa Xu Application Collaborators: Peihua Qiu, Huaihou Chen, Guanghua Xiao (UT Dallas), Jianxin Shi (NIH)