Application of random matrix theory to microarray data for discovering functional gene modules

Similar documents
Random matrix analysis of complex networks

Cluster Analysis of Gene Expression Microarray Data. BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002

Cell biology traditionally identifies proteins based on their individual actions as catalysts, signaling

ORIGINS. E.P. Wigner, Conference on Neutron Physics by Time of Flight, November 1956

Introduction to Theory of Mesoscopic Systems

Chapter 29. Quantum Chaos

Recent results in quantum chaos and its applications to nuclei and particles

Self-organized scale-free networks

Spectral Fluctuations in A=32 Nuclei Using the Framework of the Nuclear Shell Model

Effect of Unfolding on the Spectral Statistics of Adjacency Matrices of Complex Networks

A Multi-Level Lorentzian Analysis of the Basic Structures of the Daily DJIA

arxiv:physics/ v1 [physics.med-ph] 8 Jan 2003

Emergence of chaotic scattering in ultracold lanthanides.

Misleading signatures of quantum chaos

Size Effect of Diagonal Random Matrices

A new type of PT-symmetric random matrix ensembles

Estimation of Identification Methods of Gene Clusters Using GO Term Annotations from a Hierarchical Cluster Tree

Connectivity and expression in protein networks: Proteins in a complex are uniformly expressed

Modular organization of protein interaction networks

T H E J O U R N A L O F C E L L B I O L O G Y

Discovering molecular pathways from protein interaction and ge

Molecular ecological network analyses

Isospin symmetry breaking and the nuclear shell model

Spectral properties of random geometric graphs

TitleQuantum Chaos in Generic Systems.

1 Intro to RMT (Gene)

Anderson Localization Looking Forward

Analyzing Microarray Time course Genome wide Data

The Generalized Topological Overlap Matrix For Detecting Modules in Gene Networks

The architecture of complexity: the structure and dynamics of complex networks.

CHAOTIC MEAN FIELD DYNAMICS OF A BOOLEAN NETWORK WITH RANDOM CONNECTIVITY

Controlling chaos in random Boolean networks

Computational approaches for functional genomics

Weighted gene co-expression analysis. Yuehua Cui June 7, 2013

Gene expression microarray technology measures the expression levels of thousands of genes. Research Article

Genome-Scale Gene Function Prediction Using Multiple Sources of High-Throughput Data in Yeast Saccharomyces cerevisiae ABSTRACT

In order to compare the proteins of the phylogenomic matrix, we needed a similarity

Spectral Statistics of Instantaneous Normal Modes in Liquids and Random Matrices

arxiv:cond-mat/ v1 [cond-mat.stat-mech] 13 Mar 2003

The development and application of DNA and oligonucleotide microarray

Is Quantum Mechanics Chaotic? Steven Anlage

The non-backtracking operator

Self-organized Criticality and Synchronization in a Pulse-coupled Integrate-and-Fire Neuron Model Based on Small World Networks

arxiv:cond-mat/ v2 [cond-mat.stat-mech] 3 Oct 2005

The network growth model that considered community information

Herd Behavior and Phase Transition in Financial Market

Divergence Pattern of Duplicate Genes in Protein-Protein Interactions Follows the Power Law

Computational methods for predicting protein-protein interactions

Effects of Interactive Function Forms in a Self-Organized Critical Model Based on Neural Networks

Bioinformatics. Transcriptome

arxiv: v1 [q-fin.st] 5 Apr 2007

Theory of Mesoscopic Systems

Modeling Gene Expression from Microarray Expression Data with State-Space Equations. F.X. Wu, W.J. Zhang, and A.J. Kusalik

Identifying Bio-markers for EcoArray

PHYSICAL REVIEW LETTERS

biologically-inspired computing lecture 5 Informatics luis rocha 2015 biologically Inspired computing INDIANA UNIVERSITY

Universality of distribution functions in random matrix theory Arno Kuijlaars Katholieke Universiteit Leuven, Belgium

A Case Study -- Chu et al. The Transcriptional Program of Sporulation in Budding Yeast. What is Sporulation? CSE 527, W.L. Ruzzo 1

arxiv:cond-mat/ v1 [cond-mat.stat-mech] 1 Aug 2001

arxiv:cond-mat/ v1 [cond-mat.dis-nn] 4 May 2000

arxiv:physics/ v1 [physics.bio-ph] 30 Sep 2002

Erzsébet Ravasz Advisor: Albert-László Barabási

Branching Process Approach to Avalanche Dynamics on Complex Networks

Self Similar (Scale Free, Power Law) Networks (I)

Spectral Degree of Coherence of a Random Three- Dimensional Electromagnetic Field

Intensity distribution of scalar waves propagating in random media

Advances in microarray technologies (1 5) have enabled

Correlated Percolation, Fractal Structures, and Scale-Invariant Distribution of Clusters in Natural Images

Inferring Gene Regulatory Networks from Time-Ordered Gene Expression Data of Bacillus Subtilis Using Differential Equations

EIGENVALUE SPACINGS FOR REGULAR GRAPHS. 1. Introduction

ARANDOM-MATRIX-THEORY-BASEDANALYSISOF STOCKS OF MARKETS FROM DIFFERENT COUNTRIES

Percolation of networks with directed dependency links

Experimental and theoretical aspects of quantum chaos

Modeling Mass Spectrometry-Based Protein Analysis

Opinion Dynamics on Triad Scale Free Network

arxiv:physics/ v1 [physics.soc-ph] 15 May 2006

Nature Genetics: doi: /ng Supplementary Figure 1. Icm/Dot secretion system region I in 41 Legionella species.

Heuristic segmentation of a nonstationary time series

Spatial and Temporal Behaviors in a Modified Evolution Model Based on Small World Network

Supplemental Material

Interaction Network Topologies

LEVEL REPULSION IN INTEGRABLE SYSTEMS

In-Depth Assessment of Local Sequence Alignment

Evolving network with different edges

Resistance distribution in the hopping percolation model

Exploratory statistical analysis of multi-species time course gene expression

Regression Model In The Analysis Of Micro Array Data-Gene Expression Detection

arxiv:physics/ v3 [physics.data-an] 29 Nov 2006

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

Predicting Protein Functions and Domain Interactions from Protein Interactions

arxiv:physics/ v1 [physics.soc-ph] 11 Mar 2005

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Biological Systems: Open Access

Network Biology: Understanding the cell s functional organization. Albert-László Barabási Zoltán N. Oltvai

Betweenness centrality of fractal and nonfractal scale-free model networks and tests on real networks

Spectral statistics of random geometric graphs arxiv: v2 [physics.soc-ph] 6 Jun 2017

Heaving Toward Speciation

Theory of selective excitation in stimulated Raman scattering

arxiv: v1 [physics.soc-ph] 7 Dec 2011

Package neat. February 23, 2018

Transcription:

Application of random matrix theory to microarray data for discovering functional gene modules Feng Luo, 1 Jianxin Zhong, 2,3, * Yunfeng Yang, 4 and Jizhong Zhou 4,5, 1 Department of Computer Science, Clemson University, 100 McAdams Hall, Clemson, South Carolina 29634, USA 2 Department of Physics, Xiangtan University, Hunan 411105, China 3 Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, USA 4 Environmental Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, USA 5 Department of Botany and Microbiology, University of Oklahoma, Norman, Oklahoma 73019, USA Received 9 June 2005; revised manuscript received 3 February 2006; published 29 March 2006 We show that spectral fluctuation of coexpression correlation matrices of yeast gene microarray profiles follows the description of the Gaussian orthogonal ensemble GOE of the random matrix theory RMT and removal of small values of the correlation coefficients results in a transition from the GOE statistics to the Poisson statistics of the RMT. This transition is directly related to the structural change of the gene expression network from a global network to a network of isolated modules. DOI: 10.1103/PhysRevE.73.031924 PACS number s : 87.17.Aa, 87.80.Vt, 89.75.Fb Understanding gene expression networks at system level is a key issue in the post genome era 1 4. The emerging microarray technology 5,6 enables massive parallel measurement of expressions of thousands of genes simultaneously. It has opened up great opportunities to unveil gene expression networks at large scale. Currently, the inference of gene expression networks from microarray profiles is harmed by the dimensionality problem, namely, the number of genes is much larger than available data points. It is essential to develop powerful computational methods to extract as much biological information as possible from microarray data. The random matrix theory RMT, initially proposed by Wigner and Dyson in the 1960s for studying the spectrum of complex nuclei 7, is a powerful approach for the identification and modeling of phase transitions and dynamics in physical systems. It has been successfully used to study the behaviors of complex systems, such as spectral properties of large atoms 8, metal insulator transitions in disordered systems 9, spectra of quasiperiodic systems 10,11, chaotic systems 12, brain responses 13, and the stock market 14. The RMT focuses on the study of statistical properties of eigenvalue spacing between consecutive eigenvalues. From the RMT, distribution of eigenvalue spacing of real and symmetrical random matrices follows two universal laws depending on the correlativity of eigenvalues. Strong correlation of eigenvalues leads to statistics described by the Gaussian orthogonal ensemble GOE. On the other hand, eigenvalue spacing distribution follows Poisson statistics if there is no correlation between eigenvalues. A typical example of the GOE distribution is a complete random matrix with a random distribution of all matrix elements. The nonzero off-diagonal elements in this matrix, which represent mutual interactions between diagonal elements, induce strong correlations of eigenvalues and thus the GOE statistics. Differently, eigenvalue spacing distribution of a random *Corresponding authors. Electronic address: zhongjn@ornl.gov Electronic address: zhouj@ornl.gov matrix with nonzero values only for its diagonal or blockdiagonal parts follows the Poisson statistics, because eigenvalues of this system are not correlated due to the absence of interaction between diagonal or block-diagonal parts. From microarry data, gene coexpression correlation matrices CCMs can be constructed. Because of the modularity of gene coexpression networks, after successive removal of lower values of correlation coefficients a CCM begins to have nonzero elements only for its block-diagonal parts corresponding to gene modules. We expect from the RMT that eigenvalue spacing distribution in the CCMs undergoes a transition from the GOE statistics to Poisson statistics. In this paper, we report our results of application of the RMT to analysis of the CCMs of gene microarray profiles. We have found that eigenvalue spacing distribution of the CCMs of yeast gene microarray profiles is described by the GOE statistics. Furthermore, removal of small values of the correlation coefficients in the CCM results in a transition from the GOE statistics to the Poisson statistics as well as disassociation of the gene coexpression network from a global network to a network of isolated modules. This transition may provide a new objective approach for identification of functional modules from microarray profiles 15. We used the standard spectral unfolding technique in the RMT to study the eigenvalue spacing statistics of the CCMs. In general, the density of eigenvalues of a matrix varies with its eigenvalue E i i=1,2,3,...n, where N is the order of the matrix. As a result, eigenvalue spacing distribution is a function of E i and thus system dependent. In order to observe universal system independent eigenvalue fluctuations of different types of matrices, the RMT requires spectral unfolding to have a constant density of eigenvalues. To fulfill this, one can replace E i by the unfolded spectrum e i, where e i =N av E i and N av is the smoothed integrated density of eigenvalues obtained by fitting the original integrated density to a cubic spline or by local density average. With the unfolded eigenvalues, one calculates the nearest neighbor spacing distribution NNSD of eigenvalues, P s, which is defined as the probability density of unfolded eigenvalue spacing s=e i+1 e i. From the RMT, P s of the GOE statis- 1539-3755/2006/73 3 /031924 5 /$23.00 031924-1 2006 The American Physical Society

LUO et al. FIG. 1. Color online Nearest neighbor spacing distributions symbols of gene coexpression correlation matrices constructed from microarray profiles of yeast mutants by different cut methods with cutoff value q. The solid green is the Wigner-Dyson distribution and the dashed line is the Poisson distribution. a Method I: c=0 if c q. b Method II: c=0 if c q. c Method III: c=0 if c q. tics closely follows the Wigner-Dyson distribution P GOE s 1 2 s exp s2 /4. In the case of Poisson statistics, P s is given by the Poisson distribution P Poisson s = exp s. The difference between the Wigner-Dyson and Poisson distributions manifests in the regime of small s, where, P GOE s 0 =0 and P Poisson s 0 =1. Matrix elements in the coexpression correlation matrix CCM for our RMT analysis are the standard Pearson correlation coefficients of gene expressions between different genes defined as c g i,g j = 1 N k=1,n g ik M gi gi g jk M gj, gj where M gi, M gj are the average gene expression of gene g i and g j, respectively, gi, gj are their corresponding standard deviations, and N is the total number of experiments. As examples, we studied two microarray expression profiles of yeast. The first one is the microarray expression data of 287 yeast mutants 16. We selected genes that have expressed in most mutant experiments for our study. As a result, there is a total of 6209 genes. The second profile is the gene expression data of yeast cells response to environment changes 17. We selected 6090 genes that have expressed in most experiments. Figures 1 and 2 show the NNSDs for the yeast mutants and the yeast cells response, respectively curves with q=0. One can see that in both cases, the NNSD is well described by the Wigner-Dyson distribution. It has been widely believed that a cell system, like many other engineering synthetic systems, is modular. Its cellular functionality is performed by a collection of modules, which are groups of physically or functionally linked genes. Given a specific condition and stage, there are particular gene modules expressed in the cell. The expression patterns of genes in the same functional module often exhibit higher similarity in microarray experiments 2 4. Therefore, the CCM of a microarray profile consists of a strong correlation part, C s, which corresponds to the correlation between genes in the same module, and a weak correlation part, C w, which corresponds to the correlation between genes in different modules or unexpressed isolated genes. To test the modularity of genes, we gradually remove lower values of correlation coefficients. Three cut methods are investigated to keep significant correlations: 1 c=0 if c q method I ; 2 c=0 if c q method II ; and 3 c=0 if c q method III ; where 0 q 1 denotes the level of cutoff. Evidently, both positive and negative significant correlations are retained in method I. In methods II and III, only positive or negative significant correlations are kept, respectively. The three different cut methods allow us to study a series of CCMs with different cutoff values. We found that the NNSDs of the CCMs constructed from microarray expression profiles of yeast using the tree cut methods all exhibit sharp transitions from a Wiger-Dyson distribution to a Poisson distribution, as the cutoff level q increases. This transition behavior is clearly shown in Figs. 1 and 2. Transition points were obtained by the chi-square test, which is a standard technique for goodness-of-fit test that determines whether a set of sample data have been drawn from a hypothetical solution. In our calculation, q corresponds to the transition point when the chi-square test to Poisson distribution is less than a critical value of confidence level 99.999%. We found that the NNSDs of microarray profiles of yeast mutants are well described by the Poisson distribution as q 0.79, 0.76, 0.77 obtained by methods I, II, and III, respectively. For microarray profiles of yeast cells response to environmental changes, we found Poisson distribution as q 0.89 by method I, q 0.88 by method II, and q 0.86 by method III. 031924-2

APPLICATION OF RANDOM MATRIX THEORY TO¼ FIG. 2. Color online Nearest neighbor spacing distributions symbols of gene coexpression correlation matrices constructed from microarray profiles of yeast cells response to environment changes by different cut methods with cutoff value q. The solid line is the Wigner-Dyson distribution and the dashed line is the Poisson distribution. a Method I: c=0 if c q. b Method II: c=0 if c q. c Method III: c=0 if c q. To view the structural change of gene coexpression networks after the removal of weaker correlations, we constructed a series of gene coexpression networks from the CCMs of both microarray profiles. The networks were visualized using Biolayout 18, where each gene in a network is a node and there is a link between two genes if the Pearson correlation between their expressions is not 0 after removal of small values of coexpression correlation coefficients. Figures 3 and 4 show yeast gene coexpression networks at different cutoff values. One can see from Figs 3 and 4 that the networks described by the Poisson distribution are very different from the networks described by the GOE statistics. Furthermore, the transition of the NNSD from the GOE distribution to the Poisson distribution is accompanied by the disassociation of coexpression networks. Isolated modules can be easily identified in Figs. 3 and 4 at the transition points. Our detailed analysis showed that, for the microarray profile of yeast mutants, the coexpression network at q=0.79 obtained by method I has 39 modules with number of genes ranging from 3 to 597; the coexpression network at q=0.76 by method II has 41 modules with number of genes ranging from 3 to 863; and the coexpression network at q=0.77 by method III has 11 modules with number of genes ranging from 3 to 526. For microarray profiles of the yeast cells response, method I at q=0.89 generates 13 modules with sizes ranging from 3 to 636; method II at q=0.88 gives 17 modules with sizes ranging from 3 to 510 genes; and method III at q=0.86 generates 2 modules with 4 and 395 genes. We have also found that the modules obtained by different cut methods have different structures, which is understandable because different cut methods emphasize different types of correlations between genes. We found that structures of gene modules found in Figs. 3 and 4 are very different from the remaining structures constructed from a random matrix after the removal of small values of matrix elements using the same cut methods. We believe that the gene modules obtained by using the RMT are functional modules and we are now experimentally evaluating their biological functionalities 15. To clearly show that the gene coexpression networks revealed by the RMT from the microarray coexpression data indeed are of a biological origin different from structures FIG. 3. Color online Coexpression networks constructed from microarray profiles of yeast mutants by different cut methods with cutoff value q. Lines represent correlations between genes. a Method I: c=0 if c q. b Method II: c=0 if c q. c Method III: c=0 if c q. 031924-3

LUO et al. FIG. 4. Color online Coexpression networks constructed from microarray profiles of yeast cells response to environment change by different cut methods with cutoff value q. Lines represent correlations between genes. a Method I: c=0 if c q. b Method II: c=0 if c q. c Method III: c=0 if c q. constructed from a random matrix, we studied network properties of two types of random correlation matrices using the RMT. The first one is a completely random correlation matrix constructed by assigning its elements with random values distributed within an interval 1,1. The second one is the column shuffled microarray correlation matrix, where the expression values of a each gene of a yeast cell response to environment changes are randomly shuffled many times 100 times in this paper before constructing its corresponding FIG. 6. Color online Networks at different cutoff level q constructed from a a completely random matrix and b shuffled microarray profiles of yeast cells response to environment changes. Lines represent correlations between genes. FIG. 5. Color online Nearest neighbor spacing distributions symbols of random correlation matrices at different cutoff level q constructed from a a 2000 2000 completely random matrix and b shuffled microarray profiles of yeast cells response to environment changes. The solid line is the Wigner-Dyson distribution and the dashed line is the Poisson distribution. correlation matrix. Small values of matrix elements in these two matrices are then removed by using cut method I to construct CCMs and networks at a different cutoff levels. We found that NNSDs of these randomized CCMs also exhibit transitions from a Wiger-Dyson distribution to a Poisson distribution, as shown in Fig. 5. The critical cutoff value for the completely random correlation matrix is q = 0.999 97. Our detailed analysis showed that the critical cutoff value increases as the matrix dimension increases. For the shuffled microarray profiles of the yeast cells response to environmental changes, we found a Poisson distribution as q 0.34. Figure 6 illustrates the gene networks constructed from these two random correlation matrices at a different cutoff value. Notably, the networks obtained from the random correlation matrices are very different from the coexpression networks constructed from the original microarray profiles, even though they are all described by the Poisson distribution. Comparing with large highly connected clusters 031924-4

APPLICATION OF RANDOM MATRIX THEORY TO¼ hundreds of genes in the coexpression network, the network of the completely random matrix at q=0.999 97 only has isolated small clusters of a few nodes 2 3 ; the network of the shuffled microarray profiles at q=0.34 only has small chainlike or treelike small clusters with size from 3 to 15 nodes. We searched the gene ontology GO concurrence of these clusters using the GO term finder 19 of the Saccharomyces Genome Database. No significant GO term exists for all top 20 large clusters, indicating that the clusters obtained from the shuffled microarray profiles are not biologically meaningful. In summary, we have provided evidence that genomewide gene coexpression, as represented by the CCMs of microarray profiles studied here, is described by the GOE statistics of the RMT. However, the successive removal of small correlations in the CCM leads to a transition from the GOE statistics to the Poisson statistics. Our results indicate that although biological systems are very different from complex physical systems 20, they follow the same universal Wigner distribution and the Poisson distribution. The transition we found may open a new avenue for the identification of functional modules from microarray profiles. Different from existing clustering methods, cutoffs or thresholds used for separating genes in our approach is determined selfconsistently by the transition given by the random matrix theory. This work was supported by The United States Department of Energy under the Genomics: GTL through Shewanella Federation, Microbial Genome Program and Natural and Accelerated Bioremediation Research Programs of the Office of Biological and Environmental Research, Office of Science. F.L. was also supported by the NSF EPSCot Grant No. EPS-0447660. J.Z. was supported by the National Natural Science Foundation of China under Grant No. 30570432 and partially by the Materials Sciences and Engineering Division Program of the DOE Office of Science. Oak Ridge National Laboratory is managed by University of Tennessee-Battelle LLC for the Department of Energy under Contract No. DE-AC05-00OR22725. 1 A.-L. Barabasi and Z. N. Oltvai, Nat. Rev. Genet. 5, 101 2004. 2 J. M. Stuart, E. Segal, D. Koller, and S. K. Kim, Science 302, 249 2002. 3 M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Bostein, Proc. Natl. Acad. Sci. U.S.A. 95, 14863 1998. 4 X. Zhou, M. C. Kao, and W. H. Wong, Proc. Natl. Acad. Sci. U.S.A. 99, 2783 2002. 5 D. J. Lockhart et al., Nat. Biotechnol. 14, 1675 1996. 6 M. Schena, D. Shalon, R. Davis, and P. Brown, Science 270, 467 1995. 7 E. P. Wigner, SIAM Rev. 9, 1 1967. 8 E. P. Wigner, Proc. Cambridge Philos. Soc. 299, 189 1951. 9 E. Hofstetter and M. Schreiber, Phys. Rev. B 48, 16979 1993. 10 J. X. Zhong, U. Grimm, R. A. Romer, and M. Schreiber, Phys. Rev. Lett. 80, 3996 1998. 11 J. X. Zhong and T. Geisel, Phys. Rev. E 59, 4071 1999. 12 O. Bohigas, M. J. Giannoni, and C. Schmit, Phys. Rev. Lett. 52, 1 1984. 13 P. Seba, Phys. Rev. Lett. 91, 198104 2003. 14 V. Plerou, P. Gopikrishnan, B. Rosenow, L. A. Nunes Amaral, and H. E. Stanley, Phys. Rev. Lett. 83, 1471 1999. 15 F. Luo, Y. F. Yang, J. X. Zhong, H. C. Gao, L. Khan, D. K. Thompson, and J. Z. Zhou unpublished. 16 T. R. Hughes et al., Cell 102, 109 2000. 17 A. P. Gasch et al., Mol. Biol. Cell 11, 4241 2000. 18 A. J. Enright and C. A. Ouzounis, Bioinformatics 17, 853 2001. 19 http://db.yeastgenome.org/cgi-bin/go/gotermfinder. 20 L. H. Hartwell, J. J. Hopfiled, S. Leibler, and A. W. Murray, Nature London 402, C47 1999. 031924-5