Weighted gene co-expression analysis. Yuehua Cui June 7, PDF Free Download

Weighted gene co-expression analysis Yuehua Cui June 7, 2013

Weighted gene co-expression network (WGCNA) A type of scale-free network: A scale-free network is a network whose degree distribution follows a power law, at least asymptotically. That is, the fraction P(k) of nodes in the network having k connections to other nodes goes for large values of k as p k = ck γ where c is a normalization constant and γ is a parameter whose value is typically in the range 2 < γ < 3, although occasionally it may lie outside these bounds. Scale-free networks are noteworthy because many empirically observed networks appear to be scale-free, including the world wide web, citation networks, biological networks, airline networks and some social networks. From Wikipedia

From Wikipedia

Reprinted from Linked: The New Science of Networks by Albert-Laszlo Barabasi Scale-Free Network Models in Epidemiology Adapted from J.B. Dunham and F.B. Berlin

Flight connections and hub airports In a scale-free network, the nodes with the largest number of links (connections) are most important! Courtesy of A. Barabasi

Note: The rest of the slides about WGCNA are adapted or modified from the slides in Dr. Steve Horvath s website at: http://www.genetics.ucla.edu/labs/horvath/coe xpressionnetwork/

Philosophy of Weighted Gene Co- Expression Network Analysis Understand the system instead of reporting a list of individual parts Describe the functioning of the engine instead of enumerating individual nuts and bolts Focus on modules as opposed to individual genes this greatly alleviates multiple testing problem Network terminology is intuitive to biologists

How to define a gene coexpression network?

Gene Co-expression Networks In gene co-expression networks, each gene corresponds to a node. Two genes are connected by an edge if their expression values are highly correlated. Definition of high correlation is somewhat tricky One can use statistical significance But we propose a criterion for picking threshold parameter: scale free topology criterion.

Frequency 0 100 200 300 400 500 600 700 P(k) vs k in scale free networks P(k) Frequency Distribution of Connectivity Scale Free Topology refers to the frequency distribution of the connectivity k p(k)=proportion of nodes that have connectivity k 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 Connectivity k

How to check Scale Free Topology? Idea: Log transformation p(k) and k and look at scatter plots Linear model fitting R 2 index can be used to quantify goodness of fit

Our `holistic view. Weighted Network View Unweighted View All genes are connected Connection Widths=Connection strengths Some genes are connected All connections are equal Hard thresholding may lead to an information loss. If two genes are correlated with r=0.79, they are deemed unconnected with regard to a hard threshold of τ=0.8

Network=Adjacency Matrix A network can be represented by an adjacency matrix, A=[a ij ], that encodes whether/how a pair of nodes is connected. A is a symmetric matrix with entries in [0,1] For unweighted network, entries are 1 or 0 depending on whether or not 2 nodes are adjacent (connected) For weighted networks, the adjacency matrix reports the connection strength between gene pairs

Connectivity Gene connectivity = row sum of the adjacency matrix For unweighted networks=number of direct neighbors For weighted networks= sum of connection strengths to other nodes Connectivity k a i i ij ji

How to construct a weighted gene co-expression network?

Using an adjacency function to define a network Measure co-expression by a similarity s(i,j) in [0,1] e.g. absolute value of the Pearson correlation Define an adjacency matrix as A(i,j) using an adjacency function AF(s(i,j)) Here we consider 2 classes of AFs Step function AF(s)=I(s>) with parameter (unweighted network) Power function AF(s)=s b with parameter b (weighted network) The choice of the AF parameters (, b) determines the properties of the network.

Power adjacency function results in a weighted gene network a cor( x, x ) ij i j Often choosing beta=6 works well but in general we use the scale free topology criterion described in Zhang and Horvath 2005.

Comparing the power adjacency functions with the step function Adjacency =connection strength Gene Co-expression Similarity

The scale free topology criterion for choosing the parameter values of an adjacency function. A) CONSIDER ONLY THOSE PARAMETER VALUES THAT RESULT IN APPROXIMATE SCALE FREE TOPOLOGY B) SELECT THE PARAMETERS THAT RESULT IN THE HIGHEST MEAN NUMBER OF CONNECTIONS Criterion A is motivated by the finding that most metabolic networks (including gene co-expression networks, proteinprotein interaction networks and cellular networks) have been found to exhibit a scale free topology Criterion B leads to high power for detecting modules (clusters of genes) and hub genes.

General Framework for Network Analysis Define a Gene Co-expression Similarity Define a Family of Adjacency Functions Determine the AF Parameters Define a Measure of Node Dissimilarity Identify Network Modules (Clustering) Relate Network Concepts to Each Other Relate the Network Concepts to External Gene or Sample Information

How to detect network modules?

Steps for defining gene modules Define a dissimilarity measure between the genes. Standard Choice: dissim(i,j)=1- abs(correlation) Choice by network community =1-Topological Overlap Matrix (TOM) Used here Use the dissimilarity in hierarchical clustering Define modules as branches of the hierarchical clustering tree Visualize the modules and the clustering results in a heatmap plot Heatmap

The topological overlap dissimilarity is used as input of hierarchical clustering TOM ij u a a a iu uj ij min( k, k ) 1a i j ij DistTOM ij 1TOM ij a cor( x, x ) ij i j Generalized in Zhang and Horvath (2005) to the case of weighted networks Generalized in Yip and Horvath (2006) to higher order interactions

Using the TOM matrix to cluster genes To group nodes with high topological overlap into modules (clusters), we typically use average linkage hierarchical clustering coupled with the TOM distance measure. Once a dendrogram is obtained from a hierarchical clustering method, we choose a height cutoff to arrive at a clustering. Here modules correspond to branches of the dendrogram Genes correspond to rows and columns TOM plot Hierarchical clustering dendrogram TOM matrix Module: Correspond to branches

Different Ways of Depicting Gene Modules Topological Overlap Plot Gene Functions 1) Rows and columns correspond to genes 2) Red boxes along diagonal are modules 3) Color bands=modules Multi Dimensional Scaling Traditional View Idea: Use network distance in MDS

Heatmap view of module Columns= tissue samples Rows=Genes Color band indicates module membership Message: characteristic vertical bands indicate tight co-expression of module genes

-0.1 0.0 0.1 0.2 0.3 0.4 Module Eigengene= measure of over-expression=average redness=1 st PC of a given module Rows,=genes, Columns=microarray brown 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 123456789 brown The brown module eigengenes across samples

Genes Gene expression database a conceptual view Samples Sample annotations Gene expression matrix Gene annotations Gene expression levels

Singular value decomposition (SVD) Use SVD to get the eigengenes Let X denote an m x n matrix of real-valued data and rank r m n, m genes and n samples The equation for singular value decomposition of X is the following: where U is an m x n matrix, S is an n x n diagonal matrix, and V T is also an n x n matrix. 29

UU T =V T V=I 30

T 1 0 0 0 0 0 0 V U w n w X The w i are called the singular values of X If X is singular, some of the w i will be 0 In general rank(x) = number of nonzero w i SVD is mostly unique (up to permutation of singular values, or if some w i are equal) X -1 =(V T ) -1 S -1 U -1 = V S -1 U T Columns of V k corresponds to eigenvectors

-0.3 0.0-0.1 0.2-0.2 0.1-0.1 0.2-0.2 0.2-0.2 0.2-2.0 0.0 Module eigengenes can be used to determine whether 2 modules are correlated. If correlation of MEs is high-> consider merging. -0.2 0.2-0.1 0.2-0.1 0.2 Martingale.Re ME.blue 0.08 ME.brow n 0.19 0.22 0.14 0.27 0.42 ME.green Eigengenes can be used to build separate networks 0.09 0.78 0.09 0.55 ME.grey 0.12 0.39 0.41 0.67 0.72 ME.turquoise ME.yellow 0.01 0.07 0.13 0.08 0.04 0.34-2.0 0.0-0.2 0.2-0.2 0.1-0.3 0.0

Consensus eigengene networks in male and female mouse liver data and their relationship to physiological traits Langfelder P, Horvath S (2007) Eigengene networks for studying the relationships between co-expression modules. BMC Systems Biology 2007

Important Task in Many Genomic Applications: Given a network (pathway) of interacting genes how to find the central players? Gene connectivity = row sum of the adjacency matrix For unweighted networks=number of direct neighbors For weighted networks= sum of connection strengths to other nodes k i So value of k i indicates the important of the gene in a network j a ij

A Case Study MC Oldham, S Horvath, DH Geschwind (2006) Conservation and evolution of gene coexpression networks in human and chimpanzee brain. PNAS

What changed? Despite pronounced phenotypic differences, genomic similarity is ~96% (including single-base substitutions and indels) 1 Similarity is even higher in protein-coding regions 1 Cheng, Z. et al. Nature 437, 88-93 (2005) Image courtesy of Todd Preuss (Yerkes National Primate Research Center)

Assessing the contribution of regulatory changes to human evolution Hypothesis: Changes in the regulation of gene expression were critical during recent human evolution (King & Wilson, 1975) Microarrays are ideally suited to test this hypothesis by comparing expression levels for thousands of genes simultaneously

Gene expression is more strongly preserved than gene connectivity Chimp Chimp Expression Cor=0.93 Cor=0.60 Human Expression Human Connectivity Hypothesis: molecular wiring makes us human Raw data from Khaitovich et al., 2004 Mike Oldham

A B Human Chimp

p = 1.33x10-4 p = 8.93x10-4 p = 1.35x10-6 p = 1.33x10-4

Connectivity diverges across brain regions whereas expression does not

Conclusions: chimp/human Gene expression is highly preserved across species brains Gene co-expression is less preserved Some modules are highly preserved Gene modules correspond roughly to brain architecture Species-specific hubs can be validated in silico using sequence comparisons

Software and Data Availability Sample data and R software tutorials can be found at the following webpage http://www.genetics.ucla.edu/labs/horvath/coexpressionnet work An R package and accompanying tutorial can be found here: http://www.genetics.ucla.edu/labs/horvath/coexpressionnet work/rpackages/wgcna/ Tutorial for this R package http://www.genetics.ucla.edu/labs/horvath/coexpressionnet work/rpackages/wgcna/tutorialwgcnapackage.doc

What is different from other analyses? Emphasis on modules (pathways) instead of individual genes Greatly alleviates the problem of multiple comparisons Less than 20 comparisons versus 20000 comparisons Use of intramodular connectivity to find key drivers Quantifies module membership (centrality) Highly connected genes have an increased chance of validation Module definition is based on gene expression data No prior pathway information is used for module definition Two module (eigengenes) can be highly correlated Emphasis on a unified approach for relating variables Default: power of a correlation Rationale: puts different data sets on the same mathematical footing Considers effect size estimates (cor) and significance level p-values are highly affected by sample sizes (cor=0.01 is highly significant when dealing with 100000 observations) Technical Details: soft thresholding with the power adjacency function, topological overlap matrix to measure interconnectedness

Weighted gene co-expression analysis. Yuehua Cui June 7, 2013