Weighted gene co-expression analysis. Yuehua Cui June 7, 2013

Similar documents
A Geometric Interpretation of Gene Co-Expression Network Analysis. Steve Horvath, Jun Dong

An Overview of Weighted Gene Co-Expression Network Analysis. Steve Horvath University of California, Los Angeles

A General Framework for Weighted Gene Co-Expression Network Analysis. Steve Horvath Human Genetics and Biostatistics University of CA, LA

Module preservation statistics

WGCNA User Manual. (for version 1.0.x)

Weighted Correlation Network Analysis and Systems Biologic Applications. Steve Horvath University of California, Los Angeles

identifiers matched to homologous genes. Probeset annotation files for each array platform were used to

Functional Organization of the Transcriptome in Human Brain

The Generalized Topological Overlap Matrix For Detecting Modules in Gene Networks

Differential Modeling for Cancer Microarray Data

Cell biology traditionally identifies proteins based on their individual actions as catalysts, signaling

Eigengene Network Analysis of Human and Chimpanzee Microarray Data R Tutorial

Self Similar (Scale Free, Power Law) Networks (I)

β. This soft thresholding approach leads to a weighted gene co-expression network.

ProCoNA: Protein Co-expression Network Analysis

Erzsébet Ravasz Advisor: Albert-László Barabási

Evidence for dynamically organized modularity in the yeast protein-protein interaction network

Course plan Academic Year Qualification MSc on Bioinformatics for Health Sciences. Subject name: Computational Systems Biology Code: 30180

Supplementary Information

Computational approaches for functional genomics

Zhongyi Xiao. Correlation. In probability theory and statistics, correlation indicates the

Network Biology: Understanding the cell s functional organization. Albert-László Barabási Zoltán N. Oltvai

Quantile based Permutation Thresholds for QTL Hotspots. Brian S Yandell and Elias Chaibub Neto 17 March 2012

High-dimensional data: Exploratory data analysis

Clustering and Network

Gene expression microarray technology measures the expression levels of thousands of genes. Research Article

Data Mining Techniques

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Weighted Network Analysis

V 5 Robustness and Modularity

Software WGCNA: an R package for weighted correlation network analysis Peter Langfelder 1 and Steve Horvath* 2

Web Structure Mining Nodes, Links and Influence

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

The architecture of complexity: the structure and dynamics of complex networks.

An Efficient Algorithm for Protein-Protein Interaction Network Analysis to Discover Overlapping Functional Modules

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Hierarchical Clustering

A New Method to Build Gene Regulation Network Based on Fuzzy Hierarchical Clustering Methods

Eigengene Network Analysis: Four Tissues Of Female Mice R Tutorial

Dimension Reduc-on. Example: height of iden-cal twins. PCA, SVD, MDS, and clustering [ RI ] Twin 2 (inches away from avg)

Networks as vectors of their motif frequencies and 2-norm distance as a measure of similarity

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

Quantile-based permutation thresholds for QTL hotspot analysis: a tutorial

Zhiguang Huo 1, Chi Song 2, George Tseng 3. July 30, 2018

Correlation Networks

Context dependent visualization of protein function

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Bioinformatics 2. Yeast two hybrid. Proteomics. Proteomics

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Interaction Network Analysis

CS168: The Modern Algorithmic Toolbox Lectures #11 and #12: Spectral Graph Theory

Cluster Analysis of Gene Expression Microarray Data. BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002

Network Specializations, Symmetries, and Spectral Properties

Singapore Institute for Neurotechnology & Memory Network Programme, National University of Singapore, Singapore

Clustering & microarray technology

Data science with multilayer networks: Mathematical foundations and applications

networks in molecular biology Wolfgang Huber

Faloutsos, Tong ICDE, 2009

BioControl - Week 6, Lecture 1

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

FCModeler: Dynamic Graph Display and Fuzzy Modeling of Regulatory and Metabolic Maps

Systems biology and biological networks

Weighted Network Analysis

Advanced Statistical Methods: Beyond Linear Regression

Drosophila melanogaster and D. simulans, two fruit fly species that are nearly

Algorithms in Bioinformatics

Overview of clustering analysis. Yuehua Cui

Unsupervised machine learning

Computational Biology: Basics & Interesting Problems

Networks. Can (John) Bruce Keck Founda7on Biotechnology Lab Bioinforma7cs Resource

Link Analysis Ranking

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

Unravelling the biochemical reaction kinetics from time-series data

BMD645. Integration of Omics

How To Use CORREP to Estimate Multivariate Correlation and Statistical Inference Procedures

8.1 Concentration inequality for Gaussian random matrix (cont d)

25 : Graphical induced structured input/output models

Social Networks- Stanley Milgram (1967)

Thematic review series: Systems Biology Approaches to Metabolic and Cardiovascular Disorders

SYSTEMS BIOLOGY 1: NETWORKS

Bioinformatics. Transcriptome

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Design and characterization of chemical space networks

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Clustering. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Supplemental Material

Preface. Contributors

Analysis of Biological Networks: Network Robustness and Evolution

Lab 2 Worksheet. Problems. Problem 1: Geometry and Linear Equations

Weighted Network Analysis for Groups:

Pathline: " 1234B2# ="3;6<89>";>6"B A1"<6"A <?;>"=21 9>";52#A. Miriah Meyer 1,2

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

COMPSCI 514: Algorithms for Data Science

Emergent Phenomena on Complex Networks

Graph Theory and Networks in Biology

CS224W: Social and Information Network Analysis

Proteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it?

Co-expression analysis of RNA-seq data

Bioinformatics I. CPBS 7711 October 29, 2015 Protein interaction networks. Debra Goldberg

Transcription:

Weighted gene co-expression analysis Yuehua Cui June 7, 2013

Weighted gene co-expression network (WGCNA) A type of scale-free network: A scale-free network is a network whose degree distribution follows a power law, at least asymptotically. That is, the fraction P(k) of nodes in the network having k connections to other nodes goes for large values of k as p k = ck γ where c is a normalization constant and γ is a parameter whose value is typically in the range 2 < γ < 3, although occasionally it may lie outside these bounds. Scale-free networks are noteworthy because many empirically observed networks appear to be scale-free, including the world wide web, citation networks, biological networks, airline networks and some social networks. From Wikipedia

From Wikipedia

Reprinted from Linked: The New Science of Networks by Albert-Laszlo Barabasi Scale-Free Network Models in Epidemiology Adapted from J.B. Dunham and F.B. Berlin

Flight connections and hub airports In a scale-free network, the nodes with the largest number of links (connections) are most important! Courtesy of A. Barabasi

Note: The rest of the slides about WGCNA are adapted or modified from the slides in Dr. Steve Horvath s website at: http://www.genetics.ucla.edu/labs/horvath/coe xpressionnetwork/

Philosophy of Weighted Gene Co- Expression Network Analysis Understand the system instead of reporting a list of individual parts Describe the functioning of the engine instead of enumerating individual nuts and bolts Focus on modules as opposed to individual genes this greatly alleviates multiple testing problem Network terminology is intuitive to biologists

How to define a gene coexpression network?

Gene Co-expression Networks In gene co-expression networks, each gene corresponds to a node. Two genes are connected by an edge if their expression values are highly correlated. Definition of high correlation is somewhat tricky One can use statistical significance But we propose a criterion for picking threshold parameter: scale free topology criterion.

Frequency 0 100 200 300 400 500 600 700 P(k) vs k in scale free networks P(k) Frequency Distribution of Connectivity Scale Free Topology refers to the frequency distribution of the connectivity k p(k)=proportion of nodes that have connectivity k 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 Connectivity k

How to check Scale Free Topology? Idea: Log transformation p(k) and k and look at scatter plots Linear model fitting R 2 index can be used to quantify goodness of fit

Our `holistic view. Weighted Network View Unweighted View All genes are connected Connection Widths=Connection strengths Some genes are connected All connections are equal Hard thresholding may lead to an information loss. If two genes are correlated with r=0.79, they are deemed unconnected with regard to a hard threshold of τ=0.8

Network=Adjacency Matrix A network can be represented by an adjacency matrix, A=[a ij ], that encodes whether/how a pair of nodes is connected. A is a symmetric matrix with entries in [0,1] For unweighted network, entries are 1 or 0 depending on whether or not 2 nodes are adjacent (connected) For weighted networks, the adjacency matrix reports the connection strength between gene pairs

Connectivity Gene connectivity = row sum of the adjacency matrix For unweighted networks=number of direct neighbors For weighted networks= sum of connection strengths to other nodes Connectivity k a i i ij ji

How to construct a weighted gene co-expression network?

Using an adjacency function to define a network Measure co-expression by a similarity s(i,j) in [0,1] e.g. absolute value of the Pearson correlation Define an adjacency matrix as A(i,j) using an adjacency function AF(s(i,j)) Here we consider 2 classes of AFs Step function AF(s)=I(s>) with parameter (unweighted network) Power function AF(s)=s b with parameter b (weighted network) The choice of the AF parameters (, b) determines the properties of the network.

Power adjacency function results in a weighted gene network a cor( x, x ) ij i j Often choosing beta=6 works well but in general we use the scale free topology criterion described in Zhang and Horvath 2005.

Comparing the power adjacency functions with the step function Adjacency =connection strength Gene Co-expression Similarity

The scale free topology criterion for choosing the parameter values of an adjacency function. A) CONSIDER ONLY THOSE PARAMETER VALUES THAT RESULT IN APPROXIMATE SCALE FREE TOPOLOGY B) SELECT THE PARAMETERS THAT RESULT IN THE HIGHEST MEAN NUMBER OF CONNECTIONS Criterion A is motivated by the finding that most metabolic networks (including gene co-expression networks, proteinprotein interaction networks and cellular networks) have been found to exhibit a scale free topology Criterion B leads to high power for detecting modules (clusters of genes) and hub genes.

General Framework for Network Analysis Define a Gene Co-expression Similarity Define a Family of Adjacency Functions Determine the AF Parameters Define a Measure of Node Dissimilarity Identify Network Modules (Clustering) Relate Network Concepts to Each Other Relate the Network Concepts to External Gene or Sample Information

How to detect network modules?

Steps for defining gene modules Define a dissimilarity measure between the genes. Standard Choice: dissim(i,j)=1- abs(correlation) Choice by network community =1-Topological Overlap Matrix (TOM) Used here Use the dissimilarity in hierarchical clustering Define modules as branches of the hierarchical clustering tree Visualize the modules and the clustering results in a heatmap plot Heatmap

The topological overlap dissimilarity is used as input of hierarchical clustering TOM ij u a a a iu uj ij min( k, k ) 1a i j ij DistTOM ij 1TOM ij a cor( x, x ) ij i j Generalized in Zhang and Horvath (2005) to the case of weighted networks Generalized in Yip and Horvath (2006) to higher order interactions

Using the TOM matrix to cluster genes To group nodes with high topological overlap into modules (clusters), we typically use average linkage hierarchical clustering coupled with the TOM distance measure. Once a dendrogram is obtained from a hierarchical clustering method, we choose a height cutoff to arrive at a clustering. Here modules correspond to branches of the dendrogram Genes correspond to rows and columns TOM plot Hierarchical clustering dendrogram TOM matrix Module: Correspond to branches

Different Ways of Depicting Gene Modules Topological Overlap Plot Gene Functions 1) Rows and columns correspond to genes 2) Red boxes along diagonal are modules 3) Color bands=modules Multi Dimensional Scaling Traditional View Idea: Use network distance in MDS

Heatmap view of module Columns= tissue samples Rows=Genes Color band indicates module membership Message: characteristic vertical bands indicate tight co-expression of module genes

-0.1 0.0 0.1 0.2 0.3 0.4 Module Eigengene= measure of over-expression=average redness=1 st PC of a given module Rows,=genes, Columns=microarray brownbrown The brown module eigengenes across samples

Genes Gene expression database a conceptual view Samples Sample annotations Gene expression matrix Gene annotations Gene expression levels

Singular value decomposition (SVD) Use SVD to get the eigengenes Let X denote an m x n matrix of real-valued data and rank r m n, m genes and n samples The equation for singular value decomposition of X is the following: where U is an m x n matrix, S is an n x n diagonal matrix, and V T is also an n x n matrix. 29

UU T =V T V=I 30

T 1 0 0 0 0 0 0 V U w n w X The w i are called the singular values of X If X is singular, some of the w i will be 0 In general rank(x) = number of nonzero w i SVD is mostly unique (up to permutation of singular values, or if some w i are equal) X -1 =(V T ) -1 S -1 U -1 = V S -1 U T Columns of V k corresponds to eigenvectors

-0.3 0.0-0.1 0.2-0.2 0.1-0.1 0.2-0.2 0.2-0.2 0.2-2.0 0.0 Module eigengenes can be used to determine whether 2 modules are correlated. If correlation of MEs is high-> consider merging. -0.2 0.2-0.1 0.2-0.1 0.2 Martingale.Re ME.blue 0.08 ME.brow n 0.19 0.22 0.14 0.27 0.42 ME.green Eigengenes can be used to build separate networks 0.09 0.78 0.09 0.55 ME.grey 0.12 0.39 0.41 0.67 0.72 ME.turquoise ME.yellow 0.01 0.07 0.13 0.08 0.04 0.34-2.0 0.0-0.2 0.2-0.2 0.1-0.3 0.0

Consensus eigengene networks in male and female mouse liver data and their relationship to physiological traits Langfelder P, Horvath S (2007) Eigengene networks for studying the relationships between co-expression modules. BMC Systems Biology 2007

Important Task in Many Genomic Applications: Given a network (pathway) of interacting genes how to find the central players? Gene connectivity = row sum of the adjacency matrix For unweighted networks=number of direct neighbors For weighted networks= sum of connection strengths to other nodes k i So value of k i indicates the important of the gene in a network j a ij

A Case Study MC Oldham, S Horvath, DH Geschwind (2006) Conservation and evolution of gene coexpression networks in human and chimpanzee brain. PNAS

What changed? Despite pronounced phenotypic differences, genomic similarity is ~96% (including single-base substitutions and indels) 1 Similarity is even higher in protein-coding regions 1 Cheng, Z. et al. Nature 437, 88-93 (2005) Image courtesy of Todd Preuss (Yerkes National Primate Research Center)

Assessing the contribution of regulatory changes to human evolution Hypothesis: Changes in the regulation of gene expression were critical during recent human evolution (King & Wilson, 1975) Microarrays are ideally suited to test this hypothesis by comparing expression levels for thousands of genes simultaneously

Gene expression is more strongly preserved than gene connectivity Chimp Chimp Expression Cor=0.93 Cor=0.60 Human Expression Human Connectivity Hypothesis: molecular wiring makes us human Raw data from Khaitovich et al., 2004 Mike Oldham

A B Human Chimp

p = 1.33x10-4 p = 8.93x10-4 p = 1.35x10-6 p = 1.33x10-4

Connectivity diverges across brain regions whereas expression does not

Conclusions: chimp/human Gene expression is highly preserved across species brains Gene co-expression is less preserved Some modules are highly preserved Gene modules correspond roughly to brain architecture Species-specific hubs can be validated in silico using sequence comparisons

Software and Data Availability Sample data and R software tutorials can be found at the following webpage http://www.genetics.ucla.edu/labs/horvath/coexpressionnet work An R package and accompanying tutorial can be found here: http://www.genetics.ucla.edu/labs/horvath/coexpressionnet work/rpackages/wgcna/ Tutorial for this R package http://www.genetics.ucla.edu/labs/horvath/coexpressionnet work/rpackages/wgcna/tutorialwgcnapackage.doc

What is different from other analyses? Emphasis on modules (pathways) instead of individual genes Greatly alleviates the problem of multiple comparisons Less than 20 comparisons versus 20000 comparisons Use of intramodular connectivity to find key drivers Quantifies module membership (centrality) Highly connected genes have an increased chance of validation Module definition is based on gene expression data No prior pathway information is used for module definition Two module (eigengenes) can be highly correlated Emphasis on a unified approach for relating variables Default: power of a correlation Rationale: puts different data sets on the same mathematical footing Considers effect size estimates (cor) and significance level p-values are highly affected by sample sizes (cor=0.01 is highly significant when dealing with 100000 observations) Technical Details: soft thresholding with the power adjacency function, topological overlap matrix to measure interconnectedness