Tutorial I Female Mouse Liver Microarray Data Network Construction and Module Analysis

Size: px
Start display at page:

Download "Tutorial I Female Mouse Liver Microarray Data Network Construction and Module Analysis"

Transcription

1 Tutorial I Female Mouse Liver Microarray Data Network Construction and Module Analysis Steve Horvath Correspondence: shorvath@mednet.ucla.edu, Content of this tutorial 1) Gene Co-expression Network Construction, 2) Module Definition Based on Average Linkage hierarchical clustering with the dynamic tree cut algorithm 3) Relating Modules To Physiological Traits (module significance analysis) 4) Comparing Weighted Network Results to Unweighted Network Results 5) Studying the Clustering Coefficicient Abstract We use microarray from an F2 mouse intercross to examine the large-scale organization of gene co-expression networks in female mouse liver and annotate several gene modules in terms of 20 physiological traits. Finally we study the relationship between connectivity and a measures of gene significance based on the physiological traits. The data and biological implications are described in the following references Ghazalpour A, Doss S, Zhang B, Wang S, Plaisier C, Castellanos R, Brozell A, Schadt EE, Drake TA, Lusis AJ, Horvath S (2006)Integrating Genetic and Network Analysis to Characterize Genes Related to Mouse Weight. PloS Genetics. Volume 2 Issue 8 AUGUST 2006 Fuller TF, Ghazalpour A, Aten JE, Drake TA, Lusis AJ, Horvath S (2007) Weighted Gene Coexpression Network Analysis Strategies Applied to Mouse Weight, Mamm Genome 18(6): We provide the statistical code used for generating the weighted gene co-expression network results. Thus, the reader be able to reproduce all of our findings. This document also serves as a tutorial to weighted gene co-expression network analysis. Some familiarity with the R software is desirable but the document is fairly self-contained. This document and data files can be found at the following webpage: More material on weighted network analysis can be found here Method Description The network construction is conceptually straightforward: nodes represent genes and nodes are connected if the corresponding genes are significantly co-expressed across appropriately chosen tissue samples. Here we study networks that can be specified with the following adjacency matrix: A=[a ij ] is symmetric with entries in [0,1]. By convention, the diagonal elements are assumed to be zero. For unweighted networks, the adjacency matrix contains binary information (connected=1, unconnected=0). In weighted networks the adjacency matrix contains encodes pairwise connection strengths. 1

2 Microarray data RNA preparation and array hybridizations were performed at Rosetta Informatics. The custom ink-jet microarrays used in this study (Agilent technologies, previously described [2, 24]) contain 2186 control probes and 23,574 non-control oligonucleotides extracted from mouse Unigene clusters and combined with RefSeq sequences and RIKEN full-length clones. Mouse livers were homogenized and total RNA extracted using Trizol reagent (Invitrogen, CA) according to manufacturer s protocol. Three µg of total RNA was reverse transcribed and labeled with either Cy3 or Cy5 fluorochrome. Purified Cy3 or Cy5 complementary RNA was hybridized to at least two microarray slides with fluor reversal for 24 hours in a hybridization chamber, washed, and scanned using a laser confocal scanner. Arrays were quantified on the basis of spot intensity relative to background, adjusted for experimental variation between arrays using average intensity over multiple channels, and fit to an error model to determine significance (type I error). Gene expression is reported as the ratio of the mean log 10 intensity (mlratio) relative to the pool derived from 150 mice randomly selected from the F2 population. Data Reduction: In order to minimize noise in the gene expression data set, several data filtering steps were taken. First, preliminary evidence showed major differences in gene expression levels between sexes amongst the F2 mice used (Yang, manuscript in preparation), and therefore only female mice were used for network construction. The construction and comparison of the male network will be reported elsewhere. Only those mice with complete phenotype, genotype and array data were used. This gave a final experimental sample of 135 female mice used for network construction. Due to computational limitations the following filtering steps were applied to the genome wide expression data. First, the 8000 most varying genes (ml ratio log 10 of the ratio of experimental mouse gene expression to F2 pool), across all mice were identified. Next, amongst these 8000 genes, after preliminary network construction, the 3600 most connected genes were chosen as those to use in further steps (All genes excluded had a connectivity of 1 or less). These 3600 genes were then examined, and where appropriate, gene isoforms and genes containing duplicate probes were excluded by using only those with the highest expression among the redundant transcripts. This final filtering step yielded a count of 3421 genes for the experimental network construction. The main reason for not using all genes is that genes with low variance across the mouse samples are likely to be less interesting in this analysis that relates gene expression profiles to SNP markers and highly varying physiological traits. Noise genes may compromise module detection and thus our integrated model. A computational reason for restricting the analysis to 8000 genes is that our R software code becomes extremely slow when dealing with matrices of dimension larger than 8000x8000. For module detection, we limited our analysis to 3600 most connected genes since our module construction method and visualization tools cannot handle larger data sets at this point. By definition, module genes are highly connected with the genes of their module (i.e. module genes tend have relatively high connectivity). Thus, for the purpose of module detection, restricting the analysis to the most connected genes does not lead to major information loss for the key points of our application. However, there may be applications where genes with relatively low connectivity are biologically interesting so that gene filtering based on connectivity would lead to information loss. Finally, we eliminated multiple probes with similar expression pattern for the same gene since we are interested in studying gene networks as opposed to probeset networks. This resulted in 3421 genes in our final set which we used for module detection. 2

3 Network Construction: We pioneered the use of a weighted coexpression network for mapping complex disease genes. In co-expression networks, network nodes correspond to genes and connection strengths are determined by the pairwise correlations between expression profiles. In contrast to unweighted networks, weighted networks use soft thresholding of the Pearson correlation matrix for determining the connection strengths between two genes. Soft thresholding of the Pearson correlation preserves the gene co-expression information and leads to weighted co-expression networks that are highly robust with respect to the construction method. The network construction algorithm is described in detail elsewhere (Zhang and Horvath 2005). Briefly, a gene co-expression similarity measure (absolute value of Pearson s product moment correlation) was used to relate every pairwise gene-gene relationship. An adjacency matrix was then constructed using a `soft power adjacency function a ij = Power(s ij, β) s ij β where s ij is the co-expression similarity, and a ij represents the resulting adjacency that measures the connection strengths. The power β is chosen using the scale free topology criterion proposed in Zhang and Horvath (2005). Briefly, the power was chosen such the resulting network exhibited approximate scale free topology and a high mean number of connections. The scale free topology criterion led us to choose a power of β = 6 based on the preliminary network built from the 8000 most varying genes. However, since we are using a weighted network as opposed to an unweighted network, the biological findings are highly robust with respect to the choice of this power (Zhang and Horvath 2005). Topological Overlap Matrix and Gene Modules The adjacency matrix was then used to define a network distance measure or more precisely a measure of node dissimilarity based on the topological overlap matrix. Specifically the topological overlap matrix is given by lij + aij ωij = min{ k, k } + 1 a i j ij where, l = a a denotes the number of nodes to which both i and j are connected, and u ij u i iu uj indexes the nodes of the network. The topological overlap matrix (TOM) is given by Ω=[ω ij ]. ω ij is a number between 0 and 1 and is symmetric (i.e, ω ij = ω ji ). The rationale for considering this similarity measure is that nodes that are part of highly integrated modules are expected to have high topological overlap with their neighbors. Network Module Identification. Gene "modules" are groups of nodes that have high topological overlap. Module identification was based on the topological overlap matrix Ω=[ω ij ] defined above. To use it in hierarchical clustering, it was turned into a dissimilarity measure by subtracting it from one (i.e, the topological overlap based dissimilarity measure is defined by d ω = 1 ω ). Based on the dissimilarity matrix we can use hierarchical clustering to discriminate one module from another. We used a dynamic cut-tree algorithm for automatically and precisely identifying modules in hierarchical clustering dendrogram (the "tree" method of cutreedynamic, see Langfelder, Zhang Horvath 2008). ij ij 3

4 The algorithm takes into account an essential feature of cluster occurrence and makes use of the internal structure in a dendrogram. Specifically, the algorithm is based on an adaptive process of cluster decomposition and combination and the process is iterated until the number of clusters becomes stable. No claim is made that our module construction method is optimal. A comparison of different module construction methods is beyond the scope of this paper. Detection of hub genes: To identify hub genes for the network, one may either consider the whole network connectivity (denoted by ktotal) or the intramodular connectivity (kwithin). We find that intramodular connectivity is far more meaningful than whole network connectivity Module Significance Analysis: Relating Gene Modules to Physiologic Traits, In the BXH F2 cross, 20 physiological traits were measured for each animal. We used this information to explore the physiological relevance of each of the modules in the network. To do this, the Pearson s product moment correlation (Pearson s correlation) was computed between each gene within a module and each of the physiological traits. This measure is termed the gene significance of a particular gene with that trait. The geometric mean was then calculated for the absolute value of all gene significance scores within each module, yielding the module significance (MS) of that particular module with the trait. In order to explore the characteristics of our connectivity measure, we plotted the connectivity parameter versus variance and mean gene expression for each gene within the most functionally significant (Blue and Brown) modules. We observed an inverse relationship between connectivity and gene expression variance. This is consistent with the idea that the network s most highly connected hubs are resilient to genetic background variations since they are vital for core biological functions. Statistical References Bin Zhang and Steve Horvath (2005) "A General Framework for Weighted Gene Co- Expression Network Analysis", Statistical Applications in Genetics and Molecular Biology: Vol. 4: No. 1, Article 17 The following theoretical reference explores the meaning of coexpression network analysis Horvath S, Dong J (2008) Geometric Interpretation of Gene Co-Expression Network Analysis. PloS Computational Biology. 4(8): e PMID: The WGCNA R package is described in Langfelder P, Horvath S (2008) WGCNA: an R package for Weighted Correlation Network Analysis. BMC Bioinformatics Dec 29;9(1):559. PMID: For the generalized topological overlap matrix as applied to unweighted networks see Yip A, Horvath S (2007) Gene network interconnectedness and the generalized topological overlap measure. BMC Bioinformatics 8:22 Module detection based on branch cutting is described in Langfelder P, Zhang B, Horvath S (2008) Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics. Bioinformatics Mar 1;24(5): PMID:

5 # Absolutely no warranty on the code. Please contact SH with suggestions. # CONTENTS # This document contains function for carrying out the following tasks # A) Assessing scale free topology and choosing the parameters of the adjacency function # using the scale free topology criterion (Zhang and Horvath 05) # B) Computing the topological overlap matrix # C) Defining gene modules using clustering procedures # D) Summing up modules by their first principal component (first eigengene) # E) Relating a measure of gene significance to the modules # F) Carrying out a within module analysis (computing intramodular connectivity) # and relating intramodular connectivity to gene significance. # G) Miscellaneous other functions, e.g. for computing the cluster coefficient. # Downloading the R software # 1) Go to download R and install it on your computer # After installing R, you need to install several additional R library packages: # For example to install Hmisc, open R, # go to menu "Packages/Install package(s) from CRAN", # then choose Hmisc. R will automatically install the package. # When asked "Delete downloaded files (y/n)? ", answer "y". # Do the same for some of the other libraries mentioned below. But note that # several libraries are already present in the software so there is no need to re-install them. # To get this tutorial and data files, go to the following webpage # # Unzip all the files into the same directory, ## The user should copy and paste the following script into the R session. ## Text after "#" is a comment and is automatically ignored by R. # read in the R libraries library(mass) # standard, no need to install library(class) # standard, no need to install library(cluster) library(sma) # install it for the function plot.mat library(impute)# install it for imputing missing value library(scatterplot3d) # Download the WGCNA library as a.zip file from and choose "Install package(s) from local zip file" in the packages tab library(wgcna) options(stringsasfactors=f) 5

6 # Please adapt the file paths setwd("c:/documents and Settings/Steve Horvath/My Documents/ADAG/LinSong/NetworkScreening/MouseFemaleLiver") # read in the custom network functions. source("networkfunctions.txt") # The following 3421 probe set were arrived at using the following steps #1) reduce to the 8000 most varying, 2) 3600 most connected, 3) focus on unique genes dat0=read.table("cnew_liver_bxh_f2female_8000mvgenes_p3600_unique_tommodules.xls",hea der=t) names(dat0) # this contains information on the genes datsummary=dat0[,c(1:8,144:150)] # the following data frame contains # the gene expression data: columns are genes, rows are arrays (samples) datexpr = t(dat0[,9:143]) no.samples = dim(datexpr)[[1]] dim(datexpr) datclinicaltraits=read.csv("bxh_clinicaltraits_361mice_fornewbxh.csv",header=t) #Now we order the mice so that trait file and expression file agree restrictmice=is.element(datclinicaltraits$miceid,dimnames(datexpr)[[1]]) table(restrictmice) datclinicaltraits=datclinicaltraits[restrictmice,] ordermicetraits=order(datclinicaltraits$miceid) ordermiceexpr=order(dimnames(datexpr)[[1]]) datclinicaltraits =datclinicaltraits[ordermicetraits,] datexpr =datexpr[ordermiceexpr,] #from the following table, we verify that all 135 mice are in order table(datclinicaltraits$miceid==dimnames(datexpr)[[1]]) rm(dat0);gc() 6

7 #SOFT THRESHOLDING For Weighted Network Construction # To construct a weighted network (soft thresholding with the power adjacency matrix), # we consider the following vector of potential thresholds. # Now we investigate soft thesholding with the power adjacency function powers1=c(seq(1,10,by=1),seq(12,18,by=2)) # To choose a cut-off value, we propose to use the Scale-free Topology Criterion (Zhang and # Horvath 2005). Here the focus is on the linear regression model fitting index # (denoted below by scale.law.r.2) that quantify the extent of how well a network # satisfies a scale-free topology. # The function PickSoftThreshold can help one to estimate the cut-off value # when using hard thresholding with the step adjacency function. # The first column lists the power beta # The second column reports the resulting scale free topology fitting index R^2. # The third column reports the slope of the fitting line. # The fourth column reports the fitting index for the truncated exponential scale free model. # Usually we ignore it. # The remaining columns list the mean, median and maximum connectivity. RpowerTable=pickSoftThreshold(datExpr, powervector=powers1)[[2]] Power scale.law.r.2 slope truncated.r.2 mean.k. median.k. max.k cex1=0.7 par(mfrow=c(1,2)) plot(rpowertable[,1], -sign(rpowertable[,3])*rpowertable[,2],xlab=" Soft Threshold (power)",ylab="scale Free Topology Model Fit,signed R^2",type="n") text(rpowertable[,1], -sign(rpowertable[,3])*rpowertable[,2], labels=powers1,cex=cex1,col="red") # this line corresponds to using an R^2 cut-off of h abline(h=0.8,col="red") plot(rpowertable[,1], RpowerTable[,5],xlab="Soft Threshold (power)",ylab="mean Connectivity", type="n") text(rpowertable[,1], RpowerTable[,5], labels=powers1, cex=cex1,col="red") 7

8 Scale Free Topology Model Fit,signed R^ Mean Connectivity Soft Threshold (power) Soft Threshold (power) To choose a cut-off value beta, we use the Scale-free Topology Criterion (Zhang and Horvath 2005). Here the focus is on the linear regression model fitting index (denoted as scale.law.r.2) that quantify the extent of how well a network satisfies a scale-free topology. We choose the soft thresholding parameter beta=6 for the correlation matrix since this is where the R^2 curve seems to saturates. From the above table, we find that the resulting slope looks OK (negative between -1 and -2). In the appendix, we investigate different choices of beta. #Here the scale free topology criterion would lead us to pick a power of beta=6. In an appendix, we #study how the biological findings depend on the choice of the power. beta1=6 # this is the the power adjacency function parameter in power(s,beta) # By playing around with beta, you will find that the # findings are highly robust with respect to beta1, which is an attractive property. 8

9 # The following computes the network connectivity (Connectivity) Connectivity= softconnectivity(datexpr,power=beta1) # Creating Scale Free Topology Plots (SUPPLEMENTARY FIGURE S1 in our article) par(mfrow=c(1,1)) scalefreeplot(connectivity, truncated=t,main= paste("beta=",as.character(beta1))) beta= 6, scale free R^2= 0.88, slope= -1.61, trunc.r^2= 0.98 log10(p(k)) log10(k) The Figure shows that the connectivity distribution p(k) is better modeled using an exponentially truncated power law p(k) ~ k -γ exp(-α k). In practice, we find that the two parameters α and γ provide too much flexibility in curve fitting. The truncated exponential model fitting index R^2 tends to be high irrespective of the adjacency function parameter. For this reason, we focus on the scale-free topology fitting index in our scale-free topology criterion. Exploring the use of the truncated exponential fitting index is beyond the scope of this article. 9

10 Module Detection An important step in network analysis is module detetion. To group genes with coherent expression profiles into modules, we use average linkage hierarchical clustering, which uses the topological overlap measure as dissimilarity. The topological overlap of two nodes reflects their similarity in terms of the commonality of the nodes they connect to (see [Ravasz et al 2002, Yip and Horvath 2006]). # Now define the power adjacency matrix ADJ = adjacency(datexpr,power=beta1) gc() # The following code computes the topological overlap matrix based on the # adjacency matrix. # TIME: Takes several minutes disstom=tomdist(adj) gc() # Now we carry out hierarchical clustering with the TOM matrix. Branches of the # resulting clustering tree will be used to define gene modules. hiertom = hclust(as.dist(disstom),method="average"); par(mfrow=c(1,1)) plot(hiertom,labels=f) 10

11 # By our definition, modules correspond to branches of the tree. # The question is what height cut-off should be used? This depends on the # biology. Large height values lead to big modules, small values lead to small # but very cohesive modules. We used a dynamic cut-tree algorithm for selection branches of the hierarchical clustering dendrogram (Langfelder Zhang Horvath 2008). The algorithm takes into account an essential feature of cluster occurrence and makes use of the internal structure in a dendrogram. Specifically, the algorithm is based on an adaptive process of cluster decomposition and combination and the process is iterated until the number of clusters becomes stable. 11

12 # The following is the original code used in the paper by Ghazalpour et al myheightcutoff =0.995 mydeepsplit = FALSE # fine structure within module myminmodulesize = 20 # modules must have this minimum number of genes #new way for identifying modules based on hierarchical clustering dendrogram colorh1=cutreedynamic(hierclust= hiertom, deepsplit=mydeepsplit,maxtreeheight =myheightcutoff,minmodulesize=myminmodulesize) table(colorh1) # Our code has slightly changed. If we could go back in time, we would use the following code # for branch cutting. But please skip it... colorhnewstep1= dynamictreecut::cutreedynamic(dendro=hiertom, cutheight =0.9965, minclustersize = 20, method = "tree",deepsplit =F) # to turn the branch lables (which are integers) into colors we use colorhnewstep2=labels2colors(colorhnewstep1) colorhnewstep3=mergeclosemodules(datexpr, colors=colorh2, cutheight = 0.15, MEs = NULL, impute = TRUE, useabs = F)$colors # Note that the resulting color is quite similar to the original one: table(colorhnewstep3,colorh1) #This results in the following color assignment. par(mfrow=c(2,1)) plot(hiertom, main="female Mouse Liver Network", labels=f, xlab="", sub=""); plotcolorundertree(hiertom,colors=data.frame(colorh1)) title("colored by UNMERGED dynamic modules") 12

13 #Note that the colors correspond to portions of the branches. #To determine whether some colors should be merged we a) represent each module by its #module eigengenes (defined as its first principal component) and b) clustering #the principal components. If 2 module eigengenes (PCs) are highly correlated #then the modules should be merged. A general rule may be that two modules are #merged if the distance between the two is samller than 0.1 (i.e., correlation #is bigger than 0.9) datme = moduleeigengenes(as.matrix(datexpr), colorh1)[[1]] dissmes = 1-abs(cor(datME, use="p")) dissmes = ifelse(is.na(dissmes), 0, dissmes) hiermes = hclust(as.dist(dissmes),method="a") 13

14 #display ME hierarchical dendrogram on screen par(mfrow=c(1,1), mar=c(0, 3, 1, 1) + 0.1, cex=1) plot(hiermes, xlab="",ylab="",main="",sub="") par(mfrow=c(1,1)) PCgrey60 PClightcyan PClightyellow PCsalmon PCbrown PCcyan PCgrey PCgreenyellow PCblue PCmagenta PClightgreen PCgreen PCtan PCblack PCyellow PCpurple PCmidnightblue PCturquoise PCpink PCred #This tree suggest to merge several colors 14

15 #To merge a minor cluster to a major cluster, we use colorh1 = merge2clusters(colorh1, mainclustercolor="lightcyan", minorclustercolor="grey60") colorh1 = merge2clusters(colorh1, mainclustercolor="blue", minorclustercolor="magenta") colorh1 = merge2clusters(colorh1, mainclustercolor="red", minorclustercolor="turquoise") colorh1 = merge2clusters(colorh1, mainclustercolor="red", minorclustercolor="pink") colorh1 = merge2clusters(colorh1, mainclustercolor="black", minorclustercolor="yellow") colorh1 = merge2clusters(colorh1, mainclustercolor="green", minorclustercolor="lightgreen") colorh1 = merge2clusters(colorh1, mainclustercolor="green", minorclustercolor="tan") # After merging some colors we arrive at the following hierarchical plot #### FIGURE 1A) in our manuscript par(mfrow=c(2,1)) plot(hiertom, main="female Mouse Liver Network", labels=f, xlab="", sub=""); plotcolorundertree(hiertom,colorh1, title1="colored by female liver modules") 15

16 # We also propose to use classical multi-dimensional scaling plots # for visualizing the network. Here we chose 3 scaling dimensions # This also takes about 10 minutes... cmd1=cmdscale(as.dist(disstom),4) par(mfrow=c(2,3)) plot(cmd1[,c(1,2)], col= as.character(colorh1) ) plot(cmd1[,c(1,3)], col= as.character(colorh1) ) plot(cmd1[,c(1,4)], col= as.character(colorh1) ) plot(cmd1[,c(2,3)], col= as.character(colorh1) ) plot(cmd1[,c(2,4)], col= as.character(colorh1) ) plot(cmd1[,c(3,4)], col= as.character(colorh1) ) ### FIGURE 1 B in our article par(mfrow=c(1,1)) scatterplot3d(cmd1[,1:3], color=as.character(colorh1), main="mds plot",xlab="scaling Dimension 1", ylab="scaling Dimension 2", zlab="scaling Dimension 3",cex.axis=1.5,angle=320) 16

17 TOM plot and MDS plots To visualize the network, we used several plots. The topological overlap matrix plot represents the topological overlap matrix where rows and columns are sorted and colored according to the hierarchical clustering tree used in the module definition. A classical multi-dimensional scaling plot that uses the topological overlap matrix as input can also be used. # An alternative view of this is the so called TOM plot that is generated by the # function TOMplot # Inputs: TOM distance measure, hierarchical (hclust) object, color # Warning: for large gene sets, say more than 2000 genes #this may take a while... TOMplot(dissTOM, hiertom, colorh1) 17

18 Definition of trait based gene significance For a given physiological trait, we defined a measure of gene significance by forming the absolute value of the Spearman correlation between trait and gene expression values. For example, the body weight can be used to define a gene significance of the i th gene expression GSweight(i) = cor(x(i), weight) where x(i) is the gene expression profile of the i th gene. A histogram of the clinical traits shows that several clinical traits appear to have outliers. # To protect agains outliers, we replace the values of the physiological traits # by their ranks. rank1=function(x) rank(x, na.last="keep") rankdatclinicaltraits=apply(datclinicaltraits[,5:26],2,rank1) # This function computes the correlation between a gene expression # and a physiological trait if(exists("gsfunction")) rm(gsfunction) GSfunction=function(x) {cor(x,rankdatclinicaltraits,use="p")} # the following data frame has as columns the gene significance variables # for different clinical traits GeneSignificance =t(apply(datexpr,2,gsfunction)) dimnames(genesignificance)[[2]]=paste("gs",dimnames(rankdatclinicaltraits)[[2]],sep="" ) # Since we only care about absolute values of correlations between expression # profiles and traits, we set GeneSignificance=data.frame(abs(GeneSignificance)) names(genesignificance) [1] "GSWeightG" "GSLengthCM" "GSAbFat" "GSOtherFat" "GSTotalFat" [6] "GSX100xfat.weight" "GSTrigly" "GSTotalChol" "GSHDLChol" "GSUC" [11] "GSFFA" "GSGlucose" "GSLDL.VLDL" "GSMCP.1.phys." "GSInsulin.ug.l." [16] "GSGlucose.Insulin" "GSLeptin.pg.ml." "GSAdiponectin" "GSAorticLesions" "GSAneurysm" [21] "GSAorticCal.M" "GSAorticCal.L" # Here we define more conventional to annotate Figure 2 and Supplementary Figure S2 namesgs=c("weight","length","abfat","otherfat","totalfat","index", "Trigly","Chol","HDL","UC","FFA","Glucose","LDL+VLDL", "MCP1", "Insulin","GlucoseInsulin", "Leptin", "Adiponectin", "AorticLesions", "Aneurysm", "AorticCal.M", "AorticCal.L") The mean gene significance for a particular module can be considered as a measure of module significance (MS) (see Materials and Methods for statistical test), which means that MS provides a measure for overall correlation between the trait and the module. This means that a module with high MS value for "body weight" is on average composed of genes highly correlated with body weight. # Here we use the function verboseboxplot to creates barplots # that shows whether modules are enriched with significant genes. # It also reports a Kruskal Wallis P-value. # The gene significance can be a binary variable or a quantitative variable. # It also plots the 95% confidence interval of the mean 18

19 par(mfrow=c(7,3), mar= c(1, 4, 3, 1) +0.1) for (i in c(1:21) ) { verboseboxplot(genesignificance[,i],colorh1,col=levels(factor(colorh1)),main= namesgs[i],xlab="module",ylab="gs") abline(h=.3,col="red") } Weight p = 7.5e-288 Length p = 3.6e-56 AbFat p = 1.9e OtherFat p = 2.7e-262 Trigly p = 3.5e-171 UC p = 1.2e-201 LDL+VLDL p = 1.8e-208 GlucoseInsulin p = 2.2e-262 AorticLesions p = 2.5e TotalFat p = 5.1e-266 Chol p = 5.1e-213 FFA p = 1.7e-197 MCP1 p = 1.5e-211 Leptin p = 2.7e-194 Aneurysm p = 1.5e Index p = 1.1e-233 HDL p = 4.2e-46 Glucose p = 3.6e-223 Insulin p = 1.4e-265 Adiponectin p = 7.2e-70 AorticCal.M p = 4.1e So, interesting combinations include a) GSweight in the blue module and b) Glucose.Insulin the the brown module. 19

20 Since our particular interest is in understanding body weight, we focus on the blue module. The following code creates a barplot for the body weight based gene significance measure. verbosebarplot(genesignificance[,1],colorh1,col=levels(factor(colorh1)),main="module significance",xlab="module",ylab="mean gene significance") module significance p= 7.5e-288 mean gene significance black blue brown cyan darkred greenyellow lightcyan midnightblue red royalblue module dim(genesignificance) # Now we produce Figure 2 of the article. whichmodule="blue" # mean gene significance=module significance meangs=apply(abs(genesignificance)[colorh1==whichmodule,],2,mean) # corresponding standard error stderrgs= apply(abs(genesignificance)[colorh1==whichmodule,],2,stderr1) # The following code produces a barplot with rotated axis labels ## Increase bottom margin to make room for rotated labels par(mar = c(7, 4, 4, 2) + 0.1) ## Create plot and get bar midpoints in 'mp' mp = barplot(as.vector(meangs),col=whichmodule, ylab="module Significance",cex.lab=1.5) ## Set up x axis with tick marks alone axis(1, at = mp, labels = FALSE) ## text labels labels = namesgs ## Plot x axis labels at mp, you may want to change the offser text(mp, par("usr")[3] , srt = 45, adj = 1, labels = labels, xpd = TRUE,cex=1.3) ## Plot x axis label at line 4 mtext(1, text = "Physiological Traits", line = 6,cex=1.5) # This creates the error bars err.bp(meangs, stderrgs, two.side=t) 20

21 Module Significance Weight Length AbFat OtherFat TotalFat Index Trigly Chol HDL UC Physiological Traits FFA Glucose LDL+VLDL MCP1 Insulin GlucoseInsulin Leptin Adiponectin AorticLesions Aneurysm AorticCal.M AorticCal.L 21

22 # To get a sense of how related the modules are one can summarize each module # by its first eigengene (referred to as principal components). # Next we cluster the eigengens. This is very similar to the code used above # for identifying modules that should be merged. dme2=moduleeigengenes(datexpr,colorh1)[[1]] hclustdme2=hclust(as.dist( 1-abs(t(cor(dME2, method="p")))), method="average" ) par(mfrow=c(1,1)) plot(hclustdme2, main="clustering the Module Eigengenes") Clustering the Module Eigengenes Height PClightcyan PCsalmon PCbrown PCcyan PCblack PCpurple PCmidnightblue PCred PClightyellow PCgrey PCgreen PCblue PCgreenyellow as.dist(1 - abs(t(cor(pc1, method = "p")))) hclust (*, "average") 22

23 # Now we create scatter plots of the samples (arrays) along the module eigengenes. dme2=dme2[,hclustdme2$order] pairs( dme2, upper.panel = panel.smooth, lower.panel = panel.cor, diag.panel=panel.hist,main="relation between modules") Comment: each dot represents a mouse. Above the diagonal are scatterplots. Below are the corresponding absolute values of the correlation 23

24 #Now we study how connectivity is related to mean gene expression or variance of gene expression #### This Supplementary Figure S3 in our article par(mfrow=c(2,2)) whichmodule="blue" # mean expression of the blue module genes meanexprmodule=apply( datexpr[,colorh1==whichmodule],2,mean1) # variance of expression varexprmodule=apply( datexpr[,colorh1==whichmodule],2,var1) ConnectivityModule= SoftConnectivity(datExpr[,colorh1==whichmodule], power=beta1) verbosescatterplot(connectivitymodule,varexprmodule,xlab=paste("connectivity (k.in)", whichmodule, " module"), ylab="variance",col=whichmodule) verbosescatterplot(connectivitymodule,meanexprmodule,xlab=paste("connectivity (k.in)", whichmodule, " module"), ylab="mean Expression",col=whichmodule) meanexpr=apply( datexpr,2,mean1) varexpr=apply( datexpr,2,var1) verbosescatterplot(connectivity,varexpr,xlab=paste("whole Network Connectivity (k.all)"), ylab="variance",col=colorh1) verbosescatterplot(connectivity,meanexpr,xlab=paste("whole Network Connectivity (k.all)"), ylab="mean Expression",col=colorh1) In the co-expression network presented here, we find that the gene expression levels of hub genes are less variable (lower variance) than other, less connected, nodes across all mice. This is consistent with the idea that the network s most highly connected hubs are resilient to large genetic background variations since they are vital for core biological functions. 24

25 # The following produces heatmap plots for each module. # Here the rows are genes and the columns are samples. # Well defined modules results in characteristic band structures since the corresponding genes are # highly correlated. ClusterSamples=hclust(dist(datExpr[,] ),method="average") par(mfrow=c(3,1), mar=c(1, 2, 4, 1)) which.module="black" plot.mat(t(scale(datexpr[clustersamples$order,][,colorh1==which.module ]) ),nrgcols=30,rlabels=t, clabels=t,rcols=which.module, title=which.module ) which.module="blue" plot.mat(t(scale(datexpr[clustersamples$order,][,colorh1==which.module ]) ),nrgcols=30,rlabels=t, clabels=t,rcols=which.module, title=which.module ) which.module="brown" plot.mat(t(scale(datexpr[clustersamples$order,][,colorh1==which.module ]) ),nrgcols=30,rlabels=t, clabels=t,rcols=which.module, title=which.module ) # The function intramodularconnectivity computes the whole network connectivity ktotal, # the within module connectivity (kwithin). kout=ktotal-kwithin and # and kdiff=kin-kout=2*kin-ktotal 25

26 ConnectivityMeasures=intramodularConnectivity(abs(cor(datExpr,use="p"))^beta1,co lorh1) names(connectivitymeasures) [1] "ktotal" "kwithin" "kout" "kdiff" # The following plots show the gene significance vs intromodular connectivity colorlevels=levels(factor(colorh1)) par(mfrow=c(4,3)) for (i in 1:12) { whichmodule=colorlevels[i];restrict1=colorh1==whichmodule verbosescatterplot(connectivitymeasures$kwithin[restrict1],genesignificance[restrict1,1],col=wh ichmodule,main=whichmodule,xlab="intramodular k", ylab="gene Signif") } 26

27 APPENDIX: Constructing an unweighted networks and comparing it to the weighted nework. Here we redo the network analysis using hard thresholding, i.e. dichotomizing the correlation matrix. We show that our main biological findings are highly robust with respect to the network construction method. Use the scale free topology criterion for finding the hard threshold parameter tau. thresholds1= c(seq(.1,.5, by=.1), seq(.55,.95, by=.05) ) TableHard=pickHardThreshold(datExpr, thresholds1)[[2]] gc() Cut p.value scale.law.r.2 slope. truncated.r.2 mean.k. median.k. max.k e e e e e e e e e e e e e e To choose a cut-off value tau, we propose to use the Scale-free Topology Criterion (Zhang and Horvath 2005). Here the focus is on the linear regression model fitting index (denoted as scale.law.r.2) that quantify the extent of how well a network satisfies a scale-free topology. We choose the cut value (tau) of 0.7 for the correlation matrix since this is where the R^2 curve seems to saturates. From the above table, we find that the resulting slope looks OK (negative and between -1 and -2), and the mean number of connections looks good Below we investigate different choices of tau. 27

28 par(mfrow=c(1,1)) plot(thresholds1, -sign(tablehard[,4])*tablehard[,3], type="n",ylab="scale Free Topology R^2",xlab="Hard Threshold tau", ylim=range(min(c( -sign(tablehard[,4])*tablehard[,3]),na.rm=t),1) ) text(thresholds1, -sign(tablehard[,4])*tablehard[,3], labels= thresholds1, col="black") abline(h=.8) Scale Free Topology R^ Hard Threshold tau 28

29 tau1=.65 # this parameter is hard threshold parameter. #Let s define the adjacency matrix of an unweighted network ADJHARD = I(abs(cor(datExpr[,],use="p"))>tau1)+0.0 gc() # This is the unweighted connectivity ConnectivityHard =as.vector(apply(adjhard,2,sum)) scalefreeplot(connectivityhard,truncated=t,main=paste("tau=",as.character(tau1))) tau= 0.65, scale free R^2= 0.78, slope= -1.74, trunc.r^2= 0.99 log10(p(k)) log10(k) 29

30 # Let s compare weighted to unweighted connectivity in a scatter plot verbosescatterplot(connectivityhard, Connectivity,xlab="Unweighted Connectivity",ylab="Weighted Connectivity", col= as.character(colorh1)) # Comments: the connectivity measures is highly preserved between weighted and unweighted networks but there are marked differences for the brown module. 30

31 # The following code computes the topological overlap matrix based on the # adjacency matrix. disstomhard=tomdist(adjhard) gc() # Now we carry out hierarchical clustering with the TOM matrix. hiertomhard = hclust(as.dist(disstomhard),method="average"); #Next, we study whether the `soft modules of the unweighted network described above can also be #found in the unweighted network # The following shows the hierarchical tree based on the unweighted network but the # genes are colored according to their membership in the weighted network par(mfrow=c(2,1)) plot(hiertomhard, main="unweighted Network Module Tree ", labels=f, xlab="", sub=""); plotcolorundertree(hiertomhard, colors=colorh1) title("dynamic Colors, weighted network") Comment: Overall the colors stay together. This is particularly true for the blue module, which is the main interest of our paper. This demonstrates that the module assignment is robust with respect to the network construction method. 31

32 # An alternative view of this is the so called TOM plot that is generated by the # function TOMplot # Inputs: TOM distance measure, hierarchical (hclust) object, color # Here we use the unweighted module tree but color it by the weighted modules. TOMplot(dissTOMhard, hiertomhard, as.character(colorh1)) gc() #Comment: module assignment is highly preserved. ConnectivityMeasuresHARD=intramodularConnectivity(ADJHARD,colorh1) 32

33 Appendix: Computation of the cluster coefficient Although, we don t discuss the clustering coefficient in our main article, we briefly mention it here since it is an important network concept. The cluster coefficient measures the cliquishness of a gene. While we don t use the clustering coefficient in our manuscript, we report it here for the sake of completeness. # First, we start with the weighted network cluster.coef= clustercoef(adj) gc() # Now we plot cluster coefficient versus connectivity # for all genes par(mfrow=c(1,1),mar=c(2,2,2,1)) plot(connectivity, cluster.coef,col=as.character(colorh1),xlab="connectivity",ylab="cluster Coefficient") Overall, we find that the clustering coefficient in a weighted network is roughly constant for highly connected genes inside of a given module. Across modules the clustering coefficient varies a lot. 33

34 # Now we compute the CC for the unweighted network diag(adjhard)=0 cluster.coefhard= clustercoef(adjhard) ConnectivityHARD= apply(adjhard,2,sum) par(mfrow=c(1,1)) plot(connectivityhard,cluster.coefhard,col=as.character(colorh1),xlab="connectivity",ylab=" Cluster Coefficient" ) # There is a marked difference between the weighted network and the unweighted network when it comes to the relationship between clustering coefficient and connectivity. This is further discussed in Zhang and Horvath 2005 and the following reference: Horvath and Dong, Yip (2008) PloS Comp Biol THE END To cite the code and methods in this manual, please use Zhang B, Horvath S (2005) A General Framework for Weighted Gene Co-Expression Network Analysis. Statistical Applications in Genetics and Molecular Biology: Vol. 4: No. 1, Article

Weighted Gene Co-expression Network Analysis (WGCNA) R Tutorial, Part A Brain Cancer Network Construction

Weighted Gene Co-expression Network Analysis (WGCNA) R Tutorial, Part A Brain Cancer Network Construction Weighted Gene Co-expression Network Analysis (WGCNA) R Tutorial, Part A Brain Cancer Network Construction Steve Horvath, Paul Mischel Correspondence: shorvath@mednet.ucla.edu, http://www.ph.ucla.edu/biostat/people/horvath.htm

More information

β. This soft thresholding approach leads to a weighted gene co-expression network.

β. This soft thresholding approach leads to a weighted gene co-expression network. Network Concepts and their geometric Interpretation R Tutorial Motivational Example: weighted gene co-expression networks in different gender/tissue combinations Jun Dong, Steve Horvath Correspondence:

More information

Weighted gene co-expression analysis. Yuehua Cui June 7, 2013

Weighted gene co-expression analysis. Yuehua Cui June 7, 2013 Weighted gene co-expression analysis Yuehua Cui June 7, 2013 Weighted gene co-expression network (WGCNA) A type of scale-free network: A scale-free network is a network whose degree distribution follows

More information

A Geometric Interpretation of Gene Co-Expression Network Analysis. Steve Horvath, Jun Dong

A Geometric Interpretation of Gene Co-Expression Network Analysis. Steve Horvath, Jun Dong A Geometric Interpretation of Gene Co-Expression Network Analysis Steve Horvath, Jun Dong Outline Network and network concepts Approximately factorizable networks Gene Co-expression Network Eigengene Factorizability,

More information

WGCNA User Manual. (for version 1.0.x)

WGCNA User Manual. (for version 1.0.x) (for version 1.0.x) WGCNA User Manual A systems biologic microarray analysis software for finding important genes and pathways. The WGCNA (weighted gene co-expression network analysis) software implements

More information

Chronic Fatigue Syndrome (CFS) Gene Co-expression Network Analysis R Tutorial

Chronic Fatigue Syndrome (CFS) Gene Co-expression Network Analysis R Tutorial Chronic Fatigue Syndrome (CFS) Gene Co-expression Network Analysis R Tutorial Angela Presson,Steve Horvath, Correspondence: shorvath@mednet.ucla.edu, http://www.ph.ucla.edu/biostat/people/horvath.htm This

More information

An Overview of Weighted Gene Co-Expression Network Analysis. Steve Horvath University of California, Los Angeles

An Overview of Weighted Gene Co-Expression Network Analysis. Steve Horvath University of California, Los Angeles An Overview of Weighted Gene Co-Expression Network Analysis Steve Horvath University of California, Los Angeles Contents How to construct a weighted gene co-expression network? Why use soft thresholding?

More information

Eigengene Network Analysis: Four Tissues Of Female Mice R Tutorial

Eigengene Network Analysis: Four Tissues Of Female Mice R Tutorial Eigengene Network Analysis: Four Tissues Of Female Mice R Tutorial Peter Langfelder and Steve Horvath Correspondence: shorvath@mednet.ucla.edu, Peter.Langfelder@gmail.com This is a self contained R software

More information

Module preservation statistics

Module preservation statistics Module preservation statistics Module preservation is often an essential step in a network analysis Steve Horvath University of California, Los Angeles Construct a network Rationale: make use of interaction

More information

ProCoNA: Protein Co-expression Network Analysis

ProCoNA: Protein Co-expression Network Analysis ProCoNA: Protein Co-expression Network Analysis David L Gibbs October 30, 2017 1 De Novo Peptide Networks ProCoNA (protein co-expression network analysis) is an R package aimed at constructing and analyzing

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

A General Framework for Weighted Gene Co-Expression Network Analysis. Steve Horvath Human Genetics and Biostatistics University of CA, LA

A General Framework for Weighted Gene Co-Expression Network Analysis. Steve Horvath Human Genetics and Biostatistics University of CA, LA A General Framework for Weighted Gene Co-Expression Network Analysis Steve Horvath Human Genetics and Biostatistics University of CA, LA Content Novel statistical approach for analyzing microarray data:

More information

identifiers matched to homologous genes. Probeset annotation files for each array platform were used to

identifiers matched to homologous genes. Probeset annotation files for each array platform were used to SUPPLEMENTARY METHODS Data combination and normalization Prior to data analysis we first had to appropriately combine all 1617 arrays such that probeset identifiers matched to homologous genes. Probeset

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

1 Introduction to Minitab

1 Introduction to Minitab 1 Introduction to Minitab Minitab is a statistical analysis software package. The software is freely available to all students and is downloadable through the Technology Tab at my.calpoly.edu. When you

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Bradley Broom Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center kabagg@mdanderson.org bmbroom@mdanderson.org

More information

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data DEGseq: an R package for identifying differentially expressed genes from RNA-seq data Likun Wang Zhixing Feng i Wang iaowo Wang * and uegong Zhang * MOE Key Laboratory of Bioinformatics and Bioinformatics

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

Eigengene Network Analysis: Male Female Mouse Liver Comparison R Tutorial

Eigengene Network Analysis: Male Female Mouse Liver Comparison R Tutorial Eigengene Network Analysis: Male Female Mouse Liver Comparison R Tutorial Peter Langfelder and Steve Horvath Correspondence: shorvath@mednet.ucla.edu, Peter.Langfelder@gmail.com This is a self contained

More information

Data Exploration and Unsupervised Learning with Clustering

Data Exploration and Unsupervised Learning with Clustering Data Exploration and Unsupervised Learning with Clustering Paul F Rodriguez,PhD San Diego Supercomputer Center Predictive Analytic Center of Excellence Clustering Idea Given a set of data can we find a

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

Differential Modeling for Cancer Microarray Data

Differential Modeling for Cancer Microarray Data Differential Modeling for Cancer Microarray Data Omar Odibat Department of Computer Science Feb, 01, 2011 1 Outline Introduction Cancer Microarray data Problem Definition Differential analysis Existing

More information

Correlation. January 11, 2018

Correlation. January 11, 2018 Correlation January 11, 2018 Contents Correlations The Scattterplot The Pearson correlation The computational raw-score formula Survey data Fun facts about r Sensitivity to outliers Spearman rank-order

More information

SPOTTED cdna MICROARRAYS

SPOTTED cdna MICROARRAYS SPOTTED cdna MICROARRAYS Spot size: 50um - 150um SPOTTED cdna MICROARRAYS Compare the genetic expression in two samples of cells PRINT cdna from one gene on each spot SAMPLES cdna labelled red/green e.g.

More information

Data Preprocessing. Data Preprocessing

Data Preprocessing. Data Preprocessing Data Preprocessing 1 Data Preprocessing Normalization: the process of removing sampleto-sample variations in the measurements not due to differential gene expression. Bringing measurements from the different

More information

Weighted Network Analysis

Weighted Network Analysis Weighted Network Analysis Steve Horvath Weighted Network Analysis Applications in Genomics and Systems Biology ABC Steve Horvath Professor of Human Genetics and Biostatistics University of California,

More information

Clustering and Network

Clustering and Network Clustering and Network Jing-Dong Jackie Han jdhan@picb.ac.cn http://www.picb.ac.cn/~jdhan Copy Right: Jing-Dong Jackie Han What is clustering? A way of grouping together data samples that are similar in

More information

An introduction to the picante package

An introduction to the picante package An introduction to the picante package Steven Kembel (steve.kembel@gmail.com) April 2010 Contents 1 Installing picante 1 2 Data formats in picante 2 2.1 Phylogenies................................ 2 2.2

More information

ncounter PlexSet Data Analysis Guidelines

ncounter PlexSet Data Analysis Guidelines ncounter PlexSet Data Analysis Guidelines NanoString Technologies, Inc. 530 airview Ave North Seattle, Washington 98109 USA Telephone: 206.378.6266 888.358.6266 E-mail: info@nanostring.com Molecules That

More information

networks in molecular biology Wolfgang Huber

networks in molecular biology Wolfgang Huber networks in molecular biology Wolfgang Huber networks in molecular biology Regulatory networks: components = gene products interactions = regulation of transcription, translation, phosphorylation... Metabolic

More information

Clustering & microarray technology

Clustering & microarray technology Clustering & microarray technology A large scale way to measure gene expression levels. Thanks to Kevin Wayne, Matt Hibbs, & SMD for a few of the slides 1 Why is expression important? Proteins Gene Expression

More information

The Generalized Topological Overlap Matrix For Detecting Modules in Gene Networks

The Generalized Topological Overlap Matrix For Detecting Modules in Gene Networks The Generalized Topological Overlap Matrix For Detecting Modules in Gene Networks Andy M. Yip Steve Horvath Abstract Systems biologic studies of gene and protein interaction networks have found that these

More information

Quantile-based permutation thresholds for QTL hotspot analysis: a tutorial

Quantile-based permutation thresholds for QTL hotspot analysis: a tutorial Quantile-based permutation thresholds for QTL hotspot analysis: a tutorial Elias Chaibub Neto and Brian S Yandell September 18, 2013 1 Motivation QTL hotspots, groups of traits co-mapping to the same genomic

More information

A New Method to Build Gene Regulation Network Based on Fuzzy Hierarchical Clustering Methods

A New Method to Build Gene Regulation Network Based on Fuzzy Hierarchical Clustering Methods International Academic Institute for Science and Technology International Academic Journal of Science and Engineering Vol. 3, No. 6, 2016, pp. 169-176. ISSN 2454-3896 International Academic Journal of

More information

Weighted Correlation Network Analysis and Systems Biologic Applications. Steve Horvath University of California, Los Angeles

Weighted Correlation Network Analysis and Systems Biologic Applications. Steve Horvath University of California, Los Angeles Weighted Correlation Network Analysis and Systems Biologic Applications Steve Horvath University of California, Los Angeles Contents Weighted correlation network analysis (WGCNA) Module preservation statistics

More information

Supplementary Information

Supplementary Information Supplementary Information For the article"comparable system-level organization of Archaea and ukaryotes" by J. Podani, Z. N. Oltvai, H. Jeong, B. Tombor, A.-L. Barabási, and. Szathmáry (reference numbers

More information

Similarity and Dissimilarity

Similarity and Dissimilarity 1//015 Similarity and Dissimilarity COMP 465 Data Mining Similarity of Data Data Preprocessing Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed.

More information

Weighted Network Analysis

Weighted Network Analysis Weighted Network Analysis Steve Horvath Weighted Network Analysis Applications in Genomics and Systems Biology ABC Steve Horvath Professor of Human Genetics and Biostatistics University of California,

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Low-Level Analysis of High- Density Oligonucleotide Microarray Data

Low-Level Analysis of High- Density Oligonucleotide Microarray Data Low-Level Analysis of High- Density Oligonucleotide Microarray Data Ben Bolstad http://www.stat.berkeley.edu/~bolstad Biostatistics, University of California, Berkeley UC Berkeley Feb 23, 2004 Outline

More information

Gene expression microarray technology measures the expression levels of thousands of genes. Research Article

Gene expression microarray technology measures the expression levels of thousands of genes. Research Article JOURNAL OF COMPUTATIONAL BIOLOGY Volume 7, Number 2, 2 # Mary Ann Liebert, Inc. Pp. 8 DOI:.89/cmb.29.52 Research Article Reducing the Computational Complexity of Information Theoretic Approaches for Reconstructing

More information

Unsupervised machine learning

Unsupervised machine learning Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels

More information

Machine Learning, Fall 2009: Midterm

Machine Learning, Fall 2009: Midterm 10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all

More information

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science

More information

Expression arrays, normalization, and error models

Expression arrays, normalization, and error models 1 Epression arrays, normalization, and error models There are a number of different array technologies available for measuring mrna transcript levels in cell populations, from spotted cdna arrays to in

More information

Software WGCNA: an R package for weighted correlation network analysis Peter Langfelder 1 and Steve Horvath* 2

Software WGCNA: an R package for weighted correlation network analysis Peter Langfelder 1 and Steve Horvath* 2 BMC Bioinformatics BioMed Central Software WGCNA: an R package for weighted correlation network analysis Peter Langfelder 1 and Steve Horvath* 2 Open Access Address: 1 Department of Human Genetics, University

More information

Use of Agilent Feature Extraction Software (v8.1) QC Report to Evaluate Microarray Performance

Use of Agilent Feature Extraction Software (v8.1) QC Report to Evaluate Microarray Performance Use of Agilent Feature Extraction Software (v8.1) QC Report to Evaluate Microarray Performance Anthea Dokidis Glenda Delenstarr Abstract The performance of the Agilent microarray system can now be evaluated

More information

Lab 2 Worksheet. Problems. Problem 1: Geometry and Linear Equations

Lab 2 Worksheet. Problems. Problem 1: Geometry and Linear Equations Lab 2 Worksheet Problems Problem : Geometry and Linear Equations Linear algebra is, first and foremost, the study of systems of linear equations. You are going to encounter linear systems frequently in

More information

Bioinformatics 2. Yeast two hybrid. Proteomics. Proteomics

Bioinformatics 2. Yeast two hybrid. Proteomics. Proteomics GENOME Bioinformatics 2 Proteomics protein-gene PROTEOME protein-protein METABOLISM Slide from http://www.nd.edu/~networks/ Citrate Cycle Bio-chemical reactions What is it? Proteomics Reveal protein Protein

More information

Supplementary Information

Supplementary Information Supplementary Information 1 Supplementary Figures (a) Statistical power (p = 2.6 10 8 ) (b) Statistical power (p = 4.0 10 6 ) Supplementary Figure 1: Statistical power comparison between GEMMA (red) and

More information

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection

More information

Network Biology-part II

Network Biology-part II Network Biology-part II Jun Zhu, Ph. D. Professor of Genomics and Genetic Sciences Icahn Institute of Genomics and Multi-scale Biology The Tisch Cancer Institute Icahn Medical School at Mount Sinai New

More information

Introduction to Evolutionary Concepts

Introduction to Evolutionary Concepts Introduction to Evolutionary Concepts and VMD/MultiSeq - Part I Zaida (Zan) Luthey-Schulten Dept. Chemistry, Beckman Institute, Biophysics, Institute of Genomics Biology, & Physics NIH Workshop 2009 VMD/MultiSeq

More information

cdna Microarray Analysis

cdna Microarray Analysis cdna Microarray Analysis with BioConductor packages Nolwenn Le Meur Copyright 2007 Outline Data acquisition Pre-processing Quality assessment Pre-processing background correction normalization summarization

More information

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 Association Testing with Quantitative Traits: Common and Rare Variants Timothy Thornton and Katie Kerr Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 1 / 41 Introduction to Quantitative

More information

Relational Nonlinear FIR Filters. Ronald K. Pearson

Relational Nonlinear FIR Filters. Ronald K. Pearson Relational Nonlinear FIR Filters Ronald K. Pearson Daniel Baugh Institute for Functional Genomics and Computational Biology Thomas Jefferson University Philadelphia, PA Moncef Gabbouj Institute of Signal

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Inferring Transcriptional Regulatory Networks from High-throughput Data

Inferring Transcriptional Regulatory Networks from High-throughput Data Inferring Transcriptional Regulatory Networks from High-throughput Data Lectures 9 Oct 26, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20

More information

Mutual Information between Discrete Variables with Many Categories using Recursive Adaptive Partitioning

Mutual Information between Discrete Variables with Many Categories using Recursive Adaptive Partitioning Supplementary Information Mutual Information between Discrete Variables with Many Categories using Recursive Adaptive Partitioning Junhee Seok 1, Yeong Seon Kang 2* 1 School of Electrical Engineering,

More information

25 : Graphical induced structured input/output models

25 : Graphical induced structured input/output models 10-708: Probabilistic Graphical Models 10-708, Spring 2013 25 : Graphical induced structured input/output models Lecturer: Eric P. Xing Scribes: Meghana Kshirsagar (mkshirsa), Yiwen Chen (yiwenche) 1 Graph

More information

BTRY 7210: Topics in Quantitative Genomics and Genetics

BTRY 7210: Topics in Quantitative Genomics and Genetics BTRY 7210: Topics in Quantitative Genomics and Genetics Jason Mezey Biological Statistics and Computational Biology (BSCB) Department of Genetic Medicine jgm45@cornell.edu February 12, 2015 Lecture 3:

More information

Session 06 (A): Microarray Basic Data Analysis

Session 06 (A): Microarray Basic Data Analysis 1 SJTU-Bioinformatics Summer School 2017 Session 06 (A): Microarray Basic Data Analysis Maoying,Wu ricket.woo@gmail.com Dept. of Bioinformatics & Biostatistics Shanghai Jiao Tong University Summer, 2017

More information

Design of Microarray Experiments. Xiangqin Cui

Design of Microarray Experiments. Xiangqin Cui Design of Microarray Experiments Xiangqin Cui Experimental design Experimental design: is a term used about efficient methods for planning the collection of data, in order to obtain the maximum amount

More information

EDAMI DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS

EDAMI DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS EDAMI DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS Mario Romanazzi October 29, 2017 1 Introduction An important task in multidimensional data analysis is reduction in complexity. Recalling that

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION med!1,2 Wild-type (N2) end!3 elt!2 5 1 15 Time (minutes) 5 1 15 Time (minutes) med!1,2 end!3 5 1 15 Time (minutes) elt!2 5 1 15 Time (minutes) Supplementary Figure 1: Number of med-1,2, end-3, end-1 and

More information

Data Mining and Matrices

Data Mining and Matrices Data Mining and Matrices 05 Semi-Discrete Decomposition Rainer Gemulla, Pauli Miettinen May 16, 2013 Outline 1 Hunting the Bump 2 Semi-Discrete Decomposition 3 The Algorithm 4 Applications SDD alone SVD

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Statistics for Python

Statistics for Python Statistics for Python An extension module for the Python scripting language Michiel de Hoon, Columbia University 2 September 2010 Statistics for Python, an extension module for the Python scripting language.

More information

Eigengene Network Analysis of Human and Chimpanzee Microarray Data R Tutorial

Eigengene Network Analysis of Human and Chimpanzee Microarray Data R Tutorial Eigengene Network Analysis of Human and Chimpanzee Microarray Data R Tutorial Peter Langfelder and Steve Horvath Correspondence: shorvath@mednet.ucla.edu, Peter.Langfelder@gmail.com This is a self contained

More information

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression Introduction to Correlation and Regression The procedures discussed in the previous ANOVA labs are most useful in cases where we are interested

More information

Package GeneExpressionSignature

Package GeneExpressionSignature Package GeneExpressionSignature September 6, 2018 Title Gene Expression Signature based Similarity Metric Version 1.26.0 Date 2012-10-24 Author Yang Cao Maintainer Yang Cao , Fei

More information

Lab 1: Handout GULP: an Empirical energy code

Lab 1: Handout GULP: an Empirical energy code Lab 1: Handout GULP: an Empirical energy code We will be using the GULP code as our energy code. GULP is a program for performing a variety of types of simulations on 3D periodic solids, gas phase clusters,

More information

Non-specific filtering and control of false positives

Non-specific filtering and control of false positives Non-specific filtering and control of false positives Richard Bourgon 16 June 2009 bourgon@ebi.ac.uk EBI is an outstation of the European Molecular Biology Laboratory Outline Multiple testing I: overview

More information

Table of Contents. Multivariate methods. Introduction II. Introduction I

Table of Contents. Multivariate methods. Introduction II. Introduction I Table of Contents Introduction Antti Penttilä Department of Physics University of Helsinki Exactum summer school, 04 Construction of multinormal distribution Test of multinormality with 3 Interpretation

More information

Linear Regression (1/1/17)

Linear Regression (1/1/17) STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression

More information

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics: Homework Assignment, Evolutionary Systems Biology, Spring 2009. Homework Part I: Phylogenetics: Introduction. The objective of this assignment is to understand the basics of phylogenetic relationships

More information

Comparative Network Analysis

Comparative Network Analysis Comparative Network Analysis BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2016 Anthony Gitter gitter@biostat.wisc.edu These slides, excluding third-party material, are licensed under CC BY-NC 4.0 by

More information

Gene mapping in model organisms

Gene mapping in model organisms Gene mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University http://www.biostat.jhsph.edu/~kbroman Goal Identify genes that contribute to common human diseases. 2

More information

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein The parsimony principle: A quick review Find the tree that requires the fewest

More information

Overview of clustering analysis. Yuehua Cui

Overview of clustering analysis. Yuehua Cui Overview of clustering analysis Yuehua Cui Email: cuiy@msu.edu http://www.stt.msu.edu/~cui A data set with clear cluster structure How would you design an algorithm for finding the three clusters in this

More information

Computer simulation of radioactive decay

Computer simulation of radioactive decay Computer simulation of radioactive decay y now you should have worked your way through the introduction to Maple, as well as the introduction to data analysis using Excel Now we will explore radioactive

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Introduction to clustering methods for gene expression data analysis

Introduction to clustering methods for gene expression data analysis Introduction to clustering methods for gene expression data analysis Giorgio Valentini e-mail: valentini@dsi.unimi.it Outline Levels of analysis of DNA microarray data Clustering methods for functional

More information

Zhongyi Xiao. Correlation. In probability theory and statistics, correlation indicates the

Zhongyi Xiao. Correlation. In probability theory and statistics, correlation indicates the Character Correlation Zhongyi Xiao Correlation In probability theory and statistics, correlation indicates the strength and direction of a linear relationship between two random variables. In general statistical

More information

Microarray data analysis

Microarray data analysis Microarray data analysis September 20, 2006 Jonathan Pevsner, Ph.D. Introduction to Bioinformatics pevsner@kennedykrieger.org Johns Hopkins School of Public Health (260.602.01) Copyright notice Many of

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia Expression QTLs and Mapping of Complex Trait Loci Paul Schliekelman Statistics Department University of Georgia Definitions: Genes, Loci and Alleles A gene codes for a protein. Proteins due everything.

More information

Teachers Guide. Overview

Teachers Guide. Overview Teachers Guide Overview BioLogica is multilevel courseware for genetics. All the levels are linked so that changes in one level are reflected in all the other levels. The BioLogica activities guide learners

More information

Statistics Toolbox 6. Apply statistical algorithms and probability models

Statistics Toolbox 6. Apply statistical algorithms and probability models Statistics Toolbox 6 Apply statistical algorithms and probability models Statistics Toolbox provides engineers, scientists, researchers, financial analysts, and statisticians with a comprehensive set of

More information

REVIEW 8/2/2017 陈芳华东师大英语系

REVIEW 8/2/2017 陈芳华东师大英语系 REVIEW Hypothesis testing starts with a null hypothesis and a null distribution. We compare what we have to the null distribution, if the result is too extreme to belong to the null distribution (p

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

Statistical issues in QTL mapping in mice

Statistical issues in QTL mapping in mice Statistical issues in QTL mapping in mice Karl W Broman Department of Biostatistics Johns Hopkins University http://www.biostat.jhsph.edu/~kbroman Outline Overview of QTL mapping The X chromosome Mapping

More information

Haploid & diploid recombination and their evolutionary impact

Haploid & diploid recombination and their evolutionary impact Haploid & diploid recombination and their evolutionary impact W. Garrett Mitchener College of Charleston Mathematics Department MitchenerG@cofc.edu http://mitchenerg.people.cofc.edu Introduction The basis

More information

Supplementary Discussion:

Supplementary Discussion: Supplementary Discussion: I. Controls: Some important considerations for optimizing a high content assay Edge effect: The external rows and columns of a 96 / 384 well plate are the most affected by evaporation

More information

Experimental Design. Experimental design. Outline. Choice of platform Array design. Target samples

Experimental Design. Experimental design. Outline. Choice of platform Array design. Target samples Experimental Design Credit for some of today s materials: Jean Yang, Terry Speed, and Christina Kendziorski Experimental design Choice of platform rray design Creation of probes Location on the array Controls

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S3 (box) Methods Methods Genome weighting The currently available collection of archaeal and bacterial genomes has a highly biased distribution of isolates across taxa. For example,

More information

Introduction to clustering methods for gene expression data analysis

Introduction to clustering methods for gene expression data analysis Introduction to clustering methods for gene expression data analysis Giorgio Valentini e-mail: valentini@dsi.unimi.it Outline Levels of analysis of DNA microarray data Clustering methods for functional

More information

Introduction to QTL mapping in model organisms

Introduction to QTL mapping in model organisms Human vs mouse Introduction to QTL mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University www.biostat.jhsph.edu/~kbroman [ Teaching Miscellaneous lectures] www.daviddeen.com

More information