identifiers matched to homologous genes. Probeset annotation files for each array platform were used to

Size: px

Start display at page:

Download "identifiers matched to homologous genes. Probeset annotation files for each array platform were used to"

Daniella Dawson
5 years ago
Views:

1 SUPPLEMENTARY METHODS Data combination and normalization Prior to data analysis we first had to appropriately combine all 1617 arrays such that probeset identifiers matched to homologous genes. Probeset annotation files for each array platform were used to identify the gene symbol associated with each probeset and this gene symbol was mapped to a unique identifier for homologous gene groups provided by the homologene database ( Data were then combined from all experiments by matching homologene identifiers. To allow for comparison of absolute gene expression data between phenotypes across different experimental data sets, normalization of raw expression data was performed prior to all analyses. Median absolute deviation normalization of log transformed expression values was chosen for its performance advantages in sparse datasets 1. Via this method, intensity-dependent differences in expression level were minimized for between-sample comparisons, as evidenced by M-A plot (Supplementary Figure 1). Of note, the Pearson correlation is invariant under data transformation of location and scale and thus this normalization step does not affect gene expression network analysis. Species specificity We next analyzed species specificity of gene coexpression to assess for primacy of species over phenotype. Experiments were clustered according to shared expression profiles using agglomerative hierarchical clustering. As shown in Supplementary Figure 2, array experiments were tightly coordinated by species, indicating a significant species dependency for gene coexpression. As such, datasets from species with the most complete data set with regard to phenotype, mouse, were chosen for further analysis. Forming networks To perform network modeling we used weighted gene coexpression analysis as described 2. The vertices of a weighted gene co-expression graph represent genes and the connection strengths (adjacency) between them comprise edges that connect vertices. The connectivity k i of the i-th vertex (gene) is defined as the sum of its adjacencies with the other nodes in the network. The frequency distribution of the connectivity often (but not always) follows a power law, which is referred to as scale free topology. Many (but not all) biologic networks have been found to satisfy scale free topology 3, 4. The connection strength or adjacency was determined by the Pearson correlation of the corresponding expression profiles. Page 1 of 19

2 Specifically, the connection strength (adjacency) between expression profiles i and j is a power of the absolute value of their Pearson correlation. Forming the power of a correlation amounts to a soft-thresholding approach that emphasizes large correlations at the expense of low correlations and obviates the need for arbitrary thresholds for adjacency, thus preserving the continuous nature of the coexpression information. The extent to which a network satisfies scale free topology can be measured using a model fitting index R 2 2. Supplementary Figures 3 and 4 show how the scale free topology model fitting index R 2 (y-axis) depends on the soft threshold b (x-axis). We chose b according to the scale free topology criterion, which amounts to choosing the smallest b that leads to approximate scale free topology 2. To facilitate comparisons, we chose b=9 for all networks. A major advantage of weighted co-expression networks is that they are statistically highly robust with regard to the chosen threshold b. The final adjacency was further defined using the topological overlap 5. This network adjacency describes the commonality t ij of network neighbors for genes i and j and is given by: I ij + a ij t ij = min k i,k j where I ij = { } +1 a ij if i j 1 if i = j a iu a uj, k i = a iu, and k j = a uj for a iu and a uj given by soft-thresholded Pearson u i, j u i u j correlation as described above. For all t ij a distance d ij =1 t ij was defined, giving a symmetrical distance matrix of rows and columns. This distance matrix was used as the input for agglomerative hierarchical clustering in the training set of 43 fetal samples and subsequently in a validation set of 21 fetal samples and in other phenotypes. Separate distance matrices were constructed for each phenotype and for training and validation fetal datasets. The dynamic tree cut algorithm was used to define gene modules from the resultant hierarchical clustering. For this algorithm, minimum cluster size was set at 25, the hybrid tree cut algorithm was chosen, deep split was set to one, and no minimum or maximum absolute core scatter were defined; sensitivity analysis for the parameters chosen for this function is described in Supplementary Figure 5. The 90 resulting networks were subsequently internally validated using an iterative approach as described in the main manuscript in the validation set (Supplementary (1) Page 2 of 19

3 Figure 6). The degree to which topology of these validated modules was shared between phenotypes was then evaluated using the same iterative approach. To illustrate topology of key marker genes previously identified to be associated with cardiac development and disease, gene coepxression topology of networks containing these genes is displayed in Supplementary Figure 7. Transcription factor target identification Transcription factor target identification was performed to assess for common transcription factor modulation of gene coexpression modules. We applied our previously described, phylogenetic footprinting approach to determine putative gene targets for a large set of vertebrate transcription factors in the mouse genome 6. We first clustered all eukaryotic transcription factor motifs from TRANSFAC v10.2 into 235 representative motifs as described 7, 8. For each of the 235 motifs provided in the TRANSFAC database we scanned the mouse 1 kb promoter regions to first identify match with a p-value <= We further filter these matches to only include the binding sites that are either 80% conserved between human and mouse genomes based on the genome alignment provided in UCSC database or the match p-value <= In a cross validation with experimentally verified binding sites, these criteria yielded a false positive rate of 1 in 50 kb of genomic data searched, or a false discovery rate of 1 transcription factor for every 50 genes searched in our algorithm. The details of the method are as described 6, 8. Page 3 of 19

4 SUPPLEMENTARY FIGURE 1 Supplementary Figure 1. Normalization of gene expression data using mean absolute deviation method reduces intensity-dependent differences for experiments from different experimental conditions. Panel A depicts the relationship between mean and difference between experimental arrays for all experiments with the same experimental set (top panel, i.e., similar laboratory conditions) and for experiments not within the same experimental set (top panel, i.e. different laboratory conditions) before performing normalization. Horizontal scatter envelopes centered at 0 indicate minimal intensity-dependent differences in expression level. Panel B depicts the same relationships after performing normalization and indicates minimization of intensity-dependent differences in expression among disparate experimental conditions. Page 4 of 19

5 SUPPLEMENTARY FIGURE 2 Supplementary Figure 2. Gene expression profiles from different experimental conditions cluster by species. Hierarchical clustering of experimental conditions was performed to evaluate for speciesspecific patterns of gene coexpression. The top half of each figure depicts the hierarchical clustering of experiments within each experimental phenotype and the bottom half contains a color-keyed representation of species identifiers for each experiment. In normal adult (A), fetal (B), failing (C), and hypertrophied (D) cardiac tissue experiments clustered tightly by species, indicating strong species dependency of gene coexpression. Abbreviations: H = human, M = mouse, R = rat, D = dog. Page 5 of 19

6 SUPPLEMENTARY FIGURE 3 Supplementary Figure 3. Gene coexpression network adjacency soft threshold for A) normal adult and B) fetal cardiac tissue was chosen to approximate scale-free topology. The left-hand portion of each panel depicts the relationship between soft-thresholding coefficient (x axis, numbered in red) and the r value for the linear regression (y axis, slope constrained to -1) between the logarithm of the connectivity and the logarithm of the proportion of genes with that connectivity. The right-hand portion of each panel depicts the relationship between the soft-thresholding coefficient and mean connectivity. Page 6 of 19

7 SUPPLEMENTARY FIGURE 4 Supplementary Figure 4. Gene coexpression network adjacency soft threshold for A) failing and B) hypertrophied cardiac tissue was chosen to approximate scale-free topology. Panels are as described above in Supplementary Figure 3. Page 7 of 19

8 SUPPLEMENTARY FIGURE 5 Supplementary Figure 5. Sensitivity analysis for parameters used in the dynamic tree cut algorithm. The z-axis describes module reproducibility average over all derived modules, which is in turn given by the Pearson correlation between vectors describing intra-modular connectivity for each gene. The chosen parameters for the dynamic tree cut algorithm (deepsplit = 1, labelunlabeled = TRUE) gave the highest average reproducibility of modules in the fetal dataset. Page 8 of 19

9 SUPPLEMENTARY FIGURE 6 Page 9 of 19

10 Supplementary Figure 6.. Identification and validation of fetal gene modules. Genes were clustered first by topological overlap in the training dataset (n = 43 samples) and then evaluated for significant reproducibility in the validation dataset (n = 21 samples). A.) Cluster dendrogram and barplot describing module membership for each of the 12,620 genes. B.) Cluster dendrogram and barplot describing module membership for each of the 12,620 genes (as defined in the training set) in the validation dataset. C.) Significance (-log p value) of module reproducibility in the validation set. Colors below the bargraph correspond to the module membership as identified in the barplots in the above panels. Red line indicates cutoff for statistical significance (p < 5.5 x 10-4 or log(p value) > 3.26). Seventy-two modules met this criterion. Page 10 of 19

11 Dewey, et al SUPPLEMENTARY FIGURE 7 Supplementary Figure 7. Topology of developmental gene coexpression modules containing key marker genes. The hub genes for modules containing A) NPPA and MYH7 and B) SLC2A4 are depicted in Page 11 of 19

12 red, while marker genes are depicted in yellow and all nodes are depicted in green. Only edges with adjacency weight > 0.7 are shown for clarity. The proximity of each gene to the center of the figure indicates its connectivity, or sum of connection weights. Page 12 of 19

13 SUPPLEMENTARY TABLE 1. Term Count % P value Benjamini corrected P value GO: ~proteasomal protein E catabolic process GO: ~proteasomal ubiquitindependent E protein catabolic process GO: ~negative regulation of protein ubiquitination E GO: ~negative regulation of protein modification process E GO: ~anaphase-promoting complex-dependent proteasomal ubiquitindependent protein catabolic process GO: ~negative regulation of ubiquitin-protein ligase activity during mitotic cell cycle GO: ~negative regulation of ligase activity GO: ~negative regulation of ubiquitin-protein ligase activity GO: ~positive regulation of ubiquitin-protein ligase activity during mitotic cell cycle GO: ~regulation of protein ubiquitination GO: ~positive regulation of ubiquitin-protein ligase activity GO: ~regulation of ubiquitinprotein ligase activity during mitotic cell cycle GO: ~positive regulation of ligase activity Page 13 of 19

14 GO: ~regulation of ubiquitinprotein ligase activity GO: ~regulation of ligase activity GO: ~positive regulation of macromolecule metabolic process GO: ~ubiquitin-dependent protein catabolic process GO: ~positive regulation of protein ubiquitination GO: ~negative regulation of cellular protein metabolic process GO: ~regulation of cellular protein metabolic process GO: ~negative regulation of protein metabolic process GO: ~positive regulation of protein modification process GO: ~positive regulation of cellular protein metabolic process GO: ~translation GO: ~positive regulation of protein metabolic process GO: ~ER to Golgi vesicle-mediated transport GO: ~proteoglycan metabolic process GO: ~negative regulation of programmed cell death GO: ~negative regulation of cell death GO: ~mitotic cell cycle GO: ~Golgi vesicle transport GO: ~negative regulation of catalytic activity GO: ~glycosaminoglycan Page 14 of 19

15 biosynthetic process GO: ~ncRNA metabolic process GO: ~aminoglycan biosynthetic process GO: ~positive regulation of cellular biosynthetic process GO: ~axon regeneration in the peripheral nervous system GO: ~ER-associated protein catabolic process GO: ~modification-dependent macromolecule catabolic process GO: ~modification-dependent protein catabolic process GO: ~positive regulation of biosynthetic process GO: ~regulation of protein modification process GO: ~translational elongation GO: ~negative regulation of apoptosis GO: ~proteoglycan biosynthetic process GO: ~response to toxin GO: ~synapse organization GO: ~neuromuscular process GO: ~cell cycle GO: ~positive regulation of macromolecule biosynthetic process GO: ~microtubule-based process GO: ~glycoprotein metabolic process GO: ~proteolysis involved in cellular protein catabolic process Page 15 of 19

16 GO: ~cellular protein catabolic process GO: ~anti-apoptosis GO: ~regeneration GO: ~maternal process involved in female pregnancy GO: ~extracellular structure organization GO: ~ossification GO: ~sulfur metabolic process GO: ~synaptogenesis GO: ~palate development GO: ~protein catabolic process GO: ~cell cycle process GO: ~protein ubiquitination GO: ~negative regulation of molecular function GO: ~protein transport GO: ~bone development GO: ~leukocyte homeostasis GO: ~positive regulation of gene expression GO: ~establishment of protein localization GO: ~positive regulation of catalytic activity GO: ~positive regulation of nitrogen compound metabolic process GO: ~response to wounding GO: ~positive regulation of T cell proliferation GO: ~intracellular transport GO: ~protein modification by small protein conjugation Page 16 of 19

17 GO: ~cellular macromolecule catabolic process GO: ~cell-cell signaling GO: ~Wnt receptor signaling pathway GO: ~synaptic transmission GO: ~proteolysis hypertrophy. Supplementary Table 1. Over-represented biological processes according to gene ontology (GO) in the fetal module recapitulated in heart failure and Page 17 of 19

18 SUPPLEMENTARY TABLE 2. Entrez ID Gene Name Der1-like domain family, member F-box protein UBX domain protein anaphase promoting complex subunit 11 cell division cycle 26 homolog (S. cerevisiae); cell division cycle 26 homolog (S. cerevisiae) pseudogene 5705 proteasome (prosome, macropain) 26S subunit, ATPase, proteasome (prosome, macropain) subunit, alpha type, ubiquitin B 7324 ubiquitin-conjugating enzyme E2E 1 hypertrophy. Supplementary Table 2. Summary of genes involved in protein catabolism biological process from fetal module recapitulated in heart failure and Page 18 of 19

19 REFERENCES 1. Barbacioru CC, Wang Y, Canales RD, Sun YA, Keys DN, Chan F, Poulter KA, Samaha RR. Effect of various normalization methods on Applied Biosystems expression array system data. BMC Bioinformatics. 2006;7: Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol. 2005;4:Article Barabasi AL, Albert R. Emergence of scaling in random networks. Science. 1999;286: Jeong H, Tombor B, Albert R, Oltvai ZN, Barabasi AL. The large-scale organization of metabolic networks. Nature. 2000;407: Yip AM, Horvath S. Gene network interconnectedness and the generalized topological overlap measure. BMC Bioinformatics. 2007;8: Levy S, Hannenhalli S. Identification of transcription factor binding sites in the human genome sequence. Mamm Genome. 2002;13: Wingender E, Dietze P, Karas H, Knuppel R. TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res. 1996;24: Hannenhalli S, Putt ME, Gilmore JM, Wang J, Parmacek MS, Epstein JA, Morrisey EE, Margulies KB, Cappola TP. Transcriptional genomics associates FOX transcription factors with human heart failure. Circulation. 2006;114: Page 19 of 19

Weighted gene co-expression analysis. Yuehua Cui June 7, 2013

Weighted gene co-expression analysis Yuehua Cui June 7, 2013 Weighted gene co-expression network (WGCNA) A type of scale-free network: A scale-free network is a network whose degree distribution follows