Network Biology-part II Jun Zhu, Ph. D. Professor of Genomics and Genetic Sciences Icahn Institute of Genomics and Multi-scale Biology The Tisch Cancer Institute Icahn Medical School at Mount Sinai New York, NY http://research.mssm.edu/integrative-network-biology/ Email: jun.zhu@mssm.edu @IcahnInstitute
Why it is so hard to model biological systems? The more we learn, the more complicated it becomes! Epigenetic regulation : heritable changes in gene function that cannot be explained by changes in DNA sequence DNA methylation Chromotin structure Junk DNA? Post transcriptional regulation Splicing (1981) RNA editing (1986) mirna mediated regulation (1993) It is not one gene to one protein anymore! Post translational regulation Phosphorylation Glycosaltion acetylation
How to model biological systems: Types of Network models time-dependent networks discrete, continuous Differential equations Prediction vs explanation Phenomenologically predictive networks correlation based, dependency nets, Explanatory pictorial Deterministic vs. Stochastic (probabilistic) Concentration vs. Bayesian nets for up/down/nc
Biological networks/pathways Observation-> description-> explanation-> prediction Data required to train models Gene sets Association networks Probabilistic causal networks Mechanism based models Biological details revealed
Theory of network biology: how biological processes are regulated? Observation-> description-> explanation-> prediction Transcription factor Multiple genes
Micorarray: revolutionized the way we query a biological system 1995: Patrick Brown reported a proof of concept study
Microarray: two channel system vs one channel system Short term cost, long tem cost, accuracy
Microarray: what are the assumptions and limitations? Late 1990s, many EST libraries were sequenced and human and mouse genomes were closed to finished; Assuming all gene transcripts were known; Assuming all gene isoforms were known; There were no SNPs within probes Outdated and will be replaced by RNAseq mrnas (isoforms, allele specific expression ) mirnas Long non-coding RNAs
Gene set enrichment analysis Transcription factor Multiple genes
Gene set enrichment Fisher s exact Test hypergeometric distribution cdf Foreground k x Observed Signature N p F( x 1 M, K, N ) x 1 i 0 K M K i N i M N background M Problem: have to define a cutoff
Gene set enrichment: a non-parametric test Kolmogorov-Smirnov test: Order genes by fold changes or p-value Test whether genes involved in a Pathway are randomly distributed. pathway background Observed pathway
Gene set enrichment Analysis (GSEA) Difference from Kolmogorov-Smirnov test: using weighted sum Subramanian et al, PNAS, 2005
Gene set enrichment analysis What are assumptions and limitations? Can only analyze known pathways Don t know how genes involved in a pathway are regulated What are direction interactions and secondary interactions? Don t know how multiple pathways interact with each other if multiple pathways involve. Transcription factor Multiple genes
Biological networks/pathways Observation-> description-> explanation-> prediction Data required to train models Gene sets Association networks Probabilistic causal networks Mechanism based models Biological details revealed
How to define association? Association of two genes is context dependent protein-protein interaction by Y2H experiments co-cited in literature Protein-DNA interaction: ChIP-on-chip, ChIPseq correlation of mrna expression levels
Yeast-2-hybrid system Gene fusion Gene fusion reporter gene Limitations: High false positive and negative Only for soluble proteins not in a physiological condition Lodish, et al., Molecular Cell Biology
Protein-protein interaction networks Stelzl et al, Cell, 2005 Genes in a pathway interact with each other; discover new members in the pathway
Are all genes equally important? Degree distribution: how many connections a gene has? the majority of genes connect with a small number of genes, while a smaller number of genes connect to a large number of genes? Scale-free: Log-log linear p ( k) ~ k log( p( k)) ~ *log( k) Clustering coefficient: CC p k p 2n ( k 1) p
Different types of Complex Networks Degree distribution Clustering coefficient Barabasi and Oltvai, 2004
Protein-DNA interaction: chromatin immunoprecipitation (ChIP): To find transcription factor binding targets
Association by gene expression correlation How strong the correlation of mrna expression levels should be? the p-value cutoff for correlation Assuming two expression levels are independent FDR (False Discover Rate) by permutation No explicit assumption Data set specific FDR total false positives positives detected
Selecting threshold for Gene-Gene Correlation (GGC) of 25,000 genes on a microarray chip p-value < total positive false positive FDR (from data) (from permuted data) 1e-10 40245988 1079 2.68e-5 1e-15 22475531 192 8.54e-6 1e-20 13755681 38 2.76e-6 At p value <1e-20, there are only 38 false positives so that no module was detected for the permuted data Pvalue<1e-20 was chosen as threshold
Association by gene expression correlation Can two expression levels correlate because they both correlate to noise? Guilt by association is noise prone Stuart et al. 2003 Two gene expression levels correlate because they respond to common perturbation F2 intercross setting QTL overlap
Scale-free property is robust
Genetics filter makes the network closer to scale-free Chen*, Zhu*, et al. Nature, 2008
Genetics: eqtl overlapping enhances correlation signals Problem: a cutoff threshold is needed Chen Y*, Zhu J* et al. Nature 452:429-435 (2008)
Causality vs. association Smoking and disease risks
Causality vs. association Fat dogs and fat masters Stephen Friend
Biological networks/pathways Observation-> description-> explanation-> prediction Data required to train models Gene sets Association networks Probabilistic causal networks Mechanism based models Biological details revealed
A simple biological question: are there causal/reactive relationships?
A Bayesian network approach: Best model
Biological networks/pathways Observation-> description-> explanation-> prediction Data required to train models Gene sets Association networks Probabilistic causal networks Mechanism based models Biological details revealed
Differential equations explain why different dg dt dc dt correlations can be observed n g n n c g c v( g, c) u( g, c) Chen*, Zhu*, et al Nature (2008)
Model a signaling pathway
Aknowledgements Zhu lab Seungyeul Yoo Eunjee Lee Li Wang Luan Lin Quan Long Supported by: Mount Sinai Genomics Institute Eric Schadt Bin Zhang Zhidong Tu Charles Powell Patrizia Casaccia Boston University Avrum Spira Joshua Campbell U Washington Roger Baumgarner Berkerley Rachel Brem Princeton Lenoid Kruglyak Icahn Institute of Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai Janssen Canary Foundation Prostate Cancer Foundation NIH NCI