Bioinformatics I. CPBS 7711 October 29, 2015 Protein interaction networks. Debra Goldberg

Bioinformatics I CPBS 7711 October 29, 2015 Protein interaction networks Debra Goldberg debra@colorado.edu

Overview Networks, protein interaction networks (PINs) Network models What can we learn from PINs Discovering protein complexes PIN evolution Final words

Introduction to Networks

What is a network? A collection of objects (nodes, vertices) Binary relationships (edges) May be directed Also called a graph Networks are everywhere!

Social networks Nodes: People Edges: Friendship from www.liberality.org

Sexual networks Nodes: People Edges: Romantic and sexual relations

Transportation networks Nodes: Locations Edges: Roads

Power grids Nodes: Power station Edges: High voltage transmission line

Airline routes Nodes: Airports Edges: Flights

Internet Nodes: MBone Routers Edges: Physical connection

World-Wide-Web Nodes: Web documents Edges: Hyperlinks

Quick activity What kinds of biological networks are there or might there be? Molecular biology

Gene and protein networks

Metabolic networks Nodes: Metabolites Edges: Biochemical reaction (enzyme) from web.indstate.edu

Signaling networks Nodes: Molecules (e.g., Proteins or Neurotransmitters) Edges: Activation or Deactivation from www.life.uiuc.edu

Gene regulatory networks Nodes: Genes or gene products Edges: Regulation of expression Inferred from error-prone gene expression data from Wyrick et al. 2002

Disease Networks Nodes: Diseases Edges: Common genes SARS, progresssion_of Myocardial infarction Alzheimer disease Obesity Hypertension Rheumatoid arthritis from Goh et al., PNAS 2007 HIV

Disease Gene Networks Nodes: Genes Edges: Common diseases from Goh et al., PNAS 2007

Protein interaction networks Nodes: Proteins Edges: Observed interaction Gene function predicted from www.embl.de

Synthetic sick or lethal networks (SSL) X X X X Y Y Y Y Cells live (wild type) Cells live Cells live Cells die or grow slowly from Tong et al. 2001 Nodes: Nonessential genes Edges: Genes co-lethal Gene function, drug targets predicted

Other gene networks Homology edges Sometimes used to connect other network types across species Coexpression Edges: transcribed at same times, conditions Gene knockout / knockdown Edges: similar phenotype (defects) when suppressed

What they really look like We need models!

Overview Networks, protein interaction networks (PINs) Network models What can we learn from PINs Discovering protein complexes PIN evolution Final words

Traditional graph modeling from GD2002 Random Regular Erdos-Renyi (1960) Lattice

Network Research Renaissance Change in direction of network research: 1998 Four factors Theoretical analysis coupled with empirical evidence Networks are not static, they evolve over time Dynamical systems modeling real-world behaviors Computing power! Enables large system analysis

Introduce small-world networks Small-World Networks

Small-world Networks Six degrees of separation 100 1000 friends each Six steps: 10 12-10 18 But We live in communities

Small-world measures Typical separation between two vertices Measured by characteristic path length (average distance) Cliquishness of a typical neighborhood Measured by clustering coefficient v C v = 1.00 v C v = 0.33

Watts-Strogatz small-world model

Measures of the W-S model Path length drops faster than cliquishness Wide range of p has both small-world properties

Small-world measures of various graph types Cliquishness Characteristic Path Length Regular graph High Long Random graph Low Short Small-world graph High Short

Another network property: Degree distribution P (k) The degree (notation: k) of a node is the number of its neighbors The degree distribution is a histogram showing the frequency of nodes having each degree

Degree distribution of E-R random networks Erdös-Rényi random graphs Binomial degree distribution, well-approximated by a Poisson P(k) Network figures from Strogatz, Nature 2001 Degree = k

Degree distribution of many realworld networks Scale-free networks Degree distribution follows a power law P (k = x) = α x -β P(k) log P(k) log k Degree = k

Other degree distributions Amaral, Scala, et al., PNAS (2000)

Hierarchical Networks Ravasz, et al., Science 2002 37

Properties of hierarchical networks 1. Scale-free 2. Clustering coefficient independent of N 3. Scaling clustering coefficient (DGM) 38

C of 43 metabolic networks Independent of N Ravasz, et al., Science 2002 39

Clustering coefficient scaling C(k) Metabolic networks Ravasz, et al., Science 2002 40

Summary of network models Random Small world Scale-free Hierarchical Poisson degree distribution high CC, short pathlengths power law degree distribution high CC, modular, power law degree distribution

Many real-world networks are small-world, scale-free World-wide-web Collaboration of film actors (Kevin Bacon) Mathematical collaborations (Erdös number) Power grid of US Syntactic networks of English Neuronal network of C. elegans Metabolic networks Protein-protein interaction networks

So What?

Overview Networks, protein interaction networks (PINs) Network models What can we learn from PINs Discovering protein complexes PIN evolution Final words

There is information in a gene s position in the network We can use this to predict Relationships Interactions Regulatory relationships Protein function Process Complex / molecular machine

Implications from topology

Edges indicate function Proteins that are connected by an edge in many types of biological networks are more likely to have a common function

Adjacent edges indicate 3 rd In some biological networks, if gene A is connected both to genes B and C, then gene B is more likely to be connected to gene C

False positives, false negatives Can use topology to assess confidence if true edges and false edges have different network properties Assess how well each edge fits topology of true network Can also predict unknown relations

SSL hubs might be good cancer drug targets Normal cell Cancer cells w/ random mutations Alive Dead Dead (Tong et al, Science, 2004)

2-hop predictors for SSL SSL SSL (S-S) Homology SSL (H-S) Co-expressed SSL (X-S) Physical interaction SSL (P-S) 2 physical interactions (P-P) v w S: Synthetic sickness or lethality (SSL) H: Sequence homology X: Correlated expression P: Stable physical interaction Wong, et al., PNAS 2004

Multi-color motifs S: Synthetic sickness or lethality H: Sequence homology X: Correlated expression P: Stable physical interaction R: Transcriptional regulation Zhang, et al., Journal of Biology 2005

Protein complexes Tightly connected proteins may indicate a protein complex from Girvan and Newman, PNAS 2002

Beware of bias

Lethality Hubs are more likely to be essential Jeong, et al., Nature 2001

Protein abundance Abundant proteins are more likely to be represented in some types of experiments More likely to be essential Correlation between degree (hubs) and essentiality disappears or is reduced when corrected for protein abundance Bloom and Adami, BMC Evolutionary Biology 2003

Degree anti-correlation Few edges directly between hubs Edges between hubs and low-degree genes are favored Regulatory NW PPI Maslov and Sneppen, Science 2002

Degree correlation Anti-correlation of degrees of interacting proteins disappears in un-biased data average degree K1 25 20 15 10 5 0 essential non-essential 0 10 20 30 40 50 60 70 degree k Coulomb, et al., Proceedings of the Royal Society B 2005

Predicting protein function

Methods: predicting function Homology Machine Learning Graph-theoretic methods Direct methods Module-assisted methods Review: Sharan, Ulitsky, Shamir. Molecular Systems Biology, 2007

Direct methods: Neighborhood Majority method Schwikowski, Uetz, et al., Nat Biotechnol 18, 2000 Neighborhood method How does frequency affect assignment? Hishigaki, Nakai, et al., Yeast 18, 2001

Minimum cut (graph-theoretic) methods Vazquez, Flammini, et al. (2003) globally tries to minimize the number of protein interactions between different annotations Karaoz, Murali, et al. (2004) incorporates gene-expression data for better performance Nabieva, Jim, et al. (2005) reformulated as an integer linear programming problem

Functional flow Nabieva, Jim, et al., Bioinformatics 21 Suppl 1, 2005

A Markov random field method Letovsky and Kasif, Bioinformatics 19 Suppl 1, 2003 Derive marginal probabilities given other proteins putative assignment Statistically, neighbors often share label Applies p(l N, k) = p(k L,N) p(l) p(k N) iteratively to propagate probabilities L is a Boolean random variable that indicates whether or not a node has that label N is the number of neighbors k is the number of neighbors with that label

Module-assisted methods Spirin and Mirny, PNAS 2003 Find fully connected subgraphs (cliques), OR Find subgraphs that maximize density: 2m/(n(n 1)) Bader and Hogue, BMC Bioinformatics 2003 Weight vertices: neighborhood density, connectedness Find connected communities with high weights MCODE : Molecular COmplex DEtection Girvan and Newman, PNAS 2002 Betweenness centrality Removes edges likely to go between communities

Confidence assessment, edge prediction

Confidence assessment Traditionally, biological networks determined individually High confidence Slow New methods look at entire organism Lower confidence ( 50% false positives) Inferences made based on this data

Confidence assessment Can use topology to assess confidence if true edges and false edges have different network properties Assess how well each edge fits topology of true network Can also predict unknown relations Goldberg and Roth, PNAS 2003

Use clustering coefficient, a local property Number of triangles = N(v) N(w) y x v v w. w Normalization factor? N(x) = the neighborhood of node x

Mutual clustering coefficient (MCC) Jaccard Index: Meet / Min: Geometric: N(v) N(w) ---------------- N(v) N(w) N(v) N(w) ------------------------ min ( N(v), N(w) ) N(v) N(w) 2 ------------------ N(v) N(w) Hypergeometric: a p-value

Prediction A v-w edge would have a high MCC v w

Questions Degree distribution? Clustering coefficient? 2, 5, 9 Mutual clustering coefficient: 2 & 7 Use Meet/Min definition 60

Overview Networks, protein interaction networks (PINs) Network models What can we learn from PINs Discovering protein complexes PIN evolution Final words

Protein Complexes Groups of proteins that bind together to perform a specific task. Examples: Ribosomes Proteasomes Replication complexes GINS complex, DNA polymerase Image from: Computation site for bioinformatics at Charité, Universitätsmedizin Berlin Found at http://bioinf.charite.de/hergo/intro.htm

Finding protein compexes Dense regions may be an indication of a protein complex from Girvan and Newman, PNAS 2002

Protein-Protein Interaction Network Image from Yeast Proteomics, Genome News Network, 1-18-02

Looking for Complexes One goal of studying the interaction network is to discover previously unknown protein complexes. Methods: Look for cliques or near cliques Look for vertices with high clustering

Community structure: Partitioning methods

Community structure Proteins in a community may be involved in a common process or function from Girvan and Newman, PNAS 2002

Finding the communities Hierarchical clustering Betweenness centrality Dense subgraphs Similar subgraphs Spectral clustering Party and date hubs

Hierarchical clustering (1) Using natural edge weights Gene co-expression e.g., Eisen MB, et al., PNAS 1998 from www.medscape.com

Hierarchical clustering (2) Topological overlap A measure of neighborhood similarity l i,j is 1 if there is a direct link between i and j, 0 otherwise Ravasz, et al., Science 2002

Hierarchical clustering (3) Adjacency vector Function cluster: Tong et al., Science 2004 Find drug targets: Parsons et al., Nature Biotechnology 2004

Party and date hubs Protein interaction network Partition hubs by expression correlation of neighbors Han, et al., Nature 2004

Network connectivity Scale-free networks are: Robust to random failures Vulnerable to attacks on hubs Removing hubs quickly disconnects a network and reduces the size of the largest component Albert, et al., Nature 2000

Removing date hubs shatters network into communities Date Hubs Many sub-networks A single main component

Similar subgraphs Across species Interaction network and genome sequence e.g., Ogata, et al., Nucleic Acids Research 2000

Betweenness centrality Consider the shortest path(s) between all pairs of nodes Betweenness centrality of an edge is a measure of how many shortest paths traverse this edge Edges between communities have higher centrality Girvan, et al., PNAS 2002

Spectral clustering Compute adjacency matrix eigenvectors Each eigenvector defines a cluster: Proteins with high magnitude contributions Bu, et al., Nucleic Acids Research 2003 positive eigenvalue negative eigenvalue

Overview Networks, protein interaction networks (PINs) Network models What can we learn from PINs Discovering protein complexes PIN evolution Final words

Questions How does the WWW evolve? How might protein interaction networks (PINs) evolve? How can we determine if our model is incorrect?

Model for scale-free networks Growth and preferential attachment New node has edge to existing node v with probability proportional to degree of v Biologically plausible?

Gene duplication gives functional diversity A primary mechanism for diversity After duplication, 2 routes to diversity: Subfunctionalization: function loss yields complementary subsets of original functions Edge Loss Neofunctionalization: de novo acquisition of functions Edge Gain Protein interactions are convenient proxy for functions

Gene duplication in a PIN Barabási and Oltvai, Nature Reviews Genetics (2004)

Another scale-free network model Duplication and divergence New nodes are copies of existing nodes Same neighbors, then some gain/ loss Solé, Pastor-Satorras, et al. (2002)

Advantages of this model This model generates networks that are: scale-free highly clustered PINs are also scale-free, highly clustered

Question Paralogs: x & w or y & t y x w v t

Overview Networks, protein interaction networks (PINs) Network models What can we learn from PINs Discovering protein complexes PIN evolution Final words

Final words Network analysis has become an essential tool for analyzing complex systems There is still much biologists can learn from scientists in other disciplines There is much other scientists can learn from us An exciting new direction