CSCI1950 Z Computa3onal Methods for Biology* (*Working Title) Lecture 1 Ben Raphael January 21, 2009 Course Par3culars Three major topics 1. Phylogeny: ~50% lectures 2. Func3onal Genomics: ~25% lectures 3. Network/Systems Biology: ~25% lectures Tools Computer Science: Algorithms and discrete math (e.g. graph theory), Programming Mathema3cs: Discrete Probability, Linear algebra (vectors and matrices) Biology: Basics. (What is DNA?) 1
Course Par3culars Webpage h\p://cs.brown.edu/courses/csci1950 z/ [readings (including some background material) Textbook: None Assignments: mens et manus 1. 4 wri\en assignments: ~40% of grade 2. 3 programming assignments: ~40% of grade 3. Take home final: ~20% of grade Graduate credit Extra assignment/project Talk to me before March 1 Survey Topic 1: Phylogeny 2
Early Evolu3onary Studies 200 th Anniversary of birth of Charles Darwin From Origin of the Species (1859) Darwin 1960 s Anatomical features were the dominant criteria used to derive evolu3onary rela3onships between species. Imprecise, ofen subjec3ve, observa3ons ofen led to inconclusive, contradictory, or incorrect evolu3onary rela3onships between species Molecular data (DNA and protein sequences) drama3cally improved situa3on. 3
Species Trees Is a panda more closely related to a bear or a raccoon? Looks Hiberna3on Pa\ern Bear Raccoon ~100 years of arguments Tree derived from DNA sequence data. Steven O Brien et al. (1985) Human Evolu3onary History From: Molecular Evolu7on a Phylogene7c Approach, R. Page & E. Holmes 4
More Recent Human History Out of Africa Hypothesis: Most ancient ancestor lived in Africa roughly 200,000 years ago 1 2 3 4 5 http://www.becominghuman.org The Origin of Humans: Out of Africa vs Mul3regional Hypothesis Out of Africa: Humans evolved in Africa ~200,000 years ago Humans migrated out of Africa, replacing other humanoids around the globe Multiregional: Humans evolved in the last two million years as a single species. Independent appearance of modern traits in different areas Humans migrated out of Africa mixing with other humanoids on the way 5
Human Evolu3onary Tree DNA based reconstruc3on of the human evolu3onary tree http://www.mun.ca/biology/scarr/out_of_africa2.htm Evolu3onary Tree of Humans (mtdna) Vigilant, Stoneking, Harpending, Hawkes, and Wilson (1991) African population is the most diverse (sub-populations had more time to diverge) Evolutionary tree separates one group of Africans from a group containing all five populations. Tree rooted on branch between groups of greatest difference. 6
Evolu3onary Tree of Humans: (microsatellites) Neighbor joining tree for 14 human populations genotyped with 30 microsatellite loci. Lineage of Genghis Kahn? In humans, Y chromosome passed from father only. Can be used to iden3fy parental lineages. ~8% of males in parts of Asia and 0.5% world wide es3mated to be descendants of a resident of Mongolia ~1000 years ago (Zerjal et al. AGHG 2003). 7
Lafaye\e, Louisiana, 1994: A woman claimed her exlover (who was a physician) injected her with HIV+ blood Records show the physician had drawn blood from an HIV+ pa3ent that day Is there a way to show that blood from that HIV + pa3ent ended up in the woman? HIV Transmission HIV has a high muta3on rate, which can be used to trace paths of transmission Two people who were infected from different sources will have very different HIV sequences Alignment of fourteen amino acid sequences from V3 region of HIV 1 gp120 genes Azizi et al. BMC Immunology 2006 7:25 8
To the Lab! Wet lab Take mul3ple samples from the pa3ent, the woman, and controls (non related HIV+ people) Obtain DNA sequence from two HIV genes HIV (gp120 and RT). Computer lab Build phylogene3c tree from the DNA sequences. Phylogene3c Tree Convic3on Three different tree reconstruc3on techniques used. In every reconstruc3on, vic3m s sequences were related to pa3ent s sequences. Nes3ng of the vic3m s sequences within the pa3ent sequence indicated the direc3on of transmission was from pa3ent to vic3m First 3me phylogene3c analysis was used in a court case as evidence (Metzker, et. al., 2002) 9
Phylogene3c Trees How to build a phylogene7c tree from data? Data 1. Characters/Features 2. Pairwise distances Algorithm Phylogene3c Trees What is a phylogene7c tree? Biology definition: None (picture) A branching diagram Intuition: Leaves represent existing species Branch points represent most recent common ancestor. Length of branches represent evolutionary time. Root represents the oldest evolutionary ancestor. 10
Phylogene3c Trees What is a phylogene7c tree? Computer science definition tree: A connected acyclic graph G = (V, E). graph: A set V of vertices and a set E of edges, where each edge connects a pair of vertices. Tree Defini3ons tree: A connected acyclic graph G = (V, E). graph: A set V of vertices and a set E of edges, where each edge (v i, v j ) connects a pair of vertices. A path in G is a sequence (v 1, v 2,, v n ) of vertices in V such that (v i, v i+1 ) are edges in E. A graph is connected provided for every pair v i v j of vertices, there is a path between v i and v j. A cycle is a path with the same starting and ending vertices. A graph is acyclic provided it has no cycles. 11
Tree Defini3ons tree: A connected acyclic graph G = (V, E). degree of vertex v is the number of edges incident to v. A phylogenetic tree is a tree with a label for each leaf (vertex of degree one). A binary phylogenetic tree is a phylogenetic tree where every interior (non-leaf) vertex has degree 3; i.e. two children. A rooted (*binary) phylogenetic tree is phylogenetic tree with a single designated vertex r (* of degree 2) Rooted and Unrooted Trees In the unrooted tree the position of the root ( oldest ancestor ) is unknown. Otherwise, they are like rooted trees 12
Evalua3ng Different Phylogenies Value1 Value2 Mouth Smile Frown Eyebrows Normal Pointed Character Based Tree Reconstruc3on Which tree is beher? 13
Character Based Tree Reconstruc3on Count changes on tree Character Based Tree Reconstruc3on Parsimony: minimize number of changes on edges of tree 14
Character Based Tree Reconstruc3on Maximum Likelihood: Given Pr[change], what is tree with maximum probability? Iden3fying Highest Scoring Tree Naïve, exhaus3ve Algorithm: check all trees. How many possibili3es? Restrict to binary trees. 15
Phylogene3c Trees How to efficiently build trees from data? 1 4 3 2 5 Data 1. Characters/Features 2. Pairwise distances 1 4 2 3 5 Phylogene3c Trees How to efficiently build trees from data? 1 4 3 2 5 1 4 2 3 5 Methods 1. Characters/Features Parsimony: Minimum number of changes Probabilistic Model 2. Pairwise distances Clustering (UPGMA, Neighbor joining, ) 16
Addi3onal Models and Extensions Comparing trees Distances between trees. Sta3s3cal tests: bootstrap, permuta3on tests, etc. Supertrees and consensus Gene trees vs. species trees. Whole genome phylogeny. Topic 2: Func3onal Genomics 17
Biology 101 Biology 101 Central Dogma 18
What can we measure? Sequencing (expensive) Hybridiza3on (noisy) Sequencing (expensive) Hybridiza3on (noisy) Mass spectrometry (noisy) Hybridiza3on (very noisy!) DNA Basepairing 19
DNA Microarrays Clustering of Gene Expression Each microarray experiment: expression vector u = (u 1,, u n ) u i = expression value for each gene. Group similar vectors. Samples Gene expression BMC Genomics 2006, 7:279 20
Clustering 1 4 3 2 5 Clustering algorithms related to distance based phylogene3c algorithms. Phylogeny gives grouping of related data points. 1 4 2 3 5 Binary classifica@on Given a set of examples (x i, y i ), where y i = + 1, from unknown distribu3on D. Design func3on f: R n { 1,+1} that assigns addi3onal samples x i to one of two classes op7mally. Classifica3on 21
Topics Methods for Clustering Hierarchical, Matrix based (PCA), Graph based (Clique finding) Methods for Classifica3on Nearest neighbors, support vector machines Data Integra3on: Bayesian Networks Topic 3: Network and Systems Biology 22
Biological Interac3on Networks Many types: Protein DNA (regulatory) Protein metabolite (metabolic) Protein protein (signaling) RNA RNA (regulatory) Gene3c interac3ons (gene knockouts) Regulatory Networks 23
Cis regulatory Network Metabolic Networks Nodes = reactants Edges = reac3ons labeled by enzyme (protein) that catalyzes reac3on 24
Protein Protein Interac@on (PPI) Network Protein Protein Interac3on Network? Proteins are nodes Interac3ons are edges Edges may have weights Yeast PPI network H. Jeong et al. Nature 411, 41 (2001) 25
Computa3onal Problems 1. Classifying Network Topology Finding paths, cliques, dense subnetworks, etc. 2. Comparing Networks Across Species 3. Using networks to explain data Dependencies revealed by network topology 4. Modeling dynamics of networks Network Mo3fs Subnetworks with more occurrences than expected by chance. How to find? How to assess sta3s3cal significance? Shen Orr et al. 2002 26
Network Alignment Sharan and Ideker. Modeling cellular machinery through biological network comparison. Nature Biotechnology 24, pp. 427 433, 2006 The Network Alignment Problem Given: k different interac3on networks belonging to different species, Find: Conserved sub networks within these networks Conserved defined by protein sequence similarity (node similarity) and interac3on similarity (network topology similarity) 27
Protein Signaling Networks Art Salomon Biology Department Use machine learning methods (Bayesian networks, etc. to derive network structure. Course Themes Topics: Phylogeny, Func3onal Genomics, Systems & Network Biology Mixture of theory and prac3ce (real data) Graph algorithms: Path and clique finding, isomorphism, heavy subgraphs, matching, vertex cover, spanning and Steiner problems, etc. Sta@s@cs: Hypothesis tes3ng, permuta3on tests, bootstrap and resampling, enrichment (hypergeometric), etc. Data Mining and Machine Learning: Clustering and Classifica3on 28
Sources h\p://bioalgorithms.info (por3ons of Out of Africa and character phylogeny slides) 29