Opportunities and Challenges in Computational Biology

Similar documents
Recent Advances in Phylogeny Reconstruction

A New Fast Heuristic for Computing the Breakpoint Phylogeny and Experimental Phylogenetic Analyses of Real and Synthetic Data

AN EXACT SOLVER FOR THE DCJ MEDIAN PROBLEM

Effects of Gap Open and Gap Extension Penalties

Fast Phylogenetic Methods for the Analysis of Genome Rearrangement Data: An Empirical Study

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Sequencing alignment Ameer Effat M. Elfarash

Sequencing alignment Ameer Effat M. Elfarash

Improving Tree Search in Phylogenetic Reconstruction from Genome Rearrangement Data

EVOLUTIONARY DISTANCES

Phylogenetic Tree Reconstruction

Steps Toward Accurate Reconstructions of Phylogenies from Gene-Order Data 1

Early History up to Schedule. Proteins DNA & RNA Schwann and Schleiden Cell Theory Charles Darwin publishes Origin of Species

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

BIOINFORMATICS. New approaches for reconstructing phylogenies from gene order data. Bernard M.E. Moret, Li-San Wang, Tandy Warnow and Stacia K.

Phylogenetic Networks, Trees, and Clusters

A Framework for Orthology Assignment from Gene Rearrangement Data

CGS 5991 (2 Credits) Bioinformatics Tools

Whole Genome Alignments and Synteny Maps

High-Performance Algorithm Engineering for Large-Scale Graph Problems and Computational Biology

Dr. Amira A. AL-Hosary

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Computational Structural Bioinformatics

Algorithms in Bioinformatics

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr

Computational methods for predicting protein-protein interactions

Sequence analysis and comparison

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

2012 Univ Aguilera Lecture. Introduction to Molecular and Cell Biology

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Supplemental Data. Perea-Resa et al. Plant Cell. (2012) /tpc

Introduction to Molecular and Cell Biology

6.096 Algorithms for Computational Biology. Prof. Manolis Kellis

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Genomics and bioinformatics summary. Finding genes -- computer searches

DNA and protein databases. EMBL/GenBank/DDBJ database of nucleic acids

Sequences, Structures, and Gene Regulatory Networks

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Computational Biology: Basics & Interesting Problems

Mathematics of Evolution and Phylogeny. Edited by Olivier Gascuel

Exhaustive search. CS 466 Saurabh Sinha

Comparative genomics: Overview & Tools + MUMmer algorithm

On Reversal and Transposition Medians

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Genomes and Their Evolution

New Approaches for Reconstructing Phylogenies from Gene Order Data

Algorithms in Computational Biology (236522) spring 2008 Lecture #1

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Bioinformatics. Dept. of Computational Biology & Bioinformatics

BIOINFORMATICS. Scaling Up Accurate Phylogenetic Reconstruction from Gene-Order Data. Jijun Tang 1 and Bernard M.E. Moret 1

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Bioinformatics Chapter 1. Introduction

Graph Alignment and Biological Networks

On the complexity of unsigned translocation distance

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Evaluation. Course Homepage.

Lecture 15: Realities of Genome Assembly Protein Sequencing

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Similarity or Identity? When are molecules similar?

Constructing Evolutionary/Phylogenetic Trees

Bioinformatics and BLAST

BME 5742 Biosystems Modeling and Control

Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence Based Bioinformatics

THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT

Procedure to Create NCBI KOGS

Algorithms in Bioinformatics

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Bio nformatics. Lecture 3. Saad Mneimneh

Comparative Bioinformatics Midterm II Fall 2004

Small RNA in rice genome

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

Single alignment: Substitution Matrix. 16 march 2017

Evolutionary Tree Analysis. Overview

Phylogenetic analyses. Kirsi Kostamo

Packing of Secondary Structures

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Introduction to Bioinformatics Online Course: IBT

Properties of amino acids in proteins

Phylogenetic Reconstruction from Gene-Order Data

Algorithms for Bioinformatics

Phylogenetic Reconstruction

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Quiz answers. Allele. BIO 5099: Molecular Biology for Computer Scientists (et al) Lecture 17: The Quiz (and back to Eukaryotic DNA)

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

BIOINFORMATICS: An Introduction

Analysis of Gene Order Evolution beyond Single-Copy Genes

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Applications of genome alignment

O 3 O 4 O 5. q 3. q 4. Transition

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Transcription:

Opportunities and Challenges in Computational Biology Srinivas Aluru Electrical & Computer Engineering Lawrence H. Baker Center for Bioinformatics & Biological Statistics Iowa State University aluru@iastate.edu http://vulcan.ee.iastate.edu/~aluru David A. Bader Electrical & Computer Engineering University of New Mexico dbader@eece.unm.edu http://www.eece.unm.edu/~dbader

Acknowledgments National Science Foundation Nov. 17, 2002 SC2002 Tutorial: Computational Biology 1

Opportunities and Challenges in Computational Biology Biology easily has 500 years of exciting problems to work on -Donald E. Knuth

Outline 1. Molecular Biology Background 2. Sequence Alignments 3. String Data Structures and Algorithms 4. Genome Assembly 5. Gene Identification & Annotation 6. Microarrays & Gene Expression Analysis 7. Protein Folding 8. Comparative Genomics & Reconstruction of Evolutionary Histories Nov. 17, 2002 SC2002 Tutorial: Computational Biology 3

Schedule Morning 8:30 9:15 (Part I) Biology Background 9:15 10:00 (Part II) Sequence Alignments 10:00 10:30: Break 10:30 11:15 (Part III) String Data Structures and Algorithms 11:15 12:00 (Part IV) Genome Assembly 12:00 1:30: Lunch Afternoon 1:30 2:15 (Part V) Gene Identification & Annotation 2:15 3:00 (Part VI) Microarrays & Gene Expression Analysis 3:00 3:30: Break 3:30 4:15 (Part VII) Protein Folding 4:15 5:00 (Part VIII) Comparative Genomics & Evolutionary Histories Nov. 17, 2002 SC2002 Tutorial: Computational Biology 4

Part I: Molecular Biology Background

Biological Data DNA: Self-replicating Codes for proteins Proteins: Perform most functions in living organisms Nov. 17, 2002 SC2002 Tutorial: Computational Biology 6

DNA: Sequence of nucleotides Nucleotide: Deoxyribose sugar + Phosphate + Base O Nucleotides: A, T, G, and C O O P O O 5 CH 2 O C4 3 C H OH 1 C 2 C H Nov. 17, 2002 SC2002 Tutorial: Computational Biology 7 O HN C C N C CH CH 3

Nov. 17, 2002 SC2002 Tutorial: Computational Biology 8

P 5 3 5 P P 3 A T C G G C 3 P P P 5 3 5 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 9

Nov. 17, 2002 SC2002 Tutorial: Computational Biology 10

For computational purposes, DNA = A sequence over alphabet {A,C,G,T} 5 A T T C G G G A A T G C A T G C C A 3 3 T A A G C C C T T A C G T A C G G T 5 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 11

Genome: Entire genetic constitution of a living organism Chromosome: Linear strand of DNA Gene: A contiguous stretch of DNA that codes for a protein Nov. 17, 2002 SC2002 Tutorial: Computational Biology 12

Species Bacteriophage λ Escherichia Coli (bacterium) Saccharomyces Cerviciae (yeast) Caenorhabditis elegans (worm) Drosophila melanogaster (fruit fly) Homo sapiens (human) Number of Chromosomes 1 1 32 12 8 46 Genome Size 5 X 10 4 5 X 10 6 1 X 10 7 1 X 10 8 2 X 10 8 3 X 10 9 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 13

Proteins: Chains of amino acid residues. There are 20 different amino acids. Functions: Tissue building blocks (Structure proteins) Catalysts (enzymes) Oxygen transport Antibody defense Nov. 17, 2002 SC2002 Tutorial: Computational Biology 14

Nov. 17, 2002 SC2002 Tutorial: Computational Biology 15

R 1 H O R 3 + H 3 N Cα C O N Φ Cα ψ C N H Cα C O O - R 2 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 16

Nov. 17, 2002 SC2002 Tutorial: Computational Biology 17

G A C U Leu Leu Phe Phe Ser Ser Ser Ser STOP STOP Tyr Tyr Trp STOP Cys Cys U G A C U Leu Leu Leu Leu Pro Pro Pro Pro Gln Gln His His Arg Arg Arg Arg C G A C U Met Ile Ile Ile Thr Thr Thr Thr Lys Lys Asn Asn Arg Arg Ser Ser A G A C U Val Val Val Val Ala Ala Ala Ala Glu Gu Asp Asp Gly Gly Gly Gly G Third Position U Position C Second A G First Position

Protein Synthesis (DNA! Protein) DNA Transcription mrna Translation Protein Nov. 17, 2002 SC2002 Tutorial: Computational Biology 19

Nov. 17, 2002 SC2002 Tutorial: Computational Biology 20

Summary Nov. 17, 2002 SC2002 Tutorial: Computational Biology 21

What Can Be Done Experimentally? DNA sequences of length up to 700-800 bp can be read (Sanger s method). DNA samples can be amplified (PCR). Protein sequences can be determined. Structure of proteins can be determined using X-ray crystallography (expensive, tedious, time-consuming). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 22

Challenges in Computational Biology 1. Find the genomes of all organisms. 2. Identify and annotate genes. 3. Find the sequences, three dimensional structures and functions of all proteins. 4. Find sequences of proteins that have desired three dimensional structures. 5. Compare DNA sequences and proteins sequences for similarity. 6. Study the evolution of sequences and species. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 23

Part II: Sequence Alignments

Pairwise Sequence Alignment Problem: Find similarity between two sequences. Variations: Given two sequences, find if parts of them are similar (local alignment). Given a large sequence and a short sequence, find if the short sequence is similar to a stretch of the long sequence. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 25

Pairwise Global Alignment Alignment: Stacking the sequences against each other, with gaps if necessary, to expose similarity. Score: A measure of quality of an alignment C A T -- T C A -- C C -- T C G C A G C -------------------------------- 1-2 1-2 -1 1 1-2 1 = -2 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 26

Pairwise Global Alignment T[i,j] = Score of optimally aligning first i bases of s with first j bases of t. T [ i, j] = max T[ i 1, T T [ i 1, j] [ i, j 1] j 1] + g g score ( s[ i], t[ j] ) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 27

C T C G C A G C 0-2 -4-6 -8-10 -12-14 -16 C -2 1-3 -3-5 -7-9 -11-13 A -4-1 0-2 -4-6 -6-8 -10 T -6-3 0-1 -3-5 -7-7 -9 T -8-5 -2-1 -2-4 -6-8 -8 C -10-7 -4-1 -2-1 -3-5 -7 A -12-9 -6-3 -2-3 0-2 -4 C -14-11 -8-5 -4-1 -2-1 -1 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 28

T [ i, j] = Local Alignment T [ i 1, T max T 0 j [ i 1, j] [ i, j 1] 1] + ( s[ i], t[ j] ) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 29 g g score Initialize top row and leftmost column to zero. Start with a maximal value in the table and traceback.

Affine Gap Penalty Functions Gap penalty = h + gk where k = length of a maximal sequence of gaps h = gap opening penalty g = gap continuation penalty Nov. 17, 2002 SC2002 Tutorial: Computational Biology 30

Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n), while keeping run-time fixed [Myers88]. Two highly similar sequences can be aligned in O(dn) time, where d is a measure of the distance between the sequences [Landau86]. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 31

Parallel Sequence Alignment Each antidiagonal can be computed in parallel. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 32

Some Known Results Parallel sequence alignment can be computed in O(mn/p) time (Edmiston88). Optimal space-saving algorithm requires only O((m+n)/p) space, but take O((m+n) 2 /p) time (Huang89). A row-by-row parallelization is possible and is more communication-efficient. Space can be reduced to O(m+n/p) without sacrificing timeoptimality (Aluru99). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 33

Multiple Sequence Alignment VTISCTGSSSNIGAG NHVKWYQQLPG VTISCTGTSSNIGS ITVNWYQQLPG LRLSCSSSGFIFSS YAMYWVRQAPG LSLTCTVSGTSFDD YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG ATLVCLISDFYPGA VTVAWKADS ATLVCLISDFYPGA VTVAWKADS AALGCLVKDYFPEP VTVSWNSG- VSLTCLVKGFYPSD IAVEWESNG- Nov. 17, 2002 SC2002 Tutorial: Computational Biology 34

Induced Pairwise Alignment S 1 S 2 S 3 S - T I S C T G - S - N I L - T I C N G S S - N I L R T I S C S G F S Q N I Induced pairwise alignment of S 1 and S 2 : S 1 S 2 S T I S C T G - S N I L T I C N G S S N I Nov. 17, 2002 SC2002 Tutorial: Computational Biology 35

Sum-of-Pairs Scoring Function Score of multiple alignment = = i< j l t= 1 i< j where score ( S, S ) i j score( S it, S jt ) score( S i, S j ) = score of induced pairwise alignment l = length of the multiple alignment Nov. 17, 2002 SC2002 Tutorial: Computational Biology 36

Multiple Alignment Run-time of dynamic programming solution = O(2 k n k ) where n = length of each sequence k = number of sequences Space, O(n k ), is prohibitively large! Example: 6 sequences of length 100 6.4X10 13 calculations! Nov. 17, 2002 SC2002 Tutorial: Computational Biology 37

Carillo-Lippman Heuristic U = Upper bound on multiple alignment score If T i ( [ ] [ 2 k j j j l [ i,, L, i ] + score S i, n, S i n ])> U 1 l, j< l l Then T[i 1,i 2,,i k ] cannot be on an optimal path. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 38

Multiple Alignment to a Phylogenetic Tree A tree showing the evolutionary relationship between sequences is available. Compute multiple alignment such that for each edge (i,j) in the tree Induced alignment between S i and S j. = Optimal alignment between S i and S j. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 39

Multiple Alignment to a Tree Build the multiple alignment incrementally. To add a new sequence, an edge should connect it in the tree to a sequence already incorporated in the multiple alignment. Insert the new sequence according to its optimal alignment with the other sequence connected by the edge. Adjust other sequences in the multiple alignment. Run-time = time for k pairwise alignments. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 40

Searching Biological Databases BLAST (Basic Local Alignment Search Tool) http://www.ncbi.nlm.nih.gov BLASTN (DNA) BLASTP (Protein) BLASTX (DNA against Protein) PSI-BLAST (Position Specific Iterative BLAST) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 41

Multiple Alignment Software Clustalw (http://www.ebi.ac.uk/clusalw) MSA (http://softlib.rice.edu/softlib/msa.html) HMMER (http://hmmer.wustl.edu/) SAM (http://www.cse.ucsc.edu/research/ compbio/sam.html) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 42

Open Problems - Sequential 1. Gene-to-gene alignment to identify exons and introns. 2. Full genome comparison. Genomes consist of mobile components known as transposons. Due to transposons and genome rearrangements, full genome comparison is not straightforward. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 43

Open Problems - Parallel 1. Parallel alignment of similar sequences. 2. Parallel spliced alignment. DNA to Gene. Gene to Gene. 3. Parallel full-genome comparison. 4. Parallel multiple sequence alignment. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 44

Part III: String Data Structures and Algorithms

Why Strings? Biological sequences can be viewed as strings of characters over an alphabet. Sequence similarities typically translate to functional similarities. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 46

Suffix Tree M A L A Y A L A M $ 1 2 3 4 5 6 7 8 9 10 A LA M YALAM$ $ AL YALAM$ $M YALAM$ 5 10 $M YALAM$ $M 8 4 7 3 ALAYALAM$ $ 1 9 6 2 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 47

Suffix Tree M A L A Y A L A M $ 1 2 3 4 5 6 7 8 9 10 (2, 2) (10, 10) (5, 10) (1, 1) (3, 4) 5 10 (10, 10) (2, 10) (5, 10) (9, 10) (5, 10) (9, 10) (3, 4) 8 4 7 3 1 9 (5, 10) (9, 10) 6 2 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 48

Finding a Pattern in a String 1. Build a suffix tree of the string. 2. Starting from the root, traverse a path matching characters of the pattern. 3. If stuck, pattern not present in string. Otherwise, each leaf below gives a position of the pattern in the string. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 49

Finding Pattern in a String Find ALA A LA M YALAM$ $ AL YALAM$ M$ YALAM$ 5 10 M$ YALAM$ M$ 8 4 7 3 ALAYALAM$ $ 1 9 6 2 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 50

Finding common Substrings Construct a generalized suffix tree for two strings (each suffix of each string is represented). Label each leaf with the suffix number and string label. Each internal node with a leaf from each string in its subtree gives a common substring. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 51

Generalized Suffix Tree WINDOW$ INDIGO$ 1234567 1234567 D $OG I ND O $ W $OGI OW$ (2, 5) $OG ND $O GI OW$ $W $ $ INDOW$ (1, 7) (2, 7) (2, 3) (1, 4) (2, 4) $OGI OW$ (2, 2) (1, 3) (1, 5) (2, 6) (1, 6) (1, 1) (2, 1) (1, 2) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 52

Suffix Array Reducing Space 6 ALAM$ 2 ALAYALAM$ M A L A Y A L A M $ 1 2 3 4 5 6 7 8 9 10 8 4 7 AM$ AYALAM$ LAM$ 6 2 8 4 7 3 1 9 5 10 3 LAYALAM$ 3 1 1 0 Suffix Array 2 0 1 0 0 --- 1 9 5 MALAYALAM$ M$ YALAM$ lcp Array 10 $ Nov. 17, 2002 SC2002 Tutorial: Computational Biology 53

Suffix Trees vs. Suffix Arrays Suffix Array = Lexicographic order of the leaves of the Suffix Tree Suffix Tree = Suffix Array + lcp Array Nov. 17, 2002 SC2002 Tutorial: Computational Biology 54

Pattern Search in Suffix Array All suffixes that share a common prefix appear in consecutive positions in the array. Pattern P can be located in the string using a binary search on the suffix array. Naïve Run-time = O( P log n). Improved to O( P + log n) [Manber&Myers93], and to O( P ) [Abouelhoda et al. 02]. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 55

Other Applications Common substrings of multiple strings Suffix-prefix overlaps Maximal and tandem repeats Shortest unique substrings Maximal unique matches [MUMmer] Approximate matching with bounded errors Nov. 17, 2002 SC2002 Tutorial: Computational Biology 56

Limitations of String Data Structures Can only be used to extract information in the absence of errors. Problems dealing with errors may be solved by decomposing into components that do not involve errors. Example: If two sequences exhibit similarity, there must be substrings in common to them. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 57

Some Results 1. Suffix tree can be constructed in O(n) time and O(n ) space [Weiner73, McCreight76, Ukkonen92]. 2. Suffix arrays can be constructed without using suffix trees in O(n) time [Pang&Aluru02]. 3. Suffix trees can be built in O(log 4 n) time on the CREW PRAM model [Hariharan94]. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 58

Open Problems 1. Algorithms independent of alphabet size. 2. Practically efficient parallel algorithms for suffix trees and arrays. 3. What is the best way to store a biological database on a disk? Some work on disk-based data structures: String B-trees [Ferragina & Grossi 95]. Suffix trees on disk [Clark & Munro 96]. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 59

Software Development Opportunities Develop a general-purpose tree-based database system for efficiently Storing Inserting and deleting Querying biological sequences. Current approach: Store sequences as a flat file. Entire database is searched for each query! Nov. 17, 2002 SC2002 Tutorial: Computational Biology 60

Part IV: Genome Assembly

Sequencing a Genome Physical Mapping: Find markers along the genome, to find unique contigs (possibly overlapping) that cover the genome. Fragment Assembly: Sequence each contig by breaking into several short fragments, sequencing the fragments, and assembling them together. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 62

Physical Mapping Sequence Tagged Sites Sequence Tagged Site (STS) is a probe sequence that attaches to a unique position in the genome (length about 200-300 bases). The probe can identify the existence of the short sequence in the genome but cannot specify its location. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 63

Cutting With Restriction Enzymes Restriction enzyme is a protein that cuts DNA at a specific pattern (typically palindrome). Example: EcoRI G C T T A A G A A T T C C T T A A G A A T A C G Nov. 17, 2002 SC2002 Tutorial: Computational Biology 64

Physical Mapping Generate a large number of fragments of the genome, called clones. Find which probes attach to which clones. Find order of the fragments along the genome. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 65

Clones and Probes 1 2 3 4 5 6 D B G C A E F Nov. 17, 2002 SC2002 Tutorial: Computational Biology 66

Nov. 17, 2002 SC2002 Tutorial: Computational Biology 67 STS Matrix A B C D E F G 0 1 1 0 0 0 0 1 0 0 0 1 0 1 1 0 0 0 1 0 1 1 0 0 1 0 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 1 1 2 3 4 5 6

STS Hybridization Problem Given: STS matrix Find: Permutation of the columns such that the 1 s in each row are consecutive. Algorithm runs in linear time, assuming the matrix has no errors (Booth76). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 68

Errors in STS Data False positives: Clone is reported to contain an STS, but it does not. False Negatives: Clone is reported to not contain an STS, but it does. Chimeras: Two different DNA fragments combine and act as one clone. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 69

Mapping Problem in Presence of Errors In the absence of errors, overlap information is an interval graph. Find a way to discard some information in order to obtain an interval graph. Several ways of modeling the problem are NP-complete! Nov. 17, 2002 SC2002 Tutorial: Computational Biology 70

Fragment Assembly Given: A collection of DNA fragments Assemble: The fragments into maximal length contiguous sequences, or contigs using overlap information. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 71

Fragment Assembly Nov. 17, 2002 SC2002 Tutorial: Computational Biology 72

Shortest Common Superstring In the absence of errors, Fragment assembly = finding the shortest common superstring of given fragments Shortest common superstring problem is NP-hard. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 73

Greedy heuristic Find two fragments that have a maximum overlap and combine them into one contig. Iterate by treating contigs as fragments. Greedy heuristic results in a 4-approximate algorithm. Approximation factor has been improved to 2.2. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 74

Difficulties in Fragment Assembly Fragments contain errors Lack of sufficient coverage Different fragments may combine (Chimeras) Which strand did it come from? (Unknown orientation) Repeats in the genome Nov. 17, 2002 SC2002 Tutorial: Computational Biology 75

Approach to Fragment Assembly For each fragment and partial contig formed, consider both the sequence and its reverse complement. Detect overlaps using dynamic programming to allow for errors. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 76

Possible Fragment Overlaps F 1 F 1 F 2 F 2 F 1 F 2 F 2 F 1 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 77

Approach to Fragment Preprocessing: Assembly Eliminate pairs of fragments that cannot have significant overlap (quick check). Compute overlap between promising pairs using dynamic programming. If a fragment is completely contained in another, discard the shorter fragment. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 78

Approach to Fragment Assembly Forming Contigs (Greedy Heuristic): Combine fragments with strongest evidence of overlap. Treat the resulting partial contig as a single fragment and consider overlapping ends unavailable. Iterate using next strongest available overlap. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 79

Approach to Fragment Assembly Generating Consensus Sequence: Perform a multiple sequence alignment between parts of fragments overlapping in the same position to obtain better contigs. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 80

Fragment Assembly Software 1. CAP3 (ftp://cs.mtu.edu/pub/huang) 2. Phrap (http://www.phrap.com) 3. TIGR Assembler (http://www.tigr.org/softlab/assembler) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 81

Genome Sequencing Complete genomes of over 800 organisms are Currently (or soon to be) available. http://www.ncbi.nlm.nih.gov/entrez/genome/ main_genomes.html Nov. 17, 2002 SC2002 Tutorial: Computational Biology 82

Part V: Gene Identification and Annotation

Sequencing the genome is not an end-goal! Identify genes on the genome. Find the corresponding family of proteins. Find the functions of the proteins and how they are regulated. Study the natural variations in the gene among related species and different strains of the same species. Study variation between healthy and disease-causing genes. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 84

Nov. 17, 2002 SC2002 Tutorial: Computational Biology 85

Gene Structure DNA 5 3 Transcription 3 5 PremRNA 5 RNA Splicing 3 Promoter Exon Intron 5 Cap mrna 5 3 Poly A tail Nov. 17, 2002 SC2002 Tutorial: Computational Biology 86

EST Clustering Provides Clues to Finding Genes genomic DNA 3 exon 1 intron 1 exon 2 intron 2 exon 5 3 5 3 mrna exon 1 exon 2 exon 3 ESTs Nov. 17, 2002 SC2002 Tutorial: Computational Biology 87

How to Obtain EST Data? dbest (http://www.ncbi.nlm.nih.gov/dbest) 12,845,578 ESTs as of September 20, 2002 Organism Human Mouse Arabidopsis thaliana Zea mays Rice Number of ESTs 4,691,979 2,706,977 174,624 180,587 108,429 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 88

Goals of EST Clustering Clustering: Build clusters with each cluster containing ESTs from the same gene. Identification: Identify the gene. Annotation: Find and assign a function to the corresponding protein. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 89

Alternative Splicing mrna 1 exon intron Opt. exon Gene 1 mrna 2 mrna 1 Gene 2 mrna 2 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 90

Difficulties in EST Clustering Lack of Coverage mrna ESTs Nov. 17, 2002 SC2002 Tutorial: Computational Biology 91

Difficulties in EST Clustering Duplicated Genes mrna (from gene) mrna (from duplicated gene) ESTs high degree of similarity Nov. 17, 2002 SC2002 Tutorial: Computational Biology 92

Approaches to EST Clustering Use pairwise comparisons between ESTs to put ESTs into clusters. 1. Exhaustive approach Compare all pairs of ESTs. 2. Use fragment assembly software. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 93

Fragment Assembly Is Not Suited to EST Clustering Lack of sufficient coverage. ESTs come from different individuals and different strains of the same species. Genomic and Protein databases provide additional clues to EST clustering. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 94

Fragment Assembly Is Not Suited to EST Clustering Number of EST fragments is too large. ESTs are obtained in batches. Fragment assembly software is not incremental. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 95

Evaluation of Current Software Single Node of IBM xseries Cluster n=100,001 n=144,870 TIGR PHRAP CAP3 TIGR PHRAP CAP3 1200 min 91 min 150 min X 154 min 241 min 1.84 857 1.93 X 1.99 2.03 GB MB GB GB GB Nov. 17, 2002 SC2002 Tutorial: Computational Biology 96

NIH Unigene project Perform database search for each EST. Results are accrued incrementally using weekly builds on 80-processor Intel farm. Quality overrides computational issues. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 97

Space and Time Efficient EST Clustering Initially, treat each EST as a cluster by itself. If two ESTs from two different clusters show significant overlap, merge the clusters. Use union-find data structure. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 98

Reporting High-quality Promising Pairs first is important! Successful overlap results in : Merge Pass alignment test Nov. 17, 2002 SC2002 Tutorial: Computational Biology 99

Generating Promising Pairs Quality of overlap = length of a maximal common substring. Promising pairs are pairs that have a maximal common substring of length ψ. Produce promising pairs on-demand, in decreasing order of quality. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 100

Pair Generation Algorithm Build Generalized Suffix Tree of the ESTs. Process the nodes in GST in the decreasing order of string-depth and generate pairs at each node. Generate a pair at a node only if the corresponding overlap is maximal. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 101

Main Idea of the Algorithm Maximal common substring α =xβ α root β i α c 1 c 2 c 2 v c4... c 2 c4... s 1 s 2 c 3 α c 4 j (s 1,i) (s 2,j) (s 1,i+1) (s 2,j+1) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 102

Parallel EST Software Construction/ Preprocessing Phase Parallel Clustering Phase Nov. 17, 2002 SC2002 Tutorial: Computational Biology 103

Run-time vs. Number of processors Run-time in seconds 7,000 6,000 5,000 4,000 3,000 2,000 1,000 0 4 8 16 32 64 Number of processors n=10,000 n=20,000 n=40,000 n=80,000 n=144,870

Number of Pairs vs. Number of ESTs Number of Pairs in thousands 5,000 4,500 4,000 3,500 3,000 2,500 2,000 1,500 1,000 500 0 Aligned and accepted Aligned and rejected Unaligned 10,000 20,000 40,000 80,000 144,870 Number of ESTs Nov. 17, 2002 SC2002 Tutorial: Computational Biology 105

Open Problems Develop software that can cluster the human EST collection (~4.7 million currently). Improve quality of clustering Detect alternative splicing. Consult genomic & protein databases. Develop a comprehensive software system for gene identification combining EST clustering, ab initio gene prediction and genome comparison. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 106

Part VI: Microarrays and Gene Expression Analysis

Gene Expression Studies How does gene expression level differ in various cell types and states? How is gene expression changed by diseases? What are the functional roles of different genes? How are genes regulated? Nov. 17, 2002 SC2002 Tutorial: Computational Biology 108

Microarray A glass slide on which single stranded DNA molecules are attached at fixed spots. Each molecule corresponds to a gene (Ex: EST). When a solution containing single stranded molecules is washed over, binding based on complementary takes place. A single microarray can contain tens of thousands of spots. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 109

Nov. 17, 2002 SC2002 Tutorial: Computational Biology 110

Comparing mrna abundance mrna from sample and control are labeled with different fluorescent dyes. Both solutions are washed over the microarray. Relative abundance of different mrna can be judged by color/intensity difference. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 111

Nov. 17, 2002 SC2002 Tutorial: Computational Biology 112

Nov. 17, 2002 SC2002 Tutorial: Computational Biology 113 Credit: A. Michael Cambell

The Full Yeast Genome on a Chip Statistics 6116 Yeast Genes 96 Intergenic regions + lots of control samples Total spots printed: 707,520 Total Arrays:110 Actual Time to print: 52 hours Actual Speed: 226.7 spots/min Total Cycles: 1608 Total Water Usage: 23 Liters Tip Spacing: 221uM Taps per tip: 176,880 Completed: 25 April 1997 Patrick O. Brown Lab, Stanford: http://cmgm.stanford.edu/pbrown/yeastchip.html Nov. 17, 2002 SC2002 Tutorial: Computational Biology 114

Microarray Databases Repositories containing information obtained by microarray experiments http://www.ncgr.org/research/genex/other_tools.html http://www.biologie.ens.fr/en/genetiqu/puces/bddeng.html Nov. 17, 2002 SC2002 Tutorial: Computational Biology 115

Microarray analysis Lists of software packages http://genome-www5.stanford.edu/microarray/smd/restech.html http://ihome.cuhk.edu.hk/~b400559/arraysoft.html Hierarchical Clustering Self-Organizing Maps Nov. 17, 2002 SC2002 Tutorial: Computational Biology 116

Gene Expression Matrix A way to capture microarray data Rows correspond to genes Columns represent samples (different developmental stages, conditions and tissues) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 117

Using Gene Expression Matrices Compare gene expression profiles. Find co-regulated genes. Compare expression profiles of samples. Find differentially expressed genes. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 118

Finding Co-regulated Genes Each gene can be represented as a point in n-dimensional space. Use clustering algorithms to find coregulated genes. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 119

Example - Hierarchical Clustering Nov. 17, 2002 SC2002 Tutorial: Computational Biology 120

Summary Microarrays are a relatively new technology, allowing simultaneous collection of vast experimental data. Data mining and AI techniques are used to discover information from microarray data. Innovative uses of microarrays are still being discovered. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 121

Part VII: Protein Folding

Primary Structure 5 10 15 20 25 30 1 A A S X D X S L V E V H X X V F I V P P X I L Q A V V S I A 31 T T R X D D X D S A A A S I P M V P G W V L K Q V X G S Q A 61 G S F L A I V M G G G D L E V I L I X L A G Y Q E S S I X A 91 S R S L A A S M X T T A I P S D L W G N X A X S N A A F S S 121 X E F S S X A G S V P L G F T F X E A G A K E X V I K G Q I 151 T X Q A X A F S L A X L X K L I S A M X N A X F P A G D X X 181 X X V A D I X D S H G I L X X V N Y T D A X I K M G I I F G 211 S G V N A A Y W C D S T X I A D A A D A G X X G G A G X M X 241 V C C X Q D S F R K A F P S L P Q I X Y X X T L N X X S P X 271 A X K T F E K N S X A K N X G Q S L R D V L M X Y K X X G Q 301 X H X X X A X D F X A A N V E N S S Y P A K I Q K L P H F D 331 L R X X X D L F X G D Q G I A X K T X M K X V V R R X L F L 361 I A A Y A F R L V V C X I X A I C Q K K G Y S S G H I A A X 391 G S X R D Y S G F S X N S A T X N X N I Y G W P Q S A X X S 421 K P I X I T P A I D G E G A A X X V I X S I A S S Q X X X A 451 X X S A X X A Nov. 17, 2002 SC2002 Tutorial: Computational Biology 123

Secondary Structure - α helix Nov. 17, 2002 SC2002 Tutorial: Computational Biology 124

Secondary Structure - β sheet Nov. 17, 2002 SC2002 Tutorial: Computational Biology 125

Tertiary Structure Nov. 17, 2002 SC2002 Tutorial: Computational Biology 126

Quaternary Structure Nov. 17, 2002 SC2002 Tutorial: Computational Biology 127

Problems in Protein Folding 1. Folding Problem: Given the sequence of a protein, computationally determine its structure. 2. Inverse Folding Problem: Given the structure in which a protein should fold into, find a possible amino acid sequence of the protein. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 128

Why Should Sequence! Structure Determination Be Possible? Proteins with sequence similarity tend to have structural similarity. If a protein is deformed under external force, it quickly folds back into its unique shape after the force is removed. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 129

IBM Blue Gene Project $100 M, 100,000 processor petaflop supercomputer for protein folding. Expected to simulate one protein in a year. Blue Gene/L 65,536 processors, 32 X 32 X 64 torus (by year 2004). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 130

Approach I Molecular Dynamics Idea: Forces acting on the atoms in a protein and constraints are known. Perform simulation. Problem: Time step required is too small (10-18 sec). Best reported simulation 10-6 sec. Folding requires a few seconds. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 131

Approach II Lattice Models Proteins are represented as self-avoiding walks on lattices (cubic, hexagonal etc.). Each amino acid residue is modeled as hydrophobic (H) or hydrophilic (P). Position the residues subject to Linear constraint Maximizing H-H contacts Nov. 17, 2002 SC2002 Tutorial: Computational Biology 132

Approach II Lattice Models Problem is NP-complete. Approximation algorithms have been designed. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 133

Hypothesis: Approach III Energy Minimization Different amino acids have different chemical, electrical and size properties. Different folds of a protein have different levels of energy. A protein folds into its minimum energy configuration. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 134

Approach III Energy Minimization Start with a protein configuration. Compute the energy of the configuration. Incrementally fold the protein to reduce its energy. Iterate until convergence. Many known energy minimization methods. can be applied (steepest descent, simulated. annealing etc.). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 135

Approach IV Protein Threading Find proteins with known structure that exhibit similarity to the protein to be folded. Use structures of highly similar components to determine a possible structure for the new protein. Use this structure as the basis for more computational folding operations. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 136

Nov. 17, 2002 SC2002 Tutorial: Computational Biology 137

Other Problems Structure Similarity Given: The three dimensional structure of two proteins Find: the structural similarity between them. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 138

Other Problems Protein Docking Given: A receptor molecule and a drug molecule Find: A matching between the receptor surface and the drug molecule surface maximizing the contact area between the surfaces. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 139

Other Problems Accessible Surface Area Given: The three dimensional structure of a Protein Find: The cumulative accessible surface area of the atoms of the protein accessible to a solvent molecule. Atoms and solvent molecule are modeled as spheres using van der Waal s radii. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 140

Part VIII: Comparative Genomics & Reconstructing Evolutionary Histories (Phylogenetic Trees)

Comparative Genomics Chicken Human NCBI accession #NC_001323 NCBI accession #NC_001807 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 142

Eukaryotic Cell Nov. 17, 2002 SC2002 Tutorial: Computational Biology 143

Organism Est. size Est. # genes average gene density Human 3000 million bases ~30,000 1 gene per 100,000 bases M. Musculus (mouse) 3000 million bases 30,000 1 gene per 100,000 bases Drosophila (fruit fly) 135.6 million bases 13,061 1 gene per 13,781 bases Arabidopsis (plant) 100 million bases 25,000 1 gene per 4,000 bases C. elegans (roundworm) 97 million bases 19,099 1 gene per 5,079 bases S. cerevisiae (yeast) 12.1 million bases 6,034 1 gene per 2,005 bases E. coli (bacteria) 4 4.67 million bases 3,237 1 gene per 1,443 bases H. influenzae (bacteria) 1.8 million bases 1,740 1 gene per 1,034 bases Nov. 17, 2002 SC2002 Tutorial: Computational Biology 144

Phylogenetics Find the genetic connections and relationships between species (or sequences). Hypothesis: All existing organisms are derived from some common ancestor. A new species arises by a splitting of one population into two (or more populations) that do not cross-breed. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 145

Phylogenetic Trees Each species (or sequence) is described by a set of traits (called characters). Leaves of the tree are labeled with input species. Internal nodes are labeled with input or inferred species. Edges represent transition in values among certain traits. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 146

Types of Phylogenies Relationships between taxa Species Trees Gene Trees Data Morphological Tree of Life Web (Maddison/Maddison): http://tolweb.org/ Nuclear Genome Organelle Genome Nov. 17, 2002 SC2002 Tutorial: Computational Biology 147

Example Phylogenies Campanulaceae (Bluebell Flowers) 1.75 2.42 Wahlenbergia 4.25 1.61 0.063 0.23 0.94 0.83 Merciera Trachelium Symphyandra 0.18 4.34 0.77 2.82 Campanula Adenophora 3.22 Legousia 0.78 2.59 1.28 3.39 1.61 Asyneuma Triodanus 4.68 2.22 3.32 Codonopsis Cyananthus 10.75 2.25 Platycodon Tobacco HHV6 Some herpesvirus known to affect humans EBV HHV7 HVS EHV2 KHSV HSV1 VZV HSV2 PRV EHV1 HCMV Leeches Nov. 17, 2002 SC2002 Tutorial: Computational Biology 148

Techniques Maximum parsimony Occam s razor: simplest explanation for evolution, minimizes the sum of the number of evolutionary events along the tree branches Maximum likelihood Statistical methods that use an evolutionary model such as the transition/transversion rate ratio for the nuclear genome Nov. 17, 2002 SC2002 Tutorial: Computational Biology 149

Genomic Parsimony: Examples of characters Specific nucleotide in a fixed position of a DNA sequence (conserved in all examined species). Does the amino acid sequence for a protein contain a specific subsequence? Is the expression of a certain protein regulated by another particular protein? Nov. 17, 2002 SC2002 Tutorial: Computational Biology 150

Nov. 17, 2002 SC2002 Tutorial: Computational Biology 151 Example A T G C G T Elephant T C A C G T Dog A T G G G C Chimp A C A G A C Bison A T G G A C Aardvark 6 5 4 3 2 1 Species

Example 1 3 2 4, 5 4, 5, 6 Aardvark Bison Chimp Dog Elephant Nov. 17, 2002 SC2002 Tutorial: Computational Biology 152

Perfect Phylogeny for Binary Characters Given: An n m, 0-1 matrix representing n Species and m binary characters Find: A phylogenetic tree T such that The root of the tree represents an ancestor that has none of the m characters. Each character changes from 0 to 1 exactly once and never changes back. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 153

Example 1 2 3 4 5 3 1 A 1 1 0 0 0 2 B 0 0 1 0 0 5 M C 1 1 0 1 0 4 D 0 0 1 0 1 E 1 0 0 0 0 D B E A C Runs in O(mn) time (Gusfield91). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 154

Perfect Phylogeny for Non- Binary Characters n species, m characters, at most r states NP-complete. Polynomial time for any fixed r (Agarwala94). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 155

Parsimony Parsimony score of a tree = Total number of character changes in the tree P( T ) = ( u, v) E ( T ) { j } u j v j Nov. 17, 2002 SC2002 Tutorial: Computational Biology 156

A Simpler Problem: Known Tree Given: Phylogenetic tree Find: Minimum parsimony score and optimal labeling of internal nodes. Can be solved in O(nmr) time [Fitch71]. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 157

Parsimony Problem NP-hard. Techniques Used: Branch and bound [Hendy&Penny82]. Neighbor-Joining [Saitou&Nei87, Studdier&Keppler88]. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 158

Exploiting data about gene content and gene order has proved extremely challenging from a computational perspective tasks that can easily be carried out in linear time for DNA data have required entirely new theories (such as the computation of inversion distance) or appear to be NP-hard The focus has thus been on simple genomes preferably genomes consisting of a single chromosome, and where evolution can reasonably be assumed to have been driven mostly through gene order changes. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 159

Cell Organelles Chloroplasts and mitochondria have such genomes: around 120 genes for the chloroplasts of higher plants and typically 37 genes for the mitochondria of multicellular animals, in both cases packed onto a single chromosome. The gene content of these genomes is fairly constant across a wide phylogenetic range, differences are mostly in the ordering of the genes. Chloropast Mitochondria Nov. 17, 2002 SC2002 Tutorial: Computational Biology 160

Gene Order Phylogeny Many organelles appear to evolve mostly through processes that simply rearrange gene ordering (inversion, transposition) and perhaps alter gene content (duplication, loss). Chloroplast have a single, typically circular, chromosome and appear to evolve mostly through inversion: i -1 i j j+1 i -1 -j -i j+1 The sequence of genes i, i+1,, j is inverted and every gene is flipped. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 161

Phylogeny Heuristic Search [BPanalysis, Sankoff & Blanchette 98] (2n-5)!! = (2n-5) (2n-7) 5 3 trees For each tree topology do somehow assign initial genomes to the internal nodes repeat unknown iterative heuristic for each internal node do NP-hard compute a new genome that minimizes the distances to its three neighbors replace old genome by new if distance is reduced until no change Nov. 17, 2002 SC2002 Tutorial: Computational Biology 162

Lower Bounding of a Tree Tree e Tree version (paths) e a a d(e,a) d(d,e) b c d d(a,b) b d(b,c) c d(c,d) d = d(a,b) + d(b,c) + d(c,d) + d(d,e) + d(e,a) (Same trick as in the twice around the tree approximation for the TSP with triangle inequality.) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 163

Parallelization of the Phylogeny Algorithm Enumerating tree topologies is pleasantly parallel and allows multiple processors to independently search the tree space with little or no overhead Load is evenly balanced when trees are cyclically assigned (e.g. in a round-robin fashion) to the processors Linear speedup Nov. 17, 2002 SC2002 Tutorial: Computational Biology 164

High-performance implementations enable: better approximations for difficult problems (MP, ML) true optimization for larger instances realistic data exploration (e.g., testing evolutionary scenarios, assessing answers obtained through other means, etc.) use of more biologically meaningful models (inversions, transpositions, gene loss/duplication) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 165

Inversion Distance (Hannenhalli-Pevzner Theory) NP-hard for unsigned permutations [Caprara 97] Polynomial for signed permutations [Hannenhalli & Pevzner 95] Compute combinatorial terms from the cycle graph d = b c + h + f [Bafna & Pevzner 93, Setubal & Meidanis 97] b = number of breakpoints c = number of cycles h = number of hurdles f = (0/1) Is there a fortress? O(n α(n)) time, [Berman and Hannenhalli 96] where α(n) is the inverse Ackerman function (practically a constant no greater than 4) New result: O(n) inversion distance, [Bader, Moret, Yan 01] faster and simpler algorithm, both in theory and in practice Nov. 17, 2002 SC2002 Tutorial: Computational Biology 166

Challenges in Phylogeny Exact Inversion median-of-three [Siepel02] Tree enumeration using circular ordering Handle unequal gene content and duplicate genes (using exemplars?) Parallel branch and bound techniques for searching tree space Improved SPR and TBR techniques (local searches around good trees) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 167

Additional Challenges Network evolution Recombination events Large-scale phylogeny reconstruction Comparison and accuracy of techniques and heuristics Nov. 17, 2002 SC2002 Tutorial: Computational Biology 168

Parsimony Codes Phylip (Felsenstein) http://evolution.genetics.washington.edu/phylip.html Hennig86 (Farris) http://www.cladistics.org/ Nona (Goloboff) and TNT (Goloboff, Farris, Nixon) http://www.cladistics.com/ PAUP* (Swofford) http://paup.csit.fsu.edu/ MEGA (Kumar, Tamura, Jakobsen, Nei) http://www.megasoftware.net/ GRAPPA (Bader, Moret, Warnow) http://www.phylo.unm.edu/ Nov. 17, 2002 SC2002 Tutorial: Computational Biology 169

Likelihood Codes Phylip (Felsenstein) http://evolution.genetics.washington.edu/phylip.html PAUP* (Swofford) http://paup.csit.fsu.edu/ PAML (Yang) http://abacus.gene.ucl.ac.uk/software/paml.html FastDNAml (Olsen, Matsuda, Hagstrom, Overbeek) http://geta.life.uiuc.edu/~gary/programs/fastdnaml.html Felsenstein s List of Software: http://evolution.genetics.washington.edu/phylip/software.html Nov. 17, 2002 SC2002 Tutorial: Computational Biology 170

GRAPPA: Genome Rearrangements Analysis Genome Rearrangements Analysis under Parsimony and other Phylogenetic Algorithms http://www.phylo.unm.edu/ Open-source already used by other computational phylogeny groups, Caprara, Pevzner, LANL, FBI, Smithsonian Institute, PharmCos. Gene-order Phylogeny Reconstruction Breakpoint Median Inversion Median over one-billion fold speedup from previous codes Parallelism scales linearly with the number of processors [Bader, Moret, Warnow] Nov. 17, 2002 SC2002 Tutorial: Computational Biology 171

Using GRAPPA to solve Campanulaceae Phylogeny On the 512-processor cluster LosLobos at U. New Mexico, we ran the full analysis (all 14 billion trees) in under 1.5 hours a 1,000,000-fold speedup (and using true inversion distance) Current release of GRAPPA (v. 1.6) now takes minutes to solve the same problem on several processors Nov. 17, 2002 SC2002 Tutorial: Computational Biology 172

Campanulaceae Bob Jansen, UT-Austin; Linda Raubeson, Central Washington U Tobacco Nov. 17, 2002 SC2002 Tutorial: Computational Biology 173

Epilogue

Epilogue Uses of Computation in Biology: 1. Discovering information from large data sets (ex: database searches). 2. Relating micro-behavior to macrobehavior (ex: protein folding). 3. Extending experimental capabilities (ex: genome sequencing). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 175

Epilogue Computation will be an integral part of future biological discoveries. Computational biology is an exciting interdisciplinary area that will become increasingly important in the future. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 176

Bookshelf R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK, 1998. D. Graur and W.-H. Li. Fundamentals of Molecular Evolution. Sinauer Associates, Sunderland, MA, second edition, 2000. D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 177

Bookshelf D.M. Hillis, C. Moritz, and B.K. Mable, eds. Molecular Systematics. Sinauer Associates, Sunderland, MA, second edition, 1996. M. Nei and S. Kumar. Molecular Evolution and Phylogenetics. Oxford University Press, Oxford, UK, 2000. P.A. Pevzner. Computational Molecular Biology: An Algorithmic Approach. MIT Press, Inc., Cambridge, MA, 2000. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 178

Bookshelf D. Sankoff and J. Kruskal, eds. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA, 1983. J.C. Setubal and J. Meidanis. Introduction to Computational Molecular Biology. PWS, Boston, MA, 1997. D. Sankoff and J.H. Nadeau, eds. Comparative Genomics: Empirical and Analytic Approaches to Gene Order Dynamics, Map Alignment and the Evolution of Gene Families, volume 1 of Computational Biology. Kluwer Academic Publishers, Dordrecht, The Netherlands, 2000. M.S. Waterman. Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman & Hall / CRC, Boca Raton, FL, 1995. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 179

Related & Referenced Publications M. I. Abouelhoda, S. Kurtz and E. Ohlebusch, The enhanced suffix array and its applications to genome analysis, 2 nd Workshop on Algorithms in Bioinformatics, pp. 449-463, 2002. R. Agarwala, D. Ferñandez-Baca, A polynomial-time algorithm for the perfect phylogeny problem when the number of character-states is fixed. SIAM J. Comp., 23(6):1216-1224, 1994. S. Aluru, N. Futamura and K. Mehrotra, Biological sequence comparison using prefix computations, Proc. 13 th IEEE Int l Parallel Processing Symposium, pp. 653-659, 1999. D.A. Bader, B. M.E. Moret, and L. Vawter, Industrial Applications of High- Performance Computing for Phylogeny Reconstruction, ITCom: Commercial Applications for High-Performance Computing, SPIE Vol. 4528, pp. 159-168, 2001. D.A. Bader, B. M.E. Moret, and M. Yan, A Linear-Time Algorithm for Computing Inversion Distance Between Two Signed Permutations with an Experimental Study, Journal of Computational Biology, 8(5):483-491, 2001. D.A. Bader, B.M.E. Moret, and P. Sanders, Algorithm Engineering for Parallel Computation, Experimental Algorithmics, Springer Verlag Lecture Notes in Computer Science, 2547:1 23, 2002. D.A. Bader, S. Sreshta, and N.R. Weisse-Bernstein, Evaluating arithmetic expressions using tree contraction: A fast and scalable parallel implementation for symmetric multiprocessors (SMPs), Proc. 9 th IEEE Int'l Conf. High-Performance Computing, 2002, to appear. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 180

Related & Referenced Publications V. Bafna and P. A. Pevzner, Genome rearrangements and sorting by reversals, Proc. 34 th Ann. IEEE Symp. Foundations of Computer Science, pp. 148-157, 1993 V. Bafna and P. Pevzner, Sorting permutations by transpositions, Proc. 6 th Ann. Symp. Discrete Algorithms, pp. 614-623, 1995. P. Berman and S. Hannenhalli, Fast sorting by reversal, Proc. 7 th Ann. Symp. Combinatorial Pattern Matching, pp. 168-185, 1996. K. Booth and G. Lueker, Testing for consecutive ones property, interval graphs and graph planarity testing using pq-tree algorithms, J. Comp. Sys. Sci., 13:333-379, 1976. A. Caprara, Sorting by reversals is difficult, Proc. 1 st ACM Conf. Computational Molecular Biology, pp. 75-83, 1997. D.R. Clark and J.I. Munro, Efficient suffix trees on secondary storage, Proc. ACM-SIAM Symp. on Discrete Algorithms, pp. 383-391, 1996. E. Edmiston, N. Core, J. Saltz, and R. Smith, Parallel processing of biological sequence comparison algorithms. Int l Journal of Parallel Programming, 17(3):259 275, 1988. J. Felsenstein, Evolutionary trees from DNA sequences: A maximum likelihood approach, Journal of Molecular Evolution, 17: 368-376, 1981. P. Ferragina and R. Grossi, Fast incremental text editing, Journal of Algorithms, 31:291-319, 1999. Also ACM-SIAM Symp. on Discrete Algorithms, 1995. W. M. Fitch, Towards defining the course of evolution: Minimum change for a specific tree topology, Syst. Zool., 20:406-416, 1971. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 181

Related & Referenced Publications W.M. Fitch and E. Margoliash, Construction of phylogenetic trees, Science, 155:279-284, 1967. N. Futamura, S. Aluru and X. Huang, Parallel syntenic alignments, Proc. 9 th IEEE Int l Conf. on High Performance Computing, to appear. N. Futamura, S. Aluru, D. Ranjan and B. Hariharan, Efficient parallel algorithms for solvent accessible surface area of proteins, IEEE Trans. on Parallel and Distributed Systems, 13(6):544-555, 2002. D. Gusfield, Efficient algorithms for inferring evolutionary trees. Networks, 21:19-28, 1991. S. Hannenhalli and P.A. Pevzner, Transforming cabbage into turnip (polynomial algorithm for sorting signed permutations by reversals, Proc. 27 th ACM Ann. Symp. Theory of Computing, pp. 178-189, 1995. R. Hariharan, Optimal parallel suffix tree construction, Proc. 26 th IEEE Symp. Found. Computer Science, pp. 290-299, 1994. M. D. Hendy and D. Penny, Branch and bound algorithms to determine minimal evolutionary trees. Mathematical Biosciences, 59:277-290, 1982. X. Huang, A space-efficient parallel sequence comparison algorithm for a message-passing multiprocessor. Int l Journal of Parallel Programming, 18(3):223-239, 1989. X. Huang and A. Madan, CAP3: A DNA sequence assembly program, Genome Research, 9(9):868-877, 1999. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 182