Opportunities and Challenges in Computational Biology

Size: px

Start display at page:

Download "Opportunities and Challenges in Computational Biology"

Posy Johnson
6 years ago
Views:

1 Opportunities and Challenges in Computational Biology Srinivas Aluru Electrical & Computer Engineering Lawrence H. Baker Center for Bioinformatics & Biological Statistics Iowa State University David A. Bader Electrical & Computer Engineering University of New Mexico

2 Acknowledgments National Science Foundation Nov. 17, 2002 SC2002 Tutorial: Computational Biology 1

3 Opportunities and Challenges in Computational Biology Biology easily has 500 years of exciting problems to work on -Donald E. Knuth

4 Outline 1. Molecular Biology Background 2. Sequence Alignments 3. String Data Structures and Algorithms 4. Genome Assembly 5. Gene Identification & Annotation 6. Microarrays & Gene Expression Analysis 7. Protein Folding 8. Comparative Genomics & Reconstruction of Evolutionary Histories Nov. 17, 2002 SC2002 Tutorial: Computational Biology 3

5 Schedule Morning 8:30 9:15 (Part I) Biology Background 9:15 10:00 (Part II) Sequence Alignments 10:00 10:30: Break 10:30 11:15 (Part III) String Data Structures and Algorithms 11:15 12:00 (Part IV) Genome Assembly 12:00 1:30: Lunch Afternoon 1:30 2:15 (Part V) Gene Identification & Annotation 2:15 3:00 (Part VI) Microarrays & Gene Expression Analysis 3:00 3:30: Break 3:30 4:15 (Part VII) Protein Folding 4:15 5:00 (Part VIII) Comparative Genomics & Evolutionary Histories Nov. 17, 2002 SC2002 Tutorial: Computational Biology 4

6 Part I: Molecular Biology Background

7 Biological Data DNA: Self-replicating Codes for proteins Proteins: Perform most functions in living organisms Nov. 17, 2002 SC2002 Tutorial: Computational Biology 6

8 DNA: Sequence of nucleotides Nucleotide: Deoxyribose sugar + Phosphate + Base O Nucleotides: A, T, G, and C O O P O O 5 CH 2 O C4 3 C H OH 1 C 2 C H Nov. 17, 2002 SC2002 Tutorial: Computational Biology 7 O HN C C N C CH CH 3

9 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 8

10 P P P 3 A T C G G C 3 P P P Nov. 17, 2002 SC2002 Tutorial: Computational Biology 9

11 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 10

12 For computational purposes, DNA = A sequence over alphabet {A,C,G,T} 5 A T T C G G G A A T G C A T G C C A 3 3 T A A G C C C T T A C G T A C G G T 5 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 11

13 Genome: Entire genetic constitution of a living organism Chromosome: Linear strand of DNA Gene: A contiguous stretch of DNA that codes for a protein Nov. 17, 2002 SC2002 Tutorial: Computational Biology 12

14 Species Bacteriophage λ Escherichia Coli (bacterium) Saccharomyces Cerviciae (yeast) Caenorhabditis elegans (worm) Drosophila melanogaster (fruit fly) Homo sapiens (human) Number of Chromosomes Genome Size 5 X X X X X X 10 9 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 13

15 Proteins: Chains of amino acid residues. There are 20 different amino acids. Functions: Tissue building blocks (Structure proteins) Catalysts (enzymes) Oxygen transport Antibody defense Nov. 17, 2002 SC2002 Tutorial: Computational Biology 14

16 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 15

17 R 1 H O R 3 + H 3 N Cα C O N Φ Cα ψ C N H Cα C O O - R 2 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 16

18 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 17

19 G A C U Leu Leu Phe Phe Ser Ser Ser Ser STOP STOP Tyr Tyr Trp STOP Cys Cys U G A C U Leu Leu Leu Leu Pro Pro Pro Pro Gln Gln His His Arg Arg Arg Arg C G A C U Met Ile Ile Ile Thr Thr Thr Thr Lys Lys Asn Asn Arg Arg Ser Ser A G A C U Val Val Val Val Ala Ala Ala Ala Glu Gu Asp Asp Gly Gly Gly Gly G Third Position U Position C Second A G First Position

20 Protein Synthesis (DNA! Protein) DNA Transcription mrna Translation Protein Nov. 17, 2002 SC2002 Tutorial: Computational Biology 19

21 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 20

22 Summary Nov. 17, 2002 SC2002 Tutorial: Computational Biology 21

23 What Can Be Done Experimentally? DNA sequences of length up to bp can be read (Sanger s method). DNA samples can be amplified (PCR). Protein sequences can be determined. Structure of proteins can be determined using X-ray crystallography (expensive, tedious, time-consuming). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 22

24 Challenges in Computational Biology 1. Find the genomes of all organisms. 2. Identify and annotate genes. 3. Find the sequences, three dimensional structures and functions of all proteins. 4. Find sequences of proteins that have desired three dimensional structures. 5. Compare DNA sequences and proteins sequences for similarity. 6. Study the evolution of sequences and species. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 23

25 Part II: Sequence Alignments

26 Pairwise Sequence Alignment Problem: Find similarity between two sequences. Variations: Given two sequences, find if parts of them are similar (local alignment). Given a large sequence and a short sequence, find if the short sequence is similar to a stretch of the long sequence. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 25

27 Pairwise Global Alignment Alignment: Stacking the sequences against each other, with gaps if necessary, to expose similarity. Score: A measure of quality of an alignment C A T -- T C A -- C C -- T C G C A G C = -2 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 26

28 Pairwise Global Alignment T[i,j] = Score of optimally aligning first i bases of s with first j bases of t. T [ i, j] = max T[ i 1, T T [ i 1, j] [ i, j 1] j 1] + g g score ( s[ i], t[ j] ) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 27

29 C T C G C A G C C A T T C A C Nov. 17, 2002 SC2002 Tutorial: Computational Biology 28

30 T [ i, j] = Local Alignment T [ i 1, T max T 0 j [ i 1, j] [ i, j 1] 1] + ( s[ i], t[ j] ) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 29 g g score Initialize top row and leftmost column to zero. Start with a maximal value in the table and traceback.

31 Affine Gap Penalty Functions Gap penalty = h + gk where k = length of a maximal sequence of gaps h = gap opening penalty g = gap continuation penalty Nov. 17, 2002 SC2002 Tutorial: Computational Biology 30

32 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n), while keeping run-time fixed [Myers88]. Two highly similar sequences can be aligned in O(dn) time, where d is a measure of the distance between the sequences [Landau86]. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 31

33 Parallel Sequence Alignment Each antidiagonal can be computed in parallel. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 32

34 Some Known Results Parallel sequence alignment can be computed in O(mn/p) time (Edmiston88). Optimal space-saving algorithm requires only O((m+n)/p) space, but take O((m+n) 2 /p) time (Huang89). A row-by-row parallelization is possible and is more communication-efficient. Space can be reduced to O(m+n/p) without sacrificing timeoptimality (Aluru99). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 33

35 Multiple Sequence Alignment VTISCTGSSSNIGAG NHVKWYQQLPG VTISCTGTSSNIGS ITVNWYQQLPG LRLSCSSSGFIFSS YAMYWVRQAPG LSLTCTVSGTSFDD YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG ATLVCLISDFYPGA VTVAWKADS ATLVCLISDFYPGA VTVAWKADS AALGCLVKDYFPEP VTVSWNSG- VSLTCLVKGFYPSD IAVEWESNG- Nov. 17, 2002 SC2002 Tutorial: Computational Biology 34

36 Induced Pairwise Alignment S 1 S 2 S 3 S - T I S C T G - S - N I L - T I C N G S S - N I L R T I S C S G F S Q N I Induced pairwise alignment of S 1 and S 2 : S 1 S 2 S T I S C T G - S N I L T I C N G S S N I Nov. 17, 2002 SC2002 Tutorial: Computational Biology 35

37 Sum-of-Pairs Scoring Function Score of multiple alignment = = i< j l t= 1 i< j where score ( S, S ) i j score( S it, S jt ) score( S i, S j ) = score of induced pairwise alignment l = length of the multiple alignment Nov. 17, 2002 SC2002 Tutorial: Computational Biology 36

38 Multiple Alignment Run-time of dynamic programming solution = O(2 k n k ) where n = length of each sequence k = number of sequences Space, O(n k ), is prohibitively large! Example: 6 sequences of length X10 13 calculations! Nov. 17, 2002 SC2002 Tutorial: Computational Biology 37

39 Carillo-Lippman Heuristic U = Upper bound on multiple alignment score If T i ( [ ] [ 2 k j j j l [ i,, L, i ] + score S i, n, S i n ])> U 1 l, j< l l Then T[i 1,i 2,,i k ] cannot be on an optimal path. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 38

40 Multiple Alignment to a Phylogenetic Tree A tree showing the evolutionary relationship between sequences is available. Compute multiple alignment such that for each edge (i,j) in the tree Induced alignment between S i and S j. = Optimal alignment between S i and S j. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 39

41 Multiple Alignment to a Tree Build the multiple alignment incrementally. To add a new sequence, an edge should connect it in the tree to a sequence already incorporated in the multiple alignment. Insert the new sequence according to its optimal alignment with the other sequence connected by the edge. Adjust other sequences in the multiple alignment. Run-time = time for k pairwise alignments. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 40

42 Searching Biological Databases BLAST (Basic Local Alignment Search Tool) BLASTN (DNA) BLASTP (Protein) BLASTX (DNA against Protein) PSI-BLAST (Position Specific Iterative BLAST) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 41

43 Multiple Alignment Software Clustalw ( MSA ( HMMER ( SAM ( compbio/sam.html) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 42

44 Open Problems - Sequential 1. Gene-to-gene alignment to identify exons and introns. 2. Full genome comparison. Genomes consist of mobile components known as transposons. Due to transposons and genome rearrangements, full genome comparison is not straightforward. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 43

45 Open Problems - Parallel 1. Parallel alignment of similar sequences. 2. Parallel spliced alignment. DNA to Gene. Gene to Gene. 3. Parallel full-genome comparison. 4. Parallel multiple sequence alignment. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 44

46 Part III: String Data Structures and Algorithms

47 Why Strings? Biological sequences can be viewed as strings of characters over an alphabet. Sequence similarities typically translate to functional similarities. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 46

48 Suffix Tree M A L A Y A L A M $ A LA M YALAM$ $ AL YALAM$ $M YALAM$ 5 10 $M YALAM$ $M ALAYALAM$ $ Nov. 17, 2002 SC2002 Tutorial: Computational Biology 47

49 Suffix Tree M A L A Y A L A M $ (2, 2) (10, 10) (5, 10) (1, 1) (3, 4) 5 10 (10, 10) (2, 10) (5, 10) (9, 10) (5, 10) (9, 10) (3, 4) (5, 10) (9, 10) 6 2 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 48

50 Finding a Pattern in a String 1. Build a suffix tree of the string. 2. Starting from the root, traverse a path matching characters of the pattern. 3. If stuck, pattern not present in string. Otherwise, each leaf below gives a position of the pattern in the string. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 49

51 Finding Pattern in a String Find ALA A LA M YALAM$ $ AL YALAM$ M$ YALAM$ 5 10 M$ YALAM$ M$ ALAYALAM$ $ Nov. 17, 2002 SC2002 Tutorial: Computational Biology 50

52 Finding common Substrings Construct a generalized suffix tree for two strings (each suffix of each string is represented). Label each leaf with the suffix number and string label. Each internal node with a leaf from each string in its subtree gives a common substring. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 51

53 Generalized Suffix Tree WINDOW$ INDIGO$ D $OG I ND O $ W $OGI OW$ (2, 5) $OG ND $O GI OW$ $W $ $ INDOW$ (1, 7) (2, 7) (2, 3) (1, 4) (2, 4) $OGI OW$ (2, 2) (1, 3) (1, 5) (2, 6) (1, 6) (1, 1) (2, 1) (1, 2) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 52

54 Suffix Array Reducing Space 6 ALAM$ 2 ALAYALAM$ M A L A Y A L A M $ AM$ AYALAM$ LAM$ LAYALAM$ Suffix Array MALAYALAM$ M$ YALAM$ lcp Array 10 $ Nov. 17, 2002 SC2002 Tutorial: Computational Biology 53

55 Suffix Trees vs. Suffix Arrays Suffix Array = Lexicographic order of the leaves of the Suffix Tree Suffix Tree = Suffix Array + lcp Array Nov. 17, 2002 SC2002 Tutorial: Computational Biology 54

56 Pattern Search in Suffix Array All suffixes that share a common prefix appear in consecutive positions in the array. Pattern P can be located in the string using a binary search on the suffix array. Naïve Run-time = O( P log n). Improved to O( P + log n) [Manber&Myers93], and to O( P ) [Abouelhoda et al. 02]. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 55

57 Other Applications Common substrings of multiple strings Suffix-prefix overlaps Maximal and tandem repeats Shortest unique substrings Maximal unique matches [MUMmer] Approximate matching with bounded errors Nov. 17, 2002 SC2002 Tutorial: Computational Biology 56

58 Limitations of String Data Structures Can only be used to extract information in the absence of errors. Problems dealing with errors may be solved by decomposing into components that do not involve errors. Example: If two sequences exhibit similarity, there must be substrings in common to them. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 57

59 Some Results 1. Suffix tree can be constructed in O(n) time and O(n ) space [Weiner73, McCreight76, Ukkonen92]. 2. Suffix arrays can be constructed without using suffix trees in O(n) time [Pang&Aluru02]. 3. Suffix trees can be built in O(log 4 n) time on the CREW PRAM model [Hariharan94]. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 58

60 Open Problems 1. Algorithms independent of alphabet size. 2. Practically efficient parallel algorithms for suffix trees and arrays. 3. What is the best way to store a biological database on a disk? Some work on disk-based data structures: String B-trees [Ferragina & Grossi 95]. Suffix trees on disk [Clark & Munro 96]. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 59

61 Software Development Opportunities Develop a general-purpose tree-based database system for efficiently Storing Inserting and deleting Querying biological sequences. Current approach: Store sequences as a flat file. Entire database is searched for each query! Nov. 17, 2002 SC2002 Tutorial: Computational Biology 60

62 Part IV: Genome Assembly

63 Sequencing a Genome Physical Mapping: Find markers along the genome, to find unique contigs (possibly overlapping) that cover the genome. Fragment Assembly: Sequence each contig by breaking into several short fragments, sequencing the fragments, and assembling them together. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 62

64 Physical Mapping Sequence Tagged Sites Sequence Tagged Site (STS) is a probe sequence that attaches to a unique position in the genome (length about bases). The probe can identify the existence of the short sequence in the genome but cannot specify its location. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 63

65 Cutting With Restriction Enzymes Restriction enzyme is a protein that cuts DNA at a specific pattern (typically palindrome). Example: EcoRI G C T T A A G A A T T C C T T A A G A A T A C G Nov. 17, 2002 SC2002 Tutorial: Computational Biology 64

66 Physical Mapping Generate a large number of fragments of the genome, called clones. Find which probes attach to which clones. Find order of the fragments along the genome. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 65

67 Clones and Probes D B G C A E F Nov. 17, 2002 SC2002 Tutorial: Computational Biology 66

68 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 67 STS Matrix A B C D E F G

69 STS Hybridization Problem Given: STS matrix Find: Permutation of the columns such that the 1 s in each row are consecutive. Algorithm runs in linear time, assuming the matrix has no errors (Booth76). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 68

70 Errors in STS Data False positives: Clone is reported to contain an STS, but it does not. False Negatives: Clone is reported to not contain an STS, but it does. Chimeras: Two different DNA fragments combine and act as one clone. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 69

71 Mapping Problem in Presence of Errors In the absence of errors, overlap information is an interval graph. Find a way to discard some information in order to obtain an interval graph. Several ways of modeling the problem are NP-complete! Nov. 17, 2002 SC2002 Tutorial: Computational Biology 70

72 Fragment Assembly Given: A collection of DNA fragments Assemble: The fragments into maximal length contiguous sequences, or contigs using overlap information. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 71

73 Fragment Assembly Nov. 17, 2002 SC2002 Tutorial: Computational Biology 72

74 Shortest Common Superstring In the absence of errors, Fragment assembly = finding the shortest common superstring of given fragments Shortest common superstring problem is NP-hard. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 73

75 Greedy heuristic Find two fragments that have a maximum overlap and combine them into one contig. Iterate by treating contigs as fragments. Greedy heuristic results in a 4-approximate algorithm. Approximation factor has been improved to 2.2. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 74

76 Difficulties in Fragment Assembly Fragments contain errors Lack of sufficient coverage Different fragments may combine (Chimeras) Which strand did it come from? (Unknown orientation) Repeats in the genome Nov. 17, 2002 SC2002 Tutorial: Computational Biology 75

77 Approach to Fragment Assembly For each fragment and partial contig formed, consider both the sequence and its reverse complement. Detect overlaps using dynamic programming to allow for errors. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 76

78 Possible Fragment Overlaps F 1 F 1 F 2 F 2 F 1 F 2 F 2 F 1 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 77

79 Approach to Fragment Preprocessing: Assembly Eliminate pairs of fragments that cannot have significant overlap (quick check). Compute overlap between promising pairs using dynamic programming. If a fragment is completely contained in another, discard the shorter fragment. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 78

80 Approach to Fragment Assembly Forming Contigs (Greedy Heuristic): Combine fragments with strongest evidence of overlap. Treat the resulting partial contig as a single fragment and consider overlapping ends unavailable. Iterate using next strongest available overlap. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 79

81 Approach to Fragment Assembly Generating Consensus Sequence: Perform a multiple sequence alignment between parts of fragments overlapping in the same position to obtain better contigs. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 80

82 Fragment Assembly Software 1. CAP3 (ftp://cs.mtu.edu/pub/huang) 2. Phrap ( 3. TIGR Assembler ( Nov. 17, 2002 SC2002 Tutorial: Computational Biology 81

83 Genome Sequencing Complete genomes of over 800 organisms are Currently (or soon to be) available. main_genomes.html Nov. 17, 2002 SC2002 Tutorial: Computational Biology 82

84 Part V: Gene Identification and Annotation

85 Sequencing the genome is not an end-goal! Identify genes on the genome. Find the corresponding family of proteins. Find the functions of the proteins and how they are regulated. Study the natural variations in the gene among related species and different strains of the same species. Study variation between healthy and disease-causing genes. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 84

86 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 85

87 Gene Structure DNA 5 3 Transcription 3 5 PremRNA 5 RNA Splicing 3 Promoter Exon Intron 5 Cap mrna 5 3 Poly A tail Nov. 17, 2002 SC2002 Tutorial: Computational Biology 86

88 EST Clustering Provides Clues to Finding Genes genomic DNA 3 exon 1 intron 1 exon 2 intron 2 exon mrna exon 1 exon 2 exon 3 ESTs Nov. 17, 2002 SC2002 Tutorial: Computational Biology 87

89 How to Obtain EST Data? dbest ( 12,845,578 ESTs as of September 20, 2002 Organism Human Mouse Arabidopsis thaliana Zea mays Rice Number of ESTs 4,691,979 2,706, , , ,429 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 88

90 Goals of EST Clustering Clustering: Build clusters with each cluster containing ESTs from the same gene. Identification: Identify the gene. Annotation: Find and assign a function to the corresponding protein. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 89

91 Alternative Splicing mrna 1 exon intron Opt. exon Gene 1 mrna 2 mrna 1 Gene 2 mrna 2 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 90

92 Difficulties in EST Clustering Lack of Coverage mrna ESTs Nov. 17, 2002 SC2002 Tutorial: Computational Biology 91

93 Difficulties in EST Clustering Duplicated Genes mrna (from gene) mrna (from duplicated gene) ESTs high degree of similarity Nov. 17, 2002 SC2002 Tutorial: Computational Biology 92

94 Approaches to EST Clustering Use pairwise comparisons between ESTs to put ESTs into clusters. 1. Exhaustive approach Compare all pairs of ESTs. 2. Use fragment assembly software. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 93

95 Fragment Assembly Is Not Suited to EST Clustering Lack of sufficient coverage. ESTs come from different individuals and different strains of the same species. Genomic and Protein databases provide additional clues to EST clustering. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 94

96 Fragment Assembly Is Not Suited to EST Clustering Number of EST fragments is too large. ESTs are obtained in batches. Fragment assembly software is not incremental. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 95

97 Evaluation of Current Software Single Node of IBM xseries Cluster n=100,001 n=144,870 TIGR PHRAP CAP3 TIGR PHRAP CAP min 91 min 150 min X 154 min 241 min X GB MB GB GB GB Nov. 17, 2002 SC2002 Tutorial: Computational Biology 96

98 NIH Unigene project Perform database search for each EST. Results are accrued incrementally using weekly builds on 80-processor Intel farm. Quality overrides computational issues. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 97

99 Space and Time Efficient EST Clustering Initially, treat each EST as a cluster by itself. If two ESTs from two different clusters show significant overlap, merge the clusters. Use union-find data structure. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 98

100 Reporting High-quality Promising Pairs first is important! Successful overlap results in : Merge Pass alignment test Nov. 17, 2002 SC2002 Tutorial: Computational Biology 99

101 Generating Promising Pairs Quality of overlap = length of a maximal common substring. Promising pairs are pairs that have a maximal common substring of length ψ. Produce promising pairs on-demand, in decreasing order of quality. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 100

102 Pair Generation Algorithm Build Generalized Suffix Tree of the ESTs. Process the nodes in GST in the decreasing order of string-depth and generate pairs at each node. Generate a pair at a node only if the corresponding overlap is maximal. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 101

103 Main Idea of the Algorithm Maximal common substring α =xβ α root β i α c 1 c 2 c 2 v c4... c 2 c4... s 1 s 2 c 3 α c 4 j (s 1,i) (s 2,j) (s 1,i+1) (s 2,j+1) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 102

104 Parallel EST Software Construction/ Preprocessing Phase Parallel Clustering Phase Nov. 17, 2002 SC2002 Tutorial: Computational Biology 103

105 Run-time vs. Number of processors Run-time in seconds 7,000 6,000 5,000 4,000 3,000 2,000 1, Number of processors n=10,000 n=20,000 n=40,000 n=80,000 n=144,870

106 Number of Pairs vs. Number of ESTs Number of Pairs in thousands 5,000 4,500 4,000 3,500 3,000 2,500 2,000 1,500 1, Aligned and accepted Aligned and rejected Unaligned 10,000 20,000 40,000 80, ,870 Number of ESTs Nov. 17, 2002 SC2002 Tutorial: Computational Biology 105

107 Open Problems Develop software that can cluster the human EST collection (~4.7 million currently). Improve quality of clustering Detect alternative splicing. Consult genomic & protein databases. Develop a comprehensive software system for gene identification combining EST clustering, ab initio gene prediction and genome comparison. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 106

108 Part VI: Microarrays and Gene Expression Analysis

109 Gene Expression Studies How does gene expression level differ in various cell types and states? How is gene expression changed by diseases? What are the functional roles of different genes? How are genes regulated? Nov. 17, 2002 SC2002 Tutorial: Computational Biology 108

110 Microarray A glass slide on which single stranded DNA molecules are attached at fixed spots. Each molecule corresponds to a gene (Ex: EST). When a solution containing single stranded molecules is washed over, binding based on complementary takes place. A single microarray can contain tens of thousands of spots. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 109

111 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 110

112 Comparing mrna abundance mrna from sample and control are labeled with different fluorescent dyes. Both solutions are washed over the microarray. Relative abundance of different mrna can be judged by color/intensity difference. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 111

113 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 112

114 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 113 Credit: A. Michael Cambell

The Full Yeast Genome on a Chip Statistics 6116 Yeast Genes 96 Intergenic regions + lots of control samples Total spots printed: 707,520 Total Arrays:110 Actual Time to print: 52 hours Actual Speed:

115 The Full Yeast Genome on a Chip Statistics 6116 Yeast Genes 96 Intergenic regions + lots of control samples Total spots printed: 707,520 Total Arrays:110 Actual Time to print: 52 hours Actual Speed: spots/min Total Cycles: 1608 Total Water Usage: 23 Liters Tip Spacing: 221uM Taps per tip: 176,880 Completed: 25 April 1997 Patrick O. Brown Lab, Stanford: Nov. 17, 2002 SC2002 Tutorial: Computational Biology 114

116 Microarray Databases Repositories containing information obtained by microarray experiments Nov. 17, 2002 SC2002 Tutorial: Computational Biology 115

117 Microarray analysis Lists of software packages Hierarchical Clustering Self-Organizing Maps Nov. 17, 2002 SC2002 Tutorial: Computational Biology 116

118 Gene Expression Matrix A way to capture microarray data Rows correspond to genes Columns represent samples (different developmental stages, conditions and tissues) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 117

119 Using Gene Expression Matrices Compare gene expression profiles. Find co-regulated genes. Compare expression profiles of samples. Find differentially expressed genes. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 118

120 Finding Co-regulated Genes Each gene can be represented as a point in n-dimensional space. Use clustering algorithms to find coregulated genes. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 119

121 Example - Hierarchical Clustering Nov. 17, 2002 SC2002 Tutorial: Computational Biology 120

122 Summary Microarrays are a relatively new technology, allowing simultaneous collection of vast experimental data. Data mining and AI techniques are used to discover information from microarray data. Innovative uses of microarrays are still being discovered. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 121

123 Part VII: Protein Folding

124 Primary Structure A A S X D X S L V E V H X X V F I V P P X I L Q A V V S I A 31 T T R X D D X D S A A A S I P M V P G W V L K Q V X G S Q A 61 G S F L A I V M G G G D L E V I L I X L A G Y Q E S S I X A 91 S R S L A A S M X T T A I P S D L W G N X A X S N A A F S S 121 X E F S S X A G S V P L G F T F X E A G A K E X V I K G Q I 151 T X Q A X A F S L A X L X K L I S A M X N A X F P A G D X X 181 X X V A D I X D S H G I L X X V N Y T D A X I K M G I I F G 211 S G V N A A Y W C D S T X I A D A A D A G X X G G A G X M X 241 V C C X Q D S F R K A F P S L P Q I X Y X X T L N X X S P X 271 A X K T F E K N S X A K N X G Q S L R D V L M X Y K X X G Q 301 X H X X X A X D F X A A N V E N S S Y P A K I Q K L P H F D 331 L R X X X D L F X G D Q G I A X K T X M K X V V R R X L F L 361 I A A Y A F R L V V C X I X A I C Q K K G Y S S G H I A A X 391 G S X R D Y S G F S X N S A T X N X N I Y G W P Q S A X X S 421 K P I X I T P A I D G E G A A X X V I X S I A S S Q X X X A 451 X X S A X X A Nov. 17, 2002 SC2002 Tutorial: Computational Biology 123

125 Secondary Structure - α helix Nov. 17, 2002 SC2002 Tutorial: Computational Biology 124

126 Secondary Structure - β sheet Nov. 17, 2002 SC2002 Tutorial: Computational Biology 125

127 Tertiary Structure Nov. 17, 2002 SC2002 Tutorial: Computational Biology 126

128 Quaternary Structure Nov. 17, 2002 SC2002 Tutorial: Computational Biology 127

129 Problems in Protein Folding 1. Folding Problem: Given the sequence of a protein, computationally determine its structure. 2. Inverse Folding Problem: Given the structure in which a protein should fold into, find a possible amino acid sequence of the protein. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 128

130 Why Should Sequence! Structure Determination Be Possible? Proteins with sequence similarity tend to have structural similarity. If a protein is deformed under external force, it quickly folds back into its unique shape after the force is removed. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 129

131 IBM Blue Gene Project $100 M, 100,000 processor petaflop supercomputer for protein folding. Expected to simulate one protein in a year. Blue Gene/L 65,536 processors, 32 X 32 X 64 torus (by year 2004). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 130

132 Approach I Molecular Dynamics Idea: Forces acting on the atoms in a protein and constraints are known. Perform simulation. Problem: Time step required is too small (10-18 sec). Best reported simulation 10-6 sec. Folding requires a few seconds. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 131

133 Approach II Lattice Models Proteins are represented as self-avoiding walks on lattices (cubic, hexagonal etc.). Each amino acid residue is modeled as hydrophobic (H) or hydrophilic (P). Position the residues subject to Linear constraint Maximizing H-H contacts Nov. 17, 2002 SC2002 Tutorial: Computational Biology 132

134 Approach II Lattice Models Problem is NP-complete. Approximation algorithms have been designed. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 133

135 Hypothesis: Approach III Energy Minimization Different amino acids have different chemical, electrical and size properties. Different folds of a protein have different levels of energy. A protein folds into its minimum energy configuration. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 134

136 Approach III Energy Minimization Start with a protein configuration. Compute the energy of the configuration. Incrementally fold the protein to reduce its energy. Iterate until convergence. Many known energy minimization methods. can be applied (steepest descent, simulated. annealing etc.). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 135

137 Approach IV Protein Threading Find proteins with known structure that exhibit similarity to the protein to be folded. Use structures of highly similar components to determine a possible structure for the new protein. Use this structure as the basis for more computational folding operations. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 136

138 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 137

139 Other Problems Structure Similarity Given: The three dimensional structure of two proteins Find: the structural similarity between them. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 138

140 Other Problems Protein Docking Given: A receptor molecule and a drug molecule Find: A matching between the receptor surface and the drug molecule surface maximizing the contact area between the surfaces. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 139

141 Other Problems Accessible Surface Area Given: The three dimensional structure of a Protein Find: The cumulative accessible surface area of the atoms of the protein accessible to a solvent molecule. Atoms and solvent molecule are modeled as spheres using van der Waal s radii. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 140

142 Part VIII: Comparative Genomics & Reconstructing Evolutionary Histories (Phylogenetic Trees)

143 Comparative Genomics Chicken Human NCBI accession #NC_ NCBI accession #NC_ Nov. 17, 2002 SC2002 Tutorial: Computational Biology 142

144 Eukaryotic Cell Nov. 17, 2002 SC2002 Tutorial: Computational Biology 143

145 Organism Est. size Est. # genes average gene density Human 3000 million bases ~30,000 1 gene per 100,000 bases M. Musculus (mouse) 3000 million bases 30,000 1 gene per 100,000 bases Drosophila (fruit fly) million bases 13,061 1 gene per 13,781 bases Arabidopsis (plant) 100 million bases 25,000 1 gene per 4,000 bases C. elegans (roundworm) 97 million bases 19,099 1 gene per 5,079 bases S. cerevisiae (yeast) 12.1 million bases 6,034 1 gene per 2,005 bases E. coli (bacteria) million bases 3,237 1 gene per 1,443 bases H. influenzae (bacteria) 1.8 million bases 1,740 1 gene per 1,034 bases Nov. 17, 2002 SC2002 Tutorial: Computational Biology 144

146 Phylogenetics Find the genetic connections and relationships between species (or sequences). Hypothesis: All existing organisms are derived from some common ancestor. A new species arises by a splitting of one population into two (or more populations) that do not cross-breed. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 145

147 Phylogenetic Trees Each species (or sequence) is described by a set of traits (called characters). Leaves of the tree are labeled with input species. Internal nodes are labeled with input or inferred species. Edges represent transition in values among certain traits. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 146

148 Types of Phylogenies Relationships between taxa Species Trees Gene Trees Data Morphological Tree of Life Web (Maddison/Maddison): Nuclear Genome Organelle Genome Nov. 17, 2002 SC2002 Tutorial: Computational Biology 147

Example Phylogenies Campanulaceae (Bluebell Flowers) 1.75 2.42 Wahlenbergia 4.25 1.61 0.063 0.23 0.94 0.83 Merciera Trachelium Symphyandra 0.18 4.34 0.77 2.82 Campanula Adenophora 3.22 Legousia 0.

149 Example Phylogenies Campanulaceae (Bluebell Flowers) Wahlenbergia Merciera Trachelium Symphyandra Campanula Adenophora 3.22 Legousia Asyneuma Triodanus Codonopsis Cyananthus Platycodon Tobacco HHV6 Some herpesvirus known to affect humans EBV HHV7 HVS EHV2 KHSV HSV1 VZV HSV2 PRV EHV1 HCMV Leeches Nov. 17, 2002 SC2002 Tutorial: Computational Biology 148

150 Techniques Maximum parsimony Occam s razor: simplest explanation for evolution, minimizes the sum of the number of evolutionary events along the tree branches Maximum likelihood Statistical methods that use an evolutionary model such as the transition/transversion rate ratio for the nuclear genome Nov. 17, 2002 SC2002 Tutorial: Computational Biology 149

151 Genomic Parsimony: Examples of characters Specific nucleotide in a fixed position of a DNA sequence (conserved in all examined species). Does the amino acid sequence for a protein contain a specific subsequence? Is the expression of a certain protein regulated by another particular protein? Nov. 17, 2002 SC2002 Tutorial: Computational Biology 150

152 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 151 Example A T G C G T Elephant T C A C G T Dog A T G G G C Chimp A C A G A C Bison A T G G A C Aardvark Species

153 Example , 5 4, 5, 6 Aardvark Bison Chimp Dog Elephant Nov. 17, 2002 SC2002 Tutorial: Computational Biology 152

154 Perfect Phylogeny for Binary Characters Given: An n m, 0-1 matrix representing n Species and m binary characters Find: A phylogenetic tree T such that The root of the tree represents an ancestor that has none of the m characters. Each character changes from 0 to 1 exactly once and never changes back. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 153

155 Example A B M C D E D B E A C Runs in O(mn) time (Gusfield91). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 154

156 Perfect Phylogeny for Non- Binary Characters n species, m characters, at most r states NP-complete. Polynomial time for any fixed r (Agarwala94). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 155

157 Parsimony Parsimony score of a tree = Total number of character changes in the tree P( T ) = ( u, v) E ( T ) { j } u j v j Nov. 17, 2002 SC2002 Tutorial: Computational Biology 156

158 A Simpler Problem: Known Tree Given: Phylogenetic tree Find: Minimum parsimony score and optimal labeling of internal nodes. Can be solved in O(nmr) time [Fitch71]. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 157

159 Parsimony Problem NP-hard. Techniques Used: Branch and bound [Hendy&Penny82]. Neighbor-Joining [Saitou&Nei87, Studdier&Keppler88]. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 158

160 Exploiting data about gene content and gene order has proved extremely challenging from a computational perspective tasks that can easily be carried out in linear time for DNA data have required entirely new theories (such as the computation of inversion distance) or appear to be NP-hard The focus has thus been on simple genomes preferably genomes consisting of a single chromosome, and where evolution can reasonably be assumed to have been driven mostly through gene order changes. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 159

161 Cell Organelles Chloroplasts and mitochondria have such genomes: around 120 genes for the chloroplasts of higher plants and typically 37 genes for the mitochondria of multicellular animals, in both cases packed onto a single chromosome. The gene content of these genomes is fairly constant across a wide phylogenetic range, differences are mostly in the ordering of the genes. Chloropast Mitochondria Nov. 17, 2002 SC2002 Tutorial: Computational Biology 160

162 Gene Order Phylogeny Many organelles appear to evolve mostly through processes that simply rearrange gene ordering (inversion, transposition) and perhaps alter gene content (duplication, loss). Chloroplast have a single, typically circular, chromosome and appear to evolve mostly through inversion: i -1 i j j+1 i -1 -j -i j+1 The sequence of genes i, i+1,, j is inverted and every gene is flipped. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 161

163 Phylogeny Heuristic Search [BPanalysis, Sankoff & Blanchette 98] (2n-5)!! = (2n-5) (2n-7) 5 3 trees For each tree topology do somehow assign initial genomes to the internal nodes repeat unknown iterative heuristic for each internal node do NP-hard compute a new genome that minimizes the distances to its three neighbors replace old genome by new if distance is reduced until no change Nov. 17, 2002 SC2002 Tutorial: Computational Biology 162

164 Lower Bounding of a Tree Tree e Tree version (paths) e a a d(e,a) d(d,e) b c d d(a,b) b d(b,c) c d(c,d) d = d(a,b) + d(b,c) + d(c,d) + d(d,e) + d(e,a) (Same trick as in the twice around the tree approximation for the TSP with triangle inequality.) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 163

165 Parallelization of the Phylogeny Algorithm Enumerating tree topologies is pleasantly parallel and allows multiple processors to independently search the tree space with little or no overhead Load is evenly balanced when trees are cyclically assigned (e.g. in a round-robin fashion) to the processors Linear speedup Nov. 17, 2002 SC2002 Tutorial: Computational Biology 164

166 High-performance implementations enable: better approximations for difficult problems (MP, ML) true optimization for larger instances realistic data exploration (e.g., testing evolutionary scenarios, assessing answers obtained through other means, etc.) use of more biologically meaningful models (inversions, transpositions, gene loss/duplication) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 165

167 Inversion Distance (Hannenhalli-Pevzner Theory) NP-hard for unsigned permutations [Caprara 97] Polynomial for signed permutations [Hannenhalli & Pevzner 95] Compute combinatorial terms from the cycle graph d = b c + h + f [Bafna & Pevzner 93, Setubal & Meidanis 97] b = number of breakpoints c = number of cycles h = number of hurdles f = (0/1) Is there a fortress? O(n α(n)) time, [Berman and Hannenhalli 96] where α(n) is the inverse Ackerman function (practically a constant no greater than 4) New result: O(n) inversion distance, [Bader, Moret, Yan 01] faster and simpler algorithm, both in theory and in practice Nov. 17, 2002 SC2002 Tutorial: Computational Biology 166

168 Challenges in Phylogeny Exact Inversion median-of-three [Siepel02] Tree enumeration using circular ordering Handle unequal gene content and duplicate genes (using exemplars?) Parallel branch and bound techniques for searching tree space Improved SPR and TBR techniques (local searches around good trees) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 167

169 Additional Challenges Network evolution Recombination events Large-scale phylogeny reconstruction Comparison and accuracy of techniques and heuristics Nov. 17, 2002 SC2002 Tutorial: Computational Biology 168

170 Parsimony Codes Phylip (Felsenstein) Hennig86 (Farris) Nona (Goloboff) and TNT (Goloboff, Farris, Nixon) PAUP* (Swofford) MEGA (Kumar, Tamura, Jakobsen, Nei) GRAPPA (Bader, Moret, Warnow) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 169

171 Likelihood Codes Phylip (Felsenstein) PAUP* (Swofford) PAML (Yang) FastDNAml (Olsen, Matsuda, Hagstrom, Overbeek) Felsenstein s List of Software: Nov. 17, 2002 SC2002 Tutorial: Computational Biology 170

172 GRAPPA: Genome Rearrangements Analysis Genome Rearrangements Analysis under Parsimony and other Phylogenetic Algorithms Open-source already used by other computational phylogeny groups, Caprara, Pevzner, LANL, FBI, Smithsonian Institute, PharmCos. Gene-order Phylogeny Reconstruction Breakpoint Median Inversion Median over one-billion fold speedup from previous codes Parallelism scales linearly with the number of processors [Bader, Moret, Warnow] Nov. 17, 2002 SC2002 Tutorial: Computational Biology 171

173 Using GRAPPA to solve Campanulaceae Phylogeny On the 512-processor cluster LosLobos at U. New Mexico, we ran the full analysis (all 14 billion trees) in under 1.5 hours a 1,000,000-fold speedup (and using true inversion distance) Current release of GRAPPA (v. 1.6) now takes minutes to solve the same problem on several processors Nov. 17, 2002 SC2002 Tutorial: Computational Biology 172

174 Campanulaceae Bob Jansen, UT-Austin; Linda Raubeson, Central Washington U Tobacco Nov. 17, 2002 SC2002 Tutorial: Computational Biology 173

175 Epilogue

176 Epilogue Uses of Computation in Biology: 1. Discovering information from large data sets (ex: database searches). 2. Relating micro-behavior to macrobehavior (ex: protein folding). 3. Extending experimental capabilities (ex: genome sequencing). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 175

177 Epilogue Computation will be an integral part of future biological discoveries. Computational biology is an exciting interdisciplinary area that will become increasingly important in the future. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 176

Bookshelf R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK, 1998. D. Graur and W.

178 Bookshelf R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK, D. Graur and W.-H. Li. Fundamentals of Molecular Evolution. Sinauer Associates, Sunderland, MA, second edition, D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Nov. 17, 2002 SC2002 Tutorial: Computational Biology 177

Bookshelf D.M. Hillis, C. Moritz, and B.K. Mable, eds.

Sinauer Associates, Sunderland, MA, second edition, 1996. M. Nei and S.

Oxford University Press, Oxford, UK, 2000. P.A. Pevzner.

179 Bookshelf D.M. Hillis, C. Moritz, and B.K. Mable, eds. Molecular Systematics. Sinauer Associates, Sunderland, MA, second edition, M. Nei and S. Kumar. Molecular Evolution and Phylogenetics. Oxford University Press, Oxford, UK, P.A. Pevzner. Computational Molecular Biology: An Algorithmic Approach. MIT Press, Inc., Cambridge, MA, Nov. 17, 2002 SC2002 Tutorial: Computational Biology 178

Bookshelf D. Sankoff and J. Kruskal, eds.

Comparison. Addison-Wesley, Reading, MA, 1983. J.C. Setubal and J. Meidanis.

Alignment and the Evolution of Gene Families, volume 1 of Computational Biology.

180 Bookshelf D. Sankoff and J. Kruskal, eds. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA, J.C. Setubal and J. Meidanis. Introduction to Computational Molecular Biology. PWS, Boston, MA, D. Sankoff and J.H. Nadeau, eds. Comparative Genomics: Empirical and Analytic Approaches to Gene Order Dynamics, Map Alignment and the Evolution of Gene Families, volume 1 of Computational Biology. Kluwer Academic Publishers, Dordrecht, The Netherlands, M.S. Waterman. Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman & Hall / CRC, Boca Raton, FL, Nov. 17, 2002 SC2002 Tutorial: Computational Biology 179

181 Related & Referenced Publications M. I. Abouelhoda, S. Kurtz and E. Ohlebusch, The enhanced suffix array and its applications to genome analysis, 2 nd Workshop on Algorithms in Bioinformatics, pp , R. Agarwala, D. Ferñandez-Baca, A polynomial-time algorithm for the perfect phylogeny problem when the number of character-states is fixed. SIAM J. Comp., 23(6): , S. Aluru, N. Futamura and K. Mehrotra, Biological sequence comparison using prefix computations, Proc. 13 th IEEE Int l Parallel Processing Symposium, pp , D.A. Bader, B. M.E. Moret, and L. Vawter, Industrial Applications of High- Performance Computing for Phylogeny Reconstruction, ITCom: Commercial Applications for High-Performance Computing, SPIE Vol. 4528, pp , D.A. Bader, B. M.E. Moret, and M. Yan, A Linear-Time Algorithm for Computing Inversion Distance Between Two Signed Permutations with an Experimental Study, Journal of Computational Biology, 8(5): , D.A. Bader, B.M.E. Moret, and P. Sanders, Algorithm Engineering for Parallel Computation, Experimental Algorithmics, Springer Verlag Lecture Notes in Computer Science, 2547:1 23, D.A. Bader, S. Sreshta, and N.R. Weisse-Bernstein, Evaluating arithmetic expressions using tree contraction: A fast and scalable parallel implementation for symmetric multiprocessors (SMPs), Proc. 9 th IEEE Int'l Conf. High-Performance Computing, 2002, to appear. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 180

182 Related & Referenced Publications V. Bafna and P. A. Pevzner, Genome rearrangements and sorting by reversals, Proc. 34 th Ann. IEEE Symp. Foundations of Computer Science, pp , 1993 V. Bafna and P. Pevzner, Sorting permutations by transpositions, Proc. 6 th Ann. Symp. Discrete Algorithms, pp , P. Berman and S. Hannenhalli, Fast sorting by reversal, Proc. 7 th Ann. Symp. Combinatorial Pattern Matching, pp , K. Booth and G. Lueker, Testing for consecutive ones property, interval graphs and graph planarity testing using pq-tree algorithms, J. Comp. Sys. Sci., 13: , A. Caprara, Sorting by reversals is difficult, Proc. 1 st ACM Conf. Computational Molecular Biology, pp , D.R. Clark and J.I. Munro, Efficient suffix trees on secondary storage, Proc. ACM-SIAM Symp. on Discrete Algorithms, pp , E. Edmiston, N. Core, J. Saltz, and R. Smith, Parallel processing of biological sequence comparison algorithms. Int l Journal of Parallel Programming, 17(3): , J. Felsenstein, Evolutionary trees from DNA sequences: A maximum likelihood approach, Journal of Molecular Evolution, 17: , P. Ferragina and R. Grossi, Fast incremental text editing, Journal of Algorithms, 31: , Also ACM-SIAM Symp. on Discrete Algorithms, W. M. Fitch, Towards defining the course of evolution: Minimum change for a specific tree topology, Syst. Zool., 20: , Nov. 17, 2002 SC2002 Tutorial: Computational Biology 181

183 Related & Referenced Publications W.M. Fitch and E. Margoliash, Construction of phylogenetic trees, Science, 155: , N. Futamura, S. Aluru and X. Huang, Parallel syntenic alignments, Proc. 9 th IEEE Int l Conf. on High Performance Computing, to appear. N. Futamura, S. Aluru, D. Ranjan and B. Hariharan, Efficient parallel algorithms for solvent accessible surface area of proteins, IEEE Trans. on Parallel and Distributed Systems, 13(6): , D. Gusfield, Efficient algorithms for inferring evolutionary trees. Networks, 21:19-28, S. Hannenhalli and P.A. Pevzner, Transforming cabbage into turnip (polynomial algorithm for sorting signed permutations by reversals, Proc. 27 th ACM Ann. Symp. Theory of Computing, pp , R. Hariharan, Optimal parallel suffix tree construction, Proc. 26 th IEEE Symp. Found. Computer Science, pp , M. D. Hendy and D. Penny, Branch and bound algorithms to determine minimal evolutionary trees. Mathematical Biosciences, 59: , X. Huang, A space-efficient parallel sequence comparison algorithm for a message-passing multiprocessor. Int l Journal of Parallel Programming, 18(3): , X. Huang and A. Madan, CAP3: A DNA sequence assembly program, Genome Research, 9(9): , Nov. 17, 2002 SC2002 Tutorial: Computational Biology 182

Recent Advances in Phylogeny Reconstruction

Recent Advances in Phylogeny Reconstruction from Gene-Order Data Bernard M.E. Moret Department of Computer Science University of New Mexico Albuquerque, NM 87131 Department Colloqium p.1/41 Collaborators