Opportunities and Challenges in Computational Biology

Size: px
Start display at page:

Download "Opportunities and Challenges in Computational Biology"

Transcription

1 Opportunities and Challenges in Computational Biology Srinivas Aluru Electrical & Computer Engineering Lawrence H. Baker Center for Bioinformatics & Biological Statistics Iowa State University David A. Bader Electrical & Computer Engineering University of New Mexico

2 Acknowledgments National Science Foundation Nov. 17, 2002 SC2002 Tutorial: Computational Biology 1

3 Opportunities and Challenges in Computational Biology Biology easily has 500 years of exciting problems to work on -Donald E. Knuth

4 Outline 1. Molecular Biology Background 2. Sequence Alignments 3. String Data Structures and Algorithms 4. Genome Assembly 5. Gene Identification & Annotation 6. Microarrays & Gene Expression Analysis 7. Protein Folding 8. Comparative Genomics & Reconstruction of Evolutionary Histories Nov. 17, 2002 SC2002 Tutorial: Computational Biology 3

5 Schedule Morning 8:30 9:15 (Part I) Biology Background 9:15 10:00 (Part II) Sequence Alignments 10:00 10:30: Break 10:30 11:15 (Part III) String Data Structures and Algorithms 11:15 12:00 (Part IV) Genome Assembly 12:00 1:30: Lunch Afternoon 1:30 2:15 (Part V) Gene Identification & Annotation 2:15 3:00 (Part VI) Microarrays & Gene Expression Analysis 3:00 3:30: Break 3:30 4:15 (Part VII) Protein Folding 4:15 5:00 (Part VIII) Comparative Genomics & Evolutionary Histories Nov. 17, 2002 SC2002 Tutorial: Computational Biology 4

6 Part I: Molecular Biology Background

7 Biological Data DNA: Self-replicating Codes for proteins Proteins: Perform most functions in living organisms Nov. 17, 2002 SC2002 Tutorial: Computational Biology 6

8 DNA: Sequence of nucleotides Nucleotide: Deoxyribose sugar + Phosphate + Base O Nucleotides: A, T, G, and C O O P O O 5 CH 2 O C4 3 C H OH 1 C 2 C H Nov. 17, 2002 SC2002 Tutorial: Computational Biology 7 O HN C C N C CH CH 3

9 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 8

10 P P P 3 A T C G G C 3 P P P Nov. 17, 2002 SC2002 Tutorial: Computational Biology 9

11 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 10

12 For computational purposes, DNA = A sequence over alphabet {A,C,G,T} 5 A T T C G G G A A T G C A T G C C A 3 3 T A A G C C C T T A C G T A C G G T 5 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 11

13 Genome: Entire genetic constitution of a living organism Chromosome: Linear strand of DNA Gene: A contiguous stretch of DNA that codes for a protein Nov. 17, 2002 SC2002 Tutorial: Computational Biology 12

14 Species Bacteriophage λ Escherichia Coli (bacterium) Saccharomyces Cerviciae (yeast) Caenorhabditis elegans (worm) Drosophila melanogaster (fruit fly) Homo sapiens (human) Number of Chromosomes Genome Size 5 X X X X X X 10 9 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 13

15 Proteins: Chains of amino acid residues. There are 20 different amino acids. Functions: Tissue building blocks (Structure proteins) Catalysts (enzymes) Oxygen transport Antibody defense Nov. 17, 2002 SC2002 Tutorial: Computational Biology 14

16 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 15

17 R 1 H O R 3 + H 3 N Cα C O N Φ Cα ψ C N H Cα C O O - R 2 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 16

18 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 17

19 G A C U Leu Leu Phe Phe Ser Ser Ser Ser STOP STOP Tyr Tyr Trp STOP Cys Cys U G A C U Leu Leu Leu Leu Pro Pro Pro Pro Gln Gln His His Arg Arg Arg Arg C G A C U Met Ile Ile Ile Thr Thr Thr Thr Lys Lys Asn Asn Arg Arg Ser Ser A G A C U Val Val Val Val Ala Ala Ala Ala Glu Gu Asp Asp Gly Gly Gly Gly G Third Position U Position C Second A G First Position

20 Protein Synthesis (DNA! Protein) DNA Transcription mrna Translation Protein Nov. 17, 2002 SC2002 Tutorial: Computational Biology 19

21 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 20

22 Summary Nov. 17, 2002 SC2002 Tutorial: Computational Biology 21

23 What Can Be Done Experimentally? DNA sequences of length up to bp can be read (Sanger s method). DNA samples can be amplified (PCR). Protein sequences can be determined. Structure of proteins can be determined using X-ray crystallography (expensive, tedious, time-consuming). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 22

24 Challenges in Computational Biology 1. Find the genomes of all organisms. 2. Identify and annotate genes. 3. Find the sequences, three dimensional structures and functions of all proteins. 4. Find sequences of proteins that have desired three dimensional structures. 5. Compare DNA sequences and proteins sequences for similarity. 6. Study the evolution of sequences and species. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 23

25 Part II: Sequence Alignments

26 Pairwise Sequence Alignment Problem: Find similarity between two sequences. Variations: Given two sequences, find if parts of them are similar (local alignment). Given a large sequence and a short sequence, find if the short sequence is similar to a stretch of the long sequence. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 25

27 Pairwise Global Alignment Alignment: Stacking the sequences against each other, with gaps if necessary, to expose similarity. Score: A measure of quality of an alignment C A T -- T C A -- C C -- T C G C A G C = -2 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 26

28 Pairwise Global Alignment T[i,j] = Score of optimally aligning first i bases of s with first j bases of t. T [ i, j] = max T[ i 1, T T [ i 1, j] [ i, j 1] j 1] + g g score ( s[ i], t[ j] ) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 27

29 C T C G C A G C C A T T C A C Nov. 17, 2002 SC2002 Tutorial: Computational Biology 28

30 T [ i, j] = Local Alignment T [ i 1, T max T 0 j [ i 1, j] [ i, j 1] 1] + ( s[ i], t[ j] ) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 29 g g score Initialize top row and leftmost column to zero. Start with a maximal value in the table and traceback.

31 Affine Gap Penalty Functions Gap penalty = h + gk where k = length of a maximal sequence of gaps h = gap opening penalty g = gap continuation penalty Nov. 17, 2002 SC2002 Tutorial: Computational Biology 30

32 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n), while keeping run-time fixed [Myers88]. Two highly similar sequences can be aligned in O(dn) time, where d is a measure of the distance between the sequences [Landau86]. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 31

33 Parallel Sequence Alignment Each antidiagonal can be computed in parallel. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 32

34 Some Known Results Parallel sequence alignment can be computed in O(mn/p) time (Edmiston88). Optimal space-saving algorithm requires only O((m+n)/p) space, but take O((m+n) 2 /p) time (Huang89). A row-by-row parallelization is possible and is more communication-efficient. Space can be reduced to O(m+n/p) without sacrificing timeoptimality (Aluru99). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 33

35 Multiple Sequence Alignment VTISCTGSSSNIGAG NHVKWYQQLPG VTISCTGTSSNIGS ITVNWYQQLPG LRLSCSSSGFIFSS YAMYWVRQAPG LSLTCTVSGTSFDD YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG ATLVCLISDFYPGA VTVAWKADS ATLVCLISDFYPGA VTVAWKADS AALGCLVKDYFPEP VTVSWNSG- VSLTCLVKGFYPSD IAVEWESNG- Nov. 17, 2002 SC2002 Tutorial: Computational Biology 34

36 Induced Pairwise Alignment S 1 S 2 S 3 S - T I S C T G - S - N I L - T I C N G S S - N I L R T I S C S G F S Q N I Induced pairwise alignment of S 1 and S 2 : S 1 S 2 S T I S C T G - S N I L T I C N G S S N I Nov. 17, 2002 SC2002 Tutorial: Computational Biology 35

37 Sum-of-Pairs Scoring Function Score of multiple alignment = = i< j l t= 1 i< j where score ( S, S ) i j score( S it, S jt ) score( S i, S j ) = score of induced pairwise alignment l = length of the multiple alignment Nov. 17, 2002 SC2002 Tutorial: Computational Biology 36

38 Multiple Alignment Run-time of dynamic programming solution = O(2 k n k ) where n = length of each sequence k = number of sequences Space, O(n k ), is prohibitively large! Example: 6 sequences of length X10 13 calculations! Nov. 17, 2002 SC2002 Tutorial: Computational Biology 37

39 Carillo-Lippman Heuristic U = Upper bound on multiple alignment score If T i ( [ ] [ 2 k j j j l [ i,, L, i ] + score S i, n, S i n ])> U 1 l, j< l l Then T[i 1,i 2,,i k ] cannot be on an optimal path. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 38

40 Multiple Alignment to a Phylogenetic Tree A tree showing the evolutionary relationship between sequences is available. Compute multiple alignment such that for each edge (i,j) in the tree Induced alignment between S i and S j. = Optimal alignment between S i and S j. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 39

41 Multiple Alignment to a Tree Build the multiple alignment incrementally. To add a new sequence, an edge should connect it in the tree to a sequence already incorporated in the multiple alignment. Insert the new sequence according to its optimal alignment with the other sequence connected by the edge. Adjust other sequences in the multiple alignment. Run-time = time for k pairwise alignments. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 40

42 Searching Biological Databases BLAST (Basic Local Alignment Search Tool) BLASTN (DNA) BLASTP (Protein) BLASTX (DNA against Protein) PSI-BLAST (Position Specific Iterative BLAST) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 41

43 Multiple Alignment Software Clustalw ( MSA ( HMMER ( SAM ( compbio/sam.html) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 42

44 Open Problems - Sequential 1. Gene-to-gene alignment to identify exons and introns. 2. Full genome comparison. Genomes consist of mobile components known as transposons. Due to transposons and genome rearrangements, full genome comparison is not straightforward. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 43

45 Open Problems - Parallel 1. Parallel alignment of similar sequences. 2. Parallel spliced alignment. DNA to Gene. Gene to Gene. 3. Parallel full-genome comparison. 4. Parallel multiple sequence alignment. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 44

46 Part III: String Data Structures and Algorithms

47 Why Strings? Biological sequences can be viewed as strings of characters over an alphabet. Sequence similarities typically translate to functional similarities. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 46

48 Suffix Tree M A L A Y A L A M $ A LA M YALAM$ $ AL YALAM$ $M YALAM$ 5 10 $M YALAM$ $M ALAYALAM$ $ Nov. 17, 2002 SC2002 Tutorial: Computational Biology 47

49 Suffix Tree M A L A Y A L A M $ (2, 2) (10, 10) (5, 10) (1, 1) (3, 4) 5 10 (10, 10) (2, 10) (5, 10) (9, 10) (5, 10) (9, 10) (3, 4) (5, 10) (9, 10) 6 2 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 48

50 Finding a Pattern in a String 1. Build a suffix tree of the string. 2. Starting from the root, traverse a path matching characters of the pattern. 3. If stuck, pattern not present in string. Otherwise, each leaf below gives a position of the pattern in the string. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 49

51 Finding Pattern in a String Find ALA A LA M YALAM$ $ AL YALAM$ M$ YALAM$ 5 10 M$ YALAM$ M$ ALAYALAM$ $ Nov. 17, 2002 SC2002 Tutorial: Computational Biology 50

52 Finding common Substrings Construct a generalized suffix tree for two strings (each suffix of each string is represented). Label each leaf with the suffix number and string label. Each internal node with a leaf from each string in its subtree gives a common substring. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 51

53 Generalized Suffix Tree WINDOW$ INDIGO$ D $OG I ND O $ W $OGI OW$ (2, 5) $OG ND $O GI OW$ $W $ $ INDOW$ (1, 7) (2, 7) (2, 3) (1, 4) (2, 4) $OGI OW$ (2, 2) (1, 3) (1, 5) (2, 6) (1, 6) (1, 1) (2, 1) (1, 2) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 52

54 Suffix Array Reducing Space 6 ALAM$ 2 ALAYALAM$ M A L A Y A L A M $ AM$ AYALAM$ LAM$ LAYALAM$ Suffix Array MALAYALAM$ M$ YALAM$ lcp Array 10 $ Nov. 17, 2002 SC2002 Tutorial: Computational Biology 53

55 Suffix Trees vs. Suffix Arrays Suffix Array = Lexicographic order of the leaves of the Suffix Tree Suffix Tree = Suffix Array + lcp Array Nov. 17, 2002 SC2002 Tutorial: Computational Biology 54

56 Pattern Search in Suffix Array All suffixes that share a common prefix appear in consecutive positions in the array. Pattern P can be located in the string using a binary search on the suffix array. Naïve Run-time = O( P log n). Improved to O( P + log n) [Manber&Myers93], and to O( P ) [Abouelhoda et al. 02]. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 55

57 Other Applications Common substrings of multiple strings Suffix-prefix overlaps Maximal and tandem repeats Shortest unique substrings Maximal unique matches [MUMmer] Approximate matching with bounded errors Nov. 17, 2002 SC2002 Tutorial: Computational Biology 56

58 Limitations of String Data Structures Can only be used to extract information in the absence of errors. Problems dealing with errors may be solved by decomposing into components that do not involve errors. Example: If two sequences exhibit similarity, there must be substrings in common to them. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 57

59 Some Results 1. Suffix tree can be constructed in O(n) time and O(n ) space [Weiner73, McCreight76, Ukkonen92]. 2. Suffix arrays can be constructed without using suffix trees in O(n) time [Pang&Aluru02]. 3. Suffix trees can be built in O(log 4 n) time on the CREW PRAM model [Hariharan94]. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 58

60 Open Problems 1. Algorithms independent of alphabet size. 2. Practically efficient parallel algorithms for suffix trees and arrays. 3. What is the best way to store a biological database on a disk? Some work on disk-based data structures: String B-trees [Ferragina & Grossi 95]. Suffix trees on disk [Clark & Munro 96]. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 59

61 Software Development Opportunities Develop a general-purpose tree-based database system for efficiently Storing Inserting and deleting Querying biological sequences. Current approach: Store sequences as a flat file. Entire database is searched for each query! Nov. 17, 2002 SC2002 Tutorial: Computational Biology 60

62 Part IV: Genome Assembly

63 Sequencing a Genome Physical Mapping: Find markers along the genome, to find unique contigs (possibly overlapping) that cover the genome. Fragment Assembly: Sequence each contig by breaking into several short fragments, sequencing the fragments, and assembling them together. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 62

64 Physical Mapping Sequence Tagged Sites Sequence Tagged Site (STS) is a probe sequence that attaches to a unique position in the genome (length about bases). The probe can identify the existence of the short sequence in the genome but cannot specify its location. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 63

65 Cutting With Restriction Enzymes Restriction enzyme is a protein that cuts DNA at a specific pattern (typically palindrome). Example: EcoRI G C T T A A G A A T T C C T T A A G A A T A C G Nov. 17, 2002 SC2002 Tutorial: Computational Biology 64

66 Physical Mapping Generate a large number of fragments of the genome, called clones. Find which probes attach to which clones. Find order of the fragments along the genome. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 65

67 Clones and Probes D B G C A E F Nov. 17, 2002 SC2002 Tutorial: Computational Biology 66

68 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 67 STS Matrix A B C D E F G

69 STS Hybridization Problem Given: STS matrix Find: Permutation of the columns such that the 1 s in each row are consecutive. Algorithm runs in linear time, assuming the matrix has no errors (Booth76). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 68

70 Errors in STS Data False positives: Clone is reported to contain an STS, but it does not. False Negatives: Clone is reported to not contain an STS, but it does. Chimeras: Two different DNA fragments combine and act as one clone. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 69

71 Mapping Problem in Presence of Errors In the absence of errors, overlap information is an interval graph. Find a way to discard some information in order to obtain an interval graph. Several ways of modeling the problem are NP-complete! Nov. 17, 2002 SC2002 Tutorial: Computational Biology 70

72 Fragment Assembly Given: A collection of DNA fragments Assemble: The fragments into maximal length contiguous sequences, or contigs using overlap information. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 71

73 Fragment Assembly Nov. 17, 2002 SC2002 Tutorial: Computational Biology 72

74 Shortest Common Superstring In the absence of errors, Fragment assembly = finding the shortest common superstring of given fragments Shortest common superstring problem is NP-hard. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 73

75 Greedy heuristic Find two fragments that have a maximum overlap and combine them into one contig. Iterate by treating contigs as fragments. Greedy heuristic results in a 4-approximate algorithm. Approximation factor has been improved to 2.2. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 74

76 Difficulties in Fragment Assembly Fragments contain errors Lack of sufficient coverage Different fragments may combine (Chimeras) Which strand did it come from? (Unknown orientation) Repeats in the genome Nov. 17, 2002 SC2002 Tutorial: Computational Biology 75

77 Approach to Fragment Assembly For each fragment and partial contig formed, consider both the sequence and its reverse complement. Detect overlaps using dynamic programming to allow for errors. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 76

78 Possible Fragment Overlaps F 1 F 1 F 2 F 2 F 1 F 2 F 2 F 1 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 77

79 Approach to Fragment Preprocessing: Assembly Eliminate pairs of fragments that cannot have significant overlap (quick check). Compute overlap between promising pairs using dynamic programming. If a fragment is completely contained in another, discard the shorter fragment. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 78

80 Approach to Fragment Assembly Forming Contigs (Greedy Heuristic): Combine fragments with strongest evidence of overlap. Treat the resulting partial contig as a single fragment and consider overlapping ends unavailable. Iterate using next strongest available overlap. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 79

81 Approach to Fragment Assembly Generating Consensus Sequence: Perform a multiple sequence alignment between parts of fragments overlapping in the same position to obtain better contigs. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 80

82 Fragment Assembly Software 1. CAP3 (ftp://cs.mtu.edu/pub/huang) 2. Phrap ( 3. TIGR Assembler ( Nov. 17, 2002 SC2002 Tutorial: Computational Biology 81

83 Genome Sequencing Complete genomes of over 800 organisms are Currently (or soon to be) available. main_genomes.html Nov. 17, 2002 SC2002 Tutorial: Computational Biology 82

84 Part V: Gene Identification and Annotation

85 Sequencing the genome is not an end-goal! Identify genes on the genome. Find the corresponding family of proteins. Find the functions of the proteins and how they are regulated. Study the natural variations in the gene among related species and different strains of the same species. Study variation between healthy and disease-causing genes. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 84

86 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 85

87 Gene Structure DNA 5 3 Transcription 3 5 PremRNA 5 RNA Splicing 3 Promoter Exon Intron 5 Cap mrna 5 3 Poly A tail Nov. 17, 2002 SC2002 Tutorial: Computational Biology 86

88 EST Clustering Provides Clues to Finding Genes genomic DNA 3 exon 1 intron 1 exon 2 intron 2 exon mrna exon 1 exon 2 exon 3 ESTs Nov. 17, 2002 SC2002 Tutorial: Computational Biology 87

89 How to Obtain EST Data? dbest ( 12,845,578 ESTs as of September 20, 2002 Organism Human Mouse Arabidopsis thaliana Zea mays Rice Number of ESTs 4,691,979 2,706, , , ,429 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 88

90 Goals of EST Clustering Clustering: Build clusters with each cluster containing ESTs from the same gene. Identification: Identify the gene. Annotation: Find and assign a function to the corresponding protein. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 89

91 Alternative Splicing mrna 1 exon intron Opt. exon Gene 1 mrna 2 mrna 1 Gene 2 mrna 2 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 90

92 Difficulties in EST Clustering Lack of Coverage mrna ESTs Nov. 17, 2002 SC2002 Tutorial: Computational Biology 91

93 Difficulties in EST Clustering Duplicated Genes mrna (from gene) mrna (from duplicated gene) ESTs high degree of similarity Nov. 17, 2002 SC2002 Tutorial: Computational Biology 92

94 Approaches to EST Clustering Use pairwise comparisons between ESTs to put ESTs into clusters. 1. Exhaustive approach Compare all pairs of ESTs. 2. Use fragment assembly software. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 93

95 Fragment Assembly Is Not Suited to EST Clustering Lack of sufficient coverage. ESTs come from different individuals and different strains of the same species. Genomic and Protein databases provide additional clues to EST clustering. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 94

96 Fragment Assembly Is Not Suited to EST Clustering Number of EST fragments is too large. ESTs are obtained in batches. Fragment assembly software is not incremental. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 95

97 Evaluation of Current Software Single Node of IBM xseries Cluster n=100,001 n=144,870 TIGR PHRAP CAP3 TIGR PHRAP CAP min 91 min 150 min X 154 min 241 min X GB MB GB GB GB Nov. 17, 2002 SC2002 Tutorial: Computational Biology 96

98 NIH Unigene project Perform database search for each EST. Results are accrued incrementally using weekly builds on 80-processor Intel farm. Quality overrides computational issues. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 97

99 Space and Time Efficient EST Clustering Initially, treat each EST as a cluster by itself. If two ESTs from two different clusters show significant overlap, merge the clusters. Use union-find data structure. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 98

100 Reporting High-quality Promising Pairs first is important! Successful overlap results in : Merge Pass alignment test Nov. 17, 2002 SC2002 Tutorial: Computational Biology 99

101 Generating Promising Pairs Quality of overlap = length of a maximal common substring. Promising pairs are pairs that have a maximal common substring of length ψ. Produce promising pairs on-demand, in decreasing order of quality. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 100

102 Pair Generation Algorithm Build Generalized Suffix Tree of the ESTs. Process the nodes in GST in the decreasing order of string-depth and generate pairs at each node. Generate a pair at a node only if the corresponding overlap is maximal. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 101

103 Main Idea of the Algorithm Maximal common substring α =xβ α root β i α c 1 c 2 c 2 v c4... c 2 c4... s 1 s 2 c 3 α c 4 j (s 1,i) (s 2,j) (s 1,i+1) (s 2,j+1) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 102

104 Parallel EST Software Construction/ Preprocessing Phase Parallel Clustering Phase Nov. 17, 2002 SC2002 Tutorial: Computational Biology 103

105 Run-time vs. Number of processors Run-time in seconds 7,000 6,000 5,000 4,000 3,000 2,000 1, Number of processors n=10,000 n=20,000 n=40,000 n=80,000 n=144,870

106 Number of Pairs vs. Number of ESTs Number of Pairs in thousands 5,000 4,500 4,000 3,500 3,000 2,500 2,000 1,500 1, Aligned and accepted Aligned and rejected Unaligned 10,000 20,000 40,000 80, ,870 Number of ESTs Nov. 17, 2002 SC2002 Tutorial: Computational Biology 105

107 Open Problems Develop software that can cluster the human EST collection (~4.7 million currently). Improve quality of clustering Detect alternative splicing. Consult genomic & protein databases. Develop a comprehensive software system for gene identification combining EST clustering, ab initio gene prediction and genome comparison. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 106

108 Part VI: Microarrays and Gene Expression Analysis

109 Gene Expression Studies How does gene expression level differ in various cell types and states? How is gene expression changed by diseases? What are the functional roles of different genes? How are genes regulated? Nov. 17, 2002 SC2002 Tutorial: Computational Biology 108

110 Microarray A glass slide on which single stranded DNA molecules are attached at fixed spots. Each molecule corresponds to a gene (Ex: EST). When a solution containing single stranded molecules is washed over, binding based on complementary takes place. A single microarray can contain tens of thousands of spots. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 109

111 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 110

112 Comparing mrna abundance mrna from sample and control are labeled with different fluorescent dyes. Both solutions are washed over the microarray. Relative abundance of different mrna can be judged by color/intensity difference. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 111

113 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 112

114 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 113 Credit: A. Michael Cambell

115 The Full Yeast Genome on a Chip Statistics 6116 Yeast Genes 96 Intergenic regions + lots of control samples Total spots printed: 707,520 Total Arrays:110 Actual Time to print: 52 hours Actual Speed: spots/min Total Cycles: 1608 Total Water Usage: 23 Liters Tip Spacing: 221uM Taps per tip: 176,880 Completed: 25 April 1997 Patrick O. Brown Lab, Stanford: Nov. 17, 2002 SC2002 Tutorial: Computational Biology 114

116 Microarray Databases Repositories containing information obtained by microarray experiments Nov. 17, 2002 SC2002 Tutorial: Computational Biology 115

117 Microarray analysis Lists of software packages Hierarchical Clustering Self-Organizing Maps Nov. 17, 2002 SC2002 Tutorial: Computational Biology 116

118 Gene Expression Matrix A way to capture microarray data Rows correspond to genes Columns represent samples (different developmental stages, conditions and tissues) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 117

119 Using Gene Expression Matrices Compare gene expression profiles. Find co-regulated genes. Compare expression profiles of samples. Find differentially expressed genes. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 118

120 Finding Co-regulated Genes Each gene can be represented as a point in n-dimensional space. Use clustering algorithms to find coregulated genes. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 119

121 Example - Hierarchical Clustering Nov. 17, 2002 SC2002 Tutorial: Computational Biology 120

122 Summary Microarrays are a relatively new technology, allowing simultaneous collection of vast experimental data. Data mining and AI techniques are used to discover information from microarray data. Innovative uses of microarrays are still being discovered. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 121

123 Part VII: Protein Folding

124 Primary Structure A A S X D X S L V E V H X X V F I V P P X I L Q A V V S I A 31 T T R X D D X D S A A A S I P M V P G W V L K Q V X G S Q A 61 G S F L A I V M G G G D L E V I L I X L A G Y Q E S S I X A 91 S R S L A A S M X T T A I P S D L W G N X A X S N A A F S S 121 X E F S S X A G S V P L G F T F X E A G A K E X V I K G Q I 151 T X Q A X A F S L A X L X K L I S A M X N A X F P A G D X X 181 X X V A D I X D S H G I L X X V N Y T D A X I K M G I I F G 211 S G V N A A Y W C D S T X I A D A A D A G X X G G A G X M X 241 V C C X Q D S F R K A F P S L P Q I X Y X X T L N X X S P X 271 A X K T F E K N S X A K N X G Q S L R D V L M X Y K X X G Q 301 X H X X X A X D F X A A N V E N S S Y P A K I Q K L P H F D 331 L R X X X D L F X G D Q G I A X K T X M K X V V R R X L F L 361 I A A Y A F R L V V C X I X A I C Q K K G Y S S G H I A A X 391 G S X R D Y S G F S X N S A T X N X N I Y G W P Q S A X X S 421 K P I X I T P A I D G E G A A X X V I X S I A S S Q X X X A 451 X X S A X X A Nov. 17, 2002 SC2002 Tutorial: Computational Biology 123

125 Secondary Structure - α helix Nov. 17, 2002 SC2002 Tutorial: Computational Biology 124

126 Secondary Structure - β sheet Nov. 17, 2002 SC2002 Tutorial: Computational Biology 125

127 Tertiary Structure Nov. 17, 2002 SC2002 Tutorial: Computational Biology 126

128 Quaternary Structure Nov. 17, 2002 SC2002 Tutorial: Computational Biology 127

129 Problems in Protein Folding 1. Folding Problem: Given the sequence of a protein, computationally determine its structure. 2. Inverse Folding Problem: Given the structure in which a protein should fold into, find a possible amino acid sequence of the protein. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 128

130 Why Should Sequence! Structure Determination Be Possible? Proteins with sequence similarity tend to have structural similarity. If a protein is deformed under external force, it quickly folds back into its unique shape after the force is removed. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 129

131 IBM Blue Gene Project $100 M, 100,000 processor petaflop supercomputer for protein folding. Expected to simulate one protein in a year. Blue Gene/L 65,536 processors, 32 X 32 X 64 torus (by year 2004). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 130

132 Approach I Molecular Dynamics Idea: Forces acting on the atoms in a protein and constraints are known. Perform simulation. Problem: Time step required is too small (10-18 sec). Best reported simulation 10-6 sec. Folding requires a few seconds. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 131

133 Approach II Lattice Models Proteins are represented as self-avoiding walks on lattices (cubic, hexagonal etc.). Each amino acid residue is modeled as hydrophobic (H) or hydrophilic (P). Position the residues subject to Linear constraint Maximizing H-H contacts Nov. 17, 2002 SC2002 Tutorial: Computational Biology 132

134 Approach II Lattice Models Problem is NP-complete. Approximation algorithms have been designed. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 133

135 Hypothesis: Approach III Energy Minimization Different amino acids have different chemical, electrical and size properties. Different folds of a protein have different levels of energy. A protein folds into its minimum energy configuration. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 134

136 Approach III Energy Minimization Start with a protein configuration. Compute the energy of the configuration. Incrementally fold the protein to reduce its energy. Iterate until convergence. Many known energy minimization methods. can be applied (steepest descent, simulated. annealing etc.). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 135

137 Approach IV Protein Threading Find proteins with known structure that exhibit similarity to the protein to be folded. Use structures of highly similar components to determine a possible structure for the new protein. Use this structure as the basis for more computational folding operations. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 136

138 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 137

139 Other Problems Structure Similarity Given: The three dimensional structure of two proteins Find: the structural similarity between them. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 138

140 Other Problems Protein Docking Given: A receptor molecule and a drug molecule Find: A matching between the receptor surface and the drug molecule surface maximizing the contact area between the surfaces. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 139

141 Other Problems Accessible Surface Area Given: The three dimensional structure of a Protein Find: The cumulative accessible surface area of the atoms of the protein accessible to a solvent molecule. Atoms and solvent molecule are modeled as spheres using van der Waal s radii. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 140

142 Part VIII: Comparative Genomics & Reconstructing Evolutionary Histories (Phylogenetic Trees)

143 Comparative Genomics Chicken Human NCBI accession #NC_ NCBI accession #NC_ Nov. 17, 2002 SC2002 Tutorial: Computational Biology 142

144 Eukaryotic Cell Nov. 17, 2002 SC2002 Tutorial: Computational Biology 143

145 Organism Est. size Est. # genes average gene density Human 3000 million bases ~30,000 1 gene per 100,000 bases M. Musculus (mouse) 3000 million bases 30,000 1 gene per 100,000 bases Drosophila (fruit fly) million bases 13,061 1 gene per 13,781 bases Arabidopsis (plant) 100 million bases 25,000 1 gene per 4,000 bases C. elegans (roundworm) 97 million bases 19,099 1 gene per 5,079 bases S. cerevisiae (yeast) 12.1 million bases 6,034 1 gene per 2,005 bases E. coli (bacteria) million bases 3,237 1 gene per 1,443 bases H. influenzae (bacteria) 1.8 million bases 1,740 1 gene per 1,034 bases Nov. 17, 2002 SC2002 Tutorial: Computational Biology 144

146 Phylogenetics Find the genetic connections and relationships between species (or sequences). Hypothesis: All existing organisms are derived from some common ancestor. A new species arises by a splitting of one population into two (or more populations) that do not cross-breed. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 145

147 Phylogenetic Trees Each species (or sequence) is described by a set of traits (called characters). Leaves of the tree are labeled with input species. Internal nodes are labeled with input or inferred species. Edges represent transition in values among certain traits. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 146

148 Types of Phylogenies Relationships between taxa Species Trees Gene Trees Data Morphological Tree of Life Web (Maddison/Maddison): Nuclear Genome Organelle Genome Nov. 17, 2002 SC2002 Tutorial: Computational Biology 147

149 Example Phylogenies Campanulaceae (Bluebell Flowers) Wahlenbergia Merciera Trachelium Symphyandra Campanula Adenophora 3.22 Legousia Asyneuma Triodanus Codonopsis Cyananthus Platycodon Tobacco HHV6 Some herpesvirus known to affect humans EBV HHV7 HVS EHV2 KHSV HSV1 VZV HSV2 PRV EHV1 HCMV Leeches Nov. 17, 2002 SC2002 Tutorial: Computational Biology 148

150 Techniques Maximum parsimony Occam s razor: simplest explanation for evolution, minimizes the sum of the number of evolutionary events along the tree branches Maximum likelihood Statistical methods that use an evolutionary model such as the transition/transversion rate ratio for the nuclear genome Nov. 17, 2002 SC2002 Tutorial: Computational Biology 149

151 Genomic Parsimony: Examples of characters Specific nucleotide in a fixed position of a DNA sequence (conserved in all examined species). Does the amino acid sequence for a protein contain a specific subsequence? Is the expression of a certain protein regulated by another particular protein? Nov. 17, 2002 SC2002 Tutorial: Computational Biology 150

152 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 151 Example A T G C G T Elephant T C A C G T Dog A T G G G C Chimp A C A G A C Bison A T G G A C Aardvark Species

153 Example , 5 4, 5, 6 Aardvark Bison Chimp Dog Elephant Nov. 17, 2002 SC2002 Tutorial: Computational Biology 152

154 Perfect Phylogeny for Binary Characters Given: An n m, 0-1 matrix representing n Species and m binary characters Find: A phylogenetic tree T such that The root of the tree represents an ancestor that has none of the m characters. Each character changes from 0 to 1 exactly once and never changes back. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 153

155 Example A B M C D E D B E A C Runs in O(mn) time (Gusfield91). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 154

156 Perfect Phylogeny for Non- Binary Characters n species, m characters, at most r states NP-complete. Polynomial time for any fixed r (Agarwala94). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 155

157 Parsimony Parsimony score of a tree = Total number of character changes in the tree P( T ) = ( u, v) E ( T ) { j } u j v j Nov. 17, 2002 SC2002 Tutorial: Computational Biology 156

158 A Simpler Problem: Known Tree Given: Phylogenetic tree Find: Minimum parsimony score and optimal labeling of internal nodes. Can be solved in O(nmr) time [Fitch71]. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 157

159 Parsimony Problem NP-hard. Techniques Used: Branch and bound [Hendy&Penny82]. Neighbor-Joining [Saitou&Nei87, Studdier&Keppler88]. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 158

160 Exploiting data about gene content and gene order has proved extremely challenging from a computational perspective tasks that can easily be carried out in linear time for DNA data have required entirely new theories (such as the computation of inversion distance) or appear to be NP-hard The focus has thus been on simple genomes preferably genomes consisting of a single chromosome, and where evolution can reasonably be assumed to have been driven mostly through gene order changes. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 159

161 Cell Organelles Chloroplasts and mitochondria have such genomes: around 120 genes for the chloroplasts of higher plants and typically 37 genes for the mitochondria of multicellular animals, in both cases packed onto a single chromosome. The gene content of these genomes is fairly constant across a wide phylogenetic range, differences are mostly in the ordering of the genes. Chloropast Mitochondria Nov. 17, 2002 SC2002 Tutorial: Computational Biology 160

162 Gene Order Phylogeny Many organelles appear to evolve mostly through processes that simply rearrange gene ordering (inversion, transposition) and perhaps alter gene content (duplication, loss). Chloroplast have a single, typically circular, chromosome and appear to evolve mostly through inversion: i -1 i j j+1 i -1 -j -i j+1 The sequence of genes i, i+1,, j is inverted and every gene is flipped. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 161

163 Phylogeny Heuristic Search [BPanalysis, Sankoff & Blanchette 98] (2n-5)!! = (2n-5) (2n-7) 5 3 trees For each tree topology do somehow assign initial genomes to the internal nodes repeat unknown iterative heuristic for each internal node do NP-hard compute a new genome that minimizes the distances to its three neighbors replace old genome by new if distance is reduced until no change Nov. 17, 2002 SC2002 Tutorial: Computational Biology 162

164 Lower Bounding of a Tree Tree e Tree version (paths) e a a d(e,a) d(d,e) b c d d(a,b) b d(b,c) c d(c,d) d = d(a,b) + d(b,c) + d(c,d) + d(d,e) + d(e,a) (Same trick as in the twice around the tree approximation for the TSP with triangle inequality.) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 163

165 Parallelization of the Phylogeny Algorithm Enumerating tree topologies is pleasantly parallel and allows multiple processors to independently search the tree space with little or no overhead Load is evenly balanced when trees are cyclically assigned (e.g. in a round-robin fashion) to the processors Linear speedup Nov. 17, 2002 SC2002 Tutorial: Computational Biology 164

166 High-performance implementations enable: better approximations for difficult problems (MP, ML) true optimization for larger instances realistic data exploration (e.g., testing evolutionary scenarios, assessing answers obtained through other means, etc.) use of more biologically meaningful models (inversions, transpositions, gene loss/duplication) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 165

167 Inversion Distance (Hannenhalli-Pevzner Theory) NP-hard for unsigned permutations [Caprara 97] Polynomial for signed permutations [Hannenhalli & Pevzner 95] Compute combinatorial terms from the cycle graph d = b c + h + f [Bafna & Pevzner 93, Setubal & Meidanis 97] b = number of breakpoints c = number of cycles h = number of hurdles f = (0/1) Is there a fortress? O(n α(n)) time, [Berman and Hannenhalli 96] where α(n) is the inverse Ackerman function (practically a constant no greater than 4) New result: O(n) inversion distance, [Bader, Moret, Yan 01] faster and simpler algorithm, both in theory and in practice Nov. 17, 2002 SC2002 Tutorial: Computational Biology 166

168 Challenges in Phylogeny Exact Inversion median-of-three [Siepel02] Tree enumeration using circular ordering Handle unequal gene content and duplicate genes (using exemplars?) Parallel branch and bound techniques for searching tree space Improved SPR and TBR techniques (local searches around good trees) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 167

169 Additional Challenges Network evolution Recombination events Large-scale phylogeny reconstruction Comparison and accuracy of techniques and heuristics Nov. 17, 2002 SC2002 Tutorial: Computational Biology 168

170 Parsimony Codes Phylip (Felsenstein) Hennig86 (Farris) Nona (Goloboff) and TNT (Goloboff, Farris, Nixon) PAUP* (Swofford) MEGA (Kumar, Tamura, Jakobsen, Nei) GRAPPA (Bader, Moret, Warnow) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 169

171 Likelihood Codes Phylip (Felsenstein) PAUP* (Swofford) PAML (Yang) FastDNAml (Olsen, Matsuda, Hagstrom, Overbeek) Felsenstein s List of Software: Nov. 17, 2002 SC2002 Tutorial: Computational Biology 170

172 GRAPPA: Genome Rearrangements Analysis Genome Rearrangements Analysis under Parsimony and other Phylogenetic Algorithms Open-source already used by other computational phylogeny groups, Caprara, Pevzner, LANL, FBI, Smithsonian Institute, PharmCos. Gene-order Phylogeny Reconstruction Breakpoint Median Inversion Median over one-billion fold speedup from previous codes Parallelism scales linearly with the number of processors [Bader, Moret, Warnow] Nov. 17, 2002 SC2002 Tutorial: Computational Biology 171

173 Using GRAPPA to solve Campanulaceae Phylogeny On the 512-processor cluster LosLobos at U. New Mexico, we ran the full analysis (all 14 billion trees) in under 1.5 hours a 1,000,000-fold speedup (and using true inversion distance) Current release of GRAPPA (v. 1.6) now takes minutes to solve the same problem on several processors Nov. 17, 2002 SC2002 Tutorial: Computational Biology 172

174 Campanulaceae Bob Jansen, UT-Austin; Linda Raubeson, Central Washington U Tobacco Nov. 17, 2002 SC2002 Tutorial: Computational Biology 173

175 Epilogue

176 Epilogue Uses of Computation in Biology: 1. Discovering information from large data sets (ex: database searches). 2. Relating micro-behavior to macrobehavior (ex: protein folding). 3. Extending experimental capabilities (ex: genome sequencing). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 175

177 Epilogue Computation will be an integral part of future biological discoveries. Computational biology is an exciting interdisciplinary area that will become increasingly important in the future. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 176

178 Bookshelf R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK, D. Graur and W.-H. Li. Fundamentals of Molecular Evolution. Sinauer Associates, Sunderland, MA, second edition, D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Nov. 17, 2002 SC2002 Tutorial: Computational Biology 177

179 Bookshelf D.M. Hillis, C. Moritz, and B.K. Mable, eds. Molecular Systematics. Sinauer Associates, Sunderland, MA, second edition, M. Nei and S. Kumar. Molecular Evolution and Phylogenetics. Oxford University Press, Oxford, UK, P.A. Pevzner. Computational Molecular Biology: An Algorithmic Approach. MIT Press, Inc., Cambridge, MA, Nov. 17, 2002 SC2002 Tutorial: Computational Biology 178

180 Bookshelf D. Sankoff and J. Kruskal, eds. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA, J.C. Setubal and J. Meidanis. Introduction to Computational Molecular Biology. PWS, Boston, MA, D. Sankoff and J.H. Nadeau, eds. Comparative Genomics: Empirical and Analytic Approaches to Gene Order Dynamics, Map Alignment and the Evolution of Gene Families, volume 1 of Computational Biology. Kluwer Academic Publishers, Dordrecht, The Netherlands, M.S. Waterman. Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman & Hall / CRC, Boca Raton, FL, Nov. 17, 2002 SC2002 Tutorial: Computational Biology 179

181 Related & Referenced Publications M. I. Abouelhoda, S. Kurtz and E. Ohlebusch, The enhanced suffix array and its applications to genome analysis, 2 nd Workshop on Algorithms in Bioinformatics, pp , R. Agarwala, D. Ferñandez-Baca, A polynomial-time algorithm for the perfect phylogeny problem when the number of character-states is fixed. SIAM J. Comp., 23(6): , S. Aluru, N. Futamura and K. Mehrotra, Biological sequence comparison using prefix computations, Proc. 13 th IEEE Int l Parallel Processing Symposium, pp , D.A. Bader, B. M.E. Moret, and L. Vawter, Industrial Applications of High- Performance Computing for Phylogeny Reconstruction, ITCom: Commercial Applications for High-Performance Computing, SPIE Vol. 4528, pp , D.A. Bader, B. M.E. Moret, and M. Yan, A Linear-Time Algorithm for Computing Inversion Distance Between Two Signed Permutations with an Experimental Study, Journal of Computational Biology, 8(5): , D.A. Bader, B.M.E. Moret, and P. Sanders, Algorithm Engineering for Parallel Computation, Experimental Algorithmics, Springer Verlag Lecture Notes in Computer Science, 2547:1 23, D.A. Bader, S. Sreshta, and N.R. Weisse-Bernstein, Evaluating arithmetic expressions using tree contraction: A fast and scalable parallel implementation for symmetric multiprocessors (SMPs), Proc. 9 th IEEE Int'l Conf. High-Performance Computing, 2002, to appear. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 180

182 Related & Referenced Publications V. Bafna and P. A. Pevzner, Genome rearrangements and sorting by reversals, Proc. 34 th Ann. IEEE Symp. Foundations of Computer Science, pp , 1993 V. Bafna and P. Pevzner, Sorting permutations by transpositions, Proc. 6 th Ann. Symp. Discrete Algorithms, pp , P. Berman and S. Hannenhalli, Fast sorting by reversal, Proc. 7 th Ann. Symp. Combinatorial Pattern Matching, pp , K. Booth and G. Lueker, Testing for consecutive ones property, interval graphs and graph planarity testing using pq-tree algorithms, J. Comp. Sys. Sci., 13: , A. Caprara, Sorting by reversals is difficult, Proc. 1 st ACM Conf. Computational Molecular Biology, pp , D.R. Clark and J.I. Munro, Efficient suffix trees on secondary storage, Proc. ACM-SIAM Symp. on Discrete Algorithms, pp , E. Edmiston, N. Core, J. Saltz, and R. Smith, Parallel processing of biological sequence comparison algorithms. Int l Journal of Parallel Programming, 17(3): , J. Felsenstein, Evolutionary trees from DNA sequences: A maximum likelihood approach, Journal of Molecular Evolution, 17: , P. Ferragina and R. Grossi, Fast incremental text editing, Journal of Algorithms, 31: , Also ACM-SIAM Symp. on Discrete Algorithms, W. M. Fitch, Towards defining the course of evolution: Minimum change for a specific tree topology, Syst. Zool., 20: , Nov. 17, 2002 SC2002 Tutorial: Computational Biology 181

183 Related & Referenced Publications W.M. Fitch and E. Margoliash, Construction of phylogenetic trees, Science, 155: , N. Futamura, S. Aluru and X. Huang, Parallel syntenic alignments, Proc. 9 th IEEE Int l Conf. on High Performance Computing, to appear. N. Futamura, S. Aluru, D. Ranjan and B. Hariharan, Efficient parallel algorithms for solvent accessible surface area of proteins, IEEE Trans. on Parallel and Distributed Systems, 13(6): , D. Gusfield, Efficient algorithms for inferring evolutionary trees. Networks, 21:19-28, S. Hannenhalli and P.A. Pevzner, Transforming cabbage into turnip (polynomial algorithm for sorting signed permutations by reversals, Proc. 27 th ACM Ann. Symp. Theory of Computing, pp , R. Hariharan, Optimal parallel suffix tree construction, Proc. 26 th IEEE Symp. Found. Computer Science, pp , M. D. Hendy and D. Penny, Branch and bound algorithms to determine minimal evolutionary trees. Mathematical Biosciences, 59: , X. Huang, A space-efficient parallel sequence comparison algorithm for a message-passing multiprocessor. Int l Journal of Parallel Programming, 18(3): , X. Huang and A. Madan, CAP3: A DNA sequence assembly program, Genome Research, 9(9): , Nov. 17, 2002 SC2002 Tutorial: Computational Biology 182

Recent Advances in Phylogeny Reconstruction

Recent Advances in Phylogeny Reconstruction Recent Advances in Phylogeny Reconstruction from Gene-Order Data Bernard M.E. Moret Department of Computer Science University of New Mexico Albuquerque, NM 87131 Department Colloqium p.1/41 Collaborators

More information

A New Fast Heuristic for Computing the Breakpoint Phylogeny and Experimental Phylogenetic Analyses of Real and Synthetic Data

A New Fast Heuristic for Computing the Breakpoint Phylogeny and Experimental Phylogenetic Analyses of Real and Synthetic Data A New Fast Heuristic for Computing the Breakpoint Phylogeny and Experimental Phylogenetic Analyses of Real and Synthetic Data Mary E. Cosner Dept. of Plant Biology Ohio State University Li-San Wang Dept.

More information

AN EXACT SOLVER FOR THE DCJ MEDIAN PROBLEM

AN EXACT SOLVER FOR THE DCJ MEDIAN PROBLEM AN EXACT SOLVER FOR THE DCJ MEDIAN PROBLEM MENG ZHANG College of Computer Science and Technology, Jilin University, China Email: zhangmeng@jlueducn WILLIAM ARNDT AND JIJUN TANG Dept of Computer Science

More information

Effects of Gap Open and Gap Extension Penalties

Effects of Gap Open and Gap Extension Penalties Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See

More information

Fast Phylogenetic Methods for the Analysis of Genome Rearrangement Data: An Empirical Study

Fast Phylogenetic Methods for the Analysis of Genome Rearrangement Data: An Empirical Study Fast Phylogenetic Methods for the Analysis of Genome Rearrangement Data: An Empirical Study Li-San Wang Robert K. Jansen Dept. of Computer Sciences Section of Integrative Biology University of Texas, Austin,

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Sequencing alignment Ameer Effat M. Elfarash

Sequencing alignment Ameer Effat M. Elfarash Sequencing alignment Ameer Effat M. Elfarash Dept. of Genetics Fac. of Agriculture, Assiut Univ. amir_effat@yahoo.com Why perform a multiple sequence alignment? MSAs are at the heart of comparative genomics

More information

Sequencing alignment Ameer Effat M. Elfarash

Sequencing alignment Ameer Effat M. Elfarash Sequencing alignment Ameer Effat M. Elfarash Dept. of Genetics Fac. of Agriculture, Assiut Univ. aelfarash@aun.edu.eg Why perform a multiple sequence alignment? MSAs are at the heart of comparative genomics

More information

Improving Tree Search in Phylogenetic Reconstruction from Genome Rearrangement Data

Improving Tree Search in Phylogenetic Reconstruction from Genome Rearrangement Data Improving Tree Search in Phylogenetic Reconstruction from Genome Rearrangement Data Fei Ye 1,YanGuo, Andrew Lawson 1, and Jijun Tang, 1 Department of Epidemiology and Biostatistics University of South

More information

EVOLUTIONARY DISTANCES

EVOLUTIONARY DISTANCES EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:

More information

Phylogenetic Tree Reconstruction

Phylogenetic Tree Reconstruction I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven

More information

Steps Toward Accurate Reconstructions of Phylogenies from Gene-Order Data 1

Steps Toward Accurate Reconstructions of Phylogenies from Gene-Order Data 1 Steps Toward Accurate Reconstructions of Phylogenies from Gene-Order Data Bernard M.E. Moret, Jijun Tang, Li-San Wang, and Tandy Warnow Department of Computer Science, University of New Mexico Albuquerque,

More information

Early History up to Schedule. Proteins DNA & RNA Schwann and Schleiden Cell Theory Charles Darwin publishes Origin of Species

Early History up to Schedule. Proteins DNA & RNA Schwann and Schleiden Cell Theory Charles Darwin publishes Origin of Species Schedule Bioinformatics and Computational Biology: History and Biological Background (JH) 0.0 he Parsimony criterion GKN.0 Stochastic Models of Sequence Evolution GKN 7.0 he Likelihood criterion GKN 0.0

More information

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches Int. J. Bioinformatics Research and Applications, Vol. x, No. x, xxxx Phylogenies Scores for Exhaustive Maximum Likelihood and s Searches Hyrum D. Carroll, Perry G. Ridge, Mark J. Clement, Quinn O. Snell

More information

BIOINFORMATICS. New approaches for reconstructing phylogenies from gene order data. Bernard M.E. Moret, Li-San Wang, Tandy Warnow and Stacia K.

BIOINFORMATICS. New approaches for reconstructing phylogenies from gene order data. Bernard M.E. Moret, Li-San Wang, Tandy Warnow and Stacia K. BIOINFORMATICS Vol. 17 Suppl. 1 21 Pages S165 S173 New approaches for reconstructing phylogenies from gene order data Bernard M.E. Moret, Li-San Wang, Tandy Warnow and Stacia K. Wyman Department of Computer

More information

Phylogenetic Networks, Trees, and Clusters

Phylogenetic Networks, Trees, and Clusters Phylogenetic Networks, Trees, and Clusters Luay Nakhleh 1 and Li-San Wang 2 1 Department of Computer Science Rice University Houston, TX 77005, USA nakhleh@cs.rice.edu 2 Department of Biology University

More information

A Framework for Orthology Assignment from Gene Rearrangement Data

A Framework for Orthology Assignment from Gene Rearrangement Data A Framework for Orthology Assignment from Gene Rearrangement Data Krister M. Swenson, Nicholas D. Pattengale, and B.M.E. Moret Department of Computer Science University of New Mexico Albuquerque, NM 87131,

More information

CGS 5991 (2 Credits) Bioinformatics Tools

CGS 5991 (2 Credits) Bioinformatics Tools CAP 5991 (3 Credits) Introduction to Bioinformatics CGS 5991 (2 Credits) Bioinformatics Tools Giri Narasimhan 8/26/03 CAP/CGS 5991: Lecture 1 1 Course Schedules CAP 5991 (3 credit) will meet every Tue

More information

Whole Genome Alignments and Synteny Maps

Whole Genome Alignments and Synteny Maps Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of

More information

High-Performance Algorithm Engineering for Large-Scale Graph Problems and Computational Biology

High-Performance Algorithm Engineering for Large-Scale Graph Problems and Computational Biology High-Performance Algorithm Engineering for Large-Scale Graph Problems and Computational Biology David A. Bader Electrical and Computer Engineering Department, University of New Mexico, Albuquerque, NM

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Computational Structural Bioinformatics

Computational Structural Bioinformatics Computational Structural Bioinformatics ECS129 Instructor: Patrice Koehl http://koehllab.genomecenter.ucdavis.edu/teaching/ecs129 koehl@cs.ucdavis.edu Learning curve Math / CS Biology/ Chemistry Pre-requisite

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Distance Methods Character Methods

More information

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr Introduction to Bioinformatics Shifra Ben-Dor Irit Orr Lecture Outline: Technical Course Items Introduction to Bioinformatics Introduction to Databases This week and next week What is bioinformatics? A

More information

Computational methods for predicting protein-protein interactions

Computational methods for predicting protein-protein interactions Computational methods for predicting protein-protein interactions Tomi Peltola T-61.6070 Special course in bioinformatics I 3.4.2008 Outline Biological background Protein-protein interactions Computational

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Bio 1B Lecture Outline (please print and bring along) Fall, 2007 Bio 1B Lecture Outline (please print and bring along) Fall, 2007 B.D. Mishler, Dept. of Integrative Biology 2-6810, bmishler@berkeley.edu Evolution lecture #5 -- Molecular genetics and molecular evolution

More information

2012 Univ Aguilera Lecture. Introduction to Molecular and Cell Biology

2012 Univ Aguilera Lecture. Introduction to Molecular and Cell Biology 2012 Univ. 1301 Aguilera Lecture Introduction to Molecular and Cell Biology Molecular biology seeks to understand the physical and chemical basis of life. and helps us answer the following? What is the

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

Supplemental Data. Perea-Resa et al. Plant Cell. (2012) /tpc

Supplemental Data. Perea-Resa et al. Plant Cell. (2012) /tpc Supplemental Data. Perea-Resa et al. Plant Cell. (22)..5/tpc.2.3697 Sm Sm2 Supplemental Figure. Sequence alignment of Arabidopsis LSM proteins. Alignment of the eleven Arabidopsis LSM proteins. Sm and

More information

Introduction to Molecular and Cell Biology

Introduction to Molecular and Cell Biology Introduction to Molecular and Cell Biology Molecular biology seeks to understand the physical and chemical basis of life. and helps us answer the following? What is the molecular basis of disease? What

More information

6.096 Algorithms for Computational Biology. Prof. Manolis Kellis

6.096 Algorithms for Computational Biology. Prof. Manolis Kellis 6.096 Algorithms for Computational Biology Prof. Manolis Kellis Today s Goals Introduction Class introduction Challenges in Computational Biology Gene Regulation: Regulatory Motif Discovery Exhaustive

More information

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB I519 Introduction to Bioinformatics, 2011 Genome Comparison Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Whole genome comparison/alignment Build better phylogenies Identify polymorphism

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Genomics and bioinformatics summary. Finding genes -- computer searches

Genomics and bioinformatics summary. Finding genes -- computer searches Genomics and bioinformatics summary 1. Gene finding: computer searches, cdnas, ESTs, 2. Microarrays 3. Use BLAST to find homologous sequences 4. Multiple sequence alignments (MSAs) 5. Trees quantify sequence

More information

DNA and protein databases. EMBL/GenBank/DDBJ database of nucleic acids

DNA and protein databases. EMBL/GenBank/DDBJ database of nucleic acids Database searches 1 DNA and protein databases EMBL/GenBank/DDBJ database of nucleic acids 2 DNA and protein databases EMBL/GenBank/DDBJ database of nucleic acids (cntd) 3 DNA and protein databases SWISS-PROT

More information

Sequences, Structures, and Gene Regulatory Networks

Sequences, Structures, and Gene Regulatory Networks Sequences, Structures, and Gene Regulatory Networks Learning Outcomes After this class, you will Understand gene expression and protein structure in more detail Appreciate why biologists like to align

More information

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5. Five Sami Khuri Department of Computer Science San José State University San José, California, USA sami.khuri@sjsu.edu v Distance Methods v Character Methods v Molecular Clock v UPGMA v Maximum Parsimony

More information

Computational Biology: Basics & Interesting Problems

Computational Biology: Basics & Interesting Problems Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information

More information

Mathematics of Evolution and Phylogeny. Edited by Olivier Gascuel

Mathematics of Evolution and Phylogeny. Edited by Olivier Gascuel Mathematics of Evolution and Phylogeny Edited by Olivier Gascuel CLARENDON PRESS. OXFORD 2004 iv CONTENTS 12 Reconstructing Phylogenies from Gene-Content and Gene-Order Data 1 12.1 Introduction: Phylogenies

More information

Exhaustive search. CS 466 Saurabh Sinha

Exhaustive search. CS 466 Saurabh Sinha Exhaustive search CS 466 Saurabh Sinha Agenda Two different problems Restriction mapping Motif finding Common theme: exhaustive search of solution space Reading: Chapter 4. Restriction Mapping Restriction

More information

Comparative genomics: Overview & Tools + MUMmer algorithm

Comparative genomics: Overview & Tools + MUMmer algorithm Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune 411 007. urmila@bioinfo.ernet.in Genome sequence: Fact file 1995: The first

More information

On Reversal and Transposition Medians

On Reversal and Transposition Medians On Reversal and Transposition Medians Martin Bader International Science Index, Computer and Information Engineering waset.org/publication/7246 Abstract During the last years, the genomes of more and more

More information

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Introduction to Comparative Protein Modeling. Chapter 4 Part I Introduction to Comparative Protein Modeling Chapter 4 Part I 1 Information on Proteins Each modeling study depends on the quality of the known experimental data. Basis of the model Search in the literature

More information

Genomes and Their Evolution

Genomes and Their Evolution Chapter 21 Genomes and Their Evolution PowerPoint Lecture Presentations for Biology Eighth Edition Neil Campbell and Jane Reece Lectures by Chris Romero, updated by Erin Barley with contributions from

More information

New Approaches for Reconstructing Phylogenies from Gene Order Data

New Approaches for Reconstructing Phylogenies from Gene Order Data New Approaches for Reconstructing Phylogenies from Gene Order Data Bernard M.E. Moret Li-San Wang Tandy Warnow Stacia K. Wyman Abstract We report on new techniques we have developed for reconstructing

More information

Algorithms in Computational Biology (236522) spring 2008 Lecture #1

Algorithms in Computational Biology (236522) spring 2008 Lecture #1 Algorithms in Computational Biology (236522) spring 2008 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: 15:30-16:30/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office hours:??

More information

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson Grundlagen der Bioinformatik, SS 10, D. Huson, April 12, 2010 1 1 Introduction Grundlagen der Bioinformatik Summer semester 2010 Lecturer: Prof. Daniel Huson Office hours: Thursdays 17-18h (Sand 14, C310a)

More information

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Bioinformatics. Dept. of Computational Biology & Bioinformatics Bioinformatics Dept. of Computational Biology & Bioinformatics 3 Bioinformatics - play with sequences & structures Dept. of Computational Biology & Bioinformatics 4 ORGANIZATION OF LIFE ROLE OF BIOINFORMATICS

More information

BIOINFORMATICS. Scaling Up Accurate Phylogenetic Reconstruction from Gene-Order Data. Jijun Tang 1 and Bernard M.E. Moret 1

BIOINFORMATICS. Scaling Up Accurate Phylogenetic Reconstruction from Gene-Order Data. Jijun Tang 1 and Bernard M.E. Moret 1 BIOINFORMATICS Vol. 1 no. 1 2003 Pages 1 8 Scaling Up Accurate Phylogenetic Reconstruction from Gene-Order Data Jijun Tang 1 and Bernard M.E. Moret 1 1 Department of Computer Science, University of New

More information

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics Bioinformatics 1 Biology, Sequences, Phylogenetics Part 4 Sepp Hochreiter Klausur Mo. 30.01.2011 Zeit: 15:30 17:00 Raum: HS14 Anmeldung Kusss Contents Methods and Bootstrapping of Maximum Methods Methods

More information

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013 Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation

More information

Bioinformatics Chapter 1. Introduction

Bioinformatics Chapter 1. Introduction Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!

More information

Graph Alignment and Biological Networks

Graph Alignment and Biological Networks Graph Alignment and Biological Networks Johannes Berg http://www.uni-koeln.de/ berg Institute for Theoretical Physics University of Cologne Germany p.1/12 Networks in molecular biology New large-scale

More information

On the complexity of unsigned translocation distance

On the complexity of unsigned translocation distance Theoretical Computer Science 352 (2006) 322 328 Note On the complexity of unsigned translocation distance Daming Zhu a, Lusheng Wang b, a School of Computer Science and Technology, Shandong University,

More information

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz Phylogenetic Trees What They Are Why We Do It & How To Do It Presented by Amy Harris Dr Brad Morantz Overview What is a phylogenetic tree Why do we do it How do we do it Methods and programs Parallels

More information

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Evaluation. Course Homepage.

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Evaluation. Course Homepage. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 389; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs06.html 1/12/06 CAP5510/CGS5166 1 Evaluation

More information

Lecture 15: Realities of Genome Assembly Protein Sequencing

Lecture 15: Realities of Genome Assembly Protein Sequencing Lecture 15: Realities of Genome Assembly Protein Sequencing Study Chapter 8.10-8.15 1 Euler s Theorems A graph is balanced if for every vertex the number of incoming edges equals to the number of outgoing

More information

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its

More information

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task. Chapter 12 (Strikberger) Molecular Phylogenies and Evolution METHODS FOR DETERMINING PHYLOGENY In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task. Modern

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057 Estimating Phylogenies (Evolutionary Trees) II Biol4230 Thurs, March 2, 2017 Bill Pearson wrp@virginia.edu 4-2818 Jordan 6-057 Tree estimation strategies: Parsimony?no model, simply count minimum number

More information

Similarity or Identity? When are molecules similar?

Similarity or Identity? When are molecules similar? Similarity or Identity? When are molecules similar? Mapping Identity A -> A T -> T G -> G C -> C or Leu -> Leu Pro -> Pro Arg -> Arg Phe -> Phe etc If we map similarity using identity, how similar are

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: istance-based methods Ultrametric Additive: UPGMA Transformed istance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

Bioinformatics and BLAST

Bioinformatics and BLAST Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

More information

BME 5742 Biosystems Modeling and Control

BME 5742 Biosystems Modeling and Control BME 5742 Biosystems Modeling and Control Lecture 24 Unregulated Gene Expression Model Dr. Zvi Roth (FAU) 1 The genetic material inside a cell, encoded in its DNA, governs the response of a cell to various

More information

Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence Database Search Techniques I: Blast and PatternHunter tools Sequence Database Search Techniques I: Blast and PatternHunter tools Zhang Louxin National University of Singapore Outline. Database search 2. BLAST (and filtration technique) 3. PatternHunter (empowered

More information

Sequence Based Bioinformatics

Sequence Based Bioinformatics Structural and Functional Analysis of Inosine Monophosphate Dehydrogenase using Sequence-Based Bioinformatics Barry Sexton 1,2 and Troy Wymore 3 1 Bioengineering and Bioinformatics Summer Institute, Department

More information

THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT

THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT COMMUNICATIONS IN INFORMATION AND SYSTEMS c 2009 International Press Vol. 9, No. 4, pp. 295-302, 2009 001 THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT DAN GUSFIELD AND YUFENG WU Abstract.

More information

Procedure to Create NCBI KOGS

Procedure to Create NCBI KOGS Procedure to Create NCBI KOGS full details in: Tatusov et al (2003) BMC Bioinformatics 4:41. 1. Detect and mask typical repetitive domains Reason: masking prevents spurious lumping of non-orthologs based

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST Introduction Bioinformatics is a powerful tool which can be used to determine evolutionary relationships and

More information

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki. Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein

More information

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega BLAST Multiple Sequence Alignments: Clustal Omega What does basic BLAST do (e.g. what is input sequence and how does BLAST look for matches?) Susan Parrish McDaniel College Multiple Sequence Alignments

More information

Bio nformatics. Lecture 3. Saad Mneimneh

Bio nformatics. Lecture 3. Saad Mneimneh Bio nformatics Lecture 3 Sequencing As before, DNA is cut into small ( 0.4KB) fragments and a clone library is formed. Biological experiments allow to read a certain number of these short fragments per

More information

Comparative Bioinformatics Midterm II Fall 2004

Comparative Bioinformatics Midterm II Fall 2004 Comparative Bioinformatics Midterm II Fall 2004 Objective Answer, part I: For each of the following, select the single best answer or completion of the phrase. (3 points each) 1. Deinococcus radiodurans

More information

Small RNA in rice genome

Small RNA in rice genome Vol. 45 No. 5 SCIENCE IN CHINA (Series C) October 2002 Small RNA in rice genome WANG Kai ( 1, ZHU Xiaopeng ( 2, ZHONG Lan ( 1,3 & CHEN Runsheng ( 1,2 1. Beijing Genomics Institute/Center of Genomics and

More information

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus: m Eukaryotic mrna processing Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus: Cap structure a modified guanine base is added to the 5 end. Poly-A tail

More information

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS 1 Prokaryotes and Eukaryotes 2 DNA and RNA 3 4 Double helix structure Codons Codons are triplets of bases from the RNA sequence. Each triplet defines an amino-acid.

More information

Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix. 16 march 2017 Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

More information

Evolutionary Tree Analysis. Overview

Evolutionary Tree Analysis. Overview CSI/BINF 5330 Evolutionary Tree Analysis Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Distance-Based Evolutionary Tree Reconstruction Character-Based

More information

Phylogenetic analyses. Kirsi Kostamo

Phylogenetic analyses. Kirsi Kostamo Phylogenetic analyses Kirsi Kostamo The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among different groups (individuals, populations, species,

More information

Packing of Secondary Structures

Packing of Secondary Structures 7.88 Lecture Notes - 4 7.24/7.88J/5.48J The Protein Folding and Human Disease Professor Gossard Retrieving, Viewing Protein Structures from the Protein Data Base Helix helix packing Packing of Secondary

More information

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010 BLAST Database Searching BME 110: CompBio Tools Todd Lowe April 8, 2010 Admin Reading: Read chapter 7, and the NCBI Blast Guide and tutorial http://www.ncbi.nlm.nih.gov/blast/why.shtml Read Chapter 8 for

More information

Introduction to Bioinformatics Online Course: IBT

Introduction to Bioinformatics Online Course: IBT Introduction to Bioinformatics Online Course: IBT Multiple Sequence Alignment Building Multiple Sequence Alignment Lec1 Building a Multiple Sequence Alignment Learning Outcomes 1- Understanding Why multiple

More information

Properties of amino acids in proteins

Properties of amino acids in proteins Properties of amino acids in proteins one of the primary roles of DNA (but not the only one!) is to code for proteins A typical bacterium builds thousands types of proteins, all from ~20 amino acids repeated

More information

Phylogenetic Reconstruction from Gene-Order Data

Phylogenetic Reconstruction from Gene-Order Data p.1/7 Phylogenetic Reconstruction from Gene-Order Data Bernard M.E. Moret compbio.unm.edu Department of Computer Science University of New Mexico p.2/7 Acknowledgments Close Collaborators: at UNM: David

More information

Algorithms for Bioinformatics

Algorithms for Bioinformatics Adapted from slides by Alexandru Tomescu, Leena Salmela, Veli Mäkinen, Esa Pitkänen 582670 Algorithms for Bioinformatics Lecture 5: Combinatorial Algorithms and Genomic Rearrangements 1.10.2015 Background

More information

Phylogenetic Reconstruction

Phylogenetic Reconstruction Phylogenetic Reconstruction from Gene-Order Data Bernard M.E. Moret compbio.unm.edu Department of Computer Science University of New Mexico p. 1/71 Acknowledgments Close Collaborators: at UNM: David Bader

More information

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree) I9 Introduction to Bioinformatics, 0 Phylogenetic ree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & omputing, IUB Evolution theory Speciation Evolution of new organisms is driven by

More information

Quiz answers. Allele. BIO 5099: Molecular Biology for Computer Scientists (et al) Lecture 17: The Quiz (and back to Eukaryotic DNA)

Quiz answers. Allele. BIO 5099: Molecular Biology for Computer Scientists (et al) Lecture 17: The Quiz (and back to Eukaryotic DNA) BIO 5099: Molecular Biology for Computer Scientists (et al) Lecture 17: The Quiz (and back to Eukaryotic DNA) http://compbio.uchsc.edu/hunter/bio5099 Larry.Hunter@uchsc.edu Quiz answers Kinase: An enzyme

More information

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution Massachusetts Institute of Technology 6.877 Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution 1. Rates of amino acid replacement The initial motivation for the neutral

More information

BIOINFORMATICS: An Introduction

BIOINFORMATICS: An Introduction BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and

More information

Analysis of Gene Order Evolution beyond Single-Copy Genes

Analysis of Gene Order Evolution beyond Single-Copy Genes Analysis of Gene Order Evolution beyond Single-Copy Genes Nadia El-Mabrouk Département d Informatique et de Recherche Opérationnelle Université de Montréal mabrouk@iro.umontreal.ca David Sankoff Department

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline Page Evolutionary Trees Russ. ltman MI S 7 Outline. Why build evolutionary trees?. istance-based vs. character-based methods. istance-based: Ultrametric Trees dditive Trees. haracter-based: Perfect phylogeny

More information

Applications of genome alignment

Applications of genome alignment Applications of genome alignment Comparing different genome assemblies Locating genome duplications and conserved segments Gene finding through comparative genomics Analyzing pathogenic bacteria against

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees Erin Molloy and Tandy Warnow {emolloy2, warnow}@illinois.edu University of Illinois at Urbana

More information