Opportunities and Challenges in Computational Biology
|
|
- Posy Johnson
- 6 years ago
- Views:
Transcription
1 Opportunities and Challenges in Computational Biology Srinivas Aluru Electrical & Computer Engineering Lawrence H. Baker Center for Bioinformatics & Biological Statistics Iowa State University David A. Bader Electrical & Computer Engineering University of New Mexico
2 Acknowledgments National Science Foundation Nov. 17, 2002 SC2002 Tutorial: Computational Biology 1
3 Opportunities and Challenges in Computational Biology Biology easily has 500 years of exciting problems to work on -Donald E. Knuth
4 Outline 1. Molecular Biology Background 2. Sequence Alignments 3. String Data Structures and Algorithms 4. Genome Assembly 5. Gene Identification & Annotation 6. Microarrays & Gene Expression Analysis 7. Protein Folding 8. Comparative Genomics & Reconstruction of Evolutionary Histories Nov. 17, 2002 SC2002 Tutorial: Computational Biology 3
5 Schedule Morning 8:30 9:15 (Part I) Biology Background 9:15 10:00 (Part II) Sequence Alignments 10:00 10:30: Break 10:30 11:15 (Part III) String Data Structures and Algorithms 11:15 12:00 (Part IV) Genome Assembly 12:00 1:30: Lunch Afternoon 1:30 2:15 (Part V) Gene Identification & Annotation 2:15 3:00 (Part VI) Microarrays & Gene Expression Analysis 3:00 3:30: Break 3:30 4:15 (Part VII) Protein Folding 4:15 5:00 (Part VIII) Comparative Genomics & Evolutionary Histories Nov. 17, 2002 SC2002 Tutorial: Computational Biology 4
6 Part I: Molecular Biology Background
7 Biological Data DNA: Self-replicating Codes for proteins Proteins: Perform most functions in living organisms Nov. 17, 2002 SC2002 Tutorial: Computational Biology 6
8 DNA: Sequence of nucleotides Nucleotide: Deoxyribose sugar + Phosphate + Base O Nucleotides: A, T, G, and C O O P O O 5 CH 2 O C4 3 C H OH 1 C 2 C H Nov. 17, 2002 SC2002 Tutorial: Computational Biology 7 O HN C C N C CH CH 3
9 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 8
10 P P P 3 A T C G G C 3 P P P Nov. 17, 2002 SC2002 Tutorial: Computational Biology 9
11 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 10
12 For computational purposes, DNA = A sequence over alphabet {A,C,G,T} 5 A T T C G G G A A T G C A T G C C A 3 3 T A A G C C C T T A C G T A C G G T 5 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 11
13 Genome: Entire genetic constitution of a living organism Chromosome: Linear strand of DNA Gene: A contiguous stretch of DNA that codes for a protein Nov. 17, 2002 SC2002 Tutorial: Computational Biology 12
14 Species Bacteriophage λ Escherichia Coli (bacterium) Saccharomyces Cerviciae (yeast) Caenorhabditis elegans (worm) Drosophila melanogaster (fruit fly) Homo sapiens (human) Number of Chromosomes Genome Size 5 X X X X X X 10 9 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 13
15 Proteins: Chains of amino acid residues. There are 20 different amino acids. Functions: Tissue building blocks (Structure proteins) Catalysts (enzymes) Oxygen transport Antibody defense Nov. 17, 2002 SC2002 Tutorial: Computational Biology 14
16 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 15
17 R 1 H O R 3 + H 3 N Cα C O N Φ Cα ψ C N H Cα C O O - R 2 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 16
18 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 17
19 G A C U Leu Leu Phe Phe Ser Ser Ser Ser STOP STOP Tyr Tyr Trp STOP Cys Cys U G A C U Leu Leu Leu Leu Pro Pro Pro Pro Gln Gln His His Arg Arg Arg Arg C G A C U Met Ile Ile Ile Thr Thr Thr Thr Lys Lys Asn Asn Arg Arg Ser Ser A G A C U Val Val Val Val Ala Ala Ala Ala Glu Gu Asp Asp Gly Gly Gly Gly G Third Position U Position C Second A G First Position
20 Protein Synthesis (DNA! Protein) DNA Transcription mrna Translation Protein Nov. 17, 2002 SC2002 Tutorial: Computational Biology 19
21 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 20
22 Summary Nov. 17, 2002 SC2002 Tutorial: Computational Biology 21
23 What Can Be Done Experimentally? DNA sequences of length up to bp can be read (Sanger s method). DNA samples can be amplified (PCR). Protein sequences can be determined. Structure of proteins can be determined using X-ray crystallography (expensive, tedious, time-consuming). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 22
24 Challenges in Computational Biology 1. Find the genomes of all organisms. 2. Identify and annotate genes. 3. Find the sequences, three dimensional structures and functions of all proteins. 4. Find sequences of proteins that have desired three dimensional structures. 5. Compare DNA sequences and proteins sequences for similarity. 6. Study the evolution of sequences and species. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 23
25 Part II: Sequence Alignments
26 Pairwise Sequence Alignment Problem: Find similarity between two sequences. Variations: Given two sequences, find if parts of them are similar (local alignment). Given a large sequence and a short sequence, find if the short sequence is similar to a stretch of the long sequence. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 25
27 Pairwise Global Alignment Alignment: Stacking the sequences against each other, with gaps if necessary, to expose similarity. Score: A measure of quality of an alignment C A T -- T C A -- C C -- T C G C A G C = -2 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 26
28 Pairwise Global Alignment T[i,j] = Score of optimally aligning first i bases of s with first j bases of t. T [ i, j] = max T[ i 1, T T [ i 1, j] [ i, j 1] j 1] + g g score ( s[ i], t[ j] ) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 27
29 C T C G C A G C C A T T C A C Nov. 17, 2002 SC2002 Tutorial: Computational Biology 28
30 T [ i, j] = Local Alignment T [ i 1, T max T 0 j [ i 1, j] [ i, j 1] 1] + ( s[ i], t[ j] ) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 29 g g score Initialize top row and leftmost column to zero. Start with a maximal value in the table and traceback.
31 Affine Gap Penalty Functions Gap penalty = h + gk where k = length of a maximal sequence of gaps h = gap opening penalty g = gap continuation penalty Nov. 17, 2002 SC2002 Tutorial: Computational Biology 30
32 Some Results Most pairwise sequence alignment problems can be solved in O(mn) time. Space requirement can be reduced to O(m+n), while keeping run-time fixed [Myers88]. Two highly similar sequences can be aligned in O(dn) time, where d is a measure of the distance between the sequences [Landau86]. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 31
33 Parallel Sequence Alignment Each antidiagonal can be computed in parallel. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 32
34 Some Known Results Parallel sequence alignment can be computed in O(mn/p) time (Edmiston88). Optimal space-saving algorithm requires only O((m+n)/p) space, but take O((m+n) 2 /p) time (Huang89). A row-by-row parallelization is possible and is more communication-efficient. Space can be reduced to O(m+n/p) without sacrificing timeoptimality (Aluru99). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 33
35 Multiple Sequence Alignment VTISCTGSSSNIGAG NHVKWYQQLPG VTISCTGTSSNIGS ITVNWYQQLPG LRLSCSSSGFIFSS YAMYWVRQAPG LSLTCTVSGTSFDD YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG ATLVCLISDFYPGA VTVAWKADS ATLVCLISDFYPGA VTVAWKADS AALGCLVKDYFPEP VTVSWNSG- VSLTCLVKGFYPSD IAVEWESNG- Nov. 17, 2002 SC2002 Tutorial: Computational Biology 34
36 Induced Pairwise Alignment S 1 S 2 S 3 S - T I S C T G - S - N I L - T I C N G S S - N I L R T I S C S G F S Q N I Induced pairwise alignment of S 1 and S 2 : S 1 S 2 S T I S C T G - S N I L T I C N G S S N I Nov. 17, 2002 SC2002 Tutorial: Computational Biology 35
37 Sum-of-Pairs Scoring Function Score of multiple alignment = = i< j l t= 1 i< j where score ( S, S ) i j score( S it, S jt ) score( S i, S j ) = score of induced pairwise alignment l = length of the multiple alignment Nov. 17, 2002 SC2002 Tutorial: Computational Biology 36
38 Multiple Alignment Run-time of dynamic programming solution = O(2 k n k ) where n = length of each sequence k = number of sequences Space, O(n k ), is prohibitively large! Example: 6 sequences of length X10 13 calculations! Nov. 17, 2002 SC2002 Tutorial: Computational Biology 37
39 Carillo-Lippman Heuristic U = Upper bound on multiple alignment score If T i ( [ ] [ 2 k j j j l [ i,, L, i ] + score S i, n, S i n ])> U 1 l, j< l l Then T[i 1,i 2,,i k ] cannot be on an optimal path. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 38
40 Multiple Alignment to a Phylogenetic Tree A tree showing the evolutionary relationship between sequences is available. Compute multiple alignment such that for each edge (i,j) in the tree Induced alignment between S i and S j. = Optimal alignment between S i and S j. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 39
41 Multiple Alignment to a Tree Build the multiple alignment incrementally. To add a new sequence, an edge should connect it in the tree to a sequence already incorporated in the multiple alignment. Insert the new sequence according to its optimal alignment with the other sequence connected by the edge. Adjust other sequences in the multiple alignment. Run-time = time for k pairwise alignments. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 40
42 Searching Biological Databases BLAST (Basic Local Alignment Search Tool) BLASTN (DNA) BLASTP (Protein) BLASTX (DNA against Protein) PSI-BLAST (Position Specific Iterative BLAST) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 41
43 Multiple Alignment Software Clustalw ( MSA ( HMMER ( SAM ( compbio/sam.html) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 42
44 Open Problems - Sequential 1. Gene-to-gene alignment to identify exons and introns. 2. Full genome comparison. Genomes consist of mobile components known as transposons. Due to transposons and genome rearrangements, full genome comparison is not straightforward. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 43
45 Open Problems - Parallel 1. Parallel alignment of similar sequences. 2. Parallel spliced alignment. DNA to Gene. Gene to Gene. 3. Parallel full-genome comparison. 4. Parallel multiple sequence alignment. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 44
46 Part III: String Data Structures and Algorithms
47 Why Strings? Biological sequences can be viewed as strings of characters over an alphabet. Sequence similarities typically translate to functional similarities. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 46
48 Suffix Tree M A L A Y A L A M $ A LA M YALAM$ $ AL YALAM$ $M YALAM$ 5 10 $M YALAM$ $M ALAYALAM$ $ Nov. 17, 2002 SC2002 Tutorial: Computational Biology 47
49 Suffix Tree M A L A Y A L A M $ (2, 2) (10, 10) (5, 10) (1, 1) (3, 4) 5 10 (10, 10) (2, 10) (5, 10) (9, 10) (5, 10) (9, 10) (3, 4) (5, 10) (9, 10) 6 2 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 48
50 Finding a Pattern in a String 1. Build a suffix tree of the string. 2. Starting from the root, traverse a path matching characters of the pattern. 3. If stuck, pattern not present in string. Otherwise, each leaf below gives a position of the pattern in the string. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 49
51 Finding Pattern in a String Find ALA A LA M YALAM$ $ AL YALAM$ M$ YALAM$ 5 10 M$ YALAM$ M$ ALAYALAM$ $ Nov. 17, 2002 SC2002 Tutorial: Computational Biology 50
52 Finding common Substrings Construct a generalized suffix tree for two strings (each suffix of each string is represented). Label each leaf with the suffix number and string label. Each internal node with a leaf from each string in its subtree gives a common substring. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 51
53 Generalized Suffix Tree WINDOW$ INDIGO$ D $OG I ND O $ W $OGI OW$ (2, 5) $OG ND $O GI OW$ $W $ $ INDOW$ (1, 7) (2, 7) (2, 3) (1, 4) (2, 4) $OGI OW$ (2, 2) (1, 3) (1, 5) (2, 6) (1, 6) (1, 1) (2, 1) (1, 2) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 52
54 Suffix Array Reducing Space 6 ALAM$ 2 ALAYALAM$ M A L A Y A L A M $ AM$ AYALAM$ LAM$ LAYALAM$ Suffix Array MALAYALAM$ M$ YALAM$ lcp Array 10 $ Nov. 17, 2002 SC2002 Tutorial: Computational Biology 53
55 Suffix Trees vs. Suffix Arrays Suffix Array = Lexicographic order of the leaves of the Suffix Tree Suffix Tree = Suffix Array + lcp Array Nov. 17, 2002 SC2002 Tutorial: Computational Biology 54
56 Pattern Search in Suffix Array All suffixes that share a common prefix appear in consecutive positions in the array. Pattern P can be located in the string using a binary search on the suffix array. Naïve Run-time = O( P log n). Improved to O( P + log n) [Manber&Myers93], and to O( P ) [Abouelhoda et al. 02]. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 55
57 Other Applications Common substrings of multiple strings Suffix-prefix overlaps Maximal and tandem repeats Shortest unique substrings Maximal unique matches [MUMmer] Approximate matching with bounded errors Nov. 17, 2002 SC2002 Tutorial: Computational Biology 56
58 Limitations of String Data Structures Can only be used to extract information in the absence of errors. Problems dealing with errors may be solved by decomposing into components that do not involve errors. Example: If two sequences exhibit similarity, there must be substrings in common to them. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 57
59 Some Results 1. Suffix tree can be constructed in O(n) time and O(n ) space [Weiner73, McCreight76, Ukkonen92]. 2. Suffix arrays can be constructed without using suffix trees in O(n) time [Pang&Aluru02]. 3. Suffix trees can be built in O(log 4 n) time on the CREW PRAM model [Hariharan94]. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 58
60 Open Problems 1. Algorithms independent of alphabet size. 2. Practically efficient parallel algorithms for suffix trees and arrays. 3. What is the best way to store a biological database on a disk? Some work on disk-based data structures: String B-trees [Ferragina & Grossi 95]. Suffix trees on disk [Clark & Munro 96]. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 59
61 Software Development Opportunities Develop a general-purpose tree-based database system for efficiently Storing Inserting and deleting Querying biological sequences. Current approach: Store sequences as a flat file. Entire database is searched for each query! Nov. 17, 2002 SC2002 Tutorial: Computational Biology 60
62 Part IV: Genome Assembly
63 Sequencing a Genome Physical Mapping: Find markers along the genome, to find unique contigs (possibly overlapping) that cover the genome. Fragment Assembly: Sequence each contig by breaking into several short fragments, sequencing the fragments, and assembling them together. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 62
64 Physical Mapping Sequence Tagged Sites Sequence Tagged Site (STS) is a probe sequence that attaches to a unique position in the genome (length about bases). The probe can identify the existence of the short sequence in the genome but cannot specify its location. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 63
65 Cutting With Restriction Enzymes Restriction enzyme is a protein that cuts DNA at a specific pattern (typically palindrome). Example: EcoRI G C T T A A G A A T T C C T T A A G A A T A C G Nov. 17, 2002 SC2002 Tutorial: Computational Biology 64
66 Physical Mapping Generate a large number of fragments of the genome, called clones. Find which probes attach to which clones. Find order of the fragments along the genome. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 65
67 Clones and Probes D B G C A E F Nov. 17, 2002 SC2002 Tutorial: Computational Biology 66
68 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 67 STS Matrix A B C D E F G
69 STS Hybridization Problem Given: STS matrix Find: Permutation of the columns such that the 1 s in each row are consecutive. Algorithm runs in linear time, assuming the matrix has no errors (Booth76). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 68
70 Errors in STS Data False positives: Clone is reported to contain an STS, but it does not. False Negatives: Clone is reported to not contain an STS, but it does. Chimeras: Two different DNA fragments combine and act as one clone. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 69
71 Mapping Problem in Presence of Errors In the absence of errors, overlap information is an interval graph. Find a way to discard some information in order to obtain an interval graph. Several ways of modeling the problem are NP-complete! Nov. 17, 2002 SC2002 Tutorial: Computational Biology 70
72 Fragment Assembly Given: A collection of DNA fragments Assemble: The fragments into maximal length contiguous sequences, or contigs using overlap information. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 71
73 Fragment Assembly Nov. 17, 2002 SC2002 Tutorial: Computational Biology 72
74 Shortest Common Superstring In the absence of errors, Fragment assembly = finding the shortest common superstring of given fragments Shortest common superstring problem is NP-hard. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 73
75 Greedy heuristic Find two fragments that have a maximum overlap and combine them into one contig. Iterate by treating contigs as fragments. Greedy heuristic results in a 4-approximate algorithm. Approximation factor has been improved to 2.2. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 74
76 Difficulties in Fragment Assembly Fragments contain errors Lack of sufficient coverage Different fragments may combine (Chimeras) Which strand did it come from? (Unknown orientation) Repeats in the genome Nov. 17, 2002 SC2002 Tutorial: Computational Biology 75
77 Approach to Fragment Assembly For each fragment and partial contig formed, consider both the sequence and its reverse complement. Detect overlaps using dynamic programming to allow for errors. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 76
78 Possible Fragment Overlaps F 1 F 1 F 2 F 2 F 1 F 2 F 2 F 1 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 77
79 Approach to Fragment Preprocessing: Assembly Eliminate pairs of fragments that cannot have significant overlap (quick check). Compute overlap between promising pairs using dynamic programming. If a fragment is completely contained in another, discard the shorter fragment. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 78
80 Approach to Fragment Assembly Forming Contigs (Greedy Heuristic): Combine fragments with strongest evidence of overlap. Treat the resulting partial contig as a single fragment and consider overlapping ends unavailable. Iterate using next strongest available overlap. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 79
81 Approach to Fragment Assembly Generating Consensus Sequence: Perform a multiple sequence alignment between parts of fragments overlapping in the same position to obtain better contigs. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 80
82 Fragment Assembly Software 1. CAP3 (ftp://cs.mtu.edu/pub/huang) 2. Phrap ( 3. TIGR Assembler ( Nov. 17, 2002 SC2002 Tutorial: Computational Biology 81
83 Genome Sequencing Complete genomes of over 800 organisms are Currently (or soon to be) available. main_genomes.html Nov. 17, 2002 SC2002 Tutorial: Computational Biology 82
84 Part V: Gene Identification and Annotation
85 Sequencing the genome is not an end-goal! Identify genes on the genome. Find the corresponding family of proteins. Find the functions of the proteins and how they are regulated. Study the natural variations in the gene among related species and different strains of the same species. Study variation between healthy and disease-causing genes. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 84
86 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 85
87 Gene Structure DNA 5 3 Transcription 3 5 PremRNA 5 RNA Splicing 3 Promoter Exon Intron 5 Cap mrna 5 3 Poly A tail Nov. 17, 2002 SC2002 Tutorial: Computational Biology 86
88 EST Clustering Provides Clues to Finding Genes genomic DNA 3 exon 1 intron 1 exon 2 intron 2 exon mrna exon 1 exon 2 exon 3 ESTs Nov. 17, 2002 SC2002 Tutorial: Computational Biology 87
89 How to Obtain EST Data? dbest ( 12,845,578 ESTs as of September 20, 2002 Organism Human Mouse Arabidopsis thaliana Zea mays Rice Number of ESTs 4,691,979 2,706, , , ,429 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 88
90 Goals of EST Clustering Clustering: Build clusters with each cluster containing ESTs from the same gene. Identification: Identify the gene. Annotation: Find and assign a function to the corresponding protein. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 89
91 Alternative Splicing mrna 1 exon intron Opt. exon Gene 1 mrna 2 mrna 1 Gene 2 mrna 2 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 90
92 Difficulties in EST Clustering Lack of Coverage mrna ESTs Nov. 17, 2002 SC2002 Tutorial: Computational Biology 91
93 Difficulties in EST Clustering Duplicated Genes mrna (from gene) mrna (from duplicated gene) ESTs high degree of similarity Nov. 17, 2002 SC2002 Tutorial: Computational Biology 92
94 Approaches to EST Clustering Use pairwise comparisons between ESTs to put ESTs into clusters. 1. Exhaustive approach Compare all pairs of ESTs. 2. Use fragment assembly software. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 93
95 Fragment Assembly Is Not Suited to EST Clustering Lack of sufficient coverage. ESTs come from different individuals and different strains of the same species. Genomic and Protein databases provide additional clues to EST clustering. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 94
96 Fragment Assembly Is Not Suited to EST Clustering Number of EST fragments is too large. ESTs are obtained in batches. Fragment assembly software is not incremental. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 95
97 Evaluation of Current Software Single Node of IBM xseries Cluster n=100,001 n=144,870 TIGR PHRAP CAP3 TIGR PHRAP CAP min 91 min 150 min X 154 min 241 min X GB MB GB GB GB Nov. 17, 2002 SC2002 Tutorial: Computational Biology 96
98 NIH Unigene project Perform database search for each EST. Results are accrued incrementally using weekly builds on 80-processor Intel farm. Quality overrides computational issues. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 97
99 Space and Time Efficient EST Clustering Initially, treat each EST as a cluster by itself. If two ESTs from two different clusters show significant overlap, merge the clusters. Use union-find data structure. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 98
100 Reporting High-quality Promising Pairs first is important! Successful overlap results in : Merge Pass alignment test Nov. 17, 2002 SC2002 Tutorial: Computational Biology 99
101 Generating Promising Pairs Quality of overlap = length of a maximal common substring. Promising pairs are pairs that have a maximal common substring of length ψ. Produce promising pairs on-demand, in decreasing order of quality. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 100
102 Pair Generation Algorithm Build Generalized Suffix Tree of the ESTs. Process the nodes in GST in the decreasing order of string-depth and generate pairs at each node. Generate a pair at a node only if the corresponding overlap is maximal. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 101
103 Main Idea of the Algorithm Maximal common substring α =xβ α root β i α c 1 c 2 c 2 v c4... c 2 c4... s 1 s 2 c 3 α c 4 j (s 1,i) (s 2,j) (s 1,i+1) (s 2,j+1) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 102
104 Parallel EST Software Construction/ Preprocessing Phase Parallel Clustering Phase Nov. 17, 2002 SC2002 Tutorial: Computational Biology 103
105 Run-time vs. Number of processors Run-time in seconds 7,000 6,000 5,000 4,000 3,000 2,000 1, Number of processors n=10,000 n=20,000 n=40,000 n=80,000 n=144,870
106 Number of Pairs vs. Number of ESTs Number of Pairs in thousands 5,000 4,500 4,000 3,500 3,000 2,500 2,000 1,500 1, Aligned and accepted Aligned and rejected Unaligned 10,000 20,000 40,000 80, ,870 Number of ESTs Nov. 17, 2002 SC2002 Tutorial: Computational Biology 105
107 Open Problems Develop software that can cluster the human EST collection (~4.7 million currently). Improve quality of clustering Detect alternative splicing. Consult genomic & protein databases. Develop a comprehensive software system for gene identification combining EST clustering, ab initio gene prediction and genome comparison. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 106
108 Part VI: Microarrays and Gene Expression Analysis
109 Gene Expression Studies How does gene expression level differ in various cell types and states? How is gene expression changed by diseases? What are the functional roles of different genes? How are genes regulated? Nov. 17, 2002 SC2002 Tutorial: Computational Biology 108
110 Microarray A glass slide on which single stranded DNA molecules are attached at fixed spots. Each molecule corresponds to a gene (Ex: EST). When a solution containing single stranded molecules is washed over, binding based on complementary takes place. A single microarray can contain tens of thousands of spots. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 109
111 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 110
112 Comparing mrna abundance mrna from sample and control are labeled with different fluorescent dyes. Both solutions are washed over the microarray. Relative abundance of different mrna can be judged by color/intensity difference. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 111
113 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 112
114 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 113 Credit: A. Michael Cambell
115 The Full Yeast Genome on a Chip Statistics 6116 Yeast Genes 96 Intergenic regions + lots of control samples Total spots printed: 707,520 Total Arrays:110 Actual Time to print: 52 hours Actual Speed: spots/min Total Cycles: 1608 Total Water Usage: 23 Liters Tip Spacing: 221uM Taps per tip: 176,880 Completed: 25 April 1997 Patrick O. Brown Lab, Stanford: Nov. 17, 2002 SC2002 Tutorial: Computational Biology 114
116 Microarray Databases Repositories containing information obtained by microarray experiments Nov. 17, 2002 SC2002 Tutorial: Computational Biology 115
117 Microarray analysis Lists of software packages Hierarchical Clustering Self-Organizing Maps Nov. 17, 2002 SC2002 Tutorial: Computational Biology 116
118 Gene Expression Matrix A way to capture microarray data Rows correspond to genes Columns represent samples (different developmental stages, conditions and tissues) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 117
119 Using Gene Expression Matrices Compare gene expression profiles. Find co-regulated genes. Compare expression profiles of samples. Find differentially expressed genes. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 118
120 Finding Co-regulated Genes Each gene can be represented as a point in n-dimensional space. Use clustering algorithms to find coregulated genes. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 119
121 Example - Hierarchical Clustering Nov. 17, 2002 SC2002 Tutorial: Computational Biology 120
122 Summary Microarrays are a relatively new technology, allowing simultaneous collection of vast experimental data. Data mining and AI techniques are used to discover information from microarray data. Innovative uses of microarrays are still being discovered. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 121
123 Part VII: Protein Folding
124 Primary Structure A A S X D X S L V E V H X X V F I V P P X I L Q A V V S I A 31 T T R X D D X D S A A A S I P M V P G W V L K Q V X G S Q A 61 G S F L A I V M G G G D L E V I L I X L A G Y Q E S S I X A 91 S R S L A A S M X T T A I P S D L W G N X A X S N A A F S S 121 X E F S S X A G S V P L G F T F X E A G A K E X V I K G Q I 151 T X Q A X A F S L A X L X K L I S A M X N A X F P A G D X X 181 X X V A D I X D S H G I L X X V N Y T D A X I K M G I I F G 211 S G V N A A Y W C D S T X I A D A A D A G X X G G A G X M X 241 V C C X Q D S F R K A F P S L P Q I X Y X X T L N X X S P X 271 A X K T F E K N S X A K N X G Q S L R D V L M X Y K X X G Q 301 X H X X X A X D F X A A N V E N S S Y P A K I Q K L P H F D 331 L R X X X D L F X G D Q G I A X K T X M K X V V R R X L F L 361 I A A Y A F R L V V C X I X A I C Q K K G Y S S G H I A A X 391 G S X R D Y S G F S X N S A T X N X N I Y G W P Q S A X X S 421 K P I X I T P A I D G E G A A X X V I X S I A S S Q X X X A 451 X X S A X X A Nov. 17, 2002 SC2002 Tutorial: Computational Biology 123
125 Secondary Structure - α helix Nov. 17, 2002 SC2002 Tutorial: Computational Biology 124
126 Secondary Structure - β sheet Nov. 17, 2002 SC2002 Tutorial: Computational Biology 125
127 Tertiary Structure Nov. 17, 2002 SC2002 Tutorial: Computational Biology 126
128 Quaternary Structure Nov. 17, 2002 SC2002 Tutorial: Computational Biology 127
129 Problems in Protein Folding 1. Folding Problem: Given the sequence of a protein, computationally determine its structure. 2. Inverse Folding Problem: Given the structure in which a protein should fold into, find a possible amino acid sequence of the protein. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 128
130 Why Should Sequence! Structure Determination Be Possible? Proteins with sequence similarity tend to have structural similarity. If a protein is deformed under external force, it quickly folds back into its unique shape after the force is removed. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 129
131 IBM Blue Gene Project $100 M, 100,000 processor petaflop supercomputer for protein folding. Expected to simulate one protein in a year. Blue Gene/L 65,536 processors, 32 X 32 X 64 torus (by year 2004). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 130
132 Approach I Molecular Dynamics Idea: Forces acting on the atoms in a protein and constraints are known. Perform simulation. Problem: Time step required is too small (10-18 sec). Best reported simulation 10-6 sec. Folding requires a few seconds. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 131
133 Approach II Lattice Models Proteins are represented as self-avoiding walks on lattices (cubic, hexagonal etc.). Each amino acid residue is modeled as hydrophobic (H) or hydrophilic (P). Position the residues subject to Linear constraint Maximizing H-H contacts Nov. 17, 2002 SC2002 Tutorial: Computational Biology 132
134 Approach II Lattice Models Problem is NP-complete. Approximation algorithms have been designed. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 133
135 Hypothesis: Approach III Energy Minimization Different amino acids have different chemical, electrical and size properties. Different folds of a protein have different levels of energy. A protein folds into its minimum energy configuration. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 134
136 Approach III Energy Minimization Start with a protein configuration. Compute the energy of the configuration. Incrementally fold the protein to reduce its energy. Iterate until convergence. Many known energy minimization methods. can be applied (steepest descent, simulated. annealing etc.). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 135
137 Approach IV Protein Threading Find proteins with known structure that exhibit similarity to the protein to be folded. Use structures of highly similar components to determine a possible structure for the new protein. Use this structure as the basis for more computational folding operations. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 136
138 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 137
139 Other Problems Structure Similarity Given: The three dimensional structure of two proteins Find: the structural similarity between them. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 138
140 Other Problems Protein Docking Given: A receptor molecule and a drug molecule Find: A matching between the receptor surface and the drug molecule surface maximizing the contact area between the surfaces. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 139
141 Other Problems Accessible Surface Area Given: The three dimensional structure of a Protein Find: The cumulative accessible surface area of the atoms of the protein accessible to a solvent molecule. Atoms and solvent molecule are modeled as spheres using van der Waal s radii. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 140
142 Part VIII: Comparative Genomics & Reconstructing Evolutionary Histories (Phylogenetic Trees)
143 Comparative Genomics Chicken Human NCBI accession #NC_ NCBI accession #NC_ Nov. 17, 2002 SC2002 Tutorial: Computational Biology 142
144 Eukaryotic Cell Nov. 17, 2002 SC2002 Tutorial: Computational Biology 143
145 Organism Est. size Est. # genes average gene density Human 3000 million bases ~30,000 1 gene per 100,000 bases M. Musculus (mouse) 3000 million bases 30,000 1 gene per 100,000 bases Drosophila (fruit fly) million bases 13,061 1 gene per 13,781 bases Arabidopsis (plant) 100 million bases 25,000 1 gene per 4,000 bases C. elegans (roundworm) 97 million bases 19,099 1 gene per 5,079 bases S. cerevisiae (yeast) 12.1 million bases 6,034 1 gene per 2,005 bases E. coli (bacteria) million bases 3,237 1 gene per 1,443 bases H. influenzae (bacteria) 1.8 million bases 1,740 1 gene per 1,034 bases Nov. 17, 2002 SC2002 Tutorial: Computational Biology 144
146 Phylogenetics Find the genetic connections and relationships between species (or sequences). Hypothesis: All existing organisms are derived from some common ancestor. A new species arises by a splitting of one population into two (or more populations) that do not cross-breed. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 145
147 Phylogenetic Trees Each species (or sequence) is described by a set of traits (called characters). Leaves of the tree are labeled with input species. Internal nodes are labeled with input or inferred species. Edges represent transition in values among certain traits. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 146
148 Types of Phylogenies Relationships between taxa Species Trees Gene Trees Data Morphological Tree of Life Web (Maddison/Maddison): Nuclear Genome Organelle Genome Nov. 17, 2002 SC2002 Tutorial: Computational Biology 147
149 Example Phylogenies Campanulaceae (Bluebell Flowers) Wahlenbergia Merciera Trachelium Symphyandra Campanula Adenophora 3.22 Legousia Asyneuma Triodanus Codonopsis Cyananthus Platycodon Tobacco HHV6 Some herpesvirus known to affect humans EBV HHV7 HVS EHV2 KHSV HSV1 VZV HSV2 PRV EHV1 HCMV Leeches Nov. 17, 2002 SC2002 Tutorial: Computational Biology 148
150 Techniques Maximum parsimony Occam s razor: simplest explanation for evolution, minimizes the sum of the number of evolutionary events along the tree branches Maximum likelihood Statistical methods that use an evolutionary model such as the transition/transversion rate ratio for the nuclear genome Nov. 17, 2002 SC2002 Tutorial: Computational Biology 149
151 Genomic Parsimony: Examples of characters Specific nucleotide in a fixed position of a DNA sequence (conserved in all examined species). Does the amino acid sequence for a protein contain a specific subsequence? Is the expression of a certain protein regulated by another particular protein? Nov. 17, 2002 SC2002 Tutorial: Computational Biology 150
152 Nov. 17, 2002 SC2002 Tutorial: Computational Biology 151 Example A T G C G T Elephant T C A C G T Dog A T G G G C Chimp A C A G A C Bison A T G G A C Aardvark Species
153 Example , 5 4, 5, 6 Aardvark Bison Chimp Dog Elephant Nov. 17, 2002 SC2002 Tutorial: Computational Biology 152
154 Perfect Phylogeny for Binary Characters Given: An n m, 0-1 matrix representing n Species and m binary characters Find: A phylogenetic tree T such that The root of the tree represents an ancestor that has none of the m characters. Each character changes from 0 to 1 exactly once and never changes back. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 153
155 Example A B M C D E D B E A C Runs in O(mn) time (Gusfield91). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 154
156 Perfect Phylogeny for Non- Binary Characters n species, m characters, at most r states NP-complete. Polynomial time for any fixed r (Agarwala94). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 155
157 Parsimony Parsimony score of a tree = Total number of character changes in the tree P( T ) = ( u, v) E ( T ) { j } u j v j Nov. 17, 2002 SC2002 Tutorial: Computational Biology 156
158 A Simpler Problem: Known Tree Given: Phylogenetic tree Find: Minimum parsimony score and optimal labeling of internal nodes. Can be solved in O(nmr) time [Fitch71]. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 157
159 Parsimony Problem NP-hard. Techniques Used: Branch and bound [Hendy&Penny82]. Neighbor-Joining [Saitou&Nei87, Studdier&Keppler88]. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 158
160 Exploiting data about gene content and gene order has proved extremely challenging from a computational perspective tasks that can easily be carried out in linear time for DNA data have required entirely new theories (such as the computation of inversion distance) or appear to be NP-hard The focus has thus been on simple genomes preferably genomes consisting of a single chromosome, and where evolution can reasonably be assumed to have been driven mostly through gene order changes. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 159
161 Cell Organelles Chloroplasts and mitochondria have such genomes: around 120 genes for the chloroplasts of higher plants and typically 37 genes for the mitochondria of multicellular animals, in both cases packed onto a single chromosome. The gene content of these genomes is fairly constant across a wide phylogenetic range, differences are mostly in the ordering of the genes. Chloropast Mitochondria Nov. 17, 2002 SC2002 Tutorial: Computational Biology 160
162 Gene Order Phylogeny Many organelles appear to evolve mostly through processes that simply rearrange gene ordering (inversion, transposition) and perhaps alter gene content (duplication, loss). Chloroplast have a single, typically circular, chromosome and appear to evolve mostly through inversion: i -1 i j j+1 i -1 -j -i j+1 The sequence of genes i, i+1,, j is inverted and every gene is flipped. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 161
163 Phylogeny Heuristic Search [BPanalysis, Sankoff & Blanchette 98] (2n-5)!! = (2n-5) (2n-7) 5 3 trees For each tree topology do somehow assign initial genomes to the internal nodes repeat unknown iterative heuristic for each internal node do NP-hard compute a new genome that minimizes the distances to its three neighbors replace old genome by new if distance is reduced until no change Nov. 17, 2002 SC2002 Tutorial: Computational Biology 162
164 Lower Bounding of a Tree Tree e Tree version (paths) e a a d(e,a) d(d,e) b c d d(a,b) b d(b,c) c d(c,d) d = d(a,b) + d(b,c) + d(c,d) + d(d,e) + d(e,a) (Same trick as in the twice around the tree approximation for the TSP with triangle inequality.) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 163
165 Parallelization of the Phylogeny Algorithm Enumerating tree topologies is pleasantly parallel and allows multiple processors to independently search the tree space with little or no overhead Load is evenly balanced when trees are cyclically assigned (e.g. in a round-robin fashion) to the processors Linear speedup Nov. 17, 2002 SC2002 Tutorial: Computational Biology 164
166 High-performance implementations enable: better approximations for difficult problems (MP, ML) true optimization for larger instances realistic data exploration (e.g., testing evolutionary scenarios, assessing answers obtained through other means, etc.) use of more biologically meaningful models (inversions, transpositions, gene loss/duplication) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 165
167 Inversion Distance (Hannenhalli-Pevzner Theory) NP-hard for unsigned permutations [Caprara 97] Polynomial for signed permutations [Hannenhalli & Pevzner 95] Compute combinatorial terms from the cycle graph d = b c + h + f [Bafna & Pevzner 93, Setubal & Meidanis 97] b = number of breakpoints c = number of cycles h = number of hurdles f = (0/1) Is there a fortress? O(n α(n)) time, [Berman and Hannenhalli 96] where α(n) is the inverse Ackerman function (practically a constant no greater than 4) New result: O(n) inversion distance, [Bader, Moret, Yan 01] faster and simpler algorithm, both in theory and in practice Nov. 17, 2002 SC2002 Tutorial: Computational Biology 166
168 Challenges in Phylogeny Exact Inversion median-of-three [Siepel02] Tree enumeration using circular ordering Handle unequal gene content and duplicate genes (using exemplars?) Parallel branch and bound techniques for searching tree space Improved SPR and TBR techniques (local searches around good trees) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 167
169 Additional Challenges Network evolution Recombination events Large-scale phylogeny reconstruction Comparison and accuracy of techniques and heuristics Nov. 17, 2002 SC2002 Tutorial: Computational Biology 168
170 Parsimony Codes Phylip (Felsenstein) Hennig86 (Farris) Nona (Goloboff) and TNT (Goloboff, Farris, Nixon) PAUP* (Swofford) MEGA (Kumar, Tamura, Jakobsen, Nei) GRAPPA (Bader, Moret, Warnow) Nov. 17, 2002 SC2002 Tutorial: Computational Biology 169
171 Likelihood Codes Phylip (Felsenstein) PAUP* (Swofford) PAML (Yang) FastDNAml (Olsen, Matsuda, Hagstrom, Overbeek) Felsenstein s List of Software: Nov. 17, 2002 SC2002 Tutorial: Computational Biology 170
172 GRAPPA: Genome Rearrangements Analysis Genome Rearrangements Analysis under Parsimony and other Phylogenetic Algorithms Open-source already used by other computational phylogeny groups, Caprara, Pevzner, LANL, FBI, Smithsonian Institute, PharmCos. Gene-order Phylogeny Reconstruction Breakpoint Median Inversion Median over one-billion fold speedup from previous codes Parallelism scales linearly with the number of processors [Bader, Moret, Warnow] Nov. 17, 2002 SC2002 Tutorial: Computational Biology 171
173 Using GRAPPA to solve Campanulaceae Phylogeny On the 512-processor cluster LosLobos at U. New Mexico, we ran the full analysis (all 14 billion trees) in under 1.5 hours a 1,000,000-fold speedup (and using true inversion distance) Current release of GRAPPA (v. 1.6) now takes minutes to solve the same problem on several processors Nov. 17, 2002 SC2002 Tutorial: Computational Biology 172
174 Campanulaceae Bob Jansen, UT-Austin; Linda Raubeson, Central Washington U Tobacco Nov. 17, 2002 SC2002 Tutorial: Computational Biology 173
175 Epilogue
176 Epilogue Uses of Computation in Biology: 1. Discovering information from large data sets (ex: database searches). 2. Relating micro-behavior to macrobehavior (ex: protein folding). 3. Extending experimental capabilities (ex: genome sequencing). Nov. 17, 2002 SC2002 Tutorial: Computational Biology 175
177 Epilogue Computation will be an integral part of future biological discoveries. Computational biology is an exciting interdisciplinary area that will become increasingly important in the future. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 176
178 Bookshelf R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK, D. Graur and W.-H. Li. Fundamentals of Molecular Evolution. Sinauer Associates, Sunderland, MA, second edition, D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Nov. 17, 2002 SC2002 Tutorial: Computational Biology 177
179 Bookshelf D.M. Hillis, C. Moritz, and B.K. Mable, eds. Molecular Systematics. Sinauer Associates, Sunderland, MA, second edition, M. Nei and S. Kumar. Molecular Evolution and Phylogenetics. Oxford University Press, Oxford, UK, P.A. Pevzner. Computational Molecular Biology: An Algorithmic Approach. MIT Press, Inc., Cambridge, MA, Nov. 17, 2002 SC2002 Tutorial: Computational Biology 178
180 Bookshelf D. Sankoff and J. Kruskal, eds. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA, J.C. Setubal and J. Meidanis. Introduction to Computational Molecular Biology. PWS, Boston, MA, D. Sankoff and J.H. Nadeau, eds. Comparative Genomics: Empirical and Analytic Approaches to Gene Order Dynamics, Map Alignment and the Evolution of Gene Families, volume 1 of Computational Biology. Kluwer Academic Publishers, Dordrecht, The Netherlands, M.S. Waterman. Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman & Hall / CRC, Boca Raton, FL, Nov. 17, 2002 SC2002 Tutorial: Computational Biology 179
181 Related & Referenced Publications M. I. Abouelhoda, S. Kurtz and E. Ohlebusch, The enhanced suffix array and its applications to genome analysis, 2 nd Workshop on Algorithms in Bioinformatics, pp , R. Agarwala, D. Ferñandez-Baca, A polynomial-time algorithm for the perfect phylogeny problem when the number of character-states is fixed. SIAM J. Comp., 23(6): , S. Aluru, N. Futamura and K. Mehrotra, Biological sequence comparison using prefix computations, Proc. 13 th IEEE Int l Parallel Processing Symposium, pp , D.A. Bader, B. M.E. Moret, and L. Vawter, Industrial Applications of High- Performance Computing for Phylogeny Reconstruction, ITCom: Commercial Applications for High-Performance Computing, SPIE Vol. 4528, pp , D.A. Bader, B. M.E. Moret, and M. Yan, A Linear-Time Algorithm for Computing Inversion Distance Between Two Signed Permutations with an Experimental Study, Journal of Computational Biology, 8(5): , D.A. Bader, B.M.E. Moret, and P. Sanders, Algorithm Engineering for Parallel Computation, Experimental Algorithmics, Springer Verlag Lecture Notes in Computer Science, 2547:1 23, D.A. Bader, S. Sreshta, and N.R. Weisse-Bernstein, Evaluating arithmetic expressions using tree contraction: A fast and scalable parallel implementation for symmetric multiprocessors (SMPs), Proc. 9 th IEEE Int'l Conf. High-Performance Computing, 2002, to appear. Nov. 17, 2002 SC2002 Tutorial: Computational Biology 180
182 Related & Referenced Publications V. Bafna and P. A. Pevzner, Genome rearrangements and sorting by reversals, Proc. 34 th Ann. IEEE Symp. Foundations of Computer Science, pp , 1993 V. Bafna and P. Pevzner, Sorting permutations by transpositions, Proc. 6 th Ann. Symp. Discrete Algorithms, pp , P. Berman and S. Hannenhalli, Fast sorting by reversal, Proc. 7 th Ann. Symp. Combinatorial Pattern Matching, pp , K. Booth and G. Lueker, Testing for consecutive ones property, interval graphs and graph planarity testing using pq-tree algorithms, J. Comp. Sys. Sci., 13: , A. Caprara, Sorting by reversals is difficult, Proc. 1 st ACM Conf. Computational Molecular Biology, pp , D.R. Clark and J.I. Munro, Efficient suffix trees on secondary storage, Proc. ACM-SIAM Symp. on Discrete Algorithms, pp , E. Edmiston, N. Core, J. Saltz, and R. Smith, Parallel processing of biological sequence comparison algorithms. Int l Journal of Parallel Programming, 17(3): , J. Felsenstein, Evolutionary trees from DNA sequences: A maximum likelihood approach, Journal of Molecular Evolution, 17: , P. Ferragina and R. Grossi, Fast incremental text editing, Journal of Algorithms, 31: , Also ACM-SIAM Symp. on Discrete Algorithms, W. M. Fitch, Towards defining the course of evolution: Minimum change for a specific tree topology, Syst. Zool., 20: , Nov. 17, 2002 SC2002 Tutorial: Computational Biology 181
183 Related & Referenced Publications W.M. Fitch and E. Margoliash, Construction of phylogenetic trees, Science, 155: , N. Futamura, S. Aluru and X. Huang, Parallel syntenic alignments, Proc. 9 th IEEE Int l Conf. on High Performance Computing, to appear. N. Futamura, S. Aluru, D. Ranjan and B. Hariharan, Efficient parallel algorithms for solvent accessible surface area of proteins, IEEE Trans. on Parallel and Distributed Systems, 13(6): , D. Gusfield, Efficient algorithms for inferring evolutionary trees. Networks, 21:19-28, S. Hannenhalli and P.A. Pevzner, Transforming cabbage into turnip (polynomial algorithm for sorting signed permutations by reversals, Proc. 27 th ACM Ann. Symp. Theory of Computing, pp , R. Hariharan, Optimal parallel suffix tree construction, Proc. 26 th IEEE Symp. Found. Computer Science, pp , M. D. Hendy and D. Penny, Branch and bound algorithms to determine minimal evolutionary trees. Mathematical Biosciences, 59: , X. Huang, A space-efficient parallel sequence comparison algorithm for a message-passing multiprocessor. Int l Journal of Parallel Programming, 18(3): , X. Huang and A. Madan, CAP3: A DNA sequence assembly program, Genome Research, 9(9): , Nov. 17, 2002 SC2002 Tutorial: Computational Biology 182
Recent Advances in Phylogeny Reconstruction
Recent Advances in Phylogeny Reconstruction from Gene-Order Data Bernard M.E. Moret Department of Computer Science University of New Mexico Albuquerque, NM 87131 Department Colloqium p.1/41 Collaborators
More informationA New Fast Heuristic for Computing the Breakpoint Phylogeny and Experimental Phylogenetic Analyses of Real and Synthetic Data
A New Fast Heuristic for Computing the Breakpoint Phylogeny and Experimental Phylogenetic Analyses of Real and Synthetic Data Mary E. Cosner Dept. of Plant Biology Ohio State University Li-San Wang Dept.
More informationAN EXACT SOLVER FOR THE DCJ MEDIAN PROBLEM
AN EXACT SOLVER FOR THE DCJ MEDIAN PROBLEM MENG ZHANG College of Computer Science and Technology, Jilin University, China Email: zhangmeng@jlueducn WILLIAM ARNDT AND JIJUN TANG Dept of Computer Science
More informationEffects of Gap Open and Gap Extension Penalties
Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See
More informationFast Phylogenetic Methods for the Analysis of Genome Rearrangement Data: An Empirical Study
Fast Phylogenetic Methods for the Analysis of Genome Rearrangement Data: An Empirical Study Li-San Wang Robert K. Jansen Dept. of Computer Sciences Section of Integrative Biology University of Texas, Austin,
More informationAlgorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment
Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot
More informationSequencing alignment Ameer Effat M. Elfarash
Sequencing alignment Ameer Effat M. Elfarash Dept. of Genetics Fac. of Agriculture, Assiut Univ. amir_effat@yahoo.com Why perform a multiple sequence alignment? MSAs are at the heart of comparative genomics
More informationSequencing alignment Ameer Effat M. Elfarash
Sequencing alignment Ameer Effat M. Elfarash Dept. of Genetics Fac. of Agriculture, Assiut Univ. aelfarash@aun.edu.eg Why perform a multiple sequence alignment? MSAs are at the heart of comparative genomics
More informationImproving Tree Search in Phylogenetic Reconstruction from Genome Rearrangement Data
Improving Tree Search in Phylogenetic Reconstruction from Genome Rearrangement Data Fei Ye 1,YanGuo, Andrew Lawson 1, and Jijun Tang, 1 Department of Epidemiology and Biostatistics University of South
More informationEVOLUTIONARY DISTANCES
EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:
More informationPhylogenetic Tree Reconstruction
I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven
More informationSteps Toward Accurate Reconstructions of Phylogenies from Gene-Order Data 1
Steps Toward Accurate Reconstructions of Phylogenies from Gene-Order Data Bernard M.E. Moret, Jijun Tang, Li-San Wang, and Tandy Warnow Department of Computer Science, University of New Mexico Albuquerque,
More informationEarly History up to Schedule. Proteins DNA & RNA Schwann and Schleiden Cell Theory Charles Darwin publishes Origin of Species
Schedule Bioinformatics and Computational Biology: History and Biological Background (JH) 0.0 he Parsimony criterion GKN.0 Stochastic Models of Sequence Evolution GKN 7.0 he Likelihood criterion GKN 0.0
More informationPhylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches
Int. J. Bioinformatics Research and Applications, Vol. x, No. x, xxxx Phylogenies Scores for Exhaustive Maximum Likelihood and s Searches Hyrum D. Carroll, Perry G. Ridge, Mark J. Clement, Quinn O. Snell
More informationBIOINFORMATICS. New approaches for reconstructing phylogenies from gene order data. Bernard M.E. Moret, Li-San Wang, Tandy Warnow and Stacia K.
BIOINFORMATICS Vol. 17 Suppl. 1 21 Pages S165 S173 New approaches for reconstructing phylogenies from gene order data Bernard M.E. Moret, Li-San Wang, Tandy Warnow and Stacia K. Wyman Department of Computer
More informationPhylogenetic Networks, Trees, and Clusters
Phylogenetic Networks, Trees, and Clusters Luay Nakhleh 1 and Li-San Wang 2 1 Department of Computer Science Rice University Houston, TX 77005, USA nakhleh@cs.rice.edu 2 Department of Biology University
More informationA Framework for Orthology Assignment from Gene Rearrangement Data
A Framework for Orthology Assignment from Gene Rearrangement Data Krister M. Swenson, Nicholas D. Pattengale, and B.M.E. Moret Department of Computer Science University of New Mexico Albuquerque, NM 87131,
More informationCGS 5991 (2 Credits) Bioinformatics Tools
CAP 5991 (3 Credits) Introduction to Bioinformatics CGS 5991 (2 Credits) Bioinformatics Tools Giri Narasimhan 8/26/03 CAP/CGS 5991: Lecture 1 1 Course Schedules CAP 5991 (3 credit) will meet every Tue
More informationWhole Genome Alignments and Synteny Maps
Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of
More informationHigh-Performance Algorithm Engineering for Large-Scale Graph Problems and Computational Biology
High-Performance Algorithm Engineering for Large-Scale Graph Problems and Computational Biology David A. Bader Electrical and Computer Engineering Department, University of New Mexico, Albuquerque, NM
More informationDr. Amira A. AL-Hosary
Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological
More informationAmira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut
Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological
More informationComputational Structural Bioinformatics
Computational Structural Bioinformatics ECS129 Instructor: Patrice Koehl http://koehllab.genomecenter.ucdavis.edu/teaching/ecs129 koehl@cs.ucdavis.edu Learning curve Math / CS Biology/ Chemistry Pre-requisite
More informationAlgorithms in Bioinformatics
Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Distance Methods Character Methods
More informationIntroduction to Bioinformatics. Shifra Ben-Dor Irit Orr
Introduction to Bioinformatics Shifra Ben-Dor Irit Orr Lecture Outline: Technical Course Items Introduction to Bioinformatics Introduction to Databases This week and next week What is bioinformatics? A
More informationComputational methods for predicting protein-protein interactions
Computational methods for predicting protein-protein interactions Tomi Peltola T-61.6070 Special course in bioinformatics I 3.4.2008 Outline Biological background Protein-protein interactions Computational
More informationSequence analysis and comparison
The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species
More informationBio 1B Lecture Outline (please print and bring along) Fall, 2007
Bio 1B Lecture Outline (please print and bring along) Fall, 2007 B.D. Mishler, Dept. of Integrative Biology 2-6810, bmishler@berkeley.edu Evolution lecture #5 -- Molecular genetics and molecular evolution
More information2012 Univ Aguilera Lecture. Introduction to Molecular and Cell Biology
2012 Univ. 1301 Aguilera Lecture Introduction to Molecular and Cell Biology Molecular biology seeks to understand the physical and chemical basis of life. and helps us answer the following? What is the
More informationInDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9
Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic
More informationSupplemental Data. Perea-Resa et al. Plant Cell. (2012) /tpc
Supplemental Data. Perea-Resa et al. Plant Cell. (22)..5/tpc.2.3697 Sm Sm2 Supplemental Figure. Sequence alignment of Arabidopsis LSM proteins. Alignment of the eleven Arabidopsis LSM proteins. Sm and
More informationIntroduction to Molecular and Cell Biology
Introduction to Molecular and Cell Biology Molecular biology seeks to understand the physical and chemical basis of life. and helps us answer the following? What is the molecular basis of disease? What
More information6.096 Algorithms for Computational Biology. Prof. Manolis Kellis
6.096 Algorithms for Computational Biology Prof. Manolis Kellis Today s Goals Introduction Class introduction Challenges in Computational Biology Gene Regulation: Regulatory Motif Discovery Exhaustive
More informationI519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB
I519 Introduction to Bioinformatics, 2011 Genome Comparison Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Whole genome comparison/alignment Build better phylogenies Identify polymorphism
More informationTHEORY. Based on sequence Length According to the length of sequence being compared it is of following two types
Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between
More informationGenomics and bioinformatics summary. Finding genes -- computer searches
Genomics and bioinformatics summary 1. Gene finding: computer searches, cdnas, ESTs, 2. Microarrays 3. Use BLAST to find homologous sequences 4. Multiple sequence alignments (MSAs) 5. Trees quantify sequence
More informationDNA and protein databases. EMBL/GenBank/DDBJ database of nucleic acids
Database searches 1 DNA and protein databases EMBL/GenBank/DDBJ database of nucleic acids 2 DNA and protein databases EMBL/GenBank/DDBJ database of nucleic acids (cntd) 3 DNA and protein databases SWISS-PROT
More informationSequences, Structures, and Gene Regulatory Networks
Sequences, Structures, and Gene Regulatory Networks Learning Outcomes After this class, you will Understand gene expression and protein structure in more detail Appreciate why biologists like to align
More informationPhylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.
Five Sami Khuri Department of Computer Science San José State University San José, California, USA sami.khuri@sjsu.edu v Distance Methods v Character Methods v Molecular Clock v UPGMA v Maximum Parsimony
More informationComputational Biology: Basics & Interesting Problems
Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information
More informationMathematics of Evolution and Phylogeny. Edited by Olivier Gascuel
Mathematics of Evolution and Phylogeny Edited by Olivier Gascuel CLARENDON PRESS. OXFORD 2004 iv CONTENTS 12 Reconstructing Phylogenies from Gene-Content and Gene-Order Data 1 12.1 Introduction: Phylogenies
More informationExhaustive search. CS 466 Saurabh Sinha
Exhaustive search CS 466 Saurabh Sinha Agenda Two different problems Restriction mapping Motif finding Common theme: exhaustive search of solution space Reading: Chapter 4. Restriction Mapping Restriction
More informationComparative genomics: Overview & Tools + MUMmer algorithm
Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune 411 007. urmila@bioinfo.ernet.in Genome sequence: Fact file 1995: The first
More informationOn Reversal and Transposition Medians
On Reversal and Transposition Medians Martin Bader International Science Index, Computer and Information Engineering waset.org/publication/7246 Abstract During the last years, the genomes of more and more
More informationIntroduction to Comparative Protein Modeling. Chapter 4 Part I
Introduction to Comparative Protein Modeling Chapter 4 Part I 1 Information on Proteins Each modeling study depends on the quality of the known experimental data. Basis of the model Search in the literature
More informationGenomes and Their Evolution
Chapter 21 Genomes and Their Evolution PowerPoint Lecture Presentations for Biology Eighth Edition Neil Campbell and Jane Reece Lectures by Chris Romero, updated by Erin Barley with contributions from
More informationNew Approaches for Reconstructing Phylogenies from Gene Order Data
New Approaches for Reconstructing Phylogenies from Gene Order Data Bernard M.E. Moret Li-San Wang Tandy Warnow Stacia K. Wyman Abstract We report on new techniques we have developed for reconstructing
More informationAlgorithms in Computational Biology (236522) spring 2008 Lecture #1
Algorithms in Computational Biology (236522) spring 2008 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: 15:30-16:30/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office hours:??
More informationGrundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson
Grundlagen der Bioinformatik, SS 10, D. Huson, April 12, 2010 1 1 Introduction Grundlagen der Bioinformatik Summer semester 2010 Lecturer: Prof. Daniel Huson Office hours: Thursdays 17-18h (Sand 14, C310a)
More informationBioinformatics. Dept. of Computational Biology & Bioinformatics
Bioinformatics Dept. of Computational Biology & Bioinformatics 3 Bioinformatics - play with sequences & structures Dept. of Computational Biology & Bioinformatics 4 ORGANIZATION OF LIFE ROLE OF BIOINFORMATICS
More informationBIOINFORMATICS. Scaling Up Accurate Phylogenetic Reconstruction from Gene-Order Data. Jijun Tang 1 and Bernard M.E. Moret 1
BIOINFORMATICS Vol. 1 no. 1 2003 Pages 1 8 Scaling Up Accurate Phylogenetic Reconstruction from Gene-Order Data Jijun Tang 1 and Bernard M.E. Moret 1 1 Department of Computer Science, University of New
More informationBioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics
Bioinformatics 1 Biology, Sequences, Phylogenetics Part 4 Sepp Hochreiter Klausur Mo. 30.01.2011 Zeit: 15:30 17:00 Raum: HS14 Anmeldung Kusss Contents Methods and Bootstrapping of Maximum Methods Methods
More informationSequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013
Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation
More informationBioinformatics Chapter 1. Introduction
Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!
More informationGraph Alignment and Biological Networks
Graph Alignment and Biological Networks Johannes Berg http://www.uni-koeln.de/ berg Institute for Theoretical Physics University of Cologne Germany p.1/12 Networks in molecular biology New large-scale
More informationOn the complexity of unsigned translocation distance
Theoretical Computer Science 352 (2006) 322 328 Note On the complexity of unsigned translocation distance Daming Zhu a, Lusheng Wang b, a School of Computer Science and Technology, Shandong University,
More informationPhylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz
Phylogenetic Trees What They Are Why We Do It & How To Do It Presented by Amy Harris Dr Brad Morantz Overview What is a phylogenetic tree Why do we do it How do we do it Methods and programs Parallels
More informationGiri Narasimhan. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Evaluation. Course Homepage.
CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 389; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs06.html 1/12/06 CAP5510/CGS5166 1 Evaluation
More informationLecture 15: Realities of Genome Assembly Protein Sequencing
Lecture 15: Realities of Genome Assembly Protein Sequencing Study Chapter 8.10-8.15 1 Euler s Theorems A graph is balanced if for every vertex the number of incoming edges equals to the number of outgoing
More informationProtein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche
Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its
More informationMETHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.
Chapter 12 (Strikberger) Molecular Phylogenies and Evolution METHODS FOR DETERMINING PHYLOGENY In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task. Modern
More informationMATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME
MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:
More informationEstimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057
Estimating Phylogenies (Evolutionary Trees) II Biol4230 Thurs, March 2, 2017 Bill Pearson wrp@virginia.edu 4-2818 Jordan 6-057 Tree estimation strategies: Parsimony?no model, simply count minimum number
More informationSimilarity or Identity? When are molecules similar?
Similarity or Identity? When are molecules similar? Mapping Identity A -> A T -> T G -> G C -> C or Leu -> Leu Pro -> Pro Arg -> Arg Phe -> Phe etc If we map similarity using identity, how similar are
More informationConstructing Evolutionary/Phylogenetic Trees
Constructing Evolutionary/Phylogenetic Trees 2 broad categories: istance-based methods Ultrametric Additive: UPGMA Transformed istance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood
More informationBioinformatics and BLAST
Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists
More informationBME 5742 Biosystems Modeling and Control
BME 5742 Biosystems Modeling and Control Lecture 24 Unregulated Gene Expression Model Dr. Zvi Roth (FAU) 1 The genetic material inside a cell, encoded in its DNA, governs the response of a cell to various
More informationSequence Database Search Techniques I: Blast and PatternHunter tools
Sequence Database Search Techniques I: Blast and PatternHunter tools Zhang Louxin National University of Singapore Outline. Database search 2. BLAST (and filtration technique) 3. PatternHunter (empowered
More informationSequence Based Bioinformatics
Structural and Functional Analysis of Inosine Monophosphate Dehydrogenase using Sequence-Based Bioinformatics Barry Sexton 1,2 and Troy Wymore 3 1 Bioengineering and Bioinformatics Summer Institute, Department
More informationTHE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT
COMMUNICATIONS IN INFORMATION AND SYSTEMS c 2009 International Press Vol. 9, No. 4, pp. 295-302, 2009 001 THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT DAN GUSFIELD AND YUFENG WU Abstract.
More informationProcedure to Create NCBI KOGS
Procedure to Create NCBI KOGS full details in: Tatusov et al (2003) BMC Bioinformatics 4:41. 1. Detect and mask typical repetitive domains Reason: masking prevents spurious lumping of non-orthologs based
More informationAlgorithms in Bioinformatics
Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology
More informationInvestigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST
Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST Introduction Bioinformatics is a powerful tool which can be used to determine evolutionary relationships and
More informationProtein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.
Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein
More information08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega
BLAST Multiple Sequence Alignments: Clustal Omega What does basic BLAST do (e.g. what is input sequence and how does BLAST look for matches?) Susan Parrish McDaniel College Multiple Sequence Alignments
More informationBio nformatics. Lecture 3. Saad Mneimneh
Bio nformatics Lecture 3 Sequencing As before, DNA is cut into small ( 0.4KB) fragments and a clone library is formed. Biological experiments allow to read a certain number of these short fragments per
More informationComparative Bioinformatics Midterm II Fall 2004
Comparative Bioinformatics Midterm II Fall 2004 Objective Answer, part I: For each of the following, select the single best answer or completion of the phrase. (3 points each) 1. Deinococcus radiodurans
More informationSmall RNA in rice genome
Vol. 45 No. 5 SCIENCE IN CHINA (Series C) October 2002 Small RNA in rice genome WANG Kai ( 1, ZHU Xiaopeng ( 2, ZHONG Lan ( 1,3 & CHEN Runsheng ( 1,2 1. Beijing Genomics Institute/Center of Genomics and
More informationNewly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:
m Eukaryotic mrna processing Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus: Cap structure a modified guanine base is added to the 5 end. Poly-A tail
More informationSEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA
SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS 1 Prokaryotes and Eukaryotes 2 DNA and RNA 3 4 Double helix structure Codons Codons are triplets of bases from the RNA sequence. Each triplet defines an amino-acid.
More informationSingle alignment: Substitution Matrix. 16 march 2017
Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block
More informationEvolutionary Tree Analysis. Overview
CSI/BINF 5330 Evolutionary Tree Analysis Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Distance-Based Evolutionary Tree Reconstruction Character-Based
More informationPhylogenetic analyses. Kirsi Kostamo
Phylogenetic analyses Kirsi Kostamo The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among different groups (individuals, populations, species,
More informationPacking of Secondary Structures
7.88 Lecture Notes - 4 7.24/7.88J/5.48J The Protein Folding and Human Disease Professor Gossard Retrieving, Viewing Protein Structures from the Protein Data Base Helix helix packing Packing of Secondary
More informationBLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010
BLAST Database Searching BME 110: CompBio Tools Todd Lowe April 8, 2010 Admin Reading: Read chapter 7, and the NCBI Blast Guide and tutorial http://www.ncbi.nlm.nih.gov/blast/why.shtml Read Chapter 8 for
More informationIntroduction to Bioinformatics Online Course: IBT
Introduction to Bioinformatics Online Course: IBT Multiple Sequence Alignment Building Multiple Sequence Alignment Lec1 Building a Multiple Sequence Alignment Learning Outcomes 1- Understanding Why multiple
More informationProperties of amino acids in proteins
Properties of amino acids in proteins one of the primary roles of DNA (but not the only one!) is to code for proteins A typical bacterium builds thousands types of proteins, all from ~20 amino acids repeated
More informationPhylogenetic Reconstruction from Gene-Order Data
p.1/7 Phylogenetic Reconstruction from Gene-Order Data Bernard M.E. Moret compbio.unm.edu Department of Computer Science University of New Mexico p.2/7 Acknowledgments Close Collaborators: at UNM: David
More informationAlgorithms for Bioinformatics
Adapted from slides by Alexandru Tomescu, Leena Salmela, Veli Mäkinen, Esa Pitkänen 582670 Algorithms for Bioinformatics Lecture 5: Combinatorial Algorithms and Genomic Rearrangements 1.10.2015 Background
More informationPhylogenetic Reconstruction
Phylogenetic Reconstruction from Gene-Order Data Bernard M.E. Moret compbio.unm.edu Department of Computer Science University of New Mexico p. 1/71 Acknowledgments Close Collaborators: at UNM: David Bader
More information9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)
I9 Introduction to Bioinformatics, 0 Phylogenetic ree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & omputing, IUB Evolution theory Speciation Evolution of new organisms is driven by
More informationQuiz answers. Allele. BIO 5099: Molecular Biology for Computer Scientists (et al) Lecture 17: The Quiz (and back to Eukaryotic DNA)
BIO 5099: Molecular Biology for Computer Scientists (et al) Lecture 17: The Quiz (and back to Eukaryotic DNA) http://compbio.uchsc.edu/hunter/bio5099 Larry.Hunter@uchsc.edu Quiz answers Kinase: An enzyme
More informationMassachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution
Massachusetts Institute of Technology 6.877 Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution 1. Rates of amino acid replacement The initial motivation for the neutral
More informationBIOINFORMATICS: An Introduction
BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and
More informationAnalysis of Gene Order Evolution beyond Single-Copy Genes
Analysis of Gene Order Evolution beyond Single-Copy Genes Nadia El-Mabrouk Département d Informatique et de Recherche Opérationnelle Université de Montréal mabrouk@iro.umontreal.ca David Sankoff Department
More informationStatistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences
Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic
More informationPage 1. Evolutionary Trees. Why build evolutionary tree? Outline
Page Evolutionary Trees Russ. ltman MI S 7 Outline. Why build evolutionary trees?. istance-based vs. character-based methods. istance-based: Ultrametric Trees dditive Trees. haracter-based: Perfect phylogeny
More informationApplications of genome alignment
Applications of genome alignment Comparing different genome assemblies Locating genome duplications and conserved segments Gene finding through comparative genomics Analyzing pathogenic bacteria against
More informationO 3 O 4 O 5. q 3. q 4. Transition
Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in
More informationNJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees
NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees Erin Molloy and Tandy Warnow {emolloy2, warnow}@illinois.edu University of Illinois at Urbana
More information