List of Code Challenges. Meet the Authors Meet the Development Team... xxxii Meet our Adopting Institutions... xxxiv Acknowledgments...

Size: px

Start display at page:

Download "List of Code Challenges. Meet the Authors Meet the Development Team... xxxii Meet our Adopting Institutions... xxxiv Acknowledgments..."

Beverley Perry
5 years ago
Views:

1 Contents List of Code Challenges xxv Meet the Authors xxxi Meet the Development Team xxxii Meet our Adopting Institutions xxxiv Acknowledgments xxxv 1 Where in the Genome Does DNA Replication Begin? 2 A Journey of a Thousand Miles Hidden Messages in the Replication Origin... 5 DnaA boxes... 5 Hidden messages in The Gold-Bug... 6 Counting words... 7 The Frequent Words Problem... 8 Frequent words in Vibrio cholerae Some Hidden Messages are More Surprising than Others An Explosion of Hidden Messages Looking for hidden messages in multiple genomes The Clump Finding Problem The Simplest Way to Replicate DNA Asymmetry of Replication Peculiar Statistics of the Forward and Reverse Half-Strands Lurking biological phenomenon or statistical fluke? Deamination xi

2 The skew diagram Some Hidden Messages are More Elusive than Others A Final Attempt at Finding DnaA Boxes in E. coli Epilogue: Complications in ori Predictions Open Problems Multiple replication origins in a bacterial genome Finding replication origins in archaea Finding replication origins in yeast Computing probabilities of patterns in a string Charging Stations The frequency array Converting patterns to numbers and vice-versa Finding frequent words by sorting Solving the Clump Finding Problem Solving the Frequent Words with Mismatches Problem Generating the neighborhood of a string Finding frequent words with mismatches by sorting Detours Big-O notation Probabilities of patterns in a string The most beautiful experiment in biology Directionality of DNA strands The Towers of Hanoi The overlapping words paradox Bibliography Notes Which DNA Patterns Play the Role of Molecular Clocks? 66 Do We Have a Clock Gene? Motif Finding Is More Difficult Than You Think Identifying the evening element Hide and seek with motifs A brute force algorithm for motif finding Scoring Motifs From motifs to profile matrices and consensus strings Towards a more adequate motif scoring function Entropy and the motif logo From Motif Finding to Finding a Median String xii

3 The Motif Finding Problem Reformulating the Motif Finding Problem The Median String Problem Why have we reformulated the Motif Finding Problem? Greedy Motif Search Using the profile matrix to roll dice Analyzing greedy motif finding Motif Finding Meets Oliver Cromwell What is the probability that the sun will not rise tomorrow? Laplace s Rule of Succession An improved greedy motif search Randomized Motif Search Rolling dice to find motifs Why randomized motif search works How Can a Randomized Algorithm Perform So Well? Gibbs Sampling Gibbs Sampling in Action Epilogue: How Does Tuberculosis Hibernate to Hide from Antibiotics? Charging Stations Solving the Median String Problem Detours Gene expression DNA arrays Buffon s needle Complications in motif finding Relative entropy Bibliography Notes How Do We Assemble Genomes? 115 Exploding Newspapers The String Reconstruction Problem Genome assembly is more difficult than you think Reconstructing strings from k-mers Repeats complicate genome assembly String Reconstruction as a Walk in the Overlap Graph From a string to a graph The genome vanishes xiii

4 Two graph representations Hamiltonian paths and universal strings Another Graph for String Reconstruction Gluing nodes and de Bruijn graphs Walking in the de Bruijn Graph Eulerian paths Another way to construct de Bruijn graphs Constructing de Bruijn graphs from k-mer composition De Bruijn graphs versus overlap graphs The Seven Bridges of Königsberg Euler s Theorem From Euler s Theorem to an Algorithm for Finding Eulerian Cycles Constructing Eulerian cycles From Eulerian cycles to Eulerian paths Constructing universal strings Assembling Genomes from Read-Pairs From reads to read-pairs Transforming read-pairs into long virtual reads From composition to paired composition Paired de Bruijn graphs A pitfall of paired de Bruijn graphs Epilogue: Genome Assembly Faces Real Sequencing Data Breaking reads into k-mers Splitting the genome into contigs Assembling error-prone reads Inferring multiplicities of edges in de Bruijn graphs Charging Stations The effect of gluing on the adjacency matrix Generating all Eulerian cycles Reconstructing a string spelled by a path in the paired de Bruijn graph. 167 Maximal non-branching paths in a graph Detours A short history of DNA sequencing technologies Repeats in the human genome Graphs The icosian game Tractable and intractable problems xiv

5 From Euler to Hamilton to de Bruijn The seven bridges of Kaliningrad Pitfalls of assembling double-stranded DNA The BEST Theorem Bibliography Notes How Do We Sequence Antibiotics? 184 The Discovery of Antibiotics How Do Bacteria Make Antibiotics? How peptides are encoded by the genome Where is Tyrocidine encoded in the Bacillus brevis genome? From linear to cyclic peptides Dodging the Central Dogma of Molecular Biology Sequencing Antibiotics by Shattering Them into Pieces Introduction to mass spectrometry The Cyclopeptide Sequencing Problem A Brute Force Algorithm for Cyclopeptide Sequencing A Branch-and-Bound Algorithm for Cyclopeptide Sequencing Mass Spectrometry Meets Golf From theoretical to real spectra Adapting cyclopeptide sequencing for spectra with errors From 20 to More than 100 Amino Acids The Spectral Convolution Saves the Day Epilogue: From Simulated to Real Spectra Open Problems The Beltway and Turnpike Problems Sequencing cyclic peptides in primates Charging Stations Generating the theoretical spectrum of a peptide How fast is CYCLOPEPTIDESEQUENCING? Trimming the peptide leaderboard Detours Gause and Lysenkoism Discovery of codons Quorum sensing Molecular mass Selenocysteine and pyrrolysine xv

6 Pseudo-polynomial algorithm for the Turnpike Problem Split genes Bibliography Notes How Do We Compare Biological Sequences? 224 Cracking the Non-Ribosomal Code The RNA Tie Club From protein comparison to the non-ribosomal code What do oncogenes and growth factors have in common? Introduction to Sequence Alignment Sequence alignment as a game Sequence alignment and the longest common subsequence The Manhattan Tourist Problem What is the best sightseeing strategy? Sightseeing in an arbitrary directed graph Sequence Alignment is the Manhattan Tourist Problem in Disguise An Introduction to Dynamic Programming: The Change Problem Changing money greedily Changing money recursively Changing money using dynamic programming The Manhattan Tourist Problem Revisited From Manhattan to an Arbitrary Directed Acyclic Graph Sequence alignment as building a Manhattan-like graph Dynamic programming in an arbitrary DAG Topological orderings Backtracking in the Alignment Graph Scoring Alignments What is wrong with the LCS scoring model? Scoring matrices From Global to Local Alignment Global alignment Limitations of global alignment Free taxi rides in the alignment graph The Changing Faces of Sequence Alignment Edit distance Fitting alignment Overlap alignment xvi

7 Penalizing Insertions and Deletions in Sequence Alignment Affine gap penalties Building Manhattan on three levels Space-Efficient Sequence Alignment Computing alignment score using linear memory The Middle Node Problem A surprisingly fast and memory-efficient alignment algorithm The Middle Edge Problem Epilogue: Multiple Sequence Alignment Building a three-dimensional Manhattan A greedy multiple alignment algorithm Detours Fireflies and the non-ribosomal code Finding a longest common subsequence without building a city Constructing a topological ordering PAM scoring matrices Divide-and-conquer algorithms Scoring multiple alignments Bibliography Notes Are There Fragile Regions in the Human Genome? 296 Of Mice and Men How different are the human and mouse genomes? Synteny blocks Reversals Rearrangement hotspots The Random Breakage Model of Chromosome Evolution Sorting by Reversals A Greedy Heuristic for Sorting by Reversals Breakpoints What are breakpoints? Counting breakpoints Sorting by reversals as breakpoint elimination Rearrangements in Tumor Genomes From Unichromosomal to Multichromosomal Genomes Translocations, fusions, and fissions From a genome to a graph xvii

8 2-breaks Breakpoint Graphs Computing the 2-Break Distance Rearrangement Hotspots in the Human Genome The Random Breakage Model meets the 2-Break Distance Theorem The Fragile Breakage Model Epilogue: Synteny Block Construction Genomic dot-plots Finding shared k-mers Constructing synteny blocks from shared k-mers Synteny blocks as connected components in graphs Open Problem: Can Rearrangements Shed Light on Bacterial Evolution? Charging Stations From genomes to the breakpoint graph Solving the 2-Break Sorting Problem Detours Why is the gene content of mammalian X chromosomes so conserved?. 346 Discovery of genome rearrangements The exponential distribution Bill Gates and David X. Cohen flip pancakes Sorting linear permutations by reversals Bibliography Notes Which Animal Gave Us SARS? 352 The Fastest Outbreak Trouble at the Metropole Hotel The evolution of SARS Transforming Distance Matrices into Evolutionary Trees Constructing a distance matrix from coronavirus genomes Evolutionary trees as graphs Distance-based phylogeny construction Toward An Algorithm for Distance-Based Phylogeny Construction A quest for neighboring leaves Computing limb lengths Additive Phylogeny Trimming the tree Attaching a limb xviii

9 An algorithm for distance-based phylogeny construction Constructing an evolutionary tree of coronaviruses Using Least Squares to Construct Approximate Distance-Based Phylogenies. 372 Ultrametric Evolutionary Trees The Neighbor-Joining Algorithm Transforming a distance matrix into a neighbor-joining matrix Analyzing coronaviruses with the neighbor-joining algorithm Limitations of distance-based approaches to tree construction Character-Based Tree Reconstruction Character tables From anatomical to genetic characters How many times has evolution invented insect wings? The Small Parsimony Problem The Large Parsimony Problem Epilogue: Evolutionary Trees Fight Crime Detours When did HIV jump from primates to humans? Searching for a tree fitting a distance matrix The four point condition Did bats give us SARS? Why does the neighbor-joining algorithm find neighboring leaves? Computing limb lengths in the neighbor-joining algorithm Giant panda: bear or raccoon? Where did humans come from? Bibliography Notes How Did Yeast Become a Wine Maker? 416 An Evolutionary History of Wine Making How long have we been addicted to alcohol? The diauxic shift Identifying Genes Responsible for the Diauxic Shift Two evolutionary hypotheses with different fates Which yeast genes drive the diauxic shift? Introduction to Clustering Gene expression analysis Clustering yeast genes The Good Clustering Principle xix

10 Clustering as an Optimization Problem Farthest First Traversal k-means Clustering Squared error distortion k-means clustering and the center of gravity The Lloyd Algorithm From centers to clusters and back again Initializing the Lloyd algorithm k-means++ Initializer Clustering Genes Implicated in the Diauxic Shift Limitations of k-means Clustering From Coin Flipping to k-means Clustering Flipping coins with unknown biases Where is the computational problem? From coin flipping to the Lloyd algorithm Return to clustering Making Soft Decisions in Coin Flipping Expectation maximization: the E-step Expectation maximization: the M-step The expectation maximization algorithm Soft k-means Clustering Applying expectation maximization to clustering Centers to soft clusters Soft clusters to centers Hierarchical Clustering Introduction to distance-based clustering Inferring clusters from a tree Analyzing the diauxic shift with hierarchical clustering Epilogue: Clustering Tumor Samples Detours Whole genome duplication or a series of duplications? Measuring gene expression Microarrays Proof of the Center of Gravity Theorem Transforming an expression matrix into a distance/similarity matrix Clustering and corrupted cliques Bibliography Notes xx

11 9 How Do We Locate Disease-Causing Mutations? 468 What Causes Ohdo Syndrome? Introduction to Multiple Pattern Matching Herding Patterns into a Trie Constructing a trie Applying the trie to multiple pattern matching Preprocessing the Genome Instead Introduction to suffix tries Using suffix tries for pattern matching Suffix Trees Suffix Arrays Constructing a suffix array Pattern matching with the suffix array The Burrows-Wheeler Transform Genome compression Constructing the Burrows-Wheeler transform From repeats to runs Inverting the Burrows-Wheeler Transform A first attempt at inverting the Burrows-Wheeler transform The First-Last Property Using the First-Last property to invert the Burrows-Wheeler transform. 493 Pattern Matching with the Burrows-Wheeler Transform A first attempt at Burrows-Wheeler pattern matching Moving backward through a pattern The Last-to-First mapping Speeding Up Burrows-Wheeler Pattern Matching Substituting the Last-to-First mapping with count arrays Getting rid of the first column of the Burrows-Wheeler matrix Where are the Matched Patterns? Burrows and Wheeler Set Up Checkpoints Epilogue: Mismatch-Tolerant Read Mapping Reducing approximate pattern matching to exact pattern matching BLAST: Comparing a sequence against a database Approximate pattern matching with the Burrows-Wheeler transform Charging Stations Constructing a suffix tree Solving the Longest Shared Substring Problem xxi

12 Partial suffix array construction Detours The reference human genome Rearrangements, insertions, and deletions in human genomes The Aho-Corasick algorithm From suffix trees to suffix arrays From suffix arrays to suffix trees Binary search Bibliography Notes Why Have Biologists Still Not Developed an HIV Vaccine? 530 Classifying the HIV Phenotype How does HIV evade the human immune system? Limitations of sequence alignment Gambling with Yakuza Two Coins up the Dealer s Sleeve Finding CG-Islands Hidden Markov Models From coin flipping to a Hidden Markov Model The HMM diagram Reformulating the Casino Problem The Decoding Problem The Viterbi graph The Viterbi algorithm How fast is the Viterbi algorithm? Finding the Most Likely Outcome of an HMM Profile HMMs for Sequence Alignment How do HMMs relate to sequence alignment? Building a profile HMM Transition and emission probabilities of a profile HMM Classifying proteins with profile HMMs Aligning a protein against a profile HMM The return of pseudocounts The troublesome silent states Are profile HMMs really all that useful? Learning the Parameters of an HMM Estimating HMM parameters when the hidden path is known xxii

13 Viterbi learning Soft Decisions in Parameter Estimation The Soft Decoding Problem The forward-backward algorithm Baum-Welch Learning The Many Faces of HMMs Epilogue: Nature is a Tinkerer and not an Inventor Detours The Red Queen Effect Glycosylation DNA methylation Conditional probability Bibliography Notes Was T. rex Just a Big Chicken? 586 Paleontology Meets Computing Which Proteins Are Present in This Sample? Decoding an Ideal Spectrum From Ideal to Real Spectra Peptide Sequencing Scoring peptides against spectra Where are the suffix peptides? Peptide sequencing algorithm Peptide Identification The Peptide Identification Problem Identifying peptides in the unknown T. rex proteome Searching for peptide-spectrum matches Peptide Identification and the Infinite Monkey Theorem False discovery rate The monkey and the typewriter Statistical significance of a peptide-spectrum match Spectral Dictionaries T. rex Peptides: Contaminants or Treasure Trove of Ancient Proteins? The hemoglobin riddle The dinosaur DNA controversy Epilogue: From Unmodified to Modified Peptides Post-translational modifications xxiii

14 Searching for modifications as an alignment problem Building a Manhattan grid for spectral alignment Spectral alignment algorithm Detours Gene prediction Finding all paths in a graph The Anti-Symmetric Path Problem Transforming spectra into spectral vectors The infinite monkey theorem The probabilistic space of peptides in a spectral dictionary Are terrestrial dinosaurs really the ancestors of birds? Solving the Most Likely Peptide Vector Problem Selecting parameters for transforming spectra into spectral vectors Bibliography Notes Appendix: Introduction to Pseudocode 639 What is Pseudocode? Nuts and Bolts of Pseudocode The if condition The for loop The while loop Recursive algorithms Arrays Glossary 649 Bibliography 671 Image Courtesies 683 xxiv

Similar documents

List of Code Challenges. About the Textbook Meet the Authors... xix Meet the Development Team... xx Acknowledgments... xxi

List of Code Challenges. About the Textbook Meet the Authors... xix Meet the Development Team... xx Acknowledgments... xxi Contents List of Code Challenges xvii About the Textbook xix Meet the Authors................................... xix Meet the Development Team............................ xx Acknowledgments..................................

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Hidden Markov Models

Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models Hidden Markov Models Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training

More information

Hidden Markov Models 1

Hidden Markov Models 1 Hidden Markov Models Dinucleotide Frequency Consider all 2-mers in a sequence {AA,AC,AG,AT,CA,CC,CG,CT,GA,GC,GG,GT,TA,TC,TG,TT} Given 4 nucleotides: each with a probability of occurrence of. 4 Thus, one

More information

HIDDEN MARKOV MODELS

HIDDEN MARKOV MODELS Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm

More information

CONTENTS. P A R T I Genomes 1. P A R T II Gene Transcription and Regulation 109

CONTENTS. P A R T I Genomes 1. P A R T II Gene Transcription and Regulation 109 CONTENTS ix Preface xv Acknowledgments xxi Editors and contributors xxiv A computational micro primer xxvi P A R T I Genomes 1 1 Identifying the genetic basis of disease 3 Vineet Bafna 2 Pattern identification

More information

Hidden Markov Models

Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm

More information

Hidden Markov Models. Three classic HMM problems

Hidden Markov Models. Three classic HMM problems An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Hidden Markov Models Slides revised and adapted to Computational Biology IST 2015/2016 Ana Teresa Freitas Three classic HMM problems

More information

11.3 Decoding Algorithm

11.3 Decoding Algorithm 393 For convenience, we have introduced π 0 and π n+1 as the fictitious initial and terminal states begin and end. This model defines the probability P(x π) for a given sequence

More information

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from: Hidden Markov Models Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from: www.ioalgorithms.info Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm

More information

Hidden Markov Models

Hidden Markov Models Slides revised and adapted to Bioinformática 55 Engª Biomédica/IST 2005 Ana Teresa Freitas Forward Algorithm For Markov chains we calculate the probability of a sequence, P(x) How

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

Tandem Mass Spectrometry: Generating function, alignment and assembly

Tandem Mass Spectrometry: Generating function, alignment and assembly With slides from Sangtae Kim and from Jones & Pevzner 2004 Determining reliability of identifications Can we use Target/Decoy to estimate

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Was T. rex Just a Big Chicken? Computational Proteomics

Was T. rex Just a Big Chicken? Computational Proteomics Phillip Compeau and Pavel Pevzner adjusted by Jovana Kovačević Bioinformatics Algorithms: an Active Learning Approach 215 by Compeau and Pevzner.

More information

Bio nformatics. Lecture 3. Saad Mneimneh

Bio nformatics. Lecture 3. Saad Mneimneh Bio nformatics Lecture 3 Sequencing As before, DNA is cut into small ( 0.4KB) fragments and a clone library is formed. Biological experiments allow to read a certain number of these short fragments per

More information

HMMs and biological sequence analysis

HMMs and biological sequence analysis Hidden Markov Model A Markov chain is a sequence of random variables X 1, X 2, X 3,... That has the property that the value of the current state depends only on the

More information

BIOINFORMATICS: An Introduction

BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and

More information

Markov Chains and Hidden Markov Models. = stochastic, generative models

Markov Chains and Hidden Markov Models = stochastic, generative models (Drawing heavily from Durbin et al., Biological Sequence Analysis) BCH339N Systems Biology / Bioinformatics Spring 2016 Edward Marcotte,

More information

Genome Rearrangements In Man and Mouse. Abhinav Tiwari Department of Bioengineering

Genome Rearrangements In Man and Mouse Abhinav Tiwari Department of Bioengineering Genome Rearrangement Scrambling of the order of the genome during evolution Operations on chromosomes Reversal Translocation

More information

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline Page Evolutionary Trees Russ. ltman MI S 7 Outline. Why build evolutionary trees?. istance-based vs. character-based methods. istance-based: Ultrametric Trees dditive Trees. haracter-based: Perfect phylogeny

More information

Computational Biology: Basics & Interesting Problems

Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information

More information

Comparative genomics: Overview & Tools + MUMmer algorithm

Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune 411 007. urmila@bioinfo.ernet.in Genome sequence: Fact file 1995: The first

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

MOLECULAR PHYLOGENY "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky EVOLUTION - theory that groups of organisms change over time so that descendeants differ structurally

More information

Hidden Markov Models

Hidden Markov Models Slides revised and adapted to Bioinformática 55 Engª Biomédica/IST 2005 Ana Teresa Freitas CG-Islands Given 4 nucleotides: probability of occurrence is ~ 1/4. Thus, probability of

More information

Sequence analysis and Genomics

Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

EVOLUTIONARY DISTANCES

EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:

More information

C3020 Molecular Evolution. Exercises #3: Phylogenetics

C3020 Molecular Evolution. Exercises #3: Phylogenetics C3020 Molecular Evolution Exercises #3: Phylogenetics Consider the following sequences for five taxa 1-5 and the known outgroup O, which has the ancestral states (note that sequence 3 has changed from

More information

AN EXACT SOLVER FOR THE DCJ MEDIAN PROBLEM

AN EXACT SOLVER FOR THE DCJ MEDIAN PROBLEM MENG ZHANG College of Computer Science and Technology, Jilin University, China Email: zhangmeng@jlueducn WILLIAM ARNDT AND JIJUN TANG Dept of Computer Science

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

Introduction to de novo RNA-seq assembly

Introduction to de novo RNA-seq assembly Introduction Ideal day for a molecular biologist Ideal Sequencer Any type of biological material Genetic material with high quality and yield Cutting-Edge Technologies

More information

Effects of Gap Open and Gap Extension Penalties

Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See

More information

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

Chapter 12 (Strikberger) Molecular Phylogenies and Evolution METHODS FOR DETERMINING PHYLOGENY In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task. Modern

More information

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

CHAPTERS 24-25: Evidence for Evolution and Phylogeny 1. For each of the following, indicate how it is used as evidence of evolution by natural selection or shown as an evolutionary trend: a. Paleontology

More information

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55 Pairwise Alignment Guan-Shieng Huang shieng@ncnu.edu.tw Dept. of CSIE, NCNU Pairwise Alignment p.1/55 Approach 1. Problem definition 2. Computational method (algorithms) 3. Complexity and performance Pairwise

More information

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

I519 Introduction to Bioinformatics, 2015 Genome Comparison Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Whole genome comparison/alignment Build better phylogenies Identify polymorphism

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Background: Imagine it is time for your lunch break, you take your sandwich outside and you sit down to enjoy your lunch with a beautiful view of

Background: Imagine it is time for your lunch break, you take your sandwich outside and you sit down to enjoy your lunch with a beautiful view of Montana s Rocky Mountains. As you look up, you see what

More information

Phylogenetic inference

Phylogenetic inference Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 7 th 016 After this lecture, you can discuss (dis-) advantages of different information types

More information

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Int. J. Bioinformatics Research and Applications, Vol. x, No. x, xxxx Phylogenies Scores for Exhaustive Maximum Likelihood and s Searches Hyrum D. Carroll, Perry G. Ridge, Mark J. Clement, Quinn O. Snell

More information

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on: 17 Non-collinear alignment This exposition is based on: 1. Darling, A.E., Mau, B., Perna, N.T. (2010) progressivemauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One 5(6):e11147.

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Pair Hidden Markov Models

Pair Hidden Markov Models Scribe: Rishi Bedi Lecturer: Serafim Batzoglou January 29, 2015 1 Recap of HMMs alphabet: Σ = {b 1,...b M } set of states: Q = {1,..., K} transition probabilities: A = [a ij ]

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Sequence analysis and comparison

The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Hidden Markov Models for biological sequence analysis

Hidden Markov Models for biological sequence analysis Master in Bioinformatics UPF 2017-2018 http://comprna.upf.edu/courses/master_agb/ Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA

More information

Greedy Algorithms. CS 498 SS Saurabh Sinha

Greedy Algorithms. CS 498 SS Saurabh Sinha Greedy Algorithms CS 498 SS Saurabh Sinha Chapter 5.5 A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix of length l. Enumerative approach O(l n

More information

networks in molecular biology Wolfgang Huber

networks in molecular biology Wolfgang Huber networks in molecular biology Regulatory networks: components = gene products interactions = regulation of transcription, translation, phosphorylation... Metabolic

More information

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding Example: The Dishonest Casino Hidden Markov Models Durbin and Eddy, chapter 3 Game:. You bet $. You roll 3. Casino player rolls 4. Highest number wins $ The casino has two dice: Fair die P() = P() = P(3)

More information

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Bio 1B Lecture Outline (please print and bring along) Fall, 2007 B.D. Mishler, Dept. of Integrative Biology 2-6810, bmishler@berkeley.edu Evolution lecture #5 -- Molecular genetics and molecular evolution

More information

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping Plan for today! Part 1: (Hidden) Markov models! Part 2: String matching and read mapping! 2.1 Exact algorithms! 2.2 Heuristic methods for approximate search (Hidden) Markov models Why consider probabilistics

More information

Pattern Recognition and Machine Learning

Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

Computational Biology

Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

More information

Chapter 19: Taxonomy, Systematics, and Phylogeny

Chapter 19: Taxonomy, Systematics, and Phylogeny AP Curriculum Alignment Chapter 19 expands on the topics of phylogenies and cladograms, which are important to Big Idea 1. In order for students to understand

More information

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

I519 Introduction to Bioinformatics, 2011 Genome Comparison Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Whole genome comparison/alignment Build better phylogenies Identify polymorphism

More information

Background: Comment [1]: Comment [2]: Comment [3]: Comment [4]: mass spectrometry

Background: Comment [1]: Comment [2]: Comment [3]: Comment [4]: mass spectrometry Background: Imagine it is time for your lunch break, you take your sandwich outside and you sit down to enjoy your lunch with a beautiful view of Montana s Rocky Mountains. As you look up, you see what

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Chapter 16: Reconstructing and Using Phylogenies

Chapter 16: Reconstructing and Using Phylogenies Chapter Review 1. Use the phylogenetic tree shown at the right to complete the following. a. Explain how many clades are indicated: Three: (1) chimpanzee/human, (2) chimpanzee/ human/gorilla, and (3)chimpanzee/human/

More information

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models

More information

Lecture 9. Intro to Hidden Markov Models (finish up)

Lecture 9. Intro to Hidden Markov Models (finish up) Lecture 9 Intro to Hidden Markov Models (finish up) Review Structure Number of states Q 1.. Q N M output symbols Parameters: Transition probability matrix a ij Emission probabilities b i (a), which is

More information

Phylogenetics. BIOL 7711 Computational Bioscience

Phylogenetics. BIOL 7711 Computational Bioscience Consortium for Comparative Genomics! University of Colorado School of Medicine Phylogenetics BIOL 7711 Computational Bioscience Biochemistry and Molecular Genetics Computational Bioscience Program Consortium

More information

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department

More information

BINF6201/8201. Molecular phylogenetic methods

BINF6201/8201. Molecular phylogenetic methods BINF60/80 Molecular phylogenetic methods 0-7-06 Phylogenetics Ø According to the evolutionary theory, all life forms on this planet are related to one another by descent. Ø Traditionally, phylogenetics

More information

Hidden Markov Models for biological sequence analysis I

Hidden Markov Models for biological sequence analysis I Master in Bioinformatics UPF 2014-2015 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Example: CpG Islands

More information

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing Hidden Markov Models By Parisa Abedi Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed data Sequential (non i.i.d.) data Time-series data E.g. Speech

More information

Multiple Sequence Alignment using Profile HMM

Multiple Sequence Alignment using Profile HMM. based on Chapter 5 and Section 6.5 from Biological Sequence Analysis by R. Durbin et al., 1998 Acknowledgements: M.Sc. students Beatrice Miron, Oana Răţoi,

More information

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT.03.239 03.10.2012 ALIGNMENT Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity. Homology:

More information

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS CRYSTAL L. KAHN and BENJAMIN J. RAPHAEL Box 1910, Brown University Department of Computer Science & Center for Computational Molecular Biology

More information

Algorithms for Bioinformatics

Adapted from slides by Alexandru Tomescu, Leena Salmela, Veli Mäkinen, Esa Pitkänen 582670 Algorithms for Bioinformatics Lecture 5: Combinatorial Algorithms and Genomic Rearrangements 1.10.2015 Background

More information

Molecular evolution - Part 1. Pawan Dhar BII

Molecular evolution - Part 1. Pawan Dhar BII Molecular evolution - Part 1 Pawan Dhar BII Theodosius Dobzhansky Nothing in biology makes sense except in the light of evolution Age of life on earth: 3.85 billion years Formation of planet: 4.5 billion

More information

Bioinformatics 2 - Lecture 4

Bioinformatics 2 - Lecture 4 Guido Sanguinetti School of Informatics University of Edinburgh February 14, 2011 Sequences Many data types are ordered, i.e. you can naturally say what is before and what

More information

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs15.html Describing & Modeling Patterns

More information

8/23/2014. Phylogeny and the Tree of Life

8/23/2014. Phylogeny and the Tree of Life Phylogeny and the Tree of Life Chapter 26 Objectives Explain the following characteristics of the Linnaean system of classification: a. binomial nomenclature b. hierarchical classification List the major

More information

Grundlagen der Bioinformatik, SS 09, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

Grundlagen der Bioinformatik, SS 09, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence rundlagen der Bioinformatik, SS 09,. Huson, June 16, 2009 81 7 Markov chains and Hidden Markov Models We will discuss: Markov chains Hidden Markov Models (HMMs) Profile HMMs his chapter is based on: nalysis,

More information

CMPSCI 311: Introduction to Algorithms Second Midterm Exam

CMPSCI 311: Introduction to Algorithms Second Midterm Exam April 11, 2018. Name: ID: Instructions: Answer the questions directly on the exam pages. Show all your work for each question. Providing more

More information

Tree of Life iological Sequence nalysis Chapter http://tolweb.org/tree/ Phylogenetic Prediction ll organisms on Earth have a common ancestor. ll species are related. The relationship is called a phylogeny

More information

A Phylogenetic Network Construction due to Constrained Recombination

A Phylogenetic Network Construction due to Constrained Recombination Mohd. Abdul Hai Zahid Research Scholar Research Supervisors: Dr. R.C. Joshi Dr. Ankush Mittal Department of Electronics and Computer

More information

Introduction to spectral alignment

SI Appendix C. Introduction to spectral alignment Due to the complexity of the anti-symmetric spectral alignment algorithm described in Appendix A, this appendix provides an extended introduction to the

More information

Exhaustive search. CS 466 Saurabh Sinha

Exhaustive search. CS 466 Saurabh Sinha Exhaustive search CS 466 Saurabh Sinha Agenda Two different problems Restriction mapping Motif finding Common theme: exhaustive search of solution space Reading: Chapter 4. Restriction Mapping Restriction

More information

Stephen Scott.

Stephen Scott. 1 / 21 sscott@cse.unl.edu 2 / 21 Introduction Designed to model (profile) a multiple alignment of a protein family (e.g., Fig. 5.1) Gives a probabilistic model of the proteins in the family Useful for

More information

Graph Alignment and Biological Networks

Graph Alignment and Biological Networks Johannes Berg http://www.uni-koeln.de/ berg Institute for Theoretical Physics University of Cologne Germany p.1/12 Networks in molecular biology New large-scale

More information

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models 02-710 Computational Genomics Systems biology Putting it together: Data integration using graphical models High throughput data So far in this class we discussed several different types of high throughput

More information

Name: Class: Date: ID: A

Name: Class: Date: ID: A Class: _ Date: _ Ch 17 Practice test 1. A segment of DNA that stores genetic information is called a(n) a. amino acid. b. gene. c. protein. d. intron. 2. In which of the following processes does change

More information

Phylogenetic Trees. How do the changes in gene sequences allow us to reconstruct the evolutionary relationships between related species?

Phylogenetic Trees. How do the changes in gene sequences allow us to reconstruct the evolutionary relationships between related species? Why? Phylogenetic Trees How do the changes in gene sequences allow us to reconstruct the evolutionary relationships between related species? The saying Don t judge a book by its cover. could be applied

More information

Alignment Algorithms. Alignment Algorithms

Alignment Algorithms. Alignment Algorithms Midterm Results Big improvement over scores from the previous two years. Since this class grade is based on the previous years curve, that means this class will get higher grades than the previous years.

More information

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

BMI/CS 776 Lecture #20 Alignment of whole genomes Colin Dewey (with slides adapted from those by Mark Craven) 2007.03.29 1 Multiple whole genome alignment Input set of whole genome sequences genomes diverged

More information

BIOLOGY YEAR AT A GLANCE RESOURCE ( )

BIOLOGY YEAR AT A GLANCE RESOURCE (2016-17) DATES TOPIC/BENCHMARKS QUARTER 1 LAB/ACTIVITIES 8/22 8/25/16 I. Introduction to Biology Lab 1: Seed Germination A. What is Biology B. Science in the real world

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabás Póczos & Aarti Singh Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed

More information

Genomics and bioinformatics summary. Finding genes -- computer searches

Genomics and bioinformatics summary 1. Gene finding: computer searches, cdnas, ESTs, 2. Microarrays 3. Use BLAST to find homologous sequences 4. Multiple sequence alignments (MSAs) 5. Trees quantify sequence

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall 2011 1 HMM Lecture Notes Dannie Durand and Rose Hoberman October 11th 1 Hidden Markov Models In the last few lectures, we have focussed on three problems

More information

Bioinformatics Chapter 1. Introduction

Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!

More information

BIOLOGY YEAR AT A GLANCE RESOURCE ( ) REVISED FOR HURRICANE DAYS

BIOLOGY YEAR AT A GLANCE RESOURCE (2017-18) REVISED FOR HURRICANE DAYS DATES TOPIC/BENCHMARKS QUARTER 1 LAB/ACTIVITIES 8/21 8/24/17 I. Introduction to Biology A. What is Biology B. Science in the real

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

Data Mining in Bioinformatics HMM

Data Mining in Bioinformatics HMM Microarray Problem: Major Objective n Major Objective: Discover a comprehensive theory of life s organization at the molecular level 2 1 Data Mining in Bioinformatics

More information

Understanding relationship between homologous sequences

Understanding relationship between homologous sequences Molecular Evolution Molecular Evolution How and when were genes and proteins created? How old is a gene? How can we calculate the age of a gene? How did the gene evolve to the present form? What selective

More information

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.