Bio nformatics. Lecture 3. Saad Mneimneh

Similar documents
An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Hidden Markov Models

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

HIDDEN MARKOV MODELS

Bio nformatics. Lecture 16. Saad Mneimneh

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Evolution. CT Amemiya et al. Nature 496, (2013) doi: /nature12027

Analysis and Design of Algorithms Dynamic Programming

Hidden Markov Models

Hidden Markov Models

Lecture 15. Saad Mneimneh

DNA sequencing. Bad example (repeats) Lecture 15. Shortest common superstring SCS. Given a set of fragments F,

Moreover, the circular logic

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Lecture 2: Pairwise Alignment. CG Ron Shamir

Introduction to Bioinformatics Algorithms Homework 3 Solution

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Lecture 4: September 19

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Hidden Markov Models 1

Stephen Scott.

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Sequence analysis and Genomics

Lecture 5,6 Local sequence alignment

Practical considerations of working with sequencing data

11.3 Decoding Algorithm

EECS730: Introduction to Bioinformatics

Sequence analysis and comparison

Pairwise & Multiple sequence alignments

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

EECS730: Introduction to Bioinformatics

Theoretical Computer Science

Sequence Alignment. Johannes Starlinger

Pairwise alignment, Gunnar Klau, November 9, 2005, 16:

Minimum Edit Distance. Defini'on of Minimum Edit Distance

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Stephen Scott.

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler

Lecture 5: September Time Complexity Analysis of Local Alignment

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Motivating the need for optimal sequence alignments...

Single alignment: Substitution Matrix. 16 march 2017

Bioinformatics and BLAST

Computational Biology

CSCE 471/871 Lecture 3: Markov Chains and

Objec&ves. Review. Dynamic Programming. What is the knapsack problem? What is our solu&on? Ø Review Knapsack Ø Sequence Alignment 3/28/18

List of Code Challenges. About the Textbook Meet the Authors... xix Meet the Development Team... xx Acknowledgments... xxi

Algorithms in Bioinformatics

Pair Hidden Markov Models

Hidden Markov Models for biological sequence analysis

Welcome to CS262: Computational Genomics

6.6 Sequence Alignment

1.5 Sequence alignment

Copyright 2000 N. AYDIN. All rights reserved. 1

Pairwise sequence alignment

Computational Biology Lecture 5: Time speedup, General gap penalty function Saad Mneimneh

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Large-Scale Genomic Surveys

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Sequence Comparison. mouse human

Outline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009

Algorithms for biological sequence Comparison and Alignment

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Dynamic Programming: Edit Distance

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Hidden Markov Models. Three classic HMM problems

CS 580: Algorithm Design and Analysis

Pairwise sequence alignments

Linear-Space Alignment

More Dynamic Programming

Markov Chains and Hidden Markov Models. = stochastic, generative models

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Outline. Similarity Search. Outline. Motivation. The String Edit Distance

More Dynamic Programming

Approximation: Theory and Algorithms

Hidden Markov Models for biological sequence analysis I

Whole Genome Alignments and Synteny Maps

String Matching Problem

Similarity Search. The String Edit Distance. Nikolaus Augsten. Free University of Bozen-Bolzano Faculty of Computer Science DIS. Unit 2 March 8, 2012

Exhaustive search. CS 466 Saurabh Sinha

Sequence Alignment (chapter 6)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Chapter 4: Hidden Markov Models

Pattern Matching (Exact Matching) Overview

Algorithm Design and Analysis

Homework 9: Protein Folding & Simulated Annealing : Programming for Scientists Due: Thursday, April 14, 2016 at 11:59 PM

Today s Lecture: HMMs

Similarity Search. The String Edit Distance. Nikolaus Augsten.

Alignment & BLAST. By: Hadi Mozafari KUMS

Physical Mapping. Restriction Mapping. Lecture 12. A physical map of a DNA tells the location of precisely defined sequences along the molecule.

HMMs and biological sequence analysis

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Bio nformatics. Lecture 23. Saad Mneimneh

NUMB3RS Activity: DNA Sequence Alignment. Episode: Guns and Roses

Transcription:

Bio nformatics Lecture 3

Sequencing As before, DNA is cut into small ( 0.4KB) fragments and a clone library is formed. Biological experiments allow to read a certain number of these short fragments per experiment. Entire genome ( 4GB long) must be assembled from the knowledge of these short fragments.

Shortest Superstring The simplest naive approximation of DNA sequencing, ignoring unavoidable experimental errors, is the following: Shortest Superstring Problem: Given a set of strings s 1,,s n, find the shortest string s such that each s i appears as a substring of s. This problem is NP-hard.

Shortest Superstring Set of strings: {000,001,010,011,100,101,011,111} A trivial superstring: 000 001 010 011 100 101 011 111 011 110 010 000 A shortest superstring: 0 0 0 1 1 1 0 1 0 0 001 111 101 100

Sequencing by Hybridization Although DNA sequencing is a fast and efficient procedure now, it was time consuming and hard 10 years ago. In 1988, 4 groups of biologist independently and simultaneously suggested a new approach called Sequencing by Hybridization (SBH). Build a DNA chip containing thousands of short DNA fragments (probes) working like the chip s memory. Each probe will reveal some information about an unknown DNA fragment. All the pieces of information combined would solve the DNA sequencing. Of course, in 1988, no one believed that such a thing could work! Now, building DNA arrays with thousands of probes has become an industry.

SBH Given a DNA fragment with an unknown sequence, a DNA array would provide its l-tuple composition, i.e. information about all substrings of length l contained in this fragment. SBH Problem: Reconstruct a string by its l-tuple decomposition. Although conventional DNA sequencing and SBH are different approaches, computationally they are similar. SBH is a special instance of the Shortest Superstring problem when s 1 s n represent all substrings of a fixed length. While the general Shortest Superstring problem in NP-hard, SBH can be solved efficiently.

Similarity Search After sequencing, biologist have no idea about the function of the newly sequenced gene. Hoping to find a clue, they compare it with previously sequenced genes with known functionality. Edit distance: number of operations needed to transform one string into another, where operations are insertion of a symbol, deletion of a symbol, and substitution of a symbol. Since mutations in DNA can be represented by the above operations, the edit distance is a natural measure of similarity between DNA fragments. Variations to the basic edit distance above are possible.

Longest Common Subsequence If the mutations are limited to insertions and deletions only (i.e. no substitution), then the edit distance problem is equivalent to the longest common subsequence problem (LCS). Given two strings, V=v 1 v n, and W=w 1 w m, a common subsequence of V and W of length k is a set of indices i 1 < < i k and j 1 < < j k such that V ix = W jx, for x=1 k. LCS problem: Given V and W, find their Longest common subsequence. Clearly, n + m - 2LCS(V,W) is the minimum number of insertions and deletions needed to transform V into W, where LCS(V,W) is the length of the longest common subsequence of V and W. Other mutation operations and edit distances lead to notions more general than LCS, e.g. Alignments (we will start by looking at sequence Alignment algorithms).

Finding CG-islands The most infrequent dinucleotide in many genomes is CG (CG has tendency to mutate to TG). However, CG appears relatively frequently around genes in areas called CG-islands. How to define and find CG-islands in a genome? This is similar to the following analogy of the Casino: the dealer uses two coins: biased and unbiased. He switches coins with probability p. Given a sequence of coin tosses, can you find out when the biased coin was used? Why is that a good analogy? Because as we go along the genome, we can switch between two states: CG-island and non CG-island. Each state has different probability for the occurrence of CG. Given the genome, can you tell when you are in a CG island?

Protein Folding As we have seen, the protein folds on itself in 3D. Determining the 3D structure of the protein is important to understand its functionality. Many protein folding models use lattices to describe the space of conformations that a protein can assume. The folding is therefore a self avoiding path in the lattice in which vertices are labeled by the amino acids. The protein folds in a way to minimize the conformation energy, i.e. maximize number of bonds between same acids.

Example Contact Model: Maximize # bonds between same amino acids 2 dimensional lattice Sequence bacbbcacba Maximum # bonds = 4 b c a a b c c b b a

HP Model (Dill 1985) Two kinds of amino acids: H and P H: Hydrophobic P: Polar Conformation to maximize number of bonds between H amino acids The hydrophobic amino acids prefer to be shielded from water and are driven to the core of the protein, making bonds among them

Protein Folding Complexity The problem of finding the optimal folding (minimum energy conformation) is NP-complete under various models and lattices. We will look at the different models and lattices.

Sequence Alignment

Similar Sequences These 2 look very much alike GACGGATTAG GATCGGAATAG Aligning them one above the other GA-CGGATTAG GATCGGAATAG

Alignment Alignment: Insertion of gaps in arbitrary locations along the sequences so that they end up with the same size. No gap in one sequence should be aligned to a gap in the other. We want the best alignment, but what is best?

Simple Scoring Two identical characters receive a score of + m (match) Two different characters receive a score of s (mismatch) A character and a gap receive a score of d (gap) score = (#matches).m (#mismatches).s (#gaps).d

Example m = 1 s = 1 d = 2 GA-CGGATTAG GATCGGAATAG score = +1(9) -1(1) -2(1) = 6 Why do we penalize gaps more? Insertions and Deletions are less likely than substitutions

More General Scoring The scoring scheme can be more general Given two sequences x and y, aligning x i and y j could add a score of s(x i,y j ). Therefore, we have a scoring matrix. Gap penalty is not linear, once you have a gap, it is likely to have another one so we should penalize the start of the gap more

Greedy Algorithm To obtain the best alignment, try all possible alignments and find the best one Exponentially many alignments! How many? (homework) Greedy would result in a very slow algorithm

Dynamic Programming Solving an instance of the problem by taking advantage of already computed solution for smaller instances of the problem. To find optimal alignment for sequences x and y, compute optimal alignments for prefixes of x and y.

Alignment is Additive The score of aligning x 1..x m y 1..y n is additive (with our particular scoring scheme) If the alignment is x 1..x i x i+1..x m y 1..y j y j+1..y n then the score is: score(x 1..x i, y 1..y j ) + score(x i+1..x m, y j+1..y n )

Dynamic programming Assume we want to align x 1..x m y 1..y n Let A(i,j) be the score of optimally aligning the two prefixes x 1..x i y 1..y j

Dynamic Programming (cont.) Three possible cases for aligning x 1..x i and y 1..y j 1. X 1..X i-1 x i y 1..y j-1 y j A(i,j) = A(i 1, j 1) + m, if x i = y j s, if not 2. X 1..X i-1 x i y 1..y j - A(i,j) = A(i 1, j) d 3. X 1..X i - y 1..y j-1 y j A(i,j) = A(i, j 1) d

Dynamic Programming (cont.) Inductive step: A(i, j 1), A(i 1, j), A(i 1, j 1) are correct Then, A(i, j) = max A(i 1, j 1) + s(x i, y j ) A(i 1, j) d A(i, j 1) d Where s(x i, y j ) = m, if x i = y j ; s, if not

Illustration A(i 1,j 1) A(i 1,j) A(i,j 1) A(i,j) A(m,n) will be the optimal score

What Else? Base case A(0,0) = 0 A(i,0) = d.i i = 1 m A(0,j) = d.j j = 1 n

AAAC and AGC A G C 0-2 -4-6 A -2 1-1 -3 A -4-1 0-2 A -6-3 -2-1 C -8-5 -4-1

Obtaining Actual Alignment A G C 0-2 -4-6 A -2 1-1 -3 A -4-1 0-2 A -6-3 -2-1 C -8-5 -4-1

Obtaining Actual Alignment A G C 0-2 -4-6 A -2 1-1 -3 A -4-1 0-2 A -6-3 -2-1 C -8-5 -4-1 AAAC AG-C

Needleman-Wunsch Algorithm (thanks to Serafim Batzoglou) 1. Initialization A(0, 0) = 0 A(i, 0) = i.d for i = 1 m A(0, j) = j.d for j = 1 n 2. Main Iteration (Aligning prefixes) for each i = 1 m for each j = 1 n A(i 1, j 1) + s(x i, y j ) [case 1] A(i, j) = max A(i 1, j) d [case 2] A(i, j 1) d [case 3] Diag [case 1] Ptr(i, j) = Up [case 2] Left [case 3] 3. Termination A(m, n) is the optimal score, and from Ptr(m, n) can trace back optimal alignment.

Complexity Time O(mn) Space O(mn)