Algorithms for biological sequence Comparison and Alignment

Similar documents
Lecture 2: Pairwise Alignment. CG Ron Shamir

Pairwise alignment, Gunnar Klau, November 9, 2005, 16:

Lecture 5,6 Local sequence alignment

EECS730: Introduction to Bioinformatics

Bio nformatics. Lecture 3. Saad Mneimneh

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Analysis and Design of Algorithms Dynamic Programming

Algorithms in Bioinformatics

Algorithms in Bioinformatics: A Practical Introduction. Sequence Similarity

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

CSE 202 Dynamic Programming II

Local Alignment: Smith-Waterman algorithm

Pairwise sequence alignment

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

On the Monotonicity of the String Correction Factor for Words with Mismatches

Sequence Comparison. mouse human

Single alignment: Substitution Matrix. 16 march 2017

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

String Matching Problem

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1

Lecture 4: September 19

6.6 Sequence Alignment

2 Pairwise alignment. 2.1 References. 2.2 Importance of sequence alignment. Introduction to the pairwise sequence alignment problem.

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Sequence Alignment (chapter 6)

Sequence analysis and Genomics

CS 580: Algorithm Design and Analysis

Pattern Matching (Exact Matching) Overview

Moreover, the circular logic

Chapter 6. Dynamic Programming. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Dynamic Programming: Edit Distance

Dynamic programming. Curs 2015

MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance

Linear-Space Alignment

Introduction to Bioinformatics

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia

Pairwise & Multiple sequence alignments

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Algorithm Design and Analysis

Introduction to Bioinformatics Algorithms Homework 3 Solution

Hidden Markov Models

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Searching Sear ( Sub- (Sub )Strings Ulf Leser

Computational Biology

A Space Efficient Algorithm for Sequence Alignment with Inversions

Introduction to Bioinformatics

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Aside: Golden Ratio. Golden Ratio: A universal law. Golden ratio φ = lim n = 1+ b n = a n 1. a n+1 = a n + b n, a n+b n a n

A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS *

A Simple Linear Space Algorithm for Computing a Longest Common Increasing Subsequence

CSE 549 Lecture 3: Sequence Similarity & Alignment. slides (w/*) courtesy of Carl Kingsford

EECS730: Introduction to Bioinformatics

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Evolutionary Models. Evolutionary Models

Objec&ves. Review. Dynamic Programming. What is the knapsack problem? What is our solu&on? Ø Review Knapsack Ø Sequence Alignment 3/28/18

Dynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction.

More Dynamic Programming

An Introduction to Sequence Similarity ( Homology ) Searching

More Dynamic Programming

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

SA-REPC - Sequence Alignment with a Regular Expression Path Constraint

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

Large-Scale Genomic Surveys

Dynamic Programming. Prof. S.J. Soni

Motivating the need for optimal sequence alignments...

Data Structures in Java

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

Sequence analysis and comparison

Outline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009

INF 4130 / /8-2017

Introduction to Bioinformatics

Dynamic Programming. Weighted Interval Scheduling. Algorithmic Paradigms. Dynamic Programming

An alignmnet ends either with (1) a match/mismatch (2) a gap in the first sequence (3) a gap in the second sequence ATCGCT GGCATAC ATCGCT TTCCT A

Efficient High-Similarity String Comparison: The Waterfall Algorithm

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

String Search. 6th September 2018

Pairwise Sequence Alignment

Approximation: Theory and Algorithms

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

On-line String Matching in Highly Similar DNA Sequences

Evolution. CT Amemiya et al. Nature 496, (2013) doi: /nature12027

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

CONTRAfold: RNA Secondary Structure Prediction without Physics-Based Models

Dynamic Programming 1

Sequence Alignment. Johannes Starlinger

In-Depth Assessment of Local Sequence Alignment

Chapter 6. Weighted Interval Scheduling. Dynamic Programming. Algorithmic Paradigms. Dynamic Programming Applications

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Dynamic programming. Curs 2017

Implementing Approximate Regularities

Efficient Polynomial-Time Algorithms for Variants of the Multiple Constrained LCS Problem

98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Transcription:

Algorithms for biological sequence Comparison and Alignment Sara Brunetti, Dipartimento di Ingegneria dell'informazione e Scienze Matematiche University of Siena, Italy, sara.brunetti@unisi.it 1

A piece of history 1953 DNA structure, Watson e Crick 1975 development of the sequencing technique, Ranger, Maxam e Gilbert 1990 beginning of the Genome Project Goals: 1. sequence the entire human genome producing the complete DNA trascript 2. produce maps of the genome showing locations of expressed sites 2000 Tony Blair and Bill Clinton announce the completion of the human genome sequencing Cost: 3 10 9 euros 2002 High-throughput sequencing (HTS) 2008 1000 gemones pilot project 2009 1000 gemones phase 1 2011 1000 gemones phase 2 2

Amounts of data Human genome: 3x10 9 bp; Contained in 10 15 cells Macromolecular structures 35460 entry (14 Marz 06, PDB) Bioinformatics: study of problems of storage, organization and distribution of large amounts of genomic data 3

Computational biology study of mathematical and combinatorial problems of modeling biological processes in the cell, interpreting the data and providing theories about their biological relations 1. Data representation 2. Problem formulation 3. (Efficient) algorithm design 4

Data representation Alphabet Italian: A B C D E F G H I L M N O P Q R S T U V Z English: A B C D E F G H I J K L M N O P Q R S T U V W Y Z DNA: A C G T (adenine, cytosine,guanine,thymine) Protein: A Q W E R T Y I P L K H F D S C V N M Binary: 0 1 5

Data representation: strings DNA prefix substring suffix String: ACCGTATATAAAAGGCCGGGTT Length: 22 6

DNA information 7

Biological motivation Learning about the functionality or structure of a protein without performing any experiments Basic idea: In biomolecular sequences (DNA, RNA, Aminoacid sequences) high similarity usually implies significant functional or structural similarity. Usually 25% sequence identity suffice two proteins to have same 3-dim structure and almost identical function 8

WARNING Sequence similarities implies functional similarities, but the reverse is not necessarily true! Beside sequences other levels to enquire: 3D protein structure, cellular biochemistry or morphology etc., but sequences are easier to study. 9

DNA Sequence Comparison: First Success Story Finding sequence similarities with genes of known function is a common approach to infer a newly sequenced gene s function In 1984 Russell Doolittle and colleagues found similarities between cancer-causing gene and normal growth factor (PDGF) gene 10

Compare sequences. Why? The resemblance of two DNA sequences taken from different organisms can be explained by the theory that all contemporary genetic material has one ancestral ancient DNA. According to this theory, during the course of evolution mutations occurred, creating differences between families of contemporary species. Most of these changes are due to local mutations, each modifying the DNA sequence at a specific manner. 11

Similarities and differences? -Differences between the human genome and the chimpanzee genome: 2% -Differences betweeen human and worm: 50% -Similarity between two humans: 99,9% But: genome length 3 10 9 bp They can differ into 3 10 6 positions 12

Sequence Comparison Problems Informally: find which parts of sequences are alike and which parts are different. 1) Given two sequences over the same alphabet, about of the same length ( 10.000 char.), and almost equal, find the places where differences occur. Problem 1): the same gene is sequenced by two laboratories and they want to compare the results. 2) Given two sequences with a few hundred of char., find two similar sub-strings (one from each sequence). 3) Same as Problem 2), but one sequence is compared with thousand of others. Problems 2), 3): in searching local similarities in large databases of bio-sequences. 13

Sequence Comparison Problems 4) Given two sequences with a few hundred of char., find a prefix of one similar to the suffix of the other. Problem 4): in the fragment assembly procedure in large scale DNA sequencing. We introduce a single basic algorithmic idea to solve all the above problems. 14

Pairwise alignment How to compare two sequences? Alignment Similarity 15

Sequence alignment: an example s: ATGCAGCTGAGCATCG? t: ATACAGCGAGTATCG 16

Sequence alignment: an example s: ATGCAGCTGAGCATCG t: ATACA GC GAGTATCG 17

Edit Distance vs Hamming Distance Hamming distance always compares i -th letter of v with i -th letter of w V = ATATATAT W = TATATATA Hamming distance: d(v, w)=8 Just one shift Make it all line up Computing Hamming distance distance is a trivial task task Edit distance may compare i -th letter of v with j -th letter of w V = - ATATATAT W = TATATATA Edit distance: d(v, w)=2 Computing edit is a non-trivial 18

Edit Distance: Example TGCATAT ATCCGAT in 5 steps TGCATAT (delete last T) TGCATA (delete last A) TGCAT (insert A at front) ATGCAT (substitute C for 3 rd G) ATCCAT (insert G before last A) ATCCGAT (Done) What is the edit distance? 5? 19

Edit Distance: Example (cont d) TGCATAT ATCCGAT in 4 steps TGCATAT (insert A at front) ATGCATAT (delete 6 th T) ATGCAAT (substitute G for 5 th A) ATGCGAT (substitute C for 3 rd G) ATCCGAT (Done) Can it be done in 3 steps??? 20

Sequence alignment Sequence 1 Sequence2 s=(s1,,sm) of size m t=(t1,,tn) of size n An alignment (s,t ) between s and t is obtained by insertion of spaces in arbitrary positions along the sequences so that they end up with the same size s 1 s 2 s l t 1 t 2 t l (s i,t i) pair of characters in s and t or - Not allowed (-,-) 21

Number of alignments How many ways s can be aligned with t? s 1 s 2 s l t 1 t 2 t l Max(n,m) <= l <= n+m: s1.. sm - - - - - - - - - t1 tn - s1.-. sm - t1. tn f(i,j)=#alignments of one sequence of i letters with another of j letters f(n,m)=f(n-1,m)+f(n-1,m-1)+f(n,m-1) and f(n,n) (1+ 2) 2n+1 n as n 22

Es. two sequences of length 1000 have the following number of possible alignments: f(1000,1000) (1+ 2) 2001 1000=10 767,4..!!!!!!!! (there are 10 80 elementary particles in the universe) 23

Global alignment Given two sequences s and t of roughly the same length, determine the alignment of s and t with maximal (or minimal) score AC - GCTTTG - CATG TAT- (Needleman&Wunsch Algorithm) Motivation: the same gene is sequenced by two laboratories and they want to compare the results 24

More about similarity and distance 25

Similarity and distance Two approaches to comparing strings: Similarity: measures how much the strings are alike Its definition derives from the concept of one ancestral ancient DNA An alignment (s,t ) of the strings s and t is obtained by inserting space characters in them in such a way that: 1 s = t 2 Removal of - from s gives s 3 Removal of - from t gives t 4 For every i, either s [i] or t [i] is not A scoring system (p,g) has members: p:axa->r, g<0 additive scoring sim(s,t)=max score(s,t ) 26

Similarity and distance Distance: measures how much the strings differ Its definition derives from the concept of mutations A distance d on E is d:exe->r: 1 d(x,x)=0 for all x in E and d(x,y)>0 for x<>y 2 d(x,y)=d(y,x) for all x,y in E 3 d(x,y)<=d(x,z)+d(y,z) for all x,y in E An allignment is obtained by successive applications of a number of admissible operations transforming s into t 1 substitution a->b 2 insertion or deletion of any character (indel) A cost measure (c,h) has members: c:axa->r, h>0 27

When are similarity and distance algorithms equivalent? When sequences are aligned by distance in global alignment, there is a similarity algorithm that gives the same set of optimal alignments, and vice versa The measures are related by the formula: p(a,b)=m-c(a,b) g=-h+m/2 dist(s,t)+sim(s,t)=m/2( s + t ) Es. Edit distance, M=0=> p(a,a)=0, p(a,b)=-1, g=-1; M=2=> p(a,a)=2, p(a,b)=1, g=0; Same set of optimal solutions, different scores. Usually 0<=M<=max c(a,b ) 28

aligned letters<->substitutions #=l spaces<->indel operations #=r l score(s,t)= Σ i p(a i,b i )+rg l cost(s->t)= Σ i c(a i,b i )+rh score(s,t)+cost(s->t)=lm+r M/2 if global alignment: s + t =2l+r, score(s,t)+cost(s->t)=m/2( s + t ) dist(s->t)=min(m/2( s + t )- score(s,t)) =M/2( s + t )- sim(s,t) 29

Dynamic Programming Algorithm Basic Idea of Dynamic Programming: a problem is solved taking advantage of the already solved sub-problems. Each optimal alignment contains optimal alignments of the subproblems (example) - GCTGATATAGCT GGGTGAT -TAGCT Additivity of the penalty function Three essential components: Recurrence relation Tabular computation Traceback 30

Dynamic Programming Algorithm Recurrence relation Sequence 1: Sequence 2: s of size m t of size n s[i..j] sub-string from char i to char j of s. M(i,j) is the score of the best alignment between s[1..i] and t[1..j] M(j,0) = M(0,j)=-2j M(m,n) is computed by solving the more general problem of computing M(i,j) for all i,j M[i,j-1] - 2 M[i,j]= max M[i-1,j-1] + p(i,j), p(i,j)= M[i-1,j] -2 +1, if s i = t j -1, if s i t j No top-down approach, but bottom up The computation is arranged in a (m+1) (n+1) array M 31

Dynamic Programming Algorithm Tabular computation s t A G C M A A A C 0-2 -4-2 1-4 -3-6 -4 1-1 1-6 -3-4 -1 0-2 -6-3 -2-1 -8-5 -4-1 row 0: comparison between t and an empty sequence. column 0: comparison between s and an empty sequence M[i,j] is computed by observing the 3 previous entries M[i-1,j-1], M[i,j-1] and M[i-1,j]. M[i-1,j-1]: a new char of s and a new char of t are considered; +1 is added in case of match and 1 in case of mismatch. Align s[1..i-1] with t[1..j-1] M[i,j-1]: a new char of the sequence t is considered corresponding to a space in s (-2). Align s[1..i] with t[1..j-1] and match a space with tj M[i-1,j]: a new char of the sequence s is considered corresponding to a space in t (-2). Align s[1..i-1] with t[1..j] and match a space with si 32

Dynamic Programming Algorithm- Traceback Trace back to find the best alignment(s) AG -C A -GC solution1 solution2 solution3 -AGC AAAC AAAC AAAC s A A A C t A G C 0-2 -4-6 -2 1 1-3 -4-1 0-2 -6-3 -2-1 -8-5 -4-1 Best Score 33

Algorithm Similarity input: S,T,m,n output: M for i 1 to m do M(i,0) i g for j 0 to n do M(0,j) j g for i 1 to m do for j 1 to n do M(i,j) max( M(i-1,j)+g M(i-1,j-1)+p(i,j) M(i,j-1)+g ) return M Complexity: O(nm) 34

Align (i, j, len) input: i, j, array M obtained by Similarity Alg. output: alignment in align-s align-t, vectors of length len if i = 0 and j =0 then len = 0; else if i > 0 and M[i, j] = M[i-1, j] + c s then Align (i-1,j,len); len = len+1; align-s = s[i]; align-t = ;(space) else if i > 0 and j>0 and M[i, j] = M[i-1, j-1] + c(i,j) then Align (i-1,j-1,len); len = len+1; align-s = s[i]; align-t = t[j]; else Align (i,j-1,len); len = len+1; align-s = (space); align-t = t[j]; First call Align(m,n,len) max ( s, t ) len m + n Algorithm Align finds solution1. By inverting the order of the if statements it is possible to find the other solutions. 35

Complexity Algorithm Similarity takes O( m n) time and space Algorithm Align takes O (m + n) time: Let h=m+n T(h) = k for h 2 T(h) = T(h-1) + k, for h > 2 T(h) = O(h) = O(m+n) (k constant) Algorithm Similarity can be refined to run with O(m+n) space. In a row by row computation store the last and the current row only. Algorithm Align can be designed to run with O(m+n) space with a divide and conquer strategy. It is not a trivial task! The basic algorithm Similarity can be modified to solve a variety of different problems!! 36

Semi-global alignment 37

Semi-global Comparison Find the best fit of a short sequence t of size n into a larger sequence s of size m s: t: s1 sk sl sm The solution to this problem as formulated above will take time proportional to Σ k=1..m Σ l=k..m n(l-k)=o(nm 3 ) 38

(Exact matching) Problem: given a pattern p and a larger string s, find all the occurrences of the pattern p in s Is there an occurrence? How many times p occurrs in s? Naive method Boyer-Moore Algorithm/Knuth-Morris Pratt Algorithm 39

Semi-global Comparison Ignore the spaces at the beginning and at the end of a sequence. Problem: Find the highest score semi-global alignment between t and substring (prefix of a suffix) of s. s: CAGCA -CTTGG ATTCTCGG t: - - - CAGCTTGG(- - - - - - - - 1. Ignore final spaces. Find the best score between t and a prefix of s. M[i,j] of problem1 contains the best score between s[1..i] and t[1..j], hence take the maximum value M[i,n] in the last column n. There is no need to reach the last row. 40

Semi-global Comparison s: CAGCA -CTTGG ATTCTCGG t: - - -)CAGCTTGG - - - - - - - - 2. Ignore initial spaces Find the best alignment between t and a suffix of s. M[i,j] now contains the best score between t[1..j] and a suffix of s[1..i], hence in the first column we have all zeroes. C A G C. C 0 Initial char A 0 G 0 C 0 1 A!!Join solutions 1 and 2 to solve semi-global comparison!! 41

Local Alignment 42

Local Alignment Find the best fit between a sub-string of s and a sub-string of t. s: t: s1 sk si sm t1 th tj tn Motivation: Ignore streches of non-coding DNA 43

Global Alignment Local Alignment Algorithm Smith&Waterman --T -CC-C-AGT -TATGT-CAGGGGACACG A-GCATGCAGA-GAC AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG T-CAGAT--C Local Alignment better alignment to find conserved segment tcccagttatgtcaggggacacgagcatgcagagac aattgccgccgtcgttttcagcagttatgtcagatc 44

Local Alignment: Example Local alignment Compute a mini Global Alignment to get Local Global alignment 45

Local Alignment Algorithm Smith&Waterman The LA problem is still solved computing M. M[i,j] holds the value of the best alignment between a suffix of s[1..i] and a suffix of t[1..j]. The first row and the first column are initialized with zeros. 46

Local Alignment M[i,j]= max M[i,j-1] - 2 M[i-1,j-1] + p(i,j), p(i,j)= M[i-1,j] -2 0 +1 if s i = t j -1 if s i t j For any entry M[i,j] there exists always the alignment between the empty suffixes of s[1..i] and of t[1..j] with score 0 At the end choose the entry M[i,j] with maximal score in any position. Start align tracing back, as before, from there until you find a value 0. 47

Example H E A G A W G H E E 0 0 0 0 0 0 0 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 0 A 0 0 0 5 0 5 0 0 0 0 0 W 0 0 0 0 2 0 20 12 4 0 0 H 0 10 2 0 0 0 12 18 22 14 6 E 0 2 16 8 0 0 4 10 18 28 20 A 0 0 8 21 13 5 0 4 10 20 27 E 0 0 6 13 18 12 4 0 4 16 26 HEGAWGHEE PAW HEAE 48

End free-space alignment -Motivation Find the best fit of substrings of s and t, where at least one of these substrings must be a prefix of the original string and one must be a suffix? Motivation: in the shotgun sequence assembly procedure, one has a large set of partially overlapping substrings that come from many copies of one original but unknown DNA sequences. The problem is to use comparisons of pairs of substrings to infer the correct original string. 49

50 End free-space alignment = = = + = = = ), ( ), ( max ), ( ), ( max ), ( ), ( max ), ( 2 1), ( 2 ) 1, ( ), ( 1) 1, ( max ), ( 0 ) (0, 0,0) ( * * 1 * 1 * j m V n i V T S V j m V j m V n i V n i V j i V j i V T S p j i V j i V j V i V n j m i j i

Example H E A G A W G H E E 0 0 0 0 0 0 0 0 0 0 P 0-2 -1-1 -2-1 -4-2 -2-1 -1 A 0-2 -2 4-1 3-4 -4-4 -3-2 W 0-3 -5-4 1-4 18 10 2 6-6 H 0 10 2 6-6 -1 10 16 20 12 4 E 0 2 16 8 0 7 2 8 16 26 18 A 0-2 8 21 13 5 3 2 8 18 25 E 0 0 4 13 18 12 4 4 2 14 24 HEGAWGHEE PAW HEAE 51

Kinds of Alignment Global Alignment INPUT: Two strings S and T of roughly the same length. QUESTION: What is the similarity between the two? Semi-global Alignment INPUT: Two strings S and T. QUESTION: What is the similarity between a substring of S and T? Local Alignment INPUT: Two strings S and T. QUESTION: What is the similarity (difference) between a substring of S and a substring of T? What are these most similar substrings? Ends free-space alignment INPUT: Two strings S and T of different length. QUESTION: What is the similarity between substrings of S and T, respectively? where at least one of these substrings 52 must be a prefix of the original string and one (not necessary

Complexity of Alignments Problem Time complexity Space complexity Global Alignment O(nm) O(n+m) (O(nm) to bt) Semi-global Alignment O(nm) O(nm) Local Alignment O(nm) O(nm) Ends free-space alignment O(nm) O(nm) The space complexity could be a critical bottleneck. How we can improve such a complexity? Linear-Space Alignment Hirschberg s algorithm -- Miller and Myers algorithm 53

Extensions to the basic algorithm Hirschberg s linear space method for alignment uses a divide-et-conquer strategy 54

Gap penalty 55

Gap penalty function Gap: consecutive number (k>1) of spaces. From Biology we know that when mutations are involved, gap of k spaces are more probable than k isolated spaces. One concrete example is given by the c-dna matching. In the previous problems the cost w(k) of k internal consecutive spaces was proportional to k, w(k) = k g. Now w(k) = h +kg where h + g is the cost of the first space of a gap and g the cost of the following ones, k>1. CA-----CTTGG h+g g g g g w(k) = h +5g gap 56

Attention! The scoring system is no more additive, i.e. we cannot break an alignment in two parts and expect the total score to be the sum of the partial scores AAC - - - A ATTC C G ACT AC ACT ACC T - - - - - - CGC - - The scoring of an alignment is done at the block level 57

Similarities with gap We need three matrices a, b, c, with the following meaning: a[i,j] = maximum score of an alignment between s[1..i] and t[1..j] where s[i] is matched with t[j]. b[i,j] = maximum score of an alignment between s[1..i] and t[1..j] that ends in a - aligned with t[j]. c[i,j] = maximum score of an alignment between s[1..i] and t[1..j] that ends in s[i] aligned with a -. Where a[i-1,j-1] a[i,j] =p(i,j) + max b[i-1,j-1] c[i-1,j-1] a[i,j-1] -(h+g) First space a[i-1,j] -(h+g) b[i,j] =max b[i,j-1]-g c[i,j] =max (b[i-1,j]-(h+g) ) (c[i,j-1] -(h+g)) 58 c[i-1,j]-g

Initialization: a[0,0] = 0, a[i,0] =- for 0 i m, a[0,j] = - for 0 j n b[i,0] = - for 0 i m b[0,j] = -(h+gj) for 0 j n c[i,0] = -(h+gi) for 0 i m c[0,j] = - for 0 j n a[m,n] Final result Get the maximum among b[m,n] c[m,n] Trace back to obtain the optimal alignment, remembering the current position and which array belongs to. Time O(mn) Space 3(mn) = O(mn) 59

Other gap penalty models Constant. Affine. Convex: each additional space in a gap contributes less to the gap weight than the previous space (ex. Log(q)) the problem is solvable in O(nm log(m)) time Arbitrary: Any gap weight function is acceptable the problem is solvable in O(nm (m+n)) time 60