Algoritmi e strutture di dati 2

Similar documents
CS 580: Algorithm Design and Analysis

CSE 202 Dynamic Programming II

6.6 Sequence Alignment

Chapter 6. Dynamic Programming. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Dynamic Programming. Weighted Interval Scheduling. Algorithmic Paradigms. Dynamic Programming

Chapter 6. Weighted Interval Scheduling. Dynamic Programming. Algorithmic Paradigms. Dynamic Programming Applications

Areas. ! Bioinformatics. ! Control theory. ! Information theory. ! Operations research. ! Computer science: theory, graphics, AI, systems,.

6. DYNAMIC PROGRAMMING II

Objec&ves. Review. Dynamic Programming. What is the knapsack problem? What is our solu&on? Ø Review Knapsack Ø Sequence Alignment 3/28/18

Chapter 6. Dynamic Programming. CS 350: Winter 2018

Copyright 2000, Kevin Wayne 1

Chapter 6. Dynamic Programming. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Lecture 2: Pairwise Alignment. CG Ron Shamir

Dynamic Programming 1

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

CSE 421 Weighted Interval Scheduling, Knapsack, RNA Secondary Structure

6. DYNAMIC PROGRAMMING I

Chapter 6. Dynamic Programming. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

RNA Secondary Structure. CSE 417 W.L. Ruzzo

Algorithms in Bioinformatics

More Dynamic Programming

More Dynamic Programming

Lecture 5: September Time Complexity Analysis of Local Alignment

Sequence analysis and Genomics

Outline DP paradigm Discrete optimisation Viterbi algorithm DP: 0 1 Knapsack. Dynamic Programming. Georgy Gimel farb

Sequence Comparison. mouse human

CSE 431/531: Analysis of Algorithms. Dynamic Programming. Lecturer: Shi Li. Department of Computer Science and Engineering University at Buffalo

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Motivating the need for optimal sequence alignments...

Aside: Golden Ratio. Golden Ratio: A universal law. Golden ratio φ = lim n = 1+ b n = a n 1. a n+1 = a n + b n, a n+b n a n

Pairwise alignment, Gunnar Klau, November 9, 2005, 16:

Local Alignment: Smith-Waterman algorithm

Pairwise sequence alignment

The Double Helix. CSE 417: Algorithms and Computational Complexity! The Central Dogma of Molecular Biology! DNA! RNA! Protein! Protein!

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Bio nformatics. Lecture 3. Saad Mneimneh

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Algorithms in Bioinformatics: A Practical Introduction. Sequence Similarity

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Analysis and Design of Algorithms Dynamic Programming

Lecture 4: September 19

6. DYNAMIC PROGRAMMING I

Sequence Alignment (chapter 6)

CSE 591 Foundations of Algorithms Homework 4 Sample Solution Outlines. Problem 1

Algorithm Design and Analysis

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Introduction to Bioinformatics

Sequence analysis and comparison

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

Outline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009

Moreover, the circular logic

Practical considerations of working with sequencing data

Evolution. CT Amemiya et al. Nature 496, (2013) doi: /nature12027

String Matching Problem

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

Algorithms for biological sequence Comparison and Alignment

13 Comparative RNA analysis

Approximation: Theory and Algorithms

Pairwise & Multiple sequence alignments

Dynamic programming. Curs 2017

6. DYNAMIC PROGRAMMING I

Data Structures in Java

Computational Biology

Lecture 2: Divide and conquer and Dynamic programming

Lecture 5,6 Local sequence alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Outline. Similarity Search. Outline. Motivation. The String Edit Distance

98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Similarity Search. The String Edit Distance. Nikolaus Augsten. Free University of Bozen-Bolzano Faculty of Computer Science DIS. Unit 2 March 8, 2012

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Pattern Matching (Exact Matching) Overview

Computational Biology Lecture 5: Time speedup, General gap penalty function Saad Mneimneh

CONTRAfold: RNA Secondary Structure Prediction without Physics-Based Models

Introduction to Bioinformatics Algorithms Homework 3 Solution

Similarity Search. The String Edit Distance. Nikolaus Augsten.

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Hidden Markov Models

Dynamic programming. Curs 2015

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

An Introduction to Bioinformatics Algorithms Hidden Markov Models

EECS730: Introduction to Bioinformatics

CSE 549 Lecture 3: Sequence Similarity & Alignment. slides (w/*) courtesy of Carl Kingsford

Multiple Sequence Alignment using Profile HMM

A Simple Linear Space Algorithm for Computing a Longest Common Increasing Subsequence

CMPSCI 311: Introduction to Algorithms Second Midterm Exam

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Pair Hidden Markov Models

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17

Dynamic Programming. Cormen et. al. IV 15

CSE 427 Comp Bio. Sequence Alignment

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Transcription:

Algoritmi e strutture di dati 2 Paola Vocca Lezione 5: Allineamento di sequenze Lezione 5 - Allineamento di sequenze 1

Allineamento sequenze Struttura secondaria dell RNA Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 2

RNA Secondary Structure RNA. String B = b 1 b 2 b n over alphabet {A, C, G, U}. Secondary structure. RNA is single-stranded so it tends to loop back and form base pairs with itself. This structure is essential for understanding behavior of molecule. Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 3 G U C A G A A G C G A U G A U U A G A C A A C U G A G U C A U C G G G C C G Ex: GUCGAUUGAGCGAAUGUAACAACGUGGCUACGGCGAGA complementary base pairs: A-U, C-G

RNA Secondary Structure Secondary structure. A set of pairs S = {(b i b j )} that satisfy: o [Watson-Crick.] S is a matching and each pair in S is a Watson-Crick complement: A U, U A, C G, or G C. o [No sharp turns.] The ends of each pair are separated by at least 4 intervening bases. If (bi, b j ) S, then i < j 4. o [Non-crossing.] If (bi, bj) and (b k, b l ) are two pairs in S, then we cannot have i < k < j < l. Free energy. Usual hypothesis is that an RNA molecule will form the secondary structure with the optimum total free energy. approximate by number of base pairs Goal. Given an RNA molecule B = b 1 b 2 b n, find a secondary structure S that maximizes the number of base pairs. Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 4

RNA Secondary Structure: Examples Examples. C G G U G G G C G G U C G C G C U A U A U A G U A U A U A base pair A U G U G G C C A U A U G G G G 4 C A U A G U U G G C C A U ok sharp turn crossing Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 5

RNA Secondary Structure: Subproblems First attempt. OPT(j) = maximum number of base pairs in a secondary structure of the substring b 1 b 2 b j. match b t and b n 1 t n Difficulty. Results in two sub-problems. o Finding secondary structure in: b 1 b 2 b t-1. o Finding secondary structure in: b t+1 b t+2 b n-1. OPT(t-1) need more sub-problems Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 6

Dynamic Programming Over Intervals Notation. OPT(i, j) = maximum number of base pairs in a secondary structure of the substring b i b i+1 b j. o Case 1. If i j 4. OPT(i, j) = 0 by no-sharp turns condition. o Case 2. Base b j is not involved in a pair. OPT(i, j) = OPT(i, j 1) o Case 3. Base b j pairs with b t for some i t < j 4. non-crossing constraint decouples resulting sub-problems OPT(i, j) = 1 + max{opt(i, t 1) + OPT(t + 1, j 1)} t take max over t such that i t < j 4 and b t and b j are Watson-Crick complements Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 7

Bottom Up Dynamic Programming Over Intervals Q. What order to solve the sub-problems? A. Do shortest intervals first. RNA(b 1,,b n ) { for k = 5, 6,, n-1 for i = 1, 2,, n-k j = i + k Compute M[i, j] } return M[1, n] using recurrence i 4 3 2 1 0 0 0 0 0 0 6 7 8 9 j Running time. O(n 3 ). Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 8

Confronto fra sequenze Allineamento di sequenze Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 9

Sequence Alignment Applicazioni. o Alla basedel comando Unix diff. o Riconoscimento del parlato. o Biologia computazionale. La biologia computazionale spesso riguarda lo studio delle sequenze. Sequenze DNA Sequenze RNA Sequenze di proteine. Queste sequenze posso essere viste come stringhe sull alfabeto DNA & RNA: alfabeto di 4 lettere Proteine: Alfabeto di 20 lettere Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 10

Confronto fra sequenze Individuare similarità fra sequenze è importante inmolti contesti della biologia. Per esempio: Determinare geni/proteine con una origine comune Consente di predirne la funzione o la struttura. Individuare sottosequenze comuni in geni e/o proteine Identificare motifs comuni Individuare sequenze che si possono sovrapporre. Aiutare nell assemblaggio delle sequenze. Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 11

Confronto di sequenze: Perché? E uno degli strumenti informatici maggiormente usati in biologia Le nuove sequenze vengono confrontate con le sequenze già presenti nelle base di dati. Sequenze simili spesso hanno una funzione od un origine simili. La selezione opera a livello di sistema, ma le mutazioni avvengono a livello di sequenza Le similarità sono riconoscibili nei secoli Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 12

Allineamento di sequenze Defn: An alignment of strings S, T is a pair of strings S, T (with spaces) s.t. 1. S = T, and ( S = length of S ) 2. removing all spaces leaves S, T Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 13

Alignment Scoring The score of aligning (characters or spaces) x & y is σ(x,y). Value of an alignment An optimal alignment: one of max value Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 14

Optimal Alignment: A Simple Algorithm for all subseqs A of S, B of T s.t. A = B do align A[i] with B[i], 1 i A align all other chars to spaces compute its value retain the max end output the retained alignment Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 15

Analisi Assumiamo S = T = n Il costo di valutare un allineamento è O n Il numero di allineamenti possibili sono 2n n o Prendi n caratteri di S, T insieme o Consideriamo i primi k di questi in S o Allinea questi k con k caratteri non scelti di T Tempo totale: n 2n n > 22n Per n > 3 Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 16

Optimal Substructure (In More Detail) Optimal alignment ends in 1 of 3 ways: last chars of S&T aligned with each other last char of S aligned with space in T last char of T aligned with space in S ( never align space with space; σ(, ) < 0 ) In each case, the rest of S&T should be optimally aligned to each other. Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 17

Optimal Alignment in O(n 2 ) via Dynamic Programming Input: S, T, S = n, T = m Output: value of optimal alignment Risolvibile tramite i problemi intermedi: V(i, j) = value of optimal alignment of S[1],, S[i] with T[1],, T[j] for all 0 i n, 0 j m. Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 18

Base Cases Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 19

General Case Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 20

Calculating One Entry Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 21

Example Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 22

Example Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 23

Complexity Notes Tempo: O mn (sia per determinare il valore dell allineamento, che l allineamento stesso) Spazio: O mn (sia per determinare il valore dell allineamento, che l allineamento stesso) E facile determinare il valore della matrice in tempo O mn e spazio O min{m, n} E possibile calcolare sia il valore sia l allineamento in tempo O mn e spazio O min{m, n} (Prossimi lucidi) Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 24

Confronto fra sequenze Somiglianza fra stringhe Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 25

String Similarity How similar are two strings? o ocurrance o occurrence o c u r r a n c e - o c c u r r e n c e 6 mismatches, 1 gap Rispetto all allineamento di seguenze, questo è un caso più generale. osi considerano gap (allinemaenti con lo spazio, odisaccoppiamenti (mistmatch) oaccoppiamenti o c - u r r a n c e o c c u r r e n c e 1 mismatch, 1 gap o c - u r r - a n c e o c c u r r e - n c e 0 mismatches, 3 gaps Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 26

Edit Distance Edit distance. [Levenshtein 1966, Needleman-Wunsch 1970] o Gap penalty ; o mismatch penalty pq. o Cost = sum of gap and mismatch penalties. C T G A C C T A C C T - C T G A C C T A C C T C C T G A C T A C A T C C T G A C - T A C A T TC + GT + AG + 2 CA 2 + CA Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 27

Sequence Alignment Goal: Given two strings X = x 1 x 2... x m and Y = y 1 y 2... y n find alignment of minimum cost. Def. An alignment M is a set of ordered pairs x i y j such that each item occurs in at most one pair and no crossings. Def. The pair x i y j and x i y j cross if i < i, but j > j. cost( M ) xi y ( x, y ) M i j mismatch j i: x unmatched i j Ex: CTACCG vs. TACATG. Sol: M = x 2 y 1, x 3 y 2, x 4 y 3, x 5 y 4, x 6 y 6. gap j: y unmatched x 1 x 2 x 3 x 4 x 5 x 6 C T A C C - - T A C A T G G y 1 y 2 y 3 y 4 y 5 y 6 Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 28

Designing the Dynamic Programming FACT. Let M be any Alignment of X and Y. IF (m, n) is not in M THEN either x is not matched in M or y m n is not matched in M. Proof. Otherwise, a cross would occur!!!! Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 29

Sequence Alignment: Problem Structure Def. OPT(i, j) = min cost of aligning strings x 1 x 2... xi and y 1 y 2... y j. o Case 1: OPT matches x i y j. pay mismatch for x i y j + min cost of aligning two strings x 1 x 2... x i 1 and y 1 y 2... y j 1 o Case 2a: OPT leaves x i unmatched. pay gap for x i and min cost of aligning x 1 x 2... x i-1 and y 1 y 2... y j o Case 2b: OPT leaves y j unmatched. pay gap for y j and min cost of aligning x 1 x 2... x i and y 1 y 2... y j-1 ì ï ï OPT(i, j) = í ï î ï jd if i = 0 ì a xi y j +OPT(i -1, j -1) ï min í d +OPT(i -1, j) otherwise ï î d +OPT(i, j -1) id if j = 0 Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 30

Sequence Alignment: Algorithm Sequence-Alignment(m, n, x 1 x 2...x m, y 1 y 2...y n,, ) { for i = 0 to m M[i, 0] = i for j = 0 to n M[0, j] = j } for i = 1 to m for j = 1 to n M[i, j] = min( [x i, y j ] + M[i-1, j-1], + M[i-1, j], + M[i, j-1]) return M[m, n] Analysis. (mn) time and space. English words or sentences: m, n 10. Computational biology: m = n = 100,000. 10 billions ops OK, but 10GB array? 31 Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze

Sequence comparison Sequence Alignment in Linear Space Lezione 5 - Allineamento di sequenze 32

Sequence Alignment: Linear Space Q. Can we avoid using quadratic space? Easy. Optimal value in O(m + n) space and O(mn) time. o Compute OPT(i, ) from OPT(i-1, ). o No longer a simple way to recover alignment itself. Theorem. [Hirschberg 1975] Optimal alignment in O(m + n) space and O(mn) time. o Clever combination of divide-and-conquer and dynamic programming. o Inspired by idea of Savitch from complexity theory. Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 33

Sequence Alignment: Linear Space Edit distance graph. o Let f(i, j) be shortest path from (0, 0) to (i, j). o Observation: f(i, j) = OPT(i, j). y 1 y 2 y 3 y 4 y 5 y 6 0-0 x 1 a xi y j x 2 i-j x 3 m-n Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 34

Sequence Alignment: Linear Space Edit distance graph. o Let f(i, j) be shortest path from (0, 0) to (i, j). o Can compute f (, j) for any j in O(mn) time and O(m + n) space. (utilizzando la colonna precedente) j y 1 y 2 y 3 y 4 y 5 0-0 y 6 x 1 x 2 i-j x 3 m-n Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 35

Sequence Alignment: Linear Space Edit distance graph. o Let g(i, j) be shortest path from (i, j) to (m, n). o Can compute by reversing the edge orientations and inverting the roles of (0, 0) and (m, n) 0-0 y 1 y 2 y 3 y 4 y 5 y 6 x 1 i-j a xi y j x 2 x 3 Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 36 m-n

Sequence Alignment: Linear Space Edit distance graph. o Let g(i, j) be shortest path from (i, j) to (m, n). o Can compute g(, j) for any j in O(mn) time and O(m + n) space. y 1 j y 2 y 3 y 4 y 5 y 6 0-0 x 1 i-j x 2 x 3 Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze m-n 37

Sequence Alignment: Linear Space Observation 1. The cost of the shortest path that uses (i, j) is f(i, j) + g(i, j). y 1 y 2 y 3 y 4 y 5 y 6 0-0 x 1 i-j x 2 x 3 m-n Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 38

Sequence Alignment: Linear Space Proof. l ij the length of the shortest corner-to-corner path in G XY that passes through (i, j). o any such path must get from (0, 0) to (i, j) and from i, j to (m, n). o Its length is at least f(i, j) + g(i, j). o l ij > f(i, j) + g(i, j). o On the other hand, the corner-to-corner path that consists of a minimum-length path from (0, 0) to (i, j), followed by a minimum-length path from i, j to (m, n). o This path has length f(i, j) + g(i, j). o and so we have l ij f(i, j) + g(i, j). l ij = f(i, j) + g(i, j) Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 39

Sequence Alignment: Linear Space Observation 2. let q be an index that minimizes f(q, n/2) + g(q, n/2). Then, the shortest path from (0, 0) to (m, n) uses (q, n/2). n / 2 y 1 y 2 y 3 y 4 y 5 y 6 0-0 x 1 i-j q x 2 x 3 m-n Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 40

Sequence Alignment: Linear Space Divide: find index q that minimizes f(q, n/2) + g(q, n/2) using DP. o Align x q and y n/ 2. Conquer: recursively compute optimal alignment in each piece. n / 2 y 1 y 2 y 3 y 4 y 5 y 6 0-0 x 1 i-j q x 2 x 3 m-n Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 41

Sequence Alignment: Running Time Analysis Warmup Theorem. Let T m, n length at most m and n. T(m, n) = O(mn log n). = max running time of algorithm on strings of T ( m, n) 2T ( m, n / 2) O( mn) T ( m, n) O( mnlog n) Remark. Analysis is not tight because two sub-problems are of size (q, n/2) and (m q, n/2). In next slide, we save log n factor. Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 42

Sequence Alignment: Running Time Analysis Theorem. Let T(m, n) = max running time of algorithm on strings of length at most m and n. T(m, n) = O(mn ). Pf. (by induction on n) o O(mn) time to compute f(, n/2) and g(, n/2) and find index q. o T(q, n/2) + T(m q, n/2) time for two recursive calls. o Choose constant c so that: T ( m, 2) cm T (2, n) cn T ( m, n) cmn T ( q, n / 2) T ( m q, n / 2) o Base cases: m = 2 or n = 2. o Inductive hypothesis: T(m, n) 2cmn. T ( m, n) T ( q, n / 2) T ( m q, n / 2) cmn 2cqn / 2 2c( m q) n / 2 cmn cqn cmn cqn cmn 2cmn Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 43

Sequence Comparison Local alignments & gaps Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 44

Variations Local Alignment Preceding gives global alignment, i.e. full length of both strings; Might well miss strong similarity of part of strings amidst dissimilar flanks Gap Penalties 10 adjacent spaces cost 10 x one space? Many others Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 45

Local Alignment: Motivations Interesting (evolutionarily conserved, functionally related) segments may be a small part of the whole Active site of a protein Scattered genes or exons amidst junk, e.g. retroviral insertions, large deletions Don t have whole sequence Global alignment might miss them if flanking junk outweighs similar regions Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 46

Local Alignment Optimal local alignment of strings S & T: Find substrings A of S and B of T having max value global alignment Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 47

The Obvious Local Alignment Algorithm Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 48

Local Alignment in O(nm) via Dynamic Programming Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 49

Base Cases Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 50

General Case Recurrences Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 51

Scoring Local Alignments Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 52

Finding Local Alignmentsv Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 53

Notes Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 54

Alignment With Gap Penalties Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 55

Alignment with Gaps AAC-AATTAAG-ACTAC-GTTCATGAC A-CGA-TTA-GCAC-ACTG-T-C-GA AACAATTAAGACTACGTTCATGAC--- AACAATT--------GTTCATGACGCA Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 56

Gap Penalties Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 57

Global Alignment with Affine Gap Penalties Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 58

Affine Gap Algorithm Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 59

Gaps Both alignments have the same number of matches and spaces but alignment II seems better. Definition: A gap is any maximal, consecutive run of spaces in a single string. The length of the gap will be the number of spaces in it. Example I has 11 gaps while example II has only 2 gaps. Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 60

Biological motivation Number of mutational events A single gap - due to single event that removed a number of residues. Each separate gap - due to distinct independent events. Protein structure Protein secondary structure consists of alpha helixes, beta sheets and loops Loops of varying size can lead to very similar structure Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 61

Alignment in Real Life One of the major uses of alignments is to find sequences in a database Such collections contain massive number of sequences (order of 10 6 ) Finding homologies in these databases with dynamic programming can take too long Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 62

Heuristic Search Instead, most searches relay on heuristic procedures these are not guaranteed to find the best match Sometimes, they will completely miss a highscoring match We now describe the main ideas used by some of these procedures Actual implementations often contain additional tricks and hacks Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 63

Basic Intuition Almost all heuristic search procedure are based on the observation that real-life matches often contain long strings with gapless matches These heuristic try to find significant gapless matches and then extend them Algoritmi e strutture di dati 2 Lezione 5 - Allineamento di sequenze 64