Computing the Optimal Global Alignment Value. B = n. Score of = 1 Score of = a a c g a c g a. A = n. Classical Dynamic Programming: O(n )

Similar documents
Alignment of Long Sequences. BMI/CS Spring 2016 Anthony Gitter

MORE FUNCTION GRAPHING; OPTIMIZATION. (Last edited October 28, 2013 at 11:09pm.)

4.5 JACOBI ITERATION FOR FINDING EIGENVALUES OF A REAL SYMMETRIC MATRIX. be a real symmetric matrix. ; (where we choose θ π for.

State space systems analysis (continued) Stability. A. Definitions A system is said to be Asymptotically Stable (AS) when it satisfies

Chapter 6 Notes, Larson/Hostetler 3e

Geometric Sequences. Geometric Sequence a sequence whose consecutive terms have a common ratio.

Where did dynamic programming come from?

Chapter 3 MATRIX. In this chapter: 3.1 MATRIX NOTATION AND TERMINOLOGY

Balanced binary search trees

Chapter 2. Determinants

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

Recitation 3: More Applications of the Derivative

A Matrix Algebra Primer

20 MATHEMATICS POLYNOMIALS

Bellman Optimality Equation for V*

38 Riemann sums and existence of the definite integral.

A recursive construction of efficiently decodable list-disjunct matrices

DISCRETE MATHEMATICS HOMEWORK 3 SOLUTIONS

Lecture 3. In this lecture, we will discuss algorithms for solving systems of linear equations.

Comparison-Sorting and Selecting in Totally Monotone Matrices

We partition C into n small arcs by forming a partition of [a, b] by picking s i as follows: a = s 0 < s 1 < < s n = b.

A. Limits - L Hopital s Rule ( ) How to find it: Try and find limits by traditional methods (plugging in). If you get 0 0 or!!, apply C.! 1 6 C.

SOLUTIONS FOR ADMISSIONS TEST IN MATHEMATICS, COMPUTER SCIENCE AND JOINT SCHOOLS WEDNESDAY 5 NOVEMBER 2014

Multivariate problems and matrix algebra

Vyacheslav Telnin. Search for New Numbers.

Is there an easy way to find examples of such triples? Why yes! Just look at an ordinary multiplication table to find them!

Introduction To Matrices MCV 4UI Assignment #1

Calculus 2: Integration. Differentiation. Integration

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004

Algorithm Design and Analysis

Math 231E, Lecture 33. Parametric Calculus

Introduction to Determinants. Remarks. Remarks. The determinant applies in the case of square matrices

a < a+ x < a+2 x < < a+n x = b, n A i n f(x i ) x. i=1 i=1

Math 1B, lecture 4: Error bounds for numerical methods

Faster Regular Expression Matching. Philip Bille Mikkel Thorup

Lecture Note 9: Orthogonal Reduction

approaches as n becomes larger and larger. Since e > 1, the graph of the natural exponential function is as below

Predict Global Earth Temperature using Linier Regression

Chapter 0. What is the Lebesgue integral about?

The graphs of Rational Functions

FUNCTIONS: Grade 11. or y = ax 2 +bx + c or y = a(x- x1)(x- x2) a y

INTRODUCTION TO LINEAR ALGEBRA

We know that if f is a continuous nonnegative function on the interval [a, b], then b

Lecture 1: Introduction to integration theory and bounded variation

A. Limits - L Hopital s Rule. x c. x c. f x. g x. x c 0 6 = 1 6. D. -1 E. nonexistent. ln ( x 1 ) 1 x 2 1. ( x 2 1) 2. 2x x 1.

Riemann is the Mann! (But Lebesgue may besgue to differ.)

The Wave Equation I. MA 436 Kurt Bryan

Strong Bisimulation. Overview. References. Actions Labeled transition system Transition semantics Simulation Bisimulation

Riemann Sums and Riemann Integrals

Z b. f(x)dx. Yet in the above two cases we know what f(x) is. Sometimes, engineers want to calculate an area by computing I, but...

AQA Further Pure 2. Hyperbolic Functions. Section 2: The inverse hyperbolic functions

APPLICATIONS OF THE DEFINITE INTEGRAL

Improper Integrals, and Differential Equations

Quadratic Forms. Quadratic Forms

New data structures to reduce data size and search time

Nondeterminism. Nondeterministic Finite Automata. Example: Moves on a Chessboard. Nondeterminism (2) Example: Chessboard (2) Formal NFA

Riemann Sums and Riemann Integrals

4.4 Areas, Integrals and Antiderivatives

SCHOOL OF ENGINEERING & BUILT ENVIRONMENT. Mathematics

CHAPTER 2d. MATRICES

First Midterm Examination

Riemann Integrals and the Fundamental Theorem of Calculus

Natural examples of rings are the ring of integers, a ring of polynomials in one variable, the ring

1. Extend QR downwards to meet the x-axis at U(6, 0). y

Chapter 4 Contravariance, Covariance, and Spacetime Diagrams

Week 10: Riemann integral and its properties

Integral points on the rational curve

Before we can begin Ch. 3 on Radicals, we need to be familiar with perfect squares, cubes, etc. Try and do as many as you can without a calculator!!!

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite

DATA Search I 魏忠钰. 复旦大学大数据学院 School of Data Science, Fudan University. March 7 th, 2018

Unit 1 Exponentials and Logarithms

NFA DFA Example 3 CMSC 330: Organization of Programming Languages. Equivalence of DFAs and NFAs. Equivalence of DFAs and NFAs (cont.

Algorithms in Computational. Biology. More on BWT

Unit #9 : Definite Integral Properties; Fundamental Theorem of Calculus

N 0 completions on partial matrices

1. Weak acids. For a weak acid HA, there is less than 100% dissociation to ions. The B-L equilibrium is:

different methods (left endpoint, right endpoint, midpoint, trapezoid, Simpson s).

Lecture 17. Integration: Gauss Quadrature. David Semeraro. University of Illinois at Urbana-Champaign. March 20, 2014

Matrices. Elementary Matrix Theory. Definition of a Matrix. Matrix Elements:

Here we study square linear systems and properties of their coefficient matrices as they relate to the solution set of the linear system.

Section 6.1 INTRO to LAPLACE TRANSFORMS

DonnishJournals

3.4 Numerical integration

Administrivia CSE 190: Reinforcement Learning: An Introduction

Engineering Analysis ENG 3420 Fall Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th 11:00-12:00

A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H. Thomas Shores Department of Mathematics University of Nebraska Spring 2007

CMSC 330: Organization of Programming Languages

1 Online Learning and Regret Minimization

Things to Memorize: A Partial List. January 27, 2017

Review of Calculus, cont d

Pre-Session Review. Part 1: Basic Algebra; Linear Functions and Graphs

Local orthogonality: a multipartite principle for (quantum) correlations

An approximation to the arithmetic-geometric mean. G.J.O. Jameson, Math. Gazette 98 (2014), 85 95

Math 61CM - Solutions to homework 9

CS 188: Artificial Intelligence Spring 2007

MATRICES AND VECTORS SPACE

Calculus and linear algebra for biomedical engineering Week 11: The Riemann integral and its properties

Best Approximation. Chapter The General Case

Math 31S. Rumbos Fall Solutions to Assignment #16

We will see what is meant by standard form very shortly

Transcription:

Alignment Grph Alignment Mtrix

Computing the Optiml Globl Alignment Vlue An Introduction to Bioinformtics Algorithms A = n c t 2 3 c c 4 g 5 g 6 7 8 9 B = n 0 c g c g 2 3 4 5 6 7 8 t 9 0 2 3 4 5 6 7 8 9 Clssicl Dynmic Progrmming: O(n ) 2 Score of = Score of = - 2

A = n An Introduction 2 to Bioinformtics Algorithms The O(n ) time, Clssicl Dynmic Progrmming Algorithm c t c c 2 3 4 g 5 g 6 7 8 9 0 The Alignment Grph B = n c g c g 2 3 4 5 6 7 8 t 9 0 2 3 4 5 6 7 8 9 A: c g B: c g A: c g I2 I B: c g 3 I3 O = mx(i + edge[i,o]) x = x O A: c g B: c g A: c g B: c g Cn the qudrtic complexity of the optiml lignment vlue computtion be reduced without relxing the problem? x 3

Is it Possible to Align Sequences in Subqudrtic Time? Dynmic Progrmming tkes O(n 2 ) for globl lignment Cn we do better? O (h n 2 /log n), h <=. Lndu,Crochemore&Ziv-ukelson 2003 Techniques: () Compress the sequences. (2) Utilize the Totl Monotonicity of DIST.

c t c g g c An Introduction to Bioinformtics Algorithms c g c g t c t c g g c c g c g t O(n 2 ) vertices O(h n / log n) rows of n vertices + O(h n / log n) columns of n vertices 5

Stndrd, single-cell DP New, extended-cell DP I2 I3 I3 I4 I5 I6 c g I2 I g I O O4 3 O = mx(i + edge[i,o]) x = x x 6 O = mx(i + DIST[x,3]) 4 x = 0 x 6

Computing the score for Output Border Vertex O 4 2 3 4 c t cg 2 3 4 5 c g c g t I2 I g I3 I4 I5 I6 c g 5 6 g c I DIST[,4] I+DIST[,4] O4 I2 DIST[2,4] I2+DIST[2,4] I3 DIST[3,4] I3+DIST[3,4] I4 DIST[4,4] I4+DIST[4,4] I5 DIST[5,4] I5+DIST[5,4] I6 DIST[6,4] I6+DIST[6,4] O = mx(i + DIST[x,3]) 4 6 x = 0 x O4 7

Input I\ DIST Mtrix I = I2 = 2 I3 = 3 I4 = I5 = The Min Chllenges How to obtin the DIST for G in O(t) time? (Tke dvntge of the incrementl nture of LZ78 prsing). I6 = OUT[x,j] = Ix + DIST[x,j] Output vector O How to compute the column mxim of OUT in O(t) time? (Utilize the Totl Monotonicity Property of OUT). 8

How does Totl Monotonicity ffect Column Mxim behvior? For ll <b nd c < d, OUT[,c] <= OUT[b,c] OUT[,d] <= OUT[b,d] OUT Mtrix d b Column mxim row indices re monotoniclly non-decresing. SMAWK Mtrix Serching[Aggrwl et-l 87]. The t column mxim of Totlly Monotone rry cn be computed in O(t) time, by querying only O(t) elements. 9

6 An Introduction to Bioinformtics Algorithms c Trie for A g c 3 5 g 0 4 Accessing Prefix Block in Constnt time. t 2 2 3 4 5 6 c t c g g c 2 3 4 5 c g c g 3/2 5/2 3/4 5/4 t Trie for B g c 4 2 0 t 5 g 3 g c left prefix (5,2) c digonl prefix (3,2) c top g g c g prefix(3,4) block G (5,4) 0

Another technique to Align Sequences in Subqudrtic Time? For limited edit scoring schemes, such s LCS, use Four-Russins Speedup

How Mny Points Of Interest? LZ-78 compression blocks of size t How mny points of interest? O(n 2 /t) n/ t rows with n vertices ech n/ t columns with n vertices ech

The Four-Russins technique for speeding up for dynmic progrmming Dn Gusfield: The ide, comes from pper by four uthors concerning boolen mtrix multipliction. The generl ide tken from this pper hs come to be known in the West s The Four-Russins technique, even though only one of the uthors is Russin.

Arlzrov, Dinic, Kronrod nd Frdzev Msek & Pterson pplied the Four Russins to the string edit problem

Prtitioning Alignment Grid into Blocks of equl size t n n/t t n t n/t prtition

Block Alignment Problem Gol: Find the longest block pth through n edit grph Input: Two sequences, u nd v prtitioned into blocks of size t. This is equivlent to n n x n edit grph prtitioned into t x t subgrids Output: The block lignment of u nd v with the mximum score (longest block pth through the edit grph)

Stge : compute the mini-lignments n/t s, Solve mini-lignmnent problems s,2 s,3 Block pir represented by ech smll squre How my blocks? (n/t)*(n/t) = (n 2 /t 2 )

Stge 2: dynmic progrmming Let s i,j denote the optiml block lignment score between the first i blocks of u nd first j blocks of v s i,j = mx s i-,j - block s i,j- - block s i-,j- + i,j block is the penlty for inserting or deleting n entire block i,j is score of pir of blocks in row i nd column j.

Block Alignment Runtime Indices i,j rnge from 0 to n/t Running time of lgorithm is O( [n/t]*[n/t]) = O(n 2 /t 2 ) if we don t count the time to compute ech i,j

Block Alignment Runtime (cont d) Computing ll i,j requires solving (n/t)*(n/t)= n 2 /t 2 mini block lignments, ech of size (t*t) = t 2 So computing ll i,j tkes time O(n 2 /t 2 * t 2 ) = O(n 2 ) This is the sme s dynmic progrmming How do we speed this up?

Four Russins Technique Let t = log(n), where t is block size, n is sequence size. Insted of hving (n/t)*(n/t) )= n 2 /t 2 minilignments, construct 4 t x 4 t mini-lignments for ll pirs of strings of t nucleotides (huge size), nd put in lookup tble. However, size of lookup tble is not relly tht huge if t is smll. Let t = (logn)/4. Then 4 t x 4 t = 4 (logn)/4 x 4 (logn)/4 = 4 (logn)/2 = 2 (logn) = n t t

AAAAAA AAAAAC AAAAAG AAAAAT AAAACA Look-up Tble for Four Russins Technique ech sequence hs t nucleotides Lookup tble Score AAAAAA AAAAAC AAAAAG AAAAAT AAAACA size is only n, insted of (n/t)*(n/t) Let t = (logn)/4. Then the number of entries In the lookup tble: 4 t x 4 t = n Computing the scores for ech entry in the tble requires dynmic progrmming for (log n) by (log n) lignment: (logn) 2 Altogether: n (logn) 2

New Recurrence The new lookup tble Score is indexed by pir of t-nucleotide strings, so s i,j = mx s i-,j - block s i,j- - block O(logn) time s i-,j- + Score(i th block of v, j th block of u)

Four Russins Speedup Runtime Since computing the lookup tble Score of size n tkes O( n (logn) 2 ) time, the running time is minly limited by the n 2 /t 2 ccesses to the lookup tble Ech ccess tkes O(logn) time Overll running time: O( [n 2 /t 2 ]*logn ) Since t = logn, substitute in: O( [n 2 /{logn} 2 ]*logn) = O(n 2 /logn )

So Fr (restriced to block lignment) We cn divide up the grid into blocks nd run dynmic progrmming only on the corners of these blocks In order to speed up the mini-lignment clcultions to under n 2, we crete lookup tble of size n, which consists of ll scores for ll t-nucleotide pirs Running time goes from qudrtic, O(n 2 ), to subqudrtic: O(n 2 /logn)

Four Russins Speedup for LCS Unlike the block prtitioned grph, the LCS pth does not hve to pss through the vertices of the blocks. block lignment longest common subsequence

Block Alignment vs. LCS In block lignment, we only cre bout the corners of the blocks. In LCS, we cre bout ll points on the edges of the blocks, becuse those re points tht the pth cn trverse. Recll, ech sequence is of length n, ech block is of size t, so ech sequence hs (n/t) blocks.

How Mny Points Of Interest? block lignment longest common subsequence How my blocks? (n/t)*(n/t) = (n 2 /t 2 ) How mny points of interest? O(n 2 /t) n/t rows with n vertices ech n/t columns with n vertices ech

Trversing Blocks for LCS (cont d) If we used regulr dynmic progrmming to compute the grid, it would tke qudrtic, O(n 2 ) time, but we wnt to do better. Use the Four Russins Tbultion! we know these scores we cn clculte these scores t x t block

I = ( (, 2, 3), (2, 3, 4), bc, bb) O = (4, 3, 3, 3, 2, 2, 3)

I = (,, 2, 3), (, 2, 3, 4), bc, bb) * 4 t * 4 t n t * n t = (4n) 2t This will be huge tble! we need nother trick O = (4, 3, 3, 3, 2, 2, 3)

The Longest Common Subsequence T = B C B A D B D C D S = A B C B D B D D X = LCS(S,T) = BCBDBDD L = LCS(S,T) = BCBDBDD = 7

The LCS Alignment Grph S = m B C B C B D C 0 0 2 3 4 5 6 7 B C B A D B D D C 2 3 4 5 6 7 8 T = n 9 Digonl blue rrows re mtch points {(i,j) S[ i ] = T[ j ] Assigned score of. Horizontl blck rrows re deletions from T. Assigned score of 0. Verticl blck rrows re deletions from S Assigned score of 0. Clssicl Dynmic Progrmming: O(n m) (Crochemore, Lndu, Ziv-Ukelson O(n m/ log m))

0 An Introduction to Bioinformtics Algorithms I I2 2 I3 3 I4 I0 I0 I I2 I3 I4 I5 I6 I7 I8 I9 0 2 3 4 4 4 4 4 4 I5 I6 I7 I8 I9 Observtion. Due to the unit-step properties of LCS, both I nd O re monotoniclly non-decresing series, nd their vlues go up by unit steps. [Hunt-Szymnski-77]. Input row I 0 0 B C B A D B D C D 2 3 4 5 6 7 8 9 2 3 4 0 2 3 4 5 6 7 8 9 0 2 3 3 4 4 5 5 6 Output row O O0 O O2 O3 O4 O5 O6 O7 O8 O9

Reducing Tble Size Alignment scores in LCS re monotoniclly incresing, nd djcent elements cn t differ by more thn Exmple: 0,,2,2,3,4 is ok; 0,,2,4,5,8, is not becuse 2 nd 4 differ by more thn (nd so do 5 nd 8) Therefore, we only need to store qudruples whose scores re monotoniclly incresing nd differ by t most

Efficient Encoding of Alignment Scores Insted of recording numbers tht correspond to the index in the sequences u nd v, we cn use binry to encode the differences between the lignment scores 0 2 2 3 4 0 0 originl encoding binry encoding

We need to precompute only (0,(0,,),(,,), bc, bb)

Reducing Lookup Tble Size 2 t possible steps (t = size of blocks) 4 t possible strings Lookup tble size is (2 t * 2 t )*(4 t * 4 t ) = 2 6t Computing ech entry in the tble: t 2 Totl Tble Construction Time: 2 6t t 2 Let t = (logn)/6; Tble construction time is: 2 6((logn)/6) (logn) 2 = n (logn) 2

Reducing Lookup Tble Size Let t = (logn)/6; Stge : Tble construction time is: 2 6((logn)/6) (logn) 2 = n (logn) 2 Stge 2: lignment grph computtion time is: O( [n 2 /t 2 ]*t ) = O( [n 2 /{logn} 2 ]*logn) =O( n 2 /logn )

Summry We tke dvntge of the fct tht for ech block of t = O(log n), we cn pre-compute ll possible scores nd store them in lookup tble of size n, whose vlues cn be computed in time O(n (logn) 2 ). We used the Four Russin speedup to go from qudrtic running time for LCS to subqudrtic running time: O(n 2 /logn)