FLAG: Fast Local Alignment Generating Methodology. Abstract. Introduction

Similar documents
New data structures to reduce data size and search time

State space systems analysis (continued) Stability. A. Definitions A system is said to be Asymptotically Stable (AS) when it satisfies

Alignment of Long Sequences. BMI/CS Spring 2016 Anthony Gitter

Chapter 3 Polynomials

1 Linear Least Squares

Chapter 3 Solving Nonlinear Equations

p-adic Egyptian Fractions

Math 61CM - Solutions to homework 9

Module 9: Tries and String Matching

Module 9: Tries and String Matching

Riemann Sums and Riemann Integrals

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

Riemann Sums and Riemann Integrals

Reinforcement learning II

Recitation 3: More Applications of the Derivative

Reinforcement Learning

Math 520 Final Exam Topic Outline Sections 1 3 (Xiao/Dumas/Liaw) Spring 2008

MATH STUDENT BOOK. 10th Grade Unit 5

Numerical Linear Algebra Assignment 008

Predict Global Earth Temperature using Linier Regression

New Expansion and Infinite Series

Where did dynamic programming come from?

THERMAL EXPANSION COEFFICIENT OF WATER FOR VOLUMETRIC CALIBRATION

Comparison Procedures

Designing Information Devices and Systems I Discussion 8B

Math& 152 Section Integration by Parts

Jack Simons, Henry Eyring Scientist and Professor Chemistry Department University of Utah

Physics 116C Solution of inhomogeneous ordinary differential equations using Green s functions

ADVANCEMENT OF THE CLOSELY COUPLED PROBES POTENTIAL DROP TECHNIQUE FOR NDE OF SURFACE CRACKS

Lecture 3 ( ) (translated and slightly adapted from lecture notes by Martin Klazar)

Review of Gaussian Quadrature method

1 APL13: Suffix Arrays: more space reduction

Chapter 0. What is the Lebesgue integral about?

13.3 CLASSICAL STRAIGHTEDGE AND COMPASS CONSTRUCTIONS

13: Diffusion in 2 Energy Groups

W. We shall do so one by one, starting with I 1, and we shall do it greedily, trying

DATA Search I 魏忠钰. 复旦大学大数据学院 School of Data Science, Fudan University. March 7 th, 2018

The steps of the hypothesis test

Lecture 14: Quadrature

Chapter 5 : Continuous Random Variables

Uninformed Search Lecture 4

Probabilistic Investigation of Sensitivities of Advanced Test- Analysis Model Correlation Methods

Lecture 19: Continuous Least Squares Approximation

1 Online Learning and Regret Minimization

The Regulated and Riemann Integrals

SOLUTIONS FOR ADMISSIONS TEST IN MATHEMATICS, COMPUTER SCIENCE AND JOINT SCHOOLS WEDNESDAY 5 NOVEMBER 2014

Frobenius numbers of generalized Fibonacci semigroups

CS 188: Artificial Intelligence Spring 2007

2D1431 Machine Learning Lab 3: Reinforcement Learning

Orthogonal Polynomials

Best Approximation. Chapter The General Case

A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H. Thomas Shores Department of Mathematics University of Nebraska Spring 2007

Vyacheslav Telnin. Search for New Numbers.

(e) if x = y + z and a divides any two of the integers x, y, or z, then a divides the remaining integer

Abstract inner product spaces

Contents. Outline. Structured Rank Matrices Lecture 2: The theorem Proofs Examples related to structured ranks References. Structure Transport

Student Activity 3: Single Factor ANOVA

Lecture 6: Singular Integrals, Open Quadrature rules, and Gauss Quadrature

Tests for the Ratio of Two Poisson Rates

Exam 1 Solutions (1) C, D, A, B (2) C, A, D, B (3) C, B, D, A (4) A, C, D, B (5) D, C, A, B

1. Extend QR downwards to meet the x-axis at U(6, 0). y

Problem Set 3 Solutions

A Modified ADM for Solving Systems of Linear Fredholm Integral Equations of the Second Kind

DIRECT CURRENT CIRCUITS

MIXED MODELS (Sections ) I) In the unrestricted model, interactions are treated as in the random effects model:

Cryptanalysis of Substitution-Permutation Networks Using Key-Dependent Degeneracy *

Fingerprint idea. Assume:

An approximation to the arithmetic-geometric mean. G.J.O. Jameson, Math. Gazette 98 (2014), 85 95

Driving Cycle Construction of City Road for Hybrid Bus Based on Markov Process Deng Pan1, a, Fengchun Sun1,b*, Hongwen He1, c, Jiankun Peng1, d

Non-Linear & Logistic Regression

Hybrid Group Acceptance Sampling Plan Based on Size Biased Lomax Model

How do we solve these things, especially when they get complicated? How do we know when a system has a solution, and when is it unique?

19 Optimal behavior: Game theory

Genetic Programming. Outline. Evolutionary Strategies. Evolutionary strategies Genetic programming Summary

APPROXIMATE INTEGRATION

Farey Fractions. Rickard Fernström. U.U.D.M. Project Report 2017:24. Department of Mathematics Uppsala University

MATH34032: Green s Functions, Integral Equations and the Calculus of Variations 1

Jim Lambers MAT 169 Fall Semester Lecture 4 Notes

MAC-solutions of the nonexistent solutions of mathematical physics

CBE 291b - Computation And Optimization For Engineers

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature

AQA Further Pure 1. Complex Numbers. Section 1: Introduction to Complex Numbers. The number system

Physics 9 Fall 2011 Homework 2 - Solutions Friday September 2, 2011

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS.

Riemann Integrals and the Fundamental Theorem of Calculus

Lecture 12: Numerical Quadrature

Intermediate Math Circles Wednesday, November 14, 2018 Finite Automata II. Nickolas Rollick a b b. a b 4

Chapter 4 Contravariance, Covariance, and Spacetime Diagrams

NUMERICAL INTEGRATION. The inverse process to differentiation in calculus is integration. Mathematically, integration is represented by.

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below.

Credibility Hypothesis Testing of Fuzzy Triangular Distributions

Advanced Computational Fluid Dynamics AA215A Lecture 3 Polynomial Interpolation: Numerical Differentiation and Integration.

Spanning tree congestion of some product graphs

Solution to Fredholm Fuzzy Integral Equations with Degenerate Kernel

Things to Memorize: A Partial List. January 27, 2017

Linear Systems with Constant Coefficients

Chapter 10: Symmetrical Components and Unbalanced Faults, Part II

III. Lecture on Numerical Integration. File faclib/dattab/lecture-notes/numerical-inter03.tex /by EC, 3/14/2008 at 15:11, version 9

Section 14.3 Arc Length and Curvature

Bridging the gap: GCSE AS Level

Transcription:

Romnin Biotechnologicl Letters Vol 8, No, 23 Copyright 23 University of Buchrest Printed in Romni All rights reserved SHORT COMMUNICATION FLAG: Fst Locl Alignment Generting Methodology Abstrct Received for publiction, August 5, 22 Accepted, November 2, 22 Fculty of Computer Science, University Goce Delcev - Štip, Republic of Mcedoni Emil: donestojnov@ugdedumk A new, time nd spce efficient lignment methodology is presented, pplicble on similr nucleotide sequences Liner time complexity O(m), hs been determined when ligning pproximtely sme size similr sequences Time complexity improvement is due to the methodology, ccording which lignments re generted nd the significnt spce s reduction, where the serch for lignments is crried out Keywords: liner time nd spce, un-gpped, locl lignment, methodology Introduction The time inefficiency hs been the mjor disdvntge of locl pirwise lignment techniques Smith Wtermn s lgorithm (T SMITH & l []) requires fixed O(nm) time, identifying one optiml (score mximized lignment), llowing gps insertion Insted of finding one optiml lignment, M WATERMAN nd M EGGERT [2] cme up with n ide of identifying k suboptiml locl lignments The min disdvntge of Wtermn Eggert s lgorithm is gin the nonliner time complexity In order to reduce the spce complexity of Wtermn Eggert s lgorithm, X HUANG nd W MILLER [3] in 99 presented liner spce solution of Wtermn Eggert s lgorithm, being until then the spce chepest locl lignment technique Newly heuristic ultrfst solutions, such s: FASTA (D LIPMAN & l [4]) nd BLAST (S ALTSCHUL & l [5]), re pplicble for fst serch of lrge genetic dtbse, identifying similr sequences regrding referent sequence, not lwys finding the optiml solution Despite the time complexity, spce complexity is often found s limiting fctor when ligning lrge nucleotide sequences In order to reduce spce complexity of n lignment, methodology presented in (D STOJANOV & l [6]) represents ech region of consecutive mtching nucleotides with triple, identifying region s length nd strting positions t the sequences Bsed on this representtion, mesurements performed in [6] clerly show tht liner spce is required, while the time complexity is O(nm 2 ) Most of the time in [6] is wsted on exmining ll combintions of un-gpped locl lignments within overlpping sections nd the number of lignments being exmined When ligning similr nucleotide sequences, lrge regions of consecutive mtching nucleotides re prt of the optiml un-gpped locl lignment Also, the probbility of finding n optiml un-gpped locl lignment within m nucleotides long overlpping sections is higher thn the probbility of finding it in overlpping sections with less thn m nucleotides, where m is the length of the smller nucleotide sequence, subject of n lignment Bsed on the previous, fst locl lignment generting methodology is presented, requiring liner time nd spce O(m), when ligning pproximtely sme size, similr nucleotide sequences Romnin Biotechnologicl Letters, Vol 8, No, 23 788

Mterils nd methods Methodology As Tble shows, the smller nucleotide sequence b overlps n-m+ different sequence sections with length m nd sections with length less thn m Overlpping sections with length less thn the length of the smller sequence - b, re formed by left nd right one plce sequence b shifts, out of the length of the sequence Tble Overlpping sections Nucleotide sequences: = n 2n, = m 2 m m nucleotides long sequence b left shifted sequence b right shifted overlpping sections overlpping sections overlpping sections m 2m n n m+ n 2n m 2m n m 2 m b bm 3bm 2bm b b b m 2bm m 3m 2 n n m+ 2 n 2n m m n 2 m 2 m b bm 4bm 3bm 2bm m 2 m n m n b b b m 2b m When compring prllel nucleotides within overlpping sections, χ, χ regions composed of consecutive mtching nucleotides, with t lest one mtch, re found As we hve shown in [6], ech mtching region cn be represented with triple R:(p b, p, l), where p b is region s strting position t the sequence b, p is region s strting position t the sequence, while l is region s length An un-gpped locl lignment consists of one or more mtching regions, seprted with region(s) of mismtching nucleotides The sme lignment s score metrics, which hs been used in [6], will be lso used here, wrding positive score μ for ech nucleotide mtch, while penlizing ech nucleotide mismtch with negtive score δ Ech lignment A:(R R 2 R k- R k ) is ssigned unique score, computed with the formul presented in [6]: k f ( A: R R2 R R ) = μ len( R ) δ dif ( R, Rj ) k k i i= j= 2 where: len(r i ) is the length of the mtching region R i, while dif(r j, R j- ) is the number of mismtching nucleotides, seprting mtching regions R j nd R j- Alignments re formed ccording to the following strtegy fst locl lignment generting methodology, generting s n output score mximized un-gpped locl lignment A, regrding the longest mtching region within overlpping sections, where χ mtching regions hve been found: Find the longest mtching region: Rς = mx len R, R2,, Rχ, Rχ, ς χ Tke initilly region R ς s extending nd locl lignment: A e Rς, A Rς If ς =, extend A e, ppending regions R ξ, Ae Ae >< Rξ, consecutively for ξ = 2,3,, χ, χ If f(a e )>f(a), then: f ( A) f ( Ae ), A Ae If ς = χ, extend A e, ppending regions R ξ, A e Rξ >< Ae, consecutively for ξ = χ, χ 2,,2, If f(a e )>f(a), then: f ( A) f ( Ae ), A Ae k j 7882 Romnin Biotechnologicl Letters, Vol 8, No, 23

FLAG: Fst Locl Alignment Generting Methodology If < ς < χ, extend A e, ppending left positioned regions R ξ, A e Rξ >< Ae, consecutively for ξ = ς, ς 2,,2, If f(a e )>f(a), then: f ( A) f ( Ae ), A Ae Tke the locl lignment found t this stge s extending lignment, A e A, now being subject of right positioned extension, ppending regions R ξ, Ae Ae >< Rξ, consecutively for ξ = ς +, ς + 2,, χ, χ If f(a e )>f(a), then: f ( A) f ( Ae ), A Ae If f mx is score of the highest scoring un-gpped locl lignment, found within m nucleotides long overlpping sections, lso hs to be checked whether exists n lignment with higher score thn f mx, within overlpping sections with less thn m nucleotides, formed by left one plce sequence b shifts, out of the length of the sequence Proposition : An lignment with higher score thn f mx, could be found within sequence b left shifted overlpping sections, with lengths rnging between: m- nd [ f mx / μ] +, including those vlues Proof: Within overlpping sections of length l, the mximum possible score of n un-gpped locl lignment is l μ Accordingly, higher score lignment thn f mx could be found if l μ > f mx, where from we get tht l > fmx / μ [ fmx / μ] According Proposition, there is no need for serch of n lignment with higher score thn f mx, within sequence b left shifted overlpping sections, with lengths less thn [ f mx / μ] +, once n lignment with highest score f mx, hs been found within m nucleotides long overlpping sections Proposition 2: An lignment with higher score thn f mx, could be found within sequence b right shifted overlpping sections, with lengths rnging between: m- nd [ f mx / μ] +, including those vlues, if f mx is score of the optiml(highest scoring) lignment, found fter exmining lignments within m nucleotides long overlpping sections nd sequence b left shifted overlpping sections, ccording to the fst locl lignment generting methodology The proof of Proposition 2 is nlogous to the proof of Proposition While serching for the optiml un-gpped locl lignment, within m nucleotides long overlpping sections, left nd right shifted overlpping sections, dt vector identifying current un-gpped locl lignment with highest score, is kept in the memory Vector s content dynmiclly updtes if new, higher scoring un-gpped locl lignment thn the current highest one is found The lst updte of this vector identifies the optiml un-gpped locl lignment An exmple Fst locl lignment generting methodology will be demonstrted on concrete exmple, tking nucleotide sequences: : TGCTAACTTTGATTGCCTA nd b: TGAATCCCTTGAATGAAC s smples Since the length of the sequence is 9, while the length of the sequence b is 8, sequence b overlps n-m+=9-8+=2 different sequence sections with length 8 Alignments within overlpping sections re generted ccording to the fst locl lignment generting methodology Tble 2, wrding +2 for ech nucleotide mtch, while penlizing ech nucleotide mismtch with - Tble 2 Exmining lignments within 8 nucleotides long overlpping sections 8 nucleotides long overlpping sections TGCTAACTTTGATTGCCTA TGAATCCCTTGAATGAAC mtching region(s), found / region s score R : (,,2) / f ( R ) = 4 R : (6,6,) / f ( R 2 ) = 2 2 locl lignment found ccording the fst locl lignment generting methodology/lignment s score CTTTGATTG CCTTGAATG Romnin Biotechnologicl Letters, Vol 8, No, 23 7883

TGCTAACTTTGATTGCCTA TGAATCCCTTGAATGAAC R (8,8,4) / f ( R 3 ) = 8 : 3 R : (3,3,2) / f ( R 4 ) = 4 4 R : (3,4,) / f ( R ) = 2 R 2 : (5,6,) / f ( R ) 2 2 = R (8,9,) / f ( R 3 ) = 2 : 3 f ( A : R2 R3R4 ) = AAC ATC f A: R R ) ( 2 = 3 2 When compring prllel nucleotides within the first overlpping sections, four mtching regions re found There re two mtching nucleotides within the first region, one mtching nucleotide within the second region, four mtching nucleotides within the third region, while the number of mtching nucleotides within the fourth region is two The third mtching region is the longest one, initilly tken s extending nd locl lignment: A e R3, A R3 A e is left extended, Ae R2 >< Ae : R2R3, resulting with n lignment with score 9 Since extended lignment s score is higher thn the score of the locl lignment, locl lignment A is updted with A e, A Ae : R2R3 Further left positioned extension of A e results with n lignment: Ae R >< Ae : RR 2R3, with score 9 Currently extended lignment s score equls locl lignment s score, cusing no chnge of the locl lignment nd its score Optiml left extended lignment, regrding the longest mtching region, is A:R 2 R 3 Now this lignment is tken s extending lignment, being subject of right positioned extension, A e A : R2R3 After ppending R 4, n lignment: Ae Ae >< R4 : R2R3R4, with score 2, is obtined Extended lignment s score is higher thn the score of the locl lignment A, cusing locl lignment s updte with A e, A A : R2R3R4 Within the second overlpping sections, three mtching regions, with one mtching nucleotide, re found The locl lignment ccording to the fst locl lignment generting methodology is A:R R 2, with score 3 Locl lignment within the second overlpping sections is not higher scoring thn the locl lignment found within the first overlpping sections, whereby we cn conclude tht the optiml un-gpped locl lignment, found within 8 nucleotides long overlpping sections is A:R 2 R 3 R 4 =(6, 6,, 8, 8, 4, 3, 3, 2), with score 2 According Proposition, n lignment with higher score thn 2, might exist within left shifted overlpping sections, with lengths between nd 7 Exmining lignments within left shifted overlpping sections, ccording to the fst locl lignment generting methodology, no lignment with higher score is found Finlly ccording Proposition 2, lignments within right shifted overlpping sections, with lengths rnging between nd 7, re exmined Since lso within those overlpping sections, no lignment with higher score is found, un-gpped locl lignment found within the first 8 nucleotides long overlpping sections: A:R 2 R 3 R 4 =(6, 6,, 8, 8, 4, 3, 3, 2), is the optiml one being found Tble 3 Sequence b left nd right shifted overlpping sections Sequence b right shifted overlpping Sequence b left shifted overlpping sections sections TGCTAACTTTGATTGCCTA TGCTAACTTTGATTGCCTA TGAATCCCTTGAATGAAC TGAATCCCTTGAATGAAC TGCTAACTTTGATTGCCTA TGCTAACTTTGATTGCCTA TGAATCCCTTGAATGAAC TGAATCCCTTGAATGAAC e 7884 Romnin Biotechnologicl Letters, Vol 8, No, 23

Results nd Discussion FLAG: Fst Locl Alignment Generting Methodology An implementtion Fst locl lignment generting methodology hs been implemented in C++ While serching for the optiml solution, during the execution, memory keeps two dt vectors, whose content is dynmiclly chnged Dt vector - set of triples, identifying mtching regions found within current overlpping sections nd dt vector - set of triples, identifying n un-gpped locl lignment with highest score, found until then As hs been previously explined, ech triple is unique identifier of mtching region, holding region s length nd region s strting positions t the sequences For ech set of mtching regions, found within current overlpping sections, fst locl lignment generting function FLAG is clled, generting s n output n optiml un-gpped locl lignment, regrding the longest locl mtching region Afterwrds, lignment s score is compred with the score of the optiml lignment found until then If higher score lignment is found, the optiml lignment nd its score re updted function FLAG(input: R, R2,, R χ, Rχ, output: A) if( χ!= ) if( χ == ) score μ length( R ) A R else Find Rς = mx len R, R2,, Rχ, Rχ A R ς A e R ς score μ length( A) if( ς == ) for( ξ = 2 ; ξ <= χ; ξ + + ) Ae Ae >< R ξ if(f(a e ) >score) score f ( A e ) A A e else if( ς == χ ) for( ξ = χ ; ξ >= ; ξ ) A e Rξ >< A e if(f(a e ) >score) score f ( A e ) A A e Romnin Biotechnologicl Letters, Vol 8, No, 23 7885

else for( ξ = ς ; ξ >= ; ξ ) A e Rξ >< A e if(f(a e ) >score) score f ( A e ) A A e A e A for( ξ = ς + ; ξ <= χ; ξ + + ) Ae Ae >< R ξ if(f(a e ) >score) score f ( A e ) A A e Test results Implementtion s running time hs been mesured, ligning ten pirs different length nucleotide sequences, on Fujitsu computer with Core(TM) 2 Duo CPU t 267GHz nd 2 GB RAM Score metrics, wrding +2 for ech nucleotide mtch, while penlizing ech mismtch with -, hs been used Approximtely sme size similr sequences hve been ligned According results presented in Tble 4, implementtion s liner time complexity O(m) is more thn evident, when ligning pproximtely sme size similr sequences Following ours previous spce efficient implementtion [6], two dt vectors set of triples, identifying mtching regions found within overlpping sections nd current optiml lignment re kept in the memory during the execution, resulting with liner spce complexity O(m) Tble 4 Implementtion s running time sequence sequence s length - l sequence b Columne ltent 374 Columne ltent viroid clone -6 viroid RNA Cherry chlorotic 69 Cherry chlorotic rusty spot ssocited rusty spot ssocited smll stellite-like smll stellite-like dsrna B dsrna C Agertum lef curl Cmeroon betstellite, isolte 38 Agertum lef curl Cmeroon betstellite, isolte sequence s running time b length l b t (sec) 37 47 66 25 379 343 7886 Romnin Biotechnologicl Letters, Vol 8, No, 23

FLAG: Fst Locl Alignment Generting Methodology StB6 StB4 Cyclovirus Chimp 75 Cyclovirus Chimp2 747 28 Stchytrphet lef curl virus - [Hn6] Adeno-ssocited virus 3 Mouse prvovirus 4 Bnn strek IM virus Gremmeniell bietin type B RNA virus XL O'nyong-nyong virus strin SG65 275 Stchytrphet lef curl virus - [Hn54] 4726 Adeno-ssocited virus 3B 48 Mouse prvovirus 4b 7769 Bnn strek Imove virus strin IRFA9 375 Gremmeniell bietin type B RNA virus XL2 822 Igbo Or virus strin IBH964 2748 437 4722 29 4794 72 7768 685 374 35 82 2729 Figure Columne ltent viroid RNA nd Columne ltent viroid clone -6 lignment Given dt set S=<l i, lb i, t i >, i= obtined during the experimentl evlution, Principl Components Anlysis (PCA) ws used to fit liner regression tht minimizes the perpendiculr distnces from the dt to the fitted line This problem is equivlent to serch for the liner sub-spce which mximizes the vrince of projected points, the ltter being obtined by eigen decomposition of the covrince mtrix Eigen vectors corresponding to lrge eigen vlues re the directions in which the dt hs strong component, or equivlently lrge vrince PCA finds n orthogonl bsis tht best represents given dt set In our cse the fitted line cn be described with the following eqution: r = n+ t * p Where n = [7738;7775;852933e-4] is the point on the fitted line nd p = [46356e+3; 46329e+3; 42] is the line direction vector, nd t R The fitted line together with the orthogonl distnces from ech point to the line is shown on Fig 2 It shows liner dependency of ligning time for similr nucleotide sequences Romnin Biotechnologicl Letters, Vol 8, No, 23 7887

t l b l Figure 2 Fitted line with the orthogonl distnces from ech point to the line Conclusions Liner time nd spce lignment technique hs been presented, pplicble on similr nucleotide sequences The time complexity improvement is due to the methodology, ccording which un-gpped locl lignments re generted within overlpping sections nd the reduced lignments serch spce Also, the spce complexity remins liner, bsed on region s spce efficient representtion References T SMITH, M WATERMAN, Identifiction of common moleculr subsequences Journl of Moleculr Biology, 47(), 95, 97 (98) 2 M WATERMAN, M EGGERT, A new lgorithm for best subsequence lignments with ppliction to trna-rrna comprisons Journl of Moleculr Biology, 97,723, 728 (987) 3 X HUANG, W MILLER, A time-efficient, liner-spce locl similrity lgorithm Advnces in Applied Mthemtics, 2,337, 357 (99) 4 D LIPMAN, W PEARSON, Rpid nd sensitive protein similrity serches Science, 227(4693),435, 44 (985) 5 S ALTSCHUL, W GISH, W MILLER, E MYERS, D LIPMAN, Bsic locl lignment serch tool Journl of Moleculr Biology, 25(3),43, 4 (99) 6 D STOJANOV, A MILEVA, S KOCESKI, A new, spce-efficient locl pirwise lignment methodology Advnced Studies in Biology, 4(2),85, 93 (22) 7888 Romnin Biotechnologicl Letters, Vol 8, No, 23