CFG PSA Algorithm. Sequence Alignment Guided By Common Motifs Described By Context Free Grammars

Similar documents
Properties of Context-Free Languages. Closure Properties Decision Properties

Pattern Matching (Exact Matching) Overview

Computability Theory

Context-Free Grammars (and Languages) Lecture 7

Harvard CS 121 and CSCI E-207 Lecture 12: General Context-Free Recognition

Chapter 6. Properties of Regular Languages

Problem Session 5 (CFGs) Talk about the building blocks of CFGs: S 0S 1S ε - everything. S 0S0 1S1 A - waw R. S 0S0 0S1 1S0 1S1 A - xay, where x = y.

Efficient Divide-and-Conquer Parsing of Practical Context-Free Languages

CSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182

V Honors Theory of Computation

Solutions to Problem Set 3

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Homework 4. Chapter 7. CS A Term 2009: Foundations of Computer Science. By Li Feng, Shweta Srivastava, and Carolina Ruiz

Multiple Alignment. Slides revised and adapted to Bioinformática IST Ana Teresa Freitas

Multiple Sequence Alignment (MAS)

Dynamic Programming. Weighted Interval Scheduling. Algorithmic Paradigms. Dynamic Programming

Sequence Alignment (chapter 6)

The Pumping Lemma for Context Free Grammars

L bor y nnd Union One nnd Inseparable. LOW I'LL, MICHIGAN. WLDNHSDA Y. JULY ), I8T. liuwkll NATIdiNAI, liank

Top-Down Parsing and Intro to Bottom-Up Parsing

Harvard CS 121 and CSCI E-207 Lecture 10: CFLs: PDAs, Closure Properties, and Non-CFLs

Problem 2.6(d) [4 pts] Problem 2.12 [3pts] Original CFG:

Flexible RNA design under structure and sequence constraints using a language theoretic approach

CS 301. Lecture 18 Decidable languages. Stephen Checkoway. April 2, 2018

1. Induction on Strings

HKN CS/ECE 374 Midterm 1 Review. Nathan Bleier and Mahir Morshed

Theory of Computation (IX) Yijia Chen Fudan University

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Theory of Computation (IV) Yijia Chen Fudan University

Even More on Dynamic Programming

A graph kernel approach to the identification and characterisation of structured non-coding RNAs using multiple sequence alignment information

Part 4 out of 5 DFA NFA REX. Automata & languages. A primer on the Theory of Computation. Last week, we showed the equivalence of DFA, NFA and REX

3130CIT Theory of Computation

Computational Models - Lecture 4

6.1 The Pumping Lemma for CFLs 6.2 Intersections and Complements of CFLs

Combinatorial approaches to RNA folding Part II: Energy minimization via dynamic programming

Introduction to Bottom-Up Parsing

Fooling Sets and. Lecture 5

Introduction to Bottom-Up Parsing

Searching Sear ( Sub- (Sub )Strings Ulf Leser

Conflict Removal. Less Than, Equals ( <= ) Conflict

The Real Story on Synchronous Rewriting Systems. Daniel Gildea Computer Science Department University of Rochester

Intro to Theory of Computation

98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006

Introduction to Bottom-Up Parsing

CISC4090: Theory of Computation

Note: In any grammar here, the meaning and usage of P (productions) is equivalent to R (rules).

Copyright 2000, Kevin Wayne 1

Introduction to Bottom-Up Parsing

How to write proofs - II

October 6, Equivalence of Pushdown Automata with Context-Free Gramm

Chapter 0 Introduction. Fourth Academic Year/ Elective Course Electrical Engineering Department College of Engineering University of Salahaddin

Fall 1999 Formal Language Theory Dr. R. Boyer. 1. There are other methods of nding a regular expression equivalent to a nite automaton in

AC68 FINITE AUTOMATA & FORMULA LANGUAGES DEC 2013

RNA Basics. RNA bases A,C,G,U Canonical Base Pairs A-U G-C G-U. Bases can only pair with one other base. wobble pairing. 23 Hydrogen Bonds more stable

Context-free grammars and languages

Context Free Languages. Automata Theory and Formal Grammars: Lecture 6. Languages That Are Not Regular. Non-Regular Languages

Areas. ! Bioinformatics. ! Control theory. ! Information theory. ! Operations research. ! Computer science: theory, graphics, AI, systems,.

Turing Machines (TM) Deterministic Turing Machine (DTM) Nondeterministic Turing Machine (NDTM)

What we have done so far

Lecture 4: September 19

RNA Abstract Shape Analysis

Multiple Sequence Alignment

Algorithms in Bioinformatics

Crew of25 Men Start Monday On Showboat. Many Permanent Improvements To Be Made;Project Under WPA

CS 275 Automata and Formal Language Theory

PUSHDOWN AUTOMATA (PDA)

Homework 5 - Solution

10. The GNFA method is used to show that

Local Alignment of RNA Sequences with Arbitrary Scoring Schemes

ECS 120 Lesson 20 The Post Correspondence Problem

X-MA2C01-1: Partial Worked Solutions

Automata & languages. A primer on the Theory of Computation. Laurent Vanbever. ETH Zürich (D-ITET) October,

Part 3 out of 5. Automata & languages. A primer on the Theory of Computation. Last week, we learned about closure and equivalence of regular languages

Context Free Grammars: Introduction. Context Free Grammars: Simplifying CFGs

CSE 105 THEORY OF COMPUTATION

Computational Models - Lecture 5 1

Context-Free Languages (Pre Lecture)

Computational Models: Class 3

Ogden s Lemma for CFLs

NODIA AND COMPANY. GATE SOLVED PAPER Computer Science Engineering Theory of Computation. Copyright By NODIA & COMPANY

Chapter 6. Weighted Interval Scheduling. Dynamic Programming. Algorithmic Paradigms. Dynamic Programming Applications

Turing Machines Part III

LOWELL WEEKLY JOURNAL

Dynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction.

Theory of Computation (VI) Yijia Chen Fudan University

ON MINIMAL CONTEXT-FREE INSERTION-DELETION SYSTEMS

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia

AC68 FINITE AUTOMATA & FORMULA LANGUAGES JUNE 2014

Chap. 7 Properties of Context-free Languages

Ogden s Lemma. and Formal Languages. Automata Theory CS 573. The proof is similar but more fussy. than the proof of the PL4CFL.

Some Inherently Ambiguous Context-Free Languages

Solution Scoring: SD Reg exp.: a(a

Finite Automata Theory and Formal Languages TMV026/TMV027/DIT321 Responsible: Ana Bove

CYK Algorithm for Parsing General Context-Free Grammars

Definition: A grammar G = (V, T, P,S) is a context free grammar (cfg) if all productions in P have the form A x where

Computational Models - Lecture 3

CS5371 Theory of Computation. Lecture 14: Computability V (Prove by Reduction)

CSE 355 Test 2, Fall 2016

Context Free Languages: Decidability of a CFL

Transcription:

FG PS lgorithm Sequence lignment Guided By ommon Motifs Described By ontext Free Grammars

motivation Find motifs- conserved regions that indicate a biological function or signature. Other algorithm do not always align motif regions together. Incorporate knowledge about common structures into the alignment process. Forcing the alignments align such common motifs. Internal Loop External Base Hairpin Loop Multi-loop Bulge Hairpin Loop

The goal To align sequences in a way to include all or some of the motif-matches in an order to optimize the resulting score. Example k+1 max z i, wi + max β(g) i=1 y,x G, G (,)+(G,G)+(UU,U)+(G,G)+ β(g 1 )+β(g 2 )+β(g 3 )

Problem definition Let (u, v) denote the maximum ordinary alignment score for strings u and v that can be computed using the lignment algorithm. Given: two sequences S1 and S2 set of context-free grammars = {G 1,..., G f } each of which represents a motif with all its possible variations. weight function β(g) ompute: k+1 max i=1 Z i, wi + max X,Y G, G β(g) over all possible X 1, X 2,...,X k, Y 1, Y 2,..., Y k S1 = z 1 x 1 z 2 x 2... z k x k z k+1 S2 = w 1 y 1 w 2 y 2...w k y k w k+1 Each Z j or w j, 1 <= j <= k + 1, can be an empty string For all I 1<= i<= k, there exists a G such that y i, x i L G, where L(G) is the language generated by FG G. motif-matching region

={G 1, G 2, G 3, G } G 1 =(V,T,P,V 0 ) context free grammar Set of variables V = {V 0, V 1, V 2 } V 0 is a start variable Terminal symbols T = {,U,,G} Set of rules: V0 --> V 1 G GV 1 V1 --> GV 2 V2 --> G G 2 ={ {V 0, V 1, V 2 }, {,U,,G}, p, V 0 } Set of rules: V0 --> V 1 U UV 1 V1 --> V 2 G V2 --> GG G 3 ={ {V 0, V 1, V 2, V 3 }, {,U,,G}, p, V 0 } Set of rules: V0 --> V 1 U GV 1 V 1 G V1 --> V 2 U V2 --> V 3 U UV 3 V3-->

Reminder YK lgorithm YK is an algorithm that receives a NF, G, and a string S, and determines if S can be produced from G, and how. Parsell Parsell is a modification of YK that receives a NF, G, and a string S, and finds all of the substrings of S that can be produced from G. NF homsky normal form NF is a FG in which all of the production rules are of one of the forms: V TS V a S ε Every FG is easily convertible to NF by following a simple algorithm.

Running Example: NF G V0 B you 1 No 2 one B B2 can B2 understand DP it. D Dynamic P Programming String S = Poor fool, do you think you can understand Dynamic Programming? No one can understand it.

you can understand Running Example: T[,]={} T[,]={} T[,]={} T[,]={} D Dynamic T[,]={D} P Programming T[,]={P} 1 No T[,]={1} 2 one T[,]={2} can T[,]={} understand T[,]={} it T[1,1]={} String S = Poor fool, do you think you can understand Dynamic Programming? No one can understand it.

Running Example: 1 D P 1 2 1

Running Example: DP B2 1 D P 1 2 B2 1

Running Example: B2 B B2 1 B2 D P 1 2 B B2 1

Running Example: B B2 1 B B2 D P 1 2 B B2 1

Running Example: V0 B V0 B 1 V0 B B2 D P 1 V0 2 B B2 1

Running Example: S1 = [,] S2=[,1] 1 V0 B B2 D P 1 V0 2 B B2 1

lgorithm Parsell (input string S, length n, FG G) Step 1. Find the substrings derived by the rules of the form X a: for i =1 to n do set T[i, i]= ϕ for each variable X if X S[i] is a rule then add X to T[i, i] Step 2. Find the substrings derived by the rules of the form X YZ: for l =2 to n do for i =1 to n - l +1 do set j = i + l 1 for k = i to j - 1 do for each rule X YZ if Y T[i, k] and Z T[k +1,j] Then add X to T[i, j] Step 3. Return the set of all substrings generated by G: P = ϕ for i =1 to n do for j = i to n do if V0 T[i, j] then add (i, j) to P return P Runtime = O( G N 3 )

-1-1 _ G G G U _ 0 - - - - - - - - - - - - - - - - - - - 1-1 -2-3 - - - - - - - - - - - -1-1 -1 - -1 2 1 0-1 -2-3 - - - - - - - - - - - G - -2 1 0 2 1 0-1 -2-3 - - - - - - - - - G - -3 0-1 1 0-1 -2 0-1 -2-3 - - - - - - - G - - -1-2 0-1 -2-3 -1 1 0-1 -2-3 - - - - - - - -2-3 -1 1 0-1 -2 0-1 1 0-1 -2-3 - - - - - -3 - -2 0 2 1 0-1 -2 0 2 1 0-1 -2-3 - - - - -2-3 -1 1 3 2 1 0-1 1 0-1 -2-3 -1-2 - - - -3 - -2 0 2 G - - - - U - - - -

Preprocessing For all G k in do: X k parsell(s 1,n 1,G k) Y k parsell(s 2,n 2,G k) Move every (i',i) X k to list X as (i,j,k) Move every (i',i) Y k to list Y as (i,j,k) Sort X & Y in ascending order of the end points i'. Y (1,2,) (2,, 1) (3,, ) (,, 1) (3,, 2) (,, ) (,, 1) (,1, 3) X (1,, 2) (2,, ) (,, 3) (3,, 2) (,, 1) (2,, ) (,, 3) i= j=

Time omplexity of Where N=max{N 1,N 2 } Running Parsell for nd creating X, Y: Sorting X, Y: O( N 3 ) O( X log X + Y log Y ) omputing the maxterms: O( X Y + N 2 ) Because of the usual N 2 table checking, and the new maxterm: max{h(i,j ) +β(g k1 )} (i,i,k 1,j,j,k 2 ) X i *Y j,g k1 =G k2 Which passes every NF in X and Y and check for matching NFs. Total time complexity: O( N 3 + X Y )

Possible Modifications ffine gap penalties local alignment computations more advanced algorithms used in the alignment of non-motif-matching regions. Same or better complexity. Solve the general problem of optimally aligning multiple sequences guided by a given set of motifs described by FGs.