Address for Correspondence

Similar documents
Lecture 12 Simplification of Context-Free Grammars and Normal Forms

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

Decision problem of substrings in Context Free Languages.

Properties of Context-Free Languages

Simplification of CFG and Normal Forms. Wen-Guey Tzeng Computer Science Department National Chiao Tung University

Simplification of CFG and Normal Forms. Wen-Guey Tzeng Computer Science Department National Chiao Tung University

Properties of context-free Languages

TAFL 1 (ECS-403) Unit- III. 3.1 Definition of CFG (Context Free Grammar) and problems. 3.2 Derivation. 3.3 Ambiguity in Grammar

Non-context-Free Languages. CS215, Lecture 5 c

NPDA, CFG equivalence

CSCI Compiler Construction

A parsing technique for TRG languages

Foundations of Informatics: a Bridging Course

straight segment and the symbol b representing a corner, the strings ababaab, babaaba and abaabab represent the same shape. In order to learn a model,

Context Free Grammars

Formal Languages and Automata

Ogden s Lemma for CFLs

Properties of Context-free Languages. Reading: Chapter 7

Even More on Dynamic Programming

Chap. 7 Properties of Context-free Languages

Accepting H-Array Splicing Systems and Their Properties

Parsing. Context-Free Grammars (CFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 26

Improved TBL algorithm for learning context-free grammar

Introduction to Formal Languages, Automata and Computability p.1/42

Einführung in die Computerlinguistik

CS5371 Theory of Computation. Lecture 7: Automata Theory V (CFG, CFL, CNF)

CSCI 1010 Models of Computa3on. Lecture 17 Parsing Context-Free Languages

MTH401A Theory of Computation. Lecture 17

CFG Simplification. (simplify) 1. Eliminate useless symbols 2. Eliminate -productions 3. Eliminate unit productions

Computability Theory

Theory of Computation - Module 3

FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY

Formal Languages, Grammars and Automata Lecture 5

This lecture covers Chapter 7 of HMU: Properties of CFLs

Finite Automata Theory and Formal Languages TMV026/TMV027/DIT321 Responsible: Ana Bove

MA/CSSE 474 Theory of Computation

Recitation 4: Converting Grammars to Chomsky Normal Form, Simulation of Context Free Languages with Push-Down Automata, Semirings

Properties of Context-Free Languages. Closure Properties Decision Properties

CPS 220 Theory of Computation

RNA Secondary Structure Prediction

Chomsky Normal Form and TURING MACHINES. TUESDAY Feb 4

Grammars and Context Free Languages

CS 373: Theory of Computation. Fall 2010

AC68 FINITE AUTOMATA & FORMULA LANGUAGES JUNE 2014

Finite Automata and Formal Languages TMV026/DIT321 LP Useful, Useless, Generating and Reachable Symbols

CS311 Computational Structures. NP-completeness. Lecture 18. Andrew P. Black Andrew Tolmach. Thursday, 2 December 2010

Note: In any grammar here, the meaning and usage of P (productions) is equivalent to R (rules).

Weak vs. Strong Finite Context and Kernel Properties

CS20a: summary (Oct 24, 2002)

Chomsky and Greibach Normal Forms

Homework 4 Solutions. 2. Find context-free grammars for the language L = {a n b m c k : k n + m}. (with n 0,

Grammars and Context Free Languages

Testing Emptiness of a CFL. Testing Finiteness of a CFL. Testing Membership in a CFL. CYK Algorithm

Grammar formalisms Tree Adjoining Grammar: Formal Properties, Parsing. Part I. Formal Properties of TAG. Outline: Formal Properties of TAG

Plan for 2 nd half. Just when you thought it was safe. Just when you thought it was safe. Theory Hall of Fame. Chomsky Normal Form

Computational Models - Lecture 5 1

Theory of Computation 8 Deterministic Membership Testing

AC68 FINITE AUTOMATA & FORMULA LANGUAGES DEC 2013

CS375: Logic and Theory of Computing

Before We Start. The Pumping Lemma. Languages. Context Free Languages. Plan for today. Now our picture looks like. Any questions?

Notes for Comp 497 (454) Week 10

Remembering subresults (Part I): Well-formed substring tables

The Pumping Lemma for Context Free Grammars

Harvard CS 121 and CSCI E-207 Lecture 12: General Context-Free Recognition

Automata Theory CS F-08 Context-Free Grammars

FORMAL LANGUAGES, AUTOMATA AND COMPUTATION

Part 4 out of 5 DFA NFA REX. Automata & languages. A primer on the Theory of Computation. Last week, we showed the equivalence of DFA, NFA and REX

Chapter 6. Properties of Regular Languages

CFLs and Regular Languages. CFLs and Regular Languages. CFLs and Regular Languages. Will show that all Regular Languages are CFLs. Union.

(b) If G=({S}, {a}, {S SS}, S) find the language generated by G. [8+8] 2. Convert the following grammar to Greibach Normal Form G = ({A1, A2, A3},

60-354, Theory of Computation Fall Asish Mukhopadhyay School of Computer Science University of Windsor

Notes for Comp 497 (Comp 454) Week 10 4/5/05

CPSC 313 Introduction to Computability

Context-Free Grammars: Normal Forms

Context-Free Grammar

CS481F01 Prelim 2 Solutions

Section 1 (closed-book) Total points 30

CYK Algorithm for Parsing General Context-Free Grammars

Einführung in die Computerlinguistik Kontextfreie Grammatiken - Formale Eigenschaften

ON MINIMAL CONTEXT-FREE INSERTION-DELETION SYSTEMS

Theory Of Computation UNIT-II

Computational Models - Lecture 4

Context-Free Languages (Pre Lecture)

On the Sizes of Decision Diagrams Representing the Set of All Parse Trees of a Context-free Grammar

Theory of Computation Turing Machine and Pushdown Automata

Grammars (part II) Prof. Dan A. Simovici UMB

Computational Models - Lecture 4 1

FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY

Tree Adjoining Grammars

CDM Parsing and Decidability

CMPT-825 Natural Language Processing. Why are parsing algorithms important?

Context-free Grammars and Languages

Chapter 3. Regular grammars

Discrete Mathematics. CS204: Spring, Jong C. Park Computer Science Department KAIST

CSE 105 THEORY OF COMPUTATION

Introduction to Formal Languages, Automata and Computability p.1/51

SYLLABUS. Introduction to Finite Automata, Central Concepts of Automata Theory. CHAPTER - 3 : REGULAR EXPRESSIONS AND LANGUAGES

6.1 The Pumping Lemma for CFLs 6.2 Intersections and Complements of CFLs

Definition: A grammar G = (V, T, P,S) is a context free grammar (cfg) if all productions in P have the form A x where

UNIT II REGULAR LANGUAGES

Transcription:

Proceedings of BITCON-2015 Innovations For National Development National Conference on : Research and Development in Computer Science and Applications Research Paper SUBSTRING MATCHING IN CONTEXT FREE GRAMMAR 1 Pawan Kumar Patnaik, 1 M.V.Padmavati, 2 Jyoti Singh Address for Correspondence 1 Deptt. of Computer Science & Engg., BIT, Durg 2 Directorate of Technical Education,Raipur (C.G.), India ABSTRACT The purpose of this paper is to propose the complexity of the membership or substring testing problem, not necessarily contiguous for Context Free Grammar in the form of Chomsky Normal Form (CNF). We describe a new algorithm, which exts CYK algorithm for string languages and preserves the polynomial time complexity. 1. INTRODUCTION Pattern matching has a wide range of applications in the fields of pattern recognition, image processing, computer vision etc. In one dimension, this problem is referred to as string matching. String matching has got its applications in the fields of text editing, text searching, data base search, artificial intelligence, information retrieval etc. There are many instances in which one needs to find the occurrences of more than one user-defined pattern in the given text. This problem is known as multiple pattern matching. Library bibliographic search program is one such application. In two dimensions this problem is referred to as pattern matching. In many applications like computational biology, it is desirable to find the approximate matches of the pattern in the given text rather than the exact match. During recent years, many efficient algorithms to locate all occurrences of any of a finite number of keywords and phrases in an arbitrary text string have been developed. Recently, on the other hand, several authors [1-5] have investigated the problem to exactmatch Substring identification in Context Free Languages. Based on Decision problem of substrings in CFL [1], this paper presents a method for design of Substring Matching in Context Free Grammar whose running time is exactly O (n 3 ). First we consider the problem of substring testing. Let = be a string. A substring.,.. not necessarily contiguous is a string of if 1. The substring problem is: Given a CFG G and a substring, does there exist a ( ) such that is a substring of. We give algorithms for solving these problems by modifying CYK algorithm. Section II of this paper describes the background of the work presented. Section III deals CYK algorithm with complexity analysis and example. In Section IV, a procedure for substring matching problems in CFG has been discussed. Finally, the conclusions have been drawn in section V. 2. Preliminaries For simplicity, we will assume that the grammar is given in the Chomsky Normal Form (CNF). Let G = (V,T,P,S) be that grammar, where V denotes the set of Non Terminal symbols T denotes the set of Terminals P denotes the set of Production rules of the form A α S Є V is the start symbol The language generated by the grammar is defined as L(G) = { w w Є T *, S => * w} Note: 1. Two strings α, β are said to be related by => denoted as α => β, when the second string is obtained from the first by one application of some production rule P. 2. Suppose α 1, α 2,, α m are strings from (V U T) *, m 1 and α 1 => * α 2, α 2 => * α 3, α m-1 => * α m then we say α 1 => * α m. => * is the reflexive and transitive closure of =>. Definition 2.1 (Chomsky Normal Form or CNF) Any context free language without is generated by a grammar in which all production rules are of the form A BC or A b. Here A, B and C are nonterminals and b is a terminal. Membership testing of CFG Given a CFG G = < V, T, P, S> in CNF and a string s in T *, to test whether s Є L(G) or not? Solution: We shall present a simple cubic time algorithm known as the Cocke-Younger-Kasami or CYK algorithm. It is based on the dynamic programming technique. Given a string s of length n 1 and a grammar G, which we may assume is in CNF, determine for each i and j and for each non-terminal A, whether A=> * x ij, where x ij is the substring of s from i to j. We proceed by induction on the length. For length = 1 or i = j, A => * x ii If and only if A x ii is a production, since x ii is the i th symbol of string s. Proceeding to higher values of length, if length > 1 then A => * x ij if and only if there is some production A BC and some k, i k j-1, B => * x ik and C=> * x k+1,j. Hence by induction A => * x ij. Finally, when we reach i = 1 and j = n, we may determine whether S => * x 1n. As x 1n = s,s is in L(G) if and only if S => * x 1n 3. Algorithm CYK Algorithm 1. For i = 1 to n do V i,i { A A a i Є P } 2. For len = 2 to n do For i = 1 to n len + 1 do j i + len 1 V i,j = ɸ For k = i to j -1 do V i,j V i,j U { A A BC Є P, B Є V i,k & C Є V K+1,j } 3. If S Є V 1,n then the output is Yes else the output is No. Analysis: The time complexity of the algorithm is order O(n 3 ). Precisely the algorithm takes O(n 3 P ) time. Example: Consider the CFG S AB BC

A BA a B CC b C AB a And the input string is x = baaba. The V ij s are shown in table 3.1. S Є V 1,5 implies that the string x = baaba belongs to the language generated by the CFG. Table 3.1: CYK Algorithm j b a a b a i 1 2 3 4 5 1 {B} {A,S} ɸ ɸ {S,A,C} 2 - {A,C} {B} {B} {S,C,A} 3 - - {A,C} {S,C} {B} 4 - - - {B} {S,A} 5 - - - - {A,C} 4. Design of Substring Matching in Context Free Grammar In this section, we present algorithms for substring testing problems in CFG. Let w = a 1 a 2..a n be a string. A string a i..a j is a substring of w if 1 i j n. The substring problem can be stated as follows: Given a CFG G and a string w, does there exist a w Є L(G) such that w substring of w. we give algorithms for solving these problems by modifying CYK algorithm. Problem: Given a CFG G =< V,T,P,S> in CNF and a string w=a 1 a 2 a n, to find whether there exists a string w Є L(G) such that w is substring of w? Solution: Let G = < V,T,P,S> be a CFG in CNF. Without loss of generality, we assume L (G). The algorithm is based on the CYK algorithm. We make use of the notion of Left Closure and Right Closure of a set of non-terminals. The algorithms for these are given below: LeftClosure (V N ) lc V (1.1) Add A to lc if A BC Є P for some C Є lc if no new non-terminal got added to lc in this iteration return (lc) RightClosure (V N ) rc V Add A to rc if A CD Є P for some C Є rc if no new non-terminal got added to rc in this iteration return (rc) Analysis: The complexity of the LeftClosure algorithm given above is O( N P ). Note that LeftClosure (V N ) = U A Є V LeftClosure({A}). The LeftClosure for the non-terminals can be precomputed. In that case using the above relation, the required LeftClosure algorithm can be implemented in O( N 2 ). The analysis of RightClosure algorithm is exactly similar to that LeftClosure algorithm and overall complexity of RightClosure algorithm is same as that of LeftClosure algorithm. 4.1 Algorithm Substring Input: A a CFG G = < V,T,P,S> in CNF and a string w Є T + where w = a 1 a 2.a n. Assumption: G does not contain any useless productions or useless symbols. Output: If w is a substring of w Є L(G) then output is Yes else the output is No. DataStructure: V[0:n+1,0:n+1] each entry is a set of non-terminals. Algorithm: Step 1: CYK Algorithm For i = 1 to n do V i,i { A A a i Є P } For len = 2 to n do For i = 1 to n len + 1 do j i + len 1 V i,j = ɸ For k = i to j -1 do V i,j V i,j U { A A BC Є P, B Є V i,k & C Є V K+1,j } Step 2: For j = 1 to n do For k = 0 to j-1 do If (k=0) V 0,j LeftClosure(V 1,j ) (2.1) Else V 0,j V 0,j U { A A BC Є P, B Є V 0,k, C Є V k+1,j } (2.2) V 0,j LeftClosure(V 0,j ) (2.3)

Step 3: Step 4: For i = n downto 1 do For k = n+1 downto i+1 do If (k=n+1) V i,n+1 RightClosure(V i,n ) Else V i,n+1 V i,n+1 U { A A BC Є P, B Є V i,k-1, C Є V k,n+1 } V i,n+1 RightClosure(V i,n+1 ) 4.1 if V 1,n ɸ Output Yes 4.2 if V 0,n ɸ Output Yes 4.3 if V 1,n+1 ɸ Output Yes 4.4 V 0,n+1 ɸ Output Yes For k = 1 to n-1 do V 0,n+1 V 0,n+1 U { A A BC, B Є V 0,k, C Є V k+1,n+1 } If (V 0,n+1 ɸ) Output Yes Else Output No Example: Grammar G for a + b + c + d + e + f + S AX X YF Y BZ Z WE W CD A AA B BB C CC D DD E EE F FF A a B b C c D d E e F f LeftClosure({S}) = ɸ LeftClosure({A}) = {A} LeftClosure({B}) = {B} LeftClosure({C}) = {C} LeftClosure({D}) = {D,W} LeftClosure({E}) = {E,Z,Y} LeftClosure({F}) = {F,X,S} LeftClosure({X}) = {S} LeftClosure({Y}) = ɸ LeftClosure({Z}) = {Y} LeftClosure({W}) = ɸ RightClosure({S}) = ɸ RightClosure({A}) = {A,S} RightClosure({B}) = {B,Y,X} RightClosure({C}) = {C,W,Z} RightClosure({D}) = {D} RightClosure({E}) = {E} RightClosure({F}) = {F} RightClosure({X}) = ɸ RightClosure({Y}) = {X} RightClosure({Z}) = ɸ RightClosure({W}) = {Z} Table 4.1: CYK Algorithm j 1 2 3 4 5 6 i a b C d e f 1 {A} ɸ ɸ ɸ ɸ {S} 2 - {B} ɸ ɸ {Y} {X} 3 - - {C} {W} {Z} ɸ 4 - - - {D} ɸ ɸ 5 - - - - {E} ɸ 6 - - - - - {F} Input String: cd Output is Yes and V i,j s are shown in Table 4.2 Analysis: Let w = n Step 1 is CYK Algorithm. Hence it takes O( P n 3 ) time. Step 2.1 takes O( N 2 ) time and Step 2.2 takes O( P ) time. Hence, Step 2 and Step 3 take O(n N 2 +n 2 P ) time. Step 4 takes O(n P ). Thus the algorithm substring is of O( P n 3 ) time. In general for any grammar G, N and P are assumed to be constants, hence the overall complexity of the algorithm is O(n 3 ).

Table 4.2: Output of the Substring Algorithm j 0 1 2 3 i C d 0 - {C} {W} {W} 1 - {C} {W} {Z,W} 2 - - {D} {D} 3 - - - - 4.2 Algorithm Substring Problem: Given a CFG G = < V,T,P,S> in CNF and a string w = a 1 a 2.a n, to find whether there exists a string w Є L(G) such that w is substring of w? Solution: we make use of the notion of closure of a set of non-terminals. The algorithm is given below : Closure(V N) cc V Add A to cc if A CD Є P for some C Є cc or D Є cc If no new non-terminal got added to cc in this iteration return(cc) 4.2.1 Algorithm: Input: A a CFG G = < V,T,P,S> in CNF and a string w Є T + where w = a 1 a 2.a n. Assumption: G does not contain any useless productions or useless symbols. Output: If w is a substring of w Є L(G) then output is Yes else the output is No. DataStructure : V[1:n,1:n] each entry is a set of non-terminals. Algorithm: 1. For i = 1 to n do V i,i {A A a i Є P} V i,i Closure(V i,i ) 2. For len = 2 to n do For i = 1 to n-len+1 do J i + len -1 V i,i ɸ For k = i to j-1 do V i,j V i,j U { A A BC Є P, B Є V i,k, C Є V k+1,j } V i,j Closure(V i,j ) 4. If S Є V 1,n then the output is Yes else the output is No. Example 4.2.1: Grammar G for a + b + c + d + e + f + S AX X YF Y BZ Z WE W CD A AA B BB C CC D DD E EE F FF A a B b C c D d E e F f Input String: bdf Output: Yes and V i,j s are shown in Table 4.2.1 Analysis: By precomputing Closure({A}), A Є V, Closure algorithm can be implemented in O( N 2 ). Hence the substring algorithm is O( P n 3 +n 2 N 2 ). In general for any grammar G, N and P are considered as constants, the algorithm runs in O(n 3 ) time. Table 4.2.1 : Output of substring algorithm j 1 2 3 i B d f 1 {B,Y,X,S} {Y,X,S} {X,S} 2 - {D,W,Z,Y,X,S} {X,S} 3 - {F,X,S}

5. CONCLUSION: It has been concluded that the Substring matching problem can be efficiently solved by using CYK and Left Closure & Right Closure algorithm. The procedure described here and the behaviors of CYK imply that the Substring matching problem in Context Free Grammar can be solved in exactly O (n 3 ) time. This problem can also be solved by modifying the Grammar G, but we are doing without modifying the grammar. REFERENCES 1. Mauricio Osorio and Juan Antonio Navarro Perez. Decision problem of substrings in context freel anguages. In Juan Humberto Sossa Azuela, Herbert Freeman, and C. Vizcaino, editors, CIC-X: Memorias del X Congreso Interna- cional de Computacion, pages 239-249. CIC-IPN, 2004 2. Heron Molina-Lozano A new fast fuzzy Cocke Younger Kasami algorithm for DNA strings analysis Int. J. Mach. Learn. & Cyber, 2:209 218,2011 3. R. Axelsson, K. Heljanko, and M. Lange. Analyzing context-free grammars using an incremental SAT Solver. In Proc. 35th Int. Coll. on Automata Languages and Programming, ICALP 08, Part II, volume 5126 of LNCS, 410 422, 2007. 4. Stefano Crespi Reghizzi, Matteo Pradella. A CKY parser for picture grammars, information processing Letters 105:213-217,2008 5. D.C.Kozen. Automata and Computability. Springer 1997. 6. Kamala Krithivasan and Rama R., Introduction to Formal languages, Automata Theory and Computation, Pearson. 2009. Note: This Paper/Article is scrutinised and reviewed by Scientific Committee, BITCON-2015, BIT, Durg, CG, India