Proceedings of BITCON-2015 Innovations For National Development National Conference on : Research and Development in Computer Science and Applications Research Paper SUBSTRING MATCHING IN CONTEXT FREE GRAMMAR 1 Pawan Kumar Patnaik, 1 M.V.Padmavati, 2 Jyoti Singh Address for Correspondence 1 Deptt. of Computer Science & Engg., BIT, Durg 2 Directorate of Technical Education,Raipur (C.G.), India ABSTRACT The purpose of this paper is to propose the complexity of the membership or substring testing problem, not necessarily contiguous for Context Free Grammar in the form of Chomsky Normal Form (CNF). We describe a new algorithm, which exts CYK algorithm for string languages and preserves the polynomial time complexity. 1. INTRODUCTION Pattern matching has a wide range of applications in the fields of pattern recognition, image processing, computer vision etc. In one dimension, this problem is referred to as string matching. String matching has got its applications in the fields of text editing, text searching, data base search, artificial intelligence, information retrieval etc. There are many instances in which one needs to find the occurrences of more than one user-defined pattern in the given text. This problem is known as multiple pattern matching. Library bibliographic search program is one such application. In two dimensions this problem is referred to as pattern matching. In many applications like computational biology, it is desirable to find the approximate matches of the pattern in the given text rather than the exact match. During recent years, many efficient algorithms to locate all occurrences of any of a finite number of keywords and phrases in an arbitrary text string have been developed. Recently, on the other hand, several authors [1-5] have investigated the problem to exactmatch Substring identification in Context Free Languages. Based on Decision problem of substrings in CFL [1], this paper presents a method for design of Substring Matching in Context Free Grammar whose running time is exactly O (n 3 ). First we consider the problem of substring testing. Let = be a string. A substring.,.. not necessarily contiguous is a string of if 1. The substring problem is: Given a CFG G and a substring, does there exist a ( ) such that is a substring of. We give algorithms for solving these problems by modifying CYK algorithm. Section II of this paper describes the background of the work presented. Section III deals CYK algorithm with complexity analysis and example. In Section IV, a procedure for substring matching problems in CFG has been discussed. Finally, the conclusions have been drawn in section V. 2. Preliminaries For simplicity, we will assume that the grammar is given in the Chomsky Normal Form (CNF). Let G = (V,T,P,S) be that grammar, where V denotes the set of Non Terminal symbols T denotes the set of Terminals P denotes the set of Production rules of the form A α S Є V is the start symbol The language generated by the grammar is defined as L(G) = { w w Є T *, S => * w} Note: 1. Two strings α, β are said to be related by => denoted as α => β, when the second string is obtained from the first by one application of some production rule P. 2. Suppose α 1, α 2,, α m are strings from (V U T) *, m 1 and α 1 => * α 2, α 2 => * α 3, α m-1 => * α m then we say α 1 => * α m. => * is the reflexive and transitive closure of =>. Definition 2.1 (Chomsky Normal Form or CNF) Any context free language without is generated by a grammar in which all production rules are of the form A BC or A b. Here A, B and C are nonterminals and b is a terminal. Membership testing of CFG Given a CFG G = < V, T, P, S> in CNF and a string s in T *, to test whether s Є L(G) or not? Solution: We shall present a simple cubic time algorithm known as the Cocke-Younger-Kasami or CYK algorithm. It is based on the dynamic programming technique. Given a string s of length n 1 and a grammar G, which we may assume is in CNF, determine for each i and j and for each non-terminal A, whether A=> * x ij, where x ij is the substring of s from i to j. We proceed by induction on the length. For length = 1 or i = j, A => * x ii If and only if A x ii is a production, since x ii is the i th symbol of string s. Proceeding to higher values of length, if length > 1 then A => * x ij if and only if there is some production A BC and some k, i k j-1, B => * x ik and C=> * x k+1,j. Hence by induction A => * x ij. Finally, when we reach i = 1 and j = n, we may determine whether S => * x 1n. As x 1n = s,s is in L(G) if and only if S => * x 1n 3. Algorithm CYK Algorithm 1. For i = 1 to n do V i,i { A A a i Є P } 2. For len = 2 to n do For i = 1 to n len + 1 do j i + len 1 V i,j = ɸ For k = i to j -1 do V i,j V i,j U { A A BC Є P, B Є V i,k & C Є V K+1,j } 3. If S Є V 1,n then the output is Yes else the output is No. Analysis: The time complexity of the algorithm is order O(n 3 ). Precisely the algorithm takes O(n 3 P ) time. Example: Consider the CFG S AB BC
A BA a B CC b C AB a And the input string is x = baaba. The V ij s are shown in table 3.1. S Є V 1,5 implies that the string x = baaba belongs to the language generated by the CFG. Table 3.1: CYK Algorithm j b a a b a i 1 2 3 4 5 1 {B} {A,S} ɸ ɸ {S,A,C} 2 - {A,C} {B} {B} {S,C,A} 3 - - {A,C} {S,C} {B} 4 - - - {B} {S,A} 5 - - - - {A,C} 4. Design of Substring Matching in Context Free Grammar In this section, we present algorithms for substring testing problems in CFG. Let w = a 1 a 2..a n be a string. A string a i..a j is a substring of w if 1 i j n. The substring problem can be stated as follows: Given a CFG G and a string w, does there exist a w Є L(G) such that w substring of w. we give algorithms for solving these problems by modifying CYK algorithm. Problem: Given a CFG G =< V,T,P,S> in CNF and a string w=a 1 a 2 a n, to find whether there exists a string w Є L(G) such that w is substring of w? Solution: Let G = < V,T,P,S> be a CFG in CNF. Without loss of generality, we assume L (G). The algorithm is based on the CYK algorithm. We make use of the notion of Left Closure and Right Closure of a set of non-terminals. The algorithms for these are given below: LeftClosure (V N ) lc V (1.1) Add A to lc if A BC Є P for some C Є lc if no new non-terminal got added to lc in this iteration return (lc) RightClosure (V N ) rc V Add A to rc if A CD Є P for some C Є rc if no new non-terminal got added to rc in this iteration return (rc) Analysis: The complexity of the LeftClosure algorithm given above is O( N P ). Note that LeftClosure (V N ) = U A Є V LeftClosure({A}). The LeftClosure for the non-terminals can be precomputed. In that case using the above relation, the required LeftClosure algorithm can be implemented in O( N 2 ). The analysis of RightClosure algorithm is exactly similar to that LeftClosure algorithm and overall complexity of RightClosure algorithm is same as that of LeftClosure algorithm. 4.1 Algorithm Substring Input: A a CFG G = < V,T,P,S> in CNF and a string w Є T + where w = a 1 a 2.a n. Assumption: G does not contain any useless productions or useless symbols. Output: If w is a substring of w Є L(G) then output is Yes else the output is No. DataStructure: V[0:n+1,0:n+1] each entry is a set of non-terminals. Algorithm: Step 1: CYK Algorithm For i = 1 to n do V i,i { A A a i Є P } For len = 2 to n do For i = 1 to n len + 1 do j i + len 1 V i,j = ɸ For k = i to j -1 do V i,j V i,j U { A A BC Є P, B Є V i,k & C Є V K+1,j } Step 2: For j = 1 to n do For k = 0 to j-1 do If (k=0) V 0,j LeftClosure(V 1,j ) (2.1) Else V 0,j V 0,j U { A A BC Є P, B Є V 0,k, C Є V k+1,j } (2.2) V 0,j LeftClosure(V 0,j ) (2.3)
Step 3: Step 4: For i = n downto 1 do For k = n+1 downto i+1 do If (k=n+1) V i,n+1 RightClosure(V i,n ) Else V i,n+1 V i,n+1 U { A A BC Є P, B Є V i,k-1, C Є V k,n+1 } V i,n+1 RightClosure(V i,n+1 ) 4.1 if V 1,n ɸ Output Yes 4.2 if V 0,n ɸ Output Yes 4.3 if V 1,n+1 ɸ Output Yes 4.4 V 0,n+1 ɸ Output Yes For k = 1 to n-1 do V 0,n+1 V 0,n+1 U { A A BC, B Є V 0,k, C Є V k+1,n+1 } If (V 0,n+1 ɸ) Output Yes Else Output No Example: Grammar G for a + b + c + d + e + f + S AX X YF Y BZ Z WE W CD A AA B BB C CC D DD E EE F FF A a B b C c D d E e F f LeftClosure({S}) = ɸ LeftClosure({A}) = {A} LeftClosure({B}) = {B} LeftClosure({C}) = {C} LeftClosure({D}) = {D,W} LeftClosure({E}) = {E,Z,Y} LeftClosure({F}) = {F,X,S} LeftClosure({X}) = {S} LeftClosure({Y}) = ɸ LeftClosure({Z}) = {Y} LeftClosure({W}) = ɸ RightClosure({S}) = ɸ RightClosure({A}) = {A,S} RightClosure({B}) = {B,Y,X} RightClosure({C}) = {C,W,Z} RightClosure({D}) = {D} RightClosure({E}) = {E} RightClosure({F}) = {F} RightClosure({X}) = ɸ RightClosure({Y}) = {X} RightClosure({Z}) = ɸ RightClosure({W}) = {Z} Table 4.1: CYK Algorithm j 1 2 3 4 5 6 i a b C d e f 1 {A} ɸ ɸ ɸ ɸ {S} 2 - {B} ɸ ɸ {Y} {X} 3 - - {C} {W} {Z} ɸ 4 - - - {D} ɸ ɸ 5 - - - - {E} ɸ 6 - - - - - {F} Input String: cd Output is Yes and V i,j s are shown in Table 4.2 Analysis: Let w = n Step 1 is CYK Algorithm. Hence it takes O( P n 3 ) time. Step 2.1 takes O( N 2 ) time and Step 2.2 takes O( P ) time. Hence, Step 2 and Step 3 take O(n N 2 +n 2 P ) time. Step 4 takes O(n P ). Thus the algorithm substring is of O( P n 3 ) time. In general for any grammar G, N and P are assumed to be constants, hence the overall complexity of the algorithm is O(n 3 ).
Table 4.2: Output of the Substring Algorithm j 0 1 2 3 i C d 0 - {C} {W} {W} 1 - {C} {W} {Z,W} 2 - - {D} {D} 3 - - - - 4.2 Algorithm Substring Problem: Given a CFG G = < V,T,P,S> in CNF and a string w = a 1 a 2.a n, to find whether there exists a string w Є L(G) such that w is substring of w? Solution: we make use of the notion of closure of a set of non-terminals. The algorithm is given below : Closure(V N) cc V Add A to cc if A CD Є P for some C Є cc or D Є cc If no new non-terminal got added to cc in this iteration return(cc) 4.2.1 Algorithm: Input: A a CFG G = < V,T,P,S> in CNF and a string w Є T + where w = a 1 a 2.a n. Assumption: G does not contain any useless productions or useless symbols. Output: If w is a substring of w Є L(G) then output is Yes else the output is No. DataStructure : V[1:n,1:n] each entry is a set of non-terminals. Algorithm: 1. For i = 1 to n do V i,i {A A a i Є P} V i,i Closure(V i,i ) 2. For len = 2 to n do For i = 1 to n-len+1 do J i + len -1 V i,i ɸ For k = i to j-1 do V i,j V i,j U { A A BC Є P, B Є V i,k, C Є V k+1,j } V i,j Closure(V i,j ) 4. If S Є V 1,n then the output is Yes else the output is No. Example 4.2.1: Grammar G for a + b + c + d + e + f + S AX X YF Y BZ Z WE W CD A AA B BB C CC D DD E EE F FF A a B b C c D d E e F f Input String: bdf Output: Yes and V i,j s are shown in Table 4.2.1 Analysis: By precomputing Closure({A}), A Є V, Closure algorithm can be implemented in O( N 2 ). Hence the substring algorithm is O( P n 3 +n 2 N 2 ). In general for any grammar G, N and P are considered as constants, the algorithm runs in O(n 3 ) time. Table 4.2.1 : Output of substring algorithm j 1 2 3 i B d f 1 {B,Y,X,S} {Y,X,S} {X,S} 2 - {D,W,Z,Y,X,S} {X,S} 3 - {F,X,S}
5. CONCLUSION: It has been concluded that the Substring matching problem can be efficiently solved by using CYK and Left Closure & Right Closure algorithm. The procedure described here and the behaviors of CYK imply that the Substring matching problem in Context Free Grammar can be solved in exactly O (n 3 ) time. This problem can also be solved by modifying the Grammar G, but we are doing without modifying the grammar. REFERENCES 1. Mauricio Osorio and Juan Antonio Navarro Perez. Decision problem of substrings in context freel anguages. In Juan Humberto Sossa Azuela, Herbert Freeman, and C. Vizcaino, editors, CIC-X: Memorias del X Congreso Interna- cional de Computacion, pages 239-249. CIC-IPN, 2004 2. Heron Molina-Lozano A new fast fuzzy Cocke Younger Kasami algorithm for DNA strings analysis Int. J. Mach. Learn. & Cyber, 2:209 218,2011 3. R. Axelsson, K. Heljanko, and M. Lange. Analyzing context-free grammars using an incremental SAT Solver. In Proc. 35th Int. Coll. on Automata Languages and Programming, ICALP 08, Part II, volume 5126 of LNCS, 410 422, 2007. 4. Stefano Crespi Reghizzi, Matteo Pradella. A CKY parser for picture grammars, information processing Letters 105:213-217,2008 5. D.C.Kozen. Automata and Computability. Springer 1997. 6. Kamala Krithivasan and Rama R., Introduction to Formal languages, Automata Theory and Computation, Pearson. 2009. Note: This Paper/Article is scrutinised and reviewed by Scientific Committee, BITCON-2015, BIT, Durg, CG, India