Knuth-Morris-Pratt Algorithm

The roblem of tring Matching Given a string, the roblem of string matching deals with finding whether a attern occurs in and if does occur then returning osition in where occurs.

. a O(mn) aroach One of the most obvious aroach towards the string matching roblem would be to comare the first element of the attern to be searched, with the first element of the string in which to locate. If the first element of matches the first element of, comare the second element of with second element of. If match found roceed likewise until entire is found. If a mismatch is found at any osition, shift one osition to the right and reeat comarison beginning from first element of.

How does the O(mn) aroach work Below is an illustration of how the reviously described O(mn) aroach works. tring a b c a b a a b c a b a c Pattern a b a a

te 1:comare [1] with [1] a b c a b a a b c a b a c a b a a te 2: comare [2] with [2] a b c a b a a b c a b a c a b a a

te 3: comare [3] with [3] a b c a b a a b c a b a c a b a a Mismatch occurs here.. ince mismatch is detected, shift one osition to the left and erform stes analogous to those from ste 1 to ste 3. At osition where mismatch is detected, shift one osition to the right and reeat matching rocedure.

a b c a b a a b c a b a c a b a a Finally, a match would be found after shifting three times to the right side. Drawbacks of this aroach: if m is the length of attern and n the length of string, the matching time is of the order O(mn). This is a certainly a very slow running algorithm. What makes this aroach so slow is the fact that elements of with which comarisons had been erformed earlier are involved again and again in comarisons in some future iterations. For examle: when mismatch is detected for the first time in comarison of [3] with [3], attern would be moved one osition to the right and matching rocedure would resume from here. Here the first comarison that would take lace would be between [0]= a and [1]= b. It should be noted here that [1]= b had been reviously involved in a comarison in ste 2. this is a reetitive use of [1] in another comarison. It is these reetitive comarisons that lead to the runtime of O(mn).

The Knuth-Morris-Pratt Algorithm Knuth, Morris and Pratt roosed a linear time algorithm for the string matching roblem. A matching time of O(n) is achieved by avoiding comarisons with elements of that have reviously been involved in comarison with some element of the attern to be matched. i.e., backtracking on the string never occurs

Comonents of KMP algorithm The refix function, Π The refix function,π for a attern encasulates knowledge about how the attern matches against shifts of itself. This information can be used to avoid useless shifts of the attern. In other words, this enables avoiding backtracking on the string. The KMP Matcher With string, attern and refix function Π as inuts, finds the occurrence of in and returns the number of shifts of after which occurrence is found.

The refix function, Π Following seudocode comutes the refix fucnction, Π: Comute-Prefix-Function () 1 m length[] // attern to be matched 2 Π[1] 0 3 k 0 4 for q 2 to m 5 do while k > 0 and [k+1]!= [q] 6 do k Π[k] 7 If [k+1] = [q] 8 then k k +1 9 Π[q] k 10 return Π

Examle: comute Π for the attern below: Initially: m = length[] = 7 Π[1] = 0 k = 0 te 1: q = 2, k=0 Π[2] = 0 q 1 2 3 4 5 6 7 Π 0 0 te 2: q = 3, k = 0, Π[3] = 1 q 1 2 3 4 5 6 7 Π 0 0 1 te 3: q = 4, k = 1 Π[4] = 2 q 1 2 3 4 5 6 7 a b a b a c A Π 0 0 1 2

te 4: q = 5, k =2 Π[5] = 3 q 1 2 3 4 5 6 7 Π 0 0 1 2 3 te 5: q = 6, k = 3 Π[6] = 1 te 6: q = 7, k = 1 Π[7] = 1 q 1 2 3 4 5 6 7 Π 0 0 1 2 3 1 q 1 2 3 4 5 6 7 Π 0 0 1 2 3 1 1 After iterating 6 times, the refix function comutation is comlete: q 1 2 3 4 5 6 7 a b A b a c a Π 0 0 1 2 3 1 1

The KMP Matcher The KMP Matcher, with attern, string and refix function Π as inut, finds a match of in. Following seudocode comutes the matching comonent of KMP algorithm: KMP-Matcher(,) 1 n length[] 2 m length[] 3 Π Comute-Prefix-Function() 4 q 0 //number of characters matched 5 for i 1 to n //scan from left to right 6 do while q > 0 and [q+1]!= [i] 7 do q Π[q] //next character does not match 8 if [q+1] = [i] 9 then q q + 1 //next character matches 10 if q = m //is all of matched? 11 then rint Pattern occurs with shift i m 12 q Π[ q] // look for the next match Note: KMP finds every occurrence of a in. That is why KMP does not terminate in ste 12, rather it searches remainder of for any more occurrences of.

Illustration: given a tring and attern as follows: b a c b a b c a Let us execute the KMP algorithm to find whether occurs in. For the refix function, Π was comuted reviously and is as follows: q 1 2 3 4 5 6 7 a b A b a c a Π 0 0 1 2 3 1 1

Initially: n = size of = 15; m = size of = 7 te 1: i = 1, q = 0 comaring [1] with [1] b a c b a b a b P[1] does not match with [1]. will be shifted one osition to the right. te 2: i = 2, q = 0 comaring [1] with [2] b a c b a b a b P[1] matches [2]. ince there is a match, is not shifted.

te 3: i = 3, q = 1 Comaring [2] with [3] b a c b a b a b Backtracking on, comaring [1] and [3] te 4: i = 4, q = 0 comaring [1] with [4] [2] does not match with [3] [1] does not match with [4] b a c b a b a b te 5: i = 5, q = 0 comaring [1] with [5] [1] matches with [5] b a c b a b a b

te 6: i = 6, q = 1 Comaring [2] with [6] [2] matches with [6] b a c b a b a b te 7: i = 7, q = 2 Comaring [3] with [7] [3] matches with [7] b a c b a b a b te 8: i = 8, q = 3 Comaring [4] with [8] [4] matches with [8] b a c b a b a b

te 9: i = 9, q = 4 Comaring [5] with [9] [5] matches with [9] b a c b a b a b te 10: i = 10, q = 5 Comaring [6] with [10] [6] doesn t match with [10] b a c b a b a b Backtracking on, comaring [4] with [10] because after mismatch q = Π[5] = 3 te 11: i = 11, q = 4 Comaring [5] with [11] [5] matches with [11] b a c b a b a b

te 12: i = 12, q = 5 Comaring [6] with [12] [6] matches with [12] b a c b a b a b te 13: i = 13, q = 6 Comaring [7] with [13] [7] matches with [13] b a c b a b a b Pattern has been found to comletely occur in string. The total number of shifts that took lace for the match to be found are: i m = 13 7 = 6 shifts.

Running - time analysis Comute-Prefix-Function (Π) 1 m length[] matched // attern to be 2 Π[1] 0 3 k 0 4 for q 2 to m 5 do while k > 0 and [k+1]!= [q] 6 do k Π[k] 7 If [k+1] = [q] 8 then k k +1 9 Π[q] k 10 return Π In the above seudocode for comuting the refix function, the for loo from ste 4 to ste 10 runs m times. te 1 to ste 3 take constant time. Hence the running time of comute refix function is Θ(m). KMP Matcher 1 n length[] 2 m length[] 3 Π Comute-Prefix-Function() 4 q 0 5 for i 1 to n 6 do while q > 0 and [q+1]!= [i] 7 do q Π[q] 8 if [q+1] = [i] 9 then q q + 1 10 if q = m 11 then rint Pattern occurs with shift i m 12 q Π[ q] The for loo beginning in ste 5 runs n times, i.e., as long as the length of the string. ince ste 1 to ste 4 take constant time, the running time is dominated by this for loo. Thus running time of matching function is Θ(n).