General Suffix Automaton Construction Algorithm and Space Bounds

Similar documents
NON-DETERMINISTIC FSA

Technische Universität München Winter term 2009/10 I7 Prof. J. Esparza / J. Křetínský / M. Luttenberger 11. Februar Solution

CS 573 Automata Theory and Formal Languages

CS311 Computational Structures Regular Languages and Regular Grammars. Lecture 6

= state, a = reading and q j

Nondeterministic Automata vs Deterministic Automata

A Lower Bound for the Length of a Partial Transversal in a Latin Square, Revised Version

1 PYTHAGORAS THEOREM 1. Given a right angled triangle, the square of the hypotenuse is equal to the sum of the squares of the other two sides.

Prefix-Free Regular-Expression Matching

Nondeterministic Finite Automata

The University of Nottingham SCHOOL OF COMPUTER SCIENCE A LEVEL 2 MODULE, SPRING SEMESTER MACHINES AND THEIR LANGUAGES ANSWERS

Finite State Automata and Determinisation

A Disambiguation Algorithm for Finite Automata and Functional Transducers

Speech Recognition Lecture 2: Finite Automata and Finite-State Transducers. Mehryar Mohri Courant Institute and Google Research

Compiler Design. Spring Lexical Analysis. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Project 6: Minigoals Towards Simplifying and Rewriting Expressions

Algorithms & Data Structures Homework 8 HS 18 Exercise Class (Room & TA): Submitted by: Peer Feedback by: Points:

General Algorithms for Testing the Ambiguity of Finite Automata

Computational Biology Lecture 18: Genome rearrangements, finding maximal matches Saad Mneimneh

Speech Recognition Lecture 2: Finite Automata and Finite-State Transducers

CS 491G Combinatorial Optimization Lecture Notes

The size of subsequence automaton

Global alignment. Genome Rearrangements Finding preserved genes. Lecture 18

2.4 Theoretical Foundations

6.5 Improper integrals

Hybrid Systems Modeling, Analysis and Control

Minimal DFA. minimal DFA for L starting from any other

Compression of Palindromes and Regularity.

Lecture 6: Coding theory

Lecture Notes No. 10

1 Nondeterministic Finite Automata

Descriptional Complexity of Non-Unary Self-Verifying Symmetric Difference Automata

Subsequence Automata with Default Transitions

Petri Nets. Rebecca Albrecht. Seminar: Automata Theory Chair of Software Engeneering

Introduction to Olympiad Inequalities

Regular languages refresher

Convert the NFA into DFA

Formal Languages and Automata

22: Union Find. CS 473u - Algorithms - Spring April 14, We want to maintain a collection of sets, under the operations of:

Chapter 4 State-Space Planning

Compiler Design. Fall Lexical Analysis. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Fast index for approximate string matching

Chapter 3. Vector Spaces. 3.1 Images and Image Arithmetic

Model Reduction of Finite State Machines by Contraction

The Word Problem in Quandles

1 From NFA to regular expression

Section 1.3 Triangles

Regular expressions, Finite Automata, transition graphs are all the same!!

Lecture 09: Myhill-Nerode Theorem

Counting Paths Between Vertices. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs

Lecture 08: Feb. 08, 2019

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true.

p-adic Egyptian Fractions

CS 275 Automata and Formal Language Theory

CS 373, Spring Solutions to Mock midterm 1 (Based on first midterm in CS 273, Fall 2008.)

ANALYSIS AND MODELLING OF RAINFALL EVENTS

Intermediate Math Circles Wednesday 17 October 2012 Geometry II: Side Lengths

A Study on the Properties of Rational Triangles

Discrete Structures Lecture 11

Chapter 2 Finite Automata

CS103B Handout 18 Winter 2007 February 28, 2007 Finite Automata

Suffix Trays and Suffix Trists: Structures for Faster Text Indexing

Part 4. Integration (with Proofs)

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. Comparing DFAs and NFAs (cont.) Finite Automata 2

Abstraction of Nondeterministic Automata Rong Su

CS415 Compilers. Lexical Analysis and. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Engr354: Digital Logic Circuits

CSCI 340: Computational Models. Kleene s Theorem. Department of Computer Science

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. NFA for (a b)*abb.

Hyers-Ulam stability of Pielou logistic difference equation

CONTROLLABILITY and observability are the central

Designing finite automata II

@#? Text Search ] { "!" Nondeterministic Finite Automata. Transformation NFA to DFA and Simulation of NFA. Text Search Using Automata

Discrete Structures, Test 2 Monday, March 28, 2016 SOLUTIONS, VERSION α

12.1 Nondeterminism Nondeterministic Finite Automata. a a b ε. CS125 Lecture 12 Fall 2016

THE PYTHAGOREAN THEOREM

3 Regular expressions

Test Generation from Timed Input Output Automata

8 THREE PHASE A.C. CIRCUITS

Data Structures and Algorithm. Xiaoqing Zheng

Grammar. Languages. Content 5/10/16. Automata and Languages. Regular Languages. Regular Languages

5. (±±) Λ = fw j w is string of even lengthg [ 00 = f11,00g 7. (11 [ 00)± Λ = fw j w egins with either 11 or 00g 8. (0 [ ffl)1 Λ = 01 Λ [ 1 Λ 9.

Linear choosability of graphs

CMPSCI 250: Introduction to Computation. Lecture #31: What DFA s Can and Can t Do David Mix Barrington 9 April 2014

CMSC 330: Organization of Programming Languages

Symmetrical Components 1

Lesson 2: The Pythagorean Theorem and Similar Triangles. A Brief Review of the Pythagorean Theorem.

Formal languages, automata, and theory of computation

Matrices SCHOOL OF ENGINEERING & BUILT ENVIRONMENT. Mathematics (c) 1. Definition of a Matrix

On-Line Construction of Compact Directed Acyclic Word Graphs

Comparing the Pre-image and Image of a Dilation

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true.

A CLASS OF GENERAL SUPERTREE METHODS FOR NESTED TAXA

General Algorithms for Testing the Ambiguity of Finite Automata and the Double-Tape Ambiguity of Finite-State Transducers

AVL Trees. D Oisín Kidney. August 2, 2018

T b a(f) [f ] +. P b a(f) = Conclude that if f is in AC then it is the difference of two monotone absolutely continuous functions.

Learning Partially Observable Markov Models from First Passage Times

PAIR OF LINEAR EQUATIONS IN TWO VARIABLES

Thoery of Automata CS402

1B40 Practical Skills

Transcription:

Generl Suffix Automton Constrution Algorithm nd Spe Bounds Mehryr Mohri,, Pedro Moreno, Eugene Weinstein, Cournt Institute of Mthemtil Sienes 251 Merer Street, New York, NY 10012. Google Reserh 76 Ninth Avenue, New York, NY 10011. Astrt Suffix utomt nd ftor utomt re effiient dt strutures for representing the full index of set of strings. They re miniml deterministi utomt representing the set of ll suffixes or sustrings of set of strings. This pper presents novel nlysis of the size of the suffix utomton or ftor utomton of set of strings. It shows tht the suffix utomton or ftor utomton of set of strings U hs t most 2Q 2 sttes, where Q is the numer of nodes of prefix-tree representing the strings in U. This ound signifintly improves over 2 U 1, the ound given y Blumer et l. (1987), where U is the sum of the lengths of ll strings in U. More generlly, we give novel nd generl ounds for the size of the suffix or ftor utomton of n utomton s funtion of the size of the originl utomton nd the mximl length of suffix shred y the strings it epts. We lso desrie in detil liner-time lgorithm for onstruting the suffix utomton S or ftor utomton F of U in time O( S ). Our lgorithm pplies in ft to ny input suffix-unique utomton nd stritly generlizes the stndrd on-line onstrution of suffix utomton for single input string. Our lgorithm n lso e used strightforwrdly to generte the suffix orle or ftor orle of set of strings, whih hs een shown to hve vrious useful properties in string-mthing. Our nlysis suggests tht the use of ftor utomt of utomt n e prtil for lrge-sle pplitions, ft tht is further supported y the results of our experiments pplying ftor utomt to musi identifition tsk with more thn 15,000 songs. Key words: string-mthing, pttern-mthing, indexing, inverted text, finite utomt, suffix trees, suffix utomt, ftor utomt, musi identifition. Emil ddresses: mohri@s.nyu.edu (Mehryr Mohri), pedro@google.om (Pedro Moreno), eugenew@s.nyu.edu (Eugene Weinstein) Preprint sumitted to Elsevier April 26, 2009

1. Introdution Serhing for ptterns in mssive quntities of nturl lnguge texts, iologil sequenes, nd other widely essile digitized sequenes is prolem of entrl importne in omputer siene. The prolem hs vriety of pplitions nd hs een extensively studied in the pst [1, 2]. This pper onsiders the prolem of onstruting full index, or inverted file, for set of strings represented y finite utomton. When the numer of strings is lrge, suh s thousnds or even millions, the olletion of strings n e omptly stored s n utomton, whih lso enles effiient implementtions of serh nd other lgorithms [3, 4]. In ft, in mny ontexts suh s speeh reognition or informtion extrtion tsks, the entire set of strings is often diretly given s n utomton. An effiient nd ompt dt struture for representing full index of set of strings is suffix utomton, miniml deterministi utomton representing the set of ll suffixes of set of strings. Sine sustring is prefix of suffix, suffix utomton n e used to determine if string x is sustring in time liner in its length O( x ), whih is optiml. Additionlly, s with suffix trees, suffix utomt hve other interesting properties in string-mthing prolems whih mke their use nd onstrution ttrtive [1, 2]. Another similr dt struture for representing full index of set of strings is ftor utomton, miniml deterministi utomton representing the set of ll ftors or sustrings of set of strings. Ftor utomt offer the sme optiml liner-time serh property s suffix utomt, nd re never lrger. The onstrution nd the size of ftor utomton hve een speifilly nlyzed in the se of single string [5, 6]. These uthors demonstrted the remrkle result tht the size of the ftor utomton of string x is liner, nd tht, more preisely, for strings x of length more thn three, it hs t most 2 x 2 sttes nd 3 x 4 trnsitions. They lso gve on-line linertime lgorithms for onstruting ftor utomton from x. Similr results were given for suffix utomt, the miniml deterministi utomt epting extly the set of suffixes of string. The onstrution nd the size of the ftor utomt of finite set of strings U = {x 1,..., x m } hs lso een previously studied [7]. These uthors showed tht n utomton epting ll ftors of U n e onstruted tht hs t most 2 U 1 sttes nd 3 U 3 trnsitions, where U is the sum of the lengths of ll strings in U, tht is U = m i=1 x i. This pper proves signifintly etter ound on the size of the suffix utomton or ftor utomton of set of strings. It shows tht the suffix utomton or ftor utomton of set of strings U hs t most 2Q 2 sttes, where Q is the numer of nodes of prefix-tree representtion of the strings in U. The numer of nodes Q n e drmtilly smller thn U, the sum of the lengths of ll strings. Thus, our spe ound lerly improves on previous work [7]. More generlly, we give novel ounds for the size of the suffix utomton or ftor utomton of n yli finite utomton s funtion of the size of the originl utomton nd the mximl length of suffix shred y the 2

strings epted y the originl utomton. This result n e ompred to tht of Ineng et l. for ompt direted yli word grphs whose omplexity, O( Σ Q), depends on the size of the lphet [8]. Using our spe ound nlysis, we lso give simple lgorithm for onstruting the suffix utomton S or ftor utomton F of U in time O( S ) from prefix tree representing U. Our lgorithm pplies in ft to ny input suffix-unique utomton nd stritly generlizes the stndrd on-line onstrution of suffix utomton for single input string. The originl motivtion for this work ws the design of lrge-sle musi identifition system [4, 9], where we represented our song dtse y ompt finite utomton, s we shll riefly desrie lter in this pper. To filitte n effiient serh of song snippets, we onstruted the miniml deterministi ftor utomton of the song utomton. Empirilly, the size of the ftor utomton ws not prohiitive. But, to ensure the slility of our pproh to lrger set of songs, e.g., severl million songs, we wished to derive ound on the size of the ftor utomt of utomt. One hrteristi of the strings onsidered in this pplition s in mny others is tht the originl strings do not shre long suffixes. This motivted our nlysis of the size of the ftor utomt with respet to the length of the ommon suffixes in the originl utomton. The reminder of the pper is orgnized s follows. Setion 2 introdues the string nd utomt definitions nd terminology used throughout the pper. In Setion 3, we desrie novel nlysis of ftor utomt nd present new ounds on the size of the suffix utomton nd ftor utomton of n utomton. Setion 4 gives detiled desription of liner-time lgorithm for the onstrution of the suffix utomton nd ftor utomton of finite set of strings, or of ny suffix-unique utomton, inluding pseudoode of the lgorithm. Our lgorithm n lso e used strightforwrdly to generte the suffix orle or ftor orle of set of strings, whih hs een shown to hve vrious useful properties [10]. Setion 5 riefly desries the use of ftor utomt in musi identifition nd reports severl empiril results relted to their size. 2. Ftors of Finite Automton This setion reviews some key properties of ftors of fixed finite utomton, generlizing similr oservtions mde y Blumer et l. for single string [7]. We denote y Σ finite lphet. The length of string x Σ over tht lphet is denoted y x. A ftor, or sustring, of string x Σ is sequene of symols ppering onseutively in x. Thus, y is ftor of x iff there exist u, v Σ suh tht x = uyv. A suffix of string x Σ is ftor tht ppers t the end of x. Put otherwise, y is suffix of x iff there exists u Σ suh tht x = uy. Anlogously, y is prefix of x iff there exists u Σ suh tht x = yu. More generlly, ftor, suffix, or prefix of set of strings U or n utomton A, is ftor, suffix, or prefix of string in U or string 3

0 1 2 3 4 5 Figure 1: Finite utomton A epting the strings,,. epted y A, respetively. The symol ǫ represents the empty string. For ny string x Σ, ǫ is lwys prefix, suffix, nd ftor of x. In some pplitions suh s musi identifition the strings onsidered my e long, e.g., sequenes of musi sounds, ut with reltively short ommon suffixes. This motivtes the following definition. Definition 1. Let k e non-negtive integer. We will sy tht finite utomton A is k-suffix-unique if no two strings epted y A shre suffix of length k. A is sid to e suffix-unique when it is k-suffix-unique with k = 1. Figure 1 gives n exmple of simple utomton A epting three strings ending in distint symols. Note tht A is suffix-unique. The min results of this pper hold for suffix-unique utomt, ut we lso present some results for the generl se of ritrry yli utomt. We denote y F(A) the miniml deterministi utomton epting the set of ftors of finite utomton A, tht is the set of ftors of the strings epted y A. Similrly, we denote y S(A) the miniml deterministi utomton epting the set of suffixes of n utomton A. Definition 2. Let A e finite utomton. For ny string x Σ, we define end-set(x) s the set of sttes of A rehed y the pths in A tht egin with x. We sy tht two strings x nd y in Σ re equivlent nd denote this y x y, when end-set(x) = end-set(y). This defines right-invrint equivlene reltion on Σ. We denote y [x] the equivlene lss of x Σ. Lemm 1. Assume tht A is suffix-unique. Then, non-suffix ftor x of the utomton A is the longest memer of [x] iff it is either prefix of A, or oth x nd x re ftors of A for distint, Σ. Proof. Let x e non-suffix ftor of A. Clerly, if x is not prefix, then there must e distint nd suh tht x nd x re ftors of A, otherwise [x] would dmit longer memer. Conversely, ssume tht x nd x re oth ftors of A with. Let y e the longest memer of [x]. Let q e stte in end-set(x) = end-set(y). Sine x is not suffix, q is not finl stte, nd there exists non-empty string z leling pth from q to finl stte. Sine A is suffix-unique, oth xz nd yz re suffixes of the sme string. Sine y is the longest memer of [x], x must e suffix of y. Sine x nd x re oth ftors of A with, we must hve y = x. Finlly, if x is prefix, then lerly it is the longest memer of [x]. 4

0 1 2 3 6 5 4 Figure 2: Suffix utomton S(A) of the utomton A of Figure 1. Proposition 1. Assume tht A is suffix-unique. Let S A = (Q S, I S, F S, E S ) e the deterministi utomton whose sttes re the equivlene lsses Q S = {[x] : x Σ }, its initil stte I S = {[ǫ]}, its finl sttes F S = {[x] : end-set(x) F A } where F A is the set of finl sttes of A, nd its trnsition set E S = {([x],, [x]) : [x], [x] Q S }. Then, S A is the miniml deterministi suffix utomton of A: S A = S(A). Proof. By onstrution, S A is deterministi nd epts extly the set of suffixes of A. Let [x] nd [y] e two equivlent sttes of S A. Then, for ll z Σ, [xz] F A iff [yz] F A, tht is z is suffix of A iff yz is suffix of A. Sine A is suffix-unique, this implies tht either x is suffix of y or vie vers, nd thus tht [x] = [y]. Thus, S A is miniml. In wht follows, we will e interested in the se where the utomton A is yli. We denote y A Q the numer of sttes of A, y A E the numer of trnsitions of A, nd y A the size of A defined s the sum of the numer of sttes nd trnsitions of A. 3. Spe Bounds for Ftor Automt The ojetive of this setion is to derive new ounds on the size of S(A) nd F(A) in the se of interest for our pplitions where A is n yli utomton, typilly deterministi nd miniml, representing set of strings. When A represents single string, there re stndrd lgorithms for onstruting S(A) nd F(A) from A in liner time [5, 6]. In the generl se, S(A) n e onstruted from A s follows: dd n ǫ-trnsition from the initil stte of A to eh stte of A, then pply n ǫ-removl lgorithm, followed y determiniztion nd minimiztion. F(A) n e otined similrly y further mking ll sttes finl efore pplying ǫ-removl, determiniztion, nd minimiztion. It n lso e otined from S(A) y mking ll sttes of S(A) finl nd pplying minimiztion. For exmple, if A is the simple utomton of Figure 1, then Figure 2 is its suffix utomton S(A). When A represents single string x, the size of the utomt S(A) nd F(A) n e proved to e liner in x. More preisely, the following ounds 5

hold for S(A) nd F(A) [6, 5]: S(A) Q 2 x 1 S(A) E 3 x 4 F(A) Q 2 x 2 F(A) E 3 x 4. (1) These ounds re tight for strings of length more thn three. [7] gve similr results for the se of set of strings U y showing tht the size of the ftor utomton F(U) representing this set is ounded s follows F(U) Q 2 U 1 F(U) E 3 U E 3, (2) where U denotes the sum of the lengths of ll strings in U. In generl, the size of n yli utomton A representing finite set of strings U n e sustntilly smller thn U. In ft, A n e exponentilly smller thn U. Thus, we re interested in ounding the size of S(A) or F(A) in terms of the size of A, rther thn the sum of the lengths of ll strings epted y A. For ny stte q of S(A), we denote y suff(q) the set of strings leling the pths from q to finl stte. We lso denote y N(q) the set of sttes in A from whih pth leled with non-empty string in suff(q) rehes finl stte. Lemm 2. Let A e suffix-unique utomton nd let q nd q e two sttes of S(A) suh tht N(q) N(q ), then ` suff(q) suff(q ) nd N(q) N(q ) or ` suff(q ) suff(q) nd N(q ) N(q). (3) Proof. Sine S(A) is miniml utomton, its sttes re essile from the initil stte. Let u e the lel of pth from the initil I of S(A) to q nd similrly u the lel of pth from I to q. By ssumption, there exists p N(q) N(q ). Thus, there exist non-empty strings v suff(q) nd v suff(q ) suh tht oth v nd v lel pths from p to finl stte. By definition of u nd u, oth uv nd u v re suffixes of A. Sine A is suffix-unique nd v is non-empty, there exists unique string epted y A nd ending with v. There exists lso unique string epted y A nd ending with uv. Thus, these two strings must oinide. This implies tht ny string epted y A nd dmitting v s suffix lso dmits uv s suffix. In prtiulr, the lel of ny pth from n initil stte to p must dmit u s suffix. Resoning in the sme wy for v let us onlude tht the lel of ny pth from n initil stte to p must lso dmit u s suffix. Thus, u nd u re suffixes of the sme string. Thus, u is suffix of u or vie-vers. Figure 3 illustrtes this sitution. Assume without loss of generlity tht u is suffix of u. Then, for ny string w, if u w is suffix of A so is uw. Thus, suff(q ) suff(q), whih implies N(q ) N(q). When u is suffix of u, we otin similrly the other se of the sttement of the lemm. Note tht Lemm 2 holds even when A is non-deterministi utomton. 6

u u v x Figure 3: Illustrtion of the sitution desried in Lemm 2. uv nd u v re suffixes of the sme string x. Thus, u nd u re lso suffixes of the sme string. Thus, u is suffix of u or vie-vers. Lemm 3. Let A e suffix-unique deterministi utomton nd let q nd q e two distint sttes of S(A) suh tht N(q) = N(q ), then either q is finl stte nd q is not, or q is finl stte nd q is not. Proof. Assume tht N(q) = N(q ). By Lemm 2, this implies suff(q) = suff(q ). Thus, the sme non-empty strings lel the pths from q to finl stte or the pths from q to finl stte. Sine S(A) is miniml utomton, the distint sttes q nd q re not equivlent. Thus, one must dmit n empty pth to finl stte nd not the other. The following proposition extends the results of [7] whih hold for set of strings, to the se where A is n utomton. Proposition 2. Let A e suffix-unique deterministi nd miniml utomton epting strings of length more thn three. Then, the numer of sttes of the suffix utomton of A is ounded s follows S(A) Q 2 A Q 3. (4) Proof. If the strings epted y A re ll of the form n, S(A) n e derived from A simply y mking ll its sttes finl nd the ound is trivilly hieved. In the reminder of the proof, we n thus ssume tht not ll strings epted y A re of this form. Let F e the unique finl stte of S(A) with no outgoing trnsitions. Lemms 2-3 help define tree T ssoited to ll sttes of S(A) other thn F y using the ordering: N(q) N(q ) iff { N(q) N(q ) or N(q) = N(q ) nd q finl, q non-finl. (5) We will identify eh node of T with its orresponding stte in S(A). By Proposition 1, eh stte q of S(A) n lso e identified with n equivlene lss [x]. Let q e stte of S(A) distint from F, nd let [x] e its orresponding equivlene lss. Oserve tht sine A is suffix-unique, end-set(x) oinides with N(q). We will show tht the numer of nodes of T is t most 2 A Q 4, whih will yield the desired ound on the numer of sttes of S(A). To do so, we ound seprtely the numer of non-rnhing nd rnhing nodes of T. Let q e node of T nd let [x] e the orresponding equivlene lss, with x its longest memer. The hildren of q re the nodes orresponding to the equivlene lsses [x] where Σ nd x is ftor of A. 7

By Lemm 1, if x is non-suffix nd non-prefix ftor, then there exist ftors x nd x with. Thus, q dmits t lest two hildren orresponding to [x] nd [x] nd is thus rnhing node. Thus non-rnhing nodes n only e either nodes q where x is prefix, or those where x is suffix, tht is when q is finl stte of S(A). Sine the strings epted y A re not ll of the form n for some Σ, the empty prefix ǫ ours t lest in two distint left ontexts nd with. Thus, the prefix ǫ, whih orresponds to the root of T, is neessrily rnhing. Also, let f e the unique finl stte of A with no outgoing trnsitions. The equivlene lss of the longest ftor ending in f, tht is the longest string epted y A orresponds to the stte F in S(A) whih is not inluded in the tree T. Thus, there re t most A Q 2 non-rnhing prefixes. There n e t most one non-rnhing node for eh string epted y A. Let N str denote the numer of strings epted y A, then, the numer of non-rnhing nodes N n of T is t most N n A Q 2 + N str. To ound the numer of rnhing nodes N of T, oserve tht sine A is suffix-unique, eh string epted y A must end with distint symol i, i = 1,...,N str. Eh i represents distint left ontext for the empty ftor ǫ, thus the root node [ǫ] dmits ll [ i ]s, i = 1,...,N str, s hildren. Let T i represent the su-tree rooted t [ i ] nd let n i represent the numer of leves of T i. Let j, j = N str + 1,...,N str + k denote the other hildren of the root nd let T j denote eh of the orresponding su-tree. A tree with n i leves hs less thn n i rnhing nodes. Thus, the numer of rnhing nodes of T i is t most n i 1. The totl numer of leves of T is t most the numer of disjoint susets of Q exluding the initil stte nd f. Note however tht when the root node [ǫ] dmits only [ i ]s, i = 1,...,N str, s hildren, tht is when k = 0, then there is t lest one i, sy 1, tht is lso prefix of A sine ny other symol would hve een the root node s hild. The node 1 will then hve lso hild sine it orresponds to suffix or finl stte of S(A). Thus, 1 nnot e lef in tht se. Thus, there re t most s mny s N str+k i=1 n i A Q 2 1 k=0 leves nd the totl numer of rnhing nodes of T, inluding the root is t most N N str+k i=1 (n i 1)+1 A Q 2 1 k=0 (N str + k) + 1 A Q 2 N str. The totl numer of nodes of the tree T is thus t most N n + N 2 A Q 4. In the speifi se where A represents single string x, the ound of Proposition 2 mthes tht of [6] or [5] sine A Q = x +1. The ound of Proposition 2 is tight for strings of length more thn three nd thus is lso tight for utomt epting strings of length more thn three. Note tht the utomton of Figure 1 is suffix-unique, deterministi, nd miniml nd hs A Q = 6 sttes. The numer of sttes of the miniml suffix utomton of A is S(A) Q = 7 < 2 A Q 3. Corollry 1. Let A e suffix-unique deterministi nd miniml utomton epting strings of length more thn three. Then, the numer of sttes of the ftor utomton of A is ounded s follows F(A) Q 2 A Q 3. (6) 8

Proof. As mentioned erlier, ftor utomton F(A) n e otined from suffix utomton S(A) y mking ll sttes finl nd pplying minimiztion. Thus, F(A) S(A). The result follows Proposition 2. Blumer et l. (1987) showed tht n utomton epting ll ftors of set of strings U hs t most 2 U 1 sttes, where U is the sum of the lengths of ll strings in U [7]. The following gives signifintly etter ound on the size of the ftor utomton of set of strings U s funtion of the numer of nodes of prefix-tree representing U, whih is typilly sustntilly smller thn U. Corollry 2. Let U = {x 1,..., x m } e set of strings of length more thn three nd let A e prefix-tree representing U. Then, the numer of sttes of the ftor utomton F(U) nd tht of the suffix utomton S(U) of the strings of U re ounded s follows F(U) Q 2 A Q 2 S(U) Q 2 A Q 2. (7) Proof. Let B e prefix-tree representing the set U = {x 1 $ 1,..., x m $ m }, otined y ppending to eh string of U new symol $ i, i = 1,...,m, to mke their suffixes distint nd let B e the utomton otined y minimiztion of B. By onstrution, B hs m more sttes thn A, ut sine ll finl sttes of B re equivlent nd merged fter minimiztion, B hs t most one more stte thn A. By onstrution, B is suffix-unique utomton nd y Proposition 2, S(B ) Q 2 B Q 3. Removing from S(B ) the trnsitions leled with the extr symols $ i nd onneting the resulting utomton yields the miniml suffix utomton S(U). In S(B ), there must e finl stte rehle y the trnsitions leled with $ i nd only suh trnsitions, whih eomes nonessile fter removl of the extr symols. Thus, S(U) hs t lest one stte less thn S(B ), whih gives: S(U) Q S(B ) Q 1 2 B Q 4 = 2 A Q 2. (8) A similr ound holds for the ftor utomton F(U) following the rgument given in the proof of Corollry 1. When A is k-suffix-unique with reltively smll k s in our pplitions of interest, the following proposition provides onvenient ound on the size of the suffix utomton. Proposition 3. Let A e k-suffix-unique deterministi utomton epting strings of length more thn three nd let n e the numer of strings epted y A. Then, the following ound holds for the numer of sttes of the suffix utomton of A: S(A) Q 2 A k Q + 2kn 3, (9) where A k is the prt of the utomton of A otined y removing the sttes nd trnsitions of ll suffixes of length k. 9

Proof. Let A e k-suffix-unique deterministi utomton epting strings of length more thn three nd let the lphet Σ e ugmented with n temporry symols $ 1,..., $ n. By mrking eh string epted y A with distint symol $ i, we n turn A into suffix-unique deterministi utomton A. To do tht, we first unfold ll k-length suffixes of A. In the worst se, ll these (distint) suffixes were shring the sme (k 1)-length suffix. Unfolding n thus inrese the numer of sttes of A y s mny s kn n sttes in the worst se. Mrking the end of eh suffix with distint $-sign further inreses the size y n. The resulting utomton A is deterministi nd A Q A k Q +kn. By Proposition 2, the size of the suffix utomton of A is ounded s follows: S(A ) 2 A 3. Sine trnsitions leled with $-sign n only pper t the end of suessful pths in S(A ), we n remove these trnsitions nd mke their origin stte finl, nd minimize the resulting utomton to derive deterministi utomton A epting the set of suffixes of A. The sttement of the proposition follows the ft tht A S(A ). Sine the size of F(A) is lwys less thn or equl to tht of S(A), we otin diretly the following result. Corollry 3. Let A e k-suffix-unique utomton epting strings of length more thn three. Then, the following ound holds for the ftor utomton of A: F(A) Q 2 A k Q + 2kn 3. (10) The ound given y the orollry is not tight for reltively smll vlues of k in the sense tht in prtie, the size of the ftor utomton does not depend on kn, the sum of the lengths of suffixes of length k, ut rther on the numer of sttes of A used for their representtion, whih for miniml utomton n e sustntilly less. However, for lrge k, e.g., when ll strings re of the sme length nd k is s long s the length of the strings epted y A, our ound oinides with tht of [7]. 4. Suffix Automton Constrution Algorithm This setion desries liner-time lgorithm for the onstrution of the suffix utomton S(A) of n input suffix-unique utomton A, or similrly the ftor utomton F(A) of A. Sine ftor utomton n e otined from S(A) y mking ll sttes of S(A) finl nd pplying liner-time yli minimiztion lgorithm [11], it suffies to desrie liner-time lgorithm for the onstrution of S(A). It is possile however to give similr diret liner-time lgorithm for the onstrution of F A. Figures 4-6 give the pseudoode for the lgorithm for onstruting the suffix utomton S(A) = (Q S, I, F S, δ S ) of n utomton A = (Q A, I, F A, δ A ), where A is suffix-unique nd where δ S : Q S Σ Q S denotes the prtil trnsition funtion of S(A) nd likewise δ A : Q A Σ Q A tht of A. As in the previous setion, f denotes the finl stte of A with no outgoing trnsitions. Additionlly, 10

Crete-Suffix-Automton(A, f) 1 S Q S {I} initil stte 2 s[i] undefined; l[i] 0 3 while S do 4 p Hed(S) 5 for eh suh tht δ A (p, ) undefined do 6 if δ A (p, ) f then 7 Q S Q S {p} 8 l[q] l[p] + 1 9 Suffix-Next(p,, q) 10 Enqueue(S, q) 11 Q S Q S {f} 12 for eh stte p Q A nd Σ suh tht δ A (p, ) = f do 13 Suffix-Next(p,, f) 14 Suffix-Finl(f) 15 for eh p F A do 16 Suffix-Finl(q) 17 return S(A) = (Q S, I, F S, δ S ) Figure 4: Algorithm for the onstrution of the suffix utomton of suffix-unique utomton A. we use the term suffix pointer to refer to the destintion stte of the suffix link trnsition. The lgorithm is generliztion to n input suffix-unique utomton of the stndrd onstrution for n input string. Our presenttion is similr to tht of [6]. The lgorithm mintins two vlues s[q] nd l[q] for eh stte q of S q. s[q] denotes the suffix pointer or filure stte of q. l[q] denotes the length of the longest pth from the initil stte to q in S(A). l is used to determine the so-lled solid edges or trnsitions in the onstrution of the suffix utomton. A trnsition (p,, q) is solid if l[p] + 1 = l[q], tht is it is on longest pth from the initil stte to q, otherwise, it is short-ut trnsition. S is queue storing the set of sttes to e exmined. The prtiulr queue disipline of S does not ffet the orretness of the lgorithm, ut we n ssume it to e FIFO order, whih orresponds to redth-first serh nd dmits of ourse liner-time implementtion. In eh itertion of the loop of lines 3-10 in Figure 4, new stte p is extrted from S. The proessing of the trnsitions (p,, f) with destintion stte f is delyed to lter stge (lines 12-14). This is euse of the prtiulr properties of f whih, s disussed in the previous setion, n e viewed s the hild of different nodes of the tree T, nd thus n dmit different suffix links. Other trnsitions (p,, q) re proessed one t time y reting, if neessry, the destintion stte q nd dding it to Q S, defining l[q] nd lling Suffix-Next(p,, q). The suroutine Suffix-Next proesses eh trnsition (p,, q) in wy 11

Suffix-Next(p,, q) 1 l[q] mx(l[p] + 1, l[q]) 2 while p I nd δ S (p, ) = undefined do 3 δ S (p, ) q 4 p s[p] 5 if δ S (p, ) = undefined then 6 δ S (I, ) q 7 s[q] I 8 elseif l[p] + 1 = l[δ S (p, )] nd δ S (p, ) q then 9 s[q] δ S (p, ) 10 else r q 11 if δ S (p, ) q then 12 r opy of δ S (p, ) new stte with sme trnsitions 13 Q S Q S {r} 14 s[q] r 15 s[r] s[δ S (p, )] 16 s[δ S (p, )] r 17 l[r] l[p] + 1 18 while p undefined nd l[δ S (p, )] l[r] do 19 δ S (p, ) r 20 p s[p] Figure 5: Suroutine of Crete-Suffix-Automton proessing trnsition of A from stte p to stte q leled with. Suffix-Finl(p) 1 if p F S then 2 p s[p] 3 while p undefined nd p F S do 4 F S F S {p} 5 p s[p] Figure 6: Suroutine of Crete-Suffix-Automton mking ll sttes on the suffix hin of p finl. 12

tht is very similr to the stndrd string suffix utomton onstrution. The loop of lines 2-4 inspets the iterted suffix pointers of p tht do not dmit n outgoing trnsition leled with. It further retes suh trnsitions rehing q from ll the iterted suffix pointers until the initil stte or stte p lredy dmitting suh trnsition is rehed. In the former se, the suffix pointer of q is set to e the initil stte I nd the trnsition (I,, q) is reted. In the ltter se, if the existing trnsition (p,, q ) is solid nd q = q, then the suffix pointer of q is simply set to e q (line 9). Otherwise, if q q, opy of the stte q, r, with the sme outgoing trnsitions is reted (line 12) nd the suffix pointer of q is set to e r. The suffix pointer of r is set to e s[q ] (line 15), tht of q is set to r (16), nd l[r] defined s l[p] + 1 (17). The trnsitions leled with leving the iterted suffix pointers of p re inspeted nd redireted to r so long s they re non-solid trnsitions (lines 18-20). The suroutine Suffix-Finl sets the finlity nd the finl weight of sttes in S(A). For ny stte p tht is finl in A, p nd ll the sttes found y following the hin of suffix pointers strting t p re mde finl in S(A) in the loop of lines 3-5. We hve implemented nd tested the suffix-onstrution lgorithm just desried. Figure 7 illustrtes the pplition of the lgorithm to prtiulr suffix-unique utomton. All intermedite stges of the onstrution of S(A) re indited, inluding the informtion out the suffix pointers s[q] for eh stte q. In the onstrution of the so-lled suffix orle [10] no new stte is reted with respet to the input. The suffix orle of A n thus e onstruted in similr wy simply y repling line 12 in Figure 5 y: r δ S (p, ) nd removing lines 15-17. This lgorithm thus strightforwrdly extends the onstrution of the suffix orle to the se of suffix-unique input utomt. For the omplexity result tht follows, we will ssume n effiient representtion of the trnsition funtion suh tht n outgoing trnsition with speifi lel n e found in onstnt time O(1) t ny stte. Other uthors re sometimes ssuming insted n djeny list representtion nd inry serh to find trnsition t given stte, whih osts O(min{log Σ, e mx }) where e mx is the mximum outdegree [6, 2]. If one dopts tht ssumption, the omplexity results we report s well s those of Blumer et l. [5, 7] should e multiplied with the ftor min{log Σ, e mx }. We refer to rediretion of trnsition tht hs lredy previously een redireted s multiple rediretion. Proposition 4. Let A e miniml deterministi suffix-unique utomton. Then, the runtime omplexity of lgorithm Crete-Suffix-Automton(A, f) is O( S(A) ). Proof. We give rief sketh of the proof. Suffix-Next is lled t most one per trnsition, so the totl numer of lls of Suffix-Next is O( A ). Fix trnsition (p,, q) of A with q f. The ost of the exeution of the steps 1-20 y Suffix-next is proportionl to the totl numer of iterted suffix link 13

3 0 1 4 2 0/*,0 1/0,1 () () 0/*,0 1/0,1 3/1,2 0/*,0 1/0,1 3/1,2 4/0,2 () (d) 3/1,2 3/1,2 2/3,3 1/0,1 1/0,1 0/*,0 4/5,2 0/*,0 4/5,2 5/0,1 5/0,1 (e) (f) 3/1,2 3/1,2 0/*,0 1/0,1 5/0,1 4/5,2 2/0,3 0/*,0 1/0,1 5/0,1 4/5,2 2/0,3 (g) (h) Figure 7: Constrution of the suffix utomton using Crete-Suffix-Automton. () Originl utomton A. ()-(h) Intermedite stges of the onstrution of S(A). For eh stte (n/s, l), n is the stte numer, s is the suffix pointer of n, nd l is l[n]. trversls in the loop of lines 2-4 nd lines 18-20. Eh itertion of lines 2-4 results in new trnsition eing reted in S(A), so the totl numer of loop itertions over ll lls of Suffix-Next is O( S(A) ). The nlysis of the totl numer of rediretion itertions of the while loop of lines 18-20 relies on n extension of the nlysis for the single-string se [12, 7]. The liner ound on the totl numer of rediretions in the single-string se is pplile to our utomton se for liner hin of sttes in A. Given the required omintion of sustrings in A to use rediretion, it n e shown tht the totl numer of multiple rediretions is O( A ). Thus, the totl omplexity is O( S(A) ). 14

mp_72:ε 1 0 mp_736:ε mp_736:ε 6 mp_240:ε 3 mp_28:ε mp_736 :ε 7 2 4 mp_240:ε mp_2: Betles--Let_It_Be mp_349:ε 8 mp_448:ε 9 5 mp_20:mdonn--ry_of_light mp_889:vn_hlen--right_now 10 Figure 8: Finite-stte trnsduer T 0 mpping eh song to its identifier. 5. Ftor Automt for Musi Identifition We hve verified the ove insights into ftor utomt in the ontext of musi identifition system [4, 9]. Musi identifition is the tsk of mthing n udio strem to prtiulr song. In our system, we lern n inventory of musi phone units similr to phonemes in speeh nd unique sequene of musi phones hrterizing eh song. We view the musi phone set s our lphet nd the musi phone sequenes s set of strings, trnsforming the tsk into ftor reognition prolem. Our pproh is to onstrut ompt trnsduer mpping musi phone sequenes to orresponding song identifiers. 5.1. Ftor Trnsduer Constrution Let Σ denote the set of musi phones nd let the set of musi phone sequenes desriing m songs e U = {x 1,..., x m }, x i Σ for i {1,...,m}. In our experiments, m = 15,455, Σ = 1,024 nd the verge length of trnsription x i is more thn 1,700. Thus, in the worst se, there n e s mny s 15,455 1,700 2 45 10 9 ftors. The size of nive prefix-tree-sed representtion would thus e prohiitive. Hene, we represent the set of ftors with muh more ompt ftor utomton. We onstrut deterministi nd miniml utomton representing the sequenes in U nd susequently deterministi nd miniml finite-stte trnsduer mpping eh song to its identifier using trnsduer determiniztion nd minimiztion lgorithms [13, 14]. Let T 0 e the unoptimized trnsduer mpping phone sequenes to song identifiers. Figure 8 shows T 0 when U is redued to three short songs. Let A e the eptor otined y omitting the output lels of T 0. The ompt ftor utomton F(A) (Figure 9()) is onstruted s desried in Setion 3: y reting ǫ-trnsitions from the initil stte of A to ll other sttes, mking ll sttes finl, nd pplying ǫ-removl, determiniztion, nd minimiztion. Note tht F(A) does not output the song identifier ssoited with eh ftor. For the purposes of the following desription, we riefly review some properties of weighted utomt. A weighted utomton is defined over semiring (K,,, 0, 1), whih speifies the weight set used nd the lgeri opertions for omining weights long pth nd etween pths. The tropil semiring (R {, + }, min, +, +, 0) is one used extensively in fields suh s speeh nd text proessing. In the tropil semiring, the totl weight ssigned y the utomton to string s is the minimum-weight pth in the utomton with the lel s, where the totl pth long given pth is found y dding the weights of the trnsitions omposing the pth. 15

mp_2 mp_2/0 mp_20 mp_20/1 0 mp_72 mp_736 4 2 mp_240 5 mp_240 mp_240 mp_736 mp_240 7 mp_2 3 6 mp_2 mp_20 mp_20 1 0 mp_72/0 mp_736/1 4/0 2/0 mp_240/0 5/0 mp_240/0 mp_240/0 mp_736/0 mp_240/0 7/0 mp_2/0 3/0 6/0 mp_2/0 mp_20/1 mp_20/0 1/0 () () Figure 9: () Deterministi nd miniml unweighted ftor eptor F(A) for two songs. () Deterministi nd miniml weighted ftor eptor F w(a) for two songs. To onstrut ftor utomton tht preserves the song identifiers, we rete ompt weighted eptor over the tropil semiring epting the ftors of U tht ssoites the totl weight s x to eh ftor x. A ruil dvntge of this representtion is the use of weighted determiniztion nd minimiztion [13] during whih the song identifier is treted s weight possily distriuted long pth. These opertions preserve the property tht the totl weight long the pth leled with x is s x. Let F w (A) e onstruted nlogously to F(A), ut with eh dded ǫ-trnsition weighted with the orresponding song identifier. The weighted eptor F w (A), fter determiniztion nd minimiztion over the tropil semiring, is trnsformed into song reognition trnsduer T y treting eh output weight integer s n output symol. Given musi phone sequene s input, the ssoited song identifier is otined y summing the outputs yielded y T. 5.2. Automt Size Figure 9() shows the weighted utomton F w (A) orresponding to the unweighted utomton F(A) of Figure 9(). Note tht F w (A) is no lrger thn F(A). Remrkly, even in the se of 15,455 songs, the totl numer of trnsitions of F w (A) ws 53.0M, only out 0.004% more thn F(A). We lso hve F(A) E 2.1 A E. As is illustrted in Figure 10(), this multiplitive reltionship is mintined s the song set size is vried etween 1 nd 15,455. Furthermore, for the se of 15,455 songs, U is 45-suffix-unique. Figure 10() demonstrtes tht the numer of suffix ollisions drops rpidly s the suffix size is inresed. We lso hve F w (A) Q 28.8M 1.2 A Q, mening the ound of Corollry 3 is verified in this empiril ontext. 6. Conlusion We presented novel nlysis of the size of the suffix utomton nd ftor utomton of set of strings represented y n utomton in terms of the size of the originl utomton. Our nlysis shows tht suffix utomt nd ftor utomt n e prtil for onstruting n index of lrge numer of strings. 16

Size 6e+07 5e+07 4e+07 3e+07 2e+07 1e+07 # Sttes ftor # Ars ftor # Sttes/Ars Non-ftor Non-unique songs 16000 14000 12000 10000 8000 6000 4000 2000 0 0 2000 4000 6000 8000 10000120001400016000 # Songs () 0 0 5 10 15 20 25 30 35 40 45 k (suffix length) () Figure 10: () Comprison of utomton sizes for different numers of songs. #Sttes/Ars Non-ftor is the size of the utomton A epting the entire song trnsriptions. # Sttes ftor nd # Ars ftor is the numer of sttes nd trnsitions in the weighted ftor eptor F w(a), respetively. () Numer of strings in U for whih the suffix of length k is lso suffix of nother string in U. Additionlly, our pplition to lrge-sle musi identifition tsk further demonstrtes this ft. Ftor utomt of utomt re likely to form useful nd ompt index for very lrge-sle tsks. We further gve liner-time lgorithm for onstruting the suffix utomton or ftor utomton of set of strings in time liner in the size of prefix tree representing them. Our lgorithm pplies to ny input suffix-unique utomton nd stritly generlizes the stndrd on-line onstrution of suffix utomton for single input string. Our lgorithm nd nlysis rise the nturl question of n effiient onstrution of the suffix utomton of n ritrry input utomton. Aknowledgments We thnk Cyril Alluzen for severl disussions out the mteril presented. The reserh of Mehryr Mohri nd Eugene Weinstein ws prtilly supported y the New York Stte Offie of Siene Tehnology nd Ademi Reserh (NYSTAR). This projet ws lso sponsored in prt y the Deprtment of the Army Awrd Numer W81XWH-04-1-0307. The U.S. Army Medil Reserh Aquisition Ativity, 820 Chndler Street, Fort Detrik MD 21702-5014 is the wrding nd dministering quisition offie. The ontent of this mteril does not neessrily reflet the position or the poliy of the Government nd no offiil endorsement should e inferred. Referenes [1] D. Gusfield, Algorithms on Strings, Trees, nd Sequenes, Cmridge University Press, Cmridge, UK., 1997. [2] M. Crohemore, W. Rytter, Jewels of Stringology, World Sientifi, 2002. 17

[3] C. Alluzen, M. Mohri, M. Srlr, Generl Indextion of Weighted Automt Applition to Spoken Utterne Retrievl, in: Proeedings of the Workshop on Interdisiplinry Approhes to Speeh Indexing nd Retrievl (HLT/NAACL), Boston, Msshusetts, 2004, pp. 33 40. [4] E. Weinstein, P. Moreno, Musi Identifition with Weighted Finite-Stte Trnsduers, in: Proeedings of the Interntionl Conferene on Aoustis, Speeh, nd Signl Proessing (ICASSP), Honolulu, Hwii, 2007, pp. 689 692. [5] A. Blumer, J. Blumer, D. Hussler, A. Ehrenfeuht, M. T. Chen, J. I. Seifers, The smllest utomton reognizing the suwords of text, Theoretil Computer Siene 40 (1985) 31 55. [6] M. Crohemore, Trnsduers nd repetitions, Theoretil Computer Siene 45 (1986) 63 86. [7] A. Blumer, J. Blumer, D. Hussler, R. M. MConnell, A. Ehrenfeuht, Complete inverted files for effiient text retrievl nd nlysis, Journl of the ACM 34 (1987) 578 589. [8] S. Ineng, H. Hoshino, A. Shinohr, M. Tked, S. Arikw, G. Muri, G. Pvesi, On-line onstrution of ompt direted yli word grphs, Disrete Applied Mthemtis 146 (2) (2005) 156 179. [9] M. Mohri, P. Moreno, E. Weinstein, Roust musi identifition, detetion, nd nlysis, in: Proeedings of the Interntionl Conferene on Musi Informtion Retrievl (ISMIR), Vienn, Austri, 2007, pp. 135 139. [10] C. Alluzen, M. Crohemore, M. Rffinot, Effiient experimentl string mthing y wek ftor reognition, in: Proeedings of the 12th Annul Symposium on Comintoril Pttern Mthing (CPM), Springer-Verlg, London, UK, 2001, pp. 51 72. [11] D. Revuz, Minimistion of yli deterministi utomt in liner time, Theoretil Computer Siene 92 (1992) 181 189. [12] J. A. Blumer, Algorithms for the direted yli word grph nd relted strutures, Ph.D. thesis, Denver University (1985). [13] M. Mohri, Finite-stte trnsduers in lnguge nd speeh proessing, Computtionl Linguistis 23 (2) (1997) 269 311. [14] M. Mohri, Sttistil Nturl Lnguge Proessing, in: M. Lothire (Ed.), Applied Comintoris on Words, Cmridge University Press, 2005. 18