Parameterized Suffix Arrays for Binary Strings

Similar documents
Languages. A language is a set of strings. String: A sequence of letters. Examples: cat, dog, house, Defined over an alphabet:

Converting SLP to LZ78 in almost Linear Time

CS 420, Spring 2018 Homework 4 Solutions. 1. Use the Pumping Lemma to show that the following languages are not regular: (a) {0 2n 10 n n 0};

in Trigonometry Name Section 6.1 Law of Sines Important Vocabulary

A DARK GREY P O N T, with a Switch Tail, and a small Star on the Forehead. Any

1 Alphabets and Languages

1.3 Regular Expressions

Inferring Strings from Graphs and Arrays

GEETANJALI INSTITUTE OF TECHNICAL STUDIES, UDAIPUR I

Counting and Verifying Maximal Palindromes

SUFFIX TREE. SYNONYMS Compact suffix trie

Faster Online Elastic Degenerate String Matching

Sheet 1-8 Dr. Mostafa Aref Format By : Mostafa Sayed

Space-Efficient Construction Algorithm for Circular Suffix Tree

Discovering Most Classificatory Patterns for Very Expressive Pattern Classes

ACCEPTS HUGE FLORAL KEY TO LOWELL. Mrs, Walter Laid to Rest Yesterday

Warm Up answers. 1. x 2. 3x²y 8xy² 3. 7x² - 3x 4. 1 term 5. 2 terms 6. 3 terms

Computing Longest Common Substrings Using Suffix Arrays

A Universal Turing Machine

Two Posts to Fill On School Board

CS 341 Homework 16 Languages that Are and Are Not Context-Free

Section 2.1 The Derivative and the Tangent Line Problem

A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems

LOWELL WEEKLY JOURNAL.

and A L T O S O L O LOWELL, MICHIGAN, THURSDAY, OCTCBER Mrs. Thomas' Young Men Good Bye 66 Long Illness Have Sport in

Longest Lyndon Substring After Edit

Monday, April 29, 13. Tandem repeats

What Is a Language? Grammars, Languages, and Machines. Strings: the Building Blocks of Languages

Breaking a Time-and-Space Barrier in Constructing Full-Text Indices

Dynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction.

Sorting suffixes of two-pattern strings

Faster Compact On-Line Lempel-Ziv Factorization

Randomized Sorting Algorithms Quick sort can be converted to a randomized algorithm by picking the pivot element randomly. In this case we can show th

Program Analysis. Lecture 5. Rayna Dimitrova WS 2016/2017

Practice Test III, Math 314, Spring 2016

OWELL WEEKLY JOURNAL

Università degli studi di Udine

Section 4.1 Switching Algebra Symmetric Functions

CSE Theory of Computing: Homework 3 Regexes and DFA/NFAs

Building a befer mousetrap. Transi-on Graphs. L = { abba } Building a befer mousetrap. L = { abba } L = { abba } 2/16/15

ECS120 Fall Discussion Notes. October 25, The midterm is on Thursday, November 2nd during class. (That is next week!)

6.1 The Pumping Lemma for CFLs 6.2 Intersections and Complements of CFLs

Chapter 3. Regular grammars

Advanced Text Indexing Techniques. Johannes Fischer

Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm

Non-regular languages

SORTING SUFFIXES OF TWO-PATTERN STRINGS.

Longest Common Prefixes

Gray Codes and Overlap Cycles for Restricted Weight Words

Average Case Analysis of QuickSort and Insertion Tree Height using Incompressibility

IOAN ŞERDEAN, DANIEL SITARU

CVO103: Programming Languages. Lecture 2 Inductive Definitions (2)

A Quick Introduction to Vectors, Vector Calculus, and Coordinate Systems

DEPARTMENT OF MATHEMATIC EDUCATION MATHEMATIC AND NATURAL SCIENCE FACULTY

LOWELL WEEKLY JOURNAL

W i n t e r r e m e m b e r t h e W O O L L E N S. W rite to the M anageress RIDGE LAUNDRY, ST. H E LE N S. A uction Sale.

H A M M IG K S L IM IT E D, ' i. - I f

r/lt.i Ml s." ifcr ' W ATI II. The fnncrnl.icniccs of Mr*. John We mil uppn our tcpiiblicnn rcprc Died.

.-I;-;;, '.-irc'afr?*. P ublic Notices. TiffiATRE, H. aiety

CSE 105 Homework 1 Due: Monday October 9, Instructions. should be on each page of the submission.

Enumeration Schemes for Words Avoiding Permutations

PALACE PIER, ST. LEONARDS. M A N A G E R - B O W A R D V A N B I E N E.

1. Prove: A full m- ary tree with i internal vertices contains n = mi + 1 vertices.

{a, b, c} {a, b} {a, c} {b, c} {a}

Chapter-2 BOOLEAN ALGEBRA

AC68 FINITE AUTOMATA & FORMULA LANGUAGES DEC 2013

The Euclidean Division Implemented with a Floating-Point Multiplication and a Floor

String Indexing for Patterns with Wildcards

MANY BILLS OF CONCERN TO PUBLIC

CSI Mathematical Induction. Many statements assert that a property of the form P(n) is true for all integers n.

A conjecture on the alphabet size needed to produce all correlation classes of pairs of words

Define M to be a binary n by m matrix such that:

Multiple Context-Free Grammars

Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line

Lecture 6 January 15, 2014

Alternative Algorithms for Lyndon Factorization

Three Profound Theorems about Mathematical Logic

Automata Theory and Formal Grammars: Lecture 1

Chapter 2 Boolean Algebra and Logic Gates

Formal Languages. We ll use the English language as a running example.

CSci 311, Models of Computation Chapter 4 Properties of Regular Languages

CS375: Logic and Theory of Computing

Advanced Automata Theory 7 Automatic Functions

Homework 1/Solutions. Graded Exercises

Fast String Kernels. Alexander J. Smola Machine Learning Group, RSISE The Australian National University Canberra, ACT 0200

Online Computation of Abelian Runs

Outline. 1 Merging. 2 Merge Sort. 3 Complexity of Sorting. 4 Merge Sort and Other Sorts 2 / 10

COMP Logic for Computer Scientists. Lecture 16

Automata Theory for Presburger Arithmetic Logic

CS Lecture 8 & 9. Lagrange Multipliers & Varitional Bounds

Information Theory and Statistics Lecture 2: Source coding

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

Burrows-Wheeler Transforms in Linear Time and Linear Bits

UNIVERSITY OF MICHIGAN UNDERGRADUATE MATH COMPETITION 28 APRIL 7, 2011

Preview: Text Indexing

Succincter text indexing with wildcards

Combinatorial algorithms

T k b p M r will so ordered by Ike one who quits squuv. fe2m per year, or year, jo ad vaoce. Pleaie and THE ALTO SOLO

4. Quantization and Data Compression. ECE 302 Spring 2012 Purdue University, School of ECE Prof. Ilya Pollak

arxiv: v2 [cs.ds] 22 Mar 2018

Transcription:

Parameterized Suffix Arrays for Binary Strings Satoshi Deguchi, Fumihito Higashijima, Hideo Bannai, Shunsuke Inenaga and Masayuki Takeda Kyushu University, Japan

Contents Background Parameterized matching [Baker 93] Linear time algorithms for constructing parameterized suffix arrays parameterized LCP arrays for inary strings Computational Experiments Summary and Open Prolems

Background sort program y student A void sort(int *a, int n) { int p, q, t; sort program y student B void sort(int *p, int n) { int x, y, t; } for (p = 0; p < n; p++) { for (q = n-1; q > p; q--) { if (a[q] < a[q-1]) { t = a[q]; a[q] = a[q-1]; a[q-1] = t; } } } } for (x = 0; x < n; x++) { for (y = n-1; y > x; y--) { if (p[y] < p[y-1]) { t = p[y]; p[y] = p[y-1]; p[y-1] = t; } } }

Parameterized pattern matching Parameterized string [Baker 93] A string over the parameter alphaet Π and constant alphaet Σ (Π Σ = φ) Parameterized match we will only consider the case Σ = φ in this talk. Two parameterized strings x,y parameterized match if there exists a ijection f on Π Σ where f(c) = c for c Σ, and f(x) = f(x 1 x n )=f(x 1 ) f(x n )=y if (a[q] < a[q-1]) { t = a[q]; a[q] = a[q-1]; a[q-1] = t; } q y a p if (p[y] < p[y-1]) { t = p[y]; p[y] = p[y-1]; p[y-1] = t; }

Parameterized pattern matching [Baker 93] Parameterized matching prolem Given parameterized strings p and t, find all positions i where p and t i t i+ p -1 parameterized match t : xxyzxyyxzyx p : xyzx 1 2 3 4 5 6 7 8 9 10 11 x x y z x y y x z y x xyzx xzyx (y z,z y) yxzy (x y,y x) yzxy (x y,y z,z x) zxyz (x z,y x,z y) zyxz (x z,z x)

Parameterized pattern matching [Baker 93] pv(s)[i] = 0 if s[i] s[j] for any 1 j < i i k if k = max { j s[i] = s[j] 1 j < i } pv( aaaaaaaa) = 002132111513 Proposition [Baker `93] Two parameterized strings s,t parameterized match pv(s) = pv(t) t:xxyzxyyxzyx p:xyzx pv(yzxy) = 0003 pv(xyzx) = 0003

Parameterized suffix tree (p-suffix tree) [Baker 93] parameterized string t O( t ) time * construction 1. aa 00221 2. a 0021 3. a 001 4. 01 5. 0 pv() of each suffix p-suffix tree 0 0 1 2 1 2 1 5 3 2 1 4 1 O( p + #occ) time * matching * Assuming constant size alphaet

Standard suffix trees and arrays Linear time [Weiner 73] aa Linear time [Ko&Aluru 03, Karkkainen&Sanders 03, Kim et al. 03] suffix tree suffix array SA[] LCP array LCP[] Suffix arrays are reported to e superior in terms of speed/memory usage, for many applications a a 6 1 3 5 2 4 a 6 1 3 5 2 4-1 0 2 0 1 1 a a a a Linear time (simple traversal) Linear time [Kasai et al. 01]

P-suffix trees and arrays 0 Linear time [Baker '93] p-suffix tree 0 1 aa Linear time for inary strings [this work] P-suffix array PSA[] P-LCP array PLCP[] 1 2 1 2 1 5 3 2 1 4 5 3 2 1 4-1 1 2 3 1 0 0 0 1 0 0 2 1 0 0 2 2 1 0 1 Linear time (simple traversal) Linear time for inary strings [this work]

This work We consider the p-suffix array and p-lcp array show linear time algorithms for direct construction of p-suffix array and its p-lcp array for inary strings

Difficulties of pv A pv of a suffix is not necessarily a suffix of a pv aaaaaaaa pv(aaaaaaaa) = 002132111513 suffix NOT suffix aaaaaaa pv( aaaaaaa) = 00132111513

Difficulties of pv Lexicographic order etween suffixes of two strings is not necessarily preserved, even when they share a common prefix pv( aaaaaaa) = 00132111513 > pv(aaaaaaaa) = 002132111513 suffix suffix pv( aaaaaaa) = 0102111513 > pv( aaaaaaa) = 00132111513 Linear time algorithms for direct construction of standard suffix arrays won't work on the pv array.

Direct construction of p-suffix array for inary strings Oservation Monotonically decrease?

fw array fw(s)[i] = if s[i] s[j] for any i < j s i k if k = min { j s[i] = s[j] i < j s } aaaaaaaa fw(aaaaaaaa) = 2 3 1 2 5 1 1 1 3 1 suffix suffix aaaaaaa fw( aaaaaaa) = 3 1 2 5 1 1 1 3 1 A fw of a suffix is always a suffix of a fw!

Direct construction of p-suffix array Lemma for inary strings For any inary string s, the p-suffix array of s is equivalent to the standard suffix array of fw(s) Theorem The p-suffix array for inary string of length n can e constructed directly in O(n) time. (With the reverse order on integers)

Linear time LCP array construction Standard suffix arrays: [Kasai et al. 2001] i SA[i] s[sa[i]:n] LCP s [i] 1 9-1 2 6 aa 0 3 7 a 1 4 1 aaaa 2 5 3 aaa 2 6 8 0 7 5 aa 1 8 2 aaa 2 9 4 aa 1 lcp[5] 2 1 lcp[8] 2 1 Depends on preservation of lexicographic order of suffixes with common prefix won't work for parameterized strings

Lexicographic order is not preserved even when PLCP > 0. 3 < 4 12 > 8 Standard LCP algorithm won't work. PLCP : 3 1 1 i PSA[i] s[psa[i]:n] pv(s[psa[i]:n]) PLCP s [i] 1 12 a 0-1 2 11 a 00 1 3 5 aaaaa 00111513 2 4 9 aa 0013 3 5 2 aaaaaaa 00132111513 4 6 4 aaaaaa 002111513 2 7 1 aaaaaaaa 002132111513 4 8 10 a 010 1 9 8 aaa 01013 3 10 3 aaaaaaa 0102111513 3 11 7 aaaa 011013 2 12 6 aaaaa 0111013 3

Linear time parameterized LCP array construction for Binary Strings The LCP array of fw can e used to compute the PLCP array of pv Lemma For any inary string s, PLCP s [i] = min{ LCP fw(s) [i] +fw(s)[psa[i]+l], s PSA[i 1]+1 } LCP of fw + first disagreeing fw element Length of PSA[i-1] th suffix

PLCP s [i] = min{lcp fw(s) [i] +fw(s)[psa[i]+l], s PSA[i 1]+1} i PSA[i] 2 5 s[psa[i]:n] PLCP s [i] fw(s[psa[i]:n]) LCP s [i] 1 12 a -1-1 a0 a0 0 a2 a1 a1 a1 5 0 a2 a1 3 LCP of fw + first disagreeing fw element 2 11 a 1 1 3 5 aaaaa 2 5 1 1 1 3 1 0 4 2 9 aa 3 3 3 1 0 LCP of 5fw=1 first 2 disagreeing aaaaaaa fw element=3 4 3 1 2 5 1 1 1 3 1 2 6 4 aaaaaa 2 2 5 1 1 1 3 1 0 7 1 aaaaaaaa 4 2 3 1 2 5 1 1 1 3 1 1 8 10 a 1+3 1 1 0 9 8 aaa 3 1 3 1 1 10 3 aaaaaaa 3 1 2 5 1 1 1 3 1 1 11 7 aaaa 2 1 1 3 1 1 12 6 aaaaa 3 1 1 1 3 1 2

PLCP s [i] = min{lcp fw(s) [i] +fw(s)[psa[i]+l], s PSA[i 1] +1} Length of PSA[i-1]=10 i PSA[i] s[psa[i]:n] th suffix = 3 PLCPs[i] fw(s[psa[i]:n]) LCP s [i] 1 1 12 a -1-1 2 11 a 1 1 0 1 a0 a0 a1 0 1 a3 Length of PSA[i-1] th suffix 3 5 aaaaa 2 5 1 1 1 3 1 0 4 9 aa 3 3 1 0 5 1 2 aaaaaaa 3 4 3 1 2 5 1 1 1 3 1 2 LCP of 6fw=1 first 4 disagreeing aaaaaafw element=3 2 2 5 1 1 1 3 1 0 7 1 aaaaaaaa 4 2 3 1 2 5 1 1 1 3 1 1 8 10 a 12 10 + 1 = 3 1 1 0 9 8 aaa 3 1 3 1 1 10 3 aaaaaaa 3 1 2 5 1 1 1 3 1 1 11 7 aaaa 2 1 1 3 1 1 12 6 aaaaa 3 1 1 1 3 1 2

i PSA[i] s[psa[i]:n] PLCP s [i] fw(s[psa[i]:n]) LCP s [i] 1 12 a -1-1 2 11 a 1 1 min{ 0+5, 2 } 3 5 aaaaa 2 5 1 1 1 3 1 0 4 9 aa 3 3 1 0 min{ 2+2, 4 } 5 2 aaaaaaa 4 3 1 2 5 1 1 1 3 1 2 6 4 aaaaaa 2 2 5 1 1 1 3 1 0 min{ 1 + 3, 9 } 7 1 aaaaaaaa 4 2 3 1 2 5 1 1 1 3 1 1 8 10 a 1 1 0 9 8 aaa 3 1 3 1 1 10 Theorem 3 aaaaaaa 3 1 2 5 1 1 1 3 1 1 The PLCP array for inary string of length n can e constructed in O(n) time, given its p-suffix array. 11 7 aaaa 2 1 1 3 1 1 12 6 aaaaa 3 1 1 1 3 1 2

Computational Experiments We compared speed of parameterized pattern matching on random inary strings y: p-suffix array of t matched with pv(p) standard suffix array t, matched with p and p' 2 standard suffix arrays t and t', each matched with p itwise complement In general, for alphaet size k, we can construct k! standard suffix arrays for each permutation of parameterized alphaet construct k! different patterns for each permutation of parameterized alphaet and do standard matching k! times

Text length = 100 Running time p-suffix array pattern length

Text length = 1000 Running time p-suffix array pattern length

Summary Showed linear time algorithm to construct p-suffix arrays and p-lcp arrays directly from inary strings (does not construct p-suffix tree) Showed efficiency of p-suffix arrays through computational experiments Up to 200% faster than using suffix arrays twice

Open Prolems Direct construction of p-suffix array for alphaet size > 2. Reverse prolem: Infer the parameterized string whose p-suffix array is a given permutation.