Longest Common Prefixes

Similar documents
Traversal of a subtree is slow, which affects prefix and range queries.

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions

Notes for Lecture 17-18

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB

Christos Papadimitriou & Luca Trevisan November 22, 2016

Dynamic Programming 11/8/2009. Weighted Interval Scheduling. Weighted Interval Scheduling. Unweighted Interval Scheduling: Review

Some Ramsey results for the n-cube

1 Review of Zero-Sum Games

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon

23.2. Representing Periodic Functions by Fourier Series. Introduction. Prerequisites. Learning Outcomes

SOLUTIONS TO ECE 3084

An introduction to the theory of SDDP algorithm

Chapter Floating Point Representation

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still.

Chapter 2. First Order Scalar Equations

STATE-SPACE MODELLING. A mass balance across the tank gives:

Approximation Algorithms for Unique Games via Orthogonal Separators

Random Walk with Anti-Correlated Steps

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

Li An-Ping. Beijing , P.R.China

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

Algorithmic Discrete Mathematics 6. Exercise Sheet

Lecture Notes 2. The Hilbert Space Approach to Time Series

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

THE UNIVERSITY OF TEXAS AT AUSTIN McCombs School of Business

Math 2142 Exam 1 Review Problems. x 2 + f (0) 3! for the 3rd Taylor polynomial at x = 0. To calculate the various quantities:

Viterbi Algorithm: Background

Challenge Problems. DIS 203 and 210. March 6, (e 2) k. k(k + 2). k=1. f(x) = k(k + 2) = 1 x k

We just finished the Erdős-Stone Theorem, and ex(n, F ) (1 1/(χ(F ) 1)) ( n

Reading from Young & Freedman: For this topic, read sections 25.4 & 25.5, the introduction to chapter 26 and sections 26.1 to 26.2 & 26.4.

KEY. Math 334 Midterm I Fall 2008 sections 001 and 003 Instructor: Scott Glasgow

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

Explaining Total Factor Productivity. Ulrich Kohli University of Geneva December 2015

Seminar 4: Hotelling 2

Logic in computer science

Rainbow saturation and graph capacities

Some Basic Information about M-S-D Systems

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles

16 Max-Flow Algorithms

Approximate String Matching. Department of Computer Science. University of Chile. Blanco Encalada Santiago - Chile

Designing Information Devices and Systems I Spring 2019 Lecture Notes Note 17

10. State Space Methods

HOMEWORK # 2: MATH 211, SPRING Note: This is the last solution set where I will describe the MATLAB I used to make my pictures.

KINEMATICS IN ONE DIMENSION

CMU-Q Lecture 3: Search algorithms: Informed. Teacher: Gianni A. Di Caro

Robotics I. April 11, The kinematics of a 3R spatial robot is specified by the Denavit-Hartenberg parameters in Tab. 1.

Block Diagram of a DCS in 411

Asymptotic Equipartition Property - Seminar 3, part 1

Linear Time-invariant systems, Convolution, and Cross-correlation

Languages That Are and Are Not Context-Free

Brock University Physics 1P21/1P91 Fall 2013 Dr. D Agostino. Solutions for Tutorial 3: Chapter 2, Motion in One Dimension

11!Hí MATHEMATICS : ERDŐS AND ULAM PROC. N. A. S. of decomposiion, properly speaking) conradics he possibiliy of defining a counably addiive real-valu

Math From Scratch Lesson 34: Isolating Variables

Instructor: Barry McQuarrie Page 1 of 5

Phys1112: DC and RC circuits

Guest Lectures for Dr. MacFarlane s EE3350 Part Deux

Finish reading Chapter 2 of Spivak, rereading earlier sections as necessary. handout and fill in some missing details!

20. Applications of the Genetic-Drift Model

Chapter 12: Velocity, acceleration, and forces

Longest Common Prefixes

Linear Response Theory: The connection between QFT and experiments

Let us start with a two dimensional case. We consider a vector ( x,

Lecture 4 Notes (Little s Theorem)

Network Flow. Data Structures and Algorithms Andrei Bulatov

Biol. 356 Lab 8. Mortality, Recruitment, and Migration Rates

Spring Ammar Abu-Hudrouss Islamic University Gaza

EXERCISES FOR SECTION 1.5

Smoothing. Backward smoother: At any give T, replace the observation yt by a combination of observations at & before T

5.1 - Logarithms and Their Properties

LAB # 2 - Equilibrium (static)

15. Vector Valued Functions

Introduction to AC Power, RMS RMS. ECE 2210 AC Power p1. Use RMS in power calculations. AC Power P =? DC Power P =. V I = R =. I 2 R. V p.

2002 November 14 Exam III Physics 191

Final Spring 2007

More Digital Logic. t p output. Low-to-high and high-to-low transitions could have different t p. V in (t)

Online Convex Optimization Example And Follow-The-Leader

Journal of Discrete Algorithms. Approximability of partitioning graphs with supply and demand

Overview. COMP14112: Artificial Intelligence Fundamentals. Lecture 0 Very Brief Overview. Structure of this course

Two Coupled Oscillators / Normal Modes

On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems

Hamilton- J acobi Equation: Weak S olution We continue the study of the Hamilton-Jacobi equation:

HW6: MRI Imaging Pulse Sequences (7 Problems for 100 pts)

13.3 Term structure models

Learning Objectives: Practice designing and simulating digital circuits including flip flops Experience state machine design procedure

Entropy compression method applied to graph colorings

WEEK-3 Recitation PHYS 131. of the projectile s velocity remains constant throughout the motion, since the acceleration a x

Stationary Distribution. Design and Analysis of Algorithms Andrei Bulatov

5. Stochastic processes (1)

Echocardiography Project and Finite Fourier Series

PCP Theorem by Gap Amplification

Physics 180A Fall 2008 Test points. Provide the best answer to the following questions and problems. Watch your sig figs.

Physical Limitations of Logic Gates Week 10a

CHAPTER 6: FIRST-ORDER CIRCUITS

FITTING EQUATIONS TO DATA

Problem Set 5. Graduate Macro II, Spring 2017 The University of Notre Dame Professor Sims

Ensamble methods: Bagging and Boosting

Theoretical Computer Science. Approximately optimal trees for group key management with batch updates

The Arcsine Distribution

2. Nonlinear Conservation Law Equations

Online Learning Applications

Transcription:

Longes Common Prefixes The sandard ordering for srings is he lexicographical order. I is induced by an order over he alphabe. We will use he same symbols (, <,,, ec.) for boh he alphabe order and he induced lexicographical order. We can define he lexicographical order using he concep of he longes common prefix. Definiion 1.4: The lengh of he longes common prefix of wo srings A[0..m) and B[0..n), denoed by lcp(a, B), is he larges ineger l min{m, n} such ha A[0..l) = B[0..l). Definiion 1.5: Le A and B be wo srings over an alphabe wih a oal order, and le l = lcp(a, B). Then A is lexicographically smaller han or equal o B, denoed by A B, if and only if 1. eiher A = l 2. or A > l, B > l and A[l] < B[l]. 19

The concep of longes common prefixes can be generalized for ses: Definiion 1.6: For a sring S and a sring se R, define lcp(r) = max{l A[0..l) = B[0..l) for all A, B R} lcp(s, R) = max{lcp(s, T ) T R} Σlcp(R) = lcp(t, R \ {T }) T R The concep of disinguishing prefix is closely relaed and ofen used in place of he longes common prefix for ses. The disinguishing prefix of a sring is he shores prefix ha separaes i from oher srings in he se. For a prefix free se R he sum of he lenghs of he disinguishing prefixes is Σdp(R) = Σlcp(R) + R. For a non-prefix free se, he disinguishing prefixes are no always really fully defined. However, even more ineresing is a hird measure of longes common prefixes in a se defined nex. I is slighly differen from boh Σlcp(R) and Σdp(R). 20

Definiion 1.7: Le R = {S 1, S 2,..., S n } be a se of srings and assume S 1 < S 2 < < S n. Then he LCP array LCP R [1..n] is defined by Furhermore, he LCP array sum is LCP R [i] = lcp(s i, {S 1,..., S i 1 }). ΣLCP (R) = i [1..n] LCP R [i]. Example 1.8: Le R = {po$, poao$, poery$, aoo$, empo$}. Then Σlcp(R) = 11, Σdp(R) = 16, ΣLCP (R) = 7 and he LCP array is: LCP R 0 po$ 3 poao$ 3 poery$ 0 aoo$ 1 empo$ 21

Theorem 1.9: The number of nodes in rie(r) is exacly R ΣLCP (R) + 1, where R is he oal lengh of he srings in R. Proof. Consider he consrucion of rie(r) by insering he srings one by one in he lexicographical order using Algorihm 1.2. Iniially, he rie has jus one node, he roo. When insering a sring S i, he algorihm execues exacly S i rounds of he wo while loops, because each round moves one sep forward in S i. The firs loop follows exising edges as long as possible and hus he number of rounds is LCP R [i] = lcp(s i, {S 1,..., S i 1 }). This leaves S i LCP R [i] rounds for he second loop, each of which adds one new node o he rie. Thus he oal number of nodes in he rie a he end is: 1 + S i LCP R [i] = R ΣLCP (R) + 1. i [1..n] The proof reveals a close connecion beween LCP R and he srucure of he rie. We will laer see ha LCP R is useful as an acual daa srucure in is own righ. 22

The LCP array LCP R and is sum have oher ineresing properies: ΣLCP (R) Σlcp(R) 2 ΣLCP (R). For i [2..n], LCP R [i] = lcp(s i, S i 1 ). Le π : [1..n] [1..n] be an arbirary permuaion. Define LCP R,π [i] = lcp(s π(i), {S π(1),..., S π(i 1) }) ΣLCP π (R) = LCP R,π [i]. i [1..n] In oher words, LCP R,π and ΣLCP π (R) are he same as LCP R and ΣLCP (R) excep he order of he srings is differen. Then ΣLCP π (R) = ΣLCP (R) and LCP R,π is a permuaion of LCP R. The proofs are lef as exercises. 23

Compac Trie Tries suffer from a large number of nodes, close o R in he wors case. The space requiremen can be problemaic, since ypically each node needs much more space han a single symbol. Pah compaced ries reduce he number of nodes by replacing branchless pah segmens wih a single edge. Leaf pah compacion applies his o pah segmens leading o a leaf. The number of nodes is now R + Σlcp(R) ΣLCP (R) + 1 (exercise). Full pah compacion applies his o all pah segmens. Then every inernal node (excep possibly he roo) has a leas wo children. In such a ree, here is always a leas as many leaves as inernal nodes. Thus he number of nodes is a mos 2 R. The full pah compaced rie is called a compac rie. 24

Example 1.10: Pah compaced ries for R = {po$, poao$, poery$, aoo$, empo$}. p o po $ aoo$ empo$ $ aoo$ empo$ ao$ ery$ ao$ ery$ The egde labels are facors of he inpu srings. If he inpu srings are sored separaely, he edge labels can be represened in consan space using poiners o he srings. The ime complexiy of he basic operaions on he compac rie is he same as for he rie (and depends on he implemenaion of he child operaion in he same way), bu prefix and range queries are faser on he compac rie (exercise). 25

Ternary Trie Tries can be implemened for ordered alphabes bu a bi awkwardly using a comparison-based child funcion. Ternary rie is a simpler daa srucure based on symbol comparisons. Ternary rie is like a binary search ree excep: Each inernal node has hree children: smaller, equal and larger. The branching is based on a single symbol a a given posiion as in a rie. The posiion is zero (firs symbol) a he roo and increases along he middle branches bu no along side branches. Ternary rie has varians similar o he sandard (σ-ary) rie: A basic ernary rie, which is a full represenaion of he srings. Compac ernary ries reduce space by compacing branchless pah segmens. 26

Example 1.11: Ternary ries for R = {po$, poao$, poery$, aoo$, empo$}. $ p o a o $ e r y a o o $ e m p o $ $ p o a o$ a oo$ ery$ empo$ $ p o a o$ a oo$ ery$ empo$ $ Ternary ries have he same asympoic size as he corresponding ries. 27

A ernary rie is balanced if each lef and righ subree conains a mos half of he srings in is paren ree. The balance can be mainained by roaions similarly o binary search rees. b roaion d A B d b D E C D E A B C We can also ge reasonably close o a balance by insering he srings in he ree in a random order. 28

In a balanced ernary rie each sep down eiher moves he posiion forward (middle branch), or halves he number of srings remaining in he subree (side branch). Thus, in a balanced ernary rie soring n srings, any downward raversal following a sring S passes a mos S middle edges and a mos log n side edges. Thus he ime complexiy of inserion, deleion, lookup and lcp query is O( S + log n). In comparison based ries, where he child funcion is implemened using binary search rees, he ime complexiies could be O( S log σ), a muliplicaive facor O(log σ) insead of an addiive facor O(log n). Prefix and range queries behave similarly (exercise). 29

Sring Soring Ω(n log n) is a well known lower bound for he number of comparisons needed for soring a se of n objecs by any comparison based algorihm. This lower bound holds boh in he wors case and in he average case. There are many algorihms ha mach he lower bound, i.e., sor using O(n log n) comparisons (wors or average case). Examples include quicksor, heapsor and mergesor. If we use one of hese algorihms for soring a se of n srings, i is clear ha he number of symbol comparisons can be more han O(n log n) in he wors case. Deermining he order of A and B needs a leas lcp(a, B) symbol comparisons and lcp(a, B) can be arbirarily large in general. On he oher hand, he average number of symbol comparisons for wo random srings is O(1). Does his mean ha we can sor a se of random srings in O(n log n) ime using a sandard soring algorihm? 30

The following heorem shows ha we canno achieve O(n log n) symbol comparisons for any se of srings (when σ = n o(1) ). Theorem 1.12: Le A be an algorihm ha sors a se of objecs using only comparisons beween he objecs. Le R = {S 1, S 2,..., S n } be a se of n srings over an ordered alphabe Σ of size σ. Soring R using A requires Ω(n log n log σ n) symbol comparisons on average, where he average is aken over he iniial orders of R. If σ is considered o be a consan, he lower bound is Ω(n(log n) 2 ). Noe ha he heorem holds for any comparison based soring algorihm A and any sring se R. In oher words, we can choose A and R o minimize he number of comparisons and sill no ge below he bound. Only he iniial order is random raher han any. Oherwise, we could pick he correc order and use an algorihm ha firs checks if he order is correc, needing only O(n + ΣLCP (R)) symbol comparisons. An inuiive explanaion for his resul is ha he comparisons made by a soring algorihm are no random. In he laer sages, he algorihm ends o compare srings ha are close o each oher in lexicographical order and hus are likely o have long common prefixes. 31

Proof of Theorem 1.12. Le k = (log σ n)/2. For any sring α Σ k, le R α be he se of srings in R having α as a prefix. Le n α = R α. Le us analyze he number of symbol comparisons when comparing srings in R α agains each oher. Each sring comparison needs a leas k symbol comparisons. No comparison beween a sring in R α and a sring ouside R α gives any informaion abou he relaive order of he srings in R α. Thus A needs o do Ω(n α log n α ) sring comparisons and Ω(kn α log n α ) symbol comparisons o deermine he relaive order of he srings in R α. Thus he oal number of symbol comparisons is Ω ( α Σ k kn α log n α ) and α Σ k kn α log n α k(n n) log n n σ k k(n n) log( n 1) = Ω (kn log n) = Ω (n log n log σ n). Here we have used he facs ha σ k n, ha α Σ n k α > n σ k n n, and ha α Σ n k α log n α > (n n) log((n n)/σ k ) (see exercises). 32

The preceding lower bound does no hold for algorihms specialized for soring srings. Theorem 1.13: Le R = {S 1, S 2,..., S n } be a se of n srings. Soring R ino he lexicographical order by any algorihm based on symbol comparisons requires Ω(ΣLCP (R) + n log n) symbol comparisons. Proof. If we are given he srings in he correc order and he job is o verify ha his is indeed so, we need a leas ΣLCP (R) symbol comparisons. No soring algorihm could possibly do is job wih less symbol comparisons. This gives a lower bound Ω(ΣLCP (R)). On he oher hand, he general soring lower bound Ω(n log n) mus hold here oo. The resul follows from combining he wo lower bounds. Noe ha he expeced value of ΣLCP (R) for a random se of n srings is O(n log σ n). The lower bound hen becomes Ω(n log n). We will nex see ha here are algorihms ha mach his lower bound. Such algorihms can sor a random se of srings in O(n log n) ime. 33

Sring Quicksor (Mulikey Quicksor) Quicksor is one of he fases general purpose soring algorihms in pracice. Here is a varian of quicksor ha pariions he inpu ino hree pars insead of he usual wo pars. Algorihm 1.14: TernaryQuicksor(R) Inpu: (Muli)se R in arbirary order. Oupu: R in ascending order. (1) if R 1 hen reurn R (2) selec a pivo x R (3) R < {s R s < x} (4) R = {s R s = x} (5) R > {s R s > x} (6) R < TernaryQuicksor(R < ) (7) R > TernaryQuicksor(R > ) (8) reurn R < R = R > 34

In he normal, binary quicksor, we would have wo subses R and R, boh of which may conain elemens ha are equal o he pivo. Binary quicksor is slighly faser in pracice for soring ses. Ternary quicksor can be faser for soring mulises wih many duplicae keys. Soring a mulise of size n wih σ disinc elemens akes O(n log σ) comparisons (exercise). The ime complexiy of boh he binary and he ernary quicksor depends on he selecion of he pivo (exercise). In he following, we assume an opimal pivo selecion giving O(n log n) wors case ime complexiy. 35

Sring quicksor is similar o ernary quicksor, bu i pariions using a single characer posiion. Sring quicksor is also known as mulikey quicksor. Algorihm 1.15: SringQuicksor(R, l) Inpu: (Muli)se R of srings and he lengh l of heir common prefix. Oupu: R in ascending lexicographical order. (1) if R 1 hen reurn R (2) R {S R S = l}; R R \ R (3) selec pivo X R (4) R < {S R S[l] < X[l]} (5) R = {S R S[l] = X[l]} (6) R > {S R S[l] > X[l]} (7) R < SringQuicksor(R <, l) (8) R = SringQuicksor(R =, l + 1) (9) R > SringQuicksor(R >, l) (10) reurn R R < R = R > In he iniial call, l = 0. 36

Example 1.16: A possible pariioning, when l = 2. al p habe al i gnmen al l ocae al g orihm al ernaive al i as al ernae al l = al i gnmen al g orihm al i as al l ocae al l al p habe al ernaive al ernae Theorem 1.17: Sring quicksor sors a se R of n srings in O(ΣLCP (R) + n log n) ime. Thus sring quicksor is an opimal symbol comparison based algorihm. Sring quicksor is also fas in pracice. 37

Proof of Theorem 1.17. The ime complexiy is dominaed by he symbol comparisons on lines (4) (6). We charge he cos of each comparison eiher on a single symbol or on a sring depending on he resul of he comparison: S[l] = X[l]: Charge he comparison on he symbol S[l]. Now he sring S is placed in he se R =. The recursive call on R = increases he common prefix lengh o l + 1. Thus S[l] canno be involved in any fuure comparison and he oal charge on S[l] is 1. Only lcp(s, R \ {S}) symbols in S can be involved in hese comparisons. Thus he oal number of symbol comparisons resuling equaliy is a mos Σlcp(R) = Θ(ΣLCP (R)). (Exercise: Show ha he number is exacly ΣLCP (R).) S[l] X[l]: Charge he comparison on he sring S. Now he sring S is placed in he se R < or R >. The size of eiher se is a mos R /2 assuming an opimal choice of he pivo X. Every comparison charged on S halves he size of he se conaining S, and hence he oal charge accumulaed by S is a mos log n. Thus he oal number of symbol comparisons resuling inequaliy is a mos O(n log n). 38