Tries and suffixes trees

Similar documents
Watson-Crick local languages and Watson-Crick two dimensional local languages

Preview 11/1/2017. Greedy Algorithms. Coin Change. Coin Change. Coin Change. Coin Change. Greedy algorithms. Greedy Algorithms

Alignment of Long Sequences. BMI/CS Spring 2016 Anthony Gitter

Module 9: Tries and String Matching

Module 9: Tries and String Matching

USA Mathematical Talent Search Round 1 Solutions Year 25 Academic Year

First Midterm Examination

CS 275 Automata and Formal Language Theory

Harvard University Computer Science 121 Midterm October 23, 2012

set is not closed under matrix [ multiplication, ] and does not form a group.

Lecture 3: Equivalence Relations

CSCI 340: Computational Models. Kleene s Theorem. Department of Computer Science

Regular Language. Nonregular Languages The Pumping Lemma. The pumping lemma. Regular Language. The pumping lemma. Infinitely long words 3/17/15

I. Theory of Automata II. Theory of Formal Languages III. Theory of Turing Machines

Closure Properties of Regular Languages

Finite state automata

QUADRATIC RESIDUES MATH 372. FALL INSTRUCTOR: PROFESSOR AITKEN

Quadratic reciprocity

Nondeterminism and Nodeterministic Automata

Where did dynamic programming come from?

p-adic Egyptian Fractions

Connected-components. Summary of lecture 9. Algorithms and Data Structures Disjoint sets. Example: connected components in graphs

Families of Solutions to Bernoulli ODEs

Supplement 4 Permutations, Legendre symbol and quadratic reciprocity

CSC 473 Automata, Grammars & Languages 11/9/10

CMPSCI 250: Introduction to Computation. Lecture #31: What DFA s Can and Can t Do David Mix Barrington 9 April 2014

FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY. FLAC (15-453) - Spring L. Blum

Farey Fractions. Rickard Fernström. U.U.D.M. Project Report 2017:24. Department of Mathematics Uppsala University

Homework 3 Solutions

5. (±±) Λ = fw j w is string of even lengthg [ 00 = f11,00g 7. (11 [ 00)± Λ = fw j w egins with either 11 or 00g 8. (0 [ ffl)1 Λ = 01 Λ [ 1 Λ 9.

More on automata. Michael George. March 24 April 7, 2014

Physics 1402: Lecture 7 Today s Agenda

Designing Information Devices and Systems I Discussion 8B

Data Structures and Algorithms CMPSC 465

CS 373, Spring Solutions to Mock midterm 1 (Based on first midterm in CS 273, Fall 2008.)

Introduction to Group Theory

CS103B Handout 18 Winter 2007 February 28, 2007 Finite Automata

(9) P (x)u + Q(x)u + R(x)u =0

Minimal DFA. minimal DFA for L starting from any other

CS 310 (sec 20) - Winter Final Exam (solutions) SOLUTIONS

1 From NFA to regular expression

PRIMES AND QUADRATIC RECIPROCITY

NFA DFA Example 3 CMSC 330: Organization of Programming Languages. Equivalence of DFAs and NFAs. Equivalence of DFAs and NFAs (cont.

Improper Integrals, and Differential Equations

Riemann Sums and Riemann Integrals

I1 = I2 I1 = I2 + I3 I1 + I2 = I3 + I4 I 3

We partition C into n small arcs by forming a partition of [a, b] by picking s i as follows: a = s 0 < s 1 < < s n = b.

Convert the NFA into DFA

Vectors , (0,0). 5. A vector is commonly denoted by putting an arrow above its symbol, as in the picture above. Here are some 3-dimensional vectors:

Riemann Sums and Riemann Integrals

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. NFA for (a b)*abb.

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. Comparing DFAs and NFAs (cont.) Finite Automata 2

19 Optimal behavior: Game theory

Balanced binary search trees

Quadratic Forms. Quadratic Forms

Intermediate Math Circles Wednesday, November 14, 2018 Finite Automata II. Nickolas Rollick a b b. a b 4

Homework 4. 0 ε 0. (00) ε 0 ε 0 (00) (11) CS 341: Foundations of Computer Science II Prof. Marvin Nakayama

GNFA GNFA GNFA GNFA GNFA

CMSC 330: Organization of Programming Languages

DIRECT CURRENT CIRCUITS

Algorithm Design and Analysis

NFAs continued, Closure Properties of Regular Languages

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

Handout: Natural deduction for first order logic

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true.

Lecture 09: Myhill-Nerode Theorem

(e) if x = y + z and a divides any two of the integers x, y, or z, then a divides the remaining integer

Computing the Optimal Global Alignment Value. B = n. Score of = 1 Score of = a a c g a c g a. A = n. Classical Dynamic Programming: O(n )

CS 301. Lecture 04 Regular Expressions. Stephen Checkoway. January 29, 2018

Designing finite automata II

Coalgebra, Lecture 15: Equations for Deterministic Automata

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite

Duke Math Meet

Surface maps into free groups

SCHOOL OF ENGINEERING & BUILT ENVIRONMENT. Mathematics

Finite Automata. Informatics 2A: Lecture 3. John Longley. 22 September School of Informatics University of Edinburgh

NFAs continued, Closure Properties of Regular Languages

Lecture 9: LTL and Büchi Automata

Parse trees, ambiguity, and Chomsky normal form

Matrix Algebra. Matrix Addition, Scalar Multiplication and Transposition. Linear Algebra I 24

378 Relations Solutions for Chapter 16. Section 16.1 Exercises. 3. Let A = {0,1,2,3,4,5}. Write out the relation R that expresses on A.

Linear Inequalities. Work Sheet 1

Chapter 2. Determinants

CHAPTER 1 Regular Languages. Contents. definitions, examples, designing, regular operations. Non-deterministic Finite Automata (NFA)

Chapter 5 Plan-Space Planning

Looking for All Palindromes in a String

CS 275 Automata and Formal Language Theory

Formal languages, automata, and theory of computation

Infinite Geometric Series

Math 61CM - Solutions to homework 9

Section 6: Area, Volume, and Average Value

The Regulated and Riemann Integrals

Lecture 13 - Linking E, ϕ, and ρ

Finite Automata. Informatics 2A: Lecture 3. Mary Cryan. 21 September School of Informatics University of Edinburgh

a,b a 1 a 2 a 3 a,b 1 a,b a,b 2 3 a,b a,b a 2 a,b CS Determinisitic Finite Automata 1

Solution Manual. for. Fracture Mechanics. C.T. Sun and Z.-H. Jin

Context-Free Grammars and Languages

Chapter Five: Nondeterministic Finite Automata. Formal Language, chapter 5, slide 1

38 Riemann sums and existence of the definite integral.

( dg. ) 2 dt. + dt. dt j + dh. + dt. r(t) dt. Comparing this equation with the one listed above for the length of see that

Transcription:

Trie: A dt-structure for set of words Tries nd suffixes trees Alon Efrt Comuter Science Dertment University of Arizon All words over the lhet Σ={,,..z}. In the slides, let sy tht the lhet is only {,,c,d} S set of words = {,,, c, ddd} Need to suort the oertions insert(w) dd new word w into S. delete(w) delete the word w from S. find(w) is w in S? Future oertion: Given text (mny words) where is w in the text. The time for ech oertion should e O(k), where k is the numer of letters in w Usully ech word is ssocited with ddition info not discussed here. Trie (Tree+Retrive) for S A tree where ech node is struct consist Struct node { chr[4] *r; chr flg ; /* if word ends t this node. Otherwise */ } flg r r c d flg A trie - exmle S={,,d} Rule: Ech node corresonds to word w Corr. to w= d (w which is in S iff the flg is ) 3 4 ->r[ - ] Note: The lel of n edge is the lel of the cell from which this edge exits d Corresonding to w= d Corr. To w= d (not in S, flg=) In S, so flg=

=root; i = While(){ Finding if word w is in the tree If w[i] == \ // we scnned ll letters of w then return the flg of ; // True/Flse If the entry of corresond to w[i] is NULL return flse; Set to e the node ointed y this entry, nd set i++; } Inserting word w Recll we need to modify the tree so find(w) would return TRUE. Try to erform find(w). If runs into NULL ointers, crete new node(s) long the th. The flg fields of ll new node(s) is. Set the flg of the lst node to 5 6 Inserting c Deleting word w Corr. to w= d S={,,d, c} ->r[ - ] Try to erform find(c). If runs into NULL ointers, crete new node(s) long the th. The flg fields of ll new node(s)=. Set the flg of the lst node to Note: The lel of n edge is the lel of the cell from which this edge exits d w= c Corr. to w= d Find the node corresonding to w (using `find oertion). Set the flg field of to. If is ded (I.e. flg== nd ll ointers re NULL ) then free(), set =rent() nd reet this check. Corr. to w= d In S, so flg= 7 w= c 7 8

Sce requirements Heuristics for sce sving Let m e is the sum of chrcters of ll words in S The sce required might e Θ( Σ m ) (for ech letter of ech words of S, we need n rry of size Σ (Might e n issue y itself, nd might slow down erformnces) z flg To sve some sce, if Σ is lrger, there re few heuristics we cn use. Assume Σ={,..z}. We use two tyes of nodes Tye A, which is used when the numer of children of node is more thn 3 tye z flg Note the letters re not stores exlicit lly Note the letters re not stores exlicit lly 9 Heuristics for sce sving Another Heuristics th comression Tye B is used if there re 3 or less children: The letter of the child is lso stored: tye letter ointer letter ointer letter ointer B F R flg Relce long sequence of nodes tht hens to hve only single child, with single node (of tye ointer to string ) tht kees oint to the next node, nd oint to string. The rule of the flg is the sme s in tye A nodes. We only store the 3 ointers, ut we need to know to which letters they corresonds to tye c\ 3

Suffix tree. Suffix tree. Assume B (for ook) is long text. Wnt to rerocess B, so when word w is given, we could quickly find if it is in B. (incrementl serch) (s well s loctions, how mny etc) We cn find it in O( w ). Ide: Consider B s long string. Crete trie T of ll suffixes of B. In ddition to the flg (secifying if word ends t node), we lso stored the index in B where this word egins. Exmle B= S={,,,, } To know where word er in B, we store with ech node the index of the eginning of the suffix in B. (we cn store only the first ernce of the word in Exmle B= the text) 3 4 S={,,,, } Size of suffix tree Exmle B= S={,,,, } Size of suffix tree 345 Exmle B= S={,,,, } Assume n= B. Totl length of ll string Θ(n ) Size of node is Σ So size of the tree is Θ(n Σ ). Time to construct the tree Θ(n ) Rther thn flg, we store the first index where the suffix er Exmle B= S={,,,, } Assume n= B. Totl length of ll string Θ(n ) Size of node is Σ So size of the tree is Θ(n Σ ). Time to construct the tree Θ(n ) In ddition to the flg, we store the first index (in the ook) where the suffix strts (in red) 3 3 3 Exmle B= S={,,,, } 5 6 4

Suffix tries on diet Def: shred is th from node u to node v in the trie, consisting of nodes of outdegree (excet mye the lst one) nd flg=. Os: There is contiguous rt of B, identicl to the string the shred reresents. We cll this rt the shred-string We stores B itself s n rry. We use new tye of nodes, clled shred-nodes, tht mintin only the indexes of the first (id) nd lst (id) letters of the shred-string in B. Suffix tries on diet - cont Algorithm for constructing thin trie: Given B crete n emty trie T, nd insert ll n suffixes of B into T --- generting trie of size Θ(n ). Trverse the tries, nd ech time tht shred is seen, relce ll nodes of the shred with single shred-node. tye id id 7 Exmle for shred of dd flg 7 B= cdddd 7 8 Suffix tries on diet - cont Clerly the use of shred nodes sves some-ut cn we rove something? Oservtions: The # numer of leves of T is t most n (every lef is the end of one refix). In ddition there re nodes hve single child, ut their flg= ( suffixed hve ended). We cll them secil nodes. Oservtions: There re n secil nodes. Thnks for tience. See you t the review 9 5

Suffix tries on diet - cont Proof of lemm (just FYI) Lemm: Let T e tree where ech internl node hs outdegree or more, nd m leves. Then T hs t most m internl nodes. Bck to thin suffix tries: T does not hve exctly this roerty, ut it is very close (no long shreds), so mssged lemm still works, so But #internl_nodes is #lefs_nodes+#secil_nodes, #lefs_nodes + #secil_nodes #suffixes_of_b = n So the size of the trie is only constnt more thn the size of the ook. Lemm: Let T e tree where ech internl node hs outdegree or more, nd m leves nd k internl nodes. Then k m Proof: Assume true for ll trees with strictly less w thn m leves, nd ssume T hs m leves. Find lef u whose distnce from root is mximum. u Assume it hs hs exctly one siling v. Note tht v is lef (why?). Let w e their common rent. Remove oth u nd v from T. Let T e the resulting tree. Let k, m denote # internl nodes nd leves in T. Now in T w is lef. m =m-+=m-. 3 k =k-. 4 The outdegree of every internl node From induction, k m. Hence k m v 6