Fast Learning of Restricted Regular Expressions and DTDs

Size: px

Start display at page:

Download "Fast Learning of Restricted Regular Expressions and DTDs"

Jeffrey Norris
5 years ago
Views:

1 Fst Lerning of Restrite Regulr Expressions n DTDs Dominik D. Freyenerger Institut für Informtik, Goethe-Universität Frnkfurt.M. freyenerger@em.uni-frnkfurt.e Timo Kötzing Mx-Plnk-Institute for Informtis 6623 Srrüken koetzing@mpi-inf.mpg.e ABSTRACT We stuy the prolem of generlizing from finite smple to lnguge tken from preefine lnguge lss. The two lnguge lsses we onsier re susets of the regulr lnguges n hve signifine in the speifition of XML ouments (the lsses orresponing to so lle hin regulr expressions, Chres, n to single ourrene regulr expressions, Sores). The previous literture gve numer of lgorithms for generlizing to Sores proviing tre off etween qulity of the solution n spee. Furthermore, fst ut nonoptiml lgorithm for generlizing to Chres is known. For eh of the two lnguge lsses we give n effiient lgorithm returning miniml generliztion from the given finite smple to n element of the fixe lnguge lss; suh generliztions re lle esriptive. In this sense, oth our lgorithms re optiml. Keywors suregulr lnguge lerning, single ourrene regulr expression, hin regulr expression, esriptive generliztion. INTRODUCTION The present pper follows n refines n pproh for XML shem inferene from positive exmples tht ws introue y Bex et l. [3]. The si prolem setting is s follows. Given set of XML ouments, generte shem tht esries these ouments, while eing ompt n preferly humn rele. Bex et l. pproh this prolem y lerning eterministi regulr expressions from positive exmples; i. e., they onsier the following prolem: Given finite set S of positive exmples from n unknown trget lnguge L, fin eterministi regulr expression for L. These regulr expressions This work ws one while this uthor ws visiting the Mx- Plnk-Institute for Informtis in Srrüken. Permission to mke igitl or hr opies of ll or prt of this work for personl or lssroom use is grnte without fee provie tht opies re not me or istriute for profit or ommeril vntge n tht opies er this notie n the full ittion on the first pge. To opy otherwise, to repulish, to post on servers or to reistriute to lists, requires prior speifi permission n/or fee. EDBT/ICDT 3, Mrh , Geno, Itly Copyright 203 ACM /3/03...$5.00. n immeitely e use s DTDs (Doument Type Definitions), n while XSDs (XML Shem Douments) require itionl effort, lgorithms tht infer regulr expressions n lso e use s omponent of XSD inferene lgorithms (see [3, 4] for further explntions). In prtiulr, s rgue in [3], the results in [] show tht XSD inferene requires eep insights into regulr expression inferene s Bex et l. put it, one nnot hope to suessfully infer XSDs without goo lgorithms for inferring regulr expressions. Using lssil tehnique from Gol [9], Bex et l. prove in [2] tht even the lss of eterministi regulr expressions is too rih to e lernle from positive t. While, stritly speking, the lernility riterion of Gol-style lerning s efine in [9] (whih is lso lle lerning in the limit from positive t or explntory lerning) is ifferent from the setting in [2, 3], its non-lernility results still provie vlule insights into neessry restritions. In prtiulr, Gol-style lerning shows tht, when lerning from positive t, one hs to lne the nee for generliztion (s in most ses, regulr expression tht genertes extly the exmple is not onsiere goo hypothesis) with the nee to voi overgenerliztion. While there re numerous ppers on restritions on the lss of regulr lnguges tht le to lernility, prt from few exeptions (e. g. [6]), most of these restritions prior to [3] hve een se on properties of utomt. As expline in [3], this is prolemti, s even uner those restritions, onverting the inferre utomton to regulr expression n le to n exponentil size inrese. In orer to hieve lernility of onise eterministi regulr expression, Bex et l. propose single ourrene regulr expressions (short Sores), regulr expressions where eh terminl letter (or element nme) ours t most one. These Sores re eterministi y efinition, n s n itionl enefit, this restrition ensures tht the length of the inferre expressions is t most liner in the numer of ifferent terminl letters. The orresponing Sore-inferene lgorithm RWR from [3] works s follows. First, it onstruts so-lle single ourrene utomton (short So, s introue y Grí n Vil [8]). RWR then ttempts to onvert the So step y step into Sore. As the lss of Sore-lnguges is proper suset of the lss of So-lnguges, this onversion is not lwys possile. In these ses, RWR ttempts to repir the Gol-style lerning uses growing set of smples n requires tht the lerner onverges towr orret hypothesis in finite time, while this setting uses only single finite set for eh inferene instne.

2 So, n onstruts Sore tht genertes generliztion of the lnguge of the So. In orer to generlize s little s possile, [3] suggests ifferent orerings on the set of repir rules, s well s the vrint RWR 2 l, whih uses itionl heuristis n n hve n exponentil running time. Nonetheless, these vrints my still infer Sores tht re not inlusion-miniml generliztions of the input smple (within the lss of ll Sores). In orer to el with insuffiient t, Bex et l. propose further restrition on Sores, the so-lle hin regulr expressions (short: Chres), n introue the orresponing inferene lgorithm CRX. Anlogously to RWR, CRX my infer Chres tht re not inlusion-miniml generliztions. The present pper fouses on inferring Sores n Chres tht re inlusion-miniml generliztions. This pproh to regulr expression inferene is se on slightly ifferent ngle thn Gol-style lerning, nmely on the lerning prigm of esriptive generliztion tht ws introue y Freyenerger n Reienh [7]. While Gol-style lerning ssumes tht n ext representtion of the trget lnguge is present in the hypothesis spe, n tht the lerner is provie with suffiient positive informtion to orretly reognize the trget lnguge, esriptive generliztion views the hypothesis spe n the spe of trget lnguges s istint. For lss D of lnguge representtion mehnisms (e. g., lss of utomt, regulr expressions, or grmmrs 2 ), lnguge representtion δ D is lle D-esriptive of smple S if L(δ) is n inlusion-miniml generliztion of S, i. e., S L(δ) n there is no γ D with S L(γ) L(δ). This onept llows us to efine D-esriptive generliztion s nturl extension of Gol-style lerning: Inste of ttempting to lern n ext representtion of the trget lnguge L from smple S, the lerner hs to infer representtion δ D tht is D-esriptive of L. In other wors, δ is generliztion of S tht is s inlusion-miniml s possile within D. Desriptive generliztion expliitly seprtes the hypothesis spe from the lss of trget lnguges, while still proviing nturl qulity riterion for generliztion from positive exmples. In the present pper, we onsier the lss of Sores n the lss of Chres s hypothesis spes D, n exmine the prolem of inferring D-esriptive generliztions from finite smples. We pproh this prolem y first omputing So-esriptive So. As we shll see, this pproh hs the vntges tht the esriptive So is uniquely efine, n e ompute effiiently, n its lnguge is inlue in the lnguge of every esriptive Sore or Chre. The min ontriution of the present pper re two lgorithms, So2Sore n So2Chre, tht n e use to trnsform ny given So into Sore (resp. Chre) tht is Sore-esriptive (resp. Chre-esriptive) of the lnguge of tht So. Tht is, given smple S, these lgorithms n e use to ompute generliztion of S tht is inlusionminiml (or, in the terminology of [3], optiml) within the lss of Sores or Chres (respetively). In ition to this, So2Chre n So2Sore re effiient: So2Chre runs in time O(m) (ompre to O(m + n 3 ) for 2 The nonil lss D is the lss of NE-ptterns, where esriptive ptterns were introue y Angluin [] in the ontext of ext lerning from positive t. See [2] for survey on the influene of pttern lnguges in this re. CRX), So2Sore in time O(nm) (ompre to O(n 5 ) for RWR), where m is the numer of eges n n the numer of noes in the So. The pper is struture s follows. Setion 2 ontins some mthemtil preliminries, followe y some informtive properties of the lnguge lsses onsiere. Setion 3 isusses CRX s well s RWR n its vrints in the ontext of esriptive regulr expressions. In prtiulr, we show tht for eh of these lgorithms, there re smples over smll lphets where the lgorithm oes not ompute esriptive Chre or Sore. Setions 4 n 5 ontin the lgorithms So2Chre n So2Sore, respetively, s well s proofs of their orretness n running time. Finlly, Setion 6 onlues the pper. For spe resons, some of the proofs were omitte. 2. PRELIMINARIES Let enote the empty set, let ε enote the empty wor. With x, we enote the length of x if x is wor, or the numer of elements in x if x is set. We use (n ) to enote the inlusion (respetively proper inlusion) of sets. The ifferene of two sets A, B is enote y A \ B n efine s { A / B}. A wor v is ftor of wor x Σ if there exist u, w Σ suh tht x = uvw. A 2-grm is ftor of length 2. Let lph(w) enote the set of ll letters ourring in wor w, n exten this to lnguges y efining lph(l) := w L lph(l). 2. Introuing SORE, CHARE, SOA This setion introues the lsses of regulr expressions n utomt tht re use in the present pper. We mostly follow the nottions introue in [3]. In prtiulr, we use the following vrint of regulr expressions. Definition. Let Σ e finite lphet (the set of terminl letters, lso lle element nmes). Every letter Σ is regulr expression, s re ε n, n L(x) = {x} for x Σ {ε}, while L( ) =. If α is regulr expression, then α + n α? re regulr expressions, where L(α + ) = (L(α)) + n L(α?) = L(α) {ε}. Furthermore, if α n β re regulr expressions, then α β n α β re lso regulr expressions, with L(α β) = L(α) L(β) n L(α β) = {uv u L(α), v L(β)}. For ske of onveniene, we sometimes omit the ontention opertor (i. e., we write αβ inste of α β), n or omit prentheses. For regulr expression α, we use lph(α) to enote the set of terminl letters tht our in α. We ll two regulr expressions α, β lphet-isjoint if lph(α) lph(β) =. Two regulr expressions α n β re equivlent if L(α) = L(β). For ny set A Σ, we use the nottion ALT (A) to enote the regulr expression ALT (A) := ( n), with ALT ( ) = ε (ALT stns for lterntion). In strit sense, this efinition requires n orering on the letters to e soun, ut for the purpose of this pper, this is of no onern, n we ssume tht ALT (A) = ALT (B) if A = B. The full lss of regulr expressions is too strong oth for DTDs (whih llow only eterministi regulr expressions) n for lerning from positive t (whih requires lnguge lsses tht re suffiiently sprse, f. [9]). As proven in [2], even the lss of eterministi regulr expressions is still too lrge to e lernle from positive t. Hene, [3] proposes the following sulsses of eterministi regulr expressions.

3 Definition 2 (Sore/Chre). A single ourrene regulr expression (or Sore) is regulr expression in whih eh terminl letter ours (t most) one. A hin regulr expression (or Chre) is Sore of the form f... f n (n 0), where eh f i is hin ftor, i. e., Sore of the form ( k ), ( k )?, ( k ) +, or ( k ) +?, where k, n eh j is terminl letter. In other wors, Chre onsists of ontention of lphet-isjoint hin ftors. We illustrte these efinitions with few short exmples. Exmple 3. Consier the regulr expressions given s α = ()?( ) +, β = () +, n γ =. Here, α is Chre (n, hene, lso Sore), s it onsists of two lphet-isjoint hin-ftors. On the other hn, β is Sore (every letter ours only one), ut not Chre (s it is not ompose of hinftors). One n esily prove tht L(β) is not Chre lnguge: Assume there exists Chre β with L(β ) = L(β). By efinition, β must ontin n. if n re not in the sme hin-ftor of β, then t lest one of the 2-grms or nnot our in ny wor of L(β ). But if n re in the sme hin-ftor of β, the sme line of resoning implies tht this hin-ftor must e followe y + or +?. Therefore, there re wors in L(β ) tht ontin the 2-grms or, whih ontrits L(β ) = L(β). Finlly, γ is not Sore (n therefore not Chre), n one n prove tht L(γ) is not Sore-lnguge (this is est proven using tehniques tht shll e introue right fter this exmple, hene we efer the proof of this lim to Remrk 7 further own in this setion). While the fous of this pper is on lerning regulr expressions, most of our tehnil resoning uses the following lss of utomt. Definition 4 (So). Let Σ e finite lphet, n let snk, sr e istint symols tht o not our in Σ. A single ourrene utomton (short: SOA) over Σ is finite irete grph A = (V, E) suh tht () {sr, snk} V, n V Σ {sr, snk}, (2) sr hs only outgoing eges, snk hs only inoming eges, n every v V lies on pth from sr to snk. We ll lph(a) := V \{sr, snk} the set of terminl letters in A. We efine the reltion A on V y A:= E, n use + A n A to enote the trnsitive n reflexive-trnsitive hull of A. The lnguge L(A) tht is epte y A is the set of ll wors w = n (n 0) suh tht sr A A A n A snk. As usul, strongly onnete omponent of So A is non-empty n inlusion-mximl set C of verties of A suh tht for ll, C, A n A hols. A strongly onnete loope omponent of So A is nonempty n inlusion-mximl set C of verties of A suh tht for ll, C, + A n + A hols. In other wors, every strongly onnete loope omponent ontins extly those verties tht re mutully rehle. Thus, strongly onnete omponent my e singleton, while singleton strongly onnete loope omponent must hve self-loop. By efinition, ll strongly onnete loope omponents of So re isjoint, n sr n snk nnot e prt of ny strongly onnete loope omponent. Although their efinition is somewht ifferent, it is esy to see tht Sos re sulss of DFAs. In prtiulr, So n e unerstoo s DFA where for eh Σ, there exists hrteristi stte q suh tht δ(q, ) {q, q trp} for ll sttes q Q (where q trp is trp stte). This is illustrte y the following exmple. Exmple 5. In the piture elow, we hve So on the left sie, n the orresponing DFA to the right sie. Both utomt generte the sme lnguge s the regulr expression α = (( +?)(( +?) ( + )) +?)?. Note tht α is not Sore. In ft, L(α) is not Sore-lnguge, ut proving this using only tehniques tht hve een introue t this point requires onsierle effort. (The most strightforwr wy to prove this is to use tehniques tht re introue in Setion 5: Apply the lgorithm So2Sore to the So, whih returns the Sore (? +?) +?, whih is not equivlent to α. By Theorem 25, this mens tht L(α) is not Sore lnguge.) In this pper, we frequently use Sos to pproximte lnguges. For this, we rely on the following efinition. Definition 6. For every w Σ, let first(w) n lst(w) enote the first resp. lst letter of w, n let grm 2 (w) e the set of ll 2-grms in w. We exten these funtions on wors to funtions on lnguges y efining first(l) := {first(w) w L}, lst(l) := {lst(w) w L}, n grm 2 (L) := w L grm 2(w). For every lnguge L Σ, we efine the So-pproximtion of L, SOA(L), y SOA(L) := (V L, E L), where V L := lph(l) {sr, snk}, n E L ontins the eges (sr, ) for every first(l), (, snk) for every lst(l), (, ) for ll, Σ with grm 2 (L), (sr, snk) if ε L. Using this terminology, the pproh for So-lerning presente in [8] n e summrize s follows. Given finite set S, ompute SOA(S). In [3], the resulting lgorithm is lle 2T-INF. Furthermore, s omputing SOA(L) is only s hr s omputing first(l), lst(l), n grm 2 (L), note tht SOA(L) n e onstrute for lnguge from lsses tht re lrger thn the lsses of finite or regulr lnguges, e. g., for ontext-free lnguges. It is esy to see from the efinition tht L(SOA(L)) L hols for every lnguge L (in ft, we shll see in Proposition 4 tht L(SOA(L)) is lwys the lest generl pproximtion of L tht is possile with So). This inlusion n e proper s follows.

4 Remrk 7. Note tht even for finite lnguges L, the equlity L(SOA(L)) = L is not neessry; e. g., onsier L = {} (from Exmple 3). Then SOA(L) ontins n ege from sr to, from to, from to, from to itself, n from to snk. Hene, L(SOA(L)), while / L. This lso proves tht L is not So-lnguge (n, s lime in Exmple 3, not Sore-lnguge.) As Exmple 5 illustrtes, there re So-lnguges tht re not Sore-lnguges. On the other hn, we hve tht every Sore-lnguge is So-lnguge (in other wors, the Sopproximtion of Sore-lnguge is ext). Lemm 8 ([3], proof of Proposition 9). Given ny Sore α, we hve L(SOA(L(α)) = L(α). It is esy to see tht SOA(L(α)) n e erive from every Sore α (in ft, even every regulr expression α) y eriving the sets of first letters, lst letters, n 2-grms in L(α) from the expression α (we lrey i this in Remrk 7). Lemm 8 llows us to efine SOA(α) s nottionl shorthn for SOA(L(α)). Similrly, we use α to enote the reltion SOA(α). More importntly, we shll use Lemm 8 to evelop hny syntti hrteriztion of the inlusion for Sores (n Chres), whih is se on the inlusion of Sos. We sy tht So A overs So B if A is supergrph of B in other wors, lph(a) lph(b) hols, n B implies A for ll, lph(b). This efinition les to the following hrteriztion of So-inlusion. Lemm 9 ([8], Theorem 3.). For every pir A, B of Sos, L(A) L(B) hols if n only if A is overe y B. Although Lemm 9 is stte in [8] without proof (the uthors just ite Grí s PhD thesis), it is esily proven onsiering the efinition of SOA(L). Comining Lemm 9 with Lemm 8, we re le to hrterize inlusion of Sores s follows. Lemm 0. For every pir α, β of Sores, L(α) L(β) hols if n only if SOA(α) is overe y SOA(β). This oviously implies tht two Sores (or Chres) re equivlent if their orresponing Sos re equivlent. More importntly, Lemm 0 provies simple syntti n hrteristi riterion for inlusion. While the lgorithms in Setions 4 n 5 o not hek for inlusion, their orretness proofs mke hevy use of the ft tht Sore-inlusion epens on the presene of eges in the orresponing So. Before we introue the other entrl efinition of this pper in Setion 2.2, we isuss some onepts whih will e useful, lthough not quite s signifint. One n verify with little effort tht the lsses of So-, Sore-, or Chre-lnguges re not lose uner mny of the opertions tht re ommonly stuie in forml lnguge theory (e. g., ontention, union, omplementtion, intersetion with regulr lnguges, morphism, inverse morphism). One of the few opertions uner whih eh of these lsses is lose, n whih we shll use, is projetion. Let Σ e n lphet. A projetion from Σ to T Σ is morphism π T : Σ T tht is efine y π T (x) := x for ll x T, n π T (x) := ε for ll x Σ\T. We extene this to lnguges nonilly, i. e., π T (L) := {π T (w) w L}. Lemm. The lsses of Sore-, Chre-, n Solnguges re lose uner projetion. The proof ws omitte for spe resons. The min pproh in the present pper (s well s in [3]) is onverting Sos into Sores or Chres. During this proess, it is osionlly onvenient to work with moel tht n e viewe s n intermeiry step etween So n regulr expression. Definition 2. A generlize single ourrene utomton (or generlize So) is finite grph A = (V, E) suh tht () {sr, snk} V, n ll verties in V \ {sr, snk} re pirwise lphet-isjoint Sore; n (2) the ege reltion E is suh tht sr hs only outgoing eges; snk hs only inoming eges, n every v V lies on pth from sr to snk. The reltions A, A, + A on V re efine nlogously to (non-generlize) So. We exten lph to generlize Sos y efining lph(a) := v V \{sr,snk} lph(v). The lnguge L(A) is efine to e the set of ll w lph(a) for whih there exist n 0, noes v,..., v n V \ {sr, snk}, n wors w,..., w n lph(a) suh tht sr A v A A v n A snk, w = w w n, n w i L(v i) hols for every i n. Note tht generlize Sos ept the sme lss of lnguges s Sos. 2.2 Desriptivity This setion introues the notion of esriptive expressions n utomt, whih is one of the entrl spets of the present pper. Definition 3. Let D e lss of regulr expressions or finite utomt over some lphet Σ. A δ D is lle D-esriptive of non-empty lnguge S Σ if L(δ) S, n there is no γ D suh tht L(δ) L(γ) S. In other wors, n expression or utomtion tht is D- esriptive of lnguge S genertes lnguge tht is generliztion of S tht is -miniml within lnguges esrie y elements of D. If the lss D is ler from the ontext, we simply write esriptive inste of D-esriptive. As stte in [8] (using quite ifferent terminology), for every finite lnguge S, SOA(S) is So-esriptive of S. This extens to infinite lnguges s well; for Sores n Chres, we n lso prove the existene of esriptive regulr expressions: Proposition 4. Let Σ e finite lphet. For every lnguge L Σ, SOA(L) is So-esriptive of L, n there exist Sore-esriptive Sore δ s n Chre-esriptive Chre δ. The proof ws omitte for spe resons. In prtiulr, this mens tht the lgorithm 2T-INF from [3] tht ws mentione in the previous setion n e use to ompute So-esriptive Sos for finite smple sets. Moreover, this shows tht onstruting esriptive So for n ritrry lnguge L is s merely s hr s omputing the sets first(l), lst(l), n grm 2 (L). As we shll see, omputing esriptive Sores or Chres is less strightforwr. First, note tht the first prt of the proof of Proposition 4 implies the following oservtion:

5 Clss num of lnguges mx num esriptive for smple mx num eges to for esr Chre n! 2 2n (n) n! 2 3n n! Θ(n 2 ) Sore n! 2 3n r log n s(n) n! 2 7n 2 n Θ(n 2 ) So 2 n2 +O(n) Tle : A summry of the numers presente in Proposition 6. For eh of the lsses of lnguges generte y Chres, Sores, n Sos, the tle lists the numer of ifferent lnguges in the lss, the mximum numer of esriptive expressions or utomt for given smple S Σ, n the mximum numer of eges tht nee to e e to SOA(S) in orer to otin So tht orrespons to esriptive Chre or Sore. In ll ses, n enotes the size of Σ. Corollry 5. Let Σ e finite lphet, n let L Σ. For every Sore (or Chre) δ tht is Sore-esriptive (resp. Chre-esriptive) of L, L(δ) L(SOA(L)) hols. Hene, if some Sore (or Chre) is esriptive of lnguge L, it must e esriptive of L(SOA(L)) s well. This llows us to ompute esriptive Sores n Chres not from smple L, ut from its So-pproximtion SOA(L). Furthermore, if L(SOA(L)) is not Sore-lnguge (or not Chre-lnguge), So for some Sore esriptive of L n e otine in priniple from SOA(L) y ing new eges. As only finite numer of eges nee to e e, n So-inlusion n e eie esily (f. Lemm 9), the min question is whether this n e one effiiently. But s it n e neessry to sustntil numer of new eges in orer to turn So into So tht orrespons to esriptive expression (see Proposition 6 just elow), rute fore pproh is proly not visle. The next proposition lists these n other numers out ounting n esriptive Sores n Chres. These results re summrize in Tle. Rell tht regulr expressions re lle equivlent if they ept the sme lnguge. Proposition 6. Let n e the numer of lphet symols. We hve the following, for some onstnt r. () The numer of pirwise non-equivlent Chres is (n) with n! 2 2n (n) n! 2 3n. (2) The numer of pirwise non-equivlent Sores is s(n) with n! 2 3n r log n s(n) n! 2 7n. 3 (3) There is smple S Σ suh tht S hs 2 n pirwise non-equivlent esriptive Sores. (4) There is smple S Σ suh tht S hs n! pirwise non-equivlent esriptive Chres. (5) There is So with Θ(n) eges suh tht esriptive Sore with miniml numer of eges in the orresponing So hs Θ(n 2 ) eges. (6) There is So with Θ(n) eges suh tht esriptive Chre with miniml numer of eges in the orresponing So hs Θ(n 2 ) eges. The proof ws omitte for spe resons. In prtiulr, note tht Proposition 6 lso emonstrtes tht given smple n hve numerous ifferent esriptive Sores (or Chres). Note tht the numer of ifferent Chre- n Sore-lnguges n e etter pproximte 3 Note tht [2, Proof of Theorem 3.] gives tht ny Sorelnguge hs Sore of length t most 0n 4, whih gives oun of 2 O(n log n). using more vne tools from omintoris. Finlly, if we re only intereste in the numer of ifferent suh lnguge moulo renming of the terminl letters, then the sme ouns without the ftor n! hol. 3. DESCRIPTIVITY VS. CRX AND RWR Proposition 6 emonstrtes tht the numer of non-equivlent esriptive Sores (or Chres) for smple n e exponentil in the size of the lphet. Therefore, the present pper only exmines the question how single esriptive Sore (or Chre) n e foun for smple, inste of looking for n enumertion of ll these expressions. As expline in Setion 2.2 (in prtiulr, Corollry 5), esriptive Chres n Sores n e otine from the esriptive So, n moreover, for every lnguge L n every Sore α, L(α) L(SOA(L)) must hol. This oservtion motivtes our inferene pproh for Sores n Chres: Given smple S, first ompute the So-esriptive singleourrene utomton SOA(S), using 2T-INF. As expline in [8], this n e one in time O(ln), where l := s S s, n n := lph(s). Using the lgorithm So2Chre (Setion 4) or So2Sore (Setion 5), SOA(S) is then turne into esriptive Chre or Sore (respetively). Before we isuss these lgorithms n the respetive proofs in etil, we oserve tht the lgorithms CRX n RWR n its vrints from [3] o not lwys ompute esriptive Chres or Sores. For the Chre-lgorithm CRX, this is quite esy to see: As pointe out in [3] (s remrk fter Theorem 35), on the smple S = {, e, e}, the lgorithm CRX returns the Chre????e?, while δ := ( )( e) is etter pproximtion of S. (In ft, we shll e le to see tht δ is not only etter, ut Chre-esriptive. This n e verifie y oserving tht δ is the output of So2Chre on SOA(S), n referring to Theorem 9 further own.) The proofs for the non-esriptivity of the Sore-lgorithm RWR n its vrints require more effort, n n e foun in the following setion. 3. RWR-Vrints n Desriptivity In this setion we give theorems regring properties of RWR-vrints (we refer the reer to [3] for etils on ll vrints). In prtiulr, we show tht every vrint fils to fin esriptive Sore on some input. In [3, Algorithm 3] n Algorithm RWR ( Rewrite with Repirs ) ws given to turn given So (erive in the nonil wy from n input smple) into generlizing Sore, y rewriting the So step y step. This lgorithm ws proven in [3] to turn ny So in n equivlent Sore, if existent; if, t some point in the run of RWR, no rewrite rules re p-

6 plile, the lgorithm will mke generliztion step y pplying repir rule. The four repir rules of RWR re s follows, given the urrent So A. For simpliity, we give moifition of the rules, where less eges re e. However, for the ses we use in this setion, these rules re equivlent to the originl set. Repir r s If there re two noes r n s of A whih shre suessor or preeessor, eges to A to mke ll suessors of r or s suessors of oth r n s; similrly with the preeessors. Repir r s? If there re two noes r n s of A suh tht r is the only preeessor of s, eges to A to mke ll suessors of r or s (exept s) suessors of oth r n s. Repir r? s If there re two noes r n s of A suh tht s is the only suessor of r, eges to A to mke ll preeessors of r or s (exept r) preeessors of oth r n s. Repir r? s? Let r n s e noes of A suh tht s is suessor of r; eges to A to mke ll suessors of r or s suessors of oth r n s; similrly with the preeessors. Furthermore, for ll preeessors u of r n ll suessors v of s, n ege from u to v. The uthors of [3] prove tht RWR (with the originl repir rules) lwys termintes in O(n 5 ) steps (where n = Σ ) n gives Sore whih generlizes the input So. They lso suggest tht these rules re heke for ppliility in the given orer, ut mit tht ifferent situtions might ll for ifferent rules (in prtiulr, they note tht the outome of RWR is not lwy esriptive). Next, we formlly show tht RWR oes not lwys return esriptive Sore. Theorem 7. For Σ finite lphet with Σ 3 n ll orerings of the repir rules of RWR, there is (finite) set of smples S Σ suh tht RWR on S proues Sore whih is not Sore-esriptive. Proof. Let,, Σ e three ifferent symols from Σ. First, onsier the smple {, }. The orresponing So oes not llow rewrite rules n requires repir; elow this So is epite, long with two possile repirs, orresponing to the two possile repirs Repir n Repir??. The Sos resulting from the two repirs ept ( ) + n ( ), respetively, whih is not esriptive of {, }, s witnesse y δ := ((?)) + ( Sore whih epts the given smple n, ut not, for exmple,, whih is epte y ny of the Sos erive from repir rules ove). Seon, onsier the smple S = {,, }. The orresponing So A is epite s follows. A esriptive Sore for S is δ 2 := (( )) +, whih we prove s follows. In omprison to A, the So tht orrespons to δ 2 s only single ege, the ege from to. So the only possiility for Sore-lnguge L(γ) with L(A) L(γ) L(δ 2) is L(A) itself. However, L(A) is not Sore-lnguge, whih n e seen, just s in Proposition 6, y pplying either the Sore-onstrution lgorithm RWR from [3] or our lgorithm So2Sore from Setion 5 (whih oth ompute Sore equivlent to given So, if existent) n oserving strit generliztion. Hene, δ 2 is Sore-esriptive of S. (We note without proof tht (?) +? is nother Sore tht is esriptive of S. Neessrily, its lnguge is inomprle to L(δ 2).) An pplition of Repir? on A n then, fter rewriting, of Repir [?]? gives the following. [?]? This So orrespons to the Sore (??) +, n its lnguge is strit superset of L(δ 2) (for exmple is epte y the former n not the ltter). Deeiving the rule Repir r? s is symmetri to eeiving Repir r s?. In [3], Bex et l. lso propose vrint of RWR tht is lle RWR 2 l, whih uses nturl numer l s rnhing prmeter. The lgorithm explores the (reursive) outomes of the est l nites for repir rule, hoosing the ones tht le to miniml numer of wors of length t most 2n (= 2 Σ ) in the lnguge epte y the resulting Sore. Theorem 8. For ll l > 0 there is finite lphet Σ with Σ = 3l n finite set of smples S Σ suh tht RWR 2 l on S proues Sore whih is not Sore-esriptive. Proof. We first ssume l = ; onsier gin the smple {,, } with the following orresponing So.?? The three pplile repir rules re,? n? (plus some rules of the type r? s?, whih exploe the numer of epte wors). This les to the following Sos.

7 RegEx α L(α) 6 exp growth sis reurrene se ses n {, 2, 3} ( ) fα(n 2) 0, 2, 0 ( ) f α(n 2) + f α(n 3) 0,, ( ) +? 5 ( + 5)/2.62 f α(n ) + f α(n 2), 3, 5 Tle 2: Properties of the lnguges isusse in the proof of Theorem 8. For eh regulr expression α, f α(n) enotes the numer of wors in L(α) of length n; given in the tle re the numer of wors of length t most 6 epte y α, the onstnt suh tht f α grows roughly s n, the reurrene reltion for the (f α(n)) n N, s well s f α(n) for n {, 2, 3}. Tle 2 gives n overview of the properties of these three Sos. Thus, we see tht seon possiility epts miniml numer of wors of length t most 6 (= 2 Σ ), whih mens tht only this option will e explore, the first n the thir will e isre. After rewriting y RWR, this results in the following So.? The miniml repir for this results in (??) +, whih is not esriptive s witnesse y (( )) + s in the proof of Theorem 7. For l >, we use l inepenent opies of the smple use for l = (i.e., using ifferent lphet symols). Thus, RWR 2 l will fil on t lest one of these opies. 4. DESCRIPTIVE CHARES In this setion, we give the first min lgorithm of this pper, So2Chre, whih effiiently omputes esriptive Chres for given Sos. 4. The CHARE lgorithm The lgorithm So2Chre uses numer of suroutines, whih re written with ot-nottion similr to some moern ojet oriente progrmming lnguges. For exmple A.ontrt(U, l) enotes the pplition of the suroutine ontrt to the So A with prmeters U n l. For given So A, we let A.sr n A.snk enote the soure n the sink of A, respetively. The following suroutines re use in So2Chre. ontrt on So A tkes suset U of verties of A n lel l. The proeure moifies A suh tht ll verties of U re ontrte to single vertex n lele l (eges re move oringly). onstrutlevelorer on So A = (V, E) ssumes tht A is yli n ssigns level numer to every vertex v V, where the level numer of noe v V is efine to e the length of the longest pth from A.sr to v. Hene, A.sr is on level numer 0, n for every other noe v, the level numer is one more thn the highest level numer of the immeite suessors of v. isskiplevel on So A n level numer i returns true if level i is skip level. A level i is skip level if there exist noes u, v V with (respetive) level numers j u < i n j v > i suh tht u A v. In other wors, one n skip level i y trnsitioning from u to v. Algorithm : So2Chre Input: SOA A = (V, E); 2 while A hs yle o 3 Let U e strongly onnete loope omponent of A; 4 A.ontrt(U, ALT (U) + ); 5 A.onstrutLevelOrer(); 6 result ε; 7 for i = to (level numer of A.snk) o 8 B ll verties with level numer i n + ; 9 C ll verties with level numer i n no + ; 0 foreh α B o if A.isSkipLevel(i) or B + C > then result result α?; 2 else result result α; 3 if C > 0 then 4 if A.isSkipLevel(i) or B >0 then 5 result result ALT (C)?; 6 else result result ALT (C); 7 return result; Note tht the use of ontrt n turn the So into generlize So. Intuitively speking, the lgorithm So2Chre works s follows: () Reple eh strongly onnete loope omponent A V with vertex tht is lele with the regulr expression ALT (A) +. This turns A into (possily generlize) So tht is DAG. (2) Every noe in the DAG is ssigne level numer. (3) Every level is turne into one or more hin-ftors. If level ontins more thn one non-letter noe, or if level is skip level,? is ppene to every hin-ftor on tht level. The following theorem sttes tht So2Chre n e use to ompute Chre-esriptive Chres in highly effiient mnner. Theorem 9. For ny given So A, So2Chre fins Chre tht is Chre-esriptive of L(A) in time O(m), where m is the numer of trnsitions of A.

8 Before we isuss the proof of Theorem 9 in Setion 4.2, we illustrte the ehvior of So2Chre with n Exmple. The orre- Exmple 20. Let S = {f, ef, f}. sponing So, SOA(S), is epite s follows. e f First, So2Chre removes ll yles y ontrting strongly onnete loope omponents. This les to the following generlize So. ( ) + e () + Aprt from the levels for A.sr n A.snk, this generlize So hs three levels: The first level with the noes ( ) + n () +, the seon level with the noes n e, n the thir level with the noe f. As there is n ege etween ( ) + n f, the seon level is skip level. Thus, the levels le to the respetive Chres ( ) +?() +?, (e )?, n f, whih re ontente to ( ) +?() +?(e )?f. By Theorem 9, this Chre is Chre-esriptive of S. 4.2 Proof of Theorem 9 Proof. We first prove termintion n running time, followe y the proof of orretness. Note tht in this Setion, for simpliities ske n in ontrst to Setion 5, we o not istinguish etween noe its lel. Termintion n running time. Termintion is ovious, s the two loops (in lines 2 n 7) re exeute only oune numer of times. Let n enote the numer of verties n m enote the numer of eges in the input So. In the while-loop in line 2, the input So is trnsforme into n yli generlize So. Using Trjn s lgorithm (f. [5]), this prt n e relize in time O(m + n). Computing the level orer n nnotting, for eh level, whether tht level is skip level, n lso e one in time O(m + n), nlogously to topologil sorting. Finlly, eh noe in the generlize So is turne into hin ftor. This tkes time O(n). Hene, the iniviul steps sum up to time of O(m + n), whih results in totl time of O(m), s n m hols y efinition. Corretness. First, it is quite esy to see tht So2Chre omputes Chre. Note tht, in orer to prove tht this Chre is esriptive of the smple S, we o not nee to rgue out every Chre γ with L(γ) S, ut only out those with L(So2Chre(SOA(S))) L(γ) S. This llows us to use Lemm 0 from two iretions: On the one hn, every ege (n hene, every pth) tht is present in SOA(S) must e present in SOA(γ), on the other hn, SOA(γ) must not ontin ny eges tht o not our in SOA(δ). f Before we onsier the min prt of the proof, we first evelop some tehnil tools tht el with strongly onnete loope omponents. Lemm 2. Let α e Chre. A set A lph(α) is strongly onnete loope omponent in SOA(α) if n only if α ontins hin ftor of the form ALT (A) + or ALT (A) +?. The proof ws omitte for spe resons. As So2Chre turns every strongly onnete loope omponent A into hin ftor ALT (A), we oserve tht So2Chre oes not hnge these omponents. Corollry 22. Let Σ e n lphet. For every finite n nonempty set S Σ, n every set A lph(s), the following hols. A is strongly onnete loope omponent in SOA(S) if n only if A is strongly onnete loope omponent in SOA(So2Chre(SOA(S)). Finlly, oring to Lemm 0, this immeitely les to the following oservtion: Corollry 23. Let S Σ e finite set, n let δ := So2Chre(So(S)). For every Chre γ with L(δ) L(γ) S, SOA(γ) must ontin extly the sme strongly onnete loope omponents s SOA(S) n SOA(δ). We now posses ll the tools we nee to exeute the min element of the proof of orretness of So2Chre. Lemm 24. Let Σ e n lphet, let S Σ e nonempty set, n let δ := So2Chre(SOA(S)). Then L(δ) = L(γ) hols for every Chre γ with L(δ) L(γ) S. The proof ws omitte for spe resons. Lemm 24 implies tht there is no Chre γ suh tht L(So2Chre(SOA(S))) L(γ) S. As we hve, y efinition, L(So2Chre(SOA(S))) S, we get tht the result of So2Chre on SOA(S) is Chre-esriptive of S, whih onlues the proof of orretness. 5. DESCRIPTIVE SORES In this setion, we give the seon min lgorithm of this pper, whih effiiently omputes esriptive Sores for given Sos. 5. SORE Algorithm As in Setion 4, we use ot-nottion to enote the pplition of suroutines. As in Setion 4, for given So A, we let A.sr n A.snk enote the soure n the sink of A, respetively. ontrt on So A tkes suset U of verties of A n lel l. The proeure moifies A suh tht ll verties of U re ontrte to single vertex n lele l (eges re move oringly). The proeure returns the newly rete vertex. extrt on So A tkes s rgument set of verties U (of A); it oes not moify A, ut returns new So with opies of ll verties of U s well s two new verties for soure n sink; ll eges etween verties of U re opie, ll verties in U hving n inoming

9 ege (in A) from outsie of U hve now n inoming ege from the new soure, n ll verties in U hving n outgoing ege (in A) to outsie of U hve now n outgoing ege to the new sink. first returns ll verties v suh tht the only preeessor of v is the soure. Epsilon on So A s new vertex lele ε; ll outgoing eges from the soure to verties tht hve more thn one preeessor (verties, tht re not in the first-set) re reirete vi this new vertex. exlusive on So A on rgument v ( vertex of A) returns the set of ll verties u suh tht, on ny pth from the soure to the sink tht visits u, v is neessrily visite previously. Intuitively, the exlusive set of vertex v is the set of ll verties exlusively rehle from v, not from ny other vertex inomprle to v. Finlly, the most iffiult suroutine is lle en n is use to prepre the tretment of strongly onnete loope omponents of the input So A. First, it omputes the set W of ll verties rehle from the soure without pssing through (or ening with) verties whih re preeessors of the sink. Then it reirets (ens) ll trnsitions irete from n element outsie of W to suessor of the soure to point to the sink inste. With other wors, we reiret ll trnsitions from n element to n element A.su(A.sr) to now trnsition to A.snk iff for ll wors u suh tht u is pth in A, we hve tht u ontins n element A.pre(A.snk). In prtiulr, ll elements of A.pre(A.snk) o not trnsition to elements from A.su(A.snk). See Exmple 26 for n illustrtion. Furthermore, we use the following three suroutines for the retion of lels. plus on lel l returns (l) +. ontente on lels l n l returns l l. or on lels l n l returns l l. The lgorithm So2Sore is given in Algorithm 2. On more intuitive level, the lgorithm performs the following phses. () Reurse on ll strongly onnete loope omponents; reple eh with vertex, lele with the result of the reursion. (2) After the So is irete yli grph (DAG), fous on the set F of ll verties whih n e rehe from the soure iretly, ut not vi other verties; mke sure tht there re no verties whih n e rehe iretly n vi other verties (if neessry, n uxiliry noe lele ε). (3) Reurse on the sets of verties exlusively rehle from vertex in F n ontrt these sets to verties lele with the result of the reursion. (4) Comine verties of F with or, reurse gin on wht is exlusively rehle from this new vertex. (5) One only one item is left in F, split it off n reurse on the reminer. Algorithm 2: So2Sore Input: So A = (V, E); 2 Output: Sore minimlly generlizing L(A); 3 if V = 2 then return ε; 4 else if A hs yle then 5 Let U e strongly onnete loope omponent of A; 6 B 0 A. extrt(u). en(); 7 A. ontrt(u, plus(so2sore(b 0))); 8 else if A.su(A.sr) A. first() then 9 A. Epsilon(); 0 else if A. first() = then Let v e the only suessor of sr; 2 U V \ {A.sr, v, A.snk}; 3 l v.lel(); 4 l So2Sore(A. extrt(u)); 5 return ontente(l, l ); 6 else if v A. first(): A. exlusive(v) {v} then 7 Let v e suh tht A. exlusive(v) {v}; 8 U A. exlusive(v); 9 A. ontrt(u, So2Sore(A. extrt(u))); 20 else 2 Let u, v A. first() with u v s.t. A. reh(u) A. reh(v) is -mximl; 22 A. ontrt({u, v}, or(u. lel(), v. lel())); 23 return So2Sore(A); Note tht the lgorithm introues? y wy of onstruting or ε. This n e lene up y postproessing the resulting Sore. The following theorem sttes the orretness n the running time of the lgorithm. Theorem 25. The lgorithm So2Sore, given So A s input, fins esriptive Sore for L(A) in time O(nm), where n is the numer of lphet symols use in A, n m is the numer of trnsitions in A. Furthermore, this lgorithm proues Sore suh tht the orresponing So hs the sme strongly onnete omponents s the input So, n the sme set of suessors of the soure. Before we get to the proof of Theorem 25, we give two exmples of So2Sore. The first exmple illustrtes how strongly onnete loope omponents re trete. The seon illustrtes the use of exlusive. Exmple 26. Consier the following So. The lele verties of this So onsist of single strongly onnete loope omponent, n pplition of en omputes the set W = {, }, whih les to the following So.

10 After resolving the strongly onnete loope omponent ontining n (ll other re not loope ) n ontrt, we get the following. () + We n split off the first noe twie now (s line 0 pplies twie), reursing finlly on the remining So s follows. Epsilon This results in ε, or, equivlently,?. Going k through the reursions, we get (() +?) +. Exmple 27. Consier now the following So. For this So, line 6 pplies n reurses on the upper r; fter ontrtion, this gives whih results in () s esire (no generliztions were me). 5.2 Proof of Theorem 25 In this setion we re onerne with proving Theorem 25. We strt with lemm whih is use in its proof. Lemm 28. There is funtion f on Sores suh tht, for eh Sore α, L(f(α) + ) = L(α + ) \ {ε} n, for ll α.su(α.sr) n α.pre(α.snk) we hve f(α). The proof ws omitte for spe resons. We re now rey to prove Theorem 25. Proof. Let So A e given. We proee y first resoning out termintion n running time. After tht, we will inutively show orretness, y ssuming ll reursive lls to e orret. Termintion n running time. We refer to [5] for stnr grph lgorithms, suh s fining strongly onnete (loope) omponents. As the lgorithm never introues self-loops, it is esy to see tht the running time on So A is t most the running of A with ll self-loops remove plus n. Thus, it suffies to show tht So2Sore hs running time of O(nm) on selfloop free Sos. ε We first oun the running time on yli Sos. We topologilly sort the verties of A (this tkes O(m) time). We will now itertively onstrut n nnottion of ll the verties of G with susets of A. first(), orresponing to wht verties they re rehle from. We strt y nnotting eh vertex of G tht orrespons to vertex v A. first() with {v} n ll others with (in time O(n)). We now iterte through ll verties u from first to lst in the topologil sort of G n, for eh suessor w of u, we to the urrent nnottion of w the nnottion of u (ssuming unit time for this kin of set opertions; overll, this will then tke O(m) time). This results in the esire nnottion of A, in totl of O(m) time. Extrting the exlusive sets for ll elements of A. first() n now e one in O(m) time. From these nnottions we n lso fin pir of verties with -mximl reh-sets in time O(m). Any two itions of ε-noes re lne in etween y splitting off of strting noe, s given in line 0. As for ll other opertions, the lgorithm n mke t most n ontrtions; hene, there n e only O(n) reursive lls. This results in n overll time of O(nm) for yli Sos. We now turn to the generl se. Fining strongly onnete loope omponents tkes time O(m), using well-known lgorithms, for exmple Trjn s lgorithm. So2Sore first reurses on ll strongly onnete loope omponents, n then on the irete yli grph otine y ontrting ll strongly onnete loope omponents. The en opertion on strongly onnete loope omponent splits this omponent, s no vertex linke to the sink n now reh ny of the elements of the first set. The running time is mximize when the reursions re s unlne s possile; this hppens, when eh en opertion splits off only one vertex, n the remining So is still strongly onnete. This results in splitting off n times, with time of O(m) for fining strongly onnete loope omponents eh time, plus the finl work on yli Sos. This shows tht the overll running time is O(nm). Corretness. The sttements out strongly onnete omponents n the suessors of the soure re strightforwr. Furthermore, it is ler tht the result is Sore. We show the following sttement out So2Sore y inution (y termintion, we ssume the inution hypothesis to hol for ll reursive lls). Let generlize So A e given, let A e opy of the struture of A where ll lels re reple with single istint symols. Suppose tht the lim hols for ll reursive lls tht So2Sore mkes on A. Let δ = So2Sore(A) n let γ e Sore suh tht L(A) L(γ) L(δ). We istinguish numer of ifferent ses, epening on whih luse ws use for So2Sore(A). We will show L(δ) = L(γ) in eh se. Cse : The luse in line 3 ws use. This se is trivil. Cse 2: The luse in line 4 ws use. Let U e s hosen in line 4. Let B 0 = A. extrt(u). en(); let z e symol not in lph(a) n B = A. ontrt(u, z). Let ˆδ 0 = So2Sore(B 0) n let δ 0 = ˆδ 0 +. We let δ e So2Sore(B ).

11 Let T e the syntx tree of γ. For eh vertex x of T, we ll x plusse iff inserting + in T t x oes not hnge the lnguge epte y T. Clim. There is plusse vertex x in T suh tht, for the sutree γ 0 roote t x, we hve lph(γ 0) = lph(u). The proof ws omitte for spe resons. Let f e s shown existent in Lemm 28, n let x e the plusse vertex highest up in T suh tht lph(x) = U. Let ˆγ 0 e the sutree of γ roote t x; let γ e erive from γ y sustituting the sutree t x with lef lele z if ε L(γ 0) n (z ε) otherwise. Let γ 0 = f( ˆγ 0). Clerly, it suffies to show tht L(γ 0) = L(δ 0) n L(γ ) = L(δ ). Clim 2. L(B ) L(γ ) L(δ ). Proof of Clim 2. In orer to voi unneessry se istintions, we first introue two new n istint terminl symols n, where is use s wor-strt symol, n s wor-en symol. To this en, we efine γ := γ (δ, δ, n γ re efine nlogously). In ition to this, we efine So B with L(B ) = L(B ) n So B with L(B ) = L(B). (This is esily one y inserting new noes lel or etween the soure n its suessors, or the sink n its preeessors, respetively). We first prove L(B ) L(γ ) L(δ ). After this is estlishe, the lim follows y oserving tht projetion preserves inlusion. L(B ) L(γ ) : Let, lph(b ) \ {z} n suppose B. We hve B, n, hene, γ. From the efinition of γ it is now esy to see tht γ. Let lph(b ) \ {z} n suppose B z. Thus, there is U suh tht B, n, hene, γ. From the efinition of γ it is now esy to see tht γ z. Let lph(b ) \ {z} n suppose z B. Thus, there is n U suh tht B, n, hene, γ. From the efinition of γ it is now esy to see tht z γ. L(γ ) L(δ ) : Let, lph(γ ) \ {z} n suppose γ. From the efinition of γ it is now esy to see tht γ, n, hene, δ. Thus, we get δ. Let lph(γ ) \ {z} n suppose γ z. Thus, there is U suh tht γ, n, hene, δ. We hve now δ z. Let lph(γ ) \ {z} n suppose z γ. Thus, there is n U suh tht γ, n, hene, δ. We hve now z δ. Hene, L(B ) L(γ ) L(δ ), whih is equivlent to L(B ) L(γ ) L(δ ). As inlusion is preserve uner projetion, this implies π T (L(B )) π T (L(γ )) π T (L(δ )) whih proves the lim (for T := Σ \ {, }). (for Clim 2) Thnks to the lim we n now pply the inution hypothesis to see tht L(γ ) = L(δ ). Similrly, we now show γ 0 n δ 0 to e equivlent y showing L(B 0) L(γ 0) L(δ 0). From the inution hypothesis we know tht B 0.su(B 0.sr) = δ 0.su(δ 0.sr); this shows tht γ 0.su(γ 0.sr) hs to oinie with these sets. Clim 3. We hve tht B 0.pre(B 0.snk) γ 0.pre(γ 0.snk) δ 0.pre(δ 0.snk). The proof ws omitte for spe resons. Lstly, we turn to pirs of elements from U. Clim 4. On lph(u), B0 is sureltion of γ0, whih in turn is sureltion of δ0. Proof of Clim 4. This is strightforwr, using the properties of f tken from Lemm 28. (for Clim 4) This finishes showing L(B 0) L(γ 0) L(δ 0); thus, using the inution hypothesis, L(γ 0) = L(δ 0). This finishes the resoning for this se. Cse 3: The luse in line 8 ws use. This se is trivil from the inution hypothesis, s the lnguge is not hnge y the Epsilon() metho. Cse 4: The luse in line 0 ws use. Let v e the only suessor of A.sr; let = v. lel(). Note tht is the only suessor of γ.sr. Let U = lph(δ) \ {}. As A oes not hve strongly onnete loope omponent, neither oes So(γ); thus, we hve L(γ) = π U (L(γ)). Let γ equl γ with reple y ε n δ = So2Sore(A. extrt(u)). Then we hve L(A. extrt(u)) L(γ ) L(δ ) n the lim follows y inution. Cse 5: The luse in line 6 ws use. We now know tht A is yle free n, thus, δ oes not ontin +. Therefore, without loss of generlity, γ oes not ontin + either. Let v e s hosen in line 6 n = v. lel(). Let U = A. exlusive(v). Let B 0 = A. extrt(u); let z e symol not in lph(a) n B = A. ontrt(u, z). Let δ 0 = So2Sore(B 0) n let δ = So2Sore(B ). By the inution hypothesis, we hve tht δ 0. first() = {}. Thus, ny wor in L(γ) L(δ) tht ontins n element of U hs to strt with n. Clim 5. There is sutree γ 0 of γ suh tht lph(γ 0) = U. The proof ws omitte for spe resons. Let γ 0 e sutree of γ suh tht lph(γ 0) = U; let γ e erive from γ y sustituting the γ 0 with lef lele z. Note tht ε L(γ 0) euse of A.su(A.snk) = A. first(). We now lerly get L(B 0) L(γ 0) L(δ 0) n L(B ) L(γ ) L(δ ). Thus, this se follows from the inution hypothesis, similrly to Cse 2. Cse 6: The luse in line 20 ws use. In this se we know tht A. first() >, s no other se pplies. Furthermore, we will use without mention tht A is yle free. Let u, v s hosen in line 20. Let z e symol not in lph(a). Let B = A. ontrt({u, v}, z). Let δ 0 e So2Sore(B). Let = u. lel() n = v. lel(). From u, v δ. first() we hve tht there is sutree β of γ with or t the root n n re in ifferent hil trees. Clim 6. L(β) is set of letters. The proof ws omitte for spe resons. From the lim we get, without loss of generlity, tht ( ) is suexpression of γ; thus, β = ( ). Let γ 0 e erive from γ y sustituting β with z. Clerly, we now hve L(B) L(γ 0) L(δ 0). From the inution hypothesis we get L(γ 0) = L(δ 0); thus, L(γ) = L(δ).

12 6. CONCLUSIONS AND FURTHER WORK This pper proposes strtegy for inferring esriptive Sores n esriptive Chres: First, use 2T-INF to ompute esriptive So, then use So2Sore or So2Chre to turn this utomton into Sore or Chre. In [3], Bex et l. stte tht their shem inferene lgorithms outperform existing lgorithms in ury, oniseness, n spee. Consiering the results presente in Setions 3 to 5, the uthors of the present pper feel onfient to suggest tht their new strtegies outperform the lgorithms from [3] t lest with respet to oth ury n spee. An experimentl evlution of the lgorithms is plnne for the ner future. This will lso give the opportunity to evlute the qulity of the results of the lgorithms, for exmple with respet to ifferent oniseness mesures or how well they esrie the trget lnguge. We now isuss possile extensions, n possile iretions for further work. In orer to overome the prolem tht Sores n Chres nnot ount (eyon the trivil se of istinguishing etween 0 n ), Bex et l. [3] propose extening those moels with numeril preites, whih n e otine y postproessing. It is esy to see tht this extension n lso e pte to the pprohes in the present pper. If one is willing to fix proility istriution on the smple, the lerning lgorithms oul e pte to feture vrint of stohsti finite lerning (introue y Rossmnith n Zeugmnn [3]). This oul le to inferene lgorithms tht o not nee to proess the whole input, whih might e interesting for very lrge tsets. From the uthors point of view, the following prolem is proly the most interesting: In [2], Bex et l. exmine the inferene of k-ourene regulr expressions (short k- Ores); regulr expressions where eh terminl letter ours t most k times. (Hene, Sores re -Ores). Is it possile to exten So2Sore to eterministi k-ores for some k 2, or So2Chre to the orresponing extension of Chres (where letters re llowe to our up to k times)? It seems tht one woul nee to evelop not only goo generliztion of Sos, ut lso goo inlusion riterion, preferly syntti. This onjeture is se on the following oservtion: While the results in the present pper mke no iret use of the results n tehniques tht Freyenerger n Reienh [7] evelope for esriptive generliztion of pttern lnguges, oth ppers rely hevily on the ft tht the inlusion prolem for the respetive lnguge lsses hs syntti riterion for inlusion. The proofs on esriptive generliztion of pttern lnguges in [7] rely on the ft tht inlusion for terminl-free E-pttern lnguges is hrterize y the existene of morphism whih mps the pttern tht genertes the superlnguge to the pttern tht genertes the sulnguge. This riterion is verstile tool to prove the nonexistene of (pttern) lnguge etween the trget lnguge n the lnguge of esriptive pttern. While the proofs of the present pper nnot mke ny iret use of the proofs from [7], the pprohes re similr oneptully. In prtiulr, the line of resoning in whih the orretness proofs of So2Chre n So2Sore use the ft tht the inlusion prolem for Sores (n Chres) is hrterize y the overing of the respetive Sos is struturlly similr to the proofs for pttern lnguges. Moreover, lthough eiing whether suh pttern morphism exists is NP-omplete, the tehniques in [7] re not ffete y the omputtionl hrness. Hene, the hrness results on the eiility of the k-ore-inlusion prolem presente y Mrtens et l. [0] o not exlue the existene of suh riterion. This leves room for hope tht So2Sore n e extene to k-ores with k 2. Aknowlegements The uthors wish to thnk the nonymous referees for their helpful remrks. 7. REFERENCES [] D. Angluin. Fining ptterns ommon to set of strings. Journl of Computer n System Sienes, 2():46 62, 980. [2] G. J. Bex, W. Gele, F. Neven, n S. Vnsummeren. Lerning eterministi regulr expressions for the inferene of shems from XML t. ACM Trnstions on the We, 4(4):4: 4:32, 200. [3] G. J. Bex, F. Neven, T. Shwentik, n S. Vnsummeren. Inferene of onise regulr expressions n DTDs. ACM Trnstions on Dtse Systems, 35(2):: :47, 200. [4] G. J. Bex, F. Neven, n S. Vnsummeren. Inferring XML shem efinitions from XML t. In Pro. VLDB 2007, pges , [5] T. H. Cormen, C. E. Leiserson, R. L. Rivest, n C. Stein. Introution to Algorithms. MGrw Hill, 2n eition, 200. [6] H. Fernu. Algorithms for lerning regulr expressions from positive t. Informtion n Computtion, 207(4):52 54, [7] D. D. Freyenerger n D. Reienh. Inferring esriptive generlistions of forml lnguges. In Pro. COLT 200, pges , 200. [8] P. Grí n E. Vil. Inferene of k-testle lnguges in the strit sense n pplition to syntti pttern reognition. IEEE Trnstions on Pttern Anlysis n Mhine Intelligene, 2(9): , 990. [9] E. M. Gol. Lnguge ientifition in the limit. Informtion n Control, 0(5): , 967. [0] W. Mrtens, F. Neven, n T. Shwentik. Complexity of eision prolems for XML shems n hin regulr expressions. SIAM Journl on Computing, 39(4): , [] W. Mrtens, F. Neven, T. Shwentik, n G. J. Bex. Expressiveness n omplexity of XML shem. ACM Trnstions on Dtse Systems, 3(3):770 83, [2] Y. K. Ng n T. Shinohr. Developments from enquiries into the lernility of the pttern lnguges from positive t. Theoretil Computer Siene, 397( 3):50 65, [3] P. Rossmnith n T. Zeugmnn. Stohsti finite lerning of the pttern lnguges. Mhine Lerning, 44( 2):67 9, 200.

CS 491G Combinatorial Optimization Lecture Notes

CS 491G Combinatorial Optimization Lecture Notes CS 491G Comintoril Optimiztion Leture Notes Dvi Owen July 30, August 1 1 Mthings Figure 1: two possile mthings in simple grph. Definition 1 Given grph G = V, E, mthing is olletion of eges M suh tht e i,