Automata-based Pattern Mining from Imperfect Traces

Automt-sed Pttern Mining from Imperfect Trces Giles Reger University of Mnchester Oxford Rod, M13 9PL Mnchester, UK regerg@cs.mn.c.uk Howrd Brringer University of Mnchester Oxford Rod, M13 9PL Mnchester, UK howrd@cs.mn.c.uk Dvid Rydeherd University of Mnchester Oxford Rod, M13 9PL Mnchester, UK dvid@cs.mn.c.uk ABSTRACT This pper considers utomt-sed pttern mining techniques for extrcting specifictions from runtime trces nd suggests novel extension tht llows these techniques to work with so-clled imperfect trces i.e. trces tht do not exctly stisfy the intended specifiction of the system tht produced them. We show tht y tking so-clled edit-distnce etween n input trce nd the lnguge of pttern we cn extrct specifictions from imperfect trces nd identify the prts of n input trce tht do not stisfy the mined specifiction, thus iding the identifiction nd loction of errors in progrms. Keywords Pttern Mining, Specifiction Mining 1. INTRODUCTION Forml progrm specifictions re useful for numer of ctivities ut they re often missing or incomplete. The field of specifiction mining [13, 18] ims to utomticlly construct forml progrm specifictions from progrm rtifcts. In this work we consider techniques tht operte on progrm trces i.e. finite sequences of events tht occur whilst progrm is running. One such pproch [19, 5, 6] uses set of templte ptterns to detect predefined ehviours nd then comine these together to form specifiction. Thse regulr ptterns re descried vi utomt, which llows for efficient checking. For exmple, consider the following pttern over the metsymols nd - shded sttes re ccepting sttes. We cn pply this pttern to the following trce y considering six instntitions, with ech pir of symols in the trce instntiting the pttern. connect.. We detect three ptterns (1) [ connect, ], (2) [ connect, ] nd (3) [, ]. These cn then e comined to form lrger specifiction: connect 3 This generl pproch cn e used with different ptterns nd methods of pttern comintion. However, it hs mjor drwck - imgine if Prt of this work ws supported y the Engineering nd Physicl Sciences Reserch Council [grnt numer EP/P505208/1]. we hd trce with the ove sequence repeted thousnd times followed y the two events connect nd i.e. missing the finl. We would fil to detect the two ptterns involving nd therefore not extrct the ove specifiction. The prolem is tht this pproch ssumes perfect trces i.e. tht the correct ehviour is contined within the given trces. This ssumption is unrelistic - we would like to e le to del with cses where there re smll errors in trces. The notion is tht progrmming pttern my hold for the mjority of progrm ut the progrm my contin one or two ugs. One pproch [5] to deling with this issue is to reset pttern eing checked to its initil stte when n error occurs - ut this technique would not detect the required ptterns in our ove exmple. Insted we wnt to e le to mesure how ly trce mtches pttern. This pper presents n pproch tht extends the utomt-sed pttern mining pproch to imperfect trces y considering so-clled edit distnces etween trce nd pttern s lnguge. This work is motivted y the notion tht utomt-sed pttern-mining is desirle pproch to specifiction mining ut to mke it pplicle we need to del with the issue of imperfect trces. The min dvntge of the pttern-sed pproch is tht we cn use ritrrily complex ptterns, unlike techniques sed on dt-mining, llowing us us to develop more sound comintion strtegies (descried in [17]) nd consider further work (discussed in Section 9) tht involves extensions to the ptternmining pproch tht could not e pplied to lterntive pproches. Structure. Section 2 formlly introduces the concept of pttern checking nd composition. Section 3 discusses methods for deling with the imperfect trces prolem nd Sections 4, 5 nd 6 present our proposed solutions. Section 7 presents two experiments nd Section 8 discusses relted work. Finlly, we conclude in Section 9. 2. PATTERN CHECKING In this section, we introduce pttern checking frmework y first descriing how ptterns re extrcted from trces, then considering how this cn e done efficiently nd finlly discussing how extrcted ptterns re comined. 2.1 Checking ptterns In this ccount, pttern is regulr lnguge over symols i.e. set of trces (finite sequences) of symols. We consider ptterns s utomt: DEFINITION 1 (PATTERN). A pttern p = Q, Σ, δ, q 0, F is n utomton where Q is finite set of sttes, Σ is finite lphet of symols, δ Q Σ Q is trnsition function, q 0 Q is n initil stte nd F Q is set of ccepting sttes. The lnguge of pttern, L(p) is the set of trces it ccepts i.e. τ L(p) iff there exists pth q τ 0 q nd q F where is δ lifted to trces.

The process of checking pttern ginst trce considers ll possile comintions of symols in the trce s replcements for the pttern s current symols. To replce pttern s symols we instntite it. DEFINITION 2 (INSTANTIATION). Given pttern p nd mp ϕ from p.σ to Σ, the instntited pttern ϕ(p) hs lphet Σ nd is the result of pplying ϕ to every symol in p. The checking process then checks if ech prticulr instntition of the pttern holds on the trce. We sy n instntited pttern holds on trce if the trce ppers in the instntited pttern s lnguge fter we remove irrelevnt symols. To remove irrelevnt symols we project the trce. DEFINITION 3 (PROJECTION). The projection π Σ(τ) of trce τ over lphet Σ is defined s τ with ll elements not in Σ removed. Therefore, the detected instntited ptterns re given s follows. DEFINITION 4 (EXTRACTED PATTERNS). Given pttern p nd trce τ the extrcted ptterns detect(p, τ) re {ϕ(p) p.σ : ϕ() τ π ϕ(p).σ (τ) L(ϕ(p))} 2.2 Checking ptterns efficiently We discuss two pproches tht llow us to check ptterns efficiently. 2.2.1 Checking mny instntitions For ech pttern we need to check ll possile instntitions. Typiclly we restrict this technique to ptterns over 2 or 3 symols (this does not restrict the numer of symols in the mined specifiction). We cn then compute the extrcted instntited ptterns for pttern using 2 or 3 dimensionl grid of reched sttes - this pproch ws first used in [19]. For the introductory exmple the following mtrix would represent the sttes reched in the pttern fter checking the trce (- mens filure). The ptterns detected y pttern checker C in trce τ re therefore C(τ) = {p q 0 τ q p Γ(q)} We cn extend the notion of instntition to pttern checkers nd define extrcted ptterns for pttern checker s follows. DEFINITION 6 (PATTERN CHECKER EXTRACTED PATTERNS). Given pttern checker C nd trce τ the extrcted ptterns detect(c, τ) re {p ϕ : p.σ : ϕ() τ p ϕ(c)(π ϕ(p).σ (τ))} For exmple, if we cll the pttern in the introductory exmple p 1 nd cll the following pttern p 2 then the pttern checker for p 1 nd p 2 would e {p 1, p 2} {p 2} {p 2} {p 2} where sttes re leled using the output function Γ. 2.3 Comining ptterns The following is sed on the technique introduced y Gel nd Su in [5]. Once we hve extrcted set of ptterns we cn comine them together using stndrd utomt intersection. However, this opertion is only defined when two utomt hve the sme lphet. To give two utomt the sme lphet we cn expnd them y plcing self-looping trnsitions on ech stte for the missing symols. For exmple, the three detected ptterns from the introductory exmple ecome: connect connect 2 - - - 1 connect connect The restriction of ptterns to 2 or 3 symols is for efficiency resons s this pproch hs spce complexity O(n m ) nd time complexity O(n m 1 τ ) given n lphet of size n nd pttern with m symols. A more efficient The intersection of these three ptterns is the specifiction given in the symolic pproch using inry decision digrms is explored in [6]. introduction. Formlly, comintion is defined s follows. connect connect 2.2.2 Checking mny ptterns If we wnt to check multiple ptterns we would currently need to repet the ove process multiple times i.e. for ech pttern. However, given set of ptterns with the sme set of symols we cn construct pttern checker tht checks ll these ptterns simultneously y tking the union of the ptterns nd leling sttes with the ptterns tht re ccepting t tht stte. This pproch ws previously presented in [17]. DEFINITION 7 (COMBINATION). Given set of instntited ptterns p 1,..., p n with comined lphet Σ, define their comintion s comine(p 1,..., p n) = expnd Σ\p1.Σ (p 1)... expnd Σ\pn.Σ (p n) where is utomt intersection nd expnd Σ is function tht dds self-looping trnsitions to pttern for symols in Σ. DEFINITION 5 (PATTERN CHECKER). Given n lphet of symols Σ nd set of ptterns p 1,..., p n over Σ let the pttern checker for these ptterns e C(p 1,..., p n) = Q, Σ,, Γ where Q = p 1.Q... p n.q (, (q 1,..., q n)) = (p 1.δ(, q 1),..., p n.δ(, q n)) Γ((q 1,..., q n)) = {p i q i p i.f } We cn either pply this comintion opertor or directly or use it to define specific comintion rules. To use comintion directly we cn sturte the set y repeted ppliction or extrct specifiction for ech lphet of events in the trce y comining together ptterns with the sme lphet. However, this might e costly nd not ll extrcted ptterns necessrily contin useful informtion. Therefore, n lterntive pproch (which we do not consider further here) is to develop specific

comintion rules for given ptterns, s is done in [5]. For exmple, they introduce the following sequencing rule for the pttern in our introductory exmple, which we will represent y the regulr expression (). (L 1) (L 2c) (c) (L 1L 2c) We pplied this y tking L 1 = L 2 =. The use of L 1 nd L 2 llows for repeted ppliction of the rule. The closure of the set of extrcted ptterns with respect to set of comintion rules cn then e computed. 3. DEALING WITH IMPERFECT TRACES The previous frmework will only detect pttern if it mtches exctly with n input trce. In this section we consider how it cn e extended so tht ptterns re extrcted if they mtch lmost ll of the input trce. 3.1 Wht re imperfect trces? To sy tht trce is imperfect we ssume tht there is n implicit specifiction tht the progrm tht produced the trce follows nd there is some ug in the progrm tht devites from this specifiction. The process of specifiction mining is therefore to extrct this implicit specifiction. Alterntively, the progrm might e correct ut the trce recording process my e fulty - either wy, identifying specifiction nd the trce imperfections cn id deugging efforts. We could view these imperfections s uniform noise, however, in the cse of progrmming ugs, it is likely tht these imperfections re introduced y common mistkes such s forgetting to resource or check condition, or ccidentlly clling the wrong method. We cn therefore think of imperfections s smll edits tht involve the removl, ddition or sustitution of events from perfect trce. 3.2 The restrt pproch Previous pproches (i.e.[6]) del with imperfect trces y restrting the pttern nd counting the numer of such restrts. With smll ptterns such s the lterntion pttern (p 1 from efore) this cn e effective. Let us consider the following common 3-symol resource usge pttern. c Consider checking the following (imperfect) trce for the instntition [, use, c ]. The checking would fil fter the fifth event s n event is omitted. If we restrt here then we immeditely fil gin..use.use..use...use. Insted, we would like to detect tht the event is missing nd flg this s potentil ug. Note tht our proposed pproch is generlistion of the restrt pproch y mking sustitution edit for the missing symol to rech the initil stte. 3.3 Edit distnce As n lterntive to the restrt pproch we consider replcing our previous condition tht trce must exctly mtch pttern with the requirement tht the edit-distnce etween the trce nd ny trce in the lnguge of the pttern must e elow some limit. The edit-distnce we consider uses the following edit opertions: inserting new symol; deleting n existing symol; sustituting n existing symol for new symol. The edit-distnce etween two trces is then given y the (minimum) numer of edits tht trnsform one trce into the other. This is sometimes clled the Levenshtein distnce [10]. Formlly, this distnce is given s follows. DEFINITION 8 (LEVENSHTEIN DISTANCE). The Levenshtein distnce etween trces τ 1 nd τ 2 is distnce(τ 1, τ 2), defined s distnce(τ 1, ɛ) = τ 1 distnce(ɛ, τ 2) = τ 2 distnce(τ 1, τ 2) + 1 distnce(τ distnce(τ 1, τ 2) = min 1, τ 2) + 1 distnce(τ 1, τ 2) + 1 distnce(τ 1, τ 2) if if = We define n updted notion of extrcted ptterns using this metric. DEFINITION 9 (IMPERFECT EXTRACTED PATTERNS). Given pttern p, trce τ nd integer γ > 0, which we cll the tolernce, the imperfect extrcted ptterns imperfect_detect(p, τ, γ) re { ϕ(p) p.σ : ϕ() τ τ L(ϕ(p)) : distnce(τ, π ϕ(p).σ (τ)) < γ } We extend this definition for pttern checkers s we did efore (Sec. 2.2.2). 3.4 Detecting ugs So fr our pproch hs een strct, considering trces of symols generted y progrm. But our motivtion hs een to extrct specifictions tht llow us to detect potentil ugs. To do so we need to e le to ccess informtion out the prt of progrm tht genertes trce - we ssume this is contined in progrm trce. DEFINITION 10 (PROGRAM TRACE). A progrm trce is finite sequence of pirs of the form (code_point, event) where code_point identifies the point in the progrm tht genertes the event. It is esy to extend our previous constructions to work on these progrm trces y ignoring the code point informtion. Our gol is to identify points in the progrm trce tht should e edited for mined specifiction to hold. These edits will follow those descried ove i.e. the removl of n event, ddition of n event etween two existing events or replcement of one event with nother. The solutions we descrie in the following two sections will produce rewrites. DEFINITION 11 (REWRITE). A rewrite ρ is finite sequence of indexes nd rewrite opertions tht cn e pplied to progrm trce to produce n edited version. At index i we write deletion s (i, ), insertion of event s (i, +) nd sustitution with events s (i, %). A rewrite cn then e used to identify the code points tht my contin ugs, nd suggest potentil solutions i.e. edits. 4. EDITING ON FAILURE We first consider n pproch tht does not use the true edit-distnce, ut introduces new restrt opertion inspired y the metric. The ide is to introduce edit opertions only when trce fils to mtch pttern. 4.1 Filing-edit-distnce In the following we introduce n lterntive formultion of the edit-distnce tht only pplies edits when we fil. We sy pttern fils for trce if no extensions of the trce cn stisfy the pttern.

Algorithm 1 Computing the filing-edit-distnce with tolernce γ for pttern p = Q, Σ, δ, q 0, F nd trce τ. C { [], q 0 } for i in 1 to τ do τ(i) C {} for ρ, q in C do q δ(q, ) if filing(q ) then if ρ < γ then C C { (i, ).ρ, q } C C { (i, +).ρ, δ(, δ(, q)) Σ} C C { (i, %).ρ, δ(, q) Σ} else C C { ρ, q } C { ρ, q C filing(q)} return min({ ρ ρ, q C q F }) For pttern p nd trce τ let τ = good(τ)..rest(τ) where good(τ) is longest prefix of τ such tht there exists trce τ such tht good(τ).τ L(p) ut for ll trces τ we hve good(τ)..τ / L(p). Let edit e function on symols tht non-deterministiclly replces the symol y the empty trce, trce consisting of nother symol from the trce followed y the originl symol or nother symol in the trce i.e. it cn pick one of the three edit opertions discussed ove. An edited trce is defined recursively s edited(τ) = edited(good(τ).edit().rest(τ)) if τ / L(τ) τ otherwise i.e. the repeted ppliction of the edit function to the event cusing filure. As edit is non-deterministic the filing-edit-distnce is given s the minimum numer of times the edited function must e pplied to trce. This is still n edit-distnce, ut not necessrily miniml. 4.2 Computing the filing-edit-distnce To compute the filing-edit-distnce we explore the non-deterministic edit opertions y mintining numer of possile configurtions of the instntited pttern. A configurtion is pir consisting of rewrite (Def. 11) nd stte. We sy tht trce reches configurtion ρ, q for ρ(τ) q where q 0 nd re the initil stte nd trnsition pttern p iff q 0 reltion of p. Algorithm 1 gives n lgorithm for computing the filing-edit-distnce y computing the set of configurtions reched y trce. The lgorithm uses tolernce γ to restrict the size of rewrites nd therefore the lgorithm will only find the edit-distnce if it is elow this tolernce. The lgorithm uses function filing tht returns true if finl stte is not rechle from the given stte. The use of γ helps restrict the exponentil lowup introduced y the nondeterminism of edit functions. Other optimistions tht cn reduce this lowup include restricting the numer of edits llowed in row nd comining similr rewrites together. 4.3 Exmple of computing filing-edit-distnce Let us tke the resource usge pttern introduced in Sec. 3.2 nd consider the trce..use..use for the instntition [, use, c ]. Checking this pttern will fil on the second event s there is no trnsition from the second stte. Two edit opertions cn e pplied here - removl of the second event or ddition of event immeditely efore the second - this leds to two lterntive configurtions: { [(1, )], 2, [(1, +)], 2 } We continue checking nd fil gin on the fifth event, the finl use. Here there re lso three edit opertions tht cn e pplied - removl of the event, ddition of event or sustitution of the use event with n event. This leves us with six finl configurtions: [(1, ), (5, )], 1, [(1, ), (5, +)], 2, [(1, ), (5, %)], 2, [(1, +), (5, )], 1, [(1, +), (5, +)], 2, [(1, +), (5, %)], 2 Therefore, the instntited pttern mtches with filing-edit-distnce 2. 5. USING THE TRUE EDIT DISTANCE We now consider n pproch tht uses the true edit distnce etween the trce nd lnguge. We consider technique tht uses weighted trnsducers to compute the edit-distnce etween trce nd finite utomton [2]. The generl ide is tht we model the trce nd pttern s weighted trnsducers T nd P nd model the edit opertions s trnsducer X. The composition T X P will cpture the different wys tht the trce cn e rewritten to mtch the pttern nd the miniml edit-distnce is the shortest pth to n ccepting stte. 5.1 Weighted trnsducers A weighted trnsducer hs trnsitions leled with n input symol, output symol nd weight - for this ppliction we tke weights s eing 0 or 1. We llow ɛ input nd output trnsitions tht cn e tken without consuming or producing symol. DEFINITION 12 (WEIGHTED TRANSDUCER). A weighted trnsducer is 5-tuple T = Q, Σ,, δ, F where Q is finite set of sttes, Σ is finite input lphet of symols, is finite output lphet of symols, δ Q (Σ {ɛ}) ( {ɛ}) {0, 1} Q is finite set of trnsitions nd F Q is set of finl sttes. We trnslte trces into weighted trnsducers y creting trnsition to new stte per event, dding self-looping ɛ trnsitions nd only mking the lst stte finl. For exmple, the trce...c. would ecome the following weighted trnsducer where trnsitions re written input/output : weight. Note tht we use weight of 0 s there is no cost ssocited with following the trce. ɛ/ɛ : 0 ɛ/ɛ : 0 ɛ/ɛ : 0 ɛ/ɛ : 0 ɛ/ɛ : 0 ɛ/ɛ : 0 / : 0 / : 0 / : 0 c/c : 0 / : 0 3 4 5 6 Ptterns re trnslted y keeping the structure nd leling trnsitions with the sme input nd output symols using weight of 0, nd dding self-looping ɛ trnsitions The edit trnsducer consists of single stte nd looping trnsitions for ech of the edit opertions it cn perform - for n lphet of {,, c} this would e s follows. Note how ɛ is used to model deletions nd dditions nd ll edit opertions hve weight of 1. 1 / : 0, / : 0, c/c : 0, /ɛ : 1, /ɛ : 1, c/ɛ : 1, ɛ/ : 1, ɛ/ : 1, / : 1, /c : 1, / : 1, /c : 1, c/ : 1, c/ : 1

Algorithm 2 Computing the three-wy composition of trnsducers T, X nd P with the sme input nd output lphets Σ nd. Enqueue(S, (T.q 0, X.q 0, P.q 0)) Q {(T.q 0, X.q 0, P.q 0)} δ, F while isempty(s) do (q 1, q 2, q 3) Dequeue(S) if (q 1, q 2, q 3) T.F X.F P.F then F F {(q 1, q 2, q 3)} for (q 1, i 1, o 1, w 1, q 1) T.δ nd (q 3, i 3, o 3, w 3, q 3) P.δ do for (q 2, i 2, o 2, w 2, q 2) X.δ where i 2 = o 1 o 2 = i 3 do if (q 1, q 2, q 3) / Q then Q Q {(q 1, q 2, q 3)} Enqueue(S, (q 1, q 2, q 3)) δ δ ((q 1, q 2, q 3), i 1, o 3, w 1 + w 2 + w 3, (q 1, q 2, q 3)) return Q, Σ,, δ, F 5.2 Composition The composition T X of two trnsducers T nd X considers ll possile sequencing etween strings of T nd strings X i.e. if /./c is string of T nd /d.c/ is string of X then /d./ is string of T X. Here we consider three-wy composition i.e. T X P. We compute s single opertion for efficiency resons - if we computed T X nd then (T X) P it is likely tht (T X) would contin mny superfluous trnsitions. An pproch for doing this is presented in [1] nd Algorithm. 2 gives n lgorithm for three-wy composition. 5.3 An exmple of computing edit-distnce Let us tke the sme exmple we used for the filing edit-distnce i.e. the trce..use..use nd the resource usge pttern introduced in Sec. 3.2. For ese of presenttion we trnslte the trce using for, for use nd c for. This gives us the trce used s n exmple in Sec. 5.1 ove. We therefore lredy hve our weighted trnsducer T. We then compute the weighted trnsducer P for the resource usge pttern s follows. ɛ/ɛ : 0 ɛ/ɛ : 0 / : 0 c/c : 0 / : 0 We now compute T X P, using the edit trnsducer X presented in Sec. 5.1 ove. This gives us the weighted trnsducer in Figure 1. We then use Djkistr s shortest pth lgorithm to find shortest pth etween the initil stte nd n ccepting trce. We indicte one such shortest pth with dshed line, this corresponds to the string /././.c/c./ with weight of 2. This gives two edits to our string - replcing the second event with use event nd the lst use event with n event. Note tht there re multiple pths with weight of 2 here, nd therefore multiple wys we cn rewrite our trce. A shortest pth through the composition will lwys e t lest s long s the trce nd will give rewrite y relting the projected trce ck to the originl trce. If pttern checker is used then, insted of computing the shortest distnce to n ccepting stte, for ech pttern we compute the shortest distnce to n ccepting stte leled with tht pttern. /ɛ : 1 /ɛ : 1 /ɛ : 1 c/ɛ : 1 /ɛ : 1 ɛ/ : 1 1 4 3 6 8 10 12 / : 0 / : 0 / : 1 c/ : 1 / : 1 ɛ/ : 1 ɛ/ : 1 ɛ/ : 1 ɛ/ : 1 ɛ/ : 1 /c : 1 /c : 1 /c : 1 c/c : 0 /c : 1 2 5 7 9 11 ɛ/ : 1 / : 1, /ɛ : 1 ɛ/ : 1 / : 1, /ɛ : 1 ɛ/ : 1 / : 0, /ɛ : 1 ɛ/ : 1 c/ɛ : 1, c/ : 1 ɛ/ : 1 / : 0, /ɛ : 1 ɛ/ : 1 Figure 1: An exmple of the composition T X P 6. COMBINING IMPERFECT PATTERNS The previous two sections presented two different techniques for extrcting imperfect ptterns from imperfect trces. Ech pttern is given set of rewrites tht tell us how to edit the input trce to mke it mtch the pttern. When comining ptterns we now need to consider these rewrites. In this section we present n pproch for comining set of imperfect ptterns tht re comptile i.e. hve set of rewrites tht do not clsh. We then discuss sturtion pproch to producing set of pttern comintions. 6.1 The pproch We first define wht we men y imperfect pttern. If we took n imperfect pttern s pir of pttern nd its shortest rewrite then when comining two ptterns we might find tht these shortest rewrites re incomptile, ut tht if we hd chosen, sy, the second shortest rewrite we would e le to comine the two ptterns. Therefore, we consider ll rewrites up to certin size for pttern. An imperfect pttern is pir p, R where p is pttern nd R is set of rewrites. In the cse of the filing edit-distnce pproch R is given y the reched configurtions. In the cse of true edit-distnce pproch R is given y the lnguge of the composition - therefore cn e infinite, ut in prctice we use redth-first serch to select the k-shortest pths. A set of imperfect ptterns {... p i, R i...} is comptile if there exists set of rewrites {... ρ i... ρ i R i} such tht every pir of rewrites is comptile. Two rewrites re comptile if they do not ttempt to mke different rewrites t the sme relevnt points in trce. We interpret no edit s n identity edit. A point in the trce is relevnt to rewrite if it is in the lphet of the ssocited pttern. The edit-distnce of p n, R n... p n, R n is ρ 1... ρ n i.e. the numer of edits when ll rewrites re comined. Therefore, given set of comptile ptterns we wnt to find the set of rewrites tht minimizes this distnce.

Algorithm 3 Computing the minimum comptiility etween sets of rewrites R 1 nd R 2 from imperfect ptterns extrcted from trce τ where R i is relted to pttern with lphet Σ i. G {R 1 R 2} for i from 1 to τ do G for g G do D {ρ ρ(i) is defined } M [e {ρ D ρ(i) = e}] M M [τ(i) {ρ g\d τ(i) Σ i ρ R i] if D = then G G g else G G {(g\d) d (e d) M} G G G oky {g G ρ 1 R 1, ρ 2 R 2 : ρ 1, ρ 2 g} if G oky = then return "incomptile" else return min({ g g G oky }) send ( " s e r v e r A ", new S t r i n g [ ] { " s t r t ", " 45 " } ) ; send ( " s e r v e r B ", n u l l ) ; send ( " s e r v e r C ", new S t r i n g [ ] { " end ", " 23 " } ) ; void send ( S t r i n g d d r e s s, S t r i n g [ ] l i n e s ) { C o n n e c t i o n C = c o n n e c t ( d d r e s s ) ; Strem S = C. ( ) ; t r y { f o r ( S t r i n g l i n e : l i n e s ) S. send ( l i n e ) ; } ctch ( N u l l P o i n t e r E x c e p t i o n e ) { send ( " empty " ) ; C. c l o s e ( ) ; } C. c l o s e ( ) ; } Figure 2: A hypotheticl piece of Jv code. 6.2 Computing comptiility We compute the comptiility etween two sets of rewrites R 1 nd R 2 y tking the the set R 1 R 2 nd repetedly splitting it sed on conflicts etween rewrites nd then checking tht there is set of rewrites with rewrite in R 1 nd R 2. An lgorithm for computing comptiility is given in Algorithm 3. This cn e extended to set of sets of rewrites. The lgorithm will return "incomptile" if the two sets of rewrites re incomptile nd the smllest numer of edits tht mkes them comptile otherwise. Let min e the function tht returns this minimum distnce nd is undefined otherwise. 6.3 Sturting the set of ptterns Given set of imperfect ptterns P 0 extrcted from trce we compute the ith sturtion of P 0 s follows, reclling tht min(r 1, R 2) is only defined if R 1 nd R 2 re comptile. P i+1 = { p 1 p 2, min(r 1, R 2) p 1, R 1 p 2, R 2 P i} In generl, P i = 1 Pi 1 ( Pi 1 1). However, mny comintions in 2 P i1 will e trivil nd cn e removed. However, the sturtion cn grow exponentilly. Let P e the fixed-point of P i i.e. the set P i such tht P i+1 = P i. To mke sturtion prcticl we tke the following steps: Limit - We plce n upper limit on the sturtion set i.e. P 3 Prune - We filter out ptterns if: Susumption - they re susumed y nother pttern. Mximl lphet - they do not use ll symols. Minimum distnce - they re not within smll ound of the minimum edit distnce used. Rnk - We rnk ptterns y edit-distnce nd size. 7. EXPERIMENTS In this section we explore our new technique y first pplying it to hypotheticl code snippet nd then crrying out n experiment to evlute ccurcy where we ttempt to recrete known specifiction from imperfect trces. 7.1 Appliction to exmple code Consider the Jv code in Figure 2. This gives hypotheticl method for sending n rry of lines to n ddress y first connecting to tht ddress, ing strem, sending the lines nd then closing the strem. This exmple contins ug - in the cse where null rry of lines is given the connection is d twice. Let us ssume we execute the ove code, which clls the method three times with different inputs, recording the occurrences of the connect,, send nd events. The resulting trce would e s follows. connect..send.send..connect..send... connect..send.send.. We now consider mining this trce with two ptterns - the lternting pttern given in the introduction nd the resource usge pttern given in Section 3.2. We tke the lternting pttern first. The following tle gives the filing nd true edit-distnces (filing/true) for the ove trce nd the different instntitions of the lternting pttern - - represents tht no distnce should e given (we do not consider the cse where = ) nd n x represents tht no distnce is returned. The instntition [, connect] does not hve filing edit-distnce s it finishes in non-finl stte tht cn e extended to finl stte - this is one drwck of the filing edit-distnce pproch. For [, connect] nd [, ] there is shorter true edit-distnce s this pproch is llowed to mke edits without filure - here removing the lst event to ring the pttern into n ccepting stte. Note tht ll other distnces re the sme, this shows tht in filing edit-distnce cn e good pproximtion of true edit-distnce. connect send connect - x/2 3/3 4/3 0/0-3 4/3 send 2/2 2/2-4/3 1/1 1/1 3/3 - For one cse, [ connect, ] there is distnce of 0 - this is ecuse this instntited pttern mtches the trce exctly. If we consider the two cses where there is n edit-distnce of 1 nd look t the rewrite generted we see tht ll of these produce the sme rewrite - the removl of the ninth event (the second ). Comining the three instntited ptterns with n edit distnce of 0 or 1 we get the following pttern. connect 3

Now let us consider the resource usge pttern. The following tle gives the filing nd true edit distnces s efore - with ech entry in the tle representing the c dimension using 4-tuple. Here, gin, computed distnces re the sme ut the true edit-distnce pproch genertes some distnces where the filing edit-distnce pproch does not. connect send connect (-,-,-,-) (-,-,5/5,4/4) (-,3/3,-,5/5) (-,2/2,4/4,-) (-,-,2/2,1/1) (-,-,-,-) (5,-,-,5) (5/5,-,4/4,-) send (-,4/4,-,1/1) (1/1,-,-,1/1) (-,-,-,-) (x/6,x/6,-,-) (-,4/4,5/5,-) (1/1,-,5/5,-) (3/3,3/3,-,-) (-,-,-,-) There re five instntitions with n edit-distnce of 1, ut they represent different rewritings of the trce. One set removes the ninth event (s efore) nd one set removes the first connect event, therefore they re incomptile. When comined they give the following respectively: send connect 3 send, connect The rewrite for first pttern is comptile with the rewrite for pttern extrcted using the lterntion pttern nd we cn comine these ptterns to form finl specifiction, which is the sme s the one on the left ove, ut with only the initil stte ccepting. 7.2 An ccurcy experiment Before we egin we should note tht this experiment is not fully mesuring the expected usge of this technique s there is no mnul inspection of the produced specifictions. We evlute the ccurcy of our pproch y generting trces from the following specifiction for the Lucene tool descried in [5]. document.document. < init > document.field < init > (String, String, Store, Index) 4 3 document.field < int > (String, Reder) index.indexwriter.dddocument(document) We generte imperfect trces y first generting perfect trces nd then rndomly editing events ccording to some noise level (proility). We then pss these trces to our techniques nd test the resulting ptterns for ccurcy using set of perfect trces generted from the specifiction. Tle 1 gives the verge results over three runs. For ech pproch it reports the verge ccurcy, the minimum edit required to produce pttern with mximum ccurcy, the time tken for checking nd then sturtion nd the size of the pruned 3-sturted set. Experiments were crried out with rnge of trce lengths nd noise levels nd different γ for filing nd k-shortest pths for perfect. Every experiment with non-empty P 3 produced t lest one pttern with perfect ccurcy. As expected, with zero noise we chieve perfect ccurcy. The reson we sw empty P 3 sets for Filing ws tht the γ ws not high enough nd for Perfect it ws due to the relevnt rewrites not ppering in the top k-shortest sets. These prmeters cn e incresed, ut they currently hve high impct on running times. As expected, s noise increses ccurcy generlly decreses nd in generl the lrger γ or k the etter ccurcy. Generlly checking times re very fst, with sturtion dominting the process. The min cost in sturtion is the computtion of comprility etween rewrites - which is why we sw the highest sturtion times with Filing-5. Further work should consider methods for optimising this process nd trimming the set of rewrites considered. It is cler tht sturtion of this kind is too costly. If we comine this pproch with the utomt of [17] we would comine together comptile ptterns directly, without sturtion. Alterntively, the introduction of specific comintion rules would reduce the cost of sturtion. 8. RELATED WORK We consider lterntive techniques tht mine specifictions from runtime trces. A recent survey pper [18] gives good overview of the field. Here we focus on how techniques del with imperfect trces, in prticulr we re interested in utomt-sed pttern mining pproches. Ammons et l. [3] developed n erly pproch tht used proilistic finite utomt lerner from the field of grmmr inference nd requires the lphet of the inferred specifiction to e known eforehnd. Imperfect trces require humn experts to check violtions of the inferred specifiction in coring phse. Lo et l. [14] extend this pproch - one extension tht is relevnt here is the introduction of stge tht ttempts to filter out erroneous trces efore lerning. In contrst we ttempt to use this informtion to extrct specifiction nd identify the error. Techniques tht use frequent-itemset mining (i.e. [12]) nd d frequent sequentil pttern mining (i.e. [15]) rely on computing support nd confidence vlues where support reflects the level of imperfection, nd therefore cn hndle imperfect trces. However, the properties extrcted re not s strict s utomt-sed specifictions s in the first cse symols re only relted y frequent ssocition, not order, nd in the second the ordering reltion is simple. These techniques scle very well nd require miniml informtion to e provided. The utomt-sed pttern-mining technique ws first used y Engler et l. [4]. They focus on the lternting pttern () nd del with imperfect trces y counting the numer of times tht nd occur together in order, nd occurs without nd compute the likelihood tht they form specifiction. Goues nd Weimer [8] extend this pproch with techniques for pruning flse positives y exmining the source code. Yng et l. [19] introduced templte-sed technique focusing on extrcting specifictions from imperfect trces. They use the lternting pttern nd del with imperfect trces y prtitioning trce into sequences of one event followed y nother, i.e. + +, performing mining on ech sutrce nd then counting the numer of sutrces the pttern holds for. This is similr to restrting the pttern on filure ut llows for lrger rnge of filures. They lso introduce chining heuristic for comining their lternting ptterns. Gel nd Su. [5, 6] extend this pproch y introducing symolic method for specifiction mining using inry decision digrms nd the Jvert tool tht uses two ptterns () nd ( ) nd composition rules sed on utomt comintion to extrct lrge ptterns. They del with imperfect trces y restrting pttern t the initil stte on filure. In [7] they extend this pproch to infer nd enforce temporl properties t runtime over finite window, thus detecting potentil ugs t runtime. Li et l. [11] extend this pproch to mine specifictions with timing ounds nd more complex pttern composition rules, ut cnnot hndle imperfect trces. Insted their focus is on mining specifictions from

Trce Noise Filing-2 Filing-5 Perfect-2 Perfect-20 length level A E Time P 3 A E Time P 3 A E Time P 3 A E Time P 3 10 0.0 1.0 0 0.06, 0.84 36 1.0 0 0.06, 34.7 34 1.0 0 0.11, 0.13 36 1.0 0 0.05, 0.11 34 10 0.05 0.48 1 0.01, 0.96 11 0.83 3 0.01, 195 1 0.86 1 0.04, 0.05 4 0.78 1 0.03, 0.03 4 10 0.1 0.58 2 0.01, 2.22 2 0.80 3 0.01, 172 14 0.68 2 0.03, 0.06 13 0.87 2 0.03, 0.04 4 100 0.0 1.0 0 0.09, 1.64 35 1.0 0 0.09, 1.47 36 1.0 0 0.46, 0.27 33 1.0 0 0.26, 0.16 34 100 0.05 0.33 1 0.01, 4.96 1 0.53 4 0.01, 476 2 0.66 1 0.27, 0.17 1 0.0-0.29, 0.12 0 1000 0.0 1.0 0 0.17, 11.5 35 1.0 0 0.17, 12.4 32 1.0 0 3.29, 5.13 33 1.0 0 3.09, 0.70 35 1000 0.01 0.0-0.16, 2.48 0 0.33 2 0.16, 1382 1 0.0-3.48, 3.67 0 0.16 1 3.11, 3.5 2 Tle 1: Results from ccurcy experiment. A=ccurcy. E = edits. Time gives checking nd sturtion time seprtely in seconds. perfect trces nd using these to detect ugs in imperfect ones. Finlly, recent techniques [9, 16, 17] consider the prmetric cse where trces contin dt i.e. (12).(45).(12). (45). Whilst some pproches use d-hoc methods to del with context, these focus on slicing the trce sed on this dt nd extrcting trces from the resulting dt-free trces. The work in [9] extends the pproch tken y [3] nd therefore use the sme coring technique to del with imperfect trces nd [16] uses the notions of support nd confidence from dt mining. 9. CONCLUSION This pper hs introduced new pproch for mining specifictions from imperfect trces. Two techniques re introduced tht use the notion of edit-distnce to compute the numer of chnges tht would hve to e mde to trce for pttern to hold nd notion of when it sfe to comine two imperfect ptterns is given. We lso demonstrte the process y pplying it to smll code snippet nd then mesure the ccurcy of the pproch using trces generted from known specifiction. This technique not only produces specifictions, ut lso description of how progrm should e updted to mke the specifiction hold. This would e useful in ug detection nd loction ut cse study is required to estlish pplicility. Further work is required to improve the efficiency nd pplicility of the pproch. This should involve incorporting existing techniques, for exmple the symolic mining technique of [6], nd the composition rules of [5, 11]. We lso pln on comining this pproch with the uthor s pttern-mining pproch tken in [17], which trgets specific lphet of events to extrct prmetric specifiction. This pproch uses soclled utomt tht mens tht ll extrcted ptterns cn e sound comined to form specifiction. Therefore, we would e le to use pttern comintion directly, rther thn introducing pttern comintion rules. A further re of interest is the use of edit-distnce s fitness function in evolutionry techniques used to evolve specifiction. 10. REFERENCES [1] C. Alluzen nd M. Mohri. 3-wy composition of weighted finite-stte trnsducers. In Proceedings of the 13th interntionl conference on Implementtion nd Applictions of Automt, CIAA 08, pges 262 273, Berlin, Heidelerg, 2008. Springer-Verlg. [2] C. Alluzen nd M. Mohri. Liner-spce computtion of the edit-distnce etween string nd finite utomton. CoRR, s/0904.4686, 2009. [3] G. Ammons, R. Bodík, nd J. R. Lrus. Mining specifictions. SIGPLAN Not., 37(1):4 16, Jn. 2002. [4] D. Engler, D. Y. Chen, S. Hllem, A. Chou, nd B. Chelf. Bugs s devint ehvior: generl pproch to inferring errors in systems code. SIGOPS Oper. Syst. Rev., 35(5):57 72, Oct. 2001. [5] M. Gel nd Z. Su. Jvert: fully utomtic mining of generl temporl properties from dynmic trces. In Proceedings of the 16th ACM SIGSOFT Interntionl Symposium on Foundtions of softwre engineering, SIGSOFT 08/FSE-16, pges 339 349, New York, NY, USA, 2008. ACM. [6] M. Gel nd Z. Su. Symolic mining of temporl specifictions. In ICSE 08: Proceedings of the 30th interntionl conference on Softwre engineering, pges 51 60, New York, USA, 2008. ACM. [7] M. Gel nd Z. Su. Online inference nd enforcement of temporl properties. In Proceedings of the 32nd ACM/IEEE Interntionl Conference on Softwre Engineering - Volume 1, ICSE 10, pges 15 24, New York, NY, USA, 2010. ACM. [8] C. Goues nd W. Weimer. Specifiction mining with few flse positives. In Proceedings of the 15th Interntionl Conference on Tools nd Algorithms for the Construction nd Anlysis of Systems., TACAS 09, pges 292 306, Berlin, Heidelerg, 2009. Springer-Verlg. [9] C. Lee, F. Chen, nd G. Roşu. Mining prmetric specifictions. In Proceeding of the 33rd Interntionl Conference on Softwre Engineering (ICSE 11), pges 591 600. ACM, 2011. [10] V. Levenshtein. Binry Codes Cple of Correcting Deletions, Insertions nd Reversls. Soviet Physics Dokldy, 10:707, 1966. [11] W. Li, A. Forin, nd S. A. Seshi. Sclle specifiction mining for verifiction nd dignosis. In DAC 10: Proceedings of the 47th Design Automtion Conference, pges 755 760, New York, NY, USA, 2010. ACM. [12] Z. Li nd Y. Zhou. Pr-miner: utomticlly extrcting implicit progrmming rules nd detecting violtions in lrge softwre code. SIGSOFT Softw. Eng. Notes, 30(5):306 315, Sept. 2005. [13] D. Lo, K. Cheng, nd J. Hn. Mining Softwre Specifictions: Methodologies nd Applictions. Chpmn nd Hll/CRC Dt Mining nd Knowledge Discovery. Tylor & Frncis Group, 2011. [14] D. Lo nd S.-C. Khoo. Smrtic: towrds uilding n ccurte, roust nd sclle specifiction miner. In Proceedings of the 14th ACM SIGSOFT interntionl symposium on Foundtions of softwre engineering, SIGSOFT 06/FSE-14, pges 265 275, New York, USA, 2006. ACM. [15] D. Lo, S.-C. Khoo, nd C. Liu. Mining temporl rules for softwre mintennce. J. Softw. Mint. Evol., 20(4):227 247, July 2008. [16] D. Lo, G. Rmlingm, V. P. Rngnth, nd K. Vswni. Mining quntified temporl rules: Formlism, lgorithms, nd evlution. Sci. Comput. Progrm., 77(6):743 759, 2012. [17] G. Reger, H. Brringer, nd D. Rydeherd. A pttern-sed pproch to prmetric specifiction mining. In Proceedings of the 28th IEEE/ACM Interntionl Conference on Automted Softwre Engineering, Novemer 2013. To pper. [18] M. P. Roillrd, E. Bodden, D. Kwrykow, M. Mezini, nd T. Rtchford. Automted pi property inference techniques. IEEE Trnsctions on Softwre Engineering, 39(5):613 637, 2013. [19] J. Yng, D. Evns, D. Bhrdwj, T. Bht, nd M. Ds. Perrcott: mining temporl pi rules from imperfect trces. In ICSE 06: Proceedings of the 28th interntionl conference on Softwre engineering, pges 282 291, New York, NY, USA, 2006. ACM.