Mining Frequent Web Access Patterns with Partial Enumeration

Similar documents
CS 491G Combinatorial Optimization Lecture Notes

Lecture 6: Coding theory

22: Union Find. CS 473u - Algorithms - Spring April 14, We want to maintain a collection of sets, under the operations of:

Counting Paths Between Vertices. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs

Surds and Indices. Surds and Indices. Curriculum Ready ACMNA: 233,

1 PYTHAGORAS THEOREM 1. Given a right angled triangle, the square of the hypotenuse is equal to the sum of the squares of the other two sides.

Necessary and sucient conditions for some two. Abstract. Further we show that the necessary conditions for the existence of an OD(44 s 1 s 2 )

CS 2204 DIGITAL LOGIC & STATE MACHINE DESIGN SPRING 2014

Logic, Set Theory and Computability [M. Coppenbarger]

CS 360 Exam 2 Fall 2014 Name

The DOACROSS statement

Mid-Term Examination - Spring 2014 Mathematical Programming with Applications to Economics Total Score: 45; Time: 3 hours

Solutions for HW9. Bipartite: put the red vertices in V 1 and the black in V 2. Not bipartite!

2.4 Theoretical Foundations

NON-DETERMINISTIC FSA

18.06 Problem Set 4 Due Wednesday, Oct. 11, 2006 at 4:00 p.m. in 2-106

Data Structures LECTURE 10. Huffman coding. Example. Coding: problem definition

for all x in [a,b], then the area of the region bounded by the graphs of f and g and the vertical lines x = a and x = b is b [ ( ) ( )] A= f x g x dx

Factorising FACTORISING.

CSE 332. Sorting. Data Abstractions. CSE 332: Data Abstractions. QuickSort Cutoff 1. Where We Are 2. Bounding The MAXIMUM Problem 4

Section 2.1 Special Right Triangles

I 3 2 = I I 4 = 2A

CS311 Computational Structures Regular Languages and Regular Grammars. Lecture 6

Algorithm Design and Analysis

Project 6: Minigoals Towards Simplifying and Rewriting Expressions

CIT 596 Theory of Computation 1. Graphs and Digraphs

The University of Nottingham SCHOOL OF COMPUTER SCIENCE A LEVEL 2 MODULE, SPRING SEMESTER MACHINES AND THEIR LANGUAGES ANSWERS

Total score: /100 points

Now we must transform the original model so we can use the new parameters. = S max. Recruits

Section 6: Area, Volume, and Average Value

Chapter 4 State-Space Planning

Lesson 2: The Pythagorean Theorem and Similar Triangles. A Brief Review of the Pythagorean Theorem.

Algorithms & Data Structures Homework 8 HS 18 Exercise Class (Room & TA): Submitted by: Peer Feedback by: Points:

Lecture 11 Binary Decision Diagrams (BDDs)

Algorithm Design and Analysis

Compression of Palindromes and Regularity.

Numbers and indices. 1.1 Fractions. GCSE C Example 1. Handy hint. Key point

XML and Databases. Exam Preperation Discuss Answers to last year s exam. Sebastian Maneth NICTA and UNSW

Lecture 2: Cayley Graphs

Fast Frequent Free Tree Mining in Graph Databases

Lecture 3. XML Into RDBMS. XML and Databases. Memory Representations. Memory Representations. Traversals and Pre/Post-Encoding. Memory Representations

Nondeterministic Finite Automata

Welcome. Balanced search trees. Balanced Search Trees. Inge Li Gørtz

Minimal DFA. minimal DFA for L starting from any other

p-adic Egyptian Fractions

CARLETON UNIVERSITY. 1.0 Problems and Most Solutions, Sect B, 2005

Intermediate Math Circles Wednesday 17 October 2012 Geometry II: Side Lengths

APPENDIX. Precalculus Review D.1. Real Numbers and the Real Number Line

Linear Inequalities. Work Sheet 1

Review of Gaussian Quadrature method

Discrete Structures, Test 2 Monday, March 28, 2016 SOLUTIONS, VERSION α

Instructions. An 8.5 x 11 Cheat Sheet may also be used as an aid for this test. MUST be original handwriting.

Outline Data Structures and Algorithms. Data compression. Data compression. Lossy vs. Lossless. Data Compression

6.5 Improper integrals

Technische Universität München Winter term 2009/10 I7 Prof. J. Esparza / J. Křetínský / M. Luttenberger 11. Februar Solution

Equivalent fractions have the same value but they have different denominators. This means they have been divided into a different number of parts.

Generalization of 2-Corner Frequency Source Models Used in SMSIM

Implication Graphs and Logic Testing

Identifying and Classifying 2-D Shapes

Intermediate Math Circles Wednesday, November 14, 2018 Finite Automata II. Nickolas Rollick a b b. a b 4

CS103B Handout 18 Winter 2007 February 28, 2007 Finite Automata

Momentum and Energy Review

Section 1.3 Triangles

10. AREAS BETWEEN CURVES

Convert the NFA into DFA

Probability. b a b. a b 32.

If the numbering is a,b,c,d 1,2,3,4, then the matrix representation is as follows:

6. Suppose lim = constant> 0. Which of the following does not hold?

Prefix-Free Regular-Expression Matching

CS 310 (sec 20) - Winter Final Exam (solutions) SOLUTIONS

50 AMC Lectures Problem Book 2 (36) Substitution Method

Finite State Automata and Determinisation

CS 573 Automata Theory and Formal Languages

2.4 Linear Inequalities and Interval Notation

System Validation (IN4387) November 2, 2012, 14:00-17:00

Nondeterministic Automata vs Deterministic Automata

GNFA GNFA GNFA GNFA GNFA

QUADRATIC EQUATION. Contents

Common intervals of genomes. Mathieu Raffinot CNRS LIAFA

CS241 Week 6 Tutorial Solutions

A Lower Bound for the Length of a Partial Transversal in a Latin Square, Revised Version

I1 = I2 I1 = I2 + I3 I1 + I2 = I3 + I4 I 3

CS415 Compilers. Lexical Analysis and. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

, g. Exercise 1. Generator polynomials of a convolutional code, given in binary form, are g. Solution 1.

Eigenvectors and Eigenvalues

Trigonometry Revision Sheet Q5 of Paper 2

A Study on the Properties of Rational Triangles

CSC2542 State-Space Planning

On a Class of Planar Graphs with Straight-Line Grid Drawings on Linear Area

Logarithms LOGARITHMS.

Farey Fractions. Rickard Fernström. U.U.D.M. Project Report 2017:24. Department of Mathematics Uppsala University

Homework 3 Solutions

The area under the graph of f and above the x-axis between a and b is denoted by. f(x) dx. π O

Automata and Regular Languages

Bases for Vector Spaces

Individual Contest. English Version. Time limit: 90 minutes. Instructions:

= state, a = reading and q j

Solutions to Problem Set #1

Before we can begin Ch. 3 on Radicals, we need to be familiar with perfect squares, cubes, etc. Try and do as many as you can without a calculator!!!

On-Line Construction of Compact Directed Acyclic Word Graphs

Transcription:

Mining Frequent We Aess Ptterns with Prtil Enumertion Peiyi Tng Deprtment of Computer Siene University of Arknss t Little Rok 2801 S. University Ave. Little Rok, AR 72204 Mrkus P. Turki Deprtment of Computer Siene University of Arknss t Little Rok 2801 S. University Ave. Little Rok, AR 72204 ABSTRACT In this pper, we exten the pttern-growth we ess pttern mining lgorithms [1, 2, 3] with prtil enumertion. The extene lgorithm n grow the frequent ptterns with more thn one symol t time n unifies the ptterngrowth n priori lgorithms [4]. The experimentl results show tht for the tses of long sequenes, the est performne is neither given y the pttern-growth lgorithms nor y the full priori enumertion lgorithms, ut rther y the mining with prtil enumertion in the mile. 1. INTRODUCTION With the explosive growth of Internet use, mining frequent we ess ptterns from huge tsets of we log files eomes possile. The frequent we ess ptterns mine from the we log files re essentil for the we msters n esigners to further improve the esign of the we sites. We ess pttern mining is one exmple of generl frequent pttern mining. Another exmple of frequent pttern mining is the mining of iologil sequenes where eh sequene is sequene of mino is or nuleoties. Due to the importne of the prolem n its lrge pplitions, we ess pttern mining hs ttrte signifint ttention in the reent yers [1, 2, 3, 4, 5, 6]. The frequent we ess pttern mining lgorithms propose re either pttern-growth lgorithms [1, 2, 3] or priori enumertion lgorithms [4]. The priori enumertion lgorithms enumerte n test the nite ptterns n prune infrequent ptterns wy se on the known infrequent ptterns. The prolem of the full priori enumertion is the lrge numer of nites euse the nite pruning is lwys pproximte. The pttern-growth lgorithms grow the frequent ptterns only one symol t reursive ll. If the verge length of frequent ptterns is lrge, it tkes eep hin of reursive lls to fin these ptterns. Numerous reursive lls ring out overhe n slow own the mining. Permission to mke igitl or hr opies of ll or prt of this work for personl or lssroom use is grnte without fee provie tht opies re not me or istriute for profit or ommeril vntge n tht opies er this notie n the full ittion on the first pge. To opy otherwise, to repulish, to post on servers or to reistriute to lists, requires prior speifi permission n/or fee. ACMSE 2007, Mrh 23-24, 2007, Winston-Slem, N. Crolin, USA Copyright 2007 ACM 978-1-59593-629-5/07/0003...$5.00. In this pper, we exten our pttern-growth we ess mining lgorithm [3] 1 using prtil enumertion. The extene lgorithm n grow the frequent ptterns more thn one symol t eh reursive ll. Our experimentl results show tht the extene lgorithm n outperform the originl pttern-growth lgorithm y the ftor s lrge s 3.48 for the tses of long sequenes. The extene lgorithm n lso e regre s the one tht unifies the pttern-growth lgorithms [1, 2, 3] n the priori enumertion lgorithms[4] with these two t the extremes of the we ess pttern mining spetrum. Our experimentl results revel tht for the tses of long sequenes, the est performne is neither given y the ptterngrowth lgorithms nor y the full priori enumertion lgorithms, ut rther y the mining with prtil enumertion in the mile. The rest of the pper is orgnize s follows. In Setion 2, we present the theory n the strt lgorithm of we ess pttern mining with prtil enumertion. In Setion 3, we present n implementtion of the prtil enumertion lgorithm using the FLWAP-tree [3] n simple priori enumertor. We present the results of the experimentl evlution in Setion 4 n onlue the pper with esription of relte work in Setion 5. 2. MINING FREQUENT PATTERNS WITH PARTIAL ENUMERATION Let Σ e the set of symols n eh symol represents wesite. Awe ess sequene s is sequene of finite numer of symols from Σ, s = s 1 s m (s i Σ for ll 1 i m<, ns i n s j re not neessrily ifferent for i j). A we ess tse D is multi-set of we ess sequenes. A we ess pttern is lso sequene of finite numer of symols from Σ. A sequene s = s 1 s l is susequene of sequene s = s 1 s m, enote s s s, if n only if l m n there exist i 1,,i l suh tht 1 i 1 < <i l m n s j = s ij for ll 1 j l. A sequene s is si to support pttern p if p is susequene of s. The support of pttern p in D, enote s Sup D(p), is the numer of we ess sequenes in D tht support p. Given threshol ξ in intervl (0, 1], pttern p is frequent with respet to ξ n D if Sup D(p) ξ D, where D is the numer of we ess sequenes in D. ξ D is lle the solute threshol n enote s η. The we ess 1 Our pttern growth we ess mining lgorithm [3] uses First-Ourrene Linke WAP-tree n outperforms the PLWAP mining lgorithm [2] signifintly n onsistently. 226

......... () Originl Serh Spe Tree.................................... () Re-rrnge y Prtil Enumertion Figure 1: Serh Spe Tree of Frequent Ptterns pttern mining prolem is to fin ll the frequent ptterns with respet to ξ n D. Figure 1() illustrtes the serh spe tree of frequent ptterns of symol set Σ = {,,,. Eh noe is possile frequent pttern. Eh noe hs four hilren representing the ptterns extene with symols,, n, respetively. The pttern-growth lgorithms [1, 2, 3] grow the frequent ptterns one symol t time in epth-first serh of the tree. Figure 2 shows the strt lgorithm funtion Pttern-Grow(pttern q, tse D) { F ; for eh s in Σ o if (Sup D (s) η) then F F {q s; Construt the s-projetion tse D s; F Pttern-Grow(q s, D s); F F F ; enfor return F ; Figure 2: Pttern-Growth without Enumertion of pttern-growth from [3]. It grows frequent ptterns y mining projetion tses reursively. Given symol s in Σ n tse D, thes-projetion tse of D, enote s D s, onsists of the s-projetions of the sequenes in D in whih s ours t lest one. The prefix from the first symol up to n inluing the first ourrene of s is lle the s-prefix of the sequene. The s-projetion of the sequene is wht is left fter the s-prefix is tken wy. For exmple, the first ourrene of in sequene is where is unerline. Thus, the -prefix of is n its -projetion is. The is the ontention etween sequenes. The frequent ptterns in D re mine y invoking Pttern-Grow(, D), where is the empty pttern. In orer to grow the frequent ptterns fster with more thn one symol t reursive ll, we prtition the symol set Σ to isjoint susets first: Σ = Σ 1 Σ l n Σ i Σ j = for i j. For exmple, we n prtition Σ = {,,, to Σ 1 = {, n Σ 2 = {,. Given symol suset Σ j, prtil enumertions from Σ j re the non-empty ptterns me of the symols in Σ j of the length less or equl to the size of Σ j. For exmple, the prtil enumertions from Σ 1 = {, re,,,, n. The set of prtil enumertions from Σ j is enote s E(Σ j). For exmple, E({, ) is{,,,,,. Figure 1() illustrtes the si ie of the frequent we ess pttern mining with prtil enumertion. The serh spe is re-rrnge so tht the ptterns n grow more thn one symol t time. The root noe grows with the prtil enumertions from oth {, n {,. For eh other noe, if it is grown from its prent with prtil enumertion shorter thn the size of the orresponing symol suset, it only grows with the prtil enumertions from other symol susets. For exmple, noe in Figure 1() is grown from the root noe with prtil enumertion from {,. The length of pttern is 1 n less thn the size of {,, 2. Noe then grows only with the prtil enumertions from {,. If it were llowe to grow with the prtil enumertions from {,, it woul hve hilren n whih uplite the noes n grown from the root noe. If noe is grown from its prent with prtil enumertion of the length equl to the size of the orresponing symol suset, it shoul grow with the prtil enumertions from ll the symol susets. For instne, noe in Figure 1() is grown from the root noe with prtil enumertion of length 2. It shoul grow with the prtil enumertions from oth {, n {,. Bse on this ie, the set of frequent ptterns to e mine n e prtitione to the frequent ptterns whose first symol is from ifferent symol susets Σ j (j =1,,l). Let G(j, D) e the set of non-empty frequent ptterns in tse D whose first symol is from Σ j. Oviously, the set of ll non-empty frequent ptterns in D is G(1,D) G(l, D) (1) n G(i, D) G(j, D) = (2) for ny i j. Mining the frequent ptterns from D is, thus, prtitione to fining G(j, D) for ll j =1,,l. There re no uplites of frequent ptterns mine euse of (2). This is importnt euse no frequent pttern shoul e mine more thn one. Otherwise, the mining time oul inrese rmtilly. Let us use F (i, D) to enote the set of non-empty frequent ptterns in D exluing those strting with symol in Σ i: F (i, D) = i 1 [ j=1 G(j, D) l[ j=i+1 G(j, D) (3) Therefore, the set of ll non-empty frequent ptterns in D n e enote s F (0,D), euse F (0,D)= P l j=1 G(j, D) oring to (3). We now esrie how to fin G(j, D). Given we ess sequene s n pttern p suh tht p s, thep-prefix of s is the miniml prefix of s tht supports p. For exmple, the -prefix of sequene is, rther thn. The p-projetion of s is wht is left fter the p-prefix is tken wy. For exmple, the -projetion of sequene is. Note tht the p-projetion n e the empty sequene if the p-prefix of s is s itself. Given prtil enumertion p from Σ j,thep-projetion tse of D, enote s D p,σj 227

funtion Pttern-Grow(pttern q, int i, tse D) { F ; for j =1,l o if (j i) then 1: for eh p in E(Σ j ) o if (Sup D (p) η) then F F {q p; Construt the p-projetion tse D p,σj ; if ( p = Σ j ) then F Pttern-Grow(q p, 0,D p,σj ); F Pttern-Grow(q p, j, D p,σj ); F F F ; enfor enfor return F ; Figure 3: Pttern-Growth with Prtil Enumertion is the multi-set of the p-projetions of the sequenes in D tht support p: D p,σj = {p-projetion of s p s s D (4) It n e prove tht the support of pttern p in the p-projetion tse D p,σj is equl to the support of the ontente pttern p p in the originl tse D: Sup Dp,Σj (p )=Sup D(p p ) (5) The proof of (5) n e foun in [7] If the sme solute threshol η = ξ D is use in mining frequent ptterns in oth D n D p,σj, we n sy se on (5) tht, pttern p p is frequent in D if n only if p is frequent in D p,σj. To mine G(j, D), we further look t [ F (p, j, D) p E(Σ j ) where F (p, j, D) is efine y j F pf (0,Dp,Σj ) {p if p = Σ j (p, j, D) = (6) pf (j, D p,σj ) {p if p < Σ j if p is frequent in D. If p isnotfrequentind, F (p, j, D) is empty. Here, pf (0,D p,σj )is{p p p F (0,D p,σj ) (similrly for pf (j, D p,σj )), the set of ptterns tht re the ontentions of p with the ptterns from F (0,D p,σj ). Rell tht F (j, D p,σj ) is the set of non-empty frequent ptterns in D p,σj tht o not strt with the symols in Σ j oring to the efinition in (3). F (0,D p,σj )isthesetof non-empty frequent ptterns in D p,σj tht strt with ny symol (see (3) too). By using these fts, we n prove tht G(j, D) = [ F (p, j, D) (7) p E(Σ j ) n F (p, j, D) F (p,j,d)= (8) for ny p, p E(Σ j) suh tht p p. D D D D D D D D (D) (D) (D) (D) (D) (D) (D) (D) (D) (D) (D) (D) (D) ((D)) D ((D)) ((D)) ((D)) (((D))) (((D))) (((D))) ((D)) Figure 4: Mining with Prtil Enumertion Thus, (7) provies isjoint prtition of G(j, D). We n mine G(j, D) y mining F (p, j, D) for ll prtil enumertions p from E(Σ j) n no frequent pttern is mine more thn one. The proofs of (7) n (8) n e foun in [7]. Bse on the reursive equtions (3), (7) n (6), we hve the lgorithm to fin F (i, D) shown in Figure 3. The frequent ptterns in tse D n e foun y lling Pttern-Grow(, 0,D). funtion Pttern-Grow(pttern q, int i, tse D) { F ; for j =1,l o if (j i) then Construt priori enumertor Enum(Σ j ); 1: while (p Enum.Next()) is not null) o 2: if (Sup D (p) η) then 3: Construt the p-projetion tse D p,σj ; Cll Enum.Confirm(p) toreportp is frequent; F F {q p; if ( p = Σ j ) then F Pttern-Grow(q p, 0,D p,σj ); F Pttern-Grow(q p, j, D p,σj ); F F F ; enwhile Delete priori enumertor Enum; enfor return F ; Figure 5: Pttern-Growth with Apriori Prtil Enumertors Figure 4 shows the mining of tse with one sequene, D = {, nthresholξ = 100%. The symol set Σ = {,,, is prtitione s Σ 1 Σ 2 = {, {,. Eh noe represents projetion tse with its nme n ontents in the ox. We use prentheses n susripts to nme projetion tses. For exmple, D is the - projetion tse of D (i.e. D,Σ1 to e preise). (D ) is the -projetion tse of D. The eges show prtil enumertions. In the top level, the mining is prtitione to mining G(1,D)nG(2,D)withΣ 1 = {, n Σ 2 = {,, respetively. Five out of the six prtil enumertions from Σ 1 = {, re frequent n hve non-empty projetion tses: D, D, D, D n D. Notie 228

Heer Tle :2 :2 :1 :3 :3 :1 :1 :1 () D :4 :1 :1 :1 :1 :1 Heer Tle :4 :4 () D,{, :3 :3 :1 :1 :1 :1 Originl Heer Tle () D,{, Figure 6: The FLWAP-trees :4 :4 :1 :1 :1 Smller tht the reursive mining from D exlues prtil enumertions from Σ 1, euse the length of is less thn the size of Σ 1. On the other hn, the reursive mining from D uses the prtil enumertions from oth Σ 1 = {, n Σ 2 = {,, euse the length of is two. Figure 4 shows the 29 non-empty frequent ptterns mine from the tse D with ξ = 100%. They re (in the orer they re foun):,,,,,,,,,,,,,,,,,,,,,,,,,,,,. When the sizes of ll the symol susets Σ j (j =1,,l) re one, the mining of prtil enumertion shown in Figure 3 egenertes to the pttern growth mining without enumertion in Figure 2. funtion Proj(T, p, support, Σ j ){ P ; Follow the first-ourrene links to fin first-ourrenes of the first symol of p; 1: Trverse the tree from these first ourrenes to fin ll the p-prefixes n their lst noes to P ; support sum of the ounts of ll noes in P ; if support η then Q ; if ( p < Σ j ) then 2: for eh noe in P o trverse its sutrees, pssing the noes of Σ j n the first non-σ j noes to Q; 3: enfor for eh noe in P o its hilren noes to Q; enfor Construt new FLWAP-tree T for the sequenes represente y the sutrees roote t the noes in Q n mke support s the ount of its root noe; return the projetion tse T ; return the empty projetion tse; prtil enumertions, we nee n priori enumertor to enumerte only those prtil enumertions whose known susequenes re ll frequent. Figure 5 shows our finl prtil enumertion mining lgorithm using n priori prtil enumertor. The priori prtil enumertor hs the funtions s follows: lss Enum W : working queue of p, Σ ; P : working queue of sequenes; Q: working queue of sequenes; Lmx: integer; onstrutor Enum(Σ j ) egin Crete empty W, P, Q; W.Enqueue(, Σ j ); Lmx Σ j ; en funtion Confirm(q) egin Q.Enqueue(q); en funtion Next() egin if (P is not empty) then return P.Dequeue(); if (Q is not empty) then Let Q =(p s 1,,p s m); for j =1,m o W.Enqueue( p s j, {s 1,,s m ); enfor Empty Q; if (W is empty) then return null; while ((P is empty) n (W is not empty)) o p, Σ W.Dequeue(); if ( p <Lmx) then for eh s in Σ o P.Enqueue(p s); enfor enwhile if (P is not empty) then return P.Dequeue(); return null; en enlss Figure 8: Apriori Prtil Enumertor Figure 7: Mking Projetion Dtse D p,σj The for loop t line 1 of the lgorithm in Figure 3 enumertes ll prtil enumertions p from Σ j. If prtil enumertion p is not frequent, ny prtil enumertion tht hs p s its susequene nnot e frequent either n, therefore, shoul not e onsiere. To prune wy these infrequent Next() returns the next prtil enumertion in E(Σ j) fter pruning wy the infrequent prtil enumertions in etween. This metho returns null if ll the nites re exhuste. Confirm(p) onfirms tht p is frequent. It is invoke only fter the Next() whih returns p n efore the next ll of Next(). A prtil enumertion p is on- 229

siere infrequent if the enumertor oes not reeive Confirm(p) efore the next ll of Next(). 3. IMPLEMENTATION 3.1 Dtse Representtion n Projetion We use the First-Ourrene Linke WAP tree (FLWAPtree) [3] s the tse representtion in our implementtion. The we ess mining using the FLWAP-tree [3] is pttern-growth lgorithm without enumertion s shown in Figure 2. The FLWAP-tree is the si WAP-tree of the tse [1, 5] with the first-ourrenes of eh symol linke together. For simple tse D = {,,, (with Σ = {,, ), its FLWAP-tree is shown in Figure 6(). Eh we sequene in the tse is represente y pth from the root noe to the noe of the lst symol of the sequene. Eh noe hs lel for the nme of the symol n ount for the numer of sequenes tht shre the ommon prefix from the root noe up to this noe inlusive. The ount in the root noe is the totl numer of sequenes inluing empty sequenes in the tse. Given tse D represente y FLWAP-tree T n pttern p from Σ j, the funtion to fin the support Sup p(d) n mke the p-projetion tse D p,σj (if Sup p(d) η ) is shown in Figure 7. After the lst noes of the p-prefixes re foun (store in set P ) in line 1 in Figure 7, the support, Sup D(p), is known s the sum of the ounts of these noes. If the support is elow η, we o not nee to onstrut the projetion tse. TheoeinFigure7lsomkesthep-projetion tses smller if p is shorter thn the size of Σ j.ifpis shorter thn the size of Σ j, p will not grow with the prtil enumertions from Σ j, oring to the lgorithm in Figure 3. Therefore, the symols from Σ j in the eginning of the projetions n e sfely remove without hving n impt on the further mining. This is wht is one y the oe etween lines 2 n 3. For exmple, the originl projetion tse D,{, is the one shown in Figure 6(). The oe in Figure 7 tully uils the smller tree shown in Figure 6(). We use funtion Proj(T, p, support, Σ j) in Figure 7 to fin the support, Sup D(p),forline2oftheoeinFigure5 n onstrut the projetion tse D p,σj for line 3 if the support is ove the solute threshol η. 3.2 Apriori Prtil Enumertor We use simple priori enumertion strtegy [4] for our priori prtil enumertor s follows: if p is frequent n p is not frequent (p is pttern n n re symols), we o not nee to onsier when extening p, euse p nnot e frequent ue to the ft tht p is not frequent. Figure 8 shows our lgorithm for the priori prtil enumertor. The enumertor works through three working lists s follows: P is the uffer queue to store the nite prtil enumertions rey to e pike y the lls of Next(). Q is the uffer queue for the prtil enumertions tht hve een onfirme frequent. These sequenes re eposite y the lls of Confirm(q). W is the working list of pirs p, Σ,wherep is prtil enumertion n Σ suset of Σ j for the symols to exten p. The extene prtil enumertions will e put in P. Initilly, W ontins, Σ j only. All the other p, Σ pirs in W re forme y using the ontents of Q s shown in funtion Next(). When p is empty n Q is not empty, funtion Next() will empty the ontents of Q. Let the ontents of Q e p s 1,,p s m. Aoring to the priori enumertion strtegy esrie ove, only the prtil enumertions extene from p s j (1 j m) with elements s 1,,s m nee to e onsiere. Thus, m pirs p s l, {s 1,,s m (1 l m) will e enqueue to W. 4. EXPERIMENTAL EVALUATION We onute the experimentl evlution of our implementtion y prtitioning the symol set Σ of size N to N k susets with the first N susets hving k symols n the k lst suset (N mo k) symols. Whenk = 1, the mining egenertes to the pttern-growth mining without enumertion using the FLWAP-tree [3]. When k = N, the mining is the reth-first full priori mining 2. We vry the vlue of k from 1 to N n our purpose is to see how the ifferent vlues of k impt the performne. We use the IBM t genertor [8] to generte 12 tsets with N =10, 15 n the verge length of the sequenes in the tse C =4, 6, 8, 10, 12, 14. The numer of sequenes in eh tse is D = 1000. All the tests on these tsets were run on n Intel Pentium III proessor of 497 MHz CPU with 512KB he n 256MB RAM using the threshol ξ =0.005. To ompre the performne of the mining with prtil enumertion (k >1) with the mining without enumertion (k = 1), we lulte the speeup of prtil enumertion s follows: Sp(k) = T1 T k where T 1 is the mesure mining time without enumertion with k =1nT k is the mesure mining time with prtil enumertion with k 1. The speeups Sp(k) overk =1for N =10nN = 15 re plotte in Figures 9() n 9(), respetively. The experimentl results show tht the mining with prtil enumertion outperforms the mining without enumertion for the tsets with the lrge C/N rtio. For the tsets of N = 10, the prtil enumertion improves the performne for C = 8, 10, 12, 14with thec/n rnging from 0.8 to 1.4. The est performnes re given y k =3forll C =8, 10, 12, 14. The highest speeup is 3.48 for C/N =1.4 n k =3. For the tsets of N = 15, the prtil enumertion improves the performne for C = 10, 12, 14 with the C/N etween 0.67 n 0.96. The est performnes re given y k = 2 for ll C = 10, 12, 14. However, for C = 14 (C/N =0.98) the speeup of k =3islosetothtofk =2. The highest speeup is 2.73 for C/N =0.98 n k =2. This speeup is still higher thn the speeup of 2.17 for N =10 n C/N =1.0. We expet tht the speeup for N =15 will e higher when C/N goes eyon 0.98. 2 This reth-first full priori enumertion mining iffers from the epth-first full priori mining [4] in tht the mximum epth we umulte the informtion out infrequent sequenes is N. 230

We n lso oserve tht the lrger the C/N rtio, the higher the speeup hieve. We lso see tht the lrger the C/N, the lrger the rnge of the vlues of k tht give the speeup greter thn one. Speeup Speeup 4 3.5 3 2.5 2 1.5 1 0.5 0 1 2 Speeup over k=1 for N=10 n t=0.005 3 4 5 k 6 C=14 C=12 C=10 C=8 C=6 C=4 () Speeup over k=1 for N10D1Kt0.005 3 2.5 2 1.5 1 0.5 Speeup over k=1 for N=15 n t=0.005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 k 7 C=14 C=12 C=10 C=8 C=6 C=4 () Speeup over k=1 for N15D1Kt0.005 Figure 9: Speeup of Prtil Enumertion over k=1 5. RELATED WORK AND CONCLUSION The ie of using prtil enumertion to prtition the serh spe ws strte in our erly work on frequent item- 8 9 10 set mining [9]. This pper is the first work on prtil enumertion for the frequent we ess pttern mining. As in the frequent itemset mining with prtil enumertion [9], our prtitioning of the frequent we ess ptterns is isjoint. Tht is, there is no overlp mong the frequent ptterns mine from the projetion tses. This is importnt euse it gurntees tht there is no reunnt work one in the mining. The losest work relte to ours is [10], whih omines GSP with n FP-growth lgorithm (lle GenPrefixSpn [11]) in n lgorithm lle SPRSe. But, no eviene is foun tht their prtitioning of frequent ptterns is isjoint. We hve presente frequent we ess pttern mining lgorithm with prtil enumertion whih unifies the pttern-growth lgorithm n the priori enumertion lgorithm. The performne evlution shows tht the mining with prtil enumertion n speeup the pttern-growth lgorithm y the ftor s lrge s 3.48 for the tses of long sequenes. The est performne for the tses of long sequenes is neither given y the pttern-growth lgorithm nor y the full priori enumertion lgorithm, ut rther y the mining with prtil enumertion in the mile. 6. REFERENCES [1] Jin Pei, Jiwei Hn, Behz Mortzvi-sl, n Hu Zhu. Mining ess ptterns effiiently from we logs. In Proeeings of the 4th Pifi-Asi Conferene on Knowlege Disovery n Dt Mining (PAKDD 00), pges 396 407, 2000. [2] C.I. Ezeife n Yi Lu. Mining we log sequentil ptterns with position oe pre-orer linke wp-tree. Interntionl Journl of Dt Mining n Knowlege Disovery, 10:5 38, 2005. [3] Peiyi Tng, Mrkus P. Turki, n Kyle A. Gllivn. Mining we ess ptterns with first-ourrene linke wp-tree. Tehnil Report titus.ompsi.ulr.eu/~ptng/ppers/flwp-rpt.pf, Deprtment of Computer Siene, University of Arknss t Little Rok, 2006. [4] J.Ayres,J.Flnnik,J.Gehrke,nT.Yiu. Sequentil pttern mining using itmp representtion. In Proeeings of the eighth ACM SIGKDD Interntionl Conferene on Knowlege Disovery n Dt Mining, pges 429 435, 2002. [5] Myr Spiliopoulou n Luks C. Fulstih. WUM: A tool for we utiliztion nlysis. In Proeeings of EDBT Workshop We DB 98, 1998. [6] C.I. Ezeife n Min Chen. Mining we sequentil ptterns inrementlly with revise PLWAP tree. In Proeeings of the 5th Interntionl Conferene on We-Age Informtion Mngement(WAIM 2004), pges pp. 539 548, 2004. [7] Peiyi Tng n Mrkus P. Turki. Mining frequent we ess ptterns with prtil enumertion. Tehnil Report titus.ompsi.ulr.eu/~ptng/ppers/fwp-perpt.pf, Deprtment of Computer Siene, University of Arknss t Little Rok, 2006. [8] R. Sriknt n R. Agrwl. Mining sequentil ptterns: Generliztions n performne improvements. In Proeeings of the Interntionl Conferene on Extening Dtse Tehnology, pges 3 17, 1996. [9] Peiyi Tng n Mrkus P. Turki. Mining frequent itemsets with prtil enumertion. In Proeeings of the 44 th Annul Assoition for Computing Mhinery Southest Conferene (ACMSE 06), pges 180 185, Melourne, Flori, USA, Mrh 2006. [10] Clui Antunes n Arlino L. Oliveir. Sequentil pttern mining lgorithms: Tre-offs etween spee n memory. In Proeeings of the Seon Workshop on Mining Grphs, Trees n Sequenes t the 15th Europen ECML n the 8th Europen PKDD, 2004. [11] Clui Antunes n Arlino L. Oliveir. Generliztion of pttern-growth methos for sequentil pttern mining with gp onstrints. In Proeeings of the 2003 Interntionl Conferene on Mhine Lerning n Dt Mining, pges 239 251, 2003. 231