AN EFFICIENT PROCEDURE FOR MINING STATISTICALLY SIGNIFICANT FREQUENT ITEMSETS. Predrag Stanišić and Savo Tomović

Size: px

Start display at page:

Download "AN EFFICIENT PROCEDURE FOR MINING STATISTICALLY SIGNIFICANT FREQUENT ITEMSETS. Predrag Stanišić and Savo Tomović"

Morris Farmer
5 years ago
Views:

1 PUBLICATIONS DE L INSTITUT MATHÉMATIQUE Nouvelle série, tome 87(101) (2010), DOI: /PIM S AN EFFICIENT PROCEDURE FOR MINING STATISTICALLY SIGNIFICANT FREQUENT ITEMSETS Predrag Staišić ad Savo Tomović Commuicated by Žarko Mijajlović Abstract. We suggest the origial procedure for frequet itemsets geeratio, which is more efficiet tha the appropriate procedure of the well kow Apriori algorithm. The correctess of the procedure is based o a special structure called Rymo tree. For its implemetatio, we suggest a modified sort-merge-joi algorithm. Fially, we explai how the support measure, which is used i Apriori algorithm, gives statistically sigificat frequet itemsets. 1. Itroductio Associatio rules have received lots of attetio i data miig due to their may applicatios i marketig, advertisig, ivetory cotrol, ad umerous other areas. The motivatio for discoverig associatio rules has come from the requiremet to aalyze large amouts of supermarket basket data. A record i such data typically cosists of the trasactio uique idetifier ad the items bought i that trasactio. Items ca be differet products which oe ca buy i supermarkets or o-lie shops, or car equipmet, or telecommuicatio compaies services etc. A typical supermarket may well have several thousad items o its shelves. Clearly, the umber of subsets of the set of items is immese. Eve though a purchase by a customer ivolves a small subset of this set of items, the umber of such subsets is very large. For example, eve if we assume that o customer has more tha five items i her/his shoppig cart, there are 5 ( ) i=1 = possible cotets of this cart, which correspods to the subsets havig o more tha five items of a set that has 10,000 items, ad this is ideed a large umber. The supermarket is iterested i idetifyig associatios betwee item sets; for example, it may be iterestig to kow how may of the customers who bought 2010 Mathematics Subject Classificatio: Primary 03B70; Secodary 68T27, 68Q17. Key words ad phrases: data miig, kowledge discovery i databases, associatio aalysis, Apriori algorithm. 109 i

2 110 STANIŠIĆ AND TOMOVIĆ bread ad cheese also bought butter. This kowledge is importat because if it turs out that may of the customers who bought bread ad cheese also bought butter, the supermarket will place butter physically close to bread ad cheese i order to stimulate the sales of butter. Of course, such a piece of kowledge is especially iterestig whe there is a substatial umber of customers who buy all three items ad a large fractio of those idividuals who buy bread ad cheese also buy butter. For example, the associatio rule bread cheese butter [support = 20%, cofidece = 85%] represets facts: 20% of all trasactios uder aalysis cotai bread, cheese ad butter; 85% of the customers who purchased bread ad cheese also purchased butter. The result of associatio aalysis is strog associatio rules, which are rules satisfyig a miimal support ad miimal cofidece threshold. The miimal support ad the miimal cofidece are iput parameters for associatio aalysis. The problem of associatio rules miig ca be decomposed ito two subproblems [1]: Discoverig frequet or large itemsets. Frequet itemsets have a support greater tha the miimal support; Geeratig rules. The aim of this step is to derive rules with high cofidece (strog rules) from frequet itemsets. For each frequet itemset l oe fids all oempty subsets of l; for each a l a oe geerates the rule a l a, if support(l) support(a) > miimal cofidece. We do ot cosider the secod sub-problem i this paper, because the overall performaces of miig associatio rules are determied by the first step. Efficiet algorithms for solvig the secod sub-problem are exposed i [12]. The paper is orgaized as follows. Sectio 2 provides formalizatio of frequet itemsets miig problem. Sectio 3 describes Apriori algorithm. Sectio 4 presets our cadidate geeratio procedure. Sectio 5 explais how to use hypothesis testig to validate frequet itemsets. 2. Prelimiaries Suppose that I is a fiite set; we refer to the elemets of I as items. Defiitio 2.1. A trasactio dataset o I is a fuctio T : {1,..., } P (I). The set T (k) is the k-th trasactio of T. The umbers 1,..., are the trasactio idetifiers (TIDs). Give a trasactio data set T o the set I, we would like to determie those subsets of I that occur ofte eough as values of T. Defiitio 2.2. Let T : 1,..., P (I) be a trasactio data set o a set of items I. The support cout of a subset K of the set of items I i T is the umber suppcout T (K) give by: suppcout T (K) = {k 1 k K T (k)}. The support of a item set K (i the followig text istead of a item set K we will use a itemset K) is the umber: supp T (K) = suppcout T (K)/.

3 MINING STATISTICALLY SIGNIFICANT FREQUENT ITEMSETS 111 The followig rather straightforward statemet is fudametal for the study of frequet itemsets. Theorem 2.1. Let T : 1,..., P (I) be a trasactio data set o a set of items I. If K ad K are two itemsets, the K K implies supp T (K ) supp T (K). Proof. The theorem states that supp T for a itemset has the ati-mootoe property. It meas that support for a itemset ever exceeds the support for its subsets. For proof, it is sufficiet to ote that every trasactio that cotais K also cotais K. The statemet from the theorem follows immediately. Defiitio 2.3. A itemset K is μ-frequet relative to the trasactio data set T if supp T (K) mu. We deote by F μ T the collectio of all μ-frequet itemsets relative to the trasactio data set T ad by F μ T,r the collectio of μ-frequet itemsets that cotai r items for r 1 (i the followig text we will use r-itemset istead of itemset that cotais r items). Note that F μ T = r 1 F μ T,r. If μ ad T are clear from the cotext, the we will omit either or both adormets from this otatio. 3. Apiori algorithm We will briefly preset the Apriori algorithm, which is the most famous algorithm i the associatio aalysis. The Apriori algorithm iteratively geerates frequet itemsets startig from frequet 1-itemsets to frequet itemsets of maximal size. Each iteratio of the algorithm cosists of two phases: cadidate geeratio ad support coutig. We will briefly explai both phases i the iteratio k. I the cadidate geeratio phase potetially frequet k-itemsets or cadidate k-itemsets are geerated. The ati-mootoe property of the itemset support is used i this phase ad it provides elimiatio or pruig of some cadidate itemsets without calculatig its actual support (cadidate cotaiig at least oe ot frequet subset is prued immediately, before support coutig phase, see Theorem 2.1). The support coutig phase cosists of calculatig support for all previously geerated cadidates. I the support coutig phase, it is essetial to efficietly determie if the cadidates are cotaied i particular trasactio, i order to icrease their support. Because of that, the cadidates are orgaized i hash tree [2]. The cadidates, which have eough support, are termed as frequet itemsets. The Apriori algorithm termiates whe oe of the frequet itemsets ca be geerated. We will use C T,k to deote cadidate k-itemsets which are geerated i the iteratio k of the Apriori algorithm. Pseudocode for the Apriori algorithm comes ext. Algorithm: Apriori Iput: A trasactio dataset T ; Miimal support threshold μ Output: F μ T

4 112 STANIŠIĆ AND TOMOVIĆ Method: 1. C T,1 = {{i} i I} 2. F μ T,1 = {K C T,1 supp T (K) μ} 3. for (k =2; F μ T,k 1 ; k++) C T,k = apriori_ge(f μ T,k 1 ) F μ T,k = {K C T,k supp T μ} ed for 4. F μ T = r 1 F μ T,r The apriori_ge is a procedure for cadidate geeratio i the Apriori algorithm [2]. I the iteratio k, it takes as argumet F μ T,k 1 the set of all μ-frequet (k 1)-itemsets (which is geerated i the previous iteratio) ad returs the set C T,k. The set C T,k cotais cadidate k-itemsets (potetially frequet k-itemsets) ad accordigly it cotais the set F μ T,k the set of all μ-frequet k-itemsets. The procedure works as follows. First, i the joi step, we joi F μ T,k with itself: INSERT INTO C T,k SELECT R 1.item 1, R 1.item 2,..., R 1.item k 1, R 2.item k 1 FROM F μ T,k 1 AS R 1, F μ T,k 1 AS R 2 WHERE R 1.item 1 = R 2.item 1 R 1.item k 2 = R 2.item k 2 R 1.item k 1 < R 2.item k 1 I the previous query the set F μ T,k is cosidered as a relatioal table with (k 1) attributes: item 1, item 2,..., item k 1. The previous query requires k 2 equality comparisos ad is implemeted with ested-loop joi algorithm i the origial Apriori algorithm from [2]. The ested-loop joi algorithm is expesive, sice it examies every pair of tuples i two relatios: for each l 1 F μ T,k 1 for each l 2 F μ T,k 1 if l 1 [1] = l 2 [1] l 1 [k 2] = l 2 [k 2] l 1 [k 1] < l 2 [k 1] ew_cad ={l 1 [1], l 1 [2],..., l 1 [k 1], l 2 [k 1]} Cosider the cost of ested-loop joi algorithm. The umber of pairs of itemsets (tuples) to be cosidered is F μ T,k 1 * F μ T,k 1, where F μ T,k 1 deotes the umber of itemsets i F μ T,k 1. For each itemset from F μ T,k 1, we have to perform a complete sca o F μ T,k 1. I the worst case the buffer ca hold oly oe block of F μ T,k 1, ad a total of F μ T,k 1 * b + b block trasfers would be required, where deotes the umber of blocks cotaiig itemsets of F μ T,k 1. Next i the prue step, we delete all cadidate itemsets c C T,k such that some (k 1)-subset of c is ot i F μ T,k 1 (see Theorem 2.1): for each c C T,k for each (k -1)-subset s i c if s / F μ T,k 1 delete c from C T,k I the ext sectio we suggest a more efficiet procedure for cadidate geeratio phase ad a efficiet algorithm for its implemetatio.

5 MINING STATISTICALLY SIGNIFICANT FREQUENT ITEMSETS New cadidate geeratio procedure We assume that ay itemset K is kept sorted accordig to some relatio <, where for all x, y K, x < y meas that the object x is i frot of the object y. Also, we assume that all trasactios i the database T ad all subsets of K are kept sorted i lexicographic order accordig to the relatio <. For cadidate geeratio we suggest the origial method by which the set C T,k is calculated by joiig F μ T,k 1 with F μ T,k 2, i the iteratio k, for k 3. A cadidate k-itemset is created from oe large (k 1)-itemset ad oe large (k 2)-itemset i the followig way. Let X = {x 1,..., x k 1 } F μ T,k 1 ad Y = {y 1,..., y k 2 } F μ T,k 2. Itemsets X ad Y are joied if ad oly if the followig coditio is satisfied: (4.1) x i = y i, (1 i k 3) x k 1 < y k 2 producig the cadidate k-itemset {x 1,..., x k 2, x k 1, y k 2 }. We will prove the correctess of the suggested method. I the followig text we will deote this method by C T,k = F μ T,k 1 F μ T,k 2. Let I = i 1,..., i be a set of items that cotais elemets. Deote by G I = (P (I), E) the Rymo tree (see Appedix) of P (I). The root of the tree is. A vertex K = {i p1,..., i pk } with i p1<i p2 < <ip k has i pk childre K j, where i pk < j. Let S r be the collectio of itemsets that have r elemets. The ext theorem suggest a techique for geeratig S r startig from S r 1 ad S r 2. Theorem 4.1. Let G be the Ryma tree of P (I), where I = i 1,..., i. If W S r, where r 3, the there exists a uique pair of distict sets U S r 1 ad V S r 2 that has a commo immediate acestor T S r 3 i G such that U V S r 3 ad W = U V. Proof. Let u ad v ad p be the three elemets of W that have the largest, the secod-largest ad the third-largest subscripts, respectively. Cosider the sets U = W {u} ad V = W {v, p}. Note that U S r 1 ad V S r 2. Moreover, Z = U V belogs to S r 3 because it cosists of the first r 3 elemets of W. Note that both U ad V are descedats of Z ad that U V = W (for r = 3 we have Z = ). The pair (U, V ) is uique. Ideed, suppose that W ca be obtaied i the same maer from aother pair of distict sets U 1 is r 1 ad V 1 S r 2 such that U 1 ad V 1 are immediate descedats of a set Z 1 S r 3. The defiitio of the Rymo tree G I implies that U 1 = Z 1 {i m, i q } ad V 1 = Z 1 {i y }, where the letters i Z 1 are idexed by a umber smaller tha mi{m, q, y}. The, Z 1 cosists of the first r 3 symbols of W, so Z 1 = Z. If m < q < y, the m is the third-highest idex of a symbol i W, q is the secod-highest idex of a symbol i W ad y is the highest idex of a symbol i W, so U 1 = U ad V 1 = V. The followig theorem directly proves correctess of our method C T,k = F μ T,k 1 F μ T,k 2. Theorem 4.2. Let T be a trasactio data set o a set of items I ad let k N such that k > 2. If W is a μ-frequet itemset ad W = k, the there exists

6 114 STANIŠIĆ AND TOMOVIĆ a μ-frequet itemset Z ad two itemsets {i m, i q } ad {i y } such that Z = k 3, Z W, W = Z {i m, i q, i y } ad both Z {i m, i q } ad Z {i y } are μ-frequet itemsets. Proof. If W is a itemset such that W = k, tha we already kow that W is the uio of two subsets U ad V of I such that U = k 1, V = k 2 ad that Z = U V has k 3 elemets (it follows from Theorem 2). Sice W is a μ-frequet itemset ad Z, U, V are subsets of W, it follows that each of these sets is also a μ-frequet itemset (it follows from Theorem 1). We have see i the previous sectio that i the origial Apriori algorithm from [2], the joi procedure is suggested. It geerates cadidate k-itemset by joiig two large (k 1)-itemsets, if ad oly if they have first (k 2) items i commo. Because of that, each joi operatio requires (k 2) equality comparisos. If a cadidate k-itemset is geerated by the method C T,k = F μ T,k 1 F μ T,k 2 for k 3, it is eough (k 3) equality comparisos to process. The method C T,k = F μ T,k 1 F μ T,k 2 ca be represeted by the followig SQL query: INSERT INTO C T,k SELECT R 1.item 1,..., R 1.item k 1, R 2.item k 2 FROM F μ T,k 1 AS R 1, F μ T,k 2 AS R 2 WHERE R 1.item 1 = R 2.item 1 R 1.item k 3 = R 2.item k 3 R 1.item k 1 < R 2.item k 2 For the implemetatio of the joi C T,k = F μ T,k 1 F μ T,k 2 we suggest a modificatio of sort-merge-joi algorithm (ote that F μ T,k 1 ad F μ T,k 2 are sorted because of the way they are costructed ad lexicographic order of itemsets). By the origial sort-merge-joi algorithm [9], it is possible to compute atural jois ad equi-jois. Let r(r) ad s(s) be the relatios ad R S deote their commo attributes. The algorithm keeps oe poiter o the curret positio i relatio r(r) ad aother oe poiter o the curret positio i relatio s(s). As the algorithm proceeds, the poiters move through the relatios. It is supposed that the relatios are sorted accordig to joiig attributes, so tuples with the same values o the joiig attributes are i a cosecutive order. Thereby, each tuple eeds to be read oly oce, ad, as a result, each relatio is also read oly oce. The umber of block trasfers is equal to the sum of the umber of blocks i both sets F μ T,k 1 ad F μ T,k 2, b1 + b2. We have see that ested-loop joi requires F μ T,k 1 * b1 + b1 block trasfers ad we ca coclude that merge-joi is more efficiet ( F μ T,k 1 * b1 b2 ). The modificatio of sort-merge-joi algorithm we suggest refers to the elimiatio of restrictios that a joi must be atural or equi-joi. First, we separate the coditio (4.1): (4.2) (4.3) x i = y i, 1 i k 3 x k 1 < y k 2.

7 MINING STATISTICALLY SIGNIFICANT FREQUENT ITEMSETS 115 Joiig C T,k = F μ T,k 1 F μ T,k 2 is calculated accordig to the coditio (4.2), i other words we compute the atural joi. For this, the described sort-merge-joi algorithm is used, ad our modificatio is: before X = {x 1,..., x k 1 } ad Y = {y 1,..., y k 2 }, for which X F μ T,k 1 ad Y F μ T,k 2 ad x i = y i, 1 i k 3 is true, are joied, we check if coditio (4.3) is satisfied, ad after that we geerate a cadidate k-itemset {x 1,..., x k 2, x k 1, y k 2 }. The pseudocode of apriori_ge fuctio comes ext. FUNCTION apriori_ge(f μ T,k 1, F μ T,k 2 ) 1. i = 0 2. j = 0 3. while i F μ T,k 1 j F μ T,k 2 iset 1 = F μ T,k 1 [i + +] S = {iset 1 } doe = false while doe = false i F μ T,k 1 iset 1a = F μ T,k 1 [i + +] if iset 1a [w] = iset 1 [w], 1 w k 2 the S = S {iset 1a } i + + else doe = true ed if ed while iset 2 = F μ T,k 2 [j] while j F μ T,k 2 iset 2[1,..., k 2] iset 1 [1,..., k 2] iset 2 = F μ T,k 2 [j + +] ed while while j F μ T,k 2 iset 1[w] = iset 2 [w], 1 w k 2 for each s S if iset 1 [k 1] iset 2 [k 2] the c = {iset 1 [1],..., iset 1 [k 1], iset 2 [k 2]} if c cotais-ot-frequet-subset the DELETE c else C T,k = C T,k {c} ed if ed for j++ iset 2 = F μ T,k 2 [j] ed while ed while We implemeted the origial Apriori [2] ad the algorithm proposed here. Algorithms are implemeted i C i order to evaluate its performaces. Experimets are performed o PC with a CPU Itel(R) Core(TM)2 clock rate of 2.66GHz ad

8 116 STANIŠIĆ AND TOMOVIĆ with 2GB of RAM. Also, ru time used here meas the total executio time, i.e., the period betwee iput ad output istead of CPU time measured i the experimets i some literature. I the experimets dataset which ca be foud o is used. It cotais biary trasactios. The average legth of trasactios is 8. Table 1 shows that the origial Apriori algorithm from [2] is outperformed by the algorithm preseted here. Table 1. Executio time i secods mi sup(%) Apriori Rymo tree based implemetatio Validatig frequet itemsets I this sectio we use hypothesis testig to validate frequet itemsets which have bee geerated i the Apriori algorithm. Hypothesis testig is a statistical iferece procedure to determie whether a hypothesis should be accepted or rejected based o the evidece gathered from the data. Examples of hypothesis tests iclude verifyig the quality of patters extracted by may data miig algorithms ad validatig the sigificace of the performace differece betwee two classificatio models. I hypothesis testig, we are usually preseted with two cotrastig hypothesis, which are kow, respectively, as the ull hypothesis ad the alterative hypothesis. The geeral procedure for hypothesis testig cosists of the followig four steps: Formulate the ull ad alterative hypotheses to be tested. Defie a test statistic θ that determies whether the ull hypothesis should be accepted or rejected. The probability distributio associated with the test statistic should be kow. Compute the value of θ from the observed data. Use the kowledge of the probability distributio to determie a quatity kow as p-value. Defie a sigificace level a which cotrols the rage of θ values i which the ull hypothesis should be rejected. The rage of values for θ is kow as the rejectio regio. Now, we will back to miig frequet itemsets ad we will evaluate the quality of the discovered frequet itemsets from a statistical perspective. The criterio for judgig whether a itemset is frequet depeds o the support of the itemset. Recall that support measures the umber of trasactios i which the itemset is actually observed. A itemset X is cosidered frequet i the data set T, if supp T (X) > mi sup, where misup is a user-specified threshold.

9 MINING STATISTICALLY SIGNIFICANT FREQUENT ITEMSETS 117 The problem ca be formulated ito the hypothesis testig framework i the followig way. To validate if the itemset X is frequet i the data set T, we eed to decide whether to accept the ull hypothesis, H 0 : supp T (X) = mi sup, or the alterative hypothesis H 1 : supp T (X) > mi sup. If the ull hypothesis is rejected, the X is cosidered as frequet itemset. To perform the test, the probability distributio for supp T (X) must also be kow. Theorem 5.1. The measure supp T (X) for the itemset X i trasactio data set T has the biomial distributio with mea supp T (X) ad variace 1 (supp T (X)* (1 supp T (X))), where is the umber of trasactios i T. Proof. We will use measure suppcout T (X) ad calculate the mea ad the variace for it ad later derive mea ad variace for the measure supp T (X). The measure suppcout T (X) = X presets the umber of trasactios i T that cotai itemset X, ad supp T (X) = suppcout T (X)/ (Defiitio 2.2). The measure X is aalogous to determiig the umber of heads that shows up whe tossig cois. Let us calculate E(X ) ad D(X ). Mea is E(X ) = * p, where p is the probability of success, which meas (i our case) the itemset X appears i oe trasactio. Accordig to the Beroulli law it is: ε > 0, lim N P { X / p ε} = 1. Freely speakig, for large (we work with large databases so ca be cosidered large), we ca use relative frequecy istead of probability. So, we ow have: For variace we compute: E(X ) = p X D(X ) = p(1 p) X ( 1 X = X ) = X ( X ) Now we compute E(supp T (X)) ad D(supp T (X)). Recall that supp T (X) = X /. We have: ( X ) E(supp T (X)) = E = 1 E(X ) = X = supp T (X). ( X ) D(supp T (X)) = D = 1 2 D(X ) = 1 X ( X ) 2 = 1 supp T (X)(1 supp T (X)) The biomial distributio ca be further approximated usig a ormal distributio if is sufficietly large, which is typically the case i associatio aalysis. Regardig the previous paragraph ad Theorem 5.1, uder the ull hypothesis supp T (X) is assumed to be ormally distributed with mea mi sup ad variace 1 (mi sup(1 mi sup)). To test whether the ull hypothesis should be accepted or rejected, the followig statistic ca be used: (5.1) W N = supp T (X) mi sup (mi sup(1 mi sup))/.

10 118 STANIŠIĆ AND TOMOVIĆ The previous statistic, accordig to the Cetral Limit Theorem, has the distributio N(0, 1). The statistic essetially measures the differece betwee the observed support supp T (X) ad the mi sup threshold i uits of stadard deviatio. Let N = 10000, supp T (X) = 0.11, mi sup = 0.1 ad α = The last parameter is the desired sigificace level. It cotrols Type 1 error which is rejectig the ull hypothesis eve though the hypothesis is true. I the Apriori algorithm we compare supp T (X) = 0.11 > 0.1 = mi sup ad we declare X as frequet itemset. Is this validatio procedure statistically correct? Uder the hypothesis H 1 statistics W is positive ad for the rejectio regio we choose R = {(x 1,..., x ) w > k}, k > 0. Let us fid k = P H0 {W > k} P H0 {W > k} = k = 3.09 Now we compute w = ( ) ( 0.1*(1 0.1) ) 1/ = We ca see that w > k, so we are withi the rejectio regio ad H 1 is accepted, which meas the itemset X is cosidered statistically sigificat. 6. Appedix Rymo tree was itroduced i [8] i order to provide a uified search-based framework for several problems i artificial itelligece; the Rymo tree is also useful for data miig algorithms. I Defiitios 6.1 ad 6.2 we defie ecessary cocepts ad i Defiitio 6.3 we defie the Rymo tree. Defiitio 6.1. Let S be a set ad let d : S N be a ijective fuctio. The umber d(x) is the idex of x S. If P S, the view of P is the subset view(d, P ) = {s S d(s) > max p P d(p)}. Defiitio 6.2. A collectio of sets C is hereditary if U C ad W U implies W C. Defiitio 6.3. Let C be a hereditary collectio of subsets of a set S. The graph G = (C, E) is a Rymo tree for C ad the idexig fuctio d if: the root of the G is the childre of a ode P are the sets of the form P {s}, where s view(d, P ) If S = {s 1,..., s } ad d(s i ) = i for 1 i, we will omit the idexig fuctio from the defiitio of the Rymo tree for P (S). A key property of a Rymo tree is stated ext. Theorem 6.1. Let G be a Rymo tree for a hereditary collectio C of subsets of a set S ad a idexig fuctio d. Every set P of C occurs exactly oce i the tree. Proof. The argumet is by iductio o p = P. If p = 0, the P is the root of the tree ad the theorem obviously holds.

11 MINING STATISTICALLY SIGNIFICANT FREQUENT ITEMSETS 119 Suppose that the theorem holds for sets havig fewer tha p elemets, ad let P C be such that P = p. Sice C is hereditary, every set of the form P {x} with x P belogs to C ad, by the iductive hypothesis, occurs exactly oce i the tree. Let z be the elemet of P that has the largest value of the idex fuctio d. The view(p {z}) cotais z ad P is a child of the vertex P {z}. Sice the paret of P is uique, it follows that P occurs exactly oce i the tree. Note that i the Rymo tree of a collectio of the form P (S), the collectio of sets of S r that cosists of sets located at distace r from the root deotes all subsets of the size r of S. Refereces [1] R. Agrawal, R. Srikat, Fast algorithms for miig associatio rules, Proc. VLDB-94 (1994), [2] F. P. Coee, P. Leg, S. Ahmed, T-trees, vertical partitioi-g ad distributed associatio rule miig, Proc. It. Cof. Discr. Math (2003), [3] F. P. Coee, P. Leg, S. Ahmed, Data structures for associatio rule miig: T-trees ad P-trees, IEEE Tras. Data Kowledge Egi. 16(6), (2004), [4] F. P. Coee, P. Leg, G. Goulboure, Tree structures for miig associatio rules, J. Data Mi. Kowledge Discovery 8(1) (2004), [5] G. Goulboure, F. P. Coee, P. Leg, Algorithms for computig associatio rules usig a partial-support tree, J. Kowledge-based Syst. 13 (1999), [6] G. Grahe, J. Zhu, Efficietly usig prefix-trees i miig frequet itemsets, Proc. IEEE ICDM Workshop Frequet Itemset Miig Implemetatios (2003). [7] J. Ha, J. Pei, P.S. Yu, Miig frequet patters without cadidate geeratio, Proc. ACM SIGMOD Cof. Maagemet of Data, (2000), [8] R. Rymo, Search through systematic set eumeratio, Proc. 3rd It. Cof. Priciples of Kowledge Represetatio ad Reasoig, (1992), [9] A. Silberschatz, H. F. Korth, S. Sudarsha, Database System Cocepts, McGraw Hill, New York (2006) [10] A. D. Simovici, C. Djeraba, Mathematical tools for data miig (Set Theory, Partial Orders, Combiatorics), Spriger-Verlag, Lodo, [11] P. Staišić, S. Tomović, Apriori multiple algorithm for miig associatio rules, Iformatio Techology ad Cotrol 37(4) (2008), [12] P. N. Ta., M. Steibach, V. Kumar, Itroductio to Data Miig, Addiso Wesley, Departmet of Mathematics ad Computer Sciece (Received ) Uiversity of Moteegro (Revised ) Podgorica Moteegro pedjas@ac.me savotom@gmail.com

CS284A: Representations and Algorithms in Molecular Biology

CS284A: Representations and Algorithms in Molecular Biology CS284A: Represetatios ad Algorithms i Molecular Biology Scribe Notes o Lectures 3 & 4: Motif Discovery via Eumeratio & Motif Represetatio Usig Positio Weight Matrix Joshua Gervi Based o presetatios by