Mining Probabilistic Association Rules from Uncertain Databases with Pruning

Size: px

Start display at page:

Download "Mining Probabilistic Association Rules from Uncertain Databases with Pruning"

Jane Lucas
6 years ago
Views:

1 Miig Probabilistic Associatio Rules from Ucertai Databases with Pruig Erich A. Peterso Departmet of Computer Sciece Uiversity of Arkasas at Little Rock Little Rock, AR 7224 Liag Zhag Departmet of Biological Statistics ad Computatioal Biology Corell Uiversity Peiyi Tag Departmet of Computer Sciece Uiversity of Arkasas at Little Rock Little Rock, AR Ithaca, NY Abstract-I this paper, we rigorously defie the problem of miig probabilistic associatio rules from ucertai databases. We further aalyze the probability distributio space of a cadidate probabilistic associatio rule, ad propose a efficiet miig algorithm with pruig to fid all probabilistic associatio rules from ucertai databases. I. INTRODUCTION I recet years, a great deal of iterest i miig probabilistic frequet itemsets (p-fis) from ucertai databases has bee geerated [1]-[6], [8]. Sice the ladmark work of Berecker et al. [3], almost all the research i miig p-fis uses a probabilistic model based o possible world sematics. While frequet item sets are useful, the most COlmo purpose for miig frequet itemsets is the to fid associatio rules from them. While geeratig associatio rules from frequet itemsets is straightforward for certai databases [7], to fid probabilistic associatio rules from probabilistic frequet itemsets is ot trivial. I this paper, we give a rigorous defiitio of a probabilistic associatio rule (p-ar), based o possible world sematics, usig two ratios: (1) a frequecy ratio ad (2) a cofidece ratio 'fi, both betwee ad 1. Usig two ratios istead of oe (cofidece ratio), as i [8], we foud that if'fl <, X =} Y is always a probabilistic associatio rule with respect to a miimum probability threshold miprob, give that X U Y is a probabilistic frequet itemset with respect to ad miprob. This observatio offers the meas of a extremely useful pruig techique whe performig probabilistic associatio rule miig. I other words, as log as 'fi < ad Z is a probabilistic frequet itemset with respect to, X =} Z - X is guarateed to be a probabilistic associatio rule for ay XcZ. We also derive the recursive equatio for the probability distributio of a p-ar ad show a dyamic prograrmig algorithm for miig p-ars. We further itroduce additioal pruig techiques i the miig algorithm based o the value of 'fi which reduces the computatio required to determie whether a cadidate is a probabilistic associatio rule. The rest of the paper is orgaized as follows: Sectio II itroduces the ucertai database model used i this paper, probabilistic support based o possible world sematics, ad the probabilistic frequet itemset. This sectio also icludes our defiitio of a probabilistic associatlo rule usig two ratios. Sectio III aalyzes the probability distributio of associatio rules ad presets a theorem of pruig whe 'fi <. Sectio IV presets the dyamic programmig algorithm to determie whether a cadidate is a probabilistic associatio rule ad a algorithm for eumeratig cadidates, both with pruig. Sectio V describes the results of our prelimiary evaluatio of our miig algorithm. Sectio VI provides a discussio of related ad future work. Sectio VII cocludes the paper. II. A. Ucertai Database Model PRELIMINARIES A ucertai database of size deoted as T is a set of trasactios, T = {t 1,..., t}. A trasactio t j (1 :s; j :s; ) is a set of items, each of which has a o-zero probability o < p :s; 1 that it exists i the trasactio tj. That is, p = Pr(x E tj) for ay item x cotaied i tj (x E tj). The items i a trasactio with a existetial probability betwee ad 1 are called ucertai items, whereas the itemset with existetial probability 1 are called certai items. Whe defiig probabilistic frequet itemsets ad associatio rules from ucertai databases, the priciples ad theories of possible world sematics must be used. Uder possible world sematics, a possible world is a certai database istatiated from the ucertai database by icludig or ot icludig every ucertai item. If the assumptio of idepedece betwee the trasactios i T, as well as the items i each trasactio holds, the probability of each possible world w ca be calculated as follows: Pr(w) = II (II Pr(x E t f ). II (1- Pr(x E t f ))) (1) tet(w) xet x<1-t where T( w) is the certai database of possible world w, t is a trasactio i T( w), t f is the correspodig trasactio i the ucertai database T from which t is istatiated, ad x is a ucertai item of t f. I the traditioal certai database model, a itemset is either cotaied i a trasactio or ot. I the ucertai database model, we caot say itemset X is cotaied i trasactio tj with certaity if X cotais at least oe ucertai item, eve though X <;;; tj. What we have is a probability that X is cotaied i trasactio tj, Pr(X <;;; tj). Accordig to

possible world sematics, this is the margial probability that X is cotaied i tj as follows Pr(w) (2) where W is the set of all possible worlds ad tj (w) is the j-th trasactio i the certai database of

Probabilistic Support ad Frequet ltemsets Give a ucertai database of size,

::; i ::;, deoted as Pi (X), is the margial probability (4) Pr(w) (5) wew,sup(x,t(w)) = i where Sup(X, T(w)) is the support of X i the certai database T( w) of possible world w, which is the umber of

2 possible world sematics, this is the margial probability that X is cotaied i tj as follows Pr(w) (2) where W is the set of all possible worlds ad tj (w) is the j-th trasactio i the certai database of possible world w. From (1) ad (2), this margial probability ca be expressed as follows [3]: Pr(X <:;; tj) = II Pr(x E tj) (3) xex The probability that tj does ot cotai X thus is B. Probabilistic Support ad Frequet ltemsets Give a ucertai database of size, T, ad a itemset X, the probabilistic support of X i T, deoted as Sup(X, T), is a radom variable that ca have values, 1,...,. Accordig to possible world sematics, the probability that Sup(X, T) = i, ::; i ::;, deoted as Pi (X), is the margial probability (4) Pr(w) (5) wew,sup(x,t(w)) = i where Sup(X, T(w)) is the support of X i the certai database T( w) of possible world w, which is the umber of trasactios i T ( w) that cotai X. Pi (X) for i =,..., is the probability mass fuctio (pmt) of the support of X i T ad we have L:7=oPi(X) = l. Let S be a set of i trasactios from T. The probability that every trasactio i S cotais itemset X ad every trasactio i T -S does ot cotai X is II Pr(X <:;; t) II (1 -Pr(X <:;; t)) tes tet-s This is obtaied by sulmig the probability of each possible world i which every trasactio i S cotais itemset X ad every trasactio i T -S does ot cotai X, usig (1), (3) ad (4). Sice there are may differet subsets S <:;; T with size i, the probability of Sup(X, T) = i, Pi(X), is the summatio of the probabilities above over differet subsets S [3]: Pi(X) = L II Pr(X <:;; t) II (l -Pr(X <:;; t)) (6) S,; T, IISII i tes tet-s Give a miimum support ratio E (,1), a itemset X is a probabilistic frequet itemset (p-fi) if the probability that Sup(X, T) is greater tha or equal to IITII (i.e. ) is above a miimum probability threshold deoted as miprob. Sice Pi(X) is the pmf of Sup(X, T), the probability that Sup(X, T) is greater tha or equal to is L:7=Il Pi(X), Hece, we say that X is a frequet itemset if ad oly if L:7=Il Pi(X) ;::: miprob. Computig the pmf Pi(X) usig (6) is expesive. Berecker et al. proposed a dyamic programmig algorithm to calculate Pi (X) to mie probabilistic frequet itemsets efficietly i [3]. C. Probabilistic Associatio Rules (p-ars) Give a ucertai database T of size, a miimum frequecy E (,1), a miimum cofidece T) E (,1), ad a miimum probability threshold miprob, X =? Y is a probabilistic associatio rule (p-ar) if ad oly if X Y =, ad the probability that Sup(X U Y, T) ;::: ad Sup(XUY, T) ;::: T)Sup(X, T) is greater or equal to miprob. Accordig to possible worlds sematics, the probability that Sup(X UY, T) ;::: ad Sup(X UY, T) ;::: T)Sup(X, T), deoted as P(X, Y, C T), T), is the sum of the probability of all the small possible worlds w satisfyig Sup(XUY, T(w)) ;::: ad Sup(XUY, T(w)) T)Sup(X, T(w)), where T(w) is the certai database of small world w. That is, P(X, Y,, T), T) = Pr(w) we W, Sup(X U Y, T(w»), Sup(X U Y, T(w)) 2: "Sup(X, T(w)) (7) The problem of miig probabilistic associatio rules, is to fid all the probabilistic associatio rules from a ucertai base T, give the miimum frequecy, the miimum cofidece T), ad the miimum probability threshold miprob. III. PROBABILITY DISTRIBUTION FOR ASSOCIATION RULES Calculatig P(X, Y, C T), T) usig (7) is ifeasible, as the umber of possible worlds are too umerous. Thus, we eed to fid aother way to calculate it. Let us cosider the probability that a trasactio tj cotais both X ad Y with X Y = : Pr(X uy <:;; tj) II Pr(z E tj) zexuy II Pr(x E tj) II Pr(y E tj) xex The third equality above is due to X Y = ad the fourth follow (3). yey Pr(X <:;; tj )Pr(Y <:;; tj) (8) ad the secod The probability that a trasactio tj cotais X ad but ot Y is, thus, Pr(X <:;; tj 1\ Y tj) = Pr(X <:;; t)(l -Pr(Y <:;; t)) (9) because Pr(X <:;; tj 1\ Y t j) + Pr(X <:;; tj 1\ Y <:;; t = j) Pr(X <:;; tj). Cosider the possible worlds w where j trasactios cotai both X ad Y ad aother i trasactios cotai X but ot Y. These j + i trasactios are the trasactios cotaiig X ad the rest -(j +i) trasactios do ot cotai X. Thus, the support of X ad Xu Y i theses possible worlds are j + i ad j, respectively. Let S1 ad S2 be the sets of the trasactios that cotai X but ot Y, ad X ad Y, respectively ad we have IISIII = i ad IIS211 = j. Let Sl ad S2 also deote the same sets of trasactios i the ucertai database. The, the probability that the i trasactios of S1 cotai X but ot Y,

j j o (1 - 'I]) (a) whe rt > t; (b) whe rt :S t; Fig. 1.

II Pr(X <:;; t)pr(y <:;; t). II tet-5s-51 accordig to (9), (8) ad (4).

i, deoted as Pi, j(x, Y), is give as follows: Pi, j(x, Y ) II Pr(X <:;; t)(l -Pr(Y <:;; t)). 31 C;; T, te51 113111 = i, 32 C;; T, 113211 = j II Pr(X <:;; t)pr(y <:;; t).

Also, we have L Pi, j(x, Y) = 1. O::;i, j::;, j+i::; Therefore, the probability that Sup(X U Y, T)?: ad Sup(X U Y, T)?

3 j j o (1 - 'I]) (a) whe rt > t; (b) whe rt :S t; Fig. 1. Probability Distributio of Pi,j (X, Y) the j trasactios of S2 cotai both X ad Y, ad the rest of -(j + i) trasactios do ot cotai X is as follows: II Pr(X <:;; t)(l -Pr(Y <:;; t)). II Pr(X <:;; t)pr(y <:;; t). II tet-5s-51 accordig to (9), (8) ad (4). (1 -Pr(X <:;; t)) Further, the probability that there are j trasactios i T cotaiig both X ad Y ad i trasactios cotaiig X but ot Y, or equivaletly the Sup(X U Y, T) = j ad Sup(X, T) -Sup(X U Y, T) = i, deoted as Pi, j(x, Y), is give as follows: Pi, j(x, Y ) II Pr(X <:;; t)(l -Pr(Y <:;; t)). 31 C;; T, te = i, 32 C;; T, = j II Pr(X <:;; t)pr(y <:;; t). II (1 -Pr(X <:;; t)) (1) Obviously, j + i :s: ad :s: i,j :s:. Pi, j (X, Y) is actually the pmf of the umber of trasactios cotaiig X but ot Y ad the umber of trasactios cotaiig both X ad Y. Also, we have L Pi, j(x, Y) = 1. O::;i, j::;, j+i::; Therefore, the probability that Sup(X U Y, T)?: ad Sup(X U Y, T)?: T)Sup(X, T), (deoted as P(X, Y,, T), T» is the sum of those Pi, j (X, Y) satisfyig both j?: ad j?: T)( j + i). Sice j?: T)(j + i) is equivalet to j?: 1: 1) i, we have P(X, Y,, T), T) = ::; j ::;, j + i ::; j2:: i,i2:: Pi, j(x, Y) (11) Figure I(a) shows the poits where Pi, j (X, Y) is defied. The shaded area is the itersectio of the two triagles (1) j?:, i?: ad j + i :s: ad (2) j?: 1: 1) i, i?: ad j + i :s:. The summatio of the Pi, j(x, Y) of the poits i the shaded area gives P(X, Y, C T), T) as show i (11). X =} Y is a probabilistic associatio rule if ad oly if this P(X, Y,, T), T) is greater or equal to the miimum probability threshold miprob. Notice that the itersectio of lies j = ad j+i = -2ryi is (i, j) = ((1 -T)), T)). Thus, whe T) :s:, the itersectio of the two triagles is the first triagle j?:, i?: ad j + i :s: itself as illustrated i Figure I(b). The summatio of the Pi, j (X, Y) i this triagle is the probability that j = Sup(X U Y, T)?:. If X U Y is a p-fi with respect to defied i Sectio II-B ad T) :s:, the summatio of Pi, j (X, Y) i this triagle is greater tha or equal to miprob. Thus, P(X, Y, C T), T)?: miprob ad X =} Y is a p-ar with respect to ad T). We sulmarize this fidig i the followig theorem. Theorem 1: If itemset Z is a probabilistic frequet itemset with respect to the miimum frequecy ad the miimum probability threshold miprob ad the miimum cofidece T) is less tha or equal to, X =} Y is a probabilistic associatio rule with respect to, T) ad miprob for ay X C Z ad Y = Z -X. Proof If Z = X U Y is a probabilistic frequet itemset with respect to ad miprob, the the probability that Sup(X U Y, T)?:, where = IITII, is greater tha or equal to miprob. The probability that Sup(X U Y, T)?: is equal to L i2:,j+ is;, Pi, j(x,y). We thus have ::; j ::; L i >, j + i <, ::; j ::;- Pi, j(x, Y)?: miprob.

= We ow show that j?:, j + i ::; ad?: T) implies j?: I'/ First we have i::; - j::; - = (1 -O. Sice j?: ad?: T), we have t > _ = > _ T) i -(1 -O 1 - - 1 -T) which gives j?: -2rii. Sice j?:, j + i::; ad?

Pi,j(X, Y) Thus, X =} Y is a probabilistic associatio rule with respect to, T) ad miprob.

MINING PROBABILISTIC ASSOCIATION RULES A. Dyamic Programmig Algorithm The complexity of computig Pi,j(X, Y) directly by (1) is expoetial with. We ca use dyamic programmig method to reduce it to O(3).

4 = We ow show that j?:, j + i ::; ad?: T) implies j?: I'/ First we have i::; - j::; - = (1 -O. Sice j?: ad?: T), we have t > _ = > _ T) i -(1 -O T) which gives j?: -2rii. Sice j?:, j + i::; ad?: T) implies j?: -2rii, we have i 2':, j + i, ' ::s:: j ::s::, j 2': -2ryi Pi,j(X, Y) 'i 2:::, j + i :-:;, :::; j :::; > miprob. Pi,j(X, Y) Thus, X =} Y is a probabilistic associatio rule with respect to, T) ad miprob. Theorem 1 tells us that if T) ::; ad Z is a p-fi, we ca simply eumerate all the subset X of Z ad every X =} Z -X is a p-ar; thus, we eed oly mie p-ars whe T) >. IV. MINING PROBABILISTIC ASSOCIATION RULES A. Dyamic Programmig Algorithm The complexity of computig Pi,j(X, Y) directly by (1) is expoetial with. We ca use dyamic programmig method to reduce it to O(3). Let Tk be a ucertai database made of the first k trasactios {tl,..., td from T. That is, To = ad Tk = Tk-I U {tk} for k = 1,...,. Let the probability that there are j trasactios i Tk cotaiig both X ad Y ad i trasactios cotaiig X but ot Y, or equivaletly Sup(XUY, Tk l = j ad Sup(X, Tk) -SUp(XUY, Tk) = i, be deoted as p (X, Y). Obviously, i + j::; k ad ::; i,j::; k i p (X, Y) ad we have P (k i,j ) (X, Y) = 1. ""5. i, j ""5. k,j+i""5. k (12) That is, all the p (X, Y) outside of the triagle of ::; i, j::; k,j+i::; k are zeros. Notice that P)(X, Y) is the Pi,j(X, Y) give i (1). Sice Tk = Tk-I U {td, p(x, Y) ca be calculated from p -I )(X, Y) usig the followig recursio: P (k i, j ) (X, Y) p,(x, Y)Pr(X tk)(1 -Pr(Y tk)) p-=- I )(X, Y)Pr(X tk)pr(y tk) + p -I )(X, Y)(I -Pr(X tk)) (13) with pl!'i)x, Y) = ad P2I (X, Y) = due to (12). This is because trasactio tk either cotais both X ad Y, or cotais X but ot Y, or does ot cotai X at all. The recursive equatio (13) above is derived from the Law of Total Probability ad the fact that trasactio tk is idepedet from all the trasactios i Tk-I. I particular, whe the case is that Tk has j trasactios cotaiig both X ad Y ad i trasactios cotaiig X but ot Y, oly oe of the followig situatios will be true: (1) tj does ot cotai X ad Tk-I has j trasactios cotaiig both X ad Y ad i trasactios cotaiig X but ot Y, (2) tj cotais both X ad Y ad Tk-I has j -1 trasactios cotaiig both X ad Y ad i trasactios cotaiig X but ot Y, or (3) tj cotais X but ot Y ad Tk-I has j trasactios cotaiig both X ad Y ad i -I trasactios cotaiig X but ot Y. Note that the values of p -I \X, Y) are oly used to calculate the values of p (X, Y). Oce used, they do ot eed to exist. Thus, we ca use two 2-dimesioal data arrays ( fo[i,j]) ad (h[i,j]) to store values of p(x, Y) for eve ad odd k, respectively. The followig triple-ested loop will calculate ad leave p)(x, Y) i f mod 2[i,j]: { 1 if i = Iitialize array fo with fo [i, j j] = o' therwlse Iitialize array h with zero for all elemets; for k = for i =, for j = l, k, k -i tmp = fk-i mod 2[i,j](1 -pr); if (i > ) tmp += f(k-i ) mod 2[i-1,j]pr(1-pk); if (j > ) tmp += f(k-i ) mod 2[i,j -l]prpk; fk mod 2[i,j] = tmp; where pr ad Pk are the shorthads for Pr(X tk) ad Pr(Y tk), respectively. The ested loop above is the dyamic programmig method to calculate p (X, Y) for k = 1,..., usig (13). The if statemets i the loop body are to deal with boudary coditios for p (X, Y), where i = or j = which require values PI!'IJ) (X, Y) ad p/) (X, Y). These values should be all zeros, because either of i ad j ca be egative. Iteratio k (k = 1,...,) of the outer-most loop calculates each p (X, Y) ad stores them i ik mod 2[i, j] i the triagle delieated by ::; i, j::; k, j+ i::; k. It uses the p -I )(X, Y) values calculated by the previous iteratio k - 1 ad stores them i f(k-i ) mod 2[i,j] i the triagle delieated by ::; i,j ::; k -l,j + i ::; k - 1 (whe k > 1), as well as the iitial values stored i the f(k-i ) mod 2[i,j] o the diagoal of j + i = k. The iitial values o the diagoal j + i = k of f(k-i ) mod 2[i,j] are p -I )(X, Y) with j + i = k > k -1. They should be zeros, because j + i caot exceed k -1 i p -I )(X, Y). (Also see (12». The iitial value of fo[o,o] is P6 6(X, Y). Accordig to (12), this value should be 1. Thus, the array fo should be iitialized as follows.. { I if i = j = fort,] otherwise ad h should be iitialized with all zero. B. Pruig To determie whether X =} Y is a p-ar with respect to,t), ad miprob (whe T) > O-we ca either sum the values of Pi, j (X, Y) of the poits i the shaded area as i Figure l(a) ad see if the sum is greater tha or equal to miprob, or sum the values of Pi, j (X, Y) of the poits i the =

5 = u-shaded area to see if the sum is less tha or equal to 1 - miprob. The itersectio of the lies j+i = ad j = l rj i is (i,j) = ((I-7]),7]) as show i Figure lea). To compute the sum of the values of Pi,j (X, Y) i the shaded area, we oly eed to calculate the values of Pi,j (X, Y) i the trapezoid bouded by i =, i = (1-7]), j =, ad j = - i. If we wat to calculate the sum of the values OfPi,j(X, Y) i the ushaded area, we oly eed to calculate the values of Pi,j (X, Y) i the trapezoid bouded by j =, j = 7],, ad i = i = - j. We wat to chose the smaller trapezoid. The areas of trapezoids for computig the shaded ad u-shaded areas are ( l- rj2) 2 d ( 2 rj - rj2) 2. I 2 a 2 respective y. Th us, I 'f 2 7] > _ 1, l.e.. ' 7]?:.5, we chose to calculate the trapezoid for the shaded area. Otherwise, we calculate the trapezoid for the u-shaded area. The dyamic programmig algorithm with pruig to determie whether X =} Y is a p-ar is show i Figure 2. It is assumed that fuctio p-ar(x, Y, C 7], miprob) is called oly whe X Y = ad X U Y is a p-fi with respect to ad miprob. C. Fidig Associatio Rules For X =} Y with X Y = to be a p-ar with respect to, 7] ad miprob, X U Y has to be a p-fi with respect to ad miprob. We use the apriori set eumeratio [7] ad Berecker's dyamic programmig algorithm [3] to fid probabilistic frequet itemsets (p-fis) Z with respect to ad miprob. For each p-fi Z, we use a algorithm to eumerate the cadidates X, Y with Xu Y = Z ad X Y =. Figure 3 shows the eumeratio tree of p-ar cadidates X, Y for p-fi Z = abed. At each iteratio of the eumeratio, a item is moved from atecedet X which is less tha all the items i cosequet Y ad add it to Y, to be a child of X, Y. Formally, let X = Xl... X ad Y = YI... Ym ad items i X ad Y are all ordered accordig to the total order of the items, i.e. Xi < Xj ad Yi < Yj for i < j. The X, Y has oly those childre Xl... Xi-IXi+1... X, XiYI... Ym such that Xi < YI Notice that the cosequets i each of the descedats of X, Y are supersets of Y. Accordig to the ati-mootoicity priciple proved by [8], if X =} Y is ot a p-ar, oe of the descedats of X, Y would be a p-ar ad, thus, the etire subtree rooted at ode X, Y ca be prued out. V. PERFORMANCE EVALUATION Here some prelimiary tests are dissemiated, whose results were culled from a implemetatio of the p-ar miig algorithm described above. The ucertai database we used was coverted from the certai database che s s from the Frequet Itelset Miig Dataset Repository fimi.ua.ac.be/data/. The trasformatio of the che s s database was performed as follows: for each item i each trasactio, we assig a radom probability P betwee o ad 1 draw from Beta distributio with ex = 1 ad (3 = 1, to be used as the item's existetial probability. This procedure was ru five times to produce five radom datasets. Each experimet (data poit i the performace evaluatio) was ru o all five datasets ad the results of each averaged. To reduce the amout of time eeded to ru each experimet, the total umber of uique items was reduced to 14, oly the boolea p-ar(x, Y,, 7], miprob) { } if (7] :s; ) retur true; Iitialize array fo with fo [i, j] = { I if i = j. o th erwlse Iitialize array h with zero for all elemets; float sum = ; if (7]?:.5) for k = for i = for j = l, O,mi(k, l(1-7])j), k - i tmp = fk-l mod 2[i,j](1 -p; if (i > ) tmp =+ f ( k-l ) mod 2[i-l,j]p,; (1 -p; if (j > ) tmp =+ f ( k-l ) mod 2[i,j -1]p';Pk; fk mod 2 [i, j] = tmp; if (k = /\ j?: i7] l/\ j?: i2ry i l ) sum += fk mod 2[i,j]; if (sum?: miprob) retur true; ed if retur false; else for k = for i = for j = l,, O,mi(k -i, l7]j) tmp = fk-l mod 2[i,j](1 -p,;); if (i > ) tmp =+ f ( k-l ) mod 2[i-l,j]p,; (1 -p; if (j > ) tmp =+ f ( k-l ) mod 2[i,j -1]p';Pk; fk mod 2 [i, j] = tmp; if (k = /\ (j < l7] J V j < l2ry ij ) sum += fk mod 2[i,j]; if (sum> 1 - miprob) retur false; ed if retur true; ed if Fig. 2. Fig. 3. Dyamic Programmig Algorithm abed, abc, d abd, e aed, b bcd, a ab, cd ae, bd be, ad ad, be bd, ae cd, ab a, bed b, aed e, abd d, abc, abed Eumeratio of AR Cadidates

6 TABLE I. AVERAGE TIME IN p-aro (MS) ' first 1 trasactios were used, ad oly rules X =} Y of total legth 6 (legth of X plus the legth of Y) or less were cosidered as possible p-ars. The aforemetioed restrictios were ecessary because miig probabilistic associatio rules is extremely time-expesive: to fid all probabilistic frequet itemsets is already expoetial. The, to eumerate allocatio rule cadidates for each frequet itemset as show i Figure 3 is expoetial with respect to the legth of the frequet itemset. Berecker et al.'s dyamic programmig algorithm [3] is used to fid probabilistic frequet itemsets. For each probabilistic frequet itemset foud, we follow the eumeratio i Figure 3 with apriori pruig ad call fuctio p-aro i Figure 2 to determie whether a cadidate is a p-ar. We measured two times i our aalysis: (1) the total time of miig all probabilistic associatio rules ad (2) the average time of ruig fuctio p-aro i Figure 2. The tests are ru o a Dell Precisio T75 machie with Itel Xeo CPU (E562) 2.4 GHz wi 4 Cores ad 48GB RAM. I all experimets we fix miprob =.9, ad vary (from.1 to.9 i.1 icremets) ad 'T} (from.1 to.95 i.5 icremets). I Table I the average time i millisecods eeded to determie whether a cadidate is a probabilistic associatio rule (i.e., the average time take to execute the fuctio p AR()) for various values of 'T} ad. I the table, oe sees the drastic effect of pruig accordig to the value of 'T} with respect to. I particular, oe sees that whe 'T} <, that the time is virtually zero, as the fuctio p-aro simply returs t rue. Oe also sees the effect of the differig areas of the trapezoid beig computed whe 'T} >. For =.1,.2,.3 ad.4, times all peeked at 'T} =.45. We see that the time is decreasig as 'T} decreases from.45 to. This is because the trapezoid coverig the ushaded area, which is computed whe 'T} <.5, is decreasig as 'T} decreases. The time is also decreasig as 'T} icreases from.5 to.95. This is because the trapezoid coverig the shaded area, which is computed whe 'T} 2:.5, is decreasig as 'T} icreases. We also see the time is decreasig as 'T} icreases from to.95 for =.5,...,.9 for the same reaso. Table II shows the total average time of miig probabilistic frequet itemsets (p-fis) ad probabilistic associatio rules (p-ars) i secods. Oe ca see that whe 'T} is held costat, TABLE II. AVERAGE TOTAL EXECUTION TIME (SECS) ' I I I I the total miig time decreases over icreasig values of, because for lower values of far more p-fis are geerated. Whe is held costat, we see agai that for values of > 'T}, the total time is small because of the ear zero time eeded to retur t rue from p-aro; however, we see that as 'T} icreases past, total time decreases as the umber of cadidate p-ars decreases. For example, for =.1, the total miig time decreases as 'T} icreases from.25 to.95. This is because with a higher 'T} fewer cadidate p-ars are geerated. The low total time for 'T} =.1 is because the ruig time of fuctio p-aro is very ear zero. Similarly for =.4, we see that the total time decreases as 'T} icreases from.45 to.95. The total times for.1 'T}.45 are low; agai, because of the ear zero time eeded to retur true from p-aro. VI. RELATED AND FUTURE WORK Most work i miig ucertai data based o possible world sematics are cetered aroud fidig probabilistic frequet itemsets [3], [6], [8] or probabilistic frequet closed itemsets [4], [5]. I [8], Su et al. preseted a algorithm for miig probabilistic associatio rules. The algorithm dissemiated i [8], uses parameter misup, a iteger betwee

7 o ad the size of database for the ummum support, ad micof (7] i our paper), a real umber betwee ad 1 for the miimum cofidece, to defie a probabilistic associatio rule, whereas we use two ratios (miimum frequecy) ad 7] (miimum cofidece), both betwee ad 1. As a cosequece, we are able to fid the mathematical relatioship betwee the two parameters (Theorem 1) by aalyzig the probability distributio space, ad use it to prue the p-ar search space sigificatly whe 7]. We further prue the search space by choosig the smaller area of the probability distributio space depedig o the value of 7], whereas [8] always calculates the probability distributio i the shaded area o matter how small 7] is. The associatio rule probability distributio i [8] is calculated from itemset support probability distributio, usig de-covolutio through Fast Fourier Trasformatio. I this paper, we use the two-dimesioal dyamic programmig approach to calculate a cadidate associatio rule's probability distributio directly. The theoretical time complexity of the algorithm i [8] is ( 2 Jog), where ours is (3) ad ( 1), whe 7] > ad 7], respectively. Eve whe 7] >, sice our algorithm has deep pruig with 7] close to 1 or, it is our belief that our algorithm will perform well, particularly whe 7] is close to 1 or O. Oe future work, is to compare the performace betwee the two algorithms usig practical experimets-usig databases of differet sizes ad usig varyig thresholds. [8] Liwe Su, Reyold Cheg, David W. Cheug, ad Jiefeg Cheg. Miig ucertai data with probabilistic guaratees. I Proceedigs of the 16th ACM SIGKDD Iteratioal Coferece o Kowledge Discovery ad Data Miig, KDD ' 1, pages , New York, NY, USA, 21. ACM. VII. CONCLUSION I this paper, we defied probabilistic associatio rules (p-ars) usig two miimum ratios, miimum frequecy ad miimum cofidece 7]. We thoroughly aalyzed the probability distributio space for a cadidate associatio rule, ad proposed a efficiet dyamic programmig algorithm with pruig for miig probabilistic associatio rules. [l] REFERENCES CK Chui, B Kao, ad E Hug. Miig frequet itemsets from ucertai data. Advaces i Kowledge Discovery ad Data Miig, pages 47-58, 27. [2] C Chui ad B Kao. A decremetai approach for miig frequet itemsets from ucertai data. PAKDD '8, Proceedigs of the 12th Pacific-Asia coferece o Advaces i kowledge discovery ad data miig, pages 64-75, Ja 28. [3] Thomas Bemecker, Has-Peter Kriegel, Matthias Rez, Floria Verhei, ad Adreas Zuefte. Probabilistic frequet itemset miig i ucertai databases. I Proceedigs of the 15th ACM SIGKDD iteratioal coriferece o Kowledge discovery ad data miig, KDD '9, pages , New York, NY, USA, 29. ACM. [4] Peiyi Tag ad Erich A. Peterso. Miig probabilistic frequet closed itemsets i ucertai databases. I Proceedigs of the 49-th Aual Associatio for Computig Machiery Southeast Coferece (ACMSE'll), pages 86-91, Keesaw, GA, USA, March 211. [5] Yogxi Tog, Lei Che, ad Boli Dig. Discoverig threshold-based frequet closed itemsets over probabilistic data. I Data Egieerig (ICDE), 212 IEEE 28th Iteratioal Coferece o, pages , 212. [6] Liag Wag, D.W.-L. Cheug, R. Cheg, Sau-Da Lee, ad X.S. Yag. Efficiet miig of frequet item sets o large ucertai databases. Kowledge ad Data Egieerig, ieee Trasactios o, 24(12): , 212. [7] Rakesh Agrawal ad Ramakrisha Srikat. Fast algorithms for miig associatio rules. I Proceedigs of the 2th Iteratioal Coferece o Very Large Data Bases, pages , 1994.

Recursive Algorithm for Generating Partitions of an Integer. 1 Preliminary

Recursive Algorithm for Generating Partitions of an Integer. 1 Preliminary Recursive Algorithm for Geeratig Partitios of a Iteger Sug-Hyuk Cha Computer Sciece Departmet, Pace Uiversity 1 Pace Plaza, New York, NY 10038 USA scha@pace.edu Abstract. This article first reviews the