Efficient discovery of the top-k optimal dependency rules with Fisher s exact test of significance

Size: px

Start display at page:

Download "Efficient discovery of the top-k optimal dependency rules with Fisher s exact test of significance"

Isabella Anna Butler
5 years ago
Views:

1 Efficient discovery of the top-k optimal dependency rules with Fisher s exact test of significance Wilhelmiina Hämäläinen epartment of omputer Science University of Helsinki Finland whamalai@cs.helsinki.fi IM p./23

2 Problem Given a set of binary attributes R = {,..., k }. Search for the most significant positive and negative dependency rules X, where X R, R \ X! notations: = and = negative dependency between X and, if P(X) < P(X)P() P(X ) > P(X)P( ) pos. dependency between X and enough to search for positive dependencies for any consequence = a, R, a {, } IM p.2/23

3 Requirements for X = a sufficiently significant by Fisher s exact test: p F (X = a) = i ( m(x) m(x=a)+i ( ) ( n m(=a) m( X) m( X a)+i non-redundant (X contains no extra attributes which do not improve the dependency): Y X such that p F (X = a) p F (Y = a) No other restrictions (like minimum frequency thresholds)! ) ) IM p.3/23

4 Example ata R = {,,,} n = set freq est non-redundant rules rule p F e.g. redundant IM p.4/23

5 lgorithm: the main idea Generate an enumeration tree and keep record on possible consequences. onsequence i = a i is impossible in set X, if for all sets Q rule X \ { i }Q i = a i is insignificant or redundant. Possible consequences in node : pos neg IM p.5/23

6 lgorithm: the stopping criterion node (and its subtree) can be pruned out, when no possible consequences left. The algorithm stops, when no more nodes can be created. Note: fr= is not a sufficient condition for pruning out a node! However, no children are created for it. IM p.6/23

7 lgorithm: possible consequences Given node (set) X, its parents are Y i = X \ { i } for all i X.. Initialization: combine parents consequence vectors (bit-and). 2. Updating: estimate lower bounds for p F (X \ { i }Q i = a i ) and decide if i = a i impossible in node X. L, when only m( i = a i ) known; L2, when m( i = a i ) and m(x) known, i / X; L3, when m( i = a i ) and m(x) known, i X. IM p.7/23

8 lgorithm: possible consequences 3. Minimality condition: If P( i = a i Y i ) =., then i, i, and all j = a j, j / X, are impossible. 4. Lapis philosophorum principle: If i = a i is impossible in X, it is impossible in the parent node Y i = X \ { i } and all its children. efficient pruning! (hess, ccidents and Pumsb could not be handled without LP) IM p.8/23

9 Simulation (level ) Search for the best rules, when p max =.2 8. L: and are impossible consequences Otherwise, possible consequences are determined by L2 IM p.9/23

10 Simulation (level 2) ombine parents bitvectors IM p./23

11 Simulation (level 2) Rule minimal and become impossible consequences (more specific rules would be redundant) IM p./23

12 Simulation (level 2) LP: set as impossible in node IM p.2/23

13 Simulation (level 2) Rules and minimal, and impossible cons. Node is removed IM p.3/23

14 Simulation (level 2) Node removed LP: set as impossible in node and and as impossible in node IM p.4/23

15 Simulation (level 2) Node : no rules found IM p.5/23

16 Simulation (level 2) L3: impossible cons. Rule (not min.) IM p.6/23

17 Simulation (level 2) LP: impossible cons. also in node IM p.7/23

18 Simulation (level 2) and : no rules found IM p.8/23

19 Simulation (level 3) Rule minimal impossible consequence Node is removed LP IM p.9/23

20 Simulation (level 3) Rules and minimal Node is removed LP IM p.2/23

21 Results Implementation kingfisher the algorithm itself suits for all common goodness measures for statistical dependencies currently only p F and χ 2, but easy to add new measures very efficient without any extra restrictions (G memory suffices for Pumsb, ccidents, Retail, hess, etc.) p F is more efficient and produces more reliable results than χ 2! IM p.2/23

22 Thank you! IM p.22/23

23 Possible questions Why the rules are called dependency rules and not association rules? re there any previous solutions to the problem? Why the consequence vectors contain all attributes in all nodes? Wouldn t it be enough to store only those attributes, which can be added under a node? IM p.23/23

Efficient search for statistically significant dependency rules in binary data

Department of Computer Science Series of Publications A Report A-2010-2 Efficient search for statistically significant dependency rules in binary data Wilhelmiina Hämäläinen To be presented, with the permission