Alternative Approach to Mining Association Rules

Size: px

Start display at page:

Download "Alternative Approach to Mining Association Rules"

Job Gibson
5 years ago
Views:

1 Alternative Approach to Mining Association Rules Jan Rauch 1, Milan Šimůnek Faculty of Informatics and Statistics, University of Economics Prague, Czech Republic 2 Institute of Computer Sciences, Czech Academy of Sciences, Czech Republic rauch@vse.cz, simunek@{vse.cz, cs.cas.cz} Abstract An alternative approach to mining association rules is described. Some special techniques and algorithms are used that lead to a much richer syntax of association rules with only linear complexity of computation. A free and open system LISp-Miner implements these algorithms and can serve as a demonstration of used techniques. The same techniques can be used in other kinds of mining e.g. multi-relation mining and conditional frequency analysis. 1. Introduction An association rule is in common way understood as an expression of the form of X Y, where X and Y are sets of items. The intuitive meaning is that transactions (e.g. supermarket baskets) containing set X of items tend to contain set Y of items. Two measures of intensity of association rule are used, confidence and support. An association rule discovery task is a task to find all association rules of the form X Y such that the support and confidence of X Y are above the user-defined thresholds minsup and minconf. The conventional algorithm of association rules discovery proceeds in two steps. All frequent itemsets are found in the first step. The frequent itemset is the itemset that is included in at least minsup transactions. The association rules with the confidence at least minconf are generated in the second step [1]. Particular items can be represented by Boolean attributes and a Boolean data matrix can represent the whole set of transactions. The algorithm can be modified to deal with attributes with more than two values. Thus, the association rules of the form e.g. A(a 1 ) B(b 3 ) C(c 7 ) can be mined. We suppose that the attribute A has k particular values a 1,, a k. The expression A(a 1 ) denotes the Boolean attribute that is true if the value of attribute A is a 1 etc. The goal of this paper is to draw attention to an alternative approach for mining association rules based on representation of each possible value of each attribute by a single string of bits. It is possible to mine for association rules of the form e.g. A(α) B(β) C(δ) where α is a coefficient (a subset of all the possible values) of the attribute A. The expression A(α) denotes the Boolean attribute that is true for particular row of data matrix if the value of A in this row belongs to α, similarly for B(β) and C(δ). The bit string approach makes also possible to easy compute all necessary frequencies. Then we can mine not only for association rules based on confidence and support but also for rules corresponding to further various relations of Boolean attributes including relations described by statistical hypotheses tests. It is also possible to mine for conditional association rules and to deal with missing information. The presented form of association rules can be understood as a contribution to the discussion about the notion of interesting patterns. Several data structures consisting of disjunctions and conjunctions of bit strings representing particular values of attributes are maintained to optimise generation and verification of association rules. Final algorithm is very fast and it is linearly dependent on the number of rows of the analysed data matrix. Time and memory complexity are discussed in section 3. As a demonstration of capabilities of bit string approach we present the procedure 4ft-Miner (see section 2). The 4ft-Miner procedure is a part of the academic data mining system LISp-Miner (see The bit string approach proved to be very efficient. Experiences with it lead to development of new mining procedures, an example can be found in section 4. The presented approach was first applied in connection of development of the GUHA method of mechanized hypotheses formation [2], [3]. 2. Procedure 4ft-Miner Procedure 4ft-Miner mines for association rules of the form ϕ ψ and for conditional association rules of the form ϕ ψ / χ. Here ϕ, ψ and χ are conjunctions of Boolean attributes automatically derived from manyvalued attributes in various ways. The symbol is called 4ft-quantifier. The association rule ϕ ψ means that Boolean attributes ϕ and ψ are somehow associated in the sense of the 4ft-quantifier. A conditional association rule ϕ ψ / χ means that ϕ and ψ are associated (in the sense of ) if the condition χ is satisfied. 1

2 The left part of association rule (ϕ) is called antecedent, part denoted as ψ is called succedent and χ is condition. All parts together are referred as cedents. This section describes features of the procedure 4ft- Miner to show advantages of the bit string approach. The first one is richness of possibilities how to define in a simple way the set of interesting association rules to be automatically generated and verified, see section 2.1. The second one is possibility to deal with many types of association rules, see section 2.2. The important features of output of 4ft-Miner are outlined in section Sets of Interesting Association Rules Analysed data for the procedure are stored in data matrix. Rows of the data matrix correspond to observed objects and columns correspond to attributes properties of observed object. An example is the data matrix Loans, see Figure 1. Client Age Sex Salary District Quality 1 45 M very high Prague good 2 22 F very low Plzen bad 3 37 F average Brno good 4 53 F high Benesov good M low Kolin bad F high Brod good Figure 1. Data matrix Loans Each row of the data matrix Loans describes one loan given to a client of bank. There are loans. The first row describes a loan that received a 45 years old man. This man has a very high salary and he lives in the district of Prague. The quality of his loan is good. Each cedent is a conjunction of Boolean attributes called literals. Literal is the expression of the form A(α), here A is an attribute and α is the subset of all possible values (i.e. categories) of the attribute A. The subset α is called a coefficient of the literal A(α). Examples of cedents ϕ, ψ and χ are: ϕ = Age<20;30) it is true if value of the attribute Age is in the interval <20;30), ψ = Quality(good) it is true if value of the attribute Analogous simple definition of all succedents. Analogous simple definition of all conditions (if desired). Definition of a 4ft-quantifier there are 17 types of 4ft-quantifiers. The antecedents are conjunctions of literals automatically generated from the given set of antecedent attributes. It is also possible to divide this set into several subsets called partial antecedents. A partial antecedent is also conjunction of literals, and the antecedent as whole is conjunction of partial antecedents. The partial antecedent is given by: a list of attributes some of these attributes are marked as basic (partial antecedent must contain at least one basic attribute), a minimal and maximal number of attributes to be used in partial cedent, a simple definition of the set of all literals to be generated from each attribute. Any literal can positive or negative. The positive literal is the literal A(α) itself. The negative literal is the expression A(α) the Boolean negation of A(α). The set of all literals to be generated for the particular attribute is given by: a type of coefficient. There are available six types of coefficients: subsets, intervals, left cuts, right cuts, cuts, one particular value. A minimal and maximal number of values in the coefficient. Positive/negative literal option: a) only positive literals to be generated, b) only negative literals to be generated, c) both positive and negative literals to be generated. We use the attribute A with categories {1, 2, 3, 4, 5} to give examples of particular types of coefficients: Subsets: definition of subsets with 2-3 categories defines literals A(1,2), A(1,3), A(1,4), A(1,5), A(2,3),, A(3,4),..., A(4,5), A(1,2,3), A(1,2,4), A(1,2,5), A(2,3,4),, A(3,4,5). Intervals: definition of intervals with 2-3 categories defines literals A(1,2), A(2,3), A(3,4), A(4,5), A(1,2,3), A(2,3,4) and A(3,4,5). Quality is good, Left cuts: definition of left cuts with maximally 3 χ = District(Prague, Plzen) Salary(very high) it is categories defines literals A(1), A(1,2,3) and true if both the value of the attribute District is A(1,2,3). Prague or Plzen and the value of the attribute Salary Right cuts: definition of right cuts with maximally 4 is very high. categories defines literals A(5), A(5,4), A(5,4,3) and The set of interesting association rules to be generated and tested on the given data matrix is defined by: Simple definition of all antecedents. A(5,4,3,2). Cuts: means both left cuts and right cuts. 2

An example of the antecedent definition is in Figure 2. Figure 2. Example of the antecedent definition There are two partial antecedents in the Figure 2.

3 An example of the antecedent definition is in Figure 2. Figure 2. Example of the antecedent definition There are two partial antecedents in the Figure 2. The partial antecedent Client_Basic contains attributes Sex, Salary and District. Each line defines types of coefficients to be generated for corresponding attribute. Line Sex(*), 1 1 means that subset of categories of the length from 1 to 1 are to be generate for attribute Sex. It means literals Sex(F) and Sex(M). Cuts are to be generated for attribute Salary. This attribute has categories very low, low, average, high and very high. All the possible cuts of the length from 1 to 2 are literals Salary(very low), Salary(very low, low), Salary(very high) and Salary(high, very high). Subsets of the length from 1 to 2 are to be generated for the attribute District, see District(*), 1 2. It means that all single district e.g. District(Prague) and all pairs of districts e.g. District(Plzen, Prague) will be generated. There are 77 particular districts thus literals are defined this way. The partial cedent Client_Basic has length from 1 to 3. So at least one of attributes Sex, Salary, District will be always used in the antecedent. The partial cedent Client_Age is defined such that none or one of two types of literals concerning Age will be used. By defining Age(int) 5 5 we want all the intervals of the length 5 to be generated. In other way we can say that there will be a sliding window of the length 5. The definition Age(lcut) 1 10 means that left cuts will generated, thus we will investigate young clients. An example of the coefficient given by one value is in Figure 3. In such a case we concentrate on the loans with bad quality. Figure 3. Example of the coefficient of one value Let us emphasize that each cedent and even partial cedent are treated as objects and can be copied or moved to another task or cedent Verification of Association Rules The association rule ϕ ψ means that Boolean attributes ϕ and ψ are associated in the sense of the 4ft-quantifier. The rule ϕ ψ can be true or false in the analysed data matrix M. The conditional association rule ϕ ψ / χ is true in the analysed data matrix M if the rule ϕ ψ is true in the data matrix M / χ. The data matrix M / χ consists of all rows of matrix M satisfying the condition χ. There must exist at least one such a row for ϕ ψ / χ to be true. The association rule ϕ ψ is verified on the basis of four-fold table 4ft(ϕ, ψ, M) of ϕ, ψ in M see Figure 4. M ψ ψ ϕ a b ϕ c d Figure 4. Four-fold table 4ft(ϕ, ψ, M) of ϕ, ψ in M Here a is the number of objects satisfying both ϕ and ψ, b is the number of objects satisfying ϕ and not satisfying ψ, c is the number of objects not satisfying ϕ and satisfying ψ, and d is the number of objects satisfying neither ϕ nor ψ. A true/false function based on frequencies from the four-fold table <a,b,c,d> is defined by each 4ftquantifier. The association rule ϕ ψ is true in the data matrix M if the function defined by the 4ft-quantifier is true in the four-fold table 4ft(ϕ, ψ, M) of ϕ, ψ in M. Various 4ft-quantifiers are defined in [2] and [4]. Here follow some examples: Founded implication p;base Parameters: 0 < p 1 and Base > 0 True iff a p a Base a + b The association rule ϕ p;base ψ can be interpreted as 100p per cent of objects satisfying ϕ satisfy also ψ or ϕ implies ψ on the level 100p per cent. Lower critical implication! p;α;base Parameters: 0 < p 1, Base > 0 and 0 < α 0.5 True iff a+b (a+b)! i!( a+b i)! i= a i * p *(1 p) a+b i α a Base Association rule ϕ! p; ;Base ψ corresponds to a test (on the level α) of a null hypothesis H 0 : P(ϕ ψ ) p against the alternative one H 1 : P(ϕ ψ) > p. If association rule ϕ! p; ;Base ψ is true in data matrix M then the alternative hypothesis is accepted. 3

$both ϕ and ψ or ϕ ψ implies ϕ ψ on the level 100p per cent. All the implemented 4ft-quantifiers are described at http://lispminer.vse.cz\overview\4ft_quantifier.html.$

4 Double founded implication p;base Parameters 0 < p 1 and Base > 0 True iff a p a Base a + b + c Association rule ϕ p;base ψ can be interpreted as 100p percent of objects satisfying ϕ or ψ satisfy both ϕ and ψ or ϕ ψ implies ϕ ψ on the level 100p per cent. All the implemented 4ft-quantifiers are described at The four fold table can be computed in a very fast way, see section 3. Let us remark that pre-computed tables of critical frequencies can be used to verification of 4ft-quantifiers based on statistical hypotheses tests [4]. This way we need only one test of inequality instead of computation of complex formula. When we deal with missing information we have to compute nine-fold tables or even eighteen-fold tables. The bit string approach again is used for very fast computation of these tables. There are also several possibilities how to reduce these tables back to four-fold table. For details see e.g. [5]. Figure 5. Example of the 4ft-Miner output 3. Bit String Approach The basic principle of bit-string approach is in representation of analysed data by suitable strings of bits (see section 3.1). It makes then possible to use simple algorithm and data structures to efficiently compute necessary frequencies (see 3.2) Output of 4ft-Miner 3.1. Bit-string Representation of Attributes Output of the procedure consists of all prime association rules. The association rule is prime if both it is true in the analysed data matrix and it does not follow immediately from other more simple association rules already in the output. The question is what does it mean that the association rule ϕ ψ immediately follows from more simple association rule ϕ 1 ψ 1. Answer depends on properties of the used 4ft-quantifier. The definition of prime association rule for the 4ft-quantifier of founded implication p;base must take into account that if the association rule e.g. Sex(M) p;base District(Prague) is true then the association rule Sex(M) p;base District(Prague, Plzen) is also always true. Thus the second association rule immediately follows from the first, more simple one. All the followers are automatically omitted from output. There is theoretical background of logical properties of association rules. For details see section 4 or e.g. [4]. An example of the output of 4ft-Miner is in Figure 5. This output represents the task with the set of interesting antecedents and succedents defined in Figure 2 and Figure 3 respectively and with the quantifier 0.7;20 of founded implication. The whole solution contains 46 prime association rules. Each category of each attribute (i.e. each of its possible values) is represented by one string of bits. This string is called card of category [3]. We can use the attribute District as an example. The attribute District has 77 categories: Benesov, Brno,, Prague, Plzen,, Znojmo. Its representation is shown in Figure 6. Client District Cards of Categories Brno Kolin Plzen Prague 1 Prague Plzen Brno Benesov Kolin Brod Figure 6. Cards of categories The first row of this table corresponds to column Client (row number) of the data matrix Loans, see Figure 1. The second row of the table corresponds to column District. Each of the further rows of Figure 6 is the card of one category. Each bit of the card of category corresponds to one row of the data matrix Loans. The first bit corresponds to the first row; the second bit corresponds to the second row etc. There is 1 in particular bit if there is the value (i.e. 4

5 category) in the row corresponding to this bit in the column District. Otherwise there is 0 in this bit. The first bit of the card of the category Benesov is 0 because the value in the first row of the data matrix is not Benesov (but Prague). The third bit of the card of the category Brno is 1 because of the value in the third row is Brno, etc. There are 6181 rows in the data matrix Loans, therefore bits or 773 bytes are necessary to represent one category by its card. Attribute District has 77 categories. It means that bytes (i.e ) are necessary to represent this attribute Algorithm and Data Structures Structure named card of antecedent represents each antecedent. We denote it by Card_[antecedent]. It is a string of bits of the same length as number of rows in the analysed data matrix. Each bit of card corresponds again to one row of the analysed data matrix. There is 1 in a particular bit if the row corresponding to this bit satisfies the antecedent. The card of antecedent is thus the bit-wise representation of Boolean attribute antecedent. It is created as conjunction of card of literals of all its literals. Card of literal is beforehand created as disjunction of card of categories from literal coefficient. Detail description is out of range of this article and can be found in e.g. [3]. The number of 1 s in the card of antecedent is the number of rows satisfying the antecedent. We use a lowlevel bit-string function Count(α) returning number of values 1 in the string α. The number of rows satisfying the antecedent must be equal or greater than the value of parameter Base, see section 2.2. For every generated antecedent we test whether Count(Card_[antecedent]) Base to decide if this antecedent can be at all a part of the true association rule. This test can be understood whether the corresponding itemset is frequent [1]. Both Card_[antecedent] and Card_[succedent] (analogous to card of antecedent) are used to compute frequencies of four-fold table of antecedent and succedent, see Figure 7. M Succedent Succedent Antecedent a b Here n is the total number of rows in the data matrix M. Memory used by strings of bits while running a datamining task is not a significant problem. Especially when compared to significant time improvements during generation and verification. Let us remark that e.g. lot of medical data concerns thousands of patients and tens or hundreds of attributes. The corresponding data mining tasks can be solved without problems at common PC s. Moreover in many cases we get the solution in several minutes or even in several seconds. Therefore 4ft-Miner is also suitable for teaching purposes. Here we provide results of an experiment at a Pentium 400 MHz computer with 98 MB RAM. We solved tasks to find true and prime association rules in the data matrices Loans, Loans_10 and Loans_20. The data matrix Loans_10 has 10 times more rows than original data matrix Loans. Analogously data matrix Loans_20 has 20 times more rows. There are about relevant association rules that has to generated and verified according to task definition. Only about of association rules were actually verified due to all the optimisations some of them described above. The time of solution for particular data matrices is given in Figure 8. Data matrix Loans Loans_10 Loans_20 Rows Time of sol. [sec] Figure 8. Time of solution of various tasks Let us emphasize that the time of the bit string operations AND, NOT, OR and Count is linearly dependent on the length of particular cards. The length of each card is equal to the number of rows of the analysed data matrix. Thus the time the procedure 4ft-Miner needs to solve a given task is linearly dependent on the number of rows of the analysed data matrix. 4. New Data Mining Procedures Advantages of the bit-strings approach can be further used in new data mining procedures. An example is the procedure Pareto-Miner. Figures 9 and 10 express the motivation for this procedure. Antecedent c d Both figures concern distribution of clients (see the Figure 7. Four-fold table from cards data matrix Loans, Figure 1) among particular regions. The first one concerns all clients and the second one The particular frequencies are computed in the following concerns the clients with high salary only. way: a = Count(Card_[Antecedent] Card_[Succedent]) b = Count(Card_[Antecedent]) a c = Count(Card_[Succedent]) a d = n a b c The distribution of clients with high salary remarkable differs from the distribution of all clients. The difference concerns namely the pair Prague south Moravia. It can be useful to find all segments of clients that differ in a given way from the segment of all clients in the 5

, a analysed attribute A (usually with several values), parameters defining a large set of conditions in the same way as a set of conditions in the 4ft-Miner procedure is defined, a criterion of

6 distribution of clients among particular regions. The Pareto-Miner procedure is intended to solve such tasks. Its input consists of: a data matrix with columns linked to attributes and rows corresponding to observed objects., a analysed attribute A (usually with several values), parameters defining a large set of conditions in the same way as a set of conditions in the 4ft-Miner procedure is defined, a criterion of interestingness of a particular condition. Figure 9. Distribution of all clients among regions Literature [1] Aggraval, R. et all.: Fast Discovery of Association Rules, Advances in Knowledge Discovery and Data Mining (Fayyad, U. M. et al. eds.), AAAI Press / The MIT Press, 1996, pp [2] Hájek, P. Havránek, T.: Mechanising Hypothesis Formation Mathematical Foundations for a General Theory, Springer-Verlag, 1978, pp [3] Rauch, J.: Some Remarks on Computer Realisations of GUHA Procedures, International Journal of Man- Machine Studies 10, 1978, pp [4] Rauch, J.: Classes of Four-Fold Table Quantifiers, Principles of Data Mining and Knowledge Discovery, (J. Zytkow, M. Quafafou, eds.), Springer-Verlag, 1998, pp [5] Rauch, J.: Four-fold Table Calculi and Missing Information, JCI S98 Association for Intelligent Machinery, Vol. II., (Wang Paul eds.), Durham, Duke University, [6] Rauch, J. Šimůnek, M.: Mining for 4ft Association Rules by 4ft-Miner, INAP 2001, The Proceeding of the International Conference On Applications of Prolog, Prolog Association of Japan, Tokyo, October 2001, pp Figure 10. Distribution of clients with high salary among regions The criterion of interestingness describes a distribution of rows of the data matrix among the particular values of the attribute A. Examples of the criteria are: a remarkable difference of the distribution when the particular condition is satisfied and the distribution for the whole analysed data matrix. The difference can be measured e.g. by number of values with different order. a remarkable difference of the distribution when the particular condition is satisfied and the distribution under an other given condition. The evaluation of these criteria requires knowledge of frequencies of particular values of the attribute A under the condition in questions. These frequencies can be computed using cards of cedents for conditions and using cards of particular categories. Thus tools already developed can be used. We can use the already developed tools for generation including particular conditions C and for computing card Card_[C]. The particular frequencies can computed such that f i,j = Count((Card_[ a i ] Card_[ s j ] Card_[C]). This paper has been supported by the grant COST ACTION 274 TARSKI (Theory and Applications of Relational Structures as Knowledge Instruments). 6

Investigating Measures of Association by Graphs and Tables of Critical Frequencies

Investigating Measures of Association by Graphs Investigating and Tables Measures of Critical of Association Frequencies by Graphs and Tables of Critical Frequencies Martin Ralbovský, Jan Rauch University