A Logical Formulation of the Granular Data Model

2008 IEEE International Conference on Data Mining Workshops A Logical Formulation of the Granular Data Model Tuan-Fang Fan Department of Computer Science and Information Engineering National Penghu University Penghu 880, Taiwan dffan@npu.edu.tw Tsau-Young (T.Y.) Lin Department of Computer Science San Jose State University San Jose, CA 95192 tylin@cs.sjsu.edu Churn-Jung Liau Institute of Information Science Academia Sinica Taipei 115, Taiwan liaucj@iis.sinica.edu.tw Karen Lee Department of Computer Science San Jose State University San Jose, CA 95192 Abstract In data mining problems, data is usually provided in the form of data tables. To represent knowledge discovered from data tables, decision logic (DL) is proposed in rough set theory. While DL is an instance of propositional logic, we can also describe data tables by other logical formalisms. In this paper, we use a kind of many-sorted logic, called attribute value-sorted logic, to study association rule mining from the perspective of granular computing. By using a logical formulation, it is easy to show that patterns are properties of classes of isomorphic data tables. We also show that a granular data model can act as a canonical model of a class of isomorphic data tables. Consequently, association rule mining can be restricted to such granular data models. Keywords: Data table, rough set theory, decision logic, first-order logic, granular data model. tables [6]. In fact, many powerful data mining algorithms have been designed based on rough set theory (see papers in [7 9] for some examples). To represent and reason about extracted knowledge, a decision logic (DL) is also proposed in [6]. The semantics of DL is defined in a Tarskian style through the notions of models and satisfaction. While DL is an instance of propositional logic, we can also represent knowledge extracted from data tables by using first-order logic (FOL) or many-sorted first-order logic (MSFOL) [3]. In this paper, we explore the description of data tables based on MSFOL. The attribute value-sorted logic (AVSL) proposed in [3] is investigated from the perspective of granular computing. By using a logical formulation, it is easy to show that patterns are properties of classes of isomorphic data tables. It can also be shown that a granular data model can act as a canonical model of a class of isomorphic data tables. Consequently, association rule mining can be restricted to such granular data models. 1 Introduction In recent years, knowledge discovery in databases (KDD) and data mining have received a great deal of attention because of their practical applications. Many different forms of knowledge have been considered by KDD researchers, notably, association rules and sequential patterns [1, 2]. Rough set theory, proposed by Pawlak, provides an effective tool for extracting knowledge from data This work was partially supported by the NSC (Taiwan) under Grant no. 95-2221-E-001-029-MY3 In the next section, we review rough set theory and decision logic. In Section 3, the syntax and semantics of attribute value-sorted logic (AVSL) are presented. In Section 4, we show that isomorphic data tables have the same set of patterns, which implies that association rules are syntactic in nature. We also show that a canonical granular model can be obtained for each class of isomorphic data tables, and all patterns discovered from the class of data tables can be discovered from the granular data model. We then summarize our conclusions in Section 5. 978-0-7695-3503-6/08 $25.00 2008 IEEE DOI 10.1109/ICDM.Workshops.2008.25 10.1109/ICDMW.2008.23 628

2 Rough Set Theory A Review 2.1 Approximation space The basic construct of rough set theory is an approximation space, defined as a pair (U, R), where U is the universe and R U U is an equivalence relation on U. An equivalence relation partitions the universe U into a family of equivalence classes so that each element of U belongs to exactly one of these equivalence classes. Thus, we can write an equivalence class of R as [x] R if it contains the element x. Note that [x] R = [y] R iff (x, y) R. In philosophy, the extension of a concept is defined as the objects that are instances of the concept. Following this terminology, a subset of the universe is called a concept or a category in rough set theory. Given an approximation space (U, R), each equivalence class of R is called an R-basic category or R-basic concept, and any union of R-basic categories is called an R-category. Now, for an arbitrary concept X U, we are interested in the definability of X based on R-basic categories. We say that X is R-definable, if X is an R-category; otherwise X is R-undefinable. R-definable concepts are also called R- exact sets, whereas R-undefinable concepts are said to be R-inexact or R-rough. When the approximation space is explicit from the context, we simply omit the qualifier R and call a set an exact set or a rough set. A rough set can be approximated from below and above by two exact sets. The lower approximation and upper approximation of X are denoted by RX and RX respectively and defined as follows: RX = {x U [x] R X}, RX = {x U [x] R X }. 2.2 Data tables and decision logic For data mining problems, data is usually provided in the form of data tables (DT). The following formal definition of a data table is given in [6]. Definition 1 A data table 1 is a tuple T = (U, A, {V i i A}, {f i i A}), where U is a nonempty finite set, called the universe; A is a nonempty finite set of primitive attributes; for each i A, V i is the domain of values for i; and for each i A, f i : U V i is a total function. 1 Also called knowledge representation systems, information systems, or attribute-value systems In [6], a decision logic (DL) is proposed for the representation of knowledge discovered from data tables. It is called decision logic because it is particularly useful in a special kind of data table, called a decision table. A decision table is a data table T = (U, C D, {V i i A}, {f i i A}), where the set of attributes can be partitioned into two sets, C and D, called condition attributes and decision attributes respectively. Decision rules relating the condition and the decision attributes can be derived from the table by data analysis. A rule is then represented as an implication between the formulas of the logic. The basic alphabet of a DL consists of a finite set of attribute symbols A, and a finite set of value symbols V i for i A. The syntax of DL is then defined as follows. Definition 2 1. An atomic formula of DL is a descriptor (i, v), where i A and v V i. 2. The set of DL well-formed formulas (wff) is the smallest set that contains the atomic formulas and is closed under the Boolean connectives,, and. 3. If ϕ and ψ are wffs of DL, then ϕ ψ is a rule in DL, where ϕ is called the antecedent of the rule and ψ is called the consequent. A data table T = (U, A, {V i i A}, {f i i A}) relates to a given DL if there is a bijection τ : A A such that, for every a A, V τ(a) = V a. Thus, by somewhat abusing the notation, we usually denote an atomic formula as (i, v), where i A and v V i if the data tables are clear from the context. Intuitively, each element in the universe of a data table corresponds to a data record, and an atomic formula (which is in fact an attribute-value pair) describes the value of some attribute in the data record. Thus, the atomic formulas (and therefore the wffs) can be satisfied or not satisfied with respect to each data record. This generates a satisfaction relation between the universe and the set of wffs. Definition 3 Given a DL and a data table T = (U, A, {V i i A}, {f i i A}) related to it, the satisfaction relation = T between U and the wffs of the DL is defined inductively as follows (the subscript T is omitted for brevity). 1. x = (i, v) iff f i (x) = v, 2. x = ϕ iff x = ϕ, 3. x = ϕ ψ iff x = ϕ and x = ψ, 4. x = ϕ ψ iff x = ϕ or x = ψ. If ϕ is a DL wff, the set m T (ϕ) defined by m T (ϕ) = {x U x = ϕ}, (1) 629

is called the meaning set of the formula ϕ in T. If T is understood, we simply write m(ϕ). Sometimes, the notations T, x = ϕ and x = T ϕ are considered interchangeable if the data table T must be made explicit. A formula ϕ is said to be valid in a data table T (written as = T ϕ or = ϕ for short when T is clear from the context) if and only if m(ϕ) = U; that is, ϕ is satisfied by all individuals in the universe. Moreover, ϕ is said to be satisfiable in a data table T if m(ϕ) 2.3 The connection Although an approximation space is an abstract framework for representing classification knowledge, it can be easily derived from a concrete data table. Let T = (U, A, {V i i A}, {f i i A}) be a data table and B A be a subset of attributes. Then, we can define an equivalence relation, called the indiscernibility relation based on B, as ind(b) = {(x, y) x, y U, f i (x) = f i (y) i B}. In other words, x and y are B-indiscernible if they have the same values with respect to all attributes in B. Consequently, for each B A, (U, ind(b)) is an approximation space. In terms of DL, each equivalence class of B is characterized by a DL formula i B (i, v i ) and any formula ϕ of DL can be regarded as a concept m T (ϕ). Then, the equivalence class is a subset of the lower (resp. upper) approximation of the concept if the rule i B (i, v i ) ϕ is valid (resp. the formula i B (i, v i ) ϕ is satisfiable). 3 Attribute Value-sorted Logic We have shown that DL can be used to describe knowledge discovered from a data table. In fact, DL is an instance of propositional logic (PL) in which each descriptor (i, v) is a primitive proposition and each object in a data table is considered an interpretation (a model) of the logic. Consequently, from the PL viewpoint, a data table is a set of models of DL. Alternatively, we can describe data tables by using first-order logic (FOL) or many-sorted first-order logic (MSFOL) [3]. 3.1 Syntax To describe data tables by using MSFOL, we consider a special instance, called attribute value-sorted logic (AVSL). The set of sorts for AVSL is Σ = {σ i i I} {σ u }, where I is an index set. The sort σ u is called the object sort and each σ i is called an attribute value sort. The alphabet (or vocabulary) of AVSL consists of: 1. a set of constant symbols Λ = {c 1, d 1, }, 2. a finite set of monadic predicates Π = {P 1, Q 1, }, 3. a set of dyadic predicates {R i i I}, 4. a set of equality predicates {. = i i I {u}}, 5. a set of variables Ξ = {x 1, x 2,, y 1, y 2, }, and 6. logical symbols: Boolean connectives and the universal quantifier. We assume that a rank function is used to assign a rank to constant symbols, predicate symbols, and variables. The rank of a constant symbol or a variable is an element of Σ, and the rank of a predicate symbol is in Σ k if its arity is k. A constant (resp. variable) of rank σ u is called an object constant (resp. variable); otherwise, it is called an attribute domain constant (resp. variable). For each i I, R i is of rank (σ u, σ i ), and called an attribute predicate. In addition, a monadic predicate of rank σ u is called a concept predicate; and for each i I, a monadic predicate of rank σ i is called a value predicate. We also assume that an equality predicate =. i is of rank (σ i, σ i ) for each i I {u}. A term is either a constant symbol or a variable, and the rank of a term is that of the constant or the variable. If P is a predicate of rank (σ 1,, σ k ) and t 1, t 2,, t k are the terms of ranks σ 1, σ 2,, σ k respectively, then P (t 1, t 2,, t k ) is an atomic formula (k = 1, 2). The set of wffs of AVSL is then defined as the smallest set Φ that contains all atomic formulas and is closed under the following formation rules: 1. if ϕ Φ, then ϕ Φ; 2. if ϕ and ψ Φ, then ϕ ψ, and ϕ ψ Φ; 3. if x is a variable and ϕ Φ, then xϕ Φ. As usual, we abbreviate ϕ ψ as ϕ ψ, (ϕ ψ) (ψ ϕ) as ϕ ψ, and ϕ as xϕ. 3.2 Semantics For the semantics, an interpretation for AVSL is a tuple A = ((D σ ) σ Σ, A), where D σ is the domain of sort σ and A assigns meanings to the symbols of AVSL. Each constant symbol c of rank σ is interpreted as an element of D σ, and each predicate symbol P of rank (σ 1,, σ k ) is interpreted as a relation P A D σ1 D σk (k = 1, 2). In particular, the interpretation of an equality predicate is the identity relation on the corresponding domain. We assume that D σu and i I D σ i are disjoint. A variable assignment of the interpretation A = ((D σ ) σ Σ, A) is defined as a mapping µ : Ξ σ Σ D σ. A variable assignment µ must satisfy the constraint that, if x is of rank σ, then µ(x) D σ. 630

Such a variable assignment can be extended to a valuation on the set of all terms by setting µ(c) = c A for each constant symbol c. Let x Ξ and d σ Σ D σ be a variable and an element in the domain respectively. Then, we denote µ[d/x] by the variable assignment µ such that µ (x) = d and µ (y) = µ(y) for all y x. Note that µ[d/x] is only well-defined when d D σ, where x is of rank σ. The satisfaction of a wff ϕ with respect to an interpretation A and a variable assignment µ, denoted by A, µ = ϕ, is defined as follows: 1. A, µ = P (t 1,, t k ) iff (µ(t 1 ),, µ(t k )) P A, 2. A, µ = ϕ iff A, µ = ϕ, 3. A, µ = ϕ ψ iff A, µ = ϕ and A, µ = ψ, 4. A, µ = ϕ ψ iff A, µ = ϕ or A, µ = ψ, 5. A, µ = xϕ iff for all d D σ, A, µ[d/x] = ϕ, where σ is the rank of x. Let Γ be a set of wffs. We write A, µ = Γ if A, µ = ϕ for all ϕ Γ. A wff ϕ is true for the interpretation A, denoted by A = ϕ, if for every variable assignment µ, we have A, µ = ϕ. A wff ϕ is valid, denoted by = ϕ, if it is true for all interpretations. Let Γ be a set of wffs and ϕ be a wff. Then, ϕ is a logical consequence of Γ, denoted by Γ = ϕ, if for every interpretation A and variable assignment µ, A, µ = Γ implies A, µ = ϕ. Moreover, ϕ is satisfiable if there exist an interpretation A and a variable assignment µ such that A, µ = ϕ. It is sometimes necessary to distinguish the free or bound occurrence of a variable in a wff. To do this, we first define the scope of a quantifier. In wffs of the form xϕ or xϕ, ϕ is called the scope of the quantifiers x or x. A variable x is said to have a bound occurrence in a wff if the occurrence is within the scope of a quantifier x or x. Otherwise, the occurrence is said to be free. Note that a variable can have both bound and free occurrences in a wff simultaneously. However, it can be shown that xϕ yϕ(y/x) and xϕ yϕ(y/x) are both valid, where y is a new variable not occurring in ϕ and ϕ(y/x) denotes the result of replacing all free occurrences (if any) of x in ϕ by the variable y. Thus, we can always rename a bound occurrence of a variable with a new variable and assume that no variables have bound and free occurrences in a wff simultaneously. In this way, a variable with free (resp. bound) occurrences in a wff ϕ is called a free (resp. bound) variable of ϕ. We usually write ϕ(x 1, x 2,, x n ) to emphasize that x 1, x 2,, x n are all the free variables occurring in ϕ. A wff without free variables is called a sentence, and a set of sentences is called a theory(an AVSL theory). In contrast to ordinary wffs, a sentence ϕ has the property that, for any interpretation, either ϕ or ϕ is true for the interpretation. We define the set of models of a theory Γ, denoted by Mod(Γ), as the set of all interpretations in which all sentences in Γ are true. Let ϕ(x 1, x 2,, x n ) (n 1) be a wff and A be an interpretation. Then, the extension of ϕ under A is defined as: ϕ A = {(µ(x 1 ),, µ(x n )) A, µ = ϕ}. 4 Granular Data Models 4.1 Data tables as AVSL models For a data table T = (U, A, {V i i A}, {f i i A}), we can consider a particular AVSL in which the set of sort index I = A. Thus, the set of attribute predicates is {R i i A}. The table T is then regarded as an interpretation of AVSL A = (U, (V i ) i A, A) such that the meaning of the predicate symbol R i is {(u, v) u U, v V i, f i (u) = v}. The syntax allows us to quantify both the objects and the attribute values. Two basic axioms for the AVSL formulation of a data table are the existence and uniqueness of the attribute values of each object. A third axiom assumes that each attribute value is possessed by some individual 2. Formally, a theory Γ is called a basic data theory (BDT) if it contains 1. x vr i (x, v), 2. x, v, v (R i (x, v) R i (x, v ) v. = i v ), and 3. v xr i (x, v). for all i I. In this paper, we are only interested in BDT, so we use the terms theory and BDT interchangeably. 4.2 Isomorphism of AVSL models Two interpretations A = ((D σ ) σ Σ, A) and B = ((D σ) σ Σ, B) are isomorphic iff there is a one-one correspondence g : σ Σ D σ σ Σ D σ such that 1. g(a) D σ iff a D σ for each σ Σ and a D σ, 2. g(c A ) = c B for each constant symbol c, and 3. (a 1,, a k ) P A iff (g(a 1 ),, g(a k )) P B for each predicate P and (a 1,, a k ) D σ1 D σk, where (σ 1,, σ k ) is the rank of P. The correspondence g is called an isomorphism between A and B; and the notation A B is used to indicate that A and B are isomorphic. Sometimes, it is written as A g B to make the isomorphism g explicit. Let µ be a variable assignment on A and g be an isomorphism between A and 2 The assumption is not essential. However, it simplifies the presentation without loss of generality. 631

B. Then, g µ is a variable assignment on B such that g µ(x) = g(µ(x)). The following proposition, presented in [5], shows that isomorphic models satisfy the same set of wffs. Proposition 1 Let A and B be two interpretations such that A g B. Then, for any wff ϕ and variable assignment µ on A, we have A, µ = ϕ iff B, g µ = ϕ. By interpreting a data table as an AVSL model, an association rule or a pattern is simply an AVSL formula ϕ(x) with a single free variable of rank σ u. Then, the extension of ϕ under the interpretation A is its support set in the corresponding data table. A direct corollary of the above proposition shows that a pattern has the same support in isomorphic interpretations. Corollary 1 Let A and B be two interpretations such that A g B. Then, for any pattern ϕ(x), we have g( ϕ A ) = ϕ B. Since g is a one-one correspondence, ϕ A and ϕ B have the same cardinality, so their supports are the same. 4.3 Logical formulation of definability Next, we formulate the indiscernibility relation and definability of rough set theory in AVSL. Let x and y be object variables, v be an attribute domain variable, and S be a subset of the index set I. Then, we can define the indiscernibility formula (with respect to S) as: ε s (x, y) = i s v(r i (x, v) R i (y, v)). When S is a singleton {i}, we write ε s (x, y) as ε i (x, y). Given an arbitrary concept predicate P, we can define two formulas corresponding to its lower and upper approximations as follows: εp s (x) = y(ε s (x, y) P (y)), εp s (x) = y(ε s (x, y) P (y)). Let Γ be an AVSL theory that contains only predicate symbols in {R i i S} {P }. Then, we say that P is indiscernibly S-definable with respect to Γ if Γ = x(εp s (x) εp s (x)). 4.4 Isomorphic attribute predicates An important notion in rough set theory is that of a reduct, which is a minimal subset of attributes that can induce the same indiscernibility relation as the original set of attributes. In particular, if two attributes induce the same indiscernibility relation, then at least one of them is dispensable. This can be formulated in AVSL with the isomorphism between attribute predicates. Two attribute predicates, R 1 and R 2, are said to be isomorphic with respect to a BDT Γ, denoted by Γ = R 1 R 2, if Γ = v 1 v 2 x(r 1 (x, v 1 ) R 2 (x, v 2 )). We then have the following proposition. Proposition 2 Let Γ be a BDT and R i and R j be two attribute predicates. Then Γ = R i R j iff Γ = x, y(ε i (x, y) ε j (x, y)). In an ordinary AVSL interpretation, two isomorphic attribute predicates could be interpreted as different binary relations because their domains may be different. However, we can always construct a kind of parsimonious model such that all isomorphic attribute predicates are interpreted as the same binary relation. This can be achieved by using the granular data model (GDM). To introduce GDM, we recall the notion of partition. Given a domain D, a partition of D is a subset {s 1, s 2,, s m } 2 D such that m i=1 s i = D and s i s j = for 1 i j m. As mentioned in Section 2, an equivalence relation on D can induce a partition on D, i.e., the set of all equivalence classes of the equivalence relation. Conversely, an equivalence relation can be obtained from a partition by taking each s i as an equivalence class. In granular computing, the equivalence classes in a partition are also called granules. Now, an AVSL interpretation A = ((D σ ) σ Σ, A) is called a granular data model if it satisfies the following two conditions for each i I: 1. D σi is a partition of D σu, and 2. for each a D σu and s D σi, (a, s) R A i iff a s. Let A = ((D σ ) σ Σ, A) be a model of any BDT Γ. Then, we can transform A into an isomorphic GDM B = ((D σ) σ Σ, B) as follows: D σ u = D σu, D σ i = {{a (a, v) R A i } v D σ i } for each i I, Ri B = {(a, s) a s, a D σ u, s D σ i } for each i I. Obviously, A and B are isomorphic by the following oneone correspondence: 632

a a for all a D σu, v {a (a, v) R A i } for all v D σ i and i I. Because A is a model of a BDT Γ, the above construction will result in a GDM. Note that if Γ = R i R j, then D σ i = D σ j by the construction. In other words, we have two copies of the same partition that serve the value domains of attributes i and j respectively. Thus, by slightly abusing the notation, we can show that Ri B = Rj B. In summary, we have the following proposition: Proposition 3 For each BDT model A, we can find an isomorphic GDM. Furthermore, two isomorphic attribute predicates are interpreted as the same binary relation in such a GDM. Since all isomorphic GDMs are identical up to the renaming of the elements of the domain σ u, we can arbitrarily select a GDM as the canonical model for a class of isomorphic interpretations. 4.5 Association rule mining As mentioned earlier, association rules are represented as AVSL formulas with a single free variable of rank σ u. In fact, the formulas in this form are more general than the traditional association rules, since AVSL is more expressive than DL. For example, we can represent a rule as follows: v 1, v 2 (R 1 (x, v 1 ) R 2 (x, v 2 ) y(r 1 (y, v 1 ) R 2 (y, v 2 )). The rule expresses the dependency of attribute 2 on attribute 1; however, in DL, we can only express the dependency between particular attribute values. The subset of AVSL formulas corresponding to the DLstyled association rules is the class of ground association rules (GAR). Given a fixed object variable x, the set of GAR based on x is defined by the following grammar: ϕ(x) ::= R i (x, c i ) ϕ(x) ϕ 1 (x) ϕ 2 (x) ϕ 1 (x) ϕ 2 (x), where i I and c i is a constant of rank σ i. Let A be a GDM, then the extension of a GAR ϕ(x) under A can be easily computed by performing set-theoretic operations on the subsets of D σu because, for each formula R i (x, c i ), we have R i (x, c i ) = c A i D σu. Based on the computation of the extension of a GAR, we can transform the mining of frequent patterns into an inequality constraint satisfaction problem [4]. For a given data table, let us assume that the set of constants of rank σ i is equal to the set of values of attribute i that appear in the data table. Then, we define an elementary formula as a GAR i I R i(x, c i ), where c i is a constant of rank σ i. Let A be a GDM corresponding to the given data table. Then, R i (x, c i ) = i I i I c A i. For an elementary formula ϕ(x), let us denote the cardinality of ϕ by (ϕ). Then, we can find all frequent patterns of the data table by solving the following linear inequality: ϕ Φ 0 (ϕ) s ϕ r, where Φ 0 is the set of all elementary formulas, s ϕ {0, 1} for all ϕ Φ 0, and r is the threshold for high frequency. A solution of the inequality is a mapping ρ : {s ϕ ϕ Φ 0 } {0, 1} such that the inequality is satisfied by substituting each variable s ϕ by ρ(s ϕ ). Each solution ρ of the inequality constraint then corresponds to the following GAR {ϕ Φ0 ρ(s ϕ ) = 1}. 5 Conclusion In this paper, we propose using AVSL as the description language for data tables. Semantically, each data table corresponds to an AVSL interpretation. Association rules are then represented as AVSL formulas with only a single free object variable. Based on the basic results in classical logic, we can see that the sets of patterns discovered from two isomorphic models are the same. This implies that associations are syntactic in nature. For each class of isomorphic models, we can obtain a canonical model, i.e., agranular data model. Thus, it is appropriate to perform association mining with the granular data model. Building on the observation in [4], we have shown that the mining of frequent patterns can be transformed into an inequality constraint satisfaction problem. References [1] R. Agrawal and R. Srikant. Fast algorithm for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases. [2] R. Agrawal and R. Srikant. Mining sequential patterns. In Proceedings of the 11th International Conference on Data Engineering (ICDE95), pages 3 14, 1995. [3] T. Fan, C. Liau, and D. Liu. Definability in logic and rough set theory. In Proceedings of the 18th European Conference on Artificial Intelligence, pages 749 750, 2008. [4] T. Lin. Mining associations by linear inequalities. In Proceedings of the 4th IEEE International Conference on Data Mining, pages 154 161, 2004. [5] E. Mendelson. Introduction to Mathematical Logic. Chapman & Hall/CRC, forth edition, 1997. [6] Z. Pawlak. Rough Sets Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, 1991. [7] L. Polkowski, S. Tsumoto, and T. Lin, editors. Rough Set Methods and Applications. Physica-Verlag, 2000. [8] A. Skoworn and L. Polkowski, editors. Rough Sets In Knowledge Discovery Vol 1: Methodology and Applications. Physica-Verlag, 1998. 633

[9] A. Skoworn and L. Polkowski, editors. Rough Sets In Knowledge Discovery Vol 2: Applications, Case Studies and Software Systems. Physica-Verlag, 1998. 634