Issues in Modeling for Data Mining

Issues in Modeling for Data Mining Tsau Young (T.Y.) Lin Department of Mathematics and Computer Science San Jose State University San Jose, CA 95192 tylin@cs.sjsu.edu ABSTRACT Modeling in data mining has not been fully explored; so data mining is often regarded as a set of added operations in database systems. The consequence many notions are overloaded. Some terms often have different semantics. This paper continues previous effort to explore and reiterate some fundamental issues in data mining. Keywords. Modeling, data mining, equivalence relation, lattice. 1. Introduction A DBMS vender often offers a data mining service by adding a set of tools into its existing DBMS. This practice has been translates into, in academic world, the following view:data mining is a set of added operations onto the classical data model. So database terms are used without changes in data mining research. This is a confusing, in fact, an incorrect view. Some semantics of these terms can be very different between database and data mining. Roughly database stores data according to its semantics. While data mining is to discover what the data have expressed. A set of data may not be able to fully expressed the given semantics. 1.1. What is data mining? A commonly quoted definition defines data mining as a non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns from data [3]. We have pointed out in several occasions that some terms, such as novel, useful, and understandable are subjective and can not be used as scientific criteria. However, they do point out one important requirement: The discovered patterns should be relevant to the real world. We call it a real-world-pattern. We believe Real-world-patterns are the primary goal of data mining This leads us to examine the mathematical modeling of realworldforthepurposeofdatamining. 1.2. Notations Let U be the set of the real world entities and A ={A 1,A 2,, A n } be a set of attributes. A relation K will be viewed as a knowledge representation that assigns each entity or object a unique tuple, K: U Dom(A 1 ) Dom(A 2 )... Dom(A n ); u (a 1,a 2,,a n ), where Dom(-) is the active domain (= the set of attribute values currently in use). Traditionally, the image of K is referred to as a relation. For modeling, it is more convenient to have the independent variables available for discussions, so we will use the graph (u, K(u)) and call it information table or simply table. 2. Are the Given Attributes Adequate? A tuple (of degree n) in a numerical relation (of degree n) can be regarded as a coordinate of a point in n dimensional Euclidean space. A coordinate system {X 1, X 2,,X n } of an Euclidean space is a set {A 1,A 2,,A n } of attribute names. A point p x with coordinate x = (x 1,x 2,, x n ) is a tuple. The set P = {p x }of points is a coordinate dependent concept; it is a set of points, real world objects. While the set R of numerical tuples is a coordinate dependent concept; it is merely a merely representation, a matrix. A pattern (a statement about property of R) may or may not be a real world event (a statement about geometry). A pattern with real world significance is, then, a geometric property of P. 2.1. Invisible Association Rules? Letusre-visittheexample[p]:Thefirstcolumnisthelist of entities(directed segments). Attribute B is the Starting point of a direct segment, the others are the Degree and

Length ( polar coordinates ). This table only has one association rule of length 2, namely, (p 3, 2.0) if the required support is 4. Segment# S D L S 1 p 1 0 6.0 S 2 p 3 90 2.0 S 3 p 3 120 2.0 S 4 p 3 135 2.0 S 5 p 3 150 2.0 S 6 p 3 180 2.0 S 7 p 3 210 2.0 S 8 p 3 225 2.0 S 9 p 3 240 2.0 S 10 p 3 270 2.0 Table 1. 10 directed segments in polar coordinates Segment# B H V S 1 p 1 6 0 S 2 p 3 0 2 S 3 p 3-1 3 S 4 p 3-2 2 S 5 p 3-3 1 S 6 p 3-2 0 S 7 p 3-3 -1 S 8 p 3-2 - 2 S 9 p 3-3 -1 S 10 p 3 0-2 Table 2. 10 directed segments in (X,Y)-coordinates Table 2 is transformed from Table 1 by switching to Cartesian coordinate system; the two new attributes are Horizontal and Vertical length. The only association rule disappears. In this geometric database, the association rule is a real world phenomenon (a geometric fact), the same information should be still carried by Table 2. The question is then: How can such invisible association rules be discovered from Table 2? 3. Mining on Derived Attributes? To understand what is really reflected by attributes, let us examine a simple example. Table 3 is a numerical information table of 9 points in the Euclidean 3-space. The first column is the names of the points, the second columns and etc are Z-X- and Y- coordinates point X Y Z P1 200 0 0 P2 100 0 0 P3 173 100 1 P4 87 50 1 P5 44 25 1 P6 141 141 2 P7 71 71 2 P8 36 36 2 P9 0 200 3 Table 3. 9 points with coordinate in Euclidean 3-space Next, let us consider a transformation that rotates the X- and Y- axes around Z-axis. Table 4 expands Table 3 by including new X- and Y- coordinates of various rotations. The suffixes, 20, 30, 52, and 79, represent the new coordinate axes (attribute names) that rotate 0.20, 0.30 and etc radians. Each new axis is a new derived attribute. point Z X Y X20 Y20 X30 Y30 X52 Y52 X79 Y79 P1 0 200 0 196-40 191-59 174-99 141-142 P2 0 100 0 98-20 96-30 87-50 70-71 P3 1 173 100 189 64 195 44 200 0 193-53 P4 1 87 50 95 32 98 22 100 0 97-27 P5 1 44 25 48 16 49 11 51 0 49-14 P6 2 141 141 166 110 176 93 192 52 199 0 P7 2 71 71 84 55 89 47 97 26 100 0 P8 2 36 36 42 28 45 24 49 13 51 0 P9 3 0 200 40 196 59 191 99 174 142 141 Table 4. 9 points with coordinate in various axes; they rotate radians, 0.2, 0.3, 0.53(=30 o ), 0.79(=45 o )

The table seems indicate that there are at least as many new attributes as angles. Even worse, there are many non-linear transformations not listed in Table 4. Several questions arise: 1. Are all these attributes distinct? (from data mining point of view) 2. Could we find ALL derived attributes? 3. Should data mining consider all of them? We will answer the first and second one in next few sections. For third question, from scientific point of view, we should consider all, however, from practical application point of view, we may only want to select some from all. This indicates that we should extend the common practices of feature selection to this full set, not on the given set of attiutes. 3. Is Labeling Concepts/Data Necessary? 3.1. Isomorphic Attributes and Tables Definition 1. Attributes A i and A j are isomorphic iff there is a one-to-one and onto map, s: Dom(A i ) Dom(A j ) such that A j (v)= s(a i (v)), v V.siscalled an isomorphism [p]. Definition 2. Let K and H be two information tables with thesameuniversev.leta={a 1,A 2,,A n }andb ={B 1, B 2,, B m } be the attributes of K and H respectively. Then, K and H are said to be isomorphic if every A i is isomorphic to some B j, and vice versa. If n = m, then it is said to be strict isomorphic. Intuitively isomorphic is re-labeling. theorem should be obvious. The following Theorem 1. Let K and H be strictly isomorphic. Then patterns, such as, association rules, of K can be transformed to those of H by re-labeling the data. Corollary Labeling the attribute values (base concepts) does not influence the results of data mining. Now, we apply the notion of isomorphism to Table 4, we find that X X20 X30 X52 X79 Y20 Y30; where means isomorphic. So by dropping the isomorphic attributes, Table 4 is reduced to Table 5. point Z X Y Y52 Y79 P1 1 200 0-99 -142 P2 1 100 0-50 -71 P3 1 173 100 0-53 P4 1 87 50 0-27 P5 1 44 25 0-14 P6 2 141 141 52 0 P7 2 71 71 26 0 P8 2 36 36 13 0 P9 3 0 200 174 141 Table 5 Simplified Table of 9 points The association rules are (support 3) (Z=1,Y52=0); (Z =2,Y79=0); 3.2. Isomorphic High Level Concepts One of the important data mining practices is the concept hierarchy (attribute oriented generalization). In the current state of arts, human users supply the concept hierarchies [4]. In other words, the nested sequence of equivalence relations are labeled by human users. The question is: Is such a labeling necessary for data mining? In other words, will we get new rules if we rename all the nodes of a hierarchy? By considering the table of high level concepts and results of Section 3.1, the answer is no (re-labeling Table 6 will not change the high level association rules) Theorem2. Isomorphic AOG have isomorphic high level association rules. Corollary Given a nested sequence of partitions, the generalized (high level) association rules are independent of labeling. Corollary Labeling the high level and base concepts does not influence the results of data mining. point Z X Y

P1 P2 P3 P4 P5 P6 P7 P8 Lz45={2} Lz45={2} Lz45={2} Lx00= {200, 100} Ly00={0} Lx00= {200, 100} Ly00={0} Lx30= Ly30= {173, 87, 44} {100, 50,25} Lx30= Ly30= {173, 87, 44} {100, 50,25} Lx30= Ly30= {173, 87, 44} {100, 50,25} Lx45= Ly45= {141, 71, 36} {141, 71,36} Lx45= Ly45= {141, 71, 36} {141, 71,36} Lx45= Ly45= {141, 71, 36} {141, 71,36} P9 3 0 200 Table 6. 4. Granular Data Model High Level Table; granules are labeled In last section, we have shown that labels are not essential. So for data mining the underlying mathematical structures is important and we will use them to label the data [7,8, 9, 10]. An attribute can be interpreted as the composition of K and the projection P j :Dom(A 1 ) Dom(A 2 )... Dom(A n ) Dom(A j ) A j :U Dom(A j ). Each A j induces a partition on U. By replacing attribute values a j with its inverse image (A j ) (-1) (a i ); the collection of these inverse images together forms a partition on U. For example, Pi= {p i }, P12={p 1,p 2 }and,p345={p 3, p 4,p 5 }, P678=={p 6,p 7,p 8 } are equivalence classes. One can label an equivalence class by itself (called canonical names ). So Pi, Pij, Pijk are labels (canonical names) of the respective equivalence classes. Using these canonical names, Table 4 is transformed to Table 9; such a table is called a unnamed relation, because each granule is not assigned a human oriented meaningful name; they also have been called machine oriented model [10]. Each attribute value of Table 4 is a human oriented meaningful name of the corresponding Pi in Table 9. This naming is an isomorphism, that is, Table 4 is isomorphic to Table 9. Note that the attribute names, say A, are replaced by the induced equivalence relation E A. We hope, we have convinced the readers the following theorem: Theorem 3. Unnamed information table (relation) is isomorphic to the original table K, hence their respective association rules are isomorphic. The essential ingredients in an unnamed relations are U and E, where E ={E 1, E 2,, E n } is the set of equivalence relations induced by attributes. We will call the pair (U, E={E 1,E 2,,E n }) a Granular Data Model (GDM) of K. Pawlak called (U, E) an approximation space, if E has only one equivalence relation, and in general a knowledge base; knowledge base often has different meaning, we will use Granular Data Model. Summarize the discussions above we have: Definition. Given an (ordinary) relation K, then the information table (relation), obtained by using the induced equivalence relations as the names of attributes and equivalence classes as attribute values, is called an unnamed table (relation) of K; the Pair (U, E) is called the granular data model of K. point E Z E X E Y E X20 E Y20 E X30 E Y30 E X52 E Y52 E X79 E Y79 P1 P12 P1 P12 P1 P1 P1 P1 P1 P1 P1 P1 P2 P12 P2 P12 P2 P2 P2 P2 P2 P2 P2 P2 P3 P345 P3 P3 P3 P3 P3 P3 P3 P345 P3 P3 P4 P345 P4 P4 P4 P4 P4 P4 P4 P345 P4 P4 P5 P345 P5 P5 P5 P5 P5 P5 P5 P345 P5 P5 P6 P678 P6 P6 P6 P6 P6 P6 P6 P6 P6 P678 P7 P678 P7 P7 P7 P7 P7 P7 P7 P7 P7 P678 P8 P678 P8 P8 P8 P8 P8 P8 P8 P8 P8 P678 P9 P9 P9 P9 P9 P9 P9 P9 P9 P9 P9 P9 Table 9. Unnamed Relation (attribute values are expressed by granules)

The uses of unnamed relations have some advantage over the traditional ones in machine processing (including data mining). For example, we have shown in Table 4 that many attributes are isomorphic (with efforts); while in Table 9 isomorphic attributes are the identical columns. So the simplification of Table 9 into Table 10 becomes effortless in unnamed relations. point Z X Y Y52 Y79 P1 P12 P1 P12 P1 P1 P2 P12 P2 P12 P2 P2 P3 P345 P3 P3 P345 P3 P4 P345 P4 P4 P345 P4 P5 P345 P5 P5 P345 P5 P6 P678 P6 P6 P6 P678 P7 P678 P7 P7 P7 P678 P8 P678 P8 P8 P8 P678 P9 P9 P9 P9 P9 P9 Table 10. Simplified Unnamed Relation 5. The Features Completion- the Lattice in Partition In this section, we will show how one can use granular data model/unnamed relation to find all derived attributes. 5.1. Attribute Transformations By a new attribute Y we mean there is a subset, denoted by Dom(Y), and an onto map U Dom(Y). Let B={B 1, B 2,,B k } be a subset of of A, that is, each B i is some A ji. Letusassumethereisamap f: Dom(B 1 ) Dom(B 2 )... Dom(B k ) Dom(Y). such that f(b 1 (u), B 2 (u),, B k (u))=y(u) u. Then we said Y is an attribute transformed from B = {B 1,B 2,, B k }, and denoted by Y= f(b 1,B 2,,B k ) The details of f are illustrated in Table 7. V (B 1 B 2 B k Y) v 1 (b 1 1 b 2 1... b k 1 f 1 =f(b 1 1,b 2 1,...,b k 1)) v 2 (b 1 2 b 2 2... b k 2 f 2 =f(b 1 2,b 2 2,...,b k 2)) v 3 (b 1 3 b 2 3... b k 3 f 3 =f(b 1 3,b 2 3,...,b k 3))...... v i (b 1 i b 2 i b k i f i =f(b 1 i,b 2 i,...,b k i))...... Table 7 Attribute (feature) Transformations In database theory, we say Y is (extension) functionally depended on B. The equivalence relation induced by Y is denoted by Y E. Theorem 4. Y is an attribute transformed from B ={B 1,B 2,, B k }iffy E is a coarsening of E B1 E B2... E Bk,where each E Bi is the equivalence relation induced by B i. 5.2. Lattices of Partitions induced by Attributes Let E be the set of equivalence relation induced by the attributes of K; see the beginning of Section 4. The Boolean algebra 2 E is a lattice. Let Π (U) be the lattice of all partitions (equivalence relations) on U; its join is the intersection of equivalence relations and the meet is the union of equivalence relations, where the union is the smallest coarsening of its components. T. T. Lee uses the map θ: 2 A Π(U), to summarize the fact that each attribute induces an equivalence relation on the universe U. He observed that the map only respects the meet, but not the join [5]. The image of θ was called the relation lattice; he observed then that the image is a subset of Π(U), but not a sublattice. This is the point of our departure. Previously, we have let L(E) be the smallest sublattice that contains E and called it the generalized relation lattice [8, 9]. We also have called the smallest sublattice that contains E and all its coarsening the lattice-in-partitions and denoted by L(E*) [6]. Theorem 5. L(E*) is a finite lattice and the feature completion of E. This theorem says that the number of derived attributes is finite. For data mining, we may consider the following granular data model (GDM) for a given K. 1. (U, E) is the GDM of K 2. (U, L(E)) is a new GDM associated with K 3. (U, L(E*)) is a new GDM that contains feature completion 4. (U, G) is a new GDM that contains essential attributes, where G consists only of all join-irreducible equivalence relations in L(E*). All possible generalized association canbe found here.

Let us consider a Table that collapses redundant tuples in Table 6 point Z X Y P12=u 1 Lz03 Lx00 Ly00 P345= u 2 Lz03 Lx30 Ly30 P678= u 3 Lz45 Lx45 Ly45 P9= u 4 3 0 200 Table 11. An Information Table K (Reduced Table 6) It is clear X Y, so we drop Y and have an unnamed relation. point E Z E X P12=u 1 {u 1,u 2 } {u 1 } P345= u 2 {u 1,u 2 } {u 2 } P678= u 3 {u 3 } {u 3 } P9= u 4 {u 4 } {u 4 } Table 12. Unnamed relation of K Since E Z E X has 4 equivalence classes, so the number of all possible derived attribute is the Bell number B 4 =16; see [2]. We will not display them. References [1] G. Birkhoff and S. MacLane, A Survey of Modern Algebra, Macmillan, 1977 [2] R. Brualdi, Introductory Combinatorics, Prentice Hall, 1992. [3] Fayad, U. M., Piatetsky-Sjapiro, G. Smyth, P. From Data Mining to Knowledge Discovery: An overview. In Fayard, Piatetsky-Sjapiro, Smyth, and Uthurusamy eds., Knowledge Discovery in Databases, AAAI/MIT Press, 1996. [4]. Hadjimichael, H. Hamilton, N. Cercone. ``Extracting Concept Hierarchies from Relational Databases,'' in Proceedings, FLAIRS-95, Florida AI Research Symposium, Melbourne Beach, Florida, 1995. [5] T. T. Lee, Algebraic Theory of Relational Databases, The Bell System Technical Journal Vol 62, No 10, December, 1983, pp.3159-3204 [6] T.Y. Lin, Data Modeling for Data Mining. In: Data Mining and Knowledge Discovery: Theory, Tools, and Technology IV, B. Dasarathy (ed), Proceeding of SPIE Vol 4730, Orlando, Fl, April 1-5, 2002 [7] T. Y. Lin, "Data Mining: Granular Computing Approach." In: Methodologies for Knowledge Discovery and Data Mining, Lecture Notes in Artificial Intelligence 1574, Third Pacific-Asia Conference, Beijing, April 26-28, 1999, 24-33. [8] T.Y. Lin, Feature Transformations and Structure of Attributes In: Data Mining and Knowledge Discovery: Theory, Tools, and Technology IV, Proceeding of SPIE s aerosence 2002 1-5 April 2002 Orlando, FL. [9] T.Y. Lin, The Lattice Structure of Database and Mining High Level Rules. In: Proceedings of Workshop on Data Mining and E-Organizations, International Conference on Computer Software and Applications, Chicago, 2001, October 8-12, 2001. Also in Bulletin of International Rough Set Society [10] T. Y. Lin, ``Data Mining and Machine Oriented Modeling: A Granular Computing Approach," Journal of Applied Intelligence, Kluwer, Vol. 13, No 2, September/October,2000, pp.113-124