Issues in Modeling for Data Mining

Similar documents
A Logical Formulation of the Granular Data Model

Sets with Partial Memberships A Rough Set View of Fuzzy Sets

Concept Lattices in Rough Set Theory

Neighborhoods Systems: Measure, Probability and Belief Functions

Classification Based on Logical Concept Analysis

High Frequency Rough Set Model based on Database Systems

A Discrete Duality Between Nonmonotonic Consequence Relations and Convex Geometries

FUZZY PARTITIONS II: BELIEF FUNCTIONS A Probabilistic View T. Y. Lin

Foundations of Classification

Positional Analysis in Fuzzy Social Networks

Model Complexity of Pseudo-independent Models

Universal Algebra for Logics

Towards a Denotational Semantics for Discrete-Event Systems

A Generalized Decision Logic in Interval-set-valued Information Tables

2. Intersection Multiplicities

TUNING ROUGH CONTROLLERS BY GENETIC ALGORITHMS

Relational Algebra Part 1. Definitions.

Classification of root systems

Neural Networks, Qualitative-Fuzzy Logic and Granular Adaptive Systems

Category Theory. Categories. Definition.

Linear Algebra. Preliminary Lecture Notes

Rings and Fields Theorems

A new Approach to Drawing Conclusions from Data A Rough Set Perspective

On the Structure of Rough Approximations

QUIVERS AND LATTICES.

Rigid Geometric Transformations

Equational Logic. Chapter Syntax Terms and Term Algebras

Notes on the Dual Ramsey Theorem

The matrix approach for abstract argumentation frameworks

Inquiry Calculus and the Issue of Negative Higher Order Informations

Vectors and Matrices

2MA105 Algebraic Structures I

A Class of Star-Algebras for Point-Based Qualitative Reasoning in Two- Dimensional Space

Lecture 1: Lattice(I)

1 Linear transformations; the basics

Linear Algebra March 16, 2019

Mathematics Review for Business PhD Students

3. FORCING NOTION AND GENERIC FILTERS

Rough Set Approaches for Discovery of Rules and Attribute Dependencies

Data Mining and Analysis

Interpreting Low and High Order Rules: A Granular Computing Approach

FUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH

Rough operations on Boolean algebras

SUBLATTICES OF LATTICES OF ORDER-CONVEX SETS, III. THE CASE OF TOTALLY ORDERED SETS

Model theory, stability, applications

Vectors Coordinate frames 2D implicit curves 2D parametric curves. Graphics 2008/2009, period 1. Lecture 2: vectors, curves, and surfaces

Rigid Geometric Transformations

A Comparative Study of Noncontextual and Contextual Dependencies

Linear Vector Spaces

Introduction to Kleene Algebras

Zaslavsky s Theorem. As presented by Eric Samansky May 11, 2002

Topology, Math 581, Fall 2017 last updated: November 24, Topology 1, Math 581, Fall 2017: Notes and homework Krzysztof Chris Ciesielski

Linear Algebra. Preliminary Lecture Notes

Lexington High School Mathematics Department Honors Pre-Calculus Final Exam 2002

Discovery of Functional and Approximate Functional Dependencies in Relational Databases

Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events

Hierarchical Structures on Multigranulation Spaces

The size of decision table can be understood in terms of both cardinality of A, denoted by card (A), and the number of equivalence classes of IND (A),

Modeling the Real World for Data Mining: Granular Computing Approach

Interval based Uncertain Reasoning using Fuzzy and Rough Sets

A Family of Finite De Morgan Algebras

Easy Categorization of Attributes in Decision Tables Based on Basic Binary Discernibility Matrix

Introduction to Topology

On flexible database querying via extensions to fuzzy sets

Linear Algebra II. 7 Inner product spaces. Notes 7 16th December Inner products and orthonormal bases

Congruence Boolean Lifting Property

BOOLEAN ALGEBRA INTRODUCTION SUBSETS

On the connection of Hypergraph Theory with Formal Concept Analysis and Rough Set Theory 1

Jeong-Hyun Kang Department of Mathematics, University of West Georgia, Carrollton, GA

Physical justification for using the tensor product to describe two quantum systems as one joint system

What You Must Remember When Processing Data Words

On Machine Dependency in Shop Scheduling

THE REPRESENTATION THEORY, GEOMETRY, AND COMBINATORICS OF BRANCHED COVERS

On minimal models of the Region Connection Calculus

Finite pseudocomplemented lattices: The spectra and the Glivenko congruence

Lattice Theory Lecture 4. Non-distributive lattices

9. Birational Maps and Blowing Up

arxiv: v1 [cs.lo] 19 Mar 2019

Sets and Motivation for Boolean algebra

ANNIHILATOR IDEALS IN ALMOST SEMILATTICE

A Generalized Framework for Reasoning with Angular Directions

Geometric Steiner Trees

Lecture 2: Syntax. January 24, 2018

Lecture 1: Basic Concepts

Math 541 Fall 2008 Connectivity Transition from Math 453/503 to Math 541 Ross E. Staffeldt-August 2008

Math 110, Spring 2015: Midterm Solutions

Every Linear Order Isomorphic to Its Cube is Isomorphic to Its Square

Ontologies and Domain Theories

1. Ignoring case, extract all unique words from the entire set of documents.

A BRIEF INTRODUCTION TO MATHEMATICAL RELATIVITY PART 1: SPECIAL RELATIVITY ARICK SHAO

18.312: Algebraic Combinatorics Lionel Levine. Lecture 11

Fuzzy Limits of Functions

3. Abstract Boolean Algebras

AN ALGEBRAIC APPROACH TO GENERALIZED MEASURES OF INFORMATION

CHAPTER 3 THE COMMON FACTOR MODEL IN THE POPULATION. From Exploratory Factor Analysis Ledyard R Tucker and Robert C. MacCallum

Multi-coloring and Mycielski s construction

Sets and Functions. (As we will see, in describing a set the order in which elements are listed is irrelevant).

Category theory and set theory: algebraic set theory as an example of their interaction

Feature Selection with Fuzzy Decision Reducts

Seminaar Abstrakte Wiskunde Seminar in Abstract Mathematics Lecture notes in progress (27 March 2010)

Transcription:

Issues in Modeling for Data Mining Tsau Young (T.Y.) Lin Department of Mathematics and Computer Science San Jose State University San Jose, CA 95192 tylin@cs.sjsu.edu ABSTRACT Modeling in data mining has not been fully explored; so data mining is often regarded as a set of added operations in database systems. The consequence many notions are overloaded. Some terms often have different semantics. This paper continues previous effort to explore and reiterate some fundamental issues in data mining. Keywords. Modeling, data mining, equivalence relation, lattice. 1. Introduction A DBMS vender often offers a data mining service by adding a set of tools into its existing DBMS. This practice has been translates into, in academic world, the following view:data mining is a set of added operations onto the classical data model. So database terms are used without changes in data mining research. This is a confusing, in fact, an incorrect view. Some semantics of these terms can be very different between database and data mining. Roughly database stores data according to its semantics. While data mining is to discover what the data have expressed. A set of data may not be able to fully expressed the given semantics. 1.1. What is data mining? A commonly quoted definition defines data mining as a non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns from data [3]. We have pointed out in several occasions that some terms, such as novel, useful, and understandable are subjective and can not be used as scientific criteria. However, they do point out one important requirement: The discovered patterns should be relevant to the real world. We call it a real-world-pattern. We believe Real-world-patterns are the primary goal of data mining This leads us to examine the mathematical modeling of realworldforthepurposeofdatamining. 1.2. Notations Let U be the set of the real world entities and A ={A 1,A 2,, A n } be a set of attributes. A relation K will be viewed as a knowledge representation that assigns each entity or object a unique tuple, K: U Dom(A 1 ) Dom(A 2 )... Dom(A n ); u (a 1,a 2,,a n ), where Dom(-) is the active domain (= the set of attribute values currently in use). Traditionally, the image of K is referred to as a relation. For modeling, it is more convenient to have the independent variables available for discussions, so we will use the graph (u, K(u)) and call it information table or simply table. 2. Are the Given Attributes Adequate? A tuple (of degree n) in a numerical relation (of degree n) can be regarded as a coordinate of a point in n dimensional Euclidean space. A coordinate system {X 1, X 2,,X n } of an Euclidean space is a set {A 1,A 2,,A n } of attribute names. A point p x with coordinate x = (x 1,x 2,, x n ) is a tuple. The set P = {p x }of points is a coordinate dependent concept; it is a set of points, real world objects. While the set R of numerical tuples is a coordinate dependent concept; it is merely a merely representation, a matrix. A pattern (a statement about property of R) may or may not be a real world event (a statement about geometry). A pattern with real world significance is, then, a geometric property of P. 2.1. Invisible Association Rules? Letusre-visittheexample[p]:Thefirstcolumnisthelist of entities(directed segments). Attribute B is the Starting point of a direct segment, the others are the Degree and

Length ( polar coordinates ). This table only has one association rule of length 2, namely, (p 3, 2.0) if the required support is 4. Segment# S D L S 1 p 1 0 6.0 S 2 p 3 90 2.0 S 3 p 3 120 2.0 S 4 p 3 135 2.0 S 5 p 3 150 2.0 S 6 p 3 180 2.0 S 7 p 3 210 2.0 S 8 p 3 225 2.0 S 9 p 3 240 2.0 S 10 p 3 270 2.0 Table 1. 10 directed segments in polar coordinates Segment# B H V S 1 p 1 6 0 S 2 p 3 0 2 S 3 p 3-1 3 S 4 p 3-2 2 S 5 p 3-3 1 S 6 p 3-2 0 S 7 p 3-3 -1 S 8 p 3-2 - 2 S 9 p 3-3 -1 S 10 p 3 0-2 Table 2. 10 directed segments in (X,Y)-coordinates Table 2 is transformed from Table 1 by switching to Cartesian coordinate system; the two new attributes are Horizontal and Vertical length. The only association rule disappears. In this geometric database, the association rule is a real world phenomenon (a geometric fact), the same information should be still carried by Table 2. The question is then: How can such invisible association rules be discovered from Table 2? 3. Mining on Derived Attributes? To understand what is really reflected by attributes, let us examine a simple example. Table 3 is a numerical information table of 9 points in the Euclidean 3-space. The first column is the names of the points, the second columns and etc are Z-X- and Y- coordinates point X Y Z P1 200 0 0 P2 100 0 0 P3 173 100 1 P4 87 50 1 P5 44 25 1 P6 141 141 2 P7 71 71 2 P8 36 36 2 P9 0 200 3 Table 3. 9 points with coordinate in Euclidean 3-space Next, let us consider a transformation that rotates the X- and Y- axes around Z-axis. Table 4 expands Table 3 by including new X- and Y- coordinates of various rotations. The suffixes, 20, 30, 52, and 79, represent the new coordinate axes (attribute names) that rotate 0.20, 0.30 and etc radians. Each new axis is a new derived attribute. point Z X Y X20 Y20 X30 Y30 X52 Y52 X79 Y79 P1 0 200 0 196-40 191-59 174-99 141-142 P2 0 100 0 98-20 96-30 87-50 70-71 P3 1 173 100 189 64 195 44 200 0 193-53 P4 1 87 50 95 32 98 22 100 0 97-27 P5 1 44 25 48 16 49 11 51 0 49-14 P6 2 141 141 166 110 176 93 192 52 199 0 P7 2 71 71 84 55 89 47 97 26 100 0 P8 2 36 36 42 28 45 24 49 13 51 0 P9 3 0 200 40 196 59 191 99 174 142 141 Table 4. 9 points with coordinate in various axes; they rotate radians, 0.2, 0.3, 0.53(=30 o ), 0.79(=45 o )

The table seems indicate that there are at least as many new attributes as angles. Even worse, there are many non-linear transformations not listed in Table 4. Several questions arise: 1. Are all these attributes distinct? (from data mining point of view) 2. Could we find ALL derived attributes? 3. Should data mining consider all of them? We will answer the first and second one in next few sections. For third question, from scientific point of view, we should consider all, however, from practical application point of view, we may only want to select some from all. This indicates that we should extend the common practices of feature selection to this full set, not on the given set of attiutes. 3. Is Labeling Concepts/Data Necessary? 3.1. Isomorphic Attributes and Tables Definition 1. Attributes A i and A j are isomorphic iff there is a one-to-one and onto map, s: Dom(A i ) Dom(A j ) such that A j (v)= s(a i (v)), v V.siscalled an isomorphism [p]. Definition 2. Let K and H be two information tables with thesameuniversev.leta={a 1,A 2,,A n }andb ={B 1, B 2,, B m } be the attributes of K and H respectively. Then, K and H are said to be isomorphic if every A i is isomorphic to some B j, and vice versa. If n = m, then it is said to be strict isomorphic. Intuitively isomorphic is re-labeling. theorem should be obvious. The following Theorem 1. Let K and H be strictly isomorphic. Then patterns, such as, association rules, of K can be transformed to those of H by re-labeling the data. Corollary Labeling the attribute values (base concepts) does not influence the results of data mining. Now, we apply the notion of isomorphism to Table 4, we find that X X20 X30 X52 X79 Y20 Y30; where means isomorphic. So by dropping the isomorphic attributes, Table 4 is reduced to Table 5. point Z X Y Y52 Y79 P1 1 200 0-99 -142 P2 1 100 0-50 -71 P3 1 173 100 0-53 P4 1 87 50 0-27 P5 1 44 25 0-14 P6 2 141 141 52 0 P7 2 71 71 26 0 P8 2 36 36 13 0 P9 3 0 200 174 141 Table 5 Simplified Table of 9 points The association rules are (support 3) (Z=1,Y52=0); (Z =2,Y79=0); 3.2. Isomorphic High Level Concepts One of the important data mining practices is the concept hierarchy (attribute oriented generalization). In the current state of arts, human users supply the concept hierarchies [4]. In other words, the nested sequence of equivalence relations are labeled by human users. The question is: Is such a labeling necessary for data mining? In other words, will we get new rules if we rename all the nodes of a hierarchy? By considering the table of high level concepts and results of Section 3.1, the answer is no (re-labeling Table 6 will not change the high level association rules) Theorem2. Isomorphic AOG have isomorphic high level association rules. Corollary Given a nested sequence of partitions, the generalized (high level) association rules are independent of labeling. Corollary Labeling the high level and base concepts does not influence the results of data mining. point Z X Y

P1 P2 P3 P4 P5 P6 P7 P8 Lz45={2} Lz45={2} Lz45={2} Lx00= {200, 100} Ly00={0} Lx00= {200, 100} Ly00={0} Lx30= Ly30= {173, 87, 44} {100, 50,25} Lx30= Ly30= {173, 87, 44} {100, 50,25} Lx30= Ly30= {173, 87, 44} {100, 50,25} Lx45= Ly45= {141, 71, 36} {141, 71,36} Lx45= Ly45= {141, 71, 36} {141, 71,36} Lx45= Ly45= {141, 71, 36} {141, 71,36} P9 3 0 200 Table 6. 4. Granular Data Model High Level Table; granules are labeled In last section, we have shown that labels are not essential. So for data mining the underlying mathematical structures is important and we will use them to label the data [7,8, 9, 10]. An attribute can be interpreted as the composition of K and the projection P j :Dom(A 1 ) Dom(A 2 )... Dom(A n ) Dom(A j ) A j :U Dom(A j ). Each A j induces a partition on U. By replacing attribute values a j with its inverse image (A j ) (-1) (a i ); the collection of these inverse images together forms a partition on U. For example, Pi= {p i }, P12={p 1,p 2 }and,p345={p 3, p 4,p 5 }, P678=={p 6,p 7,p 8 } are equivalence classes. One can label an equivalence class by itself (called canonical names ). So Pi, Pij, Pijk are labels (canonical names) of the respective equivalence classes. Using these canonical names, Table 4 is transformed to Table 9; such a table is called a unnamed relation, because each granule is not assigned a human oriented meaningful name; they also have been called machine oriented model [10]. Each attribute value of Table 4 is a human oriented meaningful name of the corresponding Pi in Table 9. This naming is an isomorphism, that is, Table 4 is isomorphic to Table 9. Note that the attribute names, say A, are replaced by the induced equivalence relation E A. We hope, we have convinced the readers the following theorem: Theorem 3. Unnamed information table (relation) is isomorphic to the original table K, hence their respective association rules are isomorphic. The essential ingredients in an unnamed relations are U and E, where E ={E 1, E 2,, E n } is the set of equivalence relations induced by attributes. We will call the pair (U, E={E 1,E 2,,E n }) a Granular Data Model (GDM) of K. Pawlak called (U, E) an approximation space, if E has only one equivalence relation, and in general a knowledge base; knowledge base often has different meaning, we will use Granular Data Model. Summarize the discussions above we have: Definition. Given an (ordinary) relation K, then the information table (relation), obtained by using the induced equivalence relations as the names of attributes and equivalence classes as attribute values, is called an unnamed table (relation) of K; the Pair (U, E) is called the granular data model of K. point E Z E X E Y E X20 E Y20 E X30 E Y30 E X52 E Y52 E X79 E Y79 P1 P12 P1 P12 P1 P1 P1 P1 P1 P1 P1 P1 P2 P12 P2 P12 P2 P2 P2 P2 P2 P2 P2 P2 P3 P345 P3 P3 P3 P3 P3 P3 P3 P345 P3 P3 P4 P345 P4 P4 P4 P4 P4 P4 P4 P345 P4 P4 P5 P345 P5 P5 P5 P5 P5 P5 P5 P345 P5 P5 P6 P678 P6 P6 P6 P6 P6 P6 P6 P6 P6 P678 P7 P678 P7 P7 P7 P7 P7 P7 P7 P7 P7 P678 P8 P678 P8 P8 P8 P8 P8 P8 P8 P8 P8 P678 P9 P9 P9 P9 P9 P9 P9 P9 P9 P9 P9 P9 Table 9. Unnamed Relation (attribute values are expressed by granules)

The uses of unnamed relations have some advantage over the traditional ones in machine processing (including data mining). For example, we have shown in Table 4 that many attributes are isomorphic (with efforts); while in Table 9 isomorphic attributes are the identical columns. So the simplification of Table 9 into Table 10 becomes effortless in unnamed relations. point Z X Y Y52 Y79 P1 P12 P1 P12 P1 P1 P2 P12 P2 P12 P2 P2 P3 P345 P3 P3 P345 P3 P4 P345 P4 P4 P345 P4 P5 P345 P5 P5 P345 P5 P6 P678 P6 P6 P6 P678 P7 P678 P7 P7 P7 P678 P8 P678 P8 P8 P8 P678 P9 P9 P9 P9 P9 P9 Table 10. Simplified Unnamed Relation 5. The Features Completion- the Lattice in Partition In this section, we will show how one can use granular data model/unnamed relation to find all derived attributes. 5.1. Attribute Transformations By a new attribute Y we mean there is a subset, denoted by Dom(Y), and an onto map U Dom(Y). Let B={B 1, B 2,,B k } be a subset of of A, that is, each B i is some A ji. Letusassumethereisamap f: Dom(B 1 ) Dom(B 2 )... Dom(B k ) Dom(Y). such that f(b 1 (u), B 2 (u),, B k (u))=y(u) u. Then we said Y is an attribute transformed from B = {B 1,B 2,, B k }, and denoted by Y= f(b 1,B 2,,B k ) The details of f are illustrated in Table 7. V (B 1 B 2 B k Y) v 1 (b 1 1 b 2 1... b k 1 f 1 =f(b 1 1,b 2 1,...,b k 1)) v 2 (b 1 2 b 2 2... b k 2 f 2 =f(b 1 2,b 2 2,...,b k 2)) v 3 (b 1 3 b 2 3... b k 3 f 3 =f(b 1 3,b 2 3,...,b k 3))...... v i (b 1 i b 2 i b k i f i =f(b 1 i,b 2 i,...,b k i))...... Table 7 Attribute (feature) Transformations In database theory, we say Y is (extension) functionally depended on B. The equivalence relation induced by Y is denoted by Y E. Theorem 4. Y is an attribute transformed from B ={B 1,B 2,, B k }iffy E is a coarsening of E B1 E B2... E Bk,where each E Bi is the equivalence relation induced by B i. 5.2. Lattices of Partitions induced by Attributes Let E be the set of equivalence relation induced by the attributes of K; see the beginning of Section 4. The Boolean algebra 2 E is a lattice. Let Π (U) be the lattice of all partitions (equivalence relations) on U; its join is the intersection of equivalence relations and the meet is the union of equivalence relations, where the union is the smallest coarsening of its components. T. T. Lee uses the map θ: 2 A Π(U), to summarize the fact that each attribute induces an equivalence relation on the universe U. He observed that the map only respects the meet, but not the join [5]. The image of θ was called the relation lattice; he observed then that the image is a subset of Π(U), but not a sublattice. This is the point of our departure. Previously, we have let L(E) be the smallest sublattice that contains E and called it the generalized relation lattice [8, 9]. We also have called the smallest sublattice that contains E and all its coarsening the lattice-in-partitions and denoted by L(E*) [6]. Theorem 5. L(E*) is a finite lattice and the feature completion of E. This theorem says that the number of derived attributes is finite. For data mining, we may consider the following granular data model (GDM) for a given K. 1. (U, E) is the GDM of K 2. (U, L(E)) is a new GDM associated with K 3. (U, L(E*)) is a new GDM that contains feature completion 4. (U, G) is a new GDM that contains essential attributes, where G consists only of all join-irreducible equivalence relations in L(E*). All possible generalized association canbe found here.

Let us consider a Table that collapses redundant tuples in Table 6 point Z X Y P12=u 1 Lz03 Lx00 Ly00 P345= u 2 Lz03 Lx30 Ly30 P678= u 3 Lz45 Lx45 Ly45 P9= u 4 3 0 200 Table 11. An Information Table K (Reduced Table 6) It is clear X Y, so we drop Y and have an unnamed relation. point E Z E X P12=u 1 {u 1,u 2 } {u 1 } P345= u 2 {u 1,u 2 } {u 2 } P678= u 3 {u 3 } {u 3 } P9= u 4 {u 4 } {u 4 } Table 12. Unnamed relation of K Since E Z E X has 4 equivalence classes, so the number of all possible derived attribute is the Bell number B 4 =16; see [2]. We will not display them. References [1] G. Birkhoff and S. MacLane, A Survey of Modern Algebra, Macmillan, 1977 [2] R. Brualdi, Introductory Combinatorics, Prentice Hall, 1992. [3] Fayad, U. M., Piatetsky-Sjapiro, G. Smyth, P. From Data Mining to Knowledge Discovery: An overview. In Fayard, Piatetsky-Sjapiro, Smyth, and Uthurusamy eds., Knowledge Discovery in Databases, AAAI/MIT Press, 1996. [4]. Hadjimichael, H. Hamilton, N. Cercone. ``Extracting Concept Hierarchies from Relational Databases,'' in Proceedings, FLAIRS-95, Florida AI Research Symposium, Melbourne Beach, Florida, 1995. [5] T. T. Lee, Algebraic Theory of Relational Databases, The Bell System Technical Journal Vol 62, No 10, December, 1983, pp.3159-3204 [6] T.Y. Lin, Data Modeling for Data Mining. In: Data Mining and Knowledge Discovery: Theory, Tools, and Technology IV, B. Dasarathy (ed), Proceeding of SPIE Vol 4730, Orlando, Fl, April 1-5, 2002 [7] T. Y. Lin, "Data Mining: Granular Computing Approach." In: Methodologies for Knowledge Discovery and Data Mining, Lecture Notes in Artificial Intelligence 1574, Third Pacific-Asia Conference, Beijing, April 26-28, 1999, 24-33. [8] T.Y. Lin, Feature Transformations and Structure of Attributes In: Data Mining and Knowledge Discovery: Theory, Tools, and Technology IV, Proceeding of SPIE s aerosence 2002 1-5 April 2002 Orlando, FL. [9] T.Y. Lin, The Lattice Structure of Database and Mining High Level Rules. In: Proceedings of Workshop on Data Mining and E-Organizations, International Conference on Computer Software and Applications, Chicago, 2001, October 8-12, 2001. Also in Bulletin of International Rough Set Society [10] T. Y. Lin, ``Data Mining and Machine Oriented Modeling: A Granular Computing Approach," Journal of Applied Intelligence, Kluwer, Vol. 13, No 2, September/October,2000, pp.113-124