Outlier Detection Using Rough Set Theory

Outlier Detection Using Rough Set Theory Feng Jiang 1,2, Yuefei Sui 1, and Cungen Cao 1 1 Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080, P.R. China 2 Graduate School of Chinese Academy of Sciences, Beijing 100039, P.R. China jiangkong@163.net, {yfsui, cgcao}@ict.ac.cn Abstract. In this paper, we suggest to exploit the framework of rough set for detecting outliers individuals who behave in an unexpected way or feature abnormal properties. The ability to locate outliers can help to maintain knowledge base integrity and to single out irregular individuals. First, we formally define the notions of exceptional set and minimal exceptional set. We then analyze some special cases of exceptional set and minimal exceptional set. Finally, we introduce a new definition for outliers as well as the definition of exceptional degree. Through calculating the exceptional degree for each object in minimal exceptional sets, we can find out all outliers in a given dataset. 1 Introduction Rough set theory introduced by Z. Pawlak [1,2,3], is as an extension of set theory for the study of intelligent systems characterized by insufficient and incomplete information. It is motivated by the practical needs in classification and concept formation. The rough set philosophy is based on the assumption that with every objects of the universe there is associated a certain amount of information (data, knowledge), expressed by means of some attributes used for object description. Objects having the same description are indiscernible (similar) with respect to the available information. In recent years, there has been a fast growing interest in this theory. The successful applications of the rough set model in a variety of problems have amply demonstrated its usefulness and versatility. In this paper, we suggest a somewhat different usage of rough set. The basic idea is as follows. For any subset X of the universe and any equivalence relation on the universe, the difference between the upper and lower approximations constitutes the boundary region of the rough set, whose elements can not be characterized with certainty as belonging or not to X, using the available information (equivalence relation). The information about objects from the boundary This work is supported by the National NSF of China (60273019 and 60073017), the National 973 Project of China (G1999032701), Ministry of Science and Technology (2001CCA03000) and the National Laboratory of Software Development Environment. D. Śl ezak et al. (Eds.): RSFDGrC 2005, LNAI 3642, pp. 79 87, 2005. c Springer-Verlag Berlin Heidelberg 2005

80 F. Jiang, Y. Sui, and C. Cao region is, therefore, inconsistent or ambiguous. When given a set of equivalence relations (available information), if an object in X always lies in the boundary region with respect to every equivalence relation, then we may consider this object as not behaving normally according to the given knowledge (set of equivalence relations) at hand. We call such objects outliers. An outlier in X is an element that always can not be characterized with certainty as belonging or not to X, using the given knowledge. Recently, the detection of outlier (exception) has gained considerable interest in KDD. Outliers exist extensively in real world, and they are generated from different sources: a heavily tailed distribution or errors in inputting the data. While there is no single, generally accepted, formal definition of an outlier, Hawkins definition captures the spirit: an outlier is an observation that deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism [4]. Finding outliers is important in different applications, such as credit fraud detection and network intrusion detection. Outlier detection has a long history in statistics [4, 5], but has largely focused on univariate data with a known distribution. These two limitations have restricted the ability to apply these types of methods to large real-world databases which typically have many different fields and have no easy way of characterizing the multivariate distribution of examples. Other researchers, beginning with the work by Knorr and Ng [6,7,8], have taken a non-parametric approach and proposed using an example s distance to its nearest neighbors as a measure of unusualness [9,10,11]. Eskin et al. [12], and Lane and Brodley [13] applied distance-based outliers to detecting computer intrusions from audit data. Although distance is an effective non-parametric approach to detect outliers, the drawback is the amount of computation time required. Straightforward algorithms, such as those based on nested loops, typically require O(N 2 ) distance computations. This quadratic scaling means that it will be very difficult to mine outliers as we tackle increasingly larger datasets. In this paper, we formally state the ideas briefly sketched above within the context of Pawlak s rough set theory. Our goal is to develop a new way for outlier definition and outlier detection. The remainder of this paper is organized as follows. In the next section, we present some preliminaries of rough set theory that are relevant to this paper. In Section 3, we give formal definitions of concepts of exceptional set and minimal exceptional set, and discuss basic properties about them. In Section 4, we analyze some special cases of exceptional set and minimal exceptional set. Section 5 introduces a new definition for outliers along with the definitions of exceptional degree (degree of outlier-ness). Conclusions are given in Section 6. 2 Preliminaries Let U denote a finite and nonempty set called the universe, and θ U U denote an equivalence relation on U. The pair apr =(U, θ) is called an approximation space. The equivalence relation θ partitions the set U into disjoint subsets. Such a partition of the universe is denoted by U/θ. If two elements x, y in U belong

Outlier Detection Using Rough Set Theory 81 to the same equivalence class, we say that x and y are indistinguishable. The equivalence classes of θ and the empty set are called the elementary or atomic sets in the approximation space. Given an arbitrary set X U, it may be impossible to describe X precisely using the equivalence classes of θ. In this case, one may characterize X by a pair of lower and upper approximations: X θ = {[x] θ :[x] θ X}, X θ = {[x] θ :[x] θ X }, where [x] θ = {y xθy} is the equivalence class containing x. The pair (X θ, X θ )is called the rough set with respect to X. The lower approximation X θ is the union of all the elementary sets which are subsets of X, and the upper approximation X θ is the union of all the elementary sets which have a nonempty intersection with X. An element in the lower approximation necessarily belongs to X, while an element in the upper approximation possibly belongs to X. 3 Exceptional Set and Minimal Exceptional Set In contrast to current methods for outlier detection, we will take a two step strategy. First, we find out all exceptional sets and minimal exceptional sets in agivendatasetx. Second, we detect all outliers in X from minimal exceptional sets of X. Here we assume that all outliers in X must belong to some minimal exceptional set of X. That is, if an object in X doesn t belong to any minimal exceptional set of X, then we can conclude that it is not an outlier of X. What we need to do is to judge whether an object from a minimal exceptional set is an outlier of X. In the rest of this paper, given a finite and nonempty universe U, we will not only consider one equivalence relation on U at one time, but also consider an amount of equivalence relations on U simultaneously, which denoted by set R = {r 1,r 2,..., r m }. First, we give the definition of exceptional set. Definition 1 [Exceptional Set]. Given an arbitrary set X U, andaset R = {r 1,r 2,..., r m } of equivalence relations on U. Lete X be a subset of X. If for every equivalence relation r i R, e Bi X, i =1, 2,..., m, thene is called an exceptional set of X with respect to R, wherebi X = BN i (X) X = X X i. X i and X i are respectively the lower approximation and the upper approximation of X with respect to r i. BN i (X) =X i X i is called the boundary of X with respect to r i. We call Bi X the inner boundary of X with respect to r i.whenx is clear from the context, we simply use B i to denote Bi X. If an exceptional set e m,thene is called a type 1 exceptional set, Bi X else e is called a type 2 exceptional set. In order to define the concept of minimal exceptional set, we give the following two definitions first.

82 F. Jiang, Y. Sui, and C. Cao Definition 2. Given an arbitrary set X U, andasetr = {r 1,r 2,..., r m } of equivalence relations on U. Lete X be an exceptional set of X with respect to R. For any x e, ife {x} is also an exceptional set of X with respect to R, then the element x is called dispensable in the set e with respect to R, otherwise x is indispensable. Definition 3. Let e X be an exceptional set of X with respect to R. Ifall the elements of e are indispensable in e with respect to R, then exceptional set e is called independent with respect to R, otherwise e is dependent. Now we can define minimal exceptional set as an exceptional set which is independent with respect to the corresponding set R of equivalence relations. Definition 4 [Minimal Exceptional Set]. Let e X be an exceptional set of X with respect to R. Iff = e e (e e) is an independent exceptional set of X with respect to R, thenf is called a minimal exceptional set of X with respect to R in e.weusemin(e) to denote the set of all minimal exceptional sets of X with respect to R in e. It is not difficult to prove that the exceptional set and minimal exceptional set have the following basic properties. Proposition 1. Given an arbitrary set X U, andasetr = {r 1,r 2,..., r m } of equivalence relations on U. Ife X is an exceptional set of X with respect to R, then there exists at least one minimal exceptional set f in e. Proposition 2. Given an arbitrary set X U, andasetr = {r 1,r 2,..., r m } of equivalence relations on U. Ife X is an exceptional set of X with respect to R and f is a minimal exceptional set in e, then (i) f e; (ii) f m B i,whereb i denotes the inner boundary of X with respect to equivalence relation r i, i =1, 2,..., m. Proposition 3. Let E be the set of all exceptional sets of X with respect to R and F be the set of all minimal exceptional sets of X with respect to R, denoted by F = Min(e),whereMin(e) isthesetofallminimalexceptionalsets e E in e. Then (i) F E; (ii) For any e, e X, ife E and e e čňthen e E; (iii) For any e, e X, ife / E and e e,thene/ E; (iv) For any e E, e, that is, exceptional set can not be empty; (v) For any e, e E, ife e, then all minimal exceptional sets in e are also minimal exceptional sets in e. Proposition 4. If E is the set of all exceptional sets of X with respect to R and F is the set of all minimal exceptional sets of X with respect to R, then

Outlier Detection Using Rough Set Theory 83 (i) For any e 1,e 2 E, e 1 e 2 E; (ii) For any e 1,e 2 F,ife 1 e 2,thene 1 e 2 / E. Proof. (i) Given any e 1,e 2 E, for every 1 i m, e 1 B i and e 2 B i, where B i denotes the inner boundary of X with respect to equivalence relation r i. Therefore for every 1 i m, (e 1 e 2 ) B i =(e 1 B i ) (e 2 B i ). So e 1 e 2 is an exceptional set of X with respect to R by Definition 1, that is, e 1 e 2 E; (ii) (Proof by contradiction) Assume that e 1 e 2 E. Sincee 1 e 2, e 1 e 2 e 1 and e 1 e 2 e 2. Therefore e 1 (e 1 e 2 ), that is, there exists a x (e 1 (e 1 e 2 )). Furthermore, e 1 F E and e 1 e 2 E. Sox is dispensable in the set e 1 with respect to R, e 1 is dependent with respect to R, thatis,e 1 is not a minimal exceptional set of X with respect to R. This contradicts with the condition e 1 F.Soife 1 e 2,thene 1 e 2 / E. Proposition 5. Given an arbitrary set X U, andasetr = {r 1,r 2,..., r m } of equivalence relations on U. LetF be the set of all minimal exceptional sets of X with respect to R and B = {B 1,B 2,..., B m } be the set of all inner boundaries of X with respect to each equivalence relations in R. The union of all minimal exceptional sets in F equals to the union of all inner boundaries in B, thatis, f = m B i. f F 4 Some Special Cases From above we can see, mostly, we get an amount of exceptional sets and minimal exceptional sets from a given X and R. In order to detect all outliers in X from these minimal exceptional sets, it is necessary to investigate some special cases of them first. At first, we define a concept of boundary degree. Definition 5 [Boundary Degree]. Given an arbitrary set X U, andaset R = {r 1,r 2,..., r m } of equivalence relations on U. LetB = {B 1,B 2,..., B m } be the set of all inner boundaries of X with respect to each equivalence relations in R. For every object x X, the number of different inner boundaries which contain x is called the boundary degree of x, denoted by Degree B(x). Then, we can consider a special kind of minimal exceptional set which contains the least elements with respect to other minimal exceptional sets. We define it as the shortest minimal exceptional set. Definition 6 [The Shortest Minimal Exceptional Set]. Given an arbitrary set X U, andasetr = {r 1,r 2,..., r m } of equivalence relations on U. LetF be the set of all minimal exceptional sets of X with respect to R. Ifthereexists a minimal exceptional set f F such that for any f F, f f, where p denotes the cardinal number of p. Thenf is called the shortest minimal exceptional set of X with respect to R.

84 F. Jiang, Y. Sui, and C. Cao Next, we give an algorithm which can find out the shortest minimal exceptional set of X with respect to R. Algorithm 1. Find out the shortest minimal exceptional set of X with respect to R. Input: Inner boundaries set B = {B 1,B 2,...,B m } Output: The shortest minimal exceptional set f (1) f = // Initialize f as an empty set; (2) While (B )do{ (3) For each B i B (4) For each x B i (5) Compute the boundary degree of x // Degree B(x); (6) Findanelementy which has the biggest boundary degree in all B i B (if there exist more than one such elements, Select one randomly); (7) f = f {y}; (8) Delete all the inner boundaries which contain y from B; (9) } (10) Return f. We can also define two special kinds of exceptional set the greatest exceptional set and the least exceptional set. Definition 7 [The Greatest Exceptional Set]. Given an arbitrary set X U, andasetr = {r 1,r 2,..., r m } of equivalence relations on U. IfE is the set of all exceptional sets of X with respect to R, then the union of all elements in E is called the greatest exceptional set of X with respect to R, denoted by GES R (X) = e. e E Proposition 6. The greatest exceptional set of X with respect to R is unique and equals to X itself, that is, e = X. e E Definition 8 [The Least Exceptional Set]. Given an arbitrary set X U, and a set R = {r 1,r 2,..., r m } of equivalence relations on U. LetE be the set of all exceptional sets of X with respect to R. If there exists a set l E such that for any e E, l e. Thenl is called the least exceptional set of X with respect to R. Proposition 7. Let E be the set of all exceptional sets of X with respect to R and F be the set of all minimal exceptional sets of X with respect to R. Ifl is the least exceptional set of X with respect to R, then (i) l is also a minimal exceptional set of X, thatis,l F ; (ii) l is not empty and unique; (iii) l equals to the intersection of all the elements in E, denoted by l = e. Since the least exceptional set is the intersection of all exceptional sets. But the intersection of all exceptional sets may be empty. So when do we have the least exceptional set? The next proposition gives an answer. e E

Outlier Detection Using Rough Set Theory 85 Proposition 8. Let E be the set of all exceptional sets of X with respect to R, andf be the set of all minimal exceptional sets of X with respect to R. If and only if there is only one element in F, the least exceptional set of X with respect to R exists, and the only element in F just is the least exceptional set. 5 Defining Outliers Most current methods for outlier detection give a binary classification of objects (data points): is or is not an outlier. In real life, it is not so simple. For many scenarios, it is more meaningful to assign to each object a degree of being an outlier. Therefore, M. M. Breunig proposed a method for identifying densitybased local outliers [14]. He defines a local outlier factor (LOF) that indicates the degree of outlier-ness of an object using only the object s neighborhood. The outlier factor of object p captures the degree to which we call p an outlier. We define two types of exceptional degree respectively for object and set. Definition 9 [Exceptional Degree of Object]. Given an arbitrary set X U, and a set R = {r 1,r 2,..., r m } of equivalence relations on U. LetB = {B 1, B 2,..., B m } be the set of all inner boundaries of X with respect to each equivalence relations in R. For any object x X, the cardinal number of set B (equals to m) divided by the boundary degree of x (namely Degree B(x)) is called the exceptional degree of x with respect to R, denoted by ED Object(x) = Degree B(x) m. Obviously, 0 ED Object(x) 1. When we have worked out the exceptional degree for all objects in minimal exceptional sets of X, it is not difficult to find out all outliers in X with respect to R. We can assume that all the objects in minimal exceptional sets whose exceptional degree is greater than a given threshold value are outliers. And the other objects in minimal exceptional sets are not outliers. Definition 10 [Outlier]. Given an arbitrary set X U, and a set R = {r 1,r 2,..., r m } of equivalence relations on U. Let F be the set of all minimal exceptional sets of X with respect to R. For any object o f,if ED Object(o) µ then object o is called an outlier in X with respect to R, where µ is a given threshold value. Definition 11 [Exceptional Degree of Set]. Given an arbitrary set X U, and a set R = {r 1,r 2,..., r m } of equivalence relations on U. For any set Y X, the sum on the exceptional degree of all objects in Y divided by the cardinal number P of Y is called the exceptional degree of set Y, denoted by ED Set(Y )= ED Object(y) y Y Y. Obviously, 0 ED Set(Y ) 1. f F

86 F. Jiang, Y. Sui, and C. Cao Proposition 9. Given an arbitrary set X U, andasetr = {r 1,r 2,..., r m } of equivalence relations on U. LetB = {B 1,B 2,..., B m } be the set of all inner m boundaries of X with respect to each equivalence relations in R. If B i then ED Set( m B i )=1. Proof. Since m B i,thereexistsanx m B i,thatis,foreveryb i B, x B i,wherei =1, 2,..., m. Therefore for any y m B i, Degree B(y) =m P and ED Object(y) =1.LetY = m y Y B i,thened Set(Y )= P 1 y Y Y =1.SoED Set( m B i )=1. 6 Conclusion ED Object(y) Y = Finding outliers is an important task for many KDD applications. In this paper, we present a new method for outlier defining and outlier detection. The method exploits the framework of rough set for detecting outliers. The main idea is that objects in boundary region have more likelihood of being an outlier than objects in lower approximations. There are two directions for ongoing work. The first one is to analyze the complexity of our method. The second one is to make a comparison between our method and other outlier detection methods. References 1. Pawlak, Z.: Rough sets, International Journal of Computer and Information Sciences, 11 (1982) 341 356 2. Pawlak, Z.: Rough sets: Theoretical Aspects of Reasoning about Data, (Kluwer Academic Publishers, Dordrecht,1991) 3. Pawlak, Z., Grzymala-Busse, J.W., Slowinski, R., and Ziarko, W.: Rough sets, Comm. ACM, 38 (1995) 89 95 4. Hawkins, D.: Identifications of Outliers, (Chapman and Hall, London, 1980) 5. Barnett, V., and Lewis, T.: Outliers in Statistical Data, (John Wiley & Sons, 1994) 6. Knorr, E., and Ng, R.: A Unified Notion of Outliers: Properties and Computation, Proc. of the Int. Conf. on Knowledge Discovery and Data Mining, (1997) 219 222 7. Knorr, E., and Ng, R.: Algorithms for Mining Distance-based Outliers in Large Datasets, VLDB Conference Proceedings, (1998) 8. Knorr, E., and Ng, R.: Finding intensional knowledge of distance-based outliers. In Proc. of the 25th VLDB Conf., (1999) 9. Angiulli, F., and Pizzuti, C.: Fast outlier detection in high dimensional spaces, In Proc. of the Sixth European Conf. on the Principles of Data Mining and Knowledge Discovery, (2002) 15 226

Outlier Detection Using Rough Set Theory 87 10. Ramaswamy, S., Rastogi, R., and Shim, K.: Efficient algorithms for mining outliers from large datasets. In Proc. of the ACM SIGMOD Conf., (2000) 11. Knorr, E., Ng, R. and Tucakov, V.: Distance-based outliers: algorithms and applications, VLDB Journal: Very Large Databases, 8(3-4) (2000) 237 253 12. Eskin, E., Arnold, A., Prerau, M., Portnoy, L., and Stolfo, S.: A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data, In Data Mining for Security Applications, (2002) 13. Lane, T., and Brodley, C.E.: Temporal sequence learning and data reduction for anomaly detection, ACM Transactions on Information and System Security, 2(3) (1999) 295-331 14. Breunig, M.M., Kriegel, H.P., Ng, R.T., and Sander, J.: LOF: Identifying densitybased local outliers, In Proc. ACM SIGMOD Conf., (2000) 93 104