REGIONALIZATION AS SPATIAL DATA MINING PROBLEM BASED ON CLUSTERING: REVIEW

Size: px

Start display at page:

Download "REGIONALIZATION AS SPATIAL DATA MINING PROBLEM BASED ON CLUSTERING: REVIEW"

Marlene Lambert
5 years ago
Views:

REGIONALIZATION AS SPATIAL DATA MINING PROBLEM BASED ON CLUSTERING: REVIEW Geetinder Saini 1, Kamaljit Kaur 2 1 Department of Computer Science & Engineering 2 Assistant Professor, Department of

1 REGIONALIZATION AS SPATIAL DATA MINING PROBLEM BASED ON CLUSTERING: REVIEW Geetinder Saini 1, Kamaljit Kaur 2 1 Department of Computer Science & Engineering 2 Assistant Professor, Department of Computer Science & Engineering Sri Guru Granth Sahib World University, Fatehgarh Sahib, Punjab, India, ABSTRACT: Regionalization is one of the biggest problems faced by spatial data mining while representing economic and social geography. This problem could be solved by the spatial clustering algorithm for grouping spatial objects. The main purpose of regionalization is to find compact and dense regions which also represent the homogeneous distribution of non-spatial variables. In this paper various clustering algorithms which are used to solve regionalization issues in spatial data mining are studied and also compare the performance of K-means and Ward s algorithm on cohesion, variance, precision and recall parameter s done. Keywords: Spatial data mining, Regionalization, Data clustering, K-Means, Ward s Method, Single linkage, Double linkage, Average Linkage, Ward s Method, DBSCAN Clustering, Cohesion, Variance, Precision and Recall [1] INTRODUCTION Spatial data mining is the process of discovering interesting and previously unknown, but potentially useful patterns from spatial databases. Extracting interesting and useful patterns from spatial datasets is more difficult than extracting the corresponding patterns from traditional numeric and categorical data due to the complexity of spatial data types, spatial relationship and spatial autocorrelation. Spatial data are the data related to objects that occupy space. A spatial database stores spatial objects represented by spatial data types and spatial relationship among such objects [7][12].There are different types of spatial data mining techniques i.e. Clustering,Outlier Detection, Association and Co-Location, Classification, Trend-Detection groups. Clustering is the most common techniques used in spatial mining. Clustering is the process of partitioning a set of data objects into subsets of data objects into subsets such that the data elements in a cluster are similar to one another and different from the elements of others[1].the set of cluster comes from a cluster analysis can be referred to as a clustering. Clustering is a critical task in data mining in which the data which is similar are putting in one group and dissimilar in other groups. The set of cluster resulting from a cluster analysis can be referred to as a clustering. In this context, different 163

2 clustering methods may generate different clustering s on the same data set. The partitioning is not performed by humans, but by the clustering algorithm. Spatial clustering is an important component of spatial data mining. It aims group similar spatial objects into group or clusters so that objects within a cluster have high similarity in comparison to one another but are dissimilar to objects in other clusters [13].Spatial clustering can be applicable for solving many problems. An important application area for the spatial clustering algorithm is social and economic geography. In the scope a classical methodical problem of social geography, regionalization can be considered [13]. Cluster analysis is widely used for data analysis, which organizes a set of data items into groups or clusters so that items in the same group are similar to each other and different from those in other groups. Cluster analysis has a wide range of application in business intelligence, image pattern recognition, web search, biology, space and security [16]. [2] REGIONALIZATION Regionalization is one of the important tasks in spatial data mining. Regionalization is a process of dividing regions into small areas. Regionalization is the process of delineating a large set of spatial objects into a smaller number of spatially contiguous regions while optimizing the homogeneity measure of the derived regions. Regionalization is a classification procedure applied to spatial objects with an area representation, which group them into a homogenous contiguous region. The intent of regionalization is to find spatially compact and dense regions of arbitrary shape with a homogeneous internal distribution of non-spatial variables [5].It would be helpful for many applications, e.g. for direct mailing, to have specific purpose regions, depending on the kind of homogeneity one is interested in[13]. For doing regionalization different types of techniques are used and clustering is commonly used the technique for regionalization. [3]VARIOUS DATA CLUSTERING TECHNIQUES FOR REGIONALIZATION In spatial data mining, many clustering methods can be developed and classified into different categories. Clustering methods can be broadly classified into two groups: partitioning clustering and hierarchical clustering. Partitioning clustering methods, such as K-means and self-organizing map (SOM), divide a set of data items into a number of non-overlapping clusters. A data item is assigned to the closest cluster based on a proximity or dissimilarity measure. Hierarchical clustering, on the other hand, organizes data items into a hierarchy with a sequence of nested partitions or groupings. Commonly-used hierarchical clustering methods include the Ward s method (Ward, 1963), single-linkage clustering, average-linkage clustering, and complete-linkage clustering. Some common techniques used to solve regionalization issues are:- [3.1] Partitional Clustering 164

Partitional clustering methods determine a partition for dividing a group of points into different clusters, such that the points in a cluster are more similar to each other than to points in

3 Partitional clustering methods determine a partition for dividing a group of points into different clusters, such that the points in a cluster are more similar to each other than to points in different clusters. These methods start with some arbitrary initial clusters and iteratively reallocate points into clusters until a stopping criterion is met. They tend to find clusters with hyperspherical shapes [14]. Different partitional clustering algorithms are: k-means and k- medoids. [3.1.1] K-Means Clustering K-Means are a partition method technique. For solving the clustering problem K-means is one of the simplest unsupervised learning algorithms. The K-means clustering algorithm is a simple method for estimating the mean (vectors) of a set of K-groups. For spatial data mining, k-means represent an attempt to find an optimal number of k locations where the sum of the distances from every to each of the k centers is minimized. The K-means algorithm is 1. Selection of initial k means for k clusters. 2. a) Calculation of dissimilarity between an object and the mean of a cluster. b) Allocation of an object to the cluster whose mean is nearest to the object. c) Relocation of the mean of cluster from the objects allocated to it so that the intra cluster dissimilarity is minimized. 3. Repeat the second step until a complete pass through all the objects results in no object moving from one cluster to another. Now, cluster becomes stable and clustering process is ends [11]. K-Means Algorithm Properties: - There is always at least one object in each cluster. The clusters are non-hierarchical and they do not overlap. Every member of a cluster is closer to its cluster than any other cluster because closeness does not always involve the 'center' of clusters. There are always K clusters. Results depend on initial choice for centers. 165

4 Figure: 1. K-means algorithm process [22]. [3.2] Hierarchical Clustering A hierarchical method, for a given set of data objects creates a hierarchical decomposition which seeks to build a hierarchy of clusters or tree or dendrogram. In hierarchical clustering, we assign each object to a cluster such that K clusters have K objects. Find the clusters which have similar behavior and then merge them into a single cluster. Now, Compute distance between merged cluster and each of old clusters. This procedure is repeated until all objects are clustered into K no. of clusters [6]. There are two approaches to hierarchical clustering: First one is bottom up" i.e. Grouping small clusters into larger ones called agglomerative clustering or second one is top down" i.e. splitting larger clusters into small ones a called divisive clustering s respectively. [3.2.1] Agglomerative (Bottom Up) Agglomerative hierarchical clustering or bottom-up clustering starts with individual data objects and progressively groups these all data objects into big cluster until the root cluster contained all the data objects is formed. This process is done by using a greedy approach which groups that clusters which are most similar to each other at each step based on a user provide cluster dissimilarity function. This is bottom up clustering method where clusters have subclusters, which in turn have sub-clusters, etc. It starts by letting each object from its own cluster and iteratively merges cluster into larger and larger clusters, until all the objects are in a single cluster or certain termination condition is satisfied. The single cluster becomes the hierarchy s root. For the merging step, it finds the two clusters that are closest to each other, and combines the two to form one cluster [1]. [ ] Ward s Method The Ward method is an agglomerative hierarchical clustering Method. Ward s clustering method is implemented by reducing the number of clusters one at a time starting from one cluster per compound and ending which one cluster comprises all the compounds. At each cluster reduction, the method merges the two clusters and this will gives the result of the smallest increase in the total sum of squares of the distances of each point to its cluster centroid. Thus, the Ward s algorithm forms clusters by selecting a cluster that minimizes the within cluster sum of squares or the error sum of the squares (ESS) [3]. ESS k = - where: x ik : the attribute value of the molecule iin cluster k n: size of cluster 166

5 The ESS values will be summed together as in: E = (2) where: K: the number of cluster Algorithm for Ward s clustering [3] START 1- Start with the largest number of, each cluster consisting of exactly one compound. The value for E is Reduce the number of clusters by one by merging those two that minimize the increase of the total error sum of the squares 3- If the compound is found in more than one cluster, go back to step Display the results in the form of an inverted tree showing at each stage which two clusters were merged and its corresponding total error sum of squares (E) or total number of clusters (K). [ ] Single-Linkage Clustering Single linkage also called nearest neighbor or shortest distance is a method of calculating distances between clusters. In single linkage, the distance between the two closest objects in the two clusters is computed. We know that it is a bottom-up strategy: compare each object with each object. Each object is placed in a separate cluster, and at each step we merge the pair of clusters which is closest, until some termination conditions are satisfied. This requires defining a notion of cluster proximity.for the single link, the proximity of two clusters is defined as the minimum of the distance between any two points in the two clusters [11].Chaining phenomenon is the main drawback of this method i.e. clusters may be forced together due to single objects being close to each other, even though many of the objects in each cluster may be very distant to each other. [ ] Complete-Linkage Clustering Complete-linkage clustering is also known as maximum clustering. In complete linkage clustering, the distance between one cluster to another is considered to be equal to the maximum distance of any member of the clusters. According to the complete linkage clustering method, the distance between two clusters is the maximum of the distances between all pairs of variable vectors drawn from the two clusters [6]. [ ] Average-Linkage Clustering In the Average linkage algorithm, the distance between two clusters is defined as average distance between them.average linkage clustering is a method of calculating distance between clusters. In average linkage, the distance between the objects in the first cluster is considered equal to the average distance from the objects in the second cluster [9]. 167

6 [3.3] DBSCAN Clustering DBSCAN (Density Based Spatial Clustering of Application with Noise).It grows clusters according to the density of neighborhood objects. It is based on the concept of density reachibility and density connectability, both of which depends upon input parameter- size of epsilon neighborhood e and minimum terms of local distribution of nearest neighbors. Here parameter controls size of neighborhood and size of clusters. It starts with an arbitrary starting point that has not been visited [6]. The point s e-neighborhoods is retrieved, and if it contains sufficiently many points, a cluster is started. Otherwise the point is labeled as noise. The number of point parameter impacts detection of outliers. DBSCAN targeting low-dimensional spatial data used DENCLUE algorithm [12]. [4] EXISTING METHODS USED FOR SOLVING REGIONALIZATION ISSUES Various clustering methods are used by researchers to solve the regionalization issue. Some of them used existing algorithms, some were improved existing algorithms,some were presented new algorithms by combining two algorithms, and some other compared hybrid clustering algorithms for solving regionalization.in this section, we will review previous studies that presented different clustering methods used to solve regionalization issue in spatial data mining that have appeared in the literature:- Xie et al. s Scheme - [3] proposed Spatial Clustering algorithm for efficient processing of objects with neighborhood relations. Therefore, spatial clustering is determined by its spatial attributes as well as the attributes of objects in its neighborhood. Cluster with shortest distance based geomorphologic discrepancy laws are combined. The drawback of this method is that regional homogeneity is not guaranteed. Sharma et al. s Scheme - [12] proposed efficient clustering technique for regionalization of a spatial database (RCSDB). This algorithm combines the spatial density and a covariance based on method inductively finds spatially dense and non-spatially homogeneous clusters of arbitrary shape. RCSDB takes into account spatial point distributions as well as the distribution of several non-spatial characteristics. RCSDB classify a database of geographical locations into homogeneous, planar and density-connected subsets called regions. It finds internally density connected sets. Srinivas et al. s Scheme - [13] done a Comparative study of the regionalization used in spatial data mining techniques. They divided regionalization techniques into four parts: Conventional clustering method, maximization of regional compactness approach, an explicit spatial contiguity constraint approach, and density based approach. Lokesh Kumar et al. s Scheme - [4] proposed an algorithm to solve Regionalization, a prominent problem from social geography by combining the 'spatial density' clustering approach and a covariance based method to inductively find spatially dense and non- spatially homogeneous clusters of arbitrary shape. Ildiko Pelczer et al. s Scheme - [8] applies cluster analysis to achieve a regionalization of the Sonora River Basin in the Sonora State, Mexico, into homogeneous zones. The identification of 168

7 homogeneous zones is fundamental for the study of the climatic variations throughout the Basin. They do researches in this topic to analyze the frequency of rain and flood events, to analyze other variables, which can be very significant in the definition of similar areas. For this hierarchical and non- hierarchical algorithms were applied to six experiments based on the data sets for precipitation and temperature available from traditional weather stations. In order to validate the results, four indices applicable to both types of algorithms were applied. Experiments showed that better results were achieved when considering several variables than analyzing each parameter alone and also it is observed that working with average values could mask maximum and minimum values that can influence the climatic variability. By comparing results from the cluster analysis with ancillary data, it is concluded that the K-means algorithm was an effective method to achieve climatically homogeneous zones. Sheng-Tun et al. s Scheme - [6] discussed the results of cluster analysis using data generated from discrete wavelet transform and continuous wavelet transform. Data generated from continuous wavelet transform provides detailed time-variation features that can be used to detect the air pollutant spatial variation in a selected time period. Christina et al. s Scheme - [5] doing regionalization by using three agglomerative clustering and develop a system to study quality distribution. Three different hybrid clustering methods are analyzed for grouping sites into non-overlapping, contiguous and homogeneous regions. This paper also validates homogeneity of the regions formed and suggests future lines of research for improving these methods. Results of this paper show Cluster for grouping sites are homogenous and Ward s with k-means are better than other for regionalization. Ramachandra Rao et al. s Scheme - [10] uses three hybrid-clustering algorithms for checking the effectiveness of the hybrid-cluster analysis in regionalization, in which partitional clustering procedure is used to identify groups of similar catchments by refining the clusters derived from agglomerative hierarchical clustering algorithms, are investigated to determine their effectiveness in regionalization. The hierarchical clustering algorithms used are single linkage, complete linkage and Ward s algorithms, while the partitional clustering algorithm used is the K-means algorithm. The regions given by the clustering algorithms are, in general, not statistically homogeneous. The hybrid-cluster analysis is found to be useful in minimizing the effort needed to identify homogeneous regions. The hybrid of Ward s and K-means algorithms is better for regionalization than other ones. The hybrid method provides enough flexibility and it offers prospects for improvement in regionalization studies. [5] EXPERIMENTAL RESULT From so many clustering techniques which are used to solve regionalization issues we are reviewed two techniques from them i.e. K-means and Ward s algorithm. [5.1] Data Used in the study For comparison of k-means and ward s algorithm different spatial datasets are used. The dataset that is used to test the clustering algorithms and compare among them is obtained from the site: ( The experimental environment is implemented in MATLAB program. Three different datasets : 3D Road Network (North Jutland, Denmark) 169

8 Data Set, Gas Sensor Array Drift Dataset at Different Concentrations DataSet and Water Treatment Plant Dataset. [5.2] Evaluation Measure The regions formed from the clustering algorithms are tested under the following measures. (1) Cohesion measures tells that how close objects in the cluster are related to each other (2) Variance: measure how well-separated are the clusters from each other. (3)Precision: is the fraction of retrieved instances that are relevant (4)Recall: is the fraction of relevant instances that are retrieved A region can be regarded as acceptable homogeneous if HM <1, possibly homogeneous if 1 < HM < 2, and definitely heterogeneous if HM > 2, where HM is the heterogeneity measure[5]. K-means and Ward s algorithm are used for spatial analysis of data and the performances of these algorithms are evaluated by comparing their results. It is deduced that ward s algorithm provides good cohesion values than k-means. Since the Ward s algorithm merges the data objects which will result in minimum within cluster variance, it has got a better cohesion value compared to the k-means algorithms. For the algorithm to find homogenous clusters, it is essential for the right selection of the parameters. In the context of regionalization, it is inherent to use clustering algorithms to find arbitrarily shaped clusters. Table I Analysis of the average cohesion, average variance, precision and recall Dataset 3D Road Network (North Jutland, Denmark) Data Set [17] Gas Sensor Array Drift Dataset at Different Concentrations Data Set [17] Water Treatment Plant Dataset [17] Clustering method K-means Algorithm Ward s Algorith m Cohesion Variance Precision Recall Cohesion Variance Precision Recall Cohesion Variance Precision Recall

9 From the graph in the fig.3,4 and 5 it is noticed that as the cluster number increases, the cluster tends to be more homogenous. Figure:3. Homogeneity Measure of K-Mean Vs Ward s algorithm on first datasets. Figure: 4. Homogeneity Measure of K-Mean Vs Ward s algorithm on second datasets 171

10 Figure:5. Homogeneity Measure of K-Mean Vs Ward s algorithm on third datasets.. [6] CONCLUSION In this paper various data clustering techniques for regionalization issue and also various clustering methods used by different researchers are analyzed for grouping sites into contiguous, non-overlapping and homogeneous regions are presented. We compared on the four data sets the performance of the two clustering algorithms: k-mean and Ward s clustering algorithm. The result analysis of K-means and ward s algorithm on different air pollution dataset shows non-overlapping clusters based on features vector. It plays a vital role to select optimum no of clusters to be homogenous. When the no. of clusters is less than five, there remains at least one cluster which is heterogeneous. When the no. of cluster is six, then all the clusters are homogenous. Thus six is the optimum number for which the data set is taken which is found by over analysis. We found that ward s algorithm gives more cohesion and homogeneity with less clustering than k-means for our data sets. In future is work can be extended on other clustering algorithms are related to regionalization.

11 International Journal of Computer Engineering and Applications, Volume VI, Issue II/III, May 14 REFERENCES [1] Jiawei Han, Data mining: concepts and techniques,2006 [2] Margaret H Dunham, Data Mining: introductory and advanced concepts(pearson Education, 2006). [3] Caixiang Xie., Shilin Chen., FengmeiSuo., and Dan yang, Regionalization of Chinese Medicinal Plants Based on Spatial Data Mining, Seventh International Conference on Fuzzy Systems and Knowledge Discovery, pp , 2010 [4] Lokesh Kumar Sharma, Simon Scheider, Willy Kloesgen, Om Prakash Vyas, Efficient clustering technique for regionalisation of a spatial database, Int. J. of Business Intelligence and Data Mining, 2008 Vol.3, No.1, pp [5] J.Christina, Dr.K.Komathy, Analysis of Hard Clustering Algorithms Applicable to Regionalization, Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013) [6] Sheng-Tun Li and Shih-Wei Chou, Jeng-Jong Pan Multi-Resolution Spatio-temporal Data Mining for the Study of Air Pollutant Regionalizationl Proceedings of the 33rd Hawaii International Conference on System Sciences [7] N.Sumathi,R.Geetha, spatial data mining - techniques trends and its applications Journal of Computer Applications, Vol 1, No.4, Oct Dec 2008 [8] Pelczer, Ramos, Domínguez, González, Establishment of regional homogeneous zones in a watershed using clustering algorithms, International Journal of Business Intelligence and Data Mining, Volume 3, Number 1, 25 April 2008, pp (16) [9] Rao, Regionalization of Indiana Watersheds for Flood Flow Predictions Phase I: Studies in Regionalization of Indiana Watersheds, FHWA/IN/JTRP-2002/02, Joint Transportation Research Program, Indiana Department of Transportation and Purdue University, West Lafayette, Indiana, doi: / [10] Rao, Srinivas, Regionalization of watersheds by hybrid-cluster analysis Journal of Hydrology 318 (2006) [11] Ramachandra Rao and V.V. Srinivas (2006) Regionalization of watersheds by fuzzy cluster analysis, Journal of hydrology Science direct, pp [12] L.K. Sharma, S. Scheider, W. Kloesgen and O. P. Vyas, Efficient clustering technique for regionalisation of a spatial database, International Journal Business Intelligence and Data Mining,Vol. 3 No. 1 pp ,2008 [13] PVS Srinivas., Susanta K Satpathy., Lokesh K Sharma., and Ajaya K Akasapu (2011), Regionalisation as Spatial Data Mining Problem: A Comparative Study, Proc. International Journal of Computer Trends and Technology,Vol.18 No.5 pp [14] Xin Wang, Jing Wang, Using Clustering methods in geospatial information systems, GEOMATICA Vol. 64, No. 3, 2010 pp. 347 to 361 [15] Teknomo, Kardi, K-Means Clustering tutorial\kmean\ [16] Assuncao, Neves, Efficient regionalization techniques for socio-economic geographical units using minimum spanning trees International Journal of Geographical Information Science,Vol. 20, No. 7, August 2006, [17] The UCI Machine learning [online].available: 173

COMPARATIVE STUDY OF REGIONALIZATION BASED ON HYBRID K-MEAN AND WARD S CLUSTERING ALGORITHM USING DIFFERENT OPTIMIZATION TECHNIQUES

International Journal of Computer Engineering and Applications, Volume VIII, Issue II, November 14 COMPARATIVE STUDY OF REGIONALIZATION BASED ON HYBRID K-MEAN AND WARD S CLUSTERING ALGORITHM USING DIFFERENT