REGIONALIZATION AS SPATIAL DATA MINING PROBLEM BASED ON CLUSTERING: REVIEW
|
|
- Marlene Lambert
- 5 years ago
- Views:
Transcription
1 REGIONALIZATION AS SPATIAL DATA MINING PROBLEM BASED ON CLUSTERING: REVIEW Geetinder Saini 1, Kamaljit Kaur 2 1 Department of Computer Science & Engineering 2 Assistant Professor, Department of Computer Science & Engineering Sri Guru Granth Sahib World University, Fatehgarh Sahib, Punjab, India, ABSTRACT: Regionalization is one of the biggest problems faced by spatial data mining while representing economic and social geography. This problem could be solved by the spatial clustering algorithm for grouping spatial objects. The main purpose of regionalization is to find compact and dense regions which also represent the homogeneous distribution of non-spatial variables. In this paper various clustering algorithms which are used to solve regionalization issues in spatial data mining are studied and also compare the performance of K-means and Ward s algorithm on cohesion, variance, precision and recall parameter s done. Keywords: Spatial data mining, Regionalization, Data clustering, K-Means, Ward s Method, Single linkage, Double linkage, Average Linkage, Ward s Method, DBSCAN Clustering, Cohesion, Variance, Precision and Recall [1] INTRODUCTION Spatial data mining is the process of discovering interesting and previously unknown, but potentially useful patterns from spatial databases. Extracting interesting and useful patterns from spatial datasets is more difficult than extracting the corresponding patterns from traditional numeric and categorical data due to the complexity of spatial data types, spatial relationship and spatial autocorrelation. Spatial data are the data related to objects that occupy space. A spatial database stores spatial objects represented by spatial data types and spatial relationship among such objects [7][12].There are different types of spatial data mining techniques i.e. Clustering,Outlier Detection, Association and Co-Location, Classification, Trend-Detection groups. Clustering is the most common techniques used in spatial mining. Clustering is the process of partitioning a set of data objects into subsets of data objects into subsets such that the data elements in a cluster are similar to one another and different from the elements of others[1].the set of cluster comes from a cluster analysis can be referred to as a clustering. Clustering is a critical task in data mining in which the data which is similar are putting in one group and dissimilar in other groups. The set of cluster resulting from a cluster analysis can be referred to as a clustering. In this context, different 163
2 clustering methods may generate different clustering s on the same data set. The partitioning is not performed by humans, but by the clustering algorithm. Spatial clustering is an important component of spatial data mining. It aims group similar spatial objects into group or clusters so that objects within a cluster have high similarity in comparison to one another but are dissimilar to objects in other clusters [13].Spatial clustering can be applicable for solving many problems. An important application area for the spatial clustering algorithm is social and economic geography. In the scope a classical methodical problem of social geography, regionalization can be considered [13]. Cluster analysis is widely used for data analysis, which organizes a set of data items into groups or clusters so that items in the same group are similar to each other and different from those in other groups. Cluster analysis has a wide range of application in business intelligence, image pattern recognition, web search, biology, space and security [16]. [2] REGIONALIZATION Regionalization is one of the important tasks in spatial data mining. Regionalization is a process of dividing regions into small areas. Regionalization is the process of delineating a large set of spatial objects into a smaller number of spatially contiguous regions while optimizing the homogeneity measure of the derived regions. Regionalization is a classification procedure applied to spatial objects with an area representation, which group them into a homogenous contiguous region. The intent of regionalization is to find spatially compact and dense regions of arbitrary shape with a homogeneous internal distribution of non-spatial variables [5].It would be helpful for many applications, e.g. for direct mailing, to have specific purpose regions, depending on the kind of homogeneity one is interested in[13]. For doing regionalization different types of techniques are used and clustering is commonly used the technique for regionalization. [3]VARIOUS DATA CLUSTERING TECHNIQUES FOR REGIONALIZATION In spatial data mining, many clustering methods can be developed and classified into different categories. Clustering methods can be broadly classified into two groups: partitioning clustering and hierarchical clustering. Partitioning clustering methods, such as K-means and self-organizing map (SOM), divide a set of data items into a number of non-overlapping clusters. A data item is assigned to the closest cluster based on a proximity or dissimilarity measure. Hierarchical clustering, on the other hand, organizes data items into a hierarchy with a sequence of nested partitions or groupings. Commonly-used hierarchical clustering methods include the Ward s method (Ward, 1963), single-linkage clustering, average-linkage clustering, and complete-linkage clustering. Some common techniques used to solve regionalization issues are:- [3.1] Partitional Clustering 164
3 Partitional clustering methods determine a partition for dividing a group of points into different clusters, such that the points in a cluster are more similar to each other than to points in different clusters. These methods start with some arbitrary initial clusters and iteratively reallocate points into clusters until a stopping criterion is met. They tend to find clusters with hyperspherical shapes [14]. Different partitional clustering algorithms are: k-means and k- medoids. [3.1.1] K-Means Clustering K-Means are a partition method technique. For solving the clustering problem K-means is one of the simplest unsupervised learning algorithms. The K-means clustering algorithm is a simple method for estimating the mean (vectors) of a set of K-groups. For spatial data mining, k-means represent an attempt to find an optimal number of k locations where the sum of the distances from every to each of the k centers is minimized. The K-means algorithm is 1. Selection of initial k means for k clusters. 2. a) Calculation of dissimilarity between an object and the mean of a cluster. b) Allocation of an object to the cluster whose mean is nearest to the object. c) Relocation of the mean of cluster from the objects allocated to it so that the intra cluster dissimilarity is minimized. 3. Repeat the second step until a complete pass through all the objects results in no object moving from one cluster to another. Now, cluster becomes stable and clustering process is ends [11]. K-Means Algorithm Properties: - There is always at least one object in each cluster. The clusters are non-hierarchical and they do not overlap. Every member of a cluster is closer to its cluster than any other cluster because closeness does not always involve the 'center' of clusters. There are always K clusters. Results depend on initial choice for centers. 165
4 Figure: 1. K-means algorithm process [22]. [3.2] Hierarchical Clustering A hierarchical method, for a given set of data objects creates a hierarchical decomposition which seeks to build a hierarchy of clusters or tree or dendrogram. In hierarchical clustering, we assign each object to a cluster such that K clusters have K objects. Find the clusters which have similar behavior and then merge them into a single cluster. Now, Compute distance between merged cluster and each of old clusters. This procedure is repeated until all objects are clustered into K no. of clusters [6]. There are two approaches to hierarchical clustering: First one is bottom up" i.e. Grouping small clusters into larger ones called agglomerative clustering or second one is top down" i.e. splitting larger clusters into small ones a called divisive clustering s respectively. [3.2.1] Agglomerative (Bottom Up) Agglomerative hierarchical clustering or bottom-up clustering starts with individual data objects and progressively groups these all data objects into big cluster until the root cluster contained all the data objects is formed. This process is done by using a greedy approach which groups that clusters which are most similar to each other at each step based on a user provide cluster dissimilarity function. This is bottom up clustering method where clusters have subclusters, which in turn have sub-clusters, etc. It starts by letting each object from its own cluster and iteratively merges cluster into larger and larger clusters, until all the objects are in a single cluster or certain termination condition is satisfied. The single cluster becomes the hierarchy s root. For the merging step, it finds the two clusters that are closest to each other, and combines the two to form one cluster [1]. [ ] Ward s Method The Ward method is an agglomerative hierarchical clustering Method. Ward s clustering method is implemented by reducing the number of clusters one at a time starting from one cluster per compound and ending which one cluster comprises all the compounds. At each cluster reduction, the method merges the two clusters and this will gives the result of the smallest increase in the total sum of squares of the distances of each point to its cluster centroid. Thus, the Ward s algorithm forms clusters by selecting a cluster that minimizes the within cluster sum of squares or the error sum of the squares (ESS) [3]. ESS k = - where: x ik : the attribute value of the molecule iin cluster k n: size of cluster 166
5 The ESS values will be summed together as in: E = (2) where: K: the number of cluster Algorithm for Ward s clustering [3] START 1- Start with the largest number of, each cluster consisting of exactly one compound. The value for E is Reduce the number of clusters by one by merging those two that minimize the increase of the total error sum of the squares 3- If the compound is found in more than one cluster, go back to step Display the results in the form of an inverted tree showing at each stage which two clusters were merged and its corresponding total error sum of squares (E) or total number of clusters (K). [ ] Single-Linkage Clustering Single linkage also called nearest neighbor or shortest distance is a method of calculating distances between clusters. In single linkage, the distance between the two closest objects in the two clusters is computed. We know that it is a bottom-up strategy: compare each object with each object. Each object is placed in a separate cluster, and at each step we merge the pair of clusters which is closest, until some termination conditions are satisfied. This requires defining a notion of cluster proximity.for the single link, the proximity of two clusters is defined as the minimum of the distance between any two points in the two clusters [11].Chaining phenomenon is the main drawback of this method i.e. clusters may be forced together due to single objects being close to each other, even though many of the objects in each cluster may be very distant to each other. [ ] Complete-Linkage Clustering Complete-linkage clustering is also known as maximum clustering. In complete linkage clustering, the distance between one cluster to another is considered to be equal to the maximum distance of any member of the clusters. According to the complete linkage clustering method, the distance between two clusters is the maximum of the distances between all pairs of variable vectors drawn from the two clusters [6]. [ ] Average-Linkage Clustering In the Average linkage algorithm, the distance between two clusters is defined as average distance between them.average linkage clustering is a method of calculating distance between clusters. In average linkage, the distance between the objects in the first cluster is considered equal to the average distance from the objects in the second cluster [9]. 167
6 [3.3] DBSCAN Clustering DBSCAN (Density Based Spatial Clustering of Application with Noise).It grows clusters according to the density of neighborhood objects. It is based on the concept of density reachibility and density connectability, both of which depends upon input parameter- size of epsilon neighborhood e and minimum terms of local distribution of nearest neighbors. Here parameter controls size of neighborhood and size of clusters. It starts with an arbitrary starting point that has not been visited [6]. The point s e-neighborhoods is retrieved, and if it contains sufficiently many points, a cluster is started. Otherwise the point is labeled as noise. The number of point parameter impacts detection of outliers. DBSCAN targeting low-dimensional spatial data used DENCLUE algorithm [12]. [4] EXISTING METHODS USED FOR SOLVING REGIONALIZATION ISSUES Various clustering methods are used by researchers to solve the regionalization issue. Some of them used existing algorithms, some were improved existing algorithms,some were presented new algorithms by combining two algorithms, and some other compared hybrid clustering algorithms for solving regionalization.in this section, we will review previous studies that presented different clustering methods used to solve regionalization issue in spatial data mining that have appeared in the literature:- Xie et al. s Scheme - [3] proposed Spatial Clustering algorithm for efficient processing of objects with neighborhood relations. Therefore, spatial clustering is determined by its spatial attributes as well as the attributes of objects in its neighborhood. Cluster with shortest distance based geomorphologic discrepancy laws are combined. The drawback of this method is that regional homogeneity is not guaranteed. Sharma et al. s Scheme - [12] proposed efficient clustering technique for regionalization of a spatial database (RCSDB). This algorithm combines the spatial density and a covariance based on method inductively finds spatially dense and non-spatially homogeneous clusters of arbitrary shape. RCSDB takes into account spatial point distributions as well as the distribution of several non-spatial characteristics. RCSDB classify a database of geographical locations into homogeneous, planar and density-connected subsets called regions. It finds internally density connected sets. Srinivas et al. s Scheme - [13] done a Comparative study of the regionalization used in spatial data mining techniques. They divided regionalization techniques into four parts: Conventional clustering method, maximization of regional compactness approach, an explicit spatial contiguity constraint approach, and density based approach. Lokesh Kumar et al. s Scheme - [4] proposed an algorithm to solve Regionalization, a prominent problem from social geography by combining the 'spatial density' clustering approach and a covariance based method to inductively find spatially dense and non- spatially homogeneous clusters of arbitrary shape. Ildiko Pelczer et al. s Scheme - [8] applies cluster analysis to achieve a regionalization of the Sonora River Basin in the Sonora State, Mexico, into homogeneous zones. The identification of 168
7 homogeneous zones is fundamental for the study of the climatic variations throughout the Basin. They do researches in this topic to analyze the frequency of rain and flood events, to analyze other variables, which can be very significant in the definition of similar areas. For this hierarchical and non- hierarchical algorithms were applied to six experiments based on the data sets for precipitation and temperature available from traditional weather stations. In order to validate the results, four indices applicable to both types of algorithms were applied. Experiments showed that better results were achieved when considering several variables than analyzing each parameter alone and also it is observed that working with average values could mask maximum and minimum values that can influence the climatic variability. By comparing results from the cluster analysis with ancillary data, it is concluded that the K-means algorithm was an effective method to achieve climatically homogeneous zones. Sheng-Tun et al. s Scheme - [6] discussed the results of cluster analysis using data generated from discrete wavelet transform and continuous wavelet transform. Data generated from continuous wavelet transform provides detailed time-variation features that can be used to detect the air pollutant spatial variation in a selected time period. Christina et al. s Scheme - [5] doing regionalization by using three agglomerative clustering and develop a system to study quality distribution. Three different hybrid clustering methods are analyzed for grouping sites into non-overlapping, contiguous and homogeneous regions. This paper also validates homogeneity of the regions formed and suggests future lines of research for improving these methods. Results of this paper show Cluster for grouping sites are homogenous and Ward s with k-means are better than other for regionalization. Ramachandra Rao et al. s Scheme - [10] uses three hybrid-clustering algorithms for checking the effectiveness of the hybrid-cluster analysis in regionalization, in which partitional clustering procedure is used to identify groups of similar catchments by refining the clusters derived from agglomerative hierarchical clustering algorithms, are investigated to determine their effectiveness in regionalization. The hierarchical clustering algorithms used are single linkage, complete linkage and Ward s algorithms, while the partitional clustering algorithm used is the K-means algorithm. The regions given by the clustering algorithms are, in general, not statistically homogeneous. The hybrid-cluster analysis is found to be useful in minimizing the effort needed to identify homogeneous regions. The hybrid of Ward s and K-means algorithms is better for regionalization than other ones. The hybrid method provides enough flexibility and it offers prospects for improvement in regionalization studies. [5] EXPERIMENTAL RESULT From so many clustering techniques which are used to solve regionalization issues we are reviewed two techniques from them i.e. K-means and Ward s algorithm. [5.1] Data Used in the study For comparison of k-means and ward s algorithm different spatial datasets are used. The dataset that is used to test the clustering algorithms and compare among them is obtained from the site: ( The experimental environment is implemented in MATLAB program. Three different datasets : 3D Road Network (North Jutland, Denmark) 169
8 Data Set, Gas Sensor Array Drift Dataset at Different Concentrations DataSet and Water Treatment Plant Dataset. [5.2] Evaluation Measure The regions formed from the clustering algorithms are tested under the following measures. (1) Cohesion measures tells that how close objects in the cluster are related to each other (2) Variance: measure how well-separated are the clusters from each other. (3)Precision: is the fraction of retrieved instances that are relevant (4)Recall: is the fraction of relevant instances that are retrieved A region can be regarded as acceptable homogeneous if HM <1, possibly homogeneous if 1 < HM < 2, and definitely heterogeneous if HM > 2, where HM is the heterogeneity measure[5]. K-means and Ward s algorithm are used for spatial analysis of data and the performances of these algorithms are evaluated by comparing their results. It is deduced that ward s algorithm provides good cohesion values than k-means. Since the Ward s algorithm merges the data objects which will result in minimum within cluster variance, it has got a better cohesion value compared to the k-means algorithms. For the algorithm to find homogenous clusters, it is essential for the right selection of the parameters. In the context of regionalization, it is inherent to use clustering algorithms to find arbitrarily shaped clusters. Table I Analysis of the average cohesion, average variance, precision and recall Dataset 3D Road Network (North Jutland, Denmark) Data Set [17] Gas Sensor Array Drift Dataset at Different Concentrations Data Set [17] Water Treatment Plant Dataset [17] Clustering method K-means Algorithm Ward s Algorith m Cohesion Variance Precision Recall Cohesion Variance Precision Recall Cohesion Variance Precision Recall
9 From the graph in the fig.3,4 and 5 it is noticed that as the cluster number increases, the cluster tends to be more homogenous. Figure:3. Homogeneity Measure of K-Mean Vs Ward s algorithm on first datasets. Figure: 4. Homogeneity Measure of K-Mean Vs Ward s algorithm on second datasets 171
10 Figure:5. Homogeneity Measure of K-Mean Vs Ward s algorithm on third datasets.. [6] CONCLUSION In this paper various data clustering techniques for regionalization issue and also various clustering methods used by different researchers are analyzed for grouping sites into contiguous, non-overlapping and homogeneous regions are presented. We compared on the four data sets the performance of the two clustering algorithms: k-mean and Ward s clustering algorithm. The result analysis of K-means and ward s algorithm on different air pollution dataset shows non-overlapping clusters based on features vector. It plays a vital role to select optimum no of clusters to be homogenous. When the no. of clusters is less than five, there remains at least one cluster which is heterogeneous. When the no. of cluster is six, then all the clusters are homogenous. Thus six is the optimum number for which the data set is taken which is found by over analysis. We found that ward s algorithm gives more cohesion and homogeneity with less clustering than k-means for our data sets. In future is work can be extended on other clustering algorithms are related to regionalization.
11 International Journal of Computer Engineering and Applications, Volume VI, Issue II/III, May 14 REFERENCES [1] Jiawei Han, Data mining: concepts and techniques,2006 [2] Margaret H Dunham, Data Mining: introductory and advanced concepts(pearson Education, 2006). [3] Caixiang Xie., Shilin Chen., FengmeiSuo., and Dan yang, Regionalization of Chinese Medicinal Plants Based on Spatial Data Mining, Seventh International Conference on Fuzzy Systems and Knowledge Discovery, pp , 2010 [4] Lokesh Kumar Sharma, Simon Scheider, Willy Kloesgen, Om Prakash Vyas, Efficient clustering technique for regionalisation of a spatial database, Int. J. of Business Intelligence and Data Mining, 2008 Vol.3, No.1, pp [5] J.Christina, Dr.K.Komathy, Analysis of Hard Clustering Algorithms Applicable to Regionalization, Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013) [6] Sheng-Tun Li and Shih-Wei Chou, Jeng-Jong Pan Multi-Resolution Spatio-temporal Data Mining for the Study of Air Pollutant Regionalizationl Proceedings of the 33rd Hawaii International Conference on System Sciences [7] N.Sumathi,R.Geetha, spatial data mining - techniques trends and its applications Journal of Computer Applications, Vol 1, No.4, Oct Dec 2008 [8] Pelczer, Ramos, Domínguez, González, Establishment of regional homogeneous zones in a watershed using clustering algorithms, International Journal of Business Intelligence and Data Mining, Volume 3, Number 1, 25 April 2008, pp (16) [9] Rao, Regionalization of Indiana Watersheds for Flood Flow Predictions Phase I: Studies in Regionalization of Indiana Watersheds, FHWA/IN/JTRP-2002/02, Joint Transportation Research Program, Indiana Department of Transportation and Purdue University, West Lafayette, Indiana, doi: / [10] Rao, Srinivas, Regionalization of watersheds by hybrid-cluster analysis Journal of Hydrology 318 (2006) [11] Ramachandra Rao and V.V. Srinivas (2006) Regionalization of watersheds by fuzzy cluster analysis, Journal of hydrology Science direct, pp [12] L.K. Sharma, S. Scheider, W. Kloesgen and O. P. Vyas, Efficient clustering technique for regionalisation of a spatial database, International Journal Business Intelligence and Data Mining,Vol. 3 No. 1 pp ,2008 [13] PVS Srinivas., Susanta K Satpathy., Lokesh K Sharma., and Ajaya K Akasapu (2011), Regionalisation as Spatial Data Mining Problem: A Comparative Study, Proc. International Journal of Computer Trends and Technology,Vol.18 No.5 pp [14] Xin Wang, Jing Wang, Using Clustering methods in geospatial information systems, GEOMATICA Vol. 64, No. 3, 2010 pp. 347 to 361 [15] Teknomo, Kardi, K-Means Clustering tutorial\kmean\ [16] Assuncao, Neves, Efficient regionalization techniques for socio-economic geographical units using minimum spanning trees International Journal of Geographical Information Science,Vol. 20, No. 7, August 2006, [17] The UCI Machine learning [online].available: 173
COMPARATIVE STUDY OF REGIONALIZATION BASED ON HYBRID K-MEAN AND WARD S CLUSTERING ALGORITHM USING DIFFERENT OPTIMIZATION TECHNIQUES
International Journal of Computer Engineering and Applications, Volume VIII, Issue II, November 14 COMPARATIVE STUDY OF REGIONALIZATION BASED ON HYBRID K-MEAN AND WARD S CLUSTERING ALGORITHM USING DIFFERENT
More informationClustering. CSL465/603 - Fall 2016 Narayanan C Krishnan
Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification
More informationSPATIAL DATA MINING. Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM
SPATIAL DATA MINING Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM INTRODUCTION The main difference between data mining in relational DBS and in spatial DBS is that attributes of the neighbors
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 1
Clustering Part 1 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville What is Cluster Analysis? Finding groups of objects such that the objects
More informationData Exploration and Unsupervised Learning with Clustering
Data Exploration and Unsupervised Learning with Clustering Paul F Rodriguez,PhD San Diego Supercomputer Center Predictive Analytic Center of Excellence Clustering Idea Given a set of data can we find a
More information6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More informationAn introduction to clustering techniques
- ABSTRACT Cluster analysis has been used in a wide variety of fields, such as marketing, social science, biology, pattern recognition etc. It is used to identify homogenous groups of cases to better understand
More informationData Preprocessing. Cluster Similarity
1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M
More informationPrinciples of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata
Principles of Pattern Recognition C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata e-mail: murthy@isical.ac.in Pattern Recognition Measurement Space > Feature Space >Decision
More informationApplying cluster analysis to 2011 Census local authority data
Applying cluster analysis to 2011 Census local authority data Kitty.Lymperopoulou@manchester.ac.uk SPSS User Group Conference November, 10 2017 Outline Basic ideas of cluster analysis How to choose variables
More informationStatistical Machine Learning
Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x
More informationClustering Lecture 1: Basics. Jing Gao SUNY Buffalo
Clustering Lecture 1: Basics Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics Clustering
More informationMultivariate Statistics
Multivariate Statistics Chapter 6: Cluster Analysis Pedro Galeano Departamento de Estadística Universidad Carlos III de Madrid pedro.galeano@uc3m.es Course 2017/2018 Master in Mathematical Engineering
More informationChapter 5-2: Clustering
Chapter 5-2: Clustering Jilles Vreeken Revision 1, November 20 th typo s fixed: dendrogram Revision 2, December 10 th clarified: we do consider a point x as a member of its own ε-neighborhood 12 Nov 2015
More informationClustering. Stephen Scott. CSCE 478/878 Lecture 8: Clustering. Stephen Scott. Introduction. Outline. Clustering.
1 / 19 sscott@cse.unl.edu x1 If no label information is available, can still perform unsupervised learning Looking for structural information about instance space instead of label prediction function Approaches:
More informationText Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University
Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data
More informationPart I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes
Week 10 Based in part on slides from textbook, slides of Susan Holmes Part I Linear regression & December 5, 2012 1 / 1 2 / 1 We ve talked mostly about classification, where the outcome categorical. If
More informationST-DBSCAN: An Algorithm for Clustering Spatial-Temporal Data
ST-DBSCAN: An Algorithm for Clustering Spatial-Temporal Data Title Di Qin Carolina Department First Steering of Statistics Committee and Operations Research October 9, 2010 Introduction Clustering: the
More informationMultivariate Statistics: Hierarchical and k-means cluster analysis
Multivariate Statistics: Hierarchical and k-means cluster analysis Steffen Unkel Department of Medical Statistics University Medical Center Goettingen, Germany Summer term 217 1/43 What is a cluster? Proximity
More informationClustering analysis of vegetation data
Clustering analysis of vegetation data Valentin Gjorgjioski 1, Sašo Dzeroski 1 and Matt White 2 1 Jožef Stefan Institute Jamova cesta 39, SI-1000 Ljubljana Slovenia 2 Arthur Rylah Institute for Environmental
More informationAn Entropy-based Method for Assessing the Number of Spatial Outliers
An Entropy-based Method for Assessing the Number of Spatial Outliers Xutong Liu, Chang-Tien Lu, Feng Chen Department of Computer Science Virginia Polytechnic Institute and State University {xutongl, ctlu,
More informationApplication of Clustering to Earth Science Data: Progress and Challenges
Application of Clustering to Earth Science Data: Progress and Challenges Michael Steinbach Shyam Boriah Vipin Kumar University of Minnesota Pang-Ning Tan Michigan State University Christopher Potter NASA
More informationApplication of a GA/Bayesian Filter-Wrapper Feature Selection Method to Classification of Clinical Depression from Speech Data
Application of a GA/Bayesian Filter-Wrapper Feature Selection Method to Classification of Clinical Depression from Speech Data Juan Torres 1, Ashraf Saad 2, Elliot Moore 1 1 School of Electrical and Computer
More informationCHAPTER-17. Decision Tree Induction
CHAPTER-17 Decision Tree Induction 17.1 Introduction 17.2 Attribute selection measure 17.3 Tree Pruning 17.4 Extracting Classification Rules from Decision Trees 17.5 Bayesian Classification 17.6 Bayes
More informationMore on Unsupervised Learning
More on Unsupervised Learning Two types of problems are to find association rules for occurrences in common in observations (market basket analysis), and finding the groups of values of observational data
More informationInternational Journal of Research in Computer and Communication Technology, Vol 3, Issue 7, July
Hybrid SVM Data mining Techniques for Weather Data Analysis of Krishna District of Andhra Region N.Rajasekhar 1, Dr. T. V. Rajini Kanth 2 1 (Assistant Professor, Department of Computer Science & Engineering,
More informationClustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation
Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based clustering Model estimation 1 Clustering A basic tool in data mining/pattern recognition: Divide
More informationClassification of High Spatial Resolution Remote Sensing Images Based on Decision Fusion
Journal of Advances in Information Technology Vol. 8, No. 1, February 2017 Classification of High Spatial Resolution Remote Sensing Images Based on Decision Fusion Guizhou Wang Institute of Remote Sensing
More informationTo Predict Rain Fall in Desert Area of Rajasthan Using Data Mining Techniques
To Predict Rain Fall in Desert Area of Rajasthan Using Data Mining Techniques Peeyush Vyas Asst. Professor, CE/IT Department of Vadodara Institute of Engineering, Vadodara Abstract: Weather forecasting
More informationOverview of clustering analysis. Yuehua Cui
Overview of clustering analysis Yuehua Cui Email: cuiy@msu.edu http://www.stt.msu.edu/~cui A data set with clear cluster structure How would you design an algorithm for finding the three clusters in this
More informationInternational Journal of Remote Sensing, in press, 2006.
International Journal of Remote Sensing, in press, 2006. Parameter Selection for Region-Growing Image Segmentation Algorithms using Spatial Autocorrelation G. M. ESPINDOLA, G. CAMARA*, I. A. REIS, L. S.
More informationA Modified DBSCAN Clustering Method to Estimate Retail Centre Extent
A Modified DBSCAN Clustering Method to Estimate Retail Centre Extent Michalis Pavlis 1, Les Dolega 1, Alex Singleton 1 1 University of Liverpool, Department of Geography and Planning, Roxby Building, Liverpool
More informationClassification Based on Logical Concept Analysis
Classification Based on Logical Concept Analysis Yan Zhao and Yiyu Yao Department of Computer Science, University of Regina, Regina, Saskatchewan, Canada S4S 0A2 E-mail: {yanzhao, yyao}@cs.uregina.ca Abstract.
More informationIV Course Spring 14. Graduate Course. May 4th, Big Spatiotemporal Data Analytics & Visualization
Spatiotemporal Data Visualization IV Course Spring 14 Graduate Course of UCAS May 4th, 2014 Outline What is spatiotemporal data? How to analyze spatiotemporal data? How to visualize spatiotemporal data?
More informationMachine Learning on temporal data
Machine Learning on temporal data Classification rees for ime Series Ahlame Douzal (Ahlame.Douzal@imag.fr) AMA, LIG, Université Joseph Fourier Master 2R - MOSIG (2011) Plan ime Series classification approaches
More informationClassification Using Decision Trees
Classification Using Decision Trees 1. Introduction Data mining term is mainly used for the specific set of six activities namely Classification, Estimation, Prediction, Affinity grouping or Association
More informationComputer Vision Group Prof. Daniel Cremers. 14. Clustering
Group Prof. Daniel Cremers 14. Clustering Motivation Supervised learning is good for interaction with humans, but labels from a supervisor are hard to obtain Clustering is unsupervised learning, i.e. it
More informationMULTIVARIATE ANALYSIS OF BORE HOLE DISCONTINUITY DATA
Maerz,. H., and Zhou, W., 999. Multivariate analysis of bore hole discontinuity data. Rock Mechanics for Industry, Proceedings of the 37th US Rock Mechanics Symposium, Vail Colorado, June 6-9, 999, v.,
More informationBrief Introduction of Machine Learning Techniques for Content Analysis
1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview
More informationNearest Neighbor Search with Keywords in Spatial Databases
776 Nearest Neighbor Search with Keywords in Spatial Databases 1 Sphurti S. Sao, 2 Dr. Rahila Sheikh 1 M. Tech Student IV Sem, Dept of CSE, RCERT Chandrapur, MH, India 2 Head of Department, Dept of CSE,
More informationUnsupervised machine learning
Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels
More informationMachine Learning for Data Science (CS4786) Lecture 8
Machine Learning for Data Science (CS4786) Lecture 8 Clustering Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016fa/ Announcement Those of you who submitted HW1 and are still on waitlist email
More informationStochastic Hydrology. a) Data Mining for Evolution of Association Rules for Droughts and Floods in India using Climate Inputs
Stochastic Hydrology a) Data Mining for Evolution of Association Rules for Droughts and Floods in India using Climate Inputs An accurate prediction of extreme rainfall events can significantly aid in policy
More informationHigh resolution wetland mapping I.
High resolution wetland mapping I. Based on the teaching material developed by Steve Kas, GeoVille for WOIS Product Group #5 Dr. Zoltán Vekerdy and János Grósz z.vekerdy@utwente.nl vekerdy.zoltan@mkk.szie.hu
More informationLink Prediction. Eman Badr Mohammed Saquib Akmal Khan
Link Prediction Eman Badr Mohammed Saquib Akmal Khan 11-06-2013 Link Prediction Which pair of nodes should be connected? Applications Facebook friend suggestion Recommendation systems Monitoring and controlling
More informationTEMPERATUTE PREDICTION USING HEURISTIC DATA MINING ON TWO-FACTOR FUZZY TIME-SERIES
TEMPERATUTE PREDICTION USING HEURISTIC DATA MINING ON TWO-FACTOR FUZZY TIME-SERIES Adesh Kumar Pandey 1, Dr. V. K Srivastava 2, A.K Sinha 3 1,2,3 Krishna Institute of Engineering & Technology, Ghaziabad,
More informationSTATISTICA MULTIVARIATA 2
1 / 73 STATISTICA MULTIVARIATA 2 Fabio Rapallo Dipartimento di Scienze e Innovazione Tecnologica Università del Piemonte Orientale, Alessandria (Italy) fabio.rapallo@uniupo.it Alessandria, May 2016 2 /
More informationClustering using Mixture Models
Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior
More informationModern Information Retrieval
Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction
More informationLectures in AstroStatistics: Topics in Machine Learning for Astronomers
Lectures in AstroStatistics: Topics in Machine Learning for Astronomers Jessi Cisewski Yale University American Astronomical Society Meeting Wednesday, January 6, 2016 1 Statistical Learning - learning
More informationCluster Analysis (Sect. 9.6/Chap. 14 of Wilks) Notes by Hong Li
77 Cluster Analysis (Sect. 9.6/Chap. 14 of Wilks) Notes by Hong Li 1) Introduction Cluster analysis deals with separating data into groups whose identities are not known in advance. In general, even the
More informationFrom Research Objects to Research Networks: Combining Spatial and Semantic Search
From Research Objects to Research Networks: Combining Spatial and Semantic Search Sara Lafia 1 and Lisa Staehli 2 1 Department of Geography, UCSB, Santa Barbara, CA, USA 2 Institute of Cartography and
More informationLearning Theory Continued
Learning Theory Continued Machine Learning CSE446 Carlos Guestrin University of Washington May 13, 2013 1 A simple setting n Classification N data points Finite number of possible hypothesis (e.g., dec.
More informationProjective Clustering by Histograms
Projective Clustering by Histograms Eric Ka Ka Ng, Ada Wai-chee Fu and Raymond Chi-Wing Wong, Member, IEEE Abstract Recent research suggests that clustering for high dimensional data should involve searching
More informationIterative Laplacian Score for Feature Selection
Iterative Laplacian Score for Feature Selection Linling Zhu, Linsong Miao, and Daoqiang Zhang College of Computer Science and echnology, Nanjing University of Aeronautics and Astronautics, Nanjing 2006,
More informationCluster Analysis CHAPTER PREVIEW KEY TERMS
LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: Define cluster analysis, its roles, and its limitations. Identify the types of research questions addressed by
More informationClusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved
Clusters Unsupervised Learning Luc Anselin http://spatial.uchicago.edu 1 curse of dimensionality principal components multidimensional scaling classical clustering methods 2 Curse of Dimensionality 3 Curse
More informationIntroduction to clustering methods for gene expression data analysis
Introduction to clustering methods for gene expression data analysis Giorgio Valentini e-mail: valentini@dsi.unimi.it Outline Levels of analysis of DNA microarray data Clustering methods for functional
More informationCorrelation Preserving Unsupervised Discretization. Outline
Correlation Preserving Unsupervised Discretization Jee Vang Outline Paper References What is discretization? Motivation Principal Component Analysis (PCA) Association Mining Correlation Preserving Discretization
More informationDistributed Clustering and Local Regression for Knowledge Discovery in Multiple Spatial Databases
Distributed Clustering and Local Regression for Knowledge Discovery in Multiple Spatial Databases Aleksandar Lazarevic, Dragoljub Pokrajac, Zoran Obradovic School of Electrical Engineering and Computer
More informationMultivariate Analysis
Multivariate Analysis Chapter 5: Cluster analysis Pedro Galeano Departamento de Estadística Universidad Carlos III de Madrid pedro.galeano@uc3m.es Course 2015/2016 Master in Business Administration and
More informationComputation of Autocorrelation Function using Data Set
Computation of Autocorrelation Function using Data Set Manjot Kaur Department of Computer Engineering &Technology Guru Nanak Dev University Amritsar, Punjab, India manjotman123@gmail.com Jaspreet Singh
More informationWhen Dictionary Learning Meets Classification
When Dictionary Learning Meets Classification Bufford, Teresa 1 Chen, Yuxin 2 Horning, Mitchell 3 Shee, Liberty 1 Mentor: Professor Yohann Tendero 1 UCLA 2 Dalhousie University 3 Harvey Mudd College August
More informationApplied Hierarchical Cluster Analysis with Average Linkage Algoritm
CAUCHY Jurnal Matematika Murni dan Aplikasi Volume 5(1)(2017), Pages 1-7 p-issn: 2086-0382; e-issn: 2477-3344 Applied Hierarchical Cluster Analysis with Average Linkage Algoritm Cindy Cahyaning Astuti
More informationCS570 Data Mining. Anomaly Detection. Li Xiong. Slide credits: Tan, Steinbach, Kumar Jiawei Han and Micheline Kamber.
CS570 Data Mining Anomaly Detection Li Xiong Slide credits: Tan, Steinbach, Kumar Jiawei Han and Micheline Kamber April 3, 2011 1 Anomaly Detection Anomaly is a pattern in the data that does not conform
More informationMultivariate Analysis Cluster Analysis
Multivariate Analysis Cluster Analysis Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Cluster Analysis System Samples Measurements Similarities Distances Clusters
More informationChap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University
Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics
More informationDIMENSION REDUCTION AND CLUSTER ANALYSIS
DIMENSION REDUCTION AND CLUSTER ANALYSIS EECS 833, 6 March 2006 Geoff Bohling Assistant Scientist Kansas Geological Survey geoff@kgs.ku.edu 864-2093 Overheads and resources available at http://people.ku.edu/~gbohling/eecs833
More informationMachine Learning! in just a few minutes. Jan Peters Gerhard Neumann
Machine Learning! in just a few minutes Jan Peters Gerhard Neumann 1 Purpose of this Lecture Foundations of machine learning tools for robotics We focus on regression methods and general principles Often
More informationChapter 2 Spatial and Spatiotemporal Big Data Science
Chapter 2 Spatial and Spatiotemporal Big Data Science Abstract This chapter provides an overview of spatial and spatiotemporal big data science. This chapter starts with the unique characteristics of spatial
More informationCS626 Data Analysis and Simulation
CS626 Data Analysis and Simulation Instructor: Peter Kemper R 104A, phone 221-3462, email:kemper@cs.wm.edu Today: Data Analysis: A Summary Reference: Berthold, Borgelt, Hoeppner, Klawonn: Guide to Intelligent
More informationPRELIMINARY STUDIES ON CONTOUR TREE-BASED TOPOGRAPHIC DATA MINING
PRELIMINARY STUDIES ON CONTOUR TREE-BASED TOPOGRAPHIC DATA MINING C. F. Qiao a, J. Chen b, R. L. Zhao b, Y. H. Chen a,*, J. Li a a College of Resources Science and Technology, Beijing Normal University,
More informationClustering. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden
Clustering Genome 373 Genomic Informatics Elhanan Borenstein Some slides adapted from Jacques van Helden The clustering problem The goal of gene clustering process is to partition the genes into distinct
More informationOn Improving the k-means Algorithm to Classify Unclassified Patterns
On Improving the k-means Algorithm to Classify Unclassified Patterns Mohamed M. Rizk 1, Safar Mohamed Safar Alghamdi 2 1 Mathematics & Statistics Department, Faculty of Science, Taif University, Taif,
More informationFrom statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu
From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu Why? How? What? How much? How many? Individual facts (quantities, characters, or symbols) The Data-Information-Knowledge-Wisdom
More informationSupporting Statistical Hypothesis Testing Over Graphs
Supporting Statistical Hypothesis Testing Over Graphs Jennifer Neville Departments of Computer Science and Statistics Purdue University (joint work with Tina Eliassi-Rad, Brian Gallagher, Sergey Kirshner,
More informationIssues and Techniques in Pattern Classification
Issues and Techniques in Pattern Classification Carlotta Domeniconi www.ise.gmu.edu/~carlotta Machine Learning Given a collection of data, a machine learner eplains the underlying process that generated
More informationClustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden
Clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein Some slides adapted from Jacques van Helden Small vs. large parsimony A quick review Fitch s algorithm:
More informationSparse representation classification and positive L1 minimization
Sparse representation classification and positive L1 minimization Cencheng Shen Joint Work with Li Chen, Carey E. Priebe Applied Mathematics and Statistics Johns Hopkins University, August 5, 2014 Cencheng
More informationA Comparative Study of the National Water Model Forecast to Observed Streamflow Data
A Comparative Study of the National Water Model Forecast to Observed Streamflow Data CE394K GIS in Water Resources Term Project Report Fall 2018 Leah Huling Introduction As global temperatures increase,
More informationDepartment of Computer Science and Engineering
Linear algebra methods for data mining with applications to materials Yousef Saad Department of Computer Science and Engineering University of Minnesota ICSC 2012, Hong Kong, Jan 4-7, 2012 HAPPY BIRTHDAY
More informationStar-Structured High-Order Heterogeneous Data Co-clustering based on Consistent Information Theory
Star-Structured High-Order Heterogeneous Data Co-clustering based on Consistent Information Theory Bin Gao Tie-an Liu Wei-ing Ma Microsoft Research Asia 4F Sigma Center No. 49 hichun Road Beijing 00080
More informationParameter selection for region-growing image segmentation algorithms using spatial autocorrelation
International Journal of Remote Sensing Vol. 27, No. 14, 20 July 2006, 3035 3040 Parameter selection for region-growing image segmentation algorithms using spatial autocorrelation G. M. ESPINDOLA, G. CAMARA*,
More informationSpatial Decision Tree: A Novel Approach to Land-Cover Classification
Spatial Decision Tree: A Novel Approach to Land-Cover Classification Zhe Jiang 1, Shashi Shekhar 1, Xun Zhou 1, Joseph Knight 2, Jennifer Corcoran 2 1 Department of Computer Science & Engineering 2 Department
More informationAn Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets
IEEE Big Data 2015 Big Data in Geosciences Workshop An Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets Fatih Akdag and Christoph F. Eick Department of Computer
More informationIntroduction to Spatial Statistics and Modeling for Regional Analysis
Introduction to Spatial Statistics and Modeling for Regional Analysis Dr. Xinyue Ye, Assistant Professor Center for Regional Development (Department of Commerce EDA University Center) & School of Earth,
More informationExploratory Hierarchical Clustering for Management Zone Delineation in Precision Agriculture
Exploratory Hierarchical Clustering for Management Zone Delineation in Precision Agriculture Georg Ruß, Rudolf Kruse Otto-von-Guericke-Universität Magdeburg, Germany {russ,kruse}@iws.cs.uni-magdeburg.de
More informationCharacterization of Catchments Extracted From. Multiscale Digital Elevation Models
Applied Mathematical Sciences, Vol. 1, 2007, no. 20, 963-974 Characterization of Catchments Extracted From Multiscale Digital Elevation Models S. Dinesh Science and Technology Research Institute for Defence
More informationDistributed Mining of Frequent Closed Itemsets: Some Preliminary Results
Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results Claudio Lucchese Ca Foscari University of Venice clucches@dsi.unive.it Raffaele Perego ISTI-CNR of Pisa perego@isti.cnr.it Salvatore
More informationLecture 8: Clustering & Mixture Models
Lecture 8: Clustering & Mixture Models C4B Machine Learning Hilary 2011 A. Zisserman K-means algorithm GMM and the EM algorithm plsa clustering K-means algorithm K-means algorithm Partition data into K
More informationIntroduction to clustering methods for gene expression data analysis
Introduction to clustering methods for gene expression data analysis Giorgio Valentini e-mail: valentini@dsi.unimi.it Outline Levels of analysis of DNA microarray data Clustering methods for functional
More informationLocal data in M4D: LAU2 and Very Important Geographical Objects (VIGO) Delineating an alternative geometry at local scale JULY 2014 CONTENT.
JULY 2014 Local data in M4D: LAU2 and Very Important Geographical Objects (VIGO) Delineating an alternative geometry at local scale CONTENT This technical report describes the methodology used to delineate
More informationForecasting Using Time Series Models
Forecasting Using Time Series Models Dr. J Katyayani 1, M Jahnavi 2 Pothugunta Krishna Prasad 3 1 Professor, Department of MBA, SPMVV, Tirupati, India 2 Assistant Professor, Koshys Institute of Management
More informationMachine Learning for Data Science (CS4786) Lecture 2
Machine Learning for Data Science (CS4786) Lecture 2 Clustering Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2017fa/ REPRESENTING DATA AS FEATURE VECTORS How do we represent data? Each data-point
More informationJae-Bong Lee 1 and Bernard A. Megrey 2. International Symposium on Climate Change Effects on Fish and Fisheries
International Symposium on Climate Change Effects on Fish and Fisheries On the utility of self-organizing maps (SOM) and k-means clustering to characterize and compare low frequency spatial and temporal
More informationAnomaly (outlier) detection. Huiping Cao, Anomaly 1
Anomaly (outlier) detection Huiping Cao, Anomaly 1 Outline General concepts What are outliers Types of outliers Causes of anomalies Challenges of outlier detection Outlier detection approaches Huiping
More informationInternational Journal of Advance Engineering and Research Development. Review Paper On Weather Forecast Using cloud Computing Technique
Scientific Journal of Impact Factor (SJIF): 4.72 International Journal of Advance Engineering and Research Development Volume 4, Issue 12, December -2017 e-issn (O): 2348-4470 p-issn (P): 2348-6406 Review
More informationNew Regional Co-location Pattern Mining Method Using Fuzzy Definition of Neighborhood
New Regional Co-location Pattern Mining Method Using Fuzzy Definition of Neighborhood Mohammad Akbari 1, Farhad Samadzadegan 2 1 PhD Candidate of Dept. of Surveying & Geomatics Eng., University of Tehran,
More informationMining Temporal Patterns for Interval-Based and Point-Based Events
International Journal of Computational Engineering Research Vol, 03 Issue, 4 Mining Temporal Patterns for Interval-Based and Point-Based Events 1, S.Kalaivani, 2, M.Gomathi, 3, R.Sethukkarasi 1,2,3, Department
More informationClassification of precipitation series using fuzzy cluster method
INTERNATIONAL JOURNAL OF CLIMATOLOGY Int. J. Climatol. 32: 1596 1603 (2012) Published online 17 May 2011 in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/joc.2350 Classification of precipitation
More information