Distributed Clustering and Local Regression for Knowledge Discovery in Multiple Spatial Databases

Similar documents
Combining Regressive and Auto-Regressive Models for Spatial-Temporal Prediction

SPATIAL-TEMPORAL TECHNIQUES FOR PREDICTION AND COMPRESSION OF SOIL FERTILITY DATA

Performance Controlled Data Reduction for Knowledge Discovery in Distributed Databases

SPATIAL DATA MINING. Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM

A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation

A new method for short-term load forecasting based on chaotic time series and neural network

Neural Network Approach to Estimating Conditional Quantile Polynomial Distributed Lag (QPDL) Model with an Application to Rubber Price Returns

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

A Novel 2-D Model Approach for the Prediction of Hourly Solar Radiation

Optimization Methods for Machine Learning (OMML)

Frequency Selective Surface Design Based on Iterative Inversion of Neural Networks

An Analysis of the INFFC Cotton Futures Time Series: Lower Bounds and Testbed Design Recommendations

Exploring Spatial Relationships for Knowledge Discovery in Spatial Data

Input Selection for Long-Term Prediction of Time Series

Robust Multiple Estimator Systems for the Analysis of Biophysical Parameters From Remotely Sensed Data

Unsupervised Anomaly Detection for High Dimensional Data

Classification of Ordinal Data Using Neural Networks

Intelligent Modular Neural Network for Dynamic System Parameter Estimation

Holdout and Cross-Validation Methods Overfitting Avoidance

Combination of M-Estimators and Neural Network Model to Analyze Inside/Outside Bark Tree Diameters

WEATHER DEPENENT ELECTRICITY MARKET FORECASTING WITH NEURAL NETWORKS, WAVELET AND DATA MINING TECHNIQUES. Z.Y. Dong X. Li Z. Xu K. L.

Caesar s Taxi Prediction Services

A GIS Spatial Model for Defining Management Zones in Orchards Based on Spatial Clustering

Algorithms for Characterization and Trend Detection in Spatial Databases

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

Linear Programming-based Data Mining Techniques And Credit Card Business Intelligence

DETECTING HUMAN ACTIVITIES IN THE ARCTIC OCEAN BY CONSTRUCTING AND ANALYZING SUPER-RESOLUTION IMAGES FROM MODIS DATA INTRODUCTION

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren

Artificial Neural Networks

Multi-wind Field Output Power Prediction Method based on Energy Internet and DBPSO-LSSVM

Using Neural Networks for Identification and Control of Systems

A Geo-Statistical Approach for Crime hot spot Prediction

Comparing Robustness of Pairwise and Multiclass Neural-Network Systems for Face Recognition

J. Sadeghi E. Patelli M. de Angelis

KEYWORDS: census maps, map scale, inset maps, feature density analysis, batch mapping. Introduction

A bottom-up strategy for uncertainty quantification in complex geo-computational models

Deep Learning & Artificial Intelligence WS 2018/2019

Power Prediction in Smart Grids with Evolutionary Local Kernel Regression

A summary of Deep Learning without Poor Local Minima

Weight Initialization Methods for Multilayer Feedforward. 1

Multi-Layer Boosting for Pattern Recognition

CONTEMPORARY ANALYTICAL ECOSYSTEM PATRICK HALL, SAS INSTITUTE

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Forecasting River Flow in the USA: A Comparison between Auto-Regression and Neural Network Non-Parametric Models

Machine Learning (CSE 446): Neural Networks

CLICK HERE TO KNOW MORE

HOPFIELD neural networks (HNNs) are a class of nonlinear

COMPARISON OF CLEAR-SKY MODELS FOR EVALUATING SOLAR FORECASTING SKILL

OVERLAPPING ANIMAL SOUND CLASSIFICATION USING SPARSE REPRESENTATION

A. Pelliccioni (*), R. Cotroneo (*), F. Pungì (*) (*)ISPESL-DIPIA, Via Fontana Candida 1, 00040, Monteporzio Catone (RM), Italy.

Forecasting of Nitrogen Content in the Soil by Hybrid Time Series Model

NONLINEAR IDENTIFICATION ON BASED RBF NEURAL NETWORK

Forecasting Regional Employment in Germany: A Review of Neural Network Approaches. Objectives:

Data Preprocessing. Cluster Similarity

Mining Classification Knowledge

Virtual Sensors and Large-Scale Gaussian Processes

ISyE 691 Data mining and analytics

Computational Learning Theory

Neural Networks. Nethra Sambamoorthi, Ph.D. Jan CRMportals Inc., Nethra Sambamoorthi, Ph.D. Phone:

Are Rosenblatt multilayer perceptrons more powerfull than sigmoidal multilayer perceptrons? From a counter example to a general result

Finding persisting states for knowledge discovery in time series

Thunderstorm Forecasting by using Artificial Neural Network

Mining Spatial Trends by a Colony of Cooperative Ant Agents

Discriminant Uncorrelated Neighborhood Preserving Projections

Comparison of Modern Stochastic Optimization Algorithms

Neural Network to Control Output of Hidden Node According to Input Patterns

Comparing the Land Use and Land Cover Changes Between the North and South Louisiana Using Data Mining

8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7]

A recursive algorithm based on the extended Kalman filter for the training of feedforward neural models. Isabelle Rivals and Léon Personnaz

Data and prognosis for renewable energy

BAYESIAN CLASSIFICATION OF HIGH DIMENSIONAL DATA WITH GAUSSIAN PROCESS USING DIFFERENT KERNELS

Application of Topology to Complex Object Identification. Eliseo CLEMENTINI University of L Aquila

Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications

Multitask Learning of Environmental Spatial Data

International Journal of Advance Engineering and Research Development. Review Paper On Weather Forecast Using cloud Computing Technique

Applying Artificial Neural Networks to problems in Geology. Ben MacDonell

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Introduction to Machine Learning

Activity Mining in Sensor Networks

POWER SYSTEM DYNAMIC SECURITY ASSESSMENT CLASSICAL TO MODERN APPROACH

Anomaly Detection via Online Oversampling Principal Component Analysis

THE UNPREDICTABILITY OF SOIL FERTILITY ACROSS SPACE AND TIME

Support Vector Machine (SVM) and Kernel Methods

Clustering and Prediction of Solar Radiation Daily Patterns

A Feature Based Neural Network Model for Weather Forecasting

Genetic Algorithm: introduction

MODELLING TRAFFIC FLOW ON MOTORWAYS: A HYBRID MACROSCOPIC APPROACH

CLUSTERING AND NOISE DETECTION FOR GEOGRAPHIC KNOWLEDGE DISCOVERY

In the Name of God. Lectures 15&16: Radial Basis Function Networks

Upper Bounds on the Time and Space Complexity of Optimizing Additively Separable Functions

Neural Networks and Ensemble Methods for Classification

4. Conclusions. Acknowledgments. 5. References

Learning Outbreak Regions in Bayesian Spatial Scan Statistics

FORECASTING OF ECONOMIC QUANTITIES USING FUZZY AUTOREGRESSIVE MODEL AND FUZZY NEURAL NETWORK

15 Introduction to Data Mining

GROUP-SPARSE SUBSPACE CLUSTERING WITH MISSING DATA

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Data Mining Part 5. Prediction

Machine Learning Linear Classification. Prof. Matteo Matteucci

SOIL MOISTURE MODELING USING ARTIFICIAL NEURAL NETWORKS

Transcription:

Distributed Clustering and Local Regression for Knowledge Discovery in Multiple Spatial Databases Aleksandar Lazarevic, Dragoljub Pokrajac, Zoran Obradovic School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA 99164-2752, USA {alazarev, dpokraja, zoran}@eecs.wsu.edu Abstract. Many large-scale spatial data analysis problems involve an investigation of relationships in heterogeneous databases. In such situations, instead of making predictions uniformly across entire spatial data sets, in a previous study we used clustering for identifying similar spatial regions and then constructed local regression models describing the relationship between data characteristics and the target value inside each cluster. This approach requires all the data to be resident on a central machine, and it is not applicable when a large volume of spatial data is distributed at multiple sites. Here, a novel distributed method for learning from heterogeneous spatial databases is proposed. Similar regions in multiple databases are identified by independently applying a spatial clustering algorithm on all sites, followed by transferring convex hulls corresponding to identified clusters and their integration. For each discovered region, the local regression models are built and transferred among data sites. The proposed method is shown to be computationally efficient and fairly accurate when compared to an approach where all the data are available at a central location. 1. Introduction The number and the size of spatial databases are rapidly growing in various GIS applications ranging from remote sensing and satellite telemetry systems, to computer cartography and environmental planning. Many large-scale spatial data analysis problems also involve an investigation of relationships among attributes in heterogeneous data sets. Therefore, instead of applying global recommendation models across entire spatial data sets, they are varied to better match site-specific needs thus improving prediction capabilities [1]. Our recently proposed approach towards such a modeling is to define spatial regions having similar characteristics, and to build local regression models on them describing the relationship between the spatial data characteristics and the target attribute [2]. However, spatial data is often inherently distributed at multiple sites and cannot be localized on a single machine for a variety of practical reasons including physically dispersed data over many different geographic locations, security services and competitive reasons. In such situations, the proposed approach of building local regressors [2] can not be applied, since the data needed for clustering can not be

centralized on a single site. Therefore, there is a need to improve this method to learn from large spatial databases located at multiple data sites. A new viable approach for distributed learning of locally adapted models is explored in this paper. Given a number of distributed, spatially dispersed data sets, we first define more homogenous spatial regions in each data set using a distributed clustering algorithm. The next step is to build local regression models and transfer them among the sites. Our experimental results showed that this method is computationally effective and fairly accurate when compared to an approach where all data are localized at a central machine. 2. Methodology Partitioning spatial data sets into regions having similar attribute values should result in regions of similar target value. Therefore, using the relevant features, a spatial clustering algorithm is used to partition each spatial data set independently into similar regions. A clustering algorithm is applied in an unsupervised manner (ignoring the target attribute value). As a result, a number of partitions (clusters) on each spatial data set is obtained. Assuming similar data distributions of the observed data sets, this number of clusters on each data set is usually the same (Figure 1). If this is not the case, by choosing the appropriate clustering parameter values the discovery of an identical number of clusters on each data set can be easily enforced. The next step is to match the clusters among the distributed sites, i.e. which cluster from one data set is the most similar to which cluster in another spatial data set. This is followed by building the local regression models on identified clusters at sites with known target attribute values. Finally, learned models are transferred to the remaining sites where they are integrated and applied to estimate unknown target values at the appropriate clusters. 2.1. Learning at a single site Although the proposed method can be applied to an arbitrary number of spatial data sets, for the sake of simplicity assume first that we predict on the set D 2 by using local regression models built on the set D 1. Each of k clusters C 1,i, i = 1,...k, identified at D 1 (k = 5 at Figure 1), is used to construct a corresponding local regression model M i. To apply local models trained on D 1 subsets to unseen data set D 2 we construct a convex hull for each cluster on the data set D 1, and transfer all convex hulls to a site containing unseen data set D 2 (Figure 1). Using the convex hulls of the clusters from D 1 (shown with solid lines in Figure 1), we identify the correspondence between the clusters from two spatial data sets. This is determined by identifying the best matches between the clusters C 1,i (from the set D 1 ) and the clusters C 2,i (from the set D 2 ). For example, the convex hull H 1,4 at Figure 1 covers both the clusters C 2,5 and C 2,4, but it covers C 2,5 in much larger fraction than it covers C 2,4. Therefore, we concluded that the cluster C 1,4 matches the cluster C 2,5, and the local regression model M 4 built on the cluster C 1,4 is applied to the cluster C 2,5. However, there are also situations where the exact matching can not be determined, since there are significant overlapping regions between the clusters from different

data sets (e.g. the convex hull H 1,1 covers both the clusters C 2,2 and C 2,3 on Figure 1, and there is an overlapping region O 1 ). To improve the prediction, the combination of the local regression models built on neighboring clusters is used on overlapping regions. For example, the prediction for the region O 1 at Figure 1 is made using the simple averaging of local prediction models learned on the clusters C 1,1 and C 1,5. DATA SET D 1 C 1,4 DATA SET D 2 C 2,5 H 1,1 H 1,4 C 1,1 C 1,2 C 1,3 C 2,2 C 2,1 C 2,4 C 1,5 (a) (b) Figure 1. Clusters in the feature space for two spatial data sets: D 1 and D 2 and convex hulls (H 1,i ) from data set D 1 (a) transferred to the data set D 2 (b). However, there are also situations where the exact matching can not be determined, since there are significant overlapping regions between the clusters from different data sets (e.g. the convex hull H 1,1 covers both the clusters C 2,2 and C 2,3 on Figure 1, so there is an overlapping region O 1 ). To improve the prediction, averaging of the local regression models built on neighboring clusters is used on overlapping regions. For example, the prediction for the region O 1 at Figure 1 is made using the simple averaging of local prediction models learned on the clusters C 1,1 and C 1,5. In this way we hope to achieve better prediction accuracy than local predictors built on entire clusters. 2.2. Learning from multiple data sites When data from more physically distributed sites are available for modeling, the prediction can be further improved by integrating learned models from several data sites. Without loss of generality, assume there are 3 dispersed data sites, where the prediction is made on the third data set (D 3 ) using the local prediction models from the first two data sets D 1 and D 2. The key idea is the same as in the two data sets scenario, except more overlapping is likely to occur in this scenario. To simplify the presentation, we will discuss the algorithm only for the matching clusters C 1,1, C 2,2 and C 3,2 from the data sets D 1, D 2 and D 3 respectively (Figure 2). The intersection of H 1,1, H 2,2 and C 3,2 (region C) represents the portion of the cluster C 3,2, where clusters from all three fields are matching. Therefore, the prediction on this region is made by averaging the models built on the clusters C 1,1 and C 2,2, whose contours are represented in Figure 2 by convex hulls H 1,1 and H 2,2, respectively. Making the predictions on the overlapping portions O i, i = 1,2,3 is similar to learning O 1 C 2,3 H 1,5

at a single site. For example the prediction on the overlapping portion O 1 is made by averaging of the models learned on the clusters C 1,1, C 2,2 and C 2,3. DATA SET D 3 DATA SET D 1 H 1,1 O 3 C 3,5 C 1,4 C C 3,1 C 3,4 C 3,2 C 1,1 C 1,2 C 1,3 H 2,2 O 1 O 2 C 3,3 H 2,3 C 1,5 Figure 2. Transferring the convex hulls from two sites with spatial data sets to a third site Figure 3. The alternative representation of the clusters with MBRs 2.3. The comparison to minimal bounding rectangle representation An alternative method of representing the clusters, popular in database community, is to construct a minimal bounding rectangle (MBR) for each cluster. The apparent advantages of this approach are limiting the data transfer further, since the MBR can be represented by less data points than convex hulls, and reducing the computational complexity from Θ(n log n) for computing a convex hull of n points to Θ(n) for computing a corresponding MBR. However, this approach results in large overlapping of neighboring clusters (see shadowed part on Figure 3). Therefore, using a convex hull based algorithm leads to a much better cluster representation for the price of slightly increasing the computational time and the data transfer rate. 3. Experimental Results Our experiments were performed using artificial data sets generated using our spatial data simulator [4] to mix 5 homogeneous data distributions, each having different relevant attributes for generation of the target attribute. Each data set had 6561 patterns with 5 relevant attributes, where the degree of relevance was different for each distribution. Spatial clustering is performed using a density based algorithm DBSCAN [5], which was previously used in our centralized spatial regression modeling. As local regression models, we trained 2-layered feedforward neural network models with 5, 10 and 15 hidden neurons. We used Levenberg-Marquardt [3] learning algorithm and repeated experiments starting from 3 random initializations of network parameters. For each of these models, the prediction accuracy was measured using the coefficient of determination defined as R 2 = 1 MSE/σ 2, where σ is a standard deviation of the target attribute. R 2 value is a measure of the explained variability of

the target variable, where 1 corresponds to a perfect prediction, and 0 to a trivial mean predictor. Method R 2 ± std R 2 value ± std Global model 0.73±0.01 Method combine models from Matching clusters 0.82±0.02 Matching clusters + averaging for 0.87±0.03 overlapping regions single site all sites Global models 0.75±0.02 0.77±0.02 Matching clusters 0.89±0.02 0.90±0.02 Matching clusters + averaging for Centralized clustering overlapping regions 0.90±0.02 0.92±0.03 (upper bound) 0.87±0.02 Centralized clustering (upper bound) 0.90±0.01 0.92±0.02 Table1. Models built on set D 1 Table 2. Models built on sets D 1 and D 2 applied to D 3 applied on D 2 When constructing regressors using spatial data from a single site and testing on spatial data from another site, the prediction accuracies averaged over 9 experiments are given in the Table 1. The accuracy of local specific regression models significantly outperformed the global model trained on all D 1 data. By incorporating the model combinations on significant overlapping regions between clusters, the prediction capability was improved. This indicated that indeed confidence of the prediction in the overlapping parts can be increased by averaging appropriate local predictors. In summary, for this data set, the proposed distributed method can successfully approach the upper bound of centralized technique, where two spatial data sets are merged together at a single site and when the clustering is applied to the merged data set. The prediction changes depending on the noise level, the number and the type of noisy features (features used for clustering and modeling or for modeling only). We have experimented with adding different levels of Gaussian noise to clustering and modeling features (5%, 10% and 15%) for the total number of noisy features ranging from 1 to 5. (Figure 4). 0.84 Total number of noisy features: 0 1 2 3 4 5 R - square value 0.8 0.76 0.72 0.68 0.64 0.6 Noise level: 0% 1 5% 2 10% 3 15% 4 5% 10% 6 715% One noisy clustering feature Two noisy clustering features Figure 4. The influence of different noise levels on the prediction accuracy. We added none, 1, 2 and 3 noisy modeling features to the 1 or 2 noisy clustering features. We have experimented with 5%, 10% and 15% of noise level. We used matching clusters. Figure 4 shows that when a small noise is present in features (5%, 10%), even if some of them are clustering features, the method is fairly robust. However, by

increasing the noise level (15%), the prediction accuracy starts to decrease significantly. Finally, when models from 2 distributed data sites are combined to make prediction on the third spatial data set, the prediction accuracy was improved more than when considering only the models from a single site (Table 2). The influence of the noise is similar in this case, and the experimental results are omitted for lack of space. 4. Conclusions Experiments on two and three simulated heterogeneous spatial data sets indicate that the proposed method for learning local site-specific models in a distributed environment can result in significantly better predictions as compared to using a global model built on the entire data set. When comparing the proposed approach to a centralized method (all data are available at the single data site), we observe no significant difference in the prediction accuracy achieved on the unseen spatial data sets. The communication overhead of data exchange among the multiple data sites is small, since only the convex hulls and the models built on the clusters are transferred. Furthermore, the suggested algorithm is very robust to small amounts of noise in the input features. Although the performed experiments provide evidence that the proposed approach is suitable for distributed learning in spatial databases, further work is needed to optimize methods for combining models in larger distributed systems. We are currently extending the method to a distributed scenario with different sets of known features at various databases. 5. References 1. Hergert, G., Pan, W., Huggins, D., Grove, J., Peck, T., "Adequacy of Current Fertilizer Recommendation for Site-Specific Management", In Pierce F., "The state of Site-Specific Management for Agriculture," American Society for Agronomy, Crop Science Society of America, Soil Science Society of America, chapter 13, pp. 283-300, 1997. 2. Lazarevic, A., Xu, X., Fiez, T. and Obradovic, Z., Clustering-Regression- Ordering Steps for Knowledge Discovery in Spatial Databases, Proc. IEEE/INNS Int'l Conf. on Neural Neural Networks, Washington, D.C., July 1999, ISBN 0-7803-5532-0, No. 345, Session 8.1B. 3. Hagan, M., Menhaj, M.B.: Training feedforward networks with the Marquardt algorithm. IEEE Transactions on Neural Networks Vol. 5, pp. 989-993, 1994. 4. Pokrajac, D., Fiez, T. and Obradovic, Z.: A Spatial Data Simulator for Agriculture Knowledge Discovery Applications, in review. 5. Sander J., Ester M., Kriegel H.-P., Xu X.: Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and its Applications, Data Mining and Knowledge Discovery, An International Journal, Kluwer Academic Publishers, Vol. 2, No. 2, pp. 169-194, 1998.