CHAPTER 2: DATA MINING - A MODERN TOOL FOR ANALYSIS. Due to elements of uncertainty many problems in this world appear to be

Similar documents
ASSOCIATION RULE MINING AND CLASSIFIER APPROACH FOR QUANTITATIVE SPOT RAINFALL PREDICTION

PATTERN VISUALIZATION ON METEOROLOGICAL DATA FOR RAINFALL PREDICTION MODEL

To Predict Rain Fall in Desert Area of Rajasthan Using Data Mining Techniques

Data Mining. Chapter 1. What s it all about?

International Journal of Advance Engineering and Research Development. Review Paper On Weather Forecast Using cloud Computing Technique

Short Term Load Forecasting Using Multi Layer Perceptron

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Put the Weather to Work for Your Company

1990 Intergovernmental Panel on Climate Change Impacts Assessment

TOOLS AND DATA NEEDS FOR FORECASTING AND EARLY WARNING

Stochastic Hydrology. a) Data Mining for Evolution of Association Rules for Droughts and Floods in India using Climate Inputs

Seasonal Climate Watch June to October 2018

International Desks: African Training Desk and Projects

Understanding Weather and Climate Risk. Matthew Perry Sharing an Uncertain World Conference The Geological Society, 13 July 2017

SHORT TERM LOAD FORECASTING

1 Introduction. Station Type No. Synoptic/GTS 17 Principal 172 Ordinary 546 Precipitation

Operational event attribution

Decision T ree Tree Algorithm Week 4 1

Project Name: Implementation of Drought Early-Warning System over IRAN (DESIR)

Seasonal Climate Watch July to November 2018

Internet Engineering Jacek Mazurkiewicz, PhD

7 - DE Website Document Weather Meteorology

Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules. M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D.

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

Enhancing Weather Information with Probability Forecasts. An Information Statement of the American Meteorological Society

WMO Priorities and Perspectives on IPWG

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

NPTEL. NOC:Weather Forecast in Agriculture and Agroadvisory (WF) - Video course. Agriculture. COURSE OUTLINE

Seasonal Climate Watch February to June 2018

A Support Vector Regression Model for Forecasting Rainfall

Chapter-1 Introduction

Seasonal Climate Watch September 2018 to January 2019

Data Mining Part 4. Prediction

NetBox: A Probabilistic Method for Analyzing Market Basket Data

El Niño / Southern Oscillation

Classification Using Decision Trees

Predictive Analytics on Accident Data Using Rule Based and Discriminative Classifiers

Atmospheric patterns for heavy rain events in the Balearic Islands

UPDATE OF REGIONAL WEATHER AND SMOKE HAZE FOR MAY 2015

Understanding the Global Distribution of Monsoon Depressions

Data Analytics Beyond OLAP. Prof. Yanlei Diao

The Quadratic Entropy Approach to Implement the Id3 Decision Tree Algorithm

Seasonal Climate Watch November 2017 to March 2018

CHAPTER 4: PREDICTION AND ESTIMATION OF RAINFALL DURING NORTHEAST MONSOON

Analysis of Data Mining Techniques for Weather Prediction

Seasonality and Rainfall Prediction

Seasonal Climate Watch April to August 2018

An Approach to Classification Based on Fuzzy Association Rules

Application and verification of the ECMWF products Report 2007

Possible Applications of Deep Neural Networks in Climate and Weather. David M. Hall Assistant Research Professor Dept. Computer Science, CU Boulder

National level products generation including calibration aspects

Rainfall Prediction using Back-Propagation Feed Forward Network

Improving global coastal inundation forecasting WMO Panel, UR2014, London, 2 July 2014

The Kentucky Mesonet: Entering a New Phase

CSE 5243 INTRO. TO DATA MINING

Weather Analysis and Forecasting

Southwest Climate Change Projections Increasing Extreme Weather Events?

Understanding Weather And Climate

Issue Overview: El Nino and La Nina

SOME STEP OF QUALITY CONTROL OF UPPER-AIR NETWORK DATA IN CHINA. Zhiqiang Zhao

Predicting Floods in North Central Province of Sri Lanka using Machine Learning and Data Mining Methods

Global Climate Change and the Implications for Oklahoma. Gary McManus Associate State Climatologist Oklahoma Climatological Survey

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

FUTURE CARIBBEAN CLIMATES FROM STATISTICAL AND DYNAMICAL DOWNSCALING

Sub-seasonal predictions at ECMWF and links with international programmes

Finding Multiple Outliers from Multidimensional Data using Multiple Regression

THE STUDY OF NUMBERS AND INTENSITY OF TROPICAL CYCLONE MOVING TOWARD THE UPPER PART OF THAILAND

Application of Text Mining for Faster Weather Forecasting

DROUGHT ASSESSMENT USING SATELLITE DERIVED METEOROLOGICAL PARAMETERS AND NDVI IN POTOHAR REGION

A High Resolution Daily Gridded Rainfall Data Set ( ) for Mesoscale Meteorological Studies

Hierarchical models for the rainfall forecast DATA MINING APPROACH

Data and prognosis for renewable energy

WMO Aeronautical Meteorology Scientific Conference 2017

AN INTERNATIONAL SOLAR IRRADIANCE DATA INGEST SYSTEM FOR FORECASTING SOLAR POWER AND AGRICULTURAL CROP YIELDS

Tokyo Climate Center Website (TCC website) and its products -For monitoring the world climate and ocean-

Verification of the Seasonal Forecast for the 2005/06 Winter

Drought and Climate Extremes Indices for the North American Drought Monitor and North America Climate Extremes Monitoring System. Richard R. Heim Jr.

Operational Perspectives on Hydrologic Model Data Assimilation

CSE 5243 INTRO. TO DATA MINING

SEASONAL CLIMATE PREDICTION

REQUIREMENTS FOR WEATHER RADAR DATA. Review of the current and likely future hydrological requirements for Weather Radar data

The Role of Weather in Risk Management For the Market Technician s Association October 15, 2013

Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies

Unit title: Marine Meteorology: An Introduction (SCQF level 7)

Weather Permitting/Meteorology. North Carolina Science Olympiad Coaches Clinic October 6, 2018 Michelle Hafey

NATS 101 Section 13: Lecture 25. Weather Forecasting Part II

Climate Change or Climate Variability?

Global Flash Flood Guidance System Status and Outlook

South Asian Climate Outlook Forum (SASCOF-6)

Infrastructure and Expertise available to the Advisory Section

TRANSMISSION BUSINESS LOAD FORECAST AND METHODOLOGY

2015: A YEAR IN REVIEW F.S. ANSLOW

1 Ministry of Earth Sciences, Lodi Road, New Delhi India Meteorological Department, Lodi Road, New Delhi

A Recommendation for Tropical Daily Rainfall Prediction Based on Meteorological Data Series in Indonesia

Ebook Code: REAU4045. The Earth & Life Science Series. Weather. Science activities for 6 to 9 year olds

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Geovisualization for Association Rule Mining in CHOPS Well Data

Research Article Weather Forecasting Using Sliding Window Algorithm

The WMO Integrated Global Observing System (WIGOS), current status and planned regional activities

Application and verification of ECMWF products 2015

Transcription:

11 CHAPTER 2: DATA MINING - A MODERN TOOL FOR ANALYSIS Due to elements of uncertainty many problems in this world appear to be complex. The uncertainty may be either in parameters defining the problem or in the situations. Application theory is the tool to solve this uncertainty since time immemorial. However this may be suitable only in cases whose characteristics are based on random processes. But there is considerable number of problems whose uncertainty is characterized by nonrandom processes. Since the advent of computers data mining techniques is found to be very useful in reducing the uncertainty. In this context, pattern generation and classifier approach is very popular. In this chapter the importance and application of data mining as to the meteorological set are elaborated. 2.1. DATA MINING FUNCTIONALITIES Data mining is finding increased applications in society in recent years. Due to the advent of advanced information and communication technologies, huge amount of data are available nowadays and one has to extract maximum possible useful information and knowledge from the same. This is no doubt a challenging task. The fast growing tremendous amount of data, collected and stored in large and numerous data repositories has far exceeded the human ability for comprehension with powerful tools. As a result data collected in large data repositories became data tombs data archives that are seldom visited. Important decisions are not made based on the huge data banks but more on intuition and speculations. This is because the decision makers have no proper tools

12 to extract the valuable knowledge embedded in the data stored. In fact expert system technologies depend heavily upon domain experts to manually input knowledge into databases. This first of all is time consuming and secondly not very much reliable as the whole system is prone to erroneous biased interpretation of data. Data mining is used to find hidden information from the large databases (Agrawal et al 1993; Larose 2006). The newest answer is Data Mining, which is being used both to increase revenues and to reduce costs. The potential returns are enormous. Data mining is a process that uses a variety of data analysis tools to discover patterns and relationship in data that may be used to make valid predictions. Knowledge discovery in data is defined as the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable pattern in data. The overall process consists of turning low data into high-level knowledge. Classification, regression, clustering, image retrieval, summarization, functional dependencies, association rules, rules extraction etc are some of the significant functions of data mining. A meticulous data mining algorithm is usually an instantiation of the model- preference-search components. Data mining is found to be very useful in many fields like bio informatics, genetics, medicine, social sciences, education, power engineering, electronics etc (Larose 2006). In the study of human genetics, sequence mining helps to attain significant goal of understanding the mapping relationship between inter variation in human DNA sequence and variability in disease susceptibility (Tory et al 2000). In power engineering data mining method have been used for condition monitoring of high

13 voltage electrical equipment (McGrail et al 2002). Data clustering technique has been used to vibration monitoring and analysis of transformer on load tap changers (Allen et al 2000; Zhu and Davidson 2007). Another example of application is the use in expertise finding system in educational fields (Superby et al 2006). Besides management and social sciences, data mining has a variety of applications. 2.2. ASSOCIATION RULES Association rule mining (Agrawal et al 1995) discovers the relationship between items from the set of transactions. These relationships can be expressed by association rules such as [i 1 i 2, i 3 support, s = 2%, confidence, c = 60%]. This association rule means that 2% of the transactions under scrutiny show that items i 1, i 2 and i 3 appear jointly. A confidence of 60% means that 60% of the transactions carrying i 1 also carry i 2 and i 3. Associations may include any number of items on either side of the rule. Agrawal et al (1993) first introduced the problem of mining association rules and later it was enhanced by data mining scholars (Agrawal and Srikant 1994; Bayardo and Agrawal 1999; Sarawagi et al 2000). Let I = {i 1, i 2,, i m } denote a set of literals, namely, items. Moreover, let D represent a set of transactions, where each transaction T is a set of items such that T I. A unique identifier, namely TID, is associated with each transaction. A transaction T is said to contain X, a set of some items in I, if X T. An association rule implies the form X Y, where X I, Y I and X Y = φ. The rule X Y holds in the transaction set

14 D with confidence, c, where c% of transactions in D that contain X also contains Y. The rule has support, s, in the transaction set D if s% of transactions in D contain X Y. An efficient algorithm is required that restricts the search space and checks only a subset of all association rule, yet does not miss important rules. The apriori algorithm is very much useful to retrieve the frequent patterns and its relationships. However, the interestingness (validity) of the rule is only based on support and confidence. With the constant collection and storage of considerable quantities of business data, association rules are discovered from the domain databases and applied in many areas, such as marketing, logistics and manufacturing. In the areas of marketing, advertising and sales, corporations have found that they can benefit enormously if implicit and previously unknown buying and calling patterns of customers can be discovered from large volumes of business data. Generally, support and confidence are taken as two measurable factors to evaluate the 'interestingness' of association rules. Association rules are regarded as interesting if their support and confidence are greater than the user-specified minimum support and minimum confidence, respectively. In data mining, it is important but difficult to determine rule validity. In meteorology and climatology a lot of data on various weather elements like temperature, pressure, winds, humidity, cloud data etc are available. India meteorological

15 department is more than 125 years old and weather records are available. World meteorological organization, a wing of UNO is the coordinating agency for world weather watch and weather services. So weather data from various countries after scrutiny are sent to WMO and they keep a record of weather globally since last century. National Oceanic Atmospheric Administration of USA keeps a website making available of the world weather data. Thus we have a huge collection of data in meteorology that can have intra and inter correlations. Hence Data mining techniques was apt for our purpose and the same was selected to develop the pattern and methodology for spot forecasts. The methodology is selected for the forecast of selected points in Tamilnadu of peninsular India. A few attempts have been made to test the potential of data mining technique to apply for meteorological data (Cavozos 2000; Chen 2007; Cofino et al 2003; Dai 1996; Feng et al 2001; Jiang 2010; Kohail 2011; Lahoz and Miguel 2004; Nandagopal and Arunachalam 2010; Xiang Le et al 2008). But no systematic attempt towards forecast of any parameter seems to have been made using this tool. The exercise can also establish the capability of data mining to operational forecasting in the field of atmospheric science. 2.3. WEATHER PATTERN DERIVATION The discovery of associations and correlations among items in large climate data sets can be made using frequent item set mining. With enormously huge amounts of data continuously being collected and stored, many industries are becoming interested in

16 mining such patterns from their databases. In this proposed method, the discovery of interesting correlation relationships among huge amounts of surface daily data records which contains the weather information can help in the process of decision making, such as weather forecast relevant to agriculture operations, supply of analyzed data to end users for planning agricultural strategy and behavior analysis of rainfall during peak monsoons. When the association rule is applied to the concept mentioned above for analysis, the various atmospheric observations like minimum temperature, maximum temperature, daily mean temperature, sea level pressure, cloud density, humidity etc taken at a definite time in definite area, the association rules could be seen like Rule 1 : If the temperature is high, then there is no rain in the same area at the same time. Although rule Rule 1 reflects some relationships among the meteorological elements, its role in weather prediction is inadequate, as users are often more concerned about the weather along a time dimension like Rule 2: If the humidity is medium and temperature is low, then it keeps warm for the next 24 hour. The traditional association rules only capture associations among items within the same transactions. Therefore they are intra transactional rules. The notion of the

17 transaction could be usage pattern of communication preference, the occurrence of the rainfall, the atmospheric events like hot day and cool day occurrence with similar weather patterns, and so on. However, an inter-transactional association rule can represent also the association of items among different transactions along certain dimensions like 24 hours advance, 48 hours advance, 72 hours advance, 1 week advance, etc. For association rules mining from the filtered dataset, we use predictive Apriori algorithm for finding the hidden relationship between various atmospheric parameters. The basic property of Apriori is that all non-empty subsets of a frequent item set must be frequent. A frequent item set must be frequent in connection with the above Apriori algorithm searches with an increasing support threshold for the best 'N' rules concerning a support-based corrected confidence value. By using predictive Apriori algorithm the association rules are generated on weather data set with support and confidence threshold values. 2.4. BAYESIAN CLASSIFICATION FOR CLIMATE DATA A classification is made by combining the impact that the different attributes have on the prediction to be made. Bayesian classifiers are statistical classifiers. They can predict class membership probabilities, such as the probability that a given tuple belongs to a particular class. Bayesian classification is based on Bayes theorem. Bayesian classifiers have also exhibited high accuracy and speed when applied to large databases.

18 Given a data value Xi the probability that a related tuple, Ti, is in class Cj is described by P(Cj Xi). Training data can be used to determine P(Xi), P(Xi Cj), and P(Cj). From these values Bayes theorem allow us to estimate the posterior probability P(Cj Xi) and then P(Cj Ti). On classifying a target tuple, the conditional and prior probabilities generated from the training ser are used to make the prediction. This is done by combining the effects of the different attribute values from the tuple. Suppose that tuple Ti, has P independent attribute values {Xi1, Xi2, Xip}. We then estimate P(Ti Cj) by P(Ti Cj) = P(Xik Cj). Classification is a form of data analysis that can be used to extract models describing important data classes and to predict future data trends. It will be a great value in weather forecasting with a better understanding of the data at large (Ancel et al 2004). The proposed Bayesian classification for meteorological data set is implemented with the help of machine learning tool Weka 3.6.6 that is available as open source. Many real time problems has been solved with the help of this open source tool (Weka 3). The learning process of the climate data classification has training data that are analyzed by a classification algorithm. Here the class label attribute is precipitation for rainfall prediction that is having two values; they are yes and no. For the rainfall estimation, the class label attribute is precipitation which has two values such as high and low. For the temperature prediction during winter monsoon, the class label is post_minimum which has two values such as low and normal. For the maximum temperature prediction during summer months, the class label is post_maximum which has two values such as high and normal. The learned model or classifier is represented in the

19 form of classification rules. The classification process of the climate data set contains the test data that are used to estimate the accuracy of the classification rules. If the accuracy is considered acceptable, the rules can be applied to the classification of new data tuples. The figure 2.1 shows the proposed model for the learning procedure of the climate data classification process. The figure 2.2 shows the proposed model for the classification procedure of the climate data set. The proposed backpropagation neural model for meteorological data classification learns by iteratively processing a weather data set of training tuples, comparing the network s prediction for each tuple with the actual known target value. The target value may be the known class label of the training tuple. For each training tuple, the weights are modified so as to minimize the mean squared error between the weather prediction and the actual target value that is stated in the figure 2.3.

Figure 2.1. Learning procedure of meteorological data classification process 20

21 Figure 2.2. Classification process of meteorological data Figure 2.3. BPN model for meteorological data classification process

22 The frequent weather patterns are extracted using predictive Apriori algorithm for rainfall prediction, rainfall estimation on Northeast and Southwest monsoon seasons for a location specific forecasting model. Summer maximum and winter minimum temperature patterns are also generated. By using the classifier model given weather data set can be classified. The significance of the present study would become clear if it is seen in broader perspective. While scholars have attempted to forecast tropical cyclone intensity change, evolution of association rules for droughts and floods, spatial association rule mining technique over the plateau, dissolved gas analysis etc. (Dhanya and Nagesh Kumar 2009; Esp and McGrail 2000; Guo et al 2004; Jian et al 2004; Yang et al 2011) using the data mining technique, the application of this technique towards spot specific forecast of weather events is yet to be attempted, particularly to Indian sub continent. The earlier works have concentrated the data mining concept towards forecast of patterns of synoptic scale like tropical cyclone and droughts etc, but the efficacy of data mining towards rainfall forecast is yet to be established. The present work does an attempt towards that direction for peninsular India.