CHAPTER 2: DATA MINING - A MODERN TOOL FOR ANALYSIS. Due to elements of uncertainty many problems in this world appear to be

11 CHAPTER 2: DATA MINING - A MODERN TOOL FOR ANALYSIS Due to elements of uncertainty many problems in this world appear to be complex. The uncertainty may be either in parameters defining the problem or in the situations. Application theory is the tool to solve this uncertainty since time immemorial. However this may be suitable only in cases whose characteristics are based on random processes. But there is considerable number of problems whose uncertainty is characterized by nonrandom processes. Since the advent of computers data mining techniques is found to be very useful in reducing the uncertainty. In this context, pattern generation and classifier approach is very popular. In this chapter the importance and application of data mining as to the meteorological set are elaborated. 2.1. DATA MINING FUNCTIONALITIES Data mining is finding increased applications in society in recent years. Due to the advent of advanced information and communication technologies, huge amount of data are available nowadays and one has to extract maximum possible useful information and knowledge from the same. This is no doubt a challenging task. The fast growing tremendous amount of data, collected and stored in large and numerous data repositories has far exceeded the human ability for comprehension with powerful tools. As a result data collected in large data repositories became data tombs data archives that are seldom visited. Important decisions are not made based on the huge data banks but more on intuition and speculations. This is because the decision makers have no proper tools

12 to extract the valuable knowledge embedded in the data stored. In fact expert system technologies depend heavily upon domain experts to manually input knowledge into databases. This first of all is time consuming and secondly not very much reliable as the whole system is prone to erroneous biased interpretation of data. Data mining is used to find hidden information from the large databases (Agrawal et al 1993; Larose 2006). The newest answer is Data Mining, which is being used both to increase revenues and to reduce costs. The potential returns are enormous. Data mining is a process that uses a variety of data analysis tools to discover patterns and relationship in data that may be used to make valid predictions. Knowledge discovery in data is defined as the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable pattern in data. The overall process consists of turning low data into high-level knowledge. Classification, regression, clustering, image retrieval, summarization, functional dependencies, association rules, rules extraction etc are some of the significant functions of data mining. A meticulous data mining algorithm is usually an instantiation of the model- preference-search components. Data mining is found to be very useful in many fields like bio informatics, genetics, medicine, social sciences, education, power engineering, electronics etc (Larose 2006). In the study of human genetics, sequence mining helps to attain significant goal of understanding the mapping relationship between inter variation in human DNA sequence and variability in disease susceptibility (Tory et al 2000). In power engineering data mining method have been used for condition monitoring of high

13 voltage electrical equipment (McGrail et al 2002). Data clustering technique has been used to vibration monitoring and analysis of transformer on load tap changers (Allen et al 2000; Zhu and Davidson 2007). Another example of application is the use in expertise finding system in educational fields (Superby et al 2006). Besides management and social sciences, data mining has a variety of applications. 2.2. ASSOCIATION RULES Association rule mining (Agrawal et al 1995) discovers the relationship between items from the set of transactions. These relationships can be expressed by association rules such as [i 1 i 2, i 3 support, s = 2%, confidence, c = 60%]. This association rule means that 2% of the transactions under scrutiny show that items i 1, i 2 and i 3 appear jointly. A confidence of 60% means that 60% of the transactions carrying i 1 also carry i 2 and i 3. Associations may include any number of items on either side of the rule. Agrawal et al (1993) first introduced the problem of mining association rules and later it was enhanced by data mining scholars (Agrawal and Srikant 1994; Bayardo and Agrawal 1999; Sarawagi et al 2000). Let I = {i 1, i 2,, i m } denote a set of literals, namely, items. Moreover, let D represent a set of transactions, where each transaction T is a set of items such that T I. A unique identifier, namely TID, is associated with each transaction. A transaction T is said to contain X, a set of some items in I, if X T. An association rule implies the form X Y, where X I, Y I and X Y = φ. The rule X Y holds in the transaction set

14 D with confidence, c, where c% of transactions in D that contain X also contains Y. The rule has support, s, in the transaction set D if s% of transactions in D contain X Y. An efficient algorithm is required that restricts the search space and checks only a subset of all association rule, yet does not miss important rules. The apriori algorithm is very much useful to retrieve the frequent patterns and its relationships. However, the interestingness (validity) of the rule is only based on support and confidence. With the constant collection and storage of considerable quantities of business data, association rules are discovered from the domain databases and applied in many areas, such as marketing, logistics and manufacturing. In the areas of marketing, advertising and sales, corporations have found that they can benefit enormously if implicit and previously unknown buying and calling patterns of customers can be discovered from large volumes of business data. Generally, support and confidence are taken as two measurable factors to evaluate the 'interestingness' of association rules. Association rules are regarded as interesting if their support and confidence are greater than the user-specified minimum support and minimum confidence, respectively. In data mining, it is important but difficult to determine rule validity. In meteorology and climatology a lot of data on various weather elements like temperature, pressure, winds, humidity, cloud data etc are available. India meteorological

15 department is more than 125 years old and weather records are available. World meteorological organization, a wing of UNO is the coordinating agency for world weather watch and weather services. So weather data from various countries after scrutiny are sent to WMO and they keep a record of weather globally since last century. National Oceanic Atmospheric Administration of USA keeps a website making available of the world weather data. Thus we have a huge collection of data in meteorology that can have intra and inter correlations. Hence Data mining techniques was apt for our purpose and the same was selected to develop the pattern and methodology for spot forecasts. The methodology is selected for the forecast of selected points in Tamilnadu of peninsular India. A few attempts have been made to test the potential of data mining technique to apply for meteorological data (Cavozos 2000; Chen 2007; Cofino et al 2003; Dai 1996; Feng et al 2001; Jiang 2010; Kohail 2011; Lahoz and Miguel 2004; Nandagopal and Arunachalam 2010; Xiang Le et al 2008). But no systematic attempt towards forecast of any parameter seems to have been made using this tool. The exercise can also establish the capability of data mining to operational forecasting in the field of atmospheric science. 2.3. WEATHER PATTERN DERIVATION The discovery of associations and correlations among items in large climate data sets can be made using frequent item set mining. With enormously huge amounts of data continuously being collected and stored, many industries are becoming interested in

16 mining such patterns from their databases. In this proposed method, the discovery of interesting correlation relationships among huge amounts of surface daily data records which contains the weather information can help in the process of decision making, such as weather forecast relevant to agriculture operations, supply of analyzed data to end users for planning agricultural strategy and behavior analysis of rainfall during peak monsoons. When the association rule is applied to the concept mentioned above for analysis, the various atmospheric observations like minimum temperature, maximum temperature, daily mean temperature, sea level pressure, cloud density, humidity etc taken at a definite time in definite area, the association rules could be seen like Rule 1 : If the temperature is high, then there is no rain in the same area at the same time. Although rule Rule 1 reflects some relationships among the meteorological elements, its role in weather prediction is inadequate, as users are often more concerned about the weather along a time dimension like Rule 2: If the humidity is medium and temperature is low, then it keeps warm for the next 24 hour. The traditional association rules only capture associations among items within the same transactions. Therefore they are intra transactional rules. The notion of the

17 transaction could be usage pattern of communication preference, the occurrence of the rainfall, the atmospheric events like hot day and cool day occurrence with similar weather patterns, and so on. However, an inter-transactional association rule can represent also the association of items among different transactions along certain dimensions like 24 hours advance, 48 hours advance, 72 hours advance, 1 week advance, etc. For association rules mining from the filtered dataset, we use predictive Apriori algorithm for finding the hidden relationship between various atmospheric parameters. The basic property of Apriori is that all non-empty subsets of a frequent item set must be frequent. A frequent item set must be frequent in connection with the above Apriori algorithm searches with an increasing support threshold for the best 'N' rules concerning a support-based corrected confidence value. By using predictive Apriori algorithm the association rules are generated on weather data set with support and confidence threshold values. 2.4. BAYESIAN CLASSIFICATION FOR CLIMATE DATA A classification is made by combining the impact that the different attributes have on the prediction to be made. Bayesian classifiers are statistical classifiers. They can predict class membership probabilities, such as the probability that a given tuple belongs to a particular class. Bayesian classification is based on Bayes theorem. Bayesian classifiers have also exhibited high accuracy and speed when applied to large databases.

18 Given a data value Xi the probability that a related tuple, Ti, is in class Cj is described by P(Cj Xi). Training data can be used to determine P(Xi), P(Xi Cj), and P(Cj). From these values Bayes theorem allow us to estimate the posterior probability P(Cj Xi) and then P(Cj Ti). On classifying a target tuple, the conditional and prior probabilities generated from the training ser are used to make the prediction. This is done by combining the effects of the different attribute values from the tuple. Suppose that tuple Ti, has P independent attribute values {Xi1, Xi2, Xip}. We then estimate P(Ti Cj) by P(Ti Cj) = P(Xik Cj). Classification is a form of data analysis that can be used to extract models describing important data classes and to predict future data trends. It will be a great value in weather forecasting with a better understanding of the data at large (Ancel et al 2004). The proposed Bayesian classification for meteorological data set is implemented with the help of machine learning tool Weka 3.6.6 that is available as open source. Many real time problems has been solved with the help of this open source tool (Weka 3). The learning process of the climate data classification has training data that are analyzed by a classification algorithm. Here the class label attribute is precipitation for rainfall prediction that is having two values; they are yes and no. For the rainfall estimation, the class label attribute is precipitation which has two values such as high and low. For the temperature prediction during winter monsoon, the class label is post_minimum which has two values such as low and normal. For the maximum temperature prediction during summer months, the class label is post_maximum which has two values such as high and normal. The learned model or classifier is represented in the

19 form of classification rules. The classification process of the climate data set contains the test data that are used to estimate the accuracy of the classification rules. If the accuracy is considered acceptable, the rules can be applied to the classification of new data tuples. The figure 2.1 shows the proposed model for the learning procedure of the climate data classification process. The figure 2.2 shows the proposed model for the classification procedure of the climate data set. The proposed backpropagation neural model for meteorological data classification learns by iteratively processing a weather data set of training tuples, comparing the network s prediction for each tuple with the actual known target value. The target value may be the known class label of the training tuple. For each training tuple, the weights are modified so as to minimize the mean squared error between the weather prediction and the actual target value that is stated in the figure 2.3.

Figure 2.1. Learning procedure of meteorological data classification process 20

21 Figure 2.2. Classification process of meteorological data Figure 2.3. BPN model for meteorological data classification process

22 The frequent weather patterns are extracted using predictive Apriori algorithm for rainfall prediction, rainfall estimation on Northeast and Southwest monsoon seasons for a location specific forecasting model. Summer maximum and winter minimum temperature patterns are also generated. By using the classifier model given weather data set can be classified. The significance of the present study would become clear if it is seen in broader perspective. While scholars have attempted to forecast tropical cyclone intensity change, evolution of association rules for droughts and floods, spatial association rule mining technique over the plateau, dissolved gas analysis etc. (Dhanya and Nagesh Kumar 2009; Esp and McGrail 2000; Guo et al 2004; Jian et al 2004; Yang et al 2011) using the data mining technique, the application of this technique towards spot specific forecast of weather events is yet to be attempted, particularly to Indian sub continent. The earlier works have concentrated the data mining concept towards forecast of patterns of synoptic scale like tropical cyclone and droughts etc, but the efficacy of data mining towards rainfall forecast is yet to be established. The present work does an attempt towards that direction for peninsular India.