International Journal of Research in Computer and Communication Technology, Vol 3, Issue 7, July

Hybrid SVM Data mining Techniques for Weather Data Analysis of Krishna District of Andhra Region N.Rajasekhar 1, Dr. T. V. Rajini Kanth 2 1 (Assistant Professor, Department of Computer Science & Engineering, VNRVJIET, Hyderabad) 2 (Professor, Department of Computer Science &Engineering, SNIST, Hyderabad) ABSTRACT: Weather Prediction is the application of science and technology to estimate the state of atmosphere at a particular spatial location. Due to the availability of huge data researchers got interest to analyze and forecast the weather. By predicting accurately it helps in safeguarding human life and their wealth. Forecasting techniques are useful in effective weather prediction; crops yield growth, traffic congestions, marine navigation, forests growth & defense purposes. The Data Mining techniques / algorithms are proved to be better than the existing techniques / methodologies / traditional statistical methods. In this paper, we have proposed a new Hybrid SVM (Support vector machines) model for effective weather prediction by analyzing the given huge weather data sets and to find the suitable patterns existing in it. SVM is one of the supervised learning methods under classifications & regression. Good results were arrived in weather prediction than the other Data Mining Techniques. For this, as a case study we have considered Krishna district weather data sets were considered for analysis using this proposed hybrid SVM technique. Keywords - Weather Prediction, Spatial Location, Hybrid Support Vector Machine, Machine Learning Algorithms, Data Mining Techniques. 1. INTRODUCTION Analysis of Weather data is used for finding the atmospheric pattern and apart from knowing the status of weather at a specific location for desired time duration. Krishna district [Ref.1] is a district in the Coastal Andhra region of Andhra Pradesh, India. It is named after the Krishna River, the third longest river that flows within India, flows through the district and joins Bay of Bengal near Hamsaladevi village of the district. Machilipatnam is the administrative headquarters for the district. Vijayawada is the biggest city in the district and also the commercial center of the state. According to 2011 Census, it has a population of 4,529,009, of which 41.00% is urban and a literacy rate of 74.37%. The district is bounded to the north-west and west by Telangana, to the north-east by West Godavari District, to the south-east by the Bay of Bengal, to the southwest by Guntur District. All the numerical data about the previous and current status of the rainfall is collected for the estimation of weather. It is useful to study and analyze weather data sets for effective prediction. The present weather prediction models are useful in estimation of weather based on the atmospheric conditions. In expert systems there is no automation methods exists in order to pick up the best possible predictive models for effective weather forecast. For this, new mathematical models are to be designed in order to address very huge time and computational complexities that occurs due to uncertainty of weather conditions. By this, the new models developed will benefit us in enhancing weather prediction accuracies. Effective forecasting of weather has lots of uses like by providing early information about natural disasters to the public to save their lives and minimize the property damages. These predictions make the agriculturalists and business people to plan crops and yields. Weather prediction is classified depending on time factor in to four types namely www.ijrcct.org Page 743

Short range- forecast up to 48 hours, Extended-forecast from 3 5 days, Medium range 3 to 7 days and Longrange forecasts more than 7 days [1]. We consider the rainfall data of three districts of Andhra Region. Here we considered Krishna district weather data for study and analysis for effective weather prediction. 2. LITERATURE SURVEY 2.1 K- Means clustering algorithm: K- Means clustering algorithm [6, 7, 8] is one of the most widely used clustering algorithm in the real world applications. It has been successfully applied in domains such as relational databases, gene expression data and decision support systems. One of the drawbacks of this algorithm is the choice of centroids location which is at random at the beginning of the algorithm; variables treatment as numbers and the number of clusters K is unknown. The drawback of the first one is overcome by multiple runs or particular initialization methods. But these initialization methods are not better than random centroids. The second drawback is overcome by handling the categorical parameters by using the measure of matching dissimilarity. For the third drawback, the input parameter that is number of clusters is fixed before in the regular K-means algorithm. This problem can be addressed by the use of cluster validity indices. Like other data mining algorithms, reliability through K- means reduced when treating high-dimensional data. In this type data sets are nearly always too sparse. The application of Euclidean distance becomes meaningless in high dimensional sparse spaces. By combining K- means with feature extraction methods such as Principal Component Analysis (PCA) and Self Organizing Maps (SOM) there is a possibility of solution. 2.2Support vector machine: Support vector machine (SVMs) [2, 5] is a supervised learning method that generates a decision-making system to predict new values. When a set of training samples were given the SVM algorithm develops a model to predict to which category this samples falls. SVM model segregates as distant as possible the given samples by a clear gap by representing them as points in space. The given examples to be predicted are then mapped into the either side of the gap based on the category that they belongs. SVM constructs a hyper plane or set of hyper-planes in the high dimensional space that can be used for machine learning algorithms like classification, or regression. The largest distanced hyper plane has a good margin even to the closest training data points to whatever class they belongs to. If there is higher separation there will be smaller generalization of classifier error. This technique can be explained as a combination of two steps: Learning: This step consists of examples along with SVM training. Prediction: It is necessary when ever results are not known by inclusion of new samples. Every example is represented as a pair of (input, output) in which data set is the input and how to categorize is denoted by output. In this every example is represented as a pair ( x, y) in mathematical format where x is the real numbers vector and y is the Boolean value (i.e. y/n, 1/ -1, T/F), or a real number. Out of these two cases first one, is treated as a problem of classification, and the second one treated as a problem of regression. Learning results are compared to the (x, y) points representing a real function in the space and SVM generates the interpolated function interpolates these series of points to the best. The Training Set consists of group of pairs that make up the SVM, where as in Test Set it contains the group of pairs for prediction. The data set is derived after clustering the data set. 3. PROPOSED APPROACH The realistic data sets that are collected are in raw format. In first step they have to be converted in to experimental suitable data [3] sets format by preprocessing techniques like noise removal etc. Initially the K-means clustering technique will be applied on the data sets and after that SVM classification technique will be applied over that clustered data set. This SVM hybrid technique [4] is suitable for effective analysis of weather data sets. 3.1 Implementation of the Proposed Approach: K- means clustering was applied on the data set for about 102 years namely Monthly Mean for each year Average Temperature (1901-2002). K-means www.ijrcct.org Page 744

clustering algorithm was applied on the data set [3] Monthly mean for each year Average Temperature and the Clustering model is on full training set. There are 13 attributes namely year, Jan, Feb, Mar, April, May, June, July, August, Sep, Oct, Nov and Dec. Euclidean distance was applied for making clusters. The number of iterations is 6 and within cluster sum of squared errors is 33.373323263149274. Time taken to build model (full training data) is 0.03 seconds. In this, missing values are globally replaced with mean/mode. Table 1 is representing the cluster centroids for about 5 clusters. Total number of Instances is 102. Cluster 0 has 33(32%), Cluster 1 has 15(15%), Cluster 2 has 24(24%), Cluster 3 has 13(13%) and Cluster 4 has 17(17%) number of instances. Fig.1 shows clusters Histogram. Fig.2 shows cluster graph of Monthly Mean for each year Average temperature with Instance number along x-axis and year along y-axis. SVM classification was done with 14 attributes namely year, names of 12 months and Clusters. The total number of Instances is 102. It was evaluated on total training set. LibSVM wrapper, actual code by Yasser EL- Manzalawy (= WLSVM). The time taken to build the model is 0.14 seconds and 0.02 seconds to test the model on training data set. Agreement between the classifications and the true classes is represented by Kappa Statistic. It is a Chance-corrected measure. It is measured by considering expected agreement by chance away from observed agreement and divides it by highest possible agreement. If the value is more than 0 it means that the classifier is doing better. The Kappa statistic is near to 1 so it is near to perfect agreement. The kappa statistic measures the agreement of prediction with the true class 0.8828 signifies almost there is complete agreement. The correctly classified instances are 92 and incorrectly classified instances are 10. That means it has classified perfectly. There is no considerable error most of the errors are near to zero except the root relative squared error. Fig.3 shows SVM classifier graph of monthly mean for each year Average temperature. In this graph Clusters were taken along x-axis and predicted cluster along y-axis. The Annual Mean values of Minimum, Average and Maximum temperatures for all the years 1901-2002 are shown in Fig.4 along with their trend line Polynomial equations. In this month s are taken along x-axis and temperature along y-axis. The equation for Mean Minimum Temperature is given by y = -0.0019x 5 + 0.0631x 4-0.7665x 3 + 3.8138x 2-5.4917x + 20.378 ---- (1) Equation for Mean Average Temperature is given by y = -0.0024x 5 + 0.0787x 4-0.9231x 3 + 4.3635x 2-5.9399x + 25.725 --- (2) Equation for Mean Maximum Temperature is given by y = -0.0029x 5 + 0.0943x 4-1.0808x 3 + 4.9155x 2-6.3828x + 31.093 --- (3) The regression [9] equation for Annual Mean values is given by Averagetemperature = 0.0442 * cloudcover + ( - 1.2564) * DiurnalTemperatureRange + (-9.431)* GroundFrostFrequency + 0.0208 * precipitation + 0.0824 * Vapour Pressure + -0.7168 * Wet day Frequency + 3.9871 * Potential Evapotranspiration +13.3485 --- (4) 4. FIGURES AND TABLES OF CLUSTER CENTROIDS Attribute Full Data 0 1 2 3 4 (102) (33) (15) (24) (13) (17) Cluster# ======================================= ================================ Year 1951.5 1924.697 1930.2 1977.3333 1986.6923 1958.9412 Jan 23.3345 23.1897 22.7651 23.5955 24.1856 23.0984 Feb 25.1828 24.8517 24.5215 25.6661 26.159 24.9804 www.ijrcct.org Page 745

Mar 27.6269 27.0164 27.1878 28.0848 28.5775 27.8259 Apr 30.3108 30.0328 29.5188 30.6396 30.975 30.5771 May 32.3848 32.1833 31.8497 32.8083 32.6614 32.4384 Jun 31.0933 31.0168 30.7815 31.8466 30.7131 30.7443 Jul 28.838 28.6428 28.7983 29.185 29.5832 28.192 Aug 28.4052 28.4061 28.0938 28.7352 28.6475 28.0271 Sep 28.2424 28.0239 28.2679 28.6095 28.9003 27.6229 Oct 27.1811 26.91 27.5127 27.3202 27.7857 26.7561 Nov 24.7899 24.2983 25.2432 25.1427 25.6115 24.2178 Dec 23.1654 22.8922 23.3564 23.493 23.9412 22.4715 Table1: Clustered Centroids of Monthly Mean For Each Year Average Temperature Fig.1: Clusters Histogram Fig.2: Clustered graph generated using k-means clustering algorithm DETAILED ACCURACY BY CLASS TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class 0.970 0.072 0.865 0.970 0.914 0.873 0.949 0.848 cluster0 0.600 0.000 1.000 0.600 0.750 0.749 0.800 0.659 cluster1 0.958 0.038 0.885 0.958 0.920 0.895 0.960 0.858 cluster2 0.846 0.000 1.000 0.846 0.917 0.910 0.923 0.866 cluster3 1.000 0.024 0.895 1.000 0.944 0.935 0.988 0.895 cluster4 Weighted Avg. 0.902 0.036 0.912 0.902 0.897 0.875 0.933 0.833 Table 2: Detailed accuracy by clustered class www.ijrcct.org Page 746

CONFUSION MATRIX a b c d e <-- classified as 32 0 0 0 1 a = cluster0 5 9 1 0 0 b = cluster1 0 0 23 0 1 c = cluster2 0 0 2 11 0 d = cluster3 0 0 0 0 17 e = cluster4 Table 3: Confusion Matrix Fig.5: Graph of Annual mean values of weather parameters for all the years (1901-2002) Fig.3: SVM classifier graph of monthly mean for each year Average temperature Fig.4: Graph of Annual Mean Temperatures for all the years 1901-2002 along with Trend line equations 5. CONCLUSION: In the K-means cluster analysis given by Table.1 cluster 0 is 33 with highest number of instances has 1925 year as the centroid. Cluster 4 has lowest temperature 22.4715 0 c in the month of December. Cluster3 has highest temperature values in all months except May and June with 13 instances. Cluster2 with number of instances 24 has highest temperature values in these two months May and June. Detailed accuracies are given by Table.2 indicates that cluster 4 has TP rate as 1 and F-measure as 0.944. Cluster0 and cluster2 has TP rate values as 0.970 and 0.958 and F-measure values are 0.914 and 0.920 respectively. It tells that cluster 4 was classified accurately followed by cluster0 and cluster2. In the confusion matrix represented by the Table.3 each column represents instances in the predicted class and each row indicates instances in actual class. The classifier has classified accurately for the cluster4, cluster0 and cluster2 but less accurate in case of cluster1 and cluster 3 indicated by confusion matrix. Only cluster 4 was classified accurately. Fig.4. gives the following conclusions or inferences in which months are taken along x-axis and different weather parameter values are taken along y-axis. There is a correlation between mean precipitation and mean cloud cover its value is 0.86114364. Likewise there is a correlation between average temperature and mean www.ijrcct.org Page 747

vapour pressure and is 0.851278951. There is a correlation between mean diurnal temperature and mean potential evapotranspiration and is 0.665903472. There is negative correlation between Mean ground frost frequency and mean wet day frequency and its value is -0.729960042. Mean average temperature has correlation with all the parameters namely mean cloud cover, diurnal temperature range, ground frost frequency, precipitation, vapour pressure, wet day frequency, potential evapotranspiration and reference crop evapotranspiration. It has highest correlation with reference crop evapotranspiration and negative correlation with ground frost frequency. The future scope is this Hybrid SVM technique can be extended to any weather data belongs to any spatial location for further analysis based on various weather parameters. International Conference on Advanced Computing methodologies ICACM-2013, 02-03 Aug 2013, Elsevier Publications, Pg. No: 459-465, ISBN No: 978-93-35107-14- 95. [9] Dr.David, B.Stephenson, Data analysis methods in weather and climate research Department of Meteorology University of Reading, July,20, 2005, http://www.met.rdg.ac.uk /cag/courses REFERENCES [1] http://www.timeanddate.com/weather/forecast-accuracytime.html [2] http://en.wikipedia.org/wiki/weather_forecasting [3] Folorunsho Olaiya, Application of Data Mining Techniques in Weather Prediction and Climate Change Studies, I.J. Information Engineering and Electronic Business, 2012, 1, 51-59, Published Online February 2012 in MECS ( http://www.mecs-press.org/) DOI: 10.5815/ijieeb.2012.01.07 [4]Dr. M.H.Dunham, Companion slides for the text, Data Mining, Introductory and Advanced Topics, Prentice Hall, 2002. [5] Y.Radhika and M.Shashi, Atmospheric Temperature Prediction using Support Vector Machines International Journal of Computer Theory and Engineering, Vol. 1, No. 1, April 2009 1793-8201 [6] Dr.T. V. Rajini Kanth, Ananthoju Vijay Kumar, Estimation of the Influence of Fertilizer Nutrients Consumption on the Wheat Crop yield in India- a Data mining Approach, 30 Dec 2013, Volume 3, Issue 2, Pg.No: 316-320, ISSN: 2249-8958 (Online). [7] Dr.T. V. Rajini Kanth, Ananthoju Vijay Kumar, A Data Mining Approach for the Estimation of Climate Change on the Jowar Crop Yield in India, 25Dec2013,Volume 2 Issue 2, Pg.No:16-20, ISSN: 2319-6378 (Online). [8] A. Vijay Kumar, Dr. T. V. Rajini Kanth Estimation of the Influential Factors of rice yield in India 2nd www.ijrcct.org Page 748