CHAPTER 4: PREDICTION AND ESTIMATION OF RAINFALL DURING NORTHEAST MONSOON

43 CHAPTER 4: PREDICTION AND ESTIMATION OF RAINFALL DURING NORTHEAST MONSOON After analyzing the weather patterns and forecast of seasonal rainfall for Cauvery delta region, a method to predict the dry and wet days and also to estimate the whether significant rainfall will occur or not on the rainy day is attempted here. Apart from the features of synoptic systems local topography and terrain have an influence on the rainfall over a place. For the study, coastal and interior stations are the two major types of stations to be considered. Because coastal places experience phenomenon like sea breeze / land breeze which modify the local circulation and weather (Sivaramakrishnan and Prakash Rao 1989). Abundance of moisture is present in a coastal town but interior stations inside the peninsula are mostly dry and stability conditions of atmosphere vary from those of the coast. Hence two sample stations Chennai (Latitude 13 11' N/Longitude 80 11' E) and Karaikal (Latitude 10 43' N / Longitude 79 49' E) are taken to represent the coast. Trichy (Latitude 10 46' N / Longitude 78 43' E) and Coimbatore (Latitude 11 43' N / Longitude 76 49' E) are typical inland or interior stations. Another type of topography is island. Rameswaram is the only island of Tamilnadu state but there is no observatory there, hence Port Blair (Latitude 11 40' N / Longitude 92 43' E) that is around Coimbatore latitude in Bay of Bengal was selected for testing the models. The station locations are shown in Figure 4.1.

44 Indian Ocean Figure 4.1. Station locations of peninsular of India under study

45 4.1. MATERIALS AND METHODS Many meteorological parameters are correlated in nature, as they are interdependent in deciding the atmospheric dynamics. For rainfall, presence of moisture is must. Atmospheric humidity is indicated by dew point. Temperature causes evaporation for adding moisture. The wind can mix the air mass causing the moisture variation. Visibility depends on aerosols which act as nuclei of condensation for the moisture to condense. Hence the parameters considered are temperature, dew point, wind speed and visibility. The data set extracted consists of prevailing atmospheric situation 24 hours and 48 hours prior to the actual occurrence of rainfall. 4.1.1. DATA COLLECTION The global summary of the surface daily data for the period of 1961 2010 for Northeast monsoon months (October December) is collected from the National Climatic Data Centre, Asheville, USA at ncdc.noaa.gov. The sample stations Chennai Airport (Latitude 13 11' N / Longitude 80 11' E) and Karaikal (Latitude 10 43' N / Longitude 79 49' E), two typical coastal observatories, Tiruchirappalli (Latitude 10 46' N / Longitude 78 43' E) and Coimbatore (Latitude 11 43' N / Longitude 76 49' E) sample interior locations and Port Blair (Latitude 11 40' N / Longitude 92 43' E) an island station are taken for analysis and data mining model is tested on these stations. India meteorological department maintains these observatories and the atmospheric parameters temperature, dew point, wind speed, visibility and precipitation (rainfall).

46 These are considered for analysis. The values of the parameters are the average of observations (measurements) taken at the eight synoptic hours (00,03,06,09,12,15,18,21 hours GMT). The following procedures namely data cleaning, data selection, data transformation and data mining are adopted for the analysis. 4.1.2. DATA CLEANING Data preprocessing steps are applied on the new set of seasonal data and they are converted to nominal values by applying filters using unsupervised attribute of discretization algorithm. After the operations are carried out, a total of input instances of individual stations are presented for analysis. The discretization algorithm produces various best-fit ranges for the five atmospheric conditions used in analysis. The input parameters are represented with 3 bins as low, medium and high and the class label rainfall is represented by the values yes and no for rainfall prediction and high and low for the rainfall estimation. The Table [4.1] describes the nominal values for the atmospheric variables for the sample station Chennai. Table 4.1. Nominal values of atmospheric parameters for Chennai station 24 hour advance 48 hour advance Parameter Low Medium High Low Medium High Mean <75.96 75.96- >83.03 <75.96 75.96- >83.03 Temperature in (24.4 C) 83.03 83.03 (28.3 C) Fahrenheit Dew point in <63.1 63.1-71.3 >71.3 <63.1 63.1-71.3 >71.3 Fahrenheit (17.2 C) (21.8 C) Visibility in mile <5.53 (8.8kph) 5.53-9.76 >9.76 (15.7kph) <4.46 (7.1kph) 4.46-7.63 >7.63 (12.2kph) Wind speed in knot <7.33 7.33-14.66 >14.66 <7.33 7.33-14.66 >14.66

47 4.1.3. DATA SELECTION At this stage, data relevant to the analysis was decided on and retrieved from the extracted weather data set. The extracted weather data set had five attributes; their type and description are presented in Table [4.2]. Table 4.2. Attributes of weather data set Attribute Type Description Year Numerical Year considered Month Numerical Month considered Day Numerical Day considered Temperature Numerical Real Mean temperature for the day in degrees Fahrenheit to tenths. (Celsius to tenths for metric version.) Dew point Numerical Real Mean dew point for the day in degrees Fahrenheit to tenths. (Celsius to tenths for metric version.) Wind speed Numerical Real Mean wind speed for the day in knots to tenths. (Meters/second to tenths for metric version.) Visibility Numerical Real Mean visibility for the day in miles to tenths. (Kilometers to tenths for metric version.) Precipitation (rainfall) Numerical Real Total precipitation (rain) reported during the day in inches and hundredths; (For metric version, units = millimeters to tenths)

48 4.1.4. DATA TRANSFORMATION This is also known as data consolidation. It is the stage in which the selected data is transformed into forms appropriate for data mining. The data file was saved in commas separated value file format and the datasets were normalized to reduce the effect of scaling on the data. 4.1.5. DATA MINING STAGE The data mining stage was divided into three phases. At each phase all the algorithms were used to analyze the weather datasets. In first phase, the precipitation patterns were generated and its relationships with other weather parameters were found. In second phase, the output of weather forecasting variable was classified. In third phase, the testing method adopted for this research was percentage split that train on a percentage of a dataset, cross validate on it and test on the remaining percentage, 4.1.6. PSEUDO-CODE OF PROPOSED TECHNIQUE In this proposed technique, the correlation coefficient of the sample input instances of the extracted monsoon weather data are obtained for the occurrence of the rainfall using class labels yes and no for rainfall prediction and high and low for rainfall estimation. The adopted solution brings out the better accuracy compared with auto regressive integrated moving average process numerical statistical model by

49 thorough investigation using data mining tool Weka 3.6.6. The stratified classification could be enhanced by increasing prediction level and reducing the uncertainty that is achieved by the classifier model. Thereafter interesting patterns representing knowledge were identified. The novel algorithmic approach has been described in the following Figure 4.2. Figure 4.2. A novel algorithmic approach for NE monsoon wet/dry day prediction Input: Weather DATASet, PredictionType // Prediction type may be 24 or 48 hours prior Output: FSOL // Solution type may be 24 or 48 hours prior WET or Dry day Method: begin 1) FSOL = { }; 2) DATATYPE={ }; //Defines the type of Findings such as Rainfall Prediction or Estimation 3) BESTRules={}; //Defines the best rules which can be utilized to obtain climatic patterns 4) //Select the required weather parameters such as Temp,Dewpoint,visibility etc. 5) RequiredParameters=GetParameterNAMES(DATASet); 6) // MONSOON_NAME NORTHEAST, 7) //MONSOON_MONTHS OCTOBER to DECEMBER 8) //ExtracteddataType 24 hours PRIOR 48 hours PRIOR 9) ExtractedData=FORMAT_REQUEST ( DATASet, FromMONTHYEARRange, ToMONTHYEARRange, RequiredParameters, MONSOON_NAME, MONSOON_MONTHS, ExtracteddataType );

50 10) ExtDataSize=Findlength(ExtractedData); 11) //Extract the required weather parameters 12) for(ln_dataset=0 to ExtDataSize) MeanTemp= ExtractDataMeanTemp(ExtractedData (ln_dataset)); DewPoint = ExtractDataDewpt(ExtractedData (ln_dataset)); Wind_Speed= ExtractDataSpeed(ExtractedData (ln_dataset)); Visibility = ExtractDataVisibility(ExtractedData (ln_dataset)); 13) end for 14) LocalizedDATASet=Revise (MeanTemp,DewPoint, Wind_Speed,Visibility); if (isvalid(localizeddataset) //ValidDataSet and no need of NoiseRemovalProcess RevisedDATASet= LocalizedDATASet; else //Preprocess the Data RevisedDATASet= RemoveNoiseData(LocalizedDATASet); endif 15) //Transforms the dataset 16) RevisedDATASet=UpdateClassLabel(RevisedDATASet); 17) BinsData=GenBins(RevisedDATASet); // low, medium, high 18) //Generate Nominal Values 19) OstensibleValues=GenNominalDataVal(RevisedDATASet,BinsData); 20) for (ln_dataset=0 to RevisedDATASet.Size) 21) DATATYPE=PredictionType; 22) // PredictionType RAINFALL_PREDICTION RAINFALL_ESTIMATION if(datatype==rainfall_prediction ) if RAIN > 0 RAIN=YES; //RAIN CLASS LABEL DATA else RAIN=NO; end if 23) if(datatype== RAINFALL_ESTIMATION ) if RAIN > 1 inch or 2.54 cm RAIN=HIGH //RAIN CLASS LABEL DATA Else RAIN=LOW; end if 24) ConcludingWeatherDATASet+=ClassifyDATASet(RevisedDATASet, RAIN); 25) end for

51 26) //Patterns generation process 27) BESTRules= GEN_PATTERNS(ConcludingWeatherDATASet); 28) Mean_Abs_Sqr_Err_Data=GenMeanSqr_Err(ConcludingWeatherDATASet); 29) Root_Mean_Sqr_Err_Data=GenRootMeanSqr_Err(ConcludingWeatherDATASet ); 30) Rel_Abs_Sqr_ErrData=GenRelAbsSqr_Err(ConcludingWeatherDATASet); 31) Root_Rel_Sqr_ErrData=GenRootRelSqr_Err(ConcludingWeatherDATASet); 32) FSOL=Unite (Mean_Abs_Sqr_Err_Data, Root_Mean_Sqr_Err_Data, Rel_Abs_Sqr_ErrData, Root_Rel_Sqr_ErrData, BESTRules); //Solution type may be // Prediction for Wet / Dry day // Estimate of Heavy/ Moderate rainfall 33) FSOL= GEN_SOLUTION (FSOL); 34) return (FSOL); 35) end begin; 4.2. EVALUATION METRICS In selecting the appropriate algorithms and parameters to create model for weather forecasting, the following performance metrics are used. 4.2.1. CORRELATION COEFFICIENT This measures the statistical correlation between the predicted and actual values. This method is unique in that it does not change with a scale in values for the test cases. A higher number means a better model, with a 1 meaning a perfect statistical correlation and a 0 meaning there is no correlation at all.

52 4.2.2. CLASSIFIER ACCURACY MEASURES Using training data to derive a classifier or predictor and then to estimate the accuracy of the resulting learned model can result in misleading overoptimistic estimates due to overspecialization of the learning algorithm to the data. Instead, accuracy is better measured on a test set consisting of class-labeled tuples that were not used to train the model. The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier. In the pattern recognition literature, this is also referred to as the overall recognition rate of the classifier which reflects how well the classifier recognizes tuples of the various classes. The error rate or misclassification rate of a classifier, M, which is simply 1-Acc(M), where Acc(M) is the accuracy of M. If we were to use the training set to estimate the error rate of model, this quantity is known as the resubstitution error. This error estimate is optimistic of the true error rate because the model is not tested on any samples that it has not already seen. The confusion matrix is a useful tool for analyzing how well our classifier can recognize tuples of different classes. Given m classes, a confusion matrix is a table of at least size m by m. An entry, CM i, j in the first m rows and m columns indicates the number of tuples of class i that are labeled by the classifier as class j. For a classifier to have good accuracy, ideally most of the tuples would be represented along the diagonal of the confusion matrix, from entry

53 CM 1,1 to entry CM m,m. With the rest of the entries being close to zero. The table may have additional rows or columns to provide totals or recognition rates per class. Given two classes, in terms of positive tuples like precipitation_occurrence = yes versus negative tuples like precipitation_occurrence = no, true positives refer to the positive tuples that are correctly labeled by the classifier, while true negatives are the negative tuples that are correctly labeled by the classifier. False positives are the negative tuples that are incorrectly labeled ( tuples of class precipitation_occurrence = no for which the classifier predicted precipitation_occurrence = yes ). Similarly, false negatives are the positive tuples that are incorrectly labeled (tuples of class precipitation_occurrence = yes for which the classifier predicted precipitation_occurrence = no). These terms are useful when analyzing a classifier s ability and are summarized in Table [4.3]. Table 4.3. A confusion matrix for positive and negative tuples Predicted class Actual class ( Precipitation=YES ) ( Precipitation=N0 ) (Precipitation= YES) True positives False negatives (Precipitation=NO) False positives True negatives The Table [4.4] shows the confusion matrix for the wet / dry prediction 24 hour prior during Northeast monsoon for the sample stations. The correctly classified instances are represented as correlation coefficient for the classifier. The detailed accuracy by Class for 24 hour prior prediction for wet/dry day prediction on Northeast monsoon is shown in

54 Table [4.5] and for 48 hour prior prediction in Table [4.6] which contains the true positive rate (TP Rate), false positive rate (FP Rate), precision, recall, F-measure and ROC area details. The precision is a measure of exactness (i.e., what percentage of tuples labeled as positive are actually such), where recall is a measure of completeness (what percentage of positive tuples are labeled as such). The F measure is the harmonic mean of precision and recall. It gives equal weight to precision and recall. An receiver operating characteristic curves value describes the trade-off between the true positive rate (TPR) and the false positive rate (FPR). TPR describes the sensitivity and FPR describes the specificity. Precision = TP / (TP + FP) Recall = TP / (TP + FN) Table 4.4. Confusion matrix for 24 hr advance NE wet/dry day prediction Confusion matrix values Station True False False True No. of Correlation name positives negatives positives negatives instances coefficient Chennai 1041 361 706 1436 3544 69.9 % Karaikal 684 350 227 977 2238 74.2 % Trichy 2175 18 1121 79 3393 66.4 % Coimbatore 366 473 306 1976 3121 75 % Port Blair 822 325 506 982 2635 68.5 %

55 Table 4.5. Detailed accuracy by Class for 24 hour advance NE rain occurrence Station No. of samples Correlation coefficient Class TP Rate FP Rate Preci sion Recall F- Measure ROC Area Chennai 3544 69.8928 NO 0.67 0.257 0.799 0.67 0.729 0.73 YES 0.743 0.33 0.596 0.743 0.661 0.73 Wg. 0.699 0.286 0.719 0.699 0.702 0.73 Avg. Karaikal 2238 74.2262 NO 0.812 0.338 0.736 0.812 0.772 0.78 YES 0.662 0.188 0.751 0.662 0.704 0.78 Wg. 0.742 0.269 0.743 0.742 0.74 0.78 Avg. Trichy 3393 66.4309 NO 0.066 0.008 0.814 0.066 0.122 0.689 YES 0.992 0.934 0.66 0.0992 0.792 0.689 Wg. 0.664 0.607 0.715 0.664 0.555 0.689 Avg. Coimbatore 3121 75.0401 NO 0.866 0.564 0.807 0.866 0.835 0.763 YES 0.436 0.134 0.545 0.436 0.484 0.763 Wg. Avg. 0.75 0.448 0.736 0.75 0.741 0.763 Port Blair 2635 68.463 NO 0.66 0.283 0.751 0.66 0.703 0.716 YES 0.717 0.34 0.619 0.717 0.664 0.716 Wg. Avg. 0.685 0.308 0.694 0.685 0.686 0.716

56 Table 4.6. Detailed accuracy by Class for 48 hour advance NE rain occurrence Station No. of samples Correlation coefficient Class TP Rate FP Rate Preci sion Recall F- Measure ROC Area Chennai 3545 65.134 NO 0.679 0.39 0.726 0.679 0.702 0.676 YES 0.61 0.321 0.554 0.61 0.581 0.676 Wg. 0.651 0.363 0.658 0.651 0.654 0.676 Avg. Karaikal 2269 63.7288 NO 0.824 0.583 0.625 0.824 0.711 0.623 YES 0.417 0.176 0.669 0.417 0.514 0.623 Wg. 0.637 0.396 0.645 0.637 0.62 0.623 Avg. Trichy 3394 64.4909 NO 0.617 0.316 0.717 0.617 0.663 0.669 YES 0.684 0.383 0.579 0.684 0.627 0.669 Wg. Avg. 0.646 0.345 0.657 0.646 0.648 0.669 Coimbatore 3120 73.1176 NO 1 1 0.731 1 0.845 0.536 YES 0 0 0 0 0 0.536 Wg. 0.731 0.731 0.535 0.731 0.618 0.536 Avg. Port Blair 2635 64.63 NO 0.617 0.316 0.717 0.617 0.663 0.669 YES 0.684 0.383 0.579 0.684 0.627 0.669 Wg. Avg. 0.646 0.345 0.657 0.646 0.648 0.669 4.2.3. PREDICTOR ERROR MEASURES Let D T be a test set of the form (X 1, y 1 ), (X 2, y 2 ),, (X d, y d ), where the X i are the n-dimensional test tuples with associated known values, y i, for a response variable, y, and d is the number of tuples in D T. Since predictors return a continuous value rather than a categorical label, it is difficult to say exactly whether the predicted value, y ' i, for Xi is correct. Loss functions measure the error between y i and the predicted value, ' y i is an

exact match with. The loss functions measure the error between.and the ' y i predicted value. y i y i 57 The most common loss functions are: Absolute error: yi y i ' Squared error: ( yi y ) i ' 2 Based on the above, the test error (rate) is the average loss over the test set. Thus, we get the following error rates, Mean absolute error: d i= 1 y i d y ' 1 t i= 1 d y i Mean squared error: d i= 1 ( y ) Relative measures of error include: i d y ' 1 2 Relative absolute error: Relative squared error: d ' yi yi i= 1 d i= 1 d i= 1 d i= 1 y y i ( y y ) i ' 2 i ( y y) i 2

58 Where y is the mean value of the yi ' i= 1 y s of the training data, that is y = i d t. The root of the relative squared error to obtain the root relative squared error so that the resulting error is of the same magnitude as the quantity predicted. Here the 10-fold cross validation method (Stone 1974) is used to evaluate the accuracy of a classifier or predictor. The use of stratified 10- fold cross-validation for estimating classifier accuracy is recommended over the other methods, based on a theoretical and empirical study (Kohavi 1995). Testing instances are classified using K star classification method (Cleary 1995) and 10-fold cross validation method is used to validate the testing data set. The detailed statistics summary for 24 hour prior wet/dry day prediction on Northeast monsoon is shown in Table 4.7. Table 4.7. Statistical summary for 24 hour advance NE rainfall prediction Measures Chennai Karaikal Trichy Coimbatore Port Blair Total Number of 3544 2238 3393 3121 2635 Instances Correctly Classified 2477 1661 2254 2342 1804 Instances Incorrectly Classified 1067 577 1139 779 831 Instances Correctly Classified 69.8928 74.2268 66.43 75.0401 68.463 Instances in % Incorrectly Classified 30.1072 25.7732 33.57 24.9599 31.537 Instances in % Kappa statistic in % 0.3961 0.4772 0.0728 0.3224 0.3699 Mean absolute error in 0.4167 0.3879 0.4197 0.3059 0.4387 % Root mean squared 0.4463 0.4306 0.4503 0.4126 0.4585 error in % Relative absolute error 87.1412 78.02 91.7907 77.799 89.2267 in % Root relative squared error in % 91.2622 86.3724 94.1908 93.063 92.4874

59 4.3. RESULTS AND DISCUSSIONS Predictive mining is a task of inference on the current data in order to make a prediction. Here the weather parameters such as temperature, dew point, visibility, wind speed and precipitation are taken for analysis using classification and association mining. The rule A B holds in the transaction set D with support s, where s is the percentage of transactions in D that contain A B (i.e., the union of sets A and B, or say, both A and B). This is taken to be the probability, P (A B). The rule A B has confidence c in the transaction set D, where c is the percentage of transactions in D containing A that also contain B. This is taken to be the conditional probability, P (B A). Support (A B) = P (A B) sup port _ count( A B) Confidence( A B) = sup port _ count( A) The predictive Apriori algorithm generates the best association rules on weather dataset. We used four weather parameters (temperature, dew point, wind speed, visibility) with three different values (low, medium, high). Out of different climate equations, the best association rules are shown in below tables with its support and confidence values. The behavioral aspects of precipitation patterns on occurrence of the rainfall 24 hours prior and 48 hours prior have been generated by the machine learning tool Weka (Bouckaert et al 2010). For wet/dry day prediction the generated weather patterns in the form of association rules for 24 hour prior prediction for Chennai station is shown in

60 Table [4.8] and 48 hour prior prediction is shown in Table [4.9]. In Table [4.8] the association rules 1,2,6 and 9 are for dry day weather patterns. These association rules describe the weather patterns of rainy day with temperature, dew point, visibility and wind speed values belongs to low, medium or high based on their nominal values. The association rules 3,4,5,7 and 8 for wet day weather patterns. These association rules represent the weather patterns of non rainy day with values of atmosphere parameters based on their nominal values. In Table [4.9] the association rules 1,2,6 and 9 are for wet day weather patterns and association rules 3,4,5,7 and 8 are for dry day weather patterns of 48 hour advance prediction. The best rules are listed in the concern tables with their support and confidence values. For rainfall prediction, the class label prcp (precipitation) describes the precipitation (rainfall) information that has two attributes. They are yes for occurrence of the rainfall (wet day) and no for non occurrence of the rainfall (dry day). The third association rule in Table [4.8] describes the weather pattern for the occurrence of rainfall (wet day). It tells if the temperature is low, due point is high and wind speed is low according to its nominal values, then the location would receive rainfall in next 24 hours.

61 Table 4.8. Weather patterns for 24 hour advance NE rainfall prediction S.No. Association Rule Support Confidence 1. DEWP='(-inf-63.1]' VISIB='(-inf-5.533333]' WDSP='(-inf-7.333333]' ==> PRCP=no 69 0.99454 2. 6. TEMP='(-inf-75.966667]' DEWP='(63.1-71.3]' VISIB='(5.533333-9.766667]' WDSP='(-inf- 7.333333]' ==> PRCP=no 12 0.97576 3. TEMP='(-inf-75.966667]' DEWP='(71.3-inf)' WDSP='(-inf-7.333333]' ==> PRCP=yes 129 0.8814 4. DEWP='(71.3-inf)' WDSP='(14.666667-inf)' 4 ==> PRCP=yes 4 0.91683 5. TEMP='(-inf-75.966667]' DEWP='(71.3-inf)' ==> PRCP=yes 147 0.87656 6. TEMP='(-inf-75.966667]' DEWP='(63.1-71.3]' VISIB='(-inf-5.533333]' WDSP='(-inf-7.333333]' ==> PRCP=no 294 0.84071 7. TEMP='(-inf-75.966667]' VISIB='(-inf-5.533333]' WDSP='(7.333333-14.666667]' ==> PRCP=yes 36 0.77525 8. TEMP='(75.966667-83.033333]' DEWP='(71.3-inf)' WDSP='(7.333333-14.666667]' ==> PRCP=yes 26 37 0.70204 9. TEMP='(75.966667-83.033333]' DEWP='(63.1-71.3]' WDSP='(7.333333-14.666667]' ==> PRCP=no 38 0.63371

62 Table 4.9. Weather patterns for 48 hour advance NE rainfall prediction S.No. Association rule Support Confidence DEWP='(63.1-71.3]' WDSP='(14.666667-inf)' ==> 1 PRCP=yes 3 0.82818 2 3 4 TEMP='(-inf-75.966667]' DEWP='(71.3-inf)' VISIB='(- inf-4.466667]' WDSP='(-inf-7.333333]' ==> PRCP=yes 125 0.76302 TEMP='(-inf-75.966667]' VISIB='(4.466667-7.633333]' ==> PRCP=no 63 0.90159 TEMP='(75.966667-83.033333]' DEWP='(-inf-63.1]' WDSP='(-inf-7.333333]' ==> PRCP=no 38 0.88434 5 TEMP='(75.966667-83.033333]' DEWP='(-inf-63.1]' VISIB='(4.466667-7.633333]' WDSP='(-inf-7.333333]' ==> PRCP=no 18 0.98344 6 7 8 TEMP='(-inf-75.966667]' DEWP='(71.3-inf)' ==> PRCP=yes 146 0.72907 VISIB='(4.466667-7.633333]' WDSP='(7.333333-14.666667]' ==> PRCP=no 37 0.7184 TEMP='(-inf-75.966667]' VISIB='(7.633333-inf)' ==> PRCP=no 5 0.69593 9 TEMP='(75.966667-83.033333]' DEWP='(71.3-inf)' VISIB='(-inf-4.466667]' WDSP='(7.333333-14.666667]' ==> PRCP=yes 28 0.66696 For rainfall estimation the precipitation patterns are given in Table [4.10] for 24- hour prior forecast and Table [4.11] for 48-hour prior forecast. The class label of the association rule for rainfall estimation is prcp that describes the precipitation (rainfall) information that has two attributes. On consultations with the meteorologists, it is learnt that the rainfall above 25 mm will affect the normal activity during the rainy days generally so rainfall with less than 25 mm is considered as a normal rainy day or

63 moderate rain. Hence the attributes of the class label for the rainfall estimation prcp has high and low values. The attribute high describes the rainfall above the 2.54 cm (1 inch) which is considered as significant rainfall and low for the rainfall below 2.54 cm (1 inch) which is considered as moderate rainfall. In Table [4.10] the association rules 1,2,3 and 5 are moderate rain day weather patterns and association rules 4,6 and 7 for significant rain day weather patterns In Table [4.11] the association rules 1,2,3 and 4 are for moderate rain day weather patterns and association rule 5 and 6 are for significant rain day weather patterns. Table 4.10. Weather patterns for 24 hour advance NE rainfall estimation S.No. Association Rule Support Confidence TEMP='(75.1-81.3]' DEWP='(67.733333-73.266667]' 1 VISIB='(4.7-8.1]' ==> PRCP=low 13 0.96991 2 3 4 5 6 TEMP='(81.3-inf)' DEWP='(73.266667-inf)' VISIB='(4.7-8.1]' ==> PRCP=low 21 0.91262 TEMP='(75.1-81.3]' DEWP='(67.733333-73.266667]' WDSP='(-inf-7.133333]' ==> PRCP=low 293 0.88519 TEMP='(-inf-75.1]' DEWP='(73.266667-inf)' WDSP='(7.133333-14.266667]' ==> PRCP=high 3 0.88129 TEMP='(75.1-81.3]' DEWP='(67.733333-73.266667]' WDSP='(7.133333-14.266667]' ==> PRCP=low 26 0.72744 TEMP='(-inf-75.1]' WDSP='(14.266667-inf)' ==> PRCP=high 4 0.72681 7 TEMP='(-inf-75.1]' DEWP='(67.733333-73.266667]' VISIB='(-inf-4.7]' WDSP='(7.133333-14.266667]' ==> PRCP=high 14 0.71506

64 Table 4.11. Weather patterns for 48 hour advance NE rainfall estimation S.No. Association Rule Support Confidence TEMP='(-inf-75.2]' VISIB='(4.1-6.9]' WDSP='(-inf- 1 7.466667]' ==> PRCP=low 8 0.95424 2 TEMP='(75.2-81.5]' DEWP='(-inf-64.433333]' VISIB='(4.1-6.9]' ==> PRCP=low 3 0.88161 3 4 5 6 TEMP='(75.2-81.5]' DEWP='(64.433333-71.966667]' VISIB='(-inf-4.1]' WDSP='(-inf-7.466667]' ==> PRCP=low TEMP='(81.5-inf)' DEWP='(71.966667-inf)' VISIB='(4.1-6.9]' WDSP='(-inf-7.466667]' ==> PRCP=low TEMP='(-inf-75.2]' DEWP='(71.966667-inf)' WDSP='(7.466667-14.733333]' ==> PRCP=high TEMP='(75.2-81.5]' DEWP='(71.966667-inf)' VISIB='(4.1-6.9]' WDSP='(7.466667-14.733333]' ==> PRCP=high 154 0.84405 151 0.81755 4 0.71084 4 0.57645 The success rate of wet day (rainy day) or dry day (non rainy day) prediction during Northeast monsoon for 24 hours advance forecast and 48 hours advance forecast is taken from its correlation coefficients using cross validation method for the sample locations Chennai and Karaikal of coastal stations, Trichirappalli and Coimbatore of interior stations, and Port Blair of island stations is shown in Table [4.12]. These success rates of quantitative prediction types (24 hours prior / 48 hours prior) are represented in Figure 4.3. The individual tested year success rates for 2006 2010 are also represented in Figure 4.4. The success rate of quantitative rainfall estimation during the NE monsoon for 24 hours prior and 48 hours prior forecast for the tested years (2006,2007,2008,2009 and 2010) based on the training data set (1961-2005) which is shown in Table [4.13]. These results are shown in graph in Figure 4.5 and Figure 4.6.

65 Table 4.12. Success rate of wet / dry day prediction on NE monsoon S.No. Station Name No. of samples Prediction type 1 Chennai 3544 24 hr prior 2 3545 48 hr 65.13 74.71 59.30 73.26 72.41 68.97 prior 3 Karaikal 2238 24 74.23 72.1 63.4 70.4 76.92 66.66 4 2269 48 63.73 71.3 61.1 70.2 81.52 66.30 5 Trichirappalli 3393 24 66.4 56.04 69.23 72.53 63.33 46.51 6 3394 48 64.5 56.04 68.84 62.64 43.33 59.30 7 Coimbatore 3121 24 75.04 60 69.32 63.22 71.91 61.36 8 3121 48 73.12 59.55 68.97 62.79 72.73 62.07 9 Port Blair 2635 24 68.46 62.92 73.56 71.11 60.23 76.17 10 2635 48 64.63 57.47 72.09 66.29 54.023 73.56 S.No. Station Name Table 4.13. Success rate of rainfall estimation on NE monsoon (HIGH rainfall implies >=2.54 cm, LOW rainfall implies <2.54 cm) No. of samples Prediction type Corelation Testing years Coefficient 2006 2007 2008 2009 2010 % % % % % % 69.89 79.55 66.66 66.66 77.27 67.04 Corelation Testing years Coefficient 2006 2007 2008 2009 2010 % % % % % % 74.5992 72.7 83.3 70.2 63.1 84.7 1 Chennai 1402 24 hr prior 2 1404 48 hr 74.4805 72.7 83.3 70.2 63.1 82.6 prior 3 Karaikal 1042 24 72.9412 73.7 74.3 70.5 72 72.5 4 1042 48 70.00 72.7 74.9 69.8 62 70.5 5 Trichirappalli 1200 24 87.6482 92.5 83.8 83.3 83.3 88.0 6 1199 48 88.0317 92.5 83.8 77.4 83.3 88.0 7 Coimbatore 839 24 86.1111 86.1 81.4 93.7 88.4 85.2 8 839 48 86.12 86.1 81.4 93.7 88.4 85.2 9 Port Blair 1117 24 84.8786 76.6 81.8 74.2 90.6 76.2 10 1117 48 84.8786 76.6 81.8 74.2 90.6 76.2

66 Success rate of Wet / Dry day prediction on NE monsoon using cross validaion method Sucess rate in % 76 74 72 70 68 66 64 62 60 58 Chennai Karaikal Trichy Coimbatore Port Blair 24 hr NE 48 hr NE Station name Figure 4.3. Success rate for NE monsoon Wet/Dry day prediction Success rate of Wet/Dry day prediction on individual years 2006-2010 on NE monsoon Success rate 90 80 70 60 50 40 30 20 10 0 Chennai Karaikal Trichy Coimbatore Port Blair Station 2006 2007 2008 2009 2010 Average Figure 4.4. Success rate of NE Wet / Dry day prediction on testing years

67 Verification of 24 hr advance rainfall estimation on NE monsoon Success rate in % 100 90 80 70 60 50 40 30 20 10 0 Chennai Karaikal Trichy Coimbatore Port Blair 2006 2007 2008 2009 2010 Station Figure 4.5. Success rate of 24 hour advance RF estimation on NE monsoon Verification of 48 hr advance rainfall estimation on NE monsoon Success rate in % 100 90 80 70 60 50 40 30 20 10 0 Chennai Karaikal Trichy Coimbatore Port Blair 2006 2007 2008 2009 2010 Station Figure 4.6. Success rate of 48-hour advance RF estimation on NE monsoon

68 When we compare the success rates based on its correlation coefficients [Figure 4.3] of 24 hour prior and 48 hour prior forecast to rain occurrence during NE monsoon months on different typical stations, there is not much difference between the values, since the minimum and maximum success rates of the proposed 24 hour advance prediction are 66% and 75% and the minimum and maximum success rates [Figure 4.7] of the proposed 48 hour advance prediction are 63% and 73%. Hence the method can well be taken to be an effective one for 48 hours advance prediction. Comparision of quantitative rainfall prediction on NE monsoon 76 74 72 Success rate in % 70 68 66 64 62 60 58 Chennai Karaikal Trichy Coimbatore Port Blair 56 24 Hr 48 Hr Prediction type Figure 4.7. Comparison of 24 hr / 48 hr advance RF prediction on NE monsoon

69 The success rates of quantitative estimation for 24 hour advance and 48 hour advance prediction are shown in [Figure 4.8]. The success rates are practically same between 70% to 88%. Hence this model can be used to forecast 48 hour advance rainfall estimation during NE monsoon for coastal as well as interior stations. Comparision of rainfall estimation success rates on NE monsoon Success rate in % 100 90 80 70 60 50 40 30 20 10 0 Chennai Karaikal Trichy Coimbatore Port Blair 24 Hr 48 Hr Station Figure 4.8. Comparison of RF prediction success rate on NE monsoon 4.3.1. PERFORMANCE ANALYSIS The success rate of the proposed data mining model (DMM) for Northeast monsoon rainfall prediction is compared with the existing synoptic model (SYM) and statistical model (STATM) (Box 2003) on test years 2009 and 2010. The synoptic model

70 is followed by the weather forecast services globally based on the synoptic situations. The time series statistical model (Box et al 2003) is prepared for the forecasting by the National Centre for Medium Range Weather Forecasting, India, based on the auto regressive integrated moving average (ARIMA). The performance of the prediction results has been compared with synaptic forecast model (Asnani 1993; Haltiner and Martin 1967) and the pure statistical model for the testing years 2009 and 2010 that are followed by the weather forecasting services globally. The results are shown in Table [4.14]. Table 4.14. Comparative analysis of NE monsoon rainfall performance STATION TEST YEAR DATA MINING MODEL % SYNAPTIC MODEL % STATISTICAL MODEL % CHENNAI 2009 NE 77 71 56 2010 NE 68 59 51 KARAIKAL 2009 NE 79 77 51 2010 NE 67 57 58 TRICHY 2009 NE 65 63 53 2010 NE 60 54 49 COIMBATORE 2009 NE 71 45 58 2010 NE 61 46 52

71 Comparative Success Rates for 2009 on NE monsoon Success rate 90 80 70 60 50 40 30 20 10 0 Chennai Karaikal Trichy Coimbatore Stations DMM SYM STATM Figure 4.9. Comparative analysis of NE rainfall forecast performance on 2009 Comparative Success Rates for 2010 on NE monsoon Success rate 80 70 60 50 40 30 20 10 0 Chennai Karaikal Trichy Coimbatore Stations DMM SYM STATM Figure 4.10. Comparative analysis of NE rainfall forecast performance on 2010

72 The Figure 4.9 and Figure 4.10 describe the comparative analysis of North East monsoon rainfall forecast performance for the testing years 2009 and 2010. The results exhibit that the data mining model for the extraction of weather patterns for the rainfall occurrence and classification of wet / dry day during the North East monsoon is more effective than the existing synoptic and statistical models.