Improved Closest Fit Techniques to Handle Missing Attribute Values

Size: px

Start display at page:

Download "Improved Closest Fit Techniques to Handle Missing Attribute Values"

Ilene Singleton
6 years ago
Views:

1 J. Comp. & Math. Sci. Vol.2 (2), (2011) Improved Closest Fit Techniques to Handle Missing Attribute Values SANJAY GAUR and M S DULAWAT Department of Mathematics and Statistics, Maharana, Bhupal Campus Mohanlal Sukhadia University, Udaipur India sanjay.since@gmai.com & dulawat_ms@rediffmail.com ABSTRACT Data preparation for data mining is a fundamental stage of data analysis. Completeness, quality and real world data preparation is a key pre-requisite of successful data mining with its aims to discover something new from the facts already recorded in a certain database. Data with missing values complicates analysis and the application of a solution to new data. To overcome this situation, certain statistical techniques are to be employed during the data preparation. With the help of statistical methods and techniques, we can recover incompleteness of missing data and reduce ambiguities. In this paper, we introduce two sequential methods by which missing attribute values are replaced. A comparative study between both the methods is given are based on moving average method for numerical variables of time series data. Keywords: Missing Values, Attribute, Data preparation, Incompleteness, Moving average, Chronological. MSC (2010) Subject Classification: 62-07, 62N02, 62Q INTRODUCTION Missing values in database is solitary of the biggest problems faced in data analysis and in data mining applications. This missing values problem provoked imbalanced databases. The effects of these missing values are reflected on the final results. Our prime goal is to achieve the final result in the consolidated form on which we are taking decision. In this study, three statistical methods are introduced and discussed which provides an approach to find out

2 385 Sanjay Gaur, et al., J. Comp. & Math. Sci. Vol.2 (2), (2011) pattern to recover or generate missing values from a real imbalanced database with missing values. Therefore, the objective of this comparative study is to find out best fitted method to recover missing values and select records completely filled for further applications. This is based on bivariate analysis. The utility of statistical methods has gained objects in exploring estimation and prediction techniques. Buck 2 suggested estimation of missing values for use with an electronic computer. Kim and Curry 8 considered the treatment of missing data in their analysis. Rubin 10 explored about inference and missing data and multiple imputations for non-response in the survey. Allison 1 investigated estimates of linear models with incomplete data and on missing data. Smyth 11 and Zhang et al. 12 have considered that data preparation is a fundamental stage of data analysis. Chen et al. 3 studied and discussed about multiple imputation for missing ordinal data. Qin 9 considered the semi-parametric optimization for missing data imputation. Gaur and Dulawat 4,5 discussed various algorithms which are useful for estimation of missing values also gave univariate analysis by using mean value at the place of missing values for data preparation. Gyzymala-Busse 7 give idea that every missing attributes values is replaced by all possible known values. They also provided global closest fit and concept closest fit method for missing attribute values. The objective of proposed study is to determine the statistical technique which may be significant in the handling of missing attribute values. 2. FORMULATION OF PROBLEM The proposed methods are based on replacing missing attribute values by the moving average generated values. These methods are very much useful for numerical attributes and accountable under the flag of chronological analysis. In general, these methods are centralized on search of values which is very close to the central tendency of the attribute and closest to the value of just preceding and succeeding value of the missing values. 2.1 Average Fit Approach This is one of the simplest approach of generation of close fit values for missing value place. In this, we first read the complete attribute with missing value cases. Values of attributes are divided under two section that is observed and missing values. Now search of missing case in the attribute get start. The missing value case is pointed by the subscript of the attribute and denoted by the variable. After pointing missing value case, we have to record the preceding value ( ) and succeeding value ( ) from the missing value subscript (

3 Sanjay Gaur, et al., J. Comp. & Math. Sci. Vol.2 (2), (2011) 386 where and NULL At the next stage, after recording the values of just preceding value and succeeding value of the missing value subscript, we compute the average of both values ( ) Now at the average of the values received by the equation (2.1.3) is treated as the estimated values for the current missing values subscript. This estimated value may be as follows: = The value of is separately computed for every missing values subscripts Algorithm (Average Fit Approach) Read {,, } // Attribute with observed and missing values where {,, } // Attribute values observed {,, } // Attribute values missing For i =1 to n do If ( value (x i ) == NULL) then x p = value(x i -1 ) // Value of preceding of x i x s = value(x i +1 ) // Value of succeeding of x i = (x p + x s ) / 2 // Average of preceding and succeeding x est = // Estimated value value (x i ) = x est // Assigning estimated value to missing value place i = i + 1 repeat un till( i >=n) Stop 2.2 Moving Average Fitting Approach The moving average fitting method is based on the moving average concept. This approach is also very much useful for numerical attributes, is search for close fitting value which is very close to the true mean of the attribute and close to the value of just preceding and succeeding value of the missing values in association of the central tendency of attribute. In the proposed method, we first find out the range of moving average of the attribute. Here we proposed range is at least 10% of the used dataset. Therefore, the preceding range would be half (50%) of the moving average range and same for succeeding rage. Now the searches of missing case in the attribute get start. The missing value case is pointed by the subscript of the attribute and denoted by the variable. After pointing missing value case, we have to record the preceding values (,,.., ) and succeeding values (,,.., ) from the missing value subscript (.

4 387 Sanjay Gaur, et al., J. Comp. & Math. Sci. Vol.2 (2), (2011) The values for preceding are computed as x p1 = value (x i -1 ), x p2 = value (x i -2 ) and x pm = value(x i -m ), for succeeding x s1 = value(x i +1 ), x s2 = value(x i +2 ) similarly x sm = value(x i +m ). At the next stage, after recording of preceding & Succeeding values, calculate the average of preceding ( and same for succeeding ( ) = (x p1 + x p2 + +x pm ) / m = (x s1 + x s2 + +x sm ) / m The average of preceding ( and succeeding ( ) is the moving average or estimated values for missing data cell The estimated value is moving average values which is computed may be represent as follows = The estimated value is replaced at the place of missing values. x i = The process of searching of missing value is continuing till the last element of the attribute Algorithm (Moving Average Fitting Approach) Read {,, } where {,, } {,, } x t = int ((count (X) *10)/100) // Attribute values observed // Attribute values missing // Set the range of moving average. Here it is 10% of the dataset. N= x t %2 // Find the reminder If (N==0) then m= x t / 2 else m=( x t + 1)/ 2 Read {,, } // Attribute with observed and missing values For i =1 to n do If ( value (x i ) == NULL) then X p1 = value(x i -1 ) // Value of preceding of x i-1 x p2 = value(x i -2 ) // Value of preceding of x i-2 x pm = value(x i -m ) // Value of preceding of x i-m x s1 = value(x i +1 ) // Value of succeeding of x i+1

5 Sanjay Gaur, et al., J. Comp. & Math. Sci. Vol.2 (2), (2011) 388 x s2 = value(x i +2 ) // Value of succeeding of x i+2.. x sm = value(x i +m ) = (x p1 + x p2 + +x pm ) / m = (x s1 + x s2 + +x sm ) / m = ( + ) / 2 x est = value (x i ) = x est i = i + 1 repeat un till( i >=n) Stop // Value of succeeding of x i+m // Average of preceding and succeeding // Estimated value // Assigning estimated value to missing value place 3. DISCUSSION OF RESULTS Table-A given in appendix shows the world wide emission of carbon dioxide (CO 2 ) from the consumption of Oil and Natural Gas respectively for the years 1960 to The mean emission of carbon dioxide (CO 2 ) due to Oil and Natural Gas are 2262 and 879 respectively. Table-B shows the variables with observed and missing values. It may be noted that in the planned way 20 % of the values are missing in the random manner for all the variables from Table-A. The means calculated from incomplete data sets are 2259 for Oil and 874 for Natural Gas. It is observed that mean values of incomplete data sets of Table-B are slightly lower than the mean values from all the three variables of Table-A. The proposed Simple average fit method is applied on the data sets of Table- B to fill up the missing values. Values recovered or generated from this approach are shown in Table-C for both variables which are highlighted by underline. Further, it is observed that the mean values obtained after replacing the missing values by the closest fit values in Table-C are quite close to the actual mean as given in Table-A. Another proposed moving average approaches gives similar result as the simple average fit method. This is again near to the original mean as given in the table. 4. CONCLUSION It is universally known that there is not 100 % efficient technique of handling missing attribute values. The proposed moving average fit methods are useful for numerical attribute, having minor deviation from the mean. This method is appropriate for the consolidated report, also more appropriate and suitable to fit individual missing values. Here the estimated value gives a resemblance order from the preceding and succeeding values. Consequently, it is observed that techniques for handling of missing attribute values should be chosen individually or based on the nature and type of data.

6 389 Sanjay Gaur, et al., J. Comp. & Math. Sci. Vol.2 (2), (2011) 5. REFERENCES 1. Allison, P.D., Estimation of linear models with incomplete data, Social Methodology, San Francisco: Jossey Bass, pp (1987). 2. Buck, S.F., A method of estimation of missing values in multivariate data suitable for use with an electronic computer, J. Royal Statistical Society, Series B, Vol-2, pp (1960). 3. Chen, L., Drane, M.T., Valois, R.F., and Drane, J.W., Multiple imputation for missing ordinal data, Journal of Modern Applied Statistical Methods, Vol.-4, No.1, pp (2005). 4. Gaur, Sanjay and Dulawat, M.S., A perception of statistical inference in data mining, International Journal of Computer Science and Communication, Vol.-1, No. 2, pp (2010). 5. Gaur, Sanjay and Dulawat, M.S., Univariate Analysis for Data Preparation in context of Missing Values, Journal of Computer and Mathematical Sciences, Vol.-1, No. 5, pp (2010). 6. Gaur, Sanjay and Dulawat, M.S., A Closest Fit Approach to Missing Attribute Values in Data Mining, Communicated to publishing, (2011). 7. Grzymala-Busse, J. W., Data with missing attribute values: Generalization of in-discernibility realtion and rules induction, Transactions of Rough Sets, Lecture Notesin Computer Science Journal Subline, Springer- Verlag, Vol-1, pp (2004). 8. Kim, J. O., and Curry, J., The treatment of missing data in multivariate analysis, Social Methods and Research, Vol.-6, pp (1977). 9. Qin, Y. S., Semi-parametric optimization for missing data imputation, Applied Intelligence, Vol.-27, No. 1, pp (2007). 10. Rubin, D.B., Inference and missing data, Biometrika, 63, pp (1976). 11. Smyth, P., Data mining at the interface of computer Science and Statistics, Data mining for scientific and engineering applications, Department of Information and Computer Science, University of California, CA, , Chapter-1, pp (2001). 12. Zhang, S., Zhang, C., and Young, Q., Data preparation for data mining, Applied Artificial Intelligence, Vol.- 17, pp (2003). Appendix: Global Carbon Dioxide Emissions from Fossil Fuel Burning by Fuel Type, Table -A (Original Table) Table -B (Missing Values) Table -C (Simple Average) Table -D ( Moving Average) Year Oil Natural Gas Year Oil Natural Gas Year Oil Natural Gas Year Oil Natural Gas Million Tonnes Million Tonnes Million Tonnes Million Tonnes , , ,

7 Sanjay Gaur, et al., J. Comp. & Math. Sci. Vol.2 (2), (2011) , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,517 1, ,517 1, ,517 1, ,517 1, ,627 1, , ,511 1, ,498 1, ,506 1, ,506 1, ,506 1, ,506 1, ,537 1, ,537 1, ,537 1, ,537 1, ,562 1, ,562 1, ,562 1, ,562 1, ,586 1, ,586 1, ,586 1, ,586 1, ,624 1, , ,634 1, ,645 1, ,707 1, , ,707 1, ,707 1, ,763 1, ,763 1, ,763 1, ,763 1, ,716 1, ,716 1, ,716 1, ,716 1, ,831 1, , ,779 1, ,796 1, ,842 1, ,842 1, ,842 1, ,842 1, ,819 1, , ,819 1, ,819 1, ,928 1, ,928 1, ,928 1, ,928 1, ,032 1, ,032 1, ,032 1, ,032 1, ,079 1, , ,062 1, ,006 1, ,092 1, , ,092 1, ,092 1, ,087 1, ,087 1, ,087 1, ,087 1, ,079 1, ,079 1, ,079 1, ,079 1, ,019 1, ,019 1, ,019 1, ,019 1,552 Average 2, Average 2, Average 2, Average 2, Source:

Modelling Dropouts by Conditional Distribution, a Copula-Based Approach

The 8th Tartu Conference on MULTIVARIATE STATISTICS, The 6th Conference on MULTIVARIATE DISTRIBUTIONS with Fixed Marginals Modelling Dropouts by Conditional Distribution, a Copula-Based Approach Ene Käärik