International Matematical Forum, Vol. 8, 03, no. 4, 53-65 Handling Missing Data on Asymmetric Distribution Amad M. H. Al-Kazale Department of Matematics, Faculty of Science Al-albayt University, Al-Mafraq-Jordan amed_005k@yaoo.com Abstract Te problem of imputation of missing observations emerges in many areas. Data usually contained missing observations due to many factors, suc as macine failures and uman error. Incomplete dataset usually causes bias due to differences between observed and unobserved data. Tis paper proposed Neyman allocation metod to estimate asymmetric winsorizing mean for andling missing observations wen te data follow te exponential distribution. Different values of te exponential distribution parameters were used to illustrate. A set of data from exponential distribution were generated to compare te performance of te proposed metods suc as regression trend, average of te wole data, naive forecast and average bound of te oles and te proposed Neyman allocation metod. Te goodness-of-fit criterions used were te mean absolute error (MAE) and te mean squared error (MSE). It was found tat te proposed metod gave te best fit in te sense of aving smaller error, in particular for a large percentage of missing observations. Keywords: Neyman Allocation, Missing Data, Single imputation metods, Winsorized mean. INTRODUCTION Te attendance of missing observations in statistical survey data is an important issue to deal wit (Little and Rubin, 987). Besides missing observations, te time series data are possibly contaminated by outliers or are eterogeneous. Incomplete datasets may lead to results tat are different from tose tat would ave been obtained from a complete dataset (Hawtorne and Elliot, 005).
54 A. M. H. Al-Kazale Tis paper proposed a metod of substituting tese missing observations by using asymmetric winsorized mean. In tis metod, we prepared te wole data by dividing it into groups by using te stratified sampling to determine te boundaries among te strata. Stratification is te process of grouping members of te population into relatively omogeneous groups before sampling. Te strata sould be mutually exclusive were every element in te population must be assigned to only one stratum (Cyert, and Davidson, 96). One of te main objectives of stratified sampling is to reduce te variance of te estimator and to get more statistical precision tan wit te simple random sampling (Cocran, 977; Hanif, 000; Hess et al., 966; Seik and Amad, 00). In tis paper, Neyman allocation metod is used to determine te boundaries of te groups. Te missing observations problem is an old one for analysis tasks. Te waste of te data wic can result from casewise deletion of missing values, oblige to propose alternative approaces. A problem frequently encounters in data collection is missing observations or observations may be virtually impossible to obtain, eiter because of time or cost constrains. In order to replace tose observations, tere are several different options available to te researcers (Pankratz, 983; Patricia, 994). Firstly, replace te missing observations wit te mean of te series. Secondly; replace te missing observations wit te naive forecast. Also; replace te missing observations wit a simple trend forecast. Finally replace te missing observations wit an average of te last two known observations tat bound te missing observations.. NEYMAN ALLOCATION Neyman allocation is a sample allocation metod tat may be used wit stratified samples. We used te Neyman allocation to determine te stratum boundaries, wen te population is igly skewed te Neyman allocation sould be used (Cocran,977; Amad Mair et al., 007; Samiuddin et al., 998). Te suffix denotes te stratum and i te unit witin te stratum. Wen a population of N units are being stratified into L strata and te samples from eac stratum are selected wit simple random sampling an unbiased estimate of te population mean for te estimation variable x, is x st (st for stratified), were x L N x = = = st N = L W x (.) For te stratified random sampling, te variance of te estimate x st is
Handling missing data on asymmetric distribution 55 L L N S W S N n n = = V ( x ) = = st (.) Te variance depend on Neyman allocation is reduced to L VNey ( xst ) = ( WS ) n (.3) = Let x 0, x L be te smallest and largest values of x in te population. Te problem is to find intermediate stratum boundaries y, y,..., y L and by differentiating wit respect to te stratum boundaries y. We get te minimum variance VNey ( x st ) wit respect to y, since y appears in te sum only in te terms of W, S and W+, S+ (Cocran,977). Hence we ave te formula for finding te optimum stratum boundaries. ( y μ) + S ( y μ + ) + S + = were =,,3,..., L (.6) S S + Suppose tat te distribution of x be continuous wit te density function, f ( x ), a < x < bin order to make L (groups) strata, te range of x is to be cut up at points y < y < y3 <... < y L. Te relative frequency W, te population mean μ and te population variance S at te -t stratum are given by. y W = f ( x) dx (.7) y y μ = x f ( x) dx W (.8) y y = ( ) μ W y S x f x dx (.9) Tese equations are difficult to be solved, since μ and S depend on y. We must use te iterative metod to solve tem by using computer program C++ (Amad Mair et al. 007). We divide te wole data into groups by putting ( y0, y ) into group one were y0 = x0, ( y, y ) in group two and so on. 3. WINSORIZED MEAN Winsorized mean is a winsorized statistical measurement of central tendency, like te mean and te median and even more similar to te truncated mean (Mingxin and
56 A. M. H. Al-Kazale Yijun, 009; Barnett and Lewis 994), te winsorized mean eliminates te outliers at bot ends of an ordered set of observations. Unlike te trimmed mean, te winsorized mean replaces te outliers wit observed values, rater tan discarding tem. It involves te calculation of te mean after replacing te given parts of a probability distribution at te ig and low ends wit te most extreme remaining values. Te winsorized mean is defined by n s xw = [( r + ) x r+ + xi ( s ) x n s ] n + + (3.) i = r+ wic means tat te winsorized mean is te average of te observations were te first r smallest values are replaced by te ( r + ) -t smallest value x r +, and te first s largest values are replaced by te ( s + ) -t largest value x n s (Barnett and Lewis 994). Wen, we determine te suitable number of groups (must be more tan two groups) depending on te wole data by using Equation (.6). We obtain te boundaries of tese groups at te first and te last groups. Consequently we count te number of observations inside tese two groups. Te number of observations in te first group is r and te number of observations in te last group is s. Subsequently, te asymmetrical winsorized mean can be calculated using Equation (3.). Te data were generated from exponential distribution. Ten, te wole data ordered from te smallest to te largest, te smallest value is zero and te largest value is 7.. Te data were assumed aving 5%, 0%, and 0% randomly missing observations in tis study. 4 THE BOUNDARIES OF GROUPS IN EXPONENTIAL DISTRIBUTION Suppose tat te time series x, x,..., x n of n observations come from an exponential distribution wit probability density function f ( x) = λe λx, λ > 0, x > 0, (4.) were λ is te parameter of exponential function. In order to make L (groups) strata, te domain of f ( x ) was truncated from, (0, ) to (0, b ] were b is te largest value of x. From equations (.7), (.8) and (.9), te relative frequency W, mean μ and variance S of te -t stratum can be computed for te exponential distribution as te following λy λy W = e e (4.)
Handling missing data on asymmetric distribution 57 S λx λx ( y + ) e ( y + ) e μ = λ λ λ y λ y e e ( y + y + ) e ( y + y + ) e λx λx = λ λ λ λ ( μ ) λy λy e e (4.3) (4.4) Table Stratum boundaries for groups size of tree to eigt in te case of exponential distribution wen λ = 0., Boundaries 3G 4G 5G 6G 7G 8G y 3.95.306.8005.4783.539.0887 y 7.9365 5.303 3.9989 3.4.688.307 y 3 9.699 6.834 5.365 4.3633 3.7033 y 4 0.783 7.984 6.37757 5.39 y 5.6395 8.904 7.547 y 6.98 9.654 y 7.83 Table Stratum boundaries for groups size of tree to eigt in te case of exponential distribution wen λ = 0.4, Boundaries 3G 4G 5G 6G 7G 8G y.884.3550.0594 0.8699 0.7379 0.6407 y 4.9603 3.33.4064.94.604.3759 y 3 6.63 4.593 3.66.659.383 y 4 7.567 5.0994 3.986 3.84 y 5 8.0536 5.808 4.60 y 6 8.730 6.4054 y 7 9.74 Table 3 Stratum boundaries for groups size of tree to eigt in te case of exponential distribution wen λ = 0.6 Boundaries 3G 4G 5G 6G 7G 8G y.73 0.969 0.776 0.5898 0.5007 0.435 y 3.377.886.634.307.0903 0.9356 y 3 4.87.905.33.8074.55 y 4 5.008 3.4935.73.49 y 5 5.5878 3.993 3.570 y 6 6.084 4.453 y 7 6.543
58 A. M. H. Al-Kazale Te number of groups or strata depends on te size of te population. If we want to divide te wole data into tree groups, we need to calculate two boundaries y and y since x 0 = y 0 = 0 and y 3 = 7. wic is te largest value. Te boundary y depend on te μ and μ and S and S, also y depends on μ and μ 3 and S and S 3. We calculated tese values by applying equation (.6) depending on equations (.8) and (.9) and in te calculation of te boundaries we assumed te parameter of exponential distribution are λ = 0., 0.4, and 0.6. If we divide te wole data into four groups, we need to calculate tree boundaries y, y and y 3. Similarly we find te boundaries of te strata (group) for te five, six, seven and eigt groups in te same way by using te C++ program. Tables,, and 4 ave sown te boundaries for all groups for te different values of te parameter of exponential distribution for λ = 0., 0.4, and 0.6 respectively. 5 SUMMARY AND DISCUSSION Te metod to determine te two parameters of asymmetric winsorized mean ( rs, ) wen te wole data was divided into tree to eigt groups as follow. We assumed tat te data wit different percentage of missing observations suc as 5%, 0%, and 0% wic as been selected randomly. In te first case wen te data as 5% missing observations, te asymmetric winsorized mean was calculated as te following: Suppose te stratum boundaries for te wole data was divided into tree groups wit 5% missing data wit λ = 0.. Te first group as te interval (0, 3.95), te second group as te interval (3.95, 7.9365) and te tird group as te interval (7.9365, 7.) (see Table ). Since we are interested in te first and te last group, ten we count te number of observations in bot groups. Te number of observations in te first one was 9, and te number of observations in tird group was 55 as sown in Table 5a. Finally te asymmetric winsorized mean was calculated by taking r= 9 and s= 55 and by applying equation (3.), wic is equal to 5.0037. Te same procedure to compute te asymmetric winsorized mean was followed for different groups for 5% missing observations as sown in tables 5b, and 5c.
Handling missing data on asymmetric distribution 59 Table 5a Te values of te two parameters and te asymmetric winsorizing mean wen λ = 0., and 5%missing observations for tree groups to eigt groups r 9 65 46 3 5 s 55 5 3 9 8 W ( r, s ) 5.0037 4.977 4.9573 4.9303 4.948 4.949 Moreover, for te data wit 0% missing observations, wen te wole data was divided into tree groups wit λ = 0.,, te number of observations in te first group was 9 wic was te value of r and te number of observations in te last one 50 wic was s, ten te asymmetric winsorized mean was equal to 4.9483. In te same way, te asymmetric winsorized mean was calculated for different groups. Tables 6a, 6b, and 6d ave sown te numbers of observations in te first and last groups for different values of exponential distributions parameter and te values of asymmetric winsorized mean. Table 5b Te values of te two parameters and te asymmetric winsorizing mean wen λ = 0.4 and 5%missing observations for tree groups to eigt groups r 48 8 7 3 s 70 09 76 5 4 3 W ( r, s ) 3.933 4.343 4.5547 4.675 4.749 4.803 Table 5c Te values of te two parameters and te asymmetric winsorizing mean wen λ = 0.6 and 5%missing observations for tree groups to eigt groups r 6 8 9 s 67 08 69 39 6 00 W ( r, s ).93 3.530 3.866 4.085 4.45 4.3604 Table 6a Te values of te two parameters and te asymmetric winsorizing mean wen λ = 0., and 0% missing observations for tree groups to eigt groups r 9 67 48 3 4 0 s 50 0 8 7 W ( r, s ) 4.9483 4.8934 4.8633 4.839 4.849 4.8555
60 A. M. H. Al-Kazale Table 6b Te values of te two parameters and te asymmetric winsorizing mean wen λ = 0.4 and 0%missing observations for tree groups to eigt groups r 49 7 0 6 3 s 56 00 69 47 38 9 W ( r, s ) 3.885 4.7 4.4784 4.594 4.6656 4.753 Table 6c Te values of te two parameters and te asymmetric winsorizing mean wen λ = 0.6 and 0%missing observations for tree groups to eigt groups r 5 7 0 9 s 46 9 55 8 06 9 W ( r, s ).9366 3.4768 3.8089 4.070 4.78 4.903 Finally, for 0% missing observations, te data was divided into tree groups. Te number of te observations of te first group was 03 wic was te value of te parameter r and te number of observations in te tird group was 5 wic was te value of parameter s, ten te asymmetric winsorized mean computed was found to be 5.0346, as sown in Table 7a. Similarly te asymmetric winsorized mean computed for te oter groups wit 0% percentage of missing observations as obtained in Tables 7b, and 7c. Table 7a Te values of te two parameters and te asymmetric winsorizing mean wen λ = 0., and 0% missing observations for tree groups up to eigt groups r 03 59 4 9 3 9 s 5 5 3 9 8 W ( r, s ) 5.0346 5.0074 4.9975 4.969 4.989 4.9948 Table 7b Te values of te two parameters and te asymmetric winsorizing mean wen λ = 0.4 and 0% missing observations for tree groups up to eigt groups r 43 5 9 5 s 45 96 68 49 40 3 W ( r, s ) 3.9075 4.36 4.5480 4.6763 4.7635 4.85
Handling missing data on asymmetric distribution 6 Table 7c Te values of te two parameters and te asymmetric winsorizing mean wen λ = 0.6 and 0% missing observations for tree groups up to eigt groups r 3 7 0 9 8 s 4 74 44 0 89 W ( r, s ).9646 3.4945 3.834 4.0508 4.7 4.345 We compared te result of using asymmetric winsorized mean metod for missing observations wit oter estimation metods for missing observations. Te first metod is to replace te missing observations wit te mean of te series. Tis mean can be calculated over te entire range of te sample. Second metod is to replace te missing observations wit te naive forecast. Naive model is te simplest form of a univariate forecast model. Tis model uses te current time period's value for te next time period, tat is X ˆt + = X t. Also, we can replace te missing observations wit a simple trend forecast. Tis is accomplised by estimating te regression equation of te form X t = a+ bt (were t is te time) for te periods prior to te missing value. Ten use te equation to fit te time periods missing. Finally, we can replace te missing observations wit an average of te last two known observations tat cover te missing observations (Patricia,994). Te accuracy of estimating te missing observations wit asymmetric winsorized mean depends on ow close te estimating values to te actual values. In practice, we define te difference between te actual and te estimating values as an error. If te estimation is doing a good job, te error will be relatively small. Tis means tat te error for eac time period is purely random fluctuation around original value. So we sould get a value equal 0 or near 0. We tested asymmetric winsorized mean approac to estimating missing data points on time series data sets wit respect to te oter metods. In tis paper te following statistical measures were used as te estimators of accuracy (Patricia, 994). - Te mean absolute errors - Te mean square errors MAE = MSE = n t = n n e t = t n e t (5.) (5.)
6 A. M. H. Al-Kazale were n is te number of imputations, e i is difference between te actual and te estimating values. In order to evaluate te accuracy of te estimation procedure for missing observations, we used mean absolute error (MAE) as in Equation (5.) and mean square error (MSE) as in Equation (5.). Tables 8a to 9c sown te amount of te MAE and MSE for te estimation of te missing observations by using te proposed metod, and Tables and for te oter metods. Te MAE values as sown in Tables 8a to 8c for te proposed metods and Table 0 for te oter metods. At te 5% missing observations case, te simple average metod does better tan te asymmetric winsorized metods wen λ = 0., in tree and four groups, but for te oter values, it is clear tat te proposed metod is better tan te oter metods. For te 0% missing observations, te average coverage metod does better tan te oter metods, wile te simple average metod does better tan te asymmetric metod only wen λ = 0., at tree and four groups, λ = 0.4 at tree groups and wen λ = 0.6 at tree, four and five groups. Finally, for te 0% missing observations, tere is only one case; te simple average metod does better tan te asymmetric winsorized metod wen λ = 0., in tree groups. For te second accuracy measures, te MSE as sown in Tables 9a to 9c for te proposed metods and Table for te oter metods. For te 5% missing observations case in general, te asymmetric winsorized mean metods do better tan te oters, but te simple average metod do better tan te proposed metod wen λ = 0., in tree and four groups. For te 0% missing observations, te asymmetric winsorized metods do better tan te naive and trend metods, wile te average bound metod do better tan te proposed metod, te mixed results between te proposed metods and te simple average metod. Finally, for te 0% missing observations, te simple average metod does better tan te proposed metod in two cases wen te λ = 0., in tree groups and λ = 0.6 in tree and four groups. In tis paper, 8 asymmetric winsorized mean metods were proposed to andle missing observations wit 5%, 0%, and 0% missing observations. General results sown tat te proposed metods do better tan te oter metods, in particular wen te data aving 0% missing observations. We recommend using te asymmetric winsorized metod wen te data ave more missing observations and dividing te wole data into more tan four groups. Table 8a Te values of mean absolute error of asymmetric winsorized mean for 5% missing data, λ = 0., 0.4, and 0.6 for different groups MAE 3G 4G 5G 6G 7G 8G λ = 0.,.9554.9403.9335.907.96.996 λ = 0.4.4483.64.748.7986.8349.8600 λ = 0.6.037.804.449.586.596.6507
Handling missing data on asymmetric distribution 63 Table 8b Te values of mean absolute error of asymmetric winsorized mean for 0% missing data, λ = 0., 0.4, and 0.6 for different groups MAE 3G 4G 5G 6G 7G 8G λ = 0.,.4763.47.468.4659.4669.4675 λ = 0.4.507.463.455.45.4546.4569 λ = 0.6.788.5967.567.483.4679.465 Table 8c Te values of mean absolute error of asymmetric winsorized mean for 0% missing data, λ = 0., 0.4, and 0.6 for different groups MAE 3G 4G 5G 6G 7G 8G λ = 0.,.937.949.94.954.904.98 λ = 0.4.8533.83.8396.8559.87.889 λ = 0.6.0876.9343.865.8384.83.83 Table 9a Te values of mean square error for asymmetric winsorized mean for 5% missing data, λ = 0., 0.4, and 0.6 for different groups MSE 3G 4G 5G 6G 7G 8G λ = 0., 0.575 0.455 0.40 0.30 0.344 0.37 λ = 0.4 7.6499 8.4974 9.0695 9.455 9.67 9.8500 λ = 0.6 6.99 7.339 7.5377 7.986 8.658 8.546 Table 9b Te values of mean square error for asymmetric winsorized mean for 0% missing data, λ = 0., 0.4, and 0.6 for different groups MSE 3G 4G 5G 6G 7G 8G λ = 0., 9.3746 9.3545 9.3460 9.3405 9.346 9.344 λ = 0.4 0.06 9.5468 9.3974 9.355 9.3354 9.3307 λ = 0.6.5735 0.998 0.95 9.8494 9.649 9.5303 Table 9c Te values of mean square error for asymmetric winsorized mean for 0% missing data, λ = 0., 0.4, and 0.6 for different groups MSE 3G 4G 5G 6G 7G 8G λ = 0., 5.56 5.478 5.4645 5.470 5.4534 5.4609 λ = 0.4 5.749 5.0050 5.057 5.3 5.00 5.578 λ = 0.6 6.84 5.6858 5.44 5.0773 5.044 5.0055
64 A. M. H. Al-Kazale Table 0 Te values of mean absolute error for te oter metods for 5, 0 and 0%. missing MAE Simple Bound Trend Naive Average Average Regression 5%.93538 4.03374 3.96569 3.4393 0%.46873 3.34.956.98584 0%.965.5009.36689 4.033494 Table Te values of mean square error for te oter metods for 5, 0 and 0%. missing MSE Simple Bound Trend Naive Average Average Regression 5% 0.465 4.90699 5.0595 5.4698 0% 9.347366 6.8883 8.39393 5.04977 0% 5.486735 0.637 9.0379 8.30867 REFERENCES [] A.K. Seik, and M. Amad,. Statistical Models of Accelerated Life Testing. Pak. J. Statist. 7 (): (00) 75-0. [] A. Pankratz,. Forecasting wit Univariate Box - Jenkins Models. Jon Wiley, USA, 983. [3] E. G. Patricia, Introduction to Time Series Modeling and Forecasting in Business and Economics. McGraw-Hill, Inc. New York, 994. [4] G. Hawtorne, and P. Elliot,. Imputing Cross-Sectional Missing Data: Comparison of Common Tecniques. Australian and New Zealand Journal of Psyciatry. 39: (005) 583-590. [5] I. Hess, V.K. Seti, and and T.R. Balakrisnan, Stratification: a practical investigation. J. Amer. Statist. Assoc. 6, (966) 74 90. [6] M. Hanif,. Design and Model-based sampling inference. Pak. J. Statist. 6(3): (000) 9-46. [7] M. Samiuddin, M. Hanif, and A.K.A Kattan, An optimum form for te design based ratio estimator. Pak. J. Statist. 4 () (998) 8-96. [8] R. Amad Mair, A.M.H. Al-Kazale and M.M.T. Al-Kassab,. Approximation Metod in Finding Optimum Stratum Depending on Neyman Allocation Applied on Beta Distribution. Proceedings of te t WSEAS International Conference on Applied Matematics. Cairo, Egypt. WSEAS Press. (007) 34-345.
Handling missing data on asymmetric distribution 65 [9] R.M. Cyert, and H.J. Davidson, Statistical Sampling for Accounting Information. Prentice-Hall, Englewood Cliffs, NJ, pp. (96) 6 7,. [0] R. J. A. Little, and B.B. Rubin,. Statistical Analysis wit Missing Data. Jon Wiley: New York, 987. [] V. Barnett, and T. Lewis Outliers in Statistical Data. 3 rd Edition, Jon Wiley, New York, USA,994. [] W.G. Cocran. Sampling Tecniques. 3 rd Edition. Jon Wiley, New York, USA, 977. [3] W. Mingxin, and Z. Yijun,. Trimmed and winsorised means based on a scaled deviation. J. of Stat. Planning and Inference. 39: (009) 350 365. Received: September, 0