Kasetsat J. (Nat. Sci. 45 : 736-74 ( Estimation of the Coelation Coefficient fo a Bivaiate Nomal Distibution with Missing Data Juthaphon Sinsomboonthong* ABSTRACT This study poposes an estimato of the coelation coefficient fo a bivaiate nomal distibution with missing data, via the complete obsevation analysis method. Evaluation of the poposed estimato ( ˆρ J in compaison with the Peason coelation coefficient ( ˆρ P was conducted using a simulation study. It was found that, fo a highe pecentage of missing data in a lage sample size, the absolute bias of ˆρ J was less than that of ˆρ P when the population coelation coefficients (ρ wee not close to zeo. In addition, the mean squae eo of ˆρ J was not diffeent fom that of ˆρ P in each situation. Keywods: bivaiate nomal distibution, coelation coefficient, bias, missing data, mean squae eo INTRODUCTION Missing data, which ae almost always found in eseach studies and caused by many possible easons, usually intoduce bias and inefficiency in paamete estimation (Little and Rubin, ; Noazian et al., 8. Hence, welldesigned data analysis is especially necessay. Pincipally, although incomplete data may possibly be analyzed using standad statistical methods though which missing data ae ignoed, an impotant limitation is that the methods ae specifically appopiate fo studies which contain small amounts of missing data. Moeove, standad statistical appoaches can also cause deficiencies of data when incomplete cases ae discaded. Data deficiency always causes impecision and also an escalation in biases (Rao et al., 999; Little and Rubin,. Acock (5 mentioned that this may futhe educe o exaggeate statistical powe. Likewise, Rotnitzky and Wypij (994, Roth et al. (996, Goelick (6 and Fitzmauice (8 mentioned that these may possibly esult in invalid conclusions, since the degee of bias and the loss of pecision depend not only on the faction of complete cases and the patten of missing data, but also on diffeences between the complete and incomplete cases, and the paametes of inteest. Recently, seveal authos have investigated poblems egading the estimation of the population coelation coefficient, ρ, fo samples fom a bivaiate nomal distibution. The maximum likelihood estimato of ρ fo a bivaiate nomal distibution was poposed by Dahiya and Kowa (98 fo equal vaiances and an incomplete dataset. Gaen (998 examined the poblem of maximum likelihood estimation fo the coelation coefficient and its asymptotic popeties in a bivaiate nomal model with missing data. Mudelsee (3 studied the Peason coelation coefficient with bootstap confidence intevals fom a bivaiate climate time seies and Depatment of Statistics, Faculty of Science, Kasetsat Univesity, Bangkok 9, Thailand. * Coesponding autho: e-mail: fscijps@ku.ac.th Received date : // Accepted date : 9/6/
Kasetsat J. (Nat. Sci. 45(4 737 unknown data distibutions. In addition, the pefomances of Peason coelation coefficient and Speaman coelation coefficient have been futhe investigated by Huson et al. (7. They found that the Peason coelation coefficient is in fact, fo most pactical puposes, an adequate choice fo the coelation coefficient investigation. Moeove, the Peason estimate was bette than the much moe widely known Speaman estimate. Howeve, Nete et al. (996 and Zimmeman et al. (3 mentioned that the Peason coelation coefficient was a biased estimato of the population coelation coefficient fo bivaiate nomal populations. In addition, the bias deceased when the sample size inceased and it was zeo when the population coelation coefficients wee zeo and one. Futhemoe, Efon and Tibshiani (993 and Smith and Pontius (6 have applied Jackknife s method (Quenouille, 949, 956; Tukey, 958 of bias eduction to the estimation of paametes. The basic idea behind Jackknife s method lies in systematically ecomputing statistics by using samples that leave out one obsevation at a time fom the sample set. Fom this new set of obsevations, an estimato can be calculated. Geneally, in the cuent study, incomplete data with bivaiate nomal distibution wee examined. The estimato of population coelation coefficient was modified fom Peason coelation coefficient, and Jackknife s method was applied fo bias eduction. MATERIALS AND METHODS The Peason coelation coefficient Conside the incomplete bivaiate sample fom a bivaiate nomal distibution with mean vecto (µ, µ, a vaiance covaiance matix = σ ρσσ ρσσ σ and coelation coefficient ρ. Assume that pais of (X, X ae completely obseved with bivaiate nomal distibution, but the est n obsevations of X ae lost and thee ae only n obsevations ( < < n of X collected (see Figue. All data pais ae independent and identically distibuted and data ae assumed to be missing completely at andom (Little and Rubin,. In othe wods, whethe o not data ae missing is independent of both the obseved and the unobseved values of X and X. Based on the data pais, it is well-known that the maximum likelihood estimato of ρ, denoted by ˆρ P, is given by ( x x ( x x whee ( x x ( x x j j ρˆ P = j j x = xj and x = x j (Nete et al., 996; Andeson, 3. This estimato is often called the Peason coelation coefficient. It is a biased estimato of ρ (unless Obsevations: + n Vaiable : x x x x, + xn Vaiable : x x x Figue Monotone missing data patten fo a bivaiate nomal distibution.
738 Kasetsat J. (Nat. Sci. 45(4 ρ = o, which is usually small when sample size is lage (Nete et al., 996; Zimmeman et al., 3. The Jackknife s method of bias eduction This section poposes the estimato of ρ and applies the Jackknife s method fo bias eduction of ˆρ P as follows: Given a sample X = (x, x,..., x, x, x,..., x and an estimato δ(x = ˆ ( x x ( x x j j ρ P = j j ( x x ( x x ( whee x = xj and x = xj. The i th Jackknife sample, X ( i, consists of the data set with the i th obsevation emoved. X ( i =(x, x,..., x (i, x (i+,..., x, x, x,..., x (i x (i+,..., x fo i =,,,. 3 δ(x ( i is the i th Jackknife eplication of δ(x and δ(x ( i = ˆρ P ( i = ( x x ( x x j i ( x x ( x x j i j ( i j ( i j ( i j ( i j i whee x ( i = x j i x ( i = x j i fo i =,,,. j j and 4 Calculate the pseudo values in the fom of J i whee J i = δ(x ( δ(x ( i ( = ˆρ P ( ˆρ P ( i. 5 The poposed estimato of ρ is given by ˆρ J J i whee ˆρ J = i= = ρˆ P ( ˆ ρp ( i [ ] i= = ˆ ρ ρ i= i= ˆ P P ( i = ˆ ρ ρ ˆ P P ( i i= ˆρ P and ˆρ P ( i ae given by the fomat of equation ( and ( espectively. RESULTS In ode to empiically evaluate the validity and eliability of the poposed estimato, a simulation study was conducted. In the study, populations of (X, X at a size of N =, wee geneated in the fom of a bivaiate nomal distibution with µ =, µ = 3, σ = 4 and σ = 9. Coelation coefficients of (X, X at -.9, -.8,,,.,.,,.9 with the sample sizes of n =, 3 and 6 wee conducted using a simple andom sampling with eplacement method with,-times epetitions, and missing data wee set at, and 3 pecentage of the total cases, thus ceating 7 situations fo the simulation study. Then, absolute bias and mean squae eo (MSE compaisons of ˆρ J and ˆρ P wee empiically pefomed. The simulation esults pesented in Figue eveal the absolute biases of ˆρ J and ˆρ P. When the sample size and pecentage of the missing data wee and %, espectively, the absolute bias of ˆρ J was less than that of ˆρ P fo the population coelation coefficients (ρ which fell between. and.. Fo % missing data and the sample size of, the absolute bias of ˆρ J was less than that of ˆρ P when population coelation coefficients wee not about zeo.
Kasetsat J. (Nat. Sci. 45(4 739.4...8.6.4. n =, % Missing data -.4...8.6.4. n =, % Missing data -.4...8.6.4. n =, 3% Missing data -.4...8.6.4. n = 3, % Missing data.4...8.6.4. n = 3, % Missing data.4...8.6.4. n = 3, 3% Missing data - - -.4...8.6.4. n = 6, % Missing data -.4...8.6.4. n = 6, % Missing data -.4...8.6.4. n = 6, 3% Missing data - Peason estimate, % missing data Peason estimate, % missing data Peason estimate, 3% missing data Poposed estimate, % missing data Poposed estimate, % missing data Poposed estimate, 3% missing data Figue Compaison of absolute biases fo ˆρ J and ˆρ P when n =, 3 and 6. Likewise, the absolute bias of ˆρ J was less than that of ˆρ P fo population coelation coefficients which wee not between -.5 and -. when the pecentage of missing data was 3% and the sample size was. In addition, the absolute bias of ˆρ J was less than that of ˆρ P when the sample size was geate than and the pecentage of missing data was geate than % fo population coelation coefficients which wee not close to zeo. Moeove, the absolute biases of ˆρ J and ˆρ P wee less than.4 and.4, espectively, fo sample sizes of 3 and 6. Thus, the absolute bias of ˆρ P seemed to be geate than that of ˆρ J when sample sizes wee 3 and 6. With a data loss of %, the absolute bias of ˆρ J was less than that of ˆρ P at all levels of population coelation coefficient fo n = 3, wheeas fo n = 6, the absolute bias of ˆρ J was less than that of ˆρ P when the population coelation coefficients wee positive. Moeove, the absolute biases of ˆρ J and ˆρ P seemed to decease wheneve the sample size inceased. Figue 3 indicates that the mean squae eo of ˆρ J seems to have no diffeence fom that of ˆρ P in each situation fo this study. Futhemoe, the mean squae eos of ˆρ J and ˆρ P seem to
74 Kasetsat J. (Nat. Sci. 45(4 Mean squae eo..8.6.4. n =, % Missing data - Mean squae eo..8.6.4. n =, % Missing data - Mean squae eo..8.6.4. n =, 3% Missing data - Mean squae eo..8.6.4. n = 3, % Missing data Mean squae eo..8.6.4. n = 3, % Missing data Mean squae eo..8.6.4. n = 3, 3% Missing data - - - Mean squae eo..8.6.4. n = 6, % Missing data Mean squae eo..8.6.4. n = 6, % Missing data Mean squae eo..8.6.4. n = 6, 3% Missing data - - - Peason estimate, % missing data Peason estimate, % missing data Peason estimate, 3% missing data Poposed estimate, % missing data Poposed estimate, % missing data Poposed estimate, 3% missing data Figue 3 Mean squae eos of ˆρ J and ˆρ P when n =, 3 and 6. decease wheneve the sample size inceased whateve the pecentages of missing data. The anges of the mean squae eos of ˆρ J and ˆρ P wee found to be naowe when the sample size was lage, with % and 3% missing data. This simulation study found that the pefomance of ˆρ J was bette than that of ˆρ P fo sample sizes of 3 and 6 with a highe pecentage of missing data and the population coelation coefficients wee not close to zeo. DISCUSSION The simulation esults indicated that ˆρ P seemed to be a biased estimato as Nete et al. (996 and Zimmeman et al. (3 mentioned. Hence, the bias of ˆρ P can be educed by Jackknife s method as epoted (Efon and Tibshiani, 993; Smith and Pontius, 6. Moeove, the bias of the poposed estimato educed to zeo fo a lage sample size. These findings can be applied in eseach, in education, psychology, medicine and othe fields. Jackknife s method can be applied
Kasetsat J. (Nat. Sci. 45(4 74 in the elimination of biases in the coelation coefficient estimation fo incomplete samples fom bivaiate nomal populations. In addition, it is possible to calculate the poposed estimato without difficulty by compute pogamming. CONCLUSION This pape poposed an estimato of the coelation coefficient fo a bivaiate nomal distibution when obsevations ae missing fom one of the vaiables. The poposed estimato ( ˆρ J was deived fom the Peason coelation coefficient ( ˆρ P and based on the analysis of complete cases. The esults of the simulation study indicated that the absolute bias of ˆρ J was less than that of ˆρ P when the sample size was lage fo highe pecentages of missing data and the population coelation coefficients wee not close to zeo. Futhemoe, the absolute bias of ˆρ J was less than.4 fo sample sizes of 3 and 6 with whateve pecentage of missing data. In addition, the mean squae eo of ˆρ J seemed to be no diffeent fom that of ˆρ P in each situation fo this simulation study. ACKNOWLEDGEMENTS The autho would like to thank the Depatment of Statistics, Faculty of Science, Kasetsat Univesity fo financial suppot and necessay facilities duing the eseach. LITERATURE CITED Acock, A.C. 5. Woking with missing values. Jounal of Maiage and Family 67: 8. Andeson, T.W. 3. An Intoduction to Multivaiate Statistical Analysis. 3d ed. Wiley. New Jesey. 7 pp. Dahiya, R.C. and R.M. Kowa. 98. Maximum likelihood estimates fo a bivaiate nomal distibution with missing data. The Annals of Statistics 8: 687 69. Efon, B. and R.J Tibshiani. 993. An Intoduction to the Bootstap. Chapman& Hall/CRC. USA. 45 pp. Fitzmauice, G. 8. Missing data: Implications fo analysis. Nutition 4:. Gaen, S.T. 998. Maximum likelihood estimation of the coelation coefficient in a bivaiate nomal model with missing data. Statistics & Pobability Lettes 38: 8 88. Goelick, M.H. 6. Bias aising fom missing data in pedictive models. Jounal of Clinical Epidemiology 59: 5 3. Huson, L.W., Biostatistics Goup and F.H. La- Roche. 7. Pefomance of some coelation coefficients when applied to zeo-clusteed data. Jounal of Moden Applied Statistical Method 6: 53 536. Little, R.J.A. and D.B. Rubin.. Statistical Analysis with Missing Data. Wiley. New Jesey. 49 pp. Mudelsee, M. 3. Estimating Peason s coelation coefficient with bootstap confidence inteval fom seially dependent time seies. Math. Geol. 35: 65 665. Nete, J., M.H. Kutne, C.J. Nachtsheim and W. Wasseman. 996. Applied Linea Statistical Models. 4th ed. Iwin. Chicago.,43 pp. Noazian, M.N., Y.A. Shuki, R.N. Azam and A.M.M. Al Baki. 8. Estimation of missing values in ai pollution data using single imputation techniques. ScienceAsia 34: 34 345. Quenouille, M.H. 949. Appoximate test of coelation in time-seies. Jounal of the Royal Statistical Society. Seies B (Methodological : 68 84. Quenouille, M.H. 956. Notes on bias in estimation. Biometika 43: 353 36. Rao, C.R., H. Toutenbug and A. Fiege. 999. Linea Models: Least Squaes and
74 Kasetsat J. (Nat. Sci. 45(4 Altenatives. nd ed. Spinge-Velag. New Yok. 44 pp. Roth, P.L., J.E. Campion and S.D. Jones. 996. The Impact of fou missing data techniques on validity estimates in human esouce management. Jounal of Business and Psychology :. Rotnitzky, A. and D. Wypij. 994. A Note on the biased of estimatos with missing data. Biometics 5: 63 7. Smith, C.D. and J.S. Pontius. 6. Jackknife estimato of species ichness with S-PLUS. Jounal of Statistical Softwae 5:. Tukey, J.W. 958. Bias and confidence in not-quite lage samples. Annals of Mathematical Statistics 9: 64 63. Zimmeman, D.W., B.D. Zumbo and R.H. Williams. 3. Bias in estimation and hypothesis testing of coelation. Psicológica 4: 33 58.