Methods of Detecting Outliers in A Regression Analysis Model.

Methods of Detectng Outlers n A Regresson Analyss Model. Ogu, A. I. *, Inyama, S. C+, Achugamonu, P. C++ *Department of Statstcs, Imo State Unversty,Owerr +Department of Mathematcs, Federal Unversty of Technology, Owerr ++Department of Mathematcs, Alvan Ikoku Federal College of Educaton, Owerr Abstract Ths study detects outlers n a unvarate and bvarate data by usng both Rosner s and Grubb s test n a regresson analyss model. The study shows how an observaton that causes the least square pont estmate of a Regresson model to be substantally dfferent from what t would be f the observaton were removed from the data set. A Bolers data wth dependent varable Y (man-hour) and four ndependent varables (Boler Capacty), (Desgn Pressure), 3 (Boler Type), 4 (Drum Type) were used. The analyss of the Bolers data revewed an unexpected group of Outlers. The results from the fndngs showed that an observaton can be outlyng wth respect to ts Y (dependent) value or (ndependent) value or both values and yet nfluental to the data set. Key Words: Outlners, unvarate, bvarate data, Regresson Analyss,.0 Bref Hstory and Background of Study Outlers are unusual data values that occur almost n all research projects nvolvng data collecton. Ths s especally true n observatonal studes where data naturally take on very unusual values, even f they come from relable sources. Although defntons vares. An outler s generally consdered to be a data pont that s far outsde the norm for a varable or populaton Jarrell [4], Rasmussen [5]) and Steven [6].. Causes of Outlers Outlers can arse from several dfferent mechansms or causes. Ascombe (960) sorts nto two major categores. Those arsng from errors n the data and those arsng from the nherent varablty of the data.. Identfcaton Of Outlers. There s no such thng as a smple test. However, there are many ways to look at a dstrbuton of numercal values, to see f certan ponts seem out of lne wth the majorty of the data. Ths can be acheved by: (I) By vsual Ads (II) By computaton of IQR (III) By plottng a scatter plot..3 Dealng wth Outlers There s a great deal of debates as what to do wth dentfed outlers. If West Afrcan Journal of Industral and Academc Research Vol.7 No. June 03 05

your data set contans an outler two questons arses () Are they merely fluke of some knd? () How much have the coeffcents error statstcs and predctons been affected?.0 Data Presentaton and Methodology Ths study ntends to examne the causes, problems, methods of detecton and approaches to data analyss of R + = ( ) ( ) S ( ) outler n a unvarate and Bvarate data. In order to do ths, a Broler data were collected from Kelly Uscategu, unversty of Connectcut on Brolers.. Method of Data Analyss.. ROSNER S TEST (Rosner,983) The procedure entals removng from the data set the observaton that s fartherest from the mean. The test statstc R s calculated and compared wth the crtcal value. The Rosner s R test where ( ) s g v e n b y n ( ) =... j n a n d j = n S = n j = ( ) ( ) ( ) ( j )... ( ),0... k L + Tabled Crtcal Value for Comparson wth R +... Test Crtera/Decson Rule Hypothess H 0 : There s no outler n the data set H AK : There s at least one outler n the data set. Decson Rule Reject H 0 f R + > L + at the stated level of sgnfcance otherwse do not reject H 0.. GRUBB S TEST (Grubb, 950) Grubb s test detects one outler at a tme. Ths outler s expunged from the data set and the test s terated untl no outler s detected. A test statstc G s calculated and compared wth the crtcal value. The Grubb s test statstc s gven by G = M a x Y S Y West Afrcan Journal of Industral and Academc Research Vol.7 No. June 03 06

Test Crtera/Decson Rule. Hypothess H O : There s no outler n the gven data set. H ak : There s outler n the gven data set. Decson Rule: Reject H O f G α N-I T, N N > N N T α + N N at a gven ( α ) level of sgnfcance, otherwse do not reject H O. Regresson analyss s an estmatng equaton whch expresses the functonal relatonshp between two or more varables as well take care of the error term whch s classfed nto; Smple lnear regresson and multple lnear regresson..3. Smple Lnear Regresson. Ths s the type of lnear regresson that nvolves only two varables one ndependent and one dependent plus the random error term. The smple lnear regresson model assumes that there s a straght lne (lnear) relatonshp between the dependent varable Y and the ndependent varable. Ths can be estmated by the least square estmate method expressed by..3 Regresson Analyss n n n n Y Y =I b = =I =I -------------------------- (4) n n n = I =I.3. Multple Lnear Regressons Multple lnear regresson analyses three or more varables and the random error term. Ths s expressed as follows. West Afrcan Journal of Industral and Academc Research Vol.7 No. June 03 07

Y = β + β + β +...+ β + E... (5) O k k y... k y... k Where Y = = M M M y n n n... kn nxk β = b e b e e = M M bk en kx nx T h e m atrx becom es. Y... n b e Y... n b e = + M M M M y... b k e n n n n n.4 : Brolers Data As Used In The Study Table Man- Hours Boler Capacty Desgn Pressure Boler Type Drum Type S/N Y 3 4 337 0000 375 3590 65000 750 3 456 50000 500 4 085 073877 70 0 5 403 50000 35 6 7606 60000 500 0 7 3748 8800 399 8 97 8800 399 9 363 8800 399 0 4065 90000 40 048 30000 35 6500 44000 40 3 565 44000 40 4 6565 44000 40 5 6387 44000 40 6 6454 67000 55 0 7 698 60000 500 0 8 468 50000 500 9 479 089490 970 0 0 680 5000 750 974 0000 375 0 West Afrcan Journal of Industral and Academc Research Vol.7 No. June 03 08

965 65000 750 0 3 566 50000 500 0 4 55 50000 50 0 5 000 50000 500 0 6 735 50000 35 0 7 3698 60000 500 0 0 8 635 90000 40 0 9 06 30000 35 0 30 3775 44000 40 0 3 30 44000 40 0 3 406 44000 40 0 33 4006 44000 40 0 34 378 67000 55 0 0 35 3 60000 500 0 0 36 00 30000 35 0 Note Y s the dependent varable whch x, x,x 3 and x 4 are the ndependent varables. For the purpose of ths study, the followng holds: Y represents man hours represent boler capacty represent desgn pressure 3 represent boler type 4 represent drum type 3.0 Data Analyss 3. Dependent Varable Y Usng Rosner s Test To Check For Outler In The Dataset 00 06 55 965 000 048 566 635 680 735 97 974 30 337 363 3 3590 3698 378 3738 3775 4006 403 4065 406 468 456 565 6387 6454 6500 6565 698 7606 085 479 I n- Y Sy () Y () R ] + λ+ α( 0.05) 0 O 36 490.7500 70.75 00.44.99 35 4379.057 688.965 479 3.87.98 34 407.835 06.96 085 3.348.97. The decson rule s to reject H O f R y. > λ + from our result above. R y. < λ +. That s.44 <.99 Accept H O. R y. > λ +. That s 3.87 >.98 Reject H O R y. 3 > λ + That s 3.348 >.97 Reject H O. We therefore conclude that observaton 479 and 085 are outlers. West Afrcan Journal of Industral and Academc Research Vol.7 No. June 03 09

3. 3 Rosner s Test On The Independent Varable The Data Becomes. 30000 30000 30000 65000 65000 8800 90000 90000 0000 0000 5000 5000 50000 5000 50000 50000 50000 50000 44000 44000 44000 4000 44000 44000 44000 44000 60000 60000 6000 60000 67000 67000 073877 089490 From the data above we have. n- ( ) S ( ) x ( ) R + λ ( α ) =0.05 0 36 3847.3056 847.7874 3000.05.99 35 3673.349 8093.5943 089490.74.98 34 30478.7353 55.9055 073877 3.060.97 + R. > λ + That s.05 <.99 accept H O R. < λ + That s.74 <.98 accept H O R x.3 > λ + That s 3.060 >.97 Reject H O Therefore only observaton 073877 s an outler n the data set of ndependent varable. 3.3 Grubb s Test The null and alternatve hypotheses are stated as fellows. H O : There are no outlers n the data set. H A : There s at least one outler n the data set. Crtcal regon = 36 4.3 36 36-+4.3 =.90 Grubb s Test on Y the Dependent Varable. Y = 90.7500, S = 70.75.. G 0.47 (accept H O ) <.90 Not an outler 0.59 (accept H O ) <.90 Not an outler 3 0.087 (accept H O ) <.90 Not an outler 4.47 (reject H O ) >.90 An outler 5 0.099 (accept H O ) <.90 Not an outler 6.7 (accept H O ) <.90 Not an outler 7 0.0 (accept H O ) <.90 Not an outler 8 0.488 (accept H O ) <.90 Not an outler 9 0.47 (accept H O ) <.90 Not an outler 0 0.084 (accept H O ) <.90 Not an outler 0.830 (accept H O ) <.90 Not an outler 0.87 (accept H O ) <.90 Not an outler 3 0.503 (accept H O ) <.90 Not an outler West Afrcan Journal of Industral and Academc Research Vol.7 No. June 03 0

4 0.84 (accept H O ) <.90 Not an outler 5 0.776 (accept H O ) <.90 Not an outler 6 0.800 (accept H O ) <.90 Not an outler 7 0.976 (accept H O ) <.90 Not an outler 8 0.008 (accept H O ) <.90 Not outler 9 3.885 (Reject H O ) >.90 An outler 0 0.596 (accept H O ) <.90 Not an outler 0.487 (accept H O ) <.90 Not an outler 0.86 (accept H O ) >.90 Not an outler 3 0.638 (accept H O ) <.90 Not an outler 4.07 (accept H O ) <.90 Not an outler 5 0.848 (accept H O ) <.90 Not an outler 6 0.576 (accept H O ) <.90 Not an outler 7 0.9 (accept H O ) <.90 Not an outler 8 0.63 (accept H O ) <.90 Not an outler 9.4 (accept H O ) <.90 Not an outler 30 0.9 (accept H O ) <.90 Not an outler 3 0.433 (accept H O ) <.90 Not an outler 3 0.03 (accept H O ) <.90 Not an outler 33 0.05 (accept H O ) <.90 Not an outler 34 0.08 (accept H O ) <.90 Not an outler 35 0.400 (accept H O ) <.90 Not an outler 36.44 (accept H O ) <.90 Not an outler Ths shows that observaton 4 and 9 are outlers on the dependent varable (Y) usng Grubb s Test Method. they can cause potental computatonal problem and thus nfluences problems. 4. Recommendaton 4.0 Concluson Havng carred out ths study The above dscussed statstcal tests successfully the followng are used to determne f expermental recommendatons were made. observatons are statstcal outlers n the (a) We recommend that data set. Of course effectve workng wth outlers n numercal data can be rather dffcult and frustratng experence. Nether gnorng nor deletng them at all wll be good soluton f you do nothng, you wll end up wth a model that descrbes essentally none of the data nether the bulk of the data nor the outlers. Even though your numbers may be perfectly legtmate, f they le expermenters should keep good record for each experment. All data should be recorded wth any possble explanaton or addtonal nformaton. (b) We recommend that analyst should employ robust statstcal methods. These methods are mnmally affected by outlers. outsde the verge of most of the data, West Afrcan Journal of Industral and Academc Research Vol.7 No. June 03

References [] Anscombe, F.J. (960): Rejecton of Outlers Technometrcs,, 3-47. ]] Grubbs,F.E (950): Sample Crtera for Testng Outlyng observatons: Annals of Mathematcal scences. [3] Jarrell M.G. (994). A Comparson of two procedures, the Mabalanobs Dstance and the Andrews Pregbon statstcs for dentfyng multvarate outlers. Researchers n the schools, :49-58. [4] Rosner s Multple Outler Test Technometrcs 5, No May, (983), 65-7. [5] Rasmussen, J. L. (988): Evaluatng outler dentfcaton tests: Mahalanobs D Squared and Comrey, 3 (), 89-0. [6] Steven, J.P. (984). Outlers and Influental ponts n Regresson Analyss. Psychologcal Bulletn, 95, 339-344.. West Afrcan Journal of Industral and Academc Research Vol.7 No. June 03

Relatve Effcency of Splt-plot Desgn (SPD) to Randomzed Complete Block Desgn (RCBD) Oladugba, A. V+, Onuoha, Desmond O*, Opara Pus N.++ +Department of Statstcs, Unversty of Ngera, Nsukka, *Dept of Maths/Statstcs, Fed. Polytechnc Nekede, Owerr, 0803544403, ++Datafeld Logstcs Servces, Port Harcourt, Rvers State. Abstract The relatve effcency of splt-plot desgn (SPD) to randomzed complete block desgn (RCBD) was computed usng ther error varance, senstvty analyss and desgn plannng. The result of ths work showed that conductng an experment usng splt-plot (SPD) wthout replcaton s more effcent to randomzed complete block desgn (RCBD) based on comparson of ther error varances, senstvty analyss and desgn plannng consderaton. Key words: Splt-plot Desgn, Randomzed Complete Block Desgn, Error varance, Senstvty Analyss and Desgn plannng. Introducton In expermental desgn, the Relatve Effcency (RE)of desgn say A to another desgn say B denoted as RE(A:B) s defned n terms of the number of replcates of desgn B requred to acheve the same result as one replcate of desgn A. In vew of ths, the relatve effcency of splt-plot desgn (SPD) to randomzed complete block desgn (RCBD) denoted as RE (SPD:RCBD) s the number of replcates of RCBD requred to acheve the same result as one replcate of SPD. Relatve effcency can be expressed n terms of percentage by multplyng t by 00. If RE (SPD:RCBD) > 00%, SPD s sad to be more effcent to RCBD and f RE(SPD:RCBD) 00% SPD s sad to be less effcent to RCBD. The relatve effcency of two desgns s mostly measured n terms of comparng ther error varances and the desgn wth the smallest varance s sad to be more effcent than the other. Ths measure of relatve effcency does not put nto consderaton the probablty of obtanng sgnfcant dfference or detectng sgnfcant dfference f they exst between the treatments. RCBD s sad to be more effcent to complete randomzed desgn (CRD) based on the comparson of ther error varance snce the error varance of RCBD s always smaller than that of complete randomzed desgn (CRD). There s a decrease n the error degree of freedom of RCBD compare to CRD and a decrease n the error degree of freedom leads to an ncrease n the tabulated value thereby reducng the probablty of obtanng a sgnfcant result snce the decson rule s always to reject the null hypothess f F-calculated s greater than F- tabulated. Based on ths assessment whch s senstvty analyss, CRD s sad to be more effcent than RCBD; n other words, the senstvty of RCBD s decreased. From above, t can be clearly seen that the relatve effcency of any two desgns cannot be best judged by consderng the rato of ther error West Afrcan Journal of Industral and Academc Research Vol.7 No. June 03 3