Methods of Detecting Outliers in A Regression Analysis Model.

Similar documents
Comparison of Regression Lines

Department of Quantitative Methods & Information Systems. Time Series and Their Components QMIS 320. Chapter 6

Statistics II Final Exam 26/6/18

Economics 130. Lecture 4 Simple Linear Regression Continued

Statistics for Economics & Business

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers

Statistics for Business and Economics

Statistics for Managers Using Microsoft Excel/SPSS Chapter 13 The Simple Linear Regression Model and Correlation

Chapter 11: Simple Linear Regression and Correlation

Statistics Chapter 4

2016 Wiley. Study Session 2: Ethical and Professional Standards Application

LINEAR REGRESSION ANALYSIS. MODULE VIII Lecture Indicator Variables

Correlation and Regression. Correlation 9.1. Correlation. Chapter 9

Durban Watson for Testing the Lack-of-Fit of Polynomial Regression Models without Replications

Chapter 15 - Multiple Regression

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

Chapter 12 Analysis of Covariance

x = , so that calculated

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Lecture 6: Introduction to Linear Regression

Department of Statistics University of Toronto STA305H1S / 1004 HS Design and Analysis of Experiments Term Test - Winter Solution

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Scatter Plot x

Joint Statistical Meetings - Biopharmaceutical Section

First Year Examination Department of Statistics, University of Florida

Basic Business Statistics, 10/e

Chapter 13: Multiple Regression

A Robust Method for Calculating the Correlation Coefficient

BOOTSTRAP METHOD FOR TESTING OF EQUALITY OF SEVERAL MEANS. M. Krishna Reddy, B. Naveen Kumar and Y. Ramu

Chapter 15 Student Lecture Notes 15-1

Topic- 11 The Analysis of Variance

Chapter 8 Indicator Variables

January Examinations 2015

Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

7.1. Single classification analysis of variance (ANOVA) Why not use multiple 2-sample 2. When to use ANOVA

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

STAT 3008 Applied Regression Analysis

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

Topic 23 - Randomized Complete Block Designs (RCBD)

Chapter 6. Supplemental Text Material

DO NOT OPEN THE QUESTION PAPER UNTIL INSTRUCTED TO DO SO BY THE CHIEF INVIGILATOR. Introductory Econometrics 1 hour 30 minutes

Chapter 9: Statistical Inference and the Relationship between Two Variables

STATISTICS QUESTIONS. Step by Step Solutions.

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

[The following data appear in Wooldridge Q2.3.] The table below contains the ACT score and college GPA for eight college students.

Lecture 6 More on Complete Randomized Block Design (RBD)

Basically, if you have a dummy dependent variable you will be estimating a probability.

Lecture 3 Stat102, Spring 2007

Negative Binomial Regression

Linear regression. Regression Models. Chapter 11 Student Lecture Notes Regression Analysis is the

UNIVERSITY OF TORONTO Faculty of Arts and Science. December 2005 Examinations STA437H1F/STA1005HF. Duration - 3 hours

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Chapter 14 Simple Linear Regression

Resource Allocation and Decision Analysis (ECON 8010) Spring 2014 Foundations of Regression Analysis

Polynomial Regression Models

Learning Objectives for Chapter 11

Systematic Error Illustration of Bias. Sources of Systematic Errors. Effects of Systematic Errors 9/23/2009. Instrument Errors Method Errors Personal

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

PHYS 450 Spring semester Lecture 02: Dealing with Experimental Uncertainties. Ron Reifenberger Birck Nanotechnology Center Purdue University

CHAPTER IV RESEARCH FINDING AND DISCUSSIONS

Lecture 16 Statistical Analysis in Biomaterials Research (Part II)

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Chapter 3 Describing Data Using Numerical Measures

x i1 =1 for all i (the constant ).

DERIVATION OF THE PROBABILITY PLOT CORRELATION COEFFICIENT TEST STATISTICS FOR THE GENERALIZED LOGISTIC DISTRIBUTION

ECONOMICS 351*-A Mid-Term Exam -- Fall Term 2000 Page 1 of 13 pages. QUEEN'S UNIVERSITY AT KINGSTON Department of Economics

See Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition)

Global Sensitivity. Tuesday 20 th February, 2018

Simulated Power of the Discrete Cramér-von Mises Goodness-of-Fit Tests

and V is a p p positive definite matrix. A normal-inverse-gamma distribution.

ANOVA. The Observations y ij

A Method for Analyzing Unreplicated Experiments Using Information on the Intraclass Correlation Coefficient

where I = (n x n) diagonal identity matrix with diagonal elements = 1 and off-diagonal elements = 0; and σ 2 e = variance of (Y X).

Statistical Evaluation of WATFLOOD

COMPARISON OF SOME RELIABILITY CHARACTERISTICS BETWEEN REDUNDANT SYSTEMS REQUIRING SUPPORTING UNITS FOR THEIR OPERATIONS

18. SIMPLE LINEAR REGRESSION III

Using Multivariate Rank Sum Tests to Evaluate Effectiveness of Computer Applications in Teaching Business Statistics

Regression. The Simple Linear Regression Model

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Online Appendix to: Axiomatization and measurement of Quasi-hyperbolic Discounting

Linear Regression Analysis: Terminology and Notation

28. SIMPLE LINEAR REGRESSION III

Regression Analysis. Regression Analysis

Chapter 5: Hypothesis Tests, Confidence Intervals & Gauss-Markov Result

Midterm Examination. Regression and Forecasting Models

Here is the rationale: If X and y have a strong positive relationship to one another, then ( x x) will tend to be positive when ( y y)

Testing for seasonal unit roots in heterogeneous panels

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Introduction to Regression

Answers Problem Set 2 Chem 314A Williamsen Spring 2000

/ n ) are compared. The logic is: if the two

Composite Hypotheses testing

Lecture Notes for STATISTICAL METHODS FOR BUSINESS II BMGT 212. Chapters 14, 15 & 16. Professor Ahmadi, Ph.D. Department of Management

x yi In chapter 14, we want to perform inference (i.e. calculate confidence intervals and perform tests of significance) in this setting.

Estimation: Part 2. Chapter GREG estimation

NANYANG TECHNOLOGICAL UNIVERSITY SEMESTER I EXAMINATION MTH352/MH3510 Regression Analysis

NUMERICAL DIFFERENTIATION

Homework Assignment 3 Due in class, Thursday October 15

Transcription:

Methods of Detectng Outlers n A Regresson Analyss Model. Ogu, A. I. *, Inyama, S. C+, Achugamonu, P. C++ *Department of Statstcs, Imo State Unversty,Owerr +Department of Mathematcs, Federal Unversty of Technology, Owerr ++Department of Mathematcs, Alvan Ikoku Federal College of Educaton, Owerr Abstract Ths study detects outlers n a unvarate and bvarate data by usng both Rosner s and Grubb s test n a regresson analyss model. The study shows how an observaton that causes the least square pont estmate of a Regresson model to be substantally dfferent from what t would be f the observaton were removed from the data set. A Bolers data wth dependent varable Y (man-hour) and four ndependent varables (Boler Capacty), (Desgn Pressure), 3 (Boler Type), 4 (Drum Type) were used. The analyss of the Bolers data revewed an unexpected group of Outlers. The results from the fndngs showed that an observaton can be outlyng wth respect to ts Y (dependent) value or (ndependent) value or both values and yet nfluental to the data set. Key Words: Outlners, unvarate, bvarate data, Regresson Analyss,.0 Bref Hstory and Background of Study Outlers are unusual data values that occur almost n all research projects nvolvng data collecton. Ths s especally true n observatonal studes where data naturally take on very unusual values, even f they come from relable sources. Although defntons vares. An outler s generally consdered to be a data pont that s far outsde the norm for a varable or populaton Jarrell [4], Rasmussen [5]) and Steven [6].. Causes of Outlers Outlers can arse from several dfferent mechansms or causes. Ascombe (960) sorts nto two major categores. Those arsng from errors n the data and those arsng from the nherent varablty of the data.. Identfcaton Of Outlers. There s no such thng as a smple test. However, there are many ways to look at a dstrbuton of numercal values, to see f certan ponts seem out of lne wth the majorty of the data. Ths can be acheved by: (I) By vsual Ads (II) By computaton of IQR (III) By plottng a scatter plot..3 Dealng wth Outlers There s a great deal of debates as what to do wth dentfed outlers. If West Afrcan Journal of Industral and Academc Research Vol.7 No. June 03 05

your data set contans an outler two questons arses () Are they merely fluke of some knd? () How much have the coeffcents error statstcs and predctons been affected?.0 Data Presentaton and Methodology Ths study ntends to examne the causes, problems, methods of detecton and approaches to data analyss of R + = ( ) ( ) S ( ) outler n a unvarate and Bvarate data. In order to do ths, a Broler data were collected from Kelly Uscategu, unversty of Connectcut on Brolers.. Method of Data Analyss.. ROSNER S TEST (Rosner,983) The procedure entals removng from the data set the observaton that s fartherest from the mean. The test statstc R s calculated and compared wth the crtcal value. The Rosner s R test where ( ) s g v e n b y n ( ) =... j n a n d j = n S = n j = ( ) ( ) ( ) ( j )... ( ),0... k L + Tabled Crtcal Value for Comparson wth R +... Test Crtera/Decson Rule Hypothess H 0 : There s no outler n the data set H AK : There s at least one outler n the data set. Decson Rule Reject H 0 f R + > L + at the stated level of sgnfcance otherwse do not reject H 0.. GRUBB S TEST (Grubb, 950) Grubb s test detects one outler at a tme. Ths outler s expunged from the data set and the test s terated untl no outler s detected. A test statstc G s calculated and compared wth the crtcal value. The Grubb s test statstc s gven by G = M a x Y S Y West Afrcan Journal of Industral and Academc Research Vol.7 No. June 03 06

Test Crtera/Decson Rule. Hypothess H O : There s no outler n the gven data set. H ak : There s outler n the gven data set. Decson Rule: Reject H O f G α N-I T, N N > N N T α + N N at a gven ( α ) level of sgnfcance, otherwse do not reject H O. Regresson analyss s an estmatng equaton whch expresses the functonal relatonshp between two or more varables as well take care of the error term whch s classfed nto; Smple lnear regresson and multple lnear regresson..3. Smple Lnear Regresson. Ths s the type of lnear regresson that nvolves only two varables one ndependent and one dependent plus the random error term. The smple lnear regresson model assumes that there s a straght lne (lnear) relatonshp between the dependent varable Y and the ndependent varable. Ths can be estmated by the least square estmate method expressed by..3 Regresson Analyss n n n n Y Y =I b = =I =I -------------------------- (4) n n n = I =I.3. Multple Lnear Regressons Multple lnear regresson analyses three or more varables and the random error term. Ths s expressed as follows. West Afrcan Journal of Industral and Academc Research Vol.7 No. June 03 07

Y = β + β + β +...+ β + E... (5) O k k y... k y... k Where Y = = M M M y n n n... kn nxk β = b e b e e = M M bk en kx nx T h e m atrx becom es. Y... n b e Y... n b e = + M M M M y... b k e n n n n n.4 : Brolers Data As Used In The Study Table Man- Hours Boler Capacty Desgn Pressure Boler Type Drum Type S/N Y 3 4 337 0000 375 3590 65000 750 3 456 50000 500 4 085 073877 70 0 5 403 50000 35 6 7606 60000 500 0 7 3748 8800 399 8 97 8800 399 9 363 8800 399 0 4065 90000 40 048 30000 35 6500 44000 40 3 565 44000 40 4 6565 44000 40 5 6387 44000 40 6 6454 67000 55 0 7 698 60000 500 0 8 468 50000 500 9 479 089490 970 0 0 680 5000 750 974 0000 375 0 West Afrcan Journal of Industral and Academc Research Vol.7 No. June 03 08

965 65000 750 0 3 566 50000 500 0 4 55 50000 50 0 5 000 50000 500 0 6 735 50000 35 0 7 3698 60000 500 0 0 8 635 90000 40 0 9 06 30000 35 0 30 3775 44000 40 0 3 30 44000 40 0 3 406 44000 40 0 33 4006 44000 40 0 34 378 67000 55 0 0 35 3 60000 500 0 0 36 00 30000 35 0 Note Y s the dependent varable whch x, x,x 3 and x 4 are the ndependent varables. For the purpose of ths study, the followng holds: Y represents man hours represent boler capacty represent desgn pressure 3 represent boler type 4 represent drum type 3.0 Data Analyss 3. Dependent Varable Y Usng Rosner s Test To Check For Outler In The Dataset 00 06 55 965 000 048 566 635 680 735 97 974 30 337 363 3 3590 3698 378 3738 3775 4006 403 4065 406 468 456 565 6387 6454 6500 6565 698 7606 085 479 I n- Y Sy () Y () R ] + λ+ α( 0.05) 0 O 36 490.7500 70.75 00.44.99 35 4379.057 688.965 479 3.87.98 34 407.835 06.96 085 3.348.97. The decson rule s to reject H O f R y. > λ + from our result above. R y. < λ +. That s.44 <.99 Accept H O. R y. > λ +. That s 3.87 >.98 Reject H O R y. 3 > λ + That s 3.348 >.97 Reject H O. We therefore conclude that observaton 479 and 085 are outlers. West Afrcan Journal of Industral and Academc Research Vol.7 No. June 03 09

3. 3 Rosner s Test On The Independent Varable The Data Becomes. 30000 30000 30000 65000 65000 8800 90000 90000 0000 0000 5000 5000 50000 5000 50000 50000 50000 50000 44000 44000 44000 4000 44000 44000 44000 44000 60000 60000 6000 60000 67000 67000 073877 089490 From the data above we have. n- ( ) S ( ) x ( ) R + λ ( α ) =0.05 0 36 3847.3056 847.7874 3000.05.99 35 3673.349 8093.5943 089490.74.98 34 30478.7353 55.9055 073877 3.060.97 + R. > λ + That s.05 <.99 accept H O R. < λ + That s.74 <.98 accept H O R x.3 > λ + That s 3.060 >.97 Reject H O Therefore only observaton 073877 s an outler n the data set of ndependent varable. 3.3 Grubb s Test The null and alternatve hypotheses are stated as fellows. H O : There are no outlers n the data set. H A : There s at least one outler n the data set. Crtcal regon = 36 4.3 36 36-+4.3 =.90 Grubb s Test on Y the Dependent Varable. Y = 90.7500, S = 70.75.. G 0.47 (accept H O ) <.90 Not an outler 0.59 (accept H O ) <.90 Not an outler 3 0.087 (accept H O ) <.90 Not an outler 4.47 (reject H O ) >.90 An outler 5 0.099 (accept H O ) <.90 Not an outler 6.7 (accept H O ) <.90 Not an outler 7 0.0 (accept H O ) <.90 Not an outler 8 0.488 (accept H O ) <.90 Not an outler 9 0.47 (accept H O ) <.90 Not an outler 0 0.084 (accept H O ) <.90 Not an outler 0.830 (accept H O ) <.90 Not an outler 0.87 (accept H O ) <.90 Not an outler 3 0.503 (accept H O ) <.90 Not an outler West Afrcan Journal of Industral and Academc Research Vol.7 No. June 03 0

4 0.84 (accept H O ) <.90 Not an outler 5 0.776 (accept H O ) <.90 Not an outler 6 0.800 (accept H O ) <.90 Not an outler 7 0.976 (accept H O ) <.90 Not an outler 8 0.008 (accept H O ) <.90 Not outler 9 3.885 (Reject H O ) >.90 An outler 0 0.596 (accept H O ) <.90 Not an outler 0.487 (accept H O ) <.90 Not an outler 0.86 (accept H O ) >.90 Not an outler 3 0.638 (accept H O ) <.90 Not an outler 4.07 (accept H O ) <.90 Not an outler 5 0.848 (accept H O ) <.90 Not an outler 6 0.576 (accept H O ) <.90 Not an outler 7 0.9 (accept H O ) <.90 Not an outler 8 0.63 (accept H O ) <.90 Not an outler 9.4 (accept H O ) <.90 Not an outler 30 0.9 (accept H O ) <.90 Not an outler 3 0.433 (accept H O ) <.90 Not an outler 3 0.03 (accept H O ) <.90 Not an outler 33 0.05 (accept H O ) <.90 Not an outler 34 0.08 (accept H O ) <.90 Not an outler 35 0.400 (accept H O ) <.90 Not an outler 36.44 (accept H O ) <.90 Not an outler Ths shows that observaton 4 and 9 are outlers on the dependent varable (Y) usng Grubb s Test Method. they can cause potental computatonal problem and thus nfluences problems. 4. Recommendaton 4.0 Concluson Havng carred out ths study The above dscussed statstcal tests successfully the followng are used to determne f expermental recommendatons were made. observatons are statstcal outlers n the (a) We recommend that data set. Of course effectve workng wth outlers n numercal data can be rather dffcult and frustratng experence. Nether gnorng nor deletng them at all wll be good soluton f you do nothng, you wll end up wth a model that descrbes essentally none of the data nether the bulk of the data nor the outlers. Even though your numbers may be perfectly legtmate, f they le expermenters should keep good record for each experment. All data should be recorded wth any possble explanaton or addtonal nformaton. (b) We recommend that analyst should employ robust statstcal methods. These methods are mnmally affected by outlers. outsde the verge of most of the data, West Afrcan Journal of Industral and Academc Research Vol.7 No. June 03

References [] Anscombe, F.J. (960): Rejecton of Outlers Technometrcs,, 3-47. ]] Grubbs,F.E (950): Sample Crtera for Testng Outlyng observatons: Annals of Mathematcal scences. [3] Jarrell M.G. (994). A Comparson of two procedures, the Mabalanobs Dstance and the Andrews Pregbon statstcs for dentfyng multvarate outlers. Researchers n the schools, :49-58. [4] Rosner s Multple Outler Test Technometrcs 5, No May, (983), 65-7. [5] Rasmussen, J. L. (988): Evaluatng outler dentfcaton tests: Mahalanobs D Squared and Comrey, 3 (), 89-0. [6] Steven, J.P. (984). Outlers and Influental ponts n Regresson Analyss. Psychologcal Bulletn, 95, 339-344.. West Afrcan Journal of Industral and Academc Research Vol.7 No. June 03

Relatve Effcency of Splt-plot Desgn (SPD) to Randomzed Complete Block Desgn (RCBD) Oladugba, A. V+, Onuoha, Desmond O*, Opara Pus N.++ +Department of Statstcs, Unversty of Ngera, Nsukka, *Dept of Maths/Statstcs, Fed. Polytechnc Nekede, Owerr, 0803544403, ++Datafeld Logstcs Servces, Port Harcourt, Rvers State. Abstract The relatve effcency of splt-plot desgn (SPD) to randomzed complete block desgn (RCBD) was computed usng ther error varance, senstvty analyss and desgn plannng. The result of ths work showed that conductng an experment usng splt-plot (SPD) wthout replcaton s more effcent to randomzed complete block desgn (RCBD) based on comparson of ther error varances, senstvty analyss and desgn plannng consderaton. Key words: Splt-plot Desgn, Randomzed Complete Block Desgn, Error varance, Senstvty Analyss and Desgn plannng. Introducton In expermental desgn, the Relatve Effcency (RE)of desgn say A to another desgn say B denoted as RE(A:B) s defned n terms of the number of replcates of desgn B requred to acheve the same result as one replcate of desgn A. In vew of ths, the relatve effcency of splt-plot desgn (SPD) to randomzed complete block desgn (RCBD) denoted as RE (SPD:RCBD) s the number of replcates of RCBD requred to acheve the same result as one replcate of SPD. Relatve effcency can be expressed n terms of percentage by multplyng t by 00. If RE (SPD:RCBD) > 00%, SPD s sad to be more effcent to RCBD and f RE(SPD:RCBD) 00% SPD s sad to be less effcent to RCBD. The relatve effcency of two desgns s mostly measured n terms of comparng ther error varances and the desgn wth the smallest varance s sad to be more effcent than the other. Ths measure of relatve effcency does not put nto consderaton the probablty of obtanng sgnfcant dfference or detectng sgnfcant dfference f they exst between the treatments. RCBD s sad to be more effcent to complete randomzed desgn (CRD) based on the comparson of ther error varance snce the error varance of RCBD s always smaller than that of complete randomzed desgn (CRD). There s a decrease n the error degree of freedom of RCBD compare to CRD and a decrease n the error degree of freedom leads to an ncrease n the tabulated value thereby reducng the probablty of obtanng a sgnfcant result snce the decson rule s always to reject the null hypothess f F-calculated s greater than F- tabulated. Based on ths assessment whch s senstvty analyss, CRD s sad to be more effcent than RCBD; n other words, the senstvty of RCBD s decreased. From above, t can be clearly seen that the relatve effcency of any two desgns cannot be best judged by consderng the rato of ther error West Afrcan Journal of Industral and Academc Research Vol.7 No. June 03 3