Indices of Distances: Characteristics and Detection of Abnormal Points

Similar documents
Properties and Hypothesis Testing

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

A statistical method to determine sample size to estimate characteristic value of soil parameters

Response Variable denoted by y it is the variable that is to be predicted measure of the outcome of an experiment also called the dependent variable

ECON 3150/4150, Spring term Lecture 3

1 Inferential Methods for Correlation and Regression Analysis

Sampling Error. Chapter 6 Student Lecture Notes 6-1. Business Statistics: A Decision-Making Approach, 6e. Chapter Goals

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

Investigating the Significance of a Correlation Coefficient using Jackknife Estimates

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

Chapter If n is odd, the median is the exact middle number If n is even, the median is the average of the two middle numbers

Lecture 15: Learning Theory: Concentration Inequalities

Algebra of Least Squares

Final Examination Solutions 17/6/2010

Statistical Fundamentals and Control Charts

Statistics 511 Additional Materials

ANALYSIS OF EXPERIMENTAL ERRORS

If, for instance, we were required to test whether the population mean μ could be equal to a certain value μ

Chapter 2 Descriptive Statistics

The target reliability and design working life

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Optimally Sparse SVMs

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 9

Random Variables, Sampling and Estimation

Stat 139 Homework 7 Solutions, Fall 2015

Overview. p 2. Chapter 9. Pooled Estimate of. q = 1 p. Notation for Two Proportions. Inferences about Two Proportions. Assumptions

Linear Regression Demystified

Sample Size Determination (Two or More Samples)

10. Comparative Tests among Spatial Regression Models. Here we revisit the example in Section 8.1 of estimating the mean of a normal random

Expectation and Variance of a random variable

This is an introductory course in Analysis of Variance and Design of Experiments.

Lecture 24: Variable selection in linear models

STAT 350 Handout 19 Sampling Distribution, Central Limit Theorem (6.6)

A NEW CLASS OF 2-STEP RATIONAL MULTISTEP METHODS

Estimation for Complete Data

11 Correlation and Regression

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

MCT242: Electronic Instrumentation Lecture 2: Instrumentation Definitions

There is no straightforward approach for choosing the warmup period l.

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

6 Sample Size Calculations

Power and Type II Error

Lecture 5: Parametric Hypothesis Testing: Comparing Means. GENOME 560, Spring 2016 Doug Fowler, GS

Comparison of Methods for Estimation of Sample Sizes under the Weibull Distribution

Dr. Maddah ENMG 617 EM Statistics 11/26/12. Multiple Regression (2) (Chapter 15, Hines)

[412] A TEST FOR HOMOGENEITY OF THE MARGINAL DISTRIBUTIONS IN A TWO-WAY CLASSIFICATION

Data Analysis and Statistical Methods Statistics 651

The Random Walk For Dummies

ECON 3150/4150, Spring term Lecture 1

Bayesian Methods: Introduction to Multi-parameter Models

NCSS Statistical Software. Tolerance Intervals

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

Simple Linear Regression

A Relationship Between the One-Way MANOVA Test Statistic and the Hotelling Lawley Trace Test Statistic

Measures of Spread: Standard Deviation

a. For each block, draw a free body diagram. Identify the source of each force in each free body diagram.

Lecture 2: Monte Carlo Simulation

µ and π p i.e. Point Estimation x And, more generally, the population proportion is approximately equal to a sample proportion

Mathematical Notation Math Introduction to Applied Statistics

Chapter 13, Part A Analysis of Variance and Experimental Design

Infinite Sequences and Series

REGRESSION (Physics 1210 Notes, Partial Modified Appendix A)

Confidence Interval for Standard Deviation of Normal Distribution with Known Coefficients of Variation

G. R. Pasha Department of Statistics Bahauddin Zakariya University Multan, Pakistan

(6) Fundamental Sampling Distribution and Data Discription

A PROCEDURE TO MODIFY THE FREQUENCY AND ENVELOPE CHARACTERISTICS OF EMPIRICAL GREEN'S FUNCTION. Lin LU 1 SUMMARY

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Trimmed Mean as an Adaptive Robust Estimator of a Location Parameter for Weibull Distribution

BUSINESS STATISTICS (PART-9) AVERAGE OR MEASURES OF CENTRAL TENDENCY: THE GEOMETRIC AND HARMONIC MEANS

Decomposition of Gini and the generalized entropy inequality measures. Abstract

Example: Find the SD of the set {x j } = {2, 4, 5, 8, 5, 11, 7}.

On a Smarandache problem concerning the prime gaps

ADVANCED SOFTWARE ENGINEERING

CHAPTER 8 FUNDAMENTAL SAMPLING DISTRIBUTIONS AND DATA DESCRIPTIONS. 8.1 Random Sampling. 8.2 Some Important Statistics

Summary: CORRELATION & LINEAR REGRESSION. GC. Students are advised to refer to lecture notes for the GC operations to obtain scatter diagram.

The standard deviation of the mean

Frequentist Inference

MOST PEOPLE WOULD RATHER LIVE WITH A PROBLEM THEY CAN'T SOLVE, THAN ACCEPT A SOLUTION THEY CAN'T UNDERSTAND.

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

Output Analysis (2, Chapters 10 &11 Law)

UNIT 11 MULTIPLE LINEAR REGRESSION

Section 1.1. Calculus: Areas And Tangents. Difference Equations to Differential Equations

Because it tests for differences between multiple pairs of means in one test, it is called an omnibus test.

Groupe de Recherche en Économie et Développement International. Cahier de Recherche / Working Paper 10-18

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Power Comparison of Some Goodness-of-fit Tests

General IxJ Contingency Tables

Correlation Regression

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Chapter 8: Estimating with Confidence

Circle the single best answer for each multiple choice question. Your choice should be made clearly.

Analysis of Experimental Data


TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

II. Descriptive Statistics D. Linear Correlation and Regression. 1. Linear Correlation

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Confidence Intervals

Transcription:

Iteratioal Joural of Mathematics ad Computer Sciece, 8(2013), o. 2, 55 68 M CS Idices of Distaces: Characteristics ad Detectio of Abormal Poits Hicham Y. Abdallah Departmet of Applied Mathematics Faculty of Sciece-1 Lebaese Uiversity Hadath, Lebao email: habdalah@ul.edu.lb (Received September 5, 2012, Accepted November 1, 2013) Abstract Gettig a robust regressio requires the detectio of abormal poits. I this article we give a solutio to this problem based o Cook s distace, DFFITS, DFBETAS, amog others. We the compare the bouds for those distaces which are used to detect abormal poit. 1 Itroductio Statistics is the sciece whose object is to collect, process ad aalyze data from the observatio of radom pheomea, that is to say where the accidet occurs. Data aalysis is used to describe the pheomea, make predictios ad decisios about them. For this, several statistical methods are available for these studies, but the most used is the regressio method. Despite its effectiveess, the problem of ifluetial poits ad outliers makes it less robust ad affects its optimum results. Our objective is to solve this problem by showig the differece betwee the ifluetial poits ad outliers. We begi our study with a defiitio of ifluetial poits ad outliers ad the we discuss the differet methods of detectio of abormal values. I Key words ad phrases: Outliers, ifluetial poits, distaces. AMS (MOS) Subject Classificatios: 62J07,62J12. ISSN 1814-0424 c 2013, http://ijmcs.future-i-tech.et

56 H. Y. Abdallah additio, we propose a theoretical compariso betwee the differet idices to fid the most effective idex that helps us detect abormal poits. 2 Ifluetial Poits ad Outliers The study of residues y i yi ca idetify outliers observatios or commets that play a importat role i determiig regressio, where y is the predictio of y. The two types of abormal poits are outliers ad ifluetial poits. The first correspods to observatios outside the orm, while the secod to poits that weigh (urealistically) o estimates: if removed, the results would be differet sigificatly. There are several methods for detectig these values accordig to their types. Oce these observatios have bee idetified, it may be better to remove them or use other more robust criteria. 1 - Ifluetial Poits: A ifluetial item weighs heavily i the regressio; that is to say, the results are quite differet depedig o whether or ot the poit is take ito accout i the regressio. The problem of ifluetial values arises i busiess especially where surveys that collect ecoomic variables have distributios that are highly osymmetric. Ifluetial values are problematic because they geerally lead to ustable estimators. I other words, icludig or excludig a ifluetial sample value usually has a sigificat impact o the volatility of a estimator. It is possible to miimize their impact through a appropriate samplig pla. However, it is geerally ot possible to completely elimiate the problem of ifluetial values at each step of the pla. As a result, it is importat to develop robust estimatio methods i the presece of ifluetial values. 2 - Outliers: Before presetig cocepts related to outliers, it is ecessary to defie them more precisely as may authors have attempted to describe them ad the defiitios have chaged over time. Amog these authors, Grubbs has defied these values as follows: A outlier is a observatio that appears to deviate markedly compared to all other members of the sample i which it appears. I 1994, Barett ad Lewis defied a outlier i a data set as a observatio (or set of observatios) which appears to be icosistet with other data. That is, outliers correspod to observatios outside the orm of the populatio studied. These poits ca

Idices of Distaces 57 distort the results of the regressio. 3 The methods for detectio of abormal values 1 - Detectio of outliers: The idices that help us detect outliers are: a-stadardized Residue: e i This residue compares y with y the residue i the presece of the ith observatio. A abormally large value of a stadardized residual is cosidered suspected. More precisely, the model is correct 95% of the time if oe has: e i t(0.025; p) or e e i i = s. 1 h ii The observatios that qualify as outliers belog to the same regio: e i t(0.025; p). b-studet residue: e i This residue compares y with y without the ith observatio. I this case, a value of abormally large Studet residue is cosidered suspected. For such observatios we have: e i >t(0.025; p 1). I practice, we use the followig formulas: where s: stadard deviatio e i = e ( i) s( i) 1 h ii, e i = y i y i h ii = x i (X X) 1 x i where x i is the ith row of X, e ( i) is the Residue without the ith observatio, ad s ( i) is the Stadard Deviatio without the ith observatio.

58 H. Y. Abdallah The previous two criteria cotribute to detect potetially ifluetial observatios by their distace to the size of the residues. This iformatio is sythesized i the criteria directly assessig the ifluece of a observatio o some parameters. All these idicators suggest comparig a estimated parameter without the ith observatio ad this same parameter is estimated with all the observatios. C-treatmet of outliers We have already metioed the mai methods of outlier detectio to fially get to address these issues ad have a specific regressio model. Several methods of treatmet are: Reject: robustly elimiate extreme values determied followig the test mismatch. Icorporate: chage the distributio model i order to icorporate outliers. Idetify: keep outliers, sice they may represet particularly importat features. Accommodate: adopt statistical methods that miimize the impact of outliers o the statistical aalysis. 2-Detectio of ifluetial poits: The idices that help us detect ifluetial poits are as follows: a-leverage poit: I regressio aalysis, we call a Leverage poit a observatio i that sigificatly affects the estimators because its values over other variables differ much of the rest of the data ad the idicated distace betwee observatio i ad the ceter of gravity of the cloud of poits. Leverage poit h ii observatio is read from the mai diagoal of the matrix Hat Matrix, ad it appears as a measure of the ifluece of the ith observatio o its proper predictio. I practice, a observatio i is cosidered as a poit Leverage poit if h ii > 2(p+1). Note also that a observatio h ii approachig 1 is a observatio with a very importat Leverage poit. b-cook s distace: Cook s distace from a observatio is a measure of the ifluece of this observatio o all the set of predictios of a model. Oe calculates a distace betwee the vector β of coefficiets of the regressio ad the vector β(i) obtaied by repeatig the regressio without observatio i. Whe Cook s distace is ormalized, a value greater tha 1 is likely. 4 However, a limit of is ofte better as the calibratio of 1 ca p 1 permit ifluetial values.

Idices of Distaces 59 I practice, we have the followig two formulas i this case D i = i=1 (ŷ i ŷ i ( i)) 2 σ 2 ɛ (p +1) D i = ( β β ( i) ) (X X) 1 ( β β ( i)) σ ɛ 2 (p +1) c-dfbetas: The purpose of this method is to measure the ifluece of a poit o the estimated coefficiet. It is ormalized so as to be comparable from oe variable to aother. A suspected observatio is such that DFBETAS> 2 (). Note: If there are may variables, we first cosider the globally ifluetial observatios (Cook) ad the for this observatio the variable(s) causig that ifluece (DFBETAS). I practice, (DFBETAS) j i = β j β j ( i), s( i) (X X) 1 j,j where (X X) 1 j,j is the jth positio of the diagoal of (X X) 1 3 - Other criteria: a-the DFFITS: The DFFITS of a observatio is a measure of the ifluece of this observatio o predictio of its eigevalue by the model. It gives the differece betwee the adjusted value for observatio i ad the predicted value of y for i i the estimated model without this observatio i. We cosider a observatio is ifluetial whe: p +1 (DFBETAS) i > 2 or (DFBETAS) i > ŷi ŷ i ( i) s( i) h ii = e i hii 1 h ii b-covratio: The COVRATIO measures disparities betwee the precisio of the estimators; that is to say, the geeralized variace of estimators give by

60 H. Y. Abdallah Var( β) =s 2 det(x X) 1. The presece of the observatio i improves the accuracy i the sese that it reduces the variace estimators if COVRATIO> 1 Istead, COVRATIO <1 idicates that the presece of the observatio i degrades the variace. But the most commo detectio rule is: 3(p +1) COV RATIO i 1 >. V-Theoretical Compariso of distaces Based o the critical regios of these differet distaces, we ca choose the oe that gives the best detectio result. Rule: The method with the smallest critical regio is the most accurate. So compare these regios: For Leverage poit ad COVRATIO: If the critical regio for the Leverage poit 2(p+1) ad that of 2 COV RATIO 3(p+1) +1, the 2(p+1) < 3(p+1) + 1 ad so the Leverage poit is more accurate tha COVRATIO. For DFBETAS ad DFFITS: 2 If the critical regio for DFBETAS is p+1 ad that for DFFITS is 2, the 2 p+1 < 2 ad therefore DFBETAS is more accurate tha DFFITS. 2 For Cook s distace ad Leverage poit: The critical regio for the Leverage poit is 2(p+1) ad that of Cook s is 4 p 1 ad so 4 p 1 < 2(p+1). For Cook s distace ad DFBETAS: The critical regio is DFBETAS is 2 ad that of Cook s is 4 4 < 2 p 1. p 1 ad so Comparig the critical regio of the criteria already metioed above: 4 < 1 < 2ad 2 p 1 < 1 < 2ad 2(p+1) < 2 for p +1 <ad 3(p+1) +1> 2 shows that DCOOK is the best distace i detectig suspected poits.

Idices of Distaces 61 4 Relatioships betwee the distaces The purpose of this sectio is to explai the relatioships amog the differet distaces. Leverage poitage ad Cook The Leverage poit arm measures horizotally the differece betwee the observed poit ad the mea X of the explaatory variable. They deped oly o the values of that variable. As for Cook distaces, they measure somehow the overall importace of horizotal ad vertical gaps. I geeral, a poit ca be characterized by a sigificat residue, without beig very ifluetial, if the Leverage poit arm is ot very high. Similarly, a poit may have a large Leverage poit arm without beig particularly ifluetial, if the residue which is characterized is low. A poit is ot therefore ifluetial i the sese of Cook s distace if both, its residue ad Leverage poit arm are importat. DFFITS ad Cook s distace Cook s distace ad DFFITS deped o Leverage poit ad CookD ca be represeted as a fuctio of Leverage poit ad Studet residue ad eve DFFITS. This shows that the observatios with high Leverage poit are the highest values of DCook ad DFFITS the have a great ifluece o the predictios of the model. COVRATIO, DFFITS ad DFBETAS Observig the rule of COVRATIO ad those of DFFITS ad DFBETAS, we ote that they do ot deped o sample size while COVRATIO depeds o. DCook ad DFBETAS If there are may variables, we first look at the globally ifluetial observatios (DCook) ad the for these observatios with variable(s) causig that ifluece (DFBETAS). 5 Table summary The followig table summarizes the correspodig case detectio of abormal poits.

62 H. Y. Abdallah Stadardized residuals Studet residuals Leverage poit Cook s distace Result e i > 2 whereas i residue is sigificatly 0 e i > 2 whereas i observatio requires a ivestigatio h ii > 2(p+1) CookD>1 idicates a abormal effect DFBETAS DFBETAS > 2 COVRATIO COV RATIO 1 > 3(p+1) p+1 DFFITS DEFITS > 2 Purpose Large residue detectio thus atypical observatio Detectio of large residue thus atypical observatio Measure the ifluece of observatio i o the estimators Measure the effect of the removal of the observatio i o the predictio of values Measure the ifluece of a poit o the estimated coefficiet Measure the effect of the ith observatio accuracy Measure the ifluece of observatio i o the predictio of its eigevalue 6 Practical Applicatio To illustrate our study, we propose a real example for a example of 100 studets at the Faculty of Sciece of the Lebaese Uiversity takig as variable the average score i the Masters, first, secod ad third years. I order to detect outliers for this example regressio is performed i the first step to explai the marks i Masters accordig to the two explaatory variables are the grades i the first year ad those i the secod ad third year together ad, as a secod step, we determie the critical areas of idices cited i the study:

Idices of Distaces 63 Studet Major Master Average of Average of Number s Average(AM) secod ad first year third years (A1) (A23) 1 biology 71.48 64.92 50.17 2 biology 76.35 70.82 60 3 Chemical 61.55 64.13 53 4 biochemistry 66.37 66.86 65.17 5 biology 72.13 64.55 67 6 biochemistry 81.18 79.26 79.92 7 biology 70.85 63.13 61.67 8 biochemistry 76 77.54 76.50 9 biology 80.40 79.20 69.50 10 Fudametal 71.37 68.52 60.17 11 Electroics 69.37 63.52 51.33 12 biology 81.57 71.88 57.67 13 biology 74 71 66 14 biochemistry 58.85 62.48 58.25 15 biology 81.08 72.30 60.75 16 biochemistry 64.17 67.86 51.25 17 mathematical 63.50 64.85 69.42 18 chemistrymolecular 66.07 68.44 55.50 19 Molecular 77.13 78.58 84.67 Chemistry 20 biochemistry 77.62 74.92 56.33 21 Fudametal 71.58 68.68 55.75 22 Electroics 61.78 61.81 52.83 23 biochemistry 72.18 69.63 64.75 24 Fudametal 66.92 69.66 63.33 25 biology 75.60 66.67 50 26 Biology: Elective 71.37 68.65 50 27 Chemical 67.92 72.19 62.50 28 Computer 63.42 70.90 66.25 29 biology 66.72 65.43 60.67 30 biochemistry 64.83 67.02 59 31 biology 73.07 70.17 50 32 biochemistry 66.30 67.40 59.75 33 biochemistry 59.95 66.28 56.17

64 H. Y. Abdallah Studet Number Major Master Average of Average of s Average(AM) secod ad first year third years (A1) (A23) 75.20 76.25 74.48 34 chemistrymolecular 35 Electroics 69 61.94 53 36 biochemistry 66.33 68.93 68.92 37 biochemistry 68.85 66.55 57.48 38 Computer 66.02 63.22 58.50 39 Fudametal 74.42 72.54 63.83 40 Chemical 65.63 68.66 63 41 biochemistry 76.17 72.53 61.17 42 chemistrymolecular 67.65 63.63 50 43 biochemistry 59.30 62.28 59.33 44 biochemistry 58.65 61.06 56 45 biology 77.08 71.85 78.25 46 biology 76.23 68.23 54.17 47 chemistrymolecular 58.60 60.67 50 48 Chemistry 81.05 83.88 82.67 optio Evirometal Scieces 49 biology 81.83 80.37 80.83 50 biology 70.88 71.83 51.42 51 biology 73.48 66.62 62.08 52 Electroics 68.63 64.34 52.92 53 biochemistry 68.08 66.31 63.58 54 biology 67.75 62.68 57 55 biology 62.03 68.90 64.83 56 Computer 69.62 77.71 77.67 57 biology 70.80 64.92 66.17 58 Electroics 76.88 72.21 54.50 59 biochemistry 77.33 76.10 78.42 60 Computer 61.03 63.53 66.08 61 biochemistry 76.60 75.39 69.83 62 biochemistry 69.73 64.29 53.50 63 biology 87.97 83.12 82.08 64 chemistrymolecular 67.35 69.60 50 65 Chemical 59.73 63.75 52.50

Idices of Distaces 65 Studet Number Major Master s Average(AM) Average of secod ad third years (A23) Average of first year (A1) 66 biology 77.93 70.55 50.50 67 biochemistry 66.45 67.41 59.58 68 biochemistry 62.30 66.28 50 69 biology 76.25 66.63 56.92 70 chemistrymolecular 72.22 71.83 62.33 71 Fudametal 78.53 82.57 74.50 72 chemistrymolecular 69.80 69.81 52.08 73 Computer 68.40 67.17 67.58 74 Fudametal 69.95 66.58 50 75 biochemistry 70.05 66.38 60.83 76 Chemical 66.13 68.80 64.42 77 biochemistry 70.22 64.78 52.83 78 Computer 64.37 71.80 60.42 79 Electroics 68.68 63.58 57.67 80 biochemistry 65.35 60.78 59.58 81 mathematics 74.50 76.20 84.33 82 biochemistry 66.80 67.93 54.92 83 Electroics 65.90 67.24 61.17 84 mathematics 57.30 61.10 67.75 85 Electroics 67.80 63.17 50.58 86 biochemistry 64.07 64.26 56 87 Electroics 68.42 58.47 50.75 88 biochemistry 69.92 68.70 67.17 89 biochemistry 71.73 72.03 73.42 90 Fudametal 64.67 71.22 53.92 91 Computer 64.17 68.44 59 92 biology 71.52 63.87 60 93 Electroics 67.55 64.36 50 94 biology 70.30 65.55 67.50 95 Computer 59.60 60.85 57.42 96 biochemistry 71.95 70.32 59.58 97 biology 73.35 70.33 53.08 98 biochemistry 71.22 67.13 58.75 99 Electroics 77.57 76.54 56.08 100 Biology: Elec- 65.30 62.98 50

66 H. Y. Abdallah 7 Iterpretatio The proposed study has two variables (p = 2) ad 100 observatios, The model is obtaied AM= 10.78164 + 0.92954 A23-0.07701 A1. For a threshold usig the SAS software, we had the followig results: Usig the Studet residual ad stadardized residual, each observatio more tha two is a abormal fidig. By examiig the differet values??of the residues shows that the observatio of which 12 are medium (A1 = 57.67, A23 = 71.88, AM = 81.57) with a studet residue = 2.0624 is the first atypical value, observatio 78 (A1 = 60.42, A23 = 71.80, AM = 64.37) each have 2035 ad as the value of -2.0694 STUDENT RESIDUAL RSTUDENT ad is the secod atypical value. Usig the Leverage poit, each greater tha 2(p+1) = 2(2+1 =0.06 observatio 100 is a uusual observatio. The the software uses gives us the values??of the matrix diagram Hat. We takes a few examples: 6 (A1=79.92, A23=79.26, AM=81.18) ad the Leverage poit is (h=0,0611) ; 9 (A1=69.50, A23=79.20, AM=80.40) (h=0,0801) ; 48(A1=82.67, A23=83.88, AM=81.05) (h=0,0966)... 4 From Cook s distace, each more tha = 4 =0.041 observatio p 1 100 2 1 is a uusual observatio. While examiig the colum COOK foud that 56 observatios whose average are (A1 = 77.67, A23 = 77.71, AM = 60.62) ad Cook (D = 0.056) ad 63 (A1 = 82.08, A23 = 83.12, AM = 87.97) (D = 0.080) ad 84 (A1 = 67.75, A23 = 61.10, AM = 57.30) (D = 0.045);) are just outliers. The dfbetas idicates that each variable has a value greater tha 2 =0.2 correspods to a uique value, the the variable A23 has a uique value 5 (A23 = 64.55),7 (63.13),12 (71.88), 63 (83.12),66 (70.55),84 (61.10),87 (58.47),90 (71.22) ad the A1 variable 5 (A1 = 67.00), 12 (57.67),25 (50.00), 45 (78.25), 66 (50.50), 84 (67.75),90 (53.92). Usig COVRATIO, each observatio has covratio 1 > 3(p+1) = 0.07 is a uusual observatio. The follows that the observatios 6 (Average A1 = 79.92, A23 = 79.26, AM = 81.18), 8 (A1 = 76.50, A23 = 77.54, AM = 76.00), 9 (A1 = 69.50, A23 = 79.20, AM = 80.40), 12 (A1 = 57.67, A23 = 71.88, AM = 81.57), 19 (A1 = 84.67, A23 = 78.58, AM = 77.13), 20 (A1 = 56.33, A23 = 74.92, AM = 77.62) are outliers. Fially, we ote that the outliers will differ from oe remote to aother this is due to the existig differece betwee the distaces. I our project, comparig outliers obtaied usig differet distaces shows that DCook is

Idices of Distaces 67 best. The results are give usig DCOOK as follows: For the idividual 56, the average i the first three years (A1 = 77.67, A23 = 77.71) is higher tha the average Masters (AM = 60.62) -for the idividual 63, the average i the first three years (A1 = 82.08, A23 = 83.12) is lower tha the average Masters (AM = 87.97) -for the idividual 84, the average i the first three years (A1 = 67.75, A23 = 61.10) is higher tha the average Masters (AM = 57.30) -for the idividual 87, the average i the first three years (A1 = 50.75, A23 = 58.47) is lower tha the average Masters (AM = 68.42) Thus, these results are cosistet with our study, as the fact that these poits are atypical of the variatio is i the opposite directio betwee the explaatory variables ad to explai that. 8 Coclusio The observatios cotaied i the databases must absolutely be validated because the appearace of outliers is ievitable because of the quality of data processed ad the various sources of errors durig acquisitio. To esure high quality iformatio, a search for stragglers or outliers must be doe before the use of databases. So this article has helped us to differetiate a outlier from ifluetial. We also studied several methods for the detectio of outliers by showig that the test suitable for the detectio of abormal poits is Cook s distace. Simulatios o large files are the subject of curret research. Refereces [1] Ricco Rakotomolala, Pratique de la régressio liéaire multiple (diagostic et sélectio de variable), uiversité lumière Lyo 2, 2009. [2] Ricco Rakotomolala, Poits atypiques et poits ifluets, régressio liéaire multiple, uiversité LYON 2, 2009. [3] David A Belsey, Edwi Kuhroy, Regressio diagostic idetifyig ifluetial data ad sources of coliearity, 2004. [4] Pierre Adré Comilo, Eric Matzer Lober, Regressio: théorie et applicatios, Spriger, 2007.

68 H. Y. Abdallah [5] Lauret Carraro, Itroductio a la régressio, 2005. [6] Mathieu Resche-Rigo, Outills de régressio, résidus, mesure d ifluece idividuelle, 2010. [7] Stéphae Cau, Diagostic de la régressio, 2011. [8] Philippe Besse, Pratique de la modélisatio statistique, 2000.