SOME METHODS OF DETECTION OF OUTLIERS IN LINEAR REGRESSION MODEL

Size: px

Start display at page:

Download "SOME METHODS OF DETECTION OF OUTLIERS IN LINEAR REGRESSION MODEL"

April Brooks
6 years ago
Views:

1 SOME METHODS OF DETECTION OF OUTLIERS IN LINEAR REGRESSION MODEL RANJIT KUMAR PAUL M. Sc. (Agrcultural Statstcs), Roll No IASRI, Lbrary Avenue, New Delh Charperson: Dr. L. M. Bhar Abstract: An outler s an observaton that devates markedly from the majorty of the data. To know whch observaton has greater nfluence on parameter estmate, detecton of outler s very mportant. There are several methods for detecton of outlers avalable n the lterature. A good number of test-statstcs for detectng outlers have been developed. In contrast to detecton, outlers are also tackled through robust regresson technques lke, M-estmator, Least Medan of Square (LMS). Robust regresson provdes parameter estmates that are nsenstve to the presence of outlers and also helps to detect outlyng observatons. Recently, Forward Search (FS) method has been developed, n whch a small number of observatons robustly chosen are used to ft a model through Least Square (LS) method. Then more number of observatons are ncluded n the subsequent steps. Ths forward search procedure provdes a wealth of nformaton not only for outler detecton but, much more mportantly, on the effect of each observaton on aspects of nferences about the model. It also reveals the maskng problem, f present, very ncely n the data. Key words: Outler, Leverage Pont, Least Square (LS), Least Medan of Square (LMS), Robust Regresson, Forward Search (FS), Maskng. 1. Introducton No observaton can be guaranteed to be a totally dependable manfestaton of the phenomena under study. The probable relablty of an observaton s reflected by ts relatonshp to other observatons that were obtaned under smlar condtons. Observatons that n the opnon of the nvestgator stand apart from the bulk of the data have been called outlers, extreme observatons dscordant observatons, rouge values, contamnants, surprsng values, mavercks or drty data. An outler s one that appears to devate markedly from the other members of the sample n whch t occurs. An outler s a data pont that s located far from the rest of the data. Gven a mean and standard devaton, a statstcal dstrbuton expects data ponts to fall wthn a specfc range. Those that do not are called outlers and should be nvestgated. The sources of nfluental subsets are dverse. Frst, there s the nevtable occurrence of mproperly recorded data, ether at ther sources or n ther transcrpton to computer readable form. Second, observatonal errors are often nherent n the data. Although procedures more approprate for estmaton than ordnary least squares exst for ths stuaton, the dagnostcs may reveal the unsuspected exstence and severty of observatonal errors. Thrd outlyng data ponts may be legtmately occurrng extreme observatons. Such data often contan valuable nformaton that mproves estmaton effcency by ts presence. Even n ths benefcal stuaton, however t s constructve to solate extreme ponts and to determne the extent to whch the parameter estmates depend on these desrable data. Fourth, snce the data could have been generated by model(s) other than that specfed, dagnostcs may reveal patterns suggestve of these alternatves.

2 The fact that a small subset of the data can have a dsproportonate nfluence on the estmated parameters or predctons s of concern to users of regresson analyss. It s qute possble that the model-estmates are based prmarly on ths data subset rather than on the majorty of the data. When a regresson model s ftted by least squares, the estmated parameters of the ftted model depend on a few statstcs aggregated over all the data. If some of the observatons are dfferent n some way from the bulk of the data, the overall concluson drawn from ths data set may be wrong. There are a seres of powerful general methods for detectng and nvestgatng observatons that dffer from the bulk of the data. These may be ndvdual observatons that do not belong to the general model, that s, outlers. Or there may be a subset of data that s systematcally dfferent from the majorty.. Detecton of Outlers There are two types of outlers dependng on the varable n whch t occurs. Outlers n the response varable represent model falure. Outlers wth respect to the predctors are called leverage ponts; they can affect the regresson model. Ther response varables need not be outlers. However, they may almost unquely determne regresson coeffcents. They may also cause the standard errors of regresson coeffcents to be much smaller than they would be f these observatons were excluded. Leverage ponts do not necessarly correspond to outlers. An observaton wth suffcently hgh leverage mght exert enough nfluence to drag the regresson equaton close to ts response and mask the fact that t mght otherwse be an outler. The ordnary or smple resduals (observed - predcted values) are the most commonly used measures for detectng outlers. The ordnary resduals sum to zero but do not have the same standard devaton. Many other measures mprove on or complement smple resduals. Standardzed Resduals are the resduals dvded by the estmates of ther standard errors. They have mean 0 and standard devaton 1. There are two common ways to calculate the standardzed resdual for the th observaton. The use of resdual mean square error from the model ftted to the full dataset (nternally studentzed resduals) and the use of resdual mean square error from the model ftted to all of the data except the th observaton (externally studentzed resduals). The externally studentzed resduals follow a t dstrbuton wth n-p- df, where n s total number of observatons and p s the number of parameters. Outler dagnostcs are the statstcs that focus attenton on observatons havng a large nfluence on the Least Squares (LS) estmator. Several dagnostc measures have been desgned to detect ndvdual cases or groups of cases that may dffer from the bulk of the data. The feld of dagnostcs conssts of a combnaton of numercal and graphcal tools. Some commonly used statstcs n detecton of outlers are descrbed now..1 Row Deleton Methods There are many methods for detecton of outlers avalable n the lterature. Some statstcs that are obtaned through row deleton method are consdered here. It s examned n turn how the deleton of each row affects the estmated coeffcents, the predcted values (ftted values), the resduals, and the estmated covarance structure of the coeffcents.

3 Consder the followng lnear regresson model Y=Xβ +ε where Y s a n 1 vector of observatons, X s a n p matrx of explanatory varables and β s p 1 vector of parameters. ε s a n 1 vector of errors such that E ε = and ( ε ε E ) = σ I. () 0 () DFBETA Snce the estmated coeffcents are often of prmary nterest n regresson models, we look frst at the change n the estmate of regresson coeffcents that would occur f the th row were deleted. Denotng the coeffcents estmated wth the th row deleted by β (), ths change s computed by the formula, DFBETA = β ˆ β ˆ 1 ( X X) x e () = 1 h where x s the th row of the X matrx, s the th resdual and s the th e h dagonal 1 element of the matrx X X X. The cuts off value for DFBETA t s / n ( ) X () Cook s Dstance Cook (1977) proposed a statstc for detecton of outler as follows: ( ˆ ˆ D = β() β) X X( βˆ ˆ () β) / ps where s s the estmate of σ. Large values of D ndcate observatons that are nfluental on jont nferences about all the lnear parameters n the model. A suggestve alternatve form of D s D = ( Ŷ() Ŷ) ( Ŷ() Ŷ) / ps, where the Ŷ() = Xβˆ (). An nterpretaton s that D measures the sum of squared changes n the predctons when observaton s not used n estmatng β. D approxmately follows F (p, n-p) dstrbuton. The cut off value of Cook-Statstc s 4/n. () DFFIT It s the dfference between the predcted responses from the model constructed usng complete data and the predcted responses from the model constructed by settng the th observaton asde. It s smlar to Cook's dstance. Unlke Cook's dstance, t does not look at all of the predcted values wth the th observaton set asde. It looks only at the predcted values for the th observaton. DFFIT s computed as follows: DFFIT = Y Y () = ( ˆ ˆ h e p x β β() ) =. The cut off value of DFFIT s. 1 h n (v) Covarance Matrx Another major aspect of regresson s the covarance matrx of the estmated coeffcents. We agan consder the dagnostc technque of row deleton, ths tme n a comparson of the covarance matrx usng entre data, σ ( X X) 1, wth the covarance matrx that results σ X X. Of the varous alternatve means for when th row has been deleted, [ ( ) ( )] 1 3

4 comparng two such postve defnte symmetrc matrces, the rato of ther determnants det [ () () σ X X ] 1 /det [ σ ( X X) 1 ] s one of the smplest and s qute appealng. Snce these two matrces dffer only by the ncluson of the th row n the sum of squares and cross products, values of ths rato near unty can be taken to ndcate that the two covarance matrx s nsenstve to the deleton of row. Of course the estmator s of σ also changes wth the deleton of the th observaton. We can brng the y data nto consderaton by comparng the two matrces s ( X X) 1 and s 1 ( ) [ X () X() ] n the determnantal rato 1 det{ s () [ X ( ) X( ) ] } COVRATIO = 1 det s X X p [ ( ) ] 1 () det[ X ( ) X( ) ] 1 det[ X X] s = p s 1 or, COVRATIO =, p n p 1 e + ( 1 h ) n p n p e where e = s the studentzed resdual. s() 1 h For COVRARIO, cut of value s 1 ± 3 p. n. Hat Matrx Besdes the above statstcs, the Hat matrx s sometmes used to detect the nfluental observatons. The h are the dagonal elements of the least squares projecton matrx, also called hat matrx, 1 H = X ( X X) X Ths determnes the ftted or predcted values, snce Y = Xβˆ = HY The nfluence of the response value,, on the ft s most drectly reflected n ts mpact Y on the correspondng ftted value, Y, and ths nformaton s to be contaned n h.where there are two or fewer explanatory varables, scatter plots wll quckly reveal any x outlers, and t s not hard to verfy that they have relatvely large h values. Here the cutoff value s p/n, where p s the rank of the X matrx. The th observaton s called a leverage pont when exceeds p/n. h.3 Outler Detecton Based on Robust Regresson Many dagnostcs are based on the resduals resultng from LS. However, ths startng pont may lead to useless results because of the followng reason. By defnton, LS tres to avod the large resduals. Consequently, one outlyng case may cause a poor ft for the majorty of the data because the LS estmator tres to accommodate ths case at the expense 4

5 of the remanng observatons. Therefore an outler may have a small resdual, especally when t s a leverage pont. As a consequence, dagnostcs based on LS resduals often fal to reveal such ponts. A least squares model can be dstorted by a sngle observaton. The ftted lne or surface mght be tpped so that t no longer passes through the bulk of the data. In order to reduce the effect of a very large error t wll ntroduce many small or moderate errors. For example, f a large error s reduced from 00 to 50, ts square s reduced from 40,000 to,500. Increasng an error from 5 to 15 ncreases ts square from 5 to 5. Thus, a least squares ft mght ntroduce many small errors n order to reduce a large one. Fgure.1 In Fgure.1 the lne A denotes the regresson lne and passes through the bulk of the data. But n presence of the outler the lne s dragged out to the outler because the prncple of the least square method says that the resduals sum of squares should be mnmum. The lne B ndcates the regresson lne n presence of the outler. Agan n some cases where the LS estmate s used to detect outler may dagnose a clean pont as an outler and an outler as a clean pont. Fgure. 5

6 In Fgure., the pont 1 s actually an outler and n presence of ths pont the regresson lne s dragged out to that pont resultng the pont as an outler, though t s a clean pont, because now the dstance of ths pont from the lne s longest. Moreover, all those statstcs descrbed above nvolved LS resduals. Therefore, f these statstcs are appled n detecton of outlers, t may wrongly detect some clean observatons as outlers. So t s suggested to use robust regresson n place of LS. Robust regresson not only provdes parameter estmaton that s nsenstve to the presence of outlers, but also helps n detectng outlers. Robust regresson s a term used to descrbe model fttng procedures that are nsenstve to the effects of outler observatons. Many robust regresson methods have been developed, out of whch M estmator s most popular. M-estmator In general, a class of robust estmators that mnmze a functon f of the resduals s defned as, n Mnmze f β = 1 where x denotes the th row of X. ( e ) = Mnmze f ( y β) β n = 1 x (.1) Generally the followng equaton s solved: n e n y = xβ Mnmze f Mnmze f β = 1 s β = 1 s (.) where s medan e medan( e )/ = (.3) s s an approxmately unbased estmator of σ f n s large and the error dstrbuton s normal. An estmator of ths type s called an M-estmator, where M stands for maxmum lkelhood. That s, the functon f s related to the lkelhood functon for an approprate choce of the error dstrbuton. For example, f the method of least squares s used (mplyng the error dstrbuton s normal), then f(z) = (½)z. To mnmze Eq.., equate the frst partal dervatves of f wth respect to β j (j = 1,,, p) to zero, yeldng a necessary condton for a mnmum. Ths gves the system of p equatons n y x x j ψ β = 0, (.4) = 1 s (j = 1,, p) where s s the robust estmate of scale, ψ = f and x s the th observaton on the j th j regressor and x 0 =1. In general the ψ functon s non lnear and the Eq..4 must be 6

7 solved by teratve methods. Iteratve reweghted least squares s most wdely used. Ths approach s usually attrbuted to Beaton and Tukey (1974). To use teratvely reweghted least squares, suppose that an ntal estmate and that s s an estmate of scale. Then wrte p equatons n (.4), n y x x ψ β j = = 1 s n x j{ ψ[ ( y x β) / s] /( y x β) / s}( y x β) = 0 = 1 s as where ( y x β) 0 x j w 0 = w 0 ψ = ( y x β0 ) ( y ˆ x β0 )/ ˆ / s s = 1 f y =x ˆ β0 In matrx notaton Eq..6 becomes βˆ 0 s avalable (.5) (.6) f y x βˆ (.7) 0 X W 0 Xβ = X W 0 y (.8) where W0 s an n n dagonal matrx of weghts wth dagonal elements w10, w0,..., wn0 gven by Eq..7. We recognze Eq..8 as the usual weghted least-squares normal equatons. Consequently the one step estmator s 1 ( X W X) X y β ˆ = 1 0 W0 In the next step we recompute the weghts from Eq..7 but usng βˆ 1 nstead of ˆβ 0. (.9) Least Medan of Squares (LMS) Estmator Least Medan of Squares (LMS) regresson, developed by Rousseeuw (1984) mnmzes the medan squared resduals. Snce t focuses on the medan resdual, up to half of the observatons can dsagree wthout maskng a model that fts the rest of the data. For the lnear regresson model E(Y) = Xβ, wth X of rank p, let b be any estmate of β. Wth n observatons, the resduals from ths estmate are e ( b ) x = y b,(=1,,,n). The LMS estmate β Thus β p p s the value of b mnmzng the medan of the square resduals mnmzes the scale estmate ( b) =e[ med]( b) [ k]( b) ( b) e. σ, (.10) where e s the k th ordered squared resdual. In order to allow for estmaton of the parameters of the lnear model the medan s taken as med = nteger part of [(n+p+1)/], (.11) 7

8 The parameter estmate satsfyng (.10) has asymptotcally, a break down pont of 50%. Thus, for large n, almost half the data can be outlers, or come from some other model and LMS wll stll provde an unbased estmate of the regresson parameters. Ths s the maxmum break down that can be tolerated. For a hgher proporton of outlers there s no longer a model that fts the majorty of the data. The very robust behavor of the LMS estmate s n contrast to that of the least squares estmate S ( β) = ( y Xβ) ( y Xβ), whch can be wrtten as S n ( b) e ( b) =. = 1 β mnmzng Only one observaton needs to be moved towards nfnty to cause an arbtrarly large change n the estmate β : the breakdown pont of β s zero. The defnton of βp n (.10) gves no ndcaton of how to fnd such a parameter estmate. Snce the surface to be mnmzed has many local mnma, approxmate methods are used. Roussaeeuw (1984) fnds an approxmaton to βp elemental sets that s subsets of p observatons, taken at random. by searchng only over Fttng an LMS regresson model possesses some dffcultes. The frst s computatonal. Unlke least squares regresson, there s no formula that can be used to calculate the coeffcents for an LMS regresson. Rousseeuw (1984) has proposed an algorthm to obtan LMS estmator. Accordng to ths algorthm a random sample of sze p, s drawn. A regresson surface s ftted to each set of observatons and the medan squared resdual s calculated. The model that has the smallest medan squared resdual s used. Once a robust ft of the model s obtaned, resduals from ths model are used for detectng outlyng observatons. The followng Fgure.3 descrbes that the LMS ft s not affected by the outlyng observaton. Fgure.3 8

9 One of the drawbacks of the LMS estmate s that t does not consder all the data ponts for estmaton of parameters. Also t does not reveal the maskng effect of outlers f any. So the forward search method of detecton of outlers has been used..4 Forward Search If the values of the parameters of the model were known, there would be no dffculty n detectng the outlers, whch would have large resduals. The dffculty arses because the outlers are ncluded n the data used to estmate the parameters, whch can then be badly based. Most methods for outler detecton therefore seek to dvde the data nto two parts, a larger clean part and the outlers. For detectng multple outlers, some apply sngle row deleton method repeatedly. But ths method fals when there s a problem of maskng. Multple row deleton technque has also been suggested. But the dffculty here s the exploson of the number combnaton to be consdered. To overcome such problems, forward search has been evolved by Atknson (1994). The basc dea of ths method s to order the observatons by ther closeness to the ftted model. It starts wth a ft to very few observatons and then successvely ft to larger subsets. The startng pont s found by fttng to a large number of small subsets, usng methods from robust statstcs to determne whch subset fts best. Then all the observatons are ordered by closeness to ths ftted model; for regresson models the resduals determne closeness. For multvarate models, the subset sze s ncreased by one and the model reftted to the observatons wth the smallest resduals for the ncreased subset sze. Usually one observaton enters, but some tmes two or more enter the subset as one or more leave. The process contnues wth ncreasng subset szes untl; fnally, all the data are ftted. As a result of ths forward search we have an orderng of the observatons by closeness to the assumed model. If the model and the data agree, the robust and least squares ft wll be smlar, as wll be the parameter estmates and resduals from the two fts. But often the estmates and the resduals of the ftted model change apprecably durng the forward search. The changes n these quanttes and n varous statstcs are montored, such as score tests for transformaton, as the process moves forward through the data, addng one observaton at a tme. Ths forward procedure provdes a wealth of nformaton not only for outler detecton but, much more mportantly, on the effect of each observaton on aspects of nference about the model. In the forward search, such larger sub samples of outler free observatons are found by startng from small subsets and ncrementng them wth observatons that have small resduals, and so are unlkely to be outlers. The method was ntroduced by Had (199) for the detecton of outlers from a ft usng approxmately half the observatons. Dfferent versons of method are descrbed by Had and Smonoff (1993), Had and Smonoff (1994) and by Atknson (1994). Suppose at some stage n the forward search the set of m observatons used n fttng s S (m). Fttng to ths subset s by least squares (for regresson models) yeldng the parameter estmates β. From these parameter estmates a set of n resduals are m e n 9

10 calculated and also estmate σ. Suppose the subset S (m) s clear of outlers. There wll then be n-m observatons not used n fttng that may contan outlers. The nterest s n the evoluton, as m goes from p to n, of quanttes such as resduals, and test statstcs, together wth Cook s dstance and other dagnostc quanttes. The sequence of the parameter estmates β m and related t statstcs are also montored. The changes that occur, whch can always be assocated wth the ntroducton of a partcular group of observatons are montored. In practce almost always one observaton, nto the subset m used for fttng. Interpretaton of these changes s complemented by examnaton of changes n the forward plot of resduals. Remark 1 The search starts wth the approxmate LMS estmator found by samplng subsets of sze p. Let ths be ˆ β p and the Least Square estmator at the end of the search be βˆ n = β. In the absence of outlers and systematc departures from the model E ( βˆ p ) = E ( β) =β; that s, both the parameter estmates are unbased estmators of the same quantty. The same property holds for the sequence of estmates β m produced n the forward search. Therefore, n the absence of outlers, t s expected that both parameter estmates and resduals would reman sensbly constant durng the forward search. Remark Now suppose there are k outlers. Startng from a clean subset, the forward procedure wll nclude these towards the end of the search, usually n the last k steps. Untl these outlers are ncluded, the condton of Remark 1 wll hold and that resduals plots and parameter estmates wll reman sensbly constant untl the outlers are ncorporated n the subset used for fttng. Remark 3 If there are ndcatons that the regresson data should be transformed, t s mportant to remember that outlers n the transformed scale may not be outlers n another scale. If the data are analyzed usng the wrong transformaton, the k outlers may enter the search well before the end. The forward algorthm s made up of three steps: the frst concerns the choce of an ntal subset the second refers to the way n whch the forward search s progressed and the thrd relates to the montorng of the statstcs durng the progress of the search. Step 1: Choce of the Intal Subset A formal defnton of the algorthm used to fnd the LMS estmator s now gven. If the model contans p parameters, the forward search algorthm starts wth the selecton of a subset of p unts. Observatons n ths subset are ntended to be outler free. If n s moderate and p << n, the choce of the ntal subset can be performed by exhaustve 10

11 enumeraton of all n dstnct p tuples set. Let e be the least squares resdual for p ( p ), S l ( p) ( p) unt gven observatons n.the ntal subset s taken as the p tuple whch S l satsfes e, where s the k th ordered squared ( ) ( ) = Mn e p ( ) ( p) e med, S l med, Sl [ ] ( p) k, S l resdual among e, =1,,n, and, med s the nteger part of (n + p + 1)/. If n ( p), S l p s too large, 1,000 samples are taken. Step : Addng Observatons durng the Forward Search ( m) Gven a subset of dmenson m p, say S, the forward search moves to dmenson m+1 by selectng the m+1 unts beng chosen by orderng all squared resduals e ( m), = 1,,n. The forward search estmator βˆ FS s defned as a collecton of least squares estmators n each step of forward search; that s, ˆβ ˆ = (ˆ β,..., ) FS p βn. In most moves from m to m+1 just one new unt jons the subset. It may also happen that ( m) two or more unts jon S as one or more leave. However, such an event s qute unusual, only occurrng when the search ncludes one unt that belongs to a cluster of outlers. At the next step the remanng outlers n the cluster seem less outlyng and so several may be ncluded at once. Of course, several other unts then have to leave the subset. The search avods, n the frst steps, the ncluson of outlers and provdes a natural orderng of the data accordng to the specfed null model. In ths approach a hghly robust method and at the same tme least squares estmators are used. The zero breakdown pont of least square estmators, n the context of the forward search, does not turn out to be dsadvantageous. The ntroducton of typcal (nfluental) observatons s sgnaled by sharp changes n the curves that montor parameter estmates, t tests, or any other statstc at every step. In ths context the robustness of the method does not derve from the choce of a partcular estmator wth a hgh breakdown pont, but from the progressve ncluson of the unts nto a subset that are outler free. As a bonus of suggestve procedure, the observatons can be naturally ordered accordng to the specfed null model and t s possble to know how many of the m are compatble wth a partcular specfcaton. Furthermore the suggested approach enables to analyze the nfluental effect of the atypcal unts (outlers) on the results of the statstcal analyses. Remark 4 The method s not senstve to the method used to select an ntal subset, provded unmasked outlers are not ncluded at the start. For example the least medan of squares crteron for regresson can be replaced by that of Least Trmmed Squares (LTS). Ths crteron provdes estmators wth better propertes than LMS estmators, found by mnmzng the sum of the smallest h squared resduals S S 11

12 S h h ( b ) e[]( b), for some h wth [( n + p + )/ ] h < n = = 1 1. What s mportant n ths procedure s that the ntal subset s ether free of outlers or contans unmasked outlers whch are mmedately removed by the forward procedure. The search s often able to recover from a start that s not very robust. An example, for regresson, s gven by Atknson and Mulra (1993) and for spatal data by Cerol and Ran (1999). Step 3: Montorng the Search Step of the forward search s repeated untl all unts are ncluded n the subset. If just one ( m) observaton enters S at each move, the algorthm provdes an orderng of the data accordng to the specfed null model, wth observatons furthest from t jonng the subset at the last stages of the procedure. Example.1: A data set used by Wesberg (1985) s consdered here to ntroduce the deas of regresson analyss. There are 17 observatons on the bolng pont of water n 0 F at dfferent pressures, obtaned from measurements at a varety of elevatons n the Alps. The purpose of the orgnal experment was to allow predcton of pressure from bolng pont, whch s easly measured, and so to provde an estmate of alttude. The hgher the alttude, the lower the pressure and the consequent bolng pont. Wesberg (1985) gves values of both pressure and 100 log (pressure) as possble response. The varables are: x : bolng pont, o F and y : 100 log(pressure). Table.1: Data on ar pressure n the Alps and the bolng pont of water Observaton Number Bolng Pont 100Log Pressure The data are plotted n Fgure.4. A quck glance at the plot shows there s a strong lnear relatonshp between log(pressure) and bolng pont. A slghtly longer glance reveals that one of the ponts les slghtly off the lne. Lnear regresson of y on x yelds a t value for the regresson of 54.45, clear evdence of the sgnfcance of the relatonshp. 1

13 Two plots of the least square resduals e are often used to check ftted models. Fgure.5 shows a plot of resduals aganst ftted values y. Ths clearly shows one outler, observaton 1. The normal plot of the studentzed resduals s an almost straght lne from whch the large resdual for observaton 1 s clearly dstanced. It s clear that observaton 1 s an outler. Fgure.4 Fgure.5 Now t s shown that how forward search can reveal ths pont as an outler. It s started wth a least squares ft to two observatons, robustly chosen. From ths ft resduals for all 17 observatons are calculated and next ft to the three observatons wth smallest resduals. In general we ft to a subset of sze m, order the resduals and take as the next subset the m+1 case wth smallest resduals. Ths gves a forward search through the data, order by closeness to the model. It s expected that the last observatons to enter the search wll be those whch are furthest from the model and so may cause changes once 13

14 they are ncluded n the subset used for fttng. In the search through the data, the outlyng observaton 1 was the last to enter the search. Fgure.6 Fgure.7 For each value of m from to 17 quanttes such as the resduals and the parameter estmates are calculated and see how they change. Fgure.6 s a plot of the values of the parameter estmates durng the forward search. The values are extremely stable, reelectng the closeness of all observatons to the straght lne. The ntroducton of observaton 1 at the end of the search causes vrtually no change n the poston of the lne. However, Fgure.7 shows that ntroducton of observaton 1 causes a huge ncrease n s, the resduals mean square estmate of the error varance ( σ ). The nformaton from these plots about observaton 1 confrms and quantfes that from the scatter plot of Fgure.4, observaton 1 s an outler, but the observaton s n the centre of the data, so that ts excluson or ncluson has a small effect on the estmated parameters. The plots also show that all other observatons agree wth the overall model. Through out the search, all cases have small resduals, apart from case 1 whch s outlyng from all ftted subsets. Even when t s ncluded n the last step of the search, ts resdual only decreases slghtly. Remark 5 The estmate of σ does not reman constant durng the forward search as observatons are sequentally selected that have small resduals. Thus, even n the absence of outlers, the 14

15 S S resduals mean square estmate s ( m) < s ( n) = s for m<n. The smooth ncrease of s ( m) wth m for the transformed data s typcal of what s expected when the data agree wth the model and are correctly ordered by the forward search. Example.: Table. gves 60 observatons on a response y wth the value of three explanatory varables [Atknson and Ran (000)]. Table. obs X1 X X3 Y S 15

16 The plot of resduals aganst ftted values, Fgure.8 shows no obvous pattern. The largest resdual s that of case 43. There s therefore no clear ndcaton that the data are not homogeneous and well behaved. Evdence of the structure of the data s clearly shown n the Fg.9, the scaled squared resduals from the forward search. Ths fascnatng plot reveals the presence of 6 masked outlers. The left hand end of the plot gves the resduals from the LMS estmates found by samplng 1000 subsets of sze p=4. From the most extreme resdual downwards, the cases gvng rse to the outlers are 9, 30, 31, 38, 47, and 1. When all the data are ftted the largest resduals belong to, n order cases 43, 51,, 47, 31, 9, 38, 9, 7 and 48. Fgure.8 16

17 Fgure.9 The assessment of the mportance of these outlers can be made by the behavor of the parameter estmates and of the related t statstcs. Apart from the βˆ 1 all reman postve wth t values around 10 or greater durng the course of the forward search. We therefore concentrate on the behavor of t1, the t statstcs for β 1. The values for the last 0 steps of the forward search are plotted n the Fg.10. The general downward trend s typcal of plots of t statstcs from the forward search. It s caused by the ncreasng value of s, Fgure.11 as observatons wth larger resduals are entered durng the search. An mportant feature n the nterpretaton of Fgure.10 s the two upward jumps n the value of the statstc. The frst results form the ncluson of observaton 43 when m=54, gvng a t value of.5, evdence sgnfcant at the 3% level, of a postve value of β 1. Therefore the outler enter the subset, wth the observaton 43 leavng when m=58, as two outlers enter. When m=59 the value of the statstc has decreased to -1.93, close to evdence for a negatve value of the parameter. Rentroducton of observaton 43 n the last step of the search results n a value of -1.6, ndcatng that β 1 may well be zero. It s therefore mportant that the outlers be dentfed. Fgure.10 17

18 Fgure.11 Ths example shows very clearly the exstence of masked outlers, whch would not be detected by the LS. However the forward plot of resduals n Fgure.9 clearly ndcates a structure that s hdden n the conventonal plot of resduals. 3. Conclusons The least squares method s very much senstve to the extreme observatons. Only one observaton needs to be moved towards nfnty to cause an arbtrarly large change n the estmateβ ˆ. The prncple of LS tres to mnmze the resdual sum of squares. If any outler s present t wll try to mnmze the resdual of ths partcular pont, at the expense of the other observatons. In that sometmes a good observaton may be detected as outler. It has nothng to do when maskng s present. In contrast to LS, robust regresson procedure provdes parameter estmates that are nsenstve to presence of outlers. Least Medan of Square (LMS) s one of the popular robust regresson methods. It has 50 % break down pont. It not only estmates the parameters but also detects outlers. But t does not consder all the observatons for parameter estmaton. Therefore some good observatons may not be used for parameters estmaton. Forward search reveals that outlers whch were ntally masked and not revealed by LS. It starts wth p observatons by LMS and then subset s ncreased one by one. Here the observatons are arranged accordng to ther resduals. Therefore f any outler s present n the data, t wll enter n the forward search at the end of the search. The search that we use avods, n the frst step, the ncluson of the outlers and provdes a natural orderng of the data. In ths method we use a hghly robust method and the same tme least square estmators. The zero break down pont of the least square estmators, n the context of forward search, does not turn out to be dsadvantageous. References Atknson, A. and Ran, M. (000). Robust Dagnostc Regresson Analyss. 1 st edton, Sprnger, New York. 18

19 Atknson, A. C. (1994). Fast very robust methods for the detecton of outlers. J. Amer. Statst. Assoc., 89, Atknson, A.C. and Mulra, H. M. (1993). The stalactte plot for the detecton of multvarate outlers. Statstcs and Computng, 3, Beaton, A. E. and Tukey, J. W. (1974). The fttng of power seres, meanng polynomals, llustrated on band spectroscopc data. Technometrcs, 16, Beckman, R. J. and Cook, R. D. (1983). Outler.s. Technometrcs, 5, Cerol, A. and Ran, M. (1999). The orderng of spatal data and the detecton of multple outlers. Journal of computatonal and graphcal statstcs, 8, Cook, R. D. (1977). Detecton of nfluental observatons n lnear regresson. Technometrcs, 19, Had, A. S. (199). Identfyng multple outlers n multvarate data. J. of the Royal Statstcal Socety, Seres B, 54, Had, A. S. and Smonoff, J. S. (1993). Procedures for dentfcaton of multple outlers n lnear models. J. Amer. Statst. Assoc., 88, Had, A. S. and Smonoff, J. S. (1994). Improvng the estmaton and outler dentfcaton propertes of the least medal of squares and mnmum volume ellpsod estmators. Parsankhyan Sammkkha, 1, Hockng, R. R. and Pendleton, O. J. (1983). The regresson dlemma. Communcatons n Statstcs-Theory and Methods, 1, Rousseeuw, P. J. (1984). Least medan of squares regresson. J. Amer. Statst. Assoc. 79, Rousseeuw, P. J. and Van Zomeren, B.C. (1990). Unmaskng multvarate outlers and leverage ponts. J. Amer. Statst. Assoc., 75, Wesberg, S. (1985). Appled Lnear Regresson. nd edton, Wley, New York. Woodruff, D. and Rocke, D. M. (1994). Computable robust estmaton of multvarate locaton and shape n hgh dmenson usng compound estmators. J. Amer. Statst. Assoc., 89,

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers

Psychology 282 Lecture #24 Outlne Regresson Dagnostcs: Outlers In an earler lecture we studed the statstcal assumptons underlyng the regresson model, ncludng the followng ponts: Formal statement of assumptons.