SOME METHODS OF DETECTION OF OUTLIERS IN LINEAR REGRESSION MODEL

Size: px
Start display at page:

Download "SOME METHODS OF DETECTION OF OUTLIERS IN LINEAR REGRESSION MODEL"

Transcription

1 SOME METHODS OF DETECTION OF OUTLIERS IN LINEAR REGRESSION MODEL RANJIT KUMAR PAUL M. Sc. (Agrcultural Statstcs), Roll No IASRI, Lbrary Avenue, New Delh Charperson: Dr. L. M. Bhar Abstract: An outler s an observaton that devates markedly from the majorty of the data. To know whch observaton has greater nfluence on parameter estmate, detecton of outler s very mportant. There are several methods for detecton of outlers avalable n the lterature. A good number of test-statstcs for detectng outlers have been developed. In contrast to detecton, outlers are also tackled through robust regresson technques lke, M-estmator, Least Medan of Square (LMS). Robust regresson provdes parameter estmates that are nsenstve to the presence of outlers and also helps to detect outlyng observatons. Recently, Forward Search (FS) method has been developed, n whch a small number of observatons robustly chosen are used to ft a model through Least Square (LS) method. Then more number of observatons are ncluded n the subsequent steps. Ths forward search procedure provdes a wealth of nformaton not only for outler detecton but, much more mportantly, on the effect of each observaton on aspects of nferences about the model. It also reveals the maskng problem, f present, very ncely n the data. Key words: Outler, Leverage Pont, Least Square (LS), Least Medan of Square (LMS), Robust Regresson, Forward Search (FS), Maskng. 1. Introducton No observaton can be guaranteed to be a totally dependable manfestaton of the phenomena under study. The probable relablty of an observaton s reflected by ts relatonshp to other observatons that were obtaned under smlar condtons. Observatons that n the opnon of the nvestgator stand apart from the bulk of the data have been called outlers, extreme observatons dscordant observatons, rouge values, contamnants, surprsng values, mavercks or drty data. An outler s one that appears to devate markedly from the other members of the sample n whch t occurs. An outler s a data pont that s located far from the rest of the data. Gven a mean and standard devaton, a statstcal dstrbuton expects data ponts to fall wthn a specfc range. Those that do not are called outlers and should be nvestgated. The sources of nfluental subsets are dverse. Frst, there s the nevtable occurrence of mproperly recorded data, ether at ther sources or n ther transcrpton to computer readable form. Second, observatonal errors are often nherent n the data. Although procedures more approprate for estmaton than ordnary least squares exst for ths stuaton, the dagnostcs may reveal the unsuspected exstence and severty of observatonal errors. Thrd outlyng data ponts may be legtmately occurrng extreme observatons. Such data often contan valuable nformaton that mproves estmaton effcency by ts presence. Even n ths benefcal stuaton, however t s constructve to solate extreme ponts and to determne the extent to whch the parameter estmates depend on these desrable data. Fourth, snce the data could have been generated by model(s) other than that specfed, dagnostcs may reveal patterns suggestve of these alternatves.

2 The fact that a small subset of the data can have a dsproportonate nfluence on the estmated parameters or predctons s of concern to users of regresson analyss. It s qute possble that the model-estmates are based prmarly on ths data subset rather than on the majorty of the data. When a regresson model s ftted by least squares, the estmated parameters of the ftted model depend on a few statstcs aggregated over all the data. If some of the observatons are dfferent n some way from the bulk of the data, the overall concluson drawn from ths data set may be wrong. There are a seres of powerful general methods for detectng and nvestgatng observatons that dffer from the bulk of the data. These may be ndvdual observatons that do not belong to the general model, that s, outlers. Or there may be a subset of data that s systematcally dfferent from the majorty.. Detecton of Outlers There are two types of outlers dependng on the varable n whch t occurs. Outlers n the response varable represent model falure. Outlers wth respect to the predctors are called leverage ponts; they can affect the regresson model. Ther response varables need not be outlers. However, they may almost unquely determne regresson coeffcents. They may also cause the standard errors of regresson coeffcents to be much smaller than they would be f these observatons were excluded. Leverage ponts do not necessarly correspond to outlers. An observaton wth suffcently hgh leverage mght exert enough nfluence to drag the regresson equaton close to ts response and mask the fact that t mght otherwse be an outler. The ordnary or smple resduals (observed - predcted values) are the most commonly used measures for detectng outlers. The ordnary resduals sum to zero but do not have the same standard devaton. Many other measures mprove on or complement smple resduals. Standardzed Resduals are the resduals dvded by the estmates of ther standard errors. They have mean 0 and standard devaton 1. There are two common ways to calculate the standardzed resdual for the th observaton. The use of resdual mean square error from the model ftted to the full dataset (nternally studentzed resduals) and the use of resdual mean square error from the model ftted to all of the data except the th observaton (externally studentzed resduals). The externally studentzed resduals follow a t dstrbuton wth n-p- df, where n s total number of observatons and p s the number of parameters. Outler dagnostcs are the statstcs that focus attenton on observatons havng a large nfluence on the Least Squares (LS) estmator. Several dagnostc measures have been desgned to detect ndvdual cases or groups of cases that may dffer from the bulk of the data. The feld of dagnostcs conssts of a combnaton of numercal and graphcal tools. Some commonly used statstcs n detecton of outlers are descrbed now..1 Row Deleton Methods There are many methods for detecton of outlers avalable n the lterature. Some statstcs that are obtaned through row deleton method are consdered here. It s examned n turn how the deleton of each row affects the estmated coeffcents, the predcted values (ftted values), the resduals, and the estmated covarance structure of the coeffcents.

3 Consder the followng lnear regresson model Y=Xβ +ε where Y s a n 1 vector of observatons, X s a n p matrx of explanatory varables and β s p 1 vector of parameters. ε s a n 1 vector of errors such that E ε = and ( ε ε E ) = σ I. () 0 () DFBETA Snce the estmated coeffcents are often of prmary nterest n regresson models, we look frst at the change n the estmate of regresson coeffcents that would occur f the th row were deleted. Denotng the coeffcents estmated wth the th row deleted by β (), ths change s computed by the formula, DFBETA = β ˆ β ˆ 1 ( X X) x e () = 1 h where x s the th row of the X matrx, s the th resdual and s the th e h dagonal 1 element of the matrx X X X. The cuts off value for DFBETA t s / n ( ) X () Cook s Dstance Cook (1977) proposed a statstc for detecton of outler as follows: ( ˆ ˆ D = β() β) X X( βˆ ˆ () β) / ps where s s the estmate of σ. Large values of D ndcate observatons that are nfluental on jont nferences about all the lnear parameters n the model. A suggestve alternatve form of D s D = ( Ŷ() Ŷ) ( Ŷ() Ŷ) / ps, where the Ŷ() = Xβˆ (). An nterpretaton s that D measures the sum of squared changes n the predctons when observaton s not used n estmatng β. D approxmately follows F (p, n-p) dstrbuton. The cut off value of Cook-Statstc s 4/n. () DFFIT It s the dfference between the predcted responses from the model constructed usng complete data and the predcted responses from the model constructed by settng the th observaton asde. It s smlar to Cook's dstance. Unlke Cook's dstance, t does not look at all of the predcted values wth the th observaton set asde. It looks only at the predcted values for the th observaton. DFFIT s computed as follows: DFFIT = Y Y () = ( ˆ ˆ h e p x β β() ) =. The cut off value of DFFIT s. 1 h n (v) Covarance Matrx Another major aspect of regresson s the covarance matrx of the estmated coeffcents. We agan consder the dagnostc technque of row deleton, ths tme n a comparson of the covarance matrx usng entre data, σ ( X X) 1, wth the covarance matrx that results σ X X. Of the varous alternatve means for when th row has been deleted, [ ( ) ( )] 1 3

4 comparng two such postve defnte symmetrc matrces, the rato of ther determnants det [ () () σ X X ] 1 /det [ σ ( X X) 1 ] s one of the smplest and s qute appealng. Snce these two matrces dffer only by the ncluson of the th row n the sum of squares and cross products, values of ths rato near unty can be taken to ndcate that the two covarance matrx s nsenstve to the deleton of row. Of course the estmator s of σ also changes wth the deleton of the th observaton. We can brng the y data nto consderaton by comparng the two matrces s ( X X) 1 and s 1 ( ) [ X () X() ] n the determnantal rato 1 det{ s () [ X ( ) X( ) ] } COVRATIO = 1 det s X X p [ ( ) ] 1 () det[ X ( ) X( ) ] 1 det[ X X] s = p s 1 or, COVRATIO =, p n p 1 e + ( 1 h ) n p n p e where e = s the studentzed resdual. s() 1 h For COVRARIO, cut of value s 1 ± 3 p. n. Hat Matrx Besdes the above statstcs, the Hat matrx s sometmes used to detect the nfluental observatons. The h are the dagonal elements of the least squares projecton matrx, also called hat matrx, 1 H = X ( X X) X Ths determnes the ftted or predcted values, snce Y = Xβˆ = HY The nfluence of the response value,, on the ft s most drectly reflected n ts mpact Y on the correspondng ftted value, Y, and ths nformaton s to be contaned n h.where there are two or fewer explanatory varables, scatter plots wll quckly reveal any x outlers, and t s not hard to verfy that they have relatvely large h values. Here the cutoff value s p/n, where p s the rank of the X matrx. The th observaton s called a leverage pont when exceeds p/n. h.3 Outler Detecton Based on Robust Regresson Many dagnostcs are based on the resduals resultng from LS. However, ths startng pont may lead to useless results because of the followng reason. By defnton, LS tres to avod the large resduals. Consequently, one outlyng case may cause a poor ft for the majorty of the data because the LS estmator tres to accommodate ths case at the expense 4

5 of the remanng observatons. Therefore an outler may have a small resdual, especally when t s a leverage pont. As a consequence, dagnostcs based on LS resduals often fal to reveal such ponts. A least squares model can be dstorted by a sngle observaton. The ftted lne or surface mght be tpped so that t no longer passes through the bulk of the data. In order to reduce the effect of a very large error t wll ntroduce many small or moderate errors. For example, f a large error s reduced from 00 to 50, ts square s reduced from 40,000 to,500. Increasng an error from 5 to 15 ncreases ts square from 5 to 5. Thus, a least squares ft mght ntroduce many small errors n order to reduce a large one. Fgure.1 In Fgure.1 the lne A denotes the regresson lne and passes through the bulk of the data. But n presence of the outler the lne s dragged out to the outler because the prncple of the least square method says that the resduals sum of squares should be mnmum. The lne B ndcates the regresson lne n presence of the outler. Agan n some cases where the LS estmate s used to detect outler may dagnose a clean pont as an outler and an outler as a clean pont. Fgure. 5

6 In Fgure., the pont 1 s actually an outler and n presence of ths pont the regresson lne s dragged out to that pont resultng the pont as an outler, though t s a clean pont, because now the dstance of ths pont from the lne s longest. Moreover, all those statstcs descrbed above nvolved LS resduals. Therefore, f these statstcs are appled n detecton of outlers, t may wrongly detect some clean observatons as outlers. So t s suggested to use robust regresson n place of LS. Robust regresson not only provdes parameter estmaton that s nsenstve to the presence of outlers, but also helps n detectng outlers. Robust regresson s a term used to descrbe model fttng procedures that are nsenstve to the effects of outler observatons. Many robust regresson methods have been developed, out of whch M estmator s most popular. M-estmator In general, a class of robust estmators that mnmze a functon f of the resduals s defned as, n Mnmze f β = 1 where x denotes the th row of X. ( e ) = Mnmze f ( y β) β n = 1 x (.1) Generally the followng equaton s solved: n e n y = xβ Mnmze f Mnmze f β = 1 s β = 1 s (.) where s medan e medan( e )/ = (.3) s s an approxmately unbased estmator of σ f n s large and the error dstrbuton s normal. An estmator of ths type s called an M-estmator, where M stands for maxmum lkelhood. That s, the functon f s related to the lkelhood functon for an approprate choce of the error dstrbuton. For example, f the method of least squares s used (mplyng the error dstrbuton s normal), then f(z) = (½)z. To mnmze Eq.., equate the frst partal dervatves of f wth respect to β j (j = 1,,, p) to zero, yeldng a necessary condton for a mnmum. Ths gves the system of p equatons n y x x j ψ β = 0, (.4) = 1 s (j = 1,, p) where s s the robust estmate of scale, ψ = f and x s the th observaton on the j th j regressor and x 0 =1. In general the ψ functon s non lnear and the Eq..4 must be 6

7 solved by teratve methods. Iteratve reweghted least squares s most wdely used. Ths approach s usually attrbuted to Beaton and Tukey (1974). To use teratvely reweghted least squares, suppose that an ntal estmate and that s s an estmate of scale. Then wrte p equatons n (.4), n y x x ψ β j = = 1 s n x j{ ψ[ ( y x β) / s] /( y x β) / s}( y x β) = 0 = 1 s as where ( y x β) 0 x j w 0 = w 0 ψ = ( y x β0 ) ( y ˆ x β0 )/ ˆ / s s = 1 f y =x ˆ β0 In matrx notaton Eq..6 becomes βˆ 0 s avalable (.5) (.6) f y x βˆ (.7) 0 X W 0 Xβ = X W 0 y (.8) where W0 s an n n dagonal matrx of weghts wth dagonal elements w10, w0,..., wn0 gven by Eq..7. We recognze Eq..8 as the usual weghted least-squares normal equatons. Consequently the one step estmator s 1 ( X W X) X y β ˆ = 1 0 W0 In the next step we recompute the weghts from Eq..7 but usng βˆ 1 nstead of ˆβ 0. (.9) Least Medan of Squares (LMS) Estmator Least Medan of Squares (LMS) regresson, developed by Rousseeuw (1984) mnmzes the medan squared resduals. Snce t focuses on the medan resdual, up to half of the observatons can dsagree wthout maskng a model that fts the rest of the data. For the lnear regresson model E(Y) = Xβ, wth X of rank p, let b be any estmate of β. Wth n observatons, the resduals from ths estmate are e ( b ) x = y b,(=1,,,n). The LMS estmate β Thus β p p s the value of b mnmzng the medan of the square resduals mnmzes the scale estmate ( b) =e[ med]( b) [ k]( b) ( b) e. σ, (.10) where e s the k th ordered squared resdual. In order to allow for estmaton of the parameters of the lnear model the medan s taken as med = nteger part of [(n+p+1)/], (.11) 7

8 The parameter estmate satsfyng (.10) has asymptotcally, a break down pont of 50%. Thus, for large n, almost half the data can be outlers, or come from some other model and LMS wll stll provde an unbased estmate of the regresson parameters. Ths s the maxmum break down that can be tolerated. For a hgher proporton of outlers there s no longer a model that fts the majorty of the data. The very robust behavor of the LMS estmate s n contrast to that of the least squares estmate S ( β) = ( y Xβ) ( y Xβ), whch can be wrtten as S n ( b) e ( b) =. = 1 β mnmzng Only one observaton needs to be moved towards nfnty to cause an arbtrarly large change n the estmate β : the breakdown pont of β s zero. The defnton of βp n (.10) gves no ndcaton of how to fnd such a parameter estmate. Snce the surface to be mnmzed has many local mnma, approxmate methods are used. Roussaeeuw (1984) fnds an approxmaton to βp elemental sets that s subsets of p observatons, taken at random. by searchng only over Fttng an LMS regresson model possesses some dffcultes. The frst s computatonal. Unlke least squares regresson, there s no formula that can be used to calculate the coeffcents for an LMS regresson. Rousseeuw (1984) has proposed an algorthm to obtan LMS estmator. Accordng to ths algorthm a random sample of sze p, s drawn. A regresson surface s ftted to each set of observatons and the medan squared resdual s calculated. The model that has the smallest medan squared resdual s used. Once a robust ft of the model s obtaned, resduals from ths model are used for detectng outlyng observatons. The followng Fgure.3 descrbes that the LMS ft s not affected by the outlyng observaton. Fgure.3 8

9 One of the drawbacks of the LMS estmate s that t does not consder all the data ponts for estmaton of parameters. Also t does not reveal the maskng effect of outlers f any. So the forward search method of detecton of outlers has been used..4 Forward Search If the values of the parameters of the model were known, there would be no dffculty n detectng the outlers, whch would have large resduals. The dffculty arses because the outlers are ncluded n the data used to estmate the parameters, whch can then be badly based. Most methods for outler detecton therefore seek to dvde the data nto two parts, a larger clean part and the outlers. For detectng multple outlers, some apply sngle row deleton method repeatedly. But ths method fals when there s a problem of maskng. Multple row deleton technque has also been suggested. But the dffculty here s the exploson of the number combnaton to be consdered. To overcome such problems, forward search has been evolved by Atknson (1994). The basc dea of ths method s to order the observatons by ther closeness to the ftted model. It starts wth a ft to very few observatons and then successvely ft to larger subsets. The startng pont s found by fttng to a large number of small subsets, usng methods from robust statstcs to determne whch subset fts best. Then all the observatons are ordered by closeness to ths ftted model; for regresson models the resduals determne closeness. For multvarate models, the subset sze s ncreased by one and the model reftted to the observatons wth the smallest resduals for the ncreased subset sze. Usually one observaton enters, but some tmes two or more enter the subset as one or more leave. The process contnues wth ncreasng subset szes untl; fnally, all the data are ftted. As a result of ths forward search we have an orderng of the observatons by closeness to the assumed model. If the model and the data agree, the robust and least squares ft wll be smlar, as wll be the parameter estmates and resduals from the two fts. But often the estmates and the resduals of the ftted model change apprecably durng the forward search. The changes n these quanttes and n varous statstcs are montored, such as score tests for transformaton, as the process moves forward through the data, addng one observaton at a tme. Ths forward procedure provdes a wealth of nformaton not only for outler detecton but, much more mportantly, on the effect of each observaton on aspects of nference about the model. In the forward search, such larger sub samples of outler free observatons are found by startng from small subsets and ncrementng them wth observatons that have small resduals, and so are unlkely to be outlers. The method was ntroduced by Had (199) for the detecton of outlers from a ft usng approxmately half the observatons. Dfferent versons of method are descrbed by Had and Smonoff (1993), Had and Smonoff (1994) and by Atknson (1994). Suppose at some stage n the forward search the set of m observatons used n fttng s S (m). Fttng to ths subset s by least squares (for regresson models) yeldng the parameter estmates β. From these parameter estmates a set of n resduals are m e n 9

10 calculated and also estmate σ. Suppose the subset S (m) s clear of outlers. There wll then be n-m observatons not used n fttng that may contan outlers. The nterest s n the evoluton, as m goes from p to n, of quanttes such as resduals, and test statstcs, together wth Cook s dstance and other dagnostc quanttes. The sequence of the parameter estmates β m and related t statstcs are also montored. The changes that occur, whch can always be assocated wth the ntroducton of a partcular group of observatons are montored. In practce almost always one observaton, nto the subset m used for fttng. Interpretaton of these changes s complemented by examnaton of changes n the forward plot of resduals. Remark 1 The search starts wth the approxmate LMS estmator found by samplng subsets of sze p. Let ths be ˆ β p and the Least Square estmator at the end of the search be βˆ n = β. In the absence of outlers and systematc departures from the model E ( βˆ p ) = E ( β) =β; that s, both the parameter estmates are unbased estmators of the same quantty. The same property holds for the sequence of estmates β m produced n the forward search. Therefore, n the absence of outlers, t s expected that both parameter estmates and resduals would reman sensbly constant durng the forward search. Remark Now suppose there are k outlers. Startng from a clean subset, the forward procedure wll nclude these towards the end of the search, usually n the last k steps. Untl these outlers are ncluded, the condton of Remark 1 wll hold and that resduals plots and parameter estmates wll reman sensbly constant untl the outlers are ncorporated n the subset used for fttng. Remark 3 If there are ndcatons that the regresson data should be transformed, t s mportant to remember that outlers n the transformed scale may not be outlers n another scale. If the data are analyzed usng the wrong transformaton, the k outlers may enter the search well before the end. The forward algorthm s made up of three steps: the frst concerns the choce of an ntal subset the second refers to the way n whch the forward search s progressed and the thrd relates to the montorng of the statstcs durng the progress of the search. Step 1: Choce of the Intal Subset A formal defnton of the algorthm used to fnd the LMS estmator s now gven. If the model contans p parameters, the forward search algorthm starts wth the selecton of a subset of p unts. Observatons n ths subset are ntended to be outler free. If n s moderate and p << n, the choce of the ntal subset can be performed by exhaustve 10

11 enumeraton of all n dstnct p tuples set. Let e be the least squares resdual for p ( p ), S l ( p) ( p) unt gven observatons n.the ntal subset s taken as the p tuple whch S l satsfes e, where s the k th ordered squared ( ) ( ) = Mn e p ( ) ( p) e med, S l med, Sl [ ] ( p) k, S l resdual among e, =1,,n, and, med s the nteger part of (n + p + 1)/. If n ( p), S l p s too large, 1,000 samples are taken. Step : Addng Observatons durng the Forward Search ( m) Gven a subset of dmenson m p, say S, the forward search moves to dmenson m+1 by selectng the m+1 unts beng chosen by orderng all squared resduals e ( m), = 1,,n. The forward search estmator βˆ FS s defned as a collecton of least squares estmators n each step of forward search; that s, ˆβ ˆ = (ˆ β,..., ) FS p βn. In most moves from m to m+1 just one new unt jons the subset. It may also happen that ( m) two or more unts jon S as one or more leave. However, such an event s qute unusual, only occurrng when the search ncludes one unt that belongs to a cluster of outlers. At the next step the remanng outlers n the cluster seem less outlyng and so several may be ncluded at once. Of course, several other unts then have to leave the subset. The search avods, n the frst steps, the ncluson of outlers and provdes a natural orderng of the data accordng to the specfed null model. In ths approach a hghly robust method and at the same tme least squares estmators are used. The zero breakdown pont of least square estmators, n the context of the forward search, does not turn out to be dsadvantageous. The ntroducton of typcal (nfluental) observatons s sgnaled by sharp changes n the curves that montor parameter estmates, t tests, or any other statstc at every step. In ths context the robustness of the method does not derve from the choce of a partcular estmator wth a hgh breakdown pont, but from the progressve ncluson of the unts nto a subset that are outler free. As a bonus of suggestve procedure, the observatons can be naturally ordered accordng to the specfed null model and t s possble to know how many of the m are compatble wth a partcular specfcaton. Furthermore the suggested approach enables to analyze the nfluental effect of the atypcal unts (outlers) on the results of the statstcal analyses. Remark 4 The method s not senstve to the method used to select an ntal subset, provded unmasked outlers are not ncluded at the start. For example the least medan of squares crteron for regresson can be replaced by that of Least Trmmed Squares (LTS). Ths crteron provdes estmators wth better propertes than LMS estmators, found by mnmzng the sum of the smallest h squared resduals S S 11

12 S h h ( b ) e[]( b), for some h wth [( n + p + )/ ] h < n = = 1 1. What s mportant n ths procedure s that the ntal subset s ether free of outlers or contans unmasked outlers whch are mmedately removed by the forward procedure. The search s often able to recover from a start that s not very robust. An example, for regresson, s gven by Atknson and Mulra (1993) and for spatal data by Cerol and Ran (1999). Step 3: Montorng the Search Step of the forward search s repeated untl all unts are ncluded n the subset. If just one ( m) observaton enters S at each move, the algorthm provdes an orderng of the data accordng to the specfed null model, wth observatons furthest from t jonng the subset at the last stages of the procedure. Example.1: A data set used by Wesberg (1985) s consdered here to ntroduce the deas of regresson analyss. There are 17 observatons on the bolng pont of water n 0 F at dfferent pressures, obtaned from measurements at a varety of elevatons n the Alps. The purpose of the orgnal experment was to allow predcton of pressure from bolng pont, whch s easly measured, and so to provde an estmate of alttude. The hgher the alttude, the lower the pressure and the consequent bolng pont. Wesberg (1985) gves values of both pressure and 100 log (pressure) as possble response. The varables are: x : bolng pont, o F and y : 100 log(pressure). Table.1: Data on ar pressure n the Alps and the bolng pont of water Observaton Number Bolng Pont 100Log Pressure The data are plotted n Fgure.4. A quck glance at the plot shows there s a strong lnear relatonshp between log(pressure) and bolng pont. A slghtly longer glance reveals that one of the ponts les slghtly off the lne. Lnear regresson of y on x yelds a t value for the regresson of 54.45, clear evdence of the sgnfcance of the relatonshp. 1

13 Two plots of the least square resduals e are often used to check ftted models. Fgure.5 shows a plot of resduals aganst ftted values y. Ths clearly shows one outler, observaton 1. The normal plot of the studentzed resduals s an almost straght lne from whch the large resdual for observaton 1 s clearly dstanced. It s clear that observaton 1 s an outler. Fgure.4 Fgure.5 Now t s shown that how forward search can reveal ths pont as an outler. It s started wth a least squares ft to two observatons, robustly chosen. From ths ft resduals for all 17 observatons are calculated and next ft to the three observatons wth smallest resduals. In general we ft to a subset of sze m, order the resduals and take as the next subset the m+1 case wth smallest resduals. Ths gves a forward search through the data, order by closeness to the model. It s expected that the last observatons to enter the search wll be those whch are furthest from the model and so may cause changes once 13

14 they are ncluded n the subset used for fttng. In the search through the data, the outlyng observaton 1 was the last to enter the search. Fgure.6 Fgure.7 For each value of m from to 17 quanttes such as the resduals and the parameter estmates are calculated and see how they change. Fgure.6 s a plot of the values of the parameter estmates durng the forward search. The values are extremely stable, reelectng the closeness of all observatons to the straght lne. The ntroducton of observaton 1 at the end of the search causes vrtually no change n the poston of the lne. However, Fgure.7 shows that ntroducton of observaton 1 causes a huge ncrease n s, the resduals mean square estmate of the error varance ( σ ). The nformaton from these plots about observaton 1 confrms and quantfes that from the scatter plot of Fgure.4, observaton 1 s an outler, but the observaton s n the centre of the data, so that ts excluson or ncluson has a small effect on the estmated parameters. The plots also show that all other observatons agree wth the overall model. Through out the search, all cases have small resduals, apart from case 1 whch s outlyng from all ftted subsets. Even when t s ncluded n the last step of the search, ts resdual only decreases slghtly. Remark 5 The estmate of σ does not reman constant durng the forward search as observatons are sequentally selected that have small resduals. Thus, even n the absence of outlers, the 14

15 S S resduals mean square estmate s ( m) < s ( n) = s for m<n. The smooth ncrease of s ( m) wth m for the transformed data s typcal of what s expected when the data agree wth the model and are correctly ordered by the forward search. Example.: Table. gves 60 observatons on a response y wth the value of three explanatory varables [Atknson and Ran (000)]. Table. obs X1 X X3 Y S 15

16 The plot of resduals aganst ftted values, Fgure.8 shows no obvous pattern. The largest resdual s that of case 43. There s therefore no clear ndcaton that the data are not homogeneous and well behaved. Evdence of the structure of the data s clearly shown n the Fg.9, the scaled squared resduals from the forward search. Ths fascnatng plot reveals the presence of 6 masked outlers. The left hand end of the plot gves the resduals from the LMS estmates found by samplng 1000 subsets of sze p=4. From the most extreme resdual downwards, the cases gvng rse to the outlers are 9, 30, 31, 38, 47, and 1. When all the data are ftted the largest resduals belong to, n order cases 43, 51,, 47, 31, 9, 38, 9, 7 and 48. Fgure.8 16

17 Fgure.9 The assessment of the mportance of these outlers can be made by the behavor of the parameter estmates and of the related t statstcs. Apart from the βˆ 1 all reman postve wth t values around 10 or greater durng the course of the forward search. We therefore concentrate on the behavor of t1, the t statstcs for β 1. The values for the last 0 steps of the forward search are plotted n the Fg.10. The general downward trend s typcal of plots of t statstcs from the forward search. It s caused by the ncreasng value of s, Fgure.11 as observatons wth larger resduals are entered durng the search. An mportant feature n the nterpretaton of Fgure.10 s the two upward jumps n the value of the statstc. The frst results form the ncluson of observaton 43 when m=54, gvng a t value of.5, evdence sgnfcant at the 3% level, of a postve value of β 1. Therefore the outler enter the subset, wth the observaton 43 leavng when m=58, as two outlers enter. When m=59 the value of the statstc has decreased to -1.93, close to evdence for a negatve value of the parameter. Rentroducton of observaton 43 n the last step of the search results n a value of -1.6, ndcatng that β 1 may well be zero. It s therefore mportant that the outlers be dentfed. Fgure.10 17

18 Fgure.11 Ths example shows very clearly the exstence of masked outlers, whch would not be detected by the LS. However the forward plot of resduals n Fgure.9 clearly ndcates a structure that s hdden n the conventonal plot of resduals. 3. Conclusons The least squares method s very much senstve to the extreme observatons. Only one observaton needs to be moved towards nfnty to cause an arbtrarly large change n the estmateβ ˆ. The prncple of LS tres to mnmze the resdual sum of squares. If any outler s present t wll try to mnmze the resdual of ths partcular pont, at the expense of the other observatons. In that sometmes a good observaton may be detected as outler. It has nothng to do when maskng s present. In contrast to LS, robust regresson procedure provdes parameter estmates that are nsenstve to presence of outlers. Least Medan of Square (LMS) s one of the popular robust regresson methods. It has 50 % break down pont. It not only estmates the parameters but also detects outlers. But t does not consder all the observatons for parameter estmaton. Therefore some good observatons may not be used for parameters estmaton. Forward search reveals that outlers whch were ntally masked and not revealed by LS. It starts wth p observatons by LMS and then subset s ncreased one by one. Here the observatons are arranged accordng to ther resduals. Therefore f any outler s present n the data, t wll enter n the forward search at the end of the search. The search that we use avods, n the frst step, the ncluson of the outlers and provdes a natural orderng of the data. In ths method we use a hghly robust method and the same tme least square estmators. The zero break down pont of the least square estmators, n the context of forward search, does not turn out to be dsadvantageous. References Atknson, A. and Ran, M. (000). Robust Dagnostc Regresson Analyss. 1 st edton, Sprnger, New York. 18

19 Atknson, A. C. (1994). Fast very robust methods for the detecton of outlers. J. Amer. Statst. Assoc., 89, Atknson, A.C. and Mulra, H. M. (1993). The stalactte plot for the detecton of multvarate outlers. Statstcs and Computng, 3, Beaton, A. E. and Tukey, J. W. (1974). The fttng of power seres, meanng polynomals, llustrated on band spectroscopc data. Technometrcs, 16, Beckman, R. J. and Cook, R. D. (1983). Outler.s. Technometrcs, 5, Cerol, A. and Ran, M. (1999). The orderng of spatal data and the detecton of multple outlers. Journal of computatonal and graphcal statstcs, 8, Cook, R. D. (1977). Detecton of nfluental observatons n lnear regresson. Technometrcs, 19, Had, A. S. (199). Identfyng multple outlers n multvarate data. J. of the Royal Statstcal Socety, Seres B, 54, Had, A. S. and Smonoff, J. S. (1993). Procedures for dentfcaton of multple outlers n lnear models. J. Amer. Statst. Assoc., 88, Had, A. S. and Smonoff, J. S. (1994). Improvng the estmaton and outler dentfcaton propertes of the least medal of squares and mnmum volume ellpsod estmators. Parsankhyan Sammkkha, 1, Hockng, R. R. and Pendleton, O. J. (1983). The regresson dlemma. Communcatons n Statstcs-Theory and Methods, 1, Rousseeuw, P. J. (1984). Least medan of squares regresson. J. Amer. Statst. Assoc. 79, Rousseeuw, P. J. and Van Zomeren, B.C. (1990). Unmaskng multvarate outlers and leverage ponts. J. Amer. Statst. Assoc., 75, Wesberg, S. (1985). Appled Lnear Regresson. nd edton, Wley, New York. Woodruff, D. and Rocke, D. M. (1994). Computable robust estmaton of multvarate locaton and shape n hgh dmenson usng compound estmators. J. Amer. Statst. Assoc., 89,

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers Psychology 282 Lecture #24 Outlne Regresson Dagnostcs: Outlers In an earler lecture we studed the statstcal assumptons underlyng the regresson model, ncludng the followng ponts: Formal statement of assumptons.

More information

Comparison of Regression Lines

Comparison of Regression Lines STATGRAPHICS Rev. 9/13/2013 Comparson of Regresson Lnes Summary... 1 Data Input... 3 Analyss Summary... 4 Plot of Ftted Model... 6 Condtonal Sums of Squares... 6 Analyss Optons... 7 Forecasts... 8 Confdence

More information

Negative Binomial Regression

Negative Binomial Regression STATGRAPHICS Rev. 9/16/2013 Negatve Bnomal Regresson Summary... 1 Data Input... 3 Statstcal Model... 3 Analyss Summary... 4 Analyss Optons... 7 Plot of Ftted Model... 8 Observed Versus Predcted... 10 Predctons...

More information

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity LINEAR REGRESSION ANALYSIS MODULE IX Lecture - 30 Multcollnearty Dr. Shalabh Department of Mathematcs and Statstcs Indan Insttute of Technology Kanpur 2 Remedes for multcollnearty Varous technques have

More information

Linear Regression Analysis: Terminology and Notation

Linear Regression Analysis: Terminology and Notation ECON 35* -- Secton : Basc Concepts of Regresson Analyss (Page ) Lnear Regresson Analyss: Termnology and Notaton Consder the generc verson of the smple (two-varable) lnear regresson model. It s represented

More information

Polynomial Regression Models

Polynomial Regression Models LINEAR REGRESSION ANALYSIS MODULE XII Lecture - 6 Polynomal Regresson Models Dr. Shalabh Department of Mathematcs and Statstcs Indan Insttute of Technology Kanpur Test of sgnfcance To test the sgnfcance

More information

Department of Quantitative Methods & Information Systems. Time Series and Their Components QMIS 320. Chapter 6

Department of Quantitative Methods & Information Systems. Time Series and Their Components QMIS 320. Chapter 6 Department of Quanttatve Methods & Informaton Systems Tme Seres and Ther Components QMIS 30 Chapter 6 Fall 00 Dr. Mohammad Zanal These sldes were modfed from ther orgnal source for educatonal purpose only.

More information

Chapter 11: Simple Linear Regression and Correlation

Chapter 11: Simple Linear Regression and Correlation Chapter 11: Smple Lnear Regresson and Correlaton 11-1 Emprcal Models 11-2 Smple Lnear Regresson 11-3 Propertes of the Least Squares Estmators 11-4 Hypothess Test n Smple Lnear Regresson 11-4.1 Use of t-tests

More information

Chapter 13: Multiple Regression

Chapter 13: Multiple Regression Chapter 13: Multple Regresson 13.1 Developng the multple-regresson Model The general model can be descrbed as: It smplfes for two ndependent varables: The sample ft parameter b 0, b 1, and b are used to

More information

28. SIMPLE LINEAR REGRESSION III

28. SIMPLE LINEAR REGRESSION III 8. SIMPLE LINEAR REGRESSION III Ftted Values and Resduals US Domestc Beers: Calores vs. % Alcohol To each observed x, there corresponds a y-value on the ftted lne, y ˆ = βˆ + βˆ x. The are called ftted

More information

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems Numercal Analyss by Dr. Anta Pal Assstant Professor Department of Mathematcs Natonal Insttute of Technology Durgapur Durgapur-713209 emal: anta.bue@gmal.com 1 . Chapter 5 Soluton of System of Lnear Equatons

More information

A Robust Method for Calculating the Correlation Coefficient

A Robust Method for Calculating the Correlation Coefficient A Robust Method for Calculatng the Correlaton Coeffcent E.B. Nven and C. V. Deutsch Relatonshps between prmary and secondary data are frequently quantfed usng the correlaton coeffcent; however, the tradtonal

More information

18. SIMPLE LINEAR REGRESSION III

18. SIMPLE LINEAR REGRESSION III 8. SIMPLE LINEAR REGRESSION III US Domestc Beers: Calores vs. % Alcohol Ftted Values and Resduals To each observed x, there corresponds a y-value on the ftted lne, y ˆ ˆ = α + x. The are called ftted values.

More information

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction ECONOMICS 5* -- NOTE (Summary) ECON 5* -- NOTE The Multple Classcal Lnear Regresson Model (CLRM): Specfcaton and Assumptons. Introducton CLRM stands for the Classcal Lnear Regresson Model. The CLRM s also

More information

See Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition)

See Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition) Count Data Models See Book Chapter 11 2 nd Edton (Chapter 10 1 st Edton) Count data consst of non-negatve nteger values Examples: number of drver route changes per week, the number of trp departure changes

More information

Statistics for Economics & Business

Statistics for Economics & Business Statstcs for Economcs & Busness Smple Lnear Regresson Learnng Objectves In ths chapter, you learn: How to use regresson analyss to predct the value of a dependent varable based on an ndependent varable

More information

Lecture 6: Introduction to Linear Regression

Lecture 6: Introduction to Linear Regression Lecture 6: Introducton to Lnear Regresson An Manchakul amancha@jhsph.edu 24 Aprl 27 Lnear regresson: man dea Lnear regresson can be used to study an outcome as a lnear functon of a predctor Example: 6

More information

Diagnostics in Poisson Regression. Models - Residual Analysis

Diagnostics in Poisson Regression. Models - Residual Analysis Dagnostcs n Posson Regresson Models - Resdual Analyss 1 Outlne Dagnostcs n Posson Regresson Models - Resdual Analyss Example 3: Recall of Stressful Events contnued 2 Resdual Analyss Resduals represent

More information

/ n ) are compared. The logic is: if the two

/ n ) are compared. The logic is: if the two STAT C141, Sprng 2005 Lecture 13 Two sample tests One sample tests: examples of goodness of ft tests, where we are testng whether our data supports predctons. Two sample tests: called as tests of ndependence

More information

Chapter 6. Supplemental Text Material

Chapter 6. Supplemental Text Material Chapter 6. Supplemental Text Materal S6-. actor Effect Estmates are Least Squares Estmates We have gven heurstc or ntutve explanatons of how the estmates of the factor effects are obtaned n the textboo.

More information

x i1 =1 for all i (the constant ).

x i1 =1 for all i (the constant ). Chapter 5 The Multple Regresson Model Consder an economc model where the dependent varable s a functon of K explanatory varables. The economc model has the form: y = f ( x,x,..., ) xk Approxmate ths by

More information

Difference Equations

Difference Equations Dfference Equatons c Jan Vrbk 1 Bascs Suppose a sequence of numbers, say a 0,a 1,a,a 3,... s defned by a certan general relatonshp between, say, three consecutve values of the sequence, e.g. a + +3a +1

More information

2016 Wiley. Study Session 2: Ethical and Professional Standards Application

2016 Wiley. Study Session 2: Ethical and Professional Standards Application 6 Wley Study Sesson : Ethcal and Professonal Standards Applcaton LESSON : CORRECTION ANALYSIS Readng 9: Correlaton and Regresson LOS 9a: Calculate and nterpret a sample covarance and a sample correlaton

More information

NUMERICAL DIFFERENTIATION

NUMERICAL DIFFERENTIATION NUMERICAL DIFFERENTIATION 1 Introducton Dfferentaton s a method to compute the rate at whch a dependent output y changes wth respect to the change n the ndependent nput x. Ths rate of change s called the

More information

Linear Approximation with Regularization and Moving Least Squares

Linear Approximation with Regularization and Moving Least Squares Lnear Approxmaton wth Regularzaton and Movng Least Squares Igor Grešovn May 007 Revson 4.6 (Revson : March 004). 5 4 3 0.5 3 3.5 4 Contents: Lnear Fttng...4. Weghted Least Squares n Functon Approxmaton...

More information

Chapter 5 Multilevel Models

Chapter 5 Multilevel Models Chapter 5 Multlevel Models 5.1 Cross-sectonal multlevel models 5.1.1 Two-level models 5.1.2 Multple level models 5.1.3 Multple level modelng n other felds 5.2 Longtudnal multlevel models 5.2.1 Two-level

More information

Chapter 3 Describing Data Using Numerical Measures

Chapter 3 Describing Data Using Numerical Measures Chapter 3 Student Lecture Notes 3-1 Chapter 3 Descrbng Data Usng Numercal Measures Fall 2006 Fundamentals of Busness Statstcs 1 Chapter Goals To establsh the usefulness of summary measures of data. The

More information

SIMPLE LINEAR REGRESSION

SIMPLE LINEAR REGRESSION Smple Lnear Regresson and Correlaton Introducton Prevousl, our attenton has been focused on one varable whch we desgnated b x. Frequentl, t s desrable to learn somethng about the relatonshp between two

More information

Kernel Methods and SVMs Extension

Kernel Methods and SVMs Extension Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general

More information

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur Analyss of Varance and Desgn of Experment-I MODULE VII LECTURE - 3 ANALYSIS OF COVARIANCE Dr Shalabh Department of Mathematcs and Statstcs Indan Insttute of Technology Kanpur Any scentfc experment s performed

More information

Chapter 8 Indicator Variables

Chapter 8 Indicator Variables Chapter 8 Indcator Varables In general, e explanatory varables n any regresson analyss are assumed to be quanttatve n nature. For example, e varables lke temperature, dstance, age etc. are quanttatve n

More information

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Structure and Drive Paul A. Jensen Copyright July 20, 2003 Structure and Drve Paul A. Jensen Copyrght July 20, 2003 A system s made up of several operatons wth flow passng between them. The structure of the system descrbes the flow paths from nputs to outputs.

More information

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X Statstcs 1: Probablty Theory II 37 3 EPECTATION OF SEVERAL RANDOM VARIABLES As n Probablty Theory I, the nterest n most stuatons les not on the actual dstrbuton of a random vector, but rather on a number

More information

β0 + β1xi. You are interested in estimating the unknown parameters β

β0 + β1xi. You are interested in estimating the unknown parameters β Ordnary Least Squares (OLS): Smple Lnear Regresson (SLR) Analytcs The SLR Setup Sample Statstcs Ordnary Least Squares (OLS): FOCs and SOCs Back to OLS and Sample Statstcs Predctons (and Resduals) wth OLS

More information

STAT 3008 Applied Regression Analysis

STAT 3008 Applied Regression Analysis STAT 3008 Appled Regresson Analyss Tutoral : Smple Lnear Regresson LAI Chun He Department of Statstcs, The Chnese Unversty of Hong Kong 1 Model Assumpton To quantfy the relatonshp between two factors,

More information

Statistics for Managers Using Microsoft Excel/SPSS Chapter 13 The Simple Linear Regression Model and Correlation

Statistics for Managers Using Microsoft Excel/SPSS Chapter 13 The Simple Linear Regression Model and Correlation Statstcs for Managers Usng Mcrosoft Excel/SPSS Chapter 13 The Smple Lnear Regresson Model and Correlaton 1999 Prentce-Hall, Inc. Chap. 13-1 Chapter Topcs Types of Regresson Models Determnng the Smple Lnear

More information

ISSN: ISO 9001:2008 Certified International Journal of Engineering and Innovative Technology (IJEIT) Volume 3, Issue 1, July 2013

ISSN: ISO 9001:2008 Certified International Journal of Engineering and Innovative Technology (IJEIT) Volume 3, Issue 1, July 2013 ISSN: 2277-375 Constructon of Trend Free Run Orders for Orthogonal rrays Usng Codes bstract: Sometmes when the expermental runs are carred out n a tme order sequence, the response can depend on the run

More information

Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding Recall: man dea of lnear regresson Lecture 9: Lnear regresson: centerng, hypothess testng, multple covarates, and confoundng Sandy Eckel seckel@jhsph.edu 6 May 8 Lnear regresson can be used to study an

More information

Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding Lecture 9: Lnear regresson: centerng, hypothess testng, multple covarates, and confoundng Sandy Eckel seckel@jhsph.edu 6 May 008 Recall: man dea of lnear regresson Lnear regresson can be used to study

More information

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography CSc 6974 and ECSE 6966 Math. Tech. for Vson, Graphcs and Robotcs Lecture 21, Aprl 17, 2006 Estmatng A Plane Homography Overvew We contnue wth a dscusson of the major ssues, usng estmaton of plane projectve

More information

Chapter 2 - The Simple Linear Regression Model S =0. e i is a random error. S β2 β. This is a minimization problem. Solution is a calculus exercise.

Chapter 2 - The Simple Linear Regression Model S =0. e i is a random error. S β2 β. This is a minimization problem. Solution is a calculus exercise. Chapter - The Smple Lnear Regresson Model The lnear regresson equaton s: where y + = β + β e for =,..., y and are observable varables e s a random error How can an estmaton rule be constructed for the

More information

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4) I. Classcal Assumptons Econ7 Appled Econometrcs Topc 3: Classcal Model (Studenmund, Chapter 4) We have defned OLS and studed some algebrac propertes of OLS. In ths topc we wll study statstcal propertes

More information

Chapter 9: Statistical Inference and the Relationship between Two Variables

Chapter 9: Statistical Inference and the Relationship between Two Variables Chapter 9: Statstcal Inference and the Relatonshp between Two Varables Key Words The Regresson Model The Sample Regresson Equaton The Pearson Correlaton Coeffcent Learnng Outcomes After studyng ths chapter,

More information

Statistics for Business and Economics

Statistics for Business and Economics Statstcs for Busness and Economcs Chapter 11 Smple Regresson Copyrght 010 Pearson Educaton, Inc. Publshng as Prentce Hall Ch. 11-1 11.1 Overvew of Lnear Models n An equaton can be ft to show the best lnear

More information

Basically, if you have a dummy dependent variable you will be estimating a probability.

Basically, if you have a dummy dependent variable you will be estimating a probability. ECON 497: Lecture Notes 13 Page 1 of 1 Metropoltan State Unversty ECON 497: Research and Forecastng Lecture Notes 13 Dummy Dependent Varable Technques Studenmund Chapter 13 Bascally, f you have a dummy

More information

January Examinations 2015

January Examinations 2015 24/5 Canddates Only January Examnatons 25 DO NOT OPEN THE QUESTION PAPER UNTIL INSTRUCTED TO DO SO BY THE CHIEF INVIGILATOR STUDENT CANDIDATE NO.. Department Module Code Module Ttle Exam Duraton (n words)

More information

Lecture 2: Prelude to the big shrink

Lecture 2: Prelude to the big shrink Lecture 2: Prelude to the bg shrnk Last tme A slght detour wth vsualzaton tools (hey, t was the frst day... why not start out wth somethng pretty to look at?) Then, we consdered a smple 120a-style regresson

More information

Lecture 3 Stat102, Spring 2007

Lecture 3 Stat102, Spring 2007 Lecture 3 Stat0, Sprng 007 Chapter 3. 3.: Introducton to regresson analyss Lnear regresson as a descrptve technque The least-squares equatons Chapter 3.3 Samplng dstrbuton of b 0, b. Contnued n net lecture

More information

ECONOMICS 351*-A Mid-Term Exam -- Fall Term 2000 Page 1 of 13 pages. QUEEN'S UNIVERSITY AT KINGSTON Department of Economics

ECONOMICS 351*-A Mid-Term Exam -- Fall Term 2000 Page 1 of 13 pages. QUEEN'S UNIVERSITY AT KINGSTON Department of Economics ECOOMICS 35*-A Md-Term Exam -- Fall Term 000 Page of 3 pages QUEE'S UIVERSITY AT KIGSTO Department of Economcs ECOOMICS 35* - Secton A Introductory Econometrcs Fall Term 000 MID-TERM EAM ASWERS MG Abbott

More information

Introduction to Generalized Linear Models

Introduction to Generalized Linear Models INTRODUCTION TO STATISTICAL MODELLING TRINITY 00 Introducton to Generalzed Lnear Models I. Motvaton In ths lecture we extend the deas of lnear regresson to the more general dea of a generalzed lnear model

More information

Global Sensitivity. Tuesday 20 th February, 2018

Global Sensitivity. Tuesday 20 th February, 2018 Global Senstvty Tuesday 2 th February, 28 ) Local Senstvty Most senstvty analyses [] are based on local estmates of senstvty, typcally by expandng the response n a Taylor seres about some specfc values

More information

On the detection of influential outliers in linear regression analysis

On the detection of influential outliers in linear regression analysis Amercan Journal of Theoretcal and Appled Statstcs 04; 3(4): 00-06 Publshed onlne July 30, 04 (http://www.scencepublshnggroup.com/j/ajtas) do: 0.648/j.ajtas.040304.4 ISSN: 36-8999 (Prnt); ISSN: 36-9006

More information

Department of Statistics University of Toronto STA305H1S / 1004 HS Design and Analysis of Experiments Term Test - Winter Solution

Department of Statistics University of Toronto STA305H1S / 1004 HS Design and Analysis of Experiments Term Test - Winter Solution Department of Statstcs Unversty of Toronto STA35HS / HS Desgn and Analyss of Experments Term Test - Wnter - Soluton February, Last Name: Frst Name: Student Number: Instructons: Tme: hours. Ads: a non-programmable

More information

Linear Feature Engineering 11

Linear Feature Engineering 11 Lnear Feature Engneerng 11 2 Least-Squares 2.1 Smple least-squares Consder the followng dataset. We have a bunch of nputs x and correspondng outputs y. The partcular values n ths dataset are x y 0.23 0.19

More information

Composite Hypotheses testing

Composite Hypotheses testing Composte ypotheses testng In many hypothess testng problems there are many possble dstrbutons that can occur under each of the hypotheses. The output of the source s a set of parameters (ponts n a parameter

More information

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U) Econ 413 Exam 13 H ANSWERS Settet er nndelt 9 deloppgaver, A,B,C, som alle anbefales å telle lkt for å gøre det ltt lettere å stå. Svar er gtt . Unfortunately, there s a prntng error n the hnt of

More information

Errors for Linear Systems

Errors for Linear Systems Errors for Lnear Systems When we solve a lnear system Ax b we often do not know A and b exactly, but have only approxmatons  and ˆb avalable. Then the best thng we can do s to solve ˆx ˆb exactly whch

More information

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009 College of Computer & Informaton Scence Fall 2009 Northeastern Unversty 20 October 2009 CS7880: Algorthmc Power Tools Scrbe: Jan Wen and Laura Poplawsk Lecture Outlne: Prmal-dual schema Network Desgn:

More information

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore Sesson Outlne Introducton to classfcaton problems and dscrete choce models. Introducton to Logstcs Regresson. Logstc functon and Logt functon. Maxmum Lkelhood Estmator (MLE) for estmaton of LR parameters.

More information

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur Module 3 LOSSY IMAGE COMPRESSION SYSTEMS Verson ECE IIT, Kharagpur Lesson 6 Theory of Quantzaton Verson ECE IIT, Kharagpur Instructonal Objectves At the end of ths lesson, the students should be able to:

More information

10-701/ Machine Learning, Fall 2005 Homework 3

10-701/ Machine Learning, Fall 2005 Homework 3 10-701/15-781 Machne Learnng, Fall 2005 Homework 3 Out: 10/20/05 Due: begnnng of the class 11/01/05 Instructons Contact questons-10701@autonlaborg for queston Problem 1 Regresson and Cross-valdaton [40

More information

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur Analyss of Varance and Desgn of Exerments-I MODULE III LECTURE - 2 EXPERIMENTAL DESIGN MODELS Dr. Shalabh Deartment of Mathematcs and Statstcs Indan Insttute of Technology Kanur 2 We consder the models

More information

Generalized Linear Methods

Generalized Linear Methods Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set

More information

Chapter 15 Student Lecture Notes 15-1

Chapter 15 Student Lecture Notes 15-1 Chapter 15 Student Lecture Notes 15-1 Basc Busness Statstcs (9 th Edton) Chapter 15 Multple Regresson Model Buldng 004 Prentce-Hall, Inc. Chap 15-1 Chapter Topcs The Quadratc Regresson Model Usng Transformatons

More information

Chapter 12 Analysis of Covariance

Chapter 12 Analysis of Covariance Chapter Analyss of Covarance Any scentfc experment s performed to know somethng that s unknown about a group of treatments and to test certan hypothess about the correspondng treatment effect When varablty

More information

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix Lectures - Week 4 Matrx norms, Condtonng, Vector Spaces, Lnear Independence, Spannng sets and Bass, Null space and Range of a Matrx Matrx Norms Now we turn to assocatng a number to each matrx. We could

More information

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity LINEAR REGRESSION ANALYSIS MODULE IX Lecture - 31 Multcollnearty Dr. Shalabh Department of Mathematcs and Statstcs Indan Insttute of Technology Kanpur 6. Rdge regresson The OLSE s the best lnear unbased

More information

Simulated Power of the Discrete Cramér-von Mises Goodness-of-Fit Tests

Simulated Power of the Discrete Cramér-von Mises Goodness-of-Fit Tests Smulated of the Cramér-von Mses Goodness-of-Ft Tests Steele, M., Chaselng, J. and 3 Hurst, C. School of Mathematcal and Physcal Scences, James Cook Unversty, Australan School of Envronmental Studes, Grffth

More information

Chapter 7 Generalized and Weighted Least Squares Estimation. In this method, the deviation between the observed and expected values of

Chapter 7 Generalized and Weighted Least Squares Estimation. In this method, the deviation between the observed and expected values of Chapter 7 Generalzed and Weghted Least Squares Estmaton The usual lnear regresson model assumes that all the random error components are dentcally and ndependently dstrbuted wth constant varance. When

More information

BOOTSTRAP METHOD FOR TESTING OF EQUALITY OF SEVERAL MEANS. M. Krishna Reddy, B. Naveen Kumar and Y. Ramu

BOOTSTRAP METHOD FOR TESTING OF EQUALITY OF SEVERAL MEANS. M. Krishna Reddy, B. Naveen Kumar and Y. Ramu BOOTSTRAP METHOD FOR TESTING OF EQUALITY OF SEVERAL MEANS M. Krshna Reddy, B. Naveen Kumar and Y. Ramu Department of Statstcs, Osmana Unversty, Hyderabad -500 007, Inda. nanbyrozu@gmal.com, ramu0@gmal.com

More information

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands Content. Inference on Regresson Parameters a. Fndng Mean, s.d and covarance amongst estmates.. Confdence Intervals and Workng Hotellng Bands 3. Cochran s Theorem 4. General Lnear Testng 5. Measures of

More information

4.3 Poisson Regression

4.3 Poisson Regression of teratvely reweghted least squares regressons (the IRLS algorthm). We do wthout gvng further detals, but nstead focus on the practcal applcaton. > glm(survval~log(weght)+age, famly="bnomal", data=baby)

More information

Outlier Detection in Logistic Regression: A Quest for Reliable Knowledge from Predictive Modeling and Classification

Outlier Detection in Logistic Regression: A Quest for Reliable Knowledge from Predictive Modeling and Classification Outler Detecton n Logstc egresson: A Quest for elable Knowledge from Predctve Modelng and Classfcaton Abdul Nurunnab, Geoff West Department of Spatal Scences, Curtn Unversty, Perth, Australa CC for Spatal

More information

Economics 130. Lecture 4 Simple Linear Regression Continued

Economics 130. Lecture 4 Simple Linear Regression Continued Economcs 130 Lecture 4 Contnued Readngs for Week 4 Text, Chapter and 3. We contnue wth addressng our second ssue + add n how we evaluate these relatonshps: Where do we get data to do ths analyss? How do

More information

Copyright 2017 by Taylor Enterprises, Inc., All Rights Reserved. Adjusted Control Limits for P Charts. Dr. Wayne A. Taylor

Copyright 2017 by Taylor Enterprises, Inc., All Rights Reserved. Adjusted Control Limits for P Charts. Dr. Wayne A. Taylor Taylor Enterprses, Inc. Control Lmts for P Charts Copyrght 2017 by Taylor Enterprses, Inc., All Rghts Reserved. Control Lmts for P Charts Dr. Wayne A. Taylor Abstract: P charts are used for count data

More information

Primer on High-Order Moment Estimators

Primer on High-Order Moment Estimators Prmer on Hgh-Order Moment Estmators Ton M. Whted July 2007 The Errors-n-Varables Model We wll start wth the classcal EIV for one msmeasured regressor. The general case s n Erckson and Whted Econometrc

More information

e i is a random error

e i is a random error Chapter - The Smple Lnear Regresson Model The lnear regresson equaton s: where + β + β e for,..., and are observable varables e s a random error How can an estmaton rule be constructed for the unknown

More information

Week3, Chapter 4. Position and Displacement. Motion in Two Dimensions. Instantaneous Velocity. Average Velocity

Week3, Chapter 4. Position and Displacement. Motion in Two Dimensions. Instantaneous Velocity. Average Velocity Week3, Chapter 4 Moton n Two Dmensons Lecture Quz A partcle confned to moton along the x axs moves wth constant acceleraton from x =.0 m to x = 8.0 m durng a 1-s tme nterval. The velocty of the partcle

More information

Statistical Inference. 2.3 Summary Statistics Measures of Center and Spread. parameters ( population characteristics )

Statistical Inference. 2.3 Summary Statistics Measures of Center and Spread. parameters ( population characteristics ) Ismor Fscher, 8//008 Stat 54 / -8.3 Summary Statstcs Measures of Center and Spread Dstrbuton of dscrete contnuous POPULATION Random Varable, numercal True center =??? True spread =???? parameters ( populaton

More information

Foundations of Arithmetic

Foundations of Arithmetic Foundatons of Arthmetc Notaton We shall denote the sum and product of numbers n the usual notaton as a 2 + a 2 + a 3 + + a = a, a 1 a 2 a 3 a = a The notaton a b means a dvdes b,.e. ac = b where c s an

More information

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume

More information

Explaining the Stein Paradox

Explaining the Stein Paradox Explanng the Sten Paradox Kwong Hu Yung 1999/06/10 Abstract Ths report offers several ratonale for the Sten paradox. Sectons 1 and defnes the multvarate normal mean estmaton problem and ntroduces Sten

More information

Statistics II Final Exam 26/6/18

Statistics II Final Exam 26/6/18 Statstcs II Fnal Exam 26/6/18 Academc Year 2017/18 Solutons Exam duraton: 2 h 30 mn 1. (3 ponts) A town hall s conductng a study to determne the amount of leftover food produced by the restaurants n the

More information

Lecture 3: Probability Distributions

Lecture 3: Probability Distributions Lecture 3: Probablty Dstrbutons Random Varables Let us begn by defnng a sample space as a set of outcomes from an experment. We denote ths by S. A random varable s a functon whch maps outcomes nto the

More information

III. Econometric Methodology Regression Analysis

III. Econometric Methodology Regression Analysis Page Econ07 Appled Econometrcs Topc : An Overvew of Regresson Analyss (Studenmund, Chapter ) I. The Nature and Scope of Econometrcs. Lot s of defntons of econometrcs. Nobel Prze Commttee Paul Samuelson,

More information

Introduction to Regression

Introduction to Regression Introducton to Regresson Dr Tom Ilvento Department of Food and Resource Economcs Overvew The last part of the course wll focus on Regresson Analyss Ths s one of the more powerful statstcal technques Provdes

More information

Correlation and Regression. Correlation 9.1. Correlation. Chapter 9

Correlation and Regression. Correlation 9.1. Correlation. Chapter 9 Chapter 9 Correlaton and Regresson 9. Correlaton Correlaton A correlaton s a relatonshp between two varables. The data can be represented b the ordered pars (, ) where s the ndependent (or eplanator) varable,

More information

ANOVA. The Observations y ij

ANOVA. The Observations y ij ANOVA Stands for ANalyss Of VArance But t s a test of dfferences n means The dea: The Observatons y j Treatment group = 1 = 2 = k y 11 y 21 y k,1 y 12 y 22 y k,2 y 1, n1 y 2, n2 y k, nk means: m 1 m 2

More information

β0 + β1xi. You are interested in estimating the unknown parameters β

β0 + β1xi. You are interested in estimating the unknown parameters β Revsed: v3 Ordnar Least Squares (OLS): Smple Lnear Regresson (SLR) Analtcs The SLR Setup Sample Statstcs Ordnar Least Squares (OLS): FOCs and SOCs Back to OLS and Sample Statstcs Predctons (and Resduals)

More information

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA 4 Analyss of Varance (ANOVA) 5 ANOVA 51 Introducton ANOVA ANOVA s a way to estmate and test the means of multple populatons We wll start wth one-way ANOVA If the populatons ncluded n the study are selected

More information

Temperature. Chapter Heat Engine

Temperature. Chapter Heat Engine Chapter 3 Temperature In prevous chapters of these notes we ntroduced the Prncple of Maxmum ntropy as a technque for estmatng probablty dstrbutons consstent wth constrants. In Chapter 9 we dscussed the

More information

Influence Diagnostics on Competing Risks Using Cox s Model with Censored Data. Jalan Gombak, 53100, Kuala Lumpur, Malaysia.

Influence Diagnostics on Competing Risks Using Cox s Model with Censored Data. Jalan Gombak, 53100, Kuala Lumpur, Malaysia. Proceedngs of the 8th WSEAS Internatonal Conference on APPLIED MAHEMAICS, enerfe, Span, December 16-18, 5 (pp14-138) Influence Dagnostcs on Competng Rsks Usng Cox s Model wth Censored Data F. A. M. Elfak

More information

Resource Allocation and Decision Analysis (ECON 8010) Spring 2014 Foundations of Regression Analysis

Resource Allocation and Decision Analysis (ECON 8010) Spring 2014 Foundations of Regression Analysis Resource Allocaton and Decson Analss (ECON 800) Sprng 04 Foundatons of Regresson Analss Readng: Regresson Analss (ECON 800 Coursepak, Page 3) Defntons and Concepts: Regresson Analss statstcal technques

More information

Chap 10: Diagnostics, p384

Chap 10: Diagnostics, p384 Chap 10: Dagnostcs, p384 Multcollnearty 10.5 p406 Defnton Multcollnearty exsts when two or more ndependent varables used n regresson are moderately or hghly correlated. - when multcollnearty exsts, regresson

More information

Turbulence classification of load data by the frequency and severity of wind gusts. Oscar Moñux, DEWI GmbH Kevin Bleibler, DEWI GmbH

Turbulence classification of load data by the frequency and severity of wind gusts. Oscar Moñux, DEWI GmbH Kevin Bleibler, DEWI GmbH Turbulence classfcaton of load data by the frequency and severty of wnd gusts Introducton Oscar Moñux, DEWI GmbH Kevn Blebler, DEWI GmbH Durng the wnd turbne developng process, one of the most mportant

More information

Statistics MINITAB - Lab 2

Statistics MINITAB - Lab 2 Statstcs 20080 MINITAB - Lab 2 1. Smple Lnear Regresson In smple lnear regresson we attempt to model a lnear relatonshp between two varables wth a straght lne and make statstcal nferences concernng that

More information

Basic Business Statistics, 10/e

Basic Business Statistics, 10/e Chapter 13 13-1 Basc Busness Statstcs 11 th Edton Chapter 13 Smple Lnear Regresson Basc Busness Statstcs, 11e 009 Prentce-Hall, Inc. Chap 13-1 Learnng Objectves In ths chapter, you learn: How to use regresson

More information

Topic- 11 The Analysis of Variance

Topic- 11 The Analysis of Variance Topc- 11 The Analyss of Varance Expermental Desgn The samplng plan or expermental desgn determnes the way that a sample s selected. In an observatonal study, the expermenter observes data that already

More information

Chapter 15 - Multiple Regression

Chapter 15 - Multiple Regression Chapter - Multple Regresson Chapter - Multple Regresson Multple Regresson Model The equaton that descrbes how the dependent varable y s related to the ndependent varables x, x,... x p and an error term

More information

ANOMALIES OF THE MAGNITUDE OF THE BIAS OF THE MAXIMUM LIKELIHOOD ESTIMATOR OF THE REGRESSION SLOPE

ANOMALIES OF THE MAGNITUDE OF THE BIAS OF THE MAXIMUM LIKELIHOOD ESTIMATOR OF THE REGRESSION SLOPE P a g e ANOMALIES OF THE MAGNITUDE OF THE BIAS OF THE MAXIMUM LIKELIHOOD ESTIMATOR OF THE REGRESSION SLOPE Darmud O Drscoll ¹, Donald E. Ramrez ² ¹ Head of Department of Mathematcs and Computer Studes

More information