CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS

CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS 1

Influential bservatins are bservatins whse presence in the data can have a distrting effect n the parameter estimates and pssibly the entire analysis, e.g. identifying the wrng mdel. Distinctin frm utliers, thugh it is pssible fr ne bservatin t be bth influential and an utlier. Outliers: 1. data pints that cntain unusual dependent (y) values. 2. Outlying independent (x) values nt indicating lack f fit f mdel, but sme bservatins still influence the fit mre than thers. Detectin: In simple linear regressin, usually easy frm plts f data, but in multiple regressin, mre frmal measures are required. 2

8 A y 6 4 2 0 B C 0 2 4 6 8 x Figure 4.1. Three least squares lines fitted t sample data, where the bservatin at x = 8 is allwed t mve between the three pints A, B and C. The crrespnding least squares fits are the slid, dashed and dtted lines respectively. 3

The hat matrix Recall Ŷ = HY, H = X(X T X) 1 X T, s cvariance matrix f Ŷ is Var{Ŷ } = Hσ 2 Variance f ŷ i is h ii σ 2, variance f ith residual e i is (1 h ii )σ 2. Prperties f the {h ii } values include 0 h ii 1 fr all i, (1) i h ii = p. (2) Prperty (1) fllws simply frm the fact that bth h ii σ 2 and (1 h ii )σ 2 are the variances f randm quantities, and therefre are nnnegative. Fr prperty (2), nte that tr(h)=p. 4

Leverage A data pint with large h ii is called a pint f high leverage. Hw high is high? by (2), the average value f h ii is p n. A standard criterin is t call any data pint fr which h ii > 2p n a pint f high leverage. Nte that since h ii is a functin f X, it has n distributin, thus n frmal test. 5

Example: Cnsider the artificial data f Fig. 4.1. The twelve x values here are 0, 0.2, 0.4,..., 1.8, 2, 8. The crrespnding h ii values are.1342,.1221,...,.0869,.9182. The last bservatin, crrespnding t x = 8, is clearly highly influential. Intuitively, this is because if this pint is mved up r dwn, the least squares straight line will tend t fllw it the verall least squares fit n the ther 11 bservatins is nt much affected by mdest changes in the slpe f the fitted straight line, but the fit at x = 8 has a big influence. Nte that this has nthing t d with y 12 pssibly being an utlier, since fr any i, the actual value f y i des nt even enter int the calculatin f h ii. 6

Real data examples frm Chapter 3 Tree data: Highest h ii value is h 20 = 0.2428 (diameter=13.8, height=64), nt extreme fr either independent variable but des crrespnd t a fairly large diameter cmbined with the secnd smallest height. Next three largest values f h ii are h 3 = 0.1975, h 31 = 0.1803 and h 2 = 0.1672. In this case p = 3, n = 31 s accrding t the 2p n = 0.1935 criterin, bservatins 3 and 20 are influential. Draws attentin t tw bservatins which wuld nt necessarily be identified as influential frm initial inspectin f the data. 7

diam 8 10 12 14 16 18 20 65 70 75 80 85 height Figure Plt f tree diameter against height. 8

Real data examples frm Chapter 3 Nuclear pwer data: 2p n = 14 32 = 0.4375. Largest values f h ii are in rws 26 (0.4126), 19 (0.3614) and 22 (0.3526). Nt especially large, but it turns ut later these bservatins are influential. > nuk.inf<lm.influence(nuk.lm) > print(nuk.inf$hat, digit=2) 1 2 3 4 5 6 7 8 9 0.221 0.263 0.263 0.242 0.242 0.282 0.143 0.291 0.114 10 11 12 13 14 15 16 17 18 0.262 0.155 0.165 0.189 0.130 0.197 0.098 0.179 0.189 19 20 21 22 23 24 25 26 27 0.361 0.137 0.198 0.353 0.189 0.220 0.349 0.413 0.182 28 29 30 31 32 0.264 0.176 0.176 0.182 0.176 9

Deletin diagnstics Recall Var(e i ) = (1 h ii )σ 2. This suggests that after estimating σ 2 by the mean squared errr s 2, we may then define e i = e i s 1 h ii (3) as a standardized frm f residual. We call (3) the internally standardized residual. (Als knwn as studentized.) 10

This des nt take int accunt influence f utlier n parameter estimates. Culd d that by deletin residual d i = y i ŷ i(i) (4) The subscript (i) means that the mdel is refitted withut the i th bservatin. Thus ŷ i(i) means the predicted value f y i based n the mdel fit in which bservatin i is mitted. Frmula: d i = e i 1 h ii. (5) Since Var(e i ) = σ 2 (1 h ii ), it fllws that Var(d i ) = σ 2 /(1 h ii ) and we can estimate this by s 2 (d i ) = s2 (i) 1 h ii (6) in which s 2 dentes the estimated mean squared (i) errr with the i th bservatin deleted. 11

Bth y i and ŷ i(i) are statistically independent f s 2 (i), s d i and s 2 are independent, and (i) This suggests that we define d i s(d i ) t n p 1. (7) d i = d i s(d i ) as an externally studentized residual. (8) Calculate s 2 (i) frm (n p)s 2 = (n p 1)s 2 (i) + e2 i 1 h ii. (9) Cmbining these frmulae leads t d i = e i [ n p 1 (1 h ii )(n p)s 2 e 2 i ]1 2. (10) 12

Examples Tree data: recall apparent utliers in bservatins 15, 16, 18. Internally standardized residuals are 2.109, 1.834, 2.162. Externally studentized residuals are 2.258, 1.919 and 2.326. Since t 27,.975 = 2.052, frmal test f fit (at the 5% level f significance) wuld reject bservatins 15 and 18 as utliers. Largest psitive d i is 1.703, definitely nt significant. Culd g n t delete all three discussed in text. Nuclear pwer data: largest internally standardized residuals in magnitude are +2.275 in rw 19, 2.220 in rw 7, 2.052 in rw 26 and 1.815 in rw 12. When externally studentized these becme 2.503, 2.427, 2.205 and 1.908. t 25,.975 = 2.060, t 25,.995 = 2.787. 13

1. DFFITS Measures f influence Detect influence n fitted values. (DF F IT S) i = ŷi ŷ i(i) = d hii i. (11) s (i) hii 1 h ii Ratinale: standardized frmula fr examining the difference between ŷ i and ŷ i(i). Hwever, the secnd equality in (11) shws that it is equivalent t a rescaled frm f the externally studentized residual, where the rescaling is dependent n the leverage f the i th data pint. Thus DF F IT S may be thught f as a cmbined measure f influence that takes int accunt the leverage f the data pint as well as the size f the residual. Rule f thumb: bservatin is influential if DF F IT S is greater than 1 in the case f small data sets, r 2 p/n fr large data sets. 14

Examples Tree data: 2 p/n = 0.6222, nly inflential value is bservatin 18, DF F IT S = 0.8811. Cmbines utlier and mderate leverage (h ii = 0.1255). Nuclear pwer data: 2 p/n =0.9354 is easily exceeded in magnitude by the DF F IT S values fr rws 19 (1.8830) and 26 ( 1.8481), and t a lesser extent in rw 7 ( 0.9908). Since we have already seen that these three rws have the largest residuals in magnitude, and that rws 26 and 19 are the nes with the highest leverage, these results are scarcely surprising. 15

2. DFBETAS Intended t measure the influence f an bservatin n the parameter estimates. If we were t test the hypthesis H 0 : β k = β k0 fr the k th parameter estimate, where β k0 is a given numerical value, then a suitable test statistic wuld be ˆβ k β k0 s c kk where c kk is the k th diagnal entry f (X T X) 1. This statistic has a t n p distributin. Mtivated by this, we define (DF BET AS) k(i) = ˆβ k ˆβ k(i) s (i) ckk. (12) Rule f thumb: DF BET AS > 1 fr a small data set r 2/ n fr a large data set. 16

Examples Tree data: 2/ n = 0.3592, ffending values are 0.7571 (i = 18, k = 3), 0.7450 (i = 18, k = 1), 0.4930 (i = 17, k = 3) and 0.4770 (i = 17, k = 1). Rw 17 causing truble as well as rw 18? Nuclear pwer data: 2/ n = 0.3535 is exceeded by several values fr rws 7, 19 and 26 (largest verall value: 1.4899 fr the LN cefficient in rw 19). There are als three values in the 0.50.7 range fr rw 22. 17

3. Ck s D statistic Overall measure f the influence f the i th bservatin n all the parameter estimates. If we want t test H 0 : β = β 0 fr given vectr β 0, then when H 0 true, Ck defined (ˆβ β 0 ) T X T X(ˆβ β 0 ) p s 2 F p,n p. D i = (ˆβ ˆβ (i) ) T X T X(ˆβ ˆβ (i) ) p s 2 = e2 i ps 2 h ii (1 h ii ) 2. (13) Identify the i th bservatin as influential if D i is greater than the 10% pint f the F distributin, and highly influential if it is greater than the 50% pint. 18

Examples Tree data: largest value f Ck s D is 0.224 in rw 18 fllwed by 0.106 in rw 17. Fr the F 3,28 distributin, the 10% pint is 0.193 and the 50% pint 0.81. Again rw 18 stands ut. Nuclear pwer data: D =0.423 in rw 26, 0.418 in rw 19. The 10% pint f F 7,25 is 0.388 and the 50% pint 0.93. 19

The mdified Ck statistic Frm the frm f (13) in cmparisn with (11) and (12), natural t ask why, in (13), we did nt use s (i) in place f s. In fact Atkinsn (1985) tk this pint f view t define a mdified Ck statistic which turns ut, after scaling by a cnstant, t be equivalent t DF F IT S. Thus if Atkinsn s pint f view is adpted, there seems n need t cnsider Ck s statistic as a separate diagnstic, since all the relevant infrmatin is in DF F IT S. Our examples rather supprt this pint f view, since it appears that Ck s D is dwngrading the evidence f influence in the case f sme bservatins which seemed highly influential when judged by the earlier diagnstics. 20

4. COVRATIO Measures effect f deletin n the variances f the parameter estimates. (COV RAT IO) i = Det[{XT (i) X (i) } 1 s 2 (i) ] Det{(X T X) 1 s 2 } (14) where Det[A] means the determinant f a matrix A. An equivalent frmula is ( ) p n p 1 (COV RAT IO) i = + d 2 i (1 h ii) 1. n p n p (15) The suggested criterin here is COV RAT IO 1 > 3p n. (16) 21

Examples Tree data: (16) gives the critical values f COVRATIO as 0.710 and 1.290. At lwer end f range we have nly 0.6882 (rw 15), i.e. variances are decreased by mitting this bservatin. At the upper end there are several ffenders (rws 1,2,3,20,31, with largest value 1.47 in rw 20) which seems t pint twards bservatins f high leverage as thse whse missin wuld tend t increase the variances. Nuclear pwer data: critical values f COV RAT IO are 0.344 and 1.656. Rw 7 (0.334) is influential at the lwer end, while rws 25 (2.011), 8 (1.788) and 28 (1.716) are the nes with high COV RAT IOs. Rw 25 has fairly high leverage (h ii = 0.3491 which is furth largest in the data set) but rws 8 and 28 (h ii =0.2910 and 0.2637) d nt, s it is hard t see a clearcut explanatin f these. 22

Graphical methds f assessing influence Ideas due t Atkinsn (1985): refine previus rules f thumb using simulatin. N frmal hypthesis test is pssible fr high leverage, but fr the ther measures we have seen, it is pssible t cnstruct a frmal test that the bservatins are nrmal. In the case f single deletin residuals, we have seen the exact sampling distributin (t n p 1 ). Hwever, even this desn t easily extend t the prblem f the largest deletin residual in a sample (multiple testing prblem). Fr DFFITS etc., n exact test seems pssible. As an alternative, use simulatin. 23

Atkinsn s idea: use prbability plts (nrmal r halfnrmal). Halfnrmal plt: plt rdered abslute values f the deletin residuals against the n largest expected rder statistics frm a nrmal sample f size 2n + 1. As an apprximatin t the expected value f the k th largest rder statistic in a standard nrmal sample f size N, Atkinsn used Blm s apprximatin z (k 0.375)/(n+0.25) where z is the inverse f the standard nrmal c.d.f. This is slightly different frm the frmula z (k 0.5)/n which was prpsed in Sectin 2.6.1, thugh it makes very little difference in practice which frm is used. We fllw Atkinsn s usage here. 24

Examples The circles in Figure 4.2(a) shw a halfnrmal plt fr the deletin residuals fr the tree data, and Figure 4.3(a) shws the same thing fr the nuclear pwer data. In each case the plt is reasnably clse t a straight line, and even tends t flatten ff at the right hand end. Even with n ther means f assessment, this wuld suggest that the largest residuals are nt excessive utliers. The same idea can be tried fr the ther influence measures. The circles in Figures 4.2(b) and 4.3(b) shw a halfnrmal plt f the values f DF F IT S fr the tree and nuclear pwer data respectively. In the case f the tree data, the plt again appears clse t a straight line, but with the nuclear pwer data it is bvius that the largest tw values are behaving differently frm the rest. 25

(a) (b) 1.4 3 1.2 Observed Value 2 1 0 Observed Value 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.5 1.0 1.5 2.0 Expected Value 0.0 0.5 1.0 1.5 2.0 Expected Value Figure 4.2 Simulatin envelpe plts fr tree data. (a) Deletin residual, (b) DFFITS. 26

(a) (b) 3 2.0 Observed Value 2 1 0 Expected Value 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 Expected Value 0.0 0.5 1.0 1.5 2.0 Observed Value Figure 4.3 Simulatin envelpe plts fr nuclear pwer data. (a) Deletin residual, (b) DFFITS. 27

Assessing significance (Described fr deletin residuals, but same idea fr ther measures) 1. Generate a simulated sample f n standard nrmal randm variables, and calculate the deletin residuals based n that sample. OK t take β = 0 and σ 2 = 1 fr this simulatin. 2. Order the abslute values f the deletin residuals t btain a simulated sample f rder statistics. 3. Repeat this whle prcedure m times. Fr each i between 1 and n, find 5% largest and smallest values f the m simulatins fr that value f i. Mark these t btain apprximate cnfidence limits fr that value n the plt. 28

Results Fr the tree data (Fig. 4.2), neither f the plts (deletin residuals r DF F IT S) strays utside the envelpe. N serius utliers r influential pints in this data set. Fig. 4.3(a): same message fr deletin residuals with nuclear pwer data. But Fig. 4.3(b) is different. There really is a prblem with the influence f the tw largest data pints. Nte that this prcedure did nt adjust fr multiple cmparisn, which is pssible in principal but mre cmputatinally expensive (e.g., fr fixed envelpe ne can evaluate prbability that largest DF F IT S is utside the envelpe by simulatin). 29

Remedial measures First questin: is it a genuine errr (e.g. wrng data?). Even if nt, cnsider deleting bservatin and refitting mdel, but there is a danger f verding this. 30

Tree data There is a grup f three suspect bservatins cnsider deleting them all at nce. New estimates f β 2 and β 3 becme 1.9521 and 1.2503, standard errrs.0584 and.1654 (cmpare ld estimates 1.9825, 1.1166, SEs.0750,.2044). Parameter estimates are nt significantly different. s is reduced.0814 t.0625. Questin whether it is valid t qute lwer value. F statistic fr the test β 2 = 2, β 3 = 1 is nw 1.14, cmpared with 0.17 last time, but this is still nwhere near significant. These calculatins shw that the three suspect bservatins d nt substantially affect the main questins f interest and there therefre seems n reasn t remve them frm the data. 31

Nuclear pwer data The nuclear pwer data were refitted withut the influential data pints in rws 19 and 26. Use same mdel as befre. The fitted regressin mdel then becmes LC = 13.510 + 0.218D +0.689LS + 0.220NE + 0.197CT 0.044LN 0.232P T + ɛ with s =0.137. Standard errrs are 3.537 fr the intercept, and 0.047, 0.119, 0.065, 0.056, 0.042 and 0.104 fr the six cefficients. Mst f the cefficients are abut the same size, the largest differences relative t their standard errrs being in CT and LN. Indeed, accrding t the present mdel the cefficient f LN is nt significant and culd be drpped frm the mdel. The ther main thing t nte, as with the previus example, is that the residual standard deviatin s is nticably smaller. 32

We can repeat mst f the analyses tried the first time, with similar results. As an example, Figure 4.4(a) shws a plt f residuals against fitted values, and Figure 4.4(b) a nrmal prbability plt f the (internally) standardized residuals. 33

Residual 0.2 0.1 0.0 0.1 0.2 0.3 Observed Value 1 0 1 2 5.4 5.8 6.2 6.6 Fitted Value 2 1 0 1 2 Expected Value Figure 4.4 Residual plts fr nuclear pwer data with rws 19 and 26 deleted. (a) Residuals against fitted values. (b) Nrmal prbability plt fr standardized residuals. 34

(a) (b) 3 2.0 Observed Value 2 1 0 Observed Value 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 Expected Value 0.0 0.5 1.0 1.5 2.0 Expected Value Figure 4.5 Simulatin envelpe plts fr nuclear pwer data with rws 19 and 26 deleted. (a) Deletin residual, (b) DFFITS. 35

Nrmal prbability plt again seems OK Residuals vs. fitted values plt seems mre randm than befre. A plt f residuals against P T (nt shwn) shws even strnger evidence that the variances are different with the tw values f P T. D we need t delete mre bservatins? when diagnstics are recmputed there are still questins abut sme bservatins, but Fig. 4.5 des nt suggest mre deletins are needed. In cnclusin, fr this data set there des indeed seem t be a case that the tw mst influential bservatins are distrting the whle analysis and shuld be mitted, but there d nt seem t be any further instances fr which interventin is needed. 36

Calculatins in R In R, Ck s D statistics available by pltting lm bjects: nuk.lm <lm(lc~d+ls+ne+ct+ln+pt) plt(nuk.lm, which=4) Sme diagnstics available using functin lm.influence : nuk.inf<lm.influence(nuk.lm) Fr example, nuk.inf$cefficients cntains all the regressin cefficients crrespnding t deletin f each bservatin in turn, nuk.inf$sigma gives the s (i) values, and nuk.inf$hat cntains the diagnal entries f the hat matrix. 37

Calculatins in R Further diagnstics can be calculated frm these. dfbetas, dffits, stanres, studres can be cmputed using functins in diagnse.r at curse website. These can als be incrprated int a simulatin t create simulatinbased diagnstics (e.g. prgram dnsim.r n curse web page). surce("diagnse.r") plt(stanres(nuk.lm)) plt(studres(nuk.lm)) plt(dffits(nuk.lm)) dfbetas(nuk.lm) surce("dnsim.r") attach(nukes) lc<lg(c) ls<lg(s) ln<lg(n) par(mfrw=c(1,2)) dnsim(lc, cbind(d,ls,ne,ct,ln,pt), nsim=1000) 38