CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS

Similar documents
A Matrix Representation of Panel Data

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

AP Statistics Notes Unit Two: The Normal Distributions

READING STATECHART DIAGRAMS

Chapter 3: Cluster Analysis

Hypothesis Tests for One Population Mean

We say that y is a linear function of x if. Chapter 13: The Correlation Coefficient and the Regression Line

Resampling Methods. Chapter 5. Chapter 5 1 / 52

Modelling of Clock Behaviour. Don Percival. Applied Physics Laboratory University of Washington Seattle, Washington, USA

, which yields. where z1. and z2

Differentiation Applications 1: Related Rates

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

Lab 1 The Scientific Method

This section is primarily focused on tools to aid us in finding roots/zeros/ -intercepts of polynomials. Essentially, our focus turns to solving.

Comparing Several Means: ANOVA. Group Means and Grand Mean

Distributions, spatial statistics and a Bayesian perspective

THE LIFE OF AN OBJECT IT SYSTEMS

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa

Math Foundations 20 Work Plan

Physics 2010 Motion with Constant Acceleration Experiment 1

CHM112 Lab Graphing with Excel Grading Rubric

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

Experiment #3. Graphing with Excel

MODULE 1. e x + c. [You can t separate a demominator, but you can divide a single denominator into each numerator term] a + b a(a + b)+1 = a + b

Perfrmance f Sensitizing Rules n Shewhart Cntrl Charts with Autcrrelated Data Key Wrds: Autregressive, Mving Average, Runs Tests, Shewhart Cntrl Chart

MATHEMATICS SYLLABUS SECONDARY 5th YEAR

Pattern Recognition 2014 Support Vector Machines

SPH3U1 Lesson 06 Kinematics

making triangle (ie same reference angle) ). This is a standard form that will allow us all to have the X= y=

IN a recent article, Geary [1972] discussed the merit of taking first differences

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

Computational modeling techniques

Inference in the Multiple-Regression

(2) Even if such a value of k was possible, the neutrons multiply

Drought damaged area

On Huntsberger Type Shrinkage Estimator for the Mean of Normal Distribution ABSTRACT INTRODUCTION

AP Statistics Practice Test Unit Three Exploring Relationships Between Variables. Name Period Date

**DO NOT ONLY RELY ON THIS STUDY GUIDE!!!**

NAME: Prof. Ruiz. 1. [5 points] What is the difference between simple random sampling and stratified random sampling?

UNIV1"'RSITY OF NORTH CAROLINA Department of Statistics Chapel Hill, N. C. CUMULATIVE SUM CONTROL CHARTS FOR THE FOLDED NORMAL DISTRIBUTION

CONSTRUCTING STATECHART DIAGRAMS

Lead/Lag Compensator Frequency Domain Properties and Design Methods

Smoothing, penalized least squares and splines

I. Analytical Potential and Field of a Uniform Rod. V E d. The definition of electric potential difference is

ECE 5318/6352 Antenna Engineering. Spring 2006 Dr. Stuart Long. Chapter 6. Part 7 Schelkunoff s Polynomial

arxiv:hep-ph/ v1 2 Jun 1995

Determining the Accuracy of Modal Parameter Estimation Methods

What is Statistical Learning?

BASD HIGH SCHOOL FORMAL LAB REPORT

NUMBERS, MATHEMATICS AND EQUATIONS

Performance Bounds for Detect and Avoid Signal Sensing

CESAR Science Case The differential rotation of the Sun and its Chromosphere. Introduction. Material that is necessary during the laboratory

How do scientists measure trees? What is DBH?

Fall 2013 Physics 172 Recitation 3 Momentum and Springs

Phys. 344 Ch 7 Lecture 8 Fri., April. 10 th,

ENSC Discrete Time Systems. Project Outline. Semester

Chapter Summary. Mathematical Induction Strong Induction Recursive Definitions Structural Induction Recursive Algorithms

Cambridge Assessment International Education Cambridge Ordinary Level. Published

Interference is when two (or more) sets of waves meet and combine to produce a new pattern.

Thermodynamics and Equilibrium

Section 5.8 Notes Page Exponential Growth and Decay Models; Newton s Law

Physics 212. Lecture 12. Today's Concept: Magnetic Force on moving charges. Physics 212 Lecture 12, Slide 1

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression

Kinetic Model Completeness

SIZE BIAS IN LINE TRANSECT SAMPLING: A FIELD TEST. Mark C. Otto Statistics Research Division, Bureau of the Census Washington, D.C , U.S.A.

2004 AP CHEMISTRY FREE-RESPONSE QUESTIONS

Checking the resolved resonance region in EXFOR database

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

1996 Engineering Systems Design and Analysis Conference, Montpellier, France, July 1-4, 1996, Vol. 7, pp

Thermodynamics Partial Outline of Topics

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank

Lesson Plan. Recode: They will do a graphic organizer to sequence the steps of scientific method.

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Turing Machines. Human-aware Robotics. 2017/10/17 & 19 Chapter 3.2 & 3.3 in Sipser Ø Announcement:

A Regression Solution to the Problem of Criterion Score Comparability

Chapter 2 GAUSS LAW Recommended Problems:

Chapter 13: The Correlation Coefficient and the Regression Line. We begin with a some useful facts about straight lines.

5 th grade Common Core Standards

Section 6-2: Simplex Method: Maximization with Problem Constraints of the Form ~

Introduction to Smith Charts

ALE 21. Gibbs Free Energy. At what temperature does the spontaneity of a reaction change?

The Law of Total Probability, Bayes Rule, and Random Variables (Oh My!)

CHAPTER 8 ANALYSIS OF DESIGNED EXPERIMENTS

Hubble s Law PHYS 1301

Department of Economics, University of California, Davis Ecn 200C Micro Theory Professor Giacomo Bonanno. Insurance Markets

LCAO APPROXIMATIONS OF ORGANIC Pi MO SYSTEMS The allyl system (cation, anion or radical).

Module 3: Gaussian Process Parameter Estimation, Prediction Uncertainty, and Diagnostics

Medium Scale Integrated (MSI) devices [Sections 2.9 and 2.10]

FIELD QUALITY IN ACCELERATOR MAGNETS

We can see from the graph above that the intersection is, i.e., [ ).

OF SIMPLY SUPPORTED PLYWOOD PLATES UNDER COMBINED EDGEWISE BENDING AND COMPRESSION

Pipetting 101 Developed by BSU CityLab

Tree Structured Classifier

1b) =.215 1c).080/.215 =.372

MATCHING TECHNIQUES Technical Track Session VI Céline Ferré The World Bank

Simple Linear Regression (single variable)

Transcription:

CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS 1

Influential bservatins are bservatins whse presence in the data can have a distrting effect n the parameter estimates and pssibly the entire analysis, e.g. identifying the wrng mdel. Distinctin frm utliers, thugh it is pssible fr ne bservatin t be bth influential and an utlier. Outliers: 1. data pints that cntain unusual dependent (y) values. 2. Outlying independent (x) values nt indicating lack f fit f mdel, but sme bservatins still influence the fit mre than thers. Detectin: In simple linear regressin, usually easy frm plts f data, but in multiple regressin, mre frmal measures are required. 2

8 A y 6 4 2 0 B C 0 2 4 6 8 x Figure 4.1. Three least squares lines fitted t sample data, where the bservatin at x = 8 is allwed t mve between the three pints A, B and C. The crrespnding least squares fits are the slid, dashed and dtted lines respectively. 3

The hat matrix Recall Ŷ = HY, H = X(X T X) 1 X T, s cvariance matrix f Ŷ is Var{Ŷ } = Hσ 2 Variance f ŷ i is h ii σ 2, variance f ith residual e i is (1 h ii )σ 2. Prperties f the {h ii } values include 0 h ii 1 fr all i, (1) i h ii = p. (2) Prperty (1) fllws simply frm the fact that bth h ii σ 2 and (1 h ii )σ 2 are the variances f randm quantities, and therefre are nnnegative. Fr prperty (2), nte that tr(h)=p. 4

Leverage A data pint with large h ii is called a pint f high leverage. Hw high is high? by (2), the average value f h ii is p n. A standard criterin is t call any data pint fr which h ii > 2p n a pint f high leverage. Nte that since h ii is a functin f X, it has n distributin, thus n frmal test. 5

Example: Cnsider the artificial data f Fig. 4.1. The twelve x values here are 0, 0.2, 0.4,..., 1.8, 2, 8. The crrespnding h ii values are.1342,.1221,...,.0869,.9182. The last bservatin, crrespnding t x = 8, is clearly highly influential. Intuitively, this is because if this pint is mved up r dwn, the least squares straight line will tend t fllw it the verall least squares fit n the ther 11 bservatins is nt much affected by mdest changes in the slpe f the fitted straight line, but the fit at x = 8 has a big influence. Nte that this has nthing t d with y 12 pssibly being an utlier, since fr any i, the actual value f y i des nt even enter int the calculatin f h ii. 6

Real data examples frm Chapter 3 Tree data: Highest h ii value is h 20 = 0.2428 (diameter=13.8, height=64), nt extreme fr either independent variable but des crrespnd t a fairly large diameter cmbined with the secnd smallest height. Next three largest values f h ii are h 3 = 0.1975, h 31 = 0.1803 and h 2 = 0.1672. In this case p = 3, n = 31 s accrding t the 2p n = 0.1935 criterin, bservatins 3 and 20 are influential. Draws attentin t tw bservatins which wuld nt necessarily be identified as influential frm initial inspectin f the data. 7

diam 8 10 12 14 16 18 20 65 70 75 80 85 height Figure Plt f tree diameter against height. 8

Real data examples frm Chapter 3 Nuclear pwer data: 2p n = 14 32 = 0.4375. Largest values f h ii are in rws 26 (0.4126), 19 (0.3614) and 22 (0.3526). Nt especially large, but it turns ut later these bservatins are influential. > nuk.inf<lm.influence(nuk.lm) > print(nuk.inf$hat, digit=2) 1 2 3 4 5 6 7 8 9 0.221 0.263 0.263 0.242 0.242 0.282 0.143 0.291 0.114 10 11 12 13 14 15 16 17 18 0.262 0.155 0.165 0.189 0.130 0.197 0.098 0.179 0.189 19 20 21 22 23 24 25 26 27 0.361 0.137 0.198 0.353 0.189 0.220 0.349 0.413 0.182 28 29 30 31 32 0.264 0.176 0.176 0.182 0.176 9

Deletin diagnstics Recall Var(e i ) = (1 h ii )σ 2. This suggests that after estimating σ 2 by the mean squared errr s 2, we may then define e i = e i s 1 h ii (3) as a standardized frm f residual. We call (3) the internally standardized residual. (Als knwn as studentized.) 10

This des nt take int accunt influence f utlier n parameter estimates. Culd d that by deletin residual d i = y i ŷ i(i) (4) The subscript (i) means that the mdel is refitted withut the i th bservatin. Thus ŷ i(i) means the predicted value f y i based n the mdel fit in which bservatin i is mitted. Frmula: d i = e i 1 h ii. (5) Since Var(e i ) = σ 2 (1 h ii ), it fllws that Var(d i ) = σ 2 /(1 h ii ) and we can estimate this by s 2 (d i ) = s2 (i) 1 h ii (6) in which s 2 dentes the estimated mean squared (i) errr with the i th bservatin deleted. 11

Bth y i and ŷ i(i) are statistically independent f s 2 (i), s d i and s 2 are independent, and (i) This suggests that we define d i s(d i ) t n p 1. (7) d i = d i s(d i ) as an externally studentized residual. (8) Calculate s 2 (i) frm (n p)s 2 = (n p 1)s 2 (i) + e2 i 1 h ii. (9) Cmbining these frmulae leads t d i = e i [ n p 1 (1 h ii )(n p)s 2 e 2 i ]1 2. (10) 12

Examples Tree data: recall apparent utliers in bservatins 15, 16, 18. Internally standardized residuals are 2.109, 1.834, 2.162. Externally studentized residuals are 2.258, 1.919 and 2.326. Since t 27,.975 = 2.052, frmal test f fit (at the 5% level f significance) wuld reject bservatins 15 and 18 as utliers. Largest psitive d i is 1.703, definitely nt significant. Culd g n t delete all three discussed in text. Nuclear pwer data: largest internally standardized residuals in magnitude are +2.275 in rw 19, 2.220 in rw 7, 2.052 in rw 26 and 1.815 in rw 12. When externally studentized these becme 2.503, 2.427, 2.205 and 1.908. t 25,.975 = 2.060, t 25,.995 = 2.787. 13

1. DFFITS Measures f influence Detect influence n fitted values. (DF F IT S) i = ŷi ŷ i(i) = d hii i. (11) s (i) hii 1 h ii Ratinale: standardized frmula fr examining the difference between ŷ i and ŷ i(i). Hwever, the secnd equality in (11) shws that it is equivalent t a rescaled frm f the externally studentized residual, where the rescaling is dependent n the leverage f the i th data pint. Thus DF F IT S may be thught f as a cmbined measure f influence that takes int accunt the leverage f the data pint as well as the size f the residual. Rule f thumb: bservatin is influential if DF F IT S is greater than 1 in the case f small data sets, r 2 p/n fr large data sets. 14

Examples Tree data: 2 p/n = 0.6222, nly inflential value is bservatin 18, DF F IT S = 0.8811. Cmbines utlier and mderate leverage (h ii = 0.1255). Nuclear pwer data: 2 p/n =0.9354 is easily exceeded in magnitude by the DF F IT S values fr rws 19 (1.8830) and 26 ( 1.8481), and t a lesser extent in rw 7 ( 0.9908). Since we have already seen that these three rws have the largest residuals in magnitude, and that rws 26 and 19 are the nes with the highest leverage, these results are scarcely surprising. 15

2. DFBETAS Intended t measure the influence f an bservatin n the parameter estimates. If we were t test the hypthesis H 0 : β k = β k0 fr the k th parameter estimate, where β k0 is a given numerical value, then a suitable test statistic wuld be ˆβ k β k0 s c kk where c kk is the k th diagnal entry f (X T X) 1. This statistic has a t n p distributin. Mtivated by this, we define (DF BET AS) k(i) = ˆβ k ˆβ k(i) s (i) ckk. (12) Rule f thumb: DF BET AS > 1 fr a small data set r 2/ n fr a large data set. 16

Examples Tree data: 2/ n = 0.3592, ffending values are 0.7571 (i = 18, k = 3), 0.7450 (i = 18, k = 1), 0.4930 (i = 17, k = 3) and 0.4770 (i = 17, k = 1). Rw 17 causing truble as well as rw 18? Nuclear pwer data: 2/ n = 0.3535 is exceeded by several values fr rws 7, 19 and 26 (largest verall value: 1.4899 fr the LN cefficient in rw 19). There are als three values in the 0.50.7 range fr rw 22. 17

3. Ck s D statistic Overall measure f the influence f the i th bservatin n all the parameter estimates. If we want t test H 0 : β = β 0 fr given vectr β 0, then when H 0 true, Ck defined (ˆβ β 0 ) T X T X(ˆβ β 0 ) p s 2 F p,n p. D i = (ˆβ ˆβ (i) ) T X T X(ˆβ ˆβ (i) ) p s 2 = e2 i ps 2 h ii (1 h ii ) 2. (13) Identify the i th bservatin as influential if D i is greater than the 10% pint f the F distributin, and highly influential if it is greater than the 50% pint. 18

Examples Tree data: largest value f Ck s D is 0.224 in rw 18 fllwed by 0.106 in rw 17. Fr the F 3,28 distributin, the 10% pint is 0.193 and the 50% pint 0.81. Again rw 18 stands ut. Nuclear pwer data: D =0.423 in rw 26, 0.418 in rw 19. The 10% pint f F 7,25 is 0.388 and the 50% pint 0.93. 19

The mdified Ck statistic Frm the frm f (13) in cmparisn with (11) and (12), natural t ask why, in (13), we did nt use s (i) in place f s. In fact Atkinsn (1985) tk this pint f view t define a mdified Ck statistic which turns ut, after scaling by a cnstant, t be equivalent t DF F IT S. Thus if Atkinsn s pint f view is adpted, there seems n need t cnsider Ck s statistic as a separate diagnstic, since all the relevant infrmatin is in DF F IT S. Our examples rather supprt this pint f view, since it appears that Ck s D is dwngrading the evidence f influence in the case f sme bservatins which seemed highly influential when judged by the earlier diagnstics. 20

4. COVRATIO Measures effect f deletin n the variances f the parameter estimates. (COV RAT IO) i = Det[{XT (i) X (i) } 1 s 2 (i) ] Det{(X T X) 1 s 2 } (14) where Det[A] means the determinant f a matrix A. An equivalent frmula is ( ) p n p 1 (COV RAT IO) i = + d 2 i (1 h ii) 1. n p n p (15) The suggested criterin here is COV RAT IO 1 > 3p n. (16) 21

Examples Tree data: (16) gives the critical values f COVRATIO as 0.710 and 1.290. At lwer end f range we have nly 0.6882 (rw 15), i.e. variances are decreased by mitting this bservatin. At the upper end there are several ffenders (rws 1,2,3,20,31, with largest value 1.47 in rw 20) which seems t pint twards bservatins f high leverage as thse whse missin wuld tend t increase the variances. Nuclear pwer data: critical values f COV RAT IO are 0.344 and 1.656. Rw 7 (0.334) is influential at the lwer end, while rws 25 (2.011), 8 (1.788) and 28 (1.716) are the nes with high COV RAT IOs. Rw 25 has fairly high leverage (h ii = 0.3491 which is furth largest in the data set) but rws 8 and 28 (h ii =0.2910 and 0.2637) d nt, s it is hard t see a clearcut explanatin f these. 22

Graphical methds f assessing influence Ideas due t Atkinsn (1985): refine previus rules f thumb using simulatin. N frmal hypthesis test is pssible fr high leverage, but fr the ther measures we have seen, it is pssible t cnstruct a frmal test that the bservatins are nrmal. In the case f single deletin residuals, we have seen the exact sampling distributin (t n p 1 ). Hwever, even this desn t easily extend t the prblem f the largest deletin residual in a sample (multiple testing prblem). Fr DFFITS etc., n exact test seems pssible. As an alternative, use simulatin. 23

Atkinsn s idea: use prbability plts (nrmal r halfnrmal). Halfnrmal plt: plt rdered abslute values f the deletin residuals against the n largest expected rder statistics frm a nrmal sample f size 2n + 1. As an apprximatin t the expected value f the k th largest rder statistic in a standard nrmal sample f size N, Atkinsn used Blm s apprximatin z (k 0.375)/(n+0.25) where z is the inverse f the standard nrmal c.d.f. This is slightly different frm the frmula z (k 0.5)/n which was prpsed in Sectin 2.6.1, thugh it makes very little difference in practice which frm is used. We fllw Atkinsn s usage here. 24

Examples The circles in Figure 4.2(a) shw a halfnrmal plt fr the deletin residuals fr the tree data, and Figure 4.3(a) shws the same thing fr the nuclear pwer data. In each case the plt is reasnably clse t a straight line, and even tends t flatten ff at the right hand end. Even with n ther means f assessment, this wuld suggest that the largest residuals are nt excessive utliers. The same idea can be tried fr the ther influence measures. The circles in Figures 4.2(b) and 4.3(b) shw a halfnrmal plt f the values f DF F IT S fr the tree and nuclear pwer data respectively. In the case f the tree data, the plt again appears clse t a straight line, but with the nuclear pwer data it is bvius that the largest tw values are behaving differently frm the rest. 25

(a) (b) 1.4 3 1.2 Observed Value 2 1 0 Observed Value 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.5 1.0 1.5 2.0 Expected Value 0.0 0.5 1.0 1.5 2.0 Expected Value Figure 4.2 Simulatin envelpe plts fr tree data. (a) Deletin residual, (b) DFFITS. 26

(a) (b) 3 2.0 Observed Value 2 1 0 Expected Value 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 Expected Value 0.0 0.5 1.0 1.5 2.0 Observed Value Figure 4.3 Simulatin envelpe plts fr nuclear pwer data. (a) Deletin residual, (b) DFFITS. 27

Assessing significance (Described fr deletin residuals, but same idea fr ther measures) 1. Generate a simulated sample f n standard nrmal randm variables, and calculate the deletin residuals based n that sample. OK t take β = 0 and σ 2 = 1 fr this simulatin. 2. Order the abslute values f the deletin residuals t btain a simulated sample f rder statistics. 3. Repeat this whle prcedure m times. Fr each i between 1 and n, find 5% largest and smallest values f the m simulatins fr that value f i. Mark these t btain apprximate cnfidence limits fr that value n the plt. 28

Results Fr the tree data (Fig. 4.2), neither f the plts (deletin residuals r DF F IT S) strays utside the envelpe. N serius utliers r influential pints in this data set. Fig. 4.3(a): same message fr deletin residuals with nuclear pwer data. But Fig. 4.3(b) is different. There really is a prblem with the influence f the tw largest data pints. Nte that this prcedure did nt adjust fr multiple cmparisn, which is pssible in principal but mre cmputatinally expensive (e.g., fr fixed envelpe ne can evaluate prbability that largest DF F IT S is utside the envelpe by simulatin). 29

Remedial measures First questin: is it a genuine errr (e.g. wrng data?). Even if nt, cnsider deleting bservatin and refitting mdel, but there is a danger f verding this. 30

Tree data There is a grup f three suspect bservatins cnsider deleting them all at nce. New estimates f β 2 and β 3 becme 1.9521 and 1.2503, standard errrs.0584 and.1654 (cmpare ld estimates 1.9825, 1.1166, SEs.0750,.2044). Parameter estimates are nt significantly different. s is reduced.0814 t.0625. Questin whether it is valid t qute lwer value. F statistic fr the test β 2 = 2, β 3 = 1 is nw 1.14, cmpared with 0.17 last time, but this is still nwhere near significant. These calculatins shw that the three suspect bservatins d nt substantially affect the main questins f interest and there therefre seems n reasn t remve them frm the data. 31

Nuclear pwer data The nuclear pwer data were refitted withut the influential data pints in rws 19 and 26. Use same mdel as befre. The fitted regressin mdel then becmes LC = 13.510 + 0.218D +0.689LS + 0.220NE + 0.197CT 0.044LN 0.232P T + ɛ with s =0.137. Standard errrs are 3.537 fr the intercept, and 0.047, 0.119, 0.065, 0.056, 0.042 and 0.104 fr the six cefficients. Mst f the cefficients are abut the same size, the largest differences relative t their standard errrs being in CT and LN. Indeed, accrding t the present mdel the cefficient f LN is nt significant and culd be drpped frm the mdel. The ther main thing t nte, as with the previus example, is that the residual standard deviatin s is nticably smaller. 32

We can repeat mst f the analyses tried the first time, with similar results. As an example, Figure 4.4(a) shws a plt f residuals against fitted values, and Figure 4.4(b) a nrmal prbability plt f the (internally) standardized residuals. 33

Residual 0.2 0.1 0.0 0.1 0.2 0.3 Observed Value 1 0 1 2 5.4 5.8 6.2 6.6 Fitted Value 2 1 0 1 2 Expected Value Figure 4.4 Residual plts fr nuclear pwer data with rws 19 and 26 deleted. (a) Residuals against fitted values. (b) Nrmal prbability plt fr standardized residuals. 34

(a) (b) 3 2.0 Observed Value 2 1 0 Observed Value 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 Expected Value 0.0 0.5 1.0 1.5 2.0 Expected Value Figure 4.5 Simulatin envelpe plts fr nuclear pwer data with rws 19 and 26 deleted. (a) Deletin residual, (b) DFFITS. 35

Nrmal prbability plt again seems OK Residuals vs. fitted values plt seems mre randm than befre. A plt f residuals against P T (nt shwn) shws even strnger evidence that the variances are different with the tw values f P T. D we need t delete mre bservatins? when diagnstics are recmputed there are still questins abut sme bservatins, but Fig. 4.5 des nt suggest mre deletins are needed. In cnclusin, fr this data set there des indeed seem t be a case that the tw mst influential bservatins are distrting the whle analysis and shuld be mitted, but there d nt seem t be any further instances fr which interventin is needed. 36

Calculatins in R In R, Ck s D statistics available by pltting lm bjects: nuk.lm <lm(lc~d+ls+ne+ct+ln+pt) plt(nuk.lm, which=4) Sme diagnstics available using functin lm.influence : nuk.inf<lm.influence(nuk.lm) Fr example, nuk.inf$cefficients cntains all the regressin cefficients crrespnding t deletin f each bservatin in turn, nuk.inf$sigma gives the s (i) values, and nuk.inf$hat cntains the diagnal entries f the hat matrix. 37

Calculatins in R Further diagnstics can be calculated frm these. dfbetas, dffits, stanres, studres can be cmputed using functins in diagnse.r at curse website. These can als be incrprated int a simulatin t create simulatinbased diagnstics (e.g. prgram dnsim.r n curse web page). surce("diagnse.r") plt(stanres(nuk.lm)) plt(studres(nuk.lm)) plt(dffits(nuk.lm)) dfbetas(nuk.lm) surce("dnsim.r") attach(nukes) lc<lg(c) ls<lg(s) ln<lg(n) par(mfrw=c(1,2)) dnsim(lc, cbind(d,ls,ne,ct,ln,pt), nsim=1000) 38