Detecting and Assessing Data Outliers and Leverage Points

Size: px

Start display at page:

Download "Detecting and Assessing Data Outliers and Leverage Points"

Luke Cunningham
5 years ago
Views:

1 Chapter 9 Detecting and Assessing Data Outliers and Leverage Points

2 Section 9.1 Background

3 Background Because OLS estimators arise due to the minimization of the sum of squared errors, large residuals or outliers can exert considerable influence on parameter estimates, estimates of standard errors, and model prediction. Observations associated with the explanatory variables also may become more influential as they get farther from their respective means. These observations do not involve the dependent or endogenous variable in the econometric analysis, and these observations are labeled leverage points. Leverage points also can exert undue influence on parameter estimates, estimates of standard errors, and model prediction. Influence points by definition are those observations which meet both the label of outlier and leverage point. 3

4 Influence diagnostics relate only to the observations or data points indigenous to the econometric model Concern Particular observations exert undue influence on regression results Outliers/Leverage Points Influential observations may be legitimate or they may be errors in measurement; if they correspond to errors in measurement, then obtain the corrected data. Influential observations (if legitimate) may cast doubt either on the validity of the observation or on the general adequacy of the model 4

5 Section 9.2 Influence Diagnostics

6 Influence Diagnostics Nature of Problem Identification Outliers Leverage Points Studentized Residuals Hat Matrix Dffits Dfbetas Cook s D Statistic Covratio Robust Regression 6

7 Influence Diagnostics (1) Residuals e i = y i ŷ i (2) Elements of the hat matrix H = X(X T X) 1 X T The hat matrix comes from ˆ ˆ T 1 Y = XB = X ( X X ) X Y ˆ = HY T Y. 7

8 Hat diagonal elements (leverage points). T 1 T hii = Xi ( X X ) Xi Provides a measure of standardized distance from the point X to. i X i These diagonal elements highlight observations that are extreme in the X s (data points far from the mean are relatively influential). Leverage is bounded below by 1/n and bounded above by 1; the closer the leverage is to unity, the greater the leverage of the observation. 8

9 Outlier Detection Outlier detection involves the determination of whether the residual (actual value-predicted value) is an extreme negative or positive value. But, we must take into account the variance of the residuals in detecting outliers to level the playing field. 9

10 Create Standardized Residuals A standardized residual is one divided by its standard deviation. resid standardized = y i s yˆ i where s = standard deviation of residuals If the standardized residuals associated with the observations of the data set have values in excess of 3.5 and 3.5, then the corresponding data points are labeled outliers. 10

11 Studentized Residuals R where student = s 2 i R-student = studentized residual e i (1 h ii s th (i) = standard error where the i observation is deleted ) h ii = leverage point; this statistic is large if one or more of the following conditions holds: (1) e i is large ; (2) h ii is large ; or (3) 2 s i is small. The R-student statistic is distributed as t n-p, where n is the number of observations and p is the number of parameters in the model. 11

12 To detect potentially high influence observations: (1) note data points with large hat diagonals (leverage points). (2) large R-student values (size of residuals with appropriate standardization ). R student = y ŷy ˆ [ ] 2 1/ 2 s (1 h ) i i ii i 2 s i is the residual variance after deleting the i th observation 12 n For hat diagonals,,where p is the number of i= 1 h ii = p model parameters and n is the number of observations

13 Guidelines h ii > 2p / n, observations exert considerable leverage (a conservative rule of thumb is 3p/n) R-student > 2, observation is an outlier. To be an influential observation, both criteria must be met. 13

14 14

15 Influence on predicted (or fitted) value ŷ i Diagnostic (DFFITS) 1/2 = (ŷ ŷ )/s (h ) i i i i ii DF DFFITS Difference between the result with X i and without X i The -i implies that the i th observation is not involved in the computation. Represents for the i th point the number of estimated standard errors that the fitted value changes if the i th point is removed from the data set. ŷ i CUTOFF 2 p/ n 15

16 Influence on the regression coefficients For each regression coefficient, the influence diagnostics provide a statistic which gives the number of standard errors that that coefficient changes if the i th observation were set aside. ( DFBETAS ) j, i = ˆ β s j i ˆ β ( C jj j, i ) 1/ 2 C jj jth diagonal element of (X T X) 1. Large value of (DFBETAS) -- i th observation has a sizable impact on the j th regression coefficient. Cutoff 2 n 16

17 Use (DFBETAS) j,i to ascertain which observations influence specific regression coefficients. Analyst must observe n x p statistics to assess the influence on the regression coefficients. Composite measure of the influence on the set of coefficients. Cook s Distance or Cook s D D i = (ˆ ˆ ˆ T T ( β β i ) (X β β i 2 ps X)(ˆ ) Cook s distance represents the standardized distance between the vector of least squares coefficients ˆ ˆ D i > 0 β and β -i Belsley suggests 4/(n-p) as a cutoff.. 17

18 18 Cutoff for Cook s D = 4/(n-p) = 0.24

19 19

20 20

21 Influence on the Variance of Regression Coefficients COVRATIO i = ( X T i ( X T X i ) X ) 1 1 s s 2 i 2 21 The COVRATIO i statistic measures the change in the determinant of the covariance matrix of the estimates by deleting the i th observation. If COVRATIO i > 1, then i th point provides improvement - a reduction in the estimated generalized variance of the coefficient over what would be produced without the data point. If COVRATIO i < 1, inclusion of the i th point results in increasing the generalized variance. Belsley, Kuh, Welsch yardstick (for large samples) i th data point exerts an unusual amount of influence on the generalized variance if COVRATIO or COVRATIO i i > 1+ (3p / n) < 1 (3p / n) ( if n > 3p).

22 22 Cutoff for Cook s D = 4/(22-6) = 0.25

23 23

24 24 Examples of the use of influential diagnostics (1) Uri, N.D. and R. Boyd, Estimating the Regional Demand for Softwood Lumber in the United States, North Central Journal of Agricultural Economics, 12,1 (January 1990): In this article, use of: RSTUDENT HAT DIAGONAL COVRATIO DFFITS (2) Swinton, S.M. and R.P. King, Evaluating Robust Regression Techniques for Detrending Crop Yield Data with Nonnormal Errors, American Journal of Agricultural Economics, May (1991): In this article, use of: RSTUDENT HAT DIAGONAL DFBETAS

25 The INFLUENCE option (in the MODEL statement) requests the statistics proposed by Belsley, Kuh, and Welsch (1980) to measure the influence of each observation. Belsley, Kuh, and Welsch influence diagnostics hii (leverage points) R-Student (studentized residual) COVRATIO DIFFITS DFBETAS Cook s D 25

26 Section 9.3 Solutions to the Problem of Influential Observations

27 Solutions to the Problem of Influential Observations (1) Standard practice omit the influential observation(s) from the analysis; NOT a feasible solution if these observations are legitimate. (2) Robust Regression techniques weight down the influential observation(s), Huber (1973). 27 continued...

28 (3) Use of Dummy or Indicator Variables Associated with Influential Points. Create dummy variables corresponding to each influential data point. Operationally augment the model with these dummy variables and re-estimate with either OLS procedures or GLS procedures. Use of either intercept shifters and slope shifter variables. If a dummy variable is associated with the influential observation, then the corresponding coefficient estimate is likely to be statistically different from zero. Identifying how many dummies to use can be tricky. For example, suppose you have 4 outliers in a dataset of 100 observations. Are we going to use: 1) one dummy that identifies all outliers; 2) two dummies, one for the outliers that seem to be far above the mean, and one for those far below; or 3) four dummies pertaining to each outlier separately? 28

29 Example of Robust Regression Techniques from the Literature Swinton, S.M. and R.P. King, Evaluating Robust Regression Techniques for Detrending Crop Yield Data with Nonnormal Errors, American Journal of Agricultural Economics, May (1991): Because OLS minimizes the sum of squared errors, large residuals or outliers can exert considerable influence on parameter estimates. Robust regression methods give less weight than OLS to influential observations. They can be viewed as automated means of reducing outlier influence and are classified by their approach to controlling influential outliers. 29

30 Section 9.4 Robust Regression Techniques

31 31 Robust Regression Techniques M-estimators employ maximum likelihood (ML) techniques for finding estimates of model parameters that minimize some function of the regression residuals. example: Multivariate t (Judge et al., 1988) L-estimators, linear combinations of order statistics, calculate estimates of model parameters based on quantiles of the residuals. The quantiles then are combined with specified weights. examples: least absolute error (LAE) (most common) trimmed mean (TRIM) five-quantity weighted regression quantile (FIVEQUAN) Gastwirth weighted regression quantile (GASTWIRTH) Tukey tri-mean weighted regression quantile (TUKEY) No closed-form solutions for M-estimators and L-estimators. Estimation is done by optimization techniques but convergence is not guaranteed.

32 Robust Regression OLS suffers in performance in the presence of outliers and certain non-normal error distributions (heavy tails). Robust estimators are not sensitive to outliers. Robust estimators essentially weight down the influence of data points that produce residuals large in magnitude. MAD (Mean Absolute Deviation) LAR (Least Absolute Residual) LAE (Least Absolute Error) Minimizes the sum of the absolute values of the residuals 32 Criterion: Minimize Robust estimator i= 1 OLS Estimator: Minimize n y i ŷ i n i= 1 (yi ŷi) 2

33 The ROBUSTREG Procedure in SAS Overview The main purpose of robust regression is to detect outliers and provide resistant (stable) results in the presence of outliers. In order to achieve this stability, robust regression limits the influence of outliers. Historically, three classes of problems have been addressed with robust regression techniques: problems with outliers in the y-direction (response direction) problems with multivariate outliers in the x-space (i.e., outliers in the covariate space, which are also referred to as leverage points) problems with outliers in both the y-direction and the x-space 33

34 Many methods have been developed in response to these problems. However, in statistical applications of outlier detection and robust regression,the methods most commonly used today are Huber M estimation, high breakdown value estimation, and combinations of these two methods. The ROBUSTREG procedure provides four such methods: M estimation, LTS estimation, S estimation, and MM estimation. 34 continued...

35 (1) M estimation was introduced by Huber (1973), and it is the simplest approach both computationally and theoretically. Although it is not robust with respect to leverage points, it is still used extensively in analyzing data for which it can be assumed that the contamination is mainly in the response direction. 35 (2) Least Trimmed Squares (LTS) estimation is a high breakdown value method introduced by Rousseeuw (1984). The breakdown value is a measure of the proportion of contamination that an estimation method can withstand and still maintain its robustness. The performance of this method was improved by the FAST-LTS algorithm of Rousseeuw and Van Driessen (2000). (3) S estimation is a high breakdown value method introduced by Rousseeuw and Yohai (1984). With the same breakdown value, it has a higher statistical efficiency than LTS estimation. (4) MM estimation, introduced by Yohai (1987), combines high breakdown value estimation and M estimation. It has both the high breakdown property and a higher statistical efficiency than S estimation.

36 Example 1: Pine Tree Problem Consider the forestry data in the table below: 36 STAND CHARACTERISTICS FOR PINE TREES AGE HD N MDBH Source: Burkhart, H.E., R.C. Parker, M.R. Strob, and R. Oberwald. Yields of Old-field Loblolly Pine Plantations, Division of Forestry and Wildlife Resources Publication FWS-3-72, Virginia Tech, Blacksburg, VA, 1972.

37 MDBH where, i = β + β HD + β AGE N) + β ( HD / N) + ε, ( 0 1 i 2 i 3 i i MDBH i HD i N i AGE i = = = = the average diameter at breast height (measured at 4.5 feet above ground) at AGE i, the average height of dominant trees in feet, the number of pine trees per acre in age, AGE i, and the age of a particular pine stand. Compute HAT diagonals, Cook s D values, DFFITS values, and DFBETAS for each of the 20 observations. Determine whether or not any of the 20 observations exert a disproportionate influence on the results. If so, use robust regression methods to circumvent the problem. 37

38 EXAMPLE: 20 observations; p = 4 Pine Tree Data cutoff for hii 2p/n = 8/20 =.40 cutoff for DFFITS 2 p / n = 2.2 = cutoff for Cook s D 4 n p = 4 16 =.25 cutoff for DFBETAS 2 n = 2 20 = 0.45 cutoff of COVRATIO 3p 12 1 = 1 = n p = 1+ = 1.6 n 20

39 DFBETAS DFFITS Cook s D COVRATIO Intercept HD AGEN HDN OBS OBS OBS Observation 9 is a leverage point but not an outlier. Observation 10 is an outlier but not a leverage point. Observation 20 is an influential point. 39

40 Robust Regression Procedures OLS LAE a M Estimation S Estimation LTS Estimation MM Estimation Intercept (0.3466) (0.2729) (0.3838) (0.4219) NR b (0.3864) HD (0.0254) (0.0199) (0.0281) (0.0314) NR (0.0287) AGEN ( ) ( ) (0.0001) (0.0001) NR (0.0001) HDN (8.3738) (6.5914) (9.2718) ( ) NR (9.6084) R a From Shazam, not SAS b NR Not Reported by SAS 40

41 Use of Intercept Shifter for Influential Observation (OBS 20) OLS Intercept (0.3467) HD (0.0254) OLS with Intercept Shifter (0.3436) (0.0249) AGEN ( ) ( ) HDN (8.3738) (9.3732) OBS (0.3700) R R

42 Dependent Variable: MDBH Number of Observations Read 20 Number of Observations Used 20 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model <.0001 Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var continued...

43 OLS parameter estimates Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept <.0001 HD agen hdn Durbin-Watson D Pr < DW Pr > DW Number of Observations 20 1st Order Autocorrelation NOTE: Pr<DW is the p-value for testing positive autocorrelation, and Pr>DW is the p-value for testing negative autocorrelation. 43 continued...

44 Output Statistics Dependent Predicted Std Error Std Error Student Cook's Obs Variable Value Mean Predict Residual Residual Residual D ** *** ** * * * **** * * * *** ***

45 Hat Diag Cov DFBETAS Obs RStudent H Ratio DFFITS Intercept HD agen hdn continued...

46 Hat Diag Cov DFBETAS Obs RStudent H Ratio DFFITS Intercept HD agen hdn Influence Diagnostics 46

47 Sum of Residuals 0 Sum of Squared Residuals Predicted Residual SS (PRESS) Model Information Data Set WORK.PINETREE Dependent Variable MDBH Number of Independent Variables 3 Number of Observations 20 Method M Estimation Number of Observations Read 20 Number of Observations Used 20 Summary Statistics Standard Variable Q1 Median Q3 Mean Deviation MAD HD agen hdn MDBH

48 Parameter Estimates Standard 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq M estimation Intercept <.0001 HD agen hdn Scale Diagnostics Robust Standardized Mahalanobis MCD Robust Obs Distance Distance Leverage Residual Outlier * * * * Diagnostics Summary Observation Type Proportion Cutoff 48 Outlier Leverage continued...

49 Goodness-of-Fit Statistic Value R-Square AICR BICR Deviance The ROBUSTREG Procedure Model Information Data Set WORK.PINETREE Dependent Variable MDBH Number of Independent Variables 3 Number of Observations 20 Method S Estimation Number of Observations Read 20 Number of Observations Used 20 S estimation Summary Statistics Standard Variable Q1 Median Q3 Mean Deviation MAD HD agen hdn MDBH continued...

50 S estimation S Profile Total Number of Observations 20 Number of Coefficients 4 Subset Size 4 Chi Function Tukey K Breakdown Value Efficiency Parameter Estimates Standard 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr>ChiSq Intercept <.0001 HD agen hdn Scale continued...

51 Diagnostics Robust Standardized Mahalanobis MCD Robust Obs Distance Distance Leverage Residual Outlier * * * * Diagnostics Summary Observation Type Proportion Cutoff Outlier Leverage Goodness-of-Fit Statistic Value 51 R-Square Deviance continued...

52 Model Information Data Set WORK.PINETREE Dependent Variable MDBH Number of Independent Variables 3 Number of Observations 20 Method LTS Estimation Number of Observations Read 20 Number of Observations Used 20 Summary Statistics Standard Variable Q1 Median Q3 Mean Deviation MAD HD agen hdn MDBH LTS Profile 52 Total Number of Observations 20 Number of Squares Minimized 16 Number of Coefficients 4 Highest Possible Breakdown Value continued...

53 LTS Parameter Estimates Parameter DF Estimate Intercept HD agen hdn Scale (slts) Scale (Wscale) LTS estimation The ROBUSTREG Procedure Diagnostics Robust Standardized Mahalanobis MCD Robust Obs Distance Distance Leverage Residual Outlier * * * * * Diagnostics Summary Observation Type Proportion Cutoff 53 Outlier Leverage continued...

54 R-Square for LTS Estimation R-Square Model Information Data Set WORK.PINETREE Dependent Variable MDBH Number of Independent Variables 3 Number of Observations 20 Method MM Estimation 54 Number of Observations Read 20 Number of Observations Used 20 Summary Statistics Standard Variable Q1 Median Q3 Mean Deviation MAD HD agen hdn MDBH continued...

55 Profile for the Initial LTS Estimate Total Number of Observations 20 Number of Squares Minimized 16 Number of Coefficients 4 Highest Possible Breakdown Value MM estimation MM Profile Chi Function Tukey K Efficiency Parameter Estimates Standard 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr>ChiSq 55 Intercept <.0001 HD agen hdn Scale continued...

56 Diagnostics Robust Standardized Mahalanobis MCD Robust Obs Distance Distance Leverage Residual Outlier * * * * Diagnostics Summary Observation Type Proportion Cutoff Outlier Leverage Goodness-of-Fit Statistic Value 56 R-Square AICR BICR Deviance continued...

57 OLS estimation with dummy variable for the outlier (obs 20) The REG Procedure Dependent Variable: MDBH Number of Observations Read 20 Number of Observations Used 20 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model <.0001 Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var continued...

58 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept <.0001 HD agen hdn inf Dependent Variable: MDBH Durbin-Watson D Pr < DW Pr > DW Number of Observations 20 1st Order Autocorrelation NOTE: Pr<DW is the p-value for testing positive autocorrelation, and Pr>DW is the p-value for testing negative autocorrelation.

59 Section 9.5 Examples

60 Example 2: Demand Function for FF20 PUBLIX 165 Observations; p = 11; Cutoffs for hii 2 p / n = 22/165 = 0.13 DFFITS 2 p / n = 2 11/165 = 0.52 DFBETAS 2 n = = 0.16 Cook s D 4 n p = = 0.03 COVRATIO 3p 33 1 = 1 = n ; 3p 1 + = n 1.2 Influential Observations: 16, 27, 32, 52 OBS 16 OBS 27 OBS 32 OBS DFFITS Cook s D COVRATIO

61 Robust Regression Procedures OLS M Estimation S Estimation LTS Estimation MM Estimation Intercept (0.7926) (0.7109) (0.7843) NR (0.7351) Week (0.0003) (0.0003) (0.0003) NR (0.0003) LOGDISC (0.1890) (0.1695) (0.2064) NR (0.1915) LOGPRICE (0.5534) (0.4964) (0.5474) NR (0.5129) FSI (0.0893) (0.0801) (0.0885) NR (0.0827) LOGDISP (0.1873) (0.1680) (0.1991) NR (0.1846) LOGAD (0.1020) (0.0915) (0.1183) NR (0.1084) LOGDIST (3.7027) (3.3213) (3.8424) NR (3.5320) 61

62 Robust Regression Procedures OLS M Estimation S Estimation LTS Estimation MM Estimation Q (0.0406) (0.0364) (0.0408) NR (0.0382) Q (0.0419) (0.0376) (0.0417) NR (0.0391) Q (0.0429) (0.0384) (0.0450) NR (0.0415) R R Influential Observations , 27, 32, 52 27, 52 27, 52 27, 52, 149, , 52 NR not reported by SAS 62

63 The REG Procedure Dependent Variable: LOGUNITS Number of Observations Read 165 Number of Observations Used 165 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model <.0001 Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var continued...

64 Parameter Estimates OLS parameter estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept <.0001 WEEK LOGDISC <.0001 LOGPRICE <.0001 FSI LOGDISP <.0001 LOGAD <.0001 LOGDIST Q Q Q continued...

65 The REG Procedure Dependent Variable: LOGUNITS Durbin-Watson D Pr < DW <.0001 Pr > DW Number of Observations 165 1st Order Autocorrelation NOTE: Pr<DW is the p-value for testing positive autocorrelation, and Pr>DW is the p-value for testing negative autocorrelation. 65 continued...

66 Output Statistics Dependent Predicted Std Error Std Error Student Cook's Obs Variable Value Mean Predict Residual Residual Residual D ** ** ** ** * ** continued...

67 **** * * *** * *** ****** continued...

68 * ***** ** ** **** * * * * * * continued...

69 * * * * ****** * ** * * * continued...

70 Output Statistics Hat Diag Cov DFBETAS Obs RStudent H Ratio DFFITS Intercept WEEK LOGDISC LOGPRICE FSI continued...

71 continued...

72 continued...

73 continued...

74 Output Statistics DFBETAS Obs LOGDISP LOGAD LOGDIST Q1 Q2 Q continued...

75 continued...

76 continued...

77 Sum of Residuals E-13 Sum of Squared Residuals Predicted Residual SS (PRESS) Model Information Data Set WORK.PUBLIXFF20 Dependent Variable LOGUNITS Number of Independent Variables 10 Number of Observations 165 Method M Estimation Number of Observations Read 165 Number of Observations Used continued...

78 Summary Statistics Standard Variable Q1 Median Q3 Mean Deviation MAD WEEK LOGDISC LOGPRICE FSI LOGDISP LOGAD LOGDIST Q Q Q LOGUNITS continued...

79 Parameter Estimates Standard 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr>ChiSq Intercept <.0001 WEEK LOGDISC <.0001 LOGPRICE <.0001 FSI LOGDISP <.0001 LOGAD <.0001 LOGDIST Q Q Q Scale continued...

80 The ROBUSTREG Procedure Diagnostics Robust Standardized Mahalanobis MCD Robust Obs Distance Distance Leverage Residual Outlier * * Diagnostics Summary Observation Type Proportion Cutoff Outlier Leverage Goodness-of-Fit Statistic Value 80 R-Square AICR BICR Deviance continued...

81 Model Information Data Set WORK.PUBLIXFF20 Dependent Variable LOGUNITS Number of Independent Variables 10 Number of Observations 165 Method S Estimation Number of Observations Read 165 Number of Observations Used 165 Summary Statistics Standard Variable Q1 Median Q3 Mean Deviation MAD 81 WEEK LOGDISC LOGPRICE FSI LOGDISP LOGAD LOGDIST Q Q Q LOGUNITS

82 S Profile Total Number of Observations 165 Number of Coefficients 11 Subset Size 11 Chi Function Tukey K Breakdown Value Efficiency Parameter Estimates Standard 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept <.0001 WEEK LOGDISC <.0001 LOGPRICE < continued...

83 The ROBUSTREG Procedure Parameter Estimates Standard 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq FSI LOGDISP <.0001 LOGAD <.0001 LOGDIST Q Q Q Scale continued...

84 Diagnostics Robust Standardized Mahalanobis MCD Robust Obs Distance Distance Leverage Residual Outlier * * Diagnostics Summary Observation Type Proportion Cutoff Outlier Leverage Goodness-of-Fit Statistic Value R-Square Deviance Model Information Data Set WORK.PUBLIXFF20 Dependent Variable LOGUNITS Number of Independent Variables 10 Number of Observations 165 Method LTS Estimation 84 Number of Observations Read 165 Number of Observations Used 165 continued...

85 Summary Statistics Standard Variable Q1 Median Q3 Mean Deviation MAD WEEK LOGDISC LOGPRICE FSI LOGDISP LOGAD LOGDIST Q Q Q LOGUNITS LTS Profile 85 Total Number of Observations 165 Number of Squares Minimized 126 Number of Coefficients 11 Highest Possible Breakdown Value

86 LTS Parameter Estimates Parameter DF Estimate Intercept WEEK LOGDISC LOGPRICE FSI LOGDISP LOGAD LOGDIST LTS Parameter Estimates Parameter DF Estimate Q Q Q Scale (slts) Scale (Wscale) continued...

87 Diagnostics Robust Standardized Mahalanobis MCD Robust Obs Distance Distance Leverage Residual Outlier * * * * Diagnostics Summary Observation Type Proportion Cutoff Outlier Leverage R-Square for LTS Estimation 87 R-Square continued...

88 Model Information Data Set WORK.PUBLIXFF20 Dependent Variable LOGUNITS Number of Independent Variables 10 Number of Observations 165 Method MM Estimation Number of Observations Read 165 Number of Observations Used Summary Statistics Standard Variable Q1 Median Q3 Mean Deviation MAD WEEK LOGDISC LOGPRICE FSI LOGDISP LOGAD LOGDIST Q Q Q LOGUNITS

89 Profile for the Initial LTS Estimate Total Number of Observations 165 Number of Squares Minimized 126 Number of Coefficients 11 Highest Possible Breakdown Value MM Profile Chi Function Tukey K Efficiency continued...

90 Parameter Estimates Standard 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept <.0001 WEEK LOGDISC <.0001 LOGPRICE <.0001 FSI LOGDISP <.0001 LOGAD <.0001 LOGDIST Q Q Q Scale continued...

91 Diagnostics Robust Standardized Mahalanobis MCD Robust Obs Distance Distance Leverage Residual Outlier * * Diagnostics Summary Observation Type Proportion Cutoff Outlier Leverage Goodness-of-Fit Statistic Value 91 R-Square AICR BICR Deviance

92 Section 9.6 Commentary

93 Commentary (1) Always examine the data to determine the existence of influential observations (2) The existence of both leverage points (h ii values) and outliers (R-student statistics). (3) Use of DFFITS, DFBETAS, Cook s D, and COVRATIO to determine the impact of influential observations on prediction, estimated coefficients, and the variance of the estimated coefficients. (4) Solutions to this issue amount to the use of robust regression procedures (sophisticated) or to the use of dummy variables (pedestrian). (5) The dummy variable approach is more straight forward and may involve either intercept and/or slope shifters. (6) Unless the data are incorrect, never eliminate influential observations from the analysis. 93

Residuals from regression on original data 1

Residuals from regression on original data 1 Obs a b n i y 1 1 1 3 1 1 2 1 1 3 2 2 3 1 1 3 3 3 4 1 2 3 1 4 5 1 2 3 2 5 6 1 2 3 3 6 7 1 3 3 1 7 8 1 3 3 2 8 9 1 3 3 3 9 10 2 1 3 1 10 11 2 1 3 2 11 12 2 1