Detecting and Assessing Data Outliers and Leverage Points

Size: px
Start display at page:

Download "Detecting and Assessing Data Outliers and Leverage Points"

Transcription

1 Chapter 9 Detecting and Assessing Data Outliers and Leverage Points

2 Section 9.1 Background

3 Background Because OLS estimators arise due to the minimization of the sum of squared errors, large residuals or outliers can exert considerable influence on parameter estimates, estimates of standard errors, and model prediction. Observations associated with the explanatory variables also may become more influential as they get farther from their respective means. These observations do not involve the dependent or endogenous variable in the econometric analysis, and these observations are labeled leverage points. Leverage points also can exert undue influence on parameter estimates, estimates of standard errors, and model prediction. Influence points by definition are those observations which meet both the label of outlier and leverage point. 3

4 Influence diagnostics relate only to the observations or data points indigenous to the econometric model Concern Particular observations exert undue influence on regression results Outliers/Leverage Points Influential observations may be legitimate or they may be errors in measurement; if they correspond to errors in measurement, then obtain the corrected data. Influential observations (if legitimate) may cast doubt either on the validity of the observation or on the general adequacy of the model 4

5 Section 9.2 Influence Diagnostics

6 Influence Diagnostics Nature of Problem Identification Outliers Leverage Points Studentized Residuals Hat Matrix Dffits Dfbetas Cook s D Statistic Covratio Robust Regression 6

7 Influence Diagnostics (1) Residuals e i = y i ŷ i (2) Elements of the hat matrix H = X(X T X) 1 X T The hat matrix comes from ˆ ˆ T 1 Y = XB = X ( X X ) X Y ˆ = HY T Y. 7

8 Hat diagonal elements (leverage points). T 1 T hii = Xi ( X X ) Xi Provides a measure of standardized distance from the point X to. i X i These diagonal elements highlight observations that are extreme in the X s (data points far from the mean are relatively influential). Leverage is bounded below by 1/n and bounded above by 1; the closer the leverage is to unity, the greater the leverage of the observation. 8

9 Outlier Detection Outlier detection involves the determination of whether the residual (actual value-predicted value) is an extreme negative or positive value. But, we must take into account the variance of the residuals in detecting outliers to level the playing field. 9

10 Create Standardized Residuals A standardized residual is one divided by its standard deviation. resid standardized = y i s yˆ i where s = standard deviation of residuals If the standardized residuals associated with the observations of the data set have values in excess of 3.5 and 3.5, then the corresponding data points are labeled outliers. 10

11 Studentized Residuals R where student = s 2 i R-student = studentized residual e i (1 h ii s th (i) = standard error where the i observation is deleted ) h ii = leverage point; this statistic is large if one or more of the following conditions holds: (1) e i is large ; (2) h ii is large ; or (3) 2 s i is small. The R-student statistic is distributed as t n-p, where n is the number of observations and p is the number of parameters in the model. 11

12 To detect potentially high influence observations: (1) note data points with large hat diagonals (leverage points). (2) large R-student values (size of residuals with appropriate standardization ). R student = y ŷy ˆ [ ] 2 1/ 2 s (1 h ) i i ii i 2 s i is the residual variance after deleting the i th observation 12 n For hat diagonals,,where p is the number of i= 1 h ii = p model parameters and n is the number of observations

13 Guidelines h ii > 2p / n, observations exert considerable leverage (a conservative rule of thumb is 3p/n) R-student > 2, observation is an outlier. To be an influential observation, both criteria must be met. 13

14 14

15 Influence on predicted (or fitted) value ŷ i Diagnostic (DFFITS) 1/2 = (ŷ ŷ )/s (h ) i i i i ii DF DFFITS Difference between the result with X i and without X i The -i implies that the i th observation is not involved in the computation. Represents for the i th point the number of estimated standard errors that the fitted value changes if the i th point is removed from the data set. ŷ i CUTOFF 2 p/ n 15

16 Influence on the regression coefficients For each regression coefficient, the influence diagnostics provide a statistic which gives the number of standard errors that that coefficient changes if the i th observation were set aside. ( DFBETAS ) j, i = ˆ β s j i ˆ β ( C jj j, i ) 1/ 2 C jj jth diagonal element of (X T X) 1. Large value of (DFBETAS) -- i th observation has a sizable impact on the j th regression coefficient. Cutoff 2 n 16

17 Use (DFBETAS) j,i to ascertain which observations influence specific regression coefficients. Analyst must observe n x p statistics to assess the influence on the regression coefficients. Composite measure of the influence on the set of coefficients. Cook s Distance or Cook s D D i = (ˆ ˆ ˆ T T ( β β i ) (X β β i 2 ps X)(ˆ ) Cook s distance represents the standardized distance between the vector of least squares coefficients ˆ ˆ D i > 0 β and β -i Belsley suggests 4/(n-p) as a cutoff.. 17

18 18 Cutoff for Cook s D = 4/(n-p) = 0.24

19 19

20 20

21 Influence on the Variance of Regression Coefficients COVRATIO i = ( X T i ( X T X i ) X ) 1 1 s s 2 i 2 21 The COVRATIO i statistic measures the change in the determinant of the covariance matrix of the estimates by deleting the i th observation. If COVRATIO i > 1, then i th point provides improvement - a reduction in the estimated generalized variance of the coefficient over what would be produced without the data point. If COVRATIO i < 1, inclusion of the i th point results in increasing the generalized variance. Belsley, Kuh, Welsch yardstick (for large samples) i th data point exerts an unusual amount of influence on the generalized variance if COVRATIO or COVRATIO i i > 1+ (3p / n) < 1 (3p / n) ( if n > 3p).

22 22 Cutoff for Cook s D = 4/(22-6) = 0.25

23 23

24 24 Examples of the use of influential diagnostics (1) Uri, N.D. and R. Boyd, Estimating the Regional Demand for Softwood Lumber in the United States, North Central Journal of Agricultural Economics, 12,1 (January 1990): In this article, use of: RSTUDENT HAT DIAGONAL COVRATIO DFFITS (2) Swinton, S.M. and R.P. King, Evaluating Robust Regression Techniques for Detrending Crop Yield Data with Nonnormal Errors, American Journal of Agricultural Economics, May (1991): In this article, use of: RSTUDENT HAT DIAGONAL DFBETAS

25 The INFLUENCE option (in the MODEL statement) requests the statistics proposed by Belsley, Kuh, and Welsch (1980) to measure the influence of each observation. Belsley, Kuh, and Welsch influence diagnostics hii (leverage points) R-Student (studentized residual) COVRATIO DIFFITS DFBETAS Cook s D 25

26 Section 9.3 Solutions to the Problem of Influential Observations

27 Solutions to the Problem of Influential Observations (1) Standard practice omit the influential observation(s) from the analysis; NOT a feasible solution if these observations are legitimate. (2) Robust Regression techniques weight down the influential observation(s), Huber (1973). 27 continued...

28 (3) Use of Dummy or Indicator Variables Associated with Influential Points. Create dummy variables corresponding to each influential data point. Operationally augment the model with these dummy variables and re-estimate with either OLS procedures or GLS procedures. Use of either intercept shifters and slope shifter variables. If a dummy variable is associated with the influential observation, then the corresponding coefficient estimate is likely to be statistically different from zero. Identifying how many dummies to use can be tricky. For example, suppose you have 4 outliers in a dataset of 100 observations. Are we going to use: 1) one dummy that identifies all outliers; 2) two dummies, one for the outliers that seem to be far above the mean, and one for those far below; or 3) four dummies pertaining to each outlier separately? 28

29 Example of Robust Regression Techniques from the Literature Swinton, S.M. and R.P. King, Evaluating Robust Regression Techniques for Detrending Crop Yield Data with Nonnormal Errors, American Journal of Agricultural Economics, May (1991): Because OLS minimizes the sum of squared errors, large residuals or outliers can exert considerable influence on parameter estimates. Robust regression methods give less weight than OLS to influential observations. They can be viewed as automated means of reducing outlier influence and are classified by their approach to controlling influential outliers. 29

30 Section 9.4 Robust Regression Techniques

31 31 Robust Regression Techniques M-estimators employ maximum likelihood (ML) techniques for finding estimates of model parameters that minimize some function of the regression residuals. example: Multivariate t (Judge et al., 1988) L-estimators, linear combinations of order statistics, calculate estimates of model parameters based on quantiles of the residuals. The quantiles then are combined with specified weights. examples: least absolute error (LAE) (most common) trimmed mean (TRIM) five-quantity weighted regression quantile (FIVEQUAN) Gastwirth weighted regression quantile (GASTWIRTH) Tukey tri-mean weighted regression quantile (TUKEY) No closed-form solutions for M-estimators and L-estimators. Estimation is done by optimization techniques but convergence is not guaranteed.

32 Robust Regression OLS suffers in performance in the presence of outliers and certain non-normal error distributions (heavy tails). Robust estimators are not sensitive to outliers. Robust estimators essentially weight down the influence of data points that produce residuals large in magnitude. MAD (Mean Absolute Deviation) LAR (Least Absolute Residual) LAE (Least Absolute Error) Minimizes the sum of the absolute values of the residuals 32 Criterion: Minimize Robust estimator i= 1 OLS Estimator: Minimize n y i ŷ i n i= 1 (yi ŷi) 2

33 The ROBUSTREG Procedure in SAS Overview The main purpose of robust regression is to detect outliers and provide resistant (stable) results in the presence of outliers. In order to achieve this stability, robust regression limits the influence of outliers. Historically, three classes of problems have been addressed with robust regression techniques: problems with outliers in the y-direction (response direction) problems with multivariate outliers in the x-space (i.e., outliers in the covariate space, which are also referred to as leverage points) problems with outliers in both the y-direction and the x-space 33

34 Many methods have been developed in response to these problems. However, in statistical applications of outlier detection and robust regression,the methods most commonly used today are Huber M estimation, high breakdown value estimation, and combinations of these two methods. The ROBUSTREG procedure provides four such methods: M estimation, LTS estimation, S estimation, and MM estimation. 34 continued...

35 (1) M estimation was introduced by Huber (1973), and it is the simplest approach both computationally and theoretically. Although it is not robust with respect to leverage points, it is still used extensively in analyzing data for which it can be assumed that the contamination is mainly in the response direction. 35 (2) Least Trimmed Squares (LTS) estimation is a high breakdown value method introduced by Rousseeuw (1984). The breakdown value is a measure of the proportion of contamination that an estimation method can withstand and still maintain its robustness. The performance of this method was improved by the FAST-LTS algorithm of Rousseeuw and Van Driessen (2000). (3) S estimation is a high breakdown value method introduced by Rousseeuw and Yohai (1984). With the same breakdown value, it has a higher statistical efficiency than LTS estimation. (4) MM estimation, introduced by Yohai (1987), combines high breakdown value estimation and M estimation. It has both the high breakdown property and a higher statistical efficiency than S estimation.

36 Example 1: Pine Tree Problem Consider the forestry data in the table below: 36 STAND CHARACTERISTICS FOR PINE TREES AGE HD N MDBH Source: Burkhart, H.E., R.C. Parker, M.R. Strob, and R. Oberwald. Yields of Old-field Loblolly Pine Plantations, Division of Forestry and Wildlife Resources Publication FWS-3-72, Virginia Tech, Blacksburg, VA, 1972.

37 MDBH where, i = β + β HD + β AGE N) + β ( HD / N) + ε, ( 0 1 i 2 i 3 i i MDBH i HD i N i AGE i = = = = the average diameter at breast height (measured at 4.5 feet above ground) at AGE i, the average height of dominant trees in feet, the number of pine trees per acre in age, AGE i, and the age of a particular pine stand. Compute HAT diagonals, Cook s D values, DFFITS values, and DFBETAS for each of the 20 observations. Determine whether or not any of the 20 observations exert a disproportionate influence on the results. If so, use robust regression methods to circumvent the problem. 37

38 EXAMPLE: 20 observations; p = 4 Pine Tree Data cutoff for hii 2p/n = 8/20 =.40 cutoff for DFFITS 2 p / n = 2.2 = cutoff for Cook s D 4 n p = 4 16 =.25 cutoff for DFBETAS 2 n = 2 20 = 0.45 cutoff of COVRATIO 3p 12 1 = 1 = n p = 1+ = 1.6 n 20

39 DFBETAS DFFITS Cook s D COVRATIO Intercept HD AGEN HDN OBS OBS OBS Observation 9 is a leverage point but not an outlier. Observation 10 is an outlier but not a leverage point. Observation 20 is an influential point. 39

40 Robust Regression Procedures OLS LAE a M Estimation S Estimation LTS Estimation MM Estimation Intercept (0.3466) (0.2729) (0.3838) (0.4219) NR b (0.3864) HD (0.0254) (0.0199) (0.0281) (0.0314) NR (0.0287) AGEN ( ) ( ) (0.0001) (0.0001) NR (0.0001) HDN (8.3738) (6.5914) (9.2718) ( ) NR (9.6084) R a From Shazam, not SAS b NR Not Reported by SAS 40

41 Use of Intercept Shifter for Influential Observation (OBS 20) OLS Intercept (0.3467) HD (0.0254) OLS with Intercept Shifter (0.3436) (0.0249) AGEN ( ) ( ) HDN (8.3738) (9.3732) OBS (0.3700) R R

42 Dependent Variable: MDBH Number of Observations Read 20 Number of Observations Used 20 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model <.0001 Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var continued...

43 OLS parameter estimates Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept <.0001 HD agen hdn Durbin-Watson D Pr < DW Pr > DW Number of Observations 20 1st Order Autocorrelation NOTE: Pr<DW is the p-value for testing positive autocorrelation, and Pr>DW is the p-value for testing negative autocorrelation. 43 continued...

44 Output Statistics Dependent Predicted Std Error Std Error Student Cook's Obs Variable Value Mean Predict Residual Residual Residual D ** *** ** * * * **** * * * *** ***

45 Hat Diag Cov DFBETAS Obs RStudent H Ratio DFFITS Intercept HD agen hdn continued...

46 Hat Diag Cov DFBETAS Obs RStudent H Ratio DFFITS Intercept HD agen hdn Influence Diagnostics 46

47 Sum of Residuals 0 Sum of Squared Residuals Predicted Residual SS (PRESS) Model Information Data Set WORK.PINETREE Dependent Variable MDBH Number of Independent Variables 3 Number of Observations 20 Method M Estimation Number of Observations Read 20 Number of Observations Used 20 Summary Statistics Standard Variable Q1 Median Q3 Mean Deviation MAD HD agen hdn MDBH

48 Parameter Estimates Standard 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq M estimation Intercept <.0001 HD agen hdn Scale Diagnostics Robust Standardized Mahalanobis MCD Robust Obs Distance Distance Leverage Residual Outlier * * * * Diagnostics Summary Observation Type Proportion Cutoff 48 Outlier Leverage continued...

49 Goodness-of-Fit Statistic Value R-Square AICR BICR Deviance The ROBUSTREG Procedure Model Information Data Set WORK.PINETREE Dependent Variable MDBH Number of Independent Variables 3 Number of Observations 20 Method S Estimation Number of Observations Read 20 Number of Observations Used 20 S estimation Summary Statistics Standard Variable Q1 Median Q3 Mean Deviation MAD HD agen hdn MDBH continued...

50 S estimation S Profile Total Number of Observations 20 Number of Coefficients 4 Subset Size 4 Chi Function Tukey K Breakdown Value Efficiency Parameter Estimates Standard 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr>ChiSq Intercept <.0001 HD agen hdn Scale continued...

51 Diagnostics Robust Standardized Mahalanobis MCD Robust Obs Distance Distance Leverage Residual Outlier * * * * Diagnostics Summary Observation Type Proportion Cutoff Outlier Leverage Goodness-of-Fit Statistic Value 51 R-Square Deviance continued...

52 Model Information Data Set WORK.PINETREE Dependent Variable MDBH Number of Independent Variables 3 Number of Observations 20 Method LTS Estimation Number of Observations Read 20 Number of Observations Used 20 Summary Statistics Standard Variable Q1 Median Q3 Mean Deviation MAD HD agen hdn MDBH LTS Profile 52 Total Number of Observations 20 Number of Squares Minimized 16 Number of Coefficients 4 Highest Possible Breakdown Value continued...

53 LTS Parameter Estimates Parameter DF Estimate Intercept HD agen hdn Scale (slts) Scale (Wscale) LTS estimation The ROBUSTREG Procedure Diagnostics Robust Standardized Mahalanobis MCD Robust Obs Distance Distance Leverage Residual Outlier * * * * * Diagnostics Summary Observation Type Proportion Cutoff 53 Outlier Leverage continued...

54 R-Square for LTS Estimation R-Square Model Information Data Set WORK.PINETREE Dependent Variable MDBH Number of Independent Variables 3 Number of Observations 20 Method MM Estimation 54 Number of Observations Read 20 Number of Observations Used 20 Summary Statistics Standard Variable Q1 Median Q3 Mean Deviation MAD HD agen hdn MDBH continued...

55 Profile for the Initial LTS Estimate Total Number of Observations 20 Number of Squares Minimized 16 Number of Coefficients 4 Highest Possible Breakdown Value MM estimation MM Profile Chi Function Tukey K Efficiency Parameter Estimates Standard 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr>ChiSq 55 Intercept <.0001 HD agen hdn Scale continued...

56 Diagnostics Robust Standardized Mahalanobis MCD Robust Obs Distance Distance Leverage Residual Outlier * * * * Diagnostics Summary Observation Type Proportion Cutoff Outlier Leverage Goodness-of-Fit Statistic Value 56 R-Square AICR BICR Deviance continued...

57 OLS estimation with dummy variable for the outlier (obs 20) The REG Procedure Dependent Variable: MDBH Number of Observations Read 20 Number of Observations Used 20 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model <.0001 Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var continued...

58 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept <.0001 HD agen hdn inf Dependent Variable: MDBH Durbin-Watson D Pr < DW Pr > DW Number of Observations 20 1st Order Autocorrelation NOTE: Pr<DW is the p-value for testing positive autocorrelation, and Pr>DW is the p-value for testing negative autocorrelation.

59 Section 9.5 Examples

60 Example 2: Demand Function for FF20 PUBLIX 165 Observations; p = 11; Cutoffs for hii 2 p / n = 22/165 = 0.13 DFFITS 2 p / n = 2 11/165 = 0.52 DFBETAS 2 n = = 0.16 Cook s D 4 n p = = 0.03 COVRATIO 3p 33 1 = 1 = n ; 3p 1 + = n 1.2 Influential Observations: 16, 27, 32, 52 OBS 16 OBS 27 OBS 32 OBS DFFITS Cook s D COVRATIO

61 Robust Regression Procedures OLS M Estimation S Estimation LTS Estimation MM Estimation Intercept (0.7926) (0.7109) (0.7843) NR (0.7351) Week (0.0003) (0.0003) (0.0003) NR (0.0003) LOGDISC (0.1890) (0.1695) (0.2064) NR (0.1915) LOGPRICE (0.5534) (0.4964) (0.5474) NR (0.5129) FSI (0.0893) (0.0801) (0.0885) NR (0.0827) LOGDISP (0.1873) (0.1680) (0.1991) NR (0.1846) LOGAD (0.1020) (0.0915) (0.1183) NR (0.1084) LOGDIST (3.7027) (3.3213) (3.8424) NR (3.5320) 61

62 Robust Regression Procedures OLS M Estimation S Estimation LTS Estimation MM Estimation Q (0.0406) (0.0364) (0.0408) NR (0.0382) Q (0.0419) (0.0376) (0.0417) NR (0.0391) Q (0.0429) (0.0384) (0.0450) NR (0.0415) R R Influential Observations , 27, 32, 52 27, 52 27, 52 27, 52, 149, , 52 NR not reported by SAS 62

63 The REG Procedure Dependent Variable: LOGUNITS Number of Observations Read 165 Number of Observations Used 165 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model <.0001 Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var continued...

64 Parameter Estimates OLS parameter estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept <.0001 WEEK LOGDISC <.0001 LOGPRICE <.0001 FSI LOGDISP <.0001 LOGAD <.0001 LOGDIST Q Q Q continued...

65 The REG Procedure Dependent Variable: LOGUNITS Durbin-Watson D Pr < DW <.0001 Pr > DW Number of Observations 165 1st Order Autocorrelation NOTE: Pr<DW is the p-value for testing positive autocorrelation, and Pr>DW is the p-value for testing negative autocorrelation. 65 continued...

66 Output Statistics Dependent Predicted Std Error Std Error Student Cook's Obs Variable Value Mean Predict Residual Residual Residual D ** ** ** ** * ** continued...

67 **** * * *** * *** ****** continued...

68 * ***** ** ** **** * * * * * * continued...

69 * * * * ****** * ** * * * continued...

70 Output Statistics Hat Diag Cov DFBETAS Obs RStudent H Ratio DFFITS Intercept WEEK LOGDISC LOGPRICE FSI continued...

71 continued...

72 continued...

73 continued...

74 Output Statistics DFBETAS Obs LOGDISP LOGAD LOGDIST Q1 Q2 Q continued...

75 continued...

76 continued...

77 Sum of Residuals E-13 Sum of Squared Residuals Predicted Residual SS (PRESS) Model Information Data Set WORK.PUBLIXFF20 Dependent Variable LOGUNITS Number of Independent Variables 10 Number of Observations 165 Method M Estimation Number of Observations Read 165 Number of Observations Used continued...

78 Summary Statistics Standard Variable Q1 Median Q3 Mean Deviation MAD WEEK LOGDISC LOGPRICE FSI LOGDISP LOGAD LOGDIST Q Q Q LOGUNITS continued...

79 Parameter Estimates Standard 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr>ChiSq Intercept <.0001 WEEK LOGDISC <.0001 LOGPRICE <.0001 FSI LOGDISP <.0001 LOGAD <.0001 LOGDIST Q Q Q Scale continued...

80 The ROBUSTREG Procedure Diagnostics Robust Standardized Mahalanobis MCD Robust Obs Distance Distance Leverage Residual Outlier * * Diagnostics Summary Observation Type Proportion Cutoff Outlier Leverage Goodness-of-Fit Statistic Value 80 R-Square AICR BICR Deviance continued...

81 Model Information Data Set WORK.PUBLIXFF20 Dependent Variable LOGUNITS Number of Independent Variables 10 Number of Observations 165 Method S Estimation Number of Observations Read 165 Number of Observations Used 165 Summary Statistics Standard Variable Q1 Median Q3 Mean Deviation MAD 81 WEEK LOGDISC LOGPRICE FSI LOGDISP LOGAD LOGDIST Q Q Q LOGUNITS

82 S Profile Total Number of Observations 165 Number of Coefficients 11 Subset Size 11 Chi Function Tukey K Breakdown Value Efficiency Parameter Estimates Standard 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept <.0001 WEEK LOGDISC <.0001 LOGPRICE < continued...

83 The ROBUSTREG Procedure Parameter Estimates Standard 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq FSI LOGDISP <.0001 LOGAD <.0001 LOGDIST Q Q Q Scale continued...

84 Diagnostics Robust Standardized Mahalanobis MCD Robust Obs Distance Distance Leverage Residual Outlier * * Diagnostics Summary Observation Type Proportion Cutoff Outlier Leverage Goodness-of-Fit Statistic Value R-Square Deviance Model Information Data Set WORK.PUBLIXFF20 Dependent Variable LOGUNITS Number of Independent Variables 10 Number of Observations 165 Method LTS Estimation 84 Number of Observations Read 165 Number of Observations Used 165 continued...

85 Summary Statistics Standard Variable Q1 Median Q3 Mean Deviation MAD WEEK LOGDISC LOGPRICE FSI LOGDISP LOGAD LOGDIST Q Q Q LOGUNITS LTS Profile 85 Total Number of Observations 165 Number of Squares Minimized 126 Number of Coefficients 11 Highest Possible Breakdown Value

86 LTS Parameter Estimates Parameter DF Estimate Intercept WEEK LOGDISC LOGPRICE FSI LOGDISP LOGAD LOGDIST LTS Parameter Estimates Parameter DF Estimate Q Q Q Scale (slts) Scale (Wscale) continued...

87 Diagnostics Robust Standardized Mahalanobis MCD Robust Obs Distance Distance Leverage Residual Outlier * * * * Diagnostics Summary Observation Type Proportion Cutoff Outlier Leverage R-Square for LTS Estimation 87 R-Square continued...

88 Model Information Data Set WORK.PUBLIXFF20 Dependent Variable LOGUNITS Number of Independent Variables 10 Number of Observations 165 Method MM Estimation Number of Observations Read 165 Number of Observations Used Summary Statistics Standard Variable Q1 Median Q3 Mean Deviation MAD WEEK LOGDISC LOGPRICE FSI LOGDISP LOGAD LOGDIST Q Q Q LOGUNITS

89 Profile for the Initial LTS Estimate Total Number of Observations 165 Number of Squares Minimized 126 Number of Coefficients 11 Highest Possible Breakdown Value MM Profile Chi Function Tukey K Efficiency continued...

90 Parameter Estimates Standard 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept <.0001 WEEK LOGDISC <.0001 LOGPRICE <.0001 FSI LOGDISP <.0001 LOGAD <.0001 LOGDIST Q Q Q Scale continued...

91 Diagnostics Robust Standardized Mahalanobis MCD Robust Obs Distance Distance Leverage Residual Outlier * * Diagnostics Summary Observation Type Proportion Cutoff Outlier Leverage Goodness-of-Fit Statistic Value 91 R-Square AICR BICR Deviance

92 Section 9.6 Commentary

93 Commentary (1) Always examine the data to determine the existence of influential observations (2) The existence of both leverage points (h ii values) and outliers (R-student statistics). (3) Use of DFFITS, DFBETAS, Cook s D, and COVRATIO to determine the impact of influential observations on prediction, estimated coefficients, and the variance of the estimated coefficients. (4) Solutions to this issue amount to the use of robust regression procedures (sophisticated) or to the use of dummy variables (pedestrian). (5) The dummy variable approach is more straight forward and may involve either intercept and/or slope shifters. (6) Unless the data are incorrect, never eliminate influential observations from the analysis. 93

Residuals from regression on original data 1

Residuals from regression on original data 1 Residuals from regression on original data 1 Obs a b n i y 1 1 1 3 1 1 2 1 1 3 2 2 3 1 1 3 3 3 4 1 2 3 1 4 5 1 2 3 2 5 6 1 2 3 3 6 7 1 3 3 1 7 8 1 3 3 2 8 9 1 3 3 3 9 10 2 1 3 1 10 11 2 1 3 2 11 12 2 1

More information

Topic 18: Model Selection and Diagnostics

Topic 18: Model Selection and Diagnostics Topic 18: Model Selection and Diagnostics Variable Selection We want to choose a best model that is a subset of the available explanatory variables Two separate problems 1. How many explanatory variables

More information

Regression Analysis for Data Containing Outliers and High Leverage Points

Regression Analysis for Data Containing Outliers and High Leverage Points Alabama Journal of Mathematics 39 (2015) ISSN 2373-0404 Regression Analysis for Data Containing Outliers and High Leverage Points Asim Kumer Dey Department of Mathematics Lamar University Md. Amir Hossain

More information

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION Answer all parts. Closed book, calculators allowed. It is important to show all working,

More information

Section 2 NABE ASTEF 65

Section 2 NABE ASTEF 65 Section 2 NABE ASTEF 65 Econometric (Structural) Models 66 67 The Multiple Regression Model 68 69 Assumptions 70 Components of Model Endogenous variables -- Dependent variables, values of which are determined

More information

1) Answer the following questions as true (T) or false (F) by circling the appropriate letter.

1) Answer the following questions as true (T) or false (F) by circling the appropriate letter. 1) Answer the following questions as true (T) or false (F) by circling the appropriate letter. T F T F T F a) Variance estimates should always be positive, but covariance estimates can be either positive

More information

Stat 500 Midterm 2 12 November 2009 page 0 of 11

Stat 500 Midterm 2 12 November 2009 page 0 of 11 Stat 500 Midterm 2 12 November 2009 page 0 of 11 Please put your name on the back of your answer book. Do NOT put it on the front. Thanks. Do not start until I tell you to. The exam is closed book, closed

More information

LINEAR REGRESSION. Copyright 2013, SAS Institute Inc. All rights reserved.

LINEAR REGRESSION. Copyright 2013, SAS Institute Inc. All rights reserved. LINEAR REGRESSION LINEAR REGRESSION REGRESSION AND OTHER MODELS Type of Response Type of Predictors Categorical Continuous Continuous and Categorical Continuous Analysis of Variance (ANOVA) Ordinary Least

More information

STATISTICS 479 Exam II (100 points)

STATISTICS 479 Exam II (100 points) Name STATISTICS 79 Exam II (1 points) 1. A SAS data set was created using the following input statement: Answer parts(a) to (e) below. input State $ City $ Pop199 Income Housing Electric; (a) () Give the

More information

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin Regression Review Statistics 149 Spring 2006 Copyright c 2006 by Mark E. Irwin Matrix Approach to Regression Linear Model: Y i = β 0 + β 1 X i1 +... + β p X ip + ɛ i ; ɛ i iid N(0, σ 2 ), i = 1,..., n

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html 1 / 42 Passenger car mileage Consider the carmpg dataset taken from

More information

Outline. Review regression diagnostics Remedial measures Weighted regression Ridge regression Robust regression Bootstrapping

Outline. Review regression diagnostics Remedial measures Weighted regression Ridge regression Robust regression Bootstrapping Topic 19: Remedies Outline Review regression diagnostics Remedial measures Weighted regression Ridge regression Robust regression Bootstrapping Regression Diagnostics Summary Check normality of the residuals

More information

CHAPTER 5. Outlier Detection in Multivariate Data

CHAPTER 5. Outlier Detection in Multivariate Data CHAPTER 5 Outlier Detection in Multivariate Data 5.1 Introduction Multivariate outlier detection is the important task of statistical analysis of multivariate data. Many methods have been proposed for

More information

Regression Diagnostics

Regression Diagnostics Diag 1 / 78 Regression Diagnostics Paul E. Johnson 1 2 1 Department of Political Science 2 Center for Research Methods and Data Analysis, University of Kansas 2015 Diag 2 / 78 Outline 1 Introduction 2

More information

Regression Diagnostics for Survey Data

Regression Diagnostics for Survey Data Regression Diagnostics for Survey Data Richard Valliant Joint Program in Survey Methodology, University of Maryland and University of Michigan USA Jianzhu Li (Westat), Dan Liao (JPSM) 1 Introduction Topics

More information

Statistical Modelling in Stata 5: Linear Models

Statistical Modelling in Stata 5: Linear Models Statistical Modelling in Stata 5: Linear Models Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 07/11/2017 Structure This Week What is a linear model? How good is my model? Does

More information

Chapter 11: Robust & Quantile regression

Chapter 11: Robust & Quantile regression Chapter 11: Robust & Adapted from Timothy Hanson Department of Statistics, University of South Carolina Stat 704: Data Analysis I 1 / 13 11.3: Robust regression Leverages h ii and deleted residuals t i

More information

Regression Model Specification in R/Splus and Model Diagnostics. Daniel B. Carr

Regression Model Specification in R/Splus and Model Diagnostics. Daniel B. Carr Regression Model Specification in R/Splus and Model Diagnostics By Daniel B. Carr Note 1: See 10 for a summary of diagnostics 2: Books have been written on model diagnostics. These discuss diagnostics

More information

STAT 4385 Topic 06: Model Diagnostics

STAT 4385 Topic 06: Model Diagnostics STAT 4385 Topic 06: Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso xsu@utep.edu Spring, 2016 1/ 40 Outline Several Types of Residuals Raw, Standardized, Studentized

More information

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014 Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014 Tim Hanson, Ph.D. University of South Carolina T. Hanson (USC) Stat 704: Data Analysis I, Fall 2014 1 / 13 Chapter 8: Polynomials & Interactions

More information

holding all other predictors constant

holding all other predictors constant Multiple Regression Numeric Response variable (y) p Numeric predictor variables (p < n) Model: Y = b 0 + b 1 x 1 + + b p x p + e Partial Regression Coefficients: b i effect (on the mean response) of increasing

More information

Statistics for exp. medical researchers Regression and Correlation

Statistics for exp. medical researchers Regression and Correlation Faculty of Health Sciences Regression analysis Statistics for exp. medical researchers Regression and Correlation Lene Theil Skovgaard Sept. 28, 2015 Linear regression, Estimation and Testing Confidence

More information

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002 Time allowed: 3 HOURS. STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002 This is an open book exam: all course notes and the text are allowed, and you are expected to use your own calculator.

More information

STATISTICS 110/201 PRACTICE FINAL EXAM

STATISTICS 110/201 PRACTICE FINAL EXAM STATISTICS 110/201 PRACTICE FINAL EXAM Questions 1 to 5: There is a downloadable Stata package that produces sequential sums of squares for regression. In other words, the SS is built up as each variable

More information

Regression Model Building

Regression Model Building Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation in Y with a small set of predictors Automated

More information

Alternative Biased Estimator Based on Least. Trimmed Squares for Handling Collinear. Leverage Data Points

Alternative Biased Estimator Based on Least. Trimmed Squares for Handling Collinear. Leverage Data Points International Journal of Contemporary Mathematical Sciences Vol. 13, 018, no. 4, 177-189 HIKARI Ltd, www.m-hikari.com https://doi.org/10.1988/ijcms.018.8616 Alternative Biased Estimator Based on Least

More information

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises LINEAR REGRESSION ANALYSIS MODULE XVI Lecture - 44 Exercises Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur Exercise 1 The following data has been obtained on

More information

REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES

REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES Lalmohan Bhar I.A.S.R.I., Library Avenue, Pusa, New Delhi 110 01 lmbhar@iasri.res.in 1. Introduction Regression analysis is a statistical methodology that utilizes

More information

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines) Dr. Maddah ENMG 617 EM Statistics 11/28/12 Multiple Regression (3) (Chapter 15, Hines) Problems in multiple regression: Multicollinearity This arises when the independent variables x 1, x 2,, x k, are

More information

unadjusted model for baseline cholesterol 22:31 Monday, April 19,

unadjusted model for baseline cholesterol 22:31 Monday, April 19, unadjusted model for baseline cholesterol 22:31 Monday, April 19, 2004 1 Class Level Information Class Levels Values TRETGRP 3 3 4 5 SEX 2 0 1 Number of observations 916 unadjusted model for baseline cholesterol

More information

EXST7015: Estimating tree weights from other morphometric variables Raw data print

EXST7015: Estimating tree weights from other morphometric variables Raw data print Simple Linear Regression SAS example Page 1 1 ********************************************; 2 *** Data from Freund & Wilson (1993) ***; 3 *** TABLE 8.24 : ESTIMATING TREE WEIGHTS ***; 4 ********************************************;

More information

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION In this lab you will first learn how to display the relationship between two quantitative variables with a scatterplot and also how to measure the strength of

More information

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc IES 612/STA 4-573/STA 4-576 Winter 2008 Week 1--IES 612-STA 4-573-STA 4-576.doc Review Notes: [OL] = Ott & Longnecker Statistical Methods and Data Analysis, 5 th edition. [Handouts based on notes prepared

More information

SAS Procedures Inference about the Line ffl model statement in proc reg has many options ffl To construct confidence intervals use alpha=, clm, cli, c

SAS Procedures Inference about the Line ffl model statement in proc reg has many options ffl To construct confidence intervals use alpha=, clm, cli, c Inference About the Slope ffl As with all estimates, ^fi1 subject to sampling var ffl Because Y jx _ Normal, the estimate ^fi1 _ Normal A linear combination of indep Normals is Normal Simple Linear Regression

More information

a. YOU MAY USE ONE 8.5 X11 TWO-SIDED CHEAT SHEET AND YOUR TEXTBOOK (OR COPY THEREOF).

a. YOU MAY USE ONE 8.5 X11 TWO-SIDED CHEAT SHEET AND YOUR TEXTBOOK (OR COPY THEREOF). STAT3503 Test 2 NOTE: a. YOU MAY USE ONE 8.5 X11 TWO-SIDED CHEAT SHEET AND YOUR TEXTBOOK (OR COPY THEREOF). b. YOU MAY USE ANY ELECTRONIC CALCULATOR. c. FOR FULL MARKS YOU MUST SHOW THE FORMULA YOU USE

More information

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007 STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007 LAST NAME: SOLUTIONS FIRST NAME: STUDENT NUMBER: ENROLLED IN: (circle one) STA 302 STA 1001 INSTRUCTIONS: Time: 90 minutes Aids allowed: calculator.

More information

Nonlinear Regression. Summary. Sample StatFolio: nonlinear reg.sgp

Nonlinear Regression. Summary. Sample StatFolio: nonlinear reg.sgp Nonlinear Regression Summary... 1 Analysis Summary... 4 Plot of Fitted Model... 6 Response Surface Plots... 7 Analysis Options... 10 Reports... 11 Correlation Matrix... 12 Observed versus Predicted...

More information

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti Prepared by: Prof Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti Putra Malaysia Serdang M L Regression is an extension to

More information

General Linear Model (Chapter 4)

General Linear Model (Chapter 4) General Linear Model (Chapter 4) Outcome variable is considered continuous Simple linear regression Scatterplots OLS is BLUE under basic assumptions MSE estimates residual variance testing regression coefficients

More information

Booklet of Code and Output for STAC32 Final Exam

Booklet of Code and Output for STAC32 Final Exam Booklet of Code and Output for STAC32 Final Exam December 8, 2014 List of Figures in this document by page: List of Figures 1 Popcorn data............................. 2 2 MDs by city, with normal quantile

More information

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure. STATGRAPHICS Rev. 9/13/213 Calibration Models Summary... 1 Data Input... 3 Analysis Summary... 5 Analysis Options... 7 Plot of Fitted Model... 9 Predicted Values... 1 Confidence Intervals... 11 Observed

More information

Beam Example: Identifying Influential Observations using the Hat Matrix

Beam Example: Identifying Influential Observations using the Hat Matrix Math 3080. Treibergs Beam Example: Identifying Influential Observations using the Hat Matrix Name: Example March 22, 204 This R c program explores influential observations and their detection using the

More information

Chapter 11: Robust & Quantile regression

Chapter 11: Robust & Quantile regression Chapter 11: Robust & Timothy Hanson Department of Statistics, University of South Carolina Stat 704: Data Analysis I 1/17 11.3: Robust regression 11.3 Influential cases rem. measure: Robust regression

More information

Lecture 9 SLR in Matrix Form

Lecture 9 SLR in Matrix Form Lecture 9 SLR in Matrix Form STAT 51 Spring 011 Background Reading KNNL: Chapter 5 9-1 Topic Overview Matrix Equations for SLR Don t focus so much on the matrix arithmetic as on the form of the equations.

More information

Lecture notes on Regression & SAS example demonstration

Lecture notes on Regression & SAS example demonstration Regression & Correlation (p. 215) When two variables are measured on a single experimental unit, the resulting data are called bivariate data. You can describe each variable individually, and you can also

More information

Autocorrelation or Serial Correlation

Autocorrelation or Serial Correlation Chapter 6 Autocorrelation or Serial Correlation Section 6.1 Introduction 2 Evaluating Econometric Work How does an analyst know when the econometric work is completed? 3 4 Evaluating Econometric Work Econometric

More information

Math 423/533: The Main Theoretical Topics

Math 423/533: The Main Theoretical Topics Math 423/533: The Main Theoretical Topics Notation sample size n, data index i number of predictors, p (p = 2 for simple linear regression) y i : response for individual i x i = (x i1,..., x ip ) (1 p)

More information

Lecture 1: Linear Models and Applications

Lecture 1: Linear Models and Applications Lecture 1: Linear Models and Applications Claudia Czado TU München c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 0 Overview Introduction to linear models Exploratory data analysis (EDA) Estimation

More information

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects Contents 1 Review of Residuals 2 Detecting Outliers 3 Influential Observations 4 Multicollinearity and its Effects W. Zhou (Colorado State University) STAT 540 July 6th, 2015 1 / 32 Model Diagnostics:

More information

Circle a single answer for each multiple choice question. Your choice should be made clearly.

Circle a single answer for each multiple choice question. Your choice should be made clearly. TEST #1 STA 4853 March 4, 215 Name: Please read the following directions. DO NOT TURN THE PAGE UNTIL INSTRUCTED TO DO SO Directions This exam is closed book and closed notes. There are 31 questions. Circle

More information

Unit 10: Simple Linear Regression and Correlation

Unit 10: Simple Linear Regression and Correlation Unit 10: Simple Linear Regression and Correlation Statistics 571: Statistical Methods Ramón V. León 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 1 Introductory Remarks Regression analysis is a method for

More information

Introduction to Linear regression analysis. Part 2. Model comparisons

Introduction to Linear regression analysis. Part 2. Model comparisons Introduction to Linear regression analysis Part Model comparisons 1 ANOVA for regression Total variation in Y SS Total = Variation explained by regression with X SS Regression + Residual variation SS Residual

More information

((n r) 1) (r 1) ε 1 ε 2. X Z β+

((n r) 1) (r 1) ε 1 ε 2. X Z β+ Bringing Order to Outlier Diagnostics in Regression Models D.R.JensenandD.E.Ramirez Virginia Polytechnic Institute and State University and University of Virginia der@virginia.edu http://www.math.virginia.edu/

More information

Department of Mathematics The University of Toledo. Master of Science Degree Comprehensive Examination Applied Statistics.

Department of Mathematics The University of Toledo. Master of Science Degree Comprehensive Examination Applied Statistics. Department of Mathematics The University of Toledo Master of Science Degree Comprehensive Examination Applied Statistics April 8, 205 nstructions Do all problems. Show all of your computations. Prove all

More information

G. S. Maddala Kajal Lahiri. WILEY A John Wiley and Sons, Ltd., Publication

G. S. Maddala Kajal Lahiri. WILEY A John Wiley and Sons, Ltd., Publication G. S. Maddala Kajal Lahiri WILEY A John Wiley and Sons, Ltd., Publication TEMT Foreword Preface to the Fourth Edition xvii xix Part I Introduction and the Linear Regression Model 1 CHAPTER 1 What is Econometrics?

More information

Math 3330: Solution to midterm Exam

Math 3330: Solution to midterm Exam Math 3330: Solution to midterm Exam Question 1: (14 marks) Suppose the regression model is y i = β 0 + β 1 x i + ε i, i = 1,, n, where ε i are iid Normal distribution N(0, σ 2 ). a. (2 marks) Compute the

More information

Circle the single best answer for each multiple choice question. Your choice should be made clearly.

Circle the single best answer for each multiple choice question. Your choice should be made clearly. TEST #1 STA 4853 March 6, 2017 Name: Please read the following directions. DO NOT TURN THE PAGE UNTIL INSTRUCTED TO DO SO Directions This exam is closed book and closed notes. There are 32 multiple choice

More information

ssh tap sas913, sas

ssh tap sas913, sas B. Kedem, STAT 430 SAS Examples SAS8 ===================== ssh xyz@glue.umd.edu, tap sas913, sas https://www.statlab.umd.edu/sasdoc/sashtml/onldoc.htm Multiple Regression ====================== 0. Show

More information

Variable Selection and Model Building

Variable Selection and Model Building LINEAR REGRESSION ANALYSIS MODULE XIII Lecture - 37 Variable Selection and Model Building Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur The complete regression

More information

Two Stage Robust Ridge Method in a Linear Regression Model

Two Stage Robust Ridge Method in a Linear Regression Model Journal of Modern Applied Statistical Methods Volume 14 Issue Article 8 11-1-015 Two Stage Robust Ridge Method in a Linear Regression Model Adewale Folaranmi Lukman Ladoke Akintola University of Technology,

More information

10 Model Checking and Regression Diagnostics

10 Model Checking and Regression Diagnostics 10 Model Checking and Regression Diagnostics The simple linear regression model is usually written as i = β 0 + β 1 i + ɛ i where the ɛ i s are independent normal random variables with mean 0 and variance

More information

Lecture 3: Inference in SLR

Lecture 3: Inference in SLR Lecture 3: Inference in SLR STAT 51 Spring 011 Background Reading KNNL:.1.6 3-1 Topic Overview This topic will cover: Review of hypothesis testing Inference about 1 Inference about 0 Confidence Intervals

More information

Leverage. the response is in line with the other values, or the high leverage has caused the fitted model to be pulled toward the observed response.

Leverage. the response is in line with the other values, or the high leverage has caused the fitted model to be pulled toward the observed response. Leverage Some cases have high leverage, the potential to greatly affect the fit. These cases are outliers in the space of predictors. Often the residuals for these cases are not large because the response

More information

T-test: means of Spock's judge versus all other judges 1 12:10 Wednesday, January 5, judge1 N Mean Std Dev Std Err Minimum Maximum

T-test: means of Spock's judge versus all other judges 1 12:10 Wednesday, January 5, judge1 N Mean Std Dev Std Err Minimum Maximum T-test: means of Spock's judge versus all other judges 1 The TTEST Procedure Variable: pcwomen judge1 N Mean Std Dev Std Err Minimum Maximum OTHER 37 29.4919 7.4308 1.2216 16.5000 48.9000 SPOCKS 9 14.6222

More information

8. Example: Predicting University of New Mexico Enrollment

8. Example: Predicting University of New Mexico Enrollment 8. Example: Predicting University of New Mexico Enrollment year (1=1961) 6 7 8 9 10 6000 10000 14000 0 5 10 15 20 25 30 6 7 8 9 10 unem (unemployment rate) hgrad (highschool graduates) 10000 14000 18000

More information

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression BSTT523: Kutner et al., Chapter 1 1 Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression Introduction: Functional relation between

More information

Lecture 10 Multiple Linear Regression

Lecture 10 Multiple Linear Regression Lecture 10 Multiple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: 6.1-6.5 10-1 Topic Overview Multiple Linear Regression Model 10-2 Data for Multiple Regression Y i is the response variable

More information

9 Correlation and Regression

9 Correlation and Regression 9 Correlation and Regression SW, Chapter 12. Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then retakes the

More information

BE640 Intermediate Biostatistics 2. Regression and Correlation. Simple Linear Regression Software: SAS. Emergency Calls to the New York Auto Club

BE640 Intermediate Biostatistics 2. Regression and Correlation. Simple Linear Regression Software: SAS. Emergency Calls to the New York Auto Club BE640 Intermediate Biostatistics 2. Regression and Correlation Simple Linear Regression Software: SAS Emergency Calls to the New York Auto Club Source: Chatterjee, S; Handcock MS and Simonoff JS A Casebook

More information

Multiple Linear Regression

Multiple Linear Regression Andrew Lonardelli December 20, 2013 Multiple Linear Regression 1 Table Of Contents Introduction: p.3 Multiple Linear Regression Model: p.3 Least Squares Estimation of the Parameters: p.4-5 The matrix approach

More information

Regression Diagnostics Procedures

Regression Diagnostics Procedures Regression Diagnostics Procedures ASSUMPTIONS UNDERLYING REGRESSION/CORRELATION NORMALITY OF VARIANCE IN Y FOR EACH VALUE OF X For any fixed value of the independent variable X, the distribution of the

More information

Lecture 11: Simple Linear Regression

Lecture 11: Simple Linear Regression Lecture 11: Simple Linear Regression Readings: Sections 3.1-3.3, 11.1-11.3 Apr 17, 2009 In linear regression, we examine the association between two quantitative variables. Number of beers that you drink

More information

UCD CENTRE FOR ECONOMIC RESEARCH WORKING PAPER SERIES

UCD CENTRE FOR ECONOMIC RESEARCH WORKING PAPER SERIES UCD CENTRE FOR ECONOMIC RESEARCH WORKING PAPER SERIES 2005 Doctors Fees in Ireland Following the Change in Reimbursement: Did They Jump? David Madden, University College Dublin WP05/20 November 2005 UCD

More information

MATH Notebook 4 Spring 2018

MATH Notebook 4 Spring 2018 MATH448001 Notebook 4 Spring 2018 prepared by Professor Jenny Baglivo c Copyright 2010 2018 by Jenny A. Baglivo. All Rights Reserved. 4 MATH448001 Notebook 4 3 4.1 Simple Linear Model.................................

More information

Multi-Equation Structural Models: Seemingly Unrelated Regression Models

Multi-Equation Structural Models: Seemingly Unrelated Regression Models Chapter 15 Multi-Equation Structural Models: Seemingly Unrelated Regression Models Section 15.1 Seemingly Unrelated Regression Models Modeling Approaches Econometric (Structural) Models Time-Series Models

More information

13 Simple Linear Regression

13 Simple Linear Regression B.Sc./Cert./M.Sc. Qualif. - Statistics: Theory and Practice 3 Simple Linear Regression 3. An industrial example A study was undertaken to determine the effect of stirring rate on the amount of impurity

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression September 24, 2008 Reading HH 8, GIll 4 Simple Linear Regression p.1/20 Problem Data: Observe pairs (Y i,x i ),i = 1,...n Response or dependent variable Y Predictor or independent

More information

Treatment Variables INTUB duration of endotracheal intubation (hrs) VENTL duration of assisted ventilation (hrs) LOWO2 hours of exposure to 22 49% lev

Treatment Variables INTUB duration of endotracheal intubation (hrs) VENTL duration of assisted ventilation (hrs) LOWO2 hours of exposure to 22 49% lev Variable selection: Suppose for the i-th observational unit (case) you record ( failure Y i = 1 success and explanatory variabales Z 1i Z 2i Z ri Variable (or model) selection: subject matter theory and

More information

Chapter 8 (More on Assumptions for the Simple Linear Regression)

Chapter 8 (More on Assumptions for the Simple Linear Regression) EXST3201 Chapter 8b Geaghan Fall 2005: Page 1 Chapter 8 (More on Assumptions for the Simple Linear Regression) Your textbook considers the following assumptions: Linearity This is not something I usually

More information

Detecting outliers and/or leverage points: a robust two-stage procedure with bootstrap cut-off points

Detecting outliers and/or leverage points: a robust two-stage procedure with bootstrap cut-off points Detecting outliers and/or leverage points: a robust two-stage procedure with bootstrap cut-off points Ettore Marubini (1), Annalisa Orenti (1) Background: Identification and assessment of outliers, have

More information

Polynomial Regression

Polynomial Regression Polynomial Regression Summary... 1 Analysis Summary... 3 Plot of Fitted Model... 4 Analysis Options... 6 Conditional Sums of Squares... 7 Lack-of-Fit Test... 7 Observed versus Predicted... 8 Residual Plots...

More information

The Model Building Process Part I: Checking Model Assumptions Best Practice

The Model Building Process Part I: Checking Model Assumptions Best Practice The Model Building Process Part I: Checking Model Assumptions Best Practice Authored by: Sarah Burke, PhD 31 July 2017 The goal of the STAT T&E COE is to assist in developing rigorous, defensible test

More information

Economics 308: Econometrics Professor Moody

Economics 308: Econometrics Professor Moody Economics 308: Econometrics Professor Moody References on reserve: Text Moody, Basic Econometrics with Stata (BES) Pindyck and Rubinfeld, Econometric Models and Economic Forecasts (PR) Wooldridge, Jeffrey

More information

Lecture 8: Instrumental Variables Estimation

Lecture 8: Instrumental Variables Estimation Lecture Notes on Advanced Econometrics Lecture 8: Instrumental Variables Estimation Endogenous Variables Consider a population model: y α y + β + β x + β x +... + β x + u i i i i k ik i Takashi Yamano

More information

Lecture 4: Regression Analysis

Lecture 4: Regression Analysis Lecture 4: Regression Analysis 1 Regression Regression is a multivariate analysis, i.e., we are interested in relationship between several variables. For corporate audience, it is sufficient to show correlation.

More information

The General Linear Model. April 22, 2008

The General Linear Model. April 22, 2008 The General Linear Model. April 22, 2008 Multiple regression Data: The Faroese Mercury Study Simple linear regression Confounding The multiple linear regression model Interpretation of parameters Model

More information

Analysis of Variance. Source DF Squares Square F Value Pr > F. Model <.0001 Error Corrected Total

Analysis of Variance. Source DF Squares Square F Value Pr > F. Model <.0001 Error Corrected Total Math 221: Linear Regression and Prediction Intervals S. K. Hyde Chapter 23 (Moore, 5th Ed.) (Neter, Kutner, Nachsheim, and Wasserman) The Toluca Company manufactures refrigeration equipment as well as

More information

The General Linear Model. November 20, 2007

The General Linear Model. November 20, 2007 The General Linear Model. November 20, 2007 Multiple regression Data: The Faroese Mercury Study Simple linear regression Confounding The multiple linear regression model Interpretation of parameters Model

More information

Introduction to Regression

Introduction to Regression Introduction to Regression Using Mult Lin Regression Derived variables Many alternative models Which model to choose? Model Criticism Modelling Objective Model Details Data and Residuals Assumptions 1

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

The Steps to Follow in a Multiple Regression Analysis

The Steps to Follow in a Multiple Regression Analysis ABSTRACT The Steps to Follow in a Multiple Regression Analysis Theresa Hoang Diem Ngo, Warner Bros. Home Video, Burbank, CA A multiple regression analysis is the most powerful tool that is widely used,

More information

Multiple Linear Regression

Multiple Linear Regression Chapter 3 Multiple Linear Regression 3.1 Introduction Multiple linear regression is in some ways a relatively straightforward extension of simple linear regression that allows for more than one independent

More information

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) Authored by: Sarah Burke, PhD Version 1: 31 July 2017 Version 1.1: 24 October 2017 The goal of the STAT T&E COE

More information

5.3 Three-Stage Nested Design Example

5.3 Three-Stage Nested Design Example 5.3 Three-Stage Nested Design Example A researcher designs an experiment to study the of a metal alloy. A three-stage nested design was conducted that included Two alloy chemistry compositions. Three ovens

More information

Lecture 12 Robust Estimation

Lecture 12 Robust Estimation Lecture 12 Robust Estimation Prof. Dr. Svetlozar Rachev Institute for Statistics and Mathematical Economics University of Karlsruhe Financial Econometrics, Summer Semester 2007 Copyright These lecture-notes

More information

Applied Statistics and Econometrics

Applied Statistics and Econometrics Applied Statistics and Econometrics Lecture 6 Saul Lach September 2017 Saul Lach () Applied Statistics and Econometrics September 2017 1 / 53 Outline of Lecture 6 1 Omitted variable bias (SW 6.1) 2 Multiple

More information

Prediction of Bike Rental using Model Reuse Strategy

Prediction of Bike Rental using Model Reuse Strategy Prediction of Bike Rental using Model Reuse Strategy Arun Bala Subramaniyan and Rong Pan School of Computing, Informatics, Decision Systems Engineering, Arizona State University, Tempe, USA. {bsarun, rong.pan}@asu.edu

More information

A Modified M-estimator for the Detection of Outliers

A Modified M-estimator for the Detection of Outliers A Modified M-estimator for the Detection of Outliers Asad Ali Department of Statistics, University of Peshawar NWFP, Pakistan Email: asad_yousafzay@yahoo.com Muhammad F. Qadir Department of Statistics,

More information

REGRESSION OUTLIERS AND INFLUENTIAL OBSERVATIONS USING FATHOM

REGRESSION OUTLIERS AND INFLUENTIAL OBSERVATIONS USING FATHOM REGRESSION OUTLIERS AND INFLUENTIAL OBSERVATIONS USING FATHOM Lindsey Bell lbell2@coastal.edu Keshav Jagannathan kjaganna@coastal.edu Department of Mathematics and Statistics Coastal Carolina University

More information

Chapter 12: Multiple Regression

Chapter 12: Multiple Regression Chapter 12: Multiple Regression 12.1 a. A scatterplot of the data is given here: Plot of Drug Potency versus Dose Level Potency 0 5 10 15 20 25 30 0 5 10 15 20 25 30 35 Dose Level b. ŷ = 8.667 + 0.575x

More information