STAT 4385 Topic 06: Model Diagnostics
|
|
- Johnathan Webster
- 6 years ago
- Views:
Transcription
1 STAT 4385 Topic 06: Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso Spring, / 40
2 Outline Several Types of Residuals Raw, Standardized, Studentized Residuals Jackknife Residuals Assumption Checking Normality Independence Homoscedasticity Linearity Outlier Detection Outlier in X-Space Outlier in Y -Space Influential Points Mulcollinearity 2/ 40
3 in General Once a best model is selected, the next step is model diagnostics. Model diagnostics involves three specific tasks: checking model assumptions; detecting outliers; evaluating computational problems. Most methods are based on residuals of different types. 3/ 40
4 1987 Baseball Salary Data We consider the 1987 baseball salary data originally from the 1988 ASA (American Statistical Association) exposition competition. The data contain the salary information for 263 major league hitters and 22 predictors, as listed in the following table. The response variable is the logarithm of salary. One research question of interest is Are Baseball Salaries Based on Performance? 4/ 40
5 Variable Description: 1987 Baseball Salary Data X Name Description X Name Description 1 bat86 times at bat rbcr runs-batted-in 2 hit86 hits in wlkcr walks in career 3 hr86 home runs in leag86 league in 86 4 run86 runs in div86 division in 86 5 rb86 runs batted in in team86 team in 86 6 wlk86 walks in pos86 position in 86 7 yrs years in major league 18 puto86 put outs in 86 8 batcr times at bat - career 19 asst86 assists in 86 9 hitcr hits in career 20 err86 errors in hrcr home runs in career 21 leag87 league in runcr runs during career 22 team87 team in 87 5/ 40
6 Hoaglin and Velleman s (HV, 1995) Best Model This baseball data has been widely studied and various modeling analyses using different statistical methods were tried. Hoaglin and Velleman (HV, 1995) provided a nice overview on analyses and they found that the following model yields good model fit and leads to sensible interpretations. runcr log(salary) = β 0 + β 1 + β 2 run86 + yrs + β 3 min[(yrs 2) +, 5] + β 4 (yrs 7) + + ε, where ε N(0, σ 2 ), and the segmentation on year is based on a player s eligibility for arbitration or free agency. 6/ 40
7 Residuals Raw, Standardized, Studentized Residuals Several Types of Residuals There are several types of residuals, which are listed below in an ascending order of preference: The raw residual r i = y i ŷ i mimics the error term ε i = y i µ i. Motivated by the fact that ε i /σ N (0, 1), the standardized residual is defined as z i = r i /ˆσ. Noting that var(r i ) = σ 2 (1 h ii ), the studentized residual is defined as t i = r i / ˆσ 2 (1 h ii ). If the model is true, then t i N (0, 1) approximately. In the above definition, h ii is the i-th diagonal element of the hat matrix (or projection matrix) H = X(X T X) 1 X T. Recall that ŷ = Hy. h ii is also called the leverage of the i-th observation. 7/ 40
8 Residuals Raw, Standardized, Studentized Residuals Deleted Residuals In order to achieve independence between y i and its predicted value, the prediction of y i is calculated from the data by omitting the i-th observation; the same idea is used in obtaining PRESS. The deleted residual is defined as e ( i) = y i ŷ ( i) = r i /(1 h ii ), where ŷ ( i) = x t i ˆβ ( i) denotes the predicted value for y i by least squares fit on data that leave the ith observation out, and ˆβ ( i) denotes the resulting LSE of β. 8/ 40
9 Residuals Jackknife Residuals Jackknife Residuals The studentized deleted residual (also called the jackknife residual), given by r ( i) = r i σ 2 ( i) (1 h ii) = t i n p 2 n p 1 t 2 i, (1) where ˆσ 2 ( i), the estimate of σ2 based on the sample without the ith observation, can be computed via (n p 2) ˆσ 2 ( i) = (n p 1) ˆσ2 r 2 i /(1 h ii). 9/ 40
10 Residuals Jackknife Residuals 1987 Baseball Data: Residuals from the HV Model Original Data Residuals ID y x 1 x 2 x 3 x 4 ŷ i e i z i t i r ( i) / 40
11 Residuals Jackknife Residuals Facts on Jackknife Residuals The jackknife residual r ( i) is the most preferable residual for model diagnoses. Since ˆσ 2 ( i) in (1) is independent of ˆβ ( i) and hence r ( i), it can be verified that r ( i) t (n p 2) exactly if the model assumptions are correct. Moreover, the jackknife residuals can be easily computed using the second formula in (1). 11/ 40
12 Residuals Jackknife Residuals Plot of Jackknife Residuals vs. Fitted Values It is a common practice to plot r ( i) versus the predicted values ŷ i. Since r ŷ, r ( i) and ŷ i are independent of each other. If the model assumptions are valid, the jackknife residuals are expected to randomly scatter around the horizontal line y = 0. On the other hand, any systematic nonrandom pattern of the jackknife residuals may indicate some violation of the assumptions in one way or another. 12/ 40
13 Residuals Jackknife Residuals Hypothetical Plots of r ( i) vs. ŷ i (a) (b) r( i) r( i) y^ y^ 13/ 40
14 Assumption Checking Normality Checking Normality Note that r ( i) t (n p 2), which is approximated by N (0, 1) when n p. The informal histogram or quantile-quantile (Q-Q) plot of r ( i) s can be used to examine the normality assumption. Various goodness-of-fit formal tests, such as the Pearson s χ 2 test, Shapiro-Wilk (1965) test, or Kolmogorov-Smirnov test, have been used to formally test for normality. 14/ 40
15 Assumption Checking Normality Normality based on r ( i) (a) Histogram of Jackknife Residuals (b) Q Q Plot of Jackknife Residuals Frequency Sample Quantiles Jackknife Residual Theoretical Quantiles 15/ 40
16 Assumption Checking Independence Checking Independence Examining the assumption of independence among errors (or response observations) is not an easy task. There are only a few limited tests available. However, the plausibility of independence usually can be inspected from the experiment design or the way the data are collected. One common violation of independence occurs when observations are taken as a sequence in order of time and hence exhibit serial correlation. Graphically, the plot of r ( i) versus the sequence order i (or the lag plot of residuals) can be used to examine the dependence of errors. Furthermore, the run tests (Wald and Wolfowitz, 1940) can provide a rough check for randomness. The Durbin-Watson (1950; 1951) statistic and the autocorrelation function (ACF) test can be used to detect autocorrelation. 16/ 40
17 Assumption Checking Homoscedasticity Checking Homoscedasticity The assumption of homoscedasticity or equal variances can be inspected from the residual plot. For example, the plot of r ( i) vs. ŷ i (b) illustrates one scenario typically encountered with financial price data, where the error variance increases with the predicted value. It is interesting to note that the LSE remains unbiased under unequal error variances but is no longer BLUE. Formal tests for constant error variances include the White s (1980) test, Cook and Weisberg s (1983) score test, and several others, all checking whether or not the variability in e i or ei 2 can be accounted for by regressing it on the predictors X (or the estimate of mean response, ŷ). Another natural approach is to incorporate the error variance function explicitly in the model setting, and then check whether it reduces to constant variance. 17/ 40
18 Assumption Checking Homoscedasticity Jackknife Residual Plots: 1987 Baseball Salary Data Plot of Jackknife Residual vs. Fitted Value: 1987 Baseball Data Jackknife Residual Fitted Values 18/ 40
19 Assumption Checking Linearity Checking Linearity Inadequacy of linearity (i.e., linear in regression parameters) can be a serious problem. While the residual plot provides useful diagnostic information for this problem, it does not generally supply any clues as to the true functional form. Towards this end, partial residual plots have been recommended. The i-th partial residual for X j is defined as ( ˆβ0 + ˆβ 1x i1 + + ˆβ j 1 x i(j 1) + ˆβ j+1 x i(j+1) + + ˆβ ) px ip r (j) i = y i = ( y i x T i = r i + ˆβ j x ij for j = 1,..., p. ) ˆβ + ˆβ j x ij 19/ 40
20 Assumption Checking Linearity Partial Residual Plots The plot of r (j) i versus x ij provides a pictorial exploration of the appropriate functional form for one individual predictor X j after including other predictors. The figure on next page gives three examples that reflect different diagnostic interpretations regarding the functional form of X j : (a) X j might not be needed from the current model; (b) X j should be included in linear form; (c) A curvilinear form of X j is needed. 20/ 40
21 Assumption Checking Linearity Hypothetical Partial Residual Plots (a) (b) (c) r (j) r (j) r (j) x j x j x j 21/ 40
22 Assumption Checking Linearity Example: Partial Residual Plots with 1987 Baseball Data partial residuals residuals partial residuals x hits in during career hits in during career 22/ 40
23 Assumption Checking Linearity Partial Regression Plots Another similar tool, the partial leverage regression plot (i.e., the added variable plot), plots the residuals from the linear model that regresses Y on predictors without X j against the residuals from the linear model that regresses X j on other predictors. The partial regression plot can be interpreted in the same manner as the partial residual plot. 23/ 40
24 Outlier Detection Outlier Detection From the perspective of sensitivity analysis, variable selection is concerned about the influence of each column in X on model estimation while outlier detection is concerned about the influence of each row of the data. In the regression setting, an observation or row in X could be outlier mainly in three ways: outlier in X-space; outlier in Y -space; or being an influential point that affects the estimation of ˆβ and model prediction. 24/ 40
25 Outlier Detection Outlier in X-Space Outlier in X-Space An observation is said to have high leverage if it is outlier in terms of its predictor x i value. This can be assessed by the leverage h ii, which is closely related to the Mahalanobis distance from each x i to the center x = n i=1 x i/n. Points with h ii > 2(p + 1)/n are often considered outliers in x-space, recalling that h = (p + 1)/n. 25/ 40
26 Outlier Detection Outlier in X-Space Properties of Leverage h ii Relation to the Mahalanobis Distance from x: Let S X = n i=1 (x i x)(x i x) t /(n 1) denote the variance-covariance matrix of x i. Then, the Mahalanobis distance is d i = (x i x) t S 1 (x i x). It can be shown that h ii = 1/n + (n 1) d 2 i. An observation with high leverage h ii is the one that is distant from the center of points in the X-space. The value of h ii ranges from 1/n to 1 with average (p + 1)/n : tr(h) = n h ii = tr{x(x T X) 1 X T } = tr{(x T X) 1 X T X} = tr(i p+1) = p+1. i=1 26/ 40
27 Outlier Detection Outlier in Y -Space Outlier in Y -Space A response observation y i is identified to be an outlier if the observation is sufficiently different from its predicted value. The jackknife residual r ( i) is recommended for this assessment. Since r ( i) t (n p 2), the 2.5th and 97.5th percentiles from t (n p 2) may be used as benchmarks, yet at the risk of multiplicity. 27/ 40
28 Outlier Detection Influential Points Influential Points An observation is said to be an influential point if its removal or inclusion causes dramatic change in model estimations or predictions. The delete-one jackknife technique is the natural approach to tackle this issue. There are many measures developed depending on the specific aspect to be examined. 28/ 40
29 Outlier Detection Influential Points DFBETA DFBETA examines the influence of each observation on each ˆβ j, DFBETA ij = ˆβ j ˆβ j( i) ˆσ 2 ( i) (Xt X) 1 jj, where ˆβ j( i) denotes the LSE of β j without the i-th observation and (X t X) 1 jj is the j-th diagonal element of matrix (X t X) 1. 29/ 40
30 Outlier Detection Influential Points DFFITS DFFITS examines the influence of each observation on its own fitted value, DFFITS ij = ŷi ŷ ( i) h ii = r ( i). ˆσ ( i) 2 h 1 h ii ii 30/ 40
31 Outlier Detection Influential Points Cook s Distance The ultimate measure for detecting influential points is Cook s distance (Cook, 1977), D i = (ˆβ ˆβ( i) ) t X t X (ˆβ ˆβ( i) ) (p + 1) ˆσ 2 = ŷ ŷ ( i) 2 (p + 1) ˆσ 2 = r ( i) 2 p + 1 h ii. 1 h ii Muller and Mok (1997) studied the distribution of D i and provided some critical values. However, the multiplicity issue remains a concern when using these critical values for outlier detection in practice. For the sake of simplicity, one may use the benchmark of 1.0 to help identify potential outliers (see Weisberg, 2005). 31/ 40
32 Outlier Detection Influential Points HV Model with 1987 Baseball Data: Potential Outliers DFBETAs ID dfb.1 dfb.x1 dfb.x2 dfb.x3 dfb.x4 DFFIT cook.d hat / 40
33 Outlier Detection Influential Points Example: Outliers with 1987 Baseball Data The figure on next page provides a bubble plot of the three diagnostic measures, r ( i), h ii, and D i, where the size of the bubble corresponds to Cook s distance D i. Twenty-four potential outliers are found: eleven outliers are in x-space detected via the benchmark 2(p + 1)/n = 0.038, eleven outliers are in y-space identified by the benchmarks t(0.025, n p 2) = and t(0.975, n p 2) = 1.969, and two outliers (observations 92 and 252) are in both x-space and y-space. In addition, the Cook s distance measure indicate that the observation 252 has a large influence on regression parameter estimates, determined by either the benchmark 1 or Muller and Mok s critical value. 33/ 40
34 Outlier Detection Influential Points Example: Barble Plot for Outlier Detection 2(p+1)/n (n p 2) t r( i) (n p 2) t h ii 34/ 40
35 Mulcollinearity Multicollinearity One common numerical issue in linear regression is multicollinearity or collinearity, which occurs when two or more predictors in the linear model are highly correlated with each other. When multicollinearity occurs, the standard errors (SE) of some ˆβ j s can be unreasonably large, leading to difficulty in model interpretation. Multicollinearity may not be a big concern when the analytic goal is prediction. 35/ 40
36 Mulcollinearity Large SE To see why multicollinearity leads to large SE, a closer look reveals that SE( ˆβ j ) = s 1 R y Y 2 X ), (2) s j (1 R 2Xj (n p 1) X( j) where s y and s j are the sample standard deviation of y and x j, respectively; R 2 Y X denotes the R2 obtained by regressing Y on X; and RX 2 j X denotes the resulting R 2 value from regressing the ( j) j-th predictor X j on the remaining predictors X ( j). If X j can be expressed as a linear combination of other predictors, RX 2 j X ( j) would be 1 and SE( ˆβ j ) in (2) is infinite. 36/ 40
37 Mulcollinearity Assessing Multicollinearity - Method I The first method for detecting multicollinearity is to consider the spectral decomposition of X t X. Let λ 1 λ 2 λ p denote the eigenvalues of X t X. If X t X is not positive definite, some of its eigenvalues are zero. If the condition number, defined as λ 1 /λ p, is very large, then multicollinearity could be present. An informal rule of thumb is that if the condition number is 15, multicollinearity is a concern; if it is greater than 30 multicollinearity is a very serious concern. Belsley, Kuh, and Welsch (1980) insist 10 to 100 as a beginning and serious points that collinearity affects estimates in their book titled Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. 37/ 40
38 Mulcollinearity Assessing Multicollinearity - Method II Another measure for detecting collinearity is the variance inflation factor (VIF) VIF j = 1 1 R 2 X j X ( j) for j = 1,..., p. In practice, a maximum of VIF in excess of 10 is often considered as an indication of multicollinearity. 38/ 40
39 Mulcollinearity Why VIF? The name of VIF comes from the following observation. Suppose that we are working with normalized or standardized data, in which case X t X becomes the correlation matrix R X among predictors. From cov(ˆβ) = σ 2 R 1 X, it can be found that var( ˆβ j ) = σ 2 VIF j. If the columns in X are independent, then R X = I and hence var( ˆβ j ) = σ 2. Therefore, VIF j shows how much var( ˆβ j ) is inflated by the multicollinearity between X j and the remaining predictors in X, when compared to the independent case. 39/ 40
40 Mulcollinearity Discussion Thanks! Questions? 40/ 40
Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects
Contents 1 Review of Residuals 2 Detecting Outliers 3 Influential Observations 4 Multicollinearity and its Effects W. Zhou (Colorado State University) STAT 540 July 6th, 2015 1 / 32 Model Diagnostics:
More informationMultiple Linear Regression
Multiple Linear Regression University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html 1 / 42 Passenger car mileage Consider the carmpg dataset taken from
More informationRegression diagnostics
Regression diagnostics Kerby Shedden Department of Statistics, University of Michigan November 5, 018 1 / 6 Motivation When working with a linear model with design matrix X, the conventional linear model
More informationRegression Diagnostics Procedures
Regression Diagnostics Procedures ASSUMPTIONS UNDERLYING REGRESSION/CORRELATION NORMALITY OF VARIANCE IN Y FOR EACH VALUE OF X For any fixed value of the independent variable X, the distribution of the
More informationSTAT 4385 Topic 03: Simple Linear Regression
STAT 4385 Topic 03: Simple Linear Regression Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso xsu@utep.edu Spring, 2017 Outline The Set-Up Exploratory Data Analysis
More informationRegression Diagnostics for Survey Data
Regression Diagnostics for Survey Data Richard Valliant Joint Program in Survey Methodology, University of Maryland and University of Michigan USA Jianzhu Li (Westat), Dan Liao (JPSM) 1 Introduction Topics
More informationSingle and multiple linear regression analysis
Single and multiple linear regression analysis Marike Cockeran 2017 Introduction Outline of the session Simple linear regression analysis SPSS example of simple linear regression analysis Additional topics
More informationSTAT5044: Regression and Anova
STAT5044: Regression and Anova Inyoung Kim 1 / 49 Outline 1 How to check assumptions 2 / 49 Assumption Linearity: scatter plot, residual plot Randomness: Run test, Durbin-Watson test when the data can
More informationThe Model Building Process Part I: Checking Model Assumptions Best Practice
The Model Building Process Part I: Checking Model Assumptions Best Practice Authored by: Sarah Burke, PhD 31 July 2017 The goal of the STAT T&E COE is to assist in developing rigorous, defensible test
More informationThe Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)
The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) Authored by: Sarah Burke, PhD Version 1: 31 July 2017 Version 1.1: 24 October 2017 The goal of the STAT T&E COE
More information1) Answer the following questions as true (T) or false (F) by circling the appropriate letter.
1) Answer the following questions as true (T) or false (F) by circling the appropriate letter. T F T F T F a) Variance estimates should always be positive, but covariance estimates can be either positive
More informationRegression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin
Regression Review Statistics 149 Spring 2006 Copyright c 2006 by Mark E. Irwin Matrix Approach to Regression Linear Model: Y i = β 0 + β 1 X i1 +... + β p X ip + ɛ i ; ɛ i iid N(0, σ 2 ), i = 1,..., n
More informationChapter 10 Building the Regression Model II: Diagnostics
Chapter 10 Building the Regression Model II: Diagnostics 許湘伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 41 10.1 Model Adequacy for a Predictor Variable-Added
More informationChecking model assumptions with regression diagnostics
@graemeleehickey www.glhickey.com graeme.hickey@liverpool.ac.uk Checking model assumptions with regression diagnostics Graeme L. Hickey University of Liverpool Conflicts of interest None Assistant Editor
More informationCOMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION
COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION Answer all parts. Closed book, calculators allowed. It is important to show all working,
More informationQuantitative Methods I: Regression diagnostics
Quantitative Methods I: Regression University College Dublin 10 December 2014 1 Assumptions and errors 2 3 4 Outline Assumptions and errors 1 Assumptions and errors 2 3 4 Assumptions: specification Linear
More informationMulticollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.
Multicollinearity Read Section 7.5 in textbook. Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response. Example of multicollinear
More informationPrepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti
Prepared by: Prof Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti Putra Malaysia Serdang M L Regression is an extension to
More informationResiduals in the Analysis of Longitudinal Data
Residuals in the Analysis of Longitudinal Data Jemila Hamid, PhD (Joint work with WeiLiang Huang) Clinical Epidemiology and Biostatistics & Pathology and Molecular Medicine McMaster University Outline
More informationREGRESSION DIAGNOSTICS AND REMEDIAL MEASURES
REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES Lalmohan Bhar I.A.S.R.I., Library Avenue, Pusa, New Delhi 110 01 lmbhar@iasri.res.in 1. Introduction Regression analysis is a statistical methodology that utilizes
More informationBeam Example: Identifying Influential Observations using the Hat Matrix
Math 3080. Treibergs Beam Example: Identifying Influential Observations using the Hat Matrix Name: Example March 22, 204 This R c program explores influential observations and their detection using the
More informationCHAPTER 5. Outlier Detection in Multivariate Data
CHAPTER 5 Outlier Detection in Multivariate Data 5.1 Introduction Multivariate outlier detection is the important task of statistical analysis of multivariate data. Many methods have been proposed for
More informationWeighted Least Squares
Weighted Least Squares The standard linear model assumes that Var(ε i ) = σ 2 for i = 1,..., n. As we have seen, however, there are instances where Var(Y X = x i ) = Var(ε i ) = σ2 w i. Here w 1,..., w
More informationMath 423/533: The Main Theoretical Topics
Math 423/533: The Main Theoretical Topics Notation sample size n, data index i number of predictors, p (p = 2 for simple linear regression) y i : response for individual i x i = (x i1,..., x ip ) (1 p)
More informationRegression Model Specification in R/Splus and Model Diagnostics. Daniel B. Carr
Regression Model Specification in R/Splus and Model Diagnostics By Daniel B. Carr Note 1: See 10 for a summary of diagnostics 2: Books have been written on model diagnostics. These discuss diagnostics
More information((n r) 1) (r 1) ε 1 ε 2. X Z β+
Bringing Order to Outlier Diagnostics in Regression Models D.R.JensenandD.E.Ramirez Virginia Polytechnic Institute and State University and University of Virginia der@virginia.edu http://www.math.virginia.edu/
More informationLinear Models, Problems
Linear Models, Problems John Fox McMaster University Draft: Please do not quote without permission Revised January 2003 Copyright c 2002, 2003 by John Fox I. The Normal Linear Model: Structure and Assumptions
More informationLecture 1: Linear Models and Applications
Lecture 1: Linear Models and Applications Claudia Czado TU München c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 0 Overview Introduction to linear models Exploratory data analysis (EDA) Estimation
More informationGeneralized Linear Models
Generalized Linear Models Lecture 3. Hypothesis testing. Goodness of Fit. Model diagnostics GLM (Spring, 2018) Lecture 3 1 / 34 Models Let M(X r ) be a model with design matrix X r (with r columns) r n
More informationLINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises
LINEAR REGRESSION ANALYSIS MODULE XVI Lecture - 44 Exercises Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur Exercise 1 The following data has been obtained on
More informationRidge Regression. Summary. Sample StatFolio: ridge reg.sgp. STATGRAPHICS Rev. 10/1/2014
Ridge Regression Summary... 1 Data Input... 4 Analysis Summary... 5 Analysis Options... 6 Ridge Trace... 7 Regression Coefficients... 8 Standardized Regression Coefficients... 9 Observed versus Predicted...
More informationRegression Model Building
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation in Y with a small set of predictors Automated
More informationMLR Model Checking. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project
MLR Model Checking Author: Nicholas G Reich, Jeff Goldsmith This material is part of the statsteachr project Made available under the Creative Commons Attribution-ShareAlike 3.0 Unported License: http://creativecommons.org/licenses/by-sa/3.0/deed.en
More informationDetecting and Assessing Data Outliers and Leverage Points
Chapter 9 Detecting and Assessing Data Outliers and Leverage Points Section 9.1 Background Background Because OLS estimators arise due to the minimization of the sum of squared errors, large residuals
More informationDiagnostics for Linear Models With Functional Responses
Diagnostics for Linear Models With Functional Responses Qing Shen Edmunds.com Inc. 2401 Colorado Ave., Suite 250 Santa Monica, CA 90404 (shenqing26@hotmail.com) Hongquan Xu Department of Statistics University
More informationDr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)
Dr. Maddah ENMG 617 EM Statistics 11/28/12 Multiple Regression (3) (Chapter 15, Hines) Problems in multiple regression: Multicollinearity This arises when the independent variables x 1, x 2,, x k, are
More informationLinear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77
Linear Regression Chapter 3 September 27, 2016 Chapter 3 September 27, 2016 1 / 77 1 3.1. Simple linear regression 2 3.2 Multiple linear regression 3 3.3. The least squares estimation 4 3.4. The statistical
More informationSTATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002
Time allowed: 3 HOURS. STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002 This is an open book exam: all course notes and the text are allowed, and you are expected to use your own calculator.
More informationLabor Economics with STATA. Introduction to Regression Diagnostics
Labor Economics with STATA Liyousew G. Borga November 4, 2015 Introduction to Regression Diagnostics Liyou Borga Labor Economics with STATA November 4, 2015 64 / 85 Outline 1 Violations of Basic Assumptions
More informationTopic 18: Model Selection and Diagnostics
Topic 18: Model Selection and Diagnostics Variable Selection We want to choose a best model that is a subset of the available explanatory variables Two separate problems 1. How many explanatory variables
More informationUnit 10: Simple Linear Regression and Correlation
Unit 10: Simple Linear Regression and Correlation Statistics 571: Statistical Methods Ramón V. León 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 1 Introductory Remarks Regression analysis is a method for
More informationRegression Diagnostics
Diag 1 / 78 Regression Diagnostics Paul E. Johnson 1 2 1 Department of Political Science 2 Center for Research Methods and Data Analysis, University of Kansas 2015 Diag 2 / 78 Outline 1 Introduction 2
More information10 Model Checking and Regression Diagnostics
10 Model Checking and Regression Diagnostics The simple linear regression model is usually written as i = β 0 + β 1 i + ɛ i where the ɛ i s are independent normal random variables with mean 0 and variance
More informationLINEAR REGRESSION. Copyright 2013, SAS Institute Inc. All rights reserved.
LINEAR REGRESSION LINEAR REGRESSION REGRESSION AND OTHER MODELS Type of Response Type of Predictors Categorical Continuous Continuous and Categorical Continuous Analysis of Variance (ANOVA) Ordinary Least
More informationReview: Second Half of Course Stat 704: Data Analysis I, Fall 2014
Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014 Tim Hanson, Ph.D. University of South Carolina T. Hanson (USC) Stat 704: Data Analysis I, Fall 2014 1 / 13 Chapter 8: Polynomials & Interactions
More informationLecture 4: Regression Analysis
Lecture 4: Regression Analysis 1 Regression Regression is a multivariate analysis, i.e., we are interested in relationship between several variables. For corporate audience, it is sufficient to show correlation.
More informationL7: Multicollinearity
L7: Multicollinearity Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Introduction ï Example Whats wrong with it? Assume we have this data Y
More information18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013
18.S096 Problem Set 3 Fall 013 Regression Analysis Due Date: 10/8/013 he Projection( Hat ) Matrix and Case Influence/Leverage Recall the setup for a linear regression model y = Xβ + ɛ where y and ɛ are
More informationHomoskedasticity. Var (u X) = σ 2. (23)
Homoskedasticity How big is the difference between the OLS estimator and the true parameter? To answer this question, we make an additional assumption called homoskedasticity: Var (u X) = σ 2. (23) This
More informationLeverage. the response is in line with the other values, or the high leverage has caused the fitted model to be pulled toward the observed response.
Leverage Some cases have high leverage, the potential to greatly affect the fit. These cases are outliers in the space of predictors. Often the residuals for these cases are not large because the response
More informationApplied linear statistical models: An overview
Applied linear statistical models: An overview Gunnar Stefansson 1 Dept. of Mathematics Univ. Iceland August 27, 2010 Outline Some basics Course: Applied linear statistical models This lecture: A description
More informationRegression in R. Seth Margolis GradQuant May 31,
Regression in R Seth Margolis GradQuant May 31, 2018 1 GPA What is Regression Good For? Assessing relationships between variables This probably covers most of what you do 4 3.8 3.6 3.4 Person Intelligence
More informationSTAT 212: BUSINESS STATISTICS II Third Exam Tuesday Dec 12, 6:00 PM
STAT212_E3 KING FAHD UNIVERSITY OF PETROLEUM & MINERALS DEPARTMENT OF MATHEMATICS & STATISTICS Term 171 Page 1 of 9 STAT 212: BUSINESS STATISTICS II Third Exam Tuesday Dec 12, 2017 @ 6:00 PM Name: ID #:
More informationNonlinear Regression. Summary. Sample StatFolio: nonlinear reg.sgp
Nonlinear Regression Summary... 1 Analysis Summary... 4 Plot of Fitted Model... 6 Response Surface Plots... 7 Analysis Options... 10 Reports... 11 Correlation Matrix... 12 Observed versus Predicted...
More informationKutlwano K.K.M. Ramaboa. Thesis presented for the Degree of DOCTOR OF PHILOSOPHY. in the Department of Statistical Sciences Faculty of Science
Contributions to Linear Regression Diagnostics using the Singular Value Decomposition: Measures to Identify Outlying Observations, Influential Observations and Collinearity in Multivariate Data Kutlwano
More informationSection 2 NABE ASTEF 65
Section 2 NABE ASTEF 65 Econometric (Structural) Models 66 67 The Multiple Regression Model 68 69 Assumptions 70 Components of Model Endogenous variables -- Dependent variables, values of which are determined
More informationBusiness Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'
Business Statistics Tommaso Proietti DEF - Università di Roma 'Tor Vergata' Linear Regression Specication Let Y be a univariate quantitative response variable. We model Y as follows: Y = f(x) + ε where
More informationRegression diagnostics
Regression diagnostics Leiden University Leiden, 30 April 2018 Outline 1 Error assumptions Introduction Variance Normality 2 Residual vs error Outliers Influential observations Introduction Errors and
More informationLecture One: A Quick Review/Overview on Regular Linear Regression Models
Lecture One: A Quick Review/Overview on Regular Linear Regression Models Outline The topics to be covered include: Model Specification Estimation(LS estimators and MLEs) Hypothesis Testing and Model Diagnostics
More informationMultiple Linear Regression
Multiple Linear Regression Simple linear regression tries to fit a simple line between two variables Y and X. If X is linearly related to Y this explains some of the variability in Y. In most cases, there
More informationREGRESSION OUTLIERS AND INFLUENTIAL OBSERVATIONS USING FATHOM
REGRESSION OUTLIERS AND INFLUENTIAL OBSERVATIONS USING FATHOM Lindsey Bell lbell2@coastal.edu Keshav Jagannathan kjaganna@coastal.edu Department of Mathematics and Statistics Coastal Carolina University
More informationDiagnostics of Linear Regression
Diagnostics of Linear Regression Junhui Qian October 7, 14 The Objectives After estimating a model, we should always perform diagnostics on the model. In particular, we should check whether the assumptions
More informationLectures on Simple Linear Regression Stat 431, Summer 2012
Lectures on Simple Linear Regression Stat 43, Summer 0 Hyunseung Kang July 6-8, 0 Last Updated: July 8, 0 :59PM Introduction Previously, we have been investigating various properties of the population
More informationDiagnostics and Remedial Measures: An Overview
Diagnostics and Remedial Measures: An Overview Residuals Model diagnostics Graphical techniques Hypothesis testing Remedial measures Transformation Later: more about all this for multiple regression W.
More informationWeighted Least Squares
Weighted Least Squares The standard linear model assumes that Var(ε i ) = σ 2 for i = 1,..., n. As we have seen, however, there are instances where Var(Y X = x i ) = Var(ε i ) = σ2 w i. Here w 1,..., w
More informationMulticollinearity and A Ridge Parameter Estimation Approach
Journal of Modern Applied Statistical Methods Volume 15 Issue Article 5 11-1-016 Multicollinearity and A Ridge Parameter Estimation Approach Ghadban Khalaf King Khalid University, albadran50@yahoo.com
More informationK. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =
K. Model Diagnostics We ve already seen how to check model assumptions prior to fitting a one-way ANOVA. Diagnostics carried out after model fitting by using residuals are more informative for assessing
More informationUNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017
UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics Tuesday, January 17, 2017 Work all problems 60 points are needed to pass at the Masters Level and 75
More informationAny of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.
STATGRAPHICS Rev. 9/13/213 Calibration Models Summary... 1 Data Input... 3 Analysis Summary... 5 Analysis Options... 7 Plot of Fitted Model... 9 Predicted Values... 1 Confidence Intervals... 11 Observed
More informationRemedial Measures for Multiple Linear Regression Models
Remedial Measures for Multiple Linear Regression Models Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 1 / 25 Outline
More informationMath 5305 Notes. Diagnostics and Remedial Measures. Jesse Crawford. Department of Mathematics Tarleton State University
Math 5305 Notes Diagnostics and Remedial Measures Jesse Crawford Department of Mathematics Tarleton State University (Tarleton State University) Diagnostics and Remedial Measures 1 / 44 Model Assumptions
More informationRegression Analysis for Data Containing Outliers and High Leverage Points
Alabama Journal of Mathematics 39 (2015) ISSN 2373-0404 Regression Analysis for Data Containing Outliers and High Leverage Points Asim Kumer Dey Department of Mathematics Lamar University Md. Amir Hossain
More information9 Correlation and Regression
9 Correlation and Regression SW, Chapter 12. Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then retakes the
More informationLAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION
LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION In this lab you will first learn how to display the relationship between two quantitative variables with a scatterplot and also how to measure the strength of
More informationCircle a single answer for each multiple choice question. Your choice should be made clearly.
TEST #1 STA 4853 March 4, 215 Name: Please read the following directions. DO NOT TURN THE PAGE UNTIL INSTRUCTED TO DO SO Directions This exam is closed book and closed notes. There are 31 questions. Circle
More information14 Multiple Linear Regression
B.Sc./Cert./M.Sc. Qualif. - Statistics: Theory and Practice 14 Multiple Linear Regression 14.1 The multiple linear regression model In simple linear regression, the response variable y is expressed in
More informationholding all other predictors constant
Multiple Regression Numeric Response variable (y) p Numeric predictor variables (p < n) Model: Y = b 0 + b 1 x 1 + + b p x p + e Partial Regression Coefficients: b i effect (on the mean response) of increasing
More informationMultiple Linear Regression
Andrew Lonardelli December 20, 2013 Multiple Linear Regression 1 Table Of Contents Introduction: p.3 Multiple Linear Regression Model: p.3 Least Squares Estimation of the Parameters: p.4-5 The matrix approach
More informationAnalysing data: regression and correlation S6 and S7
Basic medical statistics for clinical and experimental research Analysing data: regression and correlation S6 and S7 K. Jozwiak k.jozwiak@nki.nl 2 / 49 Correlation So far we have looked at the association
More informationRegression Analysis By Example
Regression Analysis By Example Third Edition SAMPRIT CHATTERJEE New York University ALI S. HADI Cornell University BERTRAM PRICE Price Associates, Inc. A Wiley-Interscience Publication JOHN WILEY & SONS,
More informationSimple Regression Model Setup Estimation Inference Prediction. Model Diagnostic. Multiple Regression. Model Setup and Estimation.
Statistical Computation Math 475 Jimin Ding Department of Mathematics Washington University in St. Louis www.math.wustl.edu/ jmding/math475/index.html October 10, 2013 Ridge Part IV October 10, 2013 1
More informationOutline. Review regression diagnostics Remedial measures Weighted regression Ridge regression Robust regression Bootstrapping
Topic 19: Remedies Outline Review regression diagnostics Remedial measures Weighted regression Ridge regression Robust regression Bootstrapping Regression Diagnostics Summary Check normality of the residuals
More information10. Time series regression and forecasting
10. Time series regression and forecasting Key feature of this section: Analysis of data on a single entity observed at multiple points in time (time series data) Typical research questions: What is the
More informationPhd Program in Transportation. Transport Demand Modeling. MODULE 2 Multiple Linear Regression
Phd Program in Transportation Transport Demand Modeling Filipe Moura MODULE 2 Multiple Linear Regression Phd in Transportation / Transport Demand Modelling 1 Outline 1. Learning objectives 2. What is MR
More informationPsychology Seminar Psych 406 Dr. Jeffrey Leitzel
Psychology Seminar Psych 406 Dr. Jeffrey Leitzel Structural Equation Modeling Topic 1: Correlation / Linear Regression Outline/Overview Correlations (r, pr, sr) Linear regression Multiple regression interpreting
More informationDiagnostic Procedures
Diagnostic Procedures Joseph W. McKean Western Michigan University Simon J. Sheather Texas A&M University Abstract Diagnostic procedures are used to check the quality of a fit of a model, to verify the
More informationPolynomial Regression
Polynomial Regression Summary... 1 Analysis Summary... 3 Plot of Fitted Model... 4 Analysis Options... 6 Conditional Sums of Squares... 7 Lack-of-Fit Test... 7 Observed versus Predicted... 8 Residual Plots...
More informationPackage svydiags. June 4, 2015
Type Package Package svydiags June 4, 2015 Title Linear Regression Model Diagnostics for Survey Data Version 0.1 Date 2015-01-21 Author Richard Valliant Maintainer Richard Valliant Description
More information4 Multiple Linear Regression
4 Multiple Linear Regression 4. The Model Definition 4.. random variable Y fits a Multiple Linear Regression Model, iff there exist β, β,..., β k R so that for all (x, x 2,..., x k ) R k where ε N (, σ
More informationRegression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics
Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics The session is a continuation of a version of Section 11.3 of MMD&S. It concerns
More informationRegression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics
Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics The session is a continuation of a version of Section 11.3 of MMD&S. It concerns
More informationThe Steps to Follow in a Multiple Regression Analysis
ABSTRACT The Steps to Follow in a Multiple Regression Analysis Theresa Hoang Diem Ngo, Warner Bros. Home Video, Burbank, CA A multiple regression analysis is the most powerful tool that is widely used,
More informationMultiple Regression Analysis. Part III. Multiple Regression Analysis
Part III Multiple Regression Analysis As of Sep 26, 2017 1 Multiple Regression Analysis Estimation Matrix form Goodness-of-Fit R-square Adjusted R-square Expected values of the OLS estimators Irrelevant
More informationAssumptions of the error term, assumptions of the independent variables
Petra Petrovics, Renáta Géczi-Papp Assumptions of the error term, assumptions of the independent variables 6 th seminar Multiple linear regression model Linear relationship between x 1, x 2,, x p and y
More informationDetection of single influential points in OLS regression model building
Analytica Chimica Acta 439 (2001) 169 191 Tutorial Detection of single influential points in OLS regression model building Milan Meloun a,,jiří Militký b a Department of Analytical Chemistry, Faculty of
More informationChapter 12: Multiple Regression
Chapter 12: Multiple Regression 12.1 a. A scatterplot of the data is given here: Plot of Drug Potency versus Dose Level Potency 0 5 10 15 20 25 30 0 5 10 15 20 25 30 35 Dose Level b. ŷ = 8.667 + 0.575x
More informationSTAT5044: Regression and Anova. Inyoung Kim
STAT5044: Regression and Anova Inyoung Kim 2 / 51 Outline 1 Matrix Expression 2 Linear and quadratic forms 3 Properties of quadratic form 4 Properties of estimates 5 Distributional properties 3 / 51 Matrix
More informationIntroduction The framework Bias and variance Approximate computation of leverage Empirical evaluation Discussion of sampling approach in big data
Discussion of sampling approach in big data Big data discussion group at MSCS of UIC Outline 1 Introduction 2 The framework 3 Bias and variance 4 Approximate computation of leverage 5 Empirical evaluation
More informationModule 6: Model Diagnostics
St@tmaster 02429/MIXED LINEAR MODELS PREPARED BY THE STATISTICS GROUPS AT IMM, DTU AND KU-LIFE Module 6: Model Diagnostics 6.1 Introduction............................... 1 6.2 Linear model diagnostics........................
More informationAvailable online at (Elixir International Journal) Statistics. Elixir Statistics 49 (2012)
10108 Available online at www.elixirpublishers.com (Elixir International Journal) Statistics Elixir Statistics 49 (2012) 10108-10112 The detention and correction of multicollinearity effects in a multiple
More information