Remedial Measures, Brown-Forsythe test, F test

Similar documents
Diagnostics and Remedial Measures

Formal Statement of Simple Linear Regression Model

Inference in Regression Analysis

3. Diagnostics and Remedial Measures

Chapter 3. Diagnostics and Remedial Measures

Multiple Regression. Dr. Frank Wood. Frank Wood, Linear Regression Models Lecture 12, Slide 1

Chapter 2. Continued. Proofs For ANOVA Proof of ANOVA Identity. the product term in the above equation can be simplified as n

Regression Estimation Least Squares and Maximum Likelihood

Inference in Normal Regression Model. Dr. Frank Wood

Remedial Measures Wrap-Up and Transformations Box Cox

PubH 7405: REGRESSION ANALYSIS SLR: DIAGNOSTICS & REMEDIES

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model

STAT 571A Advanced Statistical Regression Analysis. Chapter 3 NOTES Diagnostics and Remedial Measures

Lecture 9: Linear Regression

Bias Variance Trade-off

Inference for Regression Simple Linear Regression

Concordia University (5+5)Q 1.

Inferences for Regression

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Inference for the Regression Coefficient

Ch 2: Simple Linear Regression

Linear models and their mathematical foundations: Simple linear regression

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

STAT5044: Regression and Anova

Mathematics for Economics MA course

Data Analysis and Statistical Methods Statistics 651

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

Diagnostics and Remedial Measures: An Overview

SSR = The sum of squared errors measures how much Y varies around the regression line n. It happily turns out that SSR + SSE = SSTO.

Lecture 3: Inference in SLR

Simple Linear Regression

STA 4210 Practise set 2a

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS

Multiple Linear Regression

Lecture 1 Linear Regression with One Predictor Variable.p2

Remedial Measures Wrap-Up and Transformations-Box Cox

MATH 644: Regression Analysis Methods

Chapter 16. Simple Linear Regression and dcorrelation

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Nonparametric Regression and Bonferroni joint confidence intervals. Yang Feng

The Model Building Process Part I: Checking Model Assumptions Best Practice

STAT Checking Model Assumptions

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Tentative solutions TMA4255 Applied Statistics 16 May, 2015

Econ 3790: Business and Economic Statistics. Instructor: Yogesh Uppal

Ch 3: Multiple Linear Regression

Summary of Chapter 7 (Sections ) and Chapter 8 (Section 8.1)

STAT 350: Geometry of Least Squares

Math 3330: Solution to midterm Exam

Lecture 14 Simple Linear Regression

Lecture 15. Hypothesis testing in the linear model

Inference for Regression Inference about the Regression Model and Using the Regression Line

F-tests and Nested Models

Stats Review Chapter 14. Mary Stangler Center for Academic Success Revised 8/16

Math 5305 Notes. Diagnostics and Remedial Measures. Jesse Crawford. Department of Mathematics Tarleton State University

Homework 2: Simple Linear Regression

Weighted Least Squares

Correlation Analysis

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Unit 10: Simple Linear Regression and Correlation

Simple linear regression

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

Comparing Several Means: ANOVA

DESAIN EKSPERIMEN Analysis of Variances (ANOVA) Semester Genap 2017/2018 Jurusan Teknik Industri Universitas Brawijaya

Inference for Regression

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

robustness, efficiency, breakdown point, outliers, rank-based procedures, least absolute regression

Factorial designs. Experiments

Applied Regression Analysis

STAT 705 Chapter 19: Two-way ANOVA

Chapter 16. Simple Linear Regression and Correlation

Categorical Predictor Variables

Math 423/533: The Main Theoretical Topics

Design & Analysis of Experiments 7E 2009 Montgomery

Multiple Linear Regression

Lecture 1: Linear Models and Applications

Remedial Measures for Multiple Linear Regression Models

Lecture 11: Simple Linear Regression

Simple Linear Regression

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur

Linear regression. We have that the estimated mean in linear regression is. ˆµ Y X=x = ˆβ 0 + ˆβ 1 x. The standard error of ˆµ Y X=x is.

Ch. 1: Data and Distributions

Y it = µ + τ i + ε it ; ε it ~ N(0, σ 2 ) (1)

Lecture 2. The Simple Linear Regression Model: Matrix Approach

Unit 27 One-Way Analysis of Variance

BNAD 276 Lecture 10 Simple Linear Regression Model

REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES

STAT5044: Regression and Anova. Inyoung Kim

Lecture 6 Multiple Linear Regression, cont.

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

Multiple Regression Methods

STAT 705 Chapter 19: Two-way ANOVA

CS 5014: Research Methods in Computer Science

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Data Science for Engineers Department of Computer Science and Engineering Indian Institute of Technology, Madras

Transcription:

Remedial Measures, Brown-Forsythe test, F test Dr. Frank Wood Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 1

Remedial Measures How do we know that the regression function is a good explainer of the observed data? Plotting Tests What if it is not? What can we do about it? Transformation of variables (next lecture) Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 2

Graphical Diagnostics for the Predictor Variable Dot Plot Useful for visualizing distribution of inputs Sequence Plot Useful for visualizing dependencies between error terms Box Plot Useful for visualizing distribution of inputs Toluca manufacturing example again: production time vs. lot size. Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 3

Dot Plot How many observations per input value? Range of inputs? Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 4

Sequence Plot If observations are made over time, is there a correlation between input and position in observation sequence? Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 5

Box Plot Shows Median 1 st and 3 rd quartiles Maximum and minimum Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 6

Residuals Remember, the definition of residual: e i =Y i Ŷi And the difference between that and the unknown true error ǫ i =Y i E{Y i } In a normal regression model the ǫ i s are assumed to be iid N(0,σ 2 ) random variables. The observed residuals e i should reflect these properties. Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 7

Mean Remember: residual properties ē i = ei n =0 Variance s 2 = (ei ē) 2 n 2 = SSE n 2 =MSE Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 8

Nonindependence of Residuals The residuals e i are not independent random variables Ŷ i The fitted values are based on the same fitted regression line. The residuals are subject to two constraints Sum of the e i s equals 0 Sum of the products X i e i s equals 0 When the sample size is large in comparison to the number of parameters in the regression model, the dependency effect among the residuals e i can reasonably safely be ignored. Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 9

Definition: semistudentized residuals Like usual, since the standard deviation of ǫ i is σ (itself estimated by MSE 1/2 ) a natural form of standardization to consider is e i = e i 0 MSE This is called a semistudentized residual. Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 10

Departures from Model to be studied by residuals Regression function not linear Error terms do not have constant variance Error terms are not independent Model fits all but one or a few outlier observations Error terms are not normally distributed One or more predictor variables have been omitted from the model Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 11

Diagnostics for Residuals Plot of residuals against predictor variable Plot of absolute or squared residuals against predictor variable Plot of residuals against fitted values Plot of residuals against time or other sequence Plots of residuals against omitted predictor variables Box plot of residuals Normal probability plot of residuals Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 12

Diagnostic Residual Plots Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 13

Scatter and Residual Plot Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 14

Prototype Residual Plots Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 15

Nonconstancy of Error Variance Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 16

Presence of Outliers Outliers can strongly effect the fitted values of the regression line. Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 17

Outlier effect on residuals Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 18

Nonindependence of Error Terms Sequential observations Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 19

Non-normality of Error Terms Distribution plots Comparison of Frequencies Normal probability plot Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 20

Normal probability plot For a N(0,MSE 1/2 ) random variable, a good approximation of the expected value of the k th smallest observation in a random sample of size n is MSE[z( k.375 n+.25 )] A normal probability plot consists of plotting the expected value of the k th smallest observation against the observed k th smallest observation Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 21

Omission of Important Predictor Variables Example Qualitative variable Type of machine Partitioning data can reveal dependence on omitted variable(s) Works for quantitative variables as well Can suggest that inclusion of other inputs is important Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 22

Tests Involving Residuals Tests for randomness Tests for constancy of variance Tests for outliers Tests for normality Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 23

Correlation Test for Normality Calculated the coefficient of correlation between residuals e i and their expected values under normality r= R 2 Tables (B.6 in the book) given critical values for the null hypothesis (normally distributed errors) holding Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 24

Tests for Constancy of Error Variance Brown-Forsythe test does not depend on normality of error terms. The Brown-Forsythe test is applicable to simple linear regression when The variance of the error terms either increases or decreases with X Sample size is large enough to ignore dependencies between the residuals Basically a t-test for testing whether the means of two normally distributed populations are the same Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 25

Brown-Forsythe Test Divide X into X 1 (the low values of X) and X 2 (the high values of X) Let e i1 be the error terms for X 1 and vice versa Let n = n 1 + n 2 The Brown-Forsythe test uses the absolute deviations of the residuals around their group median d 1i = e 1i ẽ 1 Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 26

Brown-Forsythe Test The test statistic for comparing the means of the absolute deviations of the residuals around the group medians is where t BF = d 1 d 2 s 1 n 1 + 1 n 2 (di1 d 1 ) 2 + (d i2 d 2 ) 2 s 2 = n 2 Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 27

Brown-Forsythe Test If n 1 and n 2 are not extremely small approximately t BF t(n 2) From this confidence intervals and tests can be constructed. Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 28

F test for lack of fit Formal test for determining whether a specific type of regression function adequately fits the data. Assumptions (usual) : Y X iid normally distributed same variance σ 2 Requires: repeat observations at one or more X levels (called replicates) Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 29

Example 12 similar branches of a bank offered gifts for setting up money market accounts Minimum initial deposits were specified to qualify for the gift Value of gift was proportional to the specified minimum deposit Interested in: relationship between specified minimum deposit and number of new accounts opened Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 30

F Test Example Data and ANOVA Table Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 31

Fit Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 32

Data Arranged To Highlight Replicates The observed value of the response variable for the i th replicate for the j th level of X is Y ij. The mean of the Y observations at the level X = X j is Y j Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 33

Full Model vs. Regression Model The full model is where µ j are parameters j = 1,,c ǫ ij are iid N(0,σ 2 ) Since the error terms have expectation zero Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 34

Full Model In the full model there is a different mean (a free parameter) for each X i In the regression model the mean responses are constrained to lie on a line Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 35

Fitting the Full Model The estimators of µ j are simply The error sum of squares for the full model therefore is Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 36

Degrees of Freedom Ordinary total sum of squares has n-1 degrees of freedom. Each of the j terms is a ordinary total sum of squares Each then has n j -1 degrees of freedom The number of degrees of freedom of SSPE is the sum of the component degrees of freedom Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 37

General Linear Test Remember: the general linear test proposes a reduced model null hypothesis this will be our normal regression model The full model will be as described (one independent mean for each level of X) Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 38

SSE For Reduced Model The SSE for the reduced model is as before remember and has n-2 degrees of freedom Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 39

SSE(R) Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 40

F Test Statistic From the general linear test approach a little algebra takes us to the next slide Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 41

F Test Rule From the F test we know that large values of F * lead us to reject the null hypothesis For this example we have Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 42

Example Conclusion If we set the significance level to And look up the value of the F inv-cdf We can conclude that the null hypothesis should be rejected. Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 43

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 44

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 45