MULTIPLE REGRESSION METHODS

Similar documents
MISCELLANEOUS REGRESSION TOPICS

MORE ON MULTIPLE REGRESSION

INFERENCE FOR REGRESSION

SIMPLE TWO VARIABLE REGRESSION

28. SIMPLE LINEAR REGRESSION III

AP Statistics. The only statistics you can trust are those you falsified yourself. RE- E X P R E S S I N G D A T A ( P A R T 2 ) C H A P 9

Warm-up Using the given data Create a scatterplot Find the regression line

SMAM 319 Exam1 Name. a B.The equation of a line is 3x + y =6. The slope is a. -3 b.3 c.6 d.1/3 e.-1/3

AP Statistics Unit 6 Note Packet Linear Regression. Scatterplots and Correlation

SMAM 319 Exam 1 Name. 1.Pick the best choice for the multiple choice questions below (10 points 2 each)

AP Statistics. Chapter 9 Re-Expressing data: Get it Straight

Conditions for Regression Inference:

Correlation and Regression

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

Lecture 18: Simple Linear Regression

27. SIMPLE LINEAR REGRESSION II

10 Model Checking and Regression Diagnostics

1 Introduction to Minitab

Review of Regression Basics

Evaluate the expression if x = 2 and y = 5 6x 2y Original problem Substitute the values given into the expression and multiply

Histogram of Residuals. Residual Normal Probability Plot. Reg. Analysis Check Model Utility. (con t) Check Model Utility. Inference.

Simple Linear Regression. Steps for Regression. Example. Make a Scatter plot. Check Residual Plot (Residuals vs. X)

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Review of Multiple Regression

Steps for Regression. Simple Linear Regression. Data. Example. Residuals vs. X. Scatterplot. Make a Scatter plot Does it make sense to plot a line?

SOLUTIONS FOR PROBLEMS 1-30

1. An article on peanut butter in Consumer reports reported the following scores for various brands

Analysis of Bivariate Data

Table 2.1 presents examples and explains how the proper results should be written. Table 2.1: Writing Your Results When Adding or Subtracting

CRP 272 Introduction To Regression Analysis

Experimental Uncertainty (Error) and Data Analysis

Confidence Interval for the mean response

Is economic freedom related to economic growth?

23. Inference for regression

STA220H1F Term Test Oct 26, Last Name: First Name: Student #: TA s Name: or Tutorial Room:

Correlation & Simple Regression

Overview. Overview. Overview. Specific Examples. General Examples. Bivariate Regression & Correlation

This module focuses on the logic of ANOVA with special attention given to variance components and the relationship between ANOVA and regression.

Basic Business Statistics 6 th Edition

Chapter 5: Data Transformation

22 Approximations - the method of least squares (1)

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Graphical Analysis and Errors MBL

Regression. Marc H. Mehlman University of New Haven

Multiple Regression Examples

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

Model Building Chap 5 p251

Data Set 8: Laysan Finch Beak Widths

1. Least squares with more than one predictor

STAT 350: Summer Semester Midterm 1: Solutions

Chapter 9. Correlation and Regression

AP Final Review II Exploring Data (20% 30%)

Chapter 12 - Part I: Correlation Analysis

CHAPTER 5 FUNCTIONAL FORMS OF REGRESSION MODELS

Analysing data: regression and correlation S6 and S7

Regression. Estimation of the linear function (straight line) describing the linear component of the joint relationship between two variables X and Y.

Six Sigma Black Belt Study Guides

[ ESS ESS ] / 2 [ ] / ,019.6 / Lab 10 Key. Regression Analysis: wage versus yrsed, ex

Experimental Uncertainty (Error) and Data Analysis

Chapter 1. Linear Regression with One Predictor Variable

Linear Regression with one Regressor

Math Sec 4 CST Topic 7. Statistics. i.e: Add up all values and divide by the total number of values.

Worksheet for Exploration 6.1: An Operational Definition of Work

Regression Analysis: Basic Concepts

2.4.3 Estimatingσ Coefficient of Determination 2.4. ASSESSING THE MODEL 23

Accuracy: An accurate measurement is a measurement.. It. Is the closeness between the result of a measurement and a value of the measured.

Topics. Estimation. Regression Through the Origin. Basic Econometrics in Transportation. Bivariate Regression Discussion

Reteach 2-3. Graphing Linear Functions. 22 Holt Algebra 2. Name Date Class

Multiple Regression an Introduction. Stat 511 Chap 9

Problem Set 1 ANSWERS

Secondary Math 2H Unit 3 Notes: Factoring and Solving Quadratics

STUDY GUIDE Math 20. To accompany Intermediate Algebra for College Students By Robert Blitzer, Third Edition

XVI. Transformations. by: David Scott and David M. Lane

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

Lab 6 Forces Part 2. Physics 225 Lab

(4) 1. Create dummy variables for Town. Name these dummy variables A and B. These 0,1 variables now indicate the location of the house.

Interpreting coefficients for transformed variables

Business Statistics. Lecture 10: Correlation and Linear Regression

Statistics and Data Analysis

Probability Distributions

Experimental Design and Graphical Analysis of Data

SMAM 314 Computer Assignment 5 due Nov 8,2012 Data Set 1. For each of the following data sets use Minitab to 1. Make a scatterplot.

Pre-Calculus Multiple Choice Questions - Chapter S8

Intro to Linear Regression

AP Statistics. Chapter 6 Scatterplots, Association, and Correlation

Business 320, Fall 1999, Final

WISE Regression/Correlation Interactive Lab. Introduction to the WISE Correlation/Regression Applet

BIOSTATISTICS NURS 3324

appstats8.notebook October 11, 2016

Stat 101 Exam 1 Important Formulas and Concepts 1

Intro to Linear Regression

Box-Cox Transformations

22S39: Class Notes / November 14, 2000 back to start 1

Mrs. Poyner/Mr. Page Chapter 3 page 1

Ch 13 & 14 - Regression Analysis

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Steps to take to do the descriptive part of regression analysis:

Simple Linear Regression

Models with qualitative explanatory variables p216

9 Correlation and Regression

Transcription:

DEPARTMENT OF POLITICAL SCIENCE AND INTERNATIONAL RELATIONS Posc/Uapp 816 MULTIPLE REGRESSION METHODS I. AGENDA: A. Residuals B. Transformations 1. A useful procedure for making transformations C. Reading: Agresti and Finlay Statistical Methods in the Social Sciences, 3 rd edition. II. III. RESIDUALS: A. MINITAB identifies cases that exert leverage (disproportionately affect) on estimators and very poor fitting data (that is, cases with large residuals). 1. Normally one would look at each of these cases carefully to make sure there are no measurement errors or substantive reasons that should be taken into account. B. Partial regression plots: summary 1. Assume Y and K independent or predictor variables. 2. A partial plot shows the relationship between Y and one of the predictors after both have been adjusted for the influence of the remaining K -1 variables. 3. Method: i. Regress Y on the K - 1 variables and obtain residuals ii. iii. iv. Regress X on the K - 1 variables and obtain residuals k Plot first set of residuals against second set to obtain partial regression plot. 1) This plot may indicate the need to transform the data. (See below.) Regress the first set of residuals on the second. 1) The intercept will be 0. 2) The regression coefficient will equal the partial regression coefficient obtained when Y is regressed on all K variables. TRANSFORMING DATA A. Frequently plots will reveal patterns that indicate one or more variables should be transformed in order to meet the assumptions and requirements of regression analysis. 1. OLS regression assumes that the model has been correctly specified; in particular the relationship between Y should be a linear function of the

Posc/Uapp 816 Class 17 Regression Methods Page 2 X s. 2. Moreover, variables sometimes need to be transformed to make their observed distributions more symmetrical. 3. The "raw" or "original" data can sometimes be transformed to new values, ' ' Y and/or X, in a way that creates linear relations and/or symmetry. 4. One way to find an appropriate transformation is to use the so-called "ladder powers." B. Here s a motivating example: 1. The next figure shows the relationship between sulfur dioxide and mortality. 2. The relationship seems slightly curved, right? C. Sometimes a variable will be highly skewed. 1. To see this let s switch to a new data set, one used last semester. 2. It includes per capita crime rates and percent living in poverty (or classified as poor) for 506 districts in Boston. i. The data were drawn from the Data Story and Library at Statlib located at Carnegie Mellon University. 3. Here is a stem-and-leaf display of the per capita crime variable.

Posc/Uapp 816 Class 17 Regression Methods Page 3 (400) 0 00000000000000000000000000000000000000000000000000000000000000000+ 106 0 5555555555555666666667777777778888888888999999999999 54 1 000011111122233333444444 30 1 555555678889 18 2 0022344 11 2 558 8 3 8 3 78 6 4 1 5 4 5 4 5 1 3 5 3 6 3 6 7 2 7 3 1 7 1 8 1 8 8 4. It s clear that the data are highly skewed. Most values are below 1.0. 5. Moreover, it is hard to plot 500 plus data points. 6. So I took a random sample of 50 cases from the file to use in a preliminary analysis. 7. Here is the plot of crime versus percent poor.

Posc/Uapp 816 Class 17 Regression Methods Page 4 i. We can see that a linear model may not be appropriate, partly because Y is so highly skewed and perhaps because the relationship is not linear. D. What to do? 1. We need a systematic way to decide how to transform variables. 2. First let s consider bivariate relationships. 3. Basic idea: i. Rank the X scores from lowest to highest. ii. Divide them into three roughly equal batches (i.e., each batch has about 1/3 of the cases). 1) If N, divided by 3, is even each has same number. 2) If N, divided by three, has remainder of one, put the extra case in the middle batch. 3) If N, divided by 3, has a remainder of 2, put an extra case in each end batch. iii. iv. Find the median X in each of the three batches. Call these medians X L, X M, and X H. Find the median Y's for the Y's that correspond to the X's in each batch. The median Y may or may not involve the same cases as the X median. In other words, 1) The X's have been divided into three groups. 2) Find the Y's that correspond to these X's. 3) For each of the three batches of Y's find the medians: Y, L Y M, and Y H. 4) These medians need not be actual data points. v. Find the half slopes: 1) The left or lower half slope is: b L = Y M - Y L X M - X L 2) and the upper or right half slope is: b R = Y H - Y M X H - X M 4. The half slopes can be used to check for linearity and to pick an appropriate transformation (if any exists) that will "straighten out" the relationship so that OLS can be applied: i. Find the half slopes and sketch them in the scatter plot. If the data are linear the two half slopes will be roughly equal and their graph

Posc/Uapp 816 Class 17 Regression Methods Page 5 ii. iii. will be a nearly straight line. If, on the other hand, the relationship is not linear, then the graphs of the half slopes will form an "arrow" (see below) which you can use to pick a transformation. Calculate the half slope ratio by dividing b L by b R: if the relationship is linear, the ratio will be about 1.0; if not it will be less than or greater than 1.0. If the half slope ratio is negative, that means that one slope is positive and one is negative and the ladder powers will not help. E. Using the half slopes. Consider the following 1. Suppose data points were dispersed roughly as shown. 2. There is a relationship between X and Y, it is not linear. 3. You can imagine finding half slopes i. I ve sketched them in. They are of course not drawn to scale. 4. You can also imagine obtaining their ratio, which in this figure is greater than zero. i. Both slopes have the same sign, here negative. 5. The left slope is larger (steeper) than the right slope. i. So you can determine that ratio is greater than 1.0 Figure 3 6. You can imagine drawing an arrow using the two half slope ratios, as I have done. i. This arrow points down the Y and X axes. ii. That in turn suggests that we transform either Y or X or both by taking powers down the ladder. 1) See below. For now going down means taking the square root or logarithm or some other power of X and/or Y.

Posc/Uapp 816 Class 17 Regression Methods Page 6 7. It s possible that data would be related as indicated in Figure 4: i. Now there is a curved positive correlation. ii. The left half ratio is small than the right, although they both have positive signs. 1) So again the ratio is positive and a transformation of either X or Y or both might help. iii. The arrow formed by sketching the half slopes points up the X axis and down the Y axis. 1) As we will see this implies converting X by taking higher 2 power, such as X, and/or lower powers of Y such as log(y). Figure 4 8. Now look at the next figure. We can analyze it in the same way by drawing half slopes and creating arrows.

Posc/Uapp 816 Class 17 Regression Methods Page 7 Figure 5 i. The arrow points up the Y axis and down the X axis, so we would reverse the transformations mentioned above. ii. We might have to push Y up and/or pull X down. F. Each of the these figures contains an implied arrow that represents the half-slopes. Since there are "bends" in the line (hence the arrows), we can see that the relationships are nonlinear. 1. The direction that the implied arrow points indicates what transformations of X and/or Y may help make the relationship more nearly linear. i. The words "push up" means take powers of the variable that are greater than 1.0. That is, "push up X" means transform X by 2.5 squaring or cubing it; or perhaps taking the 2.5 power (that is, X ) trial and error is necessary to see which transformation works best. G. The words "pull down" meaning taking a power that is less than 1.0; for example, one can take the square root (the 1/2 power) or the logarithm (the 0 power) of X or Y or both. Again trial and error is necessary to give the best fit. 1. Ladder Powers: when "pushing" or "pulling" a variable, one can use the so-called ladder powers (named by John Tukey, a statistician at Bell Labs):

Posc/Uapp 816 Class 17 Regression Methods Page 8 The Ladder Powers Transformation (Power) (Step on Ladder) Name Result......... 3 3 X = cube Pushes X "up" 2 2 X = square 1 "raw" score No change 1/2 1/2 X = square root 0 log(x) (base 10) -1/2 reciprocal root -1-1/X Pulls X "down"......... 2. It is possible to take half or even more refined intermediate steps such as 3/4 raising X to the 3/4 power (i.e., X ). IV. AN EXAMPLE WITH SIMULATED DATA: A. Here is an example using simulated data. 1. I created a population based on the model: Y i = $ 11 X 2.9 i +, i 2. Note that that β = 0 and β = 1.0. Y is simply X plus an error term. That 0 1 2.9 is, X has been raised to the 2.9 power. 3. I then sampled 100 cases from this population. 4. Assume then that I have 100 X-y pairs and am trying to find the best fitting model for them. 5. Normally, I would plot Y against X. In this case it the plot is:

Posc/Uapp 816 Class 17 Regression Methods Page 9 Figure 6 6. Since I am assuming that the "true" model is not known, my first guess is a simple linear equation: Y i = $ 0 + $ 1 X i +, i 7. But the plot suggests that there is a non-linear relationship between Y and X. i. Indeed, if one imagined half slopes forming the head of an arrow, one would think to transform X by going up the ladder powers-- that is, transforming X by taking, say, X-squared--or by moving down the ladder powers with Y--that is, using the square root of Y. 8. But for now I can proceed as if using raw X and Y were satisfactory. i. Here are the results from a bare-bones regression analysis.

Posc/Uapp 816 Class 17 Regression Methods Page 10 The regression equation is SampleY = - 10180 + 998 SampleX Predictor Coef StDev T P Constant -10180 1107-9.20 0.000 SampleX 997.68 42.14 23.68 0.000 S = 5840 R-Sq =.851 Analysis of Variance Source DF SS MS F P Regression 1 19120576375 19120576375 560.60 0.000 Residual Error 98 3342509414 34107239 Total 99 22463085789 ii. The sample data seem to fit the linear model quite well. Look at R 2 and s. iii. The estimated coefficient relating Y to X is 997.7, which we know is incorrect. 1) Also the constant is -10180, which we know is wrong since we created the population to have β 0 = 0. iv. Still, the data provide a good fit. v. But if we use the half slope ratios or approximations of them, we can possibly improve the fit. 1) The imaginary arrow suggests going up the ladder in X (or down in Y, but let s try X first) so we can create a variable, 2 X*, which is simply X* = X. 2) The plot of it against Y follows.

Posc/Uapp 816 Class 17 Regression Methods Page 11 Figure 7 vi. The points seem to lie on a straight line so we use regression procedures to obtain. The regression equation is SampleY = - 2624 + 21.3 SampleX2 Predictor Coef StDev T P Constant -2623.8 325.7-8.06 0.000 SampleX2 21.3132 0.3325 64.11 0.000 S = 2311 R-Sq = 97.7% R-Sq(adj) = 97.6% Analysis of Variance Source DF SS MS F P Regression 1 21939895085 21939895085 4109.61 0.000 Residual Error 98 523190705 5338681 Total 99 22463085789

Posc/Uapp 816 Class 17 Regression Methods Page 12 vii. 2 Although the R has become nearly perfect, we know--because we created the population model--that the estimated coefficients are off. 1) Of course they are closer to the population values of β = 0 0 and β 1 = 1.0. 3 2) Were we to transform X still again, by taking say, X, we would find the coefficients closer to the true values. a) Actually the figure above hints at a slightly curved relationship. 3) Also, don t forget that these data constitute a relatively 2.5 small sample from the population in which Y = X + error. a) So our transformation is not too bad. V. NOTES ARE CONTINUED ON NEXT PAGES: A. The file is too large to fit on a single disk so I split it into two parts.