Notes 6. Basic Stats Procedures part II

Size: px

Start display at page:

Download "Notes 6. Basic Stats Procedures part II"

Beverley Flynn
5 years ago
Views:

1 Statistics 5106, Fall 2007 Notes 6 Basic Stats Procedures part II Testing for Correlation between Two Variables You have probably all heard about correlation. When two variables are correlated, they are dependent, which means that knowing the outcome of one variable tells you something about the probability of the various outcomes of the other variable. We can test for correlation by looking at scatterplots and by running this test in SAS. The correlation coefficient r is a measure of the strength of the linear relationship between two quantitative variables. It has the following properties: 1 r 1 r = 0 indicates no linear relationship, r > 0 indicates a positive relationship and r < 0 indicates a negative relationship. r = 1 occurs only when the data fall perfectly on a line with positive slope; r = 1 occurs only when the data fall perfectly on a line with negative slope. Computing the correlation coefficient We calculate r for any two quantitative variables using the following formula. r = ( ) ( ) x x y ȳ S x S y n 1 = Zx Z y n 1 (1) This is sometimes called Pearson s r or Pearson s correlation to distinguish it from other measures of association; however, the phrase correlation coefficient in statistics refers specifically to r. Question: Why is r a measure of the linear relationship between two variables? Since the z-scores give us how far the value is from the mean, if the z x always vary from their mean to the same degree that the z y vary from their mean, the z-scores will be equal 1

2 and the slope between them will be one. If the deviation is only slight, then the correlation will be close to one. If the deviation is large, the correlation will be close to zero. Other properties of correlation: it makes no difference which variable you call x and which you call y in computing correlation the correlation is unchanged by changing the units of measurement for x or y A scatterplot is a graphical analog to the correlation. Remember, that correlations should never be examined without also examining the scatterplots. Let s look at the correlation between weight and MPG in the 81 cars data set. We saw the scatterplot earlier and it looked like there was a correlation. We ll now do a formal test to see if this relationship is statistically significant. Import the 81cars data set again and try the SAS code below. Remember to plot the scatterplot. (You learned this earlier in Notes 3). proc corr data=cars; var weight MPG; run; This code produces the Pearson s coefficient by default. Spearman s Coefficient Spearman s rank correlation is calculated by converting each variable to its rank. For example, in testing for correlation of blood pressure vs. body weight, the lightest person would get a rank of 1, second-lightest a rank of 2, etc. The lowest blood pressure would get a rank of 1, second lowest a rank of 2, etc. When two or more observations are equal, the average rank is used. For example, if two observations are tied for the second-highest rank, they would get a rank of 2.5 (the average of 2 and 3). Once the two variables are converted to ranks, the Pearson s correlation coefficient is calculated for the two columns of ranks, and the significance of this is tested. To get the Spearman s coefficient we add spearman to the PROC CORR statement. proc corr data=cars spearman; var weight MPG; run; 2

3 Simple Linear Regression When we have a continuous response and a continuous predictor, we can get a scatterplot of the data. Often the response varies linearly with the predictor, so that it looks reasonable to draw a straight line through the scatterplot to represent the relationship between the variables. Simple linear regression means fitting a line to the data. It is appropriate when both predictor (x) and response (y) variables are continuous, and the relationship between the variables is linear. Technically, the term linear relationship means that the variables x and y vary together, in such a way that the average increase in y associated with a change of one unit in x is constant over the range of x (i.e., the change in y given a unit change in x doesn t depend on the original value of x). Intuitively, the best line should be closest to the data. To find this, we need a specific numerical measure of closeness. The least-squares criterion (which is not the only option) is the most common measure of closeness, and the only one discussed in this text. To describe the criterion and best fit line, we first need to define a residual. Below is a plot of measurements of height (inches) and weight (pounds) for 22 kindergarteners. weights of kindergarteners as function of height weight Least-Squares Line slope=0.661 intercept= height For each datapoint on the plot, draw a vertical line connecting the point to the line. The lengths of these little vertical lines are the (absolute value of) the residuals. The datapoints above the fitted line have positive residuals, and the residuals for the datapoints below the line are negative. The principle of least squares is: 3

4 The best fit line minimizes the sum of the squared residuals. Some notation and definitions: n is the number of observations. The observations are ordered pairs (x i, y i ), i = 1, 2,..., n, where x i is the ith predictor variable observation and y i is the ith dependent variable observation. The equation for the best fit line is written ŷ = b 0 +b 1 x. This is called the estimated regression equation as well as best fit line. b 0 and b 1 are called the coefficients of the best fit line. b 0 is the intercept of the best fit line. This is its y-coordinate at x = 0. b 1 is the slope of the best fit line. A one-unit increase in the x-coordinate is associated with an increase of b 1 units in the y-coordinate. (If b 1 is negative, the y value of the best fit line is decreasing as x increases.) ŷ i is the y-coordinate of the best fit line at x = x i. It is the predicted or estimated value of the average dependent variable value at x i. Now we can write the sum of squared residuals as n SSE = (y i ŷ i ) 2 i=1 where SSE stands for sum of squared errors. (The words residual and error can be used interchangeably here.) The formulas. We can derive the coefficients of the best fit line using calculus. If you know calculus, this is what you do: 1. Write the best fit criterion as a function of b 0 and b 1 : 2. Take partial derivatives n f(b 0, b 1 ) = (y i (b 0 + b 1 x i )) 2. i=1 f and f. b 0 b 1 3. Set the partial derivatives to zero and simultaneously solve the pair of linear equations for b 0 and b 1. 4

5 If you don t know calculus, here are the answers: b 1 = ni=1 (x i x)(y i ȳ) ni=1 (x i x) 2 = xi y i n xȳ x 2 i n x 2 and b 0 = ȳ b 1 x. We won t be using the formulas anyway: SAS can do that for us. Goodness of fit measure: the R 2. We ve already defined the sum-of-squared errors or SSE. Let s look at a small dataset with the response variable y being the cholesterol level of the subject, and the predictor variable x being the tofu consumption in pounds per week. The sums of the squares of the dotted lines shown connecting the data points to the line is the SSE. linear relationship between cholesterol level and soy consumption cholesterol level soy consumption (gm/week) We can compute the total sum of squares and the regression sum of squares in a similar manner, as follows: 5

6 linear relationship between cholesterol level and soy consumption computing total sum of squares computing regression sum of squares cholesterol level cholesterol level soy consumption (gm/week) soy consumption (gm/week) These sums of squares are measures of variation: SST The total sum of squares is the measure of total variation of the y-coordinates. SSR The regression sum of squares is the variation in the y-coordinates which is explained by the regression line, or by the linear relationship with x. This is also called the explained variation. SSE The error sum of squares is the unexplained variation. Not surprisingly, we have SST = SSR + SSE, that is, the total variation is the sum of the explained variation and the unexplained variation. Now we can define the coefficient of determination: R 2 = SSR SST and is interpreted as the proportion of variation in y that is explained by the linear relationship with x. You can see that 0 R 2 1. Datasets for which the scatterplot shows points tightly clustered about a line have high R 2. If all the points are on a line, we get R 2 = 1. We get R 2 = 0 if the best fit line has slope zero, which would indicate that the predicted value of y does not depend on the value of x. When we do a simple linear regression, we are making the assumption that there is a true underlying linear relationship between y and x, that is, y i = β 0 + β 1 x i + ɛ i, 6

7 where β 0 and β 1 are the real intercept and slope defining the relationship. The ɛ i are random variables, called errors, that have mean zero. These unpredictable errors are why we can t get the exact parameter values from our sample. If we didn t have these errors, our data would lie exactly on a straight line, from which we could calculate the intercept β 0 and slope β 1. We are estimating β 0 and β 1 with b 0 and b 1. The β 0 and β 1 are parameters in the model, because they are a characteristic of the population. The estimates of the parameters are b 0 and b 1.These estimates are random variables, or statistics, because they are calculated from our sample. If we take a different sample of observations of the y i values, at the same x i values, we would calculate different b 0 and b 1. SAS reports the standard errors, (i.e., estimates of the standard deviation) of b 0 and b 1 along with their values. These are used to calculate t-statistics so we can do inference about the slope and the intercept. SAS reports the t-statistics and corresponding p-values to test the hypotheses H 0 : β 1 = 0 vs. H a : β 1 0. and also H 0 : β 0 = 0 vs. H a : β 0 0. Example: Salary Data, revisited Recall the data in salary.txt, about salaries at XYZ corporation. We have plotted the salary against seniority, and concluded that there is a general trend that as seniority increases, salary increases too. To get the best linear fit of salary to seniority, we can use proc reg as follows: proc reg; model salary=seniority; run; The default SAS output is: The REG Procedure Model: MODEL1 7

8 Dependent Variable: salary Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept <.0001 seniority

9 The top part of the table you will recognize as looking like the ANOVA table, with sources of variation. In fact, the p-value for the F -statistic is calculated in the exact same way. Confirm that the R-Square is the model sum of squares divided by the total sum of squares. (The R-Square is also reported for one-way ANOVA, and has the exact same interpretation, but we didn t discuss it then.) The Parameter Estimates part of the table contains the parameter estimates, standard errors, t-statistics, and the p-values. Here the label for the slope is the predictor variable name. Note that the p-value for the F -statistic is the same as the p-value for the slope. The p-value for the intercept tells us that the starting salary for someone with seniority=0 (just starting out) is significantly different from zero. Well, it should be! Often we re not really interested in hypothesis tests about the intercept. The p-value for the slope tells us that the salary increases significantly with seniority. The R 2 tells us that about 7.4% of the variation in salary is explained by the linear relationship with seniority. The rest of the variation is called error variation but is due in part to things like employee type, and maybe education level, competence, etc. Maybe gender? That s what we d like to know! We ll come back to this question again later when we discuss proc glm. Example: Beta Carotene Data Let s look again at the dataset betaplasma.txt. Plot blood beta carotene level against fiber consumption. What is the general trend? Do a simple linear regression and report the least-squares line. If you want to look at the plot with the least-squares line superimposed, just include the plot statement within proc reg. Try this and look at the graph. There is a lot of information included. proc reg; model salary=seniority; plot salary*seniority; run; What proportion of the variation in blood beta carotene level is explained by the linear relationship with fiber? 9

10 Interpret the regression coefficients in the context of the problem. Is there a significant linear relationship between blood beta carotene level and fiber? (use α = 0.01)? Describe. What is the estimated blood beta carotene level for a person with fiber intake of 25? What about for someone with fiber intake of 200? (Note that it s generally not a good idea to extrapolate too far out of the data range.) Report a p-value. State the hypotheses corresponding to the p-value. State the conclusion in the context of the problem. Now do the same analyses but investigate the relationship with blood beta carotene level and cholesterol. What is the R 2 associated with the simple linear regression, and what does this mean in the context of the problem? Interpret the regression coefficients in the context of the problem. 10

11 Report a p-value. State the hypotheses corresponding to the p-value. State the conclusion in the context of the problem. Example: Body and Brain sizes Let s look again at the dataset sleeptime.txt. The data were collected to study sleep and dreaming patterns in mammals, but we can use it to investigate the relationship between brain weight and body weight. Plot brain weight against body weight (put brain weight on y-axis). Describe what you see. Now plot the logarithm of brain weight against the logarithm of body weight. Describe what you see. Which do you think is more appropriate for simple linear regression? Do a simple linear regression and report the least-squares line. Report the R 2 and interpret in the context of the problem. 11

12 Look at the scatterplot of the logarithm of brain weight against the logarithm of body weight, with the best fit line superimposed. Some points are above the line; others below. Describe the points above the line, compared to those below, in the context of the problem. Which mammal has the largest residual? Interpret in the context of the problem. What is the predicted brain size for a mammal with a 60kg body size, according to the regression model? Example: Population and Profit Let s look again at the dataset iceprof.txt. Recall that this files contains data concerning the net profit of ice cream stores in a chain around the country. The variables are store ID, region, population, and profit. We have investigated whether profits are different over regions. Now we want to ask if profit varies with population of the town. What do you guess about the relationship? Plot profit against population. Describe what you see. Make a guess: is the relationship significant? 12

13 Let y i be the profit of the ith store, and x i be the population of the town of the ith store. Write down a simple linear regression model. Fit the model and report the best fit line. Interpret the coefficients in the context of the problem. Do the profits increase significantly with population? Explain. What proportion of the variation in profit is explained by the linear relationship with population? Predict profits for a store located in a town of population 500,000. Residual Analyses A statistical model is always a simplified approximation to a real-world situation. The model assumptions may be enumerated, and given these assumptions, statistical tests and confidence intervals are derived. No one expects the model to be exactly correct, but each time a model is used, an attempt should be made to find obvious deviations from the model. If deviations are found, a new more complicated model can be developed, using the new information. For the simple linear regression, we are assuming that there is a linear relationship between y and x, and that we have independent normal random errors that 13

14 all have the same variance. Residual analysis is one way to check these assumptions. Using residual analysis, we can also find anomalies in the data, such as observations that are outliers or influential points, which should be investigated by the researcher. Every time we do a regression analysis, we ought to plot our residuals to check assumptions. In this section we show how to spot deviances from the regression assumptions, but unfortunately most of the fixes for these situations are beyond the scope of this text. Recall the assumptions for a simple linear regression model: 1. We assume that the underlying structure of the data fits the model. That is, the y-coordinates of our data are varying linearly with the x-coordinates, plus some random error. Deviations from this assumption are called lack of fit. Maybe the true relationship could be better described with a function that is not linear, such as a parabola. Perhaps there is another predictor variable that affects the response and the main predictor variable, that we should include in the model. 2. We assume that the random errors are normally distributed with mean zero. The two most common deviations from this assumption are skewed error distributions and heavy-tailed error distributions. Both cause outliers in the dataset observations with unusually large y values, compared to others. 3. We assume that the random errors all have the same variance. This assumption is called homoskedasticity and the deviation from the assumption is heteroskedasticity; these are impressive-sounding terms! Two types of heteroskedasticity are common and can be found with residual plots: error variances that increase or decrease with the value of the predictor variable, and error variances that increase or decrease with the expected value of the response variable. 4. We assume that the random errors are independent of each other. The most common deviation from this assumption is called serial correlation or autoregressive errors. This phenomenon often occurs when the predictor variable values are consecutive in time or space. When this happens, the value of one error can depend on the value of the previous error. Positive autoregression implies that consecutive errors tend to be closer together: if one error is large and positive, the next error is more likely to be positive than negative. Negative autocorrelation means that large positive residuals are more likely to be followed by negative residuals, and vice-versa. There are several types of residual plots for simple linear regression. Each plot checks some of the model assumptions. We will enumerate the types of residual plots that should be generated, and discuss which assumptions can be checked with each. 14

15 1. Plot the residuals against the predictor values We put the residuals on the vertical axis, and the predictor variable on the horizontal axis. If the regression assumptions hold, the residual plot should look like a random horizontal band of points; it should have no patterns in it. These are some specific patterns to look for, which indicate violations of the regression assumptions. For example, we can look for lack of fit. Suppose we fit the equation of a straight line to data for which y is really quadratic in x; we might get the type of pattern illustrated in the plot on the left: fit line to scatterplot residual plot y res x x We can see some lack of fit in the scatterplot, but it is more noticable in the residual plot (on the right). If we see this kind of lack of fit, we should consider fitting a function other than a straight line model, perhaps a parabola (other types of regression functions are discussed in Chapter 7). We can also look for non-constant error variance, or heteroskedasticity. Sometimes when the error variance is not constant, it is increasing or decreasing in the x-values. The next figure shows residuals for the former situation. Note that as x gets bigger, the absolute values of the residuals tends to increase, in a fanning out pattern. 15

16 fit to scatterplot residual plot y residual xdat x An example of a dataset where this type of residual plot might be seen is as follows. Suppose a marketing researcher is looking at luxury spending (y) as a function of income (x). In this case, smaller incomes will have a smaller variation of luxury spending, since people with small incomes do not have the option of large spending on luxuries. On the other hand, there might be quite a bit of variation in luxury spending for people with large incomes. The remedy for error variances that increase or decrease in x is weighted regression. A brief treatment of this subject will presented in a later handout, but in broad brushstrokes, the idea is to weight the observations by the inverse of the x values if the residual variance seems to be increasing in x, and by the x values themselves if the residual variance seems to be decreasing in x. The observations with smaller variances get larger weights and hence get counted more in the analyses. 2. Plot the residuals against the predicted values If all the assumptions are met, the plot of the residuals against the predicted values (i.e., the ŷ i ) should show a random-looking horizontal band. Again, we want to see no patterns in the residuals. One common pattern is the fanning out or funneling in that was discussed above concerning the plots of the residuals against the predictor values x i. In the figure below, we see the fanning out pattern in the plot of the residuals against the predicted values. This type of pattern is another example heteroskedasticity. Sometimes a log-transformation of the response variable is helpful in correcting this problem. 16

17 y Residuals x Predicted values 3. Histogram of the residuals A histogram provides an approximate distribution of the residuals. Of course, we re looking for a nice symmetric bell shape. If the sample size is fairly large (say, n 50), the histogram will indicate whether or not the residuals have a bell-shaped type of distribution. The next figure shows three plots using a dataset with n = 80. The scatterplot of the response against the predictor is shown on the left. We can see the upward skew in the scatterplot it looks as if the positive residuals might tend to be larger in absolute value than the negative residuals. y resid(g) x predict(g) Residuals In the center of the figure, the residuals are plotted against the predictor values. Here the upward skew is a little more obvious. Finally, a histogram of the residuals shows a definite skew to the right. Skewed residuals indicate a violation of the normality assumption, and so cast doubt on the validity of the regression results. When the skew is to the right, sometimes the handy log-transformation of the response values will fix the problem; that is, convert the model to one with no obvious violations of the assumptions. 17

18 3. Normal probability plot of the residuals Another way to visually assess the normality assumption is through a normal probability plot. As discussed in proc univariate handout, the sorted residuals are plotted against quantiles of the normal distribution. If the resulting points fall roughly on a straight line, this is evidence for normality. If the line of points is curved in an obvious way, this is evidence for deviation from normality. Plot (a) in the figure below shows the sorted residuals from the skewed-right residuals above, plotted against the normal quantiles. There is a clear curved pattern, indicating non-normality. Plot (b) shows the sorted residuals from a model in which the normal assumptions are met, plotted against the normal quantiles. Here the points lie roughly on a straight line, with some random jitter. (a) (b) Sorted residuals Sorted residuals Normal quantiles Normal quantiles There are also statistical tests for normality, that we learned in proc univariate material. Residuals may sometimes be heavy-tailed. This indicates that there are both unusually large and unusually small values of the response. The normal density has thin tails meaning that is it very unlikely to see observations more than 2 or 3 standard deviations away from the mean. For some densities, such as the t-density with small degrees of freedom, it is not so surprising to see values more than 3 or 4 standard deviations away. We call these observations outliers. When residuals from a dataset with heavy-tailed errors are plotted against the normal quantiles, an upward skew is seen at the right, and a downward skew is seen at the left. This pattern is seen in plot (a) of the figure below. The regression errors were generated from a t-density with two degrees of freedom, instead of a normal density. There is a pronounced indication of heavy-tails. Note that the curve is concave-down on the left, then concave-up on the right. 18

19 (a) (b) Sorted residuals Sorted residuals Normal quantiles Normal quantiles In plot (b) of the above figure, we see the opposite phenomenon: residuals from a distribution that is thinner-tailed than the normal density. The curve is concave-up, then concave-down. This is a more unusual situation. Getting residual plots using proc reg We know how to get a scatterplot with the best fit line superimposed: we use a plot statement within the regression procedure: proc reg data=a; model y=x; plot y*x; run; We can do the same for residual plots. The terms residual. and predicted. (actually you can use r. and p. for short!) are variables that you can use in the plot statement, although they are not variables in the dataset. Try this: proc reg data=a; model y=x; plot residual.*predicted.; run; To get a probability plot: proc reg data=a; model y=x; plot residual.*npp.; run; 19

20 If you want to save the residuals and predicted values, the following will create a new dataset bdat with all the variables of adat, plus the residuals (res) and predicted (pred) values. proc reg data=adat; model y=x; output out=bdat r=res p=pred; run; Other useful things you can save are: COOKD=name (Cook s D influence statistic ) STUDENT=name (studentized residuals, which are the residuals divided by their standard errors) RSTUDENT=name (a studentized residual with the current observation deleted) Outliers and Influential Points After performing regression analysis to check the model assumptions, we should look for outliers and influential points as part of the residual analysis. In simple linear regression, we can often see these points in the scatterplot of y on x. An outlier is far from the best fit line, and will produce an unusually large residual. An observation is an influential point if removing it from the fit (deleting it from the analysis) changes the slope of the line in a substantial way. Plot (a) in the figure below contains an example of a point that is an outlier but not an influential point. The solid line is the best fit with all the data, and the dotted line is the best fit with the circled point removed. If we draw a vertical line from the point to the solid line, we see that the residual for the circled point is considerably larger than the residuals for the other observations, but the presence of the point does not much alter the fit. (a) (b) (c) O O O y y y x x x 20

21 In the plot (b), the circled point is an influential point but not an outlier. When it is included in the regression, we get the solid line. Because the point is as near the line as any of the others, the residual is not large compared to the others, so is not an outlier. However, the absence of the circled point drastically changes the fit to the remaining points, shown as a dotted line. The circled point greatly influences the fit to the data. The circled point in the plot (c) is both an outlier and an influential point. It has a larger residual than the other observations, and it has the effect of pulling the regression line toward itself. Outliers can be seen in the plots of the residuals against the predictor values, and of the residuals against the predicted values. Influential points can be harder to spot. One way to check to see if an observation is influential is to remove the observation from the dataset, perform a regression analysis using the reduced dataset, and compare the results to the full data results. This might take a long time if we want to check every point to see if it is influential. Fortunately, the computer can do this for us. Cook s D Influence Suppose we do a regression with the ith observation removed, and let ŷ j(i) be the predicted jth response, using the slope and intercept obtained with the reduced dataset. The notation (i) in the subscript stands for with the ith observation removed. The idea is to compare the predictions ŷ j(i), j = 1, 2,..., n, with the predictions ŷ j using the full dataset, to see if they are substantially different. The measure of influence of the ith observation is called the Cook s D Influence, and written D i = nj=1 (ŷ j ŷ j(i) ) 2 k MSE where ŷ j is the predicted jth response for the regression on the entire dataset. The number k is two for simple linear regression; for the more complicated models of Chapter 7, it is the number of parameters in the model. The value of D i is large if the ith observation is influential, that is, if the response at each j, without the ith observation, is far from the response at each j, using the entire dataset. It is calculated at each observation, and we can tell the software package to save these in a new column. Then we plot the influence statistics against the predictor variable, for example, to look for observations with unusually high influence. There is rule for how large a value of D i should be before the observation is considered to be an influential point; the cut-off depends on n and the number of parameters k in the model. The values of D i are compared with the 95th percentile of an F (k, n k) density; if the value of D i is large than the percentile, the ith observation is considered influential. Outliers and influential points should always be investigated. Sometimes they are the 21

22 results of simple mistakes when entering the data, and should be corrected. Sometimes they are observations from an anomalous case. Sometimes these should be removed from the dataset, and other times they should be left in, depending on the context of the problem, and what the purpose of the analysis is. Exercises: 1. Look again at the betaplasma.txt dataset. When you plotted the beta carotene blood plasma level (BCBPL) against fiber consumption or BMI, you got an unsatisfying scatterplot, and the fit didn t looks so good. Re-do your regression analyses, and get residual plots. Describe the distribution of the residuals. 2. Now do the same thing, only use log(bcbpl) as your response variable. Get residual plots. Did the transformation fix the problem? 3. Check the residual plots for the body temperature against pulse rate analysis. Are the model assumptions met? 4. Plot the Cook s influence statistics. Are there any data points with undue influence? 5. Interpret the results in the context of the problem. 22

Lecture 11: Simple Linear Regression

Lecture 11: Simple Linear Regression Readings: Sections 3.1-3.3, 11.1-11.3 Apr 17, 2009 In linear regression, we examine the association between two quantitative variables. Number of beers that you drink