Lecture notes on Regression & SAS example demonstration

Regression & Correlation (p. 215) When two variables are measured on a single experimental unit, the resulting data are called bivariate data. You can describe each variable individually, and you can also explore the relationship between the two variables. Simple Linear Regression & Correlation (p.214) For quantitative variables one could employ methods of regression analysis. Regression analysis is an area of statistics that is concerned with finding a model that describes the relationship that may exist between variables and determining the validity of such a relationship. Examples Do housing prices vary according to distance to a major freeway? Does respiration rate vary with altitude? Is snowfall related to elevation and if so, what kind of relationship is there between these two variables. Speaking of snow, let s consider wind chill Example 10.1 (p. 214) Suppose we are interested in determining the wind chill temperature. For those of us from regions where the winters are extremely cold (like North Dakota), we know that this temperature is dependent upon variables such as the wind velocity (speed and direction), the absolute temperature, relative humidity, etc. Dependent (response) variable: wind chill temperature Independent (regressor/predictor) variables: temp, wind velocity, relative humidity Is the wind chill temp important? California: +85 0 Minneapolis: -23 0 Wind chill temp: -78 0 What do you say to that????? Pretty cold if you ask me! 1

Regression analysis allows us to represent the relationship between the variables. to examine how the variable of interest (wind chill), often called the dependent or response variable is affected by one or more control or independent variables (wind speed, actual temperature, relative humidity). It provides us with a simplified view of the relationship between variables, a way of fitting a model with our data, and a means for evaluating the importance of the variables included in the model and the correctness of the model. Correlation analysis will be used as a measure of the strength of the given relationship. Note the following concepts: Quantitative variables may be classified according to types. Response variable: a variable whose changes are of interest to an experimenter. Explanatory variable: a variable that explains or causes changes in a response variable NOTE: We will generally denote the explanatory variable by x and the response variable by y. To study the relationship between variables, one could use the following as guides: Start by preparing a graph (scatterplot). Examine the graph for an overall pattern and deviations from that pattern (check for outliers, etc.). Add numerical descriptive measures for additional information and support. Scatterplots Plot explanatory (independent) variable on horizontal axis & response variable on the vertical axis Look for pattern: form, direction & strength of relationship Note the following: 2

Association Positive association: large values of one variable correspond to large values of the other Negative association: large values of one variable correspond to small values of the other EXAMPLE 10.3 (p. 215): Physicians have used the so-called diving reflex to reduce abnormally rapid heartbeats in humans by submerging the patient's face in cold water. (The reflex, triggered by cold water temperatures, is an involuntary neural response that shuts off circulation to the skin, muscles, and internal organs, and diverts extra oxygen-carrying blood to the heart, lungs, and brain.) A research physician conducted an experiment to investigate the effects of various cold water temperatures on the pulse rates of 10 children with the following results: (See Lecture Notes) Scatterplot of Diving Reflex Correlation (p. 220) Data looks reasonably linear with redpr decreasing as temp increases If two variables are related in such a way that the value of one is indicative of the value of the other, we say the variables are correlated. The correlation coefficient, ρ is a measure of the strength of the linear relationship between two variables. See formulas on this page. SOME NOTES (p. 221) The closer r is to ± 1, the stronger the linear relationship. The closer r is to 0, the weaker the linear relationship. If r = ± 1, the relationship is perfectly linear (all the points lie exactly on the line). SOME NOTES (p. 221) r > 0 as x increases, y increases (positive association). r < 0 as x increases, y decreases (negative association). r = 0 no linear association 3

Your Task Read general guidelines PROC CORR (p. 223) Produces correlation matrix which lists the Pearson's correlation coefficients between all sets of included variables. Produces descriptive statistics and the p-value for testing the population correlation coefficient ρ = 0 for each set of variables. GENERAL FORM proc corr data = dataset name options; by variables; var variables; with variables; partial variables; See Lecture Notes for options. Proc corr; SAS (p. 223) var list of variables; NOTE: If you do not specify a list of variables, SAS will report the correlation between all pairs of variables. EXAMPLE 10.11 (p. 224) Refer to the previous example on diving reflex. Use SAS to find the correlation between reduction in pulse rate and cold water temperature. We write the following SAS code: options nocenter nodate ps=55 ls=70 nonumber nodate; /* Set up temporary SAS dataset named diving */ data diving; input temp redpr @@; datalines; 68 2 65 5 70 1 62 10 60 9 55 13 58 10 65 3 69 4 63 6 4

/* Use proc corr to obtain correlation noprob suppress printing of p-value for testing rho = 0 nosimple suppress printing of desc stat */ proc corr noprob nosimple; var temp redpr; run; Quit; temp redpr Example 10.11 (p. 224) temp 1-0.94135 redpr -0.94135 NOTE: The correlation matrix is symmetric with 1 s along the main diagonal and the correlation along the other diagonal. 1 NOTE Corr(X,X) = 1 Corr(temp,temp) = 1 Corr(X,Y) = Corr(Y,X) Value & Interpretation R = -0.94135 strong inverse linear relationship between reduction in pulse rate and cold water temperatures. SIMPLE LINEAR REGRESSION GOAL: Find the equation of the line that best describes the linear relationship between the dependent variable and a single independent variable Simple single independent variable Linear equation of a line linear in the parameters Deterministic Model: y = β + β x 0 1 Requires that all points lie exactly on the line Perfect linear relationship 5

Probabilistic Model: y = β + β x+ ε 0 1 Does NOT require that all points lie exactly on the line Allows for some error/deviation from the line For a particular value of x: Vertical distance = (observed value of y) (predicted value of y obtained from estimated regression equation) Methods of Least Squares β 0 and β 1 are unknown parameters and need to be estimated. Want to estimate so that errors are minimized ε 2 ~ N(0, σ ε ) Represents random error, independent Want to estimate the slope and y-intercept in such a way that n n 2 min SSE = min εi = min yi yi n= 1 i 1 ( ˆ ) 2 ˆ S b = β1 = S xy xx a= ˆ β = y ˆ β x Estimate of y-intercept 0 1 Estimate of slope yˆ = ˆ β + ˆ β x 0 1 = a + bx Estimated regression equation Least squares regression equation 6

PROC REG in SAS (p.230) GENERAL FORMAT: proc reg data = dataset options; by variables; model dependent variable = independent variables / options; plot yvariable*xvariable symbol / options; output out = new dataset keywords = names; **See Lecture Notes for options EXAMPLE 10.15 (p. 232) Refer to the previous example on diving reflex. Use SAS to find the estimated regression equation relating reduction in pulse rate and cold water temperature. We add the following SAS code to our existing code, just before the run statement: proc reg; model redpr = temp; REMEMBER: model dependent = independent; The REG Procedure Model: MODEL1 Dependent Variable: redpr Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 127.69347 127.69347 62.26 <.0001 Error 8 16.40653 2.05082 Corr Total 9 144.10000 Root MSE 1.43207 R-Square 0.8861 Dependent Mean 6.30000 Adj R-Sq 0.8719 Coeff Var 22.73122 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 55.29417 6.22552 8.88 <.0001 temp 1-0.77156 0.09778-7.89 <.0001 yˆ = ˆ β + ˆ β x 0 1 = 55.29417 0.77156x Suppose x = 61, then y ˆ = 55.29417 0.771562(61) Suppose x = 150. Would you use this equation? NO 7

Suppose x = 34. Would you use this equation? NO THE LESSON: BE CAREFUL! This equation is NOT universally valid. Evaluating Regression Equation (p. 236) Once we have the regression, we need to evaluate its effectiveness: Correlation Coefficient of Determination Test slope Validate assumptions Coefficient of Determination, R 2 (p. 236) Represents the proportion of variability in the dependent variable, y, that can be accounted for by the variability in the independent variable, x. Reduction in SSE by using regression equation to predict y as opposed to just using the sample mean 0 R 2 1 closer R 2 gets to 1, the better fit we have. SLR, R 2 = (corr coeff) 2 2 regression sum of squares R = total sum of squares = mod el sum of squares total sum of squares SSR SSM = = TSS TSS The REG Procedure Model: MODEL1 Dependent Variable: redpr Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 127.69347 127.69347 62.26 <.0001 Error 8 16.40653 2.05082 Corr Total 9 144.10000 Root MSE 1.43207 R-Square 0.8861 Dependent Mean 6.30000 Adj R-Sq 0.8719 Coeff Var 22.73122 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 55.29417 6.22552 8.88 <.0001 temp 1-0.77156 0.09778-7.89 <.0001 8

R 2 = 0.8861 88.61% of the variability in reduction in pulse rate can be accounted for by the variability in cold water temperature OR: One can get an 88.61% reduction in the SSE by using the model to predict the dependent variable instead just using the sample mean to predict the dependent variable NOTE: This means that approximately 11.39% of the sample variability in reduction in pulse rate cannot be accounted for by the current model. CI & Tests of Hypothesis What if slope = 0? You would have a horizontal line. Thus knowing x would not help predict y. So our regression equation would not be useful! We can perform a test of hypothesis to determine whether the slope is 0. CI & Tests of Hypothesis EXAMPLE 10.20 (p. 237) Refer to the diving reflex example example. Test whether the slope is significantly different from 0. Usual t-test EXAMPLE 10.20 Soln 1. H : β = 0 0 1 2. H : β 0 a 1 The REG Procedure Model: MODEL1 Dependent Variable: redpr Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 127.69347 127.69347 62.26 <.0001 Error 8 16.40653 2.05082 Corr Total 9 144.10000 Root MSE 1.43207 R-Square 0.8861 Dependent Mean 6.30000 Adj R-Sq 0.8719 Coeff Var 22.73122 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 55.29417 6.22552 8.88 <.0001 temp 1-0.77156 0.09778-7.89 <.0001 9

EXAMPLE 10.20 Soln 3. p value< 0.0001 4. Reject H if p-value < α = 0.05 0 EXAMPLE 10.20 Soln 5. Since p value < 0.0001< 0.05 reject H 0 conclude the slope is significantly different from 0 Confidence Intervals ˆ β ± t Point Estimate 1 α /2, n 2 Distribution pt s S 2 xx Standard deviation of pt estimate Soft Drink Example (Handout) A soft drink vendor, set up near a beach for the summer (clearly summer has not yet arrived in Riverside), was interested in examining the relationship between sales of soft drinks, y (in gallons per day) and the maximum temperature of the day, x. See Handout for data Write a SAS program to read in and print out the data. options ls=78 nocenter nodate ps=55 nonumber; /* Create temporary SAS dataset and enter data */ data e1q1; input x y @@; /* Add titles */ title1 'Statistics 157 Extra SLR Example'; title2 'Winter 2008'; title3 'Linda M. Penas'; title4 'Question 1'; datalines; 90 7.3 95 8.5 101 10.1 95 9.3 87 6.7 97 9.2 102 10.2 88 6.7 88 7.1 99 9.9 101 9.9 83 10.2 ; /* Print the data as a check */ proc print; run; 10

Correlation Coeff for Example Find and interpret the correlation between sales of soft drinks and maximum temp of the day. Add the following lines of code: /* Use proc corr to generate correlation information nosimple suppress printing of desc. statistics noprob suppress printing of p-value for testing rho=0 */ proc corr nosimple noprob; var x y; run; Correlation Output The CORR Procedure 2 Variables: x y Pearson Correlation Coefficients, N = 12 x y x 1.00000 0.62180 y 0.62180 1.00000 R = 0.62180 moderate positive linear relationship between max temp and soft drink sales. Regression Find the estimated regression equation ŷ = ˆ β + ˆ β x 0 1 /* Use proc reg to generate regression information model dependent = independent */ proc reg; model y = x; run; Regression Output Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1-4.19781 5.17157-0.81 0.4359 x 1 0.13808 0.05500 2.51 0.0309 yˆ = 4.19781+ 0.13808x Coefficient of Determination Find and interpret the coefficient of determination. Root MSE 1.17396 R-Square 0.3866 Dependent Mean 8.75833 Adj R-Sq 0.3253 R 2 = 0.3866 38.66% of the variability in reduction in sales can be accounted for by the variability in max temperature Bad model! Intro to Residual Analysis (p.243) For each x i : residuals = e i = observed errors = y i - y-hat i, i = 1,2,..., n, where y i = observed value (in the data) y-hat i = corresponding predicted or fitted value (calculated from equation). 11

For a given value of x, Residual = difference between what we observe in the data and what is predicted by the regression equation = amount the regression equation has not been able to explain = observed errors if the model is correct Can examine the residuals through the use of various plots. Abnormalities would be indicated if The plot shows a fan shape. (indicates violation of common variance assumption) Plot shows a definite linear trend. (indicates the need for a linear term in the model) Plot shows a quadratic shape. (indicates the need for a quadratic or crossproduct terms in the model) NOTE: It is often easier to examine the standardized or studentized residuals. We can interpret them similarly to z- scores: 2 < std residual < 3 suspect outlier std residual > 3 extreme outlier (Outlier = doesn t seem to fit with the rest of the data = seems out of place) Quadratic term needed LOOKS RANDOM Obs x y Fit SE Fit Residual St Resid 1 1.0 50.00 30.06 12.63 19.94 1.44 2 2.0 110.00 101.03 8.03 8.97 0.53 3 2.0 90.00 101.03 8.03-11.03-0.65 4 3.0 150.00 163.86 6.45-13.86-0.79 5 3.0 140.00 163.86 6.45-23.86-1.36 6 3.0 180.00 163.86 6.45 16.14 0.92 Examine to see if there are any suspect or extreme outliers 12

Obs x y Fit SE Fit Residual St Resid 7 4.0 190.00 218.54 7.15-28.54-1.65 8 6.0 310.00 303.47 8.47 6.53 0.39 9 6.0 330.00 303.47 8.47 26.53 1.59 10 7.0 340.00 333.73 8.16 6.27 0.37 11 8.0 360.00 355.84 7.84 4.16 0.25 12 10.0 380.00 375.62 12.54 4.38 0.32 13 10.0 360.00 375.62 12.54 15.62-1.12 CONCLUSION The plot shows no apparent pattern. Since 0 < std res < 2 no suspect or extreme outliers either Fanning out: non-constant variance To get residual and residual plots in SAS: EXAMPLE 10.26: Diving Reflex Example proc reg; /* P = predicted values R = residuals Student = studentized residuals (act like z-scores) output out = datasetname */ model y = x /P R; output out = a P = pred R = Resid Student= stdres; run; Residual Plot Generate a residual plot of student (studentized) residuals versus predicted values. proc plot vpercent = 70 hpercent = 70; plot stdres*pred; To get residual and residual plots in SAS: EXAMPLE: Soft Drink Example proc reg; /* P = predicted values R = residuals Student = studentized residuals (act like z-scores) output out = datasetname */ model y = x /P R; output out = a P = pred R = Resid Student= stdres; run; 13

Residual Info The REG Procedure Model: MODEL1 Dependent Variable: y Output Statistics Dep Var Predicted Std Error Std Error Student y Value Mean Predict Residual Residual Residual Obs 1 7.3000 8.2290 0.3991-0.9290 1.104-0.841 2 8.5000 8.9194 0.3449-0.4194 1.122-0.374 3 10.1000 9.7479 0.5198 0.3521 1.053 0.335 4 9.3000 8.9194 0.3449 0.3806 1.122 0.339 5 6.7000 7.8148 0.5060-1.1148 1.059-1.052 6 9.2000 9.1956 0.3810 0.004426 1.110 0.00399 7 10.2000 9.8860 0.5626 0.3140 1.030 0.305 8 6.7000 7.9529 0.4667-1.2529 1.077-1.163 9 7.1000 7.9529 0.4667-0.8529 1.077-0.792 10 9.9000 9.4717 0.4423 0.4283 1.087 0.394 11 9.9000 9.7479 0.5198 0.1521 1.053 0.145 12 10.2000 7.2625 0.6854 2.9375 0.953 3.082 Residual Plot Generate a residual plot of student (studentized) residuals versus predicted values. proc plot vpercent = 70 hpercent = 70; plot stdres*pred; PART 2 data e1q2; input x y @@; title4 'Question 2'; datalines; 90 7.3 95 8.5 101 10.1 95 9.3 87 6.7 97 9.2 102 10.2 88 6.7 88 7.1 99 9.9 101 9.9 ; proc print; proc corr nosimple noprob; var x y; Generate new information with the outlier (83,10.2) removed /* Make sure you use different names for your residuals so you do not overwrite the old ones */ proc reg; model y = x /P R; output out = b P = pred1 R = resid1 Student = stdres1; proc plot vpercent = 70 hpercent = 70; plot stdres1*pred1; Run; New Output Output Statistics Dep Var Predicted Std Error Std Error Student Obs y Value Mean Predict Residual Residual Residual 1 7.3000 7.4515 0.1114-0.1515 0.254-0.597 2 8.5000 8.6716 0.0835-0.1716 0.264-0.650 3 10.1000 10.1358 0.1262-0.0358 0.247-0.145 4 9.3000 8.6716 0.0835 0.6284 0.264 2.380 5 6.7000 6.7194 0.1459-0.0194 0.235-0.0823 6 9.2000 9.1597 0.0899 0.0403 0.262 0.154 7 10.2000 10.3799 0.1380-0.1799 0.240-0.749 8 6.7000 6.9634 0.1336-0.2634 0.243-1.086 9 7.1000 6.9634 0.1336 0.1366 0.243 0.563 10 9.9000 9.6478 0.1052 0.2522 0.256 0.985 11 9.9000 10.1358 0.1262-0.2358 0.247-0.957 14

One should continue to remove the potential outliers and generate new models, residuals etc. until reaching the final information on pages 6-7. Normality of Residuals (add-on) Normality test proc univariate normal; ods select TestsForNormality; var stdres; Example The UNIVARIATE Procedure Variable: stdres (Studentized Residual) Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk W 0.897717 Pr < W 0.2068 Kolmogorov-Smirnov D 0.250982 Pr > D 0.0739 Cramer-von Mises W-Sq 0.106686 Pr > W-Sq 0.0818 Anderson-Darling A-Sq 0.577399 Pr > A-Sq 0.0989 Normality Test 1. H 0 : errors are normally distributed 2. H a : errors are not normally distributed 3. TS: p-value =0.2068 4. RR: Reject H 0 if p-value < α = 0.05 5. Since p-value =0.2068 not < α = 0.05 do not reject H 0 ok to assume errors are normally distributed S S S xy xy xy Some Relationships 0 ˆ β 0, r 0 1 0 ˆ β 0, r 0 1 = 0 ˆ β = 0, r = 0 1 SOME MORE INFO Total sum of squares = TSS = S yy = SSE (sum of squares of the error) + SSR (sum of squares due to regression model) TSS is constant for a given set of data SSE and SSR vary depending on the model change the model, SSE and SSR may/will change (but their sum is always constant = TSS) 15

TSS = S = ( y y) yy n i= 1 n i= 1 SSE = ( y yˆ ) = S ˆ β S SSR = TSS SSE i 2 i yy 1 xy 2 16