TOPIC 9 SIMPLE REGRESSION & CORRELATION

TOPIC 9 SIMPLE REGRESSION & CORRELATION Basic Linear Relationships Mathematical representation: Y = a + bx X is the independent variable [the variable whose value we can choose, or the input variable]. Y is the dependent variable [the output variable whose value is determined by the relationship] a is the constant b is the coefficient of X Graphical representation: Y Y = a + bx 1 b a In real situations, X and Y are replaced by more meaningful variables. X

Example : A tradesperson charges $30 per hour plus a callout fee of $20. This can be represented mathematically as C = 20 + 30T where T is the time in hours and C is the total cost. Graphically: 160 140 120 100 C 80 60 40 20 0 C = 20 + 30T 0 1 2 3 4 5 T From either of the representations it is possible to determine that the total cost of a 4 hour job would be $140.

Simple Regression: Finding Basic Linear Relationships from Real Data. Example : Suppose we kept records of the time it takes (in hours) to load removal vans and the number of rooms in each house. We have the following sample of cases: ROOMS TIME 3 4 3 5 4 5 4 7 5 6 5 7 6 6 6 8 7 7 7 8 These can be plotted on a Scatter Diagram. Then using a ruler we can find a reasonably well-fitting line that passes through this data. 9 8 7 T 6 i 5 m 4 e 3 2 1 0 b 0.75 a 2.9 0 1 2 3 4 5 6 7 8 Rooms We can then estimate the value of a and b and obtain the mathematical representation below TIME 2.9 + 0.75 ROOMS (approximately)

Alternatively: We can feed the raw data into a computer and run a Regression program which will find the values of a and b automatically. The typical output from such a program is given below. Before we believe and use any computer output, we must interpret and test it. This involves asking important questions and knowing what to look for: Model Summary Model R R Square Adjusted R Square Std. Error of the Estimate 1.780 a.609.560.8874 ANOVA Model Sum of Squares df Mean Square F Sig. 1 Regression 9.800 1 9.800 12.444.008 Residual 6.300 8.787 Total 16.100 9 Coefficients Model Unstandardised Coefficients Standardised Coefficients B Std. Error Beta t Sig. 1 (Constant) 2.800 1.031 2.716.026 ROOMS.700.198.780 3.528.008

Question 1: Does any linear relationship exist in that data? To find the answer we must look in the ANOVA table for the Sig. or p-value, and do an F-test Ho: There is no linear relationship. Ha: There is some linear relationship. Alpha :.05 Rule : If Sig. or p-value of F < Alpha then Ha. Conclusion : Sig. =.008 <.05 Therefore there is some linear relationship. Question 2 : How strong is this relationship? We look for R-SQUARE, the Coefficient of Determination This is always somewhere between 0 and 1. Values close to 0 indicate a very week relationship and close to 1 they suggest a very strong relationship. Other values in between can be described using appropriate language. Eg. 0.7 might be called moderately strong. Statistical interpretation of R-square: An R-square value of 0.7 means that the independent variable accounts for 70% of the variation in the dependent variable. Another way of looking at it is that the independent variable supplies 70% percent of the information needed to accurately predict the dependent variable. In other words, the model is 70% complete. In our removalist example: R-square =.609 The number of Rooms accounts for 60.9% of the variation in TIME.

Question 3 : What is the relationship? To formulate it, we must find the constant and the coefficient of the independent variable R in the computer output. We can then write down the full Model / Equation / Formula: TIME = 2.8 + 0.7 ROOMS This can be used for making predictions. For example, to find the predicted time to load the van when moving a 5 room house: TIME = 2.8 + 0.7 x 5 = 6.3 HOURS. However, the answer we get is only an average or approximate value because the formula cannot predict perfectly. In practice, we must apply a safety margin to give us a prediction interval. Prediction Interval The safety margin we need to apply depends on: 1. Our desired level of confidence 2. The position of our independent variable value in relation to the centre of our data. 3. The standard error of estimate. When our sample is large and we are interpolating (as we generally should be), the following simplified formula may be used: Y = Y ± 2 (Standard Error of Estimate) In our removalist example: TIME = 6.3 ± 2 (.8874) = 6.3 ± 1.7748 hours with 95% confidence.

Correlation Correlation describes the degree of association between two variables without regard to which variable is independent or dependent, and with no intention of formulating the relationship or using it to make predictions. Typical computer output Correlations ROOMS TIME ROOMS Pearson Correlation 1.000.780** Sig. (2-tailed)..008 N 10 10 TIME Pearson Correlation.780** 1.000 Sig. (2-tailed).008. N 10 10 **. Correlation is significant at the 0.01 level (2-tailed). The value of a Correlation coefficient lies between 1 and 1. Negative values indicate a downward sloping relationship. Positive values indicate an upward sloping relationship. Values between 0 and either extreme measure how closely the actual data fits a straight line. These situations are illustrated below: r =-1 r =-0.6 r = 0.9 r = 0 r = 1

Testing the Correlation: Because our correlation coefficient is calculated only from a sample, we can never be certain that the same value applies to the population. In fact, regardless of the sample correlation, there might be no linear association in the population at all. Therefore the correlation must be tested: Example : Suppose we obtained a correlation of 0.57 between the amount of time spent studying and the marks on a given assessment using a sample of 100 students. Does this show beyond reasonable doubt, that there is some (linear) connection between study time and results? Ho : Correlation = 0 Ha : Correlation not = 0 Alpha :.05 We use a t-test with DF = N- 2 r =.57 n = 100 H A H 0 H A.025-1.984 1.984 t = r n 2 2 1 r H A = 6.86 Conclusion: There is some linear relationship between study time and results. Notes : # Correlations might not be linear and look weaker than they really are. # Do not confuse correlation with causality.