Section 11: Quantitative analyses: Linear relationships among variables Australian Catholic University 214 ALL RIGHTS RESERVED. No part of this work covered by the copyright herein may be reproduced or used in any form or by any means graphic, electronic, or mechanical, including photocopying, recording, taping, Web distribution, information storage and retrieval systems without the written permission of the publishers. Disclaimer: No person should rely on the contents of this publication without first obtaining advice from a qualified professional person. This publication is distributed on the terms and understanding that: (1) the authors, consultants and editors are not responsible for the results of any actions taken on the basis of information in this publication, nor for any error in or omission from this publication; and (2) the publisher is not engaged in legal, accounting, professional or other advice or services. The publisher, and the authors, consultants and editors, expressly disclaim all and any liability and responsibility to any person, whether a purchaser or a reader of this publication or not, in respect of anything, and of the consequences of anything, done or omitted to be done by any such person in reliance, whether wholly or partially, upon the whole or any part of the contents of this publication. Without limiting the generality of the above, no author, consultant or editor shall have any responsibility for any act or omission of any other author, consultant or editor. MGMT617 Research Methods Section 11 1
11.1 Linear relationship between two variables Prescribed reading 11.11 Zikmund, WG Babin, BJ Carr, JC & Griffin, M 213, Business research methods, 9 th edn, South- Western, Cengage Learning, Mason OH. Chapter 23. Recall the use of a scatter plot as a guide to linear relationship between two variables. Examples are as follows Example 1: No clear relationship 1 Scatter plot 8 6 4 2 5 1 Example 2: Approximate linear relationship 7 6 5 4 3 2 1 Scatter plot 5 1 MGMT617 Research Methods Section 11 2
Example 3: Non-linear relationship 8 Scatter plot 6 4 2 5 1 Correlation coefficients A linear relationship between two variables is assessed using correlation coefficients. If we have samples of n observations of variables X and Y, the sample correlation between them is calculated using the formula r xy = (x i - )(y i - )/(s x s y ) where the sum runs from 1 to n and s x and s y are the sample standard deviations for X and Y. This is only sample data, so we need to test whether it gives evidence for a nonzero population correlation. We test the null hypothesis H : ρ = against the alternative hypothesis Ha: ρ. We do a t-test, using the calculation t = r/ ((1-r 2 )/(n-2)). where t is distributed as T with degrees of freedom n-2. Simple linear regression If the scatter plot from two samples gives us reason to believe that there is a linear relationship between two variables, we can do a calculation that fits a straight line to the observations that is the best possible such line, in the sense that it minimises the sum of the squared deviations of the actual observation points in the scatter plot from the line. This is what Zikmund et al. (p. 571) call the Ordinary Least Squares (OLS) method. To use the method, we need first to know what its assumptions are. The basis assumption is that the two variables are related by a formula of the form. MGMT617 Research Methods Section 11 3
Y =β + β 1 X + error where at each point the errors are independent, with zero mean, and all their standard deviations equal. It is worth giving the formulae we use to produce the least squares regression line. The equation is: ŷ = b + b 1 x, where b 1 = r xy s y /s x, and b = b 1, in which, are the sample means, s x, s y the sample standard deviations, and the r xy the sample correlation between X and Y. Checking the assumptions The assumption that the observations are independent can be checked against sampling. A simple random sample, for example, would satisfy this. One should also check the deviations from the line for normality and equal variances. A rough check can be made using a plot, which should show no real pattern, and deviations of approximately the same size over the different x values, that is, not systematically larger or smaller over the x values. Testing for the significance of the linear relationship. We want to test H : β 1 = against the usual Ha: β 1. One usually does the calculation using a computer package. Excel can be used to get the regression line, and the output contains the information needed for the significance test. We look at the following example of output from Excel. Take the data set from Example 2 above, where the scatter plot indicated that there was a linear relationship. The output contains a first block, as follows. Regression Statistics Multiple R.774391 R Square.599681 Adjusted R Square.559649 Standard Error.88428 Observations 12 MGMT617 Research Methods Section 11 4
It contains Multiple R, which is actually the simple correlation coefficient between X and Y. The square of this is.599681, and this means about 59.97% of the variation in Y is explained by the variation in X. The next block is an analysis of variance table for the assessment of overall significance of the result. ANOVA df SS MS F Significance F Regression 1 11.71177 11.71177 14.987.317 Residual 1 7.818233.781823 Total 11 19.53 The value of F is highly significant. Information about the coefficients b and b 1 is in a section below, as follows Coefficients Standard Error t Stat P-value Intercept 1.55981.67335 2.313742.43231 X Variable 1.427125.11357 3.8749.317 This tells us that the regression line has equation = 1.55981 +.427125 x The coefficient of x is followed by the information from the t-test for the given hypothesis H : β 1 = versus Ha: β 1 We get t = 3.8749, and P(T>3.8749 or T<-3.8749) =.317. So the result is highly significant and we reject the null hypothesis. The Excel program can also give a guide to the rest of the assumptions. The residuals are the values y -, that is, the deviations of the y-value on the line from the observed value y. The program can plot these as follows. MGMT617 Research Methods Section 11 5
Residuals 1.5 1.5 -.5-1 -1.5 X Variable 1 Residual Plot 2 4 6 8 1 X Variable 1 These should be randomly positive or negative, and there should be no systematic variation in size as x increases. The plot is consistent with the requirement. The residuals are also required to be normally distributed. A normal quantile plot for them is copied below. Normal quantile plot 1.5 1.5-2 -1 -.5 1 2-1 -1.5 This is approximately a straight line. So overall one can conclude that the regression method could be validly used. 11.2 Multiple regression Prescribed reading 11.2 Zikmund, WG Babin, BJ Carr, JC & Griffin, M 213, Business research methods, 9 th edn, South- Western, Cengage Learning, Mason OH. Chapter 24, pp. 582-59. Multiple regression can also be done in Excel. MGMT617 Research Methods Section 11 6
Assume that there are several independent variables x 1, x 2, x n, each of which makes a linear contribution to the dependent variable y. So we fit an equation of the following form: ŷ = b + b 1 x 1 + b 2 x 2 +... + b n x n. The assumptions are similar to those for simple regression. That is, the relation with each xi is assumed to be linear, and residuals are assumed independent, normally distributed, with equal variances. The calculation is also the same, using the least squares criterion to give the coefficients. In general, there is a choice of methods of doing multiple regression, but here one just needs to be aware of the fact that the contribution of each variable is assessed in the context of contributions from the other variables. What this means is that related independent variables may not contribute together much more than one of them does. Example The variables are X1, X2, Y, on 25 observations. We look first at ht correlations among the three variables. These are in the table below X2 Y X1.72.56 X2.67 Note that X 1 and X 2 correlate quite strongly with each other, as well as with Y. We now do multiple regression of Y on the X i. Output is below. Regression Statistics Multiple R.682181 R Square.465371 Adjusted R Square.416769 Standard Error 1.329628 Observations 25 MGMT617 Research Methods Section 11 7
ANOVA df SS MS F Significance F Regression 2 33.85558 16.92779 9.57529.12 Residual 22 38.8942 1.76791 Total 24 72.7496 Coefficients Standard Error t Stat P-value Intercept 2.233293654.694954746 3.213581.42 X Variable 1.19961128.14746839.747797.46252 X Variable 2.4219633.17718337 2.471359.21679 The analysis of variance table gives the significance of the whole regression, which is very high. If we look at the table with coefficients, we see that the coefficient for X 2 is significantly different from zero, but the coefficient for X 1 is not. This means that, in the presence of X 2, the variable X 1 does not contribute significantly more. Other methods In Chapter 24, Zikmund et al. describe some other methods that can be used with multiple variables, but for these one would need a package designed primarily for statistics. MGMT617 Research Methods Section 11 8