Simple Linear Regression and Correlation Introduction Previously, our attention has been focused on one variable which we designated by x. Frequently, it is desirable to learn something about the relationship between two (or more) variables. For example, we might be interested in studying the relationship between o cholesterol level and age, o blood pressure and age, o height and weight o the amount of exercise and heart rate; o the concentration of an injected drug and heart rate o the consumption level of some nutrient and weight gain. The nature and strength of the relationships between two variables may be examined by regression and correlation analyses, two related statistical techniques that serve different purposes. Regression is used to discover the probable form of the relationship between two variables x and y by finding an appropriate equation. The ultimate objectives when this method of analysis is employed usually is to predict or estimate the value of one variable corresponding to a given value of another variable i.e. to predict or estimate the value of y for a given value of x. Correlation analysis, on the other hand, is concerned with measuring how strong is the relationship between two variables x and y i.e. the degree of the correlation between the two variables. SIMPLE LINEAR REGRESSION In simple linear the variable x is usually referred to as the explanatory or independent variable and the other variable, y is called the predicted or dependent variable, and we speak of the regression of y on x. In the above examples, the investigator could predict the cholesterol level and blood pressure from age, the weight from height, the heart rate from the concentration of injected drug.. and so on. Thus, cholesterol level, blood pressure, the weight and heart rate would be the predicted or dependent variable and; the age, the height and the concentration of injected drug would be the explanatory or independent variable. We assume that for each value of x, there is a whole population of y values which is normally distributed and all of the y populations have equal variances. In simple linear regression the object of the researcher s interest is the regression equation that describes the true relationship between the dependent variable y and the independent variable x. Scatter diagram A first step that is usually useful in studying the relationship between two variables is to prepare a scatter diagram of the data. The points are plotted by assigning values of the independent variable x to the horizontal axis and values of the dependent variable y to the vertical axis. The pattern made by the points plotted on the scatter diagram usually suggests the basic nature and the strength of the relationship between two variables. 69
Optical density Optical density Optical density BIOSTATISTICS NURS 3324 Example Relationship between and optical density Optical density 3 4 4.5 5 5 2 5.5 3 6 5 6.5 7 7 9 7.5 3.6 In our example, we can see, in general, that as the increases the optical density also increases so that they have a positive relationship. The least-square line We can also see that the points seem to be scattered around an invisible line which would describe the relationship between x and y. These impressions suggest that the relationship between points in the two variables may be described by a straight line crossing the y-axis near the origin and making approximately a 45 degree angle with the x-axis. Thinking Challenge It looks as this line would be easy to draw by hand, but it is doubtful that the lines drawn by any two people would be exactly the same. In other words, for every person drawing such a line by eye, or freehand, we would expect a different line. Which line best describes relationship between the variables? What is needed for obtaining the desired line?.6.6 69
Answer If the scatter diagram has a linear trend, we need a mathematical way to obtain the best line through the data. We need to employ a method known as the method of least squares for obtaining the desired line, and the resulting line is called the least-square line. The reason for calling the method by this name will be explained in the discussion that follow. Equation for straight line (Linear Equation) Now, recall from algebra that the general equation for straight line is given by y = a + bx Where y is a value on the vertical axis, and x is a value on the horizontal axis, a is the point where the line crosses the vertical axis, and referred to as y-intercept. b shows the amount by which y changes for each unit change in x and referred to as the slope of the line. y y = a + bx b = slope Change in y Change in x a = y intercept x To draw a line based on the equation, we need the numerical values of the constants a and b. Given these constants, we may substitute various values of x into the equation to obtain corresponding values of y. y = a + bx The resulting points may then be plotted. Computation Finding the b-value b 2 n x x 2 n xy x y b 9 284 49 2 9 18.2 -(49)(3.4).958 69
Finding the y-intercept (x) a y bx where y mean of y values and x mean of x values 3.4 y 378 9 49 x 5.444 9 a 378.958 5.444-837 Optical density (y) x 2 y 2 xy 3 9.1 4 16.4.8 4.5 5 25.625 1.125 5 2 25 24 1.6 5.5 3 35 89 1.815 6 5 36 225 2.1 6.5 7 42.25 29 3.55 7 9 49 4 3.43 7.5 3 56.25 81 3.975 Total Σ x = 49 Σ y = 3.4 Σ x 2 = 284 Σ y 2 = 1.1882 Σ xy = 18.2 Mean x = 5.444 y = 378 Alternatively y b x a n The equation for the least squares line is: y a bx y - 837+.958x y.958x - 837 Note that we use the symbol because this value is computed from the equation and is not an observed value of y. Now, we can substitute various values of x into the equation to obtain corresponding values of. The resulting points may be plotted. y y 66
Optical density BIOSTATISTICS NURS 3324 Example: Predicting y for a given x using the regression equation Choose a value for x (within the range of x values). x = 6.8 Substitute the selected x in the regression equation. y.958 6.8-837 Determine corresponding value of y. y.958x - 837 =625 According to the equation, a of 6.8 would has a 625 optical density. Drawing the least-squares line Since any two such coordinates determine a straight line, we may select any two values in the range of x, compute two corresponding y values, locate them on a graph, and connect them with a straight line to obtain the line corresponding the equation. The following point will always be on the least squares line: ( x, y) Use 5.444 and 378, the averages of the x s and the y s, respectively. Try x = 4, Compute: y =.957(4) - 835 = 965 Sketching the Line Using the Points (5.444, 378) and (4, 965).6 y =.957x - 835 Now what we have obtained is what is called the best line for describing the relationship between our two variables. By what criterion it is considered best? Before the criterion is stated, let us examine the figure obtained. Note that the least squares line does not pass through most of the observed points that are plotted on the scatter diagram. In other words, the observed points deviate from the line by varying amounts. 11
Optical density BIOSTATISTICS NURS 3324.6 Deviation Deviation y i y i y i Deviation The line that we have drawn is best in this sense: The sum of the squared vertical deviations of the observed data points (y i ) from the least square line is smaller than the sum of the squared vertical deviations of the observed data points from any other line. CORRELATION Pearson s Correlation coefficient r 1. Pearson s correlation coefficient measures the strength of the relationship between the two numerical variables represented as x and y. 2. The correlation coefficient is denoted by r, it is calculated using the formula: r Computation Table n x i y i x i y i 2 2 i i i i 2 2 n x x n y y (x) Optical density (y) xy x 2 y 2 3 9.1 4.8 16.4 4.5 5 1.125 25.625 5 2 1.6 25 24 5.5 3 1.815 35 89 6 5 2.1 36 225 6.5 7 3.55 42.25 29 7 9 3.43 49 41 7.5 3 3.975 56.25 89 x = 49 y = 3.4 xy = 18.2 x 2 = 284 y 2 = 1.1882 r 9 18.2 49 3.4 9 284 49 9 1.1882 3.4.9891. 99 2 2 1
Coefficient of Correlation Values The statistic r has the following properties: 1. r measures the extent of linear association between two variables. 2. r has value between 1 and 1. 3. r = 1 if and only if all the observations are on a straight line with positive slope. 4. r = 1 if and only if all observations are on a straight line with negative slope. 5. r tends to be close to zero if there is no linear association between x and y. 6. Although there is no fixed rule or interpretation of the strength of a correlation, we will say that the correlation is Strong if r.8 Moderate if r.8 Weak if r Coefficient of determination or r-squared (r 2 ) Sometimes the correlation is squared (r 2 ) to form a useful statistic called the coefficient of determination or r-squared. r 2 = 1. means given value of one variable can perfectly predict the value for other variable. r 2 = means knowing either variable does not predict the other variable The higher r 2 value means more correlation there is between two variables. The coefficient of determination expresses the proportion of the variance in one variable that is accounted for or explained by the variance in the other variable. So, if a study finds a correlation (r) of between salt intake and blood pressure, it could be concluded that = 6, or 16% of the variance in blood pressure in this study is accounted for by variance in salt intake. In the above example, approximately 98 (.9891.9891=.978) percent of the variation in Optical density is accounted for by variance in change, and about 2% is explained by other causes. 11
Figure Scatter plots illustrating how the correlation coefficient, r, is a measure of the linear association between two variables. 12