Biostatistics 4: Trends and Differences Dr. Jessica Ketchum, PhD. email: McKinneyJL@vcu.edu Objectives 1) Know how to see the strength, direction, and linearity of relationships in a scatter plot 2) Interpret Pearson s correlation coefficient, r. 3) Understand the relationship between r and the slope of a linear regression line. 4) Know the difference between dependent and independent variables I. Example: The following, taken from Essex Sorlie (Medical Biostatistics & Epidemiology, Appleton & Lange, 1995, p. 246). It provides some background and justification for using peak expiratory flow rate as a surrogate for FEV 1. A commonly index of expiratory flow is FEV 1, the amount of air an individual can expel in one second. Many clinicians suggest initiating bronchodilator drugs when FEV 1 is less than 80% of the expected value for a given age, height, race and gender. Because it requires full expiration, the measurement of FEV 1 may provoke bronchoconstriction and increase symptoms. Further, FEV 1 measurement must be made by a qualified respiratory therapist on special equipment that must be calibrated frequently. Although FEV 1 is considered the gold standard of expiratory flow rates, other indices are available. One such index, peak expiratory flow rate, can be obtained on a mini Wright peak flow meter; this device is inexpensive, lightweight, accurate, compact, and easy to use, even by young patients. Measuring peak flow on the mini Wright does not require full expiration; this reduces the likelihood of symptom exacerbation. Some clinicians suggest teaching an asthmatic patient how to measure and use peak flow, in order to determine when to initiate bronchodilator treatment. How can we be sure the peak flow is an adequate surrogate for FEV 1? Does peak flow adequately predict FEV 1?
Example A study measured the peak flow and FEV 1 on 97 patients, ages 6 to 18 years, seen by a pediatric allergist A study, conducted to ascertain the relationship between peak flow and FEV 1, measured the peak flow and FEV 1 on 97 patients, ages 6 to 18 years, seen by a pediatric allergist. A first step is to construct a scatter plot. II. Scatter Plot A. Scatter plot a plot constructed by plotting pairs of data points on an X, Y coordinate plane. 1. Example: Figure 4.1 contains the scatter plot with peak flow plotted along the X (or horizontal) axis and FEV 1 plotted along the Y (or vertical) axis. Does there appear to be a relationship between peak flow and FEV 1? Figure 4.1: Scatter Plot of FEV 1 and Peak Flow
Fev1 140 130 120 110 100 90 80 70 60 50 40 30 20 30 4050 607080 90 110 130 150 Peak Flow Although a picture could suggest that two variables are related, pictures can be deceiving. The line plotted through the points seems to indicate a trend? Could this trend just be the result of a chance relationship? III. Correlation A. Pearson s correlation coefficient a measure of the strength of the linear relationship between X and Y. The true correlation, ρ XY, is estimated from the data by the sample correlation r XY : SD(X) rxy = Slope 4.1 SD(Y) Where SD(X) is the standard deviation of the horizontal variable, SD(Y) is the standard deviation of the vertical variable, and the Slope is calculated as the best fitting straight line predicting Y from X. Test tip: The strength and direction of the correlation is related to the slope of the linear trend. The correlation coefficient is bounded above by +1 (perfect positive linear relationship) and below by 1 (perfect negative linear relationship). When the correlation is zero, there is no linear relationship between X and Y. Note: Computing r only makes sense if the form of the relationship is linear a straight line. 1. Because peak flow and FEV 1 appear to be linearly related and in a positive direction, we anticipate that the sample correlation is greater than 0 but not 1. 2. Example: The SD(FEV) = 20.02 and SD(PeakFlow)=21.34. The slope is 0.61. So, applying equation 4.1 to the data, the sample correlation is r XY =0.65.
Does there appear to be a relationship? SD (FEV) = 20.02 SD (PeakFlow) = 21.34 Slope = 0.61 So, the sample correlation is r = 0.65 3. The following list includes some important features of the sample correlation coefficient: a. The Pearson s correlation coefficient assumes that both X and Y are sampled from a normal distribution. If this assumption is in question, other correlation coefficients have been suggested including Spearman s rank correlation and Kendall s tau. b. The scientific hypothesis of significant correlation (the correlation coefficient not equal to zero) leads to the statistical hypotheses H 0 : ρ XY = 0 vs. H A : ρ XY 0 4.2 Note that testing whether the correlation is zero is equivalent to testing whether the slope is zero (flat). There is a formula for testing this but you do NOT need to remember it or be able to calculate it. A test of this hypothesis is given by the t statistic t = r XY n 2 1 r 2 XY 4.3 which follows a t distribution with n 2 degrees of freedom. For the FEV 1 data, t=8.13 with 95 degrees of freedom leads to rejection of the null hypothesis with p<0.001. We conclude that there is a significant correlation between FEV 1 and peak flow. (no, you don t need to know this formula.)
c. The correlation is a measure of the strength of a linear relationship between X and Y. In other words, X and Y could have a strong curvilinear relationship and have a correlation of zero. Thus it is not appropriate to test for a correlation (linear relationship) when the relationship is nonlinear. IV. Dependent & Independent Variables In the first section, we asked the question: Does peak flow (X) adequately predict FEV 1 (Y)? When asking questions about relationships between two variables, it s often the case there is a direction implied by the question: If I knew peak flow, could I predict FEV1? The question was not: if I knew FEV1, could I predict peak flow? In statistics, the dependent variable, Y, is the variable that we are trying to predict. It s the outcome variable. Think of it this way: its value depends on other characteristics. That is, there are other variables that may be used to predict Y. For our case, the outcome variable is FEV 1. The variable that is used to predict the dependent variable is the independent variable, X. Other names for the independent variable include predictor variable and regressor variable. Other synonyms are: factor, or risk factor. For our case, the independent variable is peak flow. V. Regression A. In our high school algebra class we were taught the equation for a line: where a is the slope and b is the Y intercept. Y = ax + b 4.4 Test tip: Any linear trend can be characterized by its slope and intercept. The relationship between X and Y in equation 4.4 is deterministic; for any value of X, Y is known precisely. This is not the case for the FEV 1 data, i.e., knowledge of peak flow (X) does not guarantee precise knowledge of FEV 1. Regression techniques model the empirical (data driven) relationship between X and Y by adding an error term to equation 4.4. We call these types of equations regression models. Two regression models are presented next. B. Straight Line Equation: Simple Linear Regression In medical statistics, we don t use the term m for slope and b for intercept; We use betas ( β ). It s just the convention. 1. The straight line simple linear regression equation is Y = β 0 + X β 1 + ε 4.5 where Y is the dependent variable, β 0 is the true (but unknown) Y intercept, X is the known independent variable, β 1 is the true (but unknown) slope, and ε is the unknown error. (So, β 0 and β 1 are parameters in regression.) Basically, this is the simple equation for the line we are familiar with plus some error (ε ). Actually, in general we have n observations so we have n of these equations or Y = β 0 + X β 1 + ε 4.6 i i i
for i=1, 2,..., n. We call equation 4.6 the simple linear regression model. 2. Using software, we can obtain estimates of the slope and intercept. And so, you don t need to remember formulas. Just know the interpretation of the slope β 1 : for every one unit change in X, how much does Y change? And know the interpretation ofβ 0 : If X is (exactly) 0, what is the predicted (best guess) value for Y? Often times this is really not an interesting question. a. Example: Using the 97 FEV 1 /peak flow pairs, the estimated regression line is Y = 3543. + 0. 61 X + ε 4.7 i i i Figure 4.1 contains the scatter plot with this estimated regression line drawn through the data. Figure 4.1 Interpreting the Simple Linear Model of the Relationship Using the 97 FEV 1 /peak flow pairs, the estimated regression line is: { Y = 3543. + 0. 61 X + ε i i i 3. There are many tools available to address the adequacy of the prediction. a. When the slope of the linear regression line is zero, X has no predictive value. A t test of the null hypothesis versus the alternative hypothesis: H 0 : β 1 = 0 vs. H A : β 1 0 4.8 is a test of the scientific hypothesis: Does peak flow predict FEV 1? For the FEV 1 data, the t test rejects this null hypothesis in favor of the alternative with p<0.001. That is, in regression the question is: I presume that there is no relationship (the slope is zero), does the data support this? In this example, it s very unlikely to observe a slope this steep by chance alone. So we conclude that peak flow is a significant predictor of FEV 1.
b. The quality of the regression (how strong the prediction is) is measured by the coefficient of determination. For simple linear regression, the coefficient of determination is r XY 2 or the square of the sample correlation coefficient. Know this: R square is interpreted as the amount of variability in Y that is explained by X. An R square of 1 (100%) is perfect relationship and an R square of 0 (0%) is pure noise no relationship. 2 For our example, this is r XY = (0.65) 2 = 0.42. Therefore, 42% of the variability in FEV 1 is explained by peak flow. So there is still a good amount (58%) of variability not explained by peak flow. You should be able to assess how strong a relationship is by the R square value. The correlation (as well as the R square) of X and Y is the same as the correlation of Y and X direction doesn t matter. However, the slope predicting X from Y is different than the slope predicting Y from X. Even so, tests of the correlation and tests of the slope (regardless of direction) all give the exact same p value. So direction of prediction does not change significance just the interpretation. c. The assumptions for regression are: the Y s are sampled from a normal distribution, the relationship between X and Y is linear, the spread of the point around the line has equal variance, and every observation (every X,Y pair) is measured independently (unrelated) from every other observation. C. More Than One X: Multiple Regression 1. peak flow only explained 42% of the variations in FEV 1. Can we do better? Suppose we are interested in predicting FEV 1 using three independent variables instead of just one. This technique is called multiple regression because we have more than one independent variable. The model in equation 4.6 is modified to accommodate the two extra independent variables. The multiple regression model becomes Y = β + X β + X β + X β + ε i 0 1 i 1 2 i 2 3 i 3 i 4.9 2. All of the inferences previewed for the simple linear regression model can be applied to the multiple regression model. However, all are beyond the scope of these lectures. The bottom line is that each X has it s own slope it s own effect upon Y that, when added to each of the other X s effects will result is a predicted Y value. VI. Bottom Line A. Know all terms in bold font; you should be familiar with each. B. Know how to interpret scatter plots. Is the strength of the relations weak or strong? Positive or negative? Linear? C. Know the concepts of regression and correlation. In the simple linear regression case: the question: Is the slope zero? Is the same question as: Is the correlation zero?
Summary Simplest form of association between two continuous variables is measured by correlation Assumes normality Which requires linearity Test for significance by: testing r=0? Or slope=0? Dependent var = random outcome Independent var = explanatory predictor Same: Straight line model = slope of X,Y trend line = Simple linear regression = Pearson correlation. If the FORM is not a straight line, there are other models (and tests) Multiple regression has multiple predictors The word multivariate refers to multiple outcomes. VII. Homework Exercises 1. A study collected measurements of IgE antibodies (IU/ml) and skin test (ng/ml) from 23 subjects in the presence of Lol p 5, a purfied allergen from grass pollen. A scatterplot of the relationship between IgE and skin test are shown in the figure below. Bivariate Fit of IGE By SkinTest IGE 25 20 15 10 5 0 10 0 10 20 30 40 50 60 SkinTest a. Is it appropriate to assess the linear relationship between IgE and skin test? b. Is it appropriate to compute Pearson s correlation coefficient here? c. What methods should be used to assess the relationship between IgE and skin test? 2. A study was conducted involving women (ages 34 87) who attended a hospital outpatient department for bone density measurements and underwent lumbar spine radiation. Among the data collected were the measures for the anteroposterior (AMBD) and lateral (LBMD) bone mineral density. A scatterplot of the relationship between
AMBD and LMBD is shown in the figure below. The estimated correlation is 0.73 with p < 0.0001. Bivariate Fit of ABMD By LBMD 1.5 1.3 ABMD 1.1 0.9 0.7 0.5 0.1 0.3 0.5 0.7 0.9 1 LBMD a. What can you say about the relationship between ABMD and LBMD? i. Strength (significant or not significant)? ii. Direction (positive or negative)? iii. Form (linear of not linear)? 3. From the data in exercise 2 above, the estimated intercept and slope are 0.28 and 1.05, respectively. a. For a woman with a LBDM of 0.5, we predict her ABMD to be what? 4. The following plot of age and systolic blood pressure are from 20 healthy adults. The estimates slope is 0.45 and the estimated correlation is 0.97. BP 145 140 135 130 125 120 115 10 20 30 40 50 60 70 80 Age a. Complete the following sentence: A unit increase in age is associated with a unit increase in BP? b. The estimated 95% CI on the slope is (0.39, 0.50). What can you say about the relationship between age and BP? i. Strength (significant or not significant)? ii. Direction (positive or negative)? iii. Form (linear of not linear)?