Biostatistics 4: Trends and Differences

Similar documents
Correlation and Simple Linear Regression

Chapter 11. Correlation and Regression

BIOSTATISTICS NURS 3324

Statistics in medicine

Big Data Analysis with Apache Spark UC#BERKELEY

Overview. Overview. Overview. Specific Examples. General Examples. Bivariate Regression & Correlation

Bivariate Regression Analysis. The most useful means of discerning causality and significance of variables

Business Statistics. Lecture 10: Correlation and Linear Regression

Scatter plot of data from the study. Linear Regression

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

Correlation and simple linear regression S5

Can you tell the relationship between students SAT scores and their college grades?

appstats27.notebook April 06, 2017

Inferences for Regression

Scatter plot of data from the study. Linear Regression

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

Analysing data: regression and correlation S6 and S7

Review of Statistics 101

Data Analysis and Statistical Methods Statistics 651

Linear regression and correlation

Week 8: Correlation and Regression

Answer Key. 9.1 Scatter Plots and Linear Correlation. Chapter 9 Regression and Correlation. CK-12 Advanced Probability and Statistics Concepts 1

Chapter 12 - Part I: Correlation Analysis

Chapter 16. Simple Linear Regression and dcorrelation

Statistical Distribution Assumptions of General Linear Models

STAT Chapter 11: Regression

Introduction and Single Predictor Regression. Correlation

Chapter 13 Correlation

23. Inference for regression

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Nemours Biomedical Research Biostatistics Core Statistics Course Session 4. Li Xie March 4, 2015

Correlation & Regression. Dr. Moataza Mahmoud Abdel Wahab Lecturer of Biostatistics High Institute of Public Health University of Alexandria

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Multiple linear regression S6

t-test for b Copyright 2000 Tom Malloy. All rights reserved. Regression

Chapter 9. Correlation and Regression

Inference with Simple Regression

Basics of Experimental Design. Review of Statistics. Basic Study. Experimental Design. When an Experiment is Not Possible. Studying Relations

Important note: Transcripts are not substitutes for textbook assignments. 1

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

UNIT 12 ~ More About Regression

Rejection regions for the bivariate case

LECTURE 5. Introduction to Econometrics. Hypothesis testing

INFERENCE FOR REGRESSION

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

A discussion on multiple regression models

Lecture 30. DATA 8 Summer Regression Inference

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Inference for Regression Simple Linear Regression

THE ROYAL STATISTICAL SOCIETY 2008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS

Example: Forced Expiratory Volume (FEV) Program L13. Example: Forced Expiratory Volume (FEV) Example: Forced Expiratory Volume (FEV)

y response variable x 1, x 2,, x k -- a set of explanatory variables

Simple linear regression

Inference for Regression Inference about the Regression Model and Using the Regression Line

HYPOTHESIS TESTING II TESTS ON MEANS. Sorana D. Bolboacă

1 A Review of Correlation and Regression

Chapter 16. Simple Linear Regression and Correlation

Regression. Estimation of the linear function (straight line) describing the linear component of the joint relationship between two variables X and Y.

Correlation: basic properties.

Unit 6 - Introduction to linear regression

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Psychology 282 Lecture #4 Outline Inferences in SLR

Correlation and Regression (Excel 2007)

Unit 10: Simple Linear Regression and Correlation

Chapter 6: Exploring Data: Relationships Lesson Plan

appstats8.notebook October 11, 2016

Chapter 27 Summary Inferences for Regression

Correlation and Linear Regression

Correlation & Simple Regression

HYPOTHESIS TESTING. Hypothesis Testing

Midterm 2 - Solutions

Ecn Analysis of Economic Data University of California - Davis February 23, 2010 Instructor: John Parman. Midterm 2. Name: ID Number: Section:

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

REVIEW 8/2/2017 陈芳华东师大英语系

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

22 Approximations - the method of least squares (1)

Black White Total Observed Expected χ 2 = (f observed f expected ) 2 f expected (83 126) 2 ( )2 126

Swarthmore Honors Exam 2012: Statistics

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Regression Models. Chapter 4. Introduction. Introduction. Introduction

UNIVERSITY OF TORONTO Faculty of Arts and Science

Mathematics for Economics MA course

9 Correlation and Regression

Biostatistics. Correlation and linear regression. Burkhardt Seifert & Alois Tschopp. Biostatistics Unit University of Zurich

An R # Statistic for Fixed Effects in the Linear Mixed Model and Extension to the GLMM

Simple linear regression

SLR output RLS. Refer to slr (code) on the Lecture Page of the class website.

Final Exam - Solutions

Correlation & Linear Regression. Slides adopted fromthe Internet

Draft Proof - Do not copy, post, or distribute. Chapter Learning Objectives REGRESSION AND CORRELATION THE SCATTER DIAGRAM

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Statistics: revision

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population

Lecture 18: Simple Linear Regression

Data files for today. CourseEvalua2on2.sav pontokprediktorok.sav Happiness.sav Ca;erplot.sav

Chapter 9 Regression. 9.1 Simple linear regression Linear models Least squares Predictions and residuals.

Chapters 9 and 10. Review for Exam. Chapter 9. Correlation and Regression. Overview. Paired Data

Transcription:

Biostatistics 4: Trends and Differences Dr. Jessica Ketchum, PhD. email: McKinneyJL@vcu.edu Objectives 1) Know how to see the strength, direction, and linearity of relationships in a scatter plot 2) Interpret Pearson s correlation coefficient, r. 3) Understand the relationship between r and the slope of a linear regression line. 4) Know the difference between dependent and independent variables I. Example: The following, taken from Essex Sorlie (Medical Biostatistics & Epidemiology, Appleton & Lange, 1995, p. 246). It provides some background and justification for using peak expiratory flow rate as a surrogate for FEV 1. A commonly index of expiratory flow is FEV 1, the amount of air an individual can expel in one second. Many clinicians suggest initiating bronchodilator drugs when FEV 1 is less than 80% of the expected value for a given age, height, race and gender. Because it requires full expiration, the measurement of FEV 1 may provoke bronchoconstriction and increase symptoms. Further, FEV 1 measurement must be made by a qualified respiratory therapist on special equipment that must be calibrated frequently. Although FEV 1 is considered the gold standard of expiratory flow rates, other indices are available. One such index, peak expiratory flow rate, can be obtained on a mini Wright peak flow meter; this device is inexpensive, lightweight, accurate, compact, and easy to use, even by young patients. Measuring peak flow on the mini Wright does not require full expiration; this reduces the likelihood of symptom exacerbation. Some clinicians suggest teaching an asthmatic patient how to measure and use peak flow, in order to determine when to initiate bronchodilator treatment. How can we be sure the peak flow is an adequate surrogate for FEV 1? Does peak flow adequately predict FEV 1?

Example A study measured the peak flow and FEV 1 on 97 patients, ages 6 to 18 years, seen by a pediatric allergist A study, conducted to ascertain the relationship between peak flow and FEV 1, measured the peak flow and FEV 1 on 97 patients, ages 6 to 18 years, seen by a pediatric allergist. A first step is to construct a scatter plot. II. Scatter Plot A. Scatter plot a plot constructed by plotting pairs of data points on an X, Y coordinate plane. 1. Example: Figure 4.1 contains the scatter plot with peak flow plotted along the X (or horizontal) axis and FEV 1 plotted along the Y (or vertical) axis. Does there appear to be a relationship between peak flow and FEV 1? Figure 4.1: Scatter Plot of FEV 1 and Peak Flow

Fev1 140 130 120 110 100 90 80 70 60 50 40 30 20 30 4050 607080 90 110 130 150 Peak Flow Although a picture could suggest that two variables are related, pictures can be deceiving. The line plotted through the points seems to indicate a trend? Could this trend just be the result of a chance relationship? III. Correlation A. Pearson s correlation coefficient a measure of the strength of the linear relationship between X and Y. The true correlation, ρ XY, is estimated from the data by the sample correlation r XY : SD(X) rxy = Slope 4.1 SD(Y) Where SD(X) is the standard deviation of the horizontal variable, SD(Y) is the standard deviation of the vertical variable, and the Slope is calculated as the best fitting straight line predicting Y from X. Test tip: The strength and direction of the correlation is related to the slope of the linear trend. The correlation coefficient is bounded above by +1 (perfect positive linear relationship) and below by 1 (perfect negative linear relationship). When the correlation is zero, there is no linear relationship between X and Y. Note: Computing r only makes sense if the form of the relationship is linear a straight line. 1. Because peak flow and FEV 1 appear to be linearly related and in a positive direction, we anticipate that the sample correlation is greater than 0 but not 1. 2. Example: The SD(FEV) = 20.02 and SD(PeakFlow)=21.34. The slope is 0.61. So, applying equation 4.1 to the data, the sample correlation is r XY =0.65.

Does there appear to be a relationship? SD (FEV) = 20.02 SD (PeakFlow) = 21.34 Slope = 0.61 So, the sample correlation is r = 0.65 3. The following list includes some important features of the sample correlation coefficient: a. The Pearson s correlation coefficient assumes that both X and Y are sampled from a normal distribution. If this assumption is in question, other correlation coefficients have been suggested including Spearman s rank correlation and Kendall s tau. b. The scientific hypothesis of significant correlation (the correlation coefficient not equal to zero) leads to the statistical hypotheses H 0 : ρ XY = 0 vs. H A : ρ XY 0 4.2 Note that testing whether the correlation is zero is equivalent to testing whether the slope is zero (flat). There is a formula for testing this but you do NOT need to remember it or be able to calculate it. A test of this hypothesis is given by the t statistic t = r XY n 2 1 r 2 XY 4.3 which follows a t distribution with n 2 degrees of freedom. For the FEV 1 data, t=8.13 with 95 degrees of freedom leads to rejection of the null hypothesis with p<0.001. We conclude that there is a significant correlation between FEV 1 and peak flow. (no, you don t need to know this formula.)

c. The correlation is a measure of the strength of a linear relationship between X and Y. In other words, X and Y could have a strong curvilinear relationship and have a correlation of zero. Thus it is not appropriate to test for a correlation (linear relationship) when the relationship is nonlinear. IV. Dependent & Independent Variables In the first section, we asked the question: Does peak flow (X) adequately predict FEV 1 (Y)? When asking questions about relationships between two variables, it s often the case there is a direction implied by the question: If I knew peak flow, could I predict FEV1? The question was not: if I knew FEV1, could I predict peak flow? In statistics, the dependent variable, Y, is the variable that we are trying to predict. It s the outcome variable. Think of it this way: its value depends on other characteristics. That is, there are other variables that may be used to predict Y. For our case, the outcome variable is FEV 1. The variable that is used to predict the dependent variable is the independent variable, X. Other names for the independent variable include predictor variable and regressor variable. Other synonyms are: factor, or risk factor. For our case, the independent variable is peak flow. V. Regression A. In our high school algebra class we were taught the equation for a line: where a is the slope and b is the Y intercept. Y = ax + b 4.4 Test tip: Any linear trend can be characterized by its slope and intercept. The relationship between X and Y in equation 4.4 is deterministic; for any value of X, Y is known precisely. This is not the case for the FEV 1 data, i.e., knowledge of peak flow (X) does not guarantee precise knowledge of FEV 1. Regression techniques model the empirical (data driven) relationship between X and Y by adding an error term to equation 4.4. We call these types of equations regression models. Two regression models are presented next. B. Straight Line Equation: Simple Linear Regression In medical statistics, we don t use the term m for slope and b for intercept; We use betas ( β ). It s just the convention. 1. The straight line simple linear regression equation is Y = β 0 + X β 1 + ε 4.5 where Y is the dependent variable, β 0 is the true (but unknown) Y intercept, X is the known independent variable, β 1 is the true (but unknown) slope, and ε is the unknown error. (So, β 0 and β 1 are parameters in regression.) Basically, this is the simple equation for the line we are familiar with plus some error (ε ). Actually, in general we have n observations so we have n of these equations or Y = β 0 + X β 1 + ε 4.6 i i i

for i=1, 2,..., n. We call equation 4.6 the simple linear regression model. 2. Using software, we can obtain estimates of the slope and intercept. And so, you don t need to remember formulas. Just know the interpretation of the slope β 1 : for every one unit change in X, how much does Y change? And know the interpretation ofβ 0 : If X is (exactly) 0, what is the predicted (best guess) value for Y? Often times this is really not an interesting question. a. Example: Using the 97 FEV 1 /peak flow pairs, the estimated regression line is Y = 3543. + 0. 61 X + ε 4.7 i i i Figure 4.1 contains the scatter plot with this estimated regression line drawn through the data. Figure 4.1 Interpreting the Simple Linear Model of the Relationship Using the 97 FEV 1 /peak flow pairs, the estimated regression line is: { Y = 3543. + 0. 61 X + ε i i i 3. There are many tools available to address the adequacy of the prediction. a. When the slope of the linear regression line is zero, X has no predictive value. A t test of the null hypothesis versus the alternative hypothesis: H 0 : β 1 = 0 vs. H A : β 1 0 4.8 is a test of the scientific hypothesis: Does peak flow predict FEV 1? For the FEV 1 data, the t test rejects this null hypothesis in favor of the alternative with p<0.001. That is, in regression the question is: I presume that there is no relationship (the slope is zero), does the data support this? In this example, it s very unlikely to observe a slope this steep by chance alone. So we conclude that peak flow is a significant predictor of FEV 1.

b. The quality of the regression (how strong the prediction is) is measured by the coefficient of determination. For simple linear regression, the coefficient of determination is r XY 2 or the square of the sample correlation coefficient. Know this: R square is interpreted as the amount of variability in Y that is explained by X. An R square of 1 (100%) is perfect relationship and an R square of 0 (0%) is pure noise no relationship. 2 For our example, this is r XY = (0.65) 2 = 0.42. Therefore, 42% of the variability in FEV 1 is explained by peak flow. So there is still a good amount (58%) of variability not explained by peak flow. You should be able to assess how strong a relationship is by the R square value. The correlation (as well as the R square) of X and Y is the same as the correlation of Y and X direction doesn t matter. However, the slope predicting X from Y is different than the slope predicting Y from X. Even so, tests of the correlation and tests of the slope (regardless of direction) all give the exact same p value. So direction of prediction does not change significance just the interpretation. c. The assumptions for regression are: the Y s are sampled from a normal distribution, the relationship between X and Y is linear, the spread of the point around the line has equal variance, and every observation (every X,Y pair) is measured independently (unrelated) from every other observation. C. More Than One X: Multiple Regression 1. peak flow only explained 42% of the variations in FEV 1. Can we do better? Suppose we are interested in predicting FEV 1 using three independent variables instead of just one. This technique is called multiple regression because we have more than one independent variable. The model in equation 4.6 is modified to accommodate the two extra independent variables. The multiple regression model becomes Y = β + X β + X β + X β + ε i 0 1 i 1 2 i 2 3 i 3 i 4.9 2. All of the inferences previewed for the simple linear regression model can be applied to the multiple regression model. However, all are beyond the scope of these lectures. The bottom line is that each X has it s own slope it s own effect upon Y that, when added to each of the other X s effects will result is a predicted Y value. VI. Bottom Line A. Know all terms in bold font; you should be familiar with each. B. Know how to interpret scatter plots. Is the strength of the relations weak or strong? Positive or negative? Linear? C. Know the concepts of regression and correlation. In the simple linear regression case: the question: Is the slope zero? Is the same question as: Is the correlation zero?

Summary Simplest form of association between two continuous variables is measured by correlation Assumes normality Which requires linearity Test for significance by: testing r=0? Or slope=0? Dependent var = random outcome Independent var = explanatory predictor Same: Straight line model = slope of X,Y trend line = Simple linear regression = Pearson correlation. If the FORM is not a straight line, there are other models (and tests) Multiple regression has multiple predictors The word multivariate refers to multiple outcomes. VII. Homework Exercises 1. A study collected measurements of IgE antibodies (IU/ml) and skin test (ng/ml) from 23 subjects in the presence of Lol p 5, a purfied allergen from grass pollen. A scatterplot of the relationship between IgE and skin test are shown in the figure below. Bivariate Fit of IGE By SkinTest IGE 25 20 15 10 5 0 10 0 10 20 30 40 50 60 SkinTest a. Is it appropriate to assess the linear relationship between IgE and skin test? b. Is it appropriate to compute Pearson s correlation coefficient here? c. What methods should be used to assess the relationship between IgE and skin test? 2. A study was conducted involving women (ages 34 87) who attended a hospital outpatient department for bone density measurements and underwent lumbar spine radiation. Among the data collected were the measures for the anteroposterior (AMBD) and lateral (LBMD) bone mineral density. A scatterplot of the relationship between

AMBD and LMBD is shown in the figure below. The estimated correlation is 0.73 with p < 0.0001. Bivariate Fit of ABMD By LBMD 1.5 1.3 ABMD 1.1 0.9 0.7 0.5 0.1 0.3 0.5 0.7 0.9 1 LBMD a. What can you say about the relationship between ABMD and LBMD? i. Strength (significant or not significant)? ii. Direction (positive or negative)? iii. Form (linear of not linear)? 3. From the data in exercise 2 above, the estimated intercept and slope are 0.28 and 1.05, respectively. a. For a woman with a LBDM of 0.5, we predict her ABMD to be what? 4. The following plot of age and systolic blood pressure are from 20 healthy adults. The estimates slope is 0.45 and the estimated correlation is 0.97. BP 145 140 135 130 125 120 115 10 20 30 40 50 60 70 80 Age a. Complete the following sentence: A unit increase in age is associated with a unit increase in BP? b. The estimated 95% CI on the slope is (0.39, 0.50). What can you say about the relationship between age and BP? i. Strength (significant or not significant)? ii. Direction (positive or negative)? iii. Form (linear of not linear)?