BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

Size: px
Start display at page:

Download "BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression"

Transcription

1 BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression Introduction to Correlation and Regression The procedures discussed in the previous ANOVA labs are most useful in cases where we are interested in testing hypotheses about differences in the locations of several populations in terms of one or more sets of "factors" which represent discrete levels. However, we may also be interested in examining the relationship between random variables Y and X, where both X and Y are continuous measures on each subject sampled in a single population. When there are just two such variables, the relationship is known as a "bivariate." If there is no implied relationship in which say variable Y "depends" on variable X, then we are just asking if the two variable, Y and X are associated, and we calculate correlation coefficients to determine the strength of this association. However, if two variables are related in such a way that the value of one the variables X is useful in predicting the value of the variable Y, then we can explore the "regression" of variable Y on variable X, we fit a linear model.. Correlation and linear regression analyses are useful in evaluating the association between variables and expressing the nature of their relationship. Correlation Correlation measures the strength of the association between two variables, say Y and X. Correlation is related to Regression, but correlation analyses make different assumptions about the data. First in correlation, there is no independent or dependent variable, so one is not predicting Y from X as in Regression. The Pearson Product Moment linear correlation coefficient assumes the data are independent and bivariate normal - that is that the joint probability distribution of (Y, X) is bivariate normal. The formula for Pearson's r is: n ( i=1 (y i y )(x i x ) ) r = ( n i=1 (y i y ) 2 (x i x ) 2 n i=1 ) 1/2 Pearson's r takes on values between -1 and +1, with values near 0 indicating no correlation (no association), and values near 1 meaning a strong positive association in which Y and X increase together, and values near -1 meaning a strong negative association indicating that when the value of Y goes up the value of X goes down, or vice versa. Significance tests are available to establish that the estimated correlation is unlikely by chance at some α level. There are also non-parametric correlation coefficients. The most commonly used is Spearman rho, ρ. To calculate Spearman's ρ, first rank the Y and X values separately, and then calculate the difference in the Y and X ranks for each subject (d i = R y - R x ). Spearmans ρ is then: 1

2 ρ = 1 6 n i=1 d i 2 n(n 2 1) In R, one can apply the functions "cor" or "cor.test" to calculate Pearson's, Spearman's, or Kendall's correlation coefficient. The function "cor" does not provide a significance test, but cor.test does. The necessary arguments are: cor.test(x,y,method="pearson") or alternatively insert "spearman" or "kendall" for the method argument. For example using the data file kenyabees.csv we can perform a correlation analysis computing the Pearson r or Spearman's ρ. The data represent the coefficient of variation in bee abundance (CVN) at pan traps on farms spread across several regions in Kenya. The coefficient of variation is calculated as CV = (s x ) 100 and is a measure of variability that attempts to remove the fact that data with high means have higher variances in order to compare variability among samples with different means. The variable CTYPE is the number of different crop species planted at each farm. Zero indicates that the farm had no planted crops, only pasture. Read in the data and obtain a scatter plot. dat=read.csv("k:/biometry/biometry-fall-2015/lab9/kenyabees.csv",header=true) head(dat) ctype cvn text plot(dat$ctype,dat$cvn, xlab= "Number of Crop Species", ylab= "CV Bees", lwd= 1, cex.lab=1.25,cex.axis=1.25,cex=1.5) 2

3 There is a trend toward lower values of the CV in bee abundance on farms with more crop species. However, it is difficult to tell how strong the trend is. Now calculate Pearson's r using cor.test cor.test(dat$ctype,dat$cvn,na.rm=true, method ="pearson") Pearson's product-moment correlation data: dat$ctype and dat$cvn t = , df = 93, p-value = alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: sample estimates: cor Now calculate Spearman's ρ using cor.test. cor.test(dat$ctype,dat$cvn,method="spearman") Warning in cor.test.default(dat$ctype, dat$cvn, method = "spearman"): Cannot compute exact p-value with ties Spearman's rank correlation rho data: dat$ctype and dat$cvn S = , p-value = alternative hypothesis: true rho is not equal to 0 sample estimates: rho

4 Neither correlation coefficient is significant at α = 0.05, but the Pearson's r = p = is suggestive of a trend. If one just wanted the value of the correlation coefficient, the cor function could be used to calculate either the Pearson's, Spearman's or Kendall's correlation coefficient. Note however, that cor does not automatically discard cases with incomplete or missing data as des cor.test. One must specify the "use=" argument to be "complete.obs." cor(dat$ctype,dat$cvn, use="complete.obs",method="pearson") [1] Regression In regression, Y is termed the dependent variable, and X the independent variable and one builds a model using X to predict values of Y. For example, consider the relationship between crop yield and precipitation. Yield (Y) is a function of precipitation (X), since we hypothesize that water availability affects plant growth, but on average plant growth does not affect precipitation. Using linear regression, we can quantify the observed relationship between the two variables. We might ask if there is a significant regression of yield on precipitation, indicating that yield can be predicted from knowledge of precipitation, or if no regression exists and yield cannot be predicted from knowledge of precipitation. In bivariate regression, we assume that the relationship between variables can be described by a straight line. The line relating two variables X and Y is described by the equation: Y = b o + b 1 x + ε where, b 0 is called the intercept, corresponding to the point where X = 0 and the line intercepts the Y axis, and b 1 is the slope, the change in Y per unit change in X. The independent variable, X, is used to predict Y, the dependent variable. b 0 and b 1 are the regression coefficients or the parameters of a line which fix and define the linear relationship between Y and X. ε is the "error" or that component of the variation in the values of Y that cannot be predicted by the regression of Y on X. ε arises both because the fit of the regression equation or regression model to the data may be inadequate and because there is inherent variability in the values of Y observed at each value of X. The differences (errors) between the actual values of Y and the values predicted by the regression equation - a line fitting the data - are called residuals. The estimation of the parameters is performed by finding the "best fit" regression line, the line that minimizes the sums - of - square of the deviations of the observed values from the predicted values. This method is called the method of Ordinary Least Squares (OLS). To fit a regression model to the data on CV of be abundance from Kenya, we use CV as the dependent variable and CTYPE as the independent variable. This is because it 4

5 seems possible that the variability in bee abundance might depend on the number of crop species on a farm, but unlikely that the number of crop types on a farm depends on the variability in bee abundance, rather it is up to the farmer how many crops to plant. To fit a linear regression in R, we use the "lm" or linear model function and then get a summary of the model fit. All we need to specify is a formula for the model. which is of the form y~x (y is a function of x). m1=lm(dat$cvn~dat$ctype) summary(m1) Call: lm(formula = dat$cvn ~ dat$ctype) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** dat$ctype Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 93 degrees of freedom (9 observations deleted due to missingness) Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 93 DF, p-value: Using the str(m1) function in R, you could see all the information contained in the model object m1. Several items can be extracted using the extractor symbol $. For example, the $coefficients, $residuals, and $fitted.values. The summary of the model object includes a summary of the values of the residuals. The regression coefficients and their standard errors are given in the "estimate" and "Std. Error" columns, respectively. The row labeled "(Intercept)" is the y - intercept (b 0 ) in the model, and the row labeled dat$ctype is the regression coefficient for CTYPE (b 1 ). On the same rows there are t - tests that test the null hypothesis that the respective coefficient equal 0. At the bottom of the summary, you will see the Residual Standard error (the square root of the Mean Square Residual), An F - test for the significance of the overall model, and an estimate of R 2 which the proportion of variation in the Y variable that is accounted for by the X variable. We can plot the fitted regression line on the scatter plot of points using the function "abline" in R. We see the downward trend in CV, but there is substantial scatter of the data about the fitted line. 5

6 plot(dat$ctype,dat$cvn, xlab= "Number of Crop Species", ylab= "CV Bees", lwd= 1, cex.lab=1.25,cex.axis=1.25,cex=1.5) abline(lm(dat$cvn ~ dat$ctype),lwd=2) Assumptions of Linear Regression In using linear regression, a number of assumptions must be made; these are discussed in detail in lecture. In summary, it is assumed that: 1. each value of X is measured without error; 2. the set of observed (X, Y) values consists of n independent measures; 3. ε is a normally distributed error with a mean of zero and some standard deviation which is constant for all values of X (homogeneity of variances) 4. Y is a linear function of X. If we make these assumptions, fitting the regression model is very simple; however, to use the regression model to predict or estimate values of Y, we must test the assumptions we have made. The residuals are the difference between the observed and predicted values of Y. These deviations can be used as a tool to see if the necessary assumptions for regression have been met, and to further investigate the adequacy or goodness - of - fit of the regression model. If the assumptions have been met, the residuals (errors) should be independent and, for each value of X, the set of possible residuals should be approximately normally distributed with a mean of zero and a variance that is not a function of X. If the residuals do not have these characteristics, then some of the assumptions made in fitting the 2 σ e 6

7 model must be incorrect and the results of the regression cannot be deemed valid. Therefore, it is not reasonable to accept a regression without examining whether the assumptions are met. There are many methods that can be used in evaluating the residuals from a regression equation. Some of these methods are objective hypothesis tests; the alternative is to graph the residuals versus the values of X and to evaluate them subjectively. The properties to be evaluated are: 1) independence, 2) normality, and 3) constant variance. The property of independence can be tested in a variety of ways. Basically, though, we can assume that, if the errors are independent, the errors will not tend to have any pattern; they will be random. If they are not random, this fact will be evident because there will be a few long series of either positive or negative values, instead of numerous shorter series. A hypothesis test known as a "runs" test can be used to evaluate the randomness of the residuals; alternately, a subjective evaluation can be performed. Other objective tests are also available. Normality is another desirable property in the residuals. What is required is that for any particular value of X the set of possible errors should be normally distributed. Unless a large number of measurements of Y have been obtained for each of the values of X, there is no reasonable way of testing this assumption. Therefore, we examine the overall distribution of the residuals for normality using graphical and/or statistical approaches. The requirement of a constant variance is very important in evaluating regression models. However, unless the number of residuals is relatively large, a subjective visual evaluation is usually all that can be performed. Visually, residuals that suggest that the variances are constant look like a rectangular scatter that is evenly spread about the regression line all along its length. However, when samples sizes are small it is often difficult to make an accurate judgment about constancy of variances based on a visual inspection of the residuals. One way in R to get some quick diagnostic information to help determine if the data meet the assumption of regression is to plot the model object. This will produce 4 diagnostic plots so, if you want to see them simultaneously then you need to set a plotting parameter mfrows as such par(mfrows=c(2,2)) This will generate two rows of plots each with two plots. par(mfrow=c(2,2)) plot(m1) 7

8 The upper left plot shows the raw residuals plotted against the predicted y - values from the model. I prefer to plot the standardized residuals against either the values of X or the standardized values of X. In either case, the scatter of the residuals should be similar along the length of the plot. The lower left plot shows the square root of the standardized residual with a similar pattern. These plots are useful to determine if the data meet the homogeneity of variances assumption. An equal range of the scatter of the data about the horizontal line would suggest that the assumption is met. The upper right plot is a normal probability plot of the residuals. If the residuals are normally distributed then the points should fall on the dotted line. In this case they deviate substantially in at the upper end of the plot. Using the norm function in the QuantPsyc package on the residuals, show that the residuals are not normally distributed. library(quantpsyc) 8

9 Loading required package: boot Loading required package: MASS Attaching package: 'QuantPsyc' The following object is masked from 'package:base': norm norm(m1$residuals) Statistic SE t-val p Skewness e-07 Kurtosis e-02 The final plot in the lower right is used to diagnose data points that are outliers or overly influential. The solid red line demarcates where points would have Cook's D values of 1 or more from those with Cook's D values of less than 1. Cook's D is a measure of the influence of a data point on the regression. Small values of Cook's D are preferred (values <1). To get a plot of the standardized residuals versus the X values. We have to get the X values used in the model fit, since some data points were dropped because the CV values were missing (NA). If you use the str(m1) command, you will see that m1 consists of 13 lists. The second item in the 13th list are those values of the X variable. If you just tried to plot CTYPE versus the m1$residuals and error indicating that the vector are not the same length would occur. This is because the cases in CTYPE for which CV was NA are still present. nctype=m1[[13]][[2]] stdresid=scale(m1$residuals,center=true, scale=true) plot(nctype,stdresid) abline(h=0, lty=2) 9

10 This plot also does not have an equal scatter about the fitted lien (represented by the horizontal dashed line. However, we cannot tell if the low scatter on the right end of the plot is due to inherently unequal variances, or to there being few farms that plant a large number of crops in our sample of farms. Further Instructions for Lab 9 Data files for regression and correlation require that each subject be represented by a line in the data file and each column represents a variable. So, for correlation or bivariate regression, an R data file need only have 2 columns of values. However, if you have more than two variables for a single set of subjects for which you want to calculate their correlations, just enter all the variables in separate columns and R can calculate the correlations between the variables in each pair of columns - a correlation matrix. instead of inserting the X and Y variable names when using the 'cor' function insert the name of the data.frame and all pairs of correlation will be calculated. LAB - 9 Assignment PART 1: Introduction to Correlation and Regression The Bermuda Petrel is an oceanic bird spending most of its year on the open sea, only returning to land during the breeding season. Its nesting sites are on a small, uninhabited island of the Bermuda group, where careful hatching records have been kept over several years. The Bermuda Petrel feeds only upon fish caught in the open ocean waters far from land. Unfortunately, DDT is now so widespread, and is so concentrated by the biological amplification system knows as the "food chain," that the Bermuda Petrel can no longer lay hard shelled eggs. Since DDT breaks down so slowly, it would appear that this beautiful bird is doomed to extinction (along with how many others?) You data below represent hatching rates of clutches of eggs over a number of years. Use correlation and linear regression in R to see if there is a significant relationship between the percent of clutches hatching over time. Interpret the output. Also produce a scatter plot of the relationship between hatching rate and year. Year % of Clutches Hatching 10

11 PART 2: Assumptions of simple linear regression A) Using the kenyabees.csv data, is it possible to transform the CV data to an alternative scale on which the residuals and the Y variable are normally distributed? For example, what if we log transformed the CV data? B) Estimate the linear regression model for each of the three sample data sets (reg1, reg3, reg5) using the lm function in R. Use data in Column 1 as the X-variate and column 2 as the Y-variate in each data file. C) Write the regression equation for at least two of the data sets. D) Reiterating from the lab, the null hypothesis to be tested in each instance states that Y is not a linear function of X, and thus X will not be a good predictor of Y. More specifically, under the null hypothesis we are testing that the slope, b 1, will be equal to zero, since this would be indicative of no relationship between the two variables. At the α = 0.05 level, based on the output of the regression alone (F - test) for which of the three data sets would you reject the null hypothesis? E) Based on the R 2 values, which model reveals the best fit? F) To see if the models are adequate, you must check to see if the assumptions of regression have been met. Use graphical and/or statistical methods to assess the assumptions of normality, homogeneity of variances, and linearity for each data set each data set? For which data sets is linear regression appropriate, and for which data sets is it clear that a linear regression model should not be imposed on the data? Would some transformation of scale for the Y or X data make these data normal and homoscedastic? Would transformation of X improve linearity? 11

SLR output RLS. Refer to slr (code) on the Lecture Page of the class website.

SLR output RLS. Refer to slr (code) on the Lecture Page of the class website. SLR output RLS Refer to slr (code) on the Lecture Page of the class website. Old Faithful at Yellowstone National Park, WY: Simple Linear Regression (SLR) Analysis SLR analysis explores the linear association

More information

Regression. Marc H. Mehlman University of New Haven

Regression. Marc H. Mehlman University of New Haven Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven the statistician knows that in nature there never was a normal distribution, there never was a straight line, yet with normal and

More information

Nonparametric Statistics. Leah Wright, Tyler Ross, Taylor Brown

Nonparametric Statistics. Leah Wright, Tyler Ross, Taylor Brown Nonparametric Statistics Leah Wright, Tyler Ross, Taylor Brown Before we get to nonparametric statistics, what are parametric statistics? These statistics estimate and test population means, while holding

More information

Quantitative Understanding in Biology Module II: Model Parameter Estimation Lecture I: Linear Correlation and Regression

Quantitative Understanding in Biology Module II: Model Parameter Estimation Lecture I: Linear Correlation and Regression Quantitative Understanding in Biology Module II: Model Parameter Estimation Lecture I: Linear Correlation and Regression Correlation Linear correlation and linear regression are often confused, mostly

More information

Regression and correlation

Regression and correlation 6 Regression and correlation The main object of this chapter is to show how to perform basic regression analyses, including plots for model checking and display of confidence and prediction intervals.

More information

Linear Modelling: Simple Regression

Linear Modelling: Simple Regression Linear Modelling: Simple Regression 10 th of Ma 2018 R. Nicholls / D.-L. Couturier / M. Fernandes Introduction: ANOVA Used for testing hpotheses regarding differences between groups Considers the variation

More information

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph. Regression, Part I I. Difference from correlation. II. Basic idea: A) Correlation describes the relationship between two variables, where neither is independent or a predictor. - In correlation, it would

More information

Density Temp vs Ratio. temp

Density Temp vs Ratio. temp Temp Ratio Density 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Density 0.0 0.2 0.4 0.6 0.8 1.0 1. (a) 170 175 180 185 temp 1.0 1.5 2.0 2.5 3.0 ratio The histogram shows that the temperature measures have two peaks,

More information

Inferences for Regression

Inferences for Regression Inferences for Regression An Example: Body Fat and Waist Size Looking at the relationship between % body fat and waist size (in inches). Here is a scatterplot of our data set: Remembering Regression In

More information

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Chapte The McGraw-Hill Companies, Inc. All rights reserved. 12er12 Chapte Bivariate i Regression (Part 1) Bivariate Regression Visual Displays Begin the analysis of bivariate data (i.e., two variables) with a scatter plot. A scatter plot - displays each observed

More information

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION In this lab you will learn how to use Excel to display the relationship between two quantitative variables, measure the strength and direction of the

More information

Introduction and Single Predictor Regression. Correlation

Introduction and Single Predictor Regression. Correlation Introduction and Single Predictor Regression Dr. J. Kyle Roberts Southern Methodist University Simmons School of Education and Human Development Department of Teaching and Learning Correlation A correlation

More information

Linear regression and correlation

Linear regression and correlation Faculty of Health Sciences Linear regression and correlation Statistics for experimental medical researchers 2018 Julie Forman, Christian Pipper & Claus Ekstrøm Department of Biostatistics, University

More information

1 A Review of Correlation and Regression

1 A Review of Correlation and Regression 1 A Review of Correlation and Regression SW, Chapter 12 Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then

More information

Advanced Statistical Regression Analysis: Mid-Term Exam Chapters 1-5

Advanced Statistical Regression Analysis: Mid-Term Exam Chapters 1-5 Advanced Statistical Regression Analysis: Mid-Term Exam Chapters 1-5 Instructions: Read each question carefully before determining the best answer. Show all work; supporting computer code and output must

More information

Inference with Heteroskedasticity

Inference with Heteroskedasticity Inference with Heteroskedasticity Note on required packages: The following code requires the packages sandwich and lmtest to estimate regression error variance that may change with the explanatory variables.

More information

REVIEW 8/2/2017 陈芳华东师大英语系

REVIEW 8/2/2017 陈芳华东师大英语系 REVIEW Hypothesis testing starts with a null hypothesis and a null distribution. We compare what we have to the null distribution, if the result is too extreme to belong to the null distribution (p

More information

STAT 3022 Spring 2007

STAT 3022 Spring 2007 Simple Linear Regression Example These commands reproduce what we did in class. You should enter these in R and see what they do. Start by typing > set.seed(42) to reset the random number generator so

More information

Correlation. January 11, 2018

Correlation. January 11, 2018 Correlation January 11, 2018 Contents Correlations The Scattterplot The Pearson correlation The computational raw-score formula Survey data Fun facts about r Sensitivity to outliers Spearman rank-order

More information

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson ) Tutorial 7: Correlation and Regression Correlation Used to test whether two variables are linearly associated. A correlation coefficient (r) indicates the strength and direction of the association. A correlation

More information

SPSS LAB FILE 1

SPSS LAB FILE  1 SPSS LAB FILE www.mcdtu.wordpress.com 1 www.mcdtu.wordpress.com 2 www.mcdtu.wordpress.com 3 OBJECTIVE 1: Transporation of Data Set to SPSS Editor INPUTS: Files: group1.xlsx, group1.txt PROCEDURE FOLLOWED:

More information

9 Correlation and Regression

9 Correlation and Regression 9 Correlation and Regression SW, Chapter 12. Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then retakes the

More information

Multiple Regression and Regression Model Adequacy

Multiple Regression and Regression Model Adequacy Multiple Regression and Regression Model Adequacy Joseph J. Luczkovich, PhD February 14, 2014 Introduction Regression is a technique to mathematically model the linear association between two or more variables,

More information

Cuckoo Birds. Analysis of Variance. Display of Cuckoo Bird Egg Lengths

Cuckoo Birds. Analysis of Variance. Display of Cuckoo Bird Egg Lengths Cuckoo Birds Analysis of Variance Bret Larget Departments of Botany and of Statistics University of Wisconsin Madison Statistics 371 29th November 2005 Cuckoo birds have a behavior in which they lay their

More information

MODULE 4 SIMPLE LINEAR REGRESSION

MODULE 4 SIMPLE LINEAR REGRESSION MODULE 4 SIMPLE LINEAR REGRESSION Module Objectives: 1. Describe the equation of a line including the meanings of the two parameters. 2. Describe how the best-fit line to a set of bivariate data is derived.

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model Lab 3 A Quick Introduction to Multiple Linear Regression Psychology 310 Instructions.Work through the lab, saving the output as you go. You will be submitting your assignment as an R Markdown document.

More information

Slide 7.1. Theme 7. Correlation

Slide 7.1. Theme 7. Correlation Slide 7.1 Theme 7 Correlation Slide 7.2 Overview Researchers are often interested in exploring whether or not two variables are associated This lecture will consider Scatter plots Pearson correlation coefficient

More information

L21: Chapter 12: Linear regression

L21: Chapter 12: Linear regression L21: Chapter 12: Linear regression Department of Statistics, University of South Carolina Stat 205: Elementary Statistics for the Biological and Life Sciences 1 / 37 So far... 12.1 Introduction One sample

More information

Contents. Acknowledgments. xix

Contents. Acknowledgments. xix Table of Preface Acknowledgments page xv xix 1 Introduction 1 The Role of the Computer in Data Analysis 1 Statistics: Descriptive and Inferential 2 Variables and Constants 3 The Measurement of Variables

More information

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION In this lab you will first learn how to display the relationship between two quantitative variables with a scatterplot and also how to measure the strength of

More information

Correlation and Simple Linear Regression

Correlation and Simple Linear Regression Correlation and Simple Linear Regression Sasivimol Rattanasiri, Ph.D Section for Clinical Epidemiology and Biostatistics Ramathibodi Hospital, Mahidol University E-mail: sasivimol.rat@mahidol.ac.th 1 Outline

More information

Regression and Models with Multiple Factors. Ch. 17, 18

Regression and Models with Multiple Factors. Ch. 17, 18 Regression and Models with Multiple Factors Ch. 17, 18 Mass 15 20 25 Scatter Plot 70 75 80 Snout-Vent Length Mass 15 20 25 Linear Regression 70 75 80 Snout-Vent Length Least-squares The method of least

More information

Chapter 8: Correlation & Regression

Chapter 8: Correlation & Regression Chapter 8: Correlation & Regression We can think of ANOVA and the two-sample t-test as applicable to situations where there is a response variable which is quantitative, and another variable that indicates

More information

Introduction to Linear regression analysis. Part 2. Model comparisons

Introduction to Linear regression analysis. Part 2. Model comparisons Introduction to Linear regression analysis Part Model comparisons 1 ANOVA for regression Total variation in Y SS Total = Variation explained by regression with X SS Regression + Residual variation SS Residual

More information

Linear Probability Model

Linear Probability Model Linear Probability Model Note on required packages: The following code requires the packages sandwich and lmtest to estimate regression error variance that may change with the explanatory variables. If

More information

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

Biostatistics for physicists fall Correlation Linear regression Analysis of variance Biostatistics for physicists fall 2015 Correlation Linear regression Analysis of variance Correlation Example: Antibody level on 38 newborns and their mothers There is a positive correlation in antibody

More information

Regression Analysis. Table Relationship between muscle contractile force (mj) and stimulus intensity (mv).

Regression Analysis. Table Relationship between muscle contractile force (mj) and stimulus intensity (mv). Regression Analysis Two variables may be related in such a way that the magnitude of one, the dependent variable, is assumed to be a function of the magnitude of the second, the independent variable; however,

More information

Simple Linear Regression: One Quantitative IV

Simple Linear Regression: One Quantitative IV Simple Linear Regression: One Quantitative IV Linear regression is frequently used to explain variation observed in a dependent variable (DV) with theoretically linked independent variables (IV). For example,

More information

appstats27.notebook April 06, 2017

appstats27.notebook April 06, 2017 Chapter 27 Objective Students will conduct inference on regression and analyze data to write a conclusion. Inferences for Regression An Example: Body Fat and Waist Size pg 634 Our chapter example revolves

More information

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017 UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics Tuesday, January 17, 2017 Work all problems 60 points are needed to pass at the Masters Level and 75

More information

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure. STATGRAPHICS Rev. 9/13/213 Calibration Models Summary... 1 Data Input... 3 Analysis Summary... 5 Analysis Options... 7 Plot of Fitted Model... 9 Predicted Values... 1 Confidence Intervals... 11 Observed

More information

STATISTICS 110/201 PRACTICE FINAL EXAM

STATISTICS 110/201 PRACTICE FINAL EXAM STATISTICS 110/201 PRACTICE FINAL EXAM Questions 1 to 5: There is a downloadable Stata package that produces sequential sums of squares for regression. In other words, the SS is built up as each variable

More information

Chapter 3 - Linear Regression

Chapter 3 - Linear Regression Chapter 3 - Linear Regression Lab Solution 1 Problem 9 First we will read the Auto" data. Note that most datasets referred to in the text are in the R package the authors developed. So we just need to

More information

Multiple Regression Introduction to Statistics Using R (Psychology 9041B)

Multiple Regression Introduction to Statistics Using R (Psychology 9041B) Multiple Regression Introduction to Statistics Using R (Psychology 9041B) Paul Gribble Winter, 2016 1 Correlation, Regression & Multiple Regression 1.1 Bivariate correlation The Pearson product-moment

More information

Important note: Transcripts are not substitutes for textbook assignments. 1

Important note: Transcripts are not substitutes for textbook assignments. 1 In this lesson we will cover correlation and regression, two really common statistical analyses for quantitative (or continuous) data. Specially we will review how to organize the data, the importance

More information

Analysis of 2x2 Cross-Over Designs using T-Tests

Analysis of 2x2 Cross-Over Designs using T-Tests Chapter 234 Analysis of 2x2 Cross-Over Designs using T-Tests Introduction This procedure analyzes data from a two-treatment, two-period (2x2) cross-over design. The response is assumed to be a continuous

More information

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /1/2016 1/46

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /1/2016 1/46 BIO5312 Biostatistics Lecture 10:Regression and Correlation Methods Dr. Junchao Xia Center of Biophysics and Computational Biology Fall 2016 11/1/2016 1/46 Outline In this lecture, we will discuss topics

More information

Introduction to Simple Linear Regression

Introduction to Simple Linear Regression Introduction to Simple Linear Regression 1. Regression Equation A simple linear regression (also known as a bivariate regression) is a linear equation describing the relationship between an explanatory

More information

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference. Understanding regression output from software Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals In 1966 Cyril Burt published a paper called The genetic determination of differences

More information

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Correlation and the Analysis of Variance Approach to Simple Linear Regression Correlation and the Analysis of Variance Approach to Simple Linear Regression Biometry 755 Spring 2009 Correlation and the Analysis of Variance Approach to Simple Linear Regression p. 1/35 Correlation

More information

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES BIOL 458 - Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES PART 1: INTRODUCTION TO ANOVA Purpose of ANOVA Analysis of Variance (ANOVA) is an extremely useful statistical method

More information

Correlation. We don't consider one variable independent and the other dependent. Does x go up as y goes up? Does x go down as y goes up?

Correlation. We don't consider one variable independent and the other dependent. Does x go up as y goes up? Does x go down as y goes up? Comment: notes are adapted from BIOL 214/312. I. Correlation. Correlation A) Correlation is used when we want to examine the relationship of two continuous variables. We are not interested in prediction.

More information

Statistics. Introduction to R for Public Health Researchers. Processing math: 100%

Statistics. Introduction to R for Public Health Researchers. Processing math: 100% Statistics Introduction to R for Public Health Researchers Statistics Now we are going to cover how to perform a variety of basic statistical tests in R. Correlation T-tests/Rank-sum tests Linear Regression

More information

Data Set 1A: Algal Photosynthesis vs. Salinity and Temperature

Data Set 1A: Algal Photosynthesis vs. Salinity and Temperature Data Set A: Algal Photosynthesis vs. Salinity and Temperature Statistical setting These data are from a controlled experiment in which two quantitative variables were manipulated, to determine their effects

More information

Correlation and regression

Correlation and regression Correlation and regression Patrick Breheny December 1, 2016 Today s lab is about correlation and regression. It will be somewhat shorter than some of our other labs, as I would also like to spend some

More information

BIOSTATS 640 Spring 2018 Unit 2. Regression and Correlation (Part 1 of 2) R Users

BIOSTATS 640 Spring 2018 Unit 2. Regression and Correlation (Part 1 of 2) R Users BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users Unit Regression and Correlation of - Practice Problems Solutions R Users. In this exercise, you will gain some practice doing

More information

Chapter 3: Examining Relationships

Chapter 3: Examining Relationships Chapter 3: Examining Relationships Most statistical studies involve more than one variable. Often in the AP Statistics exam, you will be asked to compare two data sets by using side by side boxplots or

More information

Lecture (chapter 13): Association between variables measured at the interval-ratio level

Lecture (chapter 13): Association between variables measured at the interval-ratio level Lecture (chapter 13): Association between variables measured at the interval-ratio level Ernesto F. L. Amaral April 9 11, 2018 Advanced Methods of Social Research (SOCI 420) Source: Healey, Joseph F. 2015.

More information

Inference for Regression

Inference for Regression Inference for Regression Section 9.4 Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Department of Mathematics University of Houston Lecture 13b - 3339 Cathy Poliak, Ph.D. cathy@math.uh.edu

More information

Comparing Nested Models

Comparing Nested Models Comparing Nested Models ST 370 Two regression models are called nested if one contains all the predictors of the other, and some additional predictors. For example, the first-order model in two independent

More information

Generalised linear models. Response variable can take a number of different formats

Generalised linear models. Response variable can take a number of different formats Generalised linear models Response variable can take a number of different formats Structure Limitations of linear models and GLM theory GLM for count data GLM for presence \ absence data GLM for proportion

More information

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij = K. Model Diagnostics We ve already seen how to check model assumptions prior to fitting a one-way ANOVA. Diagnostics carried out after model fitting by using residuals are more informative for assessing

More information

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model 1 Linear Regression 2 Linear Regression In this lecture we will study a particular type of regression model: the linear regression model We will first consider the case of the model with one predictor

More information

Bivariate Relationships Between Variables

Bivariate Relationships Between Variables Bivariate Relationships Between Variables BUS 735: Business Decision Making and Research 1 Goals Specific goals: Detect relationships between variables. Be able to prescribe appropriate statistical methods

More information

Chapter 12 - Part I: Correlation Analysis

Chapter 12 - Part I: Correlation Analysis ST coursework due Friday, April - Chapter - Part I: Correlation Analysis Textbook Assignment Page - # Page - #, Page - # Lab Assignment # (available on ST webpage) GOALS When you have completed this lecture,

More information

Chapter 5 Exercises 1

Chapter 5 Exercises 1 Chapter 5 Exercises 1 Data Analysis & Graphics Using R, 2 nd edn Solutions to Exercises (December 13, 2006) Preliminaries > library(daag) Exercise 2 For each of the data sets elastic1 and elastic2, determine

More information

Lecture 11: Simple Linear Regression

Lecture 11: Simple Linear Regression Lecture 11: Simple Linear Regression Readings: Sections 3.1-3.3, 11.1-11.3 Apr 17, 2009 In linear regression, we examine the association between two quantitative variables. Number of beers that you drink

More information

Lecture 8: Fitting Data Statistical Computing, Wednesday October 7, 2015

Lecture 8: Fitting Data Statistical Computing, Wednesday October 7, 2015 Lecture 8: Fitting Data Statistical Computing, 36-350 Wednesday October 7, 2015 In previous episodes Loading and saving data sets in R format Loading and saving data sets in other structured formats Intro

More information

Regression and the 2-Sample t

Regression and the 2-Sample t Regression and the 2-Sample t James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) Regression and the 2-Sample t 1 / 44 Regression

More information

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept Interactions Lectures 1 & Regression Sometimes two variables appear related: > smoking and lung cancers > height and weight > years of education and income > engine size and gas mileage > GMAT scores and

More information

Measuring relationships among multiple responses

Measuring relationships among multiple responses Measuring relationships among multiple responses Linear association (correlation, relatedness, shared information) between pair-wise responses is an important property used in almost all multivariate analyses.

More information

Introduction to Linear Regression

Introduction to Linear Regression Introduction to Linear Regression James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) Introduction to Linear Regression 1 / 46

More information

Simple linear regression

Simple linear regression Simple linear regression Biometry 755 Spring 2008 Simple linear regression p. 1/40 Overview of regression analysis Evaluate relationship between one or more independent variables (X 1,...,X k ) and a single

More information

y response variable x 1, x 2,, x k -- a set of explanatory variables

y response variable x 1, x 2,, x k -- a set of explanatory variables 11. Multiple Regression and Correlation y response variable x 1, x 2,, x k -- a set of explanatory variables In this chapter, all variables are assumed to be quantitative. Chapters 12-14 show how to incorporate

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression September 24, 2008 Reading HH 8, GIll 4 Simple Linear Regression p.1/20 Problem Data: Observe pairs (Y i,x i ),i = 1,...n Response or dependent variable Y Predictor or independent

More information

Consider fitting a model using ordinary least squares (OLS) regression:

Consider fitting a model using ordinary least squares (OLS) regression: Example 1: Mating Success of African Elephants In this study, 41 male African elephants were followed over a period of 8 years. The age of the elephant at the beginning of the study and the number of successful

More information

R in Linguistic Analysis. Wassink 2012 University of Washington Week 6

R in Linguistic Analysis. Wassink 2012 University of Washington Week 6 R in Linguistic Analysis Wassink 2012 University of Washington Week 6 Overview R for phoneticians and lab phonologists Johnson 3 Reading Qs Equivalence of means (t-tests) Multiple Regression Principal

More information

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont. TCELL 9/4/205 36-309/749 Experimental Design for Behavioral and Social Sciences Simple Regression Example Male black wheatear birds carry stones to the nest as a form of sexual display. Soler et al. wanted

More information

ASSIGNMENT 3 SIMPLE LINEAR REGRESSION. Old Faithful

ASSIGNMENT 3 SIMPLE LINEAR REGRESSION. Old Faithful ASSIGNMENT 3 SIMPLE LINEAR REGRESSION In the simple linear regression model, the mean of a response variable is a linear function of an explanatory variable. The model and associated inferential tools

More information

Practical Statistics for the Analytical Scientist Table of Contents

Practical Statistics for the Analytical Scientist Table of Contents Practical Statistics for the Analytical Scientist Table of Contents Chapter 1 Introduction - Choosing the Correct Statistics 1.1 Introduction 1.2 Choosing the Right Statistical Procedures 1.2.1 Planning

More information

Simple linear regression: estimation, diagnostics, prediction

Simple linear regression: estimation, diagnostics, prediction UPPSALA UNIVERSITY Department of Mathematics Mathematical statistics Regression and Analysis of Variance Autumn 2015 COMPUTER SESSION 1: Regression In the first computer exercise we will study the following

More information

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania Chapter 10 Regression Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania Scatter Diagrams A graph in which pairs of points, (x, y), are

More information

Chapter 8: Correlation & Regression

Chapter 8: Correlation & Regression Chapter 8: Correlation & Regression We can think of ANOVA and the two-sample t-test as applicable to situations where there is a response variable which is quantitative, and another variable that indicates

More information

Remedial Measures, Brown-Forsythe test, F test

Remedial Measures, Brown-Forsythe test, F test Remedial Measures, Brown-Forsythe test, F test Dr. Frank Wood Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 1 Remedial Measures How do we know that the regression function

More information

Formal Statement of Simple Linear Regression Model

Formal Statement of Simple Linear Regression Model Formal Statement of Simple Linear Regression Model Y i = β 0 + β 1 X i + ɛ i Y i value of the response variable in the i th trial β 0 and β 1 are parameters X i is a known constant, the value of the predictor

More information

Chapter 16. Simple Linear Regression and Correlation

Chapter 16. Simple Linear Regression and Correlation Chapter 16 Simple Linear Regression and Correlation 16.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will

More information

Variance Decomposition and Goodness of Fit

Variance Decomposition and Goodness of Fit Variance Decomposition and Goodness of Fit 1. Example: Monthly Earnings and Years of Education In this tutorial, we will focus on an example that explores the relationship between total monthly earnings

More information

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007) FROM: PAGANO, R. R. (007) I. INTRODUCTION: DISTINCTION BETWEEN PARAMETRIC AND NON-PARAMETRIC TESTS Statistical inference tests are often classified as to whether they are parametric or nonparametric Parameter

More information

LAB 2. HYPOTHESIS TESTING IN THE BIOLOGICAL SCIENCES- Part 2

LAB 2. HYPOTHESIS TESTING IN THE BIOLOGICAL SCIENCES- Part 2 LAB 2. HYPOTHESIS TESTING IN THE BIOLOGICAL SCIENCES- Part 2 Data Analysis: The mean egg masses (g) of the two different types of eggs may be exactly the same, in which case you may be tempted to accept

More information

1 Correlation and Inference from Regression

1 Correlation and Inference from Regression 1 Correlation and Inference from Regression Reading: Kennedy (1998) A Guide to Econometrics, Chapters 4 and 6 Maddala, G.S. (1992) Introduction to Econometrics p. 170-177 Moore and McCabe, chapter 12 is

More information

R 2 and F -Tests and ANOVA

R 2 and F -Tests and ANOVA R 2 and F -Tests and ANOVA December 6, 2018 1 Partition of Sums of Squares The distance from any point y i in a collection of data, to the mean of the data ȳ, is the deviation, written as y i ȳ. Definition.

More information

PubH 7405: REGRESSION ANALYSIS SLR: DIAGNOSTICS & REMEDIES

PubH 7405: REGRESSION ANALYSIS SLR: DIAGNOSTICS & REMEDIES PubH 7405: REGRESSION ANALYSIS SLR: DIAGNOSTICS & REMEDIES Normal Error RegressionModel : Y = β 0 + β ε N(0,σ 2 1 x ) + ε The Model has several parts: Normal Distribution, Linear Mean, Constant Variance,

More information

Week 7 Multiple factors. Ch , Some miscellaneous parts

Week 7 Multiple factors. Ch , Some miscellaneous parts Week 7 Multiple factors Ch. 18-19, Some miscellaneous parts Multiple Factors Most experiments will involve multiple factors, some of which will be nuisance variables Dealing with these factors requires

More information

ANOVA (Analysis of Variance) output RLS 11/20/2016

ANOVA (Analysis of Variance) output RLS 11/20/2016 ANOVA (Analysis of Variance) output RLS 11/20/2016 1. Analysis of Variance (ANOVA) The goal of ANOVA is to see if the variation in the data can explain enough to see if there are differences in the means.

More information

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) Authored by: Sarah Burke, PhD Version 1: 31 July 2017 Version 1.1: 24 October 2017 The goal of the STAT T&E COE

More information

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression 36-309/749 Experimental Design for Behavioral and Social Sciences Sep. 22, 2015 Lecture 4: Linear Regression TCELL Simple Regression Example Male black wheatear birds carry stones to the nest as a form

More information

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics Exploring Data: Distributions Look for overall pattern (shape, center, spread) and deviations (outliers). Mean (use a calculator): x = x 1 + x

More information

df=degrees of freedom = n - 1

df=degrees of freedom = n - 1 One sample t-test test of the mean Assumptions: Independent, random samples Approximately normal distribution (from intro class: σ is unknown, need to calculate and use s (sample standard deviation)) Hypotheses:

More information

Chapter 16. Simple Linear Regression and dcorrelation

Chapter 16. Simple Linear Regression and dcorrelation Chapter 16 Simple Linear Regression and dcorrelation 16.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will

More information