Worksheet 4 - Multiple and nonlinear regression models

Similar documents
Worksheet 2 - Basic statistics

Experimental Design and Data Analysis for Biologists

Business Statistics. Lecture 10: Correlation and Linear Regression

Biol 206/306 Advanced Biostatistics Lab 5 Multiple Regression and Analysis of Covariance Fall 2016

Introduction to Linear regression analysis. Part 2. Model comparisons

1 Introduction to Minitab

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti

Checking model assumptions with regression diagnostics

Simple Linear Regression

A discussion on multiple regression models

Regression Diagnostics Procedures

Activity #12: More regression topics: LOWESS; polynomial, nonlinear, robust, quantile; ANOVA as regression

Notebook Tab 6 Pages 183 to ConteSolutions

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

Single and multiple linear regression analysis

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

y response variable x 1, x 2,, x k -- a set of explanatory variables

STAT Checking Model Assumptions

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Multiple regression and inference in ecology and conservation biology: further comments on identifying important predictor variables

Sociology 593 Exam 1 Answer Key February 17, 1995

A strategy for modelling count data which may have extra zeros

Chapter 16. Simple Linear Regression and Correlation

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

Chapter 16. Simple Linear Regression and dcorrelation

AMS 7 Correlation and Regression Lecture 8

Regression. Estimation of the linear function (straight line) describing the linear component of the joint relationship between two variables X and Y.

Ch Inference for Linear Regression

AP Statistics. Chapter 9 Re-Expressing data: Get it Straight

Repeated-Measures ANOVA in SPSS Correct data formatting for a repeated-measures ANOVA in SPSS involves having a single line of data for each

Introduction to Regression

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Multiple Linear Regression II. Lecture 8. Overview. Readings

Multiple Linear Regression II. Lecture 8. Overview. Readings. Summary of MLR I. Summary of MLR I. Summary of MLR I

Applied Regression Analysis

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Multiple Linear Regression II. Lecture 8. Overview. Readings

Multiple Linear Regression II. Lecture 8. Overview. Readings. Summary of MLR I. Summary of MLR I. Summary of MLR I

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

Data Set 1A: Algal Photosynthesis vs. Salinity and Temperature

UNIT 12 ~ More About Regression

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Interactions and Centering in Regression: MRC09 Salaries for graduate faculty in psychology

Regression Analysis. Table Relationship between muscle contractile force (mj) and stimulus intensity (mv).

Chapter 12 Summarizing Bivariate Data Linear Regression and Correlation

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

STAT 572 Assignment 5 - Answers Due: March 2, 2007

Chapter 4: Regression Models

Generalized Additive Models (GAMs)

The linear model. Our models so far are linear. Change in Y due to change in X? See plots for: o age vs. ahe o carats vs.

Analysing data: regression and correlation S6 and S7

Psychology 282 Lecture #4 Outline Inferences in SLR

SRBx14_9.xls Ch14.xls

Multiple Regression. More Hypothesis Testing. More Hypothesis Testing The big question: What we really want to know: What we actually know: We know:

Biostatistics. Chapter 11 Simple Linear Correlation and Regression. Jing Li

Daniel Boduszek University of Huddersfield

Contents. Acknowledgments. xix

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION

One-way between-subjects ANOVA. Comparing three or more independent means

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

Welcome! Text: Community Ecology by Peter J. Morin, Blackwell Science ISBN (required) Topics covered: Date Topic Reading

Chapter 2 Exploratory Data Analysis

Multicollinearity Richard Williams, University of Notre Dame, Last revised January 13, 2015

Ridge Regression. Summary. Sample StatFolio: ridge reg.sgp. STATGRAPHICS Rev. 10/1/2014

Mrs. Poyner/Mr. Page Chapter 3 page 1

Chapter 9. Correlation and Regression

Review of Multiple Regression

Multiple linear regression S6

Topic 18: Model Selection and Diagnostics

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

Answer Key. 9.1 Scatter Plots and Linear Correlation. Chapter 9 Regression and Correlation. CK-12 Advanced Probability and Statistics Concepts 1

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

Regression analysis is a tool for building mathematical and statistical models that characterize relationships between variables Finds a linear

This document contains 3 sets of practice problems.

Data files & analysis PrsnLee.out Ch9.xls

1 A Review of Correlation and Regression

Inference for the Regression Coefficient

Hierarchical Linear Models (HLM) Using R Package nlme. Interpretation. 2 = ( x 2) u 0j. e ij

Inferences for Regression

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Stats fest Analysis of variance. Single factor ANOVA. Aims. Single factor ANOVA. Data

Simple Linear Regression

Chapter 3: Regression Methods for Trends

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Unit 6 - Introduction to linear regression

4:3 LEC - PLANNED COMPARISONS AND REGRESSION ANALYSES

TABLE OF CONTENTS INTRODUCTION TO MIXED-EFFECTS MODELS...3

Can you tell the relationship between students SAT scores and their college grades?

THE PEARSON CORRELATION COEFFICIENT

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

Lectures 5 & 6: Hypothesis Testing

Chapter 14 Student Lecture Notes 14-1

Math 423/533: The Main Theoretical Topics

Wrap-up. The General Linear Model is a special case of the Generalized Linear Model. Consequently, we can carry out any GLM as a GzLM.

Bootstrapping, Randomization, 2B-PLS

Project Report for STAT571 Statistical Methods Instructor: Dr. Ramon V. Leon. Wage Data Analysis. Yuanlei Zhang

Overview. 1. Independence. 2. Modeling Autocorrelation. 3. Temporal Autocorrelation Example. 4. Spatial Autocorrelation Example

Statistical View of Least Squares

Basic Business Statistics 6 th Edition

Inference with Simple Regression

Transcription:

Worksheet 4 - Multiple and nonlinear regression models Multiple and non-linear regression references Quinn & Keough (2002) - Chpt 6 Question 1 - Multiple Linear Regression Paruelo & Lauenroth (1996) analyzed the geographic distribution and the effects of climate variables on the relative abundance of a number of plant functional types (PFT's) including shrubs, forbs, succulents (e.g. cacti), C3 grasses and C4 grasses. They used data from 73 sites across temperate central North America (see pareulo.syd) and calculated the relative abundance of C3 grasses at each site as a response variable Format of paruelo.csv data file C3 LAT LONG MAP JJAMAP DJFMAP............ C3 LAT LONG MAP MAT Relative abundance of C3 grasses at each site - response variable Latitude in centesimal degrees - predictor variable Longitude in centesimal degrees - predictor variable Mean annual precipitation (mm) - predictor variable Mean annual temperature ( 0 C) - predictor variable JJAMAP Proportion of MAP that fell in June, July and August - predictor variable DJFMAP Proportion of MAP that fell in December, January and Febrary - predictor variable Open the paruelo data file. HINT. Q1-1. In the table below, list the assumptions of multiple linear regression along with how violations of each assumption are diagnosed and/or the risks of violations are minimized. Assumption Diagnostic/Risk Minimization I. II. III.

IV. V. Q1-2. Construct a scatterplot matrix to investigate these assumptions(hint) a. Focussing on the assumptions of Normality, Homogeneity of Variance and Linearity, is there evidence of violations of these assumptions (y or n)? b. Try applying a temporary square root transformation (HINT). Does this improve some of these specific assumptions (y or n)? c. Is there any evidence that the assumption of Collinearity is likely to be violated (y or n)? d. Collinearity occurs when one or more of the predictor variables are correlated, therefore this assumption can also be examined by calculating the pairwise correlation coefficients between all the predictor variables (HINT). Which predictor variables are highly correlated? Q1-3. (Multi)collinearity can also be diagnosed by tolerance and variance inflation factor (VIF) measures. a. Calculate the VIF values for each of the predictor variables (HINT). b. Calculate the tolerance values for each of the predictor variables (HINT). Predictor Tolerane VIF LAT LONG MAP MAT JJAMAP log 10 (DJFMAP) c. Is there any evidence of (multi)collinearity, and if so, which variables are responsible for violations of this assumption?

We obviously cannot easily incorporate all 6 predictors into the one model, because of the collinearity problem. Paruelo and Lauenroth (1996) separated the predictors into two groups for their analyses. One group included LAT and LONG and the other included MAP, MAT, JJAMAP and DJFMAP. We will focus on the relationship between the square root relative abundance of C3 plants and latitude and longitude. This relationship will investigate the geographic pattern in abundance of C3 plants. Q1-4. Just like Paruelo and Lauenroth (1996), we will fit the multiplicative model for LAT and LONG. a. Write out the full multiplicative model b. Check the assumptions of this linear model. In particular, check collinearity. HINT c. Obviously, this model will violate collinearity. It is highly likely that LAT and LONG will be related to the LAT:LONG interaction term. It turns out that if we centre the variables, then the individual terms will no longer be correlated to the interaction. Centre the LAT and LONG variables (HINT) and (HINT) Q1-5. Fit a linear multiplicative model on the centrered LAT and LONG (HINT) a. Examine the diagnostic plots (HINT) specially the residual plot to confirm no further evidence of violations of the analysis assumptions b. Complete the following table (HINT). Coefficient Estimate t-value P-value Intercept clat clong clat:clong Q1-6. Examine the partial regression plots of LAT and LONG (HINT). Q1-7. There is clearly an interaction between LAT and LONG. This indicates that the degree to which

latitude effects the relative abundance of C3 plants depends on longitude. To investigate this further, we will examine the simple effects of latitude at a specific range of longitudes. The levels of longitude that we will use are the mean longitude value as well as the mean plus or minus 1 SD and plus or minus 2 SD. a. Calculate the five levels of longitude on which the simple effects of latitude will be investigated. (HINT, HINT, etc). b. Investigate the simple effects of latitude on the relative abundance of C3 plants for each of these levels of longitude. (HINT), (HINT), etc). Question 2 - Multiple Linear Regression Loyn (1987) modeled the abundance of forest birds with six predictor variables (patch area, distance to nearest patch, distance to nearest larger patch, grazing intensity, altitude and years since the patch had been isolated). Format of loyn.csv data file ABUND DIST LDIST AREA GRAZE ALT YR.ISOL.............. ABUND DIST LDIST AREA GRAZE ALT YR.ISOL Abundance of forest birds in patch- response variable Distance to nearest patch - predictor variable Distance to nearest larger patch - predictor variable Size of the patch - predictor variable Grazing intensity (1 to 5, representing light to heavy) - predictor variable Altitude - predictor variable Number of years since the patch was isolated - predictor variable Open the loyn data file. HINT. Q2-1. In the table below, list the assumptions of multiple linear regression along with how violations of each assumption are diagnosed and/or the risks of violations are minimized. Assumption Diagnostic/Risk Minimization I. II. III. IV. V.

Q2-2. Construct a scatterplot matrix to investigate these assumptions(hint) a. Focussing on the assumptions of Normality, Homogeneity of Variance and Linearity, is there evidence of violations of these assumptions (y or n)? b. Try applying a temporary log 10 transformation to the skewed variables(hint). Does this improve some of these specific assumptions (y or n)? c. Is there any evidence that the assumption of Collinearity is likely to be violated (y or n)? Predictor Tolerance VIF log 10 (DIST) log 10 (LDIST) log 10 (AREA) GRAZE ALT YR.ISOL d. Collinearity occurs when one or more of the predictor variables are correlated, therefore this assumption can also be examined by calculating the pairwise correlation coefficients between all the predictor variables (use transformed versions for those that required it) (HINT). Which predictor variables are highly correlated? Since none of the predictor variables are highly correlated to one another, we can include all in the linear model fitting. Q2-3. Fit an additive linear model relating ABUND to each of the predictor variables, but no interactions (HINT) a. Examine the diagnostic plots (HINT) specially the residual plot to confirm no further evidence of violations of the analysis assumptions b. Were any of the partial regression slopes significantly different from 0? Which one(s)?

Coefficient Estimate t-value P-value Intercept log 10 (DIST) log 10 (LDIST) log 10 (AREA) GRAZE ALT YR.ISOL Q2-4. We would now like to be able to find the 'best' regression model. Calculate the adjusted r 2 (HINT), AIC (HINT) and BIC (HINT) for the full regression containing all six predictor variables. Model Adj. r 2 AIC BIC Full Q2-5. Compare all possible models and select the 'best' model based on AIC and BIC (HINT). Note that this requires loading the biology package!. Selection based on: Adj. r 2 AIC BIC Model (e.g. ABUND ~ log10(dist) + ALT) Adj. r 2 AIC Question 3 - Hierachical partitioning An alternative model selection procedure is called hierarchical partitioning. Hierarchical partitioning essentially determines the contributions of each predictor variable as both an individual predictor as well as a joint predictor in explaining the variation in the response variable. We will use hierarchical partitioning for model selection on the Loyn (1987) data set. Open the loyn data file. HINT. Q3-1. Perform the hierarchical partitioning Q3-2. There are no formal hypothesis tests to determine what constitutes a significant contribution, and

therefore which predictor variables to retain in the model. However, there are two ways in which significance of predictors can be inferred: a. Retain all those predictors whose total contribution is greater than an appropriate critical correlation coefficient. We do this by first Convert the r 2 values into standardized, normal z-scores (via correlation coefficients) and then comparing these scores to a critical z-score of 1.65 (for 0.05). Perform these calculations for the individual and total contributions and indicate which of the variables contribute the most to the explained variance in forest bird abundance. b. The second method is to use a randomization procedure to generate a distribution of partitioned r 2 values. The hier.part package comes with one such routine, that performs a randomization procedure for the independent contributions only. Run the randomization procedure and indicate which of the variables contribute the most to the explained variance in forest bird abundance. Question 4 - Polynomial regression Rademaker and Cerqueira (2006), compiled data from the literature on the reproductive traits of opossoms (Didelphis) so as to investigate latitudinal trends in reproductive output. In particular, they were interested in whether there were any patterns in mean litter size across a longitudinal gradient from 44 o N to 34 o S. Analyses of data compiled from second hand sources are called metaanalyses and are very usefull at revealing overal trends across a range of studies. Format of rademaker.csv data files SPECIES LATITUDE L/Y MAOP MLS REFERENCE D.v. 44 2 16.8 8.4 Tyndale- Biscoe and Mackenzie (1976) D.v. 41 1.5 14.1 9.4 Hossler et al. (1994) D.v. 41.5?? 8.6 Reynolds (1952) D.v. 41 2 18 9 Wiseman and Hendrickson (1950) D.v. 40 2 15.8 7.9 Sanderson (1961).................. SPECIES Didelphid species (D.al.=Didelphis albiventris, D.au.=Didelphis aurita, D.m.=Didelphis marsupialis, D.v.=Didelphis virginiana - Descriptor variable

LATITUDE L/Y MAOP MLS REFERENCE Lattitude (degees) of study site - Predictor variable Mean number of litter per year - Response variable Mean annual offspring production - Response variable Mean litter size - Response variable Original source of data Open the rademaker data file. The main variables of interest in this data set are MLS (mean litter size) and LATITUDE. The other variables were included so as to enable you to see how meta data might be collected and collated from a number of other sources. The relationship between two continuous variables can be analyzed by simple linear regression, as was seen in question 1. Before performing the analysis we need to check the assumptions. To evaluate the assumptions of linearity, normality and homogeneity of variance, construct a scatterplot of MLS against LATITUDE including a lowess smoother and boxplots on the axes. (HINT) Q4-1. Is there any evidence that any of the assumptions are likely to be violated? To get an appreciation of what a residual plot would look like when there is some evidence that the linearity assumption has been violated, perform the simple linear regression (by fitting a linear model) purely for the purpose of examining the regression diagnostics (particularly the residual plot) Q4-2. How would you describe the residual plot? For this sort of trend that is clearly non-linear (yet the boxplots suggest normal data), transformations are of no use. Therefore, rather than attempt to model the data on a simple linear relationship (straight line), it is better to attempt to model the data on a curvilinear linear relationship (curved line). Note it is important to make the distinction between line (relationship) linearity and model linearity Q4-3. If the assumptions are otherwise met, perform the second order polynomial regression analysis (fit the quadratic polynomial model), examine the output, and use the information to construct the regression equation relating the number of mean litter size to latitude: DV = intercept + slope 1 x IV 2 + slope 2 x IV Mean litter size = + x latitude 2 + x latitude Q4-4. In polynomial regression, there are two hypotheses of interest. Firstly, as there are two slopes, we are now interested in whether the individual slopes are equal to one another and to zero (that is, does

the overall model explain our data better than just a horizontal line (no trend). Secondly, we are interested in whether the second order polynomial model fits the data any better than a simple first order (straight-line) model. Test these null hypotheses. a. The slopes (choose correct option) significantly different from one another and to 0 (F =, df =,, P = ) b. The second order polynomial regression model (choose correct option) fit the data significantly better than a first order (straight line) regression model (F =, df =,, P = ) Q4-5. What are your conclusions (statistical and biological)? Q4-6. Such an outcome might also be accompanied by a scatterpoint that illustrates the relationship between the mean litter size and latitude. Construct a scatterplot without a smoother or marginal boxplots (HINT). Include a quadratic (second order) trendline on this graph (HINT). Question 5 - Nonlinear Regression Peake and Quinn (1993) investigated the relationship between the size of mussel clumps (m 2 ) and the number of other invertebrate species supported. Format of peake.csv data file AREA SPECIES INDIV 516.0 3 18 469.06 7 60 462.25 6 57......... AREA SPECIES INDIV Area of the mussel clump (m 2 )- predictor variable Number of other invertebrate species found in the mussel clumps - response variable Number of other invertebrate individuals found in the mussel clumps - ignore this response variable Open the peake data file. HINT.

Q5-1. For this question we will focus on an examination of the relationship between the number of species occupying a mussel clump and the size of the mussel clump. Construct a scatterplot to investigate the assumptions of simple linear regression(hint) a. Focussing on the assumptions of Normality, Homogeneity of Variance and Linearity, is there evidence of violations of these assumptions (y or n)? Q5-2. Although we could try a logarithmic transformation of the AREA variable, species/area curves are known to be non-linear. We might expect that small clumps would only support a small number of species. However, as area increases, there should be a dramatic increase in the number of species supported. Finally, further increases in area should see the number of species approach an asymptote. Species-area data are often modeled with non-linear functions: a. Try fitting a second order polynomial trendline through these data. E.g. Species = α*area + β*area 2 + c (note that this is just an extension of y=mx+c) b. Try fitting a power trendline through these data. E.g. Species = α(area) β where the number of Species is proportional to the Area to the power of some coefficient (β). c. Which model is the most appropriate for these data and why?. Species = &alpha(area) &beta whereby the number of Species is proportional to the Area to the power of some coefficient (&beta). Fit the above power function to the data (HINT). Q5-3. Fit the above power function to the data and complete the following table (HINT). Parameter Estimate t-value P-value &alpha &beta Q5-4. Examine the residual plot (HINT). Q5-4. What conclusions (statistical and biological) would you draw from the analysis?. Q5-5. Create a plot of species number against mussel clump area (HINT).

a. Fit the nonlinear trend line to this plot (HINT) Welcome to the end of Worksheet 4!