Online Resource 2: Why Tobit regression?

Size: px
Start display at page:

Download "Online Resource 2: Why Tobit regression?"

Transcription

1 Online Resource 2: Why Tobit regression? March 8, 2017 Contents 1 Introduction 2 2 Inspect data graphically 3 3 Why is linear regression not good enough? Model assumptions are not fulfilled Pragmatism vs. rigorousness Why not log-transform? The sampling design of the predictors induces a systematic error Why not ANOVA? Discussion Tobit regression in practice Assumptions are fulfilled Conclusions 11 1

2 1 Introduction In this brief document we explain why Tobit regression was used to analyse the data, what its advantages over a linear model are and how it can be implemented in R. To do that, we introduce a toy example where the effect of distance from next garden onto the cover of Trachicarpus fortunei is analysed. Disclaimer: this document does not aim at being an introduction to Tobit regression. Here we illustrate the reasons that led us to use this method in our specific analysis. The book Analysis of failure and survival data by Peter Smith (Chapman & Hall/CRC) can be used as introduction to Tobit regression and related techniques. The data used here is a subset of the real data analysed in this paper. We start loading it and getting the needed variables. d.1 <- readrds(file = "DataConedera2017.RDS") d.1 <- subset(d.1, select = c(t.tra, Tra, dng)) str(d.1) 'data.frame': 200 obs. of 3 variables: $ T.Tra: num $ Tra : num $ dng : num head(d.1) T.Tra Tra dng ## install.packages('regr0', repos=' require(regr0) require(lattice) T.Tra is the transformed cover of hemp palm, one of the four response variables analysed in this publication. Tra is the untransformed cover of hemp palm (range ). dng, the sole predictor here, is the untransformed distance from the next garden (given in metres). 2

3 2 Inspect data graphically We start by plotting the values of the response variable against the predictor. As explained in the main text, we arc-sine square-root transformed the response variable to stabilise the variance of the residuals 1. We also chose the logarithmic scale for the distance to next garden. To better visualise data we used jittering on the x-axis (we add a small amount of noise to x values) and transparency (observations are semi-transparent). Both actions were carried out to alleviate the effect of overlap. By using this procedure, we can highlight the large number of observations that have zero cover. 0.4 Hemp palm cover (asin.sqrt transformed) log distance to next garden [m] The line drawn on this graph is a least squares regression line which shows that sampling plots close to gardens have highest covers of hemp palm. This does not come unexpected as the gardens in the study region often contain the hemp palm, and may act as seed reservoirs. 3 Why is linear regression not good enough? On could fit a linear regression to this data. However, as we will show below, this has several important drawbacks. 3.1 Model assumptions are not fulfilled In order to make statistical inference (i.e. compute p-values and confidence intervals) on a normal linear model, we assume that the errors are normally distributed, that they are independent from each other and that their variance is constant. More mathematically we can summarise this as: y = β 0 + β 1 x + ε ε iid N (0, σ 2 ε) (1) 1 This is standard procedure for proportions. 3

4 In our example, y is the transformed cover of the hemp palm (i.e. T.Tra), x is the log-distance from gardens (i.e. log(dng)), βs are the regression coefficients (intercept and slope) and ε are the errors. We fit the model: lm.0 <- lm(t.tra ~ log(dng), data = d.1) The usual way to assess whether the model assumptions are fulfilled is to produce residual diagnostics. For linear models the most important tool is the Tukey-Anscombe plot (i.e. a plot of the residuals against the fitted values). We reproduce this plot here for the linear model fitted to the toy data. Note that transparency is used again. Tukey Anscombe plot for lm residuals fitted values The variance of the residuals is evidently not constant, but increases with the fitted values. As the range of the vertical axis shows undoubtedly, the residuals are far from being symmetric. In addition to that, we can clearly see the bounding effect of zeros. Indeed, all the residuals of those observations showing zero cover lay on a line at the bottom of the graph. A quantile-quantile plot (not drawn here) would also show that the residuals do not follow a normal distribution. Thus, the model assumptions are grossly violated. 3.2 Pragmatism vs. rigorousness The above arguments against the use linear regression for this analysis may sound as not pragmatic and too rigorous. Indeed, the model assumptions are never perfectly fulfilled. However, there are also other practical implications of fitting a linear model to this data. As an example, the predicted values for sampling plots at more than 250 metres from a garden are negative. Given that we are modelling cover, this does not make sense. 4

5 3.3 Why not log-transform? To force the fitted values to be positive, we could log-transform the response variable. Note that we need to add a small positive value prior to log transform as some observed covers are 0. The positive constant added to the response variable prior to transformation is usually either 1 or the smallest non-zero value that has been observed in the data. Adding 1 is irrational since its effect depends on the measurement unit of the variable (e.g., in percent or in 1 in 1000). We therefore used the second choice. A more well-behaved modified logarithm that solves the problem of zeros in a rational way is implemented in the function logst() of the package regr0, which we use for our analysis. ( min.cover <- min(d.1$tra[d.1$tra!= 0]) ) [1] d.1$log.tra <- log(d.1$tra + min.cover) We then plot the newly obtained response variable against the predictor to inspect their relationship. 3 2 Hemp palm cover (log transformed) log distance to next garden [m] We now fit the model on the newly obtained response variable: lm.log <- lm(log.tra ~ log(dng), data = d.1) 5

6 In order to assess the model assumptions, we look at the Tukey-Anscombe plot again. Tukey Anscombe plot for lm.log residuals fitted values It is clear that the assumptions of the log-transformed model are not fulfilled. In addition to that, the back-transformed fitted values are not all positive as we hoped. Indeed, the addition of a constant prior to transformation results, in some cases, in negative fitted values. As an example sites at 350 and 450 metres are predicted to have a negative cover (see below). y.hat <- predict(lm.log, newdata = data.frame(dng = c(25, 75, 250, 350, 450))) round((exp(y.hat) - min.cover), 2) Adding 1 instead of minimum cover makes no difference. Because we are modelling cover, negative fitted values in the original space are clearly not sensible. Thus, the use of the log-transformation is not of any help here to get strictly positive fitted values. To somehow solve this problem negative values could be rounded to zero. Nevertheless, as we will show, Tobit regression offers a more elegant solution to this problem and solves other issues too. 3.4 The sampling design of the predictors induces a systematic error Again, one could argue that in this publication we are only interested in comparing the effects of the predictors in a fair way. Therefore, whether the model assumptions are perfectly fulfilled and whether the fitted values are all positive is unimportant. From a very pragmatic point of view this could be considered to be true. Nevertheless, we should note that the regression coefficient of dng depends on the sampling of the predictor. More in particular, if we sampled sites at further distances (at e.g. 750, 1500 and 5000 metres) we would have very likely observed zero covers. This would change the estimate obtained for dng (i.e. a flatter line would be obtained). Below we display graphically this situation. We create two fake data sets with additional observations at further distances. d.fake.1 goes up to

7 metres, while d.fake.2 up to metres. All these new observations have zero cover. We then fit two linear models with these additional observations. Ideally modifying the sampling of observations should not have any influence on the estimates. d.temp.1 <- data.frame(t.tra = 0, Tra = NA, log.tra = NA, dng = rep(c(750, 1500, 5000), each = 10)) d.fake.1 <- rbind(d.1, d.temp.1) lm.1 <- update(lm.0, data = d.fake.1) ## d.temp.2 <- data.frame(t.tra = 0, Tra = NA, log.tra = NA, dng = rep(c(750, 1500, 5000, 10000, 20000), each = 10)) d.fake.2 <- rbind(d.1, d.temp.2) lm.2 <- update(lm.0, data = d.fake.2) To display how the further sampling would affect the estimates, we reproduce the graph at page 3 and add the fitted values of the two new models. 0.4 Regressions original data fake 1 fake 2 Hemp palm cover (asin.sqrt transformed) log distance to next garden [m] The lines obtained here, are clearly flatter than the one obtained with the original data (blue line). The pink line is obtained with the data set where observations go to metres. The green line, which is even flatter, represents the regression with data up to We can formally compare the coefficients obtained. coef(lm.0)["log(dng)"] log(dng) coef(lm.1)["log(dng)"] log(dng)

8 coef(lm.2)["log(dng)"] log(dng) The regression coefficient for log-distance from the next garden (as well as the t and p-values) are different. The further you sample the predictor (i.e. distance to next garden), the lower the regression estimate will be. This is unfortunate and unwanted, as the design should not influence the estimated regression coefficient in a systematic way. 3.5 Why not ANOVA? We could analyse this data within the ANOVA framework (i.e. take dng as a categorical variable with 6 levels). However, note that other predictors analysed in the main analysis are continuous variables. This implies that the comparing between predictors would not be fair as the number of estimated parameters differ. Note in addition, that if we were to analyse this toy data within the ANOVA framework, we could formally compare levels of the factor in a post-hoc analysis. Here, we would conclude that groups 250 m, 350 m and 450 m are not statistically different from each others 2. However, in practice we would expect the cover at these distances to differ. Thus, it is important to highlight that observed cover does not indicate suitability here. To make a more extreme example, we could compare a plot at 250 metres and one in the middle of the lake. In both cases all observations would be zero. However, if we would enlarge our sample, we may be able to find the hemp palm in the 250 metres plot, but for sure not in the one in the middle of the lake. We therefore need to account for the fact that not all zeros carry the same information in this context. Essentially, we require a technique that allows zeros to be different. One possible solution to this problem is to use Tobit regression. This method enables the user to discriminate between zeros. In this example, all zero observations are said to be censored. In other words we assume that zero is the lowest value that we can possibly measure. 3.6 Discussion As we have seen above not considering the censoring of the data can lead to misleading results. In addition to that, the model assumptions were clearly violated in all models fitted. Thus, looking for a more appropriate model is supported by practical (i.e. biological interpretation of the results) and theoretical reasons (i.e. distributional assumptions). 2 To carry out this post-hoc analysis, we would assume that the dng factor has a significant effect and that the model assumptions are fulfilled. This is not the case as will be shown further down. 8

9 4 Tobit regression in practice A satisfactory way out of the discussed difficulties consists of using a model that is suitable for target variables which are either positive, with a potentially high probability for the value zero. This is called Tobit regression and relies on the following idea: the occurrence of the plants is driven by a variable that we can call potential for its growth. For clearly positive values, it is the expected coverage, from which the observed coverage deviates by the usual random error. If the potential declines to zero and below, the probability of observing zero coverage will grow and eventually reach one. More precisely, this probability equals the probability that the potential plus the random error is negative. This corresponds exactly to a regular linear model with the modification that the observations are censored at 0. Fitting a Tobit model with regr() function is trivial (package regr0). In this analysis, we consider that all zero observations are censored. d.1$censored <- d.1$t.tra == 0 table(d.1$censored) FALSE TRUE There are 61 censored observations out of 200. Here we fit a Tobit model using the wrapper function regr(). By using the limit we can declare the censoring (i.e. the smallest values possibly observed). After fitting the model we can look at the summary output. tob.0 <- regr(tobit(t.tra, limit = 0) ~ log(dng), data = d.1) summary(tob.0) Call: regr(formula = Tobit(T.Tra, limit = 0) ~ log(dng), data = d.1) Fitting function: survreg Terms: coef df cilow cihigh R2.x signif p.value (Intercept) NA NA log(dng) log(scale) NA NA NA NA NA p.symb (Intercept) log(dng) *** log(scale) NA --- Signif. codes: 0 *** ** 0.01 * deviance df p.value Model e-17 Null NA Distribution: gaussian. Shape parameter (`scale`): AIC: Not unexpectedly the summary tells us that distance to garden has a strong negative effect on the response variable. In addition, the regression coefficient that takes into account censoring of the data is obviously more negative than the one obtained with the normal linear model. Indeed, zeros are not considered to carry all the same information, and therefore the zero covers observed at 150 metres are considered to be different from those at 450 metres. 9

10 Again, we can compare the coefficients of the two models. coef(lm.0)["log(dng)"] log(dng) coef(tob.0)["log(dng)"] log(dng) Assumptions are fulfilled The model fitted here assumes that the observed data is censored at zero. However, what is actually modelled is a latent (unobserved) variable that has no censoring. In this case the latent variable is assumed to follow a normal distribution. The latent variable can biologically be interpreted as suitability. We can thus check the model assumptions with the classical Tukey-Anscombe plot 3. set.seed(1) res.sim.1 <- resid(tob.0)[, "random"] Tukey Anscombe plot for tobit model 0.4 residuals 0.2 censored FALSE TRUE fitted values No obvious violations of the model assumptions are visible here. There is no bounding effect and the variance of the residuals appears to be reasonably stable (i.e. homoscedastic). Note that the apparently smaller spread of residuals for small fitted values is partly due to the fact that there are less observations in this range 4. Finally, we show that the sampling design has no effect on the estimates! The regression coefficients obtained are exactly the same for all data sets used. 3 Note that to obtain meaningful plots, the residuals of censored observations are simulated. 4 A scale-location plot (i.e. absolute residuals plotted against the fitted values) would show this clearly. This is not shown here for the sake of brevity and because the information conveyed is partially redundant with the Tukey-Anscombe plot shown. 10

11 tob.fake.1 <- update(tob.0, data = d.fake.1) tob.fake.2 <- update(tob.0, data = d.fake.2) coef(tob.0) (Intercept) log(dng) coef(tob.fake.1) (Intercept) log(dng) coef(tob.fake.2) (Intercept) log(dng) Conclusions The advantages of using Tobit regression in this context are multiple. From a practical point of view it is important to note that the sampling design of the predictors does not influence the regression coefficients obtained. In addition, the modelling of a latent variable (i.e. suitability ) enables us to discriminate between zero covers, and solves the problem of the negative covers obtained in the other models. From a more rigorous point of view, the model assumptions are fulfilled and we can compare the effect of the predictors in a fair manner. 11

The R-Function regr and Package regr0 for an Augmented Regression Analysis

The R-Function regr and Package regr0 for an Augmented Regression Analysis The R-Function regr and Package regr0 for an Augmented Regression Analysis Werner A. Stahel, ETH Zurich April 19, 2013 Abstract The R function regr is a wrapper function that allows for fitting several

More information

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression Introduction to Correlation and Regression The procedures discussed in the previous ANOVA labs are most useful in cases where we are interested

More information

Lectures 5 & 6: Hypothesis Testing

Lectures 5 & 6: Hypothesis Testing Lectures 5 & 6: Hypothesis Testing in which you learn to apply the concept of statistical significance to OLS estimates, learn the concept of t values, how to use them in regression work and come across

More information

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima Applied Statistics Lecturer: Serena Arima Hypothesis testing for the linear model Under the Gauss-Markov assumptions and the normality of the error terms, we saw that β N(β, σ 2 (X X ) 1 ) and hence s

More information

Lecture 10: F -Tests, ANOVA and R 2

Lecture 10: F -Tests, ANOVA and R 2 Lecture 10: F -Tests, ANOVA and R 2 1 ANOVA We saw that we could test the null hypothesis that β 1 0 using the statistic ( β 1 0)/ŝe. (Although I also mentioned that confidence intervals are generally

More information

R 2 and F -Tests and ANOVA

R 2 and F -Tests and ANOVA R 2 and F -Tests and ANOVA December 6, 2018 1 Partition of Sums of Squares The distance from any point y i in a collection of data, to the mean of the data ȳ, is the deviation, written as y i ȳ. Definition.

More information

Workshop 7.4a: Single factor ANOVA

Workshop 7.4a: Single factor ANOVA -1- Workshop 7.4a: Single factor ANOVA Murray Logan November 23, 2016 Table of contents 1 Revision 1 2 Anova Parameterization 2 3 Partitioning of variance (ANOVA) 10 4 Worked Examples 13 1. Revision 1.1.

More information

Inferences for Regression

Inferences for Regression Inferences for Regression An Example: Body Fat and Waist Size Looking at the relationship between % body fat and waist size (in inches). Here is a scatterplot of our data set: Remembering Regression In

More information

Regression. Marc H. Mehlman University of New Haven

Regression. Marc H. Mehlman University of New Haven Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven the statistician knows that in nature there never was a normal distribution, there never was a straight line, yet with normal and

More information

Basic Business Statistics 6 th Edition

Basic Business Statistics 6 th Edition Basic Business Statistics 6 th Edition Chapter 12 Simple Linear Regression Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value of a dependent variable based

More information

Statistical View of Least Squares

Statistical View of Least Squares Basic Ideas Some Examples Least Squares May 22, 2007 Basic Ideas Simple Linear Regression Basic Ideas Some Examples Least Squares Suppose we have two variables x and y Basic Ideas Simple Linear Regression

More information

The Model Building Process Part I: Checking Model Assumptions Best Practice

The Model Building Process Part I: Checking Model Assumptions Best Practice The Model Building Process Part I: Checking Model Assumptions Best Practice Authored by: Sarah Burke, PhD 31 July 2017 The goal of the STAT T&E COE is to assist in developing rigorous, defensible test

More information

9 Correlation and Regression

9 Correlation and Regression 9 Correlation and Regression SW, Chapter 12. Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then retakes the

More information

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) Authored by: Sarah Burke, PhD Version 1: 31 July 2017 Version 1.1: 24 October 2017 The goal of the STAT T&E COE

More information

Chapter 9 Regression. 9.1 Simple linear regression Linear models Least squares Predictions and residuals.

Chapter 9 Regression. 9.1 Simple linear regression Linear models Least squares Predictions and residuals. 9.1 Simple linear regression 9.1.1 Linear models Response and eplanatory variables Chapter 9 Regression With bivariate data, it is often useful to predict the value of one variable (the response variable,

More information

Regression. Bret Hanlon and Bret Larget. December 8 15, Department of Statistics University of Wisconsin Madison.

Regression. Bret Hanlon and Bret Larget. December 8 15, Department of Statistics University of Wisconsin Madison. Regression Bret Hanlon and Bret Larget Department of Statistics University of Wisconsin Madison December 8 15, 2011 Regression 1 / 55 Example Case Study The proportion of blackness in a male lion s nose

More information

Statistical Modelling in Stata 5: Linear Models

Statistical Modelling in Stata 5: Linear Models Statistical Modelling in Stata 5: Linear Models Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 07/11/2017 Structure This Week What is a linear model? How good is my model? Does

More information

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES BIOL 458 - Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES PART 1: INTRODUCTION TO ANOVA Purpose of ANOVA Analysis of Variance (ANOVA) is an extremely useful statistical method

More information

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur Lecture No. # 36 Sampling Distribution and Parameter Estimation

More information

Density Temp vs Ratio. temp

Density Temp vs Ratio. temp Temp Ratio Density 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Density 0.0 0.2 0.4 0.6 0.8 1.0 1. (a) 170 175 180 185 temp 1.0 1.5 2.0 2.5 3.0 ratio The histogram shows that the temperature measures have two peaks,

More information

SCHOOL OF MATHEMATICS AND STATISTICS

SCHOOL OF MATHEMATICS AND STATISTICS RESTRICTED OPEN BOOK EXAMINATION (Not to be removed from the examination hall) Data provided: Statistics Tables by H.R. Neave MAS5052 SCHOOL OF MATHEMATICS AND STATISTICS Basic Statistics Spring Semester

More information

Inference for Regression Inference about the Regression Model and Using the Regression Line

Inference for Regression Inference about the Regression Model and Using the Regression Line Inference for Regression Inference about the Regression Model and Using the Regression Line PBS Chapter 10.1 and 10.2 2009 W.H. Freeman and Company Objectives (PBS Chapter 10.1 and 10.2) Inference about

More information

Multiple Predictor Variables: ANOVA

Multiple Predictor Variables: ANOVA Multiple Predictor Variables: ANOVA 1/32 Linear Models with Many Predictors Multiple regression has many predictors BUT - so did 1-way ANOVA if treatments had 2 levels What if there are multiple treatment

More information

Correlation & Simple Regression

Correlation & Simple Regression Chapter 11 Correlation & Simple Regression The previous chapter dealt with inference for two categorical variables. In this chapter, we would like to examine the relationship between two quantitative variables.

More information

Statistics 572 Semester Review

Statistics 572 Semester Review Statistics 572 Semester Review Final Exam Information: The final exam is Friday, May 16, 10:05-12:05, in Social Science 6104. The format will be 8 True/False and explains questions (3 pts. each/ 24 pts.

More information

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3 Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details Section 10.1, 2, 3 Basic components of regression setup Target of inference: linear dependency

More information

Regression Analysis: Exploring relationships between variables. Stat 251

Regression Analysis: Exploring relationships between variables. Stat 251 Regression Analysis: Exploring relationships between variables Stat 251 Introduction Objective of regression analysis is to explore the relationship between two (or more) variables so that information

More information

Inference with Simple Regression

Inference with Simple Regression 1 Introduction Inference with Simple Regression Alan B. Gelder 06E:071, The University of Iowa 1 Moving to infinite means: In this course we have seen one-mean problems, twomean problems, and problems

More information

Business Statistics. Lecture 10: Course Review

Business Statistics. Lecture 10: Course Review Business Statistics Lecture 10: Course Review 1 Descriptive Statistics for Continuous Data Numerical Summaries Location: mean, median Spread or variability: variance, standard deviation, range, percentiles,

More information

Diagnostics and Transformations Part 2

Diagnostics and Transformations Part 2 Diagnostics and Transformations Part 2 Bivariate Linear Regression James H. Steiger Department of Psychology and Human Development Vanderbilt University Multilevel Regression Modeling, 2009 Diagnostics

More information

Exam Applied Statistical Regression. Good Luck!

Exam Applied Statistical Regression. Good Luck! Dr. M. Dettling Summer 2011 Exam Applied Statistical Regression Approved: Tables: Note: Any written material, calculator (without communication facility). Attached. All tests have to be done at the 5%-level.

More information

STAT 572 Assignment 5 - Answers Due: March 2, 2007

STAT 572 Assignment 5 - Answers Due: March 2, 2007 1. The file glue.txt contains a data set with the results of an experiment on the dry sheer strength (in pounds per square inch) of birch plywood, bonded with 5 different resin glues A, B, C, D, and E.

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

Using R formulae to test for main effects in the presence of higher-order interactions

Using R formulae to test for main effects in the presence of higher-order interactions Using R formulae to test for main effects in the presence of higher-order interactions Roger Levy arxiv:1405.2094v2 [stat.me] 15 Jan 2018 January 16, 2018 Abstract Traditional analysis of variance (ANOVA)

More information

Chapter 7: Simple linear regression

Chapter 7: Simple linear regression The absolute movement of the ground and buildings during an earthquake is small even in major earthquakes. The damage that a building suffers depends not upon its displacement, but upon the acceleration.

More information

Outline. Possible Reasons. Nature of Heteroscedasticity. Basic Econometrics in Transportation. Heteroscedasticity

Outline. Possible Reasons. Nature of Heteroscedasticity. Basic Econometrics in Transportation. Heteroscedasticity 1/25 Outline Basic Econometrics in Transportation Heteroscedasticity What is the nature of heteroscedasticity? What are its consequences? How does one detect it? What are the remedial measures? Amir Samimi

More information

SRBx14_9.xls Ch14.xls

SRBx14_9.xls Ch14.xls Model Based Statistics in Biology. Part IV. The General Linear Model. Multiple Explanatory Variables. ANCOVA Model Revision ReCap. Part I (Chapters 1,2,3,4), Part II (Ch 5, 6, 7) ReCap Part III (Ch 9,

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

Data Set 1A: Algal Photosynthesis vs. Salinity and Temperature

Data Set 1A: Algal Photosynthesis vs. Salinity and Temperature Data Set A: Algal Photosynthesis vs. Salinity and Temperature Statistical setting These data are from a controlled experiment in which two quantitative variables were manipulated, to determine their effects

More information

Introduction to Linear regression analysis. Part 2. Model comparisons

Introduction to Linear regression analysis. Part 2. Model comparisons Introduction to Linear regression analysis Part Model comparisons 1 ANOVA for regression Total variation in Y SS Total = Variation explained by regression with X SS Regression + Residual variation SS Residual

More information

4. Introduction to Local Estimation

4. Introduction to Local Estimation 4. Introduction to Local Estimation Overview 1. Traditional vs. piecewise SEM 2. Tests of directed separation 3. Introduction to piecewisesem 4.1 Traditional vs. Piecewise SEM 4.1 Comparison. Traditional

More information

Chapter 1. Modeling Basics

Chapter 1. Modeling Basics Chapter 1. Modeling Basics What is a model? Model equation and probability distribution Types of model effects Writing models in matrix form Summary 1 What is a statistical model? A model is a mathematical

More information

Introduction and Single Predictor Regression. Correlation

Introduction and Single Predictor Regression. Correlation Introduction and Single Predictor Regression Dr. J. Kyle Roberts Southern Methodist University Simmons School of Education and Human Development Department of Teaching and Learning Correlation A correlation

More information

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics Exploring Data: Distributions Look for overall pattern (shape, center, spread) and deviations (outliers). Mean (use a calculator): x = x 1 + x

More information

22S39: Class Notes / November 14, 2000 back to start 1

22S39: Class Notes / November 14, 2000 back to start 1 Model diagnostics Interpretation of fitted regression model 22S39: Class Notes / November 14, 2000 back to start 1 Model diagnostics 22S39: Class Notes / November 14, 2000 back to start 2 Model diagnostics

More information

6 Multivariate Regression

6 Multivariate Regression 6 Multivariate Regression 6.1 The Model a In multiple linear regression, we study the relationship between several input variables or regressors and a continuous target variable. Here, several target variables

More information

Linear regression and correlation

Linear regression and correlation Faculty of Health Sciences Linear regression and correlation Statistics for experimental medical researchers 2018 Julie Forman, Christian Pipper & Claus Ekstrøm Department of Biostatistics, University

More information

Tutorial 6: Linear Regression

Tutorial 6: Linear Regression Tutorial 6: Linear Regression Rob Nicholls nicholls@mrc-lmb.cam.ac.uk MRC LMB Statistics Course 2014 Contents 1 Introduction to Simple Linear Regression................ 1 2 Parameter Estimation and Model

More information

INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y

INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y Predictor or Independent variable x Model with error: for i = 1,..., n, y i = α + βx i + ε i ε i : independent errors (sampling, measurement,

More information

Probability and Statistics

Probability and Statistics Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be CHAPTER 4: IT IS ALL ABOUT DATA 4a - 1 CHAPTER 4: IT

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

Lecture 2 Linear Regression: A Model for the Mean. Sharyn O Halloran

Lecture 2 Linear Regression: A Model for the Mean. Sharyn O Halloran Lecture 2 Linear Regression: A Model for the Mean Sharyn O Halloran Closer Look at: Linear Regression Model Least squares procedure Inferential tools Confidence and Prediction Intervals Assumptions Robustness

More information

ISQS 5349 Spring 2013 Final Exam

ISQS 5349 Spring 2013 Final Exam ISQS 5349 Spring 2013 Final Exam Name: General Instructions: Closed books, notes, no electronic devices. Points (out of 200) are in parentheses. Put written answers on separate paper; multiple choices

More information

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson ) Tutorial 7: Correlation and Regression Correlation Used to test whether two variables are linearly associated. A correlation coefficient (r) indicates the strength and direction of the association. A correlation

More information

Lecture Outline. Biost 518 Applied Biostatistics II. Choice of Model for Analysis. Choice of Model. Choice of Model. Lecture 10: Multiple Regression:

Lecture Outline. Biost 518 Applied Biostatistics II. Choice of Model for Analysis. Choice of Model. Choice of Model. Lecture 10: Multiple Regression: Biost 518 Applied Biostatistics II Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics University of Washington Lecture utline Choice of Model Alternative Models Effect of data driven selection of

More information

Sociology 6Z03 Review II

Sociology 6Z03 Review II Sociology 6Z03 Review II John Fox McMaster University Fall 2016 John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 1 / 35 Outline: Review II Probability Part I Sampling Distributions Probability

More information

MODULE 6 LOGISTIC REGRESSION. Module Objectives:

MODULE 6 LOGISTIC REGRESSION. Module Objectives: MODULE 6 LOGISTIC REGRESSION Module Objectives: 1. 147 6.1. LOGIT TRANSFORMATION MODULE 6. LOGISTIC REGRESSION Logistic regression models are used when a researcher is investigating the relationship between

More information

MORE ON SIMPLE REGRESSION: OVERVIEW

MORE ON SIMPLE REGRESSION: OVERVIEW FI=NOT0106 NOTICE. Unless otherwise indicated, all materials on this page and linked pages at the blue.temple.edu address and at the astro.temple.edu address are the sole property of Ralph B. Taylor and

More information

A brief introduction to mixed models

A brief introduction to mixed models A brief introduction to mixed models University of Gothenburg Gothenburg April 6, 2017 Outline An introduction to mixed models based on a few examples: Definition of standard mixed models. Parameter estimation.

More information

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 Statistics Boot Camp Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 March 21, 2018 Outline of boot camp Summarizing and simplifying data Point and interval estimation Foundations of statistical

More information

Using R in 200D Luke Sonnet

Using R in 200D Luke Sonnet Using R in 200D Luke Sonnet Contents Working with data frames 1 Working with variables........................................... 1 Analyzing data............................................... 3 Random

More information

Example. Multiple Regression. Review of ANOVA & Simple Regression /749 Experimental Design for Behavioral and Social Sciences

Example. Multiple Regression. Review of ANOVA & Simple Regression /749 Experimental Design for Behavioral and Social Sciences 36-309/749 Experimental Design for Behavioral and Social Sciences Sep. 29, 2015 Lecture 5: Multiple Regression Review of ANOVA & Simple Regression Both Quantitative outcome Independent, Gaussian errors

More information

STAT 3022 Spring 2007

STAT 3022 Spring 2007 Simple Linear Regression Example These commands reproduce what we did in class. You should enter these in R and see what they do. Start by typing > set.seed(42) to reset the random number generator so

More information

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X. Estimating σ 2 We can do simple prediction of Y and estimation of the mean of Y at any value of X. To perform inferences about our regression line, we must estimate σ 2, the variance of the error term.

More information

Single and multiple linear regression analysis

Single and multiple linear regression analysis Single and multiple linear regression analysis Marike Cockeran 2017 Introduction Outline of the session Simple linear regression analysis SPSS example of simple linear regression analysis Additional topics

More information

The Multiple Regression Model

The Multiple Regression Model Multiple Regression The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & or more independent variables (X i ) Multiple Regression Model with k Independent Variables:

More information

Estimability Tools for Package Developers by Russell V. Lenth

Estimability Tools for Package Developers by Russell V. Lenth CONTRIBUTED RESEARCH ARTICLES 195 Estimability Tools for Package Developers by Russell V. Lenth Abstract When a linear model is rank-deficient, then predictions based on that model become questionable

More information

Final Exam. Name: Solution:

Final Exam. Name: Solution: Final Exam. Name: Instructions. Answer all questions on the exam. Open books, open notes, but no electronic devices. The first 13 problems are worth 5 points each. The rest are worth 1 point each. HW1.

More information

Basic Business Statistics, 10/e

Basic Business Statistics, 10/e Chapter 4 4- Basic Business Statistics th Edition Chapter 4 Introduction to Multiple Regression Basic Business Statistics, e 9 Prentice-Hall, Inc. Chap 4- Learning Objectives In this chapter, you learn:

More information

Introduction to Statistics and R

Introduction to Statistics and R Introduction to Statistics and R Mayo-Illinois Computational Genomics Workshop (2018) Ruoqing Zhu, Ph.D. Department of Statistics, UIUC rqzhu@illinois.edu June 18, 2018 Abstract This document is a supplimentary

More information

Multiple Comparisons

Multiple Comparisons Multiple Comparisons Error Rates, A Priori Tests, and Post-Hoc Tests Multiple Comparisons: A Rationale Multiple comparison tests function to tease apart differences between the groups within our IV when

More information

Regression and Models with Multiple Factors. Ch. 17, 18

Regression and Models with Multiple Factors. Ch. 17, 18 Regression and Models with Multiple Factors Ch. 17, 18 Mass 15 20 25 Scatter Plot 70 75 80 Snout-Vent Length Mass 15 20 25 Linear Regression 70 75 80 Snout-Vent Length Least-squares The method of least

More information

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression. 10/3/011 Functional Connectivity Correlation and Regression Variance VAR = Standard deviation Standard deviation SD = Unbiased SD = 1 10/3/011 Standard error Confidence interval SE = CI = = t value for

More information

STAT 4385 Topic 01: Introduction & Review

STAT 4385 Topic 01: Introduction & Review STAT 4385 Topic 01: Introduction & Review Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso xsu@utep.edu Spring, 2016 Outline Welcome What is Regression Analysis? Basics

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples. Regression models Generalized linear models in R Dr Peter K Dunn http://www.usq.edu.au Department of Mathematics and Computing University of Southern Queensland ASC, July 00 The usual linear regression

More information

Relations in epidemiology-- the need for models

Relations in epidemiology-- the need for models Plant Disease Epidemiology REVIEW: Terminology & history Monitoring epidemics: Disease measurement Disease intensity: severity, incidence,... Types of variables, etc. Measurement (assessment) of severity

More information

Chapter 4: Regression Models

Chapter 4: Regression Models Sales volume of company 1 Textbook: pp. 129-164 Chapter 4: Regression Models Money spent on advertising 2 Learning Objectives After completing this chapter, students will be able to: Identify variables,

More information

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference. Understanding regression output from software Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals In 1966 Cyril Burt published a paper called The genetic determination of differences

More information

In the previous chapter, we learned how to use the method of least-squares

In the previous chapter, we learned how to use the method of least-squares 03-Kahane-45364.qxd 11/9/2007 4:40 PM Page 37 3 Model Performance and Evaluation In the previous chapter, we learned how to use the method of least-squares to find a line that best fits a scatter of points.

More information

Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651 y 1 2 3 4 5 6 7 x Data Analysis and Statistical Methods Statistics 651 http://www.stat.tamu.edu/~suhasini/teaching.html Lecture 32 Suhasini Subba Rao Previous lecture We are interested in whether a dependent

More information

We like to capture and represent the relationship between a set of possible causes and their response, by using a statistical predictive model.

We like to capture and represent the relationship between a set of possible causes and their response, by using a statistical predictive model. Statistical Methods in Business Lecture 5. Linear Regression We like to capture and represent the relationship between a set of possible causes and their response, by using a statistical predictive model.

More information

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model 1 Linear Regression 2 Linear Regression In this lecture we will study a particular type of regression model: the linear regression model We will first consider the case of the model with one predictor

More information

Online Courses for High School Students

Online Courses for High School Students Online Courses for High School Students 1-888-972-6237 Algebra I Course Description: Students explore the tools of algebra and learn to identify the structure and properties of the real number system;

More information

Linear Regression Models

Linear Regression Models Linear Regression Models Model Description and Model Parameters Modelling is a central theme in these notes. The idea is to develop and continuously improve a library of predictive models for hazards,

More information

Notes on Maxwell & Delaney

Notes on Maxwell & Delaney Notes on Maxwell & Delaney PSY710 9 Designs with Covariates 9.1 Blocking Consider the following hypothetical experiment. We want to measure the effect of a drug on locomotor activity in hyperactive children.

More information

Regression in R I. Part I : Simple Linear Regression

Regression in R I. Part I : Simple Linear Regression UCLA Department of Statistics Statistical Consulting Center Regression in R Part I : Simple Linear Regression Denise Ferrari & Tiffany Head denise@stat.ucla.edu tiffany@stat.ucla.edu Feb 10, 2010 Objective

More information

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI Introduction of Data Analytics Prof. Nandan Sudarsanam and Prof. B Ravindran Department of Management Studies and Department of Computer Science and Engineering Indian Institute of Technology, Madras Module

More information

Chapter 3 - Linear Regression

Chapter 3 - Linear Regression Chapter 3 - Linear Regression Lab Solution 1 Problem 9 First we will read the Auto" data. Note that most datasets referred to in the text are in the R package the authors developed. So we just need to

More information

Linear Modelling: Simple Regression

Linear Modelling: Simple Regression Linear Modelling: Simple Regression 10 th of Ma 2018 R. Nicholls / D.-L. Couturier / M. Fernandes Introduction: ANOVA Used for testing hpotheses regarding differences between groups Considers the variation

More information

1 A Review of Correlation and Regression

1 A Review of Correlation and Regression 1 A Review of Correlation and Regression SW, Chapter 12 Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then

More information

STAT 512 MidTerm I (2/21/2013) Spring 2013 INSTRUCTIONS

STAT 512 MidTerm I (2/21/2013) Spring 2013 INSTRUCTIONS STAT 512 MidTerm I (2/21/2013) Spring 2013 Name: Key INSTRUCTIONS 1. This exam is open book/open notes. All papers (but no electronic devices except for calculators) are allowed. 2. There are 5 pages in

More information

Assumptions, Diagnostics, and Inferences for the Simple Linear Regression Model with Normal Residuals

Assumptions, Diagnostics, and Inferences for the Simple Linear Regression Model with Normal Residuals Assumptions, Diagnostics, and Inferences for the Simple Linear Regression Model with Normal Residuals 4 December 2018 1 The Simple Linear Regression Model with Normal Residuals In previous class sessions,

More information

Generalized Linear Models

Generalized Linear Models York SPIDA John Fox Notes Generalized Linear Models Copyright 2010 by John Fox Generalized Linear Models 1 1. Topics I The structure of generalized linear models I Poisson and other generalized linear

More information

MULTIPLE REGRESSION ANALYSIS AND OTHER ISSUES. Business Statistics

MULTIPLE REGRESSION ANALYSIS AND OTHER ISSUES. Business Statistics MULTIPLE REGRESSION ANALYSIS AND OTHER ISSUES Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression analysis Predicting with regression analysis Old exam question

More information

Comparing Several Means: ANOVA

Comparing Several Means: ANOVA Comparing Several Means: ANOVA Understand the basic principles of ANOVA Why it is done? What it tells us? Theory of one way independent ANOVA Following up an ANOVA: Planned contrasts/comparisons Choosing

More information

Stats fest Analysis of variance. Single factor ANOVA. Aims. Single factor ANOVA. Data

Stats fest Analysis of variance. Single factor ANOVA. Aims. Single factor ANOVA. Data 1 Stats fest 2007 Analysis of variance murray.logan@sci.monash.edu.au Single factor ANOVA 2 Aims Description Investigate differences between population means Explanation How much of the variation in response

More information

Bivariate data analysis

Bivariate data analysis Bivariate data analysis Categorical data - creating data set Upload the following data set to R Commander sex female male male male male female female male female female eye black black blue green green

More information

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression INTRODUCTION TO CLINICAL RESEARCH Introduction to Linear Regression Karen Bandeen-Roche, Ph.D. July 17, 2012 Acknowledgements Marie Diener-West Rick Thompson ICTR Leadership / Team JHU Intro to Clinical

More information

Chapter 5 Exercises 1

Chapter 5 Exercises 1 Chapter 5 Exercises 1 Data Analysis & Graphics Using R, 2 nd edn Solutions to Exercises (December 13, 2006) Preliminaries > library(daag) Exercise 2 For each of the data sets elastic1 and elastic2, determine

More information