The Big Picture. Model Modifications. Example (cont.) Bacteria Count Example

Similar documents
Model Modifications. Bret Larget. Departments of Botany and of Statistics University of Wisconsin Madison. February 6, 2007

1. Variance stabilizing transformations; Box-Cox Transformations - Section. 2. Transformations to linearize the model - Section 5.

Diagnostics and Transformations Part 2

Density Temp vs Ratio. temp

AMS 7 Correlation and Regression Lecture 8

Regression. Bret Larget. Department of Statistics. University of Wisconsin - Madison. Statistics 371, Fall Correlation Plots

STAT 420: Methods of Applied Statistics

Multiple Linear Regression (solutions to exercises)

Ch. 5 Transformations and Weighting

SLR output RLS. Refer to slr (code) on the Lecture Page of the class website.

ST430 Exam 1 with Answers

Introduction and Single Predictor Regression. Correlation

Inference for Regression

Lecture 4 Multiple linear regression

Regression. Bret Hanlon and Bret Larget. December 8 15, Department of Statistics University of Wisconsin Madison.

22S39: Class Notes / November 14, 2000 back to start 1

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT. Charlotte Wickham. stat511.cwick.co.nz. Nov

STAT 3022 Spring 2007

Multiple Linear Regression

STAT 572 Assignment 5 - Answers Due: March 2, 2007

Chapter 5 Exercises 1. Data Analysis & Graphics Using R Solutions to Exercises (April 24, 2004)

Lecture 19: Inference for SLR & Transformations

Cuckoo Birds. Analysis of Variance. Display of Cuckoo Bird Egg Lengths

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

Module 1: Equations and Inequalities (30 days) Solving Equations: (10 Days) (10 Days)

Chapter 5 Exercises 1

MODELS WITHOUT AN INTERCEPT

> modlyq <- lm(ly poly(x,2,raw=true)) > summary(modlyq) Call: lm(formula = ly poly(x, 2, raw = TRUE))

Nonlinear Models. Daphnia: Purveyors of Fine Fungus 1/30 2/30

Final Exam. Name: Solution:

Distribution Assumptions

Matrices and vectors A matrix is a rectangular array of numbers. Here s an example: A =

Homework 2. For the homework, be sure to give full explanations where required and to turn in any relevant plots.

SCHOOL OF MATHEMATICS AND STATISTICS

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Modeling kid s test scores (revisited) Lecture 20 - Model Selection. Model output. Backward-elimination

Stat 401B Final Exam Fall 2015

Regression on Faithful with Section 9.3 content

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics January, 2018

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

STAT 350: Summer Semester Midterm 1: Solutions

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

ISQS 5349 Spring 2013 Final Exam

Chapter 3 - Linear Regression

Example: 1982 State SAT Scores (First year state by state data available)

Regression and Models with Multiple Factors. Ch. 17, 18

MS&E 226: Small Data

Analytics 512: Homework # 2 Tim Ahn February 9, 2016

Stat 5303 (Oehlert): Balanced Incomplete Block Designs 1

Lecture 2 Linear Regression: A Model for the Mean. Sharyn O Halloran

Chapter 12: Linear regression II

MATH 644: Regression Analysis Methods

Applied Regression Analysis

Introduction to Linear Regression

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Stat 5102 Final Exam May 14, 2015

Linear Modelling: Simple Regression

Nonlinear Models. What do you do when you don t have a line? What do you do when you don t have a line? A Quadratic Adventure

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

Before we do that, I need to show you another way of writing an exponential. We all know 5² = 25. Another way of writing that is: log

Linear Regression Model. Badr Missaoui

Skill 6 Exponential and Logarithmic Functions

Dealing with Heteroskedasticity

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Business Statistics. Lecture 10: Course Review

Statistics for Engineers Lecture 9 Linear Regression

August 2018 ALGEBRA 1

Math 141. Lecture 20: Regression Remedies. Albyn Jones 1. jones/courses/ Library 304. Albyn Jones Math 141

Week 7 Multiple factors. Ch , Some miscellaneous parts

Model Specification and Data Problems. Part VIII

Linear regression and correlation

Linear Regression Models P8111

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )

STAT 512 MidTerm I (2/21/2013) Spring 2013 INSTRUCTIONS

Pumpkin Example: Flaws in Diagnostics: Correcting Models

Outline. Topic 20 - Diagnostics and Remedies. Residuals. Overview. Diagnostics Plots Residual checks Formal Tests. STAT Fall 2013

Spiral Review Probability, Enter Your Grade Online Quiz - Probability Pascal's Triangle, Enter Your Grade

Simple Linear Regression

36-707: Regression Analysis Homework Solutions. Homework 3

Harbor Creek School District. Algebra II Advanced. Concepts Timeframe Skills Assessment Standards Linear Equations Inequalities

Math 2311 Written Homework 6 (Sections )

The linear model. Our models so far are linear. Change in Y due to change in X? See plots for: o age vs. ahe o carats vs.

Handout 4: Simple Linear Regression

L21: Chapter 12: Linear regression

4. Find x, log 4 32 = x. 5. ln e ln ln e. 8. log log log 3 243

Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall)

Regression. Marc H. Mehlman University of New Haven

General Linear Statistical Models - Part III

Algebra 1 Khan Academy Video Correlations By SpringBoard Activity and Learning Target

Solution to Series 3

Simple logistic regression

Stat 5303 (Oehlert): Models for Interaction 1

Week 7.1--IES 612-STA STA doc

Inferences on Linear Combinations of Coefficients

Lecture 1: Linear Models and Applications

Simple Linear Regression

Statistics - Lecture Three. Linear Models. Charlotte Wickham 1.

Transcription:

The Big Picture Remedies after Model Diagnostics The Big Picture Model Modifications Bret Larget Departments of Botany and of Statistics University of Wisconsin Madison February 6, 2007 Residual plots can indicate lack of model fit. There are several possible remedies, including: 1 Transform one or both variables and check if the standard assumptions are reasonable for the transformed variable(s). Might be useful when residual plots indicate non-linearity and/or heteroscedasticity. Conventional transformation include logarithms and square roots. The Box-Cox family of transformations is also useful. 2 Use weighted least squares when there is explainable heteroscedasticity but the linear model is otherwise fine. 3 Use polynomial regression when there is non-linearity (curvature) but variances are close to constant. Statistics 572 (Spring 2007) Model Modifications February 6, 2007 1 / 20 Statistics 572 (Spring 2007) Model Modifications February 6, 2007 2 / 20 Bacteria Count Bacteria Count Bacteria Count (cont.) Data consist of number of surviving bacteria after exposure to X-rays for different time periods. Time denotes time measured in six-minute intervals. N denotes the number of survivors in hundreds. Time 1 2 3 4 5 6 7 8 N 355 211 197 166 142 166 104 60 Time 9 10 11 12 13 14 15 N 56 38 36 32 21 19 15 Begin by plotting data. Fit a linear model. Assess fit informally with residual plots. Statistics 572 (Spring 2007) Model Modifications February 6, 2007 3 / 20 Statistics 572 (Spring 2007) Model Modifications February 6, 2007 4 / 20

Bacteria Count Bacteria Count Scatterplot Residual Plot > Time = 1:15 > N = c(355, 211, 197, 166, 142, 166, 104, 60, 56, 38, + 36, 32, 21, 19, 15) > par(las = 1, pch = 16) > plot(time, N) > fit1 = lm(n ~ Time) > plot(fit1, which = 1) N 350 300 250 200 150 100 50 2 4 6 8 10 12 14 Scatterplot shows lack of linearity. Residual plot also shows increasing variance. Observation #1 is a bit of an outlier. Consider transforming variables to see if model fits better. Residuals 100 50 0 50 15 1 8 0 50 100 150 200 250 Time Fitted values Statistics 572 (Spring 2007) Model Modifications February 6, 2007 5 / 20 Statistics 572 (Spring 2007) Model Modifications February 6, 2007 6 / 20 Review Review of Exponentiation and Logarithms Exponential Model Exponential Model The constant e 2.718 is the base of the natural logarithm. Recall from calculus, e is the unique base where the derivative equals the function, d dx (ex ) = e x. I will use log, not ln, to stand for the natural logarithm. Products of exponentials are exponentials of sums, e a e b = e a+b. The natural logarithm of e is one, log e = 1. Any logarithm of 1 is zero, log 1 = 0. Rule for exponents, log a b = b log a. Logarithms of products, log(ab) = log(a) + log(b). e x exists for all x and e x > 0. log x is defined only for x > 0. Statistics 572 (Spring 2007) Model Modifications February 6, 2007 7 / 20 Here there is a theoretical model: n t = n 0 e β 1t E, where t is time n t is the number of bacteria at time t n 0 is the number of bacteria at time t = 0 β 1 < 0 is a decay rate E is some multiplicative error. Take natural logs of both sides of the model: log(n t ) = log(n 0 e β 1t E) = log(n 0 ) + log(e β 1t ) + log(e) = log(n 0 ) + β 1 t + log(e) = β 0 + β 1 t + e, That is, we log-transformed n t and the result is a usual linear-line model, if the error E on the original scale is multiplicative and its logarithm is normally distributed. Statistics 572 (Spring 2007) Model Modifications February 6, 2007 8 / 20

Scatterplot Residual Plot 5.5 0.4 6 > par(las = 1, pch = 16) > plot(time, log(n)) > fit2 = lm(log(n) ~ Time) > plot(fit2, which = 1) log(n) 5.0 4.5 4.0 3.5 3.0 2 4 6 8 10 12 14 Diagnostics are consistent with a model that fits well. There is no obvious non-linearity. There is no obvious heteroscedasticity. Residual plot has no large deviations from random scatter. Residuals 0.2 0.2 10 3.0 3.5 4.0 4.5 5.0 5.5 2 Time Fitted values Statistics 572 (Spring 2007) Model Modifications February 6, 2007 9 / 20 Statistics 572 (Spring 2007) Model Modifications February 6, 2007 10 / 20 Fitted Model for Log-Transformed Data Confidence and Prediction Intervals > summary(fit2) Call: lm(formula = log(n) ~ Time) Residuals: Min 1Q Median 3Q Max -0.233578-91798 -07255 50165 0.413068 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 6.028695 88259 68.31 < 2e-16 *** Time -0.221629 09707-22.83 7.1e-12 *** --- Signif. codes: 0 '***' 01 '**' 1 '*' 5 '.' 0.1 ' ' 1 Residual standard error: 0.1624 on 13 degrees of freedom Multiple R-Squared: 0.9757, Adjusted R-squared: 0.9738 F-statistic: 521.3 on 1 and 13 DF, p-value: 7.103e-12 ˆβ 0 = 6.03, ˆβ 1 = 0.222. On the original scale, exp( ˆβ 0 ) = 415. Fitted Model: y = 415 e 0.222x where: y = bacteria count in hundreds x = time in 6-minute intervals > t0 = data.frame(time = 10) > predict(fit2, t0, interval = "c") [1,] 3.812403 3.71256 3.912246 > predict(fit2, t0, interval = "p") [1,] 3.812403 3.44756 4.177246 > exp(predict(fit2, t0, interval = "c")) [1,] 45.25907 40.95853 51116 > exp(predict(fit2, t0, interval = "p")) [1,] 45.25907 31.42363 65.1861 The point estimate for the mean bacteria count in the 10th time interval is 45.3. A 95% confidence interval goes from 41 to 50. A 95% prediction interval goes from 31.4 to 65.2. Statistics 572 (Spring 2007) Model Modifications February 6, 2007 11 / 20 Statistics 572 (Spring 2007) Model Modifications February 6, 2007 12 / 20

Other Transformations We could transform either y or x or both. Common transformations include: natural log, ln or log log base 10, log 10 square root, reciprocal, 1/y Less common transformations include: Squaring, y 2 Reciprocal squaring, 1/y 2 Cube root, y 1/3, Arcsin transformation, arcsin y, useful when y is a proportion. Box-Cox transformations are a continuous family of power transformations. The log transformation corresponds to a power of 0. The Box-Cox transformation is: y(λ) = y λ 1, if λ 0 λ y(λ) = log y, if λ = 0 In R, this transfomration is in the library MASS in a function boxcox. In theory, we can estimate λ as a continuous parameter. In practice, we might select a common transformation power close to the continuous estimate. Statistics 572 (Spring 2007) Model Modifications February 6, 2007 13 / 20 Statistics 572 (Spring 2007) Model Modifications February 6, 2007 14 / 20 Box-Cox Transformation on Bacteria Remarks > Time = 1:15 > N = c(355, 211, 197, 166, 142, 166, 104, 60, 56, + 38, 36, 32, 21, 19, 15) > fit1 = lm(n ~ Time) > library(mass) > boxcox(fit1) log Likelihood 90 80 70 60 95% 2 1 0 1 2 Typically choose transformations emprically. Transformation of y can affect both linearity and variance homogeneity. Transformations of x only affect linearity. Sometimes, solving one problem causes another: for example, transforming y to stabilize the variance can introduce curvature. Start simple and experiment. λ Statistics 572 (Spring 2007) Model Modifications February 6, 2007 15 / 20 Statistics 572 (Spring 2007) Model Modifications February 6, 2007 16 / 20

Weighted Regression Weighted Regression Weighted regression is a method that can be used when linearity is correct, but variance is not constant. Sometimes there may be a theory justifying a formula for the variance as a function of x, such as σx 2 = σ 2 x 2 which would be he case if variance increased quadratically. Sometimes people use empirical methods to estimate variances. Differences in variances can result from having access to means but not individual values from different sample sizes. In weighted regression, we use a weighted least squares criterion. If observation y i has variance σi 2, the appropriate least squares criterion is: n w i (y i (b 0 + b 1 x i )) 2 where w i = 1/σ 2 i. i=1 Statistics 572 (Spring 2007) Model Modifications February 6, 2007 17 / 20 Here is an example of artificial data reflecting missing data. In a planned experiment, there were 10 seeds in each of five treatment groups. However, not all seeds germinated resulting in actual sample sizes of 10, 10, 7, 8, and 6. The response of height at a specific number of days after planting is reported as a mean per plant. The treatment is a concentration of fertilizer. Data. conc 1.0 2.0 4.0 8.0 16.0 height 20.7 21.1 23.1 25.1 28.7 n 10 10 7 8 6 We could use simple linear regression on the raw data before taking means. (There are real settings where individual data is unavailable.) Statistics 572 (Spring 2007) Model Modifications February 6, 2007 18 / 20 R Residual Plots > conc = c(1, 2, 4, 8, 16) > ht = c(20.7, 21.1, 23.1, 25.1, 28.7) > n = c(10, 10, 7, 8, 6) > fit.u = lm(ht ~ conc) > fit.w = lm(ht ~ conc, weights = n) > summary(fit.u) > summary(fit.w) > plot(fitted(fit.u), rstudent(fit.u)) > plot(fitted(fit.w), rstudent(fit.w)) > coef(fit.u) (Intercept) conc 20.4333333 0.5333333 > sqrt(10) * summary(fit.u)$sigma [1] 1.577621 > coef(fit.w) (Intercept) conc 20.3433552 0.5441396 rstudent(fit.u) 1.5 1.0 0.5 0.5 1.0 Unweighted rstudent(fit.w) 1.5 1.0 0.5 0.5 1.0 Weighted > summary(fit.w)$sigma 1.5 1.5 [1] 1.395566 22 24 26 28 fitted(fit.u) 22 24 26 28 fitted(fit.w) Statistics 572 (Spring 2007) Model Modifications February 6, 2007 19 / 20 Statistics 572 (Spring 2007) Model Modifications February 6, 2007 20 / 20