Transformations. Merlise Clyde. Readings: Gelman & Hill Ch 2-4, ALR 8-9

Similar documents
Transformations. Merlise Clyde. Readings: Gelman & Hill Ch 2-4

Distribution Assumptions

Interpretation, Prediction and Confidence Intervals

Stat 529 (Winter 2011) A simple linear regression (SLR) case study. Mammals brain weights and body weights

STAT 420: Methods of Applied Statistics

Math 423/533: The Main Theoretical Topics

Regression diagnostics

Linear regression. Example 3.6.1: Relation between abrasion loss and hardness of rubber tires.

MLR Model Checking. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

Simple and Multiple Linear Regression

Simple Linear Regression

Lectures on Simple Linear Regression Stat 431, Summer 2012

Generalized Linear Models

Lecture 6 Multiple Linear Regression, cont.

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

Lecture 16: Again on Regression

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

Regression diagnostics

Lecture 6: Linear Regression

Chapter 1 Statistical Inference

2. TRUE or FALSE: Converting the units of one measured variable alters the correlation of between it and a second variable.

robmixglm: An R package for robust analysis using mixtures

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

Applied Regression Analysis. Section 4: Diagnostics and Transformations

Poisson Regression. Gelman & Hill Chapter 6. February 6, 2017

Leverage. the response is in line with the other values, or the high leverage has caused the fitted model to be pulled toward the observed response.

Statistics 203: Introduction to Regression and Analysis of Variance Course review

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

STAT 571A Advanced Statistical Regression Analysis. Chapter 3 NOTES Diagnostics and Remedial Measures

Ch 2: Simple Linear Regression

1 Least Squares Estimation - multiple regression.

Multiple Linear Regression

Introduction and Single Predictor Regression. Correlation

STATISTICS 479 Exam II (100 points)

22S39: Class Notes / November 14, 2000 back to start 1

enote 6 1 enote 6 Model Diagnostics

Linear models and their mathematical foundations: Simple linear regression

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

36-463/663: Multilevel & Hierarchical Models

Ch 3: Multiple Linear Regression

Lecture One: A Quick Review/Overview on Regular Linear Regression Models

Probability and Statistics Notes

Maximum Likelihood Estimation

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Handling Categorical Predictors: ANOVA

Statistics for EES Linear regression and linear models

Diagnostics for Linear Models With Functional Responses

PHAR2821 Drug Discovery and Design B Statistics GARTH TARR SEMESTER 2, 2013

Formal Statement of Simple Linear Regression Model

Linear Modelling: Simple Regression

Linear normal models

Math 5305 Notes. Diagnostics and Remedial Measures. Jesse Crawford. Department of Mathematics Tarleton State University

Introduction to Linear regression analysis. Part 2. Model comparisons

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

A CONNECTION BETWEEN LOCAL AND DELETION INFLUENCE

Lecture 1: Linear Models and Applications

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Statistics for EES 7. Linear regression and linear models

Association studies and regression

Simple Linear Regression

XVI. Transformations. by: David Scott and David M. Lane

Diagnostics can identify two possible areas of failure of assumptions when fitting linear models.

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

Simple Linear Regression

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University

BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1

Regression Model Building

Module 6: Model Diagnostics

Simple Linear Regression for the Advertising Data

Generalized Linear Models

Unit 10: Simple Linear Regression and Correlation

The regression model with one fixed regressor cont d

Analytics 512: Homework # 2 Tim Ahn February 9, 2016

Bayesian Model Diagnostics and Checking

Regression. Marc H. Mehlman University of New Haven

Simple Linear Regression

Multiple Predictor Variables: ANOVA

6.1 Introduction. Regression Model:

More about Single Factor Experiments

Simple Linear Regression for the Climate Data

STAT763: Applied Regression Analysis. Multiple linear regression. 4.4 Hypothesis testing

Statistical Applications in Genetics and Molecular Biology

Chapter 12: Multiple Linear Regression

Chapter 2 Multiple Regression I (Part 1)

Exam Applied Statistical Regression. Good Luck!

Statistics GIDP Ph.D. Qualifying Exam Methodology

Simple linear regression: estimation, diagnostics, prediction

Chapter 14. Linear least squares

LECTURE 5. Introduction to Econometrics. Hypothesis testing

Bootstrap & Confidence/Prediction intervals

Scatterplot for Cedar Tree Example

STATISTICS 110/201 PRACTICE FINAL EXAM

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Applied linear statistical models: An overview

Regression, Ridge Regression, Lasso

Checking the Poisson assumption in the Poisson generalized linear model

Introduction Robust regression Examples Conclusion. Robust regression. Jiří Franc

Lecture 19 Multiple (Linear) Regression

Answer Key. 9.1 Scatter Plots and Linear Correlation. Chapter 9 Regression and Correlation. CK-12 Advanced Probability and Statistics Concepts 1

Transcription:

Transformations Merlise Clyde Readings: Gelman & Hill Ch 2-4, ALR 8-9

Assumptions of Linear Regression Y i = β 0 + β 1 X i1 + β 2 X i2 +... β p X ip + ɛ i Model Linear in X j but X j could be a transformation of the original variables

Assumptions of Linear Regression Y i = β 0 + β 1 X i1 + β 2 X i2 +... β p X ip + ɛ i Model Linear in X j but X j could be a transformation of the original variables ɛ i N(0, σ 2 )

Assumptions of Linear Regression Y i = β 0 + β 1 X i1 + β 2 X i2 +... β p X ip + ɛ i Model Linear in X j but X j could be a transformation of the original variables ɛ i N(0, σ 2 ) Y i ind N(β 0 + β 1 X i1 + β 2 X i2 +... β p X ip, σ 2 )

Assumptions of Linear Regression Y i = β 0 + β 1 X i1 + β 2 X i2 +... β p X ip + ɛ i Model Linear in X j but X j could be a transformation of the original variables ɛ i N(0, σ 2 ) Y i ind N(β 0 + β 1 X i1 + β 2 X i2 +... β p X ip, σ 2 ) correct mean function

Assumptions of Linear Regression Y i = β 0 + β 1 X i1 + β 2 X i2 +... β p X ip + ɛ i Model Linear in X j but X j could be a transformation of the original variables ɛ i N(0, σ 2 ) Y i ind N(β 0 + β 1 X i1 + β 2 X i2 +... β p X ip, σ 2 ) correct mean function constant variance

Assumptions of Linear Regression Y i = β 0 + β 1 X i1 + β 2 X i2 +... β p X ip + ɛ i Model Linear in X j but X j could be a transformation of the original variables ɛ i N(0, σ 2 ) Y i ind N(β 0 + β 1 X i1 + β 2 X i2 +... β p X ip, σ 2 ) correct mean function constant variance independent Normal errors

Animals Read in Animal data from MASS. The data set contains measurements on body weight (kg) and brain weight (g) on 28 animals. Let s try to predict brain weight from body weight. library(mass) data(animals) brain.lm = lm(brain ~ body, data=animals)

Diagnostic Plots ## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced ## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced Residuals vs Fitted Normal Q Q Residuals 1000 1000 3000 5000 African elephant Asian elephant Human Standardized residuals 1 0 1 2 3 4 Brachiosaurus African elephant Asian elephant 540 550 560 570 2 1 0 1 2 Fitted values Theoretical Quantiles Standardized residuals 0.0 0.5 1.0 1.5 2.0 Brachiosaurus Scale Location African elephant Asian elephant Standardized residuals 2 1 0 1 2 3 4 Residuals vs Leverage African elephant Asian elephant Cook's distance Brachiosaurus 0.5 1 10.5 540 550 560 570 0.0 0.2 0.4 0.6 0.8 1.0

Outliers Issues with multiple comparisons if we compare each p-value to α = 0.05

Outliers Issues with multiple comparisons if we compare each p-value to α = 0.05 Bonferroni compares p-values to α/n

Outliers Issues with multiple comparisons if we compare each p-value to α = 0.05 Bonferroni compares p-values to α/n Other corrections for False Discoveries possible (STA 532)

Bonferonni Correction & Multiple Testing H 1,..., H n are a family of hypotheses and p 1,..., p n their corresponding p-values

Bonferonni Correction & Multiple Testing H 1,..., H n are a family of hypotheses and p 1,..., p n their corresponding p-values n 0 of the n are true

Bonferonni Correction & Multiple Testing H 1,..., H n are a family of hypotheses and p 1,..., p n their corresponding p-values n 0 of the n are true The familywise error rate (FWER) is the probability of rejecting at least one true H i (making at least one type I error). FWER = P = α { n0 i=1 ( p i α n ) } n 0 i=1 { ( P p i α )} α n 0 n n n α n

Bonferonni Correction & Multiple Testing H 1,..., H n are a family of hypotheses and p 1,..., p n their corresponding p-values n 0 of the n are true The familywise error rate (FWER) is the probability of rejecting at least one true H i (making at least one type I error). FWER = P = α { n0 i=1 ( p i α n ) } n 0 i=1 { ( P p i α )} α n 0 n n n α n This does not require any assumptions about dependence among the p-values or about how many of the null hypotheses are true.

Bonferonni Correction & Multiple Testing H 1,..., H n are a family of hypotheses and p 1,..., p n their corresponding p-values n 0 of the n are true The familywise error rate (FWER) is the probability of rejecting at least one true H i (making at least one type I error). FWER = P = α { n0 i=1 ( p i α n ) } n 0 i=1 { ( P p i α )} α n 0 n n n α n This does not require any assumptions about dependence among the p-values or about how many of the null hypotheses are true. Link https://en.wikipedia.org/wiki/bonferroni_correction

Outliers Flag outliers after Bonferroni Correction p i < α/n pval = 2*(1 - pt(abs(rstudent(brain.lm)), brain.lm$df -1)) rownames(animals)[pval <.05/nrow(Animals)] ## [1] "Asian elephant" "African elephant"

Cook s Distance Measure of influence of case i on predictions after removing the ith case D i = Y Ŷ (i) 2 ˆσ 2 p

Cook s Distance Measure of influence of case i on predictions after removing the ith case Easier way to calculate D i = Y Ŷ (i) 2 ˆσ 2 p D i = e2 i ˆσ 2 p [ ] h ii (1 h ii ) 2,

Cook s Distance Measure of influence of case i on predictions after removing the ith case Easier way to calculate D i = Y Ŷ (i) 2 ˆσ 2 p D i = e2 i ˆσ 2 p [ D i = r ii p ] h ii (1 h ii ) 2, h ii 1 h ii

Cook s Distance Measure of influence of case i on predictions after removing the ith case Easier way to calculate D i = Y Ŷ (i) 2 ˆσ 2 p D i = e2 i ˆσ 2 p [ D i = r ii p Flag cases where D i > 1 or D i > 4/n ] h ii (1 h ii ) 2, h ii 1 h ii rownames(animals)[cooks.distance(brain.lm) > 1] ## [1] "Brachiosaurus"

Remove Influential Point & Refit brain2.lm = lm(brain ~ body, data=animals, subset =!cooks.distance(brain.lm)>1) par(mfrow=c(2,2)); plot(brain2.lm) Residuals vs Fitted Normal Q Q Residuals 2000 0 2000 Asian elephant African elephant Dipliodocus Standardized residuals 2 0 1 2 3 4 Dipliodocus Asian elephant African elephant 500 1000 1500 2000 Fitted values 2 1 0 1 2 Theoretical Quantiles Scale Location Residuals vs Leverage Standardized residuals 0.0 0.5 1.0 1.5 Asian elephant African elephant Dipliodocus Standardized residuals 2 0 1 2 3 4 African elephant Cook's distance Triceratops Dipliodocus 1 0.5 0.5 1 500 1000 1500 2000 Fitted values 0.0 0.1 0.2 0.3 0.4 0.5 Leverage

Keep removing points? Residuals 4000 0 2000 4000 Residuals vs Fitted Asian elephant African elephant Triceratops 500 1000 1500 2000 2500 3000 Standardized residuals 4 2 0 2 4 Triceratops Normal Q Q African elephant Asian elephant 2 1 0 1 2 Fitted values Theoretical Quantiles Standardized residuals 0.0 0.5 1.0 1.5 2.0 Scale Location Asian elephant African elephant Triceratops Standardized residuals 4 2 0 2 4 Residuals vs Leverage African elephant Asian elephant Cook's distance Triceratops 1 0.5 0.5 1 500 1000 1500 2000 2500 3000 Fitted values 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Leverage

And another one bites the dust Residuals vs Fitted Normal Q Q Residuals 1000 0 500 1500 Human Asian elephant African elephant Standardized residuals 4 2 0 2 4 African elephant Asian elephant Human 0 1000 2000 3000 4000 5000 6000 2 1 0 1 2 Fitted values Theoretical Quantiles Scale Location Residuals vs Leverage Standardized residuals 0.0 0.5 1.0 1.5 2.0 Human Asian elephant African elephant Standardized residuals 4 2 0 2 4 Asian elephant Human Cook's distance African elephant 0.5 1 10.5 0 1000 2000 3000 4000 5000 6000 Fitted values 0.0 0.2 0.4 0.6 0.8 Leverage

and another one Residuals vs Fitted Normal Q Q Residuals 500 0 500 1000 Human Horse Cow Standardized residuals 1 0 1 2 3 4 Cow Asian elephant Human 0 1000 2000 3000 4000 Fitted values 2 1 0 1 2 Theoretical Quantiles Scale Location Residuals vs Leverage Standardized residuals 0.0 0.5 1.0 1.5 2.0 Human Cow Asian elephant Standardized residuals 2 1 0 1 2 3 4 Human Cow Cook's distance Asian elephant 0.5 1 10.5 0 1000 2000 3000 4000 Fitted values 0.0 0.2 0.4 0.6 0.8 Leverage

And they just keep coming! Figure 1: Walt Disney Fantasia

Plot of Original Data (what you should always do first!) library(ggplot2) ggplot(animals, aes(x=body, y=brain)) + geom_point() + xlab("body Weight (kg)") + ylab("brain Weight (g)") 4000 Brain Weight (g) 2000 0 0 25000 50000 75000 Body Weight (kg)

Log Transform 20 6 count 15 10 5 count 4 2 0 0 25000 50000 75000 Body Weight 0 4 0 4 8 12 log(body Weight)

Plot of Transformed Data Animals= mutate(animals, log.body = log(body)) ggplot(animals, aes(log.body, brain)) + geom_point() 4000 brain 2000 0 4 0 4 8 12 log.body #plot(brain ~ body, Animals, log="x")

Diagnostics with log(body) Residuals 2000 0 2000 5000 Residuals vs Fitted 15 7 26 500 0 500 1000 1500 Standardized residuals 1 0 1 2 3 4 Normal Q Q 15 7 26 2 1 0 1 2 Fitted values Theoretical Quantiles Standardized residuals 0.0 0.5 1.0 1.5 2.0 Scale Location 7 15 26 Standardized residuals 1 1 2 3 4 Residuals vs Leverage 15 7 Cook's distance 26 1 0.5 500 0 500 1000 1500 Fitted values 0.00 0.05 0.10 0.15 Leverage Variance increasing with mean

Try Log-Log Animals= mutate(animals, log.brain= log(brain)) ggplot(animals, aes(log.body, log.brain)) + geom_point() 8 6 log.brain 4 2 0 4 0 4 8 12 log.body #plot(brain ~ body, Animals, log="xy")

Diagnostics with log(body) & log(brain) Residuals vs Fitted Normal Q Q Residuals 4 2 0 1 2 3 16 6 26 Standardized residuals 2 1 0 1 2 6 26 16 2 4 6 8 2 1 0 1 2 Fitted values Theoretical Quantiles Standardized residuals 0.0 0.5 1.0 1.5 Scale Location 6 26 16 Standardized residuals 2 1 0 1 2 Cook's distance Residuals vs Leverage 16 6 26 0.5 0.5 2 4 6 8 Fitted values 0.00 0.05 0.10 0.15 Leverage

Optimal Transformation for Normality The BoxCox procedure can be used to find best power transformation λ of Y (for positive Y) for a given set of transformed predictors. Ψ(Y, λ) = { Y λ 1 λ if λ 0 log(y) if λ = 0 Find value of λ that maximizes the likelihood derived from Ψ(Y, λ) N(Xβ λ, σλ 2 ) (need to obtain distribution of Y first) Find λ to minimize RSS(λ) = Ψ M (Y, λ) Xˆβ λ 2 Ψ M (Y, λ) = { (GM(Y) 1 λ (Y λ 1)/λ if λ 0 GM(Y) log(y) if λ = 0 where GM(Y) = exp( log(y i )/n) (Geometric mean)

boxcox in R: Profile likelihood library(mass) boxcox(braintransx.lm) log Likelihood 250 200 150 100 50 95% 2 1 0 1 2 λ

Caveats Boxcox transformation depends on choice of transformations of X s

Caveats Boxcox transformation depends on choice of transformations of X s For choice of X transformation use boxtidwell in library(car)

Caveats Boxcox transformation depends on choice of transformations of X s For choice of X transformation use boxtidwell in library(car) transformations of X s can reduce leverage values (potential influence)

Caveats Boxcox transformation depends on choice of transformations of X s For choice of X transformation use boxtidwell in library(car) transformations of X s can reduce leverage values (potential influence) if the dynamic range of Y or X is less than 1 or 10 (ie max/min) then transformation may have little effect

Caveats Boxcox transformation depends on choice of transformations of X s For choice of X transformation use boxtidwell in library(car) transformations of X s can reduce leverage values (potential influence) if the dynamic range of Y or X is less than 1 or 10 (ie max/min) then transformation may have little effect transformations such as logs may still be useful for interpretability

Caveats Boxcox transformation depends on choice of transformations of X s For choice of X transformation use boxtidwell in library(car) transformations of X s can reduce leverage values (potential influence) if the dynamic range of Y or X is less than 1 or 10 (ie max/min) then transformation may have little effect transformations such as logs may still be useful for interpretability outliers that are not influential may still affect the estimate of σ and width of confidence/prediction intervals.

Caveats Boxcox transformation depends on choice of transformations of X s For choice of X transformation use boxtidwell in library(car) transformations of X s can reduce leverage values (potential influence) if the dynamic range of Y or X is less than 1 or 10 (ie max/min) then transformation may have little effect transformations such as logs may still be useful for interpretability outliers that are not influential may still affect the estimate of σ and width of confidence/prediction intervals. Reproducibility - describe steps and decisions

Review of Last Class In the model with both response and predictor log transformed, are dinosaurs outliers?

Review of Last Class In the model with both response and predictor log transformed, are dinosaurs outliers? should you test each one individually or as a group; if as a group how do you think you would you do this using lm?

Review of Last Class In the model with both response and predictor log transformed, are dinosaurs outliers? should you test each one individually or as a group; if as a group how do you think you would you do this using lm? do you think your final model is adequate? What else might you change?

Check Your Prediction Skills After you determine whether dinos can stay or go and refine your model, what about prediction? I would like to predict Aria s brain size given her current weight of 259 grams. Give me a prediction and interval estimate.

Check Your Prediction Skills After you determine whether dinos can stay or go and refine your model, what about prediction? I would like to predict Aria s brain size given her current weight of 259 grams. Give me a prediction and interval estimate. Is her body weight within the range of the data in Animals or will you be extrapolating? What are the dangers here?

Check Your Prediction Skills After you determine whether dinos can stay or go and refine your model, what about prediction? I would like to predict Aria s brain size given her current weight of 259 grams. Give me a prediction and interval estimate. Is her body weight within the range of the data in Animals or will you be extrapolating? What are the dangers here? Can you find any data on Rose-Breasted Cockatoo brain sizes? Are the values in the prediction interval?