Transformations Merlise Clyde Readings: Gelman & Hill Ch 2-4, ALR 8-9
Assumptions of Linear Regression Y i = β 0 + β 1 X i1 + β 2 X i2 +... β p X ip + ɛ i Model Linear in X j but X j could be a transformation of the original variables
Assumptions of Linear Regression Y i = β 0 + β 1 X i1 + β 2 X i2 +... β p X ip + ɛ i Model Linear in X j but X j could be a transformation of the original variables ɛ i N(0, σ 2 )
Assumptions of Linear Regression Y i = β 0 + β 1 X i1 + β 2 X i2 +... β p X ip + ɛ i Model Linear in X j but X j could be a transformation of the original variables ɛ i N(0, σ 2 ) Y i ind N(β 0 + β 1 X i1 + β 2 X i2 +... β p X ip, σ 2 )
Assumptions of Linear Regression Y i = β 0 + β 1 X i1 + β 2 X i2 +... β p X ip + ɛ i Model Linear in X j but X j could be a transformation of the original variables ɛ i N(0, σ 2 ) Y i ind N(β 0 + β 1 X i1 + β 2 X i2 +... β p X ip, σ 2 ) correct mean function
Assumptions of Linear Regression Y i = β 0 + β 1 X i1 + β 2 X i2 +... β p X ip + ɛ i Model Linear in X j but X j could be a transformation of the original variables ɛ i N(0, σ 2 ) Y i ind N(β 0 + β 1 X i1 + β 2 X i2 +... β p X ip, σ 2 ) correct mean function constant variance
Assumptions of Linear Regression Y i = β 0 + β 1 X i1 + β 2 X i2 +... β p X ip + ɛ i Model Linear in X j but X j could be a transformation of the original variables ɛ i N(0, σ 2 ) Y i ind N(β 0 + β 1 X i1 + β 2 X i2 +... β p X ip, σ 2 ) correct mean function constant variance independent Normal errors
Animals Read in Animal data from MASS. The data set contains measurements on body weight (kg) and brain weight (g) on 28 animals. Let s try to predict brain weight from body weight. library(mass) data(animals) brain.lm = lm(brain ~ body, data=animals)
Diagnostic Plots ## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced ## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced Residuals vs Fitted Normal Q Q Residuals 1000 1000 3000 5000 African elephant Asian elephant Human Standardized residuals 1 0 1 2 3 4 Brachiosaurus African elephant Asian elephant 540 550 560 570 2 1 0 1 2 Fitted values Theoretical Quantiles Standardized residuals 0.0 0.5 1.0 1.5 2.0 Brachiosaurus Scale Location African elephant Asian elephant Standardized residuals 2 1 0 1 2 3 4 Residuals vs Leverage African elephant Asian elephant Cook's distance Brachiosaurus 0.5 1 10.5 540 550 560 570 0.0 0.2 0.4 0.6 0.8 1.0
Outliers Issues with multiple comparisons if we compare each p-value to α = 0.05
Outliers Issues with multiple comparisons if we compare each p-value to α = 0.05 Bonferroni compares p-values to α/n
Outliers Issues with multiple comparisons if we compare each p-value to α = 0.05 Bonferroni compares p-values to α/n Other corrections for False Discoveries possible (STA 532)
Bonferonni Correction & Multiple Testing H 1,..., H n are a family of hypotheses and p 1,..., p n their corresponding p-values
Bonferonni Correction & Multiple Testing H 1,..., H n are a family of hypotheses and p 1,..., p n their corresponding p-values n 0 of the n are true
Bonferonni Correction & Multiple Testing H 1,..., H n are a family of hypotheses and p 1,..., p n their corresponding p-values n 0 of the n are true The familywise error rate (FWER) is the probability of rejecting at least one true H i (making at least one type I error). FWER = P = α { n0 i=1 ( p i α n ) } n 0 i=1 { ( P p i α )} α n 0 n n n α n
Bonferonni Correction & Multiple Testing H 1,..., H n are a family of hypotheses and p 1,..., p n their corresponding p-values n 0 of the n are true The familywise error rate (FWER) is the probability of rejecting at least one true H i (making at least one type I error). FWER = P = α { n0 i=1 ( p i α n ) } n 0 i=1 { ( P p i α )} α n 0 n n n α n This does not require any assumptions about dependence among the p-values or about how many of the null hypotheses are true.
Bonferonni Correction & Multiple Testing H 1,..., H n are a family of hypotheses and p 1,..., p n their corresponding p-values n 0 of the n are true The familywise error rate (FWER) is the probability of rejecting at least one true H i (making at least one type I error). FWER = P = α { n0 i=1 ( p i α n ) } n 0 i=1 { ( P p i α )} α n 0 n n n α n This does not require any assumptions about dependence among the p-values or about how many of the null hypotheses are true. Link https://en.wikipedia.org/wiki/bonferroni_correction
Outliers Flag outliers after Bonferroni Correction p i < α/n pval = 2*(1 - pt(abs(rstudent(brain.lm)), brain.lm$df -1)) rownames(animals)[pval <.05/nrow(Animals)] ## [1] "Asian elephant" "African elephant"
Cook s Distance Measure of influence of case i on predictions after removing the ith case D i = Y Ŷ (i) 2 ˆσ 2 p
Cook s Distance Measure of influence of case i on predictions after removing the ith case Easier way to calculate D i = Y Ŷ (i) 2 ˆσ 2 p D i = e2 i ˆσ 2 p [ ] h ii (1 h ii ) 2,
Cook s Distance Measure of influence of case i on predictions after removing the ith case Easier way to calculate D i = Y Ŷ (i) 2 ˆσ 2 p D i = e2 i ˆσ 2 p [ D i = r ii p ] h ii (1 h ii ) 2, h ii 1 h ii
Cook s Distance Measure of influence of case i on predictions after removing the ith case Easier way to calculate D i = Y Ŷ (i) 2 ˆσ 2 p D i = e2 i ˆσ 2 p [ D i = r ii p Flag cases where D i > 1 or D i > 4/n ] h ii (1 h ii ) 2, h ii 1 h ii rownames(animals)[cooks.distance(brain.lm) > 1] ## [1] "Brachiosaurus"
Remove Influential Point & Refit brain2.lm = lm(brain ~ body, data=animals, subset =!cooks.distance(brain.lm)>1) par(mfrow=c(2,2)); plot(brain2.lm) Residuals vs Fitted Normal Q Q Residuals 2000 0 2000 Asian elephant African elephant Dipliodocus Standardized residuals 2 0 1 2 3 4 Dipliodocus Asian elephant African elephant 500 1000 1500 2000 Fitted values 2 1 0 1 2 Theoretical Quantiles Scale Location Residuals vs Leverage Standardized residuals 0.0 0.5 1.0 1.5 Asian elephant African elephant Dipliodocus Standardized residuals 2 0 1 2 3 4 African elephant Cook's distance Triceratops Dipliodocus 1 0.5 0.5 1 500 1000 1500 2000 Fitted values 0.0 0.1 0.2 0.3 0.4 0.5 Leverage
Keep removing points? Residuals 4000 0 2000 4000 Residuals vs Fitted Asian elephant African elephant Triceratops 500 1000 1500 2000 2500 3000 Standardized residuals 4 2 0 2 4 Triceratops Normal Q Q African elephant Asian elephant 2 1 0 1 2 Fitted values Theoretical Quantiles Standardized residuals 0.0 0.5 1.0 1.5 2.0 Scale Location Asian elephant African elephant Triceratops Standardized residuals 4 2 0 2 4 Residuals vs Leverage African elephant Asian elephant Cook's distance Triceratops 1 0.5 0.5 1 500 1000 1500 2000 2500 3000 Fitted values 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Leverage
And another one bites the dust Residuals vs Fitted Normal Q Q Residuals 1000 0 500 1500 Human Asian elephant African elephant Standardized residuals 4 2 0 2 4 African elephant Asian elephant Human 0 1000 2000 3000 4000 5000 6000 2 1 0 1 2 Fitted values Theoretical Quantiles Scale Location Residuals vs Leverage Standardized residuals 0.0 0.5 1.0 1.5 2.0 Human Asian elephant African elephant Standardized residuals 4 2 0 2 4 Asian elephant Human Cook's distance African elephant 0.5 1 10.5 0 1000 2000 3000 4000 5000 6000 Fitted values 0.0 0.2 0.4 0.6 0.8 Leverage
and another one Residuals vs Fitted Normal Q Q Residuals 500 0 500 1000 Human Horse Cow Standardized residuals 1 0 1 2 3 4 Cow Asian elephant Human 0 1000 2000 3000 4000 Fitted values 2 1 0 1 2 Theoretical Quantiles Scale Location Residuals vs Leverage Standardized residuals 0.0 0.5 1.0 1.5 2.0 Human Cow Asian elephant Standardized residuals 2 1 0 1 2 3 4 Human Cow Cook's distance Asian elephant 0.5 1 10.5 0 1000 2000 3000 4000 Fitted values 0.0 0.2 0.4 0.6 0.8 Leverage
And they just keep coming! Figure 1: Walt Disney Fantasia
Plot of Original Data (what you should always do first!) library(ggplot2) ggplot(animals, aes(x=body, y=brain)) + geom_point() + xlab("body Weight (kg)") + ylab("brain Weight (g)") 4000 Brain Weight (g) 2000 0 0 25000 50000 75000 Body Weight (kg)
Log Transform 20 6 count 15 10 5 count 4 2 0 0 25000 50000 75000 Body Weight 0 4 0 4 8 12 log(body Weight)
Plot of Transformed Data Animals= mutate(animals, log.body = log(body)) ggplot(animals, aes(log.body, brain)) + geom_point() 4000 brain 2000 0 4 0 4 8 12 log.body #plot(brain ~ body, Animals, log="x")
Diagnostics with log(body) Residuals 2000 0 2000 5000 Residuals vs Fitted 15 7 26 500 0 500 1000 1500 Standardized residuals 1 0 1 2 3 4 Normal Q Q 15 7 26 2 1 0 1 2 Fitted values Theoretical Quantiles Standardized residuals 0.0 0.5 1.0 1.5 2.0 Scale Location 7 15 26 Standardized residuals 1 1 2 3 4 Residuals vs Leverage 15 7 Cook's distance 26 1 0.5 500 0 500 1000 1500 Fitted values 0.00 0.05 0.10 0.15 Leverage Variance increasing with mean
Try Log-Log Animals= mutate(animals, log.brain= log(brain)) ggplot(animals, aes(log.body, log.brain)) + geom_point() 8 6 log.brain 4 2 0 4 0 4 8 12 log.body #plot(brain ~ body, Animals, log="xy")
Diagnostics with log(body) & log(brain) Residuals vs Fitted Normal Q Q Residuals 4 2 0 1 2 3 16 6 26 Standardized residuals 2 1 0 1 2 6 26 16 2 4 6 8 2 1 0 1 2 Fitted values Theoretical Quantiles Standardized residuals 0.0 0.5 1.0 1.5 Scale Location 6 26 16 Standardized residuals 2 1 0 1 2 Cook's distance Residuals vs Leverage 16 6 26 0.5 0.5 2 4 6 8 Fitted values 0.00 0.05 0.10 0.15 Leverage
Optimal Transformation for Normality The BoxCox procedure can be used to find best power transformation λ of Y (for positive Y) for a given set of transformed predictors. Ψ(Y, λ) = { Y λ 1 λ if λ 0 log(y) if λ = 0 Find value of λ that maximizes the likelihood derived from Ψ(Y, λ) N(Xβ λ, σλ 2 ) (need to obtain distribution of Y first) Find λ to minimize RSS(λ) = Ψ M (Y, λ) Xˆβ λ 2 Ψ M (Y, λ) = { (GM(Y) 1 λ (Y λ 1)/λ if λ 0 GM(Y) log(y) if λ = 0 where GM(Y) = exp( log(y i )/n) (Geometric mean)
boxcox in R: Profile likelihood library(mass) boxcox(braintransx.lm) log Likelihood 250 200 150 100 50 95% 2 1 0 1 2 λ
Caveats Boxcox transformation depends on choice of transformations of X s
Caveats Boxcox transformation depends on choice of transformations of X s For choice of X transformation use boxtidwell in library(car)
Caveats Boxcox transformation depends on choice of transformations of X s For choice of X transformation use boxtidwell in library(car) transformations of X s can reduce leverage values (potential influence)
Caveats Boxcox transformation depends on choice of transformations of X s For choice of X transformation use boxtidwell in library(car) transformations of X s can reduce leverage values (potential influence) if the dynamic range of Y or X is less than 1 or 10 (ie max/min) then transformation may have little effect
Caveats Boxcox transformation depends on choice of transformations of X s For choice of X transformation use boxtidwell in library(car) transformations of X s can reduce leverage values (potential influence) if the dynamic range of Y or X is less than 1 or 10 (ie max/min) then transformation may have little effect transformations such as logs may still be useful for interpretability
Caveats Boxcox transformation depends on choice of transformations of X s For choice of X transformation use boxtidwell in library(car) transformations of X s can reduce leverage values (potential influence) if the dynamic range of Y or X is less than 1 or 10 (ie max/min) then transformation may have little effect transformations such as logs may still be useful for interpretability outliers that are not influential may still affect the estimate of σ and width of confidence/prediction intervals.
Caveats Boxcox transformation depends on choice of transformations of X s For choice of X transformation use boxtidwell in library(car) transformations of X s can reduce leverage values (potential influence) if the dynamic range of Y or X is less than 1 or 10 (ie max/min) then transformation may have little effect transformations such as logs may still be useful for interpretability outliers that are not influential may still affect the estimate of σ and width of confidence/prediction intervals. Reproducibility - describe steps and decisions
Review of Last Class In the model with both response and predictor log transformed, are dinosaurs outliers?
Review of Last Class In the model with both response and predictor log transformed, are dinosaurs outliers? should you test each one individually or as a group; if as a group how do you think you would you do this using lm?
Review of Last Class In the model with both response and predictor log transformed, are dinosaurs outliers? should you test each one individually or as a group; if as a group how do you think you would you do this using lm? do you think your final model is adequate? What else might you change?
Check Your Prediction Skills After you determine whether dinos can stay or go and refine your model, what about prediction? I would like to predict Aria s brain size given her current weight of 259 grams. Give me a prediction and interval estimate.
Check Your Prediction Skills After you determine whether dinos can stay or go and refine your model, what about prediction? I would like to predict Aria s brain size given her current weight of 259 grams. Give me a prediction and interval estimate. Is her body weight within the range of the data in Animals or will you be extrapolating? What are the dangers here?
Check Your Prediction Skills After you determine whether dinos can stay or go and refine your model, what about prediction? I would like to predict Aria s brain size given her current weight of 259 grams. Give me a prediction and interval estimate. Is her body weight within the range of the data in Animals or will you be extrapolating? What are the dangers here? Can you find any data on Rose-Breasted Cockatoo brain sizes? Are the values in the prediction interval?