Ch. 5 Transformations and Weighting

Outline Three approaches: Ch. 5 Transformations and Weighting. Variance stabilizing transformations; Box-Cox Transformations - Section 5.2; 5.4 2. Transformations to linearize the model - Section 5.3 3. Weighted regression - Section 5.5 Variance-Stabilizing Transformations Model assumptions: E[y x] = β 0 + β x V (y x) = σ 2 Set µ y = E[y x]. What if V (y x) = σ 2 f(µ y ) where f(x) is some non-constant function? Try to find a function g(y) so that V (g(y) x) = constant Then obtain a Taylor expansion of g(y) about µ y : Then V (g(y)) will be constant if g(y) = g(µ y ) + (y µ y )g (µ y ) + (y µ y) 2 g (µ y ) + 2 V (g(y)). = V (y) (g (µ y )) 2 = σ 2 f(µ y ) (g (µ y )) 2 g (µ y ) = g (z) = f(µy ) f(z) Examples:. f(x) = x (e.g. Poisson data) f(x) = x /2 g(y) = y Poisson vs Fitted 4 2 0 2 4 6 4 2 7 0 2 4 6 8 0 2 lm(formula = yy ~ xx)

Poisson (after sqrt) vs Fitted.0 0.5 0.0 0.5.0 4.0.5 2.0 2.5 3.0 3.5 lm(formula = I(sqrt(yy) ~ xx)) 2. f(x) = x 2 (e.g. Exponential data) 0 5 0 5 0 5 Exponential vs Fitted 5 8 0 5 0 5 0 5 20 lm(formula = yy ~ xx) f(x) = x g(y) = log(y) 3. f(x) = x( x) (e.g. binomial data) 5.4. Box-Cox Transformations (on response) Select the power λ in the transformation = f(x) x( x) d dx sin ( x) = 2 x( x) g(y) = arcsin( y) g(y) = y λ by maximum likelihood. Equivalent to minimizing the SSE with respect to λ (and other parameters). 2

Caution: The residual sums of squares are not comparable for different values of λ. We need to ensure that comparisons are made according to the same standard: { y (λ) y λ, λ 0 = λẏ λ ẏ log y, λ = 0 where Strategy: ẏ = geometric mean of the y s. Perform transformation y (λ),..., y(λ) n for several values of λ. 2. Compute SSE for each value of λ 3. Select λ which gives the minimum value. 4. Fit y λ = Xβ + ɛ Approximate confidence intervals for λ can also be obtained. In R, use boxcox(y~x, data= dataset) Examples:. Bacteria data (Ex. 5.3) - the average number of surviving bacteria (y) in a canned food product versus time (t) of exposure to 300 F heat. > library(mpv) > data(p5.3) > bact.lm <- lm(bact ~ min, data=p5.3) > plot(bact.lm, which=) # > plot(bact.lm, which=2) # > library(mass) > boxcox(bact.lm) # > bactlog.lm <- lm(log(bact) ~ min, data=p5.3) > plot(bactlog.lm, which=) # > plot(bactlog.lm, which=2) # vs Fitted 20 0 20 40 2 6 0 20 40 60 80 00 20 lm(formula = bact ~ min, data = p5.3) 3

Normal Q Q plot Standardized residuals 0 2 3 6 2.5.0 0.5 0.0 0.5.0.5 Theoretical Quantiles lm(formula = bact ~ min, data = p5.3) log Likelihood 65 60 55 50 45 40 35 95% 2 0 2 lambda 4

vs Fitted 0.2 0. 0.0 0. 0.2 0 7 2 2.5 3.0 3.5 4.0 4.5 5.0 lm(formula = log(bact) ~ min, data = p5.3) Normal Q Q plot Standardized residuals 0 2 7 0 2.5.0 0.5 0.0 0.5.0.5 A model of the form Theoretical Quantiles lm(formula = log(bact) ~ min, data = p5.3) log(y) = β 0 + β t + ε is reasonable, especially if β is negative ( β =.236). 2. trees data. 3 observations on Girth (g), Height (h) and Volume (V ) Simple Model: V =. g2 h 4π or log V = β 0 + β log h + β 2 log g + ε > library(daag) > data(trees); attach(trees) 5

> trees.lm <- lm(log(volume) ~ log(girth) + log(height)) > boxcox(trees.lm) # (lambda = is OK) > summary(trees.lm) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -6.632 0.800-8.29 5.e-09 log(height).7 0.204 5.46 7.8e-06 log(girth).983 0.075 26.43 < 2e-6 log Likelihood 5 0 5 20 25 95% 2 0 2 lambda Coefficient of log(height) is not distinguishable from, and coefficient of log(girth) is not distinguishable from 2. 5.3 Linearizing Transformations Intrinsically linear model: The relationship between y and x is such that a simple transformation can produce a linear model. Example: Fit the model Note that this implies multiplicative errors. i.e. E[y] = β 0 e βx log E[y] = log β 0 + β x log y i = β 0 + β x i + ε i y i = e β 0 +βxi+εi = β 0 e βxi e εi If the error is additive, i.e. y i = β 0 e βxi + ε i then the transformation is not appropriate. Other possibilities from the text 6

. E[y] = β 0 x β log E[y] = log β 0 + β log x New model: log y = β 0 + β log x i + ε i 2. x E[y] = β 0 x β E[y] = β 0 β (/x) New model: = β 0 + β ( /x i ) + ε i y i Example - windmill data (see wind) These data concern the relation between the electrical output of a windmill subjected to different wind velocities A decent model is DC output = β 0 + β /velocity + ε Windmill Data untransformed DC output 0.5.0.5 2.0 4 6 8 0 Wind Velocity Windmill Data transformed DC output 0.5.0.5 2.0 0.0 0.5 0.20 0.25 0.30 0.35 0.40 /Wind Velocity 7

Some models are intrinsically nonlinear: e.g. Michaelis-Menten model (useful for modelling chemical reaction rates) y = β 0x β + x + ε e.g. Mitcherlich Law (useful for modelling chemical yield, etc.) y = β 0 β γ x + ε e.g. Logistic Growth Model: y = β 0 + β e kx + ε Box-Tidwell transformation of a predictor variable Consider the model y = β 0 + β x α + ε If α is known, β 0 and β can be estimated... How can α be estimated? Suppose we have a good guess: α 0 Taylor expand x α about α 0 : so if α 0 is close to α, we have x α = x α0 + (α α 0 )x α0 log(x) + O((α α 0 ) 2 ) x α. = x α 0 + (α α 0 )x α0 log(x) Our regression model then looks like y =. β 0 + β x α0 + β (α α 0 )x α0 log(x) + ε so consider y =. β0 + βx α0 + β2x α0 log(x) + ε where β2 = β (α α 0 ). This gives the updating equation: α = β2/β + α 0 Algorithm:. Guess α: α 0 2. Fit 3. Fit y = β 0 + β x α0 + ε β y. = β 0 + β x α0 + β 2x α0 log(x) + ε β 2 4. Update α α = β 2/ β + α 0 Repeat above steps to get α 2.... Convergence usually in three iterations. There are instances where this procedure may not converge at all. Example: Windmill generation of electricity. DC output is measured against wind velocity: 8

wind v DC 5.00.582 2 6.00.822 3 3.40.057 4 2.70 0.500... 24 3.95.44 25 2.45 0.23 The scatterplot (windmill.pdf) indicates the need for a transformation. We saw earlier the usefulness of the reciprocal transformation of the velocity: /v. Does the Box-Tidwell procedure agree? Box-Tidwell: initial guess: α 0 = y = β 0 + β (/v) + ε > boxtidwell.lm(dc~v,data=wind) initial guess alpha_ alpha_2 alpha_3 alpha_4.000-0.98-0.836-0.833-0.833 y = β 0 + β (/v.833 ) + ε > wind.lm <- lm(dc ~ I(v^(.833)), data=wind) > summary(wind.lm) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 3.2608 0.054 63.4 <2e-6 I(v^(-0.833)) -6.4677 0.880-34.4 <2e-6 Fitted Model: ŷ = 3.26 6.47(/v.833 ) Windmill data: DC output vs Wind velocity DC 0.5.0.5 2.0 Transformed LS fits: red curve: reciprocal of v black curve: v^(.833) 4 6 8 0 v 9

Standardized residuals 2 0 Sample Q Q Plot Normal Q Q plot 20 9 8 Sample Quantiles 2 0 2 Simulated Q Q Plot 2 0 2 2 0 2 Theoretical Quantiles Theoretical Quantiles Simulated Q Q Plot Simulated Q Q Plot Sample Quantiles 2 0 2 Sample Quantiles 2 0 2 0 2 2 0 2 Theoretical Quantiles Theoretical Quantiles These plots indicate that this model fits fairly well. Note that the textbook implementation of the Box-Tidwell procedure is incorrect. Exercises on Box-Cox and Box-Tidwell: 5.4 (data are in p5.4, do you need to transform the response or the predictor? Check all diagnostics before and after transforming. Also, obtain a plot of the data with the overlaid curve.), 5.2 (data are in p5.2; for part (c) check the Box-Tidwell transformation is it consistent with the theory?), 5.3 (p5.3), 5.5 (p5.5) 5.5.2 Weighted Least Squares Consider the regression through the origin model y i = β x i + ε i with E[ε i ] = 0 and suppose V (y i x i ) = σ 2 /w i where w i is a known weight. i.e. E[ε 2 i ] = σ 2 /w i The least squares estimate was previously found by minimizing n i= ε i: β = xi y i x 2 i Gauss-Markov Theorem: When the variances are constant, β has the smallest variance of any linear unbiased of β. β is not the best linear unbiased estimator for β when there are weights w i. To find the BLUE now, multiply the model by a i : a i y i = a i β x i + a i ε i or Compute β for the new data (x i, y i ): y i = β x i + ε i β = x i y i (x i ) 2 0

E[ β ] = β (unbiased) x V ( β ) = σ 2 2 i a 4 i /w i ( a 2 i x2 i )2 How do we choose a, a 2,..., a n to make this as small as possible? Recall: Cauchy-Schwarz Inequality: ( n ) 2 n n u i v i i= u 2 j vk 2 j= k= (equality holds if the u i s are proportional to the v i s: u i = cv i ) Look at the denominator of our variance: ( n ) 2 ( n ) ( n ) a 2 i x 2 i a 4 i x 2 i /w i w i x 2 i i= i= (equality holds if the u i s are proportional to the v i s: e.g. a 4 i x2 i /w i = w i x 2 i or a i = w i ) Thus, the V ( β ) is minimized if a i = w i : Note also that and that instead of minimizing we are now minimizing Example: roller data Ordinary Least Squares: V ( β ) = σ 2 n i= wx2 i i= E[ w i ε i ] = 0 and V ( w i ε i ) = σ 2 n i= ε 2 i n w i ε 2 i i= Ordinary Least Squares Weighted Least Squares roller.lm <- lm(depression~weight, data=roller) plot(roller.lm, which=4) The residual plot indicates that the variance might not be constant. vs Fitted 0 5 0 5 0 5 7 8 5 0 5 20 25 30 lm(formula = depression ~ weight, data = roller)

Weighted Least Squares: roller.wlm <- lm(depression~weight, data=roller, weights=/weight^2) plot(roller.wlm, which=4) vs Fitted 0 5 0 5 0 5 8 0 0 5 0 5 20 25 30 35 lm(formula = depression ~ weight, data = roller, weights = /weight^2) a more random pattern. Roller Data depression 0 5 0 5 20 25 30 OLS WLS 2 4 6 8 0 2 weight compares the fitted lines. 5.5. Generalized Least Squares Model: E[ɛ] = 0 and E[ɛɛT ] = Σ = σ 2 V. y = Xβ + ɛ Σ must be symmetric and positive definite. This implies, among other things, that Σ possesses an inverse. Weighted Least Squares is a special case where Σ is a diagonal matrix with ii element σ 2 /w i V = K 2 for some symmetric nonsingular K. Consider K y = K Xβ + K ɛ 2

Note Var(K ɛ) = E[K ɛɛt K ] = K σ 2 V K = σ 2 I By multiplying through by K we now have a constant variance, so β can be estimated by Least-Squares: β = (X T K 2 X) X T K 2 y = (X T V X) X T V y β is the generalized least-squares estimator for β. Unbiased: E[ β ] = β Variance: Var( β ) = (X T V X) X T V ΣV X(X T V X) = σ 2 (X T V X) 3