Approximate Median Regression via the Box-Cox Transformation

Approximate Median Regression via the Box-Cox Transformation Garrett M. Fitzmaurice,StuartR.Lipsitz, and Michael Parzen Median regression is used increasingly in many different areas of applications. The usual median regression estimating equations are derived from minimizing the least absolute deviations (LAD). Because they are not a smooth function of the regression parameters, a solution is best obtained using a linear programming algorithm. As an alternative, we propose estimating the median regression parameters via Gaussian estimating equations after applying a Box-Cox transformation to both the outcome and the linear predictor. The proposed estimator is notably more efficient than the standard LAD estimator, albeit with an acknowledged loss of robustness. KEY WORDS: Least absolute deviations; Maximum likelihood; Median regression. 1. INTRODUCTION Median regression, in which the median of the outcome is assumed to be linear in the regression parameters, is used increasingly for analyzing non-normal continuous outcome data. For example, median regression has been applied to positively skewed data, such as length of hospital stay data (Lee, Fung, and Fu 2003) or survival data (Koenker and Geling 2001). Median regression models are appealing because they are robust to outliers in the outcome; their regression parameters also have simple interpretations. The median regression parameters can be consistently estimated by minimizing the least absolute deviations (LAD) via a linear programming algorithm (Basset and Koenker 1982). In this article, we propose estimating the median regression parameters via Gaussian estimating equations after applying a Box-Cox transformation (Box and Cox 1964) to both the outcome and linear predictor. We relate the mean of the Box-Cox transformed outcome to the same transformation of the linear predictor; the transformation parameter is assumed to be unknown. When the distribution of the transformed outcome is symmetric, the median of the untransformed outcome equals the linear predictor. The median regression parameters, and the unknown transformation parameter, are estimated by maximizing the Gaussian likelihood function. Certainly, not all data can be transformed to a normal distribution. However, Draper and Cox (1969) reported that even Garrett M. Fitzmaurice is Associate Professor, Department of Biostatistics, Harvard School of Public Health, 677 Huntington Avenue, Boston, MA 02115 (E-mail: fitzmaur@hsph.harvard.edu). Stuart R. Lipsitz is Associate Professor, Harvard Medical School, Boston, MA. Michael Parzen is Associate Professor, Emory University, Atlanta, GA. The authors are grateful for the support provided by grants AI 60373 and GM 29745 from the U.S. National Institutes of Health. in cases where no Box-Cox transformation can transform the outcome exactly to a normal random variable, the transformed variable satisfies certain restrictions on the first four moments, and its distribution will usually be close to symmetric. When its distribution is almost symmetric, we would expect little bias for estimating the median regression. We explore the robustness of the proposed method in more detail via simulations in Section 3. 2. MEDIAN REGRESSION VIA BOX-COX TRANSFORMATION Let Y i be a continuous response variable for the ith independent observation and x i be a J 1 covariate vector, i = 1,...,n. We assume the median of Y i (conditional on x i )ism i = x i β or, equivalently, Pr(Y i >x i β) = 0.5. Next, consider the Box-Cox transformation of Y i, denoted by U i = u(y i,λ,c),where (Y i +c) λ 1 λ if λ = 0, U i = u(y i,λ,c)= (1) log(y i + c) if λ = 0; c is a fixed constant to ensure Y i + c > 0; and λ is an unknown parameter to be estimated. Note that because u(y i,λ,c) is a monotone transformation of Y i and Pr(Y i >x i β) = 0.5, then Pr{u(Y i,λ,c) > u(x i β,λ,c)} =0.5. Furthermore, if the distribution of U i is symmetric, then µ i = E(U i x i ) = u(x i β,λ,c) (x i β + c)λ 1 if λ = 0, = λ (2) log(x i β + c) if λ = 0. In this article we assume that the distribution of Y i after the optimal Box-Cox transformation is almost symmetric, and thus the mean µ i is approximately equal to the median of U i, that is, Pr(U i >µ i ) Pr(Y i >x i β) = 0.5. The accuracy of this approximation depends on the symmetry of the distribution of U i. Thus, the proposed estimator requires that the distribution of U i is approximately symmetrical rather than normal. Taylor (1985) showed that the Box-Cox transformation is generally the most suitable method for transforming to symmetry. Although the focus is on the Box-Cox transformation, we note that the proposed method can be applied to any monotonic transformation of Y i (and the linear predictor). Finally, assuming the variance of U i is σ 2, and that U i has an approximate normal distribution, U i N(µ i,σ 2 ), the loglikelihood contribution from subject i is l i (β, σ 2,λ)= 1 2 log(2πσ2 ) 1 2σ 2 (U i µ i ) 2 + (λ 1) log(y i + c). (3) c 2007 American Statistical Association DOI: 10.1198/000313007X220534 The American Statistician, August 2007, Vol. 61, No. 3 233

Table 1. Results of simulation study for estimated medians at each value of x with Y i exponential. n = 80 n = 160 n = 320 x x x Approach 1 1.5 2 2.5 1 1.5 2 2.5 1 1.5 2 2.5 Relative Box-Cox 0.74 0.59 0.45 0.34 1.06 1.33 1.56 1.77 1.43 1.48 1.52 1.56 Bias (%) LAD 3.73 2.68 1.74 0.92 1.15 0.88 0.65 0.44 0.28 0.44 0.59 0.72 MLE 0.17 0.23 0.58 0.90 0.12 0.01 0.12 0.22 0.07 0.01 0.08 0.14 Simulation Box-Cox 2.68 1.29 1.56 3.48 1.32 0.67 0.81 1.74 0.68 0.33 0.39 0.86 Variance LAD 4.69 1.93 2.31 5.82 2.22 0.99 1.23 2.96 1.07 0.49 0.59 1.39 MLE 2.13 0.95 1.17 2.77 1.05 0.47 0.58 1.40 0.52 0.24 0.29 0.68 Coverage Box-Cox 96.0 96.4 95.9 95.8 95.3 94.5 95.2 95.3 95.4 94.4 94.7 95.2 Probability a LAD 95.2 96.5 96.2 95.0 94.0 95.0 94.4 94.4 95.0 95.6 95.0 95.6 MLE 93.2 93.4 93.9 93.3 94.3 95.1 94.6 94.1 94.4 94.5 94.6 94.6 a Coverage Probability of 95% Confidence Intervals Note, this contribution is the log of the normal density of U i plus the log of the Jacobian of the transformation (derivative of U i with respect to y i ). In cases where the distribution of U i is approximately symmetrical rather than normal, this method yields a quasi-likelihood estimator of β and the variance of β can be estimated using the so-called sandwich variance estimator (e.g., White 1982). The Gaussian assumption is used only to derive an estimator of β; the proposed method is based on, but does not require, that U i has a Gaussian distribution. The validity of the proposed estimator requires the weaker assumption that the distribution of U i is approximately symmetrical rather than normal. 3. MONTE CARLO SIMULATION STUDY To explore the properties of the proposed estimator, and its robustness to varying degrees of asymmetry in the distribution of U i, we performed a simulation study with four different specifications of the distribution of Y i given x i : log-normal, exponential, gamma, and Pareto. Note, when Y i is specified as lognormal, this is a special case of the Box-Cox transformation with λ = 0 and the proposed method is exact. For the remaining distributions, the Box-Cox transformation does not yield an exact normal random variable. In all specifications except for the gamma distribution, we let the median of Y i given x i be M i = β 0 +β 1 x i = 6.5+1.0 x i ; for the gamma distribution, we let M i = 6.0+1.0 x i. We considered total sample sizes n = 80, 160, and 320; we let x i take on values 1.0, 1.5, 2.0, and 2.5, with each value represented by 25% of the sample. We performed 2,500 simulation replications for each of the four possible distributions of Y i. We estimated the median regression parameters via the proposed Box-Cox transformation, in addition to minimizing the LAD via the linear programming algorithm of Basset and Koenker (1982). For the log-normal, exponential, and Pareto distributions of Y i, we also estimated (β 0,β 1 ) via maximum likelihood under the true distribution for Y i. Finally, for the case where Y i has a Pareto distribution, we also estimated (β 0,β 1 ) via the Box-Cox transformation applied to log(y i ) (and to the log of the linear predictor); as discussed earlier, the proposed method can be applied to any monotonic transformation of Y i. The Box-Cox Gaussian likelihood was maximized using the SAS procedure NLMIXED (SAS Institute Inc. 2000); sample SAS code for implementing the proposed method is available upon request from the authors. The results of the simulations when Y i is log-normal (not shown, but available upon request from the authors), with log(y i ) N(log(6.5 + x i ), 1), indicate that the estimates of the medians obtained via the optimal Box-Cox transformation are unbiased for n = 80, 160, and 320. This result is to be expected because the log-normal distribution is a special case where the Box-Cox transformation can be accurately applied. The relative bias of the Box-Cox and LAD estimates of the medians (when averaged over the distribution of the covariates and the three sample sizes) is approximately 1%, while for the MLE it is less than 0.25%. Of note, the Box-Cox estimates are almost as efficient as the MLEs (which can be considered a Box-Cox estimator with λ = 0 assumed known) and are notably more efficient than the standard LAD estimates. Specifically, the simulation variances of the LAD estimates are 50 75% larger than those for the estimates from the Box-Cox transformation. Finally, all three methods show good coverage probabilities for 95% confidence intervals. To explore robustness of the proposed method to varying degrees of asymmetry in the distribution of U i,welety i have an exponential distribution, with p(y i x i,β) = θ i e θ iy i, where θ i ={log(2)}/(β 0 + β 1 x i ); note, no transformation of Y i will yield a normal random variable. However, the average skewness of the residuals from the regression of the Box-Cox transformed Y i over all simulation replications was small (skewness = 0.05), indicating that the optimal Box-Cox transformation achieved a high degree of symmetry. The simulation results reported in Table 1 indicate that all three estimators show very little bias for n = 80, 160, and 320. The simulation variances of the LAD estimates are approximately 80% larger than those 234 Statistical Practice

Table 2. Results of simulation study for estimated medians at each value of x with Y i Gamma. n = 80 n = 160 n = 320 x x x Approach 1 1.5 2 2.5 1 1.5 2 2.5 1 1.5 2 2.5 Relative Box-Cox 2.24 3.38 4.38 5.27 5.02 5.24 5.43 5.60 5.47 5.48 5.49 5.50 Bias (%) LAD 7.54 5.57 3.86 2.34 3.28 2.83 2.45 2.11 0.43 0.87 1.25 1.58 Simulation Box-Cox 5.72 2.94 3.48 7.34 2.79 1.36 1.63 3.59 1.36 0.70 0.84 1.79 Variance LAD 10.37 4.57 5.60 13.47 5.07 2.29 2.83 6.69 2.34 1.05 1.35 3.25 Coverage Box-Cox 96.0 95.5 95.4 96.1 95.6 94.5 94.1 95.1 94.3 92.9 92.6 94.1 Probability a LAD 95.1 96.8 96.8 95.7 94.6 95.4 96.1 94.6 95.6 96.2 95.4 95.1 a Coverage Probability of 95% Confidence Intervals for the estimates from the Box-Cox transformation. Next, we increased the degree of asymmetry in the distribution of U i by letting Y i have a gamma distribution with shape parameter = 0.5. This is an example of a skewed, heavy-tailed distribution where even the Box-Cox transformed Y i is likely to show skewness. The average skewness of the residuals from the regression of U i over all simulation replications was 0.10. The results indicate that the proposed estimator yields biased estimates of β 0 and β 1. As expected, skewness in the residuals are likely to have somewhat greater impact on the intercept than on the slope. This bias results in biased estimates of the medians (see Table 2), with relative bias between 2 6%. Note, however, that the absolute bias is almost negligible when compared to the variance, even when n = 320. The LAD estimator also has relative bias in the range 1 8% for n = 80 and n = 160. On average, the relative bias of the Box-Cox and LAD median estimates is 4.9% and 2.8%, respectively. Finally, we considered the extreme case where Y i has a Pareto distribution with scale parameter equal to 1. The Pareto is an example of an extremely skewed, heavy-tailed distribution where U i is likely to show very discernible skewness. The average skewness of the residuals from the regression of the transformed Y i was 0.29, indicating that the transformation did not achieve a high degree of symmetry. As a result, the proposed estimator yields badly biased estimates of β 0, and consequently, of the medians (not shown). As would be expected, the coverage probabilities are poor in this setting. In contrast, the LAD estimator of (β 0,β 1 ) is far more robust and almost unbiased when the sample sizes are large (e.g., n = 320). Finally, when the Box- Cox transformation is applied to log(y i ) (and to log(x i β)),the bias of the proposed estimator is substantially reduced and the coverage probabilities meet the target coverage; in addition, the simulation variances of the estimates from the Box-Cox transformation of log(y i ) are approximately half the size of those for the LAD estimates. This improvement can be explained by the latter transformation of Y i achieving a far greater degree of symmetry (the average skewness was 0.05 instead of 0.29). This highlights an important point: when the distribution of Y i after a Box-Cox transformation is asymmetrical, there may be alternative monotonic transformations of Y i that can achieve a higher degree of symmetry. In summary, the results of the simulation study suggest that the proposed estimator is relatively robust to modest degrees of asymmetry in the distribution of Y i after a Box-Cox transformation. In these cases the relative bias is less than 5% and of comparable magnitude to the finite sample bias of the LAD estimator. In addition, the proposed method provides discernibly more efficient estimates than the standard LAD estimator. Of note, when the degree of asymmetry is modest, the bias is almost negligible compared to the variance. However, when there is strong asymmetry in the distribution of U i, the proposed estimator can yield badly biased estimates. 4. APPLICATION: DEVELOPMENTAL TOXICITY STUDY OF DYME Developmental toxicity studies of laboratory animals play a crucial role in the testing and regulation of chemicals and pharmaceutical compounds. Exposure to developmental toxicants during pregnancy typically causes adverse effects, such as fetal malformations and reduced fetal weight. In a typical experiment, laboratory animals are assigned to increasing doses of a test substance. In this section we describe an analysis of fetal weight data from a study of diethylene glycol dimethyl ether (DYME), a widely used organic solvent. In this study of laboratory mice, conducted through the National Toxicology Program, DYME was administered at five doses (0, 62.5, 125, 250, and 500 mg/kg/day) to 111 pregnant mice (dams) just after implantation. Following sacrifice, fetal weight was recorded for each live fetus with the goal of determining the effect of dose on weight. We considered a linear model in dose ( 10 3 ) for the median fetal weight, estimated via the Box-Cox transformation and standard LAD estimator. Results of these analyses produced very similar estimates of the intercept and slope for both methods. The two estimates of the slope for dose are 0.90 (Box-Cox) and 0.88 (LAD), indicating that the decrease in median fetal weight, comparing the highest dose group to the control group, is approximately 0.45 gm (or 0.5 0.9). The two fitted median lines, presented in Figure 1, are virtually indistinguishable and fit the The American Statistician, August 2007, Vol. 61, No. 3 235

Figure 1. Scatterplot of fetal weight (gm) versus dose (mg/kg/day), with the empirical medians (solid circles) and fitted medians for fetal weight for the estimators based on the Box-Cox transformation (solid line) and LAD (dashed line). empirical median at each dose well. However, the standard error for the slope estimated via the Box-Cox transformation (SE = 0.039) is discernibly smaller than the corresponding standard error for the LAD estimator (SE = 0.045); the ratio of the two estimates of variance is 1.33 and highlights the potential gains in efficiency. Finally, to highlight settings where the proposed estimator can and cannot be validly applied, we generated two simulated datasets from the exponential and Pareto distributions, respectively. Specifically, we generated 1,000 observations from exponential and Pareto (scale parameter = 1) distributions with median, M i = 6.5 + 1.0X i, where X i has a uniform distribution over the interval [0,4]. For the data from the exponential distribution, both methods produced similar estimates of the intercept and slope; the corresponding fitted lines (dashed lines) are presented in Figure 2(a) and are completely indistinguishable from the true median regression line (solid line). Note, the Y -axis in Figure 2(a) is on the log scale. In contrast, for the date from the Pareto distribution, the Box-Cox and LAD methods produced discernibly different estimates of the intercept (6.43 and 5.21, respectively) and slope (1.46 and 1.03, respectively). Figure 2(b) presents the fitted lines for the Box Cox (long-dashed line) and standard LAD (short-dashed line) estimators which are now clearly distinguishable. Although the Box-Cox estimator of the intercept is almost unbiased, the estimator of the slope has a relative bias of almost 50%. In this setting, the Box-Cox transformation is unable to symmetrize the data; moreover, even after transformation, there are numerous extreme observations that have an inordinate influence on the estimated slope. For example, although the median ranges from 6.5 10.5, over 10% of the observations (on the original scale) were larger than 1,000 (these extreme observation have been omitted from the scatterplot in Figure 2(b)), with 5% greater than 8,000, and the largest equal to 9.31 10 11. The estimator based on the Box-Cox transformation is not robust to such extreme observations. 5. CONCLUDING REMARKS In this article we propose estimating the median regression parameters via Gaussian estimating equations after applying the optimal Box-Cox transformation to both the outcome and the linear predictor. The proposed estimator is notably more efficient than the standard LAD estimator, albeit with an acknowledged loss of robustness. Although the proposed estimator is relatively robust to modest degrees of asymmetry in the distribution of the transformed response, U i, we caution that it can yield badly biased estimates when there is strong asymmetry in the distribution of U i. Fortunately, the degree of skewness can be assessed easily from the data at hand. In cases where there is discernible skewness, the use of a more general class of transformations, for example, folded-power transformations (Mosteller and Tukey 1977) or the Zellner-Revankar transformation (Zellner and Revankar 1969), may yield a higher degree of symmetry. An appealing property of the standard LAD estimator is that it is robust to outliers. Although the Box-Cox transformation is likely to reduce the number of and influence of outliers on the transformed scale, because the proposed method relies on regressing the mean of the transformed response, it will likely be sensitive to extreme or outlying observations. When outliers are apparent on the transformed scale, the proposed approach 236 Statistical Practice

Figure 2(a). Scatterplot of 1,000 observations from an exponential distribution, with comparison of the true median regression line (solid line) and the fitted median lines for the estimators based on the Box-Cox transformation (long-dashed line) and LAD (short-dashed line). Figure 2(b). Scatterplot of 1,000 observations from a Pareto distribution, with comparison of the true median regression line (solid line) and the fitted median lines for the estimators based on the Box-Cox transformation (long-dashed line) and LAD (short-dashed line). The American Statistician, August 2007, Vol. 61, No. 3 237

can be made more robust by simply downweighting or trimming the most extreme residuals. Any potential loss of information from trimming the most extreme residuals should be more than compensated by the increased precision over LAD estimation. [Received November 2005. Revised April 2007.] REFERENCES Bassett, G., and Koenker, R. (1982), An Empirical Quantile Function for Linear Models With iid Errors, Journal of the American Statistical Association, 77, 407 415. Box, G.E.P., and Cox, D. R. (1964), An Analysis of Transformations, Journal of the Royal Statistical Society, Series B, 26, 211 243. Draper, N.R., and Cox, D.R. (1969), On Distributions and Their Transformation to Normality, Journal of the Royal Statistical Society, Series B, 31, 472 476. Koenker, R., and Geling, O. (2001), Reappraising Medfly Longevity: A Quantile Regression Survival Analysis, Journal of the American Statistical Association, 96, 458 468. Lee, A.H., Fung, W.K., and Fu, B. (2003), Analyzing Hospital Length of Stay: Mean or Median Regression? Medical Care, 41, 681 686. Mosteller, F., and Tukey, J.W. (1977), Data Analysis and Linear Regression, Reading, MA: Addison-Wesley. SAS Institute Inc. (2000), SAS/STAT User s Guide (Version 8 ed.), Cary, NC: SAS Institute. Taylor, J.M.G. (1985), Power Transformations to Symmetry, Biometrika, 72, 145 152. Wedderburn, R.W.M. (1974), Quasi-Likelihood Functions, Generalized Linear Models, and the Gauss-Newton Method, Biometrika, 61, 439 447. White, H. (1982), Maximum Likelihood Estimation Under Misspecified Models, Econometrica, 50, 1 26. Zellner, A., and Revankar, N. (1969), Generalized Production Functions, Review of Economic Studies, 36, 241 250. 238 Statistical Practice