Versatile Regression: simple regression with a non-normal error distribution

Versatile Regression: simple regression with a non-normal error distribution Benjamin Dean,a, Robert A. R. King a a Universit of Newcastle, School of Mathematical and Phsical Sciences, Callaghan 238, NSW Australia Abstract We present a simple regression technique, called Versatile Regression, where the error distribution is described b the Generalized Lambda Distribution. The fleibilit of this distribution allows the error distribution to be heav-tailed, skewed or approimatel normal. Versatile Regression was found to perform well on heav-tailed and skewed data. Versatile Regression also provided a reasonable approimation to Normal-Error Regression. Simulation studies found that Versatile Regression produced accurate parameter estimates. Ke words: Simple Regression, Generalized Lambda Distribution, Non-Normal Error Distribution 1. Introduction Versatile Regression (VR) is a regression technique with a single predictor variable. The error distribution is non-normal, homoscedastic and identicall distributed for all levels of the predictor variable. The error distribution is described using the Generalized Lambda Distribution (GLD); a quantile-defined distribution with fleible shape. 1.1. The Generalized Lambda Distribution The GLD can represent a vast range of shapes. Freimer et al. (1988) nicel articulated this b saing [the GLD is ver rich in the variet of densit and tail shapes. It contains unimodal, U-shaped, J-shaped and monotone probabilit densit functions. These can be smmetric or asmmetric and their tails can be smooth, abrupt or truncated, and long, medium or short. Furthermore, the GLD contains the logistic, eponential and uniform distributions as limiting cases, and provides good approimations to other distributions including the normal and gamma. A comprehensive discussion of GLD shapes is given in Karian and Dudewicz (2). The GLD is defined in terms of its quantile function F 1 (u). This article makes use of Freimer et al. s (1988) parameterisation of the GLD. We use slightl different notation than Freimer et al. (1988) and write the quantile function as F 1 (u) = λ 1 + 1 λ 2 [ u 1 (1 u)λ4 1 λ 4 where u [, 1, λ 2 >. The probabilit densit function of the GLD is not available in closed form. However, it is available as a function of u. f () = λ 2 u 1 + (1 u) λ 4 1 (1) at = F 1 (u) (2) Corresponding author. Tel.: +61 2 4921 6384 Email addresses: Benjamin.Dean@newcastle.edu.au (Benjamin Dean), Robert.King@newcastle.edu.au (Robert A. R. King) Third Annual ASEARC Conference λ 1, λ 2, and λ 4 are parameters that can attain an real value (ecept for λ 2 which must be positive), and u is a variable that assumes values between zero and one. λ 1 is the location parameter, λ 2 is the (inverse) scale parameter, and, λ 4 are the shape parameters. ( controls the left tail shape and λ 4 controls the right tail shape.) A GLD with parameters λ 1, λ 2,, λ 4 is denoted b GLD(λ 1, λ 2,, λ 4 ). The support of the GLD is given in Table 1. The GLD has infinite support when both and λ 4 are non-positive, and halfinfinite support when one of and λ 4 is non-positive. λ 4 Support (, ) > [λ 1 1/λ 2, ) > (, λ 1 + 1/λ 2 λ 4 > > [λ 1 1/λ 2, λ 1 + 1/λ 2 λ 4 Table 1: Support of the GLD. 1.2. Alternative regression methods 1.2.1. Transformations The regression model i = β +β 1 i +ɛ i, where ɛ i N(, σ 2 ), is etremel popular because of its simplicit and practicalit ( i denotes the value of the response variable, i denotes the value of the predictor variable, β and β 1 are the regression coefficients, and ɛ i is a random error term). Following the terminolog used b Neter et al. (1996, pg. 29 3), we refer to this as the model for Normal-Error Regression (NER). When NER s assumptions of linearit, normalit and homoscedasticit are not satisfied, it is common practice to transform the response and/or predictor variables. However, transformations introduce two main problems. It can be difficult to justif wh a linear relationship should eist between the transformed variables. Interpretation, relative to the original variables of interest, is often complicated. December 7 8, 29, Newcastle, Australia

December 7 8, 29, Newcastle, Australia 1.2.2. Quantile Regression Quantile Regression (Koenker and Bassett, 1978) fits a linear regression function b minimizing a weighted sum of residuals, where positive and negative residuals receive a weight of p and 1 p, respectivel ( < p < 1). Quantile Regression makes no assumptions regarding the form of the error distribution. This makes the technique simple to perform, quick to compute, and widel applicable. However, Quantile Regression has two main disadvantages. The regressed p-quantile functions can cross over. Quantile Regression is a non-parametric technique and it ma not produce the most accurate predictions and prediction intervals. Parametric methods that successfull estimate the error distribution are likel to obtain better results. 1.2.3. Other non-normal regression methods There have been numerous attempts to develop regression techniques where the error distribution is non-normal. Most of the literature has occurred within the past 3 ears due to advances in computing power. A stud b Zeckhauser and Thompson (197) modelled the errors using a power distribution with densit f (z; µ, σ, θ) = k(σ, θ) ep ( σ θ z µ θ), where k(σ, θ) was a normalizing factor and µ, σ and θ were the location, scale and kurtosis parameters, respectivel. The distribution offered promise because it assumed the normal, double eponential and uniform distributions for certain values of θ. Zeckhauser and Thompson (197) were criticized because the error distribution had a cusp at the origin when θ < 1, rendering it unrealistic (Mandelbrot, 1971). Relevant work in more recent ears includes Geweke (1993) and Fernandez and Steel (1998). Geweke (1993) emploed Baesian methods to construct a linear model where the errors followed a smmetric t-distribution. Fernandez and Steel (1998) used Baesian MCMC methods to develop a linear model where the errors followed a skewed t-distribution. King, Gerlach and Wraith (2) proposed a regression technique, called Starship Regression, where the error distribution followed the GLD. Parameter estimation was performed using Owen s (1988) Starship method. The estimation algorithm used a grid-based search to obtain starting values for a minimization routine. The grid-based search made the technique slow, especiall when dealing with large sample sizes. Parameter estimation was also complicated b the presence of a penalt term in the objective function. Dean, King and Howle (29) developed a regression technique called Stretched Regression. The regression model was formulated in terms of a response distribution, rather than an error distribution centered on a regression function. The response distribution dispersed or tapered in shape as the predictor variable increased, whilst the left tail minimum remained fied in position. The response distribution was described b the GLD. 2. Method In VR, the regression model is i = β + β 1 i + ɛ i, where ɛ i GLD(λ 1, λ 2,, λ 4 ). The median of the error distribution 2 was set to zero (since the GLD can generate a wide range of shapes, including severel skewed distributions, it was more appropriate to work with the median than the mean.) This required the following constraint. λ 1 = 1 [. 1.λ4 1 (3) λ 2 λ 4 Parameter estimation was performed in two steps. Firstl, the residuals were computed using e i = i (β + β 1 i ). Secondl, the log likelihood function of the error distribution parameters was maimized. The log likelihood function is log L(λ 1, λ 2,, λ 4 ) = log f ( e ; λ 1, λ 2,, λ 4 ) = log = n f ( ) e i ; λ 1, λ 2,, λ 4 i=1 n log [ f ( ) e i ; λ 1, λ 2,, λ 4 i=1 The values of λ 2,, λ 4, β, β 1 that maimized (4) were chosen as the final parameter estimates (since λ 1 was defined as a function of λ 2,, λ 4, the independent parameters reduced to λ 2,, λ 4, β, β 1 ). The optimization was performed using the BFGS algorithm (Broden (197), Fletcher (197), Goldfarb (197), Shanno (197)). The optimization alwas used a starting value of.4 for and λ 4 (heav tails ensured decent coverage of the data). The β and β 1 starting points were set equal to the regression coefficients produced b NER. The starting point for λ 2 was set to [ (.7.2 )/ + (.7 λ 4.2 λ 4 )/λ 4 / IQR(eNER ), where, λ 4 =.4. This is obtained b approimating the NER residuals b GLD(λ 1, λ 2,, λ 4 ), and rearranging the epression for interquartile range (given b F 1 (.7 ; λ 1, λ 2,, λ 4 ) F 1 (.2 ; λ 1, λ 2,, λ 4 )). 3. Results 3.1. Approimation of Normal-Error Regression Since the GLD can approimate the normal distribution, VR can provide an approimation of NER. The qualit of the approimation was assessed b a simulation stud. The simulation stud involved randoml generating 1, datasets with i Uniform(, 2) and i β +β 1 i +N(, 1), for sample sizes of n =, 1, 2,, 1. β and β 1 were arbitraril set to β = 2 and β 1 = 1. VR and NER were applied to the datasets. This produced parameter estimates (ˆλ 2, ˆ, ˆλ 4, ˆβ, ˆβ 1 ) for VR, and parameter estimates (ˆβ, ˆβ 1, ˆσ) for NER. The performance of VR and NER was compared using the Mean Square Error (MSE) of the regression coefficients. The results are shown in Figure 1. 3.2. Accurac of parameter estimates A simulation stud assessed the accurac of VR s parameter estimates. For a given parameter set (λ 2,, λ 4, β, β 1 ), 1, datasets were randoml generated with i Uniform(, 2) and i β +β 1 i +GLD(λ 1, λ 2,, λ 4 ) (where λ 1 was given b (3)). (4)

December 7 8, 29, Newcastle, Australia MSE..6.12 β n MSE e+ 6e 4 β 1 Figure 1: MSE of regression coefficients when modelling data suited for NER. Results for VR and NER are represented b circles and triangles, respectivel. n than NER. However, VR rapidl improved in performance as the sample size increased, and there was minimal difference between VR and NER for sample sizes of n = 2, and 1. Table 3 summarizes the ˆλ 2, ˆ, ˆλ 4 values produced b VR in Section 3.1. Since GLD(, 1.464,.1349,.1349) closel approimates N(, 1), as determined b the method of moments (Karian and Dudewicz, 2), Table 3 indicates that VR s error distribution became approimatel normal as the sample size increased. Hence, VR provided a good approimation of NER as the sample size increased. VR was applied to the datasets and the estimated parameters (ˆλ 2, ˆ, ˆλ 4, ˆβ, ˆβ 1 ) were compared to the true parameters. This process was repeated for sample sizes of n =, 1, 2,, 1. A preliminar stud used 16 different (λ 2,, λ 4, β, β 1 ) parameter sets. These sets were produced b the possible combinations of λ 2 =.1, = (.4, ), λ 4 = (.4, ), β = (2, 4) and β 1 = (, 1). Setting λ 2 =.1 generated data with large spread (λ 2 is the inverse scale parameter). Setting, λ 4 =.4 generated data with heav tails (see Karian and Dudewicz (2) for a description of GLD shapes). Setting, λ 4 = generated data with lighter tails, but still on infinite support (see Table 1). β and β 1 were set to arbitrar values. The preliminar stud showed that ˆλ 2, ˆ, ˆλ 4, ˆβ, ˆβ 1 had sampling distributions where the shape and spread was dependent upon and λ 4, but independent of β and β 1. Hence, publishing results for more than one combination of β and β 1 was found to be redundant. Consequentl, this article onl presents results for 4 of the 16 parameter sets from the preliminar stud. These parameter sets are defined in Table 2. Set λ 2 λ 4 β β 1 A.1.4.4 2 B.1.4 2 C.1.4 2 D.1 2 Table 2: Definition of parameter sets. Figures 2, 3, 4 and show tpical datasets and regression models produced in the simulations (the data and regression line are drawn in the -plane, and the error distribution is drawn above the regression line at =, 1 and 2). The error distributions were approimatel smmetric for sets A and D, but were left and right skewed for sets B and C, respectivel. Figure 6 shows the results of the simulation stud. The accurac of parameter estimates is presented for a range of sample sizes and parameter sets. 4. Discussion Section 3.1 assessed VR s performance on data where the response variable was normall distributed. For a sample size of n =, VR s regression coefficients had much larger MSE 3 n ˆλ 2 ˆ ˆλ 4 1.26 (.44).34 (.27).34 (.28) 1 1.38 (.26).21 (.14).21 (.14) 2 1.43 (.1).16 (.7).16 (.7) 1.4 (.1).1 (.).1 (.) 1 1.4 (.7).14 (.3).14 (.3) Table 3: Mean ˆλ 2, ˆ and ˆλ 4 values produced b VR in Section 3.1. The bracketed values represent the standard error. Section 3.2 found that VR s parameter estimates decreased in bias and spread as the sample size increased (spread was measured b the difference between the.97 and.2 quantiles of the sampling distributions.) Figure 6 shows that set D had the smallest spread of β and β 1 estimates. This is a consequence of the data being more condensed (see Figure ). Eecution time is an important aspect of an modelling technique. Table 4 summarizes the eecution times of VR for the simulations performed in Section 3.2. The simulations were performed on computers with dual-core AMD Opteron 2 processors (2.4 GHz) and 4 GB of RAM. The fast eecution times can be largel attributed to the VR objective function being written in the C programming language. n A B C D.8 (.8).11 (.11).1 (.7).11 (.7) 1.12 (.1).16 (.7).14 (.3).16 (.4) 2.27 (.6).3 (.1).3 (.6).3 (.7).48 (.7). (.12). (.9).3 (.12) 1.94 (.12) 1.31 (.34) 1.2 (.28) 1.22 (.33) Table 4: Mean eecution times (seconds) for the simulations performed in Section 3.2. The bracketed values represent the standard deviation. All modelling techniques have their disadvantages and VR is no eception. The shortcomings of VR are listed below. The GLD has finite support when both and λ 4 are positive. Consequentl, VR s error distribution is onl defined on a finite domain when both and λ 4 are positive. The optimization routine ma not converge to a solution (this is a problem of optimization in general). In Section 3.2, 2, simulations were performed and ever one of these simulations successfull converged to a solution. In practice, if the optimization routine did fail, the user could override the and λ 4 starting values and the modelling process could be rerun.

December 7 8, 29, Newcastle, Australia.1.1.. 2 1 1 2 1 1 Figure 2: Tpical regression model for set A..2.1.1.. 2 1 2 1 1 1 Figure 3: Tpical regression model for set B..2.1.1.. 2 1 1 2 1 1 Figure 4: Tpical regression model for set C..2.2.1.1.. 2 1 1 2 1 1 Figure : Tpical regression model for set D. A B C D λ 2..1..1..1..1.8.2.8.2.4.4.2.8.4.2.8 λ 4.8.2.4.2.8.8.2.2.4 1. β 1 2 3 1 2 3 1 2 3 1 2 3 β 1 3. 4.. 6. 3. 4.. 6. 3. 4.. 6. 3. 4.. 6. Figure 6: Accurac of parameter estimates for a range of sample sizes and parameter sets. Each horizontal line represents the true parameter value. Each circle represents the mean value of the sampling distribution. Each upward and downward facing triangle represents the.2 and.97 quantiles of the sampling distribution, respectivel. 4

December 7 8, 29, Newcastle, Australia. Conclusion This article presented Versatile Regression; a simple regression technique where the error distribution was described b the Generalized Lambda Distribution. The fleibilit of this distribution allowed the error distribution to be heav-tailed, skewed or approimatel normal. Versatile Regression performed well on heav-tailed and skewed data. Versatile Regression also provided a reasonable approimation to Normal-Error Regression. References Broden, C. G., 197. The convergence of a class of double-rank minimization algorithms. IMA Journal of Applied Mathematics, 6 76 9. Dean, B., King, R. A. R., and Howle, P. P., 29. Stretched Regression: simple regression with a non-normal response distribution that smoothl changes scale and shape. In preparation. (Dept. of Statistics, Universit of Newcastle, Callaghan, NSW Australia). Fernandez, C. and Steel, M. F. J., 1998. On Baesian modeling of fat tails and skewness. Journal of the American Statistical Association, 93(441) 39 371. Fletcher, R., 197. A new approach to variable metric algorithms. Computer Journal, 13 317 322. Freimer, M., Mudholkar, G. S., Kollia, G. and Lin, C. T., 1988. A stud of the Generalized Tuke Lambda famil. Communications in Statistics - Theor and Methods, 17 347-367. Geweke, J., 1993. Baesian treatment of the independent student-t linear model. Journal of Applied Econometrics, 8 19-4. Goldfarb, D., 197. A famil of variable metric updates derived b variational means. Mathematics of Computation, 24 23 26. Karian, Z. A. and Dudewicz, E. J., 2. Fitting statistical distributions: the Generalized Lambda Distribution and Generalized Bootstrap methods. Boca Raton, CRC Press. King, R. A. R., Gerlach, R. and Wraith, D., 2. Starship regression: A parametric quantile regression method with fleibl-shaped errors. In preparation. (Dept. of Statistics, Universit of Newcastle, Callaghan, NSW Australia). Koenker, R. and Bassett, G., 1978. Regression quantiles. Econometrica, 46 33-. Mandelbrot, B., 1971. Linear regression with non-normal error terms: a comment. The Review of Economics and Statistics, 3(2) 2 26. Neter, J., Kutner, M. H., Nachtsheim, C. J. and Wasserman, W., 1996. Applied linear statistical models, fourth ed. McGraw-Hill/Irwin. Owen, D. B., 1988. The starship. Communications in Statistics - Simulation and Computation, 17 31-323. Shanno, D. F., 197. Conditioning of quasi-newton methods for function minimization. Mathematics of Computation, 24 647 66. Zeckhauser, R. and Thompson, M., 197. Linear regression with non-normal error terms. The Review of Economics and Statistics, 2(3) 28-286.