Dummy Variable Multiple Regression Forecasting Model

Size: px

Start display at page:

Download "Dummy Variable Multiple Regression Forecasting Model"

Osborn Todd
5 years ago
Views:

1 International Journal of Engineering Science Invention ISSN (Online): , ISSN (Print): wwwijesiorg Volume 2 Issue 3 ǁ March 23 ǁ PP2-5 Dummy Variable Multiple Regression Forecasting Model Nwankwo, H, Oyeka, I 2 Department of Statistics, Nnamdi zikiwe University, wka Nigeria 2 Department of Statistics, Nnamdi zikiwe University, wka Nigeria BSTRT: method of using multiple regression in making forecasts for data which are arranged in a sequence, including time-order, is presented here This method uses dummy variables, which makes it robust The design matrix is obtained by a cumulative coding procedure which enables it overcome the setback of equal spacing associated with dummy variable regression methods The use of direct effects obtained by the method of path analysis makes the whole procedure unique Keywords: umulative oding, Dummy, Path nalysis, Design Matrix, Robust, Direct Effect I INTRODUTION Usually, multiple regression models are used for estimating the contributions or effects of independent variables on a dependent variable first order multiple regression model with two independent variables and 2 is in the form [] Y i =β +β i +β 2 i2 +ε i () This model is linear in the parameters and linear in the independent variables Y i denotes the response in the ith trial and i and i2 are the values of the two independent variables in the ith trial The parameters of the model are β, β and β 2, and the error term is ε i ssuming that E(ε i ) =, the regression function for () is E(Y)=β +β +β 2 2 (2) Note that the regression function (2) is a plane and not a line The parameter β is the y intercept of the regression plane The parameter β indicates the change in the mean response per unit increase in when 2 is held constant Likewise, β 2 indicates the change in the mean response per unit increase in 2 when is held constant When there are more than two independent variables, say p- independent variables,, p-, the first order model is Y i =β +β i +β 2 i2 ++β p- i,p- +ε i (3) ssuming that E(ε i ) =, the response function for (3) is E(Y)=β +β +β β P- p- () The response function () is a hyper plane, which is a plane in more than two dimensions It is no longer possible to picture this response surface as we were able to do with (2), which is a plane in 3 dimensions Multiple regression is one of the most widely used of all statistical tools regression model for which the response surface is a plane can be used either in its own right when it is appropriate, or as an approximation to a more complex response surface Many complex response surfaces are often approximated well by a plane for limited ranges of the independent variables [] The use of multiple regression is largely limited to the problem of estimating the contributions, or estimation of the effects of the independent variables on the dependent variable Forecasting values of the dependent variable using multiple regression models is often of interest to researchers, though forecasting is not a common feature of existing regression methods The problem is that independent variables are not available for the period onto which forecasts are sought wwwijesiorg 2 P a g e

2 wwwijesiorg 3 P a g e method for carrying out forecasts of this sort is proposed here and will be applicable when data is in at least ordinal form II PROPOSED METHOD To achieve the objective of this paper, which is to develop a multiple regression forecasting model, we propose the use of dummy variable multiple regression modeling methods [2] Specifically a, dummy variable coding system would be used in such a way that each category or level of a parent independent variable in a regression model is represented by a pattern of s and s, forming a dummy variable set In order to avoid linear dependence among the dummy variables of a parent variable each parent variable is always represented by one dummy variable less than the number of its categories [3][] Thus if a given parent variable Z has z categories or levels, the corresponding design matrix will be represented ordinally by z- column vectors of ordinally coded dummy variables, x d, of s and s (for d=,2,,z-) The s and s in each x d are cumulative if the values of the level of the parent variable it represented are arranged together Specifically, the pth level (p =, 2, z) of the z levels of a parent variable Z will be represented by d ordinally coded column vectors of s and s for d =, 2, z- that is: id = (5) Then if, but without loss of generality, the observations in each level of Z are arranged all together, then the n x (z-) design matrix representing Z will consist of a set of z- cumulatively coded column vectors x d of s and s of the form Equation (6) is a prototype of ordinally coded design matrix with z- cumulatively coded column vectors x d of s and s representing the z levels of the parent variable Z Note that the first n elements of the first column x of representing the first level of Z are s while the remaining n-n are all s The firstn + n 2 elements of (6) z x x x x z Z of Level

3 x 2 are s, while the remaining n-(n + n 2 ) elements are all s and so on until finally all the elements of x z- are all s except the last n z elements which are all s Note that all the observations in the first level (level ) of Z are all coded in all the columns of the design matrix while observations in the last level (level z) of Z are all coded s in Note also that Z may be any set of parent independent variables such as, B, etc with levels a, b, c, etc respectively n ordinal dummy variable multiple regression model of y i on the x ij s may be expressed as y i ; i; 2; i2; a ; i, a; c ; i, c; e i (7) where j s are partial regression coefficients and e i are error terms uncorrelated with x ij s, with E(e i ) = ; has a levels, B has b levels has c levels, etc Note that the expected value of y i is E ( yi ) ; i; 2; i2; a; i, a; c; i, c; (8) Equation 7 may alternatively be expressed in its matrix form as y e (9) where y is an nx column vector of outcome values; is an nxr cumulatively coded design matrix of s and s; β is an rx column vector of regression coefficient and e is an nx column vector of error terms uncorrelated with with E(e)= where r is the rank of the design matrix Use of the method of least squares with either equation (7) or (9) yields an unbiased estimator of β as b ' ' () y where ' is the matrix inverse of ' yˆ b, the resulting predicted regression model is The following analysis of variance (NOV) table (Table), enables the testing of the adequacy of Equations (7) or (9) using the F test Table : nalysis of Variance (NOV) Table for Equation (9) Source of Sum of Squares (SS) Degrees of Mean Sum of F Ratio Variation Freedom (DF) Squares (MS) Regression 2 SSR b' x' y ny r SSR MSR r MSR F Error SSE y' y b' x' y n r SSE MSE MSE n r Total 2 SST y' y ny n () The null hypothesis to be tested for the adequacy of the regression model (Equation 7 or 9) is H : versus H : (2) H is tested using the test statistic F = MSR MSE This has an F distribution with r- and n-r degrees of freedom H is rejected at the level of significance if F F, ; r nr (3) Otherwise we do not reject Ho, where F (-, r-, n-r) is the critical value of the F distribution with r- and n-r degrees of freedom for a specified level wwwijesiorg P a g e

4 If H is rejected indicating that not all j s are equal to zero, then some other hypotheses concerning j s may be tested Note that k ; is interpreted in ordinal dummy variable regression model as the amount by which the dependent variable y on the average changes for every unit increase in x k compared with x k- or one unit decrease in x k relative to x k+ when all other independent variables in the model are held constant That is, k measures the amount by which on the average the dependent variable y increases or decreases for every unit change in x K compared with a corresponding unit change in either x k- or x k+ respectively when all other independent variables in the model are held constant Research interest may be in comparing the differential effects of any two ordinal dummy variables of a parent independent variable on the dependent variable For example one may be interested in testing the null hypothesis H : j; versus H : j; () Where the d s are estimated from Equation () as b d s for l =, 2 a-; j =, 2 a-; l j The null hypothesis of Equation () may be tested using the test statistic t se b b b bj; ' b b j; j; (5) MSE where is an r row vector of the form (,,,,, -,, ) Where and - correspond to the positions of β and β j; respectively in the rx column vector b and all other elements of are H is rejected at the level of significance if t t ; nr (6) Otherwise we do not reject H, where freedom for a specified level t ; nr is the critical value of the t distribution with n-r degrees of In general several other hypotheses may be tested For example one may be interested in comparing the effects of the ith level of factor, say and the jth level of factor say or of some combinations of some levels of several factors Thus interest may be in testing : versus H (7) H j; : j; Using the test statistic t se b b b bj; ' b j; (8) MSE where is a row vector as in Equation (5) except that and - now occurs at the positions corresponding to the ith level of factor and jth level of factor in b H is rejected as in Equation (6) Further interest may also be in estimating the total or overall effect of a given parent independent variable through the effects of its representative ordinal dummy variables on the dependent variable To do this it should be noted that any parent variable is completely determined by its set of representative ordinal dummy variables III ESTIMTION OF EFFETS For the purposes of parameter estimation in dummy variable regression, the dummy variables of a parent independent variable are treated as intermediate (independent) variables between the parent variable and the dependent variable of interest [][5] Each dummy variable of the parent independent variable is then treated as a separate variable determined by its parent variable and determining the specified dependent variable Therefore in developing an expression for the regression effect of a parent independent variable on a specified wwwijesiorg 5 P a g e

5 dependent variable use is made of the simple effects of the set of dummy variables representing that parent independent variable on the dependent variable Now an equation expressing the determination of a dependent variable by the d-th dummy variable, x d (d=, 2 s-), of a parent independent variable V with s levels may be written in the form: y x x x e (9) d d In equation (9) y is an n-column vector representing the dependent variable, x d is an n-column vector denoting the d-th set of ordinal dummy variables representing the parent independent variable V, for d=,2,,s- x (x) is an n x (s-2) matrix of full column rank, s-2, representing all the other s-2 ordinal dummy variables for V d is the regression effects of x d on y, and + is an s-2 column vector of the regression effects of x (x) on y e is an n-column vector of uncorrelated error terms Note that equation (36) may be expressed more compactly in the form y e (2) Where = (x d, x (x) ) and ' ' y (2) Without loss of generality, take d as the first component of In ordinal dummy variable regression, d is a regression coefficient measuring the net effect of the d-th level of a parent independent variable, V, on a dependent variable, y, relative to the effect of the immediately succeeding level ((d-) level) of V, after adjustments have been made for the effects of other variables in the regression model Now to estimate the direct effect [6], 96[2], B v, of a given parent independent variable V on a dependent variable y, the dummies of the parent independent variable V are treated as intermediate variables between V and y Then, following the method of path analysis, [6][][2], B v is obtained as a weighted sum of d given as Bv s dd (22) d where the weights d is the simple regression coefficient of x d on V which is subject to the constraint s d (23) d The difference between the total effect, b v, namely the simple regression coefficient of y on the parent independent variable V, and B v, the effect of V on y through the variables x d, is the indirect effect of V on y [6][2] IV FORESTING onventionally, forecasting using regression methods where the parent independent variables are categorical and represented by codes is carried out using the method of least squares In this case, the value of the dependent variable to be predicted or forecasted, for given values of the independent variables, is obtained by inserting the values of these independent variables represented by codes in the fitted regression model, where the independent variable is coded from to n (n being the number of observations in ascending order of time, or the orthogonal coding system where the earliest in time is coded ( until the latest in time is coded ( ), increasing arithmetically by unit, if the number of observations are odd, example is t = -3,-2,-,,,2,3 if there are 7 observations, alternatively the codes will be -(n-), -(n-3), -(n-5), starting with the earliest in time to the latest in time, if the number of observations is even These codes are used as independent variables against the real dependent variable, y This method suffers from the restriction of equal spacing of levels in the codes using normal dummy variable regression models plus the additional difficulty of interpreting the regression coefficients generated wwwijesiorg 6 P a g e

6 In cases where the parent independent variables are each represented by a set of dummy variables of s and s, the use of direct effects as the forecast model coefficients is advocated These uses of direct effects as coefficient has taken care of the requirement and constraints of equal spacing and are interpretable, hence more useful for practical purposes The multiple regression model expressing the dependence or relationship between the dependent variable Y and the parent independent variables, B, etc represented by their respective sets of ordinal dummy variables is Y i = β +β ; i; + β 2; i2; + + β a-; i,a-; + β ;B i;b + β 2;B i2;b + + β b-;b ib-;b + β ; i; + β 2; i2; + +β c-; ic-; + + e i (2) Where ij; is the ordinal dummy variable representing the jth level of factor with regression effect β j;, j=,2, a-; ij;b is the ordinal dummy variable representing the jth level of factor B with regression effect β j;b, j=,2,,b-; ij; is the ordinal dummy variable representing the jth level of factor with regression effect Β j; ; j=,2, c-; etc, and e i is the error term uncorrelated with the ij s and E(e i ) = for i=,2, n The estimated value of equation (2) is E(y i )=β + a β j ; b c j = E( ij; )+ j = β j ;B E( ij;b )+ j= β j; E( ij ; )+ (25) Now define the regression effect of any parent independent variable say, on the dependent variable y through the effects of the set of ordinal dummy variables representing that parent independent variable We take the partial derivative of equation (25), the expected value of y with respect to obtaining de (yi ) d a de( ij ; ) = j = β j ; d b + j= β j;b de ( ij ;B ) d c + j= β j; de ( ij ; ) d + Where de ( ij ; ) = α j; is the simple regression coefficient of ij; regressing on the parent independent variable,,with so that de (y i ) d d a j; j = (Oyeka,993) and de ( ij ;Z ) =, for all parent independent variables Z ; Z = B,, a = j= β j; J; + d Hence the partial regression effect of the parent independent variable of factor through the effects of its representative set of ordinal dummy variables on the dependent variable y is a β y; = j= j; β j; (26) Whose sample estimate is a j; j= b j; (27) where b j; is the sample estimate of the partial regression coefficients β j;, j=, 2,, a- Estimates of the partial effects of other parent independent variables in the model are similarly obtained V ILLUSTRTIVE EMPLE Mean monthly maximum temperature data, in degree centigrade, was collected by a research centre for five consecutive years (27-2) as shown in table 2 below Table 2: Mean Monthly Maximum Temperature ( o ) YER JN FEB MR PR MY JUN JUL UG SEP OT NOV DE wwwijesiorg 7 P a g e

7 pplying equation 5 to obtain the design matrix, note must be taken that two independent variables are involved here, they are year represented in the dummy design matrix as id, i=,2,3, and months, represented in the dummy design matrix as M id, i=,2,, Note also that the design matrix will be obtained using the cumulative coding system proposed in equation 5 for both the years and the months respectively Table 3 below shows the cumulatively coded design matrix for the year and month variables Note that one year (2) was dropped Note also that one month (Dec) was dropped P m and P x are the parent independent variables for month and year respectively Table3: ummulatively coded design matrix for the maximum temperature data P m P x 2 3 M M 2 M 3 M M 5 M 6 M 7 M 8 M 9 M, M, wwwijesiorg 8 P a g e

8 For the direct effects, the dummy variables (x id ; d=, 2, 3, ) as independent variables, are each regressed on p x, the parent independent variable for year, using the method of least squares Similarly, the dummy variables (m id ;d=,, ) as independent variables, are each regressed on p m, the parent independent variable for month, by the same method of least squares Table below show the outputs of these regressions Table : Simple Regression oefficients i MODEL OEFFIIENT(α i ) 2 3 on p x 2 on p x 3 on p x on p x M on p m M 2 on p m M 3 on p m M on p m M 5 on p m M 6 on p m M 7 on p m M 8 on p m M 9 on p m M, on p m M, on p m Note that the sums of the regression coefficients for each of the variables (year and month) sum up to Recall that the direct effects for the years, B x, is obtained as B x = i= i β i where β i are the regression coefficients obtained from regressing the maximum temperature data (as dependent variable) on the cumulatively coded dummy variable design matrix (table 3 without p x and p m ) and α i are the simple regression coefficients related to the years x, x 2, x 3, x respectively as shown in table above B m = i β i is obtained similarly for months where α i are the simple regression coefficients related to the months m, m 2,, m, respectively as shown in table Table 5 below show the coefficients (β) obtained from the regression wwwijesiorg 9 P a g e

9 Table 5: Regression oefficients from the regression of maximum temperature data on the cumulatively coded dummy variable design matrix i Dummy Variable onstant 2 3 M M 2 M 3 M M 5 M 6 M 7 M 8 M 9 M, M, Regression oefficient (β i ) p-value For the regression above, we have R 2 =877 and MSE = 979 with p-value=, hence a significant multiple regression exist Direct Effects and Test of Significance The direct effect for year, B x = i= i β i, is obtained as 33x -3 Similarly, the direct effect for month is B m = i= i β i, is -257 Testing significance of the direct effect for the years B x, the t- statistic where t = Bx var (Bx) follows the student t distribution with n-p degrees of freedom, p is the number of parameters in the model Var(B ) = Var(α'β i ) = α i 'Var(β i )α i for i=,2,3, and Var(β i ) = MSE( d ' d ) - [] so Var(B x )= α i ' MSE( d ' d ) - α i i=,2,3, this yields Var(B x ) = 67 Hence t x = = 8 For 6 - = 56 degrees of freedom, a p-value of more than 76 was observed; this supports the hypothesis of no significant direct effect, due to the years, on the regression equation Similarly, for a significance test of the direct effect of the months, B m, on the regression equation Var(B m ) is Hence wwwijesiorg 5 P a g e

10 t m = Bm Var (Bm) = 257 = -83 For n-p which is 6- = 9 degrees of freedom, a p-value of less than 5 was observed This supports a significant direct effect due to the months Forecasting The forecast equation for maximum temperature is therefore of the form Ť ym =B O +B x t xi +B m t m i i=,2 2 (28) Where B o is the overall mean effect estimated by the constant term obtained as in table 5 (the value here is 337) B x is the direct effect of year on maximum temperature obtained by the method of path analysis (value here is 33x -3 ), this value is not statistically significant hence it will be dropped B m is the direct effect of months on maximum temperature, also obtained by the method of path analysis (value here is ), this direct effect is statistically significant [7] interpreted the path coefficients (the direct effects) thus: given a path diagram as below θ = 8 >B If region increases by standard deviation from its mean, region B would be expected to increase by 8 its own standard deviations from its own mean while holding all other relevant regional connections constant Our forecast equation here, using the direct effects as coefficients is Ť = t m i i=,2,,2 (29) t m i is the serial count of the months from January (ie t m i =) to December (t m i = 2) of each year If forecast for December 22, say, is required, and t m i = 2, then Ť = (2) = 39 o Similarly, if forecast for February 23 is required, t m i = 2, then Ť = (2) = 3266 o VI ONLUSION So far it has been demonstrated how one can make forecasts on a variable which can be arranged in any sequence, including time-order, using the multiple regression approach and the robust method of dummy variables, with the design matrix obtained by a cumulative coding arrangement The direct effects, obtained through the method of path analysis, are used as parameter estimates, where they are significant This procedure can be viewed as having the ability to break down a single sequence of data into its hitherto not visible components, thus creating a let-in into the bits and pieces that make up the variable, and the relevant pieces reassembled to obtain a forecasting model for the variable REFERENES [] Neter, J; Wasserman, W; Kutner, MH (983): pplied Linear Regression Models Richard D Irwin Inc Illinois [2] Oyeka, I (993): Estimating Effects in Ordinal Dummy Variable Regression STTISTI, anno LIII, n2 PP 26-8 [3] Boyle, R P (97): Path nalysis and Ordinal Data merican Journal of Sociology, 7, 97, 6-8 [] Lyon, M (97): Techniques for Using Ordinal Measures in Regression and Path nalysis, in Herbert ostner (ed), Sociological Methods, Josey Bass Publishers, San Francisco [5] Wikipedia, the free encyclopedia (2), Path nalysis (Statistics) Modified 2 November, 2 [6] Wright, S (96); Path oefficients and Path Regression: lternative to omplementary oncept : Biometrics, Volume 6, pp [7] Brian, P (2): Structural Equation Modelling (SEM) or Path nalysis http//afninihgov wwwijesiorg 5 P a g e

TwoFactorAnalysisofVarianceandDummyVariableMultipleRegressionModels

TwoFactorAnalysisofVarianceandDummyVariableMultipleRegressionModels Gloal Journal of Science Frontier Research: F Mathematics and Decision Sciences Volume 4 Issue 6 Version.0 Year 04 Type : Doule lind Peer Reviewed International Research Journal Pulisher: Gloal Journals