Notes on the use of the Goldfeld-Quandt test for heteroscedasticity in environment research

Size: px

Start display at page:

Download "Notes on the use of the Goldfeld-Quandt test for heteroscedasticity in environment research"

Griffin Taylor
6 years ago
Views:

1 Biometrical Letters Vol. 45(008), No., Notes o the use of the Goldfeld-Quadt test for heteroscedasticity i eviromet research Aa Budka, Dauta Kachlicka, Maria Kozłowska Departmet of Mathematical ad Statistical Methods, Poza Uiversity of Life Scieces, Wojska Polskiego 8, Pozań, Polad budka@up.poza.pl SUMMARY Some methods of usig the Goldfeld-Quadt test are described. The use of classical statistics (mea, stadard deviatio) or positio statistics (media, average deviatio, media absolute deviatio) is proposed. The results ca be used to test the hypothesis that the residuals from a liear regressio are homoscedastic, whe the experimetal distributio of the cosidered variable is symmetrical or asymmetrical (right-sided skew or left-sided skew). Usig Mote Carlo method, properties of these modificatios of the Goldfeld-Quadt procedure are explored. A compariso of the effectiveess of the described methods for eviromet research is preseted. Key words: Goldfeld-Quadt test, heteroscedasticity, liear regressio, positio statistics. Itroductio The pure scieces have wide applicatio to experimetatio i the field of evirometal egieerig. Statistics ad ecoometrics offer quatitative methods which are a importat tool i the aalysis of results of eviromet research. These iclude research o the cotet of the air, air quality beig a importat issue. High air quality meas that the level of pollutio of the air is low. Pollutio may come from sources which are either atural or athropogeic (caused by huma activity). Natural pollutio is produced maily by forest fires or the decompositio of livig orgaisms, while athropogeic sources iclude trasport, fuel burig, idustrial processes ad others. The quality compositio of the air is variable, ad what is worse there arise secodary pollutats, which are ofte more toxic ad harmful tha the primary oes. The course of these processes depeds o may factors, icludig the spatial distributio of cocetratios. Aother importat aspect is the asphalt

2 44 A. Budka, D. Kachlicka, M. Kozłowska effect. Whe we cover the soil surface with a impermeable layer of cocrete or asphalt, the damp from the groud caot pass to the air. The air becomes dry ad susceptible to pollutio. Therefore a fudametal criterio to cosider whe buyig a plot of buildig lad is its positio. Research ito the effect o lad prices of, amog other thigs, the distace of the plot from the earest highway ad airport requires certai coditios to be checked, which we do usig the Goldfeld- Quadt test. The object of this paper is to preset certai modificatios of the use of the Goldfeld-Quadt test for ivestigatig the presece of heteroscedasticity.. Framework We cosider theliear model y = Xβ + ε, () where y is a x vector of observatios o the depedet variable, X=[, x,, x k- ] is a x k matrix of observatios o k- idepedet variables, β is a k x vector of ukow parameters, ad ε is a x radom vector of errors, where is the vector of oes. We assume that errors are ormally distributed ad the value of the error term correspodig to a observed value of the depedet variable is statistically idepedet of the value of the error term correspodig to ay other observed value of the variable. The error is the distace betwee the observed value of the depedet variable ad its expected value. This expectatio is labelled E(y) ad E(y) = Xβ, because we assume that E(ε) = 0, where 0 is the ull vector. I order for the estimates of the parameters ot to be biased (or best liear ubiased estimates) oe more assumptio should hold: the variace should be costat. The variace matrix is labelled D(ε) ad D(ε) = I, where I is the idetity matrix. If we observe departures from these assumptios the the estimators of the parameters are biased ad cosequetly they are ot best or other drawbacks may to be. Hece cotrol of the model is a very importat part of the regressio aalysis. It is well kow that i the presece of heteroscedasticity of error variaces D(ε)=diag(,,..., ), the least squares method has two major drawbacks:

3 Notes o the use of the Goldfeld-Quadt test for heteroscedasticity 45 iefficiet parameter estimates ad biased variace estimates which make stadard hypothesis tests iappropriate. Kow tests for heteroscedasticity based o error aalysis iclude, for example, the most popular Goldfeld-Quadt test, the Breusch-Paga test ad the White test, described respectively by Goldfeld ad Quadt (965), Breusch ad Paga (979), White (980). The literature o testig for heteroscedasticity icludes may more tests (see Dufour et al., 004). The test for heteroscedasticity i regressio models based o the Goldfeld Quadt methodology defied by Carapeto ad Holt (003) also deserves attetio. Most of these tests are what Goldfeld ad Quadt call o-costructive tests, i that they ca be used to determie the presece or absece of heteroscedasticity, but reveal othig about the form of the variace structure (Buse, 984). We are iterested i testig for heteroscedasticity i situatios with kow deflator. Our ull hypothesis is H 0 : i = for i=,,,, ad the alterative hypothesis is H : ~H 0 (the symbol ~ deotes egatio). The Goldfeld-Quadt test is proposed by, for example, Goldfeld ad Quadt (965), Buse (984) or Maddala (006) for verificatio of the ull hypothesis. The test proposed by Goldfeld-Quadt (965) is carried out i four steps. I the first step we must sort the multivariate variable with respect to the choice of oe idepedet variable, for example x d, d {l: l=,,,k-}. This variable is a potetial deflator. Next, i the secod step, we ca discard oe or more cetral observatios. I the third step, we fit separate regressio aalyses to each of two remaiig sets of observatios. I the last step, we form the statistic, which has F-distributio. The statistic is the quotiet of the mea squares for error from the separate regressios. Let x d x d x d, where the secod subscript deotes the umber of observatios of the variable x d. We obtai a set of observatios of the multivariate variable i the ew orderig. Now we cosider model () i the form y y y 3 X = X X 3 ε + β ε ε 3, ()

4 46 A. Budka, D. Kachlicka, M. Kozłowska where for j=,, 3, y j ad ε j are j x vectors, X j is a j x k dimesio matrix. Goldfeld ad Quadt (965) state that the vector y icludes some cetral observatios after they are sorted with respect to the chose potetial deflator. The choice of the x dimesio vector y is the subject of cosideratio i the ext sectio. The choice of the vector y follows as a cosequece of the choice of cetral observatios of the variable x d. Goldfeld ad Quadt (965) state that the dimesio of vectors y ad y 3 is the same, but other authors, for example Thursby (98), Dufour et al. (004), take ito accout varyig dimesios of these vectors. Goldfeld ad Quadt (965) proposed usig the test for heteroscedasticity after the removal of some umber of cetral observatios. They did ot specify how may observatios should be removed. They gave the relative frequecy of cases i which the false hypothesis is rejected for samples of dimesio =30 ad =60 after omittig 0, 4, 8, or 6 cetral observatios, ad they estimated the power of the test. They obtaied the largest frequecy for =30 ad =60 after the omissio, respectively, of 8 ad 6 cetral observatios (equal to 6.7%). Buse (984) aalysed the problem for =0, 40, 80 removig 0% of cetral observatios. The same dimesio of the removed set of observatios was used by Dufour et al. (004) for =50 ad =00. Maddala (006) suggests the removal of cetral observatios to icrease the power of the test, but he does ot aswer the questio of how may observatios to remove. 3. Results We must quote here two seteces formulated by Goldfeld ad Quadt (965): (i) The power of this test will clearly deped upo the value of, the umber of omitted observatios; for every large value of the power will be small but it is ot obvious that the power icreases mootoically as teds to 0, (ii) The power of the test will clearly deped o the ature of the sample of values for the variable which is the deflator. Thus, if the variace of x d is small relative to the mea of x d the power ca be expected to be small ad coversely. We believe that the umber of omitted observatios depeds o the precisio of the measuremets carried out ad o their distributio.

5 Notes o the use of the Goldfeld-Quadt test for heteroscedasticity 47 The distributio of the idepedet variable called the deflator may be defied by a cumulative probability fuctio labelled F(x d ). If for a value x d(q), where q (0;), the cumulative probability fuctio F(x d(q) ) is equal to q, the x d(q) is called a quatile of order q. Lower quartile, media ad upper quartile are measures of locatio ad are labelled x d(/4), x d(/) ad x d(3/4) respectively. Let us recall that may authors suggest removig 0% of observatios; these are values of the variable x d withi the iterval (x d(/5) ; x d(3/5) ). Whe for a set of several measuremets the mea has bee calculated, the stadard deviatio ca be calculated for the set too. The stadard deviatio tells us how repeatable the placemet of the measure is, ad how much this cotributes to the ucertaity of the mea value. From these, the estimated stadard deviatio of the mea (the stadard error) may be calculated. The stadard deviatio of the mea has also bee called a stadard ucertaity. The stadard ucertaity is a margi whose size ca be thought of as plus or mius oe stadard error. We propose usig classical statistics or positio statistics with the aim of defiig the set of removed observatios. 3.. Method based o classical statistics Propositio. Whe the variable x d has symmetrical distributio, the mea is the cetral poit of the distributio. We assume that stadard error is the measure of deviatio from the mea. From the set of observatios of variable x d, such that x d x d x d, we propose to remove all observatios belogig to the iterval defied by the mea of the variable plus or mius the stadard error, that is ˆ ˆ x di ( µ ˆ ; µ ˆ + ), (3) where µˆ ad ˆ are estimates of the mea ad stadard deviatio of the variable x d respectively. Let us otice that, for the variable x d N(µ;), the orders of the quatile x (q ) = µ ad = µ + have the forms: d x d(q )

6 48 A. Budka, D. Kachlicka, M. Kozłowska q q = F( µ = F = F 0; ( = F( µ + 0; ) = P(x ) = ) = ( ) = d π x P( π d < µ e e µ < ) = xd xd dx x P( dx ) = d µ < ) =, Hece, we have q q = F0; (/ ). I the case whe =6 the differece of the orders is equal to /5, that is q q = F0; (0.5) = / 5. Let us otice that for icreasig sample size, we have lim ( q q) = lim (F0; ( ) ) = 0. Whe x d has the ormal distributio, the legth of the iterval (3) depeds o the size of the sample. I the case we will remove for example 0% of observatios whe =6 or 0% whe = Methods based o positio statistics Whe the variable x d has asymmetrical distributio the the media med(x d ) is the cetral poit of the distributio. The media is i the middle of the sorted data (x d x d x d ). I a similar way we ca defie the first quartile to be /4 of the way through the sorted data (x d(/4) ), ad the third quartile to be 3/4 of the way through the sorted data (x d(3/4) ). For ivestigatio of the variability of the media, a positio statistic called media absolute deviatio is used. The media absolute deviatio (MAD) is defied as the media of the absolute value of the differece betwee a variable ad its media, that is MAD(x d ) = med x d med(x d ). For ormally distributed x d N(µ;), the MAD is give by: MAD(x d )= med (x d µ)/ =0.67

7 Notes o the use of the Goldfeld-Quadt test for heteroscedasticity 49 Let us otice that µ MAD(x d )=med(x d ) MAD(x d )=x d(/4), µ+mad(x d )= =med(x d )+MAD(x d )= x d(3/4) ad F(µ+0.5 MAD(x d )) F(µ 0.5 MAD(x d ))=0.5 because x d µ F( µ / ) F( µ 0.67 / ) = P( < 0.67 / ) = F (0.67 / ) = 0.5 0; We propose takig MAD(x d ) as the upper limitatio of the legth of the iterval icludig omitted observatios of the variable x d. We may ivestigate the symmetry of distributio of the variable x d by calculatig quartiles. Whe the media is equal to half of the iter-quartile rage, that is med(x d )=½ (x d(3/4) x d(/4) ), the the distributio is symmetrical. For right-sided skew distributio we have med(x d )<½ (x d(3/4) x d(/4) ), ad for left-sided skew distributio med(x d )>½ (x d(3/4) x d(/4) ). For variable x d havig asymmetrical distributio we propose some methods for defiig the set of omitted observatios. Propositio. Adoptig the asymptotic approach for buildig the cofidece iterval for the media applyig stadard error (see Dawid, 98) we propose to defie the set of removed observatios by the iterval media plus or mius stadard error. Hece we propose removig the followig observatios : ˆ ˆ x di ( med(x d ) ; med(x d ) + ). (4) Propositio 3. Aother measure of variatio is mea absolute deviatio. The mea absolute deviatio (MeaAD) is defied as follows: MeaAD (xd ) = xdi med(x d ). (5) i The mea absolute deviatio takes accout of values from zero to half of the rage of observatios. Hece we believe that we may remove observatios belogig to the iterval media plus or mius the mea absolute deviatio divided by the square root of the umber of observatios, amely MeaAD(x d ) MeaAD(x d ) x di ( med(xd ) ;med(x d ) + ). (6) =.

8 50 A. Budka, D. Kachlicka, M. Kozłowska Propositio 4. I experimets there is o such thig as a perfect measuremet. Each measuremet cotais a degree of ucertaity due to the limits of istrumets ad the people usig them. The ucertaity idex for sample size has the form.859 MAD(x d ) /sqrt(), where sqrt() deotes the square root of. The rage of omitted values of x d ca be defied as the media plus or mius the value of the ucertaity idex. We propose removig the followig cetral observatios:.859 MAD(xd ).859 MAD(xd ) x di ( med(xd ) ;med(xd ) + ) (7) Above we have give four propositios for defiig the set of cetral values of observatios of the idepedet variable x d called a deflator. These propositios cocer the use of measures of variatio such as the stadard error, the mea absolute deviatio ad the media absolute deviatio. The rage of removed observatios is defied as a iterval whose cetral poit is the mea or media of the variable x d. We will check the usefuless of the above propositios o geerated data. 4. Mote Carlo simulatio I order to compare the effectiveess of the described modificatios of the Goldfeld-Quadt procedure we performed a Mote Carlo study usig SEPATH from Statistica 7.. We cosidered sample size =00 ad we used the sigle regressor model as applied for example i Goldfeld ad Quadt (965), Griffiths ad Surekha (986), Carapeto ad Holt (003). However we performed our study i a differet way tha these authors. I the first step, values y i ad x i for i=,,,00, were geerated from a ormal distributio N(0;), such that these values (y i ;x i ) had highly sigificat correlatio. I the secod step we sorted (y i ;x i ) with respect to x i. I the third step we omitted some cetral observatios. We used oe of six methods of defiig the set. I the ext step we calculated the Goldfeld-Quadt statistics. Nie hudred replicatios were geerated. I the last step we calculated the p-value for the Goldfeld-Quadt test ad the cumulative frequecy of cases i which the ull

9 Notes o the use of the Goldfeld-Quadt test for heteroscedasticity 5 hypothesis is ot rejected. The power of the Goldfeld-Quadt test was calculated usig Statistica 7. for first five hudred replicatios. We cocetrate ow o comparig differet methods for omittig cetral observatios. Two stadard methods are cosidered here; the first method was described by Goldfeld ad Quadt (965) ad the secod by, for example, Buse (984). The first method ivolves use of the test without removig cetral observatios, ad i the secod method twety percet of cetral observatios are omitted. These two methods are compared with our four propositios. Table displays the experimetally calculated mea p-value, cumulative frequecy of cases i which the ull hypothesis is ot rejected o sigificace level 0.05 or 0.0 or 0.00 ad mea power of the Goldfeld-Quadt test tabulated by six methods. Table. Mea p-value, cumulative frequecy of cases i which ull hypothesis is ot rejected (α=0.05, 0.0, 0.00) mea power of the Goldfield-Quadt test for Mote-Carlo experimets ad umber of removed observatios (ro) Set of omitted observatios p-value frequecy power Nro Null % cetral observatios mea ± stadard error media ± stadard error media ± MeaAD/sqrt() media ±.859 MAD/sqrt() The magitude of the differeces i the p-values of our propositios compared with the two methods from earlier papers is small (<%). We obtaied the smallest p-value for the iterval mea plus or mius stadard error. The power give i Table clearly idicates that the power of the Goldfeld-Quadt test deped upo the umber of omitted observatios but it is ot true that power icreases mootoically as this umber teds to ull. This cosideratio shows that the stadard method recommeded by may authors (0% omitted cetral observatios) gives the same result as our methods. Whe we make measuremets, we have o way of kowig how accurate the values are. The use of variatio measures such as stadard error, media absolute deviatio or stadard ucertaity make it possible to avoid takig a overoptimistic view of reality. By performig 'truthful' aalysis, we

10 5 A. Budka, D. Kachlicka, M. Kozłowska ca refie the Goldfeld-Quadt procedure. We ucover ad take advatage the major sources of the ucertaity. I doig so, we use iformatio about measuremets. 5. Research problem Regressio models i eviromet research, ad particularly the estimated stadard errors, rely upo the assumptio that the residuals are idepedetly ad idetically distributed. Two commo violatios of this assumptio are serial correlatio ad heteroscedasticity. We assume that serial correlatio is ot a sigificat issue. However heteroscedasticity, which refers to o-costat error variace, is a commo problem i eviromet regressio models. The preset aalysis is based o the results of research described i example 4.7 by Maddala (006). I this example six variables were described. The price of lad per acre is here called the depedet variable labelled y, the percetage afforested area is labelled x, the distace of the lad parcel from the airport is labelled x, the distace from the highway is labelled x 3, the area of the lad parcel is labelled x 4 ad the moth i which it was sold is labelled x 5. The depedet variable y ad two idepedet variables x ad x 3 were trasformed ad deoted respectively ly, lx ad lx 3. The idepedece variable lx 3 is called the deflator. Hece we ca write x d =lx 3. Table displays classical ad positio statistics tabulated by the six variables. The aalysis bega with a study of homoscedasticity for each idepedet variable separately ad the depedet variable (Table 3). We performed the Goldfeld-Quadt test usig six methods of defiig the set of omitted observatios, separately. The same aalyses were made for two groups of idepedet variables, i which we sorted these data with respect to x d =lx 3. We performed the Goldfeld-Quadt test for two groups to compare their results. We calculated the media absolute deviatio of the deflator MAD(x d )=MAD(lx 3 )=0.397 ad iter-quatile rage (x d(3/5) -x d(/5) )= Table 3 examies the p-value for these aalyses ad the umber of omitted observatios of the deflator lx 3 for two stadard methods ad our four propositios for defiig the set of omitted observatios. For the last two aalyses we also calculated the power of the test.

11 Notes o the use of the Goldfeld-Quadt test for heteroscedasticity 53 Table. Classical statistics ad positio statistics for the variables cosidered i example 4.7 by Maddala (006) ly x lx lx 3 x 4 x 5 Mea Stadard deviatio Stadard error Media Miimum observatio Maksimum observatio Before we cosider the ifluece of a defied of set of omitted observatios o the p-value ad power of the Goldfeld-Quadt test, it is ecessary to make a short digressio o the relatio betwee the magitude of the sets cosidered. I the example the greatest umber of observatios belogs to the secod stadard set (0% =3), but the smallest p-value ad the highest power is for the propositios mea plus or mius stadard deviatio ad media plus or mius stadard deviatio. I the respective sets we have 9 observatios i each. For the ext two propositios, based o mea absolute deviatio or media absolute deviatio, we obtaied worse results. 6. Coclusio To coclude our cosideratios we offer four methods for buildig the set of omitted observatios. O the basis of the Mote Carlo simulatio we geerally prefer first propositio. Propositio is preferred whe the deflator has symmetrical distributio, propositio two whe the distributio is asymmetrical. For the cosidered research problem the results of each applicatio of the Goldfeld-Quadt test are differet. I may cases our propositios are better tha the secod stadard method cosidered here. Methods based o measuremet of the deviatio of the deflator are better tha the removal of 0% cetral observatios without regard to their distributio. We should reiterate that the use of variatio measures such as stadard error, media absolute deviatio or stadard ucertaity makes it possible to avoid takig a overoptimistic view of reality.

12 54 A. Budka, D. Kachlicka, M. Kozłowska Table 3. The p-value for the Goldfeld-Quadt test for simple liear regressio ad two multivariate liear regressio with xd=lx3 (ro umber of removed observatios of variable lx3) Set of omitted observatios lx lx3 x4 x5 ro x; lx; lx3; x4; x5 lx; lx3 p-value p-value p-value p-value p-value power p-value power ull % cetral observatios mea ± stadard error media ± stadard error media ± MeaAD/sqrt() media ±.859 MAD/sqrt()

13 Notes o the use of the Goldfeld-Quadt test for heteroscedasticity 55 REFERENCE Breusch T., Paga A. (979): A Simple Test for Heteroscedasticity ad Radom Coefficiet Variatio. Ecoometrica 47: Buse A. (984): Tests for additive heteroskedasticity Goldfeld ad Quadt revisited. Empirical Ecoomics 9: Carapeto M., Holt W. (003): Testig for heteroscedasticity i regressio models. Joural of Applied Statistics 30(): 3-0. Dawid H.A. (98): Order Statistics. New York, Wiley. Dufour J.-M., Khalaf L., Berard J.-T., Geest I. (004): Simulatio-based fiitesample tests for heteroskedasticity ad ARCH effects. Joural of Ecoometrics : Griffiths W.E., Surekha K. (986): A Mote Carlo evaluatio of the power of some tests for heteroscedasticity. Joural of Ecoometrics 3: 9-3. Goldfeld S.M., Quadt R.E. (965): Some tests for homoscedasticity. America Statistical Associatio Joural 60: Maddala G.S. (006): Ekoometria. PWN, Warszawa Thursby G. (98): Misspecificatio, heteroscedasticity, ad the chow ad Goldfeld- Quadt tests. The Review of Ecoomics ad Statistics 64(): White H. (980): A Heteroscedasticity-Cosistet Covariace Matrix Estimator ad a Direct Test for Heteroscedasticity. Ecoometrica 48:

Properties and Hypothesis Testing

Properties and Hypothesis Testing Chapter 3 Properties ad Hypothesis Testig 3.1 Types of data The regressio techiques developed i previous chapters ca be applied to three differet kids of data. 1. Cross-sectioal data. 2. Time series data.