Outline. Inference in regression. Critical values GEOGRAPHICALLY WEIGHTED REGRESSION

GEOGRAPHICALLY WEIGHTED REGRESSION Outline Multiple Hypothesis Testing : Setting critical values for local pseudo-t statistics stewart.fotheringham@nuim.ie http://ncg.nuim.ie ncg.nuim.ie/gwr/ Significance tests in regression Parameter significance in GWR Multiple testing procedures Critical value adjustment in GWR Excel and R functions Example: Georgia data Inference in regression It is common in regression to test whether each parameter is zero A variable whose parameter estimate is zero does not contribute to the variation in the model. The t statistic is commonly reported on regression output for the null hypothesis that β n = 0 If the reported value of t for a parameter is greater than +1.96 or less than -1.96 (normal approximation - Wald), we reject the null hypothesis for that parameter. The parameter estimate is described as significant Critical values Note that in this context, significant does not mean important, merely not zero. Note as well that the test statistic follows a t distribution. However, if the sample is large enough, you can use the percentage points of the normal distribution as critical values If you use the 5% significance level, then the critical values for various sample sizes are: t α=0.05,df=30 = 2.04 t α=0.05,df=100 = 1.98 t α=0.05,df=500 = 1.97 t α=0.05,df=1000 = 1.96

What about GWR If we map the parameter estimates we see some apparent variation Sometimes most of the parameter estimates will be positive but a small handful are negative This suggests that the variable s influence is positive virtually everywhere, but in some places negative! and vice versa Local pseudo-t statistics We can divide the local parameter estimates by their associated local standard errors to obtain local pseudo-t statistics There will be one t for each of the m+1 variables and n regression points (the intercept term has local t statistics as well) Critical value This gives us n * (m+1) simultaneous tests So can we use the percentage points of the normal distribution as the critical values? NO! If we carried out 100 independent statistical tests, and we use the 5% significance level, we would expect that about 5 of these tests are significant by chance If n = 1000 and p = 4, then we will be carrying out 5000 tests, and about 250 would be significant So the false positive rate is much higher than the 5% of the individual tests Adjusting the critical value This means that we have to increase the critical value for a given α level to account for this This means changing the test-wise error rate to maintain the family-wise error rate at α So, you can t just use 1.96 or 2.58 Adjustments Our problem is that we have a large number of dependent tests (generally they are positively dependent) There are a number of different ways of adjusting the critical value

Bonferroni adjustment A commonly encountered adjustment in multiple testing is the Bonferroni adjustment The adjusted critical value is: β = α / n With 5000 tests, and α=0.05, the adjusted level β is 0.0001 So, instead of 1.96, you use 4.44 Bonferroni The effect is this is to make the test more stringent The Bonferroni test will deal with dependencies But it is widely regarded as being far too conservative that is, it rejects too many tests Yoav Benjamini Benjamini has long been interested in the multiple testing problem He is associated with the concept of the False Discovery Rate His work with Daniel Yekutieli has dealt with adjustment where the tests are independent Adjustment procedures The Benjamini-Yekutieli adjustment, and several similar procedures are complex to operationalise We have discovered that there is a Bonferroni-like adjustment which gives results very similar to the B-Y test Byrne, et al, 2009 Fotheringham adjustment Multiple Dependent Hypothesis Tests in Geographically Weighted Regression presented at Geocomputation 2009 in Sydney Stewart made the observation that the adjustment would be proportion to the number of degrees of freedom in the GWR model β = 1+ α pe pe np

β = 1 + α pe p e np Procedure β is the adjusted test level α is the unadjusted test level (frequently 0.05) p e is the effective number of parameters in the model n is the number of regression points p is the number of parameters to be estimated You can compute the critical value by using TINV function in Excel TINV(β,n-p e ) The resulting value will lie between 1.96 and the over-conservative value from the Bonferroni adjustment General Procedure Compute the adjusted critical value, t β,npe, and the TINV function in Excel (R [use qt()], or whatever) Map the parameter estimates for a variable of interest Select those locations where the associated local t value is in excess of the adjusted critical value Create a layer from this these are locations where the parameter estimate is significant - map this layer. Excel and R In Excel TINV(0.05,1000) returns 1.962339 In R qt(1-(0.05/2),1000) returns 1.962339 Georgia As an example, we ll fit global and local models to the Georgia data PctBach is the dependent variable PctEld, PctFB, PctPov and PctBlack are the 4 independent variables in this model Global model Parameter Estimate (B) Std Err T p(b=0) --------- ------------ ------------ ------------ ------- Intercept 12.671125639718 1.636823331703 7.741291046143 0.0000 PctEld -0.105305563762 0.137837644439-0.763982594013 0.4461 PctFB 2.545211464154 0.298392136976 8.529753684998 0.0000 PctPov -0.282906604601 0.078104575505-3.622151374817 0.0004 PctBlack 0.076847513749 0.028028073689 2.741805076599 0.0068 In the global model the t statistics for the null hypotheses suggest that the variation in all the variable except PctEld contribute to the variation in the model. These variables are significant. The critical value of T with α=0.05 and 159-5 degrees of freedom is 1.98 (so the normal approximation of 1.96 is reasonable)

GWR There are n=159 regression points (centroids of the counties) There are p=5 parameters estimated at each regression point Local model The effective degrees of freedom in the local model is 12.359 The test-wise level to keep the family-wise error rate, α, at 0.05 will be 0.00375 which yields a critical value of 2.95 TINV(0.00375,159-12.359) => 2.95 Selection/Select by Attributes Label Minimum Lwr Quartile Median Upr Quartile Maximum -------- ------------- ------------- ------------- ------------- ------------- Intrcept 10.056587 11.026283 12.064082 13.257461 15.077528 PctEld -0.455205-0.286661-0.203335-0.116267 0.053857 PctFB 0.963658 1.594383 3.020416 3.629860 3.771199 PctPov -0.309038-0.264154-0.162981-0.078514 0.027344 PctBlack -0.029594 0.008555 0.034712 0.089672 0.134230 PctPov seems to change sign mostly it s negative, but there are some counties where the relationship appears, counterintuitively, to be postive This is variable 4 we plot PARM_4 and select those areas where abs(tval_4) is greater than 2.95 To highlight the significant counties, select those where the absolute value of the T statistic is greater than 2.946 PctPov parameter estimates End The selected counties are highlighted this variable is only significant in the southernmost part of the state. The values around the sign change are too small to be significant.