Models for Count and Binary Data. Poisson and Logistic GWR Models. 24/07/2008 GWR Workshop 1

Models for Count and Binary Data Poisson and Logistic GWR Models 24/07/2008 GWR Workshop 1

Outline I: Modelling counts Poisson regression II: Modelling binary events Logistic Regression III: Poisson Regression in GWR premature mortality in Tokyo IV: Logistic Regression in GWR landslides in Clearwater National Forest, Idaho 24/07/2008 2

Background Standard GWR uses OLS (Ordinary Least Squares) methods. These are not always the best option. OLS assumes a Normal (Gaussian) error term 24/07/2008 3

Non-Normality (1) Count data need a model form that cannot predict negative values! Poisson cases of an illness sightings of a rare animal number of crimes Number of earth tremors 24/07/2008 4

Non-Normality (2) Dichotomous (Binary; Yes/No; 1/0) outcomes need a model form that predicts the probability that an observation is 1 hence must give probabilities between 0-1 Logistic or Binary Logit Does an individual have a disease or not? Is a house detached or not? Was a crime committed at this location within the past week? 24/07/2008 5

I: Models for Count Data Poisson Regression 24/07/2008 6

The Poisson Model Lamba is the expected count of objects given the conditions at location i The betas are regression coefficients The x s are predictor variables Note that lambda will always be greater than or equal to zero 24/07/2008 7

Offsets Often we have a population at risk for count data For example, population susceptible to a disease Number of households (for rates of household burglaries) For zone-based data, this quantity changes from zone to zone and we need to allow for this in our model We do this by using an offset 24/07/2008 8

Adding the Offset P i is a variable (not a parameter to calibrate) It represents the population at risk Calibration of betas is by an iterative process it takes longer! 24/07/2008 9

Example Household burglary counts for 43 Police Forces in UK (the y i s) Offset = no. of households (P i ) Predictors - these are the x ij s Population Density (persons/sqkm) Unemployed males aged 16-24 as % of total population 24/07/2008 10

Results Estimate Std. Error z value P - value (Intercept) -5.837 1.144e-02-510.1 <0.001 Youngunemp 9.601e-02 5.930e-04 161.9 <0.001 Density 1.884e-04 1.067e-06 176.6 <0.001 24/07/2008 11

Question Why are z-values so high? Sometimes due to variation in the counts being more than expected for a Poisson distribution Maybe this is because the parameters vary over space? 24/07/2008 12

Geographically Weighted Poisson Regression The above is still a global model Question do the same relationships hold everywhere? Is the linkage between unemployed young males and burglary the same everywhere? We need to extend the previous model to a geographically weighted version 24/07/2008 13

Geographically Weighted Poisson Model Where u i and v i are the coordinates of observation i 24/07/2008 14

Results for Burglary Data 24/07/2008 15

II: Models for binary data Logistic Regression 24/07/2008 16

Logistic Models pr( y i = 1) = p i pr( y i = 0) = 1 p i Here, p i is the mean of the distribution for a dichotomous y i where y i is the dependent variable We need to pay attention to p i -this depends on the explanatory variables How does p i depend on (x 1i,x 2i, x mi )? 24/07/2008 17

The logistic model p i = i logit( β + β x + β x + 0 1 1i 2 2...) where logit( z) = exp( z) 1+ exp( z) 24/07/2008 18

Graph of Logit function logit(x) 0.0 0.2 0.4 0.6 0.8 1.0-10 -5 0 5 10 x 24/07/2008 19

Alternative form p log( i ) = β + β x + β x + 0 1 1i 2 2i 1 p i... where the left hand side of the equation is the log odds for y i = 1 24/07/2008 20

Interpreting the parameters As with Poisson, parameters make more sense if you take antilogs: we can write pi 1 p i = exp( β )exp( β x )exp( β x 0 1 1i 2 2i )... Each exp(β j ) gives a multiplicative factor for the odds that y i =1 when the corresponding predictor increases by 1 unit. Note multiplicative factor is for the ODDS not the PROBABILITY 24/07/2008 21

An Example: Housing in the UK Dependent variable: Does a house have more than 1 bathroom Dichotomous (binary) value Independent variable: Floor Area (sq. m) 24/07/2008 22

Results Estimate Std. Error z value Pr(> z ) Intercept -6.101 0.7214-8.457 <0.001 FloorArea 0.0295 0.0050 5.884 <0.001 Exp(beta) as % Increase Pr(> z ) FloorArea 1.0299 2.99 <0.001 24/07/2008 23

24/07/2008 24 Geographically Weighted Logistic Regression Where (u i,v i ) are the coordinates of observation i Note as before that the betas are functions not coefficients to be estimated using non-parametric methods i.e. β 0 (u,v) and so on... ), ( ), ( ), ( ) 1 log( 2 2 1 1 0 + + + = i i i i i i i i i i x v u x v u v u p p β β β

24/07/2008 25

Interpretation 2nd Bathrooms more likely in houses in South Wales and southern England General North/South effect Perhaps because property is more expensive in the south of the UK people tend to add second bathrooms to smaller houses 24/07/2008 26

Issues Convergence problems If there are areas dominated by y i s all equal to zero or one Take care with automatic bandwidth selection Possibly best policy is trial and error with manual bandwidth control in some cases 24/07/2008 27

Poisson and Logistic Regression in GWR 24/07/2008 GWR Workshop 28

III: Poisson Regression premature mortality in Toyko IV: Logistic Regression landslides in Clearwater National Forest, Idaho 24/07/2008 29

III Poisson GW Regression 24/07/2008 30

Poisson Regression To the user this looks as if it is implemented in GWR in almost the same way as ordinary Gaussian regression. The dependent variable must be a count variable (i.e. the values must be integers [whole numbers]) 24/07/2008 31

Rates: Count & Offsets If you wish to model counts which relate to areas with a varying underlying population you can do this with an offset variable y = ne X β (n is the offset) 24/07/2008 32

The Offset Variable To keep the Model Editor simple, a variable entered as the weight variable in Poisson regression is treated as the offset 24/07/2008 33

Outputs These are the similar to those which are obtained for ordinary Gaussian regression The interpretation of the parameters is slightly different 24/07/2008 34

The data We will use data for 261 municipalities in Tokyo Metropolitan Area We are considering determinants of premature mortality The independent variables are proportions of the elderly, professionals, home owners, and unemployed 24/07/2008 35

Offset The premature mortality count will obviously vary according to the size of the zone As an offset, we will use the expected number of premature deaths 24/07/2008 36

24/07/2008 37

24/07/2008 38

24/07/2008 39

24/07/2008 40

24/07/2008 41

24/07/2008 42

The data The data are the folder \SampleData\Tokyo There are shapefiles for both the municipality and prefecture boundaries 24/07/2008 43

The data The are some dbase files with socioeconomic data There is a data file for GWR3 There is a table with information on the various geographies we will need this for mapping the GWR results 24/07/2008 44

The model editor After you have selected the data you complete the model editor thus Notice the offset variable as the weight 24/07/2008 45

Patience The Poisson model is fitted used a method known as iteratively reweighed least squares This takes about 5 times as long as fitting a Gaussian model 24/07/2008 46

Header *************************************************************** * * * GEOGRAPHICALLY WEIGHTED POISSON REGRESSION * * * *************************************************************** Number of data cases read: 262 Sample data file read... *Number of observations, nobs= 262 *Number of predictors, nvar= 4 Observation Easting extent: 131840.781 Observation Northing extent: 120125.898 24/07/2008 47

Calibration *Finding bandwidth...... using all regression points This can take some time... *Calibration will be based on 262 cases *Adaptive kernel sample size limits: 65 262 *Crossvalidation begins... Bandwidth CV Score 125.876348015000 77093.504704727718 163.500000000000 78852.510762618243 102.623652260339 76090.041761883127 88.252695924830 75486.337205487056 79.370956440679 75159.569289465624 73.881739549149 75717.804317078335 82.763479033300 75179.304206706714 77.274262166597 75473.565361751971 80.666784759217 75218.352828294883 ** Convergence after 9 function calls ** Convergence: Local Sample Size= 79 24/07/2008 48

Global Model ********** Global Poisson Model Diagnostics ********** Convergence after 3 iterations Log-likelihood: -194.640790 Deviance (-2LogLikelihood): 389.281580 Trace of the Hat Matrix: 5.000000 Number of parameters in model: 5.000000 Akaike Information Criterion: 399.281580 Corrected AIC (AICc) 399.515955 Bayesian Information Criterion: 417.123303 Parameter Estimate Std Err T Exp(B) Sd(Exp(B)) --------- ------------ ------------ ------------ ------------ ------------ Intercept 0.007 0.065 0.115 1.007 0.066 Professl -2.288 0.162-14.123 0.101 0.016 Elderly 2.199 0.198 11.093 9.019 1.788 OwnHome -0.260 0.047-5.519 0.771 0.036 Unemply 0.064 0.011 5.822 1.066 0.012 24/07/2008 49

Local Model ********** Local Poisson Model Diagnostics ********** Log Likelihood: -145.678378 Deviance: 291.356756 Trace of the hat matrix... 32.639607 Residual sum of squares... 51466.082227 Effective number of parameters.. 32.639607 Akaike Information Criterion... 356.635970 Corrected AIC... 366.252205 Bayesian Information Criterion.. 473.105336 24/07/2008 50

5-number summaries ********************************************************** * PARAMETER 5-NUMBER SUMMARIES * ********************************************************** Label Minimum Lwr Quartile Median Upr Quartile Maximum Intrcept -1.179779-0.033398 0.105704 0.254133 0.591215 Professl -4.348735-2.712938-2.450336-1.633525 2.501697 Elderly 0.915348 1.529775 2.022919 2.544694 5.226847 OwnHome -0.678879-0.397862-0.289317-0.193223 0.242000 Unemply -0.064240 0.015048 0.033627 0.080714 0.196946 24/07/2008 51

Mapping The municipality and data used for the GWR model are in two different map projections However there is a lookup table to link the two attribute tables this time we do not use a spatial join 24/07/2008 52

GeogIndex.csv The lookup table is a comma separated variable file The ID field contains the IDs used in the municipality shapefile The GWR_ID field contains the sequence numbers assigned by GWR3 24/07/2008 53

TMABSU attributes 24/07/2008 54

GeogIndex.csv 24/07/2008 55

Tokyo point attributes 24/07/2008 56

24/07/2008 57

The result You can now map parameter values, and the other diagnostics Professionals 24/07/2008 58

IV Logistic GW Regression 24/07/2008 59

Logistic Regression You use logistic regression when your dependent variable is binary or dichotomous the values should be 0 or 1 The independents can be continuous or binary valued dummy variables 24/07/2008 60

Predicted π The predicted value is the probability that the dependent variable is 1 It is continuous, lying between 0 and 1 24/07/2008 61

Landslides In November 1995 and February 1996 there were some 865 landslides in Clearwater National Forest, Idaho Gorsesvki et al (TGIS, 10(3) 395-415) suggested that topographic factors may be an influence of landslide occurrence 24/07/2008 62

The Data We have extracted data for a subset of 239 observations 138 landslide sites and 101 control sites 63

The Data Our dataset also contains some topographic indicators for sites in Clearwater National Forest where there have been landslides A similar number of locations have been chosen randomly in the study area where there have not been landslides the control sites A binary variable, Landslid, indicates whether the sample is a landslide or a control site coded 1 or 0 respectively 24/07/2008 64

Landslide sites are yellow, control sites are black 65

Topographic variables The variables have been mostly generated from a digital elevation model This was created from Shuttle Radar Topography Mission data The grid size is 25m 24/07/2008 66

Predictor Variables Elevation (metres) Slope (% - 0=flat, 10=vertical) Sine of the aspect (-1... 1) Cosine of the aspect (-1... 1) Absolute deviation from due south (0...180 degrees) Distance to the nearest watercourse (metres) 24/07/2008 67

The data file The data file is named landslides.csv and is in the \SampleData\Clearwater folder There is also a georeferenced scanned map of the study area called basemap.jpg 24/07/2008 68

Input and Output Files 24/07/2008 69

Model Editor For the first model use Landslid as the Dependent Variable, and Elev and Slope as the Independent Variables. Use an Adaptive kernel we have Cartesian coordinates 24/07/2008 70

Running the Model As with Poisson Regression, the model is fitted using iteratively reweighted least squares This means it will take a little longer to run than an ordinary GWR, so be patient 24/07/2008 71

Control and Listing Files 24/07/2008 72

Header *************************************************************** * * * GEOGRAPHICALLY WEIGHTED LOGISTIC REGRESSION * * * *************************************************************** Number of data cases read: 239 Sample data file read... *Number of observations, nobs= 239 *Number of predictors, nvar= 2 P(Landslid=1 X) = 0.577406 Observation Easting extent: 33543.0625 Observation Northing extent: 28802.6289 We have 239 observations with 2 predictors. The study area extent is 33.5km by 28.8 24/07/2008 73

Calibration *Adaptive kernel sample size limits: 59 239 *AICc minimisation begins... Bandwidth AICc 114.623059100000 261.564370933634 149.000000000000 263.977389197345 93.376941151579 259.338341719932 80.246118103905 259.206909758644 72.130823143768 259.220048698357 85.261646191441 259.090515014747 88.361413027338 259.178113791002 83.345884939802 259.175696235611 ** Convergence after 8 function calls ** Convergence: Local Sample Size= 85 This is quite a large bandwidth notice that there is not much variation between the AICs for a range of bandwidths 74

Global Model ********** Global Logistic Model Diagnostics ********** Convergence after 5 iterations Log-likelihood: -136.074826 Deviance (-2LogLikelihood): 272.149653 Number of parameters in model: 3.000000 Akaike Information Criterion: 278.149653 Corrected AIC (AICc) 278.251781 Bayesian Information Criterion: 288.579043 Parameter Estimate Std Err T Exp(B) Sd(Exp(B)) --------- ------------ ------------ ------------ ------------ ------------ Intercept 0.884 0.780 1.134 2.420 1.887 Elev -0.002 0.001-4.711 0.998 0.001 Slope 0.089 0.019 4.693 1.094 0.021 The global AICc is 278.25. The elevation and slope parameters are both significant higher elevations decrease, and steeper slopes increase, the probability of a landslide 75

Local Model ********** Local Logistic Model Diagnostics ********** Log Likelihood: -109.945257 Deviance: 219.890514 Residual sum of squares... 35.881650 Effective number of parameters.. 18.038670 Akaike Information Criterion... 255.967853 Corrected AIC... 259.090513 Bayesian Information Criterion.. 318.678626 There is a smaller AICc with the local model so we have some improvement in fit 24/07/2008 76

5-number summaries ********************************************************** * PARAMETER 5-NUMBER SUMMARIES * ********************************************************** Label Minimum Lwr Quartile Median Upr Quartile Maximum -------- ------------- ------------- ------------- ------------- ------------- Intrcept -5.856532-0.240704 1.719492 3.787028 16.416200 Elev -0.013785-0.004208-0.002330-0.000893 0.000381 Slope -0.027931 0.035047 0.078778 0.133318 0.243211 Most of the local elevation parameters are negative, and most of the local slope parameters are negative 24/07/2008 77

Visualising As before, you use ArcMap. The scanned basemap is basemap.jpg The Interchange File must be converted to a coverage Visualisation is as before 24/07/2008 78

Predicted/Observed The residual is the difference between the observed y (0/1) and the predicted probability of 1 Values from -1-0.5 are locations where no landslide occurred but one has been predicted Values from 0.5 1are landslide sites where no landslide has been predicted we might look at these to see what other characteristics they might have which are not in the model 24/07/2008 79

The default symbology choices for the RESID variable: to change these click on Classify 80

The blue bars representing the breaks are removed click to highlight, then right-click to choose Delete Break 81

3 breaks at -0.5, 0.5 and the top value 24/07/2008 82

Small cluster of 3 false negatives near Sheep Mountain Work Center 83

End of presentation 24/07/2008 84