bruary 8, 0
Regression (GWR) In a nutshell: A local statistical technique to analyse spatial variations in relationships Global averages of spatial data are not always helpful: climate data health data This problem can also occur with global statistics that measure relationships in spatial data. regression correlation
Spatial Non-Stationarity Spatial Non-Stationarity occurs when a relationship (or pattern) that applies in one region does not apply in another Global models are statements about processes or patterns which are assumed to be stationary and as such are location independent are assumed to apply in all locations Local models are spatial disaggregations of global models, the results of which are location-specific The template of the model is the same - but the specifics may alter. i.e. The model may always be a linear regression model with certain variables, but the coefficients alter geographically
Spatial Non-Stationarity Spatial Non-Stationarity occurs when a relationship (or pattern) that applies in one region does not apply in another Global models are statements about processes or patterns which are assumed to be stationary and as such are location independent are assumed to apply in all locations Local models are spatial disaggregations of global models, the results of which are location-specific The template of the model is the same - but the specifics may alter. i.e. The model may always be a linear regression model with certain variables, but the coefficients alter geographically The above is essentially a description of GWR
An Example of Spatial Non-Stationarity - % With Degrees % Foreign Born < 6.5 6.5 7.8 7.8 9.0 9.0 0.0 0.0.0.0 5.0 > 5.0 < 0. 0. 0.4 0.4 0.6 0.6 0.8 0.8...0 >.0 Georgia State (Source: US Census 990)
An Example of Spatial Non-Stationarity - Northing 400000 500000 600000 700000 800000 0 4 5 6 7 0 4 5 6 7 % With Degrees 5 5 5 5 5 5 5 5 0 4 5 6 7 % Foreign Born
The GWR Model Standard Global Regression y i = α + β x i + β x i +... + ε i where ε i N(0, σ ) Regression y i = α(u i, v i ) + β (u i, v i )x i + β (u i, v i )x i +... + ε i where ε i N(0, σ ) and (u i, v i ) is the location of observation i Note: the coefficients in GWR are now functions, not variables.
A Calibration gorithm - For a given point (u, v): Consider a window of radius h Calibrate regression just using data falling in that window. Scanning the window across the study area gives a surface of regression parameters...
A Calibration gorithm - To avoid sudden jumps when scanning the window: We use a weighted regression calibration. Hence Regression.
A Calibration gorithm - Weighting Details: A possible Scheme { } d if d < h h w i (u, v) = 0 otherwise where d = (u u i ) + (v v i ). h is called the bandwidth. Other weighting functions could be used - i.e. Gaussian Results generally more sensitive to h than choice of weighting function.
A Calibration gorithm - 4 Calibration Formula { ˆβ(u, v) = X T W(u, v)x} X T W(u, v)y where W = Diagonal(w (u, v), w (u, v),..., w n (u, v)) X is the matrix of independent variables. y is the vector of the dependent variable. cf Global Regression Formula { ˆβ = X X} T X T y
A Calibration gorithm - 5 An extension of the method Use a different bandwidth in different places - i.e. h(u, v) Typically, bandwidth at (u, v) is distance to kth nearest neighbour. Useful if density of observations is variable - e.g. urban/rural.
Over- and Under- Fitting 0. 0. 0.4 0.6 0.8.0. 0. 0. 0.4 0.6 0.8.0. 0.0 0.5.0.5.0 0. 0. 0.4 0.6 0.8.0. 0. 0. 0.4 0.6 0.8.0. 0.0 0.5.0.5.0 0.0 0.5.0.5.0 0.0 0.5.0.5.0
Cross-Validation and h RMS Prediction Error.9 4.0 4. 4. Cross Validation Example 00 50 00 50 00 Bandwidth h (km) cross-validation - fit the model to a holdback sample using the remaining data for a range of h-values, then find the h-value that is the best predictor.
Results of GWR - Slope Slope Coefficient <...8.8.9.9.9.9 4. 4. 4. > 4.
Results of GWR - Intercept Intercept Coefficient < 6.9 6.9 7. 7. 7. 7. 7.6 7.6 7.9 7.9 8. > 8.
Results of GWR - Slope Slope Coefficient - Using Grid Sampling <...8.8.9.9.9.9 4. 4. 4. > 4.
Further Issues Local standard error - reliability of estimates Significance testing - Monte Carlo approaches H 0 : No spatial variation in coefficients H : GWR assumption is true Tests whether GWR assumption is valid Could also be used to justify global models on occasions Multivariate GWR
Results of GWR - Multivariate % Foreign Born % Elderly ( 65) < 0.9 0.9.7.7...4.4.7.7.9 >.9 < 0.6 0.6 0.4 0.4 0.4 0.4 0. 0. 0. 0. 0. > 0. Note - adding extra variables can alter interpretation due to correlation between predictors. Just like in other kinds of regression...
PCA Multivariate relationships: Issues with collinearity Treating variables symmetrically Multivariate outliers Principal Components Identifies collinearity: Based on Σ-matrix of several variables Can identify multivariate outliers
PCA as Model - y 4 0 4 Comparison: OLS Regression 4 0 4
PCA as Model - y 4 0 4 Comparison: PCA 4 0 4
Interpretation PCA is a kind of line fitting algorithm Based on perpendicular distances. Error to be minimised is based on fitting both x and y, not just y. Residuals are the perpendicular distances mentioned above The equation of the best fit line gives the loadings on each variable for PC The projection of the points on the line correspond to the scores for PC
The Multivariate Situation Same idea still applies BUT For first k components in m dimensions: Find k-dimensional subspace minimising perpendicular distances in m-space - the equations of the subspace gives the loadings in terms of input variables. Residuals are the perpendicular distances mentioned above Coordinates projected onto subspace ordination plot found in above plot Type multidimensional outliers They fit the model subspace model, but are unusual in the subspace Big residuals Type multidimensional outliers Don t even fit the subspace model!
Geographical Weighting PCA Might want to find outliers locally A local outlier: Is not an unusual observation in the data set as a whole But is unlike its geographical neighbours Can use locally weighted PCA to investigate local multivariate outliers. How to do it: Apply geographical weighting windows to the perpendicular distance minimising algorithm Thus PCA loadings are viewed as functions of (u, v) - like regression coefficients in GWR.
An Example Baltic Soil Survey (Reimann et al, 000). Agricultural soils were collected from 0 European countries over a large region surrounding the Baltic Sea 768 sites Here we concentrate on topsoil samples - Trace compounds: SiO, O, O, O, MnO, MgO, CaO, Na O, O and P O 5. % by weight calculated. Data has 768 rows and 0 columns. so, the x and y coordinates of each site are recorded. Data standardised to z-scores. ey task: identify local patterns and outliers...
Survey Locations
Choosing h for Bandwidth Selection Much like the procedure in GWR Measure perpendicular distances in a holdback sample CV Score 550 600 650 700 Choose h to minimise this 800 000 400 800 Bandwidth
PCA Results - Highest Loadings
PCA Results - Sternutation Plot of Loading
PCA Results - Sternutation Plot of Loading
Unique sign patterns in geographically weighted loadings SiO O O O MnO MgO CaO Na O O P O 5 + - - - - - - - - - + + + + + + + + + - + + + + + + + + + + Relatively small number of patterns exhibited - only out of a possible 0 = 04 NB. First sign always positive by convention
PCA Results - Sign Patterns
Hunting of Type - High Perpendicular Distances 50 65 0 59
Hunting of Type - Parallel Coordinates Site 0 Site 59 SiO O O O MnO MgO CaO NaO O PO5 Site 50 SiO O O O MnO MgO CaO NaO O PO5 SiO O O O MnO MgO CaO NaO O PO5 Site 65 SiO O O O MnO MgO CaO NaO O PO5
s GWR/ as data miner Certainly a useful rôle for But PCA can also be seen as a model Possibly data mining / data modelling not such a clear cut distinction? Further extensions...
s The End with thanks to Martin Charlton for his helpful comments and discussion.