Statistics: A review. Why statistics?

Statistics: A review Why statistics?

What statistical concepts should we know? Why statistics? To summarize, to explore, to look for relations, to predict What kinds of data exist? Nominal, Ordinal, Interval and Ratio Populations and samples (sampling frames) (support) Sampling frames: Random, systematic, stratified, other? How to summarize? Measures of central tendency Measures of dispersion Looking for relations? Visualizations: Tables, graphics (scatterplots) Quantitative approaches: Correlations, regression

How to summarize a distribution using a single number?

Example: Grouping analysis

Z-score

Crosstabulation (Chi square statistics) Correlation Pearson s R (IR) Spearman s R (O) Measures of association

Regression What are some of the considerations we should make? Terminology?

Regression Variables considered as or Relations amongst variables? (Think r) How many cases? Too many variables? Residuals should be

Simple regression a statistical perspective One (or more) dependent (response) variables One or more independent (predictor) variables Linear regression is linear in coefficients: y = β + β x + β x + β x + or y = xβ 0 1 1 2 2 3 3..., Vector/matrix form often used Over-determined equations & least squares Regression modelling

Ordinary Least Squares (OLS) model yi = β0+ β1x1i + β2x2i + β3x3 i +... + εi, or y = Xβ+ ε Minimise sum of squared errors (or residuals) Regression modelling

OLS models and assumptions Model simplicity and parsimony Model over-determination, multi-collinearity and variance inflation Typical assumptions Data are independent random samples from an underlying population Model is valid and meaningful (in form and statistical) Errors are iid Independent; No heteroscedasticity; common distribution Errors are distributed N(0,σ 2 ) Regression modelling

Regression modelling Spatial modelling and OLS Positive spatial autocorrelation is the norm, hence dependence between samples exists Datasets often non-normal >> transformations may be required (Log, Box-Cox, Logistic) (show income) Samples are often clustered >> spatial declustering may be required Heteroscedasticity is common Spatial coordinates (x,y) can form part of the modelling process

Choosing between models Information content perspective and AIC AIC = 2 ln( L) + 2k AICc = 2 ln( L) + 2k n n k 1 where n is the sample size, k (and p below) is the number of parameters used in the model, and L is the likelihood function (ln is the natural log) Regression modelling

Some regression terminology Simple linear Multiple (multiple independent vars) Multivariate (multiple dependent and independent vars canonical analysis) SAR (Spatial Autoregressive model ) CAR (Conditional Autoregressive model) Logistic -- Binary data Poisson -- Count data Ecological Hedonic Analysis of variance (aka analysis of differences between means) Analysis of covariance Regression modelling

Regression & spatial autocorrelation (SA) Analyse the data for SA If SA significant then Proceed and ignore SA, or Permit the coefficients, β, to vary spatially (GWR), or Modify the regression model to incorporate the SA (e.g., include the Y s of the neighbours as predictor variables in the OLS expression; or, identify the missing variable that would explain the spatial autocorrelation). Regression modelling

Regression modelling Geographically Weighted Regression (GWR) Coefficients, β, allowed to vary spatially, β(t) Model: y = Xβ(t) + ε Coefficients determined by examining neighbourhoods of points, t, using distance decay functions (fixed or adaptive bandwidths) Weighting matrix, W(t), defined for each point

Geographically Weighted Regression Sensitivity model, decay function, bandwidth, point/centroid selection ESDA mapping of surface, residuals, parameters Significance testing Increased apparent explanation of variance Effective number of parameters AICc computations Regression modelling

Regression & spatial autocorrelation (SA) Modify the regression model to incorporate the SA, i.e. produce a Spatial Autoregressive model (SAR) Many approaches including: SAR e.g. pure spatial lag model, mixed model, spatial error model, etc. CAR a range of models that assume the expected value of the dependent variable is conditional on the (distance weighted) values of neighbouring points (e.g.,gwr) Spatial filtering e.g. OLS on spatially filtered data (e.g., declustering) Regression modelling