Hypothesis Testing hypothesis testing approach

Similar documents
Hypothesis Testing hypothesis testing approach formulation of the test statistic

Ch 2: Simple Linear Regression

Correlation Analysis

Inferences for Regression

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Statistics Handbook. All statistical tables were computed by the author.

Business Statistics. Lecture 10: Correlation and Linear Regression

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Review of Statistics 101

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Confidence Intervals, Testing and ANOVA Summary

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

Chapter 4. Regression Models. Learning Objectives

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Assumptions of classical multiple regression model

Sociology 6Z03 Review II

Review of Statistics

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

Chapter 12 - Lecture 2 Inferences about regression coefficient

Chapter 16. Simple Linear Regression and dcorrelation

Simple Linear Regression

What is a Hypothesis?

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

Inference for Regression

Multiple Linear Regression

Chapter 16. Simple Linear Regression and Correlation

Introduction to Statistical Data Analysis Lecture 7: The Chi-Square Distribution

Correlation and the Analysis of Variance Approach to Simple Linear Regression

2 and F Distributions. Barrow, Statistics for Economics, Accounting and Business Studies, 4 th edition Pearson Education Limited 2006

Inference for Regression Simple Linear Regression

Ch 3: Multiple Linear Regression

Linear models and their mathematical foundations: Simple linear regression

9. Linear Regression and Correlation

Correlation and Linear Regression

CS 5014: Research Methods in Computer Science

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables.

ECO220Y Simple Regression: Testing the Slope

Finding Relationships Among Variables

Lecture 3: Inference in SLR

OHSU OGI Class ECE-580-DOE :Design of Experiments Steve Brainerd

Simple Linear Regression: One Quantitative IV

Statistics for Managers using Microsoft Excel 6 th Edition

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

BNAD 276 Lecture 10 Simple Linear Regression Model

Much of the material we will be covering for a while has to do with designing an experimental study that concerns some phenomenon of interest.

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Regression Analysis II

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Applied Regression Modeling: A Business Approach Chapter 2: Simple Linear Regression Sections

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

PubH 7405: REGRESSION ANALYSIS. MLR: INFERENCES, Part I

Chapter 4: Regression Models

Statistics Introductory Correlation

Chapter Eight: Assessment of Relationships 1/42

Psychology 282 Lecture #4 Outline Inferences in SLR

We like to capture and represent the relationship between a set of possible causes and their response, by using a statistical predictive model.

Correlation: Relationships between Variables

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

Categorical Predictor Variables

Chapter 14 Simple Linear Regression (A)

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

Master s Written Examination - Solution

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

Statistics 203: Introduction to Regression and Analysis of Variance Course review

The Multiple Regression Model

Degrees of freedom df=1. Limitations OR in SPSS LIM: Knowing σ and µ is unlikely in large

In a one-way ANOVA, the total sums of squares among observations is partitioned into two components: Sums of squares represent:

AMS 7 Correlation and Regression Lecture 8

Ch. 16: Correlation and Regression

Psych 230. Psychological Measurement and Statistics

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Chapter Learning Objectives. Regression Analysis. Correlation. Simple Linear Regression. Chapter 12. Simple Linear Regression

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012

Section 9.4. Notation. Requirements. Definition. Inferences About Two Means (Matched Pairs) Examples

The One-Way Repeated-Measures ANOVA. (For Within-Subjects Designs)

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011

Chapter 14 Student Lecture Notes 14-1

4/22/2010. Test 3 Review ANOVA

Lecture 10 Multiple Linear Regression

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Week 12 Hypothesis Testing, Part II Comparing Two Populations

Regression Analysis. Table Relationship between muscle contractile force (mj) and stimulus intensity (mv).

Lecture 6 Multiple Linear Regression, cont.

Lectures on Simple Linear Regression Stat 431, Summer 2012

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

Simple Linear Regression

This gives us an upper and lower bound that capture our population mean.

Multiple Linear Regression

Reminder: Student Instructional Rating Surveys

SMA 6304 / MIT / MIT Manufacturing Systems. Lecture 10: Data and Regression Analysis. Lecturer: Prof. Duane S. Boning

Chapter 14. Linear least squares

Can you tell the relationship between students SAT scores and their college grades?

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

Spatial inference. Spatial inference. Accounting for spatial correlation. Multivariate normal distributions

Transcription:

Hypothesis Testing In this case, we d be trying to form an inference about that neighborhood: Do people there shop more often those people who are members of the larger population To ascertain this, we can make use of the hypothesis testing approach in inferential statistics, which is a multistep process: 1. State the null hypothesis (H 0 ) 2. State the alternative hypothesis (H A ) 3. Choose α, our significance level 4. Select a statistical test, and calculate the test statistic 5. Determine the critical value where H 0 will be rejected 6. Compare the test statistic with the critical value

Hypothesis Testing - Tests 4. Select a statistical test, and calculate the test statistic To test the hypothesis, we must construct a test statistic, which frequently takes the form: Test statistic = θ - θ 0 Std. error For example, using the normal distribution, the z-test is formulated as: z = x - µ σ x σ x = σ/ n when s is known, or σ x ~ s/ n when we have to estimate the standard deviation from the sample data

Hypothesis Testing - Tests 5. Determine the value where H 0 will be rejected cont. For example, supposed we are applying a z-test to compare the mean of a large sample to the mean of a population, and we choose a 95% level of significance If we formulate our alternate hypothesis as H A : x µ, we are testing whether x is significantly different from µ in either direction, so the acceptance region must include the 95% of the normal distribution on either side of the mean, and the rejection region must include 2.5% of the area in each of the two tails: Z test > Z crit H A -Z crit H 0 Z crit Z test Z crit H 0 H A H A

Hypothesis Testing - Tests 5. Determine the value where H 0 will be rejected cont. On the other hand, suppose we formulate our alternate hypothesis as H A : x >µ or H A : x <µ. Then, we are testing whether x is significantly different from µ in a particular direction, so the rejection region must include the 5% of the normal distribution s area in one tail, and the acceptance region must include the remaining 95% of the area (e.g. using H A : x >µ): Z test > Z crit H A Z crit Z test Z crit H 0 H 0 H A

Hypothesis Testing - One-Sample Z-test The example we looked at in the last lecture used the onesample z-test, which is formulated as: We use this test statistic: Z test = x - µ σ n (difference between means) (standard error) 1. To compare a sample mean to the population mean 2. If the size of the sample is reasonably large, i.e. n > 30 3. When the population standard deviation is known (although we can estimate it from the sample standard deviation), so that we can use this value to calculate the standard error in the denominator

Hypothesis Testing - One-Sample t-test The one-sample t-test is formulated very much like the one-sample Z-test we looked at earlier: We use this test statistic: t test = x - µ s n (difference between means) (standard error) 1. To compare a sample mean to the population mean 2. If the size of the sample is somewhat small, i.e. n 30 3. We do not need to know the population standard deviation to calculate the standard error, although we still need to know the population mean for purposes of comparison with the sample mean

Hypothesis Testing - Two-Sample t-tests Two-sample t-tests are used to compare one sample mean with another sample mean, rather than with a population parameter. The form of the two-sample t-test that is appropriate depends on whether or not we can treat the variances of the two samples as being equal If the variances can be assumed to be equal (a condition called homoscedasticity), the t-statistic is: t test = x 1 -x 2 S p (1 / n 1 ) + (1 / n 2 ) (n 1-1)s2 1 + (n 2-1)s2 2 and s p is the pooled estimate of the standard deviation: = n 1 + n 2-2

Hypothesis Testing - Two-Sample t-tests Two-sample t-tests that use the equal variance assumption have degrees of freedom equal to the sum of the number of observations in the two samples, less two since we are estimating the values of two means here: df = (n 1 + n 2-2) If we cannot assume that the two samples have equal variances, the appropriate t-statistic takes a slightly different form, since we cannot produce a pooled estimate for the standard error portion of the statistic: t test = x 1 -x 2 (s 12 / n 1 ) + (s 22 / n 2 )

Hypothesis Testing - Two-Sample t-tests Unfortunately, in the heteroscedastic case (where the variances are unequal), calculating the degrees of freedom appropriate to use for the critical t-score uses a somewhat involved formula (equation 3.17 on p. 50) As an alternative, Rogerson suggests using the lesser value of n 1-1 or n 2-1: df = min[(n 1-1),(n 2-1)] based on the grounds that this value will always be lower than that produced by the involved calculation, and thus will produce a higher t crit score at the selected α; this is a conservative assumption because it makes it even harder to reject the null hypothesis mistakenly and commit a type I error

Hypothesis Testing - F-test In order to make the decision as to whether or not the variances of two samples are the same or different enough to warrant the use of one form the two-sample t- test or the other, we have a further statistical test that we use to compare the variances The F-test, a.k.a. the variance ratio test, assesses whether or not the variances are equal by computing a test statistic of the form: F test = s 1 2 s 2 2 Critical values are taken from the F-distribution, which has a 2-dimensional array of degrees of freedom (i.e. n 1-1 df in the numerator, n 2-1 df in the denominator)

Hypothesis Testing - Matched Pairs t-tests The form of the sample statistic is based upon the calculated differences between the two samples: t test = d s d n We use this test statistic: where d is the average of the differences S d = Σ (d i -d) 2 n - 1 1. To compare the sample means of paired samples 2. The size of the samples is somewhat small, i.e. n 30 3. When the two samples contain members that were not sampled at random but represent observations of the same entities, usually at different times or after some treatment has been applied

ANOVA - An F-test The ANOVA F-test is formulated as: where F test = BSS / (k - 1) WSS / (N - k) k is the number of groups N is the total number of observations BSS is the between-group sum of squares WSS is the within-group sum of squares and the total sum of squares is the sum of the betweengroup and within-group sums, i.e. TSS = BSS + WSS (important because BSS can be tedious to calculate, but by calculating WSS and TSS, BSS = TSS - WSS)

Arrangement of Data for ANOVA Category 1 Category 2 Category 3 Category k Obs. 1 x 11 x 12 x 13 x 1k Obs. 2 x 21 x 22 x 23 x 2k Obs. 3 x 31 x 32 x 33 x 3k Obs. 4 x 41 x 42 x 43 x 4k.. Ȯbs. i x i1 x i2 x i3 x ik No. of obs. n 1 n 2 n 3 n k Mean x +1 x +2 x +3 x +4 Std. Dev. s 1 s 2 s 3 s k Overall Mean: x ++

ANOVA Table A useful way to go through the process of calculating an ANOVA is to fill in an ANOVA table: Source of Sum of Degrees of Mean Square Variation Squares Freedom Variance F-Test Between BSS k - 1 MS B Groups Within WSS N - k MS W MS B Groups MS W Total TSS N - 1 Variation

Covariance Formulae The covariance of variable X with respect to variable Y can be calculated using the following formula: Cov [X, Y] = 1 n -1 i=n Σ x i y i -xy i=1 The formula for covariance can be expressed in many ways. The following equation is an equivalent expression of covariance (due to the distributive property): Cov [X, Y] = 1 n -1 i=n Σ (x i - x)(y i -y) i=1

Pearson s Correlation Coefficient A standardized measure of covariance provides a value that describes the degree to which two variables correlate with one another, expressing this using a value ranging from 1 to +1, where 1 denotes an inverse relationship and +1 denotes a positive relationship One such measure is known as Pearson s Correlation Coefficient (a.k.a. Pearson s Product Moment), and it produced through standardizing the covariance by dividing it by the product of the standard deviations of the Y and X variables: Pearson s Correlation Coefficient r = Cov [X, Y] s X s Y

Pearson s Correlation Coefficient As is the case with covariance, the correlation coefficient can be expressed in several equivalent ways: Pearson s Correlation Coefficient = i=n Σ (x i - x)(y i -y) i=1 (n -1) s X s Y It can also be expressed in terms of z-scores, which is convenient if you have already calculated them: Pearson s Correlation Coefficient = i=n Σ Z x Z y i=1 (n -1)

A Significance Test for r The sampling distribution of r follows a t-distribution with (n - 2) degrees of freedom, and we can estimate the standard error of r using: SE r = 1 - r 2 n - 2 The test itself takes the form of the correlation coefficient divided by the standard error, thus: t test = r SE r = r 1 - r 2 n - 2 = r n -2 1 - r 2

Spearmann s Rank Correlation Coefficient We have an alternative correlation coefficient we can use with ordinal data: Spearmann s Rank Correlation Coefficient (r s ) r s = 1 - i=n 6Σ d 2 i=1 i n 3 - n where n = sample size d i = the difference in the rankings of each value with respect to each variable

A Significance Test for r s As was the case for Pearson s Correlation Coefficient, we can test the significance of an r s result using a t-test The test statistic and degrees are formulated a little differently for r s, although many of the characteristics of the distribution of r values are present here as well: In this case, r s values follow a t-distribution with (n - 1) degrees of freedom, and their standard error can be estimated using: SE rs = yielding the test statistic: 1 n -1 t test = r s SE rs = r s n -1

Simple Linear Regression Simple linear regression models the relationship between an independent variable (x) and a dependent variable (y) using an equation that expresses y as a linear function of x, plus an error term: y (dependent) a error: ε b x (independent) y = a + bx + e x is the independent variable y is the dependent variable b is the slope of the fitted line a is the intercept of the fitted line e is the error term

Least Squares Method The least squares method operates mathematically, minimizing the error term e for all points We can describe the line of best fit we will find using the equation ŷ = a + bx, and you ll recall that from a previous slide that the formula for our linear model was expressed using y = a + bx + e y ŷ We use the value ŷ on the line to estimate the true value, y (y - ŷ) The difference between the two is (y - ŷ) = e ŷ = a + bx This difference is positive for points above the line, and negative for points below it

Error Sum of Squares By squaring the differences between y and ŷ, and summing these values for all points in the data set, we calculate the error sum of squares (usually denoted by SSE): n SSE = Σ (y - ŷ) 2 i = 1 The least squares method of selecting a line of best fit functions by finding the parameters of a line (intercept a and slope b) that minimizes the error sum of squares, i.e. it is known as the least squares method because it finds the line that makes the SSE as small as it can possibly be, minimizing the vertical distances between the line and the points

Finding Regression Coefficients The equations used to find the values for the slope (b) and intercept (a) of the line of best fit using the least squares method are: b = n Σ (x i - x) (y i -y) i = 1 a = y - bx n Σ (x i -x) 2 i = 1 Where: x i is the i th independent variable value y i is the i th dependent variable value x is the mean value of all the x i values y is the mean value of all the y i values

Regression Slope and Correlation The interpretation of the sign of the slope parameter and the correlation coefficient is identical, and this is no coincidence the numerator of the slope expression is identical to that of the correlation coefficient r = i=n Σ (x i - x)(y i -y) i=1 (n - 1) s X s Y The regression slope can expressed in terms of the correlation coefficient: b = n b = r s y s x Σ (x i - x) (y i -y) i = 1 n Σ (x i -x) 2 i = 1

Coefficient of Determination (r 2 ) The regression sum of squares (SSR) expresses the improvement made in estimating y by using the regression line: n y ŷ y SSR = Σ (ŷ i -y) 2 i = 1 The total sum of squares (SST) expresses the overall variation between the values of y and their mean y: n SST = Σ (y i -y) 2 i = 1 The coefficient of determination (R 2 ) expresses the amount of variation in y explained by the regression line (the strength of the relationship): r 2 = SSR SST

Partitioning the Total Sum of Squares We can decompose the total sum of squares into those two components: n n n SST = Σ (y i -y) 2 i = 1 In other words: SST = SSR + SSE and the coefficient of determination expresses the portion of the total variation in y explained by the regression line = Σ (ŷ i -y) 2 i = 1 SST + Σ (y i - ŷ) 2 i = 1 SSE y SSR ŷ y

Regression ANOVA Table We can create an analysis of variance table that allows us to display the sums of squares, their degrees of freedom, mean square values (for the regression and error sums of squares), and an F-statistic: Component Regression (SSR) Error (SSE) Total (SST) Sum of Squares n Σ (ŷ i -y) 2 i = 1 n Σ (y i - ŷ) 2 i = 1 n Σ (y i -y) 2 i = 1 df 1 n - 2 n - 1 Mean Square SSR / 1 SSE / (n - 2) F MSSR MSSE

A Significance Test for r 2 We can test to see if the regression line has been successful in explaining a significant portion of the variation in y, by performing an F-test This operates in a similar fashion to how we used the F-test in ANOVA, this time testing the null hypothesis that the true coefficient of determination of the population ρ 2 = 0 using an F-test formulated as: F test = r2 (n - 2) = MSSR 1 - r 2 MSSE which has an F-distribution with degrees of freedom: df = (1, n - 2)

Significance Tests for Regression Parameters In addition to evaluating the overall significance of a regression model by testing the r 2 value using an F-test, we can also test the significance of individual regression parameters using t-tests These t-tests have the regression parameter in some form in the numerator, and the standard error of the regression parameter in the denominator First, we must calculate the standard error of the estimate, also known as the standard deviation of the residuals (s e ): s e = n Σi = 1 (y i - ŷ) 2 (n - 2)

Significance Test for Regression Slope We can formulate a t-test to test the significance of the regression slope (b) We will be testing the null hypothesis that the true value of the slope is equal to zero, e.g. H 0 : β = 0, using the following t-test: t test = where s b is the standard deviation of the slope parameter: s b = b s b s e 2 (n - 1) s x 2 and degrees of freedom = (n - 2)

Significance Test for Regression Intercept We can formulate a similar t-test to test the significance of the regression intercept (a) We will be testing the null hypothesis that the true value of the intercept is equal to zero, e.g. H 0 : α = 0, using the following t-test: t test = where s a is the standard deviation of the intercept: a s a s a = s e 2 Σx i 2 and degrees of freedom = (n - 2) nσ(x i -x) 2

Spatial Patterns We will examine methods that are used to analyze patterns in two sorts of spatial data: Point Pattern Analysis - These methods concern themselves with the location information associated with point data (not attributes associated with those locations, just where they are found) Geographic Patterns in Areal Data -These methods are used to examine the pattern of attribute values associated with polygon representations of geographic phenomena (i.e. is there a pattern in the attributes of a set of adjacent polygons?)

Point Pattern Analysis While being able to qualitatively describe a point pattern as being {regular, random, clustered} is useful, we want to have a rigorous, quantitative means of describing these patterns We will examine two approaches for doing so: 1. The Quadrat Method - Divide the study area into equal sections, count points per section, and derive a statistic to compare counts to expectations 2. Nearest Neighbor Analysis - Compare the distances between points to an expected distance between points

χ 2 Test in the Quadrat Method Once we have calculated the mean number of points per quadrat and the variance of points per quadrat, we can calculate the χ 2 test statistic using: χ 2 = (m -1) s2 x = (m -1) * VMR where: m is the # of quadrats s is the std. dev. of the points per quadrat x is the mean of the points per quadrat This χ 2 test statistic has (m -1) degrees of freedom, and is compared to a critical value from the χ 2 distribution, yet another probability distribution for which we have tables of values (Table A.6, p.221):

Summary of the Quadrat Method 1. Divide a study region into m cells of equal size 2. Find the mean number of points per cell, which is equal to the total number of points divided by the total number of cells 3. Find the variance of the number of points per cell (s 2 ) using i=m Σi=1 (x i x) 2 s 2 = m -1 where x i is the number of points in cell i

Summary of the Quadrat Method 4. Calculate the variance to mean ratio (VMR): VMR = s2 x 5. Interpret the variance to mean ratio (VMR), and if a hypothesis test is desired, calculate the χ 2 statistic for quadrat analysis: (m -1) χ 2 s2 = x = (m -1)* VMR comparing the test stat. to critical values from the χ 2 distribution with df = (m -1)

2. Nearest Neighbor Analysis An alternative approach to assessing a point pattern can be formulated that examines the distances between points in the pattern in terms of the distance between any given point and its nearest neighbor If we define d i as the distance between a point and its nearest neighbor, the average distance between neighboring points (R O ) can be written as: n R Σ d O = i = 1 i n

The Nearest Neighbor Statistic We can also calculate an expected distance between nearest neighbors (R E ) in a point pattern (where the expected pattern conforms to our usual null hypothesis of a random point pattern): R E = 2 λ 1 where λ is the number of points per unit area The ratio between the observed and expected distances is the nearest neighbor statistic (R): R = R O x = R E 1/ (2 λ) where x is the average observed distance d i

Interpreting the Nearest Neighbor Statistic Values of R can range from: 0 when all points are coincident and the distances between them are thus 0, UP TO A theoretical maximum of 2.1491, for a perfectly uniform pattern of points spread out on an infinitely large 2-dimensional plane Through the examination of many random point patterns, the variance of the mean distances between neighbors has been found to be: V [R E ] = 4 - π 4πλn where n is the number of points

Interpreting the Nearest Neighbor Statistic Since we have a means of estimating the variance of R E, we can calculate a standard error for R E and formulate a test statistic to test the null hypothesis that the pattern is random: Z test = R O - R E R = O - R E V [R E ] (4 - π)/(4πλn) = 3.826 (R O - R E ) λn This test statistic is normally distributed with mean 0 and variance 1, thus we can use the standard normal distribution to assess its significance

Contingency Tables and the χ 2 Test Once we have observed and expected frequencies for each cell in the contingency table, we can use those values to calculate the χ 2 test statistic: χ 2 = n Σi = 1 (O - E) 2 E where: O is the observed freq. E is the expected freq. n is the number of cells This χ 2 test statistic has (r -1) * (c - 1) degrees of freedom, where are r & c are the number of rows and columns in the contingency table If the observed frequencies are very different from the expected frequencies, χ 2 test will be larger than the 1- tailed critical value it will be compared it to, thus detecting the presence of a spatial pattern

The Joint Count Statistic The first step in this method is to enumerate all of the pairs of polygons that share a boundary by creating a binary connectivity table (a.k.a. a spatial matrix). For example using the following five region system: A C B D E 1. Label the regions 2. Create a table with the same row & column labels A B C D E A B C D E 0 1 1 0 0 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 0 0 1 1 0 3. Fill in the table with 1s and 0s to indicate which regions share a boundary

The Joint Count Statistic We can now take the sum of all the 1 s in the binary connectivity table and divide by 2 to calculate the total number of shared boundaries in the system (J): J = n Σi = 1 x i 2 Next, we are ready to look at the attribute information associated with the polygons to determine if each pair of polygons that shares a boundary has the same values or different values The joint count statistic is designed to be used with binary nominal attributes, i.e. the attribute values need to be reduced to some 2 class description for use in this statistic

The Joint Count Statistic The expected number of +- boundaries is calculated as: E [+-] = 2JPM N(N - 1) where: J is the total number of shared boundaries P is the number of + polygons M is the number - polygons N is the total number of polygons For our example, E [+-] is calculated as: E [+-] = 2JPM N(N - 1) = 2*7*3*2 5(5-1) = 84 20 = 4.2 We will form a statistic by comparing the expected number of +- boundaries to the observed number of +-, which we obtain by simply counting the number of shared boundaries with this characteristic (being careful not to double count)

The Joint Count Statistic For our example five region system, the observed number of shared +- boundaries is 5 The last ingredient we need to be able to build a test statistic is an estimate of the variance in E[+-], and unfortunately, calculating this quantity requires a somewhat involved expression: Σ L i (L i -1)PM N(N - 1) 4[J(J -1)- Σ L i (L i -1)]P(P -1)M(M -1) N(N - 1)(N - 2)(N - 3) V [+-] = E [+-] -E[+-] 2 + + where L i is the total number of boundaries shared by region i In our example V [+-] = 0.56

The Joint Count Statistic We can now calculate a test statistic to compare the observed number of +- boundaries to the expected number of +- boundaries as a Z-statistic: (Obs. +- ) - E [+-] Z test = V [+-] This test statistic is normally distributed with mean 0 and variance 1, thus we can use the standard normal distribution to assess its significance An exceptional Z-statistic value would indicate a level of spatial autocorrelation that exceeds the expected amount for our system

Moran s I Statistic Thus, for each and every pair of polygons in the system, a weight expresses the degree to which they are spatiallyrelated (close to each other, connected, etc.) This weight term is multiplied by an expression that compares the attribute values of each and every pair of polygons, by calculating the mean and standard deviation for the whole data set, and then comparing the z-scores of the variable values for each polygon to that of the other: Moran s I = n ΣΣ w ij z i z i j j (n -1) ΣΣ w ij i j where n is the number of polygons w ij is the weight for combinations of the polygon in column i and the polygon in row j of the connectivity matrix z i and z j are z-scores