Hypothesis Testing hypothesis testing approach

Size: px

Start display at page:

Download "Hypothesis Testing hypothesis testing approach"

Audra Bridges
6 years ago
Views:

1 Hypothesis Testing In this case, we d be trying to form an inference about that neighborhood: Do people there shop more often those people who are members of the larger population To ascertain this, we can make use of the hypothesis testing approach in inferential statistics, which is a multistep process: 1. State the null hypothesis (H 0 ) 2. State the alternative hypothesis (H A ) 3. Choose α, our significance level 4. Select a statistical test, and calculate the test statistic 5. Determine the critical value where H 0 will be rejected 6. Compare the test statistic with the critical value

2 Hypothesis Testing - Tests 4. Select a statistical test, and calculate the test statistic To test the hypothesis, we must construct a test statistic, which frequently takes the form: Test statistic = θ - θ 0 Std. error For example, using the normal distribution, the z-test is formulated as: z = x - µ σ x σ x = σ/ n when s is known, or σ x ~ s/ n when we have to estimate the standard deviation from the sample data

3 Hypothesis Testing - Tests 5. Determine the value where H 0 will be rejected cont. For example, supposed we are applying a z-test to compare the mean of a large sample to the mean of a population, and we choose a 95% level of significance If we formulate our alternate hypothesis as H A : x µ, we are testing whether x is significantly different from µ in either direction, so the acceptance region must include the 95% of the normal distribution on either side of the mean, and the rejection region must include 2.5% of the area in each of the two tails: Z test > Z crit H A -Z crit H 0 Z crit Z test Z crit H 0 H A H A

Hypothesis Testing - Tests 5. Determine the value where H 0 will be rejected cont. On the other hand, suppose we formulate our alternate hypothesis as H A : x >µ or H A : x <µ.

4 Hypothesis Testing - Tests 5. Determine the value where H 0 will be rejected cont. On the other hand, suppose we formulate our alternate hypothesis as H A : x >µ or H A : x <µ. Then, we are testing whether x is significantly different from µ in a particular direction, so the rejection region must include the 5% of the normal distribution s area in one tail, and the acceptance region must include the remaining 95% of the area (e.g. using H A : x >µ): Z test > Z crit H A Z crit Z test Z crit H 0 H 0 H A

5 Hypothesis Testing - One-Sample Z-test The example we looked at in the last lecture used the onesample z-test, which is formulated as: We use this test statistic: Z test = x - µ σ n (difference between means) (standard error) 1. To compare a sample mean to the population mean 2. If the size of the sample is reasonably large, i.e. n > When the population standard deviation is known (although we can estimate it from the sample standard deviation), so that we can use this value to calculate the standard error in the denominator

6 Hypothesis Testing - One-Sample t-test The one-sample t-test is formulated very much like the one-sample Z-test we looked at earlier: We use this test statistic: t test = x - µ s n (difference between means) (standard error) 1. To compare a sample mean to the population mean 2. If the size of the sample is somewhat small, i.e. n We do not need to know the population standard deviation to calculate the standard error, although we still need to know the population mean for purposes of comparison with the sample mean

7 Hypothesis Testing - Two-Sample t-tests Two-sample t-tests are used to compare one sample mean with another sample mean, rather than with a population parameter. The form of the two-sample t-test that is appropriate depends on whether or not we can treat the variances of the two samples as being equal If the variances can be assumed to be equal (a condition called homoscedasticity), the t-statistic is: t test = x 1 -x 2 S p (1 / n 1 ) + (1 / n 2 ) (n 1-1)s2 1 + (n 2-1)s2 2 and s p is the pooled estimate of the standard deviation: = n 1 + n 2-2

8 Hypothesis Testing - Two-Sample t-tests Two-sample t-tests that use the equal variance assumption have degrees of freedom equal to the sum of the number of observations in the two samples, less two since we are estimating the values of two means here: df = (n 1 + n 2-2) If we cannot assume that the two samples have equal variances, the appropriate t-statistic takes a slightly different form, since we cannot produce a pooled estimate for the standard error portion of the statistic: t test = x 1 -x 2 (s 12 / n 1 ) + (s 22 / n 2 )

9 Hypothesis Testing - Two-Sample t-tests Unfortunately, in the heteroscedastic case (where the variances are unequal), calculating the degrees of freedom appropriate to use for the critical t-score uses a somewhat involved formula (equation 3.17 on p. 50) As an alternative, Rogerson suggests using the lesser value of n 1-1 or n 2-1: df = min[(n 1-1),(n 2-1)] based on the grounds that this value will always be lower than that produced by the involved calculation, and thus will produce a higher t crit score at the selected α; this is a conservative assumption because it makes it even harder to reject the null hypothesis mistakenly and commit a type I error

10 Hypothesis Testing - F-test In order to make the decision as to whether or not the variances of two samples are the same or different enough to warrant the use of one form the two-sample t- test or the other, we have a further statistical test that we use to compare the variances The F-test, a.k.a. the variance ratio test, assesses whether or not the variances are equal by computing a test statistic of the form: F test = s 1 2 s 2 2 Critical values are taken from the F-distribution, which has a 2-dimensional array of degrees of freedom (i.e. n 1-1 df in the numerator, n 2-1 df in the denominator)

11 Hypothesis Testing - Matched Pairs t-tests The form of the sample statistic is based upon the calculated differences between the two samples: t test = d s d n We use this test statistic: where d is the average of the differences S d = Σ (d i -d) 2 n To compare the sample means of paired samples 2. The size of the samples is somewhat small, i.e. n When the two samples contain members that were not sampled at random but represent observations of the same entities, usually at different times or after some treatment has been applied

12 ANOVA - An F-test The ANOVA F-test is formulated as: where F test = BSS / (k - 1) WSS / (N - k) k is the number of groups N is the total number of observations BSS is the between-group sum of squares WSS is the within-group sum of squares and the total sum of squares is the sum of the betweengroup and within-group sums, i.e. TSS = BSS + WSS (important because BSS can be tedious to calculate, but by calculating WSS and TSS, BSS = TSS - WSS)

13 Arrangement of Data for ANOVA Category 1 Category 2 Category 3 Category k Obs. 1 x 11 x 12 x 13 x 1k Obs. 2 x 21 x 22 x 23 x 2k Obs. 3 x 31 x 32 x 33 x 3k Obs. 4 x 41 x 42 x 43 x 4k.. Ȯbs. i x i1 x i2 x i3 x ik No. of obs. n 1 n 2 n 3 n k Mean x +1 x +2 x +3 x +4 Std. Dev. s 1 s 2 s 3 s k Overall Mean: x ++

14 ANOVA Table A useful way to go through the process of calculating an ANOVA is to fill in an ANOVA table: Source of Sum of Degrees of Mean Square Variation Squares Freedom Variance F-Test Between BSS k - 1 MS B Groups Within WSS N - k MS W MS B Groups MS W Total TSS N - 1 Variation

15 Covariance Formulae The covariance of variable X with respect to variable Y can be calculated using the following formula: Cov [X, Y] = 1 n -1 i=n Σ x i y i -xy i=1 The formula for covariance can be expressed in many ways. The following equation is an equivalent expression of covariance (due to the distributive property): Cov [X, Y] = 1 n -1 i=n Σ (x i - x)(y i -y) i=1

16 Pearson s Correlation Coefficient A standardized measure of covariance provides a value that describes the degree to which two variables correlate with one another, expressing this using a value ranging from 1 to +1, where 1 denotes an inverse relationship and +1 denotes a positive relationship One such measure is known as Pearson s Correlation Coefficient (a.k.a. Pearson s Product Moment), and it produced through standardizing the covariance by dividing it by the product of the standard deviations of the Y and X variables: Pearson s Correlation Coefficient r = Cov [X, Y] s X s Y

17 Pearson s Correlation Coefficient As is the case with covariance, the correlation coefficient can be expressed in several equivalent ways: Pearson s Correlation Coefficient = i=n Σ (x i - x)(y i -y) i=1 (n -1) s X s Y It can also be expressed in terms of z-scores, which is convenient if you have already calculated them: Pearson s Correlation Coefficient = i=n Σ Z x Z y i=1 (n -1)

18 A Significance Test for r The sampling distribution of r follows a t-distribution with (n - 2) degrees of freedom, and we can estimate the standard error of r using: SE r = 1 - r 2 n - 2 The test itself takes the form of the correlation coefficient divided by the standard error, thus: t test = r SE r = r 1 - r 2 n - 2 = r n r 2

19 Spearmann s Rank Correlation Coefficient We have an alternative correlation coefficient we can use with ordinal data: Spearmann s Rank Correlation Coefficient (r s ) r s = 1 - i=n 6Σ d 2 i=1 i n 3 - n where n = sample size d i = the difference in the rankings of each value with respect to each variable

20 A Significance Test for r s As was the case for Pearson s Correlation Coefficient, we can test the significance of an r s result using a t-test The test statistic and degrees are formulated a little differently for r s, although many of the characteristics of the distribution of r values are present here as well: In this case, r s values follow a t-distribution with (n - 1) degrees of freedom, and their standard error can be estimated using: SE rs = yielding the test statistic: 1 n -1 t test = r s SE rs = r s n -1

21 Simple Linear Regression Simple linear regression models the relationship between an independent variable (x) and a dependent variable (y) using an equation that expresses y as a linear function of x, plus an error term: y (dependent) a error: ε b x (independent) y = a + bx + e x is the independent variable y is the dependent variable b is the slope of the fitted line a is the intercept of the fitted line e is the error term

22 Least Squares Method The least squares method operates mathematically, minimizing the error term e for all points We can describe the line of best fit we will find using the equation ŷ = a + bx, and you ll recall that from a previous slide that the formula for our linear model was expressed using y = a + bx + e y ŷ We use the value ŷ on the line to estimate the true value, y (y - ŷ) The difference between the two is (y - ŷ) = e ŷ = a + bx This difference is positive for points above the line, and negative for points below it

23 Error Sum of Squares By squaring the differences between y and ŷ, and summing these values for all points in the data set, we calculate the error sum of squares (usually denoted by SSE): n SSE = Σ (y - ŷ) 2 i = 1 The least squares method of selecting a line of best fit functions by finding the parameters of a line (intercept a and slope b) that minimizes the error sum of squares, i.e. it is known as the least squares method because it finds the line that makes the SSE as small as it can possibly be, minimizing the vertical distances between the line and the points

24 Finding Regression Coefficients The equations used to find the values for the slope (b) and intercept (a) of the line of best fit using the least squares method are: b = n Σ (x i - x) (y i -y) i = 1 a = y - bx n Σ (x i -x) 2 i = 1 Where: x i is the i th independent variable value y i is the i th dependent variable value x is the mean value of all the x i values y is the mean value of all the y i values

25 Regression Slope and Correlation The interpretation of the sign of the slope parameter and the correlation coefficient is identical, and this is no coincidence the numerator of the slope expression is identical to that of the correlation coefficient r = i=n Σ (x i - x)(y i -y) i=1 (n - 1) s X s Y The regression slope can expressed in terms of the correlation coefficient: b = n b = r s y s x Σ (x i - x) (y i -y) i = 1 n Σ (x i -x) 2 i = 1

26 Coefficient of Determination (r 2 ) The regression sum of squares (SSR) expresses the improvement made in estimating y by using the regression line: n y ŷ y SSR = Σ (ŷ i -y) 2 i = 1 The total sum of squares (SST) expresses the overall variation between the values of y and their mean y: n SST = Σ (y i -y) 2 i = 1 The coefficient of determination (R 2 ) expresses the amount of variation in y explained by the regression line (the strength of the relationship): r 2 = SSR SST

27 Partitioning the Total Sum of Squares We can decompose the total sum of squares into those two components: n n n SST = Σ (y i -y) 2 i = 1 In other words: SST = SSR + SSE and the coefficient of determination expresses the portion of the total variation in y explained by the regression line = Σ (ŷ i -y) 2 i = 1 SST + Σ (y i - ŷ) 2 i = 1 SSE y SSR ŷ y

28 Regression ANOVA Table We can create an analysis of variance table that allows us to display the sums of squares, their degrees of freedom, mean square values (for the regression and error sums of squares), and an F-statistic: Component Regression (SSR) Error (SSE) Total (SST) Sum of Squares n Σ (ŷ i -y) 2 i = 1 n Σ (y i - ŷ) 2 i = 1 n Σ (y i -y) 2 i = 1 df 1 n - 2 n - 1 Mean Square SSR / 1 SSE / (n - 2) F MSSR MSSE

29 A Significance Test for r 2 We can test to see if the regression line has been successful in explaining a significant portion of the variation in y, by performing an F-test This operates in a similar fashion to how we used the F-test in ANOVA, this time testing the null hypothesis that the true coefficient of determination of the population ρ 2 = 0 using an F-test formulated as: F test = r2 (n - 2) = MSSR 1 - r 2 MSSE which has an F-distribution with degrees of freedom: df = (1, n - 2)

30 Significance Tests for Regression Parameters In addition to evaluating the overall significance of a regression model by testing the r 2 value using an F-test, we can also test the significance of individual regression parameters using t-tests These t-tests have the regression parameter in some form in the numerator, and the standard error of the regression parameter in the denominator First, we must calculate the standard error of the estimate, also known as the standard deviation of the residuals (s e ): s e = n Σi = 1 (y i - ŷ) 2 (n - 2)

31 Significance Test for Regression Slope We can formulate a t-test to test the significance of the regression slope (b) We will be testing the null hypothesis that the true value of the slope is equal to zero, e.g. H 0 : β = 0, using the following t-test: t test = where s b is the standard deviation of the slope parameter: s b = b s b s e 2 (n - 1) s x 2 and degrees of freedom = (n - 2)

32 Significance Test for Regression Intercept We can formulate a similar t-test to test the significance of the regression intercept (a) We will be testing the null hypothesis that the true value of the intercept is equal to zero, e.g. H 0 : α = 0, using the following t-test: t test = where s a is the standard deviation of the intercept: a s a s a = s e 2 Σx i 2 and degrees of freedom = (n - 2) nσ(x i -x) 2

33 Spatial Patterns We will examine methods that are used to analyze patterns in two sorts of spatial data: Point Pattern Analysis - These methods concern themselves with the location information associated with point data (not attributes associated with those locations, just where they are found) Geographic Patterns in Areal Data -These methods are used to examine the pattern of attribute values associated with polygon representations of geographic phenomena (i.e. is there a pattern in the attributes of a set of adjacent polygons?)

34 Point Pattern Analysis While being able to qualitatively describe a point pattern as being {regular, random, clustered} is useful, we want to have a rigorous, quantitative means of describing these patterns We will examine two approaches for doing so: 1. The Quadrat Method - Divide the study area into equal sections, count points per section, and derive a statistic to compare counts to expectations 2. Nearest Neighbor Analysis - Compare the distances between points to an expected distance between points

35 χ 2 Test in the Quadrat Method Once we have calculated the mean number of points per quadrat and the variance of points per quadrat, we can calculate the χ 2 test statistic using: χ 2 = (m -1) s2 x = (m -1) * VMR where: m is the # of quadrats s is the std. dev. of the points per quadrat x is the mean of the points per quadrat This χ 2 test statistic has (m -1) degrees of freedom, and is compared to a critical value from the χ 2 distribution, yet another probability distribution for which we have tables of values (Table A.6, p.221):

36 Summary of the Quadrat Method 1. Divide a study region into m cells of equal size 2. Find the mean number of points per cell, which is equal to the total number of points divided by the total number of cells 3. Find the variance of the number of points per cell (s 2 ) using i=m Σi=1 (x i x) 2 s 2 = m -1 where x i is the number of points in cell i

37 Summary of the Quadrat Method 4. Calculate the variance to mean ratio (VMR): VMR = s2 x 5. Interpret the variance to mean ratio (VMR), and if a hypothesis test is desired, calculate the χ 2 statistic for quadrat analysis: (m -1) χ 2 s2 = x = (m -1)* VMR comparing the test stat. to critical values from the χ 2 distribution with df = (m -1)

38 2. Nearest Neighbor Analysis An alternative approach to assessing a point pattern can be formulated that examines the distances between points in the pattern in terms of the distance between any given point and its nearest neighbor If we define d i as the distance between a point and its nearest neighbor, the average distance between neighboring points (R O ) can be written as: n R Σ d O = i = 1 i n

39 The Nearest Neighbor Statistic We can also calculate an expected distance between nearest neighbors (R E ) in a point pattern (where the expected pattern conforms to our usual null hypothesis of a random point pattern): R E = 2 λ 1 where λ is the number of points per unit area The ratio between the observed and expected distances is the nearest neighbor statistic (R): R = R O x = R E 1/ (2 λ) where x is the average observed distance d i

40 Interpreting the Nearest Neighbor Statistic Values of R can range from: 0 when all points are coincident and the distances between them are thus 0, UP TO A theoretical maximum of , for a perfectly uniform pattern of points spread out on an infinitely large 2-dimensional plane Through the examination of many random point patterns, the variance of the mean distances between neighbors has been found to be: V [R E ] = 4 - π 4πλn where n is the number of points

41 Interpreting the Nearest Neighbor Statistic Since we have a means of estimating the variance of R E, we can calculate a standard error for R E and formulate a test statistic to test the null hypothesis that the pattern is random: Z test = R O - R E R = O - R E V [R E ] (4 - π)/(4πλn) = (R O - R E ) λn This test statistic is normally distributed with mean 0 and variance 1, thus we can use the standard normal distribution to assess its significance

42 Contingency Tables and the χ 2 Test Once we have observed and expected frequencies for each cell in the contingency table, we can use those values to calculate the χ 2 test statistic: χ 2 = n Σi = 1 (O - E) 2 E where: O is the observed freq. E is the expected freq. n is the number of cells This χ 2 test statistic has (r -1) * (c - 1) degrees of freedom, where are r & c are the number of rows and columns in the contingency table If the observed frequencies are very different from the expected frequencies, χ 2 test will be larger than the 1- tailed critical value it will be compared it to, thus detecting the presence of a spatial pattern

43 The Joint Count Statistic The first step in this method is to enumerate all of the pairs of polygons that share a boundary by creating a binary connectivity table (a.k.a. a spatial matrix). For example using the following five region system: A C B D E 1. Label the regions 2. Create a table with the same row & column labels A B C D E A B C D E Fill in the table with 1s and 0s to indicate which regions share a boundary

44 The Joint Count Statistic We can now take the sum of all the 1 s in the binary connectivity table and divide by 2 to calculate the total number of shared boundaries in the system (J): J = n Σi = 1 x i 2 Next, we are ready to look at the attribute information associated with the polygons to determine if each pair of polygons that shares a boundary has the same values or different values The joint count statistic is designed to be used with binary nominal attributes, i.e. the attribute values need to be reduced to some 2 class description for use in this statistic

45 The Joint Count Statistic The expected number of +- boundaries is calculated as: E [+-] = 2JPM N(N - 1) where: J is the total number of shared boundaries P is the number of + polygons M is the number - polygons N is the total number of polygons For our example, E [+-] is calculated as: E [+-] = 2JPM N(N - 1) = 2*7*3*2 5(5-1) = = 4.2 We will form a statistic by comparing the expected number of +- boundaries to the observed number of +-, which we obtain by simply counting the number of shared boundaries with this characteristic (being careful not to double count)

46 The Joint Count Statistic For our example five region system, the observed number of shared +- boundaries is 5 The last ingredient we need to be able to build a test statistic is an estimate of the variance in E[+-], and unfortunately, calculating this quantity requires a somewhat involved expression: Σ L i (L i -1)PM N(N - 1) 4[J(J -1)- Σ L i (L i -1)]P(P -1)M(M -1) N(N - 1)(N - 2)(N - 3) V [+-] = E [+-] -E[+-] where L i is the total number of boundaries shared by region i In our example V [+-] = 0.56

47 The Joint Count Statistic We can now calculate a test statistic to compare the observed number of +- boundaries to the expected number of +- boundaries as a Z-statistic: (Obs. +- ) - E [+-] Z test = V [+-] This test statistic is normally distributed with mean 0 and variance 1, thus we can use the standard normal distribution to assess its significance An exceptional Z-statistic value would indicate a level of spatial autocorrelation that exceeds the expected amount for our system

48 Moran s I Statistic Thus, for each and every pair of polygons in the system, a weight expresses the degree to which they are spatiallyrelated (close to each other, connected, etc.) This weight term is multiplied by an expression that compares the attribute values of each and every pair of polygons, by calculating the mean and standard deviation for the whole data set, and then comparing the z-scores of the variable values for each polygon to that of the other: Moran s I = n ΣΣ w ij z i z i j j (n -1) ΣΣ w ij i j where n is the number of polygons w ij is the weight for combinations of the polygon in column i and the polygon in row j of the connectivity matrix z i and z j are z-scores

Hypothesis Testing hypothesis testing approach formulation of the test statistic

Hypothesis Testing hypothesis testing approach formulation of the test statistic Hypothesis Testing For the next few lectures, we re going to look at various test statistics that are formulated to allow us to test hypotheses in a variety of contexts: In all cases, the hypothesis testing