Introduction to hypothesis testing

Introduction to hypothesis testing Review: Logic of Hypothesis Tests Usually, we test (attempt to falsify) a null hypothesis (H 0 ): includes all possibilities except prediction in hypothesis (H A ) If hypothesis (H A )is that an experimental treatment has an effect: null hypothesis is that there is no effect Disproving H 0 = evidence that actual hypothesis is true

Decision criterion How low a probability should make us reject H 0? If probability is less than significance level (critical p-value, ), then reject H 0 ; otherwise do not reject Convention sets significance level: = 0.05 (5%) Arbitrary: other significance levels might be valid. Context specific Three special types of Hypothesis Tests based on the t distribution 1. The mean of a distribution is different from a constant (one sample t test) 2. The mean difference in pairs of observations is different from a constant (paired t test) 3. Two distributions differ (i.e. the means from two sets of observations do not come from the same distribution of means). Two sample t test.

t statistic General form of t statistic: S t SE where S t is sample statistic, is parameter value specified in H 0 and SE is standard error of sample statistic. Specific form for population mean: y s n Value of mean specified in H 0 Test statistics Sampling distributions of t, one for each sample size, when H 0 true use degrees of freedom (df = n -1) Area under each sampling (probability) distribution equals one Probabilities of obtaining particular ranges of t when H 0 is true

Three special types of Hypothesis Tests based on the t distribution 1. The mean of a distribution is different from a constant. One sample t test 2. The mean difference in pairs of observations is different from a constant. Paired t test. 3. Two distributions differ (ie the means from two sets of observations do not come from the same distribution of means). Two sample t test. Simple null hypothesis Test of hypothesis that population mean equals a particular value (H 0 : = ) These values may be from literature or other research or legislation

One sample t-test Mean(B_To_D) 4 3.5 3 2.5 2 1.5 1 0.5 0 Europe Islamic NewWorld Group Populations are fairly stable if the ratio of births to deaths is close to 1.25. H o : B/D ratios = 1.25 H A : B/D ratios = 1.25 1) Are the B/D ratios for any of these groups =1.25 2) Test using a one sample t-test Ourworld t statistic General form of t statistic: S t SE where S t is sample statistic, is parameter value specified in H 0 and SE is standard error of sample statistic. Specific form for population mean: y s n Value of mean specified in H 0

One sample t-tests Single population: H 0 : = 0 (or any other pre-specified value: here 1.25) t y 1.25 s y df = n -1 y 1.25 s n Results Europe 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1. Box plot 2. Normal approximation 3. Histogram 1 0.9 0.05 0.15 0.25 Probability

More Results Islamic Test Mean Hypothesized Value Actual Estimate DF Std Dev Test Statistic Prob > t Prob > t Prob < t t Test 7.5570 <.0001* <.0001* 1.0000 1.25 3.47825 15 1.17943 New World Test Mean Hypothesized Value 1.25 Actual Estimate 3.95091 DF 20 Std Dev 1.50949 Test Statistic Prob > t Prob > t Prob < t t Test 8.1995 <.0001* <.0001* 1.0000-1 0 1 2 3 4-2 -1 0 1 2 3 4 Even more a way to present the results 8 Births / deaths (95% CI) 7 6 5 4 3 2 1 0 Ho:

Two sample t- test Used to compare two populations, each of which has been sampled The simplest form of tests among multiple populations Example: does the average annual income differ for males and females: Ho: income (males) = income (females) 25 20 15 10 5 0 Female Male Survey2 SEX Calculation: H 0 : 1 = 2, i.e. 1-2 = 0 - independent observations t y 1 y2 ( 1 2) s y y 1 2 y 1 y s y y 1 2 2 y 1 y 2 1 1 s + p n 1 n 2 Where s p = the pooled standard deviation (more later), and df = (n 1-1) + (n 2-1) = n 1 + n 2-2

Logic of the two sample t test Assume H o : = 2 H A : > 2 1) If H o is true then the null distribution is known (for a set df) 2) If H A is true, we don t know the distribution but we do know that it is not the null distribution Probability of t 0.4 0.3 0.2 0.1 H o true 0.0-5 -4-3 -2-1 0 1 2 3 4 5 t = Central t s p y 1 H A true Non- Central t y 2 1 1 + n 1 n 2 6 7 8 9 Assume: H o : = 2, 4 df 0.4 0.3 H o true t 0.05, 4 df = 2.14 0.2 0.1 0.0-5 -4-3 -2-1 0 1 2 3 4 5 6 7 8 9 t = s p y 1 y 2 1 1 + n 1 n 2 Any t >2.14 will lead to incorrect rejection of H o 1. This means that the difference between y 1 and y 2 is > than 2.14 standard errors (pooled) 2. This will happen 5 % of the time

Assume: H A : > 2, 4 df 0.4 0.3 H A true t 0.05, 4 df = 2.14 0.2 0.1 0.0-5 -4-3 -2-1 0 1 2 3 4 5 6 7 8 9 t = s p y 1 y 2 1 1 + n 1 n 2 Any t < 2.14 will lead to incorrect rejection of H A 1. This means that the difference between y 1 and y 2 is < than 2.14 standard errors (pooled) 2. The probability that this will happen is dependent on n and the true difference between and Results of example What is the conclusion? Difference in Means The unequal variance t-test is based on the Satterthwaite adjustment (of degrees of freedom), it is not recommended unless the variance terms are very different and the sample sizes (n) are very different Difference in Means

70 Female 70 Male 25 60 50 40 30 20 60 50 40 30 20 Annual Income (mean +- SE) 20 15 10 5 10 10 0 0 0 Female Male SEX Paired t tests: The logic of 1. Often there is interest in comparisons of observations that can be considered paired within a subject or replicate a) For example: i. A comparison of activity level before and after eating in the ii. same individual A comparison of longevity of males vs females,where county is the replicate 2. In such cases there is often benefit in accounting for variance that could be caused by differences among subjects (or replicates)

Paired observations: Paired t- test H 0 : d = 0 where d is difference between between paired observations t d s d d s d n d Where s d = standard deviation of the sample of differences, and df = n - 1 where n is number of pairs Paired t-test example II Pisaster comes in two colors along the west coast: purple and orange: H o : density of purple per site = density of orange Individual reefs are the replicates of interest Looks like a no brainer 1200 1000 800 Density 600 400 200 0 Orange Purple COLOR Sea star colors all sites two sample

Results of a 2 sample test Standard GROUP N Mean Deviation -------+-------------------------- Orange 7 144.71429 101.75086 Purple 7 457.28571 353.47829 Pooled Variance Difference in Means : -312.57143 95.00% Confidence Interval : -615.48591 to -9.65695 t : -2.24827 df : 12.00000 p-value : 0.04413 1200 1200 Marginally significant WHY? 1000 1000 NUMBER 800 600 400 Density (95% CI) 800 600 400 200 0 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 Count Count COLOR Orange Purple 200 0 Orange Purple Color of seastars Consider the variability added at the level of replicate (site) Given that observations are paired at the level of site can this be accounted for 1200 1200 1200 1000 1000 1000 800 800 800 Density 600 Density 600 Density 600 400 400 400 200 0 Orange Purple COLOR 200 0 Govpt Boat Stair Shell Beach Site Hazards Cayucos PSN 200 0 Govpt Boat Stair Shell Beach SITE Hazards Cayucos PSN COLOR Orange Purple

Paired test: Details of calculation 1200 Site Purple Orange difference Govpt 1023 306 717 Boat 585 155 430 Stair 476 143 333 PSN 233 142 91 Cayucos 107 31 76 Hazards 728 222 506 Shell Beach 49 14 35 mean 312.5714 Sediff 97.25882 t 3.21381 Value 1000 800 600 400 200 0 ORANGE Index of Case PURPLE Note slopes are they the same: Perhaps rates are a better comparison 1) Convert to rates or 2) Log transform Paired test: Details of calculation: use of Log transformed data Site Purple(log) Orange(log) difference Govpt 3.0098756 2.4857214 0.524154 Boat 2.7671559 2.1903317 0.576824 Stair 2.677607 2.155336 0.522271 PSN 2.3673559 2.1522883 0.215068 Cayucos 2.0293838 1.4913617 0.538022 Hazards 2.8621314 2.346353 0.515778 Shell Beach 1.6901961 1.146128 0.544068 Value 3.5 3.0 2.5 2.0 1.5 mean 0.490884 Sediff 0.046604 1.0 t 10.53299 LORANGE LPURPLE Index of Case Note slopes much more similar Indicates that: 1) Purples are more common By a constant ratio rather than by a constant amount

Review calculations of t for One sample test y s n Two sample test Paired test y 1 y 2 s 1 p s p 1 + n 1 d s d n 2 n d Calculations of Standard Error 1) One sample t-test s n 2 S = SS (n-1) 2) Paired t-test s d n d 2 S d = SS d (n d -1) 3) Two sample t- test (calculation based on pooled variance term) 1 1 s p n 1 n 2 2 + S p = SS 1 +SS 2 (n 1-1)+(n 2-1) = SS 1 +SS 2 (n 1 +n 2-2)

Testing statistical null hypotheses Hypothesis construction

General Hypothesis A hypothesis that addresses the general question of interest H o : There will be no difference in the density of urchins on vertical vs horizontal surfaces H A : There will be a difference in the density of urchins on vertical vs horizontal surface Specific hypotheses A hypothesis that represents the specific question addressed in your study. The specifics include Location of study Time period Replication Simple description of design

Specific Hypothesis H o : There will be no difference in the density of (species name) on vertical vs horizontal surfaces based on 10 replicate quadrats for each treatment randomly placed within site A sampled on date B H A : There will be a difference in the density of (species name) on vertical vs horizontal surfaces based on 10 replicate quadrats for each treatment randomly placed within site A sampled on date B Note much of this can be placed in the methods section, which would alleviate the need to state these details. However, also note that the hypotheses above are actually what are being tested Depiction of hypotheses H o : There will be no difference in the density of (species name) on vertical vs horizontal surfaces based on 10 replicate quadrats for each treatment randomly placed within site A sampled on date B Increasing likelihood that Ho is incorrect Increasing likelihood that Ho is incorrect - 0 + Horizontal Density Vertical Density of Urchins

Depiction of hypotheses: what should the units be? H o Increasing likelihood that Ho is incorrect Increasing likelihood that Ho is incorrect - 0 + Horizontal Density Vertical Density of Urchins Depiction of hypotheses: what should the units be? Goal To use same units for all assessments irrespective of species or system To have same set of probabilities based on those units Hence - units should link to estimate of confidence Most common form are t-values, which provide an estimate of the difference in mean values calibrated by an estimate of error in the assessment of the mean values

T- statistic T X X SE 1 2 (Standard error) SE and SD SD N N i (Standard deviation) (Number of replicates) X X i 2 N 1 30 40 45 37 X 38.000 SD 6.272 SE 3.136 Depiction of hypotheses: what should the units be? H o Increasing likelihood that Ho is incorrect Increasing likelihood that Ho is incorrect - 0 + T = Horizontal Density Vertical Density of Urchins SE

Depiction of hypotheses: what should the units be? H o Increasing likelihood that Ho is incorrect Increasing likelihood that Ho is incorrect -3-2 -1 0 1 2 3 T = Horizontal Density Vertical Density of Urchins SE T-distribution (central t) is a null probability distribution Depicts the probability that the null hypothesis is correct One use is to estimate confidence levels

Depiction of hypotheses: H o Increasing likelihood that Ho is incorrect Increasing likelihood that Ho is incorrect -3-2 -1 0 1 2 3 T = Horizontal Density Vertical Density of Urchins SE Depiction of hypotheses: what should the units be? H o Increasing likelihood that Ho is incorrect Increasing likelihood that Ho is incorrect -3-2 -1 0 1 2 3 T = Horizontal Density Vertical Density of Urchins SE

H o : There will be no difference in the density of urchins on vertical vs horizontal surfaces -3-2 -1 0 1 2 3 T = Horizontal Density Vertical Density of Urchins SE H o : There will be no difference in the density of urchins on vertical vs horizontal surfaces -3-2 -1 0 1 2 3 T = Horizontal Density Vertical Density of Urchins SE

H o : There will be no difference in the density of urchins on vertical vs horizontal surfaces Including error yields a confidence interval e.g. 95% confident that the true t value is between. 95% CI -3-2 -1 0 1 2 3 T = Horizontal Density Vertical Density of Urchins SE H A : There will be a difference in the density of urchins on vertical vs horizontal surface 100% CI 2.5% 95% CI 2.5% -3-2 -1 0 1 2 3 T = Horizontal Density Vertical Density of Urchins SE

The importance of directionality of the alternative hypothesis (H A ) Consider: H o : There will be no difference in the density of urchins on vertical vs horizontal surfaces H A : There will be a difference in the density of urchins on vertical vs horizontal surfaces vs H o1 : Urchin density on horizontal surfaces will be greater than or equal to that on vertical surfaces H A1 : Urchins will be more dense on vertical than on horizontal surfaces H o1 : Urchin density on horizontal surfaces will be greater than or equal to that on vertical surfaces 100% CI 5% 95% CI -3-2 -1 0 1 2 3 T = Horizontal Density Vertical Density of Urchins SE

H A1 : Urchins will be more dense on vertical than on horizontal surfaces 100% CI 5% 95% CI -3-2 -1 0 1 2 3 T = Horizontal Density Vertical Density of Urchins SE One vs two tailed hypotheses- 1. Which is more interesting? 2. Which is more informed? H A1 : Urchins will be more dense on vertical than on horizontal surfaces H A : There will be a difference in the density of urchins on vertical vs horizontal surface 100% CI 100% CI 5% 95% CI 2.5% 95% CI 2.5% -3-2 -1 0 1 2 3-3 -2-1 0 1 2 3 T = Horizontal Density Vertical Density of Urchins SE

One vs two tailed hypotheses- 1. Which is more powerful? H A1 : Urchins will be more dense on vertical than on horizontal surfaces H A : There will be a difference in the density of urchins on vertical vs horizontal surface 100% CI 100% CI 5% 95% CI 2.5% 95% CI 2.5% -3-2 -1 0 1 2 3-3 -2-1 0 1 2 3 T= -1.79, p=0.04 T= -1.79, p=0.08 T = Horizontal Density Vertical Density of Urchins SE One vs two tailed hypotheses -Conversion to original units H A1 : Urchins will be more dense on vertical than on horizontal surfaces H A : There will be a difference in the density of urchins on vertical vs horizontal surface 100% CI 100% CI 5% 95% CI 2.5% 95% CI 2.5% -19.5-13.3-6.65 0 6.65 13.3 19.5-19.5-13.3-6.65 0 6.65 13.3 19.5 Difference = -11.78, p=0.04 Difference = -11.78, p=0.08 Horizontal Density Vertical Density of Urchins

This is the difference between 1 and 2 tailed hypotheses make sure you know which you are dealing with Always strive for one tailed hypotheses Is there a directional prediction (eg > or separately <) One tailed If not Two tailed Assumptions of t test The t test is a parametric test The t statistic only follows t distribution if: variable has normal distribution (normality assumption) two groups have equal population variances (homogeneity of variance assumption) observations are independent or specifically paired (independence assumption)

Normality assumption Data in each group are normally distributed Checks: Frequency distributions be careful Boxplots Probability plots formal tests for normality Solutions: Transformations Don t worry run it anyway just kidding but not entirely Homogeneity of variance Population variances equal in 2 groups Checks: subjective comparison of sample variances boxplots F-ratio test of H 0 : 12 = 2 2 Solutions Transformations Don t worry run it anyway just kidding again but again not entirely

F-test on variances H 0 : 12 = 2 2 F statistic (F-ratio) = ratio of 2 sample variances F = s 12 / s 2 2 Reject H 0 if F < or > 1 If H 0 is true, F-ratio follows F distribution Usual logic of statistical test Boxplot Median 25% of values 25% of values Smallest value Largest value 50 100 150 200 250 300 350 LENGTH

70 60 50 Count 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 Limpet numbers per quadrat 1. IDEAL 2. SKEWED 3. OUTLIERS 4. UNEQUAL VARIANCES * * * * *

200 0.96 0.93 0.9 0.84 0.75 0.6 0.45 0.3 0.18 0.12 0.08 0.05 0.02 Use of transformations to control departures from normality and homogeneity of variances assumptions 0 50 100 150 Pop_1990 0.96 0.93 0.9 0.84 0.75 0.6 0.45 0.3 0.18 0.12 0.08 0.05 0.02 3 0.2 0.4 1 2 3 4 6 10 20 30 50 100 200 Pop_1990 Variance Pop_1990 Lpop1990 Europe 441 0.17 Islamic 1378 0.30 Newworld 1042 0.34 Greatest ratio 3.12-1 2-1 150 2 POP_1990 100 LPOP1990 1 50 0 0 Europe Islamic GROUP NewWorld -1 Europe Islamic GROUP NewWorld Ourworld Nonparametric tests Usually based on ranks of the data H 0 : samples come from populations with identical distributions equal means or medians Don t assume particular underlying distribution of data normal distributions not necessary Equal variances and independence still required Typically much less powerful than parametric tests

Mann-Whitney-Wilcoxon test Calculates sum of ranks in 2 samples should be similar if H 0 is true Compares rank sum to sampling distribution of rank sums distribution of rank sums when H 0 true Equivalent to t test on data transformed to ranks Additional slides

A brief digression to re-sampling theory Number inside Number outside 3 10 5 7 2 9 8 12 7 8 Mean 5 9.2 Traditional evaluation would probably involve a t test: another approach is re-sampling. Resampling Treatment Number Inside 3 Inside 5 Inside 2 Inside 8 Inside 7 Outside 10 Outside 7 Outside 9 Outside 12 Outside 8 1) Assume both treatments come from the same distribution 2) Resample groups of 5 observations, with replacement, but irrespective of treatment

Resampling Treatment Number Inside 3 Inside 5 Inside 2 Inside 8 Inside 7 Outside 10 Outside 7 Outside 9 Outside 12 Outside 8 1) Assume both treatments come from the same distribution 2) Resample groups of 5 observations, with replacement, but irrespective of treatment 3) Calculate mean for each group 4) Repeat many times 5) Calculate differences between pairs of means (remember the null hypothesis is that there is no effect of treatment). This generates a distribution of differences. Mean 1 Mean 2 Difference 8 7.8 0.2 5.6 8.2 2.6 6 9 3 8 5 3 6 6 0 7 8 1 6 6.8 0.8 8 7.2 0.8 8 6.6 1.4 7 8.4 1.4 6 5.4 0.6 7 6.4 0.6 6.4 6.8 0.4 5 3.4 1.6 6.8 4.8 2 6.4 7.2 0.8 7.2 8 0.8 6.4 4.6 1.8 8.4 6 2.4 7.4 6.6 0.8 5.6 8.4 2.8 8.2 6.2 2 7.8 8.4 0.6 8.6 6.6 2 6 10.2 4.2 6.8 5.6 1.2 6.4 7.8 1.4 7.2 4.8 2.4 6.6 7.2 0.6 7 5.2 1.8 6.6 9.8 3.2 8.4 7.8 0.6 Number of Observations 250 200 150 100 50 Distribution of differences 0 1000 observations -10-5 0 5 10 Difference in Means OK, now what? 0.2 Proportion 0.1 per Bar 0.0

Compare distribution of differences to real difference Number inside Number outside 3 10 5 7 2 9 8 12 7 8 Mean 5 9.2 Real difference = 4.2 Estimate likelihood that real difference comes from two similar distributions Proportion of differences less than Mean 1 Mean 2 Difference current 10.2 3.6 6.6 1 10 3.8 6.2 0.999 10.2 4.4 5.8 0.998 9.2 3.6 5.6 0.997 9.8 4.8 5 0.996 8.8 4.2 4.6 0.995 9.6 5.2 4.4 0.994 9.8 5.6 4.2 0.993 9.8 5.8 4 0.992 9.4 5.4 4 0.991 And on through 1000 differences Likelihood is 0.007 that distributions are the same What are constraints of this sort of approach?

T-test vs resampling Test P-value Resampling 0.007 T-test 0.0093 Why the difference? Additional examples

Worked example Fecundity of predatory gastropods: sample of 37 and 42 egg capsule of Lepsiella from littorinid zone and mussel zone respectively Counted number of eggs per capsule Null hypothesis: no difference between zones in mean number of eggs per capsule Ward & Quinn (1988), qk2002 Box 3.1 Specify H 0 and choose test statistic: H 0 : M = L, i.e. population mean number of eggs per capsule from both zones are equal The t statistic is appropriate test statistic for comparing 2 population means

Specify a priori significance (probability) level (): By convention, use = 0.05 (5%). Collect data, check assumptions, calculate test statistic from sample data: Mean SD n Littorinid: 8.70 3.03 37 Mussel: 11.36 2.33 42 t = -5.39, df = 77

Compare value of t statistic to its sampling distribution, the probability distribution of statistic (for specific df) when H 0 is true what is probability of obtaining t value of 5.39 or greater from a t distribution with 77 df? what is probability of taking samples with observed or greater mean difference from 2 populations with same means? Probability (from JMP) P = 0.001 Look up in t table P < 0.05

If probability of obtaining this value or larger is less than, conclude H 0 is unlikely to be true and reject it: statistically significant result Our probability (<0.001) is less than 0.05 so reject H 0 : statistically significant result. If probability of obtaining this value or larger is greater than, conclude that H 0 is likely to be true and do not reject it: statistically non-significant result

Presenting results of t test Methods: An independent t test was used to compare the mean number of eggs per capsule from the two zones. Assumptions were checked with. Results: The mean number of eggs per capsule from the mussel zone was significantly greater than that from the littorinid zone (t = 5.39, df = 77, P < 0.001; see Fig. 2).