Statistical inference (estimation, hypothesis tests, confidence intervals) Oct 2018
Sampling A trait is measured on each member of a population. f(y) = propn of individuals in the popn with measurement y P = probability distn which assigns probability f(y) to y. The trait value of an individual randomly selected from the popn is a random variable Y with distn P. Mean and variance of the trait in the population are m = y f(y), σ 2 = (y m) 2 f(y) These are also the mean and variance of the random variable Y.
Summary statistics A random sample of size n drawn from the popn generates a sequence of observations Y 1... Y n. If size of sample is much less than size of popn, they can be treated as independent random variables, each with probability distn P. These observations are random variables, and we can calculate the sampling distn of any summary statistic (sample mean, median, variance, range, etc).
A sampling distribution The sample mean Ȳ might be used to estimate the population mean m. Sampling distn of Ȳ: E(Ȳ) = m, and var ( Ȳ ) = σ 2 /n, where n is sample size. It can be shown that if the distn (P) of the trait in the popn is normal, the distn of Ȳ is normal, i.e. Ȳ N(m, σ 2 /n) When n is large, this will be approximately true, even when the distn in the popn is not normal (central limit theorem).
Another sampling distn A binary trait takes one of two possible values, e.g. eyes are blue, eyes are not blue. Let Y i = 1 if the i-th member of the sample has blue eyes, otherwise Y i = 0. Then Ȳ is the proportion of the sample with blue eyes, and the sampling distn of Ȳ is a scaled binomial: Pr(Ȳ = y ( ) n n ) = m y (1 m) n y, for y = 0... n. y where m is the population mean, i.e. the proportion of the population with blue eyes.
Inference from sample to population A sample provides information about the popn from which it is drawn. For example, the sample mean Ȳ tells us something about the popn mean m. 1) Different samples give rise to different estimates. The value of the estimate cannot be predicted in advance, and we regard it as a random variable with a probability distribution (the sampling distribution of the estimator). 2) The sample estimate will differ from the population parameter, but if the sample is large enough, the estimate will be close to the true value with high probability.
Inference from sample to population Inference usually takes the form of a probability statement based on the sampling distn of the appropriate summary statistic. If the distn of Y in the popn is normal, the sampling distn of Ȳ is N(m, σ 2 /n). When σ 2 is known, a simple form of inference takes the form of a statement that the event Ȳ m < k E occurs with high probability. E = σ 2 /n is the standard error of Ȳ, and k is a suitable quantile of the standard normal distn.
Mean square error, bias, variance Here θ represents some feature of the popn, T is a sample statistic. If T is used as an estimator of θ, the estimation error is T θ. The mean squared error is E(T θ) 2, which we would like to be as small as possible. Let m T = E(T). The MSE can be split into two components: E(T θ) 2 = E(T m T ) 2 + (m T θ) 2 MSE = variance + bias 2
The sample variance The sample variance S 2 = (n 1) 1 n (Y i Ȳ) 2 i=1 estimates the population variance σ 2. An unbiased estimator is obtained by using the divisor n 1 (rather than n). The standard deviation (square root of the variance) of the sampling distribution of an estimator is called the standard error of the estimator. For example, the standard error of Ȳ is σ 2 /n. Usually the value of σ 2 is unknown, in which case the estimated standard error is calculated by replacing σ 2 in the formula by an estimate (e.g. the sample variance): estimated se(ȳ) = S 2 /n.
The t distribution Sample mean Ȳ is distributed N(m,σ 2 /n). The standardised value n ( Ȳ m)/σ has an N(0,1) distribution. Replacing σ by S (square root of sample variance S 2 ) changes the distn: n ( Ȳ m)/s has a t distn with n 1 d.f. More generally, if Z is N(0,σ 2 ), and S 2 is an estimate of σ 2 with f d.f., then Z/S has a t distn with f degrees of freedom. The t distn with few d.f. has thicker tails than the normal distn.
Studentization The form of inference described on the previous slide (slide 6??) requires that we know the value of σ 2. If this value is unknown, the procedure must be modified: 1) Replace the unknown value of σ 2 by an estimate, for example S 2 (the sample variance). 2) Take quantile k from tables of the t distn (instead of the normal distn). The d.f. for t are those associated with the estimate of σ 2 (or the sum of squares on which it is based).
Hypothesis tests Test of null hypothesis H 0 (about parameter θ): 1) Choose summary statistic T (typically, an estimator of θ). 2) Reject H 0 if T in C, where C is a subset of the values of T. C is the rejection region, chosen so that P(T in C) when H 0 is true is a small number α, called the significance level, or size of the test. α is the probability of rejecting H 0 when it is true ( type I error). The smaller the value of α, the more stringent the test. Failing to reject a false hypothesis is the type II error. The probability of rejecting a false hypothesis is called the power of the test.
Example of a hypothesis test Here we have a random sample from N(m, σ 2 ), and use the sample mean Ȳ as a test statistic for hypotheses about m. The test rejects H 0 : m = m 0 when Ȳ m 0 E > k where E = σ 2 /n is the standard error of Ȳ and k is a suitable quantile of the standard normal distn (when σ 2 known) or the t distn (when σ 2 estimated). This test is sometimes called the z test (σ 2 known), or the one-sample t test (σ 2 estimated). It is a two-sided (two-tail) test: we reject H 0 if either Ȳ > m 0 + ke or Ȳ < m 0 ke
Level of significance By convention, certain values are used as guidelines: 0.05, 0.01, 0.001, representing increasing strength of evidence against H 0. The smaller the significance level, the stronger the evidence. The following descriptions of a significant result are suggested, although there is no general agreement on these: Significance level Conclusion when H 0 is rejected 0.05 Evidence against H 0 0.01 Strong evidence against H 0 0.001 Very strong evidence against H 0.
The p value Given the observed value of the test statistic, the p value is the smallest α at which the test is significant. Alternatively, it is the probability of obtaining a value more extreme than the observed value of the test statistic. The p value can be regarded as a measure of the strength of evidence against the hypothesis: the smaller the p value, the stronger the evidence, and the less we are inclined to believe that the hypothesis is true. The p value should not be interpreted as the probability that the hypothesis is true.
Confidence intervals A confidence interval tells us which values of the parameter are consistent with the data. In the case of inference about a normal mean, this is just a matter of rearranging one inequality (on left) into another (on right): m ke < Ȳ < m + ke Ȳ ke < m < Ȳ + ke The first statement says that the random variable Ȳ lies between given limits. The second statement says that the random interval (Ȳ k E, Ȳ + k E) includes the unknown value of m. The value of k is chosen so that the statement is true with a given probability (e.g. 0.95).
One-sample t-test Ȳ and S 2 are the sample mean and variance of a random sample of size n from N(m, σ 2 ). The variance σ 2 is unknown, and it is required to test H 0 : m = 0. The test statistic is T = Ȳ/E, where E = S 2 /n is the estimated standard error of Ȳ. The null distn of the test statistic is the t distn with n 1 d.f. H 0 is rejected if T > k, where k is a suitable quantile taken from tables of the t distn with n 1 d.f. There are many other versions of the t test. This is the simplest.
Matched pairs experiment The one-sample t test is used to compare two treatments when observations consist of matched pairs. Nine twin pairs are chosen for the experiment. For each pair, one twin (chosen randomly) is given standard diet (control), the other is given standard diet plus food additive. Pair 1 2 3 4 5 6 7 8 9 Difference 10 2 22 23 6 31 3 7 15 One-sample t test is applied to the differences (treated minus control). These are regarded as a single sample from a normal distn with mean m. The null hypothesis is H 0 : m = 0 (no treatment effect).
One-sample t test The nine differences for the matched-pairs experiment are 10 2 22 23 6 31 3 7 15 Sum is 99, mean is 11.0, uncorrected sum of squares 10 2 + 2 2 + + 15 2 = 2397. Corrected sum of squares is 2397 99 2 /9 = 1308. Estimate of σ 2 is S 2 = 1308/8 = 163.5. Estimated s.e. of Ȳ is E = (S 2 /9) = 4.262, t statistic is Ȳ/E = 2.58. The upper 0.025 point of the t distn with 8 d.f. is 2.306. The two-tail test is significant at the 0.05 level. There is some evidence that the food additive improves growth rate. A 95% confidence interval for the effect of the additive is 11.0 ± 2.306E (between 1.2 and 20.8 g/d).
A neat way to set out the calculations Write down the ANOVA table Source DF SSQ MSQ F Mean 1 1089 1089 6.661 Residual 8 1308 163.5 Total 9 2397 The value 1089 in the first row is the correction term from the previous slide. In each row, MSQ (mean square) is SSQ (sum of squares) divided by DF. In the last column F is the ratio of the two MSQ. t statistic is the square root of F, with DF of the residual row.
Algebra of the ANOVA table Source DF SSQ MSQ F Mean 1 C.F. = nȳ 2 nȳ 2 nȳ 2 /S 2 Residual n 1 Corrected SSQ S 2 Total n Uncorrected SSQ Square root of F is Ȳ S 2 /n.
An experiment with two unmatched samples A random sample of n 1 = 9 lambs receive standard diet plus food additive. An independent random sample of n 2 = 8 lambs receive the standard diet alone. Growth rates are measured on all 17 lambs. Treated 108 110 105 131 104 96 115 118 121 Controls 99 95 120 112 80 106 98 102 Assumptions: all measurements are independently normally distributed with variance σ 2. Population means are m 1 (control), m 2 (treated). Null hypothesis: m 1 = m 2.
The two-sample t test Test is based on Ȳ 1 Ȳ 2, which has variance σ 2 (1/n 1 + 1/n 2 ) The test statistic is T = (Ȳ 1 Ȳ 2 )/E, where E = S 2 (1/n 1 + 1/n 2 ) and S 2 is an estimate of σ 2. The null distn of T is the t dist with n 1 + n 2 2 d.f.
Calculating the estimate of σ 2 n sum uncorrected SSQ Treated 9 1008 113772 Controls 8 812 83414 Calculate the corrected sum of squares separately for each sample, then pool sums of squares and degrees of freedom. Treated Controls Pooled DF SSQ DF SSQ DF SSQ MSQ Mean 1 112896 1 82418 Residual 8 876 7 996 15 1872 124.8 Total 9 113772 8 83414 17 197186 Estimate of σ 2 is S 2 = 124.8 with 15 d.f., and the estimated s.e. of Ȳ 1 Ȳ 2 is 124.8(1/9 + 1/8) = 5.428.
Two-sample t test Ȳ 1 Ȳ 2 S 2 E T 112.0 101.5 124.8 5.428 1.93 T = 10.5/5.43 = 1.93 with 8 + 7 = 15 d.f. The upper 2.5% point of the t distn with 15 d.f. is 2.131. The two-sided test is not quite significant at the 0.05 level: the data are consistent with the null hypothesis that the additive has no effect. A 95% confidence interval for the benefit of the food additive is 10.5 ± 2.131 5.43 (between 1.1 and 22.1 g/d).
Chi-squared goodness-of-fit test Frequencies n 1... n k ( n i = n) are multinomially distributed with probabilities p 1... p k. The probabilities are specified by null hypothesis H 0. Chi-squared test statistic is X 2 = k i=1 2 (n i np i ), np i often written (O E) 2 /E, where O is the observed frequency n i and E is the expected frequency np i. An alternative formula is ( ) X 2 = O 2 /E n. X 2 is a measure of discrepancy between observed and expected frequencies. A large value indicates departure from H 0 (therefore a one-sided test, with large values significant).
The chi-squared distribution The distn of the sum of squares of ν independent N(0,1) r.v.s is called the chi-squared distn with ν d.f. (ν = 1, 2, 3,... ). For example, the corrected sum of squares for a sample of size n from a normal distn has a scaled chi-squared distn with ν = n 1 d.f. The distn also arises as the null distn of the X 2 test statistic. The mean of the distribution is ν and the variance 2ν. Upper tail probability (%) d.f. 10 5 2.5 1 0.5 0.1 1 2.706 3.841 5.024 6.635 7.88 10.83 2 4.605 5.991 7.378 9.210 10.60 13.82.
Example 1 A roulette wheel with three compartments is spun 99 times, with the following results. Is the wheel fair? Side 1 2 3 Total Frequency 42 27 30 99 The null hypothesis is p 1 = p 2 = p 3 = 1/3. Each expected frequency is equal to 33, and X 2 is [(42-33) 2 + (27-33) 2 + (30-33) 2 ]/33 = 3.818, with 2 d.f. Upper 5% point for X 2 2 is 5.991, result is not significant. There is no evidence of bias: data are consistent with the wheel being fair.
Example 2 Sometimes specification of probabilities by the null hypothesis is incomplete, leaving s parameters to be estimated. In this case the null distn of X 2 is chi-squared with k 1 s d.f. Are the blood group frequencies in the table below consistent with Hardy-Weinberg equilibrium? MM MN NN Total 233 385 129 747 H-W hypothesis specifies probabilities p 2, 2pq and q 2, where p and q are the M and N allele frequencies, which must be estimated. Estimates are p = 0.5696, q = 0.4304, and expected frequencies MM MN NN Total 242.36 366.26 138.38 747 X 2 = 1.96 with 3 1 1 = 1 d.f. (not significant)
Chi-squared association test Attributes A and B each take one of two possible values. Both are recorded for a sample of N individuals. Is there association between the attributes? (In the table below, a, b, c, and d are frequencies.) B 1 B 2 Total A 1 a b a+b A 2 c d c+d Total a+c b+d N The null hypothesis is independence of the row and column events, i.e. that Pr(A 1 B 1 ) = Pr(A 1 ) Pr(B 1 ), for example. Test statistic is X 2 = (O E) 2 /E. Null distn is the chi-squared distn with 4 1 2 = 1 d.f. Expected frequency for the top-left cell is (a + b)(a + c)/n, etc.
Example Relationship between nasal carrier rate for Streptococcus pyogenes and size of tonsils among 1398 children. Not enlarged Enlarged Total Carriers 19 (26.6) 53 (45.4) 72 Non-carriers 497 (489.4) 829 (836.6) 1326 Total 516 882 1398 Expected frequencies are shown in brackets. X 2 = (19 26.6)2 26.6 + 3 more terms = 3.61 (Or use alternative formula 19 2 /26.6 + 3 more terms 1398). The test is not significant at the 0.05 level.
Larger tables Calculate expected frequency for each cell as E = (row total) (column total)/(grand total), then sum (O E) 2 /E over all cells of the table. For a table with r rows and c columns, X 2 has (r 1)(c 1) degrees of freedom. Example: a more detailed breakdown of the tonsils data gives X 2 = 7.88 with 2 d.f. (P = 0.02). None Mild Severe Carriers 19 (26.6) 29 (30.3) 24 (15.1) Non-carriers 497 (489.4) 560 (558.7) 269 (277.9) Total 516 589 293
Using R pchisq( ) and qchisq( ) calculate probabilities and quantiles of the chi-squared distn. pt( ) and qt( ) do the same for the t distn. chisq.test( ) deals with the goodness-of-fit and association tests. Note: in the case where parameters are estimated, it reports the wrong d.f. t.test( ) can be used for one or two-sample version of the t test. For the two-sample version, set var.equal = TRUE. binom.test( ) can be used with binomial data. This test is exact, based on binomial probabilities, and usually gives a result similar to the chi-squared test.
Simulation Deriving sampling dists is a job for the mathematical statistician, but an approximate answer can usually be obtained by simulation. The R function replicate( ) is useful here. Example: The code below repeatedly draws samples of size 100 from a normal distn with unit variance, and compares the histogram of the results with the theoretical distn (normal with variance 1/100). curve(dnorm(x, sd = 0.1), -0.3, 0.3, col = red, ann = FALSE, las = 1) hist(replicate(1000, mean(rnorm(100))), freq = FALSE, add = TRUE)