Faculty of Health Sciences Outline Power and nonparametric methods Basic statistics for experimental researchersrs 2017 Statistical power Julie Lyng Forman Department of Biostatistics, University of Copenhagen Nonparametric methods 2 / 30 What s the conclusion? Errors in statistical testing Does the treatment have an effect? Three answers you can get from your statistical analysis.. Yes, there is a significant treatment effect. No, there is no evidence of a treatment effect and the confidence limits are sufficiently narrow that we can rule out a practically relevant difference. Maybe. We don t see a significant difference but the confidence limits are wide so cannot rule out there is a relevant difference either. Can we do anything to avoid the maybes? Probabilities of making the right or wrong decission Decission Truth Accept H 0 Reject H 0 H 0 true 1-α α error of type I significance level H 0 false β 1-β error of type II power Usually the significance level is fixed at α = 5%. Preferably 1 β should be at least 80% (but this is not always so). 3 / 30 4 / 30
Science as a learning process What does power depend on? If the power is low, then many real differences / effective treatments go undetected through our investigations and a higher proportion of the significant discoveries will be false. Sample size. True difference / effect size. Variability in outcome, i.e. the population standard deviation. Significance level (usually fixed at 5%). Today we only consider t-tests, otherwise: Other model parameters. Statistical method for analysis. Experimental design. 5 / 30 6 / 30 Detectable differences: a rough impression Choosing the relevant difference To detect a 1/2 SD difference with 80% power n = 64 is needed in each group. To detect a 1 SD difference with 80% power n = 17 is needed in each group. To detect a 2 SD difference with 80% power n = 6 is needed in each group. Based on a two sample t-test assuming outcomes are normally distributed with equal standard deviations in the two populations / under the two treatments. What difference / effect should be detected by the experiment? Principled choice: The minimum relevant difference is the smallest difference that could be of practical or clinical relevance or general scientific interest BUT: small differences are more difficult to detect.... Pragmatic choices: The smallest difference that it would be embarrasing to overlook. The minimum detectable difference that can be fould with the available sample size. 7 / 30 8 / 30
Estimating variability Approximate power of the two-sample t-test To make a sensible estimate of the power we need to estimate the variability in the outcome. An estimate from a pilot study (beware of statistical uncertainty). An expert guess or similar experiments in the literature. (Normal distribution: 4 SDs from lower to upper limit of the normal range.) After data has been collected, it might be a good idea to review the power calculation. Was the assumption you made about the standard deviation too optimistic or pessimistic? Could you plan better in the future? Textbook formula: The required sample size (in each of two equal size groups) to detect a difference of δ with power 1 β at significance level α is approximately given by where n = 2 (z 1 α/2 + z 1 β ) 2 (δ/σ) 2. σ is the standard deviation of the outcome, z 1 α/2 = 1.96 for α = 0.05, z 1 β = 0.84 for 1 β = 0.80. (the z s are quantiles of the standard normal distribution). ATT: This formula is only valid for larger sample sizes (better use R). 9 / 30 10 / 30 R: Exact power of the two-sample t-test Use the power.t.test-function in R. power.t.test(delta=1, sd=1, power=0.80, sig=0.05, type= two.sample ) Two-sample t test power calculation n = 16.71477 delta = 1 sd = 1 sig.level = 0.05 power = 0.8 alternative = two.sided NOTE: n is number in *each* group Note: The arguments sig=0.05 and type= two.sample are set at their defaults and can be omitted. Power of the one-sample/paired t-test The required sample size (number of pairs) to detect a difference of δ with power 1 β at significance level α is approximately n = (z 1 α/2 + z 1 β ) 2 (δ/σ d ) 2. where σ d is the standard deviation of the differences σ d. But again we can use R for exact power calculations: power.t.test(delta=1, sd=0.8, power=0.80, sig=0.05, type= paired ) Paired t test power calculation n = 7.171643 delta = 1 sd = 0.8 sig.level = 0.05 power = 0.8 alternative = two.sided 11 / 30 12 / 30 NOTE: n is number of *pairs*, sd is std.dev. of *differences* within pai
The standard deviation of the differences Sometimes we do not have a natural estimate of the standard deviation of the differences, then it is useful to know the following formula: σ 2 d = 2 (1 ρ) σ 2 σ is the standard deviation of a sigle outcome (the population SD). ρ is the correlation between the two measurements in a pair (more about correlations in lecture 3) Note: If the correlation is reasonably strong (> 0.50), then the differences are less variable than the single outcomes. This is the reason why paired t-tests are often more powerful than two-sample t-tests. Attainable power and least detectable difference If you know in advance that you can only get a limited number of observations, then your power calculation should focus on... What power do I have for detecting the relevant difference? power.t.test(n=10, delta=0.5, sd=0.8) Two-sample t test power calculation power = 0.2622537 What is the smallest difference I can detect with a decent power? power.t.test(n=10, sd=0.8, power=0.80) Two-sample t test power calculation delta = 1.059957 13 / 30 If the answers you get aren t reasonable, then it migth be a better idea to give up on the investigation instead of wasting time and money. 14 / 30 Power in other situations Outline Simple power calculators / textbook formulae are available for: Two-sample t-tests with unequal sample sizes / variances. Statistical power Comparing two frequencies (2x2-table) Simple linear regression / correlation. Nonparametric methods In any other case: Talk to a statistician about it. 15 / 30 16 / 30
The normal assumption Alternative: nonparametric methods How important is the normal assumption for the t-tests? Small sample size, n < 10: Important for obtaining valid results. But hard to assess (experience, other studies)... Medium sample size, 10 n 30: Results are most often valid. Beware of marked skewness and/or outliers. Does it make sense to interpret the mean as the typical outcome? Larger samples, n > 30: Rarely important for obtaining valid results. Beware of many/extreme outliers. Does it make sense to interpret the mean as the typical outcome? 17 / 30 Most traditional analyses of continuous outcomes, t-tests, ANOVA, linear regression, (Pearson) correlation. rely on an assumption that data are normally distributed. Statistical methods avoiding to make such distributional assumptions are termed nonparametric or distribution free 18 / 30 Note: Distribution free is not the same as assumption free. Why use nonparametric statistics? Classical nonparametric methods and their conventional counterparts. 1. Data is obviously not normally distributed. 2. Don t know whether data is normally distributed or not because sample size is too small to tell. 3. Want to do a fast analysis without having to check whether or not data is normally distributed. J.H. Tukey referred to nonparametric methods as quick and dirty Normal distribution Paired t-test Two-sample t-test One-way ANOVA Two-way ANOVA with repeated measurements Pearson s correlation Nonparametric Sign test Wilcoxons signed rank test Man-Whitney U-test equivalent to Wilcoxon rank sum test Kruskal-Wallis test Fridman test Spearman s correlation 19 / 30 20 / 30
Rank-based analysis Classical nonparametric analyses are carried out in two steps. First: observations are replaced with their ranks. Smallest observation gets rank 1, second smallest rank 2, etc Example: 17.5 (5), 8.5 (1), 12.6 (4), 8.7 (2), 10.8 (3). Secondly, these are used for hypothesis testing. Two-sample testing The Wilconxon-Man-Whitney test: Assumes that observations have continuous distributions. The null hypothesis is that the two populations have the same distribution (not just the same median!). Compares the ranks between the two groups. With the further assumption that the two distributions differ only by a shift in location, it is possible to get a confidence interval for this shift. If there is no difference between the groups / no effect of treatment, all distributions of ranks between the groups/treatments are equally likely due to randomisation. We have evidence against the null hypothesis if e.g. one group has all the high ranks and the other all the low ranks. 21 / 30 22 / 30 Two versions of the same test. Trouble shooting: Tied data The Wilconxon rank sum test (originally for equal sample sizes): Add up the ranks from the two groups: R 1 and R 2. Use the smallest rank sum as test statistic: W = min(r 1, R 2 ). The Man-Whitney U-test (extension to unequal sample sizes): Add up the ranks from the two groups: R 1 and R 2. Compute U i = n 1 n 2 + ni(ni+1) 2 R i for i = 1, 2. Use max(u 1, U 2 ) as test statistic. The two tests are equivalent - You get the exact same p-value! Problem with ranking if some observations are identical: Solution: replace ranks with midranks (i.e. average of ranks). Example: 17.5 (5), 8.5 (1), 10.8 (3.5), 8.7 (2), 10.8 (3.5), ranks 3 and 4 are replaced by midranks 3.5 and 3.5. Traditionally tests have been adjusted for ties: Adjusted P-values are approximate, not exact. This is problematic if data is small or many observations are tied. Today exact p-values can be obatained by use of a permutationtest. R-package coin can do it, but beware of other software. 23 / 30 24 / 30
Permutation tests Paired testing The Wilcoxon-Man-Whitney test is an example of a permutation test. The distribution of a teststatistic under the null hypothesis can be simulated by randomly assigning ranks to the two treatment groups. I.e. if treatment has no effect, it is just a random labeling. The idea is quite general and dates back to Fisher s exact test (1934). A wide range of applications is possible today using computer simulations to randomly reassigning treatments to subjects. Permutation tests do not make any distributional assumptions and are valid for all sample sizes. In case of repeated measurements the pairing/clustering must be preserved in the permutations. A range of permutation tests are available in the R-package coin. 25 / 30 Analysis is based on the differences between the paired observations. The sign test: No assumptions save from that obervation pairs have been sampled independently from the same population. The null hypothesis is that the median difference is zero. Counts the positive/negative differences (zeros don t count). The Wilcoxon signed rank test: Assumes that the differences has a symmetric distribution. The null hypothesis is that the median difference is zero. Ranks the absolute values of the differences. Note: We get a confidence interval for the median difference. 26 / 30 Beware of skewness Drawbacks of nonparametric statistics Many nonparametric methods aim solely at hypothesis testing. With few exeptions, you get no quantification from the analysis, that is no estimates or confidence intervals. Limited conclusions can be drawn from testing alone; Accept: There may or may not be a difference we might not have sufficient data to tell. Reject: It is very likely that there is a difference but we don t know how big it is and cannot compare with other studies. Power calculations are difficult: Sample median = 0 and P = 1 for the sign test, but P = 0.03 for the Wilcoxon signed rank test. We need to assume a particular distribution of the data. Then we can estimate the power from computer simulations. 27 / 30 28 / 30
Common (mis)beliefs about nonparametric tests Nonparametric tests are less efficient (powerful) than parametric tests. This is only true if the distributional assumptions for the parametric model are correct. The efficiency of the Wilcoxon rank sum test is never less than 86% of the two-sample t-test. This is only true if the two distributions are identical save from a possible shift in location. The relative efficiency between different test depends on the true distribution of the data. R-demo # Sign test. SignTest(ckd0$aixchange) # Wilcoxon s signed rank test (assuming symmetric distribution) # approximate p-value and confidence interval with: wilcox.test(ckd0$aix0, ckd0$aix24, paired=t, conf.int=true) # More tests with R-package coin: library(coin) # Wilcoxon s signed rank test # exact p-value, but no confidence interval with: wilcoxsign_test(aix0~aix24, data=ckd0, distribution= exact ) # Wilcoxon-Man-Whitney test # exact p-value and confidence interval with: wilcox_test(aixchange~group, data=ckd.complete, + distribution= exact, conf.int=true) 29 / 30 30 / 30