Power and nonparametric methods Basic statistics for experimental researchersrs 2017

Similar documents
Non-parametric tests, part A:

Power and the computation of sample size

Rank-Based Methods. Lukas Meier

Nonparametric Statistics

CHI SQUARE ANALYSIS 8/18/2011 HYPOTHESIS TESTS SO FAR PARAMETRIC VS. NON-PARAMETRIC

HYPOTHESIS TESTING II TESTS ON MEANS. Sorana D. Bolboacă

Introduction and Descriptive Statistics p. 1 Introduction to Statistics p. 3 Statistics, Science, and Observations p. 5 Populations and Samples p.

Introduction to Nonparametric Statistics

Lecture 7: Hypothesis Testing and ANOVA

ST4241 Design and Analysis of Clinical Trials Lecture 9: N. Lecture 9: Non-parametric procedures for CRBD

SEVERAL μs AND MEDIANS: MORE ISSUES. Business Statistics

Intuitive Biostatistics: Choosing a statistical test

3. Nonparametric methods

PSY 307 Statistics for the Behavioral Sciences. Chapter 20 Tests for Ranked Data, Choosing Statistical Tests

Nonparametric Statistics Notes

Nonparametric Statistics. Leah Wright, Tyler Ross, Taylor Brown

Resampling Methods. Lukas Meier

Power and sample size calculations

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Business Statistics MEDIAN: NON- PARAMETRIC TESTS

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

Non-parametric methods

Chapter 15: Nonparametric Statistics Section 15.1: An Overview of Nonparametric Statistics

Nonparametric Methods

Nonparametric statistic methods. Waraphon Phimpraphai DVM, PhD Department of Veterinary Public Health

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

What is a Hypothesis?

My data doesn t look like that..

Non-parametric (Distribution-free) approaches p188 CN

Unit 14: Nonparametric Statistical Methods

Statistics: revision

Nonparametric tests, Bootstrapping

Analysis of variance (ANOVA) Comparing the means of more than two groups

Introduction to Statistical Data Analysis III

What Are Nonparametric Statistics and When Do You Use Them? Jennifer Catrambone

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

Background to Statistics

Sample Size Determination

appstats27.notebook April 06, 2017

Module 9: Nonparametric Statistics Statistics (OA3102)

Mitosis Data Analysis: Testing Statistical Hypotheses By Dana Krempels, Ph.D. and Steven Green, Ph.D.

MATH Notebook 3 Spring 2018

Correlation and Simple Linear Regression

Selection should be based on the desired biological interpretation!

Do not copy, post, or distribute. Independent-Samples t Test and Mann- C h a p t e r 13

4/6/16. Non-parametric Test. Overview. Stephen Opiyo. Distinguish Parametric and Nonparametric Test Procedures

In many situations, there is a non-parametric test that corresponds to the standard test, as described below:

Chapter 24. Comparing Means. Copyright 2010 Pearson Education, Inc.

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Business Statistics. Lecture 10: Course Review

SPSS Guide For MMI 409

Lecture 26. December 19, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Data Analysis and Statistical Methods Statistics 651

STAT Section 5.8: Block Designs

Violating the normal distribution assumption. So what do you do if the data are not normal and you still need to perform a test?

ANOVA - analysis of variance - used to compare the means of several populations.

Things you always wanted to know about statistics but were afraid to ask

Wilcoxon Test and Calculating Sample Sizes

Everything is not normal

Tentative solutions TMA4255 Applied Statistics 16 May, 2015

Power Analysis. Introduction to Power

Chapter 27 Summary Inferences for Regression

Introduction to Statistics with GraphPad Prism 7

Many natural processes can be fit to a Poisson distribution

Data are sometimes not compatible with the assumptions of parametric statistical tests (i.e. t-test, regression, ANOVA)

Hypothesis Testing. Hypothesis: conjecture, proposition or statement based on published literature, data, or a theory that may or may not be true

Multiple Comparisons

LECTURE 5. Introduction to Econometrics. Hypothesis testing

STA Module 10 Comparing Two Proportions

Contents. Acknowledgments. xix

Two-sample inference: Continuous Data

Transition Passage to Descriptive Statistics 28

Introduction to Statistical Analysis

Glossary for the Triola Statistics Series

Lecture 10: Non- parametric Comparison of Loca6on. GENOME 560, Spring 2015 Doug Fowler, GS

Nonparametric Tests. Mathematics 47: Lecture 25. Dan Sloughter. Furman University. April 20, 2006

Unit 12: Analysis of Single Factor Experiments

1.0 Hypothesis Testing

The entire data set consists of n = 32 widgets, 8 of which were made from each of q = 4 different materials.

Rama Nada. -Ensherah Mokheemer. 1 P a g e

Harvard University. Rigorous Research in Engineering Education

Power and sample size calculations

Inferential Statistics

Descriptive Statistics-I. Dr Mahmoud Alhussami

GS Analysis of Microarray Data

Degrees of freedom df=1. Limitations OR in SPSS LIM: Knowing σ and µ is unlikely in large

One-way ANOVA Model Assumptions

Correlation and Regression

N Utilization of Nursing Research in Advanced Practice, Summer 2008

Introduction to Statistical Inference Lecture 10: ANOVA, Kruskal-Wallis Test

Probability and Statistics

Chapter 7. Inference for Distributions. Introduction to the Practice of STATISTICS SEVENTH. Moore / McCabe / Craig. Lecture Presentation Slides

1 ONE SAMPLE TEST FOR MEDIAN: THE SIGN TEST

E509A: Principle of Biostatistics. (Week 11(2): Introduction to non-parametric. methods ) GY Zou.

Dr. Maddah ENMG 617 EM Statistics 10/12/12. Nonparametric Statistics (Chapter 16, Hines)

Chapter 23. Inferences About Means. Monday, May 6, 13. Copyright 2009 Pearson Education, Inc.

Hypothesis testing, part 2. With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal

Psychology 282 Lecture #4 Outline Inferences in SLR

Fish SR P Diff Sgn rank Fish SR P Diff Sng rank

Chapter 7 Comparison of two independent samples

Transcription:

Faculty of Health Sciences Outline Power and nonparametric methods Basic statistics for experimental researchersrs 2017 Statistical power Julie Lyng Forman Department of Biostatistics, University of Copenhagen Nonparametric methods 2 / 30 What s the conclusion? Errors in statistical testing Does the treatment have an effect? Three answers you can get from your statistical analysis.. Yes, there is a significant treatment effect. No, there is no evidence of a treatment effect and the confidence limits are sufficiently narrow that we can rule out a practically relevant difference. Maybe. We don t see a significant difference but the confidence limits are wide so cannot rule out there is a relevant difference either. Can we do anything to avoid the maybes? Probabilities of making the right or wrong decission Decission Truth Accept H 0 Reject H 0 H 0 true 1-α α error of type I significance level H 0 false β 1-β error of type II power Usually the significance level is fixed at α = 5%. Preferably 1 β should be at least 80% (but this is not always so). 3 / 30 4 / 30

Science as a learning process What does power depend on? If the power is low, then many real differences / effective treatments go undetected through our investigations and a higher proportion of the significant discoveries will be false. Sample size. True difference / effect size. Variability in outcome, i.e. the population standard deviation. Significance level (usually fixed at 5%). Today we only consider t-tests, otherwise: Other model parameters. Statistical method for analysis. Experimental design. 5 / 30 6 / 30 Detectable differences: a rough impression Choosing the relevant difference To detect a 1/2 SD difference with 80% power n = 64 is needed in each group. To detect a 1 SD difference with 80% power n = 17 is needed in each group. To detect a 2 SD difference with 80% power n = 6 is needed in each group. Based on a two sample t-test assuming outcomes are normally distributed with equal standard deviations in the two populations / under the two treatments. What difference / effect should be detected by the experiment? Principled choice: The minimum relevant difference is the smallest difference that could be of practical or clinical relevance or general scientific interest BUT: small differences are more difficult to detect.... Pragmatic choices: The smallest difference that it would be embarrasing to overlook. The minimum detectable difference that can be fould with the available sample size. 7 / 30 8 / 30

Estimating variability Approximate power of the two-sample t-test To make a sensible estimate of the power we need to estimate the variability in the outcome. An estimate from a pilot study (beware of statistical uncertainty). An expert guess or similar experiments in the literature. (Normal distribution: 4 SDs from lower to upper limit of the normal range.) After data has been collected, it might be a good idea to review the power calculation. Was the assumption you made about the standard deviation too optimistic or pessimistic? Could you plan better in the future? Textbook formula: The required sample size (in each of two equal size groups) to detect a difference of δ with power 1 β at significance level α is approximately given by where n = 2 (z 1 α/2 + z 1 β ) 2 (δ/σ) 2. σ is the standard deviation of the outcome, z 1 α/2 = 1.96 for α = 0.05, z 1 β = 0.84 for 1 β = 0.80. (the z s are quantiles of the standard normal distribution). ATT: This formula is only valid for larger sample sizes (better use R). 9 / 30 10 / 30 R: Exact power of the two-sample t-test Use the power.t.test-function in R. power.t.test(delta=1, sd=1, power=0.80, sig=0.05, type= two.sample ) Two-sample t test power calculation n = 16.71477 delta = 1 sd = 1 sig.level = 0.05 power = 0.8 alternative = two.sided NOTE: n is number in *each* group Note: The arguments sig=0.05 and type= two.sample are set at their defaults and can be omitted. Power of the one-sample/paired t-test The required sample size (number of pairs) to detect a difference of δ with power 1 β at significance level α is approximately n = (z 1 α/2 + z 1 β ) 2 (δ/σ d ) 2. where σ d is the standard deviation of the differences σ d. But again we can use R for exact power calculations: power.t.test(delta=1, sd=0.8, power=0.80, sig=0.05, type= paired ) Paired t test power calculation n = 7.171643 delta = 1 sd = 0.8 sig.level = 0.05 power = 0.8 alternative = two.sided 11 / 30 12 / 30 NOTE: n is number of *pairs*, sd is std.dev. of *differences* within pai

The standard deviation of the differences Sometimes we do not have a natural estimate of the standard deviation of the differences, then it is useful to know the following formula: σ 2 d = 2 (1 ρ) σ 2 σ is the standard deviation of a sigle outcome (the population SD). ρ is the correlation between the two measurements in a pair (more about correlations in lecture 3) Note: If the correlation is reasonably strong (> 0.50), then the differences are less variable than the single outcomes. This is the reason why paired t-tests are often more powerful than two-sample t-tests. Attainable power and least detectable difference If you know in advance that you can only get a limited number of observations, then your power calculation should focus on... What power do I have for detecting the relevant difference? power.t.test(n=10, delta=0.5, sd=0.8) Two-sample t test power calculation power = 0.2622537 What is the smallest difference I can detect with a decent power? power.t.test(n=10, sd=0.8, power=0.80) Two-sample t test power calculation delta = 1.059957 13 / 30 If the answers you get aren t reasonable, then it migth be a better idea to give up on the investigation instead of wasting time and money. 14 / 30 Power in other situations Outline Simple power calculators / textbook formulae are available for: Two-sample t-tests with unequal sample sizes / variances. Statistical power Comparing two frequencies (2x2-table) Simple linear regression / correlation. Nonparametric methods In any other case: Talk to a statistician about it. 15 / 30 16 / 30

The normal assumption Alternative: nonparametric methods How important is the normal assumption for the t-tests? Small sample size, n < 10: Important for obtaining valid results. But hard to assess (experience, other studies)... Medium sample size, 10 n 30: Results are most often valid. Beware of marked skewness and/or outliers. Does it make sense to interpret the mean as the typical outcome? Larger samples, n > 30: Rarely important for obtaining valid results. Beware of many/extreme outliers. Does it make sense to interpret the mean as the typical outcome? 17 / 30 Most traditional analyses of continuous outcomes, t-tests, ANOVA, linear regression, (Pearson) correlation. rely on an assumption that data are normally distributed. Statistical methods avoiding to make such distributional assumptions are termed nonparametric or distribution free 18 / 30 Note: Distribution free is not the same as assumption free. Why use nonparametric statistics? Classical nonparametric methods and their conventional counterparts. 1. Data is obviously not normally distributed. 2. Don t know whether data is normally distributed or not because sample size is too small to tell. 3. Want to do a fast analysis without having to check whether or not data is normally distributed. J.H. Tukey referred to nonparametric methods as quick and dirty Normal distribution Paired t-test Two-sample t-test One-way ANOVA Two-way ANOVA with repeated measurements Pearson s correlation Nonparametric Sign test Wilcoxons signed rank test Man-Whitney U-test equivalent to Wilcoxon rank sum test Kruskal-Wallis test Fridman test Spearman s correlation 19 / 30 20 / 30

Rank-based analysis Classical nonparametric analyses are carried out in two steps. First: observations are replaced with their ranks. Smallest observation gets rank 1, second smallest rank 2, etc Example: 17.5 (5), 8.5 (1), 12.6 (4), 8.7 (2), 10.8 (3). Secondly, these are used for hypothesis testing. Two-sample testing The Wilconxon-Man-Whitney test: Assumes that observations have continuous distributions. The null hypothesis is that the two populations have the same distribution (not just the same median!). Compares the ranks between the two groups. With the further assumption that the two distributions differ only by a shift in location, it is possible to get a confidence interval for this shift. If there is no difference between the groups / no effect of treatment, all distributions of ranks between the groups/treatments are equally likely due to randomisation. We have evidence against the null hypothesis if e.g. one group has all the high ranks and the other all the low ranks. 21 / 30 22 / 30 Two versions of the same test. Trouble shooting: Tied data The Wilconxon rank sum test (originally for equal sample sizes): Add up the ranks from the two groups: R 1 and R 2. Use the smallest rank sum as test statistic: W = min(r 1, R 2 ). The Man-Whitney U-test (extension to unequal sample sizes): Add up the ranks from the two groups: R 1 and R 2. Compute U i = n 1 n 2 + ni(ni+1) 2 R i for i = 1, 2. Use max(u 1, U 2 ) as test statistic. The two tests are equivalent - You get the exact same p-value! Problem with ranking if some observations are identical: Solution: replace ranks with midranks (i.e. average of ranks). Example: 17.5 (5), 8.5 (1), 10.8 (3.5), 8.7 (2), 10.8 (3.5), ranks 3 and 4 are replaced by midranks 3.5 and 3.5. Traditionally tests have been adjusted for ties: Adjusted P-values are approximate, not exact. This is problematic if data is small or many observations are tied. Today exact p-values can be obatained by use of a permutationtest. R-package coin can do it, but beware of other software. 23 / 30 24 / 30

Permutation tests Paired testing The Wilcoxon-Man-Whitney test is an example of a permutation test. The distribution of a teststatistic under the null hypothesis can be simulated by randomly assigning ranks to the two treatment groups. I.e. if treatment has no effect, it is just a random labeling. The idea is quite general and dates back to Fisher s exact test (1934). A wide range of applications is possible today using computer simulations to randomly reassigning treatments to subjects. Permutation tests do not make any distributional assumptions and are valid for all sample sizes. In case of repeated measurements the pairing/clustering must be preserved in the permutations. A range of permutation tests are available in the R-package coin. 25 / 30 Analysis is based on the differences between the paired observations. The sign test: No assumptions save from that obervation pairs have been sampled independently from the same population. The null hypothesis is that the median difference is zero. Counts the positive/negative differences (zeros don t count). The Wilcoxon signed rank test: Assumes that the differences has a symmetric distribution. The null hypothesis is that the median difference is zero. Ranks the absolute values of the differences. Note: We get a confidence interval for the median difference. 26 / 30 Beware of skewness Drawbacks of nonparametric statistics Many nonparametric methods aim solely at hypothesis testing. With few exeptions, you get no quantification from the analysis, that is no estimates or confidence intervals. Limited conclusions can be drawn from testing alone; Accept: There may or may not be a difference we might not have sufficient data to tell. Reject: It is very likely that there is a difference but we don t know how big it is and cannot compare with other studies. Power calculations are difficult: Sample median = 0 and P = 1 for the sign test, but P = 0.03 for the Wilcoxon signed rank test. We need to assume a particular distribution of the data. Then we can estimate the power from computer simulations. 27 / 30 28 / 30

Common (mis)beliefs about nonparametric tests Nonparametric tests are less efficient (powerful) than parametric tests. This is only true if the distributional assumptions for the parametric model are correct. The efficiency of the Wilcoxon rank sum test is never less than 86% of the two-sample t-test. This is only true if the two distributions are identical save from a possible shift in location. The relative efficiency between different test depends on the true distribution of the data. R-demo # Sign test. SignTest(ckd0$aixchange) # Wilcoxon s signed rank test (assuming symmetric distribution) # approximate p-value and confidence interval with: wilcox.test(ckd0$aix0, ckd0$aix24, paired=t, conf.int=true) # More tests with R-package coin: library(coin) # Wilcoxon s signed rank test # exact p-value, but no confidence interval with: wilcoxsign_test(aix0~aix24, data=ckd0, distribution= exact ) # Wilcoxon-Man-Whitney test # exact p-value and confidence interval with: wilcox_test(aixchange~group, data=ckd.complete, + distribution= exact, conf.int=true) 29 / 30 30 / 30