Sensitiveness analysis: Sample sizes for t-tests for paired samples

Sensitiveness analysis: Sample sizes for t-tests for paired samples (J.D.Perezgonzalez, 2016, Massey University, New Zealand, doi: 10.13140/RG.2.2.32249.47203) Table 1 shows the sample sizes required for obtaining a statistical significant result for a desired minimum effect size (MES) when carrying out Fisher s tests of significance (e.g., 1954) to assess mean differences between paired observations (dependent means) using t-tests. Table 1. Sample sizes for paired-sample t-tests sig = 0.001 sig = 0.01 sig = 0.05 MES (d z,4 ) 2-tailed 1-tailed 2-tailed 1-tailed 2-tailed 1-tailed 0.05 4329 3818 2653 2164 1536 1082 0.10 1000 960 667 544 387 273 0.15 487 430 299 244 174 123 0.20 277 244 170 139 99 70 0.25 180 158 110 90 64 46 0.30 127 112 78 64 46 32 0.35 95 84 58 48 34 24 0.40 74 65 46 38 27 19 0.45 60 53 37 30 22 16 0.50 50 44 31 25 18 13 0.55 42 37 26 22 16 11 0.60 36 32 23 19 14 10 0.65 32 28 20 17 12 9 0.70 28 25 18 15 11 8 0.75 25 23 16 13 10 7 0.80 23 21 15 12 9 7 0.85 21 19 13 11 8 6 0.90 19 17 12 10 8 6 0.95 18 16 12 10 7 6 1.00 17 15 11 9 7 5 Notes: Sample sizes capture MES up to four decimal places. Main source: Perezgonzalez, J.D. (2016). Statistical sensitiveness for science. arxiv 1604.01844 (Retrievable from http://arxiv.org/abs/1604.01844)

Notes. # A minimum effect size (MES) is the minimum amount of standardized difference between the mean of the null hypothesis and the level of significance of interest to the research project at hand. (It would correspond to Cohen s d z and d 4 also d z and d 4, 1988 if latter found to be the mean of the population effect size.) Unlike Cohen s effect sizes, an MES does not make a claim on the (unknown) population effect size but is independent of it. Instead, an MES sets an a priori standard of importance asking, How small ought a difference to be for me to consider it of importance (a.k.a., of practical significance)? (That is, once estimated, the real effect may be larger or smaller than the MES, although this should not have retroactive impact on the initial decision of importance for the research project.) Because an MES does not make a claim on population effect sizes, any decision about importance is made before knowing the real effect of the research treatment in the population. This makes the MES a good construct for those situations when population effect sizes are unknown (thus, a power analysis is not possible) as well as when Fisher s tests of significance are used (the latter because these tests effectively ignore any knowledge about the population effect size and Type II error). A sensitiveness analysis provides the sample size required for capturing the desired MES (or larger) as a statistically significant result. The probability of capturing such effect, however, depends on the unknown population effect size, so that such probability is greater when the population effect size is larger than the MES and gets smaller when the population effect size is smaller than the MES. Because we do not actually know the population effect size, it is not possible to predict such probability (which is otherwise known as power). Sensitiveness and power share a common background insofar a power analysis is a sensitiveness analysis with the MES calculated based on known information about the population effect size (e.g., a power analysis based on a one-tailed paired-sample t-test, ES = 0.50, α = 0.01, and power = 0.80 implies an MES = 0.37, thus requires the same sample size than a sensitiveness analysis based on a one-tailed paired-sample t-test, MES = 0.37, and sig = 0.01; both will call for the same critical value, CV t (42) = 2.418). However, although a power analysis is a sensitiveness analysis, the opposite is not true: We cannot know the power of a test without prior knowledge of the population effect size.

Table 2 shows ranges of effect sizes that will not be captured under the alternative hypothesis (a.k.a., as significant ) by Neyman-Pearson s tests (1933), the effect sizes at the boundary effectively becoming the MES of the corresponding power analyses. Table 2. Effect sizes under the alternative hypothesis that will not be so captured via power analysis. pwr = 0.90 pwr = 0.80 α = 0.01 α = 0.05 α = 0.01 α = 0.05 ES (d z,d 4 ) 2-tailed 1-tailed 2-tailed 1-tailed 2-tailed 1-tailed 2-tailed 1-tailed 0.20 [-0.13, 0.13] [-, 0.12] [-0.12, 0.12] [-, 0.11] [-0.15, 0.15] [-, 0.15] [-0.14, 0.14] [-, 0.13] 0.50 [-0.33, 0.33] [-, 0.32] [-0.30, 0.30] [-, 0.28] [-0.38, 0.38] [-, 0.37] [-0.35, 0.35] [-, 0.33] 0.80 [-0.53, 0.53] [-, 0.51] [-0.48, 0.48] [-, 0.45] [-0.60, 0.60] [-, 0.59] [-0.55, 0.55] [-, 0.52] # Minimum effect sizes have the same definition than Cohen s effect sizes, so that MES = 0.20 may be considered small, MES = 0.50 may be considered medium, and MES = 0.80 may be considered large. Although Table 1 provides sample sizes for MES as large as one standard deviation, the researcher ought to consider the implications of choosing a particularly large MES. Indeed, reproducible results will only occur when the population effect size is larger than the MES (the larger the better), and a large MES implies that the effect size in the population is so large that it may be plainly visible even before starting the research, something not too common in science. # Table 1 also provides sample sizes for conventional significance levels of 5%, 1%, and 1. The typical (mis)use of tests of significance as tests of hypotheses calls for a level of significance of 1% or lower as a more appropriate standard for better science than larger levels, such as the so popular 5% (e.g., Sellke, Bayarri, & Berger, 2001). # A procedure for calculating sample sizes for desired MES is given in Perezgonzalez (2016). A simpler procedure can be obtained using Excel, as follows: A B 1 MES = d z 0.37 Input desired MES here 2 sig 0.01 Input desired level of significance here 3 n 43 Use this cell for increasing sample size iteratively 4 df 42 Set up a formula that automatically subtracts 1 (degree of freedom) from n above (i.e. [ =B3-1 ]) 5 CV(t) 2.70 Set up a t-test function, either [ =T.INV.2T(B2,B4) ] for a two-tailed test or [ =T.INV(B2,B4)*(-1) ] for a one-tailed test 6 d = 0.4115 Set up a formula that automatically calculates Cohen s d z from CV(t) (i.e. [ =B5/SQRT(B3) ]). Compare the result against MES: If larger, increase n ; if smaller, decrease n. # The formula for calculating Cohen s d z (or d 4 ) from a paired-sample t-test is the following: dd zz = tt nn

References Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences, 2nd Edn. New York, NY: Psychology Press. doi:10.4324/9780203771587 Fisher, R. A. (1954). Statistical Methods for Research Workers, 12th Edn. Edinburgh, UK: Oliver and Boyd. Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London, Series A, 231, 289-337. doi:10.1098/rsta.1933.0009 Perezgonzalez, J.D. (2016). Statistical sensitiveness for science. arxiv 1604.01844 (Retrievable from http://arxiv.org/abs/1604.01844) Sellke, T., Bayarri, M. J., & Berger, J. O. (2001). Calibration of p values for testing precise null hypotheses. The American Statistician, 55(1), 62 71.

Science Philosophy of science Methods Replication CCMA, MA Prediction Freq. replication Updating Proto-science Description Xplore ES CI Modelling Pseudo-science Data testing Significance Acceptance Bayes factors NHST Hypothesis testing Bayes-Laplace Sensitiveness analysis provides a methodological tool for sampling calculation appropriate for Fisher s tests of significance (akin to what power analysis does for Neyman- Pearson s tests of acceptance) It also helps put importance (i.e., practical significance) at the forefront of research goals