Things you always wanted to know about statistics but were afraid to ask

Things you always wanted to know about statistics but were afraid to ask Christoph Amma Felix Putze Design and Evaluation of Innovative User Interfaces 6.12.13 1/43

Overview In the last lecture, we learned about statistical analysis T-tests to compare mean of two samples During this analysis, some questions arise: T-tests assumes normal distribution How can we learn about the distribution of the data? What to do if data does not follow normal distribution? How to handle study designs with more than two cells? How do we know our sample contains enough participants to yield useful results? 2/43

Literature There are numerous books on statistics Besides the mathematically profound ones, this is an interesting read, if it comes to practical problems: Intuitive Biostatistics A nonmathematical Guide to Statistical Thinking, Harvey Motulsky 3/43

Analyzing distributions Often Statistical Methods depend on the specific distribution of random variables E.g. t-test assumes normal distribution Do we know the underlying distribution of experimental data? We might in rare conditions In our use cases we don t know Is there an underlying theoretical distribution? In most cases, we can only approximate our data by a theoretical distribution Can tell if datasets are sampled from different distributions? How can we find out which to choose? Can we prove that our data is close to a theoretical distribution? 4/43

Histograms The first step is to look at the data Histograms show the distribution of our data We can get a first impression of a possible underlying distribution Sample data set: mean = 4.3 stddev = 0.41 5/43

QQ-Plots A more accurate and expressive method are QQ-Plots QQ-Plots can be used to Compare two samples Compare a sample with a theoretical distribution Quantiles of both distributions are plotted against each other Remember: The p-quantile is the value of the sample for which p 100% of the data is smaller inf{ x R : F( x) p} Compute the same quantiles for both distributions (e.g. 0.1, 0.2, 0.3,..) Plot against each other 6/43

QQ-Plots: Construction Compute p-quantiles of both distributions Plot each quantile against corresponding other If distributions are equal, all points are approximately on y=x Sample is the same as the one used for histogram Theoretical distribution is a normal distribution with mean and variance estimated from the sample data 7/43

QQ-Plots: Example 8/43

Equality of Distributions Statistical tests to test how well two distributions fit together Two distributions can be tested on equality Empirical vs. theoretical Empirical vs. empirical Kolmogorov-Smirnow test Continuous distributions χ²-test Discrete distributions Both make no assumptions on underlying distribution 9/43

Kolmogorov-Smirnow Test The test uses the maximum distance between two distributions as test statistics Two sample test: Two samples are drawn from the same distribution One sample test: One sample is compared to a reference distribution Parameters of theoretical distribution must not be estimated from data ( solution: Lilliefors-Test as variant of KS-Test) 10/43

Given: two distributions Kolmogorov-Smirnow Test and F1 2 Null hypothesis: distributions are equal The test is based on the maximum distance between the two distributions D x For sample data, we use the empirical distribution function F max ( F ( x) F2 ( 1 x ) ) Fˆ ( x) 1 n n i 1 x i { x i x} x i ( x) 1, x { xi} 0, x { xi} 11/43

Kolmogorov-Smirnow Test D is a random variable of which we observe d D follows Kolmogorovs distribution We can compute p = P(D d) = 1 P(D < d) by evaluating the distribution Example: Two samples ks.test(data1, data2) Two-sample Kolmogorov-Smirnov test data: data1 and data2 D = 0.3062, p-value = 1.653e-05 12/43

χ² Test Data is binned Works on discrete distributions Continuous distributions must be binned Tests an empirical vs. a theoretical distribution This test uses the difference between the observed and theoretical frequencies of each outcome 13/43

χ² Test Given: Empirical distribution F (x), theoretical distribution T(x) H₀: F(x) = T(x) (distributions are equal) with F(x) being the underlying distribution of F (x) test statistics: i 1 ( O E E n 2 2 i i ) i O E n i i obs ervedfrequencie s es timatedfrequencie s numberof bins(nots amples) Compute p-value from the χ² distribution 2 p 1 ( k) ( x) k denotes degrees-offreedom and is k=n-1 14/43

Testing for Normality t-test assumes normal distribution Can we assume, that our data is normal distributed? We could test this hypothesis with the Lilliefor Variant of the Kolmogorov-Smirnow Test Correctly: We test if our data is consistent with the assumption of sampling from a gaussian distribution Why not KS Test: We must estimate mean/var from data There are specialized tests for this task with greater power Do not use the KS Test for testing for normality Many tests with different properties Shapiro-Wilk D'Agostino-Pearson Available in R 15/43

Interpreting the p-value The p-value answers: If you randomly sample from a Gaussian population, what is the probability of obtaining a sample that deviates from a Gaussian distribution as much or more as this sample does? High p-value Data is not inconsistent with Gaussian distribution No proof that data is drawn from a Gaussian distribution Low p-value Data is not sampled from a Gaussian distribution What to do: Check for other distribution Maybe outliers cause the normality test to fail Look at the data again, maybe you can ignore it (large dataset tend to produce low p-values even for mild deviations from Gaussian?) Switch to non-parametric tests 16/43

Wilcoxon signed-rank test What to do, if our sample does not approximately follow a normal distribution? What to do, if our sample is ordinal scaled (ordered values) We cannot apply a t-test in this case The Wilcoxon signed-rank test can do it Non- parametric test (no assumption of underlying probability) Works on ordinal scaled values (tests for equality of the median) 17/43

Wilcoxon signed-rank test The test follows the well known scheme for statistical tests We have sampled X and Y from distributions A and B Our hypothesis H1 is: A and B differ The null hypothesis H0 is: A and B are equal Assumptions Values are ordered (ordinally scaled) All observations from both groups are independent of each other A and B differ only by A(x) = B(x + m) otherwise the null hypothesis can only be formulated as: The propability that a sample from A is greater or equal than a sample from B is 0.5 Mann-Whitney U Test is equivalent, but uses the U statistics 18/43

Wilcoxon rank-sum test Test statistics 1. Order all observations in ascending sequence and rank them In case of equal values in both samples: Compute ordering as usual with arbitrary sequence of equal values Compute the mean rank for all equal values Assign all equal values this mean rank 2. Compute rank sums for both samples Add up ranks R1 which came from A Add up ranks R2 which came from B The invariant R1+R2 = N(N+1)/2, N = A + B must hold Test statistics W is the smaller of these two sums 3. Compute p value from distribution of W (distribution of W is given dependent of sizes of A and B) The U statistics of the can be computed from W (Mann-Whitney) 19/43

Wilcoxon rank-sum test: example Example: Two groups perform a task with a user interface A = [3,5,8,10,20] B = [2,5,7,9,11] Ranked Sequence: 1:2, 2:3, 3.5:5, 3.5:5, 5:7, 6:8, 7:9, 8:10, 9:11, 10:20 equal values Rank sums: A: 2+3.5+6+8+10=29.5 B: 1+3.5+5+7+9=25.5 A + B = 55 = 10*(10+1)/2 W = 25.5 p = 0.75 20/43

Wilcoxon rank-sum test: special example Consider the following dataset: X = (1,..., 50, 151,,200) Y = (51, 150) Do they differ? Wilcoxon signed-rank test says: W = 10050 for both datasets p = 1.0 H0 holds What did we do wrong? We missed an assumption of the test To support the given H0, A and B must only differ by a shift in location A(x) = B(x + m) Otherwise H0 is weaker 21/43

Wilcoxon rank-sum test: special example So what should have been our way to go? First check, if X and Y are likely to have the same underlying distribution (except of shift in location) Visually inspect! Make a Kolmogorov-Smirnow Test 22/43

Multiple t-tests? Often in an experiment, we have several independent variables or independent variables with more than two values Results in more than two cells/samples we need to compare Reminder: In [Nass, 2000], we had two independent variables: System personality (values: introvert or extrovert) User personality (values: introvert or extrovert) Hypothesis was: Evaluation of system differs between those groups 4 cells in factorial design 6 pairs of cells to compare Perform 6 independent t-tests? With an α of 0.05, there is a 5% chance of wrongly rejecting a null hypothesis for one test When doing multiple pair-wise t-tests, this probability accumulates to up to 26.5%! 23/43

Another illustrative example Publication which deliberately demonstrates problems which occur from multiple testing (Bennett et al., 2009) Context: fmri-based analysis of emotional activation Present affective stimuli invoking different valence (images, videos, ) fmri measures brain activity based on oxygen concentration Good spatial resolution fmri yields 100,000s of small voxels Researchers performed experiment but replaced human subject with dead salmon Result: Found cluster of three adjacent voxels in the fish s brain which related significantly to the emotional stimulation Observation: Without controlling for multiple testing, a large number of tests will almost surely yield significant results 24/43

ANOVA: Analysis of Variance Analysis of Variance (ANOVA) can handle designs with more than two cells Tests whether there exists a difference between any two groups But not: Between which groups (needs further analysis) One-way ANOVA: One factor with multiple (>2) levels Two-way ANOVA: Two factors with multiple levels More factors are possible, analysis gets more and more complicated Idea: Variance within groups vs. variance between groups In this lecture, we will present basic ideas, not full formulas Consult statistical textbook for details Good starting point with detailed explanations and examples: http://faculty.vassar.edu/lowry/webtext.html 25/43

One-Way ANOVA ANOVA is another statistical test, like the t-test Requirements: All populations are normally distributed All populations have the same variance The samples are of equal size Test statistic F: F Between - group scatter Within - group scatter Estimate* between-group scatter as Estimate* within-group scatter as k ( X i i X 1 k i SD 1 i ² If H 0 (no differences between any two populations) is true, F is distributed according to the F-distribution Proceed as for t-test (calculate critical values, compare with observed F-value) )² *) Formulas are just for explanation and not complete! 26/43

ANOVA vs. t-test For one factor with two levels, ANOVA is equivalent to a t-test Reminder: T X-Y X Y X Y measures the scatter between groups σ X Y measures the scatter within groups 27/43

Two-Way ANOVA Extension of one-way ANOVA Handles two factors with multiple levels We can now observe two types of effects: Main effects: Consistent significant difference when manipulating only one factor (e.g. both introvert and extravert users prefer the extravert system) Interaction effects: Significant difference when changing multiple factors (e.g. introverts prefer the introvert system, extraverts prefer the extravert system) Augment model of one-way ANOVA by changing the definitions of the between-group scatter to differentiate between variance caused by main effects and interaction effects (no details here) 28/43

Main effect vs. interaction effect Dep. Variable X F1 Y X Y F2: Value 1 = red, Value 2 = green F2 Main effect: F1 influences the dependent variable averaged over all values of F2 Interaction effect: Simultaneous effect of F1 and F2 on dependent variable is not additive 29/43

ANCOVA ANOVA has many variants which cover for common problems and questions Analysis of Co-Variance (ANCOVA) Can be used to eliminate effect of confounding variables Despite random allocation of participants to groups, those can differ in relevant variables Age, gender, intelligence, attitude, mood, Confounding variables can artificially inflate or deflate effects Can we remove the effect of those confounding variables from the ANOVA? General approach: ANCOVA calculates linear influence of confounding variable on dependent variable regardless of class Removing this influence reduces within-group scatter Be careful when covariates actually are related to the factors of the design 30/43

Which cells are significantly different? Often, we are not only interested in the existence of some difference but want to know which cells differ significantly We already saw that multiple unmodified t-tests are not advisable A number of tests is designed for this purpose Scheffé-test Turkey s honest significance test All those tests use some method to control for multiple testing Those approaches are also relevant outside of a-posteriori analysis of ANOVA 31/43

Bonferroni/Sidak Correction Adjust α to account for the global null hypothesis H 0 = All tests show no significant effect Replace α by correction factor α depending on the number of tests performed and the originally desired significance level For independent tests (Sidak): Correction factor: α = 1 (1 α) 1 n For arbitrary tests (Bonferroni): Use Boole s inequality (prob. of a union of events sum of single probabilities) P(at least one test significant) nα Correction factor: α = α n 32/43

Bonferroni-Holm Method Bonferroni correction yields binary result We do not learn which or how many of the null-hypotheses are wrongly rejected Does not take differentiate between large and small p-values Improvement: Bonferroni-Holm method Sort p-values such that p 1 p N Set i = 1 and iterate: If p i α (N i + 1), reject corresponding null hypothesis and continue (for i=1, this corresponds to simple Bonferroni correction) Else, stop and do not reject any of the following null hypotheses Still does not take interdependence of tests into account If a test is significant, a similar test is more likely to also yield a significant result Benjamini-Hochberg procedure (1995) extends this method with more precise bounds for every iteration 33/43

Evaluation of correction methods Very conservative approaches (especially the simple ones) high probability of rejecting valid hypotheses ( low power ) Bonferroni-Holm strictly better than original Bonferroni method Practical issue: Which hypotheses to include? Obviously irrelevant ones? Tests planned for future investigations? Tests by others on the same data set? Family Wise Error Rate: Probability of wrongly rejecting the joint null hypothesis of related tests (in terms of content or use) Do not routinely apply correction for every test performed Typical use cases for correction: Applying different procedures for measuring the same construct Investigating many (artificial) subgroups of the sample Explorative analysis, e.g. speculative application of large number of tests 34/43

Larger Sample = better? Probability for α-error depends on size of sample Larger sample smaller error probability Should we therefore collect as much data as possible? Tests will give significant results for the tiniest effect! Tests becomes prone to small jitter in the data We need to know how large the effect is we are looking for! Collect enough data to identify the desired effect as significant Do not collect more data than necessary Consequence: Significance level α must be fixed in advance It is not instrumental to report by how much the test beats the predefined α 35/43

Effect Size The size of an effect depends on the the standard deviation of the underlying distribution For comparison of two distribution means (e.g. with a t-test), we can measure effect size as (note that there are other methods): ES = μ X μ Y σ According to Cohen (1988), we can classify effect size as follows: ES = 0.2 small ES = 0.5 medium ES = 0.8 large The smaller the effect size, the harder it is to detect in a sample The smaller the effect size, the larger the sample has to be (for fixed α) How do you know? Pilot experiments, comparable data 36/43

Power We often choose α = 0.05 because of convention For medical applications, α is traditionally much lower What is the effect of modifying α? A smaller α reduces the probability of α-errors (wrongly rejecting the null hypothesis) improves reliability As a side effect, we also reduce our chances to actually find any significant effect (e.g. if the effect size is small)! Power: 1 β is called the power of an experiment β is the probability of a β-error ( false negative ) It is not possible to optimize both α and 1 β as increasing one parameter decreases the other one 37/43

The big picture of experiment design Four parameters describe an experiment design: Significance level α Power 1 β Effect size ES Sample size n From three given parameters, we can derive the (optimal value of the) fourth one Which parameters are given depends on the situation Not all situations are equally desirable 38/43

Types of Analysis (1) A priori Analysis Significance level, Power and effect size are given (Significance level and power are selected, ES is estimated beforehand) Estimate optimal sample size Perform this analysis during the planning step of an experiment Generate a sample of exactly this size to avoid undesired effects Post-hoc Analysis Significance level, effect size and sample size are given Estimate achieved power Performed after experiment design or execution How high is/was the risk of not observing the effect? 39/43

Types of Analysis (2) Sensitivity Analysis Significance level, power and sample size are given Estimate required effect size Performed after an experiment design or execution Typically performed when no significant effect was found (Effects of which size did we have a chance to find?) Criterion Analysis Power, effect size and sample size are given Estimate required significance level Rarely executed, significance level is typically fixed in advance 40/43

G*Power Free tool maintained by university of Düsseldorf Available for Windows and Mac (created by psychologists, after all) http://www.psycho.uni-duesseldorf.de/abteilungen/aap/gpower3 Performs power analysis for a number of test paradigms t-tests (including regression) F-tests (ANOVA, ) Allows to perform all types of analysis Of course, R and other statistic toolboxes have comparable capabilities of power analysis 41/43

Example Assume we are compare efficiency of our user interface for our traditional baseline interface A our innovative interface B Efficiency is measuring in seconds to solve a given task We set the significance level α to the traditional level of 0.05 We set the power 1 β to 0.95 We estimate missing an opportunity for optimimization and overestimating the improvement as equally harmful From a pilot study we estimate the effect size to be 0.5 We might also be not interested in smaller effects as the new interface is more expensive We design a between-subject experiment and evaluate the data using a t-test to test whether B is more efficient than A Independent samples, one-sided t-test 42/43

Overpowered and Underpowered Experiments A priori analysis yields an optimal sample size of 176 88 participants per group What would happen if we could do a within-subject design? What happens if we collect less data? Study is underpowered We must assume that we cannot reject H 0 for the expected effect size What happens if we collect more data? Study is overpowered Tiny effects (e.g. participants are a little more reluctant on a Monday than on a Friday) may create a significant result Should we stop collecting data, even if more is available? Expect to have some outliers and/or corrupt data points Subgroup analysis may require a larger corpus 43/43