Things you always wanted to know about statistics but were afraid to ask

Similar documents
Rank-Based Methods. Lukas Meier

Data Analysis and Statistical Methods Statistics 651

My data doesn t look like that..

Turning a research question into a statistical question.

Power and nonparametric methods Basic statistics for experimental researchersrs 2017

Hypothesis testing, part 2. With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Textbook Examples of. SPSS Procedure

Inferential Statistics

Review of Statistics 101

COMPARING SEVERAL MEANS: ANOVA

H0: Tested by k-grp ANOVA

Physics 509: Non-Parametric Statistics and Correlation Testing

This is particularly true if you see long tails in your data. What are you testing? That the two distributions are the same!

Lecture 21: October 19

Chapter Fifteen. Frequency Distribution, Cross-Tabulation, and Hypothesis Testing

Non-parametric tests, part A:

Introduction and Descriptive Statistics p. 1 Introduction to Statistics p. 3 Statistics, Science, and Observations p. 5 Populations and Samples p.

Statistics: revision

THE ROYAL STATISTICAL SOCIETY HIGHER CERTIFICATE

Degrees of freedom df=1. Limitations OR in SPSS LIM: Knowing σ and µ is unlikely in large

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Difference in two or more average scores in different groups

Analysis of Variance (ANOVA)

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

Specific Differences. Lukas Meier, Seminar für Statistik

Background to Statistics

H0: Tested by k-grp ANOVA

Business Statistics. Lecture 10: Course Review

Practical Statistics

Power Analysis. Ben Kite KU CRMDA 2015 Summer Methodology Institute

3. Nonparametric methods

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES

Intuitive Biostatistics: Choosing a statistical test

Contents. Acknowledgments. xix

Module 9: Nonparametric Statistics Statistics (OA3102)

Inferences About the Difference Between Two Means

Exam details. Final Review Session. Things to Review

Cheat Sheet: ANOVA. Scenario. Power analysis. Plotting a line plot and a box plot. Pre-testing assumptions

STAT 461/561- Assignments, Year 2015

Nonparametric Statistics. Leah Wright, Tyler Ross, Taylor Brown

Chapter 13 Section D. F versus Q: Different Approaches to Controlling Type I Errors with Multiple Comparisons

Lecture 7: Hypothesis Testing and ANOVA

Unit 14: Nonparametric Statistical Methods

Mitosis Data Analysis: Testing Statistical Hypotheses By Dana Krempels, Ph.D. and Steven Green, Ph.D.

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Data Mining. CS57300 Purdue University. March 22, 2018

Statistics Toolbox 6. Apply statistical algorithms and probability models

REVIEW 8/2/2017 陈芳华东师大英语系

LECTURE 5. Introduction to Econometrics. Hypothesis testing

Nonparametric tests. Mark Muldoon School of Mathematics, University of Manchester. Mark Muldoon, November 8, 2005 Nonparametric tests - p.

STAT 263/363: Experimental Design Winter 2016/17. Lecture 1 January 9. Why perform Design of Experiments (DOE)? There are at least two reasons:

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Contrasts and Multiple Comparisons Supplement for Pages

Multiple t Tests. Introduction to Analysis of Variance. Experiments with More than 2 Conditions

Hypothesis T e T sting w ith with O ne O One-Way - ANOV ANO A V Statistics Arlo Clark Foos -

Chapter 1 Statistical Inference

Non-parametric methods

Analyses of Variance. Block 2b

Group comparison test for independent samples

Black White Total Observed Expected χ 2 = (f observed f expected ) 2 f expected (83 126) 2 ( )2 126

Introduction to inferential statistics. Alissa Melinger IGK summer school 2006 Edinburgh

Non-specific filtering and control of false positives

Big Data Analysis with Apache Spark UC#BERKELEY

Probability and Statistics

Chapter 9. Non-Parametric Density Function Estimation

Correlation and Simple Linear Regression

Chapter 9. Non-Parametric Density Function Estimation

Review. One-way ANOVA, I. What s coming up. Multiple comparisons

Introduction to Statistics with GraphPad Prism 7

Nonparametric statistic methods. Waraphon Phimpraphai DVM, PhD Department of Veterinary Public Health

Basics on t-tests Independent Sample t-tests Single-Sample t-tests Summary of t-tests Multiple Tests, Effect Size Proportions. Statistiek I.

Distribution-Free Procedures (Devore Chapter Fifteen)

Kumaun University Nainital

One sided tests. An example of a two sided alternative is what we ve been using for our two sample tests:

Physics 509: Bootstrap and Robust Parameter Estimation

Correlation and Regression

Multiple Comparisons

Psych 230. Psychological Measurement and Statistics

Introduction to Bayesian Learning. Machine Learning Fall 2018

4.1. Introduction: Comparing Means

Sequential Analysis & Testing Multiple Hypotheses,

Interpreting Regression Results

Non-Parametric Statistics: When Normal Isn t Good Enough"

Lecture 5: ANOVA and Correlation

A Better Way to Do R&R Studies

LECTURE NOTE #3 PROF. ALAN YUILLE

Lecture 6 April

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2017

Analysis of 2x2 Cross-Over Designs using T-Tests

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS

One-way between-subjects ANOVA. Comparing three or more independent means

CHI SQUARE ANALYSIS 8/18/2011 HYPOTHESIS TESTS SO FAR PARAMETRIC VS. NON-PARAMETRIC

Testing for Normality

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Course Review. Kin 304W Week 14: April 9, 2013

Solutions exercises of Chapter 7

4/6/16. Non-parametric Test. Overview. Stephen Opiyo. Distinguish Parametric and Nonparametric Test Procedures

An Analysis of College Algebra Exam Scores December 14, James D Jones Math Section 01

Chapter 15: Nonparametric Statistics Section 15.1: An Overview of Nonparametric Statistics

Transcription:

Things you always wanted to know about statistics but were afraid to ask Christoph Amma Felix Putze Design and Evaluation of Innovative User Interfaces 6.12.13 1/43

Overview In the last lecture, we learned about statistical analysis T-tests to compare mean of two samples During this analysis, some questions arise: T-tests assumes normal distribution How can we learn about the distribution of the data? What to do if data does not follow normal distribution? How to handle study designs with more than two cells? How do we know our sample contains enough participants to yield useful results? 2/43

Literature There are numerous books on statistics Besides the mathematically profound ones, this is an interesting read, if it comes to practical problems: Intuitive Biostatistics A nonmathematical Guide to Statistical Thinking, Harvey Motulsky 3/43

Analyzing distributions Often Statistical Methods depend on the specific distribution of random variables E.g. t-test assumes normal distribution Do we know the underlying distribution of experimental data? We might in rare conditions In our use cases we don t know Is there an underlying theoretical distribution? In most cases, we can only approximate our data by a theoretical distribution Can tell if datasets are sampled from different distributions? How can we find out which to choose? Can we prove that our data is close to a theoretical distribution? 4/43

Histograms The first step is to look at the data Histograms show the distribution of our data We can get a first impression of a possible underlying distribution Sample data set: mean = 4.3 stddev = 0.41 5/43

QQ-Plots A more accurate and expressive method are QQ-Plots QQ-Plots can be used to Compare two samples Compare a sample with a theoretical distribution Quantiles of both distributions are plotted against each other Remember: The p-quantile is the value of the sample for which p 100% of the data is smaller inf{ x R : F( x) p} Compute the same quantiles for both distributions (e.g. 0.1, 0.2, 0.3,..) Plot against each other 6/43

QQ-Plots: Construction Compute p-quantiles of both distributions Plot each quantile against corresponding other If distributions are equal, all points are approximately on y=x Sample is the same as the one used for histogram Theoretical distribution is a normal distribution with mean and variance estimated from the sample data 7/43

QQ-Plots: Example 8/43

Equality of Distributions Statistical tests to test how well two distributions fit together Two distributions can be tested on equality Empirical vs. theoretical Empirical vs. empirical Kolmogorov-Smirnow test Continuous distributions χ²-test Discrete distributions Both make no assumptions on underlying distribution 9/43

Kolmogorov-Smirnow Test The test uses the maximum distance between two distributions as test statistics Two sample test: Two samples are drawn from the same distribution One sample test: One sample is compared to a reference distribution Parameters of theoretical distribution must not be estimated from data ( solution: Lilliefors-Test as variant of KS-Test) 10/43

Given: two distributions Kolmogorov-Smirnow Test and F1 2 Null hypothesis: distributions are equal The test is based on the maximum distance between the two distributions D x For sample data, we use the empirical distribution function F max ( F ( x) F2 ( 1 x ) ) Fˆ ( x) 1 n n i 1 x i { x i x} x i ( x) 1, x { xi} 0, x { xi} 11/43

Kolmogorov-Smirnow Test D is a random variable of which we observe d D follows Kolmogorovs distribution We can compute p = P(D d) = 1 P(D < d) by evaluating the distribution Example: Two samples ks.test(data1, data2) Two-sample Kolmogorov-Smirnov test data: data1 and data2 D = 0.3062, p-value = 1.653e-05 12/43

χ² Test Data is binned Works on discrete distributions Continuous distributions must be binned Tests an empirical vs. a theoretical distribution This test uses the difference between the observed and theoretical frequencies of each outcome 13/43

χ² Test Given: Empirical distribution F (x), theoretical distribution T(x) H₀: F(x) = T(x) (distributions are equal) with F(x) being the underlying distribution of F (x) test statistics: i 1 ( O E E n 2 2 i i ) i O E n i i obs ervedfrequencie s es timatedfrequencie s numberof bins(nots amples) Compute p-value from the χ² distribution 2 p 1 ( k) ( x) k denotes degrees-offreedom and is k=n-1 14/43

Testing for Normality t-test assumes normal distribution Can we assume, that our data is normal distributed? We could test this hypothesis with the Lilliefor Variant of the Kolmogorov-Smirnow Test Correctly: We test if our data is consistent with the assumption of sampling from a gaussian distribution Why not KS Test: We must estimate mean/var from data There are specialized tests for this task with greater power Do not use the KS Test for testing for normality Many tests with different properties Shapiro-Wilk D'Agostino-Pearson Available in R 15/43

Interpreting the p-value The p-value answers: If you randomly sample from a Gaussian population, what is the probability of obtaining a sample that deviates from a Gaussian distribution as much or more as this sample does? High p-value Data is not inconsistent with Gaussian distribution No proof that data is drawn from a Gaussian distribution Low p-value Data is not sampled from a Gaussian distribution What to do: Check for other distribution Maybe outliers cause the normality test to fail Look at the data again, maybe you can ignore it (large dataset tend to produce low p-values even for mild deviations from Gaussian?) Switch to non-parametric tests 16/43

Wilcoxon signed-rank test What to do, if our sample does not approximately follow a normal distribution? What to do, if our sample is ordinal scaled (ordered values) We cannot apply a t-test in this case The Wilcoxon signed-rank test can do it Non- parametric test (no assumption of underlying probability) Works on ordinal scaled values (tests for equality of the median) 17/43

Wilcoxon signed-rank test The test follows the well known scheme for statistical tests We have sampled X and Y from distributions A and B Our hypothesis H1 is: A and B differ The null hypothesis H0 is: A and B are equal Assumptions Values are ordered (ordinally scaled) All observations from both groups are independent of each other A and B differ only by A(x) = B(x + m) otherwise the null hypothesis can only be formulated as: The propability that a sample from A is greater or equal than a sample from B is 0.5 Mann-Whitney U Test is equivalent, but uses the U statistics 18/43

Wilcoxon rank-sum test Test statistics 1. Order all observations in ascending sequence and rank them In case of equal values in both samples: Compute ordering as usual with arbitrary sequence of equal values Compute the mean rank for all equal values Assign all equal values this mean rank 2. Compute rank sums for both samples Add up ranks R1 which came from A Add up ranks R2 which came from B The invariant R1+R2 = N(N+1)/2, N = A + B must hold Test statistics W is the smaller of these two sums 3. Compute p value from distribution of W (distribution of W is given dependent of sizes of A and B) The U statistics of the can be computed from W (Mann-Whitney) 19/43

Wilcoxon rank-sum test: example Example: Two groups perform a task with a user interface A = [3,5,8,10,20] B = [2,5,7,9,11] Ranked Sequence: 1:2, 2:3, 3.5:5, 3.5:5, 5:7, 6:8, 7:9, 8:10, 9:11, 10:20 equal values Rank sums: A: 2+3.5+6+8+10=29.5 B: 1+3.5+5+7+9=25.5 A + B = 55 = 10*(10+1)/2 W = 25.5 p = 0.75 20/43

Wilcoxon rank-sum test: special example Consider the following dataset: X = (1,..., 50, 151,,200) Y = (51, 150) Do they differ? Wilcoxon signed-rank test says: W = 10050 for both datasets p = 1.0 H0 holds What did we do wrong? We missed an assumption of the test To support the given H0, A and B must only differ by a shift in location A(x) = B(x + m) Otherwise H0 is weaker 21/43

Wilcoxon rank-sum test: special example So what should have been our way to go? First check, if X and Y are likely to have the same underlying distribution (except of shift in location) Visually inspect! Make a Kolmogorov-Smirnow Test 22/43

Multiple t-tests? Often in an experiment, we have several independent variables or independent variables with more than two values Results in more than two cells/samples we need to compare Reminder: In [Nass, 2000], we had two independent variables: System personality (values: introvert or extrovert) User personality (values: introvert or extrovert) Hypothesis was: Evaluation of system differs between those groups 4 cells in factorial design 6 pairs of cells to compare Perform 6 independent t-tests? With an α of 0.05, there is a 5% chance of wrongly rejecting a null hypothesis for one test When doing multiple pair-wise t-tests, this probability accumulates to up to 26.5%! 23/43

Another illustrative example Publication which deliberately demonstrates problems which occur from multiple testing (Bennett et al., 2009) Context: fmri-based analysis of emotional activation Present affective stimuli invoking different valence (images, videos, ) fmri measures brain activity based on oxygen concentration Good spatial resolution fmri yields 100,000s of small voxels Researchers performed experiment but replaced human subject with dead salmon Result: Found cluster of three adjacent voxels in the fish s brain which related significantly to the emotional stimulation Observation: Without controlling for multiple testing, a large number of tests will almost surely yield significant results 24/43

ANOVA: Analysis of Variance Analysis of Variance (ANOVA) can handle designs with more than two cells Tests whether there exists a difference between any two groups But not: Between which groups (needs further analysis) One-way ANOVA: One factor with multiple (>2) levels Two-way ANOVA: Two factors with multiple levels More factors are possible, analysis gets more and more complicated Idea: Variance within groups vs. variance between groups In this lecture, we will present basic ideas, not full formulas Consult statistical textbook for details Good starting point with detailed explanations and examples: http://faculty.vassar.edu/lowry/webtext.html 25/43

One-Way ANOVA ANOVA is another statistical test, like the t-test Requirements: All populations are normally distributed All populations have the same variance The samples are of equal size Test statistic F: F Between - group scatter Within - group scatter Estimate* between-group scatter as Estimate* within-group scatter as k ( X i i X 1 k i SD 1 i ² If H 0 (no differences between any two populations) is true, F is distributed according to the F-distribution Proceed as for t-test (calculate critical values, compare with observed F-value) )² *) Formulas are just for explanation and not complete! 26/43

ANOVA vs. t-test For one factor with two levels, ANOVA is equivalent to a t-test Reminder: T X-Y X Y X Y measures the scatter between groups σ X Y measures the scatter within groups 27/43

Two-Way ANOVA Extension of one-way ANOVA Handles two factors with multiple levels We can now observe two types of effects: Main effects: Consistent significant difference when manipulating only one factor (e.g. both introvert and extravert users prefer the extravert system) Interaction effects: Significant difference when changing multiple factors (e.g. introverts prefer the introvert system, extraverts prefer the extravert system) Augment model of one-way ANOVA by changing the definitions of the between-group scatter to differentiate between variance caused by main effects and interaction effects (no details here) 28/43

Main effect vs. interaction effect Dep. Variable X F1 Y X Y F2: Value 1 = red, Value 2 = green F2 Main effect: F1 influences the dependent variable averaged over all values of F2 Interaction effect: Simultaneous effect of F1 and F2 on dependent variable is not additive 29/43

ANCOVA ANOVA has many variants which cover for common problems and questions Analysis of Co-Variance (ANCOVA) Can be used to eliminate effect of confounding variables Despite random allocation of participants to groups, those can differ in relevant variables Age, gender, intelligence, attitude, mood, Confounding variables can artificially inflate or deflate effects Can we remove the effect of those confounding variables from the ANOVA? General approach: ANCOVA calculates linear influence of confounding variable on dependent variable regardless of class Removing this influence reduces within-group scatter Be careful when covariates actually are related to the factors of the design 30/43

Which cells are significantly different? Often, we are not only interested in the existence of some difference but want to know which cells differ significantly We already saw that multiple unmodified t-tests are not advisable A number of tests is designed for this purpose Scheffé-test Turkey s honest significance test All those tests use some method to control for multiple testing Those approaches are also relevant outside of a-posteriori analysis of ANOVA 31/43

Bonferroni/Sidak Correction Adjust α to account for the global null hypothesis H 0 = All tests show no significant effect Replace α by correction factor α depending on the number of tests performed and the originally desired significance level For independent tests (Sidak): Correction factor: α = 1 (1 α) 1 n For arbitrary tests (Bonferroni): Use Boole s inequality (prob. of a union of events sum of single probabilities) P(at least one test significant) nα Correction factor: α = α n 32/43

Bonferroni-Holm Method Bonferroni correction yields binary result We do not learn which or how many of the null-hypotheses are wrongly rejected Does not take differentiate between large and small p-values Improvement: Bonferroni-Holm method Sort p-values such that p 1 p N Set i = 1 and iterate: If p i α (N i + 1), reject corresponding null hypothesis and continue (for i=1, this corresponds to simple Bonferroni correction) Else, stop and do not reject any of the following null hypotheses Still does not take interdependence of tests into account If a test is significant, a similar test is more likely to also yield a significant result Benjamini-Hochberg procedure (1995) extends this method with more precise bounds for every iteration 33/43

Evaluation of correction methods Very conservative approaches (especially the simple ones) high probability of rejecting valid hypotheses ( low power ) Bonferroni-Holm strictly better than original Bonferroni method Practical issue: Which hypotheses to include? Obviously irrelevant ones? Tests planned for future investigations? Tests by others on the same data set? Family Wise Error Rate: Probability of wrongly rejecting the joint null hypothesis of related tests (in terms of content or use) Do not routinely apply correction for every test performed Typical use cases for correction: Applying different procedures for measuring the same construct Investigating many (artificial) subgroups of the sample Explorative analysis, e.g. speculative application of large number of tests 34/43

Larger Sample = better? Probability for α-error depends on size of sample Larger sample smaller error probability Should we therefore collect as much data as possible? Tests will give significant results for the tiniest effect! Tests becomes prone to small jitter in the data We need to know how large the effect is we are looking for! Collect enough data to identify the desired effect as significant Do not collect more data than necessary Consequence: Significance level α must be fixed in advance It is not instrumental to report by how much the test beats the predefined α 35/43

Effect Size The size of an effect depends on the the standard deviation of the underlying distribution For comparison of two distribution means (e.g. with a t-test), we can measure effect size as (note that there are other methods): ES = μ X μ Y σ According to Cohen (1988), we can classify effect size as follows: ES = 0.2 small ES = 0.5 medium ES = 0.8 large The smaller the effect size, the harder it is to detect in a sample The smaller the effect size, the larger the sample has to be (for fixed α) How do you know? Pilot experiments, comparable data 36/43

Power We often choose α = 0.05 because of convention For medical applications, α is traditionally much lower What is the effect of modifying α? A smaller α reduces the probability of α-errors (wrongly rejecting the null hypothesis) improves reliability As a side effect, we also reduce our chances to actually find any significant effect (e.g. if the effect size is small)! Power: 1 β is called the power of an experiment β is the probability of a β-error ( false negative ) It is not possible to optimize both α and 1 β as increasing one parameter decreases the other one 37/43

The big picture of experiment design Four parameters describe an experiment design: Significance level α Power 1 β Effect size ES Sample size n From three given parameters, we can derive the (optimal value of the) fourth one Which parameters are given depends on the situation Not all situations are equally desirable 38/43

Types of Analysis (1) A priori Analysis Significance level, Power and effect size are given (Significance level and power are selected, ES is estimated beforehand) Estimate optimal sample size Perform this analysis during the planning step of an experiment Generate a sample of exactly this size to avoid undesired effects Post-hoc Analysis Significance level, effect size and sample size are given Estimate achieved power Performed after experiment design or execution How high is/was the risk of not observing the effect? 39/43

Types of Analysis (2) Sensitivity Analysis Significance level, power and sample size are given Estimate required effect size Performed after an experiment design or execution Typically performed when no significant effect was found (Effects of which size did we have a chance to find?) Criterion Analysis Power, effect size and sample size are given Estimate required significance level Rarely executed, significance level is typically fixed in advance 40/43

G*Power Free tool maintained by university of Düsseldorf Available for Windows and Mac (created by psychologists, after all) http://www.psycho.uni-duesseldorf.de/abteilungen/aap/gpower3 Performs power analysis for a number of test paradigms t-tests (including regression) F-tests (ANOVA, ) Allows to perform all types of analysis Of course, R and other statistic toolboxes have comparable capabilities of power analysis 41/43

Example Assume we are compare efficiency of our user interface for our traditional baseline interface A our innovative interface B Efficiency is measuring in seconds to solve a given task We set the significance level α to the traditional level of 0.05 We set the power 1 β to 0.95 We estimate missing an opportunity for optimimization and overestimating the improvement as equally harmful From a pilot study we estimate the effect size to be 0.5 We might also be not interested in smaller effects as the new interface is more expensive We design a between-subject experiment and evaluate the data using a t-test to test whether B is more efficient than A Independent samples, one-sided t-test 42/43

Overpowered and Underpowered Experiments A priori analysis yields an optimal sample size of 176 88 participants per group What would happen if we could do a within-subject design? What happens if we collect less data? Study is underpowered We must assume that we cannot reject H 0 for the expected effect size What happens if we collect more data? Study is overpowered Tiny effects (e.g. participants are a little more reluctant on a Monday than on a Friday) may create a significant result Should we stop collecting data, even if more is available? Expect to have some outliers and/or corrupt data points Subgroup analysis may require a larger corpus 43/43