Sta$s$cal Significance Tes$ng In Theory and In Prac$ce

Similar documents
One- factor ANOVA. F Ra5o. If H 0 is true. F Distribu5on. If H 1 is true 5/25/12. One- way ANOVA: A supersized independent- samples t- test

Some Review and Hypothesis Tes4ng. Friday, March 15, 13

Regression Part II. One- factor ANOVA Another dummy variable coding scheme Contrasts Mul?ple comparisons Interac?ons

Advances in IR Evalua0on. Ben Cartere6e Evangelos Kanoulas Emine Yilmaz

Two sample Test. Paired Data : Δ = 0. Lecture 3: Comparison of Means. d s d where is the sample average of the differences and is the

Garvan Ins)tute Biosta)s)cal Workshop 16/7/2015. Tuan V. Nguyen. Garvan Ins)tute of Medical Research Sydney, Australia

Causal Inference and Response Surface Modeling. Inference and Representa6on DS- GA Fall 2015 Guest lecturer: Uri Shalit

Linear Regression and Correla/on. Correla/on and Regression Analysis. Three Ques/ons 9/14/14. Chapter 13. Dr. Richard Jerz

Linear Regression and Correla/on

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

T- test recap. Week 7. One- sample t- test. One- sample t- test 5/13/12. t = x " µ s x. One- sample t- test Paired t- test Independent samples t- test

Short introduc,on to the

Multiple Testing. Gary W. Oehlert. January 28, School of Statistics University of Minnesota

Sta$s$cs for Genomics ( )

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

Data Processing Techniques

Data Mining. CS57300 Purdue University. March 22, 2018

Computer Vision. Pa0ern Recogni4on Concepts Part I. Luis F. Teixeira MAP- i 2012/13

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on

Bias/variance tradeoff, Model assessment and selec+on

Analysis of Variance (ANOVA)

Multiple t Tests. Introduction to Analysis of Variance. Experiments with More than 2 Conditions

My data doesn t look like that..

Common models and contrasts. Tuesday, Lecture 2 Jeane5e Mumford University of Wisconsin - Madison

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring 2016

Graphical Models. Lecture 3: Local Condi6onal Probability Distribu6ons. Andrew McCallum

Fundamental Probability and Statistics

Parameter Estimation, Sampling Distributions & Hypothesis Testing

Boos$ng Can we make dumb learners smart?

Ensemble of Climate Models

An Introduc+on to Sta+s+cs and Machine Learning for Quan+ta+ve Biology. Anirvan Sengupta Dept. of Physics and Astronomy Rutgers University

Statistical Models for sequencing data: from Experimental Design to Generalized Linear Models

COMPARING SEVERAL MEANS: ANOVA

Background to Statistics

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Chapter 26: Comparing Counts (Chi Square)

Introduc)on to RNA- Seq Data Analysis. Dr. Benilton S Carvalho Department of Medical Gene)cs Faculty of Medical Sciences State University of Campinas

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki

General linear model: basic

Streaming - 2. Bloom Filters, Distinct Item counting, Computing moments. credits:

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

Outline. What is Machine Learning? Why Machine Learning? 9/29/08. Machine Learning Approaches to Biological Research: Bioimage Informa>cs and Beyond

Multiple Comparisons

Confidence Intervals and Hypothesis Tests

Keppel, G. & Wickens, T. D. Design and Analysis Chapter 4: Analytical Comparisons Among Treatment Means

Everything is not normal

Hypothesis testing, part 2. With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal

CSE 373: Data Structures and Algorithms Pep Talk; Algorithm Analysis. Riley Porter Winter 2017

Sociology 301. Hypothesis Testing + t-test for Comparing Means. Hypothesis Testing. Hypothesis Testing. Liying Luo 04.14

REGRESSION AND CORRELATION ANALYSIS

Introduction 1. STA442/2101 Fall See last slide for copyright information. 1 / 33

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

Performance Evaluation

Physics 403. Segev BenZvi. Credible Intervals, Confidence Intervals, and Limits. Department of Physics and Astronomy University of Rochester

Class Notes. Examining Repeated Measures Data on Individuals

Specific Differences. Lukas Meier, Seminar für Statistik

CIVL /8904 T R A F F I C F L O W T H E O R Y L E C T U R E - 8

Mixture Models. Michael Kuhn

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur

Lecture 7: Hypothesis Testing and ANOVA

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

Sta$s$cal sequence recogni$on

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Your schedule of coming weeks. One-way ANOVA, II. Review from last time. Review from last time /22/2004. Create ANOVA table

Mathematical Statistics

CSE 473: Ar+ficial Intelligence. Hidden Markov Models. Bayes Nets. Two random variable at each +me step Hidden state, X i Observa+on, E i

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Introduction to the Analysis of Variance (ANOVA) Computing One-Way Independent Measures (Between Subjects) ANOVAs

Power of a hypothesis test

Wednesday, 20 November 13. p<0.05

Valida&on of Predic&ve Classifiers

Nondeterministic finite automata

Lecture 10: F -Tests, ANOVA and R 2

Lecture 21: October 19

COMP 562: Introduction to Machine Learning

Structural Equa+on Models: The General Case. STA431: Spring 2013

Multiple Comparison Procedures Cohen Chapter 13. For EDUC/PSY 6600

Data files for today. CourseEvalua2on2.sav pontokprediktorok.sav Happiness.sav Ca;erplot.sav

Contrasts (in general)

2 Hand-out 2. Dr. M. P. M. M. M c Loughlin Revised 2018

Statistics: revision

Comparing Several Means

Hypothesis Testing and Confidence Intervals (Part 2): Cohen s d, Logic of Testing, and Confidence Intervals

Introduction to Algebra: The First Week

Reproducibility and Power

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2017

Review. One-way ANOVA, I. What s coming up. Multiple comparisons

DIFFERENTIAL EQUATIONS

A Multiple Testing in Statistical Analysis of Systems-Based Information Retrieval Experiments

Evaluation Strategies

Machine Learning and Data Mining. Linear regression. Prof. Alexander Ihler

Relating Graph to Matlab

Hypothesis Testing. We normally talk about two types of hypothesis: the null hypothesis and the research or alternative hypothesis.

Transcription:

Sta$s$cal Significance Tes$ng In Theory and In Prac$ce Ben Cartere8e University of Delaware h8p://ir.cis.udel.edu/ictir13tutorial

Hypotheses and Experiments Hypothesis: Using an SVM for classifica$on will give be8er accuracy than using Naïve Bayes A Symbol- Refined Tree Subs$tu$on Grammar will give be8er parsing results than a simple TSG Expanding a short keyword query with synonyms will improve search engine effec$veness Experiment: Build a baseline system Improve it based on your hypothesis Test both systems on one or more datasets

Experimental Results from Shindo et al., Bayesian Symbol- Refined Tree Subs5tu5on Grammars for Syntac5c Parsing, ACL 2012

So What? Do these results support my hypothesis? Are these results meaningful? Is it possible that my results are due to chance? sta$s$cal significance tes$ng!

Part 1 TESTING STATISTICAL SIGNIFICANCE

Using R R is a so^ware environment for sta$s$cal compu$ng Includes built- in implementa$ons of many common tests Also has its own programming language for implemen$ng your own Download from h8p://r- project.org Download TREC- 8 evalua$on data from h8p://ir.cis.udel.edu/ictir13tutorial/trec8.rdata

Commonly- Used Tests Parametric: Student s t- test ANOVA Non- parametric: Wilcoxon signed rank test Sign test/binomial test Distribu$on- free: Randomiza$on test Bootstrap test

Student s t- test Example A B B- A 1.25.35 +.10 2.43.84 +.41 3.39.15 -.24 4.75.75 0 5.43.68 +.25 6.15.85 +.70 7.20.80 +.60 8.52.50 -.02 9.49.58 +.09 10.50.75 +.25 t = ˆ µ = B A = 0.214 ˆ σ B A = 0.291 ˆ µ n = 2.33 ˆ σ B A 8

Student s t- test t = ˆ µ = B A = 0.214 ˆ σ B A = 0.291 ˆ µ n = 2.33 ˆ σ B A p value = 0.02 9

Wilcoxon Signed- Rank Test Example A B B- A 1.25.35 +.10 2.43.84 +.41 3.39.15 -.24 4.75.75 0 5.43.68 +.25 6.15.85 +.70 7.20.80 +.60 8.52.50 -.02 9.49.58 +.09 10.50.75 +.25 Rank B- A 1 -.02 2 +.09 3 +.10 4 -.24 5.5 +.25 5.5 +.25 7 +.41 8 +.60 9 +.70 W = 40 5 = 35

Wilcoxon Signed- Rank Test Density 0.000 0.005 0.010 0.015 W = 40 5 = 35 p value = 0.03-60 -40-20 0 20 40 60 W

Sign Test Example A B B- A B > A? 1.25.35 +.10 +1 2.43.84 +.41 +1 3.39.15 -.24-1 4.75.75 0 0 5.43.68 +.25 +1 6.15.85 +.70 +1 7.20.80 +.60 +1 8.52.50 -.02-1 9.49.58 +.09 +1 10.50.75 +.25 +1 S = 7 p(7 10 trials, ½ probability) = 0.05

Randomiza$on Test Example A B B- A 1.35.25.25.35 +.10 -.10 2.43.84.84.43 +.41 -.41 3.39.15 -.24 4.75.75 0 5.43.68.68.43 +.25 -.25 6.85.15.15.85 +.70 -.70 7.20.80.80.20 +.60 -.60 8.52.50.50.52 +.02 -.02 9.49.58.58.49 0.09 +.09 10.50.75.75.50 +.25 -.25 ˆ µ 0 = B A = 0.214 ˆ µ 1 = 0.008 ˆ µ 2 = 0.093

Randomiza$on Test ˆ µ 0 = B A = 0.214 p value = 0.02-0.3-0.2-0.1 0.0 0.1 0.2 0.3 mean

Bootstrap Test Example A B B- A 1.25.35 +.10 2.43.84 +.41 3.39.15 -.24 4.75.75 0 5.43.68 +.25 6.15.85 +.70 7.20.80 +.60 8.52.50 -.02 9.49.58 +.09 10.50.75 +.25 s1 s2 s3 -.24 +.25 -.24 +.41 +.10 +.60 -.02 +.25 -.70 0 +.60 +.25 +.25 +.70 +.70 +.10 -.02 +.41 +.25 +.10 -.02 +.10 +.25 -.24 +.25 0 +.70 +.10 -.02 +.25

Bootstrap Distribu$on p value = 0.005-0.1 0.0 0.1 0.2 0.3 0.4 0.5 mean

ANOVA Compare variance due to system to variance due to topic Example A B B- A 1.25.35 +.10 2.43.84 +.41 3.39.15 -.24 4.75.75 0 5.43.68 +.25 6.15.85 +.70 7.20.80 +.60 8.52.50 -.02 9.49.58 +.09 10.50.75 +.25 ˆ σ 2 = MSE = 0.042 ˆ σ S 2 = MST = 0.229 F = MST MSE = 5.41

Summary These are 6 of the most common tests seen in IR experimenta$on Many others in the literature: Chi- squared Propor$on test ANCOVA/MANOVA/MANCOVA All have in common: The use of some probability distribu$on, computa$on of a p- value from that distribu$on

Part 2 FUNDAMENTALS OF SIGNIFICANCE TESTING

Tes$ng Paradigms Ronald Fisher Egon Pearson Jerzy Neyman Harold Jeffreys

What Are Tests Really Telling Us? Formal set- up: H 0 : μ = 0 H 1 : μ 0 The null hypothesis is a model We are looking to prove the model false The p- value is the probability that you would have found the same results if H 0 were true If that probability is low, conclude H 0 is false

What Are Tests Really Telling Us? Fisher: p- value is the likelihood of the data under H 0 The p- value is a conclusion about this par$cular experiment only Nothing more, nothing less Neyman- Pearson: p < 0.05 means we can reject H 0 as being unlikely to be true p- values lead to inference about the popula$on The p- value itself is not interes$ng; the inference is Note that we do not accept that H 1 is true! Jeffreys: posterior probability of H 0 being true can be compared to posterior probability of other models

Terms and Defini$ons Single- sample vs two- sample tests A single- sample test is generally based on applying one or more treatments (search algorithms) to a single sample of subjects (queries/topics) In a two- sample test, different samples are used for each treatment Paired vs unpaired Paired tests are a special case of single- sample tests: subtract evalua$on results for each example to obtain the measurements to summarize Unpaired tests can be single- sample too

Test Sta$s$cs and Distribu$ons Test sta$s$c A summary of the data, usually designed to have specific distribu$on guarantees (asympto$cally) Parametric vs non- parametric If the test sta$s$c distribu$on has any free parameters, the test is said to be parametric Confidence interval

Sizes and Values Sample size The number of subjects/examples in the experiment Assumed to be sampled i.i.d. from a much larger popula$on Effect size A measure of the difference between two treatments or algorithms in the popula$on Independent of sample size H 0 : no effect p- value The likelihood of observing the results assuming H 0 is true Cri$cal value The minimum effect size necessary to obtain p < α with a given sample size α usually = 0.05

Variance Total variance The sum of the square differences between measurements and the overall mean Within- group variance Variance due to instances in the sample Paired tests subtract this variance out Between- group variance Variance due to the treatments/systems

Accuracy and Power Accuracy The probability of gevng p α when H 0 is actually true Probability of correctly not rejec$ng H 0 Propor$onal to false posi$ve rate H 0 true false Power The probability of gevng p < α when the null hypothesis is actually false The probability of correctly rejec$ng H 0 True posi$ve rate Most tests are defined to have a false posi$ve rate of α when H 0 is true Achieving a certain power level involves es$ma$ng effect size and sample size not rejected rejected accuracy Type II error Type I error power

The Linear Model Sta$s$cal tests are classifiers Like classifiers, they are based on an underlying model Unlike classifiers, we cannot evaluate them directly The t- test is based on the linear regression model y i = β 0 + β 1 x +ε i

All models are wrong, but some are useful. George E. P. Box

Myths and Misconcep$ons Significance tests lend rigor to our experimenta$on Without them, the usual differences of < 5% would be difficult to interpret But they are widely misunderstood p- values can be incorrectly interpreted p- values can be easily manipulated (even uninten$onally) They are fundamentally no more rigorous than any AI approach to classifica$on Though they may have a much deeper theore$cal basis

Myth: H 0 is a Realis$c Model The first and biggest misconcep$on: the null hypothesis is some$mes true That is, there is a chance that there really is no effect In AI, the null hypothesis is almost never true Really only when the experimenter made a mistake The only ques$on is how big of a sample size will it take to reject it There is always some sample big enough to reject it

Myth: Rejec$ng H 0 Means it is False First, H 0 is always false But even if it were true, we could s$ll reject it for many reasons: something about our sample viola$ons of test model assump$ons failure to model important sources of variance uninten$onal overfivng Rejec$ng H 0 should not be taken to mean our system is definitely be8er

Myth: Test Assump$ons Are Important Consider the t- test based on the linear model Assump$ons: y is unbounded linearity and addi$vity homoscedas$city normality of errors (note: normality of data is not an assump$on) All of these are false! But we can evaluate how much their falseness affects accuracy and power

Myth: Test Assump$ons Are Important OK, so t- test assump$ons are false. Why not use a different test? Every test is based on some model, and every model is false Even so- called assump$on- free tests like Fisher s exact test or the bootstrap actually do involve assump$ons The tradeoff is between simplicity and power Fewer assump$ons less power fewer significant results t- test is popular because it is powerful, robust to viola$ons of its assump$ons, and computa$onally easy

Myth: p- Values Have Intrinsic Meaning p < 0.05 is o^en taken as a gold standard of proof Two things to keep in mind: The p- value comes out of a model; all models are wrong 0.05 is an arbitrary value that was probably first used as an example Any meaning given to a p- value is extrinsic Usually granted by a community of scien$sts

Myth: p- Values Have Intrinsic Meaning The real gold standard is whether it helps users Any IR evalua$on based on the Cranfield paradigm cannot directly answer that

Myth: Lower p- Values are Be8er If a p- value of 0.04 is be8er than a p- value of 0.06, then a p- value of 0.02 is even be8er, right? A p- value can be lower for three reasons: The effect size is bigger (good) The sample size is bigger (bad) Modeling effects, including random effects There s no way to know which of these is the reason

Myth: Lower p- Values are Be8er p- value = P(data H 0, test model, inputs) Any change to the underlying model results in a different probability distribu$on That includes changes to the systems being tested p- values should not be compared directly Fisher and Neyman/Pearson would have agreed on this!

Myth: Running Many Tests is OK AI experimenta$on o^en happens like this: Implement system, compare to baseline, run test Not significant? Re- implement system, compare to baseline, run test Significant? Start wri$ng a paper How many tests does it take to get to the endpoint?

Sequen$al Tes$ng Suppose (hypothe$cally) that the null hypothesis is actually true The probability of concluding it is false a^er one test is α (normally 0.05) The probability of concluding it is false a^er two tests is.05 +.95*.05 =.0975 A^er three tests,.05 +.95*.05 +.95*.95*.05 =.143 A^er 14 tests, ~0.5 A^er 27 tests, ~0.75 A^er 90 tests, ~0.99

Mul$ple Tes$ng Suppose three different people have the same null hypothesis If each of them does one experiment, probability that there will be one false posi$ve is 0.143 If each of them does three experiments, probability goes to ~0.4 Result: very high probability that any given published result is false! Why Most Published Research Findings Are False, Ioannidis, PLoS Medicine, 2005

Mul$ple Tes$ng

Correc$ng for Mul$ple Comparisons We should adjust our p- values up for the fact that we have made mul$ple comparisons Many different approaches: Bonferroni correc$on Tukey s Honest Significant Differences Mul$variate t test

Tukey s HSD Omnibus hypothesis: H 0 : S 1 = S 2 = S 3 = = S n ANOVA fits a linear model to all data; rejects null if there is any difference between any pair of systems The maximum difference is the one most likely to be the cause of rejec$on Tukey: compute a distribu$on of maximum difference, base all p- values on that

Tukey s HSD

Effect on TREC- 8 Evalua$on

Families of Experiments p- values should be adjusted based on families of experiments All experiments tes$ng the same hypothesis How do we define a family of experiments? Suppose we are tes$ng hypotheses about clustering for IR H: augmen$ng LM retrieval with clusters improves ad hoc retrieval H: augmen$ng BM25 retrieval with clusters improves ad hoc retrieval H: augmen$ng any ranking func$on with clusters improves ad hoc retrieval H: clusters are good for retrieval

What is a Family? In TREC data, families could be: All pairs of submi8ed systems Pairs of systems submi8ed by each par$cipa$ng group in the context of the full set of systems Pairs of systems submi8ed by each par$cipa$ng group in the context of just that group s systems p- values can be corrected based on each family type Which results in different adjustments for each The third is the least honest, yet is really the only thing you can do on your own

Summary Significance tests are just models When we use them out of the box, we fail to model many sources of variance in IR Variance in relevance, in user behavior, in interac$ons between system components, The things we do model are probably being modeled wrong In par$cular addi$vity of system and topic effects More correct models could change our conclusions about systems We know that modeling mul$ple tes$ng changes our conclusions drama$cally Most other concerns are extremely minor in comparison But we don t really truly know how to adjust for mul$ple comparisons It depends very much on what other researchers are thinking and doing The one thing I want you to take away from this talk: Never trust a p- value, even one you computed yourself

Part 3 SIGNIFICANCE TESTING IN IR RESEARCH

What Does it Mean? You can always find significance With the right sample, the right sample size, the right test, enough itera$ons of tes$ng Fishing expedi$ons Significance is only a rough proxy for interes$ngness A heuris$c

Searching for Interes$ng Results How do we use significance tests in research? Conference program commi8ees/journal editors use them as a guide for determining what to publish Publica$on determines research direc$ons that people follow Published systems implemented as baselines Essen$ally as a heuris$c in a search for the best algorithms They can easily be used as a subs$tute for human judgment Like most AI, they should be used as an aide to human judgment There isn t one right way to do it No Free Lunch Theorem applies to significance tes$ng

Searching for Interes$ng Results What if significance was granted more conserva$vely? e.g. by: Correc$ng for mul$ple comparisons Using tests that make fewer assump$ons Using a lower value of alpha (0.01 for instance) Is a more conserva$ve heuris$c always be8er?

InteresKng results The State of Research Today Published results Sta$s$cally significant results All hypotheses

InteresKng results Published results The State of Research When Significance is Granted More Conserva$vely Sta$s$cally significant results All hypotheses Fewer publica$ons overall which means fewer uninteres$ng publica$ons but also that fewer truly interes$ng results can be published

Thought Experiment Suppose sta$s$cal significance is a necessary and sufficient condi$on for publica$on Consequences: Many published papers are not interes$ng Some interes$ng results are not published Most uninteres$ng results are not published Published uninteres$ng papers - > $me wasted reading, re- implemen$ng Unpublished interes$ng results - > $me wasted each $me results are re- discovered Unpublished uninteres$ng papers - > $me wasted each $me experiment is tried and fails

Example: NLP for IR NLP generally doesn t work for IR Maybe in some domains, for some tasks, but in general not But almost every IR grad student has had some idea for using NLP to improve IR Result: a handful of published papers from a very large number of experiments, mostly due to randomness e.g. mul$ple tes$ng which gives hope to the next genera$on of students (who don t know about the very low success rate) which results in a lot of wasted $me as they re- do experiments already done by every previous genera$on

Example: NLP for IR Would we have been be8er off had that handful of papers never been published? Would we have been be8er off if all those nega$ve results had been published? Or are we be8er off with grad students having done the work to gain some intui$on about why it doesn t work?

Takeaways Always do significance tests But don t worry too much about which ones to do The t- test is always a good op$on Correc$ng for mul$ple tes$ng is probably not necessary Don t just report p- values or * to indicate significance Always report es$mated effect sizes and confidence intervals Always take results of tests with a grain of salt Especially when the effect size is low Build your intui$on and use it Never say barely significant or just missed being significant