Beyond p values and significance. "Accepting the null hypothesis" Power Utility of a result. Cohen Empirical Methods CS650

Similar documents
Statistics Primer. ORC Staff: Jayme Palka Peter Boedeker Marcus Fagan Trey Dejong

Outline. Confidence intervals More parametric tests More bootstrap and randomization tests. Cohen Empirical Methods CS650

A proportion is the fraction of individuals having a particular attribute. Can range from 0 to 1!

CIVL /8904 T R A F F I C F L O W T H E O R Y L E C T U R E - 8

Hypothesis testing. Data to decisions

Analysis of Variance. One-way analysis of variance (anova) shows whether j groups have significantly different means. X=10 S=1.14 X=5.8 X=2.4 X=9.

Relating Graph to Matlab

PSY 305. Module 3. Page Title. Introduction to Hypothesis Testing Z-tests. Five steps in hypothesis testing

Inferences About Two Proportions

Sampling Distributions: Central Limit Theorem

Advanced Experimental Design

Survey on Population Mean

Chapter 22. Comparing Two Proportions. Bin Zou STAT 141 University of Alberta Winter / 15

Department of Economics. Business Statistics. Chapter 12 Chi-square test of independence & Analysis of Variance ECON 509. Dr.

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

EC2001 Econometrics 1 Dr. Jose Olmo Room D309

CBA4 is live in practice mode this week exam mode from Saturday!

Chapter 26: Comparing Counts (Chi Square)

Chapter 8 Student Lecture Notes 8-1. Department of Economics. Business Statistics. Chapter 12 Chi-square test of independence & Analysis of Variance

The t-test: A z-score for a sample mean tells us where in the distribution the particular mean lies

Review. One-way ANOVA, I. What s coming up. Multiple comparisons

A3. Statistical Inference Hypothesis Testing for General Population Parameters

COGS 14B: INTRODUCTION TO STATISTICAL ANALYSIS

Rama Nada. -Ensherah Mokheemer. 1 P a g e

Lecture 30. DATA 8 Summer Regression Inference

1 Descriptive statistics. 2 Scores and probability distributions. 3 Hypothesis testing and one-sample t-test. 4 More on t-tests

Data Analysis and Statistical Methods Statistics 651

Chapter 7 Class Notes Comparison of Two Independent Samples

Sampling Distributions

Introduction to the Analysis of Variance (ANOVA) Computing One-Way Independent Measures (Between Subjects) ANOVAs

This is particularly true if you see long tails in your data. What are you testing? That the two distributions are the same!

Chapter 7 Comparison of two independent samples

UCLA STAT 251. Statistical Methods for the Life and Health Sciences. Hypothesis Testing. Instructor: Ivo Dinov,

Descriptive Statistics-I. Dr Mahmoud Alhussami

STAT Section 3.4: The Sign Test. The sign test, as we will typically use it, is a method for analyzing paired data.

Hypothesis Testing with Z and T

HYPOTHESIS TESTING. Hypothesis Testing

Introduction to Statistical Hypothesis Testing

The One-Way Independent-Samples ANOVA. (For Between-Subjects Designs)

The t-statistic. Student s t Test

STAT Chapter 8: Hypothesis Tests

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Difference between means - t-test /25

AMS7: WEEK 7. CLASS 1. More on Hypothesis Testing Monday May 11th, 2015

POLI 443 Applied Political Research

The One-Way Repeated-Measures ANOVA. (For Within-Subjects Designs)

Chapter 27 Summary Inferences for Regression

Lecture Slides. Elementary Statistics. by Mario F. Triola. and the Triola Statistics Series

Lecture Slides. Section 13-1 Overview. Elementary Statistics Tenth Edition. Chapter 13 Nonparametric Statistics. by Mario F.

Lecture 3: Inference in SLR

One-Way ANOVA. Some examples of when ANOVA would be appropriate include:

Class 24. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

STA Module 10 Comparing Two Proportions

Chapter 23. Inferences About Means. Monday, May 6, 13. Copyright 2009 Pearson Education, Inc.

EXAM 3 Math 1342 Elementary Statistics 6-7

Correlation and regression

Statistics and Quantitative Analysis U4320

Passing-Bablok Regression for Method Comparison

Inferences About Two Population Proportions

Confidence Intervals and Hypothesis Tests

Chapter 9 Inferences from Two Samples

Hypothesis testing: Steps

Single Sample Means. SOCY601 Alan Neustadtl

Section 9.4. Notation. Requirements. Definition. Inferences About Two Means (Matched Pairs) Examples

Binary Logistic Regression

Basics of Experimental Design. Review of Statistics. Basic Study. Experimental Design. When an Experiment is Not Possible. Studying Relations

Hypothesis Testing. ECE 3530 Spring Antonio Paiva

Statistics for IT Managers

Population Variance. Concepts from previous lectures. HUMBEHV 3HB3 one-sample t-tests. Week 8

Introduction to Business Statistics QM 220 Chapter 12

Business Statistics: Lecture 8: Introduction to Estimation & Hypothesis Testing

Hypothesis testing: Steps

Chapter 24. Comparing Means

We know from STAT.1030 that the relevant test statistic for equality of proportions is:

Math 140 Introductory Statistics

Statistiek II. John Nerbonne using reworkings by Hartmut Fitz and Wilbert Heeringa. February 13, Dept of Information Science

Lecture 17. Ingo Ruczinski. October 26, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Hypothesis Tests and Estimation for Population Variances. Copyright 2014 Pearson Education, Inc.

An Analysis of College Algebra Exam Scores December 14, James D Jones Math Section 01

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur

Lab #12: Exam 3 Review Key

Chapter 5 Confidence Intervals

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Visual interpretation with normal approximation

9/2/2010. Wildlife Management is a very quantitative field of study. throughout this course and throughout your career.

INTERVAL ESTIMATION AND HYPOTHESES TESTING

appstats27.notebook April 06, 2017

Two sided, two sample t-tests. a) IQ = 100 b) Average height for men = c) Average number of white blood cells per cubic millimeter is 7,000.

REVIEW 8/2/2017 陈芳华东师大英语系

Hypotheses Testing. 1-Single Mean

Soc3811 Second Midterm Exam

Business Statistics. Lecture 10: Course Review

Introduction to Survey Analysis!

James H. Steiger. Department of Psychology and Human Development Vanderbilt University. Introduction Factors Influencing Power

16.400/453J Human Factors Engineering. Design of Experiments II

Logistic Regression - problem 6.14

MATH 10 INTRODUCTORY STATISTICS

Comparing Several Means

a) The runner completes his next 1500 meter race in under 4 minutes: <

Probability theory and inference statistics! Dr. Paola Grosso! SNE research group!! (preferred!)!!

Transcription:

Beyond p values and significance "Accepting the null hypothesis" Power Utility of a result

Showing that things are NOT different Example: Oates and Heeringa wanted to show that their grammar induction algorithm performed "the same" as the inside/outside algorithm. Approaches: Confidence interval around the difference Power analysis Showing that the proportion of variance due to algorithm is smaller than the proportion due to problem (analysis of variance) Showing that the difference, though significant, is meaningless

"Accepting the null hypothesis" Sometimes you want to show A and B are not different Hypothesis testing doesn't allow that! Ok, can we say A and B are equal if we cannot reject Ho: A = B? = (A) (B) s.e. (A ) (B) = (A) (B) ˆ (A) (B) N

So, what can we do to "accept the null hypothesis"? 0 1 0 1 If these are sampling distributions of A B, which makes you more confident that A B 0?

Example: Animal Watch Total Math Problems Male/Female 30 20 10 20 10 1 2 3 4 5 6 7 8 9 MALES 1 2 3 4 5 6 7 8 9 FEMALES "Accept" the hypothesis that male and female scores are equal? t = 4.23 3.63 se x male x female = x 1.96se x male x female 4.23 3.63.4 =.6.4 = 1.5 not significant male female x +1.96se x male x female 0.6 1.96(.4) male female 0.6 +1.96(.4) 0.184 male female 1.38

Boostrap distribution of the difference between male and female scores. Confidence interval [-0.184, 1.39] (defun two-sample-bootstrap (sample1 sample2 statistic k) (let* ((n1 (length sample1)) (n2 (length sample2)) (s1* (make-array n1)) (s2* (make-array n2)) (dist nil)) (dotimes (i k) (dotimes (j n1)(setf (aref s1* j)(nth (random n1) sample1))) (dotimes (j n2)(setf (aref s2* j)(nth (random n2) sample2))) (push (funcall statistic s1* s2*) dist)) (values dist))) (two-sample-bootstrap m f #'(lambda(x y)(- (mean x)(mean y))) 500)) 40 30 20 10 0 0.5 1 1.5

Errors and Power Type I error: Rejecting H0 when H0 is true Type II error: Failing to reject H0 when H0 is false Power: 1 - Pr(Type II error) = Pr(rejecting H0 when H0 is false) Pr(Type I error) Power H0 H1 0 3 Critical value to reject H0 at, say, α =.05

Power and H1 Power can be assessed only with respect to H1. You must specify H1 before you can calculate the power of a test. H0 H1 0 3 Critical value to reject H0 at, say, α =.05

Example: What is the power of a t test to find a difference of at least.5 between the means of males and females? 30 20 10 H0: µ males µ females = 0 H1: µ males µ females = 0.5 1 2 3 4 5 6 7 8 9 MALES H0 H1 20 10 0.5 1 2 3 4 5 6 7 8 9 FEMALES

Example: What is the power of a t test to find a difference of at least.5 between the means of males and females? H0: µ males µ females = 0 H1: µ males µ females = 0.5 H0 From earlier slides we know the standard error of the difference between the means is 0.4, so the one-tailed critical value is 1.645 x.4 =.658 Assuming the H1 sampling distribution has the same form,.658 is 0.158 /.4 =.395 standard error units away from the mean of the H1 distribution. H1 34.6% of a normal curve lies beyond.395 standard deviations from the mean The power of the test to detect a difference of.5 is.346. 0.5 0.658

Which factors in a test affect power? H0 H1 0.5 0.658

Which factors in a test affect power? Standard error (variance, sample size), effect size, alpha H0 H1 0.5 0.658

Power curves: Fix three of the factors, vary one H0 H1 Power 0.5 Mean under H1 H0 H1 0.75

Power curves N for normal sampling distributions Crit.05 = 1.645 N 2 As N increases, Crit.05 decreases and power increases 1.5 H0 H1 1 0.5 0.5 50 100 150 200 250 300 N Change in Crit.05 for the male/female test data, assuming variances for males and females remain constant

Yes there's a difference, but does it mean anything? People make a big deal over differences in mathematics scores between boys and girls. These differences are tiny compared with those between American and Japanese students The difference between KOSO and KOSO* raw runtimes is tiny compared with the random effect of the problem on which they are tested

Significant and meaningful are not synonymous In the RKF summer trials, knowledge engineers (KEs) got significantly higher scores than naïve users (SMEs) (p <.0001). How much predictive power does this knowledge afford? Suppose you wanted to predict whether a score was higher or lower than the median of all scores. How much would it help to know whether the score belonged to a KE or SME? 50 40 30 20 10 SMEs N = 277 161 values < 2.59 100 KEs+SMEs N = 417 Median = 2.59 60 50 40 30 20 10 KEs N = 140 101 values > 2.59 1 2 1 2 1 2

Guess whether x > 2.59. Error reduction by knowing whether x belongs to an SME or a KE: No knowledge: 417 / 2 = 208.5 expected errors if you say x > 2.59 You know x comes from an SME. Guess x < 2.59 and make 277-161 = 116 errors You know x comes from a KE. Guess x > 2.59 and make 140-101 = 39 errors Error reduction is (208.5 - (116 + 39) ) / 208.5 = 25.6%. 50 40 30 20 10 SMEs N = 277 161 values < 2.59 100 KEs+SMEs N = 417 Median = 2.59 60 50 40 30 20 10 KEs N = 140 101 values > 2.59 1 2 1 2 1 2

Significant and meaningful are not synonyms Suppose you wanted to use the knowledge that the ring is controlled by KOSO or KOSO* for some prediction How much predictive power would this knowledge confer? Grand median k = 1.11; Pr(trial i has k > 1.11) =.5 Probability that trial i under KOSO has k > 1.11 is 0.57 Probability that trial i under KOSO* has k > 1.11 is 0.43 Predict for trial i whether k > 1.11: If it s a KOSO* trial you ll say no with (.43 * 150) = 64.5 errors If it s a KOSO trial you ll say yes with ((1 -.57) * 160) = 68.8 errors If you don t know which you ll make (.5 * 310) = 155 errors 155 - (64.5 + 68.8) = 22 Knowing the algorithm reduces error rate from.5 to.43

Stay/go decision An epoch: Collect several views of an object and give them a common (but new) label The robot has experienced M epochs and is k views into the current epoch Should it collect more views or go? Intuition: If additional views cannot help it discriminate the current object from others in memory, it should go. Model: You have a sample s1 and you are accumulating data into s2. When the data do not improve the discrimination of s1 and s2, stop sampling.

Stay/go math φ = SSg (SSa + SSb) SSg Theoretical maximum value when Nb = Na std(a)/std(b) (from Paola Sebastiani)

Stay/go experiments 0.06 0.05 A1 different A2 similar φ 0.04 0.03 A1 0.02 A2 0.01 50 100 Number of Views

Significant and meaningful (or useful) are not synonyms Suppose you wanted to predict the run-time of a trial. If you don t know Algorithm, your best guess is the grand mean and your uncertainty is the grand variance. If you do know Algorithm then your uncertainty is less: 2? 2 2? Algorithm = ˆ 2 =? 2 Reduction in uncertainty due to knowing Algorithm t 2 1 t 2 + N 1 + N 2 1 ˆ 2 = Estimate of reduction in variance (recall t = 2.42 from Rosenberg study) 2.42 2 1 2.42 2 + 160 + 150 1 =.015 All other things equal, increasing sample size decreases the utility of knowing the group to which a trial belongs